Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Removing newline within html tags 1

Status
Not open for further replies.

arunrr

Programmer
Oct 2, 2009
103
US
Hello,

Thanks in advance. I have the following input...

<td class="inningsDetails"><b>Did not bat</b> <span><a href="/ci/content/player/44716.html" target="" title="view the player profile for Allan Donald" class="playerName">AA Donald</a></span>,
<span><a href="/ci/content/player/44742.html" target="" title="view the player profile for Michael Bosch" class="playerName">MP Bosch</a></span>,
<span><a href="/ci/content/player/44803.html" target="" title="view the player profile for Jonathan Winters" class="playerName">JM Winters</a></span>
</td>

I would like to grep for "Did not bat", identify the <td> tag and join the lines until </td>. Output should be...

<td class="inningsDetails"><b>Did not bat</b> <span><a href="/ci/content/player/44716.html" target="" title="view the player profile for Allan Donald" class="playerName">AA Donald</a></span>, <span><a href="/ci/content/player/44742.html" target="" title="view the player profile for Michael Bosch" class="playerName">MP Bosch</a></span>, <span><a href="/ci/content/player/44803.html" target="" title="view the player profile for Jonathan Winters" class="playerName">JM Winters</a></span>
</td>

Thanks again,
Arun
 
Hi,

Yes, to have it all in the same line.

I am searching through a large html file for different items. For example, i am looking for "Did not bat" and need to find all the names that correspond. It can be more than 1 as i show in my input . Not sure how else to identify them.

Thanks
Arun
 
Hi

What about using a special tool ?
Code:
lynx -stdin -dump -nolist -width=1024 /input/file | grep 'Did not bat'
Note that if the line is longer than the maximum width, will be wrapped.

Feherke.
 
Thanks.

I had a few of my scripts using lynx and had to change things around as the hosting company that I am using does not have lynx installed on the Linux servers for shared hosting. Also, they will not allow me to install anything. Quite painful. They say that if I sign up for a dedicated server, then I can install other tools. I had to decline as I cant afford the cost of a dedicated server now.

A solution without using lynx will be helpful...

Thanks again,
Arun
 
Hi

Sad.

I am still unsure if you really need the HTML markup in your output and if you need the rest of the text, outside the [tt]td[/tt]. The following will do strictly what you requested ( and what I understood ) : removes newline characters inside the tag which start on the 'Did not bat's line and ends with its closing pair.
Code:
awk '/Did not bat/{s=$1;sub(/.*</,"",s)}s{j=j$0}s&&$0~"</"s">"{gsub(/\n/," ",j);print j;s=j="";next}!s' /input/file
Note that [tt]td[/tt]s inside the [tt]td[/tt] are not handled.


Feherke.
 
Thanks Feherke,
Sorry for the delayed response - was travelling.
Arun
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top