Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How do I Strip HTML tags from a file using AWK?

Status
Not open for further replies.

Handyman

Programmer
Oct 26, 1998
1
0
0
GB
I have a need to extract text from Web pages and I would like to use AWK to process the html source to plain text. Has anyone done this before? Is there any easy way of doing it?
 
I would suggest<br>
<br>
gsub (/&lt;[^&gt;][^&gt;]*&gt;/, "", $0); }<br>
! /^ *$/ { print; }<br>
<br>
The RE in front of the { print } is to prevent <br>
blank lines from being printed.<br>
<br>
You could also do this with sed:<br>
<br>
sed -e 's/&lt;[^&gt;][^&gt;]*&gt;//g' -e '/^ *$/d'<br>
<br>
-- Derek<br>
<br>

 
BTW ... the RE that removes HTML tags will fail if the tag spans multiple lines. Correcting that is left as an exercize for the student :).
Derek Ludwig
derek@ludwig.com

 
I seem to have the same problem if the tag spans multiple lines. Any suggestions on how to do this?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top