I have a need to extract text from Web pages and I would like to use AWK to process the html source to plain text. Has anyone done this before? Is there any easy way of doing it?
I would suggest<br>
<br>
gsub (/<[^>][^>]*>/, "", $0); }<br>
! /^ *$/ { print; }<br>
<br>
The RE in front of the { print } is to prevent <br>
blank lines from being printed.<br>
<br>
You could also do this with sed:<br>
<br>
sed -e 's/<[^>][^>]*>//g' -e '/^ *$/d'<br>
<br>
-- Derek<br>
<br>
BTW ... the RE that removes HTML tags will fail if the tag spans multiple lines. Correcting that is left as an exercize for the student .
Derek Ludwig
derek@ludwig.com
This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
By continuing to use this site, you are consenting to our use of cookies.