How do I Strip HTML tags from a file using AWK?

Handyman · Oct 26, 1998

I have a need to extract text from Web pages and I would like to use AWK to process the html source to plain text. Has anyone done this before? Is there any easy way of doing it?

derekludwig · Jun 25, 1999

I would suggest 
 
gsub (/<[^>][^>]*>/, "", $0); } 
! /^ *$/ { print; } 
 
The RE in front of the { print } is to prevent 
blank lines from being printed. 
 
You could also do this with sed: 
 
sed -e 's/<[^>][^>]*>//g' -e '/^ *$/d' 
 
-- Derek

derekludwig · Jun 12, 2001

BTW ... the RE that removes HTML tags will fail if the tag spans multiple lines. Correcting that is left as an exercize for the student

.
Derek Ludwig
derek@ludwig.com

teser · Jun 12, 2001

I seem to have the same problem if the tag spans multiple lines. Any suggestions on how to do this?

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

How do I Strip HTML tags from a file using AWK?

Handyman

Programmer

derekludwig

Programmer

derekludwig

Programmer

teser

Technical User

Similar threads

Part and Inventory Search

Sponsor