Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Printing just HTML tags with sed

Status
Not open for further replies.

reclspeak

IS-IT--Management
Dec 6, 2002
57
GB
I can find lots of examples of sed routines to delete pattern spaces with HTML tags, but I'm stuck trying to figure a routine that will do the opposite - that is print just the strings between instances of "<" and ">" (without the < and > characters themselves) and which doesn't have the comment pattern (oh, and covers case);

i.e.

<! some text >
<! >
<! some more text >

So I'm after finding patterns such as <HTML> or < HTML> or <html> < html> and just printing the HTML/html string.

The purpose is for a script to reject HTML files that don't have authorised TAGS.

This will probably be similar to the recent question about printing contents between "(" and ")" characters with a comma, but I'm getting my escape characters messed-up I think (oh, and something about being rubbish at sed too!)

Any pointers or examples will be gratefully received,


recl
 
Hi

A first try. You did not mentioned if you want to keep the attributes too, or not.
Code:
#! /usr/bin/sed -n

H
$ {
  g
  s/<!--.*-->//g
  s/<![^>]*>//g
  s/<\([^>]*\)>[^<]*/\1\t/g
  p
}
By the way, tidy is not better for what you try to do ?

Feherke.
 
Thanks Feherke, terrific answer.

Giving this a try right now.


recl
 
Hum, neally there.

You are quite right in asking - I'm not after atrributes, just the tag names themselves.

However where I am still hitting a snag is with multi-line input between the < and > characters, which is echoed, unprocessed by the routine, such as;

<FRAME SRC="wsmbanner.html" NAME="banner" SCROLLING="no" NORESIZE
MARGINWIDTH="0" MARGINHEIGHT="0" FRAMEBORDER="no" BORDER="0" BORDERCOLOR="#006363">

Any clues how I can handle this?
 
Hi

I have no problem with that tag wrapped to multiple lines.

Abit modified to keep only tag names, without attributes :
Code:
#! /usr/bin/sed -nf

H
$ {
  g
  s/<!--.*-->//g
  s/<![^>]*>//g
  s/<\([^> ]*\)[^>]*>[^<]*/\1\t/g
  p
}

Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top