Printing just HTML tags with sed

reclspeak · Oct 13, 2005

I can find lots of examples of sed routines to delete pattern spaces with HTML tags, but I'm stuck trying to figure a routine that will do the opposite - that is print just the strings between instances of "<" and ">" (without the < and > characters themselves) and which doesn't have the comment pattern (oh, and covers case);

i.e.

<! some text >
<! >
<! some more text >

So I'm after finding patterns such as <HTML> or < HTML> or <html> < html> and just printing the HTML/html string.

The purpose is for a script to reject HTML files that don't have authorised TAGS.

This will probably be similar to the recent question about printing contents between "(" and ")" characters with a comma, but I'm getting my escape characters messed-up I think (oh, and something about being rubbish at sed too!)

Any pointers or examples will be gratefully received,

recl

feherke · Oct 13, 2005

Hi

A first try. You did not mentioned if you want to keep the attributes too, or not.

Code:

#! /usr/bin/sed -n

H
$ {
  g
  s/<!--.*-->//g
  s/<![^>]*>//g
  s/<\([^>]*\)>[^<]*/\1\t/g
  p
}

By the way, tidy is not better for what you try to do ?

Feherke.

http://rootshell.be/~feherke/

feherke · Oct 13, 2005

Hi

Ok, first stupid error found. The shebang should include -f too. Sorry.

Code:

#! /usr/bin/sed -nf

Feherke.

http://rootshell.be/~feherke/

reclspeak · Oct 14, 2005

Thanks Feherke, terrific answer.

Giving this a try right now.

recl

reclspeak · Oct 14, 2005

Hum, neally there.

You are quite right in asking - I'm not after atrributes, just the tag names themselves.

However where I am still hitting a snag is with multi-line input between the < and > characters, which is echoed, unprocessed by the routine, such as;

<FRAME SRC="wsmbanner.html" NAME="banner" SCROLLING="no" NORESIZE
MARGINWIDTH="0" MARGINHEIGHT="0" FRAMEBORDER="no" BORDER="0" BORDERCOLOR="#006363">

Any clues how I can handle this?

feherke · Oct 14, 2005

Hi

I have no problem with that tag wrapped to multiple lines.

Abit modified to keep only tag names, without attributes :

Code:

#! /usr/bin/sed -nf

H
$ {
  g
  s/<!--.*-->//g
  s/<![^>]*>//g
  s/<\([^> ]*\)[^>]*>[^<]*/\1\t/g
  p
}

Feherke.

http://rootshell.be/~feherke/

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Printing just HTML tags with sed

reclspeak

IS-IT--Management

feherke

Programmer

feherke

Programmer

reclspeak

IS-IT--Management

reclspeak

IS-IT--Management

feherke

Programmer

Similar threads

Part and Inventory Search

Sponsor