I can find lots of examples of sed routines to delete pattern spaces with HTML tags, but I'm stuck trying to figure a routine that will do the opposite - that is print just the strings between instances of "<" and ">" (without the < and > characters themselves) and which doesn't have the comment pattern (oh, and covers case);
i.e.
<! some text >
<! >
<! some more text >
So I'm after finding patterns such as <HTML> or < HTML> or <html> < html> and just printing the HTML/html string.
The purpose is for a script to reject HTML files that don't have authorised TAGS.
This will probably be similar to the recent question about printing contents between "(" and ")" characters with a comma, but I'm getting my escape characters messed-up I think (oh, and something about being rubbish at sed too!)
Any pointers or examples will be gratefully received,
recl
i.e.
<! some text >
<! >
<! some more text >
So I'm after finding patterns such as <HTML> or < HTML> or <html> < html> and just printing the HTML/html string.
The purpose is for a script to reject HTML files that don't have authorised TAGS.
This will probably be similar to the recent question about printing contents between "(" and ")" characters with a comma, but I'm getting my escape characters messed-up I think (oh, and something about being rubbish at sed too!)
Any pointers or examples will be gratefully received,
recl