Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

extract part of a htmls contents and display/save txt file

Status
Not open for further replies.

robinsoninho

Programmer
Jul 25, 2005
6
0
0
GB
hi, im fairly new to perl and my question being is it possible to run a perl script to extract the text from a html page between say the <body> ..... </body> tags and then display/ say this text.

Also would that be possible to be ran on a web-page through server side cgi-bin

thanks in advance
 
yep

maybe m/(<body>(.*)</body>)/

?

Kind Regards
Duncan
 
sorry. bit drunk.

m|<code>(.*)</code>|;

Kind Regards
Duncan
 
duncdudes code could work if the body tag is very generic:

<body>

with no attribtues. This may be more flexible:

Code:
open(FH,'index.html');
my $some_html = do {local $/; <FH>}; 
close(FH);
my ($text) = $some_html =~ m|<body.*?>(.+)</body>|si;
print $text;

of course this is going to get all the html tags too that are in the body of the html page. If you want just the text you need to do some further filtering:


Code:
open(FH,'index.html');
my $some_html = do {local $/; <FH>}; 
close(FH);
my ($text) = $some_html =~ m|<body.*?>(.+)</body>|si;
$text =~ s/<.*?>//igs;
print $text;

or better yet use an HTML aware module like HTML::parser or maybe HTML::SimpleParse
 
It looks like that code should work great, I only have one suggestion - instead of
Code:
m|<body.*?>
why not use
Code:
m|<body[^>]*>
It regex should produce the same results but generally, if I remember correctly, non-greedy operators can drastically slow down the execution of code.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top