extract part of a htmls contents and display/save txt file

robinsoninho · Feb 16, 2006

hi, im fairly new to perl and my question being is it possible to run a perl script to extract the text from a html page between say the <body> ..... </body> tags and then display/ say this text.

Also would that be possible to be ran on a web-page through server side cgi-bin

thanks in advance

duncdude · Feb 16, 2006

yep

maybe m/(<body>(.*)</body>)/

?

Kind Regards
Duncan

duncdude · Feb 16, 2006

sorry. bit drunk.

m|<code>(.*)</code>|;

Kind Regards
Duncan

KevinADC · Feb 16, 2006

duncdudes code could work if the body tag is very generic:

<body>

with no attribtues. This may be more flexible:

Code:

open(FH,'index.html');
my $some_html = do {local $/; <FH>}; 
close(FH);
my ($text) = $some_html =~ m|<body.*?>(.+)</body>|si;
print $text;

of course this is going to get all the html tags too that are in the body of the html page. If you want just the text you need to do some further filtering:

Code:

open(FH,'index.html');
my $some_html = do {local $/; <FH>}; 
close(FH);
my ($text) = $some_html =~ m|<body.*?>(.+)</body>|si;
$text =~ s/<.*?>//igs;
print $text;

or better yet use an HTML aware module like HTML:

arser or maybe HTML::SimpleParse

rharsh · Feb 16, 2006

It looks like that code should work great, I only have one suggestion - instead of

Code:

m|<body.*?>

why not use

Code:

m|<body[^>]*>

It regex should produce the same results but generally, if I remember correctly, non-greedy operators can drastically slow down the execution of code.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

extract part of a htmls contents and display/save txt file

robinsoninho

Programmer

duncdude

Programmer

duncdude

Programmer

KevinADC

Technical User

rharsh

Technical User

Similar threads

Part and Inventory Search

Sponsor