Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Parsing HTML data

Status
Not open for further replies.

leroy07

Programmer
Apr 10, 2007
5
0
0
AU
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
my $site="my $content=get $site;
print "$content \n";

Since in my webpage i use img src="images/" etc.. instead of a full path, I want my perl app to connect, get the $content and change all img src="images/" to img src="
I'm new to substituting and I don't know how to do it with that many quotation marks.
 
Something like this should do:

Code:
s|images/|[URL unfurl="true"]http://www.mysite.com/images/|g;[/URL]

Usually you would use '/' for pattern matching/substitution but you can use pretty much any symbol you want.

Alchemy is easy with Perl!
s/lead/gold/g;
 
maybe:

Code:
$content =~ s|(<img src="?)(images/.*?"?>)|$1[URL unfurl="true"]http://www.mysite.com/$2|igs;[/URL]



------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
Kevin,
I was thinking along something like that myself, but then I realised that would fail if src wasn't the first attribute of the img tag (the devil's advocate in me).
 
true, but that should be easy to compensate for.

------------------------------------------
- Kevin, perl coder unexceptional! [wiggle]
 
I suggest the following regex. It won't over-capture like majorbiff's code, but won't be quite as limitting as Kevin's. Also, if you're uncomfortable with look ahead and look behind's, you can change them to captured text instead like in the commented out line:

Code:
[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$content[/blue] = [url=http://perldoc.perl.org/functions/do.html][black][b]do[/b][/black][/url] [red]{[/red][url=http://perldoc.perl.org/functions/local.html][black][b]local[/b][/black][/url] [blue]$/[/blue][red];[/red] <DATA>[red]}[/red][red];[/red]

[blue]$content[/blue] =~ [red]s{[/red][purple](?<=['"=])/?(?=images/)[/purple][red]}[/red][red]{[/red][purple][URL unfurl="true"]http://www.mysite.com/[/URL][/purple][red]}[/red][red]igs[/red][red];[/red] 

[gray][i]# Below line uses captured groupings instead of zero-width assertions:[/i][/gray]
[gray][i]# $content =~ s{(['"=])/?(images/)}{$1[URL unfurl="true"]http://www.mysite.com/$2}igs;[/URL][/i][/gray]

[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [blue]$content[/blue][red];[/red]

[teal]__DATA__[/teal]

[teal]<html>[/teal]
[teal]<body>[/teal]
[teal]Parse Me: <img src="images/doublequotes.jpg">[/teal] 
[teal]Parse Me: <img src='images/singlequotes.gif'>[/teal]
[teal]Parse Me: <img src=images/noquotes.gif>[/teal]
[teal]Parse Me: <img src="/images/beginningslash.jpg">[/teal]

[teal]Don't Parse Me: <img src="[URL unfurl="true"]http://www.mysite.com/images/hello.gif">[/URL][/teal] 
[teal]Don't Parse Me: I like my image/ directory, oh yes I do.  It's fantabulous![/teal] 
[teal]</body>[/teal]
[teal]</html>[/teal]

However, if you want to be really bad-ass, I would suggest you take a look at HTML::parser. There's an example in the package that can "Perform transformations on link attributes in an HTML document". This is pretty much the most powerful method that you're going to find as it works directly on all of the html attributes that can contain links. It pulls the link attributes from the HTML::tagset module. The only thing that it probably misses is raw javascript code, but that's probably a good thing. Anyway the script can be found here:


And an example of running it is this:

Code:
>perl hrefsub.pl "$_=URI->new_abs($_, '[URL unfurl="true"]http://www.mynewsite.com')"[/URL] hello.html
<html>
<body> 
Parse Me: <img src="[URL unfurl="true"]http://www.mynewsite.com/images/doublequotes.jpg">[/URL]
Parse Me: <img src="[URL unfurl="true"]http://www.mynewsite.com/images/singlequotes.gif">[/URL]
Parse Me: <img src="[URL unfurl="true"]http://www.mynewsite.com/images/noquotes.gif">[/URL]
Parse Me: <img src="[URL unfurl="true"]http://www.mynewsite.com/images/beginningslash.jpg">[/URL]

Don't Parse Me: <img src="[URL unfurl="true"]http://www.mysite.com/images/hello.gif">[/URL]
Don't Parse Me: I like my image/ directory, oh yes I do.  It's fantabulous! 
</body>
</html>

Notice how it changes all the attributes quotes to double quotes. Isn't that sweet!

- Miller
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top