Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Extraction/Scraper 1

Status
Not open for further replies.

SPYDERIX

Technical User
Jan 11, 2002
1,899
CA
I'm trying to connect to a website to do a scrape to get a daily URL from an <img> tag and am failing hopelessly. Connecting and reading is easy to do but when I do my preg_match I don't seem to be getting anywhere.

The page has many links on it and the one I want has a specific alt tag for the image.

The alt tag is: alt="free image" then the src comes afterwards and that's where the url is that I need to convert to a variable.

My problem really is just the regex to match this properly. What am I doing wrong here?

Code:
<?php

// $string is the fread() of the site I connect to

preg_match ('^<[0-9a-zA-Z =\"]+alt=\"free image\"[0-9a-zA-Z =\"]+>$', $string, $matches);

$first = $matches[0];

preg_match ('^http://[0-9a-zA-Z.]+$', $first, $matches2);

echo $matches2[0];

?>

Thanks.

NATE


Got a question? Search G O O G L E first.
 
so you want to take the src attribute from an img tag that has the alt attribute set to free image?

off the top of my head this pattern should return the src in $matches[3]

Code:
$pattern = '/<img.*?\\balt\\s*=\\s*(\'|").*?\\1.*?\\bsrc\\s*=(\'|")(.*?)\\2 .*?>/ims';
 
no worries. do you need it to be commented/explained or is the pattern intelligible?
 
Ya an explanation would be nice. That doesn't really make sense to be honest especially seeing as how I don't have the alt tag key word in there. I'm actually surprised it grabs the right image considering there are other images on the page.

NATE


Got a question? Search G O O G L E first.
 
Code:
    /		start the pattern
	<img 	match the opening of the <img tag
	.*?		match any (or no) character (including a newline) greedily
	\\b		assert a word boundary (the second slash is the escaping for use in php)
	alt		match alt (after a word boundary)
	\\s*	match an arbitrary amount of white space (making alt a word on its own)
	=		match the equals sign
	(\'|")	match EITHER the single quote or the double quote and store the match in backreference 1
	.*?		match any character (including a newline) grredily
	\\1		match the value stored in backreference 1 (to make sure we are matching pairs of quotes.  ideall you would make sure that the preceding character is NOT a backslash too)
	.*?		match any (or no) character (including a newline) greedily
	\\b		assert a word boundary
	src		match the src attribute
	\\s*	match an arbitrary amount of white space
	=		match the equals sign
	(\'|")	match EITHER the single quote or the double quote and store the match in backreference 2
	(.*?)	match any character (including newline) greedily and store in backrerence 3 (essentially this is the main capture and takes whatever is in the quotes
	\\2		match the character stored in backreference 2 thus matching the quotes
	.*?		match any character (or no characters) 
	>		match the closing tag for the img tag
	/		end the pattern
	i		case insensitive switch
	m		multiline switch
	s		expand dot to cover newlines
	
	weak points:
		in matching the quotes, no allowance is made for escaped quotes: .e.g <img alt='John\'s diary'>
		in general xhtml compliance is assumed but many coders still do not use quotes around attribute values.  the pattern does not allow for this
 
Ok, cool. Thanks. It really will just be a case of watching the coding to make sure that there are quotes etc and how it's done.

NATE


Got a question? Search G O O G L E first.
 
Status
Not open for further replies.

Similar threads

Part and Inventory Search

Sponsor

Back
Top