Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

how to get href value of <div style= using preg_match_all ? 1

Status
Not open for further replies.

Leopard2014

Programmer
Oct 30, 2012
4
NL
Hi all i am trying to parse a html that has many divs and i want to output only the value of href for these divs. could any one show me how this can be done? Thanks
Note: i am using file_get_contents to get remote html for parsing.My main goal is get MID= value from href=

PHP:
<td width=154 valign="top">						<div style="height:135px; border:1px solid #FFFFFF; background:#FFFFFF; margin-left:2px; text-align:center; ">
				<a href="/Ext/series.php?name=folder&id=1442&MId=71402&page=0"><img border="0" src="[URL unfurl="true"]http://www.somesite.com/1.jpg"[/URL] width="150" height="83"></a><br>
					<font face="Tahoma" size="2"><b>Movie Name</b><br/>
						24 episodes
					</font>
			</div>
			
		</td>
 
your question states that you want the answer to use preg_match_all. However there are other methods that are more user friendly.

however using preg_match_all you'd do something like this

Code:
[COLOR=#009900]$mids[/color] [COLOR=#990000]=[/color] [b][COLOR=#0000FF]array[/color][/b][COLOR=#990000]();[/color]
[COLOR=#009900]$pattern[/color] [COLOR=#990000]=[/color] [COLOR=#FF0000]"/\<a\s[^\>]*href=(['"[/color] [COLOR=#990000]])(.*?)\[/color][COLOR=#993399]1[/color][COLOR=#990000]/[/color]ims[COLOR=#FF0000]";[/color]
[COLOR=#FF0000]preg_match_all($pattern, $text, $matches);[/color]
[COLOR=#FF0000]foreach($matches as $match):[/color]
[COLOR=#FF0000] $bits = parse_url($match[2],PHP_URL_QUERY);[/color]
[COLOR=#FF0000] $bits = explode('&', $bits);[/color]
[COLOR=#FF0000] if (isset($bits['MId']) $mids[] = $bits['MId'];[/color]
[COLOR=#FF0000]endforeach;[/color]
[COLOR=#FF0000]echo '<pre>' . print_r($mids, true). '</pre>';[/color]

if you want a (perhaps) more user friendly method using jquery like dom parsing then check out phpquery
 
Hi

Regular expressions are not suitable for parsing complex structured data. An extremely simple alternative is the [tt]DOMDocument[/tt] class :
PHP:
[navy]$doc[/navy][teal]=[/teal][b]new[/b] [COLOR=darkgoldenrod]DOMDocument[/color][teal]();[/teal]
[navy]$doc[/navy][teal]->[/teal][COLOR=darkgoldenrod]loadHTMLFile[/color][teal]([/teal][navy]$url_of_the_document_to_scrap[/navy][teal]);[/teal]
[b]foreach[/b] [teal]([/teal][navy]$doc[/navy][teal]->[/teal][COLOR=darkgoldenrod]getElementsByTagName[/color][teal]([/teal][green][i]'a'[/i][/green][teal])[/teal] [b]as[/b] [navy]$elem[/navy][teal])[/teal]
  [b]foreach[/b] [teal]([/teal][navy]$elem[/navy][teal]->[/teal]attributes [b]as[/b] [navy]$attr[/navy][teal])[/teal]
    [b]if[/b] [teal]([/teal][navy]$attr[/navy][teal]->[/teal]name[teal]==[/teal][green][i]'href'[/i][/green] [teal]&&[/teal] [COLOR=darkgoldenrod]preg_match[/color][teal]([/teal][green][i]'/\bMId=(\d+)/'[/i][/green][teal],[/teal][navy]$attr[/navy][teal]->[/teal]value[teal],[/teal][navy]$match[/navy][teal]))[/teal]
      [navy]$mid[/navy][teal]=[/teal][navy]$match[/navy][teal][[/teal][purple]1[/purple][teal]];[/teal]

Feherke.
[link feherke.github.com/][/url]
 
in my code above a line was wrong it should read:

Code:
$pattern = "/\<a\s[^\>]*href=(['\" ])(.*?)\\1/ims";

I agree with feherke that regular expressions are not the best tool. DomDocument is probably faster than phpquery, but I use jquery all the time so phpquery is an easy drop in from that for me.
 
Many thanks guys for reply. I tried both suggested solutions but i got errors as follows:

feherke i tried your solution it give me these warnings and also it only output /Ext/series.php?name=folder&id=1442 but i want only value
of MId=71402!
quote said:
Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: ID Description already defined in
Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: htmlParseEntityRef: expecting ';' in

jpadie i tried your solution and i got this error:
Parse error: syntax error, unexpected T_VARIABLE in

quote said:
pointing at :if (isset($bits['MId']) $mids[] = $bits['MId'];

could you guyes help me extract all these 3 values(MId=71402 ,Movie Name ,24 episode) and construct a hyperlink like this:

PHP:
<a href="./doit.php?Id=71402&title=Movie Name 24 episode">Movie Name 24 episode</a> <br />
 
Hi

Leopard2014 said:
feherke i tried your solution it give me these warnings
Ignore them. Those are normal when parsing shitty HTML. ( That is why I suggested to use a dedicate HTML parser. Regular expressions could quite acceptably parse syntactically correct HTML, but is a pain to figure out today's horrible HTML documents. )

Leopard2014 said:
it only output /Ext/series.php?name=folder&id=1442 but i want only value

of MId=71402!
I only tried with the input you posted, works fine with that. Post a real document if you need further assistance.


Feherke.
[link feherke.github.com/][/url]
 
if using my code add a round bracket (closed) after the isset
Code:
$mids = array();
$pattern = "/\<a\s[^\>]*href=(['\" ])(.*?)\\1/ims";
preg_match_all($pattern, $text, $matches);
foreach($matches as $match):
// $qs = array()
 $bits = parse_url($match[2],PHP_URL_QUERY);
 $bits = explode('&', $bits); 
 if ( isset($bits['MId']) ) $mids[] = $bits['MId'];
endforeach;
echo '<pre>' . print_r($mids, true). '</pre>';

I strongly echo what spamjim says. the fact that someone publishes a website categorically does not give you permission (in every copyright jurisdiction in which I practise) to scrape their content. for that permission to be granted you must have the copyright owner's express permission. To the extent that the content does not engender copyright protection, the data will still have protection as a database right and/or a sui generis right (at least in Europe).

I am assuming that you have this consent and also that you are not using the resulting data for a potentially illicit purpose such as a torrent site.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top