how to get href value of <div style= using preg_match_all ? 1

Leopard2014 · Oct 30, 2012

Hi all i am trying to parse a html that has many divs and i want to output only the value of href for these divs. could any one show me how this can be done? Thanks
Note: i am using file_get_contents to get remote html for parsing.My main goal is get MID= value from href=

PHP:

<td width=154 valign="top">						<div style="height:135px; border:1px solid #FFFFFF; background:#FFFFFF; margin-left:2px; text-align:center; ">
				<a href="/Ext/series.php?name=folder&id=1442&MId=71402&page=0"><img border="0" src="[URL unfurl="true"]http://www.somesite.com/1.jpg"[/URL] width="150" height="83"></a><br>
					<font face="Tahoma" size="2"><b>Movie Name</b><br/>
						24 episodes
					</font>
			</div>
			
		</td>

jpadie · Oct 31, 2012

your question states that you want the answer to use preg_match_all. However there are other methods that are more user friendly.

however using preg_match_all you'd do something like this

Code:

[COLOR=#009900]$mids[/color] [COLOR=#990000]=[/color] [b][COLOR=#0000FF]array[/color][/b][COLOR=#990000]();[/color]
[COLOR=#009900]$pattern[/color] [COLOR=#990000]=[/color] [COLOR=#FF0000]"/\<a\s[^\>]*href=(['"[/color] [COLOR=#990000]])(.*?)\[/color][COLOR=#993399]1[/color][COLOR=#990000]/[/color]ims[COLOR=#FF0000]";[/color]
[COLOR=#FF0000]preg_match_all($pattern, $text, $matches);[/color]
[COLOR=#FF0000]foreach($matches as $match):[/color]
[COLOR=#FF0000] $bits = parse_url($match[2],PHP_URL_QUERY);[/color]
[COLOR=#FF0000] $bits = explode('&', $bits);[/color]
[COLOR=#FF0000] if (isset($bits['MId']) $mids[] = $bits['MId'];[/color]
[COLOR=#FF0000]endforeach;[/color]
[COLOR=#FF0000]echo '<pre>' . print_r($mids, true). '</pre>';[/color]

if you want a (perhaps) more user friendly method using jquery like dom parsing then check out phpquery

feherke · Oct 31, 2012

Hi

Regular expressions are not suitable for parsing complex structured data. An extremely simple alternative is the [tt]DOMDocument[/tt] class :

PHP:

[navy]$doc[/navy][teal]=[/teal][b]new[/b] [COLOR=darkgoldenrod]DOMDocument[/color][teal]();[/teal]
[navy]$doc[/navy][teal]->[/teal][COLOR=darkgoldenrod]loadHTMLFile[/color][teal]([/teal][navy]$url_of_the_document_to_scrap[/navy][teal]);[/teal]
[b]foreach[/b] [teal]([/teal][navy]$doc[/navy][teal]->[/teal][COLOR=darkgoldenrod]getElementsByTagName[/color][teal]([/teal][green][i]'a'[/i][/green][teal])[/teal] [b]as[/b] [navy]$elem[/navy][teal])[/teal]
  [b]foreach[/b] [teal]([/teal][navy]$elem[/navy][teal]->[/teal]attributes [b]as[/b] [navy]$attr[/navy][teal])[/teal]
    [b]if[/b] [teal]([/teal][navy]$attr[/navy][teal]->[/teal]name[teal]==[/teal][green][i]'href'[/i][/green] [teal]&&[/teal] [COLOR=darkgoldenrod]preg_match[/color][teal]([/teal][green][i]'/\bMId=(\d+)/'[/i][/green][teal],[/teal][navy]$attr[/navy][teal]->[/teal]value[teal],[/teal][navy]$match[/navy][teal]))[/teal]
      [navy]$mid[/navy][teal]=[/teal][navy]$match[/navy][teal][[/teal][purple]1[/purple][teal]];[/teal]

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

jpadie · Oct 31, 2012

in my code above a line was wrong it should read:

Code:

$pattern = "/\<a\s[^\>]*href=(['\" ])(.*?)\\1/ims";

I agree with feherke that regular expressions are not the best tool. DomDocument is probably faster than phpquery, but I use jquery all the time so phpquery is an easy drop in from that for me.

feherke · Oct 31, 2012

Hi

Argh ! phpQuery ! This was the second time I forgot its name. [banghead]

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

spamjim · Oct 31, 2012

Would it be better to simply ask

http://www.panet.co.il

for their database instead of scraping their web site? Are there copyright laws in your region that prevent site scraping?

Does this discussion fall within the policies of Tek-Tips.com?

http://tek-tips.com/market.cfm

Leopard2014 · Oct 31, 2012

Many thanks guys for reply. I tried both suggested solutions but i got errors as follows:

feherke i tried your solution it give me these warnings and also it only output /Ext/series.php?name=folder&id=1442 but i want only value
of MId=71402!

quote said:
Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: ID Description already defined in
Warning: DOMDocument::loadHTMLFile() [domdocument.loadhtmlfile]: htmlParseEntityRef: expecting ';' in

jpadie i tried your solution and i got this error:
Parse error: syntax error, unexpected T_VARIABLE in

quote said:
pointing at :if (isset($bits['MId']) $mids[] = $bits['MId'];

could you guyes help me extract all these 3 values(MId=71402 ,Movie Name ,24 episode) and construct a hyperlink like this:

PHP:

<a href="./doit.php?Id=71402&title=Movie Name 24 episode">Movie Name 24 episode</a> <br />

feherke · Oct 31, 2012

Hi

Leopard2014 said:
feherke i tried your solution it give me these warnings

Ignore them. Those are normal when parsing shitty HTML. ( That is why I suggested to use a dedicate HTML parser. Regular expressions could quite acceptably parse syntactically correct HTML, but is a pain to figure out today's horrible HTML documents. )

Leopard2014 said:
it only output /Ext/series.php?name=folder&id=1442 but i want only value

of MId=71402!

I only tried with the input you posted, works fine with that. Post a real document if you need further assistance.

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

feherke · Oct 31, 2012

Hi

Leopard2014 said:
feherke i tried your solution it give me these warnings

To get rid of them you can use [tt]libxml_use_internal_errors()[/tt]. Just call it with a [tt]true[/tt] parameter before the [tt]loadHTMLFile()[/tt] call.

Feherke.
[link feherke.github.com/]

http://feherke.github.com/

[/url]

jpadie · Oct 31, 2012

if using my code add a round bracket (closed) after the isset

Code:

$mids = array();
$pattern = "/\<a\s[^\>]*href=(['\" ])(.*?)\\1/ims";
preg_match_all($pattern, $text, $matches);
foreach($matches as $match):
// $qs = array()
 $bits = parse_url($match[2],PHP_URL_QUERY);
 $bits = explode('&', $bits); 
 if ( isset($bits['MId']) ) $mids[] = $bits['MId'];
endforeach;
echo '<pre>' . print_r($mids, true). '</pre>';

I strongly echo what spamjim says. the fact that someone publishes a website categorically does not give you permission (in every copyright jurisdiction in which I practise) to scrape their content. for that permission to be granted you must have the copyright owner's express permission. To the extent that the content does not engender copyright protection, the data will still have protection as a database right and/or a sui generis right (at least in Europe).

I am assuming that you have this consent and also that you are not using the resulting data for a potentially illicit purpose such as a torrent site.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

how to get href value of <div style= using preg_match_all ? 1

Leopard2014

Programmer

jpadie

Technical User

feherke

Programmer

jpadie

Technical User

feherke

Programmer

spamjim

Instructor

Leopard2014

Programmer

feherke

Programmer

feherke

Programmer

jpadie

Technical User

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

how to get href value of &lt;div style= using preg_match_all ? 1

Programmer

Technical User

Programmer

Technical User

Programmer

Instructor

Programmer

Programmer

Programmer

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor

how to get href value of <div style= using preg_match_all ? 1