Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

RegExp: Replace SRC attribute from <iframe> 1

Status
Not open for further replies.

c4n

Programmer
Mar 12, 2002
110
SI
Hello,

I want to parse some HTML code with PHP and REPLACE the SRC attribute of a <iframe>.

Example INPUT:
<iframe marginwidth="0" src=" marginheight="0" width="120" height="240" scrolling="no" frameborder="0" ></iframe>


OUTPUT should be for example:
<iframe marginwidth="0" src=" marginheight="0" width="120" height="240" scrolling="no" frameborder="0" ></iframe>

The real problem is that the SRC parameter can have double quotes (src=" single quotes (src=' or even no quotes at all (src=
Also it can be located within other attributes (like the code above) or at the end just before the closing tag > , for example:
<iframe width="120" height="240" src="
The code should be output exactly as input just with the SRC parameter replaced.

I tried using VARIOUS regular expressions but always it works just in some cases. For example I tried:

$pattern = "/<iframe ([^>]*)src=['\"]*([^'\s\">]*)['\"\s]+([^>]*)>(.*)<\/iframe>/Uis";

$pattern = "/<iframe ([^>]*)src=(.*)\s([^>]*)>(.*)<\/iframe>/Uis";

and others...

Any suggestions?

Many thanks,

c4n
 
c4n,

I believe this will do the job.

Let me know if it's not just right.
(Any improvements welcome, also)

Code:
$test = <<<END
<iframe width="120" height="240" src="[URL unfurl="true"]http://www.google.com"></iframe>[/URL]
END;

$pattern = "/(<iframe )([^src]*)(src=)([\'\"]?)([^>\s\'\"]+)([\'\"]?)/e";
$tmp = preg_replace($pattern,
  '"\\1\\2\\3\\4".str_replace("\\5", "[URL unfurl="true"]http://www.yahoo.com",[/URL] "\\5")."\\6"',
					$test);
echo stripslashes($tmp);

Thanks,
-Lrnmore
 
Hi,

That does the job, many thanks.

I made two small changes:

1. added i modifier to make the pattern case-insensitive. Now it also matches <IFRAME , <Iframe etc.

2. placed stripslashes() within the preg_replace() - this is to avoid the entire HTML code being put throug stripslashes() but just the replaced part. I believe this will save CPU/memory if many HTML pages are parsed.

This is what I'll be using:

Code:
$pattern = "/(<iframe )([^src]*)(src=)([\'\"]?)([^>\s\'\"]+)([\'\"]?)/ie";
$tmp = preg_replace($pattern,'stripslashes("\\1\\2\\3\\4".str_replace("\\5", "[URL unfurl="true"]http://www.yahoo.com",[/URL] "\\5")."\\6")',$test);

Again thanks for your help!

Best regards,

c4n
 
Hi again Lrnmore,

I found a little problem with this part of the RE:

([^src]*)

This doesn't match "src" exactly. If you input a code like this:

Code:
$test = <<<END
<iframe scrolling="0" src="[URL unfurl="true"]http://www.google.com"></iframe>[/URL]
END;

it brakes on scrolling="" (note the "scr" which is not "src" but it brakes it anyway). Similary if you have name="s" before the src or any attribute with "s" in it.

This seems to work ok:

Code:
$test = <<<END
<iframe scrolling="0" src="[URL unfurl="true"]http://www.google.com"></iframe>[/URL]
END;

$pattern = "/(<iframe )(.*)(src=)([\'\"]?)([^>\s\'\"]+)([\'\"]?)/ie";
$tmp = preg_replace($pattern,'stripslashes("\\1\\2\\3\\4".str_replace("\\5", "[URL unfurl="true"]http://www.yahoo.com",[/URL] "\\5")."\\6")',$test);
echo $tmp;

I just changed ([^src]*) to (.*) and it seems to work ok with the tests I did. Any thoughts?

Thanks again, your code did help a lot!

c4n
 
c4n,

Good job with the improvements.

The only thing I can see now is that there might be a new line in the tag.

So would you want to use the "s" modifier?

Code:
$pattern = "/(<iframe )(.*)(src=)([\'\"]?)([^>\s\'\"]+)([\'\"]?)/ies"
 
Some thoughts:
1. If you have the literal that you want to replace, use it in the regex. The /e modifier in this case with a str_replace in the replacement is redundant and can easily be eliminated. I would not recommend to look for something and the in the replacement look again for something and then replace it.
2. Another solution - very clean and elegant - is to load the document into a DOM and manipulate it that way. Find all the iframe tags (getElementByTagName()); inspect the src attributes and replace the ones you want changed.
 
As always thanks for the advice.

I see what you mean.

You can remove the str_replace, because it's already captured.

Code:
$test = <<<END
<iframe width="120" 
height="240" src="[URL unfurl="true"]http://www.yahoo.com"></iframe>[/URL]
END;

$nwURL = "[URL unfurl="true"]http://www.ask.com";[/URL]

$pattern = "/(<iframe )(.*)(src=)([\'\"]?)([^>\s\'\"]+)([\'\"]?)/ies";
$tmp = preg_replace($pattern,
	   				'"\\1\\2\\3\\4".$nwURL."\\6"',
					$test);
echo stripslashes($tmp);

Thanks,
-Lrnmore
 
One final thing:
Removing the /e pattern modifier now will probably give a little speed gain (how much really?) since the replacement does not need to be parsed as PHP code.
Some of the subpatterns could also be consolidated and we'd come up with:
Code:
$test = <<<END
<iframe width="120"
height="240" src="[URL unfurl="true"]http://www.yahoo.com"></iframe>[/URL]
END;

$nwURL = "[URL unfurl="true"]http://www.ask.com";[/URL]

$pattern = "/(<iframe .*src=([\'\"]?))([^>\s\'\"]+)[\'\"]?/is";
$tmp = preg_replace($pattern,
                       "\\1".$nwURL,
                    $test);
echo stripslashes($tmp);
 
Hello both,

Thanks for your valuable posts. Just a little issue with the last code posted, you probably didn't test it DRJ478 otherwise I am sure you would have figured it out.

The above code replaces the test <iframe> to

Code:
<iframe width="120" height="240" src="[URL unfurl="true"]http://www.ask.com></iframe>[/URL]

Not the is missing the final quote.

This works ok:

Code:
$test = <<<END
<iframe width="120" scrolling="0" height="240" src="[URL unfurl="true"]http://www.yahoo.com"[/URL] name="test"></iframe>
END;

$nwURL = "[URL unfurl="true"]http://www.ask.com";[/URL]

$pattern = "/(<iframe .*src=([\'\"]?))([^>\s\'\"]+)([\'\"]?)/is";
$tmp = preg_replace($pattern,"\\1".$nwURL."\\4",$test);
echo stripslashes($tmp);

If you (or anyone else) have any other suggestions/improvements please feel free to post them.

Thanks to both.

c4n
 
c4n
You're right - the quote is missing.
I always test the code before posting it. In this case Firefox was a bit too nice by adding the additional quote in the source code when viewed.
If I had tested it in IE it would have shown. However, IE is my least favorite program to test in.

BTW, have you considered the DOM approach?
 
Yes, I also don't use IE much. I found that by testing the code in my PHP editor (from which has an in-built browser.

As for the DOM approach: I haven't got any experience with DOM, will read the manual at and see if I can get it to work.

Regards,

c4n
 
DRJ478,

Thanks for bringing the thread up again ;)

I'm glad you demonstrated the "nesting" in the RE.

Is it safe to assume that a nested will return the next as in:
(this)(that(n))(thisandthat)

So n would be "\\3"?

Thanks,
-Lrnmore
 
yes it should...

Known is handfull, Unknown is worldfull
 
Nested subpatterns would just count in order, like vbkris says. Opening parantheses are counted from the left to the right to determine the backreference. The only exception is when the subpattern is made non-capturing by using (?:). The ?: after the opening paranthesis means that the subexpression will not be captured and therefore not count in the back references.
 
>>(?:)

ah, i thought that was a Mircosoft addition to RegExp, hmm...

Known is handfull, Unknown is worldfull
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top