Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

preg_replace except if start of string = x 1

Status
Not open for further replies.

ThomasJSmart

Programmer
Sep 16, 2002
634
Hi

Need some help with a regex and having difficulty finding it in searches.


i have a large body of text that contains several links. The extensions of some of the links needs to be replaced.

Examples:

In these links the ".pdf" needs to become .html
<a href="/Page/File.pdf">File</a>
<a href="/Abcdef/File.pdf">File ABC</a>
<a href=" File</a>


In this link nothing happens because it has no extension
<a href="/Page/Directory/">directory</a>


In these links nothing happens because it is in the /Content/ directory:
<a href="/Content/File.pdf">File</a>
<a href=" File</a>


in this link nothing happens because its not one of the site domains:
<a href="
Any help is much appreciated, thank you.
Thomas

site | / blog |
 
Hi

Although it can be done with a single regular expression :
PHP:
[navy]$str[/navy][teal]=[/teal][COLOR=darkgoldenrod]preg_replace[/color][teal]([/teal][green][i]',href="((/|[URL unfurl="true"]http://web\.com).*?(?<!/Content/)\w+\.)\w+",'[/URL][/i][/green][teal],[/teal][green][i]'href="\1html"'[/i][/green][teal],[/teal][navy]$str[/navy][teal]);[/teal]
For readability and maintainability a regular expression with callback would be better :
PHP:
[navy]$str[/navy][teal]=[/teal][COLOR=darkgoldenrod]preg_replace_callback[/color][teal]([/teal][green][i]'/href="([^"]+)"/'[/i][/green][teal],[/teal][green][i]'changeextension'[/i][/green][teal],[/teal][navy]$str[/navy][teal]);[/teal]

[b]function[/b] [COLOR=darkgoldenrod]changeextension[/color][teal]([/teal][navy]$match[/navy][teal])[/teal]
[teal]{[/teal]
  [navy]$url[/navy][teal]=[/teal][COLOR=darkgoldenrod]parse_url[/color][teal]([/teal][navy]$match[/navy][teal][[/teal][purple]1[/purple][teal]]);[/teal]
  [b]if[/b] [teal]([/teal][navy]$url[/navy][teal][[/teal][green][i]'scheme'[/i][/green][teal]]==[/teal][green][i]'http'[/i][/green] [teal]&&[/teal] [navy]$url[/navy][teal][[/teal][green][i]'host'[/i][/green][teal]]!=[/teal][green][i]'web.com'[/i][/green][teal])[/teal] [b]return[/b] [navy]$match[/navy][teal][[/teal][purple]0[/purple][teal]];[/teal]
  [navy]$path[/navy][teal]=[/teal][COLOR=darkgoldenrod]pathinfo[/color][teal]([/teal][navy]$url[/navy][teal][[/teal][green][i]'path'[/i][/green][teal]]);[/teal]
  [b]if[/b] [teal]([/teal][COLOR=darkgoldenrod]strpos[/color][teal]([/teal][navy]$path[/navy][teal][[/teal][green][i]'dirname'[/i][/green][teal]],[/teal][green][i]'/Content/'[/i][/green][teal])!==[/teal][b]false[/b][teal])[/teal] [b]return[/b] [navy]$match[/navy][teal][[/teal][purple]0[/purple][teal]];[/teal]
  [b]if[/b] [teal](![/teal][navy]$path[/navy][teal][[/teal][green][i]'extension'[/i][/green][teal]])[/teal] [b]return[/b] [navy]$match[/navy][teal][[/teal][purple]0[/purple][teal]];[/teal]
  [b]return[/b] [green][i]'href="'[/i][/green][teal].[/teal][COLOR=darkgoldenrod]substr_replace[/color][teal]([/teal][navy]$match[/navy][teal][[/teal][purple]1[/purple][teal]],[/teal][green][i]'html'[/i][/green][teal],[/teal][COLOR=darkgoldenrod]strlen[/color][teal]([/teal][navy]$match[/navy][teal][[/teal][purple]1[/purple][teal]])-[/teal][COLOR=darkgoldenrod]strlen[/color][teal]([/teal][navy]$path[/navy][teal][[/teal][green][i]'extension'[/i][/green][teal]])).[/teal][green][i]'"'[/i][/green][teal];[/teal]
[teal]}[/teal]
But of course, regular expressions will fail on erroneous or just ugly HTML. So as possible, use a dedicated HTML parsing library.

Feherke.
 
Both solutions seems very nice and elegant :) The callback seems an easier way to check against multiple site domains so will probably use that one.

String is cleaned with tidy before arriving at this regex so it should be ok.

Thank you!
Thomas

site | / blog |
 
Hi

Thomas said:
The callback seems an easier way to check against multiple site domains so will probably use that one.
Yes, also easier to handle URLs with query and fragment, like <a href="/Page/File.pdf?query#fragment">File</a>, if you will have such things too.

Feherke.
 
as a substitute for the regex, the php port of jquery would also provide a nice solution to this. (or directly manipulating the DOM if you want).
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top