preg_replace autolink input 1

jasc2k · Aug 29, 2010

hi all,

my shortened function below simply takes a user input and links any URLs or email addresses appropriatly in HTML so when posted links work as seen on facebook etc

Code:

/* Convert all URL matches to appropriate HTML links */
			$message = preg_replace('#([\s|^])([URL unfurl="true"]www)#i',[/URL] '$1[URL unfurl="true"]http://$2',[/URL] $message);
			$pattern = '#((http|https|ftp|telnet|news|gopher|file|wais):\/\/[^\s]+)#i';
			$replacement = '<a href="$1" target="_blank">$1</a>';
			$message = preg_replace($pattern, $replacement, $message);
		
			/* Convert all E-mail matches to appropriate HTML links */
			$pattern = '#([0-9a-z]([-_.]?[0-9a-z])*@[0-9a-z]([-.]?[0-9a-z])*\\.';
			$pattern .= '[a-wyz][a-z](fo|g|l|m|mes|o|op|pa|ro|seum|t|u|v|z)?)#i';
			$replacement = '<a href="mailto:\\1">\\1</a>';
			$message = preg_replace($pattern, $replacement, $message);

no matter what I do I always get the same basic problem that when two or more URLs are entered in a row (without spaces or joined with another character) all of the links get joined into one

i.e user submits:

http://www.test.com-www.two.comwww.three.com

that whole string would be one link instead of three?

I know someone must have the answer - interestingly on posting realised the same happens here

(
thanks in advance

jpadie · Aug 30, 2010

the best answer would be to do some preanalysis based on the list of known allowed tld's.

but as ever, if the user submits garbage there is a limit to what you can do to prevent garbage output

jasc2k · Aug 31, 2010

hi there,

thanks for your reply - as soon as I saw even this forum script did the same I kind of gave up hope and had assumed it was just too far

)

thanks

jpadie · Aug 31, 2010

i spent a couple of hours yesterday on this and pretty much got it cracked.

then I lost my laptop.

it is possible I have the script on a backup somewhere. I will see whether I can dig it out.

jasc2k · Sep 1, 2010

whoa, lost your laptop?
that is not good! hope you find it again (not just for my me, lol)

cheers

jpadie · Sep 6, 2010

have not given up on this. I recrafted a solution from what I could remember and got reasonably close but not good enough.

the aggravating thing is that this has to be done in php rather than javascript because we inevitably need to make use of zero width assertions that are not supported properly by js. it would be nice, otherwise, to leave this kind of processor intensive server bottleneck to the client-side.

just to note: any solution will be a compromise. in my proposed solution I do not recognise a string of domain.com as a valid url. I do recognise

http://www.domain.com

and

http://www.domain.com

and

http://domain.com

as valid urls.

i can be more precise if you could give some more refinements as to what tld's you were principally interested in. ie. all gtld's and x and y country domains.

without this ability to refine, for example, it is difficult to set up rules that differentiate between

http://www.domain.co.uk

and

http://www.domain.co.

//valid columbian domain

and

http://www.domain.cowww.domain.com

and co

http://wwww.domain.com

jasc2k · Sep 6, 2010

supurb thank you for your dedication.
my biggest problem (hence the post) is that I have a text field a user can type into and also do carraige returns which also cause the issue.

just to note: any solution will be a compromise. in my proposed solution I do not recognise a string of domain.com as a valid url. I do recognise
http://www.domain.com
and
http://www.domain.com
and
http://domain.com
as valid urls.

i figured there must be some sort of end to its range. not having domain.com would suite me great, would be nice but would that not mean file.php would also get caught in it?

I would (ideally) want .com .org .net and .co.uk I think these would be the most used, but again not sure on the difficulty on adding these.

Many Thanks,
James

jpadie · Sep 6, 2010

ok. that greatly simplifies things.

jpadie · Sep 7, 2010

have a go with this. it is an imperfect solution but I think it's a reasonable compromise.

Code:

<?php
$text =<<<TEXT
string with a url in it [URL unfurl="true"]http://www.boogy.com/withapath/file.php?aquery=something#fragment[/URL] 
more text with a half formed url [URL unfurl="true"]www.microsoft.com[/URL]
set of text with an ellided set of urls [URL unfurl="true"]http://www.domain.comhttp://www.domains.co.ukwww.halfdomain.edu[/URL]
text with an hyperlink <a href='[URL unfurl="true"]http://www.iamalink.com'>Link</a>[/URL] 
image <img src='[URL unfurl="true"]http://www.tek-tips.com/images/header-logo.gif'>[/URL]
TEXT;

echo linkify(nl2br($text)); //obviously you do not need the nl2br, nor the echo should you wish to capture the results.
function linkify($text){
	$protocols = array('http','https','ftp','file','gopher'); //add to if you need
	
	//domains
	$gTLDs = array('.info','.com','.edu','.org','.net','.mil'); //you should be able to add quite a few others so long as they do not overlap
	$newTLDs = array('.aero', '.biz', '.coop', '.info', '.museum', '.name', '.pro');
	$ukcc = array('.co.uk','.gov.uk','.ac.uk', 
					'.ltd.uk','.me.uk','.mod.uk','.net.uk',
					'.nhs.uk','.nic.uk', '.org.uk','.parliament.uk',
					'.plc.uk','.police.uk','.sch.uk', '.bl.uk','.icnet.uk',
					'.jet.uk','.nls.uk');
	$others = array('.tv','.eu');
	
	$domains = array_merge($gTLDs, $newTLDs, $ukcc, $others);
	//
	
	$_protocols = array_map('preg_quote', $protocols);
	$_domains = array_map('preg_quote', $domains);

	//first split the non word breaks
	$pattern = '/((?<!\'|\"|=| )(' . implode('|', $_protocols) . ')[^( |\.)])/imsu';
	$replace = ' \\1';
	$text = preg_replace($pattern, $replace, $text);
	$pattern = '/(?<! |\/|"\'|=)([URL unfurl="true"]www\.)/ims';[/URL]
	$text = preg_replace($pattern, $replace, $text);
	
	//by now we should have clean links
	//recognise links
	$pattern = '/([^ ]*(' . implode ('|', $_domains) . ')((\?|\/|&|#)[^ ]*)?)/imsue';
	$text = preg_replace($pattern, "_linkify('\\1')", $text);
	return $text;
}

function _linkify($text){
	if (!preg_match('/^(src|href)/ims', $text)):
		return <<<HTTP
<a href="$text" target="_blank">$text</a>
HTTP;
	else:
		return $text;
	endif;
}

?>

jpadie · Sep 7, 2010

a minor improvement to handle line endings

Code:

<?php
$text =<<<TEXT
string with a url in it [URL unfurl="true"]http://www.boogy.com/withapath/file.php?aquery=something#fragment[/URL]
more text with a half formed url [URL unfurl="true"]www.microsoft.com[/URL]
set of text with an ellided set of urls [URL unfurl="true"]http://www.domain.comhttp://www.domains.co.ukwww.halfdomain.edu[/URL]
text with an hyperlink <a href='[URL unfurl="true"]http://www.iamalink.com'>Link</a>[/URL] 
image <img src='[URL unfurl="true"]http://www.tek-tips.com/images/header-logo.gif'>[/URL]
TEXT;

echo nl2br(linkify($text));
function linkify($text){
	$protocols = array('http','https','ftp','file','gopher'); //add to if you need
	
	//domains
	$gTLDs = array('.info','.com','.edu','.org','.net','.mil'); //you should be able to add quite a few others so long as they do not overlap
	$newTLDs = array('.aero', '.biz', '.coop', '.info', '.museum', '.name', '.pro');
	$ukcc = array('.co.uk','.gov.uk','.ac.uk', 
					'.ltd.uk','.me.uk','.mod.uk','.net.uk',
					'.nhs.uk','.nic.uk', '.org.uk','.parliament.uk',
					'.plc.uk','.police.uk','.sch.uk', '.bl.uk','.icnet.uk',
					'.jet.uk','.nls.uk');
	$others = array('.tv','.eu');
	
	$domains = array_merge($gTLDs, $newTLDs, $ukcc, $others);
	//
	
	$_protocols = array_map('preg_quote', $protocols);
	$_domains = array_map('preg_quote', $domains);

	//first split the non word breaks
	$pattern = '/((?<!\'|\"|=| )(' . implode('|', $_protocols) . ')[^( |\.)])/imsu';
	$replace = ' \\1';
	$text = preg_replace($pattern, $replace, $text);
	$pattern = '/(?<! |\/|"\'|=)([URL unfurl="true"]www\.)/ims';[/URL]
	$text = preg_replace($pattern, $replace, $text);
	
	//by now we should have clean links
	//recognise links
	$pattern = '/([^(\s|\n)]*(' . implode ('|', $_domains) . ')((\?|\/|&|#)[^(\s|\n)]*)?)/imsue';
	$text = preg_replace($pattern, "_linkify('\\1')", $text);
	return $text;
}

function _linkify($text){
	if (!preg_match('/^(src|href)/ims', $text)):
		return <<<HTTP
<a href="$text" target="_blank">$text</a>
HTTP;
	else:
		return $text;
	endif;
}
?>

jasc2k · Sep 7, 2010

hey there,

the test is looking good:

http://www.myphaze.co.uk/test.php

the only issue is the

http://www.blah.com

links appear to get hyperlinked too

http://www.myphaze.co.uk/www.blah.com

lol

was tempted to use:

Code:

$text = preg_replace('#([\s|^])([URL unfurl="true"]www)#i',[/URL] '$1[URL unfurl="true"]http://$2',[/URL] $text);

but not sure

am now trying to incoorporate email links and youtube links into your soloution like so:

Code:

function _linkify($text){
///////////////
$youtube = '<p><object width="188" height="152"><param name="movie" value="[URL unfurl="true"]http://www.youtube.com/v/\\2"></param><param[/URL] name="wmode" value="transparent"></param><embed src="[URL unfurl="true"]http://www.youtube.com/v/\\2"[/URL] type="application/x-shockwave-flash" wmod="transparent" width="188" height="152"></embed></object></p>';
$yousearch = "!([URL unfurl="true"]http://.*youtube\.com.*v=)?([/URL][a-zA-Z0-9_-]{11})(&.*)?!";

	if(preg_match("/\@/",$text)){
		$text = "<a rel='nofollow' href='mailto: {$".text."}'>{".$text."}</a>"; 
	}
	
	if(preg_match($yousearch,$text)){
		$text = preg_replace($yousearch, $youtube, $text);
	}
///////////////

	if (!preg_match('/^(src|href)/ims', $text)){       
		return "<a rel='nofollow' href=".$text." target='_blank'>".$text."</a>";    
	} else {        
		return $text;    
	}
}

but I beleive I need all the preg_matches in one if statement will run some more tests

thank you

jpadie · Sep 7, 2010

surely

http://www.blah.com

should get linked? this code links just about everything that is NOT inside a tag as an href or src attribute.

can you provide some more precise detail about what you need? it's taken me soooo long to write this that I'd like to see it through to the end.

jasc2k · Sep 8, 2010

hi there,

http://www.blah.com

gets linked to:

http://www.myphaze.co.uk/www.blah.com

so it does get linked just to the wrong place

)
I have left my test.php up so you can look at it.
thanks so much for your dedication i have tried countless versions from research on gogle and all of them have massive issues or do not wok properly.

In a perfect world I would like email links to work and if someone enters a youtube link then the HTML is altered to embed a youtube video - however I dont expect you to look at this as you have already done so much for me on this post and others

)

thanks

jpadie · Sep 8, 2010

i see. the 'wrong' linking is due to browser functionality. I have fixed this in the code below.

the regex's do not handle all types of links and at the moment definitely do not handle things like mailto: protocols and other uri's that have @ and similar within. it is probably not a lot of work to augment but if you don't need it so much the better.

the code now embeds youtube urls

Code:

<?php
$text =<<<TEXT
string with a url in it [URL unfurl="true"]http://www.boogy.com/withapath/file.php?aquery=something#fragment[/URL]
more text with a half formed url [URL unfurl="true"]www.microsoft.com[/URL]
set of text with an ellided set of urls [URL unfurl="true"]http://www.domain.comhttp://www.domains.co.ukwww.halfdomain.edu[/URL]
text with an hyperlink <a href='[URL unfurl="true"]http://www.iamalink.com'>Link</a>[/URL] 
image <img src='[URL unfurl="true"]http://www.tek-tips.com/images/header-logo.gif'>[/URL]
and a youtube video: [URL unfurl="true"]http://www.youtube.com/watch?v=fllDB3FK7pI[/URL]
TEXT;
ini_set('display_errors', true);
error_reporting(E_ALL);
set_magic_quotes_runtime(FALSE);
echo linkify($text);
function linkify($text){
	$protocols = array('http','https','ftp','file','gopher'); //add to if you need
	
	//domains
	$gTLDs = array('.info','.com','.edu','.org','.net','.mil'); //you should be able to add quite a few others so long as they do not overlap
	$newTLDs = array('.aero', '.biz', '.coop', '.info', '.museum', '.name', '.pro');
	$ukcc = array('.co.uk','.gov.uk','.ac.uk', 
					'.ltd.uk','.me.uk','.mod.uk','.net.uk',
					'.nhs.uk','.nic.uk', '.org.uk','.parliament.uk',
					'.plc.uk','.police.uk','.sch.uk', '.bl.uk','.icnet.uk',
					'.jet.uk','.nls.uk');
	$others = array('.tv','.eu');
	
	$domains = array_merge($gTLDs, $newTLDs, $ukcc, $others);
	//
	
	$_protocols = array_map('preg_quote', $protocols);
	$_domains = array_map('preg_quote', $domains);

	//first split the non word breaks
	$pattern = '/((?<!\'|\"|=| )(' . implode('|', $_protocols) . ')[^( |\.)])/imsu';
	$replace = ' \\1';
	$text = preg_replace($pattern, $replace, $text);
	$pattern = '/(?<! |\/|"\'|=)([URL unfurl="true"]www\.)/ims';[/URL]
	$text = preg_replace($pattern, $replace, $text);
	
	
	//now translate youtube links
	$pattern = '/\s(http\:\/\/[URL unfurl="true"]www\.youtube\.com\/watch\?v\=(\w{11}))/imse';[/URL]
	$text = preg_replace($pattern, "_youTubeEmbed('\\2')", $text);
	
	//by now we should have clean links
	//recognise links
	$pattern = '/([^(\s|\n)]*(' . implode ('|', $_domains) . ')((\?|\/|&|#)[^(\s|\n)]*)?)/imsue';
	$text = preg_replace($pattern, "_linkify('\\1')", $text);
	return $text;
}

function _youTubeEmbed($code){
	return <<<HTML
<object width="425" height="350" data="[URL unfurl="true"]http://www.youtube.com/v/{$code}"[/URL] type="application/x-shockwave-flash"><param name="src" value="[URL unfurl="true"]http://www.youtube.com/v/{$code}"[/URL] /></object>
HTML;
}

function _linkify($text){
	$text = str_replace('\"', '"', $text);
	if (!preg_match('/^(src|href|data|value)/ims', $text)):
		$protocols = array('http','https','ftp','file','gopher');
		$_protocols = implode('|',$protocols);
		if (preg_match('/^('.$_protocols . ')/ims', $text)):
			return <<<HTTP
<a href="[URL unfurl="true"]http://{$text}"[/URL] target="_blank">$text</a>
HTTP;
		else:
			return <<<HTTP
<a href="{$text}" target="_blank">$text</a>
HTTP;
		endif;
	else:
		return $text;
	endif;
}

?>

jasc2k · Sep 8, 2010

wow you did that really quick, am at work at the moment will test tonight

your also probably right about email addresses and they will only work if the user uses outlook anyway so I think I will leave that out!

thanks (will let you know how I get on)

jpadie · Sep 8, 2010

The mailto: protocol should work with whatever the default mail program is on the users machines. It is possible that email addresses will get garbled by my code. You'll have to see.
Am I right in thinking that you are in st austell?

jasc2k · Sep 8, 2010

hi
yeah you would be right st austell,cornwall how you know that? lol

anyway have reuploaded the new test.php and have problems with this line:

Code:

return<<<HTTP<a href="[URL unfurl="true"]http://{$text}"[/URL] target="_blank">$text</a>HTTP;

I get an unexpected T_SL parse error but there are definatly no white spaces?

tried to change to:

Code:

return '<a rel=\"nofollow\" href="[URL unfurl="true"]http://'.$text.'"[/URL] target="_blank">'.$text.'</a>';

but the links when displayed on the page they look fine but the actual link is to

http://www.myphaze.co.uk/www.linkintest.com

instead of

http://www.linkintest.com

weird I can see no reason for this?
you are also right email addresses get garbled lol will try and implement a check.

Again many thanks

jpadie · Sep 8, 2010

this

Code:

return<<<HTTP<a href="[URL unfurl="true"]http://{$text}"[/URL] target="_blank">$text</a>HTTP;

is called the heredoc syntax. It MUST be on multiple lines like this

Code:

return<<<HTTP
<a href="[URL unfurl="true"]http://{$text}"[/URL] target="_blank">$text</a>
HTTP;

can you give me some test html to work with? i thought I had fixed the browser problem.

spent a lot of my childhood in your neck of the woods. we had a few houses in Charlestown, played a lot of golf at Trevose and my family (the Treffry's) came from Fowey (the big house above the harbour - Place - is theirs).

jpadie · Sep 9, 2010

aha. spotted the problem.
change this line as shown

Code:

if ([red]![/red]preg_match('/^('.$_protocols . ')/ims', $text)):

jasc2k · Sep 9, 2010

WOW and WOW

firstly my preliminary tests of your code now look flawless certainly 1000 times better than my previous attempts. thanks so much. If you own a website it must be perrrfect!

secondly what a small world it is where are you too now (if you dont mind me asking) I do like cornwall apart from broadband speeds lol

Will attempt to implement your code live tonight/tomorrow and run more tests. I notice it inadvertanly picks up emails as a link but does not link them too anywhere - I think this is actually tidy!

Will repost upon success
Many many thanks you have solved the impossible you should blog it or something for others.

Laters

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

preg_replace autolink input 1

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Technical User

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Programmer

Technical User

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor