Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

preg_replace autolink input 1

Status
Not open for further replies.

jasc2k

Programmer
Nov 2, 2005
113
GB
hi all,

my shortened function below simply takes a user input and links any URLs or email addresses appropriatly in HTML so when posted links work as seen on facebook etc

Code:
/* Convert all URL matches to appropriate HTML links */
			$message = preg_replace('#([\s|^])([URL unfurl="true"]www)#i',[/URL] '$1[URL unfurl="true"]http://$2',[/URL] $message);
			$pattern = '#((http|https|ftp|telnet|news|gopher|file|wais):\/\/[^\s]+)#i';
			$replacement = '<a href="$1" target="_blank">$1</a>';
			$message = preg_replace($pattern, $replacement, $message);
		
			/* Convert all E-mail matches to appropriate HTML links */
			$pattern = '#([0-9a-z]([-_.]?[0-9a-z])*@[0-9a-z]([-.]?[0-9a-z])*\\.';
			$pattern .= '[a-wyz][a-z](fo|g|l|m|mes|o|op|pa|ro|seum|t|u|v|z)?)#i';
			$replacement = '<a href="mailto:\\1">\\1</a>';
			$message = preg_replace($pattern, $replacement, $message);

no matter what I do I always get the same basic problem that when two or more URLs are entered in a row (without spaces or joined with another character) all of the links get joined into one

i.e user submits:
that whole string would be one link instead of three?

I know someone must have the answer - interestingly on posting realised the same happens here :eek:(
thanks in advance
 
the best answer would be to do some preanalysis based on the list of known allowed tld's.

but as ever, if the user submits garbage there is a limit to what you can do to prevent garbage output
 
hi there,

thanks for your reply - as soon as I saw even this forum script did the same I kind of gave up hope and had assumed it was just too far :eek:)

thanks
 
i spent a couple of hours yesterday on this and pretty much got it cracked.

then I lost my laptop.

it is possible I have the script on a backup somewhere. I will see whether I can dig it out.
 
whoa, lost your laptop?
that is not good! hope you find it again (not just for my me, lol)

cheers
 
have not given up on this. I recrafted a solution from what I could remember and got reasonably close but not good enough.

the aggravating thing is that this has to be done in php rather than javascript because we inevitably need to make use of zero width assertions that are not supported properly by js. it would be nice, otherwise, to leave this kind of processor intensive server bottleneck to the client-side.

just to note: any solution will be a compromise. in my proposed solution I do not recognise a string of domain.com as a valid url. I do recognise and and as valid urls.

i can be more precise if you could give some more refinements as to what tld's you were principally interested in. ie. all gtld's and x and y country domains.

without this ability to refine, for example, it is difficult to set up rules that differentiate between
and
//valid columbian domain

and and co
 
supurb thank you for your dedication.
my biggest problem (hence the post) is that I have a text field a user can type into and also do carraige returns which also cause the issue.

just to note: any solution will be a compromise. in my proposed solution I do not recognise a string of domain.com as a valid url. I do recognise and and as valid urls.

i figured there must be some sort of end to its range. not having domain.com would suite me great, would be nice but would that not mean file.php would also get caught in it?

I would (ideally) want .com .org .net and .co.uk I think these would be the most used, but again not sure on the difficulty on adding these.

Many Thanks,
James
 
have a go with this. it is an imperfect solution but I think it's a reasonable compromise.

Code:
<?php
$text =<<<TEXT
string with a url in it [URL unfurl="true"]http://www.boogy.com/withapath/file.php?aquery=something#fragment[/URL] 
more text with a half formed url [URL unfurl="true"]www.microsoft.com[/URL]
set of text with an ellided set of urls [URL unfurl="true"]http://www.domain.comhttp://www.domains.co.ukwww.halfdomain.edu[/URL]
text with an hyperlink <a href='[URL unfurl="true"]http://www.iamalink.com'>Link</a>[/URL] 
image <img src='[URL unfurl="true"]http://www.tek-tips.com/images/header-logo.gif'>[/URL]
TEXT;

echo linkify(nl2br($text)); //obviously you do not need the nl2br, nor the echo should you wish to capture the results.
function linkify($text){
	$protocols = array('http','https','ftp','file','gopher'); //add to if you need
	
	//domains
	$gTLDs = array('.info','.com','.edu','.org','.net','.mil'); //you should be able to add quite a few others so long as they do not overlap
	$newTLDs = array('.aero', '.biz', '.coop', '.info', '.museum', '.name', '.pro');
	$ukcc = array('.co.uk','.gov.uk','.ac.uk', 
					'.ltd.uk','.me.uk','.mod.uk','.net.uk',
					'.nhs.uk','.nic.uk', '.org.uk','.parliament.uk',
					'.plc.uk','.police.uk','.sch.uk', '.bl.uk','.icnet.uk',
					'.jet.uk','.nls.uk');
	$others = array('.tv','.eu');
	
	$domains = array_merge($gTLDs, $newTLDs, $ukcc, $others);
	//
	
	$_protocols = array_map('preg_quote', $protocols);
	$_domains = array_map('preg_quote', $domains);

	//first split the non word breaks
	$pattern = '/((?<!\'|\"|=| )(' . implode('|', $_protocols) . ')[^( |\.)])/imsu';
	$replace = ' \\1';
	$text = preg_replace($pattern, $replace, $text);
	$pattern = '/(?<! |\/|"\'|=)([URL unfurl="true"]www\.)/ims';[/URL]
	$text = preg_replace($pattern, $replace, $text);
	
	//by now we should have clean links
	//recognise links
	$pattern = '/([^ ]*(' . implode ('|', $_domains) . ')((\?|\/|&|#)[^ ]*)?)/imsue';
	$text = preg_replace($pattern, "_linkify('\\1')", $text);
	return $text;
}

function _linkify($text){
	if (!preg_match('/^(src|href)/ims', $text)):
		return <<<HTTP
<a href="$text" target="_blank">$text</a>
HTTP;
	else:
		return $text;
	endif;
}

?>
 
a minor improvement to handle line endings
Code:
<?php
$text =<<<TEXT
string with a url in it [URL unfurl="true"]http://www.boogy.com/withapath/file.php?aquery=something#fragment[/URL]
more text with a half formed url [URL unfurl="true"]www.microsoft.com[/URL]
set of text with an ellided set of urls [URL unfurl="true"]http://www.domain.comhttp://www.domains.co.ukwww.halfdomain.edu[/URL]
text with an hyperlink <a href='[URL unfurl="true"]http://www.iamalink.com'>Link</a>[/URL] 
image <img src='[URL unfurl="true"]http://www.tek-tips.com/images/header-logo.gif'>[/URL]
TEXT;

echo nl2br(linkify($text));
function linkify($text){
	$protocols = array('http','https','ftp','file','gopher'); //add to if you need
	
	//domains
	$gTLDs = array('.info','.com','.edu','.org','.net','.mil'); //you should be able to add quite a few others so long as they do not overlap
	$newTLDs = array('.aero', '.biz', '.coop', '.info', '.museum', '.name', '.pro');
	$ukcc = array('.co.uk','.gov.uk','.ac.uk', 
					'.ltd.uk','.me.uk','.mod.uk','.net.uk',
					'.nhs.uk','.nic.uk', '.org.uk','.parliament.uk',
					'.plc.uk','.police.uk','.sch.uk', '.bl.uk','.icnet.uk',
					'.jet.uk','.nls.uk');
	$others = array('.tv','.eu');
	
	$domains = array_merge($gTLDs, $newTLDs, $ukcc, $others);
	//
	
	$_protocols = array_map('preg_quote', $protocols);
	$_domains = array_map('preg_quote', $domains);

	//first split the non word breaks
	$pattern = '/((?<!\'|\"|=| )(' . implode('|', $_protocols) . ')[^( |\.)])/imsu';
	$replace = ' \\1';
	$text = preg_replace($pattern, $replace, $text);
	$pattern = '/(?<! |\/|"\'|=)([URL unfurl="true"]www\.)/ims';[/URL]
	$text = preg_replace($pattern, $replace, $text);
	
	//by now we should have clean links
	//recognise links
	$pattern = '/([^(\s|\n)]*(' . implode ('|', $_domains) . ')((\?|\/|&|#)[^(\s|\n)]*)?)/imsue';
	$text = preg_replace($pattern, "_linkify('\\1')", $text);
	return $text;
}

function _linkify($text){
	if (!preg_match('/^(src|href)/ims', $text)):
		return <<<HTTP
<a href="$text" target="_blank">$text</a>
HTTP;
	else:
		return $text;
	endif;
}
?>
 
hey there,

the test is looking good: the only issue is the links appear to get hyperlinked too lol

was tempted to use:
Code:
$text = preg_replace('#([\s|^])([URL unfurl="true"]www)#i',[/URL] '$1[URL unfurl="true"]http://$2',[/URL] $text);
but not sure

am now trying to incoorporate email links and youtube links into your soloution like so:
Code:
function _linkify($text){
///////////////
$youtube = '<p><object width="188" height="152"><param name="movie" value="[URL unfurl="true"]http://www.youtube.com/v/\\2"></param><param[/URL] name="wmode" value="transparent"></param><embed src="[URL unfurl="true"]http://www.youtube.com/v/\\2"[/URL] type="application/x-shockwave-flash" wmod="transparent" width="188" height="152"></embed></object></p>';
$yousearch = "!([URL unfurl="true"]http://.*youtube\.com.*v=)?([/URL][a-zA-Z0-9_-]{11})(&.*)?!";

	if(preg_match("/\@/",$text)){
		$text = "<a rel='nofollow' href='mailto: {$".text."}'>{".$text."}</a>"; 
	}
	
	if(preg_match($yousearch,$text)){
		$text = preg_replace($yousearch, $youtube, $text);
	}
///////////////

	if (!preg_match('/^(src|href)/ims', $text)){       
		return "<a rel='nofollow' href=".$text." target='_blank'>".$text."</a>";    
	} else {        
		return $text;    
	}
}

but I beleive I need all the preg_matches in one if statement will run some more tests

thank you
 
surely should get linked? this code links just about everything that is NOT inside a tag as an href or src attribute.

can you provide some more precise detail about what you need? it's taken me soooo long to write this that I'd like to see it through to the end.
 
hi there,

gets linked to:

so it does get linked just to the wrong place :eek:)
I have left my test.php up so you can look at it.
thanks so much for your dedication i have tried countless versions from research on gogle and all of them have massive issues or do not wok properly.

In a perfect world I would like email links to work and if someone enters a youtube link then the HTML is altered to embed a youtube video - however I dont expect you to look at this as you have already done so much for me on this post and others :eek:)

thanks
 
i see. the 'wrong' linking is due to browser functionality. I have fixed this in the code below.

the regex's do not handle all types of links and at the moment definitely do not handle things like mailto: protocols and other uri's that have @ and similar within. it is probably not a lot of work to augment but if you don't need it so much the better.

the code now embeds youtube urls

Code:
<?php
$text =<<<TEXT
string with a url in it [URL unfurl="true"]http://www.boogy.com/withapath/file.php?aquery=something#fragment[/URL]
more text with a half formed url [URL unfurl="true"]www.microsoft.com[/URL]
set of text with an ellided set of urls [URL unfurl="true"]http://www.domain.comhttp://www.domains.co.ukwww.halfdomain.edu[/URL]
text with an hyperlink <a href='[URL unfurl="true"]http://www.iamalink.com'>Link</a>[/URL] 
image <img src='[URL unfurl="true"]http://www.tek-tips.com/images/header-logo.gif'>[/URL]
and a youtube video: [URL unfurl="true"]http://www.youtube.com/watch?v=fllDB3FK7pI[/URL]
TEXT;
ini_set('display_errors', true);
error_reporting(E_ALL);
set_magic_quotes_runtime(FALSE);
echo linkify($text);
function linkify($text){
	$protocols = array('http','https','ftp','file','gopher'); //add to if you need
	
	//domains
	$gTLDs = array('.info','.com','.edu','.org','.net','.mil'); //you should be able to add quite a few others so long as they do not overlap
	$newTLDs = array('.aero', '.biz', '.coop', '.info', '.museum', '.name', '.pro');
	$ukcc = array('.co.uk','.gov.uk','.ac.uk', 
					'.ltd.uk','.me.uk','.mod.uk','.net.uk',
					'.nhs.uk','.nic.uk', '.org.uk','.parliament.uk',
					'.plc.uk','.police.uk','.sch.uk', '.bl.uk','.icnet.uk',
					'.jet.uk','.nls.uk');
	$others = array('.tv','.eu');
	
	$domains = array_merge($gTLDs, $newTLDs, $ukcc, $others);
	//
	
	$_protocols = array_map('preg_quote', $protocols);
	$_domains = array_map('preg_quote', $domains);

	//first split the non word breaks
	$pattern = '/((?<!\'|\"|=| )(' . implode('|', $_protocols) . ')[^( |\.)])/imsu';
	$replace = ' \\1';
	$text = preg_replace($pattern, $replace, $text);
	$pattern = '/(?<! |\/|"\'|=)([URL unfurl="true"]www\.)/ims';[/URL]
	$text = preg_replace($pattern, $replace, $text);
	
	
	//now translate youtube links
	$pattern = '/\s(http\:\/\/[URL unfurl="true"]www\.youtube\.com\/watch\?v\=(\w{11}))/imse';[/URL]
	$text = preg_replace($pattern, "_youTubeEmbed('\\2')", $text);
	
	//by now we should have clean links
	//recognise links
	$pattern = '/([^(\s|\n)]*(' . implode ('|', $_domains) . ')((\?|\/|&|#)[^(\s|\n)]*)?)/imsue';
	$text = preg_replace($pattern, "_linkify('\\1')", $text);
	return $text;
}

function _youTubeEmbed($code){
	return <<<HTML
<object width="425" height="350" data="[URL unfurl="true"]http://www.youtube.com/v/{$code}"[/URL] type="application/x-shockwave-flash"><param name="src" value="[URL unfurl="true"]http://www.youtube.com/v/{$code}"[/URL] /></object>
HTML;
}

function _linkify($text){
	$text = str_replace('\"', '"', $text);
	if (!preg_match('/^(src|href|data|value)/ims', $text)):
		$protocols = array('http','https','ftp','file','gopher');
		$_protocols = implode('|',$protocols);
		if (preg_match('/^('.$_protocols . ')/ims', $text)):
			return <<<HTTP
<a href="[URL unfurl="true"]http://{$text}"[/URL] target="_blank">$text</a>
HTTP;
		else:
			return <<<HTTP
<a href="{$text}" target="_blank">$text</a>
HTTP;
		endif;
	else:
		return $text;
	endif;
}

?>
 
wow you did that really quick, am at work at the moment will test tonight

your also probably right about email addresses and they will only work if the user uses outlook anyway so I think I will leave that out!

thanks (will let you know how I get on)
 
The mailto: protocol should work with whatever the default mail program is on the users machines. It is possible that email addresses will get garbled by my code. You'll have to see.
Am I right in thinking that you are in st austell?
 
hi
yeah you would be right st austell,cornwall how you know that? lol

anyway have reuploaded the new test.php and have problems with this line:
Code:
return<<<HTTP<a href="[URL unfurl="true"]http://{$text}"[/URL] target="_blank">$text</a>HTTP;
I get an unexpected T_SL parse error but there are definatly no white spaces?

tried to change to:
Code:
return '<a rel=\"nofollow\" href="[URL unfurl="true"]http://'.$text.'"[/URL] target="_blank">'.$text.'</a>';

but the links when displayed on the page they look fine but the actual link is to instead of weird I can see no reason for this?
you are also right email addresses get garbled lol will try and implement a check.

Again many thanks
 
this
Code:
return<<<HTTP<a href="[URL unfurl="true"]http://{$text}"[/URL] target="_blank">$text</a>HTTP;
is called the heredoc syntax. It MUST be on multiple lines like this

Code:
return<<<HTTP
<a href="[URL unfurl="true"]http://{$text}"[/URL] target="_blank">$text</a>
HTTP;

can you give me some test html to work with? i thought I had fixed the browser problem.

spent a lot of my childhood in your neck of the woods. we had a few houses in Charlestown, played a lot of golf at Trevose and my family (the Treffry's) came from Fowey (the big house above the harbour - Place - is theirs).
 
aha. spotted the problem.
change this line as shown

Code:
if ([red]![/red]preg_match('/^('.$_protocols . ')/ims', $text)):
 
WOW and WOW

firstly my preliminary tests of your code now look flawless certainly 1000 times better than my previous attempts. thanks so much. If you own a website it must be perrrfect!

secondly what a small world it is where are you too now (if you dont mind me asking) I do like cornwall apart from broadband speeds lol

Will attempt to implement your code live tonight/tomorrow and run more tests. I notice it inadvertanly picks up emails as a link but does not link them too anywhere - I think this is actually tidy!

Will repost upon success
Many many thanks you have solved the impossible you should blog it or something for others.

Laters
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top