Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Please help with (what should be) a simple regex issue? 1

Status
Not open for further replies.

BobMCT

IS-IT--Management
Sep 11, 2000
756
US
I'm trying to alter all "href=" parameters and their values in some html files. I'm trying to use sed for this.

An example of an html line is this:
<a href="
My current sed regex looks like this:
s/href="(.+)"/href="#"/g

which I am trying to locate href=" any/all within " and replace it with
href="#" so the link becomes inactive.

Could someone please explain how once the href=" is found alter everthing between that and the closing " and replace with the "#" phrase? so the resulting html link would look like this:
<a href="#">whatever</a>

Thanks for helping my headache!
 
Most versions of sed don't understand "+" to mean one-or-more occurrences (it's part of the Extended Regular Expression langauge), although GNU sed can understand it if you specify the -r or --regexp-extended option.

Another potential issue is that .+ would be a "greedy" expression and if you have multiple URLs on a single line it would consume all of the text between them as well.

Try sed 's/href="[^"]*"/href="#"/g' perhaps? This is less greedy because it only matches any non-double-quote characters surrounded by a pair of double-quotes. It also uses '*' for any-number-of-occurrences which is supported by all versions of sed.

Annihilannic
[small]tgmlify - code syntax highlighting for your tek-tips posts[/small]
 
Thanks Annihilannic,

I was unaware of the (relatively minor) differences in the sed syntax. Short of trial and error, which I do plenty of, I use the web and my "Mastering Regular Expressions" book. And that's how I arrived at my non-working solution.

Thanks for saving my day.[thumbsup2]
 
Hi

Just as a note, do not expect perfection when altering SGML with regular expression.

In this case for example
[ul]
[li]The following will not be modified :[ul]
[li][tt]<a [maroon]href[/maroon][teal]=[/teal][green]'[ignore][/ignore]'[/green]>whatever</a>[/tt][/li]
[li][tt]<a [maroon]href[/maroon][teal]=[/teal][green][ignore][/ignore][/green]>whatever</a>[/tt][/li]
[/ul][/li]
[li]The following will be modified :[ul]
[li][tt]<base [maroon]href[/maroon][teal]=[/teal][green]"[ignore][/ignore]"[/green]>[/tt][/li]
[li][tt]<link [maroon]rel[/maroon][teal]=[/teal][green]"stylesheet"[/green] [maroon]type[/maroon][teal]=[/teal][green]"text/css"[/green] [maroon]href[/maroon][teal]=[/teal][green]"style.css"[/green]>[/tt][/li]
[li][tt]<map [maroon]name[/maroon][teal]=[/teal][green]"map"[/green]><area [maroon]href[/maroon][teal]=[/teal][green]"whatever.html"[/green] [maroon]shape[/maroon][teal]=[/teal][green]"rect"[/green] [maroon]coords[/maroon][teal]=[/teal][green]"0,0,100,100"[/green]></map>[/tt][/li]
[li][tt]<map [maroon]name[/maroon][teal]=[/teal][green]"map"[/green]><area [maroon]nohref[/maroon][teal]=[/teal][green]"nohref"[/green] [maroon]shape[/maroon][teal]=[/teal][green]"rect"[/green] [maroon]coords[/maroon][teal]=[/teal][green]"25,25,75,75"[/green]></map>[/tt][/li]
[/ul][/li]
[/ul]


Feherke.
 
Very good point feherke,

So perhaps by further qualifying my patterns I could control what gets replaced and what doesn't. Perhaps you could help by commenting on the following:

I need to replace anything with the following href:

href=" with "#"
href="sites/*anything*" with "#"

any href that begins with:
href="home_files*anything*" is to be ignored

Again, I tried various combinations this morning without any luck.

Thanks
 
sed '/href="home_files/b
s/href="[^"]*"/href="#"/g'

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Hi

Ok, then another note. The following 3 are pointing to the same thing :
[ul]
[li][tt]<a [maroon]href[/maroon][teal]=[/teal][green]"home_files"[/green]>same</a>[/tt][/li]
[li][tt]<a [maroon]href[/maroon][teal]=[/teal][green]"%68%6f%6d%65_files"[/green]>same</a>[/tt][/li]
[li][tt]<a [maroon]href[/maroon][teal]=[/teal][green]"&#x68;&#x6f;&#x6d;&#x65;_files"[/green]>same</a>[/tt][/li]
[/ul]
If you want a robust solution, forget Sed or any other regular-expression based approaches. Choose a programming/scripting language and a parser library/module and do code it with them. Such parsers will handle such differences, and in a certain degree even the errors.

What I met so far and I would start this task with :
[ul]
[li]HTML Parser in Java[/li]
[li]HTML::parser in Perl[/li]
[li]Document Object Model in PHP[/li]
[li]HTMLParser in Python[/li]
[li]hpricot in Ruby[/li]
[/ul]
( No guarantee they are the best, I mostly just give them a try, not used them exhaustively. )

Feherke.
 
Appreciate all the help.

Usually and customarily I would create a program to do this but in this particular case its a one-off so I didn't want to spend much time.

But as we all know - many times the short cut wastes additional valuable time.
 
Hi

Well, if you know what to expect, regular expressions are indeed mostly enough.

For example if you know there will be no line containing both cases of links, PHV's code is perfect. But if you have both replaceable and ignorable links in the same line, something stronger than Sed is required. Personally I prefer Ruby, but Perl and JavaScript are also simple :
Code:
perl -pe '[b]s[/b][green][i]/href="([^"]*)"/index($1,"home_files")?"href=\"#\"":$&/[/i][/green]ge' /input/file

[gray]# or[/gray]

ruby -pe '[maroon]$_[/maroon][teal].[/teal]gsub![teal]([/teal][fuchsia]/href="([^"]*)"/[/fuchsia][teal])[/teal][teal]{[/teal][teal]||[/teal][maroon]$1[/maroon][teal].[/teal]start_with?[teal]([/teal][green][i]"home_files"[/i][/green][teal])?[/teal]$[teal]&:[/teal][green][i]"href=\"#\""[/i][/green][teal]}[/teal]' /input/file

[gray]# or[/gray]

js -e '[b]while[/b][teal]([/teal]s[teal]=[/teal][COLOR=darkgoldenrod]readline[/color][teal]())[/teal][COLOR=darkgoldenrod]print[/color][teal]([/teal]s[teal].[/teal][COLOR=darkgoldenrod]replace[/color][teal]([/teal][fuchsia]/href="([^"]*)"/g[/fuchsia][teal],[/teal][b]function[/b][teal]([/teal]$[purple]0[/purple][teal],[/teal]$[purple]1[/purple][teal])[/teal][teal]{[/teal][b]return[/b] $[purple]1[/purple][teal].[/teal][COLOR=darkgoldenrod]indexOf[/color][teal]([/teal][green][i]"home_files"[/i][/green][teal])==[/teal][purple]0[/purple][teal]?[/teal]$[purple]0[/purple][teal]:[/teal][green][i]"href=[/i][/green][lime][i]\"[/i][/lime][green][i]#[/i][/green][lime][i]\"[/i][/lime][green][i]"[/i][/green][teal]}[/teal][teal]))[/teal]' /input/file
But similarly simple solutions can be put together in PHP and Python too.

Feherke.
 
But as we all know - many times the short cut wastes additional valuable time.
from fortune.
if a shortcut was easy it would not be a shortcut it would be the way.

A Maintenance contract is essential, not a Luxury.
Do things on the cheap & it will cost you dear
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top