Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

RegEx help! 1

Status
Not open for further replies.

nwruiz

Programmer
Jun 22, 2005
60
US
Here comes a question from a RegEx newbie. I have the following HTML fragment that I am trying to fix:

Code:
<table cellpadding="3" cellspacing="0" bordercolor="#CCCCCC" border="1">
<tr align="Center" bgcolor="#CCCCCC">
	<td valign="top" class="tablefont" colspan="2"><b>Service Classification for 2006</b></td>
    <td valign="top" class="tablefont" width="29%"><b>EDI Load Profile Code</b></td>
<tr> 
	<td valign="top" class="tablefont" width="31%">SC-1, SC1B</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc1std_06.xls">Standard Service</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">1SC1, 2SC1</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="31%">SC-1C</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc1c_06.xls">Optional Large Time of Use</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">1SC1C, 2SC1C </td></tr>
<tr> 
	<td valign="top" class="tablefont" rowspan="2" width="31%">SC-2</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc2nd_06.xls">Non-Demand</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">1SC2, 2SC2 </td></tr>
<tr> 
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc2dem_06.xls">Demand</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">2SC2D, 3SC2D, 1SC2D</td></tr>
<tr> 
	<td  valign="top" class="tablefont" rowspan="4" width="31%">SC-3</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3sec_06.xls">Secondary</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">1SC3</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3pri_06.xls">Primary</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">2SC3</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3sub_06.xls">Subtransmission</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">3SC3</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3tra_06.xls">Transmission</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">4SC3</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="31%">Private Area Lighting</td>
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/pal_06.xls">Private Area Lighting</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">1SC1L</a> (xls)</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="31%">Traffic Signals</td>
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/traffic_06.xls">Traffic Signals</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">1SC4L</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="31%">Street Lighting</td>
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/stlght_06.xls">Street Lighting</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">1SC2L, 1SC3L, 1SC5L, 1SC6L</td></tr>
</table>
There is a stray </a> tag in the fragment. I am trying to capture the entire contents of the table cell (<td>) which contains this problem. My end goal is to make this generic for any two tags. Since RegEx is coupled most closely with Perl, I decided to ask the question here.

My first idea was to use the following expression:
Code:
(<td[^>]*>[^<a.*?>]*?)(</a>)
which almost did the trick, but it was a lucky stroke, since it would not work for tags of length greater than 1. I am currently around the lines of:
Code:
<(td)[^>]*>.*?(?!(<td[^>]*>)|(<a[^>]*>))(</a>).*?</td>
but this is not correct.

The logic for my tag is:
Capture the contents contained by (and including) the <td> element where there is no beginning <a> tag, but there is a </a> tag.

I am trying to ensure that I do not enter into another <td> element in this search. Does anyone have any ideas as to how I can do this?

Nick Ruiz
Associate Integrator
PPLSolutions IT Billing and Transactions
 
This RegEx meets my immediate needs for cleaning this fragment, but it would not satisfy the general condition I am trying to fix. It captures any <td> cell where an <a> tag does not immediately follow the beginning of the <td> element. I need it to capture any <td> element where no <a> tag exists but a </a> exists.

Code:
<(td)[^>]*>
(?!<a[^>]*>)
.*?</a>.*?
</td>
Whitespace has been included to make the RegEx easier to read.

Nick Ruiz
Associate Integrator
PPLSolutions IT Billing and Transactions
 
Since RegEx is coupled most closely with Perl, I decided to ask the question here.

Does this mean that you're not actually using Perl? If you are, I'd suggest avoiding regexps altogether and use a proper tag-aware HTML parser (HTML::TokeParser::Simple is a favourite of mine).
 
In case anyone else was curious ... this is the line that he is talking about

Code:
<table cellpadding="3" cellspacing="0" bordercolor="#CCCCCC" border="1">
<tr align="Center" bgcolor="#CCCCCC">
    <td valign="top" class="tablefont" colspan="2"><b>Service Classification for 2006</b></td>
    <td valign="top" class="tablefont" width="29%"><b>EDI Load Profile Code</b></td>
<tr>
    <td valign="top" class="tablefont" width="31%">SC-1, SC1B</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc1std_06.xls">Standard Service</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">1SC1, 2SC1</td></tr>
<tr>
    <td valign="top" class="tablefont" width="31%">SC-1C</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc1c_06.xls">Optional Large Time of Use</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">1SC1C, 2SC1C </td></tr>
<tr>
    <td valign="top" class="tablefont" rowspan="2" width="31%">SC-2</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc2nd_06.xls">Non-Demand</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">1SC2, 2SC2 </td></tr>
<tr>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc2dem_06.xls">Demand</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">2SC2D, 3SC2D, 1SC2D</td></tr>
<tr>
    <td  valign="top" class="tablefont" rowspan="4" width="31%">SC-3</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3sec_06.xls">Secondary</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">1SC3</td></tr>
<tr>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3pri_06.xls">Primary</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">2SC3</td></tr>
<tr>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3sub_06.xls">Subtransmission</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">3SC3</td></tr>
<tr>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3tra_06.xls">Transmission</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">4SC3</td></tr>
<tr>
    <td valign="top" class="tablefont" width="31%">Private Area Lighting</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/pal_06.xls">Private Area Lighting</a> (xls)</td>
    [COLOR=blue]<td valign="top" class="tablefont" width="29%">1SC1L</a> (xls)</td>[/color]</tr>
<tr>
    <td valign="top" class="tablefont" width="31%">Traffic Signals</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/traffic_06.xls">Traffic Signals</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">1SC4L</td></tr>
<tr>
    <td valign="top" class="tablefont" width="31%">Street Lighting</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/stlght_06.xls">Street Lighting</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">1SC2L, 1SC3L, 1SC5L, 1SC6L</td></tr>
</table>
 
I'm not actually using Perl, but I know that most RegExp pros use it regularly with Perl. RegExps go across programming languages, and unfortunately, Tek-tips does not have a forum dedicated to them. My actual solution is in .NET, but I feel that both I could get a better answer in this forum, and also, I'm sure there are some Perl people out there with similar questions. Thanks for your help!

Nick Ruiz
 
A quasi generic regex solution is

Code:
<(?!a )[^>]*>[^<]*</a>

However, I would suggest that this is a waste of your time, and that this is actually an X-Y problem. What you should do is run your html through a validator to find all the flaws in it, instead of this very specific, probably rare flaw. There are plenty freely available on the web, and that would catch all sorts of special cases that even this regex will not catch. Such as the case that a <b> tag is embedded within the link tag.

Good Luck.
 
Thanks for your help! Now, allow me to explain why I want to fix these problems. I am actually pulling this HTML source from a website and would like to store it in an XML-structured object. However, in .NET, I can't do this, since it's not proper XML (or for my purposes, XHTML). Scraping from this page would be much simpler (and more reliable) if I could traverse this page as if it were an XML document.

Now, how would I modify that string to grab the tag that encompasses the </a>? I see that the <(?!a )[^>]*> would grab the <td> tag in this case. I was wondering if I could use a group definition to get the end </td> tag. Something like this:

Code:
<((?!a ))[^>]*>[^<]*</a>.*?</\1>
However, this doesn't work. Thanks again for your help!

Nick Ruiz
 
As a recommendation, you could pull all the data into a single line and then split the line using the begin and end tag syntax and keep the tags. Finish by analyzing which fields that have tags missing its partner and either drop the tag from the array or add the missing tag before the next/previous tag in the array.

@array = split(/([<][/]?[^>][>])/, $_);

hope it helps


Michael Libeson
 
nwruiz said:
This RegEx meets my immediate needs for cleaning this fragment, but it would not satisfy the general condition I am trying to fix. It captures any <td> cell where an <a> tag does not immediately follow the beginning of the <td> element. I need it to capture any <td> element where no <a> tag exists but a </a> exists.

Code:
<(td)[^>]*>
(?!<a[^>]*>)
.*?</a>.*?
</td>

Whitespace has been included to make the RegEx easier to read.

Ignoring my earlier regex, I instead enhanced the one that you posted to match more generically, but within the scope of a <td>.

Code:
my $regex = qr{<td[^>]*>
(?: (?> [^<]+) | (?: < (?!a|/a|/td) ) )*
</a>
(?: (?> [^<]+) | (?: < (?!/td) ) )*
</td>}sx;

my ($full) = ($html =~ m{($regex)});

print "$full\n";

Results are:
Code:
<td valign="top" class="tablefont" width="29%">1SC1L</a> (xls)</td>
 
My regex can be compressed just a little bit. It seems I got a little bit carried away in the details, as this is exactly the same:

Code:
my $regex = qr{<td[^>]*>
(?: (?> [^<]+) | (?: < (?!a|/a|/td) ) )*
</a>
[COLOR=blue].*?[/color]
</td>}sx;

my ($full) = ($html =~ m{($regex)});

print "$full\n";

Results are:
Code:
<td valign="top" class="tablefont" width="29%">1SC1L</a> (xls)</td>
 
Looks good! Thank you for your help!

Nick Ruiz
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top