Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regex Match

Status
Not open for further replies.

fmuquartet

Programmer
Sep 21, 2006
60
US
Can some help the poor laid with a regex for extracting a number from a certain position in an html file?

string to extract:6768743

HTML:
<div align="center"><center><table border="1" cellspacing="1" width="70%"><tr> <td width="50%" align="center" valign="middle"><strong>OS X</strong> </td> <td width="50%" align="center" valign="middle">443921 </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>OS Y</strong> </td> <td width="50%" align="center" valign="middle">235124 </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>Post Code</strong>
</td> <td width="50%" align="center" valign="middle">OX15 4HE </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>Lat</strong> (WGS84) </td> <td width="50%" align="center" valign="middle">N52:00:46 ( 52.012814 ) </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>Long</strong> (WGS84) </td> <td width="50%" align="center" valign="middle">W1:21:41 ( -1.361467 ) </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>LR</strong> </td>
<td width="50%" align="center" valign="middle">SP439351 </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>mX</strong> </td> <td width="50%" align="center" valign="middle">-151557</td></tr><tr> <td width="50%" align="center" valign="middle"><strong>mY</strong> </td> <td width="50%" align="center" valign="middle">6768743 </td></tr><tr> <td width="50%" align="center" valign="middle">
<strong>M</strong> </td> <td width="50%" align="center" valign="middle">Actual : FFFDAFFB_00674867<br>Street : FFFDB034_00674774<br>Road : FFFDB610_00672640<br>Road64 : FFFE0C00_00668A00<br>Road128: FFFE0C00_00659000</td>
</tr><tr> <td width="100%" align="center" valign="middle" colspan="2" >
 
Code:
[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]HTML::TableExtract[/green][red];[/red]

[black][b]use[/b][/black] [green]strict[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$te[/blue] = HTML::TableExtract->[maroon]new[/maroon][red]([/red]
	[purple]keep_html[/purple] => [fuchsia]0[/fuchsia],
[red])[/red][red];[/red]

[black][b]my[/b][/black] [blue]$html_string[/blue] = [url=http://perldoc.perl.org/functions/do.html][black][b]do[/b][/black][/url] [red]{[/red][url=http://perldoc.perl.org/functions/local.html][black][b]local[/b][/black][/url] [blue]$/[/blue][red];[/red] <DATA>[red]}[/red][red];[/red]

[blue]$te[/blue]->[maroon]parse[/maroon][red]([/red][blue]$html_string[/blue][red])[/red][red];[/red]

[gray][i]# Examine all tables[/i][/gray]
[maroon]TABLE[/maroon][maroon]:[/maroon]
[olive][b]foreach[/b][/olive] [black][b]my[/b][/black] [blue]$ts[/blue] [red]([/red][blue]$te[/blue]->[maroon]tables[/maroon][red])[/red] [red]{[/red]
	[maroon]ROW[/maroon][maroon]:[/maroon]
	[olive][b]foreach[/b][/olive] [black][b]my[/b][/black] [blue]$row[/blue] [red]([/red][blue]$ts[/blue]->[maroon]rows[/maroon][red])[/red] [red]{[/red]
		[gray][i]# Strip Spacing from Cells[/i][/gray]
		[red]s/[/red][purple]^[purple][b]\s[/b][/purple]+|[purple][b]\s[/b][/purple]+$[/purple][red]/[/red][purple][/purple][red]/[/red][red]g[/red] [olive][b]foreach[/b][/olive] [red]([/red][blue]@$row[/blue][red])[/red][red];[/red]

		[black][b]my[/b][/black] [red]([/red][blue]$key[/blue], [blue]$val[/blue][red])[/red] = [blue]@$row[/blue][red];[/red]

		[olive][b]if[/b][/olive] [red]([/red][blue]$key[/blue] eq [red]'[/red][purple]mY[/purple][red]'[/red][red])[/red] [red]{[/red]
			[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]"[/red][purple]mY = [blue]$val[/blue][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
			[olive][b]last[/b][/olive] TABLE[red];[/red]
		[red]}[/red]
	[red]}[/red]
[red]}[/red]

[fuchsia]1[/fuchsia][red];[/red]

[teal]__DATA__[/teal]

[teal]<div align="center"><center>[/teal]
[teal]<table border="1" cellspacing="1" width="70%">[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>OS X</strong> </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">443921 </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>OS Y</strong> </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">235124 </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>Post Code</strong></td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">OX15 4HE </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>Lat</strong> (WGS84)  </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">N52:00:46 ( 52.012814 ) </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>Long</strong> (WGS84)  </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">W1:21:41 ( -1.361467 ) </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>LR</strong> </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">SP439351 </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>mX</strong> </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">-151557</td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>mY</strong> </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">6768743 </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>M</strong> </td>[/teal]
[teal]		 width="50%" align="center" valign="middle">Actual : FFFDAFFB_00674867<br>Street : FFFDB034_00674774<br>Road   : FFFDB610_00672640<br>Road64 : FFFE0C00_00668A00<br>Road128: FFFE0C00_00659000</td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="100%" align="center" valign="middle" colspan="2" ></td>[/teal]
[teal]	</tr>[/teal]
[teal]</table>[/teal]
[teal]</div>[/teal]
[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[/ul]
Other Modules used :
[ul]
[li]HTML::TableExtract[/li]
[/ul][/tt]
 
Thanks guys, I will try both of them and see which one is more efficient.
 
fmuquartet said:
... and see which one is more efficient.

I surmise that this was probably just a comment made in passing. However, I must point out that it is misguided. To quote Knuth: "premature optimization is the root of all evil".

Efficiency is not what is important for most programming projects. Instead achieving code that is logical and ordered is a much more worthwhile goal.

What you did not explain in your premise was that you were trying to do more than a simple string extraction. Instead what you truly wanted was to extract the value from a two column html table with keys in the first column and values in the second. While a regular expression could in theory accomplish such a task, and probably do it more "quickly". It would not truly be an "efficient" way to program it.

Anyway, I expect you will discover this on your own if you take the time to explore the code that I provided.

- Miller
 
Miller, my apologies for not clarifying exactly what I am trying to accomplish and by no means am I undermining your assertion to answering my question. The value mentioned in my original post, constantly changes and I need a regex to address this problem.

Let me know this clears things up for you.
 
By the fact that you insist on a regex solution, I surmise that you are not actually doing this in perl.

There is rarely a clean way to parse html solely using regular expressions. That isn't to say it can't be done, just that it's messy and prone to bugs. That is why I demonstrated the use of the HTML::TableExtract CPAN module.

Anyway, just to prove how ugly it can be, here is the same code done as a regex.

Code:
[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]strict[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$html_string[/blue] = [url=http://perldoc.perl.org/functions/do.html][black][b]do[/b][/black][/url] [red]{[/red][url=http://perldoc.perl.org/functions/local.html][black][b]local[/b][/black][/url] [blue]$/[/blue][red];[/red] <DATA>[red]}[/red][red];[/red]

[olive][b]if[/b][/olive] [red]([/red][blue]$html_string[/blue] =~ [red]m{[/red][purple][/purple]
[purple]	<td[^>]*>                    # Begin Name Cell[/purple]
[purple]	(?:<(?!/td)[^>]*>|[purple][b]\s[/b][/purple]*)*      # White Space or Formatting Tags[/purple]
[purple]	[purple][b]\Q[/b][/purple]mX[purple][b]\E[/b][/purple]                       # Name of Value[/purple]
[purple]	(?:<(?!/td)[^>]*>|[purple][b]\s[/b][/purple]*)*      # White Space or Formatting Tags[/purple]
[purple]	</td>                        # End Name Cell[/purple]
[purple]	[purple][b]\s[/b][/purple]*[/purple]
[purple]	<td[^>]*>                    # Begin Value Cell[/purple]
[purple]	[purple][b]\s[/b][/purple]*[/purple]
[purple]		([^<]*?)                 # Value to Save[/purple]
[purple]	[purple][b]\s[/b][/purple]*[/purple]
[purple]	</td>                        # End Value Cell[/purple]
[purple][/purple][red]}[/red][red]xs[/red][red])[/red] [red]{[/red]
	[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]"[/red][purple]'[blue]$1[/blue]'[purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
[red]}[/red] [olive][b]else[/b][/olive] [red]{[/red]
	[black][b]print[/b][/black] [red]"[/red][purple]No Match[purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
[red]}[/red]


[fuchsia]1[/fuchsia][red];[/red]

[teal]__DATA__[/teal]

[teal]<div align="center"><center>[/teal]
[teal]<table border="1" cellspacing="1" width="70%">[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>OS X</strong> </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">443921 </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>OS Y</strong> </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">235124 </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>Post Code</strong></td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">OX15 4HE </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>Lat</strong> (WGS84)  </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">N52:00:46 ( 52.012814 ) </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>Long</strong> (WGS84)  </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">W1:21:41 ( -1.361467 ) </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>LR</strong> </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">SP439351 </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>mX</strong> </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">-151557</td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>mY</strong> </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">6768743 </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>M</strong> </td>[/teal]
[teal]         width="50%" align="center" valign="middle">Actual : FFFDAFFB_00674867<br>Street : FFFDB034_00674774<br>Road   : FFFDB610_00672640<br>Road64 : FFFE0C00_00668A00<br>Road128: FFFE0C00_00659000</td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="100%" align="center" valign="middle" colspan="2" ></td>[/teal]
[teal]    </tr>[/teal]
[teal]</table>[/teal]
[teal]</div>[/teal]
[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[/ul]
[/tt]

- Miller
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top