Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Regex Match

Status
Not open for further replies.

fmuquartet

Programmer
Sep 21, 2006
60
0
0
US
Can some help the poor laid with a regex for extracting a number from a certain position in an html file?

string to extract:6768743

HTML:
<div align="center"><center><table border="1" cellspacing="1" width="70%"><tr> <td width="50%" align="center" valign="middle"><strong>OS X</strong> </td> <td width="50%" align="center" valign="middle">443921 </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>OS Y</strong> </td> <td width="50%" align="center" valign="middle">235124 </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>Post Code</strong>
</td> <td width="50%" align="center" valign="middle">OX15 4HE </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>Lat</strong> (WGS84) </td> <td width="50%" align="center" valign="middle">N52:00:46 ( 52.012814 ) </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>Long</strong> (WGS84) </td> <td width="50%" align="center" valign="middle">W1:21:41 ( -1.361467 ) </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>LR</strong> </td>
<td width="50%" align="center" valign="middle">SP439351 </td></tr><tr> <td width="50%" align="center" valign="middle"><strong>mX</strong> </td> <td width="50%" align="center" valign="middle">-151557</td></tr><tr> <td width="50%" align="center" valign="middle"><strong>mY</strong> </td> <td width="50%" align="center" valign="middle">6768743 </td></tr><tr> <td width="50%" align="center" valign="middle">
<strong>M</strong> </td> <td width="50%" align="center" valign="middle">Actual : FFFDAFFB_00674867<br>Street : FFFDB034_00674774<br>Road : FFFDB610_00672640<br>Road64 : FFFE0C00_00668A00<br>Road128: FFFE0C00_00659000</td>
</tr><tr> <td width="100%" align="center" valign="middle" colspan="2" >
 
Code:
[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]HTML::TableExtract[/green][red];[/red]

[black][b]use[/b][/black] [green]strict[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$te[/blue] = HTML::TableExtract->[maroon]new[/maroon][red]([/red]
	[purple]keep_html[/purple] => [fuchsia]0[/fuchsia],
[red])[/red][red];[/red]

[black][b]my[/b][/black] [blue]$html_string[/blue] = [url=http://perldoc.perl.org/functions/do.html][black][b]do[/b][/black][/url] [red]{[/red][url=http://perldoc.perl.org/functions/local.html][black][b]local[/b][/black][/url] [blue]$/[/blue][red];[/red] <DATA>[red]}[/red][red];[/red]

[blue]$te[/blue]->[maroon]parse[/maroon][red]([/red][blue]$html_string[/blue][red])[/red][red];[/red]

[gray][i]# Examine all tables[/i][/gray]
[maroon]TABLE[/maroon][maroon]:[/maroon]
[olive][b]foreach[/b][/olive] [black][b]my[/b][/black] [blue]$ts[/blue] [red]([/red][blue]$te[/blue]->[maroon]tables[/maroon][red])[/red] [red]{[/red]
	[maroon]ROW[/maroon][maroon]:[/maroon]
	[olive][b]foreach[/b][/olive] [black][b]my[/b][/black] [blue]$row[/blue] [red]([/red][blue]$ts[/blue]->[maroon]rows[/maroon][red])[/red] [red]{[/red]
		[gray][i]# Strip Spacing from Cells[/i][/gray]
		[red]s/[/red][purple]^[purple][b]\s[/b][/purple]+|[purple][b]\s[/b][/purple]+$[/purple][red]/[/red][purple][/purple][red]/[/red][red]g[/red] [olive][b]foreach[/b][/olive] [red]([/red][blue]@$row[/blue][red])[/red][red];[/red]

		[black][b]my[/b][/black] [red]([/red][blue]$key[/blue], [blue]$val[/blue][red])[/red] = [blue]@$row[/blue][red];[/red]

		[olive][b]if[/b][/olive] [red]([/red][blue]$key[/blue] eq [red]'[/red][purple]mY[/purple][red]'[/red][red])[/red] [red]{[/red]
			[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]"[/red][purple]mY = [blue]$val[/blue][purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
			[olive][b]last[/b][/olive] TABLE[red];[/red]
		[red]}[/red]
	[red]}[/red]
[red]}[/red]

[fuchsia]1[/fuchsia][red];[/red]

[teal]__DATA__[/teal]

[teal]<div align="center"><center>[/teal]
[teal]<table border="1" cellspacing="1" width="70%">[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>OS X</strong> </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">443921 </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>OS Y</strong> </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">235124 </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>Post Code</strong></td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">OX15 4HE </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>Lat</strong> (WGS84)  </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">N52:00:46 ( 52.012814 ) </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>Long</strong> (WGS84)  </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">W1:21:41 ( -1.361467 ) </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>LR</strong> </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">SP439351 </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>mX</strong> </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">-151557</td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>mY</strong> </td>[/teal]
[teal]		<td width="50%" align="center" valign="middle">6768743 </td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="50%" align="center" valign="middle"><strong>M</strong> </td>[/teal]
[teal]		 width="50%" align="center" valign="middle">Actual : FFFDAFFB_00674867<br>Street : FFFDB034_00674774<br>Road   : FFFDB610_00672640<br>Road64 : FFFE0C00_00668A00<br>Road128: FFFE0C00_00659000</td>[/teal]
[teal]	</tr>[/teal]
[teal]	<tr>[/teal]
[teal]		<td width="100%" align="center" valign="middle" colspan="2" ></td>[/teal]
[teal]	</tr>[/teal]
[teal]</table>[/teal]
[teal]</div>[/teal]
[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[/ul]
Other Modules used :
[ul]
[li]HTML::TableExtract[/li]
[/ul][/tt]
 
Thanks guys, I will try both of them and see which one is more efficient.
 
fmuquartet said:
... and see which one is more efficient.

I surmise that this was probably just a comment made in passing. However, I must point out that it is misguided. To quote Knuth: "premature optimization is the root of all evil".

Efficiency is not what is important for most programming projects. Instead achieving code that is logical and ordered is a much more worthwhile goal.

What you did not explain in your premise was that you were trying to do more than a simple string extraction. Instead what you truly wanted was to extract the value from a two column html table with keys in the first column and values in the second. While a regular expression could in theory accomplish such a task, and probably do it more "quickly". It would not truly be an "efficient" way to program it.

Anyway, I expect you will discover this on your own if you take the time to explore the code that I provided.

- Miller
 
Miller, my apologies for not clarifying exactly what I am trying to accomplish and by no means am I undermining your assertion to answering my question. The value mentioned in my original post, constantly changes and I need a regex to address this problem.

Let me know this clears things up for you.
 
By the fact that you insist on a regex solution, I surmise that you are not actually doing this in perl.

There is rarely a clean way to parse html solely using regular expressions. That isn't to say it can't be done, just that it's messy and prone to bugs. That is why I demonstrated the use of the HTML::TableExtract CPAN module.

Anyway, just to prove how ugly it can be, here is the same code done as a regex.

Code:
[url=http://perldoc.perl.org/functions/use.html][black][b]use[/b][/black][/url] [green]strict[/green][red];[/red]

[url=http://perldoc.perl.org/functions/my.html][black][b]my[/b][/black][/url] [blue]$html_string[/blue] = [url=http://perldoc.perl.org/functions/do.html][black][b]do[/b][/black][/url] [red]{[/red][url=http://perldoc.perl.org/functions/local.html][black][b]local[/b][/black][/url] [blue]$/[/blue][red];[/red] <DATA>[red]}[/red][red];[/red]

[olive][b]if[/b][/olive] [red]([/red][blue]$html_string[/blue] =~ [red]m{[/red][purple][/purple]
[purple]	<td[^>]*>                    # Begin Name Cell[/purple]
[purple]	(?:<(?!/td)[^>]*>|[purple][b]\s[/b][/purple]*)*      # White Space or Formatting Tags[/purple]
[purple]	[purple][b]\Q[/b][/purple]mX[purple][b]\E[/b][/purple]                       # Name of Value[/purple]
[purple]	(?:<(?!/td)[^>]*>|[purple][b]\s[/b][/purple]*)*      # White Space or Formatting Tags[/purple]
[purple]	</td>                        # End Name Cell[/purple]
[purple]	[purple][b]\s[/b][/purple]*[/purple]
[purple]	<td[^>]*>                    # Begin Value Cell[/purple]
[purple]	[purple][b]\s[/b][/purple]*[/purple]
[purple]		([^<]*?)                 # Value to Save[/purple]
[purple]	[purple][b]\s[/b][/purple]*[/purple]
[purple]	</td>                        # End Value Cell[/purple]
[purple][/purple][red]}[/red][red]xs[/red][red])[/red] [red]{[/red]
	[url=http://perldoc.perl.org/functions/print.html][black][b]print[/b][/black][/url] [red]"[/red][purple]'[blue]$1[/blue]'[purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
[red]}[/red] [olive][b]else[/b][/olive] [red]{[/red]
	[black][b]print[/b][/black] [red]"[/red][purple]No Match[purple][b]\n[/b][/purple][/purple][red]"[/red][red];[/red]
[red]}[/red]


[fuchsia]1[/fuchsia][red];[/red]

[teal]__DATA__[/teal]

[teal]<div align="center"><center>[/teal]
[teal]<table border="1" cellspacing="1" width="70%">[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>OS X</strong> </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">443921 </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>OS Y</strong> </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">235124 </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>Post Code</strong></td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">OX15 4HE </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>Lat</strong> (WGS84)  </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">N52:00:46 ( 52.012814 ) </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>Long</strong> (WGS84)  </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">W1:21:41 ( -1.361467 ) </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>LR</strong> </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">SP439351 </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>mX</strong> </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">-151557</td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>mY</strong> </td>[/teal]
[teal]        <td width="50%" align="center" valign="middle">6768743 </td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="50%" align="center" valign="middle"><strong>M</strong> </td>[/teal]
[teal]         width="50%" align="center" valign="middle">Actual : FFFDAFFB_00674867<br>Street : FFFDB034_00674774<br>Road   : FFFDB610_00672640<br>Road64 : FFFE0C00_00668A00<br>Road128: FFFE0C00_00659000</td>[/teal]
[teal]    </tr>[/teal]
[teal]    <tr>[/teal]
[teal]        <td width="100%" align="center" valign="middle" colspan="2" ></td>[/teal]
[teal]    </tr>[/teal]
[teal]</table>[/teal]
[teal]</div>[/teal]
[tt]------------------------------------------------------------
Pragmas (perl 5.8.8) used :
[ul]
[li]strict - Perl pragma to restrict unsafe constructs[/li]
[/ul]
[/tt]

- Miller
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top