Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

.NET regular expresions to repair malformed XHTML

Status
Not open for further replies.

nwruiz

Programmer
Jun 22, 2005
60
US
Hello,

I have the following regular expressions that I am trying to use in my VB.NET 2.0 code. This code attempts to react to an XmlException to try to fix the malformed XML.

Code:
'Trying to parse information from the following exception message:
'The 'td' start tag on line 36 does not match the end tag of 'a'. Line 36, position 56.
    Private Function FixTagMismatch(ByVal ex As XmlException) As Boolean
        Dim message As String = ex.Message
        Dim success As Boolean = False

        Const textInsideQuotes As String = "'([^']*)'"
        Const lineNumberSearch As String = "Line\s(\d*),"
        Const positionSearch As String = "position\s(\d*)."
        Const rexOptions As Integer = CInt(RegexOptions.IgnoreCase _
                Or RegexOptions.Multiline _
                Or RegexOptions.Singleline _
                Or RegexOptions.CultureInvariant _
                Or RegexOptions.Compiled)

        Dim rex As New Regex(textInsideQuotes, rexOptions)
        Dim matches As MatchCollection = rex.Matches(message) 'Regex.Matches(message, textInsideQuotes)
        Dim tag1 As String = String.Empty
        Dim tag2 As String = String.Empty

        ' Retrieve the two tags in question.
        If matches.Count = 2 Then
            success = True
            tag1 = matches.Item(0).Value
            tag2 = matches.Item(1).Value
        End If

        ' Retrieve the line number
        rex = New Regex(lineNumberSearch, rexOptions)
        Dim tmp As String = rex.Match(message).Value
        'Dim line As Integer = CInt(rex.Match(message).Value)

        ' Retrieve the position of the problem
        rex = New Regex(positionSearch, rexOptions)
        Dim tmp2 As String = rex.Match(message).Value
        'Dim pos As Integer = CInt(rex.Match(message).Value)

        'Dim reader As New StringReader(Me.ModifiedText)
        Dim errorLine As String = String.Empty

        'Dim builder As New StringBuilder

        '' Travel to the line in question.
        'For i As Integer = 1 To line - 1
        '    builder.Append(reader.ReadLine() & vbNewLine)
        'Next

        'errorLine = reader.ReadLine()
        ' Fix the problem on this line
        Dim search As String = "<tag1[^>]*>[^<tag2.*?>].*?</tag3>"
        search = search.Replace("tag1", tag1)
        search = search.Replace("tag2", tag2)

        rex = New Regex(search, rexOptions)
        matches = rex.Matches(Me.ModifiedText)
        For Each mat As Match In matches
            Debug.WriteLine(mat.Value)
        Next

        'builder.Append(errorLine & vbNewLine)

        'builder.Append(reader.ReadToEnd())
        'reader.Close()
        Return success
    End Function

I have several problems with this method. The first RegEx used successfully parses the values inside of single-quote strings (e.g. 'td' and 'a' in this example). However, the expressions used to fetch the line number and the position of the error do not work. The RegEx retrieves the entire string containing the line/position numbers (e.g. "Line 36," and "position 56.") instead of just the integers inside.

I am new with regular expressions, so any help would be greatly appreciated. I thought I was doing the same thing in the "lineNumberSearch" and "positionSearch" constants as the "textInsideQuotes" query, but apparently I am not. I used Expresso to create and test these RegEx's and they tested successfully there. However, when actually running my .NET 2.0 code, I am unsuccessful.

Finally, in the "search" string, I am trying to find the problem in the HTML source code. Below is the fragment that I am using for testing.

Code:
<table cellpadding="3" cellspacing="0" bordercolor="#CCCCCC" border="1">
<tr align="Center" bgcolor="#CCCCCC">
	<td valign="top" class="tablefont" colspan="2"><b>Service Classification for 2006</b></td>
    <td valign="top" class="tablefont" width="29%"><b>EDI Load Profile Code</b></td>
<tr> 
	<td valign="top" class="tablefont" width="31%">SC-1, SC1B</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc1std_06.xls">Standard Service</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">1SC1, 2SC1</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="31%">SC-1C</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc1c_06.xls">Optional Large Time of Use</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">1SC1C, 2SC1C </td></tr>
<tr> 
	<td valign="top" class="tablefont" rowspan="2" width="31%">SC-2</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc2nd_06.xls">Non-Demand</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">1SC2, 2SC2 </td></tr>
<tr> 
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc2dem_06.xls">Demand</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">2SC2D, 3SC2D, 1SC2D</td></tr>
<tr> 
	<td  valign="top" class="tablefont" rowspan="4" width="31%">SC-3</td>
    <td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3sec_06.xls">Secondary</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">1SC3</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3pri_06.xls">Primary</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">2SC3</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3sub_06.xls">Subtransmission</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">3SC3</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/sc3tra_06.xls">Transmission</a> (xls)</td>
    <td valign="top" class="tablefont" width="29%">4SC3</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="31%">Private Area Lighting</td>
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/pal_06.xls">Private Area Lighting</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">1SC1L</a> (xls)</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="31%">Traffic Signals</td>
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/traffic_06.xls">Traffic Signals</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">1SC4L</td></tr>
<tr> 
	<td valign="top" class="tablefont" width="31%">Street Lighting</td>
	<td valign="top" class="tablefont" width="40%"><a href="../../non_html/stlght_06.xls">Street Lighting</a> (xls)</td>
	<td valign="top" class="tablefont" width="29%">1SC2L, 1SC3L, 1SC5L, 1SC6L</td></tr>
</table>

Thank you for your help!

Nick Ruiz
Associate Integrator
PPLSolutions IT Billing and Transactions
 
Perhaps I may be on my way to solving this problem. I managed to extract the integers I needed from the "Line 36," and "position 56." examples by making the following modification:

Code:
' This is just a code snippet. Please refer to the code in the previous post for the context.
Dim line As Integer = CInt(rex.Match(message).Groups.Item(1).Value)
.
.
.
Dim pos As Integer = CInt(rex.Match(message).Groups.Item(1).Value)

Nick Ruiz
Associate Integrator
PPLSolutions IT Billing and Transactions
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top