Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

using regular expressions to find text in a file 1

Status
Not open for further replies.

itroom123

Technical User
Nov 11, 2009
8
0
0
GB
Hi All,

I'm using a script to analyze a file to get a reference from the text, this is what ive got so far

************************************************************
'script to find OGI ref in a file
dim colmatches

Const ForReading = 1
Const ForWriting = 2
Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.Pattern = "([a-z][a-z][a-z][a-z][0-9][0-9])"
objRegEx.Global = True
objRegEx.ignoreCase = True

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("d:\archive\temp\test.prn", ForReading)

Do Until objFile.AtEndOfStream
strSearchString = objFile.ReadLine
Set colMatches = objRegEx.Execute(strSearchString)
loop
If colMatches.Count > 0 Then
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("D:\archive\temp\output\test.txt", ForWriting)
objFile.Write colMatches(0).value
objFile.Close
DocumentName = "00-"&colMatches(0).value&" unknown file"
End If

objFile.Close
*******************************************************

my problem is this seems to work only itermitantly.

any ideas why sometime it doesnt work.

p.s. the pattern can appear anywhere within the file
 
[tt][blue]const ForAppending=8
sfile="D:\archive\temp\output\test.txt"
'if objFSO.fileexists(sfile) then
' objFSO.deletefile sfile
'end if
[/blue]
Do Until objFile.AtEndOfStream
strSearchString = objFile.ReadLine
Set colMatches = objRegEx.Execute(strSearchString)
[red]'[/red]loop
If colMatches.Count > 0 Then
[red]'[/red]Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile[red]2[/red] = objFSO.OpenTextFile([blue]sfile[/blue], [red]ForAppending[/red][blue],true[/blue])
objFile[red]2[/red].Write[red]Line[/red] colMatches(0).value
objFile[red]2[/red].Close
[blue]'what do you want to do with it? It is now lost...[/blue]
DocumentName = "00-"&colMatches(0).value&" unknown file"
End If
[red]loop[/red]
objFile.Close
[/tt]
ps: your script logic keeps overwriting variable assignment and eventually test only the last line. And that file handling is severely mixed up.
 
Thanks for the help,

yeah I noticed that I've made some amendments and cleaned it up its working a lot better but when I check the source file there are still instances where the pattern exists but is not being picked up.


'Script to get OGi ref from file contents
'JH 11/11/2009
'***********************************************************
dim colmatches

Const ForReading = 1
Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.Pattern = "[a-z][a-z][a-z][a-z][0-9][0-9][a-z][a-z][0-9][0-9]"
objRegEx.Global = False
objRegEx.ignoreCase = True

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("c:\archive\capture.prn", ForReading)

strSearchString = objFile.readAll
Set colMatches = objRegEx.Execute(strSearchString)
If colMatches.Count = 0 Then
DocumentName = "00-unknown reference OGI_File"
else
DocumentName = "00-"&colMatches(0).value&" OGI_File"
End If
objFile.Close
'***********************************************************
 
Could you show us some examples of the text that the pattern is not picking up?

Andy
---------------------------------
[green]' Signature removed for testing purposes.[/green]

 
where the pattern exists but is not being picked up?
tsuji's comment still holds:

the line:
DocumentName = "00-"&colMatches(0).value&" OGI_File"
only retrieves the first element of the colMatches, therefore if there were any other matches you are not doing anything with them, just a guess but...

For i = 0 To colMatches.Count 'perhaps its a UBound?
DocumentName = DocumentName & ";" & "00-"&colMatches(0).value&" OGI_File"
Next

(will result in a string which starts with ; but i dont care)
 
Hi all,
Yeah, know at the moment it only picks up the first instance of the pattern but that shouldn't be a problem any further matches will be the same.

I'm more worried about instances where it says there are no matches when a simple notepad text search can find it.


- in this file it correctly find "test09hq01" as the pattern

- in this file the pattern is "aaaa02pc01" but it doesnt find it
 
mrmovie - as he's using Global = False then he's only ever going to get one match (the first one). Thinking about it, there's no need to use a matches collection at all, if it's being used this way, just assign the result to a variable.

itroom123 - My web filter is blocking access to your file examples, could you post a small sample containing the the data (and the data that should be found with the pattern) from the second file please?

Regards

Andy
---------------------------------
[green]' Signature removed for testing purposes.[/green]

 
below is an excerpt from the second file

____________________________________________________________
E&k*p88x2310YReason*p242x2310Yfor*p310x2310Yissue*p1588x2310
YIssue*p1702x2310YDate*p90x2703YAdditional Information*p41x3174YPlease check the information on the
schedule is correct, if it is not, contact your insurance adviser.*p90x2475YBreakdown*p319x2475YPremium*p851x2475YInsurance
*p1051x2475YPremium*p1233x2475YTax*p1618x2475YTotal(8U(s1p8.
00vs3b4148T*p484x558Y*p481x665Y*p481x780Y*p481x818Y*p481x861
Y*p481x902Y*p478x946Y*p1844x561Y*p1021x1105Y*p1842x666Y*p184
2x767Y*p1841x865Y*p1842x964Y*p1841x1064Y*p360x1195Y*p131x134
6Y*p539x1349Y*p1032x1351Y*p1626x1353Y*p540x1449Y*p1842x1436Y
*p392x1542Y*p1231x1571Y*p1533x1570Y*p1745x1684Y*p539x1774Y*p
588x2106Y*p538x2293Y*p1844x2296Y(8U(s1p10.00vsb4148T*p853x24
27YInsurance*p1053x2427YPremium*p1235x2427YTax*p1617x2427YTo
tal(8U(s1p8.00vs3b4148T*p144x2771Y(8U(s1p8.00vsb4148T*p1672x
1827Y*p56x3307YFortis Insurance Limited, Fortis House,
Tollgate, Eastleigh, Hampshire SO53 3YA Registered Number
354568 England. *p56x3337YAuthorised and regulated by the
Financial Services
Authority.(8U(s1p8.00vs3b4148T*p539x1681Y(8U(s1p7.00vsb4148T
*p1733x1574YVoluntary Accidental Damage
Excess*p1733x1604YCompulsory Accidental Damage
Excess(8U(s1p8.00vs3b4148T*p903x1568Y(8U(s1p7.00vsb4148T*p13
08x1631YThese excesses are in addition to any Policy
Excesses that may
apply(8U(s1p8.00vs3b4148T*p571x2402Y*p1970x2549Y*p569x2471Y*
p1797x2479Y*p1335x2414Y*p1347x2481Y*p1845x2201Y(8U(s1p10.00v
sb4148T*p1490x2567YTotal Amount Payable*p1530x2209YPolicy
Version(8U(s1p8.00vs3b4148T*p1794x2416Y(8U(s1p8.00vsb4148T*p
1735x1550Y*b0M (0N (s1p8v0s0b52T *p500X*p565YMs Karen Lane
*p500X*p665YHousewife *p500X*p780Y34 Northfields
*p500X*p830YLambourne *p500X*p880YHungerford
*p500X*p930YBerkshire *p500X*p980YRG17 8YJ
*p1000X*p1100YAAAA02PC01 *p1850X*p565Y81780W
*p1850X*p665YCALG865556E *p1850X*p1450YYes *p620X*p2125Y
*p+40X *p+40X *p+40X *p+40X *p+40X
*p1850X*p2310Y12/11/2009 (0N (s1p14v0s3b52T *p100X
*p3400YDUPLICATE (0N (s1p8v0s0b52T *p500X*p1450Y9
years *p400X*p1540YComprehensive
*p120X*p2775Y *p+40Y *p+40Y
*p1900X*p1700YRG17 8YJ
__________________________________________________________

the pattern is in bold
 
Hmm, that's a bit strange. Finds the pattern in a .prn file just fine on my machine using a slightly modifed version of your code and the exact code you posted on 11 Nov 09 12:09. [ponder]

Andy
---------------------------------
[green]' Signature removed for testing purposes.[/green]

 
Nothing that would have made a difference to what it picked up I'm afraid.

Andy
---------------------------------
[green]' Signature removed for testing purposes.[/green]

 
sorry, im not that familiar with the ins and outs of RegularExpressions, i was reacting to the OP's statement of "where the pattern exists but is not being picked up
 
>Finds the pattern in a .prn file just fine

I'm impressed - the prn files are binary files which confuse both the filesystemobject and the regular expressions parser. It comes as no surprise to me that the code misses some matches. HQ, the sample as posted here is not a binary file and thus your search works.

What you need to do is open the file as a binary and get rid of all the ASCII control codes before looking for patterns
 
@ Strongm, thanks that makes sense.

@HarleyQuinn, sorry i didnt notice that what i pasted was not the same.

So i will need to use regular expressions to replace the ASCII with "".

Something like the below.

'***********************************************************
' open file and remove esc characters
'***********************************************************

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("d:\archive\capture.prn", 1)
Do Until objFile.AtEndOfStream
strCharacters = objFile.Read(1)

If AscB(strCharacters) = 27 or AscB(strCharacters) = 13 then
Asc(strCharacters)
elseif AscB(strCharacters) = 9 then
strDoc = strDoc & " "
else
strDoc = strDoc & strCharacters
end if

Loop

'***********************************************************
' Create new file with chars removed
'***********************************************************

strFilename = "d:\archive\result.prn"

Const ForWriting = 2
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objTextFile = objFSO.OpenTextFile _
(strFilename, ForWriting, True)

objTextFile.WriteLine(strDoc)
objTextFile.Close
 
mrmovie - Sorry, my post was intended as explanitory, not intended as a downing of your post.

Thanks strongm, having not had the pleasure of working with .prn files before I took the OP's posted sample at face value [blush]

Andy
---------------------------------
[green]' Signature removed for testing purposes.[/green]

 
>Something like the below.


more-or-less. But that might still be susceptible to certain control character sequences causing problems. I'd probably use a stream that can read binary files, and can be used by scripting, such as the ADODB stream. Further I'd probably clean out all control characters rather than just a selected few.
 
Ok, so using my limited knowledge,

you're saying i need to use an ADODB stream to read the file in the correct format and then run my pattern test against that?

in which case i've come up with this:
Code:
'Script to get OGi ref from file contents
'JH 12/11/2009
'***********************************************************
'Read from file in binary format
'***********************************************************
  Const adTypeBinary = 1
  
  'Create Stream object
  Dim BinaryStream
  Set BinaryStream = CreateObject("ADODB.Stream")
  
  'Specify stream type - we want To get binary data.
  BinaryStream.Type = adTypeBinary
  
  'Open the stream
  BinaryStream.Open
  
  'Load the file data from disk To stream object
  BinaryStream.LoadFromFile "d:\archive\capture1.prn"
  
  'Open the stream And get binary data from the object
  ReadBinaryFile = BinaryStream.Read

'***********************************************************
'Find OGI Ref
'***********************************************************

Const ForReading = 1
Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.Pattern = "[a-z][a-z][a-z][a-z][0-9][0-9][a-z][a-z][0-9][0-9]"
objRegEx.Global = False
objRegEx.ignoreCase = True

    Set colMatches = objRegEx.Execute(ReadBinaryFile)
    If colMatches.Count = 0 Then        
        DocumentName = "00-unknown reference OGI_File"
    else
        DocumentName = "00-"&colMatches(0).value&" OGI_File"
    End If
'***********************************************************


or am i on the wrong lines here?
 
Close - but you still need to clean out the control characters from the binary data and turn it back into a string before feeding it to the regular expression routing.

Basically use your code that built strDoc above, but feed it from the ADO BinaryStream instead of FSO's TextStream
 
Hi Guys, i think i've got it.

Code:
'*********************************************************************************
' open file and remove esc characters
'*********************************************************************************

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("D:\archive\capture.prn", 1)
Do Until objFile.AtEndOfStream
    strCharacters = objFile.Read(1)

	If AscB(strCharacters) => 0 and  AscB(strCharacters) =< 32 then
        ' invalid characters
	elseif AscB(strCharacters) = 127 then 
		' 127 is a delete character
		strDoc = strDoc & " "  
		' replace with space
	else
		strDoc = strDoc & strCharacters 
	end if

Loop
'*********************************************************************************
' Create new file with chars removed
'*********************************************************************************

strFilename = "d:\archive\result.prn"
Const ForWriting = 2
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objTextFile = objFSO.OpenTextFile _
    (strFilename, ForWriting, True)

objTextFile.WriteLine(strDoc)
objTextFile.Close

'***********************************************************
'Find OGI Ref
'***********************************************************
dim colmatches

Const ForReading = 1
Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.Pattern = "[a-z][a-z][a-z][a-z][0-9][0-9][a-z][a-z][0-9][0-9]"
objRegEx.Global = False
objRegEx.ignoreCase = True

Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objFile = objFSO.OpenTextFile("d:\archive\result.prn", ForReading)

    strSearchString = objFile.readAll
    Set colMatches = objRegEx.Execute(strSearchString)
    If colMatches.Count = 0 Then        
		DocumentName = "00-unknown reference OGI_File"
    else
		DocumentName = "00-"&colMatches(0).value&" OGI_File"
    End If
objFile.Close
'***********************************************************

what do you think?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top