Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

search binary data for string value

Status
Not open for further replies.

impulse24

IS-IT--Management
Jul 13, 2001
167
0
0
US
Hi,

I have some binary files that I want to run through vb.net, and have it identify how many instances of a string is located in the file. This is done easily in unix by just using the grep command.

Is there a way to do this in vb.net? I have never processed/opened files in binary mode in vb, and am not sure how to do this.

These are not text file, so I can't use textstream to read, they are strictly binary pdf files. I want to be able to see when the PDF string appears in the binary, and if it occurs more than once.

In vb 6 the code I used is:
Open srcFile For Binary As #1
datarray$ = Input$(lngSize, #1)
workarray$ = datarray$
Close #1

But its very slow, and I am looking for a performance gain in vb.net

What I would also like to do, is split the file anytime there are multiple occurences of PDF. So basically if I have a binary .pdf file that has 2 occurences of the string PDF in the binary, I would split into 2 separate binary pdf files. Any help would be appreciated.
 
A normal PDF file is actually a text file and can be read with a textstream. A PDF file is effectively a PostScript program and it may be possible to find the text simply by looking for it with InStr for example. This may not be possible if some parts of the string you are looking for are formatted differently from the other parts, for example if one word in a three word phrase is in italics.

However if the PDF has been compressed or encrypted this will not be true and you would need to uncompress and/or decrypt the binary data first. It is unlikely that you will be able to establish how to do this.

To check whether you have a simple text file PDF try opening it with a text editor (even Notepad will do) and see if there are any characters which cannot be rendered on the screen.


Bob Boffin
 
these are compressed tif images within a pdf file.. If I open the pdf file in a hex editor, or even notepad, the first tag I see is pdf, and then the binary encoding for the tiff image, and then the pdf tag again. I want to split these out into separate files whenever the pdf tags appear. I know I can accomplish using Acrobat full product, but its a very slow process(can process 5-6 files per minute). When testing on my unix platform I can run simple unix grep/split commands and it will generate what I am looking for(can process 200 files per minute), but again I want something running on the windows platform in the vb.net environment that I can have users running.

Any help would be appreciated.
 
Ahhh, I see. Something like this should work:

Code:
    Dim b() As Byte = IO.File.ReadAllBytes("C:\YourFile.PDF")
    
    Dim encoding As New System.Text.ASCIIEncoding
    Dim SearchString As String = "Info"
    Dim bSearch As Byte() = encoding.GetBytes(SearchString)
    Dim bFound As Boolean = True

    For i As Integer = 0 To b.Length - bSearch.Length - 1
      If b(i) = bSearch(0) Then
        bFound = True
        For j As Integer = 0 To bSearch.Length - 1
          If b(i + j) <> bSearch(j) Then
            bFound = False
            Exit For
          End If
        Next
        If bFound Then
          MsgBox(SearchString & " found at byte offset: " & i)
        End If
      End If
    Next

I threw that at a little compressed PDF and a 1.5meg PDF and it returned the property byte offsets almost as fast as I could click the 'OK' on the message box.

-Rick

VB.Net Forum forum796 forum855 ASP.NET Forum
[monkey]I believe in killer coding ninja monkeys.[monkey]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top