Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

comparing two memory buffers.

Status
Not open for further replies.

denc4

Programmer
Feb 17, 2005
107
NL
I am trying to write a program that finds certain data
patterns in files. FIND.EXE from DOS would do the trick
ofcourse but I want it to search from the end of the file too.

To achieve this I need a procedure which compares two
memory buffers. one with the search pattern, and one data
buffer which holds the file data. Both of an arbitrary size.
The procedure must be able to search from start to end, and
from end to start of the buffer.

The idea is to load the buffer with about 1024 bytes of the
file, search it using the routine, then read another 1024
bytes etc. until a match have been found.

It needs to be a 16 bit real mode DOS program.

I tried to write the procedure using a variety of string
instructions: scas, lods, cmps. By using these instructions
I could use the direction flag to search forward or
backward. I encountered a lot of problems along the way.
One of them is the question of what if the match overlaps
two buffers? I would have to save the state of the previous
search before I go on to the next.

I have written a lot of versions of the procedure, all
incomplete and flawed. So I will not post them here.

Has anybody had to write a similar algorithm, or does
anybody have an idea?

thx.
 
If you haven't already looked it up, have a look at Boyer-Moore string searching. It's much quicker than byte-by-byte (even with scas etc.) because it doesn't need to look at every byte.

The principle is:

say you're looking for a string of data 250 bytes long, which contains no number 23 at all. Now you find a number 23 in the next new byte. Not only do you know that your pattern doesn't start here, but also it doesn't start anywhere in the next 249 bytes, because that would still place the 23 somewhere in its length. Therefore you can skip ahead 249 bytes!

As for the buffers, unfortunately you'll probably need overlapping blocks of some sort. Again, if you're searching for a pattern 250 bytes long, you just make sure that each successive block you load overlaps the previous by 249 bytes so there's never a case where a search-pattern can exist without falling into one of the blocks completely.

I've used specific examples, but it's not hard to make the lengths arbitary.
 
I've looked at the Boyer-Moore string searching algo at
It looks promising, and shed some new light on the matter.
but is it me or is it flawed (must be me, since the algo is
over 30 years old:)).

They present an example:

search pattern: EXAMPLE
string: THIS IS A SIMPLE EXAMPLE

since "E" does not equal "S", the website explains it would
continue comparing after the "IS" word:

search pattern: EXAMPLE
string: THIS IS A SIMPLE EXAMPLE

Now I say, what if the strings would be like this:

search pattern: EXAMPLE
string: TEXAMPLEA SIMPLE EXAMPLE

"E" does not equal "L", now continue:

search pattern: EXAMPLE
string: TEXAMPLEA SIMPLE EXAMPLE

whoops! a match have been missed...

lionelhill, you are talking about 250 bytes and the number
23. Well, I would still have to access every byte in the
target buffer in order to determine there isn't a number 23
present. So there will be no speed gained, if you mentioned
it because of the speed factor that is...

By "overlapping" I actually ment this, example:

search pattern is: lionelhill
read a chunk of a file: gowngoengsdlionel
data search, conclusion: lionel appears at the end.
now read another chunk: hillgewnvornfeo
data search, conclusion: hill appears at beginning.

I hope you understand. That's why I mentioned I presumably
had to save the state of the previous search, before I
went into the next. Otherwise the above matches cannot be
found.

wow, I hope I am not being too negative all of a sudden..

anyway, thx.
 
The "search pattern:" and "string:" lines were perfectly
aligned when I typed them. Unfortunately I see this is not
the case now. Please note that I ment them to be aligned..
 
Hi denc4,

I believe you didn´t read fully how the algorythm works. But i might be wrong ;)
Just curious, but what happen if you use the part of the algorythm that positively identify a match in a part of the string to solve the buffering problem?

Rick

P.S.: This is a nice, simple and interesting post
 
To deal with overlapping first:

Say you're looking for the text "cd" in "abcdef", and your buffer size is only 3.
If your buffers don't overlap, you don't find cd, because the first buffer-full is "abc" and the 2nd is "def".

So instead you make your buffers overlap by the size of the search string (2 bytes) minus one. i.e. here, you need to overlap by one.

The first buffer is "abc" - not found!
The second buffer is "cde" - found!
The final buffer is "ef"

The bad side of this is that you do handle the whole file in more chunks than you'd like, because of the overlap. The good side is you definitely find everything.

Now to move on to your example for the Boyer-Moore thing:

Search for
"example"
in
"this is a simple example"
If you line up "example" with "this is..." then the final "e" of example lines up with "s" of "is". This is not only a not-match, but there is no "s" anywhere in the search-string "example". Therefore "example" is not only absent from position 1, it's also absent from every other position for the length of the word "example". You can skip all the way until after the "s" you've just found in the searched text.

Practically, the way to do this is to make a table showing the position of the first instance of each letter in your search-string. These positions are the increments to make when you find a non-match.

Hope this helps
 
The search method you describe is what I have tried to do for a long time.
To find "cd" in "abcdef":
compare "cd" with "ab", then compare "cd" with "bc", if no match then move on to next buffer fill and it would start all over again.

This part I can do using the x86 string instructions.
I have managed to write something, although it does use the
Boyer-Moore approach nor is it perfect.

Code:
; ---------------------------------------------------------------------------
;  IN: si: offset to data to search for.
;      di: offset to data to search in.
;      bx: remaining bytes in si buffer.
;      cx: remaining bytes in di buffer.
;      bp: original si value.
;      dx: original bx value.
;      df: direction to search in.
; OUT: si: at next byte that needs to be found, or at trailing byte.
;      di: at next byte that needs to be searched, or at trailing byte.
;      bx: number of bytes remaining in si buffer, if any.
;      cx: number of bytes remaining in di buffer, if any.
;              
; if bx=0 then
;   search pattern found.
;   si at trailing byte of buffer.
;   if cx=0 then end of buffer reached.
;   else cx bytes remain in di buffer.
;   endif
; else
;   search pattern not found.
;   cx=0
;   di at trailing byte of buffer.
;   bx bytes remain in si buffer.
; endif

  search proc near
    cmpsb
    jz nxchar
    mov si,bp                 ;restore original search
    mov bx,dx                 ; string values.
    loop search
    ret

  nxchar:
    dec bx                    ;one character matched.
    loopnz search
    ret
  search endp

now I can use the direction flag, the procedure is unaware
and does not care in what direction it is searching, which
makes it versatile. And I can start searching a newly filled
buffer with the result of the previous search still in
bx, so I can find "cd" in "abcdef" using a 3 byte buffer.

Unfortunately it has a bug. When I compare "console" with
"conconsole" and cmpsb determines that "s" does not equal to
"c", it restores the original si and bx values because
the entire search string was not found. The problem is that
di has already advanced one position. So the next cmpsb
will start comparing the search string "console" with "onsole". I cannot seem to come up with a working answer.

One more thing: you spoke about a table. could you post
an example of using such a table? should I use xlat?

thx.
 
I'm not a big fan of xlat. Last time I looked, it was quite slow compared to simple instructions. Table for me means define a place to put all the information you want (256 dup (?) or whatever), and then read/write information from/to it in the normal way
assuming ds already correct
mov si, offset mytable (base of table)
add si, thing (move to correct entry)
mov ax, ds:[si]

Or if you can:
mov ax, ds[si+thing] and handle the addressing that way.

As for functional code, you might like to see if you can find Michael Abrash's Zen of assembly. I've got his black book (graphics) which includes what I think is the Zen of assembly book. It has a large section on finding text strings in strings using Boyer Moore, and copious notes on efficient assembly. Unfortunately I've only got it in paper form, and in any case it's copyright, so I can't send you chunks, but I believe it can still be downloaded, not sure where from.
 
i want to write a program using the Atmel assembler language to generate a digital counter to display numbers using the Atmel boards. The counter can also be reset by the user at any time using an interrupt. (the AT90S8515 Assembler Boards )

i will really appreciate if anyone can help
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top