Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Algorithm to compare binary files and create patch file.

Status
Not open for further replies.

bcarney

Programmer
Aug 30, 1999
8
CA
I have to write a utility to compare two versions of a binary file (new and old) and then to extract the differences between the two versions of the file. The extracted list will then be used to update the old file and make it the same as the new one (so basically it is a patch utility). I have written it but find that the patch files it creates are larger than those that I can obtain by using diff or some other shareware patch utilities I have tried. The cause seems to be a problem in the routine where I synchronize the two files after a difference is found, but I just cannot get my head around this logic. If anyone has any good algorithms or can help me out with this I really would appreciate it - my deadline is looming nearer!
 
I'm impressed you even attempted this - I would just have called a shareware utility. Is there some reason you can't do this? <p>Mike Lacey<br><a href=mailto:Mike_Lacey@Cargill.Com>Mike_Lacey@Cargill.Com</a><br><a href= Cargill's Corporate Web Site</a><br>
 
There is Mike. The client is adamant that they do not want to rely on non-proprietary software.
 
Piece of cake. Read both files into string variables and do a byte-by-byte comparison, building a difference string as you go.<br>
Something like:<br>
FiNum = Freefile<br>
Open &quot;Oldfile.ext&quot; for Binary as #FiNum<br>
Old$ = String$(LOF(FiNum),0)<br>
Get #FiNum, 1, Old$<br>
Close #FiNum<br>
FiNum = Freefile<br>
Open &quot;Newfile.ext&quot; for Binary as #FiNum<br>
New$ = String$(LOF(FiNum),0)<br>
Get #FiNum, 1, New$<br>
Close #FiNum<br>
Diff$ = &quot;&quot;<br>
If Len(New$) = Len(Old$) then 'files are the same size<br>
FoundDiff = False<br>
For B = 1 to Len(New$)<br>
If Mid$(New$,B,1) &lt;&gt; Mid$(Old$,B,1) then<br>
Diff$ = Diff$ + Mid$(New$,B,1)<br>
FoundDiff = True<br>
Else<br>
If FoundDiff then Exit For<br>
End If<br>
Next<br>
End If<br>
<br>
The same logic would apply if changes were expected in more than one area of the files. You would simply maintain an array of difference strings. Build an element starting with the first mismatch, stop at the first match, build the next element starting with the next mismatch, and so on.<br>
The logic is slightly trickier when the files are different sizes (indicating insertions or deletions). You need to mark the location of the first difference and then use a separate loop to search for subsequent matches in the new file. If the bytes in oldfile match bytes in newfile later than their position in oldfile then there may have been an insertion. The same logic can find the location of deletions.<br>
<br>
The assignment won't seem so complex after you straighten out the spaghetti. Be careful if you decide to use recursion.
 
Just a thought - the unix diff program. (I know you're using VB; bear with me) There is probably source code out there for a version of diff (the linux distribution would be the first place I would check). Having said that, diff is for text files really.<br>
<br>
Cmp - another unix utility - is for binary files and may be if use but exits when it finds the first difference.<br>
<br>
Larry Wall's patch program may be interesting and the source may be available. Anecdotally patch is not a good example of readable code...<br>
<br>
Mike<br>
<p>Mike Lacey<br><a href=mailto:Mike_Lacey@Cargill.Com>Mike_Lacey@Cargill.Com</a><br><a href= Cargill's Corporate Web Site</a><br>
 
I don't think so Alt255 - if it were that simple the money would be in the bank. <br>
<br>
Many of the files that will be compared are several GB in size - you aren't going to read those in a single leap. So, we read the file in smaller chunks. After we come to the first difference between buffers (strings?) the following scenario exists:<br>
i) the old file has simply been changed - easy, just read ahead until the bytes match again, writing the location and new byte value to the patch file,<br>
ii) the new file has been inserted into - sounds easy, just advance through the new file buffer until the bytes match again or you have come to the end of the buffer in which case read next chunk and repeat,<br>
iii) the new file has been deleted from, same as ii but move forward through the old file buffer,<br>
<br>
----- BUT -----<br>
<br>
when we come to the first difference we don't know whether it is a change, insert or delete. Should we move ahead through the old buffer or the new or perhaps both. What if the files don't match until the next read of the new file, or the old file, what if they only match up again after 30 reads of the new file. This is the part of the logic that isn't such a &quot;piece of cake&quot;.<br>
<br>
One other point, when using strings remember that when you compare each element of the string you are actually comparing 2 bytes, not one. This is obviously overcome by using arrays of bytes.<br>
<br>
Mike, I had thought about the linux source, but the limitations you mention are the problem. The patch program sounds promising so I will try to track down the code for that.<br>
<br>
Thanks guys, please keep the thoughts coming, I need all the help I can get.
 
Geeeeezzzz guys,<br>
<br>
I was certain that EdEric or another bright spark would come up with a winner for this one. Still &quot;deperately seeking Susan - sorry solution&quot;.<br>
<br>
Regards<br>
<br>
Brad.
 
There are lots of ways to compare small files. After all, one can live with a little inefficiency when faced with a two-second comparison. You complicated the issue when you noted the &quot;several GB&quot; size of the files. Had you mentioned that in your original post, I wouldn't have bothered to stick my foot in my mouth.<br>
When you find a fast, efficient solution to your problem, make sure you post it here. We'll nominate you for the Nobel Peace Prize for Computer Sciences.<br>
I don't mean to seem uncaring -- I care! I want to see the answer (probably more than most). But this is the sort of quagmire that loves to suck in unwary programmers, leaving only their nose and eyes exposed above the muck.<br>

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top