Finding duplicate content using "Hash" strings 1

ferndoc · Feb 8, 2005

I'm attempting to locate duplicate content by divining a hex string that can be sorted, then compared to other such strings. In the old days I'd have called this a hash total but I can't find anybody else using the term (hence no FAQ found). [blush]

To get right to the question: Am I reinventing the Oxcart while the rest of you know some Windows API faster than a speeding bullet? If not, any suggestions to speed this thing up? And what should I be calling this scheme?

See fragment below.
Stoke your imagination with vast proliferation of JPG files; some with ever-duplicated-camera-generated-names, others copied to other folders, copied with rename or datechange or otherwise disguised by innocent users armed with loaded weapons.

Logic of the whole deal: sort byte count/hash string, test any such duplicates for identical content. Vastly simplify one-to-many compares.

Have produced good results using big buffer then XOR 2nd to n reads into first -- then folding string into some shorter length by XOR of second (say 16) to n bytes of string into first 16. Can detect single character change in 10MB file (reliability is not determined hence actual content comparison in scheme).

BUT ----

Execution of my routine is SLOW == 13 seconds for ~ 10MB on 2.4 Celeron with file itself already cached.

Code:

'code to safe open for Binary Access Read Shared goes here
getBuf = String(getBufL,&H0)
xorStr = getBuf
Do While Lof(f1) > Loc(f1)  'Loc is adjusted each read
  'code to adjust getBuf length for final block goes here
  Get #f1, , getBuf
[COLOR=red]'This is the guilty code ET 13+ - comment it out ET ~ .02
  For i = 1 To Len(getBuf)  'piggy loop ~1.3 sec/MB
    X = Asc(Mid(xorStr, i, 1)) Xor Asc(Mid(getBuf, i, 1))
    Mid(xorStr, i, 1) = Chr$(X)
  Next i [/color]
Loop

This technique will be valuable (and much faster) even if I stick to the first one or two buffers. But I'd like to get a fast hashing technique that can be proven failsafe so I can find duplicates externally (without diving back into the files in some frenzied each-to-many bit comparo drill).

BTW: doing this in Access with VBA for recordsets to handle potentially huge directory and file lists -- open to suggestion...

PHV · Feb 8, 2005

Isn't md5 an option ?

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244

ferndoc · Feb 8, 2005

Blush!

I had never heard of MD5 but Google gave me "The MD5 Homepage"

http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html

stating: "[The MD5 algorithm] takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of the input.

Is this what you meant? It looks like a 4 star match and I'll dive into it.

Thanks

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Finding duplicate content using "Hash" strings 1

ferndoc

Technical User

PHV

MIS

ferndoc

Technical User

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Finding duplicate content using &quot;Hash&quot; strings 1

ferndoc

Technical User

PHV

MIS

ferndoc

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor

Finding duplicate content using "Hash" strings 1