I'm attempting to locate duplicate content by divining a hex string that can be sorted, then compared to other such strings. In the old days I'd have called this a hash total but I can't find anybody else using the term (hence no FAQ found).
To get right to the question: Am I reinventing the Oxcart while the rest of you know some Windows API faster than a speeding bullet? If not, any suggestions to speed this thing up? And what should I be calling this scheme?
See fragment below.
Stoke your imagination with vast proliferation of JPG files; some with ever-duplicated-camera-generated-names, others copied to other folders, copied with rename or datechange or otherwise disguised by innocent users armed with loaded weapons.
Logic of the whole deal: sort byte count/hash string, test any such duplicates for identical content. Vastly simplify one-to-many compares.
Have produced good results using big buffer then XOR 2nd to n reads into first -- then folding string into some shorter length by XOR of second (say 16) to n bytes of string into first 16. Can detect single character change in 10MB file (reliability is not determined hence actual content comparison in scheme).
BUT ----
Execution of my routine is SLOW == 13 seconds for ~ 10MB on 2.4 Celeron with file itself already cached.
This technique will be valuable (and much faster) even if I stick to the first one or two buffers. But I'd like to get a fast hashing technique that can be proven failsafe so I can find duplicates externally (without diving back into the files in some frenzied each-to-many bit comparo drill).
BTW: doing this in Access with VBA for recordsets to handle potentially huge directory and file lists -- open to suggestion...
To get right to the question: Am I reinventing the Oxcart while the rest of you know some Windows API faster than a speeding bullet? If not, any suggestions to speed this thing up? And what should I be calling this scheme?
See fragment below.
Stoke your imagination with vast proliferation of JPG files; some with ever-duplicated-camera-generated-names, others copied to other folders, copied with rename or datechange or otherwise disguised by innocent users armed with loaded weapons.
Logic of the whole deal: sort byte count/hash string, test any such duplicates for identical content. Vastly simplify one-to-many compares.
Have produced good results using big buffer then XOR 2nd to n reads into first -- then folding string into some shorter length by XOR of second (say 16) to n bytes of string into first 16. Can detect single character change in 10MB file (reliability is not determined hence actual content comparison in scheme).
BUT ----
Execution of my routine is SLOW == 13 seconds for ~ 10MB on 2.4 Celeron with file itself already cached.
Code:
'code to safe open for Binary Access Read Shared goes here
getBuf = String(getBufL,&H0)
xorStr = getBuf
Do While Lof(f1) > Loc(f1) 'Loc is adjusted each read
'code to adjust getBuf length for final block goes here
Get #f1, , getBuf
[COLOR=red]'This is the guilty code ET 13+ - comment it out ET ~ .02
For i = 1 To Len(getBuf) 'piggy loop ~1.3 sec/MB
X = Asc(Mid(xorStr, i, 1)) Xor Asc(Mid(getBuf, i, 1))
Mid(xorStr, i, 1) = Chr$(X)
Next i [/color]
Loop
BTW: doing this in Access with VBA for recordsets to handle potentially huge directory and file lists -- open to suggestion...