How to find duplicate documents

gusset · Aug 19, 2003

i am building a document management system. one of the fields contains the OCRd text of the document.

i have been asked whether there is a way to find duplicate documents. now, we all know that OCR will make AT LEAST one error (to our eyes) in transcribing a scanned image into text. so simply comparing them will not work.

is there a clever way i have not thought of? the best i have come up with so far is to test a few randomly-selected substrings of, say, 20 words or so in each record to see if they are matched in any of the other records. this is not foolproof. nothing will be.

but is this the best i can expect?

thanks

g

lebisol · Aug 22, 2003

hm.....nice idea...although u noted pitfals ;-)

1st thing that came to my mind is "trying to simulate regual copy-paste [replace]" interaction u get on any windowz platform....perhps this:
-your approach of sequent words (spellcheck-grammar move)
-and also comparing objects (files) true size in bytes
along with storing FileNames in the DB
Obivosly if u re-scan your docs and find out that the new document is a bit "larger" due to the fact that 2nd scan picked up more words than u will be able to note the difference....
the DB===========================================
[file_name] [doc_type] [doc_size] [date_created]
-------------------------------------------------
1 My Books .txt 1050 Kb 1/1/2002
2 Return of 69 .pdf 5003 Kb 2/2/2003
3....
4....
==================================================

so I guess it would be "search & upload" project....u would 1st search for the FileName,extention/type and time it was created....compare them and repleace appropriatly
Now, if you are looking for the "content match" of the document A and B then your approach is the way to go...athough comparing 10000s of words can be a load for you application...
I hope this gives you some ideas!
All the best!

> need more info?
:: don't click HERE ::

gusset · Aug 22, 2003

thanks a lot.

only one complaint: i didn't click on "::don't click HERE::"...and nothing happened

seriously, i am only talking about comparing documents. it has to be done only once, so its not onerous at all.

i hadn't thought of comparing filesizes, so thanks! that's another useful indicator.

all the best

g

lebisol · Aug 22, 2003

hehehe

I just don't like the "mywebsite.com" signatures since they lead to no useful info with the regards to the post! direct URLs do ;-)
good luck with the project!

> need more info?
:: don't click HERE ::

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

How to find duplicate documents

gusset

Technical User

lebisol

IS-IT--Management

gusset

Technical User

lebisol

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor