Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to find duplicate documents

Status
Not open for further replies.

gusset

Technical User
Mar 19, 2002
251
GB
i am building a document management system. one of the fields contains the OCRd text of the document.

i have been asked whether there is a way to find duplicate documents. now, we all know that OCR will make AT LEAST one error (to our eyes) in transcribing a scanned image into text. so simply comparing them will not work.

is there a clever way i have not thought of? the best i have come up with so far is to test a few randomly-selected substrings of, say, 20 words or so in each record to see if they are matched in any of the other records. this is not foolproof. nothing will be.

but is this the best i can expect?

thanks

g
 
hm.....nice idea...although u noted pitfals ;-)

1st thing that came to my mind is "trying to simulate regual copy-paste [replace]" interaction u get on any windowz platform....perhps this:
-your approach of sequent words (spellcheck-grammar move)
-and also comparing objects (files) true size in bytes
along with storing FileNames in the DB
Obivosly if u re-scan your docs and find out that the new document is a bit "larger" due to the fact that 2nd scan picked up more words than u will be able to note the difference....
the DB===========================================
[file_name] [doc_type] [doc_size] [date_created]
-------------------------------------------------
1 My Books .txt 1050 Kb 1/1/2002
2 Return of 69 .pdf 5003 Kb 2/2/2003
3....
4....
==================================================

so I guess it would be "search & upload" project....u would 1st search for the FileName,extention/type and time it was created....compare them and repleace appropriatly
Now, if you are looking for the "content match" of the document A and B then your approach is the way to go...athough comparing 10000s of words can be a load for you application...
I hope this gives you some ideas!
All the best!

> need more info?
:: don't click HERE ::
 
thanks a lot.

only one complaint: i didn't click on "::don't click HERE::"...and nothing happened :)

seriously, i am only talking about comparing documents. it has to be done only once, so its not onerous at all.

i hadn't thought of comparing filesizes, so thanks! that's another useful indicator.

all the best

g
 
hehehe :)
I just don't like the "mywebsite.com" signatures since they lead to no useful info with the regards to the post! direct URLs do ;-)
good luck with the project!

> need more info?
:: don't click HERE ::
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top