i am building a document management system. one of the fields contains the OCRd text of the document.
i have been asked whether there is a way to find duplicate documents. now, we all know that OCR will make AT LEAST one error (to our eyes) in transcribing a scanned image into text. so simply comparing them will not work.
is there a clever way i have not thought of? the best i have come up with so far is to test a few randomly-selected substrings of, say, 20 words or so in each record to see if they are matched in any of the other records. this is not foolproof. nothing will be.
but is this the best i can expect?
thanks
g
i have been asked whether there is a way to find duplicate documents. now, we all know that OCR will make AT LEAST one error (to our eyes) in transcribing a scanned image into text. so simply comparing them will not work.
is there a clever way i have not thought of? the best i have come up with so far is to test a few randomly-selected substrings of, say, 20 words or so in each record to see if they are matched in any of the other records. this is not foolproof. nothing will be.
but is this the best i can expect?
thanks
g