Efficient searching algorithm

markknowsley · May 4, 2007

I want to write an algorithm that will search through a list of documents held in a database and pick out those documents that are duplicated once or more, and show the location of each copy of the document.

Here's the process I normally follow for such things with a few hundred rows:

1. Create a view of all the documents in the db.
2. Connect to the database and fill a dataset with the list of documents using SqlDataAdapter
3. Use a pair of ForEach statements, like this:

Code:

foreach (DataRow rowDocuments in dsDocuments.Tables[0].Rows)
{
foreach (DataRow rowDuplicate in dsDocuments.Tables[0].Rows)
{
if (rowDuplicate[0] == rowDocuments[0])
{
//chuck the duplicate into a seperate DataTable
}
}
}

I'm looking at a database that contains over 600,000 documents. If my maths is right that's 600,000 ^ 2 searches - err... (frantic button pushing) lots of searches.

Does any one have any tips, or can anyone recommend any resources, on how to make this process more efficient?

AlexCuse · May 4, 2007

What database are you using? I think this would be better handled as a set-based problem on that side.

Ignorance of certain subjects is a great part of wisdom

jmeckley · May 4, 2007

according to your code above, documents are duplicated if all the fields match, correct?

if so then a sql query like this would work.

Code:

select field_1, field_2, field_3, count(pk_field) as number_of_dups
from   document
group by field_1, field_2, field_3
having count(pk_field) > 1

this will return fields 1-3 and the number of times it's duplicated. then to get the specific records that are duplicated:

Code:

select d.pk_field, d.field_1, d.field_2, d.field_3[, field_n]
from   document d inner join (
               select field_1, field_2, field_3, count(pk_field) as number_of_dups
               from   document
               group by field_1, field_2, field_3
               having count(pk_field) > 1
       ) t on d.field_1 = t.field_1 and d.field_2 = t.field_2 and d.field_3 = t.field_3
order by d.field_1, d.field_2, d.field_3[, field_n]

Jason Meckley
Programmer
Specialty Bakers, Inc.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Efficient searching algorithm

markknowsley

Programmer

AlexCuse

Programmer

jmeckley

Programmer

Similar threads

Part and Inventory Search

Sponsor