Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Efficient searching algorithm

Status
Not open for further replies.

markknowsley

Programmer
Aug 30, 2005
152
GB
I want to write an algorithm that will search through a list of documents held in a database and pick out those documents that are duplicated once or more, and show the location of each copy of the document.

Here's the process I normally follow for such things with a few hundred rows:

1. Create a view of all the documents in the db.
2. Connect to the database and fill a dataset with the list of documents using SqlDataAdapter
3. Use a pair of ForEach statements, like this:

Code:
foreach (DataRow rowDocuments in dsDocuments.Tables[0].Rows)
{
foreach (DataRow rowDuplicate in dsDocuments.Tables[0].Rows)
{
if (rowDuplicate[0] == rowDocuments[0])
{
//chuck the duplicate into a seperate DataTable
}
}
}

I'm looking at a database that contains over 600,000 documents. If my maths is right that's 600,000 ^ 2 searches - err... (frantic button pushing) lots of searches.

Does any one have any tips, or can anyone recommend any resources, on how to make this process more efficient?
 
What database are you using? I think this would be better handled as a set-based problem on that side.

Ignorance of certain subjects is a great part of wisdom
 
according to your code above, documents are duplicated if all the fields match, correct?

if so then a sql query like this would work.
Code:
select field_1, field_2, field_3, count(pk_field) as number_of_dups
from   document
group by field_1, field_2, field_3
having count(pk_field) > 1
this will return fields 1-3 and the number of times it's duplicated. then to get the specific records that are duplicated:
Code:
select d.pk_field, d.field_1, d.field_2, d.field_3[, field_n]
from   document d inner join (
               select field_1, field_2, field_3, count(pk_field) as number_of_dups
               from   document
               group by field_1, field_2, field_3
               having count(pk_field) > 1
       ) t on d.field_1 = t.field_1 and d.field_2 = t.field_2 and d.field_3 = t.field_3
order by d.field_1, d.field_2, d.field_3[, field_n]

Jason Meckley
Programmer
Specialty Bakers, Inc.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top