Fastest way to check 50,000 records against a DB? 1

TheLe · Aug 25, 2007

Hello, I have vb.net optimization question.

I am reading some 50,000+ records from a file, and each record contains a unique PersonID (numeric) along with other data.

Before I process the record, I need to check that personID against the database to make sure it exists in our system. If it does, then I proceed. If it does not, I go to the next record in the file.

Currently I am doing a SELECT statement for each record I have in the file. If I get a record, then I know it exists.

Is there a faster and/or a more efficient way to check and see if that PersonID exists in the database?

Assume a table called FOO with a single field called PersonID.

Thank you,

`Le

AlexCuse · Aug 26, 2007

I would send the entire list to the database, check to see if the ID's exist in there, remove any invalid ones, then send a revised list back to the client. This would cut out a lot of trips to the db.

Hope this helps,

Alex

[small]----signature below----[/small]
I don't do any programming whatsoever

Ignorance of certain subjects is a great part of wisdom

ArtieChoke · Aug 26, 2007

If FOO isn't too large, read the whole table into a sorted array and use a binary search for the lookup.

"I think we're all Bozos on this bus!" - Firesign Theatre [jester]

TheLe · Aug 27, 2007

Thank you for your replies.

AlexCuse, I cannot find all IDs to send to the database. That would require me to read through the entire file first, collect the IDs, send them to the database, then read through the file again to process one ID at a time (remember, each ID has dozens of other bits of data that I need for file processing).

ArtieChoke, I considered that, or perhaps reading it all into a local vb dataset, then verify off of that. Unfortunately the table FOO already has 500,000 records, and it will be growing very fast once we go live.

Thanks for all the suggestions so far!

`Le

AlexCuse · Aug 27, 2007

What are you doing with the file entries once this validation is performed?

I still think that a set-based solution on the database side is going to be your best bet for cleaning out the ID's that already exist, but you have not given enough information for anyone to say how you should proceed. You may even be able to implement a solution entirely in the database.

So, what are you doing here (besides checking the rows against the database).

Hope this helps,

Alex

[small]----signature below----[/small]
I don't do any programming whatsoever

Ignorance of certain subjects is a great part of wisdom

cjelec · Aug 27, 2007

I take it that you are using something like a For ... Next loop to find each PersonID, then retrieve info from other tables?

Are you doing this for every person sequentially in the db?

Point 1:
Use "SELECT * FROM PersonsTbl WHERE PersonID > '" & LastId & "'"
This would save you searching for records that do not exist.

Point 2:
Why exactly are you needing to go through every Person in one go? Could you explain what you are trying to achieve?

ThatRickGuy · Aug 27, 2007

My first attempt would to be to pull back a sorted list of IDs from the database (select ID from Table sort by ID ASC), and move the results into a hash table.

Then I would throw together a binary search function that accepts an ID parameter and returns a boolean indicating if the record was found.

With those two chunks of code in place, you can call the function that hits the database and populates the hash table once, then call the binary search function for each record in the input file. Since everything is in memory you should have no IO issues (other than reading the input file) slowing you down. Loading all of the IDs in memory will take a chunk of memory though, and if it winds up using the swap file, it'll be a big performance hit. And the binary search should be the most efficient way of finding the right value.

-Rick

VB.Net Forum forum796 forum855 ASP.NET Forum
[monkey]

I believe in killer coding ninja monkeys. [monkey]

keenanbr · Aug 29, 2007

Why not read entire file into a temp table then
select * from temptable t, FOO f where t.ID = f.ID

ca8msm · Aug 29, 2007

I'd probably go with a Bulk Insert to a staging table in the database. You can then run a query directly against the main table and the staging table and work with the records that match.

____________________________________________________________
Mark,
[URL unfurl="true"]http://aspnetlibrary.com[/url]

Need help finding an answer? Try the Search Facility or read FAQ222-2244.

TheLe · Aug 31, 2007

Thank for all the comments so far.

I do not believe that reading the entire contents of the file into a temp table would be very beneficial. I suspect the main slow down is hitting the tables so often.

That being said, the HASH TABLE idea seems to be a great idea. Thanks for that idea -- I am researching the HASH Table logic now and it seems like it could work. I am going with that route first.

Some great ideas so far!

Thanks,

`Le

ArtieChoke · Aug 31, 2007

Unless the id fields are long in length, I don't see why you would need to use a hash scheme - it just adds more overhead and is heavily dependent on the size the input set to work without too many collisions.

I had already suggested reading in the id fields from the table and doing a binary search. You don't need to add the complexity of the hashing scheme. Just load an array with the ids in sorted order. A binary search is very easy to write and use.

"I think we're all Bozos on this bus!" - Firesign Theatre [jester]

ArtieChoke · Aug 31, 2007

One thing the hashing scheme might give you is to greatly reduce the number of values you need to store and look up, but that will require a really good algorithm - doh!

"I think we're all Bozos on this bus!" - Firesign Theatre [jester]

ThatRickGuy · Aug 31, 2007

To be clear, I was referring to a hash table, not hashing the values. A hash table is a very light weight way of storing an array of key-value pairs.

Basically, I would use a DataReader (very fast, little overhead, forward only) to read the results of the query (select ID from table order by ID asc) to loop through all of the IDs and add them to the hash table with the current index as the key. Using the index as the key will make the binary search function a breeze.

Off the top of my head, you should wind up with two methods like this:

Code:

  Private Function PopulateHashTable() As Hashtable
    Dim myHashTable As New Hashtable
    Dim index As Integer = 0

    Dim myCommand As New Data.SqlClient.SqlCommand
    Dim myReader As Data.SqlClient.SqlDataReader = myCommand.ExecuteReader()

    If myReader.HasRows Then
      Do While myReader.Read()
        myHashTable.Add(index, CInt(myReader.Item("ID")))
        index += 1
      Loop
    End If
    Return myHashTable

  End Function

  Private Function BinarySearch(ByVal Value As Integer, ByVal HashTable As Hashtable) As Boolean
    Dim ReturnValue As Boolean = False
    Dim Top As Integer = HashTable.Count
    Dim Bottom As Integer = 0

    Do
      Dim CompValue As Integer = HashTable((Top - Bottom) \ 2)
      If Value > CompValue Then
        Bottom = (Top - Bottom) \ 2 + 1
      ElseIf Value < CompValue Then
        Top = (Top - Bottom) \ 2 - 1
      Else
        ReturnValue = True
      End If
    Loop Until ReturnValue = False And Top <> Bottom

    Return ReturnValue
  End Function

The first one should load the index and ID into the hash table. It should be pretty memory efficient so long as your ID field is an integer as the Hashtable should store primitive types on the stack instead of the heap. If the ID is alpha numeric, you'll have to take a couple of hits. First, you'll have to come up with your own iComparable interface so that you can determine which value is bigger. And second the Hashtable will have to store the values on the heap, the stack will just hold memory addresses to the values on the heap.

The Binary search is just rather generic. Each iteration it will cut the list in half until it finds the right value. It should find the correct answer in SQRT(N)+1 or less attempts where N is the number of items it is searching through. So finding the correct value out of 50000 rows should take no more than 224 iterations.

-Rick

VB.Net Forum forum796 forum855 ASP.NET Forum
[monkey]

I believe in killer coding ninja monkeys. [monkey]

Swi · Aug 31, 2007

In VB6 there was the Dictionary object which was an associative array which was very fast when checking to see if something existed within the dictionary. Have a look at this post.

thread222-1313989

Swi

chrissie1 · Aug 31, 2007

VB.net also has a dictionary object and a linkedlist and all the other obscure collection types.

Christiaan Baes
Belgium

"In a system where you can define a factor as part of a third factor, you need another layer to check the main layer in case the second layer is not the base unit." - jrbarnett

ca8msm · Sep 1, 2007

I do not believe that reading the entire contents of the file into a temp table would be very beneficial. I suspect the main slow down is hitting the tables so often.

Why not try it first rather than just guess? If it turns out to be slower then fine, discount that method, but if not then it may solve your problem.

Set based operations on a database will generally be a lot faster than file lookups so a staging table in theory should be a faster method.

____________________________________________________________
Mark,
[URL unfurl="true"]http://aspnetlibrary.com[/url]

Need help finding an answer? Try the Search Facility or read FAQ222-2244.

Swi · Sep 1, 2007

I think that you will find the dictionary more than sufficient as far as speed goes.

Swi

ThatRickGuy · Sep 1, 2007

Except that dictionaries are forward only (if I recall correctly), which would limit their abilit to search for a value from a sorted list.

-Rick

VB.Net Forum forum796 forum855 ASP.NET Forum
[monkey]

I believe in killer coding ninja monkeys. [monkey]

Swi · Sep 2, 2007

It is basically an indexed array so direction should not matter.

Code:

If dict.Exists("YOUR VALUE") Then Msgbox "Value Found!"

Swi

ThatRickGuy · Sep 4, 2007

It matters on what type of search the .Exists method is using. If it is just linearly searching through the dictionary, it could take 50,000 compare operations to find the correct value. Where as a binary search will max out at 224.

Only way to know for sure would be to try it though I guess.

Performance wise, if the existing table is extremely large it will likely be faster to just load the text file into a temp file. If the existing table is not extremely large, it will likely be faster to load the existing IDs in memory and binary search them. If I get some free time this afternoon I might toy around with the dictionary and it's search performance.

-Rick

VB.Net Forum forum796 forum855 ASP.NET Forum
[monkey]

I believe in killer coding ninja monkeys. [monkey]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Fastest way to check 50,000 records against a DB? 1

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor