Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Most efficient way to update large MSSQL database using c# 1

Status
Not open for further replies.

darylbewise

Programmer
Jan 5, 2009
16
GB
I have a MS SQL database table with around 500,000 rows. Each day, this table is updated using a csv file.

Currently I have been looping round each row within the CSV file to check to see if the row that it matches within the database has changed. If this has changed I will make a UPDATE sql query to the database, if the row does not exists I will make CREATE sql query.

In the above situation, there will be 2 database queries for each row within the CSV file. Therefore for a CSV file with 500,000 rows, in order to update the database, it will require 1,000,000 database queries.

I have been looking at creating a more efficient way of updating this database. What I was thinking of downing was loading the entire database into either a datatable (array or arraylist).

If I used a datatable:

1. Load the entire database into the datatable
2. Loop through the datatable for each row of the CSV file in order to find the relevant row
3. Either update the relevant datatable row using something like:

Code:
table.Rows[row].BeginEdit();                          table.Rows[row]["column"] = "1";                             
table.Rows[row].EndEdit();
table.AcceptChanges();
OR
Add a new row to the datatable:
Code:
table.Rows.Add(“columns1”, “column2”, ..., “column7”);
4. Once I had looped through the entire CSV file, I would the bulk upload the datatable back to the database:
Code:
SqlBulkCopy bulkCopy = new SqlBulkCopy(ConfigurationManager.ConnectionStrings["Database"].ConnectionString, SqlBulkCopyOptions.TableLock);
bulkCopy.DestinationTableName = tablename;
bulkCopy.WriteToServer(table);

By making all requred alteration to the datatable instead of directly to the database, I was hoping to save processing time however with loading such a large database into a datatable, I am now starting to worry about the amount of memory that the datatable would consume.

Any suggestions of the best way to tackle this problem would be much appreciated. Should I update directly to the database, use a datatable to manipulate or array or arraylist????
 
Why not push all the work to SQL Server? You can load a staging table with your data then execute an UPDATE within SQL that will update records where the fields in question do not match. Then you can run an insert for all records that do not exist in the destination table. These will be set based operations requiring you to execute 2 queries after you have loaded the data into the stage table. To make the comparisons easier you can compare binary checksums for changed rows.

I know it is not the C# solution you are looking for but I am sure it will be faster than any solution outside of SQL server.
 
research "Extract Transform Load" or ETL for short. you definately do not want to load the entire table into memory.

I know MS provides DTS and SSIS. some love it, some hate it. if you like GUIS and wizards they are great. If you like unit testing or piecing your ETL process together piece by piece it's terrible.

I have used DTS in the past only out of ignorance and necessity. Since then I have discovered Rhino.Tools, and more speicifically Rhino.ETL. I plan on utilizing Rhino.ETL for my future data transformation processes.

this article was recently published and is the best introductory material on Rhino.ETL


Jason Meckley
Programmer
Specialty Bakers, Inc.
 
Thank you for your quick replies. I am interested in both of the methods that you have suggested.

I am currently only using Mirosoft Server Management Studio Express, are DTS and SSIS still available in this version?

MDXer:
Sounds like quite an interesting solution.

Sorry if this seems a little bit basic, however I have never really used MS SQL to actually do all the processing required for me. I have previously used c# to manipulate all require data before uploading to the database. Please let me know if I have this correct:

1. Create staging table using a stored procedure. The staging table would be a complete duplicate of the original table.
2. Loop through each row within the imputed files (XML, TXT or CSV), if no alteration has occurred then delete from staging table. If alterations have occurred, then run UPDATE query on this row.
3. Once completed, staging server will contain all rows that will need altering in the main table. Update main table.
4. Truncate all values from the staging server
Do I have this correct?
 
I wouldn't worry about deeleting anything from the stage table as it is just as easy and much faster to simply ignore the unchanged data.

Looks good from what I kow of your process.

In regards to DTS and SSIS. DTS wasn't very good for ETL as it mainly was a workflow control. SSIS on the otherhand is an extremely powerful tool if you invest the time into really learning it. If you approach it as simply a GUI to drag and drop then you probably won't have a very good experience with it, just as if you dragged and drop and did all your C# through the GUI. As experienced developers know it is understanding the various pieces and what they really do that allows you to leverage the potential of the tool.

As for Express I doubt you would have SSIS as it comes in BIDS which I believe you need Standard or enterprise for.

You may not need an ETL tool. The ETL tool would simply allow you to make all your changes prior to putting the data in SQL Server which you may still be able to do in C# and it may be faster depending on what the edits are.
 
Ok thats great. Thanks for all your help and advice.

I will have a test with Rindo.ETL and using my c# code to see which will work the best.

Once I have all the required alterations into the staging table(s), what is the quickest way to actually copy over just the altered rows and new rows to the live table?

Is there a quick and easy sql command?
 
INSERT INTO table name
select * from..

but for any SQL questions, you should post in the appropriate forum
 
Is there a quick and easy sql command?
whether you are updating 2 row or 2,000,000 rows the actions are the same. there are performance tuning options but that is more of a DBA function; index tuning, cluster management, IO operations.

first get it working, then determine acceptable execution times (1 second, 1 minute, 1 hour, 1 day). only after this is completed, should you worry about performance.

Jason Meckley
Programmer
Specialty Bakers, Inc.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top