Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Best Practice - Load Multiple Text Files to Multiple tables 2

Status
Not open for further replies.

BxWill

MIS
Mar 30, 2009
367
US
Insight needed.

I have over 12 text files within a folder and I need to create a database and load the files into the respective tables.

Currently, I am creating the database, creating a temporary table for each text file, bulk inserting each text file to the temporary table and then using "Insert Into" to insert from the respective temporary table into the respective database table.

Any insight as to best practice?

Text files include;

InitialLoad (contains 2014 thru Aug 2016 data)
Monthly text files from Sept 2016 thru July 2017



 
OOPS. As Olaf pointed out, I gave a VFP answer rather than a SQL Server answer, so ignore this:

Do you need to massage the data on its way in? Is that why you're using a staging table? If the text file contains exactly what you need, why not just import from it directly using APPEND FROM?

Tamar
 
APPEND FROM = > wrong forum, Tamar, this is MSSQL.

BxWill, I know where you come from, looking at recent threads.

There is no best practice you could just follow for sure success. It all depends, Especially having a IDENTITY colunm in the one target table filled with many staging tables each for one source text file. When you generate IDENTITY columns even for only one staging table and repeat imports for each day or week or month, you restart generating the number sequence 1,2,3,.. while it's already at whatever number in the target table. IDENTITY columns have no place in staging tables for that matter, and only can be set by the final insert of the data into the production data table.

One way out to generate those values beforehand needs a disciplined creation of staging tables, so the initial INDENTIY start number is higher than the max number in production target table, so your script generating the staging temp table will need to create INT IDENTITY(N,1) with N = current MAX +1.

In consequence you can't do that in parallel with many text files all ending up in the same target table, as you can't know how much each text file will advance the identity sequence counter.

It is easier to cope with uniqueidentifier and newid(), as those keys are random unique and can be generated independently.

Bye, Olaf.

PS: all that said I don't have the finger on the exact point in your case, but the strong feeling his is where your double key problem stems from. In the end, when keys are generated before put into the final target table, things can surely go out of whack and asnc. That's a reason keys are not pregenerated or if so, in the fashion newid() does without such a sequence. A sequence counter like IDENTITY implicitly is can only be used for numbering the rows in the final step from staging table to production table, which obviously is a disadvantage, when you want to precheck uniqueness. In the end uniqueness of keys in the staging table also tells nothing about an overlap with already existing keys, unless those are used to merge data in as update instead of insert. You have a hard time, if the data you regularly import actually is not all new data but often enough update of existing data and you don't have an identifier usable for your database. Then merging conditions must be done for equal product name, GTIN, or whatever other real world unique identifier.

 
Exactly!

Overall, at this point, it appears best to abandon the use of the identity column and import the data while exploring the use of another field to use as a primary key.

[Bold]
If I did abandon the use of an identity column, the question remains - Is it good practice to create a different temporary table for each monthly text file?[/Bold]

For each monthly file for a particular year, say the 12 files for 2017, is it a good idea to create 12 different temporary tables, perform validation, etc. on the data prior to using "Insert Into" to load the data from the 12 temp tables to "Table_2017" in the database?

Or, can I use the one temporary table, #TempTable1, to load Jan 2017 data and then re-use #TempTable1 to load Feb 2017 data and re-use #TempTable1 to load March 2017 data and so on.

Then, create one temporary table, #TempTable2, to load the monthly text files for 2018 and so on...


This is the key question.
 
>Is it good practice to create a different temporary table for each monthly text file?

The question kind of answers itself. You have a monthly routine that certainly will not take a month, you create a temp table that will not be persistent and permanent.

=> You do generate a temp table for staging every month, it's temporary, gone after the process so next month you have a new one.

If you have multiple files I would make it depend on their overall size, whether I'd collect their data into one staging table or not, but when I expect them to have data about the same final rows, then collecting them within the same temp table in one go just leads to doublets in the merge of data again or makes massaging them before merging them into production more complicated than when working on each file separately.

You can reuse the same temp table name because your last step of processing a single input file will end on dropping that temp table implicitly or explicitly doesn't matter. The typical destroying of a temp table happens, because the connection to the server is closed. The scope/lifetime of a normal temp table is the session and closing the connection ends it.

Your major decision also could be removing the identity from the temp table only still using it in the destination table for creating unique keys. Then the identity is generated on the fly at the final step. You still need a couple of columns for identifying a row your staging data should update instead of adding it again.

To expand a little: Whether you need to identify rows for updates or not depends on the nature of the data you import of course.

Assume you have a logging data, eg all sales transactions of the last month, every row is a new row. Then the only need of staging table for preparation/massaging the data before inserting it into the target database table is about any data errors or complications of converting text to the T-SQL data types.

Assume you have a list of products from a vendor, essentially the same list every month in their current state, then most rows are not to be inserted into the final table, but updated, if they contain any change at all. This requires that table to have some identifier for the product, which may not be your databases primary key, but a unique product code or other per vendor unique identifier. Your database may collect products from many vendors each having a different type of key and thus you have to generate your own.

In both cases the temp staging table is of use, but will have additional and differing purposes to clean data and be able to use T-SQL MERGE, for example, to both insert and update data from the staging table, which is impossible for a BULK operation just appending the data (Hence Tamar's post mentions APPEND FROM a text file is Foxpro, not MSSQL). The need to not directly bulk load into production mainly comes from being able to finally make a per row decision of insert or update and maybe also reject, for example to only import a certain category of products from a vendors products text file.

Identity can be fine, but not on the staging level. In the first run, you may simply not generate a primary key in the staging table at all, also no compound key on any columns. You want to be able to check for double keys, then you first have to allow them coming in as they are. In the worst case a text file may contain many temporal versions of an original row, thus only the last row for the same record in the text file would need to be in there and you have to reject all previous. such things can, for example, happen, if a vendor creates his monthly text file as the log of changes on products. Whenever a product was changed more than once in a month it is double in this export. Notice, in that case the vendors monthly products text file is not the list of all products but a change log, which might be much smaller then the full list but will have other preprocessing needs.

Finally, there is no single recipe and solution, you have to know the nature of the import file and what you want to take in from it.

Bye, Olaf.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top