Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB 5

Status
Not open for further replies.

BxWill

MIS
Mar 30, 2009
367
US
Need to clean several large pipe delimited text files prior to import into a newly created Sql Server 2012 database.

The plan is to load the Sql Server database and then append data to the database on a monthly basis.

Each text file contain 500,000 to 850,000 records with over 50 columns. Also, there are three date columns in the format YYYYMMDD.

Specifically, I need to review and clean each text file. Cleaning include;

1. Extract records that have a zero in the any of the three date columns.
2. Extract records that have zeroes in the Month and Day portion of the date. (for example, has a format like "20160000").

Initially, I opened the text files using MS Excel but then researched the use of a text editor that can/should be used.

Downloaded and installed the following text editors - GVIM, Sublime Text, and EmEditor. Upon opening the largest text file in each of the editors, I was not readily able to display the data in a columnar format that would facilitate a review of the records.

Based on my experience with MS Excel, It appears that editing the text files using MS Excel would be convenient and relatively easy compared to the use of the text editors.

Did also create the Create Table and Bulk Insert Sql Scripts to load the database(prior to cleaning of the text files). All of the tables were loaded except one. The one table that was not loaded failed because of records that have zeroes in the Month and Day portion of the date. (for example, has a format like "20160000").


Any insight as to the preferred method to clean the text files so I can initially load the Sql Server database?

Any additional insight as to the use of SSIS or Bulk Insert to perform the intial loading of the database and the subsequent appending of data on a monthly basis that will enable the storing of all records with errors in a "temporary" table and just load records that do not have errors?



 
If I were you, I'd first address the source of the files to output clean text files.

>Did also create the Create Table and Bulk Insert Sql Scripts to load the database(prior to cleaning of the text files).
That's also what I would do, SQL Server should be able to import the file as text only data.

>All of the tables were loaded except one. The one table that was not loaded failed because of records that have zeroes in the Month and Day portion of the date.
Well, why import as date? Import all fields as text, mend such errors or remove such lines and then export the corrected data to import with proper formats/types.
You may also go from the text only staging tables into your real data tables by converting from text to wanted type yourself, saves another roundtrip to hdd and back.

Bye, Olaf.
 
If you have access to Unix or Linux, awk would resolve this situation nicely. Also, the awk utility is available for Windows as an add-on.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


 
no need to use any external program.

Just load everything into a staging table - if required due to "bad data" load those fields that have bad data into a varchar field.
Then either delete or only select those records that meet the desired criteria and transform/load onto the final destination.

Having the records being processed in SQL will most likely prove to be easier than do that process outside.

Regards

Frederico Fonseca
SysSoft Integrated Ltd

FAQ219-2884
FAQ181-2886
 
>Having the records being processed in SQL will most likely prove to be easier than do that process outside.

I wouldn't say so in any case, as T-SQL string functions are far behind what other languages offer.
But staging data has one big advantage in comparison to parsing the text file itself: Data is separated into columns already.
Admitted a parser has an easy task in slicing a line of the text file at each pipe symbol, too, but what then?
a) the parser writes out a correct line into a secondary text file?
That's quite costly in time, means you read and write and then afterwards again read a million rows.
b) The parse keeps this line in memory and finally puts in data into MSSQL
Could work, too, but you need a toool capable of bulk loading while parsing, inserting single rows would be inefficient

So finally, while it sounds like doing too much in the first place, loading dirty data into text/char/varchar fields and postprocessing it can pay performancewise, as bulk loading is that much more making use of the advantages of a similarly structured data bulk without transaction log for each row, without locking, etc. It's cheaper to load too much and then throw away or mend dirty data.

We can totally agree on staging data as way to go instead of preprocessing the text file, because you really can't process a text file and remove single lines from it, you rewrite a secondary file, if you do so, and that's where you waste I/O.

Bye, Olaf.
 
Bottom line;

I am interested in performing the initial load of the data that is sourced from the various text files into the Sql Server 2012 database (version: Developer) as fast as possible and perform the monthly appending of data. I just want to ensure that I am setting this up in the most efficient manner at the onset.

Additional context: I initially installed SQl Server 2014 Express Edition and subsequently installed Sql Server 2012 as the second instance. I did not have integration services installed. My initial assumption was that the text data was clean and that all I would need to perform is create the database tables and load the text files using Bulk Insert. It was only when I encountered errors when running the Bulk Insert scripts that has now have now shifted the focus to the use of SSIS to perform the initial load and the subsequent monthly appending of the data.

Have read extensively over the last few days about various alternatives to SSIS such as Power Shell and VB Script. Other options include the use of MS Access and/or PowerQuery and Power Pivot. However, it appears that I should focus on either using MS Excel to clean the data and then use SSIS to perform ETL and load the data into the Sql Server tables or use SSIS to load the records that have no errors into the respective Sql Server tables and then load the records with errors into another table so I can research/review at a later date or send back to the department who prepared the text files.


With that being said, the following questions arise;

1) Considering that I have a Windows 7, 64-bit operating system using 64-bit MS Excel 2016 on a HP Elitebook 8440p laptop with 16 GB memory, what version of Sql Server is ideal that would allow the installation and successful performance of SSIS?



2) As previously stated, I have two installations of Sql Server on the laptop - Sql Server 2014 Express and Sql Server 2012 Developer. Will this present issues going forward? I am interested in installing and using the version of Sql Server that will allow me to efficiently use business intelligence functionality - doesn't have to be the latest version. Any insight as if the upgrade to Sql Server 2016 is warranted at this point?


3) It appears that it may be possible to strictly use Bulk Insert within TSql to initially load the data from the text files into the respective Sql Server tables, perform the monthly appending of the data to the tables and route the records with errors to a separate file. However, I do think that the use of SSIS will enable a more efficient ETL process than Bulk Insert. Any thoughts/insights regarding the use of strictly using Bulk Insert to initially load and perform the monthly appending of data to the database tables?
 
If I am not mistaken SSIS is a paid for tool. It's included in developer, but you are not licensed to use it in production.

Personally, I wouldn't spend the money on it for such a simple task as cleaning your data. As others have already said, it's best to import to a staging table that has all varchar columns, and then clean the data prior to copying to real tables.

From a performance perspective, you may get better performance by using SSIS, but all of your other suggestions are likely to be slower.

SQL 2012, 2014, and 2016 all have Business Intelligence, but again, there is a cost associated with it. If you're willing to bear the cost, then go for it.

-George
Microsoft SQL Server MVP
My Blogs
SQLCop
twitter
"The great things about standards is that there are so many to choose from." - Fortune Cookie Wisdom
 
The gist if what we said is not to clean data before bulk loading. You do bulk load the data, but not as the final type, all columns are just loaded as text, there is no value range or conversion error that way. The data also is not loaded into final tables, but into staging tables, having similar structure, but only textchar/varchar columns.

Data in the staging tables can be cleaned before merged with the real data tables. For that matter you can use functions such as

You don't need SSIS, nor EXcel, nor powershell. For bulk load you have the BULK INSERT command and/or bcp.exe tool and format files, BULK INSERT let's you specify a format file as the FORMATFILE=? option and you can specify all fields as text/char/varchar in there. As the text file is text, you can't get any copnversion errors that way.

I already mentioned the big advantage you have this way: You get data separated in columns and can address the single columns to detect problematic values to either mend them or skip this data for the daily ETL process.

Your ETL process thereby runs within SQL Server.

Bye, Olaf.

 
Although I know it is sacrilege to discuss non-Microsoft solutions in a Microsoft forum, awk will be the fastest. Awk is written in C language and will run against the delimited sequential file and provide an output delimited sequential file that can be loaded using bcp. I know discussing Unix/Linux utilities in a MS environment will not be well received, but awk against the flat file will be faster than any database procedures.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


 
We're not talking about small files here, John.

Simply do the I/O calcuilation for eg 1 GB of file

Solution with awk:
1. awk reads original file and writes clean file
2. bcp reads clean file and uploads into tables

all data is read 2x and written 2x

Solution with staging tables
1. bcp reads dirty file into staging tables
2. cleaning within staging tables
3. merging staging with production data

All data is read 1x and written 1x alone in step 1
Cleaning data also needs to read it back 1x, but since it's also done in MSSSQL it profits from caching. Also you only write changes, not all data as in awk solution step 2
And the final merging step again profits from cached data and only does the necessary inserts and updates.

Overall you can save a lot of I/O with the solution not using awk.

Bye, Olaf.
 
I have worked on data warehouses for credit cards. Millions of customers with 30+ million transactions daily. [glasses]


==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity

 
John, my logic still holds anyway, I'd say you just can also do it a way not the most performant, too, once drives and ram is sufficiently fast. And from the day on this is true, it will hold true, that's why you never made any bad experience.

Bye, Olaf.
 
Olaf, your reasoning is fine. And if timing doesn't require "the fastest", then I wholely endorse loading the raw data into a table. It makes analysis of the errors much easier. Our experience with credit card data on fast Unix and Linux servers across a range of databases has shown that system utilities written in C are fastest. As you mention, the input and output files should be on separate storage directly attached to the server for maximum optimization of the awk and bulk load processes.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


 
You don't mention if you have a programming background or not, however if you do (and your data is pretty constant) this would be a great place for a Python program. It is very easy to pick up. Just parsing a line is not that hard. And probably a few Youtube videos would get you on your way. It is fast, free and runs pretty much on any platform.

Just a thought

Simi
 
I don't want to cause blood rush, but actually I'm saying you are not using the fastest, simply looking at how many times data needs to be read and written. RAM and CPU is so fast today, that it won't matter, even SSD always is a magnitude behind RAM. Caching mecahnisms not only include hdd hardware caching and RAM caching, also the OS and SQL Server process contribute. And MSSQL can act on it's data with multiple cores, while parsing lines of a sequentieally read file is rather a sequential process not parallalizable.

Let's make some assumptions
1GB file, cleansable to .8GB

Your way reads in 1GB, writes .8GB on a separate drive to not have one controller as the bottleneck, fine. Then read in the .8GB into MSSQL tables. Total read: 1.8 GB, total write 1.6GB, once to file and once into MSSQL, just roughly speaking.

My reads in 1GB, writes 1GB into staging tables, reads 1GB and cleans it to merge into .8GB production table data. Which seems like a worse case as this in simple terms means overall reading 2GB and writing 1.8GB, but the big advantage is, that the process doing the first read, write and read back is one and the same process, the MSSQL process doing a BULK LOAD. It can profit from its caching. Involving two tools always has this disadvantage at least.

Bye, Olaf.
 
One option that requires a single pass at the data is to use a C# bit of code, either as a standalone application, or within SSIS as a script.

Read file, parse each record as string data and transform data as required - if record is considered to be valid pass it to the feed to the database (either add a row to a datatable object or add it to the SSIS pipeline row) - those invalid can also be subject to same process but to a different destination for further processing if required.

1GB read - 1GB or less written to the database - still a single process.

Regards

Frederico Fonseca
SysSoft Integrated Ltd

FAQ219-2884
FAQ181-2886
 
Frederico, in this case The C# process and the MSSQL process are involved, again two processes.

I talk about MSSQL using BULK LOAD INSERT and then more or less SQL along these lines:

Code:
CREATE DATABASE staging ...several options;
CREATE TABLE dbo.stagetable (... ) WITH (MEMORY_OPTIMIZED=ON);
BULK INSERT FROM 'datafile' FORMATFILE = 'formatfile' ....;
-- some SQL cleaning data, eg DELETE FROM dbo.stagetable WHERE NOT ISDATE(...)
MERGE into proddb.dbo.prodtable
USING staging.dbo.stagetable
ON <match_condition>
WHEN MATCHED...UPDATE...
WHEN NOT MATCHED...INSERT...;
DROP TABLE dbo.stagetable;
DROP DATABASE staging;

There is no other process invoved, only MSSQL, not even bcp.exe or sqlcmd.exe. The MSSQL process reads in data into the memory optimized dbo.stagetable, nothing is written to hdd at this stage, you only read 1GB into the staging table in memory, so the 1GB read from the flat file is written into RAM, not into any MDF file, neither yourdb.mdf nor tempdb.mdf, nor even staging.mdf. The CREATE DATABASE surely creates small staging.mdf/ldf file pair, but they are dropped in the end. There is more detail about this, eg the stagingdb should be created with its recovery model set to simple. But roughly you only read from the original flat file, do some things in RAM and MERGE into the final production db table. Minimum writing to slower devices only at the end of all processing in RAM.

Bye, Olaf.
 
Olaf, a simple rule of thumb is that a SELECT costs 1, an INSERT or DELETE costs 2, and UPDATE costs 3. Ignoring the bcp load that is part of both processes, the difference will be the cost of running awk across the entire flat file versus running UPDATE across the subset of records that have errors. A flat file read will be considerably faster than a database SELECT. I presume that if over 30% of the rows have errors, then awk will be faster. As the size of the flat file grows, the threshold percent that favors the awk solution will drop.

The Python suggestion is a consideration, but should not modify the original file because an audit trail of what was changed should be kept.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


 
1,2,3 what? If the table is in RAM, any SQL will run on the RAM and takes neglectible time compared to the initial read in of the file via BULK INSERT, you still consider normal tables stored on hdd, John, but that's not the case.

Bye, Olaf.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top