Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB 5

BxWill · Dec 19, 2016

Need to clean several large pipe delimited text files prior to import into a newly created Sql Server 2012 database.

The plan is to load the Sql Server database and then append data to the database on a monthly basis.

Each text file contain 500,000 to 850,000 records with over 50 columns. Also, there are three date columns in the format YYYYMMDD.

Specifically, I need to review and clean each text file. Cleaning include;

1. Extract records that have a zero in the any of the three date columns.
2. Extract records that have zeroes in the Month and Day portion of the date. (for example, has a format like "20160000").

Initially, I opened the text files using MS Excel but then researched the use of a text editor that can/should be used.

Downloaded and installed the following text editors - GVIM, Sublime Text, and EmEditor. Upon opening the largest text file in each of the editors, I was not readily able to display the data in a columnar format that would facilitate a review of the records.

Based on my experience with MS Excel, It appears that editing the text files using MS Excel would be convenient and relatively easy compared to the use of the text editors.

Did also create the Create Table and Bulk Insert Sql Scripts to load the database(prior to cleaning of the text files). All of the tables were loaded except one. The one table that was not loaded failed because of records that have zeroes in the Month and Day portion of the date. (for example, has a format like "20160000").

Any insight as to the preferred method to clean the text files so I can initially load the Sql Server database?

Any additional insight as to the use of SSIS or Bulk Insert to perform the intial loading of the database and the subsequent appending of data on a monthly basis that will enable the storing of all records with errors in a "temporary" table and just load records that do not have errors?

BxWill · Dec 22, 2016

Still at this.

Question - Is not an assumption explicitly implied that the use of T-SQL is a more effective method to clean the data relative to the use of formulae, IF Statements, etc. within MS Excel?

So, upon the creation and use of T-SQL scripts to "export" the error-free data from the staging tables to the production tables, what remains in the staging tables are the records with the errors.

Therefore, isn't there still a need to identify all possible type of errors for the records that remain in the staging table - requiring a review of each record? Am I thinking about this correctly?

Also, considering that all errors may not be identifiable at the same time, wouldn't this preclude the use of a temporary table as the staging table?

Any examples of Sql scripts and/or stored procedures that identifies the errors that remain within the staging table?

Is there such a method such as a stored procedure that "identifies" the various types of errors and store the records with like errors in separate tables?

For example, a table for duplicate records, a table for records with nonsensical dates, a table with records that have misspelled fields, etc. Then again, maybe this is already taken care of because
of the various Sql scripts that will be developed to identify all possible errors for the records that remain within the staging table...

johnherman · Dec 22, 2016

Duplicate records would be recognized by 'primary key' errors when attempt ing to Insert. Dates can be checked a few ways. As far as 'misspellings', well, that's a whole different subject. You would need to have a dictionary or list of valid values. Your recent post opens a whole new series of questions and scenarios.

@Olaf: 1,2,3 are relative costs. So Update(3) is more efficient than Delete followed by Insert (2+2) for example.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity

Olaf Doschke · Dec 23, 2016

BxWill,

whatever you need and want. The nature of data might not make it necessary to postprocess invalid records, eg think of measurement data taken every ms, losing some of them might not be an issue.
If you look at my code I don't keep staging data, I don't even keep the staging database. More important I create a separate database I drop in the end. Having staging data within your production data you have to think about log file growth for the import/merge process, which should be kept at minimum, that's also not done using temp tables, but a separate database you can finally drop. Therefore many of your questions have no base to stand upon, the staging table finally is not only empty, but gone.

In my short example I simply suggested DELETEing all rows with invalid date, but you can of course also move such rows into tables used for manual inspection. John might suggest you can also process your initial flat file with awk in a way to sort each row - the good ones go into the pot, the bad ones go into your crop - like cinderalla. Indeed having the first read into a MSSQL table I think one of the advantages is to be able to apply any SQL and you know this better than any other tooling, perhaps. If you're also a developer Frederico has a point in using any frontend process in C# or like Simian suggest Python, whatever flavor of language you speak you surely know your ways with these languages to act on data, too.

It's totally up to you what you do between BULK INSERT and MERGE in my script, I would just not suggest loading all data into a staging table or tables or a staging schema of your production data or use tempdb, but use a separate db of the same instance, maybe on separate filegroup/partition. Since staging.dbo.stagetable is an in memory table in MSSQL Server you can also act upon it from outside with sql commands, as long as you don't pull out all the data for processing, because then you better read it in from the flat file into such outside process in the first place.

In the end you always have to imagine where you read from and write to, the flow of the data from hdd to RAM, within LAN and HDD and network controllers and between processes, the flow of data in the slower devices costs most, data in RAM can be processed fastest, no matter if the code doing so is assembler or intepreted commands like adhoc queries, the parsing or compiling of code will take neglectible time. It's overall execution time including the data processing it does will be neglectible even in comparison to SSD reads and writes and of course even more so in comparison with network transfers. As said RAM still is 1-2 magnitudes faster than SSD and that comparison will always hold true, SSDs need non volatile RAM and that's always flashed and the process take more time. Once data is in RAM anything can act comparable fast on it, all the languages know the same core basic data types and they mostly mean the same bytes in the same order. Anyway, very very rarely you have such complex processing of data, that it'll become the bottleneck of your whole ETL process, what determines the overall execution time mainly is I/O, how many times you need to read and write from/to/through slower devices.

The about a million rows you need to process each day are in the GB range, I assume, this once was a huge amount of data, today it's quite small, fitting into physical RAM multiple times on a modern server, so might also consider a trade off between performance and maintainability and ease of modification of data cleansing code, but indeed SQL is fine for data processing. I already said T-SQL string functions are not the very best, there are a few new ones making things better, but if you're a C# pro or Python pro or even Linux pro with Perl, awk or grep or anything, you might go that route, you'll see what fits your needs best. It should always be a concern to inform your data source for better quality initial flat files, so you can keep your preprocessing simplest. What's most important in my experience is looking for ways minimizing the pressure on logs, i.e. you have or set the recovery model to simple or bulk logged. This even plays a role for the staging db you only create temporary. If not, you write out all data twice, once into the data and once into the log file MSSQL maintains for any db. If you have no deep comprehension of transaction logging you should read on that to understand the implications.

Bye, Olaf.

Olaf Doschke · Dec 23, 2016

>So Update(3) is more efficient than Delete followed by Insert (2+2) for example

I was suggesting MERGE for the final merge of cleansed data, neither Delete+Insert, nor Insert nor Delete nor Update alone. Do you know T-SQL MERGE?

Bye, Olaf.

BxWill · Dec 29, 2016

I am currently reviewing T-SQL Merge.

As previously stated, I hypothesized that the use of MS Excel to review and clean a text file prior to using Bulk Insert to load a Sql Server 2012 database rather than the use of a text editor may be feasible.

I did proceed to use MS Excel to open each text file and "marked" each cell within a DATE column that had an error by replacing the erroneous date entries that were causing the error with "11111111." Interesting observation was that there were errors that had "00" in the Day portion of the date (formatted as YYYYMMDD) and there were errors that had "0000" in the MMDD portion of the date.

Truncation errors were "resolved" by modifying the length of the field within the CREATE TABLE sql script. Other errors such as asterisks (i.e. ********) were also replaced with "11111111."

Do realize that by opening the text file using MS Excel, "cleaning" and modifying the data in MS Excel and then saving as a Tab-delimited text file prior to using Bulk Insert to load a table within Sql Server may not have been the most efficient method because it appears that MS Excel may actually alter the formatting of some of the data. But, I believe that it was imperative to go through the process of identifying the errors within the text file initially via MS Excel before constructing the T-Sql statements to identify the specific errors that will still reside in the staging tables after exporting the "good" records to the production tables.

Successfully loaded the table within my Test database with all of the records - records that contained "bad" fields as well as records that had all "good" fields. After fine tuning this process, I plan to perform the initial load of the records into the staging tables and then load the "good" records into the tables within the Production database. Then, on a monthly basis, I need to import all of the records into the staging tables and then append the good records to the tables within the Production database and identify the "bad" records so that I can return them to the Department who sourced the text file.

Now, I will revisit the creation of staging tables... Still, I need to determine if a Text editor can perform the above-mentioned editing/cleaning tasks.

So far, I have not been able to open a text file in a Text Editor and display the data in a columnar fashion so that I can even perform the above-mentioned modifications or similar modifications if I desired to do so in the future.

Any additional insight regarding the use of a Text editor vs. MS Excel to open a text file and "clean" the data prior to import into Sql Server is appreciated.

Note, I would like for this to be a seamless automated process whereby the "good" records are loaded and the "bad" records are identified as quickly as possible. Interested in basically using Bulk Insert to perform this.
However, I may have to use SSIS if there is not a steep learning curve to master.

Also, I would appreciate insight as to the process that is planned to append the data on a monthly basis.

fredericofonseca · Dec 29, 2016

As it seems you already found out never never use Excel to look at bad data - Excel will do formatting of its own that may either change completely the data or not process it at all.

Loading as raw data to a sql server table and then looking at it with SQL will enable you to do all type of analysis of the possible errors you may have and devise a parsing method for each individual type of error if so required.

For situations like yours where the data is already delimited then loading the data onto individual fields will in most cases be acceptable - but even that may not work well in partiuclar cases - but assuming it does, once you have the data loaded onto individual columns you can analyse each one of them and see which ones need specific processing based on the desired datatype - that may be dealing with invalid/partial dates, numbers with spaces/invalid chars on them and so on. All this is done easily with SQL.

Regards

Frederico Fonseca
SysSoft Integrated Ltd

http://www.syssoft-int.com

FAQ219-2884
FAQ181-2886

Olaf Doschke · Dec 29, 2016

You still don't understand what I suggest, or you disregard it.

All you need is to know the number of columns in your text file, ideally also names and expected data formats of the textual data, eg the already mentioned YYYYMMDD. Then you neither need Excel, nor a TextEditor, nor awk, nor bcp, the BULK INSERT command will import this.

The only thing you need to avoid conversion problems with bulk insert is to prepare a format file and specify it int the BULK INSERT.
This format file should only be done once and remain the same for all future bulk loads, as long as the text file format doesn't change.

The main point to avoid any conversion errors is to treat all the text file columns as TEXTs, as char/varchar fields. So your format file will state all columns are text columns. No matter, if you have YYYMM00 or YYYY0000 values within your dates in the text or not, the bulk insert will import it, because this way it is not automatically converted. You reach your main goal to have the data in a table and can look at it in columns of the staging table used to insert into via BULK INSERT. Tada! That's what you want, isn't it?

Take a look at format file specifications -

https://msdn.microsoft.com/en-us/library/ms178129.aspx

Especially

https://msdn.microsoft.com/en-us/library/ms178129.aspx#bulk_nonxml

or

https://msdn.microsoft.com/en-us/library/ms178129.aspx#bulk_xml

Whether the textual data - which still is textual data within the staging table - is valid or not, can now be checked via T-SQL.

All your ingredients are SQL Server and your text file, no more, no less. You may use the bcp tool for generating format files for your staging tables, which you design to only have necessary width char/varchar/nchar/nvarchar columns, whatever suits the text file best. You can't have a conversion error from text to text, can you? Your only enemy is wrong charset/encoding, you need to know whether your text file is ANSI or Unicode, UTF-8/UTF-16 or whatever encoding.

You have experienced the automatic conversion fails on dirty data, you clean it and then let it do it's work or you take it in your own hands, how you convert the textual data yourself via CAST/CONVERT/FORMAT and check it via LIKE, ISDATE and other T-SQL constructs. T-SQL Merge comes later, this is not your concern in this stage of filling the staging tables, T-SQL Merge is the final run from staging to production to let this be done in a single query and with least effect on production data transaction logs, but that's the last step. Whether you do this from the initial pure textual staging tables or from secondary staging tables already having the cleaned and converted data depends on convenience, performance, maintainability, your understanding of T-SQL, etc. It's possible to do the necessary conversions while MERGE, as part of the merge command is querying the staging table data, which can have the conversion expressions/functions/calls in it.

We're only dealing with these lines of my proposed T-SQL script:

Code:

CREATE DATABASE staging ...several options;
CREATE TABLE dbo.stagetable (... ) WITH (MEMORY_OPTIMIZED=ON);
BULK INSERT FROM 'datafile' FORMATFILE = 'formatfile' ....;

It already has enough tasks before doing it: Define the structure you need for your staging table or tables and creating a formatfile to specify in the BULK INSERT. That's the part reading in text file without any complications. Because where there is no conversion there are no conversion errors. The final state then is a memory_optimized SQL Server table holding the text file 1:1, but within usual table columns you can act on with SQL, that's the big advantage over the pure text data file, you have your columns already.

Bye, Olaf.

BxWill · Dec 30, 2016

Appreciate the insight.

Will review the data within the links regarding format file specifications and pursue the use of staging tables and post back...

BxWill · Jan 4, 2017

So far, I am not able to populate the table in production, "Data2."

Basically, converting from varchar to a date field when the date is formatted as YYYYMMDD presents a problem.

Specifically, the error is "Conversion failed when converting date and/or time from character string."

[bold]What modifications are needed to successfully load the production table, Data2, using the data from the staging table, #Data1?[/bold]

As previously stated, it appears that the majority of issus with the text file data are records where one or more date fields end in two or four zeroes. The dates are formulated as YYYYMMDD. There was one record that had a double quotes in the Sale Amount column within the text file (i.e. ""). [bold]Is it possible to filter this out using the CASE statement as well?[/bold]

[bold]Also, any additional insight as to the feasibility of additional fields within the production table, Data2, such as;

•TableId int identity(1,1)
•CreatedBy varchar(255) default system_user,
•CreatedAt datetime default getdate()?[/bold]

Note, an item can be sold more than once during an annual period. Therefore, I do not believe that the Item number should be the primary key.

The code that I have created so far is as follows;

Code:

DECLARE @LOADDIR VARCHAR(255) = 'C:\Data\';
--DECLARE @SQL NVARCHAR(MAX);              <------ IS THIS LINE NEEDED?


CREATE TABLE 	[#Data1](		
		[ItemNbr]   [varchar] (60)  NULL,
		[LATITUDE]   [varchar] (60),
		[LONGITUDE]  [varchar] (60),
		[LAST NAME]   [varchar] (60)  NULL,
		[FIRST NAME]   [varchar] (100)  NULL,
		[RECEIPT_DATE] [varchar] (60) NULL,                  
		[SALE_DATE] [varchar] (60) NULL,                       
		[SALE_AMOUNT]  [varchar] (60) NULL,
		[DELIVERY_DATE] [varchar] (60) NULL,       
		[ORDER_DATE] [varchar] (60) NULL,       
	
)	


CREATE TABLE [Data2](		
		[ItemNbr]   [varchar] (12)  NULL,
		[LATITUDE]   decimal (10,6) ,
		[LONGITUDE]  decimal (10,6),
		[LAST NAME]   [varchar] (25)  NULL,
		[FIRST NAME]   [varchar] (25)  NULL,
		[RECEIPT_DATE] [DATE] (10) NULL,                  
		[SALE_DATE] [DATE] (10) NULL,                      
		[SALE_AMOUNT]  [INT] NULL,
		[DELIVERY_DATE] [DATE] (10) NULL,       
		[ORDER_DATE] [DATE] (10) NULL       
)


BULK INSERT #Data1
   	FROM 'C:\Data\201701SaleData.txt'
   WITH
      (
         FIELDTERMINATOR = '\t',
		 ROWTERMINATOR = '\n',
		 FIRSTROW = 2
	 )




--ERRORS RECEIVED AT THIS POINT !

--Iteration 1: Error: Conversion failed when converting date and/or time from character string.

INSERT INTO Data2
SELECT * FROM #Data1
WHERE
   CASE WHEN RIGHT([RECEIPT_DATE],2) <> '00'						THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT([SALE_DATE],2) <> '00'						THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT([DELIVERY_DATE],2) <> '00'					THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT([ORDER_DATE],2) <> '00'						THEN 1 ELSE 0 END = 1
   OR CASE WHEN ISNUMERIC[SALE_AMOUNT] 							THEN 1 ELSE 0 END = 1


--Iteration 2: Error: Invalid column name "DATETIME2"

INSERT INTO Data2
SELECT * FROM #Data1
WHERE
   CASE WHEN RIGHT(CONVERT([RECEIPT_DATE],DATETIME2),2) <> '00'				THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT(CONVERT([SALE_DATE],DATETIME2),2) <> '00'				THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT(CONVERT([DELIVERY_DATE],DATETIME2),2) <> '00'			THEN 1 ELSE 0 END = 1
   OR CASE WHEN RIGHT(CONVERT([ORDER_DATE],DATETIME2),2) <> '00'			THEN 1 ELSE 0 END = 1
   OR CASE WHEN ISNUMERIC [SALE_AMOUNT]							THEN 1 ELSE 0 END = 1



--Iteration 3; Error: Incorrect syntax near 'CAST', expected 'AS'.

INSERT INTO Data2
    select * FROM #Data1
	WHERE
           (case when CAST(RECEIPT_DATE) as datetime) and RIGHT(SALE_DATE),2 <> '00' then 1 ELSE 0 END = 1)
           (case when CAST(SALE_DATE) as datetime) and RIGHT(SALE_DATE),2 <> '00' then 1 ELSE 0 END = 1)
           (case when CAST(DELIVERY_DATE) as datetime) and RIGHT(SALE_DATE),2 <> '00' then 1 ELSE 0 END = 1)
	   (case when CAST(ORDER_DATE) as datetime) and RIGHT(SALE_DATE),2 <> '00' then 1 ELSE 0 END = 1)

fredericofonseca · Jan 4, 2017

there are a few things that you will need to take in consideration and learn how to use.

First the errors you have on your code.
you are using datetime - note that if you are on 2008 or higher you should be using datetime2 with the desired precision
and if your field is just a date then use "DATE" as datatype - do not use datetimes if not required.

CAST - select cast(receipt_date as datetime) is the correct form

convert - select convert(datetime, receipt_date) or select convert(datetime, receipt_date, format) see online for available formats

Regarding the double quotes - if that is the only type of bad data you can just use the replace function on the select itself
e.g. select convert(int, replace(SALE_AMOUNT, '"','')) as SALE_AMOUNT

regarding using isnumeric or isdata beaware that they may return valid when the data is invalid - do search the net for examples and use with care and extra validation.

As for dates.

you said that input format will always be YYYYMMDD - if that is the case it makes life easier

Code:

set dateformat ymd -- force format of date to specific format as to avoid invalid conversions - normally not required if input is yyyymmdd

insert into Data2
-- remove if adding values as null is acceptable
select *
from (
-- end remove
select case
       when len(ltrim(rtrim(receipt_date))) <> 8 then null -- invalid date as size is not 8 
       when ltrim(rtrim(receipt_date)) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date -- note the 0-1 and 0-3 to do a tiny bit of validation at this point
       when right(ltrim(rtrim(receipt_date)), 2) = '00' then null -- not valid as day part is zeros
       when substring(ltrim(rtrim(receipt_date)),5 , 2) = '00' then null -- not valid as month part is zeros
       when isdate(ltrim(rtrim(receipt_date))) = 1 then convert(date, ltrim(rtrim(receipt_date)), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
       else null
       end as receipt_date
from #Data1
-- remove if adding values as null is acceptable
) t
where receipt_date is not null
-- end remove

if input formats can vary then the code above would need to be expanded - and eventually moved onto a function as it may become to cumbersome to code.

Similar validation can be made for the other numeric fields although for those you may also need to expand and use patindex to see if there are any non numeric chars.

Regards

Frederico Fonseca
SysSoft Integrated Ltd

http://www.syssoft-int.com

FAQ219-2884
FAQ181-2886

Olaf Doschke · Jan 4, 2017

I recommend to not use temp, but aside of that, you can't expect CAST to cast any string to datetimes.

Which version of MSSQL server do you use? How fluent are you with T-SQL and do you learn about advanced features and new functions regularly?

Ideally you use SQL2012 and can use TRY_CONVERT.

Here's a result of different conversions:

You can see TRY_CONVERT has the advantage to result in NULL, if the date is invalid. Your date checks only care for 00, but you still accept 31st of months with less dates, etc., forget about programming such things. SQL2012 also adds in PARSE, TRY_CAST and TRY_PARSE besides TRY_CONVERT and DATETIMEFROMPARTS.

Bye, Olaf.

fredericofonseca · Jan 4, 2017

I recommend to not use temp, but aside of that, you can't expect CAST to cast any string to datetimes.

why not? select cast('20010101' as date), convert(date, '20010101') works just fine, and so do other strings.

And why not use temp? a permanent table for a staging is not necesseraly the best option - and neither is a declared table if that was what you were thinking about.

as for try_convert, same way as with all date manipulation functions on sql server it does suffer from the same issues - select try_convert(date,'200702') will return a valid date (2020-07-02) and select try_convert(date,'2007') will return (2007-01-01) although it would not normally be considered as such when dealing with 8 digits dates as input.

so although the OP is using 2012 and could use that function there would still be needed to do the extra validation to prevent errors

Regards

Frederico Fonseca
SysSoft Integrated Ltd

http://www.syssoft-int.com

FAQ219-2884
FAQ181-2886

Olaf Doschke · Jan 4, 2017

I was speakiong to BxWill, mostly, but if you care:

The topic of using temp tables for staging is the load on transaction (with a million rows a day that's tough even with bulk logged recovery mode) and I addressed this in more detail above. It's not accidentally but on a very good purpose that I wrote about creating a staging database and dropping it after the merge process. In short: bad experience, simply bad experience and good experience with the suggested solution, provided you know your options.

On the topic of conversion: I showed what you get, if you cast or convert. But I also showed what happens for a wrong date: no record. If I would have added the messages tab, you'd see this errors. TRY_CONVERT does not error, but results in NULL instead, also a nice way to go about wrong dates or detect them, though there also is ISDATE() for that purpose. No need to write lengthy CASE statements anymore:

So your proposed solution can also be written as [tt]TRY_CONVERT(receipt_date) as receipt date[/tt] in very short and if you'd like another default date than NULL [tt]CASE WHEN ISDATE(textdate) THEN CONVERT(datetime, textdate) as adatetime ELSE otherdatetime END[/tt]. Or you can sort those rows out via [tt]WHERE ISDATE(...)[/tt]

Bye, Olaf.

fredericofonseca · Jan 4, 2017

No Olaf,

my proposed solution can not be written with a single try_convert unless the input is always of a fixed know blocks of data and when the blocks are less than expected they are always considered invalid data. Or when blocks are of expecte size but contain invalid data when considering the data should represent a valid date.

looking at the following blocks for example

Code:

select dt1
     , try_convert(date, dt1) as date
     , try_convert(datetime, dt1) as datetime
from (select '20:00'  union
      select '2000'  union
      select '20.00'  union 
      select '200001'  union 
      select '2000010'  union 
      select '20000101'  union 
      select '20000000'  union 
      select '20:00:00'  union 
      select '20010001'
      ) t (dt1)

although it only contains 1 valid date the output shows otherwise

Code:

dt1	 date       datetime
20.00    NULL       NULL
20:00    1900-01-01 1900-01-01 20:00:00.000
20:00:00 1900-01-01 1900-01-01 20:00:00.000
2000     2000-01-01 2000-01-01 00:00:00.000
20000000 NULL       NULL
200001   NULL       NULL
2000010  NULL       NULL
20000101 2000-01-01 2000-01-01 00:00:00.000
20010001 NULL       NULL

as for temp table - reasons you gave do not, in my opinion and experience, warrant that someone would "not recommend" its use - it does warrant only that other options are given with the usual "try and use the one that performs better for each case".

Tempdb is indeed a very viable option in many cases for several reasons and some of them are
(assuming that the server memory is not enough to hold the data and it does have to spill physically to disk)

it is already sized taking in consideration load on system - if not your DBA's arent doing their work well.
it is split into different files/disks/luns hence reducing possible IO contention
On modern versions of SQL Server and servers it is even located on SSD making it faster to load onto.
there is no absolute requirement to explicitly drop the table as part of the process even if it may be advisable

Regards

Frederico Fonseca
SysSoft Integrated Ltd

http://www.syssoft-int.com

FAQ219-2884
FAQ181-2886

Olaf Doschke · Jan 4, 2017

Well,

indeed I already assume YYYMMDD formatted data and errors like BxWill already reported with 00 as day o month portion. Obviously there are also wrong dates with wrong month or day. I have advised to load into a varchaar field, as BxWill did, that takes in everything, also dots and other separators, etc. All these cases can be inspected for ISDATE()=0. And to act on other wrong formats would be the task of other queries. A simple check of all digits for example could be done by testing whether a replacement of all digits leaves an emmpty field.

In regard of the temp table issue, I can not only tell just one story. Yes, it may be a DBA problem, too, but in the end to avoid a disk overflow and get to a tempdb state not easily to shrink again, I simply work with separate new dbs I drop after each staging process and that just works fine no matter how clever or not DBAs are. Speed of drive is not of the essence, if you use memory tables, as I suggest, they still are faster than SSDs, RAM always is fastest in any system, unless you bundle a newest SSD model with a very old server, but it would be a wonder to get that working and even if, the SSD would most probably not get to 100% its performance in an appropriate server, which then would again have faster RAM.

But speed may not be the most major concern overall, long term stability is most important. Google Brent Ozar and tempdb and you get nice tips on many aspects, too, but the process of staging data doesn't belong into the production data logs, what belongs there is merely the last merging step, so use of tempdb only is fine to me for test purposes. It's not only my experience. A collegue of me came to a customer having problem with an unshrinkable tempdb, which simply resulted on loading a copy of users data into temp for a slef written login. You can slowly kill your tempdb, you can obviously simply restart the SQL Server to get a fresh one, but why allow anything to slowly "kill" your system - or even jsut get it into a trouble some state - if you can do without?

BxWiil already stated the outset is the magnitude of a million rows of staging data (more precise 500,000 to 850,000 records), so that is a high volume to care about, performance does not only depend on that, but of course also the frequency of this ETL process. To me a sure case of avoiding to use tempdb. Merge like any T-SQL query command can access all Databases on the sam server instance, having a separate staging db you can create it on the drive or fiel group best fit for the staging purpose and make better use of multiple drives, controllers or raid systems as necessary, can alter that for any staging process and data volume, whereas the tempdb is where it is and is already ionvolved in many operations of production use, already. Should I continue enumerating advantages?

Bye, Olaf.

fredericofonseca · Jan 4, 2017

differences on opinion. 1 million rows is tiny volume for me - dealing with over 50 million rows in one of my processes daily (of many processes). and tempdb on my main prod server during etl has well over 10k transactions per second - and max latency of 5 miliseconds (not SSD yet).

Regards

Frederico Fonseca
SysSoft Integrated Ltd

http://www.syssoft-int.com

FAQ219-2884
FAQ181-2886

BxWill · Jan 4, 2017

Appreciate the insight!

Bottom line - I need to ensure that the data is initially loaded into 2012 Sql Server as efficiently and quickly as possible and that the monthly receipt of the data is loaded into the respective table within Sql Server as efficiently and quickly as possible and that any problems whatsoever with the data is quickly identified and definitely not loaded into production.

As a novice at actually loading and managing the Sql Server database going forward, I just want to ensure that I am setting everything up utilizing "Best Practices" assuming that the volume of the data that will reside in
the database 6 months down the road, 12 months down the road, etc. will increase. There will be a lot of data that we will have to perform detailed multi-year trend analyses, pattern analyses of sales transactions, historical correlations between/among multi-year data, etc. Definitely want to spend the majority of time actually analyzing the data but understand that if there is invalid data within the database, then the analyses is in vain no matter how quick the analyses...

I currently have 2012 Sql Server Developer installed on a Windows 7 Professional environment and just trying to ensure that as we scale - with ever increasing data, that we will not have any issues with the way that the data is setup.

The discussion regarding the use of staging tables within the same database vs. the use of a separate database for staging is noted. However, I will at first correct the present code and then later explore the other alternatives...

Will try the various suggestions and post back.

Olaf Doschke · Jan 4, 2017

How much one million rows is a load to transaction logs is also a matter of the frequency, as I also said.

Yes, it's a low volume in itself, as I also already said:

myself said:
The about a million rows you need to process each day are in the GB range, I assume, this once was a huge amount of data, today it's quite small, fitting into physical RAM multiple times on a modern server

No matter how fast your storage system is, RAM always is a magnitude faster and no matter how low a million rows is, the transaction log load is permanent in backups, unless you don't care to keep log backups.

If this is a monthly process, as you now say, BxWill, then this indeed is a low volume overall, too. Speed may not be of essence in this case, growing logs are also not a general problem, as they can be shrinked after log backups, and frequent log backups are a best practice anyway.

Scalabililty also is one aspect MSSQL Server has several options, besides with your main usage of the database for BI, you mostly read data. If you aim for that you should also consider learning MDX and perhaps even load this data into olap cubes right after feeding your normal oltp database.

Bye, Olaf.

BxWill · Jan 9, 2017

Using the suggested code as displayed below;

Code:

set dateformat ymd -- force format of date to specific format as to avoid invalid conversions - normally not required if input is yyyymmdd

insert into Data2
-- remove if adding values as null is acceptable
select *
from (
-- end remove
select case
       when len(ltrim(rtrim(receipt_date))) <> 8 then null -- invalid date as size is not 8 
       when ltrim(rtrim(receipt_date)) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date -- note the 0-1 and 0-3 to do a tiny bit of validation at this point
       when right(ltrim(rtrim(receipt_date)), 2) = '00' then null -- not valid as day part is zeros
       when substring(ltrim(rtrim(receipt_date)),5 , 2) = '00' then null -- not valid as month part is zeros
       when isdate(ltrim(rtrim(receipt_date))) = 1 then convert(date, ltrim(rtrim(receipt_date)), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
       else null
       end as receipt_date
from #Data1
-- remove if adding values as null is acceptable
) t
where receipt_date is not null
-- end remove

So far, I am not able to validate the data that is within the staging table and populate the table in production.

For the example given above, it appears that I am just validating the receipt_date field. I would like to validate all the date fields and in the near future, would like to validate other fields as well

Therefore, I assume that I can just stack the Case Statements - do something like the following to perform validation on two or more fields. [Bold]Is this correct or is there a more preferred route?[/Bold]

Code:

CREATE TABLE 	[#Data1](		
		[ItemNbr]   [varchar] (60)  NULL,
		[LATITUDE]   [varchar] (60),
		[LONGITUDE]  [varchar] (60),
		[LAST NAME]   [varchar] (60)  NULL,
		[FIRST NAME]   [varchar] (100)  NULL,
		[RECEIPT_DATE] [varchar] (60) NULL,                  
		[SALE_DATE] [varchar] (60) NULL,                       
		[SALE_AMOUNT]  [varchar] (60) NULL,
		[DELIVERY_DATE] [varchar] (60) NULL,       
		[ORDER_DATE] [varchar] (60) NULL,       
)	

CREATE TABLE [Data2](		
		[ItemNbr]   [varchar] (12)  NULL,
		[LATITUDE]   decimal (10,6) ,
		[LONGITUDE]  decimal (10,6),
		[LAST NAME]   [varchar] (25)  NULL,
		[FIRST NAME]   [varchar] (25)  NULL,
		[RECEIPT_DATE] [DATE] (10) NULL,                  
		[SALE_DATE] [DATE] (10) NULL,                      
		[SALE_AMOUNT]  [INT] NULL,
		[DELIVERY_DATE] [DATE] (10) NULL,       
		[ORDER_DATE] [DATE] (10) NULL       
)

BULK INSERT #Data1
   	FROM 'C:\Data\201701SaleData.txt'
   WITH
      (
         FIELDTERMINATOR = '\t',
		 ROWTERMINATOR = '\n',
		 FIRSTROW = 2
	 )



insert into Data2
-- remove if adding values as null is acceptable
select *
from (
-- end remove
select 
[ItemNbr],
[LATITUDE],
[LONGITUDE],
[LAST NAME],
[FIRST NAME],
[RECEIPT_DATE],                  

case
       when len(ltrim(rtrim(RECEIPT_DATE))) <> 8 then null -- invalid date as size is not 8 
       when ltrim(rtrim(RECEIPT_DATE)) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date -- note the 0-1 and 0-3 to do a tiny bit of validation at this point
       when right(ltrim(rtrim(RECEIPT_DATE)), 2) = '00' then null -- not valid as day part is zeros
       when substring(ltrim(rtrim(RECEIPT_DATE)),5 , 2) = '00' then null -- not valid as month part is zeros
       when isdate(ltrim(rtrim(RECEIPT_DATE))) = 1 then convert(date, ltrim(rtrim(RECEIPT_DATE)), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
       else null
end as RECEIPT_DATE,
case
       when len(ltrim(rtrim(SALE_DATE))) <> 8 then null -- invalid date as size is not 8 
       when ltrim(rtrim(SALE_DATE)) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date -- note the 0-1 and 0-3 to do a tiny bit of validation at this point
       when right(ltrim(rtrim(SALE_DATE)), 2) = '00' then null -- not valid as day part is zeros
       when substring(ltrim(rtrim(SALE_DATE)),5 , 2) = '00' then null -- not valid as month part is zeros
       when isdate(ltrim(rtrim(SALE_DATE))) = 1 then convert(date, ltrim(rtrim(SALE_DATE)), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
       else null
end as SALE_DATE,
[SALE_AMOUNT],
[DELIVERY_DATE], 
[ORDER_DATE]

from #Data1
-- remove if adding values as null is acceptable
) t
where RECEIPT_DATE is not null
or SALE_DATE is not null
-- end remove

I did initially use Select * but then reverted to listing all of the fields within the staging table that I wanted to insert into the production table because of receipt of the following error;

Code:

"Column name or number of supplied values does not match table definition."

Appreciate any additional insight as I am trying to understand the most efficient method to accomplish the objective.

fredericofonseca · Jan 9, 2017

Yes stacking is what you need as per your second block of code.

My example was just that - an example - to be expanded by yourself as required for all the fields with issues and with all the remaining fields you need to insert onto the destination table.

What matters on my example is the approach which can be used for any field - it may be that some fields require extra validation and others require less - but that is up to you to determine based on the data you have.

Regards

Frederico Fonseca
SysSoft Integrated Ltd

http://www.syssoft-int.com

FAQ219-2884
FAQ181-2886

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB 5

MIS

MIS

MIS

Programmer

Programmer

MIS

IS-IT--Management

Programmer

MIS

MIS

IS-IT--Management

Programmer

IS-IT--Management

Programmer

IS-IT--Management

Programmer

IS-IT--Management

MIS

Programmer

MIS

IS-IT--Management

Similar threads

Log in

Part and Inventory Search

Sponsor