Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Clean Large Text File - Use MS Excel or Text Editor - prior to import into 2012 Sql Server DB 5

Status
Not open for further replies.

BxWill

MIS
Mar 30, 2009
367
US
Need to clean several large pipe delimited text files prior to import into a newly created Sql Server 2012 database.

The plan is to load the Sql Server database and then append data to the database on a monthly basis.

Each text file contain 500,000 to 850,000 records with over 50 columns. Also, there are three date columns in the format YYYYMMDD.

Specifically, I need to review and clean each text file. Cleaning include;

1. Extract records that have a zero in the any of the three date columns.
2. Extract records that have zeroes in the Month and Day portion of the date. (for example, has a format like "20160000").

Initially, I opened the text files using MS Excel but then researched the use of a text editor that can/should be used.

Downloaded and installed the following text editors - GVIM, Sublime Text, and EmEditor. Upon opening the largest text file in each of the editors, I was not readily able to display the data in a columnar format that would facilitate a review of the records.

Based on my experience with MS Excel, It appears that editing the text files using MS Excel would be convenient and relatively easy compared to the use of the text editors.

Did also create the Create Table and Bulk Insert Sql Scripts to load the database(prior to cleaning of the text files). All of the tables were loaded except one. The one table that was not loaded failed because of records that have zeroes in the Month and Day portion of the date. (for example, has a format like "20160000").


Any insight as to the preferred method to clean the text files so I can initially load the Sql Server database?

Any additional insight as to the use of SSIS or Bulk Insert to perform the intial loading of the database and the subsequent appending of data on a monthly basis that will enable the storing of all records with errors in a "temporary" table and just load records that do not have errors?



 
Bill, do you really need this complex verification? Do you really have cases of text dates, as Frederico suggested, like '20:00' or '20:00:00'. As faar as you told you had cases like '20170100' and '20170000', i.e. 00 as month or day. If so TRY_convert would suffice in finding those and turn them to NULLs, so you could replace month and day positions with '01' to fix those dates or skip them and categorize them as bad data.

Admitted Fredericos case statement is more thoroughly (turning more wrong values into NULL) than try_convert, you can check length of trimmed values to detected too short dates, you can check for ':' contained in the data, etc. to also sort those out, if they ever will occur, even though they now don't.

To apply same checks on many fields, I would also start designing some scalar valued user defined functions, so you can apply them easier.

Overall, if you're not so well versed with T-SQL, you could consider choosing a data cleansing method using your preferred language, sorry to say so, but you don't make the impression you gain much from the fact you have your data in MSSQL tables.

Bye, Olaf.
 
Appreciate the insight.

Proficient for the most part with T-SQL but not advanced relative to my expertise with MS Excel, MS Access, and Power Query/Power Pivot. Not a programmer although I am comfortable with creating
some VBA within MS Excel and MS Access.

As this is the first Sql Server database that I will create, the slope of the learning curve has decreased quite significantly over the last two to three months. Of course there are some intricacies
and nuances related to the creation of a Sql Server database that I am determined to master to ensure that the database is setup as efficiently as possible.

Finalizing the case statements and have begun the review of try_convert and scalar valued user defined functions.

Considering that the data that I receive are usually large text files, the goal is to quickly review, validate and load the "good" data as quickly as possible so that we can perform detailed trend analytics.

I realize that there are various methods to accomplish the goal and I am interested in learning how to perform the validation task at least two to three different ways - using case statements or using try_convert
or using scalar valued user defined functions or using staging tables in another database, etc. etc.

Any additional insight is appreciated.
 
A Scalar-valued function is just a function you define, i.e. to define a more generic and general CASE statement to process a parameter value passed in and returning either false/true in regard of the text being a valid date in your terms (other than perhaps what TRY_CONVERT says) or to return the date as converted and NULL for non convertible dates, just like TRY_CONVERT does, only with your rules.

This just helps to write less code, you don't need to copy over the CASE statement for each field, you just call [tt]dbo.yourdateconvert(stagingtable.field) as newdatetime[/tt]

Calling a function instead of putting it directly into the SQL statement has both pros and cons in performance, but you'll see how it behaves. It shortens your code and effort for sure and also gives you one central place to change for any date validation insteead of needing to mend or change all copies of the case statements, if necessary.

Bye, Olaf.

 
Apparently, there are asterisks in several records for quite a few columns of the text data.

Encountered the error
Code:
Conversion failed when converting the varchar value '********' to data type int.

Upon downloading and installing a third-party tool (ApexSqlSearch), I was not able to search for the instances where there are asterisks. Therefore, I created SQL and searched the integer fields.

Code:
Select count([SALE_AMOUNT])
FROM #Data1
where [SALE_AMOUNT] = '********'

As a result of the query, there were three fields that had asterisks on some of the records.

Created Case statements similar to the following;

Code:
Case
   When [SALE_AMOUNT] = '********'
   ELSE [SALE_AMOUNT] * 1
end as [SALE_AMOUNT]

Now, upon trying to insert the data from the staging table into the test production table, another error results;

Code:
Error converting data type varchar to numeric

It appears that I have to repeat the process again - reviewing any decimal fields for asterisks and/or non-numeric characters and then constructing the associated Case statement.

Did search the internet for an example of a scalar-valued function. So far, I have not found an example...

Is it possible to provide an example of a scalar valued function that validates at least two of the fields below?
Code:
[LATITUDE]   decimal (10,6) ,
		[LONGITUDE]  decimal (10,6),
		[LAST NAME]   [varchar] (25)  NULL,
		[FIRST NAME]   [varchar] (25)  NULL,
		[RECEIPT_DATE] [DATE] (10) NULL,                  
		[SALE_DATE] [DATE] (10) NULL,                      
		[SALE_AMOUNT]  [INT] NULL,
		[DELIVERY_DATE] [DATE] (10) NULL,       
		[ORDER_DATE] [DATE] (10) NULL


Any additional insight is appreciated.
 
Code:
Case
   When [SALE_AMOUNT] = '********'
   ELSE [SALE_AMOUNT] * 1
end as [SALE_AMOUNT]

Please refer to the books online: This is incomplete, missing THEN and what value to return in case of ****.

Code:
--Searched CASE expression:
CASE
     WHEN Boolean_expression [highlight #FCE94F]THEN result_expression[/highlight] [ ...n ] 
     [ ELSE else_result_expression ] 
END

BxWill said:
Did search the internet for an example of a scalar-valued function. So far, I have not found an example...

Are you sure? Just start up MSSQL Management Studio and take a look at all you have below the database node of the object explorer tree
sqlfunction1.png


Scalar Valued Functions are the simplest concept of programming in any language, even not at the level of designing OOP classes or more complex things, they return a scalar value, which just means a simple value, a simple single type, like a double float or a char(20) or whatever scalar value in contrast to a vector of many values or an array or table valued function, which you find in the node above scalar-valued functions.

Are you a developer at all? If not, this is not your job, you should hire one.

Bye, Olaf.
 
sample function for integer validation
This function is limited but contains some rather important aspects.
patindex - to determine if there is any invalid character on the string. I would add a "." to it as to allow for decimals
checking to see if left char is a "-" for sign. I would also look for the rightmost char for same purpose and also for "+" sign.

Function does not check if there are trailing/leading spaces. at least the trailing ones could be removed on the validation as to allow string like " 1235" to be considered valid.

The particular function is only checking if the string is a possibly valid integer - it would be easy for you to modify it and return the converted value back but on that case although it works fine for integers, a similar one for decimals would be more limiting because of the possibly varying number of decimal positions.

you can also use the sample function to add the date validation you were given and create a function to validate a date.

Regards

Frederico Fonseca
SysSoft Integrated Ltd

FAQ219-2884
FAQ181-2886
 
Had to step away from this but now returning...

Just to recap,

Have a large text files - from 500,000 to 850,000 records to be loaded to Sql Server monthly. Format for date fields are YYYYMMDD. Latitude and Longitude fields do not have decimals. Also, there is an alpha character at the end of the longitude data (for example, "C" or "T."


Created staging table using tsql, #Data1

Code:
CREATE TABLE  #Data1
Latitude  [varchar] (60) NULL,	
Longitude  [varchar] (60) NULL,
Sale Amount [varchar] (60) NULL,
Total Amount [varchar] (60) NULL,
Sale Date [varchar] (60) NULL,
Receiving Date [varchar] (60) NULL,
.
.
.


Created test production table, SampleTable

Code:
CREATE TABLE  Data2
Latitude  decimal (8,6) NULL,	
Longitude  decimal (9,6) NULL,
Sale Amount int NULL,
Total Amount int NULL,
Sale Date date NULL,
Receiving Date date NULL,
.
.
.

Upon trying to load the data from the staging table to the test production table, I receive the following error;

Code:
Arithmetic overflow error converting varchar to data type numeric.
The statement has been terminated

I am using the Tsql below. What revisions are recommended so that the error no longer appears?
Also, any insight as to how I should cast the longitude data considering that it has an alpha character at the end?

--********************************************************************

Code:
insert into Data2
select
Case 
  WHEN [Latitude] LIKE N'%[^.0-9*-]%' OR LEN([Latitude]) > 11
                THEN 'Invalid Latitude (bad character OR too long). '
  WHEN [Latitude] NOT LIKE N'%*%' AND ISNUMERIC([Latitude]) = 0 AND LEN([Latitude]) > 0
                THEN 'Invalid Latitude (not a number). '
  WHEN ISNUMERIC([Latitude]) = 1 AND AND CHARINDEX('.', [Latitude]) = 0
                THEN Cast([Latitude]/1000000.0 as decimal(8,6) as [Latitude]
  ELSE [Latitude]
END as [Latitude],

Case 
  WHEN [Longitude] LIKE N'%[^.0-9*-]%' OR LEN([Longitude]) > 11
                THEN 'Invalid Longitude (bad character OR too long). '
  WHEN [Latitude] NOT LIKE N'%*%' AND ISNUMERIC([Longitude]) = 0 AND LEN([Longitude]) > 0
                THEN 'Invalid Longitude (not a number). '
  WHEN ISNUMERIC([Longitude]) = 1 AND CHARINDEX('.', [Longitude]) = 0
                THEN Cast(isnull((Left(LTRIM([Longitude]),7)),0) as decimal(9,6)
  ELSE [Longitude]
END as [Longitude],

Case
  When IsNumeric([Sale Amount]) = 0
       Then Null -- Non Numeric data
  When Len([Sale Amount]) > 11
       Then Null --Invalid length
  ELSE [Sale Amount]
END as [Sale Amount],


Case
  When IsNumeric([Total Amount]) = 0
       Then Null -- Non Numeric data
  When Len([Total Amount]) > 11
       Then Null --Invalid length
  ELSE [Total Amount]
END as [Total Amount],

Case
  when len(ltrim(rtrim([Sale Date]))) <> 8 then null -- invalid date as size is not 8 
  when ltrim(rtrim([Sale Date])) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date 
  when right(ltrim(rtrim([Sale Date])), 2) = '00' then null -- not valid as day part is zeros
  when substring(ltrim(rtrim([Sale Date])),5 , 2) = '00' then null -- not valid as month part is zeros
  when isdate(ltrim(rtrim([Sale Date]))) = 1 then convert(date, ltrim(rtrim([Sale Date])), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
  else null
end as [Sale Date],

Case
  when len(ltrim(rtrim([Receiving Date]))) <> 8 then null -- invalid date as size is not 8 
  when ltrim(rtrim([Receiving Date])) not like '[0-9][0-9][0-9][0-9][0-1][0-9][0-3][0-9]' then null -- not a valid format for date -- note the 0-1 and 0-3 to do a tiny bit of validation at this point
       when right(ltrim(rtrim([Receiving Date])), 2) = '00' then null -- not valid as day part is zeros
       when substring(ltrim(rtrim([Receiving Date])),5 , 2) = '00' then null -- not valid as month part is zeros
       when isdate(ltrim(rtrim([Receiving Date]))) = 1 then convert(date, ltrim(rtrim([Receiving Date])), 112) -- now use isdate to verify that string is a valid date and convert if so using specific format 112
       else null
end as [Receiving Date]

from #Data1
where [Latitude] is not null
or [Latitude] is not null
or [Sale Amount] is not null
or [Total Amount] is not null
or [Sale Date] is not null
or [Receiving Date] is not null

Assumption is that I should check the fields that have a decimal data type even though the data within the source file does
not contain any data with decimals.

I am currently reviewing the maximum length of the values in each field that have a decimal data type and will then manually revise the precision for all of the fields with a decimal data type.

Any additional insight as to the resolution of this error is appreciated.



 
The error points out the value of varchar uses more decimal places either after or before decimal point than the numeric field has. Higher your N(x,y) definitions, don't let them be too tight.
I would define Latitude and Longitude as double float anyway, that would do away with such overflows in both the magnitude and precision.

N(8,6) gives enough room for +90.0000 to -90.0000, but would eg fail on 5 or more decimal places. You don't necessarily have wrong data, but more precise than you can store.
You could also make further validation of that than with ISNUMERIC, define user defined functions checking the range to be between -90 and 90 or -180 and 180 respectively and rounding to the wanted precision.

PS, also take a careful look for your case statements about Latitude and Longitude, you got that messed up.

Bye, Olaf.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top