Context: Have a rather large text file - over 800,000 records and over 180 fields that appear to have commas separating the fields and
also quite a few instances whereby several fields have commas within the fields!
Consequently, it appears that I cannot use bulk insert within TSQL to load the "clean records" into the Sql Server database table.
Therefore, it appears that AWK may be used to resolve this problem. Note, I have never used AWK and not at all familiar with its functionality but it appears that resolving this issue via AWK is preferable to opening the text file in MS Excel and using a volatile function to determine the number of commas for each record, extracting the "good" records and then converting to a tab-delimited file for bulk inserting into the Sql Server database.
Over the last hour or so, I have downloaded files from the Cygwin site but then restored the laptop to a system restore point because the laptop appeared to be "bogged down" with unnecessary files! Although I was able to search for "GAWK", it appeared that unnecessary un-related files were downloaded.
[Bold]
What is the simplest method to download and install AWK to a Windows 7 professional 64-bit laptop? Is it preferable to install 32-bit AWK instead of 64-bit?[/Bold]
What is an honest assessment as to a learning curve for a Novice to master the quick review of a text file via AWK to determine the proportion of "bad" records within any given text file relative to all of the records within the text file. Then, I can quickly send the file back to the source department instead of spending an inordinate amount of time to extract the "good" records.
(In other words, is it possible for me to develop an "AWK Template" that I can use to quickly review every text file from the source department to validate various fields such as;
1. Zipcode - is it a valid length and is it valid for the particular city?
2. Dates - is it a valid date
3. Other fields - Is it numeric or alpha when it should be the other...?
4. For comma delimited, are there commas within one or more fields?, etc. etc.
For example, if the number of "bad" records relative to "good" records is beyond the "Bad Record" Threshold (say, 50%), I can readily send back to the source department instead of spending three to four weeks to validate and load only the good records!
Any additional insight is appreciated that relates to the use of AWK to resolve the issue of "bad" records within a relatively large text file.
also quite a few instances whereby several fields have commas within the fields!
Consequently, it appears that I cannot use bulk insert within TSQL to load the "clean records" into the Sql Server database table.
Therefore, it appears that AWK may be used to resolve this problem. Note, I have never used AWK and not at all familiar with its functionality but it appears that resolving this issue via AWK is preferable to opening the text file in MS Excel and using a volatile function to determine the number of commas for each record, extracting the "good" records and then converting to a tab-delimited file for bulk inserting into the Sql Server database.
Over the last hour or so, I have downloaded files from the Cygwin site but then restored the laptop to a system restore point because the laptop appeared to be "bogged down" with unnecessary files! Although I was able to search for "GAWK", it appeared that unnecessary un-related files were downloaded.
[Bold]
What is the simplest method to download and install AWK to a Windows 7 professional 64-bit laptop? Is it preferable to install 32-bit AWK instead of 64-bit?[/Bold]
What is an honest assessment as to a learning curve for a Novice to master the quick review of a text file via AWK to determine the proportion of "bad" records within any given text file relative to all of the records within the text file. Then, I can quickly send the file back to the source department instead of spending an inordinate amount of time to extract the "good" records.
(In other words, is it possible for me to develop an "AWK Template" that I can use to quickly review every text file from the source department to validate various fields such as;
1. Zipcode - is it a valid length and is it valid for the particular city?
2. Dates - is it a valid date
3. Other fields - Is it numeric or alpha when it should be the other...?
4. For comma delimited, are there commas within one or more fields?, etc. etc.
For example, if the number of "bad" records relative to "good" records is beyond the "Bad Record" Threshold (say, 50%), I can readily send back to the source department instead of spending three to four weeks to validate and load only the good records!
Any additional insight is appreciated that relates to the use of AWK to resolve the issue of "bad" records within a relatively large text file.