Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

AWK - change the field delimiters within a closed CSV file from a comma to a pipe symbol 2

Status
Not open for further replies.

BxWill

MIS
Mar 30, 2009
367
US
thread271-1784472

Revisited the post above regarding the processing of text files using AWK.

I have a large CSV file - 220 megabytes with approximately 1,300,000 records.

[Bold]
Upon reviewing the CSV file, there is a comma within one of the fields - the field "City" that contains data

such as "New York City, New York."[/Bold]


Due to the CSV file containing a comma within one of the fields, I am not able to successfully import the data into a Sql Server database.


Is it possible for AWK to change the field delimiter within a CSV file from a comma to a pipe symbol without opening the CSV file?

 
How will awk know if the comma is data or a field delimiter?

I would consider using sed to change "New York City, New York" to New York City New York" and the same for any other similar occurrences of comma within the city field.

You can also check the unix/linux tr (translate) command as far as switching all commas to pipe (or any other character).

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


 
The problem with specifically changing "New York City, New York" to "New York City New York", is that it's very specific and won't actually catch something like "Sacramento, CA". You'll end up playing whack-a-mole with every load.

Each record/row in the CSV should have one less comma than the number of fields you are expecting. That is, if you expect 10 fields, there should be 9 commas. I would first "filter" the file with grep and spit out those rows without the expected number of commas.

Example using five fields per valid record in the CSV...

Code:
$ cat commas.csv
one,two,three,four,five
1,2,3,4,5
a,b,hello, world,d,e
123,234,345,456,567
1,2,3
$
$ grep '^[^,]*,[^,]*,[^,]*,[^,]*,[^,]*$' commas.csv | tee commas.good
one,two,three,four,five
1,2,3,4,5
123,234,345,456,567
$
$ grep -v '^[^,]*,[^,]*,[^,]*,[^,]*,[^,]*$' commas.csv | tee commas.bad
a,b,hello, world,d,e
1,2,3

Process the "good" file as normal, and manually inspect the "bad" records. The number should be fairly small. Fix those, then load them. You could even make your script notify you only when there are bad records to fix. That way you're not putting a manual step in an automated process. Only to handle exceptions.

 
Excellent thought - differentiating between a comma within a particular field relative to a comma separating a field.

Great insight to compare the total number of commas expected and separating the "good" vs. the "bad" records.

Especially considering that there may be more than one instance where commas are within a particular field.

Interesting point is that changing to a pipe-delimited format, I want to ensure that there is not a trailing pipe at the end of each record.

Will review grep and the example code.

 
The problem with changing your comma separator to something like a pipe character, is that you still have the same problem, just with a different character.

For example, changing the commas to pipes in this record gets you a mess...

Code:
one,two,a|b|c,four,five

If the data can contain a pipe character, you just broke a good record. That example record, five valid fields becomes seven fields and will fail to load.
 
SamBones makes a good point. You need to search the file first to find an ASCII character that can become the delimiter. (one that isn't already within the file). When using the translate command to change the delimiter, translate is NOT limited to printable characters. It can take Hex or Octal values for the translate.

==================================
advanced cognitive capabilities and other marketing buzzwords explained with sarcastic simplicity


 
Actually my point was to leave the separator as a comma. I speak from experience. Universally changing the separator character throughout the entire file is in effect altering your source data. There are a lot of hidden problems that will result in data corruption. Or having to intervene every time you have to run the load. Plus, if this data plays in any kind of financial or regulatory accountability, an auditor will freak when you tell him/her how much you're altering the data prior to loading it.

Personally, I would spend a little more time learning the data. Just using grep and the simple pattern I gave, you can find every single record that has too many, or not enough delimiters. Looks at those bad records and see what you're getting. Is it an occasional "City, State"? A name like "John Smith, Jr.". Maybe a numeric field with something like "1,200". The problem is, you just can't do a global search and replace of commas with some other character. The tools you are talking about don't have enough brains to do it right and you'll just be altering your data.

Actually, if you look at the "spec" for CSV files (RFC 4180), it does allow for quoting fields so they can legally contain delimiters. For example...

Code:
231,"Jim Jones, Junior","123 Main St, Apt 12a","New York, NY",10038

That's five fields, the middle three enclosed in quotes. So, even though there are 7 commas in the record, there are only 4 comma delimiters. Load programs like Oracle's SQL*Loader handle this with no problems at all. A lot of other load programs I've used in the past can handle commas in quoted fields too. Excel can load CSVs with extra commas in quoted fields, but it's limited to 1 million rows. Informatica can handle this kind of CSV too.


So what are you trying to load this data into? What kind of database, and using what kind of load tools?

Is there any way you could share a few records? Maybe obfuscate any PII details.
 
Reviewed the data in detail and it appears that there is indeed just one field within the CSV file that has a comma within the field.

That field is titled "City" as originally stated.

Further, it appears that this will always be the case.

Yes, considering the Auditor's perspective, altering the source data should not be performed as stated by SamBones.

However, considering that there is only one field that contains a comma, the idea of searching a 220 megabyte CSV file and replacing with a delimiter that is not within the file is a novel approach.


Loading into a Sql Server Db via Bulk Insert with example below;

Code:
BULK INSERT #TestData1
	FROM 'C:\Project\EquipmentData.txt' 
   WITH
      (
		 --FIELDTERMINATOR = ',',  --CSV Field Delimiter  Cannot bulk load...
		 FIELDTERMINATOR = '\|',  
		 --ROWTERMINATOR = '\n',    Error - column is too long ... for Row 1 column 34
		 --ROWTERMINATOR = '\r\n',  Error - column is too long ... for Row 1 column 34 
		 --ROWTERMINATOR = '\r',    Error - column is too long ... for Row 1 column 34 
		 ROWTERMINATOR = '0x0a',    
		 FIRSTROW = 2,
		 LASTROW = 100
      );
GO


Noted is SamBones comment
[Bold]
"A lot of other load programs I've used in the past can handle commas in quoted fields too. Excel can load CSVs with extra commas in quoted fields, but it's limited to 1 million rows. Informatica can handle this kind of CSV too."[/Bold]

After several days of trial and error, this coincides with my present thought - importing a subset of the data from the CSV file into MS Excel and then saving/converting to a pipe delimited format for import into the Sql Server database.

But, I want to load all of the data at the same time.

johnherman, is it possible to provide an example of your approach - searching the CSV file and replacing the delimiter with a delimiter that is not within the file?

SamBones, Is it possible to provide a list of the other load programs that can handle commas in quoted fields for CSV files with over a million rows?

Thanks in advance for any additional insight!
 
You still haven't shown us an example record with a comma within a field. I don't mind you changing names and numbers to protect the innocent, but it would be good to see an example record. If your comma split field is enclosed in quotes, you might be able to fix it with [tt]sed[/tt] or [tt]awk[/tt] as per your original request. Something like...

Code:
$ cat tt.csv
one,two,three
one,"hi, there",three
"123 main st., Apt 22","New York, NY",90023
$
$ awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' tt.csv
one,two,three
one,hi there,three
123 main st. Apt 22,New York NY,90023
$

The field has to be properly quoted though, otherwise there's no way for it to know which commas to lose.

Looking at the syntax for the bulk insert, it does look like there is a "FIELDQUOTE" parameter that can be specified. Something like this?

Code:
BULK INSERT #TestData1
	FROM 'C:\Project\EquipmentData.txt' 
   WITH
      (
		 FORMAT = 'CSV' ,
		 FIELDTERMINATOR = ',' ,  
		 FIELDQUOTE = '"' ,
		 ROWTERMINATOR = '0x0a' , -- This is a newline char (\n), is this coming from *nix?
		 FIRSTROW = 2 ,
		 LASTROW = 100 ,
		 FORMATFILE = '<path_to_format_file_if_used>'
      );
GO

Also, that 'FORMATFILE' can let you do some more granular control I believe.

This is a TSQL BULK INSERT, right? Have you looked into SSIS BULK INSERT? One of my DBA homies here said SSIS BULK INSERT or BCP can handle quoted fields properly.

One last thing, most programs that can create a CSV file can specify a different field separator character. See if the app/program creating it could use a pipe character from the beginning instead of a comma. Or some other character. I would recommend nothing that's whitespace (i.e. tab) and not an easy to type character. Maybe something that requires an Alt-### to enter (see charmap for ideas.

 
Using EmEditor, I was able to review the CSV file in more detail.

Noted are the following;

1. There are 1,259,409 records
2. From Row 1,094,712 to Row 1,259,408 - There is a comma at the end of the record
3. The last Row contain text - evidently a comment.

Consequently, upon receiving large text files with the number of records exceeding the number that can be opened with
MS Excel, I need to develop a process to quickly review and clean the data.

My understanding is that AWK can perform the "cleansing" of the data.

Further, I can see why I was not readily able to Bulk Insert the data into the Sql Server database because of the inconsistent record endings and the comment on the last line.

In this situation, there were just 1,259,409 records. What if there were 5,000,000 + records?!

Consequently, I need to develop a process to identify records with errors such as commas at the end of some of the records, comments on the last row, etc.

Question - Does functionality exist within AWK to eliminate the commas at the end of the line for some of the records and any text on the last record that has a different pattern than the preceding records?

Attached is a copy of sample data for reference. Note, within the sample file, records 16 through 8061 have a comma at the end of each record. There are a total of 8061 records within the sample file.

Thanks in advance for any additional insight.
 
 https://files.engineering.com/getfile.aspx?folder=31162664-6072-4b38-a71c-59cb36849595&file=SampleData_20190607.txt
Yes, with 'sed' and 'awk' you can do anything you need to do with the records.

What OS are you on? If this is Linux or some Unix, you're good to go. If it's Windows, your options change a little.

This command will remove that comment on the last row...

Code:
head -n -1 filename.csv

This command will remove the first line of headers if you so desire...

Code:
tail -n +2 filename.csv

This command will remove that extra comma in the quotes (replacing it with <space><dash><space>, "New York, NY" becomes "New York - NY")...

Code:
sed 's/\(".*\), \(.*"\)/\1 - \2/1' filename.csv

If that trailing comma at the end of some lines is an extra, this will remove it...

Code:
sed 's/,$//g' filename.csv

Actually, I just checked the data and that trailing comma is not extra. You actually have a NULL field at the end of some of the records. So the trailing comma needs to stay. This will show you that the comma at the end needs to stay...

Code:
head -20 filename.csv | sed 's/[^,]//g'

Putting it all together, this might clean your file...

Code:
head -n -1 filename.csv | tail -n +2 | sed 's/\(".*\), \(.*"\)/\1 - \2/1' > filename_cleaned.csv

Again, I'm more comfortable with 'sed', so this is using 'sed', not 'awk'. Just personal preference (and more appropriate for a serial edit of a file).

 
Great catch on the trailing comma issue!

Upon opening the file with EmEditor, I was not able to display the data in a columnar mode where I could have readily see that there was a null value at the end of some of the records.

Now, that I think about this a little more, actually parsing the field into two columns may provide more flexibility within the database. Instead of "New York , NY" in one field, maybe having the city in one column and the state in a column.

Therefore, can AWK or sed remove quotes and allow for the parsing of the data that results in the city name being in one column and
the State abbreviation being in a separate column?

My initial thought is that as long as there is a city name, state name within quotes on each record, then there will not be an issue.

The other thought that I have is - Can there be conditional tests performed using AWK and/or sed?

For example, maybe like nested IF Statements in MS Excel -

If there are comments at the end of a file, delete the comments; If there is this problem, resolve it; If there is this problem resolve it, and so on?

Sort of like a process that can be run on any large test file to detect the common problems (and not so common problems), catalog them and clean the text files so that they can be bulk inserted into Sql Server?


 
As far as if AWK can process the text file via nested IF functionality, this was already addressed during the previous thread.

Will re-visit the previous thread...

However, still interested if functionality exists within AWK or sed to parse the text that is within the quotes into two separate columns.

 
BxWill said:
Therefore, can AWK or sed remove quotes and allow for the parsing of the data that results in the city name being in one column and
the State abbreviation being in a separate column?

Code:
$ # Remove double quote and space between comma and state abbreviation
$ sed 's/"//g;s/, /,/g' filename.csv > filename_fixed.csv

Just killing the double quote and the space before the state abbreviation is all that's needed. The comma is still in between the fields. The only possible problem is if you have any record that doesn't match this format ("City, ST").

 
Downloaded the "Complete Package, except sources" for sed (version 4.2.1) and installed on a Windows 7 64-bit laptop.


Copy pasted the following
"sed 's/\(".*\), \(.*"\)/\1 - \2/1' C:\EquipmentData_Latest.txt"

at the command prompt "C:\Program Files (x86)\GnuWin32\bin>" to remove the comma within the field "City."

Just to recap, the field "City" contains the City and State with a comma between the city and state and the data within this field is enclosed with double quotes for each of the over 1.3 million records.


Receive the following error:

sed: -e ex[ressopm #1.char 1: unknown command: '''


Currently attempting to resolve the error. However, considering that I am using sed for the first time, I am not quite familiar with the nuances/intricacies of sed to readily determine what is causing the error.

Any immediate insight as to the resolution of the above-mentioned error?



 
That command like was assuming a Unix like environment. Windows handles single and double quotes differently. You could try putting the [tt]sed[/tt] commands into a file.

Code:
s/\(".*\), \(.*"\)/\1 - \2/1

Then run it like this...

Code:
sed -f sed_commands.txt C:\EquipmentData_Latest.txt

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top