Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Remove "extra" crlf in windows CSV file. 2

Status
Not open for further replies.

densolis

MIS
Jan 7, 2009
7
0
0
US
Hi,

I am doing a conversion from Sale Force to Oracle on a Windows platform. I saw where I can get a windows version of awk to run on windows from Sourceforge. I do not know anything about awk other than it is a Unix command that does string manipulation.

Background:
The Accounts (customer name and address) table that I was given has CRLF and LF buried in the data. CRLF is also used to terminate the record in Windows. Here are a couple of examples of my data:

20,"10","1449 Jupiter Park Dr., <LF>
Unit #19","","Jupiter","FL","33458"<CRLF>

49,"10","420 E. Aviation Dr., <CRLF>
Suite 110","","Tucson","AZ","85714"<CRLF>

The data should be:

20,"10","1449 Jupiter Park Dr.,Unit #19","","Jupiter","FL","33458"<CRLF>
49,"10","420 E. Aviation Dr.,Suite 110","","Tucson","AZ","85714"<CRLF>

Questions:
Given that I have NO knowledge of awk, (I should be able to install it and get it running):
1. Can awk remove the CRLF and LF that are imbedded in the data but not remove the CRLF that terminate the record? The CRLF and LF that need to be removed will be within quotes, if that helps.
2. If awk can do that above, can you please supply the awk command to do it. Please remember that I have not used awk before.

Thanks,

Dennis
 
awk '{nq=gsub(/"/,"&");if(nq%2){x=$0;sub(/\r/,"");getline;$0=x $0}}1' input

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
PH,

Ah - how obvious (not).
I will start playing with your command and see if I can figure out what you are doing and then I will try it.

Can your recommend a good manual / document / webpage that will help me learn awk?

Thanks for your help.

Dennis

 
Hi PHV,

It's very nice, but IMO this sequence
Code:
awk '{nq=gsub(/"/,"&");if(nq%2){[highlight]x=$0;sub(/\r/,"")[/highlight];getline;$0=x $0}}1' input

is to be changed to
Code:
awk '{nq=gsub(/"/,"&");if(nq%2){[highlight #8AE234]sub(/\r/,"");x=$0[/highlight];getline;$0=x $0}}1' input
 
Good catch mikrom, thanks.
But I'm pretty sure that in a windows environment the \r is eaten by awk as the \n.
So I'd try simply this:
Code:
awk '{nq=gsub(/"/,"&");if(nq%2){x=$0;getline;$0=x $0}}1' input

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
PHV said:
But I'm pretty sure that in a windows environment the \r is eaten by awk as the \n.

I'm not sure. I have awk version which comes with MSYS
and when the densolis' file is in format DOS (CRLF) then it outputs only this
Code:
$ awk '{nq=gsub(/"/,"&");if(nq%2){x=$0;getline;$0=x $0}}1' densolis.txt
 Unit #19","","Jupiter","FL","33458"
 Suite 110","","Tucson","AZ","85714"

but when the densolis' file is in format UNIX (LF) the the output is OK:
Code:
$ awk '{nq=gsub(/"/,"&");if(nq%2){x=$0;getline;$0=x $0}}1' densolis.txt
20,"10","1449 Jupiter Park Dr.,  Unit #19","","Jupiter","FL","33458"
49,"10","420 E. Aviation Dr.,  Suite 110","","Tucson","AZ","85714"

Maybe there are other Windows versions of awk...

I have one more question:

Could you explain please, what does mean the last 1 at the end of your script, i.e.
Code:
awk '{...{...}}[highlight]1[/highlight]' input

I tried out that it has the functionality like
Code:
awk '{...{...;[highlight]print[/highlight]}}' input
or
Code:
awk '{...{...}[highlight]print[/highlight]}' input

but I don't understand why the 1 is placed at the end after the block
Code:
awk '[highlight]{...}1[/highlight]' input
 
1 is a pattern that always evaluate to true and thus the default action (ie print) is executed
I placed it in a new block to be sure to print all lines

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Hi,

As I said earlier, I do not know how to use awk.

I downloaded gawk from SourceForge and install installed it. I entered the following and got the following error:


C:\Renco\VAX\TXT Files\awk>gawk '{nq=gsub(/"/,"&");if(nq%2){x=$0;getline;$0=x $0
}}1' input
gawk: '{nq=gsub(//,
gawk: ^ invalid char ''' in expression
'");if(nq%2){x=$0;getline;$0=x $0}}1' input ' is not recognized as an internal o
r external command,
operable program or batch file.

C:\Renco\VAX\TXT Files\awk>awk '{nq=gsub(/"/,"&");if(nq%2){x=$0;getline;$0=x $0}
}1' input opportunity_awk.csv
awk: '{nq=gsub(//,
awk: ^ invalid char ''' in expression
'");if(nq%2){x=$0;getline;$0=x $0}}1' input opportunity_awk.csv' is not recogniz
ed as an internal or external command,
operable program or batch file.


So my question is, what am I doing wrong? Like I said, I barely know how to spell awk.

Thanks,

Dennis
 
I download gawk from here is some actual data. Even thought the data is on Windows, the data file terminate with a LF. Or at least that what I can figure out from displaying the data in hex in notepad++

Here is the header line and one line of bad data followed by a good line of data.

"Id","IsDeleted","AccountId","IsPrivate","Name","Description","StageName","StageSortOrder","Amount","Probability","ExpectedRevenue","TotalOpportunityQuantity","CloseDate","Type","NextStep","LeadSource","IsClosed","IsWon","ForecastCategory","ForecastCategoryName","CampaignId","HasOpportunityLineItem","Pricebook2Id","OwnerId","CreatedDate","CreatedById","LastModifiedDate","LastModifiedById","SystemModstamp","LastActivityDate","LastStageChangeDate","FiscalYear","FiscalQuarter","PrimaryPartnerAccountId","PartnerAccountId","EAU__c","Target_Price_at_EAU__c","Renco_Forecast__c","End_Product__c","Contract_Manufacturer__c","Sample_Approved__c","Reason_for_Lost__c","Opportunity_Notes__c","Outside_Rep2__c","Job_Suffix__c","Terr2__c","Terr3__c","OEM_Account__c","Next_Action_Due_By__c","Next_Action_Due__c","Next_Action__c","Previous_Action__c","SC__c","SLS__c","OEM_Rep__c","PL_Rep__c","RD_Rep__c","Commission__c","Approved_Sample__c","Order_Complete_Date__c","Notes__c"
"006A000000AuE5KIAV","0","001A000000TuWIGIA3","0","Component Distriburtors Inc.-Inductor-RL-1950 RL-1952","Ship to Customer - Zurvahn Corp, FL
End Customer - Accu-Sort Systems
Project - Danaher","Quote Request","4","","65","","","2011-01-13 00:00:00","New Project","","","0","0","Pipeline","Pipeline","000000000000000AAA","","","005A0000000TXpKIAW","2011-01-06 19:22:33","005A0000000TXpKIAW","2013-11-18 14:34:05","005A0000000TARMIA4","2013-11-18 14:34:12","","","2010","1","","","150","0.000","2. Pipeline (25%)","","","0","","Ship to Customer - Zurvahn Corp, FL
End Customer - Accu-Sort Systems
Project - Danaher","","","","","","","","","","","","","","","","","",""


The above is the header and a line of bad data. The data should be one complete line. Here is a good line of data:

"006A000000Av3sCIAR","0","001A000000TuVh5IAF","0","Aquacal-Transformer-RL-2260-130-24","Renco is raising the sales price.","Quote Request","4","","65","","","2011-01-19 00:00:00","Existing Project","Provide Quote","Outside Sales Rep","0","0","Pipeline","Pipeline","000000000000000AAA","","","005A0000000TXolIAG","2011-01-12 21:30:03","005A0000000TXolIAG","2013-11-18 14:34:05","005A0000000TARMIA4","2013-11-18 14:34:12","2011-03-26 00:00:00","","2010","1","","","4500","13.650","4. Best Case (75%)","Salt system for pool and spa","","0","","Renco is raising the sales price.","","","","","","","","","","","","","","","","","",""

Dennis
 
Create a file named, say, Dennis.awk with this content:
{nq=gsub(/\"/,"&");if(nq%2){x=$0;getline;$0=x $0}}1

and in the console window type this:
awk -f Dennis.awk opportunity_awk.csv

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Ok, I did what you said above. I then ran the command and all of the output come to the screen and my file was left untouched.

Is this what is supposed to happen? Or am I supposed to redirect the output to a file?

Like I said, I am just learning how to spell akw.

Thanks
 
awk -f Dennis.awk opportunity_awk.csv > opportunity_new.csv

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
ok - did that and it output the data to a file. Interesting this is when I looked at the output, the new file was bigger in size according to the directory.
Original file 9,214k and had 18,548 lines
New file is 9,232k and has 18,411 lines.

Also, there are multiple CRLF in a given record. The above process only removed the first. So do I have to run this process multiple time, once on each new output file until I remove all of the CR LF?

Also, some lines have just a entry LF in the data. How does the process know which LF is extra and which is the actual "end of record marker"?

Thanks,
 
One other question. How do I replace the extra Cr LF with a ", ". Can I do the same thing with the extra "LF"?

Thanks,
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top