Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

CRLF in middle of pipe delimited file

Status
Not open for further replies.

Swi

Programmer
Feb 4, 2002
1,963
US
Hi,

I have a client who has sent in a pipe delimited file however one of the fields they sent seems to containg CRLF in the middle of the field. The end of record marker is also a CRLF.

What are your thoughts on the best way to handle this?

If we know number of columns from header could then read line by line and concatenate string until # of fields is hit and move on.

Any other way to do this that you can think of?

Thanks.

Swi
 
I would think the simplest method would be to eliminate all the CRLF characters using the replace method, then re-add them where they need to go (if you're talking about string manipulation that is)

For example, a multi line text file, with each line being a record and the CRLF at the end of the line to mark the end:

open "mytextfile.txt" as input as 1
open "newtextfile.txt" as output as 2
while eof(1) = false

line input #1,origregord 'stored the line in a string​
newline = replace(origrecord,"CRLF","") 'cuts out all instances of CRLF in the line​
newline = newline + "CRLF"​
print #2,newline​

wend
close 1
close 2

This would only work if the records were individual lines in the file and could be read separately.
If it's all together in one string of characters, and the delimiter characters for each record are also in the records themselves, then I'm not sure how you'd distinguish them apart.



------------------------------------
[yinyang] Over 30 years of programming, and still learning every day! [yinyang]
 
Unfortunately it is all in one string of characters. So the end of record marker is a CRLF but also several fileds within the file also contain CRLFs. Seems like these file fields came from a free form text field from an online form or database field. Thanks.

Swi
 
Yeah, that's a MESS.

I wonder if your client could do a find/replace in their software to weed out all instances of the CRLF characters in the fields and then re-export to file.... Hmmmm....

------------------------------------
[yinyang] Over 30 years of programming, and still learning every day! [yinyang]
 
Yeah, I've asked that but they refused. [sad] Thanks.

Swi
 
If we could see an example of the file, that might help suggest a solution
 
Is there a way to identify the first column (e.g. an ID field that can be matched to a Regex)?

"Knowledge is power. Information is liberating. Education is the premise of progress, in every society, in every family." (Kofi Annan)
Oppose SOPA, PIPA, ACTA; measures to curb freedom of information under whatever name whatsoever.
 
Valid CSV can contain line breaks in the fields, when the fields are enclosed in the quotes
- for example like this:

line_breaks.csv
Code:
"aa"|"B
b"|"C
C"|"d
D"|"e
E"
foo|bar|baz|spam|eggs

File like this can be handled well. You can write a small utility which turns your CSV with line breaks into a CSV without line breaks. There are modules for CSV processing for many languages available. Here a short script in python

Code:
[COLOR=#800080]import[/color] csv, re

[COLOR=#0000ff]# original CSV for reading[/color]
csv_file_inp = [COLOR=#008b8b]open[/color]([COLOR=#ff00ff]"line_breaks.csv"[/color],[COLOR=#ff00ff]"r"[/color])
csv_reader = csv.reader(csv_file_inp, delimiter=[COLOR=#ff00ff]'|'[/color])

[COLOR=#0000ff]# corrected CSV for writing[/color]
csv_file_out = [COLOR=#008b8b]open[/color]([COLOR=#ff00ff]"line_breaks_corrected.csv"[/color],[COLOR=#ff00ff]"w"[/color])
csv_writer = csv.writer(csv_file_out, delimiter=[COLOR=#ff00ff]"|"[/color])

[COLOR=#a52a2a][b]for[/b][/color] row [COLOR=#a52a2a][b]in[/b][/color] csv_reader:
  [COLOR=#008b8b]print[/color] [COLOR=#ff00ff]"reading original row = %s"[/color] % row
  [COLOR=#0000ff]# remove line break from every field[/color]
  [COLOR=#a52a2a][b]for[/b][/color] field [COLOR=#a52a2a][b]in[/b][/color] row:
     idx = row.index(field)
     [COLOR=#0000ff]#print "field = '%s'" % field[/color]
     [COLOR=#0000ff]# remove line breaks[/color]
     field_corrected = re.sub([COLOR=#ff00ff]'[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]'[/color],[COLOR=#ff00ff]''[/color], field)
     [COLOR=#0000ff]#print "field_corrected = '%s'" % field_corrected[/color]
     row[idx] = field_corrected
  [COLOR=#008b8b]print[/color] [COLOR=#ff00ff]"writing corrected row = %s "[/color] % row
  csv_writer.writerow(row)

[COLOR=#0000ff]# at end[/color]
csv_file_inp.close()
csv_file_out.close()

Now when I run this script
Code:
$ python line_breaks.py
reading original row = ['aa', 'B\nb', 'C\nC', 'd\nD', 'e\nE']
writing corrected row = ['aa', 'Bb', 'CC', 'dD', 'eE'] 
reading original row = ['foo', 'bar', 'baz', 'spam', 'eggs']
writing corrected row = ['foo', 'bar', 'baz', 'spam', 'eggs']

I get as result the CSV file without line breaks:

line_breaks_corrected.csv
Code:
aa|Bb|CC|dD|eE
foo|bar|baz|spam|eggs

 
This is one of the reasons I asked for an example of the CSV file
 
Yesterday I didn't have access to windows, so I could try it only with Linux and Python.
In windows for example powershell have the import-csv cmdlet with the CSV parsing ability.
Now I tried this:

line_breaks.ps1
Code:
[COLOR=#0000ff]# -- functions[/color]
[COLOR=#804040][b]function[/b][/color] [COLOR=#008080]remove_line_break[/color]([COLOR=#008080]$input_string[/color]) {
  [COLOR=#008080]$output_string[/color] = [COLOR=#2e8b57][b][RegEx][/b][/color]::replace([COLOR=#008080]$input_string[/color],[COLOR=#ff00ff]"\r\n"[/color],[COLOR=#ff00ff]""[/color])
  [COLOR=#804040][b]return[/b][/color] [COLOR=#008080]$output_string[/color]
}

[COLOR=#804040][b]function[/b][/color] [COLOR=#008080]create_csv_line[/color]([COLOR=#008080]$input_array[/color]) {
  [COLOR=#008080]$output_string[/color] = [COLOR=#2e8b57][b][String][/b][/color]::join( [COLOR=#ff00ff]"|"[/color], [COLOR=#008080]$input_array[/color])
  [COLOR=#804040][b]return[/b][/color] [COLOR=#008080]$output_string[/color]  
}

[COLOR=#0000ff]# -- main[/color]
[COLOR=#008080]$csv_input[/color] = [COLOR=#ff00ff]"line_breaks.csv"[/color]
[COLOR=#008080]$csv_output[/color] = [COLOR=#ff00ff]"line_breaks_corrected.csv"[/color]

[COLOR=#008080]$csv_header[/color] = [COLOR=#ff00ff]"c1"[/color], [COLOR=#ff00ff]"c2"[/color], [COLOR=#ff00ff]"c3"[/color], [COLOR=#ff00ff]"c4"[/color], [COLOR=#ff00ff]"c5"[/color]
[COLOR=#008080]$csv_lines[/color] = [COLOR=#804040][b]import-csv[/b][/color] -path [COLOR=#008080]$csv_input[/color] -delimiter [COLOR=#ff00ff]"|"[/color] -header [COLOR=#008080]$csv_header[/color]

[COLOR=#008080]$line_num[/color] = [COLOR=#ff00ff]0[/color]
[COLOR=#804040][b]foreach[/b][/color]([COLOR=#008080]$line[/color] [COLOR=#804040][b]in[/b][/color] [COLOR=#008080]$csv_lines[/color]){
  [COLOR=#008080]$line_num[/color]++
  [COLOR=#0000ff]#echo $line.c1, $line.c2, $line.c3, $line.c4, $line.c5[/color]
  [COLOR=#0000ff]# remove line breaks from columns[/color]
  [COLOR=#008080]$col1[/color] = remove_line_break([COLOR=#008080]$line[/color].c1)
  [COLOR=#008080]$col2[/color] = remove_line_break([COLOR=#008080]$line[/color].c2)
  [COLOR=#008080]$col3[/color] = remove_line_break([COLOR=#008080]$line[/color].c3)
  [COLOR=#008080]$col4[/color] = remove_line_break([COLOR=#008080]$line[/color].c4)
  [COLOR=#008080]$col5[/color] = remove_line_break([COLOR=#008080]$line[/color].c5)
  [COLOR=#0000ff]#echo $col1, $col2, $col3, $col4, $col5[/color]

  [COLOR=#0000ff]# create corrected cvs line[/color]
  [COLOR=#008080]$csv_line_corrected[/color] = create_csv_line([COLOR=#008080]$col1[/color], [COLOR=#008080]$col2[/color], [COLOR=#008080]$col3[/color], [COLOR=#008080]$col4[/color], [COLOR=#008080]$col5[/color])
  [COLOR=#0000ff]#echo $csv_line_corrected[/color]

  [COLOR=#0000ff]# write to file[/color]
  [COLOR=#804040][b]if[/b][/color] ([COLOR=#008080]$line_num[/color] -[COLOR=#804040][b]eq[/b][/color] [COLOR=#ff00ff]1[/color]) {
    [COLOR=#0000ff]# replace file[/color]
    [COLOR=#008080]$csv_line_corrected[/color] | [COLOR=#804040][b]out-file[/b][/color] [COLOR=#008080]$csv_output[/color] -encoding [COLOR=#804040][b]default[/b][/color] 
  }
  [COLOR=#804040][b]else[/b][/color] {
    [COLOR=#0000ff]# append to file[/color]
    [COLOR=#008080]$csv_line_corrected[/color] | [COLOR=#804040][b]out-file[/b][/color] [COLOR=#008080]$csv_output[/color] -encoding [COLOR=#804040][b]default[/b][/color] -append
  } 
}

running this script from powershell
Code:
PS C:\mikrom\Work> .\line_breaks.ps1
or from windows command line running in my directory
Code:
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe  -Windowstyle Hidden -ExecutionPolicy Bypass -STA -File ".\line_breaks.ps1"
I get the similar result as before with python
 
A few days ago I started to learn the R programming language.
The solution of the problem given above is in R very straightforward.

line_breaks.R
Code:
[COLOR=#0000ff]# run:[/color]
[COLOR=#0000ff]# Rscript line_breaks.R [/color]

[COLOR=#0000ff]# read CSV input file[/color]
csv [COLOR=#a52a2a][b]<-[/b][/color] read.csv[COLOR=#6a5acd]([/color]file [COLOR=#a52a2a][b]=[/b][/color] [COLOR=#ff00ff]"line_breaks.csv"[/color][COLOR=#6a5acd],[/color] sep[COLOR=#a52a2a][b]=[/b][/color][COLOR=#ff00ff]"|"[/color][COLOR=#6a5acd],[/color] header[COLOR=#a52a2a][b]=[/b][/color][COLOR=#ff00ff]FALSE[/color][COLOR=#6a5acd])[/color]

[COLOR=#0000ff]# create matrix from CSV data[/color]
csv_matrix [COLOR=#a52a2a][b]<-[/b][/color] as.matrix[COLOR=#6a5acd]([/color]csv[COLOR=#6a5acd])[/color]

print[COLOR=#6a5acd]([/color][COLOR=#ff00ff]"Original CSV data:"[/color][COLOR=#6a5acd])[/color]
print[COLOR=#6a5acd]([/color]csv_matrix[COLOR=#6a5acd])[/color] 

[COLOR=#a52a2a][b]for[/b][/color] [COLOR=#6a5acd]([/color]i [COLOR=#a52a2a][b]in[/b][/color] [COLOR=#ff00ff]1[/color][COLOR=#6a5acd]:[/color]dim[COLOR=#6a5acd]([/color]csv_matrix[COLOR=#6a5acd])[[/color][COLOR=#ff00ff]1[/color][COLOR=#6a5acd]])[/color] [COLOR=#6a5acd]{[/color]
  [COLOR=#a52a2a][b]for[/b][/color] [COLOR=#6a5acd]([/color]j [COLOR=#a52a2a][b]in[/b][/color] [COLOR=#ff00ff]1[/color][COLOR=#6a5acd]:[/color]dim[COLOR=#6a5acd]([/color]csv_matrix[COLOR=#6a5acd])[[/color][COLOR=#ff00ff]2[/color][COLOR=#6a5acd]])[/color] [COLOR=#6a5acd]{[/color]
    [COLOR=#0000ff]# correct every matrix element[/color]
    csv_matrix[COLOR=#6a5acd][[/color]i[COLOR=#6a5acd],[/color]j[COLOR=#6a5acd]][/color] [COLOR=#a52a2a][b]<-[/b][/color] gsub[COLOR=#6a5acd]([/color][COLOR=#ff00ff]"[/color][COLOR=#6a5acd]\n[/color][COLOR=#ff00ff]"[/color][COLOR=#6a5acd],[/color][COLOR=#ff00ff]""[/color][COLOR=#6a5acd],[/color]csv_matrix[COLOR=#6a5acd][[/color]i[COLOR=#6a5acd],[/color]j[COLOR=#6a5acd]])[/color]
  [COLOR=#6a5acd]}[/color]
[COLOR=#6a5acd]}[/color]

print[COLOR=#6a5acd]([/color][COLOR=#ff00ff]"Corrected CSV data:"[/color][COLOR=#6a5acd])[/color]
print[COLOR=#6a5acd]([/color]csv_matrix[COLOR=#6a5acd])[/color]

[COLOR=#0000ff]# write corrected data into CSV output file[/color]
write.table[COLOR=#6a5acd]([/color]csv_matrix[COLOR=#6a5acd],[/color] file [COLOR=#a52a2a][b]=[/b][/color] [COLOR=#ff00ff]'line_breaks_corrected.csv'[/color][COLOR=#6a5acd],[/color]
  quote [COLOR=#a52a2a][b]=[/b][/color] [COLOR=#ff00ff]FALSE[/color][COLOR=#6a5acd],[/color] sep [COLOR=#a52a2a][b]=[/b][/color] [COLOR=#ff00ff]"|"[/color][COLOR=#6a5acd],[/color] row.names [COLOR=#a52a2a][b]=[/b][/color] [COLOR=#ff00ff]FALSE[/color][COLOR=#6a5acd],[/color] col.names [COLOR=#a52a2a][b]=[/b][/color] [COLOR=#ff00ff]FALSE[/color][COLOR=#6a5acd])[/color]

Running the script
Code:
$ Rscript line_breaks.R
[1] "Original CSV data:"
     V1    V2     V3     V4     V5    
[1,] "aa"  "B\nb" "C\nC" "d\nD" "e\nE"
[2,] "foo" "bar"  "baz"  "spam" "eggs"
[1] "Corrected CSV data:"
     V1    V2    V3    V4     V5    
[1,] "aa"  "Bb"  "CC"  "dD"   "eE"  
[2,] "foo" "bar" "baz" "spam" "eggs"
creates the CSV file with line breaks removed.


 
Ok, since no response on the "If we could see an example of the file" request perhaps you could tell us which version of Excel you have - since Excel 2016 can actually import files like the one mikrom is using perfectly OK … (I suspect Excel 2010 can as well, but I no longer have a copy readily to hand that I can test this on)
 
@strongm,
It looks like OP is not interested anymore, maybe the client has sent a simpler CSV file in the meantime.
But still, this thread can be a good guide for someone else with a similar problem.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top