Hi everyone,
I searched the forums (and the internet) and didn't find a solution for my problem, which was how to read and manage CSV data where some of the columns are quote-encapsulated and also have one or more CR, LF, or CR+LF within the quotes. When the CSV is opened in M$ Excel, you end up with a multiline box when you put your cursor on one of these quote encapsulated cells. This is a nice feature, but the source of my pain. When Strawberry Perl (or Solaris 10 Perl 5.8 under VirtualBox for that matter) does a readline, it decides the end of the line to read is in the middle of the quoted data! So lines get cut off, columns get thrown off, and you end up with a big ugly mess for output. Here's my solution, and I hope it helps you.
This program will take an input CSV, and assume the "real" line break is an unquoted CR+LF, Windows style. All input columns are stripped of commas, quoted, CR, and LF, and then reencapsulated in quotes and written back out in CSV format. The output file(s) are suitable for standard Perl readline and split. Optionally, the input file can be split into several output files by specifying how many lines each should contain in $splitafter. I had one source file that was over 500,000 lines in length and wouldn't load into Excel (64k line limit), so this is why I added that feature. A second feature I added, just in case, will force the program to write a line every $maxcols columns read, just in case the CR+LF detection code doesn't work on your CSV, and you know how wide your rectangular data is.
Thanks,
Zistrosk
#!/usr/bin/perl
#csvconvert.plx - perl program to read a CSV of any length and width and
#remove embedded CR+LF within quoted fields, and write it back out to a new
#file so that it may be read with a regular split command afterwards.
#
#note that $maxcols can be set to the width of the data, if the automatic
#detection of "true end of line", looking for unquoted CR+LF in the file, fails.
#a nonzero value of $maxcols will force the program to read exactly that many columns
#and the output a new CSV line with that many columns read (i.e. assume square dataset)
use warnings;
$maxcols = 0; # number of columns in CSV (assume square data, first column is #0, each row has 0-29)
$repln = 0; # counter for when to report number of lines processed
$repmx = 999; # value to reset $repln to when it counts down and reports
$rownum = 0; # number of rows read so far
$pathloc = 'C:\\mycsvdirectory\\';
#setup filename for report input
$fileinput = 'myinputcsvtobeconverted.csv';
$fhifn = $pathloc . $fileinput;
#setup filename for report output
$fileoutputext = '.csv'; #output file extension
$fileoutput = 'mychoiceofoutputfilenames';
$fileoutindex = 0;
$fileoutindexs = sprintf "%02.0f", $fileoutindex;
$fhofn = $pathloc . $fileoutput . $fileoutindexs . $fileoutputext;
$splitafter = 65000;# split output to a new CSV after this many lines. 0=disable file split
$split = 0;# line counter for $splitafter
open($fhi,,$fhifn) or die("unable to open input file.");
binmode $fhi;#set binary mode, so read byte by byte will be accurate
open($fho,'>',$fhofn) or die("unable to open output file.");
print "\n Converting CSV in:\n" . $fhifn . "\nto\n" . $fhofn . "\n\nRemoving carriage returns and linefeeds from quoted data to allow easier processing.\n\n";
# convert the CSV, reading it byte by byte
$linelen = 1; #cant rely on windows readline returning false when no more lines, so have to use an alternate method, line length
# read selected columns of entire csv into @csvdata[x][y]
# have to manually handle compressing quotes, seeing if in quote mode or not, and end of line checking.
# readline does not work here because some of the quote-encapsulated columns have newline characters in them
$quote = 0; # =0 not in quotes, =1 in quotes
$byte = ""; # variable used to store the byte read from input
$prevbyte = ""; # previous byte read, for double quote comparisons
$line = "";# empty before using
my @columns; # place to store the columns as they are read from each row
$columnsinx = 0; # current pointer into @columns
while ($linelen > 0) {
$prevbyte = $byte; # save previous byte read for comparisons
$linelen = read($fhi, $byte, 1); #number of bytes actually read into $byte, returned in $linelen
if ($byte eq "\"") {
if ($byte eq $prevbyte) { $byte = "";} #compress double quotes
$quote*=-1; # toggle quote mode every time a quote is found.
$quote+=1; } # if double quotes are found, they're compressed, and quote mode is restored to what it should be
if ( ((ord($byte) == 10) || ($byte eq ",")) && $quote == 0) { #if newline or comma found outside of quotes, process column
chomp $line; # get rid of trailing blanks, if any
$columns[$columnsinx] = $line;
$columnsinx++; #index to next column number
if ( (($maxcols > 0 ) && ($columnsinx > $maxcols)) || ((ord($byte) == 10) && (ord($prevbyte) == 13)) ) {
$columnsinx--; #index set to last actual data item
# check for x0d0a for end of line, which is \r\n.
$lno = ""; #empty output string
for ($j = 1;$j < ($columnsinx - 1);$j++) {
$columns[$j] =~ s/\"//g; #strip quotes from column data
$lno = $lno . "\"" . $columns[$j] . "\",";
$columns[$j] = ""; #added to empty @csvdata as we go, since we're not accumulating it
}
$columns[$columnsinx] =~ s/\"//g; #strip quotes from column data
$lno = $lno . "\"" . $columns[$columnsinx] . "\"\n";
print $fho $lno;
$columnsinx = 0;
$rownum++;
$repln++;
if ($repln > $repmx) {
print $rownum . " lines converted.\n";
$repln=0;
}
if ($splitafter > 0) {
$split++; #count lines for splitting
if ($split > $splitafter) {
print "\nClosing output file: " . $fhofn . "\n";
close $fho;
$fileoutindex++; #increase index number of file number to write
$fileoutindexs = sprintf "%02.0f", $fileoutindex;
$fhofn = $pathloc . $fileoutput . $fileoutindexs . $fileoutputext;
print "\nOpening output file: " . $fhofn . "\n";
open($fho,'>',$fhofn) or die("unable to open output file.");
$split = 0;
}
}
}
$line = "";
$byte = "";
$quote = 0;
}
if ((ord($byte) == 13)) {$line = $line . " "; } #cnvt linefeed to space
elsif ((ord($byte) == 10)) {$line = $line . ""; } #absorb carriage return
elsif ($byte eq ",") {$line = $line . ""; } #dont add anything to line
else { $line = $line . $byte; }# accumulate column into $line
}
print "\nProcessing complete. There were a total of " . $rownum . " rows converted.\n";
my $elapsed_seconds = time - $^T;
print "-- Script took " . $elapsed_seconds . " seconds to run --\n";
# close files when done
close $fhi;
close $fho;
exit 0;
I searched the forums (and the internet) and didn't find a solution for my problem, which was how to read and manage CSV data where some of the columns are quote-encapsulated and also have one or more CR, LF, or CR+LF within the quotes. When the CSV is opened in M$ Excel, you end up with a multiline box when you put your cursor on one of these quote encapsulated cells. This is a nice feature, but the source of my pain. When Strawberry Perl (or Solaris 10 Perl 5.8 under VirtualBox for that matter) does a readline, it decides the end of the line to read is in the middle of the quoted data! So lines get cut off, columns get thrown off, and you end up with a big ugly mess for output. Here's my solution, and I hope it helps you.
This program will take an input CSV, and assume the "real" line break is an unquoted CR+LF, Windows style. All input columns are stripped of commas, quoted, CR, and LF, and then reencapsulated in quotes and written back out in CSV format. The output file(s) are suitable for standard Perl readline and split. Optionally, the input file can be split into several output files by specifying how many lines each should contain in $splitafter. I had one source file that was over 500,000 lines in length and wouldn't load into Excel (64k line limit), so this is why I added that feature. A second feature I added, just in case, will force the program to write a line every $maxcols columns read, just in case the CR+LF detection code doesn't work on your CSV, and you know how wide your rectangular data is.
Thanks,
Zistrosk
#!/usr/bin/perl
#csvconvert.plx - perl program to read a CSV of any length and width and
#remove embedded CR+LF within quoted fields, and write it back out to a new
#file so that it may be read with a regular split command afterwards.
#
#note that $maxcols can be set to the width of the data, if the automatic
#detection of "true end of line", looking for unquoted CR+LF in the file, fails.
#a nonzero value of $maxcols will force the program to read exactly that many columns
#and the output a new CSV line with that many columns read (i.e. assume square dataset)
use warnings;
$maxcols = 0; # number of columns in CSV (assume square data, first column is #0, each row has 0-29)
$repln = 0; # counter for when to report number of lines processed
$repmx = 999; # value to reset $repln to when it counts down and reports
$rownum = 0; # number of rows read so far
$pathloc = 'C:\\mycsvdirectory\\';
#setup filename for report input
$fileinput = 'myinputcsvtobeconverted.csv';
$fhifn = $pathloc . $fileinput;
#setup filename for report output
$fileoutputext = '.csv'; #output file extension
$fileoutput = 'mychoiceofoutputfilenames';
$fileoutindex = 0;
$fileoutindexs = sprintf "%02.0f", $fileoutindex;
$fhofn = $pathloc . $fileoutput . $fileoutindexs . $fileoutputext;
$splitafter = 65000;# split output to a new CSV after this many lines. 0=disable file split
$split = 0;# line counter for $splitafter
open($fhi,,$fhifn) or die("unable to open input file.");
binmode $fhi;#set binary mode, so read byte by byte will be accurate
open($fho,'>',$fhofn) or die("unable to open output file.");
print "\n Converting CSV in:\n" . $fhifn . "\nto\n" . $fhofn . "\n\nRemoving carriage returns and linefeeds from quoted data to allow easier processing.\n\n";
# convert the CSV, reading it byte by byte
$linelen = 1; #cant rely on windows readline returning false when no more lines, so have to use an alternate method, line length
# read selected columns of entire csv into @csvdata[x][y]
# have to manually handle compressing quotes, seeing if in quote mode or not, and end of line checking.
# readline does not work here because some of the quote-encapsulated columns have newline characters in them
$quote = 0; # =0 not in quotes, =1 in quotes
$byte = ""; # variable used to store the byte read from input
$prevbyte = ""; # previous byte read, for double quote comparisons
$line = "";# empty before using
my @columns; # place to store the columns as they are read from each row
$columnsinx = 0; # current pointer into @columns
while ($linelen > 0) {
$prevbyte = $byte; # save previous byte read for comparisons
$linelen = read($fhi, $byte, 1); #number of bytes actually read into $byte, returned in $linelen
if ($byte eq "\"") {
if ($byte eq $prevbyte) { $byte = "";} #compress double quotes
$quote*=-1; # toggle quote mode every time a quote is found.
$quote+=1; } # if double quotes are found, they're compressed, and quote mode is restored to what it should be
if ( ((ord($byte) == 10) || ($byte eq ",")) && $quote == 0) { #if newline or comma found outside of quotes, process column
chomp $line; # get rid of trailing blanks, if any
$columns[$columnsinx] = $line;
$columnsinx++; #index to next column number
if ( (($maxcols > 0 ) && ($columnsinx > $maxcols)) || ((ord($byte) == 10) && (ord($prevbyte) == 13)) ) {
$columnsinx--; #index set to last actual data item
# check for x0d0a for end of line, which is \r\n.
$lno = ""; #empty output string
for ($j = 1;$j < ($columnsinx - 1);$j++) {
$columns[$j] =~ s/\"//g; #strip quotes from column data
$lno = $lno . "\"" . $columns[$j] . "\",";
$columns[$j] = ""; #added to empty @csvdata as we go, since we're not accumulating it
}
$columns[$columnsinx] =~ s/\"//g; #strip quotes from column data
$lno = $lno . "\"" . $columns[$columnsinx] . "\"\n";
print $fho $lno;
$columnsinx = 0;
$rownum++;
$repln++;
if ($repln > $repmx) {
print $rownum . " lines converted.\n";
$repln=0;
}
if ($splitafter > 0) {
$split++; #count lines for splitting
if ($split > $splitafter) {
print "\nClosing output file: " . $fhofn . "\n";
close $fho;
$fileoutindex++; #increase index number of file number to write
$fileoutindexs = sprintf "%02.0f", $fileoutindex;
$fhofn = $pathloc . $fileoutput . $fileoutindexs . $fileoutputext;
print "\nOpening output file: " . $fhofn . "\n";
open($fho,'>',$fhofn) or die("unable to open output file.");
$split = 0;
}
}
}
$line = "";
$byte = "";
$quote = 0;
}
if ((ord($byte) == 13)) {$line = $line . " "; } #cnvt linefeed to space
elsif ((ord($byte) == 10)) {$line = $line . ""; } #absorb carriage return
elsif ($byte eq ",") {$line = $line . ""; } #dont add anything to line
else { $line = $line . $byte; }# accumulate column into $line
}
print "\nProcessing complete. There were a total of " . $rownum . " rows converted.\n";
my $elapsed_seconds = time - $^T;
print "-- Script took " . $elapsed_seconds . " seconds to run --\n";
# close files when done
close $fhi;
close $fho;
exit 0;