Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Perl - parsing Windows CSV files with embedded quoted CR+LF, CR, or LF

Status
Not open for further replies.

Zistrosk

Programmer
Aug 5, 2011
5
US
Hi everyone,

I searched the forums (and the internet) and didn't find a solution for my problem, which was how to read and manage CSV data where some of the columns are quote-encapsulated and also have one or more CR, LF, or CR+LF within the quotes. When the CSV is opened in M$ Excel, you end up with a multiline box when you put your cursor on one of these quote encapsulated cells. This is a nice feature, but the source of my pain. When Strawberry Perl (or Solaris 10 Perl 5.8 under VirtualBox for that matter) does a readline, it decides the end of the line to read is in the middle of the quoted data! So lines get cut off, columns get thrown off, and you end up with a big ugly mess for output. Here's my solution, and I hope it helps you.

This program will take an input CSV, and assume the "real" line break is an unquoted CR+LF, Windows style. All input columns are stripped of commas, quoted, CR, and LF, and then reencapsulated in quotes and written back out in CSV format. The output file(s) are suitable for standard Perl readline and split. Optionally, the input file can be split into several output files by specifying how many lines each should contain in $splitafter. I had one source file that was over 500,000 lines in length and wouldn't load into Excel (64k line limit), so this is why I added that feature. A second feature I added, just in case, will force the program to write a line every $maxcols columns read, just in case the CR+LF detection code doesn't work on your CSV, and you know how wide your rectangular data is.

Thanks,
Zistrosk

#!/usr/bin/perl
#csvconvert.plx - perl program to read a CSV of any length and width and
#remove embedded CR+LF within quoted fields, and write it back out to a new
#file so that it may be read with a regular split command afterwards.
#
#note that $maxcols can be set to the width of the data, if the automatic
#detection of "true end of line", looking for unquoted CR+LF in the file, fails.
#a nonzero value of $maxcols will force the program to read exactly that many columns
#and the output a new CSV line with that many columns read (i.e. assume square dataset)

use warnings;
$maxcols = 0; # number of columns in CSV (assume square data, first column is #0, each row has 0-29)
$repln = 0; # counter for when to report number of lines processed
$repmx = 999; # value to reset $repln to when it counts down and reports
$rownum = 0; # number of rows read so far
$pathloc = 'C:\\mycsvdirectory\\';
#setup filename for report input
$fileinput = 'myinputcsvtobeconverted.csv';
$fhifn = $pathloc . $fileinput;
#setup filename for report output
$fileoutputext = '.csv'; #output file extension
$fileoutput = 'mychoiceofoutputfilenames';
$fileoutindex = 0;
$fileoutindexs = sprintf "%02.0f", $fileoutindex;
$fhofn = $pathloc . $fileoutput . $fileoutindexs . $fileoutputext;
$splitafter = 65000;# split output to a new CSV after this many lines. 0=disable file split
$split = 0;# line counter for $splitafter
open($fhi,,$fhifn) or die("unable to open input file.");
binmode $fhi;#set binary mode, so read byte by byte will be accurate
open($fho,'>',$fhofn) or die("unable to open output file.");

print "\n Converting CSV in:\n" . $fhifn . "\nto\n" . $fhofn . "\n\nRemoving carriage returns and linefeeds from quoted data to allow easier processing.\n\n";

# convert the CSV, reading it byte by byte
$linelen = 1; #cant rely on windows readline returning false when no more lines, so have to use an alternate method, line length
# read selected columns of entire csv into @csvdata[x][y]
# have to manually handle compressing quotes, seeing if in quote mode or not, and end of line checking.
# readline does not work here because some of the quote-encapsulated columns have newline characters in them
$quote = 0; # =0 not in quotes, =1 in quotes
$byte = ""; # variable used to store the byte read from input
$prevbyte = ""; # previous byte read, for double quote comparisons
$line = "";# empty before using
my @columns; # place to store the columns as they are read from each row
$columnsinx = 0; # current pointer into @columns
while ($linelen > 0) {
$prevbyte = $byte; # save previous byte read for comparisons
$linelen = read($fhi, $byte, 1); #number of bytes actually read into $byte, returned in $linelen
if ($byte eq "\"") {
if ($byte eq $prevbyte) { $byte = "";} #compress double quotes
$quote*=-1; # toggle quote mode every time a quote is found.
$quote+=1; } # if double quotes are found, they're compressed, and quote mode is restored to what it should be
if ( ((ord($byte) == 10) || ($byte eq ",")) && $quote == 0) { #if newline or comma found outside of quotes, process column
chomp $line; # get rid of trailing blanks, if any
$columns[$columnsinx] = $line;
$columnsinx++; #index to next column number
if ( (($maxcols > 0 ) && ($columnsinx > $maxcols)) || ((ord($byte) == 10) && (ord($prevbyte) == 13)) ) {
$columnsinx--; #index set to last actual data item
# check for x0d0a for end of line, which is \r\n.
$lno = ""; #empty output string
for ($j = 1;$j < ($columnsinx - 1);$j++) {
$columns[$j] =~ s/\"//g; #strip quotes from column data
$lno = $lno . "\"" . $columns[$j] . "\",";
$columns[$j] = ""; #added to empty @csvdata as we go, since we're not accumulating it
}
$columns[$columnsinx] =~ s/\"//g; #strip quotes from column data
$lno = $lno . "\"" . $columns[$columnsinx] . "\"\n";
print $fho $lno;
$columnsinx = 0;
$rownum++;
$repln++;
if ($repln > $repmx) {
print $rownum . " lines converted.\n";
$repln=0;
}
if ($splitafter > 0) {
$split++; #count lines for splitting
if ($split > $splitafter) {
print "\nClosing output file: " . $fhofn . "\n";
close $fho;
$fileoutindex++; #increase index number of file number to write
$fileoutindexs = sprintf "%02.0f", $fileoutindex;
$fhofn = $pathloc . $fileoutput . $fileoutindexs . $fileoutputext;
print "\nOpening output file: " . $fhofn . "\n";
open($fho,'>',$fhofn) or die("unable to open output file.");
$split = 0;
}
}
}
$line = "";
$byte = "";
$quote = 0;
}
if ((ord($byte) == 13)) {$line = $line . " "; } #cnvt linefeed to space
elsif ((ord($byte) == 10)) {$line = $line . ""; } #absorb carriage return
elsif ($byte eq ",") {$line = $line . ""; } #dont add anything to line
else { $line = $line . $byte; }# accumulate column into $line
}
print "\nProcessing complete. There were a total of " . $rownum . " rows converted.\n";

my $elapsed_seconds = time - $^T;
print "-- Script took " . $elapsed_seconds . " seconds to run --\n";

# close files when done

close $fhi;
close $fho;
exit 0;


 
Any reason you didn't just Text::CSV?

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[noevil]
Travis - Those who say it cannot be done are usually interrupted by someone else doing it; Give the wrong symptoms, get the wrong solutions;
 
Yes. Text::CSV didn't come with Strawberry Perl, and is not present on the Solaris systems that I also work with. While I could add it to my own personal copy of Strawberry Perl, there's no way I'd be able to put it on the several thousand servers I manage, and I wanted to make sure that the code I wrote was fully portable across the environment, so I could run it anywhere I needed to.

Zistrosk
 
I need to second the recommendation for Text::CSV

Code:
my $csv = Text::CSV->new({
    binary => 1,
    eol => "\n",
    quote_char => q{"},
 }) or die "Cannot use CSV: ".Text::CSV->error_diag();

Even if you're worried about being able to include it on thousands of machines, just rely on the pure perl version: Text::CSV_PP

.02,
- Miller
 
Thank you all for your suggestions.

my attempts to use:

use Text::CSV;
use Text::CSV_PP;
use Text::CSV__PP;

all end in failure. I did try them one at a time, of course. The system cannot find them. This is both Perl 5.8.4 for Solaris 10u6, and the latest package of Strawberry Perl under Windows.

What am I doing wrong?

Thanks,
Zistrosk
 
You need to either install them using strawberry perl's cpan.

Code:
cpan -i Text::CSV

Or you can just download the .pm's yourself and put them in your source tree.

Also, I have to mention that perl 5.8 is legacy now so I have to encourage you to consider upgrading to perl 5.12.

- Miller
 
Just so you know, you are not alone Zistrosk! I fully understand where you're coming from when managing hundreds of systems with various operating systems and architectures and no direct access to the internet. CPAN installations are usually painful and unreliable, so one is forced to stick with the lowest common denominator of perl versions/capabilities and make scripts as portable and self-contained as possible.

Annihilannic.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top