Regular expression to fix a "dirty" csv file

arclyte · Jan 2, 2006

Hi,

I have a problem that surpasses my knowledge of regular expressions. Can somehow help me to figure out the logic to solve this problem?

I am downloading a CSV file from a supplier which is "dirty", that is, it has quotes and commas within fields and has line breaks where there should be html tags. I've asked them to fix it, but they haven't, so I'm stuck with trying to fix it myself. Let me give an example record or two:

"SKU","Name","Description","Price","Quantity","Category"
"1234","Clock Radio","This is a clock radio, it measures 4" x 3" x 2"
Features:
Glows in the dark
Tells the time
Screws up my CSV parsing","$12.99","56","Clocks"

For this to work properly, each record should be on one line with comma seperated, double quoted fields (which contain no quotes or commas). I need to do three things:
1. Delete any commas in the descriptions
2. Change all " in the fields to "
3. Change all the \n in fields to 

For 2, I think that I want to get rid of any quotes that are not of the format: ","
I don't know if that's the best way to go, but I think that every record is formatted that way. Any help or advice would be greatly appreciated. If it were up to me, we'd just find a new supplier!

TrojanWarBlade · Jan 2, 2006

This one is not quite as easy as it first seems.
I'm just off to bed so I'm sure others will try to help you.
If you don't have a solution by the time I'm out of bed again, I'll take a look for you.

Good luck.

Trojan.

rharsh · Jan 2, 2006

Yeah, most of the CSV modules like having a new-line only at the end of each record, so this might be a bit more tricky.

Can you post some more sample data? It's hard to tell what the file looks like with only one record.

rharsh · Jan 2, 2006

Without seeing more data, there's no way to know how to split the records, but this looks like it will deal with cleaning the info you supplied. I did make the assumption that the string "," will only appear between your fields - if that's not the case, this will require more work.

Code:

my $record = '"1234","Clock Radio","This is a clock radio, it measures 4" x 3" x 2"
Features:
Glows in the dark
Tells the time
Screws up my CSV parsing","$12.99","56","Clocks"';

$record =~ s/\n/<br>/g;
my @fields = split(/(?<="),(?=")/, $record);
map { s/,/;/g; } @fields; # Change commas to semicolons
map { while (s/^([^"]*".+)"(?=.+")/$1\&quot;/g) {} } @fields; # Update quotes
print join(',', @fields), "\n";

icrf · Jan 3, 2006

If the fields are quoted, you can leave the commas in them. Any CSV parser (even simple regex ones) can handle that.

Again, assuming all fields are quoted, you could try something like this (variation on rharsh's):

Code:

[COLOR=#006600]# split on newlines surrounded by quotes[/color]
$/ = [COLOR=#FF8000]qq[/color]{[COLOR=#808080]"\n"[/color]};
[COLOR=#FF0000]open[/color] F, [COLOR=#808080]'evil_data.csv'[/color] [COLOR=#FF8000]or[/color] [COLOR=#FF0000]die[/color] $!;
[COLOR=#0000FF]for[/color] [COLOR=#0000FF]my[/color] $record (<F>) {
	[COLOR=#FF0000]chomp[/color] $record;
	$record =~ [COLOR=#FF8000]s[/color]/\n/<br>/g;
	[COLOR=#006600]# split on commas surrounded by quotes[/color]
	[COLOR=#0000FF]my[/color] @fields = [COLOR=#FF0000]split[/color](/(?<="),(?=")/, $record);
	[COLOR=#006600]# strip leading and trailing quotes, replace all others with &quot;[/color]
	[COLOR=#FF8000]s[/color]/^[COLOR=#808080]"|"[/color]$//g && [COLOR=#FF8000]s[/color]/"/&quot;/g [COLOR=#0000FF]for[/color](@fields);
	[COLOR=#006600]# recreate CSV string (could use Text::CSV's combine)[/color]
	[COLOR=#FF0000]print[/color] [COLOR=#808080]'"'[/color], [COLOR=#FF0000]join[/color]([COLOR=#808080]'","'[/color], @fields), [COLOR=#808080]'"'[/color], [COLOR=#808080]"\n"[/color];
}
[COLOR=#FF0000]close[/color] F;

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo

arclyte · Jan 3, 2006

Hey Guys,

Thanks for the replies. I haven't tried using your code yet, but am going to try that now. I figured I'd post some more data so you could see the mess I'm dealing with

Since this text gets wrapped, I'm putting in \n's just so you know where they are in the file...

Code:

"sku","name","description","manufacturer","price","retails online","MSRP","weight","reconditioned","image","category"\n
"DVD-224","SUPER SLIM PROGRESSIVE SCAN DVD PLAYER","FEATURES:\n
Progressive Scan DVD Player\n
Delivers Over 500 Horizontal Lines of Crystal Clear Pictures\n
Plays DVD/MP3/CD/CD-R/CD-RW Disks\n
Compatible to NTSC/PAL System\n
Multiple Subtitles/Viewing Angle\n
Compact Midsize Chassis\n
Slow/Fast Motion Play\n
Zoom Operation\n
Parental Lock Control\n
Convenient On Screen Display\n
110-220 Automatic Power Adjustable Input\n
\n
SPECIFICATIONS:\n
Audio\n
Dolby Digital\n
Coaxial Digital Out\n
Optical Digital Out\n
Analog Audio Output\n
Video\n
Component Video Out (Progressive Scan)\n
RCA Video Out\n
S-Video Out\n
FCC Approved & UL Listed\n
FDA Approved\n
\n
ACCESSORIES INCLUDED:\n
Full Function Remote Control","COBY","34.99","59.99","99.99","6","No","DVD224.jpg","DVD Players"\n
"10393","Milk Motion Lamp","Retro 70's style Milk Jar Replica Motion Lamp.\n
\n
White Wax/ Blue Liquid\n
\n
Dimensions:\n
H: 16" L: 5.5" W: 5.5"","Novelty Lamps","17.99","29.99","39.99","5","No","10393.jpg","Motion Lamps"\n

So, three records. 1. The field list, 2. A "good" record that needs to have 's put in for the description and 3. A bad record that contains " for inches that should be "

I'm hoping that a solution to this will work across the whole file, but i'm not absolutely sure. I have a feeling that there's probably something in the file that's going to differ from this and screw me up even more! But I think this record is a good example because of the double quotes. Everything else _should_ conform to the "," format though.
Thanks again!

Jim

arclyte · Jan 3, 2006

icrf,

tried your code on the snippet i pasted above... it worked (almost). the only problem i had was with the final category, "Motion Lamps". It came up as "Motion Lamps" ". So, it looks like it works, just need to fix the very last record of the file. I don't suppose you'd want to tell me how this code works? Or how I can use the Text:CSV combine command (would that process faster on large files?) I'll use your code, but I wouldn't mind knowing how it's doing what it's doing

Thank you very much, this is a life saver.

icrf · Jan 3, 2006

To keep it from botching the last record, just make sure there's no blank newline at the end of the input csv file. Also, using Text::CSV in this case might be a little overkill, as it wouldn't really do anything different. It may have some C-optimized parts, but I doubt it'd run much faster.

The comments in the code pretty much outline what it's doing. $/ is the input record separator, so when you read a single record from the file handle <F>, it reads until it finds a newline surrounded by quotes (what $/ is set to), returns all that and stops. chomp takes the "\n" off the end and s/\n/ /g replaces newlines with 's.

/(?<="),(?=")/ is a regular expression looking for a comma surrounded by quotes (but without consuming the quotes. I just copied that from rharsh's code, but since I'm handling quote escaping slightly differently, you could just leave that as /","/ and it'd work fine).

[tt]s/^"|"$//g && s/"/"/g for(@fields);[/tt]
This is just looping over @fields and running two substitutions on each. The first strips off leading quotes ^" or | trailing quotes "$ and the second replaces any remaining quotes with the html entity "

The final print line just joins and quotes the now-escaped fields.

As a whole, this probably wouldn't run too terribly slow on large files. Is this something that's going to be run on a very regular basis? I probably don't see new product info from my vendors much more than annually, so I don't mind a little extra work to get it in.

- Andrew
Text::Highlight - A language-neutral syntax highlighting module in Perl
also on SourceForge including demo

arclyte · Jan 3, 2006

icrf/Andrew,

Thanks again for your help with this. I'm pretty new to using regexp's and probably never would have figure this out on my own. Now that I read through the code (and docs) it starts to make sense.

The problem is our supplier isn't currently sending us any quantity information on how much they have in stock and the only info they give on new or out of stock items is via some tables on their website, so we have to run the update fairly regularly to make sure we're not advertising out of stock items. Now that I have a way of actually using their feeds without having to do hours of manual fixing that should be a breeze! Thanks a lot.

Sincerely,

Jim

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Regular expression to fix a "dirty" csv file

arclyte

Programmer

TrojanWarBlade

Programmer

rharsh

Technical User

rharsh

Technical User

icrf

Programmer

arclyte

Programmer

arclyte

Programmer

icrf

Programmer

arclyte

Programmer

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Regular expression to fix a &quot;dirty&quot; csv file

Programmer

Programmer

Technical User

Technical User

Programmer

Programmer

Programmer

Programmer

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor

Regular expression to fix a "dirty" csv file