TOUGH string match /substitute

perlnoob · Dec 29, 2004

hey all,

Situation:
1. A file comes in a .csv format (I have no control on that part).
2.Cells need to be parsed out for DBI module load into database
3.Cell Data looks like this:
col1,col2,col3....colx
I use a while loop, and split function to assign each row a variable for the loop for processing

Problem:
1.Spliting the variables up by a comma doesnt work because some data looks like this:
cell1,"ce,ll,2",cell3,"cell,4",cell5...cellx
2. So -"ce- becomes cell2, and -ll- becomes cell3...etc. Basically the columns are shifted to the right.

Side note:
1. I did notice that all the cells that have commas (within the cell) such as cell2, and cell4, etc, have double quotes around them (due to the nature of the sftp process when I port in the file.

Question:
So how do I parse out the cells, and say "oh, and when you see double quotes, I need you to remove all the commas up until you see the next double quote". E.G. I want "ce,ll,x" to look like -cellx- (so that the split doesnt fail).

thanks for your help!!!

TonyGroves · Dec 29, 2004

If you're just going to be loading the parsed data into a database, then you might not need to go through this step at all. Some databases (MySQL is one) can take input directly from CSV files using the SQL "LOAD DATA" command.

perlnoob · Dec 29, 2004

I know about SQL Loader, and have tried it already, but there are limitations that it can do when I load the data, such as joining queries, doing look ups, storing temp variables for calculations, etc....

thanks,

rharsh · Dec 29, 2004

Take a look at the Text::CSV module. It should take care of your problem.

perlnoob · Dec 29, 2004

Thanks, I'll give that a try!
Also, about my last post, I meant "can't" do, not can..oops.

Sorry about that, but thanks for the reply.

Also, is that module on CPAN?

cntr · Dec 29, 2004

Code:

# Set the input field-separator (FS) to " ;
# use empty string as the output field-separator
# (OFS; used by awk to rebuild the line read when
# assignment is made to $1, $2, etc.).
BEGIN { FS="\""; OFS="" }

# The even-numbered "fields" are the ones that are
# enclosed in quotes.  Zap the commas in them.
{ for (i=2;i<=NF;i+=2)
    if ( !gsub(/,/, "", $i) )
      # A nominal change must be made to field.
      $i=$i
  print
}

Run with awk -f prog.awk infile

With the input
[tt]
cell1,"ce,ll,2",cell3,"cell,4",cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
"cell,1",cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,"cellx"
[/tt]
the ouput is
[tt]
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
[/tt]

perlnoob · Dec 29, 2004

I just did some more data mining, and found another "problem" with my cells...

here is the example line:
cell1,""ce"ll2""","cell,3",cell4....cellx

Some cells have more than one "pair" of double quotes.... So I have a variable number of double quotes per cell now...sheesh

So
cell1,""ce"ll2""","cell,3",cell4....cellx
should be
cell1,cell2,cell3,cell4...cellx

I'm still looking at the TEXT::CSV module, but the admin is gone, so I can't install the module on the server, so I'm hoping some match/substitution will work!!

cntr · Dec 29, 2004

[tt]
cell1,"ce,ll,2",cell3,"cell,4",cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
"cell,1",cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,"cellx"
cell1,""ce"ll2""","cell,3",cell4,cellx
cell1,""ce",l,l2,""","cell,3",cell4,cellx
cell1,cell2,cell3,"""ce,ll"4"",cell5,cellx
[/tt]
is transformed to
[tt]
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cellx
cell1,cell2,cell3,cell4,cellx
cell1,cell2,cell3,cell4,cell5,cellx
[/tt]

Code:

{ split( $0, wheat, /""?/ )
  splitp( $0, chaff, "\"\"?" )
  
  for (i=1; i in wheat; i++)
  { s = wheat[i]
    if ( inquotes )
      cell = cell s
    else
      printf "%s", s
    if (i in chaff)
    { len = length( chaff[i] )
      if ( inquotes )
      { inquotes -= len
        if ( !inquotes )
        { gsub(/,/, "", cell)
          printf "%s", cell
        }
      }
      else
      { inquotes = (len>1 ? 4 : 1)
        cell = ""
      }
    }
  }
  print ""
}

# Like split(), but returns array of the matching portions
# instead of the non-matching ones.
function splitp(s,A,re,        mark,_mark )
{ delete A
  mark = sprintf( "%c", 1 )
  _mark = "[^" mark "]*"
  if ( gsub( re, mark "&" mark, s  ) )
    gsub( "^" _mark mark "|" mark _mark "$", "", s )
  else   s = ""
  return split( s, A, mark _mark mark )
}

perlnoob · Dec 30, 2004

Muchas gracias!

Looks like you're looking at each individual character, and filtering that way.

Awesome approach, thanks for the tip!

cntr · Dec 30, 2004

After this line is read
[tt]
cell1,""ce"ll2""","cell,3",cell4,cellx
[/tt]
and these lines are executed
[tt]
split( $0, wheat, /""?/ )
splitp( $0, chaff, "\"\"?" )
[/tt]
the array wheat contains 7 items, shown separated by equal signs:
[tt]
cell1,=ce=ll2==,=cell,3=,cell4,cellx
[/tt]
The array chaff has 6 items:
[tt]
""="=""="="="
[/tt]
We step through both arrays simultaneously (first item in array has index 1).

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

TOUGH string match /substitute

perlnoob

Programmer

TonyGroves

Programmer

perlnoob

Programmer

rharsh

Technical User

perlnoob

Programmer

cntr

Instructor

perlnoob

Programmer

cntr

Instructor

perlnoob

Programmer

cntr

Instructor

Similar threads

Part and Inventory Search

Sponsor