Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

TOUGH string match /substitute

Status
Not open for further replies.

perlnoob

Programmer
Dec 29, 2004
6
US
hey all,

Situation:
1. A file comes in a .csv format (I have no control on that part).
2.Cells need to be parsed out for DBI module load into database
3.Cell Data looks like this:
col1,col2,col3....colx
I use a while loop, and split function to assign each row a variable for the loop for processing

Problem:
1.Spliting the variables up by a comma doesnt work because some data looks like this:
cell1,"ce,ll,2",cell3,"cell,4",cell5...cellx
2. So -"ce- becomes cell2, and -ll- becomes cell3...etc. Basically the columns are shifted to the right.

Side note:
1. I did notice that all the cells that have commas (within the cell) such as cell2, and cell4, etc, have double quotes around them (due to the nature of the sftp process when I port in the file.

Question:
So how do I parse out the cells, and say "oh, and when you see double quotes, I need you to remove all the commas up until you see the next double quote". E.G. I want "ce,ll,x" to look like -cellx- (so that the split doesnt fail).

thanks for your help!!!

 
If you're just going to be loading the parsed data into a database, then you might not need to go through this step at all. Some databases (MySQL is one) can take input directly from CSV files using the SQL "LOAD DATA" command.
 
I know about SQL Loader, and have tried it already, but there are limitations that it can do when I load the data, such as joining queries, doing look ups, storing temp variables for calculations, etc....


thanks,
 
Thanks, I'll give that a try!
Also, about my last post, I meant "can't" do, not can..oops.

Sorry about that, but thanks for the reply.

Also, is that module on CPAN?
 
Code:
# Set the input field-separator (FS) to " ;
# use empty string as the output field-separator
# (OFS; used by awk to rebuild the line read when
# assignment is made to $1, $2, etc.).
BEGIN { FS="\""; OFS="" }

# The even-numbered "fields" are the ones that are
# enclosed in quotes.  Zap the commas in them.
{ for (i=2;i<=NF;i+=2)
    if ( !gsub(/,/, "", $i) )
      # A nominal change must be made to field.
      $i=$i
  print
}
Run with awk -f prog.awk infile

With the input
[tt]
cell1,"ce,ll,2",cell3,"cell,4",cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
"cell,1",cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,"cellx"
[/tt]
the ouput is
[tt]
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
[/tt]
 
I just did some more data mining, and found another "problem" with my cells...

here is the example line:
cell1,""ce"ll2""","cell,3",cell4....cellx

Some cells have more than one "pair" of double quotes.... So I have a variable number of double quotes per cell now...sheesh

So
cell1,""ce"ll2""","cell,3",cell4....cellx
should be
cell1,cell2,cell3,cell4...cellx

I'm still looking at the TEXT::CSV module, but the admin is gone, so I can't install the module on the server, so I'm hoping some match/substitution will work!!

 
[tt]
cell1,"ce,ll,2",cell3,"cell,4",cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
"cell,1",cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,"cellx"
cell1,""ce"ll2""","cell,3",cell4,cellx
cell1,""ce",l,l2,""","cell,3",cell4,cellx
cell1,cell2,cell3,"""ce,ll"4"",cell5,cellx
[/tt]
is transformed to
[tt]
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cell5,cellx
cell1,cell2,cell3,cell4,cellx
cell1,cell2,cell3,cell4,cellx
cell1,cell2,cell3,cell4,cell5,cellx
[/tt]

Code:
{ split( $0, wheat, /""?/ )
  splitp( $0, chaff, "\"\"?" )
  
  for (i=1; i in wheat; i++)
  { s = wheat[i]
    if ( inquotes )
      cell = cell s
    else
      printf "%s", s
    if (i in chaff)
    { len = length( chaff[i] )
      if ( inquotes )
      { inquotes -= len
        if ( !inquotes )
        { gsub(/,/, "", cell)
          printf "%s", cell
        }
      }
      else
      { inquotes = (len>1 ? 4 : 1)
        cell = ""
      }
    }
  }
  print ""
}

# Like split(), but returns array of the matching portions
# instead of the non-matching ones.
function splitp(s,A,re,        mark,_mark )
{ delete A
  mark = sprintf( "%c", 1 )
  _mark = "[^" mark "]*"
  if ( gsub( re, mark "&" mark, s  ) )
    gsub( "^" _mark mark "|" mark _mark "$", "", s )
  else   s = ""
  return split( s, A, mark _mark mark )
}
 
Muchas gracias!

Looks like you're looking at each individual character, and filtering that way.

Awesome approach, thanks for the tip!
 
After this line is read
[tt]
cell1,""ce"ll2""","cell,3",cell4,cellx
[/tt]
and these lines are executed
[tt]
split( $0, wheat, /""?/ )
splitp( $0, chaff, "\"\"?" )
[/tt]
the array wheat contains 7 items, shown separated by equal signs:
[tt]
cell1,=ce=ll2==,=cell,3=,cell4,cellx
[/tt]
The array chaff has 6 items:
[tt]
""="=""="="="
[/tt]
We step through both arrays simultaneously (first item in array has index 1).
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top