Splitting text file 2

lmiu · Jan 25, 2006

Hi, Can anyone help me please. Thanks in advance.

Problem : I have text file converted from PDF, the text file contains a table of data. The problem is the column is in a curve, also between each column or words it contains randomly one or more spaces even for a column containing two words ie. Is there anyway to split according and using each field later. Thanks

number name date another_date precis
1234 red monkeys 01.01.2006 01.01.2006 dfsdfhdsfh
2345 orange fruit 01.01.2005 01.01.2006 jhfskdfha
3456 apple 01.01.2005 01.01.2006 dfasdf
5678 pear 01.01.2005 01.01.2006 dkfsdlfk

TonyGroves · Jan 25, 2006

If the only field containing embedded spaces is the "name" field, then you could use:
[tt]
for(<F>)
{
/^(\S+)\s+(.+?)\s+(\S+)\s+(\S+)\s+(.+)/;
$number=$1;
$name=$2;
$date=$3;
$another_date=$4;
$precis=$5;
...
}
[/tt]

lmiu · Jan 25, 2006

Ah no, sorry, due to the restriction of the Message box here i didn't draw a full sample.

There could more than one, also the field could brake line and carry on beneath it like :

number name date another_date precis
1234 red monkeys 01.01.2006 01.01.2006 dfsdfhdsfh
smoking
2345 orange fruit 01.01.2005 01.01.2006 jhfsk dfha
dfsd jdf dsf
3456 apple 01.01.2005 01.01.2006 dfasdf
5678 pear 01.01.2005 01.01.2006 dkfsdlfk

rharsh · Jan 25, 2006

How are the fields delimited in the text file? By tabs? Or is it an arbitrary number of spaces?

lmiu · Jan 25, 2006

The files are not delimited in anyway, the files is very similar but on a larger scale to my sample above.

The number of spaces between column are definately 1 or more spaces. The spaces between word in the same column are 1.

Its a nightmare :C

TonyGroves · Jan 25, 2006

Then you would have to locate the dates:
[tt]
for(<F>)
{
/^(\S+)\s+(.+?)\s+(\S+?\.\S+?\.\S+)\s+(\S+?\.\S+?\.\S+)\s+(.+)/;
$number=$1;
$name=$2;
$date=$3;
$another_date=$4;
$precis=$5;
...
}
[/tt]

rharsh · Jan 25, 2006

Ok.. a different question then, you mentioned records could span multiple lines, but does a record always start with a new line followed by some quantity of numbers?

lmiu · Jan 25, 2006

Yes, always start with a unique number id field for each record. And the number is always on new line as shown above.

TonyGroves · Jan 25, 2006

Oh, I never noticed the multiple-line bit.

rharsh · Jan 25, 2006

See if this helps:

Code:

$_ = <DATA>; # Grab Headers;
print join('|', split(/\s+/, $_)), "\n";

my (@temp, $line);
while ($line = <DATA>) {
    if (defined($line) && $line =~ /^\d+/) {
        do {
            push @temp, $line;
            $line = <DATA>;
        } until (eof DATA || $line =~ /^\d+/);
        &parse_record(\@temp);
        @temp = ();
        redo;
    }
}

sub parse_record {
    my $record;
    map {/\s*(.+)\s*/; $record .= "$1 "} @{$_[0]};
    my @results =
    ($record =~ /^(\d+)\s+([\w ]+?)\s+((?:\d+\.){2}\d+)\s*((?:\d+\.){2}\d+)\s*(.+[^\s])\s*$/);
    print join('|', @results), "\n";
}

__DATA__
number    name    date      another_date  precis
1234  red monkeys 01.01.2006 01.01.2006  dfsdfhdsfh
      smoking
2345 orange   fruit 01.01.2005  01.01.2006   jhfsk  dfha
                                              dfsd jdf dsf
3456   apple     01.01.2005 01.01.2006    dfasdf
5678 pear          01.01.2005    01.01.2006 dkfsdlfk

KevinADC · Jan 25, 2006

a little different approach, although I think rharsh's code is better since it validates the lines/data better:

Code:

my @results = ();
my $line;
while(<DATA>){
   chomp;
   if (!$line and (/^\d+\s+/)){
      $line = $_;
      next;
   }
   if ((eof) and (/^\d+\s+/)) {
      push @results,$_;
   }
   if($line and (/^\d+\s+/)){
      push @results,$line;
      $line = $_;
      next;
   }
   if ($line and (/^\s+/)) {
      push @results,$line.$_;
      undef ($line);
   }
}
print "$_\n" for @results;
__DATA__
1234  red monkeys 01.01.2006 01.01.2006  dfsdfhdsfh
      smoking
2345 orange   fruit 01.01.2005  01.01.2006   jhfsk  dfha
                                              dfsd jdf dsf
3456   apple     01.01.2005 01.01.2006    dfasdf
5678 pear          01.01.2005    01.01.2006 dkfsdlfk

lmiu · Jan 26, 2006

Thanks rharsh and Kevin for your help. They both fine for this scenario. I have been trying to split another file almost identical to the one that I have posted above, but this time between column two (name) and three (date) it has another column which is like column two.

Because of this new column three is like column two (name), it is now becoming almost if not possible to differentiate between what belongs to column two (name) and new column three ie.

id name new_name date another_date precis

1 red monkeys smoking monkeys
smoking red
2 was orange orange peel
3 pear pear

No need to resolve.

lmiu

stevexff · Jan 26, 2006

Can you tackle the problem at source, i.e. get the supplier of the file to add a delimiter? Without a pattern to follow, you won't be able to distinguish between the end of col 2 and the start of col 3, no matter which way you try to do it.

lmiu · Jan 26, 2006

Ah, the answer is no. What i see is what i get. I wish to that it was delimited, it would have been a lot easier.

I tried using substr kind of indexing different part of the line, but because part of the file is too out of line, I would scrape wrong fields. Nasty PDF.

Thanks anyway

KevinADC · Jan 26, 2006

You may need to find a better way to convert the pdf to text, so that hopefully there can be a consistent delimiter inserted between the columns.

lmiu · Jan 26, 2006

True, i tried a few.

Any recommendations for PDF to text converter?

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Splitting text file 2

lmiu

Programmer

TonyGroves

Programmer

lmiu

Programmer

rharsh

Technical User

lmiu

Programmer

TonyGroves

Programmer

rharsh

Technical User

lmiu

Programmer

TonyGroves

Programmer

rharsh

Technical User

KevinADC

Technical User

lmiu

Programmer

stevexff

Programmer

lmiu

Programmer

KevinADC

Technical User

lmiu

Programmer

Similar threads

Part and Inventory Search

Sponsor