Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Splitting text file 2

Status
Not open for further replies.

lmiu

Programmer
Jan 25, 2006
16
GB
Hi, Can anyone help me please. Thanks in advance.

Problem : I have text file converted from PDF, the text file contains a table of data. The problem is the column is in a curve, also between each column or words it contains randomly one or more spaces even for a column containing two words ie. Is there anyway to split according and using each field later. Thanks

number name date another_date precis
1234 red monkeys 01.01.2006 01.01.2006 dfsdfhdsfh
2345 orange fruit 01.01.2005 01.01.2006 jhfskdfha
3456 apple 01.01.2005 01.01.2006 dfasdf
5678 pear 01.01.2005 01.01.2006 dkfsdlfk
 
If the only field containing embedded spaces is the "name" field, then you could use:
[tt]
for(<F>)
{
/^(\S+)\s+(.+?)\s+(\S+)\s+(\S+)\s+(.+)/;
$number=$1;
$name=$2;
$date=$3;
$another_date=$4;
$precis=$5;
...
}
[/tt]
 
Ah no, sorry, due to the restriction of the Message box here i didn't draw a full sample.

There could more than one, also the field could brake line and carry on beneath it like :

number name date another_date precis
1234 red monkeys 01.01.2006 01.01.2006 dfsdfhdsfh
smoking
2345 orange fruit 01.01.2005 01.01.2006 jhfsk dfha
dfsd jdf dsf
3456 apple 01.01.2005 01.01.2006 dfasdf
5678 pear 01.01.2005 01.01.2006 dkfsdlfk
 
How are the fields delimited in the text file? By tabs? Or is it an arbitrary number of spaces?
 
The files are not delimited in anyway, the files is very similar but on a larger scale to my sample above.

The number of spaces between column are definately 1 or more spaces. The spaces between word in the same column are 1.

Its a nightmare :C
 
Then you would have to locate the dates:
[tt]
for(<F>)
{
/^(\S+)\s+(.+?)\s+(\S+?\.\S+?\.\S+)\s+(\S+?\.\S+?\.\S+)\s+(.+)/;
$number=$1;
$name=$2;
$date=$3;
$another_date=$4;
$precis=$5;
...
}
[/tt]
 
Ok.. a different question then, you mentioned records could span multiple lines, but does a record always start with a new line followed by some quantity of numbers?
 
Yes, always start with a unique number id field for each record. And the number is always on new line as shown above.



 
See if this helps:
Code:
$_ = <DATA>; # Grab Headers;
print join('|', split(/\s+/, $_)), "\n";

my (@temp, $line);
while ($line = <DATA>) {
    if (defined($line) && $line =~ /^\d+/) {
        do {
            push @temp, $line;
            $line = <DATA>;
        } until (eof DATA || $line =~ /^\d+/);
        &parse_record(\@temp);
        @temp = ();
        redo;
    }
}

sub parse_record {
    my $record;
    map {/\s*(.+)\s*/; $record .= "$1 "} @{$_[0]};
    my @results =
    ($record =~ /^(\d+)\s+([\w ]+?)\s+((?:\d+\.){2}\d+)\s*((?:\d+\.){2}\d+)\s*(.+[^\s])\s*$/);
    print join('|', @results), "\n";
}

__DATA__
number    name    date      another_date  precis
1234  red monkeys 01.01.2006 01.01.2006  dfsdfhdsfh
      smoking
2345 orange   fruit 01.01.2005  01.01.2006   jhfsk  dfha
                                              dfsd jdf dsf
3456   apple     01.01.2005 01.01.2006    dfasdf
5678 pear          01.01.2005    01.01.2006 dkfsdlfk
 
a little different approach, although I think rharsh's code is better since it validates the lines/data better:

Code:
my @results = ();
my $line;
while(<DATA>){
   chomp;
   if (!$line and (/^\d+\s+/)){
      $line = $_;
      next;
   }
   if ((eof) and (/^\d+\s+/)) {
      push @results,$_;
   }
   if($line and (/^\d+\s+/)){
      push @results,$line;
      $line = $_;
      next;
   }
   if ($line and (/^\s+/)) {
      push @results,$line.$_;
      undef ($line);
   }
}
print "$_\n" for @results;
__DATA__
1234  red monkeys 01.01.2006 01.01.2006  dfsdfhdsfh
      smoking
2345 orange   fruit 01.01.2005  01.01.2006   jhfsk  dfha
                                              dfsd jdf dsf
3456   apple     01.01.2005 01.01.2006    dfasdf
5678 pear          01.01.2005    01.01.2006 dkfsdlfk
 
Thanks rharsh and Kevin for your help. They both fine for this scenario. I have been trying to split another file almost identical to the one that I have posted above, but this time between column two (name) and three (date) it has another column which is like column two.

Because of this new column three is like column two (name), it is now becoming almost if not possible to differentiate between what belongs to column two (name) and new column three ie.

id name new_name date another_date precis

1 red monkeys smoking monkeys
smoking red
2 was orange orange peel
3 pear pear


No need to resolve.

lmiu


 
Can you tackle the problem at source, i.e. get the supplier of the file to add a delimiter? Without a pattern to follow, you won't be able to distinguish between the end of col 2 and the start of col 3, no matter which way you try to do it.
 
Ah, the answer is no. What i see is what i get. I wish to that it was delimited, it would have been a lot easier.

I tried using substr kind of indexing different part of the line, but because part of the file is too out of line, I would scrape wrong fields. Nasty PDF.

Thanks anyway
 
You may need to find a better way to convert the pdf to text, so that hopefully there can be a consistent delimiter inserted between the columns.
 
True, i tried a few.

Any recommendations for PDF to text converter?

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top