Text file extraction problem 2

KenCunningham · Jun 18, 2002

Hi Folks,

I've recently been struggling with a problem whereby I receive a text file for processing, but have to remove any entries which fall below a certain monetary value before doing so. Basically, these are the text of reminder letters and are a standard 65 lines in length.

So far I have been able to identify the lines where these values occur (using grep -n) and have used head and tail to remove the lines above and below the first occurence and to write the remainder to a new file, but this then alters the number of lines within the file, so that the line numbers obtained by the first grep -n are no longer valid.

I may (well) be making this more difficult than it needs to be, but if anyone has any experience of removing documents from text files on the basis of value or other criteria, I'd be glad to hear how they managed it. If you require further information, please let me know. TIA.

kasparov · Jun 18, 2002

Have I understood this correctly? Do you want to remove letters where the value falls below a certain amount?

If so I'd be tempted to split the original file into separate files, each with a different letter in. Then use grep to identify which letters you want & cat them together.

There are probably quite neat ways which I'm not immediately aware of to split the file using awk or sed - but you could also use expr to divide the number of lines in the file by 65 to find how many letters it contains & then use head/tail to chop the original file into pieces.

Any use?

dobbyn · Jun 18, 2002

This sounds like it's more suited to a small C program rather than a script. (Always choose the right tool for the job ;-))

However, if you really want to use a script, maybe you could make use of the split command since you know the number of lines in each document. I'm thinking something like:

split -l65 InputFile TempOut
for FILE in TempOut.*
do
grep pattern $FILE
if [ $? = 0 ]; then
cat $FILE >> Output1
else
cat $FILE >> Output2
fi
rm $FILE
done

Any help?

KenCunningham · Jun 18, 2002

Thanks for your input, folks. I'll give both methods some trial data and see how I get on. I'll get back to you with the result. Cheers.

KenCunningham · Jun 18, 2002

Guys, I'm afraid I was misled and in turn misled you. It turns out that the documents supplied do not all consist of 65 lines. Some may be 66 or 67 because two reminders are included within the text.

If it's any consolation, your ideas were perfectly feasible and I got sensible output from those files which did consist of a multiple of 65 lines. Ah well, back to the drawing board!

Thanks again.

toolkit · Jun 18, 2002

One example of processing variable length text records, using Perl:

Code:

#!/usr/bin/perl -w

use English;

my $threshold = 4;

$INPUT_RECORD_SEPARATOR = &quot;\n\n&quot;;

while( <DATA> ) {
  if( $_ =~ /value=(\d+)/ ) {
    if( $1 > $threshold ) {
      print;
    }
  }
}


__END__
record number=1
value=5

record number=2
value=2

record number=3
some other setting=5
value=7
some other setting=7

Cheers, Neil

dobbyn · Jun 18, 2002

I'm not a perl guru myself, but given the variable number of lines, it makes sense to use something like perl or C that can buffer input, and then make a decision to print based on some text parsing.

KenCunningham · Jun 18, 2002

Neil, many thanks but we don't have Perl. I think, however if it makes life easier it might be time to invest! Thanks again.

dobbyn · Jun 18, 2002

Ken, If each letter has an identifiable header, you can still split the input into seperate letter files by something like this:

typeset -i SUFFIX
OUTFILE=output
SUFFIX=0
while read LINE
do
echo $LINE | grep HEADER > /dev/null
if [ $? = 0 ]; then
SUFFIX=$SUFFIX+1
fi
echo $LINE >> $OUTFILE.$SUFFIX
done < file

Then you can use a for loop to loop through the files, as I suggested above.

KenCunningham · Jun 19, 2002

Dobbyn, thanks - I'd come to a similar conclusion myself whilst thinking about this overnight (sad but true!). I'll give it a go when I get the chance today and get back to you. Cheers.

KenCunningham · Jun 19, 2002

Dobbyn, just to let you know that your solution works with just one problem. Unfortunately, all lines where text is separated by more than one space (ie to fit into a box on the form) seem to be compressed so that only one space is retained in the output file. Seems a little strange to me, but whilst I investigate further, have you any ideas why this might be? Cheers.

KenCunningham · Jun 19, 2002

Folks, and Dobbyn in particular, sometimes I amaze myself with my own stupidity :~/ Putting "" around the $LINE statement above seems to have done the trick. Very many thanks, and an additional star for Dobbyn.

dobbyn · Jun 19, 2002

Sorry - I didn't thoroughly test that script before posting it

Glad to be able to help!

KarveR · Jun 20, 2002

Ken, invest in perl ????
Its free

all you need to invest is time installing it.

Its well worth it too. <hr>
There's no present like the time, they say. - Henry's Cat.

KenCunningham · Jun 21, 2002

KarveR, thanks for that, I'll get a copy. Do you recommend the Nutshell book as a start to the learning process? Cheers.

KarveR · Jun 21, 2002

Nutshell is good, or O'reilly Learning Perl.

As a starter I found this

http://www.sthomas.net/roberts-perl-tutorial.htm

was more helpful than any book I picked up. <hr>
There's no present like the time, they say. - Henry's Cat.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Text file extraction problem 2

KenCunningham

Technical User

kasparov

Programmer

dobbyn

Programmer

KenCunningham

Technical User

KenCunningham

Technical User

toolkit

Programmer

dobbyn

Programmer

KenCunningham

Technical User

dobbyn

Programmer

KenCunningham

Technical User

KenCunningham

Technical User

KenCunningham

Technical User

dobbyn

Programmer

KarveR

MIS

KenCunningham

Technical User

KarveR

MIS

Similar threads

Part and Inventory Search

Sponsor