Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Text file extraction problem 2

Status
Not open for further replies.

KenCunningham

Technical User
Mar 20, 2001
8,475
GB
Hi Folks,

I've recently been struggling with a problem whereby I receive a text file for processing, but have to remove any entries which fall below a certain monetary value before doing so. Basically, these are the text of reminder letters and are a standard 65 lines in length.

So far I have been able to identify the lines where these values occur (using grep -n) and have used head and tail to remove the lines above and below the first occurence and to write the remainder to a new file, but this then alters the number of lines within the file, so that the line numbers obtained by the first grep -n are no longer valid.

I may (well) be making this more difficult than it needs to be, but if anyone has any experience of removing documents from text files on the basis of value or other criteria, I'd be glad to hear how they managed it. If you require further information, please let me know. TIA.
 
Have I understood this correctly? Do you want to remove letters where the value falls below a certain amount?

If so I'd be tempted to split the original file into separate files, each with a different letter in. Then use grep to identify which letters you want & cat them together.

There are probably quite neat ways which I'm not immediately aware of to split the file using awk or sed - but you could also use expr to divide the number of lines in the file by 65 to find how many letters it contains & then use head/tail to chop the original file into pieces.

Any use?
 
This sounds like it's more suited to a small C program rather than a script. (Always choose the right tool for the job ;-))

However, if you really want to use a script, maybe you could make use of the split command since you know the number of lines in each document. I'm thinking something like:

split -l65 InputFile TempOut
for FILE in TempOut.*
do
grep pattern $FILE
if [ $? = 0 ]; then
cat $FILE >> Output1
else
cat $FILE >> Output2
fi
rm $FILE
done

Any help?
 
Thanks for your input, folks. I'll give both methods some trial data and see how I get on. I'll get back to you with the result. Cheers.
 
Guys, I'm afraid I was misled and in turn misled you. It turns out that the documents supplied do not all consist of 65 lines. Some may be 66 or 67 because two reminders are included within the text.

If it's any consolation, your ideas were perfectly feasible and I got sensible output from those files which did consist of a multiple of 65 lines. Ah well, back to the drawing board!

Thanks again.
 
One example of processing variable length text records, using Perl:
Code:
#!/usr/bin/perl -w

use English;

my $threshold = 4;

$INPUT_RECORD_SEPARATOR = "\n\n";

while( <DATA> ) {
  if( $_ =~ /value=(\d+)/ ) {
    if( $1 > $threshold ) {
      print;
    }
  }
}


__END__
record number=1
value=5

record number=2
value=2

record number=3
some other setting=5
value=7
some other setting=7
Cheers, Neil :)
 
I'm not a perl guru myself, but given the variable number of lines, it makes sense to use something like perl or C that can buffer input, and then make a decision to print based on some text parsing.
 
Neil, many thanks but we don't have Perl. I think, however if it makes life easier it might be time to invest! Thanks again.
 
Ken, If each letter has an identifiable header, you can still split the input into seperate letter files by something like this:

typeset -i SUFFIX
OUTFILE=output
SUFFIX=0
while read LINE
do
echo $LINE | grep HEADER > /dev/null
if [ $? = 0 ]; then
SUFFIX=$SUFFIX+1
fi
echo $LINE >> $OUTFILE.$SUFFIX
done < file

Then you can use a for loop to loop through the files, as I suggested above.
 
Dobbyn, thanks - I'd come to a similar conclusion myself whilst thinking about this overnight (sad but true!). I'll give it a go when I get the chance today and get back to you. Cheers.
 
Dobbyn, just to let you know that your solution works with just one problem. Unfortunately, all lines where text is separated by more than one space (ie to fit into a box on the form) seem to be compressed so that only one space is retained in the output file. Seems a little strange to me, but whilst I investigate further, have you any ideas why this might be? Cheers.
 
Folks, and Dobbyn in particular, sometimes I amaze myself with my own stupidity :~/ Putting &quot;&quot; around the $LINE statement above seems to have done the trick. Very many thanks, and an additional star for Dobbyn.
 
Sorry - I didn't thoroughly test that script before posting it :) Glad to be able to help!
 
Ken, invest in perl ????
Its free :) all you need to invest is time installing it.

Its well worth it too. <hr>
There's no present like the time, they say. - Henry's Cat.
 
KarveR, thanks for that, I'll get a copy. Do you recommend the Nutshell book as a start to the learning process? Cheers.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top