Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Using Awk to find a character in a file

Status
Not open for further replies.

debih99

Programmer
Mar 22, 2001
7
US
I have a 'master' file, in this example 20 bytes long. The data looks like this:
11111TestOne0005.10
22222TestTwo0006.50
The key in this file starts in col 1 to 12, so 11111TestOne and 22222TestTwo are the keys. (In reality, this file will be larger approx. 7,000 recs at 450 bytes long)

I have a second file, which I will need to search to find the key. (This file in reality will be close to 1 million recs at 600 bytes long)
The data looks like this:
9999999999999999999911111TestOneXXXX
999999999999999999999999999999999999
The key in this file starts in col 21 to 32.

The result of this process should produce a NEW file, the data in the second file that matched the master file. In this example, the new file should look like this:
9999999999999999999911111TestOneXXXX

In the example here is my code:

key=`cut -c1-12 masterfile`
for i in $key
do
awk /"$i"/ second_file >> new_second_file
done

When I test on a 'small' sample of data, it works fine. BUT, if I test it against reality (1 million recs), the process runs long, 90 minutes. I cancel the job, so I actual don't know if this is working. Probalbly not.

Can someone suggest another approach? Thanks
 
I could suggest a Perl based solution -- would that be of use? Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.
 
Mike,
Unfortunately they do not want us to use Perl here :(
So I must follow the standards using UNIX ksh - on a HP platform.
What now?
 
.. you could try grep (or one of it's flavours such as egrep or fgrep) which is a utility geared specifically to pattern matching:

grep $i filename >resultfile

see "man grep" for various interesting options

 
Debi,

You can take a step out of the the process like this.
[tt]
for i in $(cut -c1-12 masterfile)
do
awk /"$i"/ second_file >> new_second_file
done
[/tt]

It may well be that grep will run faster than awk in this case, so also try:
[tt]
for i in $(cut -c1-12 masterfile)
do
grep $i second_file >> new_second_file
done
[/tt]

If you need to be sure that the process is running -- add the line:
[tt]
print '.'
[/tt]

after the awk, or grep, line.
Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.
 
..this is a bigger challenge than it first appears. On closer inspection, the real problem is that for a master file of 7,000 records and a data file of 1 million records, these solutions will produce 7 thousand million reads - a lot of activity and time whatever your cpu/disk combo. Perhaps a different and more sophisticated approach is required, bearing in mind the restrictions imposed by shell programming.

Perhaps....

1. Cut keys from master file, then sort them
2. Set up array of master file keys
3. Process data records, and do binary chop for each record against array of keys.

I'd be happy to let you know what a binary chop is if necessary. It would certainly be an interesting shell exercise!

And to give it go faster stripes... anyone ever done multi threading with shell?!?
 
tim,

Yes, good point -- and I didn't spot it.... (doh!)

Anyway -- another approach.

Some psuedo code:

read master file into an associative (indexed) array (7,000 entries)

for each line in second file (1,000,000 rows)
for each item in master array
does the second file line match?
yes: print it
next item in master array
next line in second file

This approach has the advantage that it cuts down on the file reads -- it still won't be fast though. It *will* have to do the 1 million searches through the associative array.......

C would be the best language to write this in -- fastest to run.

Second would be Perl -- but you can't use that.

Awk -- yep, will take "a while" to run though.

ksh -- forget it...

So -- Write the whole thing in Awk using an associative array would be my suggestion.
Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top