Using Awk to find a character in a file

debih99 · Mar 26, 2001

I have a 'master' file, in this example 20 bytes long. The data looks like this:
11111TestOne0005.10
22222TestTwo0006.50
The key in this file starts in col 1 to 12, so 11111TestOne and 22222TestTwo are the keys. (In reality, this file will be larger approx. 7,000 recs at 450 bytes long)

I have a second file, which I will need to search to find the key. (This file in reality will be close to 1 million recs at 600 bytes long)
The data looks like this:
9999999999999999999911111TestOneXXXX
999999999999999999999999999999999999
The key in this file starts in col 21 to 32.

The result of this process should produce a NEW file, the data in the second file that matched the master file. In this example, the new file should look like this:
9999999999999999999911111TestOneXXXX

In the example here is my code:

key=`cut -c1-12 masterfile`
for i in $key
do
awk /"$i"/ second_file >> new_second_file
done

When I test on a 'small' sample of data, it works fine. BUT, if I test it against reality (1 million recs), the process runs long, 90 minutes. I cancel the job, so I actual don't know if this is working. Probalbly not.

Can someone suggest another approach? Thanks

MikeLacey · Mar 26, 2001

I could suggest a Perl based solution -- would that be of use? Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

debih99 · Mar 26, 2001

Mike,
Unfortunately they do not want us to use Perl here

So I must follow the standards using UNIX ksh - on a HP platform.
What now?

timdrake · Mar 26, 2001

.. you could try grep (or one of it's flavours such as egrep or fgrep) which is a utility geared specifically to pattern matching:

grep $i filename >resultfile

see "man grep" for various interesting options

MikeLacey · Mar 27, 2001

Debi,

You can take a step out of the the process like this.
[tt]
for i in $(cut -c1-12 masterfile)
do
awk /"$i"/ second_file >> new_second_file
done
[/tt]
It may well be that grep will run faster than awk in this case, so also try:
[tt]
for i in $(cut -c1-12 masterfile)
do
grep $i second_file >> new_second_file
done
[/tt]
If you need to be sure that the process is running -- add the line:
[tt]
print '.'
[/tt]
after the awk, or grep, line.
Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

timdrake · Mar 27, 2001

..this is a bigger challenge than it first appears. On closer inspection, the real problem is that for a master file of 7,000 records and a data file of 1 million records, these solutions will produce 7 thousand million reads - a lot of activity and time whatever your cpu/disk combo. Perhaps a different and more sophisticated approach is required, bearing in mind the restrictions imposed by shell programming.

Perhaps....

1. Cut keys from master file, then sort them
2. Set up array of master file keys
3. Process data records, and do binary chop for each record against array of keys.

I'd be happy to let you know what a binary chop is if necessary. It would certainly be an interesting shell exercise!

And to give it go faster stripes... anyone ever done multi threading with shell?!?

MikeLacey · Mar 27, 2001

tim,

Yes, good point -- and I didn't spot it.... (doh!)

Anyway -- another approach.

Some psuedo code:

read master file into an associative (indexed) array (7,000 entries)

for each line in second file (1,000,000 rows)
for each item in master array
does the second file line match?
yes: print it
next item in master array
next line in second file

This approach has the advantage that it cuts down on the file reads -- it still won't be fast though. It *will* have to do the 1 million searches through the associative array.......

C would be the best language to write this in -- fastest to run.

Second would be Perl -- but you can't use that.

Awk -- yep, will take "a while" to run though.

ksh -- forget it...

So -- Write the whole thing in Awk using an associative array would be my suggestion.
Mike
michael.j.lacey@ntlworld.com
Email welcome if you're in a hurry or something -- but post in tek-tips as well please, and I will post my reply here as well.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Using Awk to find a character in a file

debih99

Programmer

MikeLacey

MIS

debih99

Programmer

timdrake

IS-IT--Management

MikeLacey

MIS

timdrake

IS-IT--Management

MikeLacey

MIS

Similar threads

Part and Inventory Search

Sponsor