Help with extracting data based on common field in 2 files 4

xxalex · Feb 9, 2005

Hello,

This is my first post.
I have a problem that I have nearly solved, I need some help to complete it.
I have 2 files.

input.txt (6000 lines) looks like this:

BT04 1LG,UniqueID1,Type1
BT19 6SF,UniqueID2,Type1
BT22 4PT,UniqueID3,Type2
BT19 5SF,UniqueID4,Type2

database.txt (750,000 lines) looks like this:

BT22 4PT,350432,237859
BT19 5SF,374650,384950
BT04 1LG,238400,438900
BT44 8DF,237465,849503
BT04 1LG,238475,483928
BT19 6SF,384758,948372

The desired output is:

"$2input.txt,$3input.txt,$2database.txt,$3database.txt" where $1==$1
UniqueID1,Type1,238475,483928
UniqueID1,Type1,238400,483900

Note: $1 in input.txt will match several records in database.txt

I have been able to output the desired results with an input file containing 1 record, however I need to do this with 6000 records.

My script looks like this:

########################################
#!/bin/bash
# These lines read the input file and assign variables

postcode=`cat input.txt | awk -F, '{ print $1 }'`
id=`cat input.txt | awk -F, '{ print $2 }'`
type=`cat input.txt | awk -F, '{ print $3 }'`

## This line reads in the database file, matches $1 in input file with S1 in database, and outputs the id and type from the input file followed by the remaining information from the database file

cat database.txt | awk -F, -v post="$postcode" '$1==post {print "'$id'"",""'$type'"","$2","$3}' > results.txt

########################################

How do I make this script run on all 6000 lines of the input file?
Is it possible to do it like this or is there an better way?

I am relatively new to shell scripting and any help you can give me would be most appreciated.

Thanks

Alex

PHV · Feb 9, 2005

Something like this ?
awk -F',' '
NR=FNR{a[$1]=$2","$3;next}
$1 in a{print a[$1]","$2","$3}
' input.txt database.txt > results.txt

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244

xxalex · Feb 9, 2005

Thanks for the quick reply PH,

When I run that code I get a blank results file.
I don't know how to debug it because I don't understand it.

#!/bin/bash
awk -F',' '
NR=FNR{a[$1]=$2","$3;next}
$1 in a{print a[$1]","$2","$3}
' input.txt database.txt > results.txt

When i run this from the command line with ./ph the computer processor works hard and then outputs a blank file. Can you help?

Thanks

Alex

futurelet · Feb 9, 2005

Code:

BEGIN { FS = OFS = "," }
NR==FNR { a[$1] = $2 FS $3 ; next }
$1 in a { print a[$1],$2,$3 }

Run with[tt]
awk -f prog.awk input.txt database.txt[/tt]

Let me know whether or not this helps.

If you have nawk, use it instead of awk because on some systems awk is very old and lacks many useful features. Under Solaris, use /usr/xpg4/bin/awk.

For an introduction to Awk, see faq271-5564.

PHV · Feb 9, 2005

You may try this:
awk -F',' '
NR=FNR{a[$1]=$2","$3;next}
{if($1 in a)print a[$1]","$2","$3}
' input.txt database.txt > results.txt

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244

futurelet · Feb 9, 2005

xxalex, be aware that PHV's code is incorrect and untested.

vgersh99 · Feb 9, 2005

NR==FNR

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

xxalex · Feb 9, 2005

Futurelet,

Your code worked a treat. I'm gonna run it on some known datasets and then work out how to write like that myself for the next time I need it. Thanks!

PHV,
With NR==FNR as supplied by vgersh99 your code works too. Thanks!

There seem to be so many ways of doing these things.

Now I can try to understand your code for myself.

Thanks

Alex

futurelet · Feb 9, 2005

Alex, be sure to see FAQ271-5564.

xxalex · Feb 9, 2005

Thanks futurelet, I'll print that off tomorrow

xxalex · Feb 9, 2005

This script was suggested by a friend:

#!/bin/sh

out=results.txt

> $out
while read xx1 xx2 ; do
POSTCODE="`echo $xx1 $xx2 | sed 's|,.*||'`"
UniqueID=`echo $xx2 | sed 's|...,||' | sed 's|,.*||'`
Type=`echo $xx2 | sed 's|.*,.*,||'`
# echo POSTCODE=$POSTCODE UniqueID=$UniqueID Type=$Type
grep "$POSTCODE" database.txt | \
sed "s|.... ...,|${UniqueID},${Type},|" >> $out
done < input.txt

***********************************************************************

I've been testing the scripts with known data. Here are my findings.

The script posted above return all 171 expected results.
futurelet's script returned 57 results.
PVH's script returned 57 results.

I think both futurelet's and PVH's scripts are returning only unique records.

How can I fix these scripts to return all records? The duplicates are very important.

Thanks

Alex

xxalex · Feb 9, 2005

Just to add,

Fulurelet's and PVH's scripts only return the last occurance of any grid reference to the output file. eg.

If the input file looked like this...

BT04 1LG,UniqueID1,Type1
BT19 6SF,UniqueID2,Type1
BT19 6SF,UniqueID3,Type2
BT04 1LG,UniqueID4,Type2
BT04 1LG,UniqueID5,Type3
BT19 6SF,UniqueID6,Type4

only UniqueID5 and UniqueID6 (the last occurance of each postcode) are ouput.

Can you tell me what part of the script is outputting only 1 set of results for each postcode?

Thanks again

Alex

PHV · Feb 9, 2005

awk -F',' '
NR==FNR{++c[$1];a[$1","c[$1]]=$2","$3;next}
$1 in c{for(i=1;i<=c[$1];++i)print a[$1","i]","$2","$3}
' input.txt database.txt > results.txt

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244

xxalex · Feb 9, 2005

Thanks PVH, that worked. I'll compare the two and find what you changed.

Cheers

Alex

futurelet · Feb 9, 2005

Code:

BEGIN { FS = OFS = "," }
NR==FNR { a[$1] = a[$1] SUBSEP $2 FS $3 ; next }
$1 in a { split( a[$1], tmp, SUBSEP )
  for (key=2; key in tmp; key++)
    print tmp[key],$2,$3
}

PHV, always test your code before posting it.
PHV, try to make your code more readable. Use [ignore]

Code:

 or [tt][/ignore] to print in a monospaced font.  Don't cram things together; instead of [tt]$2","$3[/tt] , say [tt]$2 "," $3[/tt].
PHV, someone fluent in Awk doesn't write [tt]a[foo","bar][/tt] to index a two-dimensional array; he writes
[tt]a[foo, bar][/tt].

xxalex · Feb 10, 2005

Hey

Just wanted to say thanks for all of your help. Futurelet, your script works great, I like how I can specify input and output files when I call the script... I must write like this in the future.

Cheers

Alex

ps. PHV, sorry I got your name wrong in so many of the posts, I was a bit tired last night.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Help with extracting data based on common field in 2 files 4

xxalex

Programmer

PHV

MIS

xxalex

Programmer

futurelet

Programmer

PHV

MIS

futurelet

Programmer

vgersh99

Programmer

xxalex

Programmer

futurelet

Programmer

xxalex

Programmer

xxalex

Programmer

xxalex

Programmer

PHV

MIS

xxalex

Programmer

futurelet

Programmer

xxalex

Programmer

Similar threads

Part and Inventory Search

Sponsor