Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Help with extracting data based on common field in 2 files 4

Status
Not open for further replies.

xxalex

Programmer
Feb 9, 2005
11
GB
Hello,

This is my first post.
I have a problem that I have nearly solved, I need some help to complete it.
I have 2 files.

input.txt (6000 lines) looks like this:

BT04 1LG,UniqueID1,Type1
BT19 6SF,UniqueID2,Type1
BT22 4PT,UniqueID3,Type2
BT19 5SF,UniqueID4,Type2

database.txt (750,000 lines) looks like this:

BT22 4PT,350432,237859
BT19 5SF,374650,384950
BT04 1LG,238400,438900
BT44 8DF,237465,849503
BT04 1LG,238475,483928
BT19 6SF,384758,948372

The desired output is:

"$2input.txt,$3input.txt,$2database.txt,$3database.txt" where $1==$1
UniqueID1,Type1,238475,483928
UniqueID1,Type1,238400,483900

Note: $1 in input.txt will match several records in database.txt

I have been able to output the desired results with an input file containing 1 record, however I need to do this with 6000 records.

My script looks like this:

########################################
#!/bin/bash
# These lines read the input file and assign variables

postcode=`cat input.txt | awk -F, '{ print $1 }'`
id=`cat input.txt | awk -F, '{ print $2 }'`
type=`cat input.txt | awk -F, '{ print $3 }'`

## This line reads in the database file, matches $1 in input file with S1 in database, and outputs the id and type from the input file followed by the remaining information from the database file

cat database.txt | awk -F, -v post="$postcode" '$1==post {print "'$id'"",""'$type'"","$2","$3}' > results.txt

########################################

How do I make this script run on all 6000 lines of the input file?
Is it possible to do it like this or is there an better way?

I am relatively new to shell scripting and any help you can give me would be most appreciated.

Thanks

Alex
 
Something like this ?
awk -F',' '
NR=FNR{a[$1]=$2","$3;next}
$1 in a{print a[$1]","$2","$3}
' input.txt database.txt > results.txt

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244
 
Thanks for the quick reply PH,

When I run that code I get a blank results file.
I don't know how to debug it because I don't understand it.

#!/bin/bash
awk -F',' '
NR=FNR{a[$1]=$2","$3;next}
$1 in a{print a[$1]","$2","$3}
' input.txt database.txt > results.txt

When i run this from the command line with ./ph the computer processor works hard and then outputs a blank file. Can you help?

Thanks

Alex
 
Code:
BEGIN { FS = OFS = "," }
NR==FNR { a[$1] = $2 FS $3 ; next }
$1 in a { print a[$1],$2,$3 }

Run with[tt]
awk -f prog.awk input.txt database.txt[/tt]

Let me know whether or not this helps.

If you have nawk, use it instead of awk because on some systems awk is very old and lacks many useful features. Under Solaris, use /usr/xpg4/bin/awk.

For an introduction to Awk, see faq271-5564.
 
You may try this:
awk -F',' '
NR=FNR{a[$1]=$2","$3;next}
{if($1 in a)print a[$1]","$2","$3}
' input.txt database.txt > results.txt

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244
 
NR==FNR

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Futurelet,

Your code worked a treat. I'm gonna run it on some known datasets and then work out how to write like that myself for the next time I need it. Thanks!

PHV,
With NR==FNR as supplied by vgersh99 your code works too. Thanks!

There seem to be so many ways of doing these things.

Now I can try to understand your code for myself.

Thanks

Alex
 
Thanks futurelet, I'll print that off tomorrow :)
 
This script was suggested by a friend:

#!/bin/sh

out=results.txt

> $out
while read xx1 xx2 ; do
POSTCODE="`echo $xx1 $xx2 | sed 's|,.*||'`"
UniqueID=`echo $xx2 | sed 's|...,||' | sed 's|,.*||'`
Type=`echo $xx2 | sed 's|.*,.*,||'`
# echo POSTCODE=$POSTCODE UniqueID=$UniqueID Type=$Type
grep "$POSTCODE" database.txt | \
sed "s|.... ...,|${UniqueID},${Type},|" >> $out
done < input.txt

***********************************************************************

I've been testing the scripts with known data. Here are my findings.

The script posted above return all 171 expected results.
futurelet's script returned 57 results.
PVH's script returned 57 results.

I think both futurelet's and PVH's scripts are returning only unique records.

How can I fix these scripts to return all records? The duplicates are very important.

Thanks

Alex
 
Just to add,

Fulurelet's and PVH's scripts only return the last occurance of any grid reference to the output file. eg.

If the input file looked like this...

BT04 1LG,UniqueID1,Type1
BT19 6SF,UniqueID2,Type1
BT19 6SF,UniqueID3,Type2
BT04 1LG,UniqueID4,Type2
BT04 1LG,UniqueID5,Type3
BT19 6SF,UniqueID6,Type4

only UniqueID5 and UniqueID6 (the last occurance of each postcode) are ouput.

Can you tell me what part of the script is outputting only 1 set of results for each postcode?

Thanks again

Alex
 
awk -F',' '
NR==FNR{++c[$1];a[$1","c[$1]]=$2","$3;next}
$1 in c{for(i=1;i<=c[$1];++i)print a[$1","i]","$2","$3}
' input.txt database.txt > results.txt

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244
 
Thanks PVH, that worked. I'll compare the two and find what you changed.

Cheers

Alex
 
Code:
BEGIN { FS = OFS = "," }
NR==FNR { a[$1] = a[$1] SUBSEP $2 FS $3 ; next }
$1 in a { split( a[$1], tmp, SUBSEP )
  for (key=2; key in tmp; key++)
    print tmp[key],$2,$3
}

PHV, always test your code before posting it.
PHV, try to make your code more readable. Use [ignore]
Code:
 or [tt][/ignore] to print in a monospaced font.  Don't cram things together; instead of [tt]$2","$3[/tt] , say [tt]$2 "," $3[/tt].
PHV, someone fluent in Awk doesn't write [tt]a[foo","bar][/tt] to index a two-dimensional array; he writes
[tt]a[foo, bar][/tt].
 
Hey

Just wanted to say thanks for all of your help. Futurelet, your script works great, I like how I can specify input and output files when I call the script... I must write like this in the future.

Cheers

Alex


ps. PHV, sorry I got your name wrong in so many of the posts, I was a bit tired last night.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top