Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

match field in 2 files with awk?? 2

Status
Not open for further replies.

malpa

Technical User
Feb 8, 2004
122
CO
Hi,

I have 2 files. On of them
filea contains
------
field1,field2,.. ,field4,...
001200,001200,001200,001200,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001201,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001202,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001203,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001204,001204,001204,001204,6,9,001204,CANADA,CANADA,001204

Filed4 (series) is variable, and the field 5 contains the length of field4.

and fileb contains
------------------
field1
012000059 042213997 0507301019480507301022390002800029BOGOAXE2AUP00000200319211110011CE12D7IMED2X7O00000C000 0000000000251
012000059 042554019 0507301927250507301935570008500086BOGOAXE1AUP00000594046211110011CEN2D7IMED5H7O00000C000 0000000000832
012000132 018920468



I want to make a join between field4 (filea) and substr(field1,1,18) (fileb) and obtain a file with this structure

if (field4(file_a) in substr(field1(file_b),1,field5(file_a)))
{
print $0(file_b),field8(file_a),field9(file_a)
}
else{print $0(file_a) > "Is not here"}

Thanks

malpa
 
And what have you tried so far ?
Tons of similar threads in this forum.

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
I believe one of your own threads could be useful to revisit.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Thanks for your replay.

Program functions very well if both fields are similar.
But what happend if a field (file_a, field 4)that you want to compare is within the other field (file_b, field 1).

If you notice both fields are not equal. The fourth field file_a is a part of the first field file_b. You must to keep in mind that the field 4 file_a is variable.

Field 1 file_b is much bigger than field 4 file_a.


awk -F'|' '
FNR==NR{t[$1]=$2;next}
t[$1]==$2
' file1.txt file2.txt

If you have some advice for this, I'll thank you.

I am sorry if I am wrong.

Thanks again.
 
If $1(file_b) must begin with $4(file_a) consider the substr function.
If $1(file_b) must contain $4(file_a) consider the ~ comparison operator.

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
Hi

I modified the program but it doesn't work.

match.awk

BEGIN {FS=","}
{ if(!fn) fn = "filea"
while( (getline < fn) > 0)
{
a[$1]=$4
}
}
{
if(substr($0,1,8) ~ /^a[$1]/){
printf"%s,%s\n", $0, a[$1]>"match"
}
}

awk -f match.awk fileb

Thanks a lot

Malpa
 
A starting point (typed, not tested):
awk -F',' '
FNR==NR{t[$4]=$8" "$9;next}
{ for(i in t)if($1~"^"i{print $0","t;next}
print > "Is not here"
}' filea fileb > match

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
First, thanks a lot.

At this moment its working, it goes well, I'll say to you later the results.

The fileb contains 3.413.544 records and the filea contains 5673 records.

Is there any problem with the size???
Is there any restrction with this??

If I want to match the results with another file Would be better do this in two steps or Can I do this at the same time in the same program???.


I modified it a little.

awk -F',' '
FNR==NR{t[$4]=$8" "$9;next}
{ for(i in t)if($1~"^"i){print $1","t;next}
print > "Is not here"
}' filea fileb > match

Again thanks a lot.

Malpa
 
If I want to match the results with another file
...
}' filea fileb filec filed ... > match

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
Hi PVH

Thanks again.

The results was fine.

Now, I have to compare the same field (filea) with two fields(fileb) at the same time.
I ran the same program twice and It worked fine.

Is it possible to do this at the same time ??.

So I changed the program but I have some troubles with this.

awk -F',' '
FNR==NR{t[$4]=$8" "$9;next}
{ for(i in t)if(substr($1,1,18)~"^"i){
a[$x]=t
i=0
t=0
for (i in t)if(substr($1,19,36)~"^"i){
print $1","a[$x]","t;next}}
print > "Is_not_here"
}' filea fileb > match



Thanks

malpa
 
What is supposed to be $x ?
What do you want to do ?

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
Hi,

the idea is to find telephones origin and telephones destiny and to print in the exit file the telephone origin the city origin and telephone destiny the city destiny.

At the firts time the program compared the field $4 (filea) with the field substr($1,1,18) (fileb) and it prints $1(fileb),t =(field8 fiel9).
Now I want to compare filed $4 (filea) again with field substr($1,19,36) (fileb) and it must print what it finds at the first time and what it finds at the second time.


filea contains
------
field1,field2,.. ,field4,field5,field6,field7,field8,field9
001200,001200,001200,001200,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001201,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001202,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001203,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200

and fileb contains
------------------
field1

123456789012345678901234567890123456

012000059 042213997 0507301019480507301022390002800029BOGOAXE2AUP00000200319211110011CE12D7IMED2X7O00000C000 0000000000251
012000059 042554019 0507301927250507301935570008500086BOGOAXE1AUP00000594046211110011CEN2D7IMED5H7O00000C000 0000000000832

match

090000000 090000000 0507302035140507302035140000000000MEDEAXE1SAN00000000000224200061BOG1H7I0 00000Z100T0032 T0032 0000000000000,BOGOTA CUNDINAMARCA, BOGOTA CUNDINAMARCA

090000000 090000000 0507302035320507302035320000000000MEDEAXE1SAN00000000000224200061BOG4P7O0 00000Z100T0035 T0154 0000000000000,9 BOGOTA CUNDINAMARCA,9 BOGOTA CUNDINAMARCA

Thanks so much

malpa.
 
Something like this ?
awk -F',' '
FNR==NR{t[$4]=$8" "$9;next}
{ for(i in t)if($1~"^"i || substr($1,19)~"^"i)
{print $1","t;next}
print > "Is not here"
}' filea fileb > match

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886
 
Hi PVH

Thanks for your replay.
It is like making a dispesión of traffic.

When you call to long distance anybody wherever you want You can distinguish between who did the call and who receive the call.
File_a contains who did the call subscriber_a substr($1,1,18) and who receive the call subscriber_b substr($1,19,36).
File_b contains the number series (field 4) and city field 8 and country field 9.

I want to print:

$1(file_b), city(field 8 file_a) country (field 9 file_a) (regarding with substr($1,1,18) , city(field 8 file_a) country (field 9 file_a) (regarding with substr($1,18,39).

Thanks.

malpa
 
Hi,


At this moment I am at the airport, I`m working with this script and I have a doubt with this??


awk -F',' '
FNR==NR{t[$4]=$8" "$9;next}
{ for(i in t){
if(substr($1,1,18)~"^"i){
a=1
b=t#Here I want to save t for first time;
#Can I do this in this way?
}
if(substr($1,19,36~"^"i && a==1){
#Here tcan be different to the previous b=t
a=0
printf "%s%s%s",$1,b,t
next
}
}
print > "Is not here"
}' filea fileb > match

Thanks

malpa
 
File_a contains who did the call subscriber_a substr($1,1,18) and who receive the call subscriber_b substr($1,19,36).
File_b contains the number series (field 4) and city field 8 and country field 9.

I think you got filea and fileb the wrong way around...?

Also, in filea, the numbers have an extra 0 on the prefix? Do you want to take that off to match against the numbers in fileb? I have done that in my example.

Unfortunately with the test data you have given it produces no output, because the destination numbers do not appear in filea.

Of course you can remove the "print read" and "print check" statements, they are just for debugging.

Code:
awk '
BEGIN {
        FS=OFS=","
        while (getline < "filea") {
                prefix[++i]=substr($4,2)
                city[i]=$8
                country[i]=$9
                print "read prefix " prefix[i] " city " city[i] " country " country[i]
        }
        close("filea")
}
{
        from=substr($0,1,18)
        to=substr($0,19,18)
        print "read from " from " to " to
        for (i in prefix) {
                print "check from prefix " prefix[i]
                if (from ~ "^"prefix[i]) {
                        for (j in prefix) {
                                print "check to prefix " prefix[j]
                                if (to ~ "^"prefix[j] ) {
                                        print from,city[i],country[i],city[j],country[j]
                                        break
                                }
                        }
                        break
                }
        }
}
' fileb

Annihilannic.
 
Sorry, of course you can print "no match" like you suggested if nothing is found.

Code:
awk '
BEGIN {
        FS=OFS=","
        while (getline < "filea") {
                prefix[++i]=substr($4,2)
                city[i]=$8
                country[i]=$9
                print "read prefix " prefix[i] " city " city[i] " country " country[i]
        }
        close("filea")
}
{
        from=substr($0,1,18)
        to=substr($0,19,18)
        print "read from " from " to " to
        matchedfrom=0
        for (i in prefix) {
                print "check from prefix " prefix[i]
                if (from ~ "^"prefix[i]) {
                        matchedfrom=1
                        matchedto=0
                        for (j in prefix) {
                                print "check to prefix " prefix[j]
                                if (to ~ "^"prefix[j] ) {
                                        matchedto=1
                                        print from,city[i],country[i],city[j],country[j]
                                        break
                                }
                        }
                        if (!matchedto) {
                                print from,city[i],country[i],"(no match),(no match)"
                        }
                        break
                }
        }
        if (!matchedfrom) {
                print from,"(no match),(no match),(no match),(no match)"
        }
}
' fileb


Annihilannic.
 
Hi Annihilannic

Thanks a lot for your replay. Actually Your script is working.
File_a contains many series including local series and long disntance series. Size of each serie is variable.

file_b contains call data registers (3.500.000 per day)
-------
002534306 000000000 0507301505500507301506170000400000BOGOAXE2AUN00000000000116900063PAC3I7ITUN2HBO00000Z000
0000000000027
012012360 0019549932384 0507302158220507302213490015400156BOGOAXE1AUP00003416400111110011MUZ8D7IEUM2I7O00000N000
0000000001527
012066817 0017863125866 0507302304110507302304240000200010BOGOAXE1AUP00000219000111110011CEN2D7IEUA1I7O00000N000
0000000000013
012100615 00524422251072 0507302327280507302334020006500066BOGOAXE1AUP00001544400111110011CEN2D7IMET1I7O00000Q000
0000000000634
012100687 00525555263828 0507302348090507302356370008400086BOGOAXE1AUP00002012400111110011CEN2D7IMET1I7O00000Q000
0000000000828
012101033 0017862735836 0507302325190507302329120003800039BOGOAXE2AUP00000854100111110011CE12D7IEUA1I7O00000N000
0000000000353
012101501 0017862014029 0507302212200507302231000018600186BOGOAXE2AUP00004073400111110011CE12D7IEUM1I7O00000N000
0000000001840
012105000 0018188319380 0507302220210507302232420012300123BOGOAXE1AUP00002693700111110011CEN2D7IEUI1I7O00000N000
0000000001221
012105022 0059322443145 0507302324040507302328340004500046BOGOAXE2AUP00000920000111110011CE12D7IECA1I7O00000M000
0000000000430
012105093 005072305862 0507302225400507302227550002200023BOGOAXE2AUP00000518650111110011CE12D7IPAC1I7O00000P000
0000000000215

file_a contains local and long distance series (8500 series).
-------
------------------field4---------field8,field9,----------
00253,00253,00253,00253,5,9,00253,DJIBOUTI,DJIBOUTI,00253
25,25,11001,012,3,7,11001,BOGOTA,CUNDINAMARCA,0057
00593,00593,00593,00593,5,9,00593,ECUADOR,ECUADOR,00593
00507,00507,00507,00507,5,9,00507,PANAMA,PANAMA,00507
001200,001200,001200,001781,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001785,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001786,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200

If you want, you can try with these files. I apologize my english goes wrong. This is a short report I took it while it is working.

002534306 ,DJIBOUTI,DJIBOUTI,000000000 ,(no match),(no match)
012012360 ,BOGOTA,CUNDINAMARCA,0019549932384 ,ESTADOSUNIDOS,ESTADOSUNIDOS,050730,00154,BOGOAXE1AU,P,111110,MUZ8D7I,EUM2I7
012066817 ,BOGOTA,CUNDINAMARCA,0017863125866 ,ESTADOSUNIDOS,ESTADOSUNIDOS,050730,00002,BOGOAXE1AU,P,111110,CEN2D7I,EUA1I7
012100615 ,BOGOTA,CUNDINAMARCA,00524422251072 ,MEXICO,MEXICO,050730,00065,BOGOAXE1AU,P,111110,CEN2D7I,MET1I7
012100687 ,BOGOTA,CUNDINAMARCA,00525555263828 ,MEXICO,MEXICO,050730,00084,BOGOAXE1AU,P,111110,CEN2D7I,MET1I7
012101033 ,BOGOTA,CUNDINAMARCA,0017862735836 ,ESTADOSUNIDOS,ESTADOSUNIDOS,050730,00038,BOGOAXE2AU,P,111110,CE12D7I,EUA1I7
012101501 ,BOGOTA,CUNDINAMARCA,0017862014029 ,ESTADOSUNIDOS,ESTADOSUNIDOS,050730,00186,BOGOAXE2AU,P,111110,CE12D7I,EUM1I7
012105000 ,BOGOTA,CUNDINAMARCA,0018188319380 ,ESTADOSUNIDOS,ESTADOSUNIDOS,050730,00123,BOGOAXE1AU,P,111110,CEN2D7I,EUI1I7



Thanks a lot

I love awk.

malpa
 
Thanks a lot PVH and Annhilannic. At last I learned manyhtings about awk.

I have used both programs and these work fine.

Question?

What is the difference between :

if (from ~ "^"prefix)

and

if (from ~ /^prefix/) Why this doesn´t works.

Thanks

malpa

 
Hi

Because the backets are metacharacters and when are between the slashes, [tt]awk[/tt] interpret them as part of the regular expression instead of array index.

[highlight #eeeeff][tt]from ~ "^"prefix[/tt][/highlight] - value of variable [tt]from[/tt] matches the value of [tt]i[/tt]th[/tt] element of array [tt]prefix[/tt], at the begin

[highlight #eeeeff][tt]from ~ /^prefix/[/tt][/highlight] - value of variable [tt]from[/tt] matches the string prefix followed by any character from the set of characters i, at the begin

Feherke.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top