match field in 2 files with awk?? 2

malpa · Aug 30, 2005

Hi,

I have 2 files. On of them
filea contains
------
field1,field2,.. ,field4,...
001200,001200,001200,001200,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001201,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001202,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001203,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001204,001204,001204,001204,6,9,001204,CANADA,CANADA,001204

Filed4 (series) is variable, and the field 5 contains the length of field4.

and fileb contains
------------------
field1
012000059 042213997 0507301019480507301022390002800029BOGOAXE2AUP00000200319211110011CE12D7IMED2X7O00000C000 0000000000251
012000059 042554019 0507301927250507301935570008500086BOGOAXE1AUP00000594046211110011CEN2D7IMED5H7O00000C000 0000000000832
012000132 018920468

I want to make a join between field4 (filea) and substr(field1,1,18) (fileb) and obtain a file with this structure

if (field4(file_a) in substr(field1(file_b),1,field5(file_a)))
{
print $0(file_b),field8(file_a),field9(file_a)
}
else{print $0(file_a) > "Is not here"}

Thanks

malpa

PHV · Aug 30, 2005

And what have you tried so far ?
Tons of similar threads in this forum.

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

vgersh99 · Aug 30, 2005

I believe one of your own threads could be useful to revisit.

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

malpa · Aug 30, 2005

Thanks for your replay.

Program functions very well if both fields are similar.
But what happend if a field (file_a, field 4)that you want to compare is within the other field (file_b, field 1).

If you notice both fields are not equal. The fourth field file_a is a part of the first field file_b. You must to keep in mind that the field 4 file_a is variable.

Field 1 file_b is much bigger than field 4 file_a.

awk -F'|' '
FNR==NR{t[$1]=$2;next}
t[$1]==$2
' file1.txt file2.txt

If you have some advice for this, I'll thank you.

I am sorry if I am wrong.

Thanks again.

PHV · Aug 30, 2005

If $1(file_b) must begin with $4(file_a) consider the substr function.
If $1(file_b) must contain $4(file_a) consider the ~ comparison operator.

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

malpa · Sep 9, 2005

Hi

I modified the program but it doesn't work.

match.awk

BEGIN {FS=","}
{ if(!fn) fn = "filea"
while( (getline < fn) > 0)
{
a[$1]=$4
}
}
{
if(substr($0,1,8) ~ /^a[$1]/){
printf"%s,%s\n", $0, a[$1]>"match"
}
}

awk -f match.awk fileb

Thanks a lot

Malpa

PHV · Sep 9, 2005

A starting point (typed, not tested):
awk -F',' '
FNR==NR{t[$4]=$8" "$9;next}
{ for(i in t)if($1~"^"i{print $0","t;next}
print > "Is not here"
}' filea fileb > match

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

malpa · Sep 9, 2005

First, thanks a lot.

At this moment its working, it goes well, I'll say to you later the results.

The fileb contains 3.413.544 records and the filea contains 5673 records.

Is there any problem with the size???
Is there any restrction with this??

If I want to match the results with another file Would be better do this in two steps or Can I do this at the same time in the same program???.

I modified it a little.

awk -F',' '
FNR==NR{t[$4]=$8" "$9;next}
{ for(i in t)if($1~"^"i){print $1","t;next}
print > "Is not here"
}' filea fileb > match

Again thanks a lot.

Malpa

PHV · Sep 9, 2005

If I want to match the results with another file
...
}' filea fileb filec filed ... > match

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

malpa · Sep 12, 2005

Hi PVH

Thanks again.

The results was fine.

Now, I have to compare the same field (filea) with two fields(fileb) at the same time.
I ran the same program twice and It worked fine.

Is it possible to do this at the same time ??.

So I changed the program but I have some troubles with this.

awk -F',' '
FNR==NR{t[$4]=$8" "$9;next}
{ for(i in t)if(substr($1,1,18)~"^"i){
a[$x]=t
i=0
t=0
for (i in t)if(substr($1,19,36)~"^"i){
print $1","a[$x]","t;next}}
print > "Is_not_here"
}' filea fileb > match

Thanks

malpa

PHV · Sep 12, 2005

What is supposed to be $x ?
What do you want to do ?

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

malpa · Sep 12, 2005

Hi,

the idea is to find telephones origin and telephones destiny and to print in the exit file the telephone origin the city origin and telephone destiny the city destiny.

At the firts time the program compared the field $4 (filea) with the field substr($1,1,18) (fileb) and it prints $1(fileb),t =(field8 fiel9).
Now I want to compare filed $4 (filea) again with field substr($1,19,36) (fileb) and it must print what it finds at the first time and what it finds at the second time.

filea contains
------
field1,field2,.. ,field4,field5,field6,field7,field8,field9
001200,001200,001200,001200,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001201,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001202,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001203,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200

and fileb contains
------------------
field1

123456789012345678901234567890123456

012000059 042213997 0507301019480507301022390002800029BOGOAXE2AUP00000200319211110011CE12D7IMED2X7O00000C000 0000000000251
012000059 042554019 0507301927250507301935570008500086BOGOAXE1AUP00000594046211110011CEN2D7IMED5H7O00000C000 0000000000832

match

090000000 090000000 0507302035140507302035140000000000MEDEAXE1SAN00000000000224200061BOG1H7I0 00000Z100T0032 T0032 0000000000000,BOGOTA CUNDINAMARCA, BOGOTA CUNDINAMARCA

090000000 090000000 0507302035320507302035320000000000MEDEAXE1SAN00000000000224200061BOG4P7O0 00000Z100T0035 T0154 0000000000000,9 BOGOTA CUNDINAMARCA,9 BOGOTA CUNDINAMARCA

Thanks so much

malpa.

PHV · Sep 12, 2005

Something like this ?
awk -F',' '
FNR==NR{t[$4]=$8" "$9;next}
{ for(i in t)if($1~"^"i || substr($1,19)~"^"i)
{print $1","t;next}
print > "Is not here"
}' filea fileb > match

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ181-2886

malpa · Sep 12, 2005

Hi PVH

Thanks for your replay.
It is like making a dispesión of traffic.

When you call to long distance anybody wherever you want You can distinguish between who did the call and who receive the call.
File_a contains who did the call subscriber_a substr($1,1,18) and who receive the call subscriber_b substr($1,19,36).
File_b contains the number series (field 4) and city field 8 and country field 9.

I want to print:

$1(file_b), city(field 8 file_a) country (field 9 file_a) (regarding with substr($1,1,18) , city(field 8 file_a) country (field 9 file_a) (regarding with substr($1,18,39).

Thanks.

malpa

malpa · Sep 15, 2005

Hi,

At this moment I am at the airport, I`m working with this script and I have a doubt with this??

awk -F',' '
FNR==NR{t[$4]=$8" "$9;next}
{ for(i in t){
if(substr($1,1,18)~"^"i){
a=1
b=t#Here I want to save t for first time;
#Can I do this in this way?
}
if(substr($1,19,36~"^"i && a==1){
#Here tcan be different to the previous b=t
a=0
printf "%s%s%s",$1,b,t
next
}
}
print > "Is not here"
}' filea fileb > match

Thanks

malpa

Annihilannic · Sep 16, 2005

File_a contains who did the call subscriber_a substr($1,1,18) and who receive the call subscriber_b substr($1,19,36).
File_b contains the number series (field 4) and city field 8 and country field 9.

I think you got filea and fileb the wrong way around...?

Also, in filea, the numbers have an extra 0 on the prefix? Do you want to take that off to match against the numbers in fileb? I have done that in my example.

Unfortunately with the test data you have given it produces no output, because the destination numbers do not appear in filea.

Of course you can remove the "print read" and "print check" statements, they are just for debugging.

Code:

awk '
BEGIN {
        FS=OFS=","
        while (getline < "filea") {
                prefix[++i]=substr($4,2)
                city[i]=$8
                country[i]=$9
                print "read prefix " prefix[i] " city " city[i] " country " country[i]
        }
        close("filea")
}
{
        from=substr($0,1,18)
        to=substr($0,19,18)
        print "read from " from " to " to
        for (i in prefix) {
                print "check from prefix " prefix[i]
                if (from ~ "^"prefix[i]) {
                        for (j in prefix) {
                                print "check to prefix " prefix[j]
                                if (to ~ "^"prefix[j] ) {
                                        print from,city[i],country[i],city[j],country[j]
                                        break
                                }
                        }
                        break
                }
        }
}
' fileb

Annihilannic.

Annihilannic · Sep 16, 2005

Sorry, of course you can print "no match" like you suggested if nothing is found.

Code:

awk '
BEGIN {
        FS=OFS=","
        while (getline < "filea") {
                prefix[++i]=substr($4,2)
                city[i]=$8
                country[i]=$9
                print "read prefix " prefix[i] " city " city[i] " country " country[i]
        }
        close("filea")
}
{
        from=substr($0,1,18)
        to=substr($0,19,18)
        print "read from " from " to " to
        matchedfrom=0
        for (i in prefix) {
                print "check from prefix " prefix[i]
                if (from ~ "^"prefix[i]) {
                        matchedfrom=1
                        matchedto=0
                        for (j in prefix) {
                                print "check to prefix " prefix[j]
                                if (to ~ "^"prefix[j] ) {
                                        matchedto=1
                                        print from,city[i],country[i],city[j],country[j]
                                        break
                                }
                        }
                        if (!matchedto) {
                                print from,city[i],country[i],"(no match),(no match)"
                        }
                        break
                }
        }
        if (!matchedfrom) {
                print from,"(no match),(no match),(no match),(no match)"
        }
}
' fileb

Annihilannic.

malpa · Sep 16, 2005

Hi Annihilannic

Thanks a lot for your replay. Actually Your script is working.
File_a contains many series including local series and long disntance series. Size of each serie is variable.

file_b contains call data registers (3.500.000 per day)
-------
002534306 000000000 0507301505500507301506170000400000BOGOAXE2AUN00000000000116900063PAC3I7ITUN2HBO00000Z000
0000000000027
012012360 0019549932384 0507302158220507302213490015400156BOGOAXE1AUP00003416400111110011MUZ8D7IEUM2I7O00000N000
0000000001527
012066817 0017863125866 0507302304110507302304240000200010BOGOAXE1AUP00000219000111110011CEN2D7IEUA1I7O00000N000
0000000000013
012100615 00524422251072 0507302327280507302334020006500066BOGOAXE1AUP00001544400111110011CEN2D7IMET1I7O00000Q000
0000000000634
012100687 00525555263828 0507302348090507302356370008400086BOGOAXE1AUP00002012400111110011CEN2D7IMET1I7O00000Q000
0000000000828
012101033 0017862735836 0507302325190507302329120003800039BOGOAXE2AUP00000854100111110011CE12D7IEUA1I7O00000N000
0000000000353
012101501 0017862014029 0507302212200507302231000018600186BOGOAXE2AUP00004073400111110011CE12D7IEUM1I7O00000N000
0000000001840
012105000 0018188319380 0507302220210507302232420012300123BOGOAXE1AUP00002693700111110011CEN2D7IEUI1I7O00000N000
0000000001221
012105022 0059322443145 0507302324040507302328340004500046BOGOAXE2AUP00000920000111110011CE12D7IECA1I7O00000M000
0000000000430
012105093 005072305862 0507302225400507302227550002200023BOGOAXE2AUP00000518650111110011CE12D7IPAC1I7O00000P000
0000000000215

file_a contains local and long distance series (8500 series).
-------
------------------field4---------field8,field9,----------
00253,00253,00253,00253,5,9,00253,DJIBOUTI,DJIBOUTI,00253
25,25,11001,012,3,7,11001,BOGOTA,CUNDINAMARCA,0057
00593,00593,00593,00593,5,9,00593,ECUADOR,ECUADOR,00593
00507,00507,00507,00507,5,9,00507,PANAMA,PANAMA,00507
001200,001200,001200,001781,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001785,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200
001200,001200,001200,001786,6,9,001200,ESTADOSUNIDOS,ESTADOSUNIDOS,001200

If you want, you can try with these files. I apologize my english goes wrong. This is a short report I took it while it is working.

002534306 ,DJIBOUTI,DJIBOUTI,000000000 ,(no match),(no match)
012012360 ,BOGOTA,CUNDINAMARCA,0019549932384 ,ESTADOSUNIDOS,ESTADOSUNIDOS,050730,00154,BOGOAXE1AU,P,111110,MUZ8D7I,EUM2I7
012066817 ,BOGOTA,CUNDINAMARCA,0017863125866 ,ESTADOSUNIDOS,ESTADOSUNIDOS,050730,00002,BOGOAXE1AU,P,111110,CEN2D7I,EUA1I7
012100615 ,BOGOTA,CUNDINAMARCA,00524422251072 ,MEXICO,MEXICO,050730,00065,BOGOAXE1AU,P,111110,CEN2D7I,MET1I7
012100687 ,BOGOTA,CUNDINAMARCA,00525555263828 ,MEXICO,MEXICO,050730,00084,BOGOAXE1AU,P,111110,CEN2D7I,MET1I7
012101033 ,BOGOTA,CUNDINAMARCA,0017862735836 ,ESTADOSUNIDOS,ESTADOSUNIDOS,050730,00038,BOGOAXE2AU,P,111110,CE12D7I,EUA1I7
012101501 ,BOGOTA,CUNDINAMARCA,0017862014029 ,ESTADOSUNIDOS,ESTADOSUNIDOS,050730,00186,BOGOAXE2AU,P,111110,CE12D7I,EUM1I7
012105000 ,BOGOTA,CUNDINAMARCA,0018188319380 ,ESTADOSUNIDOS,ESTADOSUNIDOS,050730,00123,BOGOAXE1AU,P,111110,CEN2D7I,EUI1I7

Thanks a lot

I love awk.

malpa

malpa · Sep 30, 2005

Thanks a lot PVH and Annhilannic. At last I learned manyhtings about awk.

I have used both programs and these work fine.

Question?

What is the difference between :

if (from ~ "^"prefix)

and

if (from ~ /^prefix/) Why this doesn´t works.

Thanks

malpa

feherke · Sep 30, 2005

Hi

Because the backets are metacharacters and when are between the slashes, [tt]awk[/tt] interpret them as part of the regular expression instead of array index.

[highlight #eeeeff][tt]from ~ "^"prefix[/tt][/highlight] - value of variable [tt]from[/tt] matches the value of [tt]i[/tt]^{th[/tt] element of array [tt]prefix[/tt], at the begin

[highlight #eeeeff][tt]from ~ /^prefix/[/tt][/highlight] - value of variable [tt]from[/tt] matches the string prefix followed by any character from the set of characters i, at the begin

Feherke.

http://rootshell.be/~feherke/}

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

match field in 2 files with awk?? 2

Technical User

MIS

Programmer

Technical User

MIS

Technical User

MIS

Technical User

MIS

Technical User

MIS

Technical User

MIS

Technical User

Technical User

MIS

MIS

Technical User

Technical User

Programmer

Similar threads

Log in

Part and Inventory Search

Sponsor