Mege and compare 3 files 7

mrr · Feb 16, 2005

What is the best way to input 3 files into AWK?
I want to load each of the 3 files into different arrays and then gsub text and then compare/natch fields between the three files and then export to 1 merged file.

Can you do 3 while((getline<"file1,2,3) > 0) and load each file into an array?

Thanks

PHV · Feb 16, 2005

A starting point:
nawk '
FILENAME!=file{++f;a[f,0]=file=FILENAME}
{a[f,FNR]=$0}
END{for(i=1;i<=f;++i)print a[i,0]"(1) "a[i,1]}
' file1 file2 file3

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244

futurelet · Feb 16, 2005

Code:

{ by_line_num[FILENAME,FNR] = $0
  by_content[FILENAME,$0] = FNR
}

END {
  
  file1 = ARGV[1]
  file2 = ARGV[2]
  file3 = ARGV[3]
  
  # Let's say we want to print only lines found
  # in all 3 files, in the order in which they
  # occur in file 1.
  
  for (i=1; (file1,i) in by_line_num; i++)
  { content = by_line_num[file1,i]
    if ((file2,content) in by_content &&
        (file3,content) in by_content)
      print content
  }

}

Save as "3files.awk" and run with [tt]
awk -f 3files.awk f1 f2 f3[/tt]

Let me know whether or not this helps.

If you have nawk, use it instead of awk because on some systems awk is very old and lacks many useful features. Under Solaris, use /usr/xpg4/bin/awk.

For an introduction to Awk, see faq271-5564.

mrr · Feb 17, 2005

Thanks for responding PHV and futurelet.
This may be too tall of a request for help but this is what I am trying to accomplish.
1. Load all three files into arrays and the first file is the master file that has the most detailed fields and number of records. I want to match on fields 1, 6 and 7 from file 1 to same fields in file 2. I also want to match file1 fields 1, 4 and 5 to file 3 fields 1, 4 and 5.

If I find a complete match from either 2 or all 3 files, I would append the uid of 1 and then move to next record in file 1 and if another match is found increment the uid by one.

There may be duplicate records in all three files and also may be incomplete records in either file 2 or 3 or also records in files 2 or 3 that do not have a match in file 1.

The consistant field in all three files is field 1 but the contents sometimes vary due to the data having non-alpha number characters such as - _ | \ ( ) and spaces.
Before doing a match comparison on the first field of all three files, I would apply this:
{ gsub (/ /, "",$1) to remove spaces in data of field 1

I would also apply the following to remove/edit the special characters from field 1
gsub (/[\/_-()]/, "",$1), "",$1)### REMOVES SPECIFIC NON-ALPHANUMERIC CHARACTERS FROM FIELD 1
if ( match( $1, "[0-9]X$")) { gsub (/X/, "EXT",$1)};### FIELD 1 MODIFIER
if ( match( $1, "[0-9]EX$")) { gsub (/EX/, "EXT",$1)}; ### FIELD 1 MODIFIER

These are only temporarily modifications for matching and not to affect the ouput.

Also in order for the matching to work on fields 5, 6, 7 I would also like to do a match on only the first 3 characters of each field due to different spellings in data.

One last request, could each record output have the following appended to end of each record:
first 3 characters of field 6"-"first 3 characters of field 5 "-" and then first 4 characters of field4
it would look like CCC-AAA-1901 This is a grouping type number that I will use along with the uid to sort and group other similar records.

Here's my test data file1:
1 146 337 1901 AAABB CCCCCC XXX1 111
11 146 337 1902 DDDDD CCCCCC XXX2 112
111 146 337 1903 EEEEE CCCCCC XXX3 113
44 146 337 1904 HHHHH CCCCCC XXX8 114
1 1 146 337 1905 FF CCCCCC XXX4 115
1 146 337 1906 GGGG CCCCCC XXX5 116
111 146 337 1907 HHHHH CCCCCC XXX6 117
22 146 337 1908 HHHHH CCCCCC XXX7 118
44 146 337 1909 HHHHH CCCCCC XXX8 119
44E 146 337 1910 HHHHH CCCCCC XXX8 120
44X 146 337 1911 HHHHH CCCCCC XXX8 121
44EXT 146 337 1912 HHHHH CCCCCC XXX8 122
55_1 146 337 1913 HHHHH CCCCCC XXX8 123
55-1 146 337 1914 HHHHH CCCCCC XXX8 124
55(1) 146 337 1915 HHHHH CCCCCC XXX8 125

Here's file 2:

1 146 337 1901 AAABB CCCCCC XXX1 111
11 146 337 1902 DDDDD CCCCCC XXX2 112
111 146 337 1903 EEEEE CCCCCC XXX3 113
44 146 337 1904 HHHHH CCCCCC XXX8 114
11 146 337 1905 FF CCCCCC XXX4 115
1 146 337 1906 GGGG CCCCCC XXX5 116
111 146 337 1907 HHHHH CCCCCC XXX6 117
22 146 337 1908 HHHHH CCCCC1 XXX7 118
44 146 337 1909 HHHHH CCCCC2 XXX8 119
44E 146 337 1910 HHHHH CCCCC4 XXX8 120
44X 146 337 1911 HHHXX CCCCCX XXX8 121
44EXT 146 337 1912 HHHHH CCCCCE XXX8 122
551 146 337 1913 HHH00 CCCCCF XXX8 123
55-1 146 337 1914 HHHHH CCCCD XXX8 124
551) 146 337 1915 HHHHH CCCCC XXX8 125

and here's file3:

1 146 337 1901 AAABB CCCCCC XXX1 333 GRP1
11 146 337 1902 DDDDD CCCCCC XXX2 777 GRP2
111 146 337 1903 EEEEE CCCCCC XXX3 834 GRP3
44 146 337 1955 HHHHH CCCCCC XXX8 787 GRP2
1 1 146 337 1960 FF CCCCCC XXX4 99 GRP2
1 146 337 1980 GGGG CCCCCC XXX5 199 GRP1
111 146 337 1908 HHHHH DDDDDD XXX6 339 GRP8
22 146 337 1908 HHHHH CCCCCC XXX7 454 GRP19
44 146 337 1909 HHHHH XXXXX XXX8 66 GRP20
44E 146 337 1910 HHHHH CCCCCC XXX8 88 GRP22
44X 146 337 1911 HHHHH CCCCCC XXX8 999 GRP55
44EXT 146 337 1912 HHHHH CCCCCC XXX8 565 GRP79
55_1 146 337 1913 HHHHH CCCCCC XXX8 17 GRP12
55-1 146 337 1955 HHHHH CCCCCC XXX8 5454
55(1) 146 337 1915 HHHHH GGGGG XXX8 100
Please note that file 3 has a ninth field and so the grouping type number would have to be placed in field 10 in all 3 files.

Thanks for any help on this, I know it is alot to ask for.

xmb · Feb 17, 2005

what about the missing field in the last ttwo records in file3? did i understand correctly, add the 10th field of file3 to all others?

. Mac for productivity
.. Linux for developement
... Windows for solitaire

xmb · Feb 17, 2005

[first, written with big headache]

I certainly see it doesnt to _exactly_ what you want, but i have my own difficults to understand _exactly_ whats up.

Code:

#!/usr/bin/awk -f

# specialized example for mrr on tek-tips.com
# slow cause too much preloading, or not..

# respect 15 minutes b[dmg] kiddie code!
# syntax free excepts your gsub field $1 regex
# to be honest, it didnt run from the first run.. why? i forgot the if in scomp

# for him:
# gsub (/ /, "",$1) # does not make sense
# aha, got it, still wrong

BEGIN {
    #loadfile(1) # not needed?
    loadfile(2)
    loadfile(3)
}

{
    # comparing relative to the processed file 1
    if (scomp(1, 2) && scomp(6, 2) && scomp(7, 2)) p() 
    if (scomp(1, 3) && scomp(4, 3) && scomp(5, 3)) p() 
}

function scomp(field, file) { # shortcomp
    if (comp(field, file, field)) return 1
    else return 0
}

function comp(field, file, filefield) {
    if (File[file, filefield] = $field) # $field can also be str
        return 1
    else return 0
}

# preloading
function loadfile(file  ,i) {
    while (getline <file) {
        # per line stuff here
        gsub (/[\/_()-]/, "",$1) # + if (/[^0-9]/) may be faster
        sub(/ /, "")
        if ($1 ~ /X$/) sub(/E?X$/, "EXT", $1) # temp uhu?

        for (i = 1; i <= NF; i++)
            File[file, i] = $i
    }
}

# pseudo print
function p() {
    $1++
    $(NF + 1) = substr($6, 1, 3) "-" substr($5, 1, 3)
    if (Dup[$0]++) { Realdup[$0]++; return } # slowdown..
    print
}

. Mac for productivity
.. Linux for developement
... Windows for solitaire

futurelet · Feb 17, 2005

The consistant field in all three files is field 1 but the contents sometimes vary due to the data having non-alpha number characters such as - _ | \ ( ) and spaces.
Before doing a match comparison on the first field of all three files, I would apply this:
{ gsub (/ /, "",$1) to remove spaces in data of field 1

If a space can be found in field 1, what is the separator between fields?

xmb · Feb 17, 2005

To have some comment on the code..

the files were named '1', '2' and '3', with '1' being an argument to awk
Time needed was -4 minutes
dup checking, nja.. correct would be

if (Dup[$0]++) { if (Dup[$0] == 1) Realdup[++dup] = $0; return }

I hope i do not need to explain the function names, else for everything, ask.

. Mac for productivity
.. Linux for developement
... Windows for solitaire

xmb · Feb 17, 2005

Please anyone besides mrr tell me this is cool code :/

Futurelet, i can as well have a script mark every of my posts with a star, but thats not the point.

. Mac for productivity
.. Linux for developement
... Windows for solitaire

xmb · Feb 17, 2005

[this aint the future..]

mrr, if you can tell me for whats that, i'd name my script properly.

futurelet · Feb 17, 2005

mrr said:
if ( match( $1, "[0-9]X$")) { gsub (/X/, "EXT",$1)}
if ( match( $1, "[0-9]EX$")) { gsub (/EX/, "EXT",$1)}

I didn't see any cases where field 1 ended with "EX" but I did see that some ended with "E".

So I'm using this:
[tt]
if ( s ~ /[0-9](E|X)$/ )
sub( /(E|X)$/, "EXT", s )
[/tt]

futurelet · Feb 17, 2005

On rare occasions you will need to use the getline command, which can do explicit input from any number of files....

It's too bad that those without a good grasp of the AWK way use getline in almost every program.
[tt]
awk -f merge.awk file1 file2 file3
[/tt]
When invoked in this manner, AWK will automatically read all the lines in file1, then all lines in file2, then file3---if you let it.
[tt]
awk -f merge.awk file2 file3 file1
[/tt]
When invoked in this manner, AWK will automatically read all the lines in file2, then all lines in file3, then file1---if you let it.

So in most programs, there is no need for anyone to take the reading of lines out of AWK's capable hands.

futurelet · Feb 17, 2005

Run with
[tt]awk -f merge3.awk file1 file2 file3[/tt]

Code:

BEGIN {
  ## What is the field-separator? ? ? ?
  ## I'm assuming 2 or more spaces or a tab.
  FS = "  +|\t"
  for (i=1; i<ARGC; i++)
    filenum[ARGV[i]] = i
}

NR==FNR { lines[NR] = $0 ; next }

2==filenum[FILENAME] { file2data[grab167()]++
                       next }

{ file3data[grab145()]++ }

END {
  for (i=1; i in lines; i++)
  { $0 = lines[i]
    if ((grab167() in file2data) ||
        (grab145() in file3data))
      display()
  }
}


function display()
{ printf "%s %s-%s-%s\n",
    $0,substr($6,1,3),substr($5,1,3),substr($4,1,4)
}

function grab167( )
{ return clean1($1) SUBSEP clean567($6) SUBSEP \
         clean567($7)
}

function grab145( )
{ return clean1($1) SUBSEP $4 SUBSEP clean567($5)
}

function clean567( s )
{ return substr( s, 1, 3)
}

function clean1( s )
{ gsub ( /[-\/_() ]/, "", s )
  if ( s ~ /[0-9](E|X)$/ )
    sub( /(E|X)$/, "EXT", s )
  return s
}

mrr · Feb 18, 2005

xmb and futurelet,
Thanks for your help on this, it has definitely been overwelming to me on how to bring in three files and compare. The file is Tab delimited, sorry I did not mention it.

I will try to test on real data and see what happens.
I appreciate your help.

PHV · Feb 18, 2005

Simply add this to your awk program:
BEGIN { FS = "\t" }

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244

mrr · Feb 18, 2005

futurelet,
Thanks for this very nice script.
I've tested it on my data and it looks like it is working with the exception of 2 issues.
1. I would like for all lines to be printed to output from the 3 files (not just the records that have a match between the 3 files).

2. I need the addition of the unique id number that increments by one each time there is a match between the three files. There may be several records in any of the three files that would make a single match for the uid based on the similar values found in file 1 fields 1,6,7 matching to file 2 same fields and also file 1 fields 1,4,5 matching to file 3.

In other words, each file 1,2 and 3 may have numerous records that after comparing the matched fields, they would constitue a single match of a specific uid.

If there is no match of records in file 1 against files 2 and 3, then the uid entry can be blank.
This uid key will allow me to sort and group similar records of similar data.

If you need further clarification, please respond.

Thanks again.

vgersh99 · Feb 18, 2005

futurelet,
once again - nicely done. modular and clear as a whistle!

stick around this site - there's so much others can gain!

here ya go!

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

futurelet · Feb 18, 2005

It seemed to me that the only way to do it was to have 2 uids: one for fields 1,6,7 and one for fields 1,4,5.
Perhaps Vlad or xmb can figure a better way.

Code:

BEGIN {
  ## The field-separator is a tab.
  FS = "\t"
  ## Make a way to get file-number from filename.
  for (i=1; i<ARGC; i++)
    filenum[ARGV[i]] = i
}

{ 
  whichfile = filenum[FILENAME]
  # Store all lines of all files.
  lines[whichfile,FNR] = $0

  if (1==whichfile)
  { # Save key to match with file2.
    keys[2,grab167()]++
    # Save key to match with file3.
    keys[3,grab145()]++
  }
  else  # File 2 or 3.
  { key = (2==whichfile) ? grab167() : grab145()
    # If this key was in file1...
    if ((whichfile,key) in keys)
      # and no uid has yet been assigned...
      if (!((whichfile,key) in uidmap))
      { uid[whichfile]++
        uidmap[whichfile,key] = uid[whichfile]
      }
  }
}

END {
  # Print file1.
  for (line=1; (1,line) in lines; line++)
  { $0 = lines[1,line]
    display( uidmap[2,grab167()], uidmap[3,grab145()] )
  }

  for (file=2; file<=3; file++)
  { print "____ file " file
    for (line=1; (file,line) in lines; line++)
    { $0 = lines[file,line]
      if (2==file)
        display( uidmap[2,grab167()], 0 )
      else
        display( 0, uidmap[3,grab145()] )
    }
  }
}


function display( id1, id2    ,id)
{ id = form_id( id1 ) " " form_id( id2 )
  printf "%s %s\t%s-%s-%s\n", id,
    $0,substr($6,1,3),substr($5,1,3),substr($4,1,4)
}

function form_id( id )
{ if (id)
    return sprintf( "%03d", id )
  return "   "
}

# For file2.
function grab167( )
{ return clean1($1) SUBSEP clean567($6) SUBSEP \
         clean567($7)
}

# For file3.
function grab145( )
{ return clean1($1) SUBSEP $4 SUBSEP clean567($5)
}

function clean567( s )
{ return substr( s, 1, 3)
}

function clean1( s )
{ gsub ( /[-\/_() ]/, "", s )
  if ( s ~ /[0-9](E|X)$/ )
    sub( /(E|X)$/, "EXT", s )
  return s
}

mrr · Feb 19, 2005

futurelet,
Thanks again for this powerful comparison script.
Most of it is over my head, but it has helped me in learning more about the use of functions.

After analyzing my data I found that file1 has data in field1 that starts with numeric "0-" where files 2 and 3 same field1 starts with "O-", hence they will not be able to match.

Can you tell me why when I add the commented lines in the clean1 function they do not correct the problem in file1?

function clean1( s )
{ gsub ( /[-\/_() ]/, "", s )
###gsub( /^0-/, "O-", s ) ###used to change 0 to O in field 1
if ( s ~ /[0-9](E|X)$/ )
sub( /(E|X)$/, "EXT", s )
### if ( s ~ /^0-/ )
### sub( /(0-)/, "O-", s ) ###used to change 0 to O in field 1
return s

mrr · Feb 19, 2005

I got it to work by changing the record to:
gsub( /^0/, "O", s ) ###used to change 0 to O in field 1

Thanks futurelet and others for help on this problem.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Mege and compare 3 files 7

Technical User

MIS

Programmer

Technical User

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Programmer

Technical User

MIS

Technical User

Programmer

Programmer

Technical User

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor