Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

AWK in bash to compare files 2

Status
Not open for further replies.

Awksome

Programmer
Jul 17, 2009
11
US
I have "n" files in directory A and "n" files in directory B. The files are expected to be the same with same data. Each file has 14 columns and "x" rows. Of the 14 column, 2 columns are to be considered as key identifiers. Based on this unique combination, I need to compare each field value between the two files and report the difference.

Note: File in Directory A has values displayed upto 5 decimal places. But file in Directory B has some values displayed as scientific values, some whole numbers and some upto some decimal places.

I am a beginner with Bash and learned that this could be executedin AWK and not sure on executing this.

The text file is tab delimited.
Here is the sample file:
File A:
Row1: 6666; D000; $12.00405; $234.08976; $212.09876; the number goes on for another 9 columns
Row2: 6666; D015; $12.56745; $456.34567; $324; 12345;....

File b:
Row1: 6666; D015; $12.56745; $456.34567; $324; 12345;....
Row2: 6666; D000; $12.00405; $234.08976; $212.09876;....

So as you can see, the first two values are the unique row identifier. With that, I need to compare all the other values.

any help would be appreciated.

Thanks!
 
You say tab delimited, but your sample data has semi-colons as well; is it both?

Try this anyway:

Code:
awk -F '[ \t;$]+' '
        # load first file into array indexed by fields 1 and 2
        NR == FNR {
                for (i=3; i<=NF; i++) {
                        file1[$1,$2,i] = $i
                }
                # store the number of fields for this index
                file1nf[$1,$2] = NF
                next
        }
        {
                if (!file1nf[$1,$2]) {
                        print "key " $1 " " $2 " missing from file1"
                        next
                }
                if (file1nf[$1,$2] != NF ) {
                        print "key " $1 " " $2 " has different number of fields: " file1nf[$1,$2], NF
                }
                for (i=3; i<=NF; i++) {
                        if (file1[$1,$2,i] != $i) {
                                print "key " $1 " " $2 " field " i " differs: " file1[$1,$2,i], $i
                        }
                }

        }
' dira/file1 dirb/file1

It ignores the case where lines are in dira/file1 but not in dirb/file1, so you may want to add that feature. Also it ignores semi-colons and $ signs.

Sample output:

Code:
$ cat dira/file1
6666; D000; $12.00405; $234.08976; $212.09876; 1; 2; 3; 4; 5; 6; 7; 8; 9;
6666; D015; $12.56745; $456.34567; $324; 12345; 1; 2; 3; 4; 5; 6; 7; 8;
6666; D012; $12.56745; $456.34567; $324; 12345; 1; 2; 3; 4; 5; 6; 7; 8;
6666; D013; $12.56745; $456.34567; $324; 12345; 1; 2; 2; 4; 5; 6; 7; 8;
6666; D014; $12.56745; $456.34567; $324; 12345; 1; 2; 3; 4; 5; 6; 7; 8;
6666; D016; $12.56745; $456.34567; $324; 12345; 1; 2; 3; 4; 5; 6; 7;
$ cat dirb/file1
6666; D015; $12.56745; $456.34567; $324; 12345; 1; 2; 3; 4; 5; 6; 7; 8;
6666; D000; $12.00405; $234.08976; $212.09876; 1; 2; 3; 4; 5; 6; 7; 8; 9;
6666; D001; $12.00405; $234.08976; $212.09876; 1; 2; 3; 4; 5; 6; 7; 8; 9;
6666; D013; $12.56745; $456.34567; $324; 12345; 1; 2; 3; 4; 5; 6; 7; 8;
6666; D014; $12.56745; $4.5634567e+02; $324.0000; $12345; 1; 2; 3; 4; 5; 6; 7; 8;
6666; D016; $12.56745; $456.34567; $324; 12345; 1; 2; 3; 4; 5; 6; 7; 8;
$ ./compare
key 6666 D001 missing from file1
key 6666 D013 field 9 differs: 2 3
key 6666 D016 has different number of fields: 14 15
key 6666 D016 field 14 differs:  8
$

Annihilannic.
 
Thanks for the response!

The fields are tab delimited. Just for clarity purpose I used ";".

I will try this code. But I need two other checks to be performed as well:
1. MAtch th enumber of records on both files. It should e same.
2. The values being compared are reprensented differently. For example, in File1 - all values are upto 5 decimal places.
File2 - some as whole numbers, some in scientific

File1 - 6670; D0005; $123.00000; $56.65432
File2 - 6670; D0005; $123; $56.6543212e6

how to compare these values in field 3 and 4?

Thanks again!
 
Okay, you can take the semi-colon out of the delimiters I specified with -F then.

You don't need to do anything to handle the numbers that are in scientific notation because AWK recognises and understands that automatically (as you can see from the D014 lines in the test data I used).

You should be able to modify my code easily enough to check the number of records of the two files; let me know if you get stuck.

Annihilannic.
 
Thanks for the response!

Now that I read through your code, I am getting confused with variable File1. Since both the files names are same in the two directories, now that I am trying to replace the names as per my requirement and just not sure which file is what. can you detail your code, please?

I still could not find a way to compare NR of both files. I would appreciate if you could help.
 
Sorry, it was just a poor choice of names for my test files... I have renamed them to dira/testfile and dirb/testfile for clarity. That is the only place where you should need to change any filenames.

Code:
awk -F '[ \t$]+' '
        # load first file into array indexed by fields 1 and 2
        NR == FNR {
                file1nr = FNR
                for (i=3; i<=NF; i++) {
                        file1[$1,$2,i] = $i
                }
                # store the number of fields for this index
                file1nf[$1,$2] = NF
                next
        }
        # compare contents of second file to information captured above
        {
                file2nr = FNR
                if (!file1nf[$1,$2]) {
                        print "key " $1 " " $2 " missing from file1"
                        next
                }
                if (file1nf[$1,$2] != NF ) {
                        print "key " $1 " " $2 " has different number of fields: " file1nf[$1,$2], NF
                }
                for (i=3; i<=NF; i++) {
                        if (file1[$1,$2,i] != $i) {
                                print "key " $1 " " $2 " field " i " differs: " file1[$1,$2,i], $i
                        }
                }

        }
        END {
                if (file1nr != file2nr)
                        print "files have different numbers of records: " file1nr, file2nr
        }
' dira/testfile dirb/testfile

So the array file1 contains the data from the first input file specified on the command-line.

Annihilannic.
 
Thanks Annihilannic!!!! This code works like a miracle!

I still have one question though. When I run this code on my files, I see that the output still shows the difference because of scientific value representation.

Here is a sample output:
key 10139 D000 field 4 differs: -1.19119e+06 -91190.11328125
key 10139 D000 field 5 differs: -1.19119e+06 -91190.11328125

-.19119e+06 is from dira/testfile
-91190.11328125 is from dirb/testfile

I want to compare the values upto 6 decimal places. How do I handle this?

Thanks agai for your time.
 
Also, the first record in both the files are Header and need to avoid that in processing. Thanks!
 
I'd prefer not to write any more code for you since you don't appear to be attempting anything yourself... have you had a look at the awk man page at all to get a general idea of how it works?

I don't understand what you're saying about those two numbers. They are different numbers, despite the representation, so they won't match anyway.

Annihilannic.
 
Thanks Anni!!!

I made few modifications as per my requirement and it works great... Thanks for the wonderful code!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top