Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

comparing multiple files with different fields. 3

Status
Not open for further replies.

Guest_imported

New member
Jan 1, 1970
0
Hi,
I am a novice awk programmer and this is my first and critical assignment.
I have two flat files as input.
File 1 : 1|xyz|111|perrysburg|21
2|abc|222|sylvania|21
3|pqr|121|toledo|21
4|wpr|112|annarbor|21
5|mno|131|detroit|21
6|hij|133|columbus|21
7|klm|333|cincinati|21

File 2 : 1|xyz|111|perrysburg|21
3|pqr|121|toledo|21
4|wpr|112|annarbor|21
6|hij|133|columbus|21
8|def|212|maumee|21

I would like to search on fields 1,2,4 in File1 with fields 1,2,4 from File2 and if there is a match then write the whole record of File2 to an output file1 else write the whole record of File2 to output file2.
Also,pl. remember that the file names and the column positions for the search need to be dynamic (user inputs).

Pl. help me ASAP.:-(

Thank you for your time.
 
Hi baby_awk,

A similar problem is solved in thread271-353112.

Hope that helps,
Grant.
 
Hi Grant,
It helped a bit. It sucked out the duplicates but I want them to re-direct to a file and also I want the o/p of the the non-duplicate records into a file.
The script also has fixed set of columns and file names and I have them as user inputs.
Pl. advise ASAP.
Many many thanks....
 
Hi baby_awk,

Here is an example of ways of getting input from the user either from the command line, or accepting keyboard input during processing, and redirecting to files:

The data files:

% cat mydata.txt
abc def ghi
jkl mno pqr
stu dqf vwx
yza dlf bcd
efg hij klm


The program:

% cat my.awk
#!/usr/bin/awk -f

BEGIN{
printf("Enter regular expression mask: ");
getline re < &quot;-&quot;;
}

$2 ~ re {
print $0 > outfile;
}


Running the program:

my.awk outfile=myout.txt mydata.txt
Enter regular expression mask: d.f


The results:

% cat myout.txt
abc def ghi
stu dqf vwx
yza dlf bcd



I believe this has all the elements you need to put your program together.

Good luck,
Grant
 
Grant,
My column no's and file names will be an input from the shell script to this awk program. Therefore, I need to check the user input file names with the user input column no's for duplicates.

Pl. advise. This is critical.
 
Hi baby_awk,

How about if you show what you've done so far (with data) and explain where the problem is?

Grant.
 
Hi Grant,
Below given is the program I have written.
It generates the o/p specified below.
I run this program by supplying a file2 as an input.
BEGIN{
while ( (getline < &quot;file1&quot; ) > 0 )
{
array[$1,$2,$4]=1;
}
close(&quot;file1&quot;);
}
{
if ( array[$1,$2,$4] != &quot;&quot; )
print $0 > &quot;dups&quot;
else
print $0 > &quot;no_dups&quot;
}

The &quot;dups&quot; o/p file has
1|xyz|111|perrysburg|21
3|pqr|121|toledo|21
4|wpr|112|annarbor|21
6|hij|133|columbus|21

The &quot;no_dups&quot; o/p file has
8|def|212|maumee|21

I want to modify the above program to
accept both the files from the user (currently file1
is hard coded) and also the the columns no's
(currently $1,$2,$4 are hard coded). The column no's in the
array should be substituted by the column no's the user will
pass from the command line.

Hope you understand my problem.
Pl. email me at skuthe@hotmail.com
if u need any further info.

Thanks for you help.
 
function intake(msg) {
printf &quot;%s&quot;, msg
getline ret < &quot;-&quot;
if (ret) {
return ret
}
}


BEGIN {
file = intake(&quot;Name of file to load: &quot;)
while (x < 3) {
x++
array[x] = intake(&quot;Record to search: &quot;)
if (x == 3) {
array[&quot; $&quot;array[x],&quot; $&quot;array[x - 1],&quot; $&quot;array[x - 2]]= 1
}
}
for (all in array) {
print array[all],all
}
}

Try this and see if it is what you need as far as an idea.
 
Hey Guys,
thanks for the rapid response.
But the problem is still not solved yet.
The above example restricts the user for 3 columns but
the requirement is has to be dynamic (i.e it could be 2,4,6,8,9).
Also the parameters will be supplied to the awk script from the command line.

Pl. let me know if you can find something.

Thanks.
 
I suggest you check out &quot;Effective Awk Programming&quot;, pgs 185-190 for an awk getopt
which sounds like it's what you need.


 
Hi baby_awk,

Try this:

#!/usr/bin/awk -f

BEGIN{
FS=&quot;|&quot;;

MaxCols=0;
for (j=1; j<(ARGC-2); j++)
{
if ( ARGV[j] ~ /^[0-9]+$/ )
{
MaxCols+=1;
ColNo[MaxCols]=ARGV[j];
delete ARGV[j];
}
}

infile=ARGV[ARGC-2];
delete ARGV[ARGC-2];

while ( (getline < infile ) > 0 )
{
for (j=1; j<=MaxCols; j++)
ColVal[j]=$(ColNo[j]);

ColStr=ColVal[1];
for (j=2; j<=MaxCols; j++)
ColStr=ColStr SUBSEP ColNo[j];

array[ColStr]=1;
}

close(infile);
}

{
for (j=1; j<=MaxCols; j++)
{
ColVal[j]=$(ColNo[j]);
}

ColStr=ColVal[1];
for (j=2; j<=MaxCols; j++)
ColStr=ColStr SUBSEP ColNo[j];

if ( array[ColStr] != &quot;&quot; )
print $0 > &quot;dups&quot;
else
print $0 > &quot;no_dups&quot;
}

To run it:
my.awk 1 2 4 file2 file1


You should be able to vary the number of columns. Use the same order as above. That is, program name followed by columns, followed by files.

Warning: It's not bullet proof.

Hope this works,
Grant.
 
Here is Arnold Robbins getopt code, copied from a copy
distributed with gawk. If you have gawk you probably have this already.

# getopt.awk --- do C library getopt(3) function in awk
#
# Arnold Robbins, arnold@gnu.org, Public Domain
#
# Initial version: March, 1991
# Revised: May, 1993

# External variables:
# Optind -- index in ARGV of first non-option argument
# Optarg -- string value of argument to current option
# Opterr -- if nonzero, print our own diagnostic
# Optopt -- current option letter

# Returns:
# -1 at end of options
# ? for unrecognized option
# <c> a character representing the current option

# Private Data:
# _opti -- index in multi-flag option, e.g., -abc
function getopt(argc, argv, options, thisopt, i)
{
if (length(options) == 0) # no options given
return -1

if (argv[Optind] == &quot;--&quot;) { # all done
Optind++
_opti = 0
return -1
} else if (argv[Optind] !~ /^-[^: \t\n\f\r\v\b]/) {
_opti = 0
return -1
}
if (_opti == 0)
_opti = 2
thisopt = substr(argv[Optind], _opti, 1)
Optopt = thisopt
i = index(options, thisopt)
if (i == 0) {
if (Opterr)
printf(&quot;%c -- invalid option\n&quot;,
thisopt) > &quot;/dev/stderr&quot;
if (_opti >= length(argv[Optind])) {
Optind++
_opti = 0
} else
_opti++
return &quot;?&quot;
}
if (substr(options, i + 1, 1) == &quot;:&quot;) {
# get option argument
if (length(substr(argv[Optind], _opti + 1)) > 0)
Optarg = substr(argv[Optind], _opti + 1)
else
Optarg = argv[++Optind]
_opti = 0
} else
Optarg = &quot;&quot;
if (_opti == 0 || _opti >= length(argv[Optind])) {
Optind++
_opti = 0
} else
_opti++
return thisopt
}
BEGIN {
Opterr = 1 # default is to diagnose
Optind = 1 # skip ARGV[0]

# test program
if (_getopt_test) {
while ((_go_c = getopt(ARGC, ARGV, &quot;ab:cd&quot;)) != -1)
printf(&quot;c = <%c>, optarg = <%s>\n&quot;,
_go_c, Optarg)
printf(&quot;non-option arguments:\n&quot;)
for (; Optind < ARGC; Optind++)
printf(&quot;\tARGV[%d] = <%s>\n&quot;,
Optind, ARGV[Optind])
}
}
 
Hi baby_awk,

I wonder if you had a chance to try out the script I sent above. I was at work and didn't have much time to explain it, but I believe it matches your requirements: it's possible to have varying numbers of columns. Everything -- both column numbers and input files -- get input via the command line.

In your examples, you use '|' as your field separator. If this is not how the file actually appears you can remove the line FS=&quot;|&quot;;

The code may appear a bit cryptic, but if anything is unclear I will be happy to explain it.

I tried it quickly from my workplace and it seemed to work fine. You might want to make it more bulletproof. For example, the user could input more column numbers than exist in the file. I don't know how it would behave in that case.

Anyway, I hope it solves your requirements,
Grant.
 
Hi Grant,
It was really nice of you to send the script. I was trying the script yesterday afternoon. The script is excellent in terms of accepting user parameters but it did not fully solve the problem. It finds the &quot;Non-Dups&quot; correctly but cannot determine the Dups. For Dups it just repeats the second input file. I was trying to solve it but being very new to awk and no books for available for references I am unable to make it work as per my needs also I was not able to understand the following part of the code
[ {
for (j=1; j<=MaxCols; j++)
{
ColVal[j]=$(ColNo[j]);
}

ColStr=ColVal[1];
for (j=2; j<=MaxCols; j++)
ColStr=ColStr SUBSEP ColNo[j];
]

I will really appreciate if you can spend some time on it and explain the above part and give me some tips as to what needs to be done to get the Dups part work correctly.

Thank you very much for your time.
 
Hi Grant,
The actual problem in finding the DUPS is that the program compares the two files based on the FIRST column only; all other coulmns are ignored.
This is the actual problem in dups and not the one I mentioned in my earlier message. SORRY for the confusion.

Thanks.
 
Hi baby_awk,

I used the following data, taken from your example:


% cat file1
1|xyz|111|perrysburg|21
2|abc|222|sylvania|21
3|pqr|121|toledo|21
4|wpr|112|annarbor|21
5|mno|131|detroit|21
6|hij|133|columbus|21
7|klm|333|cincinati|21


% cat file2
1|xyz|111|perrysburg|21
3|pqr|121|toledo|21
4|wpr|112|annarbor|21
6|hij|133|columbus|21
8|def|212|maumee|21




I ran it using this syntax:

% my.awk 1 2 4 file2 file1



I got these results:

% cat dups
1|xyz|111|perrysburg|21
3|pqr|121|toledo|21
4|wpr|112|annarbor|21
6|hij|133|columbus|21


% cat no_dups
2|abc|222|sylvania|21
5|mno|131|detroit|21
7|klm|333|cincinati|21




Please try again using the data above. If we get different results, it suggests that the problem is due to system or version differences.




As for an explanation of the code.

The basic plan is as follows:

1) On the command line, the user should be able to specify:
-a list of (any number of) columns to check, followed by
-the '2nd' filename, (file2 in the example above), and
-the '1st' filename, (file1 in the example above).

Eg:
my.awk 1 2 4 file2 file1


Arguments must arrive in the order listed above.



2) In the BEGIN{} section:
-Set the field separator to &quot;|&quot;;
FS=&quot;|&quot;;



-Get the columns by looping through ARGV[] array.
-The highest column number should at at ARGV[ARGC-2].
-Keep track of the number of column numbers (MaxCols).
-Store the columns numbers in ColNo[] array.
-Delete the column no's from the ARGV[] array as we go.


MaxCols=0;
for (j=1; j<(ARGC-2); j++)
{
if ( ARGV[j] ~ /^[0-9]+$/ )
{
MaxCols+=1;
ColNo[MaxCols]=ARGV[j];
delete ARGV[j];
}
}



-Get name of '2nd file' (infile) from ARGV[] array and delete.

infile=ARGV[ARGC-2];
delete ARGV[ARGC-2];



-Loop through 'infile' (using getline) to build array[].


while ( (getline < infile ) > 0 )
{
for (j=1; j<=MaxCols; j++)
ColVal[j]=$(ColNo[j]);

ColStr=ColVal[1];
for (j=2; j<=MaxCols; j++)
ColStr=ColStr SUBSEP ColNo[j];

array[ColStr]=1;
}


-The subscript of array[] will consist of all the values
from each of the requested columns all concatenated
together, separated by SUBSEP. (You may need to look up
how arrays work in awk).
-As a starting point in constructing the key, we have the
column numbers (stored in ColNo[]) that the user provided
on the command line.
-There are 'MaxCols' elements in ColNo. In other words,
MaxCols is the number of columns the user put on the command
line.

-What does $(ColNo[j]) mean?
-We normally get the value of column 1 by referencing $1,
column 2 by referencing $2, and so on.
-We can also use the reference a column value using the
notation '$( expression )'.
-Example: $( 1 + 1 ) is the same as $2.
-Likewise, $(ColNo[j]) refers to the value in whichever
column is referred to by ColNo[j].



-We do a 'for' loop using the contents of ColNo[] to
build a new array ColVal[], which contains the VALUES of
each column.

for (j=1; j<=MaxCols; j++)
ColVal[j]=$(ColNo[j]);




-We do a second loop to concatenate the values -- separated
by SUBSEP -- together, to build the subscript for array[].

ColStr=ColVal[1];
for (j=2; j<=MaxCols; j++)
ColStr=ColStr SUBSEP ColNo[j];




-We set the contents of each array element to an arbitrary
value. In this case, we set it to 1.

array[ColStr]=1;




-We then are done with 'infile', so we close it.

close(infile);





3) In the 'main' loop:
-As we pass through each line of 'file 1', we do almost the same
thing as in the BEGIN{} section to create the subscript of values
of each of the columns.

for (j=1; j<=MaxCols; j++)
{
ColVal[j]=$(ColNo[j]);
}

ColStr=ColVal[1];
for (j=2; j<=MaxCols; j++)
ColStr=ColStr SUBSEP ColNo[j];





-We then test to see if the array[subscript] combination exists.
-If it does, write to 'dups'.
-Else, write to 'no_dups'.

if ( array[ColStr] != &quot;&quot; )
print $0 > &quot;dups&quot;
else
print $0 > &quot;no_dups&quot;



Hope this clarifies things. (Ya, right!)
Grant.
 
Hi,
I have modified your script as below.
Why does it checks only the last record of file 1 ????
why not all records from file 1????????????

Please reply ASAP.............

#!/usr/bin/awk -f

BEGIN{
# FS=&quot;|&quot;;

MaxCols=0;
for (j=1; j<(ARGC-2); j++)
{
if ( ARGV[j] ~ /^[0-9]+$/ )
{
MaxCols+=1;
ColNo[MaxCols]=ARGV[j];
delete ARGV[j];
}
}

for (j=1; j<=MaxCols; j++)
{
if ( j < MaxCols )
ColStr=ColStr&quot;$&quot;ColNo[j]&quot;,&quot;;
else
ColStr=ColStr&quot;$&quot;ColNo[j];
}

#The ColStr will have values (S1,$3,$4) after the
#execution of the above for loop if columns 1,3,4 were #passed to the program from command line.

infile=ARGV[ARGC-2];
delete ARGV[ARGC-2];

while ( ( getline < infile ) > 0 )
{
array[ColStr]=1;
}
close(infile)
}
{ if ( array[ColStr] != &quot;&quot; )
print $0 > &quot;dups&quot;
else
print $0 > &quot;no_dups&quot;
}


 
Hi Grant,
I ran the code exactly as you have mentioned with the similar file contents for file1 and file2

Here is the output ....

Dups
====
1|xyz|111|perrysburg|21
2|abc|222|sylvania|21
3|pqr|121|toledo|21
4|wpr|112|annarbor|21
5|mno|131|detroit|21
6|hij|133|columbus|21
7|klm|333|cincinati|21


No_Dups
=======
9|def|212|maumee|21

I am on HP-Unix box.
 
Hi baby_awk,

The changes you have made have eliminated an important part of the main loop.

The fundamental plan is really the same as it was in the small script that you wrote earlier,

eg:
BEGIN{
while ( (getline < &quot;file1&quot; ) > 0 )
{
array[$1,$2,$4]=1;
}
close(&quot;file1&quot;);
}

#main
{
if ( array[$1,$2,$4] != &quot;&quot; )
print $0 > &quot;dups&quot;
else
print $0 > &quot;no_dups&quot;
}


The basic plan is this:
In the BEGIN{} section, we pass through each record of file2 and build an array which has a subscript that is composed of the concatenation of the values of the requested columns.

Then in the main section, we pass through each record of file1, getting the values in the corresponding column numbers. We build a subscript using the same method as in the BEGIN{} section. It's important to remember that we must build the subscript for each record in file1!

One problem with your changes to my code in the more elaborate version, is that, in the main section, you are no longer building a new subscript for each record. Instead, you are re-using a subscript that was left over from the last line of file2 which you built in the BEGIN{} section.

I think it would be better if you go to my previous message and re-try my code using the exact examples I used. We should try to eliminate any problems due to version differences, etc.

Grant.
 
Hi baby_awk,

Oops, please disregard my last message. I see you did run the examples.

I am surprised at your results. The record '9|def|212|maumee|21' was part of file2 and was not referenced in file1. The main loop should only go through file1, so it should never have been tested in the main loop at all, and therefore should not have ended up in either dups or no_dups.

I wonder if there is any other HP users using your version who could test this. (What is your version anyway?).

Grant.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top