Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Remove duplicates from a file depending on user input columns ????? 2

Status
Not open for further replies.

skuthe

Programmer
Sep 10, 2002
33
US
Guys,
I want to remove duplicate records from a file where only the columns supplied by the user match.

eg: 2001|typ|CA|089
2001|bkj|CA|987
2004|bkp|CA|986
2006|typ|CA|654

Considering the above file as input.
Suppose the user supplies columns 2 & 3 to search for duplicates, then the program should give the following output as duplicate records..
2001|typ|CA|089
2006|typ|CA|654
Other records should be written to a non duplicate file.

Any suggestions??

Thank you in advance.
 
not to steal Grant's thunder, but....... isn't it similar to

thread271-355879 vlad
+---------------------------+
|#include<disclaimer.h> |
+---------------------------+
 
I looked at it but couldn't find a way to trim it to my use.
I guess there he is comparing two files and my requirement is for one file only.

Can you pl give me the appropriate script that satisfies my requiremnt??


Thank you.
 
Hi skuthe,

I am sending you some pseudo-code that I think might solve your problem. But it really does draw heavily on the program from Thread271-355879.

By the way, I am assuming you will need to allow the user to enter varying numbers of columns.

Here's the pseudo-code:


Command line syntax:
% my.awk col_1 col_2 ... col_n file


BEGIN{

# Get list of columns from command line.
# Also get MaxCols.

# Do pre-pass through file ( ARGV[ARGC-1] )
while ( ( getline < ARGV[ARGC-1] ) > 0 )
{
# Use list of columns from command line
# to build SUBSCRIPT made up of a
# concatenated list of values from those
# columns.

array[SUBSCRIPT]+=1;
}
close (ARGV[ARGC-1]);
}

#main
{
# Use list of columns to build SUBSCRIPT made
# up of concatenated list of values from those
# columns.


# Use array[SUBSCRIPT] to test each record to see
# if it was a duplicate.
(if array[SUBSCRIPT] > 1)
{
print $0 > &quot;dupefile.txt&quot;;
}
else
{
print $0 > &quot;otherfile.txt&quot;;
}
}



If you look at Thread271-355879 you should be able to steal a fair bit of code.

Hope this helps,
Grant.


By the way, I am willing to fill in the code for $150 US. You also have to pay Vlad's finders fee (20%).




Heh, heh.


 
Here's my attempt.

#usage: awk -f this-file -v f1=2 -v f2=3 -v dfn=&quot;dup.out&quot; input-file < input-fil

BEGIN {
FS = &quot;|&quot;
if (!f1) f1 = 1
if (!f2) f2 = 2
if (!dfn) dfn = &quot;dup.out&quot;
}
{
if (a[$f1,$f2]) {
if (a[$f1,$f2] != 1) print a[$2,$3] > dfn
print > dfn
a[$f1,$f2] = 1
}
else {
a[$f1,$f2] = $0
}
}
END {
while (getline < &quot;-&quot;) {
if (a[$f1,$f2] != 1) print
}
}

You can have this one for $149.99 with no finder's fee. CaKiwi
 
That's

#usage: awk -f this-file -v f1=2 -v f2=3 -v dfn=&quot;dup.out&quot; input-file < input-file

CaKiwi
 
Hi skuthe,

Ask CaKiwi to toss in free 24-hour support and 3 months AOL for free!

;)

grant
 
Grant,
I picked up the code from the thread you had mentioned. After doing the changes in the code I ran it and it gives me errors.
I am giving my code below. Pl. let me know where am I going wrong and also the possible correction.

Thanks.

My code :
#!/usr/bin/awk -f

BEGIN{
FS=&quot;|&quot;;

#print ARGC
MaxCols=0;
for (j=1; j<=(ARGC-2); j++)
{
if ( ARGV[j] ~ /^[0-9]+$/ )
{
MaxCols+=1;
ColNo[MaxCols]=ARGV[j];
# delete ARGV[j];
}
}

print &quot;Argv 0 : &quot; ARGV[0];
print &quot;Argv 1 : &quot; ARGV[1];
print &quot;Argv 2 : &quot; ARGV[2];
print &quot;Argv 3 : &quot; ARGV[3];
print &quot;Maxcols : &quot; MaxCols;

while ( (getline < ARGV[ARGC-1] ) > 0 )
{

for (j=1; j<=MaxCols; j++)
{
ColVal[j]=$(ColNo[j]);
ColStr=ColVal[1];
}
for (j=2; j<=MaxCols; j++)
ColStr=ColStr SUBSEP ColVal[j];

array[ColStr]+=1;
}

close(ARGV[ARGC-1]);
}

{
for (j=1; j<=MaxCols; j++)
{
ColVal[j]=$(ColNo[j]);
}

ColStr=ColVal[1];
for (j=2; j<=MaxCols; j++)
ColStr=ColStr SUBSEP ColVal[j];

if ( array[ColStr] > 1 )
print &quot;Dups&quot; $0
else
print &quot;No Dups&quot; $0
}


awk -fmy_awk.awk 2 3 file1

Errors :

awk: Cannot find or open file 4.
The source line number is 41.


 
Hi skuthe,

Just at a glance I see 2 problems:


PROBLEM 1:
You commented out the delete of ARGV[j]:
# delete ARGV[j];

The delete is necessary because of the way awk handles command line arguments.

Let me explain it this way:

Awk lets you define variables on the command line that can be referenced in the program using the syntax awk myvar=value. (There is more to this syntax, including the use of the -v switch, but I won't get into that now).

One of the problems is that, at least on my version of awk, variables defined in this way are not accessible in the BEGIN{} section, they are only accessible in main.

Another thing -- and this is a personal preference -- is that I often prefer to minimize the extra stuff that has to be written on the command line, so I sometimes like to get rid of the 'myvar=' portion of the variable definition.

But if you don't use the syntax to tell awk that an argument is a variable, awk assumes it is an input file. (I don't know if you have ever tried this, but if you put 2 or more file names on your command line, awk will process them in the order supplied).

So, when we ask the user to supply the column numbers on the command line without anything to signal that it is a variable, then we can access those values in the BEGIN{} section via the ARGV[] array, but we need to remember to delete those values from the ARGV[] array, or awk will think they are filenames when the main loop starts, and it will bomb out because no files exist by those names. (Or if files do exist with the same name, things might get really hairy!)




PROBLEM 2:

You copied some of the code incorrectly. You have:
for (j=1; j<=MaxCols; j++)
{
ColVal[j]=$(ColNo[j]);
ColStr=ColVal[1];
}
for (j=2; j<=MaxCols; j++)
ColStr=ColStr SUBSEP ColVal[j];


It should be:
for (j=1; j<=MaxCols; j++)
{
ColVal[j]=$(ColNo[j]);
}

ColStr=ColVal[1];
for (j=2; j<=MaxCols; j++)
{
ColStr=ColStr SUBSEP ColVal[j];
}

This was an easy mistake to make because in the original code I did not clearly delineate the scope of the for block with the use of {}'s. Normally I do, but sometimes I can be a slacker.


Try it again with those corrections and let me know how it goes.

Grant.




Question: Has anybody ever tried using array references on the command line?

Example: my.awk col[1]=1 col[2]=3 col[3]=5 mydata.txt

A usage like this (if it works) might make for a messier command line, and increase the possibility of user error, but it might make the program code simpler and easier to understand. This would only be needed in cases where there was a need for a varying number of arguments, such as column numbers.
 
Grant,
It worked fine but there is one problem.
I need put one record from the two duplicate records in the non duplicate file.
i.e Between two duplicates the first one will be copied to the non dup file and the other will be thrown in the dup file.

Any ideas please ???
 
Hi skuthe,

Ouch! I thought from your original example ALL dupes were supposed to go into the dupes file, including the first.

Let me give it some thought. It should be possible to reduce it to a single pass through the file.

Grant.
 
Hi skuthe,

Well, I thought about it. It seems to me the following will do what you want (and it's simpler):


Command line syntax:
% my.awk col_1 col_2 ... col_n file


BEGIN{

# Get list of columns from command line.
# Also get MaxCols.

}

#main
{
# Use list of columns to build SUBSCRIPT made
# up of concatenated list of values from those
# columns.

array[SUBSCRIPT]+=1;

# Use array[SUBSCRIPT] to test each record to see
# if it was a duplicate.
(if array[SUBSCRIPT] > 1)
{
print $0 > &quot;dupefile.txt&quot;;
}
else
{
print $0 > &quot;otherfile.txt&quot;;
}
}



Grant.
 
Grant,
Nope, it does not work.
It copies all the records of the input file into the dup file whithout writing anything to the Non-dup file.
(i.e It replicates the input file ).

Waiting for your response,
Thanks.


Why did u start a new thread??
 
Sorry Grant, it works just fine !!.
I wrongly aligned the code and that generated the wrong output.

Thanks for your help. I appreciate it.!![thumbsup2]
 
Hi vgersh99 and CaKiwi,

Sorry guys. I guess none of us get paid this time.
<crustytheclown>Oh, well!</crustytheclown>

Grant.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top