How to Remove Dupe Records based on one Column 1

dougcranston · Sep 12, 2007

Ladies and Gentlemen,

I am struggling to find a way to remove duplicate lines from a pipe delimited file utilzing AWK.

I get a file between 2k-5k

The file is Sorted by the REF column.

SAMPLE INPUT FILE

CD|REF|NAME|Month|Year
A12345N|123123|GALLAGH T|09|2007
D2345AN|123123|GALLAGH T|09|2007
D2345AN|345678|FOLLGERS J|09|2007
A12345N|654321|JOHNS A|09|2007
D2345AN|654321|JOHNS A|09|2007
A42345N|078982|VANCLYBURN A|09|2007

The REF Column is the one contains duplicates.

The output would result in a file with only the first record written to an output file.

SAMPLE NEEDED OUTPUT FILE

CD|REF|NAME|Month|Year
A12345N|123123|GALLAGH T|09|2007
D2345AN|345678|FOLLGERS J|09|2007
A12345N|654321|JOHNS A|09|2007
A42345N|078982|VANCLYBURN A|09|2007

I am running on an MS desktop and no access to LINUX or UNIx systems, but do have a commandline version of AWK.

I have tried searching this site and the web and not found a script that either works or that I understand how to adapt to solve this issue.

Any suggestions would be greatly appreciated.

Thanks,

Dougc

jdrocea · Sep 12, 2007

I too need a similar function. I found an awk program that delete duplicates called histsort:
{
if (data[$0]++ == 0)
lines[++count] = $0
}

END {
for (i = 1; i <= count; i++)
print lines
}
The delete is based on the entire line. I need to delete based on a key wihtin the line. Inotherwords, delete dups based on columns 1-15.

Any recommendations!

PHV · Sep 12, 2007

You may try this simple awk program:

Code:

$2 in arr{next}{print;arr[$2]}

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886

dougcranston · Sep 14, 2007

PH

Thanks for the code.

Unfortunately, when I try it, it returns the all the rows. I prefaced your code with a BEGIN {FS="|"} too.

Search continues.

Thanks again.

Dougc

dougcranston · Sep 14, 2007

jdrocea,

Thanks.. If I find a solution will let you know too.

Dougc

dougcranston · Sep 14, 2007

jdrocea,

Found the solution... ReGoogled and found on

http://www.computing.net/unix/wwwboard/forum/7011.html

I had to tweak it a little, but since my test file is based on second field, I changed their code to add a BEGIN and used the $2.

Works fine.

Thanks all.

BEGIN {FS="|"}

{
if ($2 in stored_lines)
x=1
else
print
stored_lines[$2]=1
}

dougcranston · Sep 14, 2007

Well... Back to the drawing board..

For a larger file it fails. Had an 80k file and it gives me an out of memory error with the DOS based version of AWK I am running.

Oh well. Back at it again.

Thanks,

Dougc

taupirho · Sep 14, 2007

I'm probably missing something obvious here but you say

"The file is Sorted by the REF column."

But the last record in your example has a REF of 078982, and the first has a REF of 123123. This doesn't look sorted to me at all.

In order to understand recursion, you must first understand recursion.

taupirho · Sep 14, 2007

OT I know, but try this from DOS command line on your file

For /F "delims=| tokens=1,2,*" %%i in (input.dat) Do @find "%%j" output.dat||@echo %%i %%j %%k>>output.dat

In order to understand recursion, you must first understand recursion.

PHV · Sep 14, 2007

Provided the file is sorted by the second column, you may try this:

Code:

BEGIN{FS="|"}
$2!=ref{print;ref=$2}

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886

futurelet · Sep 17, 2007

Thanks for the code.

Unfortunately, when I try it, it returns the all the rows. I prefaced your code with a BEGIN {FS="|"} too.

It works for me. Did you include the program in the command-line or did you put it in a file? If the former, show us the command-line you used.

Code:

E:\>echo BEGIN{FS="|"}$2 in a{next}{print;a[$2]}|awk -f - file

output:

Code:

CD|REF|NAME|Month|Year
A12345N|123123|GALLAGH T|09|2007
D2345AN|345678|FOLLGERS J|09|2007
A12345N|654321|JOHNS A|09|2007
A42345N|078982|VANCLYBURN A|09|2007

futurelet · Sep 17, 2007

For a larger file it fails. Had an 80k file and it gives me an out of memory error with the DOS based version of AWK I am running.

If you're in a DOS-box under windows, I don't see how you're
running out of memory. I tried this under windows 2000:

Code:

ruby -e '9000.times{puts "foo@#{ rand(9999) }@bar"}' >junk

awk "BEGIN{FS=\"@\"}{a[$2]}END{for(k in a)c++;print c}" junk

The file "junk" is about 120k and the output of the awk program is 5876. This works with awk, gawk, and mawk.

The awk I used is compiled by Kernighan himself (he's the K in AWK). Download it from

http://cm.bell-labs.com/cm/cs/who/bwk/awk95.exe

dougcranston · Sep 17, 2007

PHV.
Thank you very much. I tried your code with the file that was dying and worked like a champ.

I saved your code as: ddupe.awk and ran it from the command prompt using awk -fddupe.awk tst.txt

BEGIN{FS="|"}
$2!=ref{print;ref=$2}

It wittled the file down as expected.

Again thank you.

Doug C

dougcranston · Sep 17, 2007

futurelet,

I tried your code and it ran to a point and then posted and out of memory error.

What I did was save your AWK code of:

BEGIN{FS="|"}
$2 in a{next}{print;
a[$2]}

I then saved it as: ddupe2.awk

I then ran it using: awk -fddupe2.awk tst2.txt

It processed a portion of the file and then aborted with the following error msg:

awk ERROR ddupe2.awk: Out of Memory

I am using a dos AWK version from 1991.

Although I am familiar with piping commands I was not able to get it to function as you noted.

I am wondering if the issue is the difference between the two scripts. Your code utilizes arrays whereas the one from PHV did not... Maybe it is the version of AWK.

futurelet. I do want to thank you for your input, and suspect that it is either self inflicted on my end or a awk version issue.

Thanks again.

Dougc

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

How to Remove Dupe Records based on one Column 1

dougcranston

Technical User

jdrocea

Technical User

PHV

MIS

dougcranston

Technical User

dougcranston

Technical User

dougcranston

Technical User

dougcranston

Technical User

taupirho

Programmer

taupirho

Programmer

PHV

MIS

futurelet

Programmer

futurelet

Programmer

dougcranston

Technical User

dougcranston

Technical User

Similar threads

Part and Inventory Search

Sponsor