Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

How to Remove Dupe Records based on one Column 1

Status
Not open for further replies.

dougcranston

Technical User
Oct 5, 2001
326
US
Ladies and Gentlemen,

I am struggling to find a way to remove duplicate lines from a pipe delimited file utilzing AWK.

I get a file between 2k-5k

The file is Sorted by the REF column.

SAMPLE INPUT FILE

CD|REF|NAME|Month|Year
A12345N|123123|GALLAGH T|09|2007
D2345AN|123123|GALLAGH T|09|2007
D2345AN|345678|FOLLGERS J|09|2007
A12345N|654321|JOHNS A|09|2007
D2345AN|654321|JOHNS A|09|2007
A42345N|078982|VANCLYBURN A|09|2007

The REF Column is the one contains duplicates.

The output would result in a file with only the first record written to an output file.

SAMPLE NEEDED OUTPUT FILE

CD|REF|NAME|Month|Year
A12345N|123123|GALLAGH T|09|2007
D2345AN|345678|FOLLGERS J|09|2007
A12345N|654321|JOHNS A|09|2007
A42345N|078982|VANCLYBURN A|09|2007

I am running on an MS desktop and no access to LINUX or UNIx systems, but do have a commandline version of AWK.

I have tried searching this site and the web and not found a script that either works or that I understand how to adapt to solve this issue.

Any suggestions would be greatly appreciated.

Thanks,

Dougc
 
I too need a similar function. I found an awk program that delete duplicates called histsort:
{
if (data[$0]++ == 0)
lines[++count] = $0
}

END {
for (i = 1; i <= count; i++)
print lines
}
The delete is based on the entire line. I need to delete based on a key wihtin the line. Inotherwords, delete dups based on columns 1-15.

Any recommendations!
 
You may try this simple awk program:
Code:
$2 in arr{next}{print;arr[$2]}

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
PH

Thanks for the code.

Unfortunately, when I try it, it returns the all the rows. I prefaced your code with a BEGIN {FS="|"} too.

Search continues.

Thanks again.

Dougc
 
jdrocea,

Thanks.. If I find a solution will let you know too.

Dougc
 
jdrocea,

Found the solution... ReGoogled and found on

I had to tweak it a little, but since my test file is based on second field, I changed their code to add a BEGIN and used the $2.

Works fine.

Thanks all.


BEGIN {FS="|"}

{
if ($2 in stored_lines)
x=1
else
print
stored_lines[$2]=1
}
 
Well... Back to the drawing board..

For a larger file it fails. Had an 80k file and it gives me an out of memory error with the DOS based version of AWK I am running.

Oh well. Back at it again.

Thanks,

Dougc
 
I'm probably missing something obvious here but you say

"The file is Sorted by the REF column."

But the last record in your example has a REF of 078982, and the first has a REF of 123123. This doesn't look sorted to me at all.


In order to understand recursion, you must first understand recursion.
 
OT I know, but try this from DOS command line on your file

For /F "delims=| tokens=1,2,*" %%i in (input.dat) Do @find "%%j" output.dat||@echo %%i %%j %%k>>output.dat


In order to understand recursion, you must first understand recursion.
 
Provided the file is sorted by the second column, you may try this:
Code:
BEGIN{FS="|"}
$2!=ref{print;ref=$2}

Hope This Helps, PH.
FAQ219-2884
FAQ181-2886
 
Thanks for the code.

Unfortunately, when I try it, it returns the all the rows. I prefaced your code with a BEGIN {FS="|"} too.
It works for me. Did you include the program in the command-line or did you put it in a file? If the former, show us the command-line you used.
Code:
E:\>echo BEGIN{FS="|"}$2 in a{next}{print;a[$2]}|awk -f - file
output:
Code:
CD|REF|NAME|Month|Year
A12345N|123123|GALLAGH T|09|2007
D2345AN|345678|FOLLGERS J|09|2007
A12345N|654321|JOHNS A|09|2007
A42345N|078982|VANCLYBURN A|09|2007
 
For a larger file it fails. Had an 80k file and it gives me an out of memory error with the DOS based version of AWK I am running.
If you're in a DOS-box under windows, I don't see how you're
running out of memory. I tried this under windows 2000:
Code:
ruby -e '9000.times{puts "foo@#{ rand(9999) }@bar"}' >junk

awk "BEGIN{FS=\"@\"}{a[$2]}END{for(k in a)c++;print c}" junk
The file "junk" is about 120k and the output of the awk program is 5876. This works with awk, gawk, and mawk.

The awk I used is compiled by Kernighan himself (he's the K in AWK). Download it from
 
PHV.
Thank you very much. I tried your code with the file that was dying and worked like a champ.

I saved your code as: ddupe.awk and ran it from the command prompt using awk -fddupe.awk tst.txt

BEGIN{FS="|"}
$2!=ref{print;ref=$2}

It wittled the file down as expected.

Again thank you.

Doug C
 
futurelet,

I tried your code and it ran to a point and then posted and out of memory error.

What I did was save your AWK code of:

BEGIN{FS="|"}
$2 in a{next}{print;
a[$2]}

I then saved it as: ddupe2.awk

I then ran it using: awk -fddupe2.awk tst2.txt

It processed a portion of the file and then aborted with the following error msg:

awk ERROR ddupe2.awk: Out of Memory

I am using a dos AWK version from 1991.

Although I am familiar with piping commands I was not able to get it to function as you noted.

I am wondering if the issue is the difference between the two scripts. Your code utilizes arrays whereas the one from PHV did not... Maybe it is the version of AWK.

futurelet. I do want to thank you for your input, and suspect that it is either self inflicted on my end or a awk version issue.

Thanks again.

Dougc
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top