Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Finding and Removing Duplicate Data in a File 3

Status
Not open for further replies.

wellster34

Programmer
Sep 4, 2001
113
CA
Hello,

I have a few questions via UNIX scripting.

(1). Is there a way to find duplicates in a file?
(2). After finding duplicates, is there a way to weed them out to create a non-duplicate data file?

Example File called test.dat (using 3 characters per line)

123
456
789
ABC
123
DEF
GHI

In this example I want to have in my data file (test.dat):
123
456
789
ABC
DEF
GHI

Which is removing the second 123 that was found. Now, this example shows one 1 duplicate but I could have more than 1 duplicate. In some cases I had 4.

Any suggestions via UNIX?

Thanks
[dazed]
 
If order is not important use

sort -u infile > outfile

If order is important, use the fllowing awk script.

{
for (j=1;j<=ix;j++) if (a[j] == $0) next
ix++
a[ix] = $0
}
END {
for (j=1;j<=ix;j++) print a[j]
}

CaKiwi
 
I'm not sure if this is exactly what you're looking for, but try the following command:

sort filename | uniq > output file

Hope that helps!

John
 
If order is unimportant then

sort -u filename > output_file

is even shorter Cheers,
ND [smile]

bigoldbulldog@hotmail.com
 
Thanks for help this will get me going in the right direction. These commands will check the whole record of informatoin right? So, that brings me to one other question:

If a record of information is 10 characters long but the first 2 characters makes a record unique. Is there a way to sort out the duplicates just by checking the first 2 characters out of the 10 characters.

i.e.
1234567890
ABCDEFGHIJ
KLMNOPQRST
1200000000
1211111111

So, in this example the 1200000000 & 1211111111 are duplicates. Is there a way to omit them?
 
This works without any re-sorting

awk '{
key=substr($1,1,2)
if( ! a[key] ) print
a[key]=1
}' < input > output_file Cheers,
ND [smile]

bigoldbulldog@hotmail.com
 
Hi,

Can you please explain what the code below does?

awk '{
key=substr($1,1,2)
if( ! a[key] ) print
a[key]=1
}' < input > output_file

I have found out my record uniqueness is the 1st 39 characters so does that mean this code will change the 2 to a 39?

Also, when I ran it against a file that does not contain duplicates... I lose records? Why is that happening do you know?

Thanks
 
I forgot to mention that my data contains double quotes &quot; and commas , Would that impact the commands/logic above. Here is an example of data from my file:

&quot;11111111111&quot;,&quot;222222&quot;,&quot;33&quot;,&quot;44444444&quot;,
&quot;22222222222&quot;,&quot;333333&quot;,&quot;44&quot;,&quot;55555555&quot;,
&quot;33333333333&quot;,&quot;444444&quot;,&quot;55&quot;,&quot;66666666&quot;,
and etc...

 
Quotes and commas should not be a problem, but spaces would be since bigoldbulldog's script uses field 1. Change

key=substr($1,1,2)

to

key=substr($0,1,39)

to match the first 39 characters in a line.

If this doesn't work, post the file that loses records.
CaKiwi
 
Your array contains the first 2 chars of $0
So to compare first 39 chars :

awk '{
key=substr($0,1,39) #load 39 cols from input record
if( ! a[key] ) print $0 # if not in array - print record ($0 isn't necessary - but its clearer for you)
a[key]=1
}' < input > output_file

HTH ;-)

Dickie Bird
Honi soit qui mal y pense
 
wellster34,

do you really mean &quot;first 39&quot; chars?
Seems like your data is 'well-behaved' and has 4 fields that are ',' separated [CSVlike].

You might want rethink your sorting criteria:
character length
VS
field content

Just a though........ vlad
+---------------------------+
|#include<disclaimer.h> |
+---------------------------+
 
It worked changing the $1 to $0!!! Thank you all!!! [smile]

Unfortunately, I still do not understand why it worked? I'm just happy that it is working.

I understand a lot of the logic like the substring, array, not in the array print the record to the output file. Then after that there is the a[key]=1 statement. Is that assigning a value of 1 to the array?
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top