Finding and Removing Duplicate Data in a File 3

wellster34 · Nov 7, 2002

Hello,

I have a few questions via UNIX scripting.

(1). Is there a way to find duplicates in a file?
(2). After finding duplicates, is there a way to weed them out to create a non-duplicate data file?

Example File called test.dat (using 3 characters per line)

123
456
789
ABC
123
DEF
GHI

In this example I want to have in my data file (test.dat):
123
456
789
ABC
DEF
GHI

Which is removing the second 123 that was found. Now, this example shows one 1 duplicate but I could have more than 1 duplicate. In some cases I had 4.

Any suggestions via UNIX?

Thanks
[dazed]

CaKiwi · Nov 7, 2002

If order is not important use

sort -u infile > outfile

If order is important, use the fllowing awk script.

{
for (j=1;j<=ix;j++) if (a[j] == $0) next
ix++
a[ix] = $0
}
END {
for (j=1;j<=ix;j++) print a[j]
}

CaKiwi

johngiggs · Nov 7, 2002

I'm not sure if this is exactly what you're looking for, but try the following command:

sort filename | uniq > output file

Hope that helps!

John

bigoldbulldog · Nov 7, 2002

If order is unimportant then

sort -u filename > output_file

is even shorter Cheers,
ND [smile]

bigoldbulldog@hotmail.com

wellster34 · Nov 7, 2002

Thanks for help this will get me going in the right direction. These commands will check the whole record of informatoin right? So, that brings me to one other question:

If a record of information is 10 characters long but the first 2 characters makes a record unique. Is there a way to sort out the duplicates just by checking the first 2 characters out of the 10 characters.

i.e.
1234567890
ABCDEFGHIJ
KLMNOPQRST
1200000000
1211111111

So, in this example the 1200000000 & 1211111111 are duplicates. Is there a way to omit them?

wellster34 · Nov 7, 2002

By the way, order is importmant.

bigoldbulldog · Nov 7, 2002

This works without any re-sorting

awk '{
key=substr($1,1,2)
if( ! a[key] ) print
a[key]=1
}' < input > output_file Cheers,
ND [smile]

bigoldbulldog@hotmail.com

wellster34 · Nov 8, 2002

Hi,

Can you please explain what the code below does?

awk '{
key=substr($1,1,2)
if( ! a[key] ) print
a[key]=1
}' < input > output_file

I have found out my record uniqueness is the 1st 39 characters so does that mean this code will change the 2 to a 39?

Also, when I ran it against a file that does not contain duplicates... I lose records? Why is that happening do you know?

Thanks

wellster34 · Nov 8, 2002

I forgot to mention that my data contains double quotes " and commas , Would that impact the commands/logic above. Here is an example of data from my file:

"11111111111","222222","33","44444444",
"22222222222","333333","44","55555555",
"33333333333","444444","55","66666666",
and etc...

CaKiwi · Nov 8, 2002

Quotes and commas should not be a problem, but spaces would be since bigoldbulldog's script uses field 1. Change

key=substr($1,1,2)

to

key=substr($0,1,39)

to match the first 39 characters in a line.

If this doesn't work, post the file that loses records.
CaKiwi

dickiebird · Nov 8, 2002

Your array contains the first 2 chars of $0
So to compare first 39 chars :

awk '{
key=substr($0,1,39) #load 39 cols from input record
if( ! a[key] ) print $0 # if not in array - print record ($0 isn't necessary - but its clearer for you)
a[key]=1
}' < input > output_file

HTH ;-)

Dickie Bird
Honi soit qui mal y pense

vgersh99 · Nov 8, 2002

wellster34,

do you really mean "first 39" chars?
Seems like your data is 'well-behaved' and has 4 fields that are ',' separated [CSVlike].

You might want rethink your sorting criteria:
character length
VS
field content

Just a though........ vlad
+---------------------------+
|#include<disclaimer.h> |
+---------------------------+

wellster34 · Nov 8, 2002

It worked changing the $1 to $0!!! Thank you all!!! [smile]

Unfortunately, I still do not understand why it worked? I'm just happy that it is working.

I understand a lot of the logic like the substring, array, not in the array print the record to the output file. Then after that there is the a[key]=1 statement. Is that assigning a value of 1 to the array?

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Finding and Removing Duplicate Data in a File 3

wellster34

Programmer

CaKiwi

Programmer

johngiggs

Technical User

bigoldbulldog

Programmer

wellster34

Programmer

wellster34

Programmer

bigoldbulldog

Programmer

wellster34

Programmer

wellster34

Programmer

CaKiwi

Programmer

dickiebird

Programmer

vgersh99

Programmer

wellster34

Programmer

Similar threads

Part and Inventory Search

Sponsor