delete "almost" duplicate

PrinceOfDervock · Sep 21, 2010

Hi,

I am a novice awk user.
Actually I use mawk on Windows XP.

I am already familiar with the following code for deleting duplicate records.

!arr[$0]++ deletes if whole line matches
!arr[$2]++ deletes if field two matches
!arr[$2,$3]++ deletes if fields two and three match

The third one I use a lot.

Let me describe the kind of files I usually have to deal with.
They are ascii files containing real world map coordinates for data points.
Where field 1 is the point_identifier, field 2 is the easting or X coordinate, field 3 is the northing or Y coordinate and field 4 is the elevation.

Example:

point1 1234567.123 1234568.123 12345.123
point2 1234567.123 1234568.123 12345.123
point3 1234577.123 1234568.123 12345.123
point4 1234577.123 1234568.123 12345.123
point5 1234576.999 1234568.999 12345.123
.
.
point1000000 1234566.999 1234568.999 12345.123

Relative to fields 2 and 3 in the example above, you can see that:

point2 is an exact duplicate of point1
point4 is an exact duplicate of point3

The code !arr[$2,$3]++ will take care of this scenario.

What I would really like to do is:

As well as delete these exact duplicates, I would like to delete "almost duplicate" or you could call it "nearest neighbor", within a user defined distance.

By this I mean, if using a user defined distance of 1:
I would like to also delete point5 because it's X and Y are within 1 of point4's X and Y.
I would like to also delete point1000000 because it's X and Y are within 1 of point1's X and Y.

Thanking you in advance,
Kenny.

Annihilannic · Oct 21, 2010

I think the simplest way would be to reduce the precision as you store them in the array. awk arrays are all associative (i.e. indexed by strings, not numbers), so you would have to search the entire range in array indices each time to check whether there were any in that range. However if you chopped off the fractional component at least there would be a finite range of values you would search for (in your example, n-1, n and n+1).

Annihilannic.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

delete "almost" duplicate

PrinceOfDervock

Technical User

Annihilannic

MIS

Similar threads

Part and Inventory Search

Sponsor

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

delete &quot;almost&quot; duplicate

PrinceOfDervock

Technical User

Annihilannic

MIS

Similar threads

Log in

Part and Inventory Search

Sponsor

delete "almost" duplicate