Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

delete "almost" duplicate

Status
Not open for further replies.

PrinceOfDervock

Technical User
Sep 21, 2010
1
US
Hi,

I am a novice awk user.
Actually I use mawk on Windows XP.

I am already familiar with the following code for deleting duplicate records.

!arr[$0]++ deletes if whole line matches
!arr[$2]++ deletes if field two matches
!arr[$2,$3]++ deletes if fields two and three match

The third one I use a lot.

Let me describe the kind of files I usually have to deal with.
They are ascii files containing real world map coordinates for data points.
Where field 1 is the point_identifier, field 2 is the easting or X coordinate, field 3 is the northing or Y coordinate and field 4 is the elevation.

Example:

point1 1234567.123 1234568.123 12345.123
point2 1234567.123 1234568.123 12345.123
point3 1234577.123 1234568.123 12345.123
point4 1234577.123 1234568.123 12345.123
point5 1234576.999 1234568.999 12345.123
.
.
point1000000 1234566.999 1234568.999 12345.123

Relative to fields 2 and 3 in the example above, you can see that:

point2 is an exact duplicate of point1
point4 is an exact duplicate of point3

The code !arr[$2,$3]++ will take care of this scenario.

What I would really like to do is:

As well as delete these exact duplicates, I would like to delete "almost duplicate" or you could call it "nearest neighbor", within a user defined distance.

By this I mean, if using a user defined distance of 1:
I would like to also delete point5 because it's X and Y are within 1 of point4's X and Y.
I would like to also delete point1000000 because it's X and Y are within 1 of point1's X and Y.

Thanking you in advance,
Kenny.
 
I think the simplest way would be to reduce the precision as you store them in the array. awk arrays are all associative (i.e. indexed by strings, not numbers), so you would have to search the entire range in array indices each time to check whether there were any in that range. However if you chopped off the fractional component at least there would be a finite range of values you would search for (in your example, n-1, n and n+1).



Annihilannic.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top