PrinceOfDervock
Technical User
Hi,
I am a novice awk user.
Actually I use mawk on Windows XP.
I am already familiar with the following code for deleting duplicate records.
!arr[$0]++ deletes if whole line matches
!arr[$2]++ deletes if field two matches
!arr[$2,$3]++ deletes if fields two and three match
The third one I use a lot.
Let me describe the kind of files I usually have to deal with.
They are ascii files containing real world map coordinates for data points.
Where field 1 is the point_identifier, field 2 is the easting or X coordinate, field 3 is the northing or Y coordinate and field 4 is the elevation.
Example:
point1 1234567.123 1234568.123 12345.123
point2 1234567.123 1234568.123 12345.123
point3 1234577.123 1234568.123 12345.123
point4 1234577.123 1234568.123 12345.123
point5 1234576.999 1234568.999 12345.123
.
.
point1000000 1234566.999 1234568.999 12345.123
Relative to fields 2 and 3 in the example above, you can see that:
point2 is an exact duplicate of point1
point4 is an exact duplicate of point3
The code !arr[$2,$3]++ will take care of this scenario.
What I would really like to do is:
As well as delete these exact duplicates, I would like to delete "almost duplicate" or you could call it "nearest neighbor", within a user defined distance.
By this I mean, if using a user defined distance of 1:
I would like to also delete point5 because it's X and Y are within 1 of point4's X and Y.
I would like to also delete point1000000 because it's X and Y are within 1 of point1's X and Y.
Thanking you in advance,
Kenny.
I am a novice awk user.
Actually I use mawk on Windows XP.
I am already familiar with the following code for deleting duplicate records.
!arr[$0]++ deletes if whole line matches
!arr[$2]++ deletes if field two matches
!arr[$2,$3]++ deletes if fields two and three match
The third one I use a lot.
Let me describe the kind of files I usually have to deal with.
They are ascii files containing real world map coordinates for data points.
Where field 1 is the point_identifier, field 2 is the easting or X coordinate, field 3 is the northing or Y coordinate and field 4 is the elevation.
Example:
point1 1234567.123 1234568.123 12345.123
point2 1234567.123 1234568.123 12345.123
point3 1234577.123 1234568.123 12345.123
point4 1234577.123 1234568.123 12345.123
point5 1234576.999 1234568.999 12345.123
.
.
point1000000 1234566.999 1234568.999 12345.123
Relative to fields 2 and 3 in the example above, you can see that:
point2 is an exact duplicate of point1
point4 is an exact duplicate of point3
The code !arr[$2,$3]++ will take care of this scenario.
What I would really like to do is:
As well as delete these exact duplicates, I would like to delete "almost duplicate" or you could call it "nearest neighbor", within a user defined distance.
By this I mean, if using a user defined distance of 1:
I would like to also delete point5 because it's X and Y are within 1 of point4's X and Y.
I would like to also delete point1000000 because it's X and Y are within 1 of point1's X and Y.
Thanking you in advance,
Kenny.