Have two files, each with the following format:
Example data in File1:
176, 488, 14, 475, 167, 497, 482, 617, 491, 168, 483
215, 106, 14, 276, 488, 105, 298, 299, 498, 497, 714
.
.
.
Example data in File2:
216, 475, 276, 14, 488, 601, 298, 482, 617, 714, 497
25, 488, 475, 476, 167, 617, 485, 616, 491, 483, 480
.
.
.
File1 might have up to 1000 lines and File2 might have
100000 lines.
Need to compare fields 2-11 for each line in File1 to fields 2-11 for each line in File2. The numbers in the fields will not be in increasing order.
Every time there are at least 4 matches in fields 2-11
on a line in File2 compared with a line in File1, print field 1 from File2 on a separate line in a new file (Newfile).
In the above example, lines 216 and 25 from File2 each have at least 4 matches with line 176 in File1 so field 1 (216 and 25) from File2 get printed each time.
Now compare the 215 line from File1 with each line in File2.
In File2, 216 has at least 4 matches so print field 1 (216), but line 25 has less than 4 matches, so print nothing.
The resulting printout in Newfile would be:
216,
25,
216,
Ideally would also like to eliminate all duplicates in Newfile (like the extra 216) if possible-ok if that is a separate step.
Any suggestions/help on how to do this in UNIX/awk/etc. would be appreciated.
Example data in File1:
176, 488, 14, 475, 167, 497, 482, 617, 491, 168, 483
215, 106, 14, 276, 488, 105, 298, 299, 498, 497, 714
.
.
.
Example data in File2:
216, 475, 276, 14, 488, 601, 298, 482, 617, 714, 497
25, 488, 475, 476, 167, 617, 485, 616, 491, 483, 480
.
.
.
File1 might have up to 1000 lines and File2 might have
100000 lines.
Need to compare fields 2-11 for each line in File1 to fields 2-11 for each line in File2. The numbers in the fields will not be in increasing order.
Every time there are at least 4 matches in fields 2-11
on a line in File2 compared with a line in File1, print field 1 from File2 on a separate line in a new file (Newfile).
In the above example, lines 216 and 25 from File2 each have at least 4 matches with line 176 in File1 so field 1 (216 and 25) from File2 get printed each time.
Now compare the 215 line from File1 with each line in File2.
In File2, 216 has at least 4 matches so print field 1 (216), but line 25 has less than 4 matches, so print nothing.
The resulting printout in Newfile would be:
216,
25,
216,
Ideally would also like to eliminate all duplicates in Newfile (like the extra 216) if possible-ok if that is a separate step.
Any suggestions/help on how to do this in UNIX/awk/etc. would be appreciated.