I am completely new to shell scripting but have been assigned the task of creating several batch files to manipulate data in a windows environment. My final task requires me to find rows in a .csv document where the last column is “Y” then delete not only the duplicate(s) but the original as well. There two catches, one is the original ends with “N” and two, there may be rows that are missing the first column, therefore I found the following script to help eliminate the duplicate based on the 2nd, 3rd, 4th, 5th and 6th columns which works great with two exceptions, eliminating the original and keeping the duplicate row(s) if there is no “Y” present (Shown in Desired Output below):
Below is a sample of the infile data:
Note: The 3rd and 4th column are actually duplicates of the 2nd minus the 1st column and the 6th and 7th columns are duplicates of the 5th also minus the 1st column.
Current Output:
As you can see the 2nd row starting with “180320” from Current Output is to be eliminated completely and row 9 from sample of infile starting with ,”1” is to remain because there was never a row created with “Y” for the original “180319”. I realize I am asking for quite a bit and I hope I have not confused too much however, it would be greatly appreciated if someone can tell me how to make this happen. Thank you.
Code:
gawk -F, "!_[$2, $3, $4, $5, $6]++" infile
Below is a sample of the infile data:
"180321","1","1.7","1Z0E947E0357465677","6.81","UPAC","N"
"180320","3","5.1","1Z0E947E0357735643","20.43","UPAC","N"
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","N”
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","N"
"180320","3","5.1","1Z0E947E0357735643","20.43","UPAC","Y"
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","Y"
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","Y"
"180319","1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
,”1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
"180318","1","1.7","1Z0E947E0355356628","6.81","UPAC","N"
Note: The 3rd and 4th column are actually duplicates of the 2nd minus the 1st column and the 6th and 7th columns are duplicates of the 5th also minus the 1st column.
Current Output:
Desired Output:"180321","1","1.7","1Z0E947E0357465677","6.81","UPAC","N"
"180320","3","5.1","1Z0E947E0357735643","20.43","UPAC","N"
"180319","1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
"180318","1","1.7","1Z0E947E0355356628","6.81","UPAC","N"
"180321","1","1.7","1Z0E947E0357465677","6.81","UPAC","N"
"180319","1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
,”1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
"180318","1","1.7","1Z0E947E0355356628","6.81","UPAC","N"
As you can see the 2nd row starting with “180320” from Current Output is to be eliminated completely and row 9 from sample of infile starting with ,”1” is to remain because there was never a row created with “Y” for the original “180319”. I realize I am asking for quite a bit and I hope I have not confused too much however, it would be greatly appreciated if someone can tell me how to make this happen. Thank you.