Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Delete similar records under specific conditions using gawk 2

Status
Not open for further replies.

chino10

Technical User
Jun 23, 2011
16
US
I am completely new to shell scripting but have been assigned the task of creating several batch files to manipulate data in a windows environment. My final task requires me to find rows in a .csv document where the last column is “Y” then delete not only the duplicate(s) but the original as well. There two catches, one is the original ends with “N” and two, there may be rows that are missing the first column, therefore I found the following script to help eliminate the duplicate based on the 2nd, 3rd, 4th, 5th and 6th columns which works great with two exceptions, eliminating the original and keeping the duplicate row(s) if there is no “Y” present (Shown in Desired Output below):

Code:
gawk -F, "!_[$2, $3, $4, $5, $6]++" infile

Below is a sample of the infile data:
"180321","1","1.7","1Z0E947E0357465677","6.81","UPAC","N"
"180320","3","5.1","1Z0E947E0357735643","20.43","UPAC","N"
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","N”
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","N"
"180320","3","5.1","1Z0E947E0357735643","20.43","UPAC","Y"
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","Y"
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","Y"
"180319","1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
,”1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
"180318","1","1.7","1Z0E947E0355356628","6.81","UPAC","N"

Note: The 3rd and 4th column are actually duplicates of the 2nd minus the 1st column and the 6th and 7th columns are duplicates of the 5th also minus the 1st column.

Current Output:
"180321","1","1.7","1Z0E947E0357465677","6.81","UPAC","N"
"180320","3","5.1","1Z0E947E0357735643","20.43","UPAC","N"
"180319","1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
"180318","1","1.7","1Z0E947E0355356628","6.81","UPAC","N"
Desired Output:
"180321","1","1.7","1Z0E947E0357465677","6.81","UPAC","N"
"180319","1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
,”1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
"180318","1","1.7","1Z0E947E0355356628","6.81","UPAC","N"

As you can see the 2nd row starting with “180320” from Current Output is to be eliminated completely and row 9 from sample of infile starting with ,”1” is to remain because there was never a row created with “Y” for the original “180319”. I realize I am asking for quite a bit and I hope I have not confused too much however, it would be greatly appreciated if someone can tell me how to make this happen. Thank you.

 
Try this:


Code:
awk -F, '
        [gray]# accumulate records in r array, update index if first field is present[/gray]
        {
                r[[blue]NR[/blue]]=[blue]$0[/blue]
                [olive]if[/olive] ([blue]$1[/blue]!=[red]"[/red][purple][/purple][red]"[/red]) { ix=[blue]$1[/blue] }
        }
        [gray]# accumulate indices to be skipped[/gray]
        [blue]$NF[/blue] == [red]"[/red][purple]\"Y\"[/purple][red]"[/red] { s[ix] }
        [green]END[/green] {
                [gray]# reprocess r array, skipping indices in s[/gray]
                [olive]for[/olive] (i=1;i<=[blue]NR[/blue];i++) {
                        [b]split[/b](r[i],a,[red]"[/red][purple],[/purple][red]"[/red])
                        [olive]if[/olive] (a[1]!=[red]"[/red][purple][/purple][red]"[/red]) { ix=a[1] }
                        [olive]if[/olive] (!(ix [olive]in[/olive] s)) { [b]print[/b] r[i] }
                }
        }

' inputfile


Annihilannic.
 
Thank you for responding Annihilannic. I attempted to make this work in a dos environment but continue to receive the following error: "The system cannot find the file specified". I am sure it's because I don't have the syntax correct yet but if, by chance, you have any additional tips I would appreciate it. In the mean time I will continue to tweak and let you know.

 
Hi

chino10 said:
I attempted to make this work in a dos environment
DOS and Windows command interpreters usually not handle single quoted strings. So put Annihilannic's code excepting the first and last lines in a separate file, for example called chino10.awk, then execute it like this :
Code:
awk -F, -f chino10.awk inputfile


Feherke.
 
Feherke thank you for responding so quickly. I set it up as you stated but I received the following error:

gawk: -F
gawk: ^ invalid char ‘-‘ in expression

Also, does it matter what the extension is for the "separate" file (i.e., chino_10.awk)?
 
This is the contents of the file I created:

Code:
{
        r[NR]=$0
        if ($1!="") { ix=$1 }
}

$NF == "\"Y\"" { s[ix] }
END {
        for (i=1;i<=NR;i++) {
                split(r[i],a,",")
                if (a[1]!="") { ix=a[1] }
                if (!(ix in s)) { print r[i] }
        }
}
This is the code I'm using to run it:

Code:
gawk –F, -f C:\Users\username\Desktop\Test\GawkScript.txt C:\Users\mikecornejo\Desktop\BatchTests\inputfile.csv
Did I miss something?
 
Hi

The code looks good. But what kind of character is here ? Looks like an en dash :
Code:
gawk [COLOR=red yellow]–[/color]F, -f C:\Users\username\Desktop\Test\GawkScript.txt C:\Users\mikecornejo\Desktop\BatchTests\inputfile.csv


Feherke.
 
Awesome! It looks like that was the problem. I originally put it in MS word and apparently the formatting threw it off. I did it in notepad and worked like a charm. Thank you very much!

Annihilannic thank you very much!

Tek-Tips rocks!
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top