Delete similar records under specific conditions using gawk 2

chino10 · Jun 23, 2011

I am completely new to shell scripting but have been assigned the task of creating several batch files to manipulate data in a windows environment. My final task requires me to find rows in a .csv document where the last column is “Y” then delete not only the duplicate(s) but the original as well. There two catches, one is the original ends with “N” and two, there may be rows that are missing the first column, therefore I found the following script to help eliminate the duplicate based on the 2nd, 3rd, 4th, 5th and 6th columns which works great with two exceptions, eliminating the original and keeping the duplicate row(s) if there is no “Y” present (Shown in Desired Output below):

Code:

gawk -F, "!_[$2, $3, $4, $5, $6]++" infile

Below is a sample of the infile data:

"180321","1","1.7","1Z0E947E0357465677","6.81","UPAC","N"
"180320","3","5.1","1Z0E947E0357735643","20.43","UPAC","N"
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","N”
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","N"
"180320","3","5.1","1Z0E947E0357735643","20.43","UPAC","Y"
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","Y"
,"3","5.1","1Z0E947E0357735643","20.43","UPAC","Y"
"180319","1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
,”1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
"180318","1","1.7","1Z0E947E0355356628","6.81","UPAC","N"

Note: The 3rd and 4th column are actually duplicates of the 2nd minus the 1st column and the 6th and 7th columns are duplicates of the 5th also minus the 1st column.

Current Output:

"180321","1","1.7","1Z0E947E0357465677","6.81","UPAC","N"
"180320","3","5.1","1Z0E947E0357735643","20.43","UPAC","N"
"180319","1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
"180318","1","1.7","1Z0E947E0355356628","6.81","UPAC","N"

Desired Output:

"180321","1","1.7","1Z0E947E0357465677","6.81","UPAC","N"
"180319","1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
,”1","1.7","1Z0E947E0356058038","6.26","UPAC","N"
"180318","1","1.7","1Z0E947E0355356628","6.81","UPAC","N"

As you can see the 2nd row starting with “180320” from Current Output is to be eliminated completely and row 9 from sample of infile starting with ,”1” is to remain because there was never a row created with “Y” for the original “180319”. I realize I am asking for quite a bit and I hope I have not confused too much however, it would be greatly appreciated if someone can tell me how to make this happen. Thank you.

Annihilannic · Jun 30, 2011

Try this:

Code:

awk -F, '
        [gray]# accumulate records in r array, update index if first field is present[/gray]
        {
                r[[blue]NR[/blue]]=[blue]$0[/blue]
                [olive]if[/olive] ([blue]$1[/blue]!=[red]"[/red][purple][/purple][red]"[/red]) { ix=[blue]$1[/blue] }
        }
        [gray]# accumulate indices to be skipped[/gray]
        [blue]$NF[/blue] == [red]"[/red][purple]\"Y\"[/purple][red]"[/red] { s[ix] }
        [green]END[/green] {
                [gray]# reprocess r array, skipping indices in s[/gray]
                [olive]for[/olive] (i=1;i<=[blue]NR[/blue];i++) {
                        [b]split[/b](r[i],a,[red]"[/red][purple],[/purple][red]"[/red])
                        [olive]if[/olive] (a[1]!=[red]"[/red][purple][/purple][red]"[/red]) { ix=a[1] }
                        [olive]if[/olive] (!(ix [olive]in[/olive] s)) { [b]print[/b] r[i] }
                }
        }

' inputfile

Annihilannic.

chino10 · Jul 4, 2011

Thank you for responding Annihilannic. I attempted to make this work in a dos environment but continue to receive the following error: "The system cannot find the file specified". I am sure it's because I don't have the syntax correct yet but if, by chance, you have any additional tips I would appreciate it. In the mean time I will continue to tweak and let you know.

feherke · Jul 4, 2011

Hi

chino10 said:
I attempted to make this work in a dos environment

DOS and Windows command interpreters usually not handle single quoted strings. So put Annihilannic's code excepting the first and last lines in a separate file, for example called chino10.awk, then execute it like this :

Code:

awk -F, -f chino10.awk inputfile

Feherke.

http://free.rootshell.be/~feherke/

chino10 · Jul 4, 2011

Feherke thank you for responding so quickly. I set it up as you stated but I received the following error:

gawk: -F
gawk: ^ invalid char ‘-‘ in expression

Also, does it matter what the extension is for the "separate" file (i.e., chino_10.awk)?

feherke · Jul 4, 2011

Hi

chino10 said:
put Annihilannic's code [red]excepting the first and last lines[/red] in a separate file

Sounds like you missed the words in [red]red[/red].

Regarding the extension, it can be anything, or even nothing.

Feherke.

http://free.rootshell.be/~feherke/

chino10 · Jul 4, 2011

This is the contents of the file I created:

Code:

{
        r[NR]=$0
        if ($1!="") { ix=$1 }
}

$NF == "\"Y\"" { s[ix] }
END {
        for (i=1;i<=NR;i++) {
                split(r[i],a,",")
                if (a[1]!="") { ix=a[1] }
                if (!(ix in s)) { print r[i] }
        }
}

This is the code I'm using to run it:

Code:

gawk –F, -f C:\Users\username\Desktop\Test\GawkScript.txt C:\Users\mikecornejo\Desktop\BatchTests\inputfile.csv

Did I miss something?

feherke · Jul 4, 2011

Hi

The code looks good. But what kind of character is here ? Looks like an en dash :

Code:

gawk [COLOR=red yellow]–[/color]F, -f C:\Users\username\Desktop\Test\GawkScript.txt C:\Users\mikecornejo\Desktop\BatchTests\inputfile.csv

Feherke.

http://free.rootshell.be/~feherke/

chino10 · Jul 4, 2011

Awesome! It looks like that was the problem. I originally put it in MS word and apparently the formatting threw it off. I did it in notepad and worked like a charm. Thank you very much!

Annihilannic thank you very much!

Tek-Tips rocks!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Delete similar records under specific conditions using gawk 2

chino10

Technical User

Annihilannic

MIS

chino10

Technical User

feherke

Programmer

chino10

Technical User

feherke

Programmer

chino10

Technical User

feherke

Programmer

chino10

Technical User

Similar threads

Part and Inventory Search

Sponsor