Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

find difference between two files (added, removed, changed) 4

Status
Not open for further replies.

entrylevel

Technical User
Nov 29, 2001
46
CA
I read some of the posts here, but still could not come up a very good and smart logic for my case using awk

goal is to find difference between two files (the added lines, the removed lines and the changed lines) (first column are unique entries while 2nd column may contain duplicate entries)

file1

1.1 bj
2.2 hk
3.3 sj
4.4 tw
5.5 ch
6.6 kr
7.7 sg
10.10 sj
11.11 fr

file2

2.2 hk
3.3 uk
4.4 tw
5.5 sd
7.7 sg
8.8 in
9.9 ca
10.10 sj
11.11 fr

desire output would be ( append to a file )

following lines have been removed

1.1 bj
6.6 kr

following lines have been added

8.8 in
9.9 ca

following lines have been changed

3.3 from sj to uk
5.5 from ch to sd

I used diff a script to compare and get those info, only easy with removed and added items not quite smart with changed items, want to get some idea how to use awk to accomplish it. please inspire. Thanks.

Regards!
 
I'd parse the output of diff before writing my own file comp routine.
 
I want to load these two files to arrays then compare them,

file1 to array

fb[file1,FNR]=$1
fn[file1,$1]=$2
f1[$1]=$2

file2 to array

fb[file2,FNR]=$1
fn[file1,$1]=$2
f2[$1]=$2

please let me know whether the flow below is possible to load everything or not. Thanks.

BEGIN {FS = OFS = "\t"}

fb[FILENAME,FNR] = $1
fn[FILENAME,$1] = $2
NR==FNR { f1[$1] = $2; next }
f2[$1]=$2; next

END{ if ....... }

Regards!
 
I have one..

Code:
awk -v f=list 'BEGIN { while (getline <f) Seen[$0] = 1; close(f) } { if (Seen[$0]) delete Seen[$0]; else print "New:", $0 } END { print "-- END --"; for (i in Seen) print "Not seen..", i }' list2

But this is not enough for version like comparing.. the rewrite i did to my older but effective dpkg-diff.awk may be exactly what you're looking for, have fun =)

Code:
#!/usr/bin/awk -f

# dpkg-diff.awk - compare and print differencies in the installed packages of two dpkg status files
# to line-diff.awk - as example for entryuser@tek, format <(N.N) (.*)>, line number, content
# by xmb - localhack
# xmb<@skilled.ch>

# status in the array:
# 1 = installed in the reference file, but not in the other one
# 2 = both matching
# 3 = newly installed on the other status file
# 4 = different content
# decided to use additionally a plain index for faster processing of the END block

BEGIN {
    if (! file) file = "/var/lib/dpkg/status"

    if (ARGC <= 1 && ! stdin) {
        print "Usage: dpkg-diff.awk [-v status=reference-file] <-v stdin=1|other file>"
        print "use -v stdin=1 to pipe instead of specify a file (like /var/backups .gz ones)"
        exit 1
    }

    if (stdin) # hmm
        ARGC = 1
    else
        ARGC = 2

    while (getline < file) {
        line = get_line()
        Package[line]  = 1
        Package[line, "v"] = $0
        idx = idx line " "
    } close(file)
}

function get_line(  tmp) { tmp=$1;$1="";gsub(/^[ \t]*|[ \t]*/, "");return tmp;}

#$1 == "Package:" {
{
    #line = $1; gsub(/^[ \t]*|[ \t]*/, "")
    #line = get_line()
    #get_line(); str = R[1] SUBSEP R[2]
    line = get_line()

    if ($1) { # useless, visual extra check, /install ok/
        if (Package[line]) {
            if (Package[line, "v"] == $0) Package[line] = 2
            else {
                Res[4, line] = $0
                Package[line] = 4
            } # ..
        } else {
            Packages[line] = 3
            idx = idx line " "
        }
    }

    if (! valid_file)
        valid_file = 1
}

END {
    if (! valid_file) {
        print "Invalid file specified"
        exit 1
    }

    # 'iin' instead of 'in' cause 'in' is an awk keyword
    split(idx, Idx)
    while (Idx[++i]) {
        if (Packages[Idx[i]] == 1)
            Res[1, ++iin] = Idx[i]
        else if (Packages[Idx[i]] == 3)
            Res[3, ++out] = Idx[i]
        else if (Package[Idx[i]] == 4)
            Res[4, ++xtra] = Idx[i]
    } iin = out = xtra = 0

    while (Res[1, ++iin])
        printf "+ %s\n", Res[1, iin]

    while (Res[3, ++out])
        printf "- %s\n", Res[3, out]

    while (Res[4, ++xtra])
        printf "~:%-5g %s -> %s\n", Res[4, xtra], Package[Res[4, xtra], "v"], Res[4, Res[4, xtra]]

    if (! Res[1, 1] && ! Res[3, 1] && ! Res[4, 1])
    #if (iin == out == xtra == 0)
        print "No differencies found."
}
---

run as awk -f line-diff.awk -v file=file1 file2
---

xmb (gp:23:1)~/awk $ awk -f line-diff.awk -v file=f1 f2
- 8.8
- 9.9
~:3.3   sj -> uk
~:5.5   ch -> sd

. Mac for productivity
.. Linux for developement
... Windows for solitaire
 
Hi xmb,

Thanks for the post, I need some time to digest your code above, however, when I cut/paste the code and ran it ...

# awk -f awkf1 -v file=file1 file2
awk: syntax error near line 30
awk: illegal statement near line 30
awk: syntax error near line 32
awk: illegal statement near line 32
awk: syntax error near line 37
awk: bailing out near line 37

and I believe if I study the code above it will be easier for me to get the complete result that include removed items. Thanks for bringing the idea.

Regards!
 
Np, gaah.. i know why people would not include cmd-line copypaste ready
Code:
's.. Not this code related tho.. im working on a fully working version.

.   Mac for productivity
..  Linux for developement
... Windows for solitaire
 
(that it doesnt print additions is a bug)

To make it work, change idx to use SUBSEP instead spaces
idx = idx line SEP

The duplicated loop code can be replaced with this (which also makes the whole thing work in the end vs the three loops). The ending print loops stay..

Code:
    split(idx, Idx, SEP)
    while (Idx[++i]) {
        if (Package[Idx[i]] != 2) { # print i, Idx[i], Package[Idx[i]]
            Res[Package[Idx[i]], ++Res[4, Package[Idx[i]], "count"]] = Idx[i]
        }
    }

I have the idea about using that algorythm for a flexible individual comparision engine.

. Mac for productivity
.. Linux for developement
... Windows for solitaire
 
Where 'SEP' is SUBSEP, im doing it with a SEP = "\xff", using SUBSEP is fine.

The last piece of code follows, instant printing at compare time, no speed loss with excessive loops, cool that i hit that, nice code. Me for my part find it amazing! took me some time on the beginning cause a ']' fault at a too late place.
Doing variable printing in a noble way requires some definitions tho.

Code:
    Fmt[1] = "+ %s\n"
    Fmt[2] = "= %s\n"
    Fmt[3] = "- %s\n"
    Fmt[4] = "~%-4g %s -> %s\n"
}

function p(idx, a1, a2, a3, a4) { printf Fmt[idx], a1, a2, a3, a4 }

..
    ..
    while (Idx[++i]) {
        if (Package[Idx[i]] != 2) { # print i, Idx[i], Package[Idx[i]]
            #Res[Package[Idx[i]], ++Res[4, Package[Idx[i]], "count"]] = Idx[i]
            (Idx[i] == 4) ? p(4, Idx[i], Package[Idx[i], "v"], Res[4, Idx[i]]) : p(Package[Idx[i]], Idx[i])
        }
    }

. Mac for productivity
.. Linux for developement
... Windows for solitaire
 
thanks for the details, I will put some time on above xmb's posts and study.

regarding the 2nd post from me,

file1 to array

fb[file1,FNR]=$1
fn[file1,$1]=$2
f1[$1]=$2

file2 to array

fb[file2,FNR]=$1
fn[file1,$1]=$2
f2[$1]=$2

it seems I cannot do in either way below

1.
NR==FNR {f1[$1] = $2; fb[file1,FNR] = $1; fn[file1,$1] = $2; next}
{f2[$1]=$2; fb[file2,FNR] = $1; fn[file2,$1] = $2; next}

or

2.
fb[FILENAME,FNR] = $1
fn[FILENAME,$1] = $2
NR==FNR {f1[$1] = $2; next}
{f2[$1]=$2; next}

I think I still do not master how awk proceed the data flow.
if somebody could comment out, appreciate.

Regards!

 
A simple crude starting point:
awk '
FNR==NR{a[$1]=$2;next}
{b[$1]=$2}
END{
print "following lines have been removed"
for(i in a)if(!(i in b))print i,a
print "following lines have been added"
for(i in b)if(!(i in a))print i,b
print "following lines have been changed"
for(i in a)if((i in b)&&a!=b)print i" from "a" to "b
}
' file1 file2 >> /path/to/append

Hope This Helps, PH.
Want to get great answers to your Tek-Tips questions? Have a look at FAQ219-2884 or FAQ222-2244
 
The output is
[tt]
The following lines have been removed:
1.1 bj
6.6 kr
The following lines have been added:
9.9 ca
8.8 in
The following lines have been changed:
5.5 from ch to sd
3.3 from sj to uk
[/tt]
Note that the following code does not use getline. It is rarely wise to use getline in Awk. It's best to let Awk read the files for you.
Code:
BEGIN { OFS="\t" }
NR==FNR { f1[$1] = $2 ; next }
{ f2[$1] = $2 }
END {
  print "The following lines have been removed:"
  for (k in f1)
    if (!(k in f2))
      print k, f1[k]
  print "The following lines have been added:"
  for (k in f2)
    if (!(k in f1))
      print k, f2[k]
  print "The following lines have been changed:"
  for (k in f1)
    if ((k in f2) && (f1[k] != f2[k]))
      printf "%s from %s to %s\n", k,f1[k],f2[k]
}
 
thanks everybody PHV, futurelet and xmb, I learned a lot!

Regards!

FNR==NR {f1[$1]=$2;next}
{f2[$1]=$2}
END{
m=0
n=0
p=0
for(i in f1) if(!(i in f2))
m++
if(m>=1)
print "following lines have been removed"
for(i in f1) if(!(i in f2))
print i, f1
for(i in f2) if(!(i in f1))
n++
if(n>=1)
print "following lines have been added"
for(i in f2) if(!(i in f1))
print i, f2
for(i in f1) if((i in f2)&&f1!=f2)
p++
if(p>=1)
print "following lines have been changed"
for(i in f1) if((i in f2)&&f1!=f2)
print i" from "f1" to "f2
}
 
Not clear if awk was a requirement, nor if you require list of adds, deletes and changes in that order,

but I really like the -e option of diff for this type of thing.

&quot;Code what you mean,
and mean what you code!
But by all means post your code!&quot;

Razalas
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top