find out the delta between two files

hmehta · Sep 13, 2002

I have two files file1 and file2. File 1 has 2000 entries and file 2 has 5000 odd entried. I want to find out what are the common things between two files and also find what are the extra items in file2. How could i Do this in awk. Simple diff does not work as the line to l;ine comparsioon is not valid in this case.

vgersh99 · Sep 13, 2002

#define "common things" ????

This is too vague of the question. First you have define what make a line unique. Secondly you need to look at both files to see any 'patterns' that are repetitive and come up with the persistent 'least common denominator' that you can use in identifying the 'uniqueness' of line/record.

Post a sample 'line/record' here to see if we could help you, but I'd say 'study your data first' vlad
+---------------------------+
|#include<disclaimer.h> |
+---------------------------+

marsd · Sep 13, 2002

Vgersh has a good point:
There are so many things that you could compare:
Length of strings..
Alphanumeric chars,control chars, etc..
Identical records..
Number of matches in consecutive segments of records..
Number of regexp matches in each record..
Number of string matches in each record..
etc,etc...

You could really come up with some extravagant
code for this. The optimal search algorithm, and
a good sort..would be fun to build too..

Thank goodness I'm not a programmer and still think this kind of stuff is fun..

hmehta · Sep 13, 2002

See teh following two files fIle1 and File2. They have some common city/state names as well File has additional items. I ned to identify common items as well as teh additional ones in File1
example :
File1:
abbeville,SC
aberdeen,ID
abernathy,TX
abingdon,IL
abiquiu,NM
achusnet,MA
acton,MA
adams,NY
adamsville,MI

File2:
abbeville,AL aberdeen,MD adams,WI
adams,NY

marsd · Sep 13, 2002

If it's only the first field to match against:

function loader(fname,fld, barra,x,z) {
if (fld ~ /[0-9]/) {
while ((getline < fname) > 0) {
barra[x++] = $fld
}
close(fname)
return rebuild(barra,x)
} else {
while ((getline < fname) > 0) {
barra[x++] = $0
}
close(fname)
return rebuild(barra,x)
}
return 1
}

function rebuild(a,b, cnt,ret) {
for (cnt=1 ; cnt <= b ; cnt++) {
ret = length(ret) < 1 ? a[cnt]" \n" : ret" "a[cnt]" \n"
}
return ret
}

BEGIN {
z = split(loader(file1,1),arr1,"\n&quot

}

{
for (x = 1 ; x <= z ; x++) {
if ($0 ~ arr1[x]) {
print $0, "matched->Additonal::", arr1[x]
}
}
}' file2

There's also this which you are welcome to adapt:

function qsortM(array1,array2,min1,max1,min2,max2,matches) {
if (min1 >= max1 || min2 > max2) {
return
}

m = min1 ; e = min2
for (i = (min1 + 1) ; i <= max1 ; i++) {
for (x = (min2 + 1) ; x <= max2 ; x++) {
if (array1 == array2[x] && length(array1) > 0) {
print "Match", array1"="array2[x],"at",i, "and", x
matches++
}
}
}
qsortM(array1,array2,min1,(m - 1),min2,(e - 1),matches)
qsortM(array1,array2,(m + 1),max1,(e + 1),max2,matches)
return matches
}

function _elems(arr, i,k) {
for (i in arr) {
k++
}
return k
}

function parray(arr,u, f) {
print "Max=", u
while (f <= u) {
f++
if (arr[f] == "&quot {
continue
} else {
print arr[f]
}
}
}

BEGIN {
srand()
cn = 0
for (z = int(1 + rand() * 125) ; z < 125 ; z += 2) {
#print z
a[cn++] = z * 12
}
cn = 0
for (p = int(1 + rand() * 125) ; p < 125 ; p += 2) {
#print p
d[cn++] = p * 12
}
#print "Elements in a:", cnt1 = _elems(a)
#print "Elements in d:", cnt2 = _elems(d)
#parray(a,cnt1)
#parray(d,cnt2)
print "Matches found" , mat = qsortM(a,d,1,_elems(a),1,_elems(d),0)
}

I'm not explaining it though..I'm tired from just writing the nonsense.

jamisar · Sep 14, 2002

hmehta: reduce the files to the same format
one entry per line
sort them
then try: comm file1 file2 ------------ jamisar
Einfachheit ist das Resultat der Reife. (Friedrich Schiller)
Simplicity is the fruit of maturity.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

find out the delta between two files

hmehta

IS-IT--Management

vgersh99

Programmer

marsd

IS-IT--Management

hmehta

IS-IT--Management

marsd

IS-IT--Management

jamisar

Programmer

Similar threads

Part and Inventory Search

Sponsor