Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

find out the delta between two files

Status
Not open for further replies.

hmehta

IS-IT--Management
Jun 5, 2002
27
US
I have two files file1 and file2. File 1 has 2000 entries and file 2 has 5000 odd entried. I want to find out what are the common things between two files and also find what are the extra items in file2. How could i Do this in awk. Simple diff does not work as the line to l;ine comparsioon is not valid in this case.
 
#define "common things" ????

This is too vague of the question. First you have define what make a line unique. Secondly you need to look at both files to see any 'patterns' that are repetitive and come up with the persistent 'least common denominator' that you can use in identifying the 'uniqueness' of line/record.

Post a sample 'line/record' here to see if we could help you, but I'd say 'study your data first' vlad
+---------------------------+
|#include<disclaimer.h> |
+---------------------------+
 
Vgersh has a good point:
There are so many things that you could compare:
Length of strings..
Alphanumeric chars,control chars, etc..
Identical records..
Number of matches in consecutive segments of records..
Number of regexp matches in each record..
Number of string matches in each record..
etc,etc...

You could really come up with some extravagant
code for this. The optimal search algorithm, and
a good sort..would be fun to build too..

Thank goodness I'm not a programmer and still think this kind of stuff is fun.. ;)

 
See teh following two files fIle1 and File2. They have some common city/state names as well File has additional items. I ned to identify common items as well as teh additional ones in File1
example :
File1:
abbeville,SC
aberdeen,ID
abernathy,TX
abingdon,IL
abiquiu,NM
achusnet,MA
acton,MA
adams,NY
adamsville,MI

File2:
abbeville,AL aberdeen,MD adams,WI
adams,NY

 
If it's only the first field to match against:

function loader(fname,fld, barra,x,z) {
if (fld ~ /[0-9]/) {
while ((getline < fname) > 0) {
barra[x++] = $fld
}
close(fname)
return rebuild(barra,x)
} else {
while ((getline < fname) > 0) {
barra[x++] = $0
}
close(fname)
return rebuild(barra,x)
}
return 1
}

function rebuild(a,b, cnt,ret) {
for (cnt=1 ; cnt <= b ; cnt++) {
ret = length(ret) < 1 ? a[cnt]&quot; \n&quot; : ret&quot; &quot;a[cnt]&quot; \n&quot;
}
return ret
}

BEGIN {
z = split(loader(file1,1),arr1,&quot;\n&quot;)
}

{
for (x = 1 ; x <= z ; x++) {
if ($0 ~ arr1[x]) {
print $0, &quot;matched->Additonal::&quot;, arr1[x]
}
}
}' file2

There's also this which you are welcome to adapt:

function qsortM(array1,array2,min1,max1,min2,max2,matches) {
if (min1 >= max1 || min2 > max2) {
return
}

m = min1 ; e = min2
for (i = (min1 + 1) ; i <= max1 ; i++) {
for (x = (min2 + 1) ; x <= max2 ; x++) {
if (array1 == array2[x] && length(array1) > 0) {
print &quot;Match&quot;, array1&quot;=&quot;array2[x],&quot;at&quot;,i, &quot;and&quot;, x
matches++
}
}
}
qsortM(array1,array2,min1,(m - 1),min2,(e - 1),matches)
qsortM(array1,array2,(m + 1),max1,(e + 1),max2,matches)
return matches
}

function _elems(arr, i,k) {
for (i in arr) {
k++
}
return k
}

function parray(arr,u, f) {
print &quot;Max=&quot;, u
while (f <= u) {
f++
if (arr[f] == &quot;&quot;) {
continue
} else {
print arr[f]
}
}
}

BEGIN {
srand()
cn = 0
for (z = int(1 + rand() * 125) ; z < 125 ; z += 2) {
#print z
a[cn++] = z * 12
}
cn = 0
for (p = int(1 + rand() * 125) ; p < 125 ; p += 2) {
#print p
d[cn++] = p * 12
}
#print &quot;Elements in a:&quot;, cnt1 = _elems(a)
#print &quot;Elements in d:&quot;, cnt2 = _elems(d)
#parray(a,cnt1)
#parray(d,cnt2)
print &quot;Matches found&quot; , mat = qsortM(a,d,1,_elems(a),1,_elems(d),0)
}

I'm not explaining it though..I'm tired from just writing the nonsense.
 
hmehta: reduce the files to the same format
one entry per line
sort them
then try: comm file1 file2 ------------ jamisar
Einfachheit ist das Resultat der Reife. (Friedrich Schiller)
Simplicity is the fruit of maturity.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top