diff command inconsistent

jones54 · Mar 21, 2006

Hi,

I have 2 files:

file 1 contains:
x
y
z

file 2 contains:
x
y
z
1
2
3

I am using the diff command to compare the 2 files so that I can get the differences between the 2 files. File 2 will always have extra data appended to it so the difference will always be the last few lines. In the case above the difference is:

1
2
3

My command is:

diff $location/file1.txt $location/file2.txt > $location/diff.txt

This seems to work some of the time but not all of the time. I have setup a cron job to compare these files 24x7 and I have noticed that the result of the diff is not always consistent.

Should I be using some options in the command? Or am I doing something incorrectly?

Any ideas or suggestions would be much appreciated.

Salem · Mar 21, 2006

The problem will be if any of the appended data matches any of the data which exists already. This is likely to confuse diff.

What I think you need to do is something along the lines of this
[tt]
# count the lines in both files
a=`wc -l $location/file1.txt`
b=`wc -l $location/file2.txt`

# calculate how many lines appended
let c=$b-$a

# output those appended lines
tail $c $location/file2.txt > $location/diff.txt
[/tt]

--

jones54 · Mar 21, 2006

Thanks a million for that Salem.

I am impressed with that solution .. very clever!!

I will use your solution but just to clarify something ...

if we have 2 files

file1:
x
y
z
1
2
3

file2:
x
y
z
1
2
3
y
z
1
2
x
y
z

Could diff get confused with the repetition in this file. I would expect to get:
y
z
1
2
x
y
z

as the diff here.

Would this result not be garunteed 100% of the time?

Is this how diff is supposed to behave??

Once again, many thanks.

Salem · Mar 21, 2006

One of the historic uses of diff was for the SCCS version control system, in which the output was a series of 'ed' commands to convert the old file into the new file.

In essence, any set of edits which achieve this count as a successful difference, though diff does try to minimise the amount of text needed to achieve the modification.

If you had an old file of
1
2
3
x
y
z

And a new file of
1
2
3
x
y
z
1
2
3
x
y
z

There is no way to determine whether "1 2 3 x y z" should be inserted before the current text, or appended after the current text.

Likewise with your changes, if the appended data in the new file is partially common with the existing data, it potentially gives diff many plausable choices as to how to represent the difference, and it's not always going to choose the 'append only' difference.

--

SamBones · Mar 21, 2006

Can you give more information on what's not working correctly? You say sometimes it works and sometimes it doesn't. What does it do when it doesn't work.

My [tt]diff[/tt] seems to work all the time, even if it has repeating sections. Even when it's just a repeated section, the comparison starts with the first matching lines and reports the repeated lines as being inserted after the matched lines.

p5wizard · Mar 21, 2006

SamBones said:
My diff seems to work all the time, even if it has repeating sections. Even when it's just a repeated section, the comparison starts with the first matching lines and reports the repeated lines as being inserted after the matched lines.

My diffs on AIXes (433-09 up to 53-02) work the same...

HTH,

p5wizard

jones54 · Mar 22, 2006

Hi,

I have a log file which is being continuously updated. Being a log file alot of the entries are repeated each time a user performs an action. Note not all lines in the file are time stamped.

I take a copy of this log file at x mins and then take another copy of the log file in x + 10 mins. I then do a diff on the log files using
diff file_x file_x+10 > difference.txt

This task is automated using a cron job. This task could run correctly for 10 hours always creating the correct result. But occassionally the difference is incorrect.

I'm not sure in what way the files are compared when it does it incorrectly. The difference seems to contain almost everything but file looks a bit distorted - not like the normal log file.

I tried to sort each file thinking this might help but again the difference was not accurate all of the time.

I put in Salem's solution yesterday and that seems to be working well.

SamBones · Mar 22, 2006

Just a guess, but your log file writes might be buffered. That is, each log line doesn't actually get written to the log until a write buffer fills up.

One possibe solution is to put a "[tt]sync[/tt]" command right before your diff. The "[tt]sync[/tt]" command forces all buffers to be written.

There could still be a timing issue if the log is being written to quickly. Salem's solution is a great idea, but can still miss lines if the log is being written to fast. If lines are written to the log between the last [tt]wc[/tt] and the [tt]tail[/tt], you will be missing some lines.

Something like this might tighten it up a bit (borrowing from Salem's idea)...

Code:

#!/bin/ksh

export LOGFILE=/path-to/some.log

[ ! -f ~/.lasttime ] && print "0" > ~/.lasttime

typeset -L LAST=$(<~/.lasttime)
typeset -L THIS=$(cat ${LOGFILE}|wc -l)

if (( THIS == LAST ))
then
        print "No change. Last time = ${LAST}. This time = ${THIS}"

elif (( LAST > THIS ))
then
        print "File is smaller. Resetting."
        print "0" > ~/.lasttime

else
        (( LAST += 1 ))   # Start after what we did last time

        print "New lines ${LAST} -> ${THIS}"
        print "=========="
        sed -n "${LAST},${THIS}p" ${LOGFILE}
        print "=========="

        print ${THIS} > ~/.lasttime
fi

This saves the position in the log file you've pulled from each time. Even if it's being written to fast and some get missed, they'll be picked up on the next run. The only issue is if you reset the log file, you won't get any entries since the last run.

dstxaix · Mar 23, 2006

If you simply want to make sure you are always getting all the latest info from the log file, you can run a tail -f command.

vgersh99 · Mar 23, 2006

SamBones,
sorry for being anally non-UUOC, but.......

Code:

typeset -L THIS=$(wc -l < ${LOGFILR})

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

SamBones · Mar 23, 2006

It's not a full fledged UUOC. The command "[tt]wc -l filename[/tt]" outputs both number of lines and the file name. Piping it eliminates the filename from the output. So, there was a reason for using [tt]cat[/tt], not just out of laziness or habit.

But, your version is a prettier redirection.

The command [tt]cat[/tt] isn't an obscenity you know (not like "[tt]goto[/tt]" (pardon my language)).

[smile]

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

diff command inconsistent

jones54

Programmer

Salem

Programmer

jones54

Programmer

Salem

Programmer

SamBones

Programmer

p5wizard

IS-IT--Management

jones54

Programmer

SamBones

Programmer

dstxaix

Programmer

vgersh99

Programmer

SamBones

Programmer

Similar threads

Part and Inventory Search

Sponsor