finding duplicate files

rajeshtektips · Nov 7, 2005

Hi,

Iam new to shell scripting, and immediately need to write a shell script to find whether the incoming file is already exists in the directory or not.

Requirement is...

Daily we will get some files located in Incoming Folder.After that file is loaded to database the file is moved to Bakup folder.

The problem is sometimes we will get the file which is already exists in Backup folder but with different name.

We dont want to process those duplicate files.

I need to write a script to find whether the incoming file is existing or not in the Backup folder.

Please some one can help me in writing the script.

Thanks
raj

TrojanWarBlade · Nov 7, 2005

I'm just off to bed and too tired to answer this properly but you need to do something like "ls -lS" to sort files by size and then read each in turn, "sum" the file and compare the output of "sum" with the output from the last test. If it is the same you have a duplicate.

Trojan.

motoslide · Nov 7, 2005

Something along these lines:

Code:

FILE=$1
HERE=`pwd`
BACKUP=$HERE/backup
cd $BACKUP
sum * > sumlist
cd $HERE
NEWSUM=`sum $FILE|awk '{ print $1 '}`
FOUND=`grep $NEWSUM $BACKUP/sumlist`
if [ "$FOUND" = "" ]
        then
        # {process this file]
        echo "NewFile"
        else
        echo "Duplicate!  Did not process $FILE " >>somelogfile
fi

This is basically the same thing TrojanWarBlade referred to. Run the script and give the filename as an argument.

It's too bad the "find" command doesn't have a -sum option!

olded · Nov 7, 2005

There's any number of ways of doing this. Here is one:

Code:

#!/bin/ksh

file=$1 # what is your file
set - $(cksum $file)
kfilesum=$1

# cycle thru all the files in the backup directory
cksum $(find backup -type f -print)|
while read ksum c2 fname
do
   if [[ $kfilesum -eq $ksum ]]
   then
      echo "$file is duplicate of $fname"
   fi
done

adimstec · Nov 8, 2005

Maybe something like this :

#!/bin/ksh

cat file_in_incomming_folder | while read a
do
cat backup_file |while read b
do
if [$a -eq $b]
then
file already backuped
else
backup the file
fi
done
done

vgersh99 · Nov 8, 2005

knock-knock
Who's there?
UUOC police!

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

p5wizard · Nov 8, 2005

Not only that, doesn't even make sense to me...

Why would you want to (numerically) compare each line of one file to all the lines in another file???

wiz.

adimstec · Nov 8, 2005

if the question is for me :

It was just an easy and quick way to be sure that a file is not present in a list.

But maybe I am wrong so in this case sorry

vgersh99 · Nov 8, 2005

adimstec,
and how does exactly your code accomplish this? [I must be going blind or something].

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

rajeshtektips · Nov 8, 2005

olded,

Thanks for ur code...

Just modified ur code and tried to run, it is giving error "COMPMAN_TOP^M: This is not an identifier."

Pls find below my code..

#!/bin/sh
#--- Set the environment
COMPMAN_TOP=/u61/compman; export COMPMAN_TOP
SCRIPT_DIR=$COMPMAN_TOP/scripts; export SCRIPT_DIR
DATA_DIR=$COMPMAN_TOP/incoming; export DATA_DIR
LOG_DIR=$COMPMAN_TOP/log; export LOG_DIR
BACKUP_DIR=$COMPMAN_TOP/backup; export BACKUP_DIR
#--------------------------------------------------------------
#--- Check for the presence of Duplicate files

cd $BACKUP_DIR
sum * > sumlist
cd $DATA_DIR
NEWSUM=`sum $DATA_DIR/*.dat`
FOUND=`grep $NEWSUM $BACKUP_DIR/sumlist`
if [ $FOUND = '' ]
then
# {process this file]
echo 'NewFile'
else
echo 'Duplicate!'
fi

Please let me know waht is wrong in this code.

olded · Nov 8, 2005

First, this is not my code. Mine was the next.

Second, I can't see where your problem is, but I don't agree with your logic and the hack you made of motoslide's code.

You are trying to grab the output of the checksum of everything in DATA_DIR and grep it with the checksum of everything in the BACKUP directory?

Both my solution and motoslides depends on grabbing one file and checking the checksum against everything in the BACKUP directory.

If you need to check each file in your DATA_DIR, I suggest setting up a loop to obtain each file in DATA_DIR and then either call my script or motoslides.

motoslide · Nov 8, 2005

Did you write this on a DOS machine and try to run it in UNIX? Most UNIX shells aren't real happy with the CR/LF line termination.

And, as OldEd states, you've changed the scope of the source string from a single numeric value:

NEWSUM=`sum $FILE|awk '{ print $1 '}`

to a list which will likely contain multiple values:

NEWSUM=`sum $DATA_DIR/*.dat`

I would still guess that the (current) error you see is do to DOS/UNIX line termination, not a problem in your logic.

rajeshtektips · Nov 8, 2005

motoslide,

I created the .sh file in unix...and also I changed the NEWSUM,it is giving the below result

"duplicate.sh[14]: [31644: not found.
duplicate"

Even if incoming file is new it displays "duplicate" in the output.

Please find the code below..

#!/bin/sh
#--- Set the environment
COMPMAN_TOP=/u61/compman; export COMPMAN_TOP
SCRIPT_DIR=$COMPMAN_TOP/scripts; export SCRIPT_DIR
DATA_DIR=$COMPMAN_TOP/incoming; export DATA_DIR
LOG_DIR=$COMPMAN_TOP/log; export LOG_DIR
BACKUP_DIR=$COMPMAN_TOP/backup; export BACKUP_DIR
#--------------------------------------------------------------
#--- Check for the presence of Duplicate files

cd $BACKUP_DIR
sum * > sumlist
cd $DATA_DIR
NEWSUM=`sum $DATA_DIR/*.dat|awk '{ print $1 }'`
FOUND=`grep $NEWSUM $BACKUP_DIR/sumlist`
if [ $FOUND = '' ]
then
# {process this file]
echo 'NewFile'
else
echo 'Duplicate!'
fi

vgersh99 · Nov 8, 2005

change
if [ $FOUND = '' ]

to

if [ "$FOUND" = '' ]

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+

motoslide · Nov 8, 2005

Re-check this line:

Code:

NEWSUM=`sum $DATA_DIR/*.dat|awk '{ print $1 }'`

It should be:

Code:

NEWSUM=`sum $DATA_DIR/*.dat|awk '{ print $1 '}`

Vlad:
Ref:

Code:

 if [ "$FOUND" = '' ]

Following the "=" sign are 2 single-quotes, right? I typically use double-quotes, but I recall in the past where you folks have told me single-quotes are more predictable. In the posts above, it looks more like a single double-quote, instead of double single-quote.
Damn. That sounds much more confusing that I had planned.

Annihilannic · Nov 8, 2005

motoslide said:
It should be:

Code:

NEWSUM=`sum $DATA_DIR/*.dat|awk '{ print $1 '}`

No it shouldn't? The original syntax was correct.

Annihilannic.

adimstec · Nov 8, 2005

#!/usr/bin/ksh

cat fic1 | while read x
do
grep $x fic2
if [ $? -eq 0 ]
then
echo "the file $x is present in the file fic2"
else
echo "The file $x is not present in the file fic2"
fi
done

motoslide · Nov 9, 2005

Annihilannic:
I've learned something new today. I've always ended my awk sequences with '} because that's how I was told. It looked wrong, but always worked. I just modified my script to end with }' and it works exactly the same. Maybe there was a bug in an ancient version of awk which required the reverse?
In any case, Thank-you!

olded · Nov 9, 2005

adimstec:

First, before Vlad, the UUOC cop, <grin> shows up, take a look at this link:

http://support.internetconnection.net/DEFINITIONS/Definition_of_UUOC.html

Second, I don't mean to be mean, but you seem to be bewildered. I don't think you are understanding the original question.

p5wizard · Nov 9, 2005

Motoslide, that is just your shell interpreting the single qoutes:

try the following command

echo '{}'
echo '{'}
echo {'}'
echo {}
those are all equivalent

awk needs that subprogram between {}, and seeing as it usually contains characters that have special meaning to the shell, those characters are usually escaped by delimiting the entire awk program in single quotes.

So you can start/end your awk subprogram with '{ or {' / '} or }', your shell doesn't care about that, it just gives the whole string as a positional parameter to awk to process it. It just makes sense to use the '... { ... }' fashion because that makes it easier for us humans to understand...

HTH,

p5wizard

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

finding duplicate files

Programmer

Programmer

MIS

Programmer

IS-IT--Management

Programmer

IS-IT--Management

IS-IT--Management

Programmer

Programmer

Programmer

MIS

Programmer

Programmer

MIS

MIS

IS-IT--Management

MIS

Programmer

IS-IT--Management

Similar threads

Log in

Part and Inventory Search

Sponsor