Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations gkittelson on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

finding duplicate files

Status
Not open for further replies.

rajeshtektips

Programmer
Nov 7, 2005
7
US
Hi,

Iam new to shell scripting, and immediately need to write a shell script to find whether the incoming file is already exists in the directory or not.

Requirement is...

Daily we will get some files located in Incoming Folder.After that file is loaded to database the file is moved to Bakup folder.

The problem is sometimes we will get the file which is already exists in Backup folder but with different name.

We dont want to process those duplicate files.

I need to write a script to find whether the incoming file is existing or not in the Backup folder.

Please some one can help me in writing the script.

Thanks
raj
 
I'm just off to bed and too tired to answer this properly but you need to do something like "ls -lS" to sort files by size and then read each in turn, "sum" the file and compare the output of "sum" with the output from the last test. If it is the same you have a duplicate.


Trojan.
 
Something along these lines:
Code:
FILE=$1
HERE=`pwd`
BACKUP=$HERE/backup
cd $BACKUP
sum * > sumlist
cd $HERE
NEWSUM=`sum $FILE|awk '{ print $1 '}`
FOUND=`grep $NEWSUM $BACKUP/sumlist`
if [ "$FOUND" = "" ]
        then
        # {process this file]
        echo "NewFile"
        else
        echo "Duplicate!  Did not process $FILE " >>somelogfile
fi

This is basically the same thing TrojanWarBlade referred to. Run the script and give the filename as an argument.

It's too bad the "find" command doesn't have a -sum option!
 
There's any number of ways of doing this. Here is one:

Code:
#!/bin/ksh

file=$1 # what is your file
set - $(cksum $file)
kfilesum=$1

# cycle thru all the files in the backup directory
cksum $(find backup -type f -print)|
while read ksum c2 fname
do
   if [[ $kfilesum -eq $ksum ]]
   then
      echo "$file is duplicate of $fname"
   fi
done
 
Maybe something like this :




#!/bin/ksh

cat file_in_incomming_folder | while read a
do
cat backup_file |while read b
do
if [$a -eq $b]
then
file already backuped
else
backup the file
fi
done
done
 
knock-knock
Who's there?
UUOC police!

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Not only that, doesn't even make sense to me...

Why would you want to (numerically) compare each line of one file to all the lines in another file???

wiz.
 
if the question is for me :

It was just an easy and quick way to be sure that a file is not present in a list.

But maybe I am wrong so in this case sorry
 
adimstec,
and how does exactly your code accomplish this? [I must be going blind or something].

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
olded,

Thanks for ur code...

Just modified ur code and tried to run, it is giving error "COMPMAN_TOP^M: This is not an identifier."

Pls find below my code..

#!/bin/sh
#--- Set the environment
COMPMAN_TOP=/u61/compman; export COMPMAN_TOP
SCRIPT_DIR=$COMPMAN_TOP/scripts; export SCRIPT_DIR
DATA_DIR=$COMPMAN_TOP/incoming; export DATA_DIR
LOG_DIR=$COMPMAN_TOP/log; export LOG_DIR
BACKUP_DIR=$COMPMAN_TOP/backup; export BACKUP_DIR
#--------------------------------------------------------------
#--- Check for the presence of Duplicate files

cd $BACKUP_DIR
sum * > sumlist
cd $DATA_DIR
NEWSUM=`sum $DATA_DIR/*.dat`
FOUND=`grep $NEWSUM $BACKUP_DIR/sumlist`
if [ $FOUND = '' ]
then
# {process this file]
echo 'NewFile'
else
echo 'Duplicate!'
fi

Please let me know waht is wrong in this code.
 
First, this is not my code. Mine was the next.

Second, I can't see where your problem is, but I don't agree with your logic and the hack you made of motoslide's code.

You are trying to grab the output of the checksum of everything in DATA_DIR and grep it with the checksum of everything in the BACKUP directory?

Both my solution and motoslides depends on grabbing one file and checking the checksum against everything in the BACKUP directory.

If you need to check each file in your DATA_DIR, I suggest setting up a loop to obtain each file in DATA_DIR and then either call my script or motoslides.
 
Did you write this on a DOS machine and try to run it in UNIX? Most UNIX shells aren't real happy with the CR/LF line termination.

And, as OldEd states, you've changed the scope of the source string from a single numeric value:

NEWSUM=`sum $FILE|awk '{ print $1 '}`

to a list which will likely contain multiple values:

NEWSUM=`sum $DATA_DIR/*.dat`

I would still guess that the (current) error you see is do to DOS/UNIX line termination, not a problem in your logic.
 
motoslide,

I created the .sh file in unix...and also I changed the NEWSUM,it is giving the below result

"duplicate.sh[14]: [31644: not found.
duplicate"

Even if incoming file is new it displays "duplicate" in the output.

Please find the code below..

#!/bin/sh
#--- Set the environment
COMPMAN_TOP=/u61/compman; export COMPMAN_TOP
SCRIPT_DIR=$COMPMAN_TOP/scripts; export SCRIPT_DIR
DATA_DIR=$COMPMAN_TOP/incoming; export DATA_DIR
LOG_DIR=$COMPMAN_TOP/log; export LOG_DIR
BACKUP_DIR=$COMPMAN_TOP/backup; export BACKUP_DIR
#--------------------------------------------------------------
#--- Check for the presence of Duplicate files

cd $BACKUP_DIR
sum * > sumlist
cd $DATA_DIR
NEWSUM=`sum $DATA_DIR/*.dat|awk '{ print $1 }'`
FOUND=`grep $NEWSUM $BACKUP_DIR/sumlist`
if [ $FOUND = '' ]
then
# {process this file]
echo 'NewFile'
else
echo 'Duplicate!'
fi
 
change
if [ $FOUND = '' ]

to

if [ "$FOUND" = '' ]

vlad
+----------------------------+
| #include<disclaimer.h> |
+----------------------------+
 
Re-check this line:
Code:
NEWSUM=`sum $DATA_DIR/*.dat|awk '{ print $1 }'`

It should be:
Code:
NEWSUM=`sum $DATA_DIR/*.dat|awk '{ print $1 '}`

Vlad:
Ref:
Code:
 if [ "$FOUND" = '' ]

Following the "=" sign are 2 single-quotes, right? I typically use double-quotes, but I recall in the past where you folks have told me single-quotes are more predictable. In the posts above, it looks more like a single double-quote, instead of double single-quote.
Damn. That sounds much more confusing that I had planned.
 
#!/usr/bin/ksh

cat fic1 | while read x
do
grep $x fic2
if [ $? -eq 0 ]
then
echo "the file $x is present in the file fic2"
else
echo "The file $x is not present in the file fic2"
fi
done

 
Annihilannic:
I've learned something new today. I've always ended my awk sequences with '} because that's how I was told. It looked wrong, but always worked. I just modified my script to end with }' and it works exactly the same. Maybe there was a bug in an ancient version of awk which required the reverse?
In any case, Thank-you!
 
Motoslide, that is just your shell interpreting the single qoutes:

try the following command

echo '{}'
echo '{'}
echo {'}'
echo {}
those are all equivalent

awk needs that subprogram between {}, and seeing as it usually contains characters that have special meaning to the shell, those characters are usually escaped by delimiting the entire awk program in single quotes.

So you can start/end your awk subprogram with '{ or {' / '} or }', your shell doesn't care about that, it just gives the whole string as a positional parameter to awk to process it. It just makes sense to use the '... { ... }' fashion because that makes it easier for us humans to understand...

HTH,

p5wizard
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top