Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Find Duplicate copies of a Source File

Status
Not open for further replies.

zenenigma

Programmer
Apr 23, 2001
119
US
Background:

We have a program interacts with 100's of Virtual "Folders", each of which has 30-40 PDF files. The PDF documents are stored on a network drive and can be in any one of dozens of subfolders depending on when they were added. To create all of these documents, we took 100 or so original PDF files and ran a script to duplicate them over and over to populate the virtual folders. So each original PDF file likely has 100+ copies.


Issue:

One of the original source PDF files (that we duplicated) has information which we need to redact. We've found file location of one of the documents, but the process to find the location of all copies of this document I have been told will take a while. I know the file is between 212kb - 213kb. From what I can tell by glancing at the other files, using file size would be the quickest way to find the duplicates.


What I've Tried:

Windows Search - Useless, you can only search by file sizes that are "At least" or "less than", not a range

Half a dozen "Duplicate File Finder" programs. The problem is that none of them let me set size filters on which files should be analyzed. And none give me the option of using an "Original File" and say "Find all copies of this particular file".


What I'm looking for:

A program that will let me enter either a) the original file so it can find all duplicates of it, or b) a file size range so it doesn't search for hours over my network.

If it's wishful thinking and noone has created it, then I guess I'll just have to let one of these "Duplicate" programs run overnight and give me a list of 100 files that have 100+ duplicates each.


Any help would be appreciated.

-ZE
 
hi,
I have not understood if the duplicate files have the same name.
Is it possible (if the names differ in some parts), to code
the document with a unic key (plus other infos) and put
in search such key ?

If you would find a commercial program, or write some
scripts/program thar loads/updates a DB, this key could be useful.

bye
vic
 
Dippn, I will check those links. Thank you very much.

I found one program that found the documents quickly using file size, but had no way to export the list - all I could do is a delete from the program (which wouldn't work, because I need to replace the files, not just delete).


Victorv, the duplicate files *may* have the same name. The file name will always be a 2-digit number followed by the .PDF extension. It is built on a 3-tier folder system (best way I can think to describe it).

So if you are adding the first file to the system, it would go here:

Volumes\00\00\00\01.PDF
Volumes\00\00\00\02.PDF
etc.

After 99 PDFs are created in that folder, the next PDF would be:

Volumes\00\00\01\01.PDF

And so on.

As for coding the documents, that would make sense going forward - but I'm hoping this will be the only occurance where we'll need to swap out a file.

Another employee is currently struggling through trying to find the files by database scripts.
 
What happens if you search for *.pdf and after the search is finished you set the view of the search results to "sort by size" in the details view, then select all the .pdf files with the correct size and copy (or move) them to a new folder, double check you have the right files and make your amendments? After the copy is done, the old files can be deleted from the search view window.

You will have to paste the amended files back into their correct locations or create new locations.
 
You've found nothing that will check duplicates based on CRC or MD5 value?

Anyway, if this is a NTFS volume, you ought to look into hard links or symbolic links (depending on the scenario). Something that will make the file show up all those other places, but path to a single original copy. That way, you change your original copy and it shows up all the other places.

Measurement is not management.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top