Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations IamaSherpa on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Need advice on formatting/files per folder in image archive

Status
Not open for further replies.

Tranman

Programmer
Sep 25, 2001
695
US
Hi All,
A am about to create an image archive at work. It will initially contain about 520,000 pdf files, ranging in size from 150KB to 550KB, with the mean size being about 240KB. We will be adding about 70,000 new images to the archive every year.

The archive will reside on a 500GB Seagate USB drive.

The image files will never be modified, and will not be opened on the drive--only copied.

The images will be indexed in an .mdb file, which will be maintained by a VB.Net program, with ADO.Net, which I have not yet written.

As far as I know we will not be using Windows Explorer on the drive at all.

So my questions are:
What is the optimum number of files to put in a folder? Obviously, I don't want to drop all 520,000 files in a single folder. I've heard 2,000. I've also heard, anything less than 10,000 and anything less than 15,000.

What size should my disk clusters be? The initial load of files will be 515,000 images and will take up about ~125GB. I figure that if I set the cluster size to 32KB, and *if* every file wasted an entire cluster (which they won't), that would only amount to ~17GB, which would be an acceptable amount of waste on this huge drive. Performance is an issue.

Should I disable 8.3 generation? File names are not unique in the first 6 bytes.

Should I enable/disable indexing?

Does anyone who has done a project like this have any other advice/know of anything I'm not thinking of?

Tranman





"Adam was not alone in the Garden of Eden, however,...much is due to Eve, the first woman, and Satan, the first consultant." Mark Twain
 
If you don't do anything special at all - i.e. if you just plug the drive in, format it and then save files to it - everything will just work and you'll have plenty of room for all your current images and about 20 years' worth of new ones. I therefore assume that you have some special considerations that you haven't mentioned.

Disabling indexing will probably make the drive slightly faster to read from/write to but slower to search, but if you're going to be tracking the files via an Access database then you're not going to be using Windows search to look for them anyway.

Changing the cluster size can make more (or less) efficient use of the storage space so you could go for smaller clusters, but you have more than enough room anyway so storage efficiency isn't that relevant.

Do you have a special reason for not just leaving things in their default format though? If you want to share the drive amongst several users or if you want high performance then a USB-attached hard drive is not a good way of doing it for lots of reasons, speed and fault tolerance being two that spring to mind.

This Wikipedia page has some good basic info about the NTFS file system. There are some links at the end of the page to more in-depth sites.

Regards

Nelviticus
 
Nelviticus,
Thanks for the response.

"...if you just plug the drive in, format it and then save files to it - everything will just work..."

During the pre-project investigation, as a test, I copied ~400,000 documents to the drive. Up above ~300,000, it got *incredibly* slow--like each file was taking 3 or 4 times as long just to load. It seemed like everything about the drive was taking eternity. I felt like it was unacceptably slow. That is what initially caused me to start investigating possible reasons for the slowdown.

"I therefore assume that you have some special considerations that you haven't mentioned."

I did say that, "Performance is an issue." The person using this drive (a single R/A--the drive will not be shared), will be extracting subsets of the library that vary from <10 files to up in the 40,000 to 50,000 range.

Wasted space is not a problem. That is why I thought that if large clusters ran faster (like I could get the average file to take 8 or 9 physical reads, instead of the 60 or 70 it would take with 4K clusters), it might be an advantage to use them.

We will have an identical drive for backup, so if one crashes, it's not a big deal. You can buy these things for $129.95 now. Even if both drives failed, we could still reload from DVD backup.

So I guess the bottom line is that I just need to try all of the things I mentioned and see if they help? I felt like the performance when I, to paraphrase, "didn't do anything special at all", was unacceptable.

Tranman

"Adam was not alone in the Garden of Eden, however,...much is due to Eve, the first woman, and Satan, the first consultant." Mark Twain
 
OK, well I used to know a few useful things about NTFS but my knowledge is a little rusty. A bit of Googling for 'NTFS performance' throws up a lot of links which could probably be useful to you. For instance, I had a quick look at this one which implies that your massive slow-down may have been down to the overhead of generating 8.3 aliases for all your files, which if true would mean that disabling 8.3 generation would be beneficial.

However, I suspect that the improvements you can get from tweaking the file system and layout might be less than the gains you could get from using a different hardware solution. At the bottom end of the scale you could get an external drive with an e-SATA connection (and an e-SATA PCI card if the user's PC doesn't have it), which will be considerably faster than USB. Going up a bit you can get RAID-capable directly attached storage (DAS) boxes which give you fault tolerance and blistering speed for a reasonable price nowadays. The ones I've seen have been Linux-based devices which use their own file system.

Regards

Nelviticus
 
A few things to think about. (BTW I'm a Programmer/System Admin in a large CorpRecords department). We image and manage approx 53 million pages!!!! We keep our index info in a number of tables. ok, a LARGE number of tables.

This is how our server space is lais out. Images (including a number of pds's) are stored logical by department. For example AP, HR, etc. Within each group we further divide by whatever the department needs. Then below this is a number of folders, all starting with 0. Within each folder qre a MAX. of 1024 files. Once the folder reaches 1024 files, a new folder is created. Once there are 1024 folders, a new level folder is created.

Makes it fast and searching is easier, if not faster if you have a know starting point.

example:

-AP
-AP1
-0
1
2
3...1024
-1
1
2
3...1024

etc. We use NTFS. Document/page retrieval can be anywhere from 1 page to many hundreds at one time, thru both a Web interface and a windows application.


 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top