Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Number of files within directories -- practical or theoretical limits?

Status
Not open for further replies.

astarte

Technical User
Feb 20, 2003
2
US
I'm helping with a web project that creates static HTML pages from a database. Right now, it places all these pages on one directory. So far there are about 10,000 ... with a good chance of achieving 20,000 by the end of the year.

I don't know much about Linux, but I vaguely remember from Unix training years ago, that the filesystem becomes increasingly inefficient above a certain limit.

Is this correct? I would like to argue for a different method of page generation, but I don't have facts or theories to back me up.

Any help or pointers to other sources would be very welcome.
 
It's correct.

Each time a file is accessed, it has to be found, and if it's not in the namei cache (that's what it is called "namei"), the search has to step through the directory entry by entry.

You can, of course, have a large namei cache, but at some point that's self defeating so it's better to use the divide and conquer approach of hierarchical directories. That's why the darn things were invented (there are probably people here too young to remember when there were no subdirectories).

Things like "sar" can tell you how well your namei is doing on some systems. Tony Lawrence
SCO Unix/Linux Resources tony@pcunix.com
 
By the way, a good book is "Web Performance Tuning"


Another trick that book mentions is to keep file names different at the beginning rather than the end. We humans tend to create things like this:

perfomance_1.html
performance_2.html

etc., but since almost all comparisions are done left to right, a machine search will find things more quickly if it were

1_performance.html
2_performance.html

etc.

Good book.

It also reminds me that "namei" is called "dnlc" (directory name lookup cache) on sone systems and that some fielsystems (ex. Veritas) used hashed directory lookups.


Tony Lawrence
SCO Unix/Linux Resources tony@pcunix.com
 
the maximum number of dirs in one dir is something near 32000, i don't know if there is a limit on files.
one of my machines has a dir with >100,000 files in it. if you did an ls on this you'd be there for a very long time but if you knew the file name i think it can open it in a normal time.
also if you try to delete the files in that kind of dir it will also take a very long time (hours) - even if you used something more efficient than rm.
it would probably help to use reiserfs for this too.
 
Thanks, everybody, for these quick replies. Any more?

> Why must you have static pages?

The database is offline. Once a week, we generate the pages, 10,000++, then we FTP them to the server.

Makes sense? No -- it's unscaleable, Internet-illiterate and rapidly becoming unmanageable. But we have to work as a team, and sometimes morale and goodwill within the team are more important than doing things in the most sensible way. I'm trying to nudge them in the right direction.

I posted here because the hosting organization recently switched from a flaky NT4 server at the end of a leased line, to Redhat Linux with a much better connection. To my surprise, there are no obvious improvements -- and a very definite degradation of the FTP upload. What used to take 3 hours now takes 48 hours -- I was wondering if the different filesystem could be part of the explanation.

 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top