Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Backing up Volumes with Millions of files - Suggestions?

Status
Not open for further replies.

markey164

Technical User
Apr 22, 2008
26
GB
Hi All,

Looking for your thoughts on this scenario.

At our site, we have a volume containing ~20,000 user home folders totalling ~30 million files. The volume is on a Netapp Filer, which backs up via NDMP to a Tape Library with 4 LTO4 tape drives.

All other volumes on the NAS backup fine (and fast) with a throughput of 100-200GB/hr, but as soon as the backups reach the above volume, performance is dire.

We had a single subclient to begin with for the above volume, and when the job started, it would sit there for at least 12 hours, simply scanning the volume before even beginning to back it up. The throughput would often only be around 5GB/hr and rarely go above 50-60GB/hr obviously caused by the sheer number of small files it is handling.

The volume has 26 subfolders in it representing the surnames of the users. We've now setup a sub client per folder to give us a better idea of the breakdown of each folder in our reports. However it still takes 4-5 hours to scan many of the folders, and the throughput is still only a few GB/hr on average. Also for some reason the overall backup window for this volume is now longer since we split it into separate subclients per folder.

We are looking at implementing synthetic fulls, however whilst this will solve the *full* backups, it seems likely the incrementals are still going to go into a 2 day window because of the sheer number of files they still have to scan.

Does anyone else have problems backing up volumes with millions of files, and what steps have they taken to improve the situation.

I'm interested in any thoughts/comments/suggestions on this subject.
 
With such a scenario, you will most likely be able to get a fast drive in streaming mode. Using faster drives will even make the scenario worse .. at least for media and drive. It could also increase the backup time.

IMHO, a snapshot solution should be what you should consider.
 
605, when you say 'fast drive' are you referring to the fact we are using LTO4? and why would using faster drives make the situation and backup time worse?

What do you mean by streaming mode? I've looked at books on line, and the knowledgebase, but i can't find any reference to this term?

When you say a snapshot solution should be what we should consider, please can you elaborate? We are currently already using the snapshot features of the filer on a 30 day cycle for short term recovery of files, using the shadow copy features, but for DR purposes obviously we still need to backup to tape. So i'm not sure what else you might mean??
 
I'll let the previous poster answer your questions as to what he meant, but in general terms, a faster drive can sometimes have problems writing data that is being fed to it slowly.

The problem exists because drives want to write data to the tape fast, but when the rate at which the data is being sent to it cannot match that high speed, they cannot sustain writing at the intended speed, with one of two consequences. Firstly, modern drives (which probably means your LTO4s as well) can slow down the tape so that it goes over the heads more slowly, in order to match the tape speed with the flow of data being written to them, although I'm not sure that the speed can be *exactly* matched (perhaps some drives are only 2 speed, or 3 or 4 - you know: normal speed, half speed, quarter speed?). Secondly, drives that cannot slow down, or even if they can, cannot slow down to exactly match the speed of the data being fed to them, suffer from an effect called "shoe-shining". This is where the drive isn't being given data at the high speed that it expects, so has to stop the tape, rewind it a little, and start it forward again. This constant stopping, rewinding, and forwarding is reminiscent of someone shining your shoes (rubbing a cloth back and forwards over the shoes) and is very bad for performance, and for head life! With the tape going back and forward over the heads, it's claimed that they can actually wear out faster. Besides, performance is sometimes atrocious as a result of shoe-shining, and sometimes it ends up being written slower than the rate of the already slow data! That is why, sometimes, a faster drive won't improve performance, and can sometimes make it worse! Ask your drive's vendor about whether it has variable speeds (and how many) and how it avoids the shoe-shining effect.

In a nutshell then, it would seem (from the information you've given) that you would get the most benefit from trying to ensure that you get the data to the drives faster, as the drives don't appear to be the problem, but I think you already knew that.

It would seem that both the scan and the backup itself is taking a long time. You have done exactly the right thing by placing your user folders under 26 separate folders, labelled A-Z. I have seen some sites that put them all in one folder, with awful results! I wouldn't separate the 26 into 26 separate sub-clients though (too many to run at once). I suggest 7 sub-clients, each with the full on a different day of the week. Try, by experimentation, grouping the 26 folders into the 7 sub-clients so that they are roughly evenly balanced, both by NUMBER of files and SIZE of data to backup (check Galaxy reports every day for a while to get the balance about right). It is a sad fact that people's surnames are not evenly spread amongst the alphabet, with lots more starting with "S", for example, than "X". You might notice that to be evenly balanced you may have to put a lot of folders in one sub-client (for example W, X, Y, Z) but only a couple in another (for example A and S). Note that even 7 sub-clients will be too many if the server doing the backup is not very fast.

Other suggestions:

* Delete unwanted files (and if the users won't do it, you do it for them! :)

* Make sure there is plenty of free space on the disk (a full disk can cause horrible fragmentation)

* Perform regular defragmentation of the disk, even if Windows says (as it notoriously does) that defragmentation is not needed (look at the fragmentation report yourself).

* Backup a snapshot of your disk rather than the live disk itself (I think this was what the previous poster was getting at). This means take a snapshot of the disk and let the backup run on that. At the end of the backup, dissolve the snapshot. Yes I know that won't make the thing run faster but it does take the load off the live disk (and maybe shift the work to a different server too).

* Best of all, move some of the user folders (eg. M to Z) onto a different disk. There are limits to how much data can be scanned and shifted n short periods of time, particularly with millions of little files to deal with. The problem of poor performance is almost certainly at the Windows file system side of it, and not the backup software, which of course will just move data from disk to tape as quickly as the hardware will allow it to. Windows is probably straining having to open and close so many files so quickly, with all of the overhead involved in doing that (for any file, regardless of size).

* Actually there is an even better solution: put your user files on a Unix platform. And please do not think that I am joking here. I think you will be surprised at just how much more quickly things happen in a proper operating system. There is unlikely to be any effect on your users by storing their files on a different platform. Oh, and use a Unix system for your Galaxy Media Agent as well. Then watch the data fly!

None of the above are quick fixes to your problem - sorry. You may need to get someone in if this problem continues to cause you difficulties.

Good luck.
 
Dear Craig,

That's some very interesting information there. I've never heard of the 'shoe-shining' effect, but i will definitely be investigating more into this.

Although i've found a relevant article on the HP forums that states (not necessarily factually) that shoe-shining shouldn't affect modern LTO drives. Although there is reference to it being variable between 40GB/hr-120GB/hr, but not what happens when it goes below 40GB/hr.

See
I will find out more from our suppliers on this point as we definitely need to know more about this.

I forgot to mention that we do actually snap-mirror the data from the Primary Filer (where the home drives are served from) to a DR filer. It is then the DR filer that is backed up to tape. Therefore we are already backing up a snapshot of the data rather than the data itself (which is possibly what the previous poster was getting at). I should have mentioned that point.

In discussion here, we have already decided that there has been a considerable increase in the backup window, since we went from a single subclient to 26 subclients. Hence our next test was to actually bring it back down to 4 subclients which will give one subclient/job per tape drive, which might balance things a bit better.

Although the 7 subclients (1 for each day of the week) is an interesting idea, and we will seriously consider that option also.

Funny you should mention about putting the files on a Unix platform. This is the very platform we moved it 'off of'. This is because we had performance issues serving it from Unix boxes via a Samba layer to Windows clients. I'm not sure of all the details on this side of things, other than it was a causing performance problems that were apparently not resolvable except to move it *off* Unix.

Also note that the OS on our Netapp Filer is Netapp's own 'Ontap' OS, now Windows ;o), so Windows isn't involved in this particular backup issue. Our Netapp also does continuous disk defragmentation as part of its maintenance/housekeeping procedures.

You've given me some food for thought though, but interested in any more thoughts/comments on this.

Cheers

Mark
 
Hi again.

Well it looks like you've already covered off some of my points and you make very interesting points in reply.

Would you like to share here what you find out from your tape drive vendor?

Good luck with it.
 
'Streaming mode' is another word for constant tape motion (no repositioning).

If you cannot stream one drive with your data, you will
obviously not be able to stream another tape drive that you may add to achieve a higher throughput. A lot of users do this and are surprised that the situation becomes worse.

The faster the tape drives run, the more they will 'overshoot' when there is a data 'gap'. The more they do, the longer the repositioning time will become.

LTO4s will slow down and try to adjust but each tape drive has even a minimal write speed.
 
> If you cannot stream one drive with your data...

How can you positively confirm whether a given drive is streaming or not?
 
Actually, you can hear that by the drive noise - a bit hard when it is inside a jukebox.

The only way to do this is indirectly, via the backup transfer rate (for a given speed).
 
Or you could have a look at the repositioning information in the drive logs.
 
> Or you could have a look at the repositioning information in the drive logs.

Could you be more specific about these logs? Where are they? What are they called? What writes them?

It seems that there are very poor methods for sys admins to see if a drive is streaming properly or not. Listen to it? Come now. What if it's in a noisy environment (and whose aren't)? What if it's inside a big library and you can't get near it? What if it's across the other side of the city, the country, the world, the universe?

Let's all go out in the streets and demonstrate. Let's tell our drive vendors that we want better ways to determine what the drive is doing, and especially how it's coping with streaming. Has it slowed down? By how much? Is that managing to allow streaming or not?
 
> Actually, you can hear that by the drive noise

Not possible to hear in our setup. We have 4 drives all physically installed next to each other, in a large rack height Qualstar Tape Library, which itself is in a noisy data centre.

We do have good support from Qualstar, so i will see what info they can provide in this area, although the drives are actually manufactured by IBM, but we'll see what we can find out.
 
Have you considered Synthetic Fulls?

We have a server with roughly 16 million scanned images (.tif files totaling 350GB or so). It takes a full 72 hours to do a "Regular" full from start to finish, but just about 20 minutes on average to do it's nightly incremental and just over 3 hours to complete a "Synthetic" full.
 
> It takes a full 72 hours to do a "Regular" full from start to finish

For one of my Servers (7M Files / 1.3TB) I have a Scan Time of 3 hours, and a Transfer Time of 11 Hours, so I'm suprised that 16M / 350GB would take so long - what was you Scan time / Data tranfer breakdown?

> Have you considered Synthetic Fulls?
Synthetic Fulls may be a good solution here, but you need to make sure that you enable "Verify Synthetic Fulls".
Verify Synthetic Fulls ensure that files that are static in nature are not lost.

Markey164 - just 2 quick questions:
Have you run any performance monitoring to see if the delays are hardware based?
Is it possible that you have OnAccess AntiVirus, or something similar slowing things down?




 
I don't have any regular fulls in my job history any longer to look at the details of how long the scan phase took vs the data transfer phase etc. I just remember how the few times that we did a regular full, we started the job on Friday night and it would run well into Monday afternoon. Our longest retention period is 1 year so it has been more than that long since the last regular full was run on this box.

In our case, the server is pretty old hardware, DL380 G2 and the data exists on drives in an added SCSI shelf consisting of multiple 72GB 10K drives in a RAID 5 config.

We have had to do several restores of various files and folders and have never had a problem. I am very happy running Synthetic Fulls on this and 3 other servers in our environment where those other 3 are across slow WAN links.
 
> In our case, the server is pretty old hardware

Fair call, I guess given the Hardware and Storage, 72 hours probably isn't that much of a stretch.


> We have had to do several restores of various files and folders and have never had a problem.

Do you "verify Synthetic Fulls"?
 
you should consider Direct to Disk Option. I don't know if Netapp filers work the same but we have an EMC filer that requires no scanning to perform a backup. Craig sounds like he is correct when he says your disks just can't keep up with the LTO4 drives. how many disks do you have allocated for the volume on you filer that is giving you the issues? What you could do is add a subclient for the volume in concern and limit his streams to 2 or 3 and you will have to change your data path and set the policy to use a specified number of resources (tape drives) sort of match the filers abilities with the tape drives. You could try and change the chuck size on your tapes but i doubt that would resolve your scanning issue. I had a VTL for almost a year emu a IBM LTO4 drive ran very well but management was pain. I have since taken the unix head off of it offloaded compression to the MA moved the commserve off and i literally went from managing Commvault everyday to not having touched it in a month. DDO is the way to go Commvault handles it so much better and they will tell you this.
 
Look at CommVault's Data Classification Enabler (DCE) - it's designed to significantly speed up the scan by using its own separately maintained change database. It is separately licensed, but it might be worth it for the system affected by your problem. Perhaps CV will give you an evaluation key for it so you can try it out.

Go to Books Online and search for "Data Classification Enabler".
 
A Synthetic Full backup policy would ensure that you only backup the incrementals, this will reduce the backup time. The DCE will reduce the scan phase.

If after using these two you are still not happy then teh Image Level iDA is the way to go, this is block level so the file system being populated but millions of files is irrelvant.

In regards to a backup target.
Yes, you need to stream the tape drive in order for this to be an effecient device. Using disk targets is an option and certainly they way to go if you want to use sythetic full backups. The incremental can go to disk and full/sythFull goes to tape.

---------------------------------------
EMCTA & EMCIE - Backup & Recovery
Legato & Commvault Certified Specialist
MCSE
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top