Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Mike Lewis on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

speeding up disk staging? 1

Status
Not open for further replies.
Jul 1, 2006
9
AU
I have been given the task of speeding up our unacceptably slow backups. We use BAB 11.5 enterprise on a fast server with a library capable of 8 SDLT320 drives but only two fitted. There are two jobs configured and both use disk staging to (supossedly) make the backups faster. Of the average 40 hours a full backup takes, around thirty hours is taken by the staging and only 10 hours by migrating from disk to tape.

I can see that some of the 30 sources are much slower than others and it's clear that any bottlenecks are in the network or sources rather than at the tape server.

My first improvement was to set the migrations to start at the end of each staging sub-job so that the 10 hours of migration is mostly layered over the 30 hours of staging. I think prioritising the slowest sources to stage first will further help this.

What I would really like is to have the equivalent of multi-streaming or multiplexing for the disk staging. Would I be right in believing the best way to make this happen is to get more tape drives and then split the existing jobs into smaller ones, with the slow clients spread evenly over the jobs. In this way, each job will perform staging at the current slow rate, but I will have more staging jobs running concurrently and the total job time will drop.

If there is another way to conveniently get more staging jobs running concurrently, I would really appreciate hearing your opinion...
 
Multi stream and multiplex do not work with disk staging.

The typical cause of slow throughput is a large number of small files and for that, there is little that can be done.
 
Check your patch level and you have SP1 and Device Patch 6 applied - I'm sure there is a fix for a cause of slow disk staging in one of them.

Running the copy job straight after backup could have it's own issues - not insofar as the software can't cope with it, but moreso if ARCserve is merging a lot of detail records from the catalog files after the backup. Merging catalog file detail to the db can be very disk intensive, especially if your ASDB resides on the same physical or logical partition as your disk staging area.

As DavidMichel says MUX and Multi-streaming are not supported for disk staging, and really there's no need to if you think about it, there's nothing to stop you dividing up your estate into different jobs and running them concurrently.

You could hedge your bets if you're not already and use both disk staging and the drives all at the same time, in which case you could use MUX or multistreaming to the tape drives. Perhaps identifying the really slow servers and creating a mux job just for those servers to tape.

Once you start pushing the limits of your system you'll start hitting the other limiting factors, like insufficient cache on disk arrays/enclosures AND array controllers (not necessarily one and the same thing). Once you start getting to this level then you'll need to start looking at perfmon and other traces to identify and workaround the bottlenecks.

Going back to the DB disk activity you could always enable catalog DB if you haven't already, meaning that detail files aren't automatically merged at the end of a backup, but are merged on demand. This avoids the huge disk activity at the end of backup, and also keeps your database a more manageable size.

Assuming you have a separate dedicated backup server, I would schedule your migrate to tape to start after any catalog files have been merged to tape (if you decide to go that way).
 
Ah Ha! They have not applied ANY patches to Arcserve that I am aware of - applying SP1 was right at the top of my to-do list and I will also find that device patch. Excellent advice!

I know that MUX and streaming is not supported for staging - I was hoping to 'simulate' their benefits using other techniques.

This tape server is an HP ML350 with dual 3GHz CPU, 3.5Gb RAM, a RAID 5 array dedicated to staging, a SAN connection also dedicated to staging and the system and ASDB on a smaller raid 0+1 array. What I can see is that when there are two migration jobs running, they each get three times more throughput to tape than the best staging job gets to disk. For this reason, I think that running more staging jobs concurrently is well within the tape server's capacity. There will obviously be a saturation point somewhere, but doubling the number of concurrent staging jobs sounds possible to me.

I should also mention that the strategy with staging is to have at least a week of retention time on the staging disks so that they can do quick restores of recent data. I will not be allowed to put the slower clients direct to tape, even though it is a good idea.

I have not had to use catalogs before, I was planning to upgrade to a SQL database, possibly on it's own array or on a SAN disk. The current DB size is 3.2Gb and it's definite slower than I want. I will take that good advice about DB activity onboard and try to minimise it. I only altered the jobs last night so I will see how the daily jobs compare to last week.

It looks like my strategy will be:

Patching
Two more tape drives
splitting the jobs into smaller concurrent chuncks
possible SQL DB
fixing the client-side issues if possible
 
One thing to be aware of with the HP/Compaq kit is that the built in SMARTArray controllers don't have any, or a very piddling small amount of cache which is not configurable - If you have this it will definately be a bottleneck, I have come across this in several configurations in the past.

Whilst SQL is certainly better than the built in VLDB for enterprise configuration you have to bear in mind that if you host SQL on the same server as the backup machine this is also going to suck some cpu and memory - the larger the db the more it sucks :) You can keep the db in check by running a weekly housekeeping job and that pretty much takes care of itself then.

You might still want to consider catalog.db (you can use this with SQL or VLDB) - when using staging it also does housekeeping of the catalog files from the disk, and catalog files are usually a lot smaller than when their detail is expanded into the db.

I would start by looking at what array card you have and whether (a) it has cache, and (b) what ratio its set at. From memory they are normally set to 75% read 25% write - in your case it might make more sense to reverse this so that it uses more memory to cache writes.

It would also be interesting to see whether staging to SAN is significantly faster than staging to the SCSI attached storage - of course we would expect this, or whether it is topping out at around the same figure for both SCSI or Fibre storage.

Go with the patches first and then look at the other stuff if that doesn't help, I'm sure there are some obscure backup to disk fixes in SP1 that relate in part to what you're describing. You could also dice with death and look at available firmware and driver updates for the array and fibre cards as well, although I have to admit that most issues I see with fibre card drivers relate to tape attached backup not the other way around.
 
I had a look at the tape server's SCSI card- it's a SMART6400 with 192Mb of battery-backed cache. The cache is split 50-50 between read & write; I will switch it to favour writing.

I noticed that the tape drives are connected to the same SCSI card; I have always been told they should attach to their own card for better performance but perhaps this is less important nowadays? The tape performance is more that adequate in any case; migrating from the local array averages 3Gb per minute, while migrating from the SAN averages over 1.5Gb per minute.

I just discovered the SAN disks are the worst fragmented I have EVER seen, with some of the larger files having 52,000 fragments! The local staging array is not much better and the array with the ASDB is also terrible - plenty of scope for improvement.

On the client side, it's looking like the agents and the network have real problems too; Running a comparison backup direct to tape and via staging may tell me something; if the staging job is no faster than the direct tape job, then I think I can assume the greatest problems are not on the tape server.

Its seems everywhere I look on this system there are problems to be fixed...
 
Other stuff you might want to look at:

Anti-virus - make sure it's patched and up to date, especially the system level real-time drivers (not just the definitions). Bearing in mind that most system level AV drivers intercept so to speak read and write calls this may have some bearing on the problem if they are faulty or out of date.

Your array card should be fine, I have used these before. Using a separate SCSI channel for each tape drive is preferable, but as you're not having issues with this, it's probably best to leave this alone. If you have an U320 based card then you're probably fine to have 2 drives hanging off it anyway, to a lesser extent with U160 you may start to push the limits. Just make sure they are not attached to the SMARTArrays external port. Even though HP support some of their own drives doing this, they do not support more than a single drive attached to this port, and do not support any libraries attached to them at all.

For the agents, look to hard set the NIC/IP the client agents use at the client agent end (Providing they are not using changing IP addy's via DHCP - DHCP reserved is fine - you can do this in client agent configuration). This doc may also be helpful in optimising client agent performance:

 
Hi BackupFanatic,

take a look at the thread "AS 11.5 Disk staging slow" in
the acrserve-backup-forum. It seems to the same problem,
which was fixed by a patch from CA. This patch increased
the stageingspeed to a multiple. I think all other tuning
would be rather cosmetic!

 
That's why I recommended patching up, as IIRC the testfix issued in this case should be in the latest published patch.
 
For some reason I can't find that post but I will be patching the tape server and some clients next week - it will be interesting to compare the speeds before & after. I discovered that many servers are still on the unpatched 11.1 client and even found a few AS 6.61 clients around. I started defragging the staging disks today; the larger one only managed 23% until I had to stop it after 8 hours. It's going to take a whole week at this rate. The whole situation is looking like a train wreck, which I see as a good thing - as long as I can fix it up!
 
The 6.61 clients will certainly cause issues, and it's best as you know to patch the others up. Defragging such large files is always a problem as I mentioned before, it will only really work itself out if you have at least as much free space as the largest file you are trying to defragment (and I know that may not be realistic).
 
Just to finish this thread up:

Before I had the chance to apply patches or other changes, the DB self-destructed. Without going into details, nothing would bring Arcserve back, so I un-installed, cleaned out the registry and then installed fresh from SP1.

Many small problems have been fixed and the staging speed is a little faster - it seems the disk or network I/0 is the limit here.
 
Hi,
I'v been running Disk Staging for about year now. First experience came with pre-release version of 11.5. I'm having speedy IBM server, IBM DS-storage and LTO3 drives. The network connectivity to backup server is build using 2x1000 interfaces with load balancing. The outcome of my tests show following:
1. 3-10 large db-dump files totalling about 120GB data
- to tape about 4200MB/min
- to disk about 4000MB/min
2. one server with 620 000 files and about 295GB data
- to tape 1300MB/min
- to disk 670MB/min
3. several servers in one backup job totalling 200GB data
- to tape 900MB/min
- to disk 495MB/min

So this means that whatever SP or patching level used the results are pretty much the same. This "slow" storage is slower backup device than tape. The difference is smaller if we are running big steady backup stream from client agent. For example job that was backing up 1 server was running 5h 20min to disk and only 2h 30min to tape. This staorge I'm using is DS4100 fully equipped and about 2,6TB of capacity and it is not comparable to for example DS4400 series storage in speed. LTO3 drives are really fast if you can offer a steady big data stream.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top