Recently we upgraded our PowerPath licenses on several systems , including a Legato Networker backup server with the AFT backup-to-disk device option. The backup server has 3x1TB LUNs from a 14-disk ATA RAID5 group.
This weekend our backups were running at a severely reduced rate compared to normal, and typically this happens when there are problems with the storage system.
In Navisphere Analyzer I could see that utilization was constantly reaching 100% for the LUNs used by the backup to disk drivers, whereas IO/s were quite low compared to normal operations.
Furthermore, powermt and iostat revelealed a large queue of outstanding requests on both active paths to the LUNs which led me to suspect that somehow the loadbalancing was the culprit.
I reverted the policy to Basic Failover and immediately saw a doubling in IO/s and a massive increase in the throughput of the running backups.
I cannot exactly explain just why this became such a big problem, but I suspect that the parallelism may have led to spindle contention. On the other hand, IO ops/s actually doubled when I reverted to BF mode, and disk utilization was never higher than 40% in CO mode, so this may be an incorrect assumtion.
Alternatively, I may have run into a problem related to throttling or queue depth in the OS-level.
I should mention that this host has only one HBA and can see both controllers on both SPs in the switch zone. In loadbalancing mode, this means that the same local HBA is used to queue up transactions to both controllers on the currently active SP - and I guess this may have consequences for how I should configure /etc/system.
Anyway, I am a bit worried as to wether this problem may also occur on some of our more important san-attached servers. If there are parameters I need to tune in /etc/system or similar, I'd very much like to hear about it.
This weekend our backups were running at a severely reduced rate compared to normal, and typically this happens when there are problems with the storage system.
In Navisphere Analyzer I could see that utilization was constantly reaching 100% for the LUNs used by the backup to disk drivers, whereas IO/s were quite low compared to normal operations.
Furthermore, powermt and iostat revelealed a large queue of outstanding requests on both active paths to the LUNs which led me to suspect that somehow the loadbalancing was the culprit.
I reverted the policy to Basic Failover and immediately saw a doubling in IO/s and a massive increase in the throughput of the running backups.
I cannot exactly explain just why this became such a big problem, but I suspect that the parallelism may have led to spindle contention. On the other hand, IO ops/s actually doubled when I reverted to BF mode, and disk utilization was never higher than 40% in CO mode, so this may be an incorrect assumtion.
Alternatively, I may have run into a problem related to throttling or queue depth in the OS-level.
I should mention that this host has only one HBA and can see both controllers on both SPs in the switch zone. In loadbalancing mode, this means that the same local HBA is used to queue up transactions to both controllers on the currently active SP - and I guess this may have consequences for how I should configure /etc/system.
Anyway, I am a bit worried as to wether this problem may also occur on some of our more important san-attached servers. If there are parameters I need to tune in /etc/system or similar, I'd very much like to hear about it.