Why is I/O wait that high? 1

zaxxon · Jan 16, 2006

I am running an audit of a Tivoli Storage Manager database, which is awfully slow:

Code:

Topas Monitor for host:    sremhv09             EVENTS/QUEUES    FILE/TTY
Tue Jan 17 09:22:46 2006   Interval:  1         Cswitch    1131  Readch  1306.3K
                                                Syscall    6706  Writech  146.4K
Kernel    5.0   |##                          |  Reads       390  Rawin         0
User     12.0   |####                        |  Writes       59  Ttyout      425
Wait     83.0   |########################    |  Forks         4  Igets         0
Idle      0.0   |                            |  Execs         6  Namei       282
                                                Runqueue    0.0  Dirblk        0
Network  KBPS   I-Pack  O-Pack   KB-In  KB-Out  Waitqueue   1.0
en0      13.2     90.0    22.0     9.4     3.8
lo0       0.0      0.0     0.0     0.0     0.0  PAGING           MEMORY
                                                Faults     1540  Real,MB    6143
Disk    Busy%     KBPS     TPS KB-Read KB-Writ  Steals        0  % Comp     17.1
skpower0 75.0    696.0   174.0   696.0     0.0  PgspIn        0  % Noncomp  14.8
hdisk2   41.0    348.0    87.0   348.0     0.0  PgspOut       0  % Client   14.0
hdisk4   34.0    348.0    87.0   348.0     0.0  PageIn      199
skpower3 17.0    100.0    25.0   100.0     0.0  PageOut       0  PAGING SPACE
hdisk14   9.0     52.0    13.0    52.0     0.0  Sios        199  Size,MB    2048
                                                                 % Used      0.6
Name            PID  CPU%  PgSp Owner           NFS (calls/sec)  % Free     99.3
dsmserv      430214   4.0 381.3 root            ServerV2       0
ftpd         507906   1.0   1.2 fnsw            ClientV2      10   Press:
IBM.CSMAg    446686   0.0   2.2 root            ServerV3       0   "h" for help
vmptacrt      53274   0.0   0.1 root            ClientV3       0   "q" to quit
pilegc        57372   0.0   0.2 root
xmgc          61470   0.0   0.1 root

I can't see where they high wait is coming from. The machine has nothing else to do; the hdiskpowers are EMC² Powerpath devices for our storage subsystem which has no problems and the current data transfer rates are laughable. I've thought to see some throughput in MBs, not in kbs... The used LUN is a Raid 5, but see my test further down; doesn't explain why this audit is so slow...

AIX is 5.2 ML 4 and we have no problems at all with another application using an Oracle database on this box.
System has 1x 1,45 GHz CPU and 6 GB RAM.

More tests checking read/write performance:

Code:

root@sremhv09:/tsmcache> time dd if=/dev/zero of=./outfile bs=1024 count=1000000
1000000+0 records in.
1000000+0 records out.

real    0m21.29s
user    0m1.38s
sys     0m18.58s

root@sremhv09:/tsmcache> time dd if=./outfile of=/dev/null
2000000+0 records in.
2000000+0 records out.

real    0m35.06s
user    0m14.68s
sys     0m18.69s

Anyone got any idea or should I ask the TSM people maybe?
Thanks in forward!

laters
zaxxon

p5wizard · Jan 16, 2006

IO wait is not only disk IO...

I'd look into network througput - perhaps fdx/hdx is wrong on your network card?

Do the same dd test, from another box using an NFS mount to one of your EMC based filesystems on the TSM server.

There is a hardware checksum offload setting issue with some types of network adapters - don't know off hand, but can look it up for you if you want - I'd think google can put you in the right direction.

Then again, sequential read or write performance is hard to compare with random IO on a TSM database...

HTH,

p5wizard

zaxxon · Jan 17, 2006

Ok, thanks for the advice.
So can I/O wait even come from the application self, because it is, whyever, too slow to respond?

laters
zaxxon

p5wizard · Jan 17, 2006

Perhaps too simplified, but:

IO wait means processor is waiting for IO to complete - any IO adapter: scsi, fc, ethernet, ... not application.

There are processes sitting in the 'b' (blocked for IO)queue, waiting to be moved to the 'r' (run) queue and the processor(s) has (have) nothing better to do than wait for processes to enter the 'r' queue so that they can be dispatched.

Look at the 'r' and 'b' queues in vmstat (first 2 cols)

Essentially your 'b' queue is occupied and your 'r' queue is mostly empty, hence the high IO wait. This explains your high IO wait value, but does not explain the reason(s) why your processes are blocked...

HTH,

p5wizard

RMGBELGIUM · Jan 17, 2006

your tsmserver is accessing your disks ( some kind of backup running or so? ) What kind of storage do you use? The slower your disks , the longer the wait I/O will be ...
In this case your IO clearly comes from your disks, the traffic on ethernet is minimal

rgds,

R.

p5wizard · Jan 17, 2006

Also look at tape traffic. Trouble is, it can't be measured as easily as disk IO. But if you can isolate a tape drive from TSM, you can run a few dd tests to calculate throughput yourself. Use random data though, because most drives do HW compression, so all zeroes is a bad test.

And just because ethernet traffic is minimal, doesn't mean it can't be the source of your problem, perhaps ethernet throughput needs to be much higher? Is it 100mbit or gbit? half or full duplex? Is your switch set to the same speed and mode? Have the network guys look into that. Can't hurt.

HTH,

p5wizard

zaxxon · Jan 17, 2006

The TSM Server is not active. There was formerly a "dsmserv DUMPDB" and a "dsmserv LOADDB" because we lost the recovery log. The TSM DB is placed on an FC attached Clariion CX 700 Storage Subsystem which has no high traffic, nor does it show any faults. The used disks are FC disks and there is no access while running that awful slow "dsmserv AUDITDB fix=yes".
The test with "dd" was done on the same FS where the TSM DB is, that has to be auditted, on the Clariion storage subsystem.
For this AUDITDB is no access via ethernet or anything else but the FC components to the disks required. I had our network specialist checking the FC switch and he said everything is looking good, which could be proofed by the dd-test.
So I have no idea where the problem comes from when it is not the application itself..

laters
zaxxon

p5wizard · Jan 17, 2006

Well, if nothing else but auditdb is running, and there's only 1 CPU, then high IO wait can be easily explained:

only 1 active (well questionably but, for the sake of argument) process which is running on a 1.45 GHz CPU and has to perform an IO-intensive job. If ever there's an IO-bound job race, I guess TSM auditdb is candidate for 1st place...

So most of the time your auditdb job has to wait for an IO to complete. Your CPU does not have anything better to do than wait. That wait cannot counted as "idle" because your one CPU is waiting to be able to dispatch the process.

If you had 4 CPUs, you'd see about 20% IOwait 5% active and 75% idle (the other three CPUs sit idly by watching the 1st CPU go mad about IO).

You could of course run a CPU-bound process (calculate prime numbers from 0 to infinity?) together with the auditdb, you'd see less IO wait, but the auditdb wouldn't run any faster...

A 1.45GHz processor is a few orders of magnitude faster than even your fastest disk server. Also, perhaps your storage server's cache isn't too happy with the scattered reads your auditdb is provoking...

HTH,

p5wizard

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Why is I/O wait that high? 1

zaxxon

MIS

p5wizard

IS-IT--Management

zaxxon

MIS

p5wizard

IS-IT--Management

RMGBELGIUM

MIS

p5wizard

IS-IT--Management

zaxxon

MIS

p5wizard

IS-IT--Management

Similar threads

Part and Inventory Search

Sponsor