Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Need help with vmstat and slow system 2

Status
Not open for further replies.

zaxxon

MIS
Dec 12, 2001
226
DE
Hello,

I already opened a post for my problem some time ago here:
Sadly I didn't find out yet, where the problems come from. When those high numbers in the kthreads row for the (r)un queue show up, the applications running on it get very slow for several seconds, until this bottleneck is gone and everything is running smooth for a minute again, then everything "lags" again. All CPUs are busy, handling the huge amount of triggered kthreads waiting in the run queue. When I try to type some commands in a shell on the machine, it even "lags" there so I sometimes write faster than it can show the letters on putty.

Some more info:
AIX 5.2 ML 6
CPUs: 4 x 1,45 GHz Power4
RAM: 15,5 GB

Apps running on it are DB2 V8.2 FP9, TSM-Server 5.2.6.0, WebSphere App-Server 5.0.2 FP 17, Java 1.3.1, being part of an IBM Content Manager Archive v8.1 FP 10.
The IBM Support for CM is no help at all so far.

Bad thing is, I can't find anything useful in the logs of any of the applications when the slowdown occurs. Usually it is that way, that when it start to happen, it will never get better and will go on in the same intervals, about bein 20 secs very slow, then for 30-60 secs everything is running smooth again and so on.
If I restart all applications, everything is running smooth coninously again.
It doesn't happen every day but sometimes even twice a day, sometimes after running 6 hours very good, sometimes it happens after 4 hours and right 2 hours later again. I can't find any other influence that might impact on this yet.

Here some VMM tuning I did and it is running with:
Code:
                 10.0 minperm percentage
                 30.0 maxperm percentage
                 30.0 maxclient percentage

And here a long vmstat output (vmstat 1):

Code:
 0  0 1215653 23542   0   1   0   0    0   0 1892 37549 16271 15 10 49 26
 1  0 1215842 23321   0   0   0   0    0   0 2039 44036 19186 25 12 43 20
 0  1 1215853 23293   0   1   0   0    0   0 1943 39213 17564 21  8 49 22
 0  0 1215858 23264   0   2   0   0    0   0 1752 34737 16295 15  9 60 16
 0  0 1215875 23239   0   1   0   0    0   0 1705 42358 17620 21 10 54 15
 0  1 1215881 23206   0   0   0   0    0   0 1799 40752 17839 21  9 52 19
 1  0 1215881 23196   0   2   0   0    0   0 1482 30612 12799 15  6 56 23
 0  0 1215895 23143   0   0   0   0    0   0 1771 35944 15716 20  9 45 26
 0  0 1215916 23113   0   1   0   0    0   0 1794 34288 17040 15  6 63 16
 6  0 1215943 23056   0   0   0   0    0   0 1791 37548 16567 18 12 57 13
 8  1 1215988 22986   0   1   0   0    0   0 1852 39122 15370 33 12 40 15
 1  1 1215988 22976   0   0   0   0    0   0 1682 33139 14908 15  9 60 16
 0  0 1216255 22697   0   0   0   0    0   0 1799 45013 19325 21 13 56 11
 1  0 1216479 22465   0   1   0   0    0   0 1601 31782 14686 15  8 72  5
 0  1 1216688 22213   0   0   0   0    0   0 1713 51791 21284 23 11 45 22
 2  1 1216759 22120   0   1   0   0    0   0 1700 40117 17048 17 10 60 13
 0  0 1216781 22094   0   0   0   0    0   0 1715 30957 13898 16  7 69  8
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 5  0 1216456 22373   0   1   0   0    0   0 1906 33769 15779 18 17 47 18
 6  1 1215303 23505   0   0   0   0    0   0 1558 37857 15442 21 27 39 13
 1  1 1215272 23526   0   0   0   0    0   0 1293 24516 9158 12 22 52 14
10  0 1213430 25368   0   0   0   0    0   0 860 4683 1286  2 57 25 16
 5  1 1211859 26939   0   0   0   0    0   0 791 4665 974  3 64 21 12
 4  1 1209980 28815   0   0   0   0    0   0 897 5522 1918  1 55 26 17
24  1 1207942 30849   0   0   0   0    0   0 822 3674 1092  1 63 20 15
16  2 1206935 31855   0   0   0   0    0   0 750 4397 1002  1 63 21 15
10  1 1205487 33303   0   0   0   0    0   0 844 4692 980  2 62 11 25
13  0 1204762 34026   0   0   0   0    0   0 838 4661 1129  1 42 33 24
 9  1 1203719 35066   0   0   0   0    0   0 917 5291 1393  1 65  4 29
 5  1 1201738 37047   0   0   0   0    0   0 687 4037 1040  2 67 11 20
19  1 1200406 38375   0   0   0   0    0   0 775 15343 1098  3 80  4 13
35  1 1199143 39694   0   0   0   0    0   0 713 13811 1231  2 76 15  6
22  3 1197181 41594   0   0   0   0    0   0 766 8365 1281  3 92  0  6
51  0 1196010 42759   0   0   0   0    0   0 925 8131 1932  2 80  2 16
60  0 1194988 43779   0   0   0   0    0   0 761 7766 1486  3 86  7  3
 8  1 1194416 44346   0   0   0   0    0   0 633 4883 959  2 56 33 10
 7  2 1194441 44318   0   0   0   0    0   0 850 4351 879  6 63  8 23
11  1 1194070 44688   0   0   0   0    0   0 720 5282 1109  3 75 15  7
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
43  0 1194361 44397   0   0   0   0    0   0 635 5260 1346  2 65 29  3
 1  1 1194362 44396   0   0   0   0    0   0 714 3527 741  1 32 34 33
35  1 1194459 44284   0   1   0   0    0   0 892 8730 1761  3 93  4  1
102  0 1195201 43542   0   0   0   0    0   0 759 6811 946  1 83  8  8
24  0 1195249 43494   0   0   0   0    0   0 616 3751 978  2 61 30  6
 1  2 1195250 43486   0   0   0   0    0   0 689 4170 1575  0 27 22 50
30  0 1195380 43354   0   0   0   0    0   0 712 5169 994  2 74 13 11
57  0 1195049 43685   0   0   0   0    0   0 655 5488 870  2 90  8  0
61  0 1197056 41664   0   0   0   0    0   0 736 7940 1309  3 95  1  0
12  0 1197121 41589   0   0   0   0    0   0 774 5615 1810  1 69 14 16
25  0 1200173 38522   0   1   0   0    0   0 887 13646 3579  6 72  0 23
31  2 1202163 36528   0   0   0   0    0   0 618 7632 1481  3 86  4  7
31  0 1203110 35564   0   0   0   0    0   0 851 6915 1520  3 76  3 19
 3  4 1203458 35215   0   0   0   0    0   0 693 3699 647  1 75  0 24
24  1 1207381 31261   0   1   0   0    0   0 1214 20069 5849  8 73 17  2
51  1 1213352 25275   0   0   0   0    0   0 1368 48685 12205 25 60  5 10
 2  1 1214065 24517   0   3   0   0    0   0 2058 38901 12323 19 31 10 41
 3  0 1215158 23400   0   0   0   0    0   0 2503 41619 18429 23 16 25 37
 3  1 1215701 22850   0   0   0   0    0   0 1603 33452 14327 16 10 63 10
 0  0 1215509 22991   0   1   0   0    0   0 1882 40642 19318 22 11 43 24
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 0  0 1216100 22382   0   0   0   0    0   0 2004 46266 19341 24 13 47 15
 1  0 1216116 22347   0   1   0   0    0   0 2038 43065 19391 21 14 44 21
 0  0 1216311 22105   0   1   0   0    0   0 1960 45368 19212 24 10 51 14
 0  0 1217090 21309   0   0   0   0    0   0 1811 45064 18378 24 12 51 13
 1  0 1217349 21022   0   0   0   0    0   0 2098 52137 22315 29 11 34 26
 2  0 1217349 21008   0   1   0   0    0   0 1899 46866 20705 25  9 43 22
 0  0 1217349 20997   0   0   0   0    0   0 1814 46477 18653 37 11 35 17
10  1 1217609 20725   0   1   0   0    0   0 1829 46101 15392 46 12 19 23
 4  0 1217940 20381   0   0   0   0    0   0 1871 47111 16830 41 12 35 12
 1  0 1217940 20368   0   1   0   0    0   0 1986 47634 18076 46 13 30 12
 2  2 1218129 20169   0   0   0   0    0   0 1814 44158 15483 46  7 37  9
 2  1 1218130 20137   0   1   0   0    0   0 1921 46017 15420 53 12 23 12
 4  0 1218353 19897   0   0   0   0    0   0 1868 49748 16706 45 11 33 11
 6  0 1218973 19264   0   1   0   0    0   0 1985 63548 21004 51 15 22 13
 2  0 1219028 19199   0   0   0   0    0   0 1876 42975 15747 41 11 39 10
 7  0 1219333 18882   0   1   0   0    0   0 1885 50906 17753 47 12 28 13
 3  0 1219333 18865   0   1   0   0    0   0 2023 50949 18768 50 10 23 16
 1  0 1219351 18815   0   1   0   0    0   0 1707 48153 16904 44 15 28 13
 1  0 1219433 18703   0   0   0   0    0   0 1835 50007 17836 44 13 27 15
 8  0 1220196 17923   0   1   0   0    0   0 2024 53724 18092 47 17 15 21
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 4  0 1220248 17855   0   0   0   0    0   0 2164 53855 18590 48 13 24 15
 8  0 1220310 17778   0   1   0   0    0   0 1821 43663 18127 20 11 58 11
 4  0 1220358 17723   0   0   0   0    0   0 1854 48589 20240 21 12 47 19
 0  0 1220369 17663   0   1   0   0    0   0 2008 51147 22026 23 13 47 17
20  1 1219939 18084   0   0   0   0    0   0 1616 34068 13950 16 39 18 26
34  0 1218451 19567   0   0   0   0    0   0 914 6842 1827  3 88  4  5
 2  1 1216887 21129   0   0   0   0    0   0 797 5364 1303  1 71 16 12
 4  0 1215615 22401   0   0   0   0    0   0 873 4921 1248  2 63 19 17
 4  2 1214411 23601   0   0   0   0    0   0 795 4516 1015  3 70 10 17
 3  2 1212305 25704   0   0   0   0    0   0 1046 5460 1914  2 69 11 18
 2  2 1212114 25892   0   0   0   0    0   0 734 15152 1162  4 56 11 28
 3  2 1210881 27115   0   0   0   0    0   0 927 18764 1936  4 57 14 25
 2  2 1210477 27509   0   0   0   0    0   0 747 7071 907  2 57 13 29
 2  1 1209250 28729   0   0   0   0    0   0 783 5806 1479  1 50 18 31
 7  1 1207882 30096   0   0   0   0    0   0 620 2724 731  0 53 34 13
 4  1 1206470 31505   0   0   0   0    0   0 778 4555 1173  1 49 20 30
 2  2 1205634 32338   0   0   0   0    0   0 792 5099 1199  1 49 15 34
11  0 1204672 33299   0   0   0   0    0   0 989 5549 1381  7 50  6 36
15  0 1203258 34713   0   0   0   0    0   0 741 4669 1197  1 51 29 20
12  0 1201493 36478   0   0   0   0    0   0 745 4885 1198  1 68 16 14
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 2  1 1200461 37509   0   1   0   0    0   0 794 4747 1086  2 63 26  9
27  0 1198973 38997   0   0   0   0    0   0 648 4066 669  1 75 17  7
 1  2 1199088 38880   0   0   0   0    0   0 685 3643 892  1 28 31 39
31  0 1200074 37893   0   0   0   0    0   0 920 12891 2727  6 78 11  5
33  0 1202010 35957   0   0   0   0    0   0 830 11077 1910  3 78 12  7
20  1 1201975 35992   0   0   0   0    0   0 697 5005 880  1 95  3  0
13  2 1202096 35871   0   0   0   0    0   0 655 3214 614  1 64  4 30
38  1 1201974 35992   0   0   0   0    0   0 849 6226 1201  2 89  0  9
38  0 1203252 34708   0   1   0   0    0   0 776 8589 1871  3 96  1  0
50  0 1204212 33747   0   0   0   0    0   0 726 6485 1311  2 75 15  7
28  2 1204151 33777   0   0   0   0    0   0 674 8032 2304  4 84  0 12
83  0 1204718 33208   0   0   0   0    0   0 833 8742 2384  3 79  7 12
80  0 1205702 32222   0   0   0   0    0   0 866 9346 2227  4 91  0  4
100  1 1205970 31953   0   1   0   0    0   0 758 11823 2451  6 79 10  5
112  0 1211368 26534   0   0   0   0    0   0 1361 27365 7793 13 84  1  3
59  0 1216764 21124   0   0   0   0    0   0 1423 41961 12133 25 64  2  9
 1  2 1217012 20802   0   4   0   0    0   0 2379 46381 15861 21 23 14 42
10  0 1218522 19273   0   0   0   0    0   0 2342 43237 18957 22 14 42 22
 0  0 1219374 18403   0   1   0   0    0   0 1973 45384 19781 24 12 44 20
 0  1 1220058 17677   0   1   0   0    0   0 1981 43488 19099 23 10 36 31
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 2  1 1220065 17646   0   0   0   0    0   0 2105 50420 22252 25 13 39 23
 0  0 1220089 17597   0   1   0   0    0   0 2090 52754 21856 27 13 38 22
 5  1 1220160 17505   0   1   0   0    0   0 2196 54743 23175 29 16 33 23
 4  1 1220401 17254   0   0   0   0    0   0 1817 40038 18292 19  9 55 17
 6  1 1221709 15923   0   1   0   0    0   0 2098 51189 19967 37 16 30 17
 2  0 1221966 15644   0   0   0   0    0   0 1645 38375 17143 19 10 50 20
 6  0 1222454 15134   0   0   0   0    0   0 2043 47069 20710 27 10 36 27
 0  0 1222507 15062   0   0   0   0    0   0 2085 43164 20048 21 12 48 18
 0  0 1222561 14986   0   0   0   0    0   0 2171 47430 22109 21 13 40 26
 3  1 1222721 14808   0   0   0   0    0   0 2092 56306 22569 27 15 39 20
 0  0 1222798 14723   0   0   0   0    0   0 1873 39622 17532 21  9 51 19
 0  1 1222966 14510   0   2   0   0    0   0 2120 44575 19638 18 12 42 28
 3  0 1223059 14352   0   0   0   0    0   0 2221 45944 21141 23 12 42 23
 4  1 1223076 14315   0   0   0   0    0   0 1929 43699 19552 18 11 52 19
 2  0 1223303 14044   0   0   0   0    0   0 2189 54128 24345 26 12 32 29
 5  0 1223380 13916   0   1   0   0    0   0 2094 38245 17355 15  9 55 22
 1  0 1223395 13878   0   1   0   0    0   0 1951 47598 21715 23 10 46 20
 7  0 1223403 13861   0   0   0   0    0   0 1895 40763 19283 21  8 50 21
 0  2 1223463 13759   0   1   0   0    0   0 2023 44691 20227 24 10 41 25
 8  1 1223463 13714   0   0   0   0    0   0 2057 41690 19772 20 10 52 19
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
 8  0 1223543 13589   0   1   0   0    0   0 2030 43829 19370 20 11 46 23
 1  0 1223549 13542   0   1   0   0    0   0 2026 42225 19659 22 13 43 23
 5  0 1223575 13478   0   2   0   0    0   0 1706 35836 16670 18 10 53 19
 2  0 1223628 13383   0   0   0   0    0   0 1811 41055 18007 16 11 55 17
 7  1 1222457 14540   0   0   0   0    0   0 1170 12420 5001  6 60 25 10
 3  1 1221657 15339   0   0   0   0    0   0 719 3942 877  2 59 19 20
 6  2 1220455 16537   0   0   0   0    0   0 934 6703 1924  3 60 11 26
 3  3 1220604 16388   0   0   0   0    0   0 781 6097 1735  1 53  7 39
 6  4 1220032 16958   0   0   0   0    0   0 740 23103 1356  3 78  5 14
 3  3 1219164 17821   0   0   0   0    0   0 829 7132 1084  1 64  6 28
 4  3 1218007 18964   0   1   0   0    0   0 825 4983 1088  1 61  7 31
 2  2 1216612 20353   0   0   0   0    0   0 850 8713 1387  1 53 12 34
 8  0 1215395 21570   0   0   0   0    0   0 688 6703 1037  2 79 14  5
 6  1 1213355 23599   0   0   0   0    0   0 813 6212 1302  2 65 16 16
 3  2 1211692 25253   0   0   0   0    0   0 719 4187 778  1 52 31 15
 1  2 1211100 25838   0   1   0   0    0   0 937 5085 1303  7 53 16 24
 5  1 1209998 26938   0   0   0   0    0   0 788 4795 1080  2 58 23 17
 1  3 1208938 27989   0   0   0   0    0   0 757 4361 1838  1 46 17 36
 1  2 1207608 29300   0   1   0   0    0   0 746 4367 1038  1 45 27 27
18  2 1206938 29964   0   0   0   0    0   0 831 3927 937  0 74  5 21
kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
 r  b   avm   fre  re  pi  po  fr   sr  cy  in   sy  cs us sy id wa
35  1 1205956 30944   0   0   0   0    0   0 752 7393 1509  3 81 10  6
23  2 1204168 32729   0   0   0   0    0   0 717 4379 851  1 83 15  1
37  0 1204081 32814   0   0   0   0    0   0 812 6142 1596  3 92  2  3
70  0 1203666 33229   0   0   0   0    0   0 811 7278 1666  2 98  0  0
23  0 1204528 32311   0   1   0   0    0   0 885 12927 3582  5 83 11  1
36  0 1205931 30904   0   0   0   0    0   0 882 12052 2518  6 84  8  2
74  0 1205688 31146   0   0   0   0    0   0 658 4312 751  2 85 13  0
114  0 1204389 32444   0   0   0   0    0   0 687 5501 1119  3 97  0  0
 1  1 1204091 32740   0   0   0   0    0   0 754 6495 1495  4 38 29 29
 1  1 1204091 32740   0   0   0   0    0   0 729 4381 931  1 26 38 35
 1  1 1204096 32735   0   0   0   0    0   0 647 3214 577  0 26 49 25
 3  1 1204099 32732   0   0   0   0    0   0 662 7553 905  2 28 32 38
 6  2 1204149 32680   0   0   0   0    0   0 707 3625 1012  1 63  7 29
58  0 1205329 31451   0   0   0   0    0   0 801 12325 3034  5 92  2  0
32  2 1207168 29603   0   1   0   0    0   0 749 7982 1671  4 72 11 13
 1  3 1207314 29457   0   0   0   0    0   0 658 3155 571  0 26 25 49
 1  3 1207321 29449   0   0   0   0    0   0 801 4642 1574  1 29 11 59
 2  4 1209069 27687   0   0   0   0    0   0 818 6052 1401  1 56 11 33
37  2 1213280 23442   0   1   0   0    0   0 1174 21018 6502  8 85  4  2
23  2 1215002 21708   0   0   0   0    0   0 932 15045 3753  5 85  3  7

Thanks for any new ideas anyone has in forward.


laters
zaxxon
 
Its a cpu issue by the look of it.

Does the slow down follow any pattern? Every 5 mins.. Etc

What's the top process when the systems slow?

Mem look fine to me.

Mike

"Whenever I dwell for any length of time on my own shortcomings, they gradually begin to seem mild, harmless, rather engaging little things, not at all like the staring defects in other people's characters."
 
During your slowdowns, the kernel context switch rate (cs) drops by an order of magnitude and the cpu time spent in system calls (sy) goes through the roof.

The first thing that comes to mind is that an application is creating a sort of "thread storm", spawning off bunches of threads (as seen in the kthr) for some reason. This would explain the sy percentage (system calls to spawn, manage, and reap threads) and the context switch drop (all of an applications run in the same context) as well.

During a slow down, try running "tprof -x sleep 60" and then check the __prof.all file. Your culprit should be at the top of the list.

- Rod

IBM Certified Advanced Technical Expert pSeries and AIX 5L
CompTIA Linux+
CompTIA Security+

Wish you could view posts with a fixed font? Got Firefox & Greasemonkey? Give yourself the option.
 
BTW, zaxxon, did I ever compliment you on your handle? :)

Check the Zaxxon World Record from October 1983 out. :)

Note that the editor thought the postal abbreviation for Arkansas was AK, which is Alaska.

Also, since turning forty is supposed to be career limiting in IT, I should say I was five years old at the time. :)

- Rod
 
@Rod
Cheers and congratulations for the hiscore, even I might be a bit of late hehe! I saw Zaxxon 1st time on an arcarde machine when I was 10, back in 1984^^
If you get any problems in your career advancement because of your age, tell them to ask me, I will approve the hiscore and your age back then hehe ;)

But back to the problem:
For the moment that it occurs, I would say too it is a CPU problem/shortage which is what having so many kthreads in the runqueue waiting for CPUs. But until the problem starts to occure, the apps run fine for several hours, not having any high lod on the CPUs.
To make it more visible it looks like this for a typical problem day:

Code:
06:00 Production starts, apps are running since 22:00 they day before only being issued by some small batch jobs
08:00 Production starting to get in the range for heavy load
10:20 1st time of having issues, that the system starts to be slow, having the effect showed by the vmstat output.
10:30 Restart of all apps on the box
10:35 Everything runs smooth again under heavy prduction load
14:30 Same problems again, restarting apps

The machine is running totally fine unless the problems occure, no problem of any kind is visible.

@mrn
The slowdowns interval is like described or as seen, being ~20 secs awful slow, many kthreads in the run queue, then ~30-60 secs everything going smooth again, then ~20 secs slowdown, then 30-60 secs running smooth again and so on, until I restart all the apps. It will not get worth or better if I do not restart the apps.

@Rod & mrn
I already had a topas running and saw our DB2 being the most busy when the problems occure. When no problems occure, it is the TSM-Server together with Java being the ones using most CPU time, but at no problematic load as described.
I doubt it is the DB2 having the problem, since we have similar setup of apps and os/hardware running on another box, cooperating with this one and having not the problem. Only difference is, that we have no TSM-Server there but another DB2 instance running additional there, all parts of that CM archive. Btw. I forgot to say that we never restart the TSM-Server and it works fine after a restart of the other apps.
I also did a pstat -A and checked which kthreads being in the run queue waiting, but of course this was no big help. As you said, I will have the tprof running next time and instructed my co-workers to have only small parts of the applications being restarted, one by one, to find out which part of that complex bunch of apps might cause the problem.

But all in all, I am glad all of you don't think it might be caused of a OS/OS-config problem, just to be at least sure on that part.

laters
zaxxon
 
We were able to see that the problem must come from the WebSphere Application Server or it's Java application running. In the tprof outfile were the db2agents listed on top. When stopping only part of the apps, we didn't stop the DB2 itself and after starting the WAS and http ontop of it, everything was fine again.
We are going to patch the software soon and hope it will be fixed. At least we know now which part of the apps it is. Thanks all for the help :)

laters
zaxxon
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top