Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations strongm on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

System slowness, help needed on diagnosis!

Status
Not open for further replies.

Jimbo2112

IS-IT--Management
Mar 18, 2002
109
GB
Hi All,

I am looking for some advice on how to diagnose my system slowness. Here is the situation.

System:
E450 3x300Mhz cpu 512Mb Ram (with 50Gb external disk pack)
Sparc 20 Backup server
7 sparc 4 and sparc 5 clients
100 Mb Lan on the side of an Nt lan
Running Solaris 2.6

For the last 2 months we have been getting sporadic system speed loss. This amy only be on one terminal or it might affect all terminals. Some of the performance meters on terminals show high amounts of errors and others very little.

Can you give me advise on things to check and command line functions to get to the root of this problem?

Thanks

Jimbo
 
Hi,

The errors I am seeing are on the performance meter used by open windows. This does not correspond to the amount of error messages that I am seeing in /var/adm/messages. Are these two things (performance meter and messages file) not linked?

Jimbo
 
No idea, I don't know what the perfm logs as errors really.

What happens when you open a monitor error terminal? what does it show there then ?
That should show all those errors,
Any collisions ? heavy diskusage ? proc's that are eating cpu ?
 
Do you mean what happens when I open a Console? I have opened one for a client that is showing errors on the performance meter and I will monitor it for anything happening, but I think that the errors that go to the console ar the same ones that go to the messages file.

As for the rest of the performance monitoring features there is sporadic load depending on what demands are made on the client and very few collisions.Interrupts are also running consistently high though. Maybe there is something there.

Finally, I am not sure whether these readouts are the cause of the slowness. I am going through a course of elimination to get to the root of the cause!

Cheers

Jimbo
 
hi there,

very few collisions ?
Collisions shouldn't be occuring at all,
Atleast, I don't have any collisions occuring, not even a few, all our hardware is on a 100fdx network.
The perfmeter here doesn't show collisions on any station, not even a few...
I'm not sure how many collisions may/can occure before everything slows down, but I'm suspecting it can't be much...or none at all

Check the nic speed post here in the solaris forum, that 'll tell you how to check & setup your network interface card for full fuplex & half duplex, since that is the main problem when having strange performance drops.

If it's happening from time to time, and not on all clients, you may need more nfs deamons, but i'd go first for the network stuff.

maybe do a snoop on the networkcard.
That will show you network messages related to nfs service & you could learn more from the error messages there, but not when the network is shaggy, therefore, cure the network & collisions thing first.



 
Hmmmmm .. my last reply failed ........ here goes again!

The network card is set to 100 and full duplex.

I would like to know how to assess the need for nfs daemons on my system. What dictates the need for the nfs daemon? Is it the process that runs the remote file systems from the server to the client. Please excue my ignorance, I inherited this job!

Cheers

Jimbo
 
I tried netstat and it reflects what I am seeing in the performance meters. The amounts of collisins is much lower now, but the consistent thing that is showing on each machine is the amount of interrupts.

As a piece of background info, I have recently removed a line in the /etc/system file to resync the scsi rate between the external disk pack and the e450 server. The line was used to slow the data transfer rate betwen the two devices so that we stopped getting scsi errors every 30 seconds or so. But I took out the line as I figured that getting these scsi errors reporting was the lesser of the two evils (one being scsi errors and the other being enforced system slowness). I am wondering if the interrupts are the scsi errors and thus a red herring?

Thanx for your patients chaps!

Jimbo
 
Hi Jimbo,

The Saga Continues...

What line was that in the /etc/system ?

could be, retrying the transfer several times a second can ofcourse clutter the system.
I had a few months ago scsi error mayhem on a external discpack, turned out that the tape device was connected behind it without a scsi ID, removing the tape device made the errors "Go Away"..



 
Indeed .... what a saga!

The line removed was:

set scsi_options = 0x58

This was a supplied solution from Sun themselves. I hope they gave me the optimum solution as there seems to be a wide variety of ways to implement this command.

We have an internal DDS3 tape on the E450, so I dont think that would be an issue. What I would like advice on though is whether 0x58 is the best solution for my scsi devices. It is hard for me to analyse this as I don't have a very good grip on the syntax for the set command used in my system file!

I have just taken a look at my /var/adm/messages file and there are no scsi speed errors! So it seems that since I changed the set scsi_options line in the system file, the scsi errors of old have not returned! This is making my brain hurt!

Cheers

Jimbo
 
Also, make sure your overall network health is ok. Are you isolated from sporadic and unnecessary traffic? For example, we had an issue a week or so ago where clients where showing sporadic but critical system slowness, it turned out to be a user listening to online music from the net (a definite no-no!) It absolutely killed us. Removed the user from the switch, and things immediately cleared up.

Hope this might help!
 
Although I would like to remove some of our users from the building :), it would not be because they are abusing the system. They can only interact within our firewall. The only link from the NT lan to our Unix lan is via a solstice nfs client and some sporadic ftp transfer (although this is single user only and from pc's generally, I would not see this affecting clients on the Unix lan)

I guess my next question is ...... what impact on system speed do interrupts have generally? The performance monitor on the server is showing peaks of around 3000. Below is a bit of info I collected from the outputs of vmstat

vmstat 1 10

procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 m0 m1 m2 in sy cs us sy id
0 0 0 68184 90888 2 220 372 47 125 0 12 0 0 1 1 1145 1061 262 2 2 96
0 0 0 825000 20584 0 7 0 0 0 0 0 0 0 0 0 1255 192 212 0 3 96
0 0 0 825000 20584 0 0 0 0 0 0 0 0 0 0 0 530 117 42 0 0 100
0 0 0 825000 20584 0 0 0 0 0 0 0 0 0 0 0 693 125 112 0 0 100
0 1 0 825000 20584 0 0 0 0 0 0 0 0 0 0 0 736 185 108 0 1 99
0 0 0 825000 20584 0 0 0 0 0 0 0 0 0 0 0 741 132 124 0 0 100
0 0 0 825000 20584 0 0 0 0 0 0 0 0 0 0 0 517 135 48 0 0 100
0 0 0 825000 20584 0 0 0 0 0 0 0 0 0 0 0 603 120 44 0 0 100
0 0 0 825000 20584 0 0 0 0 0 0 0 0 0 0 0 531 137 64 0 0 100
0 0 0 825000 20584 0 0 0 0 0 0 0 0 0 0 0 718 162 92 0 1 99


vmstat -s

0 swap ins
0 swap outs
0 pages swapped in
0 pages swapped out
80631001 total address trans. faults taken
6241159 page ins
1150226 page outs
17019920 pages paged in
2166818 pages paged out
984878 total reclaims
971813 reclaims from free list
0 micro (hat) faults
80631001 minor (as) faults
3692256 major faults
28930262 copy-on-write faults
7534898 zero fill page faults
4632519 pages examined by the clock daemon
73 revolutions of the clock hand
5721818 pages freed by the clock daemon
1253065 forks
100497 vforks
1343993 execs
95842905 cpu context switches
454767311 device interrupts
113401195 traps
387260244 system calls
192196129 total name lookups (cache hits 94%)
23196 toolong
1764047 user cpu
2730862 system cpu
93275285 idle cpu
11709269 wait cpu

Does any of this look excessive? I guess at the end of this episode I will never get a job as a pure sys admin!

Ho Ho!

Jimbo
 
Hi there,

The scsi command is for async transfer.
Rather strange that you need to set it,
I mean, I don't have to set a scsi option on our E250 with 2 discpacks on the second controller. I just connect 'em, mirror them with disksuite, set it all up and it runs perfect.
Maybe if the discpacks are connected to the same controller of some fast hard disk you could run into trouble, but then again, never mix different disk types on the same controller.
Although, I had some scsi errors going on about read/write error there & reducing this & retrying that. After a complete format & check of the disk it stayed away. lucky me.

Network issues are indeed a crucial thing, use snoop to determine *what* is actually being fed up your interface.
Anything you don't know or you don't trust, find out from where it's coming.
Do a #snoop > filename and let it run for a few minutes when the machine is slowing down, afterwards you can check the snooped file and see what the network is sending out.
I had 1 desktop U10 running a full blown routing service for it's subnet, needless to say that such a thing gets the whole store slowing down,

Our solaris environment is behind 1 router which blocks all unnecessary traffic from our normal IT-LAN.

Anyway, i'm curious to find out *what* the problem actually was/is

Put on some coffee, power off your mobile phone and get searching!

Cheers!



 
OK ....... here are some lines from snoop ......

zeus is main server
mercury is client that runs a lot of post process operations on a tuesday night
atlas is a client that I work on

zeus -> mercury RPC R XID=4007801815 Success
mercury -> zeus NFS C LOOKUP3 FH=A9F0 sti-0204_01080_2.LCK
zeus -> mercury NFS R LOOKUP3 OK FH=143F
mercury -> zeus NFS C LINK3 FH=D977 to FH=A9F0 sti-0204_01080_2.LCK
zeus -> mercury NFS R LINK3 File exists
mercury -> zeus NFS C LOOKUP3 FH=A9F0 .nfs75C7
zeus -> mercury NFS R LOOKUP3 No such file or directory
mercury -> zeus NFS C RENAME3 FH=A9F0 .mercury274 to FH=A9F0 .nfs75C7
? -> (broadcast) ETHER Type=8137 (Novell (old) NetWare IPX), size = 56 bytes
atlas -> zeus TELNET C port=55148
zeus -> atlas TELNET R port=55148 /dev/hme (promiscuou
sn8032dfh21368.bma.org.uk -> (broadcast) ARP C Who is 172.24.22.160, bmjspjssrv1.bma.org.uk ?
172.24.26.1 -> ALL-ROUTERS.MCAST.NET UDP D=1985 S=1985 LEN=28
zeus -> mercury NFS R RENAME3 OK
mercury -> zeus NFS C REMOVE3 FH=A9F0 .nfs75C7
atlas -> zeus TELNET C port=55148
zeus -> mercury NFS R REMOVE3 OK
mercury -> zeus NFS C GETATTR3 FH=C433
zeus -> mercury NFS R GETATTR3 OK
mercury -> zeus NFS C LOOKUP3 FH=1C1C OUT
zeus -> mercury NFS R LOOKUP3 OK FH=A9F0
mercury -> zeus NFS C LOOKUP3 FH=A9F0 .mercury274
zeus -> mercury NFS R LOOKUP3 No such file or directory
mercury -> zeus NFS C GETATTR3 FH=A9F0
zeus -> mercury NFS R GETATTR3 OK
mercury -> zeus NFS C ACCESS3 FH=A9F0 (lookup)
zeus -> mercury NFS R ACCESS3 OK (lookup)
mercury -> zeus NFS C CREATE3 FH=A9F0 (GUARDED) .mercury274
? -> * ETHER Type=0000 (LLC/802.3), size = 17 bytes
zeus -> mercury NFS R CREATE3 OK FH=E732
? -> (broadcast) ETHER Type=8137 (Novell (old) NetWare IPX), size = 60 bytes


then a bit further on .........

atlas -> zeus RSTAT C Get Statistics
zeus -> atlas RSTAT R Get Statistics
172.24.25.1 -> ALL-ROUTERS.MCAST.NET UDP D=1985 S=1985 LEN=28
172.24.3.5 -> (broadcast) ARP C Who is 172.24.6.61, laser2 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.58.246, 172.24.58.246 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.59.4, 172.24.59.4 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.59.8, 172.24.59.8 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.56.125, 172.24.56.125 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.50.94, 172.24.50.94 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.51.105, 172.24.51.105 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.56.5, 172.24.56.5 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.21.191, 172.24.21.191 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.56.13, 172.24.56.13 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.56.18, 172.24.56.18 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.53.2, 172.24.53.2 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.50.133, 172.24.50.133 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.50.147, 172.24.50.147 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.51.160, 172.24.51.160 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.59.209, 172.24.59.209 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.51.32, 172.24.51.32 ?
172.24.3.5 -> (broadcast) ARP C Who is 172.24.3.10, 172.24.3.10 ?
atlas -> (broadcast) ARP C Who is 172.24.6.114, olympus ?
atlas -> olympus RPC R XID=1018397668 Success


Hope this sheds some light!

Thanks for all the attention you guys .... hope I can repay you with my knowledge some time! (I work in Medical Publishing)

Jimbo

 
Hi there,

First I would say the free swap looks small regarding the swap used.
The rest is getting a bit deep for me but I would say the machine is running low on resources and is being pushed no ?

Do a /usr/proc/bin ./ptree and post that one, might be helpfull too.

Iga



Medical publishing ?
I just visited Korperwelten exposition. Almost threw up....Way to visual...











 
Hi All,

Sorry no response yesterday ... I have a very dodgy back which went after picking up a sock at the gym!

Here is that ptree ......

157 /usr/sbin/nis_cachemgr
149 /usr/sbin/rpcbind
151 /usr/sbin/keyserv
159 /usr/sbin/rpc.nisd
188 /usr/lib/nfs/statd
183 /usr/sbin/inetd -s
198 rpc.rstatd
302 bootpd
1138 /usr/dt/bin/rpc.ttdbserverd
1147 in.telnetd
1149 -sh
20488 csh
3981 ./ptree
22669 in.rlogind
22671 -sh
24715 in.telnetd
24717 -sh
26729 csh
21753 in.telnetd
21755 -sh
21759 csh
25464 in.telnetd
25466 -sh
25470 csh
11572 in.rlogind
11574 -sh
11720 /bin/csh ./spj02_cdrom_all
1622 in.rlogind
1624 -sh
1932 in.telnetd
1934 -sh
190 /usr/lib/nfs/lockd
212 /usr/sbin/syslogd
208 /usr/lib/autofs/automountd
220 /usr/sbin/cron
232 /usr/sbin/nscd -S passwd,yes -S group,yes
248 /usr/lib/lpsched
3937 /usr/lib/lpsched
3938 /bin/sh -c /etc/lp/interfaces/laser1 laser1-31593 csmith@zeus "31593-
3939 /bin/sh -c /etc/lp/interfaces/laser1 laser1-31593 csmith@zeus "3159
3966 <defunct>
3967 /usr/spool/lp/bin/lp.tell laser1
3968 /bin/sh -c /etc/lp/interfaces/laser1 laser1-31593 csmith@zeus &quot;
3969 /opt/hpnp/bin/hpnpf -j laser1-31593+csmith@zeus -w -b1396316
3970 /usr/bin/sh /etc/lp/interfaces/model.orig/laser1 laser1-315
3980 cat /var/spool/lp/tmp/zeus/31593-1
272 /usr/lib/sendmail -bd -q15m
267 /usr/lib/power/powerd
279 /usr/lib/utmpd
284 /opt/hpnp/bin/hpnpd
297 /usr/sbin/vold
358 /usr/lib/nfs/nfsd -a 16
355 /usr/lib/nfs/mountd
371 /usr/dt/bin/dtlogin -daemon
473 /usr/lib/snmp/snmpdx -y -c /etc/snmp/conf
673 mibiisa -p 32891
471 /opt/SUNWpcnfs/sbin/rpc.pcnfsd
483 /usr/lib/dmi/snmpXdmid -s zeus
482 /usr/lib/dmi/dmispd
565 /utils/adobe/Acrobat3.0/Distillr/sparcsolaris/bin/distilld -noparamprefs
539 /utils/adobe/Acrobat3.0/Distillr/sparcsolaris/bin/distilld -noparamprefs
591 /utils/adobe/Acrobat3.0/Distillr/sparcsolaris/bin/distilld -noparamprefs
617 /utils/adobe/Acrobat3.0/Distillr/sparcsolaris/bin/distilld -noparamprefs
670 /utils/adobe/Acrobat3.0/Distillr/sparcsolaris/bin/distilld -noparamprefs
697 /utils/adobe/Acrobat3.0/Distillr/sparcsolaris/bin/distilld -noparamprefs
725 /utils/adobe/Acrobat3.0/Distillr/sparcsolaris/bin/distilld -noparamprefs
747 /etc/xyvision/access/lmgrd -c /etc/xyvision/access/license.dat
748 xymgr -T zeus 4 -c /etc/xyvision/access/license.dat
750 bgquer composeq composexq spool1q ps1fmtq ps5fmtq ps7fmtq ps9fmtq ps4fmtq
761 /usr/lib/saf/sac -t 300
764 /usr/lib/saf/listen tcp
765 /usr/lib/saf/ttymon
756 sproc
762 /usr/lib/saf/ttymon -g -h -p zeus console login: -T sun -d /dev/console
1183 xzentec -e /usr/app/xyvision/xz/bin/xsh
1192 /usr/app/xyvision/xz/bin/xsh
13903 /usr/app/xyvision/xz/bin/xsp -xsh
1196 xzentec -e /usr/app/xyvision/xz/bin/xsh
1205 /usr/app/xyvision/xz/bin/xsh
1209 xzentec -e /usr/app/xyvision/xz/bin/xsh
1218 /usr/app/xyvision/xz/bin/xsh
777 /usr/app/xyvision/xz/bin/xsp -xsh
28166 xzentec -e /usr/app/xyvision/xz/bin/xsh
28175 /usr/app/xyvision/xz/bin/xsh


Cheers

Jimbo
 
The process # 3966 < Defunct > is bad. This will slow the system down. Do ps -ef look for the PID and PPID number then find out what programs then kill the process.
 
Hi All,

This is interesting. The processes that are defunct are linked to print processes. We currently have a network printer (HP 5000 Si) which is very slow. I am not sure if it is the configuration or whether there is some processes (like the above ) that are slowing in down. The printer should be the fastest we have, but it is printing very slowly as if it is ripping each page 1 at a time instead of caching a whole job. For admin purposes we are using Jetadmin, which I understand has been superceded. This might explain why I find it hard to get support for Jetadmin software!

Another thing is that it the laser printer seemed to slow down after an engineer serviced it. This looks like he changed the configuration but yo can never be sure.We have since had someone change parts inside which should reset the settings to default since.

I am still getting the consistent interrupts on each client in the lan. I hope these problems are all linked so that when I fix one I fix them all (dream on!)

Cheers

Jimbo
 
I think that was problem all along. Run ./jetadmin then select diagnostics to see any error. You might need to reinstall jetadmin_SOLd621.PKG.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top