Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Westi on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

AIX connection problem - help needed

Status
Not open for further replies.

MCubitt

Programmer
Mar 14, 2002
1,081
GB
We're running an AIX 5.1 server with Oracle 9.2.0.1.0 running on it and several databases therein.

The AIX appears to log itself out of our network every now and again.

I am unable to telnet in (Connection to host lost.) and even going to the physical server, I see the following message:
Action: StartDTscreenBlank
An attempt to start a new process on host "Servername" failed.
To continue, you may need to stop an unneeded process on this host.

I am unable to close the message window.

If I attempt to start a terminal session on the server I get the message again.

I am forced to Exit -> Logout and restart.

This is a weekly event but not at the same time or dy each week (though tends to be overnight).

Are there any error logs or system logs I can check? Has anyone experience of this problem, and a resolution?

 
Hi,

you can check mailx , errpt -a | more
what maintenance level are you running on AIX 5.1
if you are monitoring any error alerts via syslog then check ther files listed in /etc/syslog.conf

Is your server connected to a switch , if so do the port speeds match i.e. tpe entstat -d ent? ( ? network card 0,1 etc) check speed settings , type lsattr -El ent? , do they match?, if not set them the same speed?

If you say it only happens at night , do you run anything from cron at that time , which could heavily utilise your network ?

just a few things to check ?



 
Many thanks for those tips.

The errpt was a new command for me and could prove very useful. In this instance it did not catch any overnight error - if there was one.

The mail box has a re-occuring error:
-------------------------------------------------------
651-880: The CEC or SPCN reported an error. Report the SRN and the
following reference and physical location codes to your service
provider.
Error log information:
Date: Tue 9 Sep 09:00:22 2003
Sequence number: 647
Label: EPOW_SUS_CHRP
Ref. Code: 10111520 FRU: 53P2399 U0.1-V2
-------------------------------------------------------


Not sure how serious that is, but will report it to our maintenance company.

Unfortunately I have no idea on the network infrastrucure or setup.

Thank anyway, I learned something of not the solution to my specific problem!

 
Hi,

SInce you are not able to open a session even while on the local server - I would suspect that the machine runs out of resources at this time,normally memory.
Possibly due to something heavy running on it at night.
Since you did not mention that you have to reboot to clear the problem - I assume there are no "zombie" processes filling up the paging space.

However - a good practice could be to look with topas or nmon at the system resources just before.

In one of such cases I have used the below script to colllect data into some cyclyc log of 10000 lines,that can be analyzed after the event.
========================
#!/bin/ksh
###########################################################################################
#You run it once and it runs in loop.
#It writes to log file which is limited to 10000 lines, thus never explodes.
#It's good if you want to catch the system load at certain point in time, or to catch it just prior to the system sticking.
###########################################################################################
#set -x

DATE1=`date`
DATE2=`date +%m%d%H%M%y`
LOGFILE=”/logs/perform_check.log”

while true ;do
tail -10000 $LOGFILE > $LOGFILE.tmp
mv $LOGFILE.tmp $LOGFILE > /dev/null
resize > /dev/null

echo “\n#######################################################“ >> $LOGFILE
echo $DATE1 >> $LOGFILE
echo “==================================================” >> $LOGFILE
vmstat 2 5 >> $LOGFILE
echo “==================================================” >> $LOGFILE
echo “Shows top 10 memory usage by process:” >> $LOGFILE
ps auxw | sort -r +3 |head –10 >> $LOGFILE
echo “==================================================” >> $LOGFILE
echo “Shows top 10 CPU usage by process:” >> $LOGFILE
ps auxw | sort -r +2 |head –10 >> $LOGFILE
echo “==================================================” >> $LOGFILE
echo “Shows zombies processes:” >> $LOGFILE
ps -auw | grep defunct >> $LOGFILE
echo “==================================================” >> $LOGFILE
echo “Shows iostat:” >> $LOGFILE
iostat 2 3 >> $LOGFILE
echo “==================================================” >> $LOGFILE
echo “rpcinfo :\n” >> $LOGFILE
rpcinfo >> $LOGFILE
echo “==================================================” >> $LOGFILE
echo “Stations connected over IP :\n” >> $LOGFILE
arp –a >> $LOGFILE
echo “==================================================” >> $LOGFILE
echo “Running processes – ps –ef :\n” >> $LOGFILE
ps –ef >> $LOGFILE
echo “==================================================” >> $LOGFILE
echo “Running processes – ps auxww :\n” >> $LOGFILE
ps auxww >> $LOGFILE
echo “==================================================” >> $LOGFILE
echo “\n######################################################“ >> $LOGFILE
sleep 60
done




"Long live king Moshiach !"
 
Thanks for the reply and script.

I *DO* have to reboot to solve the problem. I am unable to start a terminal session or telnet in so access to the server is pretty restricted.

I am running the script just now and will look forward to reviewing the resulting logs.


There is a havy job taht runs each night - the Tivoli backup. It is also apparent that the /usr directory is full:
Filesystem 512-blocks Free %Used Iused %Iused Mounted on
/dev/hd4 65536 38088 42% 2397 15% /
/dev/hd2 7471104 0 100% 69293 8% /usr

I presume that is not a healthy sign!

Sorry, I am very new to Unix and we have been thrown into the deep end, rather!



 
A full /usr is by itself good reason for not beeing able to open a new session.

Also :
1. if you have to reboot - can be problem with zombies - run "ps -ef |grep defunct " to detect,possible some process is crashing frequently leaving zombies.

2.Change the sleep in the script to 600 - so that it samples every 10 minutes,1 minute could be too frequent for a night logging,your rellvant data might be gone by the morning time.

"Long live king Moshiach !"
 
MCubitt, the CEC is the main drawer of the computer, so you will want to get your maintenance company called soon. It might not be a big problem, but I wouldn't chance it.

When you call, they will probably ask you if you have run diag. That is a program that will run through tests to see if it can detect the problem. Sometimes it's useful, sometimes it just repeats what you see in errpt.

I would suggest you run diag, select Advanced Diagnostic Routines, and then Problem Determination to see if you get any more info.

Since this happens during backups, I'm wondering how fast your network is (assuming you are backing up to an external device), is Tivoli on this system or is the system a client only of Tivoli? Presumably the databases aren't very active because the backups are going on, but have you checked with the DBAs to see if there are any heavy duty night jobs scheduled around the same time as the backups?



 
in my experience the EPOW_SUS_CHRP error warrants an immediate call to 1-800-IBM-SERV, unless you are poking the power button when you should not be.

IBM Certified -- AIX 4.3 Obfuscation
 
Last night I scheduled a shutdown.

The log this morning reads...

IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
369D049B 0910081403 I O SYSPFS UNABLE TO ALLOCATE SPACE IN FILE SYSTEM
C092AFE4 0910081403 I O ctcasd ctcasd Daemon Started
A6DF45AA 0910081403 I O RMCdaemon The daemon is started.
BE0A03E5 0910081203 P H sysplanar0 ENVIRONMENTAL PROBLEM
BE0A03E5 0910081103 P H sysplanar0 ENVIRONMENTAL PROBLEM
2BFA76F6 0910080703 T S SYSPROC SYSTEM SHUTDOWN BY USER
9DBCFDEE 0910081303 T O errdemon ERROR LOGGING TURNED ON
192AC071 0910080603 T O errdemon ERROR LOGGING TURNED OFF


UNABLE TO ALLOCATE SPACE IN FILE SYSTEM is a worry!
LABEL: JFS_FS_FULL
IDENTIFIER: 369D049B

Date/Time: Wed 10 Sep 08:14:30 2003
Sequence Number: 753
Machine Id: 005D87BA4C00
Node Id: IFSSERVER
Class: O
Type: INFO
Resource Name: SYSPFS

Description
UNABLE TO ALLOCATE SPACE IN FILE SYSTEM

Probable Causes
FILE SYSTEM FULL

Recommended Actions
USE FUSER UTILITY TO LOCATE UNLINKED FILES STILL REFERENCED
INCREASE THE SIZE OF THE ASSOCIATED FILE SYSTEM
REMOVE UNNECESSARY DATA FROM FILE SYSTEM

Detail Data
MAJOR/MINOR DEVICE NUMBER
000A 0005
FILE SYSTEM DEVICE AND MOUNT POINT
/dev/hd2, /usr

I have no idea how to use FUSER. Still waiting on maintenance company to come back to me.

 
Something (vaguely) similar resulted in us clearing down /usr/adm/wtmp on a regular basis.

Dickie Bird (:)-)))
 
Excuse my ignorance, can I simply delete this file or must I empty it?

By the way, my file is not huge:
-rw-rw-r-- 1 adm adm 1102248 10 Sep 08:37 wtmp
 
A IBM 595W ACPS :
Serial Number...............YL1021000469
EC Level....................H62921
Product Specific.(CC).......51AF
FRU Number.................. 53P2399
Version.....................RS6K
System Info Specific.(YL)...U0.1-V2
Physical Location: U0.1-V2


This FRU looks like a power supply.
So possibly sysplanar errors are coming from some bad PS that needs replacement.
However,this should not explain the /usr filling up.I think we are dealing here at least with 2 differnt problems.
I would increase /usr and watch it's behaviour for a while.
Also,output of the above script could be helpfull.

"Long live king Moshiach !"
 
Again, forgive my ignorance but HOW do I expand /usr? I need guidence here. I am, in fact, your worst nightmare ;-)
 
I am assuming you have no man pages on your system ?

To use fuser do

fuser <filesystem>

e.g.

fuser /usr

You may see something like this;

/u02/oracle9 > fuser /usr
/usr: 8152

This means process 8152 is attached to a file in /usr

In this case it is the cron process

Alex
 
It must exist - so just do (in a script or at command line):
> /usr/adm/wtmp
(check permissions remain the same afterwards)

Dickie Bird (:)-)))
 
I have man pages but not time to read them. Besides, you know UNIX. It takes no prisoners! It does what you type often without prompting!

Anyway, the resulting figure from the fuser was about as useful to me as a chocolate teapot.

I expect levw is right - still waiting for maintenance company.







 
Levw: Thanks (unfortunately we are outta space so maintenance guy wil have to re-jig the allocation)

Dickiebird: Thanks - worked correctly.

 
Mcubitt, when things settle down, I suggest you go out to and search for &quot;certification AND AIX&quot; and download the certification guide for AIX 5.x. Although not compelling and gripping reading, that book will explain a lot of things and how to do them.

And learn about smitty. IBM has made it so easy for sys admins with that tool.

Also, I'm wondering if there might be a core file in /usr that is taking up room there? cd into /usr and type

find . -name core

If you find one, it should be OK to remove it, but do an ls -l to get the timestamp. It might be useful if it is from today. In that case, you could just move it to someplace where you do have room.
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top