Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations SkipVought on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

why do long-running jobs die?

Status
Not open for further replies.

chayase

IS-IT--Management
May 10, 2001
5
0
0
US
This might be just a general UNIX topic...

We have some S-85s which is shared by many users. These systems are 12-way, 96 GB, machines with 128 GB of paging space. Some long-running jobs (that run for more than 5+ days) seem to die unexpectedly. We're not sure if it's the program or what kind of resources are running out at the time the job dies. Can anyone suggest how to monitor and/or troubleshoot this?

Thanks...

Colleen
 
1) check vmstat output when the job dies. look at fre column.
2) having you done memory tuning.
 
Hi
Also check the errpt for detailed output. Is core created ? if yes analyse that also.

I didnt do any memory tuning and most of my jobs are running for months.

JSiva Om Maha Ganapathiye Namaga!
 
Hi,

Just to add to the above tips:

1.create some script that logs vmstat every 10 minutes into some log,check the log after some job dies.

2.in errpt look for "APPLICATION TERMINATED ABNORMALLY" problems.

If you have these - cut&paste the following lines in any shell - it will display you the reason for the applications errors and the time it happened:
==========================================
ksh
DATE1=`errpt -a -j C60BB505 | grep Date | awk '{ print $5 }' | cut -d: -f 1,2 |head -1`
for SEQUENCE in `errpt -a -j C60BB505|grep Sequence|cut -d: -f2`
do
DATE2=`errpt -a -j C60BB505 -l $SEQUENCE | grep Date | awk '{ print $5 }' | cut -d: -f 1,2`
if [[ $DATE1 != $DATE2 ]] ;then
echo “`errpt -l $SEQUENCE -a | grep Date | awk '{ print $3,$4,$5 }'` \c\t”
errpt -a -j C60BB505 -l $SEQUENCE | awk '/PROGRAM NAME/,/ADDITIONAL INFORMATION/' | grep -vE "PROG|ADD"
DATE1=$DATE2
fi
done
exit
=================================== "Long live king Moshiach !"
h
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top