Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Westi on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Can I add an error to be reported by errpt?

Status
Not open for further replies.

bi

Technical User
Apr 13, 2001
1,552
US
Here's my problem: All too frequently, the temperature in our computer room gets too hot. Today it got up to 93 F (33 C) before we knew the air conditioning had failed. This has happened several times, and each time we have lost one or two SSA disks two weeks after the overheating.

My HP system just goes into "hibernation" when the temperature gets too high. The disks stay spinning, but you can't get to the system and anyone who is logged in gets kicked out (that's how I discovered today's overheating).

Is there some way I can add to errpt a condition where if the internal temperature of the SSA drawers or the server itself gets above a certain point, an error is logged? I have the commands to extract the internal temp of the server and the SSA drawers and I have a script that warns me by email when there is a change in the number of errors in the error report.

Is there some way I can automatically get the system to shut down gracefully if the temperature remains above a certain level (for those times at night when nobody is here to shut the systems down)?

I am wondering if there is something with powerfail I can use to do this? (I would check the man pages and the rc.powerfail script, but my systems are currently down!)

A real monitoring system would be best, but I don't think management will want to spend the money.

Any suggestions/help is most appreciated.

 
Hi bi

The errlogger command writes entries into the errorlog.
Use your commands to check the condition and if temp. is
higher than expected use errlogger command to write to the errorlog. (Check with man errlogger)

HTH
Axel
 
hi,
what you can do is the following ,

if you know what type of error is reported in error report
create an entry in the ODM which runs your script ,
for example :-

1. have an ODM entry for your error , call it temp.odm
NB this is an example of TAPE_ERR1 as label in error report
errnotify:
en_pid = 0
en_name = "TAPE_ERR1"
en_persistenceflg = 1
en_label = "TAPE_ERR1"
en_crcid = 0
en_class = "-"
en_type = "-"
en_alertflg = "-"
en_resource = ""
en_rtype = ""
en_rclass = ""
en_symptom = ""
en_method = "/u/script/temp/error_notify 1"

NB:- you will have toadd the temp.odm to the odm do this by
odmadd temp.odm

if you don't want to run your script just remove your script in en_method and update the ODM

what you are interested in is en_method , here you can run your script, called error_notify

your script could have something like :-

TYPE=$1 ( note $1 is 1 in above entry)

case $TYPE in

1) print " Error tape error "

or in your case print " Temp too high shutting down system"

esac


you can add more errors in here , all you do is in ODM for the label you want to run your script is add 2,3,4 etc for the TYPE of error to be used in your script.

Does it make sense.


The other thing is /etc/rc.powerfail script should encounter for this type of errors and act accordingly , i think it does do a shutdown of the box , have a look at the script.

HTH




 
Thanks for the replies. I'll have to wait to try the solutions, however, because the air conditioning broke an hour and a half after it was fixed! I'll post what I did, once I can get it all up and going.

 
Hi,
you have a script which measures temperature inside a RS6K ?

Can you please e-mail it to me because I am searching for such a script.

Thanks in advance
PoldiAIX
 
This is a small script that will check the temperature of the SSA enclosures attached to your system. You can modify it to e-mail you when temp's reach a certain level:

#!/bin/ksh

function display
{
if [ ${#} -eq 3 ]
then
printf "%-25s %-20s %s " "${1}" "${2}" "${3}"
else
echo "${*}"
fi
}

function check_temp
{

ITEM="${1}"

display "Checking temperature" "${ITEM}" "..."
/usr/sbin/ssaencl -al ${ITEM} | grep temperature | awk '{print $2}' | read TEMP
((TEMP=TEMP*9/5+32))
display "${TEMP}"
}

for ENCLOSURE in $(lscfg | grep enclosure | awk '{print $2}' | sort)
do
check_temp ${ENCLOSURE}
done


Regards,
Chuck
 
The command to check the heat of the server itself is

/usr/lpp/diagnostics/bin/uesensor -l

(That's an L as an option.)

I also have a script that checks errpt as often as you want and emails you if there is a change in the number of errors. It was written by a Tek-Tips forum member. Do you want that?
 
This script will do the trick:

#!/bin/ksh

TOTALERRS=`errpt | grep -v "IDENTIFIER" | wc -l`

if [ -s /usr/local/bin/errpt.count ]
then
OLDERRS=`cat /usr/local/bin/errpt.count`
else
echo "0" > /usr/local/bin/errpt.count
OLDERRS=`cat /usr/local/bin/errpt.count`
fi

((NEWERRS=TOTALERRS-OLDERRS))

if [ ${NEWERRS} -gt 1 ]
then
echo "Please check errpt, ${NEWERRS} errors found!" | /usr/bin/mailx -vs "`hostname`: errpt report" user@domain.com
elif [ ${NEWERRS} -gt 0 ]
then
errpt | grep -v "IDENTIFIER" | head -${NEWERRS} | cut -c 42- |
while read ERRMSG
do
echo "errpt:${ERRMSG}" | /usr/bin/mailx -vs "`hostname`: errpt report" user@domain.com
done
fi

echo ${TOTALERRS} > /usr/local/bin/errpt.count


Regards,
Chuck
 
Here's the script I use. Change the mailto value to a value you have in your /etc/aliases file. Also, this script is designed to start up at boot time and run every 5 minutes. Make changes if you don't like it that way.

This script was written by Bill Verzal, a frequent contributor to this forum.

#! /bin/ksh
#
# $0 = errmon.sh
#
# Written 11/3/1998 Bill Verzal.
#
# This script will run every [interval] and check the error log
# for new entries. Upon finding them, it will send an email to
# administrators containing a message indicating the change
# in errlog status, as well as the offending lines.
#
if [ "$1" = "-v" ] ; then
set -x
fi
lc="NULL"
tc="$lc"
# lc="last count"
# tc="this count"
#interval=900
interval=300
# Divide interval by 60 to get number of minutes.
me="$0 - Hardware error monitoring"
myname=`hostname`
args="$*"
#mailto="root"
mailto="alert"
true=0
false=1
boj=`date`

echo "$me started.\nThis message goes to $mailto." | mail -s "Errlog monitoring for $myname" $mailto
logger "$0 started"

while [ "$true" != "$false" ] ; do
tc=`errpt -dH,S,U,O | wc -l`
if [ "$lc" = "NULL" ] ; then
lc="$tc"
fi
if [ "$lc" -ne "$tc" ] ; then
foo=`echo "$tc-$lc"|bc`
msg="$foo new errors have been found on $myname"
page_msg="$foo new errors have been found on $myname"
errlogl=`errpt -dH,S,U,O -a`
if [ "$tc" -eq "0" ] ; then
msg="$msg\n Errlog was cleared"
else
logger $msg
msg=" $msg \n Errlog details below:\n $errlogl \n"
echo "$msg" | mail -s "Errlog status change on host $myname" $mailto
fi
fi
lc="$tc"
sleep $interval
done


 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top