AIX Hardware Monitoring 3

Mag0007 · Jan 15, 2006

Are there any good tools to monitor the hardware for AIX? Currently, I login to the server and run an errpt, is there a way to automate this?

Does anyone have any opinions on diagela? What about a script for errpt, which will email me when there is a hardware problem?

TIA

plamb · Jan 15, 2006

we use Hobbit (like BigBrother)

hirschaj · Jan 15, 2006

You can just modify the ODM to take care of sending you an email whenever the system sees whatever type of error you define. Look at example 1 (errnotify) in this link...

http://www.unet.univie.ac.at/aix/aixprggd/genprogc/error_notice.htm

Jim Hirschauer

http://www.aixexpert.com

Mag0007 · Jan 15, 2006

Jim:
I looked at that...but a little hesitant to play with an ODM o n a production box.

How does diagela compare?

rzs0502 · Jan 16, 2006

On AIX 5.x,

Run 'diag'
Choose 'Task Selection'
Select 'Automatic Error Log Analysis and Notification'

"If you always do what you've always done, you will always be where you've always been."

Mag0007 · Jan 16, 2006

rzs:

I will try that....

DO you know if this exists on AIX 4.3.3 ?

spamly · Jan 16, 2006

I tie my errpt messages into my enterprise monitoring system. You could always modify this script to use emails instead of snmptrap (available from the IBM AIX Toolbox).

One advantage that this script has is the ability to categorize alerts. We send out a page if we find a "critical" alert, "major" alerts generate emails, and "minor" alerts are just logged in our event viewer. All notifications are controlled through the enterprise monitoring server.

My scripts are generally ugly, but get the job done.

Code:

#!/usr/bin/ksh
# /usr/local/scripts/em-errpt.ksh
#   Version 2.0
#   Updated: September 2, 2005
#   * Crontab initiated *
# root crontab should look like this.
###########################################
# Create alerts for any new lines in the errpt
###########################################
#0,5,10,15,20,25 * * * * /root/scripts/em-errpt.ksh > /dev/null 2>&1
#30,35,40,45,50,55 * * * * /root/scripts/em-errpt.ksh > /dev/null 2>&1
#
# Version 1.0
# The purpose of this script is to send any errpt messages to Enterprise
# Monitoring.  Check every five minutes for new messages and send all new
# lines (subject and time only) to Enterprise Monitoring with a custom
# sendtrap.
# Version 2.0
# Alerts are now categorized based on severity.
#
###########
# ARRAYS - See below for how to populate these arrays
MINOR=
MAJOR=
CRITICAL=

###########
# Populating the MINOR array.
MINOR[1]="RECEIVER OVER-RUN ON INPUT"
MINOR[2]="ERROR LOGGING TURNED ON"
MINOR[3]="SOFTWARE PROGRAM ERROR"
MINOR[4]="The daemon is started."
MINOR[5]="UNABLE TO ALLOCATE SPACE IN FILE SYSTEM"
MINOR[6]="SOFTWARE PROGRAM ABNORMALLY TERMINATED"

###########
# Populating the MAJOR array.
MAJOR=[1]="JFS2 LOGGING IS BACK TO NORMAL"
MAJOR=[2]="JFS2 LOG RECORDS FORCED OVERWRITTEN"
MAJOR=[3]="DISK OPERATION ERROR"
MAJOR=[4]="ADAPTER ERROR"
MAJOR=[5]="I/O ERROR DETECTED BY LVM"
MAJOR=[6]="COMMUNICATION PROTOCOL ERROR"

###########
# Populating the CRITICAL array.
CRITICAL[1]="ENVIRONMENTAL PROBLEM"

# Functions
_snmptrap()
{
 TRAPCOMMAND="/usr/local/bin/snmptrap -v 1 -c public [b][i]enterprise_monitoring_server[/i][/b] .1.3.6.1.4.1.8072"
 TRAPFROM=`uname -n`
 TRAPREQUIRED="6 1 '' .1.3.6.1.2.1.1.6 s"
 TRAPDETAILSEPERATOR=".1.3.6.1.2.1.1.6 s"
 ${TRAPCOMMAND} ${TRAPFROM} ${TRAPREQUIRED} "${TRAPSUBJECT}" \
  ${TRAPDETAILSEPERATOR} $TRAPGROUP ${TRAPDETAILSEPERATOR} "${TRAPDETAIL1}" \
  ${TRAPDETAILSEPERATOR} "${TRAPDETAIL2}" &
}

# Move the previous errpt output to /tmp/em-errpt-old.txt.  This will give us
# something to compare the new one to.  If it doesn't exist, create a blank
# old one.
if [ -f /tmp/em-errpt.txt ] ; then
        mv /tmp/em-errpt.txt /tmp/em-errpt-old.txt
else
        touch /tmp/em-errpt-old.txt
fi

# Generate an errpt to /tmp/em-errpt.txt
errpt | grep -v IDENTIFIER > /tmp/em-errpt.txt

# Compare the new errpt output with the previous.  Any new lines should
# generate alerts.
# Seperate files with "|".  The first two fields are the timestamp and resource.
# The last field is a description.  The description comes through with commas
# instead of spaces.
for ALERT in `diff /tmp/em-errpt.txt /tmp/em-errpt-old.txt \
        | grep "< " | sed 's/\< //g' \
        | awk '{ printf $2"|"$5"|"; \
        for (i=6; i<=NF; i++) printf "%s,", $i; printf "\n" }'`
do
        # Now that I have the alert, seperate the fields and send the trap.
        DELIMITEDMESSAGE=`echo $ALERT | awk -F\| '{ print $3}'`
        # This line converts all commas to spaces and removes the space
        # from the end of the line.
        DESCRIPTION=`echo $DELIMITEDMESSAGE | sed 's/\,/ /g' | sed 's/ $//g'`
        RESOURCE=`echo $ALERT | awk -F\| '{ print $2 }'`
        TIMESTAMP=`echo $ALERT | awk -F\| '{ print $1 }'`
        # Assign the appropriate severity.
        DEFINED=FALSE
        COUNT=1
        until [[ $COUNT = ${#MINOR[*]} ]] ; do
                if [[ "$DESCRIPTION" = "${MINOR[$COUNT]}" ]] ; then
                        ASSIGNEDSEV=MINOR
                        DEFINED=TRUE
                fi
                (( COUNT = COUNT + 1 ))
        done
        if [[ $DEFINED != TRUE ]] ; then
                COUNT=1
                until [[ $COUNT = ${#MAJOR[*]} ]] ; do
                        if [[ "$DESCRIPTION" = "${MAJOR[$COUNT]}" ]] ; then
                                ASSIGNEDSEV=MAJOR
                                DEFINED=TRUE
                        fi
                        (( COUNT = COUNT + 1 ))
                done
        fi
        if [[ $DEFINED != TRUE ]] ; then
                COUNT=1
                until [[ $COUNT = ${#CRITICAL[*]} ]] ; do
                        if [[ "$DESCRIPTION" = "${CRITICAL[$COUNT]}" ]] ; then
                                ASSIGNEDSEV=CRITICAL
                                DEFINED=TRUE
                        fi
                        (( COUNT = COUNT + 1 ))
                done
        fi
        if [[ $DEFINED = FALSE ]] ; then
                ASSIGNEDSEV=MAJOR
        fi
        export TRAPSUBJECT="$ASSIGNEDSEV: [b]team[/b]: `uname -n` $DESCRIPTION"
        export TRAPGROUP="[b]Support team[/b]"
        export TRAPDETAIL1="$RESOURCE"
        export TRAPDETAIL2="$TIMESTAMP"
        _snmptrap
done

exit 0

rzs0502 · Jan 16, 2006

Hi Mag

I think it was only introduced in AIX 5.1
Prior to that, we used scripts like Spamly's to monitor.

Here's the one we currently use.

<code>
#!/usr/bin/ksh

# This script runs from the cron periodically and searches for
# hardware messages in the error log as well as stale logical volumes.

MMDD=`date +%m%d` # Display the month and day of month in numerics
YY=`date +%y` # Display the year in numerics
DatE=`date`
Time=`date +%H%M`
HosT=`uname -n`
HOUR=`date +%H`
end_date=${MMDD}${Time}${YY}
hw_errs=/var/tmp/hw_errors
sw_errs=/var/tmp/sw_errors
OUTFILE=/var/tmp/stalelv.$$
VAR=`echo Hardware errors on $HosT, call AIX support ! `

#get the start time
case $HOUR in
00) hour=23 ;; 01) hour=00 ;; 02) hour=01 ;; 03) hour=02 ;; 04) hour=03 ;; 05) hour=04 ;;
06) hour=05 ;; 07) hour=06 ;; 08) hour=07 ;; 09) hour=08 ;; 10) hour=09 ;; 11) hour=10 ;;
12) hour=11 ;; 13) hour=12 ;; 14) hour=13 ;; 15) hour=14 ;; 16) hour=15 ;; 17) hour=16 ;;
18) hour=17 ;; 19) hour=18 ;; 20) hour=19 ;; 21) hour=20 ;; 22) hour=21 ;; 23) hour=22 ;;
esac

# Change to the working directory
cd /var/tmp
[[ -f "$hw_errs" ]] && rm $hw_errs

# Format the search date and time criteria
start_date=`date +%m%d${hour}%M%y`
errpt -d'H' -T "UNKN,PERM,PEND,PERF,TEMP" -s "$start_date" -e"$end_date" | awk '{print $5,$6,$7}' | egrep -v RES | grep -v "rmt" > $hw_errs 2>/dev/null
if [ -s "$hw_errs" ]
then
mail -s "H/W Errors on $HosT on $DatE" aixadmin <$hw_errs
LINES=`cat $hw_errs | wc -l`
if [ "$LINES" -ge 2 ]
then
MSG1=`tail -1 $hw_errs`
MSG2=`tail -2 $hw_errs | head -1`
/usr/bin/logger "(SYSMON) $VAR $MSG1"
/usr/bin/logger "(SYSMON) $VAR $MSG2"
else
MSG1=`cat $hw_errs`
/usr/bin/logger "(SYSMON) $VAR $MSG1"
fi
fi

# Check for any stale logical volumes.
STALE=`lsvg -o | lsvg -il | grep -i stale`
if [ -n "$STALE" ]
then
echo "Check for stale LV's on ${HosT} - ${DatE}" > ${OUTFILE}
mail -s "${HosT}: Stale Logical Volume" aixadmin < ${OUTFILE}
fi
if [ -f "$OUTFILE" ]
then
rm $OUTFILE
fi
</code>

"If you always do what you've always done, you will always be where you've always been."

Mag0007 · Jan 17, 2006

I tried doing this:

Run 'diag'
Choose 'Task Selection'
Select 'Automatic Error Log Analysis and Notification'

And added my email address to the list.

Then I tried filling my /tmp, and I got no email. Is there something I need to enable to get this to work?

jrothey · Jan 17, 2006

We've implemented a series of scripts specifically relating to particular applications. For drives and their diskspace try something small like this:

#!/usr/bin/ksh
BSE=/home/bsp
for a in `df -k|awk \
'{if ($4 >= 95 ) print $7}'`; do
$BSE/bse_scripts/adm.page $a "is > 95% on "`uname -n`
done

Just create a script, check_diskspace.sh, and add it to your crontab to run every minute or however often depending on the drive itself. The adm.page file just has admin references to our pagers/phones. See below:

# This file is used to enter pager numbers or email address that will be
# utilized for administration of the Baan Server.
echo $@ | mailx email_address@domain.com
echo $@ | mailx phone_number@mobile.mycingular.com
#echo $@ | mailx phone_number@messaging.nextel.com

Then as you continue to grow scripts you can always reference back to this adm.page for your core contact list and change/update/modify as needed.

rzs0502 · Jan 17, 2006

Did you press F7 (or Esc-7) to commit the email address on the diag menu.
I've done that before forgetting that Enter does work to diag menus

"If you always do what you've always done, you will always be where you've always been."

Germo · Jan 18, 2006

rzs0502,

I have a few servers on AIX 5.2 but I can't find the option in the "diag/Task Selection" section for the "'Automatic Error Log Analysis and Notification", is there a fileset to install to get this option?

Thanks

rzs0502 · Jan 18, 2006

I have one 5.1 box left and I see the problem.
The error reporting depends on diagela
To enable, Run
/usr/lpp/diagnostics/bin/diagela ENABLE

To add the email address.
Diag -> Task Selection -> Periodic Diagnostics -> Add to the error notification mailing list

"If you always do what you've always done, you will always be where you've always been."

Germo · Jan 18, 2006

Thanks, thats answered my question.

Mag0007 · Jan 18, 2006

rsz:

Can you try overloading your /tmp or whatever filesystem, to put an entry in your error report? And see if you get the email/page?

Thx

dano1979 · Jan 18, 2006

I think the key here is preventative maintenance. On our AIX 4.3 machine I added a script to crontab to run every morning and give me filesystem stats and the most recent errors.

Something like this directed to a file and piped to the mailx command (Very basic). Adjust size to your needs, we run Universe on top of AIX and it can only support filesystems up to 2.4GB, after that very bad things happen.

find / -size +3000000 -exec ls -l {} \; >> $logfile
errpt | head -n10 >> $logfile
cat $logfile | mailx -s "subject" username

Check current CPU and MEM use something like this:

ps aux | grep -v kproc | head -23

diagela should notify you of anything of importance. We have a an old tape drive that has a faulty SCSI adapter, diagela picks it up pretty quickly.

Have fun!

bonsky · Jan 25, 2006

Well, we are all using IBMs here, how about making use of the exisitng IBM director? its a free one when use with IBM machines. With its latest version now, IBM director 5.1, supports AIX as client as well. Think its worth a try.
Thanks!

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

AIX Hardware Monitoring 3

Mag0007

MIS

plamb

MIS

hirschaj

MIS

Mag0007

MIS

rzs0502

IS-IT--Management

Mag0007

MIS

spamly

MIS

rzs0502

IS-IT--Management

Mag0007

MIS

jrothey

Technical User

rzs0502

IS-IT--Management

Germo

Technical User

rzs0502

IS-IT--Management

Germo

Technical User

Mag0007

MIS

dano1979

IS-IT--Management

bonsky

MIS

Similar threads

Part and Inventory Search

Sponsor