Monitoring (nagios) 1

daFranze · Dec 9, 2004

I wan't to improve our machine monitoring, we are currently monitoring
cpu load, disk capacity, swapspace, number of processes, DB Tables, tnsping etc.

Last week in my data center a Powersupply crashed and a fuse opened, therefore all Machines in the rack had only one PS left delivering power (uh, hard to say this in English, I hope you can pick up what I mean

).

That's why I want to write some additional monitoring scripts processing 'prtdiag -v' output and scanning /var/adm/messages for errors warnings etc.

Does anybody know a PD/GNU Version of Scripts/Tools monitoring Sun Hardware and messages?

Best Regards, Franz
--
Solaris System Manager from Munich, Germany
I used to work for Sun Microsystems Support (EMEA) for 5 years

daFranze · Dec 10, 2004

nobody is scanning the messages?

Best Regards, Franz
--
Solaris System Manager from Munich, Germany
I used to work for Sun Microsystems Support (EMEA) for 5 years

bfitzmai · Dec 10, 2004

With all the help you do here deFranz, I have looked to see if I could find anything that would help you.. I do not use any automated monitoring of /var/adm/messages. Nor have I heard of such a tool.

daFranze · Dec 10, 2004

I know we used to scan the messages when we used CA Unicenter, but I don't know if it was a plugin or a separate tool; long time ago and nobody knows... :-( ;-)

I'm just writing a small script...

Thank you bfitzmai for helping me search!

Best Regards, Franz
--
Solaris System Manager from Munich, Germany
I used to work for Sun Microsystems Support (EMEA) for 5 years

Mike042 · Dec 10, 2004

Hi Franz

I have spent months trying to find an answer to a similar problem on our systems (a mix of Compaq Alpha and 3 different types of Sun hardware). I want each box to monitor itself and pass "warning" or "problem" messages (depending on a threshold value exceeded) to one central system for display. On Sun systems the problem I found with the command:
/usr/platform/`uname -i`/sbin/prtdiag -v
is that a different format output is produced depending on what the hardware is. So in my script I test for the type of hardware with uname -i | awk -F, '{print $2}' and then cut out the appropriate sections of the output to check the Internal Temperature, Fan Status and Power Supply Status.
Currently all I do with /var/adm/messages is to output a count of the number of lines in the file with todays date. Thus I can tell when new lines are added on a particular system and can investigate.
I know this doesn't really answer your enquiry, but as I haven't found anything useful either during many months of searching, then perhaps there is nothing.

Good luck with your research.

Mike

daFranze · Dec 10, 2004

Thank you Mike, may I have a look on that prtdiag-script? If you don't want a public post I can post an email address.

My messages scanner Version 0.1 is ready to use! ;-)

Best Regards, Franz
--
Solaris System Manager from Munich, Germany
I used to work for Sun Microsystems Support (EMEA) for 5 years

Mike042 · Dec 10, 2004

Hi Franz,

I don't mind posting the script. It is self-contained, and runs on Solaris. (written for Korn shell on Solaris 8). This version is the final "test version" before I encorporated it into my server monitoring script, hence the echo commands. The various outputs would be tested against values from a "Threshold values" file and appropriate "warning" or "problem" messages sent. Also it is written in my "easy to read" style of scripting, which may not be the most efficient way to write it, but it does mean that in years to come it can be quickly understood/modified.
As I mentioned we have 3 types of Sun Systems, Sun-Fire v1280 (= Netra-T12), Enterprise 250 (= Ultra-250) and SunBlade 150 (has no environmental monitoring) so the script is written to cater for only them. It should be quite easy to add sections of code for other hardware you might possess.

Code:

#!/bin/ksh
#
#
   LOGS_DIR="/tmp"
#
   function extract_text
   {
#  Starting string  = ${START_STRING}
#  Finishing string = ${FINISH_STRING}
#  Name of File     = ${1}
#  Minimum number of lines = ${2}

   ext_output_flag="OFF"
   ext_lines_count=0

   if [[ ! -e ${1} ]]; then return 1; fi
   cat ${1} | while read ext_line
   do
       ext_string_found=`echo "${ext_line}" | grep -c "${START_STRING}"`
       if [[ ext_string_found -gt 0 ]]
       then
           ext_output_flag="ON"
           ext_lines_count=0
       fi
       if [[ "${ext_output_flag}" = "ON" ]]
       then
           echo "${ext_line}"
           ((ext_lines_count=ext_lines_count + 1))
           ext_string_found=`echo "${ext_line}" | grep -c "${FINISH_STRING}"`
           if [[ ext_string_found -gt 0 ]]
           then
               if [[ ext_lines_count -gt ${2} ]]; then ext_output_flag="OFF"; fi
           fi
       fi
   done
   }
#
#
   HARDWARE_TYPE=`uname -i | awk -F, '{print $2}'`
 
   ENVMON_FILE="${LOGS_DIR}/unixmon_envmon.$$"
   PRTDIAG_FILE="${LOGS_DIR}/unixmon_prtdiag.$$"
   echo "Environmental Monitoring on:  $(date)" > ${ENVMON_FILE}
   /usr/platform/`uname -i`/sbin/prtdiag -v > ${PRTDIAG_FILE}

   START_STRING="== Environmental Status =="
   FINISH_STRING="== HW Revisions =="
   grep_start=`grep -n "${START_STRING}" ${PRTDIAG_FILE} | cut -d':' -f1`
   grep_end=`grep -n "${FINISH_STRING}" ${PRTDIAG_FILE} | cut -d':' -f1`
   head -$((grep_end-1)) ${PRTDIAG_FILE} | tail -$((grep_end-grep_start)) >> ${ENVMON_FILE}
   rm ${PRTDIAG_FILE}
 
   high_temp_threshold=40     # would normally get this from "Threshold values" file

   current_temperature=20
   count=0
   number=0
   if [[ "${HARDWARE_TYPE}" = "Ultra-250" ]]
   then
       START_STRING="System Temperatures"
       FINISH_STRING="\============"
       extract_text ${ENVMON_FILE} 3 > ${LOGS_DIR}/unixmon_temperature.$$
       cat ${LOGS_DIR}/unixmon_temperature.$$ | sed "1,2 d" | while read line
       do
           field1=`echo "${line}" | awk '{print $1}' | cut -c 1-3`
           field2=`echo "${line}" | awk '{print $2}'`
           if [[ "${field1}" != "CPU" && -n ${field2} ]]
           then
               ((number=number + field2))
               ((count=count + 1))
           fi
       done
   fi
   if [[ "${HARDWARE_TYPE}" = "Netra-T12" ]]
   then
       START_STRING="Temperature sensors:"
       FINISH_STRING="\------------"
       extract_text ${ENVMON_FILE} 5 > ${LOGS_DIR}/unixmon_temperature.$$
       cat ${LOGS_DIR}/unixmon_temperature.$$ | grep -i ambient | while read line
       do
           field3=`echo "${line}" | awk '{print $3}' | sed "s/C//"`
           if [[ -n ${field3} ]]
           then
               ((number=number + field3))
               ((count=count + 1))
           fi
       done
   fi
   if [[ number -gt 0 && ${number} = [0-9]* && count -gt 0 ]]
   then
       ((current_temperature=number / count))
   fi
   rm ${LOGS_DIR}/unixmon_temperature.$$
echo "current_temperature = ${current_temperature}  (${number} / ${count})"

   internal_fan_status=0
   count=0
   if [[ "${HARDWARE_TYPE}" = "Ultra-250" ]]
   then
       START_STRING="Fan Bank"
       FINISH_STRING="\============"
       extract_text ${ENVMON_FILE} 3 > ${LOGS_DIR}/unixmon_fanstatus.$$
       cat ${LOGS_DIR}/unixmon_fanstatus.$$ | sed "1,6 d" | while read line
       do
echo "${line}"
           field3=`echo "${line}" | awk '{print $3}' | tr '[a-z]' '[A-Z]'`
           if [[ -n ${field3} ]]
           then
               if [[ "$(echo ${field3} | cut -c 1-2)" != "OK" ]]
               then
                   ((count=count + 1))
               fi
           fi
       done
   fi
   if [[ "${HARDWARE_TYPE}" = "Netra-T12" ]]
   then
       START_STRING="Board Status:"
       FINISH_STRING="\------------"
       extract_text ${ENVMON_FILE} 5 > ${LOGS_DIR}/unixmon_fanstatus.$$
       cat ${LOGS_DIR}/unixmon_fanstatus.$$ | grep -i fan | while read line
       do
echo "${line}"
           field2=`echo "${line}" | awk '{print $2}' | tr '[a-z]' '[A-Z]'`
           if [[ -n ${field2} ]]
           then
               if [[ "$(echo ${field2} | cut -c 1-2)" != "OK" ]]
               then
                   ((count=count + 1))
               fi
           fi
       done
   fi
   internal_fan_status=${count}
   rm ${LOGS_DIR}/unixmon_fanstatus.$$
echo "internal_fan_status = ${internal_fan_status}"

   power_supply_status=0
   count=0
   if [[ "${HARDWARE_TYPE}" = "Ultra-250" ]]
   then
       START_STRING="Power Supplies"
       FINISH_STRING="\============"
       extract_text ${ENVMON_FILE} 3 > ${LOGS_DIR}/unixmon_powerstatus.$$
       cat ${LOGS_DIR}/unixmon_powerstatus.$$ | sed "1,5 d" | while read line
       do
echo "${line}"
           field2=`echo "${line}" | awk '{print $2}' | tr '[a-z]' '[A-Z]'`
           if [[ -n ${field2} ]]
           then
               if [[ "$(echo ${field2} | cut -c 1-2)" != "OK" ]]
               then
                   ((count=count + 1))
               fi
           fi
       done
   fi
   if [[ "${HARDWARE_TYPE}" = "Netra-T12" ]]
   then
       START_STRING="Voltage sensors:"
       FINISH_STRING="\------------"
       extract_text ${ENVMON_FILE} 5 > ${LOGS_DIR}/unixmon_powerstatus.$$
       cat ${LOGS_DIR}/unixmon_powerstatus.$$ | grep -i ps | while read line
       do
echo "${line}"
           field3=`echo "${line}" | awk '{print $NF}' | tr '[a-z]' '[A-Z]'`
           if [[ -n ${field3} ]]
           then
               if [[ "$(echo ${field3} | cut -c 1-2)" != "OK" ]]
               then
                   ((count=count + 1))
               fi
           fi
       done
   fi
   power_supply_status=${count}
   rm ${LOGS_DIR}/unixmon_powerstatus.$$
echo "power_supply_status = ${power_supply_status}"

   mv ${ENVMON_FILE} ${LOGS_DIR}/unixmon_envmon.history

   exit 0

I hope it makes sense to you. I am sure you will make changes to it to suit your needs. Feel free to use it as you wish.

Mike

daFranze · Dec 13, 2004

Thank you Mike!

Best Regards, Franz
--
Solaris System Manager from Munich, Germany
I used to work for Sun Microsystems Support (EMEA) for 5 years

coffeysm · Dec 13, 2004

I use Big Brother Monitoring Tool its a great program and its free. It is basically a Client/Server program it runs on the client and displays back to a server which runs a web server and displays the status of all machines via webpage. It is easy to setup and highly customizable I have it monitor everything on my servers disk, cpu, tripwire, temperature, etc...

http://www.bb4.org

daFranze · Dec 14, 2004

nagios is more or less the same like BB

Best Regards, Franz
--
Solaris System Manager from Munich, Germany
I used to work for Sun Microsystems Support (EMEA) for 5 years

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Monitoring (nagios) 1

daFranze

Technical User

daFranze

Technical User

bfitzmai

Technical User

daFranze

Technical User

Mike042

MIS

daFranze

Technical User

Mike042

MIS

daFranze

Technical User

coffeysm

MIS

daFranze

Technical User

Similar threads

Part and Inventory Search

Sponsor