After you have completed this procedure, your system will issue SNMP traps to the host or hosts that you specified. You will need to use an appropriate SNMP monitor, such as Solstice Enterprise Agents software, to view the traps as they are issued.
Note - Set the mdmonitord command to probe your system regularly to help ensure that you receive traps if problems arise. See "Setting the mdmonitord Command for Periodic Error Checking". Also, refer to "Monitoring Solaris Volume Manager with a cron Job" for additional error-checking options.
Solaris Volume Manager SNMP Agent Limitations
The Solaris Volume Manager SNMP agent has certain limitations, and will not issue traps for all Solaris Volume Manager problems that system administrators will likely need to know about. Specifically, the agent issues traps only in the following instances:
A RAID 1 or RAID 5 subcomponent goes into "needs maintenance" state
A hot spare is swapped into service
A hot spare starts to resynchronize
A hot spare completes resynchronization
A mirror is taken offline
A disk set is taken by another host and the current host panics
Many problematic situations, such as an unavailable disk with RAID 0 volumes or soft partitions on it, do not result in SNMP traps, even when reads and writes to the device are attempted. SCSI or IDE errors are generally reported in these cases, but other SNMP agents must issue traps for those errors to be reported to a monitoring console.
Monitoring Solaris Volume Manager with a cron Job
How to Automate Checking for Errors in Volumes
To automatically check your Solaris Volume Manager configuration for errors, create a script that the cron utility can periodically.
The following example shows a script that you can adapt and modify for your needs.
Note - This script serves as a starting point for automating Solaris Volume Manager error checking. You will probably need to modify this script for your own configuration.
# #ident "@(#)metacheck.sh 1.3 96/06/21 SMI" #!/bin/ksh #!/bin/ksh -x #!/bin/ksh -v # ident='%Z%%M% %I% %E% SMI' # # Copyright (c) 1999 by Sun Microsystems, Inc. # # metacheck # # Check on the status of the metadevice configuration. If there is a problem # return a non zero exit code. Depending on options, send email notification. # # -h # help # -s setname # Specify the set to check. By default, the 'local' set will be checked. # -m recipient [recipient...] # Send email notification to the specified recipients. This # must be the last argument. The notification shows up as a short # email message with a subject of # "Solaris Volume Manager Problem: metacheck.who.nodename.setname" # which summarizes the problem(s) and tells how to obtain detailed # information. The "setname" is from the -s option, "who" is from # the -w option, and "nodename" is reported by uname(1). # Email notification is further affected by the following options: # -f to suppress additional messages after a problem # has been found. # -d to control the supression. # -w to identify who generated the email. # -t to force email even when there is no problem. # -w who # indicate who is running the command. By default, this is the # user-name as reported by id(1M). This is used when sending # email notification (-m). # -f # Enable filtering. Filtering applies to email notification (-m). # Filtering requires root permission. When sending email notification # the file /etc/lvm/metacheck.setname.pending is used to # controll the filter. The following matrix specifies the behavior # of the filter: # # problem_found file_exists # yes no Create file, send notification # yes yes Resend notification if the current date # (as specified by -d datefmt) is # different than the file date. # no yes Delete file, send notification # that the problem is resolved. # no no Send notification if -t specified. # # -d datefmt # Specify the format of the date for filtering (-f). This option # controls the how often re-notification via email occurs. If the # current date according to the specified format (strftime(3C)) is # identical to the date contained in the # /etc/lvm/metacheck.setname.pending file then the message is # suppressed. The default date format is "%D", which will send one # re-notification per day. # -t # Test mode. Enable email generation even when there is no problem. # Used for end-to-end verification of the mechanism and email addresses. # # # These options are designed to allow integration of metacheck # into crontab. For example, a root crontab entry of: # # 0,15,30,45 * * * * /usr/sbin/metacheck -f -w SVMcron \ # -d '\%D \%h' -m notice@example.com 2148357243.8333033@pager.example.com # # would check for problems every 15 minutes, and generate an email to # notice@example.com (and send to an email pager service) every hour when # there is a problem. Note the \ prior to the '%' characters for a # crontab entry. Bounced email would come back to root@nodename. # The subject line for email generated by the above line would be # Solaris Volume Manager Problem: metacheck.SVMcron.nodename.local # # display a debug line to controlling terminal (works in pipes) decho() { if [ "$debug" = "yes" ] ; then echo "DEBUG: $*" < /dev/null > /dev/tty 2>&1 fi } # if string $1 is in $2-* then return $1, else return "" strstr() { typeset look="$1" typeset ret="" shift # decho "strstr LOOK .$look. FIRST .$1." while [ $# -ne 0 ] ; do if [ "$look" = "$1" ] ; then ret="$look" fi shift done echo "$ret" } # if string $1 is in $2-* then delete it. return result strdstr() { typeset look="$1" typeset ret="" shift # decho "strdstr LOOK .$look. FIRST .$1." while [ $# -ne 0 ] ; do if [ "$look" != "$1" ] ; then ret="$ret $1" fi shift done echo "$ret" } merge_continued_lines() { awk -e '\ BEGIN { line = "";} \ $NF == "\\" { \ $NF = ""; \ line = line $0; \ next; \ } \ $NF != "\\" { \ if ( line != "" ) { \ print line $0; \ line = ""; \ } else { \ print $0; \ } \ }' } # trim out stuff not associated with metadevices find_meta_devices() { typeset devices="" # decho "find_meta_devices .$*." while [ $# -ne 0 ] ; do case $1 in d+([0-9]) ) # metadevice name devices="$devices $1" ;; esac shift done echo "$devices" } # return the list of top level metadevices toplevel() { typeset comp_meta_devices="" typeset top_meta_devices="" typeset devices="" typeset device="" typeset comp="" metastat$setarg -p | merge_continued_lines | while read line ; do echo "$line" devices=`find_meta_devices $line` set -- $devices if [ $# -ne 0 ] ; then device=$1 shift # check to see if device already refered to as component comp=`strstr $device $comp_meta_devices` if [ -z $comp ] ; then top_meta_devices="$top_meta_devices $device" fi # add components to component list, remove from top list while [ $# -ne 0 ] ; do comp=$1 comp_meta_devices="$comp_meta_devices $comp" top_meta_devices=`strdstr $comp $top_meta_devices` shift done fi done > /dev/null 2>&1 echo $top_meta_devices } # # - MAIN # METAPATH=/usr/sbin PATH=//usr/bin:$METAPATH USAGE="usage: metacheck [-s setname] [-h] [[-t] [-f [-d datefmt]] \ [-w who] -m recipient [recipient...]]" datefmt="%D" debug="no" filter="no" mflag="no" set="local" setarg="" testarg="no" who=`id | sed -e 's/^uid=[0-9][0-9]*(//' -e 's/).*//'` while getopts d:Dfms:tw: flag do case $flag in d) datefmt=$OPTARG; ;; D) debug="yes" ;; f) filter="yes" ;; m) mflag="yes" ;; s) set=$OPTARG; if [ "$set" != "local" ] ; then setarg=" -s $set"; fi ;; t) testarg="yes"; ;; w) who=$OPTARG; ;; \?) echo $USAGE exit 1 ;; esac done # if mflag specified then everything else part of recipient shift `expr $OPTIND - 1` if [ $mflag = "no" ] ; then if [ $# -ne 0 ] ; then echo $USAGE exit 1 fi else if [ $# -eq 0 ] ; then echo $USAGE exit 1 fi fi recipients="$*" curdate_filter=`date +$datefmt` curdate=`date` node=`uname -n` # establish files msg_f=/tmp/metacheck.msg.$$ msgs_f=/tmp/metacheck.msgs.$$ metastat_f=/tmp/metacheck.metastat.$$ metadb_f=/tmp/metacheck.metadb.$$ metahs_f=/tmp/metacheck.metahs.$$ pending_f=/etc/lvm/metacheck.$set.pending files="$metastat_f $metadb_f $metahs_f $msg_f $msgs_f" rm -f $files > /dev/null 2>&1 trap "rm -f $files > /dev/null 2>&1; exit 1" 1 2 3 15 # Check to see if metadb is capable of running have_metadb="yes" metadb$setarg > $metadb_f 2>&1 if [ $? -ne 0 ] ; then have_metadb="no" fi grep "there are no existing databases" < $metadb_f > /dev/null 2>&1 if [ $? -eq 0 ] ; then have_metadb="no" fi grep "/dev/md/admin" < $metadb_f > /dev/null 2>&1 if [ $? -eq 0 ] ; then have_metadb="no" fi # check for problems accessing metadbs retval=0 if [ "$have_metadb" = "no" ] ; then retval=1 echo "metacheck: metadb problem, can't run '$METAPATH/metadb$setarg'" \ >> $msgs_f else # snapshot the state metadb$setarg 2>&1 | sed -e '1d' | merge_continued_lines > $metadb_f metastat$setarg 2>&1 | merge_continued_lines > $metastat_f metahs$setarg -i 2>&1 | merge_continued_lines > $metahs_f # # Check replicas for problems, capital letters in the flags # indicate an error, fields are seperated by tabs. # problem=`awk < $metadb_f -F\t '{if ($1 ~ /[A-Z]/) print $1;}'` if [ -n "$problem" ] ; then retval=`expr $retval + 64` echo "\ metacheck: metadb problem, for more detail run:\n\t$METAPATH/metadb$setarg -i" \ >> $msgs_f fi # # Check the metadevice state # problem=`awk < $metastat_f -e \ '/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'` if [ -n "$problem" ] ; then retval=`expr $retval + 128` echo "\ metacheck: metadevice problem, for more detail run:" \ >> $msgs_f # refine the message to toplevel metadevices that have a problem top=`toplevel` set -- $top while [ $# -ne 0 ] ; do device=$1 problem=`metastat $device | awk -e \ '/State:/ {if ($2 != "Okay" && $2 != "Resyncing") print $0;}'` if [ -n "$problem" ] ; then echo "\t$METAPATH/metastat$setarg $device" >> $msgs_f # find out what is mounted on the device mp=`mount|awk -e '/\/dev\/md\/dsk\/'$device'[ \t]/{print $1;}'` if [ -n "$mp" ] ; then echo "\t\t$mp mounted on $device" >> $msgs_f fi fi shift done fi # # Check the hotspares to see if any have been used. # problem="" grep "no hotspare pools found" < $metahs_f > /dev/null 2>&1 if [ $? -ne 0 ] ; then problem=`awk < $metahs_f -e \ '/blocks/ { if ( $2 != "Available" ) print $0;}'` fi if [ -n "$problem" ] ; then retval=`expr $retval + 256` echo "\ metacheck: hot spare in use, for more detail run:\n\t$METAPATH/metahs$setarg -i" \ >> $msgs_f fi fi # If any errors occurred, then mail the report if [ $retval -ne 0 ] ; then if [ -n "$recipients" ] ; then re="" if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then re="Re: " # we have a pending notification, check date to see if we resend penddate_filter=`cat $pending_f | head -1` if [ "$curdate_filter" != "$penddate_filter" ] ; then rm -f $pending_f > /dev/null 2>&1 else if [ "$debug" = "yes" ] ; then echo "metacheck: email problem notification still pending" cat $pending_f fi fi fi if [ ! -f $pending_f ] ; then if [ "$filter" = "yes" ] ; then echo "$curdate_filter\n\tDate:$curdate\n\tTo:$recipients" \ > $pending_f fi echo "\ Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate" >> $msg_f echo "\ --------------------------------------------------------------" >> $msg_f cat $msg_f $msgs_f | mailx -s \ "${re}Solaris Volume Manager Problem: metacheck.$who.$set.$node" $recipients fi else cat $msgs_f fi else # no problems detected, if [ -n "$recipients" ] ; then # default is to not send any mail, or print anything. echo "\ Solaris Volume Manager: $node: metacheck$setarg: Report: $curdate" >> $msg_f echo "\ --------------------------------------------------------------" >> $msg_f if [ -f $pending_f ] && [ "$filter" = "yes" ] ; then # pending filter exista, remove it and send OK rm -f $pending_f > /dev/null 2>&1 echo "Problem resolved" >> $msg_f cat $msg_f | mailx -s \ "Re: Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients elif [ "$testarg" = "yes" ] ; then # for testing, send mail every time even thought there is no problem echo "Messaging test, no problems detected" >> $msg_f cat $msg_f | mailx -s \ "Solaris Volume Manager Problem: metacheck.$who.$node.$set" $recipients fi else echo "metacheck: Okay" fi fi rm -f $files > /dev/null 2>&1 exit $retval
For information on invoking scripts by using the cron utility, see the cron(1M) man page.