SteelHead™ Deployment Guide : Authentication, Security, Operations, and Monitoring : SNMP Monitoring
  
SNMP Monitoring
This section describes the SNMP traps. It does not list the corresponding clear traps. Every SteelHead supports SNMP traps, and email alerts for conditions that require attention or intervention. An alarm triggers for most (but not every) event and subsequently, the related trap is sent. For most events, when the condition is fixed, the system clears the alarm and sends out a clear trap. The clear traps are useful in determining when an event has been resolved.
RiOS v5.0 supports the following:
  • SNMP Version 1
  • SNMP Version 2c
  • RiOS v6.0 or later supports the following:
  • SNMP Version 3, which provides authentication through the User-based Security Model (USM).
  • View-Based Access Control Mechanism (VACM), which provides richer access control.
  • RiOS v7.0 or later supports the SNMP v3 authentication with AES 128 and DES encryption described in the following table. Riverbed recommends the following OIDs as a good starting point from which to monitor your deployment. Additional variables can be added or removed as needed.
    The following OIDs are for xx55 SteelHeads only.
     
    OID
    Object Type
    Descriptions
    1.3.6.1.4.1.17163.1.1.5.2.1.0
    optimizedConnections
    Current total number of optimized connections
    1.3.6.1.4.1.17163.1.1.5.2.2.0
    passthroughConnections
    Current total number of pass-through connections
    1.3.6.1.4.1.17163.1.1.5.2.3.0
    halfOpenedConnections
    Current total number of half-opened (optimized) connections
    1.3.6.1.4.1.17163.1.1.5.2.4.0
    halfClosedConnections
    Current total number of half-closed (optimized) connections
    1.3.6.1.4.1.17163.1.1.5.2.5.0
    establishedConnections
    Current number of established (optimized) connections
    1.3.6.1.4.1.17163.1.1.5.2.6.0
    activeConnections
    Current number of active (optimized) connections
    1.3.6.1.4.1.17163.1.1.5.2.7.0
    totalConnections
    Total number of connections
    1.3.6.1.4.1.17163.1.1.5.1.1.0
    cpuLoad1
    1-minute CPU load in hundredths
    1.3.6.1.4.1.17163.1.1.5.1.2.0
    cpuLoad5
    5-minute CPU load in hundredths
    1.3.6.1.4.1.17163.1.1.5.1.3.0
    cpuLoad15
    15-minute CPU load in hundredths
    1.3.6.1.4.1.17163.1.1.5.1.4.0
    cpuUtil1
    Percentage CPU utilization, aggregated across all CPUs, rolling average over the past minute
    1.3.6.1.4.1.17163.1.1.5.1.5.1.1.1
    cpuIndivIndex
    A synthetic number numbering the CPUs
    1.3.6.1.4.1.17163.1.1.5.1.5.1.2.1
    cpuIndivId
    Name of the CPU, also serves as the Index for the table
    1.3.6.1.4.1.17163.1.1.5.1.5.1.3.1
    cpuIndivIdleTime
    Idle time for this CPU
    1.3.6.1.4.1.17163.1.1.5.1.5.1.4.1
    cpuIndivSystemTime
    System time for this CPU
    1.3.6.1.4.1.17163.1.1.5.1.5.1.5.1
    cpuIndivUserTime
    User time for this CPU
    1.3.6.1.4.1.17163.1.1.4.0.8
    raidError
    RAID errors
     
    In multiple CPU systems, the last digit corresponds to the CPU number.
    The following table summarizes SNMP traps that represent serious issues and Riverbed recommends that you address them immediately.
     
    Trap and OID
    SteelHead State
    Text
    Description
    procCrash
    (enterprises.17163.1.1.4.0.1)
     
    A procCrash trap signifies that a process managed by PM has crashed and left a core file. The variable sent with the notification indicates which process crashed.
    A process crashed and subsequently restarted by the system. The trap contains the name of the process that crashed. A system snapshot associated with this crash is created on the SteelHead and is accessible through the CLI or the Management Console. Riverbed Support might need this information to determine the cause of the crash. The crashed process automatically restarts and no other action is required on the SteelHead.
    procExit
    (enterprises.17163.1.1.4.0.2)
     
    A procExit trap signifies that a process managed by PM has exited unexpectedly, but not left a core file. The variable sent with the notification indicates which process exited.
    A process unexpectedly exited and subsequently restarted by the system. The trap contains the name of the process. The process might have exited automatically due to other process failures on the SteelHead. Review the release notes for known issues related to this process exit. If none exist, Contact Riverbed Support to determine the cause of this event. The crashed process automatically restarts and no other action is required on the SteelHead.
    bypassMode
    (enterprises.17163.1.1.4.0.7)
    Critical
    The SteelHead has entered bypass (failthru) mode.
    The SteelHead entered bypass mode and passes through all traffic unoptimized. This is the result of the optimization service locking up or crashing. It can also happen when the system is first turned on or turned off. If this trap is generated on a system that was previously optimizing and is still running, contact Riverbed Support.
    storeCorruption
    (enterprises.17163.1.1.4.0.9)
    Critical
    The RiOS data store is corrupted.
    Corruption is detected in the RiOS data store. Contact Riverbed Support immediately.
    haltError
    (enterprises.17163.1.1.4.0.12)
    Critical
    The service is halted due to a software error.
    The optimization service halts due to a serious software error. Check to see if a core dump or sysdump was created. If so, retrieve the information and contact Riverbed Support immediately.
    serviceError
    (enterprises.17163.1.1.4.0.13)
    Degraded
    There has been a service error. Consult the log file.
    The optimization service encountered a condition that might degrade optimization performance. Consult the system log for more information.
    licenseError
    (enterprises.17163.1.1.4.0.57)
    Critical
    The main SteelHead license has expired, been removed, or become invalid.
    A license on the SteelHead has been removed, has expired, or is invalid. The alarm clears when a valid license is added or updated.
    hardwareError
    (enterprises.17163.1.1.4.0.58)
    Either Critical or Degraded, depending on the state
    Hardware error detected.
    Indicates that the system has detected a problem with the SteelHead hardware. These issues trigger the hardware error alarm:
  • the SteelHead does not have enough disk, memory, CPU cores, or NIC cards to support the current configuration
  • the SteelHead is using a memory Dual In-line Memory Module (DIMM), a hard disk, or a NIC that is not qualified by Riverbed
  • other hardware issues
  • The alarm clears when you add the necessary hardware, remove the unqualified hardware, or resolve other hardware issues.
    lanWanLoopError
    (enterprises.17163.1.1.4.0.63)
    Critical
    LAN-WAN loop detected. System will not optimize new connections until this error is cleared.
    A LAN-WAN network loop has been detected between the LAN and WAN interfaces on a SteelHead (virtual edition). This can occur when you connect the LAN and WAN virtual NICs to the same vSwitch or physical NIC. This alarm triggers when a SteelHead (virtual edition) starts up, and clears after you connect each LAN and WAN virtual interface to a distinct virtual switch and physical NIC (through the vSphere Networking tab) and then reboot the SteelHead (virtual edition).
    optimizationServiceStatusError
    (enterprises.17163.1.1.4.0.64)
    Critical
    Optimization service currently not optimizing any connections.
    The optimization service has encountered an optimization service condition. The message indicates the reason for the condition:
  • optimization service is not running
  • This message appears after a configuration file error. For more information, review the SteelHead logs.
  • in-path optimization is not enabled
  • This message appears if an in-path setting is disabled for an in-path SteelHead. For more information, review the SteelHead logs.
  • optimization service is initializing
  • This message appears after a reboot. The alarm clears on its own; no other action is necessary. For more information, review the SteelHead logs.
  • optimization service is not optimizing
  • This message appears after a system crash. For more information, review the SteelHead logs.
  • optimization service is disabled by user
  • This message appears after entering the CLI command no service enable or shutting down the optimization service from the Management Console. For more information, review the SteelHead logs.
  • optimization service is restarted by user
  • This message appears after the optimization service is restarted from either the CLI or Management Console. You might want to review the SteelHead logs for more information.
    storageProfSwitchFailed
    (enterprises.17163.1.1.4.0.73)
    Either Critical or Needs Attention, depending on the state
     
    Storage profile switch failed
    An error has occurred while repartitioning the disk drives during a storage profile switch. A profile switch changes the disk space allocation on the drives, clears the Granite and VSP data stores, and repartitions the data stores to the appropriate sizes.
    You switch a storage profile by entering the disk-config layout CLI command at the system prompt or by choosing Administration > System Settings: Disk Management on an EX or EX+Granite SteelHead and selecting a storage profile.
    These reasons can cause a profile switch to fail:
  • RiOS cannot validate the profile.
  • The profile contains an invalid upgrade or downgrade.
  • RiOS cannot clean up the existing VDMKs. During clean up RiOS uninstalls all slots and deletes all backups and packages.
  • When you encounter this error, try to switch the storage profile again. If the switch succeeds, the error clears. If it fails, RiOS reverts the SteelHead to the previous storage profile.
  • If RiOS is unable to revert the SteelHead to the previous storage profile, the alarm status becomes critical.
  • If RiOS successfully reverts the SteelHead to the previous storage profile, the alarm status displays needs attention.
  • flashProtectionFailed
    (enterprises.17163.1.1.4.0.75)
    Critical
    Flash disk hasn't been backed up due to not enough free space on /var filesystem.
    Indicates that the USB flash drive has not been backed up because there is not enough available space in the /var filesystem directory.
    Examine the /var directory to see if it is storing an excessive amount of snapshots, system dumps, or TCP dumps that you could delete. You could also delete any RiOS images that you no longer use.
    datastoreNeedClean
    (enterprises.17163.1.1.4.0.76)
    Critical
    The data store needs to be cleaned.
    You need to clear the RiOS data store. To clear the data store, choose Administration > Maintenance: Services and select the Clear Data Store check box before restarting the appliance.
    Clearing the data store degrades performance until the system repopulates the data.
    If an error condition exists, there are several alarms that are generated along with the SNMP traps. If the email feature is configured, you receive an email notification in addition to the alarms.
    To limit the number of alarms generated over a given period of time, use the stats alarm <alarm name> rate-limit count <thresholds> <count> command.
    There are three sets of thresholds—short, medium and long. Each has a window, which is several seconds, and a maximum count. If, for any threshold, the number of alarms exceeds the maximum during the window, an alarm is not generated and emails are not sent.
    For more information about configuring SNMP and other important traps, see the SteelHead Management Console User’s Guide.