SteelFusion Core MIB : SNMP traps
  
SNMP traps
Every Core supports SNMP traps and email alerts for conditions that require attention or intervention. An alarm triggers for most, but not every, event, and the related trap is sent. For most events, when the condition clears, the system clears the alarm and also sends a clear trap. The clear traps are useful in determining when an event has been resolved.
This section describes the SNMP traps. It does not list the corresponding clear traps.
RiOS 6.0 and later includes support for SNMPv3.
You can view Core health at the top of each Management Console page, by entering the CLI show info command, and through SNMP (health, systemHealth).
The Core tracks key hardware and software metrics and alerts you of any potential problems so that you can quickly discover and diagnose issues. The health of an appliance falls into one of the following states:
•  Healthy - The Core is functioning optimally.
•  Needs Attention - Accompanies a healthy state to indicate management-related issues not affecting the ability of Core to perform.
•  Degraded - The Core system has detected an issue.
•  Admission Control - The Core is performing but has reached its connection limit.
•  Critical - The Core might or might not be performing; you must address a critical issue.
The following table summarizes the SNMP traps sent from the system to configured trap receivers and their effect on the Core health state.
Trap and OID
Appliance State
Text
Description
procCrash
(enterprises.17163.1.100.4.0.1)
Healthy
A procCrash trap signifies that a process managed by PM has crashed and left a core file. The variable sent with the notification indicates which process crashed.
A process has crashed and subsequently been restarted by the system. The trap contains the name of the process that crashed. A system snapshot associated with this crash has been created on the appliance and is accessible through the CLI or the Management Console. Riverbed Support might need this information to determine the cause of the crash. No other action is required on the appliance as the crashed process is automatically restarted.
procExit
(enterprises.17163.1.100.4.0.2)
Healthy
A procExit trap signifies that a process managed by PM has exited unexpectedly, but not left a core file. The variable sent with the notification indicates which process exited.
A process has unexpectedly exited and been restarted by the system. The trap contains the name of the process. The process might have exited automatically or due to other process failures on the appliance. Review the release notes for known issues related to this process exit. If none exist, contact Riverbed Support to determine the cause of this event. No other action is required on the appliance as the crashed process is automatically restarted.
configChange
(enterprises.17163.1.100.4.0.3)
Healthy
A change has been made to the system’s configuration.
A configuration change has been detected. Check the log files around the time of this trap to determine what changes were made and whether they were authorized.
cpuUtil
(enterprises.17163.1.100.4.0.4)
Degraded
The average CPU utilization in the past minute has gone above the acceptable threshold.
Average CPU utilization has exceeded an acceptable threshold. If CPU utilization spikes are frequent, it might be because the system is undersized. Sustained CPU load can be symptomatic of more serious issues. Consult the CPU Utilization report to gauge how long the system has been loaded and also monitor the amount of traffic currently going through the appliance. A one-time spike in CPU is normal, but we recommend reporting extended high CPU utilization to Riverbed Support. No other action is necessary as the alarm clears automatically.
pagingActivity
(enterprises.17163.1.100.4.0.5)
Degraded
The system has been paging excessively (thrashing).
The system is running low on memory and has begun swapping memory pages to disk. This event can be triggered during a software upgrade while the optimization service is still running but there can be other causes. If this event triggers at any other time, generate a debug sysdump and send it to Riverbed Support. No other action is required as the alarm clears automatically. This alarm isn’t related to the hypervisor or the Edge (if raised on the Core side).
linkError
(enterprises.17163.1.100.4.0.6)
Degraded
An interface on the appliance has lost its link.
The system has lost one of its Ethernet links, typically due to an unplugged cable or dead switch port. Check the physical connectivity between the Core and its neighbor device. Investigate this alarm as soon as possible. Depending on what link is down, the system might no longer be optimizing and a network outage could occur.
This is often caused by surrounding devices, like routers or switches interface transitioning. This alarm also accompanies service or system restarts on the Core.
powerSupplyError
(enterprises.17163.1.100.4.0.7)
Degraded
A power supply on the appliance has failed.
A redundant power supply on the appliance has failed on the appliance and needs to be replaced. Contact Riverbed Support for an RMA replacement as soon as practically possible.
fanError
(enterprises.17163.1.100.4.0.8)
Degraded
A fan has failed on this appliance.
A fan is failing or has failed and needs to be replaced. Contact Riverbed Support for an RMA replacement as soon as practically possible.
memoryError
(enterprises.17163.1.100.4.0.9)
Degraded
A memory error has been detected on the appliance (not supported on all models).
A memory error has been detected. A system memory stick might be failing. Try reseating the memory first. If the problem persists, contact Riverbed Support for an RMA replacement as soon as practically possible.
ipmi
(enterprises.17163.1.100.4.0.10)
Degraded
An IPMI event has been detected on the appliance. Please check the details in the alarm report on the web UI.
An Intelligent Platform Management Interface (IPMI) event has been detected. Check the Alarm Status page for more detail. You can also view the IPMI events on the Core, by entering the CLI command:
show hardware error-log all
localFSFull
(enterprises.17163.1.100.4.0.11)
Critical
The appliance local file system is full.
The appliance local file system is full. You must create more space.
Note: The appliance local file system contains no files from the NFS exports.
temperatureCritical
(enterprises.17163.1.100.4.0.12)
Critical
The system temperature has reached a critical stage.
This trap/alarm triggers a critical state on the appliance. This alarm occurs when the appliance temperature reaches 90 degrees Celsius. The temperature value isn’t user-configurable. Reduce the appliance temperature.
temperatureWarning
(enterprises.17163.1.100.4.0.13)
Degraded
The system temperature has exceeded the threshold.
The appliance temperature is a configurable notification. By default, this notification is set to trigger when the appliance reached 70 degrees Celsius. Raise the alarm trigger temperature if it is normal for the device to get that hot, or reduce its temperature.
scheduledJobError
(enterprises.17163.1.100.4.0.14)
Healthy
A scheduled job has failed during execution.
A scheduled job on the system (for example, a software upgrade) has failed. To determine which job failed, use the CLI or the Management Console.
confModeEnter
(enterprises.17163.1.100.4.0.15)
Healthy
A user has entered configuration mode.
A user on the system has entered a configuration mode from either the CLI or the Management Console. A log in to the Management Console by user admin sends this trap as well. This is for notification purposes only; no other action is necessary.
confModeExit
(enterprises.17163.1.100.4.0.16)
Healthy
A user has exited configuration mode.
A user on the system has exited configuration mode from either the CLI or the Management Console. A log out of the Management Console by user admin sends this trap as well. This is for notification purposes only; no other action is necessary.
secureVaultLocked
(enterprises.17163.1.100.4.0.17)
Critical
Secure vault is locked. The secure datastore cannot be used.
You must unlock the secure vault.
For details, see Unlocking the secure vault.
testTrap
(enterprises.17163.1.100.4.0.19)
Healthy
Trap test.
An SNMP trap test has occurred on the Core. This message is informational and no action is necessary.
temperatureNonCritical
(enterprises.17163.1.100.4.0.1012)
Degraded
The system temperature is no longer in a critical stage.
This message is informational and no action is necessary.
temperatureNormal
(enterprises.17163.1.100.4.0.1013)
Healthy
The system temperature is back within the threshold.
This message is informational and no action is necessary.
secureVaultUnlocked
(enterprises.17163.1.100.4.0.1017)
Healthy
Secure vault is unlocked. The secure data store can be used now.
This message is informational and no action is necessary.
edgeError
(enterprises.17163.1.100.4.0.10500)
Critical
Edge module encountered error.
Edge module encountered an error and the connection to the Edge is down. If this error isn’t resolved, the blockstore will fill up.
highAvailabilityError
(enterprises.17163.1.100.4.0.10501)
Degraded or Critical
High-availability module encountered an error.
A degraded state indicates one of the following conditions:
•  Edge heartbeat channel failure
•  High availability heartbeat timed out
•  Edge blockstore connection failure
A critical state indicates one of the following conditions:
•  Edge blockstore activation failure
•  Edge blockstore local write failure
•  Edge detected split-brain
•  Edge requires activation
lunError
(enterprises.17163.1.100.4.0.10502)
Degraded
LUN module encountered error.
•  Metadata DB has been corrupted - Check the syslogs. Unmount and remount the export.
•  Backend connection to the export isn’t ready/unreachable on the backend - Check the connection to the backend.
•  Backend does not support required operations - Check the permission settings on the backend.
•  File system crawl of the backend export failed - Check the syslogs. Unmount and remount the export.
•  File system is invalid - Check the syslogs. Unmount and remount the export.
•  Resize of the export failed - Check the logs.
•  Export does not have enough available space - Increase the size of the export on the backend.
•  Export is not supported - Check the settings on the backend.
•  Reduction in size of export detected - Take the export offline and bring it back online, and the new size will take effect.
•  Available size of export detected is below the threshold - Increase the size of the export from the backend.
•  Metadata write sequence number is missing - Take the export offline and bring it back online.
•  Backend does not support writes - Set read/write permissions on the backend.
•  Lost connection to the backend export - Check the connection to the backend.
iscsiError
(enterprises.17163.1.100.4.0.10503)
Critical
iSCSI module encountered error.
NFS server isn’t running - Check the backend NFS server for errors.
snapshotError
(enterprises.17163.1.100.4.0.10505)
Critical
Snapshot module encountered error.
A snapshot failed to be committed to the SAN, or a snapshot has failed to complete due to Windows timing out.
Check the Core logs for details. Retry the Windows snapshot.
applianceUnlicensedError
(enterprises.17163.1.100.4.0.10506)
Critical
Appliance license expired/invalid.
Appliance license expired/invalid.
modelUnlicensedError
(enterprises.17163.1.100.4.0.10507)
Critical
Model license expired/invalid.
Model license expired/invalid. Some SteelFusion services may stop working.
blkdiskError
(enterprises.17163.1.100.4.0.10508)
Critical
Block-disk module encountered error.
Block-disk module encountered an error.
Note: This alarm applies only to Core-v implementations.
backupIntegrationError
(enterprises.17163.1.100.4.0.10509)
Critical
Backup-integration module encountered error.
Backup-Integration module encountered error. Indicates that there is an issue with data protection, for example the proxy mounting has failed or the proxy server cannot be reached.
otherHardwareError
(enterprises.17163.1.100.4.0.10510)
Either Critical or Degraded, depending on the state
Hardware error detected.
Indicates that the system has detected a problem with the hardware. These issues trigger the hardware error alarm:
•  the appliance does not have enough disk, memory, CPU cores, or NIC cards to support the current configuration
•  the appliance is using a memory Dual In-line Memory Module (DIMM), a hard disk, or a NIC that isn’t qualified by Riverbed
•  other hardware issues
The alarm clears when you add the necessary hardware, remove the unqualified hardware, or resolve other hardware issues.