Viewing Diagnostic Reports and Logs : Viewing SCC alarm status reports
  
Viewing SCC alarm status reports
The appliance tracks key hardware and software metrics and alerts you of any potential problems so you can quickly discover and diagnose issues.
RiOS 7.0 and later feature better alarm reporting using hierarchical alarms. The system groups certain alarms into top-level categories, such as the SSL Settings alarm. When an alarm triggers, its parent expands to provide more information. For example, the System Disk Full top-level alarm aggregates over multiple partitions. If a specific partition is full, the System Disk Full alarm triggers and the Alarm Status report displays more information regarding that partition caused the alarm to trigger.
The health of the SCC falls into one of these states:
•  Healthy - The SCC is in a healthy state.
•  Needs Attention - The SCC is in a healthy state indicating management-related issues aren’t affected but something may need to be looked it. For example, the license may need to be reviewed.
•  Degraded - The SCC has detected an issue.
•  Critical - The SCC has encountered a critical issue that needs to be addressed immediately.
The health of a managed appliance on the SCC falls into one of these states:
•  Needs Attention - Accompanies a healthy state to indicate management-related issues not affecting the ability of the SteelHead to optimize traffic.
•  Degraded - The SteelHead is optimizing traffic but the system has detected an issue.
•  Admission Control - The SteelHead is optimizing traffic but has reached its connection limit.
•  Critical - The SteelHead might or might not be optimizing traffic; you must address a critical issue.
•  Unsupported - The SteelHead is unsupported.
The Alarm Status report provides the status for the SCC alarms and includes this alarm information.
Alarm
Reason
CPU Utilization
Displays an alarm when the system has reached the CPU threshold for any of the CPUs in the appliance. If the system has reached the CPU threshold, check your settings.
If your alarm thresholds are correct, reboot the appliance.
If more than 100 MBs of data is moved through an appliance while performing PFS synchronization, the CPU utilization can become high and result in a CPU alarm. This CPU alarm isn’t cause for concern.
Disk Full
Displays an alarm when the system partitions (not the RiOS data store) are full or almost full. For example, RiOS monitors the available space on /var that’s used to hold logs, statistics, system dumps, TCP dumps, and so on.
This alarm monitors these system partitions:
Partition “/boot Full” Free Space
Partition “/bootmgr Full” Free Space
Partition “/config Full” Free Space
Partition “/data Full” Free Space
Partition “/proxy” Free Space
Partition “/var” Free Space
Hardware
•  Flash Error - Indicates an error with the flash drive hardware.
•  IPMI - Indicates an Intelligent Platform Management Interface (IPMI) event. (Not supported on all appliance models.)
This alarm triggers when there has been a physical security intrusion. These events trigger this alarm:
–  Chassis intrusion (physical opening and closing of the appliance case)
–  Memory errors (correctable or uncorrectable ECC memory errors)
–  Hard drive faults or predictive failures
–  Power supply status or predictive failure
By default, this alarm is enabled.
Licensing
Displays an alarm when your licenses are current.
•  Autolicense critical event - This alarm triggers on a SteelHead (virtual edition) appliance when the Riverbed Licensing Portal can’t response to a license request with valid licenses. The Licensing Portal can’t issue a valid license for one of these reasons:
–   A newer SteelHead (virtual edition) appliance is already using the token, so you can’t use it on the SteelHead (virtual edition) appliance displaying the critical alarm. Every time the SteelHead (virtual edition) appliance attempts to refetch a license token, the alarm retriggers.
–  The token has been redeemed too many times. Every time the SteelHead (virtual edition) appliance attempts to refetch a license token, the alarm retriggers.
•  Autolicense informational event - This alarm triggers if the Riverbed Licensing Portal has information regarding the licenses for a SteelHead (virtual edition) appliance. For example, the SteelHead (virtual edition) appliance displays this alarm when the portal returns licenses that are associated with a token that has been used on a different SteelHead (virtual edition) appliance.
•  Insufficient Appliance Management License(s) - This alarm triggers if there aren’t enough licenses to manage all connected appliances.
•  Invalid License(s) - This alarm triggers if there is any invalid license.
•  Licenses Expired - This alarm triggers if one or more features has at least one license installed, but all of them are expired.
•  Licenses Expiring - This alarm triggers if the license for one or more features is going to expire within two weeks.
•  License(s) Missing - This alarm triggers if any licenses are missing.
Note: The licenses expiring and licenses expired alarms are triggered per feature. For example: if you install two license keys for a feature, LK1-FOO-xxx (expired) and LK1-FOO-yyy (not expired), the alarms don’t trigger, because the feature has one valid license.
Link Duplex
Displays an alarm and sends an email notification when an interface wasn’t configured for half-duplex negotiation but has negotiated half-duplex mode. Half-duplex significantly limits the optimization service results.
The alarm displays which interface is triggering the duplex alarm.
By default, this alarm is enabled.
Link I/O Errors
Displays an alarm when the error rate on an aux or primary interface has exceeded 0.1 percent while either sending or receiving packets. This threshold is based on the observation that even a small link error rate reduces TCP throughput significantly. A properly configured LAN connection experiences very few errors. The alarm clears when the error rate drops below 0.05 percent.
The alarm clears when the rate drops below 0.05 percent.
Link State
Displays an alarm and sends an email notification if an Ethernet link is lost due to an unplugged cable or dead switch port. Depending on that link is down, the system can no longer be optimizing and a network outage could occur.
This condition is often caused by surrounding devices, like routers or switches, interface transitioning. This alarm also accompanies service or system restarts on the appliance.
For aux and primary interfaces.
By default, this alarm is disabled.
Memory Paging
Displays an alarm when the system has reached the memory paging threshold. If 100 pages are swapped approximately every two hours the SteelHead is functioning properly. If thousands of pages are swapped every few minutes, then reboot the SteelHead. If rebooting doesn’t solve the problem, contact Riverbed Support at https://support.riverbed.com.
Process Dump Creation
Displays an alarm when the system has detected an error while trying to create a process dump. This alarm indicates an abnormal condition where RiOS can’t collect the core file after three retries. It can be caused when the /var directory that’s used to hold system dumps is reaching capacity or other conditions. When this alarm is raised, the directory is blacklisted.
SCC Appliance Configuration Backup
Displays an alarm when the daily back up has failed.
SCC External Configuration Backup/Restore
Displays an alarm when the external configuration backup has failed. It updates every 30 seconds.
SCC External Statistics Backup/Restore
Displays an alarm when the external statistics backup has failed. It updates every 30 seconds.
SCC Underprovisioned Virtual Machine
Displays an alarm when the an under provisioned virtual SteelHead is detected.
Secure Vault
Enables an alarm and sends an email notification if the system encounters a problem with the secure vault:
•  Secure Vault Locked - Needs Attention - Indicates that the secure vault is locked. To optimize SSL connections or to use RiOS data store encryption, the secure vault must be unlocked. Choose Appliance > Secure Vault and unlock the secure vault.
SSL
Enables an alarm if an error is detected in your SSL configuration:
•  Non-443 SSL Servers - Indicates that during a RiOS upgrade (for example, from 5.5 to 6.0), the system has detected a preexisting SSL server certificate configuration on a port other than the default SSL port 443. SSL traffic can’t be optimized. To restore SSL optimization, you can add an in-path rule to the client-side SteelHead to intercept the connection and optimize the SSL traffic on the nondefault SSL server port.
After adding an in-path rule, you must clear this alarm manually by entering this CLI command:
stats alarm non_443_ssl_servers_detected_on_upgrade clear
•  SSL Certificates Error - Indicates that an SSL peering certificate has failed to reenroll automatically within the Simple Certificate Enrollment Protocol (SCEP) polling interval.
•  SSL Certificates Expiring - Indicates that an SSL certificate is about to expire.
•  SSL Certificates SCEP - Indicates that an SSL certificate has failed to reenroll automatically within the SCEP polling interval.
Temperature
•  Critical Temperature - Enables an alarm and send an email notification of the CPU temperature exceeds the rising threshold. When the CPU returns to the reset threshold, the critical alarm is cleared. The default value for the rising threshold temperature is 70ºC; the default threshold temperature is 67ºC.
•  Warning Temperature - Enables an alarm and sends an email notification if the CPU temperature approaches the rising threshold. When the CPU returns to the reset threshold, the waning alarm is cleared.
What this report tells you
The Alarm Status report answers this question:
•  What is the current status of the SCC?
About report data
The SCC is designed to retain statistics for up to a maximum of three years, based on daily statistics for 2,000 appliances monitoring 50 to100 TCP ports per SteelHead. Factors that can influence this number include the number of monitored TCP ports, the number of active interfaces on managed appliances, and changes in types amounts of data collected in RiOS releases.
The SCC polls data every five minutes. In general, the SCC retains five-minute granularity data points for a maximum of 30 days. One-hour granularity data points are stored for a maximum of 90 days. Beyond 90 days, SCC retains one-day granularity data points for up to three years. In case of stats in excess of capacity, the SCC deletes the oldest data from each of the three granularities, while attempting to preserve as much recent data as it can.
Note: Be aware that if the SCC and remote appliances lose connectivity with each other, the bandwidth and connection data during the period of lost connectivity can be skewed. For example, if a remote appliance loses connectivity with the SCC for six hours, data for the missing six hours appears to be 0 in reports for periods of Last Day or Custom intervals smaller than one day. However, when the remote appliance reestablishes connectivity, it sends an aggregate data point for the last day. Thus, report for periods longer than Last Day do reflect bandwidth and connection data accurately. If you need to analyze data on the remote SteelHead for the missing period, you can view this in the SCC for the individual remote appliance.
To view the SCC Status report
•  Choose Diagnostics > SCC System: Alarm Status to display the Alarms Status page.
Related topics
•  Configuring alarm parameters
•  Configuring SNMP basic settings