Displaying and Customizing Reports : Viewing diagnostic reports : Viewing Alarm Status reports
  
Viewing Alarm Status reports
The Alarm Status report provides status for the Core alarms.
For details about configuring alarm settings, see Setting alarm parameters.
What this report tells you
The Alarm Status report answers these questions:
•  What is the current status of the system?
•  What is the status of the system alarms?
To view the alarm status report
1. Choose Reports > Diagnostics: Alarm Status to display the Alarm Status page.
Alternately, you can click the current alarm status that appears in the status bar of each page (Healthy, Admission Control, Degraded, or Critical) to display the Alarm Status page.
Figure: Alarm Status page
The following table summarizes the alarms shown in the report.
Control
Description
Backup Integration
Indicates that the backup-integration module has failed.
Block-disk
(Only active on iSCSI/block appliances) - Indicates that the block-disk module has failed.
Failover
Indicates that the Core has failed and the failover peer is in operation.
CPU Utilization
Indicates that the system has reached the CPU threshold for one or more of the CPUs in the Core. If the system has reached the CPU threshold, check your settings.
If your alarm thresholds are correct, reboot the Core.
Note: If more than 100 MB of data are moved through the Core while performing PFS synchronization, the CPU utilization might become high and result in a CPU alarm. This CPU alarm isn’t cause for concern.
Disk Full
Indicates that one or more of the following partitions on the disk is full:
•  Partition "/boot" Full
•  Partition "/bootmgr" Full
•  Partition "/config" Full
•  Partition "/data" Full
•  Partition "/var" Full
Edge Service
Indicates that the Core has lost connection with one of the configured Edges.
Hardware
Indicates that one or more hardware failures have occurred.
This alarm setting also enables you to select one or more types of hardware failure (fan error, memory error, and so on), including:
•  Fan Error - Enables an alarm and sends an email notification if a fan is failing or has failed and needs to be replaced. By default, this alarm is enabled.
•  Flash Error - Enables an alarm when the system detects an error with the flash drive hardware. By default, this alarm is enabled.
•  IPMI - Enables an alarm and sends an email notification if an Intelligent Platform Management Interface (IPMI) event is detected.
•  Other Hardware Error - This alarm indicates that the system has detected a problem with the hardware. The alarm clears when you add the necessary hardware, remove the nonqualified hardware, or resolve other hardware issues. The following issues trigger the hardware error alarm:
•  The appliance does not have enough disk, memory, CPU cores, or NIC cards to support the current configuration.
•  The appliance is using a dual in-line memory module (DIMM), a hard disk, or a NIC that isn’t qualified by Riverbed.
•  DIMMs are plugged into the appliance but the system can’t recognize them because the DIMM modules are in the wrong slot. You must plug DIMM modules into the black slots first and then use the blue slots when all of the black slots are in use.
•  A DIMM module is broken and you must replace it.
•  Other hardware issues.
By default, all Hardware alarms are enabled.
•  Power Supply - Enables an alarm and sends an email notification if an inserted power supply cord does not have power, as opposed to a power supply slot with no power supply cord inserted. By default, this alarm is enabled.
•  RAID - Indicates an error with the RAID array (for example, missing drives, pulled drives, drive failures, and drive rebuilds). An audible alarm might also sound. To see if a disk has failed, enter this CLI command from the system prompt:
show raid diagram
 
•  For drive rebuilds, if a drive is removed and then reinserted, the alarm continues to be triggered until the rebuild is complete. Rebuilding a disk drive can take four to six hours. This alarm applies only to the SteelHead RAID Series 3000, 5000, and 6000.
High Availability
Indicates that the high-availability feature is degraded.
Licensing
Enables an alarm and sends an email notification if the appliance is unlicensed, if there is an issue with the autolicense, the licenses have expired, the licenses are about to expire, or the model is unlicensed.
By default, all Licensing alarms are enabled.
Link Duplex
Indicates that an interface was not configured for half-duplex negotiation but has negotiated half-duplex mode. Half-duplex significantly limits the optimization service results.
The alarm displays which interface is triggering the duplex error.
Choose Configure > Networking: Data Interfaces and examine the Core link configuration. Next, examine the peer switch user interface to check its link configuration. If the configuration on one side is different from the other, traffic is sent at different rates on each side, causing many collisions.
To troubleshoot, change both interfaces to automatic duplex negotiation. If the interfaces don’t support automatic duplex, configure both ends for full duplex.
You can enable or disable the alarm for a specific interface. To disable an alarm, choose Settings > System Settings: Alarms and select or clear the check box next to the link alarm.
Link I/O Errors
Indicates that the error rate on an interface has exceeded 0.1 percent while either sending or receiving packets. This threshold is based on the observation that even a small link error rate reduces TCP throughput significantly. A properly configured LAN connection experiences very few errors. The alarm clears when the error rate drops below 0.05 percent.
You can change the default alarm thresholds by entering the alarm error-threshold CLI command at the system prompt. For details, see the SteelFusion Command-Line Interface Reference Manual.
To troubleshoot, try a new cable and a different switch port. Another possible cause is electromagnetic noise nearby.
You can enable or disable the alarm for a specific interface: for example, you can disable the alarm for a link after deciding to tolerate the errors. To enable or disable an alarm, choose Settings > System Settings: Alarms and select or clear the check box next to the link name.
Link State
Indicates that the system has lost one of its Ethernet links due to an unplugged cable or dead switch port. Check the physical connectivity between the appliance and its neighbor device. Investigate this alarm as soon as possible. Depending on which link is down, the system might no longer be optimizing and a network outage could occur.
You can enable or disable the alarm for a specific interface. To enable or disable the alarm, choose Settings > System Settings: Alarms and select or clear the check box next to the link name.
Memory Paging
Indicates extended memory paging activity.
If 100 pages are swapped every couple of hours, the appliance is functioning properly. If thousands of pages are swapped every few minutes, contact Riverbed Support.
Process Dump Creation Error
Indicates that the system detected an error while trying to create a process dump.
This alarm indicates an abnormal condition in which the system can’t collect the core file after three retries. This condition can be caused when the /var directory reaches capacity. When the alarm is raised, the directory is blacklisted.
Secure Vault
Secure Vault Locked - Indicates that the secure vault is locked. To optimize SSL connections or to use RiOS data store encryption, the secure vault must be unlocked. Go to Settings > Security: Secure Vault and unlock the secure vault.
Server Backup
Indicates that one of the following backup failures have occurred:
•  Proxy connection failure - Indicates that the connection between Core and the proxy server has failed, or the credentials for ESXi proxy login are incorrect. When the connection is restored, correct credentials are provided, or the proxy configuration is deleted, the alarm is cleared.
•  Backup failure - Indicates that a backup policy has failed. The message identifies the failing server, backup policy name, and the reason for failure. The reason may be a storage backend failure for a snapshot or clone operation, failure to mount or unmount the exports, or a slow proxy server resulting in a timeout. Once the next protection operation succeeds, the alarm is cleared.
•  Proxy cleanup timeout - Indicates that proxy cleanup is taking more time than expected. If a backup takes longer than 30 minutes, or if a snapshot remains on a VM after a failed backup, Core will trigger an alarm. To fix this issue, check the state of the ESXi server. For ESXi servers, ensure that the failed backup did not leave a snapshot on the VM. Once the next protection operation succeeds, the alarm is cleared.
•  Snapshot error - Indicates that proxy mounted VMs have associated snapshots and can’t be unmounted.
•  Excluded VMs - Indicates that VMs are excluded from a backup policy.
Snapshot
Indicates that the connection to one or more of the snapshot storage arrays has failed.
SSL
Indicates that the system detected an error in your SSL configuration.
SteelFusion Core configuration status
Indicates that the Core configuration has been reverted to a previous version and all connections to the Edges are lost.
Contact Riverbed Support at https://support.riverbed.com.
SteelFusion Core Service
Indicates that the Core service isn’t running.
SteelFusion Protocol Service
Indicates that an NFS protocol error from the backend storage array is preventing an export from being mounted on the Core.
By default, this alarm is enabled.
Storage Volume Status
Indicates that the connection to the export has failed or there is an issue with any of the following:
•  Backend connectivity
•  No read/write permissions
•  Space threshold has been reached
•  Resize failure
•  An export is deactivated and unavailable. An export will be deactivated if the blockstore has a critical amount of low space and this particular export has a high rate of new writes.
•  Initialization of the blockstore for the export fails, making the export unavailable.
Check if the data center export was offlined on the Core while I/O operations were in progress. Reactivate the export through the Management Console or the CLI to troubleshoot this issue.
•  A Resize alarm will be triggered for an export if its size is changed on the storage array and the Core isn’t able to make the new size available to the branch client. Some reasons why a resize may not be propagated to the branch are:
•   The size of the export on the storage array is reduced.
•   The increased size of a pinned export cannot be accommodated in the Edge blockstore.
•  Data sync is blocked between Edge and Core, potentially due to a protocol-related issue.
 
Temperature
•  Critical Temperature - Indicates that the CPU temperature exceeds the rising threshold. When the CPU returns to the reset threshold, the critical alarm is cleared. The default value for the rising threshold temperature is 70ºC; the default reset threshold temperature is 67ºC.
•  Warning Temperature - Indicates that the CPU temperature is approaching the rising threshold. When the CPU returns to the reset threshold, the warning alarm is cleared.
After the alarm triggers, it cannot trigger again until after the temperature falls below the reset threshold and then exceeds the rising threshold again.