Displaying and Customizing Reports : Viewing Diagnostic Reports : Viewing Alarm Status Reports
  
Viewing Alarm Status Reports
The Alarm Status report provides status for the Core alarms.
For details about configuring alarm settings, see Setting Alarm Parameters.
What This Report Tells You
The Alarm Status report answers these questions:
•  What is the current status of the system?
•  What is the status of the system alarms?
To view the alarm status report
1. Choose Reports > Diagnostics: Alarm Status to display the Alarm Status page.
Alternately, you can click the current alarm status that appears in the status bar of each page (Healthy, Admission Control, Degraded, or Critical) to display the Alarm Status page.
The following table summarizes the alarms shown in the report.
Control
Description
Backup Integration
Indicates that the backup-integration module has failed.
Block-disk
Indicates that the block-disk module has failed.
Failover
Indicates that the Core has failed and the failover peer is in operation.
Core Disaster Recovery
Indicates that there is an issue with replication:
•  Journal LUN - This alarm triggers if the Journal LUN size is not large enough to support the configured replica LUNs. To resolve this issue, increase the size of the Journal LUN.
•  Replication synchronization latency - This alarm triggers if the replication latency to the secondary data center is over 150 ms. To troubleshoot this issue, check the storage array latency on the secondary data center, inter-data center WAN latency, or if there are any MPIO errors on the secondary data center.
•  Replication state - This alarm triggers if the status of one or more replicating LUNs is suspended. This acts as a reminder that data is not being replicated for the LUNs and therefore not protected from data center failure. This alarm is automatically resolved when data is synchronized and replication is active.
•  Journal corruption - This alarm triggers when any type of corruption is detected on the Journal LUN. Refer to the alarm email for details, including any possible resolutions (these will vary depending on the type of corruption).
•  Data center connection - This alarm triggers when connections are lost to the peer data center. To troubleshoot, check whether the inter-data center WAN and all the added interfaces are functioning correctly.
•  Journal LUN missing - This alarm triggers when the Journal LUN is not found on the storage array because it was accidentally unmapped from the backend, or the connection to the storage array was lost. To resolve this issue, check the backend settings, ensure that the LUN is exposed correctly, and check iSCSI logs for any issues.
CPU Utilization
Indicates that the system has reached the CPU threshold for one or more of the CPUs in the Core. If the system has reached the CPU threshold, check your settings.
If your alarm thresholds are correct, reboot the Core.
Note: If more than 100 MB of data are moved through the Core while performing PFS synchronization, the CPU utilization might become high and result in a CPU alarm. This CPU alarm is not cause for concern.
Disk Full
Indicates that one or more of the following partitions on the disk is full:
•  Partition "/boot" Full
•  Partition "/bootmgr" Full
•  Partition "/config" Full
•  Partition "/data" Full
•  Partition "/var" Full
Edge Service
Indicates that the Core has lost connection with one of the configured Edges.
Hardware
Indicates that one or more hardware failures have occurred.
This alarm setting also enables you to select one or more types of hardware failure (fan error, memory error, and so on), including:
•  Fan Error - Enables an alarm and sends an email notification if a fan is failing or has failed and needs to be replaced. By default, this alarm is enabled.
•  Flash Error - Enables an alarm when the system detects an error with the flash drive hardware. By default, this alarm is enabled.
•  IPMI - Enables an alarm and sends an email notification if an Intelligent Platform Management Interface (IPMI) event is detected.
•  Other Hardware Error - This alarm indicates that the system has detected a problem with the hardware. The alarm clears when you add the necessary hardware, remove the nonqualified hardware, or resolve other hardware issues. The following issues trigger the hardware error alarm:
•  The appliance does not have enough disk, memory, CPU cores, or NIC cards to support the current configuration.
•  The appliance is using a dual in-line memory module (DIMM), a hard disk, or a NIC that is not qualified by Riverbed.
•  DIMMs are plugged into the appliance but the system cannot recognize them because the DIMM modules are in the wrong slot. You must plug DIMM modules into the black slots first and then use the blue slots when all of the black slots are in use.
•  A DIMM module is broken and you must replace it.
•  Other hardware issues.
By default, all Hardware alarms are enabled.
•  Power Supply - Enables an alarm and sends an email notification if an inserted power supply cord does not have power, as opposed to a power supply slot with no power supply cord inserted. By default, this alarm is enabled.
•  RAID - Indicates an error with the RAID array (for example, missing drives, pulled drives, drive failures, and drive rebuilds). An audible alarm might also sound. To see if a disk has failed, enter this CLI command from the system prompt:
show raid diagram
 
•  For drive rebuilds, if a drive is removed and then reinserted, the alarm continues to be triggered until the rebuild is complete. Rebuilding a disk drive can take 4-6 hours. This alarm applies only to the SteelHead RAID Series 3000, 5000, and 6000.
High-Availability
Indicates that the High-Availability feature is degraded.
iSCSI Service
Indicates that the iSCSI initiators are not accessible. Review the iSCSI configuration in Core. The iSCSI initiators might have been removed.
Licensing
Enables an alarm and sends an email notification if the appliance is unlicensed, if there is an issue with the autolicense, the licenses have expired, the licenses are about to expire, or the model is unlicensed.
By default, all Licensing alarms are enabled.
Link Duplex
Indicates that an interface was not configured for half-duplex negotiation but has negotiated half-duplex mode. Half-duplex significantly limits the optimization service results.
The alarm displays which interface is triggering the duplex error.
Choose Configure > Networking: Data Interfaces and examine the Core link configuration. Next, examine the peer switch user interface to check its link configuration. If the configuration on one side is different from the other, traffic is sent at different rates on each side, causing many collisions.
To troubleshoot, change both interfaces to automatic duplex negotiation. If the interfaces do not support automatic duplex, configure both ends for full duplex.
You can enable or disable the alarm for a specific interface. To disable an alarm, choose Settings > System Settings: Alarms and select or clear the check box next to the link alarm.
Link I/O Errors
Indicates that the error rate on an interface has exceeded 0.1 percent while either sending or receiving packets. This threshold is based on the observation that even a small link error rate reduces TCP throughput significantly. A properly configured LAN connection experiences very few errors. The alarm clears when the error rate drops below 0.05 percent.
You can change the default alarm thresholds by entering the alarm error-threshold CLI command at the system prompt. For details, see the SteelFusion Command-Line Interface Reference Manual.
To troubleshoot, try a new cable and a different switch port. Another possible cause is electromagnetic noise nearby.
You can enable or disable the alarm for a specific interface: for example, you can disable the alarm for a link after deciding to tolerate the errors. To enable or disable an alarm, choose Settings > System Settings: Alarms and select or clear the check box next to the link name.
Link State
Indicates that the system has lost one of its Ethernet links due to an unplugged cable or dead switch port. Check the physical connectivity between the appliance and its neighbor device. Investigate this alarm as soon as possible. Depending on which link is down, the system might no longer be optimizing and a network outage could occur.
You can enable or disable the alarm for a specific interface. To enable or disable the alarm, choose Settings > System Settings: Alarms and select or clear the check box next to the link name.
LUN Status
Indicates that a LUN is having any of these issues:
•  A LUN is deactivated and unavailable. A LUN will be deactivated if the blockstore has a critical amount of low space and this particular LUN has a high rate of new writes.
•  Initialization of the blockstore for the LUN fails, making the LUN unavailable.
Check if the data center LUN was offlined on the Core while IO operations were in progress. Reactivate the LUN through the Management Console or the CLI to troubleshoot this issue.
•  A Resize alarm will be triggered for a LUN if its size is changed on the storage array and the Core is not able to make the new size available to the branch client. Some reasons why a resize may not be propagated to the branch are:
•   The size of the LUN on the storage array is reduced.
•   The increased size of a pinned LUN cannot be accommodated in the Edge blockstore.
•   In the FusionSync (replication) configuration, the replica LUN size is smaller than the primary LUN size.
In FusionSync, due to the allowed replica leeway when configuring replication, the replica LUN on the secondary data center can be larger than the LUN in the Core configuration (which is the size of the LUN on the primary data center). If the primary data center goes down and you fail over to the secondary data center, the LUN size on secondary will show as larger than the configured LUN size, causing the Resize alarm to be triggered.
Memory Paging
Indicates extended memory paging activity.
If 100 pages are swapped every couple of hours, the appliance is functioning properly. If thousands of pages are swapped every few minutes, contact Riverbed Support.
Process Dump Creation Error
Indicates that the system detected an error while trying to create a process dump.
This alarm indicates an abnormal condition in which the system cannot collect the core file after three retries. This condition can be caused when the /var directory reaches capacity. When the alarm is raised, the directory is blacklisted.
Secure Vault
Secure Vault Locked - Indicates that the secure vault is locked. To optimize SSL connections or to use RiOS data store encryption, the secure vault must be unlocked. Go to Settings > Security: Secure Vault and unlock the secure vault.
Snapshot
Indicates that the connection to one or more of the snapshot storage arrays has failed.
SSL
Indicates that the system detected an error in your SSL configuration.
SteelFusion Core configuration status
Indicates that the Core configuration has been reverted to a previous version and all connections to the Edges are lost. Contact Riverbed Support at
https://support.riverbed.com.
SteelFusion Core Service
Indicates that the Core service is not running.
Temperature
•  Critical Temperature - Indicates that the CPU temperature exceeds the rising threshold. When the CPU returns to the reset threshold, the critical alarm is cleared. The default value for the rising threshold temperature is 70ºC; the default reset threshold temperature is 67ºC.
•  Warning Temperature - Indicates that the CPU temperature is approaching the rising threshold. When the CPU returns to the reset threshold, the warning alarm is cleared.
After the alarm triggers, it cannot trigger again until after the temperature falls below the reset threshold and then exceeds the rising threshold again.