Alarm | SteelHead State | Reason |
Admission Control | Admission Control | In RiOS v7.0.1 and later, the system preemptively closes MAPI sessions to reduce the connection count in an attempt to bring the SteelHead out of admission control. RiOS closes MAPI sessions in this order: |
Application Consistent Snapshot | Degraded | An application-consistent snapshot failed to be committed to the SAN, or a snapshot failed to complete. Application consistent snapshots are scheduled using the Core snapshot scheduler. A snapshot is application consistent if, in addition to being write-order consistent, it includes data from running applications that complete their operations and flush their buffers to disk. Application-consistent backups are recommended for database operating systems and applications such as SQL, Oracle, and Exchange. This error triggers when there are problems interacting with servers (ESXi or Windows). The first interaction is to prepare for a snapshot (where the server gets filesystems or a VM in a consistent state), and the second is to resume after the snapshot is taken (the server can clean up, stop logging changes, and so on). Errors can also occur due to misconfigurations on either side, local issues on the servers (high load, timeouts, reboots), networking problems, and so on. Check the Core logs for details. Retry the snapshot. |
Asymmetric Routing | Needs Attention | Indicates that the system is experiencing asymmetric traffic. Indicates OK if the system is not experiencing asymmetric traffic. In addition, any asymmetric traffic is passed through, and the route appears in the Asymmetric Routing table. For details about the Asymmetric Routing table, see Configuring Asymmetric Routing Features. |
Connection Forwarding | Degraded | Indicates that the system has detected a problem with a connection-forwarding neighbor. The connection-forwarding alarms are inclusive of all connection-forwarding neighbors. For example, if a SteelHead has three neighbors, the alarm triggers if any one of the neighbors is in error. In the same way, the alarm clears only when all three neighbors are no longer in error. This alarm can also indicate that a connection-forwarding neighbor is running a RiOS version that is incompatible with IPv6. Neighbors must be running RiOS v8.5 or later. The SteelHead neighbors pass through IPv6 connections when this incompatibility is detected. These issues trigger the single connection-forwarding alarm: |
CPU Utilization | Degraded | Indicates that the system has reached the CPU threshold for any of the CPUs in the SteelHead. If the system has reached the CPU threshold, check your settings. For details, see Configuring Alarm Settings. If your alarm thresholds are correct, reboot the SteelHead. For details, see Rebooting and Shutting Down the SteelHead. |
Data Store | Critical | Resetting the Data Store alarm If a data store alarm was caused by an unintended change to the configuration, you can change the configuration to match the previous RiOS data store settings, and then restart the service without clearing the data store to reset the alarm. Typical configuration changes that require a restart with a clear RiOS data store are enabling the extended peer table or changing the data store encryption type. For details, see Configuring Peering and Encrypting the RiOS Data Store. To clear the RiOS data store of data, choose Administration > Maintenance: Services, select Clear Data Store and click Reboot to reboot the optimization service. For details, see Starting and Stopping the Optimization Service. |
Disk Full | Indicates that the system partitions (not the RiOS data store) are full or almost full. For example, RiOS monitors the available space on /var, which is used to hold logs, statistics, system dumps, TCP dumps, and so on. Examine the directory to see if it is storing an excessive amount of snapshots, system dumps, or TCP dumps that you could delete. You could also delete any RiOS images that you no longer use. | |
Domain Authentication Alert | Needs Attention | Indicates that the system is unable to communicate with the DC, has detected an SMB signing error, or delegation has failed. CIFS-signed and Encrypted-MAPI traffic is passed through without optimization. For details, see Configuring CIFS Optimization. |
Domain Join Error | Degraded | Indicates an attempt to join a Windows domain has failed. For details, see Troubleshooting a Domain Join Failure. |
Edge HA Service | Either Critical or Degraded, depending on the state | Indicates that only one of the appliances in a high availability (HA) SteelFusion Edge pair is actively serving storage data (the active peer). As the system writes new data to the active peer, it is reflected to the standby peer, which stores a copy of the data in its local data store. The two appliances maintain a heartbeat protocol between them, so that if the active peer goes down, the standby peer can take over servicing the LUNs. If the standby peer goes down, the active peer continues servicing the LUNs after raising this alarm and sending an email that the appliance is degraded or critical. The email contains the IP address of the peer appliance. Degraded indicates that the edge HA is not functioning but the LUNs are being serviced. After a failed peer resumes, it resynchronizes with the other peer in the HA pair to receive any data that was written since the time of the failure. After the peer receives all the written data, the normal HA mode resumes and any future writes are reflected to both peers. Critical indicates that the LUNs are no longer available and are not being serviced. Contact Riverbed Support. |
Flash Protection Failure | Critical | Indicates that the USB flash drive has not been backed up because there is not enough available space in the /var filesystem directory. Examine the /var directory to see if it is storing an excessive amount of snapshots, system dumps, or TCP dumps that you could delete. You could also delete any RiOS images that you no longer use. |
Hardware | Either Critical or Degraded, depending on the state | show raid diagram This alarm applies only to the SteelHead RAID Series 3000, 5000, and 6000. To reboot the appliance, go to the Administration > Maintenance: Reboot/Shutdown page or enter the CLI reload command to automatically power cycle the SteelHead and restore the flash drive to its proper function. This alarm triggers when there has been a physical security intrusion. These events trigger this alarm: By default, this alarm is enabled. —or— By default, this alarm is enabled. |
Disk access time can exceed the safety valve activation threshold for several reasons: the SteelHead might be undersized for the amount of traffic it is required to optimize, a larger than usual amount of traffic is being optimized temporarily, or a disk is experiencing hardware issues such as sector errors, failing mechanicals, or RAID disk rebuilding. You configure the safety valve activation threshold and timeout using CLI commands: datastore safety-valve threshold datastore safety-value timeout For details, see the Riverbed Command-Line Interface Reference Manual. To clear the alarm, restart the SteelHead. show raid diagram For drive rebuilds, if a drive is removed and then reinserted, the alarm continues to be triggered until the rebuild is complete. Rebuilding a disk drive can take 4-6 hours. This alarm applies only to the SteelHead RAID Series 3000, 5000, and 6000. | ||
Inbound QoS WAN Bandwidth Configuration | Degraded (Needs Attention) | Indicates that the inbound QoS WAN bandwidth for one or more of the interfaces is set incorrectly. You must configure the WAN bandwidth to be less than or equal to the interface bandwidth link rate. This alarm triggers when the system encounters one of these conditions: While this alarm appears, the SteelHead puts existing connections into the default class. The alarm clears when you configure the WAN bandwidth to be less than or equal to the bandwidth link rate or reconnect an interface configured with the correct WAN bandwidth. By default, this alarm is enabled. |
iSCSI Service | Needs Attention | Indicates that the iSCSI initiators are not accessible. Review the iSCSI configuration in Core. The iSCSI initiators might have been removed. |
Licensing | Needs Attention, Degraded, or Critical, depending on the state | Indicates whether your licenses are current. Discontinue use of the other SteelHead (virtual edition) appliance or contact Riverbed Support. Make sure that any previous SteelHead (virtual edition) appliances that were licensed with that token are no longer running. The alarm clears automatically the next time the SteelHead (virtual edition) appliance fetches the licenses from the Licensing Portal. Note: The licenses expiring and licenses expired alarms are triggered per feature.For example, if you install two license keys for a feature, LK1-FOO-xxx (expired) and LK1-FOO-yyy (not expired), the alarms do not trigger, because the feature has one valid license. If the Licenses Expiring alarm triggers, the system status changes to Needs Attention. The Licenses Expired alarm changes the system status to Degraded. Depending on the expiring license, other alarms might trigger simultaneously. For example, if the MSPEC or SH10BASE license expires, the Appliance Unlicensed alarm triggers and changes the health to Critical. |
Link Duplex | Degraded | Indicates that an interface was not configured for half-duplex negotiation but has negotiated half-duplex mode. Half-duplex significantly limits the optimization service results. The alarm displays which interface is triggering the duplex error. Choose Networking > Networking: Base Interfaces and examine the SteelHead link configuration. Next, examine the peer switch user interface to check its link configuration. If the configuration on one side is different from the other, traffic is sent at different rates on each side, causing many collisions. To troubleshoot, change both interfaces to automatic duplex negotiation. If the interfaces do not support automatic duplex, configure both ends for full duplex. You can enable or disable the alarm for a specific interface. To disable an alarm, choose Administration: System Settings > Alarms and select or clear the check box next to the link alarm. |
Link I/O Errors | Degraded | Indicates that the error rate on an interface has exceeded 0.1 percent while either sending or receiving packets. This threshold is based on the observation that even a small link error rate reduces TCP throughput significantly. A properly configured LAN connection experiences few errors. The alarm clears when the error rate drops below 0.05 percent. You can change the default alarm thresholds by entering the alarm link_errors err-threshold xxxxx CLI command at the system prompt. For details, see the Riverbed Command-Line Interface Reference Manual. To troubleshoot, try a new cable and a different switch port. Another possible cause is electromagnetic noise nearby. You can enable or disable the alarm for a specific interface. For example, you can disable the alarm for a link after deciding to tolerate the errors. To enable or disable an alarm, choose Administration > System Settings: Alarms and select or clear the check box next to the link name. |
Link State | Degraded | Indicates that the system has lost one of its Ethernet links due to an unplugged cable or dead switch port. Check the physical connectivity between the SteelHead and its neighbor device. Investigate this alarm as soon as possible. Depending on what link is down, the system might no longer be optimizing, and a network outage could occur. You can enable or disable the alarm for a specific interface. To enable or disable the alarm, choose Administration > System Settings: Alarms and select or clear the check box next to the link name. |
LUN Status | Degraded | Indicates that a LUN is unavailable for any of these issues: Check if the data center LUN was offlined in SteelFusion Core while IO operations were in progress. This alarm clears when you reactivate the LUN through the Management Console or the CLI. |
Memory Error | Degraded | Indicates that the system has detected a memory error. A system memory stick might be failing. First, try reseating the memory first. If the problem persists, contact Riverbed Support for an RMA replacement as soon as practically possible. |
Memory Paging | Degraded | Indicates that the system has reached the memory paging threshold. If 100 pages are swapped approximately every two hours the SteelHead is functioning properly. If thousands of pages are swapped every few minutes, reboot the SteelHead. For details, see Rebooting and Shutting Down the SteelHead. If rebooting does not solve the problem, contact Riverbed Support at https://support.riverbed.com. |
Neighbor Incompatibility | Degraded | Indicates that the system has encountered an error in reaching a SteelHead configured for connection forwarding. For details, see Configuring Connection Forwarding Features. |
Network Bypass | Critical | Indicates that the system is in bypass failover mode. If the SteelHead is in bypass failover mode, restart the optimization service. If restarting the service does not resolve the problem, reboot the SteelHead. If rebooting does not resolve the problem, shut down and restart the SteelHead. For details, see Rebooting and Shutting Down the SteelHead, and Starting and Stopping the Optimization Service. |
NFS V2/V4 Alarm | Degraded | Indicates that the system has detected either NFSv2 or NFSv4 is in use. The SteelHead supports only NFSv3 and passes through all other versions. For details, see Configuring NFS Optimization. |
Optimization Service | Critical | optimization service is not running This message appears after an optimization restart. For more information, review the SteelHead logs. in-path optimization is not enabled This message appears if an in-path setting is disabled for an in-path SteelHead. For more information, review the SteelHead logs. optimization service is initializing This message appears after a reboot. The alarm clears. For more information, review the SteelHead logs. optimization service is not optimizing This message appears after a system crash. For more information, review the SteelHead logs. optimization service is disabled by user This message appears after entering the CLI command no service enable or shutting down the optimization service from the Management Console. For more information, review the SteelHead logs. optimization service is restarted by user This message appears after the optimization service is restarted from either the CLI or Management Console. You might want to review the SteelHead logs for more information. |
Outbound QoS WAN Bandwidth Configuration | Degraded (Needs Attention) | Indicates that the outbound QoS WAN bandwidth for one or more of the interfaces is set incorrectly. You must configure the WAN bandwidth to be less than or equal to the interface bandwidth link rate. This alarm triggers when the system encounters one of these conditions: While this alarm appears, the SteelHead puts existing connections into the default class. The alarm clears when you configure the WAN bandwidth to be less than or equal to the bandwidth link rate or reconnect an interface configured with the correct WAN bandwidth. By default, this alarm is enabled. |
Path Selection Path Down | Degraded | Indicates that one of the predefined paths for a connection is unavailable because it has exceeded either the timeout value for path latency or the threshold for observed packet loss. When a path fails, the SteelHead directs traffic through another available path. When the original path comes back up, the SteelHead redirects the traffic back to it. |
Path Selection Path Probing Error | Degraded | Indicates that a path selection monitoring probe for a predefined path has received a probe response from an unexpected relay or interface. |
Process Dump Creation Error | Degraded | Indicates that the system has detected an error while trying to create a process dump. This alarm indicates an abnormal condition in which RiOS cannot collect the core file after three retries. It can be caused when the /var directory, which is used to hold system dumps, is reaching capacity or other conditions. When this alarm is raised, the directory is blacklisted. Contact Riverbed Support to correct the issue. |
Riverbed Host Tools Version | Degraded | Indicates that the Riverbed host tools package (RHSP) is incompatible with the Windows server version. RHSP provides snapshot capabilities by exposing the Edge through iSCSI to the Windows Server as a snapshot provider. RHSP is compatible with 64-bit editions of Microsoft Windows Server 2008 R2 or later and can be downloaded from the Riverbed Support site at https://support.riverbed.com. |
Secure Transport | Indicates that a peer SteelHead has encountered a problem with the controller connection. The controller is a SteelHead that typically resides in the data center and manages the control channel and operations required for secure transport between SteelHead peers. The control channel between the SteelHeads uses SSL to secure the connection between the peer SteelHead and the SteelHead controller. | |
Secure Vault | Degraded | Indicates a problem with the secure vault. |
Snapshot | Degraded | A snapshot failed to be committed to the Core, or a snapshot has failed to complete at the Edge because the blockstore is full, needs credentials, or there is a misconfiguration at the Core. Check the Core logs for details. Retry the Windows snapshot. |
Software Compatibility | Needs Attention or Degraded, depending on the state | Indicates that there is a mismatch between software versions in the Riverbed system. By default, this alarm is enabled. |
SSL | Needs Attention | Indicates that an error has been detected in your secure vault or SSL configuration. For details about checking your settings, see Verifying SSL and Secure Inner Channel Optimization. After adding an in-path rule, you must clear this alarm manually by entering this CLI command: stats alarm non_443_ssl_servers_detected_on_upgrade clear |
SteelFusion Core | Degraded | Indicates that the system has encountered any of the following issues with the SteelFusion Core: |
SteelFusion Edge Service | Needs Attention | Indicates that the Edge appliance connected to the Core is not servicing the Core. Check that Edge appliance is running. |
Storage Profile Switch Failed | Either Critical or Needs Attention, depending on the state | On a SteelHead EX, indicates that an error has occurred while repartitioning the disk drives during a storage profile switch. The repartitioning was unsuccessful. A profile switch changes the disk space allocation on the drives to allow VE and VSP to use varying amounts of storage. It also clears the SteelFusion and VSP data stores, and repartitions the data stores to the appropriate sizes. You switch a storage profile by entering the disk-config layout CLI command at the system prompt or by choosing Administration > System Settings: Disk Management on an EX or EX+SteelFusion SteelHead and selecting a storage profile. A storage profile switch requires a reboot of the SteelHead. The alarm appears after the reboot. These reasons can cause a profile switch to fail: When you encounter this error, reboot the SteelHead and then switch the storage profile again. If the switch succeeds, the error clears. If it fails, RiOS reverts the SteelHead to the previous storage profile. For assistance, contact Riverbed Support: |
System Detail Report | Degraded | Indicates that the system has detected a problem with an optimization or system module. For details, see Viewing System Details Reports. |
Temperature | Critical or Warning | |
Web Proxy | Degraded | |
Uncommitted Edge Data | Degraded | Indicates that a large amount of data in the blockstore needs to be committed to SteelFusion Core. The difference between the contents of the blockstore and the SteelFusion Core-side LUN is significant. This alarm checks for how much uncommitted data is in the Edge cache as a percentage of the total cache size. This alarm triggers when the appliance writes a large amount of data very quickly, but the WAN pipe is not large enough to get the data back to the SteelFusion Core fast enough to keep the uncommitted data percentage below 5 percent. As long as data is being committed, the cache will flush eventually. The threshold is 5 percent, which for a 4 TB (1260-4) system is 200G. To change the threshold, use this CLI command: [failover-peer] edge id <id> blockstore uncommitted [trigger-pct <percentage>] [repeat-pct <percentage>] [repeat-interval <minutes>] For example: Core3(config) # edge id Edge2 blockstore uncommitted trigger-pct 50 repeat-pct 25 repeat-interval 5 For details on the CLI command, see the SteelFusion Command-Line Interface Reference Manual. To check that data is being committed, go to Storage > Reports: Blockstore Metrics on the Edge. |
Alarm | Description |
ESXi Communication Failed | Indicates that RiOS cannot communicate with ESXi or the ESXi password is not synchronized with RiOS. Make sure that the ESXi RiOS Management IP address is correct or synchronize the passwords for ESXi and RiOS. |
ESXi Disk Creation Failed | Indicates that the ESXi disk creation has failed during the VSP setup. Contact Riverbed Support. |
ESXi Initial Config Failed | Indicates the ESXi initial configuration failed. Contact Riverbed Support. |
ESXi License | Indicates whether your ESXi license is current. |
ESXi Memory Overcommitted | Indicates that the total memory assigned to powered on VMs is more than the total memory available to ESXi for the VMs. To view this number in the vSphere client, choose Allocation > Memory > Total Capacity. Amount of memory overcommitted = Total memory assigned to powered-on VMs - ESXi memory total capacity This alarm has configurable thresholds: |
ESXi Not Set up | Indicates that ESXi has not been set up on a freshly installed appliance. Complete the initial configuration wizard to enable VSP for the first time. The alarm clears after ESXi installation begins. |
ESXi Version Unsupported | Indicates that the appliance is running an unknown or unsupported ESXi version, resulting in no Riverbed support. VSP services are blocked. Reinstall an ESXi version that Riverbed supports. |
ESXi vSwitch MTU larger than 1500 | Indicates that a vSwitch with an uplink or a vmknic interface is configured with the maximum transmission unit (MTU) larger than 1500 bytes. Jumbo frames larger than 1500 bytes are not supported. |
Virtual CPU Utilization | Indicates average virtual CPU utilization of the individual cores has exceeded an acceptable threshold. The default threshold is 90 percent. If virtual CPU utilization spikes are frequent, the system might be undersized. Sustained virtual CPU load can be symptomatic of more serious issues. To gauge how long the system has been loaded and also monitor the amount of traffic currently going through the appliance, view the CPU Utilization with the display mode set to Individual Cores. An isolated spike in virtual CPU is normal, but Riverbed recommends reporting extended high CPU utilization to Riverbed Support. No other action is necessary; the alarm clears automatically. When you set the display mode on the CPU Utilization report to System Average, it shows the VSP CPU percentage in addition to the RiOS CPU utilization percentage. Some of the virtual CPU cores are shared by RiOS. This alarm might trigger due to CPU-intensive activities on your virtual machines. If this alarm triggers too often, you can increase the trigger thresholds or you can disable the Virtual CPU utilization alarm. |
VSP Service Not Running | Indicates that the virtualization service is not running. The email notification indicates whether the alarm was triggered because the VSP service was disabled, restarted, or crashed. This is a critical error that requires a VMware service restart. |
VSP Unsupported VM Count | Indicates that the number of virtual machines powered on exceeds five. |