Alarm | Appliance state | Reason |
Admission Control | Admission Control | • Connection Limit - Indicates that the system connection limit has been reached. Additional connections are passed through unoptimized. The alarm clears when the appliance moves out of this condition. • CPU - Indicates that the appliance has entered admission control due to high CPU use. During this event, the appliance continues to optimize existing connections, but passes through new connections without optimization. The alarm clears automatically when the CPU usage decreases. • MAPI - Indicates that the total number of MAPI optimized connections has exceeded the maximum admission control threshold. By default, the maximum admission control threshold is 85 percent of the total maximum optimized connection count for the client-side appliance. The appliance reserves the remaining 15 percent so the MAPI admission control does not affect the other protocols. The 85 percent threshold is applied only to MAPI connections. The alarm clears automatically when the MAPI traffic decreases; however, it can take one minute for the alarm to clear. The system preemptively closes MAPI sessions to reduce the connection count in an attempt to bring the appliance out of admission control. RiOS closes MAPI sessions in this order: • MAPI prepopulation connections • MAPI sessions with the largest number of connections • MAPI sessions with the most idle connections • Most recently optimized MAPI sessions or the oldest MAPI session • MAPI sessions exceeding the memory threshold • Memory - Indicates that the appliance has entered admission control due to memory consumption. The appliance is optimizing traffic beyond its rated capability and is unable to handle the amount of traffic passing through the WAN link. During this event, the appliance continues to optimize existing connections, but new connections are passed through without optimization. No other action is necessary as the alarm clears automatically when the traffic decreases. • TCP - Indicates that the appliance has entered admission control due to high TCP memory use. During this event, the appliance continues to optimize existing connections, but new connections are passed through without optimization. The alarm clears automatically when the TCP memory pressure decreases. |
Application Consistent Snapshot | Degraded | An application-consistent snapshot failed to be committed to the SAN, or a snapshot failed to complete. Application consistent snapshots are scheduled using the Core snapshot scheduler. A snapshot is application consistent if, in addition to being write-order consistent, it includes data from running applications that complete their operations and flush their buffers to disk. Application-consistent backups are recommended for database operating systems and applications such as SQL, Oracle, and Exchange. This error triggers when there are problems interacting with servers (ESXi or Windows). The first interaction is to prepare for a snapshot (where the server gets filesystems or a VM in a consistent state), and the second is to resume after the snapshot is taken (the server can clean up, stop logging changes, and so on). Errors can also occur due to misconfigurations on either side, local issues on the servers (high load, timeouts, reboots), networking problems, and so on. Check the Core logs for details. Retry the snapshot. |
Asymmetric Routing | Needs Attention | Indicates that the system is experiencing asymmetric traffic. Indicates OK if the system is not experiencing asymmetric traffic. In addition, any asymmetric traffic is passed through, and the route appears in the Asymmetric Routing table. For details about the Asymmetric Routing table, see Configuring asymmetric routing features. |
Blockstore | Degraded or Critical | Indicates that the system has encountered any of the following issues with the blockstore: • Disk space Low - The blockstore is running out of space. Check your WAN connection as well as connectivity to the Core. This can also happen if clients write more data than can be sent over the WAN for a prolonged period of time. • Disk space Critical - The blockstore on the Edge is falling and is critically low. Edge will deactivate LUNs with high uncommitted data to improve availability on other LUNs. • Disk space Full - The blockstore is out of space. Check your WAN connection as well as connectivity to the Core. This can also happen if clients write more data than can be sent over the WAN for a prolonged period of time or when the Core HA time does not match the Edge HA time. It is critical that the Core HA time is synchronized with the Edge HA time. We recommend using NTP time synchronization to synchronize the Core HA and the Edge HA. You must also verify that the time zone is correct. • Memory Low - The blockstore is running out of RAM. This indicates a temporary condition caused by too much IO. Limit the number of active prepopulation sessions. Check if IOPS is more than what is recommended for the appliance model. • Read Error - The blockstore could not read data that was already replicated to the data center. Clients will not see any error because the Edge will fetch the data from the Core. Check the system logs to determine the root cause. Replace any disks that have failed. The alarm clears when you restart the service. • Read Cache Error - An appliance with read cache SSDs cannot start the read cache. Replace any failed or missing SSDs, and restart the appliance. If no drives have failed or no drives are missing, check system logs for more detailed information, and contact Riverbed Support for assistance as needed. • Critical Read Error - The blockstore could not read data that is not yet replicated to the Core. This might include data loss. The Edge deactivates all LUNs. Check the system logs to determine the root cause. Replace any disks that have failed. The alarm clears when you restart the service. • Startup Failed - The blockstore failed to start due to disk errors or an incorrect configuration. Check the system logs to determine the root cause. • Startup Wrong Version - The Edge software version is incompatible with the blockstore version on disk. The alarm indicates that the software has been upgraded or downgraded with an incompatible version. Revert to the previous software version. |
Blockstore | Degraded or Critical | • Standby Wrong Version - The Edge software version running on the standby peer is incompatible with the version on the active peer appliance in a high-availability pair. The active peer has been upgraded or downgraded to an incompatible software version. This alarm typically triggers during the process of upgrading both Edges in a high-availability pair. When upgrading across versions with disk-format changes, this is a normal part of the upgrade process. To clear the alarm, update the software version running on the standby peer to match the software version running on the active peer. • Write Error - The blockstore could not save data to disk due to a media error. The Edge deactivates all LUNs. Check the system logs to determine the root cause. Replace any disks that have failed. The alarm clears when you restart the service. |
Connection Forwarding | Degraded | Indicates that the system has detected a problem with a connection-forwarding neighbor. The connection-forwarding alarms are inclusive of all connection-forwarding neighbors. For example, if an appliance has three neighbors, the alarm triggers if any one of the neighbors is in error. In the same way, the alarm clears only when all three neighbors are no longer in error. • Cluster Neighbor Incompatible - Indicates that a connection-forwarding neighbor is running a RiOS version that is incompatible with IPv6. Neighbors must be running RiOS 8.5 or later. The appliance neighbors pass through IPv6 connections when this alarm triggers. • Multiple Interface - Indicates that the connection to an appliance in a connection forwarding cluster is lost. • Single Interface - Indicates that the connection to an appliance connection-forwarding neighbor is lost. These issues trigger the single connection-forwarding alarm: • The connection-forwarding neighbor has not sent a keepalive message within the time-out period to the neighbor appliance(s), indicating that the connection has been lost. • The connection cannot be established with a connection-forwarding neighbor. • The connection has been closed by the connection-forwarding neighbor. • The connection has been lost with the connection-forwarding neighbor due to an error. • The connection has been lost because requests have not been acknowledged by a connection-forwarding neighbor within the set threshold. • The appliance has timed out while waiting for an initialization message from a connection-forwarding neighbor. • The amount of latency between connection-forwarding appliances has exceeded the specified threshold. |
CPU Utilization | Degraded | Indicates that the system has reached the CPU threshold for any of the CPUs in the appliance. If the system has reached the CPU threshold, check your settings. For details, see Configuring alarm settings. If your alarm thresholds are correct, reboot the appliance. For details, see Rebooting and shutting down. |
Data Store | Critical | • Corruption - Indicates that the RiOS data store is corrupt or has become incompatible with the current configuration. • Data Store Clean Required - Indicates that you must clear the RiOS data store. To clear the data store, choose Administration > Maintenance: Services and select the Clear Data Store check box before restarting the appliance. Clearing the data store degrades performance until the system repopulates the data. • Encryption Level Mismatch - Indicates a RiOS data store error such as an encryption, header, or format error. • Synchronization Error - Indicates that the RiOS data store synchronization between two appliances has been disrupted and the RiOS data stores are no longer synchronized. For details, see Synchronizing peer RiOS data stores. Resetting the Data Store alarm If a data store alarm was caused by an unintended change to the configuration, you can change the configuration to match the previous RiOS data store settings, and then restart the service without clearing the data store to reset the alarm. Typical configuration changes that require a restart with a clear RiOS data store are enabling the extended peer table or changing the data store encryption type. For details, see Configuring peering and Encrypting the RiOS data store. To clear the RiOS data store of data, choose Administration > Maintenance: Services, select Clear Data Store and click Reboot to reboot the optimization service. For details, see Starting and stopping the optimization service. |
Disk Full | Indicates that the system partitions (not the RiOS data store) are full or almost full. For example, RiOS monitors the available space on /var, which is used to hold logs, statistics, system dumps, TCP dumps, and so on. Examine the directory to see if it is storing an excessive amount of snapshots, system dumps, or TCP dumps that you could delete. You could also delete any RiOS images that you no longer use. | |
Domain Authentication Alert | Needs Attention | Indicates that the system is unable to communicate with the Core, has detected an SMB signing error, or delegation has failed. CIFS-signed and Encrypted-MAPI traffic is passed through without optimization. For details, see Configuring CIFS optimization. |
Domain Join Error | Degraded | Indicates an attempt to join a Windows domain has failed. For details, see Troubleshooting a domain join failure. |
Edge HA Service | Either Degraded or Critical | Indicates that only one of the appliances in a high-availability (HA) Edge pair is actively serving storage data (the active peer). As the system writes new data to the active peer, it is reflected to the standby peer, which stores a copy of the data in its local data store. The two appliances maintain a heartbeat protocol between them, so that if the active peer goes down, the standby peer can take over servicing the LUNs. If the standby peer goes down, the active peer continues servicing the LUNs after raising this alarm and sending an email that the appliance is degraded or critical. The email contains the IP address of the peer appliance. Degraded indicates that the edge HA is not functioning but the LUNs are being serviced. After a failed peer resumes, it resynchronizes with the other peer in the HA pair to receive any data that was written since the time of the failure. After the peer receives all the written data, the normal HA mode resumes and any future writes are reflected to both peers. Critical indicates that the LUNs are no longer available and are not being serviced. Contact Riverbed Support. |
Hardware | Either Degraded or Critical | These alarms report issues with the SteelFusion Edge RiOS node hardware. • A VSP upgrade requires additional memory or a memory replacement. • Disk Error - Indicates that one or more disks is offline. To see which disk is offline, enter this command from the system prompt: show raid diagram This alarm applies only to the SteelHead RAID Series 3000, 5000, and 6000. • Fan Error - Indicates that a fan is failing or has failed and must be replaced. • Flash Error - Indicates an error with the flash drive hardware. At times, the USB flash drive that holds the system images might become unresponsive; the appliance continues to function normally. When this error triggers you cannot perform a software upgrade, as the appliance is unable to write a new upgrade image to the flash drive without first power cycling the system. To reboot the appliance, go to the Administration > Maintenance: Reboot/Shutdown page or enter the reload command to automatically power cycle the appliance and restore the flash drive to its proper function. • IPMI - Indicates an Intelligent Platform Management Interface (IPMI) event. This alarm triggers when there has been a physical security intrusion. These events trigger this alarm: • Chassis intrusion (physical opening and closing of the appliance case) • Memory errors (correctable or uncorrectable ECC memory errors) • Hard drive faults or predictive failures • Power supply status or predictive failure By default, this alarm is enabled. • Management Disk Size Error - Indicates that the size of the management disk is too small for the virtual appliance model. This condition can occur when upgrading a virtual appliance to a model VCX 5055 or VCX 7055 without first expanding the management disk to a size that supports the higher end models. To clear the alarm, increase the size of the management disk. • Memory Error - Indicates a memory error (for example, when a system memory stick fails). • Other Hardware Error - Indicates one of the following issues: • The appliance does not have enough disk, memory, CPU cores, or NIC cards to support the current configuration. • The appliance is using a memory Dual In-line Memory Module (DIMM), a hard disk, or a NIC that is not qualified by Riverbed. |
• DIMMs are plugged into the appliance but RiOS cannot recognize them because: – a DIMM is in the wrong slot. You must plug DIMMs into the black slots first and then use the blue slots when all of the black slots are in use. —or— – a DIMM is broken and you must replace it. • other hardware issues exist. • Safety Valve: disk access exceeds response times - Indicates that the appliance is experiencing increased disk access time and has started the safety valve disk bypass mechanism that switches connections into SDR-A. SDR-A performs data reduction in memory until the disk access latency falls below the safety valve activation threshold. Disk access time can exceed the safety valve activation threshold for several reasons: the appliance might be undersized for the amount of traffic it is required to optimize, a larger than usual amount of traffic is being optimized temporarily, or a disk is experiencing hardware issues such as sector errors, failing mechanicals, or RAID disk rebuilding. You configure the safety valve activation threshold and timeout using commands: datastore safety-valve threshold datastore safety-value timeout For details, see the Riverbed Command-Line Interface Reference Manual. To clear the alarm, restart the SteelHead. • Power Supply - Indicates an inserted power supply cord does not have power, as opposed to a power supply slot with no power supply cord inserted. • RAID - Indicates an error with the RAID array (for example, missing drives, pulled drives, drive failures, and drive rebuilds). An audible alarm might also sound. To see if a disk has failed, enter this command from the system prompt: show raid diagram For drive rebuilds, if a drive is removed and then reinserted, the alarm continues to be triggered until the rebuild is complete. Rebuilding a disk drive can take 4 to 6 hours. This alarm applies only to the SteelHead RAID Series 3000, 5000, and 6000. | ||
Hypervisor Hardware | Either Degraded or Critical | Indicates that a problem has occurred with the Edge Hypervisor node hardware. The hypervisor hardware affects virtualization on the appliance. These issues trigger the hypervisor hardware alarm: • Hardware Management Connection - Indicates that RiOS has lost IP connectivity or cannot authenticate the connection to the hypervisor motherboard controller. • Hardware Management Controller Unauthenticated User - Indicates that RiOS does not recognize the password used to access the hardware management controller. • Memory - Indicates that a memory error has occurred; for example, a system memory stick has failed. • Other Hardware - Indicates that a hardware error has been detected. These issues trigger the hardware error alarm: • The hypervisor hardware is using a memory Dual In-line Memory Module (DIMM), a hard disk, or a NIC that is not qualified. • The hypervisor hardware has detected a RiOS NIC. The hypervisor does not support RiOS NICs. • DIMMs are plugged into the hypervisor hardware but the hypervisor cannot recognize them because: – a DIMM is in the wrong slot. You must plug DIMMs into the black slots first and then use the blue slots when all of the black slots are in use. —or— – a DIMM is broken and you must replace it. • Power - Indicates that the hypervisor has lost power unexpectedly. • Temperature - Indicates that a hypervisor CPU, board, or platform controller hub (PCH) temperature has exceeded the rising threshold. When the CPU, board, or PCH returns to the reset threshold, the critical alarm clears (after polling for 30 seconds). If the appliance has more than one CPU, the alarm displays both CPUs. The default temperature thresholds are set by the motherboard. |
Inbound QoS WAN Bandwidth Configuration | Degraded (Needs Attention) | Indicates that the inbound QoS WAN bandwidth for one or more of the interfaces is set incorrectly. You must configure the WAN bandwidth to be less than or equal to the interface bandwidth link rate. This alarm triggers when the system encounters one of these conditions: • An interface is connected and the WAN bandwidth is set higher than its bandwidth link rate: for example, if the bandwidth link rate is 1536 kbps, and the WAN bandwidth is set to 2000 kbps. • A nonzero WAN bandwidth is set and QoS is enabled on an interface that is disconnected; that is, the bandwidth link rate is 0. • A previously disconnected interface is reconnected, and its previously configured WAN bandwidth was set higher than the bandwidth link rate. The Management Console refreshes the alarm message to inform you that the configured WAN bandwidth is set higher than the interface bandwidth link rate. While this alarm appears, the appliance puts existing connections into the default class. The alarm clears when you configure the WAN bandwidth to be less than or equal to the bandwidth link rate or reconnect an interface configured with the correct WAN bandwidth. |
Licensing | Needs Attention, Degraded, or Critical | Indicates whether your licenses are current. • Appliance Unlicensed - This alarm triggers if the appliance has no BASE or MSPEC license installed for its currently configured model. For details about updating licenses, see Managing licenses and model upgrades. • Autolicense Critical Event - This alarm triggers on a SteelHead (virtual edition) appliance when the Riverbed Licensing Portal cannot respond to a license request with valid licenses. The Licensing Portal cannot issue a valid license for one of these reasons: – A newer SteelHead (virtual edition) appliance is already using the token, so you cannot use it on the SteelHead (virtual edition) appliance displaying the critical alarm. Every time the SteelHead (virtual edition) appliance attempts to refetch a license token, the alarm retriggers. – The token has been redeemed too many times. Every time the SteelHead (virtual edition) appliance attempts to refetch a license token, the alarm retriggers. Discontinue use of the other SteelHead (virtual edition) appliance or contact Riverbed Support. • Autolicense Informational Event - This alarm triggers if the Riverbed Licensing Portal has information regarding the licenses for a SteelHead (virtual edition) appliance. For example, the SteelHead (virtual edition) appliance displays this alarm when the portal returns licenses that are associated with a token that has been used on a different SteelHead (virtual edition) appliance. Make sure that any previous SteelHead (virtual edition) appliances that were licensed with that token are no longer running. The alarm clears automatically the next time the SteelHead (virtual edition) appliance fetches the licenses from the Licensing Portal. • Licenses Expired - This alarm triggers if one or more features has at least one license installed, but all of them are expired. • Licenses Expiring - This alarm triggers if the license for one or more features is going to expire within two weeks. Note: The licenses expiring and licenses expired alarms are triggered per feature. For example, if you install two license keys for a feature, LK1-FOO-xxx (expired) and LK1-FOO-yyy (not expired), the alarms do not trigger, because the feature has one valid license. If the Licenses Expiring alarm triggers, the system status changes to Needs Attention. The Licenses Expired alarm changes the system status to Degraded. Depending on the expiring license, other alarms might trigger simultaneously. For example, if the MSPEC or SH10BASE license expires, the Appliance Unlicensed alarm triggers and changes the health to Critical. |
Link Duplex | Degraded | Indicates that an interface was not configured for half-duplex negotiation but has negotiated half-duplex mode. Half-duplex significantly limits the optimization service results. The alarm displays which interface is triggering the duplex error. Choose Networking > Networking: Base Interfaces and examine the SteelHead link configuration. Next, examine the peer switch user interface to check its link configuration. If the configuration on one side is different from the other, traffic is sent at different rates on each side, causing many collisions. To troubleshoot, change both interfaces to automatic duplex negotiation. If the interfaces do not support automatic duplex, configure both ends for full duplex. You can enable or disable the alarm for a specific interface. To disable an alarm, choose Administration: System Settings > Alarms and select or clear the check box next to the link alarm. |
Link I/O Errors | Degraded | Indicates that the error rate on an interface has exceeded 0.1 percent while either sending or receiving packets. This threshold is based on the observation that even a small link error rate reduces TCP throughput significantly. A properly configured LAN connection experiences few errors. The alarm clears when the error rate drops below 0.05 percent. You can change the default alarm thresholds by entering the alarm link_io_errors err-threshold <threshold-value> command at the system prompt. For details, see the Riverbed Command-Line Interface Reference Manual. To troubleshoot, try a new cable and a different switch port. Another possible cause is electromagnetic noise nearby. You can enable or disable the alarm for a specific interface. For example, you can disable the alarm for a link after deciding to tolerate the errors. To enable or disable an alarm, choose Administration > System Settings: Alarms and select or clear the check box next to the link name. |
Link State | Degraded | Indicates that the system has lost one of its Ethernet links due to an unplugged cable or dead switch port. Check the physical connectivity between the appliance and its neighbor device. Investigate this alarm as soon as possible. Depending on what link is down, the system might no longer be optimizing, and a network outage could occur. You can enable or disable the alarm for a specific interface. To enable or disable the alarm, choose Administration > System Settings: Alarms and select or clear the check box next to the link name. |
Memory Paging | Degraded | Indicates that the system has reached the memory paging threshold. If 100 pages are swapped approximately every two hours the appliance is functioning properly. If thousands of pages are swapped every few minutes, reboot the appliance. For details, see Rebooting and shutting down. If rebooting does not solve the problem, contact Riverbed Support at https://support.riverbed.com. |
Neighbor Incompatibility | Degraded | Indicates that the system has encountered an error in reaching an appliance configured for connection forwarding. For details, see Configuring connection forwarding features. |
Network Bypass | Critical | Indicates that the system is in bypass failover mode. If the appliance is in bypass failover mode, restart the optimization service. If restarting the service does not resolve the problem, reboot the appliance. If rebooting does not resolve the problem, shut down and restart the appliance. For details, see Rebooting and shutting down, and Starting and stopping the optimization service. |
NFS V2/V4 Alarm | Degraded | Indicates that the system has detected either NFSv2 or NFSv4 is in use. The appliance supports only NFSv3 and passes through all other versions. For details, see Configuring NFS optimization. |
Optimization Service | Critical | • Internal Error - The optimization service has encountered a condition that might degrade optimization performance. Go to the Administration > Maintenance: Services page and restart the optimization service. • Unexpected Halt - The optimization service has halted due to a serious software error. See if a system dump was created. If so, retrieve the system dump and contact Riverbed Support immediately. For details, see Viewing logs. • Service Status - The optimization service has encountered an optimization service condition. The message indicates the reason for the condition: – optimization service is not running This message appears after an optimization restart. For more information, review the appliance logs. – in-path optimization is not enabled This message appears if an in-path setting is disabled for an in-path SteelHead. For more information, review the SteelHead logs. – optimization service is initializing This message appears after a reboot. The alarm clears. For more information, review the SteelHead logs. – optimization service is not optimizing This message appears after a system crash. For more information, review the SteelHead logs. – optimization service is disabled by user This message appears after entering the no service enable command or shutting down the optimization service from the Management Console. For more information, review the SteelHead logs. – optimization service is restarted by user This message appears after the optimization service is restarted from either the CLI or Management Console. You might want to review the SteelHead logs for more information. |
Outbound QoS WAN Bandwidth Configuration | Degraded (Needs Attention) | Indicates that the outbound QoS WAN bandwidth for one or more of the interfaces is set incorrectly. You must configure the WAN bandwidth to be less than or equal to the interface bandwidth link rate. This alarm triggers when the system encounters one of these conditions: • An interface is connected and the WAN bandwidth is set higher than its bandwidth link rate: for example, if the bandwidth link rate is 1536 kbps, and the WAN bandwidth is set to 2000 kbps. • A nonzero WAN bandwidth is set and QoS is enabled on an interface that is disconnected; that is, the bandwidth link rate is 0. • A previously disconnected interface is reconnected, and its previously configured WAN bandwidth was set higher than the bandwidth link rate. The Management Console refreshes the alarm message to inform you that the configured WAN bandwidth is set higher than the interface bandwidth link rate. While this alarm appears, the SteelHead puts existing connections into the default class. The alarm clears when you configure the WAN bandwidth to be less than or equal to the bandwidth link rate or reconnect an interface configured with the correct WAN bandwidth. |
Path Selection Path Down | Degraded | Indicates that one of the predefined paths for a connection is unavailable because it has exceeded either the timeout value for path latency or the threshold for observed packet loss. When a path fails, the SteelHead directs traffic through another available path. When the original path comes back up, the SteelHead redirects the traffic back to it. |
Path Selection Path Probing Error | Degraded | Indicates that a path selection monitoring probe for a predefined path has received a probe response from an unexpected relay or interface. |
Process Dump Creation Error | Degraded | Indicates that the system has detected an error while trying to create a process dump. This alarm indicates an abnormal condition in which RiOS cannot collect the core file after three retries. It can be caused when the /var directory, which is used to hold system dumps, is reaching capacity or other conditions. When this alarm is raised, the directory is blacklisted. Contact Riverbed Support to correct the issue. |
Riverbed Host Tools Version | Degraded | Indicates that the Riverbed host tools package (RHSP) is incompatible with the Windows server version. RHSP provides snapshot capabilities by exposing the Edge through iSCSI to the Windows Server as a snapshot provider. RHSP is compatible with 64-bit editions of Microsoft Windows Server 2008 R2 or later and can be downloaded from the Riverbed Support site at https://support.riverbed.com. |
Secure Transport | Critical | Indicates that a peer appliance has encountered a problem with the controller connection. The controller is an appliance that typically resides in the data center and manages the control channel and operations required for secure transport between peers. The control channel between the peers uses SSL to secure the connection. • Connection with Controller Lost - Indicates that the peer appliance is no longer connected to the controller for one of these reasons: • The connectivity between the peer appliance and the SteelHead controller is lost. • The SSL for the connection is not configured correctly. • Registration with Controller Unsuccessful - Indicates that the peer appliance is not registered with the SteelHead controller, and the controller does not recognize it as a member of the secure transport group. |
Secure Vault | Degraded | Indicates a problem with the secure vault. • Secure Vault Locked - Needs Attention - Indicates that the secure vault is locked. To optimize SSL connections or to use RiOS data store encryption, the secure vault must be unlocked. Go to Administration > Security: Secure Vault and unlock the secure vault. • Secure Vault New Password Recommended - Degraded - Indicates that the secure vault requires a new, nondefault password. Reenter the password. • Secure Vault Not Initialized - Critical - Indicates that an error has occurred while initializing the secure vault. When the vault is locked, SSL traffic is not optimized and you cannot encrypt the RiOS data store. For details, see Unlocking the secure vault. |
Server Backup | Degraded | Indicates that one of the following backup failures have occurred: • Failed connection to the server - The connection between the Edge and the ESXi or Windows server is down, the server is not running, or there are incorrect credentials for the ESXi or vCenter server login. To fix this issue, check if the server is reachable from the Edge and vice-versa. Also ensure that the correct credentials are being used for the ESXi server. This alarm is cleared when the connection is restored between the Edge and the server. • Backup failure on the Edge - A backup has failed on the Edge. The alarm displays a message with the affected server. • LUN is shared among multiple Windows servers - At least one LUN is shared among two or more Windows servers. To fix this issue, make sure that the LUN has access to only one IP address. The alarm is cleared when servers no longer share LUN(s) and the next protect operation succeeds. • Server with a backup policy does not have a LUN - A server with an associated backup policy does not have any VMs or LUNs to protect. |
Snapshot | Degraded | A snapshot failed to be committed to the Core, or a snapshot has failed to complete at the Edge because the blockstore is full, needs credentials, or there is a misconfiguration at the Core. Check the Core logs for details. Retry the Windows snapshot. |
Software Compatibility | Needs Attention or Degraded | Indicates that there is a mismatch between software versions in the Riverbed system. • Peer Mismatch - Needs Attention - Indicates that the appliance has encountered another appliance that is running an incompatible version of system software. Refer to the CLI, Management Console, or the SNMP peer table to determine which appliance is causing the conflict. Connections with that peer will not be optimized, connections with other peers running compatible RiOS versions are unaffected. To resolve the problem, upgrade your system software. No other action is required as the alarm clears automatically. • Software Version Mismatch - Degraded - Indicates that the appliance is running an incompatible version of system software. To resolve the problem, upgrade your system software. No other action is required as the alarm clears automatically. By default, this alarm is enabled. |
SSL | Needs Attention | Indicates that an error has been detected in your secure vault or SSL configuration. For details about checking your settings, see Verifying SSL and secure inner channel optimization. • Non-443 SSL Servers - Indicates that during a RiOS upgrade (for example, from 8.5 to 9.0), the system has detected a preexisting SSL server certificate configuration on a port other than the default SSL port 443. SSL traffic might not be optimized. To restore SSL optimization, you can add an in-path rule to the client-side appliance to intercept the connection and optimize the SSL traffic on the nondefault SSL server port. After adding an in-path rule, you must clear this alarm manually by entering this command: stats alarm non_443_ssl_servers_detected_on_upgrade clear • SSL Certificates Error - Indicates that an SSL peering certificate has failed to re-enroll automatically within the Simple Certificate Enrollment Protocol (SCEP) polling interval. • SSL Certificates Expiring - Indicates that an SSL certificate is about to expire. • SSL Certificates SCEP - Indicates that an SSL certificate has failed to reenroll automatically within the SCEP polling interval. |
SteelFusion Core | Degraded | Indicates that the system has encountered any of the following issues with the SteelFusion Core: • Unknown Edge - The Edge appliance has connected to a Core that does not recognize the appliance. Most likely the configuration present on the Core is missing an entry for the Edge. Check that the Edge is supplying the proper Edge ID. To find the Edge ID, choose Storage > Storage Edge Configuration on the Edge appliance. The edge identifier appears under SteelFusion Settings. • SteelFusion Core Connectivity - The Edge does not have an active connection with the Core. Check the network between the Edge and the Core; recheck the Edge configuration on the Core. • Inner Channel Down - The data channel between the Core and the Edge is down. The connection between the Core and the Edge has stalled. Check the network between the Edge and the Core. • Keep-Alive Timeout - The connection between the Core and the Edge has stalled. Check the network between the Edge and the Core. |
SteelFusion Edge Service | Needs Attention | Indicates that the Edge appliance connected to the Core is not servicing the Core. Check that the Edge appliance is running. |
SteelFusion Protocol Service | Indicates that an iSCSI protocol error is preventing a LUN on the Edge from being discovered by the clients (for example, ESXi). | |
Storage Profile Switch Failed | Either Critical or Needs Attention, depending on the state | Indicates that an error has occurred while repartitioning the disk drives during a storage profile switch. The repartitioning was unsuccessful. A profile switch changes the disk space allocation on the drives to allow VSP to use varying amounts of storage. It also clears the SteelFusion and VSP data stores, and repartitions the data stores to the appropriate sizes. You switch a storage profile by entering the disk-config layout CLI command at the system prompt or by choosing Administration > System Settings: Disk Management and selecting a storage profile. A storage profile switch requires a reboot of the Edge. The alarm appears after the reboot. These reasons can cause a profile switch to fail: • RiOS can’t validate the profile. • The profile contains an invalid upgrade or downgrade. • RiOS can’t clean up the existing VDMKs. During cleanup, RiOS uninstalls all slots and deletes all backups and packages. When you encounter this error, reboot the Edge and then switch the storage profile again. If the switch succeeds, the error clears. If it fails, RiOS reverts the Edge to the previous storage profile. • If RiOS successfully reverts the Edge to the previous storage profile, the alarm status displays needs attention. • If RiOS is unable to revert the Edge to the previous storage profile, the alarm status becomes critical. For assistance, contact Riverbed Support. |
Storage Volume Status | Critical or Degraded | Indicates that the connection to the volume has failed or there is an issue with any of the following: • Backend connectivity • No read/write permissions • Space threshold has been reached • Resize failure • A LUN is deactivated. A LUN will be deactivated if the blockstore has a critical amount of low space and this particular LUN has a high rate of new writes. • Initialization of the blockstore for the LUN fails. • Connectivity issues between Edge and Core. If the status is Degraded, the export is available to be written on the Edge however there may be issues with writing the Edge’s data to the backend storage array. If the status is Critical, the export may not be available to be used by the clients at the branch. By default, this alarm is enabled. |
System Detail Report | Degraded | Indicates that the system has detected a problem with an optimization or system module. For details, see Viewing system detail reports. |
Temperature | Critical or Warning | • Critical - Indicates that the CPU temperature has exceeded the critical threshold. The default value for the rising threshold temperature is 80ºC; the default reset threshold temperature is 67ºC. • Warning - Indicates that the CPU temperature is about to exceed the critical threshold. |
Uncommitted Edge Data | Degraded | Indicates that a large amount of data in the blockstore needs to be committed to SteelFusion Core. The difference between the contents of the blockstore and the SteelFusion Core-side LUN is significant. This alarm checks for how much uncommitted data is in the Edge cache as a percentage of the total cache size. This alarm triggers when the appliance writes a large amount of data very quickly, but the WAN pipe is not large enough to get the data back to the SteelFusion Core fast enough to keep the uncommitted data percentage below 5 percent. As long as data is being committed, the cache will flush eventually. The threshold is 5 percent, which for a 4 TiB (1260-4) system is 200 GiB. To change the threshold, use this command: [failover-peer] edge id <id> blockstore uncommitted [trigger-pct <percentage>] [repeat-pct <percentage>] [repeat-interval <minutes>] For example: Core3(config) # edge id Edge2 blockstore uncommitted trigger-pct 50 repeat-pct 25 repeat-interval 5 For details on the CLI command, see the SteelFusion Command-Line Interface Reference Manual. To check that data is being committed, go to Storage > Reports: Blockstore Metrics on the Edge. |
Virtualization | Degraded or Critical | Hypervisor - Indicates a problem with the hypervisor. License - Indicates that the hypervisor license has expired. |
Operation - The hypervisor is in lockdown mode. | ||
Degraded | Virtual Services Platform - Indicates a communication issue between VSP and the hypervisor. | |
Connection - The hypervisor is not communicating for any of these issues: • VSP is disconnected from the hypervisor. • The hypervisor password is invalid. • VSP was unable to gather some hardware information. • VSP is disconnected. | ||
Installation - VSP is not installed properly and is powered off for any of these issues: • A hypervisor upgrade has failed. • A configuration push from the Hypervisor Installer has failed. • VSP could not gather enough information to set up an interface. • The hypervisor is not installed. | ||
Web Proxy | Configuration - Indicates that an error has occurred with the web proxy configuration. Service Status - Indicates that an error has occurred with the web proxy service. |