Disaster recovery scenarios

• Failover is the process of switching to a redundant computer server, storage, and network upon the failure or abnormal termination of the production server, storage, hardware component, or network.

• Failback is the process of restoring a system, component, or service previously in a state of failure back to its original, working state.

• Production site is the site in which applications, systems, and storage are originally designed and configured. Also known as the primary site.

• Disaster recovery site is the site that is set up in preparation for a disaster. Also known as the secondary site.

In the case of a failure or a disaster affecting the entire site, this section recommends considerations you should take into account. The exact process depends on the storage array and other environment specifics. You must create thorough documentation of the disaster recovery plan for successful recovery implementation. We recommend that you perform regular testing so that the information in the plan is maintained and up to date.

In the event that an entire data center experiences failure or a disaster, you can restore the Core operations assuming you have met the following prerequisites:

• The disaster recovery site has the storage array replicated from the production site.

• The network infrastructure is configured on the disaster recovery site similarly to the production site, enabling the Edges to communicate with Core.

• Core and SteelHeads (or their virtual editions) at the disaster recovery site are installed, licensed, and configured similarly to the production site.

• Ideally, the Core at the disaster recovery site is configured identically to the Core on the production site. You can import the configuration file from Core at the production site to ensure that you have configured both Cores the same way.

Unless the disaster recovery site is designed to be an exact replica of the production site, minor differences are inevitable: for example, the IP addresses of the Core, the storage array, and so on. We recommend that you regularly replicate the Core configuration file to the disaster recovery site and import it into the disaster recovery instance. You can script the necessary adjustments to the configuration to automate the configuration adoption process.

Likewise, the configuration of SteelHeads in the disaster recovery site should reflect the latest changes to the configuration in the production site. All the relevant in-path rules must be maintained and kept up to date.

• If you have different LUN IDs in the disaster recovery site than in the production site, you need to reconfigure the Core and all the Edges and deploy them as new. You must know which LUNs belong to which Edge and map them correspondingly. We recommend that you implement a naming convention.

• Even if the data from the production storage array is replicated in synchronous mode, you can assume that there is already committed data to the Edge. The data has not been sent to the production storage, or the data has not been replicated to disaster recovery site yet. This action means that a gap in data consistency can occur if, after the failover, the Edges immediately start writing to the disaster recovery Core. To prevent the data corruption, you need to configure all the LUNs at Edges as new. When you configure Edges as new, this configuration empties out their blockstore, causing data loss of all the writes occurred after a disaster at the production site. To prevent data loss, we recommend that you configure FusionSync.

• If you want data consistency on the application level, we recommend that you perform a rollback to one of the previous snapshots.

• Keep in mind that initially after the recovery, the blockstore on Edges does not have any data in the cache.

When the Edge in a branch becomes inaccessible from outside the branch office due to a network outage, the operation in the branch office might continue. The products are designed with disconnected operations resiliency in mind. If your workflow enables branch office users to operate independently for a period of time (which is defined during the network planning stage and implemented with a correctly sized appliance), the branch office continues as operational and synchronizes with the data center later.

In the case when the branch office is completely lost, or it is imperative for the business to have a service in the branch office online sooner, you can choose to deploy the Edge in another branch or in the data center.

If you chose to deploy an Edge in the data center, we recommend that you remove the LUNs from Core as to prevent data corruption by multiple write access to the LUNs. We recommend that you roll back to a latest application-consistent snapshot. If mostly read access is required to the data projected to the branch office, a good alternative is to temporarily mount a snapshot to a local host. This snapshot enables the data to be accessible to the data center, while the branch office is operating in disconnected-operation mode. Avoiding the failover will also simplify fallback to the production site.

If you chose to deploy Edge in another branch office, follow the steps in “Edge replacement.” You must understand that in this scenario, all the uncommitted writes at the branch are not stored. We recommend that you to roll back the LUNs to the latest application-consistent snapshot.

After a disaster is over, or a failure is fixed, you might need to revert the changes and move the data and computing resources to where they were located before the disaster, while ensuring that the data integrity is not compromised. This process is called failback. Unlike the failover process that can occur in a rush, you can thoroughly plan and test the failback process.

As the product relies on primary storage to keep the data intact, the Core failback can only follow a successful storage array replication from the disaster recovery site back to the production site. There are multiple ways to perform the recovery; however, we recommend that you use the following method. The process most likely requires a downtime, which you can schedule in advance. We also recommend that you create an application-consistent snapshot and backup prior to performing the following procedure. Perform these steps on one appliance at a time:

4. Remove iSCSI Initiator access from the LUN at the Core. Data cannot be written to the LUN until the data has become available after the failback completes.

For more information, see the user guide for your storage array. The preferred method is the method that preserves the LUN ID, which might not work for all the arrays. If the LUN ID is going to change, you need to add the LUN as new on first the Core and then on the Edge.

7. If you had to make changes on the disaster recovery site due to a LUN ID change, import the Core configuration file from the disaster recovery site and make the necessary adjustments to IP addresses and so on.

3. If any changes were made to the LUN mapping configuration, you need to merge the changes during the fallback process. For assistance with this process, contact Riverbed Support.

If you took out the production LUNs of the Core and used them locally in the data center, shut down hosts and unmount LUNs and then continue the setup process as described in the Core installation and configuration guide.