About Data Resilience and Security : Disaster recovery scenarios
  
Disaster recovery scenarios
This section describes basic product appliance disaster scenarios, and includes general recovery recommendations.
Keep in mind the following definitions:
Failover is the process of switching to a redundant computer server, storage, and network upon the failure or abnormal termination of the production server, storage, hardware component, or network.
Failback is the process of restoring a system, component, or service previously in a state of failure back to its original, working state.
Production site is the site in which applications, systems, and storage are originally designed and configured. Also known as the primary site.
Disaster recovery site is the site that is set up in preparation for a disaster. Also known as the secondary site.
Product appliance failure—failover
In the case of a failure or a disaster affecting the entire site, this section recommends considerations you should take into account. The exact process depends on the storage array and other environment specifics. You must create thorough documentation of the disaster recovery plan for successful recovery implementation. We recommend that you perform regular testing so that the information in the plan is maintained and up to date.
Data center failover
In the event that an entire data center experiences failure or a disaster, you can restore the Core operations assuming you have met the following prerequisites:
The disaster recovery site has the storage array replicated from the production site.
The network infrastructure is configured on the disaster recovery site similarly to the production site, enabling the Edges to communicate with Core.
Core and SteelHeads (or their virtual editions) at the disaster recovery site are installed, licensed, and configured similarly to the production site.
Ideally, the Core at the disaster recovery site is configured identically to the Core on the production site. You can import the configuration file from Core at the production site to ensure that you have configured both Cores the same way.
Unless the disaster recovery site is designed to be an exact replica of the production site, minor differences are inevitable: for example, the IP addresses of the Core, the storage array, and so on. We recommend that you regularly replicate the Core configuration file to the disaster recovery site and import it into the disaster recovery instance. You can script the necessary adjustments to the configuration to automate the configuration adoption process.
Likewise, the configuration of SteelHeads in the disaster recovery site should reflect the latest changes to the configuration in the production site. All the relevant in-path rules must be maintained and kept up to date.
There are some limitations:
If you have different LUN IDs in the disaster recovery site than in the production site, you need to reconfigure the Core and all the Edges and deploy them as new. You must know which LUNs belong to which Edge and map them correspondingly. We recommend that you implement a naming convention.
Even if the data from the production storage array is replicated in synchronous mode, you can assume that there is already committed data to the Edge. The data has not been sent to the production storage, or the data has not been replicated to disaster recovery site yet. This action means that a gap in data consistency can occur if, after the failover, the Edges immediately start writing to the disaster recovery Core. To prevent the data corruption, you need to configure all the LUNs at Edges as new. When you configure Edges as new, this configuration empties out their blockstore, causing data loss of all the writes occurred after a disaster at the production site. To prevent data loss, we recommend that you configure FusionSync.
If you want data consistency on the application level, we recommend that you perform a rollback to one of the previous snapshots.
Keep in mind that initially after the recovery, the blockstore on Edges does not have any data in the cache.
Branch office failover
When the Edge in a branch becomes inaccessible from outside the branch office due to a network outage, the operation in the branch office might continue. The products are designed with disconnected operations resiliency in mind. If your workflow enables branch office users to operate independently for a period of time (which is defined during the network planning stage and implemented with a correctly sized appliance), the branch office continues as operational and synchronizes with the data center later.
In the case when the branch office is completely lost, or it is imperative for the business to have a service in the branch office online sooner, you can choose to deploy the Edge in another branch or in the data center.
If you chose to deploy an Edge in the data center, we recommend that you remove the LUNs from Core as to prevent data corruption by multiple write access to the LUNs. We recommend that you roll back to a latest application-consistent snapshot. If mostly read access is required to the data projected to the branch office, a good alternative is to temporarily mount a snapshot to a local host. This snapshot enables the data to be accessible to the data center, while the branch office is operating in disconnected-operation mode. Avoiding the failover will also simplify fallback to the production site.
If you chose to deploy Edge in another branch office, follow the steps in “Edge replacement.” You must understand that in this scenario, all the uncommitted writes at the branch are not stored. We recommend that you to roll back the LUNs to the latest application-consistent snapshot.
Product appliance failure—failback
After a disaster is over, or a failure is fixed, you might need to revert the changes and move the data and computing resources to where they were located before the disaster, while ensuring that the data integrity is not compromised. This process is called failback. Unlike the failover process that can occur in a rush, you can thoroughly plan and test the failback process.
Data center failback
As the product relies on primary storage to keep the data intact, the Core failback can only follow a successful storage array replication from the disaster recovery site back to the production site. There are multiple ways to perform the recovery; however, we recommend that you use the following method. The process most likely requires a downtime, which you can schedule in advance. We also recommend that you create an application-consistent snapshot and backup prior to performing the following procedure. Perform these steps on one appliance at a time:
1. Shut down hosts and unmount LUNs.
2. Export the configuration file from the Core at the disaster recovery site.
3. From the Core, initiate taking the Edge offline. This process forces the Edge to replicate all the committed writes to the Core.
4. Remove iSCSI Initiator access from the LUN at the Core. Data cannot be written to the LUN until the data has become available after the failback completes.
5. Make sure that you replicate the LUN with the storage array from the disaster recovery site back to the production site.
6. On a storage array at the production site, make the replicated LUN the primary LUN.
Depending on a storage array, you might need to create a snapshot, create a clone, or promote the clone to a LUN—or all these actions.
For more information, see the user guide for your storage array. The preferred method is the method that preserves the LUN ID, which might not work for all the arrays. If the LUN ID is going to change, you need to add the LUN as new on first the Core and then on the Edge.
7. If you had to make changes on the disaster recovery site due to a LUN ID change, import the Core configuration file from the disaster recovery site and make the necessary adjustments to IP addresses and so on.
8. Add access to the LUN for Core. The Core at production site begins servicing the LUN instantaneously but only if the LUN ID remained the same.
9. At the branch office, check to see if you need to change the Core IP address.
Branch office failback
The branch office failback process is similar to the Edge replacement process. The procedure requires downtime that you can schedule in advance.
If the production LUNs were mapped to another Edge, use this procedure to perform the Edge failback process:
1. Shut down hosts and unmount LUNs.
2. Take the LUNs offline from the disaster recovery Edge. This process forces the Edge to replicate all the committed writes to Core.
3. If any changes were made to the LUN mapping configuration, you need to merge the changes during the fallback process. For assistance with this process, contact Riverbed Support.
4. Shut down the Edge at the disaster recovery site.
5. Bring up the Edge at the production site.
6. Follow the steps described in “Edge replacement.”
Keep in mind that after the fallback process is completed, the blockstore on Edges does not have any data in the cache.
If you took out the production LUNs of the Core and used them locally in the data center, shut down hosts and unmount LUNs and then continue the setup process as described in the Core installation and configuration guide.