Data Resilience and Security

This chapter describes security and data resilience deployment procedures and design considerations. It contains the following sections:

• “Recovering a single Core” on page 89

• “Edge replacement” on page 91

• “Disaster recovery scenarios” on page 92

• “Best practice for export snapshot rollback” on page 95

• “At-rest and in-flight data security” on page 96

• “Clearing the blockstore contents” on page 98

• “Edge network communication” on page 99

• “Additional security best practices” on page 99

• “Related information” on page 99

Recovering a single Core

If you decide you want to deploy only a single Core, read this section to minimize downtime and data loss when recovering from a Core failure. This section includes the following topics:

• “Recovering a single physical Core” on page 90

• “Recovering a single Core-v” on page 90

Caution: We strongly recommend that you deploy Core as an HA pair so that in an event of a failure, you can seamlessly continue operations. Both physical and virtual Core HA deployments provide a fully automated failover without end-user impact.

Recovering a single physical Core

The Core internal configuration file is crucial to rebuilding your environment in the event of a failure. The possible configuration file recovery scenarios are as follows:

• Up-to-date Core configuration file is available on an external server - When you replace the failed Core with a new Core, you can import the latest configuration file to resume operations. The Edges reconnect to the Core and start replicating the new writes that were created after the Core failed.

In this scenario, you do not need to perform any additional configuration and there is no data loss on the Core and the Edge.

We recommend that you frequently back up the Core configuration file. For details about the backup and restore procedures for device configurations, see the SteelCentral Controller for SteelHead User Guide.

Use the following CLI commands to export the configuration file:

enable

configure terminal

configuration bulk export scp://username:password@server/path/to/config

Use the following CLI commands to replace the configuration file:

enable

configure terminal

no service enable

configuration bulk import scp://username:password@server/path/to/config

service enable

• Core configuration file is available but it is not up to date - If you do not regularly back up the Core configuration file, you can be missing the latest information. When you import the configuration file, you retain all data since the last export. The data written to the configuration file after the Core failure to Edges and exports are lost. You must manually add the components of the environment that were added after the configuration file was exported.

• No Core configuration file is available - This is the worst-case scenario. In this case you need to build a new Core and reconfigure all Edges as if they were new. All data in the Edges is invalidated, and new writes to Edge exports after Core failure are lost. There is no data loss at the Core. If there were applications running at the Edge that cannot handle the loss of most recent data, they need to be recovered from an application-consistent snapshot and backup from the data center.

For more instruction on how to export and import the configuration file, see “Core configuration export” on page 111 and “Core in HA configuration replacement” on page 111. For general information about the configuration file, see the SteelFusion Core User Guide.

Recovering a single Core-v

The following recommendation will help to recover from potential failures and disasters and minimize data loss in a Core-v and Edges.

• Continually back up the Core-v configuration file to an external shared storage - See the scenarios described in “Recovering a single physical Core” on page 90.

• Restore the Core-v from a VM snapshot - We strongly recommend that you do not use this procedure.

The primary reason not to use this procedure is that the configuration file in the Core-v from the snapshot might not be current. If you made any configuration changes since the last VM snapshot, you can lose data if an incorrect Core configuration is suddenly applied to the existing Edge deployment. Using this procedure can also mean LUN snapshots triggered by the Edge might be lost. With SteelFusion 4.2 there is a configuration check performed when a Core-v from an old snapshot is booted up. When the Core-v tries to reconnect to an Edge that has a more recent configuration, an alarm is raised and the Core-Edge connection fails. Prior to SteelFusion 4.2, this check was not available.

Note: Core-v is not compatible with VMware Fault Tolerance (FT).

Edge replacement

In the event of catastrophic failure, you might need to replace the Edge appliance and remap the exports. It is usually impossible to properly shut down an Edge export and bring it offline because the Edge wants to commit all its pending writes (for the export) to the Core. If the Edge has failed, and you cannot successfully bring the export offline, you need to manually remove the export.

The blockstore is a part of Edge, and if you replace the Edge, the cached data on the failed blockstore is discarded. To protect the Edge against a single point of failure, consider an HA deployment of Edge. For more information, see “Edge high availability” on page 68.

Use the following procedure for an Edge disaster recovery scenario in which there is an unexpected Edge or remote site failure. This procedure does not include Edge HA.

Note: We recommend that you contact Riverbed Support before performing the following procedure.

To replace the Edge

1. Schedule time that is convenient to be offline (if possible).

2. On the Core, force unmap exports from the failed Edge.

3. In the Core Management Console, remove the failed Edge.

4. Add replacement Edge.

You can use the same Edge Identifier.

5. Map exports back to the Edge.

You can lose data on the exports when writes to the Edge are not committed to the Core. In the case of minimal data loss, it is possible that you can easily recover the exports from a crash consistent state, such as with a filesystem check. However, this ease of recovery depends on the type of applications that were using the exports. If you have concerns about the data consistency, we recommend that you roll back the export to a latest application-consistent snapshot. For details, see “Best practice for export snapshot rollback” on page 95.

Disaster recovery scenarios

This section describes basic SteelFusion appliance disaster scenarios, and includes general recovery recommendations. It includes the following topics:

• “SteelFusion appliance failure—failover” on page 92

• “SteelFusion appliance failure—failback” on page 94

Keep in mind the following definitions:

• Failover - to switch to a redundant computer server, storage, and network upon the failure or abnormal termination of the production server, storage, hardware component, or network.

• Failback - the process of restoring a system, component, or service previously in a state of failure back to its original, working state.

• Production site - the site in which applications, systems, and storage are originally designed and configured. Also known as the primary site.

• Disaster recovery site - the site that is set up in preparation for a disaster. Also known as the secondary site.

SteelFusion appliance failure—failover

In the case of a failure or a disaster affecting the entire site, we recommend you take the following considerations into account. The exact process depends on the storage array and other environment specifics. You must create thorough documentation of the disaster recovery plan for successful recovery implementation. We recommend that you perform regular testing so that the information in the plan is maintained and up to date.

This sections includes the following topics:

• “Data center failover” on page 92

• “Branch office failover” on page 93

Data center failover

In the event that an entire data center experiences failure or a disaster, you can restore the Core operations assuming you have met the following prerequisites:

• The disaster recovery site has the storage array replicated from the production site.

• The network infrastructure is configured on the disaster recovery site similarly to the production site, enabling the Edges to communicate with Core.

• Core and SteelHeads (or their virtual editions) at the disaster recovery site are installed, licensed, and configured similarly to the production site.

• Ideally, the Core at the disaster recovery site is configured identically to the Core on the production site. You can import the configuration file from Core at the production site to ensure that you have configured both Cores the same way.

Unless the disaster recovery site is designed to be an exact replica of the production site, minor differences are inevitable: for example, the IP addresses of the Core, the storage array, and so on. We recommend that you regularly replicate the Core configuration file to the disaster recovery site and import it into the disaster recovery instance. You can script the necessary adjustments to the configuration to automate the configuration adoption process.

Likewise, the configuration of SteelHeads in the disaster recovery site should reflect the latest changes to the configuration in the production site. All the relevant in-path rules must be maintained and kept up to date.

There are some limitations:

• Even if the data from the production storage array is replicated in synchronous mode, you can assume that there is already committed data to the Edge. The data has not been sent to the production storage, or the data has not been replicated to disaster recovery site yet. This action means that a gap in data consistency can occur if, after the failover, the Edges immediately start writing to the disaster recovery Core. To prevent the data corruption, you need to configure all the exports at Edges as new. When you configure Edges as new, this configuration empties out their blockstore, causing data loss of all the writes occurred after a disaster at the production site.

• If you want data consistency on the application level, we recommend that you perform a rollback to one of the previous snapshots.

• Keep in mind that initially after the recovery, the blockstore on Edges does not have any data in the cache.

Branch office failover

When the Edge in a branch becomes inaccessible from outside the branch office due to a network outage, the operation in the branch office might continue. SteelFusion products are designed with disconnected operations resiliency in mind. If your workflow enables branch office users to operate independently for a period of time (which is defined during the network planning stage and implemented with a correctly sized appliance), the branch office continues as operational and synchronizes with the data center later.

In the case when the branch office is completely lost, or it is imperative for the business to have a service in the branch office online sooner, you can choose to deploy the Edge in another branch or in the data center.

If you chose to deploy an Edge in the data center, we recommend that you remove the exports from Core as to prevent data corruption by multiple write access to the exports. We recommend that you roll back to a latest application-consistent snapshot. If mostly read access is required to the data projected to the branch office, a good alternative is to temporarily mount a snapshot to a local host. This snapshot enables the data to be accessible to the data center, while the branch office is operating in disconnected-operation mode. Avoiding the failover will also simplify fallback to the production site.

If you chose to deploy Edge in another branch office, follow the steps in “Edge replacement” on page 91. You must understand that in this scenario, all the uncommitted writes at the branch are not stored. We recommend that you to roll back the exports to the latest application-consistent snapshot.

SteelFusion appliance failure—failback

After a disaster is over, or a failure is fixed, you might need to revert the changes and move the data and computing resources to where they were located before the disaster, while ensuring that the data integrity is not compromised. This process is called failback. Unlike the failover process that can occur in a rush, you can thoroughly plan and test the failback process.

This section includes the following topics:

• “Data center failback” on page 94

• “Branch office failback” on page 95

Data center failback

As SteelFusion relies on primary storage to keep the data intact, the Core failback can only follow a successful storage array replication from the disaster recovery site back to the production site. There are multiple ways to perform the recovery; however, we recommend that you use the following method. The process most likely requires a downtime, which you can schedule in advance. We also recommend that you create an application-consistent snapshot and backup prior to performing the following procedure. Perform these steps on one appliance at a time.

To perform the Core failback process

1. Shut down hosts and unmount exports.

2. Export the configuration file from the Core at the disaster recovery site.

3. From the Core, initiate taking the Edge offline. This process forces the Edge to replicate all the committed writes to the Core.

4. Remove NFS access from the export at the Core.

Data cannot be written to the export until the data has become available after the failback completes.

5. Make sure that you replicate the export with the storage array from the disaster recovery site back to the production site.

6. On a storage array at the production site, make the replicated export the primary export.

Depending on a storage array, you might need to create a snapshot, create a clone, or promote the clone to an export—or all these actions.

For more information, see the user guide for your storage array.

7. Add access to the export for Core. The Core at production site begins servicing the export instantaneously .

8. At the branch office, check to see if you need to change the Core IP address.

Branch office failback

The branch office failback process is similar to the Edge replacement process. The procedure requires downtime that you can schedule in advance.

If the production exports were mapped to another Edge, use the following procedure.

To perform the Edge failback process

1. Shut down hosts and unmount exports.

2. Take the exports offline from the disaster recovery Edge. This process forces the Edge to replicate all the committed writes to Core.

3. If any changes were made to the export mapping configuration, you need to merge the changes during the fallback process. For assistance with this process, contact Riverbed Support.

4. Shut down the Edge at the disaster recovery site.

5. Bring up the Edge at the production site.

6. Follow the steps described in “Edge replacement” on page 91.

Keep in mind that after the fallback process is completed, the blockstore on Edges does not have any data in the cache.

If you took out the production exports of the Core and used them locally in the data center, shut down hosts and unmount exports and then continue the setup process as described in the SteelFusion Core Installation and Configuration Guide.

Best practice for export snapshot rollback

When single file restore is impossible or impractical, you can roll back the entire LUN snapshot on the storage array at the data center and projected out to the branch. We recommend the following procedure for an export snapshot rollback.

Note: A single file restore is to recover your deleted file from a backup or a snapshot without rolling back the entire file system to a point of time in which a file still existed in the file system. When you use the export rollback, everything that was written to (and deleted from) the file system is lost.

To roll back the export snapshot

1. Set the export offline at the server running at the Edge.

2. Remove NFS access from the export at the Core.

3. Remove the export from the Core.

4. Restore the export on the storage array from a snapshot.

5. Add the export to the Core.

6. Add NFS access for the export at Core.

You can now access the export snapshot from a server on the Edge.

Keep in mind that after this process is completed, the blockstore on Edges does not have any data in the cache.

At-rest and in-flight data security

For organizations that require high levels of security or face stringent compliance requirements, Edge provides data at-rest and in-flight encryption capabilities for the data blocks written on the blockstore cache. This section includes the following topics:

• “Enable data at-rest blockstore encryption” on page 97

• “Enable data in-flight secure peering encryption” on page 98

Supported encryption standards include AES-128, AES-192, and AES-256. The keys are maintained in an encrypted secure vault. In 2003, the United States government declared a review of the three algorithm key lengths to see if they were sufficient for protection of classified information up to the secret level. Top secret information requires 192-bit or 256-bit keys.

The vault is encrypted by AES with a 256-bit key and a 16-byte cipher, and you must unlock it before the blockstore is available. The secure vault password is verified upon every power up of the appliance, assuring that the data is confidential in case the Edge is lost or stolen.

Initially, the secure vault has a default password known only to the RiOS software so the Edge can automatically unlock the vault during system startup. You can change the password so that the Edge does not automatically unlock the secure vault during system startup and the blockstore is not available until you enter the password.

When the system boots, the contents of the vault are read into memory, decrypted, and mounted (through EncFS, a FUSE-based cryptographic file system). Because this information is only in memory, when an appliance is rebooted or powered off, the information is no longer available and the in-memory object disappears. Decrypted vault contents are never persisted on disk storage.

We recommend that you keep your secure vault password safe. Your private keys cannot be compromised, so there is no password recovery. In the event of a lost password, you can reset the secure vault only after erasing all the information within the secure vault.

To reset a lost password

• From either Edge appliance, enter the following CLI commands:

> enable

# configure terminal

(conf)# secure-vault clear

When you use the secure-vault clear command, you lose the data in the blockstore if it was encrypted. You then need to reload or regenerate the certificates and private keys.

Note: The Edge blockstore encryption is the same mechanism that is used in the RiOS data store encryption. For more information, see the security information in the SteelHead Deployment Guide.

Configuring data encryption requires extra CPU resources and might affect performance. We recommend blockstore encryption only if you require a high level of security or dictated by compliance requirements.

Enable data at-rest blockstore encryption

The following example shows how to configure blockstore encryption on an Edge. The commands are entered on the Core at the data center.

To configure blockstore encryption on the Edge

1. From the Core, enter the following commands:

> enable

# configure

(config) # edge id <edge-identifier> blockstore enc-type <AES_128 | AES_192 | AES_256 | NONE>

2. To verify whether encryption has been enabled on the Edge, enter the following commands:

> enable

# show edge id <edge-identifier> blockstore

Write Reserve : 10%

Encryption type : AES_256

You can do the same procedure in the Core Management Console by choosing Configure > Manage: SteelFusion Edges.

Figure: Adding blockstore encryption

To verify whether encryption is enabled on your Edge appliance, look at the Blockstore Encryption field on your Edge status window as shown in Figure 7‑2.

Figure: Verify blockstore encryption

Enable data in-flight secure peering encryption

SteelFusion Rdisk protocol operates on clear text and there is a possibility that remote branch data can be exposed to hackers during transfer over the WAN. To counter this exposure, the Edge provides data in-flight encryption capabilities when the data blocks are asynchronously propagated to the data center export.

You can use secure peering between the Edge and the data center SteelHead to create a secure SSL channel and protect the data in-flight over the WAN. For more information about security and SSL, see the SteelHead Deployment Guide and the SteelHead Deployment Guide - Protocols.

Clearing the blockstore contents

Under normal conditions, if you select Offline on the Core for a particular export, the contents of the blockstore on the corresponding Edge is synchronized and then cleared.

However, there can be a situation in which it is necessary to make sure the entire contents of the blockstore on an Edge is erased to a military grade level. While you can achieve this level of deletion, it involves the use of commands not normally available for general use. To ensure the correct procedures are followed, open a support case with Riverbed Support.

Edge network communication

In a location in which you have deployed Edge, there can be a requirement to keep track of the ports and protocols used with the various interfaces that are active. This table provides you with a general list of devices and ports, including a description of what the communication is related to.

Source device	Source port	Destination device	Destination port	Protocol	Description
Edge primary interface	Any	Core	7950-7955	TCP	BlockStream
Edge primary interface	Any	Core	7970	TCP	SteelFusion management
SCC and other management hosts	Any	Edge primary interface	22,80,443	TCP	Edge management
VSP management hosts	Any	Through the Edge primary interface (using virtual IP)			ESXi management
VSP management hosts	Any	Through Edge primary interface (using VM management IP, vm_pri)	3389	TCP	RDP to remote VM
vSphere client machine	Any	Through the Edge primary interface (using virtual IP)	22, 80, 443, 902, 903, 9443	TCP, UDP	ESXi management
Edge primary interface (using VM management IP, vm_pri)	Any	vSphere client machine	22, 80, 443, 902, 903	TCP, UDP	VM management
Edge in-path interface	Any	Core	Any	TCP, UDP, ICMP	WAN optimization

Additional security best practices

For additional advice and guidance on appliance security in general, see the SteelHead Deployment Guide. The guide includes suggestions on restricting web-based access, the use of role-based accounts, creation of login banners, alarm settings, and so on, which you can apply in principle to Edge appliances.

Related information

• SteelFusion Core User Guide

• SteelFusion Edge User Guide

• SteelFusion Core Installation and Configuration Guide

• SteelFusion Command-Line Interface Reference Manual

• SteelHead Deployment Guide

• Riverbed Splash at https://splash.riverbed.com/community/product-lines/steelfusion