About replication design

In a deployment in which there is just a single data center and a Core without FusionSync, the Edge acknowledges the write operations to local hosts in the branch, temporarily saves the data in its blockstore marking it as uncommitted, and then asynchronously sends the write operations to the Core. The Core writes the data to the LUN in the data center storage array. After the Core has received an acknowledgment from the storage array, the Core then acknowledges the Edge. The Edge can then mark the relevant blockstore contents as committed. In this way, the Edge is always maintaining data consistency.

To maintain data consistency between the Edge and the two data centers—with a Core in each data center and FusionSync configured—the data flow is somewhat different.

In the steady state, the Edge acknowledges the write operations to local hosts, temporarily saves the data in its blockstore marking it as uncommitted, and asynchronously sends the write operations to its Core. When you configure FusionSync, the primary Core applies the write operations to backend storage and replicates the write operations to secondary Core. The data is replicated between the Cores synchronously, meaning that a write operation is acknowledged by the primary Core to the Edge only when both the local storage array and the secondary Core, along with its storage array, have acknowledged the data.

The Edge marks the relevant blockstore content as committed only when the primary Core has finally acknowledged the Edge.

If the primary Core loses its connection to the secondary Core, it pauses FusionSync. When FusionSync is paused, writes from the Edge to the Core are not acknowledged by the Core. The Edge continues to acknowledge the local hosts in the branch and buffer the writes similar to its behavior when the WAN connectivity to Core goes down, without FusionSync. Although write operations between Edge and Core are not available, read operations are not affected, and read requests from the Edges continue to be serviced by the same Core as normal.

When the connectivity comes back up, FusionSync continues automatically. If, for any reason, the connectivity between the Cores takes a long time to recover, the uncommitted data in the Edges might continue to increase. Uncommitted data in the Edge can lead to a full blockstore. If the blockstore write reserve is in danger of reaching capacity, you can suspend FusionSync. When FusionSync is suspended, the primary Core accepts writes from the Edges, keeps a log of the write operations on its Journal LUN and acknowledges the write operations to the Edges so that the blockstore data is marked as committed.

When a primary Core is down, a secondary Core can take over the primary role. You have to manually initiate the failover on the secondary Core. The Edges maintain connectivity to both Cores (primary and secondary) when the failover occurs, the surviving Core automatically contacts the Edges to move all Edge data connections to the secondary Core. At this point, the secondary Core becomes primary with its Replication suspended. Now the new primary Core acknowledges writes from Edges, applies them to the storage array, logs the operations into the Journal LUN, and acknowledges the write operations to the Edges. When the connectivity between the Cores is restored, the new primary Core starts resynchronizing writes logged in the Journal LUN through the Core in the original data center (the old primary Core) to the LUNs. In this recovery scenario, the old primary Core now becomes the secondary Core and all the LUNs protected by FusionSync are brought back into the synchronization with their replicas.

Whatever the failover scenario, when the failed data center and Core are brought back online and connectivity between data centers is restored, you can failback to the original data center by initiating a failover at the active (secondary) data center. Because a failover is only possible if the primary Core is not reachable by the secondary Core and the Witness, you must manually bring down the primary Core. You can accomplish this process by stopping the Core service on the current primary Core (in the secondary data center) and then initiating a failover on the old primary Core located in the primary data center.

As with any high-availability scenario, there may be a possibility of a split-brain condition. In the case of product Replication, it is when both the primary and secondary Cores are up and visible to Edges but cannot communicate to each other. FusionSync could become suspended on both sides and the Edges send writes to both Cores, or some writes to the Core and some to the other. Writing to both sides leads to a condition when both Cores are journaling and neither Core has a consistent copy of the data. More than likely, split brain results in a data loss. To prevent the issue, you must define one of the Edges as a Witness.

The Witness must approve the request that comes to the Cores to suspend replication. The Witness makes sure that both primary and secondary Cores do not get approval for suspension at the same time. When the request is approved, the Core can start logging the writes to the Journal LUN.