Failover states and sequences

At the same time as performing their primary functions associated with projecting exports, each Core in an HA deployment is using its heartbeat interfaces to check if the peer is still active. By default, the peers check each other every 3 seconds through a heartbeat message. The heartbeat message is sent through TCP port 7972 and contains the current state of the peer that is sending the message.

The state is one of the following:

• ActiveSelf—The Core is healthy, running its own configuration and serving its exports as normal. It has an active heartbeat with its peer.

• ActiveSolo—The Core is healthy but the peer is down. It is running its own configuration and that of the failed peer. It is serving its exports and also the exports of the failed peer. The exports transition to read-only mode.

• Inactive—The Core is healthy but has just started up and cannot automatically transition to ActiveSolo or ActiveSelf. Typically this state occurs if both Cores fail at the same time. To complete the transition, you must manually activate the correct Core.

• Passive—The default state when Core starts up. Depending on the status of the peer, the Core state transitions to Inactive, ActiveSolo, or ActiveSelf.

If there is no response from three consecutive heartbeats, then the secondary Core declares the primary failed and initiates a failover. Both Cores in an HA deployment are primary for their own functions and secondary for the peer. Therefore, whichever Core fails, it is the secondary that takes control of the exports from the failed peer.

Core A and Core B are interconnected with two cross-over cables that are providing connectivity for heartbeat.

There are two file servers: A and B. They could each be configured with virtual IP addresses to provide redundancy. File server A exports two fileshares, A1 and A2, which are mounted by Core A. File server B is exporting one fileshare, B1, which is mounted by Core B.

We can assume that an Edge has been configured to connect to Core A in order to map fileshares A1 and A2, and another Edge has been configured to connect to Core B in order to map projected fileshare B1.

Under normal conditions, where Core A and Core B are healthy, they are designed to operate in an active-active methodology. They are each in control of their respective fileshares, but also aware of their peer’s fileshares. They are independently servicing read and write operations to and from the Edge for their respective fileshares.

The Cores check each other via their heartbeat interfaces to ensure their peer is healthy. In a healthy state, both peers are reported as being ActiveSelf.

When a failover scenario occurs in a Core HA, the surviving Core transitions to the ActiveSolo state.

In this condition, the surviving Core transitions all exported fileshares that are part of the HA configuration and projected to Edges, into a read-only mode. In our example, this would include A1, A2, and B1.

Important: The read-only mode transition is from the perspective of the surviving Core and any Edges that are connected to the HA pair. There is no change made to the state of the exported fileshares on the backend file servers. They remain in read-write mode.

With the surviving Core in ActiveSolo state and the exports in read-only mode, the following scenarios apply:

• The ActiveSolo Core will defer all commits arriving from its connected Edges.

• The ActiveSolo Core will defer all snapshot requests coming in from its connected Edges.

• Edges connected to the ActiveSolo Core will absorb writes locally in the blockstore and acknowledge, but commits will be paused.

• Edges connected to the ActiveSolo Core will continue to service read requests locally if the data is resident in the blockstore, and will request nonresident data via the ActiveSolo Core as normal.

• Mounting new exported fileshares on the ActiveSolo Core from backend file servers are permitted.

• Mapping exported fileshares from the ActiveSolo Core to Edge appliances will still be allowed.

• Any operation on backend file servers to resize an exported fileshare will be deferred.

Once the failed Core in an HA configuration comes back online and starts communicating healthy heartbeat messages to the ActiveSolo Core, recovery to normal service is automatic. Both Core appliances return to an ActiveSelf state and exported fileshares are transitioned back to read-write mode.

All pending commits for the connected Edge appliances will be completed and any other deferred operations will resume and complete.

In circumstances where it is absolutely necessary, it is possible to “force” a transition back to read-write mode while in an ActiveSolo state. Contact Riverbed Support for assistance if you need to perform this task.