April 13, 2021

Why do I need independent connections to the Witness Appliance in a VMware vSAN stretched cluster?

The year was 2009 and my pager went off and interrupted dinner. A DRDB cluster that lacked proper fencing had experienced a failure and data was coming up severely out of date in the application. I didn’t know it then but something terrible had happened and I would miss sleep for 3 days trying to put the pieces back together. If only vSAN had existed back then, and my life would have been simpler. We will revisit this incident in a bit, for now, there is a more important question to address first.

Why do I need independent connections to the Witness in a vSAN stretched Cluster?

This question comes up from time to time. You have two datacenters near each other, and want to run a vSAN stretched cluster (Great idea!). You have a 3rd location to place the witness appliance (Awesome!). They then look at connectivity between the sites and discover that both data centers do not have independent connectivity and both datacenters are depending on one datacenter to reach the witness. This is a problem as vSAN does not support this configuration.

vSAN requires that in the event a data center fails, the remaining site can still reach the witness. Why is this?

image 51

The short answer is CAP theorem. This states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees.

  • Consistency: Every read receives the most recent write or an error
  • Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
  • Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

In the presence of a network partition (a site becoming fully isolated from the other 2 sites) a choice had to be made by vSAN’s designers.

  • Cancel IO operations and thus decrease the availability but ensure consistency (Protect the data). In this scenario, virtual machines lose access to their storage, and until a quorum is restored writes and reads to volumes are blocked.
  • Proceed with the operation and thus provide availability but risk inconsistency. In this scenario, both sites could become active, and applications and files could be independently updated on both sites. This would present a problem once the network was fully repaired where you would have two independent copies of a virtual machine that can not be safely merged and would require hand editing of files and applications to resynchronize data (if possible at all).

vSAN chose the former approach, which is to always “fail-safe” and prioritize data integrity at all costs. This is the design decision of most if not all modern enterprise storage platforms that run as a scale-out distributed system. Distributed systems that follow the latter approach tend to be stateless systems (like a firewall) that do not cause irreparable harm in the event of a split-brain scenario.

The DRDB cluster I had to fix in 2009, lacked this design decision. A network partition had allowed both instances of DRDB to go active (Split-Brain) and process IO. This created a shadow copy of the database server volume. A subsequent failure of the other node led to this shadow copy of the old database accepting new data transactions. It took days to resync the data Some datasets having to be entered by hand due to lost transactions. DRDB was abandoned shortly after this incident.

What are my options to overcome the witness connectivity requirement?  

There are a few solutions to this problem of how to be able to survive a datacenter failure when you lack property connectivity to build a stretched cluster.

  1. Deploy a second connection at the datacenter lacking unique connectivity to the witness. NSX SD-WAN along with wireless connections should be more than sufficient for this purpose.
  2. Use vSphere Replication to asynchronously replicate. This has the added benefit of if paired with SRM can allow you to do orchestrated failover. Asynchronous replication involves a manual failover and prevents split-brains from automatically occurring.
  3. Use Application replication to replicate between two independent vSAN clusters. Exchange DAG, Active Directory replication, Oracle Dataguard allow for asynchronous replication on very low RPO.

vSAN stretched clusters are a powerful solution. They can deliver rapid, automatic failover and high availability with a zero recovery point objective. To deliver this in a reliable and consistent manner, the witness must be reachable from both data center locations. For more information on stretched clusters see the vSAN stretched cluster guide.

Filter Tags

Storage vSAN vSAN 7 Blog Deployment Considerations Design