Tale of leader election, failure detection and false positives

Building distributed systems requires tackling many tradeoffs. Failure detection of leader is a specific example of trade-off between failure detection time and latencies. This blog post captures insights learnt over years to reduces the impact of failure detection on latency & availability building DynamoDB. The learnings are captured in this DynamoDB 2022 Paper and Twitter Thread as well.




Let’s understand the concept of distributed databases, leaders, followers and leader election. A distributed database table for a system like DynamoDB is divided into multiple partitions to handle the throughput and storage requirements of the table. Each partition of the table hosts a disjoint and contiguous part of the table’s key-range. Each partition has multiple replicas distributed across different Availability Zones for high availability and durability. One of the replicas is a leader and the rest of the replicas are followers. Leader replica coordinates all consistent reads and writes. On receiving a write, leader replica writes the data to local disk and replicates it to the followers. The replication ensures the followers are in sync with the leaders within a few milliseconds of writes being committed. The followers process the writes in the same order as received by the leader replica. The replicas for a partition form a replication group. The replication group uses Multi-Paxos for leader election and consensus. Clients who want to read most recently written writes, send the read request to the leader replica performing consistent reads.

Failure detection is a key part of a distributed system. It dictates the time to recovery. The faster the failure is detected, the faster the recovery process and lower latency impact. Why are we talking about failure detection? In a distributed system with multiple nodes distributed across multiple availability zones, a node crash is imminent. The crashing node could be a leader or a follower. The process of recovering from a node failure could be expensive depending on the amount of data that needs to be recovered. In distributed databases recovery time during node failures plays a critical role in defining the maximum size of each replica. Smaller replicas heal faster and large replicas heal slower. It is better to keep smaller replicas in the system. Node crash is not the only reason for triggering a leader failover, it could be power outage, network router outage, software bug or deployments.

Once a leader node crashes, someone has to trigger a failover and elect a new leader. The crucial step that impacts latencies and availability of a distributed database is the failure detection. There is no easy way to detect failures in distributed systems let alone specific leader failures. Detecting something is working is easier as long as you get a response from the node. In case the node is not responding back, node could be down, network is disconnected, node could be busy, message is lost or delayed it is hard to figure out with 100% certainty what is really happening. This makes the job of failure detectors even harder and most failure detectors are not necessarily always accurate. 

So how to do failure detection? In a system like replicated data store such as DynamoDB, leader replica sends messages to followers for data replication, metadata change replication etc. A message is typically delivered within few milliseconds between different nodes. Leader replica also maintains a lease and sends heart-beats to followers. The heartbeats can be sent as part of normal replication messages (data or metadata) or out of band as a special message. Heartbeats are to renew leaders lease. If the follower does not get the heart-beat or leader does not hear back from the followers, any one can trigger leader election. Once the leader election finishes, a new leader is elected. In order to avoid consistency problems, the newly elected leader waits out the old leader's lease. Choosing the lease timeout is where the trade-off lies which defines whether the system will be highly available, lower latency or not. If you keep the failure detection time to be too short, it creates a lot of leader churn, thus impacting latencies and availability. If you keep the failure detection time to be too long, it could result into higher customer impact (non functioning leader). So, to reduce latency & availability impact for consistent reads and writes, reducing false positive in failure detection is critical. 

Failure detection works well for failure scenarios where every replica of the group loses connection to the leader. However, nodes can experience gray network failures. Gray network failures can happen because of communication issues between a leader and follower. Gray failures can also happen due to issues with outbound or inbound communication of a node, or front-end routers facing communication issues with the leader even though the leader and followers can communicate with each other. A replica that isn’t receiving heartbeats from a leader will try to elect a new leader even during gray failures. 

To solve the un-necessary churn caused by gray failures, a follower that wants to trigger a failover sends a message to other replicas in the replication group asking if they can communicate with the leader. If replicas respond with a healthy leader message, the follower drops its attempt to trigger a leader election.

This change in the failure detection algorithm used by DynamoDB significantly minimized the number of false positives in the system, and hence the number of spurious leader elections. In addition, designing systems to be able to push the replica change events to the consumers proactively also helps improve availability and latency impact. During deployments, before doing a software update on the node, DynamoDB marks the node that is going to be taken out of the fleet as ineligible for leadership. This ensures a new leader election kicks in proactively without waiting for failure detection.

 References:

Builders Library Article on leader election
DynamoDB 2022 Paper 
Twitter Thread on leader election


Comments

Popular posts from this blog

When Hours Drag and Decades Dash

Don't Just Work, Work with Purpose

Availability - What it really means?