Posts

Showing posts from November, 2022

Deploying large scale distributed systems

Deploying services such as DynamoDB is quite challenging. DynamoDB is a distributed system with a lot of servers serving mission critical workloads for customers. Some of these nodes are stateless and some are stateful (store customer data). Unlike a traditional relational database, DynamoDB takes care of deployments without the need for maintenance windows. Deployments need to be safe, without impacting security, durability, availability or performance.  This blog covers critical tips that took days, months & years to learn deploying distributed services at Amazon DynamoDB. Deployments challenges Roll-backs Distributed system deployments are non atomic. A deployment takes the software from one state to another state. It’s not just the end state and the start state of the software that matters; there could be times when the newly deployed software doesn’t work and needs a rollback. The rolled-back state might be different from the initial state of the software. The rollback procedu

Tale of leader election, failure detection and false positives

Image
Building distributed systems requires tackling many tradeoffs. Failure detection of leader is a specific example of trade-off between failure detection time and latencies. This blog post captures insights learnt over years to reduces the impact of failure detection on latency & availability building DynamoDB. The learnings are captured in this  DynamoDB 2022 Paper  and  Twitter Thread as well. Let’s understand the concept of distributed databases, leaders, followers and leader election. A distributed database table for a system like DynamoDB is divided into multiple partitions to handle the throughput and storage requirements of the table. Each partition of the table hosts a disjoint and contiguous part of the table’s key-range. Each partition has multiple replicas distributed across different Availability Zones for high availability and durability. One of the replicas is a leader and the rest of the replicas are followers. Leader replica coordinates all consistent reads and writes