Posts

Amazon MemoryDB for Redis: A Marriage of Speed and Durability

Image
Forget trade-offs, Imagine a “database” that delivers in-memory speed with the 11 9’s durability. It was a joy to read the paper “Amazon MemoryDB: A Fast and Durable Memory-First Cloud Database”. This paper triggered me to share what I admire about its innovative approach. It fundamentally challenges the notion that speed and durability are opposing forces in database design. MemoryDB breaks the mold by decoupling the storage engine (Redis) from the durability layer (transaction log). Technically, the concept of separating storage and durability isn’t entirely new. Most systems do partial de-coupling (shipping transaction logs off boxes). However, complete decoupling, focus on in-memory performance, and the level of consistency it offers makes it unique. Redis does not offer a replication solution that can tolerate the loss of nodes without data loss, or can offer scalable strongly-consistent read Paper highlights that while Redis boasts impressive microsecond latencies and the ability...

Don't Just Work, Work with Purpose

Image
  Ever felt lost at work? Ever found yourself asking, 'What am I doing?' or 'Why am I doing this?' It's a sentiment many of us can relate to. I've pondered these questions numerous times throughout my career. One thing I've come to realize is that working in an environment where the foundations of a team are rooted in clear values and purpose, it leads to a vastly different experience. When everyone understands what matters most, decision-making becomes smoother, and frustration decreases across everyone. This doesn't mean you won't ponder these questions anymore, but the confusion will be less. Similarly, when you have a clear purpose, you feel excited about what you are doing. When your values and purpose clash with your workplace or peers, it can lead to frustration and a feeling of limited impact. It's almost like constantly swimming against the current. It's exhausting and limits your impact. On the other hand, if your values are aligned...

Confessions of a recovering speaker

Image
Ever feel that sinking feeling after a presentation?  The one where your audience seems to check out and the silence screams "unengaged"?  Been there done that. After delivering one of the talk to my teammates where I bombed, I decided to improve my presentation skills. Inside Amazon, engineers have the opportunity to deliver POA (Principal Of Amazon) talks. These presentations, crafted by senior engineers, share knowledge and insights with a wider audience.   A significant amount of effort goes into creating these presentations. Each speaker is assigned a dedicated coach who guides them through the process, offering invaluable guidance in polishing the presentation and ensuring the message is delivered clearly and succinctly. My experience delivering multiple "POA talks" revealed a simple secret from the coaches: prioritize crafting a script before building slides.  During my first POA talk in 2014, I arrived with a prepared slide deck.  However, the coach's ...

When Hours Drag and Decades Dash

Image
For more than the past decade, I've had the privilege of being an engineer on the DynamoDB team. Being part of this exceptional group of individuals dedicated to tackle highly ambiguous challenges. I joined the team as an engineer with limited knowledge of databases and none of NoSQL databases. The fact that I was novice intimidated me and excited me, as it meant I would constantly face challenges that would push me to expand my understanding. One particular instance that I vividly recall from my early days at AWS, was being welcomed into a room filled with three senior engineers talking about solving a scaling challenge in the replication protocol of SimpleDB. The details of the specific problem is explained in this article by AWS titled   Summary of the Amazon SimpleDB service disruption . The opportunity to work on that project was solely due to my display of curiosity and eagerness to learn. The challenge was quite daunting because I had to ramp up on distributed systems, had t...

Operating large scale distributed systems

Image
Operating a large scale distributed systems is hard. This blog share some insights that I have learnt over years working in AWS. Most systems are designed to handle failures. One important aspect that is overlooked in system design is what happens when the failed component comes back online also known as Failback. Sometimes, it could be the recovery that triggers a bigger impact. Consider for example a data-center going down because of power loss. Most systems handle data center failures by ensuring the traffic is shipped away from the datacenter that crashed. After recovery of the data center outage, what if all the nodes come back online at the same time? Can the system handle the load generated by the nodes coming back up? Can it create a negative feedback loop and take the system down by hitting part of the system at the same time? It is equally important to ask this question what will happen if suddenly all the nodes in the crashed data center come back up at the same time. Be min...

Deploying large scale distributed systems

Deploying services such as DynamoDB is quite challenging. DynamoDB is a distributed system with a lot of servers serving mission critical workloads for customers. Some of these nodes are stateless and some are stateful (store customer data). Unlike a traditional relational database, DynamoDB takes care of deployments without the need for maintenance windows. Deployments need to be safe, without impacting security, durability, availability or performance.  This blog covers critical tips that took days, months & years to learn deploying distributed services at Amazon DynamoDB. Deployments challenges Roll-backs Distributed system deployments are non atomic. A deployment takes the software from one state to another state. It’s not just the end state and the start state of the software that matters; there could be times when the newly deployed software doesn’t work and needs a rollback. The rolled-back state might be different from the initial state of the software. The rollback pro...

Tale of leader election, failure detection and false positives

Image
Building distributed systems requires tackling many tradeoffs. Failure detection of leader is a specific example of trade-off between failure detection time and latencies. This blog post captures insights learnt over years to reduces the impact of failure detection on latency & availability building DynamoDB. The learnings are captured in this  DynamoDB 2022 Paper  and  Twitter Thread as well. Let’s understand the concept of distributed databases, leaders, followers and leader election. A distributed database table for a system like DynamoDB is divided into multiple partitions to handle the throughput and storage requirements of the table. Each partition of the table hosts a disjoint and contiguous part of the table’s key-range. Each partition has multiple replicas distributed across different Availability Zones for high availability and durability. One of the replicas is a leader and the rest of the replicas are followers. Leader replica coordinates all consiste...