Deploying large scale distributed systems

Deploying services such as DynamoDB is quite challenging. DynamoDB is a distributed system with a lot of servers serving mission critical workloads for customers. Some of these nodes are stateless and some are stateful (store customer data). Unlike a traditional relational database, DynamoDB takes care of deployments without the need for maintenance windows. Deployments need to be safe, without impacting security, durability, availability or performance. This blog covers critical tips that took days, months & years to learn deploying distributed services at Amazon DynamoDB.

Deployments challenges

Roll-backs

Distributed system deployments are non atomic. A deployment takes the software from one state to another state. It’s not just the end state and the start state of the software that matters; there could be times when the newly deployed software doesn’t work and needs a rollback. The rolled-back state might be different from the initial state of the software. The rollback procedure is often missed in testing and can lead to customer impact.

Complex Protocol deployments

Non atomic nature of deployments also pose another challenge, there will be multiple versions of software running in production. Typically distributed system interact with each other through messages and have a specific protocol. If the deployment is introducing a new message, not all nodes might know how to handle the new message. Similarly, upgrading from an older protocol to a newer protocol means there will be nodes that do not understand the newer protocol. Change in protocol or introducing a new message can even fail the system all together.

Availability during deployments

Distributed systems with the concept of leaders, generally do failure detection based on lease threshold values. Deployments can cause the leaders to fail. The time between detecting the failure, electing a leader can range up to few seconds. During these few seconds, the system could be in a non available state and can cause availability impact to the customer traffic.

Distributed Deployments

There is already a lot of information on internet on Blue/Green deployments and other deployment best practices. I am not going to cover those in the blog. At high level, all the deployments must be done in a staged manner. First deployment should target only a small set of nodes before pushing the changes to the entire fleet. The main advantage of this strategy is that it reduces the potential impact of faulty deployments. Deploying to a small subset of fleet is not enough, the fleet needs to be monitored with right critical metrics ( error rate, latency etc.)

Upgrade Downgrade Tests

Deployment systems must strive hard to make sure every deployment can be rolled back without any question. DynamoDB runs a suite of upgrade and downgrade tests at a component level before every deployment. After deploying, the software is rolled back on purpose and tested by running functional tests. Upgrade downgrade tests requires one full cluster configured exactly like prod and two deployment artifacts ( production and target). We upgrade one storage node at a time in the cluster and run functional tests. If the first upgrade succeed, deploy the software to rest of the storage nodes and again run functional tests. Start the downgrade process by rolling back one storage node and kick off functional tests. Once the tests pass, rollback the software on rest of the storage nodes. Do one final run of functionals and verify all the functional tests pass. This testing procedure is the best way to gain confidence on deploying complex distributed algorithms.

Two Phase Deployments

DynamoDB handles complex protocol changes with read write deployments. Read write deployments are completed as a multi step process. The first step is to deploy the software to read the new message format or protocol. Once all the nodes can handle the new message, the software is enabled to send the new message. Enabling the new message type is also done via a deployment. This is done to ensure that the code to understand the new message is deployed and can be rolled back independently. Read write deployments ensure that both types of messages can co-exist in the system.

Ensuring High Availability

Delayed failure detection can cause availability impact and the goal of these deployments is to ensure the availability impact is minimal. DynamoDB shortcuts the failure detection by sending a message to the leader replicas in the datacenter that is about to get deployed. Leader replicas upon receiving the message releases the leader and this triggers a new leader election.

References:

Automatic safe hands off deployments

DynamoDB Paper

Twitter Thread on deployments

Take nobody's word for it