Operating large scale distributed systems


Operating a large scale distributed systems is hard. This blog share some insights that I have learnt over years working in AWS.



Most systems are designed to handle failures. One important aspect that is overlooked in system design is what happens when the failed component comes back online also known as Failback. Sometimes, it could be the recovery that triggers a bigger impact. Consider for example a data-center going down because of power loss. Most systems handle data center failures by ensuring the traffic is shipped away from the datacenter that crashed. After recovery of the data center outage, what if all the nodes come back online at the same time? Can the system handle the load generated by the nodes coming back up? Can it create a negative feedback loop and take the system down by hitting part of the system at the same time? It is equally important to ask this question what will happen if suddenly all the nodes in the crashed data center come back up at the same time. Be mindful about both failover and failback.

There is no reason to treat configs specially and different from code. Every change to the system, whether it's a config change, a database change, or a code change, has the potential to have a huge impact to the operating service. As a result, it's important to treat all changes with the same level of scrutiny and to test them thoroughly before deploying them in the production environment.

One interesting way to test configs specific to a production environment is through canary testing, which involves running a small group of test applications to simulate the customer experience. These canary applications can be run from multiple locations and can be used to test all of the APIs offered by the service. Hitting the service endpoints from multiple data centers can help ensure that all of the service's public endpoints are functional and that the customer traffic originating from anywhere on the internet is not affected by changes to the system.

Run-books play an important part in operating a large-scale distributed system. Run-books have detailed instructions for common procedures for handling operations. Each team strives to automate these run-books or fix the bug that caused the run-book but there are cases where automation is not possible or is delayed because of teams priorties. In those cases ensuring run-books are linked to the tickets that are cut by the system immensely help operators to quickly get to the root cause of an issue and take appropriate action to solve the customer impact.

Different operational scenarios may require different tools, and it's important to select the ones that are best suited for the task at hand. For example, in the case of rack outage, it may be necessary to replace an entire rack of machines, whereas in other cases such as data center outage, a different approach may be more appropriate. 

Making high-judgment calls is another challenge that operators of large-scale distributed systems face. In emergency situations, it can be difficult to make decisions quickly and accurately. To address this, it's important to proactively write automated scripts and run-books that can handle common operations. This can help to ensure that operators are able to respond quickly and effectively, even when faced with complex or unexpected situations. 

Even when scripts/tools are used, it's important to have a peer verify the commands or input to the tools being used. This can help to protect against human error and can ensure that the correct actions are taken to recover from outages. Logging all commands that are executed can also help to facilitate recovery and root cause investigations after the event.

References

Twitter Thread


Comments

Popular posts from this blog

When Hours Drag and Decades Dash

Don't Just Work, Work with Purpose

Availability - What it really means?