Take nobody's word for it

Operating a large scale distributed systems is hard. This blog share some insights that I have learnt over years working in AWS. Most systems are designed to handle failures. One important aspect that is overlooked in system design is what happens when the failed component comes back online also known as Failback. Sometimes, it could be the recovery that triggers a bigger impact. Consider for example a data-center going down because of power loss. Most systems handle data center failures by ensuring the traffic is shipped away from the datacenter that crashed. After recovery of the data center outage, what if all the nodes come back online at the same time? Can the system handle the load generated by the nodes coming back up? Can it create a negative feedback loop and take the system down by hitting part of the system at the same time? It is equally important to ask this question what will happen if suddenly all the nodes in the crashed data center come back up at the same time. Be min...

Take nobody's word for it

Posts

Operating large scale distributed systems