Availability - What it really means?


Metrics are critical for all cloud services. One of the most common metric that is tracked by most cloud services is the availability metric. The general definition of the availability is the “the quality of being able to be used or obtained”. The key part of availability is “quality” but definition of quality is vague. So, I always wondered how different services choose to implement this? If a cloud service is serving multiple types of requests, should the availability values for each request type be considered in evaluating availability? Some type of requests are called more often than others, are all these request types equally important? Each request made by the customer has a time delay between customer sending the request and getting a response for it, does time delay play into the availability metric or is it just number of requests that got a response? What is the time scale that should be considered for the metic? Is it seconds, minute, hour, day, week, or month or something else? I work at AWS ( SimpleDB, DynamoDB, Keyspaces) and have built services where we prioritize and focus availability as one of the key tenets of the service. This blog, summarizes key points that I have learnt over years regarding availability, running distributed service operating at scale. I also found an interesting paper meaningful availability from Google G Suite team and will cover some of the key points discussed in the paper and what I found interesting.


Let’s start with the question of why is availability important for a service? Availability is important because the customers rely on the service to run their business. If a service is down, the business is impacted. Outages result in unhappy customers, millions of dollars of impact and no one wants unhappy customers. Most of the cloud services provide service level agreements on availability.  Services also track availability metrics to discover problems in the stack before the customers so that we can take corrective actions to improve the systems. This is generally done by setting alarms on the availability metrics at a stricter threshold (if external availability impact is measured at 99.9% requests failing, internal thresholds could be set at much aggressive threshold of even a single request failure). These internal alarms can be used to either take automated corrective actions or page a human to take care of the issue. 


Correctly defining the availability metrics is super critical. In case the availability metric is not defined correctly, it might guide us into incorrect action plan. The next obvious question that comes to mind is, what is the definition of availability?  High level definition of availability is a measure that varies between 0 to 100. 0 means never available and 100 means always available. This is super general and we need to get to more specific definition of availability. So lets start with tenets for our availability metrics

Tenets of a good availability metric are:
  1. Availability metric must reflect what the user of the service is experiencing.
  2. Change in the availability metric should reflect a proportional change in the perceived availability of the customer.
  3. Availability metic must provide insights that are actionable.

Commonly used metrics for availability is success ratios. Success ratio is the count of successful requests / count of total requests. This is one of the most popular way to calculate availability. Tracking availability in success ratio has a problem. Success ratio can under count the customer impact. If a customer is mostly getting successful response and sees an error occasionally at regular intervals, the user will be quite happy. This is because occasional errors can be taken care by retries. Retries do result into a degraded user experience but the customer is at least able to make “slow” progress.  On the other hand if the customers gets all the errors that were spread over in the previous example all in a short interval. The customer will be unhappy. Even though the number of errors are same, the duration in which they happen changes the user experience. Depending on the duration of continuous errors customers are getting, this might mean multi second or multi minute or multi hour outage for the customer. Similarity, the success ratio could over count the user impact as well. In case clients gets errors, the rule of thumb is to do a retry. The most common way to do retries is doing exponential back off retries. Retries from multiple clients distributed across different data centers or different locations, could also result into multiple retries sent in a short duration due to retries. The overall number of requests have now increased and also the total number of requests that are failing have also increased resulting into miscalculating the impact. So, overall counts are not the right way to track the customer errors and perceived experience of the customer.

Another common metric used by many services is the uptime ratio. This is the time for which system was up divided by the total time ( uptime in addition to downtime ). This metric is better than count as it is measuring the customer impact in terms of time. This metric though, also has similar problems as the count metric to detect the customer perceived availability. The time metrics are generally tracked to alarm only after a certain time threshold i.e. if the error stays for a certain duration. This goes against the tenet that availability metric should be proportional to the impact customer is experiencing. This is because if the threshold is not hit, customers are still impacted but the availability metric will not track it correctly.

Availability in the cloud services are tracked at a global level. This also has a problem, because outages are generally local. Just like nature has built some parts of human body as redundant, some crucial parts of the distributed systems are built with redundancy as well. These redundancies help in improving the availability e.g. in a distributed database, there are generally multiple replicas. In case one of the replica is down, another replica is quickly built up and traffic is routed away from the replica that went down. Similarly in case of cross region replicated systems, during region outages, traffic is routed away from the region that went down and all traffic is served from the second region.

To avoid the problems with the common approaches. We track the uptime ratio at a user level. In case a user is going down, the uptime and downtime of the specific user are tracked. To measure the user uptime ratio, all the user uptimes are summed up and divided with the sum of all the users uptime + downtime.

So what is the uptime? What is the downtime? How do we measure it? We have learnt having the uptime and downtime as threshold is not a good idea. Instead of measuring the uptime or downtime as a threshold, we measure the uptime by looking at user requests. In case a request is served to the customer, the system is up until there is a reason to believe that the system is down. Similar to this, once the system gets a failed request the system is considered down until there is contradictory data about the system being down. This is what is used to measure the uptime and downtime. You must be wondering, if a system went down and the user did not make any request, will this not be misleading as well? To solve this problem, system also keeps track of inactive time so that it can be removed from the calculation.
Another question that I mentioned in the beginning was around, whether it makes sense to have the metric at quarterly level or monthly level? Even if we use the user uptime metric, in case there are problem with the system for long duration cycles. Longer the duration, less meaningful the metric becomes. The short outages might be lost in the forest of metrics at the month level. To solve this problem, if we make the user up time windowed, it will give us a better understanding of customer impact.  Windowed uptimes are then analyzed to find out the worst window that exists. The process is repeated with a bigger window sizes. The short windows exposes short outages and large windows expose large outages.

To summarize, a good availability metric should be meaningful, proportional, and actionable. and quite often missed point about metrics. I have been lucky to be working at some of the most advanced and scalable services in my career. I have learnt from stupid mistakes such as cost of data accumulation over time and the impact of emitting metrics on every request and its performance impact. Meaningful metrics are the ones that capture the exact user experience. A proportional metric is one in which change in metric shows the right impact on the user experience. The actionable metric helps the owners of the system to identify the root cause and why the user experience is impacted so that a remedial action can be taken.

Comments

Popular posts from this blog

Don't Just Work, Work with Purpose

Confessions of a recovering speaker