Estimated Reading Time: 12 minutes
In the previous article, we briefly saw what Design Qualities are and their place in the design process.
Availability is one of the main pillars of Design Qualities in infrastructure design.
We will detail what Availability is, how IT can use it and what you can expect from it.
Definition of Availability
Quick reminder: “Service” is the outcome, software, feature delivered to the final client. System” is a group of interdependent items that interact to perform a task (a software, feature, platform, hardware…)
Availability means being able to use a fully functional system in its normal state of production.
There can be nuances to this uptime: it is meant to indicate a perfectly functioning system in its intended state. So a system that has lower performance, disrupted user access, or that is not measurable (monitoring systems down) can be described as unavailable and impact overall availability.
As with most non-functional requirements, Availability has SMART metrics (Specific, Measurable, Achievable, Relevant, Time bound): it is the measure of the time during which the system is available and the time during which it is unavailable, over a given period. Measurement can also be represented in percentage of time when the system is considered available (for example, 95%).
The time during which a service is available is commonly referred to as “uptime”, as opposed to “downtime”.
Availability is not the same as Recoverability. If Availability refers to the time during which the service is rendered, Recoverability refers to how quickly a service can be restored, either on the same system or on another system.
Recoverability can help Availability, as long as the system is not compromised and there is no loss of data.
Other Design Qualities can influence Availability too: Scalability can help to achieve better metrics, Performance could help to a faster time to resolution, etc.
Availability measurement is very often used in IT contracts: SLAs (Service Level Agreement) and SLO (Service Level Objectives) often have these metrics in contracts.
Example: SaaS service vRealize Automation Cloud (https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/downloads/eula/vmw-vrealize-automation-cloud-sla-jan-2020.pdf) which has an availability commitment of 99.9% (it’s “three nines”)
This contractually sets a service “guarantee” (not technical, but contractual), implicitly reflecting the quality of the service delivered.
The closer this figure is to 100%, the better the quality of service – but beware of the measurement method and exclusions to the measurement.
The availability of the service can be affected (positively or negatively) by the systems that make up the service.
Layers or Scale-up
The different layers that make up the service have a negative impact on availability, since the unavailability of any one of the layers will directly make the service unavailable. The probability of unavailability is then added.
– Application “A”, consisting of a firmware, running on a device with a power supply has only 3 layers (firmware, hardware, power supply).
– Application “B” executed by a runtime on an operating system, running on hardware having a power supply, communicating with network cards equipped with firmware and driver – this represents 6 layers.
Parallelization or Scale-out
The multiplication of systems that can operate in parallel (scale-out) to carry the service will improve Availability. Indeed, the probability that the two systems are down at the same time is lower than a single system failure.
Example: in aerospace, the control systems are all doubled – which “simply” multiply the availability: even if one system breaks, the other takes over.
Improve or impact?
If the unavailability of a single layer has a direct effect (unavailability of the entire service), the unavailability of one of the parallel systems also has potential impacts: on other design qualities (Scalability, Performance, Security, etc.) but also on day2 and operations: once this system is lost, the availability remaining in the state becomes much lower, and the failed system must be repaired as soon as possible.
Virtualization solutions make it possible to boost availability, not by virtualization itself, but by the ability to parallelize the hardware and software resources under the service.
Willing to improve Availability by parallelizing systems can be counterproductive. These systems also have their design qualities each, including Manageability or Scalability, and parallelizing them can lead to risks which would not occur on a simple system but increase exponentially on parallelized systems.
As a common example, multiple network cards on a system to improve availability is quite common, but if a bug affects the chipset of the cards, then both cards will be affected at the same time, rendering the desired parallelism obsolete.
Worse even, if the cards are different, one suffers from a bug causing erratic behavior but not the other, the availability mechanism (switching from one to the other) between the cards will cause blunt crashes, slowdowns and other issues difficult to understand and troubleshoot, affecting overall availability much longer than a system with a single card and with a straight failure.
Realistic and Alternatives
Availability measures and contracts may vary, depending on the type of offer, understanding and needs:
- Generally offered as a percentage of availability rate, or with number of nine
- Example: two nine is 99%, four nine is 99.99%. Or a custom 97.5% of availability.
- Availability rarely measured end-to-end, on all components.
- The availability SLA is often offered only on the managed scope by the team offering the service
- Example: a virtualized infrastructure team will be able to offer a platform SLA of 99.9% for IaaS vms that will be based on it.
- The availability SLA offered may depend on SLAs of other services
- Example: for the virtualized platform with 99.9% SLA, it depends on the network SLA of 99.95% on the underlay network fabric.
- Some availability SLAs are only applicable during specified time slots. Out of these slots the system has no guarantee.
- It can also come from dependencies to other services.
- The advantage is to have easier and less risky maintenance windows.
- Example: 99.95% between 8am and 6pm, and no SLA otherwise.
- Most of the availability SLAs exclude maintenance periods, or failures of unmanaged systems – which significantly improves the availability figure.
- This is often the case with SaaS services. The SLA is valid as long as the underlying services’ provider is 100% available, any unplanned or planned failure of the provider’s platform is explicitly excluded from the terms of the contract.
- Availability is a statistical average of uptime rate, it does not represent a minimum or a maximum. It is not uncommon to see hypervisors (therefore including hardware, power supply, cooling) with an uptime of several years, or 100% for this service!
The Availability SLA can also be checked per month, to avoid huge contractual impact: a company missing for a month its SLA of 99.9% because having had 12 hours of downtime (it makes 99.86% annually), but having normally performed the rest of the year will possibly be less penalized than if they were reporting a “target missed” year.
Example: Amazon’s S3 (https://aws.amazon.com/s3/sla/) have an availability SLA of 99.9% (except some requests), and if this SLA is broken then consumer will gain service credit. Note that SLA is excluding network reachability or performance issues.
There is nothing wrong with the various alternatives above, as long as it is clearly understood and considered by customers and consumers of the service.
It is essential to know how to calculate the availability to be able to choose the correct target, and therefore the architecture adapted to your needs.
The calculation rules are simple:
- Availability is regularly described in number of 9’s (of which 2 are before the decimal point): two nine (99%), three nine (99.9%), three nine and a half (99.95%), etc.
- Usually expressed per year, with a measurement in hours / minutes / seconds (visit the website https://uptime.is/ to easily calculate)
- 99% represents a downtime of 3 days and 15 hours, while 99.99% represents an unavailability of only 52 minutes 35 seconds, and the impossible six nine (99.9999%) gives only 31 seconds of downtime per year!
- Stacked layers multiply their availability rate (which means quality decreases).
- Example: two layers at 99% can be calculated: 0.99 * 0.99 = 98.01% of availability.
- Parallel systems multiply their downtime rate, brought back to 100%.
- Example: two 99% systems give 1 – ((1-0.99) * (1-0.99)) = 1 – (0.01 * 0.01) = 1 – 0.0001 = 99.99%
- Calculation of complex systems with stacks of parallel
- Calculate each group independently, starting by the stacked layers, then parallel, and so on.
- Example: two groups of two operating systems hosting a database, put in cluster which presents application service, redundant with a traffic manager. Availability of each system is 99%.
- Each device is 98.01% available
- In parallel, it’s 99.96% of availability
- After going to db cluster and app service, it’s down to 97.97%
- Then those groups in parallel have and availability of 99.96%
- Applying traffic manager availability, this service end to end have an availability of 98.96%
Regarding the hardware components:
- Availability of hardware components partially depends on their error rate
- Example: a 1TB SSD disk with an uncorrectable error rate of 1 in 10^15, with a DWPD of 3, will statistically experience a failure every year.
- Potential effect 1 : It potentially results in a restart of writes and a surface check, disrupting the service for 15 minutes: this gives 99.997% availability.
- Potential effect 2 : It completely crashes the systems and cause data corruption, service needs to be recovered, within the agreed metric of 24 hours: this gives a rate of 99.72% availability.
- Availability of hardware components is also calculated from failures rates and amortization
- Example: the Cloud storage site Backblaze publishes its statistics: https://www.backblaze.com/b2/hard-drive-test-data.html
- 0.93% of disks fail annually, meaning an availability rate by 99.07%.
Note: In the following examples we will take two nine (99%) availability as the basis of our math, reality may differ.
Example 1: Availability of an enterprise-class server will be better than that of a Raspberry Pi, since the server has two power supplies and two NIC.
Example 2: Availability of a VM is better than the availability of a “bare metal” server since the VM will be hosted on a cluster, made up of several servers.
Example 3: Adding a containers stack reduces availability…
…which we can boost by making a mix of container cluster and virtualization cluster, to increase the availability.
One of the shortcomings of technical-focused architects is to want the best-in-class, without real needs, resulting in complex design. Why would you design with multiple network fabrics if only one is sufficient to reach the target SLA?
Think about the K.I.S.S. rule : “keep it simple and stupid”, don’t over-complicate the architecture when you don’t need it.
Another common mistake is to confuse the features of recoverability with availability. A platform based on a Stretch cluster (compute + storage + network), improving the availability of a system on two sites, cannot constitute an offer where the customer fixes his vms on a single site and does not accept the hosting on the second site unless losing its “main site”.
Why is this not a valid offer? Let’s compare the two schemes (Availability), the difference should be obvious:
The invalid offer (on the right) would like to offer a “Private Cloud” based on two “Sites”, but using only one!
We cannot reduce the width of systems on which the service offered is based (in this case, the hosting of vms on the Private Cloud).
Finally, the third common mistake is to overestimate the SLAs: everyone would like to have 4 or 5 nine, but let’s be realistic and pragmatic: end to end, having 95% is already a good SLA, as this uptime could be impacted even if the service is up (i.e. network access down, performance issue, monitoring issue…)
As we have seen, Availability is an essential design quality. This is used both as a requirement and also as an output to create the service offer. It’s not easy to do the calculation, but that helps to choose the correct solution.
I often hear distorted discussions “corp application X has only 97% availability, while the public Cloud Z offers 99.99%!”. Again, think end-to-end, exclusions, terms of contracts. Not all Availability measurements are equal.
“A la mode” solutions are not necessarily beneficial for availability – if this can lead to other outcomes, it is to be compared to business requirements, whereas one of the most measured is Availability. Always think about the architecture rule: keep it simple.
Improving availability is exponentially expensive – trying to reach six nines’ (99.9999%), if not technically impossible, will make a very expensive solution.
Therefore, align architecture with the real needs, and do not hesitate to challenge them.