Estimated Reading Time: 15 minutes
This is the third article in the Design Qualities series.
After Availability, this one will focus on Recoverability. If they are often associated, these two Design Qualities are nevertheless unique and can seriously affect an infrastructure offer when they are confused, by poor understanding and definition.
Links to the previous posts:
Definition of Recoverability
Recoverability is the ability to restore a service. As opposed to “Availability” it does not guarantee availability of data, but rather in the event of a disaster to be able to recreate the service and its data, on the same element / support.
IT Infrastructure uses this design quality, in particular for the architecture of backups, replicated (asynchronous) storage, virtualized infrastructure, etc.
Recoverability can also be used in applications with versioning of data, configurations, and so on.
Please note, the Recoverability is to restore a service and its underlying systems in good condition, not to be able to restart it on another system. Clustering is an Availability quality, not Recoverability.
Measurement of Recoverability is made from multiple metrics, which we will detail later. The most known are RPO and RTO. The unit is time : minutes, hours, days.
Recoverability is not a mandatory capability: a “low cost” service offer, development environment or even a temporary platform do not necessarily need to be recovered when a disaster occurs.
Like Availability and other Qualities, there are different ways to answer to Recoverability, including backup/restoration, disaster recovery, (cold) standby instances, etc. We will talk about it again below.
Recoverability is one of the key metrics (KPIs) of an infrastructure/ Cloud offer. From simple storage offerings to the application, including infrastructure, these components (should) all have a Recoverability defined in the offer. Sometimes optional or with different levels, this duration before the platform can be operational again and used by end clients must be clearly understood because it can lead to bad experiences.
It is quite rare to have an end-to-end Recoverability commitment, from the hardware to the application data. Most of the time, only one layer (the scope of the current architecture design) is treated, thus excluding the lower and upper layers.
First example, an on-premise virtualization offer provides a 4h RPO recoverability option, but this only includes the recovery time of the virtualization infrastructure: assets are considered up & running as long as the operating system has successfully restarted. If the application or its configuration has been damaged, its recovery time is not counted in the offer.
Another example, a SaaS-based storage service allows you to restore your data quickly and with a 24-hour RPO. On the other hand, all the applications, automation, or configurations which use this data might need to be reconfigured or restarted.
There are several types of “Recovery”, depending on the type of affected system.
- Hardware (such as servers) has spare parts to solve its failures: disks, CPU, memory, servers…
- Concerning the operating systems (OS), it usually consists of a backup and restore.
- Example solutions such as Veeam B&R, AWS Glacier, Symantec NetBackup, etc.
- Virtualization infrastructure has brought hardware independence and the popularization of disaster recovery.
- Example solutions include tools like VMware DRaaS with VMC on AWS, storage array replication features/add-ons such as SnapMirror for NetApp or RecoverPoint for EMC.
- Software is divided in 2 things to recover: the installer or its package and its configuration. This will be managed by repository, configuration manager and versioning, which combined with an offline repository makes it possible to be able to react quickly to a faulty binary.
- For example, the “enterprise” packaging is done by InstallShield AdminStudio, distribution of packages via a configuration manager such as Salt, Ansible or Microsoft System Center Configuration Manager (SCCM) and offline versioned repository
- The first role of Configuration manager is to restore the configuration to the expected state. Some configuration managers like Salt and Puppet go further by reacting on their own to a manual change and pushing back on the expected configuration automatically.
- Versioning with tools like Git keeps different versions of configurations and packages, which helps organize and speed up recovery.
- For Applications, “cold standby” clustered instances can be included in the application’ clustering mechanism.
- In Cloud Native apps, this can be achieved by using a methodology with a CI/CD to use multiple locations
- Example of tools includes Jenkins, Helm Charts ; and Tanzu Service Mesh, Kong, Kiali to redirect access.
- The data itself is the most critical and must be carefully treated: it will be backup and restoration, with incremental techniques, regular checks, offsite copies (with the well-known method of 3-2-1 backups).
Previously we listed some acronyms which are keys for recoverability. Here is the list of the 4 acronyms and metrics:
- RPO : Recovery Point Objective
- RTO : Recovery Time Objective
- WRT : Work Recovery Time
- MTD : Maximum Tolerable Downtime
Many blog posts already exist to detail these metrics, when they are triggered, etc. Most of these articles are about Disaster Recovery, it’s the most frequent applicability of “Recoverability” design quality ; but explanations on these blog posts are also valid for backup/restore and other methods.
The graph above shows the main stages linked to the triggering of the Recoverability steps: a disaster occurs and is identified (3), the teams are alerted and launch the recovery (4), based on the last point of backup (2) or an older but validated point (1). Once the technical recovery is complete (5), the teams of each system restart the systems and applications, to return to a nominal state (6) once completed.
Steps 1 and 2 are “in the past”, the last system/application backup points. I chose to represent 2 points and not just one (as you would see in other articles or explanations). It may happen that the last point is not valid, and the definition of RPO must consider this significant risk; thus the RPO may be wider than the point (2) and up to point (1) where data was successfully tested.
What will be restored is the data present in these backup points : all the work done since this (valid) restore point will be lost. Applications will be fully functional again, with data accessible, at the very end of the MTD period, which includes RTO + WRT.
The net loss of work carried out is therefore: 2x ((1 to 2) + RPO) + MTD. Steps 1 to 3 are counted twice since the work was already done once (between 1 and 3) but has been lost; involved teams have to be redone the work again.
This is where Recoverability is important. The customer’s business is directly impacted, depending on the offer and the options, for the systems’ provider its offer is an additional financial entry, with its associated risks and penalties.
Minimize the Recoverability?
Everyone is unanimous: the ideal would be to have an RPO and RTO the closest possible to zero.
As for availability, the more we tend towards “zero defect”, the more the technologies choices are and the overall cost high. In addition, an RPO + RTO tending to zero leaves little room for external events, either human errors or intrusions from other systems.
The reduction is on all three axes: the RPO, the RTO and the WRT.
Reducing the RPO consists in having more frequent replications or backups of the data. Depending on the volume of data to be moved and the technologies, this can be done incrementally or full.
There may not have been any changes (for example in configuration files), in which case the previous valid copy of data will be used as a reference pointer. For example, git pushing an unmodified file will not update the file, but the sync date needs to be updated somewhere.
Too much reduction can also lead to failure. An RPO that is too short and/or with few checkpoints in time will be more prone to intrusions and data corruption.
Example: in the case of ransomware that encrypts disks, if all the restore points are impacted before the issue is detected, then Recoverability is failed and the system is lost.
If the data integrity test period is longer than the data copy frequency, then the data can never be verified and validated. Obviously, that could lead to unwanted issues where RPO is in theory small but can’t be satisfied.
Minimizing the RTO is a multi-axis challenge, it is necessary to have:
- Sufficient and correct tooling: the recovery period (RTO) is not when you should try to install and test new stuff.
- Example: for the restoration of emails, it is necessary to have a tool to allow to repair the faulty .ost files and to extract the emails, in addition to the server-side replication and recovery.
- Automation, scripts, workflows: anything that can be done programmatically should be automated, to save time during this phase. Especially when you begin a DR on the Sunday, 3am.- Fast and functional access without dependencies
- Example: on a Datacenter Recovery, if all the emergency passwords are on the password vault which is also part of the loss, then the duration will be considerably longer.
- Trained and prepared staff, it’s not when triggering a Recovery that it’s necessary to find who masters the tools, or to need to read the documentation.
- Management escalation process: Recovery involves a business impact and cost. Technology is not the only vector to be considered, and just as there must be a process for technical escalation, the same must be done for business and therefore involve managers/directors.
- Example: in the event of loss of a storage array, it may be faster (better RTO) to restore backups (48h RPO) on another storage system rather than wait until the end of the full recovery of the storage array by support and engineering (RPO close to 0, but RTO could take a week). Despite the data lost (due to the RPO of the backup/restore), having many client teams that cannot work during the whole RTO+WRT period is a stronger impact in terms of profit loss, brand damage and reputation.
Although presenting several axes, the optimization of the RTO is fairly simple: numerous software exists on the market, and the investment in human resources is limited to the only IT provider team and a few scripting teams.
With a little effort and reasonable investment, the RTO is then optimized. However, these investments and operations must be repeated regularly to maintain an optimized RTO that is adapted to the needs.
WRT is the work that is required after technical restoration is done to bring (underlying) systems back to their nominal state.
This step often involves more people and several technologies: operating systems, software configuration, hardware configurations, validation of proper functioning, functional tests …
Among my work experiences, I had a case of restoration of several hundred vms each equipped with multiple network interfaces. RTO and WRT duration was short on the storage and virtualization infrastructure, but a (old) OS kernel bug which allows an interface to reply to ICMP on the wrong VLAN/MAC address resulted in a large overrun time.
The WRT phase is complex. It includes the time to get back service to normal status, the time to validate data integrity ; but also the required (re)work of all the clients of the restored systems to reach the data status before the failure.
Speeding up the WRT has little to do with technical resources. Of course a minimum is needed, from accessibility to security. An automated testing tool in “behavior-driven” mode will help to easily compare the current state and the expected functional state.
Most of the duration of WRT is more related to people and processes.
You must have all the procedures, manuals, tutorials for system configuration, access, and troubleshooting. Then, it’s a matter of invest in (human) resources that must redo the work lost.
IaC, CI/CD and Recoverability
Infrastructure as Code (IaC) enables IT to get expected results, corresponding to expectations from a configuration (via specialized tools).
Continuous Integration / Continuous Delivery (CI/CD) automates the implementation of systems, adding data to systems.
In many cases, IaC could accelerate the RTO, particularly on the configurations of the systems. Integration with CI/CD will also greatly improve the RTO by automating the redeployment of systems and partial data. RPO can’t be enhanced by IaC, but it can be enhanced by CI/CD in some circumstances.
If the organization has an IaC and CI/CD fully implemented, operated and very well understood / used (I insist on the “very”, unfortunately it’s not so frequent), then some recovery can be speed up, by re-creating “from scratch” the system rather than trying to heal the system.
This “designed to fail” pattern is sometimes applicable to components with few data (or low change rate) which are easy to re-integrate. For example, the restoration of VMware NSX Manager is done via restoration of its configuration on freshly redeployed appliances, and not appliance backup.
Common Mistakes and be Realistic
On some sites, you might find promises of a RPO=0 or RTO=0. With what you’ve read above, you can imagine how unlikely this is.
Most of these offers are better-marketed Availability offers: a kind of “Business Continuity” (BC), with synchronous replication. However, BC does not protect against data integrity issues, accidental/intentional suppression or data loss. An error in written data will be replicated with this asynchronous system and can’t be recovered with this functionality. Multi-AZ, extended or stretched clusters (whether storage, network, application…) and synchronous data replication are NOT a Recoverability quality.
And this is one of the biggest mistakes: confusing Availability and Recoverability. A cluster, an extension between AZ: this is Availability.
A second type of misunderstandings is to limit the Recoverability to the operating system, vm or container. The service is not back to nominal state until the entire stack is available and validated, from the operating system to access (incl. permissions and the directory system) to the client’s data and configuration.
For example, in a system made up of a front tier (http) and back tier (database), if the front tier has been restored, but the back has acknowledged data that is not present, then there is still work to do (WRT).
Another common mistake is the misunderstanding of the (end-to-end) system. For example, if a database can have a short RPO thanks to log shipping, a 3-tier application (with database) is not following the same rules and will have a higher RPO.
Generally speaking, a system composed by several sub-systems will have a higher RPO than the sub-system with the largest RPO.
Last common mistake, misjudge the timings: Everyone would like an RTO or WRT of few minutes, but more seriously, RTO is often higher than RPO, and the operations involve a lot of human and process. It is absolutely key to not underestimate the time that it can take, especially when multiple issues happen (ex: the measured RTO of 1 vm or 10’000 vms cannot be identical).
Small advice in addition to the common mistakes list: Have a plan for managerial escalation chain avoids A LOT of problems over time. The end customer of IT is business, not for technology’ sake.
Like Availability, Recoverability is an essential design quality. This is a usual option of Cloud offers, mandatory for healthy Production and safe operations.
Security and risk reduction are the keywords of Recoverability: it’s about ensuring the data would be restored in case of unexpected issue, and de-risking possible technical, human and environmental problems.
There is nothing wrong with having Recoverability engagement times of tens of hours (or even days). It is a complete and complex process. Beware of the hypothetical promises of the magic zero RPO/RTO; real RPO-RTO metrics are measured end-to-end, and only a small part of end-to-end Recovery is tied to technology.
Recoverability tools and processes need to be tested and validated regularly, but also to be reviewed, re-architectured and re-implemented frequently. Escalation and operations processes must be clear, concise and tested as well.
Yes, Recoverability is an important technological and human investment. But that’s the price of insurance and security. Without Recoverability, the entire business can be lost, and the weak economy will be the cause of going out of business.