Setting RPO and RTO Based on Risk
Setting RPO and RTO Based on Risk
Introduction
Assignment of RPO (Recovery Point Objective) and RTO (Recovery Time Objective) should be based on a thorough risk and impact analysis of your systems and data. Assessing the likelihood and impact of threats should drive recovery goals that support an acceptable level of data loss and downtime for each risk level. This means critical systems with high-impact risks should have a shorter RPO and RTO compared to less critical systems with lower risk.
The two most widely used metrics for Backup and Recovery, Disaster Recovery, Cyber Resiliency and Business Continuity are RTO and RPO. RTO represents the maximum acceptable downtime for a system, or how quickly that system needs to be operational after an outage. RPO represents the maximum tolerable data loss (unrecoverable updates and changes to data: files written, databases updated etc.), generally measured in minutes to days, that can be lost before critically impacting business or operations.
Determining the business risk, and hence the RTO and RPO, is crucial for deciding appropriate data protection strategies based on the importance and value of different systems within a business. For example, in the event of a major outage, whether from natural disaster or ransomware, the RTO and RPO for systems such as Active Directory or Directory Name Services (DNS) should be very low, even approaching zero. In contrast, systems like print servers and long-term archival storage will likely have an RTO and RPO of days or even weeks. The Data Protection and Cyber Resiliency practice at WWT strongly encourages every organization to perform Application Dependency Mapping for identifying the critical rebuild systems that need very tight RPO and RTO.
Risk analysis
Establishing RPO and RTO metrics should be based on a thorough risk analysis to determine the potential for negative consequences or harm during a variety of outages. RTO and RPO are the outcomes of measuring the level of risk a company is willing to accept regarding data loss and downtime during an incident.
One of the simplest and most used calculations of risk is Risk = Likelihood × Impact, where risk is represented as a percentage (or a value of 0.0 to 1.0) and impact is the monetary cost. For example, the likelihood of a natural disaster may be 10% with the cost being $1M, resulting in a risk of $100,000. While difficult to monetize, a company's reputation should be considered in the impact portion of the calculation.
For cyber security, we can further expand the calculation to be Risk = Likelihood × Impact where Likelihood = f (Vulnerabilities, Threats, Exposure, Security Controls). Given the likelihood of a cyber event is higher than a natural disaster, one can extrapolate the risk would be higher, resulting in higher monetary impact.
RTO and RPO are key factors after calculating the impact portion of risk. Recovery time and data loss can be quantified financially in terms of lost productivity and lost revenue per minute/hour/day.
For help on determining risk, there is information available from the National Institute of Science and Technology (NIST) Risk Management Framework. To convert risk into financial terms, The FAIR Institute (Factor Analysis of Information Risk) helps organizations understand the potential economic impact of cyber threats. This makes it easier to communicate risks to business stakeholders and prioritize mitigation strategies.
Maximum tolerable downtime
Maximum tolerable downtime (MTD) is the absolute longest amount of downtime an organization can tolerate before facing serious consequences. MTD is the result of Incident Response Time (IRT) plus RTO plus the Operational Resumption Time (ORT) needed to get the business function running at an acceptable level. IRT can be as low as seconds or minutes in the event of a disaster where the plan to seamlessly failover to a disaster recovery site is executed. However, IRT can be days or even weeks if an enterprise is affected by ransomware and remediation efforts are required before recovery can be effected.
It's vitally important to perform a complete business impact analysis (BIA) for key applications to determine the absolute MTD. Clearly, the most critical business processes needed to maintain specific business functions will require the shortest MTD. Once MTD is established, the RTO can be determined.
In the event of a site disaster or a cyber recovery event, the MTD is for the Minimum Viable Business needed to resume operations. Minimum Viable Business refers to the smallest set of critical systems, data, and processes an organization needs to maintain a basic level of operations immediately following a disaster.
Cost
As described above, RTO and RPO are factored into the impact portion of the risk equation. While it should be obvious that a strategy of driving RTO and RPO as low as possible will reduce risk to the organization, it comes with increased costs.
The lowest cost RTO and RPO solution is archival or offsite storage. What is characteristically called cold storage. It is generally very low cost and is typically achieved using tape technology. Examples include backup to tape sent to offsite storage vendors such as Iron Mountain and Cold Object or Blob Storage such as Amazon Glacier and Azure Archive. The downside to this low-cost option is an RTO of days or even weeks.
For an RTO and RPO of hours to days, standard data protection provided by Backup and Recovery applications will be sufficient. Backup and Recovery applications provide strong security and reliability at a reasonable cost but are usually inadequate for providing guaranteed RTO and RPO of less than a few hours.
To achieve tighter RTO and RPO than Backup and Recovery can provide requires a more expensive solution, typically using array-based snapshots or asynchronous replication to another site. This can drive the RPO down to minutes or even seconds but does not improve RTO in the event of a site wide or multi-site disaster.
Ultimately, driving RPO and RTO to near zero (it's nearly impossible to achieve zero RTO/RPO) requires a disaster recovery site with duplicate storage and sufficient computing capacity to resume Minimal Viable Business operations very quickly. This configuration requires synchronous replications of all transactions and enough computing resources to restart core applications within minutes. The overall cost of a second site with sufficient resources will be exponentially more expensive than backup and recovery but will drive risk down substantially.
It's important to note that a disaster recovery solution provides excellent recovery times for site disasters, but little benefit for a cyber-attack which is far more likely. Recall that Maximum Tolerable Downtime is a combination of Incident Response Time plus RTO. Given that incident response times in a cyber-attack normally range from days or weeks, the RTO should have high visibility since Recovery Time is one of the last pieces to restarting the business.
Summary
Recovery Point Objectives and Recovery Time Objectives must be the outcome of a thorough risk analysis where the impact of an event is evaluated as a balance between the Maximum Tolerable Downtime and the cost required to achieve it. Only after the risk analysis is complete should RTO and RPO be assigned.
Furthermore, there is no single RTO or RPO value for an organization. RTO and RPO should be aligned with the recovery tier of each application or system, as a single blanket RTO and RPO are cost prohibitive to implement at scales.