Understanding Service Level Agreements in Healthcare

Orion Health White Paper Gustavo Herrera Site Reliability Engineering Director 042018 How to measure and control the However, as every system is different they quality of services in the health can have an almost infinite number of industry variables. It is important to note that not all variables bear the same relevance or have Service Level Agreements (SLAs) are one the same impact in quality as each other. of the most important parts of any cloud Some are concrete and can be quantified, service agreement, yet their importance can while others are more abstract and harder to be poorly understood. In their most basic represent in numbers. form SLAs are a single contract between two parties focusing on the type and quality To ensure there is understanding of of service provided to a client. Yet, how the the order of relevance, a hierarchy of contracts are developed, and what they variables— focusing on significance and contain can be overlooked. measurability—should be created. It is these two features that are key to defining Strong SLAs are paramount within accurate SLIs. healthcare—where capabilities such as data availability can mean the difference between Some SLIs can be composed by an life and death—and this whitepaper will aggregation of other SLIs, but each must present a comprehensive overview of what have enough relevance on its own to be SLAs are, how they are created, and their listed independently, otherwise it should importance to healthcare organisations. be rolled into a relevant SLI and discarded. For example: database availability; business Definitions: SLI, SLO and SLA logic availability; web server availability; and API availability are all different SLIs. Each Service Level Indicator (SLI) has a meaning on their own, but they are also presented as an aggregated SLI, such Defining a set of SLIs is the first step in as system availability. creating a Cloud Service Agreement. SLI can be defined as a variable (also known However, it could be argued that some as a value) that has enough relevance for availabilities have relative importance; for the service and can be quantitatively and example, web server availability will not objectively measured. The most well- have an impact on consumers using the known SLIs are “system availability” and API. Therefore, it is still valuable to make a “downtime”. distinction between the different SLI rolled up into system availability or uptime.

Variables

Relevant SLI Measureable

FIGURE 1: Service Level Indicator Diagram

Orion Health White Paper Understanding Service Level Agreements in Healthcare 2 Service Level Objective (SLO) Another factor to consider is the cost of enhancing the SLO compared with the added benefit as perceived by the customer. For An SLI alone does not provide a means to example, for user interaction, 3-4 seconds of measure the quality of a service, as there is response time is considered acceptable and still no judgment regarding what value, or there is no perceived value on reducing the range of values, the SLI could take. Service reaction time beyond that threshold. Usually, Level Objectives define what range of those this follows the law of diminishing marginal values will constitute an acceptable level of returns: that is, an increasing volume of service for a given SLI. investment on improving the SLO will become less and less productive. Each additional unit SLOs usually require some context to be of input will produce less output than the prior interpreted, for example, availability is unit of input until it does not make economic expressed as a percentage of time that a sense to keep investing money. system is usable over period of time (usually a month). Another common definition of availability is “the system is considered The Myth of Five-Nines: available if responds within 10 seconds to a user request”. The ‘myth of the five-nines’ is a particularly famous but currently unrealistic goal for IT providers who, in pursuit of perfection, now In other cases, such as ‘error rate’, the SLI target performance up to three decimal places— can be expressed as a number of errors in the case of uptime—being up 99.999% of the (per minute) considered acceptable (which time represented by five consecutive nines. informs how many retries must be allowed). The most common pattern is to have an SLO If this was goal possible, it would equate to just 5 expressed as the upper (or lower) bound minutes and 15.6 seconds of service interruption in a given year. While this currently unachievable of the acceptable range, for instance, any goal is frequently discussed, it is not offered by network round-trip time (RTT) lower than a any of the major public cloud providers or any given acceptable number in seconds, would other major service providers for server-based be a valid SLO (i.e. RTT < 4 sec). capabilities .

It is important to consider external factors Less obvious examples are values that go and chains of dependency when defining an beyond the substrate that the SLOs are built SLO. Network latency, for example, cannot upon. For example, if the Infrastructure be defined if part of the communication Provider cannot guarantee an uptime for happens over the public , since it is virtual machines (or containers) higher than technically impossible to predict the route 99.5% then it will not be possible (without a particular packet will follow under TCP/ extra work) for the Service Provider to offer IP networks. Sometimes even the laws of an overall service availability higher than physics need to be considered, such as the 99.5%. Ultimately, a combination of a relevant time it takes for a beam of light to travel and quantifiable variables and an expected through the fibre optic cables under the value range does not result in an SLA. In other Pacific Ocean. words, when the expression “Availability > 99.5%” is found, that is actually an SLO and Therefore, an SLO must be defined to not an SLA. While SLOs are not enough, realistic values, it is of no use to define SLO setting a reasonable set of SLOs is of the as “RTT = 0 Sec” or “Availability = 100%” essence and constitutes the core of a well- since those values will never be able to defined Service Contract Agreement in terms be achieved. of quality of service.

Orion Health White Paper Understanding Service Level Agreements in Healthcare 3 Service Level Agreements (SLA)

To build SLAs from Service Level compensations, typically called “service Objectives, some levels of criticality and credits”, can be defined monetarily or via cost (in terms of business value) need bonuses on future services. The bonuses to be introduced. The importance of the typically constitute an amount of service SLO needs to be qualified and paired credits depending on the severity of the with the consequences of breaching the failure. commitment. For example, the interruption of service Service Level Agreements are included for 30 minutes compared with one day of in the Cloud Service Agreement once interruption will result in materially different this has been completed and are usually degrees of penalty, in extreme cases associated with a level of compensation— resulting in dissolution of contract. in case of failure to honour them. Such

SLI (what) SLO (objective) SLA(consequence) • quantifiable • minimum/ • contractual • relevant maximum • service credits • average

FIGURE 2: Service Level Agreements Diagram

The impact of automation on SLA Site Reliability Engineering (SRE)

The relatively recent introduction of the can decrease from minutes or hours to just concept of Site Reliability Engineering seconds and the margin for human error is in the IT industry has revolutionised the drastically reduced. approach to increasing the quality, speed, security, and certainty of delivering an SLA, An added benefit, which is especially especially on cloud services. important for the healthcare industry, is the fundamental paradigm shift in data security. Site Reliability Engineering (or SRE) focuses The level of isolation that can be achieved on replacing Engineers with Automation. So, through utilising automation instead of instead of having DevOps or Ops Engineers humans, provides an opportunity to protect providing 24/7 support and maintenance of private data that would be otherwise production systems, the modern approach impossible. As automation has limited scope has Developers building automated and access, is subject to stringent auditing solutions that manage the 24/7 support and controls, and has been reviewed, tested and maintenance function. As a result, the mean approved prior to its implementation, the time to reaction of an incident response room for unexpected exposures of protected

Orion Health White Paper Understanding Service Level Agreements in Healthcare 4 data are close to zero. Automation is also In contrast, in a traditional operations model ubiquitous and can be deployed anywhere an international IT provider—who is utilising in the world, providing another advantage DevOps or Ops Engineers—requires local for the healthcare industry which also has teams in each country, which significantly strong data sovereignty requirements. limits scalability in the context of a global service.

FIGURE 3: Site Reliability Engineering Diagram

Monitoring SLI and Alerting SLO

For effective management of Service Level Building a monitoring and alerting system Agreements it is fundamental to apply a can quickly become a project on its own, robust monitoring and alerting system. instead of a means to an end. Fortunately, Without visibility and appropriate alerting, there are multiple offerings “as a Service” SLAs are an empty concept. available that provide a much better cost/ benefit ratio than building a solution. Nowadays, monitoring and alerting has become a discipline on its own, and there For the purpose of this paper, the following are numerous statistical and machine concepts are important: learning techniques that are applied to monitoring including trend prediction, Warnings heuristic anomaly detection, and others. Within the field there are constant advances, When an SLI is approaching levels that and new capabilities are now expected by could indicate an anomaly in the system, but the industry. These include traceability, are not at levels worthy of an alert, it would correlation between logs and metrics. be effective to notify the support or incident response system.

Orion Health White Paper Understanding Service Level Agreements in Healthcare 5 Common warnings could include, “used Alerting disk space is higher than [a certain value]” -- which would allow the process to add Once the SLI values have reached levels capacity before a level requiring an alert is close to an SLO and urgent action is reached, and “too many lost packages (in a required, or a trend is showing an imminent network)” which could indicate some active breach of an SLO, an alert instead must be element is not performing— even though triggered. In a traditional model all incident the SLOs are still within tolerable values. alerts are routed through a L1 Response Warnings should not necessarily require Team, but it is the opinion of the author that immediate action, and could be handled by such a model is out of date. an Engineering Team on the next working day—if they occur outside of business This method is unnecessarily slow, relying hours. on response and processing by a Help Desk before it is escalated and triaged by The term “incident response system” L1 Engineers. Given the current expectation has been used above, which implies that in terms of mean time to recover and warnings should be managed by automation availability, such times to respond to an as much as possible, reducing the cost and incident would not be acceptable, especially improving the quality of the service at the for the Health Industry. same time. In traditional settings, a notification service Usually, machine learning, trends, and is usually employed to send the alert to an heuristics should at most produce warnings, on-call engineer, who will typically respond since those predicting tools are by their in 20 minutes to 2 hours for the most severe nature designed to warn about the likelihood cases. In an automated system, the reaction of occurrence of something that has not time could be in the order of minutes. happened yet. However, in some cases they could be considered for alerts, as elaborated It is customary to have SLO for incidents, in the next section. classified by the severity, as in the following example:

SLI Description L1 L2 L3

Mean Time to Time to detect an SLO has been (or < 1 minute < 1 minute < 1 minute Detection is about to be) breached

Time to commence responding to Incident Response < 3 minutes 4 Hs Next working day the incident

Mean Time To Mean time to recover from incidents < 30 minutes < 8 hours N/A Recover (taken during 30 days period) 5 working Report Incident Preliminary Report to the Customer 24 Hs N/A days

Orion Health White Paper Understanding Service Level Agreements in Healthcare 6 The traditional system also requires a Reports should include all the SLAs agreed substantial number of engineers and trained to by the parties, and a brief description personnel to provide responses to incidents, of any incident that might have occurred, which increases the complexity and cost of clearly specifying if a breach has happened the service, as well as the likelihood of and providing means to request more human error. information if necessary/available.

And finally, for a truly global service, it Finally, as SLAs specify compensation requires personnel with clearance in each (expressed as “credit services”), whenever country to be able to review the relevant an incident breaches the SLA customers information, since it could contain PHI/PII need to have a formal process to request the data (logs, urls, incident reports, etc.). credits. This needs to be clearly specified in the contract, detailing the frequency that Nowadays, for alert cases, the use of request will be honoured (monthly, yearly, etc.). automation as a response should be the default for most, if not all, cases. While Recovery (DR) some exceptions will be necessary, it is good practice to review those exceptional cases is an SLA that can have periodically and explore new advances that significant ramifications on a healthcare would allow for the automation of incident provider. It is a comprehensive plan to responses. recover a service from a natural or human- caused disaster. While it is rare, business This will not only reduce the mean time to interruption is accepted as a possibility recover and increase the availability, but in all industries which makes it incredibly also will reduce costs and opportunity for important to have a comprehensive DR human error. In terms of access to PHI/PII plan to take control in an unlikely event -- data, as the automation is only software, it especially in the healthcare industry where can be deployed within the legal boundaries lives could be at risk. allowed by the relevant regulation, without the need for human engineers to have Interestingly, new advances in cloud clearance to review data (and completely computing have drastically reduced the removes any risk of a human accessing likelihood of a negative outcome from a the data). DR incident. In the past a common DR risk was the loss of power due to a climate Reports and Credit Requests catastrophe impacting a data-centre. Now, with public cloud providers offering hosting While it does not have an impact on the options in different locations (within a actual service, it is important for customers country or abroad), it is technically simple to receive performance reports about to reduce the severity and impact of such the service. Such reports can be either incidents at a Business Continuity level— periodic, issued by the Service Provider, however whether that makes economic or on-demand, being available by request sense or not is still a business decision. or, in a state-of-the-art service through a self-service web user interface or API, to This section will explore the different be extracted whenever necessary. Some aspects of a Disaster Recovery Plan, in services have access via a self-service Web particular from the point of view of the UI to real time monitoring as well, which is healthcare industry. helpful during incidents, allowing all of the interested parties to follow the events as they happen.

Orion Health White Paper Understanding Service Level Agreements in Healthcare 7 Preventive, Detective and RTO Corrective Measures New advances in and An important action item in Disaster automation have reduced recovery time Recovery planning is the implementation significantly. Currently, it is accepted of concrete preventative, detective, and best practice to have the infrastructure corrective measures. deployment fully automated and an RTO of an hour or less is now possible for state-of- Protecting the data is perhaps the most the art services. The healthcare industry important preventative task. A contemporary should strive to achieve such standards. service should take advantage of preventive measures that utilise public cloud RPO capabilities, such as the aforementioned distributed geographic locations. The Recovery Point in Time will strongly depend on the nature of the application Distributed data storages would, therefore, or system being recovered. Typically, it be the best solution to protect against data is almost certain to lose the information loss and ensuring data availability at all in-flight (being transmitted and not times, but they are still rarely applicable yet processed or acknowledged by the since they require systems that have the destination). While sophisticated solutions capability of keeping replication up to are sometimes possible, most common tool date (with the problem of synchronicity) for data recovery is still , and the and expensive amounts of bandwidth. RPO would then be equal to the frequency of Instead, remote backups with cold- those backups. infrastructure is still a common recovery technique in most modern DR plans. A common, yet contentious, point is the discussion of the strategy to follow for With respect to detective measures, as primary and secondary data repositories mentioned previously, monitoring tools of information. are now advanced enough to perform A primary data repository is considered sophisticated analyses, such as heuristics, the true data owner and is the main target time series, trend analytics, etc. allowing for of the Disaster Recovery Data Protection early detection and also, in some cases, the (typically backups), while the secondary can prediction of DR events. be repopulated or re-constructed from the data in the primary and therefore, it is not Finally, corrective measures must be absolutely required to have it included in designed and implemented following the the backups. target SLO as elaborated below.

Main SLI for DR However, the time for repopulation will have to be added to the RPO, and therefore— depending on the volume of data—the Recovery Time Objective (RTO) describes how option of backing up can be chosen for much time would be required to have at least convenience since a backup store usually is a minimum acceptable level of service, so cheaper than prolonging the RTO. operations can re-commence. While Recovery Point Objective (RPO) describes how much All data after the RPO will have to be re- information will be lost, due to the Disastrous loaded from source. Event, described in terms of time (how much information back in time will be lost).

Orion Health White Paper Understanding Service Level Agreements in Healthcare 8 DR Testing

Testing DR is useful in discovering gaps in Testing DR periodically is an extended the plan that could be produced by missing practice that should be adopted by all state- a potential problem during planning or by of-the-art services. It is possible to clone ‘drifting’ on the infrastructure or software of a system and create a replica environment a service. Drifting happens when external or within the cloud which can be intentionally internal changes to a service inadvertently disrupted as if a disastrous event would invalidated the DR plan. This invalidation have occurred. This testing should produce may not be immediately obvious and might a number of reports, with different levels of not be discovered unless a DR Test is detail, tailored for the different stakeholders executed. and audiences.

FIGURE 4: DR Testing Diagram

Orion Health White Paper Understanding Service Level Agreements in Healthcare 9 Conclusion Further reading

There are a number of scenarios within Anatomy of a cloud service SLA: Availability the healthcare industry where lives can be guarantees: https://www.techrepublic.com/ at stake, so Service Levels for Software blog/the-enterprise-cloud/anatomy-of-a- as a Service must be held to the highest cloud-service-sla-availability-guarantees/ standards possible. Incident Management & Service Level It is imperative that the continuing advances Agreement: An Optimistic Approach: ht tp:// in cloud and monitoring services are not ijcsit.com/docs/Volume%204/vol4Issue3/ overlooked and are tested and implemented ijcsit2013040317.pdf as soon as they become available. Amazon Compute Service Level Agreement: The healthcare industry needs to follow https://aws.amazon.com/ec2/sla/ in the footsteps of the commercial sector and understand that increasing quality and Microsoft Azure Service Level Agreements: security, while reducing costs, is not only https://azure.microsoft.com/en-us/ possible but almost inevitable to remain support/legal/sla/ relevant in an evolving world. Service Level Objectives: ht tps:// landing.google.com/sre/book/chapters/ service-level-objectives.html

CloudForge Service Level Agreement: http://www.cloudforge.com/uptime-sla

Orion HealthTM is a trademark of Orion Health group of companies. All other trademarks displayed in this document are the property of Orion Health or their respective owners, and may not be used without written permission of the owner. All patient information shown in any imagery is for representation and demonstration purposes only and is not related to a real patient. Orion Health makes no warranties and the functionality described within may change without notice.

Copyright © 2018 Orion Health™ group of companies | All rights reserved | www.orionhealth.com