Understanding Service Level Agreements in Healthcare
Total Page:16
File Type:pdf, Size:1020Kb
Understanding Service Level Agreements in Healthcare Orion Health White Paper Gustavo Herrera Site Reliability Engineering Director 042018 How to measure and control the However, as every system is different they quality of services in the health can have an almost infinite number of industry variables. It is important to note that not all variables bear the same relevance or have Service Level Agreements (SLAs) are one the same impact in quality as each other. of the most important parts of any cloud Some are concrete and can be quantified, service agreement, yet their importance can while others are more abstract and harder to be poorly understood. In their most basic represent in numbers. form SLAs are a single contract between two parties focusing on the type and quality To ensure there is understanding of of service provided to a client. Yet, how the the order of relevance, a hierarchy of contracts are developed, and what they variables— focusing on significance and contain can be overlooked. measurability—should be created. It is these two features that are key to defining Strong SLAs are paramount within accurate SLIs. healthcare—where capabilities such as data availability can mean the difference between Some SLIs can be composed by an life and death—and this whitepaper will aggregation of other SLIs, but each must present a comprehensive overview of what have enough relevance on its own to be SLAs are, how they are created, and their listed independently, otherwise it should importance to healthcare organisations. be rolled into a relevant SLI and discarded. For example: database availability; business Definitions: SLI, SLO and SLA logic availability; web server availability; and API availability are all different SLIs. Each Service Level Indicator (SLI) has a meaning on their own, but they are also presented as an aggregated SLI, such Defining a set of SLIs is the first step in as system availability. creating a Cloud Service Agreement. SLI can be defined as a variable (also known However, it could be argued that some as a value) that has enough relevance for availabilities have relative importance; for the service and can be quantitatively and example, web server availability will not objectively measured. The most well- have an impact on consumers using the known SLIs are “system availability” and API. Therefore, it is still valuable to make a “downtime”. distinction between the different SLI rolled up into system availability or uptime. Variables Relevant SLI Measureable FIGURE 1: Service Level Indicator Diagram Orion Health White Paper Understanding Service Level Agreements in Healthcare 2 Service Level Objective (SLO) Another factor to consider is the cost of enhancing the SLO compared with the added benefit as perceived by the customer. For An SLI alone does not provide a means to example, for user interaction, 3-4 seconds of measure the quality of a service, as there is response time is considered acceptable and still no judgment regarding what value, or there is no perceived value on reducing the range of values, the SLI could take. Service reaction time beyond that threshold. Usually, Level Objectives define what range of those this follows the law of diminishing marginal values will constitute an acceptable level of returns: that is, an increasing volume of service for a given SLI. investment on improving the SLO will become less and less productive. Each additional unit SLOs usually require some context to be of input will produce less output than the prior interpreted, for example, availability is unit of input until it does not make economic expressed as a percentage of time that a sense to keep investing money. system is usable over period of time (usually a month). Another common definition of availability is “the system is considered The Myth of Five-Nines: available if responds within 10 seconds to a user request”. The ‘myth of the five-nines’ is a particularly famous but currently unrealistic goal for IT providers who, in pursuit of perfection, now In other cases, such as ‘error rate’, the SLI target performance up to three decimal places— can be expressed as a number of errors in the case of uptime—being up 99.999% of the (per minute) considered acceptable (which time represented by five consecutive nines. informs how many retries must be allowed). The most common pattern is to have an SLO If this was goal possible, it would equate to just 5 expressed as the upper (or lower) bound minutes and 15.6 seconds of service interruption in a given year. While this currently unachievable of the acceptable range, for instance, any goal is frequently discussed, it is not offered by network round-trip time (RTT) lower than a any of the major public cloud providers or any given acceptable number in seconds, would other major service providers for server-based be a valid SLO (i.e. RTT < 4 sec). capabilities . It is important to consider external factors Less obvious examples are values that go and chains of dependency when defining an beyond the substrate that the SLOs are built SLO. Network latency, for example, cannot upon. For example, if the Infrastructure be defined if part of the communication Provider cannot guarantee an uptime for happens over the public Internet, since it is virtual machines (or containers) higher than technically impossible to predict the route 99.5% then it will not be possible (without a particular packet will follow under TCP/ extra work) for the Service Provider to offer IP networks. Sometimes even the laws of an overall service availability higher than physics need to be considered, such as the 99.5%. Ultimately, a combination of a relevant time it takes for a beam of light to travel and quantifiable variables and an expected through the fibre optic cables under the value range does not result in an SLA. In other Pacific Ocean. words, when the expression “Availability > 99.5%” is found, that is actually an SLO and Therefore, an SLO must be defined to not an SLA. While SLOs are not enough, realistic values, it is of no use to define SLO setting a reasonable set of SLOs is of the as “RTT = 0 Sec” or “Availability = 100%” essence and constitutes the core of a well- since those values will never be able to defined Service Contract Agreement in terms be achieved. of quality of service. Orion Health White Paper Understanding Service Level Agreements in Healthcare 3 Service Level Agreements (SLA) To build SLAs from Service Level compensations, typically called “service Objectives, some levels of criticality and credits”, can be defined monetarily or via cost (in terms of business value) need bonuses on future services. The bonuses to be introduced. The importance of the typically constitute an amount of service SLO needs to be qualified and paired credits depending on the severity of the with the consequences of breaching the failure. commitment. For example, the interruption of service Service Level Agreements are included for 30 minutes compared with one day of in the Cloud Service Agreement once interruption will result in materially different this has been completed and are usually degrees of penalty, in extreme cases associated with a level of compensation— resulting in dissolution of contract. in case of failure to honour them. Such SLI (what) SLO (objective) SLA(consequence) • quantifiable • minimum/ • contractual • relevant maximum • service credits • average FIGURE 2: Service Level Agreements Diagram The impact of automation on SLA Site Reliability Engineering (SRE) The relatively recent introduction of the can decrease from minutes or hours to just concept of Site Reliability Engineering seconds and the margin for human error is in the IT industry has revolutionised the drastically reduced. approach to increasing the quality, speed, security, and certainty of delivering an SLA, An added benefit, which is especially especially on cloud services. important for the healthcare industry, is the fundamental paradigm shift in data security. Site Reliability Engineering (or SRE) focuses The level of isolation that can be achieved on replacing Engineers with Automation. So, through utilising automation instead of instead of having DevOps or Ops Engineers humans, provides an opportunity to protect providing 24/7 support and maintenance of private data that would be otherwise production systems, the modern approach impossible. As automation has limited scope has Developers building automated and access, is subject to stringent auditing solutions that manage the 24/7 support and controls, and has been reviewed, tested and maintenance function. As a result, the mean approved prior to its implementation, the time to reaction of an incident response room for unexpected exposures of protected Orion Health White Paper Understanding Service Level Agreements in Healthcare 4 data are close to zero. Automation is also In contrast, in a traditional operations model ubiquitous and can be deployed anywhere an international IT provider—who is utilising in the world, providing another advantage DevOps or Ops Engineers—requires local for the healthcare industry which also has teams in each country, which significantly strong data sovereignty requirements. limits scalability in the context of a global service. FIGURE 3: Site Reliability Engineering Diagram Monitoring SLI and Alerting SLO For effective management of Service Level Building a monitoring and alerting system Agreements it is fundamental to apply a can quickly become a project on its own, robust monitoring and alerting system. instead of a means to an end. Fortunately, Without visibility and appropriate alerting, there are multiple offerings “as a Service” SLAs are an empty concept. available that provide a much better cost/ benefit ratio than building a solution. Nowadays, monitoring and alerting has become a discipline on its own, and there For the purpose of this paper, the following are numerous statistical and machine concepts are important: learning techniques that are applied to monitoring including trend prediction, Warnings heuristic anomaly detection, and others.