bit.ly/safer-operations Complexity

@_pkillPhoto by| bit.ly/safer-operations Vu Viet Anh on Unsplash @_pkill | bit.ly/safer-operations @_pkill | bit.ly/safer-operations Compare 2020 with 2010

Developers can do much with much less

@_pkill | bit.ly/safer-operations launch entire stack

few lines of code

@_pkill | bit.ly/safer-operations “Do you think that kind of automation is easy? Or cheap?”

Still from Jurassic Park (1993), Universal Pictures

@_pkill | bit.ly/safer-operations https://info.sourcegraph.com/emergence-of-big-code-2020-survey

Findings from 2020 Emergence of Big Code Survey

Over half of developers surveyed are working with 100x the volume of code than 10 years ago

@_pkill | bit.ly/safer-operations https://info.sourcegraph.com/emergence-of-big-code-2020-survey

Findings from 2020 Emergence of Big Code Survey

Majority of developers report

92% Pressure to release changes more quickly

90% Software their teams deliver is more critical

50% Understanding dependencies is more difficult

74% Updating code is more dangerous

@_pkill | bit.ly/safer-operations Why isn’t this leading to more company-ending outages?

@_pkill | bit.ly/safer-operations What is driving this increase in complexity?

It’s success

@_pkill | bit.ly/safer-operations Law of Stretched Systems

@_pkill | bit.ly/safer-operations Capacities being stretched

+ Organizational workload

+ Pace of development cycle

+ Demands on expertise

+ Speed of technological innovation

Photo by Mitchell Luo on Unsplash Are We Getting Better Yet? Progress toward safer operations

Alex Elman SRE Leadership @_pkill @_pkill | bit.ly/safer-operations @_pkill | bit.ly/safer-operations

Prioritize a learn and adapt safety mode over a prevent and fix safety mode

@_pkill | bit.ly/safer-operations Prevent

& Fix

@_pkill | bit.ly/safer-operations Learn &

Adapt

@_pkill | bit.ly/safer-operations The Prevent & Fix cycle

Prevent

Tradeoffs

Fix

Focus on safety Surprise

Time

© 2020 Alex Elman @_pkill | bit.ly/safer-operations The Learn & Adapt reinforcing loop

Learn Getting safer Adapt Focus on safety Surprise

Time

© 2020 Alex Elman @_pkill | bit.ly/safer-operations Action items form a defensive strategy but do not lead to learning

@_pkill | bit.ly/safer-operations A taxonomy of action-items

+ Emergency work

+ Planned work

● eventually gets completed

● never gets completed

+ Undiscovered critical work

Photo by Mitchell Luo on Unsplash Better preventions Better fixes

@_pkill | bit.ly/safer-operations itsonlyamodel.com

@_pkill | bit.ly/safer-operations Measuring` progress

@_pkillPhoto by Benn| bit.ly/safer-operations McGuinness on Unsplash It is wrong to suppose that “if you can’t measure it, you can’t manage it – a costly myth.”

W. Edwards Deming

@_pkill | bit.ly/safer-operations It is wrong to suppose that “if you can’t measure it, you can’t manage it – a costly myth.”

W. Edwards Deming

@_pkill | bit.ly/safer-operations The most important figures that one needs for “management are unknown or unknowable, but successful management must nevertheless take account of them.”

Lloyd S. Nelson

@_pkill | bit.ly/safer-operations Can these be measured?

Richer metrics

+ Magnitude of psychological safety

+ Comfort in surfacing risk to leadership

+ Potential $$ losses from incidents that were avoided

+ Net cost/benefit of experts changing teams

+ Amount an organization is learning

Photo by ASHLEY D'CRUZ on Unsplash Be data-driven avoid being driven by data

@_pkill | bit.ly/safer-operations Z22, CC BY-SA 3.0, via Wikimedia Commons

“It is a difficult thing to look a winking light on a board, or hear a peeping alarm … and … draw any of rational picture of something happening out in the vast plant, let alone meet it with anything but a mechanical response”

Three Mile Island commission report http://bit.ly/three-mile-island

@_pkill | bit.ly/safer-operations Valid Metrics

@_pkill | bit.ly/safer-operations Everybody has a story to tell

@_pkill | bit.ly/safer-operations Metrics anchor the story and the story gives meaning to the metrics

@_pkill | bit.ly/safer-operations Photo by Nicolas Solerieu on Unsplash Activities around creating and maintaining safety

Preparation

Reporting reliability

Assessing accountability

Incident analysis

Incident -ups

@_pkill | bit.ly/safer-operations Photo by Nicolas Solerieu on Unsplash Activities around creating and maintaining safety

Preparation

@_pkill | bit.ly/safer-operations Preparation

Emphasizes only avoiding a recurrence

Emphasizes a more accurate and complete understanding Barriers and guardrails are used to prevent people from repeating mistakes

@_pkillPhoto | bit.ly/safer-operations by Francisco Galarza Examples of barriers and guardrails

+ Turning MySQL Safe Mode on in Production

+ Disallowing SSH access

+ Capping instance capacity

+ Preventing rollbacks without approval

Photo by Mitchell Luo on Unsplash All practitioner actions are gambles.”

Richard I. Cook https://how.complexsystems.fail/#10

@_pkill | bit.ly/safer-operations Performance variability

@_pkill | bit.ly/safer-operations Ensure positive outcomes through activities like team practice and chaos experiments

@_pkill | bit.ly/safer-operations Chaos experiments as scrimmage

@_pkillPhoto | bit.ly/safer-operations by Valentin B. Kremer on Unsplash Practice: safe, predictable Chaos: safe, unpredictable Incidents: unsafe, unpredictable

@_pkill | bit.ly/safer-operations Why are incidents so difficult?

@_pkill | bit.ly/safer-operations Escape Rooms

Still@_pkill from Schitt’s | bit.ly/safer-operations Creek (2020), Pop TV/Netflix Hindsight

@_pkill Photo| bit.ly/safer-operations by Silas Köhler on Unsplash @_pkill | bit.ly/safer-operations Mistakes are a feature not a bug

@_pkill | bit.ly/safer-operations Stories

+ The tale of the well-choreographed incident response

+ A brand new on-call responders painful experience through the

obstacle course to address a stuck deploy

+ The one where an accidental line of YAML burned $250,000 in

cloud costs

@_pkill | bit.ly/safer-operations Humans in the loop

@_pkill | bit.ly/safer-operations Automation an opportunity to enhance or enable humans @_pkill | bit.ly/safer-operations Photo by Andy Kelly on Unsplash Photo by Nicolas Solerieu on Unsplash Activities around creating and maintaining safety

Preparation

Reporting reliability

@_pkill | bit.ly/safer-operations Reporting reliability

Reliability outcomes and human performance are predicted and controlled

Reliability outcomes and human performance are monitored and influenced Incidents are a source of insights but not a good measure of reliability

@_pkill | bit.ly/safer-operations Reliability Historically good performance

Robustness Retains good performance within a threshold when challenged

Brittleness Predictably poor performance when challenged

@_pkill | bit.ly/safer-operations How fast can we safely go in a brittle system?

@_pkill | bit.ly/safer-operations Service Level Objectives

@_pkill | bit.ly/safer-operations Service Level Objectives

+ How reliable have we been?

+ How fast can we go?

+ How fast should we go?

+ What is the user experiencing?

+ Can we keep relying on FooService?

Photo by Mitchell Luo on Unsplash Control vs Influence

@_pkill | bit.ly/safer-operations Things we control Things we influence What we want

inputs outputs outcomes

Circumstances out of our control circumstances @_pkill | bit.ly/safer-operations runbooks queue delays training timeouts Production procedures hypotheses stabilized fault tolerance mitigations

packet loss traffic spikes hardware failures circumstances users @_pkill | bit.ly/safer-operations Outcomes over Outputs

Watch the inputs Influence the outputs Target the outcomes

@_pkill | bit.ly/safer-operations Photo by Nicolas Solerieu on Unsplash Activities around creating and maintaining safety

Preparation

Reporting reliability

Assessing accountability

@_pkill | bit.ly/safer-operations Assessing accountability

People mistakes are blamed. They are obligated to take responsibility.

People who make mistakes feel supported which inspires them to seek opportunities. Attribution is important to learning but can also lead to blame

@_pkill | bit.ly/safer-operations Opportunity vs Obligation

@_pkill | bit.ly/safer-operations Opportunities are taken, not given

@_pkill | bit.ly/safer-operations Opportunities

+ Identifying

● Defined goals and rationale

+ Selecting

● Career growth

● Assumptions and risks

+ Realizing

● Definition of “done”

Photo by Mitchell Luo on Unsplash Photo by Nicolas Solerieu on Unsplash Activities around creating and maintaining safety

Preparation

Reporting reliability

Assessing accountability

Incident analysis

@_pkill | bit.ly/safer-operations Incident analysis

Incidents result only in technical fixes

Incidents are investments in more capable organizations Focusing on reducing errors diverts energy “and attention into … narrowly targeted ‘fixes’ that treat symptoms but not the underlying problem”

Robert L. Wears

@_pkill | bit.ly/safer-operations ↑ Performance = ↓ errors + ↑ insight generation

@_pkill | bit.ly/safer-operations Ignoring “above the line” misses at least 50% of the opportunities

Line of representation

Local-only fixes Databases “below the line” Code

Services

Users

Adapted from R. I. Cook for ACL, © 2016 all rights reserved @_pkill | bit.ly/safer-operations Improving a process

@_pkill | bit.ly/safer-operations What are we looking for during incident analysis?

@_pkill | bit.ly/safer-operations @_pkill | bit.ly/safer-operations @_pkill | bit.ly/safer-operations Is this in a runbook?

How does the lag impact this ability?

What else can impact the website’s ability?

@_pkill | bit.ly/safer-operations How are these responders performing?

@_pkill | bit.ly/safer-operations Judging human performance with metrics applies conclusions without context

@_pkill | bit.ly/safer-operations Absolute time (CST) Delta time

Event start 09/04 10:30:00 - A/B turned up to 1%

Time to Detect 10/22 03:00:00 1M 16d 16h 30m Product Manager identifies a discrepancy

Time to Diagnose 10/26 04:30:00 4d 1h 30m Software dev makes a diagnosis (bug)

Time to Correct 10/26 04:31:00 1m A/B test turned off

Time to Recover 10/26 04:37:00 6m Prod declared stabilized

@_pkill | bit.ly/safer-operations Recording performance metrics promotes one perspective over others

@_pkill | bit.ly/safer-operations @_pkill | bit.ly/safer-operations @_pkill | bit.ly/safer-operations @_pkill | bit.ly/safer-operations Incident analysis

Richer metrics

+ Number of ○ technical fixes inspired by incident analysis

○ insights generated per incident analyzed

+ Time spent

○ verbally coordinating actions ○ launching ec2 hosts

○ restarting many instances of a database

Photo by ASHLEY D'CRUZ on Unsplash Photo by Nicolas Solerieu on Unsplash Activities around creating and maintaining safety

Preparation

Reporting reliability

Assessing accountability

Incident analysis

Incident write-ups

@_pkill | bit.ly/safer-operations Incident write-ups

Incident write-ups favor a particular viewpoint above others

Incident write-ups faithfully present multiple perspectives Interview Debriefing

@_pkill | bit.ly/safer-operations “How did you become aware of the certificate expiration?” “Oh, Edvaldo had some magic command that he ran…”

@_pkill | bit.ly/safer-operations “I was kind of amazed that some of the certificates were pushed out through Puppet. Maybe there’s a gap there—that somebody knows how to push a certificate out but we weren’t aware that the Mongo [database] had to be restarted by a certain time and date.”

Photo @_pkillby Christina | bit.ly/safer-operations@ wocintechchat.com on Unsplash “Right from the start when we had to restart everything I knew this was going to be a ton of work … helping to take care of the number of servers that we have but also to help verify everything.”

Photo @_pkillby Christina | bit.ly/safer-operations@ wocintechchat.com on Unsplash “I asked Shira and Dmitri to help the workload.”

Photo @_pkillby Christina | bit.ly/safer-operations@ wocintechchat.com on Unsplash “We both made the mistake of assuming that users from India are served by the Hong Kong DC because that seemed to be the closest one, but it turns out it actually goes to London.”

@_pkillPhoto | bit.ly/safer-operations by Amy Hirschi on Unsplash @_pkill | bit.ly/safer-operations https://extfiles.etsy.com/DebriefingFacilitationGuide.pdf

@_pkill | bit.ly/safer-operations https://youtu.be/TqaFT-0cY7U

@_pkill | bit.ly/safer-operations Traps to avoid

+ Counterfactual reasoning

● “She should have waited before restarting…”

+ Normative language

● “He lacked an understanding of...”

+ Mechanistic reasoning

● “Maintenance would be less risky if we automated this.”

Photo by Mitchell Luo on Unsplash Incident write-ups

Richer metrics

+ Number of

○ distinct write-up document opens ○ attendees to review meetings

○ distinct perspectives represented

○ employees trained using write-up

+ Qualitative survey feedback ○ How was the write-up useful?

Photo by ASHLEY D'CRUZ on Unsplash Ask deeper` questions

@_pkillPhoto by Benn| bit.ly/safer-operations McGuinness on Unsplash Our system has been very reliable over the past few quarters. Why?

@_pkill | bit.ly/safer-operations How close to the safety boundary is the pod autoscaler pushing my infrastructure?

@_pkill | bit.ly/safer-operations Are my cloud provider’s staff a team player in my sociotechnical system?

@_pkill | bit.ly/safer-operations Recap

@_pkill | bit.ly/safer-operations + Deeper understanding leads to better fixes and enduring prevention

+ Reliability is reported using SLOs not incidents metrics

+ Nobody has control over how an incident unfolds

+ Incidents are an opportunity to improve the accuracy of mental models

+ At least half of incident analysis should focus on human factors

+ Comparative storytelling enhances learning

Photo by Mitchell Luo on Unsplash bit.ly/safer-operations

Will Gallego Sue Lueder John Allspaw Ketan Gangatirkar Laura Maguire Johan Bergström Vanessa Huerta Granda Lloyd S. Nelson Jabe Bloom Patrick J. Hayes Jens Rasmussen Bryan Cantrill Rein Henrichs J. Paul Reed Martin Check Alex Hidalgo Casey Rosenthal Iulian Circo Lorin Hochstein Matthew Schemmel Richard Cook Robert R. Hoffman Joshua Seiden Todd Conklin Erik Hollnagel Steven Shorrock Sidney Dekker Nora Jones Andrew S. Townsend W. Edwards Deming Ryan Kitchens Robert L. Wears Kenneth M. Ford Gary Klein Ron Westrum Martin Fowler Jason Koppe David Woods https://learningfromincidents.io

@_pkill | bit.ly/safer-operations Thank you

Alex Elman @_pkill bit.ly/safer-operations