<<

Controlling the Costs of Coordination in Large-scale Distributed

Dissertation

Presented in Partial Fulfillment of the for the Degree in the Graduate School of The Ohio State University

By Laura Marie Dose Maguire, MSc Graduate Program in Industrial and Systems

The Ohio State University 2020

Dissertation Committee Dr. David D. Woods, Advisor Dr. Michael F. Rayo, Committee Member Dr. Philip J. Smith, Committee Member

Copyrighted by Laura Marie Dose Maguire 2020

Abstract

Responding to anomalies in the critical digital services domain involves coordination across a distributed of automated subsystems and multiple human roles (Allspaw, 2015; Grayson, 2019). Exploring the costs of this joint activity is an underexamined area of study (Woods, 2017) but has important implications for managing complex systems across distributed teams and for tool design and use. It is understood that anomaly recognition is a shared activity between the users of the service, the automated monitoring systems, and the practitioners responsible for developing and operating the service (Allspaw, 2015). In addition, multiple, diverse perspectives are needed for their different views of the system and its behavior and their ability to recognize unexpected and abnormal conditions. While, the collaborative interplay and synchronization of roles is critical in anomaly response (Patterson et al, 1999; Patterson & Woods, 2001), the cognitive costs for practitioners (Klein et al, 2005; Klinger & Klein, 1999; Klein, 2006) can be substantial. The choreography of this joint activity is shown to be a subtle and highly integrated into the technical efforts of dynamic fault management. This work uses process tracing to take a detailed look at a corpus of five cases involving software coping with unexpected service outages of varying difficulty. In doing so, it is noted that the practices of incident management work very differently than domain models suggest and the tooling designed to aid coordination incurs a cognitive cost for practitioners. Adding to the literature on coordination in ambiguous, time pressured and non co-located groups, this study shows that adaptive choreography enables practitioners to cope with dynamic events – and dynamic coordination demands. These demands can also be a function of the coordination strategies of others – in particular when they shift costs of coordination across time and organizational boundaries.

i

Dedication

This dissertation is dedicated to my parents – Bente & Paul. Words are insufficient to express my gratitude for the gifts you have given me – love, curiosity, creative expression, unwavering confidence and deeply human ethics. I will forever be an explorer, a reader and an advocate for equality, empathy and justice from the examples you have laid out with your lives. I love you so much.

ii

Acknowledgments

First and foremost, Dr. David Woods has my tremendous gratitude for his unflagging support and investment in my intellectual growth as a researcher and a scholar. His influence has been central in cultivating my own approach toward understanding complexity and the fields of practice that cope with its dragons. Dave, you are one in a million and I am truly honored to be a part of the CSEL family.

My PhD program would not have been as rich without the gentle cajoling and always genuine commitment to my development from Dr. Richard Cook. Richard, your mentorship and above all, friendship, is deeply appreciated.

Dr. Michael Rayo, for modeling the way to be a hardcore scholar and still be present in your own life. Mike, your willingness to make time even when overrun by the demands of the tenure track did not go unnoticed.

Dr. Phillip Smith’s role as a professor, mentor and committee member have helped me become more rigorous, well-rounded and discerning – all characteristics he himself so clearly models in his work and teaching. Phil, thank you for your commitment to the next generation of CSErs even when you could have been riding or researching!

Dr. Emily Patterson, I am incredibly fortunate to have had your tremendous experience and insight on my committee and I count your mentorship as a highlight of my time at OSU.

My colleagues in the CSELab deserve an abundance of thanks. In particular, E. Asher Balkin, Kati Walker & Dan Welna for the countless conversations, whiteboarding sessions, proofreads and pep talks. To all the students- especially “The Morgans” (Reynolds & Fitzgerald), Christine

iii

Jeffries, Dane Morey & Jesse Marquiese – I am so proud of you all and can’t wait to see what brilliance you unleash on the world.

I know how deeply fortunate I am to have the love & support of the Maguire (& honorary Maguire) clan. I am indebted to each of you for your enthusiasm, your empathy and your solidarity. In particular, this wouldn’t have happened without my sister Kristina’s enduring editing enthusiasm and the unflagging encouragement from her and my dear friend Yvonne.

Dr. Jody Reimer & Dr. Cayman Unterborn, you deserve all my admiration as wonderful people & accomplished scholars. Thank you for your friendship & your roles in getting me out of the lab and keeping me grounded.

These acknowledgements would be incomplete without a sincere recognition of the countless practitioners, managers, designers and product owners who keep the digital world operating smoothly for us. To the participants and organizations who shared their insights, their struggles and their seemingly mundane day-to-day, I thank you deeply. In particular, DL, MC & JR your support was very much appreciated.

Lastly, my constant companion Oliver, the instigator of study break nature excursions! Without him curled up under my desk, I am not sure I would have made it through as many late nights as I did.

iv

Vita

Education 2003………………………… Bachelor of Commerce – Entrepreneurial Management, School of Business, Royal Roads University

2017………………………….Masters of Science – Human Factors & Safety Systems,

Faculty of Engineering, Lund University

Work 1999-2005……………………..Forestry Field worker, various

2006-2009……………………..Business Dev & Safety Manager – Dynamic Reforestation

2010-2012………………………Manager, Training & Program Dev – BC Forest Safety Council

2012-2015………………………Supervisor, Safety Training – Major , Enbridge Inc.

2015-2016………………………Quality Specialist – Major Projects, Enbridge Pipelines

2017-Present…………………….Graduate Research Assistant – Cognitive Systems

Engineering Lab, Ohio State University (Present)

2017-2019……………………Resilience Engineering Intern – IBM

v

Publications

Maguire, L. M. D. (2020). Managing the Hidden Costs of Coordination. ACMQUEUE 17(6) Association of Computing Machinery.

Chuang, S., Maguire, L., Hsiao, S., Ho, Y., & Tsai, S. (2019). Nurses’ perceptions of hazards to patient safety in the intensive care units during a nursing staff shortage. International Journal of Healthcare, 6(1), 19-19.

Maguire, L. M., Vazquez, D. E., Haney, A., Byrd, C., & Sanders, E. B. N. (2019). Transdisciplinary Co-Design to Envision the Needs of the Intensive Care Unit of the Future. In Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care (Vol. 8, No. 1, pp. 181-181). Sage CA: Los Angeles, CA: SAGE Publications.

Maguire, L. M. & Percival, J. (2019). Sensemaking in the Snow: Examining the cognitive work of avalanche forecasting in a Canadian Ski Operation. Neve E Valanghe 93. pp 94-95. Retrieved at https://issuu.com/aineva7/docs/nv93_rivista Laura, M. (2014). The human behind the factor: a brief look at how context informs practice in recreational backcountry users. In Proceedings of the international snow science workshop (pp. 942-948).

Fields of Study

Major Field: Industrial & Minor Field: Cognitive Systems Engineering Minor Field: Design

vi

Table of Contents

Abstract ...... i

Dedication ...... ii

Acknowledgments ...... iii

Vita ...... v

List of Tables ...... xi

List of Figures ...... xii

Chapter 1 - Introduction ...... 1

Chapter 2 – Joint activity & anomaly response in large-scale distributed work systems 7

Key Concepts ...... 8

Large-scale distributed systems ...... 9

Defining a large-scale distributed system ...... 11 The large-scale distributed systems of interest ...... 12 Time & coherence ...... 12 The reification of distributed systems ...... 13

Coordination ...... 13 Joint Activity ...... 16 Cognitive costs of coordination ...... 18

Anomaly Response ...... 21

Summary ...... 24 vii

Chapter 3 – The domain of Critical Digital Infrastructure ...... 25

Increasing complexity, speed and scale in ...... 26 Changing methods of ...... 26 Physical to virtual - Moving to the Cloud ...... 28 Service level agreements ...... 30 CD/CI as continuous adaptation ...... 30

Service outages ...... 31 Tooling to manage incident response...... 31 Attempts to manage coordination for incident response ...... 33

Implications ...... 33

Chapter 4 – Research Methods ...... 35

Introduction ...... 35

Sources of Data ...... 36

Organizational characteristics of participants ...... 37

Individual characteristics of participants ...... 37

Data availability ...... 38 Collecting data on cognition in the wild ...... 38

Preparations ...... 39 Investigating models of incident response ...... 40 Understanding configurations for incident response ...... 42

Study: Process tracing a corpus of cases ...... 44 Case collection ...... 44 Data extraction and conversion ...... 45 Coding ...... 45

Chapter 5 - Findings ...... 49

Introduction ...... 49 viii

Results from preparatory studies ...... 50 Preparatory study 1: Incident Models ...... 50 Preparatory study 2: How technology shapes the configurations for joint activity...... 58

Process Tracing a corpus of cases ...... 64 Case # 1 - Hidden changes & multiple failures confound responders...... 68 Case # 2 - A bad fault gets worse with multiple concurrent issues...... 74 Case # 3 - Scaling beyond imagined parameters ...... 82

Case # 4 - When those who should know don’t ...... 88 Findings from Case 4 ...... 90

Case # 5 - Is this a problem and how bad is it? ...... 94 Findings from Case 5 ...... 97

Themes emerging from across the cases ...... 107 Additional patterns of choreography needed for coordination ...... 116

Summary ...... 120

Chapter 6 Discussion ...... 121

The elements of choreography ...... 122 Establishing Common Ground ...... 122 Maintaining & repairing common ground ...... 126 Recruiting ...... 128 Tracking others activities ...... 129 Taking initiative ...... 129 Delegating ...... 130 Taking Direction ...... 131 Synchronizing tasks ...... 131 Controlling costs for self & others ...... 132 Model updating ...... 132 Investing for future coordination ...... 133 Coordinating with tooling ...... 134

ix

Monitoring coordination demands ...... 135 Summary of the elements of choreography ...... 136

Efforts to control the costs of coordination...... 136 Limitations of the Incident Command model ...... 138 Controlling the costs of coordination ...... 139 The cost of cross-boundary coordination ...... 142

Chapter 7 Adaptive choreography ...... 149

The framework for Adaptive Choreography ...... 149

The dynamics of coordination in anomaly response ...... 151

Re-imagining incident management practices ...... 153 The changing nature of roles ...... 154 Responders as self-organizing units...... 157

The scaffolding for Adaptive Choreography ...... 157

Summary of Adaptive Choreography ...... 158

Chapter 8 – Conclusion...... 160

Appendix A – Elements of Choreography ...... 162

Bibliography ...... 167

x

List of Tables

Table 5.1 Overview of Corpus Cases ...... 65

xi

List of Figures

Figure 2.1 Components of coordination (Malone & Crowston, 1994)...... 14 Figure 2.2 A description of joint activity...... 17 Figure 2.3 Costs of Coordination - Clark & Brennan (1991) ...... 19 Figure 2.4 Multiple intermixed lines of reasoning (from Woods, 1994)...... 21 Figure 4.1 Distribution of site visits ...... 40 Figure 4.2 Overview of research methods ...... 44 Figure 4.3 Coding framework...... 47 Figure 5.1 One at a Time ...... 52 Figure 5.2 OAAT with Mean Time To Innocence ...... 53 Figure 5.3 OAAT with Incident Commander ...... 54 Figure 5.4 Escalation ...... 55 Figure 5.5 All Hands on Deck ...... 57 Figure 5.6 SWAT Incident response ...... 58 Figure 5.7 Recruitment of responders over time ...... 78 Figure 5.8 Corrected by cross-checking ...... 88 Figure 5.9 Dashboard time interval extended ...... 89 Figure 5.10 Tooling ...... 104 Figure 5.11 Maintaining common ground through updating...... 105 Figure 6.1 The Incident Commander becomes a bottleneck ...... 138 Figure 6.2 Side-channeling as coordination costs rise...... 139 Figure 6.3 Multi-role, multi-echelon coordination ...... 143 Figure 6.4 The cross-scale interactions of interest ...... 144 Figure 6.5 Nested series of relationships ...... 145

xii

List of Figures continued

Figure 6.6 Escalation ...... 145 Figure 7.1 Adaptive Choreography as an incident response model ...... 155

xiii

List of Abbreviations

CDI Critical Digital Infrastructure CD/CI Continuous Delivery/Continuous Integration CI Continuous Integration CSE Cognitive Systems Engineering IC Incident Commander ICS Incident Command System IP Internet Protocol SaaS Software as a Service SIO Software Intensive Operations

xiv

Chapter 1 - Introduction

When we try to pick out anything by itself, we find it hitched to everything else in the universe. —John Muir, Nature Writings

The inherent interconnectedness and interdependence of complex systems- be they ecological, technological or social – inextricably links separate elements to a coherent whole. Science itself is an endeavor to understand the myriad relationships that define the orchestration of the causes/effects, the actions/reactions and the interactive dynamics that sustain life, enable societies to flourish, permit planes to fly, facilitate trade and allow families to function. In this way, a desire to understand the mechanisms of coordination is shared across a diverse cast of individual interests such as biologists, economists, engineers, managers and social scientists. And while most organic systems have always been recognized as intricate and inexplicable, it is largely only in recent times that human systems- bolstered by technological and geopolitical advances that extend speed and scale- have, as well, developed emergent properties that make them difficult to explain and control. Whereby once nation states dictated their own economic policies, navigated their own conflicts and coped with their own natural disasters, shifting geopolitical landscapes and shared threats such as climate change and terrorism have necessitated strong global interdependencies. Aspirations in space exploration and scientific discovery have seen collective investment in shared assets that would be otherwise prohibitive to a single nation. Even at a more local level, raising children or caring for aging parents, requires extensive coordination amongst family and community.

It is apparent that, through this interconnection, as Klein, Feltovich, Bradshaw, & Woods, (2005) assert, nearly all forms of meaningful or challenging work are joint activity. Joint activity takes place in systems that were designed by humans, for human practitioners and to serve human 1

purposes and goals, therefore it is implicitly understood that joint activity is amongst human- human groupings. However, technological capabilities are needed to extend human reach into dangerous, unpredictable or inhospitable environments (Woods et al. 2004) and enable vast quantities of information processing at speeds unmatched by human counterparts. So, increasingly, joint activity is distributed across both human and machine groupings (Allen and Ferguson 2002; Woods et al. 2004) The multiple layers of distribution mean coordination is an inherently necessary and all coordinative work comes with additional cognitive costs (Klein, G., Woods, Bradshaw, Hoffman & Feltovich, 2004). As Woods (2017) notes, “the skills needed to marshal resources and deploy them effectively are related to but different from those associated with the problem solving. But to be effective, these resources must be directed, tracked, and redirected. These activities are themselves demanding” (p.28). So, while additional resources are needed, they impose costs. Attempts to limit the number of parties involved can be futile and counterproductive in many complex, large-scale critical digital systems. This is because the system itself often supports high value interdependent work functions and outages or performance degradations impedes those dependent functions in ways that drive consequences for others.

In systems that have criticality dependent stakeholders, off-nominal performance will drive an escalation in cognitive and coordinative demands (Woods & Patterson, 2001). The cognitive demands follow from the mental effort embedded in dynamic fault management of ambiguous problems in abstracted systems during high tempo and interactive failures (Woods et al, 1986; Woods & Hollnagel, 2006). The coordinative demands stem from the need for multiple, diverse perspectives representing varying mental models that can aid in hypothesis exploration (Watts- Perotti & Woods, 2007) or have access to privileged or localized information (Smith et al., 1999). This collaborative interplay and synchronization of roles is critical in anomaly response (Patterson et al, 1999; Patterson & Woods, 2001). In high demand scenarios a diverse group of skill sets and roles are needed who: can aid with the technical issue, have the authority to make decisions for tradeoffs around competing priorities, can reallocate resources from other areas to the problem and direct peripheral players who may be a distraction on technical experts (such as customer service or communication roles seeking to update users or the public). Often, these parties are non-collocated - both intentionally and by virtue of the scarcity of expertise (Beyer et 2

al. 2016) - which drives the need for technology to facilitate coordination. And, while advancements in computer supported cooperative work tooling have progressed, the tools themselves are insufficient for high demand collaborative problem-solving (Olson & Olson 2000; Olson, Olson, and Venolia 2009) and therefore generate additional costs. The ways in which they are insufficient are explored further in Chapter 2.

In summary, the features of this problem space include: large-scale distributed work systems are increasingly common; non-routine failure events in these contexts require coordination across a variety of agents within a network of interdependent capabilities; agents that are both human and machine/automata; and whose geographic dispersion is supported by tooling insufficient to aid complex cognitive work.

This dissertation seeks to resolve the question of what mechanisms do practitioners use to control the costs of coordination within large-scale distributed work systems as they carry out the functions of anomaly response under uncertainty, , and pressure?

While extensive investigation has occurred across organizational, management, systems engineering, social psychology, engineering psychology and sociological research, this work provides a substantially different, integrated view of coordination. It shows how cross-scale dynamics and shifting coordination demands in time incur cognitive costs for practitioners and the corresponding strategies used to the manage multiple, competing demands for attention that arise during a response to exceptional events.

As described in the opening examples, the answers to these questions are relevant to many diverse domains. Consistent with other work in Cognitive Systems Engineering (CSE), this research uses the particular circumstances of a specific domain (critical digital infrastructure) to extrapolate patterns of cognitive work around coordination that are applicable to other technical professionals working in large-scale distributed work systems operating at speed and at scale. A comprehensive definition of critical digital infrastructure is given in Chapter 3.

3

What follows examines large-scale distributed systems coordination through the lens of event driven cognitive work. It differs in important ways from previous investigations of coordination by using an integrated, cross-scale lens that focuses on the cognitive effects of the relationship between groups of geographically dispersed people working on difficult problems with automated supports across organizational and technological boundaries.

Whereas coordination is typically process driven, this work asserts that event driven coordination accounts for the temporal variability of demands (including expansions of uncertainty and the number of parties involved requiring coordinative effort). In addition, event driven coordination provides context for the situated action of practitioner strategies for controlling the costs of coordination. Just as Muir recognizes everything is “hitched to everything else” Suchman describes this as “the view that every course of action depends in essential ways on its material and social circumstances” (Suchman 1987, p. 50). Central to this investigation is methods for surfacing (via process tracing methods) the ways in which observable actions are informed by the tacit cognitive processes that drive them.

I have organized this dissertation into 7 sections.

Following this introduction, in Chapter 2, I will discuss the background and relevance of studying how to control the cognitive costs of coordination. This dissertation draws from several streams of research relevant to uncovering and understanding the cognitive costs of coordination. These conceptual frameworks show how distributed work systems, enabled by computer supported cooperative work, influence joint cognitive systems during anomaly response. These frameworks provide interrelated and complementary insights on coordination in that:

· Distributed work systems research is necessary for examining conditions of work where practitioners are often spatially and temporally separated, and tasks are distributed against groups of human and machine agents. · The organizing of domains in this manner is enabled by computer supported cooperative work which has implications for both coordination design and for cognitive work. 4

· Coordination of human-human and human-machine teams is best understood as joint cognitive systems to analyze how multiple human perspectives on a complex problem space and the diverse cognitive capacities inherent in human and automation form an irreducible unit of . · The anomaly response model and case examples of dynamic fault management provide a foundation for time pressured diagnosis and repair in systems characterized by cascading interactive failure.

Next, chapter 3 introduces the domain of study (software engineering in critical digital infrastructure). I have chosen to provide a thick description (Ponterotto 2006; Denzin 2001) alongside the technical description to capture the changing organization of work and the tension inherent in navigating uncertainty during service outages. I integrate these different forms of narrative because, implicitly and explicitly, embedded in the traditions of CSE is ethnographic field research. This study is representative of a sustained three-year engagement with one organization and a two-year relationship with the others. In my analysis, I draw on this substantial context to provide insight. Before moving on, I will make the connection between studying how teams of DevOps engineers manage critical incidents and the relevance to other distributed work systems.

Following this, Chapter 4 describes the methods used in this study. As noted by Potter et al. 2000 “developing a meaningful understanding of a field of practice relies on multiple converging techniques” (p.337). In this tradition, this work used knowledge elicitation, process tracing and observation to develop the analysis. In doing so, a number of methodological constraints were uncovered and are discussed.

Chapter 5 then presents the findings. The first set of findings outlines the approaches used by the organizations to handle coordination during anomaly response. The second set relates to a corpus of five cases characterized by a high degree of difficulty with episodes where diverse perspectives contribute to model updating & revision, as well as episodes where the participants struggled to update & revise their models of the how the Critical Digital Infrastructure (CDI) was malfunctioning. Then, a cross-case analysis focuses on how practitioners adapt to manage the 5

costs of coordination and looks closely at how the tools and technology that enable interaction across diverse roles can facilitate, but also interfere, with coordination. The contribution of an integrative model of controlling the costs of coordination provides greater explanatory power than previous work disseminated across these streams.

The majority of Chapter 6 is dedicated to the discussion on the elements of choreography and the means by which responders manage the cognitive and coordinative demands of the joint activity of incident response. This is followed by a synthesis that argues that coordination must be considered using a cross-scale lens. Results from the preparation studies and the corpus findings are laid out to account for the myriad roles of direct responders and stakeholders that influence incident response practices in critical digital systems. In this way, we are then prepared to look at the dynamics of coordination – how the elements are enacted across time and space.

The framework for Adaptive Choreography – which outlines the dynamics of coordination – is then established in Chapter 7 detailing the ways the elements of choreography are fulfilled by participants to enable smooth coordination. This chapter outlines some of the essential characteristics of the organizational context, individual orientations towards coordinative work and identifies opportunities for integrating adaptive choreography into incident response practices.

In the final section (Chapter 8), I will conclude by summarizing the findings from the corpus, highlight the key discussion points, touch on the limitations with this study, provide brief comments on implications for tool design and end with remarks on the implications for future research of coordination in cognitive work.

6

Chapter 2 – Joint activity & anomaly response in large-scale distributed work systems

The rise of automation into the control of managed processes and the prevalence of the computer into the workspace has been a transformative shift that created new demands for organizations, practitioners and the designers and developers of workplace technologies. Almost 25 years ago, Malone & Crowston (1994) noted that “there is a pervasive feeling in businesses today that global interdependencies are becoming more critical, that the pace of change is accelerating, and that we need to create more flexible and adaptive organizations. Together, these changes may soon lead us across a threshold where entirely new ways of organizing human activities become desirable” (p. 89). As this shift has continued and the scale of the systems being managed increases, new ways of organizing human activities, alongside the development of automated and semi- automated tooling to support complex work systems functions has become both desirable and clearly, in many high hazard contexts, critical. Malone & Crowston’s comment and their corresponding Coordination Theory envisioned these changes on typical business practices that might connect a collaborative global workforce to remain competitive. Since that time, high risk and high consequence industries have continued to experience the same pervasive feeling (Smith et al, 1999; Obradovich & Smith, 2003; Mikkers et al, 2012; Bento, 2018;).

Technology and practices that can enable frontline practitioners and their organizations to dynamically navigate the distances in time and space between the point of control (virtual or otherwise) and the anomalous events being attended to are still lacking. Coordinative devices that support fluid abilities to act on the processes managed; readily integrating new sources of information; revising assessments; allowing rapid reconfiguration of available resources (human and machine) and flexibly adapting to the changing demands of the environment and problem space.

7

As a result, several closely correlated but distinct vectors of research examine these new contexts for work. Research relevant to understanding and improving these systems span the human (individual & social cognition), the technological (hardware, software, architectures, networks), the organizational (human resource management, resource allocation, management), the industrial (regulatory, economic) and the system (characteristics of complexity and emergent system behavior). The extent of the transdisciplinary nature of the research is indicative of the transformative effect information technology has had on society.

One vector, the distributed work stream, primarily emphasizes the distance and issues when teams are not co-located (Smith, Spencer, & Billings, 2007). A second (and often overlapping) line of research is Computer Supported Collaborative Work - which largely refers to the role of the computer in providing connectivity, modulating the form of interactions (Grudin 1994; Olson and Olson 2007; Winograd et al. 1986; Tatar et al. 1991). A third vector is in joint cognitive systems - collaborative cognitive effort across machines (or cognitive artifacts) and their human counterparts (Hollnagel and Woods 2005; Smith 2017; Woods and Roth 1988; Hollan et al. 2000). These will each be examined in depth to draw the connection to controlling the cognitive costs of coordination during anomaly response. Before beginning this examination, it is useful to set some definitions up front.

Key Concepts

This analysis is situated chiefly amongst the foundation of coordination in joint activity and the subsequent grounding necessary for joint activity. These constructs have been well established in socio-linguistic circles (Clark 1996; Clark and Brennan 1991; Levinson 1979) and more recently applied to cognitive systems (Klinger and Klein 1999; Fairburn et al. 1999; Klein et al. 2005; Woods, 2017; Mansson et al. 2017). Specifically, in this dissertation work, these constructs are embedded in the context of anomaly response in large-scale distributed systems. Perhaps unsurprisingly, there are a wide variety of definitions for concepts such as coordination, choreography, costs of coordination, joint activity and distributed systems. As such, it is useful to briefly define how the terms are used in this work, a more detailed discussion follows in the chapter.

8

Coordination refers to the ongoing management of dependent dynamic interactions arising from joint activity. This definition draws from Malone & Crowston (1994), Clark (1994), Klein et al (2005) and Johnson et al (2014).

Clark (1996) defines joint activities as ongoing processes with multiple participants that vary on multiple dimensions such as scriptedness, formality and verbalness.

Cost of coordination is most typically defined in the literature in terms of the time and effort associated with communication (MacMillan et al, 2004; Malone, 1994; Clark, 1996). For the purposes of this study, the definition from Klein et al (2005) of the additional mental effort and load required to participate in joint activity is used. Expert performance is subtle and nuanced and tooling affords new and different ways of working jointly that is not limited strictly to communication.

Finally, while all of the above definitions have layered constructs behind them, the term distributed system in particular defies a compact definition. Therefore, with baseline definitions established for our key terms, we begin with an in-depth look at distributed systems to uncover a suitable definition.

Large-scale distributed systems

Martin Van Steen said, “distributed systems are like 3D brain teasers: easy to disassemble; hard to put together.” And so, it is in assembling a definition of a distributed system. While systems thinking emerged in the 1940’s from biologist ’s attempts to counteract reductionist thinking in the sciences (Von Bertalanffy, 1972) and was later extended by Ashby in and .

There are a number of shared characteristics - having to do with the complexity of the system under study, the nature of the tasks being supported and the user - that overlap. The significant departures between these positions have to do, in part, with the extent to which they account for complexity (including where the boundaries of analysis are drawn) and the way they account for

9

time (recognizing both tempo changes and the hyper-accelerated transactions in some environments).

Multiple fields of study have investigated large-scale distributed work systems including, but not limited to:to systems engineering, organizational and management sciences, biological and ecological sciences, military sciences, and cognitive systems engineering. While biological and ecological sciences (as well as countless other fields) have important contributions relative to phenomenon encountered relevant to system properties (related to complexity, emergence, interactive variability) this dissertation reluctantly excludes them from this discussion by attempting to focus on work systems. It could be argued the computer sciences (CS) include a scope much broader than work systems and this is true. However, the framework of distributed systems in CS is fundamental for examining information technology architectures and the fundamental assertions inherent in how automated and/or intelligent machine agents are designed and deployed into a system of work.

These multiple lenses are critical for representing the diversity of the transdisciplinary nature of the problem space of how to design for and control large-scale distributed work systems. However, they are also a source of confusion when researchers use a ‘same but different’ term in varied domains and the chasms in language and scope between fields may impede truly integrated research. Speaking of research Ackoff (1971) notes:

“Despite the importance of systems concepts and attention that they have received and are receiving, we do not yet have a unified or integrated set (ie: a system) of such concepts. Different terms are used to refer to the same thing and the same term is used to refer to different things. This state is aggravated by the fact that the literature of systems research is widely dispersed and is therefore difficult to track. Researchers in a wide variety of disciplines and interdisciplines are contributing to the conceptual development of the systems sciences but these contributions are not as interactive and addition as they might be.” (p. 661)

10

The same could be said of the current state in large-scale distributed systems research. Even within a single field (systems engineering for example) there are multiple definitions. There may be validity in avoiding prematurely attempting to define a common conceptual framework, Hoffman et al. (2002) assert that as a field progresses both incremental and substantial variations generate new labels to describe work that may share some common features but differ in significant ways which can be beneficial as a field of science.

Defining a large-scale distributed system

The use of the term distributed system has spanned multiple organizational and technological advancements, some that have fundamentally altered what it means for a system to be distributed. In this way, I argue that what constitutes a large-scale distributed system needs to be revisited. However, it does not dictate the hubris of an entirely new definition and scope. As with Webster (1994) treatment of what constitutes an ‘Information Society’ and Woods (2015) examination of ‘resilience’ there is value in acknowledging how the multiple perspectives remain relevant (if underspecified).

Therefore, these disparate definitions can converge on several similarities which I use to re- assemble the characteristics of large-scale distributed systems for the purposes of this paper. They involve: -multiple nodes in a network whereby a node has autonomous capabilities but also contributes in some way to the collective work/action to accomplish system level objectives -representative of cross-scale groupings whereby a node can represent human or machine agent at one level as well as higher levels of abstraction -such organizational entities - made up of human-human and human-machine groupings -inclusive of multiple overlapping boundaries (both inter and intra-organizational) -geographic distances mitigated by technological connections to aid in the communication and dispersion of information across nodes and timeframes.

This definition provides a high-level framework to think flexibly about the interactions between nodes without explicitly dealing with the capabilities and limitations of each node. Some nodes 11

represent highly proficient expert human resources, some represent automated agents that are dumb and dutiful (Wiener, 1988). Depending on the properties of the system you wish to understand you move between levels of abstraction. For example, in studying the cognitive costs of coordination where the unit of analysis is the joint cognitive system involving human-human groupings interacting with semi-autonomous machines then it may be appropriate to acknowledge more distal influences as abstracted nodes - such as a ‘regulatory agency’. In doing so we do not discard the complexity inherent in acting in a multi-scale system by ignoring the effects but are not overrun by the challenges of examining each node in the system at the same level of detail.

The large-scale distributed systems of interest

Not all large-scale distributed systems are of interest in this dissertation. For example, Amazon is undoubtedly a large-scale distributed system where the goods are housed in multiple warehouses and delivered through countless distribution centers and delivery trucks. It has an extensive network of interdependent activities distributed across supply chains and groups of humans and machines (Leblanc, 2019). However, individuals within the system are not responsible for maintaining process control. Therefore, the systems of interest involve a degree of responsibility and the concurrent skills and abilities to bear the responsibility.

Time & coherence

Two underemphasized but important features of distributed systems has to do with the function of time and the extent of integration.

First, distributed systems operate on varying time scales and there is seldom, if ever, a “global clock” shared across agents Tanenbaum & Van Steen (2007). They note that the “lack of a common reference of time leads to fundamental questions regarding the synchronization and coordination within a distributed system” (p.3). This comment is relevant to the broader discussion of coordination to be examined later in this chapter.

12

The reification of distributed systems

In addition, distributed systems are commonly seen as being a single coherent system. Tanenbaum & Van Steen, (2007) define this to mean “end users should not even notice that they are dealing with the fact that processes, data, and control are dispersed across a ” (p. 4) but note that achieving coherence is difficult so architects often settle for the appearance of coherence. The danger here, as Woods et al (2010) point out, is this is a reification fallacy and can serve to undermine the complexity of the problem demands faced by those tasked with managing and restoring service during failures. While it is necessary, in this dissertation, to use the short-hand of distributed system, it is understood that it is not a single entity but rather a complex conglomerate, a tangled layered network (Woods, 2018) of extensive dependencies (Woods, 2015; Smith et al, 2013).

The definition for large scale distributed systems provided here is considered relevant for looking at the coordinative elements of critical digital infrastructure.

Coordination

As noted in the introduction to this paper, much of human activity requires interacting with others. And like the variety of examples given earlier, many disciplines and domains have approached the study of coordination. But each discipline approaches the term differently and most of us have an intuitive sense of coordination. More specifically, we can intuitively recognize the ends of the coordination spectrum – when it works in smooth well-coordinated activity (such as dining in a busy restaurant without incident) and when it doesn’t (such as attempting to leave a stadium after a professional sports game). This multi-disciplinary interest plus the ease with which we assume we understand coordination is a challenge for studying it as a phenomenon for cognitive work as the construct has a heavy legacy and parsing the difference between the different approaches can be effortful or rapidly lead to getting mired in debating the semantics of definitions.

13

In a seminal 1994 paper, Malone & Crowston identified the need for an interdisciplinary approach. This foundational piece defined coordination “managing interdependencies between activities” (p.12) then refines to emphasize a focus on goal directed activity. They further go on to characterize different aspects of dependencies and the concurrent processes that can be used to manage them. In this work, coordination is seen as a function of as shown in the table below.

Figure 2.1 Components of coordination (Malone & Crowston, 1994)

The coordination processes as outlined are useful for the purposes of understanding coordination dynamics but this has limitations when applied to complex, adaptive systems. It implies that the processes are static – once identified and mapped they are resolved and can be managed. Spiro et al (1987) identified this as a form of oversimplification. And indeed, it would be wrong to imply a system is static when it is instead characterized by change, and there is fluidity between goal priorities, the kinds of activities and actors needed to handle and event and near continuous management of interdependencies moment-over-moment. A danger in doing so is that the strategies to manage coordination that are based on this interpretation are slow and stale to adapt to changing conditions.

One might argue that this branch of coordination literature is able to cope with highly dynamic worlds, as evidenced by the work down in ‘Adaptive Coordination’. The term has been used extensively in the literature (Grote and Zala-Mezö 2005; Henrickson Parker et al. 2018; Burtscher et al. 2010) but it falls short of being ‘adaptive’ relative to the cognitive systems engineering, and arguably, the high reliability organizing literature. 14

Adaptive Coordination (AC) comes from Entin & Serftaty (1999), who defined coordination as either explicit or implicit. We will explore what this means first before looking more closely at AC. Explicit coordination “is achieved through deliberate effort dealing with the coordination as such (e.g. by discussing the appropriate way of action and then deciding on which action to take), or it can be implicit, i.e. coordination is achieved without spending effort on coordination activities as such (e.g. by providing somebody with relevant information at the right time without the other person having to ask for that information.” (Grote and Zala-Mezö 2005, p.193). This framing gives pause from a cognitive work perspective. First, it supposes the visible presence of coordinative action to be representative of effort. For someone to know what information is relevant to another, at what point in time, particularly in multi-threaded and time pressured activity with multiple competing demands is effortful yet invisible. This cognitive work intimates a deep knowledge of the task being undertaken, of the expected flow of events and of anticipation that the information being provided may be lacking in the other person. Contrary to this being “without spending effort” the person providing the relevant information incurs a cost in doing so. Turning again to the term Adaptive Coordination we find it “has been defined as a team’s ability to change its coordination activities in response to unexpected external events or changes in task or team characteristics” Burtscher et al (2010). More specifically, it is used to denote when practitioners shift from one form to another – implicit to explicit or vice versa (versa (Grote and Zala-Mezö 2005; Henrickson Parker et al. 2018; Burtscher et al. 2010). In doing so, it assumes that explicit coordination is merely another form of coordination but some literature has shown this can be qualitatively different - a signal of a coordination breakdown brought on by the demands of the event. As Latour (2005) says “group formations leave many more traces in their wake than established connections which, by definition, might remain mute and invisible” (p. 31). One aspect of recognizing coordination challenges is simply needing to talk about or being able to more readily see the coordination, so in this way, adaptive coordination may not be an entirely controlled (or useful) technique from a cognitive work perspective. A second consideration is that in normal operations, coordination can be difficult to detect and to extricate from the system of work it is embedded within (Hutchins 1996, Wears 2005).

15

The emphasis on using predominantly observable features of coordinative interactions as a basis for understanding it, coupled with the focus on processes of coordination can make it challenging to introduce alternative methods and frameworks for studying coordination. In this way, to be able to connect the cognitive work inherent in coordination to the research question of how practitioners control the costs of coordination while carrying out the functions of anomaly response under conditions of uncertainty, risk and time pressure requires looking outside the traditional literature. A possible reason for the insufficiency of the broader field of coordination research in cognitive systems engineering is, as Nemeth et al (2006) put it “looking and listening alone are not enough to the density and complexity of information and interaction…” (p. 1012)

Joint Activity

The conceptual framework of joint activity (Clark, 1996; Klein, Feltovich, Bradshaw, & Woods, 2005) offer an alternative distinction from the entrenched conceptualizations of coordination seen in the work from the last 25 years and extend the study of coordination more directly into the cognitive realm. The foundation of Joint Activity from Klein et al (2005) stipulates that several basic criteria must be met as pre-conditions to well-coordinated joint activity. The first of these is that the parties intend to work together. The second is that the tasks are interdependent (all parties are reliant in some form on others to produce an end result). The third pre-condition is the activity includes an extended set of behaviors (Clark, 1996) and are grounded by the Basic Compact (a continual commitment to the shared goals and to repairing any breakdowns in common ground). These must be scaffolded by coordination devices that provide structure to the efforts to choreograph their shared activities.

16

Figure 2.2 A description of joint activity

Narrowing the literature to focus more specifically on the cognitive work, we see different lenses for looking at the practice of coordination. For example, Clark (1996) describes coordination as joint actions and joint activities the actions that emerge when there is a common goal and interdependence on tasks. When individuals take action that is intended to be coordination with another, it is joint action. The compilation of sequential joint actions is joint activities. Mansson et al (2017) elaborate:

“Joint activities are carried out by participants in particular roles that help shape what each does and is understood to be doing. Participant’s roles may, however, change between different activities or as the nature of the joint activity becomes clear. Although they participate to achieve a certain dominant goal, participants often also pursue other goals such as procedural, interpersonal, and private goals. It is however the public goals, those which are openly recognised by all participants, that define the joint activity. What is required to coordinate then, Clark argues, is common ground of the participants.” (p.551)

Clark & Brennan, (1991) describe coordination between content and process as necessary elements for successful collective action. The mutual shared knowledge, beliefs and assumptions (or ‘common ground’) is required for coordinating on content. Coordinating on process requires updating common ground moment by moment therefore, “all collective actions are built on common ground and its accumulation.” (p.222). This represents a different way -an event driven 17

perspective- of thinking about coordination distinct from external process-driven models of coordination.

Joint activity is underscored by the Laws of Collaborative Systems (Woods & Hollnagel, 2006)

First Law Of Collaborative Systems (not all you, not all me): It’s not collaboration, if either you do it all or I do it all. Collaborative problem solving occurs when the agents coordinate activity in the process of solving the problem. Second Law of Collaborative Systems (nobody is incompetent): You can’t collaborate with another agent if you assume they are incompetent. Collaborative agents have access to partial, overlapping information and knowledge relevant to the problem at hand. Coordination Costs, Continually: Achieving coordinated activity across agents consumes resources and requires ongoing investments and practice. Computers are not situated (need context): Computers can’t tell if their model of the world is the world they are in. Computers need people to align and repair the context gap. The Collaborative Variation on the Law of Regulation: Every controller is a model of the other agents who influence the target processes and its environment and of those who coordinate their activities directly or indirectly to achieve control, i.e., a model of the activities, models, capabilities, and expectations of the other agents. Mr. Weasley’s Rule: Never trust anything that can think for itself if you can’t see where it keeps its brain-Harry Potter

“To function effectively, a team must act as an information-processing unit, maintaining an awareness of the situation or context in which it is functioning and acquiring and using information to act in that situation.” (MacMillan et al, 2004). In this way, coordination incurs additional costs to the technical or task-based work being completed together.

Cognitive costs of coordination

Prior research has shown coordination in joint activity incurs cognitive costs for practitioners (Klein et al, 2005; Klinger & Klein, 1999; Klein, 2006; Woods, 2017). Effective joint activity during periods of high tempo, high consequence work requires mechanisms that maximize the 18

value of coordination across roles, while controlling the costs coordination can impose which can exacerbate the risk of workload bottlenecks (Branlat, Morison & Woods, 2011) or risk falling behind the pace of events (Woods & Branlat, 2011). However, follow up studies that examine how to control the costs of coordination still need to be carried out (Woods, 2017). This work contributed to this effort by providing a detailed account of the cognitive costs associated with choreographing joint activity.

Costs of coordination have been explained different ways in the literature. Studies from the cognitive psychology field often account for it quite literally, as the time costs (as measured by eye tracking) of delayed action responses as the complexity of the tasks increases. Malone (1987) describes “the costs of maintaining communication links (or "channels") between actors and the costs of exchanging "messages" along these links”. Clark & Brennan (1991) posits that grounding in communication comes with associated costs that are “paid” by one or all of the participants in a conversation (as described in the table below).

Figure 2.3 Costs of Coordination - Clark & Brennan (1991)

19

Building off this, (Klein et al., 2005) extended the concept out to show how these communicative costs are associated with various forms of cognitive work in joint activity. In this work, “coordination costs refer to the burden on joint action participants that is due to choreographing their efforts” (p. 12). Despite the extension of Clark (1996) from communication to a broader range of joint activities, the lineage limits that analysis in appreciating how the concept of common ground and coordination costs are linked. Passages such as “effort spent in improving common ground will ultimately allow more efficient communication and less extensive signaling. On the other hand, failure to invest in adequate signaling and coordination devices, in an attempt to cut coordination costs, increases the likelihood of a breakdown in the joint activity.” (p.18) shows the authors definition of coordination costs being tightly coupled to the communication.

Of note, is how Klein (2007) further developed the notion of costs in the flexible execution of planning and replanning – known as flexecution (Klein, 2007). “Flexecution creates a cost of continually thinking about benefits and drawbacks for different courses of action—issues that are believed to be resolved during the planning stage of a conventional planning and execution sequence. Flexecution also requires additional communication to prevent confusion and maintain common ground. Each time the goals and priorities change, the leaders must issue notifications and re-direct their subordinates’ attention. Leaders who change directions too often might reduce their teams’ effectiveness. Thus, we have the added complexity of emerging goals in a (distributed) supervisory control environment” (p.110).

In this way, Klein has pointed to the need for coordination to be dynamic – continually adjusting relative to the problem demands and environmental conditions. A key point is being unable to keep pace with the problem demands can generate coordination breakdowns and basic patterns in complex systems failure - decompensation, working at cross purposes and getting stuck in outdated behaviors (Woods & Branlat, 2011). These patterns are relevant to highly interdependent networked systems with extensive and hidden dependencies.

20

Anomaly Response

The anomaly response model (Woods & Hollnagel, 2006) describes a dynamic process of detection, diagnosing and responding to unexpected events. It describes how an agent is able to recognize anomalous behavior, then act to either safeguard against performance degradations or to seek further information to explain the discrepant finding. As more information becomes available, they revise their hypothesis and replan their actions accordingly.

Figure 2.4 Multiple intermixed lines of reasoning (from Woods, 1994)

What’s implicit here is this takes place under time pressure and anomalies are assumed to be of high consequence. Woods (1994) notes that in complex systems failures

“a fault initiates a temporally evolving sequence of events and process behavior by triggering a set of disturbances that grow and propagate through the monitored process if unchecked by control responses.” (p.64)

21

Given the consequence of failure, definitive assessments for disturbances are secondary to taking action to mitigate further degradation. Often these actions provide information that can inform and update tentative assessments. Woods (1994) separates dynamic fault management from troubleshooting a broken device by underscoring how “time pressure, multiple interacting goals, high consequences for failure and multiple interleaved tasks” (p. 2371) generate additional cognitive demands.

In addition, anomalies do not present fully formed. Instead, they evolve over time and cues about the changes within the system can be partially or fully obscured or intermittent. The capacity to cope with changing conditions can become saturated and recruiting additional resources can allow a system to stretch its capabilities (Woods & Wreathall, 2008).

All of these factors create significant cognitive demands. This cognitive complexity needs to be distributed to limit the complexity for any one individual (Smith, Spencer, & Billings, 2011). The framework of joint activity coupled with anomaly response links the benefits accrued through collaborative joint effort (across humans and machine) to meet the demands of dynamic fault management.

Joint activity in Anomaly response The anomaly response model represents dynamic intertwined processes within multi-threaded activity. This activity is distributed across a multi-role, multi-echelon network of human and machine agents. The connectivity afforded by modern IT systems has increased the scales, therefore, the useful unit of analysis becomes the network instead of the typical unit of joint activity - a small work group. Operating within the network are nodes - joint cognitive systems of human-machine or human-human teams that are not static roles or configurations.

Nodes in a network are not definitive and relationships change over time as conditions change. (Olson & Olson, 2000) make a highly relevant point in that:

“Coupling is associated with the nature of the task, with some interaction with the common ground of the participants. The greater the number of participants, the more 22

likely all aspects of the task are ambiguous. Tasks that are by nature ambiguous are tightly coupled until clarification is achieved. The more common ground the participants have, the less interaction required to understand the situation and what to do.” (p.162)

This is worth breaking down in greater detail before moving on as these insights can provide design guidance for distributed anomaly response:

• “Coupling is associated with the nature of the task, with some interaction with the common ground of the participants.” If the nature of the work is non-routine and highly interdependent as in anomaly response, it is also noted that group members will need more frequent, complex communication and short feedback cycles and multiple streams of information.

• “The greater the number of participants, the more likely all aspects of the task are ambiguous. Tasks that are by nature ambiguous are tightly coupled until clarification is achieved.” This statement provides insight into the dynamic needs inherent in anomaly response and reinforces the need to support synchronization and coordination. That is, the redeployment, reconfiguration and redistribution of resources as new information becomes available.

• “The more common ground the participants have, the less interaction required to understand the situation and what to do.” This is of particular interest for non-co-located work groups in several ways. The first is that some measure on some dimension of common ground is sacrificed in the service of other goal priorities. Well-calibrated teams can recognize when one or more dimensions are suddenly insufficient and can act quickly to repair. The second relevant point is that it raises questions about what kinds of interactions will be necessary to repair and sustain common ground as the event progresses.

23

Summary

The conceptual framework for this dissertation emphasizes the constraints and demands imposed by large scale, distributed systems, the nature of coordination – specifically joint activity – during the ambiguous and time pressured nature of anomaly response. These concepts have been well studied individually, and in some domains, collectively. However, there remains a valuable opportunity to explore how the integration of the particular orientations discussed – namely towards cognitive work and human-machine teams – influence coordination strategies.

24

Chapter 3 – The domain of Critical Digital Infrastructure

Critical digital infrastructure (CDI) is increasingly at the core of many high-risk domains that serve societal needs. Electronic health records, military intelligence surveillance, 911 call routing systems and monitoring of high hazard processes like nuclear power plants and air traffic control systems are all software intensive operations (SIO) that critically depend on reliable, timely system performance. Technology advances in recent years have shifted not only the technical architecture but the organizational design and work practices surrounding software engineering and digital service delivery (Kim et al. 2016; Beyer et al. 2016). This has, in turn, led to a concurrent shift in the skill sets and types of expertise needed to manage the infrastructure under these new conditions. These changes have created new challenges for coordination that have yet to be fully realized. In addition, SIO are heterogenous - both in terms of the ways in which they utilize new architectures and the degree to which their practices are integrated across the systems they manage. The implications of the kinds of technological advances that generate new forms of dependencies, increase complexity and new coordinative demands has been discussed for many high hazard domains. An important point to make is that, similar to the patterns from high hazard domains, the new forms of dependencies, increased complexity and coordinative demands have become a reality for many kinds of critical societal infrastructure and the full implications of this have yet to be explored.

This chapter expands on this point and sets the context for a certain class of software intensive organization - those operating critical digital infrastructure (CDI). The first section describes the differences between traditional software engineering and Agile or DevOps methods including the architecture and infrastructure. A brief explanation of some of the processes and practices is given as this is an important grounding for the reader. The chapter wraps up with a discussion on these changes and the implications for coordination. 25

Increasing complexity, speed and scale in software engineering

The shifting landscape described in this section was a result of the technological capabilities - practices like continuous deployment and hardware changes to virtual servers- making greater speeds and scales possible. However, what underpins it is a fundamental orientation to the world that is based on rapid change. For all the rhetoric in management sciences about being responsive to change, many organizations utterly fail to understand the kinds of flexibility and adaptation that are needed to adapt at the speed of progress. Software engineering recognized this need in its first push to closely integrate new feature development and ongoing operations (hence the term DevOps) but quickly discovered the paradigms for organizations needs to be fundamentally different as well. However, challenges remain in ‘transformation’ of change to a DevOps or Agile model in part due to the tension inherent in how to regulate these kinds of environments (Feltovich et al., 2007). This section describes these two orientations and touches briefly on the models of organizing through the approaches to software engineering.

Changing methods of software development

The first substantial shift to taking advantage of the speed and scale technology affords was in the software development process itself and the shift from waterfall to CD/CI methods.

The waterfall method of developing new software followed a linear, sequential process that entailed clearly delineated phase transitions. These phases were the domains of specialized set of experts (user researchers, designers, software engineers, testers, support engineers).

The process could last months or years depending on the complexity of the development, the rigor of the quality assurance process and the priorities of the company (be first to market or be highly reliable with quality products). The requirements phase involved extensive user research to develop a clearly defined scope of features to be included in the development. These design briefs created a model of the user and the product and were hardened before (or shortly into) development to prevent scope creep and keep the on time and budget. While it allowed a higher degree of predictability around the product to be delivered, the costs and the timeline, the method had a number of drawbacks. First, the models generated from the requirements phase 26

was a snapshot of the user and the way they would use the product at a certain point in time that was independent of changing needs and conditions in the market. The long duration between when the user research was conducted and when the final product was delivered created gaps that increasingly made products obsolete quickly as other products had entered the market, tastes and needs changes or users had found alternatives to meet their needs. Secondly, inevitable ‘bugs’ in the code had to wait until the next version (another full development cycle) or a special update for the problem to be fixed. This meant users had to tolerate poor performance that may have begun the day they bought the product.

Because of the multitude of ways in which bugs could affect performance as well as the diversity of hardware the software could run on, while also running other programs, it became difficult to predict all the ways in which the programs could break. Developers prioritized the fixes to balance the efficiency-thoroughness tradeoff and hitting launch dates for new deployments of software were frenzied affairs.

In contrast, to the waterfall method is the Agile method which involves on-going updates being made to the system in real time without taking it offline for maintenance. Requirements gathering, user testing, fixing bugs and delivering the ‘new’ program are all integrated into a continuous process. It was no longer about conceptualizing the whole project start to finish – have some developers build it, run some tests on it across a range of conditions, then push it out to the world to use and, have some tech support engineers available for users for when it breaks. CD/CI purposefully pushes out ‘unfinished’ projects – what’s known as Minimum Viable Products – with minimal features but ones that go straight to the heart of the issue: users need a software program that lets them talk instantly with someone they are working with? A company might create a bare bones version of messaging. Users provide ongoing feedback to iteratively improve the product. For example, if after using the messaging platform they discovered the importance of letting someone else know their messages were read and agreed with, so emojis get added to aid in signalling and observability. Or, users prefer to send screenshots instead of describing something in words and that drives the added feature of sharing photos.

27

To enable this, fundamental changes were needed. Examples of this are found in the approach to day-to-day activities, the tooling to support software development processes and the coordination across users, vendors and developers.

Physical to virtual - Moving to the Cloud

A fundamental aspect of this model of software engineering was the hardware it ran on. Prior to the internet, information technology (IT) infrastructure looked very different. In large organizations, software was run on “bare metal” – hardware such as servers and physical computing assets was a physical machine that hosts certain content and they would add machines as the capacity needs grew. But, with the rise of companies no longer needed to own their own hardware – instead they could ‘rent capacity’ from other companies who would handle all the maintenance and upkeep of the hardware in accordance with service agreements. These service agreements gave specifications for the amount of downtime a company would experience and guaranteed the reliability of the system to a specified amount.

Making changes in real time In these large-scale systems, the need to do updates to fix bugs or security issues, would mean taking the servers ‘down’ or offline to perform the updates. This meant users couldn’t access the system when these were offline. Most companies handled this by doing it in the middle of the night, so it minimized impact to users. The ownership and location of the servers were important to how the maintenance was going to get done.

In an era where most organizations have computers and devices that are continuously connected to the internet, developers realized they could “push” fixes to users by having the systems regularly download updates. This meant users wouldn’t have to wait months or years for the newest version and wouldn’t have to re-buy glitchy software all the time.

To illustrate this, let’s use an example familiar to most readers then connect that to a broader domain example. Essentially, all small scale IT (such as an individual with a personal computer) is operating “on premises” whereby the owner of the computer is responsible for making sure updates are completed, achieving connectivity to a network (for instance, a local area network at 28

a university campus or an internet connection) and for troubleshooting any issues on their own. However, large organizations or those whose IT is a critical function might contract out the management of their hardware (servers, data centers) and software to large companies such as IBM or Cisco. These organizations would maintain a workforce of skilled technicians to visit the client site and handle their IT needs. In these cases, they are paying for the ‘service’ of reliable performance of the system. While this was reserved for a specific type of organization in the past, shifts from on premises hosting to Cloud based infrastructure is becoming increasingly the standard. Hosting ‘in the Cloud’ is where organizations can virtually ‘rent’ the hardware and server space they need by operating virtual servers provided by a large provider like Amazon Web Services (AWS) or Microsoft Azure. Essentially, this eliminates the need for on-site management of hardware or software. It is called Software-as-a-Service (SaaS). This has the benefit of allowing an organization to scale (grow their system’s capacity) on an as-needed basis. In between on prem and fully virtual is a hybrid Cloud architecture. This type of platform works with components from a mix of public or private cloud and on-premises infrastructure. Organizations that have their own internal offerings may choose a hybrid model to utilize their own company capacity and ‘eat their own dog food’ as the saying goes. This structure can offer flexibility (particularly for an organization undergoing a shift to Cloud and building up internal service delivery capacities.

This can also add increased complexity. The promise of SaaS offerings is that an organization outsources their hardware and software needs and pays for the reliability. In a labor market where highly skilled engineers are at a premium, this outsourcing ostensibly delivers reliable performance without needing to maintain internal capacity. However, many organizations run proprietary software and hosting these on external (and potentially competitor) platforms can create a concern about corporate security. In addition, strict regulations for the handling of data protection and privacy of personal information brought about by the European General Data Protection Regulation (GDPR) increased the consideration of data custody.

29

Service level agreements

The use of reliability-based service level agreements (SLA’s) or objectives (SLO’s) define the expected reliability (uptime) of services between a client and vendor. To meet these service object’s organizations developed strategies for maintaining reliability including how reliability teams are resourced, how client issues get resolved and how to handle acute problems (outages or severe degradations). In addition to handling unplanned (acute) outages, these groups manage chronic maintenance activities (such as installing updates, fixing bugs and responding to user support needs). The next section explains how the continuous updating and reliability are related and how teams cope with unexpected events.

CD/CI as continuous adaptation

The success of CDI has led to systems that are continuously changing and expanding, leading to operations at large-scales. The IT systems under management undergo continuous change through ongoing deployments of new code (Allspaw, 2016). The continuous deployment is needed to maintain functionality for service users. New features, capabilities and repairs are integrated without taking the system offline. In this way, the characteristics of continuous adaptation in the system to accommodate demands (user feature requests or changes in the market including increased pressures to innovative) is both a source of resilience and a hard problem for system design and control (Alderson and Doyle, 2010).

As a continuous process, CDI, like space operations and nuclear power, requires recognition of disturbances to the ‘live’ system that threaten outages and diagnosis of what is driving the unexpected and abnormal behaviours in order to intervene in timely ways to avoid or minimize service losses (Woods, 1994; 1995). Because CDI services increasingly support business and safety critical activities that require flexibility to adapt in real time, stakeholders will increasingly demand near perfect reliability. The value of the system to its users and operators becomes so great that there is very little tolerance for outages – planned or unplanned. Thus, controlling the risk of services outages requires finely tuned capabilities for anomaly response when issues arise. Because the knowledge and expertise to detect, diagnose and repair outages is highly specialized, the personnel required to manage the systems need to be on-call and ready to respond wherever 30

they may be. So, the need is for distributed anomaly response and for the engineering practices and tooling to support distributed anomaly response.

Service outages

Distributed anomaly response in this domain involves a mix of synchronous and asynchronous joint activity for diagnosing and resolving threats. This means new roles are continually coming into the incident to aid in the diagnostic and repair processes - escalating the coordination demands (Woods & Patterson, 2001; Klein et al., 2005). This domain reflects two general trends. One is CDI has led to systems that operate at new scales with extensive and hidden interdependencies. This increases the pressures on cognitive work of anomaly response. Another trend is the move from a physical control center to distributed roles who coordinate and communicate through software-based mechanisms such as online chat. As a result, as anomalies occur and threaten service outages, a broad range of roles get engaged. The extensive potential participation both provides value and increases the importance of keeping the costs of coordination low.

Tooling to manage incident response

Resolving anomalies requires dynamic coordination across multiple roles and groups that span inter- and intra-organizational boundaries. Which roles and groups provide knowledge and information crucial to diagnosis and resolution cannot be known in advance. Therefore, tooling that enables rapid coordination across a broad range of players is needed. Virtual chat platforms (online chat known as ChatOps) enable transparency and listening in across the different threads of activity. Most virtual chat platforms provide both ‘open’ virtual spaces (predominantly called channels) and ‘closed’ spaces (such as direct messages and private channels).

In addition, incident coordination can be aided through video or audio conferencing. Together, these sources provide traceability and insight into the mental models of participants. These support joint activity, sharing information across perspectives, and integration of diverse models of the system. But the transparency also imposes burdens as chat channels can host several 31

hundred participants listening and looking in on responders which allow others to anticipate when their involvement is needed. As the roles engaged expands the costs of coordination increase and this leads to adaptations, for example a ‘core’ team — a distributed set of people pursuing critical threads — will shift out of the general channels to work together through alternative mechanisms.

Incident response can be conducted in ‘public’ warroom channels so users and stakeholders can follow the progress of response efforts. This occurs because of the nature of CDI and from the background of open source development and social coding which are based on the benefits of distributed collaboration. Because these are critical systems, as some services degrade stakeholders become interested, provide information about emerging service losses, and need updates about the status of resolution and restoration of services, especially to ahead to compensate for the impact of the degraded services. Stakeholder involvement generates additional pressures on the roles trying to understand the anomalies and resolve the situation. The more tangible and severe the threat of outage the greater the time pressure, the more interruptions occur, the greater the need to update other roles about progress, while many lines of reasoning and activity are going on under uncertainty (see the cases in Allspaw, 2015; Woods, 2017; Grayson, 2018). As disturbances persist and expand, consequences expand and uncertainty grows, engaging more and more roles, levels, and organizations — driving the tempo of operations up. More roles provide more resources, but come at the price of additional demands. Thus, understanding how to manage distributed anomaly response in ways that manage the costs of coordination in this setting (a) is challenging and (b) the findings can generalize to across settings undergoing scale growth from dependence on increasingly software intensive and autonomous capabilities.

Descriptive analysis of the ways in which expert reliability engineers troubleshoot live production systems fails to account for the full complexity, pressure and uncertainty faced in an outage. This is in part why attempts to manage coordination can constrain rather than support performance. The next section will discuss methods used to manage coordination in digital service delivery.

32

Attempts to manage coordination for incident response

To minimize downtime during service outages, a variety of practices have been adopted from emergency service organizations such as the incident command system (ICS). A variety of models have been defined with greater or lesser emphasis on bureaucracy, rigidity and role clarifications (Bigley & Roberts, 2001; Jamieson, 2005). The software engineering domain has adopted techniques to manage coordination during incident response (Beyer, Jones, Petoff and Murphy, 2016). However, close study of how engineering teams respond to incidents suggests this model is an oversimplification (Grayson, 2018; Woods, 2017).

Implications

Moving from Waterfall software development and years long development lifecycles to one where it takes weeks and days and moving from hardware located “on prem” to the Cloud undoubtedly increased adaptability to changing conditions. These changes also mean the company transfers the and costs of managing and maintaining IT infrastructure. These shifts create opportunities for an organizations to grow and capitalize on market trends and they substantial benefits to the user. However, it is also exposing organizations to new forms of failures and transforming work for the engineers.

While Agile refers to a specific methodology of continuous deployment/continuous integration practices like DevOps, there remains ambiguity about these distinctions. Arguably, the domain has been “in transition” for the last decade but what actually constitutes whether an organization is ‘doing’ DevOps remains up for debate. While organizations see the value of this approach, many remain posied on the cusp of change (particularly those in the slow-to-change domains like personal banking, insurance or airline routing systems) and have yet to shift to new models of maintaining their systems. The ambiguity and the substantial challenges of managing legacy systems while attempting to move to new architectures and technologies and cope with the demands these bring represent significant issues for modern organizations. To a large extent, this is why we see many ‘celebrated’ failures of these kinds of systems - the rate of change in the world exceeds the rate at which many systems are maintained and updated which leave them “unable to keep pace with change and cascades” (Woods, 2018) that can trigger substantial

33

failures. The time for an organization to choose whether or how to make the shift to continuous integration is quickly evaporating. The demands of the competitive market continue to grow and the capabilities afforded from the technologies are too attractive to ignore. John Allspaw once commented that there will soon be a time when he will not get on an airplane that can’t update its software while in the air (personal communication). This is not to say that critical integration methods are safe. Recent incidents like Knight Capital’s loss of $440 million over the course of 45 minutes or the myriad examples of problematic and brittle software are cases worth studying closely. However, what this points to is that as the technology enables greater speeds and scales, the need to be able to adapt in real time becomes critical. To do so, you need skilled operators who can wrangle the complexity inherent in the systems.

This chapter provided a high-level overview of the characteristics of the domain of interest to studying the cognitive and coordinative work of anomaly response. The next chapter will discuss the methods for researching cognition in software engineering.

34

Chapter 4 – Research Methods

Introduction

The preceding chapters considered current perspectives for managing coordination in large-scale distributed systems, particularly through the lens of cognitive work. A form of cognitive work - anomaly response - was used to demonstrate both the mismatches and potential opportunities for applying these frameworks to high tempo, high consequence environments. Poorly designed coordination was shown to incur costs that can become untenable during periods of escalating uncertainty, high tempo and cascading failures. The resulting coordination breakdowns can be attributed to processes that do not account for cognitive load on participants and procedures that are too rigid to allow for alternate strategies when the costs of coordination become too high. A gap was found in the literature concerning the choreography of coordination and a more complete representation of the corresponding costs. Following the review, to set context for the reader, a description of the practices, applications and technologies of the domain was provided to draw the connections between the tempo of work, the inherent uncertainty of continuous deployment environments and the extensive dependencies between inter-organizational and intra- organizational functions. The omnipresence of critical digital infrastructure was made explicit to highlight the implications of this research across a broad range of industries currently trending towards distributed work groups managing complex, high hazard operations.

Next, the approach to the study is discussed. In her seminal work, Plans and Situated Actions: The Problem of Human-machine Communication, Suchman (1987) notes two important objectives. “One objective in studying situated action is to consider just those fleeting circumstances that our interpretations of action systematically rely upon, but which our accounts of 35

action routinely ignore. A second objective is to make the relation between interpretations of action and action's circumstances our subject matter.” (p. 72)

Suchman’s comments are relevant to the subject matter of this dissertation in that the purpose of this research was to examine the strategies expert practitioner’s use to balance the cognitive and coordinative demands of managing uncertainty during high tempo events. Coordination for cognitive work and the corresponding costs it entails are similarly fleeting and often ignored. This work aims to make visible this substantial effort and draw the connections between the costs and circumstances to aid in future design of work.

This is achieved first by looking at a sample of the processes and tooling used to manage coordination for potential source of brittleness leading to coordination breakdowns. Then, by examining a corpus of cases drawn from a Consortium of critical digital service companies to assess the conditions under which costs of coordination can become unmanageable. More specifically, at the work of Software Engineers tasked with system reliability of critical digital infrastructure. This work adds to a series of studies of anomaly response in CDI (Grayson, 2019; Allspaw, 2015).

This chapter is on research methods and will describe the sources of data including the organizations and their participants and provide a brief comment on the availability of data. The chapter then goes on to describe the methods used to explore the mechanisms used by practitioners to control the costs of coordination within large-scale distributed work systems as they carry out the functions of anomaly response under uncertainty, risk and pressure.

Sources of Data

Obtaining access to large-scale, distributed work organizations is not an inconsequential achievement for researchers (Prechelt et al. 2015). A partnership was established through the SNAFU Catchers Consortium with organizations that operate CDI, experience anomalies regularly, and, interestingly, also develop tooling to improve support for anomaly response. In

36

joining the Consortium all organizations agreed to provide access to their operations through an engineering unit responsible for service reliability.

Organizational characteristics of participants

The organizations studied are medium to large sized companies ranging from 1700 to over 300,000 employees with regional and global operations. These organizations represent different types of service providers with both internal and external customer bases. All are characterized by 24/7 operations of services provided as well as an interest in improving the reliability of service delivery through deeper study of the factors that influence organizational resilience.

Individual characteristics of participants

Typically, a mid-level manager or lead would act as a liaison to the Consortium, assisting to organize site visits, aid in collecting data and updating on potentially informative cases. These key informants were highly knowledgeable about the system, legacy constraints, the “explody bits” and prioritization of work related to reliability and resilience. The software engineering teams involved were typical DevOps engineers who tasked with developing new features as well as ensuring reliable uptime. There was a range of experience levels both in terms of tenure within the company and experience in software development. Studying the nature of coordination across multiple organizations, multiple organizational boundaries and the anonymity of the participants involved it is not possible to prescriptively define more precise characteristics of the responders. It is accepted that their involvement in the incident response gives them legitimacy (meaning their involvement has some value to the organization either because of their accumulated expertise or because their involvement is a form of coming up to speed or training to further develop their skills). Additional participants encountered during site visits ranged from system architecture roles to product owners, senior and executive management, communications personnel, advocates, client relations, administrators and vendor support engineers. The daily rate of live code deployments into production varied but all teams were involved in both continuous improvement and ongoing maintenance work.

37

The Consortium is operated under Chatham house rules whereby participants agreed to transparent discussions of real operational difficulties under the auspices that any information about what was shared outside the Consortium meetings did not include the identity of the individuals or organizations involved. In addition, all data is de-identified and anonymized for the purposes of research.

Data availability

As previously described, the ChatOps method of collaboration provides a high degree of built-in traceability that is often supplemented by video conference recordings, system logs and other traces of activity. Therefore, CDI is a natural laboratory that provides a variety of assets for process tracing of anomaly response in actual failures. Using text-based transcripts in Slack is an unobtrusive method of data collection that has little disruption on normal operations. Similarly, web conferences are easily recorded with minimal additional effort.

Collecting data on cognition in the wild

That being said, collecting data in production environments can be challenging (Buchanan et al. 2013; Woods 1993). Research in the natural laboratory, seeks to discover “how these more or less visible activities are part of a larger process of collaboration and coordination, how they are shaped by the artifacts and in turn shape how those artifacts function in the workplace, and how they are adapted to the multiple goals and constraints of the organizational context and the work domain (Hoffman and Woods 2000, p.3). However, observation in production environments “is permeated with the conflict between what is theoretically desirable on the one hand and what is practically possible on the other. It is desirable to ensure representativeness in the sample, uniformity of interview procedures, adequate data collection across the range of topics to be explored, and so on. But the members of organizations block access to information, constrain the time allowed for interviews, lose your questionnaires, go on holiday, and join other organizations in the middle of your unfinished study. In the conflict between the desirable and the possible, the possible always wins.” (Buchanan et al. 2013, p.68) 38

In other words, studying naturally occurring phenomena such as anomaly response in ‘live’ systems means the opportunities for data collection are intermittent. (Neale et al. (2004) highlight the of capturing this kind of work that is distributed across time and place. In addition to the difficulties of negotiating access in operational settings over long periods of time across multiple participants. They note that, “much of the interaction of interest occurs at times that are either inaccessible to evaluators or occur over long time periods making it impractical to capture. Even when it is possible to collect data, it can be difficult to predict when and where the interaction of most interest is going to occur. Having only “snapshots” of the total interaction that is relevant leaves evaluators wondering whether more data is needed. All these factors make it difficult to prioritize the most appropriate data collection strategies.” (p.113)

Therefore, preparations to maximize the completeness of the data collected helped to understand the potential constraints on data collection within each of the Consortium companies. In this way, the methods used in this dissertation were both planned and opportunistic and a substantial effort was placed on being prepared to capture as much contextual data as is available when events occur. Those preparations are described next.

Preparations

In the 2 years prior to the study, 10 site visits were made to 3 of the 4 participating companies including 2 extended residencies. These multi-day excursions served both a social and a technical role. Strong observational research of work depends on the ability to establish rapport with practitioners to help deepen the understanding of context that will inform the later analysis. Technically, the site visits were important orientations to the engineering teams, the range of tooling used and the extent of use, the variety of team structures and functions of peripheral units (including dependencies) and support roles within the broader organization as well as the types of problems faced. Throughout this period different methods of data collection, extraction from the tools in which they were collected, transcription, preparation and representation for analysis were explored and refined This foundation provided initial exposure to the challenges of coordination

39

and the attempts to control the costs of coordination both within teams and across the organization. In year 3, the data collection methods were finalized.

Figure 4.1 Distribution of site visits

Investigating models of incident response

The first study was to understand the variation in how teams organized around incident response. While there is a general structure that many organizations follow (the incident command system popularized by disaster response agencies), variations to accommodate available resources, architectures, organizational or technical constraints and pressures all matter in how it influences coordination. As discussed in the introduction, current methods in software engineering for organizing coordination are based on process-driven requirements. This is problematic. As Hollnagel and Woods (1983) show, models of practice developed from a process perspective are shown to be incomplete and representative of work as imagined, not work as done. Therefore an important part of examining how practitioners control the costs of coordination during anomaly response is to be able to model how incident response takes place that is grounded in the system as experienced by responders. There are two important points to be made here. The first is that 40

the current models used to structure an anomaly response are based on implicit assumptions of how practitioners detect, diagnose and resolve service outages including presumptions about available resources and information that may or may not be true. They are oversimplifications (Feltovich, Spiro & Coulson, 1997) of the complex underlying behavior and interactions taking place. Simplified models about incident response break down very quickly under even lightweight empirical study. Second, when an organization makes a claim to be following a model or best practice there is an implicit assumption they are being implemented exactly as prescribed. Minor (or major) variations in important attributes related to the coordinative interplay - like resourcing, authority-responsibility structures, incentivization or expectations for collaboration- can render the original intent of the model unrecognizable. Methods for defining models of incident response practices To develop an understanding of the existing models for incident response in software engineering, a review of the site literature was conducted. This review provided an orientation to the generally accepted methods of responding to anomalous events in software systems. Next, the criteria for possible research subjects was determined and company liaisons were engaged to assist with securing access to participants. Of the fifteen work groups identified for the project, fourteen were included. Despite repeated attempts, one work group was unavailable for interviews during the project period. Sixteen engineers with on call incident response duties were interviewed from twelve different work groups using a semi-structured protocol. Using the information provided in the interviews and access provided by the participants, subsequent observations were made of the customizable tooling dedicated to their incident response and user support. This included chat platforms, monitoring and alerting systems and, where available postmortem or root cause analysis documents. Of the fourteen work groups only one declined to provide access to analyze their tooling. This additional access to the tooling enabled approximately 26 hours of observations and approximately 14 hours of review. The observations were of active use of the tooling while the review was retrospective. Following this initial work, a ‘playback’ of the data collected was conducted with the interviewees to confirm accuracy and provide an opportunity for clarification. The playbacks were conducted within ten days of the initial outreach. Four interviewees were unable to confirm data collection via the playback. Data collection and analysis were completed, and a playback was presented to a senior

41

site reliability team responsible for tool deployment and development within their business unit. Feedback from this playback was used to further calibrate the models.

Understanding configurations for incident response

The second preparation study was conducted to analyze the configurations for incident response. Configurations are the structured and unstructured interactions of the work environment, hardware and software. A structured configuration refers to a setting or threshold established by the incident response practitioners (an alerting threshold for example). An unstructured interaction is one where the possibility for calibration is possible but not utilized (.

There is a tight coupling between the incident model used and the configurations for incident response. A rigorous cognitive acknowledges that the physical environment and the artifacts used shapes the modes of interactions between people and technology and informs the possibilities for action. This preparation study was developed to establish a baseline of the kinds of physical environments, hardware and software used in incident response. This was not intended to be an exhaustive review and the methods reflect this - interviews, artifact analysis and observation were used.

Interviews

The study included interviews with 46 individuals with responsibility for service delivery. The responsibility for service delivery included roles with a variety of titles such as site reliability engineer, DevOps engineer, , product owner, product manager, technical lead and engineering manager. The interviews used a semi-structured interview protocol and were conducted during business hours over the course of three months. Artifact analysis Supplemental to the interviews, twenty-seven engagements involving artifact analysis were completed. A follow up to the interview was considered a discrete ‘engagement’ when the interview produced artifacts for review. As part of the interview process, interviewees were asked if they would permit access to their software tools or submit any further documents, wiki’s or 42

other materials (virtual or physical) to help understand their workflow. 96% of participants agreed to additional engagements. As multiple interviewees were part of the same team, there was overlap in the access and artifacts provided. The scope of artifacts reviewed included: ● Standard Operating Procedures - prescriptions for incident management practices. ● Incident Response playbooks -containing technical detail on how to respond to commonly seen problems; some of these were housed in an internal wiki or shared document repository, others were catalogued in the incident response tools so access to the tool was granted for review. ● Flowcharts - outlining the flow of an incident and escalation practices. ● Service Level Agreements (SLA’s)/ Service Level Objectives (SLO’s) - These documents contained information about the expectations for reliability between a service owner and their users or clients. ● - Used for compiling information about who to contact for assistance with different dependent tools. ● Internal company wikis - These were used primarily for recruitment and detailed competencies, reporting relationships and contact information. ● Software tools - Various tooling used for incident response functions like monitoring, alerting, scheduling call, in-event aids (chatbots).

As part of this investigation, seventeen discrete chatops workspaces, forty-one channels and seventy service configurations were examined in depth. Observations An estimated 420 hours of observation (in person and virtually) were conducted of incident response practices. This included direct observation in the physical workspace and virtual observation. Direct observation was conducted by sitting in the workspace nearby the response team while also being connected to the web conferences or audio bridges. Virtual observations were entirely online and through the web conference link being used by the incident responders.

43

Study: Process tracing a corpus of cases

The preparatory work, in concert with the literature review, provided the background to generate an analytical frame for coding. This frame considered concepts relevant to the domain that were reflective of coordinative functions. Coordinative functions can be supportive or adverse, but they are both important for study and reveal different insights about cross-scale coordination.

Case collection

A collection of sixty-two cases were reviewed for pertinent coordination features. The review involved reading available post-mortem data and identifying cases that contained elements of coordination useful for studying coordination costs. This was done by applying a lightweight first pass of codes with a selection of coordinative aspects. Based on this, fourteen cases were selected to be reviewed for further assessment of data availability. Of these, five events were selected to form a corpus of cases for process tracing analysis to focus on how practitioners adapt to manage the costs of coordination.

Figure 4.2 Overview of research methods

44

Data extraction and conversion

The inherent traceability afforded by ChatOps offers tantalizing opportunities for research but yet, challenges remain in easily extracting data for analysis. Many organizations use chat transcripts as the basis for their post-incident activities but this is a relatively superficial process and accessing time stamped data inclusive of non-text based objects for process tracing involves a more concerted effort. Non-text-based objects common in incident data include emoticons, screenshots (typically of logs, dashboards, user reports), gifs, shared links from other channels that may be embedded as picture files. Text only files were extracted as JavaScript Object Notation (json) files. Json is a “lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate” (Crockford, 2009). Ingestion into the analysis tool required a conversion to a tab separated values (tsv) file. This was completed by the tool developer. The tsv files were then uploaded into Churchkey (a tool for qualitative research analysis). This will be described later.

In addition, any audio recordings including audio from web conferences were transcribed using Descript. The timestamped transcripts were then edited to clean up transcription errors, identify different speakers and make note of any unintelligible text. These files were converted to rich text format and then processed by a third party into tsv. The audio transcripts were then uploaded into Churchkey where they were integrated so utterances from the audio files overlaid the utterances from the chat transcripts.

With all data sources, the relevant parties and company-specific details were de-identified and anonymized in accordance with the non-disclosure agreements and charter of the Consortium.

Coding

As was previously discussed in Chapter 2, coordination is inherent in all forms of joint activity. However, to focus the corpus on cases relevant to controlling the costs of coordination, it was first necessary to narrow the 62 available cases using a first pass coding. Cases were coded according to the concepts of breaking down, recruiting, switching, crossing, synchronizing and mentioning as relevant frames to identify scenarios that can help understand the costs of coordination. Each are described briefly. 45

Breaking down refers to a spectrum of degraded coordination. As noted in Ch. 2 smooth coordination is difficult to see, therefore breakdowns offer an opportunity to ‘see’ the coordination. As the costs of coordination begin to rise the efforts needed to manage the interactions is revealed. Therefore, this aspect of the frame is evidenced by actions like the refusal of an invitation to participate in joint activity or the ‘non-action’ of being unresponsive to others attempts to coordinate.

Forming/Dissolving involves the engagement of others into new groupings for the purposes of aiding in the joint activity. Recruitment in this domain can be effortful (physically stopping work to go locate another) or lightweight (@ mentioning someone in a shared virtual collaboration space or paging them using a notification tool). In the cases reviewed, for example, groups formed by virtue of detection (users reporting issues to the reliability team), diagnosis (particular skill sets, knowledge or the engineer with a recent code push were needed to help determine the issues), decision-making (authority was needed to approve a course of action or to escalate the urgency with higher levels of management). Both forming and dissolving of groups could happen abruptly (as when an entire team is paged in) or ad hoc (when individuals identified as potentially useful are recruited after the response has been underway for some time or when an invalidated hypothesis about the source of the issue renders a portion of the responders unnecessary and they drop off).

Switching represents a change in the medium for communication or coordination such as moving from chat-based interactions to web conferencing or gathering participants in a co-located physical space. Switching may not be indicative of coordinative breakdowns or rising costs but rather of designed for supports to aid a response team. In this way, it is a useful code for the first pass because it can reveal interesting characteristics about how coordination works.

Crossing are coordinative instances that happen at the boundaries. It is the expansion or contraction of coordinative demands across some form of grouping – team, organization, vendor. This could be inter-organizational as in the recruitment of another department’s engineering team or a dependency on an internal company capability such as with a network or internal Cloud 46

services. Other boundaries are intra-organizational where critical dependencies are external to the organization but coordination is crucial for incident response to meet the problem demands. Client/support boundaries may be both inter-organizational (for internal users) or intra- organizational (for customers).

Synchronizing as a code relates to timing and sequencing of activity. It can also relate to goals and priorities. Anomaly response is inherently a multi-threaded activity as detection and diagnosis may be ongoing in concert with efforts to repair or safeguard the system from further degrading. Subsequently, tasks and activities need to be synchronized to avoid decompensation and working at cross purposes (Woods and Branlat 2011). In addition, large-scale distributed systems have multiple goals and priorities that shift rapidly over time (Smith et al. 2003; Nemeth et al. 2004) and need re-synchronizing.

Mentioning is explicit communications related to coordination. As noted previously, smooth coordination is difficult to see. Traces of coordination difficulties and breakdowns makes the invisible visible.

Figure 4.3 Coding framework

47

The second pass of coding was iterative with the secondary codes beginning with several of the key known elements of choreography important to cost of coordination – recruiting, common ground, delegating, signaling, being directable and being observable. Additional secondary codes provided descriptive features of those elements. Where a tool was involved a code was applied (such as in aiding recruitment, signaling or delegating tasks).

The secondary codes also identified the participants relative to their position to the core group of responders to stratify the groupings involved in the cases. These codes were core, specialists, intra-organizational responders or inter-organizational responders.

48

Chapter 5 - Findings

Introduction

This Chapter presents the findings from the preparatory studies and the corpus of cases. This follows from previous chapters, which grounded these findings in the context of large-scale distributed software systems where teams of human and machine roles conduct the joint activity of anomaly response. Additional background was provided in chapter 3 on software engineering and service delivery in modern continuous delivery/ continuous integration environments. The previous chapter covered the methodology used in the preparatory studies and for the analysis of the corpus of cases. These findings have three main thrusts – first on the models of coordination that inform distributed anomaly response, next on the technologies that shape the configurations for joint activity and finally, on the elements of choreography that are used to show how practitioners control costs of coordination under uncertainty and time pressure.

Before presenting the results from the corpus of cases, the results of the two preparation studies are given. In the first, findings show that some models of coordination during incident response can aid in smooth coordination and resilient performance by allowing for flexibility and adaptation of practices to match the demands of the problems faced. Conversely, the rigidity imposed by other models can limit the range of possibilities available to responders to cope with fluctuating problem demands.

While these models influence practitioner’s actions responding to anomalies, the technology used can also heavily inform practice. Coping with software outages includes automation ‘working’ alongside human roles. Practitioners in this domain compile and use various displays, alerts, and other tools to assess the system and to take actions to resolve problems. The configuration of response efforts is heavily shaped by the ways the technology is utilized to bring people, data, knowledge, expertise to bear as an incident begins, evolves and spreads. Because of this context, 49

the second preparatory study examines how technology shapes the configurations for joint activity.

Results from preparatory studies

The next section describes the two studies conducted as preparations for selecting and analyzing the set of incidents in the corpus.

Preparatory study 1: Incident Models

This study combined data from semi-structured interviews across fourteen different teams responsible for service delivery. The structures for service delivery ranged from on premises (on prem) management of a third-party vendor’s service, to pure Software as a service (SaaS) offerings, to internally developed services that were hosted and managed internally (no users external to an organization). Some were a (perhaps unintentionally) hybrid service delivery model with shared responsibilities across organizational boundaries.

The different styles of incident response are categorized into 4 models, though several have variations: • Incident Command model - considered an industry best accepted practice, this model is based on the ICS structure of command and control currently popular in digital service delivery where a collection of formal and ad hoc groupings work under a central (and usually singular) leader; • One At A Time model - ownership for the problem is assigned to discrete groups responsible for different aspects of service delivery and passed along as the problem is found to reside somewhere else. • Escalation model - popular with ticketing style incident management; the problem gets escalated through different levels of responders. • All Hands on Deck - this style of incident response focused on recruiting all available resources when the severity of the incident was deemed to be high.

50

“Incident Command” This model follows the command and control style response used in disaster recovery efforts. Upon declaration of an incident, the ‘commander’ role becomes the definitive authority on how the response should be conducted. “They structure the incident response task force, assigning responsibilities according to need and priority. De facto, the commander holds all positions that they have not delegated. If appropriate, they can remove roadblocks that prevent Ops from working most effectively.” (Beyer et al. 2016). In this model, the incident commander (IC) is the key hub around which all other activities are connected. First, organizing incident response around an (IC) assumes a single individual is capable of the cognitive work needed keeping pace with the dynamic flow of events, assessing the implications, handling tradeoff decisions in concert with the rate of change of the incident, Second, the IC role is set up as the key point of authority who then delegates tasks to other personnel. ‘Freelancing’ (as defined by any activity taken without explicit authorization by the IC) is to be avoided. Third, this model assumes tasks can be parsed into discrete units of activity that can be timeboxed with little concern for varying tempos of operations as problems, evidence, uncertainty, and threats change over time. The incident command model does not address how workload bottlenecks may produce lags in guiding or integrating activities There is little provision for the consequences that arise when the IC role falls behind the pace of events. Interestingly, if slowing the pace of response is needed to keep the IC current, these actions will not help and may hinder the pace of cognitive work for other responders. “One-At-A-Time” The One-At-A-Time (OAAT) incident model is based on a structure of paging out one individual responder at a time. The underlying assumption is that this responder can follow prescribed protocols from the runbook and attempt to diagnose the issue to the full extent of their knowledge. Following this, they can recruit additional resources within their own team. If the responders are unable to successfully diagnose and repair the issue additional resources from a different service delivery team would be recruited to hand the issue off to. While some initial responders remained engaged in the effort, it is more likely that the dependent service would begin diagnosing on their own.

51

Figure 5.1 One at a Time

This model assumes that the knowledge and expertise needed to handle events is held within clearly delineated work groups. Getting to the right people with the right expertise is sufficient. This means it is straightforward to recruit additional resources and bring them up to speed without delay or rifts to the response process. Thus, there is little or no additional benefit for handling an event from interactions across different skills, perspectives, and experience from different work groups. Handoffs then can be a straightforward exchange that will contain all relevant information.

OAAT – Variation of Mean Time To Innocence This is variation of OAAT, when responders exhaust all possible known efforts to restore service in non-routine or exceptional cases. In parallel to their response efforts, they are compiling evidence that the problem actually is with one of their dependencies – thus they are ‘innocent’ of being responsible for the outage. Upon sufficient proof that the problem is not within their scope of the service, they hand off the problem to the role responsible for the suspected dependency. This is a hard handoff with no offer or expectation of continued involvement in the response effort.

52

Figure 5.2 OAAT with Mean Time To Innocence

This variation extends the underlying assumptions of OAAT will still hold in more challenging cases. Even then, unanticipated interactions between dependencies are able to be identified by responsible parties. Once identified problems can be ascribed to one responsible party, and a specific single work group needs to be accountable for preventing or minimizing any outages. Even in these more difficult cases, responders stay within their scope and do not want to be called in for issues with dependent services. OAAT Variation: “With Incident Command” In this instantiation of OAAT, an incident commander role triaged the level of response. They were not expected to perform any technical function but rather would coordinate the efforts of technical experts, communications, note keepers and provide a liaison for management. This role strictly adhered to a procedurally driven response effort which ‘managed’ technical experts by having them provide a description of the tasks they were undertaking and an expected time to completion. Any communication between responders was not typically monitored except for updates about the status of previously predetermined actions. This variation adds assumption about the incident commander role given the other assumptions about the connection between problems and expertise above. Thus the IC will know when a service is sufficiently proven to be responsible or not responsible and then can manage hand offs between the different technical experts.

53

Figure 5.3 OAAT with Incident Commander

Escalation Escalations is based on a typical ticketing model that seeks to rank the severity of reported issues. In this model, when a user discovers an issue, they may open a ticket to report their findings. The ticket is then assessed and assigned to a lower skilled responder to handle by following a prescribed runbook. If the problem cannot be resolved in this way the severity is escalated to a higher level and a more skilled responder (who may also have access to a greater range of diagnostic tools) takes over. In dependent services this escalation may happen across boundaries. For example, it is escalated through several levels internally and, failing resolution, may cross the boundary to the vendor’s lower level service desks before it is again escalated. This model can control the costs of coordination by weeding out the easily resolved issues and protecting the attention of highly skilled (and scarce) resources to only deal with non-routine and exceptional events.

54

Figure 5.4 Escalation

This model assumes steps through the escalation will not create any undue delay in getting to the appropriate skill level of responder given the difficulty of the event, even though this isn’t known in advance. Each successive layer ‘trusts’ that the layer below completed the troubleshooting within their runbook steps sufficiently and these steps do not complicate or hinder responding to the event as it is escalated. As escalation proceeds, it is assumed handoff notes in the ticketing system are sufficient to provide context for incoming responders. This model assumes that the users or clients have only a very base level of proficiency and that routing the ticket through initial troubleshooting may resolve it. A highly proficient team of site reliability engineers may be recruiting a lesser skilled group of vendor support.

55

“All Hands on Deck” This model assumes the people who make up response teams have a refined understanding of classes of problems that tend to arise in their system. Because their monitoring and alerting capabilities are strong, known problems can be matched to appropriate support level responses and unknown or potentially severe problems trigger a broadcast paging out to the pool of responders. For example, an actual or suspected high severity incident would trigger a large-scale page out to a broad collection of responders. This brings a diverse collection of skills and expertise to the incident in the early stages. If the incident is discovered to be lower priority or the solution is found quickly, responders will drop off the effort accordingly understanding they can be easily recruited again if the incident progression demands it. The different assumption in this model is that engaging multiple, diverse perspectives in the problem early is beneficial as opposed to waiting to bring the additional or targeted resources and expertise. Research indicates the interactions between these perspectives broaden hypothesis generation and bring out important information relevant to response efforts early. The benefits of diverse expertise and perspectives comes at a cost as there must be means to coordinate over the multiple responders engaged. The two instances of this style of incident response model observed the response also utilized an incident commander role. When paired with “all hands model” the IC role did not generate bottlenecks that slowed responses. This was due to initiative being pushed down to individual responders as they generated potentially constructive activities and took it on themselves to be responsible for tasks.

This model makes a rather different set of assumptions. First, responders all share extensive common ground about the service, the expected response pattern and methods of adapting in real time. Second, responders can flexibly adapt to the structure of the incident response, anticipating the actions and needs of others.

56

Figure 5.5 All Hands on Deck

All Hands Variation: Incident Response SWAT In police and military parlance, a SWAT team stands for ‘Special Weapons and Tactics’. In one instance of incident response models observed, this model deployed engineers specially trained to handle incident response who took over the ‘management’ aspects of the response to the incident in order to offload and allow the technical experts to focus on the diagnostic and repair work.

This variation also assumes that incident management can be handed over or taken over as additional responders come in to the situation. The incoming responders then have sufficiently broad knowledge about the services and about incident response practices to appropriately support the response by managing communications to users & stakeholders, recruiting additional responders, securing resources and removing blockers while the response team that owns the service remains focused on the technical resolution.

57

Figure 5.6 SWAT Incident response

These models represent different ways of coordinating during incident response that have varying implications on the cognitive work functions for incident responders. The next preparatory study shifts focus from how the incident models shape joint activity to technology’s role in doing so.

Preparatory study 2: How technology shapes the configurations for joint activity

Cognitive Systems Engineering has a long history of surfacing new insights into how to design of technology-mediated control of dynamic systems - including the physical space, cognitive aids and interface displays- influences and informs cognitive work (Watts et al, 1996; Smith et al. 2009), mobile In this way, the objective of this investigation was to understand how the technologies used in service reliability work in critical digital infrastructure shape possibilities for joint activity. Given the 24/7 nature of operations and technology that enables access to the

58

system from virtually anywhere, this section explores how the designed environments -spanning physical environments, hardware and software – are adjusted and adapted relative to the event. Software engineers reconfiguration of these three aspects of modern control ‘rooms’ to influence observability, directability and interpredictability. The changing concept of the control room Increasingly, a fully remote workforce is becoming the norm so co-located responders in a shared physical environment is no longer standard. In addition, even with typically co-located teams, the 24/7 nature of operations means responders need to be prepared for taking action at any point in time that they are on call or recruited into an incident. This means that opening a laptop in a home, conference hall, train, airport or other space must afford the ability to create a distributed control room with others who may also be remote or can be physically co-located. Geographic dispersion of the workforce, to account for continual support requirements also drive the need for ‘control rooms’ to be inclusive of distant responders who may work the problem asynchronously. In this preparation study, physical office space was only a small part of the configuration. Many modern software engineering spaces were designed with open floor plans conducive to Agile practices such as stand-ups and continuous reconfigurations of teams that require different workspaces as they carry out their tasks. The layout included banks of workstations arranged in pods with 3-4 desks on each side. Other collaborative workspaces allowed for ad hoc groups to assemble quickly around tables or couches. Some of these areas included both movable and affixed digital smartboards or whiteboards. Multiple large screens viewable from many locations on the floor are affixed to walls. Floorplans also included smaller private meeting rooms for 2-20 people. The meeting rooms were typically glass enclosed with tables, chairs, whiteboards, smartboards or large monitors for projecting, a conference call device with a speakerphone and multiple microphones.

Costs of coordination as influenced by the physical environment Consistent with findings in computer supported cooperative work studies, the physical environment had inherent affordances for coordination that were essentially ‘free’ of cost. The arrangement of the workspace enabled strong observability into the actions of other responders. At times, this included being able to infer whether they were able to be interrupted, how well they were coping with the demands of the incident (by being able to read body language), gave some 59

indication of the tasks they were handling (screens visibility), who they may be interacting with (relative to physical proximity and interactions noted). There were costs to coordination in physical environments such as coping with noise and distractions from neighbors and other activity going on in the space. In addition, interruptions by others during periods of cognitively demanding activity can disrupt lines of reasoning underway and add re-orientation costs. The physical environment also afforded the ability to easily shift to other spaces to support dynamic reconfiguration of the response team. This was necessary because there is an inherent rigidity in the desk configurations that make it difficult for others recruited to the incident to work in close proximity. The design of meeting rooms and collaborative workspaces supported by smart boards and whiteboards lowered the cost of physically relocating by providing some of the same aspects provided by the physical space while allowing inclusion of remote responders.

While colocation has its advantages there are still limitations and costs imposed by the benefits. However, even in collocated environments there are certain constraints on being able to collaborate. Monitors necessary for viewing extensive lines of code can obstruct the visibility and limit the affordance provided by an open workspace.

In addition, in an effort to integrate teams that work closely together (and because of the desk configuration) teams are side by side in the workspace. During an incident, background conversation from the adjacent team conducting stand ups or discussing work gets picked up in the web conference audio. To minimize the effects of the background chatter, responders will mute their microphone until they are ready to speak but frequently (because they are focused on their task activity) they will forget to unmute leading to additional cognitive costs to remember to mute/unmute and for other responders in waiting for an expected response.

Technologies used to aid coordination during anomaly response. The above description provided a preview of the dynamics of translating physical control rooms into virtual spaces for incident response. Being prepared to work in both collocated and non co- located contexts require a bridge between the physical and virtual control rooms to enable all responders, regardless of location, to be integrated into the response efforts. This is especially true where there is a hybrid distribution (some responders co-located and others remote). 60

Software-based collaboration tools provides this bridge and the next section discusses the ways in which software creates connectivity to form a virtual control room. Software to connect the physical and virtual spaces The findings showed that there was a wide variety of software tools in use - both internally and externally developed, as well as third party SaaS offerings. These included software for: • monitoring system health (according to intervals and sensitivities defined by the service owner), • Visualizing system performance (through dashboards, graphs and representations of pre- set thresholds for various functions) • Alerting (messages sent to chatops channels, email, text or phone calls) at varying degrees of sensitivity • Managing on call schedules (to manage who should be alerted, at what interval and using which methods) • Aiding collaboration (including real time messaging, web conferencing, audio bridges, tools for generating shared artifacts such as documents or diagrams) • Automating response plays for known problems or consistently used actions taken to mitigate issues • Assisting during incident response (primarily these are chatbots used to convert responder’s command line instructions in the messaging platform into automated action in the preceding tools). This automation serves to limit task switching during the event and minimize navigation across multiple platforms • Tracking incident status (including user facing ticketing systems to request support for an issue they are experiencing) • Communicating service status and outages to users and dependent services • Identifying human resources related to the response (including organizational charts, customer resource management tools, competency databases) • Planning, prioritizing and scheduling the backlog of follow up actions and bug fixes uncovered in incident response (including assigning ownership of the issue) • Compiling and visualizing post mortem data

61

How bridging software impacts costs of coordination in virtual control rooms The use of collaboration software creates an ability to virtually assemble and re-assemble according to the demands of the incident. An example of this is a feature in chat platforms where users can direct message another privately to avoid engaging the broader group in a discussion, creating a side channel where the conversation is focused and constrained to the direct participants. If more people are needed, adding them to the conversation and seamlessly converting to the shared channel is a well-designed flow that retains the context of the discussion for incoming responders. Another example of software lowering costs of coordination is in the use of web conferencing to overcome the limitation of the physical environment. When onsite responders cannot physically convene at desks and colocation is not an option (unavailability of meeting rooms or workspaces), the use of web conferencing software enables participants to provide some degree of observability into their actions with a head on view. However, this is quite limited and difficult to discern if the person is engaged in the incident or looking at something else on their computer and provides no contextual cues about their interaction with others. Additional patterns of the costs of coordination interacting with software tools is made in Chapter 6.

The study showed that there are also some general patterns across the software tools used for incident response related to the tool adoption and use. They related to the costs associated with selecting the tool, adopting it into practice and re-calibrating over time. These are discussed in greater detail next.

i) Investments into determining suitability of the tool There were two main thrusts to the investments in suitability of the tool that can be understood as tradeoffs on thoroughness and efficiency. In terms of the first, interviewees described months- long and structured assessment processes that focused on reviewing best practices from industry, evaluating the tool’s performance for other related teams and comparisons across platforms. Interviewee’s described substantial investments prior to tool adoption. They included the time spent surveying the current state of tooling for a specific function, researching the capabilities and limitations of various platforms or services, assessments into how the service or tool would integrate into the existing stack and to the existing practices and assessments of how the adoption 62

of the tool would influence dependent functions. In several cases, the reasons for adopting a new tool was tightly coupled with reflecting on internal capabilities which added a secondary decision around ‘on prem’ or SaaS offerings. The adoptions described in this method were typically carried out not by first responders but by product or service managers, business analysts or other supporting roles. The second approach was less structured, somewhat ad hoc, and ‘pushed’ the cost of assessing the tool further out in time by attempting to assess the tool by partially integrating into the practice in near real time. This approach was more commonly taken by front line responders. Both approaches faced an ‘adoption paradox’ in that they were often looking for tooling to support their current practice because their current practice had become unsustainable in its present form. In other words, they were too busy coping with incidents to invest in adopting tooling that could help them cope with their incidents. Many vendors seem to be aware of this paradox and, accordingly, have streamlined tool adoption - essentially attempting to make it ‘plug and play’. ii) Adapting practices to accommodate the implicit model of the tool The easy adoption of tools is a double-edged sword. If there is a low cost to the initial adoption (simple onboarding or a promise of no extra training required) then the actual cost of coordination with that tool is shifted in time. As previously stated, technologies are designed to transform practice (ostensibly to make things easier for the practitioner) but a transformation of practice does not happen without implicitly requiring the practitioner to adjust their practice in some way. These adjustments were found to involve additional effort on behalf of individual practitioners (investing time and attention to learn how the tool functions) as well as the broader group of practitioners (investing time and attention to understand how the adjusted practice will influence joint activity). However, because the tool was easy to adopt and the adoption costs could be ‘paid’ later it did not mean they were. Almost unilaterally, the survey participants described a reluctance to integrate a new tool into practice without a trial period, particularly for critical functions. As discussed in Chapter 2, technology that overpromises and underdelivers leaves practitioners to make up for the shortfall. Therefore, ‘trialing’ was common. The findings showed an overwhelming majority of low cost adoptions resulted in the tool not being fully integrated into the workflow. There were multiple reasons for this: an individual may have on-boarded the 63

tool with the intent of ‘piloting’ but did not have buy-in from other team members to implement fully, only a specific feature was desired to fill a gap in existing practices, insufficient time available to learn how the tool works (due to production pressure), efforts to ‘test’ the tool in a production-like revealed shortcomings in its ability to integrate seamlessly into existing practices and the cost of further adjustments were deemed too high. This is a significant finding in that a tool that is easy to onboard may simply be deferring the actual cost of that onboarding to a later point in time which ‘hides’ the true costs of coordinating with that tool as discussed in the next section.

Before moving on it is worth noting that a ‘piloting’ phase makes sense to allow small scale testing to uncover any potential blockers or bugs that may seriously impede performance. However, piloting will not uncover problems with using the tool at scale. So even if the tool performs well for a limited number of teams, the potential for additional costs for both adoption and ongoing coordination still exist as the scale increased.

Process Tracing a corpus of cases

This section provides a brief summary of the five cases included in the corpus and the reason for inclusion. This is followed by a more detailed description of each incident highlighting the coordinative elements of the response emphasizing the inherent choreography and adaptations to control the costs of coordination. Responders, the organizations they belong to and the vendors involved have been de-identified. The service outages examined in these cases involve teams responsible for services delivered to internal and external clients, managing both internally developed tools as well as on premises and SaaS offerings. An overview of the cases is provided in the following table. A complete version of the table is included in the Appendix.

64

Table 5.1 Overview of Corpus Cases

Case Date Event Data Actors Participants Duration First coding Choreography of coordination # Available Direct / Indirect 1 Q1 Hidden changes from vendor & multiple Web conf 8 /3 Core members x 4 36 hr Forming (multiple reconfigurations of Establishing common ground; Recruiting (need specialized skill 2019 ChatOps ch x 3 Specialized experts x 2 26min/ core responders); Mentioning sets beyond the capabilities of the core members); Being recruited; failures confound: Debrief Additional resources x 4 2363 min (requests to screenshare, discussions Multiple concurrent problems lead to short term recording Management x 1 on long running implications); Switching coordination mechanisms to accommodate non-co- outages and severe performance degradation over a Postmortem Breaking Down (vendor support); located members and difficult problems; Adapting coordination Switching (web conf to physical two day period; Issues began occurring after a version mechanisms to accommodate additional recruitment; Maintaining colocation to remote); Synchronizing common ground; Repairing breakdowns; Controlling costs for self; upgrade and were found to be related to two of the (tradeoff decisions on timing of three symptoms that were presenting at the time of the support bundles); Crossing (inter and Controlling costs for others; Investing in future coordination; reported issues. One of which was related to an intra-organizational grouping) Interacting with tools; undisclosed change in the version upgrade. Case demonstrates many elements of choreography;

2 Q3 A bad fault gets worse: Screenshots of 10/ 18 Core members x 2 71 hr 37 Forming/Dissolving (Initial Recruiting (The timing of the incident meant resources were not 2019 chatOps Specialized experts x 5 min/4297 recruitment, ongoing recruitment & readily available; Issues with recruiting due to insufficient support A routine upgrade uncovered brittleness in the system channels, Additional resources x 4 min engagements); Mentioning when a bug in one system revealed backups were screenshots of Management x 3 (prompting, suggesting); Breaking contracts with vendor; recruitment ad hoc; difficulties coming up corrupted. Previously unanticipated interactions, direct message, Vendor x 4 down (side channels, no integration, to speed; Resources not integrated); Synchronizing (sequential & difficulties for resources to come up to speed, Post-mortem delayed updating); Switching parallel activities; multiple perspectives not integrated; new (multiple mediums); Synchronizing perspectives integrated result in delayed); Establishing common undisclosed changes from the vendor and fragmented (delays, lag); Crossing (inter and coordination exacerbated the recovery efforts and intra-org) ground (Difficulties establishing CG due to side channeling and contributed to a 3-day outage with substantial data workload saturation); Maintaining common ground ; Repairing loss. common ground breakdowns (difficulties recognizing CG breakdowns); Delegating (sequential efforts); Contrast case on side channeling & IR structure;

65

Case Date Event Data Actors Participants Duration First coding Choreography of coordination # Available Direct / Indirect 3 Q4 Scaling beyond imagined parameters: ChatOps ch 7/9 Core members x 4 3 hrs 53 Forming (Initial recruitment adapted, Recruiting (Internal resources brought in and adapted protocol; Intra-org screenshots, Specialized experts x 0 min/ 233 integrating vendor resources); 2018 Users began reporting errors that build logs for a recruitment smooth through established mediums); Crossing(resources post mortem Additional resources x 1 min Mentioning (coordination breakdown, from vendor beneficial in diagnostic activities, aided in real time model third-party continuous integration software were not (doc & Management x 1 cross checking); Switching (used updating, assessed tradeoff decisions); Establishing common ground recording), Vendor x 3 multiple mediums to engage appearing. It was recognized to be the log parts table (vendor resources had extensive background on company system & in a large-scale and the build logs were not post mortem inter/intra-org resources); Crossing (Vendor resources joined as part of demands); Maintaining common ground (use of multiple mediums to being saved. Vendor support immediately joined the the response team); maintain CG); Repairing common ground (understandings about response and aided in real time model updating and interactions within the system); Model updating (real time updating managing tradeoff decisions. It was found the size of throughout the dialogue with vendor support); Controlling costs for others the database led the integer table to have been (incoming responder adapted protocol and took IC to keep support eng on exceeded and the database schema needed to be diagnostic work); Synchronizing (alerting dependent services of an replaced. The use of tools enabled cross checking with impending impact); Contrast case on vendor engagement: Vendor support on the call within minimal additional effort. minutes; 4 Q3 When those who should know don’t: ChatOps x 2 3/4 Core members x 3 12 hr 28 Forming/Dissolving (multiple vendor Establishing common ground (Refusal to join the real time response effort vendor support Specialized experts x 0 min/748 support coming in/out of response 2019 The code repository was ‘up and down’ all day and created issues in establishing common ground about nominal functioning tickets Additional resources x 0 min requires additional coordinative of the system); Maintaining CG (Vendor suggestions for diagnostic testing experiencing performance issues around critical Web conf Management x 1 demands); Mentioning (frustration would worsen the problem); Coordination breakdowns (tickets were Transcript Vendor x 4 expressed with delay in accessing functions. The support engineers reached out to the consolidated and receipt of critical info to vendor was delayed); Repairing vendor support but coordination between the two needed perspectives); Breaking Down (efforts to workaround the need for breakdowns (additional effort was involved in framing the communication engineering teams occurred through a ticketing coordination with vendor support); to vendor support to underscore the nature of the issue); Taking/switching system which preventing common ground from being Switching (engineers used multiple perspectives (Additional effort was involved in anticipating vendor needs established, introduced additional coordination mediums to redirect vendor attention due to delay); Recruiting (attempts to engage vendor resources); demands and resulted in delays. As is common with to the problem; Taking advantage of Controlling costs – self (anticipating vendor needs allowed better co-location); vendor-client troubleshooting, a request to run support synchronization; breakdown occurred); Controlling costs – others; Synchronizing (lag introduced by the Synchronizing (timing of support bundles; tradeoffs to maintain bundles was made. Paradoxically, the act of running use of the ticketing system and the performance); Investing in future coordination (Flagging post mortem support bundles for the vendor to assist in diagnosing length of time required to run support inputs); Interactions with tools (Issues with web conferencing, network, the problem was actually contributing to the problem bundles); Crossing (Intra-organizational paging) by adding load to a taxed system. Independently of boundaries) the joint diagnostic work, an engineer reviewing logs noticed a memory cap - instituted during the last hotpatch without notification - was throttling capacity 66

by limiting memory caching. Case Date Event Data Actors Participants Duration First coding Choreography of coordination # Available Direct / Indirect 5 Q3 Is this a problem and how bad is it? – Early ChatOps x 4 6 /9 Core members x 4 17 hr 12 Forming; Mentioning; Synchronizing; Establishing Common Ground (determining if it is in fact an incident and Specialized experts x 2 min/ 1032 Switching; Crossing 2019 morning reports of user issues for a European user Audio severity levels); Maintaining CG (changing opinions over the nature of the Additional resources x 2 min problem and steps to resolve); Repairing breakdowns; Taking/switching group were intermittently causing issues with Management x 1 perspectives; Delegating; Recruiting ;Coordination Breakdowns (repeated Vendor x unknown continuous integration software. The handoff got recruitment to the issue as people came in/out thinking the issue was deprioritized after the US team came online to find a resolved); Being recruited/Orienting (lightweight recruitment); Aiding in series of urgent CRE hotpatches. However the model updating (knowledge limitations were unacknowledged); Model problems returned later in the day and the system Updating (real time updating on system functions); Controlling Costs – became unreachable for several minutes triggering the Self (some responders ‘left’ the incident thinking it was over); Controlling incident. Additional resources were recruited but the Costs – Others; Synchronizing; Investing in future coordination (note intermittent nature of the problems made diagnosis inputs for PM); Interacting with Tools (colocation for cross checking); challenging. After protracted efforts to fix the problem were unsuccessful the system appeared to have regained normal operations. The incident was resolved but uncertainty remained as to the source of the problem. Chart Categories: Data available – Transcripts from ChatOps channels (multiple indicate data was pulled from user channels, warrooms, notification channels or internal support engineering channels (including inter or intra-organizational in a shared workspace); Web conference recordings; Audio bridges of recorded teleconferences; post mortem debrief recordings; post mortem documents; vendor support ticketing systems; screenshots of artifacts presented during a case; screenshots of shared channels; screenshots or transcripts from direct message communications; Participants – Core members are defined as those who typically respond (either as part of an organizational structure or escalation policy due to dependencies or responsibility for certain functions); Specialized expertise are resources who do not typically respond but are recruited for their knowledge or experience; Additional resources are technically skilled responders but whose role in this incident is supportive – gathering information, updating stakeholders, handling logisitics or otherwise assisting in the response. Duration – The duration is inclusive of any initial mentions of an issue within user channels or automated monitoring systems; In this way, it differs from company records. It is NOT a reflection of a service outage unless otherwise stated First Coding - Forming/Dissolving, Mentioning, Breaking down, Switching, Synchronizing, Crossing Choreography of coordination - Establishing Common Ground; Maintaining CG; Repairing breakdowns; Taking/switching perspectives; Delegating; Recruiting; Being recruited/Orienting; Aiding in model updating; Model Updating; Controlling Costs – Self; Controlling Costs – Others; Synchronizing; Investing in future coordination; Interacting with Tools; Difficulties handled – Short description of some of the technical and coordinative difficulties 67

Case # 1 - Hidden changes & multiple failures confound responders

This case is representative of many basic aspects of coordination choreography including: recruitment of specialized resources, explicit building, maintaining and repairing common ground, delegation, adaptation of coordination mechanisms, controlling costs of coordination for self and others, investing in future coordination and coordination with tools. There are salient examples of cross boundary coordination, both inter and intra-organizational.

A popular code repository (CR) offers an Enterprise version (CRE) of their Software-As-A- Service model. Two significant problems were introduced with an upgrade to the Enterprise offering which caused major incidents immediately following the upgrade. The upgrade was performed over the weekend and on Monday morning, as the North American workday got started for the east coast, users began reporting issues with stuck jobs in the distributed continuous integration (CI) service. CRE & CI are critical functions for software engineers within the company, representing thousands of users. As this was being investigated, sporadic ‘500 errors’ began appearing for CRE users. The CRE cluster status did not report issues with the cluster and other available information did not immediately trigger any recognition of a known failure pattern within the 5 responders working the problem. The team rapidly opened a support ticket and recruited highly skilled engineers who were responsible for a complementary service to assist. A conference room was secured and several other experienced responders joined the efforts both virtually via web conference and on-site. Investigations uncovered a large increase in Consul errors in the logs as well as error log entries noting Redis was out of memory. The group attempted to restart Redis but it did not resolve the issue. After noticing the Redis max memory was capped in the configuration file the group changed the cap to 5G up from 1G. Testing cluster components revealed this alleviated the issues with Travis but some pull request links were still giving 500 errors to some users. The group forced a new replica of the repository to resolve the 500 errors that occurred for users when viewing pull requests. A hotpatch - a fix applied in a production system so that it does not incur downtime or restarts - was applied, a process that took several hours. Upon completion, testing revealed the system behavior was unchanged after the hotpatch. As the incident had been underway for twelve hours, it was decided to stop further investigations, await a response from the vendor and pick up in the morning. The following day, with no response from the vendor in the ticketing system, an engineer managed to reach a support

68

engineer through a shared Slack space. The incident work was restarted. Some functionality was available (tagging commit status and PR status tagging) but creating CRE releases was still breaking. Several hours into the renewed effort, it was discovered a curl request could trigger HTTP 422 errors. On the advice of the vendor, the affected servers were rebooted to no effect. The affected servers were removed from live rotation. Noise from Consul log entries and ongoing issues with Travis masters not sending logs to logentries hampered the diagnostic process. On subsequent advice from the vendor, the hotpatch was reapplied successfully and the load balancers were returned to rotation. The HTTP 422 errors were discovered to be existing prior to the upgrade and were unrelated to the incident. It was discovered that payload of Travis APO call was having errors and it appeared that Travis was adding fields to payload not needed by API. A code change with fix for the API call was found and the patch applied to the master image then deployed into production. The test patch was confirmed, a pull request was created to formally commit to master image and the incident was closed. Findings from Case 1 This case presented as three problems though as the response proceeded it ended up as two actual problems. The problem presented in such a way that, almost immediately, additional responders with specialized expertise were recruited and were engaged throughout the full duration of the event. Recruiting Almost immediately new resources were brought in and attempts to recruit vendor resources were made. The difficulties presenting themselves were non-routine and, very quickly, the standard efforts to gather information about the problem did not offer any promising hypothesis about the source(s) of the problem. Therefore, specialized expertise was recruited. A ticket was opened with the vendor support underlying the urgency (the system was down) and difficulties being faced. The ticketing system aims to prioritize attention by having users rank the severity of their issue. Vendor support then assesses the information provided and may adjust the severity according to their understanding of the problem.

Then, being collocated, one of the responders left the boardroom they had assembled in and walked across the open floor space to tell the two specialists their skills were needed. It’s unclear what was said directly but within seconds one of the specialists had logged into the web conference link they found in the group’s war room in the chat messaging channel. The specialists were recruited frequently enough to incidents that they had access to the sustained chat

69

rooms and many of the tools the team used for incident response. In this way, joining a response has lower effort involved than if needing to access new tools or virtual locations. After having joined the web conference, the other specialist prompted them to physically co-locate to the boardroom while the IC stated they could remain in the web conference. The responder muted their line and entered the boardroom. Establishing common ground Four minutes after the ticket was opened a vendor support engineer responded in an attempt to establish common ground about the state “Hi , Can you provide more information on what tasks you are attempting to carry out please? Do you get 500 errors in all cases or is it intermittent? Are your users able to use the appliance at all at the moment? Or is it is inaccessible? We'll keep an eye out for the support bundle. Thanks, ” Delegation Delegation by the IC is not an explicit giving of directions. More often than not it is querying to see who is available. There are three relevant examples of delegation in case 1. Ex 1: Eng 2 (ic): Um, so let's check for the 500 errors in the, in log entries. Does anyone already have log entries open? Eng 1: I do. Yeah. So you want me to look for 500 errors from? Eng 2 (ic): Thank you. Yeah. Eng 1: Uh, ok, right. On it.

Ex 2: Eng 2(ic): [Eng 4] and [Eng 5] are checking into narrowing the scope. [Eng 3] and [Eng 1] do you have something that you're already digging into? Eng 1: Yeah. Eng 2(ic): Which gap? Eng 1: I'm digging into the permission thing.

Ex 3: Eng 2(IC): Yeah. If someone is free... has free hands, can they start it's import bundle please.

In these examples again the delegation is not explicit, it is the IC tracking the activities after responders took initiative and began working on them.

70

These examples on delegation bring up interesting challenges to the idea of freelancing. Preventing freelancing and avoiding work at cross-purposes is one of the explicit goals of using an ICS. However, while there certainly can be efforts that work at cross purposes, the evidence suggests that high performing engineers more often than not take initiative without explicit coordination with other responders. It is difficult to know when these activities were functional to a smooth response since they are not explicitly considered unless they are/were problematic.

Eng 2(ic):’ Eng 3 and Eng 4, you guys said that you are taking into the root or trying to figure out exactly where this error is coming from. Is that what you were saying? I want to keep tabs on what you guys are going to.”

Here is an explicit need for an update on their activities. This is a low grade coordination breakdown. Observability and directability dropped off as the initiative taking was without an update to keep the IC in the loop. The IC is looking for the update not to control their activities but rather to make sense of the shifting capacities and potential bottlenecks. Taking direction There were very few examples of where the IC contradicted the activities underway or redirected efforts. However, it outlines a critical element of choreography – taking direction. One such example came from an exchange between two responders, one of whom had just given an update that there were no new issue reports in the user chat channel. This exchange followed:

Eng 5: Eng 4,, I could monitor the #[CRE user channel name] so you can focus on the incident today. Eng 4: (distracted) I'm good. I'm just flipping back and forth. Eng 2(ic): actually, I wouldn't mind that. Uh, if you, if you have free hands to do that, Eng 5: Yep, I can monitor that and general. Eng 2(ic): Thank you. That probably along with the Travis channel. Are you in the Travis channel? Eng 5: I am. I can do that as well.

Eng 5 was an additional resource the group doesn’t usually have. They came into the incident unsolicited but had noticed the incident start, in the shared space. The tempo and tone of the incident was very different than other cases and Eng 5 followed the group into the boardroom to lend a hand. This explicit redirect by the IC was an effort to redistribute the effort across available capacity. Eng 2 and Eng 5 had not previously worked incidents together so, while the IC was aware they had technical skills to be useful, there was not a shared understanding of the kind of 71

support they might be able to provide. The offer to monitor the user channels enabled off-loading of a task with little demand for technical skill to prioritize the cognitively demanding tasks of diagnosing the incident.

A second example shows the connection between delegation and taking direction Eng 2 (ic): Hey [Eng 5], what are you digging into? Anything? Eng 5: Um, I'm just trying to get rid of all these consul errors from the... I think the consul service is still running because the logs are still spitting out. I found a different service name and I'm just going to kill it. Eng 2 (ic):What service? Eng 5: Consul. Eng 2 (ic): Okay. Kind of different service name from what I heard. Eng 2 (ic): Yeah. There's one called enterprise managed console, and that one has been spitting out a gazillion errors. Eng 2 (ic): Okay. So two separate consul services? Eng 5: Yeah. So I'm just going to kill it so that we can filter that out of the logs Eng 2 (ic): Thank you

Synchronizing In this example, the team is trying to coordinate across boundaries when one of the required tasks (running a support bundle) will take several hours. Eng 4: While we're waiting for that is somebody's generating a support bundle in the ticket? I think they're waiting for that. [crosstalk] Eng 1: Yes, sir. [crosstalk] Eng 2: Yeah. It's still running. It takes a while for run. [ crosstalk] Eng 4: Okay, thank you. I didn't see a task for that. [crosstalk] Eng 3: You have to do a full support bundle? Eng 2:A full support bundle? Eng 3: Well before one takes like a couple of hours. Eng 2: Yeah. I did a full up, a full cluster support bundle Eng 3: Doesn't that take a couple of hours? Or did they improve the speed? Eng 3: I recall it taking a couple hours. Eng 2: Yeah. Takes a while Eng 1: While they are waiting for the support bundle we may want to let them know, say, Hey, this kind of take a couple hours. We want something on do something between that time. Eng 4: I'll let them know that

Synchronizing across the team of responders and their user base is seen in this example where considerations are made about how their actions will impact users.

Eng 3: I think we should put it as a major outage now because it seems like you can't log into it or anything else like that. 72

Eng 2 (ic): I completely agree. That's part of the update. Thank you. Eng 5: Just ..To prevent our queuing problems. Um, can I propose as well that we're going to put in maintenance mode... Can we, uh... Shut down Travis or else, if this is a long running incident with a long maintenance, we're going to have Travis queuing up a lot of stuff. Eng 3: That's a good point. Eng 6: That's fair point. We should do that. Eng 4: Are we seeing signs Travis is not working at the moment? Eng 5: Well is not working, so is not working. Eng 4: Well, I mean, do we actually see that? Eng 5: So it's known that when goes into maintenance mode for a prolonged period of time Travis goes crazy, and so I'm saying that as a precaution, if we're going to put in maintenance mode while we investigate, we should take down before does something crazy. Eng 4: Yeah, absolutely. I agree with that.

There are a number of coordinative functions in this next example relating to synchronization, delegation and interacting with tools. The core+ group has independently resolved part of the performance issues. They’ve been working out of sequence with the inputs from the vendor but monitoring the issue (ticket) for asynchronous responses. As they are about to close out the incident for the time being they receive a request for info from the vendor. This example also shows the additional costs of working across tools. The vendor is responding in their ticketing system and this info is then copied over to the ChatOps channel where the responders are gathered so one of them can pick up the task. The IC inadvertently posts it in the wrong channel, catches themselves and corrects. Note that Eng 3 adopted the task without explicit instruction.

Eng 4: So at this point, should we just go ahead and close the incident? Eng 2(ic):Um, let me check the issue again. Eng 2 (ic): Alright. We're at, uh, reading the update really quick. Uh, there's a request for information… And interesting to get the output of Redis CLI... A big key to determine what is consuming so much space in Redis. I'm going to paste that.. this update in the... Eng 3: Want me to run it? Eng 2 (ic): Yes, please. Whoops, that's not where I meant for that to go. One second. I can find it in the support ticket. It's posted in channel now. Specifically Redis CLI, the key.

73

Case # 2 - A bad fault gets worse with multiple concurrent issues.

This case presents a contrast with the first case in that less formal coordination mechanisms are employed. Because of this, we are able to see many elements of choreography from a different angle. They include recruitment, synchronization, establishing and maintaining common ground, repairing common ground breakdowns, delegating and how side channels can disrupt progress on incident response. As in the first case, cross boundary coordination is highlighted. Also of interest, is the role of a high level monitoring of coordination demands that supersede the ‘in event’ actions.

To combat performance degradations within a popular and ticketing tool, a 6-hour system maintenance window - to upgrade hardware (memory & CPU) and then reboot hosting infrastructure - was planned for Friday evening to improve user experience. The offering was managed as a semi-customized SaaS offering with the infrastructure being hosted on internal Cloud services. This distributed the management of the service between the vendor, the product owner within the company and internal Cloud services within the company.

Maintenance began as planned with standard maintenance procedures requiring a backup and snapshot to be taken prior to maintenance activities. Attempts to take a backup failed. Upon investigation, it was found that the previous upgrade had included a new version of the backup utilities, however there were no indications of this in the release notes or other communications with the vendor. Upon further investigation it was found that the backup system had been reporting false positive and in fact, backups had not been occurring. The last good backup was from 23 days prior, however since the backup process is configured to save only 5 days of backups when the job runs successfully, there were no good backups available on the backup server. It was found that the latest good backup that was available on the backup server was from 6 weeks prior. This backup had been used to test the restore process and had been saved in a different directory. The time pressure was mounting as the Cloud services team prepared to begin the scheduled upgrade so the engineer attempted to install the new version of backup utilities but recognized they would be unable to do so before the planned maintenance began. In consultation with another engineer, it was decided that taking a full disk snapshot of the VM would provide suitable safety for the maintenance to proceed with the intention of resolving the backups issue on 74

the following day. In doing so, the engineer noticed that the hypervisor storage was at 94% full. The snapshot completed and the Cloud services conducted the maintenance. After maintenance, the hypervisor storage was showing at 95% so the engineer deleted some log files to create more space until more storage could be added. Following the maintenance, the engineer installed the new version of the backup utilities on the backup server and backups ran as per normal scheduling. This backup reported success but this also turned out to be a false positive.

Shortly after, monitoring alerts that the system is down and users began experiencing 503 errors because the hypervisor storage had been exceeded. Initial response efforts included attempts to free up some space by deleting some logs and turning off the VMs. As it was a weekend, few support personnel were available for a consultation. The engineer opened a high severity ticket with internal cloud services requesting additional space be added to prevent issues for the return to work on Monday. In addition, they reach out to the vendor on Slack. This message was later deleted after it was thought the issue would be resolved with additional storage. Efforts to engage cloud computing and virtualization support were delayed by not having the appropriate level of support contract for off-hours support. The product owner is engaged and negotiates for concurrent support while the service agreement is being established. The technician begins diagnostic work while the engineer makes a second attempt to recruit assistance from the internal cloud provider.

A senior manager who was notified of the original issue by the escalation policy in the alerting software inquires if additional help is needed and recruits reliability engineers from other services to assist. Coping with the multiple threads to the response, the responsible engineer, acting as the defacto incident manager, is approaching saturation and there is delays in bringing the new responders up to speed. Jointly, this group is able to engage the needed Cloud resources using multiple methods of contact. Meanwhile, the technician from the vm provider discovered the physical storage configuration on the host was not healthy for the workload. It appeared the 2nd storage disk added to the system had disconnected, but the logs to verify this had previously been deleted. Believing that this storage disconnection caused inconsistency in the virtual machine disk snapshot chain, the technician begins work to clone the virtual machine disk files to attempt to force-consolidate them. During recovery, the disk snapshot files were found to have been corrupted and application backups were found to be incomplete due to multiple issues. The technician notes that data is unrecoverable.

75

At this point, the work week has begun for some parts of the global user base and, unable to access the service or receive a response from their internal service team, users inquire with the vendor who reaches out to the internal support team. The engineer updates the vendor of the situation and the discussion focuses on whether to bring the system up with data losses or remain down and attempt recovery. They decide to focus on recovery and engage resources to assist efforts to recover the data. The responders notify their users of the situation and begin regular updating.

The tradeoff decision involved attempting recovery by copying virtual machine snapshots in an attempt to force consolidation, a process that takes ~12 hours, while concurrently setting up a new server as a mitigation. When the recovery was found not to have worked the default plan of switching over to the new server was further delayed. The team considered engaging external data recovery specialists but the procurement process was thought to have been too onerous for that to be a viable option. However, upon consultation with a major user, the decision to continue attempting data recovery was made and internal expertise from another part of the company was engaged. Following their assessment that the data was unrecoverable, the backup server was brought online and the service restored.

Findings from Case 2 If there ever was a case of multiple, contributing factors, each individually insufficient but jointly sufficient to cause an outage, this is it. Several compounding problems resulted in this challenging incident. An uncommunicated change by the vendor with the backup utilities and false positives on the backups caused the lack of reliable backups over an extended period of time and an unlikely data corruption that rendered snapshots useless exacerbated a difficult situation. There were a variety of expansion and contractions related to the uncertainty, and to the cognitive and coordinative demands throughout the event. Several key aspects of choreography contributed to the costs borne by responders in this incident.

Establishing common ground Extensive difficulties were noted in establishing common ground upon initial declaration of the incident. Responders had not previously worked together so efforts to establish common ground slowed the actual anomaly response efforts.

76

The incident commander was focused on diagnostic tasks and showed evidence of saturation which meant incoming responders had trouble coming up to speed. Mgr 1: Hi Eng 2: Hey, how can I help? Mgr: From “I actually do need help. I need someone from [internal cloud provider] to do an emergency maintenance on our server. Our storage filled to 100% out of no where and I’m not quite sure what to do. I have [virtual machine vendor] trying to allocate some space.” @ are there some snapshots you can remove? Eng 2: gotcha. Do you know in which cloud account the server is located? Is this managed by [internal Cloud provider]? Eng 1(ic): I tried I think I made it worse deleting things… It’s managed by [internal Cloud provider] yes Eng 2: Have you tried reaching out to [internal Cloud provider]? Eng 1(ic): yes no one has responded Eng 2: can you log into the machine as root? Eng 1(ic): Yea thats what [virtual machine vendor] is doing now Eng 2: Is [virtual machine vendor] = support, or ? Eng 1(ic): yes Eng 2: One thing we could look at is whether there are logs or other large files that can be removed. Is [virtual machine vendor] support blocked (or out of options) or are they still investigating? Eng 1(ic): He’s trying to figure it out Do you think we could go around [internal Cloud provider] and have [cloud support] provide maintenance? Eng 2: It’ll be difficult unless you have access to the [internalcloud.com] account that [internal Cloud] support uses. Are you able to log in at [cloudprovider.com] and see the account containing the server? Also, are you sure it’s a server problems and not an application problem that resulted in the disk full condition? If you want, I can look at the filesystems with yout o see where the space is being used up. Mgr: Sure sounds like a runaway log Eng 1(ic): I made a ticket, reached out in the slack channels too Eng 2: If it’s a runaway log then server maintenance won’t help

Recruitment There were a series of recruitments in the event over the 4 days of the incident.

77

Figure 5.7 Recruitment of responders over time

It’s important to note the importance of recruiting Eng 1. This became a pivotal resource in the incident bringing technical expertise, depth of experience in incident response practices and an extensive network where they in turn recruited additional participants. Vendor 3 brought themselves into the incident following a series of user reports to the vendor directly. Users began reporting performance issues in the user channel but, as it was the weekend and monitoring the user channel was not a typical practice, there was no response from service owners. In addition, the responders were preoccupied with the response. Several times throughout the incident queries were made about recruiting additional responders, indicating that some engineers though additional perspectives could be valuable but the ic as indicated in the exchange below Mgr 1: Hi how is it going? IC: Honestly, not the best, it looks like we might face some data loss. The guys at are checking if we have any other options now. Mgr 1: Need any help? IC: No sir, I am just watching them work on the system. Mgr 1: Can the guys help? IC: Probably not this time but if we do have issues when the vm is back up I will probably reach out to them.

78

Tracking other’s activities Three examples of interest were noted in this case related to the choreography of tracking others’ activities. The first is a general finding that applies to the event overall, the second is a tradeoff decision made without the full complement of responders and the third relates to the ongoing attempts to recover data on day 3.

In general, without a centralized war room in the initial stages of the incident and the subsequent use of side channels meant there were no traces of the response efforts to date. The lack of observability meant additional costs were incurred by the incident commander to keep everyone up to date. On-going requests to get a situation representation or to review prior discussions and actions had to go through the incident commander. While a subgroup began using web conferences on the second day, its use exacerbated the lack of observability to track others activities as there was no record of the hypothesis generation, information sharing or and discussions discarding, revising or prioritizing hypotheses to bring others up to speed. While incident response activity was limited to relatively just a few participants, there were multiple lines of reasoning underway concurrently (which is to be expected for a dynamic, ambiguous problem). But the connection between the activities was unclear. This was shown by the conclusions and recommendations of the virtual machine vendor support on Sunday being overridden by ongoing attempts to recover data.

Side channeling The multiple sidebar conferences – side channels from the main group - occurred variously through direct messages, phone calls and web conferences increased the costs of coordination for some participants. There were three factors related to people continually coming in and out of the incident – 1) it began over a weekend, 2) the proprietary nature of SaaS offerings meant external support teams were working the problem largely independently and 3) responders were juggling multiple priorities. The intermittent involvement, coupled with the inability to track others activities in a shared forum, meant incoming responders would continually prompt for updates or have to be brought up to speed at a later point in time. They were unable to look in and listen in so anticipation was decreased and, without a central shared forum, tradeoff decisions did not include all stakeholders. As such, the benefits in the diversity of perspectives and corresponding capabilities could not be fully realized.

79

In the second example, early Sunday morning the specialist who had attempted to consolidate the virtual machines had determined that data recovery was not possible and had advised on restoring to the last good backup. As the core responders were discussing the final conclusion, the vendor became engaged and brought up to speed. A meeting later that afternoon reunited the virtual machine specialist with the core-vendor group and they reiterated their conclusion. However, the group decided to try another possible solution, a decision that prolonged the incident. That this conversation took place out of a shared forum and only the final decision was recorded in the shared Slack channel meant the value of having multiple, diverse perspectives providing input into a tradeoff decision was lost.

Lastly, Monday morning after it was clear the recovery had failed, the contingency plan was to spin up a new virtual machine was ready but not deployed and further delays to the service restoration occurred as yet another virtual machine expert was brought in. As one participant recalled, “there was a lot of waiting in between figuring out how to get all the people together.”

Related, when the incident commander faced a blocker that they needed assistance with, they posted in the shared Slack channel but, as it was over the weekend and the other responders had been intermittently engaged (due to the side channeling) almost 2 hours went by before it was noticed. Returning to the incident, they commented: Eng 1:“ I suggest using @here in this channel for your blockers so that leaders can help if possible.”

Maintaining common ground The use of side channels made maintaining common ground difficult. Efforts to maintain common ground were seen when the initial efforts to restore the data had failed and it was clear alternatives would need to be explored. The senior manager converted the private direct messages that had been used between them, Eng 1 and the IC into shared channel “for ease of adding more people”. Secondly, on Monday morning (almost 60 hours into the event), and again after the effort to consolidate the snapshots was unsuccessful, Eng 1 suggested “during the lull in activitiy, it might be helpful to level-set on the high-level approach that’s currently underway and the high- level next steps for getting there.”

80

Delegation There was very little explicit delegation of technical tasks noted in this case. This is believed to be largely because the event was managed by vendor response teams working in isolation. Taking Direction Interestingly, while there was little explicit delegation, many of the examples given in the previous elements of choreography entail others taking direction. For example, the suggestions made by Eng 1 with regards to maintaining common ground and creating observability by specifically mentioning those who could help with tasks were acted on – in effect, direction was taken. Interacting with tools There are a number of ways in which the choreography of interacting with tools is relevant in this case that were revealed in the postmortem. The first is that the handoff of product ownership from the ‘Pilot’ stage where the service was managed by a highly proficient team of experienced reliability engineers to the team tasked with moving it into ‘Production’ was sparse. As a result, it is unclear what factors led to the decision on how to set up the application where the team was running a custom configuration of virtual machines that required high proficiency to manage. This could have been managed by support engineers with less knowledge if a support contract was put in place with the virtual machine vendor. It was not. The second was the backup files were reporting successful completions even though they were not. When an automated process reports itself to be doing its job, then additional confirmation represents what should be an unnecessary cost to using the automation. The core team had adopted the use of a chatbot tool to support incident management practices but were only using the feature to update users of planned maintenance outages. An unexpected interaction produced a cascading failure as (it is thought likely), when the virtual machines failed to power on, the snapshots got corrupted. Last, the chatops channels used for communicating with users had less than 12% of the user base in it. In addition, less than 0.01% of the users were subscribed to an automated status page. Therefore, when the system began experiencing trouble users struggled to come up to speed on the outage and its implications for their work. This represents additional costs of coordination for the users as they have to actively seek out places where they can find information. Of note was that during the outage, the number of members in the user channel on the chat platform tripled. This is a very salient example of escalating coordination demands during an exceptional event.

81

Monitoring coordination demands There was an example of this meta-function in this case. The first was the senior manager who noticed, via the escalation policy, that the issue had not yet been attended to. They proactively reached out to the core response team early in the event to see if additional resources were needed.

Case # 3 - Scaling beyond imagined parameters

There are less complex difficulties handled in this case than in previous cases but the choreography in this event demonstrates smooth recruitment and cross-boundary coordination, the value of investments in common ground, real time repairing of common ground and model updating, controlling costs for others and synchronization. In addition, this case provides a contrast to cases where vendor support does not join the incident response, extending the duration of the event. A small support team of four software engineers were responsible for the service delivery of continuous integration (CI) software within a large, multinational firm. The software had been adopted several years earlier by a small group of developers as an on premises enterprise offering. Over time, the offering had become a critical part of many development teams within the company and was running at scale. In this CI software, job logs are first saved to a database table named ‘log_parts’ and are later aggregated and moved to a table named ‘logs’ via a scheduled process. Over the several years of operations, the CI database had become extremely large. The support team noticed users reporting that their build logs were not appearing. Upon investigation, it was found that the database was reporting “number out of range” errors. The support engineer immediately recruited vendor support and the rest of the internal team who logged into a web conference to assist with the diagnosis and repair. In the vendor’s support model, their on-call engineers were typically assigned to certain clients instead of a generic front-desk support. The two engineers who joined from the vendor were familiar with the organization and its instance of the CI software. In checking the database schema, the id field was noted as a 32-bit signed integer. The vendor checked to see if this was a bigint in later versions. The team started to scale job workers to zero. They recognized the implications this would have on dependent services and preemptively notified the engineering team responsible for them so they could shut down the services and then

82

communicated out to their users they were investigating the issue. Recognizing they may have been facing data loss, the group began debating whether to create a new table or modify the current table’s schema. In order to understand the data loss involved in dropping the log_parts table, the vendor asks how many logs were created to understand how much data wasn’t aggregated and persisted in the logs table. By figuring out metrics for the min/max IDs in use in the logs and log_parts tables, they were able to help answer how far behind the aggregation was. Based on the aggregation status, team was able to make a tradeoff decision that data migration (from log_parts to logs table) was not necessary. Instead, the team decided to handle log_parts migration independently for the teams that asked for it. In total, for about 4 hours, the CI software build logs were not being saved to the database. In order to prevent additional data loss, the CI service was shut down for approx. 2 hours in order to repair the system.

Findings from Case 3 This is a good contrast case to the cases 1 & 5 in which the vendors were delayed in responding or refused to respond respectively. The delays and refusals slowed the response effort and added additional costs to working around the absence of needed perspectives at the point in time where they would have influenced the actions underway. In this case, the vendor joined almost immediately and was well integrated into the reconfigured response group. Recruiting The usual pattern for this team is for the responder who is ‘first on scene’ to be the de facto incident commander. This incident began with casual user reports in the shared channel that the engineer began investigating. The metrics were not showing anything unusual so they dug deeper into the specifics of the reported case. Forty-five minutes into this investigation they realized they were seeing something they had not encountered before. Deep into the diagnostic work, the engineer didn’t even trigger an incident. Instead, they leaned back in their chair and called out that they needed help to the team assembled nearby having lunch. As they returned to their desks, one of the incoming responders, noting the engineer had immediately returned to their screen declared themselves the incident responder and triggered the incident management software to notify users and begin compiling the information about events to date. The engineer had meanwhile posted a Sev1 (the most critical problem in their scale) notice in the messaging channel they shared with the CI tool vendor along with a web conferencing link. Within two

83

minutes the post was acknowledged and within six minutes, two vendor support engineers who had worked closely with the team for several years joined the web conference.

Orienting to the problem Upon joining the call, the engineer who had been investigating gave a lightweight briefing explaining they had been getting some build logs but not others and had found a “number out of range” error and a few details about how long they had been seeing the issues. Establishing common ground As in other cases, the recruited engineers are specialists with deep knowledge of the system. In this case, the vendor support team had worked closely with this team for an extended period and had extensive knowledge of nominal and off-nominal performance as well. They were familiar with how the team conducted their incident response and so, with sparse detail they were able to begin their own investigation to aid the diagnosis. The hypothesis generation was focused to known problems and potential interactions between the CI software and the company instance. Maintaining common ground The response team immediately spun up a web conference when they recruited the vendor. Exchanges were rapid and richly detailed. Consider the following exchange, which took place over 9 seconds.

Eng 1: Uh, I think before we touch the table, we need to make sure that everything is out of...using the database. Do we still have workers running? Eng 2: well it doesn't matter cause the workers don't talk to the database directly. I could scale the masters down to zero and that should ensure that nothing is touching the database. Right. Eng 1: I want all fingers out of the pie. When we flew this change. Eng 2: , do you agree that if I turn off all the masters, nothing can be touching the database? VEng 1: Uh, so long as you also get out and there isn't any sort of backlog there. Um, but yes. Eng 2: What do you mean. Get out? VEng: Oh, sorry. Um, make sure that there are no, that is also, well, yes. If you turn off the masters, you're good. (pause) Um, that should be fine. My only question is if has things in process, it's currently trying to write to the table...write to the database, um. Eng 2: Does talk directly to the database or does the master pull data out of < messaging broke > and write it into the database? Eng 1: Yeah, I thought was a viewing system, I thought it was maybe Sidekiq, maybe there's like, Oh yeah, yeah. 84

VEng 1: Oh yeah, yeah. It would be Sidekiq, which would be on the platform. Um, yeah. So yes, if we shut down the platform that the masters, you should be good to go. Eng 2: So team, should we scale our masters down so that we can actually do the surgery? Eng 3(ic): Yes. VEng 1: Yup. Eng 4: Yeah, I think that's a good idea. Eng 2: Alright. Doing it.

This exchange is an example of maintaining common ground about the expected outcome and also of model updating, as Eng 2 asks for clarification of how the systems interact. There is consensus that the action is the right one to take.

Delegation & Taking direction Tooling supported the functions of delegation and taking direction in this case. The team used incident management software that enabled real time task tracking. Any responder could type a task into the command line and send this to a shared task dashboard available to all participants in the chat channel. While direct delegation was made on occasion, responders instead assigned themselves tasks, also a form of taking initiative. Once complete, they would update the task dashboard. Conversely, the incident commander would notice and update the task list. There was a fluid interchange of both surfacing tasks and taking them.

Investing in future coordination Before scaling down the masters, the team proactively reached out to a large dependent service to let him know that he should shut down his workers as well. This service owner, in turn, contacted another dependent system service owner. In doing so, they established trust that the service team recognized the impact to users that unplanned downtime had and were conscientious of minimizing disruption. A second example of investing in future coordination, was an unprompted offer of help from a related service reliability team. The team responded that there was sufficient responders on the incident, and that they would reach out if their help became necessary.

Synchronizing There was a relatively low cost to coordination for the synchronization of tasks in this case as all responders were working jointly in a web conference and were therefore able to stay apprised of others’ activities with relative ease. As noted in the example about alerting dependent services of an impending impact enabled them to synchronize the powering down of their workers before the

85

outage began. The service owner for one of the dependent services also joined the web conference to be better able to anticipate when and how to bring their own service up as they tracked the progress of the resolution.

Cross checking/Model updating: The tools for collaboration are an important part of facilitating the exchanges. In this exchange, the responders are on a web conferencing with a screenshare of Eng 4’s monitor. The group is cross checking the code before the deploy is made. Having all the responders connected in real time enables rapid updating and there is minimal delay and uncertainty is resolved quickly.

Eng 1: So in theory, we can just wake back up right now? Eng 4: Correct. Eng 2(ic):Let's do that. Eng 1: All right, so... Eng 4: Now we can test what number we get by adding a dummy row and then deleting? Eng 1: Okay. That seems smart. We don't want it to start reusing ID numbers Eng 4 (typing): ...One content is hello... one Eng 1: Why is it using hello as a column name? (Pause) Eng 4: It should have been a string, shouldn't it? Content is text. Yeah. Eng 1: I think this query syntax is wrong. (Long pause) Eng 4: Okay. It wants the text field, and that's a little trickier here in creating this, so I won't do that. Eng 1: Uh, you do insert into table name values. Eng 2(ic):Values is a, yeah... Values. It's in the wrong place. Yeah. But where do you select...? You have to put in also the columns you are inserting into it. Eng 2(ic): If you're specifying the columns, you... it is in the correct place. So this is right. Eng 4: So in this case, because this is a text field Eng 1: that's really confusing. Okay. Yeah. I don't know. I don't know what's wrong. (Long Pause) Eng 1: Oh, wait. Postgres uses single quotes for strings. Eng 4: Ah, Ven1: Oh right. There you are

The value of collaborative interplay across diverse perspectives, with well-established common ground, is demonstrated here as all could keep track of the activities of others. All four participants shaped the outcome.

86

Interacting with the tools There are several salient examples of cost of coordination in interactions with tools in this case. The first set of examples has to do with the benefits chat messaging structure can provide for lowering the costs of coordination. Pre-established communication channels and norms for how these were to be used allowed users with issues and support engineers tasked with aiding them to be virtually co-located. A service specific message chat room or channel is an enduring virtual forum for users that, in this case, provided a centralized place for the responders to engage in real time exchanges that enabled the tempo of the response to remain high (as there was little to no lag between question and answer). In addition, pre-established channels and norms for the responders was also in place and had been configured with automated chatbots and integrations that enabled the responders to remain in context without switching across platforms. For example, integrations with the automated monitoring and alerting software gave near real-time feedback from these sources in the channels were the incident response was being managed. Finally, the ability to switch mediums (to web conferences) enabled higher fidelity conversation and screensharing.

The CI software had ongoing issues where users were unable to view logs where the queue for log processing fills up rapidly. The fix in these typical cases was simply to restart the Master pods. The team had implemented monitoring and alerting (on the queue) for this condition. In this incident, however, monitoring showed that the queue was not full, which ended up being relevant to the way in which the tool was failing but triggered uncertainty if the monitoring was not reporting accurately.

Another example of cross-checking was in the update to the table schema. The group ran parts of the SQL on an empty mock-up table in production. Through this practice, issues were identified and resolved.

87

Figure 5.8 Corrected by cross-checking

Case # 4 - When those who should know don’t

This case is reflective of additional costs incurred with poorly supported cross boundary coordination. It demonstrates how efforts to control costs for one party shifts the costs to the other party - in the form of additional workload, workarounds and lag. The case shows challenges in establishing, maintaining and repairing common ground as a result of the ticketing system of interacting during the event. There are difficulties in synchronization due to the lag and coordination breakdowns involving tooling.

Users at a large, global digital services company began reporting 502 errors for their code repository in their Enterprise service. Upon inspection the support engineers noticed that load was spiking on one of the servers but testing revealed some were able to perform the functions being reported as unavailable by users. In response, the vendor requested a support bundle to aid with the diagnostic test. Support bundles are scripts that gather a collection of relevant diagnostic data from various functions in the system. They are ‘bundled’ to give a wide range of key performance indicators to aid in diagnosis of the issue. The time to run a bundle can range from mins up to several hours and do add load. The engineers were reluctant to fulfil the request as it was generating additional load on an already degraded system. The act of running support bundles and

88

diagnostics sent CRE into an unresponsive state multiple times so they began prioritizing running the bundles to minimize load. After running them sequentially, the logs from the load balancer and diagnostics bundles from almost every node in the cluster were uploaded to the CRE support ticket. Early on the team noticed a scheduled maintenance process (called a “hookshot” migration) had failed and the vendor confirmed the migration may not have completed, leaving the system in an unsteady state. Adding to the trouble narrowing down the sources of the issue, the team received intermittent response times in calling the servers and could not find upstream drops in the log. In addition, the status page returned that [code repository] was up. Vendor support was surprised by the volume of traffic on the API shown from two Internet Protocol addresses (IPs) and requested the team check and/or temporarily disable these jobs or processes to see if that might help restore overall performance. The two IPs were from services that made up the company’s largest consumers of CRE and the behavior being seen was normal loads. The hotpatch conducted several weeks prior was thought to have contributed, and the team began tracking this. Almost 5 hours after reporting the first of the problems, the team had exhausted all candidate hypotheses and were awaiting further assistance. One of the reliability engineers was reviewing the dashboards set up to visualize the internal monitoring of the system. Internal monitoring can be calibrated to ping the system at varying intervals and the data is represented according to the most relevant timeframe for managing system performance but can be adjusted. Searching for more diagnostic information, the engineer was looking into the behavior of the logs over time by shifting the timescales in the visualization. By extending the timeframe the engineering ended up noticing a substantial drop in the memcache (as shown below in the screenshot).

Figure 5.9 Dashboard time interval extended

89

Upon investigation it was determined the memory caching had been changed without notice in the release notes during the last version upgrade. Following this, the group changed the memory cap, the system recovered and they declared the incident closed.

Findings from Case 4 This case highlights difficulties in establishing and maintaining common ground when one party is attempting to control the costs of coordination through a ticketing system and the use of support bundles. It is a contrast case to the previous case 3.

Establishing Common Ground The vendor asked the reliability engineers to run support bundles to gather diagnostic data instead of directly joining the response. This introduced substantial lag since the bundles can take up to hours to run. The additional drag this put on the system meant the team had to run them sequentially to avoid further degradation. This process took several hours. Four hours after initially recruited the vendor responded VEng 1: Hi , Thanks for the update. I don't see the load balancer logs attached to this ticket yet, but we'll advise once we receive them. Regarding the active issue: Multiple members of our team are actively researching the situation you're experiencing. We've also reviewed graphs found at: . This specifically tell us that all the Unicorn workers on the web nodes are being used: If we zoom out on the active workers graph we can see when this behavior started right at the end of August: Our hypothesis centers around the increased utilization of workers, and that is why we're asking for a bundle from one of the web nodes: # from any single web node If you are concerned about generating this bundle during peak activity — Would you be able to generate this immediately after business hours today? In the interim could you share a diagnostics file and the unicorn.log file from ONE of the web nodes? # generate and upload diagnostics output # upload unicorn.log file Thanks, VEng 1

Twenty minutes later, a second vendor support, who had analyzed the remaining diagnostic bundles points out a substantial amount of traffic on some of the IPs. VEng 2: Hi , Thanks for reuploading. I'm analyzing the logs you provided and see a LOT of traffic from these IPs: Over 1M requests from . Just

90

looking at the timestamps this is a lot of traffic that is pounding the API. Could you check and/or temporarily disable these jobs or processes and see what that does to help restore overall performance? Thanks,

However, the organization is a large-scale user of the service and the traffic they noted in the bundles was business as usual and this hypothesis was politely dismissed but the team was frustrated by the amount of time elapsed for an issue that would have been resolved in minutes. Eng 2(ic): We appreciate the analysis. These two services are the largest consumers of and the behavior you're seeing is BAU. We cannot disable these services. Further, the exhaustion of the the active workers is the behavior since hotpatching from to which spawned this issue a week ago. Thanks,

Maintaining common ground

Even after the above exchanges the vendor did not adjust their coordination strategies to allow for more rapid repairing of common ground breakdowns. As noted previously, in an effort to proactively keep the vendor support up to date on the event progression, unprompted, the team proactively provided on-going updates to their activities and to changes in the system performance in maintain some degree of a common operating picture about the current state. The intention was to make clear the urgency of the problem and the difficulties the team faced. Delegation This case is an example of the fluid interchange exhibited by the team in taking initiative about identifying tasks and signaling to others who may have the capacity to take it on. As a small team, the IC role actively takes on tasks. In this case they are running the support bundles requested by the vendor so Eng 1, who has noticed a task looks for someone to delegate it to. Eng 2(ic):parallel storage diag is running Eng 1: 100% of our API calls to are failing. (adding clarification, not reporting as new issue) Eng 1: Can someone investigate :this:? Eng 2(ic):parallel diags on storage appear to have a huge impact Eng 1: cancel and rerun serial Eng 2(ic):done Eng 3: <@Eng1> I can investigate API calls

91

Taking direction What is implicit in the above example for delegation is that someone is willing to be directed. Eng 3, tracking the activity stream makes a clear statement to take on the API calls. Repairing common ground Additional challenges were noted in maintaining common ground when the vendor support was rotating in their responses to the ticketing system. At times, it is unclear whether a posting from a new responder is from someone who is up to speed on the problems and therefore requires only a repair to an existing understanding of the issue or if they need a more deliberate and thorough grounding. An example of this is when a responder, who may have been working jointly with others but has not previously posted in the support ticket makes a request or suggestion that seems unrelated to past discussions. Does this represent a new line of inquiry related to a recently generated hypothesis? Or this does this person need to establish common ground? As shown in other cases and examples, there are different kinds of responses to establishing common ground vs repairing it.

Controlling costs of coordination for self What is striking about this case is the degree to which the vendor support team refuses to engage even as the channels being used for coordination are obviously not working. In this way, they continue to control the costs of coordination in that they can manage the tempo of the response and are not reliant on the other team to run the event. However, this shifts the costs of coordination to the internal support team who has to defer activity, wait for the responses and invest additional effort to run the service bundles.

This is also an indication of a reciprocal pattern happening with the support team’s own users. Over the course of fifteen minutes multiple user reports came in: User 1: Hi, I know folks are working on this, but any ETA on some level of stability? We have a provisioning outage in one of our regions for an Cloud Production service and need in order to run our deployment pipeline. User 2: getting `502 Bad Gateway` User 3: Hello Team,I am trying to access but having some problems. Could someone please help? User 4: Rip User 5: riphub User 6: We’re also blocked from our CD deployment now. User 7: Same here, you’re not alone User 8: Poor octocat User 9: Is there a place to raise a ticket or do we just wait for a response on the channel? 92

User 10: Getting a 502 error code User 11: Same here showing partial outage User 12: Any update on the performance issue? Last update was 2 hours ago and seems things have not improved. So wanted to make sure this is still being investigated. Trying to deploy builds to production but fighting `socket closed` timeouts.

The users attempt to get information about the state of the outage to manage their own deferred work. The team, preoccupied by controlling the degraded performance, were not monitoring their own user channel. This shows a nested series of relationships, as the responders are trying multiple mediums to try and capture the attention of the Vendor support engineers, so too are the users trying to capture the attention of the internal SRE support team. Also of note, is that while user inputs are a needed source of information about how the system is performing, the cost of coordination is high when there is frustrated chatter amongst users about the performance issues that adds pressure to the response. In response to the negative backlash in the channel, a responder posted a rebuke and redirected people to the ticket for status updates: Eng 4: (it’s our channel so no haters) We have multiple people asking about the outage and raising tickets. Our SREs are hard at work trying to resolve the performance issues. Please see for status updates. Thank you. Following this, the team decided to drop their monitoring of the channel to avoid distractions.

Controlling costs for others Part way through the incident, the team determined putting the system into maintenance mode would enable them to work quicker than coping with the continued degraded performance. Before doing so, they acknowledged the impact this would have on others and prepared to notify them so they could adapt accordingly.

Eng 3: “Is everyone aligned that we will put the system in maintenance mode after the support bundles? Because that’s what I will update the users and stakeholders with to prepare them for that.”

Interacting with tools To cope with the long running incident, the responders took turns monitoring the system while the others commuted home with the intent of jointly continuing the response efforts once everyone had eaten and a chance to check in at home. Having lost the affordances provided by 93

colocation, they switch from chat messaging when they need a higher bandwidth medium to work through the overnight plan and communication to users.

Eng 2(ic): “agreed. i'll jump in a in just a sec and we'll plan the announcement and the next few steps.”

During the incident, it was also discovered the one of their monitoring systems were not reporting all of their metrics since almost 4 weeks prior.

Investing in future coordination In this event, the response team has additional capacity but chooses not to use it. In effect, they throttle the amount of resources dedicated to this incident to both preserve capacity and to maintain a focus on the chronic demands of maintaining the system. In doing so, they work towards reducing technical debt while retaining the ability to bring in fresh perspective and additional help for future problems and workload.

Case # 5 - Is this a problem and how bad is it?

This case occurs amidst a challenging period of performance problems with the continuous integration software. This case does not involve vendor resources but recruitment of specialized expertise shows inter-organizational coordination challenges in establishing, maintaining and repairing common ground (despite having worked together repeatedly) and coordination breakdowns. Choreography such as taking and switching perspectives, delegation, orienting to problems, model updating, controlling costs for self and others and investing in future coordination is shown. In addition, responders took advantage of colocation which points to limitations during interactions with tools.

In the early morning hours, the automated monitoring system issued non-urgent alerts for an issue with continuous integration (CI), an unstable code repository app webook and slow pushes to the code repository. The issue with the CI software resolved on its own and the support engineer on call (a European member of the reliability team responsible for the CI tool & code repository)

94

acknowledged the alerts. The reliability team in this company uses a ‘follow the sun’ strategy for user support using geographically distributed engineering teams to provide 24/7 coverage. Shortly after, users began reporting issues with slow builds for the continuous integration service. The European engineer investigated, noted the problems were intermittent but services looked ok so they did not declare an incident. They had shared their uncertainty by making a note in the shared reliability channel and asked for feedback on whether they needed to take any action but received no responses. The problems seemingly resolved and were deemed to be normal system variation so no hand-off was made when the North American support engineers came online. As the US team arrived at their desk they found three urgent hotpatches had been released for their code repository and they immediately began work on that. However, several hours into the workday in the US, the system became unreachable. This triggered a declaration of an incident that brought 3 members of the core team into the response efforts. As part of their routine practice of monitoring chatter in the user channels, the responders noticed the earlier reports of user issues. Engineers from a support team responsible for a dependent service were recruited to the incident to bring in fresh perspective. Several responders noted they were unable to reach the UI but when looking at the master logs confirmed they were up and did not indicate any of the problems seen. Several minutes later the UI came back but none of the monitoring had triggered meaning the jobs weren’t disrupted. Unnerved, they began generating possible hypotheses. One responder suggested a recent deploy may have caused the nodes to come back up soon enough before the older ones were shutdown as the pull request would have kicked off a new deploy. Another responder suggested a network issue may have been to blame and pointed out the problem was also seen with the master-build freezing during the software package update. However, the team did not have their own monitoring on the network’s performance and the network provider’s status page showed no service interruptions. Therefore, they were unable to pursue that possibility. With no further degradation noted, the group disbanded with inconclusive findings twenty-three minutes after the service became unreachable.

About an hour later, the European engineer re-posted about the early morning hours issues saying “I know [Eng 2] was looking earlier, but widespread users continue to report no output in 10 mins issues for [continuous integration software]. I mention it because it's been going on all since well before US hours - and I couldn't see a cause.” The attempt to redirect the team’s attention back to this problem was acknowledged with the on-call support engineer indicated they would look into it later.

95

Thirty-eight minutes later a flood of users began reporting stuck builds again and a check revealed this included the incident response team’s own master-build. The engineer on support went back and reviewed the issues from the morning and discovered the system had apparently stalled right after the CI job worker information was printed and wrapped up and none of the builds ran. One of the specialists reminded the incident commander that the master builds were getting hung up earlier. Adding to the uncertainty, another user reported seeing sporadic stalling in different parts of the build. Looking at log files and the cluster status, nothing seemed unusual. The specialists had returned to their own work but, when prompted for suggestions for further useful diagnostic activity, one of them offered that “the item which I think will be best is to get more precise on what is happening when the build times out. Also to get the nodes, to see any patterns. It may just be select nodes.” This was seen by the engineer who made the comment as a suggestion that the core group could follow up on as they had disengaged from the joint activity. But, to the requestor, who was operating on the assumption that the level of engagement given earlier was still in play taken aback by the suggestion. Struggling to enact the suggestion, the requestor then made an explicit ask for the specialists to actively rejoin the incident. Upon doing so, the group checked all possible clusters, comms, hosts and worker nodes all showed no issues and deploys seemed normal. Confused, they decided to attempt to narrow in on the specific points of failure showing up in the user reports (was it the same command? does it always fail while doing a “apt-get update” or curl to host?) in the hopes that information would enable them to reproduce it by repeating the command until they found out where it failed. Attempts to replicate produced a promising lead that switching to an identical server (known as a “mirror”) may resolve the issue. To do so would require some additional risks to the system. They resumed watching the jobs and contemplating next steps. After a day of tracking inconsistent problems and several hours with very little new information the group was wearing down. The incident commander posted a status update to realign the group: *things we've tried*: - running local apt updates; this was fine - running tons of build steps with apt update; half the builds were fine, the other half took upwards of 4-5 minutes - comms to [code repo] from cluster node; :pass: - comms to [enterprise universal repository manager] from cluster nodes; :pass: - apt updates on nodes; :pass: - noted no DNS issues during any of the above

*working on*: - edit the apt mirror and use a new mirror in the matrix to test faster performance 96

The group agreed that the only way to test if the mirror change would work would be to modify the setting in production and see what happened. A test in a test environment didn’t break anything so they moved forward with the change. But they again got slow test speeds and the master build failed with the new mirror as well. In turn, this generated a new possible fix by trying to run the processes on each of the servers, one at a time, and record the speed they observed on each. After users restarted there were less impacted users, but the problem had still not resolved. The group began debating whether they had sufficient evidence pointing to upstream network issues to approach the network provider and request involvement. At that point, someone asked whether they still had a problem and in stepping back to review the logs, the problem had appeared to clear up. Without a clear understanding of whether their change or some other change independent of their actions had resolved the issue, the incident was declared over.

Findings from Case 5 This case exemplified many themes related to costs of coordination. Establishing Common Ground Initially, problems noticed in the early morning hours presented themselves through non-urgent alerts that automated monitoring had triggered and the alerting software had pushed them a notification. In an effort to establish common ground with the rest of their team (who would be coming online soon as the US workday started), the responder on support left a statement for the incoming responders. This was not quite a question but indicated they lacked the knowledge to determine significance. “Morning, jobs and services appear ok. I'm going to acknowledge the two alerts above but not resolve. I'd like to know if any action is expected when we see them.” The non-urgent status coupled with the uncertainty around their meaning meant it was unclear if it was worthy of declaring an incident. Shortly after, the uncertainty grew as user reports began coming in. The engineer solicited more input from the users in the shared user channel then added further information for context back in the channel shared with the other reliability engineers “I'm getting strange, multiple reports of [continuous integration software] workload not flowing.” Twenty minutes later they specified the results of his user solicited adding, “The [continuous integration software] users are seeing jobs

97

self cancelling after 10 mins with no output. The jobs have previously run normally. It's not a symptom I've seen before - it's got 13 agreeing they are seeing the same thing.”

Maintaining Common Ground After investigating the usual sources for relevant diagnostic data and finding them lacking. The core group recruited two specialist engineers. The recruitment was to bring additional perspective and deep knowledge on how the continuous integration software operated. In doing so, it is apparent there will be gaps in shared knowledge, beliefs and understanding about how the system operates. Core group responders recognize this and control the cost of coordination for the specialists by limiting their requests for model updating (essentially they asked less questions about why something was being suggested). However, some degree of common ground needs to be maintained as a core competency in this team is being able to anticipate others actions and smoothly take on tasks to support this action. Due to the intermittent and ambiguous nature of the problem, multiple hypotheses were generated, some of which were unclear to the other responders. There were efforts to maintain common ground when the gap widened. In this example, late in the event, the group has eliminated all other hypotheses and wants to approach the network provider. The specialist (SpE 1) explains why that may not be sufficient.

Eng 1: Okay, so we have only one hypothesis that can support the current data, yes? It's probably something in the network Eng 2: correct Eng 1: Can we go to them and say: - our clusters can't reach this reliably - we can from other geos - we've tried different endpoint and gotten same behavior Is that sufficient SpE 1: I think this stuff is a little more complicated uses a lot of trickery in the background to scale. It uses dns to give out different servers. Eng 1: Ooooh Multicast

Repairing breakdowns In a related segment, the second specialist notices a discrepancy in the above exchange and seeks to repair the common ground breakdown.

98

SpE 2: > we can from other geos What does that mean? Did we actually try it in different geos in sl? Eng 1: I meant that we can from Which is not the same place as our DC SpE 1: it also has a lot of different nodes. Just because it works on the laptop may not mean there is not a problem on the nodes with apt mirror

Aiding others in model updating As mentioned before, the effort (and additional cost) inherent in aiding another in their model updating is closely related to maintaining and repairing common ground breakdowns. Continuing from the conversation from above Eng 1 does not understand and seeks clarification.

Eng 1: Can you elaborate on that last point? [Edited] SpE 1: What is happening is the nodes does a dns lookup to the dns server. They don't give a constant response. On the laptop it may be a different address then the server is using. Eng 1: Okay, good point. SpE 1: most services they only give out one, but packages mirrors do not The problem may be time dependent as well. They are constant moving parts changes some of them are not under our control

At this point in the response, where maintaining common ground is not necessary for the continued joint action, SpE 1 is incurring an additional cost to helping Eng 1 with their model updating. This is a form of investment in future coordination because SpE 1 recognizes that helping Eng 1 develop more knowledge will ultimately make future coordination easier.

Model Updating The next excerpt is a compact rapid interchange that takes place over 4 minutes 5 seconds. The longest pause is 44 seconds but the average is just 16.3 seconds between each. The rapid interchange is a salient example of multiple cognitive functions for coordination in anomaly response. It is shown with codes related to the cognitive work to highlight the rapid and information dense exchange as hypotheses are generated (HYP), queries made related to model updating (MU), statements that aid others in their model updating (AOMU), probes for new information to maintain common ground (MCG) and repairing common ground breakdowns (RCG).

Eng 2: It may have something to do with the deploy. Maybe the nodes did not come up

99

soon enough before the old ones were shutdown. We may want to investigate in a readiness probe if that is the case Eng 3: was there a deploy in the last 10 minutes? Eng 1: Did you just do a deploy? Eng 2: whenever a PR is merged a deploy is kicked off Eng 1: Sure, but why did it update the k8s resources when the k8s files weren't changed? Eng 2: It deploy everything. There is no logic to make sure to only deploy when k8s items are changed Eng 1: I thought (for some reason) that `kubectl apply` was a noop when the yaml was the same Eng 2: it is but we rebuild the image each time, that will cause a new tag -> new deploy Eng 4: The master containers have been running for 31 hours which means no new containers were made. Eng 1: Good point :this: The workers are also 20 hours old Eng 3: So what are the youngest items in the cluster atm? ingress? Eng 1: Maybe ingresses or [messaging broker] ? Eng 2: The build is still building: Eng 1: Good thing [continuous integration software] is up so that we can watch it  Eng 5: Was it corroborated by users - ie could it have been local to [area] or the US network? Eng 3: other users saw the outage too Eng 1: Users reported it. Could have been users in [local area] though Eng 3: One was [local area] One was [international area] though That's not even close  Eng 3: Jobs weren't interrupted, just access to the UI Eng 5: Bear in mind that [network provider] has periodic glitches of network traffic and will report them at least 6 hours after an event occurred. It could potentially be caused by them. Eng 3: Certainly a possibility Eng 2: What I think is happening there is a [network provider] network issue. This can also been seen with the master build freezing during the apt-get update

Taking/switching perspectives; Taking perspectives is shown in this case as the responders continually decenter from their needs to take the position of the network provider to determine if they would consider the information they have available sufficient to investigate. This on-going cycle to determine if they’ve met a sufficient burden of proof incurs additional costs in imagining the stance of the other party but also in the additional effort that goes into gathering further proof.

100

Delegating This case again shows patterns of indirect delegation consistent with the other cases where an absence of authoritative ‘directing’ has been noted. What is exceptional to note is that in, nineteen hours of transcript reviewed there was only one direct delegation of a task. Eng 2 (ic): are you available to chase the DNS thought while we are running other tests?

Taking Initiative Instead, taking initiative was more pronounced, perhaps because in recruiting the specialist engineers, the core group of responders (including IC) were adjusting around their actions. The specialists were taking initiative because the IC directing them would not make sense as they had higher knowledge of appropriate action.

Recruiting Recruitment in this case is both of other responders and of the users to provide critical diagnostic data about the specific details of their problems. Recruitment of users is also cross-boundary activity. Other responders In this case, attempts were made to recruit additional specialized resources by pinging them in channel asking a question related to the service but not explicitly asking for them to join the incident. Eng 3 (ic): “I need more help on this. This isn't normal behavior.”

When that fails to engage them, the IC @mentions them, signaling their attention is needed directly. Eng 3 (ic): “@Eng1, @Eng 2i need one of you to help us dig, please :batman: we're in webex”

In part, the specialist engineers taking initiative (as discussed above) is a function of how they were recruited to the incident itself. They are brought in “to dig” not explicitly to take on a particular task. The implication here is that they will take appropriate action relative to their knowledge, which they do.

101

Active recruiting of users Repeated user recruitment was seen throughout this event. Of particular note is that when monitoring is not in place or when the status pages from the dependent services are not showing any outages, the reports from users became highly essential for pinpointing how the problem was presenting. The following are two messages posted in the shared user chat channels in an attempt to consolidate the multiple user reports. “ Hey folks, if you're getting timed out, please respond in thread of this issue with the build host that your build was running on. You'll find that information by expanding the `Worker information` in the build log and posting *in thread* the `hostname`. Please also link the build log *if your repo is public*. If you have opened an issue already, we're pulling information from those issues as well.”

“Everyone with stalled builds, please restart these builds and let me know if it's better :pass: or same :fail: (react to this post with :pass: or :fail:)”

Note that the responder asks the users to respond in a thread. The packed message list design of the messaging system can fragment this information across time making it harder to track the full scope of the problem. Putting it in a thread keeps it consolidated. In the second example, they take advantage of the affordances in the chat messaging platform (emoji) for a lightweight, low- cost way of communicating for both the user and the responder.

Recruiting vendor (network) Recruitment with the internal network provider across organizational boundaries is necessary but it is costly. The responders all have established common ground that the network provider will refuse to actively join an incident without definitive proof that the issue is theirs or the issue is not with other parts of the system. Eng 2: we can try to open an issue with [network provider] and get them to try and look around Eng 1(ic): who'd like to take that? Eng 3: I don't think we have enough to go to yet. They won't take this Eng 1 (ic):yea, that's what i saying to the group "we only see network problems sometimes in our application, but not from the nodes they run on" they won't take that

Also of interest here is that a technique of solicitation as a form of delegation “who’d like to take that?” is contingent on the other participants in the joint activity actively taking initiative and 102

tracking the activities of others. This technique enables flexibility but can break down during periods of workload saturation, as responders with tasks underway do not take on additional work so the incident lead must keep track of the unassigned tasks and the sequencing of when other activity will be finished. This is why some degree of orchestration is needed - the choreographer needs to keep the holistic view of activities underway in mind and adjust others actions accordingly when they notice a sequencing or prioritization problem.

Controlling Costs – Self When the intermittent activity seemingly spontaneously resolved, the specialized responders ‘left’ the incident to return to their own work, thinking it was over. When being drawn back into the response they did not explicitly rejoin and instead they gave directions for someone else to follow to avoid another disruption to their own tasks. Controlling Costs – Others As noted in the previous example of difficulties in recruitment, when the recruited specialists did not rejoin, the recruiter initial tried to work with the changed pattern for coordination. This changed pattern was that the specialized responders give instructions to follow and the core group tries to follow them - but when there were difficulties in doing so they explicitly asked them to rejoin as full participants taking on tasks and engaging in the group discussion. Investing in future coordination After discovering that they did not have a quick and easy way to count jobs at a specific time interval, the IC makes a note in the chat transcript for follow up in the post-incident activities. “is it possible to log/report/graph the count of jobs that are kill at the 10m timeout?” This is a form of lowering future workload by making it easier to track needed repairs or updates while in the incident.

103

Interacting with tools Earlier findings showed difficulties establishing common ground about whether the incident was, in fact, an incident between European and US counterparts. That example showed how one engineer enacted two very common practices in ChatOps - the first is simply making a statement in a channel. However, the signal to noise ratio is often low in high volume, large-scale chat messaging systems.

Most messaging platforms use a form of signaling (in the Slack example to the right, there are 5 indicators. 1) A small red button appears with the number of messages 2) An aggregate list becomes bolded when there are unread messages 3) Any form of activity in a channel bolds the channel name 4) When there are specific mentions of your name, a group you belong to or a broadcast function (such as notify everyone online or everyone in the channel) a small red button appears with the number of messages for you directly and lastly, 5) for workspaces with many channels or those separated into distinct organizations (cross workspace) an additional banner at the bottom of the visible screen will indicate unread messages. Another finding of difficulties in interacting with tools from this case is mid-event when a responder posts: Figure 5.10 Tooling “:this: realized that I never ack’d” and moments later “:thinkx1000: I didn’t get paged at all for this inc” Another responder chimes in “what said” confirming they were not paged by the alerting tool either.

Updating As mentioned, this case represents a full day of difficulties handled by the responders. The entire response team of the core group and specialist engineers are co-located at the same site. A manager, who had periodically been checking in (both by looking in and listening in and by

104

physically coming by the team’s area had been offline and unavailable for several hours. Upon returning, the overhead involved in scrolling back through chat logs is substantial so he asks the group for a face-to-face update during a lull in activity.

Figure 5.11 Maintaining common ground through updating

Coordination Breakdown There were three coordination breakdowns of note in this case. In the first example, the engineer responsible for coverage during European business hours maintains their usual practice of making a general comment about the state of the services around the time the US engineering team will be coming online. However, this time they include a second comment that was intended to direct attention to alerts they were uncertain about. There is an implicit question that needs addressing. The problems continued and the European engineer used an @mention which is designed to flag a message for an individual or group by prioritizing it in their message list. While they did post their updates to the shared channel, an urgent hotpatch took precedence when the US team came online and the attempt to establish common ground on the problem an incident got lost.

105

The second example of a coordination breakdown is when the incident commander has assumed the 2 specialists, who they recruited to the earlier incident, are still engaged. However, those responders had returned to their own work and were offering ideas but did not actively re-joining the incident. There is no signaling to indicate the withdrawal from the joint activity and, given the nature of virtual incident response – it is not immediately evident they have withdrawn. The IC has to explicitly ask again for their full participation.

This exchange also offers insight into the way in which model updating is an integrated part of incident response. When Eng 2 suggests a course of action that Eng 1 doesn’t know how to perform they walk through the steps. Eng 1 realizes the knowledge about the system inherent in their mental model is insufficient to carry the line of inquiry on further. They give up attempts to update and instead explicitly recruit the other responders.

Eng 1(ic): i can't think of anything else to investigate or levers to pull to fix the problem <@mention sre team> anyone else have ideas? Eng 2: The item which I think will be best is to get more precise on what is happening when the build timeout. Also to get the nodes, to see any patterns. It may just be select nodes. Eng 5: <@mention sre team> I am finally back - anything I can do to assist? Eng 1(ic): [posting to user channel] Most of the sporadic build timeout issues appear to be network operations with artifactory. I saw a couple with `apt` operations and restarting the builds have resulted in successful builds. We have yet to find an issue with networking, but continue to look around. If you consistently hit a build timeout issue, please open an issue with your and the build logs attached: Eng 1(ic):Okay, is there a way that I can match a build to a node in the cluster? Eng 2: if you expand worker information Eng 2: There is a field called hostname . The end contains the node name Eng 2: Eng 1(ic): Eng 1(ic): another issue opened by a user: Eng 1(ic):I need more help on this. This isn't normal behavior Eng 3: Eng 1(ic): @Eng2 @Eng4 i need one of you to help us dig, please :emoji: we're in webex

106

A third example of coordination breakdown comes from the coordination between the response team and one of their third-party dependencies. In this case it is the internal network team described in the recruiting vendor’s examples. There are many indications of a long-running coordination breakdowns with the provider. Early in the incident, when the network is proposed as a potential source of the issues one responder comments “don't see anything posted in . to @Eng3’s point, they aren't usually quick about doing so.” This followed a discussion where Eng3 warned against using the status page as proof the problem was not with the network recounting that in the past they had only found out about network issues hours later. Past engagements have shown that the network team will refuse to engage unless a substantial threshold for proof that they are responsible for the problem has been met. This is a form of controlling costs that will be addressed in the next section detailing cross case findings.

Themes emerging from across the cases

In the previous section, findings were presented highlighting discrete key elements of the choreography that enables smooth coordination. As the data in each of the separate cases shows, enacting these coordinative functions is additional effort outside of the cognitive work of anomaly response. When presenting a hypothesis for example, a responder calibrates the information presented relative to the amount of common ground between the participants of the joint activity. They add Previously identified elements of choreography such as those surrounding common ground, delegating, updating, synchronizing, tracking others activities, taking direction,

Findings in the analysis across cases revealed new patterns elements of the choreography while they also confirmed and deepened our understanding of prior patterns. A selection of cross case findings are presented here to connect specific examples of the overhead costs incurred in coordination across the observations from the corpus. The full table of Elements of Choreography is in Appendix A.

107

Investments in establishing common ground The evidence of an investment in establishing common ground (CG) during an incident was found in explicit statements that implied the speaker recognized there may have been difference knowledge, beliefs or assumptions about the information being transferred and sought to define a baseline. Case 2, 3, 4 & 5 showed clear evidence that investments had been made in establishing common ground. Statements like the one in Case 4 where a responder looking at load times for node posts “long-term load is down to 17 (very low)”. The italicized text is a lightweight example of an investment in CG with other responders who may not know whether 17 is high or low. “The master containers have been running for 31 hours which means no new containers were made.” This statement from Case 5 makes explicit the knowledge that a new container would not have 31 hours of run time. In Case 2, an incoming responder asks “do you know in which cloud account the server is located? Is this managed by ?” to come up to speed on the way the system is configured and also to probe the knowledge of the existing responders. In Case 3, as a group of responders has finished diagnosing an issue and a vendor support engineer, states their assumption about the goals of restoring service to establish common ground about the priorities of the next phase of work in saying: “We just need to make sure that the log parts that has not yet been aggregated, get aggregated, which to me seems like we should also, and that they're visible at some point, which seems like a data migration. At some point, I think that getting bills running, even if people can't say like see their logs from five minutes ago, that seems fine… Like it's better to be able to run future builds and get hung up on the last, on the migration.”

It is also possible to infer that there are ongoing investments to add to common ground over a longer period of time. Statements that reveal knowledge and beliefs about the different aspects of the incident – the other responders, the system behavior, the other teams or units coordinated with or the organization as a whole – represent previous effort that went into learning or forming those beliefs. For example, in Case 3 the vendor support engineer suggests scaling workers to zero to stop build log generation, but still allow users to view old logs. In making this suggestion, the vendor demonstrates knowledge of how the system functions and the belief that the tradeoff is reasonable for their users. This understanding about what the priorities of the situation are is indicative of a past investment in establishing common ground. Similarly, learning about who other responders know (or have in their network) was shown to be important when the existing group of responders needed to recruit a skillset but they were not aware of someone who held them (as in case 2 & 4). In contrast, also in case 2, a specialized internal resource was able to be recruited into a long running incident because a contact of the existing network was known to 108

have connections with their department. Evidence of past investments in establishing common ground also relates to an organization itself. Case 1, 2, 4 & 5 showed where extensive preparations and cross-checking were involved in communicating across organizational boundaries to third party support teams through their ticketing system. These internal and vendor support teams were known to be highly reluctant to join web conferences to engage in real time collaborative diagnostic efforts. So the incident responders, knowing they needed their insight, adapted by initiating the work to bring others up to speed earlier and allocated additional attention and care to crafting the messages to elicit a response based on their knowledge of the kinds of information needed to convey the urgency of the situation. Lastly, and perhaps most obviously, investments in establishing and maintaining common ground about the system itself has direct impacts on the ability to carry out the functions of anomaly response. Case one, three and five show how shared CG can be beneficial and case two and four show how costs of coordination rise when there is limited shared knowledge of system performance. Recordings from the post-mortem meetings in case one and four reveal a substantial portion of the debriefing is dedicated to updating knowledge of how the system behaves. This form of model updating happened in a number of ways - through direct query and response of sources of failure and the nature of cascades noted in an incident, by correcting a statement made by another response pertaining to system function, by specific explanations about the internal workings of a microservice or by recognizing the implications of configuration changes or deferred maintenance. A second specific instance of how establishing knowledge of the system was an important element of coordination comes from the contrast provided when the investment is not made. The example in case four was the suggestion from the vendor that the volume on a specific application appeared to be part of the problem even though that showed business as usual volume. Tracking other’s activities An intrinsic part of the case findings is keeping track of the different individual efforts underway that make up the joint activity. Effort to maintain a current understanding of what others are doing is necessary for all responders but most explicitly by the incident commander. The evidence of this is in Case 3 where the incident responder tracks the activities in a chatbot accessible to all responders. In Case 4 the responders have normative pattern of explicitly stating what they are doing and when that task finishes such as the selection below, taken over an 8- minute period. This shorthand allows others to track their activities and identify unassigned tasks. “Rolling restart underway” “Waiting for the caches to warm” 109

"On the last job reboot” “drafting diag cmds” “Job nodes done rebooting” “Drafting ticket to ” Recruiting resources Preparatory studies showed the models and the tooling used to recruit resources can be instrumental in both increasing and decreasing the costs of recruitment (and the subsequent integration of the incoming resources to the effort). Recruitment strategies were myriad and largely successful. Issues were raised in preparatory study 2 about the effort involved in maintaining systems for recruitment and revealed that systems of recruitment can become stale without substantial effort to maintain currency for those who are recruitable and how to reach them. This was noted to a lesser degree in case 1 and case 5 findings when on call alerting systems revealed gaps in the scheduling had resulted in responders not being paged. Multiple roles and signaling devices were used to alert others their help was needed. In Case 1, the core group of responders quickly recognized the boundaries of their collective knowledge was insufficient for the problems face and brought in specialized expertise from another team in early stages. These resources were co-located and a member of the core group walked over to their workspace to recruit them which they quickly joined online before physically relocating to the meeting room where the others had gathered. Concurrently, they began recruiting the vendor, even before the boundaries of reformed group were reached as access to additional proprietary information available to the vendor was needed. The scaffolding of the established ticketing system was used to contact them. In Case 2, there were multiple layers to the recruitment, beginning first with a second opinion then to the Mgr 1 recruiting Eng 1 and the subsequent recruitment of vendor support and other specialized responders throughout the remainder of the incident. It was noted that the act of recruiting is more nuanced than simply notifying someone. As was shown in case one, two, four and five, vendor recruitments caused delays which meant the task activity changed - generating workarounds, creating unproductive lulls while waiting or shifting the timing of certain tasks as was shown in the proactive upload of service bundles in case one and five. Recruitment into virtual collaboration spaces also generated the need to bring responders into the tooling and cope with difficulties hearing, seeing or otherwise connecting with others. This was seen in cases two, four and five (as well as multiple times in the original set of cases reviewed) when explicit communication about how to join web conferences, difficulties in hearing and loss of internet access disrupted coordination efforts.

110

Delegating Perhaps most surprising findings in the cross-case analysis relates to effort required for effective delegation (for example, probing to get a current assessment of responder availability to take on the task being delegated) and to the minimal presence of explicit delegation. In case one, two and four there were examples of the ‘preparation’ for delegation whereby the incident commander ensures they have current knowledge of the status of the degradation and asks directly for updates on responder activities before delegation. In case one there is an explicit redirect of a responder to enable shared delegation of the workload. In case two, delegation was very subtle and offered as suggestions. This can also be seen in case one, four and five where the assignment of work is more suggestions and probes, putting the emphasis for taking on additional work to the responders. The following are compact examples of delegating from the cases. The incident commanders were found to use both direct requests as well as more general open queries to the group. IC: Hey [Eng 2], Would you mind running the cluster reach to get a... To see if Consul is actually running?

In this example the IC knew Eng 2 was in the Consul admin account and having them run the cluster lowered switching costs and minimized the potential of working at cross purposes. There is no explicit direct given on the timing of when to do the task, inferring that Eng 2 should manage workload accordingly.

IC: [Eng 3] and [Eng 1] do you have something that you're already digging into?

In this case the IC is working jointly on the response in addition to IC duties. They keep a high level trace of activities of the responders.

IC: “So I'm, uh, does someone know the best way to restart Redis and if so, can you take that?”

Delegation of tasks can be indirect to the group, this was common in incident response teams with significant common ground and high levels of taking initiative. Taking Direction What is implicit in delegating tasks is that there is a reciprocal function of taking direction. At a distant, taking direction appears to be a passive process of waiting for instructions. However, incident responders are by nature, proactive. Joining an incident in progress required coming up to speed and orienting yourself to the problem (both active processes). Upon orienting to the 111

problem, many responders immediately took initiative search for diagnostic information, therefore upon being delegated a task they incurred a cost of re-prioritizing the activity they had just begun to accept the new task. Another cost of coordination in taking direction is signaling your availability to take on workload. Once a task has been delegated, the engineer must assess the task relative to their skills and abilities as well as clarify task assignments or timelines for completion. Synchronizing tasks Challenges were noted when interacting cross boundaries using the ticketing systems that created lag and a lack of observability that impeded anticipation and resulted in workarounds. Also inherent in the choreography related to synchronization is the cost of delay while waiting on responses or participants to finish their tasks as seen repeatedly in case one, two and five. A clear example of this from case two is when a manager has stepped in to aid the incident responders by escalating the request for help to an unresponsive third party support team. Mgr 1: While we're waiting for that is somebody's generating a support bundle in the ticket? I think they're waiting for that. [crosstalk] Eng 2: Yes, sir. [crosstalk] Eng 3: Yeah. It's still running. It takes a while for run. [ crosstalk] Mgr 1: Okay, thank you. I didn't see a task for that. [crosstalk] Eng 4: You have to do a full support bundle? Eng 3: A full support bundle? Eng 4: Well before one takes like a couple of hours. Eng 3: Yeah. I did a full up, a full cluster support bundle Eng 2: Doesn't that take a couple of hours? Or did they improve the speed? Eng 5: I recall it taking a couple hours. Eng 3: Yeah. Takes a while Eng 2: While they are waiting for the support bundle we may want to let them know, say, Hey, this kind of take a couple hours. We want something on do something between that time. Mgr 1: I'll let them know that

Finally, costs are found in the additional cross checking that comes with resynthesizing efforts into a coherent whole. An interesting example of these last two is the tradeoff seen in case three where the entire response group participated in the code review. This added costs in the sense that threads of activity had stopped but ended up being a worthwhile investment when the multiple perspectives were brought to bear in resolving uncertainties.

Controlling costs for self Implicit in the study of strategies used to control the costs of coordination is that they benefit the individual deploying them in some form. It could be argued that choreography used to control 112

costs for oneself does not actually incur a cost. However, the cost is being shifted to the person who is trying to coordinate with them. Therefore, it is worth noting them here. The strategies noted were: Ignoring prompts for input (see case two and four for examples of this between the reliability team and their users and case one, four and five for examples of this between the reliability team and the vendor), dropping conference calls, audio bridges or web conferencing (see case two for actively engaged responders, see case one, four and five for refusals to join at all), decreased monitoring of user forums (as previously noted in case two and four), shedding load (see case 4 where migration work was postponed until requested), decreasing quality of work, requesting support, delegating tasks or asking for cross checks (as mentioned earlier).

An example from case one is seen when a responder has been tasked with sending an email with diagnostic information to a third party vendor. This responder has multiple threads of activity underway and controls the costs of coordination with the third party vendor by requesting support (offloading a task): Eng 3: “If someone has all of their emails and the CC list so I can copy and paste, it'd be fantastic too.…”

Controlling costs for others A less acknowledged aspect of the choreography of coordination was seen to control costs for others. It seems counterintuitive that when the costs of coordination are high that one would expend effort on behalf of others. However, some forms of this overhead cost were ignoring or deferring prompts for input that would interrupt their work (this was common when coping with user channels in case one, two and four) and other forms of gatekeeping, determining interruptability to minimize switching costs related to redirecting attention (case one, two, four showed this), pre-formulating ideas for potential contributions (role, task) prior to engaging (as in case one and two), delegating tasks on their behalf (case one where the incident commander steps in to shift tasks away from one of their responders), delaying requests until they are able to respond (particularly evident when listening to audio transcripts of case one, four and five where extended pauses are common), signalling availability to help and pro-offering capacity (as previously noted), conducting cross checks (in particular the example from case one below but see also case three and four) and lowering expectations for output.

Distributed incident response teams were frequently found to be taking advantage of the observability aspects afforded by the coordination tools (in this case a web conference). Eng 1 has

113

been tasked with deploying the agreed upon code change that the group has settled on. Another responder suggests pulling up a screenshare for them to conduct a cross check to offset decreasing quality of work brought about by the time pressure.

Eng 1: I have it queued up. I'm going to double check out the service name right real quick. Eng 4: Are you sharing your screen? So I can go check the commands to make sure you didn't make any typo's... Like, I always do, Eng 1: uh, I can share this screen and my double checking has revealed that I don't have the service name right.

Maintaining common ground While initial investments to establish common ground set a baseline for shared understanding, ongoing investments were needed to maintain common ground. Clark (1996) talks of maintaining and repairing common ground relative to short timespans (in conversations). Interestingly, the most salient examples in the context of this kind of joint activity (incident response) of the role of maintaining common ground came from longer term investments. Linking back to the positoining of investments in establishing common ground as being an occurrence over time then it makes sense that the concurrent maintenance takes place across time. There are expenditures needed in cultivating networks and establishing channels for recognizing change that can impact individual and team goals and tasks; Developing a sense of nominal and off-nominal trajectories of change (at the individual, team and organizational levels); In Case 1, Eng 4 calls forth a known problem with delated reporting: “Bear in mind that has periodic glitches of network traffic and will report them at least 6 hours after an event occurred. It could potentially be caused by them.” Maintaining a sense of team specific demands (including workload, changes, availability of specific resources, individual capacities, upcoming events); Maintaining a sense of system demands (technical debt, hardware updates, deferred maintenance); and maintaining a sense of environmental demands that can impede coordination (weather events, limited office technologies etc). As with establishing common ground, while it was harder to trace these efforts explicitly, evidence was provided in the post-incident discussions when individuals were questioned about how they recognized something or knew some inane but important fact was relevant. Often the mechanisms had to do with being involved in cross-department groups or initiatives, having read a news article or blog posting or having gone for coffee with ppl from other units or companies.

114

As mentioned previously, maintaining common ground is an ongoing process. The following examples show how investments in maintaining ground must take place over time. Eng 2: “the exhaustion of the active workers is the behavior since hotpatching... which spawned this issue a week ago.” In this example, the responder is able to link an event from the week prior to the off-nominal performance currently found in the system. They know of this because of ongoing investments in maintaining common ground including daily stand ups to stay current on others activities, participation in post mortem debriefs about recent events and more informal methods like discussions about current events that take place during lunch or coffee breaks. Eng 1: “the web node long term load is way above normal (at half capacity)” The use of “long term load” indicates this responder has context for performance over an extended period of time. In this particular case the responder is a relatively new member of the team and their ability to maintain common ground came from accessing dashboard queries that provide a link to past performance.

Eng 3: "It's probably just the west coast signing off, but response times got a lot better about 10 min ago”

This responder, situated on the east coast, is able to connect a drop in usage that typically occurs when the other side of the country ends its workday. This example may seem inconsequential but knowledege of the is an acknowledgement of the distributed nature of company activities, the expected load that partial operations would generate and that this particular day’s activities are typical which may explain part of the performance changes in the system. Repairing breakdowns in common ground Similarly to maintaining common ground, repairing breakdowns in common ground can be effortful. The overhead costs include considering implications for breakdown, assessing the options for repair given other work and current conditions, deferring, reprioritizing or abandoning other activity to invest in repair, running mental to consider outcomes, gauging interruptibility, establishing new channels for communication, recruiting resources to aid in repair or delegating responsibility for repair to others. It was noted that not all breakdowns in common ground were repaired. Some were bypassed and revisited during the post mortem debrief while others, perhaps deemed to be insignificant or too costly given other demands at that point in time, were ignored.

115

Additional patterns of choreography needed for coordination

In addition, a number of previously undiscussed patterns were observed across the cases of actions taken in joint activity related to the choreography of smooth coordination during incident response. These were aiding others in model updating, recognizing your own need for model updating, taking initiative, investing in future coordination, coordinating with tools and monitoring coordination demands. Aiding others in model updating This pattern was closely related to maintaining common ground but is a distinct element of choreography for coordination in that it is an active investment in furthering the knowledge and skills of the parties being coordinated with. In effect, it is a demonstration of reciprocity that benefits the individual as well since it increases the shared knowledge and understanding and fulfills the purpose of learning more about others capabilities which can lower coordination costs in future. The overhead associated with aiding others in model updating included recognizing faulty mental models in others, determining what is known to them, retrieving information to share (either from prior knowledge or research conducted), sharing knowledge (the actual communication exchange), seeking confirmation it was understood, devising examples to aid in understanding, fielding questions, clarifying others statements or adding further descriptions to something someone else has said.

Developing knowledge of the system is effortful and on-going. Practitioners use continuous updating to keep these efforts minimal. An example of this is a discussion during an incident about contextual factors relevant to the problem but not explicit to the task at hand is common. These micro-investments are not thorough but they are efficient. As noted, more elaborate examples of this occur during post mortems.

Recognizing your own need for model updating The recognition that your own mental model needs updating is listed here as a separate pattern although it incurs similar costs as above. It is listed as a separate pattern because it appeared the effort involved was distinct from recognizing others need to update and taking action to assist them. In the corpus, this appeared to be intermittent but this is a pattern worth exploring further. An example of this is from case two, when an engineer tasked with checking logs in a service networking tool recognizes they are unclear on how to do so and prompts the rest of the group by saying “I’m a little rusty checking logs in Consul…” 116

Taking initiative These included: considering task demands, considering available capacity & skills, considering other’s capabilities, determining sequencing of events, considering pairing options for larger or complex activities, communicating intent and joining activity already underway. This was seen in multiple cases when responders come into an incident, bring themselves up to speed, notice a task that could be useful to the effort underway and, unprompted, signal that they are going to do it. In particular, case one, three and five are reflective of a smooth interchange of taking direction and initiative. Investing for future coordination Reinforcing the finding that coordination is not episodic but should rather be considered across longer time frames is the pattern of responders investing for future coordination. This pattern included a wide range of activities that can incur costs such as capturing or extracting data for use in the Post Mortem, updating mental models about system functions and interactions, identifying coordinative friction at the boundaries, compiling materials for a debrief, participating in debriefs, reflecting on own & others actions, updating group processes, elaborating on coordination breakdowns, calibrating technologies, communicating post mortem results to stakeholders, prioritizing follow up action items, re-prioritizing backlog, prioritizing recovery (self & others) by deferring tasks, taking on more tasks to aid peers while they recover. Essentially, many of these items are taking action for reciprocity.

One example from case included annotating items to flag them for aiding in reconstructing the timeline or follow up discussion in the post mortem. “‼️ is it possible to log/report/graph the count of jobs that are kill at the 10m timeout?” The use of the !! emoji indicates to others this is not a question to be answered in the flow of the response but rather, a notation for after action review. Consequently, no one attempted to answer it as they recognized this as part of an investment in future coordination. “:this: this message didn't go to users ” Similarly, in case four, this short segment used an emoji to flag an unexpected fault in one of the tools used for coordination to make a note of the malfunction for its developers. It is a lightweight way of lowering costs of coordination to partake in coordinative actions like providing input on how a tool functions in practice. Another form of investment comes in the ongoing efforts to improve working relationships across inter and intra-organizational network. Several teams meet regularly with key vendor product

117

management to discuss the use of the tool and the quality of the support relationship. Implicitly, attempts to resolve this ‘friction at the boundaries’ is a recognition that the cost and quality of future coordination depends on being able to transparently work through any issues.

In one long running incident, a support engineer from another team responsible for another service had been observing the incident response the shared chatops channels the teams shared and offered to provide additional capacity if they needed. “I know everyone has been focus on the issue all day, I just wanted to offer If anyone need a replacement or a break. I am available tonight to help. Just call me out with pagerduty”

Coordinating with tools Human-machine interaction, human-computer interaction and interface design are all fields implicitly dealing with coordination between people and the technology. However, they are not always designed for coordination. The findings in this section outlines some of the ways in which there is additional cognitive work and corresponding costs to coordinating with tools that are intended to aid in coordination. As shown in preparatory study 2 significant additional efforts were found in assessing the suitability of the tool, comparing it to current practice, forecasting future value, setting up the tool, testing, orienting others/training, adapting practice to accommodate the tool, gaining access to the tool, calibrating the tool, troubleshooting issues, additional monitoring of the tool, cross checking, assessing the validity of the tool’s interjections and gaining advantage of the tool’s capabilities by adjusting other practices. Many of these findings were supported in the corpus. For example, part of the first coding of the data was to identify where coordination was being explicitly mentioned. Consistent across all cases are multiple references to the tools of coordination (asks for the a link to the web conference, verbal prodding for people to join an audio bridge, statements indicating difficulties logging in or setting up the audio or video components). In addition, the use of ticketing systems adds an attentional demand for monitoring. Case 1, 2, 4 and 5 all include references to responders explicitly directing attention to watching for an update in the ticketing system. If the active incident response is taking place in chatops or on a web conference then the responders have to interrupt the tracking of activity taking place there, switch tools and reorient to the ticket or the vendor’s proprietary collaboration mechanism. Even tools ostensibly designed for coordination like on call scheduling

118

Another way in which this cost is incurred is when the tool is unable to determine the appropriateness of an intervention or recognize when a requirement is not critical. In one case, a bot in a chatops channel interjected 7 times in a 35 min period to ask responders to define the location of the incident - an action that was not necessary for communicating about the event because it was implicit in the services that were degraded. Lag in the systems designed to provide observability about the status of a function meant an indicator that would be relevant and meaningful to the response was rendered useless such as in the example below.

Eng 2: “Redis thing doesn't take into account Redis memory utilization that still says they're green. Despite the fact that we have clear evidence there, they're not.”

Monitoring coordination demands A final pattern is of meta-coordinative activity relating to monitoring and informing the on-going coordination demands. The costs here were of aligning current demands with required support and ensuring appropriate resource requirements - people, support contracts, hardware/software - are met. In case 3 & 4, the incident commander had the option of bringing in additional resources – as other responders had offered to join. They chose to defer use of those resources. In case 2 this oversight function was filled by Mgr1 who, in the early stages of the event, probed to assess that the problem demands and available resources were well matched. When they weren’t they brought in Eng 1 to assist and later created a the shared channel to invest in building & maintaining a collaborative posture amongst all the responders. A second example of this meta- coordinative function is in Case 1 where a manager who shared the workspace with the response team recognizing shifting needs during an incident and stepped in to order lunch.

An example of this meta-function in action is seen when management, who has been tracking the long running event in Case 4, steps in to ask if appropriate handoffs have been considered and if the team has sufficient coverage to take time to recover. Another example shows this is not related to role or authority, in Case 2 a user directly messages a member of the response team to notify them that more frequent updates are needed to help users plan and adjust around the outage. Lastly, in Case 1, a manager who has been tracking the event notices the continued delay from one of the vendors steps in to escalate the urgency.

119

Summary

In the analysis of the corpus of five cases, coordination is a continuous process spanning multiple roles and timeframes. The unit of analysis expanded beyond individual coordinative events (like conversation) to place these into the context of key functions in anomaly / incident response. The key functions could go easily or become more difficult depending on the challenges posed by the system’s malfunctions and disturbances, the uncertainties that resulted, how the processes of anomaly response worked, and how additional people and expertise became engaged in the response.

120

Chapter 6 Discussion

This dissertation work has examined the mechanics of coordinated activity that is increasingly common in large-scale distributed work systems. Using multiple converging methods, including a series of preparatory studies and detailed process tracing of joint activity during incidents, this project has elicited patterns of cognitive work inherent in joint activity. Specifically, this work addressed the strategies used to control the costs of coordination within these systems while carrying out the functions of anomaly response under uncertainty, risk, and pressure. The detailed process tracing of the corpus of five cases laid out the elements of choreography used to manage escalating coordinative and cognitive demands that arise during a response to exceptional events. These findings offer a model of the elements of the choreography used in coordination and an analysis of existing processes of coordination within software engineering. This detailed analysis provides an empirical basis to extend previous research on common ground in joint activity (Clark 1996; Clark and Brennan 1991; Klein et al. 2005; Bradshaw et al. 2009), mental models (Johnson-Laird 1983; Entin and Serfaty 1999; Fiore et al. 2001) and anomaly response in distributed work (Woods 1994; Watts-Perotti and Woods 2007; Woods and Patterson, 2001). The integrated view of this analysis provides a basis for understanding the dynamics of choreography that complement the elements of choreography by tracing how the costs of coordination are incurred in geographically dispersed people working on difficult problems with automated processes at speed and scale. Specifically, the three main contributions of this work are in 1) extending the underspecified elements of choreography to provide a detailed analysis of the cognitive work inherent, 2) detailing the dynamics of choreography across multi-role, multi- echelon networks (Woods, 2018) and 3) providing new insights into the costs of coordination with tooling (automated and otherwise) necessary for anomaly response in distributed work systems. These elements of choreography explicitly laid out in Chapter 5 will be examined in detail first. The discussion will then shift to more general observations about how coping with non-routine failure events in these contexts requires the cross-scale, time sensitive, systems view of

121

coordination provided by Adaptive Choreography. Dynamic coordinative strategies are shown to facilitate smooth interactions across a variety of agents -both human and machine/automata- with interdependent capabilities. The third key point in the discussion will lay out the implications of tooling that incurs costs of coordination. Brief closing comments are then made on the design implications for the coordination across human-human and human-machine teams.

The elements of choreography

Prior research identified many important aspects of choreography: establishing common ground, recruiting new resources, delegating, synchronizing tasks, controlling costs for self, taking perspectives/switching perspectives, controlling costs for others, maintaining common ground, and repairing breakdowns in common ground. (Klein et al. 2005; Woods and Patterson; Klinger and Klein 1999). However, it stopped short of laying out in detail the inherent cognitive work. What follows in this section is an empirical basis for the cognitive work involved in the elements of choreography.

In addition to the expansion of prior work, new elements of choreography emerged as important for anomaly response. These were: aiding others in model updating; recognizing your own need for model updating, taking initiative, investing for future coordination, coordinating with tools and monitoring coordination demands.

Establishing Common Ground

The concept of common ground is applicable to both the understanding of a specific event underway and the broader context in which that work takes place. We might think of this as both knowledge, beliefs and assumptions about the anomaly response for an acute issue as well as the on-going need to be able to understand the implications of this issue relative to the pressures & constraints more broadly. For example, a responder paged into an event in progress will need to establish common ground by orienting themselves to information such as: what others believe is happening and why, what mitigations have been tried and their results. However, this takes place within the broader context of the socio-technical system performance including knowledge, beliefs and assumptions about things like: typical system behavior (including the kinds of expected variance and failure patterns), the expertise of others in the joint activity, the availability and willingness to 122

coordinate of needed resources. In this way, the concept of common ground about the Anomaly Response is situated within common ground relative to Joint Activity.

Taken this way, the role of common ground in joint activity should be considered relative to immediate interactions within an event as well as the ongoing (and often “offscreen”) efforts put into investing in establishing common ground.

Clark (1996) describes how we infer much about common ground based on contextual cues related to roles, membership in various communities and norms. As noted in Chapter 3, continuous integration software development environments are characterized by change. In addition, knowledge of how different types of software actually work and consequently interact with the rest of the system is highly specialized and continually recalibrated. Therefore, it is a common (and expected) practice to continually be adding knowledge and updating mental models about how the system functions. Because of this, many of the coordinative interactions relate to establishing, maintaining and repairing common ground. Olson & Olson (2000) note that tightly coupled and ambiguous work – such as incident response in software engineering- requires greater common ground.

There were many examples of explicit investments in establishing common ground(AR) as new responders were recruited into a response underway. These were quick, compact, highly encoded and transitioned quickly into diagnostic work. In these examples, the cost of coordination during 123

the incident is low, as the costs have been ‘paid’ earlier. In contrast, the costs of coordination are substantially greater when investments in establishing common ground(JA) were required for incoming responders. These efforts tended to disrupt the flow of the anomaly response work, in some cases stopping it entirely as incident commanders and key responders were forced to divert attention from the efforts underway to helping ‘on-board’ the new resources.

As implied by the recognition that costs of coordination can be ‘paid’ sometime prior, it is possible to infer from the cases that there are ongoing ‘off-screen’ investments to add to common ground(JA) over a longer period of time. Statements made in the event that reveal knowledge and beliefs about the different aspects of the incident – the other responders, the system behavior, the other teams or units coordinated with or the organization as a whole – represent prior effort that went into learning or forming those beliefs. Based on the findings, the greater the common ground(JA) the easier it is to update or repair common ground(AR). While there is certainly a cost to the ongoing investments (time and attention paid to changes in interdependent systems and the teams that manage them), these can be incurred during lower tempo periods and can be integrated into other activities. Observations made during the study showed that engineers were continuously making these informal small-scale investments in myriad ways: through casual conversations over lunch or coffee, in online chat forums in user or guild channels, as part of inter-organizational working groups or during vendor site visits with intra-organizational support teams. Transcripts from particularly high performing cross-boundary groups revealed inquiries that extended beyond the particulars of the incident but were useful to establishing common ground. Organizations (especially those requiring cross-boundary incident response) can structure these into their practices. For example, having new reliability engineers spend time reviewing past incident reports during the onboarding process orients the engineer to the system, its failure patterns and incident response strategies. Anchoring this knowledge-building in a specific case can provide an immediate basis for common ground(JA) with their new team and discussions are better focused on needed repairs. Of note in the findings on common ground were the four bases for common ground that emerged from analysis across the preparatory studies and the corpus. These include knowledge, beliefs and assumptions about: i) other individual participants, ii) teams (used here to denote established groupings of individuals such as the support engineering function in another business unit or vendor), iii) the organization (a particular business unit, support function or third party) and iv)

124

the system under management was shown to be important in being able to flexibly and more precisely coordinate during time compressed events. These four bases are explored further next.

Other individual participants Investments in establishing common ground incur a cost of effort and attention to learn about what others know and what they can do (existing skills and knowledge). These investments aid in rapid recruitment of the right resources for the specific problem demands (even, perhaps especially, when those are underspecified) as was shown in all cases. Effort into establishing knowledge of the other participants was made in understanding both the roles & functions (espoused & actual) and understanding others stance toward a problem or system as well as understanding the limits and boundaries of authority & responsibility of others. This is a critical requirement for all responders as the knowledge of who to recruit extends the capabilities of the response team particularly under time pressure. For example, if all responders are knowledgeable about the capabilities of others then not only is the pool of potential resources larger (there are more known people who can be helpful) but recruitment becomes more fluid as the responders draws on innate knowledge instead of relying on internal personnel records that may or may not be current. This can be inferred from discussions in case one, two, four and five.

Team membership Similarly, having pertinent knowledge, beliefs and assumptions that are shared among the involved parties requires having insight not only into them as individuals but to a collective team. Throughout this analysis, collective responders have been referred to as groups to denote their often ad hoc and fluid structure. However, there are consistent groupings that can be considered a classic ‘team’ Here, costs are incurred learning about the pressures & constraints of the team and how they as a collective grouping respond to these. Related to anomaly response, this can include learning about normative behaviors and handling of exceptions and understanding how roles & functions interact in practice. As well as understanding the limits and boundaries of authority and responsibility of the team.

The organization Just as individuals are situated within team dynamics, when coordination involves crossing internal or external boundaries, the dynamics of the organization in which the team is embedded can be relevant to coordination. The corresponding overhead costs are in: maintaining a sense 125

of organizational demands (including shifting priorities, new management, pressures & constraints for action); understanding priorities and goal conflicts, learning how priorities typically tend to shift and how quickly/slowly; learning how goal conflicts are typically dealt with; learning formal decision processes; learning informal or role specific decision making authority; learning who makes decisions and at what speed; and learning about dependent units, their role & function (espoused and actual). Additional overhead costs are inherent in determining the boundaries of organizational silos; understanding the implications of organizational silos relative to joint activity; understanding how historical events influence current coordination and learning about available resources for establishing & repairing common ground (access to tools, formal protocols or informal practices).

The system performance In high consequence anomaly response or for organizations seeking to improve reliability, common ground(JA) surrounding knowledge, beliefs and assumptions about system performance is critical to coordination. It seems obvious to state but, supporting engineers to be sensitive to variation (nominal or off-nominal), to learn about cross-boundary pressures & constraints in the operations of other members of the joint activity, about normal, abnormal or anomalous system behavior (it is expected or unexpected variability) and about previous non-routine or exceptional events and the responses taken in them were all found to be important foundations for common ground. As presented in the cross-case findings, this was noted prominently in all cases although not in the same ways.

Maintaining & repairing common ground

Evidence to support the cognitive work inherent in maintaining common ground comes from questions or statements designed to calibrate understanding. These intermediary statements are indicative of prior attention to the person, software or system state. For example, the following statement “it’s probably just the west coast signing off, but response times got a lot better about 10 mins ago” tells us a lot about the effort of establishing and maintaining common ground. The first is the recognition that the organization has a substantial volume of users on the west coast whose use of the system has implications on performance. The second is that west coast users are likely to be signing off at that point in time. The contextual relevance of these kinds of knowledge, beliefs and assumptions is often just itself assumed to be implicit but represents

126

knowledge that had to be learned previously in order to make currently meaningful judgements. The last way this statement informs us about maintaining common ground is in the qualifier “about 10 minutes ago”. This indicates the speaker has either been monitoring the system over time or has recently checked a dashboard or log file that enables them to see this. This statement uses common ground(JA) to maintain common ground(AR).

As noted in the cross-case findings, explicit common ground(AR) repairing statements were made when the speaker recognized that inferences about common ground may have been miscalibrated and were in need of repair. Consider the following example from Case 5: Eng 2(ic): We appreciate the analysis. These two services are the largest consumers of and the behavior you're seeing is BAU. We cannot disable these services. Further, the exhaustion of the the active workers is the behavior since hotpatching from to which spawned this issue a week ago. In this excerpt, taken from the midst of an on-going incident, the IC informs the vendor support engineer (who has little to no common ground(JA)) about what levels of load are “business as usual”. It’s worth adding some context to this statement, the reliability team had been attempting to get help from the vendor support team for several hours, proactively providing information based on anticipated needs. As opposed to engaging directly with the responders, which would have allowed for a rapid calibration of common ground and clarifications, the vendor support team attempted an independent analysis which was not useful and resulted in increased frustration as the resolution was further delayed.

In closing, two salient examples of the value of investments in establishing common ground about system performance were found in case 2 & 3 whereby incoming responders external to the responsible reliability team had substantial common ground about system performance including, in case 2, knowledge about prioritizing tradeoffs. While their status as ‘outsiders’ meant their knowledge, beliefs and assumptions of the current state of the system (common ground(AR)) required efforts to update, their prior investments (common ground(JA)) allowed for smoother integration into the incident underway. Their participation in diagnostic efforts were better calibrated to the problem demands. This is in contrasted with the case 5 example noted above, where vendor support made general statements about system performance that were poorly calibrated, brought no value to their engagement and increased the costs of coordination as the responders now had to establish common ground during a high tempo, high consequence period.

127

It is important to note that in this analysis, investments in establishing common ground are seen as related, but distinct from model updating. This will be discussed shortly.

Recruiting

Observations from preparatory study 1 and the findings from the cross-case analysis showed the overhead costs involved in effectively recruiting resources included monitoring current capacity relative to changing demands to recognize when additional help was needed, identifying the skills required, identifying who is available and determining how to contact them. Delays in recruitment (as seen in the One-At-A-Time models of incident response) generate additional overhead costs to bringing these responders up to speed and establishing common ground as the event had been progressing for longer so the information needed to establish common ground was greater. Once these preconditions of what assistance was needed and who to recruit had been established, there are additional costs of coordination inherent in contacting them/ alerting them, waiting for a response, adapting current work to accommodate new engagement (waiting, slowing or speeding up to complete tasks). In addition, preparations for the engagement were needed such as anticipating needs of incoming responders (as it pertained to information, accessibility or understanding). Practically, this meant additional activities like developing a CritSit or status update, giving access/permissions to tools & coordination channels (sending links to chat channels or phone numbers for audio bridges), generating shared artifacts (refreshing dashboards or taking screenshots of relevant views). Once engaged, extra costs may be incurred dealing with access issues (inability to join web conference or trouble establishing audio bridge) as seen in case one, five and multiple times in the original set of cases reviewed.

Coming up to speed Just as the recruiter incurs costs of coordination, so too do those being recruited. Aside from the obvious costs of being interrupted in their own work and acknowledging their re-orientation to the problem, the responder being recruited has to assess the request relative to their own capabilities and capacity to act. If they chose to respond to the request there was additional costs involved in deferring or abandoning their own work, communicating about the deferral or abandonment to the parties they coordinate with. This aspect of costs of coordination is inferred from the act of reprioritization which necessarily diverts attention from the task at hand. Once recruited, effort was expended in gaining access to collaboration tools and assessing available information, clarifying (available data and expectations), determining roles of participants,

128

requesting additional information, forming questions about the state of the problem, determining interruptability of the response underway, forming an interjection and interjecting. These costs were shown less explicitly in the form of a recruited responder but rather in the ways in which peripheral resources (such as managers or other stakeholders) would re-enter a response they had left as in case one, two, four and five. In addition, efforts were needed in assessing work underway and its implications, considering their contributions relative to problem constraints and assessing how their contributions may influence work underway. There were no examples of this explicitly which is in and of itself a data point. That responders were able to seamlessly integrate into the effort underway is a signal of being well calibrated to what will be most helpful when joining. Think of a treadmill operating at full speed, if someone is aware of and prepared for the efforts required when they go to step on it, they will be able to begin without disruption - literally hitting ‘the ground’ running. Consider the alternative, when someone is unprepared and steps on - it’s literally apparent the person had not ‘come up to speed’ as they shoot off the back! In case one, three and five there were excellent examples of smooth integration having come up to speed appropriately.

Tracking others activities

Fundamental to smooth coordination is anticipation. However, being able to anticipate change in a high tempo event takes focused attention on the current trajectories and tempos of others’ activities. While this incurs a cost to the individuals, it decreases redundant work and allows faster re-orientation to new tasks or potential threats that can improve overall team performance. The findings clearly show that when other responders invest in tracking other’s activities it can lower the tracking costs for the incident commander. Individuals are more adeptly able to anticipate what tasks they can take on or understand who might be needing help enabling them to take initiative instead of being directed. In addition, in tracking others’ activities, they can synchronize their current efforts relative to upcoming workload making the coordination smoother. This finding is a core characteristic of Adaptive Choreography which will be explained further in the chapter.

Taking initiative

Closely related to tracking others’ activities is the element of initiative. Interpredictability is increased when participants in joint activity can “decenter and imagine the events from the other person’s point of view” (Klein et al. 2005, p.9). However, there is overhead involved when doing

129

so for the purposes of coordination. These efforts included visualizing/mentally simulating the experience of the other, anticipating needs, anticipating stance or orientation towards the problem or workload demands (this is shown in the proactive steps taken in case one, two and four). In doing so, the party seeking to coordinate with another was able to adjust expectations or reorient their request to make it easier for the other party (see the examples regarding ‘controlling costs for others’). While this might typically be ascribed to an incident commander, it was also shown extensively by other responders which demonstrated reciprocity. In several of the cases involving third party support, vendors acknowledged the perspective of the support team and the pressure to get service restored quickly. Two examples are, in case 2 when the vendor support team actively joined the core responders in diagnosing and managing tradeoff decisions to get the system back. They adjusted their typical way of interacting. In case 2, despite incurring a cost by delaying the receipt of information, the third-party support team suggests running the bundles after the workday to avoid further slowing the system.

In case one, two, four and five additional responders noticed the system struggling and proactively reached out to offer their help. This happened in several ways - explicitly in the response channel (“need help?”), less directly (as in case two where several resources direct messages a responder actively involved in the incident to offer help and provide insight) and through more subtle forms, like when, in case one and case four additional responders noticed their peers being recruited and relocated to the boardroom and logged into the web conference respectively. Since these engineers were not recruited directly, they did not wait to be assigned tasks and instead both took initiative by identifying tasks they could take on.

Delegating

An obvious symbiotic element of choreography to taking initiative is in the element of delegation. However, less obvious is the efforts involved. Roles tasked with delegation, or roles that delegate in order to achieve certain outcomes incur an overhead that is involved in considering task demands, available resources and other’s capabilities relative to the sequencing of events. Once determined, deciding how to direct the other, the extent of directions and the timing of the requests (relative to gauging others interruptability) are all needed before actually making “the ask”. An extra aspect of delegation involves the management of expectations for completion (quality, timing, goal priorities).

130

The costs associated with delegation can be compounded when considering pairing options for larger or complex activities or long running incidents as seen in case one where the practice of taking initiative began to breakdown as the effort wore on and responders began to burn out. The compact examples of delegating shown in the cross-case findings reveal important insights into some of the limitations of the IC model. By using both direct requests as well as more general open queries to the group the responders can adjust their activities relative to other concurrent demands they, or others, are facing – allowing for others to take on additional load if able and to bypass tasks if their current focus takes longer than anticipated. In several of the cases the IC was both taking on tasks and ‘commanding’ the incident. According to the IC model of incident management this is to be avoided – the IC should focus only on the coordination aspects. However, the ability to make decisions quickly and with incomplete information is contingent upon an understanding of how those implications may interact with other parts of the system. This is not an inconsequential amount of work and requires that the IC has sufficient knowledge of the evolving technical details as they shift over time – difficult to do with periodic updating. Instead, the shared responsibility for keep track of others activities and complementary functions of taking initiative, controlling costs for others, synchronizing, recruiting, updating, maintaining and repairing common ground was additional costs for individual responders but helped maintain coherence for all involved which ended up reducing overall costs of coordination. It should be noted, this was a highly functional strategy for incident response teams with common ground and reciprocal shared investments in managing the response efforts overall.

Taking Direction

Klein et al (2005) defined being directable as a core function for supporting interpredictability. Active efforts to signal availability and monitor when tasks went unclaimed help reduce the costs of delegation for the incident commander.

Synchronizing tasks

Sequencing and synchronizing tasks across responders and across activities (particularly those distributed across boundaries where there is less visibility in the synchronization of those tasks) can be effortful. Overhead related to this element of choreography includes considering temporal relationships between tasks and immediate and longer-term implications of each individual course of action as well as the immediate and longer termlonger-term implications of the collective course of actions and the interactions between them. This was particularly exacerbated

131

when interacting cross boundaries using the ticketing systems that created lag and a lack of observability that impeded anticipation and resulted in workarounds.

Controlling costs for self & others

Individual responders used strategies for controlling the costs of coordination for themselves by ignoring others requests for help, decreasing their participation in discussions or decision making, taking on less tasks and contributing less to sharing knowledge about system functions. These were seen to be intermittent, contingent on the demands and duration of the event.

Efforts to control the costs for others was another pattern noted across multiple cases. This included efforts to determine interruptability and delaying requestss until others appeared ready to be interrupted, formulating ideas for potential contributions (“would you like me to check the logs?” or “do you need me to be ic?”), signaling availability, pro-offering capacity, cross- checking work, lowering expectations for output from others and gatekeeping. Interestingly, gatekeeping was seen both within the team (the IC requesting users slow their requests on the response team to allow them to focus) and on behalf of the team (managers becoming intermediaries to executives and providing updates to prevent them from directly engaging to allow the team to focus).

Model updating

As mentioned in Chapter 2, modern IT systems are complex and undergo continuous change. This inherently drives a need for multiple, diverse perspectives to maintain an accurate mental model (Woods, 2017). Individuals both seek knowledge to update their own model and contribute to updating others. Thus, it is understood models will be incomplete in some way, so continued investment in helping others learn is mutually beneficial (their contributions to incident response are likely to be better calibrate to the problem demands) and reciprocated (they will know things that you will likely need to learn in future). While closely related to maintaining common ground, for the purposes of this analysis I have separated out the cognitive work of model updating as effort to expand and update technical knowledge about the system behaviour. It could be argued this is unnecessarily reductionist but it serves an important distinction to underscore the importance of knowledge, beliefs and assumptions about the socio-technical system as a whole. It is easy to dismiss the importance of having current, detailed knowledge about the state of goals, priorities, tradeoffs, pressures and

132

constraints on teams, organizations or industries as a whole that influence and inform the possibilities for action. Cook (1998) argues effectively for analysis to consider organizational and technical issues together and concurrently so as to consider narrow domain details as situated within a “set of pragmatic issues regarding the system in operation”. A salient example of how the technical details is always in context of the pragmatics is given in case 1 when Eng 3 says to the responder who is preparing an update to stakeholders “I think we should put it as a major outage now because it seems like you can’t log into it or anything else like that”. This is both an assertion that the level of degradation is severe (failure to log in) and that classifying the event as a major outage will mean something to stakeholders (they should expect significant issues with their use of the service and that there is a team actively working the problem). Moments later, Eng 5 proposes shutting down the CD service because of its relationship to the currently degraded function. When Eng 4 asks if the CD service is having issues, Eng 5 aids model updating by explaining how shutting down one service will impact the other. After which, Eng 4 agrees that is the right course of action repairing the breakdown in common ground(AR). Many examples of model updating, independent on common ground are found throughout the cases. In case 2, updating was limited. The contrast was stark. In other cases, interactions between responders routinely included information that served to inform and educate others. This indirectly influences the response since it provides an opportunity for potential variations in the mental models amongst responders to be re-calibrated and aligned. The acts of 1) aiding others model updating, or 2) raising the need for updating in your own mental model, ensures both common ground(AR) and common ground(JA). It is a micro-learning opportunity that, in high performing teams, is near continuous and fluidly integrated into the incident response.

Investing for future coordination

Eliciting the element of choreography investing for future coordination was particularly interesting in that it showed lightweight, continual efforts served to lower overall costs. The first key finding - that others in the network offering to join a response they were not required to is a second form of reciprocity (the first being the investments in model updating). Their willingness to incur costs establishes the possibility of them receiving aid from others at a future point in time. The second key finding – that leaving traces of items to follow up on (either through the use of select emoji’s, conventions for identifying follow up items or @mentioning others) – incurred an immediate cost but enabled post-incident activities and follow up actions to be more readily identified and attended to. This form of shifting the costs of coordination provides a contrast to

133

the other detrimental effects of shifting the costs of coordination in time- seen when needed resources refuse to join a response (discussed in the next section on general findings).

Coordinating with tooling

Tools are intended to transform practice in some way, for example by streamlining a practice, consolidating information or aiding collaboration. There is an implicit assumption that when a tool is deployed into a field of practice that it will provide a net benefit for the functions it is designed to support. As Woods & Hollangel (2006) note,

“Our fascination with the possibilities afforded by new technological powers often obscures the fact that new computerized and automated devices also create new burdens and complexities for the individuals and teams of practitioners responsible for operating, troubleshooting, and managing high-consequence systems.” (p. 119)

A second implicit assumption is that, if the tool fails to support the functions intended, the failing is in improper use instead of with the tool itself. But this research has shown there are costs incurred to selecting, on-boarding, calibrating (and recalibrating!), training users and adjusting existing practices to make the tool useful. Reliability groups have existing ways of coordinating (both practices and tools) that are likely to inform how they will conduct incident response practices. Introducing new tooling disrupts the existing practices. This is a critical and underappreciated aspect of technological change. Even if introducing the tool will produce benefit at some future point in time, there will be a cost in switching to new modes of practice. This is an often overlooked aspect of tool use. An incident response team does not form from thin air. Given it is difficult to anticipate the full range of disruptions or recognize all potential interactions, it is likely the group will experience some disruptions during a live incident which adds to the cost of coordinating with that tool. An example given during preparation study 2, showed that adopting an alerting tool based on a call schedule is a benefit by automating the existing practices of notifying needed resources. However, if there are nuances to scheduling that the tool does not recognize and it fails to notify responders it can introduce delays in getting help. There is a cost inherent to ‘partnering’ with the tool in that the design likely has not accounted for all possible variations so it takes time to learn where these ‘gaps’ are (Nemeth & Cook, 2013).New tools that are deployed concurrent with changing responsibilities (and the

134

potential need to develop new skills to handle service ownership) can uncover complex and previously unacknowledged dependencies that end up being poorly supported by the tool alone.

Two surprising but consistent patterns were the extent of the effort expended to integrate new tools that purportedly increased efficiency. Significant cross-functional efforts were invested prior to adoption and, once in place, on-going ‘fine tuning’ was needing to calibrate. In one case, the tool was deemed to be frequently unreliable so responders developed a cross-check to provide redundancy to guard against the tool’s unpredictability. Typically, tool designers account for the gaps by attributing it to user error. This represents a deep misunderstanding of the nature of technical work in a large-scale, distributed work systems. Coping with gaps between actual and expected conditions (Cook et al. 2000) involves additional cognitive effort. The benefits of technology are often overstated and fail to properly support practitioners. This includes processes that do not reflect actual working conditions, unachievable workload given the resources available or an emphasis on key performance metrics that drive unintended behavior. Given this, the strategies employed to maintain operations are necessary adaptations and changes to the practices may break down the strategies. However, investments into the change are substantial and not easily undone even if they are poorly suited. Therefore, adoption can be tentative and partial as people recognize dependencies on new tools create new forms of failure or additional cognitive burden. Workarounds to cope with these additional costs are common. For example, a company mandated tool adoption was being tracked for compliance. Some teams, noted to have ‘onboarded’ to the tool, were only using partial functions - and not the ones intended by the mandatory adoption. This is indicative of a poor fit between the tool and the purpose of its intended user and/or the cost of coordination with the tool itself was too high to fully integrate.

Monitoring coordination demands

The functions served by this element of choreography is to provide slack to the system. While individual responders in the event may, in periods of low activity, consider the broader coordination demands, this element benefits from being enacted by an outsider who is able to contribute fresh perspective. This was noted in examples where an engineer from another team or a manager, distant but tracking the progression of an event, steps in to ask if lunch has been ordered to allow the responders to keep working or if plans for hand-offs and rest have been considered for long-running incidents.

135

Summary of the elements of choreography

These findings on the functions within the elements of choreography are important to provide an empirical basis for re-imagining incident management in critical digital infrastructure. Next, we shift from the specific elements of choreography to a more general discussion of how these play out in active coordination during anomaly response.

Efforts to control the costs of coordination.

A key aim for this study is to connect the findings of how the elements of choreography are adapted to the effort and costs incurred by the parties engaging in joint activity. This next section discusses how the costs inherent in carrying out the cognitive work of the elements of choreography are adapted to the demands of the event, to the available resources, and to the constraints and limits imposed by organizational process or structure, can disrupt smooth coordination. To do so, it is first necessary to identify, some limitations of the existing frameworks, They are choreography as phases of activity (Klein et al, 2005) and coordination as assigned functions (Beyer et al., 2016). We will examine phases of activity first.

The concept that joint activity has phases is an oversimplification. By its nature, to assess coordination in organizations is to see it in relation to the interactions of multiple parties. But also, to see it as part of a continuous flow where the past, both recent and earlier, have shaped how it will play out. Or, as Bradshaw, Feltovich & Johnson (2011) put it, “Joint activity is a process, extended in space and time” (p. 299). All coordinated action has a historical context that is either direct or indirect. Directly relevant history may be a previous interaction between parties or prior joint activity. Indirectly relevant history is a legacy of social, organizational or professional norms. In this way, the dynamics of coordination – who is involved, how they become involved, what their expected contributions are and how that effects the cognitive work in anomaly response – cannot be properly understood as episodic and instead, must be assessed as a function of system behavior seen over time. It is also evidence that attempts at understanding the elements of choreography without considering the dynamics of coordination would be insufficient. It could be said this is one of the reasons why many attempts to proceduralize coordination fail (the other being that they fail to account for cognitive work). To account for the dynamics of coordination, it should be recognized that: attention flows across multiple threads of activity concurrently (Woods et al, 2010); the continued sensitivity to new information or events allows refocusing without interrupting ongoing lines of work (Woods & Hollnagel, 2006) Some 136

of the cognitive work inherent in the elements of choreography may unfold in microseconds or seconds, for example, recalling whose skillset matches the problem demands when recruiting new responders. Other elements, such as establishing common ground, may take place across days, months or even years. Why is this important to note? One reason is that practitioners exploit these varying time scales to cope with rising costs of coordination. A second reason is it begins to provide promising design directions for considering how to lower the costs of coordination – if the cost can be shifted in time or re-distributed to other resources in ways that maintain the integrity of the element then we can better design for smooth coordination.

The second limitation inherent in existing models of coordination is the coordination as an assigned role function. Making certain tasks and activities the explicit domain of a single role immediately sets up the potential for workload bottlenecks and for the IC role to fall behind the pace of actual events as they must work both in the incident and on the incident (Maguire, 2019).To have a clear picture of coordination in large-scale distributed systems one must have a zoom lens to see the details and a wide-angle lens to connect those details to the broader systems dynamics that can drive decisions in work practices and organizational design. For the purposes of making the findings explicit, the elements of choreography were separated out into discrete activities to aid the detailed view, however, cognitive work during anomaly response does not work this way. Many current incident response models oversimplify the dynamics of coordination by separating coordination functions into a class of activity distinct from the cognitive work. That is, you can separate out “managing the incident” from “thinking about or understanding the incident”. This, however, is wrong and fails to account for a fundamental principle in joint activity – coordination costs. You cannot assign the costs to a single party; all participants must exert effort into coordination. The two key issues then become: 1) are you getting the value from the coordination and 2) how to best control excessive costs? So while some degree of success may be found in attempting to ‘offload’ the coordination costs to a single or a limited set of roles, as the complexity, the pace and the number of roles involved increases, the necessity for all participants to actively carry out the elements of choreography increases.

To explain this further it is worth looking at attempts to control the costs of coordination by assigning coordination as a function to specific roles. The models identified in the preparation studies are used as each purport to efficiently manage the demands of an incident response. The proponents of these systems advocate for their utility. However, upon closer examination, there are problematic aspects of this model to aiding and supporting cognitive work of anomaly 137

response. As many organizations have invested heavily in training on incident command systems and the model has uncritically spread as the de facto standard, it is worth taking a closer look at its limitations in supporting cognitive and coordinative demands.

Limitations of the Incident Command model

This model is based on a structure popularized by large-scale disaster recovery agencies where multiple, cross scale organizations conduct multi-threaded joint activity in time pressured, degraded conditions. In this way, the analogy is a good one for critical service delivery. However, this model breaks down in important ways for the speeds, scales, complexities, and types of problems faced in digital service incident response. In non-routine and exceptional outages, managing the cognitive (technical problem) and coordinative work is demanding. Keeping pace with the rate of cascading failures and the resources at hand quickly overwhelms the capacity of any one person. Therefore, the IC becomes a bottleneck as instructions sent out to responders must be acted upon then reported back to a single point of contact. The IC’s cognitive capacity to receive incoming information (either of new developments, of the status of an action or the outcomes of an instruction acted upon) is limited regardless of how proficient they are. In addition to limits on receiving incoming information, the effort of tracking and sequencing instructions increases as more responders join. Even in models where another layer of ‘management’ such as an Ops or Tech Leads serves to help coordination similar issues can arise.

Figure 6.1 The Incident Commander becomes a bottleneck

138

Problems with using the incident command system (ICS) are not limited to software engineering and have been noted before (Jensen and Waugh 2014). Branlat & Woods (2011) note “the bottleneck illustrates issues associated with purely hierarchical control structures: operators more directly in contact with the controlled process lack authority and autonomy, and the required transmission of information between layers of the system is inefficient and is a source of bottlenecks” (p.20). In high tempo events, or those with multiple cascading failures, delayed action can exacerbate the problem. As Grayson (2018) notes, “the pace of events matters a great deal in anomaly response, since all systems have finite resources and limited capacity to handle changing load” (p.8). Excess overhead to coordination depletes resources and capacity to cope with escalating situations. An example of how practitioners circumvent the extensive costs inherent in the Incident Command model is the Allspaw-Collins effect (named for the engineers who first noticed the pattern). This is commonly seen in the creation of side channels away from the group response effort, which is “necessary to accomplish cognitively demanding work but leaves the other participants disconnected from the progress going on in the side channel (p.81)” (Maguire, 2019).

Figure 6.2 Side-channeling as coordination costs rise.

Controlling the costs of coordination

Side channels happen both virtually using direct messages, separate channels, phone calls, or text messages, or by physically reorienting away from the group in a shared physical space. These adaptations are in response to excessive coordination costs. While locally adaptive (enabling the responders to continue carry on their tasks) this becomes globally maladaptive when it goes on for too long without updating or synchronization. Currently, the discovery of side-channeling 139

leads to admonishment and calls for a renewed discipline to follow the planned ICS structure. However, this is counterproductive. Rather than eliminate these necessary spaces, shared efforts must be made to both share and elicit updates or otherwise track the side channel activity closer. This example highlights two important aspects of this discussion: the first is while the ICS structure provides value in creating a repeatable pattern for incident responders to adhere to, upon critical examination, the practice is found to inadvertently increase the costs of coordination by forcing the responders to match the tempo of the IC as opposed to a reciprocal matching of tempo and demand. The second is that the cognitive and coordinative work of anomaly response is critically important to consider in designing work practices, team interactions and tooling. It’s worth providing a few more general examples of how practitioners managed costs relative to the cognitive work before moving on.

The corpus also revealed examples of established patterns for controlling costs of coordination. Evidence of the four strategies for controlling overload identified by Woods & Hollnagel (2006) were seen in the corpus - they were shedding load, reducing thoroughness, recruiting resources, and shifting work in time. Practitioners shed load by decreasing or eliminating updates to stakeholders (seen predominantly in case 1, 2 and 4) or simply not completing tasks, they reduced thoroughness by making tradeoff decisions when getting the system back up without having all data recovered (as in case 2&3) or performance issues fully resolved (as in case 1 and 4). In all cases more resources were recruited to handle workload, bring fresh perspective and bring specialized expertise and some degree of work was shifted in time. In addition to creating side channels preserving capacity, two other strategies individuals had to control costs of coordination during anomaly response were finding workarounds and employing flexible choreography. The latter which will be discussed in detail later in this chapter.

Interestingly, decisions to preserve capacity as a strategy for controlling the costs involve deferring use of available resources. This was seen by declining offers from other responders to join the event or sending responders home and rotating coverage in shifts if anticipating long duration events. When the coordination from others is needed but unavailable (because it is being deferred or restricted – often by cross-boundary parties) responders adapted by finding workarounds. Workarounds almost always add load to the responders as they are developing a contingency or safeguard without all relevant information. This was seen in multiple cases when the additional perspectives to be able to move the response effort forward were unavailable at the

140

point in time needed. The absence of the perspectives represented a significant cost of coordination – the overhead of dependency, particularly on another party who refused to participate in or limited their efforts in coordinated joint activity.

These findings are salient examples of how Patterson & Woods (2001) note, escalation in the demands of an event exact a penalty for poorly designed coordination. I’ll make two points here about poorly designed coordination – the first is that many efforts seen in critical digital infrastructure to provide scaffolding for event management (through the use of tools and practices) do provide some degree of support. This is evidenced by the stark contrast case provided in case 2 where the lack of coordination design generated multiple additional demands on the responders as they are required to build coordination support in real time (such as forming new chat channels for the ad hoc grouping, establishing common ground across diverse parties and develop new patterns for updating stakeholders). So, some parts of the practices and models of incident response provide some value however, they remain very much inadequate for properly supporting the demands of cognitive work in anomaly response. And to make the point explicitly – the more complex the problems, the faster the speeds and scales of problems, the less likely existing models will support the challenging cognitive work in restoring the service. If too rigid a coordinative structure, as in the Incident Command model, increases the costs of coordination but too little coordinative structure, as in Case 2, can generate additional demands what alternatives exist for supporting coordination in dynamic, uncertain, and risky conditions?

A promising alternative is exemplified by two closely related elements of choreography – delegation and taking initiative. Many models of incident response suggest that the incident commander delegates tasks to responders passively awaiting instructions. While there was evidence of the incident commander assigning responsibilities, the more commonly observed practice was of incident responders taking initiative and verbalizing the tasks they were going to do. This action lowered the costs of coordination for the incident responder and avoided a workload bottleneck. What is implicit in a responder appropriately taking initiative is well established and maintained common ground amongst the group, individuals tracking others’ activities and a commitment to controlling the costs of coordination for others. This is separate and distinct from the noted pattern of ‘freelancing’ (Beyer et al., 2016) where self-assigned tasks are dysfunctional and counter-productive to joint activity. This kind of coordinative interplay decreases the amount of effort required for delegation of tasks.

141

Similarly, synchronization of activities can improve when participants of joint activity are tracking others’ activities more closely. Smooth coordination was also supported by pre- emptively controlling the costs of coordination for others (by deferring requests, taking on additional work another would otherwise have to do or blocking access to parties looking to interrupt the response). In turn, this reduces the possibilities of coordination breakdowns brought about by typical responses to saturation- role retreat, shedding load or reduced quality of work. Distributing these functions across responders takes advantage of collective expertise through distributed cognitive capabilities and supports the active, engaged participation exhibited in high performing teams. Subtle but dynamic reconfiguration to address coordination demands exploits the continuously shifting availability of cognitive resources and multiple vantage points on the system.

The cost of cross-boundary coordination

How widely should the functions be distributed is relative to the boundaries of the system. Earlier in the discussion, it was noted that the unit of analysis needs to go beyond individual strategies for controlling the costs of coordination to look at the interactions between parties. This means extending the boundaries of the system should necessarily include parties whose action or inaction places pressure and/or constraints on the responders responsible for service reliability. A key finding from taking an integrated view of coordination is that non-collocated responders working on difficult problems at speed and scale adapt to rising costs of coordination by shifting coordination demands in time and across boundaries. As shown in the corpus, this can be practice driven - such as the use of ticketing systems to manage coordination - or structural – related to architectural choices- such as microservices- and the support agreements between client & vendor organizations. Despite the need to look at a broader system for insights into designing better coordination, cross- scale analysis is complex and becomes unwieldly quickly - a substantial reason why many models of system behavior either avoid it or simplify it. For the purposes of this discussion, I will focus primarily on a smaller subset of interactions, but it is worth first following this concept through to show how multi-role, multi-echelon systems and their corresponding temporal scales influence coordination. By framing the coordinative landscape as part of a broader system, the analysis becomes cross- scale looking at networks that considers individuals, teams, business units and industry partnerships. In doing so, we then must look at a variety of tempos to understand how

142

coordination works (and doesn’t). Individual responders responsible for resolving an outage, experience the immediate and moment-to-moment demands produced by the event – the uncertainty, risks and time pressure. It is noted that responders act at the speed of the event as allowed by cognitive and coordinative demands. Others in the network removed from the immediate demands of the event may act according to other drivers, such as dependent services that re-prioritize multiple competing demands relative to the service level objectives outlined in contracts or according to the level of support targets embedded in the service agreement. In this way, while the tempo of the actual event may be measured in microseconds and seconds, with the responsible engineers working to match this pace, the tempo of action from others may be measured in minutes and hours.

Figure 6.3 Multi-role, multi-echelon coordination

Even more distant parties’ actions relative to the event may exist on much slower tempos, such as senior level decision makers whose decisions to delay hiring of more support engineers may manifest over months or years. Another example is in a regulator conducting a year-long investigation into a financial marketing trading software incident that resulted in viability- crushing losses and massive instability in the market. These are both examples of influential coordinative units that are distant from the moment-to-moment incident responders but nonetheless, whose actions effect and add costs to the other parties in the network. Their slower tempo for action means increased costs of coordination (in the form of additional efforts for workarounds while waiting for contributions from other parts of the system).

143

With the broader system laid out, the rest of the analysis in this section will focus on a smaller highly critical subset of coordination for incident response. In this case, the coordination across: i) teams of responders responsible for service delivery, ii) responders responsible for dependent services (both those the service is dependent on and who depends on the service) as well as iii) the vendor support team whose software influences performance in the large-scale information technology system and iv) automated agents.

Figure 6.4 The cross-scale interactions of interest

This is reasonable given that these dynamics are central to the questions posed in this dissertation and the selection of cases in the corpus. What emerges from the case findings is a nested series of relationships of particular interest to this analysis. These are the interactions between users and the support engineers responsible for service reliability as well as the interactions between these responders and other support teams.

144

Figure 6.5 Nested series of relationships

To illustrate the dynamic of cross-boundary coordination costs through the lens of the nested series of relationships it is worth returning to the escalation model outlined in the findings from preparation study 1.

Figure 6.6 Escalation

In this method of organizing, an incident reported by a user experiencing difficulties moves through levels of triage involving multiple handoffs between lower levels of diagnostics with limited visibility into the system to more sophisticated techniques where higher skilled responders deal with novel and/or high severity issues. If a user pushes code into production and their build hangs, they are likely to reach out to the internal team who owns the service for help (Recruitment 1). This team will begin to investigate and, if having difficulties may recruit other more specialized internal resources to assist (R2). Upon discovering the issue is high severity, this team further recruits specialized skills (R3) while simultaneously opening a ticket with vendor 145

support (R4). A similar escalation needs to occur on the vendor side before appropriate resources are deployed to help. These handoffs, even if escalated quite quickly in the face of an obvious high severity issue, delay recruitment of needed resources in both obvious (it takes more time to escalate through multiple layers) and less obvious ways. Findings showed higher levels of responders will typically repeat diagnostic tests completed earlier to ensure the results from those tests were trustworthy (ie: they were done properly). There are two ways this is interesting – the first is the need to trust the data, so even though it takes extra time the higher-level responders repeat previously tested for problems. It also gives an indication that there is something important about having a basic context for the problem that may be instrumental for helping diagnose more substantial issues. Another way of saying this may be that simply reading the handoff notes from a lower level responder may be insufficient for a highly skilled responder to get the context needed to perform higher level diagnosis.

There are two critical attempts to control costs of coordination across both these sets of relationships. One effort is by setting up coordination devices and mechanisms to structure how support is initially sought. By setting up robust user communities within a company messaging platform or a web-based message board the issue may be addressed through responses from peers in the user community. This structure, common in software engineering, offloads the costs of coordination to the user community and the support engineers tasked with monitoring the chat and responding as needed. A second method is using ticketing systems that are very adept at controlling the cost of coordination for the responder by pushing the cost to the user. The ticketing systems often provide very little observability into the status of the incident beyond high level labels like – “investigating” or through an assigned severity level. It is challenging to see who is engaged, what they are doing, how frequently they are being updated on the changing status of the event, what current hypothesis are under considerations, which have been discarded and why. A third problematic element is the use of support bundles for transmitting information. These bundle a broad collection of metrics to help with diagnostics – and save the vendor support from having to directly interact with the internal support. Depending on the size of the system, these bundles can take several hours to run and may further add to performance issues. This process makes managing multiple competing demands much easier on the vendor side but adds delay. Very few groups tasked with reliability are going to wait for assistance without taking action to safeguard the system or attempt further diagnostic tests. Meanwhile, the practice of

146

escalation ultimately adds costs of coordination to the user who is experiencing the pressure of the performance issues minute-over-minute and, accordingly, needs to take action. Their actions are workarounds based on incomplete knowledge (since they lack the necessary perspective from the support team on the other end of the ticketing system) which at best is successful, at worst can exacerbate the problem it is trying to fix. This also adds costs for both reliability support teams as incoming responders are behind the progression of the on-going incident and need more input given to them to come up to speed. These strategies incur additional overhead for other practitioners, a locally adaptive but globally maladaptive practice.

The assumptions built into this model is that the collaborative interplay between responder skill sets, experience and knowledge across different teams is not important to support. A single responder (or a group of responders from one particular service) should be able to diagnose an issue with interactions being mediated through the ticketing tool. In a system undergoing continual change this is a gross oversimplification. As noted, “as the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly” (Woods, 2017, p.1). Coordination across the boundaries is an important issue of central concern to modern information technology intensive organizations. Both microservice architectures and the dependencies created by using a “service” based IT infrastructure create critical dependencies for both the user and the vendor of these kinds of software. Critical to the user for obvious reasons as the services represent necessary productive functions in the business that are dependent on their near constant availability. And also, critical to the vendor because this dependence means the client is paying not just for the product but for the reliability. Therefore, reliability matters in ways that can be viability-crushing for the vendor if the expectations for reliability are violated too severely. This means reconsidering how to control costs of coordination is not just of benefit to front line responders but ultimately cross-scale as well. Existing structures designed to provide interpredictability about cross-boundary coordination, such as SLO’s & SLA’s discussed in Chapter 3, appear to meet the criteria of the Basic Compact (first discussed in Chapter 2) but actually do not. A contractual obligation to respond to requests for help by a set threshold does provide some degree of goal alignment but there are no formal mechanisms to allow the vendor to relax “shorter-term local goals in order to permit more global and long-term goals to be addressed” (p.7), nor is there any stipulation to “try to detect and correct any loss of common ground that might disrupt the joint activity.” (p.7). It is implied but not explicit in the service reliability arrangements (indeed, it would be difficult to describe the

147

Basic Compact in legal contracts!). Instead, it depends on the willingness of individual vendor support engineers to make an exception to meet the requirements of the joint activity. This means reconsidering how to control costs of coordination is not just of benefit to front line responders but ultimately cross-scale as well. Therefore, different perspectives that have varying views on the system assembled in the right collaborative interplay are critical to forming a coherent view of the event. Alternative models are needed to aid software engineers with managing the cognitive and coordinative demands of anomaly response in complex systems.

148

Chapter 7 Adaptive choreography

The elements of choreography laid out in Chapter 5 consolidate several important themes from the literature. They are: aspects of coordination in joint activity, mental models amongst distributed teams working on complex, adaptive systems, and the cognitive work of distributed anomaly response for large-scale systems. The synthesis of these themes happens first by laying out in detail the functions needed to successfully coordinate cognitively demanding work under conditions of uncertainty, risk and time pressure. Then, in Chapter 6, by showing how these elements are enacted continuously and across the distributed groups to adjust performance to problem demands. An analysis of the limitations of two common current models of organizing during incident response made explicit the ways in which efforts to control costs of coordination can have the unintended consequence of i) adding cost to the coordinative efforts and ii) pushing those costs cross-boundaries within a system of interdependent parties. This contribution has produced a more nuanced and dynamic view of how practitioners cope with the need for variable interactions over time. This Chapter extends this analysis by providing a model for designing interactions across non-collocated teams. In the model, the coordination is based on the elements of choreography being dynamically distributed across boundaries in the Adaptive Choreography framework. Adaptive Choreography is a framework outlining the criteria and attributes (Klein et al, 2005, Johnson et al, 2011), scaffolding and elements of choreography needed for smooth coordination. In dynamic, rapidly changing events responders were seen to continually and flexibly enact different elements of choreography throughout joint activity and across the distributed team. The proposed framework of Adaptive Choreography provides promising design directions for organizing, training and resourcing both consistent and ad hoc teams for the purposes of incident response coordination.

The framework for Adaptive Choreography

The framework for Adaptive Choreography builds off past work in joint activity (Klein et al, 2005) to offer an alternative model for incident response. The elements of choreography are seen as the central features that are distributed across the engaged parties in dynamic reconfiguration 149

to fulfill the functions of cognitive and coordinative work. It proposes reframing the command and control structure inherent in models of Incident Command to instead take advantage of distributed cognitive and coordinative capabilities while supporting by appropriate scaffolding and fulfilling the basic requirements for joint activity. For example, a key characteristic of the Incident Command System (and its software interpretations) is focused on the importance of the Incident Commander with extensive training programs focused on improving their abilities to handle the coordinative demands of the event. However, this model implies incident responders are passive ‘followers’ of the IC’s leadership. The findings of this research show definitively this is not the case. Whether acknowledged or not, all parties are actively involved in managing the coordination, albeit in a subtle and difficult to detect fashion. The anticipatory actions of individuals in a joint activity to ‘prepare’ themselves for their part in the shared effort and the ways in which they re-adjust, re-calibrate and re- configure continuously is clear evidence that the costs of coordination are born by all – not just an assigned role. When integrated as part of the cognitive efforts, the coordination becomes less explicit and instead, manifests in small adjustments to the interactions across the parties. There is precedence for this distributed, small scale collective effort. A strong finding from both the preparation studies and the corpus was that engineers involved with service delivery maintain an on-going and substantial level of awareness about their service. This is completed by continuous, small, lightweight ‘checks’ that happen regularly during work hours and even intermittently during off-hours. Many of the engineers studied push monitoring alerts to their smartphones, and even in off-hours, will glance at them regularly. When the alert indicates expected performance, the engineer ignores it. When the alert is unusual, even if it does not trigger a page out (meaning it hasn’t reached a certain pre-determined threshold for paging responders), engineers will begin formulating hypotheses about where the source(s) of the problem come from. This may be simply thinking it over and trying to connect their knowledge of recent or expected events (“are we running an update this weekend?”) or involve gathering more information (checking dashboards, looking in channel to see if there are any reports). To be clear, I am not suggesting this should become the standard, I’m suggesting it is the standard. But, as Richard Cook notes, in reflecting on the data collected from the SNAFU Catchers Consortium: “It’s hard to know who is monitoring the system, who is looking at things, and who is available. We know for certain in (some organizations) that there is a pretty well-defined group of people who are in high-frequency update mode more or less continuously while awake. This is a huge

150

contributor to reducing the costs of coordination and it is invisible to everyone and therefore not on any dashboard.” (Cook, personal communication) Organizations reap the benefits of this additional monitoring while rarely acknowledging the efforts involved. This investment in maintaining a sense of the system’s behavior (common ground) helps responders maintain their mental models. When an incident occurs their ‘currency’ about system status may only be a few minutes or hours old as opposed to days (if over a weekend). This is not an assigned activity but one that likely has substantial benefit to the engineer’s ability to come up to speed quickly. The organization reaps this benefit without cost. The last point to be made here is that a continual capacity to adjust to the coordination demands must be appropriately scaffolded. The scaffold of joint activity is both a function of the tacit devices – agreement, convention, precedent and salience (Klein et al, 2005) - and the adaptive mechanisms – tooling and practices - that provide affordances to know when choreography needs to be dynamically adjusted. This scaffolding can be embedded in the software itself, the expectations for how roles interact or the mechanisms that enable fluidly shift from one medium to another. An example is the use of chatops user channels to provide a consistent forum for receiving early indications of issues through informal user reports or to provide updates to users as the performance issues continue. Another example is, in co-located teams, the availability of physical or virtual spaces can serve as dynamically reconfigured ‘control rooms’ depending on the specific needs of an event. In this way, scaffold provides a semi-structured basis for the response efforts while not enforcing rigidity.

The dynamics of coordination in anomaly response

In the last chapter, the discussion outlined some specific examples of the dynamics of coordination using a few different elements of choreography. While these illustrative samples begin to frame out how cognitive and coordinative work is tightly coupled and, in large-scale distributed systems, is a critically important lens to take in designing for quicker, more precise and smoother coordination amongst software engineers tasked with systems reliability. To fully appreciate how the flow of attention across multiple, often concurrent, activities takes place in a distributed work setting, it is use to look at how the response efforts are governed.

Based on the findings of their simulation study on cyber security teams responding to potential attacks, Branlat & Woods (2011) draw the following conclusions:

151

“Similar to other domains related to emergency response, cyber defense would benefit from implementing polycentric control architectures (Ostrom, 1999; Woods and Branlat, 2010). This research emphasizes that lower echelons, through their more specific set of competences and direct contact with the controlled process than remote managers, develop a much finer knowledge of the process behavior. This knowledge allows them to detect anomalies early, thereby making them more able to adjust their actions to meet security or safety goals. That being said, both purely centralized and decentralized approaches are likely to fail (Andersson and Ostrom, 2008); they simply fall victim to different forms of adaptive challenges. In the domain of cyber security, in particular, systematically reducing vulnerabilities identified is not a viable strategy since it threatens other, production-related, goals (p.20).”

The conclusion that governance needs to strike a balance between centralized and decentralized has direct implications for critical digital infrastructure reliability. In the digital systems domain, coordination designers can exploit the capabilities of modern tooling to support distributed work. Doing so can generate a coordination structure that to amplifies the value of coordination while minimizing the costs. A core principle is to take advantage of shifting capacities for action across the distributed group. The framework allows for adaptability to meet problem demands and resource fluctuations while pushing initiative away from an IC role to the responders to reduce workload bottlenecks. This allows participants to invest in the overall group’s performance by tailoring their individual performance based on their changing capacities. Their changing capacities are based on a number of factors. The first is the availability of cognitive resources (are they completing cognitively demanding activities that is degraded by interruptions or tasks switching or are they routine tasks that can easily be interleaved with other work?). A second factor is their ability to minimize other demands. Incidents are unplanned and interrupt other activities important to the on-going development and maintenance of the system. Some of this work, once started, must be followed through to completion. The responder may need to make tradeoffs to split attention across multiple demands, decreasing their capacity to fully engage in the incident response. Another factor is attentional control (are they capable of concentrating at that point in time or are they distracted, overwhelmed, fatigued?).

While the elements of choreography must be substantially fulfilled by the coordinative network, obviously, not all participants need to be exhibiting all elements at all points in time. The

152

interplay of the group as a whole must ensure that the requirements are met. For example, in groups where less investments have been made in establishing common ground, greater efforts are required to bring incoming participants up to speed with relevant details. More effort will be expended maintaining and repairing common ground. It becomes more difficult to integrate new perspectives because what is known and unknown is unclear and emergent. This is particularly common when crossing inter-organizational boundaries with groups that are not frequently engaged together in joint activity. In cases where it is understood that coming up to speed will be effortful, responders adapted by taking initiative and providing additional information in anticipation of the need to establish more common ground. This is distinct from providing a situation representation (‘sit rep’) – where the information transmitted is typically proximal in time and space to the incident. Being adaptive with choreography recognizes that without established common ground, contextual factors relevant to understanding the nature of the problem may need to be included to help make incoming responders immediately useful. What differentiates adaptive choreography is that it is difficult (and sometimes wasteful) to be prescriptive about what action is appropriate in an emergent, dynamic situation. Instead, action is context dependent and members of the joint activity emphasize or de-emphasize the patterns of activity depending on the needs.

Re-imagining incident management practices

Given the above discussion, it is useful to consider ways to integrate these findings into how to provide constructive guidance on reimagining incident response. This section examines the context of incident management, the potentials for change and some promising directions for re- imagining incident management in critical digital infrastructure.

Modern incident management is assumed to be non-collocated and distributed across inter and intra-organizational boundaries. Groupings of responders are both formal and consistent (formed by a ‘core’ group that is typically the team responsible for the service) as well as ad hoc and emergent (reconfiguring as individuals and teams from other dependent services or core business functions join in when their skills or perspectives are deemed useful). Even with well-balanced and thoughtful organizations, there will always be elements of ad hoc groupings. Some responders may necessarily be working on other high value activities when an incident occurs (such as making architectural changes or system updates, attending conferences to share knowledge or coaching junior employees). Service outages are a fact of life in 24/7 operations so

153

with the exception of clearly high severity outages, recruitment may be delayed (and in some cases rejected) so as to make progress on other important organizational priorities. In addition, there is a relative scarcity of specialized expertise so creating redundancy by employing multiple people in the same roles (even if economically viable which it rarely is), can be challenging to meaningfully sustain. Because of this, the usual approaches to business continuity, such as training of “teams” or carefully orchestrating a diversity of skillsets within a team, will always fall somewhat short. This means organizing to enable rapid reconfiguration of ad hoc groupings, and then preparing those individuals to better coordinate provides a better chance of ensuring coordination can continue in conditions requiring adaptation to the availability of resources and adjustment of roles.

An aspect of the domain of software engineering that allows for more dynamic modes of coordination is that the shared base of skills and concurrent ability to access and interpret logs or dashboards means there is greater fluidity for tasks to be adopted by others. This overlap of specialized skills and the ability of response teams to maintain common ground in real time (afforded by the tooling) and with relatively low overhead represents a substantial opportunity for the domain to redesign new models for incident response. The next section lays out the ways in which roles are changed in an Adaptive Choreography framework.

The changing nature of roles

The nature of joint activity means collectives that form for the purpose of responding to a specific service outage (such as an incident response team) will naturally retain some degree of the authority and power, capacity to leverage or allocate resources and expectations from their role ‘outside of the incident’. However, this is not a 1:1 match. A high tempo, high consequence outage engenders reconfigurations: managers deferring to their engineers; users becoming a part of the incident response as they gather & provide diagnostic data; or vendor support engineers working under the direct of a client’s incident response team. Some aspects of existing roles are subordinated in service of the immediate demands of the incident. This dynamic shifting of roles (and the activities or authorities of those roles) actually aids in smooth coordination. This discussion will begin with a review of the “supporting cast” before shifting to the incident responders themselves.

154

Figure 7.1 Adaptive Choreography as an incident response model

Many of the roles traditionally held in incident command structures are maintained in name but, transformed in practice within Adaptive Choreography. These roles provide value in aiding the incident response by providing a focal point for peripheral stakeholder engagement (management, user groups, communications or product-oriented roles for example). The IC, for example, becomes one of several supporting roles. With greater autonomy pushed to the responders, the IC’s role shifts from commander to coordinator – providing support and aiding to track activities, secure new resources or act as a buffer to the outside world. This last function can help prevent overload by helping to pace the tempo of the event. For example, the IC can assess the urgency of incoming information then choose to wait until lower tempo periods before presenting to the responders to integrate the new, outside perspectives. Other core functions are in monitoring cognitive and coordination demands, tracking activities underway for the purpose of cross- checking or sequencing and identifying any gaps in the existing work. For example, a team that has had difficulties diagnosing an outage has stopped taking safeguarding actions that may

155

temporarily resolve the service issue so that they may gather more information about the nature of the problem itself to help generate a sustainable solution. During the incident, a VP who is on- site with a major client, directs the IC to restore service even if it means the problem cannot be identified and may return at a later point in time. The IC, operating in a supervisory control function, probes the responders to elicit the full extent of the tradeoff decision being asked. If the benefits of continuing to incur some degradation to the service outweigh the reputational cost to the VP, the IC can then serve to negotiate the directive by transmitting information on the implications of restoring service to overall system health. In this way, the IC’s role can mediate potentially locally adaptive but globally maladaptive actions by broadening the scope of analysis and mitigating local needs or power dynamics from disrupting the best course of action. While some degree of activity tracking, delegation, synchronization is still being performed by the IC role, it is not the primary means of coordination – the Adaptive Choreography amongst responders is primary. The IC’s function is not to direct activities but rather to watch for progressions that indicate the response group is not well calibrated to the broader organizational goals, priorities or demands and then help revise plans, recruit or reallocate resources accordingly. There is a sentiment within the field that IC is simply a high-level coordinator position (some even claiming this role does not need technical knowledge of the system) but what is being proposed is qualitatively different. The IC requires technical knowledge to be able to interpret the activity underway and its implications to be able to provide oversight and ask pertinent questions to help reframe the responders when course corrections are needed.

In Figure 7.2, a number of roles are seen bridging the space between the core response team and the stakeholders. These are active participants whose primary roles are ones of support. This is a subtle but non-trivial differentiation to many existing practices. Take the role of the manager whose team owns the service experinceing the outage and whose engineers form the basis of the response team (this role is called Service Owner Management in the figure). A typical pattern is to stop or slow response efforts to give a status update to the manager which incurs a cost to the responders. By prioritizing the needs of the incident responders and deferring to a support role, the manager instead invests effort in looking in and listening into maintain awareness of the event progression. In this way, they are able to offer early course corrections, approve needed resources to be reallocated or otherwise anticipate how to better support timely resolution.

Similarly, in instances where a 3rd party provider’s insight and perspective is needed, the ability to have vendor resources actively participating in the incident lowers the cost of coordination for 156

both parties. Of course there is an overall cost to third parties or internal service providers to have sufficient engineering resources to join client response efforts. However, in highly interdependent systems where cross-boundary observability is limited (one may only be able to see if the service is up or down) and where unexpected interactions or undisclosed code changes may be contributing factors to an outage, it is crucial to have greater cross-boundary coordination. While many vendors offer access greater levels of support for additional costs this is disingenuous. If the vendor’s current pricing model does not account for what is necessary access to information and expertise for complex incidents then they are misrepresenting the true cost of that dependency. Other stakeholder roles looking and listening in to the response - such as client support, users (of an internal service), or other dependent services – incur a low cost of monitoring the responders activities to better anticipate what, if any, adaptations are needed to mitigate any follow on effects from the outage.

Responders as self-organizing units

Instead of being passive recipients of instructions from the IC, the responders are engaged in actively and fluidly fulfilling the choreography requirements. As laid out in Chapter 5 & 6, these include taking on and shedding tasks, prioritizing actions, synchronizing the cadence of, and informing others about their work as their ability to manage load changes moment to moment. Allocating some attentional resources to tracking others’ activities, assessing their own capacity to increase or decrease involvement and taking initiative while signaling their actions does add some additional workload as described. However, this continuous, lower intensity effort incurs the costs on a more predictable and dynamic level than in other models where sudden or persistent coordination breakdowns take much more effort and time to repair.

The scaffolding for Adaptive Choreography

Tools such as chat platforms, video conferencing, shared collaboration tools (such as Google docs or virtual whiteboarding) enable distributed work in several ways. The first is that having documents, dashboards or other artifacts and shared frame of references (Maguire and Jones, 2020) can quickly center the shared activity and better surface any discrepancies in mental models. Doing so is a shorthand way of establishing common ground by making the salient details immediately relevant and providing a basis to add or clarify information. As discussed in

157

Chapter 2, ChatOps platforms enable looking in and listening in (Patterson, Watts-Perotti and Woods, 1999; Woods, 2017). In Adaptive Choreography stakeholders such as users or dependent service owners can trace the progression of the incident over time by monitoring the incident response channels or subscribing to updates generated by the Comms Lead or Incident Commander and posted to a status page or other shared updating forum (such as an existing user channel or social media newsfeed). This is important in several ways. The first is that looking in/listening in even when the diagnosis remains unclear updates their mental models about system performance and may activate knowledge useful to the response effort. In addition, early awareness of the hypothesis being generated, as well as the mitigating actions being considered can help them pre-emptively adjust their workload and revise or re-plan their own work to accommodate potential impacts. Another valuable aspect of this largely silent monitoring stems from their vantage point on the system, which is, by definition, different from the responders. They may have access to information about the status of the system or the impacts of actions taken that can help inform the active response efforts. Taking initiative to track what information or perspective would be useful and then providing it at the point in time it is needed without having to be directed lowers the cost of coordination for the incident response team. Finally, if outside stakeholders or users join the incident response channel and begin interrupting the activity (say for example, by prompting for status updates), these ‘by-standers’ can gatekeep by re-directing the interrupter to the updates channel or by providing the support that the responders are unable to at that point in time. Another way the tooling acts as a coordination mechanisms is that it provides a semi-flexible shared forum for rapid recruitment, information sharing and searching.

Summary of Adaptive Choreography

This chapter has laid out an alternative model to the Incident Command System for managing service outages in critical digital infrastructure. In this decentralized model, multiple roles play a support function that serves to protect and enhance the core response group. The dynamics of how coordination during incident response takes place were discussed and the implications for adopting new approaches were explored. The transition from command-and-control to distributed coordination follows from the findings of the corpus of cases and the preparation studies. These findings showed that, somewhat paradoxically, successfully managing escalating events involves integrating the coordinative functions into the cognitive work. Integration streamlines the ability to diagnose, repair and learn from the incident and lower the costs of coordination. Assessing

158

coordination in this manner can lower costs both locally and in cross-boundary coordination by designing for better cognitive and coordinative support. Reducing unexpected escalations in coordination demands can reduce sudden coordination breakdowns (such as when a party experienced overload and dropped their commitment to the Basic Compact). Considerations for re-designing the interactions between core response teams, vendor support, users and other impacted stakeholders that takes into account cross-scale costs of coordination were given. While there are still aspects of Adaptive Choreography to extend, this analysis and proposal provides a basis for re-designing incident management practices and developing new tools to support coordination.

159

Chapter 8 – Conclusion

This work has examined the costs of coordination through the lens of the cognitive work of anomaly response in large-scale distributed work systems. The integrated view of coordination drew from multiple relevant lines of inquiry to forge a synthesized new view into joint activity in large-scale distributed systems. The domain of critical digital infrastructure is a relevant area of study that has broader applications to practitioners in other high tempo, interactive failure environments characterized by time pressure, risk and uncertainty. This is both because of the inherent qualities that make it ideal for studying cognitive work (the natural production of artifacts useful for process tracing) and for the increasing relevance in technology-mediated, distributed systems running at speed and scale.

To situate the corpus of cases in the context of the operational realities that shape and influence the practice of incident management, two preparation studies were completed focusing on the models of organizing incident response and the tooling to support incident resolution. A corpus of five cases using detailed process tracing formed the basis to analyze the artifacts generated as software engineers coped with service outages. To elicit the data for this study, multiple converging methods and sources of data on critical incidents involving service disruptions were used. These included chat messaging transcripts, audio transcripts from web conferences and audio bridges, log files, dashboard screenshots, post-mortem documents and video recordings of incident responses. Capturing the complexity of in situ work is a challenge for any researcher, particularly those who are designing and building tooling. Convergence across multiple sources helps to recreate, in part, the nuance of everyday work. In other words, the details matter. This close examination, across multiple data sources, was necessary to identify the ways in which expertise smooths out the ‘rough edges’ of poorly designed work practices or technologies. How practitioners control the costs of coordination is an issue relevant across any domain with interdependent activity. This study offers researchers a promising basis for how to collect data in and take advantage of the affordances that distributed and technology mediated work provides.

160

The findings showed that strategies for coordination during critical incidents are highly adaptive, enacted with fluency and depend on the ability to anticipate and adapt in real time to others’ activities. Sixteen elements of choreography were laid out in detail. These elements outlined the cognitive work required to enable smooth coordination between multiple parties in joint activity. The dynamics of choreography show how these are actively managed by each party, depending on their changing capacity. In particular, the relationship between the cognitive and coordinative demands meant responders were more appropriately suited to manage coordinative functions than a single incident command role. In some high tempo and high complexity cases, the incident commander created a workload bottleneck which delayed incident response and drove workarounds (responders moving into side channels) to cope with the cost of coordinating through a single entity.

In the Adaptive Choreography model, a framework for enacting the elements of choreography across both the core responder team and the multi-role, multi-echelon group of stakeholders was provided. A proposed redesign of the traditional incident command structure shifted that role into one of support. In this revised role, the IC provides meta-coordinative supervisory oversight, buffers the group from external demands and track activities in parallel among other functions. This enables the IC to be an informed, but distant support whose vantage point across multiple stakeholders is an added benefit to the group’s performance. The redistribution of functions also removes the IC as a bottleneck and allows for more expedient revision of plans and reallocation of resources where needed. Central to this is re-distribution of activities is the explicit adoption of innate coordinative functions by the core response team. On the surface, this redistribution appears to be adding load to responders, however, the findings show in well-coordinated activity this was already implicitly taking place. The additional level of granularity of how to direct attention to coordination provides specifics about what considerations are meaningful to ensure the interactions in joint activity produce desired outcomes.

Collectively, the elements and dynamics of choreography provide a means to design for skilled joint activity in this domain and others. Future research into this topic should focus on studying the conditions under which Adaptive Choreography itself becomes too costly to carry out. In addition, promising areas of future research include the role of reciprocity, of investing in future coordination, and in aiding response teams to extend their analysis to include reflections on choreography in their post-incident work.

161

Appendix A – Elements of Choreography

Element of Corresponding Overhead cost Who incurs the cost Temporal Examples from the cases Adaptations to control CoC Choreography Scale

Generalized ● Learning about the organization* Individuals Pre-event New responder joining the team spends the first few Individual: Investments in •Maintaining a sense of organizational demands (including shifting (All ind. committed to (Years, weeks on the incidents but not being ‘expected to Less involvements in “extra-curricular” activities establishing common priorities, new management, pressures & constraints for action) the basic compact) months, contribute’. They do contribute in whatever ways they which means decreasing depth of knowledge for ground •Understanding priorities and goal conflicts days) can (pulling info, monitoring users channels, ordering common ground. •Learning how priorities typically tend to shift and how lunch during a long running outage, aiding in quickly/slowly reconstructing the timeline for the post mortem) Less dialogue during interactions with other •Learning how goal conflicts are typically dealt with departments or units that may not be directly •Learning formal decision processes •Learning informal or role specific decision-making authority Specialized incident response teams relevant to the immediate activity but serves to •Learning who makes decisions and at what speed establish common ground •Learning about dependent units, their role & function (espoused and actual) •Determining the boundaries of organizational silos Responder contacted influential and highly impacted •Understanding the implications of organizational silos relative to manager to get feedback on the tradeoff decision the Organizational: joint activity team was facing (lose data or be down longer in an Conducting fewer post mortems or longer lag times •Understanding how historical events influence current coordination attempt to recover) between event & post mortem Learning about available resources for establishing & repairing common ground (tools, formal protocols, informal practices) New hires not given adequate orientations or ● Establishing knowledge of the ‘others’ Being able to immediately identify an individual who has sufficient time to become familiar with past events •Learning about what others know information useful to the case and knowing how to •Learning about what others can do contact them in multiple modes (email and chat) •Learning about who others know Little or no formal mechanisms to exchange •Understanding others stances toward a problem or system information/updates between dependent services •Understanding roles & functions (espoused & actual) about potentially impactful changes. •Understanding the limits and boundaries of authority & responsibility of others Training or orientation programs that focus on technical knowledge and skills but not on other ● Establishing knowledge of the immediate team Recognizing that an upgrade conducted over the weekend important contextual factors •Learning about the pressures & constraints of the grouping meant the team was coming into Monday operations •Learning about how the collective grouping responds to these without sufficient rest and may require additional New or junior engineers given responsibility for •Learning about normative behaviours support. service reliability with insufficient grounding •Learning about handling of exceptions •Understanding how roles & functions interact in practice Recognizing that one responder prefers web conferencing •Understanding the limits and boundaries of authority and to collocated activity and accommodating for that. responsibility of the team

• Establishing knowledge of system performance “For some context, we've gotten 40 million requests from that •Learning about the pressures & constraints for operations application in the last two weeks. The request volume that you're •Learning about normal system behaviour looking at is normal and business-as-usual for us.”

•Learning about abnormal or anomalous system behaviour “These two services are the largest consumers of and •Learning about previous non-routine or exceptional events and the behaviour you're seeing is business as usual. We cannot responses taken disable these services”

162

Element of Corresponding Overhead cost Who incurs the Temporal Examples from the cases Adaptations to control CoC Choreography cost Scale

Maintaining common •Cultivating networks and establishing channels for recognizing change Individuals Pre-event, Less additional effort put into ongoing awareness of ground that can impact individual and team goals and tasks. (All ind. During, “ the exhaustion of the active workers is the behavior activities. committed to After since hotpatching... which spawned this issue a week •Developing a sense of nominal and off-nominal trajectories of change the basic ago.” Less or no effort expended to query uncertainties or (at the individual, team and organizational levels) compact) (Years, contribute to broader awareness of activity. Months, “the web node long term load is way above normal (at half •Maintaining a sense of team specific demands (including workload, Days, capacity)” changes, availability of specific resources, individual capacities, minutes, upcoming events) seconds) "It's probably just the west coast signing off, but response times got a lot better about 10 min ago” •Maintaining a sense of system demands (technical debt, hardware updates, deferred maintenance)

•Maintaining a sense of environmental demands that can impede coordination (weather events, limited office technologies etc)

Repairing breakdowns •Considering implications for breakdown Individuals During Mgr 1: While we're waiting for that is somebody's Repairs not conducted. A comment will go in common ground •Assessing the options for repair given other work and current conditions (All ind. generating a support bundle in the ticket? I think unchallenged. •Deferring, reprioritizing or abandoning other activity to invest in repair committed to (Minutes, they're waiting for that. [crosstalk] •Running mental simulations to consider outcomes the basic hours) Eng 2(ic): Yes, sir. [crosstalk] Repairs completed but without additional information •Gauging interruptibility compact) Eng 4: Yeah. It's still running. It takes a while for run. [ that could aid model updating (pointing out an •Establishing new channels for communication crosstalk] interpretation is wrong without saying why). •Recruiting resources to aid in repair Mgr 1: Okay, thank you. I didn't see a task for that. •Delegating responsibility for repair to others

Taking perspectives/ ● Visualizing/mentally simulating the experience of the other Individuals, During Uploading service bundles and additional diagnostic data Focusing narrowly on own goals and tasks switching perspectives ● Anticipating needs Groups (Minutes) for vendor support proactively ● Anticipating stance or orientation towards the problem Waiting for requests to be made ● Anticipating workload demands

Delegating ● Considering task demands Individuals During IC: Hey [Eng 2], Would you mind running the cluster Not managing the coordination of activities ● Considering available resources (Minutes, reach to get a... To see if console is actually running? (unstructured assignments and expectations) ● Considering other’s capabilities hours, IC: [Eng 3] and [Eng 1] do you have something that you're ● Determining sequencing of events days) already digging into? ● Considering pairing options for larger or complex activities IC: “So I'm, uh, does someone know the best way to ● Determining how to direct the other restart Redis and if so, can you take that?” ● Determining extent of directions ● Consider timing of the requests ● “Making the ask” ● Gauging interruptability ● Clarifying expectations for completion (quality, timing, goal priorities)

Taking direction ● Re-prioritizing work to accept new tasks Individual During Eng 1: Can someone investigate :this:? Avoid taking on new work ● Signaling availability (Seconds, Eng 2(ic):parallel diags on storage appear to have a huge Being unresponsive to queries for help ● Clarifying task assignments and timelines for completion Mins, impact ● Early flagging of potential issues Hours) Eng 1: cancel and rerun serial ● Confirming acceptance of tasks Eng 2(ic):done Eng 3: <@Eng1> I can investigate API calls

163

Element of Corresponding Overhead cost Who incurs the Temporal Examples from the cases Adaptations to control CoC Choreography cost Scale

Taking Initiative ● Considering task demands Individual During “I’m looking at the log files now” Focusing narrowly on own tasks ● Considering available capacity & skills (Seconds, “Do you want me to restart the Masters?” Not signaling availability ● Considering other’s capabilities Mins, Avoiding additional effort ● Determining sequencing of events Hours) ● Considering pairing options for larger or complex activities ● Communicating intent ● Joining activity already underway

Recruiting new ● Monitoring current capacity relative to changing demands Recruiter During Bringing new people in without context (no sit rep, resources ● Identifying the skills required (can be an IC, context) ● Identifying who is available eng or user) (Seconds, Relying on chat log to orient newcomers ● Determining how to contact them Mins, “^Support level response, but anyone who is free to look Not asking for resources ● Contacting them/ alerting them to the need Hours, at it with me would be welcome” Passivity in ● Waiting for a response Days) Localized focus on tasks (does not ● Adapting current work to accommodate new engagement (waiting, “PagerDuty alerting is having issues. You may need to Relies on others to perform coordination (eventually slowly, completing tasks to aid coordination) grab folks the old fashioned way.” this happens) ● Preparing for engagement -Anticipating needs -Developing a Crit sit or status update -Giving access/permissions to tools & coordination channels -Generate shared artifacts (dashboards, screenshots) ● Dealing with access issues (inability to join web conference or trouble establishing audio bridge)

Being recruited to an ● Being interrupted in your own work Recruitee During Ignoring requests for help incident/ Orienting to ● Assessing the request relative to your own capabilities (Seconds, Joining the incident without reviewing the incident another’s problem ● Assessing the request relative to your capacity to act Mins, “Do you need help?” data provided or transcript and asking for an (Coming up to speed) ● Deferring or abandoning your own work Hours, update. ● Acknowledging your orientation to the problem Days) “Can you share your WebEx link with me?” Not joining the incident in the manner requested ● Communicating about the deferral or abandonment to the parties you (joining a chat room rather than a web coordinate with conference) ● Gaining access to collaboration tools Not acknowledging other activity (or leaving others to ● Assessing available information manage around your work) ● Clarifying (available data and expectations) Disregarding coordination with others and conducting ● Requesting additional information “GitHub is requesting we run a support bundle” the technical work only ● Forming questions about the state ● Determining interruptability ● Forming interjection “Can you join the call?” ● Interjecting ● Determining roles ● Assessing work underway ● Assessing implications of work underway ● Considering your contributions relative to problem constraint ● Assessing how your contributions may influence work underway

164

Element of Corresponding Overhead cost Who incurs the Temporal Examples from the cases Adaptations to control CoC Choreography cost Scale

Aiding others in model ● Recognizing faulty mental models Knowledgeable During An eng tasked with checking logs in a service networking Ignoring recognized or stated deficiencies updating/ Recognizing ● Determining what is known parties (Seconds, tool recognizes they are unclear on how to do so and your own need for ● Assessing differences amongst mental models Mins, prompts the group “I’m a little rusty checking logs in model updating ● Determining the implications of the differences Hours, Consul…” ● Querying for clarification Days, Longer term investments: knowledge sharing ● Retrieving information to share Months) workshops, post mortem accounts etc ● Sharing knowledge ● Seeking confirmation ● Devising examples ● Fielding questions ● Clarifying others statements ● Adding further descriptions

Controlling the costs ● Ignoring or deferring prompts for input Willing parties During, Not conducting these activities for others ● Gatekeeping After Case 1 pre-emptively running service bundles ● Determining interruptability ● Formulating ideas for potential contributions (role, task) Case 3 - Tradeoff decision of taking the system down vs ● Delegating tasks working on it while up, recognizing the timing of their ● Delaying requests actions will adversely influence users ● Signaling availability ● Pro-offering capacity ● Cross checking ● Lowered expectations for output

Controlling costs for ● Ignoring prompts for input Individual (but During, This is typically not explicit, but rather seen when things Don’t join the forum that requires listening in self ● Dropping conference calls, audio bridges, web conferencing also group After do not get done Restrict work ● Decrease monitoring of user forums participants Drop coordinative functions ● Shedding load in JA) Eng 1(ic) - Case 1 “If someone has all of their emails and ● Decreasing quality of work the CC list so I can copy and paste, it'd be fantastic ● Delegating tasks too.…” ● Asking for cross checks Case 3 - Eng shares screen to have others look in on their work to prevent errors because of the time pressure

165

Element of Corresponding Overhead cost Who incurs the Temporal Examples from the cases Adaptations to control CoC Choreography cost Scale

Synchronizing tasks ● Considering temporal relationships between tasks Individual (to a During Vendor suggests running the bundles after the work day to avoid Invest more effort in proactively anticipating other ● Considering immediate and longer term implications of each course certain slowing the system needs and in demonstrating appropriate severity for of action extent); Case 3 - Tradeoff decision of taking the system down vs level of support requested ● Determining immediate and longer term implications of collective predominantly working on it while up, recognizing the timing of their Attempt to recruit external resources early on in the course of actions (interactions between) responsible actions will adversely influence users incident party ● Waiting on responses Re-prioritizing efforts that will take longer to ● Cross checking While we're waiting for that is somebody's generating a support complete so they are underway early (running bundle in the ticket? I think they're waiting for that. support bundles) [crosstalk] Yes, sir. [crosstalk] Yeah. It's still running. It takes a while for run. [ crosstalk] Okay, thank you. I didn't see a task for that. [crosstalk] You have to do a full support bundle? A full support bundle? Well before one takes like a couple of hours. Yeah. I did a full up, a full cluster support bundle Doesn't that take a couple of hours? Or did they improve the speed? I recall it taking a couple hours. Yeah. Takes a while While they are waiting for the support bundle we may want to let them know, say, Hey, this kind of take a couple hours. We want something on do something between that time. I'll let them know that

Eng 4: Are you sharing your screen? So I can go check the commands to make sure you didn't make any typo's... Like, I always do.

166

Bibliography

Ackoff, R. L. (1971). Towards a Concepts. Management Science, 17(11), 661–671.

Alderson, D. L., & Doyle, J. C. (2010). Contrasting views of complexity and their implications for network-centric infrastructures. IEEE Transactions on systems, man, and cybernetics- Part A: Systems and humans, 40(4), 839-852.

Allen, J., & Ferguson, G. (2002). Human-machine collaborative planning. Proceedings of the Third International NASA Workshop on Planning and Scheduling for Space, 27–29.

Allspaw, J. (2015). Trade-Offs under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages. http://lup.lub.lu.se/student- papers/record/8084520/file/8084521.pdf

Bento, F. (2018). Complexity in the oil and gas industry: a study into exploration and exploitation in integrated operations. Journal of Open Innovation: Technology, Market, and Complexity, 4(1), 11.

Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. “O’Reilly Media, Inc.”

Bigley, G. A., & Roberts, K. H. (2001). The incident command system: high-reliability organizing for complex and volatile task environments. The Academy of Management Journal, 44(6), 1281–1299.

Buchanan, D., Boddy, D., & McCalman, J. (2014). Getting in getting on getting out and getting back. In E. Bell, & H. Willmott (Eds.), Qualitative research in business and management: practices and preoccupations (Vol. 3). SAGE Publications Inc.

167

Burtscher, M. J., Wacker, J., Grote, G., & Manser, T. (2010). Managing nonroutine events in anesthesia: the role of adaptive coordination. Human factors, 52(2), 282-294.

Branlat, M., Morison, A., & Woods, D. D. (2011, October). Challenges in managing uncertainty during cyber events: Lessons from the staged-world study of a large-scale adversarial cyber security exercise. In Human Systems Integration Symposium (pp. 10-25).

Cares, J. R., Christian, R. J., & Manke, R. C. (2002). Fundamentals of distributed, networked military forces and the engineering of distributed systems. https://apps.dtic.mil/docs/citations/ADA402951

Clark, H. H. (1996). Using Language. Cambridge University Press.

Clark, H. H., & Brennan, S. E. (1991). Grounding in communication. Perspectives on Socially Shared Cognition, 13(1991), 127–149.

Crockford, D. (2009). JSON. https://www.json.org/json-en.html

Denzin, N. K. (2001). Interpretive Interactionism. SAGE.

Fairburn, C., Wright, P., & Fields, R. (1999). Air traffic control as distributed joint activity: Using Clark’s theory of language to understand collaborative working in ATC. Proceedings of the European Conference on Cognitive Science. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.35.8402&rep=rep1&type=pdf

Feltovich, P. J., Bradshaw, J. M., Clancey, W. J., & Johnson, M. (2007). Toward an Ontology of Regulation: Socially-Based Support for Coordination in Human and Machine Joint Activity. Engineering Societies in the Agents World VII, 175–192.

Feltovich, P. J., Spiro, R. J., & Coulson, R. L. (1997). Issues of expert flexibility in contexts characterized by complexity and change. Expertise in context: Human and machine, 125- 146.

Grayson, M. R. (2018). Approaching Overload: Diagnosis and Response to Anomalies in Complex and Automated Production Software Systems (D. D. Woods (ed.)) [Masters in Integrated Systems Engineering]. The Ohio State University.

Grudin, J. (1994). Computer-supported cooperative work: history and focus. Computer, 27(5), 19–26.

168

Hoffman, R. R., & Woods, D. D. (2000). Studying cognitive systems in context: preface to the special section. Human Factors, 42(1), 1–7.

Hoffman, R. R., & Woods, D. D. (2011). Beyond Simon’s Slice: Five Fundamental Trade-Offs that Bound the Performance of Macrocognitive Work Systems. IEEE Intelligent Systems, 26(6), 67–71.

Hollan, J., Hutchins, E., & Kirsh, D. (2000). Distributed Cognition: Toward a New Foundation for Human-computer Interaction Research. ACM Trans. Comput. -Hum. Interact., 7(2), 174–196.

Hollnagel, E., & Woods, D. (2005). Joint cognitive systems: Foundations of cognitive systems engineering. CRC Press.

Hollnagel, E., & Woods, D. D. (1983). Cognitive Systems Engineering: New wine in new bottles. International Journal of Man-Machine Studies, 18(6), 583–600.

Jamieson, G. (2005, May). NIMS and the incident command system. In International oil spill conference (Vol. 2005, No. 1, pp. 291-294). American Petroleum Institute.

Jamshidi, M. (2017). Systems of Systems Engineering: Principles and Applications. CRC Press.

Johnson, M., Bradshaw, J. M., Feltovich, P. J., Jonker, C. M., van Riemsdijk, M. B., & Sierhuis, M. (2014). Coactive Design: Designing Support for Interdependence in Joint Activity. J. Hum. -Robot Interact., 3(1), 43–69.

Kim, G., Humble, J., Debois, P., & Willis, J. (2016). The DevOps Handbook:: How to Create World-Class Agility, Reliability, and Security in Technology Organizations. IT Revolution.

Klein, G., Feltovich, P. J., Bradshaw, J. M., & Woods, D. D. (2005). Common ground and coordination in joint activity. Organizational Simulation, 53, 139–184.

Klein, G., Woods, D. D., Bradshaw, J. M., Hoffman, R. R., & Feltovich, P. J. (n.d.). Ten Challenges for Making Automation a “Team Player” in Joint Human-Agent Activity. Cmapsinternal.ihmc.us. http://cmapsinternal.ihmc.us/rid=1JD6BQHJS-4GDQ8L- L15/17.%20Team%20Players.pdf

169

Klinger, D., & Klein, G. (1999). An Accident Waiting to Happen. Ergonomics in Design: The Magazine of Human Factors Applications, 7(3), 20–25.

Koschmann, T., & LeBaron, C. D. (2003). Reconsidering Common Ground. ECSCW 2003, 81– 98.

Larson, L., & DeChurch, L. (2020). Leading teams in the digital age: Four perspectives on technology and what they mean for leading teams. The Leadership Quarterly, 101377.

Leblanc, R. (2019, October 15). How Amazon Is Changing . The Balance Small Business. https://www.thebalancesmb.com/how-amazon-is-changing- supply-chain-management-4155324

Levinson, S. C. (1979). Activity types and language. https://www.degruyter.com/view/j/ling.1979.17.issue-5-6/ling.1979.17.5- 6.365/ling.1979.17.5-6.365.xml

MacMillan, J., Entin, E. E., & Serfaty, D. (2004). Communication overhead: The hidden cost of team cognition.

Maguire, L. M. (2019). Managing the Hidden Costs of Coordination. Queue, 17(6), 71-93.

Maguire, L., M.D., & Jones, N. (2020, May 25). Learning from Adaptions to Coronavirus. LFI. https://www.learningfromincidents.io/blog/learning-from-adaptations-to-coronavirus

Malone, T. W. (1987). Modeling Coordination in Organizations and Markets. Management Science, 33(10), 1317–1332.

Malone, T. W., & Crowston, K. (1994). The Interdisciplinary Study of Coordination. ACM Comput. Surv., 26(1), 87–119.

Mansson, J. T., Lutzhoft, M., & Brooks, B. (2017). Joint Activity in the Maritime Traffic System: Perceptions of Ship Masters, Maritime Pilots, Tug Masters, and Vessel Traffic Service Operators. Journal of Navigation, 70(3), 547–560.

Mikkers, M., Henriqson, E., & Dekker, S. (2012). Managing multiple and conflicting goals in dynamic and complex situations: Exploring the practical field of maritime pilots. Journal of Maritime Research, 9(2), 13-18.

170

Neale, D. C., Carroll, J. M., & Rosson, M. B. (2004). Evaluating computer-supported cooperative work: models and frameworks. Proceedings of the 2004 ACM Conference on Computer Supported Cooperative Work, 112–121.

Nemeth, C. P., Cook, R. I., & Woods, D. D. (2004). The messy details: insights from the study of technical work in healthcare. IEEE Transactions on Systems

Obradovich, J. H., & Smith, P. J. (2003). Design concepts for distributed work systems: an empirical investigation into distributed teams in complex domains. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting (Vol. 47, No. 3, pp. 354-358). Sage CA: Los Angeles, CA: SAGE Publications.

Olson, G. M., & Olson, J. S. (2000). Distance Matters. Hum. -Comput. Interact., 15(2), 139–178.

Olson, G. M., & Olson, J. S. (2007). Computer-Supported Cooperative Work. In F. T. Durso, R. S. Nickerson, S. T. Dumais, S. Lewandowsky, & T. J. Perfect (Eds.), Handbook of Applied Cognition (pp. 497–526). John Wiley & Sons Ltd.

Olson, G. M., Olson, J. S., & Venolia, G. (2009). What still matters about distance. Proceedings of HCIC. https://www.researchgate.net/profile/Gary_Olson/publication/255563884_What_Still_Ma tters_about_Distance/links/54f89ee20cf2ccffe9df588e.pdf

Perry, C. W. H. S. J., & Wears, R. L. (2011). Large-scale coordination of work: Coping with complex chaos within healthcare. In Informed by Knowledge (pp. 69-82). Psychology Press.

Ponterotto, J. G. (2006). Brief note on the origins, evolution, and meaning of the qualitative research concept thick description. The Qualitative Report, 11(3), 538–549.

Potter, S. S., Roth, E. M., Woods, D. D., & Elm, W. C. (2000). Bootstrapping multiple converging cognitive task analysis techniques for system design. Cognitive Task Analysis, 317–340.

Prechelt, L., Zieris, F., & Schmeisky, H. (2015). Difficulty Factors of Obtaining Access for Empirical Studies in Industry. 2015 IEEE/ACM 3rd International Workshop on Conducting Empirical Studies in Industry, 19–25.

Sarter, N. (1994). Strong, silent, and out-of-the-loop: Properties of advanced(cockpit) automation and their impact on human-automation interaction(Ph. D. Thesis). 171

Schmidt, K., & Simonee, C. (1996). Coordination mechanisms: Towards a conceptual foundation of CSCW . Computer Supported Cooperative Work: CSCW: An International Journal, 5(2), 155–200.

Shattuck, L. G., & Woods, D. D. (2000). Communication of Intent in Military Command and Control Systems. In C. McCann & R. Pigeau (Eds.), The Human in Command: Exploring the Modern Military Experience (pp. 279–291). Springer US.

Smith, M. W., Patterson, E. S., & Woods, D. D. (2007). Collaboration and Context in Handovers. Proceedings of the 2007 European Conference on Computer-Supported Cooperative Work. Limerick, Ireland: Springer.

Smith, P. J. (2017). Making brittle technologies useful. In Cognitive Systems Engineering (pp. 181–208). CRC Press.

Smith, P. J., Beatty, R., Spencer, A., & Billings, C. (2003). Dealing with the challenges of distributed planning in a stochastic environment: coordinated contingency planning. Digital Avionics Systems Conference, 2003. DASC ’03. The 22nd, 1, 5.D.1–5.1–8 vol.1.

Smith, P. J., Billings, C., McCoy, C. E., & Orasanu, J. (1999). Alternative Architectures for Distributed Cooperative Problem-Solving in the National Airspace System.

Smith, P. J., Spencer, A. L., & Billings, C. (2011). The Design of a Distributed Work System to Support Adaptive Decision Making Across Multiple Organizations. In Informed by Knowledge (pp. 153–166). Psychology Press.

Suchman, L. A. (1987). Plans and Situated Actions: The Problem of Human-Machine Communication. Cambridge University Press.

Tanenbaum, A. S., & Van Steen, M. (2007). Distributed systems: principles and paradigms. http://cds.cern.ch/record/1056310/files/0132392275_TOC.pdf

Tatar, D. G., Foster, G., & Bobrow, D. G. (1991). Design for conversation: Lessons from Cognoter. International Journal of Man-Machine Studies, 34(2), 185–209.

Von Bertalanffy, L. (1972). The History and Status of General Systems Theory. Academy of Management Journal, 15(4), 407–426.

Webster, F. (1994). What information society? The Information Society, 10(1), 1–23.

172

Wiener, E. L. (1988). 13 - Cockpit Automation. In E. L. Wiener & D. C. Nagel (Eds.), Human Factors in Aviation (pp. 433–461). Academic Press.

Winograd, T., Flores, F., & Flores, F. F. (1986). Understanding Computers and Cognition: A New Foundation for Design. Intellect Books.

Woods, D. (2018). The theory of graceful extensibility: basic rules that govern adaptive systems. Environment Systems and Decisions, 38(4), 433–457.

Woods, D. (2005). Studying cognitive systems in context: the cognitive systems triad. Institute for Ergonomics, The Ohio State University, Columbus, OH. Retrieved on November 12, 2018.

Woods, D. D. (1993). Process tracing methods for the study of cognition outside of the experimental psychology laboratory. Decision Making in Action: Models and Methods, 228–251.

Woods, D. D., Dekker, S., Cook, R., Johannesen, L., & Sarter, N. (2010). Behind human error. CRC Press.

Woods, D. D. (2015). Four concepts for resilience and the implications for the future of resilience engineering. Reliability Engineering & System Safety, 141, 5–9.

Woods, D. D. (2019). 4 Essentials of resilience, revisited. In ResearchGate.

Woods, D. D., & Hollnagel, E. (2006). Joint cognitive systems: Patterns in cognitive systems engineering. CRC Press.

Woods, D. D., & Branlat, M. (2011). Basic patterns in how adaptive systems fail. Resilience Engineering in Practice, 127–144.

Woods, D. D., & Patterson, E. S. (n.d.). How Unexpected Events Produce An Escalation Of Cognitive And Coordinative Demands.

Woods, D. D., & Roth, E. M. (1988). Chapter 1 - Cognitive Systems Engineering. In M. Helander (Ed.), Handbook of Human-Computer Interaction (pp. 3–43). North-Holland.

Woods, D. D., Tittle, J., Feil, M., & Roesler, A. (2004). Envisioning human-robot coordination in future operations. IEEE Transactions on Systems, Man and Cybernetics. Part C, 173

Applications and Reviews: A Publication of the IEEE Systems, Man, and Cybernetics Society, 34(2), 210–218.

Woods, D. (ed). (2017). Stella Report: Report from the SNAFUcatchers Workshop on Coping With Complexity

Woods, D. D. (2015). Four concepts for resilience and the implications for the future of resilience engineering. Reliability Engineering & System Safety, 141, 5-9.

Woods, D. D. (2015). Four concepts for resilience and the implications for the future of resilience engineering. Reliability Engineering & System Safety, 141, 59

174