Controlling the Costs of Coordination in Large-Scale Distributed Software Systems

Controlling the Costs of Coordination in Large-scale Distributed Software Systems Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Laura Marie Dose Maguire, MSc Graduate Program in Industrial and Systems Engineering The Ohio State University 2020 Dissertation Committee Dr. David D. Woods, Advisor Dr. Michael F. Rayo, Committee Member Dr. Philip J. Smith, Committee Member Copyrighted by Laura Marie Dose Maguire 2020 Abstract Responding to anomalies in the critical digital services domain involves coordination across a distributed system of automated subsystems and multiple human roles (Allspaw, 2015; Grayson, 2019). Exploring the costs of this joint activity is an underexamined area of study (Woods, 2017) but has important implications for managing complex systems across distributed teams and for tool design and use. It is understood that anomaly recognition is a shared activity between the users of the service, the automated monitoring systems, and the practitioners responsible for developing and operating the service (Allspaw, 2015). In addition, multiple, diverse perspectives are needed for their different views of the system and its behavior and their ability to recognize unexpected and abnormal conditions. While, the collaborative interplay and synchronization of roles is critical in anomaly response (Patterson et al, 1999; Patterson & Woods, 2001), the cognitive costs for practitioners (Klein et al, 2005; Klinger & Klein, 1999; Klein, 2006) can be substantial. The choreography of this joint activity is shown to be a subtle and highly integrated into the technical efforts of dynamic fault management. This work uses process tracing to take a detailed look at a corpus of five cases involving software engineers coping with unexpected service outages of varying difficulty. In doing so, it is noted that the practices of incident management work very differently than domain models suggest and the tooling designed to aid coordination incurs a cognitive cost for practitioners. Adding to the literature on coordination in ambiguous, time pressured and non co-located groups, this study shows that adaptive choreography enables practitioners to cope with dynamic events – and dynamic coordination demands. These demands can also be a function of the coordination strategies of others – in particular when they shift costs of coordination across time and organizational boundaries. i Dedication This dissertation is dedicated to my parents – Bente & Paul. Words are insufficient to express my gratitude for the gifts you have given me – love, curiosity, creative expression, unwavering confidence and deeply human ethics. I will forever be an explorer, a reader and an advocate for equality, empathy and justice from the examples you have laid out with your lives. I love you so much. ii Acknowledgments First and foremost, Dr. David Woods has my tremendous gratitude for his unflagging support and investment in my intellectual growth as a researcher and a scholar. His influence has been central in cultivating my own approach toward understanding complexity and the fields of practice that cope with its dragons. Dave, you are one in a million and I am truly honored to be a part of the CSEL family. My PhD program would not have been as rich without the gentle cajoling and always genuine commitment to my development from Dr. Richard Cook. Richard, your mentorship and above all, friendship, is deeply appreciated. Dr. Michael Rayo, for modeling the way to be a hardcore scholar and still be present in your own life. Mike, your willingness to make time even when overrun by the demands of the tenure track did not go unnoticed. Dr. Phillip Smith’s role as a professor, mentor and committee member have helped me become more rigorous, well-rounded and discerning – all characteristics he himself so clearly models in his work and teaching. Phil, thank you for your commitment to the next generation of CSErs even when you could have been riding or researching! Dr. Emily Patterson, I am incredibly fortunate to have had your tremendous experience and insight on my committee and I count your mentorship as a highlight of my time at OSU. My colleagues in the CSELab deserve an abundance of thanks. In particular, E. Asher Balkin, Kati Walker & Dan Welna for the countless conversations, whiteboarding sessions, proofreads and pep talks. To all the students- especially “The Morgans” (Reynolds & Fitzgerald), Christine iii Jeffries, Dane Morey & Jesse Marquiese – I am so proud of you all and can’t wait to see what brilliance you unleash on the world. I know how deeply fortunate I am to have the love & support of the Maguire (& honorary Maguire) clan. I am indebted to each of you for your enthusiasm, your empathy and your solidarity. In particular, this wouldn’t have happened without my sister Kristina’s enduring editing enthusiasm and the unflagging encouragement from her and my dear friend Yvonne. Dr. Jody Reimer & Dr. Cayman Unterborn, you deserve all my admiration as wonderful people & accomplished scholars. Thank you for your friendship & your roles in getting me out of the lab and keeping me grounded. These acknowledgements would be incomplete without a sincere recognition of the countless practitioners, managers, designers and product owners who keep the digital world operating smoothly for us. To the participants and organizations who shared their insights, their struggles and their seemingly mundane day-to-day, I thank you deeply. In particular, DL, MC & JR your support was very much appreciated. Lastly, my constant companion Oliver, the instigator of study break nature excursions! Without him curled up under my desk, I am not sure I would have made it through as many late nights as I did. iv Vita Education 2003………………………… Bachelor of Commerce – Entrepreneurial Management, School of Business, Royal Roads University 2017………………………….Masters of Science – Human Factors & Safety Systems, Faculty of Engineering, Lund University Work 1999-2005……………………..Forestry Field worker, various 2006-2009……………………..Business Dev & Safety Manager – Dynamic Reforestation 2010-2012………………………Manager, Training & Program Dev – BC Forest Safety Council 2012-2015………………………Supervisor, Safety Training – Major Projects, Enbridge Inc. 2015-2016………………………Quality Specialist – Major Projects, Enbridge Pipelines 2017-Present…………………….Graduate Research Assistant – Cognitive Systems Engineering Lab, Ohio State University (Present) 2017-2019……………………Resilience Engineering Intern – IBM v Publications Maguire, L. M. D. (2020). Managing the Hidden Costs of Coordination. ACMQUEUE 17(6) Association of Computing Machinery. Chuang, S., Maguire, L., Hsiao, S., Ho, Y., & Tsai, S. (2019). Nurses’ perceptions of hazards to patient safety in the intensive care units during a nursing staff shortage. International Journal of Healthcare, 6(1), 19-19. Maguire, L. M., Vazquez, D. E., Haney, A., Byrd, C., & Sanders, E. B. N. (2019). Transdisciplinary Co-Design to Envision the Needs of the Intensive Care Unit of the Future. In Proceedings of the International Symposium on Human Factors and Ergonomics in Health Care (Vol. 8, No. 1, pp. 181-181). Sage CA: Los Angeles, CA: SAGE Publications. Maguire, L. M. & Percival, J. (2019). Sensemaking in the Snow: Examining the cognitive work of avalanche forecasting in a Canadian Ski Operation. Neve E Valanghe 93. pp 94-95. Retrieved at https://issuu.com/aineva7/docs/nv93_rivista Laura, M. (2014). The human behind the factor: a brief look at how context informs practice in recreational backcountry users. In Proceedings of the international snow science workshop (pp. 942-948). Fields of Study Major Field: Industrial & Systems Engineering Minor Field: Cognitive Systems Engineering Minor Field: Design vi Table of Contents Abstract ...................................................................................................................... i Dedication .................................................................................................................. ii Acknowledgments ..................................................................................................... iii Vita ............................................................................................................................ v List of Tables ............................................................................................................. xi List of Figures ........................................................................................................... xii Chapter 1 - Introduction .............................................................................................. 1 Chapter 2 – Joint activity & anomaly response in large-scale distributed work systems 7 Key Concepts .....................................................................................................................8 Large-scale distributed systems ..........................................................................................9 Defining a large-scale distributed system .......................................................................... 11 The large-scale distributed systems of interest .................................................................................... 12 Time & coherence ................................................................................................................................. 12 The reification of distributed systems .................................................................................................

Load more