Distributed Computing with Modern Shared Memory

Thèse n° 7141 Distributed Computing with Modern Shared Memory Présentée le 20 novembre 2020 à la Faculté informatique et communications Laboratoire de calcul distribué Programme doctoral en informatique et communications pour l’obtention du grade de Docteur ès Sciences par Mihail Igor ZABLOTCHI Acceptée sur proposition du jury Prof. R. Urbanke, président du jury Prof. R. Guerraoui, directeur de thèse Prof. H. Attiya, rapporteuse Prof. N. Shavit, rapporteur Prof. J.-Y. Le Boudec, rapporteur 2020 “Where we going man?” “I don’t know but we gotta go.” — Jack Kerouac, On the Road To my grandmother Momo, who got me started on my academic adventure and taught me that learning can be fun. Acknowledgements I am first and foremost grateful to my advisor, Rachid Guerraoui. Thank you for believing in me, for encouraging me to trust my own instincts, and for opening so many doors for me. Looking back, it seems like you always made sure I had the perfect conditions to do my best possible work. Thank you also for all the things you taught me: how to get at the heart of a problem, how to stay optimistic despite all signs to the contrary, how to be pragmatic while also chasing beautiful problems. I am grateful to my other supervisors and mentors: Yvonne Anne Pignolet and Ettore Ferranti for my internship at ABB Research, Maurice Herlihy for the time at Brown University and at Oracle Labs, Dahlia Malkhi and Ittai Abraham for my internship at VMware, Virendra Marathe and Alex Kogan for my internship at Oracle Labs and the subsequent collaboration, Marcos Aguilera for the collaboration on the RDMA work, and Aleksandar Dragojevic for the internship at Microsoft Research. Thank you for spoiling me with such interesting problems to work on and for teaching me so many valuable lessons about research. I am grateful to the members of my PhD oral exam jury: Prof. Hagit Attiya, Prof. Jean-Yves Le Boudec, and Prof. Nir Shavit. Thank you for your insightful questions and comments, which have helped improve my thesis. I am also thankful to Prof. Rüdiger Urbanke for presiding over my thesis committee. I am grateful to my collaborators: Naama Ben-David, Thanasis Xygkis, Karolos Antoniadis, Vincent Gramoli, Vasilis Trigonakis, Tudor David, Oana B˘alm˘au, Julien Stainer, Peva Blanchard, and Nachshon Cohen. Discussing and working with you on tough problems has undoubtedly been the most enjoyable part of my PhD. Huge thanks to Naama and Thanasis for the hundreds of hours of peer programming, Skype meetings and deadline crunches that we shared in the last two years. I am grateful to the administrative staff of the lab. Thank you, France, for your invaluable help on so many fronts, big and small. I always look forward to interacting with you because I know you will help me get things done quickly, and your green office feels like an oasis where I can take a breath of fresh air. I also always enjoy our chats on many different subjects. Thank you, Fabien, for your help with the technical infrastructure for my experiments; thanks to you, it has always been smooth sailing in that regard. i Acknowledgements I am grateful to my colleagues and friends in the Distibuted Computing Lab at EPFL, who have made it fun to come into work every day: Karolos, Matej, George D., Adi, George C., Thanasis, Sébastien, Le, Mahdi, Alex, Jovan, Arsany, Vlad, Tudor, David, Vasilis, Jingjing, Mahsa, Matteo, Rhicheek, Victor, and Andrei. Thanks for all the goofy memories that have helped me make light of the often stressful PhD life. Special thanks to Karolos for always being extra supportive in my moments of doubt, and for the countless hours of (sometimes scientific) discussions. Big thanks to Le for helping with the French translation of the thesis abstract. I am grateful to my friends outside of the lab: Corina, Camilo, Cristina, Bogdan Stoica, Olivier, Maria, Viki, Étienne, Laura, Bogdan Sassu, C˘alin, R˘azvan, and Diana. Thank you for the conversations, laughs, hikes, parties, and adventures that we shared. Without you the PhD would have been a much lonelier experience. I am also thankful to all the interesting people I met during my internships: Irina, Dan, Aran, Sasha, Heidi, Mike, Zoe, Rashika, Lily, Petros, Dario, and Ho-Cheung. Thank you for the fun memories! I am deeply grateful to my girlfriend, Monica, who has been by my side for most of the PhD. Thank you for supporting and encouraging me during my most difficult moments. You have been the best adventure partner I could have hoped for. I am looking forward to our next adventure! I am profoundly grateful to my family: my parents, Gabriela and Nicolae, my grandmother, Irina (to whom this thesis is dedicated), my aunt, Mihaela, and my cousin, Silvana. Thank your for your constant, loving support, not only during the PhD, but throughout my entire education. You have always seen in me the best version of myself and given me courage to believe in that version too. My work has been supported in part by the European Research Council (ERC) Grant 339539 (AOC). Igor Zablotchi Lausanne, 2020 ii Preface The research presented in this dissertation was conducted in the Distributed Computing Laboratory at EPFL, under the supervision of Professor Rachid Guerraoui, between 2015 and 2020. The main results of this dissertation appear originally in the following publications (author names are in alphabetical order): 1. Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra Marathe, Athanasios Xygkis, Igor Zablotchi. Microsecond Consensus for Microsecond Applications. Under submission. 2. Rachid Guerraoui, Alex Kogan, Virendra Marathe, Igor Zablotchi. Efficient Multi-word Compare and Swap. In International Symposium on Distributed Computing (DISC), 2020. 3. Marcos K. Aguilera, Naama Ben-David, Rachid Guerraoui, Virendra Marathe, Igor Zablotchi. The Impact of RDMA on Agreement. In ACM Symposium on Principles of Distributed Computing (PODC), 2019. 4. Nachshon Cohen, Rachid Guerraoui, Igor Zablotchi. The Inherent Cost of Remembering Consistently. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2018. 5. Oana Balmau, Rachid Guerraoui, Maurice Herlihy, Igor Zablotchi. Fast and Robust Memory Reclamation for Concurrent Data Structures. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2016. Besides the above-mentioned publications, I was also involved in other research projects that resulted in the following publications: 1. Tudor David, Aleksandar Dragojevic, Rachid Guerraoui, Igor Zablotchi. Log-Free Con- current Data Structures. In USENIX Annual Technical Conference (ATC), 2018. 2. Oana Balmau, Rachid Guerraoui, Vasileios Trigonakis, Igor Zablotchi. FloDB: Unlocking Memory in Persistent Key-Value Stores. In European Conference on Computer Systems (EuroSys), 2017 3. Peva Blanchard, Rachid Guerraoui, Julien Stainer, Igor Zablotchi. The Disclosure Power of Shared Objects. In International Conference on Networked Systems (NETYS), 2017 iii Abstract In this thesis, we revisit classic problems in shared-memory distributed computing through the lenses of (1) emerging hardware technologies and (2) changing requirements. Our con- tributions consist, on the one hand, in providing a better understanding of the fundamental benefits and limitations of new technologies, and on the other hand, in introducing novel, efficient tools and systems to ease the task of leveraging new technologies or meeting new requirements. First, we look at Remote Direct Memory Access (RDMA), a networking hardware feature which enables a computer to access the memory of a remote computer without involving the remote CPU. In recent years, the distributed computing community has taken an interest in RDMA due to its ultra-low latency and high throughput and has designed systems that take advantage of these characteristics. However, we argue that the potential of RDMA for distributed computing remains largely untapped. We show that RDMA’s unique semantics enable agreement algorithms which improve on fundamental trade-offs in distributed computing between performance and failure-tolerance. Furthermore, we show the practical applicability of our theoretical results through Mu, a state machine replication system which can replicate requests in under 2 microseconds, and can fail-over in under 1 millisecond when failures occur. Mu’s replication and fail-over latencies are at least 61% and 90% lower, respectively, than those of prior work. Second, we focus on persistent memory, a novel class of memory technologies which is only now starting to become available. Persistent memory provides byte-addressable persistent storage with access times comparable to traditional DRAM. Recent work has focused on designing tools for working with persistent memory, but little is known about the fundamental cost of providing consistency in persistent memory. Furthermore, important shared-memory primitives do not yet have efficient persistent implementations. We provide an answer to the former question through a tight bound on the number of persistent fences required to implement a lock-free persistent object. We address the latter problem by presenting a novel efficient multi-word compare-and-swap algorithm for persistent memory. Third and finally, we consider the current exponential increase in the amount of data world- wide. Memory capacity has been on the rise for decades, but remains scarce when compared to the rate of data growth. Given this scarcity and the prevalence of concurrent in-memory processing, the classic problem of concurrent memory reclamation

Distributed Computing with Modern Shared Memory

Understanding Distributed Consensus

Distributed Consensus Revised

(12) Patent Application Publication (10) Pub. No.: US 2004/0107227 A1 Michael (43) Pub

Practical Parallel Data Structures

Distributed Consensus: Performance Comparison of Paxos and Raft

42 Paxos Made Moderately Complex

A General Characterization of Indulgence

Velos: One-Sided Paxos for RDMA Applications

CS5412: Lecture II How It Works

Distributed Computing Column 36 Distributed Computing: 2009 Edition

State Machine Replication Is More Expensive Than Consensus

A History of the Virtual Synchrony Replication Model Ken Birman