E Cient Remote Procedure Calls for Datacenters
Total Page:16
File Type:pdf, Size:1020Kb
Ecient Remote Procedure Calls for Datacenters Anuj Kalia CMU-CS-19-126 September 2019 Computer Science Department School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee David G. Andersen, Chair Justine Sherry Michael Kaminsky Miguel Castro, Microsoft Research, Cambridge Submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy. Copyright © 2019 Anuj Kalia This research was sponsored by the National Science Foundation under grant numbers CNS-1314721 and CNS- 1700521, Intel ISTC-CC, Intel ISTC-VCC, and a Facebook Fellowship. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the ocial policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: datacenter networks, distributed systems, Remote Procedure Calls, Remote Di- rect Memory Access, eRPC To my father, Darshan Lal Kalia, for getting me here iv Abstract Datacenter network latencies are approaching their microsecond-scale speed-of-light limit, and network bandwidths continue to grow beyond 100 Gbps. These improvements bear rethinking the design of communication-intensive distributed systems for datacenters, whose performance has historically been limited by slow networks. With the slowing down of Moore’s law, a popular approach is to redesign distributed systems to use network hardware devices and technologies that ooad communication or data access from commodity CPUs, such as smart network cards (NICs), lossless networks, programmable NICs, and programmable switches. In this dissertation, we show that we can continue to use end-to-end software-only commu- nication mechanisms to build high-performance distributed systems, i.e., we bring the speed of fast networks to distributed systems without an expensive redesign with in-network hardware ooads. We show that the ubiquitous Remote Procedure Call (RPC) communication mecha- nism, when rearchitected specially for the capabilities of modern commodity datacenter hard- ware, is a fast, scalable, exible, and simple communication choice for distributed systems. We make three contributions. First, we present a detailed analysis of datacenter communication hardware—ranging from the peripheral bus that connects CPUs to NICs, to the datacenter’s switched network—that informs our choice of the communication mechanism. Second, we lay out the advantages of RPCs over network hardware ooads through the design and evaluation of two new systems, a key-value store called HERD, and a distributed transaction processing system called FaSST. Third, we combine the lessons learned from the rst two steps with new insights about datacenter packet loss and congestion control to create a new RPC library called eRPC, and show how existing distributed system codebases perform well over eRPC. In many cases, these systems substantially outperform ooads because they use less communication, and their end-to-end design provides exibility and simplicity. vi Acknowledgments Around this time six years ago, I burned the midnight oil learning Paxos in an attempt to convince David Andersen and Michael Kaminsky to take me under their wing as a student. I still don’t understand Paxos, but my eort paid o immensely. I have learned a tremendous amount from Dave, from executing a research vision, to ethics and the value of doing the right thing, to correctly formatting i++. Dave gave me freedom to chart my own course, while still providing close guidance and supporting my decisions throughout. Thanks, Dave, for making this the amazing journey that it was. I am grateful to Michael for his consistency in mentorship, collaboration and support. Michael taught me by example that the perfect is the enemy of the good, and that the answer to many questions in research and in life is “it depends.” At a time when I was looking for the perfect thesis topic, Michael encouraged me to commit to datacenter networks research. I am fortunate to have followed his advice. Thanks for everything, Michael. My debt to Miguel Castro and Justine Sherry goes beyond them serving on my thesis com- mittee. Miguel’s feedback on my work was crucial to the FaSST and eRPC projects. Justine provided invaluable help during my job search, especially by single-handedly rescuing my job talk. I am also grateful to Garth Gibson, Dushyanth Narayanan, Vyas Sekar, Amar Phanishayee, Nathan Beckmann, Brandon Lucia, and Rashmi Vinayak for their help and guidance at various phases in my PhD. Joe Moore’s generosity in providing access to a cluster at NetApp was invaluable. The administrative aspects of the PhD program were easy thanks to Deb Cavlovich, Angy Malloy, and Karen Lindenfelser. Members and friends of the FAWN group—Iulian Moraru, Hyeontaek Lim, Dong Zhou, Huanchen Zhang, Conglong Li, Sol Boucher, Thomas Kim, Angela Jiang, Chris Canel, Giulio Zhou, and Daniel Wong—created a fun, cooperative, and helpful research environment. In par- ticular, I learned the art of building and evaluating systems from Hyeontaek. I have been lucky to have many close friends. Alok, Rohan, Ojasvi and Rishika, and Deepak and Shikha: thanks for providing a home away from home, and for the many adventures. Deepak Vasisht: I would not have done a PhD if random luck had not made us roommates ten years ago. Thanks for the friendship, help, and competition, and I hope our paths will cross once again. My friends in Pittsburgh made this journey so much more enjoyable. Thanks to Yang Jiao for all the support, and for introducing me to good food. Jack Kosaian’s magnanim- ity as a person, friend and ocemate made me nally stop working from home. Angela Jiang changed my worldview in the short span of a few months. Thanks, Angela, for being the coolest person ever. I am grateful for the constant love and support of my family. My greatest debt is towards my mother and father, who made many great sacrices to help me succeed. My father made it a mission in his life to see me get the best education, and towards this goal he left no stone unturned. For all the times he went to bat for me, this thesis is dedicated to him. Contents 1 Introduction1 1.1 Thesis contributions and outline...........................2 1.2 Distributed systems performance from a speed-of-light perspective.......3 1.3 An end-to-end design for high exibility......................5 1.4 Scalability: A silicon power consumption argument................6 1.5 A fast and general-purpose design for simplicity..................7 1.6 Evolution of our RPC designs.............................8 2 Background9 2.1 Modern datacenter networks.............................9 2.1.1 Datacenter network hardware........................ 10 2.1.2 Userspace networking............................ 12 2.1.3 In-network ooads for distributed systems................ 13 2.2 Communication-intensive distributed applications................. 18 2.2.1 Main-memory key-value stores....................... 18 2.2.2 Distributed transaction processing..................... 19 2.2.3 State machine replication.......................... 19 2.2.4 Common application workload characteristics............... 20 2.3 Evaluation clusters.................................. 20 2.4 Open-source code................................... 21 3 Guidelines for use of modern high-speed NICs 22 3.1 A review of PCI Express............................... 23 3.1.1 PCIe headers................................. 23 3.1.2 Memory-mapped I/O and Direct Memory Access............. 24 3.2 How modern NICs work............................... 25 3.3 RDMA terminology.................................. 26 3.3.1 RDMA verbs................................. 26 3.3.2 RDMA queue pairs.............................. 27 3.3.3 RDMA transport types............................ 27 3.4 Preface to the guidelines............................... 28 3.5 Guidelines for NICs with transport-layer ooad.................. 29 3.5.1 Prefer application-level ACKs over transport-level ACKs......... 29 3.5.2 Avoid storing connection state on NICs.................. 30 viii 3.6 Reduce PCIe trac.................................. 32 3.6.1 Measurement method: PCIe counters on commodity CPUs........ 32 3.6.2 Reduce CPU-initiated MMIOs........................ 33 3.6.3 Reduce NIC-initiated DMAs......................... 36 3.7 Guidelines based on NIC architecture........................ 38 3.7.1 Engage multiple NIC processing units................... 39 3.7.2 Avoid contention among NIC processing units............... 40 3.7.3 Avoid NIC cache misses........................... 41 3.8 Related work...................................... 42 3.9 Conclusion....................................... 43 4 Case study 1: HERD – An RPC-based key-value store 44 4.1 Introduction...................................... 44 4.2 Background...................................... 46 4.2.1 Recent research on key-value stores.................... 46 4.2.2 One-sided RDMA–based key-value stores................. 46 4.3 Design decisions.................................... 47 4.3.1 Notation and experimental setup...................... 48 4.3.2 Constructing a fast RPC primitive...................... 48 4.3.3 Using datagram transport for responses.................. 52 4.4 Design of HERD.................................... 53 4.4.1 HERD’s key-value data structure...................... 54 4.4.2 Masking DRAM latency with prefetching................. 54 4.4.3 Request format and handling........................ 55 4.4.4 Response format and