Adopting Zoned Storage in Distributed Storage Systems Abutalib Aghayev

Adopting Zoned Storage in Distributed Storage Systems Abutalib Aghayev CMU-CS-òý-Ôçý August òýòý Computer Science Department School of Computer Science Carnegie Mellon University Pittsburgh, PA Ô òÔç esis Committee: George Amvrosiadis, Chair Gregory R. Ganger Garth A. Gibson Peter J. Desnoyers, Northeastern University Remzi H. Arpaci-Dusseau, University of Wisconsin–Madison Sage A. Weil, Red Hat, Inc. Submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy. Copyright © òýòý Abutalib Aghayev is research was sponsored by the Los Alamos National Laboratory award number: çÀ¥Àýç; by a Hima and Jive Fellowship in Computer Science for International Students; and by a Carnegie Mellon University Parallel Data Lab (PDL) Entrepreneurship Fellowship. e views and conclusions contained in this document are those of the author and should not be interpreted as representing the ocial policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: le systems, distributed storage systems, shingled magnetic recording, zoned names- paces, zoned storage, hard disk drives, solid-state drives Abstract Hard disk drives and solid-state drives are the workhorses of modern storage systems. For the past several decades, storage systems soware has communicated with these drives using the block interface. e block interface was introduced early on with hard disk drives, and as a result, almost every storage system in use today was built for the block interface. erefore, when ash memory based solid-state drives recently became viable, the industry chose to emulate the block interface on top of ash memory by running a translation layer inside the solid-state drives. is translation layer was necessary because the block interface was not a direct t for the ash memory. More recently, hard disk drives are shiing to shingled magnetic recording, which in- creases capacity but also violates the block interface. us, emerging hard disk drives are also emulating the block interface by running a translation layer inside the drive. Emulating the block interface using a translation layer, however, is becoming a source of signicant performance and cost problems in distributed storage systems. In this dissertation, we argue for the elimination of the translation layer—and consequently the block interface. We propose adopting the emerging zone interface instead—a natural t for both high-capacity hard disk drives and solid-state drives— and rewriting the storage backend component of distributed storage systems to use this new interface. Our thesis is that adopting the zone interface using a special- purpose storage backend will improve the cost-eectiveness of data storage and the predictability of performance in distributed storage systems. We provide the following evidence to support our thesis. First, we introduce Sky- light, a novel technique to reverse engineer the translation layers of modern hard disk drives and demonstrate the high garbage collection overhead of the translation layers. Second, based on the insights from Skylight we develop ext¥-lazy, an extension of the popular ext¥ le system, which is used as a storage backend in many distributed storage systems. Ext¥-lazy signicantly improves performance over ext¥ on hard disk drives with a translation layer, but it also shows that in the presence of a translation layer it is hard to achieve the full potential of a drive with evolutionary le system changes. ird, we show that even in the absence of a translation layer, the abstrac- tions provided by general-purpose le systems such as ext¥ are inappropriate for a storage backend. To this end, we study a decade-long evolution of a widely used distributed storage system, Ceph, and pinpoint technical reasons that render general- purpose le systems unt for a storage backend. Fourth, to show the advantage of a special-purpose storage backend in adopting the zone interface, as well as the advan- tages of the zone interface itself, we extend BlueStore—Ceph’s special-purpose storage backend—to work on zoned devices. As a result of this work, we demonstrate how having a special-purpose backend in Ceph enables quick adoption of the zone interface, how the zone interface eliminates in-device garbage collection when running RocksDB (a key-value database used for storing metadata in BlueStore), and how the zone interface enables Ceph to reduce tail latency and increase cost-eectiveness of data storage without sacricing performance. iv Acknowledgments My doctoral journey has been far from the ordinary: it spanned two schools, ve advisors, and more years than I care to admit. In hindsight, it was a blessing in disguise—I met many great people, some of whom became life long friends. Below, I will try to acknowledge them, and I apologize if I miss anyone. I am grateful to George Amvrosiadis and Greg Ganger for agreeing to advise me at a critical stage of my studies and for helping me cross the nish line. It was a packed two years of work, but George and Greg’s laid-back and cheerful attitude made them bearable and fun. I thank Garth Gibson and Eric Xing for agreeing to advise me upon my arrival at CMU and for introducing me to the eld of machine learning systems. Garth was instrumental in my coming to CMU, and I received his unwavering support throughout the years, for which I am grateful. I am grateful to Peter Desnoyers for ad- vising me at Northeastern, for supporting my decision to reapply to graduate schools late into my studies, and for always being there for me. Remzi Arpaci-Dusseau has been my unocial advisor since that lucky day when I rst met him at SOSP ’Ôç. I am grateful for his continuous support inside and outside of academia and for his serving in my thesis committee. While at CMU, I also briey worked with Phil Gibbons and Ehsan Totoni, and I am grateful to them for the opportunity. I was also fortunate to have great collaborators from the industry. Sage Weil and eodore Ts’oare two super hackers, respectively responsible for leading the develop- ment of a distributed storage system and a le system used by millions. Yetthey found time to meet with me regularly, to patiently answer my questions, and to discuss research directions. Towards the end of my studies, I asked Sage to serve on my thesis committee, and he kindly agreed. I am grateful to Sage and Ted for their support. Matias Bjørling turned my request for his slides into an internship, and thus started our collaboration on zoned storage, which turned out to be crucial to my dissertation. Tim Feldman thoroughly answered my questions as I was getting started with my research on shingled magnetic recording disks. I am grateful to Matias, Tim, and many other brilliant engineers and hackers, including Samuel Just, Kefu Chai, Igor Fedo- tov, Mark Nelson, Hans Holmberg, Damien Le Moal, Marc Acosta, Mark Callaghan, and Siying Dong, whose answers to my countless queries greatly contributed to my technical knowledge base. As a member of the Parallel Data Lab (PDL) I beneted immensely from the op- portunities it provided, and I am therefore grateful to all who make the PDL a great lab to be a part of. ank you Jason Boles, Chad Dougherty, Mitch Franzos, and Chuck Cranor for keeping the PDL clusters running and for xing my never-ending technical problems. ank you Joan Digney for always polishing my slides and posters and proofreading my papers. ank you Karen Lindenfelser, Bill Courtright, and Greg Ganger for organizing the wonderful PDL events and for keeping it all running, and Garth Gibson, for founding the PDL. ank you all the organizations that contribute to the PDL, and thank you Hugo Patterson for the PDL Entrepreneurship Fellowship. I am thankful to the sta of the Computer Science Department (CSD), including Deb Cavlovich for her help with the administrative aspects of the doctoral process, and Catherine Copetas for her help with organizing CSD events. I am thankful to the anonymous donor who established the Hima and Jive Fellowship in Computer Science for International Students. During my studies I made many friends. It was a privilege to share an oce with Samuel Yeom from whom I learned so much, ranging from the ner details of the U.S. political system and why Ruth Bader Ginsburg is not retiring yet, to countless obscure facts, thanks to our daily ritual of solving e New York Times crossword aer lunch. (To be fair, Sam would mostly solve it and I would try to catch up and feel elated when contributing an answer or two.) Michael Kuchnik, Jin Kyu Kim, and Aurick Qiao were the best collaborators one could wish for: smart, humble, hard working, and fun. Saurabh Kadekodi and Jinliang Wei showed me the ropes around the campus and the department when I rst arrived at CMU. As I was preparing to leave CMU, Anuj Kalia shared with me his experience of navigating the academic job market and gave me useful tips. I thank all my friends, colleagues, and professors at CMU and Northeastern, including Umut Acar, Maruan Al-Shedivat, David Andersen, Joy Arul- raj, Kapil Arya, Mehmet Berat Aydemir, Ahmed Badawi, Nathan Beckmann, Naama Ben-David, Daniel Berger, Brandon Bohrer, Sol Boucher, Harsh Raju Chamarthi, Do- minic Chen, Andrew Chung, Gene Cooperman, Henggang Cui, Wei Dai, Omar De- len, Chris Fallin, Pratik Fegade, Matthias Felleisen, Mohammed Hossein Hajkazemi, Mor Harchol-Balter, Aaron Harlap, Kevin Hsieh, Angela Jiang, David Kahn, Rajat Kateja, Ryan Kavanagh, Jack Kosaian, Yixin Luo, Lin Ma, Pete Manolios, Sara McAl- lister, Prashanth Menon, Alan Mislove, Mansour Shafaei Moghaddam, Todd Mowry, Ravi Teja Mullapudi, Onur Mutlu, Guevara Noubir, Jun Woo Park, İlqar Ramazanlı, Alexey Tumanov, David Wajc, Christo Wilson, Pengtao Xie, Jason Yang, Hao Zhang, Huanchen Zhang, and Qing Zheng, for their help and support and random fun chats over the years.

Load more