2018 Spring Newsletter

NEWSLETTER ON PDL ACTIVITIES AND EVENTS • SPRING 2 0 1 8 http://www.pdl.cmu.edu/ Massive Indexed Directories in DeltaFS AN INFORMAL PUBLICATION FROM ACADEMIA’S PREMIERE STORAGE by Qing Zheng, George Amvrosiadis & the DeltaFS Group SYSTEMS RESEARCH CENTER DEVOTED Faster storage media, faster interconnection networks, and improvements in systems TO ADVANCING THE STATE OF THE software have significantly mitigated the effect of I/O bottlenecks in HPC applications. Even so, applications that read and write data in small chunks are limited by ART IN STORAGE AND INFORMATION the ability of both the hardware and the software to handle such workloads efficiently. INFRASTRUCTURES. Often, scientific applications partition their output using one file per process. This is a problem on HPC computers with hundreds of thousands of cores and will only worsen CONTENTS with exascale computers, which will be an order of magnitude larger. To avoid wasting time creating output files on such machines, scientific applications are forced to use DeltaFS ....................................... 1 libraries that combine multiple I/O streams into a single file. For many applications where output is produced out-of-order, this must be followed by a costly, massive data Director’s Letter .............................2 sorting operation. DeltaFS allows applications to write to an arbitrarily large number Year in Review ...............................4 of files, while also guaranteeing efficient data access without requiring sorting. Recent Publications ........................5 The first challenge when handling an arbitrarily large number of files is dealing with PDL News & Awards........................8 the resulting metadata load. We manage this using the transient and serverless DeltaFS 3Sigma ...................................... 12 file system [1]. The transient property of DeltaFS allows each program that uses it to Defenses & Proposals ..................... 14 individually control the amount of computing resources dedicated to the file system, effectively scaling metadata performance under application control. When combined Alumni News .............................. 18 with DeltaFS’s serverless nature, file system design and provisioning decisions are New PDL Faculty & Staff................. 19 decoupled from the overall design of the HPC platform. As a result, applications that create one file for each process are no longer tied to the platform storage system’s ability PDL CONSORTIUM to handle metadata-heavy workloads. The HPC platform can also provide scalable file MEMBERS creation rates without requiring a fundamental redesign of the platform’s storage system. The second challenge is guaranteeing both fast writing and reading for workloads that Alibaba Group consist primarily of small I/O transfers. This work was inspired by interactions with Broadcom, Ltd. cosmologists seeking to explore the trajectories of the highest energy particles in an Dell EMC astrophysics simulation using the VPIC plasma simulation code [2]. VPIC is a highly- Facebook optimized particle simulation Google Hewlett Packard Enterprise Traditional file-per-process output DeltaFS file-per-particle output code developed at Los Alamos Simulation P P ... P O(1M) P P ... P O(1M) National Laboratory (LANL). Hitachi, Ltd. Procs r Each VPIC simulation pro- IBM Research Di A D E A D B E F C Intel Corporation File B F C A D B E F C ceeds in timesteps, and each System D B ... F O(1M) ... O(1T) Massive process represents a bounding Micron API C E A Microsoft Research box in the physical simula- MongoDB Indexed tion space that particles move Object NetApp, Inc. ... O(1M) index index ... index O(1M) through. Every few timesteps Oracle Corporation Store the simulation stops, and Salesforce Trajectory O(1TB) search O(1MB) search each process creates a file and Query Samsung Information Systems America A B C A B C writes the data for the particles Seagate Technology that are currently located Toshiba Figure 1: DeltaFS in-situ indexing of particle data in an within its bounding box. This Two Sigma Indexed Massive Directory. While indexed particle data are is the default, file-per-process Veritas exposed as one DeltaFS subfile per particle, they are stored Western Digital as indexed log objects in the underlying storage. continued on page 11 FROM THE DIRECTOR’S CHAIR THE PDL PACKET GREG GANGER THE PARALLEL DATA Hello from fabulous Pittsburgh! LABORATORY School of Computer Science 25 years! This past fall, we celebrated 25 Department of ECE years of the Parallel Data Lab. Started by Garth after he defended his PhD disserta- Carnegie Mellon University tion on RAID at UC-Berkeley, PDL has 5000 Forbes Avenue seen growth and success that I can’t imagine he imagined... from the early days Pittsburgh, PA 15213-3891 of exploring new disk array approaches to today’s broad agenda of large-scale VOICE 412•268•6716 storage and data center infrastructure research... from a handful of core CMU FAX 412•268•3010 researchers and industry participants to a vibrant community of scores of CMU researchers and 20 sponsor companies. Amazing. PUBLISHER It has been another great year for the Parallel Data Lab, and I’ll highlight some Greg Ganger of the research activities and successes below. Others, including graduations, publications, awards, etc., can be found throughout the newsletter. But, I can’t EDITOR not start with the biggest PDL news item of this 25th anniversary year: Garth has Joan Digney graduated;). More seriously, 25 years after founding PDL, including guiding/ nurturing it into a large research center with sustained success (25 years!), Garth The PDL Packet is published once per decided to move back to Canada and take the reins (as President and CEO) of the year to update members of the PDL new Vector Institute for AI. We wish him huge success with this new endeavor! Consortium. A pdf version resides in the Publications section of the PDL Web Garth has been an academic role model, a mentor, and a friend to me and many pages and may be freely distributed. others... we will miss him greatly, and he knows that we will always have a place Contributions are welcome. for him at PDL events. THE PDL LOGO Because it overlaps in area with Vector, I’ll start my highlighting of PDL activities with our continuing work at the intersection for machine learning (ML) Skibo Castle and the lands that com- and systems. We continue to explore new approaches to system support for prise its estate are located in the Kyle of Sutherland in the northeastern part of large-scale machine learning, especially aspects of how ML systems should adapt Scotland. Both ‘Skibo’ and ‘Sutherland’ and be adapted in cloud computing environments. Beyond our earlier focus are names whose roots are from Old on challenges around dynamic resource availability and time-varying resource Norse, the language spoken by the interference, we continue to explore challenges related to training models over Vikings who began washing ashore reg ularly in the late ninth century. The geo-distributed data, training very large models, and how edge resources should word ‘Skibo’ fascinates etymologists, be shared among inference applications using DNNs for video stream processing. who are unable to agree on its original We are also exploring how ML can be applied to make systems better, including meaning. All agree that ‘bo’ is the Old even ML systems ;). Norse for ‘land’ or ‘place,’ but they argue whether ‘ski’ means ‘ships’ or ‘peace’ Indeed, much of PDL’s expansive database systems research activities center on or ‘fairy hill.’ embedding automation in DBMSs. With an eye toward simplifying administra- tion and improving performance robustness, there are a number of aspects of Although the earliest version of Skibo seems to be lost in the mists of time, Andy’s overall vision of a self-driving database system being explored and real- it was most likely some kind of fortified ized. To embody them, and other ideas, a new open source DBMS called Peloton building erected by the Norsemen. The has been created and is being continuously enhanced. There also continue to be present-day castle was built by a bishop cool results and papers on better exploitation of NVM in databases, improved of the Roman Catholic Church. Andrew concurrency control mechanisms, and range query filtering. I thoroughly enjoy Carnegie, after making his fortune, bought it in 1898 to serve as his sum mer watching (and participating) in the great energy that Andy has infused into da- home. In 1980, his daughter, Mar garet, tabase systems research at CMU. donated Skibo to a trust that later sold the estate. It is presently being run as a Of course, PDL continues to have a big focus on storage systems research at vari- luxury hotel. ous levels. At the high end, PDL’s long-standing focus on metadata scaling for scalable storage has led to continued research into benefits of and approaches to allowing important applications to manage their own namespaces and metadata for periods of time. In addition to bypassing traditional metadata bottlenecks 2 THE PDL PACKET PARALLEL DATA LABORATORY FROM THE DIRECTOR’S CHAIR FACULTY entirely during the heaviest periods of activity, this approach promises oppor- Greg Ganger (PDL Director) 412•268•1297 tunities for efficient in-situ index creation to enable fast queries for subsequent [email protected] analysis activities. At the lower end, we continue to explore how software systems George Amvrosiadis Seth Copen Goldstein should be changed to maximize the value from NVM storage, including addressing David Andersen Mor Harchol-Balter read-write performance asymmetry and providing storage management features Lujo Bauer Gauri Joshi Nathan Beckmann Todd Mowry (e.g., page-level checksums, dedup, etc.) without yielding load/store efficiency. Daniel Berger Onur Mutlu We’re excited about continuing to work with PDL companies on understanding Chuck Cranor Priya Narasimhan where storage hardware is (and should be) going and how it should be exploited Lorrie Cranor David O’Hallaron Christos Faloutsos Andy Pavlo in systems.

2018 Spring Newsletter

Microsoft and Cray to Unveil $25,000 Windows-Based Supercomputer

The Fourth Paradigm

2017 PDL Fall Update

Managing Tail Latency in Large Scale Information Retrieval Systems

The Computer Games Journal Ltd Registered Company No

Microsoft Corp. BUY Company Update : Enterprise Software

What Is Kodu?

Microsoft and Cray to Unveil $25,000 Windows-Based Supercomputer

To Save Everything, Click Here

Teaching Kodu with Physical Manipulatives

Assembly Passes Budgets of Eight New Govt Bodies

Expressing Computer Science Concepts Through Kodu Game Lab