n e w s l e t t e r o n d l a c t i v i t i e s a n d e v e n t s • s p r i n g 2 0 1 8 http://www.pdl.cmu.edu/

Massive Indexed Directories in DeltaFS an informal publication from academia’s premiere storage by Qing Zheng, George Amvrosiadis & the DeltaFS Group systems research center devoted Faster storage media, faster interconnection networks, and improvements in systems to advancing the state of the software have significantly mitigated the effect of I/O bottlenecks in HPC applica- tions. Even so, applications that read and write data in small chunks are limited by art in storage and information the ability of both the hardware and the software to handle such workloads efficiently. infrastructures. Often, scientific applications partition their output using one file per process. This is a problem on HPC computers with hundreds of thousands of cores and will only worsen CONTENTS with exascale computers, which will be an order of magnitude larger. To avoid wasting time creating output files on such machines, scientific applications are forced to use DeltaFS...... 1 libraries that combine multiple I/O streams into a single file. For many applications where output is produced out-of-order, this must be followed by a costly, massive data Director’s Letter...... 2 sorting operation. DeltaFS allows applications to write to an arbitrarily large number Year in Review...... 4 of files, while also guaranteeing efficient data access without requiring sorting. Recent Publications...... 5 The first challenge when handling an arbitrarily large number of files is dealing with PDL News & Awards...... 8 the resulting metadata load. We manage this using the transient and serverless DeltaFS 3Sigma...... 12 file system [1]. The transient property of DeltaFS allows each program that uses it to Defenses & Proposals...... 14 individually control the amount of computing resources dedicated to the file system, effectively scaling metadata performance under application control. When combined Alumni News...... 18 with DeltaFS’s serverless nature, file system design and provisioning decisions are New PDL Faculty & Staff...... 19 decoupled from the overall design of the HPC platform. As a result, applications that create one file for each process are no longer tied to the platform storage system’s ability PDL CONSORTIUM to handle metadata-heavy workloads. The HPC platform can also provide scalable file MEMBERS creation rates without requiring a fundamental redesign of the platform’s storage system. The second challenge is guaranteeing both fast writing and reading for workloads that Alibaba Group consist primarily of small I/O transfers. This work was inspired by interactions with Broadcom, Ltd. cosmologists seeking to explore the trajectories of the highest energy particles in an Dell EMC astrophysics simulation using the VPIC plasma simulation code [2]. VPIC is a highly- Facebook optimized particle simulation Google Hewlett Packard Enterprise Traditional file-per-process output DeltaFS file-per-particle output code developed at Los Alamos Hitachi, Ltd. Simulation P P ... P O(1M) P P ... P O(1M) National Laboratory (LANL). Procs Each VPIC simulation pro- IBM Research Di r A D E A D B E F C Intel Corporation File B F C A D B E F C ceeds in timesteps, and each System D B ... F O(1M) ... O(1T) Massive process represents a bounding Micron API C E A box in the physical simula- MongoDB Indexed tion space that particles move Object NetApp, Inc. ... O(1M) index index ... index O(1M) through. Every few timesteps Oracle Corporation Store the simulation stops, and Salesforce Trajectory O(1TB) search O(1MB) search each process creates a file and Query Samsung Information Systems America A B C A B C writes the data for the particles Seagate Technology that are currently located Toshiba Figure 1: DeltaFS in-situ indexing of particle data in an within its bounding box. This Two Sigma Indexed Massive Directory. While indexed particle data are is the default, file-per-process Veritas exposed as one DeltaFS subfile per particle, they are stored Western Digital as indexed log objects in the underlying storage. continued on page 11 FROM THE DIRECTOR’S CHAIR THE PDL PACKET GREG GANGER THE PARALLEL DATA Hello from fabulous Pittsburgh! LABORATORY School of Computer Science 25 years! This past fall, we celebrated 25 Department of ECE years of the Parallel Data Lab. Started by Garth after he defended his PhD disserta- Carnegie Mellon University tion on RAID at UC-Berkeley, PDL has 5000 Forbes Avenue seen growth and success that I can’t imagine he imagined... from the early days Pittsburgh, PA 15213-3891 of exploring new disk array approaches to today’s broad agenda of large-scale voice 412•268•6716 storage and data center infrastructure research... from a handful of core CMU fax 412•268•3010 researchers and industry participants to a vibrant community of scores of CMU researchers and 20 sponsor companies. Amazing. PUBLISHER It has been another great year for the Parallel Data Lab, and I’ll highlight some Greg Ganger of the research activities and successes below. Others, including graduations, publications, awards, etc., can be found throughout the newsletter. But, I can’t EDITOR not start with the biggest PDL news item of this 25th anniversary year: Garth has Joan Digney graduated;). More seriously, 25 years after founding PDL, including guiding/ nurturing it into a large research center with sustained success (25 years!), Garth The PDL Packet is published once per decided to move back to Canada and take the reins (as President and CEO) of the year to update members of the PDL new Vector Institute for AI. We wish him huge success with this new endeavor! Consortium. A pdf version resides in the Publications section of the PDL Web Garth has been an academic role model, a mentor, and a friend to me and many pages and may be freely distributed. others... we will miss him greatly, and he knows that we will always have a place Contributions are welcome. for him at PDL events. THE PDL LOGO Because it overlaps in area with Vector, I’ll start my highlighting of PDL activi- ties with our continuing work at the intersection for machine learning (ML) Skibo Castle and the lands that com­ and systems. We continue to explore new approaches to system support for prise its estate are located in the Kyle of Sutherland in the northeastern part of large-scale machine learning, especially aspects of how ML systems should adapt Scotland. Both ‘Skibo’ and ‘Sutherland’ and be adapted in cloud computing environments. Beyond our earlier focus are names whose roots are from Old on challenges around dynamic resource availability and time-varying resource Norse, the language spoken by the interference, we continue to explore challenges related to training models over Vikings who began washing ashore reg­ularly in the late ninth century. The geo-distributed data, training very large models, and how edge resources should word ‘Skibo’ fascinates etymologists, be shared among inference applications using DNNs for video stream processing. who are unable to agree on its original We are also exploring how ML can be applied to make systems better, including meaning. All agree that ‘bo’ is the Old even ML systems ;). Norse for ‘land’ or ‘place,’ but they argue whether ‘ski’ means ‘ships’ or ‘peace’ Indeed, much of PDL’s expansive database systems research activities center on or ‘fairy hill.’ embedding automation in DBMSs. With an eye toward simplifying administra- tion and improving performance robustness, there are a number of aspects of Although the earliest version of Skibo seems to be lost in the mists of time, Andy’s overall vision of a self-driving database system being explored and real- it was most likely some kind of fortified ized. To embody them, and other ideas, a new open source DBMS called Peloton building erected by the Norsemen. The has been created and is being continuously enhanced. There also continue to be present-day castle was built by a bishop cool results and papers on better exploitation of NVM in databases, improved of the Roman Catholic Church. Andrew concurrency control mechanisms, and range query filtering. I thoroughly enjoy Carnegie, after making his fortune, bought it in 1898 to serve as his sum­mer watching (and participating) in the great energy that Andy has infused into da- home. In 1980, his daughter, Mar­garet, tabase systems research at CMU. donated Skibo to a trust that later sold the estate. It is presently being run as a Of course, PDL continues to have a big focus on storage systems research at vari- luxury hotel. ous levels. At the high end, PDL’s long-standing focus on metadata scaling for scalable storage has led to continued research into benefits of and approaches to allowing important applications to manage their own namespaces and metadata for periods of time. In addition to bypassing traditional metadata bottlenecks

2 THE PDL PACKET PARALLEL DATA LABORATORY FROM THE DIRECTOR’S CHAIR

FACULTY entirely during the heaviest periods of activity, this approach promises oppor- Greg Ganger (PDL Director) 412•268•1297 tunities for efficient in-situ index creation to enable fast queries for subsequent [email protected] analysis activities. At the lower end, we continue to explore how software systems George Amvrosiadis Seth Copen Goldstein should be changed to maximize the value from NVM storage, including addressing David Andersen Mor Harchol-Balter read-write performance asymmetry and providing storage management features Lujo Bauer Gauri Joshi Nathan Beckmann Todd Mowry (e.g., page-level checksums, dedup, etc.) without yielding load/store efficiency. Daniel Berger Onur Mutlu We’re excited about continuing to work with PDL companies on understanding Chuck Cranor Priya Narasimhan where storage hardware is (and should be) going and how it should be exploited Lorrie Cranor David O’Hallaron Christos Faloutsos Andy Pavlo in systems. Kayvon Fatahalian Majd Sakr Rajeev Gandhi M. Satyanarayanan PDL continues to explore questions of resource scheduling for cloud computing, Saugata Ghose Srinivasan Seshan which grows in complexity as the breadth of application and resource types grow. Phil Gibbons Rashmi Vinayak Garth Gibson Hui Zhang Our cluster scheduling research continues to explore how job runtime estimates STAFF MEMBERS can be automatically generated and exploited to achieve greater efficiency. Our Bill Courtright, 412•268•5485 most recent work explores more robust ways of exploiting imperfectly-estimated (PDL Executive Director) [email protected] runtime information, finding that providing full distributions of likely run- Karen Lindenfelser, 412•268•6716 (PDL Administrative Manager) [email protected] times (e.g., based on history of “similar” jobs) works quite well for real-world Jason Boles workloads as reflected in real cluster traces. We are also exploring scheduling for Joan Digney adaptively-sized “virtual clusters” within public clouds, which introduces new Chad Dougherty Mitch Franzos questions about which machine types to allocate, how to pack them, and how Alex Glikson aggressively to release them. Charlene Zang I continue to be excited about the growth and evolution of the storage systems and cloud classes created and led by PDL faculty — their popularity is at an all-time VISITING RESEARCHERS / POST DOCS Rachata Ausavarungnirun Kazuhiro Saito high again this year. These project-intensive classes prepare 100s of MS students Hyeontaek Lim to be designers and developers for future infrastructure systems. They build FTLs that store real data (in a simulated NAND Flash SSD), hybrid cloud file systems GRADUATE STUDENTS that work, cluster schedulers, efficient ML model training apps, etc. It’s really Abutalib Aghayev Conglong Li rewarding for us and for them. In addition to our lectures and the projects, these Joy Arulraj Kunmin Li classes each feature 3-5 corporate guest lecturers (thank you, PDL Consortium Ben Blum Yang Li V. Parvathi Bhogaraju Yixin Luo members!) bringing insight on real-world solutions, trends, and futures. Amirali Boroumand Lin Ma Sol Boucher Diptesh Majumdar Many other ongoing PDL projects are also producing cool results. For example, Christopher Canel Ankur Mallick to help our (and others’) file systems research, we have developed a new file system Dominic Chen Charles McGuffey Haoxian Chen Prashanth Menon aging suite, called Geriatrix. Our key-value store research continues to expose Malhar Chaudhari Yuqing Miao new approaches to indexing and remote value access. This newsletter and the Andrew Chung Wenqi Mou PDL website offer more details and additional research highlights. Chris Fallin Pooja Nilangekar Pratik Fegade Yiqun Ouyang Ziqiang Feng Jun Woo Park I’m always overwhelmed by the accomplishments of the PDL students and staff, Samarth Gupta Aurick Qiao and it’s a pleasure to work with them. As always, their accomplishments point at Aaron Harlap Souptik Sen great things to come. Kevin Hsieh Sivaprasad Sudhir Fan Hu Aaron Tian Abhilasha Jain Dana Van Aken Saksham Jain Nandita Vijaykumar Angela Jiang Haoran Wang Ellango Jothimurugesan Jianyu Wang Saurabh Arun Kadekodi Justin Wang Anuj Kalia Ziqi Wang Rajat Kateja Jinliang Wei Jin Kyu Kim Daniel Wong Thomas Kim Lin Xiao Vamshi Konagari Hao Zhang Jack Kosaian Huanchen Zhang Marcel Kost Qing Zheng Michael Kuchnik Giulio Zhou The CMU fence displays a farewell message to Garth.

SPRING 2018 3 YEAR IN REVIEW

May 2018 on automatic database manage- computing systems.” Onur, who ment systems. is now at ETH Zurich, was chosen ™™20th annual Spring Visit Day. ™™Anuj Kalia proposed his thesis for “contributions to computer ™™Qing Zheng and Michael Kuchnik research “Efficient Networked architecture research, especially in will be interning with LANL this Systems for Datacenter Fabrics memory systems.” summer. with RPCs.” ™™Joy Arulraj proposed his PhD re- April 2018 ™™Nathan Beckmann presented search “The Design & Implemen- ™™Andy Pavlo receive the 2018 Joel “LHD: Improving Cache Hit Rate tation of a Non-Volatile Memory & Ruth Spira Teaching Award. by Maximizing Hit Density” at Database Management System.” NSDI ‘18 in Renton, WA. ™™Lorrie Cranor received the IAPP ™™Dana Van Aken gave her speaking Leadership Award. ™™Rajat Kateja presented “Viyojit: skills talk on “Automatic Data- Decoupling Battery and DRAM base Management System Tuning ™ ™Srinivasan Seshan was appointed Capacities for Battery-Backed Through Large-scale Machine Head of he Computer Science DRAM” at NVMW ‘18 in San Learning.” Dept. at CMU. Diego, CA. November 2017 ™™Michael Kuchnik received an ™™Rachata Ausavarungnirun pre- NDSEG Fellowship for his work sented “MASK: Redesigning the ™™Qing Zheng presented “Software- on machine learning in HPC GPU Memory Hierarchy to Sup- Defined Storage for Fast Trajec- systems. port Multi-Application Concur- tory Queries using a DeltaFS ™™Huanchen Zhang proposed his rency” at ASPLOS’18 in Williams- Indexed Massive Directory” at PhD research “Towards Space- burg, VA. PDSW-DISCS ‘17 in Denver, CO. Efficient High-Performance In- ™™ASPLOS’18 in Williamsburg, VA. October 2017 Memory Search Structures.” February 2018 ™™Lorrie Cranor awarded FORE Sys- ™™Jun Woo Park presented “3Sig- tems Chair of Computer Science. ma: Distribution-based Cluster ™™Lorrie Cranor wins top SIG- Scheduling for Runtime Uncer- CHI Award, given to individuals ™™Qing Zheng gave his speaking tainty” at EuroSys ’18 in Porto, who promote the application of skills talk on “Light-weight In-situ Portugal. human-computer interaction re- Analysis with Frugal Resource search to pressing social needs. Usage.” ™™Charles McGuffey delivered his ™ speaking skills talk on “Designing ™™Six posters were presented at the ™Rachata Ausavarungnirun pre- Algorithms to Tolerate Processor 1st SysML Conference at Stanford sented “Mosaic: A GPU Memory Faults.” U. on various work related to Manager with Application-Trans- creating more efficient systems for parent Support for Multiple Page ™™Qing Zheng gave his speaking machine learning. Sizes” and Vivek Seshadri present- skills talk “Light-Weight In-Situ ™™Yixin Luo successfully defended ed Ambit: In-Memory Indexing For Scientific Work- for Bulk Bitwise Operations Using loads.” his PhD dissertation on “Archi- tectural Techniques for Improving Commodity DRAM Technology” March 2018 NAND Flash Memory Reliability.” at MICRO ‘17 in Cambridge, MA. ™ ™™Andy Pavlo wins Google Faculty ™™Andy Pavlo awarded a Sloan Fel- ™Timothy Zhu presented “Work- Research Award for his research lowship to continue his work on load Compactor: Reducing the study of database management Datacenter Cost while Providing systems, specifically main memory Tail Latency SLO Guarantees” at systems, non-relational systems SoCC’17 in Santa Clara, CA. (NoSQL), transaction processing ™™25th annual PDL Retreat. systems (NewSQL) and large-scale September 2017 data analytics. ™™Garth Gibson to lead new Vector December 2017 Institute for AI in Toronto. ™™Mor Harchol-Balter and Onur ™™Hongyi Xin delivered his speak- Mutlu were made Fellows of the ing skills talk on “Improving DNA ACM. Mor was selected “for con- Read Mapping with Error-resil- Greg Ganger and PDL alums Hugo Patterson ient Seeds.” (Datrium) and Jiri Schindler (HPE) enjoy social tributions to performance mod- time at the PDL Retreat. eling and analysis of distributed continued on page 32

4 THE PDL PACKET RECENT PUBLICATIONS

Geriatrix: Aging what you see (HDD) and up to 75% on Flash (SSD) and Modeling of Computer Systems and what you don’t see. A file disks. Worse, in some cases, the per- Los Angeles, CA, June 2018. system aging approach for formance rank ordering of file system We consider an extremely broad class modern storage systems designs being compared are different of M/G/1 scheduling policies called from the published results. SOAP: Schedule Ordered by Age- Saurabh Kadekodi, Vaishnavh Geriatrix will be released as open based Priority. The SOAP policies Nagarajan, Gregory R. Ganger & source software with eight built-in ag- include almost all scheduling policies Garth A. Gibson ing profiles, in the hopes that it can ad- in the literature as well as an infinite 2018 USENIX Annual Technical Con- dress the need created by the increased number of variants which have never ference (ATC’18). July 11–13, 2018, performance impact of file system been analyzed, or maybe not even con- ceived. SOAP policies range from clas- Boston, MA. aging in modern SSD-based storage. sic policies, like first-come, first-serve File system performance on modern A Case for Packing and (FCFS), foreground-background primary storage devices (Flash-based Indexing in Cloud File Systems (FB), class-based priority, and shortest SSDs) is greatly affected by aging of the remaining processing time (SRPT); to free space, much more so than were Saurabh Kadekodi, Bin Fan, Adit much more complicated scheduling mechanical disk drives. We introduce Madan, Garth A. Gibson & rules, such as the famously complex Geriatrix, a simple-to-use profile Gregory R. Ganger Gittins index policy and other policies driven file system aging tool that in- in which a job’s priority changes arbi- 10th USENIX Workshop on Hot Top- duces target levels of fragmentation in trarily with its age. While the response ics in Cloud Computing. July 9, 2018, both allocated files (what you see) and time of policies in the former category Boston, MA. remaining free space (what you don’t is well understood, policies in the lat- see), unlike previous approaches that Tiny objects are the bane of highly scal- ter category have resisted response time focus on just the former. This paper able cloud object stores. Not only do analysis. We present a universal analy- describes and evaluates the effective- tiny objects cause massive slowdowns, sis of all SOAP policies, deriving the ness of Geriatrix, showing that it rec- but they also incur tremendously high mean and Laplace-Stieltjes transform reates both fragmentation effects bet- costs due to current operation-based of response time. ter than previous approaches. Using pricing models. For example, in Geriatrix, we show that measurements Amazon S3’s current pricing scheme, Towards Optimality in Parallel presented in many recent file systems uploading 1GB data by issuing tiny Job Scheduling papers are higher than should be ex- (4KB) PUT requests (at 0.0005 pected, by up to 30% on mechanical cents each) is approximately 57x more Ben Berg, Jan-Pieter Dorsman & expensive than storing that same 1GB Mor Harchol-Balter for a month. To address this prob- Proceedings of ACM SIGMETRICS Pro le lem, we propose client-side packing 100 Geriatrix 2018 Conference on Measurement of files into gigabyte-sized blobs with and Modeling of Computer Systems Impressions embedded indices to identify each file’s Los Angeles, CA, June 2018. 50 Unaged location. Experiments with a pack- ing implementation in Alluxio (an To keep pace with Moore’s law, chip designers have focused on increasing

12.7 12.8 Throughput (MB/s) .5 11 open-source distributed file system)

36.3 64.8 0 126.1 illustrate the potential benefits, such as the number of cores per chip rather SSD HDD than single core performance. In turn, Storage Device simultaneously increasing file creation throughput by up to 61x and decreas- modern jobs are often designed to run ing cost by over 99.99%. on any number of cores. However, to Aging impact on Ext4 atop SSD and HDD. effectively leverage these multi-core The three bars for each device represent SOAP: One Clean Analysis chips, one must address the question the FS freshly formatted (unaged), aged of how many cores to assign to each with Geriatrix, and aged with Impressions. of All Age-Based Scheduling job. Given that jobs receive sublinear Although relatively small differences are Policies seen with the HDD, aging has a big impact speedups from additional cores, there on FS performance on the SSD. Although Ziv Scully, Mor Harchol-Balter & is an obvious tradeoff: allocating more their file fragmentation levels are similar, the Alan Scheller-Wolf cores to an individual job reduces the higher free space fragmentation produced job’s runtime, but in turn decreases by Geriatrix induces larger throughput Proceedings of ACM SIGMETRICS reductions than for Impressions. 2018 Conference on Measurement continued on page 6

SPRING 2018 5 RECENT PUBLICATIONS continued from page 5 the efficiency of the overall system. execute enables a scheduler to more We ask how the system should schedule effectively pack jobs with diverse time jobs across cores so as to minimize the concerns (e.g., deadline vs. the- mean response time over a stream of sooner-the-better) and placement incoming jobs. preferences on heterogeneous cluster To answer this question, we develop resources. But, existing schedul- an analytical model of jobs running ers use single-point estimates (e.g., on a multi-core machine. We prove mean or median of a relevant subset that EQUI, a policy which continu- of historical runtimes), and we show ously divides cores evenly across jobs, that they are fragile in the face of real-world estimate error profiles. Relative cache size needed to match LHD’s is optimal when all jobs follow a single hit rate on different traces. LHD requires speedup curve and have exponentially In particular, analysis of job traces roughly one-fourth of LRU’s capacity, and distributed sizes. EQUI requires jobs from three different large-scale clus- roughly half of that of prior eviction policies. to change their level of parallelization ter environments shows that, while while they run. Since this is not pos- the runtimes of many jobs can be has proposed many eviction policies sible for all workloads, we consider a predicted well, even state-of-the-art that improve on LRU, but these class of “fixed-width” policies, which predictors have wide error profiles policies make restrictive assump- choose a single level of paralleliza- with 8–23% of predictions off by tions that hurt their hit rate, and tion, k, to use for all jobs. We prove a factor of two or more. Instead of they can be difficult to implement that, surprisingly, it is possible to reducing relevant history to a single efficiently. point, 3Sigma schedules jobs based on achieve EQUI’s performance without We introduce least hit density requiring jobs to change their levels full distributions of relevant runtime (LHD), a novel eviction policy for of parallelization by using the optimal histories and explicitly creates plans key-value caches. LHD predicts each fixed level of parallelization, k*. We that mitigate the effects of anticipated also show how to analytically derive the runtime uncertainty. Experiments object’s expected hits-per-space- optimal k* as a function of the system with workloads derived from the consumed (hit density), filtering load, the speedup curve, and the job same traces show that 3Sigma greatly objects that contribute little to the size distribution. outperforms a state-of-the-art sched- cache’s hit rate. Unlike prior evic- uler that uses point estimates from a tion policies, LHD does not rely In the case where jobs may follow dif- on heuristics, but rather rigor- ferent speedup curves, finding a good state-of-the-art predictor; in fact, the ously models objects’ behavior using scheduling policy is even more chal- performance of 3Sigma approaches conditional probability to adapt its lenging. In particular, we find that the end-to-end performance of a behavior in real time. policies like EQUI which performed scheduler based on a hypothetical, well in the case of a single speedup perfect runtime predictor. 3Sigma To make LHD practical, we design function now perform poorly. We pro- reduces SLO miss rate, increases clus- and implement RankCache, an effi- pose a very simple policy, GREEDY*, ter goodput, and improves or matches cient key-value cache based on mem- which performs near-optimally when latency for best effort jobs. cached. We evaluate RankCache and compared to the numerically-derived LHD: Improving Cache Hit Rate LHD on commercial memcached optimal policy. by Maximizing Hit Density and enterprise storage traces, where LHD consistently achieves better 3Sigma: Distribution-based Nathan Beckmann, Haoxian Chen & hit rates than prior policies. LHD Asaf Cidon Cluster Scheduling for Runtime requires much less space than prior Uncertainty 15th USENIX Symposium on Net- policies to match their hit rate, on worked Systems Design and Imple- average 8X less than LRU and 2–3X Jun Woo Park, Alexey Tumanov, Angela mentation (NSDI 18). April 9–11, less than recently proposed policies. Jiang, Michael A. Kozuch & Gregory R. 2018, Renton, WA. Moreover, RankCache requires no Ganger Cloud application performance is synchronization in the common EuroSys ’18, April 23–26, 2018, heavily reliant on the hit rate of data- case, improving request throughput Porto, Portugal. Supersedes CMU- center key-value caches. Key-value at 16 threads by 8 over LRU and by PDL-17-107, Nov. 2017. caches typically use least recently 2X over CLOCK. The 3Sigma cluster scheduling system used (LRU) as their eviction policy, uses job runtime histories in a new but LRU’s hit rate is far from optimal way. Knowing how long each job will under real workloads. Prior research continued on page 7

6 THE PDL PACKET RECENT PUBLICATIONS continued from page 6 Tributary: Spot-dancing for time, the often-larger sets allow unex- to re-tune settings during execution. Elastic Services with Latency pected workload bursts to be handled Experiments show that MLtuner can SLOs without SLO violation. Over a range robustly find and re-tune tunable set- of web service workloads, we find that tings for a variety of ML applications, Aaron Harlap, Andrew Chung, Alexey Tributary reduces cost for achieving a including image classification (for 3 Tumanov, Gregory R. Ganger & Phillip given SLO by 81–86% compared to tra- models and 2 datasets), video clas- B. Gibbons ditional scaling on non-preemptible sification, and matrix factorization. Compared to state-of-the-art ML Carnegie Mellon University Parallel resources and by 47–62% compared auto-tuning approaches, MLtuner is Data Lab Technical Report CMU- to the high-risk approach of the same more robust for large problems and PDL-18-102, Jan. 2018. scaling with spot resources. over an order of magnitude faster. The Tributary elastic control system MLtuner: System Support for embraces the uncertain nature of Automatic Machine Learning Addressing the Long-Lineage transient cloud resources, such as Tuning Bottleneck in Apache Spark AWS spot instances, to manage elastic services with latency SLOs more ro- Henggang Cui, Gregory R. Ganger & Haoran Wang, Jinliang Wei & Garth bustly and more cost-effectively. Such Phillip B. Gibbons Gibson resources are available at lower cost, Carnegie Mellon University Parallel but with the proviso that they can be arXiv:1803.07445v1 [cs.LG] 20 Mar, 2018. Data Lab Technical Report CMU- preempted en masse, making them PDL-18-101, January 2018. risky to rely upon for business-critical MLtuner automatically tunes settings services. Tributary creates models of for training tunables — such as the Apache Spark employs lazy evaluation preemption likelihood and exploits learning rate, the momentum, the [11, 6]; that is, in Spark, a dataset is the partial independence among dif- mini-batch size, and the data staleness represented as Resilient Distributed ferent resource offerings, selecting bound—that have a significant impact Dataset (RDD), and a single-threaded collections of resource allocations on large-scale machine learning (ML) application (driver) program simply that will satisfy SLO requirements performance. Traditionally, these describes transformations (RDD to and adjusting them over time as client tunables are set manually, which is RDD), referred to as lineage [7, 12], workloads change. Although Tribu- unsurprisingly error prone and dif- without performing distributed com- tary’s collections are often larger than ficult to do without extensive domain putation until output is requested. required in the absence of preemp- knowledge. MLtuner uses efficient The lineage traces computation and tions, they are cheaper because of both snapshotting, branching, and optimi- dependency back to external (and as- sumed durable) data sources, allow- lower spot costs and partial refunds zation-guided online trial-and-error ing Spark to opportunistically cache for preempted resources. At the same to find good initial settings as well as intermediate RDDs, because it can recompute everything from external data sources. To initiate computation on worker machines, the driver pro- cess constructs a directed acyclic graph (DAG) representing computation and dependency according to the requested RDD’s lineage. Then the driver broad- casts this DAG to all involved workers requesting they execute their portion of the result RDD. When a requested RDD has a long lineage, as one would expect from iterative convergent or streaming applications [9, 15], con- structing and broadcasting compu- tational dependencies can become a significant bottleneck. For example, when solving matrix factorization using Gemulla’s iterative convergent

The Tributary Architecture. continued on page 20

SPRING 2018 7 AWARDS & OTHER PDL NEWS

April 2018 is given annually to individuals who turn to full-time Andy Pavlo Receives 2018 Joel demonstrate an “ongoing commitment teaching and re- & Ruth Spira Teaching Award to furthering privacy policy, promot- search. “We are ing recognition of privacy issues and all excited about The School of advancing the growth and visibility Srini Seshan’s Computer Sci- of the privacy profession.” Cranor new role as head ence honored helped develop and is now co-director of CSD,” said outstanding of CMU’s MSIT-Privacy Engineering School of Com- faculty and staff master’s degree program as well as puter Science members April director of the CyLab Usable Privacy Dean Andrew Moore. “He is an out- 5 during the an- and Security Laboratory. standing researcher and teacher, and nual Founder’s --CMU Piper, April 5, 2018 I’m confident that his expanded role Day ceremony in leadership will help the department in Rashid Auditorium. It was the sev- April 2018 reach even greater heights.” Seshan enth year for the event and was hosted joined the CSD faculty in 2000, and Welcome Baby Nora! by Dean Andrew Moore. Andy Pavlo, served as the department’s associate Assistant Professor in the Computer Pete and Laura Losi, and Grandma head for graduate education from Science Department (CSD), was the Karen Lindenfelser are thrilled to 2011 to 2015. His research focuses on winner of the Joel and Ruth Spira announce Nora Grace joined big sister improving the design, performance Teaching Award, sponsored by Lutron Layla Anne and big cousin Landon and security of computer networks, in- Electronics Co. of Coopersburg, Pa., Thomas to become a family of four cluding wireless and mobile networks. in honor of the company’s founders (five if you count, Rudy, the grand- He earned his bachelor’s, master’s and the inventor of the electronic dog). Nora was born Friday the 13th and doctoral degrees in computer sci- dimmer switch. at 11:50 am at 7 lbs 19.5 inches. ence at the University of California, -- CMU SCS news, April 5, 2018 Berkeley. He worked as a research staff member at IBM’s T.J. Watson Research April 2018 Center for five years before joining Lorrie Cranor Receives IAPP Carnegie Mellon. Leadership Award --CMU Piper, April 5, 2018

Lorrie Cranor has received the 2018 March 2018 Leadership Award from The Interna- Andy Pavlo Wins Google tional Association of Privacy Profes- Faculty Research Award sionals (IAPP). Cranor, a professor in the Institute for Software Research The CMU Database Group and the PDL and the Department of Engineering are pleased to announce that Prof. Andy and Public Policy, accepted the award Pavlo has won a 2018 Google Faculty at the IAPP’s Global Privacy Summit Research Award. This award was for his on March 27. “Lorrie Cranor, for 20 research on automatic database manage- years, has been a leading voice and a ment systems. Andy was one of a total leader in the privacy field,” said IAPP 14 faculty members at Carnegie Mellon President and CEO Trevor Hughes. University selected for this award. “She developed some of the earliest The Google Faculty Research Awards privacy enhancing technologies, she is an annual open call for proposals on developed a groundbreaking program computer science and related topics at Carnegie Mellon University to such as machine learning, machine create future generations of privacy April 2018 perception, natural language process- engineers, and she has been a steadfast Srinivasan Seshan Appointed ing, and quantum computing. Grants supporter, participant and leader of Head of CSD cover tuition for a graduate student the field of privacy for that entire time. and provide both faculty and students Srinivasan Seshan has been appointed Her merits as recipient for our privacy the opportunity to work directly with leadership award are unimpeachable. head of the Computer Science De- Google researchers and engineers. She’s as great a person as we have in our partment (CSD), effective July 1. He world.” The IAPP Leadership Award succeeds Frank Pfenning, who will re- continued on page 9

8 THE PDL PACKET AWARDS & OTHER PDL NEWS continued from page 8 This round received 1033 proposals Mazurek, Florian Schaub and Yang covering 46 countries and over 360 Wang – supported Lorrie’s nomina- universities from which 152 were tion. “All four of us are currently assis- chosen to fund. The subject areas that tant professors, spread out across the received the most support this year United States,” said Ur, who received were human computer interaction, his doctorate degree in 2016. “In ad- machine learning, machine percep- dition to this impact on end users, the tion, and systems. four of us who jointly nominated her -- Google and CMU Database Group have also benefitted greatly from her News, March 20, 2018 mentorship.” very snuggly addition to their family. A full summary of this year’s SIGCHI Sebastian Alexander Andersen-Fuchs February 2018 award recipients can be found on the was born December 11, 2017, at 11:47 Lorrie Cranor Wins Top SIGCHI organization’s website. am at 8lb 8oz and 21” long. Mom and Award -- info from Cylab News, Daniel baby are healthy, and Aria is very ex- cited to be a big sister. Lorrie Cranor, Tkacik, Feb. 23, 2018 a professor in December 2017 the Institute February 2018 Andy Pavlo Awarded a Sloan Mor Harchol-Balter and Onur for Software Mutlu Fellows of the ACM Research and Fellowship the Department “The Sloan Research Fellows represent Congratula- of Engineering the very best science has to offer,” said tions to Mor and Public Pol- Sloan President Adam Falk. “The (Professor of icy, is this year’s brightest minds, tackling the hardest CS) and Onur (adjunct Pro- recipient of the problems, and succeeding brilliantly fessor of ECE), Social Impact Award from the Associa- — fellows are quite literally the future who have been tion for Computing Machinery Special of 21st century science.” Interest Group on Computer Human made Fellows of Interaction (SIGCHI). Andrew Pavlo, an assistant professor the ACM. of computer science, specializes in the The Social Impact Award is given to From the ACM study of database management systems, mid-level or senior individuals who website: “To be selected as a Fellow is specifically main memory systems, promote the application of human- to join our most renowned member non-relational systems (NoSQL), computer interaction research to grade and an elite group that repre- transaction processing systems (NewS- pressing social needs and includes an sents less than 1 percent of ACM’s QL) and large-scale data analytics. He honorarium of $5,000, the opportu- overall membership,” explains ACM is a member of the Database Group nity to give a talk about the awarded President Vicki L. Hanson. “The and the Parallel Data Laboratory. He work at the CHI conference, and life- Fellows program allows us to shine a joined the Computer Science Depart- time invitations to the annual SIGCHI light on landmark contributions to ment in 2013 after earning a Ph.D. in award banquet. computing, as well as the men and computer science at Brown University. “Lorrie’s work has had a huge impact women whose hard work, dedication, He won the 2014 Jim Gray Doctoral and inspiration are responsible for on the ability of non-technical users Dissertation Award from the Asso- to protect their security and privacy groundbreaking work that improves ciation for Computing Machinery’s our lives in so many ways.” through her user-centered approach (ACM) Special Interest Group on the to security and privacy research and Management of Data. Mor was selected “for contributions to development of numerous tools and performance modeling and analysis of technologies,” said Blase Ur, who -- Carnegie Mellon University News, systems.” Feb. 15, 2018 prepared Lorrie’s nomination. Ur is Onur, who is now at ETH Zurich, was a former Ph.D. student of Lorrie’s, December 2017 chosen for “contributions to computer and is now an assistant professor at the architecture research, especially in Welcome Baby Sebastian! University of Chicago. memory systems.” In addition to Ur, three former stu- In not-unexpected news, David, Erica --with info from www.acm.org dents from Cranor’s CyLab Usable and big sister Aria are delighted to an- Privacy and Security Lab – Michelle nounce the arrival of a squirmy and continued on page 10

SPRING 2018 9 AWARDS & OTHER PDL NEWS continued from page 9 November 2017 Lorrie provided information that June 2017 Welcome Baby Will! “the founders of FORE Systems, Inc. Satya Honored for Creation of established the FORE Systems Profes- Andrew File System Kevin Hsieh and his wife would like sorship in 1995 to support a faculty share the news of their new baby! Will member in the School of Computer The Association for Computing Ma- was born on November 15, 2017 at Science. The company’s name is an chinery has named the developers of 11:15am (not a typo...). He was born at acronym formed by the initials of the CMU’s pioneering Andrew File Sys- 6lb 7oz and 20’’ long. Since then, he founders’ first names. Before it was tem (AFS) the recipients of its pres- has been growing very well and keeping acquired by Great Britain’s Marconi in tigious 2016 Software System Award. his family busy. 1998, FORE created technology that AFS was the allows computer networks to link and first distributed transfer information at a rapid speed. file system de- Ericsson purchased much of Marconi signed for tens in 2006.” The chair was previously of thousands held by CMU University Professor of machines, Emeritus, Edmund M. Clarke. and pioneered the use of scal- September 2017 able, secure and Garth Gibson to Lead New ubiquitous ac- Vector Institute for AI in cess to shared file data. To achieve the October 2017 Toronto goal of providing a common shared Welcome Baby Jonas! file system used by large networks of In January of people, AFS introduced novel ap- Jason & Chien-Chiao Boles are excited 2018, PDL’s proaches to caching, security, man- to announce the arrival of their son founder, Garth agement and administration. Jonas at 7:42pm, October 18th. Jonas Gibson, became The award recipients, including CS was born a few weeks early — a surprise President and Professor Mahadev Satyanarayanan, for us all. Everyone is doing well so far. CEO of the built the Andrew File System in the Vector Institute 1980s while working as a team at for AI in To- the Information Technology Cen- ronto. Vector’s ter (ITC) — a partnership between website states Carnegie Mellon and IBM. that “Vector will be a leader in the The ACM Software System Award is transformative field of artificial -in presented to an institution or indi- telligence, excelling in machine and viduals recognized for developing a deep learning — an area of scientific, software system that has had a lasting academic, and commercial endeavour influence, reflected in contributions that will shape our world over the next to concepts, in commercial accep- generation.” tance, or both. October 2017 Frank Pfenning, Head of the Depart- AFS is still in use today as both an Lorrie Cranor Awarded FORE ment of Computer Science, notes that open-source system and as a file sys- Systems Chair of Computer “this is a tremendous opportunity for tem in commercial applications. It Science Garth, but we will sorely miss him in has also inspired several cloud-based the multiple roles he plays in the de- storage applications. Many universi- We are very pleased to announce partment and school: Professor (and ties integrated AFS before it was in- that, in addition to a long list of ac- all that this entails), Co-Director of troduced as a commercial application. complishments, which has included the MCDS program, and Associate a term as the Chief Technologist of Dean for Masters Programs in SCS.” -- Byron Spice, The Piper, June 1, 2017 the Federal Trade Commission, Lor- We are sad to see him go and will miss rie Cranor has been made the FORE him greatly, but the opportunities Systems Professor of Computer Sci- presented here for world level innova- ence and Engineering & Public Policy tion are tremendous and we wish him at CMU. all the best.

10 THE PDL PACKET MASSIVE INDEXED DIRECTORIES IN DELTA-FS continued from page 1 mode of VPIC. For each timestep, 40 To improve the performance of appli- 4096 bytes of data is produced per particle cations with small I/O access patterns 1024 Baseline DeltaFS representing the particle’s spatial loca- ) 256 similar to VPIC, we propose an Indexed 64 tion, velocity, energy, etc. We refer to Massive Directory — a new technique 16 the entire particle data written at the 4 for indexing data in-situ as it is written 1 245x 665x 532x 625x 992x 2221x 4049x 5112x same timestep as a frame, because frame to storage. In-situ indexing of massive Query Time (sec 0.25 data is often used by domain scientists 0.0625 amounts of data written to a single di- 0.015625 to construct false-color movies of the 496992 1984 3968 7936 1636832736 49104 rectory simultaneously, and in an arbi- Simulation Size (M Particles) simulation state over time. Large-scale trarily large number of files with the goal (a) Query time VPIC simulations have been conducted of efficiently recalling data written to the 15 ) Baseline DeltaFS 108% with up to trillions of particles, gener- 12 same file without requiring any time- TiB ating terabytes of data for each frame. ( consuming data post-processing steps to 9 108%

Domain scientists are often interested reorganize it. This greatly improves the 6 108% in a tiny subset of particles with specific Output Size readback performance of applications, 3 108% 108% characteristics, such as high energy, that at the price of small overheads associated 108% 108% 108% 0 is not known until the simulation ends. with partitioning and indexing the data 496992 1984 3968 7936 1636832736 49104 All data for each such particle is gathered during writing. We achieve this through Simulation Size (M Particles) for further analysis, such as visualizing (b) Output size a memory-efficient indexing mechanism 200 1.15x 1.13x its trajectory through space over time. Baseline DeltaFS 1.13x for reordering and indexing data, and 160 Unfortunately, particle data within a a log-structured storage layout to pack frame is written out of order, since 120 small writes into large log objects, all 1.29x 80 output order depends on the particles’ 1.56x while ensuring compute node resources 9.63x 4.78x 2.42x spatial location. Therefore, in order 40 are used frugally. to locate individual particles’ data over Frame Write Time (sec) 0 time, all output data must be sorted We evaluated the efficiency of the -In 496992 1984 3968 7936 1636832736 49104 Simulation Size (M Particles) before they can be analyzed. dexed Massive Directory on LANL’s (c) Frame write time For scientists working with VPIC, it Trinity hardware (Figure 2). By applying would be significantly easier program- in-situ partial sorting of VPIC’s particle Figure 2: Results from real VPIC simulation matically to create a separate file for output, we demonstrated over 5000x runs with and without DeltaFS at LANL each particle, and append a 40-byte data speedup in reading a single particle’s Trinity computer. trajectory from a 48- billion particle record on each timestep. This would References reduce analysis queries to sequentially simulation output using only a single reading the contents of a tiny number CPU core, compared to post-processing [1] Zheng, Q., Ren, K., Gibson, G., of particle files. Attempting to do this the entire dataset (10TiB) using the same Settlemyer, B. W., and Grider, G. Del- in today’s parallel file systems, however, amount of CPU cores as the original taFS: Exascale file systems scale better would be disastrous for performance. simulation. This speedup increases with without dedicated servers. In Proceed- Expecting existing HPC storage stacks simulation scale, while the total memory ings of the 10th Parallel Data Storage and file systems to adapt to scientific used for partial sort is fixed at 3% of Workshop (PDSW 15), pp. 1–6. needs such as this one, however, is lu- the memory available to the simulation nacy. Parallel file systems are designed code. The cost of this read acceleration [2] Byna, S., Sisneros, R., Chadalavada, to be long-running, robust services is the increased work in the in-situ pipe- K., and Koziol, Q. Tuning parallel I/O that work across applications. They are line and the additional storage capacity on blue waters for writing 10 trillion typically kernel resident, mainly de- dedicated to storing the indexes. These particles. In Cray User Group (CUG) veloped to manage the hardware, and results are encouraging, as they indicate (2015). primarily optimized for large sequential that the output write buffering stage of [3] Qing Zheng, George Amvrosiadis, data access. DeltaFS aims to provide the software-defined storage stack can Saurabh Kadekodi, Garth Gibson, this file-per-particle representation be leveraged for one or more forms of to applications, while ensuring that Chuck Cranor, Brad Settlemyer, Gary efficient in-situ analysis, and can be -ap storage hardware is utilized to its full Grider, Fan Guo. Software-Defined plied to more kinds of query workloads. performance potential. A comparison Storage for Fast Trajectory Queries us- of the file-per-process (current state-of- For more information, please see [3] or ing a DeltaFS Indexed Massive Direc- the-art) and file-per-particle (DeltaFS) visit our project page at www.pdl.cmu. tory. PDSW-DISCS 2017, Denver, CO, representations is shown in Figure 1. edu/DeltaFS/ November 2017.

SPRING 2018 11 3SIGMA

Jun Woo Park, Greg Ganger and PDL 3Sigma Group 3Sigma: Distribution-Based profiles as compared to having perfect Cluster Scheduling for Runtime 25 estimates. Uncertainty 20 Our 3Sigma cluster scheduling system Modern cluster schedulers face a uses all of the relevant runtime history daunting task. Modern clusters support 15 for each job rather than just a point es- a diverse mix of activities, including timate derived from it. Instead, it uses exploratory analytics, software devel- 10 expected runtime distributions (e.g., opment and test, scheduled content SLO Miss(%) the histogram of observed runtimes), generation, and customer-facing 5 taking advantage of the much richer services [2]. Pending work is typically information (e.g., variance, possible mapped to heterogeneous resources to 0 multi-modal behaviors, etc.) to make 3Sigma Point Point Prio more robust decisions. The first bar satisfy deadlines for business-critical PerfEst RealEst jobs, minimize delays for interactive of Fig. 1 illustrates 3Sigma’s efficacy. best-effort jobs, maximize efficiency, Figure 1: Comparison of 3Sigma with three By considering the range of possible and so on. Cluster schedulers are ex- other scheduling approaches w.r.t. SLO runtimes for a job, and their likeli- pected to make that happen. (deadline) miss rate, for a mix of SLO and best hoods, 3Sigma can explicitly consider Knowledge of the runtimes of these effort jobs derived from the Google cluster the various potential outcomes from pending jobs has been identified as a trace [2] on a 256-node cluster. 3Sigma, each possible plan and select a plan powerful building block for modern despite estimating runtime distributions based on optimizing the expected cluster schedulers. With it, a scheduler online with imperfect knowledge of job outcome. For example, the predicted can pack jobs more aggressively in a classification, approaches the performance of distribution for one job might have cluster’s resource assignment plan, for a hypothetical scheduler using perfect runtime low variance, indicating that the sched- instance by allowing a latency-sensitive estimates (PointPerfEst). Full historical runtime uler can be aggressive in packing it in, distributions and mis-estimation handling best-effort job to run before a high- whereas another job’s high variance helps 3Sigma outperform PointRealEst, a state- priority batch job provided that the might suggest that it should be sched- of-the-art point-estimate-based scheduler. The uled early (relative to its deadline). priority job will still meet its deadline. value of exploiting runtime information, when Runtime knowledge allows a scheduler 3Sigma similarly exploits the runtime done well, is confirmed by comparison to a distribution to adaptively address the to determine whether it is better to conventional priority-based approach (Prio). start a job immediately on suboptimal problem of point over-estimates, machine types with worse expected per- art ML-based predictor [1] to three which may suggest that the scheduler formance, wait for the jobs currently real-world traces, including the well- will avoid scheduling a job based on occupying the preferred machines to studied Google cluster trace [2] and the likelihood of missing its deadline. finish, or to preempt them. Exploiting new traces from data analysis clusters In application, 3Sigma replaces the job runtime knowledge leads to better, used at a hedge fund and a scientific scheduling component of a cluster more robust scheduler decisions than site, shows good estimates in general manager (e.g. YARN). The cluster relying on hard-coded assumptions. (e.g., 77–92% within a factor of two manager remains responsible for job of the actual runtime and most much In most cases, the job runtime esti- and resource life-cycle management. closer). Unfortunately, 8–23% are not mates are based on previous runtimes Job requests are received asynchro- within that range, and some are off by nously by 3Sigma from the cluster observed for similar jobs (e.g., from an order of magnitude or more. Thus, manager (Step 1 of Fig. 2). As is typical the same user or by the same periodic a significant percentage of runtime job script). When such estimates are estimates will be well outside the error for such systems, the specification of accurate, the schedulers relying on ranges previously reported. Worse, we the request includes a number of at- them outperform those using other find that schedulers relying on run- tributes, such as (1) the name of the job approaches. time estimates cope poorly with such to be run, (2) the type of job to be run However, we find that estimate er- error profiles. Comparing the middle (e.g. MapReduce), (3) the user submit- rors, while expected in large, multi- two bars of Fig. 1 shows one example ting the job, and (4) a specification of use clusters, cover an unexpectedly of how much worse a state-of-the-art the resources requested. larger range. Applying a state-of-the- scheduler does with real estimate error continued on page 13

12 THE PDL PACKET 3SIGMA continued from page 12 with trace-derived workloads both 4. Measured runtime 2. Job submission + distribution on a real 256-node cluster and in 1. Job submission simulation demonstrate that 3Sigma’s distribution-based scheduling greatly 3σPredict 3σSched outperforms a state-of-the-art point-

y

r

o Scheduling Option Generator estimate scheduler, approaching the t SortV s John

i

h

e performance of a hypothetical sched-

r

u

t Cluster a e uler operating with perfect runtime

F

Resources Time estimates. Fourth, 3Sigma performs Manager r

o

t

c

e

l Optimization Compiler well (i.e., comparably to PointPerfEst)

e

s

t r under a variety of conditions, such as

e

p

x E Optimization Solver varying cluster load, relative SLO job deadlines, and prediction inaccuracy. 3. Job placement Fifth, we show that the 3Sigma com- ponents (3σPredict and 3σSched) can Figure 2: End-to-end system integration scale to >10000 nodes. Overall, we see that 3Sigma robustly exploits runtime The role of the predictor component ments with production-derived work- 3σPredict is to provide the core sched- loads demonstrate 3Sigma’s effective- distributions to improve SLO attain- uler with a probability distribution of ness. Using its imperfect but automat- ment and best-effort performance, the execution time of the submitted ically-generated history-based runtime dealing gracefully with the complex job. 3σPredict does this by maintaining distributions, 3Sigma outperforms runtime variations seen in real cluster a history of previously executed jobs, both a state-of-the-art point-estimate- environments. identifying a set of jobs that, based based scheduler and a priority-based For more information, please see [3] on their attributes, are similar to the (runtime-unaware) scheduler, espe- or visit www.pdl.cmu.edu/TetriSched/ current job and deriving the runtime cially for mixes of deadline-oriented distribution the selected jobs’ historical jobs and latency-sensitive jobs on References runtimes (Step 2 of Fig. 2). Given a heterogeneous resources. 3Sigma si- distribution of expected job runtimes multaneously provides higher (1) SLO [1] Alexey Tumanov, Angela Jiang, Jun and request specifications, the core attainment for deadline-oriented jobs Woo Park, Michael A. Kozuch, and scheduler, 3σSched decides which jobs and (2) cluster goodput (utilization). Gregory R. Ganger. 2016. JamaisVu: to place on which resources and when. Our evaluation of 3Sigma, yielded five Robust Scheduling with AutoEsti- The scheduler evaluates the expected key takeaways. First, 3Sigma achieves mated Job Runtimes. Technical Report utility of each option and the expected significant improvement over the state- CMU-PDL-16-104. Carnegie Mellon resource consumption and availability of-the-art in SLO miss rate, best-effort University. over the scheduling horizon. Valua- job goodput, and best-effort latency in [2] Charles Reiss, Alexey Tumanov, tions and computed resource capacity a fully-integrated real cluster deploy- Gregory R. Ganger, Randy H. Katz, are then compiled into an optimization ment, approaching the performance and Michael A. Kozuch. 2012. Het- problem, which is solved by an external of the unrealistic PointPerfEst in SLO erogeneity and Dynamicity of Clouds at solver. 3σSched translates the solution miss rate and BE latency. Second, all into an updated schedule and submits Scale: Google Trace Analysis. In Proc. of the 3σSched component features of the 3nd ACM Symposium on Cloud the schedule to the cluster manager are important, as seen via a piecewise Computing (SOCC ’12). (Step 3 of Fig. 2). On completion, benefit attribution. Third, estimated the job’s actual runtime is recorded distributions are beneficial in sched- [3] Jun Woo Park, Alexey Tumanov, by 3σPredict (along with the attribute uling even if they are somewhat in- Angela Jiang, Michael A. Kozuch, information from the job) and incor- accurate, and such inaccuracies are Gregory R. Ganger. 3Sigma: Distri- porated into the job history for future better handled by distribution-based bution-based Cluster Scheduling for predictions (Step 4 of Fig. 2). scheduling than point-estimate-based Runtime Uncertainty. EuroSys ’18, Full system and simulation experi- scheduling. In fact, experiments April 23–26, 2018, Porto, Portugal.

SPRING 2018 13 DEFENSES & PROPOSALS

DISSERTATION ABSTRACT: NAND flash memory through compre- Architectural Techniques hensive experimental characterization for Improving NAND Flash of real 3D NAND chips, and propose Memory Reliability four new techniques that mitigate these new errors and improve 3D NAND Yixin Luo raw bit error rate by up to 66.9%. (4) Carnegie Mellon University, SCS We propose a new technique called PhD Defense — February 9, 2018 HeatWatch that improves 3D NAND lifetime by 3.85 times by utilizing the Raw bit errors are common in NAND self-healing effect to mitigate retention flash memory and will increase in the Greg Ganger, PDL alum Michael Abd-El- errors in 3D NAND. Malek (Google), and Bill Courtright enjoy future. These errors reduce flash reli- social time at the PDL Retreat. ability and limit the lifetime of a flash DISSERTATION ABSTRACT: memory device. This dissertation im- Fast Storage for File System Metadata provide better support for a variety of proves flash reliability with a multitude Kai Ren big data applications in data center en- of low-cost architectural techniques. Carnegie Mellon University, SCS vironment. Our distributed metadata We show that NAND flash memory PhD Defense — August 8, 2017 middleware IndexFS can help improve reliability can be improved at low cost metadata performance for PVFS, Lus- and with low performance overhead by In an era of big data, the rapid growth tre and HDFS by scaling to as many as deploying various architectural tech- of data that many companies and 128 metadata servers. niques that are aware of higher-level organizations produce and manage application behavior and underlying continues to drive efforts to improve DISSERTATION ABSTRACT: flash device characteristics. the scalability of storage systems. The number of objects presented in stor- Enabling Data-Driven This dissertation analyzes flash error age systems continue to grow, making Optimization of Quality characteristics and workload behavior metadata management critical to the of Experience in Internet through rigorous experimental char- Applications acterization and designs new flash overall performance of file systems. Many modern parallel applications controller algorithms that use the Junchen Jiang are shifting toward shorter durations insights gained from our analysis to Carnegie Mellon University, SCS and larger degree of parallelism. improve flash reliability at low cost. PhD Defense — June 23, 2017 We investigate four novel directions. Such trends continue to make storage (1) We propose a new technique called systems to experience more diverse Today’s Internet has become an eyeball WARM that improves flash lifetime by metadata intensive workloads. economy dominated by applications 12.9 times by managing flash reten- The goal of this dissertation is to im- such as video streaming and VoIP. With tion differently for write-hot data prove metadata management in both most applications relying on user en- and write-cold data. (2) We propose a local and distributed file systems. The gagement to generate revenues, main- new framework that learns an online dissertation focuses on two aspects. taining high user-perceived Quality of flash channel model for each chip and One is to improve the out-of-core Experience (QoE) has become crucial enables four new flash controller algo- representation of file system metadata, to ensure high user engagement. rithms to improve flash write endur- by exploring the use of log-structured For instance, one short buffering in- ance by up to 69.9%. (3) We identify multi-level approaches to provide a terruption leads to 39% less time spent three new error characteristics in 3D unified and efficient representation watching videos and causes significant for different types of secondary stor- revenue losses for ad-based video sites. age devices (e.g., traditional hard disk Despite increasing expectations for and solid state disk). We have designed high QoE, existing approaches have and implemented TableFS and its im- limitations to achieve the QoE needed proved version SlimFS, which shows by today’s applications. They either 50% to 10x faster than traditional require costly re-architecting of the Linux file systems. The other aspect network core, or use suboptimal end- is to demonstrate that such represen- point-based protocols to react to the tation also can be flexibly integrated dynamic Internet performance based with many namespace distribution Industry guests and CMU folks boarding on limited knowledge of the network. the bus to head to Bedford Springs for the mechanisms to scale metadata perfor- PDL Retreat mance of distribution file systems, and continued on page 15

14 THE PDL PACKET DEFENSES & PROPOSALS continued from page 14 Kevin K. Chang and operating systems, contemporary Carnegie Mellon University, ECE systems perform this movement inef- PhD Defense — May 5, 2017 ficiently, by transferring data from DRAM to the processor, and then Over the past two decades, the stor- back to DRAM, across a narrow off- age capacity and access bandwidth of chip channel. The use of this narrow main memory have improved tremen- channel for bulk data movement re- dously, by 128x and 20x, respectively. sults in high latency and high energy These improvements are mainly due consumption. This dissertation intro- to the continuous technology scaling duces a new DRAM design, Low-cost Shinya Matsusmoto (HItachi) talks about his of DRAM (dynamic random-access Inter-linked SubArrays (LISA), which company’s research on “Risk-aware Data memory), which has been used as the provides fast and energy-efficient bulk Replication against Widespread Disasters” physical substrate for main memory. In data movement across sub- arrays in a at the PDL retreat industry poster session. stark contrast with capacity and band- DRAM chip. We show that the LISA width, DRAM latency has remained In this thesis, I present a new ap- substrate is very powerful and versatile almost constant, reducing by only 1.3x by demonstrating that it efficiently proach, which is inspired by the in the same time frame. Therefore, recent success of data-driven ap- enables several new architectural long DRAM latency continues to be mechanisms, including low-latency proaches in many fields of comput- a critical performance bottleneck in ing. I will demonstrate that data- data copying, reduced DRAM access modern systems. Increasing core latency for frequently-accessed data, driven techniques can improve In- counts, and the emergence of increas- ternet QoE by utilizing a centralized and reduced preparation latency for ingly more data-intensive and latency- subsequent accesses to a DRAM bank. real-time view of performance across critical applications further stress the Second, DRAM needs to be peri- millions of endpoints (clients). I will importance of providing low-latency odically refreshed to prevent data loss focus on two fundamental challenges memory accesses. unique to this data-driven approach: due to leakage. Unfortunately, while the need for expressive models to In this dissertation, we identify three DRAM is being refreshed, a part of it capture complex factors affecting main problems that contribute sig- becomes unavailable to serve memory QoE, and the need for scalable plat- nificantly to long latency of DRAM requests, which degrades system per- forms to make real-time decisions accesses. To address these problems, formance. To address this refresh with fresh data from geo-distributed we present a series of new techniques. interference problem, we propose clients. Our new techniques significantly im- two access-refresh parallelization prove both system performance and Our solutions address these chal- techniques that enable more overlap- energy efficiency. We also examine the ping of accesses with refreshes inside lenges in practice by integrating sev- critical relationship between supply eral domain-specific insights in net- DRAM, at the cost of very modest voltage and latency in modern DRAM changes to the memory controllers and worked applications with machine chips and develop new mechanisms learning algorithms and systems, DRAM chips. These two techniques that exploit this voltage-latency trade- together achieve performance close and achieve better QoE than using off to improve energy efficiency. many standard machine learning to an idealized system that does not solutions. I will present end-to-end First, while bulk data movement is a require refresh. systems that yield substantial QoE key operation in many applications Third, we find, for the first time, that improvement and higher user en- there is significant latency variation gagement for video streaming and in accessing different cells of a single VoIP. Two of my projects, CFA and DRAM chip due to the irregularity in VIA, have been used in industry by the DRAM manufacturing process. Conviva and Skype, companies that As a result, some DRAM cells are in- specialize in QoE optimization for herently faster to access, while others video streaming and VoIP, respec- are inherently slower. Unfortunately, tively. existing systems do not exploit this variation and use a fixed latency DISSERTATION ABSTRACT: value based on the slowest cell across Understanding and Improving all DRAM chips. To exploit latency Saurabh Kadekodi discusses his research on variation within the DRAM chip, we the Latency of DRAM-Based “Aging Gracefully with Geriatrix: A File System Memory System Aging Suite” at a PDL retreat poster session. continued on page 16

SPRING 2018 15 DEFENSES & PROPOSALS continued from page 15 improve latency, energy efficiency, or comparably to uncompressed indexes. reliability. To support dynamic operations such The key conclusion of this dissertation as inserts, deletes, and updates, we is that augmenting DRAM architecture introduce the dual-stage hybrid index with simple and low-cost features, and architecture that preserves the space developing a better understanding of efficiency brought by a compressed manufactured DRAM chips together static index, while amortizing its leads to significant memory latency re- performance overhead on dynamic duction as well as energy efficiency -im operations by applying modifications provement. We hope and believe that in batches. Jiri Schindler (HPE), Bruce Wilson (Broadcom) the proposed architectural techniques In the proposed work, we seek op- and Rajat Kateja discuss PDL research at a and detailed experimental data on real portunities to further shrink the size retreat poster session. commodity DRAM chips presented in of in-memory indexes by co-designing experimentally characterize and un- this dissertation will enable develop- the indexes with the in-memory tuple derstand the behavior of the variation ments of other new mechanisms to storage. We also propose to complete that exists in real commodity DRAM improve the performance, energy effi- the hybrid index work by extending chips. Based on our characterization, ciency, or reliability of future memory the techniques to support concurrent we propose Flexible-LatencY DRAM systems. indexes. (FLY-DRAM), a mechanism to reduce THESIS PROPOSAL: DRAM latency by categorizing the THESIS PROPOSAL: Efficient Networked Systems DRAM cells into fast and slow regions, Towards Space-Efficient for Datacenter Fabrics with and accessing the fast regions with a High-Performance In-Memory RPCs reduced latency, thereby improving Search Structures system performance significantly. Our Anuj Kalia, SCS Huanchen Zhang, SCS extensive experimental characteriza- March 23, 2018 tion and analysis of latency variation April 30, 2018 Datacenter networks have changed in DRAM chips can also enable de- This thesis seeks to address the chal- radically in recent years. Their band- velopment of other new techniques to lenge of building space-efficient yet improve performance or reliability. width and latency has improved by or- high- performance in-memory search ders of magnitude, and advanced net- Fourth, this dissertation, for the first structures, including indexes and work devices such as NICs with Remote time, develops an understanding of filters, to allow more efficient use of Direct Memory Access (RDMA) capa- the latency behavior due to another memory in OLTP databases. We show bilities and programmable switches important factor—supply voltage, that we can achieve this goal by first have been deployed. The conventional which significantly impacts DRAM designing fast static structures that wisdom is that to best use fast datacen- performance, energy consumption, leverage succinct data structures to ter networks, distributed systems must and reliability. We take an experimen- approach the information-theoretic be redesigned to offload processing tal approach to understanding and optimum in space, and then using the from server CPUs to network devices. exploiting the behavior of modern “hybrid index” architecture to obtain In this dissertation, we show that con- DRAM chips under different supply dynamicity with bounded and modest ventional, non-offloaded designs offer voltage values. Our detailed charac- cost in space and performance. terization of real commodity DRAM To obtain space-efficient yet high- continued on page 17 chips demonstrates that memory access performance static data structures, we latency reduces with increasing supply first introduce the Dynamic-to-Static voltage. Based on our characterization, rules that present a systematic way to we propose Voltron, a new mechanism convert existing dynamic structures to that improves system energy efficiency smaller immutable versions. We then by dynamically adjusting the DRAM present the Fast Succinct Trie (FST) supply voltage based on a performance and its application, the Succinct Range model. Our extensive experimental Filter (SuRF), to show how to leverage data on the relationship between theories on succinct data structures to DRAM supply voltage, latency, and build static search structures that con- Bill Bolosky (Microsoft Research) talks about reliability can further enable develop- sume space close to the information- his company’s work on exciting new projects ments of other new mechanisms that theoretic minimum while performing at the PDL retreat industry poster session.

16 THE PDL PACKET DEFENSES & PROPOSALS continued from page 16 better or comparable performance for THESIS PROPOSAL: STRADS: A a wide range of datacenter workloads, New Distributed Framework including key-value stores, distributed for Scheduled Model-Parallel transactions, and highly-available rep- Machine Learning licated services. Jin Kyu Kim, SCS We present the following principle: May 15, 2017 The physical limitations of networks must inform the design of high-per- Machine learning (ML) methods are formance distributed systems. used to analyze data which are collected Offloaded designs often require more from various sources. As the problem size grows, we turn to distributed network round trips than conventional Joan Digney and Garth Gibson celebrate 25 CPU-based designs, and therefore years of PDL research and retreats. parallel computation to complete ML have fundamentally higher latency. training in a reasonable amount of Since they require more network pack- to it are potentially persistent even time. However, naive parallelization ets, they also have lower throughput. after power loss. Existing DB systems of ML algorithms often hurts the ef- Realizing the benefits of this principle are unable to take full advantage of fectiveness of parameter updates due requires fast networking software for this technology because their internal to the dependency structure among CPUs. To this end, we undertake a architectures are predicated on the as- model parameters, and a subset of detailed exploration of datacenter sumption that memory is volatile. With model parameters often bottlenecks network capabilities, CPU-NIC in- NVM, many components of legacy the completion of ML algorithms due teraction over the system bus, and database systems are unnecessary and to the uneven convergence rate. In NIC hardware architecture. We use will degrade the performance of data this proposal, I propose two efforts: 1) STRADS that improves the training insights from this study to create high- intensive applications. performance remote procedure call speed in an order of magnitude and 2) implementations for use in distributed This dissertation explores the implica- STRADS-AP that makes parallel ML systems with active end host CPUs. tions of NVM for database systems. It programming easier. presents the design and implementa- In STRADS, I will first present sched- We demonstrate the effectiveness of tion of Peloton, a new database system this principle through the design and uled model-parallel approach with two tailored specifically for NVM. We focus specific scheduling schemes: 1) model evaluation of four distributed in- on three aspects of a database system: memory systems: a key-value cache, a parameter dependency checking to (1) logging and recovery, (2) storage networked sequencer, an online trans- avoid updating dependent parameters management, and (3) indexing. Our action processing system, and a state concurrently; 2) parameter prioritiza- machine replication system. We show primary contribution in this disserta- tion to give more update chances to the that our designs often simultaneously tion is the design of a new logging and parameters far from its convergence outperform the competition in per- recovery protocol, called write-behind point. To efficiently run the scheduled formance, scalability, and simplicity. logging, that improves the availability model-parallel in a distributed system, of the system by more than two or- continued on page 18 THESIS PROPOSAL: ders of magnitude compared to the Design & Implementation of a ubiquitous write- ahead logging pro- Non-Volatile Memory Database tocol. Besides improving availability, Management System we found that write- behind logging improves the space utilization of the Joy Arulraj, SCS NVM device and extends its lifetime. December 7, 2017 Second, we propose a new storage For the first time in 25 years, a new engine architecture that leverages non-volatile memory (NVM) category the durability and byte-addressability is being created that is two orders of properties of NVM to avoid unneces- magnitude faster than current durable sary data duplication. Third, the dis- Yixin Luo and and Michael Kuchnik, sertation presents the design of a latch- ready to discuss their research on “Error storage media. This will fundamentally Characterization, Mitigation, and Recovery in change the dichotomy between volatile free range index tailored for NVM that Flash Memory Based Solid-State Drives” and memory and durable storage in DB supports near-instantaneous recovery “Machine Learning Based Feature Tracking systems. The new NVM devices are without requiring special-purpose in HPC Simulations” at a PDL retreat poster almost as fast as DRAM, but all writes recovery code. session.

SPRING 2018 17 DEFENSES & PROPOSALS continued from page 17 I implement a prototype framework STRADS-AP library consist of three due to the repetitive nature of hu- called STRADS. STRADS improves programming constructs: 1) a set of man genome and limitations of the the parameter update throughput by distributed data structures (DDS); 2) sequencing technology, current read pipelining iterations and overlapping a set of functional style operators; and mapping methods still fall short from update computations with network 3) an imperative style loop operator. achieving both high performance and communication for parameter syn- Once an ML programmer writes an high sensitivity. chronization. With ML scheduling ML program using STRADS-AP li- In this proposal, I break down the and system optimizations, STRADS brary APIs, the STRADS-AP runtime DNA read mapping problem into four improves the ML training time by an automatically parallelizes the user subproblems: intelligent seed extrac- order of magnitude. However, these program over a cluster ensuring data tion, efficient filtration of incorrect performance gains are at the cost of consistency. seed locations, high performance extra programming burden when writ- extension and accurate and efficient ing ML schedules. In STRADS-AP, I THESIS PROPOSAL: read cloud mapping. I provide novel will present a high-level programming Novel Computational computational techniques for each library and a system infrastructure Techniques for Mapping Next- subproblem, including: 1) a novel seed that automates ML scheduling. The Generation Sequencing Reads selection algorithm that optimally di- vides a read into low frequency seeds; Hongyi Xin, SCS 2) a novel SIMD-friendly bit-parallel May 31, 2017 filtering problem that quickly estimates DNA read mapping is an important if two strings are highly similar; 3) a problem in Bioinformatics. With generalization of a state-of-the-art ap- the introduction of next-generation proximate string matching algorithm sequencing (NGS) technologies, we that measures genetic similarities are facing an exponential increase with more realistic metrics and 4) a in the amount of genomic sequence novel mapping strategy that utilizes data. The success of many medical and characteristics of a new sequencing genetic applications critically depends technologies, read cloud sequencing, Dana Van Aken presents her research to map NGS reads with higher accuracy on “Automatic Database Management on computational methods to process System Tuning Through Large-scale Machine the enormous amount of sequence and efficiency. Learning” at the PDL retreat. data quickly and accurately. However,

ALUMNI NEWS

Hugo Patterson (Ph.D., ECE ‘98) and deliver performance and speed at there. He also wants to mention that scale.” http://bit.ly/2Cl2mAR they are hiring! We are pleased to pass on the news that Datrium (www.datrium.com/), where Hugo received his Ph.D. in Electri- 23andMe is a personal genomics Hugo is a co-founder, won Gold in cal and Computer Engineering from and biotechnology company based Search Storage’s 2017 Product of the Carnegie Mellon University where he in Mountain View, California. The company is named for the 23 pairs of Year. was a charter student in the PDL. He was advised by Garth Gibson and his chromosomes in a normal human cell. “Datrium impresses judges and wins Ph.D. research focused on informed top honors with its DVX storage ar- prefetching and caching. He was chitecture, designed to sidestep latency named a distinguished alumni of the PDL in 2007.

Ted Wong (Ph.D, CS ‘04) Ted joined 23andMe (www.23andme. com), as a Senior Software Engineer with the Machine Learning Engineer- ing group back in January 2018, and reports he is incredibly happy to be

18 THE PDL PACKET NEW PDL FACULTY & STAFF

Gauri Joshi costs in jobs with hundreds of parallel tolerance, scalability, load balancing, tasks, where the slowest task becomes and reducing latency in large-scale The PDL would the bottleneck. distributed data storage and caching like to welcome Unlike traditional file transfer where systems. She and her colleagues de- Gauri Joshi to signed coding theory based solutions our family! only total delay matters, Streaming Communication requires fast and in- that were shown to be provably opti- Gauri is an As- mal. They also built systems and evalu- sistant Professor order delivery of individual packets to the user. This project analyzes the ated them on Facebook’s data-analytics at CMU in the cluster and on Amazon EC2 showing Department of trade-off between throughput and the in-order delivery delay, and in partic- significant benefits over the state-of- Electrical and Computer Engineering. ular how it is affected by the frequency the-art. The solutions are now a part of She is interested in stochastic model- of feedback to the source, and proposes Apache Hadoop 3.0 and are also being ing and analysis that provides sharp a simple combination of repetition and considered by several companies such insights into the design of computing greedy linear coding that achieves close as NetApp and Cisco. systems. Her favorite tools include to optimal throughput-delay trade-off. probability, queueing, coding theory Rashmi is also interested in machine and machine learning. learning: the research focus here has Rashmi Vinayak been on the generalization perfor- Until August 2017 Gauri was a Re- We would also mance of a class of learning algorithms search Staff Member at IBM T. J. that are widely used for ranking. She Watson in Yorktown Heights NY. In like to welcome Rashmi Vinayak! collaborated on designing an algorithm June 2016 she completed her PhD Rashmi is an as- building on top of Multiple Additive at MIT, working with Prof. Gregory sistant professor Regression Trees, and through empiri- Wornell and Prof. Emina Soljanin. in the Computer cal evaluation on real-world datasets Before that, Gauri spent five years at Science depart- showed significant improvement over IIT Bombay, where I completed a dual ment at Carnegie classification, regression, and rank- degree (B.Tech + M.Tech) in Electrical Mellon Univer- ing tasks. This new algorithm is now Engineering. She also spent several sity. She received her PhD in the EECS deployed in production in Microsoft’s summers interning at Google, Bell department at UC Berkeley in 2016, data-analysis toolbox which powers Labs, and Qualcomm. and was a postdoctoral researcher at the Azure Machine Learning product. Currently, Gauri is working on sev- AMPLab/RISELab and BLISS. Her eral projects. These include one on dissertation received the Eli Jury Alex Glikson Distributed Machine Learning. In Award 2016 from the EECS depart- Alex Glikson large-scale machine learning, training ment at UC Berkeley for outstanding joined the Com- is performed by running stochastic achievement in the area of systems, puter Science gradient descent (SGD) in a distrib- communications, control, or signal Department as uted fashion using a central parameter processing. Rashmi is the recipient of a staff engineer, server and multiple servers (learners). the IEEE Data Storage Best Paper and after spending Using asynchronous methods to al- Best Student Paper Awards for the years the last 14 years leviate the problem of stragglers, the 2011/2012. She is also a recipient of at IBM Research in Israel, where he research goal is to design a distributed the Facebook Fellowship 2012-13, the has been leading a number of research SGD algorithm that strikes the best Microsoft Research PhD Fellowship and development projects in the area trade-off between the training time, 2013-15, and the Google Anita Borg of systems management and cloud and errors in the trained model. Memorial Scholarship 2015-16. Her infrastructure. Alex is interested in Her project on Straggler Replica- research interests lie in building high resource and workload management tion in Parallel Computing develops performance and resource-efficient in cloud computing environments, insights into the best relaunching big data systems based on theoretical recently focusing on ‘Function-as-a- time, and the number of replicas to foundations. Service’ platforms, infrastructure for relaunch to reduce latency, without A recent project has focused on Stor- Deep Learning workloads, and the a significant increase in computing age and caching, particularly on fault combination of the two.

SPRING 2018 19 RECENT PUBLICATIONS continued from page 7 algorithm [3], and taking tens of data LTRF: Enabling High- passes to converge, each data pass is Aggregated Server Capacity Register Files gradients slowed down by 30-40% relative to for GPUs via Hardware/ the prior pass, so the eighth data pass Decompressed Software Cooperative Register is 8.5X slower than the first. gradients Prefetching The current practice to avoid such performance penalty is to frequently Mohammad Sadrosadati, Amirhossein checkpoint to durable storage device Push Mirhosseini, Seyed Borna Ehsani, Compressed Hamid Sarbazi-Azad, Mario which truncates lineage size. Check- gradients pointing as a performance speedup Drumond, Babak Falsafi, Rachata is difficult for a programmer to -an Gradients Ausavarungnirun & Onur Mutlu Worker Worker ticipate and fundamentally contradicts ASPLOS ’18, March 24–28, 2018, Spark’s philosophy that the working (a) Gradient pushes from workers to servers. Williamsburg, VA, USA. set should stay in memory and not be Graphics Processing Units (GPUs) replicated across the network. Since Server Model deltas employ large register files to accom- Spark caches intermediate RDDs, modate all active threads and acceler- one solution is to cache constructed Compressed model deltas ate context switching. Unfortunately, DAGs and broadcast only new DAG Pull register files are a scalability bottleneck elements. Our experiments show that for future GPUs due to long access with this optimization, per iteration latency, high power consumption, and execution time is almost independent Decompressed large silicon area provisioning. Prior of growing lineage size and compa- model deltas work proposes hierarchical register rable to the execution time provided Updated file, to reduce the register file power by optimal checkpointing. On 10 local model consumption by caching registers in a machines using 240 cores in total, Worker Worker smaller register file cache. Unfortu- without checkpointing we observed (b) Model pulls from servers to workers. nately, this approach does not improve a 3.4X speedup when solving matrix register access latency due to the low hit Point-to-point tensor compression for two factorization and 10X speedup for a rate in the register file cache. streaming application provided in the example layers in 3LC. Spark distribution. In this paper, we propose the La- We present 3LC, a lossy compression tency-Tolerant Register File (LTRF) 3LC: Lightweight and Effective scheme for state change traffic that architecture to achieve low latency Traffic Compression for Distrib- strikes balance between multiple goals: in a two-level hierarchical structure uted Machine Learning traffic reduction, accuracy, computa- while keeping power consumption Hyeontaek Lim, David G. Andersen & tion overhead, and generality. It com- low. We observe that compile-time Michael Kaminsky bines three new techniques—3-value interval analysis enables us to divide quantization with sparsity multiplica- GPU program execution into intervals arXiv:1802.07389v1 [cs.LG] 21 Feb tion, quartic encoding, and zero-run with an accurate estimate of a warp’s 2018. encoding—to leverage strengths of aggregate register working-set within The performance and efficiency of quantization and sparsification tech- each interval. The key idea of LTRF distributed machine learning (ML) niques and avoid their drawbacks. It is to prefetch the estimated register depends significantly on how long achieves a data compression ratio of working-set from the main register file it takes for nodes to exchange state up to 39–107X, almost the same test to the register file cache under software changes. Overly-aggressive attempts to accuracy of trained models, and high control, at the beginning of each in- reduce communication often sacrifice compression speed. Distributed ML terval, and overlap the prefetch latency final model accuracy and necessitate frameworks can employ 3LC without with the execution of other warps. Our additional ML techniques to com- modifications to existing ML algo- experimental results show that LTRF pensate for this loss, limiting their rithms. Our experiments show that enables high-capacity yet long-latency generality. Some attempts to reduce 3LC reduces wall-clock training time main GPU register files, paving the communication incur high compu- of ResNet-110–based image classifiers way for various optimizations. As an tation overhead, which makes their for CIFAR-10 on a 10-GPU cluster example optimization, we implement performance benefits visible only over by up to 16–23X compared to Tensor- the main register file with emerging slow networks. Flow’s baseline design. continued on page 21

20 THE PDL PACKET RECENT PUBLICATIONS continued from page 20 high-density high-latency memory memory, which manages and protects lation-aware cache and memory man- technologies, enabling 8× larger capac- the address space of each application. agement mechanisms that work to- ity and improving overall GPU perfor- However, modern GPUs lack the ex- gether to largely reduce the overhead of mance by 31% while reducing register tensive support for multi-application address translation: (1) a token-based file power consumption by 46%. concurrency available in CPUs, and as technique to reduce TLB contention, a result suffer from high performance (2) a bypassing mechanism to improve MASK: Redesigning the GPU overheads when shared by multiple ap- the effectiveness of cached address Memory Hierarchy to Support plications, as we demonstrate. translations, and (3) an application- Multi-Application Concurrency We perform a detailed analysis of which aware memory scheduling scheme to multi-application concurrency support reduce the interference between ad- Rachata Ausavarungnirun, Vance limitations hurt GPU performance the dress translation and data requests. Miller, Joshua Landgraf, Saugata most. We find that the poor perfor- Our evaluations show that MASK Ghose, Jayneel Gandhi, Adwait Jog, mance is largely a result of the virtual restores much of the throughput lost Christopher J. Rossbach & Onur Mutlu memory mechanisms employed in mod- to TLB contention. Relative to a state- ASPLOS’18, March 24–28, 2018, Wil- ern GPUs. In particular, poor address of-the-art GPU TLB, MASK improves liamsburg, VA, USA. translation performance is a key obstacle system throughput by 57.8%, improves Graphics Processing Units (GPUs) to efficient GPU sharing. State-of-the- IPC throughput by 43.4%, and re- exploit large amounts of thread-level art address translation mechanisms, duces applicationlevel unfairness by parallelism to provide high instruc- which were designed for single-appli- 22.4%. MASK’s system throughput is tion throughput and to efficiently hide cation execution, experience significant within 23.2% of an ideal GPU system long-latency stalls. The resulting high inter-application interference when with no address translation overhead. throughput, along with continued multiple applications spatially share programmability improvements, have the GPU. This contention leads to fre- Slim NoC: A Low-Diameter made GPUs an essential computational quent misses in the shared translation On-Chip Network Topology resource in many domains. Applica- lookaside buffer (TLB), where a single for High Energy Efficiency and tions from different domains can have miss can induce long-latency stalls for Scalability vastly different compute and memory hundreds of threads. As a result, the demands on the GPU. In a large-scale GPU often cannot schedule enough Maciej Besta, Syed Minhaj Hassan, computing environment, to efficiently threads to successfully hide the stalls, Sudhakar Yalamanchili, Rachata accommodate such wide-ranging de- which diminishes system throughput Ausavarungnirun, Onur Mutlu & mands without leaving GPU resources and becomes a first-order performance Torsten Hoefler concern. underutilized, multiple applications ASPLOS ’18, March 24–28, 2018, Based on our analysis, we propose can share a single GPU, akin to how Williamsburg, US. multiple applications execute concur- MASK, a new GPU framework that rently on a CPU. Multi-application provides low-overhead virtual memory Emerging chips with hundreds and concurrency requires several support support for the concurrent execu- thousands of cores require networks mechanisms in both hardware and soft- tion of multiple applications. MASK with unprecedented energy/area effi- ware. One such key mechanism is virtual consists of three novel address-trans- continued on page 22

1 TLB-Fill Tokens Address-Translation-Aware Address-Space-Aware Cache Bypass Memory Scheduler

Page Address Translation Level 1 Hit Rate if (Level Hit Rate < L2 Hit Rate) # Hits Table Requests Level 2 Hit Rate Bypassed Address Translation Requests # Misses Level 3 Hit Rate TLB Walker Level 4 Hit Rate Prev. Hit Request Bu ers Golden Queue Miss Tokens Bank 0 Page Table Root L2 Hit Rate Tokens Dir Address Translation Requests DRAM Address Bank 1 Silver Queue Page Table if (Level Hit Rate Translation >= L2 Hit Rate) Root Cache Requests Data Demand Requests TLB Bypass Tags Entries Bank 2 Cache Normal Queue Bank n L1 Shared L2 TLB Data Demand Memory Data Demand Requests Cache Requests L2 Cache Requests Memory Controller Address Translation Data Demand Request Request

MASK design overview.

SPRING 2018 21 RECENT PUBLICATIONS continued from page 21 ciency and scalability. To address this, with conventional 4KB base pages), we propose Slim NoC (SN): a new Application 1 Base PagesUApplication 2 Base Pages nallocated Pages which increases TLB coverage and thus 1 on-chip network design that delivers Large Page Frame 1 2 Large Page Frame 1 reduces TLB misses. Conversely, the significant improvements in efficiency demand paging overhead is lower when and scalability compared to the state- Large Page Frame 2 Large Page Frame 2 we employ a smaller page size, which of-the-art. The key idea is to use two decreases the system I/O bus transfer Standard Memory Allocatio n Cannot Coalesce Pages concepts from graph and number Without Migrating Data latency. Support for multiple page sizes theory, degree-diameter graphs com- (a)State-of-the-art GPU memorymanagement. can help relax the page size trade-off bined with non-prime finite fields, to so that address translation and demand 3 Large Page Frame 1 4 Coalesced Large Page 1 enable the smallest number of ports paging optimizations work together for a given core count. SN is inspired Large Page Frame 2 Coalesced Large Page 2 synergistically. However, existing page by state-of-the-art off-chip topologies; coalescing (i.e., merging base pages it identifies and distills their advan- into a large page) and splintering (i.e., Contiguity-Conserving Lazy Coalescer tages for NoC settings while solving Allocation splitting a large page into base pages) several key issues that lead to signifi- (b)Memorymanagementwith Mosaic. policies require costly base page migra- cant overheads on-chip. SN provides tions that undermine the benefits mul- NoC-specific layouts, which further Page allocation and coalescing behavior of tiple page sizes provide. In this paper, GPU memory managers: (a) state-of-the-art, enhance area/energy efficiency. We (b) Mosaic. 1: The GPU memory manager we observe that GPGPU applications show how to augment SN with state- allocates base pages from both Applications present an opportunity to support of-the-art router microarchitecture 1 and 2. 2: As a result, the memory manager multiple page sizes without costly data schemes such as Elastic Links, to make cannot coalesce the base pages into a large migration, as the applications perform the network even more scalable and page without first migrating some of the base most of their memory allocation en efficient. Our extensive experimental pages, which would incur a high latency. masse (i.e., they allocate a large num- evaluations show that SN outperforms 3: Mosaic uses Contiguity Conserving ber of base pages at once). We show both traditional low-radix topologies Allocation (CoCoA) — a memory allocator that this en masse allocation allows us (e.g., meshes and tori) and modern which provides a soft guarantee that all of the base pages within the same large page to create intelligent memory allocation high-radix networks (e.g., various range belong to only a single application, and policies which ensure that base pages Flattened Butterflies) in area, latency, 4: InPlace Coalescer, a page size selection that are contiguous in virtual memory throughput, and static/dynamic power mechanism that merges base pages into are allocated to contiguous physical consumption for both synthetic and a large page immediately after allocation . memory pages. As a result, coalescing real workloads. SN provides a promis- ory management, but they introduce and splintering operations no longer ing direction in scalable and energy- high performance overheads during need to migrate base pages. efficient NoC topologies. (1) address translation and (2) page We introduce Mosaic, a GPU memory faults. A GPU relies on high degrees manager that provides application- Mosaic: A GPU Memory of thread-level parallelism (TLP) to transparent support for multiple page Manager with Application- hide memory latency. Address transla- sizes. Mosaic uses base pages to transfer Transparent Support for tion can undermine TLP, as a single data over the system I/O bus, and al- Multiple Page Sizes miss in the translation lookaside buffer locates physical memory in a way that (TLB) invokes an expensive serialized (1) preserves base page contiguity and Rachata Ausavarungnirun, Joshua (2) ensures that a large page frame con- Landgraf, Vance Miller, Saugata page table walk that often stalls mul- tiple threads. Demand paging can also tains pages from only a single memory Ghose, Jayneel Gandhi, Christopher J. protection domain. We take advantage Rossbach & Onur Mutlu undermine TLP, as multiple threads often stall while they wait for an ex- of this allocation strategy to design Proc. of the International Symposium pensive data transfer over the system a novel in-place page size selection on Microarchitecture (MICRO), I/O (e.g., PCIe) bus when the GPU mechanism that avoids data migration. Cambridge, MA, October 2017. demands a page. This mechanism allows the TLB to use large pages, reducing address transla- Contemporary discrete GPUs support In modern GPUs, we face a trade- tion overhead. During data transfer, rich memory management features off on how the page size used for this mechanism enables the GPU to such as virtual memory and demand memory management affects address transfer only the base pages that are paging. These features simplify GPU translation and demand paging. The needed by the application over the sys- programming by providing a virtual address translation overhead is lower tem I/O bus, keeping demand paging address space abstraction similar to when we employ a larger page size CPUs and eliminating manual mem- (e.g., 2MB large pages, compared continued on page 23

22 THE PDL PACKET RECENT PUBLICATIONS continued from page 22 overhead low. Our evaluations show create a file for each particle to receive that Mosaic reduces address translation writes of that particle’s simulation DRAM cell wordline overheads while efficiently achieving output data. Dynamically indexing the the benefits of demand paging, com- directory’s underlying storage keyed on capacitor pared to a contemporary GPU that particle filename allows us to achieve uses only a 4KB page size. Relative to a 5000x speedup for a single particle bitline a state-of-the-art GPU memory man- trajectory query, which requires read- access ager, Mosaic improves the performance ing all data for a single particle. This transistor of homogeneous and heterogeneous speedup increases with application multi-application workloads by 55.5% scale, while the overhead remains enable and 29.7% on average, respectively, stable at 3% of the available memory. coming within 6.8% and 15.4% of the sense performance of an ideal TLB where all Ambit: In-Memory Accelerator amplifier bitline TLB requests are hits. for Bulk Bitwise Operations Using Commodity DRAM DRAM cell and sense amplifier. Software-Defined Storage for Technology Fast Trajectory Queries using tions. Second, with modest changes to a DeltaFS Indexed Massive Vivek Seshadri, Donghyuk Lee, Thomas the sense amplifier, the system can use Directory Mullins, Hasan Hassan, Amirali the inverters present inside the sense Boroumand, Jeremie Kim, Michael A. amplifier to perform bitwise NOT op- Qing Zheng, George Amvrosiadis, Kozuch, Onur Mutlu, Phillip B. Gibbons erations. With these two components, Saurabh Kadekodi, Garth Gibson, & Todd C. Mowry Ambit can perform any bulk bitwise Chuck Cranor, Brad Settlemyer, Gary operation efficiently inside DRAM. Grider & Fan Guo Proceedings of the 50th International Ambit largely exploits existing DRAM Symposium on Microarchitecture PDSW-DISCS 2017: 2nd Joint Inter- structure, and hence incurs low cost (MICRO), Boston, MA, USA, Octo- on top of commodity DRAM designs national Workshop on Parallel Data ber 2017. Storage and Data Intensive Scalable (1% of DRAM chip area). Impor- Computing System held in conjunction Many important applications trigger tantly, Ambit uses the modern DRAM with SC17, Denver, CO, Nov. 2017. bulk bitwise operations, i.e., bitwise interface without any changes, and operations on large bit vectors. In therefore it can be directly plugged In this paper we introduce the Indexed fact, recent works design techniques onto the memory bus. Our extensive Massive Directory, a new technique for that exploit fast bulk bitwise opera- indexing data within DeltaFS. With its circuit simulations show that Ambit tions to accelerate databases (bitmap works as expected even in the presence design as a scalable, server-less file sys- indices, BitWeaving) and web search tem for HPC platforms, DeltaFS scales of significant process variation. (BitFunnel). Unfortunately, in exist- file system metadata performance with Averaged across seven bulk bitwise ing architectures, the throughput of application scale. The Indexed Mas- operations, Ambit improves perfor- bulk bitwise operations is limited by sive Directory is a novel extension mance by 32X and reduces energy con- the memory bandwidth available to to the DeltaFS data plane, enabling sumption by 35X compared to state- the processing unit (e.g., CPU, GPU, in-situ indexing of massive amounts of-the-art systems. When integrated FPGA, processing-in-memory). To of data written to a single directory with Hybrid Memory Cube (HMC), overcome this bottleneck, we propose simultaneously, and in an arbitrarily a 3D-stacked DRAM with a logic Ambit, an Accelerator-in-Memory large number of files. We achieve this layer, Ambit improves performance through a memory-efficient indexing for bulk bitwise operations. Unlike prior works, Ambit exploits the ana- of bulk bitwise operations by 9.7X mechanism for reordering and index- compared to processing in the logic ing writes, and a log-structured storage log operation of DRAM technology to perform bitwise operations completely layer of the HMC. Ambit improves layout to pack small data into large log the performance of three real-world objects, all while ensuring compute inside DRAM, thereby exploiting the full internal DRAM bandwidth. data-intensive applications, 1) data- node resources are used frugally. We base bitmap indices, 2) BitWeaving, a demonstrate the efficiency of this Ambit consists of two components. technique to accelerate database scans, indexing mechanism through VPIC, First, simultaneous activation of three and 3) bit-vector-based implementa- a plasma simulation code that scales DRAM rows that share the same set of tion of sets, by 3X-7X compared to a to trillions of particles. With Indexed sense amplifiers enables the system to Massive Directory, we modify VPIC to perform bitwise AND and OR opera- continued on page 24

SPRING 2018 23 RECENT PUBLICATIONS continued from page 23 state-of-the-art baseline using SIMD For example, a system can improve nism needs to detect failures whenever optimizations. We describe four other performance and energy efficiency there is a write access that changes the applications that can benefit from by using a lower refresh rate for most content of memory. As detection of Ambit, including a recent technique cells and mitigate the failing cells using failure with a runtime testing has a proposed to speed up web search. We higher refresh rates or error correct- high overhead, MEMCON selectively believe that large performance and ing codes. All these system optimiza- initiates a test on a write, only when the energy improvements provided by tions depend on accurately detecting time between two consecutive writes to Ambit can enable other applications every possible data-dependent failure that page (i.e., write interval) is long to use bulk bitwise operations. that could occur with any content in enough to provide significant benefit DRAM. Unfortunately, detecting all by lowering the refresh rate during Detecting and Mitigating Data- data-dependent failures requires the that interval. MEMCON builds upon a Dependent DRAM Failures by knowledge of DRAM internals spe- simple, practical mechanism that pre- Exploiting Current Memory cific to each DRAM chip. As internal dicts the long write intervals based on DRAM architecture is not exposed to our observation that the write intervals Content the system, detecting data-dependent in real workloads follow a Pareto dis- Samira Khan, Chris Wilkerson, Zhe failures at the system-level is a major tribution: the longer a page remains Wang, Alaa R. Alameldeen, Donghyuk challenge. idle after a write, the longer it is ex- Lee & Onur Mutlu In this paper, we decouple the detec- pected to remain idle. Our evaluation shows that compared to a system that Proceedings of the 50th International tion and mitigation of data-dependent failures from physical DRAM organi- uses an aggressive refresh rate, MEM- Symposium on Microarchitecture zation such that it is possible to detect CON reduces refresh operations by (MICRO), Boston, MA, USA, Octo- failures without knowledge of DRAM 65-74%, leading to a 10%/17%/40% ber 2017. internals. To this end, we propose (min) to 12%/22%/50% (max) per- DRAM cells in close proximity can MEMCON, a memory content-based formance improvement for a single- fail depending on the data content in detection and mitigation mechanism core and 10%/23%/52% (min) to neighboring cells. These failures are for data-dependent failures in DRAM. 17%/29%/65% (max) performance called data-dependent failures. De- MEMCON does not detect every pos- improvement for a 4-core system using tecting and mitigating these failures sible data-dependent failure. Instead, 8/16/32 Gb DRAM chips. online, while the system is running it detects and mitigates failures that in the field, enables various optimiza- occur only with the current content Bigger, Longer, Fewer: What tions that improve reliability, latency, in memory while the programs are Do Cluster Jobs Look Like and energy efficiency of the system. running in the system. Such a mecha- Outside Google? George Amvrosiadis, Jun Woo Park, Gregory R. Ganger, Garth A. Gibson, Elisabeth Baseman & Nathan DeBardeleben Carnegie Mellon University Parallel Data Lab Technical Report CMU- PDL-17-104, October 2017. In the last 5 years, a set of job scheduler logs released by Google has been used in more than 400 publications as the token cloud workload. While this is an invaluable trace, we think it is crucial that researchers evaluate their work under other workloads as well, to en- sure the generality of their techniques. To aid them in this process, we ana- lyze three new traces consisting of job scheduler logs from one private and

Greg opens the 25th PDL Retreat at the Bedford Springs Resort. continued on page 25

24 THE PDL PACKET RECENT PUBLICATIONS continued from page 24 two HPC clusters. We further release NAND flash memory is ubiquitous in the two HPC traces, which we expect to everyday life today because its capacity Rate (r) be of interest to the community due to has continuously increased and cost has their unique characteristics. The new continuously decreased over decades. Tokens Bucket traces represent clusters 0.3-3 times size (b) This positive growth is a result of two the size of the Google cluster in terms key trends: 1) effective process technol- of CPU cores, and cover a 3-60 times Take tokens to proceed ogy scaling; and 2) multi-level (e.g., longer time span. Queue Requests MLC, TLC) cell data coding. Unfortu- This paper presents an analysis of the nately, the reliability of raw data stored differences and similarities between in flash memory has also continued to all aforementioned traces. We discuss Token bucket rate limiters control the rate and become more difficult to ensure, -be a variety of aspects: job characteris- burstiness of a stream of requests. When a cause these two trends lead to 1) fewer tics, workload heterogeneity, resource request arrives at the rate limiter, tokens are electrons in the flash memory cell utilization, and failure rates. More used (i.e., removed) from the token bucket importantly, we review assumptions to allow the request to proceed. If the bucket floating gate to represent the data; and from the literature that were originally is empty, the request must queue and wait 2) larger cell-to-cell interference and until there are enough tokens. Tokens are derived from the Google trace, and disturbance effects. Without mitiga- added to the bucket at a constant rate r up tion, worsening reliability can reduce verify whether they hold true when to a maximum capacity as specified by the the new traces are considered. For bucket size b. Thus, the token bucket rate the lifetime of NAND flash memory. those assumptions that are violated, limiter limits the workload to a maximum As a result, flash memory controllers we examine affected work from the instantaneous burst of size b and an average in solid-state drives (SSDs) have be- literature. Finally, we demonstrate the rate r. come much more sophisticated: they importance of dataset plurality in job located while meeting each workload’s incorporate many effective techniques scheduling research by evaluating the SLO. In reality, neither the service to ensure the correct interpretation performance of JVuPredict, the job provider nor customer knows how of noisy data stored in flash memory runtime estimate module of the Tet- to choose rate limits. Customers end cells. In this article, we review recent riSched scheduler, using all four traces. up selecting rate limits on their own advances in SSD error characteriza- in some ad hoc fashion, and service tion, mitigation, and data recovery Workload Compactor: providers are left to optimize given the techniques for reliability and lifetime chosen rate limits. Reducing Datacenter Cost improvement. We provide rigorous while ProvidingTail Latency This paper describes Workload Com- experimental data from state-of-the- SLO Guarantees pactor, a new system that uses workload art MLC and TLC NAND flash de- traces to automatically choose rate limits Timothy Zhu, Michael A. Kozuch & vices on various types of flash memory simultaneously with selecting onto which errors, to motivate the need for such Mor Harchol-Balter server to place workloads. Our system techniques. Based on the understand- meets customer tail latency SLOs while ACM Symposium on Cloud Comput- ing developed by the experimental ing (SoCC’17), Santa Clara, Oct 2017. minimizing datacenter resource costs. Our experiments show that by optimiz- characterization, we describe several Service providers want to reduce data- ing the choice of rate limits, Workload mitigation and recovery techniques, center costs by consolidating work- Compactor reduces the number of re- including 1) cell-to-cell interference loads onto fewer servers. At the same quired servers by 30-60% as compared mitigation; 2) optimal multi-level time, customers have performance to state-of-the-art approaches. cell sensing; 3) error correction us- goals, such as meeting tail latency ing state-of-the-art algorithms and Service Level Objectives (SLOs). Con- Error Characterization, methods; and 4) data recovery when solidating workloads while meeting tail error correction fails. We quantify latency goals is challenging, especially Mitigation, and Recovery in since workloads in production envi- Flash-Memory-Based Solid- the reliability improvement provided ronments are often bursty. To limit the State Drives by each of these techniques. Looking congestion when consolidating work- forward, we briefly discuss how flash Yu Cai, Saugata Ghose, Erich F. memory and these techniques could loads, customers and service providers Haratsch, Yixin Luo & Onur Mutlu often agree upon rate limits. Ideally, evolve into the future. rate limits are chosen to maximize the Proceedings of the IEEE Volume: 105, number of workloads that can be co- Issue: 9, Sept. 2017. continued on page 26

SPRING 2018 25 RECENT PUBLICATIONS continued from page 25 A Better Model for Job memory system, based on the proper- Redundancy: Decoupling Before migration: ties of each page. It is important to Server Slowdown and Job Size request to Page 0 make intelligent page management (i.e., placement and migration) deci- Kristen Gardner, Mor Harchol-Balter, sions, as they can significantly affect Alan Scheller-Wolf & Benny Van Houdt After migration: system performance. Transactions on Networking, Septem- request to Page 0 In this paper, we propose utility- ber 2017. based hybrid memory management T (UH-MEM), a new page management Recent computer systems research has mechanism for various hybrid memo- t proposed using redundant requests to Application stall time ries, that systematically estimates the reduced by T reduce latency. The idea is to replicate utility (i.e., the system performance a request so that it joins the queue at (a) Alone request benefit) of migrating a page between multiple servers. The request is con- different memory types, and uses this sidered complete as soon as any one request to Page 1 information to guide data placement. of its copies completes. Redundancy request to Page 2 UH-MEM operates in two steps. First, allows us to overcome server-side vari- it estimates how much a single applica- ability – the fact that a server might be tion would benefit from migrating one temporarily slow due to factors such as request to Page 1 of its pages to a different type of mem- background load, network interrupts, ory, by comprehensively considering and garbage collection – to reduce request to Page 2 T access frequency, row buffer locality, response time. In the past few years, and memory-level parallelism. Sec- queueing theorists have begun to study Application stall time t ond, it translates the estimated benefit reduced by T redundancy, first via approximations, of a single application to an estimate of and, more recently, via exact analysis. (b) Overlapped requests the overall system performance benefit Unfortunately, for analytical tractabil- from such a migration. ity, most existing theoretical analysis Conceptual example showing that the MLP of a page influences how much effect We evaluate the effectiveness of UH- has assumed an Independent Runtimes its migration to fast memory has on the MEM with various types of hybrid (IR) model, wherein the replicas of application stall time. memories, and show that it signifi- a job each experience independent cantly improves system performance runtimes (service times) at different In Proc. of the IEEE Cluster Confer- on each of these hybrid memories. servers. The IR model is unrealistic ence (CLUSTER), Honolulu, HI, For a memory system with DRAM and and has led to theoretical results which September 2017. non-volatile memory, UH-MEM im- can be at odds with computer systems While the memory footprints of cloud proves performance by 14% on aver- implementation results. This paper and HPC applications continue to in- age (and up to 26%) compared to the introduces a much more realistic crease, fundamental issues with DRAM best of three evaluated state-of-the-art model of redundancy. Our model scaling are likely to prevent traditional mechanisms across a large number of decouples the inherent job size (X) main memory systems, composed of data-intensive workloads. from the server-side slowdown (S), monolithic DRAM, from greatly grow- where we track both S and X for each ing in capacity. Hybrid memory systems Scheduling for Efficiency job. Analysis within the S&X model can mitigate the scaling limitations of and Fairness in Systems with is, of course, much more difficult. monolithic DRAM by pairing together Redundancy Nevertheless, we design a dispatching multiple memory technologies (e.g., policy, Redundant-to-Idle-Queue different types of DRAM, or DRAM Kristen Gardner, Mor Harchol-Balter, (RIQ), which is both analytically trac- and non-volatile memory) at the same Esa Hyyti & Rhonda Righter table within the S&X model and has level of the memory hierarchy. The goal Performance Evaluation, July 2017. provably excellent performance. of a hybrid main memory is to combine the different advantages of the multiple Server-side variability—the idea that the same job can take longer to run on Utility-Based Hybrid Memory memory types in a cost-effective man- one server than another due to server- Management ner while avoiding the disadvantages of each technology. Memory pages are dependent factors—is an increasingly Yang Li, Saugata Ghose, Jongmoo placed in and migrated between the important concern in many queueing Choi, Jin Sun, Hui Wang & Onur Mutlu different memories within a hybrid continued on page 27

26 THE PDL PACKET RECENT PUBLICATIONS continued from page 26 systems. One strategy for overcoming Carnegie Mellon University Parallel server-side variability to achieve low 6 Data Laboratory Technical Report response time is redundancy, under 2 CMU-PDL-17-103. June 2017. 5 which jobs create copies of themselves 7 Machine Learning (ML) is becoming and send these copies to multiple dif- 1 4 an increasingly popular application ferent servers, waiting for only one in the cloud and data-centers, inspir- copy to complete service. Most of the 3 ing a growing number of distributed existing theoretical work on redundan- frameworks optimized for it. These cy has focused on developing bounds, 8 frameworks leverage the specific prop- approximations, and exact analysis to erties of ML algorithms to achieve study the response time gains offered Flow chart describing Viyojit’s implementation orders of magnitude performance by redundancy. However, response for tracking dirty pages and enforcing the improvements over generic data pro- time is not the only important metric dirty budget. cessing frameworks like Hadoop or in redundancy systems: in addition to Spark. However, they also tend to be providing low overall response time, of data-intensive applications. A pop- static, unable to elastically adapt to the system should also be fair in the ular form of NVM is Battery-backed the changing resource availability that sense that no job class should have a DRAM, which is available and in use is characteristic of the multi-tenant worse mean response time in the sys- today with DRAMs latency and without environments in which they run. Fur- tem with redundancy than it did in the the endurance problems of emerging thermore, the programming models system before redundancy is allowed. NVM technologies. Modern servers provided by these frameworks tend to can be provisioned with up-to 4 TB In this paper we use scheduling to be restrictive, narrowing their appli- of DRAM, and provisioning battery cability even within the sphere of ML address the simultaneous goals of (1) backup to write out such large memo- achieving low response time and (2) workloads. ries is hard because of the large battery maintaining fairness across job classes. Motivated by these trends, we present sizes and the added hardware and cool- We develop new exact analysis for per- Litz, a distributed ML framework that ing costs. We present Viyojit, a system class response time under First-Come achieves both elasticity and generality that exploits the skew in write work- First-Served (FCFS) scheduling for a without giving up the performance ing sets of applications to provision general type of system structure; our of more specialized frameworks. Litz substantially smaller batteries while analysis shows that FCFS can be unfair uses a programming model based on in that it can hurt non-redundant jobs. still ensuring durability for the entire scheduling micro-tasks with parameter We then introduce the Least Redun- DRAM capacity. Viyojit achieves this by server access which enables applica- dant First (LRF) scheduling policy, bounding the number of dirty pages in tions to implement key distributed which we prove is optimal with re- DRAM based on the provisioned bat- ML techniques that have recently been spect to overall system response time, tery capacity and proactively writing introduced. Furthermore, we believe but which can be unfair in that it can out infrequently written pages to an that the union of ML and elasticity hurt the jobs that become redundant. SSD. Even for write-heavy workloads presents new opportunities for job Finally, we introduce the Primaries with less skew than we observe in analy- scheduling due to dynamic resource First (PF) scheduling policy, which is sis of real data center traces, Viyojit usage of ML algorithms. We give ex- provably fair and also achieves excel- reduces the required battery capacity amples of ML properties which give lent overall mean response time. to 11% of the original size, with a per- rise to such resource usage patterns formance overhead of 7-25%. Thus, and suggest ways to exploit them to Viyojit: Decoupling Battery Viyojit frees battery-backed DRAM improve resource utilization in multi- tenant environments. To evaluate Litz, and DRAM Capacities for from stunted growth of battery capaci- we implement two popular ML appli- Battery-Backed DRAM. ties and enables servers with terabytes of battery-backed DRAM. cations that vary dramatically terms of Rajat Kateja, Anirudh Badam, Sriram their structure and run-time behav- Govindan, Bikash Sharma & Litz: An Elastic Framework for ior—they are typically implemented Greg Ganger High-Performance Distributed by different ML frameworks tuned for each. We show that Litz achieves ISCA ’17, June 24-28, 2017, Toronto, Machine Learning competitive performance with the ON, Canada. Aurick Qiao, Abutalib Aghayev, Weiren state of the art while providing low- Non-Volatile Memories (NVMs) can Yu, Haoyang Chen, Qirong Ho, Garth overhead elasticity and exposing the significantly improve the performance A. Gibson & Eric P. Xing continued on page 28

SPRING 2018 27 RECENT PUBLICATIONS continued from page 27 underlying dynamic resource usage of from the Bing advertising system show Cachier, a system that uses the caching ML applications. that a traditional cache can reduce model along with novel optimizations cost by up to 27.7% but has negative to minimize latency by adaptively bal- Workload Analysis and revenue impact as bad as −14.1%. On ancing load between the edge and the Caching Strategies for Search the other hand, the proposed mecha- cloud by leveraging spatiotemporal lo- Advertising Systems nisms can reduce cost by up to 20.6% cality of requests, using offline analysis while capping revenue impact between of applications, and online estimates Conglong Li, David G. Andersen, −1.3% and 0%. Based on Microsoft’s of network conditions. We evaluate Qiang Fu, Sameh Elnikety & earnings release for FY16 Q4, the Cachier for image-recognition appli- Yuxiong He traditional cache would reduce the net cations and show that our techniques SoCC ’17, September 24–27, 2017, profit of Bing Ads by $84.9 to $166.1 yield 3x speed-up in responsiveness, million in the quarter, while our Santa Clara, CA, USA. and perform accurately over a range proposed cache could increase the net of operating conditions. To the best Search advertising depends on accu- profit by $11.1 to $71.5 million. of our knowledge, this is the first work rate predictions of user behavior and that models edge-servers as caches for interest, accomplished today using Cachier: Edge-caching for compute-intensive recognition appli- complex and computationally expen- Recognition Applications cations, and Cachier is the first system sive machine learning algorithms that that uses this model to minimize la- estimate the potential revenue gain of Utsav Drolia, Katherine Guo, Jiaqi Tan, tency for these applications. thousands of candidate advertisements Rajeev Gandhi & Priya Narasimhan per search query. The accuracy of this The 37th IEEE International Con- Carpool: A Bufferless On-Chip estimation is important for revenue, ference on Distributed Computing Network Supporting Adaptive but the cost of these computations Systems (ICDCS 2017), June 5 – 8, Multicast and Hotspot represents a substantial expense, 2017, Atlanta, GA, USA Alleviation e.g., 10% to 30% of the total gross revenue. Caching the results of previ- Recognition and perception-based Xiyue Xiang, Wentao Shi, Saugata ous computations is a potential path mobile applications, such as image Ghose, Lu Peng, Onur Mutlu & to reducing this expense, but tradi- recognition, are on the rise. These Nian-Feng Tzeng tional domain-agnostic and revenue- applications recognize the user’s sur- agnostic approaches to do so result in roundings and augment it with infor- In Proc. of the International Con- substantial revenue loss. This paper mation and/or media. These applica- ference on Supercomputing (ICS), presents three domain-specific caching tions are latency-sensitive. They have Chicago, IL, June 2017 mechanisms that successfully optimize a soft-realtime nature - late results are Modern chip multiprocessors (CMPs) for both factors. Simulations on a trace potentially meaningless. On the one employ on-chip networks to enable hand, given the compute-intensive communication between the indi- nature of the tasks performed by such vidual cores. Operations such as co- User applications, execution is typically herence and synchronization generate search query search result w/ ads offloaded to the cloud. On the other a significant amount of the on-chip Search Engine hand, offloading such applications click network traffic, and often create net- ads to the cloud incurs network latency, work requests that have one-to-many Publisher ads to show which can increase the user-perceived (Bing Ads) Auction (i.e., a core multicasting a message to cache latency. Consequently, edge-comput- several cores) or many-to-one (i.e., hit ing has been proposed to let devices several cores sending the same mes- Proposed Cache tens Advertiser of offload intensive tasks to edge-servers sage to a common hotspot destination ads cache insert instead of the cloud, to reduce latency. no cache/ core) flows. As the number of cores cache miss/ Scoring In this paper, we propose a different ads refresh in a CMP increases, one-to-many and keywords thousands of ads model for using edge-servers. We budgets many-to-one flows result in greater Candidate Selection propose to use the edge as a special- congestion on the network. To al- thousands of ized cache for recognition applications million ads leviate this congestion, prior work Ads partition partition and formulate the expected latency for Pool provides hardware support for effi- such a cache. We show that using an cient one-to-many and many-to-one edge-server like a typical web-cache, flows in buffered on-chip networks. Simplified workflow of how Bing advertising for recognition applications, can system serves ads to users. lead to higher latencies. We propose continued on page 29

28 THE PDL PACKET RECENT PUBLICATIONS continued from page 28 Unfortunately, this hardware support present an automated approach that cannot be used in bufferless on-chip leverages past experience and col- 2.0 (sec) networks, which are shown to have 1.5 lects new information to tune DBMS lower hardware complexity and higher -tile 1.0 configurations: we use a combination 0.5 energy efficiency than buffered net- 99t h% 0.0 of supervised and unsupervised ma- Lo 200 works, and thus are likely a good fit for g fi 400 chine learning methods to (1) select le 600 500 size 800 1500 1000 large-scale CMPs. (MB) 2000 ze (MB) the most impactful knobs, (2) map 2500 pool si Buffer unseen database workloads to previous We propose Carpool, the first buffer- (a) Dependencies less on-chip network optimized for 3.0 workloads from which we can trans- one-to-many (i.e., multicast) and ) fer experience, and (3) recommend many-to-one (i.e., hotspot) traffic. (sec 2.0 knob settings. We implemented our Carpool is based on three key ideas: %-tile 1.0 techniques in a new tool called Otter- 99th Tune and tested it on three DBMSs. it (1) adaptively forks multicast flit 0.0 500 1000 1500 2000 2500 3000 replicas; (2) merges hotspot flits; and Buffer poolsize(MB) Our evaluation shows that OtterTune (3) employs a novel parallel port al- (b) Continuous Settings recommends configurations that are as location mechanism within its rout- 6.0 good as or better than ones generated

) Workload #1 Workload #2 by existing tools or a human expert.

ers, which reduces the router critical (sec 4.0 Workload #3 path latency by 5.7% over a bufferless network router without multicast %-tile 2.0 Understanding Reduced- 99th Voltage Operation in Modern support. We evaluate Carpool using 0.0 synthetic traffic workloads that emu- Config#1 Config#2 Config#3 DRAM Devices: Experimental (c) Non-Reusable Configurations late the range of rates at which multi- Characterization, Analysis, and 600 threaded applications inject multicast s MySQL Mechanisms Postgres

and hotspot requests due to coherence knob 400 and synchronization. Our evaluation of Kevin K. Chang, A. Giray Yaglikçi, shows that for an 8×8 mesh network, 200 Saugata Ghose, Aditya Agrawal, Number Carpool reduces the average packet 0 Niladrish Chatterjee, Abhijith Kashyap, 2000 2004 2008 2012 2016 latency by 43.1% and power consump- Releasedate Donghyuk Lee, Mike O’Connor, tion by 8.3% over a bufferless network (d) Tuning Complexity Hasan Hassan & Onur Mutlu without multicast or hotspot support. Proceedings of the ACM on Measure- We also find that Carpool reduces the average packet latency by 26.4% and Motivating Examples – Figs. a to c show ment and Analysis of Computing power consumption by 50.5% over a performance measurements for the YCSB Systems (POMACS), Vol. 1, No. 1, buffered network with multicast sup- workload running on MySQL (v5.6) using June 2017. different configuration settings. Fig. d shows The energy consumption of DRAM is port, while consuming 63.5% less area the number of tunable knobs provided in for each router. MySQL and Postgres releases over time. a critical concern in modern comput- ing systems. Improvements in manu- Automatic Database facturing process technology have Management System Tuning that control everything in the system, allowed DRAM vendors to lower the Through Large-scale Machine such as the amount of memory to use DRAM supply voltage conservatively, for caches and how often data is writ- Learning which reduces some of the DRAM ten to storage. The problem with these energy consumption. We would like to Dana Van Aken, Andrew Pavlo, knobs is that they are not standardized reduce the DRAM supply voltage more Geoffrey J. Gordon & Bohan Zhang (i.e., two DBMSs use a different name aggressively, to further reduce energy. Aggressive supply voltage reduction ACM SIGMOD International Con- for the same knob), not independent requires a thorough understanding of ference on Management of Data, May (i.e., changing one knob can impact the effect voltage scaling has on DRAM 14-19, 2017. Chicago, IL, USA. others), and not universal (i.e., what works for one application may be sub- access latency and DRAM reliability. Database management system (DBMS) optimal for another). Worse, infor- In this paper, we take a comprehensive configuration tuning is an essential mation about the effects of the knobs approach to understanding and ex- aspect of any data-intensive applica- ploiting the latency and reliability tion effort. But this is historically a typically comes only from (expensive) characteristics of modern DRAM when difficult task because DBMSs have experience. hundreds of configuration “knobs” To overcome these challenges, we continued on page 30

SPRING 2018 29 RECENT PUBLICATIONS continued from page 29 the supply voltage is lowered below In cloud computing systems, assigning the nominal voltage level specified by a task to multiple servers and waiting DRAM standards. Using an FPGA- for the earliest copy to finish is an high based testing platform, we perform an effective method to combat the vari- error experimental study of 124 real DDR3L ability in response time of individual (low-voltage) DRAM chips manufac- servers and reduce latency. But adding low tured recently by three major DRAM redundancy may result in higher cost error 512 cells/bitline of computing resources, as well as wordline driver

vendors. We find that reducing the row address supply voltage below a certain point an increase in queueing delay due to introduces bit errors in the data, and higher traffic load. This work helps in local sense amp. we comprehensively characterize the understanding when and how redun- (a) ConceptualBitline behavior of these errors. We discover dancy gives a cost-efficient reduction that these errors can be avoided by in latency. For a general task service local sense amp. increasing the latency of three major time distribution, we compare differ- DRAM operations (activation, resto- ent redundancy strategies in terms of high error ration, and precharge). We perform the number of redundant tasks and detailed DRAM circuit simulations to the time when they are issued and low validate and explain our experimen- canceled. We get the insight that the error tal findings. We also characterize the log-concavity of the task service time high 512 cells/bitline wordline driver various relationships between reduced creates a dichotomy of when adding row address error supply voltage and error locations, redundancy helps. If the service time local sense amp. stored data patterns, DRAM tempera- distribution is log-convex (i.e., log of ture, and data retention. the tail probability is convex), then (b)OpenBitline Scheme Based on our observations, we pro- adding maximum redundancy reduces pose a new DRAM energy reduction both latency and cost. And if it is log- Design-Induced Variation Due to Row mechanism, called Voltron. The key concave (i.e., log of the tail probability Organization is concave), then less redundancy, and idea of Voltron is to use a performance in such systems, the focus in designing model to determine by how much we early cancellation of redundant tasks is more effective. Using these insights, query execution engines has shifted can reduce the supply voltage without to optimizing CPU performance. introducing errors and without ex- we design a general redundancy strat- egy that achieves a good latency-cost Recent systems have revived an older ceeding a user-specified threshold for technique of using just-in-time (JIT) performance loss. Our evaluations trade-off for an arbitrary service time distribution. This work also general- compilation to execute queries as na- show that Voltron reduces the average tive code instead of interpreting a plan. DRAM and system energy consump- izes and extends some results in the analysis of fork-join queues. The state-of-the-art in query compila- tion by 10.5% and 7.3%, respectively, tion is to fuse operators together in a while limiting the average system per- query plan to minimize materialization formance loss to only 1.8%, for a va- Relaxed Operator Fusion for In-Memory Databases: Making overhead by passing tuples efficiently riety of memory-intensive quad-core between operators. Our empirical workloads. We also show that Voltron Compilation, Vectorization, and Prefetching Work analysis shows, however, that more significantly outperforms prior dy- tactful materialization yields better Together At Last namic voltage and frequency scaling performance. mechanisms for DRAM. Prashanth Menon, Todd C. Mowry & We present a query processing model Efficient Redundancy Andrew Pavlo called “relaxed operator fusion” that Techniques for Latency Proceedings of the VLDB Endowment, allows the DBMS to introduce stag- Reduction in Cloud Systems Vol. 11, No. 1, 2017. ing points in the query plan where intermediate results are temporarily Gauri Joshi, Emina Soljanin & Gregory In-memory database management materialized. This allows the DBMS Wornell systems (DBMSs) are a key component to take advantage of inter-tuple par- of modern on-line analytic process- allelism inherent in the plan using a ACM Transactions on Modeling and ing (OLAP) applications, since they combination of prefetching and SIMD Performance Evaluation of Computing provide low-latency access to large vectorization to support faster query Systems (TOMPECS) Volume 2 Issue volumes of data. Because disk accesses 2, May 2017. are no longer the principle bottleneck continued on page 31

30 THE PDL PACKET RECENT PUBLICATIONS continued from page 30 execution on data sets that exceed the The benefits offered by EC-Cache are induced variation in modern DRAM size of CPU-level caches. Our evalua- further amplified in the presence of devices by testing and characterizing tion shows that our approach reduces background network load imbalance 96 DIMMs (768 DRAM chips). Our the execution time of OLAP queries and server failures. by up to 2.2X and achieves up to 1.8X characterization identifies DRAM better performance compared to other Design-Induced Latency regions that are vulnerable to er- in-memory DBMSs. Variation in Modern DRAM rors, if operated at lower latency, and Chips: Characterization, finds consistency in their locations EC-Cache: Load-Balanced, Low- Analysis, and Latency across a given DRAM chip genera- Latency Cluster Caching with Reduction Mechanisms Online Erasure Coding tion, due to design-induced variation. Donghyuk Lee, Samira Khan, Lavanya Based on our extensive experimental K. V. Rashmi, Mosharaf Chowdhury, Subramanian, Saugata Ghose, analysis, we develop two mechanisms Jack Kosaian, Ion Stoica & Kannan Rachata Ausavarungnirun, Gennady Ramchandran that reliably reduce DRAM latency. Pekhimenko, Vivek Seshadri & Onur First, DIVA Profiling uses runtime Mutlu 12th USENIX Symposium on Op- profiling to dynamically identify the erating Systems Design and Imple- Proceedings of the ACM on Measure- lowest DRAM latency that does not mentation, NOVEMBER 2–4, 2016, ment and Analysis of Computing SAVANNAH, GA. introduce failures. DIVA Profiling Systems (POMACS), Vol. 1, No. 1, Data-intensive clusters and object June 2017. exploits design-induced variation and stores are increasingly relying on in- periodically profiles only the vulner- Variation has been shown to ex- memory object caching to meet the I/O ist across the cells within a modern able regions to determine the lowest performance demands. These systems DRAM chip. Prior work has studied DRAM latency at low cost. It is the first routinely face the challenges of popu- and exploited several forms of varia- larity skew, background load imbal- mechanism to dynamically determine ance, and server failures, which result tion, such as manufacturing-process- the lowest latency that can be used to or temperature-induced variation. We in severe load imbalance across servers operate DRAM reliably. DIVA Profil- empirically demonstrate a new form and degraded I/O performance. Selec- ing reduces the latency of read/write tive replication is a commonly used of variation that exists within a real technique to tackle these challenges, DRAM chip, induced by the design requests by 35.1%/57.8%, respectively, where the number of cached replicas and placement of different compo- at 55°C. Our second mechanism, of an object is proportional to its nents in the DRAM chip: different DIVA Shuffling, shuffles data such that regions in DRAM, based on their popularity. In this paper, we explore values stored in vulnerable regions are relative distances from the peripheral an alternative approach using erasure mapped to multiple error-correcting coding. structures, require different minimum access latencies for reliable operation. code (ECC) codewords. As a result, EC-Cache is a load-balanced, low In particular, we show that in most DIVA Shuffling can correct 26% more latency cluster cache that uses on- real DRAM chips, cells closer to the line erasure coding to overcome the multi-bit errors than conventional peripheral structures can be accessed limitations of selective replication. ECC. Combined together, our two much faster than cells that are farther. EC-Cache employs erasure coding by: We call this phenomenon design- mechanisms reduce read/write latency (i) splitting and erasure coding indi- induced variation in DRAM. Our goals by 40.0%/60.5%, which translates to vidual objects during writes, and (ii) are to i) understand design-induced late binding, wherein obtaining any an overall system performance im- k out of (k + r) splits of an object are variation that exists in real, state-of- provement of 14.7%/13.7%/13.8% (in the-art DRAM chips, ii) exploit it to sufficient, during reads. As compared 2-/4-/8-core systems) across a variety develop low-cost mechanisms that can to selective replication, EC-Cache of workloads, while ensuring reliable improves load balancing by more than dynamically find and use the lowest la- operation. 3x and reduces the median and tail tency at which to operate a DRAM chip read latencies by more than 2x, while reliably, and, thus, iii) improve overall using the same amount of memory. system performance while ensuring EC-Cache does so using 10% addi- reliable system operation. tional bandwidth and a small increase To this end, we first experimentally in the amount of stored metadata. demonstrate and analyze designed-

SPRING 2018 31 YEAR IN REVIEW continued from page 4 ™™Conglong Li presented “Workload exascale file systems. Edge-caching for Recognition Analysis and Caching Strategies ™™Charles McGuffey interned with Applications” at ICDCS ‘17 in for Search Advertising Systems” at Google in Sunnyvale, CA, work- Atlanta, GA. SoCC ’17 in Santa Clara, CA. ing on cache partitioning systems May 2016 for Google infrastructure. August 2017 ™™Hongyi Xin proposed his disserta- ™™Kai Ren successfully defended his ™™Jinliang Wei interned with Saeed tion research “Novel Computa- PhD thesis on “Fast Storage for Maleki, Madan Musuvathi and tional Techniques for Mapping File System Metadata.” Todd Mytkowicz at Microsoft Re- Next-Generation Sequencing ™™Souptik Sen interned with Linke- search in Redmond WA, working Reads.” dIn’s Data group in Sunnyvale, on parallelizing and scaling out ™™Kevin K. Chang successfully de- working with Venkatesh Iyer and stochastic gradient descent with fended his PhD research on “Un- Subbu Sanka on a data tooling sequential semantics. derstanding and Improving the library in Scala which converts ge- June 2016 Latency of DRAM-Based Memory neric parameterized Hive queries ™™ M. Satyanarayanan and Col- System.” to Spark to create an optimized leagues Honored for Creation of ™™Jin Kyu Kim proposed his PhD workflow on LinkedIn’s advertis- Andrew File System research “STRADS: A New Dis- ing data pipeline. ™™Junchen Jiang successfully de- tributed Framework for Scheduled ™™Saurabh Kadekodi interned with fended his PhD dissertation Model-Parallel Machine Learn- Alluxio, Inc. in California, work- “Enabling Data-Driven Optimiza- ing.” ing on packing and indexing in tion of Quality of Experience in ™™Dana Van Aken presented “Au- cloud file systems. Internet.” tomatic Database Management System Tuning Through Large- ™™Aaron Harlap interned with Mi- ™ ™Rajat Kateja presented “Viyojit: scale Machine Learning” at ICMD crosoft Research in Seattle, WA, Decoupling Battery and DRAM ‘17 in Chicago, IL. working on “Scaling up Distrib- Capacities for Battery-Backed uted DNN Training.” DRAM” at ISCA ’17 in Toronto, ™™19th annual PDL Spring Visit Day. ™™Qing Zheng interned with LANL ON, Canada. in Los Alamos, NM, working on ™™Utsav Drolia presented “Cachier:

2017 PDL Workshop and Retreat.

32 THE PDL PACKET