n e w s l e t t e r o n p d l a c t i v i t i e s a n d e v e n t s • s p r i n g 2 0 1 7 http://www.pdl.cmu.edu/

Automatic Database Management System an informal publication from academia’s premiere storage Tuning Through Large-scale Machine Learning systems research center devoted by Dana Van Aken to advancing the state of the Database management systems (DBMSs) are the most important component of art in storage and information any modern data-intensive application. But, this is historically a difficult task infrastructures. because DBMSs have hundreds of configuration “knobs” that control everything in the system, such as the amount of memory to use for caches and how often data CONTENTS is written to storage. This means that organizations often hire human experts to help with tuning activities, though this method can be prohibitively expensive. Automatic DBMS Tuning...... 1 OtterTune is a new automatic DBMS configuration tool that overcomes these challenges, making it easier for anyone to deploy a DBMS able to handle large Director’s Letter...... 2 amounts of data and more complex workloads without any expertise needed in Year in Review...... 4 database administration. The key feature of OtterTune is that it reduces the amount Recent Publications...... 5 of time and resources it takes to tune a DBMS for a new application by leverag- PDL News & Awards...... 8 ing knowledge gained from previous DBMS deployments. OtterTune maintains Big Learning in the Cloud...... 12 a repository of data collected from previous tuning sessions, and uses this data to build models of how the DBMS responds to different knob configurations. For a Defenses & Proposals...... 14 new application, these models are used to guide experimentation and recommend New PDL Faculty...... 19 optimal settings to the user to improve a target objective (e.g., latency, throughput). Alumni Update...... 19 In this article, we first provide a high-level example to demonstrate the tasks per- formed by each of the components in the OtterTune system, and show how they PDL CONSORTIUM interact with one another when automatically tuning a DBMS’s configuration. MEMBERS We next discuss each of the components in OtterTune’s ML pipeline. Finally, we conclude with an evaluation of OtterTune’s efficacy by comparing the performance Broadcom, Ltd. of its best configuration with ones selected by DBAs and other automatic tuning Citadel tools for MySQL and Postgres. All the experiments for our evaluation and data Dell EMC collection were conducted on Amazon EC2 Spot Instances. OtterTune’s ML al- Google gorithms were implemented using Google TensorFlow and Python’s scikit-learn. Hewlett-Packard Labs The OtterTune System Hitachi, Ltd. Intel Corporation Figure 1 shows an overview of the components in the OtterTune system. At the start of Research a new tuning session, the DBA tells OtterTune what metric to optimize when selecting MongoDB a configuration (e.g., latency, NetApp, Inc. throughput). The client-side Tuning Manager controller then connects to the Oracle Corporation Analysis Planning target DBMS and collects its Samsung Information Systems America Amazon EC2 instance type and Seagate Technology current knob configuration. Tintri ML Controlle r Models Data JDBC DBMS Repository Next, the controller starts Two Sigma its first observation period, Toshiba during which the controller Veritas Western Digital Figure 1: The OtterTune System continued on page 10 FROM THE DIRECTOR’S CHAIR THE PDL PACKET GREG GANGER THE PARALLEL DATA Hello from fabulous Pittsburgh! LABORATORY School of Computer Science It has been another great year for the Department of ECE Parallel Data Lab. Some highlights in- Carnegie Mellon University clude record enrollment in PDL’s stor- 5000 Forbes Avenue age systems and cloud classes, continued growth and success for PDL’s database systems and big-learning systems research, Pittsburgh, PA 15213-3891 and new directions for cloud and NVM research. Along the way, many students voice 412•268•6716 graduated and joined PDL Consortium companies, new students joined the fax 412•268•3010 PDL, and many cool papers have been published. Let me highlight a few things. PUBLISHER Since it headlines this newsletter, I’ll also lead off my highlights by talking about Greg Ganger PDL’s database systems research. The largest scope activities focus on automa- tion in DBMSs, such as in the front-page article, and creating a new open source EDITOR DBMS called Peloton to embody these and other ideas from the database group. Joan Digney There also continue to be cool results and papers on deduplication in databases, better exploitation of NVM in databases, and improved concurrency control The PDL Packet is published once per year to update members of the PDL mechanisms. The energy that Andy has infused into database systems research at Consortium. A pdf version resides in CMU makes it a lot of fun to watch and participate in. the Publications section of the PDL Web pages and may be freely distributed. In addition to applying machine learning (ML) to systems for automation pur- Contributions are welcome. poses, we continue to explore new approaches to system support for large-scale machine learning. As discussed in the second newsletter article, we have been THE PDL LOGO especially active in exploring how ML systems should adapt to cloud computing Skibo Castle and the lands that com­ environments. prise its estate are located in the Kyle of Sutherland in the northeastern part of Important questions include how such systems should address dynamic resource Scotland. Both ‘Skibo’ and ‘Sutherland’ availability, unpredictable levels of time-varying resource interference, and are names whose roots are from Old Norse, the language spoken by the geo-distributed data. And, in a fun twist, we are developing new approaches to Vikings who began washing ashore automatic tuning for ML systems—ML to improve ML. reg­ularly in the late ninth century. The word ‘Skibo’ fascinates etymologists, I continue to be excited about the growth and evolution of the storage systems who are unable to agree on its original and cloud classes created and led by PDL faculty... the popularity of both is at an meaning. All agree that ‘bo’ is the Old Norse for ‘land’ or ‘place,’ but they argue all-time high. For example, over 100 MS students completed the storage class last whether ‘ski’ means ‘ships’ or ‘peace’ Fall, and the project-intensive cloud classes serve even more. We refined the new or ‘fairy hill.’ myFTL project such that the students’ FTLs store real data in the NAND Flash Although the earliest version of Skibo SSD simulator we provide them, which includes basic interfaces for read-page, seems to be lost in the mists of time, write-page, and erase-block. The students write FTL code to maintain LBA-to- it was most likely some kind of fortified physical mappings, perform cleaning, and address wear leveling. In addition to building erected by the Norsemen. The our lectures and the projects, these classes feature 3-5 corporate guest lecturers present-day castle was built by a bishop of the Roman Catholic Church. Andrew (thank you, PDL Consortium members!) bringing insight on real-world solu- Carnegie, after making his fortune, tions, trends, and futures. bought it in 1898 to serve as his sum­mer home. In 1980, his daughter, Margaret,­ PDL continues to explore questions of resource scheduling for cloud computing, donated Skibo to a trust that later sold which grows in complexity as the breadth of application and resource types grows. the estate. It is presently being run as a luxury hotel. Our Tetrisched project has developed new ways of allowing users to express their per-job resource type preferences (e.g., machine locality or hardware accelera- tors) and then exploring the trade-offs among them to maximize utility of the public and/or private cloud infrastructure. Our most recent work explores more

2 THE PDL PACKET PARALLEL DATA LABORATORY FROM THE DIRECTOR’S CHAIR

FACULTY robust ways of exploiting estimated runtime information, finding that provid- Greg Ganger (PDL Director) ing full distributions of likely runtimes (e.g., based on history of “similar” jobs) 412•268•1297 [email protected] works quite well for real-world workloads as reflected in real cluster traces. In collaboration with the new Intel-funded visual cloud systems research center, David Andersen Seth Copen Goldstein Lujo Bauer Mor Harchol-Balter we are also exploring how processing of visual data should be partitioned and Nathan Beckmann Todd Mowry distributed across capture (edge) devices and core cloud resources. Chuck Cranor Onur Mutlu Lorrie Cranor Priya Narasimhan Our explorations of how systems should be designed differently to exploit and Christos Faloutsos David O’Hallaron Kayvon Fatahalian Andy Pavlo work with new underlying storage technologies, especially NVM and SMR, Eugene Fink Majd Sakr continues to expand on many fronts. Activities range from new search and sort Rajeev Gandhi M. Satyanarayanan Saugata Ghose Srinivasan Seshan algorithms that understand the asymmetry between read and write performance Phil Gibbons Alex Smola (i.e., assuming using NVM directly) to a modified Linux Ext4 file system imple- Garth Gibson Hui Zhang mentation that mitigates drive-managed SMR performance overheads. We’re STAFF MEMBERS excited about continuing to work with PDL companies on understanding where Bill Courtright, 412•268•5485 (PDL Executive Director) [email protected] the hardware is (and should be) going and how it should be exploited in systems. Karen Lindenfelser, 412•268•6716 (PDL Administrative Manager) [email protected] Naturally, PDL’s long-standing focus on scalable storage continues strongly. As Jason Boles always, a primary challenge is metadata scaling, and PDL researchers are explor- Joan Digney Chad Dougherty ing several approaches to dealing with scale along different dimensions. A cool Mitch Franzos new concept being explored is allowing high-end applications to manage their Charlene Zang own namespaces, for periods of time, bypassing traditional metadata bottlenecks VISITING RESEARCHERS / POST DOCS entirely during the heaviest periods of activity. George Amvrosiadis Hyeontaek Lim Daniel Berger Michael (Tieying) Zhang Many other ongoing PDL projects are also producing cool results. For example, Akira Kuroda to help our (and others’) file systems research, we have developed a new file system GRADUATE STUDENTS aging suite, called Geriatrix. Our key-value store research continues to expose Abutalib Aghayev Conglong Li new approaches to indexing and remote value access. Our continued operation Joy Arulraj Mu Li Rachata Ausavarungnirun Yang Li of private clouds in the Data Center Observatory (DCO) serves the dual purposes Ben Blum Haibin Lin of providing resources for real users (CMU researchers) and providing us with Amirali Boroumand Yixin Luo Sol Boucher Lin Ma invaluable Hadoop logs, instrumentation data, and case studies. This newslet- Christopher Canel Charles McGuffey ter and the PDL website offer more details and additional research highlights. Kevin Chang Prashanth Menon Andrew Chung Nathan Mickulicz I’m always overwhelmed by the accomplishments of the PDL students and staff, Henggang Cui Ravi Teja Mullapudi Wei Dai Jun Woo Park and it’s a pleasure to work with them. As always, their accomplishments point at Debanshu Das Akansha Patel great things to come. Utsav Drolia Matt Perron Chris Fallin Alex Poms Jian Fang Aurick Qiao Kristen Scholes Gardner Kai Ren Kiryong Ha Dana Van Aken Aaron Harlap Nandita Vijaykumar Kevin Hsieh Haoran Wang Saksham Jain Ziqi Wang Angela Jiang Jinliang Wei Junchen Jiang Lin Xiao Wesley Jin Pengtao Xie Abishek Joshi Hongyi Xin Ellango Jothimurugesan Lianghong Xu Saurabh Arun Kadekodi Jiajun Yao Anuj Kalia Yunfan Ye Rajat Kateja Huanchen Zhang Jin Kyu Kim Rui Zhang Industry guests gather with PDL faculty, staff Ben Blum shows some details of his work on Thomas Kim Qing Zheng and students in Roberts Hall for breakfast and “Systematic Testing with Data-Race Dimitris Konomis Dong Zhou an industry poster session at the opening of Preemption Points” with Jeff Heller of NetApp Timothy Zhu Michael Kuchnik the 2016 PDL Retreat. at the 2016 PDL Retreat.

SPRING 2017 3 YEAR IN REVIEW

May 2017 April 2017 January 2017 ™™19th annual Spring Visit Day. ™™Henggang Cui successfully ™™Adam Pennington receives CMU ™™Timothy Zhu defended his PhD defended his PhD research on Alumni Award. thesis “Meeting Tail Latency SLOs “Exploiting Application Char- ™™Abutalib Aghayev receives Hugo in Shared Networked Storage.” acteristics for Efficient System Patterson PDL Entrepreneurship Support of Data-Parallel Machine ™ Fellowship. ™Jinliang Wei presented his speak- Learning.” ing skills talk on “Managed ™™Andy Pavlo presented “Self-Driv- Communication and Consistency ™™Ben Blum proposed his thesis ing Database Management Systems” for Fast Data-Parallel Iterative research “Practical Concurrency CIDR ‘17 in Chaminade, CA. Analytics.” Testing.” December 2016 ™™Rachata Ausavarungnirun suc- ™™Priya Narasimhan honored with cessfully defended his PhD thesis Pittsburgh History Makers Award. ™™Todd Mowry became an ACM Fellow. “Techniques for Shared Resource ™™Aaron Harlap presented “Pro- November 2016 Management in Systems with teus: Agile ML Elasticity through Graphics Processing Units.” Tiered Reliability in Dynamic ™™Jiaqi Tan successfully defended Resource Markets” at EuroSys ‘17 his PhD dissertation “Prescriptive ™ ™Presentations at the SIGMOD in Belgrade, Serbia. Safety-Checks through Automated Int’l Conference on Data Manage- Proofs for Control-Flow Integrity.” ment 2017 included “Automatic March 2017 ™™Jun Woo Park gave his speaking Database Management System ™™Presentations at NSDI ‘17 in skills talk on “JVuSched: Robust Tuning Through Large-scale Boston, MA included “AdaptSize: Machine Learning” (Dana Van Orchestrating the Hot Object Scheduling with Auto-estimated Aken), and “Online Deduplica- Memory Cache in a Content De- Job Runtimes.” tion for Databases” (Lianghong livery Network” (Daniel Berger), ™™Anuj Kalia presented “FaSST: Xu). and “Gaia: Geo-Distributed Ma- Fast, Scalable and Simple Distrib- ™™Rajat Kateja will be interning at chine Learning Approaching LAN uted Transactions with Two-sided Google with their cloud platforms Speeds” (Kevin Hsieh). (RDMA) Datagram RPCs” at team this coming summer, work- ™™Utsav Drolia presented “Towards OSDI ‘17 in Savannah, GA. ing with Niket Agarwal. Edge-caching for Image Recogni- ™™Jiaqi Tan presented “AUSPICE-R: ™™Saksham Jain will be interning in tion” at SmartEdge ‘17 in Hawaii. Automatic Safety-Property Proofs VMware in Palo Alto this summer February 2017 for Realistic Features in Machine in their VMKernel Team. Code” at ASPLAS ‘16 in Hanoi, ™™Mu Li successfully defended his Vietnam. ™™Charles McGuffey will be intern- PhD thesis “Scaling Distributed ing with Google in Mountain Machine Learning with System October 2016 View, CA this summer. and Algorithm Co-design.” ™™24th annual PDL Retreat. ™™Aaron Harlap will be interning ™ ™Anuj Kalia was named a 2017 ™™Ben Blum presented “Stateless with Microsoft, working on ML Facebook Fellow. Model Checking with Data-Race for Big Data. ™™Abutalib Aghayev presented Preemption Points” at SPLASH “Evolving Ext4 for Shingled Disks” 2016 in Amsterdam, Netherlands. at FAST ‘17 in Santa Clara, CA. ™™Kiryong Ha successfully defended his ™™Two papers from Onur Mutlu’s PhD thesis “System Infrastructure group were presented at HPCA ‘17 for Mobile-Cloud Convergence.” in Austin, TX: “Vulnerabilities ™™Mu Li proposed his thesis re- in MLC NAND Programming: Experimental search “Scaling Distributed Ma- Analysis, Exploits, and Mitigation chine Learning with System and Techniques” Algorithm Co-design.” ™ ™™(Yu Cai), and “SoftMC: A Flex- ™Timothy Zhu presented “SNC- ible and Practical Open-Source Meister: Admitting More Tenants Infrastructure for Enabling with Tail Latency SLOs” at SoCC PDL staff members Jason Bole and Charlene ‘16 in Santa Clara, CA. Zang discuss PDL infrastructure while Experimental DRAM Studies” enjoying the PDL Retreat. (Hasan Hassan). continued on page 32

4 THE PDL PACKET RECENT PUBLICATIONS

Evolving Ext4 for Shingled Principled Workflow-centric designers to navigate the RDMA de- Disks Tracing of Distributed Systems sign space. Our guidelines emphasize paying attention to low-level details Abutalib Aghayev, Theodore Ts’o, Raja R. Sambasivan, Ilari Shafer, such as individual PCIe transactions Garth Gibson & Peter Desnoyers Jonathan Mace, Benjamin H. and NIC architecture. We empirically Sigelman, Rodrigo Fonseca & Gregory demonstrate how these guidelines can 15th USENIX Conference on File and R. Ganger be used to improve the performance Storage Technologies (FAST ‘17), Feb of RDMA-based systems: we design a ACM Symposium on Cloud Comput- 27–Mar 2, 2017. Santa Clara, CA. networked sequencer that outperforms ing 2016 Santa Clara, CA,October an existing design by 50x, and improve Drive-Managed SMR (Shingled Mag- 5 - 7, 2016. netic Recording) disks offer a plug- the CPU efficiency of a prior high- compatible higher-capacity replace- Workflow-centric tracing captures the performance key-value store by 83%. ment for conventional disks. For workflow of causally-related events We also present and evaluate several (e.g., work done to process a request) non-sequential workloads, these disks new RDMA optimizations and pitfalls, within and among the components of show bimodal behavior: After a short and discuss how they affect the design a distributed system. As distributed of RDMA systems. period of high throughput they enter a systems grow in scale and complexity, continuous period of low throughput. such tracing is becoming a critical tool SNC-Meister: Admitting More We introduce ext4-lazy, a small change for understanding distributed system Tenants with Tail Latency SLOs to the Linux ext4 file system that sig- behavior. Yet, there is a fundamental nificantly improves the throughput in lack of clarity about how such infra- Timothy Zhu, Daniel S. Berger & Mor both modes. We present benchmarks structures should be designed to pro- Harchol-Balter vide maximum benefit for important on four different drive-managed SMR ACM Symposium on Cloud Comput- management tasks, such as resource disks from two vendors, showing that ing 2016 (SoCC’16), Santa Clara, CA, ext4-lazy achieves 1.7-5.4X improve- accounting and diagnosis. Without October 5 - 7, 2016. ment over ext4 on a metadata-light research into this important issue, there is a danger that workflow-centric Meeting tail latency Service Level file server benchmark. On metadata- Objectives (SLOs) in shared cloud heavy benchmarks it achieves 2-13X tracing will not reach its full potential. To help, this paper distills the design networks is both important and chal- improvement over ext4 on drive- lenging. One primary challenge is de- managed SMR disks as well as on con- space of workflow-centric tracing and describes key design choices that can termining limits on the multitenancy ventional disks. help or hinder a tracing infrastruc- such that SLOs are met. Doing so ture’s utility for important tasks. Our involves estimating latency, which is design space and the design choices we difficult, especially when tenants ex- 1 J hibit bursty behavior as is common in S suggest are based on our experiences 2 Journal developing several previous workflow- production environments. Neverthe- Memory Disk centric tracing infrastructures. less, recent papers in the past two years (a) Journaling under ext4 (Silo, QJump, and PriorityMeister) Design Guidelines for High show techniques for calculating latency S J 2 1 J based on a branch of mathematical Performance RDMA Systems Map Journal modeling called Deterministic Net- Memory Disk Anuj Kalia, Michael Kaminsky & David work Calculus (DNC). The DNC (b) Journaling under ext4-lazy G. Andersen theory is designed for adversarial worst-case conditions, which is some- (a) Ext4 writes a metadata block to disk 2016 USENIX Annual Technical times necessary, but is often overly twice. It first writes the metadata block to the Conference (USENIX ATC’16), June conservative. Typical tenants do not journal at some location J and marks it dirty in 2016. Best Student Paper. require strict worst-case guarantees, memory. Later, the writeback thread writes Modern RDMA hardware offers the but are only looking for SLOs at lower the same metadata block to its static location potential for exceptional performance, percentiles (e.g., 99th, 99.9th). S on disk, resulting in a random write. (b) Ext4-lazy, writes the metadata block but design choices including which This paper describes SNC-Meister, a approximately once to the journal and inserts RDMA operations to use and how to new admission control system for tail a mapping (S,J) to an in-memory map so that use them significantly affect observed latency SLOs. SNC-Meister improves the file system can find the metadata block performance. This paper lays out in the journal. guidelines that can be used by system continued on page 6

SPRING 2017 5 RECENT PUBLICATIONS continued from page 5 and (ii) an unbounded heap-allocated offload intensive tasks to edge-servers shared memory with asymmetric read- instead of the cloud, to reduce latency. write costs, and show how the costs in In this paper, we propose a different the model map efficiently onto a more model for using edge-servers. We concrete machine model under a work- propose to use the edge as a special- stealing scheduler. We use the new ized cache for recognition applications model to design reduced-write, work- and formulate the expected latency for efficient, low-span parallel algorithms such a cache. We show that using an for a number of fundamental problems edge-server like a typical web-cache, such as reduce, list contraction, tree for recognition applications, can Admission numbers for state-of-the-art contraction, breadth-first search, or- admission control systems and SNC-Meister lead to higher latencies. We propose in 100 randomized experiments. In each dered filter, and planar convex hull. Cachier, a system that uses the caching experiment, 180 tenants, each submitting For the latter two problems, our algo- model along with novel optimizations hundreds of thousands of requests, arrive rithms are output-sensitive in that the to minimize latency by adaptively bal- in random order and seek a 99.9% SLO work and number of writes decrease ancing load between the edge and the randomly drawn from f10ms, 20ms, 50ms, with the output size. We also present cloud by leveraging spatiotemporal lo- 100msg. While all systems meet all SLOs, a reduced-write, low-span minimum cality of requests, using offline analysis SNC-Meister is able to support on average spanning tree algorithm that is nearly of applications, and online estimates 75% more tenants with tail latency SLOs than the next-best system. work-efficient (off by the inverse -Ack of network conditions. We evaluate ermann function). Our algorithms re- Cachier for image-recognition appli- upon the state-of-the-art DNC-based veal several interesting techniques for cations and show that our techniques systems by using a new theory, Stochas- significantly reducing shared memory yield 3x speed-up in responsiveness, tic Network Calculus (SNC), which is writes in parallel algorithms without and perform accurately over a range designed for tail latency percentiles. asymptotically increasing the number of operating conditions. To the best Focusing on tail latency percentiles, of shared memory reads. of our knowledge, this is the first work rather than the adversarial worst-case that models edge-servers as caches for DNC latency, allows SNC-Meister to Cachier: Edge-caching for compute-intensive recognition appli- pack together many more tenants: in Recognition Applications cations, and Cachier is the first system experiments with production traces, that uses this model to minimize la- Utsav Drolia, Katherine Guo, Jiaqi Tan, tency for these applications. SNC-Meister supports 75% more ten- Rajeev Gandhi & Priya Narasimhan ants than the state-of-the-art. The 37th IEEE International Con- Automatic Database Parallel Algorithms for ference on Distributed Computing Management System Tuning Asymmetric Read-Write Costs Systems (ICDCS 2017), June 5 – 8, Through Large-scale Machine 2017, Atlanta, GA, USA. Learning Naama Ben-David, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Recognition and perception-based Dana Van Aken, Andrew Pavlo, Yan Gu, Charles McGuffey & Julian Shun mobile applications, such as image rec- Geoffrey J. Gordon & Bohan Zhang ognition, are on the rise. These appli- SPAA 2016. 28th ACM Symposium on cations recognize the user’s surround- ACM SIGMOD International Con- Parallelism in Algorithms and Archi- ings and augment it with information ference on Management of Data, May tectures, July 11 - 13, 2016. Asilomar and/or media. These applications are 14-19, 2017. Chicago, IL, USA. State Beach, CA, USA. latency-sensitive. They have a soft- Database management system (DBMS) Motivated by the significantly higher realtime nature - late results are configuration tuning is an essential cost of writing than reading in emerg- potentially meaningless. On the one aspect of any data-intensive applica- ing memory technologies, we consider hand, given the compute-intensive tion effort. But this is historically a parallel algorithm design under such nature of the tasks performed by such difficult task because DBMSs have asymmetric read-write costs, with the applications, execution is typically hundreds of configuration “knobs” goal of reducing the number of writes offloaded to the cloud. On the other that control everything in the system, while preserving work-efficiency and hand, offloading such applications such as the amount of memory to use low span. We present a nested-parallel to the cloud incurs network latency, for caches and how often data is writ- model of computation that combines which can increase the user-perceived ten to storage. The problem with these (i) small per-task stack-allocated mem- latency. Consequently, edge-comput- ories with symmetric read-write costs ing has been proposed to let devices continued on page 7

6 THE PDL PACKET RECENT PUBLICATIONS continued from page 6 knobs is that they are not standardized Beyond block-level compression of A CDN is a large distributed system of (i.e., two DBMSs use a different name individual database pages or operation servers that caches and delivers con- for the same knob), not independent log (oplog) messages, as used in today’s tent to users. The first-level cache in (i.e., changing one knob can impact DBMSs, dbDedup uses byte-level delta a CDN server is the memory-resident others), and not universal (i.e., what encoding of individual records within Hot Object Cache (HOC). A major works for one application may be sub- the database to achieve greater savings. goal of a CDN is to maximize the object optimal for another). Worse, infor- dbDedup’s single-pass encoding meth- hit ratio (OHR) of its HOCs. But, the mation about the effects of the knobs od can be integrated into the storage small size of the HOC, the huge vari- typically comes only from (expensive) and logging components of a DBMS ance in the requested object sizes, and experience. to provide two benefits: (1) reduced the diversity of request patterns make size of data stored on disk beyond To overcome these challenges, we this goal challenging. what traditional compression schemes present an automated approach that We propose AdaptSize, the first adap- provide, and (2) reduced amount of tive, size-aware cache admission policy leverages past experience and col- data transmitted over the network for HOCs that achieves a high OHR, lects new information to tune DBMS for replication services. To evaluate even when object size distributions and configurations: we use a combination our work, we implemented dbDedup request characteristics vary significant- of supervised and unsupervised ma- in a distributed NoSQL DBMS and chine learning methods to (1) select analyzed its properties using four ly over time. At the core of AdaptSize is the most impactful knobs, (2) map real datasets. Our results show that a novel Markov cache model that seam- unseen database workloads to previous dbDedup achieves up to 37X reduc- lessly adapts the caching parameters to workloads from which we can trans- tion in the storage size and replication the changing request patterns. Using fer experience, and (3) recommend traffic of the database on its own and request traces from one of the largest knob settings. We implemented our up to 61X reduction when paired with CDNs in the world, we show that our techniques in a new tool called Otter- the DBMS’s blocklevel compression. implementation of AdaptSize achieves Tune and tested it on three DBMSs. dbDedup provides both benefits with significantly higher OHR than widely- Our evaluation shows that OtterTune negligible effect on DBMS throughput used production systems: 30-48% and recommends configurations that are as or client latency (average and tail). 47-91% higher OHR than Nginx and good as or better than ones generated Varnish, respectively. AdaptSize also by existing tools or a human expert. AdaptSize: Orchestrating the achieves 33-46% higher OHR than Hot Object Memory Cache in a state-of-the-art research systems. Online Deduplication for Content Delivery Network Further, AdaptSize is more robust to Databases changing request patterns than the Daniel S. Berger, Ramesh K. Sitaraman traditional tuning approach of hill Lianghong Xu, Andrew Pavlo, Sudipta & Mor Harchol-Balter climbing and shadow queues studied Sengupta & Gregory R. Ganger in other contexts. 14th USENIX Symposium on Net- ACM SIGMOD International Con- worked Systems Design and Imple- Improving the Reliability of ference on Management of Data, May mentation (NSDI ‘17). March 27–29, Chip-off Forensic Analysis of 14-19, 2017. Chicago, IL, USA. 2017, Boston, MA. NAND Flash Memory Devices dbDedup is a similarity-based dedu- Most major content providers use con- plication scheme for on-line data- tent delivery networks (CDNs) to serve Aya Fukami, Saugata Ghose, Yixin Luo, base management systems (DBMSs). web and video content to their users. Yu Cai & Onur Mutlu DFRWS Digital Forensics Research Conference Europe (DFRWS EU), Feature Index Source Delta March 21 - 23, 2017 Lake Constance, New extraction Sketch lookup Candidate selection Source compression Encoded Germany. record (top-K features) similar records record record (From local Digital forensic investigators often oplog) Database Remote Source Lossy update replication need to extract data from a seized record cache delta cache device that contains NAND flash Feature index Database memory. Many such devices are physi- cally damaged, preventing investiga- tors from using automated techniques dbDedup Workflow – (1) Feature Extraction, (2) Index Lookup, (3) Source Selection, and (4) Delta Compression. continued on page 20

SPRING 2017 7 AWARDS & OTHER PDL NEWS

April 2017 January 2017 dividuals have Priya Narasimhan Honored Anuj Kalia Named a 2017 been honored with History Makers Award Facebook Fellow through the program. Each year, the Congratula- Congratula- History Mak- tions to Anuj tions to Adam ers Award Din- Kalia, who has (CS 2001, E ner honors five been named to 2003), who the 2017 class men and wom- has been of Facebook en who have awarded a 2016 fellows! made a differ- CMU Alumni Award for Alumni Ser- ence through Founded in vice. He has been involved with various their work to 2010, the Carnegie Mellon University volunteer not only the Facebook Fel- roles since he graduated with a BS in Greater Pittsburgh area, but beyond. lowship program is designed to help Computer Science in 2001, and an MS Congratulations to Priya, Professor of foster ties with the academic commu- in Electrical and Computer Engineer- Electrical and Computer Engineering, nity, encourage and support promising ing in 2003. He is the current presi- and the CEO and founder of YinzCam Ph.D. students engaged in research dent of the Activities Board Technical who will receive this award at a banquet across computer science and engineer- Committee (ABTech) alumni interest on May 12 for her outstanding achieve- ing, and provide those students with group, which he founded in 2004. ments in technology and innovation. opportunities to work with Facebook In 2015, Adam started the ABTech Other awards are made in the fields on problems relevant to their research. Advancement Fund, which is dedicated of education, sports, healthcare and Since its inception, the program to helping current undergraduate community. has supported more than 50 Ph.D. members of ABTech attend industry candidates, whose research covers conferences. Adam also served on the -- information from heinzhistory- topics that range from power systems center.org Alumni Association Board from 2011 and microgrids to the intersection of until 2015, co-chairing both the Net- computer vision, machine learning works and Awards committees. January 2017 and cognitive science. Jun Woo Park and Sun Hee While at CMU, Adam was an execu- Anuj is a Ph.D. student in the Com- Baik Marry! tive in ABTech, a board member of puter Science Department, a member Scotch ‘n’ Soda, and an officer of both Congratulations to Jun Woo and Sun of the PDL, and an advisee of As- Hammerschlag tower clubs (Carnegie Hee, who were married on January sociate Professor David Anderson. Tech Radio Club [W3VC] and the ECE 7th, 2017 at home in Seoul, Korea! He received his B.Tech in computer Graduate Organization). He currently science from the Indian Institute of works for a nonprofit and spends as Technology-Dehli, and his research much time as possible SCUBA diving interests include networked systems, with his girlfriend, Lindsay Spriggs with a focus on designing efficient soft- (HS 2005, 2006). ware systems for high-speed networks. -- information from CMU Alumni -- information from CMU SCS News Association News, January 2017 Jan 31, 2017, Aisha Rashid January 2017 January 2017 Hugo Patterson PDL Adam Pennington Receives Entrepreneurship Fellowship CMU Alumni Award Awarded First presented in 1950, the Carnegie The Parallel Data Laboratory is pleased Mellon Alumni Awards recognize to announce that the Hugo Patterson alumni, students, and faculty for their PDL Entrepreneurship Fellowship has service to the university and its alumni, been Awarded to Abutalib Aghayev. for their achievements in the arts, This award provides academic (tuition humanities, sciences, technology, and business. To date more than 880 in- continued on page 9

8 THE PDL PACKET AWARDS & OTHER PDL NEWS continued from page 8 and stipend) support to a student December 2016 applications for researchers and prac- working within the umbrella of the Todd Mowry Named ACM titioners that go beyond the standard PDL who has had experience in the Fellow benchmarks. workforce prior to entering a graduate We built a crawler that finds applica- program. Congratula- tions to Todd tions hosted on public repositories Abutalib Aghayev is a Ph.D. student Mowry on be- (e.g., GitHub). We then created a in the Computer Science Department ing named an framework that automatically learns at Carnegie Mellon University. He ACM Fellow for how to deploy and execute an applica- received his M.S. degree from North- 2016! Mowry, tion inside a virtual machine sandbox. eastern University and B.S. degree a professor in You can then safely download the ap- from Bogazici University, Istanbul, the Computer plication on your local machine and Turkey. Before coming to United Science Depart- execute it to collect query traces and States for graduate studies, he was a co- ment, was cited other metrics. founder and CTO of a software startup “for contribu- in Istanbul. His research interests are tions to software prefetching and The CMDBAC currently contains over in storage and systems support for thread-level speculation.” His research 1000 applications of varying complex- machine learning. interests span the areas of computer ity. Web applications based on popular The endow- architecture, compilers, operating programming frameworks are targeted ment for the systems, parallel processing, database because (1) they are easier to find and fellowship was performance, and modular robotics. (2) we can automate the deployment He co-leads the Log-Based Architec- made by PDL process. We support applications that tures project and the Claytronics proj- alum Hugo use the Django, Ruby on Rails, Drupal, ect. In 2004-2007, he served as the Patterson, an director of Intel Labs Pittsburgh. Todd Node.js, and Grails frameworks. Find entrepreneur was among 53 members of the ACM, the CMDBAC website at: http://cmd- who wishes to the world’s leading computing society, bac.cs.cmu.edu and the source code at foster similar elevated to fellow status this year. ACM https://github.com/cmu-db/cmdbac. opportunities will formally recognize the 2016 fel- for PDL graduate students. He is lows at the annual Awards Banquet in June 2016 currently the CTO and Co-Founder San Francisco on June 24, 2017. Best Student Paper at ATC’16! of Datrium, where he has helped de- -- Carnegie Mellon University News, velop a Server Flash Storage System Congratulations to Anuj Kalia and Dec. 8, 2016, Byron Spice for private clouds. It’s a new kind of his co-authors Michael Kaminsky and David G. Andersen for receiving storage that leverages server cycles July 2016 the award for Best Student Paper at and server flash and combines them Announcing the Carnegie with an external NetShelf for durable the 2016 USENIX Annual Technical Mellon Database Application storage. Prior to Datrium, Hugo was Conference (USENIX ATC’16), in Catalog an EMC Fellow serving as CTO of Denver, CO. Their paper, “Design the EMC Backup Recovery Systems IE M The Car- Guidelines for High Performance Division, and the Chief Architect and EG EL N L negie Mellon RDMA Systems,” lays out guidelines CTO of Data Domain (acquired by R O that can be used by system designers N Database EMC in 2009), where he built the first A C Applica- to navigate the RDMA design space in deduplication storage system. Prior D g a tion Catalog order to take advantage of potential o t l to that he was the engineering lead at a a performance improvements. b t (CMDBAC) a a NetApp, developing SnapVault, the s e C A n is an on-line first snap-and-replicate disk-based p plic a tio backup product. Hugo holds a Ph.D. repository of in Electrical and Computer Engineer- open-source ing from Carnegie Mellon University, database applications that you can and was the inaugural recipient of the use for benchmarking and experi- CMU Parallel Data Lab Distinguished mentation. The goal of this project Alumni Award. is to provide ready-to-run real-world

SPRING 2017 9 AUTOMATIC DBMS TUNING continued from page 1 observes the DBMS and measures DBMS-independent external metrics Workload CharacterizationAKnob Identification utomatic Tuner Metrics Distinct Metrics Phase 1 chosen by the DBA, such as the latency Knobs Knobs and throughput. At the end of the s Importance Samples observation period, the controller col- Configs Metrics GPs

Factor K-means Sample Phase 2 lects additional DBMS-specific inter- Factors Mapped Data Analysis Clustering Workload Repository nal metrics, such as MySQL’s counters Lasso New for pages read from or written to disk. Metrics Observations The controller then returns both the external and internal metrics to the Figure 2: OtterTune Machine Learning Pipeline. tuning manager. When OtterTune’s tuning manager Following is a discussion of the com- In the first step, the system identi- receives the result of a new observa- ponents in the ML pipeline in more fies which workload from a previous tion period from the controller, it detail. tuning session is most representative first stores that information in its Workload Characterization: Otter- of the target workload based on the repository. From this, OtterTune Tune uses the DBMS’s internal runtime performance measurements for the computes the next configuration that metrics to characterize how a workload non-redundant set of metrics. It does the controller should install on the behaves. These metrics provide an ac- this by comparing the session’s metrics target DBMS. This configuration, as curate representation of a workload with those from previously seen work- well as an estimate of the expected because they capture many aspects of loads to see which ones react similarly improvement to be gained by running its runtime behavior. The problem with to different knob settings. this configuration, are returned to the these metrics is that many of them are In the next step, OtterTune chooses a controller. The estimate helps the DBA redundant: some are the same measure- configuration that is explicitly selected decide whether to iterate or terminate ment recorded in different units; oth- to maximize the target objective over the tuning session. ers are ones that represent independent the entire tuning session. Instead of We note that there are certain assump- components of the DBMS, but whose tuning all of the DBMS’s configura- tions made by the OtterTune system values are highly correlated. Pruning tion knobs, OtterTune selects only that may limit its capabilities for some these redundant metrics is important to the top-k from its ranked list of knobs users. For example, it assumes that reduce the complexity of the ML models to tune. OtterTune reuses the data the DBA has administrative privileges that use them. from the most similar workload in its repository and the data collected so far so that the controller can modify the Knob Identification: The knob iden- DBMS’s configuration. A complete from the target workload (for the top-k tification component determines which knobs and the target metric) to train a discussion of these assumptions and knobs have the strongest impact on the limitations is provided in our paper [1]. Gaussian Process model. This model DBA’s target objective function. DBMSs enables OtterTune to estimate how can have hundreds of knobs, but only a Machine Learning Pipeline well the target metric will perform for subset affect the DBMS’s performance. a given DBMS configuration. Figure 2 shows the processing path of OtterTune uses a popular feature selec- data as it moves through OtterTune’s tion technique for linear regression, Experimental Evaluation ML pipeline. All previous observa- called Lasso, to identify important OtterTune’s system is written in Python. tions reside in the OtterTune re- knobs and the dependencies that may The ML algorithms used in the Work- pository. This data is first passed into exist between them. Applying this tech- load Characterization and Knob Iden- the workload characterization component nique to the knob data and the data for tification components are implemented that identifies the most distinguish- the non-redundant set of metrics in Ot- using scikit-learn. These algorithms are ing DBMS metrics. Next, the knob terTune’s repository exposes the knobs frequently recomputed by background identification component generates a that have the strongest correlation to the processes as new data becomes avail- ranked list of the most important system’s overall performance. able in OtterTune’s repository. For The automated knobs. This information is then fed Automatic Tuner: instance, the algorithms from the Au- tuning component executes a two-step to the automatic tuner component where tomatic Tuning component are recom- it maps the target DBMS’s workload analysis after the completion of each puted after each observation period. We to a previously seen workload and observation period in the tuning pro- implemented these algorithms using generates improved tuning configu- cess to determine which configuration rations. it should recommend next. continued on page 11

10 THE PDL PACKET AUTOMATIC DBMS TUNING continued from page 10 TensorFlow as performance is more of MySQL a consideration here. We integrated Ot- 800 terTune’s controller with a benchmark- The results in Figure 3 show that 600 ing framework, called OLTP-Bench, to MySQL achieves 22-35% better (ms) throughput and approximately 60% collect data about the DBMS’s hardware, -tile 400 knob configurations, and runtime per- better latency when using the best formance metrics. configuration generated by OtterTune 200 versus ones generated by the tuning 99t h% We conducted all our experiments script and Amazon RDS for the TPC- 0 on Amazon EC2. Each experiment C workload. We also see that Otter-

) 1000 consists of two instances: OtterTune’s Tune generates a configuration that is controller and the target DBMS deploy- almost as good as the DBA. 750 ment. For these, we used the m4.large We find that MySQL has a small number (txn/sec and m3.xlarge instance types, respec- 500 tively. We deployed OtterTune’s tun- of knobs that have a large impact on its ing manager and repository on a local performance for the TPC-C workload. 250 server with 20 cores and 128GB RAM. The configurations generated by Otter- Tune and the DBA provided good set- Throughput 0 Our evaluation compares the perfor- tings for each of these knobs. Amazon mance of MySQL and Postgres when Default DBA RDS performs slightly worse because it OtterTune RDS-config using the best configuration selected provides a sub-optimal setting for one Tuning script by OtterTune versus ones selected by of the knobs. The tuning script’s con- human DBAs and open-source tuning figuration performs the worst because Figure 4: Efficacy Comparison (Postgres). advisor tools. We also compare Ot- it only modifies one knob. terTune’s configurations with those sions to tune new DBMS deployments. This drastically reduces tuning time as created for Amazon RDS-managed Postgres DBMSs deployed on the same EC2 OtterTune does not need to generate instance type as the rest of our experi- The latency measurements in our an initial data set for training its ML ments. results show that the configurations models. For this comparison, we use the TPC- generated by OtterTune, the tuning Our next step in this research is to en- C workload, which is the current tool, the DBA, and RDS all achieve able OtterTune to automatically detect industry standard for evaluating the similar improvements for TPC-C the hardware capabilities of the target performance of OLTP systems. over Postgres’ default settings. This is DBMS without having remote access likely because of the overhead of net- to the DBMS’s host machine. This re- 2000 work round-trips between the OLTP- striction is necessary due to the grow- Bench client and the DBMS. But the ing popularity of DBaaS deployments 1500 throughput measurements show that where such access is not available. (ms) Postgres has ~12% higher performance For more details about the OtterTune 1000 with OtterTune compared to the DBA

%-tile system, take a look at our paper [1] and the tuning script, and ~32% higher 500 or check out the code [2] on Github. 99th performance compared to RDS. We are currently building a website 0 Similar to MySQL, there are only a to provide OtterTune as an online- 1000 few knobs that impact Postgres’ per- tuning service. formance for the TPC-C workload. 750 But in this case, the configurations References txn/sec)

t( generated by OtterTune, the DBA, 500 [1] Automatic Database Management the tuning script and Amazon RDS all System Tuning Through Large-scale modified these knobs and most pro- 250 Machine Learning. Dana Van Aken, vided reasonably good settings. Andrew Pavlo, Geoffrey J. Gordon, Throughpu 0 Bohan Zhang. ACM SIGMOD Inter- Conclusion Default DBA national Conference on Management OtterTune RDS-config Tuning script OtterTune is an automatic DBMS of Data, May 14-19, 2017. Chicago, IL. configuration tool that reuses training [2] https://github.com/cmu-db/ot- Figure 3: Efficacy Comparison (MySQL). data gathered from previous tuning ses- tertune

SPRING 2017 11 BIG-LEARNING SYSTEMS MEET CLOUD COMPUTING

Greg Ganger and PDL Big-learning Group Statistical machine learning (ML) cation and synchronization challenges 120 3 has become a primary data process- Cost from ML app writers. Like MPI-based ing activity for business, science, and 100 RunTime HPC applications, these frameworks online services that attempt to extract 80 2 generally assume that the set of ma- insight from observation (training) chines is fixed, optimizing aggressively 60 for the no machine failure and no

data. Generally speaking, ML algo- Cost ($ ) rithms iteratively process training data 40 1 Time (Hrs) machine change case (and restarting to determine model parameter values 20 the entire computation from the last checkpoint on any failure). So, using that make an expert-chosen statistical 0 0 All On- Standard + Proteus revocable machines risks significant model best fit it. Once fit (trained), Demand Checkpointing such models can predict outcomes for rollback overhead, and adding newly new data items based on selected char- Cost and time benefits of Proteus. This graph available machines to a running com- acteristics (e.g., for recommendation shows average cost (left axis) and runtime putation is often not supported. systems), correlate effects with causes (right axis) for running the MLR application Proteus is a new parameter server (e.g., for genomic analyses of diseases), on the AWS EC2 US-EAST-1 Region. The three system that combines agile elasticity label new media files (e.g., which ones configurations shown are: 128 on-demand with aggressive acquisition strategies to machines, using 128 spot market machines are funny cat videos?), and so on. exploit transient revocable resources. with checkpoint/restart for dealing with Given ever-growing quantities of data evictions and a standard strategy of bidding As one example of how well this works, being collected and stored, combined the on-demand price, and Proteus using 3 on- we find that using three on-demand with ubiquitous attempts to discover demand and up to 189 spot market machines. instances and up to 189 spot market and employ new ways of leveraging it, Proteus reduces cost by 85% relative to using instances allows Proteus to reduce cost it is not surprising that there is work all on-demand machines and by ≈50% relative by 85% when compared to using only on ML systems at every scale. PDL to the checkpointing-based scheme. on-demand instances, even when ac- researchers have been at the forefront tunity: transient availability of cheap counting for spot market variation and of systems support for large-scale ML, but revocable resources. For example, revocations, while running 24% faster. which we refer to as “big-learning sys- Amazon EC2’s spot market and Google Proteus consists of two principal compo- tems,” and exciting new questions have Compute Engine’s preemptible in- nents: AgileML and BidBrain. The Ag- arisen around how best to adapt to and stances often allow customers to use ileML parameter-server system achieves exploit properties of cloud computing. machines at a 70–80% discount off agile elasticity by explicitly combining The convergence of big-learning sys- the regular price, but with the risk multiple reliability tiers, with core tems and cloud computing brings new that they can be taken away at any time. functionality residing on more reliable challenges, like the increased straggler Similarly, many cluster schedulers resources (non-transient resources, like effects addressed in our FlexRR work, allow lower-priority jobs to use re- on-demand instances on EC2) with most and many new opportunities. This sources provisioned but not currently work performed on transient resources. short article highlights two of our needed to support business-critical ac- This allows quick and efficient scaling, newest works in this space, focused tivities, taking the resources away when including expansion when resources on exploiting transient resource avail- those activities need them. ML model become available and bulk extraction of ability and on big-learning for geo- training could often be faster and/or revoked transient resources without big distributed data, respectively. cheaper by aggressively exploiting such delays for rolling back state or recovering revocable resources. lost work. AgileML transitions among Proteus: Agile ML Elasticity for Unfortunately, efficient modern different modes/stages as transient re- Dynamic Resource Markets frameworks for parallel ML, such as sources come and go. When the ratio ML model training is often quite re- TensorFlow, MxNet, and Petuum, of transient to non-transient is small source intensive, requiring hours on are not designed to exploit transient (e.g., 2-to-1), it simply distributes the 10s or 100s of cores to converge on a resources. Most use a parameter server parameter server functionality across solution. As such, it should be able to architecture, in which parallel workers only the non-transient machines, in- exploit any available extra resources or process training data independently stead of across all machines, as is the cost savings. Many modern compute and use a specialized key-value store usual approach. For much larger ratios infrastructures offer a great oppor- for shared state, offloading communi- continued on page 13

12 THE PDL PACKET BIG-LEARNING SYSTEMS MEET CLOUD COMPUTING continued from page 12 (e.g., 63-to-1), the one non-transient breaking down the sources of benefit, WAN bandwidth, while keeping the machine would be a bottleneck in that we find that both the agile elasticity of global ML model sufficiently consistent configuration. In that case, AgileML uses AgileML and the aggressive policies of across data centers to retain the accu- non-transient machine(s) as on-line BidBrain are needed; using either one racy and correctness guarantees of an backup parameter servers to active pri- alone (e.g., BidBrain with checkpoint- ML algorithm; and (2) is generic and mary parameter servers that run on tran- ing instead of AgileML) achieves half flexible enough to run a wide range of sient machines. Updates are coalesced or less the cost and runtime savings. ML algorithms, without requiring any and streamed from actives to backups in For more information, see [1]. changes to the algorithms themselves. the background at a rate that the network Gaia builds on the widely used param- bandwidth can accommodate. Gaia: Geo-distributed ML eter server architecture that provides BidBrain is Proteus’s resource alloca- Approaching LAN Speeds ML worker machines with a distributed tion component that decides when to Machine learning (ML) is widely used global shared memory abstraction for acquire and yield transient resources. to derive useful information from the ML model parameters they col- BidBrain is specialized for EC2, ex- large-scale data (such as user activities, lectively train until convergence to fit ploiting spot market characteristics in pictures, and videos) generated at in- the input data. The key idea of Gaia is its policies, but its general approach creasingly rapid rates, all over the world. to maintain an approximately-correct could apply to other environments Unfortunately, it is often infeasible to copy of the global ML model within with transient resources (e.g., private move all this globally-generated data to each data center, and dynamically elim- clusters or other cloud costing mod- a centralized data center before running inate any unnecessary communication els). It monitors current market prices an ML algorithm over it; moving large between data centers. Gaia enables for multiple instance types, which amounts of raw data over wide-area this by decoupling the synchronization move relatively independently, and networks (WANs) can be extremely slow, (i.e., communication/consistency) bids on new resources when their ad- and is also subject to the constraints of model within a data center from the dition to the current footprint would privacy and data sovereignty laws. This synchronization model between dif- increase work per dollar. Similarly, motivates the need for geo-distributed ferent data centers. This differentia- resources near the end of an hour ML systems that can run on data span- tion allows Gaia to run a conventional may be released if they have become ning multiple data centers. synchronization model that maximizes less cost-effective relative to others. Today’s large-scale distributed ML utilization of the more-freely-available As part of its considerations, BidBrain systems are designed and perform well LAN bandwidth within a data center. estimates the probability of getting free only for operation within a single data At the same time, across different compute cycles due to instance revoca- center. Our experiments using three data centers, Gaia employs a new syn- tion within the billing hour (with later state-of-the-art distributed ML sys- chronization model, the Approximate in the hour being better than earlier) tems (Bösen, IterStore, and GeePS) Synchronous Parallel (ASP) model, for different bids and spot market show that operating these systems which makes more efficient use of conditions. Simultaneously consider- across as few as two data centers (over the scarce and heterogeneous WAN ing costs (e.g., revocation and scaling WANs) can cause a slowdown of 1.8– bandwidth. By ensuring that each ML inefficiencies) and benefits (e.g., 53.7× relative to their performance model copy that exists in different data cheaper new resources), BidBrain within a data center (over LANs), as centers is approximately correct based finds a happy medium between aggres- the communication overwhelms the on a precise notion, ASP maintains sive bidding on transient resources and limited WAN bandwidth. Existing ML algorithm convergence guarantees. more conservative choices. systems that do address challenges in ASP is based on a key finding that the geo-distributed data analytics focus on Experiments with three real ML appli- vast majority of updates to the global bulk data querying rather than the so- cations confirm that Proteus achieves ML model parameters from each ML phisticated ML algorithms commonly significant cost and runtime reduc- worker machine are insignificant. For run on big-learning systems. tions. AgileML’s elasticity support example, our study of three classes of introduces negligible performance Gaia is a new ML system for geo- ML algorithms shows that more than overhead, scales well, and suffers min- distributed data that (1) employs an 95% of the updates produce less than a imal disruption during bulk addition intelligent communication mechanism or removal of transient machines. In over WANs to efficiently utilize scarce continued on page 13

SPRING 2017 13 DEFENSES & PROPOSALS

DISSERTATION ABSTRACT: meeting tail latency SLOs in networked of applications, ranging from typical Meeting Tail Latency SLOs in storage infrastructures. Our system graphics applications to general pur- Shared Networked Storage uses prioritization and rate limiting pose data parallel (GPGPU) applica- as tools for controlling the congestion tions. However, this success has been Timothy Zhu between workloads. We introduce a accompanied by new performance Carnegie Mellon University, SCS novel approach for intelligently con- bottlenecks throughout the memory figuring the workload priorities and hierarchy of GPU-based systems. This PhD Defense — May 3, 2017 rate limits using two different types of dissertation identifies and eliminates Shared computing infrastructures queueing analyses: Deterministic Net- such performance bottlenecks caused (e.g., cloud computing, enterprise work Calculus (DNC) and Stochastic by interference throughout the mem- datacenters) have become the norm to- Network Calculus (SNC). By inte- ory hierarchy. day due to their lower operational costs grating these mathematical analyses Specifically, we provide an in-depth and IT management costs. However, into our system, we are able to build analysis of inter- and intra-application resource sharing introduces challenges better algorithms for optimizing the as well as inter-address-space inter- in controlling performance for each of resource usage. Our implementation ference that significantly degrade the the workloads using the infrastructure. results using realistic workload traces performance and efficiency of GPU- For user-facing workloads (e.g., web on a physical cluster demonstrate that based systems. server, email server), one of the most our approach can meet tail latency To minimize such interference, we in- important performance metrics com- SLOs while achieving better resource troduce changes to the memory hierar- panies want to control is tail latency, utilization than the state-of-the-art. chy for systems with GPUs that allow the the time it takes to complete the most While this thesis focuses on schedul- memory hierarchy to be aware of both delayed requests. Ideally, companies ing policies, admission control, and CPU and GPU applications’ memory would be able to specify tail latency workload placement in storage and access characteristics. We introduce performance goals, also called Service networks, the ideas presented in our mechanisms to dynamically analyze dif- Level Objectives (SLOs), to ensure work can be applied to other related ferent applications’ characteristics and that almost all requests complete problems such as workload migration propose four major changes through- quickly. and datacenter provisioning. Our out the memory hierarchy. Meeting tail latency SLOs is challeng- theoretically grounded techniques for First, we introduce Memory Diver- ing for multiple reasons. First, tail controlling tail latency can also be ex- gence Correction (MeDiC), a cache latency is significantly affected by the tended beyond storage and networks to management mechanism that miti- burstiness that is commonly exhibited other contexts such as the CPU, cache, gates intra-application interference by production workloads. Burstiness etc. For example, in real-time CPU in GPGPU applications by allowing leads to transient queueing, which scheduling contexts, our DNC-based the shared L2 cache and the memory is a major cause of high tail latency. techniques could be used to provide controller to be aware of the GPU’s Second, tail latency is often due to I/O strict latency guarantees while account- warp-level memory divergence char- (e.g., storage, networks), and I/O de- ing for workload burstiness. acteristics. MeDiC uses this warp-level vices exhibit performance peculiarities memory divergence information to that make it hard to meet SLOs. Third, DISSERTATION ABSTRACT: give more cache and memory band- the end-to-end latency is affected by Techniques for Shared width resources to warps that benefit sum of latencies across multiple types Resource Management most from utilizing such resources. of resources such as storage and net- in Systems with Graphics Our evaluations show that MeDiC works. Most of the existing research, Processing Units significantly outperforms multiple however, have ignored burstiness and state-of-the-art caching policies pro- focused on a single resource. Rachata Ausavarungnirun posed for GPUs. This thesis introduces new techniques Carnegie Mellon University, ECE Second, we introduce the Staged for meeting end-to-end tail latency PhD Defense — May 2, 2017 Memory Scheduler (SMS), an ap- SLOs in both storage and networks while accounting for the burstiness The continued growth of the compu- plication-aware CPU-GPU memory that arises in production workloads. tational capabilities, along with con- request scheduler that mitigates inter- We address open questions in schedul- siderable improvements in the pro- application interference in hetero- geneous CPU-GPU systems. SMS ing policies, admission control, and grammability of Graphics Processing creates a fundamentally new approach workload placement. We build a new Units (GPUs), have made GPUs the Quality of Service (QoS) system for platform of choice for a wide variety continued on page 15

14 THE PDL PACKET DEFENSES & PROPOSALS continued from page 14 to memory controller design that de- DISSERTATION ABSTRACT: do not fit in GPU memory. MLtuner couples the memory controller into Exploiting Application is a system for automatically tuning three significantly simpler structures, Characteristics for Efficient the training tunables of ML tasks. It each of which has a separate task and System Support of Data- exploits the characteristic that the best all of which operate together to greatly tunable settings can often be decided Parallel Machine Learning improve both system performance and quickly with just a short trial time. By fairness. Our evaluations show that Henggang Cui making use of optimization-guided SMS not only reduces inter-applica- Carnegie Mellon University, ECE online trial-and-error, MLtuner can tion interference caused by the GPU, robustly find and re-tune tunable set- thereby improving heterogeneous PhD Defense — April 6, 2017 tings for a variety of machine learning system performance, but also provides Large scale machine learning has many applications, including image classifi- better scalability and power efficiency characteristics that can be exploited in cation, video classification, and matrix compared to multiple state-of-the-art the system designs to improve its effi- factorization, and is over an order memory schedulers. ciency. This dissertation demonstrates of magnitude faster than traditional Third, we redesign the GPU memory that, the characteristics of the ML hyper-parameter tuning approaches. management unit to efficiently handle computations can be exploited in the DISSERTATION ABSTRACT: new problems caused by the massive design and implementation of param- address translation parallelism pres- eter server systems, to greatly improve Scaling Distributed Machine ent in GPU computation units in the efficiency by an order of magnitude Learning with System and multi-GPU-application environments. or more. We support this thesis state- Algorithm Co-design Running multiple GPGPU applica- ment with three case study systems, tions concurrently induces significant Mu Li IterStore, GeePS, and MLtuner. Carnegie Mellon University, ML inter-core thrashing on the shared IterStore is an optimized parameter TLB, a new phenomenon that we call server system design that exploits the PhD Defense — February 2, 2017 inter-address-space interference. To repeated data access pattern char- reduce this interference, we introduce acteristic of ML computations. The Due to the rapid growth of data and Multi Address Space Concurrent Ker- designed optimizations allow IterStore the ever increasing model complexity, nels (MASK). MASK introduces TLB- to reduce the total run time of our which often manifests itself in the large awareness throughout the GPU memory ML benchmarks by up to 50X. GeePS number of model parameters, today, hierarchy and introduces TLB- and is a parameter server that is special- many important machine learning cache-bypassing techniques to increase ized for deep learning on distributed problems cannot be efficiently solved the effectiveness of a shared TLB. GPUs. By exploiting the layer-by-layer by a single machine. Distributed op- Finally, we introduce MOSAIC, a hard- data access and computation pattern timization and inference is becoming ware-software cooperative technique of deep learning, GeePS provides more and more inevitable for solving that further increases the effectiveness almost linear scalability from single- large scale machine learning problems of TLB by modifying the memory al- machine baselines (13 more training in both academia and industry. How- location policy in the system software. throughput with 16 machines), and ever, obtaining an efficient distributed MOSAIC introduces a high-throughput also supports neural networks that implementation of an algorithm, is far method to support large pages in multi- from trivial. Both intensive compu- GPU-application environments. Our tational workloads and the volume of evaluations show that the MASK-MO- data communication demand careful SAIC combination provides a simple design of distributed computation sys- mechanism that eliminates the perfor- tems and distributed machine learning mance overhead of address translation algorithms. In this thesis, we focus on in GPUs without significant changes to the co-design of distributed comput- GPU hardware, thereby greatly improv- ing systems and distributed optimiza- ing GPU system performance. tion algorithms that are specialized for large machine learning problems. The key conclusion of this disserta- In the first part, we propose two distrib- tion is that a combination of GPU- uted computing frameworks: Parameter aware cache and memory management Kevin Hsieh tells Michael Kozuch of Intel Server, a distributed machine learning techniques can effectively mitigate the about his research on “Gaia: Geo-Distributed framework that features efficient data memory interference on current and Machine Learning Approaching LAN future GPU-based systems. Speeds”at a retreat poster session. continued on page 16

SPRING 2017 15 DEFENSES & PROPOSALS continued from page 15 communication between the machines; part approach for CFI: (i) prescrib- enabling programmers to customize MXNet, a multi-language library that ing source-code safety checks, that them and observe that their software aims to simplify the development of prevent the root-causes of CFI, that is not inadvertently changed–with deep neural network algorithms. We programmers can insert themselves, machine-code proofs of CFI that can have witnessed the wide adoption of the and (ii) formally proving CFI for be automated, and that does not re- two proposed systems in the past two the machine-code of programs with quire a trusted or verified compiler to years. They have enabled and will con- source-code safety-checks. First, our ensure its proven properties hold in tinue to enable more people to harness prescribed safety-checks, when ap- machine-code. the power of distributed computing to plied, prevent the root-causes of CFI, We evaluate our CFI approach on re- design efficient large-scale machine thereby enabling software to recover alistic embedded software. We evaluate learning applications. from CFI violations in a customiz- our approach on the MiBench and In the second part, we examine a able way. In addition, our prescribed WCET benchmarks, implementations number of distributed optimization safety-checks are visible to program- of common file utilities, and programs problems in machine learning, lever- mers, empowering them to ensure interfacing with hardware inputs and aging the two computing platforms. We that the behavior of their software outputs on the Raspberry Pi single- present new methods to accelerate the is not inadvertently changed by the board-computer. The variety of our tar- training process, such as data parti- prescribed safety-checks. However, get programs, and our ability to support tioning with better locality properties, programmer-inserted safety-checks useful features such as file and hardware communication friendly optimization may be incomplete. Thus, current inputs and outputs, demonstrate the methods, and more compact statistical techniques for proving CFI, which as- wide applicability of our approach. models. We implement the new algo- sume that safety-checks are complete, rithms on the two systems and test on may not work. Second, this disserta- DISSERTATION ABSTRACT: large scale real data sets. We successfully tion develops a logic approach that System Infrastructure for demonstrate that careful co-design of automates formal proofs of CFI for the Mobile-Cloud Convergence computing systems and learning algo- machine-code of software containing rithms can greatly accelerate large scale both source-code CFI safety-checks Kiryong Ha distributed machine learning. and system calls. We extend an exist- Carnegie Mellon University, SCS ing trustworthy Hoare logic with new PhD Defense — October 20, 2016 DISSERTATION ABSTRACT: proof rules, proof tactics, and a novel Prescriptive Safety-Checks proof-search algorithm, which ex- The convergence of mobile comput- ploit the principle of local reasoning through Automated Proofs for ing and cloud computing enables new for safety properties to automatically Control-Flow Integrity mobile applications that are both re- generate CFI proofs for the machine- source-intensive and interactive. For Jiaqi Tan code of programs compiled with our these applications, end-to-end network Carnegie Mellon University, ECE prescribed source-code safety-checks. bandwidth and latency matter greatly To the best of our knowledge, our ap- when cloud resources are used to aug- PhD Defense — November 2016 proach to CFI is the first to combine ment the computational power and Embedded software today is pervasive: programmer-visible source-code battery life of a mobile device. This dis- they can be found everywhere, from enforcement mechanisms for CFI– sertation designs and implements a new coffee makers and medical devices, to architectural element called a cloudlet, cars and aircraft. Embedded software that arises from the convergence of mo- today is also open and connected to the bile computing and cloud computing. Internet, exposing them to external Cloudlets represent the middle tier of a attacks that can cause its Control-Flow 3-tier hierarchy, mobile device - cloudlet Integrity (CFI) to be violated. Control- - cloud, to achieve the right balance be- Flow Integrity is an important safety tween cloud consolidation and network property of software, which ensures that responsiveness. We first present quanti- the behavior of the software is not inad- tative evidence that shows cloud location vertently changed. The violation of CFI can affect the performance of mobile in software can cause unintended behav- applications and cloud consolidation. iors, and can even lead to catastrophic Abutalib Aghayev presents his research on We then describe an architectural solu- incidents in safety-critical systems. “Evolving ext4 for SMR Disks” at the 2016 tion using cloudlets that are a seamless This dissertation develops a two- PDL Retreat. continued on page 17

16 THE PDL PACKET DEFENSES & PROPOSALS continued from page 16 extension of today’s cloud comput- erences and flexibility in a general, DISSERTATION ABSTRACT: ing infrastructure. Finally, we define extensible way by using a declarative, Practical Data Compression for minimal functionalities that cloudlets composable, intuitive algebraic expres- Modern Memory Hierarchies must offer above/beyond standard sion structure. Second, building on cloud computing, and address corre- Gennady G. Pekhimenko sponding technical challenges. the generality of STRL, we propose an Carnegie Mellon University, SCS equally general STRL Compiler that PhD Defense — July 1, 2016 DISSERTATION ABSTRACT: automatically compiles STRL expres- Scheduling with Space- sions into Mixed Integer Linear Pro- Although compression has been widely Time Soft Constraints in gramming (MILP) problems that can be used for decades to reduce file sizes Heterogeneous Cloud aggregated and solved to maximize the (conserving storage capacity and network Datacenters overall value of shared cluster resources. bandwidth when moving files), there has been little to no use within modern Alexey Tumanov These theoretical contributions form memory hierarchies. Why not? As pro- Carnegie Mellon University, ECE the foundation for the system we grams become increasingly data-intensive, PhD Defense — July 25, 2016 architect, called TetriSched, that the capacity and bandwidth within the instantiates our conceptual contribu- memory hierarchy (including caches, Heterogeneity in modern datacenters is tions: (a) declarative soft constraints, main memory, and their associated inter- on the rise, in hardware resource char- (b) space-time soft constraints, (c) connects) are becoming increasingly im- acteristics, in workload characteristics, portant bottlenecks. If data compression and in dynamic characteristics (e.g., a combinatorial constraints, (d) order- less global scheduling, and (e) in situ were applied successfully to the memory memory-resident copy of input data). hierarchy, it could relieve pressure on preemption. We also propose a set of As a result, which machines are assigned these bottlenecks by increasing effective to a given job can have a significant im- mechanisms that extend the scope and capacity, increasing effective bandwidth, pact. For example, a job may run faster the practicality of TetriSched’s deploy- even reducing energy consumption. on the same machine as its input data or ment by analyzing and improving on In this thesis, I describe a new, practical with a given hardware accelerator, while its scalability, enabling and studying approach to integrating data compres- still being runnable on other machines, the efficacy of preemption, and fea- sion within the memory hierarchy, in- albeit less efficiently. Heterogeneity turing a set of runtime misestimation cluding on-chip caches, main memory, takes on more complex forms as sets of handling mechanisms to address run- and both on-chip and off-chip inter- resources differ in the level of perfor- time prediction inaccuracy. connects. This new approach is fast, mance they deliver, even if they consist simple, and effective in saving storage of identical individual units, such as In collaboration with Microsoft, we space. A key insight in our approach adapt some of these ideas as we design with rack-level locality. We refer to this is that access time (including decom- as combinatorial heterogeneity. Mixes and implement a heterogeneity-aware pression latency) is critical in modern of jobs with strict SLOs on completion resource reservation system called memory hierarchies. By combining in- time and increasingly available runtime Aramid with support for ordinal place- expensive hardware support with mod- estimates in production datacenters ment preferences targeting deployment est OS support, our holistic approach deepen the challenge of matching the in production clusters at Microsoft to compression achieves substantial im- right resources to the right workloads scale. A combination of simulation and provements in performance and energy at the right time. real cluster experiments with synthetic efficiency across the memory hierarchy. In this dissertation, we hypothesize that and production-derived workloads, a In addition to exploring compression- it is possible and beneficial to simulta- related issues and enabling practical range of workload intensities, degrees neously leverage all of this information solutions in modern CPU systems, of burstiness, preference strengths, in the form of declaratively specified we discover new problems in realizing spacetime soft constraints. To accom- and input inaccuracies support our hy- hardware-based compression for GPU- plish this, we first design and develop pothesis that leveraging space-time soft based systems and develop new solutions our principal building block—a novel constraints (a) significantly improves to solve these problems. Space-Time Request Language (STRL). scheduling quality and (b) is possible It enables the expression of jobs’ pref- to achieve in a practical deployment. continued on page 18

SPRING 2017 17 DEFENSES & PROPOSALS continued from page 17 THESIS PROPOSAL: For a lot of important machine learning THESIS PROPOSAL: Practical Concurrency Testing problems, due to the rapid growth of Architectural Techniques data and the ever increasing model com- for Improving NAND Flash Benjamin Blum, SCS plexity, which often manifests itself in Memory Reliability April 25, 2017 the large number of model parameters, Concurrent programming presents a no single machine can solve them fast Yixin Luo, SCS challenge to students and experts alike enough. Thus, distributed optimization August 5, 2016 because of the complexity of multi- and inference is becoming more in- Raw bit errors are common in NAND threaded interactions and the difficulty evitable for solving large scale machine flash memory and will increase in learning problems in both academia to reproduce and reason about bugs. the future. These errors reduce flash and industry. Obtaining an efficient Stateless model checking is a concurren- reliability and limit the lifetime of a distributed implementation of an algo- cy testing approach which forces a pro- flash memory device. This proposal gram to interleave its threads in many rithm, however, is far from trivial. Both aims to improve flash reliability with different ways, checking for bugs each intensive computational workloads and a multitude of low-cost architectural time. This technique is powerful, in the volume of data communication principle capable of finding any nonde- demand careful design of distributed techniques. Our thesis statement is: terministic bug in finite time, but suffers computation systems and distributed NAND flash memory reliability can be from exponential explosion as program machine learning algorithms. improved at low cost and with low per- size increases. Checking an exponential In this thesis, we focus on the co-design formance overhead by deploying various number of thread interleavings is not of distributed computing systems and architectural techniques that are aware a practical or predictable approach for distributed optimization algorithms of higher-level application behavior and programmers to find concurrency bugs that are specialized for large machine underlying flash device characteristics. before their project deadlines. learning problems. We propose two Our proposed approach is to under- In this thesis, I propose to make state- distributed computing frameworks: stand flash error characteristics and less model checking more practical a parameter server framework which workload behavior through charac- features efficient data communication, for human use by way of several new terization, and to design smart flash and MXNet, a multi-language library techniques. I have built Landslide, a controller algorithms that utilize this aiming to simplify the development of stateless model checker specializing understanding to improve flash reli- in student projects for undergraduate deep neural network algorithms. In less than two years, we have witnessed ability. We propose to investigate four operating systems classes. Landslide directions through this approach. (1) includes a novel algorithm for auto- the wide adoption of the proposed Our preliminary work proposes a new matically managing multiple state spac- systems. We believe that as we continue es according to their estimated comple- to develop these systems, they will en- technique that improves flash reli- tion times, which I will show quickly able more people to take advantage of ability by 12.9 times by managing flash finds bugs should they exist and also the power of distributed computing retention differently for write-hot data quickly verifies correctness otherwise. to design efficient machine learning and write-cold data. (2) We propose I will evaluate Landslide’s suitability applications to solve large-scale com- to characterize and model flash errors for inexpert use by presenting the re- putational problems. Leveraging the on new flash chips. (3) We propose to sults of many semesters providing it to two computing platforms, we examine develop a technique to construct a flash students in 15-410, CMU’s Operating a number of distributed optimization error model online and improve flash System Design and Implementation problems in machine learning. We lifetime by exploiting our online model. present new methods to accelerate the class. Finally, I will explore broader (4) We propose to understand and de- training process, such as data parti- impact by extending Landslide to test velop new techniques that utilize flash tioning with better locality properties, some real-world programs and to be self-healing effect. We hope that these used by students at other universities. communication friendly optimization methods, and more compact statistical four directions will allow us to achieve THESIS PROPOSAL: models. We implement the new algo- higher flash reliability at low cost. Scaling Distributed Machine rithms on the two systems and test on Learning with System and large scale real data sets. We successfully Algorithm Co-design demonstrate that careful co-design of computing systems and learning algo- Mu Li, SCS rithms can greatly accelerate large scale October 14, 2016 distributed machine learning.

18 THE PDL PACKET NEW PDL FACULTY

Nathan Beckmann nearby cache banks. This work spans System”, won the William A. Martin cache partitioning, dynamic NUCA, Memorial Thesis Award for an out- The PDL would like to welcome Na- heterogeneous memories, replacement standing Master’s thesis. He received than Beckmann to our family! Nathan policies, etc to reduce overall data his Bachelor of Science from the is an Assistant Professor at CMU in movement. A common theme is the University of California, Los Ange- the Department of Computer Science. use of analytical models in software to les, graduating summa cum laude in Nathan works in computer architec- control simple, efficient, but “dumb” Computer Science and Mathematics ture, with experience in distributed hardware. This analytical approach of Computation and was honored as operating systems and is currently yields robust performance and unex- the Bachelor of the Year (2008) in focused on reducing data movement pected insights. Following graduation, Computer Science. in multicores, which uses over half of he stayed for a one-year postdoc. the energy available in current chips, by redesigning caches so they dynami- Before his PhD, Nathan worked in cally adopt an application-specific Anant Agarwal’s CSAIL group on fos, organization. a distributed operating system for multicore and clouds (2008-2011). Nathan received his PhD from MIT’s Computer Science and Artificial In- He also briefly worked with Frans telligence Laboratory (CSAIL)(2012- Kaashoek and Nickolai Zeldovich in 2015), advised by Daniel Sanchez. His the Parallel & Distributed Operating PhD thesis, “Design and Analysis of Systems Group (2011-2012). A long Spatially-Partitioned Shared Caches,” time ago, he also worked on Graphite, received the 2015 Sprowls Doctoral a distributed multicore simulator. Thesis Prize for the “best PhD thesis Other interests include math and in computer science at MIT”. This cryptography. work exposes physical cache banks to Nathan received his masters degree in software, letting the operating system 2010 from MIT. His thesis, “Distrib- schedule applications’ working sets in uted Naming in a Factored Operating

ALUMNI NEWS

Vinaykumar Bhat (MS, INI ’15) in dual-degree programs with nearby Niraj Tolia (PhD, SCS ‘07) schools that do offer one. His role will PDL Alumni Vinaykumar Bhat (2014- be to provide Engineering experiences Niraj, who currently resides in the 15) and Preeti Murthy (also a CMU at Mount Holyoke that help develop San Francisco Bay area, has recently ECE graduate), who both took the interest, facilitate transition into the founded a new startup (Kasten, kas- much-in-demand PDL Storage Sys- dual-degree programs, and help inte- ten.io), working on container-based/ tems class, were married on March grate their skills back into the Mount cloud-native storage infrastructure. 16th. Vinay currently works for Ora- While they are still in stealth mode, cle’s in-memory database group. Holyoke College community. He also got a new dog. Her name is Hope. they are hiring and have other PDL alumni on board too! Peter Klemperer (PhD, ECE ‘14) Atreyee Maiti (MS, SCS ‘14) Starting this coming Fall, Peter will Amber Palekar (MS, INI ‘06) Atreyee, now a Senior MTS with Nu- be entering a new role as an Assistant Amber has moved on from his position Professor of Computer Science at tanix, recently had a paper accepted as a file system engineer at NetApp. He Mount Holyoke College in South Had- by the International Conference on is now working at , moving ley, MA. He will be leading a 3-year Cloud Engineering (IC2E) 2017 in pilot into whether Engineering has a the industrial track. The paper “Quest: over as a part of Nutanix’s acquisition role at Mount Holyoke. There is no Search-driven Management of Cloud- of PernixData. He says he is having Engineering Department at Mount Scale Data Centers” was presented in some startup fun at the moment! Holyoke yet, but many students enroll April in Vancouver, B.C.

SPRING 2017 19 RECENT PUBLICATIONS continued from page 7 to extract the data stored within the Proteus: Agile ML Elasticity An Empirical Evaluation of device. Instead, investigators turn to through Tiered Reliability in In-Memory Multi-Version chip-off analysis, where they use a Dynamic Resource Markets Concurrency Control thermal-based procedure to physically remove the NAND flash memory chip Aaron Harlap, Alexey Tumanov, Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran from the device, and access the chip Andrew Chung, Greg Ganger & Phil Xian & Andrew Pavlo Gibbons directly to extract the raw data stored Proceedings of the VLDB Endowment, on the chip. ACM European Conference on Com- vol. 10, iss. 7, pages. 781—792, March We perform an analysis of the er- puter Systems, 2017 (EuroSys’17), 2017. rors introduced into multi-level cell 23rd-26th April, 2017, Belgrade, Multi-version concurrency control (MLC) NAND flash memory chips Serbia. Supersedes Carnegie Mellon (MVCC) is currently the most popular after the device has been seized. We University Parallel Data Lab Techni- transaction management scheme in make two major observations. First, cal Report CMU-PDL-16-102. May modern database management systems between the time that a device is seized 2016. (DBMSs). Although MVCC was dis- and the time digital forensic investiga- Many shared computing clusters allow covered in the late 1970s, it is used in tors perform data extraction, a large users to utilize excess idle resources almost every major relational DBMS number of errors can be introduced at lower cost or priority, with the released in the last decade. Maintaining as a result of charge leakage from proviso that some or all may be taken multiple versions of data potentially the cells of the NAND flash memory away at any time. But, exploiting such increases parallelism without sacri- (known as data retention errors). dynamic resource availability and the ficing serializability when processing Second, when thermal-based chip often fluctuating markets for them transactions. But scaling MVCC in removal is performed, the number requires agile elasticity and effective a multi-core and in-memory setting of errors in the data stored within acquisition strategies. Proteus aggres- is non-trivial: when there are a large NAND flash memory can increase by sively exploits such transient revocable number of threads running in parallel, two or more orders of magnitude, as resources to do machine learning the synchronization overhead can out- the high temperature applied to the (ML) cheaper and/or faster. Its pa- weigh the benefits of multi-versioning. chip greatly accelerates charge leak- rameter server framework, AgileML, To understand how MVCC perform age. We demonstrate that the chip-off efficiently adapts to bulk additions when processing transactions in mod- analysis based forensic data recovery and revocations of transient machines, ern hardware settings, we conduct an procedure is quite destructive, and can through a novel 3-stage active-backup extensive study of the scheme’s four key often render most of the data within approach, with minimal use of more design decisions: concurrency control NAND flash memory uncorrectable, costly non-transient resources. Its protocol, version storage, garbage col- and, thus, unrecoverable. BidBrain component adaptively al- lection, and index management. We im- To mitigate the errors introduced locates resources from multiple EC2 plemented state-of-the-art variants of during the forensic recovery process, spot markets to minimize average cost all of these in an in-memory DBMS and we explore a new hardware-based ap- per work as transient resource avail- evaluated them using OLTP workloads. proach. We exploit a fine-grained read ability and cost change over time. Our Our analysis identifies the fundamental evaluations show that Proteus reduces reference voltage control mechanism bottlenecks of each design choice. cost by 85% relative to non-transient implemented in modern NAND flash pricing, and by 43% relative to previ- Gaia: Geo-Distributed Machine memory chips, called read-retry, which ous approaches, while simultaneously can compensate for the charge leakage Learning Approaching LAN reducing runtimes by up to 37%. that occurs due to (1) retention loss Speeds and (2) thermal-based chip removal. Kevin Hsieh, Aaron Harlap, Nandita The read-retry mechanism success- Vijaykumar, Dimitris Konomis, Gregory fully reduces the number of errors, R. Ganger, Phillip B. Gibbons & Onur such that the original data can be fully Mutlu recovered in our tested chips as long as the chips were not heavily used prior 14th USENIX Symposium on Net- to seizure. We conclude that the read- worked Systems Design and Imple- mentation (NSDI), March 27–29, retry mechanism should be adopted The Proteus architecture consists of the as part of the forensic data recovery resource allocation component, BidBrain, and 2017, Boston, MA. process. the elastic ML framework, AgileML. continued on page 21

20 THE PDL PACKET RECENT PUBLICATIONS continued from page 20 between data centers and eliminates the first work that models edge-servers Data Center 1 Data Center 2 the vast majority of insignificant com- as caches for compute-intensive recog- ❶ Global Global Data Model Copy Model Copy Shard Worker munication while still guaranteeing the nition applications. Machine Parameter Parameter Server Server … correctness of ML algorithms. Our Data Shard Worker experiments on our prototypes of Gaia SoftMC: A Flexible and Machine Parameter Parameter … Data Server ❷ ASP Server running across the 11 Amazon EC2 Practical Open-Source Shard Worker Machine ❸ BSP/SSP global regions and on a cluster that Infrastructure for Enabling emulates EC2 WAN bandwidth show Experimental DRAM Studies Gaia system overview. that Gaia provides 1.8-53.5X speed- up over a state-of-art distributed ML Hasan Hassan, Nandita Vijaykumar, Machine learning (ML) is widely used system, and is within 0.94-1.56X of Samira Khan, Saugata Ghose, Kevin to derive useful information from ML-on-LAN speeds. Chang, Gennady Pekhimenko, large-scale data (such as user activi- Donghyuk Lee, Oguz Ergin & Onur ties, pictures, and videos) generated Towards Edge-caching for Mutlu at increasingly rapid rates, all over the Image Recognition International Symposium on High- world. Unfortunately, it is infeasible Utsav Drolia, Katherine Guo, Jiaqi Tan, Performance Computer Architecture to move all this globally-generated (HPCA), February 2017. data to a centralized data center be- Rajeev Gandhi & Priya Narasimhan DRAM is the primary technology used fore running an ML algorithm over it First Workshop on Smart Edge Com- for main memory in modern sys- -- moving large amounts of raw data puting and Networking (SmartEdge) tems. Unfortunately, as DRAM scales over a wide-area network (WAN) can ‘17, held in conjunction with PerCom down to smaller technology nodes, be extremely slow, and is also subject 2017, March 13 - 17, 2017, Hawaii, it faces key challenges in both data to national privacy law constraints. USA. integrity and latency, which strongly This motivates the need for a geo-dis- With the available sensors on mobile tributed ML system spanning multiple affects overall system reliability and devices and their improved CPU and performance. To develop reliable data centers. Unfortunately, commu- storage capability, users expect their nicating over WANs can significantly and high-performance DRAM-based devices to recognize the surrounding main memory in future systems, it is degrade ML system performance (by as environment and to provide relevant much as 53.7X in our study) because critical to characterize, understand, information and/or content auto- and analyze various aspects (e.g., re- the communication overwhelms the matically and immediately. For such limited WAN bandwidth. liability, latency) of existing DRAM classes of real-time applications, user chips. To enable this, there is a strong Our goal in this work is to develop perception of performance is key. To need for a publicly-available DRAM a geo-distributed ML system that (1) enable a truly seamless experience for testing infrastructure that can flexibly employs an intelligent communication the user, responses to requests need and efficiently test DRAM chips in a mechanism over WANs to efficiently to be provided with minimal user- manner accessible to both software and utilize the scarce WAN bandwidth, perceived latency. hardware developers. while retaining the accuracy and cor- Current state-of-the-art systems for rectness guarantees of an ML algo- This paper develops the first such in- these applications require offloading frastructure, SoftMC (Soft Memory rithm; and (2) is generic and flexible requests and data to the cloud. This enough to run a wide range of ML Controller), an FPGA-based testing paper proposes an approach to allow platform that can control and test algorithms, without requiring any users’ devices and their onboard ap- changes to the algorithms. memory modules designed for the plications to leverage resources closer commonly-used DDR (Double Data To this end, we introduce a new, to home, i.e., resources at the edge of Rate) interface. SoftMC has two key general geo-distributed ML system, the network. We propose to use edge- properties: (i) it provides flexibility to Gaia, that decouples the communi- servers as specialized caches for image- thoroughly control memory behavior cation within a data center from the recognition applications. We develop or to implement a wide range of mech- communication between data centers, a detailed formula for the expected la- anisms using DDR commands; and (ii) enabling different communication tency for such a cache that incorporates it is easy to use as it provides a simple and consistency models for each. We the effects of recognition algorithms’ and intuitive high-level programming present a new ML synchronization computation time and accuracy. We interface for users, completely hiding model, Approximate Synchronous show that, counter-intuitively, large the low-level details of the FPGA. Parallel (ASP), which dynamically ad- cache sizes can lead to higher latencies. justs to the available WAN bandwidth To the best of our knowledge, this is continued on page 22

SPRING 2017 21 RECENT PUBLICATIONS continued from page 21 in existing DRAM chips, which dem- Vulnerabilities in MLC NAND onstrates the usefulness of SoftMC in Flash Memory Programming: testing new ideas on existing memory Experimental Analysis, modules. We discuss several other Exploits, and Mitigation use cases of SoftMC, including the Techniques ability to characterize emerging non- volatile memory modules that obey Yu Cai, Saugata Ghose, Yixin Luo, Ken the DDR standard. We hope that our Mai, Onur Mutlu & Erich F. Haratsch open-source release of SoftMC fills a 23rd IEEE Symposium on High Per- gap in the space of publicly-available formance Computer Architecture, (a) System experimental memory testing infra- Industrial session, February 2017. structures and inspires new studies, ideas, and methodologies in memory Modern NAND flash memory chips system design. provide high density by storing two bits of data in each flash cell, called a multi- An Evaluation of Distributed level cell (MLC). An MLC partitions Concurrency Control the threshold voltage range of a flash cell into four voltage states. When a Rachael Harding, Dana Van Aken, flash cell is programmed, a high voltage Andrew Pavlo & Michael Stonebraker is applied to the cell. Due to parasitic (b) Chip capacitance coupling between flash Proceedings of the VLDB Endowment, cells that are physically close to each vol. 10, iss. 5, pages. 553—564, Janu- other, flash cell programming can lead ary 2017. to cell-to-cell program interference, Increasing transaction volumes have which introduces errors into neigh- led to a resurgence of interest in dis- boring flash cells. In order to reduce tributed transaction processing. In the impact of cell-to-cell interference particular, partitioning data across sev- on the reliability of MLC NAND flash eral servers can improve throughput by memory, flash manufacturers adopt a allowing servers to process transactions two-step programming method, which in parallel. But executing transactions programs the MLC in two separate (c) Bank across servers limits the scalability and steps. First, the flash memory partially performance of these systems. programs the least significant bit of the DRAM-based memory system organization. MLC to some intermediate threshold In this paper, we quantify the ef- voltage. Second, it programs the most fects of distribution on concurrency We demonstrate the capability, flex- significant bit to bring the MLC up to control protocols in a distributed ibility, and programming ease of its full voltage state. environment. We evaluate six classic SoftMC with two example use cases. and modern protocols in an in-mem- In this paper, we demonstrate that First, we implement a test that char- ory distributed database evaluation two-step programming exposes new acterizes the retention time of DRAM framework called Deneva, providing reliability and security vulnerabilities. cells. Experimental results we obtain an apples-to-apples comparison be- We experimentally characterize the ef- using SoftMC are consistent with the tween each. Our results expose severe fects of two-step programming using findings of prior studies on retention limitations of distributed transaction contemporary 1X-nm (i.e., 15–19nm) time in modern DRAM, which serves processing engines. Moreover, in our flash memory chips. We find that a as a validation of our infrastructure. analysis, we identify several protocol- partially-programmed flash cell (i.e., Second, we validate two recently- specific scalability bottlenecks. We a cell where the second programming proposed mechanisms, which rely conclude that to achieve truly scalable step has not yet been performed) is on accessing recently-refreshed or operation, distributed concurrency much more vulnerable to cell-to-cell recently-accessed DRAM cells faster control solutions must seek a tighter interference and read disturb than a than other DRAM cells. Using our coupling with either novel network fully-programmed cell. We show that infrastructure, we show that the ex- hardware (in the local area) or applica- it is possible to exploit these vulner- abilities on solid-state drives (SSDs) pected latency reduction effect of tions (via data modeling and semanti- these mechanisms is not observable cally-aware execution), or both. continued on page 23

22 THE PDL PACKET RECENT PUBLICATIONS continued from page 22 to alter the partially-programmed FaSST is an RDMA-based system data, causing (potentially malicious) 90000 that provides distributed in-memory data corruption. Building on our ex- transactions with serializability and perimental observations, we propose (txn/s) durability. Existing RDMA-based several new mechanisms for MLC 45000 transaction processing systems use NAND flash memory that eliminate or one-sided RDMA primitives for their mitigate data corruption in partially- oughput Thr ability to bypass the remote CPU. This 0 programmed cells, thereby removing NVM-WBL NVM-WAL design choice brings several drawbacks. or reducing the extent of the vulner- LoggingProtocol First, the limited flexibility of one- abilities, and at the same time increas- (a) Throughput sided RDMA reduces performance ing flash memory lifetime by 16%. 1000 and increases software complexity

s) when designing distributed data stores.

Write-Behind Logging y( 100 Second, deep-rooted technical limita- 10 tions of RDMA hardware limit scal-

Joy Arulraj, Matthew Perron & Andrew Latenc ability in large clusters. FaSST eschews Pavlo 1 very one-sided RDMA for fast RPCs using 0.1 Proc. VLDB Endow., vol. 10, pp. 337- Reco two-sided unreliable datagrams, which NVM-WBL NVM-WAL 348, December, 2016. LoggingProtocol we show drop packets extremely rarely The design of the logging and recovery (b) Recovery Time on modern RDMA networks. This ap- components of database management proach provides better performance, systems (DBMSs) has always been 3.50 scalability, and simplicity, without influenced by the difference in the requiring expensive reliability mecha- nisms in software. In comparison with performance characteristics of volatile (GB) 1.75

(DRAM) and non-volatile storage de- ge published numbers, FaSST outper- vices (HDD/SSDs). The key assump- forms FaRM on the TATP benchmark tion has been that non-volatile storage Stora by almost 2x while using close to half 0.00 is much slower than DRAM and only NVM-WBL NVM-WAL the hardware resources, and it out- supports block-oriented read/writes. LoggingProtocol performs DrTM+R on the SmallBank But the arrival of new nonvolatile (c) Storage Footprint benchmark by around 1.7x without memory (NVM) storage that is almost making data locality assumptions. as fast as DRAM with fine-grained NVM-WBL NVM-WAL read/writes invalidates these previous Self-Driving Database design choices. WBL vs. WAL – The throughput, recovery Management Systems time, and storage footprint of the DBMS This paper explores the changes that A. Pavlo, G. Angulo, J. Arulraj, H. Lin, are required in a DBMS to leverage for the YCSB benchmark with the write- ahead logging and write-behind logging J. Lin, L. Ma, P. Menon, T. Mowry, the unique properties of NVM in sys- protocols. M. Perron, I. Quah, S. Santurkar, A. tems that still include volatile DRAM. Tomasic, S. Toor, D. V. Aken, Z. Wang, Y. We make the case for a new logging 1.5×. We also demonstrate that our Wu, R. Xian & T. Zhang and recovery protocol, called write- logging protocol is compatible with behind logging, that enables a DBMS standard replication schemes. In CIDR 2017, Conference on Inno- to recover nearly instantaneously vative Data Systems Research. January from system failures. The key idea is FaSST: Fast, Scalable 8-11, 2017, Chaminade, CA. that the DBMS logs what parts of the and Simple Distributed In the last two decades, both research- database have changed rather than how Transactions with Two-sided ers and vendors have built advisory it was changed. Using this method, (RDMA) Datagram RPCs tools to assist database administrators the DBMS flushes the changes to the (DBAs) in various aspects of system database before recording them in the Anuj Kalia, Michael Kaminsky & David tuning and physical design. Most of log. Our evaluation shows that this G. Andersen this previous work, however, is incom- protocol improves a DBMS’s transac- plete because they still require humans 12th USENIX Symposium on Operat- tional throughput by 1.3×, reduces the to make the final decisions about any recovery time by more than two orders ing Systems Design and Implementa- changes to the database and are re- of magnitude, and shrinks the storage tion November 2–4, 2016, Savannah, footprint of the DBMS on NVM by GA, USA. continued on page 24

SPRING 2017 23 RECENT PUBLICATIONS continued from page 23 actionary measures that fix problems than a serial C++ implementation. after they occur. tunable choice to try With a modest dataset (Netflix dataset What is needed for a truly “self- Searching Tunable containing 100 million ratings), the driving” database management system Logic Searcher PySpark implementation showed 5.5X Fork a convergence speed speedup from 1 to 8 machines, using (DBMS) is a new architecture that is branch to try designed for autonomous operation. tunable Progress 1 core per machine. But it failed to for trial me Summarizer This is different than earlier attempts MLtuner achieve further speedup with more because all aspects of the system are training progress machines or gain speedup from using controlled by an integrated planning Training System more cores per machine. While it’s still component that not only optimizes the ongoing investigation, we believed that system for the current workload, but shuffling as Spark’s only scalable mech- also predicts future workload trends Tunable searching procedure. anism of propagating model updates is so that the system can prepare itself responsible for much of the overhead zation-guided online trial-and-error and more efficient approaches for accordingly. With this, the DBMS to find good initial settings as well as can support all of the previous tuning communication could lead to much to re-tune settings during execution. better performance. techniques without requiring a human Experiments with three real ML tasks to determine the right way and proper show that MLtuner automatically en- Zorua: A Holistic Approach time to deploy them. It also enables ables performance within 40–178% new optimizations that are impor- to Resource Virtualization in of having oracle knowledge of the best GPUs tant for modern high-performance settings, and outperforms oracle when DBMSs, but which are not possible no single set of settings are best for the Nandita Vijaykumar, Kevin Hsieh, today because the complexity of man- entire execution. It also significantly Gennady Pekhimenko, Samira Khan, aging these systems has surpassed the outperforms most of the many feasible Saugata Ghose, Ashish Shrestha, abilities of human experts. settings that might get used in practice. Adwait Jog, Phillip B. Gibbons & Onur This paper presents the architecture of Mutlu Peloton, the first self-driving DBMS. Benchmarking Apache Spark 49th IEEE/ACM International Sym- Peloton’s autonomic capabilities are with Machine Learning posium on Microarchitecture (MI- now possible due to algorithmic ad- Applications CRO’16), October 15 - 19, 2016, vancements in deep learning, as well as Taipei, Taiwan. improvements in hardware and adap- Jinliang Wei, Jin Kyu Kim & Garth A. tive database architectures. Gibson This paper introduces a new resource virtualization framework, Zorua, that Carnegie Mellon University Parallel MLtuner: System Support for decouples the programmer-specified Data Lab Technical Report CMU- resource usage of a GPU applica- Automatic Machine Learning PDL-16-107 October 2016. Tuning tion from the actual allocation in the We benchmarked Apache Spark with on-chip hardware resources. Zorua Henggang Cui, Gregory R. Ganger & a popular parallel machine learning enables this decoupling by virtualizing Phillip B. Gibbons training application, Distributed Sto- each resource transparently to the pro- chastic Gradient Descent for Matrix grammer. The virtualization provided Carnegie Mellon University Parallel Factorization [5] and compared the by Zorua builds on two key concepts— Data Lab Technical Report CMU- Spark implementation with alternative dynamic allocation of the on-chip PDL-16-108, October 2016. approaches for communicating model resources and their oversubscription MLtuner automatically tunes settings parameters, such as scheduled pipe- using a swap space in memory. for training tunables—such as the lining using POSIX socket or MPI, Zorua provides a holistic GPU re- learning rate, the mini-batch size, and distributed shared memory (e.g. source virtualization strategy, designed and the data staleness bound—that parameter server [13]). We found that to (i) adaptively control the extent of have a significant impact on large-scale Spark performance suffers substantial oversubscription, and (ii) coordinate machine learning (ML) performance. overhead with only modest model size the dynamic management of multiple Traditionally, these tunables are set (rank of a few hundreds). For example, on-chip resources (i.e., registers, manually, which is unsurprisingly er- the PySpark implementation using scratchpad memory, and thread slots), ror prone and difficult to do without one single-core executor was about 3X to maximize the effectiveness of virtu- extensive domain knowledge. MLtuner slower than a serial out-of-core Python uses efficient snapshotting and optimi- implementation and 226X slower continued on page 25

24 THE PDL PACKET RECENT PUBLICATIONS continued from page 24 aging suite with built-in aging profiles Coordinator with the goal of making it easier to age, and harder to justify ignoring file sys- Oversubscribe Oversubscribe Oversubscribe tem aging in storage research. threads? scratchpad? registers? Thread/ Barrier Scratchpad Register Application Queue Queue Queue A Survey of Security Thread Vulnerabilities in Bluetooth Thread Blocks Schedulable Block Threads Low Energy Beacons 1 55 Scheduler Acquire Warp 9 Resources Scheduler Release 2 3 4 7 Hui Jun Tay, Jiaqi Tan & Priya Resources Narasimhan 10 6 To Compute Carnegie Mellon University Parallel Units Data Lab Technical Report CMU- Register Scratchpad Thread Mapping Table Mapping Table Mapping Table PDL-16-109. November 2016.

Phase Change 8 There are currently 4 million Blue- tooth Low Energy beacons, such as Apple’s iBeacons™, deployed world- Overview of Zorua in hardware. wide, with more being used in an in- creasingly wide variety of fields, from alization. Zorua employs a hardware- are already highly tuned to best utilize marketing to navigation. Yet, there is a software codesign, comprising the the hardware resources. The holistic lack of focus on the impact of security compiler, a runtime system and hard- virtualization provided by Zorua can on beacon deployments. This report ware-based virtualization support. The also enable other uses, including looks to capture an overview of cur- runtime system leverages information fine-grained resource sharing among rent beacon deployments, the current from the compiler regarding resource multiple kernels and low-latency pre- vulnerabilities and risks present in requirements of each program phase emption of GPU programs. using such platforms, and the security to (i) dynamically allocate/deallocate measures taken by vendors today. the different resources in the physi- Aging Gracefully with cally available on-chip resources or Geriatrix: A File System Aging A Model for Application their swap space, and (ii) manage the Suite Slowdown Estimation in On- tradeoff between higher thread-level Saurabh Kadekodi, Vaishnavh Chip Networks and Its Use for parallelism due to virtualization versus Nagarajan & Garth A. Gibson Improving System Fairness and the latency and capacity overheads of Performance swap space usage. Carnegie Mellon University Parallel We demonstrate that by providing Data Lab Technical Report CMU- Xiyue Xiang, Saugata Ghose, Onur the illusion of more resources than PDL-16-105. October, 2016. Mutlu & Nian-Feng Tzeng physically available via controlled and File system aging has been advocated International Conference on Com- coordinated virtualization, Zorua for thorough analysis of any design, puter Design (ICCD), October 3-5, offers several important benefits: (i) but it is cumbersome and often by- 2016, Phoenix, USA. Programming Ease. Zorua eases the passed. Our aging study re-evaluates In a network-on-chip (NoC) based burden on the programmer to pro- published file systems after aging using system, the NoC is a shared resource vide code that is tuned to efficiently the same benchmarks originally used. among multiple processor cores. Net- utilize the physically available on-chip We see significant performance degra- dation on HDDs and SSDs. With per- work requests generated by different resources. (ii) Portability. Zorua al- applications running on different cores leviates the necessity of re-tuning an formance of aged file systems on SSDs dropping by as much as 80% relative to can interfere with each other, leading application’s resource usage when to a slowdown in performance of each porting the application across GPU the recreated results of prior papers, aging is even more necessary in the era application. The degree of slowdown generations. (iii) Performance. By introduced by this interference varies dynamically allocating resources and of SSDs. Still more concerning, the rank ordering of compared file systems for each application, as it depends on carefully oversubscribing them when (1) the sensitivity of the application to necessary, Zorua improves or retains can change versus published results. the performance of applications that We offer Geriatrix, a simple-to-use continued on page 26

SPRING 2017 25 RECENT PUBLICATIONS continued from page 25 NoC performance, and (2) network (FAST), a mechanism that employs likely to complete in a fixed CPU traffic induced by other applications slowdown predictions to control the budget, and it incorporates data- running concurrently on the system. network injection rates of applica- race analysis to add new preemption In modern systems, NoC interference tions in a way that minimizes system points on the fly. Preempting threads is largely uncontrolled, and therefore unfairness. Our results over a variety during a data race’s instructions can some applications unfairly slow down of multiprogrammed workloads show automatically classify the race as buggy much more than others. This can lead that FAST improves average system or benign, and uncovers new bugs not to overall system performance degra- fairness and performance by 9.5% and reachable by prior model checkers. It dation, prevent fair progress of differ- 5.2%, respectively. also enables full verification of all pos- ent applications, and cause starvation sible schedules when every data race of unfairly-treated applications. Our Stateless Model Checking with is verified as benign within the CPU goal is to accurately model the slow- Data-Race Preemption Points budget. In our evaluation, QUICK- down of each application executing on SAND found 1.25x as many bugs and the system due to NoC interference at Ben Blum & Garth Gibson verified 4.3x as many tests compared runtime, and to use this information SPLASH 2016 OOPSLA, Oct. 30 - Nov. to prior model checking approaches. to improve system performance and 4, 2016, Amsterdam, Netherlands. reduce unfairness. JamaisVu: Robust Scheduling Stateless model checking is a powerful with Auto-Estimated Job To this end, we propose the NoC Ap- technique for testing concurrent pro- plication Slowdown (NAS) Model, grams, but suffers from exponential Runtimes the first online model that accurately state space explosion when the test in- Alexey Tumanov, Angela Jiang, Jun estimates how much network delays put parameters are too large. Several Woo Park, Michael A. Kozuch & due to interference contribute to the reduction techniques can mitigate Gregory R. Ganger overall stall time of each application. this explosion, but even after prun- The key idea of NAS is to determine ing equivalent interleavings, the state Carnegie Mellon University Parallel how the delays induced at each level space size is often intractable. Most Data Laboratory Technical Report of network data transmission overlap prior tools are limited to preempting CMU-PDL-16-104, September 2016. with each other, and to use the overlap only on synchronization APIs, which JamaisVu is a new end-to-end cluster information to calculate the net impact reduces the space further, but can miss scheduling system that automatically of the delays on application stall time. unsynchronized thread communica- generates and robustly exploits job Our model determines the application tion bugs. Data race detection, an- runtime predictions. Using runtime slowdowns at runtime with a very low other concurrency testing approach, knowledge allows it to more effectively error rate, averaging 4.2% over 90 focuses on suspicious memory access pack jobs with diverse time concerns multiprogrammed workloads for an 88 pairs during a single test execution. It (e.g., deadline vs. latency) and soft- mesh network. We use NAS to develop avoids concerns of state space size, but placement constraints on heteroge- Fairness- Aware Source Throttling may report races that do not lead to neous cluster resources. JamaisVu’s observable fail- job run time predictor, JVuPredict 20 ures, which jeop- uses a new black-box approach that ardizes a user’s tracks job run time history as a func- willingness to use tion of multiple job submission 15 the analysis. features (e.g., user ID and program We present name), and then adaptively uses the most effective feature subset for each 10 QUICKSAND, a new stateless submitted job. Analysis of a 1-month model checking Google cluster trace shows JVuPredict 5 framework which predicts reasonably well for complex manages the ex- real-world job mixes; for example, 90% of predictions are within a fac- 0 ploration of many state spaces us- tor of two of actual runtime. But, ing different pre- because predictions cannot be perfect, emption points. JamaisVu includes new techniques for Some data-race candidates may not be identified during a single program mitigating the effects of such real mis- execution. Using nondeterministic data races as PPs, QUICKSAND found It uses state space 128% (Limited HB) to 191% (Pure HB) as many data-race bugs compared estimation to pri- prediction profiles. Experiments with to using single-pass candidates alone. oritize jobs most continued on page 27

26 THE PDL PACKET RECENT PUBLICATIONS continued from page 26 CFI violations to allow recovery, and enables programmers to specify robust JVuPredict JVuSched 1. Job STRL Generator recovery actions by providing CFI via submission Feature history Sum Estimate source-code safety-checks. PCFIRE’s Selector 2. Job Max Min Job 14 John submission CFI can be formally proved automati- +Estimate Objective function John Runtime Supply/demand cally, and supports realistic features of YARN prediction 4. Job SortV constraints embedded software such as hardware runtime SortV STRL Compiler and I/O access. We showcase PCFIRE by providing, and automatically prov- MILP Solver Runtime core ing, CFI for: benchmark programs, text utilities containing I/O, and embedded programs with sensor 3. Schedule inputs and hardware outputs on the Raspberry Pi single-board computer. End-to-end system integration: JVuSched is integrated into Hadoop YARN—a popular open source cluster scheduling framework. μC-States: Fine-grained GPU Datapath Power Management workloads derived from the trace show tomatic safety property verification that JamaisVu performs nearly as well for unmodified executables, which Onur Kayıran, Adwait Jog, Ashutosh as a hypothetical scheduler with perfect extends an existing trustworthy Hoare Pattnaik, Rachata Ausavarungnirun, job runtime information, outper- logic for local reasoning, and pro- Xulong Tang, Mahmut T. Kandemir, forming runtime-unaware scheduling vides a novel proof tactic for selective Gabriel H. Loh, Onur Mutlu & Chita R. by reducing SLO miss rate, increasing composition. We demonstrate our Das goodput, and maintaining comparable fully automated proof technique on Proceedings of the The 25th Interna- latency for best effort jobs. synthetic and realistic programs, and tional Conference on Parallel Archi- our verification completes in 6 hours tectures and Compilation Techniques AUSPICE-R: Automatic Safety- for a realistic 533-instruction string (PACT 2016), Haifa, Israel, Septem- Property Proofs for Realistic search algorithm, demonstrating the ber 2016. feasibility of our approach. Features in Machine Code To improve the performance of Jiaqi Tan, Hui Jun Tay, Rajeev Gandhi, PCFIRE: Towards Provable, Graphics Processing Units (GPUs) beyond simply increasing core count, & Priya Narasimhan Preventative Control-Flow architects are recently adopting a scale- 14th Asian Symposium on Program- Integrity Enforcement for up approach: the peak throughput and ming Languages and Systems (ASP- Realistic Embedded Software individual capabilities of the GPU LAS), November, 2016. Jiaqi Tan, Hui Jun Tay, Utsav Drolia, cores are increasing rapidly. This big- Verification of machine-code pro- Rajeev Gandhi & Priya Narasimhan. core trend in GPUs leads to various grams using program logic has fo- challenges, including higher static cused on functional correctness, ACM SIGBED International Confer- power consumption and lower and and proofs have required manually- ence on Embedded Software (EM- imbalanced utilization of the data- provided program specifications. SOFT), October 2016. path components of a big core. As we Fortunately, the verification of shal- Control-Flow Integrity (CFI) is an show in this paper, two key problems low safety properties such as memory important safety property of software, ensue: (1) the lower and imbalanced and control-flow safety can be easier particularly in embedded and safety- datapath utilization can waste power as to automate, but past techniques for critical systems, where CFI violations an application does not always utilize automatically verifying machine-code have led to patient deaths and can all portions of the big core datapath, safety have required post-compilation render cars remotely controllable by and (2) the use of big cores can lead to application performance degrada- transformations, which can change attackers. Previous techniques for CFI tion in some cases due to the higher program behavior. In this work, we may reduce the robustness of embed- memory system contention caused by automatically verify safety properties ded and safety-critical systems, as they the more memory requests generated for unmodified machine-code pro- handle CFI violations by stopping by each big core. grams without requiring user-supplied programs. In this work, we present specifications. We present our novel PCFIRE, a preventative approach to This paper introduces a new analysis of logic framework, AUSPICE, for au- CFI that prevents the root-causes of continued on page 28

SPRING 2017 27 RECENT PUBLICATIONS continued from page 27 datapath component utilization in big- Recent computer systems research has necessarily preempting on every shared core GPUs based on queuing theory proposed using redundant requests to memory access, and ignores false-pos- principles. Building on this analysis, reduce latency. The idea is to replicate itive data race candidates arising from we introduce a fine-grained dynamic a request so that it joins the queue at certain heap allocation patterns. An power- and clock-gating mechanism multiple servers. The request is con- Iterative Deepening test that reaches for the entire datapath, called μC- sidered complete as soon as any one completion soundly verifies all pos- States, which aims to minimize power copy of the request completes. Redun- sible thread interleavings of that test. consumption by turning off or tuning- dancy is beneficial because it allows down datapath components that are us to overcome server-side variability Larger-than-Memory Data not bottlenecks for the performance of – the fact that the server we choose Management on Modern Storage the running application. Our experi- might be temporarily slow due to fac- Hardware for In-Memory OLTP mental evaluation demonstrates that tors such as background load, network Database Systems μC-States significantly reduces both interrupts, and garbage collection. static and dynamic power consumption When there is significant server-side Lin Ma, Joy Arulraj, Sam Zhao, Andrew in a big-core GPU, while also signifi- variability, replicating requests can Pavlo, Subramanya R. Dulloor, Michael cantly improving the performance of greatly reduce response times. J. Giardino, Jeff Parkhurst, Jason L. applications affected by high memory In the past few years, queueing theo- Gardner, Kshitij Dosh & Col. Stanley system contention. We also show that rists have begun to study redundancy, Zdonik our analysis of datapath component first via approximations, and, more DaMoN’16, June 26 - July 01 2016, utilization can guide scheduling and recently, via exact analysis. Unfortu- San Francisco, CA, USA design decisions in a GPU architecture nately, for analytical tractability, most In-memory database management that contains heterogeneous cores. existing theoretical analysis has as- sumed an Independent Runtimes (IR) systems (DBMSs) outperform disk- oriented systems for on-line transaction A Better Model for Job model, wherein the replicas of a job processing (OLTP) workloads. But this Redundancy: Decoupling each experience independent runtimes (service times) at different servers. The improved performance is only achievable Server Slowdown and Job Size IR model is unrealistic and has led to when the database is smaller than the Kristen Gardner, Mor Harchol-Balter & theoretical results which can be at odds amount of physical memory available in Alan Scheller-Wolf with computer systems implementation the system. To overcome this limitation, results. This paper introduces a much some in-memory DBMSs can move cold IEEE Modeling, Analysis and Simula- more realistic model of redundancy. data out of volatile DRAM to secondary tion of Computer and Telecommu- Our model allows us to decouple the storage. Such data appears as if it resides nication Systems (MASCOTS 2016), inherent job size (X) from the server- in memory with the rest of the database London, UK, September 2016. side slowdown (S), where we track both even though it does not. Although there S and X for each job. Analysis within have been several implementations pro- the S&X model is, of course, much posed for this type of cold data storage, λk arrivals/sec there has not been a thorough evaluation with inherent size ~X more difficult. Nevertheless, we design a policy, Redundant-to- Idle-Queue of the design decisions in implement- Dispatcher (RIQ) which is both analytically trac- ing this technique, such as policies for table within the S&X model and has when to evict tuples and how to bring provably excellent performance. them back when they are needed. These . . . choices are further complicated by the S S S S Soundness Proofs for Iterative varying performance characteristics of Deepening different storage devices, including fu- ture non-volatile memory technologies. Ben Blum We explore these issues in this paper and The S&X model. The system has k servers and discuss several approaches to solve them. jobs arrive as a Poisson process with rate λk. Carnegie Mellon University Parallel Each job has an inherent size X. When a job Data Lab Technical Report CMU- We implemented all of these approaches runs on a server it experiences slowdown S. PDL-16-103, September 6, 2016. in an in-memory DBMS and evaluated A job’s running time on a single server is R(1) them using five different storage tech- = X·S. When a job runs on multiple servers, its The Iterative Deepening algorithm al- nologies. Our results show that choosing inherent size X is the same on all these servers lows stateless model checkers to adjust the best strategy based on the hardware and it experiences a different, independently preemption points on-the-fly. It uses drawn instance of S on each server. dynamic data-race detection to avoid continued on page 29

28 THE PDL PACKET RECENT PUBLICATIONS continued from page 28 improves throughput by 92–340% over of the problem and the effectiveness places those pages in the memory stack a generic configuration. of FlexRR’s solution. Using FlexRR, closest to the offloaded code, to mini- we consistently observe near-ideal mize off-chip bandwidth consump- Addressing the Straggler run-times (relative to no performance tion. We call the combination of these Problem for Iterative jitter) across all real and injected strag- two programmer-transparent mecha- Convergent Parallel ML gler behaviors tested. nisms TOM: Transparent Offloading and Mapping. Aaron Harlap, Henggang Cui, Wei Dai, Transparent Offloading and Our extensive evaluations across a Jinliang Wei Gregory R. Ganger, Phillip B. Mapping (TOM): Enabling variety of modern memory-intensive Gibbons, Garth A. Gibson & Eric P. Xing Programmer-Transparent GPU workloads show that, without ACM Symposium on Cloud Computing Near-Data Processing in GPU requiring any program modification, 2016. October 5 - 7, Santa Clara. CA. Systems TOM significantly improves perfor- mance (by 30% on average, and up FlexRR provides a scalable, efficient Kevin Hsieh, Eiman Ebrahimi, to 76%) compared to a baseline GPU solution to the straggler problem for Gwangsun Kim, Niladrish Chatterjee, system that cannot offload computa- iterative machine learning (ML). The Mike O’Connor, Nandita Vijaykumar, tion to 3D-stacked memories. frequent (e.g., per iteration) barriers Onur Mutlu & Stephen W. Keckler used in traditional BSP-based distrib- Bridging the Archipelago uted ML implementations cause every Proceedings of the 43rd International between Row-Stores and transient slowdown of any worker Symposium on Computer Architec- thread to delay all others. FlexRR com- ture (ISCA), Seoul, South Korea, June Column-Stores for Hybrid bines a more flexible synchronization 18 - 22, 2016. Workloads model with dynamic peer-to-peer re- Main memory bandwidth is a critical Joy Arulraj, Andrew Pavlo & Prashanth assignment of work among workers to bottleneck for modern GPU systems Menon address straggler threads. Experiments due to limited off-chip pin bandwidth. with real straggler behavior observed 3D-stacked memory architectures SIGMOD’16, June 26 - July 01, 2016, on Amazon EC2 and Microsoft Azure, provide a promising opportunity to San Francisco, CA, USA as well as injected straggler behavior significantly alleviate this bottleneck Data-intensive applications seek to stress tests, confirm the significance by directly connecting a logic layer to obtain trill insights in real-time by the DRAM layers with high bandwidth analyzing a combination of historical Ok Fast Slow connections. Recent work has shown data sets alongside recently collected promising potential performance ben- data. This means that to support such efits from an architecture that connects hybrid workloads, database manage- multiple such 3D-stacked memories ment systems (DBMSs) need to handle Ignore and offloads bandwidth-intensive both fast ACID transactions and com- Do assignment #1 (I need help) plex analytical queries on the same (red work) computations to a GPU in each of the Started Working database. But the current trend is to use Do assignment #2 logic layers. An unsolved key challenge (green work) in such a system is how to enable com- specialized systems that are optimized putation offloading and data mapping for only one of these workloads, and to multiple 3D-stacked memories with- thus require an organization to main- out burdening the programmer such tain separate copies of the database. that any application can transparently This adds additional cost to deploying RapidReassignment example. The middle benefit from near-data processing ca- a database application in terms of both worker sends progress reports to the other pabilities in the logic layer. storage and administration overhead. two workers (its helper group). The worker on the left is running at a similar speed, Our paper develops two new mecha- To overcome this barrier, we pres- so it ignores the message. The worker on nisms to address this key challenge. ent a hybrid DBMS architecture that the right is running slower, so it sends a First, a compiler-based technique that efficiently supports varied workloads do-this message to re-assign an initial work automatically identifies code to offload on the same database. Our approach assignment. Once the faster worker finishes to a logic-layer GPU based on a simple differs from previous methods in that its own work and begins helping, it sends a we use a single execution engine that is begun-helping message to the slow worker. cost-benefit analysis. Second, a soft- ware/hardware cooperative mechanism oblivious to the storage layout of data Upon receiving this, the slow worker sends without sacrificing the performance a do-this with a follow-up work assignment that predicts which memory pages will to the fast worker. be accessed by offloaded code, and continued on page 30

SPRING 2017 29 RECENT PUBLICATIONS continued from page 29 benefits of the specialized systems. we propose using a two-stage index: KVSs; results demonstrated an order This obviates the need to maintain sep- The first stage ingests all incoming of magnitude increase in throughput arate copies of the database in multiple entries and is kept small for fast read and energy efficiency over stock mem- independent systems. We also present and write operations. The index peri- cached. Software-centric research re- a technique to continuously evolve odically migrates entries from the first visited the KVS application to address the database’s physical storage layout stage to the second, which uses a more fundamental software bottlenecks and by analyzing the queries’ access pat- compact, read-optimized data struc- to exploit the full potential of modern terns and choosing the optimal layout ture. Our first contribution is hybrid commodity hardware; these efforts too for different segments of data within index, a dual-stage index architecture showed orders of magnitude improve- the same table. To evaluate this work, that achieves both space efficiency ment over stock memcached. we implemented our architecture in and high performance. Our second We aim at architecting high perfor- an in-memory DBMS. Our results contribution is Dual-Stage Trans- mance and efficient KVS platforms, show that our approach delivers up formation (DST), a set of guidelines and start with a rigorous architectural to 3× higher throughput compared to for converting any order-preserving characterization across system stacks static storage layouts across different index structure into a hybrid index. over a collection of representative KVS workloads. We also demonstrate that Our third contribution is applying implementations. Our detailed full- our continuous adaptation mechanism DST to four popular order-preserving system characterization not only iden- allows the DBMS to achieve a near-op- index structures and evaluating them tifies the critical hardware/software timal layout for an arbitrary workload in both standalone microbenchmarks without requiring any manual tuning. ingredients for high-performance and a full in-memory DBMS using sev- KVS systems, but also suggests new op- eral transaction processing workloads. Reducing the Storage timizations to achieve record-setting Our results show that hybrid indexes throughput: 120 million requests Overhead of Main-Memory provide comparable throughput to per second (MRPS) (167 MRPS when OLTP Databases with Hybrid the original ones while reducing the with client-side batching) on a single Indexes memory overhead by up to 70%. commodity server. Our system deliv- ers the best performance and energy Huanchen Zhang, Andy Pavlo, David Achieving One Billion Key- efficiency (RPS/watt) demonstrated G. Andersen, Michael Kaminsky, Lin Ma Value Requests Per Second on & Rui Shen to date with existing KVSs—includ- a Single Server ing the best-published FPGA-based ACM SIGMOD International Con- Sheng Li, Hyeontaek Lim, Victor Lee, and GPU-based claims. We propose ference on Management of Data 2016 Jung Ho Ahn, Anuj Kalia, Michael a future manycore platform, and via (SIGMOD’16), June 2016. Kaminsky, David G. Andersen, Seongil detailed simulations demonstrate the Using indexes for query execution is O, Sukhan Lee & Pradeep Dubey capability of achieving a billion RPS crucial for achieving high performance with a single server constructed fol- in modern on-line transaction pro- IEEE Micro’s Top Picks from the lowing our principles. cessing databases. For a main-mem- Computer Architecture Conferences ory database, however, these indexes 2016, May/June 2016. Top Picks 2016! Efficient Algorithms with consume a large fraction of the total Distributed in-memory key-value Asymmetric Read and Write memory available and are thus a major stores (KVSs), such as memcached, Costs source of storage overhead of in-mem- have become a critical data serving layer ory databases. To reduce this overhead, in modern Internet-oriented datacen- Guy E. Blelloch, Jeremy T. Fineman, ter infrastructure. Their performance Phillip B. Gibbons, Yan Gu & Julian Shun O -chip GPU-to-memory link Main GPU SMs and efficiency directly affect the QoS of 24th European Symposium on Algo- } DRAM layers web services and the efficiency of data- rithms (ESA’16). August, 2016. Logic layer centers. Traditionally, these systems have had significant overheads from In several emerging technologies for computer memory (main memory), Main GPU inefficient network processing, OS kernel involvement, and concurrency the cost of reading is significantly 3D-stacked memory O -chip Logic layer cheaper than the cost of writing. Such (memory stack) cross-stack link SM control. Two recent research thrusts Crossbar switch have focused upon improving key- asymmetry in memory costs poses a Vault . . . Vault Ctrl Ctrl value performance. Hardware-centric fundamentally different model from research has started to explore special- the RAM for algorithm design. In Overview of an NDP GPU system. ized platforms including FPGAs for continued on page 31

30 THE PDL PACKET RECENT PUBLICATIONS continued from page 30 this paper we study lower and upper to achieve asymptotic improvements which would seem to be a diamond bounds for various problems under with cheaper reads when ω is bounded DAG, we can beat this lower bound such asymmetric read and write costs. by a polynomial in M. Moreover, there with an algorithm with only O(ωn2/ We consider both the case in which all is an asymptotic gap (of min(ω,log n)/ (M min(ω1/3,M1/2))) cost. To achieve but O(1) memory has asymmetric cost, ω log( M)) between the cost of sorting this we make use of a “path sketch” and the case of a small cache of sym- networks and comparison sorting in technique that is forbidden in a strict metric memory. We model both cases the model. This contrasts with the using the (M,ω)-ARAM, in which there RAM, and most other models, in DAG computation. Finally, we show is a small (symmetric) memory of size which the asymptotic costs are the several interesting upper bounds for M and a large unbounded (asymmetric) same. We also show a lower bound for shortest path problems, minimum memory, both random access, and computations on an n × n diamond spanning trees, and other problems. A where reading from the large memory DAG of Ω(ωn2/M) cost, which indicates common theme in many of the upper has unit cost, but writing has cost ω»1. no asymptotic improvement is achiev- bounds is that they require redundant For FFT and sorting networks we show able with fast reads. However, we show computation and a tradeoff between ω a lower bound cost of Ω( n logωMn), that for the minimum edit distance reads and writes. which indicates that it is not possible problem (and related problems), BIG-LEARNING SYSTEMS MEET CLOUD COMPUTING continued from page 13 communication to parameter server systems, Gaia: (1) Data Center Boundary the available WAN significantly improves performance, by Gaia bandwidth between 1.8–53.5×, (2) has performance within Gaia Parameter Server Parameter Worker Server pairs of data centers 0.94–1.40× that of running the same Local Machine Parameter Store Server and uses special se- ML algorithm on a LAN in a single data Worker Mirror lective barrier and center, and (3) significantly reduces the Machine Control Server … mirror clock control monetary cost of running the same ML Significance Queue Mirror Worker messages to ensure Filter Data Client algorithm on WANs by 2.6–59.0×. For Machine Queue algorithm conver- more information, see [2]. gence even during Key components of Gaia. a period of sudden References fall (negative spike) 1% change to the parameter value. With in available WAN bandwidth. [1] Proteus: Agile ML Elasticity ASP, these insignificant updates to the through Tiered Reliability in Dynamic In contrast to a state-of-the-art com- Resource Markets. Aaron Harlap, same parameter within a data center are munication-efficient synchronization Alexey Tumanov, Andrew Chung, aggregated (and thus not communicated model, Stale Synchronous Parallel Greg Ganger, Phil Gibbons. ACM to other data centers) until the aggre- (SSP), which bounds how stale (i.e., European Conference on Computer gated updates are significant. ASP al- old) a parameter can be, ASP bounds Systems, 2017 (EuroSys’17), 23rd- lows the ML programmer to specify the how inaccurate a parameter can be, 26th April, 2017, Belgrade, Serbia. function and the threshold to determine compared to the most up-to-date value. [2] Gaia: Geo-Distributed Machine the significance of updates for each Hence, it provides high flexibility in Learning Approaching LAN Speeds. ML algorithm, while providing default performing (or not performing) up- Kevin Hsieh, Aaron Harlap, Nan- configurations for unmodified ML pro- dates, as the server can delay synchro- dita Vijaykumar, Dimitris Konomis, grams. For example, the programmer nization indefinitely as long as the ag- Gregory R. Ganger, Phillip B. Gib- can specify that all updates that produce gregated update remains insignificant. bons, Onur Mutlu. 14th USENIX more than a 1% change are significant. Experiments with three popular classes Symposium on Networked Systems ASP ensures all significant updates are of ML algorithms on Gaia prototypes Design and Implementation (NSDI), synchronized across all model copies in across 11 Amazon EC2 regions show March 27–29, 2017, Boston, MA. a timely manner. It dynamically adapts that, compared to two state-of-the-art

SPRING 2017 31 YEAR IN REVIEW continued from page 4 ™™Nandita Vijaykumar presented ™™Gennady Pekhimenko successfully Memory Data Management on “Zorua: A Holistic Approach to defended his PhD thesis “Practi- Modern Storage Hardware for In- Resource Virtualization in GPUs” cal Data Compression for Modern Memory OLTP Database Systems” at at MICRO ‘16 in Taipei, Taiwan. Memory Hierarchies.” DaMoN ‘16 in San Francisco, CA. ™™Papers presented at SoCC ‘16 in ™™Phil Gibbons gave the keynote ™™Papers presented at SIGMOD ‘16 Santa Clara, CA included “Prin- talk on “How Emerging Memory in San Francisco were“Bridging cipled Workflow-centric Tracing of Technologies will Have You Re- the Archipelago between Row- Distributed Systems” (Raja R. Sam- thinking Algorithm Design” at Stores and Column-Stores for basivan) and “Addressing the Strag- PODC’16 in Chicago, IL. Hybrid Workloads” (Joy Arul- gler Problem for Iterative Conver- raj) and “Reducing the Storage gent Parallel ML” (Aaron Harlap). ™™Naama Ben-David presented Overhead of Main-Memory OLTP “Parallel Algorithms for Asym- Databases with Hybrid Indexes” September 2016 metric Read-Write Costs” at SPAA (Huanchen Zhang). ™™Kristen Gardner presented “A ‘16 in Asilomar, CA. Better Model for Job Redundancy: ™™Kevin Hsieh presented “Trans- Decoupling Server Slowdown and June 2016 parent Offloading and Mapping Job Size” at MASCOTS ‘16 in ™™ Anuj Kalia and co-authors won (TOM): Enabling Programmer- London, UK. Best Student Paper award at USE- Transparent Near-Data Processing NIX ATC’16 in Denver, CO for in GPU Systems” at ISCA ‘16 on August 2016 their work on “Design Guidelines Seoul, South Korea. ™™Yixin Luo proposed his thesis re- for High Performance RDMA May 2016 search “Architectural Techniques Systems.” for Improving NAND Flash ™™18th annual PDL Spring Visit Day. ™™Mor Harchol-Balter gave the Memory Reliability.” keynote presentation on “A Better ™™“Achieving One Billion Key-Value ™™Mor Harchol-Balter presented the Model for Task Assignment in Requests Per Second on a Single keynote presentation on “Queue- Server Farms: How Replication Server” by Sheng Li, Hyeontaek ing with Redundant Requests: A can Help” at Sigmetrics ‘16 in Lim, Victor Lee, Jung Ho Ahn, More Realistic Model” at Can- Antibes Juan-les-Pins. Anuj Kalia, Michael Kaminsky, Queue 2016 in London, ON. David G. Andersen, Seongil O, ™™Phil Gibbons gave the keynote Sukhan Lee and Pradeep Dubey July 2016 talk on “What’s So Special About was chosen as one of IEEE Micro’s ™™Alexey Tumanov successfully de- Big Learning?...A Distributed Top Picks from the Computer Ar- fended his PhD thesis “Scheduling Systems Perspective” at ICDCS’16 chitecture Conferences of 2016. with Space-Time Soft Constraints in in Nara, Japan. Heterogeneous Cloud Datacenters.” ™™Lin Ma presented “Larger-than-

2016 PDL Workshop and Retreat.

32 THE PDL PACKET