n e w s l e t t e r o n d l a c t i v i t i e s a n d e v e n t s • f a l l 2 0 1 7 http://www.pdl.cmu.edu/

PDL CONSORTIUM The PDL is 25! MEMBERS by Joan Digney, Greg Ganger & Bill Courtright Broadcom, Ltd. Citadel After a successful formative workshop in late 1992, Dr. Garth Gibson officially Dell EMC launched the PDL in 1993 with 7 students from CMU’s CS and ECE Departments. Google Having recently finished his Ph.D. research, which defined the industry standard Hewlett-Packard Labs RAID terminology for redundant disk arrays, Gibson guided PDL researchers Hitachi, Ltd. in advanced disk array research. The name “Parallel Data Lab” came from this Intel Corporation initial focus on parallelism in storage systems. In the PDL’s formative years, its researchers developed technologies for improving failure recovery performance MongoDB (parity declustering) and maximizing performance in small-write intensive work- NetApp, Inc. loads (parity logging). They also developed an aggressive prefetching technology Oracle Corporation (transparent informed prefetching, or TIP) for converting serial access patterns Salesforce into highly parallel work- Samsung Information Systems America loads capable of exploit- Seagate Technology ing large disk arrays. Two Sigma The PDL first received Toshiba actual lab space at CMU Veritas (Wean Hall 3607) to go Western Digital with its name in January of 1994. As it grew, PDL CONTENTS became more spread out, with people in various ar- The PDL is 25!...... 1 eas of Wean Hall and the PDL News & Awards...... 3 D-Level of Hamerschlag Hall. Today, PDL people Defenses & Proposals...... 4 are primarily located in Recent Publications...... 6 the Robert Mehrabian Garth and the Scotch Parallel File System Collaborative Innovation THE PDL PACKET Center (RMCIC) and the Gates Center for Computer Science.The equipment populating PDL lab spaces is often state of the art, a rare case for academic re- EDITOR search, and is upgraded frequently as technology advances. In large part, we have Joan Digney our sponsor companies to thank for this. CONTACTS Greg Ganger PDL’s initial seed funding came from CMU’s Data Storage Systems Center (DSSC), PDL Director then directed by Mark Kryder, and from DARPA (from which much PDL fund- Bill Courtright ing has come over the years). Additional original funding came from the member PDL Executive Director companies of the PDL Consortium, whose initial members were AT&T Global Karen Lindenfelser Information Systems, Data General, IBM, Hewlett-Packard, Seagate and Storage PDL Administrative Manager Technology. Today, PDL funding comes from DoD, DoE, NSF, and the current The Parallel Data Laboratory PDL Consortium members. The list of consortium members has become long Carnegie Mellon University over the years, as companies join, merge, and change research focus. It is currently 5000 Forbes Avenue Pittsburgh, PA 15213-3891 comprised of the following: Broadcom, Ltd., Citadel, Dell EMC, Google, Hewlett- tel 412-268-6716 Packard Labs, Hitachi, Ltd., Intel Corporation, Microsoft Research, MongoDB, fax 412-268-3010 continued on page 2 THE PDL IS 25! continued from page 1 NetApp, Inc., Oracle Corporation, Salesforce, Samsung Information Systems America, Seagate Technol- ogy, Two Sigma, Toshiba, Veritas, and Western Digital. VMware has also been generous with their financial support. In 1995, Gibson and Dr. David Nagle launched a new PDL project called Network-Attached Secure Disks (NASD), a network-attached storage architecture for achieving cost-effec- tive scalable bandwidth. In addition to their research advances, Gibson founded and chaired an industry An early PDL group photo, ca. ~1995. working group within the Information power distributors and 22 remote con- magnitude improvements in energy Storage Industry Consortium (INSIC) sole servers with a total of 3,683 cables. efficiency. to transfer the new technology and The DCO provides a computation move towards standardization of the »»We continue to research the technol- and storage utility to resource-hungry NASD architecture. ogy advances needed for cloud com- research activities such as data mining, puting. Our OpenCloud and Open- In 1999, Nagle took over as PDL Di- design simulation, network intrusion Cirrus clouds provide resources for rector when Gibson went on leave to detection, and visualization. The DCO real users, as well as provide us with co-found Panasas. In 2000, Dr. Greg houses several large research clusters in- invaluable Hadoop logs, instrumen- Ganger, who joined the ECE faculty cluding Susitna, Marmot and Narwhal. tation data, and case studies. and the PDL in 1997, jointly directed the PDL with Nagle, then became Since then, each year has seen the de- »»We have explored exciting new stor- Director in 2001 when Nagle went on velopment of many exiting ideas in the age technologies, such as NVM and leave. Early research initiatives under PDL. Here are a few examples: Flash SSDs. Even the disk drive is Ganger’s leadership included Self- »»DISC (data-intensive supercomput- changing, with technologies like Securing Devices, PASIS (Survivable ing) allowed applications to extract shingled magnetic recording creating Storage), and Self-* (tuning, manag- deep insight from huge and dynam- a need to reconsider usage patterns ing, ...) Storage. ically-changing datasets. and interfaces. In 2006, after several years of prepara- »»We explored the use of virtual ma- »»Focus has reemerged on database sys- tion, the PDL opened the Data Center chines using FSVAs (file system tems, including work on automated Observatory (DCO), a machine room virtual appliances) to address the database tuning, deduplication in with over 1,875 square feet of space. As porting problems associated with databases, incremental computation, of May 2017, it is populated with 1077 the client-side component of most and more exploitation of NVM in data- computers, connected to 76 switches, 111 cluster-based designs (including bases. In particular, Andy Pavlo’s Pelo- ours). ton DBMS project combines several of »» The Perspective these activities into what he refers to home storage system was as a “self-driving” database, seeking designed to simplify data to achieve autonomous adaptation to management and sharing workload and resource conditions. among the many storage- »»The breadth of analytics frameworks enhanced devices (e.g., and other cloud computing activi- DVRs, iPODs, laptops). ties continues to grow and has led to resource scheduling challenges. Our »» Led by Prof. David TetriSched project developed new Andersen, the FAWN (fast ways of allowing users to express their array of wimpy nodes) per-job resource type preferences project explored new (e.g., machine locality or hardware cluster architectures able accelerators) and explored the trade- Garth and students; from L to R, Dan Stodolsky, Hugo Patterson, to provide data-intensive Garth, and Bill Courtright. computing with order of continued on page 16

2 THE PDL PACKET AWARDS & OTHER PDL NEWS

October 2017 chine and deep learning — an area of of providing a Lorrie Cranor Awarded FORE scientific, academic, and commercial common shared Systems Chair of Computer endeavour that will shape our world file system used Science over the next generation.” by large net- Frank Pfenning, Head of the Depart- works of peo- We are very ment of Computer Science, notes that ple, AFS in- pleased to an- “this is a tremendous opportunity for troduced novel nounce that, Garth, but we will sorely miss him in approaches to in addition to the multiple roles he plays in the de- caching, secu- a long list of partment and school: Professor (and rity, manage- accomplish- all that this entails), Co-Director of ment and administration. ments, which the MCDS program, and Associate The award recipients, including has included Dean for Masters Programs in SCS.” Computer Science Professor Mahadev a term as the Satyanarayanan, built the Andrew File Chief Tech- We are sad to see him go and will miss nologist of the Federal Trade Com- him greatly, but the opportunities System in the 1980s while working as mission, Lorrie Cranor has been presented here for world level innova- a team at the Information Technol- made the FORE Systems Professor of tion are tremendous and we wish him ogy Center (ITC) — a partnership Computer Science and Engineering & all the best. between Carnegie Mellon and IBM. Public Policy at CMU. The ACM System Award is June 2017 Lorrie provided information that presented to an institution or indi- “the founders of FORE Systems, Inc. Dana Van Aken’s SIGMOD viduals recognized for developing a established the FORE Systems Profes- Paper Featured in Amazon AI software system that has had a lasting sorship in 1995 to support a faculty Blog influence, reflected in contributions member in the School of Computer to concepts, in commercial accep- Please see Amazon’s AI blog at https:// Science. The company’s name is an tance, or both. aws.amazon.com/blogs/ai/tuning- acronym formed by the initials of the your-dbms-automatically-with-ma- AFS is still in use today as both an founders’ first names. Before it was chine-learning/ to read about Tun- open-source system and as a file sys- acquired by Great Britain’s Marconi in ing Your DBMS Automatically with tem in commercial applications. It 1998, FORE created technology that Machine Learning, an article by Dana has also inspired several cloud-based allows computer networks to link and Van Aken, based on the SIGMOD ‘17 storage applications. Many universi- transfer information at a rapid speed. paper “Automatic Database Manage- ties integrated AFS before it was in- Ericsson purchased much of Marconi troduced as a commercial application. in 2006.” The chair was previously ment System Tuning Through Large- scale Machine Learning,” which she In addition to Satya, the recipients held by CMU University Professor of the award include former faculty Emeritus, Edmund M. Clarke. co-authored with Andy Pavlo and Geoff Gordon. member Alfred Z. Spector, alumnus Michael L. Kazar, Robert N. Sidebo- September 2017 June 2017 tham, David A. Nichols, Michael J. Garth Gibson to Lead New Satya and Colleagues Honored West, John H. Howard and Sherri M. Vector Institute for AI in for Creation of Andrew File Nichols. Many of them also contrib- Toronto System uted to two foundational AFS papers: In January of “The ITC Distributed File System: 2018, PDL’s The Association for Computing Principles and Design,” published founder, Garth Machinery has named the develop- in Proceedings of ACM SOSP 1985, Gibson, will be- ers of Carnegie Mellon University’s and “Scale and Performance in a come President pioneering Andrew File System (AFS) Distributed File System,” published and CEO of the the recipients of its prestigious 2016 in Proceedings of ACM SOSP 1987. Vector Institute Software System Award. The Software System Award carries for AI in To- AFS was the first distributed file sys- a prize of $35,000. Financial sup- ronto. Vector’s tem designed for tens of thousands of port for the Software System Award website states that “Vector will be a machines, and pioneered the use of is provided by IBM. leader in the transformative field of scalable, secure and ubiquitous access -- Byron Spice, The Piper, June 1, artificial intelligence, excelling in ma- to shared file data. To achieve the goal 2017

SPRING 2017 3 DEFENSES & PROPOSALS

DISSERTATION ABSTRACT: all DRAM chips. To exploit latency Understanding and Improving variation within the DRAM chip, we the Latency of DRAM-Based experimentally characterize and un- Memory System derstand the behavior of the variation that exists in real commodity DRAM Kevin K. Chang chips. Based on our characterization, Carnegie Mellon University, ECE we propose Flexible-LatencY DRAM (FLY-DRAM), a mechanism to reduce PhD Defense — May 5, 2017 DRAM latency by categorizing the Over the past two decades, the stor- DRAM cells into fast and slow regions, age capacity and access bandwidth of and accessing the fast regions with a main memory have improved tremen- Thomas Kim presents a demo at the 2017 PDL Spring Visit Day. reduced latency, thereby improving dously, by 128x and 20x, respectively. system performance significantly. Our These improvements are mainly due duces a new DRAM design, Low-cost extensive experimental characteriza- to the continuous technology scaling Inter-linked SubArrays (LISA), which tion and analysis of latency variation of DRAM (dynamic random-access provides fast and energy-efficient bulk in DRAM chips can also enable de- memory), which has been used as the data movement across sub- arrays in a velopment of other new techniques to physical substrate for main memory. In DRAM chip. We show that the LISA improve performance or reliability. stark contrast with capacity and band- substrate is very powerful and versatile Fourth, this dissertation, for the first width, DRAM latency has remained by demonstrating that it efficiently time, develops an understanding of the almost constant, reducing by only 1.3x enables several new architectural latency behavior due to another impor- in the same time frame. Therefore, long mechanisms, including low-latency tant factor—supply voltage, which sig- DRAM latency continues to be a criti- data copying, reduced DRAM access nificantly impacts DRAM performance, cal performance bottleneck in modern latency for frequently-accessed data, energy consumption, and reliability. We systems. Increasing core counts, and the and reduced preparation latency for take an experimental approach to under- emergence of increasingly more data- subsequent accesses to a DRAM bank. standing and exploiting the behavior of intensive and latency-critical applica- Second, DRAM needs to be periodically modern DRAM chips under different tions further stress the importance of refreshed to prevent data loss due to supply voltage values. Our detailed char- providing low-latency memory accesses. leakage. Unfortunately, while DRAM acterization of real commodity DRAM In this dissertation, we identify three is being refreshed, a part of it becomes chips demonstrates that memory access main problems that contribute sig- unavailable to serve memory requests, latency reduces with increasing supply nificantly to long latency of DRAM which degrades system performance. voltage. Based on our characterization, accesses. To address these problems, To address this refresh interference we propose Voltron, a new mechanism we present a series of new techniques. problem, we propose two access-refresh that improves system energy efficiency by Our new techniques significantly im- parallelization techniques that enable dynamically adjusting the DRAM supply prove both system performance and more overlap- ping of accesses with voltage based on a performance model. energy efficiency. We also examine the refreshes inside DRAM, at the cost of Our extensive experimental data on critical relationship between supply very modest changes to the memory the relationship between DRAM supply voltage and latency in modern DRAM controllers and DRAM chips. These voltage, latency, and reliability can fur- chips and develop new mechanisms two techniques together achieve per- ther enable developments of other new that exploit this voltage-latency trade- formance close to an idealized system mechanisms that improve latency, energy off to improve energy efficiency. that does not require refresh. efficiency, or reliability. First, while bulk data movement is a Third, we find, for the first time, that The key conclusion of this dissertation key operation in many applications there is significant latency variation is that augmenting DRAM architecture and operating systems, contemporary in accessing different cells of a single with simple and low-cost features, and systems perform this movement inef- DRAM chip due to the irregularity in developing a better understanding of ficiently, by transferring data from the DRAM manufacturing process. manufactured DRAM chips together DRAM to the processor, and then As a result, some DRAM cells are in- leads to significant memory latency re- back to DRAM, across a narrow off- herently faster to access, while others duction as well as energy efficiency -im chip channel. The use of this narrow are inherently slower. Unfortunately, provement. We hope and believe that the channel for bulk data movement re- existing systems do not exploit this proposed architectural techniques and sults in high latency and high energy variation and use a fixed latency consumption. This dissertation intro- value based on the slowest cell across continued on page 5

4 THE PDL PACKET DEFENSES & PROPOSALS continued from page 4 detailed experimental data on real com- while accounting for the burstiness THESIS PROPOSAL: modity DRAM chips presented in this that arises in production workloads. The Design & Implementation dissertation will enable developments We address open questions in schedul- of a Non-Volatile Memory of other new mechanisms to improve ing policies, admission control, and Database Management System the performance, energy efficiency, or workload placement. We build a new reliability of future memory systems. Quality of Service (QoS) system for Joy Arulraj, SCS meeting tail latency SLOs in networked October 20, 2017 DISSERTATION ABSTRACT: storage infrastructures. Our system For the first time in 25 years, a new non- Meeting Tail Latency SLOs in uses prioritization and rate limiting volatile memory (NVM) category is being Shared Networked Storage as tools for controlling the congestion created that is two orders of magnitude between workloads. We introduce a faster than current durable storage Timothy Zhu novel approach for intelligently con- media. This will fundamentally change Carnegie Mellon University, SCS figuring the workload priorities and the dichotomy between volatile memory PhD Defense — May 3, 2017 rate limits using two different types of and durable storage in DB systems. The queueing analyses: Deterministic Net- new NVM devices are almost as fast as Shared computing infrastructures work Calculus (DNC) and Stochastic DRAM, but all writes to it are potentially (e.g., cloud computing, enterprise Network Calculus (SNC). By inte- persistent even after power loss. Existing datacenters) have become the norm to- grating these mathematical analyses DB systems are unable to take full ad- day due to their lower operational costs into our system, we are able to build vantage of this technology because their and IT management costs. However, better algorithms for optimizing the internal architectures are predicated on resource sharing introduces challenges resource usage. Our implementation the assumption that memory is volatile. in controlling performance for each of results using realistic workload traces With NVM, many components of legacy the workloads using the infrastructure. on a physical cluster demonstrate that database systems are unnecessary and For user-facing workloads (e.g., web our approach can meet tail latency will degrade the performance of data server, email server), one of the most SLOs while achieving better resource intensive applications. important performance metrics com- utilization than the state-of-the-art. panies want to control is tail latency, This dissertation explores the implica- the time it takes to complete the most While this thesis focuses on schedul- tions of NVM for database systems. It delayed requests. Ideally, companies ing policies, admission control, and presents the design and implementa- would be able to specify tail latency workload placement in storage and tion of Peloton, a new database system performance goals, also called Service networks, the ideas presented in our tailored specifically for NVM. We focus Level Objectives (SLOs), to ensure that work can be applied to other related on three aspects of a database system: almost all requests complete quickly. problems such as workload migration (1) logging and recovery, (2) storage management, and (3) indexing. Our Meeting tail latency SLOs is challeng- and datacenter provisioning. Our theoretically grounded techniques for primary contribution in this disserta- ing for multiple reasons. First, tail tion is the design of a new logging and latency is significantly affected by the controlling tail latency can also be ex- tended beyond storage and networks to recovery protocol, called write-behind burstiness that is commonly exhibited logging, that improves the availability of other contexts such as the CPU, cache, by production workloads. Burstiness the system by more than two orders of etc. For example, in real-time CPU leads to transient queueing, which magnitude compared to the ubiquitous scheduling contexts, our DNC-based is a major cause of high tail latency. write- ahead logging protocol. Besides techniques could be used to provide Second, tail latency is often due to I/O improving availability, we found that (e.g., storage, networks), and I/O de- strict latency guarantees while account- write- behind logging improves the vices exhibit performance peculiarities ing for workload burstiness. space utilization of the NVM device and that make it hard to meet SLOs. Third, extends its lifetime. Second, we propose the end-to-end latency is affected by a new storage engine architecture that sum of latencies across multiple types leverages the durability and byte-ad- of resources such as storage and net- dressability properties of NVM to avoid works. Most of the existing research, unnecessary data duplication. Third, however, have ignored burstiness and the dissertation presents the design of a focused on a single resource. latch-free range index tailored for NVM This thesis introduces new techniques that supports near-instantaneous recov- for meeting end-to-end tail latency Aaron Harlap talks about his research at the ery without requiring special-purpose SLOs in both storage and networks 2016 PDL retreat. recovery code.

SPRING 2017 5 RECENT PUBLICATIONS

Software-Defined Storage for writes of that particle’s simulation we examine affected work from the Fast Trajectory Queries using output data. Dynamically indexing the literature. Finally, we demonstrate a DeltaFS Indexed Massive directory’s underlying storage keyed on the importance of dataset plurality in Directory particle filename allows us to achieve job scheduling research by evaluating a 5000x speedup for a single particle the performance of JVuPredict, the Qing Zheng, George Amvrosiadis, trajectory query, which requires read- job runtime estimate module of the Saurabh Kadekodi, Garth Gibson, ing all data for a single particle. This TetriSched scheduler, using all four Chuck Cranor, Brad Settlemyer, Gary speedup increases with application traces. Grider & Fan Guo scale, while the overhead remains stable at 3% of the available memory. Mosaic: A GPU Memory PDSW-DISCS 2017: 2nd Joint Inter- Manager with Application- national Workshop on Parallel Data Bigger, Longer, Fewer: What Storage and Data Intensive Scalable Transparent Support for Do Cluster Jobs Look Like Computing System held in conjunc- Multiple Page Sizes Outside Google? tion with SC17, Denver, CO, Nov. Rachata Ausavarungnirun, Joshua 2017. George Amvrosiadis, Jun Woo Landgraf, Vance Miller, Saugata In this paper we introduce the Indexed Park, Gregory R. Ganger, Garth A. Ghose, Jayneel Gandhi, Christopher J. Massive Directory, a new technique for Gibson, Elisabeth Baseman & Nathan Rossbach & Onur Mutlu indexing data within DeltaFS. With its DeBardeleben design as a scalable, server-less file sys- Proc. of the International Symposium tem for HPC platforms, DeltaFS scales Carnegie Mellon University Parallel on Microarchitecture (MICRO), file system metadata performance with Data Lab Technical Report CMU- Cambridge, MA, October 2017. application scale. The Indexed Mas- PDL-17-104, October 2017. Contemporary discrete GPUs support sive Directory is a novel extension In the last 5 years, a set of job scheduler rich memory management features to the DeltaFS data plane, enabling logs released by Google has been used such as virtual memory and demand in-situ indexing of massive amounts in more than 400 publications as the paging. These features simplify GPU of data written to a single directory token cloud workload. While this is an programming by providing a virtual simultaneously, and in an arbitrarily invaluable trace, we think it is crucial address space abstraction similar to large number of files. We achieve this that researchers evaluate their work CPUs and eliminating manual mem- through a memory-efficient indexing under other workloads as well, to en- ory management, but they introduce mechanism for reordering and index- sure the generality of their techniques. high performance overheads during ing writes, and a log-structured storage To aid them in this process, we ana- (1) address translation and (2) page layout to pack small data into large log lyze three new traces consisting of job faults. A GPU relies on high degrees objects, all while ensuring compute scheduler logs from one private and of thread-level parallelism (TLP) to node resources are used frugally. We two HPC clusters. We further release hide memory latency. Address transla- demonstrate the efficiency of this the two HPC traces, which we expect to tion can undermine TLP, as a single indexing mechanism through VPIC, be of interest to the community due to miss in the translation lookaside buffer a plasma simulation code that scales their unique characteristics. The new (TLB) invokes an expensive serialized to trillions of particles. With Indexed traces represent clusters 0.3-3 times page table walk that often stalls mul- Massive Directory, we modify VPIC to the size of the Google cluster in terms tiple threads. Demand paging can also create a file for each particle to receive of CPU cores, and cover a 3-60 times undermine TLP, as multiple threads longer time span. often stall while they wait for an ex- This paper presents an analysis of the pensive data transfer over the system differences and similarities between I/O (e.g., PCIe) bus when the GPU all aforementioned traces. We discuss demands a page. a variety of aspects: job characteris- In modern GPUs, we face a trade- tics, workload heterogeneity, resource off on how the page size used for utilization, and failure rates. More memory management affects address importantly, we review assumptions translation and demand paging. The from the literature that were originally address translation overhead is lower derived from the Google trace, and when we employ a larger page size Daniel Beveridge, of VMware, Inc. discusses verify whether they hold true when (e.g., 2MB large pages, compared research with Daniel Berger at the 2017 PDL the new traces are considered. For Spring Visit Day. those assumptions that are violated, continued on page 7

6 THE PDL PACKET RECENT PUBLICATIONS continued from page 6 advantage of this allocation strategy overcome this bottleneck, we propose

p 7 to design a novel in-place page size Ambit, an -in-Memory 6 43.1% 23.7% 5 selection mechanism that avoids data for bulk bitwise operations. Unlike Speedu 4 31.5% migration. This mechanism allows prior works, Ambit exploits the ana- 3 21.4% 2 log operation of DRAM technology to 1 the TLB to use large pages, reducing 0

Weighted address translation overhead. During perform bitwise operations completely 2 Apps3 Apps 4 Apps 5 Apps Number of Concurrently-Executing Applications data transfer, this mechanism enables inside DRAM, thereby exploiting the GPU-MMU Mosaic Ideal TLB the GPU to transfer only the base pages full internal DRAM bandwidth. that are needed by the application over Ambit consists of two components. Heterogeneous workload performance of the the system I/O bus, keeping demand First, simultaneous activation of three GPU memory managers. paging overhead low. Our evaluations DRAM rows that share the same set of show that Mosaic reduces address sense amplifiers enables the system to with conventional 4KB base pages), translation overheads while efficiently perform bitwise AND and OR opera- which increases TLB coverage and thus achieving the benefits of demand tions. Second, with modest changes to reduces TLB misses. Conversely, the paging, compared to a contemporary the sense amplifier, the system can use demand paging overhead is lower when GPU that uses only a 4KB page size. the inverters present inside the sense we employ a smaller page size, which Relative to a state-of-the-art GPU amplifier to perform bitwise NOT op- decreases the system I/O bus transfer memory manager, Mosaic improves erations. With these two components, latency. Support for multiple page sizes the performance of homogeneous Ambit can perform any bulk bitwise can help relax the page size trade-off and heterogeneous multi-application operation efficiently inside DRAM. so that address translation and demand workloads by 55.5% and 29.7% on Ambit largely exploits existing DRAM paging optimizations work together average, respectively, coming within structure, and hence incurs low cost synergistically. However, existing page 6.8% and 15.4% of the performance on top of commodity DRAM designs coalescing (i.e., merging base pages of an ideal TLB where all TLB requests (1% of DRAM chip area). Impor- into a large page) and splintering (i.e., are hits. tantly, Ambit uses the modern DRAM splitting a large page into base pages) interface without any changes, and policies require costly base page migra- Ambit: In-Memory Accelerator therefore it can be directly plugged tions that undermine the benefits mul- for Bulk Bitwise Operations onto the memory bus. Our extensive tiple page sizes provide. In this paper, Using Commodity DRAM circuit simulations show that Ambit we observe that GPGPU applications Technology works as expected even in the presence present an opportunity to support of significant process variation. multiple page sizes without costly data Vivek Seshadri, Donghyuk Lee, Thomas migration, as the applications perform Averaged across seven bulk bitwise Mullins, Hasan Hassan, Amirali operations, Ambit improves perfor- most of their memory allocation en Boroumand, Jeremie Kim, Michael A. masse (i.e., they allocate a large num- mance by 32X and reduces energy con- Kozuch, Onur Mutlu, Phillip B. Gibbons sumption by 35X compared to state- ber of base pages at once). We show & Todd C. Mowry that this en masse allocation allows us of-the-art systems. When integrated to create intelligent memory allocation Proceedings of the 50th International with Hybrid Memory Cube (HMC), policies which ensure that base pages Symposium on Microarchitecture a 3D-stacked DRAM with a logic that are contiguous in virtual memory (MICRO), Boston, MA, USA, Octo- layer, Ambit improves performance are allocated to contiguous physical ber 2017. of bulk bitwise operations by 9.7X memory pages. As a result, coalescing compared to processing in the logic Many important applications trigger layer of the HMC. Ambit improves and splintering operations no longer bulk bitwise operations, i.e., bitwise need to migrate base pages. the performance of three real-world operations on large bit vectors. In data-intensive applications, 1) data- We introduce Mosaic, a GPU memory fact, recent works design techniques base bitmap indices, 2) BitWeaving, a manager that provides application- that exploit fast bulk bitwise opera- technique to accelerate database scans, transparent support for multiple page tions to accelerate databases (bitmap and 3) bit-vector-based implementa- sizes. Mosaic uses base pages to transfer indices, BitWeaving) and web search tion of sets, by 3X-7X compared to a data over the system I/O bus, and al- (BitFunnel). Unfortunately, in exist- state-of-the-art baseline using SIMD locates physical memory in a way that ing architectures, the throughput of optimizations. We describe four other (1) preserves base page contiguity and bulk bitwise operations is limited by applications that can benefit from (2) ensures that a large page frame the memory bandwidth available to Ambit, including a recent technique contains pages from only a single the processing unit (e.g., CPU, GPU, memory protection domain. We take FPGA, processing-in-memory). To continued on page 8

SPRING 2017 7 RECENT PUBLICATIONS continued from page 7 proposed to speed up web search. We MEMCON does not detect every pos- believe that large performance and sible data-dependent failure. Instead, energy improvements provided by it detects and mitigates failures that Rate (r) Ambit can enable other applications occur only with the current content Bucket to use bulk bitwise operations. in memory while the programs are Tokens running in the system. Such a mecha- size (b) Detecting and Mitigating Data- nism needs to detect failures whenever Take tokens to proceed Dependent DRAM Failures by there is a write access that changes the Queue Exploiting Current Memory content of memory. As detection of Requests Content failure with a runtime testing has a high overhead, MEMCON selectively Samira Khan, Chris Wilkerson, Zhe initiates a test on a write, only when the Token bucket rate limiters control the rate and Wang, Alaa R. Alameldeen, Donghyuk time between two consecutive writes to burstiness of a stream of requests. When a Lee & Onur Mutlu that page (i.e., write interval) is long request arrives at the rate limiter, tokens are used (i.e., removed) from the token bucket Proceedings of the 50th International enough to provide significant benefit to allow the request to proceed. If the bucket Symposium on Microarchitecture by lowering the refresh rate during is empty, the request must queue and wait (MICRO), Boston, MA, USA, Octo- that interval. MEMCON builds upon a until there are enough tokens. Tokens are ber 2017. simple, practical mechanism that pre- added to the bucket at a constant rate r up dicts the long write intervals based on to a maximum capacity as specified by the DRAM cells in close proximity can our observation that the write intervals bucket size b. Thus, the token bucket rate fail depending on the data content in in real workloads follow a Pareto dis- limiter limits the workload to a maximum neighboring cells. These failures are tribution: the longer a page remains instantaneous burst of size b and an average rate r. called data-dependent failures. De- idle after a write, the longer it is ex- tecting and mitigating these failures pected to remain idle. Our evaluation ronments are often bursty. To limit the online, while the system is running shows that compared to a system that congestion when consolidating work- in the field, enables various optimiza- uses an aggressive refresh rate, MEM- loads, customers and service providers tions that improve reliability, latency, CON reduces refresh operations by often agree upon rate limits. Ideally, and energy efficiency of the system. 65-74%, leading to a 10%/17%/40% rate limits are chosen to maximize the For example, a system can improve (min) to 12%/22%/50% (max) per- number of workloads that can be co- performance and energy efficiency formance improvement for a single- located while meeting each workload’s by using a lower refresh rate for most core and 10%/23%/52% (min) to SLO. In reality, neither the service cells and mitigate the failing cells using 17%/29%/65% (max) performance provider nor customer knows how higher refresh rates or error correct- improvement for a 4-core system using to choose rate limits. Customers end ing codes. All these system optimiza- 8/16/32 Gb DRAM chips. up selecting rate limits on their own tions depend on accurately detecting in some ad hoc fashion, and service every possible data-dependent failure Workload Compactor: providers are left to optimize given the that could occur with any content in Reducing Datacenter Cost chosen rate limits. DRAM. Unfortunately, detecting all data-dependent failures requires the while ProvidingTail Latency This paper describes WorkloadCom- knowledge of DRAM internals spe- SLO Guarantees pactor, a new system that uses workload cific to each DRAM chip. As internal traces to automatically choose rate Timothy Zhu, Michael A. Kozuch & limits simultaneously with selecting DRAM architecture is not exposed to Mor Harchol-Balter the system, detecting data-dependent onto which server to place workloads. failures at the system-level is a major ACM Symposium on Cloud Comput- Our system meets customer tail latency challenge. ing (SoCC’17) , Santa Clara, Oct 2017. SLOs while minimizing datacenter resource costs. Our experiments show Service providers want to reduce data- In this paper, we decouple the detec- that by optimizing the choice of rate tion and mitigation of data-dependent center costs by consolidating work- limits, WorkloadCompactor reduces failures from physical DRAM organi- loads onto fewer servers. At the same the number of required servers by 30- zation such that it is possible to detect time, customers have performance 60% as compared to state-of-the-art failures without knowledge of DRAM goals, such as meeting tail latency approaches. internals. To this end, we propose Service Level Objectives (SLOs). Con- MEMCON, a memory content-based solidating workloads while meeting tail Error Characterization, detection and mitigation mechanism latency goals is challenging, especially for data-dependent failures in DRAM. since workloads in production envi- continued on page 9

8 THE PDL PACKET RECENT PUBLICATIONS continued from page 8 Mitigation, and Recovery in techniques. Looking forward, we gies (e.g., different types of DRAM, Flash-Memory-Based Solid- briefly discuss how flash memory and or DRAM and non-volatile memory) State Drives these techniques could evolve into the at the same level of the memory hi- future. erarchy. The goal of a hybrid main Yu Cai, Saugata Ghose, Erich F. memory is to combine the different Haratsch, Yixin Luo & Onur Mutlu Utility-Based Hybrid Memory advantages of the multiple memory Proceedings of the IEEE Volume: 105, Management types in a cost-effective manner while avoiding the disadvantages of each Issue: 9, Sept. 2017. Yang Li, Saugata Ghose, Jongmoo technology. Memory pages are placed Choi, Jin Sun, Hui Wang & Onur Mutlu NAND flash memory is ubiquitous in in and migrated between the different everyday life today because its capacity memories within a hybrid memory has continuously increased and cost In Proc. of the IEEE Cluster Confer- system, based on the properties of has continuously decreased over de- ence (CLUSTER), Honolulu, HI, each page. It is important to make cades. This positive growth is a result September 2017. intelligent page management (i.e., of two key trends: 1) effective process While the memory footprints of cloud technology scaling; and 2) multi-level and HPC applications continue to in- placement and migration) decisions, (e.g., MLC, TLC) cell data coding. crease, fundamental issues with DRAM as they can significantly affect system Unfortunately, the reliability of raw scaling are likely to prevent traditional performance. data stored in flash memory has also main memory systems, composed In this paper, we propose utility- continued to become more difficult of monolithic DRAM, from greatly based hybrid memory management to ensure, because these two trends growing in capacity. Hybrid memory (UH-MEM), a new page management lead to 1) fewer electrons in the flash systems can mitigate the scaling limita- mechanism for various hybrid memo- memory cell floating gate to represent tions of monolithic DRAM by pairing ries, that systematically estimates the the data; and 2) larger cell-to-cell together multiple memory technolo- utility (i.e., the system performance interference and disturbance ef- benefit) of migrating a page between fects. Without mitigation, worsening Before migration: different memory types, and uses this reliability can reduce the lifetime of request to Page 0 information to guide data placement. NAND flash memory. As a result, UH-MEM operates in two steps. First, flash memory controllers in solid- it estimates how much a single applica- state drives (SSDs) have become much tion would benefit from migrating one more sophisticated: they incorporate After migration: of its pages to a different type of mem- many effective techniques to ensure request to Page 0 ory, by comprehensively considering the correct interpretation of noisy data access frequency, row buffer locality, T stored in flash memory cells. In this and memory-level parallelism. Sec- article, we review recent advances in Application stall time t ond, it translates the estimated benefit SSD error characterization, mitiga- reduced by T of a single application to an estimate of tion, and data recovery techniques for (a) Alone request the overall system performance benefit reliability and lifetime improvement. from such a migration. We provide rigorous experimental data request to Page 1 from state-of-the-art MLC and TLC We evaluate the effectiveness of UH- NAND flash devices on various types request to Page 2 MEM with various types of hybrid of flash memory errors, to motivate memories, and show that it signifi- the need for such techniques. Based cantly improves system performance on the understanding developed by request to Page 1 on each of these hybrid memories. the experimental characterization, For a memory system with DRAM and request to Page 2 we describe several mitigation and T non-volatile memory, UH-MEM im- proves performance by 14% on aver- recovery techniques, including 1) Application stall time t cell-to-cell interference mitigation; reduced by T age (and up to 26%) compared to the 2) optimal multi-level cell sensing; 3) (b) Overlapped requests best of three evaluated state-of-the-art error correction using state-of-the- mechanisms across a large number of data-intensive workloads. art algorithms and methods; and 4) Conceptual example showing that the data recovery when error correction MLP of a page influences how much effect fails. We quantify the reliability im- its migration to fast memory has on the provement provided by each of these application stall time. continued on page 10

SPRING 2017 9 RECENT PUBLICATIONS continued from page 9 Viyojit: Decoupling Battery from stunted growth of battery capaci- First (PF) scheduling policy, which is and DRAM Capacities for ties and enables servers with terabytes provably fair and also achieves excel- Battery-Backed DRAM. of battery-backed DRAM. lent overall mean response time. Rajat Kateja, Anirudh Badam, Sriram Scheduling for Efficiency A Better Model for Job Govindan, Bikash Sharma & Greg and Fairness in Systems with Redundancy: Decoupling Ganger Redundancy Server Slowdown and Job Size ISCA ’17, June 24-28, 2017, Toronto, Kristen Gardner, Mor Harchol-Balter, Kristen Gardner, Mor Harchol-Balter, ON, Canada. Esa Hyyti & Rhonda Righter Alan Scheller-Wolf & Benny Van Houdt Non-Volatile Memories (NVMs) can significantly improve the performance Performance Evaluation, July 2017. Transactions on Networking, Septem- of data-intensive applications. A pop- Server-side variability—the idea that ber 2017. ular form of NVM is Battery-backed the same job can take longer to run on Recent computer systems research has DRAM, which is available and in use one server than another due to server- proposed using redundant requests to today with DRAMs latency and without dependent factors—is an increasingly reduce latency. The idea is to replicate the endurance problems of emerging important concern in many queueing a request so that it joins the queue at NVM technologies. Modern servers systems. One strategy for overcoming multiple servers. The request is con- server-side variability to achieve low can be provisioned with up-to 4 TB sidered complete as soon as any one response time is redundancy, under of DRAM, and provisioning battery of its copies completes. Redundancy which jobs create copies of themselves backup to write out such large memo- allows us to overcome server-side vari- ries is hard because of the large battery and send these copies to multiple dif- ferent servers, waiting for only one ability – the fact that a server might be sizes and the added hardware and cool- temporarily slow due to factors such as ing costs. We present Viyojit, a system copy to complete service. Most of the background load, network interrupts, that exploits the skew in write work- existing theoretical work on redundan- and garbage collection – to reduce ing sets of applications to provision cy has focused on developing bounds, response time. In the past few years, substantially smaller batteries while approximations, and exact analysis to still ensuring durability for the entire study the response time gains offered queueing theorists have begun to study DRAM capacity. Viyojit achieves this by by redundancy. However, response redundancy, first via approximations, bounding the number of dirty pages in time is not the only important metric and, more recently, via exact analysis. DRAM based on the provisioned bat- in redundancy systems: in addition to Unfortunately, for analytical tractabil- tery capacity and proactively writing providing low overall response time, ity, most existing theoretical analysis out infrequently written pages to an the system should also be fair in the has assumed an Independent Runtimes SSD. Even for write-heavy workloads sense that no job class should have a (IR) model, wherein the replicas of with less skew than we observe in analy- worse mean response time in the sys- a job each experience independent sis of real data center traces, Viyojit tem with redundancy than it did in the runtimes (service times) at different reduces the required battery capacity system before redundancy is allowed. servers. The IR model is unrealistic to 11% of the original size, with a per- In this paper we use scheduling to and has led to theoretical results which formance overhead of 7-25%. Thus, address the simultaneous goals of (1) can be at odds with computer systems Viyojit frees battery-backed DRAM achieving low response time and (2) implementation results. This paper maintaining fairness across job classes. introduces a much more realistic We develop new exact analysis for per- model of redundancy. Our model 6 class response time under First-Come decouples the inherent job size (X) 2 First-Served (FCFS) scheduling for a from the server-side slowdown (S), 5 7 general type of system structure; our where we track both S and X for each

1 4 analysis shows that FCFS can be unfair job. Analysis within the S&X model in that it can hurt non-redundant jobs. is, of course, much more difficult. 3 We then introduce the Least Redun- Nevertheless, we design a dispatching dant First (LRF) scheduling policy, 8 policy, Redundant-to-Idle-Queue which we prove is optimal with re- (RIQ), which is both analytically trac- spect to overall system response time, table within the S&X model and has Flow chart describing Viyojit’s implementation but which can be unfair in that it can provably excellent performance. for tracking dirty pages and enforcing the hurt the jobs that become redundant. dirty budget. Finally, we introduce the Primaries continued on page 11

10 THE PDL PACKET RECENT PUBLICATIONS continued from page 10 Litz: An Elastic Framework for ior—they are typically implemented tional domain-agnostic and revenue- High-Performance Distributed by different ML frameworks tuned agnostic approaches to do so result in Machine Learning for each. We show that Litz achieves substantial revenue loss. This paper competitive performance with the presents three domain-specific caching Aurick Qiao, Abutalib Aghayev, Weiren state of the art while providing low- mechanisms that successfully optimize Yu, Haoyang Chen, Qirong Ho, Garth overhead elasticity and exposing the for both factors. Simulations on a trace A. Gibson & Eric P. Xing underlying dynamic resource usage of from the Bing advertising system show Carnegie Mellon University Parallel ML applications. that a traditional cache can reduce Data Laboratory Technical Report cost by up to 27.7% but has negative CMU-PDL-17-103. June 2017. Workload Analysis and revenue impact as bad as −14.1%. On Caching Strategies for Search the other hand, the proposed mecha- Machine Learning (ML) is becoming Advertising Systems nisms can reduce cost by up to 20.6% an increasingly popular application while capping revenue impact between in the cloud and data-centers, inspir- Conglong Li, David G. Andersen, −1.3% and 0%. Based on Microsoft’s ing a growing number of distributed Qiang Fu, Sameh Elnikety & Yuxiong earnings release for FY16 Q4, the frameworks optimized for it. These He traditional cache would reduce the net frameworks leverage the specific prop- profit of Bing Ads by $84.9 to $166.1 SoCC ’17, September 24–27, 2017, erties of ML algorithms to achieve million in the quarter, while our Santa Clara, CA, USA.. orders of magnitude performance proposed cache could increase the net improvements over generic data pro- Search advertising depends on accu- profit by $11.1 to $71.5 million. cessing frameworks like Hadoop or rate predictions of user behavior and Spark. However, they also tend to be interest, accomplished today using Cachier: Edge-caching for static, unable to elastically adapt to complex and computationally expen- recognition applications the changing resource availability that sive machine learning algorithms that is characteristic of the multi-tenant estimate the potential revenue gain of Utsav Drolia, Katherine Guo, Jiaqi Tan, environments in which they run. Fur- thousands of candidate advertisements Rajeev Gandhi & Priya Narasimhan thermore, the programming models per search query. The accuracy of this provided by these frameworks tend to The 37th IEEE International Con- estimation is important for revenue, ference on be restrictive, narrowing their appli- but the cost of these computations cability even within the sphere of ML Systems (ICDCS 2017), June 5 – 8, represents a substantial expense, workloads. 2017, Atlanta, GA, USA e.g., 10% to 30% of the total gross Motivated by these trends, we present revenue. Caching the results of previ- Recognition and perception-based Litz, a distributed ML framework that ous computations is a potential path mobile applications, such as image achieves both elasticity and generality to reducing this expense, but tradi- recognition, are on the rise. These without giving up the performance applications recognize the user’s sur- of more specialized frameworks. Litz roundings and augment it with infor- uses a programming model based on User mation and/or media. These applica- search query search result scheduling micro-tasks with parameter w/ ads tions are latency-sensitive. They have server access which enables applica- Search Engine a soft-realtime nature - late results are click tions to implement key distributed ads potentially meaningless. On the one Publisher ads to show hand, given the compute-intensive ML techniques that have recently been (Bing Ads) Auction nature of the tasks performed by such introduced. Furthermore, we believe cache that the union of ML and elasticity hit applications, execution is typically Proposed Cache tens presents new opportunities for job Advertiser of offloaded to the cloud. On the other ads scheduling due to dynamic resource cache insert hand, offloading such applications no cache/ usage of ML algorithms. We give ex- cache miss/ Scoring to the cloud incurs network latency, ads refresh amples of ML properties which give keywords thousands of ads which can increase the user-perceived budgets rise to such resource usage patterns Candidate Selection latency. Consequently, edge-comput- thousands of and suggest ways to exploit them to million ads ing has been proposed to let devices improve resource utilization in multi- Ads partition partition offload intensive tasks to edge-servers tenant environments. To evaluate Litz, Pool instead of the cloud, to reduce latency. we implement two popular ML appli- In this paper, we propose a different cations that vary dramatically terms of Simplified workflow of how Bing advertising their structure and run-time behav- system serves ads to users. continued on page 12

SPRING 2017 11 RECENT PUBLICATIONS continued from page 11 model for using edge-servers. We buffered network with multicast sup- propose to use the edge as a special- port, while consuming 63.5% less area ized cache for recognition applications for each router. and formulate the expected latency for such a cache. We show that using an Understanding Reduced- edge-server like a typical web-cache, Voltage Operation in Modern for recognition applications, can DRAM Devices: Experimental lead to higher latencies. We propose Characterization, Analysis, and Cachier, a system that uses the caching Mechanisms model along with novel optimizations to minimize latency by adaptively bal- Kevin K. Chang, A. Giray Yaglikçi, Andy Pavlo regales the crowd with tales of ancing load between the edge and the his database research at the 2017 PDL Spring Saugata Ghose, Aditya Agrawal, cloud by leveraging spatiotemporal lo- Visit Day. Niladrish Chatterjee, Abhijith Kashyap, cality of requests, using offline analysis Donghyuk Lee, Mike O’Connor, Hasan of applications, and online estimates many-to-one flows result in greater Hassan & Onur Mutlu of network conditions. We evaluate congestion on the network. To al- Cachier for image-recognition appli- leviate this congestion, prior work Proceedings of the ACM on Measure- cations and show that our techniques provides hardware support for effi- ment and Analysis of Computing yield 3x speed-up in responsiveness, cient one-to-many and many-to-one Systems (POMACS), Vol. 1, No. 1, and perform accurately over a range flows in buffered on-chip networks. June 2017. of operating conditions. To the best Unfortunately, this hardware support The energy consumption of DRAM is of our knowledge, this is the first work cannot be used in bufferless on-chip a critical concern in modern comput- that models edge-servers as caches for networks, which are shown to have ing systems. Improvements in manu- compute-intensive recognition appli- lower hardware complexity and higher facturing process technology have cations, and Cachier is the first system energy efficiency than buffered net- allowed DRAM vendors to lower the that uses this model to minimize la- works, and thus are likely a good fit for DRAM supply voltage conservatively, tency for these applications. large-scale CMPs. which reduces some of the DRAM energy consumption. We would like to We propose Carpool, the first buffer- Carpool: A Bufferless On-Chip reduce the DRAM supply voltage more less on-chip network optimized for Network Supporting Adaptive aggressively, to further reduce energy. one-to-many (i.e., multicast) and Aggressive supply voltage reduction Multicast and Hotspot many-to-one (i.e., hotspot) traffic. Alleviation requires a thorough understanding of Carpool is based on three key ideas: the effect voltage scaling has on DRAM Xiyue Xiang, Wentao Shi, Saugata it (1) adaptively forks multicast flit access latency and DRAM reliability. replicas; (2) merges hotspot flits; and Ghose, Lu Peng, Onur Mutlu & Nian- In this paper, we take a comprehen- Feng Tzeng (3) employs a novel parallel port al- location mechanism within its rout- sive approach to understanding and In Proc. of the International Con- ers, which reduces the router critical exploiting the latency and reliability ference on Supercomputing (ICS), path latency by 5.7% over a bufferless characteristics of modern DRAM when Chicago, IL, June 2017 network router without multicast the supply voltage is lowered below support. We evaluate Carpool using the nominal voltage level specified by Modern chip multiprocessors (CMPs) DRAM standards. Using an FPGA- synthetic traffic workloads that emu- employ on-chip networks to enable based testing platform, we perform an late the range of rates at which multi- communication between the indi- experimental study of 124 real DDR3L threaded applications inject multicast vidual cores. Operations such as co- (low-voltage) DRAM chips manufac- and hotspot requests due to coherence herence and synchronization generate tured recently by three major DRAM and synchronization. Our evaluation a significant amount of the on-chip vendors. We find that reducing the network traffic, and often create net- shows that for an 8×8 mesh network, supply voltage below a certain point work requests that have one-to-many Carpool reduces the average packet introduces bit errors in the data, and (i.e., a core multicasting a message to latency by 43.1% and power consump- we comprehensively characterize the several cores) or many-to-one (i.e., tion by 8.3% over a bufferless network behavior of these errors. We discover several cores sending the same mes- without multicast or hotspot support. that these errors can be avoided by sage to a common hotspot destination We also find that Carpool reduces the increasing the latency of three major core) flows. As the number of cores average packet latency by 26.4% and in a CMP increases, one-to-many and power consumption by 50.5% over a continued on page 13

12 THE PDL PACKET RECENT PUBLICATIONS continued from page 12 DRAM operations (activation, resto- unseen database workloads to previous ration, and precharge). We perform workloads from which we can trans- 2.0 detailed DRAM circuit simulations to (sec) fer experience, and (3) recommend 1.5 validate and explain our experimen- -tile 1.0 knob settings. We implemented our tal findings. We also characterize the 0.5 techniques in a new tool called Otter-

99t h% 0.0 various relationships between reduced Lo 200 Tune and tested it on three DBMSs. g fi 400 le 600 500 supply voltage and error locations, size 800 1500 1000 Our evaluation shows that OtterTune (MB) 2000 ze (MB) 2500 pool si stored data patterns, DRAM tempera- Buffer recommends configurations that are as ture, and data retention. (a) Dependencies good as or better than ones generated 3.0

Based on our observations, we pro- ) by existing tools or a human expert. pose a new DRAM energy reduction (sec 2.0 Efficient Redundancy mechanism, called Voltron. The key %-tile 1.0 idea of Voltron is to use a performance 99th Techniques for Latency 0.0 500 1000 1500 2000 2500 3000 Reduction in Cloud Systems model to determine by how much we Buffer poolsize(MB) can reduce the supply voltage without (b) Continuous Settings Gauri Joshi, Emina Soljanin & Gregory introducing errors and without ex- 6.0

) Workload #1 Wornell ceeding a user-specified threshold for Workload #2 performance loss. Our evaluations (sec 4.0 Workload #3 ACM Transactions on Modeling and show that Voltron reduces the average %-tile 2.0 Performance Evaluation of Computing DRAM and system energy consump- 99th Systems (TOMPECS) Volume 2 Issue 0.0 tion by 10.5% and 7.3%, respectively, Config#1 Config#2 Config#3 2, May 2017. (c) Non-Reusable Configurations while limiting the average system per- In cloud computing systems, assigning 600 formance loss to only 1.8%, for a va- s MySQL a task to multiple servers and waiting Postgres riety of memory-intensive quad-core knob 400 for the earliest copy to finish is an workloads. We also show that Voltron of effective method to combat the vari- 200 significantly outperforms prior dy- ability in response time of individual Number 0 namic voltage and frequency scaling 2000 2004 2008 2012 2016 servers and reduce latency. But adding mechanisms for DRAM. Releasedate (d) Tuning Complexity redundancy may result in higher cost of computing resources, as well as Automatic Database an increase in queueing delay due to Management System Tuning higher traffic load. This work helps in Motivating Examples – Figs. a to c show Through Large-scale Machine performance measurements for the YCSB understanding when and how redun- Learning workload running on MySQL (v5.6) using dancy gives a cost-efficient reduction different configuration settings. Fig. d shows in latency. For a general task service Dana Van Aken, Andrew Pavlo, the number of tunable knobs provided in time distribution, we compare differ- Geoffrey J. Gordon & Bohan Zhang MySQL and Postgres releases over time. ent redundancy strategies in terms of ACM SIGMOD International Con- the number of redundant tasks and ference on Management of Data, May (i.e., changing one knob can impact the time when they are issued and 14-19, 2017. Chicago, IL, USA. others), and not universal (i.e., what canceled. We get the insight that the works for one application may be sub- log-concavity of the task service time Database management system (DBMS) optimal for another). Worse, infor- creates a dichotomy of when adding configuration tuning is an essential mation about the effects of the knobs redundancy helps. If the service time aspect of any data-intensive applica- typically comes only from (expensive) distribution is log-convex (i.e., log of tion effort. But this is historically a experience. the tail probability is convex), then difficult task because DBMSs have adding maximum redundancy reduces To overcome these challenges, we hundreds of configuration “knobs” both latency and cost. And if it is log- that control everything in the system, present an automated approach that concave (i.e., log of the tail probability such as the amount of memory to use leverages past experience and col- is concave), then less redundancy, and for caches and how often data is writ- lects new information to tune DBMS early cancellation of redundant tasks is ten to storage. The problem with these configurations: we use a combination more effective. Using these insights, knobs is that they are not standardized of supervised and unsupervised ma- we design a general redundancy strat- (i.e., two DBMSs use a different name chine learning methods to (1) select for the same knob), not independent the most impactful knobs, (2) map continued on page 14

SPRING 2017 13 RECENT PUBLICATIONS continued from page 13 egy that achieves a good latency-cost First, DIVA Profiling uses runtime trade-off for an arbitrary service time profiling to dynamically identify the distribution. This work also general- high lowest DRAM latency that does not izes and extends some results in the error introduce failures. DIVA Profiling analysis of fork-join queues. exploits design-induced variation and low periodically profiles only the vulner- Design-Induced Latency error able regions to determine the lowest 512 cells/bitline Variation in Modern DRAM wordline driver DRAM latency at low cost. It is the first row address Chips: Characterization, mechanism to dynamically determine local sense amp. Analysis, and Latency the lowest latency that can be used to Reduction Mechanisms (a) ConceptualBitline operate DRAM reliably. DIVA Profil- ing reduces the latency of read/write Donghyuk Lee, Samira Khan, Lavanya local sense amp. requests by 35.1%/57.8%, respectively, Subramanian, Saugata Ghose, at 55°C. Our second mechanism, Rachata Ausavarungnirun, Gennady high DIVA Shuffling, shuffles data such that Pekhimenko, Vivek Seshadri & Onur error values stored in vulnerable regions are Mutlu low mapped to multiple error-correcting error code (ECC) codewords. As a result, Proceedings of the ACM on Measure- high DIVA Shuffling can correct 26% more 512 cells/bitline

ment and Analysis of Computing wordline driver row address error multi-bit errors than conventional Systems (POMACS), Vol. 1, No. 1, ECC. Combined together, our two June 2017. local sense amp. mechanisms reduce read/write latency Variation has been shown to ex- (b)OpenBitline Scheme by 40.0%/60.5%, which translates to ist across the cells within a modern an overall system performance im- DRAM chip. Prior work has studied Design-Induced Variation Due to Row provement of 14.7%/13.7%/13.8% (in and exploited several forms of varia- Organization 2-/4-/8-core systems) across a variety tion, such as manufacturing-process- of workloads, while ensuring reliable or temperature-induced variation. We peripheral structures can be accessed operation. empirically demonstrate a new form much faster than cells that are farther. of variation that exists within a real We call this phenomenon design- Relaxed Operator Fusion for induced variation in DRAM. Our goals DRAM chip, induced by the design In-Memory Databases: Making and placement of different compo- are to i) understand design-induced variation that exists in real, state-of- Compilation, Vectorization, nents in the DRAM chip: different and Prefetching Work regions in DRAM, based on their the-art DRAM chips, ii) exploit it to Together At Last relative distances from the peripheral develop low-cost mechanisms that can structures, require different minimum dynamically find and use the lowest la- Prashanth Menon, Todd C. Mowry & access latencies for reliable operation. tency at which to operate a DRAM chip Andrew Pavlo In particular, we show that in most reliably, and, thus, iii) improve overall real DRAM chips, cells closer to the system performance while ensuring Proceedings of the VLDB Endowment, reliable system operation. Vol. 11, No. 1, 2017. To this end, we first experimentally In-memory database management demonstrate and analyze designed- systems (DBMSs) are a key component induced variation in modern DRAM of modern on-line analytic process- devices by testing and characterizing ing (OLAP) applications, since they 96 DIMMs (768 DRAM chips). Our provide low-latency access to large characterization identifies DRAM volumes of data. Because disk accesses regions that are vulnerable to er- are no longer the principle bottleneck rors, if operated at lower latency, and in such systems, the focus in designing finds consistency in their locations query execution engines has shifted across a given DRAM chip genera- to optimizing CPU performance. tion, due to design-induced variation. Recent systems have revived an older Garth Gibson chats with Steve Heller of Two Based on our extensive experimental technique of using just-in-time (JIT) Sigma during a poster session at the 2017 analysis, we develop two mechanisms PDL Visit Day. that reliably reduce DRAM latency. continued on page 15

14 THE PDL PACKET RECENT PUBLICATIONS continued from page 14 compilation to execute queries as na- popularity. In this paper, we explore respect to storage and reliability. How- tive code instead of interpreting a plan. an alternative approach using erasure ever, existing practical MSR codes do The state-of-the-art in query compila- coding. not address the increasingly important tion is to fuse operators together in a problem of I/O overhead incurred EC-Cache is a load-balanced, low during reconstructions, and are, in query plan to minimize materialization latency cluster cache that uses on- overhead by passing tuples efficiently general, inferior to RS codes in this line erasure coding to overcome the regard. In this paper, we design erasure between operators. Our empirical limitations of selective replication. analysis shows, however, that more codes that are simultaneously optimal EC-Cache employs erasure coding by: in terms of I/O, storage, and network tactful materialization yields better (i) splitting and erasure coding indi- performance. bandwidth. Our design builds on top vidual objects during writes, and (ii) of a class of powerful practical codes, We present a query processing model late binding, wherein obtaining any called the product-matrix-MSR codes. called “relaxed operator fusion” that k out of (k + r) splits of an object are Evaluations show that our proposed allows the DBMS to introduce stag- sufficient, during reads. As compared design results in a significant reduc- ing points in the query plan where to selective replication, EC-Cache tion the number of I/Os consumed intermediate results are temporarily improves load balancing by more than during reconstructions (a 5 reduction materialized. This allows the DBMS 3x and reduces the median and tail for typical parameters), while retain- to take advantage of inter-tuple par- read latencies by more than 2x, while ing optimality with respect to storage, allelism inherent in the plan using a using the same amount of memory. reliability, and network bandwidth. combination of prefetching and SIMD EC-Cache does so using 10% addi- vectorization to support faster query tional bandwidth and a small increase execution on data sets that exceed the in the amount of stored metadata.

size of CPU-level caches. Our evalua- The benefits offered by EC-Cache are Eviction tion shows that our approach reduces further amplified in the presence of g in the execution time of OLAP queries background network load imbalance by up to 2.2X and achieves up to 1.8X and server failures. Cach DRAM better performance compared to other in-memory DBMSs. Having Your Cake and Eating Cache

It Too: Jointly Optimal Erasure GE

EC-Cache: Load-Balanced, Low- Codes for I/O, Storage, and T T

Latency Cluster Caching with Network-bandwidth PU Online Erasure Coding KV Rashmi, Preetum Nakkiran, Jingyan (a)Backend server K. V. Rashmi, Mosharaf Chowdhury, Wang, Nihar B. Shah & Kannan Jack Kosaian, Ion Stoica & Kannan Ramchandran Ramchandran Encoder / USENIX FAST, Feb 2015, Santa Clara, Decoder 12th USENIX Symposium on Op- CA. Best paper. erating Systems Design and Imple- Erasure codes, such as Reed-Solomon Splitter / mentation, NOVEMBER 2–4, 2016, (RS) codes, are increasingly being Receiver SAVANNAH, GA. deployed as an alternative to data-rep-

Data-intensive clusters and object lication for fault tolerance in distrib- GE stores are increasingly relying on in- uted storage systems. While RS codes T T memory object caching to meet the I/O provide significant savings in storage PU performance demands. These systems space, they can impose a huge burden (b) EC-Cache client routinely face the challenges of popu- on the I/O and network resources larity skew, background load imbal- when reconstructing failed or other- ance, and server failures, which result wise unavailable data. A recent class Roles in EC-Cache: (a) backend servers in severe load imbalance across servers of erasure codes, called minimum- manage interactions between caches and and degraded I/O performance. Selec- storage-regeneration (MSR) codes, has persistent storage for client libraries; (b) tive replication is a commonly used emerged as a superior alternative to the EC-Cache clients perform encoding and decoding during writes and reads. technique to tackle these challenges, popular RS codes, in that it minimizes where the number of cached replicas network transfers during reconstruc- of an object is proportional to its tion while also being optimal with

SPRING 2017 15 THE PDL IS 25! continued from page 2 offs among them to maximize utility two PDL alumni who of the public and/or private cloud left us too soon: Howard infrastructure. Gobioff, PhD 1999, and »»We are exploring new approaches Wittawat Tantisiriroj, to system support for large-scale PhD student. machine learning. We have been Our PDL members have especially active in investigating how garnered many presti- ML systems should adapt to cloud gious awards, including computing environments. Important 24 best paper awards. questions include how such systems Numerous fellowships should address dynamic resource have been awarded to availability, unpredictable levels of our graduate students, Greg and some PDLers, from L to R, John Bucy, John Griffin, time-varying resource interference, many from our sponsor Andy Klosterman, Greg, and Jay Wylie. and geo-distributed data. And, in a companies. There have fun twist, we are developing new ap- also been many faculty research awards. CMU participants to 11 industry visitors proaches to automatic tuning for ML One of our first student members, Hugo from 6 sponsor companies. systems—ML to improve ML. Patterson (PhD ’97), now a successful One of the PDL students at the time entrepreneur, has endowed a fellowship recalls everyone wondering if they Who We Are to support up and coming PDL entre- would have enough solid content to preneurs. The award provides academic keep the industry attendees’ attention Over the past 25 years, many faculty, throughout the retreat—of course it was staff, and students have been active support to a student working within the umbrella of the PDL who has had expe- not a problem. Every year since then, members of the PDL. The lab has the difficult problem has been what grown to over almost 90 current rience in the workforce prior to entering a graduate program. to leave out, as the PDL researchers members (including staff, students generate more cool ideas than will fit and faculty), and research funding and In fact, Hugo is one of four PDL alums into the available time. output have seen similar growth while who have been named “Distinguished remaining focused on the PDL’s and Alumni.” Also honored are Howard The first Retreat was held at the Hid- CMU’s goals of delivering distinctive Gobioff, PhD ’99, Erik Riedel, PhD den Valley Resort, PA. A number of and top-quality education, fostering ’99, and Bianca Schroeder, PhD ’07. retreats were held at the Nemacolin research, creativity and discovery, and Woodlands Resort, in Farmington, PA. We now gather at the Omni Bedford using the new knowledge created on The PDL Retreat Springs Resort, in Bedford, PA, and campus to serve our larger society. From the beginning, the PDL logo these days retreats are usually attended PDL’s first PhD graduate was Dr. Mark has included Skibo Castle, Andrew by over 100 participants. Holland (1994), who wrote his disserta- Carnegie’s summer home near Dor- In 1999 we also began holding an an- tion on ‘On-Line Data Reconstruction noch, Scotland. In the past, it has nual Spring Industry Visit Day. This is in Redundant Disk Arrays.’ Since then, represented “a fortress of storage” (like a one-day event that evolved as a result dozens of PDL students have gradu- a redundant disk array). Later, it came of requests from industry for more fre- ated with PhDs, Masters Degrees, and to represent a “fortress of security” quent interaction with PDL researchers. undergraduate degrees, and many have (à la self-securing devices). Perhaps, Since its inception, Carnegie Mellon’s moved on to employment with PDL though, it is simply our vision of the Parallel Data Lab (PDL) has established Consortium companies. We should ideal PDL Retreat venue. also take a moment here to remember itself as academia’s premiere storage The first official PDL Workshop and systems research center, consistently Retreat was held in October, 1993 for pushing the state-of-the-art with new the purpose of interacting with our in- storage system architectures, technolo- dustry sponsors, offering them a chance gies, and design methodologies. In to get to know the PDL researchers, the words of our director “I’m always hear about their work, offer feedback, overwhelmed by the accomplishments and give the students a chance to de- of the PDL students and staff, and it’s a velop relationships with prospective pleasure to work with them. As always, employers. At the first retreat, PDL their accomplishments point at great Garth and Hugo Patterson at Hugo’s research on disk arrays, parity logging things to come.” Let’s see where the graduation, 1997. and declustering was described by 20 next 25 years take us!

16 THE PDL PACKET