Structured Streaming: a Declarative API for Real-Time Applications in Apache Spark

Total Page:16

File Type:pdf, Size:1020Kb

Structured Streaming: a Declarative API for Real-Time Applications in Apache Spark Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark Michael Armbrusty, Tathagata Dasy, Joseph Torresy, Burak Yavuzy, Shixiong Zhuy, Reynold Xiny, Ali Ghodsiy, Ion Stoicay, Matei Zahariayz yDatabricks Inc., zStanford University Abstract with Spark Streaming [37], one of the earliest stream processing With the ubiquity of real-time data, organizations need streaming systems to provide a high-level, functional API. We found that two systems that are scalable, easy to use, and easy to integrate into challenges frequently came up with users. First, streaming systems business applications. Structured Streaming is a new high-level often ask users to think in terms of complex physical execution streaming API in Apache Spark based on our experience with Spark concepts, such as at-least-once delivery, state storage and triggering Streaming. Structured Streaming differs from other recent stream- modes, that are unique to streaming. Second, many systems focus ing APIs, such as Google Dataflow, in two main ways. First, itisa only on streaming computation, but in real use cases, streaming is purely declarative API based on automatically incrementalizing a often part of a larger business application that also includes batch static relational query (expressed using SQL or DataFrames), in con- analytics, joins with static data, and interactive queries. Integrating trast to APIs that ask the user to build a DAG of physical operators. streaming systems with these other workloads (e.g., maintaining Second, Structured Streaming aims to support end-to-end real-time transactionality) requires significant engineering effort. applications that integrate streaming with batch and interactive Motivated by these challenges, we describe Structured Stream- analysis. We found that this integration was often a key challenge ing, a new high-level API for stream processing that was developed in practice. Structured Streaming achieves high performance via in Apache Spark starting in 2016. Structured Streaming builds on Spark SQL’s code generation engine and can outperform Apache many ideas in recent stream processing systems, such as separating Flink by up to 2× and Apache Kafka Streams by 90×. It also offers processing time from event time and triggers in Google Dataflow [2], rich operational features such as rollbacks, code updates, and mixed using a relational execution engine for performance [12], and of- streaming/batch execution. We describe the system’s design and use fering a language-integrated API [17, 37], but aims to make them cases from several hundred production deployments on Databricks, simpler to use and integrated with the rest of Apache Spark. Specif- the largest of which process over 1 PB of data per month. ically, Structured Streaming differs from other widely used open source streaming APIs in two ways: ACM Reference Format: • Incremental query model: Structured Streaming automati- M. Armbrust et al.. 2018. Structured Streaming: A Declarative API for Real- Time Applications in Apache Spark. In SIGMOD’18: 2018 International Con- cally incrementalizes queries on static datasets expressed through ference on Management of Data, June 10–15, 2018, Houston, TX, USA. ACM, Spark’s SQL and DataFrame APIs [8], meaning that users typ- New York, NY, USA, 13 pages. https://doi.org/10.1145/3183713.3190664 ically only need to understand Spark’s batch APIs to write a streaming query. Event time concepts are especially easy to ex- 1 Introduction press and understand in this model. Although incremental query execution and view maintenance are well studied [11, 24, 29, 38], Many high-volume data sources operate in real time, including we believe Structured Streaming is the first effort to adopt them sensors, logs from mobile applications, and the Internet of Things. in a widely used open source system. We found that this incre- As organizations have gotten better at capturing this data, they also mental API generally worked well for both novice and advanced want to process it in real time, whether to give human analysts the users. For example, advanced users can use a set of stateful pro- freshest possible data or drive automated decisions. Enabling broad cessing operators that give fine-grained control to implement access to streaming computation requires systems that are scalable, custom logic while fitting into the incremental model. easy to use and easy to integrate into business applications. While there has been tremendous progress in distributed stream • Support for end-to-end applications: Structured Streaming’s processing systems in the past few years [2, 15, 17, 27, 32], these sys- API and built-in connectors make it easy to write code that is tems still remain fairly challenging to use in practice. In this paper, “correct by default" when interacting with external systems and we begin by describing these challenges, based on our experience can be integrated into larger applications using Spark and other software. Data sources and sinks follow a simple transactional Permission to make digital or hard copies of all or part of this work for personal or model that enables “exactly-once" computation by default. The classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation incrementalization based API naturally makes it easy to run a on the first page. Copyrights for components of this work owned by others than the streaming query as a batch job or develop hybrid applications author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission that join streams with static data computed through Spark’s and/or a fee. Request permissions from [email protected]. batch APIs. In addition, users can manage multiple streaming SIGMOD’18, June 10–15, 2018, Houston, TX, USA queries dynamically and run interactive queries on consistent © 2018 Copyright held by the owner/author(s). Publication rights licensed to Associa- tion for Computing Machinery. ACM ISBN 978-1-4503-4703-7/18/06...$15.00 https://doi.org/10.1145/3183713.3190664 snapshots of stream output, making it possible to write applica- time discussing these challenges with users and designers of other tions that go beyond computing a fixed result to let users refine streaming systems, including Spark Streaming, Truviso, Storm, and drill into streaming data. Dataflow and Flink. This section details the challenges wesaw. Beyond these design decisions, we made several other design 2.1 Complex and Low-Level APIs choices in Structured Streaming that simplify operation and in- crease performance. First, Structured Streaming reuses the Spark Streaming systems were invariably considered more difficult to use SQL execution engine [8], including its optimizer and runtime code than batch ones due to complex API semantics. Some complexity is generator. This leads to high throughput compared to other stream- to be expected due to new concerns that arise only in streaming: for ing systems (e.g., 2× the throughput of Apache Flink and 90× that example, the user needs to think about what type of intermediate of Apache Kafka Streams in the Yahoo! Streaming Benchmark [14]), results the system should output before it has received all the data as in Trill [12], and also lets Structured Streaming automatically relevant to a particular entity, e.g., to a customer’s browsing session leverage new SQL functionality added to Spark. The engine runs on a website. However, other complexity arises due to the low- in a microbatch execution mode by default [37] but it can also use level nature of many streaming APIs: these APIs often ask users to a low-latency continuous operators for some queries because the specify applications at the level of physical operators with complex API is agnostic to execution strategy [6]. semantics instead of a more declarative level. Second, we found that operating a streaming application can be As a concrete example, the Google Dataflow model [2] has a challenging, so we designed the engine to support failures, code powerful API with a rich set of options for handling event time updates and recomputation of already outputted data. For example, aggregation, windowing and out-of-order data. However, in this one common issue is that new data in a stream causes an applica- model, users need to specify a windowing mode, triggering mode tion to crash, or worse, to output an incorrect result that users do and trigger refinement mode (essentially, whether the operator not notice until much later (e.g., due to mis-parsing an input field). outputs deltas or accumulated results) for each aggregation operator. In Structured Streaming, each application maintains a write-ahead Adding an operator that expects deltas after an aggregation that event log in human-readable JSON format that administrators can outputs accumulated results will lead to unexpected results. In use to restart it from an arbitrary point. If the application crashes essence, the raw API [10] asks the user to write a physical operator due to an error in a user-defined function, administrators can up- graph, not a logical query, so every user of the system needs to date the UDF and restart from where it left off, which happens understand the intricacies of incremental processing. automatically when the restarted application reads the log. If the Other APIs, such as Spark Streaming [37] and Flink’s DataStream application was outputting incorrect data instead, administrators API [18], are also based on writing DAGs of physical operators and can manually roll it back to a point before the problem started and offer a complex array of options for managing state[20]. In addition, recompute its results starting from there. reasoning about applications becomes even more complex in sys- Our team has been running Structured Streaming applications tems that relax exactly-once semantics [32], effectively requiring for customers of Databricks’ cloud service since 2016, as well as the user to design and implement a consistency model.
Recommended publications
  • Is 'Distributed' Worth It? Benchmarking Apache Spark with Mesos
    Is `Distributed' worth it? Benchmarking Apache Spark with Mesos N. Satra (ns532) January 13, 2015 Abstract A lot of research focus lately has been on building bigger dis- tributed systems to handle `Big Data' problems. This paper exam- ines whether typical problems for web-scale companies really benefits from the parallelism offered by these systems, or can be handled by a single machine without synchronisation overheads. Repeated runs of a movie review sentiment analysis task in Apache Spark were carried out using Apache Mesos and Mesosphere, on Google Compute Engine clusters of various sizes. Only a marginal improvement in run-time was observed on a distributed system as opposed to a single node. 1 Introduction Research in high performance computing and `big data' has recently focussed on distributed computing. The reason is three-pronged: • The rapidly decreasing cost of commodity machines, compared to the slower decline in prices of high performance or supercomputers. • The availability of cloud environments like Amazon EC2 or Google Compute Engine that let users access large numbers of computers on- demand, without upfront investment. • The development of frameworks and tools that provide easy-to-use id- ioms for distributed computing and managing such large clusters of machines. Large corporations like Google or Yahoo are working on petabyte-scale problems which simply can't be handled by single computers [9]. However, 1 smaller companies and research teams with much more manageable sizes of data have jumped on the bandwagon, using the tools built by the larger com- panies, without always analysing the performance tradeoffs. It has reached the stage where researchers are suggesting using the same MapReduce idiom hammer on all problems, whether they are nails or not [7].
    [Show full text]
  • Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks
    Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks. In Proceedings of the ACM Symposium on Cloud Computing (SOCC '14). ACM, New York, NY, USA, Article 6 , 15 pages. As Published http://dx.doi.org/10.1145/2670979.2670985 Publisher Association for Computing Machinery (ACM) Version Author's final manuscript Citable link http://hdl.handle.net/1721.1/101090 Terms of Use Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks Haoyuan Li Ali Ghodsi Matei Zaharia Scott Shenker Ion Stoica University of California, Berkeley MIT, Databricks University of California, Berkeley fhaoyuan,[email protected] [email protected] fshenker,[email protected] Abstract Even replicating the data in memory can lead to a signifi- cant drop in the write performance, as both the latency and Tachyon is a distributed file system enabling reliable data throughput of the network are typically much worse than sharing at memory speed across cluster computing frame- that of local memory. works. While caching today improves read workloads, Slow writes can significantly hurt the performance of job writes are either network or disk bound, as replication is pipelines, where one job consumes the output of another. used for fault-tolerance. Tachyon eliminates this bottleneck These pipelines are regularly produced by workflow man- by pushing lineage, a well-known technique, into the storage agers such as Oozie [4] and Luigi [7], e.g., to perform data layer.
    [Show full text]
  • Discretized Streams: Fault-Tolerant Streaming Computation at Scale
    Discretized Streams: Fault-Tolerant Streaming Computation at Scale" Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica University of California, Berkeley Published in SOSP ‘13 Slide 1/30 Motivation" • Faults and stragglers inevitable in large clusters running “big data” applications. • Streaming applications must recover from these quickly. • Current distributed streaming systems, including Storm, TimeStream, MapReduce Online provide fault recovery in an expensive manner. – Involves hot replication which requires 2x hardware or upstream backup which has long recovery time. Slide 2/30 Previous Methods" • Hot replication – two copies of each node, 2x hardware. – straggler will slow down both replicas. • Upstream backup – nodes buffer sent messages and replay them to new node. – stragglers are treated as failures resulting in long recovery step. • Conclusion : need for a system which overcomes these challenges Slide 3/30 • Voila ! D-Streams Slide 4/30 Computation Model" • Streaming computations treated as a series of deterministic batch computations on small time intervals. • Data received in each interval is stored reliably across the cluster to form input datatsets • At the end of each interval dataset is subjected to deterministic parallel operations and two things can happen – new dataset representing program output which is pushed out to stable storage – intermediate state stored as resilient distributed datasets (RDDs) Slide 5/30 D-Stream processing model" Slide 6/30 What are D-Streams ?" • sequence of immutable, partitioned datasets (RDDs) that can be acted on by deterministic transformations • transformations yield new D-Streams, and may create intermediate state in the form of RDDs • Example :- – pageViews = readStream("http://...", "1s") – ones = pageViews.map(event => (event.url, 1)) – counts = ones.runningReduce((a, b) => a + b) Slide 7/30 High-level overview of Spark Streaming system" Slide 8/30 Recovery" • D-Streams & RDDs track their lineage, that is, the graph of deterministic operations used to build them.
    [Show full text]
  • The Design and Implementation of Declarative Networks
    The Design and Implementation of Declarative Networks by Boon Thau Loo B.S. (University of California, Berkeley) 1999 M.S. (Stanford University) 2000 A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the GRADUATE DIVISION of the UNIVERSITY OF CALIFORNIA, BERKELEY Committee in charge: Professor Joseph M. Hellerstein, Chair Professor Ion Stoica Professor John Chuang Fall 2006 The dissertation of Boon Thau Loo is approved: Professor Joseph M. Hellerstein, Chair Date Professor Ion Stoica Date Professor John Chuang Date University of California, Berkeley Fall 2006 The Design and Implementation of Declarative Networks Copyright c 2006 by Boon Thau Loo Abstract The Design and Implementation of Declarative Networks by Boon Thau Loo Doctor of Philosophy in Computer Science University of California, Berkeley Professor Joseph M. Hellerstein, Chair In this dissertation, we present the design and implementation of declarative networks. Declarative networking proposes the use of a declarative query language for specifying and implementing network protocols, and employs a dataflow framework at runtime for com- munication and maintenance of network state. The primary goal of declarative networking is to greatly simplify the process of specifying, implementing, deploying and evolving a network design. In addition, declarative networking serves as an important step towards an extensible, evolvable network architecture that can support flexible, secure and efficient deployment of new network protocols. Our main contributions are as follows. First, we formally define the Network Data- log (NDlog) language based on extensions to the Datalog recursive query language, and propose NDlog as a Domain Specific Language for programming network protocols.
    [Show full text]
  • Apache Spark: a Unified Engine for Big Data Processing
    contributed articles DOI:10.1145/2934664 for interactive SQL queries and Pregel11 This open source computing framework for iterative graph algorithms. In the open source Apache Hadoop stack, unifies streaming, batch, and interactive big systems like Storm1 and Impala9 are data workloads to unlock new applications. also specialized. Even in the relational database world, the trend has been to BY MATEI ZAHARIA, REYNOLD S. XIN, PATRICK WENDELL, move away from “one-size-fits-all” sys- TATHAGATA DAS, MICHAEL ARMBRUST, ANKUR DAVE, tems.18 Unfortunately, most big data XIANGRUI MENG, JOSH ROSEN, SHIVARAM VENKATARAMAN, applications need to combine many MICHAEL J. FRANKLIN, ALI GHODSI, JOSEPH GONZALEZ, different processing types. The very SCOTT SHENKER, AND ION STOICA nature of “big data” is that it is diverse and messy; a typical pipeline will need MapReduce-like code for data load- ing, SQL-like queries, and iterative machine learning. Specialized engines Apache Spark: can thus create both complexity and inefficiency; users must stitch together disparate systems, and some applica- tions simply cannot be expressed effi- ciently in any engine. A Unified In 2009, our group at the Univer- sity of California, Berkeley, started the Apache Spark project to design a unified engine for distributed data Engine for processing. Spark has a programming model similar to MapReduce but ex- tends it with a data-sharing abstrac- tion called “Resilient Distributed Da- Big Data tasets,” or RDDs.25 Using this simple extension, Spark can capture a wide range of processing workloads that previously needed separate engines, Processing including SQL, streaming, machine learning, and graph processing2,26,6 (see Figure 1).
    [Show full text]
  • A Ditya a Kella
    A D I T Y A A K E L L A [email protected] Computer Science Department http://www.cs.cmu.edu/∼aditya Carnegie Mellon University Phone: 412-818-3779 5000 Forbes Avenue Fax: 412-268-5576 (Attn: Aditya Akella) Pittsburgh, PA 15232 Education PhD in Computer Science May 2005 Carnegie Mellon University, Pittsburgh, PA (expected) Dissertation: “An Integrated Approach to Optimizing Internet Performance” Advisor: Prof. Srinivasan Seshan Bachelor of Technology in Computer Science and Engineering May 2000 Indian Institute of Technology (IIT), Madras, India Honors and Awards IBM PhD Fellowship 2003-04 & 2004-05 Graduated third among all IIT Madras undergraduates 2000 Institute Merit Prize, IIT Madras 1996-97 19th rank, All India Joint Entrance Examination for the IITs 1996 Gold medals, Regional Mathematics Olympiad, India 1993-94 & 1994-95 National Talent Search Examination (NTSE) Scholarship, India 1993-94 Research Interests Computer Systems and Internetworking PhD Dissertation An Integrated Approach to Optimizing Internet Performance In my thesis research, I adopted a systematic approach to understand how to optimize the perfor- mance of well-connected Internet end-points, such as universities, large enterprises and data centers. I showed that constrained bottlenecks inside and between carrier networks in the Internet could limit the performance of such end-points. I observed that a clever end point-based strategy, called Mul- tihoming Route Control, can help end-points route around these bottlenecks, and thereby improve performance. Furthermore, I showed that the Internet's topology, routing and trends in its growth may essentially worsen the wide-area congestion in the future. To this end, I proposed changes to the Internet's topology in order to guarantee good future performance.
    [Show full text]
  • Comparison of Spark Resource Managers and Distributed File Systems
    2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) Comparison of Spark Resource Managers and Distributed File Systems Solmaz Salehian Yonghong Yan Department of Computer Science & Engineering, Department of Computer Science & Engineering, Oakland University, Oakland University, Rochester, MI 48309-4401, United States Rochester, MI 48309-4401, United States [email protected] [email protected] Abstract— The rapid growth in volume, velocity, and variety Section 4 provides information of different resource of data produced by applications of scientific computing, management subsystems of Spark. In Section 5, Four commercial workloads and cloud has led to Big Data. Traditional Distributed File Systems (DFSs) are reviewed. Then, solutions of data storage, management and processing cannot conclusion is included in Section 6. meet demands of this distributed data, so new execution models, data models and software systems have been developed to address II. CHALLENGES IN BIG DATA the challenges of storing data in heterogeneous form, e.g. HDFS, NoSQL database, and for processing data in parallel and Although Big Data provides users with lot of opportunities, distributed fashion, e.g. MapReduce, Hadoop and Spark creating an efficient software framework for Big Data has frameworks. This work comparatively studies Apache Spark many challenges in terms of networking, storage, management, distributed data processing framework. Our study first discusses analytics, and ethics [5, 6]. the resource management subsystems of Spark, and then reviews Previous studies [6-11] have been carried out to review open several of the distributed data storage options available to Spark. Keywords—Big Data, Distributed File System, Resource challenges in Big Data.
    [Show full text]
  • Alluxio: a Virtual Distributed File System by Haoyuan Li A
    Alluxio: A Virtual Distributed File System by Haoyuan Li A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Ion Stoica, Co-chair Professor Scott Shenker, Co-chair Professor John Chuang Spring 2018 Alluxio: A Virtual Distributed File System Copyright 2018 by Haoyuan Li 1 Abstract Alluxio: A Virtual Distributed File System by Haoyuan Li Doctor of Philosophy in Computer Science University of California, Berkeley Professor Ion Stoica, Co-chair Professor Scott Shenker, Co-chair The world is entering the data revolution era. Along with the latest advancements of the Inter- net, Artificial Intelligence (AI), mobile devices, autonomous driving, and Internet of Things (IoT), the amount of data we are generating, collecting, storing, managing, and analyzing is growing ex- ponentially. To store and process these data has exposed tremendous challenges and opportunities. Over the past two decades, we have seen significant innovation in the data stack. For exam- ple, in the computation layer, the ecosystem started from the MapReduce framework, and grew to many different general and specialized systems such as Apache Spark for general data processing, Apache Storm, Apache Samza for stream processing, Apache Mahout for machine learning, Ten- sorflow, Caffe for deep learning, Presto, Apache Drill for SQL workloads. There are more than a hundred popular frameworks for various workloads and the number is growing. Similarly, the storage layer of the ecosystem grew from the Apache Hadoop Distributed File System (HDFS) to a variety of choices as well, such as file systems, object stores, blob stores, key-value systems, and NoSQL databases to realize different tradeoffs in cost, speed and semantics.
    [Show full text]
  • Donut: a Robust Distributed Hash Table Based on Chord
    Donut: A Robust Distributed Hash Table based on Chord Amit Levy, Jeff Prouty, Rylan Hawkins CSE 490H Scalable Systems: Design, Implementation and Use of Large Scale Clusters University of Washington Seattle, WA Donut is an implementation of an available, replicated distributed hash table built on top of Chord. The design was built to support storage of common file sizes in a closed cluster of systems. This paper discusses the two layers of the implementation and a discussion of future development. First, we discuss the basics of Chord as an overlay network and our implementation details, second, Donut as a hash table providing availability and replication and thirdly how further development towards a distributed file system may proceed. Chord Introduction Chord is an overlay network that maps logical addresses to physical nodes in a Peer- to-Peer system. Chord associates a ranges of keys with a particular node using a variant of consistent hashing. While traditional consistent hashing schemes require nodes to maintain state of the the system as a whole, Chord only requires that nodes know of a fixed number of “finger” nodes, allowing both massive scalability while maintaining efficient lookup time (O(log n)). Terms Chord Chord works by arranging keys in a ring, such that when we reach the highest value in the key-space, the following key is zero. Nodes take on a particular key in this ring. Successor Node n is called the successor of a key k if and only if n is the closest node after or equal to k on the ring. Predecessor Node n is called the predecessor of a key k if and only n is the closest node before k in the ring.
    [Show full text]
  • UC Berkeley UC Berkeley Electronic Theses and Dissertations
    UC Berkeley UC Berkeley Electronic Theses and Dissertations Title Scalable Systems for Large Scale Dynamic Connected Data Processing Permalink https://escholarship.org/uc/item/9029r41v Author Padmanabha Iyer, Anand Publication Date 2019 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California Scalable Systems for Large Scale Dynamic Connected Data Processing by Anand Padmanabha Iyer A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Ion Stoica, Chair Professor Scott Shenker Professor Michael J. Franklin Professor Joshua Bloom Fall 2019 Scalable Systems for Large Scale Dynamic Connected Data Processing Copyright 2019 by Anand Padmanabha Iyer 1 Abstract Scalable Systems for Large Scale Dynamic Connected Data Processing by Anand Padmanabha Iyer Doctor of Philosophy in Computer Science University of California, Berkeley Professor Ion Stoica, Chair As the proliferation of sensors rapidly make the Internet-of-Things (IoT) a reality, the devices and sensors in this ecosystem—such as smartphones, video cameras, home automation systems, and autonomous vehicles—constantly map out the real-world producing unprecedented amounts of dynamic, connected data that captures complex and diverse relations. Unfortunately, existing big data processing and machine learning frameworks are ill-suited for analyzing such dynamic connected data and face several challenges when employed for this purpose. This dissertation focuses on the design and implementation of scalable systems for dynamic connected data processing. We discuss simple abstractions that make it easy to operate on such data, efficient data structures for state management, and computation models that reduce redundant work.
    [Show full text]
  • A Networking Abstraction for Distributed Data-Parallel Applications
    Coflow: A Networking Abstraction for Distributed Data-Parallel Applications by N M Mosharaf Kabir Chowdhury A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Ion Stoica, Chair Professor Sylvia Ratnasamy Professor John Chuang Fall 2015 Coflow: A Networking Abstraction for Distributed Data-Parallel Applications Copyright © 2015 by N M Mosharaf Kabir Chowdhury 1 Abstract Coflow: A Networking Abstraction for Distributed Data-Parallel Applications by N M Mosharaf Kabir Chowdhury Doctor of Philosophy in Computer Science University of California, Berkeley Professor Ion Stoica, Chair Over the past decade, the confluence of an unprecedented growth in data volumes and the rapid rise of cloud computing has fundamentally transformed systems software and corresponding in- frastructure. To deal with massive datasets, more and more applications today are scaling out to large datacenters. These distributed data-parallel applications run on tens to thousands of ma- chines in parallel to exploit I/O parallelism, and they enable a wide variety of use cases, including interactive analysis, SQL queries, machine learning, and graph processing. Communication between the distributed computation tasks of these applications often result in massive data transfers over the network. Consequently, concentrated efforts in both industry and academia have gone into building high-capacity, low-latency datacenter networks at scale. At the same time, researchers and practitioners have proposed a wide variety of solutions to minimize flow completion times or to ensure per-flow fairness based on the point-to-point flow abstraction that forms the basis of the TCP/IP stack.
    [Show full text]
  • Heterogeneity and Load Balance in Distributed Hash Tables
    Heterogeneity and Load Balance in Distributed Hash Tables P. Brighten Godfrey and Ion Stoica Computer Science Division, University of California, Berkeley {pbg,istoica}@cs.berkeley.edu Abstract— Existing solutions to balance load in DHTs incur imbalance to a constant factor. To handle heterogeneity, each a high overhead either in terms of routing state or in terms of node picks a number of virtual servers proportional to its load movement generated by nodes arriving or departing the capacity. Unfortunately, virtual servers incur a significant cost: system. In this paper, we propose a set of general techniques and a node with k virtual servers must maintain k sets of overlay use them to develop a protocol based on Chord, called Y0, that achieves load balancing with minimal overhead under the typical links. Typically k = Θ(log n), which leads to an asymptotic assumption that the load is uniformly distributed in the identifier increase in overhead. space. In particular, we prove that Y0 can achieve near-optimal The second class of solutions uses just a single ID per load balancing, while moving little load to maintain the balance node [26], [22], [20]. However, all such solutions must re- and increasing the size of the routing tables by at most a constant factor. assign IDs to maintain the load balance as nodes arrive and Using extensive simulations based on real-world and synthetic depart the system [26]. This can result in a high overhead capacity distributions, we show that Y0 reduces the load imbal- because it involves transferring objects and updating overlay ance of Chord from O(log n) to a less than 3.6 without increasing links.
    [Show full text]