Introduction to Apache Flink

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Apache Flink Technologie-Workshop „Big Data“ FZI Karlsruhe, 22. Juni 2015 Introduction to Apache Flink Christoph Boden | stv. Projektkoordinator Berlin Big Data Center (BBDC) | TU Berlin DIMA Introducing the http://bbdc.berlin “Data Scientist” – “Jack of All Trades!” Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics) Mathematical Programming Relational Algebra / SQL Linear Algebra Data Warehouse/OLAP Stochastic Gradient Descent Application 2 Error Estimation NF /XQuery Active Sampling Resource Management Hardware Adaptation Regression Fault Tolerance Monte Carlo Data Memory Management Statistics Science Parallelization Sketches Scalability Hashing Memory Hierarchy Convergence Data Analysis Language Decoupling Compiler Query Optimization Iterative Algorithms Indexing Data Flow Curse of Dimensionality Control Flow Real-Time Machine Learning + Data Management = X Mathematical Programming Relational Algebra/SQL Linear Algebra Data Warehouse/OLAP 2 Error Estimation NF /XQuery Scalability Hardware adaption Active Sampling Technology X Fault Tolerance Regression Monte Carlo Resource Management Think ML-algorithms in a scalable way Feature Engineering Declarative Languages Representation declarative Automatic Adaption Algorithms (SVM, GPs, etc.) ML DM Scalable processing Process iterative algorithms Statistic in a scalable way Parallelization Compiler Sketches Hashing Memory Management Isolation Convergence Goal: Data Analysis without Memory Hierarchy Curse of Dimensionality System Programming! Data Analysis Language Iterative Algorithms Query Optimization Control flow Dataflow Indexing Beuth Hochschule ZIB Berlin Stefan Edlich Software Engineering Fritz-Haber-Institut, Christof Alexander Max-Planck-Gesellschaft Schütte Reinefeld Information- File Systems, based Medicine Supercomputing Matthias Technische Universität Berlin, Data Analytics Lab DFKI Scheffler Material Science Volker Klaus R. Anja Odej Thomas Hans Uszkoreit Markl Müller Feldmann Kao Wiegand Language Technology Data Management Machine Computer Distributed Video Mining Learning Networks Systems http://bbdc.berlin X = Big Data Analytics – System Programming! („What“, not „How“) Data Analyst Machine Larger human base of „data scientists“ Reduction of „human“ latencies Cost reduction Description of „What“? Description of „How“? (declarative specification) (State of the art in scalable data analysis) Technology X Hadoop, MPI Technology X Think ML-algorithms Analysis of Declarative specification Algorithmic process in a scalable way “data in motion” fault tolerance Iterative algorithms Automatic optimization, in a scalable way Multimodal parallelization and hardware adaption Consistent analysis of dataflow and control flow intermediate results with user-defined functions, iterations and distributed state Software-defined Numerical networking stability Scalable algorithms and debugging Application Examples: Technology Drivers and Validation Think ML-algorithms in a scalable way Declarative Technology X Process iterative algorithms in a scalable way hierarchical text data flows multimodal data integration: numerical video, images, text simulation data numerical stability windowing Application example: Marketplace Application Example: Application Example: Material for Health science information economics-based society-based science-based Open source data infrastructure Hive Cascading Applications Mahout Pig … MapReduce Flink Data processing engines Spark Storm Tez App and resource management Yarn Mesos Storage, streams HDFS HBase Kafka … 8 Engine paradigms & systems MapReduce Apache Hadoop 1 (OSDI’04) Apache Tez Dryad, Nephele (EuroSys’07) PACTs Apache Flink (SOCC’10, VLDB’12) RDDs Apache Spark (HotCloud’10, NSDI’12) 9 Engine comparison Transformations Iterative MapReduce on API on k/v pair transformations k/v pairs collections on collections Paradigm MapReduce RDD Cyclic dataflows Optimization Optimization Optimization none of SQL queries in all APIs Batch Batch with Stream with Execution sorting memory out-of-core pinning algorithms10 APACHE FLINK 11 Library API API ML & Gelly Batch API Flink Table Table Streaming Machine Learing Graph Libraries & DataSet API DataStream API Batch Processing Stream Processing Runtime Distributed Streaming Dataflow Local Cluster Cloud Single JVM YARN, Standalone GCW, EC2 Deploy Core APIs An open source platform for scalable batch and stream data processing. Data sets and operators Program DataSet DataSet DataSet X Y A B C Operator X Operator Y Parallel Execution A (1) X B (1) Y C (1) A (2) X B (2) Y C (2) 13 Rich operator and functionality set Map, Reduce, Join, CoGroup, Union, Iterate, Delta Iterate, Filter, FlatMap, GroupReduce, Project, Aggregate, Distinct, Vertex-Update, Accumulators Iterate Source Map Reduce Join Reduce Sink Source Map 14 Base‐Operator: Map D Base‐Operator: Reduce D Base‐Operator: Cross R L Base‐Operator: Join R L Base‐Operator: CoGroup R L WordCount in Java ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); text DataSet<String> text = readTextFile (input); DataSet<Tuple2<String, Integer>> counts= text flatMap .map (l ‐> l.split(“\\W+”)) .flatMap ((String[] tokens, Collector<Tuple2<String, Integer>> out) ‐> { Arrays.stream(tokens) .filter(t ‐> t.length() > 0) reduce .forEach(t ‐> out.collect(new Tuple2<>(t, 1))); }) .groupBy(0) .sum(1); counts env.execute("Word Count Example"); 20 WordCount in Scala val env = ExecutionEnvironment text .getExecutionEnvironment val input = env.readTextFile(textInput) flatMap val counts = text .flatMap { line => line.split("\\W+") } .filter { term => term.nonEmpty } .map { term => (term, 1) } reduce .groupBy(0) .sum(1) counts env.execute() 21 Long operator pipelines DataSet<Tuple...> large = env.readCsv(...); DataSet<Tuple...> medium = env.readCsv(...); γ DataSet<Tuple...> small = env.readCsv(...); DataSet<Tuple...> joined1 = ⋈ large.join(medium) .where(3).equals(1) ⋈ small .with(new JoinFunction() { ... }); medium DataSet<Tuple...> joined2 = large small.join(joined1) .where(0).equals(2) .with(new JoinFunction() { ... }); DataSet<Tuple...> result = joined2.groupBy(3) .max(2); 22 Beyond Key/Value Pairs DataSet<Page> pages = ...; DataSet<Impression> impressions = ...; DataSet<Impression> aggregated = impressions .groupBy("url") .sum("count"); pages.join(impressions).where("url").equalTo("url") // custom data types class Impression { class Page { public String url; public String url; public long count; public String topic; } } 23 Flink’s optimizer • inspired by optimizers of parallel database systems – cost models and reasoning about interesting properties • physical optimization follows cost‐based approach – Select data shipping strategy (forward, partition, broadcast) – Local execution (sort merge join/hash join) – keeps track of interesting properties such as sorting, grouping and partitioning • optimization of Flink programs more difficult than in the relational case: – no fully specified operator semantics due to UDFs – unknown UDFs complicate estimating intermediate result sizes – no pre‐defined schema present 24 Optimization example case class Order(id: Int, priority: Int, ...) case class Item(id: Int, price: double, ) val orders = DataSource(...) case class PricedOrder(id, priority, price) val items = DataSource(...) val filtered = orders filter { ... } val prio = filtered join items where { _.id } isEqualTo { _.id } map {(o,li) => PricedOrder(o.id, o.priority, li.price)} val sales = prio groupBy {p => (p.id, p.priority)} aggregate ({_.price},SUM) (0,1) Grp/Agg Grp/Agg Join (0) = (0) Join sort (0,1) sort (0) partition(0) (∅) Filter Filter partition(0) Orders Items Orders Items 25 Memory management Unmanaged Heap public class WC { public String word; public int count; } Flink Managed JVM Heap Heap empty page Network Buffers Pool of Memory Pages • Flink manages its own memory • User data stored in serialize byte arrays • In-memory caching and data processing happens in a dedicated memory fraction • Never breaks the JVM heap • Very efficient disk spilling and network transfers 26 Built‐in vs. driver‐based iterations Client Loop outside the system, in driver Step Step Step Step Step program Client Iterative program looks like many Step Step Step Step Step independent jobs Dataflows with red. feedback edges map join System is iteration- join aware, can optimize the job 27 “Iterate” operator Iterate Replace initial partial partial iteration solution solution X Y solution result other datasets Step function • Built‐in operator to support looping over data • Applies step function to partial solution until convergence • Step function can be arbitrary Flink program • Convergence via fixed number of iterations or custom convergence criterion 28 “Delta Iterate” operator Replace Delta-Iterate initial workset workset workset A B initial partial delta iteration solution solution X Y set result other datasets Merge deltas • compute next workset and changes to the partial solution until workset is empty • generalizes vertex‐centric computing of Pregel and GraphLab 29 ReCap: What is Apache Flink? Apache Flink is an open source platform for scalable batch and stream data processing. • The core of Flink is a distributed streaming dataflow engine. • Executing dataflows in parallel on clusters • Providing a reliable foundation for various workloads • DataSet and DataStream programming abstractions are the foundation for user programs and higher layers http://flink.apache.org Working on and with Apache Flink • Flink homepage https://flink.apache.org • Flink
Recommended publications
  • Apache Flink™: Stream and Batch Processing in a Single Engine
    Apache Flink™: Stream and Batch Processing in a Single Engine Paris Carboney Stephan Ewenz Seif Haridiy Asterios Katsifodimos* Volker Markl* Kostas Tzoumasz yKTH & SICS Sweden zdata Artisans *TU Berlin & DFKI parisc,[email protected][email protected][email protected] Abstract Apache Flink1 is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continu- ous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph analysis) can be expressed and executed as pipelined fault-tolerant dataflows. In this paper, we present Flink’s architecture and expand on how a (seemingly diverse) set of use cases can be unified under a single execution model. 1 Introduction Data-stream processing (e.g., as exemplified by complex event processing systems) and static (batch) data pro- cessing (e.g., as exemplified by MPP databases and Hadoop) were traditionally considered as two very different types of applications. They were programmed using different programming models and APIs, and were exe- cuted by different systems (e.g., dedicated streaming systems such as Apache Storm, IBM Infosphere Streams, Microsoft StreamInsight, or Streambase versus relational databases or execution engines for Hadoop, including Apache Spark and Apache Drill). Traditionally, batch data analysis made up for the lion’s share of the use cases, data sizes, and market, while streaming data analysis mostly served specialized applications. It is becoming more and more apparent, however, that a huge number of today’s large-scale data processing use cases handle data that is, in reality, produced continuously over time.
    [Show full text]
  • Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions
    00 Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions MUTAZ BARIKA, University of Tasmania SAURABH GARG, University of Tasmania ALBERT Y. ZOMAYA, University of Sydney LIZHE WANG, China University of Geoscience (Wuhan) AAD VAN MOORSEL, Newcastle University RAJIV RANJAN, Chinese University of Geoscienes and Newcastle University Interest in processing big data has increased rapidly to gain insights that can transform businesses, government policies and research outcomes. This has led to advancement in communication, programming and processing technologies, including Cloud computing services and technologies such as Hadoop, Spark and Storm. This trend also affects the needs of analytical applications, which are no longer monolithic but composed of several individual analytical steps running in the form of a workflow. These Big Data Workflows are vastly different in nature from traditional workflows. Researchers arecurrently facing the challenge of how to orchestrate and manage the execution of such workflows. In this paper, we discuss in detail orchestration requirements of these workflows as well as the challenges in achieving these requirements. We alsosurvey current trends and research that supports orchestration of big data workflows and identify open research challenges to guide future developments in this area. CCS Concepts: • General and reference → Surveys and overviews; • Information systems → Data analytics; • Computer systems organization → Cloud computing; Additional Key Words and Phrases: Big Data, Cloud Computing, Workflow Orchestration, Requirements, Approaches ACM Reference format: Mutaz Barika, Saurabh Garg, Albert Y. Zomaya, Lizhe Wang, Aad van Moorsel, and Rajiv Ranjan. 2018. Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions.
    [Show full text]
  • Building Machine Learning Inference Pipelines at Scale
    Building Machine Learning inference pipelines at scale Julien Simon Global Evangelist, AI & Machine Learning @julsimon © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Problem statement • Real-life Machine Learning applications require more than a single model. • Data may need pre-processing: normalization, feature engineering, dimensionality reduction, etc. • Predictions may need post-processing: filtering, sorting, combining, etc. Our goal: build scalable ML pipelines with open source (Spark, Scikit-learn, XGBoost) and managed services (Amazon EMR, AWS Glue, Amazon SageMaker) © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Apache Spark https://spark.apache.org/ • Open-source, distributed processing system • In-memory caching and optimized execution for fast performance (typically 100x faster than Hadoop) • Batch processing, streaming analytics, machine learning, graph databases and ad hoc queries • API for Java, Scala, Python, R, and SQL • Available in Amazon EMR and AWS Glue © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. MLlib – Machine learning library https://spark.apache.org/docs/latest/ml-guide.html • Algorithms: classification, regression, clustering, collaborative filtering. • Featurization: feature extraction, transformation, dimensionality reduction. • Tools for constructing, evaluating and tuning pipelines • Transformer – a transform function that maps a DataFrame into a new
    [Show full text]
  • Evaluation of SPARQL Queries on Apache Flink
    applied sciences Article SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink Oscar Ceballos 1 , Carlos Alberto Ramírez Restrepo 2 , María Constanza Pabón 2 , Andres M. Castillo 1,* and Oscar Corcho 3 1 Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Ciudad Universitaria Meléndez Calle 13 No. 100-00, Cali 760032, Colombia; [email protected] 2 Departamento de Electrónica y Ciencias de la Computación, Pontificia Universidad Javeriana Cali, Calle 18 No. 118-250, Cali 760031, Colombia; [email protected] (C.A.R.R.); [email protected] (M.C.P.) 3 Ontology Engineering Group, Universidad Politécnica de Madrid, Campus de Montegancedo, Boadilla del Monte, 28660 Madrid, Spain; ocorcho@fi.upm.es * Correspondence: [email protected] Abstract: Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based ecosystems. New trends in Big Data technologies have also emerged (e.g., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher data processing performance. In this paper, we present a formal interpretation of some PACT transformations implemented in the Apache Flink DataSet API. We use this formalization to provide a mapping to translate a SPARQL query to a Flink program. The mapping was implemented in a prototype used to determine the correctness and performance of the solution. The source code of the Citation: Ceballos, O.; Ramírez project is available in Github under the MIT license.
    [Show full text]
  • Flare: Optimizing Apache Spark with Native Compilation
    Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data Gregory Essertel, Ruby Tahboub, and James Decker, Purdue University; Kevin Brown and Kunle Olukotun, Stanford University; Tiark Rompf, Purdue University https://www.usenix.org/conference/osdi18/presentation/essertel This paper is included in the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18). October 8–10, 2018 • Carlsbad, CA, USA ISBN 978-1-939133-08-3 Open access to the Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data Grégory M. Essertel1, Ruby Y. Tahboub1, James M. Decker1, Kevin J. Brown2, Kunle Olukotun2, Tiark Rompf1 1Purdue University, 2Stanford University {gesserte,rtahboub,decker31,tiark}@purdue.edu, {kjbrown,kunle}@stanford.edu Abstract cessing. Systems like Apache Spark [8] have gained enormous traction thanks to their intuitive APIs and abil- In recent years, Apache Spark has become the de facto ity to scale to very large data sizes, thereby commoditiz- standard for big data processing. Spark has enabled a ing petabyte-scale (PB) data processing for large num- wide audience of users to process petabyte-scale work- bers of users. But thanks to its attractive programming loads due to its flexibility and ease of use: users are able interface and tooling, people are also increasingly using to mix SQL-style relational queries with Scala or Python Spark for smaller workloads. Even for companies that code, and have the resultant programs distributed across also have PB-scale data, there is typically a long tail of an entire cluster, all without having to work with low- tasks of much smaller size, which make up a very impor- level parallelization or network primitives.
    [Show full text]
  • Large Scale Querying and Processing for Property Graphs Phd Symposium∗
    Large Scale Querying and Processing for Property Graphs PhD Symposium∗ Mohamed Ragab Data Systems Group, University of Tartu Tartu, Estonia [email protected] ABSTRACT Recently, large scale graph data management, querying and pro- cessing have experienced a renaissance in several timely applica- tion domains (e.g., social networks, bibliographical networks and knowledge graphs). However, these applications still introduce new challenges with large-scale graph processing. Therefore, recently, we have witnessed a remarkable growth in the preva- lence of work on graph processing in both academia and industry. Querying and processing large graphs is an interesting and chal- lenging task. Recently, several centralized/distributed large-scale graph processing frameworks have been developed. However, they mainly focus on batch graph analytics. On the other hand, the state-of-the-art graph databases can’t sustain for distributed Figure 1: A simple example of a Property Graph efficient querying for large graphs with complex queries. Inpar- ticular, online large scale graph querying engines are still limited. In this paper, we present a research plan shipped with the state- graph data following the core principles of relational database systems [10]. Popular Graph databases include Neo4j1, Titan2, of-the-art techniques for large-scale property graph querying and 3 4 processing. We present our goals and initial results for querying ArangoDB and HyperGraphDB among many others. and processing large property graphs based on the emerging and In general, graphs can be represented in different data mod- promising Apache Spark framework, a defacto standard platform els [1]. In practice, the two most commonly-used graph data models are: Edge-Directed/Labelled graph (e.g.
    [Show full text]
  • Portable Stateful Big Data Processing in Apache Beam
    Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google https://s.apache.org/ffsf-2017-beam-state [email protected] / @KennKnowles Flink Forward San Francisco 2017 Agenda 1. What is Apache Beam? 2. State 3. Timers 4. Example & Little Demo What is Apache Beam? TL;DR (Flink draws it more like this) 4 DAGs, DAGs, DAGs Apache Beam Apache Flink Apache Cloud Hadoop Apache Apache Dataflow Spark Samza MapReduce Apache Apache Apache (paper) Storm Gearpump Apex (incubating) FlumeJava (paper) Heron MillWheel (paper) Dataflow Model (paper) 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Java fully managed input.apply( Apache Spark Sum.integersPerKey()) local, on-prem, cloud Sum Per Key Apache Apex Python local, on-prem, cloud input | Sum.PerKey() Apache Gearpump (incubating) ⋮ ⋮ 6 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Python fully managed input | KakaIO.read() Apache Spark local, on-prem, cloud KafkaIO Apache Apex ⋮ local, on-prem, cloud Apache Java Gearpump (incubating) class KafkaIO extends UnboundedSource { … } ⋮ 7 The Beam Model PTransform Pipeline PCollection (bounded or unbounded) 8 The Beam Model What are you computing? (read, map, reduce) Where in event time? (event time windowing) When in processing time are results produced? (triggers) How do refinements relate? (accumulation mode) 9 What are you computing? Read ParDo Grouping Composite Parallel connectors to Per element Group
    [Show full text]
  • Apache Spark Solution Guide
    Technical White Paper Dell EMC PowerStore: Apache Spark Solution Guide Abstract This document provides a solution overview for Apache Spark running on a Dell EMC™ PowerStore™ appliance. June 2021 H18663 Revisions Revisions Date Description June 2021 Initial release Acknowledgments Author: Henry Wong This document may contain certain words that are not consistent with Dell's current language guidelines. Dell plans to update the document over subsequent future releases to revise these words accordingly. This document may contain language from third party content that is not under Dell's control and is not consistent with Dell's current guidelines for Dell's own content. When such third party content is updated by the relevant third parties, this document will be revised accordingly. The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © 2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [6/9/2021] [Technical White Paper] [H18663] 2 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Table of contents Table of contents
    [Show full text]
  • HDP 3.1.4 Release Notes Date of Publish: 2019-08-26
    Release Notes 3 HDP 3.1.4 Release Notes Date of Publish: 2019-08-26 https://docs.hortonworks.com Release Notes | Contents | ii Contents HDP 3.1.4 Release Notes..........................................................................................4 Component Versions.................................................................................................4 Descriptions of New Features..................................................................................5 Deprecation Notices.................................................................................................. 6 Terminology.......................................................................................................................................................... 6 Removed Components and Product Capabilities.................................................................................................6 Testing Unsupported Features................................................................................ 6 Descriptions of the Latest Technical Preview Features.......................................................................................7 Upgrading to HDP 3.1.4...........................................................................................7 Behavioral Changes.................................................................................................. 7 Apache Patch Information.....................................................................................11 Accumulo...........................................................................................................................................................
    [Show full text]
  • Apache Flink Fast and Reliable Large-Scale Data Processing
    Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System • Focused on large-scale data analytics • Real-time stream and batch processing • Easy and powerful APIs (Java / Scala) • Robust execution backend 2 What is Flink good at? It‘s a general-purpose data analytics system • Real-time stream processing with flexible windows • Complex and heavy ETL jobs • Analyzing huge graphs • Machine-learning on large data sets • ... 3 Flink in the Hadoop Ecosystem Libraries Dataflow Table API Table ML Library ML Gelly Library Gelly Apache MRQL Apache SAMOA DataSet API (Java/Scala) DataStream API (Java/Scala) Flink Core Optimizer Stream Builder Runtime Environments Embedded Local Cluster Yarn Apache Tez Data HDFS Hadoop IO Apache HBase Apache Kafka Apache Flume Sources HCatalog JDBC S3 RabbitMQ ... 4 Flink in the ASF • Flink entered the ASF about one year ago – 04/2014: Incubation – 12/2014: Graduation • Strongly growing community 120 100 80 60 40 20 0 Nov.10 Apr.12 Aug.13 Dec.14 #unique git committers (w/o manual de-dup) 5 Where is Flink moving? A "use-case complete" framework to unify batch & stream processing Data Streams • Kafka Analytical Workloads • RabbitMQ • ETL • ... • Relational processing Flink • Graph analysis • Machine learning “Historic” data • Streaming data analysis • HDFS • JDBC • ... Goal: Treat batch as finite stream 6 Programming Model & APIs HOW TO USE FLINK? 7 Unified Java & Scala APIs • Fluent and mirrored APIs in Java and Scala • Table API
    [Show full text]
  • The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing: Extended Survey
    The VLDB Journal https://doi.org/10.1007/s00778-019-00548-x SPECIAL ISSUE PAPER The ubiquity of large graphs and surprising challenges of graph processing: extended survey Siddhartha Sahu1 · Amine Mhedhbi1 · Semih Salihoglu1 · Jimmy Lin1 · M. Tamer Özsu1 Received: 21 January 2019 / Revised: 9 May 2019 / Accepted: 13 June 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019 Abstract Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We performed an extensive study that consisted of an online survey of 89 users, a review of the mailing lists, source repositories, and white papers of a large suite of graph software products, and in-person interviews with 6 users and 2 developers of these products. Our online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants’ responses to our questions highlighting common patterns and challenges. Based on our interviews and survey of the rest of our sources, we were able to answer some new questions that were raised by participants’ responses to our online survey and understand the specific applications that use graph data and software. Our study revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, scalability and visualization are undeniably the most pressing challenges faced by participants, and data integration, recommendations, and fraud detection are very popular applications supported by existing graph software.
    [Show full text]
  • Debugging Spark Applications a Study on Debugging Techniques of Spark Developers
    Debugging Spark Applications A Study on Debugging Techniques of Spark Developers Master Thesis Melike Gecer from Bern, Switzerland Philosophisch-naturwissenschaftlichen Fakultat¨ der Universitat¨ Bern May 2020 Prof. Dr. Oscar Nierstrasz Dr. Haidar Osman Software Composition Group Institut fur¨ Informatik und angewandte Mathematik University of Bern, Switzerland Abstract Debugging is the main activity to investigate software failures, identify their root causes, and eventually fix them. Debugging distributed systems in particular is burdensome, due to the challenges of managing numerous devices and concurrent operations, detecting the problematic node, lengthy log files, and real-world data being inconsistent. Apache Spark is a distributed framework which is used to run analyses on large-scale data. Debugging Apache Spark applications is difficult as no tool, apart from log files, is available on the market. However, an application may produce a lengthy log file, which is challenging to examine. In this thesis, we aim to investigate various techniques used by developers on a distributed system. In order to achieve that, we interviewed Spark application developers, presented them with buggy applications, and observed their debugging behaviors. We found that most of the time, they formulate hypotheses to allay their suspicions and check the log files as the first thing to do after obtaining an exception message. Afterwards, we use these findings to compose a debugging flow that can help us to understand the way developers debug a project. i Contents
    [Show full text]