The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing

Total Page:16

File Type:pdf, Size:1020Kb

The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, M. Tamer Özsu David R. Cheriton School of Computer Science University of Waterloo {s3sahu,amine.mhedhbi,semih.salihoglu,jimmylin,tamer.ozsu}@uwaterloo.ca ABSTRACT 52, 55], and distributed graph processing systems [17, 21, 27]. In Graph processing is becoming increasingly prevalent across many the academic literature, a large number of publications that study application domains. In spite of this prevalence, there is little re- numerous topics related to graph processing regularly appear across search about how graphs are actually used in practice. We conducted a wide spectrum of research venues. an online survey aimed at understanding: (i) the types of graphs Despite their prevalence, there is little research on how graph data users have; (ii) the graph computations users run; (iii) the types is actually used in practice and the major challenges facing users of graph software users use; and (iv) the major challenges users of graph data, both in industry and research. In April 2017, we face when processing their graphs. We describe the participants’ conducted an online survey across 89 users of 22 different software responses to our questions highlighting common patterns and chal- products, with the goal of answering 4 high-level questions: lenges. We further reviewed user feedback in the mailing lists, bug (i) What types of graph data do users have? reports, and feature requests in the source repositories of a large (ii) What computations do users run on their graphs? suite of software products for processing graphs. Through our re- (iii) Which software do users use to perform their computations? view, we were able to answer some new questions that were raised (iv) What are the major challenges users face when processing their by participants’ responses and identify specific challenges that users graph data? face when using different classes of graph software. The partici- Our major findings are as follows: pants’ responses and data we obtained revealed surprising facts Variety: about graph processing in practice. In particular, real-world graphs • Graphs in practice represent a very wide variety of enti- represent a very diverse range of entities and are often very large, ties, many of which are not naturally thought of as vertices and and scalability and visualization are undeniably the most pressing edges. Most surprisingly, traditional enterprise data comprised challenges faced by participants. We hope these findings can guide of products, orders, and transactions, which are typically seen as future research. the perfect fit for relational systems, appear to be a very common form of data represented in participants’ graphs. PVLDB Reference Format: • Ubiquity of Very Large Graphs: Many graphs in practice are Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M. very large, often containing over a billion edges. These large Tamer Özsu. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. PVLDB, 11(4): 420 - 431, 2017. graphs represent a very wide range of entities and belong to DOI: https://doi.org/10.1145/3164135.3164139 organizations at all scales from very small enterprises to very large ones. This refutes the sometimes heard assumption that large graphs are a problem for only a few large organizations 1. INTRODUCTION such as Google, Facebook, and Twitter. Graph data representing connected entities and their relationships ap- • Challenge of Scalability: Scalability is unequivocally the most pear in many application domains, most naturally in social networks, pressing challenge faced by participants. The ability to process the web, the semantic web, road maps, communication networks, very large graphs efficiently seems to be the biggest limitation biology, and finance, just to name a few examples. There has been of existing software. a noticeable increase in the prevalence of work on graph process- • Visualization: Visualization is a very popular and central task ing both in research and in practice, evidenced by the surge in the in participants’ graph processing pipelines. After scalability, number of different commercial and research software for man- participants indicated visualization as their second most pressing aging and processing graphs. Examples include graph database challenge, tied with challenges in graph query languages. systems [3,8,14,35,48,53], RDF engines [38,64,67], linear algebra • Prevalence of RDBMSes: software [6,46], visualization software [13,16], query languages [28, Relational databases still play an important role in managing and processing graphs. Permission to make digital or hard copies of all or part of this work for Our survey also highlights other interesting facts, such as the preva- personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies lence of machine learning on graph data, e.g., for clustering vertices, bear this notice and the full citation on the first page. To copy otherwise, to predicting links, and finding influential vertices. republish, to post on servers or to redistribute to lists, requires prior specific We further reviewed user feedback in the mailing lists, bug re- permission and/or a fee. Articles from this volume were invited to present ports, and feature requests in the source code repositories of 22 their results at The 44th International Conference on Very Large Data Bases, software products between January and September of 2017 with August 2018, Rio de Janeiro, Brazil. two goals: (i) to answer several new questions that the participants’ Proceedings of the VLDB Endowment, Vol. 11, No. 4 Copyright 2017 VLDB Endowment 2150-8097/17/12... $ 10.00. responses raised; and (ii) to identify more specific challenges in DOI: https://doi.org/10.1145/3164135.3164139 different classes of graph technologies than the ones we could iden- 420 tify in participants’ responses. For some of the questions in our Table 1: Software products used for recruiting participants and the online survey, we also compared the graph data, computations, and number of active mailing list users from February to April 2017. software used by the participants with those studied in academic publications. For this, we reviewed 90 papers from 6 conferences Technology Software # Users across different academic venues. ArrangoDB [3] 40 In addition to discussing the insights we gained through our study, Caley [8] 14 Graph Database DGraph [14] 33 we discuss several directions about the future of graph processing. 233 We hope our study can inform research about real use cases and System JanusGraph [35] 32 important problems in graph processing. Neo4j [48] 69 OrientDB [53] 45 Apache Jena [38] 87 2. METHODOLOGY AND PARTICIPANTS RDF Engine Sparksee [64] 5 115 In this section, we first describe the format of our survey and then Virtuoso [67] 23 how we recruited the participants. Next we describe the demo- Apache Flink (Gelly) [17] 24 Distributed Graph graphic information of the participants, including the organizations Apache Giraph [21] 8 39 Processing Engine they come from and their roles in their organizations. Then we Apache Spark (GraphX) [27] 7 describe our methodology of reviewing academic publications. We Query Language Gremlin [28] 82 82 end this section by describing our methodology for reviewing the Graph for Scala [22] 4 user feedback in the mailing lists, bug reports, and feature requests GraphStream [24] 8 in the source code repositories of the software products. Graphtool [25] 28 Graph Library 97 NetworKit [50] 10 2.1 Survey Format NetworkX [51] 27 The survey was in the format of an online form. All of the questions SNAP [62] 20 were optional and participants could skip any number of questions. Cytoscape [13] 93 Graph Visualization 116 There were 2 types of questions: Elasticsearch 23 (i) Multiple Choice: There were 3 types of multiple choice ques- (X-Pack Graph) [16] tions: (a) yes-no questions; (b) questions that allowed only Graph Representation Conceptual Graphs [11] 6 6 a single choice as a response; and (c) questions that allowed multiple choices as a response. The participants could use an • Slack Channels: Two of the software products on our list, Neo4j Other option when their answers required further explanation and Cayley, had Slack channels for their users. We posted the or did not match any of the provided choices. We random- survey to these channels. ized the order of choices in questions about the computations • Twitter: A week after posting our survey to the mailing lists and participants run and the challenges they face. Slack channels and sending private emails, we posted a tweet (ii) Short Answer: For these questions, the participants entered with a link to our survey to 7 of the 22 software products that their responses in a text box. had an official Twitter account. Only Neo4j retweeted our tweet. There were 34 questions grouped into five categories: (i) demo- Participants that we recruited through different methods shared the graphic questions; (ii) graph datasets; (iii) graph and machine learn- same online link and we could not tell the number of participants ing computations; (iv) graph software; and (v) workload breakdown recruited from each method. In particular, we suspect that there and major challenges. were more users from graph database systems mainly because their mailing lists contained more active users, as can be seen in Ta- 2.2 Participant Recruitment ble 1. Moreover, 4 of the 5 mailing lists that allowed us to send We prepared a list of 22 popular software products for processing private emails and the Slack and Twitter channels belonged to graph graphs (see Table 1) that had public user mailing lists covering database systems. We note that after posting the survey on Twitter, 6 types of technologies: graph database systems, RDF engines, we received 12 responses. distributed graph processing systems (DGPSes), graph libraries In the end, there were 89 participants.
Recommended publications
  • Networkx Tutorial
    5.03.2020 tutorial NetworkX tutorial Source: https://github.com/networkx/notebooks (https://github.com/networkx/notebooks) Minor corrections: JS, 27.02.2019 Creating a graph Create an empty graph with no nodes and no edges. In [1]: import networkx as nx In [2]: G = nx.Graph() By definition, a Graph is a collection of nodes (vertices) along with identified pairs of nodes (called edges, links, etc). In NetworkX, nodes can be any hashable object e.g. a text string, an image, an XML object, another Graph, a customized node object, etc. (Note: Python's None object should not be used as a node as it determines whether optional function arguments have been assigned in many functions.) Nodes The graph G can be grown in several ways. NetworkX includes many graph generator functions and facilities to read and write graphs in many formats. To get started though we'll look at simple manipulations. You can add one node at a time, In [3]: G.add_node(1) add a list of nodes, In [4]: G.add_nodes_from([2, 3]) or add any nbunch of nodes. An nbunch is any iterable container of nodes that is not itself a node in the graph. (e.g. a list, set, graph, file, etc..) In [5]: H = nx.path_graph(10) file:///home/szwabin/Dropbox/Praca/Zajecia/Diffusion/Lectures/1_intro/networkx_tutorial/tutorial.html 1/18 5.03.2020 tutorial In [6]: G.add_nodes_from(H) Note that G now contains the nodes of H as nodes of G. In contrast, you could use the graph H as a node in G.
    [Show full text]
  • Apache Flink™: Stream and Batch Processing in a Single Engine
    Apache Flink™: Stream and Batch Processing in a Single Engine Paris Carboney Stephan Ewenz Seif Haridiy Asterios Katsifodimos* Volker Markl* Kostas Tzoumasz yKTH & SICS Sweden zdata Artisans *TU Berlin & DFKI parisc,[email protected][email protected][email protected] Abstract Apache Flink1 is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continu- ous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph analysis) can be expressed and executed as pipelined fault-tolerant dataflows. In this paper, we present Flink’s architecture and expand on how a (seemingly diverse) set of use cases can be unified under a single execution model. 1 Introduction Data-stream processing (e.g., as exemplified by complex event processing systems) and static (batch) data pro- cessing (e.g., as exemplified by MPP databases and Hadoop) were traditionally considered as two very different types of applications. They were programmed using different programming models and APIs, and were exe- cuted by different systems (e.g., dedicated streaming systems such as Apache Storm, IBM Infosphere Streams, Microsoft StreamInsight, or Streambase versus relational databases or execution engines for Hadoop, including Apache Spark and Apache Drill). Traditionally, batch data analysis made up for the lion’s share of the use cases, data sizes, and market, while streaming data analysis mostly served specialized applications. It is becoming more and more apparent, however, that a huge number of today’s large-scale data processing use cases handle data that is, in reality, produced continuously over time.
    [Show full text]
  • Networkx: Network Analysis with Python
    NetworkX: Network Analysis with Python Salvatore Scellato Full tutorial presented at the XXX SunBelt Conference “NetworkX introduction: Hacking social networks using the Python programming language” by Aric Hagberg & Drew Conway Outline 1. Introduction to NetworkX 2. Getting started with Python and NetworkX 3. Basic network analysis 4. Writing your own code 5. You are ready for your project! 1. Introduction to NetworkX. Introduction to NetworkX - network analysis Vast amounts of network data are being generated and collected • Sociology: web pages, mobile phones, social networks • Technology: Internet routers, vehicular flows, power grids How can we analyze this networks? Introduction to NetworkX - Python awesomeness Introduction to NetworkX “Python package for the creation, manipulation and study of the structure, dynamics and functions of complex networks.” • Data structures for representing many types of networks, or graphs • Nodes can be any (hashable) Python object, edges can contain arbitrary data • Flexibility ideal for representing networks found in many different fields • Easy to install on multiple platforms • Online up-to-date documentation • First public release in April 2005 Introduction to NetworkX - design requirements • Tool to study the structure and dynamics of social, biological, and infrastructure networks • Ease-of-use and rapid development in a collaborative, multidisciplinary environment • Easy to learn, easy to teach • Open-source tool base that can easily grow in a multidisciplinary environment with non-expert users
    [Show full text]
  • Graph Database Fundamental Services
    Bachelor Project Czech Technical University in Prague Faculty of Electrical Engineering F3 Department of Cybernetics Graph Database Fundamental Services Tomáš Roun Supervisor: RNDr. Marko Genyk-Berezovskyj Field of study: Open Informatics Subfield: Computer and Informatic Science May 2018 ii Acknowledgements Declaration I would like to thank my advisor RNDr. I declare that the presented work was de- Marko Genyk-Berezovskyj for his guid- veloped independently and that I have ance and advice. I would also like to thank listed all sources of information used Sergej Kurbanov and Herbert Ullrich for within it in accordance with the methodi- their help and contributions to the project. cal instructions for observing the ethical Special thanks go to my family for their principles in the preparation of university never-ending support. theses. Prague, date ............................ ........................................... signature iii Abstract Abstrakt The goal of this thesis is to provide an Cílem této práce je vyvinout webovou easy-to-use web service offering a database službu nabízející databázi neorientova- of undirected graphs that can be searched ných grafů, kterou bude možno efektivně based on the graph properties. In addi- prohledávat na základě vlastností grafů. tion, it should also allow to compute prop- Tato služba zároveň umožní vypočítávat erties of user-supplied graphs with the grafové vlastnosti pro grafy zadané uži- help graph libraries and generate graph vatelem s pomocí grafových knihoven a images. Last but not least, we implement zobrazovat obrázky grafů. V neposlední a system that allows bulk adding of new řadě je také cílem navrhnout systém na graphs to the database and computing hromadné přidávání grafů do databáze a their properties.
    [Show full text]
  • Gephi Tools for Network Analysis and Visualization
    Frontiers of Network Science Fall 2018 Class 8: Introduction to Gephi Tools for network analysis and visualization Boleslaw Szymanski CLASS PLAN Main Topics • Overview of tools for network analysis and visualization • Installing and using Gephi • Gephi hands-on labs Frontiers of Network Science: Introduction to Gephi 2018 2 TOOLS OVERVIEW (LISTED ALPHABETICALLY) Tools for network analysis and visualization • Computing model and interface – Desktop GUI applications – API/code libraries, Web services – Web GUI front-ends (cloud, distributed, HPC) • Extensibility model – Only by the original developers – By other users/developers (add-ins, modules, additional packages, etc.) • Source availability model – Open-source – Closed-source • Business model – Free of charge – Commercial Frontiers of Network Science: Introduction to Gephi 2018 3 TOOLS CINET CyberInfrastructure for NETwork science • Accessed via a Web-based portal (http://cinet.vbi.vt.edu/granite/granite.html) • Supported by grants, no charge for end users • Aims to provide researchers, analysts, and educators interested in Network Science with an easy-to-use cyber-environment that is accessible from their desktop and integrates into their daily work • Users can contribute new networks, data, algorithms, hardware, and research results • Primarily for research, teaching, and collaboration • No programming experience is required Frontiers of Network Science: Introduction to Gephi 2018 4 TOOLS Cytoscape Network Data Integration, Analysis, and Visualization • A standalone GUI application
    [Show full text]
  • Evaluation of SPARQL Queries on Apache Flink
    applied sciences Article SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink Oscar Ceballos 1 , Carlos Alberto Ramírez Restrepo 2 , María Constanza Pabón 2 , Andres M. Castillo 1,* and Oscar Corcho 3 1 Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Ciudad Universitaria Meléndez Calle 13 No. 100-00, Cali 760032, Colombia; [email protected] 2 Departamento de Electrónica y Ciencias de la Computación, Pontificia Universidad Javeriana Cali, Calle 18 No. 118-250, Cali 760031, Colombia; [email protected] (C.A.R.R.); [email protected] (M.C.P.) 3 Ontology Engineering Group, Universidad Politécnica de Madrid, Campus de Montegancedo, Boadilla del Monte, 28660 Madrid, Spain; ocorcho@fi.upm.es * Correspondence: [email protected] Abstract: Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based ecosystems. New trends in Big Data technologies have also emerged (e.g., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher data processing performance. In this paper, we present a formal interpretation of some PACT transformations implemented in the Apache Flink DataSet API. We use this formalization to provide a mapping to translate a SPARQL query to a Flink program. The mapping was implemented in a prototype used to determine the correctness and performance of the solution. The source code of the Citation: Ceballos, O.; Ramírez project is available in Github under the MIT license.
    [Show full text]
  • Gephi-Poster-Sunbelt-July10.Pdf
    The Open Graph Viz Platform Gephi is a new open-source network visualization platform. It aims to create a sustainable software and technical Download Gephi at ecosystem, driven by a large international open-source community, who shares common interests in networks and http://gephi.org complex systems. The rendering engine can handle networks larger than 100K elements and guarantees responsiveness. Designed to make data navigation and manipulation easy, it aims to fulll the complete chain from data importing to aesthetics renements and interaction. Particular focus is made on the software usability and interoperability with other tools. A lot of eorts are made to facilitate the community growth, by providing tutorials, plug-ins development documentation, support and student projects. Current developments include Dynamic Network Analysis (DNA) and Goals spigots (Emails, Twitter, Facebook …) import. Create the Photoshop of network visualization, by combining a rich set of Gephi aims at being built-in features and a sustainable, open to friendly user interface. many kind of users, and creating a large, Design a modular and international and extensible software diverse open-source architecture, facilitate plug-ins development, community reuse and mashup. Build a large, international Features Highlight and diverse open-source community. Community Join the network, participate designing the roadmap, get help to quickly code plug-ins or simply share your ideas. Select, move, paint, resize, connect, group Just with the mouse Export in SVG/PDF to include your infographics Metrics Architecture * Betweenness, Eigenvector, Closeness The modular architecture allows developers adding and extending features * Eccentricity with ease by developing plug-ins. Gephi can also be used as Java library in * Diameter, Average Shortest Path other applications and build for instance, a layout server.
    [Show full text]
  • Plantuml Language Reference Guide (Version 1.2021.2)
    Drawing UML with PlantUML PlantUML Language Reference Guide (Version 1.2021.2) PlantUML is a component that allows to quickly write : • Sequence diagram • Usecase diagram • Class diagram • Object diagram • Activity diagram • Component diagram • Deployment diagram • State diagram • Timing diagram The following non-UML diagrams are also supported: • JSON Data • YAML Data • Network diagram (nwdiag) • Wireframe graphical interface • Archimate diagram • Specification and Description Language (SDL) • Ditaa diagram • Gantt diagram • MindMap diagram • Work Breakdown Structure diagram • Mathematic with AsciiMath or JLaTeXMath notation • Entity Relationship diagram Diagrams are defined using a simple and intuitive language. 1 SEQUENCE DIAGRAM 1 Sequence Diagram 1.1 Basic examples The sequence -> is used to draw a message between two participants. Participants do not have to be explicitly declared. To have a dotted arrow, you use --> It is also possible to use <- and <--. That does not change the drawing, but may improve readability. Note that this is only true for sequence diagrams, rules are different for the other diagrams. @startuml Alice -> Bob: Authentication Request Bob --> Alice: Authentication Response Alice -> Bob: Another authentication Request Alice <-- Bob: Another authentication Response @enduml 1.2 Declaring participant If the keyword participant is used to declare a participant, more control on that participant is possible. The order of declaration will be the (default) order of display. Using these other keywords to declare participants
    [Show full text]
  • Travels of Baron Munchausen
    Songs for Scribus Travels of Baron Munchausen Songs for Scribus Travels of Baron Munchausen Travels of Baron Munchausen How should I disengage myself? I was not much pleased with my awkward situation – with a wolf face to face; our ogling was not of the most pleasant kind. If I withdrew my arm, then the animal would fly the more furiously upon me; that I saw in his flaming eyes. In short, I laid hold of his tail, turned him inside out like a glove, and flung him to the ground, where I left him. On January 3 2011, the first working day of the year, OSP gathered around Scribus. We wanted to explore framerendering, bootstrapping Scribus and to play around with the incredible adventures of Baron von Munchausen. The idea was to produce some kind of experimental result, a follow- up of to our earlier attempts to turn a frog into a prince. One day we asked Scribus team-members what their favourite Scribus feature was. After some hesitation they pointed us to the magical framerender. Framerenderer The Framerender is an image frame with a wrapper, a GUI, and a configuration scheme. External programs are invoked from inside of Scribus, and their output is placed into the frame! By default, the current Scribus is configured to host the following friends: LaTeX, Lilypond, gnuplot, dot/GraphViz and POV-Ray Today we are working with Lilypond. We are also creating cus- tom tools that generate PostScript/PDF via Imagemagick and other command line tools. Lilypond LilyPond is a music engraving program, devoted to producing the highest-quality sheet music possible.
    [Show full text]
  • Performing Knowledge Requirements Analysis for Public Organisations in a Virtual Learning Environment: a Social Network Analysis
    chnology Te & n S o o ti ft a w a m r r e Fontenele et al., J Inform Tech Softw Eng 2014, 4:2 o f E Journal of n n I g f i o n DOI: 10.4172/2165-7866.1000134 l e a e n r r i n u g o J ISSN: 2165-7866 Information Technology & Software Engineering Research Article Open Access Performing Knowledge Requirements Analysis for Public Organisations in a Virtual Learning Environment: A Social Network Analysis Approach Fontenele MP1*, Sampaio RB2, da Silva AIB2, Fernandes JHC2 and Sun L1 1University of Reading, School of Systems Engineering, Whiteknights, Reading, RG6 6AH, United Kingdom 2Universidade de Brasília, Campus Universitário Darcy Ribeiro, Brasília, DF, 70910-900, Brasil Abstract This paper describes an application of Social Network Analysis methods for identification of knowledge demands in public organisations. Affiliation networks established in a postgraduate programme were analysed. The course was ex- ecuted in a distance education mode and its students worked on public agencies. Relations established among course participants were mediated through a virtual learning environment using Moodle. Data available in Moodle may be ex- tracted using knowledge discovery in databases techniques. Potential degrees of closeness existing among different organisations and among researched subjects were assessed. This suggests how organisations could cooperate for knowledge management and also how to identify their common interests. The study points out that closeness among organisations and research topics may be assessed through affiliation networks. This opens up opportunities for ap- plying knowledge management between organisations and creating communities of practice.
    [Show full text]
  • Portable Stateful Big Data Processing in Apache Beam
    Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google https://s.apache.org/ffsf-2017-beam-state [email protected] / @KennKnowles Flink Forward San Francisco 2017 Agenda 1. What is Apache Beam? 2. State 3. Timers 4. Example & Little Demo What is Apache Beam? TL;DR (Flink draws it more like this) 4 DAGs, DAGs, DAGs Apache Beam Apache Flink Apache Cloud Hadoop Apache Apache Dataflow Spark Samza MapReduce Apache Apache Apache (paper) Storm Gearpump Apex (incubating) FlumeJava (paper) Heron MillWheel (paper) Dataflow Model (paper) 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Java fully managed input.apply( Apache Spark Sum.integersPerKey()) local, on-prem, cloud Sum Per Key Apache Apex Python local, on-prem, cloud input | Sum.PerKey() Apache Gearpump (incubating) ⋮ ⋮ 6 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Python fully managed input | KakaIO.read() Apache Spark local, on-prem, cloud KafkaIO Apache Apex ⋮ local, on-prem, cloud Apache Java Gearpump (incubating) class KafkaIO extends UnboundedSource { … } ⋮ 7 The Beam Model PTransform Pipeline PCollection (bounded or unbounded) 8 The Beam Model What are you computing? (read, map, reduce) Where in event time? (event time windowing) When in processing time are results produced? (triggers) How do refinements relate? (accumulation mode) 9 What are you computing? Read ParDo Grouping Composite Parallel connectors to Per element Group
    [Show full text]
  • Apache Flink Fast and Reliable Large-Scale Data Processing
    Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System • Focused on large-scale data analytics • Real-time stream and batch processing • Easy and powerful APIs (Java / Scala) • Robust execution backend 2 What is Flink good at? It‘s a general-purpose data analytics system • Real-time stream processing with flexible windows • Complex and heavy ETL jobs • Analyzing huge graphs • Machine-learning on large data sets • ... 3 Flink in the Hadoop Ecosystem Libraries Dataflow Table API Table ML Library ML Gelly Library Gelly Apache MRQL Apache SAMOA DataSet API (Java/Scala) DataStream API (Java/Scala) Flink Core Optimizer Stream Builder Runtime Environments Embedded Local Cluster Yarn Apache Tez Data HDFS Hadoop IO Apache HBase Apache Kafka Apache Flume Sources HCatalog JDBC S3 RabbitMQ ... 4 Flink in the ASF • Flink entered the ASF about one year ago – 04/2014: Incubation – 12/2014: Graduation • Strongly growing community 120 100 80 60 40 20 0 Nov.10 Apr.12 Aug.13 Dec.14 #unique git committers (w/o manual de-dup) 5 Where is Flink moving? A "use-case complete" framework to unify batch & stream processing Data Streams • Kafka Analytical Workloads • RabbitMQ • ETL • ... • Relational processing Flink • Graph analysis • Machine learning “Historic” data • Streaming data analysis • HDFS • JDBC • ... Goal: Treat batch as finite stream 6 Programming Model & APIs HOW TO USE FLINK? 7 Unified Java & Scala APIs • Fluent and mirrored APIs in Java and Scala • Table API
    [Show full text]