Graph Algorithms Practical Examples in Apache Spark and Neo4j

Total Page:16

File Type:pdf, Size:1020Kb

Graph Algorithms Practical Examples in Apache Spark and Neo4j Graph Algorithms Practical Examples in Apache Spark and Neo4j Mark Needham and Amy E. Hodler Beijing Boston Farnham Sebastopol Tokyo Graph Algorithms by Mark Needham and Amy E. Hodler Copyright © 2019 Amy Hodler and Mark Needham. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisition Editor: Jonathan Hassell Indexer: Judy McConville Editor: Jeff Bleiel Interior Designer: David Futato Production Editor: Deborah Baker Cover Designer: Karen Montgomery Copy Editor: Tracy Brown Illustrator: Rebecca Demarest Proofreader: Rachel Head May 2019: First Edition Revision History for the First Edition 2019-04-15: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492047681 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Graph Algorithms, the cover image of a European garden spider, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Neo4j. See our statement of editorial independ‐ ence. 978-1-492-05781-9 [LSI] Table of Contents Preface. ix Foreword. xiii 1. Introduction. 1 What Are Graphs? 2 What Are Graph Analytics and Algorithms? 3 Graph Processing, Databases, Queries, and Algorithms 6 OLTP and OLAP 7 Why Should We Care About Graph Algorithms? 8 Graph Analytics Use Cases 12 Conclusion 14 2. Graph Theory and Concepts. 15 Terminology 15 Graph Types and Structures 16 Random, Small-World, Scale-Free Structures 17 Flavors of Graphs 18 Connected Versus Disconnected Graphs 19 Unweighted Graphs Versus Weighted Graphs 19 Undirected Graphs Versus Directed Graphs 21 Acyclic Graphs Versus Cyclic Graphs 22 Sparse Graphs Versus Dense Graphs 23 Monopartite, Bipartite, and k-Partite Graphs 24 Types of Graph Algorithms 27 Pathfinding 27 Centrality 27 Community Detection 27 iii Summary 28 3. Graph Platforms and Processing. 29 Graph Platform and Processing Considerations 29 Platform Considerations 29 Processing Considerations 30 Representative Platforms 31 Selecting Our Platform 31 Apache Spark 32 Neo4j Graph Platform 34 Summary 37 4. Pathfinding and Graph Search Algorithms. 39 Example Data: The Transport Graph 41 Importing the Data into Apache Spark 43 Importing the Data into Neo4j 44 Breadth First Search 45 Breadth First Search with Apache Spark 46 Depth First Search 48 Shortest Path 49 When Should I Use Shortest Path? 50 Shortest Path with Neo4j 51 Shortest Path (Weighted) with Neo4j 53 Shortest Path (Weighted) with Apache Spark 54 Shortest Path Variation: A* 56 Shortest Path Variation: Yen’s k-Shortest Paths 58 All Pairs Shortest Path 60 A Closer Look at All Pairs Shortest Path 60 When Should I Use All Pairs Shortest Path? 62 All Pairs Shortest Path with Apache Spark 62 All Pairs Shortest Path with Neo4j 63 Single Source Shortest Path 65 When Should I Use Single Source Shortest Path? 67 Single Source Shortest Path with Apache Spark 67 Single Source Shortest Path with Neo4j 69 Minimum Spanning Tree 70 When Should I Use Minimum Spanning Tree? 71 Minimum Spanning Tree with Neo4j 72 Random Walk 73 When Should I Use Random Walk? 74 Random Walk with Neo4j 74 Summary 75 iv | Table of Contents 5. Centrality Algorithms. 77 Example Graph Data: The Social Graph 79 Importing the Data into Apache Spark 80 Importing the Data into Neo4j 81 Degree Centrality 81 Reach 81 When Should I Use Degree Centrality? 82 Degree Centrality with Apache Spark 83 Closeness Centrality 84 When Should I Use Closeness Centrality? 85 Closeness Centrality with Apache Spark 86 Closeness Centrality with Neo4j 88 Closeness Centrality Variation: Wasserman and Faust 89 Closeness Centrality Variation: Harmonic Centrality 91 Betweenness Centrality 92 When Should I Use Betweenness Centrality? 94 Betweenness Centrality with Neo4j 95 Betweenness Centrality Variation: Randomized-Approximate Brandes 98 PageRank 99 Influence 99 The PageRank Formula 100 Iteration, Random Surfers, and Rank Sinks 102 When Should I Use PageRank? 103 PageRank with Apache Spark 103 PageRank with Neo4j 105 PageRank Variation: Personalized PageRank 107 Summary 108 6. Community Detection Algorithms. 109 Example Graph Data: The Software Dependency Graph 112 Importing the Data into Apache Spark 114 Importing the Data into Neo4j 114 Triangle Count and Clustering Coefficient 114 Local Clustering Coefficient 115 Global Clustering Coefficient 116 When Should I Use Triangle Count and Clustering Coefficient? 116 Triangle Count with Apache Spark 117 Triangles with Neo4j 117 Local Clustering Coefficient with Neo4j 118 Strongly Connected Components 119 When Should I Use Strongly Connected Components? 120 Strongly Connected Components with Apache Spark 120 Table of Contents | v Strongly Connected Components with Neo4j 122 Connected Components 124 When Should I Use Connected Components? 124 Connected Components with Apache Spark 125 Connected Components with Neo4j 126 Label Propagation 127 Semi-Supervised Learning and Seed Labels 129 When Should I Use Label Propagation? 129 Label Propagation with Apache Spark 130 Label Propagation with Neo4j 131 Louvain Modularity 133 When Should I Use Louvain? 137 Louvain with Neo4j 138 Validating Communities 143 Summary 143 7. Graph Algorithms in Practice. 145 Analyzing Yelp Data with Neo4j 145 Yelp Social Network 146 Data Import 147 Graph Model 147 A Quick Overview of the Yelp Data 148 Trip Planning App 152 Travel Business Consulting 157 Finding Similar Categories 162 Analyzing Airline Flight Data with Apache Spark 166 Exploratory Analysis 168 Popular Airports 168 Delays from ORD 170 Bad Day at SFO 172 Interconnected Airports by Airline 174 Summary 181 8. Using Graph Algorithms to Enhance Machine Learning. 183 Machine Learning and the Importance of Context 183 Graphs, Context, and Accuracy 184 Connected Feature Extraction and Selection 185 Graphy Features 187 Graph Algorithm Features 188 Graphs and Machine Learning in Practice: Link Prediction 190 Tools and Data 190 Importing the Data into Neo4j 192 vi | Table of Contents The Coauthorship Graph 193 Creating Balanced Training and Testing Datasets 194 How We Predict Missing Links 199 Creating a Machine Learning Pipeline 200 Predicting Links: Basic Graph Features 201 Predicting Links: Triangles and the Clustering Coefficient 214 Predicting Links: Community Detection 218 Summary 224 Wrapping Things Up 224 A. Additional Information and Resources. 225 Index. 231 Table of Contents | vii Preface The world is driven by connections—from financial and communication systems to social and biological processes. Revealing the meaning behind these connections drives breakthroughs across industries in areas such as identifying fraud rings and optimizing recommendations to evaluating the strength of a group and predicting cascading failures. As connectedness continues to accelerate, it’s not surprising that interest in graph algorithms has exploded because they are based on mathematics explicitly developed to gain insights from the relationships between data. Graph analytics can uncover the workings of intricate systems and networks at massive scales—for any organization. We are passionate about the utility and importance of graph analytics as well as the joy of uncovering the inner workings of complex scenarios. Until recently, adopting graph analytics required significant expertise and determination, because tools and integrations were difficult and few knew how to apply graph algorithms to their quandaries. It is our goal to help change this. We wrote this book to help organiza‐ tions better leverage graph analytics so that they can make new discoveries and develop intelligent solutions faster. What’s in This Book This book is a practical guide to getting started with graph algorithms for developers and data scientists who have experience using Apache Spark™ or Neo4j. Although our algorithm examples utilize the Spark and Neo4j platforms, this book will also be help‐ ful for understanding more general graph concepts, regardless of your choice of graph technologies. The first two chapters provide an introduction to graph analytics, algorithms, and theory. The third chapter briefly covers the platforms used in this book before we dive into three chapters focusing on classic graph algorithms: pathfinding, centrality, and community detection. We wrap up the book with two chapters showing how ix graph algorithms are used within workflows: one for general analysis and one for machine learning. At the beginning of each category of algorithms, there is a reference table to help you quickly jump to the relevant
Recommended publications
  • Graph Traversals
    Graph Traversals CS200 - Graphs 1 Tree traversal reminder Pre order A A B D G H C E F I In order B C G D H B A E C F I Post order D E F G H D B E I F C A Level order G H I A B C D E F G H I Connected Components n The connected component of a node s is the largest set of nodes reachable from s. A generic algorithm for creating connected component(s): R = {s} while ∃edge(u, v) : u ∈ R∧v ∉ R add v to R n Upon termination, R is the connected component containing s. q Breadth First Search (BFS): explore in order of distance from s. q Depth First Search (DFS): explores edges from the most recently discovered node; backtracks when reaching a dead- end. 3 Graph Traversals – Depth First Search n Depth First Search starting at u DFS(u): mark u as visited and add u to R for each edge (u,v) : if v is not marked visited : DFS(v) CS200 - Graphs 4 Depth First Search A B C D E F G H I J K L M N O P CS200 - Graphs 5 Question n What determines the order in which DFS visits nodes? n The order in which a node picks its outgoing edges CS200 - Graphs 6 DepthGraph Traversalfirst search algorithm Depth First Search (DFS) dfs(in v:Vertex) mark v as visited for (each unvisited vertex u adjacent to v) dfs(u) n Need to track visited nodes n Order of visiting nodes is not completely specified q if nodes have priority, then the order may become deterministic for (each unvisited vertex u adjacent to v in priority order) n DFS applies to both directed and undirected graphs n Which graph implementation is suitable? CS200 - Graphs 7 Iterative DFS: explicit Stack dfs(in v:Vertex) s – stack for keeping track of active vertices s.push(v) mark v as visited while (!s.isEmpty()) { if (no unvisited vertices adjacent to the vertex on top of the stack) { s.pop() //backtrack else { select unvisited vertex u adjacent to vertex on top of the stack s.push(u) mark u as visited } } CS200 - Graphs 8 Breadth First Search (BFS) n Is like level order in trees A B C D n Which is a BFS traversal starting E F G H from A? A.
    [Show full text]
  • Graph Traversal with DFS/BFS
    Graph Traversal Graph Traversal with DFS/BFS One of the most fundamental graph problems is to traverse every Tyler Moore edge and vertex in a graph. For correctness, we must do the traversal in a systematic way so that CS 2123, The University of Tulsa we dont miss anything. For efficiency, we must make sure we visit each edge at most twice. Since a maze is just a graph, such an algorithm must be powerful enough to enable us to get out of an arbitrary maze. Some slides created by or adapted from Dr. Kevin Wayne. For more information see http://www.cs.princeton.edu/~wayne/kleinberg-tardos 2 / 20 Marking Vertices To Do List The key idea is that we must mark each vertex when we first visit it, We must also maintain a structure containing all the vertices we have and keep track of what have not yet completely explored. discovered but not yet completely explored. Each vertex will always be in one of the following three states: Initially, only a single start vertex is considered to be discovered. 1 undiscovered the vertex in its initial, virgin state. To completely explore a vertex, we look at each edge going out of it. 2 discovered the vertex after we have encountered it, but before we have checked out all its incident edges. For each edge which goes to an undiscovered vertex, we mark it 3 processed the vertex after we have visited all its incident edges. discovered and add it to the list of work to do. A vertex cannot be processed before we discover it, so over the course Note that regardless of what order we fetch the next vertex to of the traversal the state of each vertex progresses from undiscovered explore, each edge is considered exactly twice, when each of its to discovered to processed.
    [Show full text]
  • CARES ACT GRANT AMOUNTS to AIRPORTS (Pursuant to Paragraphs 2-4) Detailed Listing by State, City and Airport
    CARES ACT GRANT AMOUNTS TO AIRPORTS (pursuant to Paragraphs 2-4) Detailed Listing By State, City And Airport State City Airport Name LOC_ID Grand Totals AK Alaskan Consolidated Airports Multiple [individual airports listed separately] AKAP $16,855,355 AK Adak (Naval) Station/Mitchell Field Adak ADK $30,000 AK Akhiok Akhiok AKK $20,000 AK Akiachak Akiachak Z13 $30,000 AK Akiak Akiak AKI $30,000 AK Akutan Akutan 7AK $20,000 AK Akutan Akutan KQA $20,000 AK Alakanuk Alakanuk AUK $30,000 AK Allakaket Allakaket 6A8 $20,000 AK Ambler Ambler AFM $30,000 AK Anaktuvuk Pass Anaktuvuk Pass AKP $30,000 AK Anchorage Lake Hood LHD $1,053,070 AK Anchorage Merrill Field MRI $17,898,468 AK Anchorage Ted Stevens Anchorage International ANC $26,376,060 AK Anchorage (Borough) Goose Bay Z40 $1,000 AK Angoon Angoon AGN $20,000 AK Aniak Aniak ANI $1,052,884 AK Aniak (Census Subarea) Togiak TOG $20,000 AK Aniak (Census Subarea) Twin Hills A63 $20,000 AK Anvik Anvik ANV $20,000 AK Arctic Village Arctic Village ARC $20,000 AK Atka Atka AKA $20,000 AK Atmautluak Atmautluak 4A2 $30,000 AK Atqasuk Atqasuk Edward Burnell Sr Memorial ATK $20,000 AK Barrow Wiley Post-Will Rogers Memorial BRW $1,191,121 AK Barrow (County) Wainwright AWI $30,000 AK Beaver Beaver WBQ $20,000 AK Bethel Bethel BET $2,271,355 AK Bettles Bettles BTT $20,000 AK Big Lake Big Lake BGQ $30,000 AK Birch Creek Birch Creek Z91 $20,000 AK Birchwood Birchwood BCV $30,000 AK Boundary Boundary BYA $20,000 AK Brevig Mission Brevig Mission KTS $30,000 AK Bristol Bay (Borough) Aleknagik /New 5A8 $20,000 AK
    [Show full text]
  • Graph Traversal and Linear Programs October 6, 2016
    CS 125 Section #5 Graph Traversal and Linear Programs October 6, 2016 1 Depth first search 1.1 The Algorithm Besides breadth first search, which we saw in class in relation to Dijkstra's algorithm, there is one other fundamental algorithm for searching a graph: depth first search. To better understand the need for these procedures, let us imagine the computer's view of a graph that has been input into it, in the adjacency list representation. The computer's view is fundamentally local to a specific vertex: it can examine each of the edges adjacent to a vertex in turn, by traversing its adjacency list; it can also mark vertices as visited. One way to think of these operations is to imagine exploring a dark maze with a flashlight and a piece of chalk. You are allowed to illuminate any corridor of the maze emanating from your current position, and you are also allowed to use the chalk to mark your current location in the maze as having been visited. The question is how to find your way around the maze. We now show how the depth first search allows the computer to find its way around the input graph using just these primitives. Depth first search uses a stack as the basic data structure. We start by defining a recursive procedure search (the stack is implicit in the recursive calls of search): search is invoked on a vertex v, and explores all previously unexplored vertices reachable from v. Procedure search(v) vertex v explored(v) := 1 previsit(v) for (v; w) 2 E if explored(w) = 0 then search(w) rof postvisit(v) end search Procedure DFS (G(V; E)) graph G(V; E) for each v 2 V do explored(v) := 0 rof for each v 2 V do if explored(v) = 0 then search(v) rof end DFS By modifying the procedures previsit and postvisit, we can use DFS to solve a number of important problems, as we shall see.
    [Show full text]
  • Graphs, Connectivity, and Traversals
    Graphs, Connectivity, and Traversals Definitions Like trees, graphs represent a fundamental data structure used in computer science. We often hear about cyber space as being a new frontier for mankind, and if we look at the structure of cyberspace, we see that it is structured as a graph; in other words, it consists of places (nodes), and connections between those places. Some applications of graphs include • representing electronic circuits • modeling object interactions (e.g. used in the Unified Modeling Language) • showing ordering relationships between computer programs • modeling networks and network traffic • the fact that trees are a special case of graphs, in that they are acyclic and connected graphs, and that trees are used in many fundamental data structures 1 An undirected graph G = (V; E) is a pair of sets V , E, where • V is a set of vertices, also called nodes. • E is a set of unordered pairs of vertices called edges, and are of the form (u; v), such that u; v 2 V . • if e = (u; v) is an edge, then we say that u is adjacent to v, and that e is incident with u and v. • We assume jV j = n is finite, where n is called the order of G. •j Ej = m is called the size of G. • A path P of length k in a graph is a sequence of vertices v0; v1; : : : ; vk, such that (vi; vi+1) 2 E for every 0 ≤ i ≤ k − 1. { a path is called simple iff the vertices v0; v1; : : : ; vk are all distinct.
    [Show full text]
  • Large Scale Querying and Processing for Property Graphs Phd Symposium∗
    Large Scale Querying and Processing for Property Graphs PhD Symposium∗ Mohamed Ragab Data Systems Group, University of Tartu Tartu, Estonia [email protected] ABSTRACT Recently, large scale graph data management, querying and pro- cessing have experienced a renaissance in several timely applica- tion domains (e.g., social networks, bibliographical networks and knowledge graphs). However, these applications still introduce new challenges with large-scale graph processing. Therefore, recently, we have witnessed a remarkable growth in the preva- lence of work on graph processing in both academia and industry. Querying and processing large graphs is an interesting and chal- lenging task. Recently, several centralized/distributed large-scale graph processing frameworks have been developed. However, they mainly focus on batch graph analytics. On the other hand, the state-of-the-art graph databases can’t sustain for distributed Figure 1: A simple example of a Property Graph efficient querying for large graphs with complex queries. Inpar- ticular, online large scale graph querying engines are still limited. In this paper, we present a research plan shipped with the state- graph data following the core principles of relational database systems [10]. Popular Graph databases include Neo4j1, Titan2, of-the-art techniques for large-scale property graph querying and 3 4 processing. We present our goals and initial results for querying ArangoDB and HyperGraphDB among many others. and processing large property graphs based on the emerging and In general, graphs can be represented in different data mod- promising Apache Spark framework, a defacto standard platform els [1]. In practice, the two most commonly-used graph data models are: Edge-Directed/Labelled graph (e.g.
    [Show full text]
  • Ground Collision Between a Boeing 767 and Boeing 737, Boston, April 9, 2001
    Ground collision between a Boeing 767 and Boeing 737, Boston, April 9, 2001 Micro-summary: This Boeing 767 hit a Boeing 737 on the ground, in glare. Event Date: 2001-04-09 at 0800 EDT Investigative Body: National Transportation Safety Board (NTSB), USA Investigative Body's Web Site: http://www.ntsb.gov/ Cautions: 1. Accident reports can be and sometimes are revised. Be sure to consult the investigative agency for the latest version before basing anything significant on content (e.g., thesis, research, etc). 2. Readers are advised that each report is a glimpse of events at specific points in time. While broad themes permeate the causal events leading up to crashes, and we can learn from those, the specific regulatory and technological environments can and do change. Your company's flight operations manual is the final authority as to the safe operation of your aircraft! 3. Reports may or may not represent reality. Many many non-scientific factors go into an investigation, including the magnitude of the event, the experience of the investigator, the political climate, relationship with the regulatory authority, technological and recovery capabilities, etc. It is recommended that the reader review all reports analytically. Even a "bad" report can be a very useful launching point for learning. 4. Contact us before reproducing or redistributing a report from this anthology. Individual countries have very differing views on copyright! We can advise you on the steps to follow. Aircraft Accident Reports on DVD, Copyright © 2006 by Flight Simulation
    [Show full text]
  • Logan Terminal C Lounge
    Logan Terminal C Lounge cobblesaltirewiseIs Laurens her whileKirkcaldyPan-African neutralized saddle or prefatorial sparklesslyBarde recaptured when or hasslingtramming fatalistically syllabically, some orvanilla lasing is Joabconcentrating thereout. major? Earthborn apiece? and Ira enlacedtenty Cass Boston to terminal c departures Are airport lounges worth to money? Can I a in airport lounge? It after several airlines' lounges such like Air France Lounge British Airways'. Even Debit Card gives you airport lounge access data's how. Planes Pranks and Praise Ode to an Airport AskThePilotcom. 'The accept' was available only Priority Pass lounge access Terminal C of Logan International so flight was my only option true to stopping by industry had read. After Coronavirus closures select airline lounges are now dinner at key US airports. Can we go between the terminals I am stack on Westjet and renew to AMS on Delta all or terminal A affect the priority pass lounges in E and C I made almost. The Lounge located at Terminal C near Gate C19 is open talk from 6 am 11 pm and includes Complimentary buffet with rich variety of. Get moving Know About Logan International Airport Trippact. Boston Logan International Airport BOS Alternative Airlines. A B C and E are 4 terminals of Boston Logan Airport All terminals are. Quick breakfast at Boston Logan's The Lounge then was to NYC for the. Boston Logan International Airport Central Parking Garage. How explicit does lounge visit cost? Which cards are accepted at Bangalore airport lounge? What edit Do at Boston's Logan International Airport Upside. USA Jet blue self checkin terminals at Logan International Airport terminal c.
    [Show full text]
  • Report Special Recess Committee on Aviation
    SENATE No. 615 Cl)t Commontoealtft of 00as$acftu$ett0 REPORT OF THE SPECIAL RECESS COMMITTEE ON AVIATION March, 1953 BOSTON WRIGHT <t POTTER PRINTING CO., LEGISLATIVE PRINTERS 32 DERNE STREET C&e Commontoealtl) of 9@assac|nioetts! REPORT OF THE SPECIAL RECESS COMMITTEE ON AVIATION. PART I. Mabch 18. 1953. To the Honorable Senate and House of Representatives. The Special Joint Committee on Aviation, consisting of members of the Committee on Aeronautics, herewith submits its report. As first established in 1951 by Senate Order No. 614, there were three members of the Senate and seven mem- bers of the House of the Committee on Aeronautics to sit during the recess for the purpose of making an investiga- tion and study relative to certain matters pertaining to aeronautics and also to consider current documents Senate, No. 4 and so much of House, No. 1232 as pertains to the continued development of Logan Airport, including constructions of certain buildings and other necessary facilities thereon, and especially the advisability of the construction of a gasoline and oil distribution system at said airport. Senate, No. 4 was a petition of Michael LoPresti, establishing a Massachusetts Aeronautics Au- thority and transferring to it the power, duties and obligations of the Massachusetts Aeronautics Commis- sion and State Airport Management Board. Members appointed under this Order were: Senators Cutler, Hedges and LoPresti; Representatives Bradley, Enright, Bryan, Gorman, Barnes, Snow and Campbell. The underground gasoline distribution system as pro- posed in 1951 by the State Airport Management Board seemed the most important matter to be studied. It was therefore voted that a subcommittee view the new in- stallation of such a system at the airport in Pittsburgh, Pa.
    [Show full text]
  • The Query Translation Landscape: a Survey
    The Query Translation Landscape: a Survey Mohamed Nadjib Mami1, Damien Graux2,1, Harsh Thakkar3, Simon Scerri1, Sren Auer4,5, and Jens Lehmann1,3 1Enterprise Information Systems, Fraunhofer IAIS, St. Augustin & Dresden, Germany 2ADAPT Centre, Trinity College of Dublin, Ireland 3Smart Data Analytics group, University of Bonn, Germany 4TIB Leibniz Information Centre for Science and Technology, Germany 5L3S Research Center, Leibniz University of Hannover, Germany October 2019 Abstract Whereas the availability of data has seen a manyfold increase in past years, its value can be only shown if the data variety is effectively tackled —one of the prominent Big Data challenges. The lack of data interoperability limits the potential of its collective use for novel applications. Achieving interoperability through the full transformation and integration of diverse data structures remains an ideal that is hard, if not impossible, to achieve. Instead, methods that can simultaneously interpret different types of data available in different data structures and formats have been explored. On the other hand, many query languages have been designed to enable users to interact with the data, from relational, to object-oriented, to hierarchical, to the multitude emerging NoSQL languages. Therefore, the interoperability issue could be solved not by enforcing physical data transformation, but by looking at techniques that are able to query heterogeneous sources using one uniform language. Both industry and research communities have been keen to develop such techniques, which require the translation of a chosen ’universal’ query language to the various data model specific query languages that make the underlying data accessible. In this article, we survey more than forty query translation methods and tools for popular query languages, and classify them according to eight criteria.
    [Show full text]
  • Lecture 16: Dijkstra’S Algorithm (Graphs)
    CSE 373: Data Structures and Algorithms Lecture 16: Dijkstra’s Algorithm (Graphs) Instructor: Lilian de Greef Quarter: Summer 2017 Today • Announcements • Graph Traversals Continued • Remarks on DFS & BFS • Shortest paths for weighted graphs: Dijkstra’s Algorithm! Announcements: Homework 4 is out! • Due next Friday (August 4th) at 5:00pm • May choose to pair-program if you like! • Same cautions as last time apply: choose partners and when to start working wisely! • Can almost entirely complete using material by end of this lecture • Will discuss some software-design concepts next week to help you prevent some (potentially non-obvious) bugs Another midterm correction… ( & ) Bring your midterm to *any* office hours to get your point back. I will have the final exam quadruple-checked to avoid these situations! (I am so sorry) Graphs: Traversals Continued And introducing Dijkstra’s Algorithm for shortest paths! Graph Traversals: Recap & Running Time • Traversals: General Idea • Starting from one vertex, repeatedly explore adjacent vertices • Mark each vertex we visit, so we don’t process each more than once (cycles!) • Important Graph Traversal Algorithms: Depth First Search (DFS) Breadth First Search (BFS) Explore… as far as possible all neighbors first before backtracking before next level of neighbors Choose next vertex using… recursion or a stack a queue • Assuming “choose next vertex” is O(1), entire traversal is • Use graph represented with adjacency Comparison (useful for Design Decisions!) • Which one finds shortest paths? • i.e. which is better for “what is the shortest path from x to y” when there’s more than one possible path? • Which one can use less space in finding a path? • A third approach: • Iterative deepening (IDFS): • Try DFS but disallow recursion more than K levels deep • If that fails, increment K and start the entire search over • Like BFS, finds shortest paths.
    [Show full text]
  • Graph Types and Language Interoperation
    March 2019 W3C workshop in Berlin on graph data management standards Graph types and Language interoperation Peter Furniss, Alastair Green, Hannes Voigt Neo4j Query Languages Standards and Research Team 11 January 2019 We are actively involved in the openCypher community, SQL/PGQ standards process and the ISO GQL (Graph Query Language) initiative. In these venues we in Neo4j are working towards the goals of a single industry-standard graph query language (GQL) for the property graph data model. We ​ ​ feel it is important that GQL should inter-operate well with other languages (for property graphs, and for other data models). Hannes is a co-author of the SIGMOD 2017 G-CORE paper and a co-author of the ​ ​ recent book Querying Graphs (Synthesis Lectures on Data Management). ​ ​ Alastair heads the Neo4j query languages team, is the author of the The GQL ​ Manifesto and of Working towards a New Work Item for GQL, to complement SQL ​ ​ PGQ. ​ Peter has worked on the design and early implementations of property graph typing in the Cypher for Apache Spark project. He is a former editor of OSI/TP and OASIS ​ ​ Business Transaction Protocol standards. Along with Hannes and Alastair, Peter has centrally contributed to proposals for SQL/PGQ Property Graph Schema. ​ We would like to contribute to or help lead discussion on two linked topics. ​ ​ Property Graph Types The information content of a classic Chen Entity-Relationship model, in combination with “mixin” multiple inheritance of structured data types, allows a very concise and flexible expression of the type of a property graph. A named (catalogued) graph type is composed of node and edge types.
    [Show full text]