Exploratory Large Scale Graph Analytics in Arkouda 59 2 60 3 61 4 Zhihui Du,Oliver Alvarado Rodriguez and Michael Merrill and William Reus 62 5 David A
Total Page:16
File Type:pdf, Size:1020Kb
1 Exploratory Large Scale Graph Analytics in Arkouda 59 2 60 3 61 4 Zhihui Du,Oliver Alvarado Rodriguez and Michael Merrill and William Reus 62 5 David A. Bader [email protected] 63 6 {zhihui.du,oaa9,bader}@njit.edu [email protected] 64 7 New Jersey Institute of Technology Department of Defense 65 8 Newark, New Jersey, USA USA 66 9 67 10 ABSTRACT 1 INTRODUCTION 68 11 Exploratory graph analytics helps maximize the informational value A graph is a well defined mathematical model to formulate the rela- 69 12 for a graph. However, the increasing graph size makes it impossi- tionship between different objects and is widely used in numerous 70 13 ble for existing popular exploratory data analysis tools to handle domains such as social sciences, biological systems, and informa- 71 14 dozens-of-terabytes or even larger data sets in the memory of a tion systems. The edge distributions of many large scale real world 72 15 common laptop/personal computer. Arkouda is a framework un- problems tend to follow a power-law distribution [1, 11, 26]. Dense 73 16 der early-development that brings together the productivity of graph data structures and algorithms will consume much more 74 17 Python at the user side with the high-performance of Chapel at memory and cannot analyze very large sparse graphs efficiently. 75 18 the server side. In this paper, the preliminary work on overcoming Therefore, parallel algorithms for sparse graphs [23] have become 76 19 the memory limit and high performance computing coding road- an important research topic to efficiently analyze the large and 77 20 block for high level Python users to perform large graph analysis sparse graphs from different real-world problems. 78 21 is presented. A simple and succinct graph data structure design Exploratory data analysis (EDA) [6, 13, 15] was proposed by 79 22 and implementation at both the Python front-end and the Chapel Tukey [27] and his associates early in 1960s. The basic idea of EDA 80 23 back-end in the Arkouda framework are provided. A typical graph is listening to the data in as many ways as possible until a plausible 81 24 algorithm, Breadth-First Search (BFS), is used to show how we can story of the data is apparent. Tukey linked EDA with “detective 82 25 use Chapel to develop high performance parallel graph algorithm work" to summarize the main characteristics of data sets. Instead 83 26 productively. Two Chapel based parallel Breadth-First Search (BFS) of checking a given hypothesis with data, EDA primarily is for 84 27 algorithms, one high level version and one corresponding low level seeing what the data can tell us beyond the formal modeling or 85 28 version, have been implemented in Arkouda to support analyzing hypotheses testing task. In this way EDA tries to maximize the 86 29 large graphs. Multiple graph benchmarks are used to evaluate the value of data and has been widely used in different applications, 87 30 performance of the provided graph algorithms. Experimental re- such as COVID-19 [10], Twitter [19] and so on. 88 31 sults show that we can optimize the performance by tuning the Popular EDA methods and tools, which often run on laptops or 89 32 selection of different Chapel high level data structures and parallel common personal computers, cannot hold terabyte or even larger 90 33 constructs. Our code is open source and available from GitHub sparse graph data sets, let alone produce highly efficient analy- 91 34 (https://github.com/Bader-Research/arkouda). sis results. Arkouda [21, 24] is an EDA framework under early- 92 35 development that brings together the productivity of Python with 93 36 94 KEYWORDS world-class high-performance computing. Arkouda allows data sci- 37 entists to make use of the advantages of both laptop computing and 95 38 Exploratory graph analysis, Parallel graph algorithms, Breadth-First cloud/supercomputing together. Currently, Arkouda cannot sup- 96 39 Search, Real-world graph data sets, High level parallel language port graph analysis. In this work we provide the preliminary design 97 40 Not for distribution.and implementation on extending Arkouda’s data structures and 98 41 ACM Reference Format: Unpublished workingparallel algorithms todraft. support large scale sparse graph analytics. 99 42 Zhihui Du,Oliver Alvarado Rodriguez and David A. Bader and Michael In this paper we provide the preliminary solution on integrating 100 43 Merrill and William Reus. 2021. Exploratory Large Scale Graph Analytics sparse graphs into Arkouda. The major contributions are as follows. 101 44 in Arkouda. In Proceedings of CHIUW ’21: Chapel Implementers and Users 102 45 Workshop (CHIUW ’21). ACM, New York, NY, USA, 9 pages. https://doi.org/ 103 46 10.1145/nnnnnnn.nnnnnnn (1) An efficient and succinct Double-Index (DI) graph data struc- 104 47 ture and large sparse graph partition method are developed 105 48 to support parallel and distributed graph algorithm design. 106 49 (2) Two distributed parallel Breadth-First Search (BFS) graph 107 50 Permission to make digital or hard copies of all or part of this work for personal or 108 classroomUnpublished use is granted working without fee draft. provided Not that copies for distribution.are not made or distributed algorithms are developed based on the high level parallel 51 for profit or commercial advantage and that copies bear this notice and the full citation language Chapel. High productivity in algorithm design and 109 52 on the first page. Copyrights for components of this work owned by others than ACM quick optimization in performance improvement are the 110 must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, 53 111 to post on servers or to redistribute to lists, requires prior specific permission and/or a two major advantages of our Chapel based graph algorithm 54 fee. Request permissions from [email protected]. development method. 112 55 CHIUW ’21, June 04, 2021, Virtual (3) Experimental results show that the proposed graph data 113 © 2021 Association for Computing Machinery. 56 114 ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 structure and algorithm can support Arkouda to handle dif- 57 https://doi.org/10.1145/nnnnnnn.nnnnnnn ferent kinds of large graphs. Based on the same algorithm 115 58 2021-05-24 01:37. Page 1 of 1–9. 116 CHIUW ’21, June 04, 2021, Virtual Du and Alvarado Rodriguez, et al. 117 framework, quick and small tuning in high level data struc- or locating edges from given vertex. DI has two features: (1) sig- 175 118 ture and parallel construct selection can achieve more than nificant memory savings for large sparse graphs; (2) supporting 176 119 8 times speedup. easy and high level array operators. The proposed DI data structure 177 120 is new innovation to support large sparse graph exploratory data 178 121 2 ARKOUDA FRAMEWORK FOR DATA analytics. 179 122 SCIENCE 180 123 181 As a high level exploratory data analytics framework, Arkouda 3.1 Directed Graphs 124 182 aims to support not only flexible, but also high performance large For directed graphs, the edge index array consists of two arrays 125 183 scale data analysis. Python [25] is an interpreted, high-level and with the same shape. One is the source vertex array and the other 126 184 general-purpose programming language. Python consistently ranks is the destination vertex array. If there are a total of " edges and 127 185 as one of the most popular programming languages and has ever # vertices, we will use the numbers from 0 to " − 1 to identify 128 186 growing community. Python has become a very powerful EDA tool. different edges and the numbers from 0to # −1 to identify different 129 187 However, performance and very large scale data processing are vertices. 130 188 two bottlenecks of Python. Chapel [8] is a high level programming For example, given edge 4 =< 8, 9 ¡, we will let ('퐶 »4¼ = 8 and 131 189 language designed for productive parallel computing at scale. It 퐷() »4¼ = 9 where ('퐶 is the source vertex array and 퐷() is the 132 190 has the same advantages such as portable and open-source like destination vertex array; 4 is the edge ID number. Both ('퐶 and 133 191 in Python. Furthermore, it has the scalable and fast features that 퐷() have the same size ". When all edges are stored into ('퐶 and 134 192 Python lacks. 퐷() , we will sort them based on their vertex ID value and remap 135 193 So, Arkouda integrates its front-end Python with its back-end the edge ID from 0 to " − 1. Based on the sorted edge index array, 136 194 Chapel with a middle, communicative part ZeroMQ [14]. ZeroMQ we can build the vertex index array, which also consists of two of 137 195 is used for the data and instruction exchangs between Python users the same shape arrays. For example, we let edge 41000 have ID 1000. 138 196 and back-end services. In this way, Arkouda can provide flexible If 41000 =< 50, 3 ¡, 41001 =< 50, 70 ¡ and 41002 =< 50, 110 ¡ are all 139 197 and high performance large scale data analysis capability. the edges starting from vertex 50, then we will assign the entry of 140 198 To break the data volume limit of Python, Arkouda provides a one vertex index array ()'»50¼ = 1000 and another vertex index 141 199 virtual data view for its Python users. However, the real or raw array # 퐸퐼 »50¼ = 3. This means that for given vertex 50, the edges 142 200 data are stored in Chapel.