1 Exploratory Large Scale Graph Analytics in Arkouda 59 2 60 3 61 4 Zhihui Du,Oliver Alvarado Rodriguez and Michael Merrill and William Reus 62 5 David A. Bader [email protected] 63 6 {zhihui.du,oaa9,bader}@njit.edu [email protected] 64 7 New Jersey Institute of Technology Department of Defense 65 8 Newark, New Jersey, USA USA 66 9 67 10 ABSTRACT 1 INTRODUCTION 68 11 Exploratory graph analytics helps maximize the informational A graph is a well defined mathematical model to formulate the rela- 69 12 for a graph. However, the increasing graph size makes it impossi- tionship between different objects and is widely used in numerous 70 13 ble for existing popular exploratory data analysis tools to domains such as social sciences, biological systems, and informa- 71 14 dozens-of-terabytes or even larger data sets in the memory of a tion systems. The edge distributions of many large scale real world 72 15 common laptop/personal computer. Arkouda is a framework un- problems tend to follow a power-law distribution [1, 11, 26]. Dense 73 16 der early-development that brings together the productivity of graph data and will consume much more 74 17 Python at the user side with the high-performance of Chapel at memory and cannot analyze very large sparse graphs efficiently. 75 18 the server side. In this paper, the preliminary work on overcoming Therefore, parallel algorithms for sparse graphs [23] have become 76 19 the memory limit and high performance computing coding road- an important research topic to efficiently analyze the large and 77 20 block for high level Python users to perform large graph analysis sparse graphs from different real-world problems. 78 21 is presented. A simple and succinct graph data design Exploratory data analysis (EDA) [6, 13, 15] was proposed by 79 22 and implementation at both the Python front-end and the Chapel Tukey [27] and his associates early in 1960s. The basic idea of EDA 80 23 back-end in the Arkouda framework are provided. A typical graph is listening to the data in as many ways as possible until a plausible 81 24 , Breadth-First Search (BFS), is used to show how we can story of the data is apparent. Tukey linked EDA with “detective 82 25 use Chapel to develop high performance parallel graph algorithm work" to summarize the main characteristics of data sets. Instead 83 26 productively. Two Chapel based parallel Breadth-First Search (BFS) of checking a given hypothesis with data, EDA primarily is for 84 27 algorithms, one high level version and one corresponding low level seeing what the data can tell us beyond the formal modeling or 85 28 version, have been implemented in Arkouda to support analyzing hypotheses testing task. In this way EDA to maximize the 86 29 large graphs. Multiple graph benchmarks are used to evaluate the value of data and has been widely used in different applications, 87 30 performance of the provided graph algorithms. Experimental re- such as COVID-19 [10], Twitter [19] and so on. 88 31 sults show that we can optimize the performance by tuning the Popular EDA methods and tools, which often run on laptops or 89 32 selection of different Chapel high level data structures and parallel common personal computers, cannot hold terabyte or even larger 90 33 constructs. Our code is open source and available from GitHub sparse graph data sets, let alone produce highly efficient analy- 91 34 (https://github.com/Bader-Research/arkouda). sis results. Arkouda [21, 24] is an EDA framework under early- 92 35 development that brings together the productivity of Python with 93 36 94 KEYWORDS world-class high-performance computing. Arkouda allows data sci- 37 entists to make use of the advantages of both laptop computing and 95 38 Exploratory graph analysis, Parallel graph algorithms, Breadth-First cloud/supercomputing together. Currently, Arkouda cannot sup- 96 39 Search, Real-world graph data sets, High level parallel language port graph analysis. In this work we provide the preliminary design 97 40 Not for distribution.and implementation on extending Arkouda’s data structures and 98 41 ACM Reference Format: Unpublished workingparallel algorithms todraft. support large scale sparse graph analytics. 99 42 Zhihui Du,Oliver Alvarado Rodriguez and David A. Bader and Michael In this paper we provide the preliminary solution on integrating 100 43 Merrill and William Reus. 2021. Exploratory Large Scale Graph Analytics sparse graphs into Arkouda. The major contributions are as follows. 101 44 in Arkouda. In Proceedings of CHIUW ’21: Chapel Implementers and Users 102 45 Workshop (CHIUW ’21). ACM, New York, NY, USA, 9 pages. https://doi.org/ 103 46 10.1145/nnnnnnn.nnnnnnn (1) An efficient and succinct Double-Index (DI) graph data struc- 104 47 ture and large sparse graph partition method are developed 105 48 to support parallel and distributed graph algorithm design. 106 49 (2) Two distributed parallel Breadth-First Search (BFS) graph 107 50 Permission to make digital or hard copies of all or part of this work for personal or 108 classroomUnpublished use is granted working without fee draft. provided Not that copies for distribution.are not made or distributed algorithms are developed based on the high level parallel 51 for profit or commercial advantage and that copies bear this notice and the full citation language Chapel. High productivity in algorithm design and 109 52 on the first page. Copyrights for components of this work owned by others than ACM quick optimization in performance improvement are the 110 must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, 53 111 to post on servers or to redistribute to lists, requires prior specific permission and/or a two major advantages of our Chapel based graph algorithm 54 fee. Request permissions from [email protected]. development method. 112 55 CHIUW ’21, June 04, 2021, Virtual (3) Experimental results show that the proposed graph data 113 © 2021 Association for Computing Machinery. 56 114 ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 structure and algorithm can support Arkouda to handle dif- 57 https://doi.org/10.1145/nnnnnnn.nnnnnnn ferent kinds of large graphs. Based on the same algorithm 115 58 2021-05-24 01:37. Page 1 of 1–9. 116 CHIUW ’21, June 04, 2021, Virtual Du and Alvarado Rodriguez, et al.

117 framework, quick and small tuning in high level data struc- or locating edges from given vertex. DI has two features: (1) sig- 175 118 ture and parallel construct selection can achieve more than nificant memory savings for large sparse graphs; (2) supporting 176 119 8 times . easy and high level array operators. The proposed DI 177 120 is new innovation to support large sparse graph exploratory data 178 121 2 ARKOUDA FRAMEWORK FOR DATA analytics. 179 122 SCIENCE 180 123 181 As a high level exploratory data analytics framework, Arkouda 3.1 Directed Graphs 124 182 aims to support not only flexible, but also high performance large For directed graphs, the edge index array consists of two arrays 125 183 scale data analysis. Python [25] is an interpreted, high-level and with the same shape. One is the source vertex array and the other 126 184 general-purpose programming language. Python consistently ranks is the destination vertex array. If there are a total of 푀 edges and 127 185 as one of the most popular programming languages and has ever 푁 vertices, we will use the numbers from 0 to 푀 − 1 to identify 128 186 growing community. Python has become a very powerful EDA tool. different edges and the numbers from 0to 푁 −1 to identify different 129 187 However, performance and very large scale data processing are vertices. 130 188 two bottlenecks of Python. Chapel [8] is a high level programming For example, given edge 푒 =< 푖, 푗 >, we will let 푆푅퐶 [푒] = 푖 and 131 189 language designed for productive at scale. It 퐷푆푇 [푒] = 푗 where 푆푅퐶 is the source vertex array and 퐷푆푇 is the 132 190 has the same advantages such as portable and open-source like destination vertex array; 푒 is the edge ID number. Both 푆푅퐶 and 133 191 in Python. Furthermore, it has the scalable and fast features that 퐷푆푇 have the same size 푀. When all edges are stored into 푆푅퐶 and 134 192 Python lacks. 퐷푆푇 , we will sort them based on their vertex ID value and remap 135 193 So, Arkouda integrates its front-end Python with its back-end the edge ID from 0 to 푀 − 1. Based on the sorted edge index array, 136 194 Chapel with a middle, communicative part ZeroMQ [14]. ZeroMQ we can build the vertex index array, which also consists of two of 137 195 is used for the data and instruction exchangs between Python users the same shape arrays. For example, we let edge 푒1000 have ID 1000. 138 196 and back-end services. In this way, Arkouda can provide flexible If 푒1000 =< 50, 3 >, 푒1001 =< 50, 70 > and 푒1002 =< 50, 110 > are all 139 197 and high performance large scale data analysis capability. the edges starting from vertex 50, then we will assign the entry of 140 198 To break the data volume limit of Python, Arkouda provides a one vertex index array 푆푇푅[50] = 1000 and another vertex index 141 199 virtual data view for its Python users. However, the real or raw array 푁 퐸퐼 [50] = 3. This means that for given vertex 50, the edges 142 200 data are stored in Chapel. Python users can use the metadata to starting with vertex 50 are stored in edge index array starting at 143 201 access the actual big data sets at the back-end. From the view of position 1000 and there are totally 3 such edges. If there are no 144 202 the Python programmers, all data is directly available just like on edges from vertex 푖, we will let 푆푇푅[푖] = −1 and 푁 퐸퐼 [푖] = 0. In this 145 203 their local laptop device. This is why Arkouda can break the local way, we can directly locate all the connected edges and neighbors 146 204 memory capacity limit, while at the same time bring traditional of any given vertex. 147 205 laptop users powerful computing capabilities that could only be For a given array 퐴, we use 퐴[푖..푗] to express the elements in 퐴 148 206 provided by . from 퐴[푖] to 퐴[푗]. 퐴[푖..푗] is also called an array section of 퐴. So, 149 207 When users are exploring their data, if only the metadata sec- for a given vertex with index 푖, it will have 푁 퐸퐼 [푖] neighbours and 150 208 tion is needed, then the operations can be completed locally and their vertex IDs are from 퐷푆푇 [푆푇푅[푖]] to 퐷푆푇 [푆푇푅[푖]+푁 퐸퐼 [푖]−1]. 151 209 quickly. These actions are carried out just like in previous Python This can be expressed as an array section 퐷푆푇 [푆푇푅[푖]..푆푇푅[푖] + 152 210 data processing workflows. If the operations have to be executed 푁 퐸퐼 [푖] − 1] (here we assume the out degree of 푖 is not 0). For any 153 211 on raw data, the Python program will automatically generate an vertex 푖, its can be directly expressed as < 푖, 푥 > 154 212 internal message and send the message to Arkouda’s message pro- where 푥 in 퐷푆푇 [푆푇푅[푖]..푆푇푅[푖] + 푁 퐸퐼 [푖] − 1]. 155 213 cessing pipeline for external and remote help. Arkouda’s message Fig. 1 shows 푀 sorted edges represented by the 푆푅퐶 and 퐷푆푇 156 214 processing center (ZeroMQ) is responsible for exchangingNot messages for distribution.arrays. Any one of the 푁 vertices 푛 can find its neighbours us- 157 푘 215 between its front-end and back-end.Unpublished When the Chapel back-end workinging 푁 퐸퐼 and 푆푇푅 arrays draft. with O(1) . This figure 158 216 receives the operation command from the front-end, it will execute shows how 푁 퐸퐼 and 푆푇푅 arrays can help us locate neighbours and 159 217 the analysis tasks quickly on the powerful HPC resources and large adjacency lists quickly. 160 218 memory to handle the corresponding raw data and return the re- 161 219 quired information back to the front-end. Through this, Arkouda 162 3.2 UnDirected Graphs 220 can support Python users to locally handle, on their personal de- 163 For undirected graphs, an edge < 푖, 푗 > means that we can also ar- 221 vices, large scale data sets residing on powerful back-end servers 164 rive at 푖 from 푗. We can use the data structures 푆푅퐶, 퐷푆푇, 푆푇푅, 푁 퐸퐼 222 without knowing all the detailed operations at the back-end. 165 to search the neighbours of 푗 in 푆푅퐶 and create the adjacency 223 166 list. However, this search cannot be done in O(1) time complex- 224 3 DOUBLE-INDEX SPARSE GRAPH DATA 167 ity. To achieve O(1) search time complexity for an undirected 225 168 STRUCTURE AND PARTITION graph, we introduce another four arrays called reversed index ar- 226 169 For representations, compressed sparse row or col- rays 푆푅퐶푟, 퐷푆푇푟, 푆푇푅푟, 푁 퐸퐼푟. For any edge < 푖, 푗 > in 푆푅퐶 and 227 170 umn are two very important data structures for sparse graphs. We 퐷푆푇 , we will have the corresponding reverse edge < 푗, 푖 > in 푆푅퐶푟 228 171 borrow the basic idea of such existing data structures and propose and 퐷푆푇푟, where 푆푅퐶푟 has the exact same elements as in 퐷푆푇 and 229 172 our Double-Index (DI), edge index and vertex index, sparse graph 퐷푆푇푟 has the exact same elements as in 푆푅퐶. 푆푅퐶푟 and 퐷푆푇푟 are 230 173 data structure to enable directly locating vertices from given edge also sorted and 푁 퐸퐼푟 and 푆푇푅푟 are the index array of the number 231 174 2021-05-24 01:37. Page 2 of 1–9. 232 Exploratory Large Scale Graph Analytics in Arkouda CHIUW ’21, June 04, 2021, Virtual

233 Arkouda. Two significant features of our parallel BFS algorithm 291 234 design are different from existing BFS algorithm design: (1) Our 292 235 BFS algorithm can exploit parallelism in graph search easily and 293 236 efficiently based on the proposed DI graph data structure. (2) We 294 237 employ the high level parallel language Chapel to develop the 295 238 BFS algorithm so we can significantly improve the productivity of 296 239 design. 297 240 For especially large graphs, one cannot be held in one shared 298 241 memory computer, however, it can be handled with distributed 299 242 memory computers, such as computing clusters to execute the BFS 300 243 in parallel. In Chapel, the locale type refers to a unit of the machine 301 244 resources on which your program is running. Locales have the 302 245 Figure 1: Double Index data structure for sparse graph. capability to store variables and to run Chapel tasks. In practice for 303 246 a standard architecture a single multicore/SMP 304 247 node is typically considered a locale. architectures 305 of neighbours and the index array of the starting neighbour index 248 are typically considered a single locale. We have developed two 306 just like in the directed graph. So for a given edge < 푖, 푗 > of an 249 versions of parallel BFS algorithms in Arkouda. The first is the high 307 undirected graph, the neighbours of vertex 푖 will include the ele- 250 level multi-locale version and the second is the corresponding low 308 ments in 퐷푆푇 [푆푇푅[푖]..푆푇푅[푖] + 푁 퐸퐼 [푖] − 1] and the elements in 251 level version. We will give the details in the following subsections. 309 퐷푆푇푟 [푆푇푅푟 [푖]..푆푇푅푟 [푖]+푁 퐸퐼푟 [푖]−1]. The adjacency list of the ver- 252 310 tex 푖 should be < 푖, 푥 > where 푥 in 퐷푆푇 [푆푇푅[푖]..푆푇푅[푖]+푁 퐸퐼 [푖]−1] 253 311 or < 푖, 푥 > where 푥 in 퐷푆푇푟 [푆푇푅푟 [푖]..푆푇푅푟 [푖] + 푁 퐸퐼푟 [푖] − 1]. 254 4.1 High Level Multi-Locale BFS Algorithm 312 255 3.3 Graph Partition The standard level by level BFS algorithm works as follows. For each 313 256 vertex at the current level or frontier, we will search its unvisited 314 For real-world power law graphs, the edge and vertex distributions 257 next level vertices. When all the vertices at the current level have 315 are highly skewed. Few vertices will have very large degrees but 258 been expanded, we will switch the next level vertices to the current 316 many vertices have very small degrees. If we partition the graph 259 level and repeat the search until no vertices can be expanded in the 317 evenly based on the vertices, it will be easy to cause load balanc- 260 current level. 318 ing problem because the processor which holds the vertices that 261 The basic idea of our algorithm is that we take advantage of 319 have a large number of edges will often have very heavy load. So, 262 the multi-locale feature of Chapel to handle very large graphs in 320 we equally divide the total number of edges into different proces- 263 distributed memory. The distributed data are processed at their 321 sors/computing nodes instead. 264 locales or their local memory. Furthermore, each shared memory 322 265 computing node can its owned data also in parallel. Our 323 266 multi-locale BFS algorithm can exploit the following features. (1) 324 267 The edges of the DI graph data have been distributed evenly onto 325 268 the distributed memory to balance the load. (2) Each distributed 326 269 node only expands the vertices it owns in the current frontier. This 327 270 can be done in distributed memory in parallel. 328 271 Our method is described in Alg. 1. Line 1 initializes the return 329 272 Not for distribution.array. Line 2 sets the starting vertex’s search level as 0. Line 3 330 273 Unpublished workinginitializes the current draft. search level as 0. Lines 4 and 5 create two 331 274 distributed bag classes to manage the current and next search fron- 332 275 Figure 2: Edge based Sparse Graph Partition. tiers. Line 6 adds the starting search vertex into the current frontier. 333 276 The parallel code is very simple and easy. From line 7 to line 25 334 277 Fig. 2 shows the basic idea of our sparse graph partition method. we will continue the standard loop if the current search frontier 335 278 When we partition an edge’s vertex entry in index array 푁 퐸퐼 and is not empty. From line 8 to line 21 we use the coforall parallel 336 279 푆푇푅 to the same processors, this method can increase the locality construct to execute the search on each locale in parallel. From 337 280 when we search from edge to vertex or from vertex to edge. How- line 10 to line 20 we will execute parallel search on each locale. 338 281 ever, this requires us to we distribute 푁 퐸퐼 and 푆푇푅 in an irregular On each locale, we will check each vertex in the current frontier 339 282 way since the edge index array and the vertex index array have but only the vertices on the current locale will be expanded (line 340 283 different shapes. Currently we cannot implement this distribution 11). In this way, we will expand the current frontier in parallel on 341 284 in existing Chapel methods easily, so we just partition 푁 퐸퐼 and all the locales without any overlapping. In line 12 we will build 342 285 푆푇푅 arrays evenly as the edges. the of neighbours 푆푒푡푁푒푖푔ℎ푏표푢푟 of the current vertex 푖. Since 343 286 some neighbours have been visited before (line 14), we will only 344 287 4 PARALLEL BFS ALGORITHM expand the unvisited vertices and add them into the next frontier 345 288 We select one typical graph algorithm, Breadth-First Search, to set 푆푒푡푁푒푥푡퐹 (line 15). At the same time, we will assign the visiting 346 289 show how we can implement exploratory large graph analytics in level to the expanded vertices with 푐푢푟푟푒푛푡_푙푒푣푒푙 + 1 (line 16). After 347 290 2021-05-24 01:37. Page 3 of 1–9. 348 CHIUW ’21, June 04, 2021, Virtual Du and Alvarado Rodriguez, et al.

349 Algorithm 1: High level Chapel based parallel BFS for To optimize the performance, we use a distributed array 푐푢푟퐹퐴푟푦 407 350 distributed memory supercomptuers to clearly distinguish the frontier elements owned by different lo- 408 351 Input: A graph 퐺 and the starting vertex 푟표표푡 cales. We also use a distributed array 푟푒푐푣퐴푟푦 to hold the expanded 409 Output: An array 푑푒푝푡ℎ to show the different visiting level for each vertex 352 410 1 푑푒푝푡ℎ = −1 // initialize the visiting level of all the vertices vertices from different locales. In this way, we can exploit the local- 353 2 푑푒푝푡ℎ[푟표표푡 ] = 0 // set starting vertex’s level is 0 ity and optimize the communication during the graph search. The 411 3 푐푢푟 _푙푒푣푒푙 = 0 //set current level 354 4 푆푒푡퐶푢푟퐹 = 푛푒푤 퐷푖푠푡퐵푎푔(푖푛푡,퐿표푐푎푙푒푠) // allocate a distributed bag to hold vertices in the low level algorithm is given in Alg. 2. 412 355 current frontier 413 5 푆푒푡푁 푒푥푡퐹 = 푛푒푤 퐷푖푠푡퐵푎푔(푖푛푡,퐿표푐푎푙푒푠) // allocate another bag to hold vertices in the next From line 1 to line 6 the low level BFS algorithm is just like 356 frontier the high level BFS algorithm except that we replace 푆푒푡퐶푢푟퐹 with 414 6 푆푒푡퐶푢푟퐹.푎푑푑 (푟표표푡 ) //insert the starting vertex into the current vertices bag 357 7 while (!푆푒푡퐶푢푟퐹.푖푠퐸푚푝푡푦()) do 푐푢푟퐹퐴푟푦 and replace 푆푒푡푁푒푥푡퐹 with 푟푒푐푣퐴푟푦. The basic algorithm 415 8 coforall (loc in Locales ) do 358 9 // parallel search on each locale structure is similar to the Alg. 1. From line 8 to line 32 we will finish 416 359 10 forall (i in 푆푒푡퐶푢푟퐹 ) do 417 11 if (i is on current locale) then one level vertex expansion. A set data structure 푆푒푡푁푒푥푡퐹푙표푐푎푙 is 360 12 푆푒푡푁푒푖푔ℎ푏표푢푟 = {푘|푘 is the neighbour of 푖} created to hold the expanded elements owned by the current locale 418 13 forall (j in 푆푒푡푁푒푖푔ℎ푏표푢푟 ) do 361 14 if (푑푒푝푡ℎ[푗 ] == −1) then (line 9) and the elements can be added in parallel. If the expanded 419 15 푆푒푡푁푒푥푡퐹.푎푑푑 (푗 ) 362 16 푑푒푝푡ℎ[푗 ] = 푐푢푟푟푒푛푡_푙푒푣푒푙 + 1 elements are not owned by the current locale, we create another 420 363 17 end set data structure 푆푒푡푁푒푥푡퐹푅푒푚표푡푒 to hold such elements (line 10). 421 18 end 364 19 end Instead of parallel search on all the current frontier, in the low level 422 365 20 end version, each locale will first get its owned vertices (line 11). Then 423 21 end 366 22 푆푒푡퐶푢푟퐹 <=> 푆푒푡푁 푒푥푡퐹 // exchange values for each locale, it uses the parallel construct 푐표푓 표푟푎푙푙 to expand the 424 23 푆푒푡푁푒푥푡퐹.푐푙푒푎푟 () 367 425 24 푐푢푟푟푒푛푡_푙푒푣푒푙+ = 1; next frontier in parallel (line 12). The difference with Alg. 1 is that 368 25 end we put the expanded elements into different sets. If they are local, 426 26 return 푑푒푝푡ℎ 369 we put them into 푆푒푡푁푒푥푡퐹퐿표푐푎푙 set in parallel (from line 16 to line 427 370 18). If they are not owned by the current locale, we will put them 428 371 into 푆푒푡푁푒푥푡퐹푅푒푚표푡푒 (from line 19 to line 21). After the vertex 429 372 expansion at each level, each locale will scatter the next frontier 430 all locales have expanded their vertices in the current frontier, we 373 elements in the 푆푒푡푁푒푥푡퐹푅푒푚표푡푒 to their owners (line 26 to line 28). 431 will exchange the value of the current frontier and the next frontier 374 At the same time, elements in the next frontier owned by the current 432 (line 22), clean the vertices in the next frontier (line 23), add the 375 locale 푆푒푡푁푒푥푡퐹퐿표푐푎푙 will be merged into the distributed array 433 푐푢푟푟푒푛푡_푙푒푣푒푙 to next level (line 24) . Then the next loop will begin 376 푐푢푟퐹퐴푟푦 (line 29 to line 31). All the above vertex operations can be 434 from the new frontier. When all vertices have been visited, we will 377 done in parallel without data races. Parallel construct 푐표푓 표푟푎푙푙 has 435 return the search array 푑푒푝푡ℎ as the final search result. 378 implicit synchronization mechanism. So after line 32, we can make 436 From this algorithm description, we can see it is very simple and 379 sure that all data communication has been completed and we can 437 natural to describe the level by level parallel method to implement 380 safely use the data in 푟푒푐푣퐴푟푦. From line 33 to line 35, each locale 438 the BFS algorithm in Chapel. The 퐷푖푠푡퐵푎푔 data structure can be 381 will combine the next frontier elements generated by the current 439 used to hold the current and the next frontier set easily and effi- 382 locale and the other locales to form the current frontier. 440 ciently. At the same time, the coforall and forall parallel construct 383 The low level BFS algorithm can exploit locality, avoid idle par- 441 can express the parallel expansion in a very efficient way. At line 384 allel threads, use an aggregation method to optimize the commu- 442 8 multiple locales can execute the search in parallel. At line 10 385 nication performance. However, we have to take care of the data 443 different vertices in the current frontier can be expanded in parallel. 386 distribution and data communication. Even though, this optimiza- 444 At line 13, different expanded vertices can be added into the next 387 tion cannot beat the high level data structure implementation for 445 frontier in parallel. We can exploit the parallelism in a hierarchical 388 our Delaunay benchmark test (see section 5.2.2). This comparison 446 way to improve the total performance. Not for distribution. 389 shows the advantage of Chapel’s high level data structure in easy 447 At line 8, we use coforallUnpublishedinstead of forall to implement dis- working draft. 390 programming and high performance. 448 tributed parallel computing on each distributed memory computing 391 449 node. At line 11, we just select the vertices owned by the current 392 450 locale. In this way, we can increase the access locality and avoid 393 451 expanding the same edges on multiple locales. 5 EXPERIMENTAL RESULTS 394 452 395 5.1 Experimental Setup 453 396 4.2 Low Level Multi-Locale BFS Algorithm To evaluate the results of the proposed integrated solution, we used 454 397 In the high level BFS algorithm, we use two 퐷푖푠푡퐵푎푔 classes to hold two kind of graphs. The first is the R-MAT method7 [ ] to generate 455 398 all the current frontier and the next frontier. The communication the testing graphs. The other kind of graphs are from standard 456 399 between different locales is implicit. This can make our parallel benchmarks. We develop a simple bsf.py Python testing program 457 400 program become simple and easy. To evaluate the performance of to drive the experiments. 458 401 such high level data structures in Chapel, we directly use arrays For the R-MAT graphs, we set the vertices count of four different 459 402 to hold the vertices in current and next frontiers and explicitly graphs as the following values: 32768, 65536,131072, and 262144. 460 403 implement the corresponding communication between different The possibility of the dense edges area is set as 0.75. All other three 461 404 locales. In this way, we can check if the high level data structure parts’ possibility share the remainder 0.25 equally. Each vertex has 462 405 퐷푖푠푡퐵푎푔 introduces significantly performance . 2 edges so we will generate 65536, 131072, 262144, and 524288 edges 463 406 2021-05-24 01:37. Page 4 of 1–9. 464 Exploratory Large Scale Graph Analytics in Arkouda CHIUW ’21, June 04, 2021, Virtual

465 Algorithm 2: Low level parallel BFS for distributed mem- 1: Important parameters for each graph benchmark 523 466 ory supercomptuers file utilized. 524 467 Input: A graph 퐺 and the starting vertex 푟표표푡 525 Output: An array 푑푒푝푡ℎ to show the different visiting level for each vertex Name Vertices Edges Weighted CCs Biggest CC Size Diameter(≥) 468 526 1 푑푒푝푡ℎ = −1 // initialize the visiting level of all the vertices delaunay_n17 131072 393176 0 1 131072 163 469 2 푑푒푝푡ℎ[푟표표푡 ] = 0 // set starting vertex’s level is 0 delaunay_n18 262144 786396 0 1 262144 226 527 3 푐푢푟 _푙푒푣푒푙 = 0 //set current level delaunay_n19 524288 1572823 0 1 524288 309 470 528 4 Create distributed array 푐푢푟퐹퐴푟 푦 to hold current frontier of each locale delaunay_n20 1048576 3145686 0 1 1048576 442 471 5 Create distributed array 푟푒푐푣퐴푟 푦 to receive expanded vertices from other locales delaunay_n21 2097152 6291408 0 1 2097152 618 529 6 put 푟표표푡 into 푐푢푟퐹퐴푟 푦 delaunay_n22 4194304 12582869 0 1 4194304 861 472 7 while (!푐푢푟퐹퐴푟푦.푖푠퐸푚푝푡푦()) do delaunay_n23 8388608 25165784 0 1 8388608 1206 530 8 coforall (loc in Locales ) do delaunay_n24 16777216 50331601 0 1 16777216 1668 473 9 create 푆푒푡푁푒푥푡퐹퐿표푐푎푙 to hold expanded vertices owned by current locale 531 rgg_n_2_21_s0 2097148 14487995 0 4 2097142 1151 10 create 푆푒푡푁푒푥푡퐹푅푒푚표푡푒 to hold expanded vertices owned by other locales 474 rgg_n_2_22_s0 4194301 30359198 0 2 4194299 1578 532 11 푚푦퐶푢푟퐹 ←current locale’s frontier in 푐푢푟퐹퐴푟 푦 and then clear 푐푢푟퐹퐴푟 푦 rgg_n_2_23_s0 8388607 63501393 0 4 8388601 2129 475 12 coforall (i in 푚푦퐶푢푟퐹 ) do 533 13 푆푒푡푁푒푖푔ℎ푏표푢푟 = {푘|푘 is the neighbour of 푖} rgg_n_2_24_s0 16777215 132557200 0 1 16777215 3009 476 534 14 forall (j in 푆푒푡푁푒푖푔ℎ푏표푢푟 ) do kron_g500-logn18 210155 10583222 1 8 210141 4 kron_g500-logn19 409175 21781478 1 27 409123 4 477 15 if (푑푒푝푡ℎ[푗 ] == −1) then 535 16 if (j is local) then kron_g500-logn20 795241 44620272 1 45 795153 4 478 17 푆푒푡푁푒푥푡퐹퐿표푐푎푙.푎푑푑 (푗 ) kron_g500-logn21 1544087 91042010 1 94 1543901 4 536 18 end 479 19 else 537 ( ) 480 20 푆푒푡푁푒푥푡퐹푅푒푚표푡푒.푎푑푑 푗 538 21 end 481 22 푑푒푝푡ℎ[푗 ] = 푐푢푟푟푒푛푡_푙푒푣푒푙 + 1 539 23 end 482 setting, an architecture setup that aptly fits an HPC scenario was 540 24 end 483 25 end chosen for testing. 541 26 if (!푆푒푡푁푒푥푡퐹푅푒푚표푡푒.푖푠퐸푚푝푡푦()) then For R-MAT, Delaunay, KRON and RGG graphs, we will first 484 27 scatter elements in 푆푒푡푁푒푥푡푅푒푚표푡푒 to 푟푒푐푣퐴푟 푦 542 485 28 end build the graphs into distributed memory based on our partition 543 29 if (!푆푒푡푁푒푥푡퐹퐿표푐푎푙.푖푠퐸푚푝푡푦()) then 486 30 move elements in 푆푒푡푁푒푥푡퐿표푐푎푙 to 푐푢푟퐹퐴푟 푦 method and then execute the parallel BFS algorithm with different 544 31 end number locales to check their performance (we will cancel some 487 32 end 545 488 33 coforall (loc in Locales ) do tests if the execution time is too long to keep the experiments in 546 34 푐푢푟퐹퐴푟 푦 ← collect elements from 푟푒푐푣퐴푟 푦 489 35 end reasonable time arrangement). For R-MAT graphs, we implement 547 36 푐푢푟푟푒푛푡_푙푒푣푒푙+ = 1 490 the R-MAT algorithm to generate the R-MAT graph in parallel each 548 37 end 491 38 return 푑푒푝푡ℎ time. For the benchmark graphs, each locale will read the graph file 549 492 in parallel using Chapel file IO and just select the data that should 550 493 be stored at its locale. After the graph data are ready in memory, 551 494 we will sort the edges and organize the graph based on our DI data 552 495 for different R-MAT graphs. We will generate both directed and structure. Furthermore, for the high level multi-locale algorithm, 553 496 undirected R-MAT graphs. we will show how a simple replacement in the data structure and 554 497 The graph benchmarks utilized for testing include the Delaunay, parallel construct can affect the performance significantly. 555 498 Kronecker(notation as KRON in the following part), and Random 556 499 Geometric graphs (notation as RGG in the following part) from 5.2 Experimental Results 557 500 the tenth DIMACS implementation challenge [3]. The number of For large scale graph analytics, there are two major steps. The first 558 501 edges and vertices will be approximately doubled to reach the next is building the graph into memory at the Arkouda back-end. The 559 502 graph in the same benchmark series. All data for these graphs second step is conducting different analysis methods on the graph 560 503 can be found online. For the intents and purposes of this paper, in memory to gain insight from the given graph. Here we use a 561 504 Table 1 summarizes some important informationNot on the graphs for distribution.parallel BFS algorithm to demonstrate how we can conduct analysis 562 505 selected and utilized for testing.Unpublished All benchmark graphs utilized were workingon large graphs. In draft.this section, we will provide the experimental 563 506 undirected, and some were weighted. The number of connected results of our graph building and graph analyzing methods. 564 507 components are listed as well under the CCs column. For those 565 508 files where the number of connected components exceeded 1,99%+ 5.2.1 Graph Building. The experimental results from Fig. 3 to Fig. 566 509 of the number of vertices found in the graph, were also found in 10 show the graph building time and the building efficiency of dif- 567 510 the largest component. The diameter pictured is a rough estimate ferent graphs in Arkouda. We can see that for Fig. 7 and Fig. 9, the 568 511 taken by iterating over the first 100 all-pairs shortest paths created building time will increase linearly with the number of edges, no 569 512 by the NetworkX python graph tool. The actual diameter of these matter how many locales we use. However, it will take more time 570 513 graphs may be bigger that what is shown. when handling the same amount of edges with more locales. The 571 514 Testing of the methods was conducted in an environment com- reason lies in the data movement overhead among locales. More lo- 572 515 posed of a 32 node cluster with a FDR Infiniband between the nodes cales mean that more data movement between distributed memory 573 516 in the cluster. Each node has two 10-core Intel Xeon E5-2650 v3 will be needed. The building efficiency Fig. 8 and Fig. 10 also have 574 517 @ 2.30GHz and 512GB DDR4 memory. Infiniband connections be- perfect flat lines. The flat line means that each core will havethe 575 518 tween nodes is commonly found in high performing computers. same efficiency no matter how many edges or how many cores are 576 519 Every node is made up of 512GB of RAM and two Xeon E5-2650 used. Our experiments show that the best construction efficiency 577 520 processors with 20 cores total between the two processors. Due of the RGG graph is 1736 edges/second/core. The lowest building 578 521 to Arkodua being designed primarily for data analysis in a HPC efficiency is 255 edges/second/core for the largest R-MAT graph 579 522 2021-05-24 01:37. Page 5 of 1–9. 580 CHIUW ’21, June 04, 2021, Virtual Du and Alvarado Rodriguez, et al.

581 because the R-MAT graph will need additional time to generate the linearly with the product 퐸 × 퐿. For all the results, the RMSE (Root 639 582 graph. Mean Square Error) is less than 390 and the R-squared value is larger 640 583 than 0.79 which means more than 79% of the observed variation can 641 584 be explained by the model’s inputs. We can use the models to do 642 585 some prediction. For examples, for the com-friendster.ungraph.txt 643 586 1 which has 1,806,067,135 edges, the predicted building time on 2 644 587 locales will be 8.31 hours if we use the RGG data. It really takes 8.5 645 588 hours to build the graph in memory and the predicted value is very 646 589 close to the practical value. However, if we use the data of Delaunay 647 590 and KRON that have less edges, the predicted building time will 648 591 be much longer. The experimental results in Fig. 3 and Fig. 5 can 649 592 Figure 3: Graph building Figure 4: Graph building help us see what happened for different graphs. We can see in Fig. 650 593 time (R-MAT). efficiency (R-MAT). 3, locales with 8 and 16 have a good linear growth trend. However, 651 594 the curve with locale 4 will increase very fast when the number 652 595 of edges is becoming larger although the total number of edges is 653 We model the graph building time with the following multivari- 596 less than the number in KRON and RGG benchmark. The major 654 ate nonlinear equation. Let 퐸 be the number of edges in the graph 597 reason is that compared with the benchmark method, R-MAT graph 655 and 퐿 be the number of locales that will be used to build the graph. 598 building method will have additional graph generation time. When 656 The building time will be 599 a graph touches the limit of its computing resources or the heavy 657 600 푇 (퐸, 퐿) = 푎 × 퐸/퐿 + 푏 × 퐸 × 퐿 + 푐 workload watermark, it cannot maintain a linear trend. A heavy 658 601 This model means that we assume that the computing time will in- load will reduce the core’s performance and efficiency. For Fig. 5, 659 602 crease linearly with 퐸/퐿 and the communication time will increase the Delaunay graph’s building time with 2 locales will also increase 660 603 fast when the number of edges is larger than 25,165,784. The reason 661 604 is that Delaunay graphs have much more vertices than the other 662 605 benchmarks. Therefore, for the same number of edges, the Delau- 663 606 nay benchmark will need more computation and memory. When 664 607 the graph touches the suitable resource limit, the lack of hardware 665 608 resource will also cause loss in performance and efficiency. Beyond 666 609 the resource bound is the major reason why the graph building 667 610 efficiency will decrease in Fig. 4 and Fig. 6. From all thegraphs, 668 611 we can conclude that the graph building efficiency curve will first 669 612 increase, then stay almost the same and finally reduce when the 670 613 Figure 5: Graph building Figure 6: Graph building graph workload touches the computing or memory limit of the 671 614 time (Delaunay). efficiency (Delaunay). given platform. 672 615 673 5.2.2 BFS Performance. In this part we will focus on the perfor- 616 674 mance comparison of different BFS implementations. So we will 617 675 use deterministic graph benchmarks instead of R-MAT graphs that 618 676 can lead to different performance with the same method because 619 677 of the randomness in the graphs. 620 678 Not for distribution.To evaluate the performance of the proposed high level multi- 621 679 Unpublished workinglocale BFS algorithm draft. and the low level algorithm, we implement 622 680 the algorithm with different data structures and parallel constructs 623 681 provided in Chapel to show the performance difference. 624 682 Figure 7: Graph building Figure 8: Graph building We use four different Delaunay benchmark graphs to show the 625 683 time (KRON). efficiency(KRON). performance of our BFS algorithms. Fig. 11 will be used as an exam- 626 684 ple to explain the meaning of different algorithm implementations. 627 685 On the 푥 axis , 푀 means the result of our manually optimized low 628 686 level Alg. 2. 퐵푎푔퐿 is the result of high level multi-locale Alg. 1. 629 687 퐵푎푔퐺 means that we will remove line 11 of Alg. 1 and all locales 630 688 will search on the whole frontier instead of the vertices owned by 631 689 itself. 푆푒푡퐿 is the case that we just replace the high level data struc- 632 690 ture 퐷푖푠푡퐵푎푔 with 푆푒푡 in Alg. 1. Except using the 푠푒푡 data structure, 633 691 푆푒푡퐺 is similar to 퐵푎푔퐺. 퐷표푚퐿 and 퐷표푚퐺 are just like 푆푒푡퐿 and 634 692 푆푒푡퐺 except we will replace 퐷푖푠푡퐵푎푔 with 퐷표푚푎푖푛. In our high level 635 693 multi-locale BFS algorithm framework, 퐷푖푠푡퐵푎푔, 푆푒푡 and 퐷표푚푎푖푛 636 Figure 9: Graph building Figure 10: Graph building 694 637 time (RGG). efficiency (RGG). 1https://snap.stanford.edu/data/com-Friendster.html 695 638 2021-05-24 01:37. Page 6 of 1–9. 696 Exploratory Large Scale Graph Analytics in Arkouda CHIUW ’21, June 04, 2021, Virtual

697 can provide the same function to hold the current frontier and the parallel threads created by 푐표푓 표푟푎푙푙 is less than the high level im- 755 698 1 756 next frontier elements. At the same time, they also have the same plementation (about 푛푢푚퐿표푐푎푙푒푠 of the size of current frontier). (3) 699 or similar method to use the data structure. For example, they all For different high level data structures (DistBag, Set and Domain), 757 700 have the 푎푑푑 function to add element into 퐷푖푠푡퐵푎푔, 푆푒푡 or 퐷표푚푎푖푛. their optimized 푓 표푟푎푙푙 parallel performance is very close to each 758 701 For all the high level multi-locale BFS methods in Fig. 11 to Fig. other. The major operation in our algorithm is add an element into 759 702 14, the legend 퐹표푟퐴푙푙 means we will use 푓 표푟푎푙푙 parallel construct a set in parallel. Surprisingly, DistBag has not shown obvious ad- 760 703 to expand the vertex at line 10 in Alg. 1. The legend 퐶표퐹표푟퐴푙푙 vantage in our preliminary tests. (4) Our manually optimized low 761 704 means that we will use the 푐표푓 표푟푎푙푙 parallel construct to expand level algorithm cannot have better performance than the high level 762 705 the vertices. However, for the manually optimized low level method, algorithms. This means that Chapel’s high level data structures 763 706 the legend 퐶표퐹표푟푎푙푙 means that we will use the 푐표푓 표푟푎푙푙 parallel (DistBag, Set and Domain) can implement the data insertion into 764 707 construct to expand the owned vertices by each locale at line 12 in a set and the communication among different locales with high 765 708 Alg. 2. The legend 퐹표푟퐴푙푙 means that we will use the 푓 표푟푎푙푙 parallel performance. 766 709 construct to expand the owned vertices by each locale. The major advantage of our manually optimized low level im- 767 710 plementation is that we can extend the vertices owned by different 768 711 locales independently to get next frontier without generating any 769 712 idle threads. However, for the high level method, we have to create 770 713 the same number of threads on each locale to check if the element 771 714 is owned by the local locale. If a vertex is not owned by the current 772 715 locale, this will become idle. The disadvantage of our low 773 716 level method is that we have to create two additional sets to keep 774 717 the local elements and remotes. And we need to send the remote 775 718 elements to their owners. We will incur additional cost for such 776 719 Figure 11: BFS time Figure 12: BFS time operations. 777 720 (푑푒푙푎푢푛푎푦_푛17). (푑푒푙푎푢푛푎푦_푛18). The reverse Cuthill–McKee algorithm (RCM) [9] can reduce the 778 721 bandwidth of a sparse matrix and improve the data access locality. 779 722 From the experimental results in Fig. 11 to Fig. 14, we have the So we employ the RCM method as the pre-processing step to relabel 780 723 following observations: (1) For all the data structures 퐷푖푠푡퐵푎푔, 푆푒푡 the vertices, in this way we can improve the BFS performance. In 781 724 and 퐷표푚푎푖푛, the performance of distributed parallel computing Table 2 we give the experimental results without and with RCM 782 725 version (BagL, SetL and DomL) will better than the shared com- pre-processing results. In the column of “RCM", "N" means without 783 726 puting version (BagG, SetG and DomG). It is easy to understand RCM pre-processing and “Y" means with RCM pre-processing. 784 727 that the shared computing will have a lot of duplicated computa- We can see that the RCM method can substantially improve 785 728 tions and the distributed resources cannot be used efficiently. (2) the performance in almost all cases. The best performance can be 786 729 For most of the distributed parallel computing versions, the perfor- improved about 1.24 fold for the 4 different benchmarks. We can also 787 730 mance of the 푓 표푟푎푙푙 parallel construct is better than the 푐표푓 표푟푎푙푙 see that the best performance is also different. For the performance 788 731 parallel construct. The reason is that the size of our frontier (from without RCM pre-processing, the 푆푒푡 data structure together with 789 732 hundreds to thousands and beyond) is relatively larger than the the 푓 표푟푎푙푙 parallel construct can achieve the best performance 790 733 parallel units (20 in our system). The 푐표푓 표푟푎푙푙 construct will gener- among all the cases. After RCM pre-processing, the 퐷푖푠푡퐵푎푔 data 791 734 ate many parallel threads but they cannot be run immediately. So structure together with the 푓 표푟푎푙푙 parallel construct can achieve 792 735 the 푓 표푟푎푙푙 parallel construct that only generates the same number the best performance. It seems that 퐷푖푠푡퐵푎푔 data structure is more 793 736 of threads as the maximum cores will be more efficient.Not However, for distribution.sensitive to the data locality. 794 737 for our manually optimized lowUnpublished level version, the 푐표푓 표푟푎푙푙 parallel workingOur experimental draft. results also show that for very large graphs, 795 738 construct implementation has better performance when the graph the 푐표푓 표푟푎푙푙 parallel construct can cause runtime errors because 796 739 size if small. The performance of 푓 표푟푎푙푙 will catch up when the of limited resources. To avoid this problem, we can set a threshold 797 740 graph size become larger (see Fig. 14). The reason is that our low value to switch between 푐표푓 표푟푎푙푙 and 푓 표푟푎푙푙 based on the total 798 741 level implementation can avoid idle threads and the number of number of threads and the available resources. The performance 799 742 results show that for the same algorithm framework, we can select 800 743 suitable data structures and parallel constructs to achieve much 801 744 better performance in Chapel programming. So we can quickly 802 745 optimize the performance and this is the basic reason why we can 803 746 develop parallel graph algorithms in Chapel in a productive and 804 747 efficient way. 805 748 806 749 807 750 6 RELATED WORK 808 751 Since BFS is a very basic graph algorithm, it has been investigated 809 752 Figure 13: BFS time Figure 14: BFS time from different aspects. For shared-memory BFS algorithms, Leis- 810 753 (푑푒푙푎푢푛푎푦_푛19). (푑푒푙푎푢푛푎푦_푛20). erson et al. [17] proposed a data structure “bag" to replace shared 811 754 2021-05-24 01:37. Page 7 of 1–9. 812 CHIUW ’21, June 04, 2021, Virtual Du and Alvarado Rodriguez, et al.

813 Table 2: Execution time of different BFS implementations. 871 814 872 815 Graph Parallel Construct RCM M BagL BagG SetL SetG DomL DomG 873 N 22.20 16.87 32.28 18.84 33.05 17.18 32.06 CoForall 816 Y 14.90 14.77 26.68 16.94 29.11 14.42 26.65 874 delaunay_n17 N 63.42 14.28 44.14 13.97 26.99 14.20 27.02 817 Forall 875 Y 24.28 10.85 33.75 12.02 21.85 12.16 21.85 818 N 48.57 33.76 64.58 43.08 70.55 34.25 63.84 876 CoForAll 819 Y 31.08 30.91 55.62 47.10 70.52 32.51 55.58 877 delaunay_n18 N 155.39 28.37 87.79 27.59 53.58 28.26 54.07 820 ForAll 878 Y 37.56 23.37 73.45 25.28 43.52 25.58 44.05 821 N 110.93 68.72 131.04 102.32 156.55 69.08 128.39 879 CoForAll Y 63.77 63.83 114.82 114.05 159.83 62.55 109.56 822 delaunay_n19 880 N 453.23 56.54 175.88 55.62 107.17 56.49 107.56 ForAll 823 Y 69.90 46.23 141.92 49.65 86.68 50.27 86.50 881 N 259.44 139.16 265.08 255.28 361.99 138.98 258.44 824 CoForAll 882 Y 126.62 127.22 231.47 286.72 386.11 133.12 229.45 delaunay_n20 825 N 305.01 125.89 387.61 120.19 236.20 123.91 236.66 883 ForAll 826 Y 172.16 92.87 293.59 99.46 176.49 101.05 176.03 884 827 885 828 queue to improve the parallelism in expanding the next frontier small change in data structure or parallel construct can cause very 886 829 of vertices. Of course, their “bag" is different from the “distbag" in significant performance differences. The purpose of Chapel based 887 830 Chapel. However, we share the same idea of employing efficient graph algorithm design in Arkouda is the productive algorithm 888 831 data structures to support parallel algorithm design. They optimized design and quick performance optimization to support exploratory 889 832 the implementation of the reducer and all their methods have been data analysis at scale. 890 833 integrated into their ++ compiler. 891 834 There are many approaches that can exploit specific hardware ar- 892 835 893 chitecture. For example, Bader et al. [4] employed the fine-grained, 7 CONCLUSION 836 low-overhead synchronization Cray MTA-2 computer to develop 894 837 a load-balanced BFS algorithm using thousands of hardware threads. Large graph analytics is a challenging problem and in this paper 895 838 Mizell et al. [22] implemented the BFS algorithm on the 128-processor we present our preliminary work to show how we can use the open 896 839 Cray XMT system. source framework Arkouda to handle this problem. The advan- 897 840 There are also many BFS algorithms on GPU. For examples, Luo tage of Arkouda lies in two aspects: high productivity and high 898 841 et al. [18] used a hierarchical data structure at grid, block and warp performance. High productivity means that end users can use a 899 842 levels to store and access the frontier vertices, and demonstrate that popular EDA language such as Python to achieve insight from large 900 843 their algorithm is up to 10 times faster than the Harish-Narayanan scale graphs. High performance means that the end users can break 901 844 algorithm on NVIDIA GPUs and low-diameter sparse graphs. Mer- the limit of their laptop and personal computer’s capabilities in 902 845 rill et al. [20] used efficient computations to deliver memory and calculation to handle very large graphs using Chapel 903 846 excellent performance on diverse graphs. at the server side. Taking advantage of the high level parallel pro- 904 847 For very large graphs, distributed-memory BFS algorithms are gramming language Chapel, the high performance solution is also 905 848 necessary. Beamer et al. [5] proposed a hybrid strategy to com- highly productive. This means that we can design and develop high 906 849 bine the “top-down” and “bottom-up” expansions together. The performance graph analysis algorithms using Chapel quickly and 907 850 basic idea is employing “top-down” expansion when the frontier efficiently. 908 851 has a small number of vertices. Otherwise the “bottom-up" expan- Based on the basic array data structure in Arkouda, we define 909 852 sion method will be used to avoid searching tooNot many edges. for Azad distribution.a double index graph data representation to exploit the different 910 853 et al. [2] employed a variantUnpublished of the standard breadth-first search workingkinds of array operators draft. in Arkouda. At the same time, our graph 911 854 O( ) 912 algorithm, reverse Cuthill-McKee algorithm, to improve the perfor- representation can enable edge and vertex locating with 1 time 855 mance. The Cuthill–McKee algorithm will relabel the vertices of complexity. We have implemented the double index graph data 913 856 the graph to reduce the bandwidth of the adjacency matrix. Jiang structure in Arkouda. We developed a typical graph algorithm 914 857 et al. [16] used both Reverse Cuthill-Mckee algorithm and SIMD breadth first search in the high level parallel language Chapel tosup- 915 858 executions to improve their BFS algorithm’s performance. Fan et al. port basic graph analysis in Arkouda. The proposed BFS algorithms 916 859 [12] employed several technologies, such as asynchronous virtual show that we can develop parallel algorithms and optimize their 917 860 ring method, thread caching scheme and vertex ID reordering to performance based on the same algorithm framework in Chapel 918 861 improve the BFS performance. efficiently. Selecting suitable data structures and parallel constructs 919 862 The major difference between our idea and the existing BFS can significantly improve the algorithm performance in Chapel. 920 863 algorithms is high algorithm design productivity and quick opti- This work shows that Arkouda is a promising framework to 921 864 mization. The basic idea of our graph algorithms design in Chapel support large scale graph analytics. Of course, the reported work is 922 865 is taking advantage of the high level data structure provided by the first step to evaluate the feasibility and performance of Arkouda 923 866 Chapel to simplify the algorithm design and employing the parallel based large graph analytics. In future work, we will provide more 924 867 constructs provided by Chapel to exploit the parallelism. Our ex- graph algorithms and further optimize the performance of our 925 868 perimental results show that with the same algorithm framework, algorithms in Arkouda. At the same time, we will compare our 926 869 method with other approaches. 927 870 2021-05-24 01:37. Page 8 of 1–9. 928 Exploratory Large Scale Graph Analytics in Arkouda CHIUW ’21, June 04, 2021, Virtual

929 ACKNOWLEDGMENTS Distributed Processing Symposium Workshops (IPDPSW). IEEE, 650–650. 987 930 [25] Guido Rossum. 1995. Python reference manual. (1995). 988 We appreciate the help from Brad Chamberlain, Elliot Joseph Ron- [26] Andrew T Stephen and Olivier Toubia. 2009. Explaining the power-law degree 931 aghan, Engin Kayraklioglu, David Longnecker and the Chapel com- distribution in a social commerce network. Social Networks 31, 4 (2009), 262–270. 989 932 munity when we integrated the algorithms into Arkouda. This [27] John W Tukey. 1977. Exploratory data analysis. Vol. 2. Reading, MA. 990 933 research was funded in part by NSF grant number CCF-2109988. 991 934 992 935 REFERENCES 993 936 [1] Lada A Adamic, Bernardo A Huberman, AL Barabási, R Albert, H Jeong, and G 994 937 Bianconi. 2000. Power-law distribution of the world wide web. Science 287, 5461 995 938 (2000), 2115–2115. 996 [2] Ariful Azad, Mathias Jacquelin, Aydin Buluç, and Esmond G Ng. 2017. The re- 939 verse Cuthill-McKee algorithm in distributed-memory. In 2017 IEEE International 997 940 Parallel and Distributed Processing Symposium (IPDPS). IEEE, 22–31. 998 [3] David A. Bader, Andrea Kappes, Henning Meyerhenke, Peter Sanders, Christian 941 999 Schulz, and Dorothea Wagner. 2017. Benchmarking for Graph Clustering and 942 Partitioning. Encyclopedia of Social Network Analysis and Mining (2017), 1–11. 1000 943 https://doi.org/10.1007/978-1-4614-7163-9_23-1 1001 [4] David A Bader and Kamesh Madduri. 2006. Designing multithreaded algorithms 944 for breadth-first search and st-connectivity on the Cray MTA-2. In 2006 Interna- 1002 945 tional Conference on Parallel Processing (ICPP’06). IEEE, 523–530. 1003 946 [5] Scott Beamer, Aydin Buluc, Krste Asanovic, and David Patterson. 2013. Dis- 1004 tributed memory breadth-first search revisited: Enabling bottom-up search. In 947 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops 1005 948 and Phd Forum. IEEE, 1618–1627. 1006 [6] John T Behrens. 1997. Principles and procedures of exploratory data analysis. 949 1007 Psychological Methods 2, 2 (1997), 131. 950 [7] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. 2004. R-MAT: A 1008 951 recursive model for graph mining. In Proceedings of the 2004 SIAM International 1009 Conference on Data Mining. SIAM, 442–446. 952 [8] Bradford L Chamberlain, Elliot Ronaghan, Ben Albrecht, Lydia Duncan, Michael 1010 953 Ferguson, Ben Harshbarger, David Iten, David Keaton, Vassily Litvinov, Pre- 1011 954 ston Sahabu, et al. 2018. Chapel comes of age: Making scalable programming 1012 productive. Cray User Group (2018). 955 [9] Elizabeth Cuthill and James McKee. 1969. Reducing the bandwidth of sparse 1013 956 symmetric matrices. In Proceedings of the 1969 24th national conference. 157–172. 1014 [10] Samrat K Dey, Md Mahbubur Rahman, Umme R Siddiqi, and Arpita Howlader. 957 1015 2020. Analyzing the epidemiological outbreak of COVID-19: A visual exploratory 958 data analysis approach. Journal of medical virology 92, 6 (2020), 632–638. 1016 959 [11] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. 1999. On power-law 1017 relationships of the internet topology. ACM SIGCOMM computer communication 960 review 29, 4 (1999), 251–262. 1018 961 [12] Dongrui Fan, Huawei Cao, Guobo Wang, Na Nie, Xiaochun Ye, and Ninghui Sun. 1019 962 2020. Scalable and efficient graph traversal on high-throughput cluster. CCF 1020 Transactions on High Performance Computing (2020), 1–13. 963 [13] Irving J Good. 1983. The philosophy of exploratory data analysis. Philosophy of 1021 964 science 50, 2 (1983), 283–295. 1022 [14] Pieter Hintjens. 2013. ZeroMQ: messaging for many applications. " O’Reilly Media, 965 1023 Inc.". 966 [15] Andrew T Jebb, Scott Parrigon, and Sang Eun Woo. 2017. Exploratory data 1024 967 analysis as a foundation of inductive research. Human Resource Management 1025 Review 27, 2 (2017), 265–276. 968 [16] Zite Jiang, Tao Liu, Shuai Zhang, Zhen Guan, MengtingNot Yuan, and Haihang for You. distribution. 1026 969 2020. Fast and Efficient ParallelUnpublished Breadth-First Search with Power-law Graph working draft. 1027 970 Transformation. arXiv preprint arXiv:2012.10026 (2020). 1028 [17] Charles E Leiserson and Tao B Schardl. 2010. A work-efficient parallel breadth- 971 first (or how to cope with the nondeterminism of reducers). 1029 972 In Proceedings of the twenty-second annual ACM symposium on Parallelism in 1030 algorithms and architectures. 303–314. 973 1031 [18] Lijuan Luo, Martin Wong, and Wen-mei Hwu. 2010. An effective GPU implemen- 974 tation of breadth-first search. In Design Automation Conference. IEEE, 52–55. 1032 975 [19] Theo Lynn, Pierangelo Rosati, Binesh Nair, and Ciáran Mac an Bhaird. 2020. An 1033 Exploratory Data Analysis of the# Crowdfunding Network on Twitter. Journal 976 of Open Innovation: Technology, Market, and Complexity 6, 3 (2020), 80. 1034 977 [20] Duane Merrill, Michael Garland, and Andrew Grimshaw. 2015. High-performance 1035 978 and scalable GPU graph traversal. ACM Transactions on Parallel Computing 1036 (TOPC) 1, 2 (2015), 1–30. 979 [21] Michael Merrill, William Reus, and Timothy Neumann. 2019. Arkouda: interactive 1037 980 data exploration backed by Chapel. In Proceedings of the ACM SIGPLAN 6th on 1038 Chapel Implementers and Users Workshop. 28–28. 981 1039 [22] D Mizell and K Maschhoff. 2009. Early experiences with large-scale XMT systems. 982 In Proc. Workshop on Multithreaded Architectures and Applications (MTAAP’09). 1040 983 [23] Michael J Quinn and Narsingh Deo. 1984. Parallel graph algorithms. ACM 1041 Computing Surveys (CSUR) 16, 3 (1984), 319–348. 984 [24] William Reus. 2020. CHIUW 2020 Keynote Arkouda: Chapel-Powered, Interac- 1042 985 tive Supercomputing for Data Science. In 2020 IEEE International Parallel and 1043 986 2021-05-24 01:37. Page 9 of 1–9. 1044