Big Data Architecture in Radio Astronomy: the Effectiveness of the Hadoop/Hive/Spark Ecosystem in Data Analysis of Large Astronomical Data Collections
Total Page:16
File Type:pdf, Size:1020Kb
Big Data Architecture in Radio Astronomy: The Effectiveness of the Hadoop/Hive/Spark ecosystem in data analysis of large astronomical data collections Geoffrey Duniam, B.App.Sci. This thesis is presented for the degree of Master of Philosophy(Research) of The University of Western Australia The School of Computer Science and Software Engineering The International Centre for Radio Astronomy Research July 20, 2017 Thesis Declaration I, Geoffrey Duniam, certify that: This thesis has been substantially accomplished during enrolment in the degree. This thesis does not contain material which has been accepted for the award of any other degree or diploma in my name, in any university or other tertiary institution. No part of this work will, in the future, be used in a submission in my name, for any other degree or diploma in any university or other tertiary institution without the prior approval of The University of Western Australia and where applicable, any partner institution responsible for the joint-award of this degree. This thesis does not contain any material previously published or written by another person, except where due reference has been made in the text. The work(s) are not in any way a violation or infringement of any copyright, trademark, patent, or other rights whatsoever of any person. This thesis contains published work and/or work prepared for publication, some of which has been co-authored. Signature Date i Abstract In this study, alternatives to the classical High Performance Computing environment (MPIi/OpenMPii) are investigated for large scale astronomy data analysis. Designing and implementing a classical HPC analysis using OpenMP and MPI can be a complex process, requiring advanced programming skills that many researchers may not have. Frameworks that offer access to very large datasets while abstracting the complexities of designing parallel processing tasks allow researchers to concentrate on specific analysis problems without having to invest time in acquiring advanced programming skills. The Spark/Hive/Hadoop ecosystem is one such platform. Combined with astronomy specific Python based machine learning libraries in the context of the analysis of very large collections of data, this framework was then tested with a range of benchmarking exercises. This framework has been found to be very effective, and although it may not outperform MPI/OpenMP, it offers reliability, elasticity, scalability and ease of use. ihttp://mpi-forum.org/ iiwww.openmp.org/ ii Contents Thesis Declaration i Abstract ii Acknowledgements vii Authorship Declaration viii Dedication x 1 Introduction 1 1.1 Technical landscape of Big Data in astronomy . .1 1.2 High Performance Computing in Scientific Analysis . .4 1.3 Hadoop Ecosystem . .4 2 Methodology 8 2.1 Test methodology . .8 2.2 Datasets . .9 2.3 Cluster Architecture . 11 2.4 Hive tables . 13 2.4.1 External tables . 14 2.4.2 Internal Tables . 14 2.4.3 Partitioning . 14 2.4.4 Table partitions . 14 2.4.5 Internal table formats and compression codecs . 15 2.4.6 Test table design . 16 2.4.7 Test table data extracts . 17 2.4.8 Hive user interfaces . 19 2.5 Python . 19 2.6 Test Framework . 19 2.6.1 KMeans . 20 iii 2.6.2 Kernel Density Estimation (KDE) . 20 2.6.3 Principal Component Analysis (PCA) . 20 2.6.4 Non-embarrassingly parallel problems . 21 2.6.5 RDD Creation . 22 2.6.6 Spark process settings . 26 2.7 Benchmark framework . 26 2.7.1 RDD creation testing . 26 2.7.2 Full table scan testing . 27 2.7.3 Partition testing . 27 2.7.4 Correlation testing . 27 2.7.5 Java on Spark . 27 3 Results 29 3.1 Writing file data to HDFS . 29 3.2 RDD Creation . 29 3.2.1 HDFS and Hive baseline I/O read rates . 29 3.2.2 Full table scans . 30 3.2.3 Partition based table scan . 31 3.2.4 Partition based scan with grouping . 32 3.3 Python Machine Learning test programs . 32 3.3.1 KMeans . 33 3.3.2 Kernel Density Estimation . 34 3.3.3 Principal Component Analysis . 37 3.3.4 Correlation . 39 3.4 Java on Spark . 49 4 Discussion 50 4.1 RDD Creation - HDFS Vs Hive context calls . 50 4.2 Snapshot Generation . 51 4.3 Correlation testing . 52 4.4 Data Compression . 53 4.5 Performance comparisons . 53 4.5.1 Cluster I/O comparisons . 53 4.5.2 Response times . 54 4.6 Hive Partitioning . 54 4.6.1 Hive explain plans . 55 4.7 Usability . 56 4.8 Tuning Spark jobs . 56 iv 5 Conclusions 57 5.1 Findings . 57 5.2 Future work . 58 Appendices 61 A Supplementary Material - Detection and Parameter File formats and raw data examples 62 A.1 Detection file structure . 62 A.2 Parameter file structure . 63 A.3 Duchamp output file example . 65 A.4 Detection file example . 68 A.5 Parameter file example . 70 B Supplementary Material - Final Virtual Cluster Configuration 71 C Supplementary Material - Hive test table definition 73 D Supplementary Material - Hive internal tables 74 D.1 Creation scripts . 74 D.1.1 ORC format tables, zlib compression . 74 D.1.2 ORC format tables, snappy compression . 75 D.1.3 Parquet format tables . 75 D.1.4 RC Format tables . 76 D.1.5 Text based internal table creation . 77 D.2 Population Scripts . 77 D.2.1 ORC, RC File and text based tables . 77 D.2.2 Parquet tables . 78 E Supplementary Material - Hive External tables 81 E.1 Creating a non-partitioned Hive external table . 81 E.2 Creating a partitioned Hive external table . 83 E.3 Populating a partitioned Hive external table . 84 E.4 Compressing Hive External Table Data . 85 F Supplementary Material - Hive explain plans 88 G Supplementary Material - Python Library Dependencies 92 H Supplementary Material - Python Code Listings 94 H.1 KMeans analysis . 95 v H.2 Kernal Density Estimation . 101 H.3 Principle Component Analysis . 111 H.4 Correlation analysis . 121 I Supplementary Material - Hive QL for Correlation Analysis 133 I.1 Creating and populating the base table for Correlation analysis . 133 I.2 Creating the baseline wavelength table . 134 I.3 Creating the fine grained position data . 134 I.4 Creating the Problem Space . 135 I.5 Creating the wavelength histogram data . 136 Glossary 138 List of Acronyms 140 Bibliography 141 vi Acknowledgements I would like to gratefully acknowledge the support and guidance I received from my supervisors, Prof. Amitava Datta and Prof. Slava Kitaeff. I would like to acknowledge Pawsey Supercomputer Centre and the National eResearch Collab- oration Tools and Resources project (NeCTAR) that provided the infrastructure and support in this project. Without the assistance and ongoing support from Mr. Chris Bording at the Faculty of Engi- neering, Computing and Mathematics, The University of Western Australia, and Mr. Mark Gray of the Pawsey Supercomputer centre and the Nectar Support staff this study would not have been possible and I gratefully acknowledge their support. I thank Mr. Kevin Vinson at theSkyNet for providing access to the output files from the Duchamp source finder for the ASKAP Deep HI survey (DINGO), Galactic ASKAP (GASKAP)survey and the HI Parkes All-Sky Survey (HIPASS) data used in this study, and the Java application used to extract the discrete parameter and detection files. I would like to acknowledge the assistance and support from the International Centre for Radio Astronomy Research, its Data Intensive Astronomy Group, the University of Western.