Architecture in Radio Astronomy: The Effectiveness of the Hadoop/Hive/Spark ecosystem in data analysis of large astronomical data collections

Geoffrey Duniam, B.App.Sci.

This thesis is presented for the degree of Master of Philosophy(Research) of The University of Western Australia

The School of Computer Science and Software Engineering The International Centre for Radio Astronomy Research

July 20, 2017 Thesis Declaration

I, Geoffrey Duniam, certify that:

This thesis has been substantially accomplished during enrolment in the degree.

This thesis does not contain material which has been accepted for the award of any other degree or diploma in my name, in any university or other tertiary institution.

No part of this work will, in the future, be used in a submission in my name, for any other degree or diploma in any university or other tertiary institution without the prior approval of The

University of Western Australia and where applicable, any partner institution responsible for the joint-award of this degree.

This thesis does not contain any material previously published or written by another person, except where due reference has been made in the text.

The work(s) are not in any way a violation or infringement of any copyright, trademark, patent, or other rights whatsoever of any person.

This thesis contains published work and/or work prepared for publication, some of which has been co-authored.

Signature Date

i Abstract

In this study, alternatives to the classical High Performance Computing environment (MPIi/OpenMPii) are investigated for large scale astronomy data analysis. Designing and implementing a classical

HPC analysis using OpenMP and MPI can be a complex process, requiring advanced programming skills that many researchers may not have. Frameworks that offer access to very large datasets while abstracting the complexities of designing parallel processing tasks allow researchers to concentrate on specific analysis problems without having to invest time in acquiring advanced programming skills. The Spark/Hive/Hadoop ecosystem is one such platform.

Combined with astronomy specific Python based machine learning libraries in the context of the analysis of very large collections of data, this framework was then tested with a range of benchmarking exercises. This framework has been found to be very effective, and although it may not outperform MPI/OpenMP, it offers reliability, elasticity, scalability and ease of use.

ihttp://mpi-forum.org/ iiwww.openmp.org/

ii Contents

Thesis Declaration i

Abstract ii

Acknowledgements vii

Authorship Declaration viii

Dedication x

1 Introduction 1

1.1 Technical landscape of Big Data in astronomy ...... 1

1.2 High Performance Computing in Scientific Analysis ...... 4

1.3 Hadoop Ecosystem ...... 4

2 Methodology 8

2.1 Test methodology ...... 8

2.2 Datasets ...... 9

2.3 Cluster Architecture ...... 11

2.4 Hive tables ...... 13

2.4.1 External tables ...... 14

2.4.2 Internal Tables ...... 14

2.4.3 Partitioning ...... 14

2.4.4 Table partitions ...... 14

2.4.5 Internal table formats and compression codecs ...... 15

2.4.6 Test table design ...... 16

2.4.7 Test table data extracts ...... 17

2.4.8 Hive user interfaces ...... 19

2.5 Python ...... 19

2.6 Test Framework ...... 19

2.6.1 KMeans ...... 20

iii 2.6.2 Kernel Density Estimation (KDE) ...... 20

2.6.3 Principal Component Analysis (PCA) ...... 20

2.6.4 Non-embarrassingly parallel problems ...... 21

2.6.5 RDD Creation ...... 22

2.6.6 Spark process settings ...... 26

2.7 Benchmark framework ...... 26

2.7.1 RDD creation testing ...... 26

2.7.2 Full table scan testing ...... 27

2.7.3 testing ...... 27

2.7.4 Correlation testing ...... 27

2.7.5 Java on Spark ...... 27

3 Results 29

3.1 Writing file data to HDFS ...... 29

3.2 RDD Creation ...... 29

3.2.1 HDFS and Hive baseline I/O read rates ...... 29

3.2.2 Full table scans ...... 30

3.2.3 Partition based table scan ...... 31

3.2.4 Partition based scan with grouping ...... 32

3.3 Python Machine Learning test programs ...... 32

3.3.1 KMeans ...... 33

3.3.2 Kernel Density Estimation ...... 34

3.3.3 Principal Component Analysis ...... 37

3.3.4 Correlation ...... 39

3.4 Java on Spark ...... 49

4 Discussion 50

4.1 RDD Creation - HDFS Vs Hive context calls ...... 50

4.2 Snapshot Generation ...... 51

4.3 Correlation testing ...... 52

4.4 Data Compression ...... 53

4.5 Performance comparisons ...... 53

4.5.1 Cluster I/O comparisons ...... 53

4.5.2 Response times ...... 54

4.6 Hive Partitioning ...... 54

4.6.1 Hive explain plans ...... 55

4.7 Usability ...... 56

4.8 Tuning Spark jobs ...... 56

iv 5 Conclusions 57

5.1 Findings ...... 57

5.2 Future work ...... 58

Appendices 61

A Supplementary Material - Detection and Parameter File formats and raw data

examples 62

A.1 Detection file structure ...... 62

A.2 Parameter file structure ...... 63

A.3 Duchamp output file example ...... 65

A.4 Detection file example ...... 68

A.5 Parameter file example ...... 70

B Supplementary Material - Final Virtual Cluster Configuration 71

C Supplementary Material - Hive test table definition 73

D Supplementary Material - Hive internal tables 74

D.1 Creation scripts ...... 74

D.1.1 ORC format tables, zlib compression ...... 74

D.1.2 ORC format tables, snappy compression ...... 75

D.1.3 Parquet format tables ...... 75

D.1.4 RC Format tables ...... 76

D.1.5 Text based internal table creation ...... 77

D.2 Population Scripts ...... 77

D.2.1 ORC, RC File and text based tables ...... 77

D.2.2 Parquet tables ...... 78

E Supplementary Material - Hive External tables 81

E.1 Creating a non-partitioned Hive external table ...... 81

E.2 Creating a partitioned Hive external table ...... 83

E.3 Populating a partitioned Hive external table ...... 84

E.4 Compressing Hive External Table Data ...... 85

F Supplementary Material - Hive explain plans 88

G Supplementary Material - Python Library Dependencies 92

H Supplementary Material - Python Code Listings 94

H.1 KMeans analysis ...... 95

v H.2 Kernal Density Estimation ...... 101

H.3 Principle Component Analysis ...... 111

H.4 Correlation analysis ...... 121

I Supplementary Material - Hive QL for Correlation Analysis 133

I.1 Creating and populating the base table for Correlation analysis ...... 133

I.2 Creating the baseline wavelength table ...... 134

I.3 Creating the fine grained position data ...... 134

I.4 Creating the Problem Space ...... 135

I.5 Creating the wavelength histogram data ...... 136

Glossary 138

List of Acronyms 140

Bibliography 141

vi Acknowledgements

I would like to gratefully acknowledge the support and guidance I received from my supervisors,

Prof. Amitava Datta and Prof. Slava Kitaeff.

I would like to acknowledge Pawsey Supercomputer Centre and the National eResearch Collab- oration Tools and Resources project (NeCTAR) that provided the infrastructure and support in this project.

Without the assistance and ongoing support from Mr. Chris Bording at the Faculty of Engi- neering, Computing and Mathematics, The University of Western Australia, and Mr. Mark Gray of the Pawsey Supercomputer centre and the Nectar Support staff this study would not have been possible and I gratefully acknowledge their support.

I thank Mr. Kevin Vinson at theSkyNet for providing access to the output files from the

Duchamp source finder for the ASKAP Deep HI survey (DINGO), Galactic ASKAP (GASKAP)survey and the HI Parkes All-Sky Survey (HIPASS) data used in this study, and the Java application used to extract the discrete parameter and detection files.

I would like to acknowledge the assistance and support from the International Centre for Radio

Astronomy Research, its Data Intensive Astronomy Group, the University of Western Australia and the School of Computer Science and Software Engineering.

Mr. Peter Ward provided a sounding board for some of the preliminary architectural ideas and over the course of a long discussion helped me work out what might eventually be possible.

Sunglasses.

Finally, none of this would have been possible without the wholehearted support and encour- agement of my wife, Karen. Thank you.

vii Authorship Declaration

This thesis contains work that has been prepared for publication, some of which has been co- authored.

Details of the work: Abstract and introduction statements

Location in Thesis: Abstract, Introduction

Student contribution to Wrote the initial drafts. Co-author provided references for the JWST,

work: LSST, source finders and references to computational intensive ma-

chine learning papers in the introduction chapter.

Dr. Slava Kitaeff

Co-author Signature Date

Details of the work: Problem statement, experiment design, tests, analysis, discussion,

and conclusions

Location in Thesis: Introduction, Methodology, Results, Discussion and Conclusions

chapters

Student contribution to Wrote the sections on commercial VLDBs, described previous tests

work: on Spark, defined the problem statement, designed and tested the

experiments, collected the raw data logs, prepared the charts and

graphs, analysed the results and wrote the conclusions.

Dr. Slava Kitaeff

Co-author Signature Date

viii Student Signature Date

I hereby certify that the student statements regarding their contribution to each of the works listed above are correct.

Coordinating Supervisor Signature Date

ix Dedication

To the three best managers I ever had the pleasure of working for - Geoff Hill, Peter Applegate and Nigel Ridgeon.

I’ve also had the privilege of working with some of the finest developers and analysts anywhere-

Geoff Hoyle, Caroline Little, Stu Frater, Ash Aldridge, Jon Russell, Steve Pearson, Gary Perkins and Mark Springett. Thank you all.

And finally, to production DBA’s and system administrators everywhere (specifically Sam, Steve,

John, Salma and Muj). If we all listened to what you have to say, and designed our systems accordingly, the IT world would undoubtedly be a far, far better place.

x List of Tables

2.1 Survey raw data used from theSkyNet ...... 10

2.2 Common compression formats for HDFS Files ...... 10

2.3 Survey file sizes in HDFS in GB ...... 11

2.4 Nectar Virtual Machine Configurations ...... 12

2.5 Test table storage formats and compression...... 16

2.6 Hive table formats and compression codecs - Size in GB ...... 17

2.7 DINGO raw survey record counts ...... 18

2.8 Test table record counts ...... 18

3.1 Spark extraction process test ...... 49

A.1 Raw Detection file structure ...... 63

A.2 Raw Parameter file structure ...... 64

B.1 Current Cluster Configuration - Master nodes ...... 71

B.2 Current Cluster Configuration - Worker nodes ...... 72

C.1 Test table definitions ...... 73

G.1 Python Machine learning library installation dependencies - (table format) . . . . 92

xi List of Figures

2.1 Evaluation Framework High Level Architecture ...... 12

2.2 Hive Table Partition Schematic ...... 15

2.3 Creating a Hive based RDD in Python ...... 23

2.4 Creating a HDFS based RDD in Python ...... 24

2.5 Hive QL statement for a partition based table scan with grouping...... 25

2.6 Hive RDD creation for HDFS based files with grouping...... 26

3.1 Baseline HDFS and Hive I/O Rates...... 30

3.2 Full scan response times in seconds...... 31

3.3 Partition based scan response time in seconds...... 31

3.4 Partition based scan with grouping response time in seconds...... 32

3.5 Hive RDD creation for Kmeans testing...... 33

3.6 KMeans analysis - total elapsed time in seconds...... 33

3.7 KMeans analysis - Network I/O traffic for a full scan of HDFS data...... 34

3.8 KMeans analysis - Network I/O lifecycle traffic for a full scan of HDFS data. . . . 34

3.9 Hive RDD creation for KDE full scan testing...... 35

3.10 KDE full scan analysis - total elapsed time in seconds (default and optimised Spark

settings)...... 35

3.11 KDE analysis - Network Receive I/O traffic for a full table scan...... 36

3.12 KDE analysis - Network Transmit I/O traffic for a full table scan...... 36

3.13 KDE analysis - Network Receive I/O lifecycle traffic...... 37

3.14 KDE analysis - Network Transmit I/O lifecycle traffic...... 37

3.15 Hive RDD creation for PCA testing...... 38

3.16 PCA analysis, three partitions - total elapsed time in seconds (default and optimised

Spark settings)...... 38

3.17 Correlation analysis elapsed time - thread pool size against histogram array size. . 39

3.18 Correlation analysis memory allocation - 10,000 element array against Pyspark

thread pool size...... 40

xii 3.19 Correlation analysis network I/O (receive) - 10,000 element array against Pyspark

thread pool size...... 40

3.20 Correlation analysis network I/O (transmit) - 10,000 element array against Pyspark

thread pool size...... 41

3.21 Correlation analysis CPU utilisation - 10,000 element array against Pyspark thread

pool size...... 41

3.22 Correlation analysis elapsed time - 56 thread processes, array sizes for arrays 15,000

- 80,000 elements...... 42

3.23 Correlation analysis memory allocation - 56 thread processes, array sizes 15,000 -

30,000 elements...... 42

3.24 Correlation analysis network I/O (receive rate) - 56 thread processes, array sizes

15,000 - 30,000 elements...... 43

3.25 Correlation analysis thread network I/O (transmit rate) - 56 thread processes, array

sizes 15,000 - 30,000 elements...... 43

3.26 Correlation analysis CPU utilisation - 56 thread processes, array sizes 15,000 - 30,000

elements...... 44

3.27 Correlation analysis memory - 56 thread processes, array sizes for arrays 40,000 -

80,000 elements...... 44

3.28 Correlation analysis network I/O (receive rate) - 56 thread processes, array sizes

40,000 - 80,000 elements...... 45

3.29 Correlation analysis thread network I/O (transmit rate) - 56 thread processes, array

sizes 40,000 - 80,000 elements...... 45

3.30 Correlation analysis CPU utilisation - 56 thread processes, array sizes 40,000 - 80,000

elements...... 46

3.31 Correlation analysis thread pool testing - Elapsed times for 2k - 6k correlation

comparisons...... 46

3.32 Correlation analysis thread pool testing - CPU Utilisations for 2k - 6k correlation

comparisons...... 47

3.33 Correlation analysis thread pool testing - Memory allocation for 2k - 6k correlation

comparisons...... 47

3.34 Correlation analysis thread pool testing - Network I/O (receive) for 2k - 6k correla-

tion comparisons...... 48

3.35 Correlation analysis thread pool testing - Network I/O (transmit) for 2k - 6k corre-

lation comparisons...... 48

D.1 Creation script - Hive internal table, ORC format, zlib compression ...... 75

D.2 Creation script - Hive internal table, ORC format, snappy compression ...... 75

xiii D.3 Creation script - Hive internal table, Parquet format ...... 76

D.4 Creation script - Hive internal table, RCFile format ...... 77

D.5 Creation script - Hive internal table, text format ...... 77

D.6 Population script - Hive internal table, ORC, RC File and Text format ...... 78

D.7 Population script - Hive internal table, Parquet format ...... 79

E.1 Creation script - non partitioned Hive external table ...... 82

E.2 Creation script - partitioned Hive external table ...... 84

E.3 Adding a file definition to a Hive external table ...... 84

E.4 Pig compression script example for HDFS file data ...... 87

F.1 Hive query with no explicit partition call ...... 88

F.2 Explain plan extract for full table scan ...... 89

F.3 Hive query with explicit partition call ...... 89

F.4 Explain plan extract for explicit partition calls ...... 90

F.5 Hive query with explicit partition call ...... 90

F.6 Explain plan extract for explicit partition calls ...... 91

G.1 Python Machine learning library installation dependencies ...... 93

H.1 Python Test Program - KMeans Analysis ...... 100

H.2 Python Test Program - Kernel Density Estimation ...... 111

H.3 Python Test Program - Principal Component Analysis ...... 121

H.4 Python Test Program - Correlation Analysis ...... 132

I.1 Creation DDL for CorrelationTest table ...... 134

I.2 Population DML for CorrelationTest table ...... 134

I.3 Creation DDL for baseline wavelength table ...... 134

I.4 Creation DML for fine grained position data table ...... 135

I.5 Creation DML for the problem space table ...... 136

I.6 Hive QL statement to extract wavelength histogram data ...... 137

xiv Chapter 1

Introduction

The planned commissioning of the Square Kilometre Array (SKA), the James Webb Space

Telescope (JWST) [1], the Large Synoptic Survey Telescope (LSST) [2], and other very capable astronomy instruments brings astronomy to an era of big data, which leads to an unacceptably high cost of duplicating and moving the data for analysis. Cornwell [3] and Alexander [4] estimate that in one year SKA will collect from 100PB to 4EB of data for a one year Redshifted Hydrogen survey alone. The source finders [5–7] will extract the detected objects into the catalogues, which then are to be analysed and knowledge is to be extracted using various statistical and algorithmic methods, including machine learning and deep machine learning that tend to be very computationally, I/O and memory intensive [8–11].

This influx of these very large datasets necessitates the development of storage and analysis frameworks that are robust, cost-effective and simple enough to use if the astronomical community is to maximise the extraction of information and knowledge from this data. Scientific researchers will, in many cases, be unable to use very large data archives to search and download data snapshots for local or offline processing simply because of the size of the data products. In order to maximise the value of these vary large datasets in extracting information from raw data, investigation into processing and data storage frameworks that will allow in place processing and analysis of datasets in the peta and exa scale. This thesis examines one potential framework.

1.1 Technical landscape of Big Data in astronomy

Research into large scientific is ongoing, and a number of approaches have already been explored. For example, the Sloan Digital Sky was initially built as an Object Oriented [12] and then redesigned into a relational database system due to poor search and query performance

[13, 14]. Similarly, an Object Oriented database system was proposed for the Large Hadron Collider

1 [15] however a hybrid architecture based around a massive Oracle RDBMS RAC clusteri [16] to hold and index the metadata, with the bulk raw data stored in the ROOT modular scientific framework

[17] was implemented. Duellmann [18] describes the evolution of at CERN, where

Object Oriented Database design architectures, a tight match with object programming models, were dropped due to stagnation of the Object Oriented database market.

Another very large scientific database for the LSST is being designed in a similar manner to the data systems at the Large Hadron Collider. The LSST will produce an estimated half petabyte of image data per month; image files will be stored in flat files, and metadata and catalogue data will be stored in a relational database [19]. The LSST design team also make the point that while an

Object DBMS would be a better technical fit, the use of a technology that is not widely accepted, used or supported is too much of a risk.

The Netherlands based Low-Frequency Array (LOFAR)ii, completed in 2012, has generated a large data archive of at least 25 PB. The LOFAR Long Term Archive is an adaptation of the

Astro-WISE distributed information system [20]. In a similar fashion to the Large Hadron Collider database architecture described above, catalogue and metadata is stored on an Oracle RDBMS

RAC cluster [21] which is connected to a complex federated file server containing hundreds of terabytes of raw data [22]. The top level software in Astro-WISE system is written in Python, which then has the ability to call C libraries [21]. Interestingly, the Astro-WISE system has adopted an object based model, with these models being stored on a relational database using object oriented user-defined types (which are supported by some commercial RDBMS systems) [21]. Data selection and analysis is accomplished with the Python based Astro-WISE Environment.

The Atacama Large Millimeter/Submillimeter Array (ALMA)iii and the Montage Image Mosaic

Engine iv are examples of currently existing astronomical data archives. Montage is described as “a toolkit for assembling astronomical images into mosaics”. These mosaics are created in FITS v vi format from data released from sources including the Spitzer and Hubble space telescopes, the

Infrared Astronomical Satellite (IRAS), the Sloan Digital Sky Survey and ground based telescopes.

While a large distributed system, Montage is an archive primarily for the generation of astronomical images in FITS format.

The ALMA archive does provide some offline processing capabilities in regional centres, however to all intents and purposes it is purely a passive archive, albeit with quite sophisticated query capabilities. [24]

Another important aspect are the protocols and interfaces accessing the data in the archives and frameworks that enable the data analyses. The International Virtual Observatory Alliance

iOracle Real Application Cluster functionality built on the Oracle Relational Database Management System. Refer to the glossary. iihttp://www.lofar.org iiihttp://almaobservatory.org/ ivhttp://montage.ipac.caltech.edu/ vhttps://fits.gsfc.nasa.gov/fits_home.html vihttp://montage.ipac.caltech.edu/docs/supported.html

2 IVOA [25] and the Virtual Observatory, founded in 2002, have developed such standards to fa- cilitate the development of tools and systems that provide astronomers with a unified view of many astronomical data archivesvii in “a single transparent system”. Standardised data services provide consistent access to hundreds of data collections to search and extract data, which can then be analysed locally. This approach, however, will begin to encounter problems when data collections to be analysed exceed the capability for the system resources on a local machine (I/O, memory/CPU, and bandwidth) to able to cope with the processing loads. As we have seen, some estimated survey sizes from the SKA will also be far too large to be successfully stored and analysed on a single database. Therefore, an investigation is necessary into the architecture of distributed storage and parallel analysis of very large astronomy databases.

Very large commercial data storage and analysis systems have been developed in recent yearsviii.

However, commercial big data implementations may not have the same requirements as scientific and astronomical data analysis requirements, as business big data implementations are mainly con- cerned with analysis and product development [26]. Much of the data processing in a commercial environment has short life-cycle value, so the processing speed of the data stream becomes vital in order to assist real time decision making.

Many commercial applications of Very Large Data Base (VLDB) technology address the issue of the short effective life-cycleix of incoming data as well as the rate with which the raw data is produced, which is sometimes described as the speed of the feedback loop [26]. This requires the usage of sophisticated data streaming or event processing technologies to cope with the volume and velocity of the incoming raw data. Such a framework can be potentially problematic for a scientific database.

It is common commercial practice to hold fine grained, detailed data only for a limited period.

As the data ages it is aggregated into summary datasets of a coarser granularity (deleting the underlying fine grained data after the aggregations have been produced). Once a set of data reaches a retention date where the data is no longer required (either legally or operationally), it is deleted.

Usually, scientific raw data has a much greater effective life cycle, and the aggregation of the underlying raw data products may not always be an appropriate strategy. Accurate and verified historical observations do not lose their validity with the passage of time. Therefore the data retention strategy and subsequent data storage requirements will be unique to the scientific environment.

Regardless of the data retention strategies adopted, data mining in both commercial and scien- tific databases face similar issues extracting information and knowledge from very large data sets,

viiSee http://www.ivoa.net/about/member-organizations.html for a list of member organisations with links to the various archives. viiiFor example, AT&T, Yahoo, Google, YouTube and Amazon. ixWhich can be as little as hours or days, depending on the application. See Dumbill et al [26]

3 particularly as data sizes increase to peta and exa scale.

1.2 High Performance Computing in Scientific Analysis

Traditional distributed High Performance Computing (HPC) systems and applications for scientific analysis have usually been written in low level programming languages with distributed processing extensions (e.g. C++ with OpenMP/MPI accessing file storage systems like Lustrex or HDF5xi).

While undoubtedly a very powerful parallel processing paradigm, it requires a complex set of design and programming skills and considerable expertise and experience to master, especially when designing memory sharing and distributed parallel processing across a cluster of nodes.

While there will be some scientific astronomical researchers with these skill sets, many researchers and research students will not have the necessary skills. Just as importantly, these researchers will not have the time required to acquire and cultivate these skills to maximise the amount of value from the raw data. In order to maximise the benefit of the collected data and facilitate effective knowledge extraction, an easier to use framework for data analysis is needed.

From a reliability standpoint, the C++ OpenMP/MPI ecosystem is not fault tolerant (although for high priority analyses these programs are usually run on very high level, hardware resilient, and very expensive supercomputing hardware). A large production system containing petabytes of data and potentially hosting hundreds of concurrent users would best be served with an architecture with automatic fault tolerance designed into the ecosystem, as a node failure will not interrupt a process and extensive user-designed check-pointing within the analysis programs becomes unnecessary.

Modern distributed programming and storage paradigms have evolved into ecosystems with the capability to access powerful parallel processing systems on commodity hardware. Abstracting the complexity of distributed parallel programming algorithms into libraries of common protocols for commonly used languages could allow a greater community of researchers to extract information and knowledge from peta scale datasets.

1.3 Hadoop Ecosystem

One prominent system for very large database analysis is Apache Hadoop [27].

Hadoop is an open-source set of tools to provide a reliable distributed computing environment, and is deployed on commodity hardware. The Hadoop ecosystem has been in use since 2005/2006 in the commercial sector, and is gaining acceptance in the astronomical research domain. Farivar et al [28] demonstrated the use of Hadoop and MapReduce [29] to process large photometric surveys,

xLustre is a scale-out distributed parallel filesystem. See http://lustre.org/ xiA versatile and portable file format with no limit on the size of the data objects in he collection, with an application programming interface implemented in C, C++, Fortran and Java. See https://support.hdfgroup. org/HDF5/. Parallel processing in HDF5 is MPI dependent - see https://support.hdfgroup.org/HDF5/Tutor/ pprog.html

4 and Wiley et al [30] presented techniques for image co-addition using Hadoop and MapReduce.

MapReduce is the original Hadoop processing framework and while it does abstract the complexity of writing distributed applications it is still a relatively low-level interface requiring advanced programming skills.

The Hadoop ecosystem comprises various integrated components, the data storage component being the Hadoop Distributed File System (HDFS) [31]. HDFS is a distributed architecture for large scale data storage. It is scalable using commodity hardware, fault-tolerant and easy to expand. Although HDFS is primarily a “Write Once, Read Many” storage architecture optimised for batch processing, technologies have emerged in the last few years to enable ad hoc querying of data contained within HDFS. HDFS is capable of storing both structured and unstructured text-based data data, as well as other data formats including images, JSON and XML documents.

HDFS clusters have been implemented with petabyte storage capacities as early as 2010 [32] and thus HDFS is an excellent candidate for astronomical data storage.

Accessing data in HDFS can be done at the file level, but more usable and effective protocols are available. One such system is Apache Hive [33], which is a datawarehouse infrastructure built on Hadoop, and enables ad hoc querying of data contained within the underlying HDFS data

files. Hive creates metadata structures on HDFS data which represents the data in a very similar fashion to a relational database table, and allows SQL-like querying, analysis and summarisation of that data. Hive is of interest because the supported, Hive QL (HQL) shares many features with common ANSI standard SQL dialects and can easily create customised, subject specific snapshots of data from larger datasets. Table structures in Hive can either be internally managed by the Hive Server (allowing creation of table data with different compression codecs) or externally managed whereby existing HDFS datasets have a meta-data abstraction layer created with the Hive server which then allows Hive QL queries to be run against the data.

Another feature of great interest is Hive’s capability to partition and sub-partition table defi- nitions, providing both a coarse and fine granularity. Hive table data can also be compressed into divers data formats and compression codecs, including non-columnar data stores.

Hive also supports the creation of indexes on Hive table data; in both standard BTree and

Bitmap format. Index design needs careful attention, particularly with Bitmap indexes, however properly applied indexes on Hive tables can again vastly improve read performance over large datasets.

Hadoop also provides cluster computing platforms for distributed computing, which is believed to be essential for the expected data volumes. Apache Spark [34] is one such framework. It offers in-memory distributed processing, which accounts for the speed of computations of which it is capable, and it also more efficient than Map/Reduce on Hadoop. Spark offers simple APIs in Java,

Scala, R, Python and SQL and has a comprehensive selection of in-built libraries. Spark integrates easily with Hadoop and can access any Hadoop data source. Spark storage APIs will also inte-

5 grate with Amazon S3xii, Cassandraxiii, Apache Hive and HBasexiv, and any Relational Database that supports Java Database Connectivity. Utilising the Hadoop InputFormat and OutputFormat

Map Reduce interfaces enables file formats on common storage systems to be made available for processing.

Commonly supported file formats include text, JSON documents, CSV files (both comma and tab delimited), Hadoop sequence files, Hive table formats (including ORC [35], Parquet [36, 37],

AVROxv, RC [38] and text) as well as protocol buffers and object files. The core abstraction for Spark’s ability to work with data is the Resilient Distributed Dataset (RDD) which is simply a distributed collection of elements. The Spark engine automatically splits these elements into partitions and then distributes and parallelises operations on these elements across the cluster.

Rysa et al [39] describe the significant improvements of the Spark processing framework over its immediate predecessor, Map/Reduce. Rather than relying on a rigid map-then-reduce framework, they suggest using Spark’s more general directed acyclic graph of operators and the rich set of transformations that represent complex pipelines in a few lines of code.

Spark can run in local mode, as a stand alone cluster, or be incorporated into a Hadoop cluster.

Running within a Hadoop cluster enables Spark to be run through the YARN ResourceManager

[40], providing very efficient integration with the data sources contained within the underlying

Hadoop ecosystem (of specific interest in this study, HDFS and Hive).

The Hadoop ecosystem includes a sophisticated resource manager to schedule and manage distributed processing jobs, Apache YARN. YARN decouples resource management from data processing requirements, which allows YARN to provide resources for any processing framework compatible with Hadoop. YARN consists of a Resource Manager, which resides on one of the dedicated service nodes within the cluster, and Node Manager Daemons on the worker nodes within the cluster. The Resource Manager is simply a dedicated scheduler that assigned resources to requesting applications.

Although Spark on Hadoop offers significant benefits there are performance penalties when compared to classic HPC OpenMP/MPI architectures.

Liang and Lu [41] report a 2-3 times performance increase using MPI over Spark on Hadoop; however the tests they describe are against small datasets and on a relatively small cluster. Reyes-

Ortiz et al [42] tested Spark on Hadoop against the classical HPC OpenMP/MPI on Beowulf with

The Higgs Data Setxvi. They conclude that for their experiments while OpenMP/MPI is more powerful in terms of speed and scales better, that Spark on Hadoop may be a preferred option because Spark on Hadoop offers

xiihttps://aws.amazon.com/s3/ xiiicassandra.apache.org xivHbase is an open source non relation distributed database - see https://hbase.apache.org/. HBase was not used in this study. xvhttps://avro.apache.org/docs/1.2.0/ xviWhich contains 11 x 106 samples of simulated signal processes that produce the Higgs bosons.

6 • A distributed processing platform with fault-tolerant node failover and data replication ca-

pabilities, which OpenMP/MPI does not;

• The ability to dynamic add new nodes at runtime to scale the cluster up; and

• A set of tools for data analysis and management that is easy to use, deploy and manage.

While these findings are interesting, it is important to note that the performance difference of a production sized Spark/Hadoop cluster (with hundreds or thousands of optimally configured worker nodes storing and processing very large data sets in the terra and peta scale) may not be that significant. Also, unlike batch job submission used in HPC, Hadoop offers the elasticity of resources that can be dynamically added or removed as needed which can be a significant advantage.

The objectives for this study were:

• To develop and test a generic astronomy framework on Hadoop incorporating most commonly

used tools for astronomical data analysis;

• To investigate the impact of data compression on performance for sufficiently large (hundreds

of gigabytes to terabyte sized) datasets; and

• To also assess the overall “ease of use” and “user friendliness” of incorporating Hive calls

into a Spark based Python analysis.

The methods and evaluation framework are described in Chapter 2, followed by the results in

Chapter 3. Findings and conclusions are discussed in Chapters 4 and 5.

7 Chapter 2

Methodology

Any large data storage system exists only to provide a framework whereby the data can be accessed and analysed to provide information and insight into specific problems; in this case astro- nomical research. The framework supported by this data storage system must be robust, flexible, easy to use and must provide the end user with useful tools to accomplish knowledge extraction from data.

In order for any large data analysis framework to be successful, it must be accessible and usable by the end user community. Regardless of the efficiency and processing power of any storage platform, the value for users will be severely curtailed if the only tools available for analysis are complex low level programming languages aimed at a very narrow strata of experienced programmers and analysis. Therefore, any platform for astronomical research must be able to support commonly understood tools in order to provide broad value to as wide a population of end users as possible.

It is also important to take into account the fact that the sets of commonly used tools and analysis software in any one field of study is in a state of constant flux as new tools become available and older tools become obsolete. Therefore any proposed framework architecture should be loosely coupled in order to incorporate new tools as seamlessly as possible.

Furthermore, any such framework must be robust and secure in a production environment. It should have defined and concise management and monitoring tools, and be able to scale effectively as data storage and processing demands increase with time.

2.1 Test methodology

The Hadoop platform may satisfy many of the requirements for the storage and analysis of very large astronomical data sets. The cluster-based architecture of Hadoop is scalable and fault- tolerant, provides an effective distributed processing framework with Spark (and newer parallel

8 processing frameworks like Apache Drilli) and supports commonly used and popular programming languages; for example, Python, Java, Scala and R.

This investigation was limited to the Python language. Python was chosen because of its wide usage and acceptance within the astronomical and scientific community [43–46], with existing expertise amongst the end-user community. Python also has access to astronomy specific machine learning libraries, as discussed in Section 2.5.

The Hadoop framework abstracts the complexity of parallel processing algorithms so end-users can concentrate on knowledge extraction. To determine the suitability of the Hadoop ecosystem for astronomy research, the following areas were evaluated.

• Cluster architecture;

• Compression formats, storage formats and data structures;

• Machine learning libraries;

• Performance; and

• Integration with existing analysis tools and ease of use.

2.2 Datasets

In order to test a Hadoop-based framework, representative sample datasets are needed. Existing catalogue data was made available from theSkyNet project [47]. TheSkyNet is a public science project dedicated to radio astronomy whereby citizen scientists donate unused processing capacity of their personal computers to process small packets of data distributed from a central system, thereby making up a large network computer infrastructure. Survey datasets made available and used in this study were the Deep Investigation of Neutral Gas Origins (DINGO) deep HI sur- veyii, the HI Parkes All Sky Survey (HIPASS)iii and the Galactic ASKAP Spectral Line Survey

(GASKAP)ivv. All of these involve observations of the 21cm hydrogen (HI) line producing data cubes having two spatial and one spectral dimension.

Raw datasets (see Table 2.1) produced by the Duchamp source finder [5] were provided by theSkyNet. Source finding parameters were changed multiple times, and multiple outputs were produced in the form of Duchamp output files. This comprises data from specific source-finding runs. The data were in ASCII text format. ihttps://drill.apache.org/ iihttp://internal.physics.uwa.edu.au/~mmeyer/dingo/welcome.html iiiwww.atnf.csiro.au/research/multibeam/lstavele/sky_abstract.html ivhttp://www.atnf.csiro.au/projects/askap/ssps.html vhttps://sites.google.com/site/gaskapproject/home

9 Table 2.1: Survey raw data used from theSkyNet

Dataset Size in GB DINGO 01 279 DINGO 02 279 GASKAP 01 15 GASKAP 02 15 HIPASS scaled 01 220 HIPASS scaled 02 206 HIPASS2 fixed 01 705 HIPASS2 fixed 02 706 TOTAL 2425

The Duchamp output files were then run through a custom Java program to extract the detection and parameter files in csv format. This data included processing and load dates, parameters passed into the source finding algorithms (including filtering and plotting parameters) and the subsequent detection data. The detection and parameter data was extracted, separated and formatted into structured .csv files. The file structure of the generated detection and parameter files can be found in Appendix A, tables A.1 (detections) and A.2 (parameters). Details of the raw Duchamp file output can be found in Appendix A.3 and details of the extracted detections and parameters files can be found in Appendix A.4 and A.5 respectively.

The extracted detection and parameter files were loaded into a predefined directory structure in the Hadoop File System.

Raw data files in HDFS can be compressed various compression formats and for this study the extracted detection and parameter .csv files were compressed with the gzip compression algo- rithm. Other compression formats are also available, and while not a comprehensive list, common compression formats for HDFS are shown in Table 2.2.

Table 2.2: Common compression formats for HDFS Files

Format Comments gzip Native Hadoop support, very good compression sizes bzip Fast compression and decompression LZO Designed for very fast compression and decompression, compression size not as good as gzip Snappy Reasonable compression, high input speeds

The compressed file sizes are shown in Table 2.3. File sizes indicated are representative of the default HDFS block replication factor of threevi.

viHDFS stores files as a sequence of blocks, and each block is replicated by default three times across the cluster for fault tolerance. The replication factor can be modified according to specific requirements.

10 Table 2.3: Survey file sizes in HDFS in GB

Dataset Detections Parameters TOTAL DINGO 01 56.8 11 67.8 DINGO 02 54.8 10.3 65.1 GASKAP 01 0.021 0.997 1.018 GASKAP 02 0.0207 1.001 1.0217 HIPASS scaled 01 62.8 10.2 73 HIPASS scaled 02 55.8 8.5 64.3 HIPASS2 fixed 01 120.8 23.3 144.1 HIPASS2 fixed 02 120.3 22.8 143.1 TOTAL 471.3417 88.098 559.4397

These raw data files were the basis of the Hive External Table that were used to create subsets of data for testing, as described in Section 2.4.1.

It should also be noted that while these dataset sizes are not excessivevii, they demonstrate the compression factors that could be reasonably expected as dataset sizes scale upward into tera, peta and exa scale. It should also be noted that the evaluation cluster for this study as implemented can only be categorised as a small “proof of concept” platform.

2.3 Cluster Architecture

In order to perform this evaluation, an evaluation cluster was developed and populated with a representative samples of data. The cluster was configured to incorporate astronomy-specific

Python-based machine learning libraries. A simplified diagram of the evaluation framework is shown in Figure 2.1.

viiFor example, the DINGO 01 raw dataset size of 279 GB is reduced to ≈ 68 GB. This framework also enables the creation of subject-specific snapshots (discussed in Section 4.2) which has the capability to further reduce the dataset size for a specific analysis which will also reduce the associated I/O loads on the cluster.

11 End User

End User End User

System Administrators

Source Finding (eg Duchamp)

Management Server Web Server UI Web Server (Including Hive and Pyspark editors)

Service Nodes(simplified) Catalog extraction

MySQL RDBMS Hive Server HDFS Name Node YARN Resource Manager, Hue, OOZIE and Hive schemas Management Server

Data Load

HDFS, HDFS, HDFS, HDFS, HDFS, Spark, Spark, Spark, Spark, Spark, Python Libraries Python Libraries Python Libraries Python Libraries Python Libraries

Worker Nodes* *Raw Data Files, Hive table data and metadata

HADOOP ECOSYSTEM

Figure 2.1: Evaluation Framework High Level Architecture

The Nectar Research Cloud [48] was utilised to create the evaluation clusters. The Cloudera CDH open-source platform (including the Hadoop, Hive and Spark ecosystem) was used to configure the test framework. The Cloudera implementation was selected for the comprehensive product bundles and easy installation and configuration of the Hadoop ecosystem components and subsystems.

Standard virtual machine sizes used are illustrated in Table 2.4 below.

Table 2.4: Nectar Virtual Machine Configurations

VM Image CPUs RAM in GB Primary disk size in GB m2.large 4 12 30 m1.large 4 16 10 m2.xlarge 12 48 30 m1.xxlarge 16 64 10

12 A small evaluation cluster was initially set up running the Cloudera CDH 5.4.8 release. This initial cluster was used to perform initial program development and gain experience with the Spark process configuration. All machines were configured to run the Ubuntu 12.0.4 “Precise” operating system. Initial set up was configured with all service functions on one server, five worker nodes, and all servers configured as m1.large virtual instances.

While appropriate for evaluation and preliminary testing, the limitations of the initial configu- ration quickly became apparent. The Hadoop ecosystem management processes are most resilient and best performing when run across multiple servers - for example, it is generally recommended to run three instances of the Zookeeperviii configuration and management processes across three management servers, and an odd number (usually a minimum of three) of HDFS journal node pro- cesses also across separate servers as well. Data/worker nodes should be configured with adequate memory. Recommendations vary, but data/worker nodes on a production system will typically have between 256 - 512 GB RAM. Spark executor processes require memory allocation to process and small clusters with worker nodes with limited memory will be quickly overwhelmed with large datasets, necessitating data swapping to disk.

The cluster was subsequently upgraded to run the Cloudera CDH 5.5.1 release on eleven ma- chines, in order to support Spark version 1.5 and to more closely replicate the architecture of a production system. Retaining some servers from the previous incarnation of the cluster, five new worker nodes of m1.xxlarge capacity were created, and the service functions were distributed across multiple management servers. The Oozie, Hive Metaserver and the Hue user database schemas were migrated off the cluster to a separate MySQL 5.5 database instance running on a dedicated server. All testing for this study was run on the upgraded cluster. The service layout of the upgraded cluster is illustrated in Appendix B, Table B.1 (master service nodes) and Table B.2

(worker/slave nodes).

2.4 Hive tables

Hive supports two main types of tables: internal and external. These table formats are discussed in the subsequent sections. Creation and modification of Hive tables and indexes is accomplished by the use of Hive DDL statements on any hive interfaceix. These commands are explained in detail in the Hive DDL Language Manual [49]. Note that Hive indexes were not used in this study

(see section 5.2)

viiiZookeeper is a centralised service used to provide distributed configuration and synchronisation services as well as a naming registry. See https://zookeeper.apache.org ixFor example, the beeswax command line interface, or the Hue Hive editor interface.

13 2.4.1 External tables

The external tables are created by defining a metadata layer across existing HDFS data files.

The Hive metadata points to the specific location within the HDFS as well as defining column and data types. For example, once raw data is loaded into HDFS, an external table definition can be created for the file which can then be access with SQL statements via a Hive interface. Deleting a hive external table removes only the Hive metadata, not the underlying file(s) on HDFS.

For the purposes of this study, external table functionality was utilised to create a metadata abstraction of the raw data files described in Section 2.2.

Partitioned external tables were created for each of the detection and parameter data extracted from the datasets shown in Table 2.3. These tables were the source data for specific data subsets used to populate the partitioned internal test tables described in Section 2.4.6.

2.4.2 Internal Tables

Hive internal table life-cycles are managed by the Hiveserver. When an internal table is created,

Hive creates the table schema as well as file location and (if appropriate) partition metadata. When a Hive internal table is deleted, the data and the metadata are deleted. In this study, internal partitioned Hive tables were created to test storage formats and partitioning schemas, described in Section 2.4.6.

2.4.3 Partitioning

Partitioning is a design philosophy whereby the physical datasets that make up a logical dataset or table are stored in separate physical locations. Metadata making up the logical unit of storage

(table or dataset) holds pointers to the physical locations of these subsets of data. Data partitioning has been in wide use for many years on large scale relational databases and is a well understood technology. Partitioning large tables and datasets can offer significant benefits on read operations as only the physical partitions containing the data of interest are scanned. In may cases this is a small percentage of the total table size, and in these circumstances a full table scan would be unnecessary and inefficient. The benefits of partitioning Hive tables is discussed in Section 4.6.

2.4.4 Table partitions

Each partition in a partitioned Hive table has its own directory and file structure in the under- lying Hadoop File System; partition calls within a Hive QL statement reduces I/O bottlenecks and can greatly improve read performance.

14 A simplified schematic of a Hive partitioned table in illustrated in Figure 2.2.

Figure 2.2: Hive Table Partition Schematic

This schema represents the table design that was tested initially for query performance; the table was partitioned around the input file name and then sub-partitioned on Right Ascension and

Declination rangesx.

2.4.5 Internal table formats and compression codecs

Four internal Hive table formats were investigated – Parquet, ORC, Text and RC, with the compatible compression codecs.

Parquet [36, 37] is a columnar-based file format for Hadoop, designed to support complex nested data. Parquet was built to utilise efficient compression schemes and consists of row groups, column chunks and pages. Row groups are logical horizontal partitions of row data, consisting of a column chunk for each column in the dataset. Column chunks are then divided into pages. Multiple page types can be interleaved in column chunks. Parquet is supported by a plugin in Hive 0.10, 0.11, and 0.12 and natively in Hive 0.13 and later. In this study the Parquet file format was tested with the Snappy and Gzip compression codecs.

xThe sub partition schema was chosen arbitrarily to test the performance benefits of partitioning, and may not represent the most efficient schema for actual analysis. Partition schema designs under production conditions would be designed around best known use cases and quite possibly would change as requirements change. The use of Hive allows the creation of data snapshots with differing partition schemas which may more efficiently address specific analysis questions.

15 ORC (Optimized Row Columnar) [35] file format is a columnar-based file storage designed to address limitations in previous Hive file formats (for example, RCFile). ORC files consist of groups of stripes, each stripe containing index data, row data and a stripe footer. Auxiliary information is contained in the file footer, and the postscript holds compression parameters and the size of the compressed footer. The default stripe size is 250 MB, designed to enable fast reads from HDFS.

ORC formats for Hive were introduced in Hive version 0.11, and can be compressed using the

Snappy or Zlib codecs.

Hive text-based internal tables save data as text (for example, .csv format files). Partitioned test tables create the underlying partitions in separate directories.

RCFile (Record Columnar File) [38] is a flat file format consisting of binary key/value pairs.

The RCFile format was added to Hive in version 0.6.0. File metadata of the row split is stored as the key portion of a record, and the data within a row split is stored as the value.

2.4.6 Test table design

Six Hive test tables were created and then populated with identical data, differing in table formats and compression codecs as shown in Appendix C, Table C.1. The tables, the type of storage and compression codec are shown in Table 2.5.

Table 2.5: Test table storage formats and compression.

Table name Storage type Compression SPARKTESTORCZLIB ORC zlib SPARKTESTORCSNAPPY ORC snappy SPARKTESTPARQUETGZIP Parquet gzip SPARKTESTPARQUETSNAPPY Parquet snappy SPARKTESTTEXT Hive text None SPARKTESTRC RC File None

All of the tables were created as partitioned Hive internal tables. The partition schema utilised is very similar to the schema illustrated in Figure 2.2. As the DINGO datasets are comprised of different file groups, the table was partitioned on the input filename as the primary partition, and then sub-partitioned based on a calculated value of the Right Ascension and Declination range.

We also created a Hive external non-partitioned table for comparison, however testing against this table was limited as there is a bug in Spark 1.5 that prevents data frames returning data to a

Python on Spark program calling a text based partitioned external tablexi.

xiA text based external table that has not been partitioned does not, however, display this behaviour. The previous version of Spark used in initial testing (Spark 1.3) did not exhibit this behaviour either. It is believed that this bug is addressed in Spark version 1.6+, however this was not tested and this behaviour cannot be confirmed.

16 Data sizes of the test tables are illustrated in Table 2.6. These table sizes represent the Hadoop default block replication factor of 3.

Table 2.6: Hive table formats and compression codecs - Size in GB

Table Format Compression Snappy Gzip Zlib None Parquet 26.5 26.5 - - ORC 23.2 - 17.1 - Hive Text - - - 59.3 RC - - - 29.2 HDFS Detections - 111.6 - - HDFS Parameters - 21.3 - -

Hive tables are created by simply running the appropriate DDL creation scriptxii. The creation scripts of Hive-based internal tables can be found in Appendix D.

• The Data Definition Languagexiii (DDL) creation scripts for ORC based tables can be found

in Figures D.1 and D.2.

• The DDL creation scripts for Parquet based tables can be found in Figure D.3. Compression

formats for Parquet tables are defined in the population scripts.

• The DDL creation script for RC based tables can be found In Figure D.4.

• The DDL creation script for RC based tables can be found in Figure D.5.

The creation scripts of Hive-based external tables can be found in Appendix E.

• The DDL creation scripts for external Hive tables can be found in Figure E.1 (non-partitioned)

and Figure E.2 (partitioned) xiv. Note that the DDL statement for a non-partitioned table

includes a location directive indicating the HDFS directory where the data files will reside.

A non partitioned table will reference all files in the specified HDFS directory. These files

can be added and removed.

2.4.7 Test table data extracts

In this study, all test tables defined in Table 2.5 were populated with an identical subset of data from the DINGO 01 and DINGO 02 surveys. The external table was populated with data from one partition of the DINGO 01 survey, to replicate one partition from the test tables.

xiiEither in the Hue web interface using the Hive editor, or the beeswax command line interface. xiiiDatabase commands that define structures and objects in a DBMS. Common DDL commands are CREATE, ALTER and DROP. xivAs discussed earlier, an external table in Hive is only a metadata layer for an existing HDFS dataset, so this statement will not populate the data into the table.

17 Data Manipulation Languagexv (DML) INSERT statements were used to populate the test tables. Dynamic partitioning was used to ensure that data was automatically loaded into the correct partition. Dynamic partitioning is discussed in detail in the online Hive Tutorial [50]. The

Hive external tables discussed previously in Section 2.4.1 were used as source for the data extracts.

• The population scripts for ORC, RC File and text based tables can be found in Appendix D,

Section D.2, Figure D.6. These tables are similar in that the compression codec (or lack

thereof) is defined on table creation.

• The population script for the Parquet tables can be found in Appendix D, Section D.2,

Figure D.7.

Example code to populate a Hive externally partitioned table is presented in Appendix E, Sec- tion E.3 and Figure E.3.

Compressing Hive external table data (or any text file stored on HDFS) is easily accomplished using a Pig [51] script and example code is presented in Appendix E, Section E.4 Figure E.4.

Test tables were populated from survey data from DINGO 01 and DINGO 02 datasets. Record counts are shown in Table 2.7.

Table 2.7: DINGO raw survey record counts

Dataset Record counts DINGO 01 Detections 226,439,400 DINGO 01 Parameters 80,821,349 DINGO 02 Detections 218,683,422 DINGO 02 Parameters 75,751,922

Test tables were populated with an subset of data from the detection and parameter tables for the performance evaluations. Record counts of these tables are shown in Table 2.8.

Table 2.8: Test table record counts

Dataset Record counts Sparktest Parquet Snappy 445,121,256 Sparktest Parquet Gzlip 445,121,256 Sparktest ORC Snappy 445,121,256 Sparktest ORC Zlib 445,121,256 Sparktest Hive Textfile 445,121,256 Sparktest RC Files 445,121,256 Hive external table (CSV format) 16,235,604

An external non-partitioned table with data from one partition from the DINGO 01 survey was also created for comparison with single partition test runs against the test tables.

xvDatabase commands used for selecting, inserting, deleting and updating data in a database.

18 2.4.8 Hive user interfaces

The Hive tables were created via the Hive editor in the Hue interface [52]. Hue is a common industry standard user interface for Hadoop and is bundled with the Cloudera release of Hadoop utilised in this study. Hue provides editors for SQL over Hadoop (Hive and Impalaxvi), job design and scheduling (Oozie), Spark job submissionxvii and HDFS file browsers.

2.5 Python

The Python language is widely adopted by the scientific community, and there are many compre- hensive scientific analysis libraries available. In this study, we used the astronomy specific machine learning libraries ScikitLearn [53], astroPy [54] and the astroML libraries (astroML and astroML- addons [55]) as well as the Spark Machine learning library (mllib) [56]. Standard Python scientific and plotting libraries (numpy, scipy and matplotlib) are also required as ScikitLearn, astroPy and astroML are dependent on these libraries being available. The machine learning library installation dependencies are presented in Appendix G.

These libraries were installed on all worker nodes within the cluster.

2.6 Test Framework

Three types of statistical analysis implemented in Python were used for benchmarking the read performance of the Hive table types: a KMeans analysis, Kernel Density Estimation (KDE) and

Principal Component Analysis (PCA). The programs are derivatives of the examples provided for the correspondent methods in ScikitLearn or astroML libraries. The programs were modified to get the data from Hive tables or HDFS via Spark Context calls.

A Pearson correlation coefficient correlation test was also performed to test inter-process com- munications and parallel computations.

The test pyspark (Python on Spark) programs were submitted from bash command line scripts to the default Spark-submit executable on one of the cluster worker nodes. While this would not be considered an secure option for a production environment, it is a simple method for development testing and evaluation. A more appropriate approach in a production environment would be to utilise a secure interface to submit these jobs.

In this study, a brief examination was made of three user interfaces with the capability of running Spark based Python jobs: the Hue interface, Apache Zeppelin [57, 58] and Jupyter/Ipython

[59]. While a detailed examination of appropriate user interfaces is beyond the scope of this thesis, it is nonetheless worthwhile noting the following points. At time of writing, Hue appears to be

xviApache Impala is an open source, analytical database for Apache Hadoop. Impala was not evaluated in this study. https://impala.incubator.apache.org/ xviiSpark editor is currently in Beta

19 the most robust, full featured and production ready interface. As well as providing a Hive editor,

Hue provides secure ACL based access to the HDFS file system, provides secure job tracking, an workstream editor and can also be configured with the Livy REST server [60] allowing remote submission of Python/Spark jobs via a CURL call [61]. However The Hue Spark editor does not yet support the Matplotlib library and The SparkR editor is not yet supported (as at Cloudera release 5.5.1). Both Ipython/Jupyter and Apache Zeppelin support the Matplotlib library but do not offer functionality beyond Python/spark job submission.

The Python test programs were used to record RDD creation duration as well as total elapsed time, number of records retrieved, and CPU seconds. Cluster logs (CPU usage and I/O statistics) were also extracted and matched to test program runs. Raw data were examined in scatter plots and obvious outliers were discarded before final analysis.

2.6.1 KMeans

The aim of a KMeans analysis is to group subsets of entities based on similarities, and is often incorporated into a supervised learning pipeline. It is an unsupervised learning algorithm and aims to partition n observations into n clusters where each observation belongs to the cluster with the nearest mean. According to Pentreath [62],

“K-means tries to find clusters so as to minimize the sum of squared errors...within

each cluster.”

Our KMeans program is an extension of the KMeans example provided by the Spark Machine

Learning library MLLib [63]. The source code for this program can be found in Appendix H,

Section H.1.

2.6.2 Kernel Density Estimation (KDE)

KDE is a method for estimating the non-parametric probability density function of a random variable, and is a data smoothing problem where inferences are made based on a finite data sample.

The code used for this program was based on the examples from Ivezi´cet al [55]xviii. Source code was obtained from KDE example on the supporting astroML website [64] and modified to use a

Spark RDD to extract the data from a Hive table on HDFS. Source code for the KDE analysis program can be found in Appendix H, Section H.2.

2.6.3 Principal Component Analysis (PCA)

Pentreath [62] describes PCA as a type of dimensionality reduction unsupervised learning model.

These models take an input set of data with a dimension D and then extracts a representation xviiiSee ch. 6, pp. 250 - 259

20 of the data of dimension k, where k is usually significantly smaller than D. PCA models seek to extract a set of k principal components from a data matrix where the principal components are uncorrelated to each other. The first principal component accounts for the largest variation in the input data and each subsequent component is calculated to account for the largest variation provided that it is independent of the principal components calculated so far. Each principal component has the same feature dimensionality as the original data matrix and the k components returned are guaranteed to account for the highest variation in the input data. The original data is projected into a k-dimensional space represented by the principal components. This output can then be visualised in a two or three dimensional plot.

This program is based on the scikit-learn PCA library as described in Section 6.2 of the As- tronomy with scikit-learn tutorial [65] and modified to use Hive/Spark RDDs as the data source.

Source code for the PCA analysis program can be found in Appendix H, Section H.3.

2.6.4 Non-embarrassingly parallel problems

In order to test the environment for the problems that require intensive inter-process communica- tions, a Pearson correlation coefficient (PCC) test against a sample of astronomical data was setup, using the numpy corrcoef functionxix. A PCC test is a measure of linear correlation between two variables x and y, and returns a value between 1 and -1 inclusive, where -1 is total negative linear correlation, 0 is no linear correlation and 1 is total positive linear correlation. Wavelength and position (Right Ascension and Declination in degrees) data was extracted from the DINGO 1xx dataset and instantiated into an internal Hive table in Parquet format with the table name Cor- relationTest. This table was physically partitioned by cubes 15 arc-minutes square. Appendix I,

Section I.1 illustrates the DDL and DML statements required to create the data table.

All distinct wavelengths were then extracted and instantiated in a separate Hive internal table to be used as the baseline for histogram generation; see Appendix I, Section I.2.

Setting the problem space A correlation problem involves the simultaneous comparison of adjacent data elements (in this case, wavelength data of adjacent areas of sky) and the use of Hive was an interesting opportunity to investigate how best to facilitate this analysis. The approach used involved creating an instantiated Hive table containing all possible permutations and combinations of comparisons - in this case, of all data cubes one arc minute square - which was then used within the Pyspark analysis program to extract the relevant wavelength data for these cubes for comparison.

The creation of a table of all distinct permutations and combinations of a given problem set

xixhttps://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html. xxDINGO data has two spatial dimensions, Right Ascension and Declination (hereafter referred to as RA and Dec) and one spectral dimension, wavelength.

21 can be created in SQL using inequality joins. The base data table (CorrelationTest) was physically partitioned into cubes 15 arcminutes square and an additional table (CorrelationMinutes) was created to create the fine-grained list of one acrminute square cubes within each physical partition of 15 arcminute square data. See Appendix I, Section I.3 for the DML statement used to create these tables.

Finally, the creation of a table of all distinct permutations and combinations of a given problem set was created in SQL using inequality joins. See Appendix I, Section I.4.

This represents one approach to setting the problem space - it should be noted that Hive views could also be utilised instead of instantiating physical tables but there would be a performance hit utilising views.

Prepare the histogram array The internal Hive histogram generation function was used to create the wavelength arrays for correlation, based upon parameter data passed to define the particular cube of interest. See Appendix I, Section I.5.

Thread pools in Spark The Python program used ThreadPool from the Python multiprocess- ing library to run the correlation analyses in parallel. As the size of the cluster limited the number of available CPU cores to the YARN management process to 60, the number of parallel processes spawned by the program was limited at maximum to 56.

The Python program listing used for these tests can be seen in Appendix H, Section H.4.

2.6.5 RDD Creation

The Apache Spark core abstraction for working with data is a Resilient Distributed Dataset

(RDD). An RDD is a distributed collection of elements, be they Python, Java, Scala, or user defined objects. As explained by Karau et al [66] all work in Spark is expressed as the creation of new RDDs, transforming existing RDDs or calling operations on an RDD to compute results.

Spark also automatically distributes and parallelizes operations performed on the RDD across the cluster. Selecting data directly from HDFS files is possible; however if the data can be represented as a Hive table, the benefits outweigh the overhead of creating and maintaining the Hive metadata layer. A full discussion of the comparison between creating a Spark RDD from HDFS and Hive can be found in Section 4.1 below.

Creating a Spark RDD in Python from Hive table data is demonstrated in Figure 2.3.

# import the Spark context libraries from pyspark import SparkContext,

SparkConf, StorageLevel

22 # Create the spark configuration conf=(SparkConf()

.setAppName("HDFSTest")

.setMaster("yarn-client"))

# Create the Spark context sc=SparkContext(conf=conf)

# Import the Hive context library from pyspark.sql import HiveContext

# Create the Hive context sqlCtx=HiveContext(sc)

# "use skynet" sets the default Hive

# database where the tables are located sqlCtx.sql("use skynet")

# Hive query to extract the required data.

# Note that this query will result in a

# full table scan. data = sqlCtx.sql("select ra, dec, freq \ from sparktest \ where ra between 44 and 45 \ and dec between 15 and 16")

Figure 2.3: Creating a Hive based RDD in Python

Creating a Spark RDD in Python from HDFS data files is demonstrated in Figure 2.4.

# import the Spark context libraries from pyspark import SparkContext,

SparkConf, StorageLevel conf=(SparkConf()

.setAppName("HDFSTest")

.setMaster("yarn-client"))

23 # Create the Spark context sc=SparkContext(conf=conf)

# Multiple RDDs need to be created if the

# data required is located on more than one

# HDFS file data1=sc.textFile("hdfs:/.../part-m-00000.gz") data2=sc.textFile("hdfs:/.../part-m-00001.gz") data3=sc.textFile("hdfs:/.../part-m-00002.gz")

# ... truncated for brevity

...

# RDDs combined into one master RDD

AllData=data1

AllData=AllData.union(data2)

AllData=AllData.union(data3)

...

... data=AllData.map(lambda line: line.split(","))

.filter(lambda line: float(line[2]) >= 44 and float(line[2]) <= 45 and float(line[3]) >= 15 and float(line[3]) <= 16 )

.map(lambda line: (line[2],line[3],line[4])) parsedData=data.map(lambda row: array([float(row[0]), float(row[1]), float(row[2])]))

Figure 2.4: Creating a HDFS based RDD in Python

In order to explicitly specify a partition within a Hive query (which makes up the main part of creating a Hive based Spark RDD) the partition name must be specified in the query predicate.

For example, Figure 2.5 demonstrates a query statement in which data from the SPARKTESTOR-

CZLIB table being explicitly extracted from the R1 dingo 00 partition, selecting only data where the ra range value is 60 (ra values between 40 and 60) and the dec range value is 20 (dec values between 0 and 20).

select ra, dec, freq from sparktestorczlib

# Explicitly specify the ’R1_dingo_00’ partition

24 where filename=’R1_dingo_00’

# Explicitly specify the ra_range sub-partition and ra_range=60

# and explicitly specify the dec_range sub-partition and dec_range=20

# and further, limit the data to specific ranges

# of ra and dec and ra between 44 and 45 and dec between 15 and 16

# group the results group by ra, dec, freq

Figure 2.5: Hive QL statement for a partition based table scan with grouping.

We can also see that this query further narrows the selection criteria by specifying further range based predicates with the detail of the partition selection. Contrast the above statement to the

Python code necessary to group an RDD based on raw HDFS data as illustrated in Figure 2.6.

# Create the RDDs from the files with the required

# data. These files need to be identified prior

# to submitting the code! data1=sc.textFile("hdfs:/.../part-m-00000.gz") data2=sc.textFile("hdfs:/.../part-m-00001.gz")

# Combine the individual RDDs into one

AllData=data1

AllData=AllData.union(data2)

# Extract the required fields, filter and group

# the data data=AllData.map(lambda line: line.split(","))\

.filter(lambda line: float(line[2]) >= 44 \

and float(line[2]) <= 45 \

and float(line[3]) >= 15 \

and float(line[3]) <= 16 )\

25 .map(lambda line: (line[2],line[3],line[4])) \

.distinct()

Figure 2.6: Hive RDD creation for HDFS based files with grouping.

2.6.6 Spark process settings

Two test regimes were run to evaluate the effect of different Spark job settings. Approaches to running and tuning Spark resource allocation are covered in detail by Ryza [67, 68]. Given the small size of the test cluster, resources that could be made available for the Pyspark test programs were restricted (compared to the resources that would be available in a production sized cluster).

The preliminary runs were submitted with a default Spark configuration of five executors, 2 GB memory for each executor and one CPU core per executor. This setting may not be appropriate for a production environment; however as the initial testing was to identify possible performance trends of the table formats tested, this was not considered a major limitation.

A secondary round of tests were performed with uprated Spark configurations calculated ac- cording to the guidelines provided by Ryza. Based on five m1.xxl worker nodes with 16 cores and

64 GB of Ram, the optimum setting for a single job on this cluster, avoiding over utilisation of system resources, was calculated to be 14 executors (three per node with the exception of the node running the Application Master) four CPU cores per executor for a total of 12 cores per node, and

14 GB of memory for each executor for a total of 42 GB per node.

2.7 Benchmark framework

Although the three test programs run during the course of this investigation used unsupervised learning and required creation of only one RDD, future work into supervised pipelines will require multiple data selections incorporating training, validation and test datasets into appropriate RDDs.

In order to asses the performance of RDD creation and incorporation within the Python Machine learning framework a benchmark framework was setup.

2.7.1 RDD creation testing

Initial tests were setup to evaluate the response times needed to create RDD data sets within a

Python program. Response times for three different scenarios were investigated:

• Creating a RDD from a full table scan;

• creating a RDD from a single specific partition; and

• creating a RDD with a specific partition where the results were grouped.

26 Results are detailed in Section 3.2.

2.7.2 Full table scan testing

The next phase of testing involved running an machine learning analysis with a full table scan.

The KDE analysis program was chosen for this test and was run with a full scan of the Spark test tables as detailed in Table 2.8. The Hive external table was not used in these tests. Results of these tests are detailed in Section 3.3.2.

2.7.3 Partition testing

Two sets of tests were prepared to investigate the performance benefits of partitioning: firstly against one specific partition, and secondly against three specific partitions. The KMeans cluster analysis program was prepared to retrieve data from one partition of the target test tables detailed in Table 2.8. The Hive external table was included in these test as well, as the external table that had been set up with data from one partition, specifically for this comparison. Results of these tests are detailed in Section 3.3.1. The Principal Component Analysis program was also run against the data selected from three specific partitions. Results of these tests are detailed in

Section 3.3.3.

2.7.4 Correlation testing

Taking the data and problem space tables as discussed in Section 2.6.4, three main series of tests were run. While total areas varied for these tests, the basic fine grain of data cubes one arcminute square were used. Firstly, a series of tests were run for a set area four arc minutes square within the area of 180 degrees Right Ascension, -46.5 degrees Declination, resulting in 540 correlation tests. These tests were run with six different ThreadPool process numbers, each test against a wavelength histogram of 1000, 5000 and 10000 elements.

The next test series was run against a consistent array size of 15000 elements with 56 threadPool processes against data cube areas of five, six and seven square arcminutes, resulting in 2170, 3572 and 6072 correlation permutations.

Finally, the test suite ran analyses against a consistent 4 arcminute square area with varying array wavelength sizes of 15,000, 20,000, 25,000 and 30,000 elements. Results of these tests are detailed in Section 3.3.4.

2.7.5 Java on Spark

While not a primary focus of this investigation, tests of the Java program used to separate out the detection and parameter files from the Duchamp output file were run on the Spark framework

27 to see if any performance benefits could be gained compared to running the process locally on one server. Results of these tests are detailed in section 3.4.

28 Chapter 3

Results

This chapter describes the results obtained from the test scenarios identified in Chapter 2.

Section 3.1 illustrates the load elapsed time and I/O rates for a representative file of approximately

2 GB. Section 3.2 details the results of Spark RDD creation programs. Section 3.3 details the results of the KMeans(Section 3.3.1), KDE (Section 3.3.2) and PCA (Section 3.3.3) analyses as well as the results of the correlation analysis program in Section 3.3.4.

3.1 Writing file data to HDFS

Elapsed time and Network I/O statistics were recorded for an example raw file, the 2.1 GB dingo 01 parameters file. Job elapsed time (which was consistent for all file loads of this size) was 00h 01m 19s. The network I/O receive rate was 14.1 Mbyte/sec and the Network I/O trans- mit rate was 22.8 Mbyte/sec.

3.2 RDD Creation

Initial testing focussed on RDD creation only, in order to do a preliminary assessment on the read performance of the various Hive table formats. The performance of RDDs created directly from data files in the HDFS file system was also assessed.

This testing also illustrates the performance benefits of table partitioning in relation to the explain plans of Hive queries - refer to Section 4.6.1 for a discussion of the explain plan statistics.

3.2.1 HDFS and Hive baseline I/O read rates

As a baseline for comparison of I/O rates, full table scans were performed over the data files making up the dingo 01 detections external table (56.8 GB and 226,439,400 rows) and the Parquet test table with snappy compression filtered for dingo 01 survey data(26.5 GB and 226,438,015 rows).

While a full table scan was performed of these two tables, filtering mechanisms were in place to

29 restrict the final result set to 28, 996 rows of data. Total elapsed time for the job scanning the data files making up the dingo 01 detections survey was 00:03:01. Total elapsed time for the job scanning the Parquet test table was 00:01:54. Significant variation of I/O demands were observed, as shown in Figure 3.1.

Figure 3.1: Baseline HDFS and Hive I/O Rates.

3.2.2 Full table scans

In this scenario, a subset of data from the scanned tables was selected, but did not explicitly specify a partition, resulting in a full table scan. Test programs were run a minimum of seventy times for each table, and average, minimum, maximum and interquartile ranges recorded; the test programs were run a minimum of eighty times. Results of these tests are shown in Figure 3.2.

There was little variation in the response times for all tests (ie the interquartile ranges - IQR - were very narrow for all tests). However, the tests against the Parquet table with gzip compression

(median response time 144.98, IQR 135.65–158.16 seconds) and Parquet with snappy compression

(Median response time 139.40, IQR 136.34–153.09 seconds) had faster median response times than all other test scenarios.

30 Figure 3.2: Full scan response times in seconds.

3.2.3 Partition based table scan

The tests were then repeated but with explicit partition selection in the query predicates for the data selected from the test tables. The results from test runs against the Hive external table were also included for comparison. This external table was created with the equivalent of one partition of data from the dingo 01 detection file to match the data extracted from the partitioned Hive tables. Test programs were run a minimum of eighty times. Results are shown in Figure 3.3.

Again, there was very little variation in the response times for all tests, but results for Parquet gzip (Median 41.40, IQR 40.37–43.36 seconds) and Parquet snappy (median 42.53, IQR 41.40–44.12 seconds) showed faster median response times.

Figure 3.3: Partition based scan response time in seconds.

31 3.2.4 Partition based scan with grouping

Lastly, RDD creation with explicit partition selection and grouping was tested. Test programs were run a minimum of eighty times. Results are shown in Figure 3.4.

There was little variation across all test scenarios. Parquet gzip (median 50.69, IQR 49.14 -

53.05 seconds) and Parquet snappy (median 51.48, IQR 50.15–53.41 seconds) consistently returned the fastest median response times

Figure 3.4: Partition based scan with grouping response time in seconds.

3.3 Python Machine Learning test programs

Having completed the initial RDD creation testing, the three Python based analysis programs against varying selections of data were then tested. These programs were also tested with the different Spark process settings discussed in Section 2.6.6. Cluster Network I/O statistics were recorded for two of the full scan job tests -

• A full read of data directly from HDFS files making up the dingo 01 external table that was

then processed as a KMeans analysis; and

• A full scan on the Parquet test table with GZip compression that was processed as a Kernel

Density Estimation analysis.

These I/O statistics were included for comparison with the I/O statistics generated from the

Correlation Tests described in Section 3.3.4

32 3.3.1 KMeans

The KMeans analysis program was run against one coarse partition grain (the “filename=’R1 dingo 00” predicate) with additional predicates limiting the selection to wavelengths between 5500 and 5505; the Hive QL to create the Spark RDD is illustrated below in Figure 3.5.

sqlCtx.sql("use skynet") data = sqlCtx.sql("select freq, wavelength, flux \ from [table definition] \ where filename=’R1_dingo_00’ \ and wavelength between 5500 and 5505 \ group by freq, wavelength, flux")

Figure 3.5: Hive RDD creation for Kmeans testing.

The preliminary runs were submitted with a default Spark configuration as discussed in Sec- tion 2.6.6. This configuration was used as a preliminary start point, and this setting should not be considered appropriate for a production environment. The tests were then re-run with the optimised settings. Both test regimes were run a minimum of fifty times. Results are shown

Figure 3.6.

Running a KMeans analysis requires multiple iterative calculations across the same set of data, and in this case the analysis was run through twenty iterations. As expected, running this program with the optimum Spark settings across all test scenarios reduced the response times from the default Spark settings by at least 50% for all tests. We also see the continuing trend of

Parquet gzip (median 197.62, IQR 194.99 - 199.10 seconds) and Parquet snappy (median 195.92,

IQR 191.98–198.81 seconds) returning the fastest median response times.

Figure 3.6: KMeans analysis - total elapsed time in seconds.

33 The KMeans analysis was then rerun for a full scan of data from the dingo 01 data files making up the external table in order to retrieve network I/O traffic data from the cluster logs. As can be seen from Figure 3.7 Network I/O traffic peaked at approximately 70 Mbytes/sec with an average traffic rate of approximately 30 Mbytes/sec. Traffic rates over the jobs lifecycle can bee seen in

Figure 3.8.

Figure 3.7: KMeans analysis - Network I/O traffic for a full scan of HDFS data.

Figure 3.8: KMeans analysis - Network I/O lifecycle traffic for a full scan of HDFS data.

3.3.2 Kernel Density Estimation

The KDE analysis was tested against a full scan of the table, with both the default and optimised

Spark settings, and again both test regimes were run a minimum of fifty times. Figure 3.9 illustrates

34 the Hive query used to create the RDD.

sqlCtx.sql("use skynet") data=sqlCtx.sql("select ra, freq from sparktestrc \ where ra between 60 and 61 \ and dec between -1.5 and 1.5 \ group by ra, freq")

Figure 3.9: Hive RDD creation for KDE full scan testing.

Results of these tests are shown in Figure 3.10.

There was a significant reduction in median response times for the Parquet tests with the default Spark settings, at least 50% of the times recorded by the next fastest tests, the RC File and Hive Text tests. The tests with optimised Spark setting show Parquet gzip (median 115.93,

IQR 106.33–118.77 seconds) and Parquet snappy (median 115.83, IQR 107.27–118.59 seconds) again returning the fastest median response times.

Figure 3.10: KDE full scan analysis - total elapsed time in seconds (default and optimised Spark settings).

The KDE analysis was then rerun for a full scan of data from Parquet test table (gzip compression) in order to retrieve network I/O traffic data from the cluster logs. The program was run four times, generating four different RDD sizes for analysis of 7,615, 200,467, 478,601 and 769,098 rows of data. As can be seen from Figures 3.11 and 3.12, network I/O traffic peaked at approximately 12

Mbytes/sec with an average traffic rate of approximately 10 Mbytes/sec, significantly lower then for a full HDFS scan.

35 Figure 3.11: KDE analysis - Network Receive I/O traffic for a full table scan.

Figure 3.12: KDE analysis - Network Transmit I/O traffic for a full table scan.

Traffic rates over the jobs lifecycle can be seen in Figures 3.13 and 3.14.

36 Figure 3.13: KDE analysis - Network Receive I/O lifecycle traffic.

Figure 3.14: KDE analysis - Network Transmit I/O lifecycle traffic.

3.3.3 Principal Component Analysis

The Principal Component Analysis program is interesting because there is significant manipulation of data within the Python program to populate the arrays necessary for input into the astroML algorithm. These manipulations are necessary because the catalogue data used returns ragged arrays which need to be trimmed for inclusion. A more efficient design solution would entail pivoting the raw data within Hive and present it as a pre-computed table. This program was run this against three course grained filename partitions, and also the finer grained RA and Dec range partitions as shown in Figure 3.15.

37 sqlCtx.sql("use skynet") rawData=sqlCtx.sql("select distinct ra, dec, \ object, freq, wavelength wavelengths, flux, \ s.threshold, rn tidx \ from sparktestorcsnappy s \ inner join threshold_lookup ti \ on s.threshold=ti.threshold \ where s.filename in (’R1_dingo_00’, ’R1_dingo_01’, \

’R1_dingo_02’) \ and ra_range=80 \ and dec_range=20 \ and dec between -0.5 and 0.5")

Figure 3.15: Hive RDD creation for PCA testing.

Results are shown in Figure 3.16, and both test regimes were run a minimum of fifty times.

With a more complex set of predicates in the Hive query populating this RDD, we see that for the standard Spark settings, response times are similar across all test scenarios. However, with the optimised Spark process settings, we again see that the Parquet gzip (median 247.96, IQR 241.94–

263.62 seconds) and Parquet snappy (250.47, IQR 245.17–264.09 seconds) consistently returning the fastest response times.

Figure 3.16: PCA analysis, three partitions - total elapsed time in seconds (default and optimised Spark settings).

38 3.3.4 Correlation

Initial tests were run across a consistent sample of wavelength data in a 4 arcminute square area; this comprised 540 correlation comparisons on wavelength histogram array sizes of 1,000, 5,000 and

10,000 elements. Figure 3.17 shows very little difference was observed in the elapsed time across all tests .

Figure 3.17: Correlation analysis elapsed time - thread pool size against histogram array size.

Further tests were run with a consistent histogram array size of 10,000 elements, against jobs with thread pool sizes of between 8 and 56 processes. Memory allocations (Figure 3.18) and network

I/O traffic (Figures 3.19 and 3.20) remained consistent across all tests regardless of the number of Python/Spark thread pool processes with a maximum cluster memory allocation of ≈ 89.5GB and maximum network I/O receive and transmit rates of ≈ 6MB/sec.

39 Figure 3.18: Correlation analysis memory allocation - 10,000 element array against Pyspark thread pool size.

Figure 3.19: Correlation analysis network I/O (receive) - 10,000 element array against Pyspark thread pool size.

40 Figure 3.20: Correlation analysis network I/O (transmit) - 10,000 element array against Pyspark thread pool size.

Cluster CPU utilisation for correlation tests with array sizes of 10,000 elements was on average between 25% and 30% with the jobs launching 48 and 56 processes initially peaking at ≈ 40%

(Figure 3.21).

Figure 3.21: Correlation analysis CPU utilisation - 10,000 element array against Pyspark thread pool size.

These tests were then repeated with wavelength histogram array sizes of 15,000, 20,000, 25,000 and 30,000 elements. Job elapsed times remained reasonably consistent at between 1h 16m and

1h 23m. This trend continued with jobs processing wavelength histogram array sizes of 40,000,

41 60,000, 75,000 and 80,000 elements with a thread pool size of 56 processes.

Figure 3.22: Correlation analysis elapsed time - 56 thread processes, array sizes for arrays 15,000 - 80,000 elements.

For jobs processing array sizes between 15,000 and 30,000 elements, maximum job memory allocation remained consistent at ≈ 89.5GB (Figure 3.23) while network I/O transmit and receive rates peaked slightly higher at ≈ 8MB/sec. The the average transmit and receive rates were

≈6MB/sec (Figures 3.24 and 3.25).

Figure 3.23: Correlation analysis memory allocation - 56 thread processes, array sizes 15,000 - 30,000 elements.

42 Figure 3.24: Correlation analysis network I/O (receive rate) - 56 thread processes, array sizes 15,000 - 30,000 elements.

Figure 3.25: Correlation analysis thread network I/O (transmit rate) - 56 thread processes, array sizes 15,000 - 30,000 elements.

Cluster CPU utilisation for these jobs peaked just short of 60% for the job processing array sized of 30,000, but the average CPU utilisation was between 30% and 40% (Figure 3.26).

43 Figure 3.26: Correlation analysis CPU utilisation - 56 thread processes, array sizes 15,000 - 30,000 elements.

For jobs processing array sizes between 40,000 and 80,000 elements, maximum job memory allocation again remained consistent at ≈ 89.5GB (Figure 3.27).

Figure 3.27: Correlation analysis memory - 56 thread processes, array sizes for arrays 40,000 - 80,000 elements.

Network I/O transmit and receive rates peaked higher at ≈ 14MB/sec and the average transmit and receive rates were ≈ 7MB/sec.

44 Figure 3.28: Correlation analysis network I/O (receive rate) - 56 thread processes, array sizes 40,000 - 80,000 elements.

Figure 3.29: Correlation analysis thread network I/O (transmit rate) - 56 thread processes, array sizes 40,000 - 80,000 elements.

Cluster CPU utilisation for these jobs peaked just short of 60% for the jobs processing the larger array sizes, but again the average CPU utilisation was between 30% and 40% (Figure 3.30).

45 Figure 3.30: Correlation analysis CPU utilisation - 56 thread processes, array sizes 40,000 - 80,000 elements.

Finally, tests were run on larger areas - a 6 x 6 arcminute area comprising 2,170 correlations, a

7 x 7 arcminute area comprising 3,752 correlations and an 8 x 8 arcminute area comprising 6,072 correlations. These tests were all run with 55 threadPools and a consistent wavelength histogram array size of 15,000. Job elapsed times are shown in Figure 3.31.

Figure 3.31: Correlation analysis thread pool testing - Elapsed times for 2k - 6k correlation comparisons.

Cluster CPU Utilisation for jobs processing 2,170 and 3,752 correlations was similar at approxi- mately 32% after initially peaking at just short of 40%. However, the job processing 6,072 corre- lations averaged approximately 25% CPU utilisation.

46 Figure 3.32: Correlation analysis thread pool testing - CPU Utilisations for 2k - 6k correlation comparisons.

Maximum memory allocation remained consistent over the majority of the job lifecycles at ≈ 89.5

GB as shown in Figure 3.33.

Figure 3.33: Correlation analysis thread pool testing - Memory allocation for 2k - 6k correlation comparisons.

Network I/O traffic rates shown in Figures 3.34 and 3.35.

47 Figure 3.34: Correlation analysis thread pool testing - Network I/O (receive) for 2k - 6k correlation comparisons.

Figure 3.35: Correlation analysis thread pool testing - Network I/O (transmit) for 2k - 6k corre- lation comparisons.

48 3.4 Java on Spark

As discussed in Section 2.7.5, the Java based detection and parameter extraction process was run both locally and on the Spark framework to investigate the potential performance benefits. The elapsed time in seconds of five Java extraction processes running in parallel against five separate output files from the Duchamp source finder were compared; one test was run locally, and the second with the processes running on the Spark cluster.

Table 3.1: Spark extraction process test

Job process framework Elapsed time (seconds) Spark 3086 Local Java 3418

49 Chapter 4

Discussion

This chapter discusses the implications and conclusions that can be derived from test run results discussed in Chapter 3. Section 4.1 discusses the comparison between creating a Spark RDD with Hive and direct HDFS file calls. Section 4.2 discusses the creation of subject specific data snapshots and the advantages that this technique offers. Section 4.3 examines the capability of the framework to perform multiple parallel processes during a correlation analysis. Section 4.4 looks in detail at the data compression that can be obtained using various storage mechanisms, and

Section 4.5 examines response times and cluster I/O during the test runs. Section 4.6 discusses the capability and advantages of data partitioning functions that the Hive structure offers. And

finally, Section 4.7 examines the overall usability of the framework and Section 4.8 discusses the approaches that were used to tune the Spark jobs.

4.1 RDD Creation - HDFS Vs Hive context calls

As illustrated in Section 2.6.5 creating a Spark RDD from files stored on HDFS while possible, can be an overly complex procedure. It should be immediately obvious that this is not the most efficient or user friendly method. Raw data files loaded into HDFS will usually be compressed

(in our case, using the Gzip codec) and the compression procedure creates multiple files from the original data file. This problem will be exacerbated if the entire dataset is made up of a large number of files. The example in Figure 2.4 has been truncated for brevity, as in fact there are thirty partitions to be loaded for full scan of DINGO 01 data. Should only subsets of the raw datasets be required, these subset files will need to be identified in order to include them in the

RDD creation process, and appropriate metadata will be necessary to identify which files contain the relevant data. What can be seen from the above example is that if the data required for a particular analysis extends beyond one file in HDFS - which is likely - then multiple RDDs need to be created for each file and then concatenated with the Union directive to create one RDD for processing. We also see that if the resultant data needs to be aggregated (for example, a distinct

50 call or group) then additional processing is required via a Python .distinct(...) call in the RDD definition. This complexity can be avoided using Hive.

It should also be immediately obvious that one of the primary advantages of using Hive within

Python for astronomical analyses is the ability to quickly and easily tailor the RDD simply by the use of Hive QL statements to define the particular data extract required for this analysis. A properly designed Hive table is also self explanatory, indicating which elements of a dataset as well as which partitions contain the data required for the particular analysis. It should also be noted that due to the wide adoption of SQL - and the various derivatives thereofi - in the data analysis community, Hive will continue to be a core element of the Hadoop ecosystem and continue to evolveii. It is therefore reasonable to assume that Hive and other SQL based analysis open source products will continue to be a viable long term tool for very large data analysis.

Taking this concept to its logical conclusion, the creation of RDDs can be further simplified by using Hive to create instantiated subset tables of subject specific data within the Hive/HDFS repository. Performance can be further improved by incorporating data filtering and grouping at the table level reducing the processing requirements on the Python analysis program.

4.2 Snapshot Generation

The Duchamp source finder and the associated Java extraction process generates files with a considerable amount of data that is not potentially useful for analysis - for example the parameter

files contain data specific only to the parameters used for a single source finding run. A considerable portion of this data could therefore be discarded; leaving only subject-specific snapshots of data to be generated and retained (for example, position data in RA and Dec, as well as flux, wavelength and frequency data).

This is not to say, however, that most parameter data should be discarded, as data quality analysis snapshots could be designed for analysis based on the source finder input parameters, error counts and number of detections.

Such subject specific data snapshots would then be optimised in terms of the partitioning schema used, as well as potential index architectures for the most efficient processes possible for end users running machine learning programs from Python or R. These subject specific snapshots could also by design perform significant data manipulation before ingestion by analysis programs and this

iLeveraging the Hadoop ecosystem, we have for example Hive, Impala, Shark (https://github.com/amplab/ shark/wiki) and Presto (prestodb.io/). This is not an exhaustive list. Other database vendors like Cassandra (cassandra.apache.org/) are also adopting variations of SQL. iiFor example, the Hive Stinger.next initiative. https://hortonworks.com/blog/ stinger-next-enterprise-sql-hadoop-scale-apache-hive/

51 would offer significant performance enhancements as the snapshots would be physically instantiated within the Hive/Hadoop infrastructure. A possible example of this is the PCA program where the source data is manipulated within the Python program from a jagged array and converted into a static array within Python. A far more efficient design would be to pivot the data within

Hive and present an instantiated data snapshot specific to the PCA analysis. This would reduce the processing time required as much of the data manipulation “heavy lifting” has already been performed within the distributed data store itself.

On the assumption that this is a realistic scenario, the test tables were created based around this concept as detailed in Table 2.6.

It does need to be pointed out, however, that any final subject-specific table sizes will vary significantly in size based upon just how much data will need to be extracted from the initial raw datasets. It is also quite feasible that multiple subject-specific table snapshots will need to be created from the same raw data sets.

4.3 Correlation testing

Following on from the snapshot discussion in Section 4.2 above, a practical application is the generation of the problem space and the histogram arrays for a correlation analysis, as discussed in

Section 2.6.4. The Python/Spark analysis program first extracts the problem space requirements and then uses that information to spawn the process threads to calculate the required correlations.

A Hive query then extracts the relevant wavelength histogram arrays for a particular area (in this case, one arcminute by one arcminute square) for comparison.

Given the small size of the cluster, and the available resources for Spark and YARN (60 CPU vcores and 240 GB of memory) thread pooling within the Pyspark program demonstrated a con- sistent usage of ≈ 89.5GB of memory usage for all jobs. All jobs were run with no overide of the default YARN process settings, resulting in YARN allocating a consistent 60 executors running one vcore per executor for the entire job lifecycle, and each executor having approximately 1.5 GB of memory allocated.

Cluster CPU utilisation for jobs processing 540 correlations for jobs processing up to 10,000 element arrays showed relatively consistent load of between 35 and 30%, differing only with peak usage of approximately 40% for jobs running 48 and 56 thread processes on arrays of 10,000 elements. Network I/O traffic rates for these jobs appeared stable at between ≈ 5 and ≈ 6

Mbyte/sec for arrays sized to 10,000 elements.

Jobs running tests on 540 correlations with arrays of 15,000 to 30,000 elements showed a slightly higher CPU utilisation of 30% - 38% with a peak utilisation of approximately 55% for the

52 job processing 30,000 element arrays. Network I/O traffic rates 5 to 7 Mbyte/sec with temporary peaks to 8 Mbyte/sec.

Jobs running tests on 540 correlations with larger array sizes of 40,000 to 80,000 elements displayed very similar elapsed times. CPU utilisation averaged between ≈ 26% and ≈ 38% with peak load at ≈ 55%. Network I/O average transmit and receive rates for jobs processing these larger arrays were ≈ 7 mbytes/sec with a maximum rate of ≈ 14 mbytes/sec for both transmit and receive rates. This leads to the conclusion that even a small cluster will not be constrained processing array sizes for any particular correlation test up to at the very least 80,000 elements.

During these tests the YARN memory allocation across the cluster was ≈ 89.5GB, and given that on this cluster Spark and YARN have 240GB available then it seems likely that with appropriate job tuning very large array sizes for these analyses can be accommodated.

Finally, jobs running 2,170 to 3,752 correlation comparisons on 15,000 element arrays averaged a cluster CPU load of ≈ 30%, while the job running 6,072 comparisons averaged ≈ 25%. Because of the default Spark settings, memory allocation was consistent with the previous tests at ≈

89.5GB. Median network I/O traffic rates for jobs processing 2,170 and 3,752 correlarions was

≈ 5.5 mbytes/sec, while median traffic rated for the job running 6,072 correlations was ≈ 3.4 mbytes/sec.

With the resources available on this cluster, and with appropriate Spark job tuning, it is apparent that the Spark platform is very capable of running large correlation analyses.

4.4 Data Compression

Given the expected data volumes, as well as data reduction (discussed in Section 4.2), data compression becomes a critical component and was investigated as a matter of priority. As we can see from Figure 2.6, the ORC table format utilising the Zlib compression codec resulted in the most efficient compression. This also demonstrates the importance of properly designed data snapshots from the initial raw tables; comparing the raw file total sizes, detection and parameter data for the dingo 01 and dingo 02 datasets (132.9 GB) against the subject specific table created in ORC format with Zlib compression (17.1 GB) we can demonstrate a final data product of only

12% the size of the initial raw data files.

4.5 Performance comparisons

4.5.1 Cluster I/O comparisons

As demonstrated in Section 3.2.1, significant variation was observed in the cluster I/O statistics for a full table read of the HDFS files making up the dingo 01 detections survey compared to a partial read of the Parquet test table SPARKTEST filtered for data from the dingo 01 survey (recall

53 that the internal test tables comprised data from the dingo 01 and dingo 02 surveys). While the difference in the data size certainly contributes to the lower I/O demands for the Parquet table scan

(26.5 GB vs 56.8 GB for the HDFS data files) it can be seen that a columnar-based storage format

(Parquet) coupled with appropriate compression formats (in this case, Snappy) can significantly reduce I/O demands.

Examining the Cluster Network I/O logs for a full scan of data files directly stored in HDFS

(see Figures 3.7 and 3.8)and a full scan of data from a Hive based internal table (see Figures 3.11,

3.12 and 3.13), but with specific partition selection for the Parquet table and a reduced selection criteria of the raw HDFS data files to match, network I/O traffic for the Hive internal table was significantly lower (on average, approximately 10 Mbytes/sec as opposed to approximately 30

Mbytes/sec) .

This also illustrates the benefit of creating subject specific snapshots of data, which will require less system resources to process.

4.5.2 Response times

Across all the conducted tests, the Parquet table format consistently provided the fastest re- sponse times; with little difference between Parquet tables using the either gzip or Snappy com- pression codecs (Figures 3.2, 3.3, 3.4, 3.6, 3.10 and 3.16 ). From Figure 2.6 we see that the ORC table format provides the best data compression whereas the Parquet table formats provide supe- rior response times and read performance. These trade-offs will need to be assessed against the priorities any final production system in order to support the primary focus - storage efficiency or read performance.

4.6 Hive Partitioning

The obvious benefit for a properly designed partitioned tables is that requests for specific sub- sets of data can be run against specific physical partitions of data, avoiding full table scans and potentially providing significant performance benefits, both in terms of response times as well as system resource usage.

While partitions can be added or dropped with standard DDL statements, the partition specifi- cation cannot be modified after table creation. Table and partition definitions should be carefully designed to address specific problems, while addressing the widest possible use cases. Indexes, while not utilised in the study can likewise be dropped and recreated on an existing table. New ta- ble definitions can be created very quickly, with table creation DDL statements usually completing within seconds. Any such modifications would need to be evaluated on a case-by-case basis.

54 As illustrated in Sections 3.2.3 and 3.2.4, table partitioning allows simple, efficient access to

HDFS data for Spark based applications. Specific data can be retrieved by partition and sub- partition name, avoiding full table and file reads and which demonstrably improves job perfor- mance. Partition schemas in Hive can be designed for various requirements; for example astro- nomical data could potentially be partitioned on position coordinate systems like HEALPixiii.

Some analyses will requires a full table scan of the data; in these cases then attention to the Spark runtime configuration parameters on job submission will be required. However careful attention to partitioning data tables will in many cases offer significant performance benefits and a reduction in unnecessary demand for system resources.

4.6.1 Hive explain plans

Examining the explain plans for both full table and partition based scans immediately illustrate the benefits of partitioning as only the defined partitions are scanned to retrieve the relevant data.

A full scan of any of the Hive based test tables defined in Table 2.8 - with the exception of the Hive external table - scans the full table of 445,121,256 records, while the same query with partition specifications in the query predicates scans and return far fewer rows for an associated performance benefit.

The explain plans generated from SQL statements run against partitioned Hive test tables were examined. The RDD creation SQL statement for the KDE Full scan test (Figure 3.9) does not explicitly specify a partition in the query predicates and generates the explain plan extract shown in Appendix F, Figure F.2. We can see that we are scanning the full table exactly as expected, which in this case is comprised of 445,121,256 rows.

The SQL statement used to create the RDD for the KMeans analysis (see Figure 3.5) selected data from one specific partition, in this case the primary partition R1 dingo 00. We can see from the explain plan (see Appendix F, Figure F.4) that this query has reduced the scan to 16,885,586 rows, which is the row count for that particular partition.

The SQL statement used to create the PCA query (see Figure 3.15) narrows the selection criteria further – an inner join to a lookup table (threshold lookup) and explicitly scanning three primary partitions (R1 dingo 00, R1 dingo 01 and R1 dingo 02 ) and two secondary partitions (ra range and dec range) results in a scan of 2,309,811 rows. (shown in Appendix F, Figure F.6).

iiiHEALPix is an algorithm for pixelisation of data on a sphere. See http://healpix.sourceforge.net/

55 4.7 Usability

Another important aspect of our investigation was the ease of use of the framework for the astronomer. Creating a Spark RDD from a Hive table is intuitive, easy to use and we believe provides the best environment for comparing different slices of data. This is due in no small part to the Hive query language, which shares enough similarity to ANSI standard SQL to be familiar to a very wide audience. Hive snapshots are, if properly designed, self documenting whereas raw HDFS data files would need an additional meta data library to identify file formats and file locations in order to create the appropriate RDD. Hive based table snapshots can also be optimised for subject specific data mining and analysis, and with properly designed partition schemas and appropriate indexes will offer the fastest response times for analysis of very large datasets.

This will also be important in a supervised machine learning pipeline. Creating the separate training and validation model datasets as well as the test data would be simple to implement in

Python with Hive based RDDs.

4.8 Tuning Spark jobs

Tuning the various Spark configurations can have a significant impact on the performance of

Python jobs running on Spark. Optimising the resources available to a particular Spark job (in terms of number of executor tasks, memory available for each executor and number of CPU cores allocated per executor) can significantly improve response times for the analysis of large datasets.

As we can see on Figure 3.10 that demonstrates the response times for a full table scan of the test table datasets composed of Dingo 01 and Dingo 02 datasets. Each of these tables has in excess of 430 million rows and table sizes range from 17.1 GB for the ORC table with Zlib compression to the raw HDFS data files which total 132.9 GB. Of course, on a small test cluster optimisation options are somewhat restricted, however a full sized production cluster with hundreds of data and worker nodes would present a platform far more capable of tuning various jobs.

56 Chapter 5

Conclusions

5.1 Findings

This thesis demonstrates that the Hadoop ecosystem can provide a very effective environment for astronomical data research. In comparison to existing data archive systems discussed in Section 1.1, the Hadoop ecosystem (incorporating HDFS, Hive, Yarn and Spark) provides all of the following capabilities -

• An open source, scalable and effective parallel processing platform capable of analysing very

large datasets;

• Integrated support for the Python programming language on the Spark parallel processing

framework;

• Seamless support for freely available (and astronomy specific) scientific and machine learning

libraries;

• Significant data compression capability; and

• Abstraction of complex parallel processing algorithms, leaving researchers (and especially

research students) free to focus on the data analysis and machine learning relevant to their

fields of inquiry.

Python has been identified as one of the most popular programming languages used for astro- nomical data analysis, and this thesis demonstrates that Python integrates very well into the Spark framework, particularly with the Hive datawarehouse capability on Hadoop. The machine learning libraries demonstrated in this thesis also provide an easy to use environment for astronomical data analysis on very large datasets while abstracting the design and implementation complexity of a traditional parallel processing framework (for example, Map/Reduce or OpenMP/MPI).

57 Data compression within this framework has also been shown to be very effective; with demon- strated potential storage reductions to 12% of the original raw data size (see Section 4.4) depending on use cases, table formats and compression codecs. Correct selection of Hive table formats for read operations will also offer significant read performance gains; testing described in Chapter 3 consistently demonstrated that the Parquet file format provided the fastest response times across the test regimesi.

Furthermore, the Hadoop ecosystem functionality is constantly evolving and can potentially be significantly extended as discussed in Section 5.2 on Future Work.

Going forward, this framework has demonstrated the potential to offer astronomers a very ef- fective tool whereby

• Commonly available and well understood tools can be used;

• Very large datasets can be stored, compressed, structured and analysed; and

• Subsequently, the astronomical end user community can concentrate on scientific analysis

relevant to their field of study rather than investing substantial time and effort acquiring

additional technical programming skills.

5.2 Future work

While this study focuses on astronomical machine learning using Python, Spark and Hive, the

Hadoop ecosystem has further features worthy of additional investigation. These features have the potential to offer further functionality in terms of performance as well as offering extended analysis capability beyond what Python can offer.

Hive Indexing Tests performed in this thesis did not utilise indexes, however the capability to index data contained in Hive tables was added in version 0.7.0. to facilitate faster data retrieval times for Hive queries. Hive currently supports both BTree and Bitmap indexes. In combination with an effective partitioning scheme design, data retrieval times could be vastly reduced. For very large datasets, however, these reduced response times must be balanced against the processing necessary to create the indexes, as well as the potentially substantial disk storage required to store the indexes. Further investigations into storage requirements and performance benefits would be appropriate in order to formulate a set of guidelines as to when indexing data would be appropriate.

iNote, however, that the formats tested are not a complete list of currently available file formats, and that new formats will continue to be developed; moving any such system into a production environment would require additional testing to ensure the correct file formats and compression codecs are selected to best serve the individual use cases.

58 Hive Analytical Functions Hive provides support for windowed analytical and aggregation queries, including “GROUP BY GROUPING SETS”, “GROUP BY CUBE” and “GROUP BY

ROLLUP” functionality. These functions provide interesting opportunities to create customised multi-purpose rollup and snapshot tables for further analysis.

Using Pig, Oozie and Hive for large ETL processes Apache Pig [51] is a high level, data

flow language that abstracts Map/Reduce programming and as such is well suited for very large

Extract Transform and Load processes into a datawarehouse. Coupled with the Hadoop process scheduler Oozie [69], Pig could provide the basis of a very effective process mechanism to ingest and transform data from a variety of sources and formats, transform this data and them load the results into Hive tables for further analysis.

Unstructured and binary data While outside the scope of this study, it should be noted that the Hadoop ecosystem provides support for the storage, processing and analysis of unstructured and binary data. Many effective procedures exist for the processing of unstructured data from differing sources - a selection of which are discussed by Pasupuleti [70]. This could extend further into the capability of ingesting unstructured data into Hive tables. While Hive applies structure to data, the Pig dataflow language handles semi and unstructured data. Pig scripts can be called from Python programs; Gates [71] demonstrates how to embed Pig scripts into a python program.

Binary files can be stored directly into HDFS and metdata extracted for further analysis. In the case of FITS files, these could be analysed with Python and Spark as the relevant Python libraryii could be installed on the worker nodes in the same manner as any other Python library.

R on Spark There is a growing astronomy community using R [72], which makes the capability of running R programs against very large datasets using Spark a very interesting proposition. SparkR

(R on Spark) delivered within the Hue interface at time of writing this paper was still in prototype and unstable, and was therefore not tested. However, we believe that further development resulting in a stable release of SparkR on Hue accessing the same HDFS and Hive datasets would be worth serious evaluation.

Apache Dremel/Drill Apache Dremel/Drill is an high performance ad-hoc query system for the analysis of read only data. Melnik et al [73] report response times in seconds for aggregation queries run over trillion row tables. Dremel provides a SQL like query language that is intuitive and easy to use and would certainly be of interest once support for astronomical machine learning libraries becomes available. iihttp://docs.astropy.org/en/stable/io/fits/

59 Hadoop and Heterogeneous RDBMS integration For organisations and institutions al- ready utilising RDBMS systems, methodologies are available to integrate data on Hive and HDFS with these systems. Apache SQOOP [74] provides batch data transfers between Hadoop and any

ODBC enabled RDBMS. ODBC Connectors are available for the most commonly used RDBMS systems including MySQL, PostgreSQL, Microsoft SQL Server, Oracle and DB2.

Hive on Spark Tests performed in this thesis utilised the default setting of Hive on Map/Reduce.

Configuring Apache Hive to run on Spark replaces Map/Reduce as the execution engine for data processing, and potentially offers performance benefits over Hive on Map/Reduce. Currently in

Beta (at time of writing), Hive on Spark could transform Hive into a near real-time SQL engine for interactive queries over peta-scale datasets.

Alternatives to HDFS An Intel white paper released in 2016 [75] reported that the Lustre

file system on Intel ran 20 percent faster than the same data on HDFSiii. This opens interest- ing possibilities for file sharing between Hadoop and classical HPC systems, as well as potential performance gains.

It is also possible to configure Hadoop ecosystem to include an in-memory file system, Al- luxio iv v. Preliminary work by Lawrie [76] suggests significant performance improvements may be possible, but not under all scenarios and careful evaluation and design will be required to maximise any potential performance gains.

iiiSee the Intel Lustre connector for Hadoop at https://github.com/intel-hpdd/lustre-connector-for-hadoop ivhttp://www.alluxio.org/ vSee http://www.alluxio.org/docs/1.1/en/Configuring-Alluxio-with-HDFS.html

60 Appendices

61 Appendix A

Supplementary Material -

Detection and Parameter File formats and raw data examples

This appendix details the file structure, field name and data types of the Detection and Parameter

file extracted from the output files generated from the Duchamp source finder. The file structure of the generated Detection and Parameter files are shown in Tables A.1 and A.2. An example of the raw data is included in Section A.3 (Duchamp output file), Section A.4 (Detections) and

Section A.5 (Parameters).

A.1 Detection file structure

62 Table A.1: Raw Detection file structure

Field name Data type Field name Data type idmsb bigint y2 int idlsb bigint z1 int x decimal(5,1) z2 int y decimal(5,1) npix int z decimal(5,1) flag string ra hms string x av decimal(4,1) dec dms string y av decimal(4,1) vel double z av decimal(4,1) w ra decimal(5,1) x cent decimal(4,1) w dec decimal(5,1) y cent decimal(4,1) w 50 decimal(12,3) z cent decimal(4,1) w 20 decimal(12,3) x peak int w vel decimal(12,3) y peak int f int float z peak int f tot float ra rad double f peak float dec rad double sn max decimal(6,2) parameternumber int x1 int obj int x2 int filename string y1 int

A.2 Parameter file structure

63 Table A.2: Raw Parameter file structure

Field name data type Field name data type idmsb bigint flagfdr int idlsb bigint alphafdr string cubename string fdrnumcorchan string outputfilename string threshold string daterun string snrcut string dateloaded string minpix int searchtype string minchannels int blankpixelvalue string minvoxels int flagtrim int flaggrowth int flagnegative int growththreshold string flagmw int growthcut string minmw string flagadjacent int maxmw string threshspatial string beamarea string threshvelocity float beammaj string flagrejectbeforemerge int beammin string flagtwostagemerging int flagbaseline int spectralmethod string flagsmooth int pixelcentre string smoothtype string detectionthreshold string hanningwidth string noiselevel string kernmaj string noisespread string kernmin string mean string kernpa string stddev string flagatrous int median string recondim int madfm string scalemin int madfmstddev string scalemax int numberdetections int snrrecon int parameternumber int filtercode int jobname string filtername string detectionsadded int flagrobuststats int errorcount int statsec string rangecheckerrorcount int

64 A.3 Duchamp output file example

H001_C2.fits.gz_hipass2_run_00001.par results.txt Results of the Duchamp source finder v.1.1.13: Wed Jul 4 17:23:40 2012

---- Parameters ----

Image to be analysed...... [imageFile] = input.fits

Intermediate Logfile...... [logFile] = duchamp-Logfile.txt

Final Results file...... [outFile] = results.txt

Saving reconstructed cube?...... [flagOutputRecon] = false

Saving residuals from reconstruction?..[flagOutputResid] = false

Saving mask cube?...... [flagOutputMask] = false

Saving 0th moment to FITS file?...... [flagOutputMask] = false

------

Type of searching performed...... [searchType] = spatial

Trimming Blank Pixels?...... [flagTrim] = false

Searching for Negative features?...... [flagNegative] = false

Removing Milky Way channels?...... [flagMW] = false

Area of Beam...... = No beam

Removing baselines before search?...... [flagBaseline] = false

Smoothing data prior to searching?...... [flagSmooth] = false

Using A Trous reconstruction?...... [flagATrous] = true

Number of dimensions in reconstruction...... [reconDim] = 1 65 Minimum scale in reconstruction...... [scaleMin] = 1 SNR Threshold within reconstruction...... [snrRecon] = 2

Filter being used for reconstruction...... [filterCode] = 1 (B3 spline function)

Using Robust statistics?...... [flagRobustStats] = true

Using FDR analysis?...... [flagFDR] = false

Detection Threshold...... [threshold] = 0.009

Minimum # Pixels in a detection...... [minPix] = 5

Minimum # Channels in a detection...... [minChannels] = 3

Minimum # Voxels in a detection...... [minVoxels] = 4

Growing objects after detection?...... [flagGrowth] = true

Threshold for growth...... [growthThreshold] = 0.0045

Using Adjacent-pixel criterion?...... [flagAdjacent] = true

Max. velocity separation for merging....[threshVelocity] = 7

Reject objects before merging?...[flagRejectBeforeMerge] = true

Merge objects in two stages?...... [flagTwoStageMerging] = true

Method of spectral plotting...... [spectralMethod] = peak

Type of object centre used in results...... [pixelCentre] = centroid

------

------

Summary of statistics:

Detection threshold = 0.009 Jy/beam 66 Detections grown down to threshold of 0.0045 Jy/beam

Not calculating full stats since threshold was provided directly.

------

Total number of detections = 10

------

------

Obj# Name X Y Z RA DEC VEL w_RA w_DEC w_50 w_20 w_VEL F_int F_tot F_peak S/Nmax X1 X2 Y1 Y2 Z1 Z2 Npix Flag X_av Y_av Z_av X_cent Y_cent Z

_cent X_peak Y_peak Z_peak

[km/s] [arcmin] [arcmin] [km/s] [km/s] [km/s] [Jy/beam km/s] [Jy/beam] [Jy/

beam] [pix]

------

1 J122557-833843 3.7 5.5 143.8 12:25:57.03 -83:38:43.06 5688.896 31.53 28.70 48.318 106.777 287.676 6.755 0.493 0.028 3.10 1 7 2 9 135

156 32 N 3.8 5.4 144.5 3.7 5.5 143.8 3 7 143

2 J145312-861221 32.0 58.6 145.9 14:53:12.00 -86:12:21.69 5718.083 16.82 0.63 39.942 80.461 68.488 1.645 0.120 0.020 2.20 31 33 57 59

142 147 9 NE 31.9 58.7 145.7 32.0 58.6 145.9 32 59 146

3 J135001-842444 31.6 25.8 175.2 13:50:01.11 -84:24:44.25 6120.121 314.76 81.54 3272.453 3922.759 5478.838 17483.532 1272.764 0.113

12.51 0 62 0 59 0 398 91401 NES 30.9 27.0 178.8 31.6 25.8 175.2 48 16 59 67 4 J154316-844620 57.8 56.0 189.6 15:43:16.98 -84:46:20.09 6317.475 57.29 18.35 230.118 254.938 398.732 15.760 1.146 0.031 3.44 51 62 51

59 172 201 89 E 57.3 56.1 188.8 57.8 56.0 189.6 59 57 194

5 J154259-844401 58.2 55.6 214.2 15:42:59.74 -84:44:01.61 6656.920 46.69 11.65 102.781 285.171 372.236 19.380 1.406 0.027 3.04 54 62 52

59 203 230 93 NE 58.2 55.6 214.5 58.2 55.6 214.2 59 55 216

6 J125410-840205 13.9 13.2 249.2 12:54:10.60 -84:02:05.08 7140.600 16.55 16.67 9.970 82.620 207.469 2.820 0.204 0.037 4.10 13 15 12 16

245 260 14 N 13.9 13.4 249.6 13.9 13.2 249.2 14 12 247

7 J131312-853853 13.5 38.1 335.0 13:13:12.75 -85:38:53.03 8330.262 25.19 12.92 185.877 190.927 181.143 4.409 0.316 0.023 2.54 11 15 36

40 328 341 23 N 13.3 38.1 335.0 13.5 38.1 335.0 14 37 338

8 J150756-855606 37.5 58.4 345.6 15:07:56.43 -85:56:06.32 8478.567 25.07 9.23 158.885 310.231 306.924 3.893 0.279 0.025 2.73 35 40 57

59 337 359 20 NE 37.5 58.4 345.3 37.5 58.4 345.6 38 59 348

9 J152442-851626 48.2 55.6 345.9 15:24:42.11 -85:16:26.19 8482.426 36.48 7.08 29.322 162.814 376.389 4.973 0.357 0.024 2.71 45 51 54 59

326 353 29 NE 48.3 55.7 345.0 48.2 55.6 345.9 48 55 352

10 J131529-854732 13.4 40.3 392.4 13:15:29.68 -85:47:32.85 9132.964 78.22 25.84 42.800 105.551 238.062 13.476 0.962 0.030 3.32 4 19 37

48 381 398 88 NS 13.5 40.1 393.0 13.4 40.3 392.4 17 39 396

A.4 Detection file example

7766321971042470491, -5530087208626879168, 3.7, 5.5, 143.8, ’12:25:57.03’, ’-83:38:43.06’, 5688.896, 31.53, 28.70, 48.318, 106.777,

287.676, 6.755, 0.493, 0.028, 3.10, 1, 7, 2, 9, 135, 156, 32, ’N’, 3.8, 5.4, 144.5, 3.7, 5.5, 143.8, 3, 7, 143, 3.2548230704744903,

-1.4598857918556767, 1, 1

7766321971042470491, -5530087208626879168, 32.0, 58.6, 145.9, ’14:53:12.00’, ’-86:12:21.69’, 5718.083, 16.82, 0.63, 39.942, 80.461, 68.488, 68 1.645, 0.120, 0.020, 2.20, 31, 33, 57, 59, 142, 147, 9, ’NE’, 31.9, 58.7, 145.7, 32.0, 58.6, 145.9, 32, 59, 146, 3.8973202197033383, -1.5045789713065445, 1, 2

7766321971042470491, -5530087208626879168, 57.8, 56.0, 189.6, ’15:43:16.98’, ’-84:46:20.09’, 6317.475, 57.29, 18.35, 230.118, 254.938,

398.732, 15.760, 1.146, 0.031, 3.44, 51, 62, 51, 59, 172, 201, 89, ’E’, 57.3, 56.1, 188.8, 57.8, 56.0, 189.6, 59, 57, 194,

4.115848532022419, -1.4795548283423947, 1, 4

7766321971042470491, -5530087208626879168, 58.2, 55.6, 214.2, ’15:42:59.74’, ’-84:44:01.61’, 6656.920, 46.69, 11.65, 102.781, 285.171,

372.236, 19.380, 1.406, 0.027, 3.04, 54, 62, 52, 59, 203, 230, 93, ’NE’, 58.2, 55.6, 214.5, 58.2, 55.6, 214.2, 59, 55, 216,

4.114594803843068, -1.4788834583567945, 1, 5

7766321971042470491, -5530087208626879168, 13.9, 13.2, 249.2, ’12:54:10.60’, ’-84:02:05.08’, 7140.600, 16.55, 16.67, 9.970, 82.620,

207.469, 2.820, 0.204, 0.037, 4.10, 13, 15, 12, 16, 245, 260, 14, ’N’, 13.9, 13.4, 249.6, 13.9, 13.2, 249.2, 14, 12, 247,

3.3779829563619916, -1.4666829766275686, 1, 6

7766321971042470491, -5530087208626879168, 13.5, 38.1, 335.0, ’13:13:12.75’, ’-85:38:53.03’, 8330.262, 25.19, 12.92, 185.877, 190.927,

181.143, 4.409, 0.316, 0.023, 2.54, 11, 15, 36, 40, 328, 341, 23, ’N’, 13.3, 38.1, 335.0, 13.5, 38.1, 335.0, 14, 37, 338,

3.46104244824388, -1.4948407128195698, 1, 7

7766321971042470491, -5530087208626879168, 37.5, 58.4, 345.6, ’15:07:56.43’, ’-85:56:06.32’, 8478.567, 25.07, 9.23, 158.885, 310.231,

306.924, 3.893, 0.279, 0.025, 2.73, 35, 40, 57, 59, 337, 359, 20, ’NE’, 37.5, 58.4, 345.3, 37.5, 58.4, 345.6, 38, 59, 348,

3.9616377843008936, -1.4998502441051067, 1, 8

7766321971042470491, -5530087208626879168, 48.2, 55.6, 345.9, ’15:24:42.11’, ’-85:16:26.19’, 8482.426, 36.48, 7.08, 29.322, 162.814,

376.389, 4.973, 0.357, 0.024, 2.71, 45, 51, 54, 59, 326, 353, 29, ’NE’, 48.3, 55.7, 345.0, 48.2, 55.6, 345.9, 48, 55, 352,

4.03477289772363, -1.4883110482369142, 1, 9

7766321971042470491, -5530087208626879168, 13.4, 40.3, 392.4, ’13:15:29.68’, ’-85:47:32.85’, 9132.964, 78.22, 25.84, 42.800, 105.551,

238.062, 13.476, 0.962, 0.030, 3.32, 4, 19, 37, 48, 381, 398, 88, ’NS’, 13.5, 40.1, 393.0, 13.4, 40.3, 392.4, 17, 39, 396, 69 3.47100027884703, -1.4973608712967135, 1, 10

A.5 Parameter file example

7766321971042470491, -5530087208626879168, ’H001_C2’, ’ten_rec.out’, ’Wed Jul 4 17:23:40 2012’, ’2017-03-22 02:55:29’, ’spatial’, NULL, 0,

0, 0, NULL, NULL, NULL, NULL, NULL, 0, 0, NULL, NULL, NULL, NULL, NULL, 1, 1, 1, NULL, 2.0, 1, ’(B3 spline function)’, 1, NULL, 0, NULL

, NULL, 0.009, NULL, 5, 3, 4, 1, 0.0045, NULL, 1, NULL, 7.0, 1, 1, ’peak’, ’centroid’, 0.009, NULL, NULL, NULL, NULL, NULL, NULL, NULL,

10, 1, ’H001_C2.fits.gz_hipass2_run_00001.par’, 9, 0, 1 70 Appendix B

Supplementary Material - Final

Virtual Cluster Configuration

This appendix describes the final server configuration of the Hadoop/HDFS/Spark/Hive clus- ter used in this study. Master service server layout is described in Table B.1 and worker node configuration is described in Table B.2.

Table B.1: Current Cluster Configuration - Master nodes

Master Nodes Name Type Components Artemis m2.xlarge HDFS JournalNode NameNode Hive Metastore Server * HiveServer2 Hue Server * Livy Server Oozie Server * Spark History Server ZooKeeper Server Ares m2.xlarge HDFS JournalNode SecondaryNameNode YARN ResourceManager ZooKeeper Server Dionysus m1.medium MySQL RDBMS (Metadata schemas for Hive, Hue and Oozie) Hades m2.xlarge HDFS JournalNode YARN JobHistory Server YARN ResourceManager ZooKeeper Server Athena m2.large Cloudera Management Server Hue Server Sqoop2 Server

Components indicated with * have their metadata repositories located on the external MySQL database on Dionysus.

71 The two smaller servers, Athena and Perseus, were retained from the initial cluster - Athena for the Cloudera Management Server services, and Perseus to provide a host for the Apache Zeppelin service.

Table B.2: Current Cluster Configuration - Worker nodes

Worker Nodes Agamemnon m1.xxlarge HDFS Data node YARN Node Manager Apollo m1.xxlarge HDFS Data node YARN Node Manager Hercules m1.xxlarge HDFS Data node YARN Node Manager Odysseus m1.xxlarge HDFS Data node YARN Node Manager Posiedon m1.xxlarge HDFS Data node YARN Node Manager Perseus m1.large HDFS Data node Apache Zeppelin

72 Appendix C

Supplementary Material - Hive test table definition

This appendix details the Hive table definitions used in this study. DDL Creation scripts are detailed in Appendix D and Appendix E.

Table C.1: Test table definitions

Field Name Data type ra decimal(5,1) dec decimal(5,1) object string freq decimal(5,1) wavelength double flux float threshold string filename string ra range int dec range int Partition Information Column name Data type

filename string ra range int dec range int

73 Appendix D

Supplementary Material - Hive internal tables

This appendix details the DDL table creation scripts, and DML population scripts for the Hive internal tables used in this study. It is highly recommended that the Apache Hive tutorial [77] is referenced when creating any Hive based table.

D.1 Creation scripts

D.1.1 ORC format tables, zlib compression

USE SkyNet;

drop table if exists SparkTestORCzlib; create table SparkTestORCzlib(

RA decimal(5,1),

Dec decimal(5,1),

Object string, freq decimal(5,1), wavelength double, flux float, threshold string

) comment ’this is the test partitioned hive table for testing Spark performance’ partitioned by (filename string, RA_Range int, Dec_Range int) stored as orc

74 TBLPROPERTIES ("orc.compress"="ZLIB");

Figure D.1: Creation script - Hive internal table, ORC format, zlib compression

D.1.2 ORC format tables, snappy compression

The creation script for an ORC format table with snappy compression is the same as for zlib compression, differing only on the orc.compress parameter.

USE SkyNet;

drop table if exists SparkTestORCzlib; create table SparkTestORCzlib(

RA decimal(5,1),

Dec decimal(5,1),

Object string, freq decimal(5,1), wavelength double, flux float, threshold string

) comment ’this is the test partitioned hive table for testing Spark performance’ partitioned by (filename string, RA_Range int, Dec_Range int) stored as orc

TBLPROPERTIES ("orc.compress"="SNAPPY");

Figure D.2: Creation script - Hive internal table, ORC format, snappy compression

D.1.3 Parquet format tables

Note that the compression format for the Parquet tables is defined in the population script

(see D.2.2).

USE SkyNet;

drop table if exists SparkTestParquetGzip;

75 create table SparkTestParquetGzip(

RA decimal(5,1),

Dec decimal(5,1),

Object string, freq decimal(5,1), wavelength double, flux float, threshold string

) comment ’this is the test partitioned hive table for testing Spark performance’ partitioned by (filename string, RA_Range int, Dec_Range int) row format delimited

fields terminated by ’,’ stored as parquet;

Figure D.3: Creation script - Hive internal table, Parquet format

D.1.4 RC Format tables

USE SkyNet;

drop table if exists SparkTestRC; create table SparkTestRC(

RA decimal(5,1),

Dec decimal(5,1),

Object string, freq decimal(5,1), wavelength double, flux float, threshold string

) comment ’this is the test partitioned hive table for testing Spark performance’ partitioned by (filename string, RA_Range int, Dec_Range int)

-- row format delimited

-- fields terminated by ’,’ stored as rcfile;

76 Figure D.4: Creation script - Hive internal table, RCFile format

D.1.5 Text based internal table creation

USE SkyNet;

drop table if exists SparkTestText; create table SparkTestText(

RA decimal(5,1),

Dec decimal(5,1),

Object string, freq decimal(5,1), wavelength double, flux float, threshold string

) comment ’this is the test partitioned hive table for testing Spark performance’ partitioned by (filename string, RA_Range int, Dec_Range int) clustered by(Object) sorted by(threshold) into 32 buckets

-- row format delimited

-- fields terminated by ’,’ stored as textfile;

Figure D.5: Creation script - Hive internal table, text format

D.2 Population Scripts

D.2.1 ORC, RC File and text based tables

A standard INSERT INTO SQL statement is used to populate these tables. Dynamic parti- tioning ensures that data is automatically loaded into the correct partitions. Enabling dynamic partitioning is described in detail on the Apache Hive Tutorial website [77].

Note the use of the Hive external tables used as source fo the data extracts.

77 -- populate the table set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict;

SET hive.exec.parallel=true;

SET hive.vectorized.execution.enabled=true;

INSERT OVERWRITE table SparkTestOrcZlib PARTITION(filename, RA_Range , Dec_Range) select d.x,d.y, concat(cast(d.x as STRING), ’-’,cast(d.y as string )) ObjectPosition, d.z freq, vel wavelengths,f_peak flux, p.threshold, d.filename, case

when floor(d.x) between 0 and 20 then 20

when floor(d.x) between 21 and 40 then 40

when floor(d.x) between 41 and 60 then 60

when floor(d.x) between 61 and 80 then 80

else 100 end RA_Range, case

when floor(d.y) between 0 and 20 then 20

when floor(d.y) between 21 and 40 then 40

when floor(d.y) between 41 and 60 then 60

when floor(d.y) between 61 and 80 then 80

else 100 end Dec_Range from dingo_01_detections d

inner join dingo_01_parameters p

on d.idmsb=p.idmsb and d.idlsb=p.idlsb where d.filename=’R1_dingo_00’ and p.filename=’R1_dingo_00’

Figure D.6: Population script - Hive internal table, ORC, RC File and Text format

D.2.2 Parquet tables

Defining the compression codec for Parquet tables is normally applied in the population script.

By default, Parquet tables use the “snappy” codec. In the example below, we see the use of the

“SET PARQUET COMPRESSION CODEC” statement, specifying the gzip code. Leaving this

78 statement out, or specifying “SET PARQUET COMPRESSION CODE=snappy;” would result in the snappy compression codec being used.

-- populate the table set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict;

--set hive.exec.max.dynamic.partitions.pernode=1000;

--set hive.exec.max.dynamic.partitions=1000;

SET hive.exec.parallel=true;

SET hive.vectorized.execution.enabled=true;

SET PARQUET_COMPRESSION_CODEC=gzip;

INSERT OVERWRITE table SparkTestParquetGzip PARTITION(filename, RA_Range , Dec_Range) select d.x,d.y, concat(cast(d.x as STRING), ’-’,cast(d.y as string )) ObjectPosition, d.z freq, vel wavelengths,f_peak flux, p.threshold, d.filename, case

when floor(d.x) between 0 and 20 then 20

when floor(d.x) between 21 and 40 then 40

when floor(d.x) between 41 and 60 then 60

when floor(d.x) between 61 and 80 then 80

else 100 end RA_Range, case

when floor(d.y) between 0 and 20 then 20

when floor(d.y) between 21 and 40 then 40

when floor(d.y) between 41 and 60 then 60

when floor(d.y) between 61 and 80 then 80

else 100 end Dec_Range from dingo_01_detections d

inner join dingo_01_parameters p

on d.idmsb=p.idmsb and d.idlsb=p.idlsb where d.filename=’R1_dingo_00’ and p.filename=’R1_dingo_00’

79 Figure D.7: Population script - Hive internal table, Parquet format

80 Appendix E

Supplementary Material - Hive

External tables

This appendix details the DDL table creation scripts, and DML population scripts as well as the Pigi compression script for the Hive external tables used in this study.

E.1 Creating a non-partitioned Hive external table

Figure E.1 shows the DDL statement to create a non partitioned Hive external table. The

“location” clause defines the HDFS directory where the data files will be located.

use skynet; drop table if exists externalTest; create table externalTest (

idMsb bigint,

idLsb bigint,

X decimal(5,1),

Y decimal(5,1),

Z decimal(5,1),

RA_hms string,

DEC_dms string,

VEL double,

w_RA decimal(5,1),

w_DEC decimal(5,1),

w_50 decimal(12,3),

ihttps://pig.apache.org/

81 w_20 decimal(12,3),

w_VEL decimal(12,3),

F_int float,

F_tot float,

F_peak float,

SN_max decimal(6,2),

X1 int,

X2 int,

Y1 int,

Y2 int,

Z1 int,

Z2 int,

Npix int,

Flag string,

X_av decimal(4,1),

Y_av decimal(4,1),

Z_av decimal(4,1),

X_cent decimal(4,1),

Y_cent decimal(4,1),

Z_cent decimal(4,1),

X_peak int,

Y_peak int,

Z_peak int,

RA_rad double,

DEC_rad double,

parameterNumber int,

obj int

) comment ’this is the test non partitioned hive table for detections’ row format delimited

fields terminated by ’,’

stored as textfile location ’’;

Figure E.1: Creation script - non partitioned Hive external table

82 As we can see, the table definition accurately reflects the detection file definition shown in

Table A.1 in Appendix A. Note that a non partitioned table will reference all files in the specified

HDFS directory, and that files can be added and removed.

E.2 Creating a partitioned Hive external table

Figure E.2 shows the definition statement to create the metadata necessary for a Hive partitioned external table; this is defined by the “partitioned by” clause.

use skynet; drop table if exists dingo_01_detections; create table dingo_01_detections (

idMsb bigint,

idLsb bigint,

X decimal(5,1),

Y decimal(5,1),

Z decimal(5,1),

RA_hms string,

DEC_dms string,

VEL double,

w_RA decimal(5,1),

w_DEC decimal(5,1),

w_50 decimal(12,3),

w_20 decimal(12,3),

w_VEL decimal(12,3),

F_int float,

F_tot float,

F_peak float,

SN_max decimal(6,2),

X1 int,

X2 int,

Y1 int,

Y2 int,

Z1 int,

Z2 int,

Npix int,

Flag string,

83 X_av decimal(4,1),

Y_av decimal(4,1),

Z_av decimal(4,1),

X_cent decimal(4,1),

Y_cent decimal(4,1),

Z_cent decimal(4,1),

X_peak int,

Y_peak int,

Z_peak int,

RA_rad double,

DEC_rad double,

parameterNumber int,

obj int

) comment ’this is the test partitioned hive table for detections’ partitioned by (filename string) row format delimited

fields terminated by ’,’;

Figure E.2: Creation script - partitioned Hive external table

Recall from Section 2.4.1 that an external table in Hive is only a metadata layer for an existing

HDFS dataset, so this statement will not populate the data into the table.

E.3 Populating a partitioned Hive external table

Populating an external table (or more accurately, attaching file pointers to the metadata defini- tion) is achieved by the ALTER TABLE statement as shown in Figure E.3.

use skynet; alter table dingo_01_detections add partition (filename=’R1_dingo_00’) location ’’;

Figure E.3: Adding a file definition to a Hive external table

84 E.4 Compressing Hive External Table Data

Compressing Hive external table data (or for that matter, any text file stored on HDFS) is easily accomplished using a Pig script. Figure E.4 below illustrates an example of the Pig script we used to compress detection and external table data, using the gzip compression codec.

85 /*

** Pig Script: Compress HDFS data

**

** Purpose:

** Compress HDFS data while keeping the original folder structure

**

** Paramater

** $date - in the format YYYYMMDD (following HDFS folder structure)

**

** Example call:

** pig -param "FILNAME=R1_HIPASS2B_11" -dryrun -f Compress_HiPass2B_Detections.pig /*dryrun is a check

*/

-- set compression

set output.compression.enabled true;

set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;

-- set large split size to merge small files together and then compress

set pig.maxCombinedSplitSize 2684354560;

-- load files and store them again (using compression codec) 86 inputFiles = LOAD ’$FILNAME’ using PigStorage();

STORE inputFiles INTO ’/user/oracle/SkyNet/Temp/’ USING PigStorage();

-- remove original folder and rename gzip folder to original

-- remove the $date folder, rename the $date_gz to $date

-- remember, use rm as below because it’s a Pig command, not a command line hdfs dfs command!

rm hdfs://bigdatalite.localdomain:8020$FILNAME

mv hdfs://bigdatalite.localdomain:8020/user/oracle/SkyNet/Temp hdfs://bigdatalite.localdomain:8020$FILNAME

Figure E.4: Pig compression script example for HDFS file data

In the above example, we pass in the HDFS directory containing the files we wish to compress - this is indicated by the $FILNAME parameter. The compression of

these files is accomplished by the STORE...INTO...USING PigStorage(); directive. We then remove the original directory structure, and replace it with the compressed

directory structure. 87 Appendix F

Supplementary Material - Hive explain plans

This Appendix details and discusses the explain plans generated from SQL statements run against a partitioned Hive test table. These explain plans were extracted and table read statistics were examined from the generated explain plans. For example, the select statement where the partition names are not explicitly used in the predicates of the query, shown in Figure F.1

SELECT ra, dec, freq

FROM sparktest

WHERE ra BETWEEN 60 AND 61

AND dec BETWEEN 1 AND 1.5

GROUP BY ra, dec, freq

Figure F.1: Hive query with no explicit partition call

when run against the Hive table sparktest results in the following explain plan, an extract of which is shown in Figure F.2

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan alias: sparktest

Statistics: Num rows: 445121256 Data size: 3115848792

88 Basic stats: COMPLETE Column stats: NONE

Filter Operator predicate: (ra BETWEEN 60 AND 61

and dec BETWEEN 1 AND 1.5) (type: boolean)

Statistics: Num rows: 111280314 Data size: 778962198

Basic stats: COMPLETE Column stats: NONE

Group By Operator

...

...

Statistics: Num rows: 111280314 Data size: 778962198

Basic stats: COMPLETE Column stats: NONE

...

Figure F.2: Explain plan extract for full table scan

We can see from the explain plan that extract we are scanning the full table, which in this case is comprised of 445,121,256 rows.

If we explicitly specify a primary partition name in the query predicate, for example “file- name=’R1 dingo 00’ ” as in figure F.3

SELECT ra, dec, freq

FROM sparktest

WHERE filename = ’R1_dingo_00’

AND ra BETWEEN 60 AND 61

AND dec BETWEEN 1 AND 1.5

GROUP BY ra, dec, freq

Figure F.3: Hive query with explicit partition call

we reduce the read to 16,885,586 rows as demonstrated in Figure F.4.

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan

alias: sparktest

89 Statistics: Num rows: 16885586 Data size: 118199102 Basic stats: COMPLETE Column stats: NONE

Filter Operator

predicate: wavelength BETWEEN 5500 AND 5505 (type: boolean)

Statistics: Num rows: 8442793 Data size: 59099551 Basic stats: COMPLETE Column stats: NONE

Group By Operator

...

...

Statistics: Num rows: 4221396 Data size: 29549772

Basic stats: COMPLETE Column stats: NONE

...

Figure F.4: Explain plan extract for explicit partition calls

Calling the same SQL statement, but explicitly defining the primary partition name as well as the secondary and tertiary partitin names (as defined in figure F.5) as below

SELECT ra, dec, freq

FROM sparktest

WHERE filename = ’R1_dingo_00’

AND ra_range = 80

AND dec_range = 20

AND ra BETWEEN 60 AND 61

AND dec BETWEEN 1 AND 1.5

GROUP BY ra, dec, freq

Figure F.5: Hive query with explicit partition call

we see that the explain plan is only scanning the rows in the specified partitions, In this case,

787,377 rows as in figure F.6

STAGE PLANS:

Stage: Stage-1

Map Reduce

Map Operator Tree:

TableScan alias: sparktestorczlib

Statistics: Num rows: 787336 Data size: 417288080

Basic stats: COMPLETE Column stats: NONE

90 Filter Operator predicate: (ra BETWEEN 60 AND 61

and dec BETWEEN 1 AND 1.5) (type: boolean)

Statistics: Num rows: 196834 Data size: 104322020

Basic stats: COMPLETE Column stats: NONE

Group By Operator

...

...

Statistics: Num rows: 196834 Data size: 104322020

Basic stats: COMPLETE Column stats: NONE

...

Figure F.6: Explain plan extract for explicit partition calls

91 Appendix G

Supplementary Material - Python

Library Dependencies

In order to be able to support library calls to the astroML and scikitLearn machine learning packages, certain dependencies need to be addressed and this Appendix defines these dependencies in order to sucessfully install these packages. These packages were manually installed on each worker node within the cluster - this is an acceptable approach for a small development cluster however will not be appropriate for a large production installation. Other approaches may be used

(for example, installing the Anaconda parcel within the Cloudera CDH framework, or deploy using an infrastructure service like puppeti) however these approaches were not tested.

Table G.1: Python Machine learning library installation dependencies - (table format)

Package Dependency libpng-dev pkg config Freetype* packages Python Numpy Scipy Matplotlib libpng-dev pkg config Freetype* packages scikit learn Python Numpy Scipy astropy Python Numpy Scipy Matplotlib astroML scikit learn astropy astroML-Add ons astroML

ihttps://puppet.com/

92 Figure G.1: Python Machine learning library installation dependencies

93 Appendix H

Supplementary Material - Python

Code Listings

Complete code listings are included in this Appendix for the test programs used in this study.

The KMeans analysis program is detailed in Section H.1, the Kernel Density Estimation program is detailed in Section H.2, Principal component Analysis is detailed in Section H.3 and finally the

Pearsons correlation analysis is detailed in Section H.4.

94 H.1 KMeans analysis

# coding: utf-8

# In[1]:

import matplotlib

matplotlib.use("Agg")

import numpy as np

import pylab as pl

from math import sqrt

from numpy import array

import time

from pyspark.mllib.clustering import KMeans

from pyspark import SparkContext, SparkConf, StorageLevel

95 conf=(SparkConf() .setAppName("KMeansTest")

.setMaster("yarn-client"))

sc=SparkContext(conf=conf)

from pyspark.sql import HiveContext

from datetime import datetime

LogFile=datetime.now().strftime(’KMeansORCSnappy_%H_%M_%d_%m_%Y.log’)

PicFile=datetime.now().strftime(’KMeansORCSnappy_Pic_%H_%M_%d_%m_%Y’)

import logging

logger = logging.getLogger(’myapp’)

hdlr = logging.FileHandler(LogFile)

formatter = logging.Formatter(’%(asctime)s %(levelname)s %(message)s’)

hdlr.setFormatter(formatter)

logger.addHandler(hdlr)

logger.setLevel(logging.INFO)

start_elapsed=time.time()

start_cpu=time.clock() 96 sqlCtx=HiveContext(sc)

sqlCtx.sql("use skynet")

data = sqlCtx.sql("select freq, wavelength, flux \

from sparktestorcsnappy \

where filename=’R1_dingo_00’ and wavelength between 5500 and 5505 \

group by freq, wavelength, flux")

data.persist(StorageLevel.MEMORY_AND_DISK)

parsedData=data.map(lambda row: array([ float(row.freq), float(row.wavelength), float(row.flux)]))

parsedData.persist(StorageLevel.MEMORY_AND_DISK)

numRecs=parsedData.count()

rdd_elapsed=time.time()

rdd_cpu=time.clock()

logger.info("--- RDD Creation - Elapsed - %s seconds ---" % (rdd_elapsed - start_elapsed))

logger.info("--- RDD Creation - CPU - %s seconds ---" % (rdd_cpu - start_cpu))

logger.info("We have " + str(numRecs) + " rows in the RDD" ) 97 def my_range(start, end, step):

while start <= end:

yield start

start += step

int=0

rlist=[]

for K in my_range(2, 20, 2):

try:

clusters=KMeans.train(parsedData, K, maxIterations=10, runs=30, initializationMode="random")

#logger.info "KMeans.train has run"

def error(point):

center=clusters.centers[clusters.predict(point)]

return sqrt(sum([x**2 for x in (point - center)]))

WSSSE=parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)

logger.info("WSSSE for " + str(K) + " clusters is " + str(WSSSE))

res=[K, WSSSE]

rlist.append(res)

int+=1

except Exception: 98 logger.error("Error calcuating KMeans for for "+ str(K))

logger.error( "Error is:", sys.exc_info()[0] )

#else code that executes if no error thrown in try-except block

#finally code the executes regardless

rlist=np.array(rlist)

x_axis1=np.array([c[0] for c in rlist])

y_axis1=np.array([c[1] for c in rlist])

x_axis=x_axis1[np.argsort(y_axis1)]

y_axis=y_axis1[np.argsort(y_axis1)]

pos=np.arange(len(x_axis))

width=0.01

if len(x_axis) > 0:

def plot_k_Xvalid(pos, width, x_axis, y_axis):

#ax=pl.axes()

#ax.set_xticks(pos + (width / 2))

#ax.set_xticklabels(x_axis)

#pl.bar(pos, y_axis, width, color=’lightblue’) 99 pl.plot(x_axis, y_axis, ’ro’)

pl.axis([min(x_axis)-2, max(x_axis)+2, min(y_axis)*0.75, max(y_axis)*1.2]) #[xmin, xmax, ymin, ymax]

#pl.axis([0, 35, 0, 70])

pl.xticks(rotation=30)

pl.xlabel(’Value of K’)

pl.ylabel(’Within Set Sum Squared Error’)

pl.title(’K value cross validation’)

#fig = matplotlib.pyplot.gcf()

#fig.set_size_inches(16, 10)

pl.savefig(PicFile)

plot_k_Xvalid(pos, width, x_axis, y_axis)

else:

logger.error (’No data to graph - see errors’)

end_elapsed=time.time()

end_cpu=time.clock()

logger.info("--- Elapsed - %s seconds ---" % (end_elapsed - start_elapsed))

logger.info("--- CPU - %s seconds ---" % (end_cpu - start_cpu)) 100 Figure H.1: Python Test Program - KMeans Analysis

H.2 Kernal Density Estimation

# coding: utf-8

# In[26]:

import matplotlib

matplotlib.use("Agg")

# get_ipython().magic(u’matplotlib inline’)

# Author: Jake VanderPlas

# License: BSD

# The figure produced by this code is published in the textbook

# "Statistics, Data Mining, and Machine Learning in Astronomy" (2013)

# For more information, see http://astroML.github.com

# To report a bug or issue, use the following forum:

# https://groups.google.com/forum/#!forum/astroml-general 101 import numpy as np from matplotlib import pyplot as plt

from matplotlib.colors import LogNorm

from scipy.spatial import cKDTree

from scipy.stats import gaussian_kde

from astroML.datasets import fetch_great_wall

# pyspark libraries for hive

from pyspark import SparkContext, SparkConf, StorageLevel

conf=(SparkConf()

.setAppName("KDE_Test")

.setMaster("yarn-client"))

sc=SparkContext(conf=conf)

from pyspark.sql import HiveContext

# Logging, etc

import time

102 from datetime import datetime LogFile=datetime.now().strftime(’KDE_Parquet_Snappy_%H_%M_%d_%m_%Y.log’)

PicFile=datetime.now().strftime(’KDE_Parquet_Snappy_Pic_%H_%M_%d_%m_%Y’)

import logging

logger = logging.getLogger(’myapp’)

hdlr = logging.FileHandler(LogFile)

formatter = logging.Formatter(’%(asctime)s %(levelname)s %(message)s’)

hdlr.setFormatter(formatter)

logger.addHandler(hdlr)

logger.setLevel(logging.INFO)

# Scikit-learn 0.14 added sklearn.neighbors.KernelDensity, which is a very

# fast kernel density estimator based on a KD Tree. We’ll use this if

# available (and raise a warning if it isn’t).

try:

from sklearn.neighbors import KernelDensity

use_sklearn_KDE = True

logger.info ("Current KDE version enabled")

except:

import warnings

103 warnings.warn("KDE will be removed in astroML version 0.3. Please " "upgrade to scikit-learn 0.14+ and use "

"sklearn.neighbors.KernelDensity.", DeprecationWarning)

from astroML.density_estimation import KDE

use_sklearn_KDE = False

#------

# This function adjusts matplotlib settings for a uniform feel in the textbook.

# Note that with usetex=True, fonts are rendered with LaTeX. This may

# result in an error if LaTeX is not installed on your system. In that case,

# you can set usetex to False.

from astroML.plotting import setup_text_plots

setup_text_plots(fontsize=8, usetex=False)

#------

# Fetch the great wall data

#X = fetch_great_wall()

#Y = fetch_great_wall()

#------

start_elapsed=time.time()

start_cpu=time.clock() 104 sqlCtx=HiveContext(sc)

# Lets try this with skynet data

sqlCtx.sql("use skynet")

data=sqlCtx.sql("select ra, freq \

from sparktest \

where filename=’R1_dingo_00’ \

and ra_range in (60, 80) \

and dec_range=20 \

and dec between -1.5 and 1.5 \

group by ra, freq")

numRecs=data.count()

rdd_elapsed=time.time()

rdd_cpu=time.clock()

logger.info("--- RDD Creation - Elapsed - %s seconds ---" % (rdd_elapsed - start_elapsed))

logger.info("--- RDD Creation - CPU - %s seconds ---" % (rdd_cpu - start_cpu))

105 logger.info("We have " + str(numRecs) + " rows in the RDD" ) #and dec between -0.5 and 0.5 \

X = data.map(lambda row: np.array([float(row[0]), float(row[1])])).collect()

# X comes out as a list, now we have to convert it to an array

X=np.array(X)

#X=np.random.normal(size=(1000,2))

#------

# Create the grid on which to evaluate the results

Nx = 50

Ny = 125

#xmin, xmax = (-375, -175)

#ymin, ymax = (-300, 200)

xmin, xmax = (41, 81)

ymin, ymax = (0, 192)

#------

106 # Evaluate for several models Xgrid = np.vstack(map(np.ravel, np.meshgrid(np.linspace(xmin, xmax, Nx),

np.linspace(ymin, ymax, Ny)))).T

kernels = [’gaussian’, ’tophat’, ’exponential’]

dens = []

if use_sklearn_KDE:

kde1 = KernelDensity(5, kernel=’gaussian’)

log_dens1 = kde1.fit(X).score_samples(Xgrid)

dens1 = X.shape[0] * np.exp(log_dens1).reshape((Ny, Nx))

kde2 = KernelDensity(5, kernel=’tophat’)

log_dens2 = kde2.fit(X).score_samples(Xgrid)

dens2 = X.shape[0] * np.exp(log_dens2).reshape((Ny, Nx))

kde3 = KernelDensity(5, kernel=’exponential’)

log_dens3 = kde3.fit(X).score_samples(Xgrid)

dens3 = X.shape[0] * np.exp(log_dens3).reshape((Ny, Nx))

else:

107 kde1 = KDE(metric=’gaussian’, h=5) dens1 = kde1.fit(X).eval(Xgrid).reshape((Ny, Nx))

kde2 = KDE(metric=’tophat’, h=5)

dens2 = kde2.fit(X).eval(Xgrid).reshape((Ny, Nx))

kde3 = KDE(metric=’exponential’, h=5)

dens3 = kde3.fit(X).eval(Xgrid).reshape((Ny, Nx))

#------

# Plot the results

fig = plt.figure(figsize=(5, 2.2))

fig.subplots_adjust(left=0.12, right=0.95, bottom=0.2, top=0.9,

hspace=0.01, wspace=0.01)

fig.set_size_inches(16, 10)

# First plot: scatter the points

ax1 = plt.subplot(221, aspect=’equal’)

ax1.scatter(X[:, 1], X[:, 0], s=1, lw=0, c=’k’)

ax1.text(0.95, 0.9, "input", ha=’right’, va=’top’,

108 transform=ax1.transAxes, bbox=dict(boxstyle=’round’, ec=’k’, fc=’w’))

# Second plot: gaussian kernel

ax2 = plt.subplot(222, aspect=’equal’)

ax2.imshow(dens1.T, origin=’lower’, norm=LogNorm(),

extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)

ax2.text(0.95, 0.9, "Gaussian $(h=5)$", ha=’right’, va=’top’,

transform=ax2.transAxes,

bbox=dict(boxstyle=’round’, ec=’k’, fc=’w’))

# Third plot: top-hat kernel

ax3 = plt.subplot(223, aspect=’equal’)

ax3.imshow(dens2.T, origin=’lower’, norm=LogNorm(),

extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)

ax3.text(0.95, 0.9, "top-hat $(h=5)$", ha=’right’, va=’top’,

transform=ax3.transAxes,

bbox=dict(boxstyle=’round’, ec=’k’, fc=’w’))

ax3.images[0].set_clim(0.01, 0.8)

# Fourth plot: exponential kernel

109 ax4 = plt.subplot(224, aspect=’equal’) ax4.imshow(dens3.T, origin=’lower’, norm=LogNorm(),

extent=(ymin, ymax, xmin, xmax), cmap=plt.cm.binary)

ax4.text(0.95, 0.9, "exponential $(h=5)$", ha=’right’, va=’top’,

transform=ax4.transAxes,

bbox=dict(boxstyle=’round’, ec=’k’, fc=’w’))

for ax in [ax1, ax2, ax3, ax4]:

ax.set_xlim(ymin, ymax - 0.01)

ax.set_ylim(xmin, xmax)

for ax in [ax1, ax2]:

ax.xaxis.set_major_formatter(plt.NullFormatter())

for ax in [ax3, ax4]:

ax.set_xlabel(’$y$ (Mpc)’)

for ax in [ax2, ax4]:

ax.yaxis.set_major_formatter(plt.NullFormatter())

for ax in [ax1, ax3]:

110 ax.set_ylabel(’$x$ (Mpc)’) # plt.show()

plt.savefig(PicFile)

end_elapsed=time.time()

end_cpu=time.clock()

logger.info("--- Elapsed - %s seconds ---" % (end_elapsed - start_elapsed))

logger.info("--- CPU - %s seconds ---" % (end_cpu - start_cpu))

logger.info("Done!")

Figure H.2: Python Test Program - Kernel Density Estimation

H.3 Principle Component Analysis

import matplotlib

matplotlib.use(’agg’) 111 import numpy as np import pylab as pl

# Masked array support

import numpy.ma as ma

from sklearn.preprocessing import normalize

# this is the full PCA

from sklearn.decomposition import PCA

# whereas this one uses randomised decomposition

from sklearn.decomposition import RandomizedPCA

import time

from datetime import datetime

from pyspark import SparkContext, SparkConf, StorageLevel

conf=(SparkConf()

.setAppName("PCATest")

.setMaster("yarn-client"))

sc=SparkContext(conf=conf)

112 from pyspark.sql import HiveContext from datetime import datetime

LogFile=datetime.now().strftime(’PCA_ORC_SNAPPY_%H_%M_%d_%m_%Y.log’)

PicFile=datetime.now().strftime(’PCA_ORC_SNAPPY_Pic_%H_%M_%d_%m_%Y’)

import logging

logger = logging.getLogger(’myapp’)

hdlr = logging.FileHandler(LogFile)

formatter = logging.Formatter(’%(asctime)s %(levelname)s %(message)s’)

hdlr.setFormatter(formatter)

logger.addHandler(hdlr)

logger.setLevel(logging.INFO)

start_elapsed=time.time()

start_cpu=time.clock()

sqlCtx=HiveContext(sc)

sqlCtx.sql("use skynet")

113 rawData=sqlCtx.sql("select distinct ra, dec, object, freq, wavelength wavelengths, flux, s.threshold, rn tidx \ from sparktestorcsnappy s \

inner join threshold_lookup ti \

on s.threshold=ti.threshold \

where s.filename in (’R1_dingo_00’, ’R1_dingo_01’, ’R1_dingo_02’) \

and ra_range=80 \

and dec_range=20 \

and dec between -0.5 and 0.5")

rawData.persist(StorageLevel.MEMORY_AND_DISK)

numRecs=rawData.count()

#rawData.persist()

rdd_elapsed=time.time()

rdd_cpu=time.clock()

logger.info("--- RDD Creation - Elapsed - %s seconds ---" % (rdd_elapsed - start_elapsed))

logger.info("--- RDD Creation - CPU - %s seconds ---" % (rdd_cpu - start_cpu))

logger.info("We have " + str(numRecs) + " rows in the RDD" )

def mArray(b, max_entries):

114 # this dynamically creates the mask for the masked array bmask = np.concatenate([np.zeros(len(b),dtype=bool), np.ones(max_entries-len(b), dtype=bool)])

# this creates the masked array

km = ma.masked_array(data=ma.array(np.resize(b, max_entries),mask=bmask),mask=bmask,fill_value=999)

# get the mean of the values

mean=km.mean()

# turn the masked array to a list, set empty values to -9999

kl=km.tolist(-9999)

# Replace the -9999 values with the mean of the row

kx=ma.masked_values(kl, -9999)

return kx.filled(mean)

# pivot the RDDs

# Flux data

UniqueFlux=rawData.map(lambda row: (row[2],row[5])).distinct()

FluxData=UniqueFlux.map(lambda nameTuple: (nameTuple[0], [ float(nameTuple[1]) ])) .reduceByKey(lambda a, b: a + b) .sortByKey(1,1)

# Wavelengths

115 UniqueWave=rawData.map(lambda row: (row[2],row[4])).distinct() WavelengthData=UniqueWave.map(lambda nameTuple: (nameTuple[0], [ float(nameTuple[1]) ])) .reduceByKey(lambda a, b: a + b) .sortByKey(1,1)

# Frequency

UniqueFreq=rawData.map(lambda row: (row[2],row[3])).distinct()

FreqData=UniqueFreq.map(lambda nameTuple: (nameTuple[0], [ float(nameTuple[1]) ])) .reduceByKey(lambda a, b: a + b) .sortByKey(1,1)

# Thresholds - use this for label data

UniqueThresh=rawData.map(lambda row: (row[2],row[7])).distinct()

ThreshData=UniqueThresh.map(lambda nameTuple: (nameTuple[0], [ float(nameTuple[1]) ])) .reduceByKey(lambda a, b: a + b) .sortByKey(1,1)

# just extract the values, but they might be a jagged array

Frequencies=FreqData.map(lambda x: np.array(x[1]))

Wavelengths=WavelengthData.map(lambda x: np.array(x[1]))

Fluxes=FluxData.map(lambda x: np.array(x[1]))

Thresholds=ThreshData.map(lambda x: np.array(x[1]))

# threshold labels

116 ThresholdLabels=rawData.map(lambda row: (row[6],row[7])).distinct().sortByKey(1,1) ThresholdLabels.take(7)

labels=np.asarray(ThresholdLabels.map(lambda row: row[0]).collect())

# set up the label value for each object

UniqueThresh=rawData.map(lambda row: (row[2],row[7])).distinct()

ThreshData=UniqueThresh.map(lambda nameTuple: (nameTuple[0], [ float(nameTuple[1]) ])) .reduceByKey(lambda a, b: a + b) .sortByKey(1,1)

Thresholds=ThreshData.map(lambda x: np.array(x[1]))

max_thresh=max([len(x) for x in Thresholds.collect()])

y=np.asarray(Thresholds.map(lambda b: mArray(b, max_thresh)).collect())

#

# so, lets take the maximum value recorded for each object, using the numpy nanmax directive

y=np.nanmax(y, axis=1)

# Set up the Flux data for the PCA analysis

117 # get the max number of elements for the flux data #FC=Fluxes.collect()

max_flux=max([len(x) for x in Fluxes.collect()])

# Convert the Fluxes jagged array data in the RDD to a consistent array using masked arrays

# FluxPad=Fluxes.map(lambda b: mArray(b, max_flux))

# Convert the padded flux data in the RDD to a numpy array for use in the PCA functions

# We run the mArray function through the first map procedure to smooth the jagged arrays,

# add the collect() turns the result into a list which can them be turned into a numpy array

# using np.asarray

X=np.asarray(Fluxes.map(lambda b: mArray(b, max_flux)).collect())

# Set up the Frequency data

max_freq=max([len(x) for x in Frequencies.collect()])

f=np.asarray(Frequencies.map(lambda b: mArray(b, max_freq)).collect())

# Set up the wavelength data

max_wave=max([len(x) for x in Wavelengths.collect()])

w=np.asarray(Wavelengths.map(lambda b: mArray(b, max_wave)).collect()) 118 # Set up the threshold link data

max_thresh=max([len(x) for x in Thresholds.collect()])

y=np.asarray(Thresholds.map(lambda b: mArray(b, max_flux)).collect())

# this could produce an array with more threshold values per object, which is not what we want

#

# so, lets take the maximum value recorded for each object, using the numpy nanmax directive

y=np.nanmax(y, axis=1)

# lets see if the PCA stuff can take a RDD

pca=PCA(n_components=4)

X=normalize(X)

X_projected = pca.fit_transform(X) # this could take a while, Horatio! Actually, not - maybe Spark on cluster!

# whereas this one uses randomised decomposition

rpca=RandomizedPCA(n_components=4, random_state=0)

# X is already the normalised data array from data from the step above, don’t need to do it again

X_proj = rpca.fit_transform(X)

119 # Here is the first plot function, using the randomized data def plot_RPCA_projection(rpca):

#y = data[’y’]

# The formatter for our labels...sweet!

#labels = data[’labels’]

format = pl.FuncFormatter(lambda i, *args: labels[i].replace(’ ’, ’\n’))

X_prog = rpca.transform(X)

pl.scatter(X_prog[:, 0], X_prog[:, 1], c=y, s=4, lw=0, vmin=2, vmax=6, cmap=pl.cm.jet)

#pl.scatter(X_prog[:, 0], X_prog[:, 1], s=4, lw=0, vmin=2, vmax=6, cmap=pl.cm.jet)

#pl.colorbar(ticks=range(2, 7), format=format)

pl.xlabel(’Coefficient 1’)

pl.ylabel(’Coefficient 2’)

pl.title(’PCA Projection of Spectra - randomized PCA’)

pl.savefig(PicFile)

plot_RPCA_projection(rpca)

120 end_elapsed=time.time() end_cpu=time.clock()

logger.info("--- Elapsed - %s seconds ---" % (end_elapsed - start_elapsed))

logger.info("--- CPU - %s seconds ---" % (end_cpu - start_cpu))

Figure H.3: Python Test Program - Principal Component Analysis

H.4 Correlation analysis

import numpy as np

import pylab as pl

import StringIO

import sys

# set the global variables from the agguments

gNumberBins=sys.argv[1]

gHistModulo=sys.argv[2] 121 gRA_ArcSegment=sys.argv[3] gDEC_ArcSegment=sys.argv[4]

gMinuteRange=sys.argv[5]

gNumProcs=sys.argv[6]

gAppName=sys.argv[7]

from pyspark.mllib.stat import Statistics

from pyspark.mllib.linalg import Vectors

# import threading and queue libraries

from threading import Thread

from multiprocessing.pool import ThreadPool

import Queue

# import gc for manual calls to garbage collection to stop memory leaks in .collect()s

import gc

import time

import random

from datetime import datetime

from pyspark import SparkContext, SparkConf, StorageLevel

122 conf=(SparkConf() .setAppName(gAppName)

.setMaster("yarn-client"))

# try fair scheduling - see http://stackoverflow.com/questions/30214474/how-to-run-multiple-jobs-in-one-sparkcontext-from-separate-threads-in-pyspark

# and http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

conf.set("spark.scheduler.mode", "FAIR")

sc=SparkContext(conf=conf)

from pyspark.sql import HiveContext

from datetime import datetime

gLogFileString=’CTest_{}_{}Bins_mod{}_%H_%M_%d_%m_%Y.log’.format(gAppName,gNumberBins,gHistModulo)

#LogFile=datetime.now().strftime(’CorrelationTest_300Bins_mod250_%H_%M_%d_%m_%Y.log’)

LogFile=datetime.now().strftime(gLogFileString)

import logging

logger = logging.getLogger(’myapp’)

hdlr = logging.FileHandler(LogFile)

formatter = logging.Formatter(’%(asctime)s %(levelname)s %(message)s’)

hdlr.setFormatter(formatter)

123 logger.addHandler(hdlr) logger.setLevel(logging.INFO)

def CleanExit():

logger.info("Clean Exit")

sc.stop()

sys.exit(0)

def CreateTheQuery(queryPredicate,modulo,numBins):

sqlStatement="with baseline as ("

sqlStatement=sqlStatement+" select "

sqlStatement=sqlStatement+" case when wavelength - pmod(wavelength,{}) < 0.1 then 0 ".format(modulo)

sqlStatement=sqlStatement+" else wavelength-pmod(wavelength,{}) end basewave ".format(modulo)

sqlStatement=sqlStatement+" from allcorrelationwavelengths "

sqlStatement=sqlStatement+" group by "

sqlStatement=sqlStatement+" case when wavelength - pmod(wavelength,{}) < 0.1 then 0 ".format(modulo)

sqlStatement=sqlStatement+" else wavelength-pmod(wavelength,{}) end ".format(modulo)

sqlStatement=sqlStatement+" ), "

sqlStatement=sqlStatement+" realStuff as ( "

sqlStatement=sqlStatement+" select wavelength-pmod(wavelength,250) realWave from correlationtest {} ".format(queryPredicate)

sqlStatement=sqlStatement+" ) "

124 sqlStatement=sqlStatement+" select " sqlStatement=sqlStatement+" cast(hist.x as int) as bin_center, "

sqlStatement=sqlStatement+" cast(hist.y as bigint) as bin_height "

sqlStatement=sqlStatement+" from "

sqlStatement=sqlStatement+" ( "

sqlStatement=sqlStatement+" select "

sqlStatement=sqlStatement+" histogram_numeric(wavelength, {}) as salary_hist ".format(numBins)

sqlStatement=sqlStatement+" from"

sqlStatement=sqlStatement+" ( "

sqlStatement=sqlStatement+" select "

sqlStatement=sqlStatement+" case when realwave is null then basewave else realwave end wavelength "

sqlStatement=sqlStatement+" from baseline left join realStuff on basewave=realwave "

sqlStatement=sqlStatement+" ) b "

sqlStatement=sqlStatement+" ) a "

sqlStatement=sqlStatement+" lateral view explode(salary_hist) exploded_table as hist"

return sqlStatement

def DoTheSwim(sourcePredicate, targetPredicate):

numBins=gNumberBins

125 modulo=gHistModulo sourceSql=CreateTheQuery(sourcePredicate,modulo,numBins)

targetSql=CreateTheQuery(targetPredicate,modulo,numBins)

# as per http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence

# In Python, stored objects will always be serialized with the Pickle library

sourceData=sqlCtx.sql(sourceSql).persist(StorageLevel.MEMORY_ONLY)

targetData=sqlCtx.sql(targetSql).persist(StorageLevel.MEMORY_ONLY)

if sourceData.count() == 0 or targetData.count()==0:

logger.info("One of the source or target datasets has no rows, exiting DoThe Swim")

return

try:

result1=sourceData.map(lambda row: [float(row[1])]).collect()

x=np.asarray(result1).transpose()

logger.info(x.shape)

except Exception,e:

logger.info("Creating the X array failed - {}".format(str(e)))

126 return try:

result2=targetData.map(lambda row: [float(row[1])]).collect()

y=np.asarray(result2).transpose()

except Exception,e:

logger.info("Creating the Y array failed - {}".format(str(e)))

return

#logger.info("In DoTheSwim, doing the correlation analysis")

MoreBarney=np.corrcoef(x, y)

try:

q.put(MoreBarney)

except Exception,e:

logger.info("Adding correlation vector to queue failed - {}".format(str(e)))

else:

logger.info("Adding correlation vector to queue successful...")

finally:

try:

logger.info("Clearing the RDDs and Lists")

127 sourceData.unpersist() targetData.unpersist()

del result1

del result2

gc.collect()

except Exception,e:

logger.info("Error clearing RDD and lists - {} ".format(str(e)))

logger.info("And all done, exiting...")

return

def FunkyNassau(funkyArray):

sourcePredicate="where ra_range={} and dec_range={} ".format(funkyArray[0],funkyArray[1])

sourcePredicate=sourcePredicate+"and floor((ra*180/pi()-floor(ra*180/pi()))*60)={} ".format(funkyArray[4])

sourcePredicate=sourcePredicate+"and floor((abs(dec)*180/pi()-floor(abs(dec)*180/pi()))*60)={} ".format(funkyArray[5])

targetPredicate="where ra_range={} and dec_range={} ".format(funkyArray[2],funkyArray[3])

targetPredicate=targetPredicate+"and floor((ra*180/pi()-floor(ra*180/pi()))*60)={} ".format(funkyArray[6])

targetPredicate=targetPredicate+"and floor((abs(dec)*180/pi()-floor(abs(dec)*180/pi()))*60)={} ".format(funkyArray[7])

# Random sleep to stagger the processes

if random.randint(0,1):

128 time.sleep(0.1) try:

DoTheSwim(sourcePredicate, targetPredicate)

except Exception,e:

logger.info("======")

logger.info("DoTheSwim has failed for - {} {}".format(sourcePredicate, targetPredicate))

logger.info("======")

finally:

return 0

if __name__ == "__main__":

start_elapsed=time.time()

start_cpu=time.clock()

defDatabase="use skynet"

sqlCtx=HiveContext(sc)

sqlCtx.sql(defDatabase)

logger.info("--- here we go Freddie - setting the queue and thread pool---")

gNumProcs=int(gNumProcs)

q=Queue.Queue()

tpool=ThreadPool(processes=gNumProcs) 129 # ok in here we extract the problem space comparisons we need

masterSQL="select ra_range_source,dec_range_source,ra_range_target,dec_range_target, \

ra_source_minute,dec_source_minute, \

ra_target_minute, dec_target_minute \

from correlation_problem_space_1 \

where ra_range_source={} \

and dec_range_source={} \

and ra_range_source =ra_range_target \

and dec_range_source=dec_range_target \

and ra_source_minute < {} \

and dec_source_minute < {} \

and ra_target_minute < {} \

and dec_target_minute < {} ".format(gRA_ArcSegment,gDEC_ArcSegment,gMinuteRange,gMinuteRange,gMinuteRange,gMinuteRange)

logger.info("masterSQL is {}".format(masterSQL))

data=sqlCtx.sql(masterSQL)

numCubes=data.count()

if numCubes==0:

logger.info("No master cubes")

130 CleanExit logger.info("We have {} cube comparisons to run on {} processes".format(numCubes,gNumProcs))

masterLoop=data.map(lambda row: [float(row[0]),float(row[1]),float(row[2]),float(row[3]), \

float(row[4]),float(row[5]),float(row[6]),float(row[7]) ])

logger.info("starting all processes")

billy = tpool.map(FunkyNassau, [x for x in masterLoop.toLocalIterator()])

logger.info("Finished - all processes")

logger.info("Loop through the queue")

while True:

try:

func_value=q.get(False)

except Queue.Empty:

logger.info("Finished looping, breaking")

break

else:

logger.info(func_value)

131 logger.info("All threads finished...") logger.info("======")

end_elapsed=time.time()

end_cpu=time.clock()

logger.info("--- Elapsed - %s seconds ---" % (end_elapsed - start_elapsed))

logger.info("--- CPU - %s seconds ---" % (end_cpu - start_cpu))

CleanExit

Figure H.4: Python Test Program - Correlation Analysis 132 Appendix I

Supplementary Material - Hive

QL for Correlation Analysis

This appendix details the HiveQL statement used to set up the table structured for the Pearsons correlation analysis program.

I.1 Creating and populating the base table for Correlation

analysis

USE SkyNet; drop table if exists CorrelationTest; create table CorrelationTest( ra_hms string, dec_dms string,

RA double,

Dec double, freq decimal(5,1), wavelength double, flux float

) comment ’this is the test partitioned hive table for testing Spark performance’ partitioned by ( RA_Range decimal(5,2), Dec_Range decimal(5,2)) row format delimited

fields terminated by ’,’ stored as parquet

133 location ’/user/prometheus/staging/CorrelationTest’;

Figure I.1: Creation DDL for CorrelationTest table

INSERT OVERWRITE table CorrelationTest PARTITION(RA_Range , Dec_Range) select ra_hms, dec_dms, ra_rad*(180/pi()) ra, dec_rad*(180/pi()) dec, z freq, vel wavelength, f_peak flux, ra_rad*(180/pi()) - pmod(ra_rad*(180/pi()), 0.25) ra_range, dec_rad*(180/pi()) - pmod(dec_rad*(180/pi()), 0.25) dec_range from dingo_01_detections

Figure I.2: Population DML for CorrelationTest table

I.2 Creating the baseline wavelength table

This query creates the baseline table of all distinct wavelengths.

create table allcorrelationwavelengths stored as parquet as select distinct(vel) wavelength from dingo_01_detections

Figure I.3: Creation DDL for baseline wavelength table

I.3 Creating the fine grained position data

This query extracts the fine grained position data from the correlation test table, down to minute granularity.

drop table if exists CorrelationMinutes;

Create table CorrelationMinutes stored as parquet

134 as select ra_range, dec_range, floor((ra-floor(ra))*60) raMin, floor((abs(dec)-floor(abs(dec)))*60) decMin from Correlationtest group by ra_range, dec_range, floor((ra-floor(ra))*60) , floor((abs(dec)-floor(abs(dec)))*60)

Figure I.4: Creation DML for fine grained position data table

I.4 Creating the Problem Space

Finally, we create the actual problem space table. Note the usage of Hive nested queries. Also note that the method used to create a non-repeating table of combinations - a full join, coupled with the ¡= comparisons on the predicate will not be particularly efficient. However, as this should be a once off, this is probably not an issue.

drop table if exists correlation_problem_space_1; create table correlation_problem_space_1 stored as parquet as with CorrelationSourceMinutes as(

select

ra_range ra_range_source, dec_range dec_range_source,

raMin ra_source_minute, decMin dec_source_minute

from CorrelationMinutes

),

CorrelationTargetMinutes as (

select

ra_range ra_range_target, dec_range dec_range_target,

raMin ra_target_minute, decMin dec_target_minute

from CorrelationMinutes

) select

135 ra_range_source,dec_range_source,

ra_source_minute,dec_source_minute ,

ra_range_target,dec_range_target,

ra_target_minute,dec_target_minute from CorrelationSourceMinutes

cross join CorrelationTargetMinutes where ra_range_source <= ra_range_target

and dec_range_source <= dec_range_target

and ra_source_minute <= ra_target_minute

and dec_source_minute <= dec_target_minute order by ra_source_minute,dec_source_minute,

ra_target_minute,dec_target_minute

Figure I.5: Creation DML for the problem space table

I.5 Creating the wavelength histogram data

This query demonstrates the usage of the Hive HISTOGRAM NUMERIC function to extract the histogram data. As we can see, we use the baseline wavelength table to standardise the histogram arrays, which are then built according to the number of bins required (in this case, 5000). This is parametrised within the Python Correlation program illustrated in Figure H.4.

with baseline as

(

select

case when wavelength < 0.1 then 0 else wavelength end basewave

from allcorrelationwavelengths

group by case when wavelength < 0.1 then 0 else wavelength end), realStuff as

(

select wavelength realWave from correlationtest

where ra_range=185 and dec_range=-46.5

and floor((ra*180/pi()-floor(ra*180/pi()))*60) =6

and floor((abs(dec)*180/pi()-floor(abs(dec)*180/pi()))*60) =6

) select

136 cast(hist.x as int) as bin_center,

cast(hist.y as bigint) as bin_height from

(

select

histogram_numeric(wavelength, 5000) as salary_hist

from

(

select case when realwave is null then basewave else realwave end wavelength

from baseline left join realStuff on basewave=realwave

) b

) a lateral view explode(salary_hist) exploded_table as hist;

Figure I.6: Hive QL statement to extract wavelength histogram data

137 Glossary

ALMA Atacama Large Millimeter/submillimeter Array is a single telescope composed of 66 high

precision antennas located on the Chajnantor plateau, 5000 meters altitude in northern Chile.

2

Astro-WISE The astronomical Wide-field Imaging System for Europe. A federated environment

of software and hardware to scientifically exploit large volumes of data produced by scientific

experiments. Initially geared towards astromonical research, it is now being used for projects

from multiple disciplines. 2

CERN European Organization for Nuclear Research (French - Conseil Europ´eenpour la Recherche

Nucl´eaire).1

HDFS Hadoop Distributed File System - a distributed, scalable and portable file system written

in Java for the Hadoop system. 4–6, 10, 13, 14, 16–20, 22, 23, 25, 29, 32, 35, 50, 51, 53–56,

59

HQL Hive Query Language - a SQL like query language that runs against the Apache Hive data

warehouse platform; HQL queries are seamlessly translated into Map/Reduce jobs to process

the query. See also SQL. 5

I/O The communication processes between information processing systems and external environ-

ments. In computer architecture, transfer of information to or from the CPU/memory from

a disk drive is considered an I/O operation. 1

IVOA International Virtual Astronomy Alliance is an organisation that debates and agrees the

technical standards that are needed to make the Virtual Observatory possible. See also

Virtual Observatory. 2

JSON JavaScript Object Notation - a lightweight data interchange format, based on a subset of

the JavaScript programming language. 5

138 Large Hadron Collider As at September 2016, the largest and most powerful particle accelera-

tor in the world. Operated by the European organization for Nuclear Research (CERN). 1,

2

LOFAR The Low-Frequency Array, a large radio telescope based on an interferometric array of

radio telescopes with stations located in the Netherlands, Germany, Great Britain, France

and Sweden. Completed in 2012. 2

MPI Message Passing Interface - a standardized and portable message passing system designed

to facilitate a variety of parallel computing architectures. See also OpenMP. 3, 4, 6, 57

ODBC Open Database Connectivity - an open standard programming interface for accessing a

database. 59

OpenMP A shared-memory application programming interface to enable shared-memory parallel

architectures in sequential programming languages with minimal code modification. See also

MPI. 3, 4, 6, 57

RAC Oracle Real Application Clusters - an option for Oracle databse software which allows the

running of multiple database instances on different servers against a shared list of data files

or database. 1, 2

RDD Residual Distributed Dataset - the basic abstraction of data within the Spark cluster com-

puting platform. It is an immutable distributed collection of objects that can be operated

on in parallel. RDDs can contain any type of Java, Scala or Python object including user

defined classes and objects. 6, 20–26, 29, 32, 33, 35, 38, 50, 51, 55, 56

SQL Structured Query Language - a standard computer language for data manipulation and

management initially developed for relational database management. SQL is used to query,

insert, update and delete data. 5, 14, 19, 21, 22, 51, 55, 56, 59

Virtual Observatory the vision that astronomical datasets and other resources should work as

a seamless whole. See also IVOA. 2

XML Extensible Markup Language - a markup language that defines a set of rules of encoding

documents in both human-readable and machine-readable form. 5

YARN Yet Another Resource Negotiator - The resource scheduler bundled with Apache Hadoop.

Decouples resource management from data processing requirements. 6, 22, 52

139 List of Acronyms

HPC High Performance Computing. 3, 6, 7, 59

JWST James Webb Space Telescope. 1

LSST Large Synoptic Survey Telescope. 1, 2

RDBMS Relational Database Management System. 1, 2, 59

SKA Square Kilometre Array. 1, 3

VLDB Very Large Data Base. 3

140 Bibliography

[1] J. P. Gardner, J. C. Mather, M. Clampin, R. Doyon, M. A. Greenhouse, H. B. Hammel, J. B.

Hutchings, P. Jakobsen, S. J. Lilly, K. S. Long, J. I. Lunine, M. J. Mccaughrean, M. Mountain,

J. Nella, G. H. Rieke, M. J. Rieke, H.-W. Rix, E. P. Smith, G. Sonneborn, M. Stiavelli, H. S.

Stockman, R. A. Windhorst, and G. S. Wright, “The James Webb Space Telescope,” Space

Science Reviews, vol. 123, no. 4, pp. 485–606, 2006.

[2] v. Ivezi´c,J. A. Tyson, E. Acosta, R. Allsman, S. F. Anderson, J. Andrew, J. R. P. Angel,

T. S. Axelrod, J. D. Barr, A. C. Becker, et al., “LSST: from science drivers to reference design

and anticipated data products,” 2008.

[3] T. Cornwell, “The Square Kilometer Array Computing Challenge.” http://www.atnf.csiro.

au/research/workshops/2013/astroinformatics/talks/skaastroinformatics2013.

pdf, 2013. Accessed: 4 October, 2016.

[4] P. Alexander, “The Square Kilometer Array Computing Challenge.” http://indico.

cern.ch/event/184526/session/3/contribution/4/attachments/253364/354189/

PAlexander_SKA_Computational_Challenges.pdf, 2013. Accessed: 4 October, 2016.

[5] M. T. Whiting, “Duchamp: a 3D source finder for spectral-line data,” Monthly Notices of the

Royal Astronomical Society, vol. 421, pp. 3242–3256, Apr. 2012.

[6] P. Serra, T. Westmeier, N. Giese, R. Jurek, L. Fl¨oer, A. Popping, B. Winkel, T. van der Hulst,

M. Meyer, B. S. Koribalski, L. Staveley-Smith, and H. Courtois, “SOFIA: a flexible source

finder for 3D spectral line data,” Monthly Notices of the Royal Astronomical Society, vol. 448,

pp. 1922–1929, Apr. 2015.

[7] E. Bertin and S. Arnouts, “SExtractor: Software for source extraction.,” Astronomy and

Astrophysics Supplement Series, vol. 117, pp. 393–404, June 1996.

[8] N. M. Ball and R. J. Brunner, “Data mining and machine learning in astronomy,” International

Journal of Modern Physics D, vol. 19, no. 07, pp. 1049–1106, 2010.

[9] S. Takamatsu and C. Guestrin, “Reducing data loading bottleneck with coarse feature vectors

for large scale learning.,” in BigMine, pp. 46–60, 2014.

141 [10] D. R. Thompson and K. L. Wagstaff, “On-line Machine Learning and Event Detection in

Petascale Data Streams,” in American Astronomical Society Meeting Abstracts #219, vol. 219

of American Astronomical Society Meeting Abstracts, p. 113.04, Jan. 2012.

[11] N. M. Ball, R. J. Brunner, and A. D. Myers, “Robust machine learning applied to terascale

astronomical datasets,” CoRR, vol. abs/0804.3417, 2008.

[12] R. Brunner, K. Ramaiyer, A. Szalay, A. Connolly, and R. Lupton, “An Object-Oriented Ap-

proach to Astronomical Databases.” http://www.adass.org/adass/proceedings/adass94/

brunnerr.html, 1995. Accessed: 20 June, 2016.

[13] R. Mann, “Sky Survey Database Design: Introduction to Sky Survey Database design is-

sues.” http://www.nesc.ac.uk/action/esi/contribution.cfm%3FTitle=225, 2003. Ac- cessed: 20 June, 2016.

[14] A. S. Szalay, P. Z. Kunszt, and A. Thakar, “Designing and mining multi-terabyte astron-

omy archives: The sloan digital sky survey.” http://research.microsoft.com/en-us/um/

people/gray/papers/ms_tr_99_30_sloan_digital_sky_survey.pdf, 1999. Accessed: 20 June, 2016.

[15] J. Shiers, “Building a multi-pb object database for the LHC.” http://storageconference.

us/1998/presentations/d2-1-SHIERSvg.pdf, 1998. Accessed: 20 June, 2016.

[16] M. Girone, “Cern database services for the LHC computing grid,” Journal of Physics: Con-

ference Series, vol. 119, no. 5, p. 052017, 2008.

[17] “The ROOT modular scientific framework.” https://root.cern.ch/about-root, 2016. Ac- cessed: 22 June, 2016.

[18] D. Duellmann, “Xldb agenda - ”examples of future large scale scientific databases”.” http:

//www-conf.slac.stanford.edu/xldb07/agenda.htm, 2007.

[19] J. Becla, A. Hanushevsky, S. Nikolaev, G. Abdulla, A. S. Szalay, M. A. Nieto-Santisteban,

A. Thakar, and J. Gray, “Designing a multi-petabyte database for LSST,” CoRR, vol. ab-

s/cs/0604112, 2006.

[20] A. Belikov, D. Boxhoorn, F. Dijkstra, H. A. Holties, and W.-J. Vriend, “Target for LO-

FAR Long Term Archive: Architecture and Implementation,” in Astronomical Data Analysis

Software and Systems XXI (P. Ballester, D. Egret, and N. P. F. Lorente, eds.), vol. 461 of

Astronomical Society of the Pacific Conference Series, p. 693, Sept. 2012.

[21] E. A. Valentijn, J. P. McFarland, J. Snigula, K. G. Begeman, D. R. Boxhoorn, R. Rengelink,

E. Helmich, P. Heraudeau, G. Verdoes Kleijn, R. Vermeij, W.-J. Vriend, M. J. Tempelaar,

E. Deul, K. Kuijken, M. Capaccioli, R. Silvotti, R. Bender, M. Neeser, R. Saglia, E. Bertin, and

142 Y. Mellier, “Astro-WISE: Chaining to the Universe,” in Astronomical Data Analysis Software

and Systems XVI (R. A. Shaw, F. Hill, and D. J. Bell, eds.), vol. 376 of Astronomical Society

of the Pacific Conference Series, p. 491, Oct. 2007.

[22] E. A. Valentijn and G. V. Keijn, “The astro-wise system: a federated information accumulator

for astronomy,” Proceedings of the International Astronomical Union, vol. 2, no. 14, 2006.

[23] K. Anderson, A. Alexov, L. B¨ahren,J.-M. Grießmeier, M. Wise, and G. A. Renting, “LOFAR

and HDF5: Toward a New Radio Data Standard,” in Astronomical Data Analysis Software

and Systems XX (I. N. Evans, A. Accomazzi, D. J. Mink, and A. H. Rots, eds.), vol. 442 of

Astronomical Society of the Pacific Conference Series, p. 53, July 2011.

[24] A. Wicenec, S. Farrow, S. Gaudet, N. Hill, H. Meuss, and A. Stirling, “The ALMA Archive:

A Centralized System for Information Services,” in Astronomical Data Analysis Software

and Systems (ADASS) XIII (F. Ochsenbein, M. G. Allen, and D. Egret, eds.), vol. 314 of

Astronomical Society of the Pacific Conference Series, p. 93, July 2004.

[25] “International Virtual Astronomy Alliance.” http://www.ivoa.net/index.html, 2016. Ac- cessed: 1 June, 2016.

[26] E. Dumbill, A. Croll, J. Steele, and M. K. Loukides, Planning for big data. O’Reilly Media,

2012.

[27] “Apache Hadoop Home page.” http://hadoop.apache.org/, 2016. Accessed: 1 June, 2016.

[28] R. Farivar, R. J. Brunner, R. Santucci, and R. Campbell, “Cloud Based Processing of Large

Photometric Surveys,” in Astronomical Data Analysis Software and Systems XXII (Friedel,

DN, ed.), vol. 475 of Astronomical Society of the Pacific Conference Series, pp. 91–94, Natl

Ctr Supercomput Applicat; NASA Infrared Process & Anal Ctr; Univ Illinois, Dept Astron;

Natl Radio Astron Observ; Australian Astron Observ; Natl Opt Astron Observ; Canada

France Hawaii Telescope; Observ Paris; Elsevier Publish; Smithsonian Astrophys Observ;

European Space Astron Ctr; Space Telescope Sci Inst; European So Observ, 2013. 22nd

Annual Conference on Astronomical Data Analysis Software and Systems (ADASS XXII),

Univ Illinois, Champaign, IL, NOV 04-08, 2012.

[29] “Apache Mapreduce Tutorial.” https://hadoop.apache.org/docs/current/

hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html, 2016. Accessed: 1 June, 2016.

[30] K. Wiley, A. Connolly, J. Gardner, S. Krughoff, M. Balazinska, B. Howe, Y. Kwon, and

Y. Bu, “Astronomy in the Cloud: Using MapReduce for Image Co-Addition,” Publications of

the Astronomical Society of the Pacific, vol. 123, pp. 366–380, MAR 2011.

143 [31] “Apache Hadoop 2.7.2 - HDFS Users Guide.” http://hadoop.apache.org/docs/current/

hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html, 2016. Accessed: 1 June, 2016.

[32] K. Shvacho, “HDFS scalability: The limits to growth,” ;login: The Usenix Magazine, vol. 35,

no. 2, pp. 6–16, 2010.

[33] “Apache Hive - Online Reference.” http://hive.apache.org, 2016. Accessed: 1 June, 2016.

[34] “Spark - Lightening fast cluster computing.” http://spark.apache.org, 2016. Accessed: 1 June, 2016.

[35] “ORC Language Manual.” https://cwiki.apache.org/confluence/display/Hive/

LanguageManual+ORC, 2016. Accessed: 2 June, 2016.

[36] “Apache Parquet Homepage.” http://parquet.apache.org, 2016. Accessed: 2 June, 2016.

[37] “Parquet Language Manual.” https://cwiki.apache.org/confluence/display/Hive/

Parquet, 2016. Accessed: 2 June, 2016.

[38] “Rcfile format - Language Manual.” https://cwiki.apache.org/confluence/display/

Hive/RCFile, 2016. Accessed: 2 June, 2016.

[39] S. Rysa, U. Laserson, S. Owen, and J. Wills, Advanced Analytics with Spark. O’Reilly Media,

Inc., 2015.

[40] “Apache Hadoop YARN.” http://hadoop.apache.org/docs/current/hadoop-yarn/

hadoop-yarn-site/YARN.html, 2016. Accessed: 27 June, 2016.

[41] F. Liang and X. Lu, “Accelerating iterative big data computing through mpi,” Journal of

Computer Science and Technology, vol. 30, no. 2, pp. 283–294, 2015.

[42] J. L. Reyes-Ortiz, L. Oneto, and D. Anguita, “Big data analytics in the cloud: Spark on

hadoop vs mpi/openmp on beowulf,” Procedia Computer Science, vol. 53, pp. 121 – 130,

2015. {INNS} Conference on Big Data 2015 Program San Francisco, CA, {USA} 8-10 August

2015.

[43] Perren, G. I., V´azquez,R. A., and Piatti, A. E., “Asteca: Automated stellar cluster analysis,”

A&A, vol. 576, p. A6, 04 2015.

[44] “Astroml: Machine learning and data mining for astronomy.” http://www.astroml.org, 2017. Accessed: 12 April, 2017.

[45] “Astropy - a community python library for astronomy.” http://www.astropy.org, 2017. Accessed: 12 April, 2017.

144 [46] “Practical python for astronomers.” https://python4astronomers.github.io, 2017. Ac- cessed: 12 April, 2017.

[47] “The Sky Net Homepage.” http://www.theskynet.org, 2016. Accessed: 1 June, 2016.

[48] “NECTAR Cloud - homepage.” https://nectar.org.au/research-cloud, 2016. Accessed: 1 June, 2016.

[49] “Apache Hive - LanguageManual DDL.” https://cwiki.apache.org/confluence/

display/Hive/LanguageManual+DDL, 2016. Accessed: 27 June, 2016.

[50] “Apache Hive - Hive Tutorial.” https://cwiki.apache.org/confluence/display/Hive/

Tutorial, 2016. Accessed: 7 June, 2016.

[51] “Welcome to Apache Pig.” https://pig.apache.org/, 2016. Accessed: 8 June, 2016.

[52] “Home Hue - Hadoop User Experience - The Apache Hadoop UI.” http://gethue.com, 2014. Accessed: 2 June, 2016.

[53] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,

P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,

M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in python,” vol. 12, pp. 2825–

2830, 2011.

[54] Astropy Collaboration, T. Robitaille, E. Tollerud, P. Greenfield, M. Droettboom, E. Bray,

T. Aldcroft, M. Davis, A. Ginsburg, A. Price-Whelan, W. Kerzendorf, A. Conley, N. Crighton,

K. Barbary, D. Muna, H. Ferguson, F. Grollier, M. Parikh, P. Nair, H. Unther, C. Deil,

J. Woillez, S. Conseil, R. Kramer, J. Turner, L. Singer, R. Fox, B. Weaver, V. Zabalza,

Z. Edwards, K. Azalee Bostroem, D. Burke, A. Casey, S. Crawford, N. Dencheva, J. Ely,

T. Jenness, K. Labrie, P. Lim, F. Pierfederici, A. Pontzen, A. Ptak, B. Refsdal, M. Servillat,

and O. Streicher, “Astropy: A community Python package for astronomy,” Astronomy and

Astrophysics, vol. 558, p. A33, Oct. 2013.

[55] Z.ˇ Ivezi´c,A. Connolly, J. Vanderplas, and A. Gray, Statistics, Data Mining and Machine

Learning in Astronomy. Princeton University Press, 2014.

[56] “Spark Mlib.” http://spark.apache.org/mllib/, 2016. Accessed: 8 June, 2016.

[57] “Apache Zeppelin Incubator Homepage.” https://zeppelin.incubator.apache.org, 2016. Accessed: 3 June, 2016.

[58] “The Zeppelin Project Homepage.” http://zeppelin-project.org, 2016. Accessed: 3 June, 2016.

145 [59] “The Jupyter Notebook (Formerly IPython).” https://ipython.org/notebook.html, 2016. Accessed: 3 June, 2016.

[60] “Livy - an open source REST Service for Apache Spark.” http://livy.io, 2016. Accessed: 4 October, 2016.

[61] “The Livy Spark REST Job Server API.” http://gethue.com/

how-to-use-the-livy-spark-rest-job-server-api-for-submitting-batch-jar-python-and-streaming-spark-jobs, 2016. Accessed: 4 October, 2016.

[62] N. Pentreath, Machine Learning with Spark. Packt Publishing, 2015.

[63] “Spark Machine Learning Library (MLlib) Guide.” http://spark.apache.org/docs/

latest/mllib-clustering.html, 2016. Accessed: 16 May, 2016.

[64] “Great Wall KDE - astroML Example.” http://www.astroml.org/book_figures/

chapter6/fig_great_wall_KDE.html, 2016. Accessed: 3 June, 2016.

[65] J. VanderPlas, Astronomy with scikit-learn. 1 ed., 2012.

[66] H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning Spark. O’Reilly Media, Inc.,

2015.

[67] S. Ryza, “How-to: Tune Your Apache Spark Jobs (Part 1) -

Cloudera Engineering Blog.” http://blog.cloudera.com/blog/2015/03/

how-to-tune-your-apache-spark-jobs-part-1/, 2015. Accessed: 21 April, 2016.

[68] S. Ryza, “How-to: Tune Your Apache Spark Jobs (Part 2) -

Cloudera Engineering Blog.” http://blog.cloudera.com/blog/2015/03/

how-to-tune-your-apache-spark-jobs-part-2/, 2015. Accessed: 21 April, 2016.

[69] “Apache Oozie Workflow Scheduler for Hadoop.” http://oozie.apache.org/, 2016. Ac- cessed: 7 June, 2016.

[70] P. Pasupuleti, Pig Design Patterns. Packt Publishing Ltd., 2014.

[71] A. Gates, Programming Pig. O’Reilly Media, 2011.

[72] E. D. Feigelson and G. Jogesh Babu, Modern Statistical Methods for Astronomy. Cambridge

University Press, 2015.

[73] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis,

“Dremel: Interactive analysis of web-scale datasets,” in Proc. of the 36th Int’l Conf on Very

Large Data Bases, pp. 330–339, 2010.

[74] “Apache Sqoop.” http://sqoop.apache.org/, 2016. Accessed: 4 October, 2016.

146 [75] “Lustre performance is superior to HDFS with the latest Intel R Xenon R Processor Fam-

ily.” http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/

lustre-performance-superior-to-hdfs-white-paper.pdf, 2016. Accessed: 28 April, 2017.

[76] C. Lawrie, “Evaluation of the Suitability of Alluxio for Hadoop Processing Frameworks.”

https://cds.cern.ch/record/2216180/files/Report.pdf, 2016. Accessed: 1 May, 2017.

[77] “Apache Hive - Tutorial.” https://cwiki.apache.org/confluence/display/Hive/

Tutorial, 2016. Accessed: 7 June, 2016.

147