FITS Data Source for Apache Spark

FITS Data Source for Apache Spark Julien Peloton,∗ Christian Arnault, Stéphane Plaszczynski LAL, Univ. Paris-Sud, CNRS/IN2P3, Université Paris-Saclay, Orsay, France October 17, 2018 Abstract Early 2010, a new tool set into the landscape of We investigate the performance of Apache Spark, cluster computing: Apache Spark [13, 1, 2]. After a cluster computing framework, for analyzing data starting as a research project at the University of from future LSST-like galaxy surveys. Apache Spark California, Berkeley in 2009, it is now maintained attempts to address big data problems have hitherto by the Apache Software Foundation [14], and widely proved successful in the industry, but its use in the used in the industry to deal with big data problems. astronomical community still remains limited. We Apache Spark is an open-source framework for data show how to manage complex binary data structures analysis mostly written in Scala with application handled in astrophysics experiments such as binary programming interfaces (API) for Scala, Python, R, tables stored in FITS files, within a distributed en- and Java. It is based on the so-called MapReduce vironment. To this purpose, we first designed and cluster computing paradigm [4], popularized by the implemented a Spark connector to handle sets of ar- Hadoop framework [15] [3] using implicit data par- allelism and fault tolerance. In addition Spark op- bitrarily large FITS files, called spark-fits. The user interface is such that a simple file "drag-and- timizes data transfer, memory usage and communi- drop" to a cluster gives full advantage of the frame- cations between processes based on a graph analysis work. We demonstrate the very high scalability of (DAG) of the tasks to perform, largely relying on functional programming. To fully exploit the ca- spark-fits using the LSST fast simulation tool, CoLoRe, and present the methodologies for measur- pability of cluster computing, Apache Spark relies ing and tuning the performance bottlenecks for the also on a cluster manager and a distributed storage workloads, scaling up to terabytes of FITS data on system. Spark can interface with a wide variety of the Cloud@VirtualData, located at Université Paris distributed storage systems, including the Hadoop Sud. We also evaluate its performance on Cori, Distributed File System (HDFS), Apache Cassan- a High-Performance Computing system located at dra [16], Amazon S3 [17], or even Lustre [18] tra- NERSC, and widely used in the scientific commu- ditionally installed on High-Performance Comput- nity. ing (HPC) systems. Although Spark is not an in- memory technology per se1, it allows users to effi- 1 Introduction ciently use in-memory Least Recently Used (LRU) cache relying on disk only when the allocated mem- The volume of data recorded by current and future ory is not sufficient, and its popularity largely comes High Energy Physics & Astrophysics experiments, from the fact that it can be used at interactive speeds and their complexity require a broad panel of knowl- for data mining on clusters. edge in computer science, signal processing, statis- tics, and physics. Precise analysis of those data sets When it started, usages of Apache Spark were often limited to naively structured data formats such arXiv:1804.07501v2 [astro-ph.IM] 15 Oct 2018 is a serious computational challenge, which cannot be done without the help of state-of-the-art tools. as Comma Separated Values (CSV), or JavaScript This requires sophisticated and robust analysis per- Object Notation (JSON) driven by the need to ana- formed on many machines, as we need to process lyze huge data sets from monitoring applications, log or simulate data sets several times. Among the fu- files from devices or web data extraction for exam- ture experiments, the Large Synoptic Survey Tele- ple. Such simple data structures allow a relatively scope (LSST [12]) will collect terabytes of data per easy and efficient distribution of the data across ma- observation night, and their efficient processing and chines. However these file formats are usually not analysis remains a major challenge. 1Spark has pluggable connectors for different persistent ∗[email protected] storage systems but it does not have native persistence code. 1 suitable for describing complex data structures, and lowing users to easily access the set of tools. In Sec. often lead to poor performance. Over the past years, 3 we evaluate the performance of spark-fits on a a handful of efficient data compression and encod- medium size cluster with various data set sizes, and ing schemes has been developed by different compa- we detail a few use cases using a LSST-like simulated nies such as Avro [19] or Parquet [20], and interfaced galaxy catalogs in Sec. 4. We discuss the impact of with Spark. Each requires a specific implementation using spark-fits on a HPC system in Sec. 5, and within the framework. conclude in Sec. 6. High Energy Physics & Astrophysics experiments typically describe their data with complex data 2 FITS data source for Apache Spark structures that are stored into structured file for- This section reviews the different building blocks of mats such as for example Hierarchical Data Format spark-fits. We first introduce a new library to 5 (HDF5) [21], ROOT files [22], or Flexible Image interpret FITS data in Scala, and then detail the Transport System (FITS) [23]. As far as connecting mechanisms of the Spark connector to manipulate the format directly to Spark is concerned, some ef- FITS data in a distributed environment. Finally we forts are being made to offer the possibility to load briefly give an example using the Scala API. ROOT files [24], or HDF5 in e.g. [25, 26]. For the 2.1 Scala FITSIO library FITS format indirect efforts are made (e.g. [7, 6, 8]) to interface the data with Spark but no native Spark In this section we recall the main FITS format fea- connector to distribute the data across machines is tures of interest for this work. A FITS file is logically available to the best of our knowledge. split into independent FITS structures called Header In this work, we investigate how to leverage Data Unit (HDU), as shown in the upper panel in Apache Spark to process and analyse future data Fig. 2. By convention, the first HDU is called pri- sets in astronomy. We study the question in the con- mary, while the others (if any) are called extensions. text of the FITS file format [5, 9] introduced more All HDU consist of one or more 2880-byte header than thirty years ago with the requirement that de- blocks immediately followed by an optional sequence velopments to the format must be backward compat- of associated 2880-byte data blocks. The header is ible. This data format has been used successfully to used to define the types of data and protocoles, and store and manipulate the data of a wide range of then the data is serialized into a more compact bi- astrophysical experiments over the years, hence it is nary format. The header contains a set of ASCII considered as one of the possible data format for fu- text characters (decimal 32 through 126) which de- ture surveys like LSST. More specifically we address scribes the following data blocks: types and struc- a number of challenges in this paper, such as: ture (1D, 2D, or 3D+), size, physical informations • How do we read a FITS file in Scala (Sec. 2.1)? or comments about the data, and so on. Then the • How do we access the data of a distributed FITS entire array of data values are represented by a con- file across machines (Sec. 2.2)? tinuous stream of bytes starting at the beginning of • What is the effect of caching the data on IO per- the first data block. formance in Spark (Secs. 3 & 5)? spark-fits is written in Scala, like Spark. Scala • What is the behavior if there is insufficient mem- is a relatively recent general-purpose programming ory across the cluster to cache the data (Sec. 3)? language (2004) providing many features of func- • What is the impact of the underlying distributed tional programming such as type inference, im- file system (Sec. 5)? mutability, lazy evaluation, and pattern matching. While several packages are available to read and To this purpose, we designed and implemented write FITS files in about 20 computer languages, a native Scala-based Spark connector, called none is written in Scala. There are Java packages spark-fits to handle sets of FITS files arbitrar- that could have been used in this work as Scala ily large. spark-fits is an open source soft- provides language interoperability with Java, but ware released under a Apache-2.0 license and avail- the structures of the different available Java pack- able as of April, 2018 from https://github.com/ ages are not meant to be used in a distributed envi- astrolabsoftware/spark-fits. ronment and they lack functional programming fea- The paper is organized as follows. We start with a tures. Therefore we released within spark-fits a review of the architecture of in Sec. 2, spark-fits new Scala library to manipulate FITS file, compat- and in particular the Scala FITSIO library to inter- ible with Scala versions 2.10 and 2.112. The library pret FITS in Scala, the Spark connector to manip- is still at its infancy and only a handful of fea- ulate FITS data in a distributed environment, and the different API (Scala, Java, Python, and R) al- 2Scala versions are mainly not backwards-compatible. 2 tures included in the popular and sophisticated C with the NameNode, the master server that man- library CFITSIO [27] is available.

Load more