Nebulos: a Big Data Framework for Astrophysics

MNRAS 000,1{15 (2016) Preprint 6 June 2021 Compiled using MNRAS LATEX style file v3.0 NebulOS: A Big Data Framework for Astrophysics Nathaniel R. Stickley,1? Miguel A. Aragon-Calvo,1 1Department of Physics and Astronomy, University of California, 900 University Ave, Riverside, CA 92521 USA Accepted XXX. Received YYY; in original form ZZZ ABSTRACT We introduce NebulOS, a Big Data platform that allows a cluster of Linux machines to be treated as a single computer. With NebulOS, the process of writing a massively parallel program for a datacenter is no more complicated than writing a Python script for a desktop computer. The platform enables most pre-existing data analysis software to be used, as scale, in a datacenter without modification. The shallow learning curve and compatibility with existing software greatly reduces the time required to develop distributed data analysis pipelines. The platform is built upon industry-standard, open- source Big Data technologies, from which it inherits several fault tolerance features. NebulOS enhances these technologies by adding an intuitive user interface, automated task monitoring, and other usability features. We present a summary of the architecture, provide usage examples, and discuss the system's performance scaling. Key words: methods: data analysis { virtual observatory tools 1 INTRODUCTION data. While each of these projects produces different kinds of data and performs different types of data analysis, the In recent decades, the volume of data coming from experi- data management and analysis patterns are shared. ments, sensors, observations, simulations, and other sources has increased exponentially. Scientific disciplines that were once starved of data are now being flooded with data that 1.1 I/O bottlenecks in conventional is not only massive in size, but heterogeneous and, in some supercomputers cases, highly interconnected. Consequently, scientific discov- eries are increasingly being driven by large-scale data anal- Most large-scale scientific data analysis is currently per- ysis. Although many data analysis and management pat- formed on supercomputers in which computing nodes are terns are shared among research groups, individual groups physically decoupled from data storage media. Such archi- typically develop custom, application-specific, data analy- tectures are well-suited for compute-intensive applications, sis pipelines. This results in duplicated efforts, wasted re- such as large simulations, where CPU, memory, and inter- sources, and incompatibility among projects that might oth- node communication speed are the most important factors. erwise complement one another. However, they tend to perform poorly when applied to data- In industry, the need to perform large-scale data anal- intensive problems that require high data throughput, such ysis has resulted in the development and adoption of data- as large-scale signal processing, or analysing large ensem- aware frameworks, such as Apache Hadoop. Largely driven bles of data. As the size of datasets approaches the petabyte arXiv:1503.02233v3 [astro-ph.IM] 14 Sep 2016 by Internet and finance companies, these tools are most eas- scale, traditional supercomputing architectures quickly be- ily applied to Web and business data. Adapting these tools come I/O-bound, which limits their usefulness for data anal- for scientific data analysis is oftentimes not straightforward. ysis. Analogous tools, designed specifically for large-scale scientific data analysis, management, and storage have not yet emerged, despite the increasing need. In astronomy, for ex- 1.2 The Big Data approach ample, many projects, including the Sloan Digital Sky Sur- vey (York et al. 2000), the Hubble Space Telescope, and the A distributed computing architecture, known as the data- Bolshoi simulation (Klypin et al. 2011) have each produced center (Hoelzle & Barroso 2009), has emerged as the nat- tens or hundreds of terabytes of data. Future projects, such ural architecture for analysing the enormous quantities of as the LSST (Ivezic et al. 2008), will produce petabytes of data that have recently become available. Like most supercomputers, datacenters consist of many machines connected via a network. Unlike supercomputers, however, the data in ? E-mail: [email protected] most datacenters is stored close to the the computing nodes. c 2016 The Authors 2 N. R. Stickley & M. A. Aragon-Calvo Typically, the data storage media (hard disk drives or solid pre-existing command-line driven program on a datacenter; state drives) are directly attached to each computing node, programs do not need to be made aware of NebulOS, or the which allows data to be transferred to CPUs at high speed datacenter's architecture, in order to work. Any data format without passing over a network. This configuration topology can be read exactly as on a regular Linux-based workstation is known as direct-attached storage (DAS). Alternatively, a without extra effort. A Python module is provided, allowing bank of hard drives can be located on each server rack. The scientists to use the system interactively or from a script, computing nodes within each rack then share the local bank and a C++ library is also available. of disks, using a local high-performance network. Depending In section2, we describe the general architectural ideas on the implementation details, this configuration is known and specific implementation details. Basic usage examples as network-attached storage (NAS), a storage area network are included in section3. We discuss system performance in (SAN), or a hybrid of NAS and SAN. section4. Finally, in5, we summarise NebulOS' strengths In order to make sure that each node in a datacenter and limitations and discuss planned work. primarily analyses data stored locally, rather than needing to first transfer data over a cluster-scale network, special software frameworks have been designed. These frameworks 2 NebulOS ARCHITECTURE are commonly referred to as \Big Data" frameworks. Big Data frameworks share three key features: The NebulOS platform consists of a distributed operating system kernel, a distributed file system, and an applica- (i) They are aware of data placement. Most analysis is tion framework. The kernel and file system are industry- performed on machines which can read data locally; instruc- standard, open source projects, managed by the Apache tions are executed on nodes that contain relevant data. Software Foundation: Apache Mesos(Hindman et al. 2011) (ii) They are fault tolerant. Since datacenters often con- and the Hadoop Distributed File System (HDFS) (Shvachko tain thousands of nodes, hardware failures are frequent oc- et al. 2010). The HDFS is mounted on each node of the sys- currences. Big Data tools automatically create redundant tem, using FUSE-DFS (a tool included in the Hadoop soft- copies of the data stored in the datacenter so that hardware ware distribution), so that applications can access the HDFS failures do not result in data loss. Additionally, tasks that like a local file system. Using industry-standard tools for the are lost during a node failure are automaitcally rescheduled core of the platform allows us to focus our development ef- and performed on healthy nodes. forts on the application framework. (iii) They provide an abstraction layer on top of the dat- The application framework provides a simple, intuitive acenter hardware. The typical user of a Big Data framework interface for launching and managing programs on the dis- does not need to be aware of the details of the underlying tributed system. The user only needs to be familiar with hardware. the basic usage of a Python interpreter. No experience with Big Data frameworks also often make use of new paral- distributed computing or multithreading is necessary. The lel programming models that efficiently use datacenter hard- framework automatically schedules tasks so that, whenever ware. The prime example of this is the MapReduce model possible, execution occurs on CPUs that have local access (Dean & Ghemawat 2008), which allows certain data analy- to the data being used. This minimizes network traffic and sis jobs to be automatically divided into many independent maximizes the rate at which data can be read. Tasks that tasks and performed in parallel on the nodes of the data- are lost due to hardware problems, such as network inter- center (the map step). The results of the map step are then face card failure or motherboard failure, are automatically automatically combined into a final result (the reduce step). rescheduled on other machines. The framework also pro- Most existing Big Data frameworks require the user vides the user with a means of automatically monitoring to have more software engineering expertise than most as- each task that runs on the distributed system; a user-defined tronomers possess. The language of choice for most of the task monitoring script can be executed at regular intervals frameworks is Java, which is not a popular language among to examine the detailed behaviour of each task. astronomers. These frameworks are also most easily used In the remainder of this section, we describe the three with text-based data. Reading and writing binary file for- main components of the NebulOSplatform. The application mats, common in scientific research, requires extra effort and framework is described in greatest detail because it is the knowledge. Furthermore, the ability to use existing analysis component which makes NebulOS unique. software with these frameworks is limited. Software usually needs to be aware of the framework in some way in order 2.1 Distributed resource management with to work properly. For a more detailed discussion of popular Apache Mesos Big Data tools, refer to AppendixB. We chose Apache Mesos as NebulOS' distributed resource manager because of its scalability, resiliency, and generality. 1.3 This work Mesos is a distributed operating system kernel that manages In this paper, we describe NebulOS1, a Big Data framework resources on a cluster of networked machines (e.g., a data- designed to allow a datacenter to be used in a seamless way center).

Nebulos: a Big Data Framework for Astrophysics

Red Hat Data Analytics Infrastructure Solution

Unlock Bigdata Analytic Efficiency with Ceph Data Lake

Horn: a System for Parallel Training and Regularizing of Large-Scale Neural Networks

Key Exchange Authentication Protocol for Nfs Enabled Hdfs Client

Return of Organization Exempt from Income

Projects – Other Than Hadoop! Created By:-Samarjit Mahapatra [email protected]

HFAA: a Generic Socket API for Hadoop File Systems

Decentralising Big Data Processing Scott Ross Brisbane

Optimizing Resource Utilization in Distributed Computing Systems For

Collective Communication on Hadoop

Maximizing Hadoop Performance and Storage Capacity with Altrahdtm

Overview of Mapreduce and Spark