Is Apache Spark Scalable to Seismic Data Analytics and Computations?

Is Apache Spark Scalable to Seismic Data Analytics and Computations? Yuzhong Yan, Lei Huang Liqi Yi Department of Computer Science Intel Corporation Prairie View A&M University 2111 NE 25th Ave. Prairie View, TX Hillsboro, OR Email: [email protected], [email protected] Email: [email protected] Abstract—High Performance Computing (HPC) has been a In many the data- and technology-driven industries, big dominated technology used in seismic data processing at the data analytics platforms and cloud computing technologies petroleum industry. However, with the increasing data size and have made great progress in recent years toward meeting varieties, traditional HPC focusing on computation meets new challenges. Researchers are looking for new computing platforms the requirements of handling fast-growing data volumes and with a balance of both performance and productivity, as well as varieties. Hadoop [1] and Spark [2] are currently the most featured with big data analytics capability. Apache Spark is a new popular open source big data platforms that provide scalable big data analytics platform that supports more than map/reduce solutions to store and process big data, which deliver dynamic, parallel execution mode with good scalability and fault tolerance. elastic and scalable data storage and analytics solutions to In this paper, we try to answer the question that if Apache Spark is scalable to process seismic data with its in-memory tackle the challenges in the big data era. These platforms allow computation and data locality features. We use a few typical data scientists to explore massive datasets and extract valuable seismic data processing algorithms to study the performance and information with scalable performance. Many technologies productivity. Our contributions include customized seismic data advances in statistics, machine learning, NoSQL database, distributions in Spark, extraction of commonly used templates and in-memory computing from both industry and academia for seismic data processing algorithms, and performance analysis of several typical seismic processing algorithms. continue to stimulate new innovations in the data analytics Index Terms—Parallel Computing; Big Data Analytics; Seismic field. Data; Stencil Computing; Geophysicists need an ease-to-use and scalable platform that allows them incorporate the latest big data analytics I. INTRODUCTION technology with the geoscience domain knowledge to speed up their innovations in the exploration phase. Although there Petroleum is a traditional industry where massive seismic are some big data analytics platforms available in the market, data sets are acquired for exploration using land-based or they are not widely deployed in the petroleum industry since marine surveys. Huge amount of seismic data has already been there is a big gap between these platforms and the special generated and processed for several decades in the industry, needs of the industry. For example, the seismic data formats although there was no the big data concept at that time. High are not supported by any of these platforms, and the machine Performance Computing (HPC) has been heavily used in the learning algorithms need to be integrated with geology and industry to process the pre-stack seismic data in order to create geophysics knowledge to make the findings meaningful. 3D seismic property volumes for interpretation. Are these big data analytics platforms suitable in the The emerging challenges in petroleum domain are the petroleum industry? Because of lack of domain knowledge, burst increase of the volume size of acquired data and high- these platforms have been difficult to use in some traditional speed streaming data from sensors in wells that need to be industry sectors such as petroleum, energy, security, and analyzed on time. For instance, the volume size of high others. They need to be integrated and customized to meet dimension such as 3D/4D seismic data and high density the specific requirements of these traditional industry sectors. seismic data are growing exponentially. The seismic data The paper targets to discuss the gap between the general processing becomes both computation- and data- intensive functionality of the big data analytics platforms and the special applications. The traditional HPC programming model is good requirements from the petroleum industry, and to experiment a at handling computation-intensive applications, however, with prototype of Seismic Analytics Cloud platform (SAC for short) the continuously increasing sizes and varieties of petroleum [3, 4]. The goal of SAC is to deliver a scalable and productive data, HPC was not designed to handle the emerging big data cloud Platform as a Service (PaaS) to seismic data analytics problems. Moreover, HPC platforms has been an obstacle for researchers and developers. SAC has two main characteristics: most geophysicists to implement their algorithms on such one is its scalability to process big seismic data, and the other platforms directly, who demand a productive and scalable is its ease-to-use feature for geophysicists. In this paper, we platform to accelerate their innovations. describe our implementation of SAC, experiment with a few typical algorithms in seismic data analytics and computations, which combines the batch, interactive and streaming pro- and discuss the performance in details. cessing models into a single computing engine. It provides a highly scalable, memory-efficient, in-memory computing, II. RELATED WORK real-time streaming-capable big data processing engine for The big data problem requires a reliable and scalable cluster high-volume, high-velocity and high-variety data. Moreover, it computing or cloud computing support, which has been a supports high-level language Scala that combines both object- longstanding challenge to scientists and software developers. oriented programming and functional programming into a sin- Traditional High Performance Computing (HPC) researches gle programming language. The innovative designed Resilient have put significant efforts in parallel programming models in- Distributed Dataset (RDD) [20] and its parallel operations cluding MPI [5], OpenMP [6], and PGAS languages [7, 8, 9], provide a scalable and extensible internal data structure to compiler parallelization and optimizations, runtime support, enable in-memory computing and fault tolerance. There is a performance analysis, auto-tuning, debugging, scheduling and very active, and fast-growing research and industry community more. However, these efforts mostly focused on scientific that builds their big data analytics projects on top of Spark. computing, which are computation-intensive, while big data However, all these frameworks are built for general propose problems have both computation- and data-intensive chal- cases and are focused on data parallelism with improved lenges. Hence, these traditional HPC programming models MapReduce model, and there is no communication mechanism are not suitable to big data problems anymore. Besides between workers, which does not fit to some iterative seis- scalable performance, tackling big data problems requires a mic algorithms requiring frequent data communication among fault-tolerant framework with high-level programming models, workers. Both traditional seismic data storage and processing highly scalable I/O or database, and batch, interactive and algorithms need big changes to run on MapReduce platform. streaming tasks support for data analytics. With the growing exponentially of the seismic data volumes, MapReduce [10] is one of the major innovations that created how to store and manage the seismic data becomes a very a high-level, fault-tolerant and scalable parallel programming challenge problem. The HPC applications need to distribute framework to support big data processing. The Hadoop [1] data to every worker node, which will consume more time package encloses Hadoop Distributed File System (HDFS), on data transferring. The trend toward big data is leading MapReduce parallel processing framework, job scheduling to transitions in the computing paradigm, and in particular and resource management (YARN), and a list of data query, to the notion of moving computation to data, also called processing, analysis and management systems to create a near-data-processing(NDP) [21]. In Data Parallel System such big data processing ecosystem. Hadoop Ecosystem is fast MapReduce [22], clusters are built with commodity hardware growing to provide an innovative big data framework for big and each node takes the roles of both computation and storage, data storage, processing, query and analysis. Seismic Hadoop which makes it possible to bring computation to data. In [11] combines Seismic Unix [12] with Cloudera’s Distribution [23], it presented an optimized implementation of RTM by including Apache Hadoop to make it easy to execute common experiments with different data partitioning, keeping data seismic data processing tasks on a Hadoop cluster. In [13], locality, and reducing data movement. In [24], it proposed a [14] and [15], some traditional signal processing and migration remote visualization solution by introducing GPU computing algorithms are already implemented on MapReduce platform. into cluster, which could overcome problems of dataset size by [16] built a large-scale multimedia data mining platform by reducing data movements to local desktop. In [25], it evaluated using MapReduce framework, where the processing dataset is the suitability of MapReduce framework to implement

Is Apache Spark Scalable to Seismic Data Analytics and Computations?

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Building Machine Learning Inference Pipelines at Scale

Evaluation of SPARQL Queries on Apache Flink

Flare: Optimizing Apache Spark with Native Compilation

Large Scale Querying and Processing for Property Graphs Phd Symposium∗

Apache Spark Solution Guide

HDP 3.1.4 Release Notes Date of Publish: 2019-08-26

Debugging Spark Applications a Study on Debugging Techniques of Spark Developers

Adrian Florea Software Architect / Big Data Developer at Pentalog [email protected]

Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem

Mobile Storm: Distributed Real-Time Stream Processing for Mobile Clouds

Analysis of Real Time Stream Processing Systems Considering Latency Patricio Córdova University of Toronto [email protected]