Apache Spark Solution Guide

Technical White Paper Dell EMC PowerStore: Apache Spark Solution Guide Abstract This document provides a solution overview for Apache Spark running on a Dell EMC™ PowerStore™ appliance. June 2021 H18663 Revisions Revisions Date Description June 2021 Initial release Acknowledgments Author: Henry Wong This document may contain certain words that are not consistent with Dell's current language guidelines. Dell plans to update the document over subsequent future releases to revise these words accordingly. This document may contain language from third party content that is not under Dell's control and is not consistent with Dell's current guidelines for Dell's own content. When such third party content is updated by the relevant third parties, this document will be revised accordingly. The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © 2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [6/9/2021] [Technical White Paper] [H18663] 2 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Table of contents Table of contents Revisions............................................................................................................................................................................. 2 Acknowledgments ............................................................................................................................................................... 2 Table of contents ................................................................................................................................................................ 3 Executive summary ............................................................................................................................................................. 5 Audience ............................................................................................................................................................................. 5 1 Introduction ................................................................................................................................................................... 6 1.1 PowerStore overview .......................................................................................................................................... 6 1.2 Apache Spark overview ...................................................................................................................................... 6 1.3 Apache Hadoop Distributed File System overview ............................................................................................ 8 1.4 The advantages of Spark and Hadoop on PowerStore ...................................................................................... 9 1.4.1 AppsON brings applications closer to the infrastructure and storage ................................................................ 9 1.4.2 Agile infrastructure, flexible scaling on a high-performing storage and compute platform ................................. 9 1.4.3 Mission-critical high availability and fault-tolerant platform ................................................................................ 9 1.4.4 PowerStore inline data reduction reduces storage consumption and cost ...................................................... 10 1.4.5 Efficient and convenient snapshot data backup ............................................................................................... 10 1.4.6 Secure data protection with ease of mind ........................................................................................................ 10 1.4.7 Unified infrastructure and services management ............................................................................................. 10 1.4.8 Spark value and future expansion .................................................................................................................... 11 1.5 Terminology ...................................................................................................................................................... 11 2 Sizing considerations ................................................................................................................................................. 13 3 Deploying a Spark cluster with HDFS ........................................................................................................................ 14 3.1 Planning for the virtual machines that run Spark and Hadoop ......................................................................... 14 3.1.1 PowerStore X model appliance ........................................................................................................................ 15 3.1.2 PowerStore storage containers and virtual volumes ........................................................................................ 16 3.1.3 Creating virtual machines on PowerStore X model appliance ......................................................................... 18 3.2 Installation and configuration of Apache Hadoop ............................................................................................. 22 3.2.1 Installing Hadoop .............................................................................................................................................. 22 3.2.2 Configuring Hadoop HDFS cluster ................................................................................................................... 24 3.3 Installation and configuration of Apache Spark ................................................................................................ 28 3.3.1 Installing Spark ................................................................................................................................................. 28 3.3.2 Configuring a Spark standalone cluster ........................................................................................................... 30 3.3.3 Configuring Spark History Server ..................................................................................................................... 34 4 Testing Spark with Spark-bench ................................................................................................................................ 36 4.1 Installing Spark-bench tool ............................................................................................................................... 36 3 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Table of contents 4.1.1 Installation prerequisites ................................................................................................................................... 36 4.1.2 Installing Spark-bench ...................................................................................................................................... 37 4.2 Running Spark-bench workloads ...................................................................................................................... 38 4.2.1 Generate KMeans dataset ................................................................................................................................ 39 4.2.2 Run KMeans workload ..................................................................................................................................... 40 4.2.3 Spark memory and CPU cores ......................................................................................................................... 42 4.2.4 Spark network timeout ...................................................................................................................................... 44 4.2.5 Monitoring Spark applications .......................................................................................................................... 44 5 Interactive analysis of PowerStore metrics with Jupyter notebook ............................................................................ 48 5.1 Installing prerequisite software ......................................................................................................................... 48 5.1.1 JupyterLab ........................................................................................................................................................ 48 5.1.2 Python modules ................................................................................................................................................ 49 5.1.3 PowerStore command-line interface (CLI) ....................................................................................................... 49 5.2 Extract PowerStore space metrics ................................................................................................................... 50 5.3 Import PowerStore space metrics into HDFS ................................................................................................... 50 5.4 Perform analysis on the PowerStore space metrics ........................................................................................

Apache Spark Solution Guide

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Building Machine Learning Inference Pipelines at Scale

Evaluation of SPARQL Queries on Apache Flink

Flare: Optimizing Apache Spark with Native Compilation

Large Scale Querying and Processing for Property Graphs Phd Symposium∗

HDP 3.1.4 Release Notes Date of Publish: 2019-08-26

Debugging Spark Applications a Study on Debugging Techniques of Spark Developers

Adrian Florea Software Architect / Big Data Developer at Pentalog [email protected]

Big Data Analysis Using Hadoop Lecture 4 Hadoop Ecosystem

Mobile Storm: Distributed Real-Time Stream Processing for Mobile Clouds

Analysis of Real Time Stream Processing Systems Considering Latency Patricio Córdova University of Toronto [email protected]

SQOOP on SPARK for Data Ingestion Veena Basavaraj & Vinoth Chandar @Uber Currently @Uber on Streaming Systems