Technical White Paper Dell EMC PowerStore: Apache Solution Guide

Abstract This document provides a solution overview for Apache Spark running on a Dell EMC™ PowerStore™ appliance.

June 2021

H18663 Revisions

Revisions

Date Description June 2021 Initial release

Acknowledgments

Author: Henry Wong

This document may contain certain words that are not consistent with Dell's current language guidelines. Dell plans to update the document over subsequent future releases to revise these words accordingly.

This document may contain language from third party content that is not under Dell's control and is not consistent with Dell's current guidelines for Dell's own content. When such third party content is updated by the relevant third parties, this document will be revised accordingly.

The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.

Use, copying, and distribution of any software described in this publication requires an applicable .

Copyright © 2021 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Other trademarks may be trademarks of their respective owners. [6/9/2021] [Technical White Paper] [H18663]

2 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Table of contents

Table of contents

Revisions...... 2 Acknowledgments ...... 2 Table of contents ...... 3 Executive summary ...... 5 Audience ...... 5 1 Introduction ...... 6 1.1 PowerStore overview ...... 6 1.2 Apache Spark overview ...... 6 1.3 Distributed File System overview ...... 8 1.4 The advantages of Spark and Hadoop on PowerStore ...... 9 1.4.1 AppsON brings applications closer to the infrastructure and storage ...... 9 1.4.2 Agile infrastructure, flexible scaling on a high-performing storage and compute platform ...... 9 1.4.3 Mission-critical high availability and fault-tolerant platform ...... 9 1.4.4 PowerStore inline data reduction reduces storage consumption and cost ...... 10 1.4.5 Efficient and convenient snapshot data backup ...... 10 1.4.6 Secure data protection with ease of mind ...... 10 1.4.7 Unified infrastructure and services management ...... 10 1.4.8 Spark value and future expansion ...... 11 1.5 Terminology ...... 11 2 Sizing considerations ...... 13 3 Deploying a Spark cluster with HDFS ...... 14 3.1 Planning for the virtual machines that run Spark and Hadoop ...... 14 3.1.1 PowerStore X model appliance ...... 15 3.1.2 PowerStore storage containers and virtual volumes ...... 16 3.1.3 Creating virtual machines on PowerStore X model appliance ...... 18 3.2 Installation and configuration of Apache Hadoop ...... 22 3.2.1 Installing Hadoop ...... 22 3.2.2 Configuring Hadoop HDFS cluster ...... 24 3.3 Installation and configuration of Apache Spark ...... 28 3.3.1 Installing Spark ...... 28 3.3.2 Configuring a Spark standalone cluster ...... 30 3.3.3 Configuring Spark History Server ...... 34 4 Testing Spark with Spark-bench ...... 36 4.1 Installing Spark-bench tool ...... 36

3 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Table of contents

4.1.1 Installation prerequisites ...... 36 4.1.2 Installing Spark-bench ...... 37 4.2 Running Spark-bench workloads ...... 38 4.2.1 Generate KMeans dataset ...... 39 4.2.2 Run KMeans workload ...... 40 4.2.3 Spark memory and CPU cores ...... 42 4.2.4 Spark network timeout ...... 44 4.2.5 Monitoring Spark applications ...... 44 5 Interactive analysis of PowerStore metrics with Jupyter notebook ...... 48 5.1 Installing prerequisite software ...... 48 5.1.1 JupyterLab ...... 48 5.1.2 Python modules ...... 49 5.1.3 PowerStore command-line interface (CLI) ...... 49 5.2 Extract PowerStore space metrics ...... 50 5.3 Import PowerStore space metrics into HDFS ...... 50 5.4 Perform analysis on the PowerStore space metrics ...... 54 6 Automation ...... 58 7 Data protection ...... 59 7.1 Snapshots and thin clones ...... 59 7.2 AppSync ...... 60 7.3 RecoverPoint for Virtual Machines ...... 60 7.4 Hadoop distributed copy and HDFS snapshots ...... 60 A Configure passwordless SSH ...... 61 B Python codes ...... 62 B.1 Import .csv files into HDFS ...... 62 B.2 Analyze PowerStore space metrics ...... 63 C Additional resources ...... 67 C.1 Technical support and resources ...... 67 C.2 Other resources ...... 67 C.3 Ansible resources ...... 67

4 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Executive summary

Executive summary

Apache® Spark® has seen tremendous growth in the past few years. It is the leading platform for distributed processing because of its innovation, speed, and developer-friendly framework. This document offers a high-level overview of the Dell EMC™ PowerStore™ appliance and the benefits of running Apache Spark and Hadoop® HDFS on PowerStore. The document also provides installation, configuration, testing, and a simple use case for Spark and HDFS on PowerStore. Audience

This document is intended for IT administrators, storage architects, partners, and Dell Technologies™ employees. This audience also includes individuals who may evaluate, acquire, manage, operate, or design a Dell EMC networked storage environment using PowerStore systems.

5 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Introduction

1 Introduction This document was developed using the PowerStore X model appliance, Apache Spark, Apache HDFS, and Red Hat® Enterprise ®. This section provides an overview for PowerStore, Apache Spark, and Apache HDFS.

1.1 PowerStore overview PowerStore achieves new levels of operational simplicity and agility. It uses a container-based microservices architecture, advanced storage technologies, and integrated to unlock the power of your data. PowerStore is a versatile platform with a performance-centric design that delivers multidimensional scale, always-on data reduction, and support for next-generation media.

PowerStore brings the simplicity of public cloud to on-premises infrastructure, streamlining operations with an integrated machine-learning engine and seamless automation. It also offers predictive analytics to easily monitor, analyze, and troubleshoot the environment. PowerStore is highly adaptable, providing the flexibility to host specialized workloads directly on the appliance and modernize infrastructure without disruption. It also offers investment protection through flexible payment solutions and data-in-place upgrades.

The PowerStore platform is available in two different product models: PowerStore T models and PowerStore X models. PowerStore T models are bare-metal, unified storage arrays which can service block, file, and VMware® vSphere® Virtual Volumes™ (vVols) resources along with numerous data services and efficiencies. PowerStore X model appliances enable running applications directly on the appliance through the AppsON capability. A native VMware ESXi™ layer runs embedded applications alongside the PowerStore , all in the form of virtual machines. This feature adds to the traditional storage functionality of PowerStore X model appliances, and supports serving external block and vVol storage to servers with multiple protocols.

For more information about PowerStore T models and PowerStore X models, see the documents Dell EMC PowerStore: Introduction to the Platform and Dell EMC PowerStore Virtualization Infrastructure Guide.

1.2 Apache Spark overview Apache Spark is an open-source distributed processing engine designed to be high performing, scalable, and capable of computing massive amount of data. It can perform a wide range of analytic tasks such as SQL queries, streaming, and machine learning.

Spark supports several popular programming languages such as Java, Scala, Python, and and provides a unified and consistent set of for these programming languages. Also, it has an extensive set of libraries for SQL (dataframes), machine learning (MLlib), Spark Streaming, and GraphX. These capabilities allow developers to easily build Spark applications by combining different APIs, libraries, and functions.

Spark is built for speed and high performance. Spark loads the entire dataset in memory on the cluster and performs computation on it. The data is kept in memory to minimize disk access. Spark performs exceptionally well for iterative computations that require passing the same data multiple times. Machine learning is a great example of such iterative computations.

6 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Introduction

Spark supports a wide range of storage systems such as local file systems, Apache Hadoop HDFS, , Apache HBase, Cassandra, and more. Figure 1 shows the Spark components in blue. For more information, see the corresponding documentation on https://spark.apache.org.

APIs SQL Java Python Scala R

Libraries Dataframes Streaming GraphX MLlib

Core execution Spark Core engine

Storage/Data Local file HDFS HBase Hive Cassandra Others source system

Spark components

Spark supports several cluster managers including Spark standalone cluster, Apache Hadoop YARN, , and . This paper focuses on the Spark standalone cluster.

7 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Introduction

A Spark standalone cluster (see Figure 2) consists of one master node and multiple worker nodes. The cluster manager on the master node manages the cluster resources, such as CPUs and memory, and assigns application tasks to the worker nodes. A Spark application is a driver program that establishes a Spark session to the cluster manager and requests resources to perform multiple tasks on the worker nodes. Executors are the Java VMs (JVM) on the worker nodes that perform the tasks and report the status and results back to the driver program.

Spark Worker Node Task Executor Task Application Driver Program Task Task Spark session Executor Task Task

Spark Worker Node Spark Master Node Task Executor Task Spark Cluster Manager Task Task Executor Task Task

Spark standalone cluster overview

1.3 Apache Hadoop Distributed File System overview Apache Hadoop is an open-source software suite and framework for big-data processing. Hadoop Distributed File System (HDFS) is one of the core components of Hadoop. It is a distributed file system designed to be massively scalable, fault tolerant, and have high throughput. HDFS can scale up to hundreds of servers and supports large files. It is well suited for applications, such as Spark, that require access to large datasets. HDFS files are divided into blocks and stored on multiple servers. Data block replication (replication factor) places replicas of the block across the cluster to increase data availability and read performance.

Other Hadoop core components include the following:

• Hadoop YARN: A cluster and resource manager • Hadoop MapReduce: A distributed parallel data processing system • Hadoop Common: Core common utilities shared by other modules

HDFS provides the persistent storage and data source for Spark. The configuration in this document does not use Hadoop YARN and MapReduce.

A Hadoop cluster consists of a NameNode and multiple DataNodes. The NameNode maintains the files and directory information of the distributed file system and tracks where the data blocks are located within the cluster. The DataNodes store the data blocks and the replicas in the local file systems.

8 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Introduction

1.4 The advantages of Spark and Hadoop on PowerStore Both Spark and HDFS share a similar distributed architecture that requires a powerful, highly scalable, and flexible infrastructure. PowerStore is performance-optimized for any workload, and its adaptable scale-up and scale-out architecture complements the distributed model of these applications. This section highlights the PowerStore features that benefit and extend the application environment.

1.4.1 AppsON brings applications closer to the infrastructure and storage Bringing applications closer to data increases density and simplifies infrastructure operations. The PowerStore AppsON capability integrates with VMware vSphere®, resulting in streamlined management in which storage resources plug directly into the virtualization layer. Using VMware as the onboard application environment results in unmatched simplicity, since support is inherently available for any standard VM-based applications. When a new PowerStore X model is deployed, the VASA provider is automatically registered, and the datastore is created, eliminating manual steps and saving time. PowerStore seamlessly integrates the VMware ESXi software into the same hardware. Two ESXi nodes are embedded inside the appliance which has direct access to the same storage resources. This close integration allows applications such as Spark and Hadoop to take full advantage of server and storage virtualization with simplified deployment and management. AppsON is available on the PowerStore X model exclusively.

1.4.2 Agile infrastructure, flexible scaling on a high-performing storage and compute platform PowerStore provides flexible scaling with ease of management that compliments the Spark and Hadoop scale-up and scale-out distribution model. The integrated hypervisor dynamically scales up the cluster nodes when the workload requires it, while you can rapidly provision new nodes on the same or on other appliances in a different location.

Big-data applications require large amount of data and computational power for analytics, machining learning, model training, and other workloads. With a PowerStore appliance, administrators can scale up the storage capacity by adding disks and disk expansion enclosures without service interruption at any time. You can also configure multiple PowerStore appliances into a cluster to increase CPUs, memory, storage capacity, and front-end connectivity. Clustering simplifies and centralizes the management of multiple appliances from PowerStore Manager, a single HTML5-based management interface. A cluster can consist of up to four PowerStore T appliances or four PowerStore X appliances. Each appliance within the cluster can have different configurations of CPUs, memory, NVMe drives, and expansion enclosures.

The NVMe architecture is designed for the next-generation NVMe-based storage and takes advantage of low- overhead NVRAM cache. PowerStore is engineered to handle the most demanding workloads.

1.4.3 Mission-critical high availability and fault-tolerant platform PowerStore provides a high level of stability and reliability for Spark and HDFS. At the hardware level, PowerStore is highly available and fault tolerant. It monitors the storage devices continuously, and it automatically relocates data from failing devices to avoid data loss. The PowerStore X model appliance includes two ESXi nodes and redundant hardware components. The nondisruptive upgrade (NDU) feature further increases overall PowerStore availability. The updates take place on the nodes in a rolling fashion. NDU supports PowerStore software releases, hotfixes, and hardware and disk firmware.

The dynamic resiliency engine (DRE) feature automatically protects and repairs the underlying storage from drive failures. Administrators are not required to manually configure or manage the protection settings for the drives.

9 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Introduction

To support high-value business workloads and service requirements on the application level, it is essential to protect and ensure the availability of the Spark and HDFS nodes. The Spark master node and the HDFS NameNode are central to all operations of the application in the cluster. When they become inaccessible, all applications and storage operations are affected. Also, if a DataNode is not reachable for an extended period, the Namenode determines the blocks on the failed node and starts making copies of the blocks from other replicas until the replication factor is met.

With standard VMware vSphere High Availability (HA) integrated into PowerStore, the embedded VMware ESXi™ hypervisor automatically restarts or migrates failed Spark and HDFS servers to a different ESXi node. This resumes operations by helping restore Spark and HDFS to its full operation capacity, and it minimizes the chance of the DataNodes being marked dead.

To achieve an even higher level of redundancy and application availability, you can deploy the Spark cluster and HDFS cluster across multiple PowerStore appliances in different racks, floors, or locations. PowerStore improves application availability and provides unparalleled flexibility and mobility to relocate and move across data centers and appliances.

1.4.4 PowerStore inline data reduction reduces storage consumption and cost Data science and big-data applications continuously pull in a tremendous amount of data from various sources. To help reduce storage consumption and cost, the PowerStore inline data-reduction feature maximizes space savings by combining both software data deduplication and hardware compression. Data reduction works seamlessly in the background, is always enabled, and cannot be disabled. Since data reduction is always active in PowerStore, enabling application or operating system compression may not provide additional savings.

1.4.5 Efficient and convenient snapshot data backup PowerStore provides Spark and HDFS with extra data protection through array-based snapshots. A PowerStore snapshot is a point-in-time copy of the data. The snapshots are space efficient and require seconds to create. Snapshot data are exact copies of the source data and can be used for application testing, backup, or DevOps. Because of the tight integration with VMware vSphere, PowerStore can take vVol VM snapshots directly from PowerStore Manager using a protection policy schedule or on demand. You can view the VM snapshot information in PowerStore and vCenter.

1.4.6 Secure data protection with ease of mind With high-value data driving business applications, data security is a top concern for all organizations. Lost or stolen data can seriously damage the reputation of an organization and result in huge financial costs and loss of customer trust. Dell Technologies engineered PowerStore with Data at Rest Encryption (D@RE) which uses self-encrypting drives and supports array-based, self-managed keys. When D@RE is activated, data is encrypted as it is written to disk using the 256-bit Advanced Encryption Standard (AES). PowerStore D@RE provides this data security benefit to Spark applications while eliminating application overhead, performance penalties, and administrative overhead that is typically associated with software-based solutions.

1.4.7 Unified infrastructure and services management PowerStore provides deep integration with VMware management tools and services with Dell EMC Virtual Storage Integrator (VSI), VMware vRealize® Operations Manager (vROps), VMware vRealize Orchestrator (vRO), and VMware Storage Replication Adapter (SRA). You can easily incorporate ESXi on PowerStore X models into your existing vCenter and manage all VMware infrastructure and services from a unified management platform.

10 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Introduction

1.4.8 Spark value and future expansion Big-data platforms such as Spark and Hadoop create enormous value for organizations. As the value and scale of this data grows, it is critical to have a future-proof platform that is easy to manage. The platform must also provide technical innovation for future growth, and support the application architecture. Spark and Hadoop on PowerStore bring IT organizations the ability to be agile, efficient, and responsive to business demands.

1.5 Terminology The following terms are used with PowerStore.

Appliance: Solution containing a base enclosure and attached expansion enclosures. The size of an appliance could be only the base enclosure or the base enclosure plus expansion enclosures.

PowerStore node: Storage controller that provides the processing resources for performing storage operations and servicing I/O between storage and hosts. Each PowerStore appliance contains two nodes.

Base enclosure: Enclosure containing both nodes (node A and node B) and 25 NVMe drive slots

Expansion enclosure: Enclosures that can be attached to a base enclosure to provide additional storage.

Fibre Channel (FC) protocol: Protocol used to perform SCSI commands over a Fibre Channel network.

iSCSI: Provides a mechanism for accessing block-level data storage over network connections.

NDU: A nondisruptive upgrade (NDU) updates PowerStore and maximizes its availability by performing rolling updates. This includes updates for PowerStore software releases, hotfixes, and hardware and disk firmware.

NVMe: Non-Volatile Memory Express is a communication interface and driver for accessing nonvolatile storage media such as solid-state drives (SSD) and SCM drives through the PCIe bus.

NVMe over Fibre Channel (NVMe-FC): Allows hosts to access storage systems across a network fabric with the NVMe protocol using Fibre Channel as the underlying transport.

NVRAM: Nonvolatile random-access memory is persistent random-access memory that retains data without an electrical charge. NVRAM drives are used in PowerStore appliance as additional system write caching.

Volume: A block-level storage device that can be shared out using a protocol such as iSCSI or Fibre Channel.

Snapshot: A point-in-time view of data that is stored on a storage resource. You can recover files from a snapshot, restore a storage resource from a snapshot, or provide access to a host.

Storage container: A VMware term for a logical entity that consists of one or more capability profiles and their storage limits. This entity is known as a vVol datastore when it is mounted in vSphere.

PCIe: Peripheral Component Interconnect Express is a high-speed serial computer expansion bus standard.

PowerStore Manager: An HTML5 management interface for creating storage resources and configuring and scheduling protection of stored data on PowerStore. PowerStore Manager can be used for all management of PowerStore native replication.

11 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Introduction

PowerStore T model: Container-based storage system that is running on purpose-built hardware. This storage system supports unified (block and file) workloads, or block-optimized workloads.

PowerStore X model: Container-based storage system that runs inside a virtual machine that is deployed on a VMware hypervisor. Besides offering block-optimized workloads, PowerStore also allows you to deploy applications directly on the array.

RecoverPoint for Virtual Machines: Protects VMs in a VMware environment with VM-level granularity and provides local or remote replication for any point-in-time recovery. This feature is integrated with VMware vCenter and has integrated orchestration and automation capabilities.

SCM: Storage-class memory, also known as persistent memory, is an extremely fast storage technology supported by PowerStore appliance.

Snapshot: A point-in-time view of data stored on a storage resource. You can recover files from a snapshot, restore a storage resource from a snapshot, or provide access to a host.

Storage Policy Based Management (SPBM): Using policies to control storage-related capabilities for a VM and ensure compliance throughout its life cycle.

Thin clone: A read/write copy of a thin block storage resource (volume, volume group, or vSphere VMFS datastore) that shares blocks with the parent resource.

User snapshot: Snapshot that is created manually by the user or by a protection policy with an associated snapshot rule. This snapshot type is different than an internal snapshot, which is taken automatically by the system with asynchronous replication.

Virtual machine (VM): An operating system running on a hypervisor, which is used to emulate physical hardware.

vCenter: VMware vCenter server provides a centralized management platform for VMware vSphere environments.

VMware vSphere Virtual Volumes (vVols): A VMware storage framework which allows VM data to be stored on individual vVols. This ability allows for data services to be applied at a VM-level of granularity and according to SPBM. vVols can also refer to the individual storage objects that are used to enable this functionality.

vSphere API for Array Integration (VAAI): A VMware API that improves ESXi host utilization by offloading storage-related tasks to the storage system.

vSphere API for Storage Awareness (VASA): A VMware vendor-neutral API that enables vSphere to determine the capabilities of a storage system. This feature requires a VASA provider on the storage system for communication.

12 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Sizing considerations

2 Sizing considerations Before you select the PowerStore model, storage media, capacity, and connectivity options, you must first understand the target Spark and HDFS environment. There are many factors to consider that are not limited to the following:

• Consider the amount of data to keep and future data growth. Spark typically requires minimal storage. It performs computation in memory and only requires temporary disk storage when the data does not fit in the memory. HDFS NameNodes requires small amount of storage for storing the HDFS metadata information and transaction logs. HDFS DataNodes requires a large amount of storage because they manage and store the data on disk. • Consider the HDFS replication requirement. The HDFS replication factor decides how many copies of the data are replicated across the DataNodes in the cluster. • Understand the workload patterns. For Spark, computational power and memory are the most important factors to its performance. For Hadoop, disk space, I/O bandwidth, and computational power are important factors. • Use a 10 Gb or 25 Gb network to provide sufficient network bandwidth and reduce latency especially for HDFS replication.

Also, consider the following resources for the Spark and Hadoop nodes on a PowerStore X appliance:

• Determine how many VMs, and the CPU and memory requirements of the VMs. • While it is possible to run Spark and Hadoop on separate VMs, we recommend placing the data source, HDFS, close to the Spark worker nodes for best performance. If it is not possible to co-locate the Spark worker node and the HDFS DataNode on the same VM, ensure the Spark and Hadoop VMs are close and connected on a fast network. • Prepare for the event in which one of the ESXi nodes fail, and decide if the CPU and memory resources should be reserved on an ESXi node to accommodate full performance for the applications. • Do not overcommit CPU and memory resources on PowerStore ESXi nodes in any production or mission-critical environments. However, this practice might be acceptable in test or development environments where the guaranteed performance level is not a concern.

Review the Spark documentation and Hadoop documentation to learn about other software and hardware requirements.

The Dell Technologies account team has access to a suite of tools, such as LiveOptics and CloudIQ, that are designed to help gather and analyze workload and performance data in an existing environment. The account team can use the PowerStore sizer to estimate the storage needs.

13 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

3 Deploying a Spark cluster with HDFS The following sections describe what a Spark cluster environment looks like and demonstrates how to set up a simple Spark cluster with HDFS on a PowerStore X appliance.

3.1 Planning for the virtual machines that run Spark and Hadoop Spark has several deployment modes. The standalone cluster deployment mode, which includes a cluster manager, is the simplest way to deploy Spark in a private cluster. You can also deploy Spark on top of other cluster managers including Apache Hadoop YARN, Apache Mesos, and Kubernetes. This paper presents the Spark standalone cluster deployment in a private cluster on a PowerStore X appliance.

Spark is a computing analytic engine and does not handle the storage of the data. A Spark job reads the data from various sources into memory, processes it and keeps it in memory, and optionally writes the result to storage systems to persist the data. Spark supports many storage systems such as Linux file systems, Hadoop distributed file system (HDFS), Apache Hive, , Cassandra, and others. This paper focuses on setting up a Spark cluster with HDFS.

A Spark standalone cluster consists of one master node and multiple worker nodes. The master node manages the cluster resources and coordinates running Spark applications across the worker nodes. Spark applications are split into multiple tasks which are performed by executors (Java processes) on the worker nodes. One or multiple executors (Java processes) might be launched on each worker node.

For HDFS, it requires a NameNode and multiple DataNodes. The NameNode maintains the files and directory information of the distributed file system and tracks where the data is located within the cluster in the DataNodes.

To increase Spark processing power, you can add more worker nodes to the cluster which increases the total computational power and memory available for executors. This addition enables more parallel tasks to be performed across the cluster. For storage, more DataNodes provide more storage processing capability and storage capacity in the cluster.

With a PowerStore X appliance, you can deploy both Spark nodes and Hadoop nodes as virtual machines on the appliance. This simplifies the provision and management of the virtual machines and storage.

In this example, the Spark cluster consists of five virtual machines, and the Hadoop cluster consists of six virtual machines. While Spark nodes and Hadoop nodes can run on different virtual machines, co-locating them on the same virtual machines brings the data closer to Spark and reduces the data access time. Table 1 summarizes the roles and software installed on each virtual machine.

Spark and HDFS virtual machine specifications Virtual machine CPU RAM Software Role hadoop-namenode-vm10 16 32 GB Hadoop HDFS NameNode spark-prim-vm10 16 32 GB Spark, Hadoop Spark primary server HDFS DataNode spark-wrk-vm10 24 64 GB Spark, Hadoop Spark worker node HDFS DataNode spark-wrk-vm11 24 64 GB Spark, Hadoop Spark worker node HDFS DataNode

14 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

Virtual machine CPU RAM Software Role spark-wrk-vm12 24 64 GB Spark, Hadoop Spark worker node HDFS DataNode spark-wrk-vm13 24 64 GB Spark, Hadoop Spark worker node HDFS DataNode spark-bench-vm10 16 32 GB Spark, Spark-bench, spark-bench driver Hadoop JupyterLab server PowerStore Command Line Interface Client HDFS Client

3.1.1 PowerStore X model appliance AppsON is a unique PowerStore X model feature where a VMware hypervisor running vSphere ESXi v6.7 is embedded on the two internal hosts. This feature allows applications to run in VMs directly on the appliance. The appliance offers deep integration with vSphere and is fully compatible with VMware tools. During the initial configuration of the appliance, the internal ESXi hosts are configured to register with a vCenter provided by the customer. The initialization automatically applies performance optimizations, or you can apply them manually afterward. These optimizations include the following:

• Create multiple iSCSI targets on the appliance • Configure additional network ports • Optimize ESXi multipath settings for the appliance • Increase ESXi queue depths • Configure jumbo frames for cluster and iSCSI networks

For details about the performance best practices, see the following documents.

• PowerStore: PowerStore X Performance Best Practice Tuning • Dell EMC PowerStore Virtualization Guide • Dell EMC PowerStore: VMware vSphere Best Practices

Dell Technologies offers various media options and hardware specifications to choose from. To learn more about the full PowerStore family, go to the PowerStore product page.

15 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

3.1.2 PowerStore storage containers and virtual volumes On PowerStore X models, the VASA provider is automatically registered with vSphere, and the default storage container is mounted automatically on the internal ESXi nodes through the iSCSI protocol. See Figure 3. For external ESXi hosts, PowerStore can serve block volumes using Fibre Channel (FC), iSCSI, or NVMe over Fabrics (NVMe-OF). PowerStore can also serve vVol storage containers to the external hosts using FC or iSCSI. However, you must manually register the VASA provider, and you must mount the storage containers manually on the external ESXi hosts.

PowerStore automatically tracks the vVols that belong to each VM. The PowerStore Manager UI shows these vVols objects under the Virtual Machines view. See Figure 4 and Figure 5.

For more information about vVols, storage containers, and vSphere VASA, see the Dell EMC PowerStore Virtualization Guide.

vVol-based storage container automatically mounted in vSphere on the PowerStore X model

Default storage container on the PowerStore X model appliance

16 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

Listing vVols objects that are associated with a VM

17 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

3.1.3 Creating virtual machines on PowerStore X model appliance Using vCenter, the virtual machines for the Spark nodes and the Hadoop nodes are deployed directly on the PowerStore X internal ESXi hosts based on the information in Table 1 (virtual machine specifications), and Table 2 (file system layout). The two PowerStore X internal ESXi hosts are presented and managed in vCenter like other external ESXi hosts. See Figure 6. You can also view the virtual machines in the PowerStore Manager UI, See Figure 7.

We recommend creating the virtual machines from a template which ensures consistency and faster setup. You can import a virtual machine or template from an existing environment to speed up the deployment process. In this example, a Red Hat Enterprise Linux 7.9 template is established with the packages and configuration outlined in section 3.1.3.1.

Internal ESXi nodes in PowerStore X appliance

PowerStore Controller VMs

Application VMs

PowerStore X internal ESXi hosts and virtual machines in vCenter

18 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

Virtual machines in PowerStore Manager

3.1.3.1 Guest virtual machine operating system Spark runs on Windows, macOS, and Linux. Apache Hadoop supports Linux and Windows but is mostly deployed on Linux. In this example, Red Hat Enterprise Linux (RHEL) is used for all applications. A VM template is created with RHEL plus the following software and configurations. All application VMs are created from the template to ease deployment and ensure consistency:

• Enterprise Linux 7.9 Server with Graphical Desktop • chrony • open-vm-tools • lsscsi • sg3_utils • autofs • iscsi-initiator-utils • java-1.8.0-openjdk • java-1.8.0-openjdk-devel • python3 • python3-pip • python3-setuptools • zlib • zlib-devel • ncurses • ncurses-devel • gcc • opensll-devel • -devel • Latest updates from Red Hat software repository

19 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

This example applies the following configurations to Red Hat Enterprise Linux:

• Disable optional services

for service in firewalld avahi-daemon irqbalance iptables ip6tables do systemctl disable $service systemctl stop $service done

• Configure the virtual machines to use network time servers. It is a best practice to keep the system clock synchronized across the cluster nodes. chrony is a common time synchronization service available on Linux. Add the network time server IP address in /etc/chrony.conf and enable the service.

server $time_server_1_ip iburst server $time_server_2_ip iburst

# systemctl enable chronyd --now

• Ensure the applications have enough system resources to run on the VMs by increasing the ulimit limits. Use the following as a starting point, and adjust the settings if necessary. Set the following in /etc/security/limits.conf.

* soft nofile 128000 * hard nofile 128000 * hard nproc 16000 * hard fsize -1 * soft core ulimited * soft data unlimited * hard data unlimited * soft stack unlimited * hard stack unlimited

3.1.3.2 File system layout Each virtual machine is configured with one or more paravirtual SCSI controllers and several vVol-based virtual disks provisioned from the Storage Container on the PowerStore appliance. The virtual disks are formatted as XFS file system on RHEL.

20 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

Table 2 shows an example of the file system layout for each application VM. It is a best practice to separate application data from the operating system. One or more file systems are dedicated on each VM for HDFS use.

Spark and Hadoop file system layout Paravirtual SCSI Mount Virtual machine Virtual disks Size Description controllers point

hadoop-namenode- 2 /dev/sda 50 GB / and swap sda used for vm10 /dev/sdb 100 GB /data/1 operating system and application binaries

sdb used for NameNode spark-prim-vm10 3 /dev/sda 50 GB / and swap sda used for spark-wrk-vm10 /dev/sdb 100 GB /data/1 operating system and spark-wrk-vm11 /dev/sdc 100 GB /data/2 application binaries spark-wrk-vm12 spark-wrk-vm13 sdb and sdc used for data on DataNodes spark-bench-vm10 1 /dev/sda 50 GB / and swap sda used for operating system and application binaries

3.1.3.3 Networking The PowerStore X model creates a vSphere distributed switch (vDS) and a set of preconfigured distributed port groups for internal communications during the initial configuration process. Each internal ESXi node has two 10 Gb connections for vDS uplinks and one 1 Gb connection for the management network. The preconfigured port groups are reserved for PowerStore use only. To allow guest VMs to communication with other systems on the network, create a new distributed virtual switch (DVS) port group and assign a different VLAN to the port group. Configure each VM with a virtual network adapter on this user-defined DVS port group. See Figure 8.

21 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

User defined DVS port groups for VM network communication

vSphere distributed switch and distributed port groups for PowerStore X

3.2 Installation and configuration of Apache Hadoop This section shows the basic installation and configuration of a Hadoop cluster. For more advanced topics and configuration, see the documentation at https://hadoop.apache.org/docs/current/.

3.2.1 Installing Hadoop Apache Hadoop project offers several binary versions and the source code on the project website. For simplicity and ease of installation, download one of the prebuilt binaries from http://hadoop.apache.org/releases.html. To decide which Hadoop version to use, check the Spark download site at http://spark.apache.org/downloads.html to verify which version of Hadoop is supported. In this example, Hadoop release 3.2.2 is chosen because it is supported by the Spark prebuilt version 3.0.2. The following steps show an example of installing a prebuilt version of Hadoop.

Perform the following steps on each Hadoop node as the root user:

1. Install Java JDK.

# yum install java-1.8.0-openjdk # yum install java-1.8.0-openjdk-devel

2. Install Python3.

# yum install python3 python3-pip python3-setuptools

22 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

3. Create a hdfs user and group.

When deploying Hadoop on multiple VMs, ensure the hdfs user id (UID) and group id (GID) are the same across all cluster nodes.

# groupadd -g 3000 hdfs # useradd -u 3000 -g 3000 -d /home/hdfs hdfs # passwd hdfs

4. Download Hadoop from http://hadoop.apache.org/releases.html and save the installation file in /usr/local. 5. Extract the software into a subdirectory in /usr/local.

# cd /usr/local # tar xzvf hadoop-3.2.2.tar.gz

6. Assign ownership to hdfs user.

# chmod -R hdfs:hdfs /usr/local/hadoop-3.2.2

7. Optionally, create a symbolic link to the software directory. Configure a symbolic link to point to the active version of the software. This action ensures a consistent path to the Hadoop program and configuration files between different versions of the software.

# ln -s /usr/local/hadoop-3.2.2 /usr/local/hadoop

8. Configure the following environment variables for the hdfs user in $HOME/.bashrc.

export HADOOP_HOME=/usr/local/hadoop export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export JAVA_HOME=/usr/lib/jvm/java-1.8.0 export _JAVA_OPTIONS="-Xmx4g -Djava.awt.headless=true" export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${PATH} export HADOOP_MAPRED_HOME=/usr/local/hadoop

HADOOP_HOME Set the location of the Hadoop software HADOOP_CONF_DIR Set the location of the Hadoop configuration files JAVA_HOME Set the location of the java software _JAVA_OPTIONS Set the java heap size and other java options PATH Add locations of Hadoop programs to the search path HADOOP_MAPRED_HOME Set the location of the MapReduce programs

23 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

3.2.2 Configuring Hadoop HDFS cluster You can deploy Apache Hadoop on a single node as a standalone instance or across multiple nodes in a cluster setting. The standalone setup is great for performing quick tests or debugging without the overhead of bringing up a full cluster. This paper focuses on setting up a Hadoop cluster with multiple VMs.

1. Set up the Hadoop worker file /usr/local/Hadoop/etc/hadoop/workers. This file contains a list of HDFS DataNodes of the cluster.

$ cd /usr/local/hadoop/etc/hadoop $ cat workers spark-prim-vm10 spark-wrk-vm10 spark-wrk-vm11 spark-wrk-vm12 spark-wrk-vm13

2. Configure the Hadoop environment settings in /usr/local/hadoop/etc/hadoop/hadoop-env.sh. This file contains environment variables for the Hadoop daemons such as the Java process options and Hadoop software location.

export HDFS_NAMENODE_USER=hdfs export HADOOP_MAPRED_HOME=/usr/local/hadoop export HADOOP_HOME=/usr/local/hadoop export HADOOP_HEAPSIZE_MAX=4g export HDFS_NAMENODE_OPTS="-Xmx4g -Djava.awt.headless=true - XX:+UseParallelGC" export HDFS_DATANODE_OPTS="-Xmx4g -Djava.awt.headless=true - XX:+UseParallelGC"

3. Configure Hadoop core configuration settings in /usr/local/hadoop/etc/core-site.xml. This file contains core site settings such as I/O, security, and others. There are hundreds of configurable attributes, and many have default values that are not listed in the core-site.xml file. To see the complete list of these attributes and their description, go to http://hadoop.apache.org and search for the core-default.xml documentation. In this example, the fs.defaultFS and hadoop.http.staticuser.user attributes are defined in the file.

- fs.defaultFS sets the default file system universal resource identifier (URI) of your environment. - hadoop.http.staticuser.user sets the username that would be used to browse the content of file system in the HDFS web UI.

fs.defaultFS hdfs://hadoop-namenode-vm10:9000

hadoop.http.staticuser.user hdfs

24 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

4. Configure the HDFS configuration settings in /usr/local/hadoop/etc/hdfs-site.xml. For a complete list of attributes of HDFS, search for hdfs-default.xml documentation on http://hadoop.apache.org. The following attributes are set in this example.

- dfs.replication specifies the number of block replications for all files in HDFS. The default for block replication is 3. - dfs.namenode.name.dir specifies the local file systems to store the NameNode name table (fsimage). - dfs.datanode.data.dir specifies the local file systems on the DataNode to store the data blocks. - dfs.datanode.max.transfer.threads specifies the maximum number of threads for transferring data in and out of the DataNode.

dfs.replication 3

dfs.namenode.name.dir /data/1/dfs/nn

dfs.datanode.data.dir /data/1/dfs/dn,/data/2/dfs/dn

dfs.datanode.max.transfer.threads 4096

5. Configure passwordless SSH between the NameNode and the DataNodes. This action allows the NameNode to transfer files, and start and stop the Hadoop daemons remotely without supplying the password for each node. See appendix A for instructions to set up passwordless SSH. 6. Sync the configuration files in the /usr/local/hadoop/etc/hadoop directory on the NameNode to all DataNodes. Use the scp or rsync command to transfer the files between the VMs. 7. Start the Hadoop daemons as the hdfs user.

Hadoop provides a set of scripts in /usr/local/hadoop/sbin to start and stop the daemons and cluster.

- start-dfs.sh, stop-dfs.sh – start and stop NameNode, DataNode, and HDFS daemons. - start-all.sh, stop-all.sh – start and stop NameNode, DataNode, HDFS, and YARN daemons.

To use these cluster-wide scripts, set up passwordless SSH properly as described in step 5.

25 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

8. Validate the Hadoop cluster with the hdfs command or web UI.

a. As the hdfs user, run hdfs dfsadmin -report to show the status of the cluster and each node.

$ hdfs dfsadmin -report Configured Capacity: 1073217536000 (999.51 GB) Present Capacity: 1072615031814 (998.95 GB) DFS Remaining: 673199080454 (626.97 GB) DFS Used: 399415951360 (371.99 GB) DFS Used%: 37.24% Replicated Blocks: Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0 Erasure Coded Block Groups: Low redundancy block groups: 0 Block groups with corrupt internal blocks: 0 Missing block groups: 0 Low redundancy blocks with highest priority to recover: 0 Pending deletion blocks: 0

------Live datanodes (5):

Name: 100.88.XX.XX:9866 (spark-prim-vm10.techsol.local) Hostname: spark-prim-vm10.techsol.local Decommission Status : Normal Configured Capacity: 214643507200 (199.90 GB) DFS Used: 64006270976 (59.61 GB) Non DFS Used: 69025792 (65.83 MB) DFS Remaining: 150439768579 (140.11 GB) DFS Used%: 29.82% DFS Remaining%: 70.09% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 3 Last contact: Thu May 13 09:42:36 CDT 2021 Last Block Report: Thu May 13 07:21:20 CDT 2021 Num of Blocks: 1298

-----Repeat for other nodes-----

26 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

b. Go to the Hadoop web UI in a browser: http://$NAMENODE_IP:9870.

Hadoop web UI > Cluster overview

27 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

Hadoop web UI > DataNode Information

3.3 Installation and configuration of Apache Spark This section shows the installation and configuration of a Spark cluster including the following software dependencies:

• Java JDK: Spark runs on Java Virtual Machines (JVMs). It requires Java 6 or above. • Programming language interpreter: Spark is written in Scala and is shipped with a Scala interpreter. Spark also works with Python, Java, and R. For Spark to work from any of these programming languages, install the programming interpreters on the system. According to Apache Spark project, Python is now the most widely used language with Spark.

3.3.1 Installing Spark Spark offers several prebuilt versions and source code on the project site. For simplicity and ease of installation, download one of the prebuilt binaries from http://spark.apache.org/downloads.html. If you prefer to build Spark from source for advanced customization, follow the information on https://spark.apache.org/docs/latest/building-spark.html. The following example shows installing a prebuilt version of Spark.

28 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

Perform the following steps as the root user on the Spark nodes:

1. Install Java JDK.

# yum install java-1.8.0-openjdk # yum install java-1.8.0-openjdk-devel

2. Install the programming language interpreters.

# yum install python3 python3-pip python3-setuptools

3. Create a spark user and group as the root user. When deploying Spark on multiple VMs, ensure the Spark user id (UID) and group id (GID) are the same across all cluster member VMs.

# groupadd -g 3004 spark # useradd -u 3004 -g 3004 -d /home/spark # passwd spark

4. Download Spark from http://spark.apache.org/downloads.html to the /usr/local directory. In this example, the Spark release is 3.0.2, and the package type is Pre-built for Apache Hadoop 3.2 and later (see Figure 11). Click the download link to download the file.

Apache Spark download page

Note: The download site is updated periodically with new releases, and older releases may be archived to another location. Ensure that the prebuilt Spark version is compatible with the Hadoop version that you have chosen.

5. Extract the software in a subdirectory in /usr/local.

# cd /usr/local # tar xzvf spark-3.0.2-bin-hadoop3.2.tgz

6. Assign ownership to spark user.

# chmod -R spark:spark /usr/local/spark-3.0.2-bin-hadoop3.2

29 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

7. Optionally, create a symbolic link to the software directory. Configure a symbolic link to point to the active version of the software. This action ensures a consistent path to the Spark program and configuration files between different versions of the software.

# ln -s /usr/local/spark-3.0.2-bin-hadoop3.2 /usr/local/spark

8. Configure the following environment variables for the spark user in $HOME/.bashrc.

export SPARK_HOME=/usr/local/spark export JAVA_HOME=/usr/lib/jvm/java-1.8.0 export _JAVA_OPTIONS="-Xmx4g -XX:+UseParallelGC" export PATH=${SPARK_HOME}/bin:${SPARK_HOME}/sbin:${PATH} export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.7- src.zip:${SPARK_HOME}/python/:$PYTHONPATH" export PYSPARK_PYTHON=/usr/bin/python3 export PYSPARK_DRIVER_PYTHON=/usr/bin/python3

SPARK_HOME Set the location of the Spark software JAVA_HOME Set the location of the java software _JAVA_OPTIONS Set the java heap size and other java options PATH Add locations of Spark programs to the search path PYTHONPATH Set the locations of Spark python libraries for pyspark PYSPARK_PYTHON Set the Python interpreter for pyspark PYSPARK_DRIVE_PYTHON Set the Python interpreter for pyspark driver

9. Verify the Spark installation using the spark-shell interactive tool as spark user.

$ source ~/.bashrc $ spark-shell

A Spark session is successfully created and waiting for user command.

3.3.2 Configuring a Spark standalone cluster Spark can run on a single host or on multiple hosts in a cluster setting. To form a Spark standalone cluster, add the cluster nodes to the Spark configuration file and synchronize the environment settings and cluster configuration across all cluster nodes.

Configure and update these files on the master node first, and sync them to all worker nodes.

1. To configure the Spark standalone cluster, add the worker-node information in the /usr/local/spark/conf/slaves file. In this example, the following worker nodes are added to the Spark cluster configuration:

$ cd /usr/local/spark/conf $ cat /usr/local/spark/conf/slaves spark-wrk-vm10 spark-wrk-vm11 spark-wrk-vm12 spark-wrk-vm13

30 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

2. Configure Spark logging in /usr/local/spark/conf/.properties.

Spark uses log4j for logging. Configure log4j by copying the template file in the /usr/local/spark/conf directory. The default settings in the template are a good starting point without any changes. Adjust the parameters if necessary.

$ cd /usr/local/spark/conf $ cp log4j.properties.template log4j.properties

3. Configure Spark environment settings in /usr/local/spark/conf/spark-env.sh.

The following variables are chosen as a baseline. These variables configure the Spark cluster such as the java classpath and the web portal UI. To see the complete list of variables, see https://spark.apache.org/docs/latest/spark-standalone.html.

# cat spark-env.sh export HADOOP_HOME=/usr/local/hadoop export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/ export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath) export SPARK_MASTER_HOST=spark-prim-vm10 export SPARK_MASTER_WEBUI_PORT=9090 export SPARK_WORKER_CORES=22

HADOOP_HOME Set the location of the Hadoop programs and configuration files so that Spark can access the HDFS.

SPARK_DIST_CLASSPATH Include the Hadoop classpath.

SPARK_MASTER_HOST Set the master node ip/hostname.

SPARK_MASTER_WEBUI_PORT Set the master web UI port (default is 8080). It might be necessary to change the default port due to conflicts with other applications on the same system.

SPARK_WORKER_CORES Set how many CPU cores Spark applications allow to use. The default is all CPU cores.

4. Configure the Spark application properties in /usr/local/spark/conf/spark-defaults.conf.

This configuration file contains properties that control most of the application settings. We recommend reviewing these properties to understand what they do and how they change the behavior of Spark. See https://spark.apache.org/docs/latest/configuration.html for the comprehensive list of properties, their default values, and description.

The following lists the application properties used in this example:

$ cat spark-defaults.conf spark.master spark://spark-prim-vm10:7077 spark..debug.maxToStringFields 1000 spark.dynamicAllocation.enabled true spark.dynamicAllocation.shuffleTracking.enabled true spark.dynamicAllocation.executorIdleTimeout 600s

31 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

spark.driver.log.persistToDfs.enabled true spark.driver.log.dfsDir /user/spark/driverlogs spark.network.timeout 600000 spark.executor.heartbeatInterval 100000

# Enable history server spark.eventLog.enabled true spark.eventLog.dir hdfs://hadoop-namenode-vm10:9000/user/spark/loghistory spark.history.fs.logDirectory hdfs://hadoop-namenode- vm10:9000/user/spark/loghistory

Set to the cluster manager ip/hostname spark.master and port. For Spark standalone cluster, the URI is in the form of spark://$SPARK_MASTER_IP:$PORT.

Set the maximum number of fields that spark.sql.debug.maxToStringFields can be converted to strings in debug output.

Enable dynamic resource allocation. This spark.dynamicAllocation.enabled allows dynamic scaling of executors.

Enable shuffle file tracking for executors spark.dynamicAllocation.shuffleTracking. without the need for an external shuffle enabled service.

Increase the executor default timeout spark.dynamicAllocation.executorIdleTime from 60–600 s. This prevents the out executors to be removed prematurely for long running tasks.

Enable the applications to write the spark.driver.log.persistToDfs driver logs to a persistent storage. The default is do not persist the driver logs.

Set the persistent storage location where spark.driver.log.dfsDir the Spark driver stores the logs. In this example, it is set to use a HDFS directory.

Set the default timeout for all network spark.network.timeout connections. The default is 120 s. Increase the timeout for long running tasks, for example, 600000 ms (10 min).

Set the default interval for the executor spark.executor.heartbeatInterval heartbeat. It must be significantly less than the spark.network.timeout. The default is 10 s.

32 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

Enable Spark to log events to be used spark.eventLog.enabled with Spark history server. See section 3.3.3.

Set the location for the Spark event logs spark.eventLog.dir to be stored. In this example, it is set to use a HDFS directory. See section 3.3.3.

Specifies the persistent storage where spark.history.fs.logDirectory the Spark history server can load the event logs from. This should be set to the same as spark.eventLog.dir. See section 3.3.3.

5. Configure passwordless SSH between the master node and the worker nodes. This allows the master node to transfer files, and start and stop the Spark daemons remotely without supplying the password for each node. See appendix A for instructions to set up passwordless SSH. 6. Sync the configuration files in the /usr/local/spark/conf directory on the Spark master node to all worker nodes. Use the scp or rsync command to transfer the files between the nodes. 7. Start the Spark processes as the spark user.

Spark provides a set of scripts in /usr/local/spark/sbin to start and stop the cluster or individual processes.

start-all.sh, stop-all – start, and stop all Spark processes on the master and worker nodes. This does not include the Spark History Server.

To use these cluster-wide scripts, you must properly set up passwordless ssh as described in step 5.

33 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

8. Validate the Spark standalone cluster on the Spark master web UI.

Go to the Spark master web UI in a browser: http://spark-prim-vm10:9090. Verify that the worker node status is alive.

Spark master web UI

3.3.3 Configuring Spark History Server Spark History Server is a web front end that accesses and displays the event logs generated from the Spark applications across all nodes. You must enable and save the event logs in a centralized location where the Spark History Server can access them. While the Spark master web UI also provides application logs, they are not persisted across Spark restarts. It is useful to have the event logs available for troubleshooting or tuning the application performance. See section 4.2.5 for an example of using the Spark History Server.

To enable Spark history server, use the following procedures:

1. Create a directory for the event logs where all Spark worker nodes have write access. In this example, the log directory resides in the HDFS cluster.

On any of the HDFS DataNodes, perform the following commands as the hdfs or spark user to create a new directory and set the ownership to spark user:

$ hdfs dfs -mkdir /user/spark/loghistory $ hdfs dfs -chown spark /user/spark/loghistory

34 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Deploying a Spark cluster with HDFS

2. Add the following entries in /usr/local/spark/conf/spark-defaults.conf on all Spark nodes. These entries enable the event logging to the specified HDFS directory from all Spark nodes.

# Enable history server spark.eventLog.enabled true spark.eventLog.dir hdfs://hadoop-namenode-vm10:9000/user/spark/loghistory spark.history.fs.logDirectory hdfs://hadoop-namenode- vm10:9000/user/spark/loghistory

3. As spark user, start the Spark History Server on the Spark master node.

$ /usr/local/spark/sbin/start-history-server.sh

4. Verify the status of Spark History Server. In a browser, connect to the Spark History Server at http://$SPARK_MASTER_IP:18080

Spark History Server web UI

35 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

4 Testing Spark with Spark-bench Spark-bench is an open-source benchmarking tool that is designed to test various application workloads. The tool provides an integrated data generator and several application workloads including machine learning, streaming, graph processing, and SQL. The goal of the project is to provide the developers a comprehensive Spark-specific benchmark tool that is also easy to use and configure. Developers use the tool to test and validate their configurations, compare performance between different platforms, and identity system bottlenecks in the environment.

4.1 Installing Spark-bench tool The original Spark-bench code can be downloaded from the github site https://github.com/CODAIT/spark- bench. The code was last updated in November 2018. Since then, other developers have forked the code to fix issues or add enhancements to the tool. In this example, a forked version is used from https://github.com/ch2994/spark-bench because it updates the Scala version to 2.12. The original Spark- bench code is compiled with Scala version 2.11, which is incompatible with the prebuilt version of Spark 3.x because it is compiled with Scala version 2.12.

The Spark-bench tool can be installed on one of the Spark nodes or on a dedicated node. If Spark-bench is installed on a dedicated node, it is a best practice to keep the Spark-bench node close to the Spark cluster to avoid high network latency.

4.1.1 Installation prerequisites Spark-bench requires the following software:

• Java 8 or above • Python 3 • Spark software: Spark-bench launches application workloads by making the spark-submit calls to the cluster. Spark software is required on the Spark-bench node. • Hadoop software: To configure the Spark-bench driver to save the logs to an HDFS directory, the system requires Hadoop software.

Follow the steps in section 3.3.1 and section 3.3.2 to install and configure Spark on the Spark-bench VM, except for configuring /usr/local/spark/conf/slaves. In this example, the Spark-bench VM functions as a dedicated driver and not a part of the Spark cluster. There is no requirement to add the Spark-bench node in the configuration file. Also, there is no requirement to start the Spark daemons on the Spark-bench VM.

Follow the steps in section 3.2 to install and configure Hadoop, except for configuring /usr/local/hadoop/etc/hadoop/worker. There is also no requirement to start any Hadoop daemon on the Spark- bench VM.

36 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

4.1.2 Installing Spark-bench The following procedure installs Spark-bench on a dedicated virtual machine running on the same PowerStore X appliance:

1. Download Spark-bench from https://github.com/ch2994/spark-bench. Select the scala-2.12 branch, click the Code drop-down menu, and click Download ZIP. Save the installation file in /usr/local.

37 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

2. As root user, extract the software into a subdirectory in /usr/local.

# cd /usr/local # unzip spark-bench-scala-2.12.zip

3. Create a symbolic link to the software directory as root user.

# ln -s /usr/local/spark-bench-scala-2.12 /usr/local/spark-bench

4. Assign ownership to the spark user.

# chown -R spark:spark /usr/local/spark-bench-scala-2.12

5. Download the sbt tool from github to compile the Spark-bench code.

# cd /usr/local # wget https://github.com/sbt/sbt/releases/download/v1.4.8/sbt-1.4.8.zip # unzip sbt-1.4.8.zip

6. Compile the Spark-bench code with sbt tool as spark user. For more information about compiling Spark-bench, see https://codait.github.io/spark-bench/compilation/.

$ cd /usr/local/spark-bench $ /usr/local/sbt/bin/sbt assembly $ mkdir lib $ cp -p target/assembly/* lib

7. When sbt assembly is complete, two jar files are generated in the target/assembly directory. Move or copy them to the lib directory.

spark-bench-2.3.0_0.4.0-RELEASE.jar spark-bench-launch-2.3.0_0.4.0-RELEASE.jar

4.2 Running Spark-bench workloads This section demonstrates running the Spark-bench KMeans workload, and using the Spark master web UI and Spark History Server to monitor the applications. Spark-bench provides data generators for KMeans, , and Graph. It also includes workloads for KMeans, , SparkPi, and others. For the complete list of workloads and their definitions, see https://codait.github.io/spark- bench/workloads/. As noted in the article, even though the project aims to provide comprehensive workloads supported by Spark, some of these workloads have not been fully implemented. For instance, while it can generate data for Linear Regression, the Linear Regression workload has not been implemented yet.

This paper focuses on the KMeans workload, a machine learning workload, because Spark-bench supports both generating the data and exercising the KMeans workload against the generated data. The workload reads the data from the storage, performs computation in memory, and writes the results to the storage.

Spark-bench also provides example configuration files for different workloads. These examples are good starting points for beginners to explore Spark-bench and the workload configuration files. Before you attempt to implement the KMeans workload, we recommend reading about these examples on https://codait.github.io/spark-bench/examples/.

38 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

4.2.1 Generate KMeans dataset The following data-generation configuration is based on an example configuration from Spark-bench. The example configuration files are in /usr/local/spark-bench/examples. Make a copy of the data- generation.conf and modify it to fit your Spark environment. At a minimum, adjust the parameters highlighted to reflect your environment.

$ cat data-generation-8p.conf spark-bench = { spark-submit-parallel = false spark-submit-config = [{ spark-args = { // Specify the Spark master address in your env master = "spark://spark-prim-vm10:7077" // Specify how much memory to request for the executor. Must be less than the avail memory on the worker node executor-memory = "4G" } suites-parallel = false workload-suites = [ { descr = "Generating data for the benchmarks to use" parallel = false repeat = 1 // generate once and done! benchmark-output = console workloads = [ { name = "data-generation-kmeans" // The generated data is written to the HDFS filesystem in the parquet format output = "hdfs://hadoop-namenode-vm10:9000/user/spark/testdata-p8- 50mil/kmeans-data.parquet" save-mode = "overwrite" // Size of the dataset, 50 million rows total rows = 50000000 cols = 24 partitions = 8 } ] } ] }] }

To generate the dataset, run the following command as spark user. Ensure that the spark-bench.sh is invoked by the same user that owns the output directory. These examples assign the spark user the ownership of the HDFS directories.

$ cd /usr/local/spark-bench/examples $ /usr/local/spark-bench/bin/spark-bench.sh data-generation-8p.conf

39 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

When the application completes, verify the dataset in HDFS. In a web browser, go to the HDFS NameNode web UI at http://$HADOOP_NAMENODE_IP:9870. Click Utilities > Browse the file system. The number of files created is based on the partition parameters that are defined in the configuration file.

Browse KMeans data directory in HDFS

4.2.2 Run KMeans workload Make a copy of the KMeans configuration file in the example directory. Modify the configuration to fit your environment. The Spark-bench configuration syntax is very flexible and allows defining multiple workloads, repeating workloads, and sequential and parallel executions of workloads to simulate various mix workload patterns. The following example shows three workloads with different Spark settings and each workload is repeated five times. The goal is to test the effect of using different CPU cores for each executor. The configuration also overrides the default executor memory and driver memory setting.

spark-bench = { spark-home = "/usr/local/spark" spark-submit-parallel = false spark-submit-config = [{ spark-args = { master = "spark://spark-prim-vm10:7077"

40 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

executor-cores = 4 executor-memory = 2g driver-memory = 4g } suites-parallel = false // Workload 1 workload-suites = [ { descr = "Run kmeans " parallel = false repeat = 5 benchmark-output = "console" workloads = [ { name = "kmeans" input = "hdfs://hadoop-namenode-vm10:9000/user/spark/testdata- p8-50mil/kmeans-data.parquet" k = 10 } ] } ] }, { spark-args = { master = "spark://spark-prim-vm10:7077" executor-cores = 8 executor-memory = 2g driver-memory = 4g } // Workload 2 suites-parallel = false workload-suites = [ { descr = "Run kmeans 2 " parallel = false repeat = 5 benchmark-output = "console" workloads = [ { name = "kmeans" input = "hdfs://hadoop-namenode-vm10:9000/user/spark/testdata- p8-50mil/kmeans-data.parquet" k = 10 } ] } ] }, {

41 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

spark-args = { master = "spark://spark-prim-vm10:7077" executor-cores = 12 executor-memory = 2g driver-memory = 4g } suites-parallel = false workload-suites = [ { descr = "Run kmeans 3 " parallel = false repeat = 5 benchmark-output = "console" workloads = [ { name = "kmeans" input = "hdfs://hadoop-namenode-vm10:9000/user/spark/testdata- p8/kmeans-data.parquet" k = 10 } ] } ] }] }

4.2.3 Spark memory and CPU cores Because Spark is an in-memory computing engine, proper memory configuration is critical to its performance. One of the common errors Spark applications encounter is Out Of Memory errors (OOM). This is typically related to the JAVA Heap setting, the executor memory setting, or the driver memory setting. Every application has different requirement, and there is not a universal setting that works for every application. The memory configuration might appear to be different from application to application. The general guideline is that the memory setting should be large enough to hold the dataset. Adjust and experiment with these values when the application encounters memory errors. For a complete list of tunable Spark settings, see https://spark.apache.org/docs/latest/configuration.html.

To adjust the amount of Java Heap space for Spark daemons, set the following environment settings:

• _JAVA_OPTIONS = -Xmx4g in spark user $HOME/.bashrc file • SPARK_DAEMON_JAVA_OPTS = -Xmx4g in $SPARK_HOME/conf/spark-env.sh file

To change the default settings for the Spark driver memory and Spark executor memory, set the following parameters in $SPARK_HOME/conf/spark-defaults.conf.

• spark.driver.memory • spark.executor.memory

Each Spark application might set its own memory settings that are different than the default. To override the default values, specify the executor-memory and driver-memory parameters like the example in section 4.2.2.

42 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

When a Spark application is submitted to the cluster manager, it launches several executors on the Spark worker nodes to process the tasks. By default, one executor is requested on each Spark worker node with 1 GB of memory and all CPU cores available on the node. However, when Spark is co-located with other applications, like Hadoop in this example, some CPU cores should be reserved for the Hadoop daemons. To limit the total number of CPU cores that Spark applications are allowed to use on the system, set SPARK_WORKER_CORES to the total number of CPU cores on the system minus the number of CPU cores reserved for the other applications in /usr/local/spark/conf/spark-env.sh.

Also, the Spark application might explicitly request the number of cores allowed for each executor. Instead of all available CPU cores for a single executor on a worker node, set the executor-cores parameter in the application to override the default. The example in section 4.2.2 compares the duration of the KMeans workloads using a different number of executor-cores. Figure 15 shows the status summary of the application runs and their durations on the Spark master web UI. In this example, there are four Spark worker nodes. Each node has 22 available CPU cores for Spark applications. When the executor-cores is specified explicitly, Spark automatically calculates the number of executors that it can run on each worker node based on the available CPU cores.

Click the application id link to see the executors details like in Figure 16. The status of the executors might show KILLED even though they are completed successfully because the driver asks the workers to terminate the executors after they finish processing.

executor-core = 12

executor-core = 8

executor-core = 4

Spark application status in Spark master web UI

43 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

Executor status of an application

4.2.4 Spark network timeout It might be necessary to increase the timeout values for Spark network communication. The default network timeout is 120 s, which might not be long enough for long running tasks. If the application fails with a timeout error, increase the spark.network.timeout value incrementally to find the optimal value. spark.network.timeout is defined in $SPARK_HOME/conf/spark-defaults.conf. See section 3.3.2.

4.2.5 Monitoring Spark applications Use the Spark master web UI to monitor the applications that are running or recently completed. The application information does not persist after restarting Spark daemons. To retain information about completed applications, configure and use the Spark History Server. See section 3.3.3 for information about configuring Spark History Server.

4.2.5.1 Spark master web UI The Spark master web UI shows the workers status, a summary of CPU cores and memory available and used on each worker, and a list of completed and incomplete applications. It also displays Spark applications information including their core and memory usage, environment settings, stages and tasks, and the executors information. Click the application id to see the runtime information and log messages. See Figure 15 and Figure 16.

Note: The application information does not persist after restarting the Spark daemons.

44 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

4.2.5.2 Spark History Server We recommend configuring the Spark History Server to retain the application information on persistent storage. It is useful to be able to recall the information for troubleshooting or comparing performance with different environment settings. To store the application information, configure Spark worker nodes to write events to a persistent storage location like HDFS. See section 3.3.3 about configuring Spark History Server.

To access the Spark History Server, in a web browser, go to http://$SPARK_MASTER_IP:18080. The main page shows a list of completed and incomplete applications. See Figure 17.

Spark History Server web UI

Click the app id link to see more details like the event timeline and the status of the jobs. The jobs and stages sections provide useful information about their runtimes, the functions performed, and the detail logs specific to each job and stage. This allows users to easily see the end-to-end job flow, identify trouble areas, and investigate the cause of the bottleneck or issues. See Figure 18, Figure 19, Figure 20.

Spark jobs event timeline

45 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

Completed jobs status

46 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Testing Spark with Spark-bench

Show all stages and review the detail log of a specific stage

The environment section, Figure 21, shows the application environment settings and is useful to review how adjusting these settings might affect the performance of the application. For instance, execute the same application with different executor memory settings and compare their performance.

Application environment settings

47 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Interactive analysis of PowerStore metrics with Jupyter notebook

5 Interactive analysis of PowerStore metrics with Jupyter notebook This section demonstrates a simple use case using Spark to perform analytic tasks on PowerStore storage usage. This use case requires the following additional software:

• Jupyterlab notebook: JupyterLab is an open source web-based development environment for Jupyter notebooks. Developers can interact with live codes, equations, texts, and create visualization such as graphs and tables in a notebook. It supports various programming languages such as Python, Scala, R, Julia, and many more. For more information about JupyterLab, go to https://jupyter.org/. • pandas, matplotlib, and statsmodels Python modules: The pandas module is a popular data- analysis library that is easy to use. The matplotlib module provides data visualization and graphical plotting library for Python. The statsmodels module provides classes and functions for different statistic models. More information about these modules is available on https://pandas.pydata.org, https://matplotlib.org, and https://www.statsmodels.org. • PowerStore command-line interface (CLI) client: The PowerStore command-line interface (CLI) client, pstcli, enables administrators to manage and automate tasks on PowerStore appliances from Windows or Linux systems. Administrators can run pstcli commands interactively or in a batch script to automate various tasks like extracting the various PowerStore metrics. For more information about PowerStore CLI client, go to http://www.dell.com/support and search for pstcli.

5.1 Installing prerequisite software For simplicity, all prerequisite software is installed on the same system where it has access to the Spark programs and libraries. In this example, the software is installed on spark-bench-vm10.

5.1.1 JupyterLab Perform the following steps to install JupyterLab on a Linux system:

1. Install JupyterLab as spark user.

$ pip3 install jupyterlab

2. The software is installed in the user $HOME/.local/bin. Add this information to the user PATH variable.

$ export PATH=$HOME/.local/bin:$PATH

3. Launch JupyterLab from a directory where the notebooks reside. By default, JupyterLab allows access from localhost only. To allow access from other systems, include the --ip 0.0.0.0 argument.

Note: Make note of the following highlighted URLs for accessing the JupyterLab server in a browser.

$ cd /stage/spark/notebooks $ jupyter-lab --ip 0.0.0.0

[I 2021-05-10 16:46:53.992 ServerApp] jupyterlab | extension was successfully linked.

48 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Interactive analysis of PowerStore metrics with Jupyter notebook

[I 2021-05-10 16:46:54.010 LabApp] JupyterLab extension loaded from /home/spark/.local/lib/python3.6/site-packages/jupyterlab [I 2021-05-10 16:46:54.010 LabApp] JupyterLab application directory is /home/spark/.local/share/jupyter/lab [I 2021-05-10 16:46:54.013 ServerApp] jupyterlab | extension was successfully loaded. [I 2021-05-10 16:46:54.013 ServerApp] Serving notebooks from local directory: /stage/spark/notebooks [I 2021-05-10 16:46:54.013 ServerApp] Jupyter Server 1.7.0 is running at: [I 2021-05-10 16:46:54.013 ServerApp] http://spark-bench- vm10:8888/lab?token=8813cbd9ad29c724df329089dadd068427e3d07c0c617811 [I 2021-05-10 16:46:54.013 ServerApp] http://127.0.0.1:8888/lab?token=8813cbd9ad29c724df329089dadd068427e3d07c0c 617811 [I 2021-05-10 16:46:54.013 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [W 2021-05-10 16:46:54.018 ServerApp] No web browser found: could not locate runnable browser. [C 2021-05-10 16:46:54.018 ServerApp]

To access the server, open this file in a browser: file:///home/spark/.local/share/jupyter/runtime/jpserver-31142- open.html Or copy and paste one of these URLs: http://spark-bench- vm10:8888/lab?token=8813cbd9ad29c724df329089dadd068427e3d07c0c617811

http://127.0.0.1:8888/lab?token=8813cbd9ad29c724df329089dadd068427e3d07c0c 617811

5.1.2 Python modules Perform the following steps to install the pandas, matplotlib, and statsmodels python modules as root user:

# pip3 install pandas matplotlib statsmodels

5.1.3 PowerStore command-line interface (CLI) Perform the following steps to install the PowerStore CLI client, pstcli, on a Linux system:

1. Download the pstcli rpm package to /usr/local directory. On www.dell.com/support, search for pstcli and follow the link to download the rpm package. 2. Issue the following command as root user to install the package.

# rpm -ihv /usr/local/

3. The programs are installed in /opt/dellemc/pstcli- directory. Add this location to the PATH environment variable for the user.

# export PATH=/opt/dellemv/pstcli-:$PATH

49 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Interactive analysis of PowerStore metrics with Jupyter notebook

5.2 Extract PowerStore space metrics Use the following commands to extract the space metrics from two appliances and save the data in csv- format files.

pstcli -d $POWERSTORE_MANAGEMENT_IP_PS12 -u admin -p $PASSWORD metrics \ generate -entity space_metrics_by_appliance -entity_id A1 -interval One_Hour \ -output csv > space-metrics-ps14-2-17-2021.csv

pstcli -d $POWERSTORE_MANAGEMENT_IP_PS14 -u admin -p $PASSWORD metrics \ generate -entity space_metrics_by_appliance -entity_id A1 -interval One_Hour \ -output csv > space-metrics-ps14-2-17-2021.csv

The amount of historic data extracted depends on the collection interval specified in the argument.

• For Five_Mins interval, 1 day of historical data is available. • For One_Hour interval, 30 days of historical data is available. • For One_Day interval, 2 years of historical data is available.

Examine the .csv files. If Success appears in the first line, manually delete the line, but do not modify the header that contains the column names.

pstcli has a rich set of functions. For more information about pstcli, go to http://www.dell.com/support and search for pstcli.

5.3 Import PowerStore space metrics into HDFS The following example demonstrates using the JupyterLab notebook and pyspark to import the .csv files generated from section 5.2 and converting it into parquet format in HDFS for future access.

1. Access JupyterLab in a web browser by copying and pasting the URLs provided in the above command output.

50 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Interactive analysis of PowerStore metrics with Jupyter notebook

2. Create a Python notebook with the following codes. The codes create a Spark session and import the .csv files into HDFS in parquet format. Run each cell interactively to see the results of each step. A copy of the code is also provided in appendix B.1.

51 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Interactive analysis of PowerStore metrics with Jupyter notebook

52 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Interactive analysis of PowerStore metrics with Jupyter notebook

53 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Interactive analysis of PowerStore metrics with Jupyter notebook

3. Confirm the imported data in HDFS. Go to http://$HADOOP_NAMENODE_IP:9870, and click Utilities > Browse the file system. Enter the HDFS directory in the text box, and click Go.

5.4 Perform analysis on the PowerStore space metrics Create a new Python notebook with the following codes. The codes create a Spark session, read the PowerStore space metrics from the parquet files on HDFS, transform data, perform calculations and aggregation, and create bar charts on the results. Run each cell interactively to see the results of each step. A copy of the code is also provided in appendix B.2.

54 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Interactive analysis of PowerStore metrics with Jupyter notebook

55 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Interactive analysis of PowerStore metrics with Jupyter notebook

56 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Interactive analysis of PowerStore metrics with Jupyter notebook

57 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Automation

6 Automation One of the main challenges to create a reliable cluster environment is to build and maintain the cluster consistently. When changes are introduced to the environment, it is important to apply them to all systems consistently and track these changes over time. Inconsistency in the cluster environment causes unexpected or intermittent issues which are difficult to troubleshoot. We recommend adopting an automation tool to ensure a high-quality, consistent environment. PowerStore offers several choices to program against the appliance including REST APIs, PowerStore pstcli, and Ansible module for PowerStore. Automation tools such as Ansible are typically easier to learn and implement as compared to writing programs with the other options. However, all options are available and equally capable to manage the PowerStore appliance. Another benefit to using Ansible is that there is a vast number of modules contributed by the community at https://galaxy.ansible.com. Administrators can easily create an end-to-end deployment and update workflow.

For instance, an Ansible playbook may use the following modules to set up and maintain an environment:

• Use the PowerStore Ansible module to create volumes, the protection policies, and manage storage operations. • Use the VMware Ansible module to create virtual machines from a template and update VM configuration on PowerStore appliance. • Use the Ansible build-in modules to perform many operating system tasks, such as applying updates, configuring time server, creating ssh keys, creating file systems, installing and configuring applications, and many more.

For more links and resources related to Ansible and PowerStore, see appendix C.3.

58 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Data protection

7 Data protection PowerStore includes native data-protection features with snapshot and thin clone technologies. Also, Dell Technologies has a wide range of products to enhance data protection and enable disaster recovery beyond a local storage system.

7.1 Snapshots and thin clones A PowerStore snapshot is a point-in-time copy of data of a storage resource such as a volume, volume group, virtual machine, or file system. You can take manual snapshots in the PowerStore Manager UI or create snapshot policies within protection policies to automatically take snapshots on a predefined schedule or frequency. Snapshots are inaccessible to the hosts directly.

To access the data on a snapshot for a storage resource, except for VM, you can create a read-writable thin clone from the snapshot and it to the same host or different host. A thin clone is a space-efficient copy that shares the data blocks with its parent object. Multiple thin clones are allowed from a snapshot. Changes to one thin clone do not affect the parent object or other associated thin clones and conversely.

For VMs, PowerStore creates snapshots at the VM level, and the snapshot information reflects in the vCenter automatically. Use PowerStore snapshots to protect the guest operating system and applications in VMs. We recommend taking snapshots of all cluster nodes as close together as possible due to the distributed nature of the HDFS.

Some of the PowerStore snapshot and thin clone use cases are as follows:

• Provide quick and easy rollback on operating system or application updates. • Clone the production environment to development or test environments with ease and without a full- size copy of the data. • Reduce the complexity and time to refresh the data in a clone environment from the latest snapshot or thin clone.

To perform snapshot operations in PowerStore Manager, you can take the following actions. The clone, refresh, and restore functions do not apply to VMs.

• Take a snapshot from the Protection card on the Overview page of the storage resource. • Clone a snapshot from the Protection card on the Overview page of the storage resource. • Delete a snapshot from the Protection card on the Overview page of the storage resource. • Refresh the data of a storage resource from the More Actions menu of the storage resource or snapshot. • Restore a storage resource from the More Actions menu of the storage resource or snapshot. • Configure, view, and manage snapshot rules from the Policies page.

To recover data from a snapshot of a storage resource, except for VM, you can perform the following actions in PowerStore Manager:

• Create a thin clone from a snapshot and map it to a host • Use the refresh operation to replace existing data in the volume with the data from a snapshot or thin clone related to the parent storage resource. • Use the restore operation to replace the data of a parent storage resource with data from an associated snapshot. The restore operation resets the data in the parent storage resource to the point in time the snapshot was taken.

59 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Data protection

PowerStore also offers PowerStore pstcli and REST APIs that can be used to automate operations in scripts and other programming languages.

To recover a VM from a VM snapshot in vCenter, revert the VM in vCenter using the VM Manage Snapshots operation.

For more information about PowerStore data protection, see the document Dell EMC PowerStore Protecting Your Data.

7.2 AppSync Dell EMC AppSync™ is an optional software that enhances the overall application protection supported by the software. With its deep integration of PowerStore and applications such as Oracle® and Microsoft® SQL Server®, AppSync uses the native PowerStore asynchronous replication, snapshot, and thin clone technologies to create and manage local and remote copies of applications. However, AppSync supports only block-storage resources on PowerStore and does not have application integration with Spark or Hadoop. For vVol storage resources, see Dell EMC RecoverPoint™ for Virtual Machines information in section 7.3.

For more information about PowerStore AppSync integration, see the document Dell EMC PowerStore: AppSync.

7.3 RecoverPoint for Virtual Machines Dell EMC RecoverPoint for Virtual Machines is an optional software that extends data protection and enables disaster recovery for VMware virtualized environments to on-premises or cloud environment. RecoverPoint for Virtual Machines is a software-only solution that protects VMs with local and remote replication. It is storage and application agnostic and supports both synchronous and asynchronous replication on all storage types supported by VMware. It also allows replicating multiple VMs in a consistency group.

PowerStore does not support the vVols replication and VM consistency group. RecoverPoint for Virtual Machines is a great addition if these features are required. Using RecoverPoint, all application cluster nodes can be protected and replicated collectively in a consistency group.

For more information about RecoverPoint for VMs, see the document RecoverPoint for Virtual Machines Administrator’s Guide on Dell Support.

7.4 Hadoop distributed copy and HDFS snapshots For native Hadoop solutions, consider using Hadoop DistCp and HDFS snapshots. DistCp is a software tool that allows Hadoop to replicate one cluster to another cluster. HDFS snapshots are read-only point-in-time copies of the HDFS file system. More information about DistCp and HDFS snapshots is available on https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html and https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html.

60 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Configure passwordless SSH

A Configure passwordless SSH

Use the following procedures to configure passwordless ssh for application user:

1. Generate an SSH key for the spark user.

$ ssh-keygen -b 1024

Press enter twice to accept the default key file and empty for no passphrase.

2. Push the SSH public key to the worker nodes.

for server in spark-wrk-vm10 spark-wrk-vm11 spark-wrk-vm12 spark-wrk-vm13 do ssh-copy-id $server done

3. Test the passwordless SSH connection.

for server in spark-wrk-vm10 spark-wrk-vm11 spark-wrk-vm12 spark-wrk-vm13 do ssh $server uname -a done

This makes an SSH connection to each worker node without prompting for a password, queries the system hostname with the uname command, and returns the output.

61 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Python codes

B Python codes

The following codes are used in this paper.

B.1 Import .csv files into HDFS

Save the code to a file, for example, import-powerstore-space-metrics.py. Alternatively, copy and paste the code into a JupyterLab notebook.

#!/usr/bin/env python3 # Import modules and create Spark session import os from pyspark.sql import SparkSession from pyspark.sql import functions as f from pyspark.sql.types import *

spark = SparkSession \ .builder \ .appName("Import csv data") \ .getOrCreate()

# Define the schema columns and datatypes schema = StructType() \ .add("appliance_id", StringType(), True) \ .add("timestamp", TimestampType(), True) \ .add("last_logical_provisioned", LongType(), True) \ .add("last_logical_used", LongType(), True) \ .add("last_physical_total", LongType(), True) \ .add("last_physical_used", LongType(), True) \ .add("max_logical_provisioned", LongType(), True) \ .add("max_logical_used", LongType(), True) \ .add("max_physical_total", LongType(), True) \ .add("max_physical_used", LongType(), True) \ .add("last_data_physical_used", LongType(), True) \ .add("max_data_physical_used", LongType(), True) \ .add("last_efficiency_ratio", DoubleType(), True) \ .add("last_data_reduction", DoubleType(), True) \ .add("last_snapshot_savings", DoubleType(), True) \ .add("last_thin_savings", DoubleType(), True) \ .add("max_efficiency_ratio", DoubleType(), True) \ .add("max_data_reduction", DoubleType(), True) \ .add("max_snapshot_savings", DoubleType(), True) \ .add("max_thin_savings", DoubleType(), True) \ .add("last_shared_logical_used", LongType(), True) \ .add("max_shared_logical_used", LongType(), True) \ .add("last_logical_used_volume", LongType(), True) \ .add("last_logical_used_file_system", LongType(), True) \ .add("last_logical_used_vvol", StringType(), True) \ .add("max_logical_used_volume", LongType(), True) \

62 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Python codes

.add("max_logical_used_file_system", LongType(), True) \ .add("max_logical_used_vvol", LongType(), True) \ .add("repeat_count", IntegerType(), True) \ .add("entity", StringType(), True)

# Read data from csv file into dataframe df3 = spark.read.format("csv") \ .option("header", True) \ .schema(schema) \ .option("timestampFormat","MM/dd/yyyy hh:mm:ss a") \ .load("file:///stage/spark/csv/space-metrics-ps14-2-17-2021.csv")

df3.printSchema() df3.toPandas()

# Write data to HDFS directory in parquet format df3.write.parquet("hdfs://hadoop-namenode-vm10:9000/user/spark/pstcli/space- metrics-ps14-2-17-2021.parquet")

# Read data from csv file into dataframe df4 = spark.read.format("csv") \ .option("header", True) \ .schema(schema) \ .option("timestampFormat","MM/dd/yyyy hh:mm:ss a") \ .load("file:///stage/spark/csv/space-metrics-ps12-2-18-2021.csv")

df4.printSchema()

# Show data in pretty format df4.toPandas()

# Write data to HDFS directory in parquet format df4.write.parquet("hdfs://hadoop-namenode-vm10:9000/user/spark/pstcli/space- metrics-ps12-2-18-2021.parquet")

B.2 Analyze PowerStore space metrics

Save the code to a file, for example, analyze-space-metrics-2appliances.py. Alternatively, copy and paste the code into a JupyterLab notebook.

#!/usr/bin/env python

#Import pyspark and pandas modules from pyspark.sql import SparkSession from pyspark.sql.types import LongType, DateType, TimestampType, DoubleType from pyspark.sql.functions import unix_timestamp import pyspark.sql.functions as F import pandas as pd

#Create Spark session

63 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Python codes

spark = SparkSession .builder .appName("Analyzing Storage Metrics") .getOrCreate()

#Read data into dataframe df1 = spark.read .format("parquet") .option("header", "true") .load("hdfs://hadoop-namenode-vm10:9000/user/spark/pstcli/space-metrics- ps14-2-17-2021.parquet")

#Read data into dataframe df2 = spark.read .format("parquet") .option("header", "true") .load("hdfs://hadoop-namenode-vm10:9000/user/spark/pstcli/space-metrics- ps12-2-18-2021.parquet")

#Extracting the columns we need and adding the appliance name data1 = df1[["timestamp","last_physical_used"]].withColumn('appliance',F.lit('WX -0001')) data2 = df2[["timestamp","last_physical_used"]].withColumn('appliance',F.lit('WX -0002'))

# Merge dfs in on df df = data1.union(data2)

# Cast datatype on columns df1 = df.withColumn("last_physical_used",df["last_physical_used"].cast(LongType( ))) .withColumn("timestamp",unix_timestamp("timestamp", 'MM/dd/yyyy hh:mm:ss a').cast(TimestampType()))

df1.show(5)

# Define date range dates =("2021-01-24", "2021-02-12") date_from, date_to = [F.to_date(F.lit(s)).cast(TimestampType()) for s in dates] df_range = df1.where((df1.timestamp >= date_from) & (df1.timestamp < date_to))

df_range.show(5)

# Perform aggregations on data from date range from pyspark.sql.functions import sum,avg,max,min,mean,count

df = df_range.where((df_range.timestamp > date_from) & (df_range.timestamp < dat e_to)). groupBy("appliance") .agg(avg("last_physical_used").alias("avg_physical_used"),

64 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Python codes

max("last_physical_used").alias("max_physical_used"))

df.show()

# Plot bar chart on average physical used for two appliances import matplotlib.pyplot as plt

# Data to plot

x_list = [row.appliance for row in df.select('appliance').collect()] y_list = [(row.avg_physical_used)/1024/1024/1024 for row in df.select('avg_physi cal_used').collect()]

print(x_list) print(y_list)

plt.style.use('ggplot') plt.figure(figsize=(70, 20)) plt.bar(x_list, y_list)

plt.title('Avg Physical Used between dates', fontsize=70) plt.xlabel('Appliance',fontsize=70) plt.ylabel('GB', fontsize=70)

plt.xticks(rotation=90, fontsize=60) plt.yticks(fontsize=70)

plt.autoscale() plt.show()

# Perform transformation and aggregation on average physical used from pyspark.sql.functions import sum,avg,max,min,mean,count

df = df_range.where((df_range.timestamp > date_from) & (df_range.timestamp < dat e_to)). groupBy(df_range.timestamp.substr(0,10), df_range.appliance) .agg(avg("last_physical_used").alias("avg_physical_used"), max("last_physical_used").alias("max_physical_used")) .withColumnRenamed("substring(timestamp, 0, 10)","day") .orderBy("day","appliance")

df.show(10)

# Prepare the data series and remove duplicated day entries for plotting the cha rt pdf = df.select("*").toPandas() dates = df.select("day").toPandas()["day"].drop_duplicates().to_list() series1 = pdf.loc[pdf["appliance"] == "WX-0001"]["max_physical_used"].to_list()

65 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Python codes

series2 = pdf.loc[pdf["appliance"] == "WX-0002"]["max_physical_used"].to_list()

# Plot bar chat on max physical used on two appliances over time from matplotlib import pyplot as plt

plt.rcParams["figure.figsize"] = [16, 8]

plotdata = pd.DataFrame({ "WX-0001": series1, "WX-0002": series2 }, index=dates ) plotdata.plot(kind="bar") plt.title("Max Physical Used") plt.xlabel("Dates") plt.ylabel("Max Physical Used in bytes")

66 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Additional resources

C Additional resources

C.1 Technical support and resources

Dell.com/support is focused on meeting customer needs with proven services and support.

Storage technical documents and videos provide expertise that helps to ensure customer success on Dell EMC storage platforms.

The PowerStore Info Hub provides detailed documentation on how to install, configure, and manage PowerStore systems.

C.2 Other resources

• http://spark.apache.org/ • https://hadoop.apache.org/ • https://github.com/CODAIT/spark-bench • https://codait.github.io/spark-bench/ • https://github.com/dell/ansible-powerstore • https://docs.ansible.com/ansible/latest/collections/community/vmware/index.html • https://docs.ansible.com/ansible/latest/ • https://galaxy.ansible.com/ • https://jupyter.org/ • https://pandas.pydata.org • https://matplotlib.org • https://www.statsmodels.org

C.3 Ansible resources

The Ansible module for PowerStore and documentation is available on the Dell GitHub page https://github.com/dell/ansible-powerstore. For downloadable example codes, go to https://github.com/dell/ansible-storage-automation/tree/master/powerstore. Also, go to the Dell EMC Automation Community, https://www.dell.com/community/Automation/bd-p/Automation, where customers can participate in discussions, share their knowledge, ask questions, and provide feedback.

For information about Ansible and VMware modules, go to https://docs.ansible.com/ansible/latest/collections/community/vmware/index.html and https://docs.ansible.com/ansible/latest/.

For information about the PowerStore command line interface, see the CLI reference guide at https://downloads.dell.com/manuals/common/pwrstr-clirefg_en-us.pdf.

67 Dell EMC PowerStore: Apache Spark Solution Guide | H18663 Additional resources

For information about the PowerStore REST API, see the Developers Guide at https://downloads.dell.com/manuals/common/pwrstr-apig_en-us.pdf. Also, PowerStore has an integrated online REST API interface which can be accessed in a web browser by going to https://$POWERSTORE_MANAGEMENT_IP/swaggerui. See Figure 22.

PowerStore online REST API interface

68 Dell EMC PowerStore: Apache Spark Solution Guide | H18663