DELL EMC ISILON DATA LAKE WITH POWEREDGE SERVERS Recommended Configurations

Kris Applegate Solution Architect EMC Customer Solution Centers [email protected]

Boni Bruno Principal Solution Architect Dell EMC Emerging Technology Team [email protected]

Armando Acosta Product Manager Dell EMC Converged Platform Division [email protected]

Sai Devulapalli Data Analytics Practice Lead Dell EMC Emerging Technology Team [email protected]

ABSTRACT This white paper details the validated configuration for connecting Dell EMC Isilon to Dell EMC PowerEdge servers. We will also detail some recommended configurations as well as provide guidance on optional modifications for tailoring to each customer’s use case.

December 2016

TABLE OF CONTENTS

EXECUTIVE SUMMARY ...... 4 AUDIENCE ...... 4

HADOOP IN THE ENTERPRISE ...... 5

SHARED STORAGE HADOOP VS. DISTRIBUTED STORAGE HADOOP ...... 5

DELL EMC ISILON ...... 6 Dell EMC Isilon X-Series Nodes ...... 6

DELL EMC POWEREDGE ...... 6 Dell EMC PowerEdge FX2, PowerEdge FC630, and PowerEdge FD332 ...... 7 Dell EMC PowerEdge R630 ...... 7

HADOOP ROLES ...... 7 Compute Node(s) ...... 7 Infrastructure Nodes ...... 8 Manager Node(s) ...... 8 Edge Node(s) ...... 8

RECOMMENDED CONFIGURATIONS ...... 9 Modular Infrastructure ...... 9 Network Diagram ...... 9 Configuration ...... 10 Rack Server Infrastructure ...... 12 Network Diagram ...... 12 Configuration ...... 13 As-Tested Configuration ...... 14 Network Diagram ...... 14 Configuration ...... 15 Considerations ...... 15 Sizing Compute Nodes and Isilon Nodes ...... 16 Isilon Platform ...... 16 Server Platform ...... 16 Server CPU ...... 16 Server Memory ...... 17 Server Local Storage ...... 17 Network ...... 17

Dell - Internal Use - Confidential 2

DELL EMC CUSTOMER SOLUTION CENTERS ...... 17

LINKS ...... 17

Dell - Internal Use - Confidential 3

EXECUTIVE SUMMARY Analytics play a crucial role in any modern enterprise. One technology that enables processing the data necessary to extract insights at a scale, speed, and price-point is Hadoop. Dell EMC can provide solutions ranging from pure do-it-yourself reference architectures all the way through complete turn-key appliances that can accommodate almost any project budget.

This whitepaper details the validated configurations for running Cloudera and Hadoop distributions on ® Dell EMC Isilon arrays and ® Dell EMC PowerEdge servers. Additionally, recommended configurations are outlined along with potential variances that can be used to accommodate differing use cases. Configurations will be provided for both module and rack server configurations in order to provide the most flexibility to adapt to customer requirements.

AUDIENCE This white paper is intended for customers looking to leverage validated configurations when customizing their own Hadoop clusters as well as Dell EMC sales makers and partners that are looking to propose solutions to customers.

Dell - Internal Use - Confidential

4

Hadoop in the Enterprise Hadoop is an open source platform that is designed to store and process large datasets in a distributed computing environment. It has two main sub-projects: Hadoop Distributed File System (HDFS) for data storage and MapReduce for data processing. Hadoop breaks down large datasets across servers or shared storage to process the data in parallel.

Organizations turn to Hadoop for both business and technology advantages. At a business level, Hadoop offers a compelling value proposition from a total cost of ownership standpoint. Hadoop uses industry-standard servers and storage, decreasing the cost to store and process huge datasets versus traditional existing business intelligence (BI) and analytics solutions. In addition, cost efficiencies can be achieved via a data lake using scale-out NAS, such as Dell EMC Isilon.

To help organizations overcome their expertise and skills shortages and accelerate the deployment of Hadoop environments that grow with the business, Dell EMC provides a portfolio of validated reference architectures and scalable solutions for Hadoop deployments. Dell EMC backs these offerings with a wide range of professional consulting and support services.

There is no single “right answer” for success with data analytics. It is a journey of continual growth. Every organization’s data is unique, and must be treated as such. Solutions that are perfect for one company may not address the needs of another.

With this thought in mind, Dell EMC offers a wide range of products and solutions to address diverse big data and analytics challenges — from starter bundles and validated reference architectures to integrated appliances and engineered solutions, or even completely customized solutions for your specific environment.

Shared Storage Hadoop vs. Distributed Storage Hadoop It’s a testament to Hadoop’s flexibility that it can tolerate multiple deployment models accounting for varying budget, performance, capacity, and density requirements. The Dell EMC Isilon solution is a shared storage model where the persistent filesystem data for Hadoop is stored in an Isilon NAS cluster versus in the distributed model where data is spread across the local storage of the Hadoop nodes themselves.

Figure 1. Shared Vs. Distributed Topologies

These two approaches offer varying advantages:

Dell - Internal Use - Confidential 5

Shared Storage Hadoop Distributed Storage Hadoop

Single source of data Massive Scale (100s of PB+)

Reduced datacenter footprint (Storage Density) Commodity platforms

Leverage storage platform features (Performance Tiering, Linear scaling Alternative RAID, Multi-Protocol)

Independent scaling of storage/compute Flexible replica model

Table 1. Shared Vs. Distributed Comparison

While this whitepaper focuses on the Dell EMC Isilon configurations, Dell EMC offers solutions built around both shared and distributed models in order to accommodate the breadth in our customer’s use-cases. Please contact your Dell EMC sales team or your Dell EMC Customer Solutions Center Solution Architect if you wish to discuss the distributed solution in detail.

Dell EMC Isilon DELL EMC® Isilon® scale-out storage solutions are designed for enterprises that want to manage their data, not their storage. Isilon storage systems are powerful yet simple to install, manage, and scale to virtually any size. And, unlike traditional enterprise storage, Isilon solutions stay simple no matter how much storage capacity is added, how much performance is required, or how business needs change in the future. We’re challenging enterprises to think differently about their storage, because when they do, they’ll recognize there’s a better, simpler way—with Isilon.

Dell EMC Isilon X-Series Nodes The Isilon X-Series, our most flexible and comprehensive storage product line, strikes the right balance between large capacity and high-performance storage. The highly versatile X-Series is an ideal solution for high-throughput and high-concurrency applications. With SSD technology for file system metadata and file-based storage workflows, the Isilon X-Series significantly accelerates namespace- intensive operations. To meet rigorous data security and compliance requirements, Isilon also offers Data at Rest Encryption (DARE) with self-encrypting drive (SED) options with the X-Series platform.

Figure 2. Dell EMC Isilon X-Series

Dell EMC PowerEdge Dell EMC’s server portfolio is broad enough that there are many different options with regards to delivering the compute aspect of a Hadoop solution. With models that can accommodate so many differing requirements around price, density, and management

Dell - Internal Use - Confidential 6

capabilities, it would take far too much time to list every possible option. We’ll start the conversation here with two recommended configurations. One from the modular infrastructure portfolio and one from the traditional rack infrastructure portfolio. Use these as an initial starting point and work with your Dell EMC server specialist to customize to your exact specifications.

Dell EMC PowerEdge FX2, PowerEdge FC630, and PowerEdge FD332 The PowerEdge FX2 family is a fully modular eco-system that allows you enough configuration options that you can tailor it to meet the demands of any workload. For Isilon Data Lake designs, we need the ability to have some flexible internal storage options as well as a robust network capability both from server to Isilon as well as from server to client. The Dell EMC PowerEdge FC630 compute node with a PowerEdge FD332 disk shelf is a great way to pack a lot of punch into an easily manageable footprint.

Figure 3. Dell EMC PowerEdge FX2, PowerEdge FC630, PowerEdge FD332

Dell EMC PowerEdge R630 As our most popular server platform the R630 has been battle tested by almost every use-case possible. In this configuration, we take advantage of plenty of drive slots for either rotational media or solid-state drives as well as plenty of network bandwidth (both data and client-facing).

Figure 4. Dell EMC PowerEdge R630

Hadoop Roles

Compute Node(s)

With all shared filesystem responsibilities taken care of by the Isilon, these node’s primary role is to provide the computational horsepower to comb through all the data. However, they do still need some local storage to help cache or accelerate those operations. With the drastic cost reductions in flash over the last years, some customers choose to make this local space consist of Solid State Drives (SSDs). The use of SSDs isn’t a hard and fast requirement, but is becoming a common request as SSD prices come down further and further.

Function Disks Type Operating System 2 RAID 1 (Mirror)

Spark Scratch / Map Reduce Spill 2-10 Non-RAID or RAID 0 (Optionally SSD)

Dell - Internal Use - Confidential 7

Table 2. Data Node Disk Layout

Infrastructure Nodes The number of infrastructure servers will vary from customer to customer. In our recommended configuration we allocate 4 nodes, but it could be done with less as your requirements for services high-availability vary.

Manager Node(s) The manager nodes in the cluster are responsible for running things like the Cloudera Manager (Cloudera Hadoop), Ambari (Hortonworks Hadoop), and the principle roles for services like Hive, Oozie, and Zookeeper. We need 3 of them in order to provide a quorum for high-availability in case of a node failure. These boxes don’t need to have high-end configurations and are a ripe area for cost optimization. For the sake of our recommended configurations we’ll use the same chassis and server types as our compute nodes in order to keep a common platform, but this is by no means required. Additionally, you can also, if your requirements allow it, co-locate these roles on compute or edge nodes.

Edge Node(s) The role of the edge nodes are to be the primary interface for funneling data into a cluster as well as for pushing result data out of the cluster. They are most often multi-homed to the Isilon network as well as the datacenter network. The configuration of these nodes can vary drastically depending on the customer’s use case. For example if they are staging batch jobs into a cluster, you’ll need a larger amount of local storage for that data to land on before you copy it into HDFS. If you are streaming data into the cluster, you wouldn’t need a large amount of space but rather faster storage (like SSDs) to keep that data moving quickly. Much like the Manager Node(s), this is an area ripe for optimization depending on use case. Our recommendations keep the same configuration as the Manager Nodes just to keep some platform commonality. Lastly, as with the Manager Node(s) you can co-locate this role onto compute or manager nodes if your use-case allows.

Dell - Internal Use - Confidential 8

Recommended Configurations

Modular Infrastructure Network Diagram

Figure 5. Modular Infrastructure – Network Diagram

Dell - Internal Use - Confidential 9

Configuration Isilon Data Lake Array

Isilon Node 4x Dell EMC Isilon X410 102TB HDD / 3.2TB SSD 256GB 2x10GE and 2x1GE

Isilon Switch 2x QDR IB Switch - 8 Port, 1U, 1PS

Table 3. Modular Infrastructure Configuration - Isilon Data Lake Array

Networking

Data Network 2x Dell EMC Networking S4048-ON 10GbE Switches Switches

Management 1x Dell EMC Networking S3048-ON 1GbE Switch Network Switches

Table 4. Modular Infrastructure Configuration – Networking

Compute Chassis

Compute Chassis 3x Dell EMC PowerEdge FX2s

Chassis I/O Module 2x (Per-chassis) Dell EMC FX2 10 GbE Pass-through Module

Compute Platform 2x (Per-chassis) Dell EMC FC630 w/ 2x 2.5” disk slots

Compute Storage 2x (Per-chassis) Dell EMC FD332 w/ 16x 2.5” disk slots

Table 5. Modular Infrastructure Configuration – Compute Chassis

Compute Servers

Compute Platform 2x (Per Sled) Xeon E5-2698v4 (20C) Processor

Compute Platform 256 GB (Per Sled) - 16x 16GB 2400MHz RDIMM Memory

Compute Platform (OS) – 2x (Per Sled) 200GB Boot MLC 2.5” Intel S3610 Solid State Drives Disks

Compute Platform 1x (Per Sled) Intel X710 Dual Port 10GbE Network Daughter Card Network Cards

Table 6. Modular Infrastructure Configuration – Compute Chassis

Compute Storage Shelves

Compute Storage 8x (Per Sled) 1.2TB 10K RPM 2.5” HDD Shelves

Dell - Internal Use - Confidential 10

Table 7. Modular Infrastructure Configuration – Compute Storage Shelves

Infrastructure Nodes Chassis

Infrastructure Chassis 2x Dell EMC PowerEdge FX2s

Infrastructure Chassis 2x Dell EMC FX2 10 GbE Pass-through Module I/O Module

Table 8. Modular Infrastructure Configuration – Infrastructure Nodes Chassis

Infrastructure Nodes Servers

Infrastructure Node 2x (Per-chassis) Dell EMC FC630 w/ 2x 2.5” disk slots Platform

Infrastructure Node 2x (Per Sled) Intel Xeon E5-2640v4 (10C) Processor

Infrastructure Node 128 GB (Per Sled) - 8x 16GB 2400MHz RDIMM Memory

Infrastructure Node (OS) – 2x (Per Sled) 200GB Boot MLC 2.5” Intel S3610 Solid State Drives Disks

Infrastructure Node 1x (Per Sled) Intel X710 Dual Port 10GbE Network Daughter Card Network Cards

Table 9. Modular Infrastructure Configuration – Infrastructure Nodes Servers

Infrastructure Node Storage Shelves

Infrastructure Node 3x (Per Sled) 1.2TB 10K RPM 2.5” HDD Storage Shelves

Table 10. Modular Infrastructure Configuration – Infrastructure Storage Shelves

Dell - Internal Use - Confidential 11

Rack Server Infrastructure Network Diagram

Dell - Internal Use - Confidential 12

Figure 6. Rack Server Infrastructure – Network Diagram

Configuration Isilon Data Lake Array

Isilon Node 4x Dell EMC Isilon X410 102TB HDD / 3.2TB SSD 256GB 2x10GE and 2x1GE

Isilon Switch 2x QDR IB Switch - 8 Port, 1U, 1PS

Table 11. Rack Server Infrastructure Configuration - Isilon Data Lake Array

Networking

Data Network 2x Dell EMC Networking S4048-ON 10GbE Switches Switches

Management 1x Dell EMC Networking S3048-ON 1GbE Switch Network Switches

Table 12. Rack Server Infrastructure Configuration – Networking

Compute Servers

Compute Platform 6x Dell EMC PowerEdge R630 10-drive chassis

Compute Platform 2x Intel Xeon E5-2698v4 (20C) Processor

Compute Platform 256 GB - 16x 16GB 2400MHz RDIMM Memory

Compute Platform (OS) – 2x 200GB Boot MLC 2.5” Intel S3610 Solid State Drives Disks (Data) – 8x 1.2TB 10K RPM 2.5” HDD

Compute Platform Intel X710 Dual Port 10GbE Network Daughter Card Network Daughter Card

Table 13. Rack Server Infrastructure Configuration – Compute Servers

Infrastructure Nodes Servers

Infrastructure Node 4x Dell EMC PowerEdge R630 10-drive chassis Platform

Infrastructure Node 2x Intel Xeon E5-2640v4 (10C) Processor

Infrastructure Node 128 GB - 8x 16GB 2400MHz RDIMM Memory

Infrastructure Node (OS) – 2x 200GB Boot MLC 2.5” Intel S3610 Solid State Drives Disks (Data) – 3x 1.2TB 10K RPM 2.5” HDD

Dell - Internal Use - Confidential 13

Infrastructure Node Intel X710 Dual Port 10GbE Network Daughter Card Network Daughter Cards

Table 14. Rack Server Infrastructure Configuration – Infrastructure Nodes Servers

As-Tested Configuration The configuration below is only to document what was stood up in the Dell EMC Customer Solution Center in order to validate basic functionality. We encourage customers to leverage these same capabilities that the Customer Solution Centers provide, by executing their own proofs-of-concepts with us at no cost.

Roles of compute and infrastructure were shared across the same nodes. This isn’t recommended in production, but at a small scale in a proof-of-concept, this is acceptable.

Network Diagram

Dell - Internal Use - Confidential 14

Figure 7. As-Tested Configuration – Network Diagram

Configuration Isilon Data Lake Array

Isilon Node 3x Dell EMC Isilon S210 19.8TB HDD / 1.6TB SSD 256GB 2x10GE and 2x1GE (OneFS v8.0.0.2)

Isilon Switch 1x QDR IB Switch - 8 Port, 1U, 1PS

Table 15. As-Tested Configuration - Isilon Data Lake Array

Networking

Data Network 1x Dell EMC Networking S4048-ON 10GbE Switch Switches

Management 1x Dell Force10 S60 1GbE Switch Network Switches

Table 16. As-Tested Configuration – Networking

Compute Chassis

Compute Chassis 1x Dell EMC PowerEdge FX2s

Chassis I/O Module 2x (Per-chassis) Dell EMC FX2 10 GbE Pass-through Module

Compute Platform 4x (Per-chassis) Dell EMC FC630 w/ 8x 1.8” SSD slots

Table 17. As-Tested Configuration – Compute Chassis

Compute Servers

Compute Platform 2x (Per Sled) Intel Xeon E5-2680v3 (12C) Processor

Compute Platform 256 GB (Per Sled) - 16x 16GB 2400MHz RDIMM Memory

Compute Platform (OS) – 2x (Per Sled) 480GB Intel S3610 MLC 1.8” Solid State Drives Disks (Data) – 6x (Per Sled) 480GB Intel S3610 MLC 1.8” Solid State Drives

Compute Platform 1x (Per Sled) Intel X520k Dual Port 10GbE Network Daughter Card Network Cards

Compute Platform RedHat Enterprise Linux 7.2.1511 Operating System

Table 18. As-Tested Configuration – Compute Chassis

Considerations

Dell - Internal Use - Confidential 15

Sizing Compute Nodes and Isilon Nodes Many different factors go into sizing of your cluster. It’s important to work with your Dell EMC Account teams and Dell EMC Customer Solutions Centers Solution Architects to make sure you’re appropriately accounting for as many variable as possible. Variables that may need to be accounted for include:

 Amount of Initial Data

 Number of Replicas

 Rate of Ingest

 Duration of retention

 Scratch Space

 Compression

 Read/Write I/O Mix

Our initial guidance for the number of compute nodes to the number of Isilon nodes is a ratio of 2:1. However, this is only an initial guidance and we strongly recommend a more formal discussion with you Customer Solution Centers Solution Architect to come up with a more specifically tailored recommendation given your requirements for capacity, performance, and any additional functions that the Isilon may be serving.

Isilon Platform Isilon clusters simplify storage by combining the file system, volume manager, and data protection into the EMC Isilon OneFS® operating system. Through the clustered use of EMC Isilon high-performance X-Series nodes, high-capacity NLSeries, and high density HD-Series nodes, a single Isilon cluster can contain a mix of tiers that provide the best economics, throughput, or IOs per second into the petabyte range. With over 80 percent storage utilization, Isilon clusters need less raw capacity than most storage systems. Compared to traditional direct-attached storage (DAS) Hadoop, Isilon can do so at a third of the storage capacity while providing more protection. Consolidating your unstructured data on Isilon results in greater efficiency, simplified management, and cost savings.

Server Platform There are plenty of options when it comes to compute and infrastructure nodes inside the Dell EMC PowerEdge portfolio. We’ve detailed two possible recommended configurations above, but there are many others as well that can be discussed with your Dell EMC Customer Solutions Centers Solution Architects. Options include:

PowerEdge Rack / Tower Servers – These R- and T- series server are one of the most popular options for customers looking for traditional 1U and 2U options. Either the PowerEdge R630 at 1U for density or the PowerEdge R730/XD for drive options are great choices.

Modular Servers – Customers looking for robust manageability and integrated networking can look to the Dell EMC modular infrastructure portfolio. The Dell EMC PowerEdge M1000 Blade chassis and the Dell EMC PowerEdge FX families are great choices. Just make sure that you have enough drive slots or disk capacity to accommodate the local storage/scratch space that is needed. These are also great for incidents where a highly datacenter density is required (co-location / hosting).

Server CPU The server core and frequency requirements for each customer can vary wildly. We recommend working closely with your Dell EMC Customer Solution Centers Solution Architects to identify the right processor given you unique workload. You can also utilize the ability to execute a proof-of-concept in the Customer Solution Centers at no charge to you in order to get an accurate characterization of your expected performance.

Dell - Internal Use - Confidential 16

Server Memory As with the Server CPUs, this can vary from customer to customer and use-case to use-case. Generally we recommend starting at 256GB and going up from there as your utilization of in-memory technologies (Spark, Impala, Alluxio, etc.) increases.

Server Local Storage You’ll need some host-side cache / scratch space for your compute nodes. Approximately 5-8TB is common on either rotational or flash memory. You should have enough scratch space across your compute nodes that is equal to approx 25% of your usable Hadoop capacity. With the rapidly falling prices of flash memory, it can make sense to utilize those technologies to get fast local storage in ever- increasing amounts. If you do opt for SSDs, this local scratch space can be SSDs either in drive-bays or in PCI-E form-factors.

Network At a minimum, you’ll want dual 10GbE from each host to the Isilon data nodes. As your bandwidth needs increase, you’ll want to consider either segmenting off front-side (client to compute nodes) to their own network cards, or increasing the number and/or speed of the links to each node. Prices on 25GbE and 40GbE cards are becoming very affordable and you may want to consider investing in those early in order to reduce complexity (no need for complication bonding) as well as preparing you for the ever-increasing bandwidth needs of emerging workloads. The Dell EMC Networking S6100 switch is an excellent switch for high-bandwidth needs either at the host level or at the aggregation tier linking multiple racks together.

It’s also worth noting that as the Dell EMC Isilon product evolves, investing in 40GbE networking will be very wise for both compute- node connectivity as well as datanode-to-datanode connectivity.

Dell EMC Customer Solution Centers The Dell EMC Customer Solution Centers are a global network of connected labs that allow Dell to help customers architect, validate and build solutions. With multiple footprints in every region, they can help you understand anything from simple hardware platforms, to more complex solutions. These engagements range from an informal 30-60 minute briefings, through a longer half-day workshop, and on to a proof-of-concept that allow customers to kick the tires of their solution prior to signing on the dotted line. Customers may engage with their account team and have them submit a request to take advantage of these services for no charge.

Links Dell Customer Solution Centers – http://www.dell.com/customersolutioncenter Dell EMC FX PowerEdge Server FX Architecture – http://www.dell.com/en-us/work/learn/fx-server-solutions Dell EMC Isilon Info Hub For Hadoop - https://community.emc.com/docs/DOC-39529 Isilon Hadoop Tools - https://github.com/Isilon/isilon_hadoop_tools

Cloudera – http://cloudera.com Hortonworks – http://hortonworks.com

Dell - Internal Use - Confidential 17