Red Hat Data Analytics Infrastructure Solution Multi-Tenant Workload Isolation with a Shared Context

Brief

Red Hat data analytics infrastructure solution Multi-tenant workload isolation with a shared context

Benefits: Introduction

Deploy and decommission As more organizations embrace analytics, the need for scalable shared data storage has grown analytics clusters on demand exponentially. Artificial intelligence, machine learning, and deep learning have provided new ways to with Red Hat private extract more value from accumulated information. Storing the necessary large datasets, however, can cloud infrastructure. be a challenge. Effectively sharing multipetabyte datasets is more difficult still.

Traditional storage arrays lack the necessary cloud scale to handle large analytics datasets. They also Share large analytics datasets can lead to vendor lock-in and high total cost of ownership. For Apache Hadoop users, the Hadoop through Amazon S3 object Distributed File System (HDFS) has been a popular choice, but HDFS suffers from the close coupling storage, avoiding unnecessary of compute and storage. HDFS clusters cannot scale storage without also scaling compute resources. duplication and eliminating data I/O performance can also suffer as clusters grow and data locality to compute becomes suboptimal. access delays. The Red Hat® data analytics infrastructure solution offers a novel approach based on accepted cloud Accommodate bare-metal, models and open source technology. By integrating key components of the Red Hat stack, the solu- virtualized, or containerized tion provides the ability to rapidly spin up or spin down analytics clusters while giving them access to environments using Red Hat the same shared data repository (also commonly known as a shared data lake). Ceph Storage. The Red Hat data analytics infrastructure solution

Use space-efficient erasure With traditional Hadoop, each separate analytics cluster typically has its own dedicated HDFS datas- coding for data protection, tore. To provide access to the same data for different Hadoop/HDFS clusters, data platform teams saving up to 50% of per-cluster frequently must copy large datasets between the clusters, trying to keep them consistent and upto- storage costs over HDFS date. To cope, firms often maintain many separate, fixed analytics clusters (more than 50 in one 2 3x replication.1 company Red Hat interviewed )—each with its own redundant data copy in HDFS containing potentially petabytes of data. Keeping datasets updated between clusters requires an error-prone maze of scripts and delays for time-consuming data hydration. The cost of maintaining 5, 10, or 20 copies of multipetabyte datasets on multiple clusters is prohibitive to many organizations in terms of capital expenses (CapEx) and operating expenses (OpEx).

The growing popularity of public cloud-based solutions has introduced data scientists and analysts to the ability to deploy and decommission clusters on demand. By design, public clouds like Amazon Web Services (AWS) have provided access to shared datasets, avoiding time-consuming data hydration periods after initializing a new cluster or destaging cycles upon cluster termination. Many analysts now expect these capabilities on-premise.

The Red Hat data analytics infrastructure solution implements a software-defined shared datastore. The solution supports Amazon Simple Storage Service (Amazon S3) compatibility. With the S3A file- system client connector, Apache Spark and Hadoop jobs and queries can run directly against data held within a shared S3-compatible datastore.

facebook.com/redhatinc @redhat 1 Testing by Red Hat and QCT, 2017-2018, https://www.redhat.com/en/blog/why-spark-ceph-part-3-3. linkedin.com/company/red-hat 2 Red Hat conversations with early adopters, 2017-2018.

redhat.com About Red Hat Based on thorough testing, the solution integrates key components of the Red Hat stack: Red Hat is the world’s leading provider of enterprise open • Red Hat OpenStack® Platform is a cloud computing platform that virtualizes resources from source software solutions, industry-standard hardware, organizes those resources into clouds, and manages them. In the using a community-powered context of this solution, it provides on-demand provisioning of virtualized analytics clusters. approach to deliver reliable and high-performing Linux, • Red Hat OpenShift® Container Platform is an optional element of the solution for those who hybrid cloud, container, and Kubernetes technologies. are interested in containerizing Spark clusters. It provides a reliable, enterprise-class platform that Red Hat helps customers integrate combines the industry-leading Kubernetes container orchestration engine with advanced applica- new and existing IT applications, tion development and delivery automation features. develop cloud-native applications, standardize on our industry-leading • Red Hat Ceph® Storage is an open and massively scalable software-defined storage solution for operating system, and automate, secure, and manage complex modern cloud workloads, optimized to deliver the scale and performance that demanding cloud environments. Award-winning analytics applications require. support, training, and consulting services make Red Hat a trusted The benefits of a shared data repository adviser to the Fortune 500. As a strategic partner to cloud providers, The Red Hat data analytics infrastructure solution is ideal for organizations that want to provide system integrators, application vendors, customers, and open an S3-compatible shared data repository experience to their data scientists and data analysts. source communities, Red Hat can Supporting Spark or Hadoop analytics provides several benefits over traditional HDFS, including: help organizations prepare for the digital future. • Lower CapEx through reduced duplication. Petabytes of redundant storage capacity can be North America reduced or eliminated while allowing access to the same datasets by multiple clusters. 1 888 REDHAT1 www.redhat.com • Lower CapEx through improved data durability efficiency. Using available erasure coding Europe, Middle East, for data protection potentially reduces CapEx of purchased storage capacity by 50% over typical and Africa 3x HDFS replication.3 00800 7334 2835 [email protected] • Right-sized CapEx for infrastructure. Shared object storage promotes right-sizing of compute Asia Pacific needs and avoids overprovisioning of either compute or storage resources. +65 6490 4200 [email protected] • Lower OpEx and risk. With shared storage, clusters can retain access to the same data without Latin America costly, time-consuming scripting and scheduling of dataset copies between HDFS instances. +54 11 4329 7300 [email protected] • Accelerated insights and better compliance. Analyzing data in place within a shared Ceph data repository helps reduce time to insight while maintaining compliance within established deadlines.

• Support for different tool and version needs of diverse teams. With a shared datastore,

facebook.com/redhatinc cluster users can choose the toolsets and versions appropriate to their jobs without disrupting @redhat users from other teams requiring different tools and versions. linkedin.com/company/red-hat Conclusion

Copyright © 2019 Red Hat, Inc. Red Hat, Red Hat Enterprise Linux, OpenShift, With the Red Hat data analytics infrastructure solution, data scientists and analysts can derive Ceph, and the Red Hat logo are trademarks insights more quickly because they do not have to wait for infrastructure or compete with other or registered trademarks of Red Hat, Inc. or its subsidiaries in the United States and teams. They can spin up clusters on demand with a shared datastore, eliminating the need to dupli- other countries. Linux® is the registered cate large datasets or wait for duplicate clusters to be hydrated and brought online. They can choose trademark of Linus Torvalds in the U.S. and other countries. The OpenStack word the tools and versions that best suit their specific needs without impacting other teams. By adopt- mark and the Square O Design, together or apart, are trademarks or registered ing software-defined storage models from the public cloud, organizations can simplify how their data trademarks of OpenStack Foundation in storage resources are deployed and managed. Full support for Amazon S3 means that Hadoop and the United States and other countries, and are used with the OpenStack Foundation’s Spark jobs run unaltered, lowering both cost and risk. permission. Red Hat, Inc. is not affiliated with, endorsed by, or sponsored by the OpenStack Foundation or the OpenStack 3 Testing by Red Hat and QCT, 2017-2018, https://www.redhat.com/en/blog/why-spark-ceph-part-3-3. community. Further information is available in the Red Hat data analytics infrastructure solution technology detail.

redhat.com #F19438_0919