Solution overview Cisco public

Refresh Your Data Lake to Cisco Data Intelligence Platform

The evolving Hadoop landscape Consideration in the journey In the beginning of 2019, providers of leading Hadoop distribution, Hortonworks and Cloudera of a Hadoop refresh merged together. This merger raised the bar on innovation in the big data space and the new Despite the capability gap between “Cloudera” launched Cloudera Data Platform (CDP) which combined the best of Hortonwork’s and Cloudera’s technologies to deliver the industry leading first enterprise data cloud. Recently, Hadoop 2.x and 3.x, it is estimated that Cloudera released the CDP Private Cloud Base, which is the on-premises version of CDP. This more than 80 percent of the Hadoop unified distribution brought in several new features, optimizations, and integrated analytics. installed base is still on either HDP2 or CDH5, which are built on Apache Hadoop CDP Private Cloud Base is built on Hadoop 3.x distribution. Hadoop developed several 2.0, and are getting close to end of capabilities since its inception. However, Hadoop 3.0 is an eagerly awaited major release support by the end of 2020. with several new features and optimizations. Upgrading from Hadoop 2.x to 3.0 is a paradigm shift as it enables diverse computing resources, (i.e., CPU, GPU, and FPGA) to work on data Amid those feature enrichments, and leverage AI/ML methodologies. It supports flexible and elastic containerized workloads, specialized computing resources, and managed either by Hadoop scheduler (i.e., YARN or ), distributed deep learning, end of support, a Hadoop upgrade is a GPU-enabled Spark workloads, and more. Not only that, Hadoop 3.0 offers better reliability value-added refresh. Considering these and availability of metadata through multiple standby name nodes, disk balancing for evenly enhancements, it is imperative to find a utilized data nodes, enhanced workloads scheduling with YARN 3.0, and overall improved more holistic approach while refreshing operational efficiency. your data lake, such as conjoining Going forward, the Ozone initiative lays the foundation of the next generation of storage various frameworks and open-source architecture for HDFS, where data blocks are organized in storage containers for higher technologies with the Hadoop ecosystem. scale and handling of small objects in HDFS. The Ozone project also includes an object store implementation to support several new use cases.

© 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public

As the journey continues in Hadoop, more staggering and impressive software frameworks and technologies are introduced for crunching big data. Going forward, they will continue to evolve more and integrate in a modular fashion. Furthermore, specialized hardware such as GPU and FPGA are becoming the de-facto standard to facilitate deep learning for processing gigantic datasets expeditiously. Figure 1 demonstrates, how AI/ML frameworks and containerization are augmenting the Hadoop ecosystem.

Figure 1. Hadoop 3.0 refresh with AI included

a ee i a 3.

Cisco UCS enabling these Next Gen workloads

aba

erreer e2 i b e1 e4

e3

eule b YARN Scheduler

Terbar

Terbar Terbar

ae ar 2.3 rer rer rer rer 3 3 rl rer rer CPU Nvidia Memo ry a F GPU au ubarie b iia

ae a 3.1 ae ar 3. ae ubarie ae e AI support in a data Lake (Tech Preview) (Tech Preview)

© 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public

Containerization Cloudera Data Platform Private Cloud Hadoop 3.0 introduces production-ready container support on YARN Shadow IT can now be eliminated when the CDP Private Cloud is with GPU isolation and scheduling. This opens up a plethora of opportunities implemented in Cisco® Data Intelligence Platform. CDP Private Cloud for modern applications, such as micro-services and distributed applications offers a cloud-like experience in a customer’s on-premises environment. frameworks comprised of thousands of containers to execute AI/ML With disaggregated compute and storage, a complete multi-tenant self- algorithms on peta bytes of data with ease and in a speedy fashion. service analytics environment can be implemented, thereby offering better infrastructure utilization. Distributed deep learning with Apache Submarine The Hadoop community initiated the Apache Submarine project to make Also, CDP Private Cloud offers data scientist, data engineer, and data analyst distributed deep learning/machine learning applications easy to launch, personas, bringing together the right tools to the user and improving time manage, and monitor. These improvements make distributed deep learning/ to value. machine learning applications (TensorFlow, PyTorch, MXNet, etc.) run on Red Hat OpenShift Container Platform (RHOCP) cluster Apache Hadoop YARN as simple as running it locally, which can let data scientists focus on algorithms instead of underlying infrastructure. Cloudera selects Red Hat OpenShift as the preferred container platform for CDP Private Cloud. With RHOCP, CDP Private Cloud delivers powerful, self- Apache Spark 3.0 service analytics and enterprise-grade performance with the granular security Apache Spark 3.0 is something that every data scientist and data engineer and governance policies that IT leaders demand. has been waiting for with anticipation. Spark is no longer limited just to Apache Hadoop Ozone object store CPU for its workload; it now offers GPU isolation and acceleration. To easily manage the deep learning environment, YARN launches the Spark Apache Hadoop Ozone is a scalable, redundant, and distributed object store 3.0 applications with GPU. This paves the way for other workloads, such for Hadoop. Apart from scaling to billions of objects of varying sizes, Ozone as machine learning and ETL, to also be accelerated by GPU for Spark can function effectively in containerized environments such as Kubernetes workloads. Learn more by reading the Cisco Blog on Apache Spark 3.0 and YARN. Applications using frameworks like Apache Spark, YARN, and Hive work natively without any modifications. Ozone is built on a highly available, Cloudera Data Platform Private Cloud Base (PvC) replicated block storage layer called Hadoop Distributed Data Store (HDDS). With the merger of Cloudera and Hortonworks, a new “Cloudera” software Ozone is a scale-out architecture with minimal operational overhead and named Cloudera Data Platform (CDP) combined the best of Hortonwork’s and long-term maintenance effort. Ozone can be co-located with HDFS with Cloudera’s technologies to deliver the industry leading first enterprise data single security and governance policies for easy data exchange or migration. cloud. CDP Private Cloud Base is the on-premises version of CDP. This unified It also offers seamless application portability. Ozone enables separation of distribution is a scalable and customizable platform where workloads can compute and storage via the S3 API. Similar to HDFS, it also supports data be securely provisioned. CDP gives a clear path for extending or refreshing locality for applications that choose to use it. your existing HDP and CDH deployments and set the stage for cloud-native architecture.

© 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public

The design of Ozone was guided by the key principles listed in Figure 2.

Figure 2 Ozone design principles

igl alable e billi ile a bl

eure ae rl a ire eri

uli rl i F 3 lai

aere arieure earae aeae a bl aagee laer

aa lali ieri e er F aa lali ae e iebie ele are rage i i F

igl aailable ull reliae urie ulile ailure

lu aie r i aierie eire lie a uberee

Kubernetes Extracting intelligence from data lakes in a timely and speedy fashion is key in finding emerging business opportunities, accelerating time-to-market efforts, gaining market share, and increasing overall business agility.

In today’s fast-paced digitization, Kubernetes enables enterprises to rapidly deploy new updates and features at scale while maintaining consistency across testing, development, and production environments. Kubernetes lays the foundation for cloud-native apps, which can be packaged in container images and ported to diverse platforms. Containers with microservice architecture managed and orchestrated by Kubernetes help organizations embark on a modern development pattern. Moreover, Kubernetes has become a de-facto standard for container orchestration and offers the core for on-premises container cloud for enterprises. Kubernetes is a single cloud-agnostic infrastructure with a rich open-source ecosystem. It allocates, isolates, and manages resources across many tenants at scale as needed in elastic fashion, thereby, giving efficient infrastructure resource utilization. Figure 3 shows how Kubernetes is transforming the use of compute and becoming the de-facto standard for running applications.

© 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public

Figure 3 Compute on Kubernetes is exciting Hybrid architecture Environmental Application Portability Personalities - Consistency - rie barig aa Red Hat OpenShift is the preferred container cloud platform for CDP private Teerui ubereeaier egieer aa iei e ae r iilar ee a rie cloud and is the market-leading Kubernetes-powered container platform. a aal iraruure rabili This combination completes the vision of the very first enterprise data cloud,

Utilization aier with a powerful hybrid architecture that decouples compute and storage for all aig a greater agility, ease of use, and more efficient use of private and multi-cloud Agility ai ie Hybridity re re ea a are lu a aa aere rla e ae infrastructure resources. With Cloudera’s Shared Data Experience (SDX), iraruure eablig beer uiliai security and governance polices can be easily and consistently enforced across data and analytics in private as well as multi-cloud deployments. This Observability and hybridity will open myriad opportunities for multi-function integration with Monitoring iger graulari other frameworks such as streaming data, batch workloads, analytics, data pipelining/engineering, and machine learning.

Spark on Kubernetes Cloud-native architecture for data lakes The introduction of Spark 2.3 brought full support for Apache Spark on and AI Kubernetes, enabling a Kubernetes cluster to act as compute for the data Cisco Data Intelligence Platform with CDP private cloud accelerates the lake, much of which is used in Cloudera Private Cloud applications. Spark 2.3 journey in becoming cloud-native for your data lake and AI/ML workloads. can also be submitted in a Kubernetes cluster. By leveraging a Kubernetes-powered container cloud, enterprises can now Spark on Kubernetes is a great stride in the Hadoop ecosystem, as it opened quickly break the silos in monolithic application frameworks and embrace a the door for many public cloud-specific applications and framework use continuous innovation of micro-services architecture with a CI/CD approach. cases to be deployed on premises; thus, providing hybridity to stretch to With a cloud-native ecosystem, enterprises can build scalable and elastic the cloud anywhere. Kubernetes addresses gaps that existed in YARN, such modern applications that extend the boundaries from a private cloud to a as a lack of isolation and reproducibility. Kubernetes also allows workloads hybrid infrastructure. to be packaged in docker images. Spark on Kubernetes inherits all other built-in features, such as auto-scaling, detailed metrics, advanced container networking, security, and so on.

© 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public

Upgrade your Data Lake with Cisco Data Intelligence Platform (CDIP)

Platform upgrades are exciting as they bring long-awaited features and The Cisco Data Intelligence Platform (CDIP) delivers: capabilities. However, successful upgrades require planning. This planning also involves underlying infrastructure as much as the refreshed software involved. • The latest generation in CPU from Intel (2nd generation Intel Scalable family, with Cascade Lake CLXR) and AMD (EPYC Rome CPUs) Given all these new long-awaited frameworks and functionalities—unified • Cloud scale and a fully modular architecture, where big data, AI/ distribution (CDP), S3-compatible object store, CDP Private Cloud, and most compute farm, and massive storage tiers work together as a single entity importantly, end of support for older releases—now is the time to consider a and each CDIP component can also scale independently to address the IT risk-free system refresh with Cisco Data Intelligence Platform. issues in the modern data center Consensus has been truly achieved among enterprises that data science • World-record Hadoop performance, both for MapReduce and Spark initiatives are effectively driving business value. However, exponential data frameworks published at TPCx-HS benchmark growth and the need to analyze the enormous amount of data—whether at rest or in motion—at higher rates are constituting several challenges, such as I/O • AI compute farm, which offers different types of AI frameworks and bottlenecks, several management touchpoints, growing cluster complexity, compute types (GPU, CPU, FPGA) to work data for analytics performance degradation, and so on. Cisco Data Intelligence Platform is • A massive storage tier that enables customers to gradually retire data and thoughtfully designed and engineered with an ecosystem-driven data strategy, quickly retrieve it when needed on storage-dense sub-systems with a lower therefore, providing high-quality collaboration between data science and IT. cost per TB for a better TCO Figure 4 illustrates how CDP and HDP integrate with Cisco Data Intelligence • Data compression with FPGA, which allows customers to offload Platform. compute-heavy compression tasks to FPGA, relieve CPU to perform other

Figure 4. Ecosystem-driven data strategy tasks, and gain significant performance • Seamless scale of the architecture to thousands of nodes i aa elligee lar i luera aa lar • Single-pane-of-glass management with Cisco Intersight™ • An ISV partner ecosystem, a best-in-class ecosystem of vendors that offers best-of-breed, end-to-end validated architectures • A pre-validated and fully supported platform HDP • Disaggregate architecture, supporting separation of storage and compute CDHCLOUDERA’S OPEN SOURCE HORTONWORKS PLATFORM DISTRIBUTION DATA PLATFORM including APACHE HADOOP powered by Apache Hadoop for a data lake

© 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public

Figure 5 illustrates how the ecosystem of vendors’ technologies work together.

Figure 5. Partner ecosystem

Pre-validated orld record performance • Fully supported ue ar • TPCx-HS (20 plus) • Architectural innovations • Proven linear scaling e a ei • Only to publish 300 TB test with vendor aier lar ue liai caling • Independently scale storage and compute

e a i erie • Data tiering ei a erer aier F aier lar ue eie rla Centralied management • Infrastructure management

aa lae a aa aere

ue liai ae e be rage

i erie i 326 a erer aier F

Data-intensive workloads Massive storage

Cisco Intersight manages CDIP, which offers a powerful experience of cloud-based management

© 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public

Cisco Intersight is a lifecycle management platform for Cisco Data Intelligence Platform, regardless of where it resides (see Figure 4). Cisco Intersight features SaaS-based management, proactive guidance, security and extensibility, enhanced support with connected Cisco Technical Assistance Center (TAC) integration, visibility anywhere, and much more.

Figure 6. Cisco Intersight features

Platform compliance Connected TAC Unified Management Cisco Security SaaS Delivered Centralized (HW/FW compatibility (Technical Assistance (Dashboard and Data Advisories ( Hosted Mgmt. or Comprehensive checks) Center) collection) (CVEs) Connected appliance) Management Automation

Global Policies Single Pane of Glass Connect Everything iable (UCS Director, UCS elligee Central, UCS manager and IMC)

Organizations that are deploying container cloud in production need to have a platform strategy that encompass key elements i.e. security, governance, monitoring and logging, data protection and persistence, container lifecycle management, process priority and isolation, and end to end automation and orchestration from application layer to infrastructure layer. Cisco Data Intelligence Platform delivers cloud infrastructure strategy for data and application, aligned with long-term business strategy, and with a clear vision of becoming cloud-native from on-prem to hybrid in the long run. The table below outlines CDIP features.

© 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public

Table 1. Cisco Data Intelligence Platform Features

Features Phase 1 Phase 2

Infrastructure Hardware support CPU/GPU FPGA Security Cloudera security (authorization, authentication, RBAC, encryption) Networking 25G/40G – thousands of nodes 100G – thousands of nodes Compute 2nd Gen Intel Xeon Scalable family, AMD EPYC Rome Storage (data locality + decoupled HDFS compute and storage) S3 storage for Hadoop - Apache Ozone Minio, S3-compatible Storage FPGA compression HDFS Compression with Xilinx Centralized management Cisco Intersight™ AI/ML – Deep learning on Kubernetes Deep learning Cloudera machine learning Distributed deep learning Apache Submarine AI/ML – Inference and model management Inference on CPU Cloudera machine learning Inference on GPU Triton Inference Server Inference on FPGA Deploying models to Xilinx FPGA Data processing and AI/ML Application and services Spark 2.x/Spark 3.0 Monitoring for Kubernetes Application and k8 infra AppDynamics Workload Optimizer for Kubernetes Workload optimizer Intersight Workload Optimizer (IWO) Kubernetes Kubernetes RedHat Openshift Multi-cloud Hybrid/multi-cloud AWS/Google

© 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public

For more information Features Phase 1 Phase 2 • Optimizing Analytics Workloads with the Governance Cisco Data Intelligence Platform Governance Cloudera SDX • To learn more about Cisco Data Intelligence Platform, visit https://www. Backup and disaster recovery cisco.com/c/dam/en/us/products/servers- Backup/point-in-time recovery Cloudera BDR and Ozone Hybrid unified-computing/ucs-c-series-rack- servers/solution-overview-c22-742432.pdf Experiences (or personas-driven)

• To find out more about Cisco UCS® big Data warehouse (data analyst) Cloudera data warehouse data solutions, visit https://www.cisco.com/ go/bigdata Data lake (data engineer) Cloudera data engineer • To find out more about Cisco UCS big data Data science (data scientist) Cloudera machine learning validated designs, visit https://www.cisco. com/go/bigdata_design Edge use cases Cloudera data flow • To find out more about Cisco UCS AI/ML solutions, visit https://www.cisco.com/go/ ai-compute Conclusion • To find out more about Cisco validated Cisco Data Intelligence Platform is a robust platform that lays out the foundation for all the exciting solutions based on Cloudera Data Platform, new architectural and technological innovation happening in the data lake world. It sets the stage visit https://www.cisco.com/c/en/us/td/ for Cloudera Data Platform Private Cloud and for all of the upcoming enhancements for increased docs/unified_computing/ucs/UCS_CVDs/ flexibility. By design, Cisco Data Intelligence Platform is a disaggregated architecture, which makes cisco_ucs_cdip_cloudera.html the big data journey easy and removes complexity out of refresh or upgrade cycles. With Cisco Data Intelligence Platform, each component can not only scale independently but also be refreshed or • To learn more about Cisco Intersight, visit upgraded. https://www.cisco.com/c/en/us/products/ servers-unified-computing/intersight/ index.html

© 2020 Cisco and/or its affiliates. All rights reserved. Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S. and other countries. To view a list of Cisco trademarks, go to this URL: https://www.cisco.com/go/trademarks. Third-party trademarks mentioned are the property of their respective owners. The use of the word partner does not imply a partnership relationship between Cisco and any other company. (1110R) C22-744005-00 09/20