Refresh Your Data Lake to Cisco Data Intelligence Platform
Total Page:16
File Type:pdf, Size:1020Kb
Solution overview Cisco public Refresh Your Data Lake to Cisco Data Intelligence Platform The evolving Hadoop landscape Consideration in the journey In the beginning of 2019, providers of leading Hadoop distribution, Hortonworks and Cloudera of a Hadoop refresh merged together. This merger raised the bar on innovation in the big data space and the new Despite the capability gap between “Cloudera” launched Cloudera Data Platform (CDP) which combined the best of Hortonwork’s and Cloudera’s technologies to deliver the industry leading first enterprise data cloud. Recently, Hadoop 2.x and 3.x, it is estimated that Cloudera released the CDP Private Cloud Base, which is the on-premises version of CDP. This more than 80 percent of the Hadoop unified distribution brought in several new features, optimizations, and integrated analytics. installed base is still on either HDP2 or CDH5, which are built on Apache Hadoop CDP Private Cloud Base is built on Hadoop 3.x distribution. Hadoop developed several 2.0, and are getting close to end of capabilities since its inception. However, Hadoop 3.0 is an eagerly awaited major release support by the end of 2020. with several new features and optimizations. Upgrading from Hadoop 2.x to 3.0 is a paradigm shift as it enables diverse computing resources, (i.e., CPU, GPU, and FPGA) to work on data Amid those feature enrichments, and leverage AI/ML methodologies. It supports flexible and elastic containerized workloads, specialized computing resources, and managed either by Hadoop scheduler (i.e., YARN or Kubernetes), distributed deep learning, end of support, a Hadoop upgrade is a GPU-enabled Spark workloads, and more. Not only that, Hadoop 3.0 offers better reliability value-added refresh. Considering these and availability of metadata through multiple standby name nodes, disk balancing for evenly enhancements, it is imperative to find a utilized data nodes, enhanced workloads scheduling with YARN 3.0, and overall improved more holistic approach while refreshing operational efficiency. your data lake, such as conjoining Going forward, the Ozone initiative lays the foundation of the next generation of storage various frameworks and open-source architecture for HDFS, where data blocks are organized in storage containers for higher technologies with the Hadoop ecosystem. scale and handling of small objects in HDFS. The Ozone project also includes an object store implementation to support several new use cases. © 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public As the journey continues in Hadoop, more staggering and impressive software frameworks and technologies are introduced for crunching big data. Going forward, they will continue to evolve more and integrate in a modular fashion. Furthermore, specialized hardware such as GPU and FPGA are becoming the de-facto standard to facilitate deep learning for processing gigantic datasets expeditiously. Figure 1 demonstrates, how AI/ML frameworks and containerization are augmenting the Hadoop ecosystem. Figure 1. Hadoop 3.0 refresh with AI included Hadoop meets AI with Hadoop 3.0 Cisco UCS enabling these Next Gen workloads aba Interreter Note2 i b Note1 Note4 Note3 YARN eule b YARN Scheduler Terbard Terbard Terbard Wrker PS Wrker Apache Spark 2.3 PS Wrker PS Wrker 3 3 rotl Wrker Wrker CPU Nvidia Memo ry a FS GPU au ubarie b CSI iia YARN Apache Hadoop 3.1 Apache Spark 3.0 Apache Submarine Apache Ozone AI support in a data Lake (Tech Preview) (Tech Preview) © 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public Containerization Cloudera Data Platform Private Cloud Hadoop 3.0 introduces production-ready Docker container support on YARN Shadow IT can now be eliminated when the CDP Private Cloud is with GPU isolation and scheduling. This opens up a plethora of opportunities implemented in Cisco® Data Intelligence Platform. CDP Private Cloud for modern applications, such as micro-services and distributed applications offers a cloud-like experience in a customer’s on-premises environment. frameworks comprised of thousands of containers to execute AI/ML With disaggregated compute and storage, a complete multi-tenant self- algorithms on peta bytes of data with ease and in a speedy fashion. service analytics environment can be implemented, thereby offering better infrastructure utilization. Distributed deep learning with Apache Submarine The Hadoop community initiated the Apache Submarine project to make Also, CDP Private Cloud offers data scientist, data engineer, and data analyst distributed deep learning/machine learning applications easy to launch, personas, bringing together the right tools to the user and improving time manage, and monitor. These improvements make distributed deep learning/ to value. machine learning applications (TensorFlow, PyTorch, MXNet, etc.) run on Red Hat OpenShift Container Platform (RHOCP) cluster Apache Hadoop YARN as simple as running it locally, which can let data scientists focus on algorithms instead of underlying infrastructure. Cloudera selects Red Hat OpenShift as the preferred container platform for CDP Private Cloud. With RHOCP, CDP Private Cloud delivers powerful, self- Apache Spark 3.0 service analytics and enterprise-grade performance with the granular security Apache Spark 3.0 is something that every data scientist and data engineer and governance policies that IT leaders demand. has been waiting for with anticipation. Spark is no longer limited just to Apache Hadoop Ozone object store CPU for its workload; it now offers GPU isolation and acceleration. To easily manage the deep learning environment, YARN launches the Spark Apache Hadoop Ozone is a scalable, redundant, and distributed object store 3.0 applications with GPU. This paves the way for other workloads, such for Hadoop. Apart from scaling to billions of objects of varying sizes, Ozone as machine learning and ETL, to also be accelerated by GPU for Spark can function effectively in containerized environments such as Kubernetes workloads. Learn more by reading the Cisco Blog on Apache Spark 3.0 and YARN. Applications using frameworks like Apache Spark, YARN, and Hive work natively without any modifications. Ozone is built on a highly available, Cloudera Data Platform Private Cloud Base (PvC) replicated block storage layer called Hadoop Distributed Data Store (HDDS). With the merger of Cloudera and Hortonworks, a new “Cloudera” software Ozone is a scale-out architecture with minimal operational overhead and named Cloudera Data Platform (CDP) combined the best of Hortonwork’s and long-term maintenance effort. Ozone can be co-located with HDFS with Cloudera’s technologies to deliver the industry leading first enterprise data single security and governance policies for easy data exchange or migration. cloud. CDP Private Cloud Base is the on-premises version of CDP. This unified It also offers seamless application portability. Ozone enables separation of distribution is a scalable and customizable platform where workloads can compute and storage via the S3 API. Similar to HDFS, it also supports data be securely provisioned. CDP gives a clear path for extending or refreshing locality for applications that choose to use it. your existing HDP and CDH deployments and set the stage for cloud-native architecture. © 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public The design of Ozone was guided by the key principles listed in Figure 2. Figure 2 Ozone design principles igl alable e billi ile a bl eure ae rl a ire eri uli rotl i F 3 lai ayere arieure earae aeae a bl aagee layer aa lali ieri e wer F aa lali ae e iebyie ele are rage i i FS igl available ull reliae urie ulile ailure lu aie r i aierie eire lie a uberee Kubernetes Extracting intelligence from data lakes in a timely and speedy fashion is key in finding emerging business opportunities, accelerating time-to-market efforts, gaining market share, and increasing overall business agility. In today’s fast-paced digitization, Kubernetes enables enterprises to rapidly deploy new updates and features at scale while maintaining consistency across testing, development, and production environments. Kubernetes lays the foundation for cloud-native apps, which can be packaged in container images and ported to diverse platforms. Containers with microservice architecture managed and orchestrated by Kubernetes help organizations embark on a modern development pattern. Moreover, Kubernetes has become a de-facto standard for container orchestration and offers the core for on-premises container cloud for enterprises. Kubernetes is a single cloud-agnostic infrastructure with a rich open-source ecosystem. It allocates, isolates, and manages resources across many tenants at scale as needed in elastic fashion, thereby, giving efficient infrastructure resource utilization. Figure 3 shows how Kubernetes is transforming the use of compute and becoming the de-facto standard for running applications. © 2020 Cisco and/or its affiliates. All rights reserved. Solution overview Cisco public Figure 3 Compute on Kubernetes is exciting Hybrid architecture Environmental Application Portability Personalities - Consistency - Wrie o barig aa Red Hat OpenShift is the preferred container cloud platform for CDP private Teerui ubereeaier egieer aa iei e ae r iilar ee a roie cloud and is the market-leading Kubernetes-powered container platform. a aal iraruure rabili This combination completes the vision of the very first enterprise data