CDP DATA CENTER 7.1 Laurent Edel : Solution Engineer Jacques Marchand : Solution Engineer Mael Ropars : Principal Solution Engineer

CDP DATA CENTER 7.1 Laurent Edel : Solution Engineer Jacques Marchand : Solution Engineer Mael Ropars : Principal Solution Engineer 30 Juin 2020 SPEAKERS • © 2019 Cloudera, Inc. All rights reserved. AGENDA • CDP DATA CENTER OVERVIEW • DETAILS ABOUT MAJOR COMPONENTS • PATH TO CDP DC && SMART MIGRATION • Q/A © 2019 Cloudera, Inc. All rights reserved. CLOUDERA DATA PLATFORM © 2020 Cloudera, Inc. All rights reserved. 4 ARCHITECTURE CIBLE : ENTERPRISE DATA CLOUD CDP Cloud Public CDP On-Prem (platform-as-a-service) (installable software) © 2020 Cloudera, Inc. All rights reserved. 5 CDP DATA CENTER OVERVIEW CDP Data Center (installable software) NEW CDP Data Center features include: Cloudera Manager • High-performance SQL analytics • Real-time stream processing, analytics, and management • Fine-grained security, enterprise metadata, and scalable data lineage • Support for object storage (tech preview) • Single pane of glass for management - multi-cluster support Enterprise analytics and data management platform, built for hybrid cloud, optimized for bare metal and ready for private cloud Cloudera Runtime © 2020 Cloudera, Inc. All rights reserved. 6 A NEW OPEN SOURCE DISTRIBUTION FOR BETTER CAPABILITY Cloudera Runtime - created from the best of CDH and HDP Deprecate competitive Merge overlapping Keep complementary Upgrade shared technologies technologies technologies technologies © 2019 Cloudera, Inc. All rights reserved. 7 COMPONENT LIST CDP Data Center 7.1(May) 2020 • Cloudera Manager 7.1 • HBase 2.2 • Key HSM 7.1 • Kafka Schema Registry 0.8 • Hadoop 3.1 • Phoenix 5.0 • Knox 1.3 • Streams Messaging Mgr 1.0 • Spark 2.4 / Spark 3(b2) • Kudu 1.12 • Livy 0.7 • Streams Replication Mgr 2.1 • Hive 3.1 • Sqoop 1.4.7 • Navigator Encrypt 7.1 • Ozone (Beta) 0.6 • Impala 3.4 • Parquet 1.10 • Ranger KMS 7.1 • Kafka Connect 2.4 • Oozie 5.1 • Avro 1.8 • Zeppelin • Cruise Control 2.0 • Hue 4.5 • ORC 1.5 • Hive Warehouse Connector 1.0 • Tez 0.9 • Ranger 2.0 • Zookeeper 3.5 • Kafka 2.4 • Key Trustee Server 7 • Atlas 2.0 • Solr 8.4 • RHEL/CENTOS/OEL 7.7 • MySQL 5.7 • Upgrades from CDP DC 7.0 • Postgres 10 • Oracle DB 12 (Fresh Install Only) • Upgrades from CDH 5.13-5.16 • JDK 8 • PostgreSQL 10 • Upgrades from HDP 2.6.5 • JDK 11 Runtime • Maria DB 10.2 © 2020 Cloudera, Inc. All rights reserved. 8 AGENDA • CDP DATA CENTER OVERVIEW • DETAILS ABOUT MAJOR COMPONENTS • PATH TO CDP DC && SMART MIGRATION • Q/A © 2019 Cloudera, Inc. All rights reserved. CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center] Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04 01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION © 2020 Cloudera, Inc. All rights reserved. 10 CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center] Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04 01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION © 2020 Cloudera, Inc. All rights reserved. 11 KAFKA COMPUTE CLUSTERS WITH CLOUDERA MANAGER Kafka Clusters using Shared Security & Governance Data Lake with Atlas and Ranger • Kafka 2.4 • Ranger & Atlas Integration • Support of Kafka Connect, Kafka Streams • Cruise Control for load balancing • Create multiple Kafka compute clusters using shared Security Data Lake with Ranger & Atlas © 2019 Cloudera, Inc. All rights reserved. KAFKA MANAGEMENT SERVICES Kafka Services for Schema Management, Replication and Monitoring Schema Registry Streams Messaging Manager (SMM) Streams Replication Manager (SRM) New Kafka Schema Governance New Kafka Monitoring Service New Kafka Replication Engine powered by MirrorMaker2 © 2019 Cloudera, Inc. All rights reserved. CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center] Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04 01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION © 2020 Cloudera, Inc. All rights reserved. 14 SPARK • Spark 2.4 • Integration with Ranger for Fine Grained Authorizations • Coming soon: Spark 3 ! • Better performance • Enhanced support for Deep Learning • New modules • MLLib replaced with SparkML • Tech Preview available © 2020 Cloudera, Inc. All rights reserved. 15 CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center] Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04 01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION © 2020 Cloudera, Inc. All rights reserved. 16 SQL USER EXPERIENCE : HUE © 2020 Cloudera, Inc. All rights reserved. 17 DATA WAREHOUSE Hive 3 Apache Hive 3 • Comprehensive ANSI SQL 2016 coverage • GDPR: new ACID v2 as fast as regular tables, transactions, UPDATE/DELETE/MERGE • Cloud-ready: optimized for S3/WASB/GCP • Support for JDBC/Kafka/Druid out of the box • EDW offload: – “DBA” tooling: surrogate keys, materialized views, constraints – information schema • Performance: – workload management – query result cache © 2020 Cloudera, Inc. All rights reserved. 18 DATA WAREHOUSE Impala and Kudu Apache Impala Apache Kudu • Leading MPP SQL Engine for DW - • Leading columnar storage engine for fast optimized for Parquet/Kudu analytics on fast data • Ideal for Data Mart Implementations that • Ideal for Low latency time series data require Interactive/Ad-hoc BI ingest and analytics (with Impala SQL • 1000+ enterprise customers - many engine) running on 10s of PBs and 100s of nodes • Strength of fast ingest with single rows like • Certified with leading BI tools with broad HBASE and allows large scans like HDFS SQL coverage • ACID (insert/update/delete) semantics • Latest release adds improvements for with single rows resiliency, concurrency, and metadata © 2020 Cloudera, Inc. All rights reserved. 19 WORKLOAD MANAGER Global view on Deep Dive Query analysis analytic processing © 2020 Cloudera, Inc. All rights reserved. 20 CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center] Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04 01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION © 2020 Cloudera, Inc. All rights reserved. 21 HBASE + PHOENIX HBASE PHOENIX Flexible, scale-out, no-sql database RDBMS-like, scale-out database Put put = new Put(Bytes.toBytes(rowKey)); stmt.executeUpdate(“UPSERT INTO TABLE_NAME put.addColumn(COLUMN_FAMILY_NAME, COLUMN_NAME, VALUES(rowKey, GREETINGS) "); Bytes.toBytes(GREETINGS)); stmt.execute(); table.put(put); • Maximally flexible & customizable • Programmatic ANSI SQL support • SQL only for data remediation • RDBMS-like data architecture • All advanced functionality available • Auto-applies performance best • New async client practices • JDK8/G1GC • Can co-exist with HBase apps • Off-Heap read path • API clean-up, HBCK2 © 2020 Cloudera, Inc. All rights reserved. 22 CDP SEARCH Scalable and Robust Index Storage with SOLR 8.4 Querying API Indexing API Solr Cloud ● Scalable, cost-efficient index storage Distributed processing coordinator ● High availability, Integrated security with Atlas/Ranger Solr Extraction Mapping ● Shared data store with other processing tools (Spark, Impala..) Indexing engine (Lucene) ● Search AND process data in one platform Shared Data Storage © 2020 Cloudera, Inc. All rights reserved. 23 CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center] Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04 01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION © 2020 Cloudera, Inc. All rights reserved. 24 DATA SCIENCE AND ENGINEERING TOOLS CLOUDERA DATA SCIENCE APACHE ZEPPELIN WORKBENCH © 2019 Cloudera, Inc. All rights reserved. CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center] Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04 01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION © 2020 Cloudera, Inc. All rights reserved. 26 SIMPLIFIED MANAGEMENT Cloudera Manager • Management of multiple clusters • Knox,Ranger,Atlas,Hive-on-Tez,DAS • Cluster-level configuration history • Improved global search • Resume errors in enabling Kerberos • Scalability improvements • Improved alerts configuration • Upgrade Support • Support for Private Cloud (Beta) © 2020 Cloudera, Inc. All rights reserved. 27 CONSISTENT SECURITY AND GOVERNANCE Built for multi-functional analytics anywhere • Data Catalog: a comprehensive catalog of all data sets, spanning on-premises, cloud object stores, structured, unstructured, and semi-structured • Schema: automatic capture and storage of any and all schema and metadata definitions as they are used and created by platform workloads • Replication: deliver data as well as data policies there where the enterprise needs to work, with complete consistency and security • Security: role-based access control applied consistently across the platform. Includes full stack encryption and key management • Governance: enterprise-grade auditing, lineage, and governance capabilities applied across the platform with rich extensibility for partner integrations © 2020 Cloudera, Inc. All rights reserved. 28 SECURITY AND GOVERNANCE Identity & Perimeter Access Visibility Data Protection Validate users in Defining what users and Reporting on where data Shielding data in the enterprise directory applications can do with came from and how it’s cluster from unauthorized data being used visibility Technical Concepts: Technical Concepts: Technical Concepts:

CDP DATA CENTER 7.1 Laurent Edel : Solution Engineer Jacques Marchand : Solution Engineer Mael Ropars : Principal Solution Engineer

Netapp Solutions for Hadoop Reference Architecture: Cloudera Faiz Abidi (Netapp) and Udai Potluri (Cloudera) June 2018 | WP-7217

Groups and Activities Report 2017

Kyuubi Release 1.3.0 Kent

Chapter 2 Introduction to Big Data Technology

Release Notes Date Published: 2020-08-10 Date Modified

Storage and Ingestion Systems in Support of Stream Processing

Cloudera Enterprise

Towards a Unified Ingestion-And-Storage Architecture

Release Notes Date Published: 2020-10-13 Date Modified

Getting Started with Kudu PERFORM FAST ANALYTICS on FAST DATA

Building a Scalable Distributed Data Platform Using Lambda Architecture

Red Hat Fuse 7.3 Release Notes