CDP DATA CENTER 7.1 Laurent Edel : Solution Engineer Jacques Marchand : Solution Engineer Mael Ropars : Principal Solution Engineer

30 Juin 2020 SPEAKERS

© 2019 Cloudera, Inc. All rights reserved. AGENDA

• CDP DATA CENTER OVERVIEW • DETAILS ABOUT MAJOR COMPONENTS • PATH TO CDP DC && SMART MIGRATION • Q/A

© 2019 Cloudera, Inc. All rights reserved. CLOUDERA DATA PLATFORM

© 2020 Cloudera, Inc. All rights reserved. 4 ARCHITECTURE CIBLE : ENTERPRISE DATA CLOUD

CDP Cloud Public CDP On-Prem (platform-as-a-service) (installable software)

© 2020 Cloudera, Inc. All rights reserved. 5 CDP DATA CENTER OVERVIEW

CDP Data Center (installable software) NEW CDP Data Center features include: Cloudera Manager • High-performance SQL analytics • Real-time stream processing, analytics, and management • Fine-grained security, enterprise metadata, and scalable data lineage • Support for object storage (tech preview) • Single pane of glass for management - multi-cluster support

Enterprise analytics and data management platform, built for hybrid cloud, optimized for bare metal and ready for

private cloud

Cloudera Runtime

© 2020 Cloudera, Inc. All rights reserved. 6 A NEW OPEN SOURCE DISTRIBUTION FOR BETTER CAPABILITY Cloudera Runtime - created from the best of CDH and HDP

Deprecate competitive Merge overlapping Keep complementary Upgrade shared technologies technologies technologies technologies

© 2019 Cloudera, Inc. All rights reserved. 7 COMPONENT LIST

CDP Data Center 7.1(May) 2020

• Cloudera Manager 7.1 • HBase 2.2 • Key HSM 7.1 • Kafka Schema Registry 0.8 • Hadoop 3.1 • Phoenix 5.0 • Knox 1.3 • Streams Messaging Mgr 1.0 • Spark 2.4 / Spark 3(b2) • Kudu 1.12 • Livy 0.7 • Streams Replication Mgr 2.1 • Hive 3.1 • 1.4.7 • Navigator Encrypt 7.1 • Ozone (Beta) 0.6 • Impala 3.4 • Parquet 1.10 • Ranger KMS 7.1 • Kafka Connect 2.4 • Oozie 5.1 • Avro 1.8 • Zeppelin • Cruise Control 2.0 • Hue 4.5 • ORC 1.5 • Hive Warehouse Connector 1.0 • Tez 0.9 • Ranger 2.0 • Zookeeper 3.5 • Kafka 2.4 • Key Trustee Server 7 • Atlas 2.0 • Solr 8.4

• RHEL/CENTOS/OEL 7.7 • MySQL 5.7 • Upgrades from CDP DC 7.0 • Postgres 10 • Oracle DB 12 (Fresh Install Only) • Upgrades from CDH 5.13-5.16

• JDK 8 • PostgreSQL 10 • Upgrades from HDP 2.6.5 • JDK 11 Runtime • Maria DB 10.2

© 2020 Cloudera, Inc. All rights reserved. 8 AGENDA

• CDP DATA CENTER OVERVIEW • DETAILS ABOUT MAJOR COMPONENTS • PATH TO CDP DC && SMART MIGRATION • Q/A

© 2019 Cloudera, Inc. All rights reserved. CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center]

Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04

01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR

SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

© 2020 Cloudera, Inc. All rights reserved. 10 CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center]

Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04

01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR

SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

© 2020 Cloudera, Inc. All rights reserved. 11 KAFKA COMPUTE CLUSTERS WITH CLOUDERA MANAGER Kafka Clusters using Shared Security & Governance Data Lake with Atlas and Ranger

• Kafka 2.4 • Ranger & Atlas Integration

• Support of Kafka Connect, Kafka Streams • Cruise Control for load balancing

• Create multiple Kafka compute clusters using shared Security Data Lake with Ranger & Atlas

© 2019 Cloudera, Inc. All rights reserved. KAFKA MANAGEMENT SERVICES Kafka Services for Schema Management, Replication and Monitoring

Schema Registry Streams Messaging Manager (SMM) Streams Replication Manager (SRM) New Kafka Schema Governance New Kafka Monitoring Service New Kafka Replication Engine powered by MirrorMaker2

© 2019 Cloudera, Inc. All rights reserved. CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center]

Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04

01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR

SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

© 2020 Cloudera, Inc. All rights reserved. 14 SPARK

• Spark 2.4

• Integration with Ranger for Fine Grained Authorizations

• Coming soon: Spark 3 ! • Better performance • Enhanced support for Deep Learning • New modules • MLLib replaced with SparkML • Tech Preview available

© 2020 Cloudera, Inc. All rights reserved. 15 CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center]

Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04

01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR

SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

© 2020 Cloudera, Inc. All rights reserved. 16 SQL USER EXPERIENCE : HUE

© 2020 Cloudera, Inc. All rights reserved. 17 DATA WAREHOUSE Hive 3

Apache Hive 3 • Comprehensive ANSI SQL 2016 coverage • GDPR: new ACID v2 as fast as regular tables, transactions, UPDATE/DELETE/MERGE • Cloud-ready: optimized for S3/WASB/GCP • Support for JDBC/Kafka/Druid out of the box • EDW offload: – “DBA” tooling: surrogate keys, materialized views, constraints – information schema • Performance: – workload management – query result cache

© 2020 Cloudera, Inc. All rights reserved. 18 DATA WAREHOUSE Impala and Kudu

Apache Impala Apache Kudu • Leading MPP SQL Engine for DW - • Leading columnar storage engine for fast optimized for Parquet/Kudu analytics on fast data • Ideal for Data Mart Implementations that • Ideal for Low latency time series data require Interactive/Ad-hoc BI ingest and analytics (with Impala SQL • 1000+ enterprise customers - many engine) running on 10s of PBs and 100s of nodes • Strength of fast ingest with single rows like • Certified with leading BI tools with broad HBASE and allows large scans like HDFS SQL coverage • ACID (insert/update/delete) semantics • Latest release adds improvements for with single rows resiliency, concurrency, and metadata

© 2020 Cloudera, Inc. All rights reserved. 19 WORKLOAD MANAGER

Global view on Deep Dive Query analysis analytic processing

© 2020 Cloudera, Inc. All rights reserved. 20 CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center]

Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04

01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR

SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

© 2020 Cloudera, Inc. All rights reserved. 21 HBASE + PHOENIX

HBASE PHOENIX Flexible, scale-out, no-sql RDBMS-like, scale-out database

Put put = new Put(Bytes.toBytes(rowKey)); stmt.executeUpdate(“UPSERT INTO TABLE_NAME put.addColumn(COLUMN_FAMILY_NAME, COLUMN_NAME, VALUES(rowKey, GREETINGS) "); Bytes.toBytes(GREETINGS)); stmt.execute(); table.put(put);

• Maximally flexible & customizable • Programmatic ANSI SQL support • SQL only for data remediation • RDBMS-like data architecture • All advanced functionality available • Auto-applies performance best • New async client practices • JDK8/G1GC • Can co-exist with HBase apps • Off-Heap read path • API clean-up, HBCK2

© 2020 Cloudera, Inc. All rights reserved. 22 CDP SEARCH Scalable and Robust Index Storage with SOLR 8.4 Querying API Indexing API Solr Cloud ● Scalable, cost-efficient index storage Distributed processing coordinator ● High availability, Integrated security with Atlas/Ranger Solr

Extraction Mapping ● Shared data store with other processing tools (Spark, Impala..) Indexing engine (Lucene)

● Search AND process data in one platform Shared Data Storage

© 2020 Cloudera, Inc. All rights reserved. 23 CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center]

Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04

01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR

SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

© 2020 Cloudera, Inc. All rights reserved. 24 DATA SCIENCE AND ENGINEERING TOOLS

CLOUDERA DATA SCIENCE APACHE ZEPPELIN WORKBENCH

© 2019 Cloudera, Inc. All rights reserved. CDP DATA CENTER OVERVIEW [What is the scope of CDP Data Center]

Collect Report Predict Impala Spark Hive Zeppelin 02 Kudu 04

01 03 05 Enrich Serve Spark Hbase Hive Phoenix SolR

SECURITY | GOVERNANCE | LINEAGE | MANAGEMENT | AUTOMATION

© 2020 Cloudera, Inc. All rights reserved. 26 SIMPLIFIED MANAGEMENT Cloudera Manager

• Management of multiple clusters • Knox,Ranger,Atlas,Hive-on-Tez,DAS • Cluster-level configuration history • Improved global search • Resume errors in enabling Kerberos • Scalability improvements • Improved alerts configuration • Upgrade Support • Support for Private Cloud (Beta)

© 2020 Cloudera, Inc. All rights reserved. 27 CONSISTENT SECURITY AND GOVERNANCE Built for multi-functional analytics anywhere

• Data Catalog: a comprehensive catalog of all data sets, spanning on-premises, cloud object stores, structured, unstructured, and semi-structured

• Schema: automatic capture and storage of any and all schema and metadata definitions as they are used and created by platform workloads

• Replication: deliver data as well as data policies there where the enterprise needs to work, with complete consistency and security

• Security: role-based access control applied consistently across the platform. Includes full stack encryption and key management

• Governance: enterprise-grade auditing, lineage, and governance capabilities applied across the platform with rich extensibility for partner integrations

© 2020 Cloudera, Inc. All rights reserved. 28 SECURITY AND GOVERNANCE

Identity & Perimeter Access Visibility Data Protection Validate users in Defining what users and Reporting on where data Shielding data in the enterprise directory applications can do with came from and how it’s cluster from unauthorized data being used visibility

Technical Concepts: Technical Concepts: Technical Concepts: Authentication Permissions Auditing Technical Concepts: User/group mapping Authorization Lineage Encryption, Key Management

SSL/TLS, HDFS TDE, Kerberos, Apache Ranger Apache Atlas Apache Ranger Apache Knox (KMS, Masking, Filtering)

© 2020 Cloudera, Inc. All rights reserved. 29 VISIBILITY SECURITY AND AUDITING Apache Atlas

• Lineage – What data do I consume? – What consumes my Data? • Who uses my data? – Audit who accessed what – Track access events from Apache Ranger – Metadata audit and versioning from Apache Atlas

© 2020 Cloudera, Inc. All rights reserved. 30 ACCESS CONTROL Apache Ranger

• Maintain one set of data, control access centrally with fine grained policies down to the column and the row level.

• Anonymize PII with Dynamic column masking

• Customize views for users with Dynamic row filtering

• Manage user access with Role-based Access Control

• Unify policies across many data sets with Attribute-based Access Control

© 2020 Cloudera, Inc. All rights reserved. 31 OBJECT STORAGE Apache Ozone

• Ozone is the next generation of HDFS – Based on HDFS architecture, but with some fundamental shifts – Preserve and reuse good parts of HDFS – Addresses HDFS scale limits and small file problem • Uses an object store architecture to achieve scale. • Provides native Hadoop File System API as well as a native S3 API

© 2020 Cloudera, Inc. All rights reserved. 32 PATHS TO CDP THREE PATHS TO CDP

Migrate to Public Cloud Migrate to CDP DC Upgrade to CDP DC

Copy data and metadata to a public Build a new CDP Datacenter cluster Upgrade from classic cluster to cloud; implement new, or migrate on-premises; copy data and CDP Datacenter in-place on the existing workloads on CDP Public metadata from existing classic same hardware infrastructure. Cloud. cluster; and migrate existing workloads.

© 2020 Cloudera, Inc. All rights reserved. 34 SMARTUPGRADE TO CDP DC

1. PLAN 2. LAUNCH 3. PRODUCTION

PATH2CDP UPGRADE2CDP CONSUME CDP

● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● CDP Data Center

1+ Weeks 3 + Weeks Custom Estimate

© 2020 Cloudera, Inc. All rights reserved. 35 FORMATIONS GENERALES ET CDP CDH/HDP TO CDP DELTA COURSES

AWS Fundamentals for CDP Public Cloud Admin CDP Security (Q2-Q3) ADMINISTRATOR CDP Pub. Cloud (Q2)

CDP Data Governance (Q2)

DATA ANALYST CDW Hive/Impala (DC - Q1)

Spark Performance Spark (DC - Q1) Workshop DEVELOPER Flow Management with Stream Processing (Q1) Kafka Operations NiFi (CDF) (CDF) (CDF)

DATA SCIENTIST CML Data Science Wkshp (Q2) DS/ML Modules

cloudera.com/training.html

© 2020 Cloudera, Inc. All rights reserved. 36 WHAT IS COMING NEXT ? CDP PRIVATE CLOUD : BASED ON CDP DATA CENTER

New set of data analytics Management Console applications Featuring use-case optimized interfaces Data Catalog Experiences

Workload Running on a container cloud Machine Data Data Manager DataFlow Fast provisioning & scaling, efficient, simple Learning Warehouse Engineering Replication Manager With access to a shared data lake That is secured and governed Kubernetes

CDP Data Center SDX Bare Metal Security Workloads Metadata Governance

BareMetal

© 2020 Cloudera, Inc. All rights reserved. 38 Questions/Réponses A vous la parole...