Hadoop Essen>Als: HDP and HDI Overview

Hadoop Essenals: HDP and HDI Overview

Glenn Koehler – Enterprise Account Manager Piotr Pruski – Soluons Engineer Eyad Garelnabi – Soluons Engineer

§ Introducon § The Connected Data Era § Hortonworks Data Plaorm (HDP) Overview § Core Hadoop § Data Access & Consumpon § Data Security & Governance § Operaons, Cloud and HDI § Hortonworks Data Flow (HDF) Overview § Demo of HDP & HDF § Q&A

Founded in 2011 1000+ % employees across ST ONLY ONLY 20+ 100 HADOOP countries open source 1 provider to go public ApacheTM Hadoop data platform 1,700+ IPO 4Q14 (NASDAQ: HDP) technology partners

Fastest company to reach $100 M in revenue

Architect Co-Dev Apache Develop Partners & Soware Customers Foundaon Service Distribute Customers ODPi

Architect Develop

Service Distribute

OPEN happens best COMMUNITY not in isolaon THE INNOVATION ADVANTAGE but in

PROPRIETARY APPROACH collaboraon

TIME

MAXIMUM COMMUNITY INNOVATION

We Employ the Committers --one third of all commiers to the Apache® Hadoop™ project, and a majority in Apache NiFi and other important projects Our Committers Innovate and improve Connected Data Plaorms We Influence the Hadoop Roadmap by communicang important requirements to the community through our leaders

APACHE HADOOP COMMITTERS

Page 8 8 © Hortonworks Inc. 2011 – © Hortonworks Inc. 2011 – 2015. All Rights Reserved 2017. All Rights Reserved Hortonworks Customer Momentum Spans Industries Canadian Companies partnered with Hortonworks Financial Services Retail

55% 75% of the U.S. F100 of the U.S. F100

Telecommunicaons Automove

8 of the top 9 8 of the global in North America top 20 Learn more at hortonworks.com/customers

The Connected Era A New Way of Business

Digital Personalizaon Cloud Compung

44ZB By 2020

12 © Hortonworks Inc. 2011 – 20172017. All Rights Reserved . All Rights Reserved The Old Way The New Way System-Centric User-Centric Data Doubles Procedural Every Two Agile Hierarchical Years Dynamic Scheduled Real-Time 44ZB By 2020 Monolithic Contextual

Capture Store streaming data data forever Deliver Access perishable insights a mul-tenant data lake Combine Model new & old data with arﬁcial intelligence DATA IN MOTION DATA AT REST

Data in Moon or at Rest Cloud Across Data Center and Data Plane Cloud Data Center Centrally Managed

16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Sentiment Payment Analysis Pricing Tracking Next Customer Optimal Targeted Product Support Inventories Offers Social Recs Mapping Web to Product Basket Customer Proactive Product Store Line Analysis Segments Staffing Recs Call Accounts Revenue Accident Traffic Planning Analysis Receivable Analysis Analysis Golden Cross- Supply Vendor Inventory Risk Customer MIS Sell Chain Scorecards Predictions Modeling Dispatch Record Support

Sensor OPEX Data Data Reduction as a Service Ingest

Historical Geo Tagging Records

Text Public RDBMS Sales Digital Enrichment Data Offloads Reporting Protection Capture

Social Customer Opmize Next Product Store Design Mapping Support Inventories Recs

Machine Product Ad Basket Proacve Disaster Investment Call Analysis M & A Segments Data Design Placement Analysis Repair Migaon Planning

Factory Defect Cross- Customer Vendor Inventory Risk Ad Supply Chain Yields Detecon Sell Retenon Scorecards Predicons Modeling Placement

OPEX Device Data Data Reducon Ingest as a Service

Historical Fraud Records Prevenon

Mainframe Rapid Digital Public Oﬄoads Reporng Protecon Data Capture

Hortonworks® customers leverage our technology to transform their businesses, either by achieving new business objecves or by reducing costs. The journey typically involves both of those goals in combinaon, across many use cases.

12 Month Results at TRUECar May ‘14 • Six Producon Hadoop Applicaons IPO • Sixty nodes/2PB data • Storage Costs/Compute Costs from $19/GB to $0.12/GB

Feb 2014 “We addressed our data plaorm capabilies Three More strategically as a pre-cursor to IPO.” Producon Apps Jan 2014 (6 total) 40% Dev

Dec 2013 Staﬀ Three Proﬁcient Nov 2013 Producon Producon Aug 2013 Apps Cluster June 2013 Training (3 total) 60 Nodes Begin July 2013 & Dev 2 PB Hadoop Hortonworks Begins Execuon Partnership Data Plaorm Capabilies

12 months execuon plan

Weather Medicaon Inventory Proacve Drug Proacve Maintenance Diversion Outreach Risk Safety Management Abatement

PREDICTIVE DATA Genomics & ANALYTICS Preventave Staﬃng Census Populaon ENRICHMENT Medicine Predicons Omics Health TRANSFORM

OR Paent Seasonal Paent Paent Opmizaon Throughput Staﬃng Experience Outreach

ETL SINGLE 360° Supply Net Senment ONBOARD VIEW HCAHPS STARS Paent Chain Promoter

OPTIMIZE Analysis Scores Rangs View Management Score

Insurance Physician Device Cohort Bill Shock Claims Notes Monitoring Selecon Selement

ACTIVE DATA Care-Path DISCOVERY Quality Social EXPLORE ARCHIVE Best R & D Benchmarks Pracces Senment

Sensor Paent Data Pharmacy Data Locaons Paent Social Clinical Records Media Physician (EMR) Notes Data Paent Intra-Network Sasfacon Data Data

Wearables Claims Data Lab Data

SINGLE VIEW OF REAL-TIME VITAL BILLING & EMR SUPPLY CHAIN PATIENT SIGN MONITORING REIMBURSEMENTS OPTIMIZATION OPTIMIZATION

PREDICTIVE ANALYTICS SITUATION Prevenve Care

PREDICTIVE Exisng plaorm DATA SINGLE DATA SINGLE VIEW ANALYTICS impeded goals DISCOVERY VIEW DISCOVERY Single Paent Device Data 3-5 move data oﬀ Epic OPEX Eﬃciency Billing Vital Signs Record Ingest Minutes to Clarity with HDP

Data enrichment needed $1M Addional from improved for 1 million paents Annual Revenue billing process

Move to Clarity wouldn’t From “Never” accelerated enable real-me analycs to “Seconds” researcher insight Extracng data from 900x more data ingest of ICU Epic to Clarity took a vital signs day ACTIVE ACTIVE ACTIVE DATA ETL OFFLOAD ARCHIVE ARCHIVE ARCHIVE ENRICHMENT Medical Decision Lab Notes Epic EMR Privacy Epic Enrichment Support Replicaon Database

“[HDP] provides us a place and way to leverage our Epic data in addion to other data that comes from outside of Epic.” Paul Boal, Director of Data Management 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Arizona State Genomics Researchers Do The Work of Enre Teams

SITUATION Cancer is a complicated and NGCC for HDP with HPC for complex disease Genomics analycs on 1000s Genomic data was too Analysis of human genomes vast for legacy plaorms Federated follows “Naonal Cancer risk paerns hidden Framework for Cancer Moonshot” within genomic data Data Sharing guidelines

“Lamp-posng” around 1-2 Minutes on query response some genes made 20 Billion Rows me from ”never” research slow & ACTIVE ETL OFFLOAD ARCHIVE Sensor Data Ingest DATA to minutes incremental Research & Lab Data Storage DISCOVERY Genomics

“[HDP] has sped our me to insight inﬁnitely in some cases. Some quesons were not possible before, and now they return results in a day.” Dr. Kenneth Buetow, Director of Computaonal Sciences and Informacs 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Progressive Rewards Safe Drivers and Improves Traﬃc Safety

PREDICTIVE ANALYTICS SITUATION Usage-Based Insurance (UBI)

DATA DATA DATA Usage-Based “Snapshot” DISCOVERY DISCOVERY DISCOVERY Web Log Online Ad Claim Notes Insurance Program Analysis Placement Mining 100% in driving detail captured 2-3 Days from Snapshot, in HDF In-Car Sensor Captures IoT Data +12 Billion miles driven stored

Web App- customers see driving Exisng Data Systems Did Enabled detail and improve safety Not Scale Eﬃciently $2.6 Billion in 2014 Premiums ~7 Days to Transform Only 25% of UBI Data ETL ACTIVE OFFLOAD ARCHIVE Sensor Data Individual Ingest Driving Histories

“We’re looking at datasets that we never dreamed we could look at…It’s joining dots that in the past we didn’t even know we could join.” Pawan Divakarla, Data & Analycs Business Leader 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Improving Service at the UK’s Royal Mail, a Centuries-Old Business

PREDICTIVE SITUATION ANALYTICS New Data Products SINGLE SINGLE PREDICTIVE SINGLE Supports more than VIEW VIEW ANALYTICS VIEW Time moving EDW freeing valuable Parcel Customer Investment Customer 29 million addresses Distribuon Acquision Planning Support data from 90 to 10% analyc capacity

Spent 90% of Customer churn by gathering edge me moving data reduced data with HDF to/from warehouse Analyc velocity delivering insight in improved days, not months Wanted to redeﬁne data for business decisions Governance & simpliﬁed and compliance centralized

ACTIVE ACTIVE ETL DATA DATA ARCHIVE ARCHIVE OFFLOAD ENRICHMENT ENRICHMENT OPEX EDW Rapid Public Data-as-a- Savings Oﬄoad Reporng Data Capture Service “We’re accelerang that whole process, we’re not having to spin up projects just to get data. We are able to accomplish a huge amount of work with single individuals. We see Hortonworks as our advanced analycs plaorm.” Thomas Lee-Warren, Director of the Technology Data Group 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Science Speeds Time to Cyber Security Protecon

PREDICTIVE SITUATION ANALYTICS Threat Predicons DATA DATA PREDICTIVE DATA PREDICTIVE Network has +57M aack DISCOVERY DISCOVERY ANALYTICS DISCOVERY ANALYTICS Protecve From 4 hours threat detecon sensors in 157 countries Security Threat Uniﬁed Aacker Logs Detecon Security Idenﬁcaon Safeguard to 2-Seconds latency

Data streams from 75M 5000x improved users on 120M devices me-to-protecon

3-4 hr processing latencies 10s of PBS of historical data for to analyze digital threats machine learning

Long open windows of Cloud to meet peak exposure to cyber aacks Flexibility demand for analysis ETL ETL ETL ACTIVE OFFLOAD OFFLOAD OFFLOAD ARCHIVE Device Data Greenplum Metadata Threat Ingest Oﬄoad Capture Archive “On any given day, we’ll be processing 40 billion messages into our system…It used to be that queues would back up. We would see mes to analysis on the order of 4 hours. On average, we’ve goen that down to two or two and a half seconds.” David Lin, Senior Director of Engineering, Symantec Cloud Plaorm 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MODERN DATA APPS Hortonworks Delivers Connected Data Platforms

ACTIONABLE INTELLIGENCE

PERISHABLE HISTORICAL INSIGHTS INSIGHTS

INTERNET OF DATA IN MOTION DATA AT REST ANYTHING

28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved What is Apache Hadoop? Hadoop was designed for Big Data Ã Cost Eﬀecve – Commodity hardware HDFS – Hadoop Distributed File System – Open source soware • Data broken into blocks and replicated 3x • Automacally replaces lost data / computers Ã Scalable – Eﬃciently store and process petabytes of data – Grows linearly by adding commodity computers YARN – Distributed Computaon Layer Ã Reliable • Distributed execuon on HDFS • Many programing models – Self healing as hardware fails or is added • MapReduce, SQL, Streaming, ML… Ã Flexible • Mul-users, with queues, priories, etc… – Store all types of data in many formats – Schema-on-read

HDP 2.6* 1.2.1+ 5.5.1 1.6.3+ 2.7.3 0.16.0 0.9.2 0.7.0 0.7.0 0.91.0 1.1.2 4.7.0 1.7.0 1.1.0 0.10.0 0.8.0 1.4.6 1.5.2 0.10.1.0 2.5.0 3.4.6 4.2.0 0.11.0 0.7.0 1H2017 2.1*** **** 2.1**

HDP 2.5 1.2.1+ 1.6.2+ 2.7.3 0.16.0 0.7.0 5.5.1 0.6.0 0.91.0 1.1.2 4.7.0 1.7.0 1.0.1 0.10.0 0.7.0 1.4.6 1.5.2 0.10.0 2.4.0 3.4.6 4.2.0 0.9.0 0.6.0 Aug 2016 2.1*** 2.0**

HDP 2.4 2.7.1 0.15.0 1.2.1 0.7.0 5.2.1 1.6.0 0.80.0 1.1.2 4.4.0 1.7.0 0.10.0 0.6.1 0.5.0 1.4.6 1.5.2 0.9.0 2.2.1 3.4.6 4.2.0 0.6.0 0.5.0 Mar 2016

HDP 2.3 2.7.1 0.15.0 1.2.1 0.7.0 5.2.1 1.4.1 0.80.0 1.1.2 4.4.0 1.7.0 0.10.0 0.6.1 0.5.0 1.4.6 1.5.2 0.8.2 2.1.0 3.4.6 4.2.0 0.6.0 0.5.0 Oct 2015

HDP 2.2 2.6.0 0.14.0 0.14.0 0.5.2 4.10.2 1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.9.3 0.6.0 1.4.5 1.5.2 0.8.1 2.0.0 3.4.6 4.1.0 0.5.0 0.4.0 Dec 2014

HDP 2.1 2.4.0 0.12.1 0.13.0 0.4.0 4.7.2 0.98.0 4.0.0 1.5.1 0.9.1 0.5.0 1.4.4 1.4.0 1.5.1 3.4.5 4.0.0 0.4.0 April 2014

HDP 2.0 2.2.0 0.12.0 0.12.0 0.96.1 1.4.4 1.3.1 1.4.4 3.4.5 3.3.2 Oct 2013

Zookeeper Hadoop & YARN Druid Phoenix Accumulo Storm Falcon Atlas Sqoop Flume Kaa Ambari Knox Ranger Oozie Hive Tez Pig Solr Spark Zeppelin Slider HBase

DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY

HORTONWORKS DATA PLATFORM

* HDP 2.6 – Shows current Apache branches being used. Final component version subject to change based on Apache release process. ** Spark 1.6.3+ Spark 2.1 – HDP 2.6 supports both Spark 1.6.3 and Spark 2.1 as GA. *** Hive 2.1 is GA within HDP 2.6. **** Apache Solr is available as an add-on product HDP Search. 31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Core Hadoop

Centralized Architecture YARN: Data Operang System (Cluster Resource Management)

Highly Scalable 1 • • • • HDFS• • • • • • • (Hadoop Distributed File System) • • • • • • • • • • • • Cost Eﬀecve

Name Nodes § Data Catalog, a “Namespace” § Primary & Secondary Nodes for HA § Manual or Automated Failover

§ Quorum Journaling (States & Edits) 1 • • • • • • • • • • •

Data Nodes • • • • • • • • • • • • § “Where the data lives” § Data stored in replicas of 3 (default) § Rack aware Edge & Ulity Nodes § Heterogeneous Storage Opons § Cluster Administraon § Large block sizes (64MB default) § 3rd Party Tools

Data Nodes § If a Data Node goes down, the Name Node will instruct the other data nodes to replicate the lost data.

§ HDFS is Rack-Aware which is helpful in 1 • • • • • • • • • • •

avoiding loss of data from a full rack • • • • • • • • • • • • failure.

§ When data is deleted, it is moved to .trash in its directory. The Name Node will eventually delete the data based on an expiraon seng.

35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved YARN: A Data Operang System Yet Another Resource Negoator YARN The Architectural BATCH, INTERACTIVE & REAL-TIME DATA ACCESS Center of Hadoop Script SQL Java NoSQL Stream In-Memory Search Others Scala • Common data platform, many applications

Pig Hive Cascading HBase Storm Spark Solr • Support multi-tenant access & processing Accumulo ISV Tez Tez Tez Slider Slider Engines • Batch, interactive & real-time use cases

YARN: Data Operating System • Supports 3rd-party ISV tools (Cluster Resource Management) (ex. SAS, Syncsort, Actian, etc.)

1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS ° ° ° ° (Hadoop° ° Distributed° ° File° System)° ° ° ° ° °

YARN in Producon YARN Ready Applicaons • Yahoo: ~40,000 nodes, mulple clusters running YARN across over Facilitates ongoing innovaon and enterprise adopon via 365PB of data ecosystem of new and exisng “YARN Ready” soluons • Spofy, Progressive, Kohls, UHG, Sprint, JPMC, Target, AIG, Samsung

Application § Applicaon is a temporal job or a service submied YARN § Examples – Map Reduce Job (job) – HBase Cluster (service) Container § Basic unit of allocaon § Allocaon of cluster resources (RAM, CPU) – container_0 = 2GB, 1CPU – container_1 = 1GB, 6 CPU § Replaces the ﬁxed map/reduce slots

YARN Architectural Components

Resource Manager § Global resource scheduler § Hierarchical queues

Node Manager § Per-machine agent § Manages the life-cycle of container § Container resource monitoring

Application Master § Per-applicaon § Manages applicaon scheduling and task execuon § E.g. MapReduce Applicaon Master

• ETL • Reporng • Ad-Hoc • Connuous • Muldimensional • Reporng • BI Tools: Tableau, • Drill-Down Ingeson from Analycs • Data Mining Microstrategy, • BI Tools: Tableau, Operaonal DBMS • MDX Tools • Deep Analycs Cognos Excel • Slowly Changing • Excel

Applicaons Dimensions

Interacve Sub-Second Batch SQL ACID / MERGE OLAP / Cube SQL SQL Capabilies

Comprehensive SQL:2011 Coverage Legend Petabyte Scale Advanced Cost-Based JDBC / ODBC Exisng Processing Opmizer Apache Tez: Scalable Core Hive Development Scale-Out Storage Distributed Processing Advanced Security

Plaorm Core SQL Engine Connecvity

Hive LLAP GA: Interacve query in seconds, 10X fast join performance

Ease of Use and Adopon : SQL Standard ACID Merge

Enterprise Readiness: Supports all TPC-DS Queries

Streamlined Operaons: Hive Views

Web UI Metastore DB Executor

MapReduce Hadoop 3

1 User issues SQL query YARN job 2 Hive parses and plans query 3 Query converted to MapReduce and executed on Hadoop NodeManager

Apache Pig is a high-level platform for transforming or analyzing large datasets.

Pig includes a scripted, procedural- based language that is used: § To build data pipelines to aggregate and add structure to data § To analyze data

Pig scripts are automatically converted to MapReduce jobs.

• Pig Latin, abstracts from the Java MapReduce idiom into a form similar to SQL, but which allows to write a data flow describing the data transformation step by step.

• Pig uses a lazy execution model: • During execuon, each statement is processed by the Pig interpreter • If a statement is valid, it gets added to a logical plan built by the interpreter • The steps in the logical plan do not actually execute unl a DUMP or STORE command is used

• Pig Latin is extensible: • Custom UDFs can be wrien in Java, Python, JavaScript, Ruby and Groovy.

§ Pig & Hive are based on MapReduce processing framework.

§ Tez optimizes inefficiencies in MapReduce:

– HDFS and local storage use – Requirement of map phase before reduce phase. – YARN containers

§ Performance Optimization: – Persistent Query execuon servers with intelligent in- memory caching – Columnar-based, compression opmized to ﬁt more in memory – Vectorized queries executed by TEZ § Row Level and Column Level Security for SparkSQL

What is Apache Spark ? Ã Apache open source project originally developed at AMPLab (University of California Berkeley)

Ã Uniﬁed data processing engine that operates across varied data workloads and plaorms

Why Spark ? Ã Elegant Developer APIs – Single environment for data munging and Machine Learning (ML)

Ã In-memory computaon model – Fast! – Eﬀecve for iterave computaons and ML

Ã Machine Learning – Implementaon of distributed ML algorithms Ã YARN Ready

Features Use Cases

• Ad-hoc experimentaon • Data exploraon and discovery • Deeply integrated with • Visualizaon Spark + Hadoop • Interacve snippet-at-a-me experience • Interacve data ingeson, data exploraon, • “Modern Data Science Studio” visualizaon, sharing and collaboraon

Data engineers, data analysts and data sciensts

• NoSQL database: • Open source key-value data store for Hadoop based on Google’s Bigtable paper [2006] • Implemented as a sparse, consistent, distributed, mul-dimensional, persistent, sorted map • Key and value are byte arrays. • HBase was created for hosting very large tables with billions of rows and millions of columns with very sparse data. • Region Servers can collocate with DataNodes • HBase vs RDBMS: • HBase Advantages: • No data types • No schema • Provides random, low latency real-me data access • • No mul-row transacon Allows table inserts, updates, and deletes • Runs on top of the Hadoop distributed ﬁle system • Not opmized for N-way joins with large scans. • SQL Support through Apache Phoenix

CustomerID FirstName LastName Street Birthday 328 John Wayne 123 BelAir null

126 null Smith 1211 Sycamore null 449 Sara Fox 840 Highview null 65 Cary Grant 12232 Main St 1929-02-21

HBase “Column Qualiﬁer” : Cell

RowKey Column Family “FirstName” : Cary 65 “info” “LastName” : Grant “Street” : 12232 Main St 126 “Birthday” : 1929-02-21 328 “info” 449 “LastName” : Smith “Street” : 1211 Sycamore 50 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Phoenix

• Enables low-latency OLTP • SQL Skin for Hbase and analytics in Hadoop • Can be run on top of exisng Hbase • ANSI SQL-92 Apps or as Stand-alone • ODBC/JDBC/.NET/Python/etc • Secondary indexes, joins, aggregaon pushdowns

Applicaons within the Datacenter Nave Interfaces (DB-API, DBI, etc.)

Protobuf over HTTP Phoenix Phoenix Phoenix Query Server Query Server Query Server Business Analysts + Edge Servers Standard SQL Tools HBase HBase HBase Phoenix RegionServer RegionServer RegionServer Query Server

Authencaon Authencate users and systems Who are you? Prove it?

Authorizaon Provision access to data What are allowed to do?

Audit Maintain a record of data access Who did what and when?

Data Protecon

Protect data at rest and in moon How to secure data at rest and over the wire?

Administraon Single administrave console to set policy across the Central management and consistent security enre cluster: Apache Ranger

Authencaon Authencaon for perimeter and cluster; integrates with exisng Acve Directory and LDAP soluons: Kerberos | Authencate users and systems Apache Knox

Authorizaon Provision access to data Consistent authorizaon controls across all Apache components within HDP: Apache Ranger

Audit Record of data access events across all components that is Maintain a record of data access consistent and accessible: Apache Ranger

Data Protecon Secure data in moon and data at rest: HDFS TDE w/ Protect data at rest and in moon Ranger KMS + HSM, Ranger Data Masking + Row Filtering, Wire encrypon + Partner Soluons

Apache Knox

HDFS Encrypon

55 © Hortonworks Inc. 2011 – 2017. All Rights Reserved LDAP/AD Central Security Administration with Ranger Apache Ranger • Delivers a ‘single pane of glass’ for the security administrator. • Centralizes administration of security policy: • Create and manage security policies to deﬁne authorizaons and permissions to access Hadoop components. • Provides centralized auditing. • Ensures consistent coverage across the entire Hadoop stack.

• Fine grained security policies: • User/Groups • Hive DB authorizaons, Column based. • Aributes Based Access Control • Dynamic Rules: Tag based, me based, geolocaon based, prohibion rules.

Row Level Security in Hive Dynamic Data Masking of Hive Columns Ã Control Access to Rows in Hive Tables based on Ã Protect Sensive Data in real-me with Dynamic Context! Data Masking/Obfuscaon! Ã Improve reliability and robustness of HDP by Ã Mask or anonymize sensive columns of data providing Row Level Security to Hive tables and (e.g. PII, PCI, PHI) from Hive query output reducing surface area of security system Ã Beneﬁts Ã Restrict data row access based on user – Sensive informaon never leaves database characteriscs (e.g. group membership) AND – No changes are required at the applicaon or Hive runme context layer – No need to produce addional protected duplicate Ã Use Cases: versions of datasets – A hospital can create a security policy that allows doctors to view data rows only for their own paents – Simple & easy to setup masking policies – A bank can create a policy to restrict access to rows of ﬁnancial data based on the employee's business division, locale or based on the employee's role – A mul-tenant applicaon can create logical separaon of each tenant's each tenant can see only its data rows. 57 © Hortonworks Inc. 2011 – 2017. All Rights Reserved DGI Community becomes Apache Atlas

Global Financial Company

First kickoﬀ to GA in 7 months

Dec Jan Feb April May June July 2014 2015 2015 2014 2015 2015 2015

DGI Proto-type Apache HDP 2.3 Kickoﬀ Atlas GA Release Incubaon

Business Taxonomy (Catalog) Benefits: The pracce and science of classificaon of things or concepts, including the principles that underlie such classificaon. The A view of data assets organized business organizaon model is hierarchical making authoritave by business language with no duplicaon. Data Lineage (Provenance) Data lineage is defined as a data life cycle that includes the data's Impact analysis, Compliance, origins and where it moves over me. It describes what happens Acceptable use to data as it goes through diverse processes. It helps provide visibility into the analycs pipeline and simplifies tracing errors back to their sources Common tag though Hadoop Tags: Traits vs. Labels components Atlas has Tags that are authorave and prevent duplicaon. Tag can span different parts of the business taxonomy. A tag PII can be used in HR as well Finance or Sales.

NiFi Hive Kafka Spark Storm HBase Sqoop Falcon Ranger HDP 2.3 Apache Atlas HDP 2.5 Beyond HDP 2.6

• Cross- component dataset lineage. Centralized locaon for all metadata inside HDP • Single Interface point for Metadata Exchange with plaorms outside of HDP

• Basic Tag policy – PII example. Access and entlements must be tag based ABAC and scalable in implementaon.

• Geo-based policy – Policy based on IP address, proxy IP substuon maybe required. The rule enforcement but be geo aware.

• Time-based policy – Timer for data access, de-coupled from deleon of data.

• Prohibions – Prevenon of combinaon of Hive tables that may pose a risk together.

A completely open source management plaorm for provisioning, managing, monitoring and securing Apache Hadoop clusters. Apache Ambari takes the guesswork out of operang Hadoop.

Simpliﬁed Installaon, Centralized Security Full Visibility into Highly Extensible and Conﬁguraon and Setup Cluster Health Customizable Management

Ambari Stacks + Blueprints Together

Component Stack Definition Layout & BLUEPRINT Configuration

BLUEPRINT HOSTS INSTANTIATE CLUSTER

Primary Use Cases

Deploy on Public or Private Clouds Dynamically configure and manage clusters on public or private clouds (Amazon Web Services, Microsoft Azure, Google Cloud Platform and OpenStack)

Automated Scaling Seamlessly manage elasticity requirements as cluster workloads change

Secured Cluster Access Supports configuration for Kerberos, defining network boundaries and configuring security groups

On-Premises Cloud

HD Insight

Managed Hadoop Service HDP on Linux Built on Azure storage

Full control of HW and soware conﬁgs HDP on Azure

Your deployment of Hadoop hosted as a VM in Azure

• Data acquision Ã Scalable data broker for and delivery streaming apps • Simple transformaon Ã Scale out complex transformaon and data roung Flow Stream Ã Complex analycs • Simple event processing Management Processing Ã Stream development and runme • End to end provenance environment • Edge intelligence and Ã Connuous insights bi-direconal comms.

Ã Provisioning Ã Audit Ã Management Enterprise Ã Compliance Ã Monitoring Services Ã Governance Ã Security Ã Mul-tenancy

StreamLine

Streaming Analycs FLOW STREAM Capture perishable insights from data-in-moon MANAGEMENT PROCESSING Visual Control Over Data Flows to manage who can see and touch data in transit End-to-End Security to encrypt, decrypt, and ﬁlter data on its journey Real-Time Traceability ENTERPRISE Rich metadata and contextual detail helps SERVICES troubleshoot security issues

Schema Registry

• Guaranteed delivery • Recovery/recording a rolling log of fine- • Data buffering grained history - Backpressure - Pressure release • Visual command and control • Priorized queuing • Flow templates • Flow specific QoS • Pluggable/mul-role - Latency vs. throughput security - Loss tolerance • Designed for extension • Data provenance • Clustering • Supports push and pull models

FTP Hash Encrypt GeoEnrich SFTP Merge Tail Scan HL7 Extract Evaluate Replace UDP Duplicate Execute Translate XML Split Fetch Convert

HTTP WebSocket Email Route Text Distribute Load HTML Route Content Generate Table Fetch Image Route Context Jolt Transform JSON Syslog Control Rate Priorized Delivery AMQP All Apache project logos are trademarks of the ASF and the respecve projects. 73 © Hortonworks Inc. 2011 – 2017. All Rights Reserved A word on Kaa • High throughput publish-subscribe Distributed Messaging producer consumer System: – Kaa maintains feeds of messages in categories called topics. – Processes that publish messages to a Kaa topic are producers. producer Kaa consumer – Processes that subscribe to topics and process the feed of published Cluster messages are consumers. – Kaa is run as a cluster comprised of one or more servers each of which is called a broker. producer consumer

• For each topic, Kaa maintains parons across the Kaa cluster: – Ordered, immutable, sequence of messages. – Messages are retained for a conﬁgurable period of me. – Paroning can be round-robin, or key based. – Topics can be replicated for fault tolerance purposes.

• Distributed Computation framework for real-time analytics: • Producon Applicaon log monitoring • Real-me fraud detecon on credit card/banking transacons • Storm Components: • Topology: Storm applicaon made of: • Stream: Flow of tuples • Spouts: Generate streams • Bolt: Contain actual data processing, persistence, and alerng logic. • Tuple: Fundamental data unit. • Storm Cluster Architecture: • Similar to a Hadoop cluster: Highly-parallel processing cluster to reliably process massive amounts of data. • Storm and Hadoop can share the same cluster, and can also run on YARN.

• Download HDP Sandbox: http:// hortonworks.com/products/sandbox/

• Deploy the HDP Sandbox on Azure: http:// hortonworks.com/hadoop-tutorial/deploying- hortonworks-sandbox-on-microsoft-azure/

• Learning the ropes: http://hortonworks.com/ hadoop-tutorial/learning-the-ropes -of-the- hortonworks-sandbox Thank You

Hadoop Essen&gt;Als: HDP and HDI Overview

Hadoop Essen>Als: HDP and HDI Overview