Hadoop Essen als: HDP and HDI Overview
Glenn Koehler – Enterprise Account Manager Piotr Pruski – Solu ons Engineer Eyad Garelnabi – Solu ons Engineer
1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda
§ Introduc on § The Connected Data Era § Hortonworks Data Pla orm (HDP) Overview § Core Hadoop § Data Access & Consump on § Data Security & Governance § Opera ons, Cloud and HDI § Hortonworks Data Flow (HDF) Overview § Demo of HDP & HDF § Q&A
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Introduc on
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved About Hortonworks
Founded in 2011 1000+ % employees across ST ONLY ONLY 20+ 100 HADOOP countries open source 1 provider to go public ApacheTM Hadoop data platform 1,700+ IPO 4Q14 (NASDAQ: HDP) technology partners
Fastest company to reach $100 M in revenue
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved The Hortonworks Approach
Architect Co-Dev Apache Develop Partners & So ware Customers Founda on Service Distribute Customers ODPi
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved The Hortonworks Approach
Architect Develop
Service Distribute
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved INNOVATION Innova on
OPEN happens best COMMUNITY not in isola on THE INNOVATION ADVANTAGE but in
PROPRIETARY APPROACH collabora on
TIME
MAXIMUM COMMUNITY INNOVATION
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hortonworks Influences the Apache Community
We Employ the Committers --one third of all commi ers to the Apache® Hadoop™ project, and a majority in Apache NiFi and other important projects Our Committers Innovate and improve Connected Data Pla orms We Influence the Hadoop Roadmap by communica ng important requirements to the community through our leaders
APACHE HADOOP COMMITTERS
Page 8 8 © Hortonworks Inc. 2011 – © Hortonworks Inc. 2011 – 2015. All Rights Reserved 2017. All Rights Reserved Hortonworks Customer Momentum Spans Industries Canadian Companies partnered with Hortonworks Financial Services Retail
55% 75% of the U.S. F100 of the U.S. F100
Telecommunica ons Automo ve
8 of the top 9 8 of the global in North America top 20 Learn more at hortonworks.com/customers
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Connected Data Era
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Smart Mobility The Internet of Things
The Connected Era A New Way of Business
Digital Personaliza on Cloud Compu ng
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Doubles Every Two Years
44ZB By 2020
12 © Hortonworks Inc. 2011 – 20172017. All Rights Reserved . All Rights Reserved The Old Way The New Way System-Centric User-Centric Data Doubles Procedural Every Two Agile Hierarchical Years Dynamic Scheduled Real-Time 44ZB By 2020 Monolithic Contextual
13 © Hortonworks Inc. 2011 – 20172017. All Rights Reserved . All Rights Reserved Connected Data Pla orms
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Perishable Insights Historical Insights ACTIONABLE INTELLIGENCE
Capture Store streaming data data forever Deliver Access perishable insights a mul -tenant data lake Combine Model new & old data with ar ficial intelligence DATA IN MOTION DATA AT REST
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved The Data Plane: For All Data, Any me and Anywhere
Data in Mo on or at Rest Cloud Across Data Center and Data Plane Cloud Data Center Centrally Managed
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Sentiment Payment Analysis Pricing Tracking Next Customer Optimal Targeted Product Support Inventories Offers Social Recs Mapping Web to Product Basket Customer Proactive Product Store Line Analysis Segments Staffing Recs Call Accounts Revenue Accident Traffic Planning Analysis Receivable Analysis Analysis Golden Cross- Supply Vendor Inventory Risk Customer MIS Sell Chain Scorecards Predictions Modeling Dispatch Record Support
Sensor OPEX Data Data Reduction as a Service Ingest
Historical Geo Tagging Records
Text Public RDBMS Sales Digital Enrichment Data Offloads Reporting Protection Capture
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Payment Due Sen ment Tracking Diligence Analysis
Social Customer Op mize Next Product Store Design Mapping Support Inventories Recs
Machine Product Ad Basket Proac ve Disaster Investment Call Analysis M & A Segments Data Design Placement Analysis Repair Mi ga on Planning
Factory Defect Cross- Customer Vendor Inventory Risk Ad Supply Chain Yields Detec on Sell Reten on Scorecards Predic ons Modeling Placement
OPEX Device Data Data Reduc on Ingest as a Service
Historical Fraud Records Preven on
Mainframe Rapid Digital Public Offloads Repor ng Protec on Data Capture
Hortonworks® customers leverage our technology to transform their businesses, either by achieving new business objec ves or by reducing costs. The journey typically involves both of those goals in combina on, across many use cases.
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Use Case: 12 month Hadoop evolu on at TrueCar
12 Month Results at TRUECar May ‘14 • Six Produc on Hadoop Applica ons IPO • Sixty nodes/2PB data • Storage Costs/Compute Costs from $19/GB to $0.12/GB
Feb 2014 “We addressed our data pla orm capabili es Three More strategically as a pre-cursor to IPO.” Produc on Apps Jan 2014 (6 total) 40% Dev
Dec 2013 Staff Three Proficient Nov 2013 Produc on Produc on Aug 2013 Apps Cluster June 2013 Training (3 total) 60 Nodes Begin July 2013 & Dev 2 PB Hadoop Hortonworks Begins Execu on Partnership Data Pla orm Capabili es
12 months execu on plan
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved RENOVATE INNOVATE EXAMPLE HEALTHCARE JOURNEY
Weather Medica on Inventory Proac ve Drug Proac ve Maintenance Diversion Outreach Risk Safety Management Abatement
PREDICTIVE DATA Genomics & ANALYTICS Preventa ve Staffing Census Popula on ENRICHMENT Medicine Predic ons Omics Health TRANSFORM
OR Pa ent Seasonal Pa ent Pa ent Op miza on Throughput Staffing Experience Outreach
ETL SINGLE 360° Supply Net Sen ment ONBOARD VIEW HCAHPS STARS Pa ent Chain Promoter
OPTIMIZE Analysis Scores Ra ngs View Management Score
Insurance Physician Device Cohort Bill Shock Claims Notes Monitoring Selec on Se lement
ACTIVE DATA Care-Path DISCOVERY Quality Social EXPLORE ARCHIVE Best R & D Benchmarks Prac ces Sen ment
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Ac onable Intelligence Makes Healthcare Precise and Personal
Sensor Pa ent Data Pharmacy Data Loca ons Pa ent Social Clinical Records Media Physician (EMR) Notes Data Pa ent Intra-Network Sa sfac on Data Data
Wearables Claims Data Lab Data
SINGLE VIEW OF REAL-TIME VITAL BILLING & EMR SUPPLY CHAIN PATIENT SIGN MONITORING REIMBURSEMENTS OPTIMIZATION OPTIMIZATION
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Mercy Transforms Healthcare Through “One Pa ent, One Record”
PREDICTIVE ANALYTICS SITUATION Preven ve Care
PREDICTIVE Exis ng pla orm DATA SINGLE DATA SINGLE VIEW ANALYTICS impeded goals DISCOVERY VIEW DISCOVERY Single Pa ent Device Data 3-5 move data off Epic OPEX Efficiency Billing Vital Signs Record Ingest Minutes to Clarity with HDP
Data enrichment needed $1M Addi onal from improved for 1 million pa ents Annual Revenue billing process
Move to Clarity wouldn’t From “Never” accelerated enable real- me analy cs to “Seconds” researcher insight Extrac ng data from 900x more data ingest of ICU Epic to Clarity took a vital signs day ACTIVE ACTIVE ACTIVE DATA ETL OFFLOAD ARCHIVE ARCHIVE ARCHIVE ENRICHMENT Medical Decision Lab Notes Epic EMR Privacy Epic Enrichment Support Replica on Database
“[HDP] provides us a place and way to leverage our Epic data in addi on to other data that comes from outside of Epic.” Paul Boal, Director of Data Management 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Arizona State Genomics Researchers Do The Work of En re Teams
SITUATION Cancer is a complicated and NGCC for HDP with HPC for complex disease Genomics analy cs on 1000s Genomic data was too Analysis of human genomes vast for legacy pla orms Federated follows “Na onal Cancer risk pa erns hidden Framework for Cancer Moonshot” within genomic data Data Sharing guidelines
“Lamp-pos ng” around 1-2 Minutes on query response some genes made 20 Billion Rows me from ”never” research slow & ACTIVE ETL OFFLOAD ARCHIVE Sensor Data Ingest DATA to minutes incremental Research & Lab Data Storage DISCOVERY Genomics
“[HDP] has sped our me to insight infinitely in some cases. Some ques ons were not possible before, and now they return results in a day.” Dr. Kenneth Buetow, Director of Computa onal Sciences and Informa cs 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Progressive Rewards Safe Drivers and Improves Traffic Safety
PREDICTIVE ANALYTICS SITUATION Usage-Based Insurance (UBI)
DATA DATA DATA Usage-Based “Snapshot” DISCOVERY DISCOVERY DISCOVERY Web Log Online Ad Claim Notes Insurance Program Analysis Placement Mining 100% in driving detail captured 2-3 Days from Snapshot, in HDF In-Car Sensor Captures IoT Data +12 Billion miles driven stored
Web App- customers see driving Exis ng Data Systems Did Enabled detail and improve safety Not Scale Efficiently $2.6 Billion in 2014 Premiums ~7 Days to Transform Only 25% of UBI Data ETL ACTIVE OFFLOAD ARCHIVE Sensor Data Individual Ingest Driving Histories
“We’re looking at datasets that we never dreamed we could look at…It’s joining dots that in the past we didn’t even know we could join.” Pawan Divakarla, Data & Analy cs Business Leader 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Improving Service at the UK’s Royal Mail, a Centuries-Old Business
PREDICTIVE SITUATION ANALYTICS New Data Products SINGLE SINGLE PREDICTIVE SINGLE Supports more than VIEW VIEW ANALYTICS VIEW Time moving EDW freeing valuable Parcel Customer Investment Customer 29 million addresses Distribu on Acquisi on Planning Support data from 90 to 10% analy c capacity
Spent 90% of Customer churn by gathering edge me moving data reduced data with HDF to/from warehouse Analy c velocity delivering insight in improved days, not months Wanted to redefine data for business decisions Governance & simplified and compliance centralized
ACTIVE ACTIVE ETL DATA DATA ARCHIVE ARCHIVE OFFLOAD ENRICHMENT ENRICHMENT OPEX EDW Rapid Public Data-as-a- Savings Offload Repor ng Data Capture Service “We’re accelera ng that whole process, we’re not having to spin up projects just to get data. We are able to accomplish a huge amount of work with single individuals. We see Hortonworks as our advanced analy cs pla orm.” Thomas Lee-Warren, Director of the Technology Data Group 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Science Speeds Time to Cyber Security Protec on
PREDICTIVE SITUATION ANALYTICS Threat Predic ons DATA DATA PREDICTIVE DATA PREDICTIVE Network has +57M a ack DISCOVERY DISCOVERY ANALYTICS DISCOVERY ANALYTICS Protec ve From 4 hours threat detec on sensors in 157 countries Security Threat Unified A acker Logs Detec on Security Iden fica on Safeguard to 2-Seconds latency
Data streams from 75M 5000x improved users on 120M devices me-to-protec on
3-4 hr processing latencies 10s of PBS of historical data for to analyze digital threats machine learning
Long open windows of Cloud to meet peak exposure to cyber a acks Flexibility demand for analysis ETL ETL ETL ACTIVE OFFLOAD OFFLOAD OFFLOAD ARCHIVE Device Data Greenplum Metadata Threat Ingest Offload Capture Archive “On any given day, we’ll be processing 40 billion messages into our system…It used to be that queues would back up. We would see mes to analysis on the order of 4 hours. On average, we’ve go en that down to two or two and a half seconds.” David Lin, Senior Director of Engineering, Symantec Cloud Pla orm 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MODERN DATA APPS Hortonworks Delivers Connected Data Platforms
ACTIONABLE INTELLIGENCE
PERISHABLE HISTORICAL INSIGHTS INSIGHTS
INTERNET OF DATA IN MOTION DATA AT REST ANYTHING
27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HDP Overview
28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved What is Apache Hadoop? Hadoop was designed for Big Data à Cost Effec ve – Commodity hardware HDFS – Hadoop Distributed File System – Open source so ware • Data broken into blocks and replicated 3x • Automa cally replaces lost data / computers à Scalable – Efficiently store and process petabytes of data – Grows linearly by adding commodity computers YARN – Distributed Computa on Layer à Reliable • Distributed execu on on HDFS • Many programing models – Self healing as hardware fails or is added • MapReduce, SQL, Streaming, ML… à Flexible • Mul -users, with queues, priori es, etc… – Store all types of data in many formats – Schema-on-read
29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hortonworks Data Platform
30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Ongoing Innovation in Apache
HDP 2.6* 1.2.1+ 5.5.1 1.6.3+ 2.7.3 0.16.0 0.9.2 0.7.0 0.7.0 0.91.0 1.1.2 4.7.0 1.7.0 1.1.0 0.10.0 0.8.0 1.4.6 1.5.2 0.10.1.0 2.5.0 3.4.6 4.2.0 0.11.0 0.7.0 1H2017 2.1*** **** 2.1**
HDP 2.5 1.2.1+ 1.6.2+ 2.7.3 0.16.0 0.7.0 5.5.1 0.6.0 0.91.0 1.1.2 4.7.0 1.7.0 1.0.1 0.10.0 0.7.0 1.4.6 1.5.2 0.10.0 2.4.0 3.4.6 4.2.0 0.9.0 0.6.0 Aug 2016 2.1*** 2.0**
HDP 2.4 2.7.1 0.15.0 1.2.1 0.7.0 5.2.1 1.6.0 0.80.0 1.1.2 4.4.0 1.7.0 0.10.0 0.6.1 0.5.0 1.4.6 1.5.2 0.9.0 2.2.1 3.4.6 4.2.0 0.6.0 0.5.0 Mar 2016
HDP 2.3 2.7.1 0.15.0 1.2.1 0.7.0 5.2.1 1.4.1 0.80.0 1.1.2 4.4.0 1.7.0 0.10.0 0.6.1 0.5.0 1.4.6 1.5.2 0.8.2 2.1.0 3.4.6 4.2.0 0.6.0 0.5.0 Oct 2015
HDP 2.2 2.6.0 0.14.0 0.14.0 0.5.2 4.10.2 1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.9.3 0.6.0 1.4.5 1.5.2 0.8.1 2.0.0 3.4.6 4.1.0 0.5.0 0.4.0 Dec 2014
HDP 2.1 2.4.0 0.12.1 0.13.0 0.4.0 4.7.2 0.98.0 4.0.0 1.5.1 0.9.1 0.5.0 1.4.4 1.4.0 1.5.1 3.4.5 4.0.0 0.4.0 April 2014
HDP 2.0 2.2.0 0.12.0 0.12.0 0.96.1 1.4.4 1.3.1 1.4.4 3.4.5 3.3.2 Oct 2013
Zookeeper Hadoop & YARN Druid Phoenix Accumulo Storm Falcon Atlas Sqoop Flume Ka a Ambari Knox Ranger Oozie Hive Tez Pig Solr Spark Zeppelin Slider HBase
DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY
HORTONWORKS DATA PLATFORM
* HDP 2.6 – Shows current Apache branches being used. Final component version subject to change based on Apache release process. ** Spark 1.6.3+ Spark 2.1 – HDP 2.6 supports both Spark 1.6.3 and Spark 2.1 as GA. *** Hive 2.1 is GA within HDP 2.6. **** Apache Solr is available as an add-on product HDP Search. 31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Core Hadoop
32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Core Hadoop: HDFS + YARN Distributed Compu ng of Storage, CPU and Memory
Centralized Architecture YARN: Data Opera ng System (Cluster Resource Management)
Highly Scalable 1 • • • • HDFS• • • • • • • (Hadoop Distributed File System) • • • • • • • • • • • • Cost Effec ve
33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HDFS: Name Nodes & Data Nodes Data and its Namespace
Name Nodes § Data Catalog, a “Namespace” § Primary & Secondary Nodes for HA § Manual or Automated Failover
§ Quorum Journaling (States & Edits) 1 • • • • • • • • • • •
Data Nodes • • • • • • • • • • • • § “Where the data lives” § Data stored in replicas of 3 (default) § Rack aware Edge & U lity Nodes § Heterogeneous Storage Op ons § Cluster Administra on § Large block sizes (64MB default) § 3rd Party Tools
34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HDFS: Fault Tolerant
Data Nodes § If a Data Node goes down, the Name Node will instruct the other data nodes to replicate the lost data.
§ HDFS is Rack-Aware which is helpful in 1 • • • • • • • • • • •
avoiding loss of data from a full rack • • • • • • • • • • • • failure.
§ When data is deleted, it is moved to .trash in its directory. The Name Node will eventually delete the data based on an expira on se ng.
35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved YARN: A Data Opera ng System Yet Another Resource Nego ator YARN The Architectural BATCH, INTERACTIVE & REAL-TIME DATA ACCESS Center of Hadoop Script SQL Java NoSQL Stream In-Memory Search Others Scala • Common data platform, many applications
Pig Hive Cascading HBase Storm Spark Solr • Support multi-tenant access & processing Accumulo ISV Tez Tez Tez Slider Slider Engines • Batch, interactive & real-time use cases
YARN: Data Operating System • Supports 3rd-party ISV tools (Cluster Resource Management) (ex. SAS, Syncsort, Actian, etc.)
1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS ° ° ° ° (Hadoop° ° Distributed° ° File° System)° ° ° ° ° °
YARN in Produc on YARN Ready Applica ons • Yahoo: ~40,000 nodes, mul ple clusters running YARN across over Facilitates ongoing innova on and enterprise adop on via 365PB of data ecosystem of new and exis ng “YARN Ready” solu ons • Spo fy, Progressive, Kohls, UHG, Sprint, JPMC, Target, AIG, Samsung
36 © Hortonworks Inc. 2011 – 2017. All Rights Reserved YARN Concepts
Application § Applica on is a temporal job or a service submi ed YARN § Examples – Map Reduce Job (job) – HBase Cluster (service) Container § Basic unit of alloca on § Alloca on of cluster resources (RAM, CPU) – container_0 = 2GB, 1CPU – container_1 = 1GB, 6 CPU § Replaces the fixed map/reduce slots
37 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
YARN Architectural Components
Resource Manager § Global resource scheduler § Hierarchical queues
Node Manager § Per-machine agent § Manages the life-cycle of container § Container resource monitoring
Application Master § Per-applica on § Manages applica on scheduling and task execu on § E.g. MapReduce Applica on Master
38 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Access & Consump on
39 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive: Scalable Modern Data Warehousing with HDP
• ETL • Repor ng • Ad-Hoc • Con nuous • Mul dimensional • Repor ng • BI Tools: Tableau, • Drill-Down Inges on from Analy cs • Data Mining Microstrategy, • BI Tools: Tableau, Opera onal DBMS • MDX Tools • Deep Analy cs Cognos Excel • Slowly Changing • Excel
Applica ons Dimensions
Interac ve Sub-Second Batch SQL ACID / MERGE OLAP / Cube SQL SQL Capabili es
Comprehensive SQL:2011 Coverage Legend Petabyte Scale Advanced Cost-Based JDBC / ODBC Exis ng Processing Op mizer Apache Tez: Scalable Core Hive Development Scale-Out Storage Distributed Processing Advanced Security
Pla orm Core SQL Engine Connec vity
40 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive Support For HDP 2.6
Hive LLAP GA: Interac ve query in seconds, 10X fast join performance
Ease of Use and Adop on : SQL Standard ACID Merge
Enterprise Readiness: Supports all TPC-DS Queries
Streamlined Opera ons: Hive Views
41 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive Architecture Hive Hive CLI/ Compiler Beeline 2 Hive JDBC / SQL HiveServer2 Hive Op mizer 1 ODBC
Web UI Metastore DB Executor
MapReduce Hadoop 3
1 User issues SQL query YARN job 2 Hive parses and plans query 3 Query converted to MapReduce and executed on Hadoop NodeManager
42 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Pig
Apache Pig is a high-level platform for transforming or analyzing large datasets.
Pig includes a scripted, procedural- based language that is used: § To build data pipelines to aggregate and add structure to data § To analyze data
Pig scripts are automatically converted to MapReduce jobs.
43 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Pig Latin
• Pig Latin, abstracts from the Java MapReduce idiom into a form similar to SQL, but which allows to write a data flow describing the data transformation step by step.
• Pig uses a lazy execution model: • During execu on, each statement is processed by the Pig interpreter • If a statement is valid, it gets added to a logical plan built by the interpreter • The steps in the logical plan do not actually execute un l a DUMP or STORE command is used
• Pig Latin is extensible: • Custom UDFs can be wri en in Java, Python, JavaScript, Ruby and Groovy.
44 © Hortonworks Inc. 2011 – 2017. All Rights Reserved A word on Tez
§ Pig & Hive are based on MapReduce processing framework.
§ Tez optimizes inefficiencies in MapReduce:
– HDFS and local storage use – Requirement of map phase before reduce phase. – YARN containers
45 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive LLAP: Performance Gains + Security
§ Performance Optimization: – Persistent Query execu on servers with intelligent in- memory caching – Columnar-based, compression op mized to fit more in memory – Vectorized queries executed by TEZ § Row Level and Column Level Security for SparkSQL
46 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Spark
What is Apache Spark ? Ã Apache open source project originally developed at AMPLab (University of California Berkeley)
à Unified data processing engine that operates across varied data workloads and pla orms
Why Spark ? Ã Elegant Developer APIs – Single environment for data munging and Machine Learning (ML)
à In-memory computa on model – Fast! – Effec ve for itera ve computa ons and ML
à Machine Learning – Implementa on of distributed ML algorithms à YARN Ready
47 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Zeppelin GA: The Data Science Notebook
Features Use Cases
• Ad-hoc experimenta on • Data explora on and discovery • Deeply integrated with • Visualiza on Spark + Hadoop • Interac ve snippet-at-a- me experience • Interac ve data inges on, data explora on, • “Modern Data Science Studio” visualiza on, sharing and collabora on
Data engineers, data analysts and data scien sts
48 © Hortonworks Inc. 2011 – 2017. All Rights Reserved A word on HBase
• NoSQL database: • Open source key-value data store for Hadoop based on Google’s Bigtable paper [2006] • Implemented as a sparse, consistent, distributed, mul -dimensional, persistent, sorted map • Key and value are byte arrays. • HBase was created for hosting very large tables with billions of rows and millions of columns with very sparse data. • Region Servers can collocate with DataNodes • HBase vs RDBMS: • HBase Advantages: • No data types • No schema • Provides random, low latency real- me data access • • No mul -row transac on Allows table inserts, updates, and deletes • Runs on top of the Hadoop distributed file system • Not op mized for N-way joins with large scans. • SQL Support through Apache Phoenix
49 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HBase Data Storage – Rela onal vs. HBase Relational Database
CustomerID FirstName LastName Street Birthday 328 John Wayne 123 BelAir null
126 null Smith 1211 Sycamore null 449 Sara Fox 840 Highview null 65 Cary Grant 12232 Main St 1929-02-21
HBase “Column Qualifier” : Cell
RowKey Column Family “FirstName” : Cary 65 “info” “LastName” : Grant “Street” : 12232 Main St 126 “Birthday” : 1929-02-21 328 “info” 449 “LastName” : Smith “Street” : 1211 Sycamore 50 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Phoenix
• Enables low-latency OLTP • SQL Skin for Hbase and analytics in Hadoop • Can be run on top of exis ng Hbase • ANSI SQL-92 Apps or as Stand-alone • ODBC/JDBC/.NET/Python/etc • Secondary indexes, joins, aggrega on pushdowns
Applica ons within the Datacenter Na ve Interfaces (DB-API, DBI, etc.)
Protobuf over HTTP Phoenix Phoenix Phoenix Query Server Query Server Query Server Business Analysts + Edge Servers Standard SQL Tools HBase HBase HBase Phoenix RegionServer RegionServer RegionServer Query Server
51 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Security & Governance
52 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Comprehensive Approach to Hadoop Security Administra on Central management and consistent security How do I set policies across the en re cluster?
Authen ca on Authen cate users and systems Who are you? Prove it?
Authoriza on Provision access to data What are allowed to do?
Audit Maintain a record of data access Who did what and when?
Data Protec on
Protect data at rest and in mo on How to secure data at rest and over the wire?
53 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HDP Security: Comprehensive, Complete, Extensible
Administra on Single administra ve console to set policy across the Central management and consistent security en re cluster: Apache Ranger
Authen ca on Authen ca on for perimeter and cluster; integrates with exis ng Ac ve Directory and LDAP solu ons: Kerberos | Authen cate users and systems Apache Knox
Authoriza on Provision access to data Consistent authoriza on controls across all Apache components within HDP: Apache Ranger
Audit Record of data access events across all components that is Maintain a record of data access consistent and accessible: Apache Ranger
Data Protec on Secure data in mo on and data at rest: HDFS TDE w/ Protect data at rest and in mo on Ranger KMS + HSM, Ranger Data Masking + Row Filtering, Wire encryp on + Partner Solu ons
54 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Security: Protec ng the Elephant in the Castle….. Kerberos, Wire Encryp on Apache Ranger Network Segmenta on, Firewalls
Apache Knox
HDFS Encryp on
55 © Hortonworks Inc. 2011 – 2017. All Rights Reserved LDAP/AD Central Security Administration with Ranger Apache Ranger • Delivers a ‘single pane of glass’ for the security administrator. • Centralizes administration of security policy: • Create and manage security policies to define authoriza ons and permissions to access Hadoop components. • Provides centralized auditing. • Ensures consistent coverage across the entire Hadoop stack.
• Fine grained security policies: • User/Groups • Hive DB authoriza ons, Column based. • A ributes Based Access Control • Dynamic Rules: Tag based, me based, geoloca on based, prohibi on rules.
56 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Unique Security Features within HDP for SQL Users RANGER HIVE
Row Level Security in Hive Dynamic Data Masking of Hive Columns à Control Access to Rows in Hive Tables based on à Protect Sensi ve Data in real- me with Dynamic Context! Data Masking/Obfusca on! à Improve reliability and robustness of HDP by à Mask or anonymize sensi ve columns of data providing Row Level Security to Hive tables and (e.g. PII, PCI, PHI) from Hive query output reducing surface area of security system à Benefits à Restrict data row access based on user – Sensi ve informa on never leaves database characteris cs (e.g. group membership) AND – No changes are required at the applica on or Hive run me context layer – No need to produce addi onal protected duplicate à Use Cases: versions of datasets – A hospital can create a security policy that allows doctors to view data rows only for their own pa ents – Simple & easy to setup masking policies – A bank can create a policy to restrict access to rows of financial data based on the employee's business division, locale or based on the employee's role – A mul -tenant applica on can create logical separa on of each tenant's each tenant can see only its data rows. 57 © Hortonworks Inc. 2011 – 2017. All Rights Reserved DGI Community becomes Apache Atlas
Global Financial Company
First kickoff to GA in 7 months
Dec Jan Feb April May June July 2014 2015 2015 2014 2015 2015 2015
DGI Proto-type Apache HDP 2.3 Kickoff Atlas GA Release Incuba on
58 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Governance Key Concepts
Business Taxonomy (Catalog) Benefits: The prac ce and science of classifica on of things or concepts, including the principles that underlie such classifica on. The A view of data assets organized business organiza on model is hierarchical making authorita ve by business language with no duplica on. Data Lineage (Provenance) Data lineage is defined as a data life cycle that includes the data's Impact analysis, Compliance, origins and where it moves over me. It describes what happens Acceptable use to data as it goes through diverse processes. It helps provide visibility into the analy cs pipeline and simplifies tracing errors back to their sources Common tag though Hadoop Tags: Traits vs. Labels components Atlas has Tags that are authora ve and prevent duplica on. Tag can span different parts of the business taxonomy. A tag PII can be used in HR as well Finance or Sales.
59 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Atlas: Cross-Component Dataset Lineage
NiFi Hive Kafka Spark Storm HBase Sqoop Falcon Ranger HDP 2.3 Apache Atlas HDP 2.5 Beyond HDP 2.6
• Cross- component dataset lineage. Centralized loca on for all metadata inside HDP • Single Interface point for Metadata Exchange with pla orms outside of HDP
60 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Atlas + Ranger: Metadata driven security model
61 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Tag-based Access Policy Requirements
• Basic Tag policy – PII example. Access and en tlements must be tag based ABAC and scalable in implementa on.
• Geo-based policy – Policy based on IP address, proxy IP subs tu on maybe required. The rule enforcement but be geo aware.
• Time-based policy – Timer for data access, de-coupled from dele on of data.
• Prohibi ons – Preven on of combina on of Hive tables that may pose a risk together.
62 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Opera ons, Cloud & HDI
63 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Ambari: Cluster Opera ons
A completely open source management pla orm for provisioning, managing, monitoring and securing Apache Hadoop clusters. Apache Ambari takes the guesswork out of opera ng Hadoop.
Simplified Installa on, Centralized Security Full Visibility into Highly Extensible and Configura on and Setup Cluster Health Customizable Management
64 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Simplified Installa on, Configura on and Management
Ambari Stacks + Blueprints Together
Component Stack Definition Layout & BLUEPRINT Configuration
BLUEPRINT HOSTS INSTANTIATE CLUSTER
65 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cloudbreak: Cloud opera ons
Primary Use Cases
Deploy on Public or Private Clouds Dynamically configure and manage clusters on public or private clouds (Amazon Web Services, Microsoft Azure, Google Cloud Platform and OpenStack)
Automated Scaling Seamlessly manage elasticity requirements as cluster workloads change
Secured Cluster Access Supports configuration for Kerberos, defining network boundaries and configuring security groups
66 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Strong Co-engineering partnership
67 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Flexible Deployment Options with Azure
On-Premises Cloud
HD Insight
Managed Hadoop Service HDP on Linux Built on Azure storage
Full control of HW and so ware configs HDP on Azure
Your deployment of Hadoop hosted as a VM in Azure
68 © Hortonworks Inc. 2011 – 2017. All Rights Reserved HDF Overview
69 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Key Components of Streaming Analy cs in Enterprise Environments
• Data acquisi on à Scalable data broker for and delivery streaming apps • Simple transforma on à Scale out complex transforma on and data rou ng Flow Stream à Complex analy cs • Simple event processing Management Processing à Stream development and run me • End to end provenance environment • Edge intelligence and à Con nuous insights bi-direc onal comms.
à Provisioning à Audit à Management Enterprise à Compliance à Monitoring Services à Governance à Security à Mul -tenancy
70 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hortonworks DataFlow Powered by Apache NiFi, Storm & Ka a
StreamLine
Streaming Analy cs FLOW STREAM Capture perishable insights from data-in-mo on MANAGEMENT PROCESSING Visual Control Over Data Flows to manage who can see and touch data in transit End-to-End Security to encrypt, decrypt, and filter data on its journey Real-Time Traceability ENTERPRISE Rich metadata and contextual detail helps SERVICES troubleshoot security issues
Schema Registry
71 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache NiFi Key Features
• Guaranteed delivery • Recovery/recording a rolling log of fine- • Data buffering grained history - Backpressure - Pressure release • Visual command and control • Priori zed queuing • Flow templates • Flow specific QoS • Pluggable/mul -role - Latency vs. throughput security - Loss tolerance • Designed for extension • Data provenance • Clustering • Supports push and pull models
72 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Connec ng Data Between Ecosystems Without Coding: 180+ Processors
FTP Hash Encrypt GeoEnrich SFTP Merge Tail Scan HL7 Extract Evaluate Replace UDP Duplicate Execute Translate XML Split Fetch Convert
HTTP WebSocket Email Route Text Distribute Load HTML Route Content Generate Table Fetch Image Route Context Jolt Transform JSON Syslog Control Rate Priori zed Delivery AMQP All Apache project logos are trademarks of the ASF and the respec ve projects. 73 © Hortonworks Inc. 2011 – 2017. All Rights Reserved A word on Ka a • High throughput publish-subscribe Distributed Messaging producer consumer System: – Ka a maintains feeds of messages in categories called topics. – Processes that publish messages to a Ka a topic are producers. producer Ka a consumer – Processes that subscribe to topics and process the feed of published Cluster messages are consumers. – Ka a is run as a cluster comprised of one or more servers each of which is called a broker. producer consumer
• For each topic, Ka a maintains par ons across the Ka a cluster: – Ordered, immutable, sequence of messages. – Messages are retained for a configurable period of me. – Par oning can be round-robin, or key based. – Topics can be replicated for fault tolerance purposes.
74 © Hortonworks Inc. 2011 – 2017. All Rights Reserved A word on Apache Storm
• Distributed Computation framework for real-time analytics: • Produc on Applica on log monitoring • Real- me fraud detec on on credit card/banking transac ons • Storm Components: • Topology: Storm applica on made of: • Stream: Flow of tuples • Spouts: Generate streams • Bolt: Contain actual data processing, persistence, and aler ng logic. • Tuple: Fundamental data unit. • Storm Cluster Architecture: • Similar to a Hadoop cluster: Highly-parallel processing cluster to reliably process massive amounts of data. • Storm and Hadoop can share the same cluster, and can also run on YARN.
75 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Q&A Next Steps: Try Hortonworks Data Platformn
• Download HDP Sandbox: http:// hortonworks.com/products/sandbox/
• Deploy the HDP Sandbox on Azure: http:// hortonworks.com/hadoop-tutorial/deploying- hortonworks-sandbox-on-microsoft-azure/
• Learning the ropes: http://hortonworks.com/ hadoop-tutorial/learning-the-ropes -of-the- hortonworks-sandbox Thank You
78 © Hortonworks Inc. 2011 – 2017. All Rights Reserved