Apache Hadoop Today & Tomorrow

Apache Hadoop Today & Tomorrow Eric Baldeschwieler, CEO Hortonworks, Inc. twitter: @jeric14 (@hortonworks) www.hortonworks.com © Hortonworks, Inc. All Rights Reserved. Agenda Brief Overview of Apache Hadoop Where Apache Hadoop is Used Apache Hadoop Core Hadoop Distributed File System (HDFS) Map/Reduce Where Apache Hadoop Is Going Q&A © Hortonworks, Inc. All Rights Reserved. 2 What is Apache Hadoop? A set of open source projects owned by the Apache Foundation that transforms commodity computers and network into a distributed service •HDFS – Stores petabytes of data reliably •Map-Reduce – Allows huge distributed computations Key Attributes •Reliable and Redundant – Doesn’t slow down or loose data even as hardware fails •Simple and Flexible APIs – Our rocket scientists use it directly! •Very powerful – Harnesses huge clusters, supports best of breed analytics •Batch processing centric – Hence its great simplicity and speed, not a fit for all use cases © Hortonworks, Inc. All Rights Reserved. 3 What is it used for? Internet scale data Web logs – Years of logs at many TB/day Web Search – All the web pages on earth Social data – All message traffic on facebook Cutting edge analytics Machine learning, data mining… Enterprise apps Network instrumentation, Mobil logs Video and Audio processing Text mining And lots more! © Hortonworks, Inc. All Rights Reserved. 4 Apache Hadoop Projects Programming Pig Hive (Data Flow) (SQL) Languages MapReduce Computation (Distributed Programing Framework) HMS (Management) HBase (Coordination) Zookeeper Zookeeper HCatalog Table Storage (Meta Data) (Columnar Storage) HDFS Object Storage (Hadoop Distributed File System) Core Apache Hadoop Related Apache Projects © Hortonworks, Inc. All Rights Reserved. 5 Where Hadoop is Used © Hortonworks, Inc. All Rights Reserved. 6 Everywhere! 2006 2008 2009 2010 2007 The Datagraph Blog © Hortonworks, Inc. All Rights Reserved. 7 HADOOP @ YAHOO! 40K+ Servers 170 PB Storage 5M+ Monthly Jobs 1000+ Active users © Hortonworks, Inc. All Rights Reserved. 8 CASE STUDY YAHOO! HOMEPAGE Personalized for each visitor twice the engagement Result: twice the engagement Recommended links News Interests Top Searches +79% clicks +160% clicks +43% clicks vs. randomly selected vs . one s ize fits all vs. editor selected © Hortonworks, Inc. All Rights Reserved. 9 CASE STUDY YAHOO! HOMEPAGE • Serving Maps SCIENCE » Machine learning to build ever • Users - Interests HADOOP better categorization models CLUSTER • Five Minute USER CATEGORIZATION Production BEHAVIOR MODELS (weekly) • Weekly PRODUCTION HADOOP Categorization Identify user interests using CLUSTER » models SERVING Categorization models MAPS (every 5 minutes) USER BEHAVIOR SERVING SYSTEMS ENGAGED USERS Build customized home pages with latest data (thousands / second) © Hortonworks, Inc. All Rights Reserved. 10 CASE STUDY YAHOO! MAIL Enabling quick response in the spam arms race • 450M mail boxes • 5B+ deliveries/day SCIENCE • Antispam models retrained every few hours on Hadoop PRODUCTION 40% less spam than“ “ Hotmail and 55% less spam than Gmail © Hortonworks, Inc. All Rights Reserved. 11 A Brief History , early adopters 2006 – present Scale and productize Hadoop Apache Hadoop Other Internet Companies 2008 – present Add tools / frameworks, enhance Hadoop … Service Providers 2010 – present Provide training, support, hosting … Wide Enterprise Adoption Nascent / 2011 Funds further development, enhancements © Hortonworks, Inc. All Rights Reserved. 12 Traditional Enterprise Architecture Data Silos + ETL Traditional Data Warehouses, Serving Applications BI & Analytics Traditional ETL & Web NoSQL Message buses Data … EDW BI / Serving RDMS Marts Analytics Serving Social Sensor Te x t … Logs Media Data Systems Unstructured Systems © Hortonworks, Inc. All Rights Reserved. 13 Hadoop Enterprise Architecture Connecting All of Your Big Data Traditional Data Warehouses, Serving Applications BI & Analytics Traditional ETL & Web NoSQL Message buses Data … EDW BI / Serving RDMS Marts Analytics Apache Hadoop EsTsL (s = Store) Custom Analytics Serving Social Sensor Te x t … Logs Media Data Systems Unstructured Systems © Hortonworks, Inc. All Rights Reserved. 14 Hadoop Enterprise Architecture Connecting All of Your Big Data Traditional Data Warehouses, Serving Applications BI & Analytics Traditional ETL & Web NoSQL Message buses Data … EDW BI / Serving RDMS Marts Analytics Apache Hadoop EsTsL (s = Store) Custom Analytics 80-90% of data Gartner predicts produced today 800% data growth is unstructured over next 5 years Serving Social Sensor Te x t … Logs Media Data Systems Unstructured Systems © Hortonworks, Inc. All Rights Reserved. 15 What is Driving Adoption? Business drivers Identified high value projects that require use of more data Belief that there is great ROI in mastering big data Financial drivers Growing cost of data systems as proportion of IT spend Cost advantage of commodity hardware + open source Enables departmental-level big data strategies Technical drivers Existing solutions failing under growing requirements 3Vs - Volume, velocity, variety Proliferation of unstructured data © Hortonworks, Inc. All Rights Reserved. 16 Big Data Platforms Cost per TB, Adoption Size of bubble = cost effectiveness of solution Source: © Hortonworks, Inc. All Rights Reserved. 17 Apache Hadoop Core © Hortonworks, Inc. All Rights Reserved. 18 Overview Frameworks share commodity hardware Storage - HDFS Processing - MapReduce Network Core 2 * 10GigE 2 * 10GigE 2 * 10GigE 2 * 10GigE Rack Switch Rack Switch Rack Switch Rack Switch • 20-40 nodes / rack 1-2U server 1-2U server 1-2U server 1-2U server • 16 Cores • 48G RAM • 6-12 * 2TB disk … • 1-2 GigE to node … … … … © Hortonworks, Inc. All Rights Reserved. 19 Map/Reduce Map/Reduce is a distributed computing programming model It works like a Unix pipeline: cat input | grep | sort | uniq -c > output Input | Map | Shuffle & Sort | Reduce | Output Strengths: Easy to use! Developer just writes a couple of functions Moves compute to data Schedules work on HDFS node with data if possible Scans through data, reducing seeks Automatic reliability and re-execution on failure © Hortonworks, Inc. All Rights Reserved. 20 HDFS: Scalable, Reliable, Managable Scale IO, Storage, CPU Fault Tolerant & Easy management • Add commodity servers & JBODs Built in redundancy • 4K nodes in cluster, 80 Tolerate disk and node failures Automatically manage addition/removal of nodes Core Core Switch Switch One operator per 8K nodes!! Switch Switch Switch Storage server used for computation Move computation to data … Not a SAN … … … But high-bandwidth network access to data via Ethernet ` Immutable file system Read, Write, sync/flush No random writes © Hortonworks, Inc. All Rights Reserved. 21 HDFS Use Cases Petabytes of unstructured data for parallel, distributed analytics processing using commodity hardware Solve problems that cannot be solved using traditional systems at a cheaper price Large storage capacity ( >100PB raw) Large IO/Computation bandwidth (>4K servers) > 4 Terabit bandwidth to disk! (conservatively) Scale by adding commodity hardware Cost per GB ~= $1.5, includes MapReduce cluster © Hortonworks, Inc. All Rights Reserved. 22 HDFS Architecture Persistent Namespace NFS Metadata & Journal Namespace Hierarchal Namespace State File Name BlockIDs Block Namenode Map Namespace Block ID Block Locations Heartbeats & Block Reports b2 b1 b1 b3 b3 b2 b1 b2 b3 b5 Datanodes b5 b5 Block ID Data JBOD JBOD JBOD JBOD Block Storage Horizontally Scale IO and Storage © Hortonworks, Inc. All Rights Reserved. Client Read & Write Directly from Closest Server Namespace State 1 open Block 1 create Namenode Map Client Client 2 read End-to-end checksum 2 write b2 b1 b1 b3 b3 b2 b2 b3 b5 b5 b5 b1 JBOD JBOD JBOD JBOD Horizontally Scale IO and Storage © Hortonworks, Inc. All Rights Reserved. 24 Actively maintain data reliability Namespace State Block Namenode Map 1. 3. replicate blockReceived Periodically check block Bad/lost block checksums replica b2 b4 b1 b3 2. copy b3 b2 b2 b3 b5 b5 b5 b1 JBOD JBOD JBOD JBOD © Hortonworks, Inc. All Rights Reserved. 25 HBase Hadoop ecosystem Database, based on Google BigTable Goal: Hosting of very large tables (billions of rows X millions of columns) on commodity hardware. Multidimensional sorted Map Table => Row => Column => Version => Value Distributed column-oriented store Scale – Sharding etc. done automatically No SQL, CRUD etc. © Hortonworks, Inc. All Rights Reserved. 26 What’s Next © Hortonworks, Inc. All Rights Reserved. 27 About Hortonworks – Basics Founded – July 1st, 2011 22 Architects & committers from Yahoo! Mission – Architect the future of Big Data Revolutionize and commoditize the storage and processing of Big Data via open source Vision – Half of the worlds data will be stored in Hadoop within five years © Hortonworks, Inc. All Rights Reserved. Game Plan Support the growth of a huge Apache Hadoop ecosystem Invest in ease of use, management, and other enterprise features Define APIs for ISVs, OEMs and others to integrate with Apache Hadoop Continue to invest in advancing the Hadoop core, remain the experts Contribute all of our work to Apache Profit by providing training & support to the Hadoop community © Hortonworks, Inc. All Rights Reserved. Lines of Code Contributed to Apache Hadoop © Hortonworks, Inc. All Rights Reserved. 30 Apache Hadoop Roadmap Phase 1 – Making Apache Hadoop Accessible 2011 • Release the most stable version

Apache Hadoop Today & Tomorrow

Splitting the Load How Separating Compute from Storage Can Transform the Flexibility, Scalability and Maintainability of Big Data Analytics Platforms

Big Business Value from Big Data and Hadoop

View Whitepaper

Final HDP with IBM Spectrum Scale

Hortonworks Data Platform Release Notes (October 30, 2017)

Hortonworks Data Platform Apache Solr Search Installation (July 12, 2018)

Hadoop Security

Ingesting Data

Hortonworks Data Platform Teradata Connector User Guide (May 17, 2018)

Hortonworks Data Platform on IBM Power Systems

Hortonworks Data Platform Data Movement and Integration (December 15, 2017)

Hortonworks Data Platform Apache Hadoop High Availability (April 20, 2017)