The Evolving Apache Hadoop Eco-System – What It Means for Big Data Analytics and Storage

The Evolving Apache Hadoop Eco-System – What it means for Big Data Analytics and Storage Sanjay Radia Architect/Founder, Hortonworks Inc. 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 – All Rights Reserved Page 1 Outline • Hadoop and Big Data Analytics – Data platform drivers and How Hadoop Changes the Game – Data Refinery – Hadoop in Enterprise Data Architecture • Hadoop (HDFS) and Storage – RAID disk or not? – HDFS’s Generic storage layer – Archival – Upcoming Page 2 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 Next Generation Data Platform Drivers Business • Need for new business models & faster growth (20%+) Drivers • Find insights for competitive advantage & optimal returns Technical • Data continues to grow exponentially • Data is increasingly everywhere and in many formats Drivers • Legacy solutions unfit for new requirements growth Financial • Cost of data systems, as % of IT spend, continues to grow Drivers • Cost advantages of commodity hardware & open source 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 Big Data Changes the Game Transactions + Interactions Mobile Web + Observations Petabytes BIG DATA Sentiment SMS/MMS = BIG DATA User Click Stream Speech to Text Social Interactions & Feeds Terabytes WEB Web logs Spatial & GPS Coordinates A/B testing Sensors / RFID / Devices Behavioral Targeting Gigabytes CRM Business Data Feeds Dynamic Pricing Segmentation External Demographics Search Marketing ERP Customer Touches User Generated Content Megabytes Affiliate Networks Purchase detail Support Contacts HD Video, Audio, Images Dynamic Funnels Purchase record Product/Service Logs Payment record Offer details Offer history Increasing Data Variety and Complexity 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 How Hadoop Changes the Game • Cost – Commodity servers, JBod disks – Horizontal scaling – new servers as needed – Ride the technology curve • Scale from small to very large – Size of data or size of computation • The Data-Refinery – Traditional Datamarts optimize for time and space – Don’t need to pre-select the subset of “schema” – Can experiment to gain insights into your whole data – You insights can change over time – Don’t need to throw your old data away • Open and Growing Eco-system – You aren’t locked into a vendor – Ever growing choice of tools Page 5 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 Vertical Specific Uses of Hadoop Vertical Analytical Use Cases Operational Use Cases • Social Network Analysis • Session & Content Optimization Retail & Web • Log Analysis/Site Optimization • Dynamic Pricing • Brand and Sentiment Analysis • Dynamic Pricing/Targeted Offers Retail • Loyalty Program Optimization • Person of Interest Discovery • Cross Jurisdiction Queries Intelligence • Threat Identification • Risk Modeling & Fraud Identification • Surveillance and Fraud Detection Finance • Trade Performance Analytics • Customer Risk Analysis • Smart Grid: Production Optimization • Smart Meters Energy • Gris Failure Prevention • Individual Power Grid • Customer Churn Analysis • Dynamic Delivery Manufacturing • Supply Chain Optimization • Replacement parts • Electronic Medical Records (EMPI) • Insurance Premium Determination Healthcare & Payer • Clinical Trials Analysis 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 Transaction + Interactions + Observations Audio, Retain runtime models and Video, historical data for ongoing Images 4 Business refinement & analysis Transactions Docs, Text, & Interactions XML Web Logs, Web, Mobile, CRM, Clicks ERP, SCM, … Big Data Social, Refinery Classic Graph, Share refined data and 1 Feeds 3 ETL runtime models processing Sensors, 2 Devices, RFID Store, aggregate, and transform Business multi-structured Spatial, Intelligence GPS data to unlock value & Analytics Retain historical Events, data to unlock 5 Other additional value Dashboards, Reports, Visualization, … Page 7 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 Next-Generation Big Data Architecture Audio, Web, Mobile, CRM, Video, ERP, SCM, … Images Business Transactions Docs, Text, & Interactions XML Web Logs, Clicks Big Data Social, SQL NoSQL NewSQL Refinery Graph, Feeds EDW MPP NewSQL Sensors, Devices, RFID Business Spatial, Intelligence GPS & Analytics Events, Other Dashboards, Reports, Visualization, … Page 8 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 Hadoop in Enterprise Data Architecture Existing Business Infrastructure Web New Tech Datameer Tablaeu Karmasphere IDE & ODS & Applications & Visualization & Web Splunk Dev Tools Datamarts Spreadsheets Intelligence Applications Operations Discovery Low Latency/ EDW Tools NoSQL Custom Existing Templeton WebHDFS Sqoop FlumeNG HCatalog HBase Pig Hive MapReduce HDFS Ambari Oozie HA ZooKeeper Social Exhaust logs files CRM ERP financials Media Data Data Sources (transactions, observations, interactions) Page 9 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 Metadata Services Apache HCatalog provides flexible metadata services across tools and external access • Consistency of metadata and data models across tools (MapReduce, Pig, HBase and Hive) • Accessibility: share data as tables in and out of HDFS • Availability: enables flexible, thin-client access via REST API Shared table HCatalog and schema management • Raw Hadoop data Table access opens the • Inconsistent, unknown Aligned metadata platform • Tool specific access REST API 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 Outline • Hadoop and Big Data Analytics – Data platform drivers and How Hadoop Changes the Game – Data Refinery – Hadoop in Enterprise Data Architecture • Hadoop (HDFS) and Storage – RAID disk or not? – HDFS’s Generic storage layer – Archival – Upcoming Page 11 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 HDFS: Scalable, Reliable, Manageable Scale IO, Storage, CPU r Fault Tolerant & Easy management • Add commodity servers & JBODs r Built in redundancy • 6K nodes in cluster, 120PB r Tolerate disk and node failures r Automatically manage addition/ removal of nodes Core r One operator per 3K node!! Switch r Storage server used for computation Switch Switch Switch r Move computation to data r Not a SAN r But high-bandwidth network access to data via Ethernet … … … r Scalable file system r Read, Write, rename, append ` r No random writes Simplicity of design why a small team could build such a large system in the first place 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 12 HDFS Architecture: Two Key Layers • Namespace layer › Multiple independent namespace • Can be mounted using client side mount tables › Consists of dirs, files and blocks … NS NS › Operations: create, delete, modify and list dir/files Namespace Block Management • Block Storage layer – Generic Block service › DataNode cluster membership › Block Pool – set of blocks for a Namespace volume DataNode … DataNode • Storage shared across all volumes – no partitioning Block Storage Storage • Namespace Volume = Namespace + Block Pool › Block operations • Create/delete/modify/getBlockLocation operations • Read and write access to blocks › Detect and Tolerate faults • Monitor node/disk health, periodic block verification • Recover from failures – replication and placement 13 File systems Background (3): Leading to Google FS and HDFS > Separation of metadata from data - 1978, 1980 l “Separating Data from Function in a Distributed File System” (1978) l by J E Israel, J G Mitchell, H E Sturgis l “A universal file server” (1980) by A D Birrell, R M Needham > Horizontal scaling of storage nodes and io bandwidth (1999-2003) l Several startups building scalable NFS – late 1990s l Luster (1999-2002) l Google FS (2003) l (Hadoop’s HDFS, pNFS) > Commodity HW with JBODs, Replication, Non-posix semantics l Google FS (2003) - Not using RAID a very important decision > Computation close to the data l Parallel DBs l Google FS/MapReduce 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 Significance of not using Disk RAID • Key new idea was to not use RAID on the local disk and instead replicate the data – Uniformly Deal with device failure, media failures, node failures etc. – Nodes and clusters continue run, with slightly lower capacity – Nodes and disks can be fixed when convenient – Recover from failures in parallel – Raid5 disk take over half a day to recover from a 1TB failure – HDFS recovers 12TB in minutes – recover is in parallel – Faster for larger clusters – Operational advantage: system recovers automatically from node and disk failures very rapidly – 1 operator for 4K servers at mature Hadoop installations – Yes there is the storage overhead – File-RAID is available (1.4 -1.6 efficiency) – Practical to use for portion of data otherwise node recover takes long Page 15 2012 SNIA Analytics & Big Data Summit © Hortonworks Inc. 2012 HDFS’ Generic Storage Service Opportunities for Innovation • Federation - Distributed (Partitioned) Namespace – Simple and Robust due to independent masters Alternate NN Implementation HBase – Scalability, Isolation, Availability HDFS Namespace MR tmp • New Services – Independent Block Pools – New FS - Partial namespace in memory – MR Tmp storage, HBase directly on block storage Storage Service – Shadow file system – caches HDFS, NFS, S3 • Future: move Block Management in DataNodes – Simplifies namespace/application implementation – Distributed namenode becomes significantly simple © Hortonworks Inc. 2012 Archival data – where should it sit? •

Load more