Introduction to Big Data(Hadoop) Eco-System the Modern Data

Introduction to Big Data(Hadoop) Eco-System The Modern Data Platform for Innovation and Business Transformation Roger Ding Cloudera February 3rd, 2018 © Cloudera, Inc. All rights reserved. 1 Agenda •Hadoop History •Introduction to Apache Hadoop Eco-System •Transition from Legacy Data Platform to Hadoop •Resources, Q & A © Cloudera, Inc. All rights reserved. 2 Legacy RDBMS Quick Check • Centralized Storage • High Cost • Centralized Computing • High-end Processing and Storage • Send data to compute • Hard to plan • Bottleneck • Network bandwidth • Time to Data • Slow disk I/O • Structure Data • Scale-Up • Up-front modeling • Add more memory, upgrade CPU, • Schema-on-write replace server every several years • Transforms lose data • No agility © Cloudera, Inc. All rights reserved. 3 Google 1999: Indexing the Web © Cloudera, Inc. All rights reserved. 4 The Original Inspirations for Hadoop 2003 2004 © Cloudera, Inc. All rights reserved. 5 The Beginning: Building Hadoop 2006 Core Hadoop: HDFS, MapReduce © Cloudera, Inc. All rights reserved. 6 Agenda • Hadoop History • Introduction to Apache Hadoop Eco-System • Transition from Legacy Data Platform to Hadoop • Resources, Q & A © Cloudera, Inc. All rights reserved. 7 Hadoop Eco-System Primer • Hadoop consists of 3 core components • HDFS(Hadoop Distributed File System): Self-healing, Distributed Storage Framework • MapReduce: Distributed Computing Framework • YARN(Yet Another Resource Management): Distributed Resource Management Framework • Many other projects based around core Hadoop • Referred to as the “Hadoop Ecosystem” projects • Spark, Pig, Hive, Impala, HBase, Flume, Sqoop, etc • A set of machines running Hadoop Software is known as a Hadoop Cluster • Individual machines are known as ‘nodes’ © Cloudera, Inc. All rights reserved. 8 HDFS: Economically Feasible to Store More Data Self-healing, high bandwidth clustered storage. Affordable & Attainable $300-$1,000 per TB 1 2 3 HDFS 4 2 1 1 2 1 4 2 3 3 3 5 5 5 4 5 4 HDFS breaks incoming files into blocks and stores them redundantly across the cluster. © Cloudera, Inc. All rights reserved. 9 MapReduce: Power to predictably process large data Distributed computing framework. 1 2 3 MR 4 2 1 1 2 1 4 2 3 3 3 5 5 5 4 5 4 Processes large jobs in parallel across many nodes and combines the results. © Cloudera, Inc. All rights reserved. 10 A Decade of Hadoop – A platform won’t stop growing Kudu RecordService Ibis Falcon Knox Knox Flink Flink Parquet Parquet Parquet Sentry Sentry Sentry Spark Spark Spark Spark Tez Tez Tez Tez Impala Impala Impala Impala Kafka Kafka Kafka Kafka Drill Drill Drill Drill Flume Flume Flume Flume Flume Bigtop Bigtop Bigtop Bigtop Bigtop Oozie Oozie Oozie Oozie Oozie HCatalog HCatalog HCatalog HCatalog HCatalog Hue Hue Hue Hue Hue Sqoop Sqoop Sqoop Sqoop Sqoop Sqoop Avro Avro Avro Avro Avro Avro Hive Hive Hive Hive Hive Hive Hive Mahout Mahout Mahout Mahout Mahout Mahout Mahout HBase HBase HBase HBase HBase HBase HBase HBase ZooKeeper ZooKeeper ZooKeeper ZooKeeper ZooKeeper Core Hadoop ZooKeeper ZooKeeper ZooKeeper Solr Solr Solr Solr Solr Solr Solr Solr Solr Pig Pig Pig Pig Pig (HDFS, Pig Pig Pig Pig YARN YARN YARN YARN YARN MapReduce) Core Hadoop Core Hadoop Core Hadoop Core Hadoop Core Hadoop Core Hadoop Core Hadoop Core Hadoop Core Hadoop 2006 2007 2008 2009 2010 2011 2012 2013 2014 2016 © Cloudera, Inc. All rights reserved. 11 Some Hadoop Eco-System Projects • Data Storage • Analytics • HDFS, HBase, KUDU • Pig, Hive, Impala • Computing Framework • Orchestration • MapReduce, Spark, Flink • Zookeeper • Data Ingestion • Workflow, Coordination • Sqoop, Flume, Kfaka • OOZIE • Data Serialization in HDFS • Security (Authorization) • Avro, Parquet • Sentry • Search • Solr © Cloudera, Inc. All rights reserved. 12 Hadoop Eco-System – Storage Engine • HDFS (2006): Large files, block storage • HBase (2008): Key-Value store • KUDU (2016): Store structured data © Cloudera, Inc. All rights reserved. 13 Hadoop Eco-System – Computing Framework • Spark (2012) • Originated at UC Berkeley AMPLab • In-memory computing framework • Processes data in-memory vs. MapReduce two-stage paradigm • Can Perform 10 to 100 times faster than MapReduce for certain applications • Flexible (Scala, Java, Python API) vs. MapReduce (Java) • Include 4 components on top of Core Spark: Spark Streaming, GraphX, MLLib, Spark SQL © Cloudera, Inc. All rights reserved. 14 Hadoop Eco-System – Analytics • Hive (2010) • Originated at Facebook • Compile SQL queries to MapReduce or Spark jobs • Data warehouse tool in Hadoop Eco-System • Good for ETL, batch, long-running job. • Impala (2013) • Originated at Cloudera • MPP(Massively Parallel Processing) SQL Engine • Much faster than Hive Query or Spark SQL; Support high concurrency; But no Fault tolerance • Good for short-running, BI-Style ad-hoc queries. • BI tool like Tableau, MicroStrategy connect to Impala through ODBC/JDBC © Cloudera, Inc. All rights reserved. 15 Hadoop Data Processing Pattern • Distributed Storage • Time to Data • Distributed Computing • No Up-front modeling • Send compute to data • Schema -on-read • 100% fidelity of original data • Scale-Out • Data agility • Add more nodes • Cost Effective • Commodity hardware © Cloudera, Inc. All rights reserved. 16 Agenda • Hadoop History • Introduction to Apache Hadoop Eco-System • Transition from Legacy Data Platform to Hadoop • Resources, Q & A © Cloudera, Inc. All rights reserved. 17 Data Silos Customer Engineering Marketing Sales HR Service • Slow down your company • Limits communication and collaboration • Decrease the quality and credibility of data © Cloudera, Inc. All rights reserved. 18 Cloudera Enterprise Data Hub Making Hadoop Fast, Easy, and Secure PROCESS, ANALYZE, SERVE A new kind of data BATCH STREAM SQL SEARCH OTHER Spark, Hive, Pig platform: MapReduce Spark Impala Solr Kite • One place for unlimited data UNIFIED SERVICES • Unified, multi-framework data RESOURCE MANAGEMENT SECURITY DATA OPERATIONS YARN Sentry, RecordService MANAGEMENT access Cloudera Manager Cloudera Navigator Cloudera Director Encrypt and KeyTrustee FILESYSTEM RELATIONAL NoSQL OTHER Optimizer HDFS Kudu HBase Object Store Cloudera makes it: STORE • Fast for business STRUCTURED STREAMING Sqoop Kafka, Flume • Easy to manage INTEGRATE • Secure without compromise © Cloudera, Inc. All rights reserved. 19 Data Mgmt. Chain Serving, Analytics & Data Sources Data Ingest Data Storage & Processing Machine Learning Apache Hive Apache Flume Batch Processing, ETL Stream ingestion Apache Spark Batch, Stream & iterative processing, ML Apache Impala Apache Kafka MPP SQL for fast Connected Things/ Data Stream ingestion Apache Hadoop Storage (HDFS) & deep batch processing analytics Sources Apache HBase NoSQL data store for real time applications Cloudera Apache Sqoop Apache Kudu Search Ingestion of data from relational sources Storage & serving for fast changing data Real time search Structured Data Sources Security, Scalability & Easy Management ENTERPRISE DATA HUB Deployment Flexibility: Datacenter Cloud © Cloudera, Inc. All rights reserved. 20 The best-in-class organizations use Cloudera Over 150 health & life science organizations #1 Largest use enterprise-class #1 Largest this Hospital was one of the first Cloudera four to receive Stage 7 status from HIMSS, the Payer in the US Biotech in the world. highest possible will be covering software. #1 commercial distinction in electronic medical 123 million lives records implementation, uses Cloudera to host a #1 hospital chain variety of data, and was awarded by US DHHS a and pay out worldwide. $950B to Largest Gold Medal of health data largest global providers in 2015. #1 Honor. company, with 500M+ genomic 7 out of the anonymous patient records. repository #1 Largest Health IT top 10 cancer company in the World, $3B+ #1 most utilized Patient drugs by 2020 in revenue has 1000’s of are being made by Cloudera Centered Medical Home nodes of Cloudera. customers. program. © Cloudera, Inc. All rights reserved. 21 Broad Institute’s industry standard GATK pipeline’s new version is based on Apache Spark, over 20,000 global users may migrate to Spark Thanks to the contributions of Cloudera Engineers, GATK4 now uses Apache Spark for both traditional local multithreading and for parallelization on Spark-capable compute infrastructure and services, Such as google Dataproc. “It has been a privilege collaborating with the Broad Institute over the last two years to ensure that GATK4 can use the power of Apache Spark to make genomics workflows more scalable than precious approaches”, said Tom White, principal data scientist at Cloudera. © Cloudera, Inc. All rights reserved. 22 Seattle Children’s Research Institute • 200+ PI’s at Seattle Children’s Research Institute • 9 Research Centers including cancer, brain, birth, infectious disease • Was no integrated data platform across the 9 Centers • Evaluated multiple packaged applications, all multi-millions of dollars • Selected Cloudera as the platform, created their own web user interface Benefit Today, a single lab at SCRI can evaluate and diagnose a single patient per week after receiving the whole exome and clinical record. After implementation, the lab could diagnose 4-5 patients per week. © Cloudera, Inc. All rights reserved. 23 Agenda • Hadoop History • Introduction to Apache Hadoop Eco-System • Transition from Legacy Data Platform to Hadoop • Resources, Q&A © Cloudera, Inc. All rights reserved. 24 Start Your Big Data Journey •Download Cloudera QuickStart Virtual Machine Today •Practice ! •Practice !! •Practice !!! © Cloudera, Inc. All rights reserved. 25 Meetups AI + Big Data Healthcare Washington DC Area Meetup Apache Spark Interactive https://www.meetup.com/AI-and-Big- http://www.meetup.com/Washington- Data-Healthcare-Meetup/ DC-Area-Spark-Interactive/ 1600+ members 2,700+ members © Cloudera, Inc. All rights reserved. 26 Thank you! [email protected] © Cloudera, Inc. All rights reserved. 27 .

Introduction to Big Data(Hadoop) Eco-System the Modern Data

Netapp Solutions for Hadoop Reference Architecture: Cloudera Faiz Abidi (Netapp) and Udai Potluri (Cloudera) June 2018 | WP-7217

Introduction to Hbase Schema Design

E6895 Advanced Big Data Analytics Lecture 4: Data Store

Apache Sentry

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Building a Modern, Scalable Cyber Intelligence Platform with Apache Kafka

Building Machine Learning Inference Pipelines at Scale

Cómo Citar El Artículo Número Completo Más Información Del

Evaluation of SPARQL Queries on Apache Flink

Unravel Data Systems Version 4.5

Vulnerability Summary for the Week of July 10, 2017

Chainsys-Platform-Technical Architecture-Bots