BRKCLD-2501.Pdf

Architecting and Delivering Big Data in cloud Abhi Singh Technical Solutions Architect, Cisco Agenda • Business use case Demonstration: • Ecosystem and technology •Demo 1: Import and query structured • Data Models and architectures data in Hadoop •Demo 2: Import and analyze semi • Enterprise Grade structured data in Hadoop • Hadoop and Openstack •Demo 3: Spark for analytics on object relationships • Summary, further learning, Q&A •Demo 4: Big Data as a service in cloud 3 From Data to information to actionable intelligence OUTCOME LOGIC DATA Bare Metal & Virtualized Infrastructure 5 Private & Public Cloud Big Data is … Data sets too large/complex for traditional data processing applications Volume Big Data driven Velocity spending Variety 2012-2016: Veracity $232 B Gartner 2012 Data quality problems cost U.S. businesses more than $600 billion a year- DW-institute 2002 6 Big Data is an Enabler Decrease Cost • Open source technology • Integrate different data sources Streamline business • Landing zone for all Data • Data driven business decisions • Improved outcomes of campaigns Increase revenue • Improve customer satisfaction • Reduce customer churn • Understand patterns • Preventative maintenance 7 Big Data in Action - Waze Real time Location GPS based Ads Data Budget Map based Ad Archives campaigns Social media Community activity shared Gas prices 8 Ecosystem and Technology 9 Data lives in RDBMS, NoSQL, and Hadoop Bare Metal & Virtualized Infrastructure Private & Public Cloud 10 Conventional – Databases and Data Warehouses RDBMS – A relational database is a set of tables containing data fitted into predefined categories (schema/blueprint). Allows for ACID properties. Operational Data Databases Warehouse • OLTP • OLAP • Data • Data analysis and retrieval/update decision making • Relational • Relational and • Current multi-dimensional • Historic Scaling 11 Distribution across cluster is challenging (CAP) Consistency Partition or Availability Is that hotel (latency) What's the room still account $ available? balance? Network 12 NoSQL (Non-relational) databases Attributes: • Schema-less (?), cluster friendly (Sharding), Opensource • Aggregates (Transaction boundary), Developer friendly • Flavors (Aggregates: Key-Value, Document, Columnar Relationship: Graph ) Scaling 13 Hadoop History – reinventing the Google wheel From Google to Yahoo to Opensource (Apache Hadoop) to Monetized 2003-2004 2006 2009 • Google • Apache 2008 2011 published • MapR Hadoop is • Cloudera Founded • Hortonworks papers on born @ Yahoo GFS (Google founded • CTO Srivas is founded File System) • Hadoop name • Doug is a from Google • 24 engineers and came from Chief architect and worked on from Hadoop MapReduce Doug Cutting @ Cloudera GFS, BigTable team @ Yahoo (Hadoop’s (Data • Intel (740M for and • HDP (Market Processing on creator) son’s Mapreduce plush toy 18%) Cap 850M) Large • Google (110M) Clusters) elephant 14 Hadoop ecosystem Kafka highly reliable Distributed distributed Workflow scheduler Realtime Computation Messaging coordination Faster Computations HDFS <-> RDBMS Machine Streaming event HiveQL Data analytics NoSQL DB learning apps data ingestion SQL-Like to Mapreduce Core Hadoop MapReduce/YARN & HDFS 15 NoSQL and Hadoop are now widely accepted 16 Data Models and architectures 17 New Data models have emerged (NoSQL) Tom 123456 London Customer Orders …. Name Scott 999999 Malvern phone ….. email Column based …. Key-Value Address Payment Document Graph City Credit Card State Number Jon pancakes expiry …. Dan Football 18 Core Hadoop : Mapreduce/YARN & HDFS Mapreduce: Ability to take a dataset, divide it, and run it over parallel nodes. Input data is processed and transformed into a intermediate stage and then summarized into the final stage Parallel Processing for large data sets. MapReduce 2.0 ResourceMgr NodeMgr Job scheduling and cluster YARN -Resource Negotiator resource management ApplicationMaster Container High throughput HDFS NameNode DataNode access to Distributed file system application data 19 MapReduce 2.0 (YARN) • Global Resource Manager (Job Tracking and resource management) • Per-application (Mapreduce or DAG) Application Master (job scheduling/monitoring): Negotiate resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the tasks https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html 20 HDFS - Hadoop Distributed File System For large data sets, runs on Clusters, Failure is assumed hadoop.apache.org • DFS in Java, runs on Linux • write-once-read-many • hierarchical file organization • stores each file as a sequence of blocks • block size and replication factor are configurable per file • NameNode maintains the file system namespace • communication protocols are layered on top of the TCP/IP 21 Demo – Import and query structured data in Hadoop 22 Data is getting complex Volume Velocity Variety Veracity Sensor data Social Macrobatch >15 min Database Media activity Microbatch 2-15 min Videos Audio Near Real-time 2 sec – 2 min Decision support Near Real-time 100 milisec-2 min Event Processing Images DATA Archives Real-time <100 milisec 24 Data serialization and De-serialization Process of translating an object into a stream of bytes in order to store or transmit in memory, a database, or a file Sensor Avro Parquet data Social Music Media • Schema based system • Columnar storage format activity • uses JSON for defining data available to any project in the types and protocols, and Hadoop ecosystem, regardless Videos Audio serializes data in a compact of the choice of data processing binary format framework, data model or 110101010101101 programming language 010101010101010 101010101010101 Images 010101010101010DATA Archives 101010101010101 010101010101010 101010101010101 010101010101010 Others 101010101010101 010101010101010 • Thrift (Facebook) • Protocol Buffers (Google) • JSON Database File • BSON Memory 25 Demo – Import and analyze semi structured data in Hadoop 26 Big Data solutions need a lot more than just Hadoop Lambda http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html Kappa http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html 28 Apache Spark deserves special attention 10x to 100x faster than MapReduce spark.apache.org DAG execution engine: cyclic data flow and in-memory computing Java, Scala, Python, R IBM: 300M + 3500 people 29 Demo - Spark for analytics on object relationships 30 When to use NoSQL vs Hadoop for Big Data Cluster friendly, scale out horizontally, Handle large data sets, Flexible Data formats, Developer friendly NoSQL Hadoop Maintain two different Databases • Batch processing copies of data? • Read/Write/Modify • Process large data • Interactive sets • Real-time • Historic (Data Lakes) random, real-time read/write access to HDFS 32 Industry acceptance : NoSQL and Hadoop Magic Quadrant for Operational Database Management Systems 33 Enterprise Grade 34 Simplified view of Hadoop cluster and add-ons 35 Opensource is free ..if your time has no value Management Skills Scale & Operations Fault Security Governance Tolerance 36 Management 37 Scalability and Fault tolerance In the world of cluster computing Hadoop was designed to be scalable and robust: built into HDFS Centralized service for maintaining and MapReduce configuration information, naming, providing distributed synchronization, and providing group services 38 Security By default Hadoop runs in non-secure mode sentry.apache.org Authentication Data Encryption Role based authorization to data and metadata stored on a Hadoop cluster • Kerberos principals for • Data Encryption on RPC, Started by Cloudera Hadoop Daemons and Block data transfer, and Users HTTP • Hadoop Key Management • Transparent Encryption in Server HDFS Service Level Authorization • ensure clients are authorized to access the Hadoop service 39 Data governance in Hadoop, ready for compliance? Audit data access, Track data Lineage, Data/Metadata lifecycle management 40 Hadoop and Openstack 41 Hadoop can run on Openstack Traditional Hadoop Hadoop on Openstack Big Data Software (Hadoop) • Designed for Bare Metal • Agile, automated (Lack of agility) deployment (both Physical • Underutilized resources (Ironic) and Virtual) - Operating System • Maintenance (Operations Sahara (Linux) can become a bottleneck) • Shared platform • Fixed Capacity • Repeatable operations with • Setting up POC takes too template based Infrastructure – much time provisioning (Hadoop on Physical or Virtual demand) • Scale up/down • Quick POC 42 Big Data needs Infrastructure and it can be virtual Virtual infrastructure Hadoop was designed for physical servers: benefits: •Understand the virtualization overhead •Clone and repeat- lower •Local on-server HDD storage (fast sequential reads) – use Block operations costs device driver for Cinder, utilize full disk for Cinder volume and •On-demand Hadoop keep Cinder volume on same host as the VM clusters (Short lived •Dedicated CPUs and homogeneous hardware – keep VM clusters) flavors consistent •Physical infrastructure can •Dedicated network – isolate physical networks supporting Tenant be reused and shared networks •Flexible cluster scaling– •Replication and redundant services ( zookeeper, Journal mgr, Grow/Shrink Namenode) land on different data nodes for fault tolerance – Anti- •Pay per usage (in public Affinity cloud •For Spark (memory intensive) – do NOT oversubscribe RAM 43 Summary OUTCOME LOGIC Cisco Data & Analytics Cisco Data Virtualization DATA XML Packaged Apps RDBMS Excel Files Data Warehouse

BRKCLD-2501.Pdf

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support