Architecting and Delivering Big Data in cloud Abhi Singh Technical Solutions Architect, Cisco Agenda
• Business use case Demonstration:
• Ecosystem and technology •Demo 1: Import and query structured • Data Models and architectures data in Hadoop •Demo 2: Import and analyze semi • Enterprise Grade structured data in Hadoop • Hadoop and Openstack •Demo 3: Spark for analytics on object relationships • Summary, further learning, Q&A •Demo 4: Big Data as a service in cloud
3 From Data to information to actionable intelligence
OUTCOME
LOGIC
DATA
Bare Metal & Virtualized Infrastructure 5 Private & Public Cloud Big Data is … Data sets too large/complex for traditional data processing applications
Volume Big Data driven Velocity spending Variety 2012-2016: Veracity $232 B Gartner 2012
Data quality problems cost U.S. businesses more than $600 billion a year- DW-institute 2002
6 Big Data is an Enabler
Decrease Cost • Open source technology • Integrate different data sources Streamline business • Landing zone for all Data • Data driven business decisions • Improved outcomes of campaigns Increase revenue • Improve customer satisfaction • Reduce customer churn • Understand patterns • Preventative maintenance
7 Big Data in Action - Waze
Real time Location GPS based Ads Data Budget Map based Ad Archives campaigns Social media Community activity shared Gas prices
8 Ecosystem and Technology
9 Data lives in RDBMS, NoSQL, and Hadoop
Bare Metal & Virtualized Infrastructure Private & Public Cloud
10 Conventional – Databases and Data Warehouses RDBMS – A relational database is a set of tables containing data fitted into predefined categories (schema/blueprint). Allows for ACID properties.
Operational Data Databases Warehouse • OLTP • OLAP • Data • Data analysis and retrieval/update decision making • Relational • Relational and • Current multi-dimensional • Historic Scaling
11 Distribution across cluster is challenging (CAP)
Consistency Partition or
Availability Is that hotel (latency) What's the room still account $ available? balance? Network
12 NoSQL (Non-relational) databases
Attributes: • Schema-less (?), cluster friendly (Sharding), Opensource • Aggregates (Transaction boundary), Developer friendly • Flavors (Aggregates: Key-Value, Document, Columnar Relationship: Graph )
Scaling
13 Hadoop History – reinventing the Google wheel From Google to Yahoo to Opensource (Apache Hadoop) to Monetized
2003-2004 2006 2009 • Google • Apache 2008 2011 published • MapR Hadoop is • Cloudera Founded • Hortonworks papers on born @ Yahoo GFS (Google founded • CTO Srivas is founded File System) • Hadoop name • Doug is a from Google • 24 engineers and came from Chief architect and worked on from Hadoop MapReduce Doug Cutting @ Cloudera GFS, BigTable team @ Yahoo (Hadoop’s (Data • Intel (740M for and • HDP (Market Processing on creator) son’s Mapreduce plush toy 18%) Cap 850M) Large • Google (110M) Clusters) elephant
14 Hadoop ecosystem
Kafka highly reliable Distributed distributed Workflow scheduler Realtime Computation Messaging coordination Faster Computations HDFS <-> RDBMS
Machine Streaming event HiveQL Data analytics NoSQL DB learning apps data ingestion SQL-Like to Mapreduce
Core Hadoop MapReduce/YARN & HDFS
15 NoSQL and Hadoop are now widely accepted
16 Data Models and architectures
17 New Data models have emerged (NoSQL)
Tom 123456 London Customer Orders ….
Name Scott 999999 Malvern phone ….. email Column based …. Key-Value
Address Payment Document Graph City Credit Card
State Number Jon pancakes expiry ….
Dan Football
18 Core Hadoop : Mapreduce/YARN & HDFS
Mapreduce: Ability to take a dataset, divide it, and run it over parallel nodes. Input data is processed and transformed into a intermediate stage and then summarized into the final stage
Parallel Processing for large data sets. MapReduce 2.0 ResourceMgr NodeMgr
Job scheduling and cluster YARN -Resource Negotiator resource management ApplicationMaster Container
High throughput HDFS NameNode DataNode access to Distributed file system application data
19 MapReduce 2.0 (YARN)
• Global Resource Manager (Job Tracking and resource management) • Per-application (Mapreduce or DAG) Application Master (job scheduling/monitoring): Negotiate resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the tasks
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
20 HDFS - Hadoop Distributed File System For large data sets, runs on Clusters, Failure is assumed hadoop.apache.org • DFS in Java, runs on Linux • write-once-read-many • hierarchical file organization • stores each file as a sequence of blocks • block size and replication factor are configurable per file • NameNode maintains the file system namespace • communication protocols are layered on top of the TCP/IP
21 Demo – Import and query structured data in Hadoop
22 Data is getting complex Volume Velocity Variety Veracity
Sensor data Social Macrobatch >15 min Database Media activity Microbatch 2-15 min Videos Audio Near Real-time 2 sec – 2 min Decision support
Near Real-time 100 milisec-2 min Event Processing Images DATA Archives Real-time <100 milisec
24 Data serialization and De-serialization Process of translating an object into a stream of bytes in order to store or transmit in memory, a database, or a file
Sensor Avro Parquet data Social Music Media • Schema based system • Columnar storage format activity • uses JSON for defining data available to any project in the types and protocols, and Hadoop ecosystem, regardless Videos Audio serializes data in a compact of the choice of data processing binary format framework, data model or 110101010101101 programming language 010101010101010 101010101010101 Images 010101010101010DATA Archives 101010101010101 010101010101010 101010101010101 010101010101010 Others 101010101010101 010101010101010 • Thrift (Facebook) • Protocol Buffers (Google) • JSON Database File • BSON Memory
25 Demo – Import and analyze semi structured data in Hadoop
26 Big Data solutions need a lot more than just Hadoop
Lambda
http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
Kappa
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
28 Apache Spark deserves special attention
10x to 100x faster than MapReduce
spark.apache.org DAG execution engine: cyclic data flow and in-memory computing Java, Scala, Python, R IBM: 300M + 3500 people
29 Demo - Spark for analytics on object relationships
30 When to use NoSQL vs Hadoop for Big Data Cluster friendly, scale out horizontally, Handle large data sets, Flexible Data formats, Developer friendly
NoSQL Hadoop Maintain two different Databases • Batch processing copies of data? • Read/Write/Modify • Process large data • Interactive sets • Real-time • Historic (Data Lakes) random, real-time read/write access to HDFS
32 Industry acceptance : NoSQL and Hadoop
Magic Quadrant for Operational Database Management Systems
33 Enterprise Grade
34 Simplified view of Hadoop cluster and add-ons
35 Opensource is free ..if your time has no value
Management Skills Scale & Operations
Fault Security Governance Tolerance
36 Management
37 Scalability and Fault tolerance In the world of cluster computing
Hadoop was designed to be scalable and robust: built into HDFS Centralized service for maintaining and MapReduce configuration information, naming, providing distributed synchronization, and providing group services
38 Security By default Hadoop runs in non-secure mode sentry.apache.org Authentication Data Encryption Role based authorization to data and metadata stored on a Hadoop cluster • Kerberos principals for • Data Encryption on RPC, Started by Cloudera Hadoop Daemons and Block data transfer, and Users HTTP • Hadoop Key Management • Transparent Encryption in Server HDFS
Service Level Authorization • ensure clients are authorized to access the Hadoop service
39 Data governance in Hadoop, ready for compliance? Audit data access, Track data Lineage, Data/Metadata lifecycle management
40 Hadoop and Openstack
41 Hadoop can run on Openstack
Traditional Hadoop Hadoop on Openstack Big Data Software (Hadoop) • Designed for Bare Metal • Agile, automated (Lack of agility) deployment (both Physical • Underutilized resources (Ironic) and Virtual) - Operating System • Maintenance (Operations Sahara (Linux) can become a bottleneck) • Shared platform • Fixed Capacity • Repeatable operations with • Setting up POC takes too template based Infrastructure – much time provisioning (Hadoop on Physical or Virtual demand) • Scale up/down • Quick POC
42 Big Data needs Infrastructure and it can be virtual
Virtual infrastructure Hadoop was designed for physical servers: benefits: •Understand the virtualization overhead •Clone and repeat- lower •Local on-server HDD storage (fast sequential reads) – use Block operations costs device driver for Cinder, utilize full disk for Cinder volume and •On-demand Hadoop keep Cinder volume on same host as the VM clusters (Short lived •Dedicated CPUs and homogeneous hardware – keep VM clusters) flavors consistent •Physical infrastructure can •Dedicated network – isolate physical networks supporting Tenant be reused and shared networks •Flexible cluster scaling– •Replication and redundant services ( zookeeper, Journal mgr, Grow/Shrink Namenode) land on different data nodes for fault tolerance – Anti- •Pay per usage (in public Affinity cloud •For Spark (memory intensive) – do NOT oversubscribe RAM
43 Summary
OUTCOME
LOGIC Cisco Data & Analytics
Cisco Data Virtualization
DATA XML
Packaged Apps RDBMS Excel Files Data Warehouse OLAP Cubes Hadoop/Big Data XML Docs Flat Files Web Services
Cisco Infrastructure and Services 44 Demo – Big Data as a service in public Cloud
45 Call to Action
• Follow up on the following related sessions • IoT, Hadoop and SAP HANA on UCS [BRKDCT-1016] • Advanced - Turn on the Lights with Big Data Security Analytics [TECSEC-3900]
• Visit the World of Solutions for • MapR
• DevNet zone related sessions • DevNet Workshop: Big Data as a Service [DevNet-1608] • DevNet Zone demo pod: Big Data on Cisco Cloud
49 Complete Your Online Session Evaluation
• Please complete your online session evaluations after each session. Complete 4 session evaluations & the Overall Conference Evaluation (available from Thursday) to receive your Cisco Live T-shirt.
• All surveys can be completed via the Cisco Live Mobile App or the Communication Stations
50 Thank you
51