BRKCLD-2501.Pdf

Architecting and Delivering Big Data in cloud Abhi Singh Technical Solutions Architect, Cisco Agenda

• Business use case Demonstration:

• Ecosystem and technology •Demo 1: Import and query structured • Data Models and architectures data in Hadoop •Demo 2: Import and analyze semi • Enterprise Grade structured data in Hadoop • Hadoop and Openstack •Demo 3: Spark for analytics on object relationships • Summary, further learning, Q&A •Demo 4: Big Data as a service in cloud

3 From Data to information to actionable intelligence

OUTCOME

LOGIC

DATA

Bare Metal & Virtualized Infrastructure 5 Private & Public Cloud Big Data is … Data sets too large/complex for traditional data processing applications

Volume Big Data driven Velocity spending Variety 2012-2016: Veracity $232 B Gartner 2012

Data quality problems cost U.S. businesses more than $600 billion a year- DW-institute 2002

6 Big Data is an Enabler

Decrease Cost • Open source technology • Integrate different data sources Streamline business • Landing zone for all Data • Data driven business decisions • Improved outcomes of campaigns Increase revenue • Improve customer satisfaction • Reduce customer churn • Understand patterns • Preventative maintenance

7 Big Data in Action - Waze

Real time Location GPS based Ads Data Budget Map based Ad Archives campaigns Social media Community activity shared Gas prices

8 Ecosystem and Technology

9 Data lives in RDBMS, NoSQL, and Hadoop

Bare Metal & Virtualized Infrastructure Private & Public Cloud

10 Conventional – Databases and Data Warehouses RDBMS – A relational database is a set of tables containing data fitted into predefined categories (schema/blueprint). Allows for ACID properties.

Operational Data Databases Warehouse • OLTP • OLAP • Data • Data analysis and retrieval/update decision making • Relational • Relational and • Current multi-dimensional • Historic Scaling

11 Distribution across cluster is challenging (CAP)

Consistency Partition or

Availability Is that hotel (latency) What's the room still account $ available? balance? Network

12 NoSQL (Non-relational) databases

Attributes: • Schema-less (?), cluster friendly (Sharding), Opensource • Aggregates (Transaction boundary), Developer friendly • Flavors (Aggregates: Key-Value, Document, Columnar Relationship: Graph )

Scaling

13 Hadoop History – reinventing the Google wheel From Google to Yahoo to Opensource (Apache Hadoop) to Monetized

2003-2004 2006 2009 • Google • Apache 2008 2011 published • MapR Hadoop is • Cloudera Founded • Hortonworks papers on born @ Yahoo GFS (Google founded • CTO Srivas is founded File System) • Hadoop name • Doug is a from Google • 24 engineers and came from Chief architect and worked on from Hadoop MapReduce Doug Cutting @ Cloudera GFS, BigTable team @ Yahoo (Hadoop’s (Data • Intel (740M for and • HDP (Market Processing on creator) son’s Mapreduce plush toy 18%) Cap 850M) Large • Google (110M) Clusters) elephant

14 Hadoop ecosystem

Kafka highly reliable Distributed distributed Workflow scheduler Realtime Computation Messaging coordination Faster Computations HDFS <-> RDBMS

Machine Streaming event HiveQL Data analytics NoSQL DB learning apps data ingestion SQL-Like to Mapreduce

Core Hadoop MapReduce/YARN & HDFS

15 NoSQL and Hadoop are now widely accepted

16 Data Models and architectures

17 New Data models have emerged (NoSQL)

Tom 123456 London Customer Orders ….

Name Scott 999999 Malvern phone ….. email Column based …. Key-Value

Address Payment Document Graph City Credit Card

State Number Jon pancakes expiry ….

Dan Football

18 Core Hadoop : Mapreduce/YARN & HDFS

Mapreduce: Ability to take a dataset, divide it, and run it over parallel nodes. Input data is processed and transformed into a intermediate stage and then summarized into the final stage

Parallel Processing for large data sets. MapReduce 2.0 ResourceMgr NodeMgr

Job scheduling and cluster YARN -Resource Negotiator resource management ApplicationMaster Container

High throughput HDFS NameNode DataNode access to Distributed file system application data

19 MapReduce 2.0 (YARN)

• Global Resource Manager (Job Tracking and resource management) • Per-application (Mapreduce or DAG) Application Master (job scheduling/monitoring): Negotiate resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the tasks

https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

20 HDFS - Hadoop Distributed File System For large data sets, runs on Clusters, Failure is assumed hadoop.apache.org • DFS in Java, runs on Linux • write-once-read-many • hierarchical file organization • stores each file as a sequence of blocks • block size and replication factor are configurable per file • NameNode maintains the file system namespace • communication protocols are layered on top of the TCP/IP

21 Demo – Import and query structured data in Hadoop

22 Data is getting complex Volume Velocity Variety Veracity

Sensor data Social Macrobatch >15 min Database Media activity Microbatch 2-15 min Videos Audio Near Real-time 2 sec – 2 min Decision support

Near Real-time 100 milisec-2 min Event Processing Images DATA Archives Real-time <100 milisec

24 Data serialization and De-serialization Process of translating an object into a stream of bytes in order to store or transmit in memory, a database, or a file

Sensor Avro Parquet data Social Music Media • Schema based system • Columnar storage format activity • uses JSON for defining data available to any project in the types and protocols, and Hadoop ecosystem, regardless Videos Audio serializes data in a compact of the choice of data processing binary format framework, data model or 110101010101101 programming language 010101010101010 101010101010101 Images 010101010101010DATA Archives 101010101010101 010101010101010 101010101010101 010101010101010 Others 101010101010101 010101010101010 • Thrift (Facebook) • Protocol Buffers (Google) • JSON Database File • BSON Memory

25 Demo – Import and analyze semi structured data in Hadoop

26 Big Data solutions need a lot more than just Hadoop

Lambda

http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

Kappa

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

28 Apache Spark deserves special attention

10x to 100x faster than MapReduce

spark.apache.org DAG execution engine: cyclic data flow and in-memory computing Java, Scala, Python, R IBM: 300M + 3500 people

29 Demo - Spark for analytics on object relationships

30 When to use NoSQL vs Hadoop for Big Data Cluster friendly, scale out horizontally, Handle large data sets, Flexible Data formats, Developer friendly

NoSQL Hadoop Maintain two different Databases • Batch processing copies of data? • Read/Write/Modify • Process large data • Interactive sets • Real-time • Historic (Data Lakes) random, real-time read/write access to HDFS

32 Industry acceptance : NoSQL and Hadoop

Magic Quadrant for Operational Database Management Systems

33 Enterprise Grade

34 Simplified view of Hadoop cluster and add-ons

35 Opensource is free ..if your time has no value

Management Skills Scale & Operations

Fault Security Governance Tolerance

36 Management

37 Scalability and Fault tolerance In the world of cluster computing

Hadoop was designed to be scalable and robust: built into HDFS Centralized service for maintaining and MapReduce configuration information, naming, providing distributed synchronization, and providing group services

38 Security By default Hadoop runs in non-secure mode sentry.apache.org Authentication Data Encryption Role based authorization to data and metadata stored on a Hadoop cluster • Kerberos principals for • Data Encryption on RPC, Started by Cloudera Hadoop Daemons and Block data transfer, and Users HTTP • Hadoop Key Management • Transparent Encryption in Server HDFS

Service Level Authorization • ensure clients are authorized to access the Hadoop service

39 Data governance in Hadoop, ready for compliance? Audit data access, Track data Lineage, Data/Metadata lifecycle management

40 Hadoop and Openstack

41 Hadoop can run on Openstack

Traditional Hadoop Hadoop on Openstack Big Data Software (Hadoop) • Designed for Bare Metal • Agile, automated (Lack of agility) deployment (both Physical • Underutilized resources (Ironic) and Virtual) - Operating System • Maintenance (Operations Sahara (Linux) can become a bottleneck) • Shared platform • Fixed Capacity • Repeatable operations with • Setting up POC takes too template based Infrastructure – much time provisioning (Hadoop on Physical or Virtual demand) • Scale up/down • Quick POC

42 Big Data needs Infrastructure and it can be virtual

Virtual infrastructure Hadoop was designed for physical servers: benefits: •Understand the virtualization overhead •Clone and repeat- lower •Local on-server HDD storage (fast sequential reads) – use Block operations costs device driver for Cinder, utilize full disk for Cinder volume and •On-demand Hadoop keep Cinder volume on same host as the VM clusters (Short lived •Dedicated CPUs and homogeneous hardware – keep VM clusters) flavors consistent •Physical infrastructure can •Dedicated network – isolate physical networks supporting Tenant be reused and shared networks •Flexible cluster scaling– •Replication and redundant services ( zookeeper, Journal mgr, Grow/Shrink Namenode) land on different data nodes for fault tolerance – Anti- •Pay per usage (in public Affinity cloud •For Spark (memory intensive) – do NOT oversubscribe RAM

43 Summary

OUTCOME

LOGIC Cisco Data & Analytics

Cisco Data Virtualization

DATA XML

Packaged Apps RDBMS Excel Files Data Warehouse OLAP Cubes Hadoop/Big Data XML Docs Flat Files Web Services

Cisco Infrastructure and Services 44 Demo – Big Data as a service in public Cloud

45 Call to Action

• Follow up on the following related sessions • IoT, Hadoop and SAP HANA on UCS [BRKDCT-1016] • Advanced - Turn on the Lights with Big Data Security Analytics [TECSEC-3900]

• Visit the World of Solutions for • MapR

• DevNet zone related sessions • DevNet Workshop: Big Data as a Service [DevNet-1608] • DevNet Zone demo pod: Big Data on Cisco Cloud

49 Complete Your Online Session Evaluation

• Please complete your online session evaluations after each session. Complete 4 session evaluations & the Overall Conference Evaluation (available from Thursday) to receive your Cisco Live T-shirt.

• All surveys can be completed via the Cisco Live Mobile App or the Communication Stations

50 Thank you