APACHE CASSANDRA AND SPARK STREAMING FOR REAL TIME ANALYTICS

Rohit Bhardwaj Principal Cloud Engineer [email protected] : rbhardwaj1 AGENDA

• Big data characteristics

• Real time analytics

• Cassandra no and

• Cassandra Data Model

• Spark with Cassandra BIG DATA BIG DATA CHARACTERISTICS

https://imasaikirangeek.files.wordpress.com/2014/05/defining-big-data1.png

REAL TIME ANALYTICS USELESS

http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=IMW14704USEN APACHE SPARK FOR DATA IN MOTION

WHY SPARK?

BIG DATA BIG DATA WHY SPARK?

• Readability • Expressiveness • Fast • Testability • Interactive • Fault Tolerant • Unify Big Data MAP-REDUCE EXPLOSION

LOCS

As of April 2015 HISTORY OF SPARK

WHO IS USING SPARK?

SPARK CORE MAINTAINERS

SPARK DEMO

• http://demo.gethue.com SPARK CONTEXT

• Task creator

• Scheduler

• Data locality

• Fault tolerance RESILIENT DISTRIBUTED DATASETS (RDD)

• Immutable

• Re-computable

• Fault tolerant

• Reusable DAG: DIRECTED ACYCLIC GRAPH SPARK GENERAL FLOW SPARK MECHANICS

INPUT SPARK WORKFLOW SPARK EXECUTION MODEL SPARK SQL EXAMPLE BATCH VS REAL TIME PROCESSING

ACID

• Atomicity •Consistency •Isolation •Durability • Would ACID work with Bigdata ? REPLICATION

DB SHARDING BASE

• Basically Available

• Soft State

• Eventual Consistency

CAP THEOREM

BASE (Basically Available, Soft-State, Eventual Consistency) data store https://www.facebook.com/notes/facebook-engineering/cassandra-a-structured-storage-system-on-a-p2p-network/24413138919/

Netflix case study

Component microservices Chaos Gorilla Cassandra maintenance Isolated Regions Cassandra

10x more read throughput

8x faster read latency (up to 100x faster)

8x more write throughput

10x slower write latency (with the default configuration; that is, no write durability for HBase)

8x faster scan latency

4x more scan throughput INSTALLATION CONFIGURATION FILES CASSANDRA.YAML CASSANDRA CLUSTER • Node: One Cassandra instance

• Rack: a logical set of nodes

• Data Center: a logical set of Racks

• Cluster: a ring of nodes

HASH RING CASSANDRA TERMINOLOGY SIMPLE STRATEGY

• CREATE KEYSPACE nfjs WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }; NETWORK TOPOLOGY STRATEGY • create KEYSPACE nfjs WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'DC1': 3, 'DC2', 1}; CASSANDRA DATA MODEL QUERIES KILLRVIDEO.COM CONCEPTUAL DATA MODELING PURPOSE RELATIONSHIP KEYS ENTITY TYPE HIERARCHY APPLICATION WORKFLOW MODEL

MAPPING CONCEPTUAL TO LOGICAL CHEBOTKO DIAGRAMS CHEBOTKO DIAGRAM NOTATION EXAMPLE CHEBOTKO DIAGRAM CASSANDRA DATA MODELING PRINCIPLES

• Know your data

• Know your queries

• Nest data

• Duplicate data MAPPING RULES

For the query-driven methodology • Mapping rules ensure that a logical data model is correct • Each query has a corresponding • Tables are designed to allow queries to execute properly • Tables return data in the correct order MAPPING RULES

• Mapping Rule 1: Entities and relationships

• Mapping Rule 2: Equality search attributes

• Mapping Rule 3: Inequality search attributes

• Mapping Rule 4: Ordering attributes

• Mapping Rule 5: Key attributes MR1: ENTITIES AND RELATIONSHIPS MR2: EQUALITY SEARCH ATTRIBUTES MR3: INEQUALITY SEARCH ATTRIBUTES MR4: ORDERING ATTRIBUTES MR5: KEY ATTRIBUTES APPLYING MAPPING RULES PHYSICAL DATA MODEL WHAT TO ANALYZE

• Finding the problem

• Partition size

• Data redundancy

• Data consistency

• Application-side joins

• Referential integrity constraints

• Transactions

• Data aggregation DESIGNING THE MODEL CREATING TABLES READS AND WRITES IN CASSANDRA READS AND WRITES IN CASSANDRA CONSISTENCY LEVEL WRITE PATH

COMPACTION READ PATH READ PROCESSING IN NODE DEMO

• Cassandra query language

• http://www.planetcassandra.org/try-cassandra/ DEMO CQL SCRIPT

• CREATE KEYSPACE gids WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };

• CREATE TABLE users ( firstname text, lastname text, age int, email text, city text, PRIMARYKEY (lastname));

• INSERT INTO users (firstname, lastname, age, email, city) VALUES ('John', 'Smith', 46, '[email protected]', 'Sacramento’);

• INSERT INTO users (firstname, lastname, age, email, city) VALUES ('Jane', 'Doe', 36, '[email protected]', 'Beverly Hills’);

• INSERT INTO users (firstname, lastname, age, email, city) VALUES ('Rob', 'Byrne', 24, '[email protected]', 'San Diego’);

• SELECT * FROM users;

• UPDATE users SET city= 'San Jose' WHERE lastname= 'Doe’;

• SELECT * FROM users where lastname= 'Doe’;

• DELETE from users WHERE lastname = 'Doe’;

• SELECT * FROM users; COMBINING SPARK AND CASSANDRA

• Spark

• Great for analyzing large amount of data

• Cassandra

• Great for storing large amount of data

SPARK AND CASSANDRA ARCHITECTURE

CASSANDRA AND SPARK SPARK AND CASSANDRA REFERENCES REFERENCES

• http://docs.datastax.com/en/landing_page/doc/ landing_page/current.html

• http://spark.apache.org/

• http://cassandra.apache.org/

• http://www.planetcassandra.org/

• http://demo.gethue.com/home