Cassandra and Spark Streaming for Real Time Analytics

APACHE CASSANDRA AND SPARK STREAMING FOR REAL TIME ANALYTICS Rohit Bhardwaj Principal Cloud Engineer [email protected] Twitter: rbhardwaj1 AGENDA • Big data characteristics • Real time analytics • Apache Spark • Cassandra no sql database and • Cassandra Data Model • Spark with Cassandra BIG DATA BIG DATA CHARACTERISTICS https://imasaikirangeek.files.wordpress.com/2014/05/defining-big-data1.png REAL TIME ANALYTICS USELESS http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=IMW14704USEN APACHE SPARK FOR DATA IN MOTION WHY SPARK? BIG DATA BIG DATA WHY SPARK? • Readability • Expressiveness • Fast • Testability • Interactive • Fault Tolerant • Unify Big Data MAP-REDUCE EXPLOSION LOCS As of April 2015 HISTORY OF SPARK WHO IS USING SPARK? SPARK CORE MAINTAINERS SPARK DEMO • http://demo.gethue.com SPARK CONTEXT • Task creator • Scheduler • Data locality • Fault tolerance RESILIENT DISTRIBUTED DATASETS (RDD) • Immutable • Re-computable • Fault tolerant • Reusable DAG: DIRECTED ACYCLIC GRAPH SPARK GENERAL FLOW SPARK MECHANICS INPUT SPARK WORKFLOW SPARK EXECUTION MODEL SPARK SQL EXAMPLE BATCH VS REAL TIME PROCESSING ACID • Atomicity •Consistency •Isolation •Durability • Would ACID work with Bigdata ? REPLICATION DB SHARDING BASE • Basically Available • Soft State • Eventual Consistency CAP THEOREM BASE (Basically Available, Soft-State, Eventual Consistency) data store https://www.facebook.com/notes/facebook-engineering/cassandra-a-structured-storage-system-on-a-p2p-network/24413138919/ Netflix case study Component microservices Chaos Gorilla Cassandra maintenance Isolated Regions Cassandra 10x more read throughput 8x faster read latency (up to 100x faster) 8x more write throughput 10x slower write latency (with the default configuration; that is, no write durability for HBase) 8x faster scan latency 4x more scan throughput INSTALLATION CONFIGURATION FILES CASSANDRA.YAML CASSANDRA CLUSTER • Node: One Cassandra instance • Rack: a logical set of nodes • Data Center: a logical set of Racks • Cluster: a ring of nodes HASH RING CASSANDRA TERMINOLOGY SIMPLE STRATEGY • CREATE KEYSPACE nfjs WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }; NETWORK TOPOLOGY STRATEGY • create KEYSPACE nfjs WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'DC1': 3, 'DC2', 1}; CASSANDRA DATA MODEL QUERIES KILLRVIDEO.COM CONCEPTUAL DATA MODELING PURPOSE RELATIONSHIP KEYS ENTITY TYPE HIERARCHY APPLICATION WORKFLOW MODEL MAPPING CONCEPTUAL TO LOGICAL CHEBOTKO DIAGRAMS CHEBOTKO DIAGRAM NOTATION EXAMPLE CHEBOTKO DIAGRAM CASSANDRA DATA MODELING PRINCIPLES • Know your data • Know your queries • Nest data • Duplicate data MAPPING RULES For the query-driven methodology • Mapping rules ensure that a logical data model is correct • Each query has a corresponding table • Tables are designed to allow queries to execute properly • Tables return data in the correct order MAPPING RULES • Mapping Rule 1: Entities and relationships • Mapping Rule 2: Equality search attributes • Mapping Rule 3: Inequality search attributes • Mapping Rule 4: Ordering attributes • Mapping Rule 5: Key attributes MR1: ENTITIES AND RELATIONSHIPS MR2: EQUALITY SEARCH ATTRIBUTES MR3: INEQUALITY SEARCH ATTRIBUTES MR4: ORDERING ATTRIBUTES MR5: KEY ATTRIBUTES APPLYING MAPPING RULES PHYSICAL DATA MODEL WHAT TO ANALYZE • Finding the problem • Partition size • Data redundancy • Data consistency • Application-side joins • Referential integrity constraints • Transactions • Data aggregation DESIGNING THE MODEL CREATING TABLES READS AND WRITES IN CASSANDRA READS AND WRITES IN CASSANDRA CONSISTENCY LEVEL WRITE PATH COMPACTION READ PATH READ PROCESSING IN NODE DEMO • Cassandra query language • http://www.planetcassandra.org/try-cassandra/ DEMO CQL SCRIPT • CREATE KEYSPACE gids WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }; • CREATE TABLE users ( firstname text, lastname text, age int, email text, city text, PRIMARYKEY (lastname)); • INSERT INTO users (firstname, lastname, age, email, city) VALUES ('John', 'Smith', 46, '[email protected]', 'Sacramento’); • INSERT INTO users (firstname, lastname, age, email, city) VALUES ('Jane', 'Doe', 36, '[email protected]', 'Beverly Hills’); • INSERT INTO users (firstname, lastname, age, email, city) VALUES ('Rob', 'Byrne', 24, '[email protected]', 'San Diego’); • SELECT * FROM users; • UPDATE users SET city= 'San Jose' WHERE lastname= 'Doe’; • SELECT * FROM users where lastname= 'Doe’; • DELETE from users WHERE lastname = 'Doe’; • SELECT * FROM users; COMBINING SPARK AND CASSANDRA • Spark • Great for analyzing large amount of data • Cassandra • Great for storing large amount of data SPARK AND CASSANDRA ARCHITECTURE CASSANDRA AND SPARK LAMBDA ARCHITECTURE SPARK AND CASSANDRA REFERENCES REFERENCES • http://docs.datastax.com/en/landing_page/doc/ landing_page/current.html • http://spark.apache.org/ • http://cassandra.apache.org/ • http://www.planetcassandra.org/ • http://demo.gethue.com/home.

Cassandra and Spark Streaming for Real Time Analytics

Apache Cassandra on AWS Whitepaper

Apache Cassandra and Apache Spark Integration a Detailed Implementation

Implementing Replication for Predictability Within Apache Thrift Jianwei Tu the Ohio State University [email protected]

Chapter 2 Introduction to Big Data Technology

Why Migrate from Mysql to Cassandra?

Apache Cassandra™ Architecture Inside Datastax Distribution of Apache Cassandra™

Technology Overview

Hbase Or Cassandra? a Comparative Study of Nosql Database Performance

Building a Scalable Distributed Data Platform Using Lambda Architecture

Going Native with Apache Cassandra™

Data Modeling in Apache Cassandra™

Log4j User Guide