Spark and Cassandra Solving classical data analytic task by using modern distributed Artem Aliev DataStax

Spark and Cassandra

Solving classical data analytic task by using modern distributed databases

Artem Aliev Software Developer

©2014 DataStax Confidential. Do not distribute without consent. Agenda:

http://lambda-architecture.net Apache Cassandra Apache Cassandra™ is a massively scalable NoSQL OLTP .

• Masterless architecture with read/write anywhere design. • Continuous availability with no single point of failure. • Multi-data center and cloud availability zone support. • Linear scale performance with online capacity expansion. • CQL – SQL-like language.

Node Node Node Node Node

100,000 200,000 Node Node Node 400,000 Node txns/sec txns/sec txns/sec

Node Node Node Node Node Cassandra Architecture Overview

• Cassandra was designed with the understanding that system/hardware failures can and do occur • Peer-to-peer, distributed system • All nodes the same • Multi data centre support out of the box • Configurable replication factor • Configurable data consistency per request • Active-Active replication architecture

Node 1 Node 1 1st 1st copy

Node 2 Node 2 Node 5 Node 5 2nd copy 2nd copy

Node 3 Node 4 Node 4 Node 3 3rd copy DC: USA DC: EU Cassandra Query Language

CREATE sporty_league ( team_name varchar, player_name varchar, jersey int, PRIMARY KEY (team_name, player_name) );

SELECT * FROM sporty_league WHERE team_name = ‘Mighty Mutts’ and player_name = ‘Lucky’;

INSERT INTO sporty_league (team_name, player_name, jersey) VALUES ('Mighty Mutts',’Felix’,90);

• Distributed computing framework • Created by UC AMP Lab since 2009 • Apache Project since 2010 • Solves problems Hadoop is bad at • Iterative Algorithms • Interactive Machine Learning • More general purpose than MapReduce • Streaming! • In memory!

8 Fast

* Logistic Regression Performance

4000

3500 110 sec / iteration

3000

2500

2000 Hadoop

1500 Running Time (s) Time Running Spark

1000

500 first iteration 80 sec 0 1 5 10 20 30 further iterations 1 sec Number of Iterations Simple

• 1. package org.myorg; • 2. • 3. import java.io.IOException; • 4. import java.util.*; • 5. • 6. import org.apache.hadoop.fs.Path; • 7. import org.apache.hadoop.conf.*; • 8. import org.apache.hadoop.io.*; • 9. import org.apache.hadoop.mapred.*; • 10. import org.apache.hadoop.util.*; • 11. • 12. public class WordCount { 1. file = spark.textFile("hdfs://...") • 13. • 14. public static class Map extends MapReduceBase implements Mapper { • 15. private final static IntWritable one = new IntWritable(1); 2. counts = file.flatMap(lambda line: line.split(" ")) \ • 16. private Text word = new Text(); • 17. • 18. public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { .map(lambda word: (word, 1)) \ • 19. String line = value.toString(); • 20. StringTokenizer tokenizer = new StringTokenizer(line); • 21. while (tokenizer.hasMoreTokens()) { .reduceByKey(lambda a, b: a + b) • 22. word.set(tokenizer.nex tTok en()); • 23. output.collect(word, one); • 24. } 3. counts.saveAsTextFile("hdfs://...") • 25. } • 26. } • 27. • 28. public static class Reduce extends MapReduceBase implements Reducer { • 29. public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { • 30. int sum = 0; • 31. while (values.hasNext()) { • 32. sum += values.next().get(); • 33. } • 34. output.collect(key, new IntWritable(sum)); • 35. } • 36. } • 37. • 38. public static void main(String[] args) throws Exception { • 39. JobConf conf = new JobConf(WordCount.class); • 40. conf.setJobName("wordcount"); • 41. • 42. conf.setOutputKeyClass(Text.class); • 43. conf.setOutputValueClass(IntWritable.class); • 44. • 45. conf.setMapperClass(Map.class); • 46. conf.setCombinerClass(Reduce.class); • 47. conf.setReducerClass(Reduce.class); • 48. • 49. conf.setInputFormat(TextInputFo rmat.class); • 50. conf.setOutputFormat(TextOutpu tFormat.class); • 51. • 52. FileInputFormat.setInputPaths(conf, new Path(args[0])); • 53. FileOutputFormat.setOutputPath(conf, new Path(args[1])); • 54. • 55. JobClient.runJob(conf); • 57. } • 58. } A Quick Comparison to Hadoop

Data Source 1 Data Source 2 map()

map() reduce() () HDFS map()

reduce() cache()

transform() transform() API map reduce API map reduce sample filter count take groupBy fold first sort reduceByKey partitionBy union groupByKey mapWith join cogroup pipe leftOuterJoin cross save rightOuterJoin zip ... API

*Resilient Distributed Datasets (RDD)

* Collections of objects spread across a cluster, stored in RAM or on Disk * Built through parallel transformations * Automatically rebuilt on failure

*Operations

* Transformations (e.g. map, filter, groupBy) * Actions (e.g. count, collect, save) Operator Graph: Optimization and Fault Tolerance

A: B:

F: Stage 1 groupBy

C: D: E:

join

map filter Stage 2 Stage 3

= RDD = Cached partition Why Spark on Cassandra?

*Data model independent queries

*Cross-table operations (JOIN, UNION, etc.)

*Complex analytics (e.g. machine learning)

*Data transformation, aggregation, etc.

*Stream processing How to Spark on Cassandra?

*DataStax Cassandra Spark driver * Open source: https://github.com/datastax/cassandra-driver-spark *Compatible with * Spark 0.9+ * Cassandra 2.0+ * DataStax Enterprise 4.5+ Cassandra Spark Driver

*Cassandra tables exposed as Spark RDDs

*Read from and write to Cassandra

*Mapping of C* tables and rows to Scala objects

*All Cassandra types supported and converted to Scala types

*Server side data selection

*Spark Streaming support

*Scala and Java support Connecting to Cassandra

// Import Cassandra-specific functions on SparkContext and RDD objects import com.datastax.driver.spark._

// Spark connection options val conf = new SparkConf(true) .setMaster("spark://192.168.123.10:7077") .setAppName("cassandra-demo") .set("cassandra.connection.host", "192.168.123.10") // initial contact .set("cassandra.username", "cassandra") .set("cassandra.password", "cassandra")

val sc = new SparkContext(conf) Accessing Data

CREATE TABLE test.words (word text PRIMARY KEY, count int);

INSERT INTO test.words (word, count) VALUES ('bar', 30); INSERT INTO test.words (word, count) VALUES ('foo', 20);

*Accessing table above as RDD:

// Use table as RDD val rdd = sc.cassandraTable("test", "words") // rdd: CassandraRDD[CassandraRow] = CassandraRDD[0]

rdd.toArray.foreach(println) // CassandraRow[word: bar, count: 30] // CassandraRow[word: foo, count: 20]

rdd.columnNames // Stream(word, count) rdd.size // 2

val firstRow = rdd.first // firstRow: CassandraRow = CassandraRow[word: bar, count: 30] firstRow.getInt("count") // Int = 30 Saving Data

val newRdd = sc.parallelize(Seq(("cat", 40), ("fox", 50))) // newRdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2]

newRdd.saveToCassandra("test", "words", Seq("word", "count"))

*RDD above saved to Cassandra:

SELECT * FROM test.words;

word | count ------+------bar | 30 foo | 20 cat | 40 fox | 50

(4 rows) Type Mapping CQL Type Scala Type ascii String bigint Long boolean Boolean counter Long decimal BigDecimal, java.math.BigDecimal double Double float Float inet java.net.InetAddress int Int list Vector, List, Iterable, Seq, IndexedSeq, java.util.List map Map, TreeMap, java.util.HashMap set Set, TreeSet, java.util.HashSet text, varchar String timestamp Long, java.util.Date, java..Date, org.joda.time.DateTime timeuuid java.util.UUID uuid java.util.UUID varint BigInt, java.math.BigInteger *nullable values Option Mapping Rows to Objects

*Mapping rows to Scala Case Classes

*CQL underscore case column mapped to Scala camel case property

*Custom mapping functions (see docs)

CREATE TABLE test.cars ( case class Vehicle( id text PRIMARY KEY, id: String, model text, à model: String, fuel_type text, fuelType: String, year int year: Int ); )

sc.cassandraTable[Vehicle]("test", "cars").toArray //Array(Vehicle(KF334L, Ford Mondeo, Petrol, 2009), // Vehicle(MT8787, Hyundai x35, Diesel, 2011) Server Side Data Selection

*Reduce the amount of data transferred

*Selecting columns

sc.cassandraTable("test", "users").select("username").toArray.foreach(println) // CassandraRow{username: john} // CassandraRow{username: tom}

*Selecting rows (by clustering columns and/or secondary indexes)

sc.cassandraTable("test", "cars").select("model").where("color = ?", "black").toArray.foreach(println) // CassandraRow{model: Ford Mondeo} // CassandraRow{model: Hyundai x35} Spark SQL

Compatible

Spark SQL Streaming ML Graph

Spark (General execution engine)

Cassandra/HDFS Spark SQL

*SQL query engine on top of Spark

*Hive compatible (JDBC, UDFs, types, metadata, etc.)

*Support for in-memory processing

*Pushdown of predicates to Cassandra when possible Spark SQL Example

import com.datastax.spark.connector._

// Connect to the Spark cluster val conf = new SparkConf(true)... val sc = new SparkContext(conf)

// Create Cassandra SQL context val cc = new CassandraSQLContext(sc)

// Execute SQL query val df = cc.sql("SELECT * FROM keyspace.table WHERE ...”) Spark Streaming

Spark SQL Streaming ML Graph

Spark (General execution engine)

Cassandra/HDFS Spark Streaming

*Micro batching *Each batch represented as RDD *Fault tolerant *Exactly-once processing *Unified stream and batch processing framework

RDD

Data Stream DStream Streaming Example

import com.datastax.spark.connector.streaming._

// Spark connection options val conf = new SparkConf(true)...

// streaming with 1 second batch window val ssc = new StreamingContext(conf, Seconds(1))

// stream input val lines = ssc.socketTextStream(serverIP, serverPort)

// count words val wordCounts = lines.flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

// stream output wordCounts.saveToCassandra("test", "words")

// start processing ssc.start() ssc.awaitTermination() Spark MLlib

Spark SQL Streaming ML Graph

Spark (General execution engine)

Cassandra/HDFS Spark MLlib

*Classification(SVM, LogisticRegression, NaiveBayes, RandomForest…)

*Clustering (Kmeans, LDA, PIC)

*Linear Regression

*Collaborative Filtering

*Dimensionality reduction (SVD, PCA)

*Frequent pattern mining

*Word2Vec Spark MLlib Example

import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.classification.NaiveBayes

// Spark connection options val conf = new SparkConf(true)...

// read data val data = sc.cassandraTable[Iris]("test", "iris")

// convert data to LabeledPoint val parsedData = data.map { i => LabeledPoint(class2id(i.species), Array(i.petal_l,i.petal_w,i.sepal_l,i.sepal_w)) }

// train the model val model = NaiveBayes.train(parsedData)

//predict model.predict(Array(5, 1.5, 6.4, 3.2))

http://www.datastax.com/dev/blog/interactive-advanced-analytic-with-dse-and-spark-mllib •Scala •Native •Python •Popular •Java •Ugly •R •No closures Questions?