Apache Kafka

SmartSquare Project – Data Analytics Pipeline CityScienceLab | HafenCity University Hamburg Ingestion Data Management & Analytics Visualization & Data Providing Twitter Kafka Akka Actor: Akka Actor: Node 4 Akka Actor: Akka Actor: Stream Producer: Batch Consumer Batch Processor Node 3 Analytics MLlib RestService REST API CityScope API Twitter Node 2 Node 1 Transformers, Akka HTTP Kafka Consumer: Spark: Batch Estimators and Server: Other external Cassandra Processing Unit Spark Master Clustering Rest Services Consumer Spark Worker Kafka Topics Akka Actor: Cassandra Keyspaces Dashboards Stream Processor tweets master_dataset Kafka Consumer: Zeppelin Websites, filtered batch_views SparkStreaming Notebooks & Documents and detections Kafka realtime_views Dashboards other Reactive Streams NoSQL Database Notebooks & Dashboards Source: https://public.centerdevice.de/download/4c90efbb-fcfe-4d71- a677-674fbb332319.05014a90-5077-a7b6-9116-0c09af347982 • provides the underlying Column-oriented database • Apache Kafka and Akka together (Akka Streams Kafka) complement the Apache Cassandra NoSQL architecture by implementing a non-blocking and asynchronous management system for all ETL-Processes (Extract, Transform, Load) as well as for parallel and distributed Analytics on top of the Spark/Cassandra Cluster transmission (dynamic message rate) between producers and consumers in order to improve Resilience during Streaming • It persists and replicates its data distributed over a master-less, scalable and • Apache Zeppelin is used as a data scientists playground (Notebooks) to time-series optimized cluster explore, analyze and visualize data and findings • Kafka could be described as an Publish-Subscribe Message Broker, which • Columns are stored sorted by their column keys and Rows are split and • It integrates Apache Spark and its Cassandra and Kafka Connectors to create handles Ingestion processes in a distributed and replicated manner (Cluster) batch- and real-time Views distributed over the cluster according to their row keys. right now • Kafka organizes messages in Topics, split them into one or more Partitions • Data modelling has to be done according to future queries to achieve best and supports Processing Chains (executed in parallel on Spark Cluster) read/write performance while serving ETL and Analytics processes batch realtime prediction • Example: use Spark to stream raw twitter data into one topic, filter data of that topic into a new topic, count related terms and publish results into another topic • It offers Scala, Java, Python and R Interpreters out-of-the-box Cluster Computing • It supports standard visualizations like pivot tables, diagrams, maps and heat maps out-of-the-box and allows to integrate other JavaScript Frameworks Detection & Tracking • Is also used to build Admin- and Stakeholder-Dashboards Next Steps and further Questions Source: http://silverpond.com.au/2016/10/06/balancing-spark.html • Apache Spark relies on Resilient Distributed Datasets (RDD) and uses fast In- • Connect more data sources besides Twitters Streaming API and implement Memory Technologies on each of its worker nodes their Ingestion processes within Kafka using Akka and Scala • RDD is a fault-tolerant collection of elements that can be operated on in parallel. • What are the relevant Indicators and suitable analytical Models, Techniques and Algorithms to support Stakeholders decision making? • Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or • How to define, measure and interpret the “meaning” of a plaza in different Scala objects, including user-defined classes. • NVIDIA´s Jetson TX2 Embedded-Modul together with FLIR´s Blackfly cameras contexts, like business or culture? • Frameworks such as Spark´s MLlib or Google´s TensorFlow and Tamron lenses are used to record People´s and Vehicle´s Movement- Machine Learning • Implement the Analytics Pipeline accordingly and build up Stakeholder are embedded within the pipeline to perform complex tasks on distributed Trajectories datasets in parallel through Spark´s distributed worker instances Dashboards to visualize relevant data and findings • State-of-the-Art Detection and Tracking Algorithms (mostly based on CNN or • Natural Language Processing with OpenNLP and UIMA • Find the detection and tracking algorithms with the best tradeoff between other types of neural network) are used and tested in regards to its capability to runtime and accuracy to record real-time trajectories using NVIDIA Jetson TX2 perform in near real-time on Jetson TX2 and it´s GPU (runtime vs. accuracy) • Cassandra and Kafka Connectors are used to read/write/stream to/from Cassandra and Kafka within Spark-Applications November 2017 - Marc-André Vollstedt .

Apache Kafka

Apache Cassandra on AWS Whitepaper

Apache Cassandra and Apache Spark Integration a Detailed Implementation

Implementing Replication for Predictability Within Apache Thrift Jianwei Tu the Ohio State University [email protected]

Chapter 2 Introduction to Big Data Technology

Why Migrate from Mysql to Cassandra?

Apache Cassandra™ Architecture Inside Datastax Distribution of Apache Cassandra™

Technology Overview

Hbase Or Cassandra? a Comparative Study of Nosql Database Performance

Building a Scalable Distributed Data Platform Using Lambda Architecture

Going Native with Apache Cassandra™

Data Modeling in Apache Cassandra™

Log4j User Guide