Sub-Second Analytics for User-Facing Applications with Apache Spark™ and Rockset Venkat Venkataramani CEO and Co-Founder, Rockset About Me

Sub-Second Analytics for User-Facing Applications with Apache Spark™ and Rockset Venkat Venkataramani CEO and co-founder, Rockset About me 2016 - present 2007 - 2015 2002 - 2007 Venkat Venkataramani Co-Founder & CEO 2002 - 2007 Agenda ▪ Large-scale data applications ▪ Rockset: designed to serve data applications ▪ Reference architectures with Apache Spark and Rockset ▪ Conclusion Large-scale data applications Building apps on Apache Spark BI Machine Learning Data Pipelines Apache Spark Data Lake Building apps on Apache Spark BI Machine Learning Data Pipelines Data Apps low latency high concurrency Serving Tier Apache Spark (MySQL, Postgres) Data Lake What happens when we get to large scale? BI Machine Learning Data Pipelines Large-Scale Data Apps TBs of data low latency high concurrency Serving Tier Apache Spark (MySQL, Postgres) Data Lake BI and SQL Data Science and Data Engineering Analytics Machine Learning Real-Time Data Applications Apache Spark Rockset Data Lake Example: Investment decisions at Sequoia Capital • Data sets from multiple vendors loaded regularly into data lake • Run data enrichment in Apache Spark/Databricks • Entity 360: combine with internal data sources for complete view of potential investments • Investment team and data scientists use app to help make investment decisions Example: Personalized recommendations at Ritual • Health technology company selling multivitamins online • Customer data from Segment loaded into data warehouse and data lake • Machine learning modeling in Apache Spark/Databricks • Build personalized offers and bundles for their online portal and checkout page Challenges when building large-scale data apps • Speed • Scale • Operational complexity ▪ Cannot power fast • Single-node systems • Extensive performance analytics on OLTP cannot scale horizontally engineering required ▪ Cannot build indexes on • Bulk loads take too long • Periodic reloads result in data lakes/warehouses downtime Modern data apps Modern data applications demand speed and scale But existing solutions force you to pick one OR Modern data apps Modern data applications demand speed and scale But existing solutions force you to pick one Pick speed ➔ hard to scale (OLTP: MySQL, Postgres) Modern data apps Modern data applications demand speed and scale But existing solutions force you to pick one Pick speed ➔ hard to scale (OLTP: MySQL, Postgres) Pick scale ➔ slow and expensive (data warehouse, data lake) Rockset: Designed to serve data apps What is Rockset? Real-time indexing database for modern data apps at massive scale without operational overhead Speed - Converged Index • All ﬁelds are indexed in inverted, columnar and row indexes • Accelerates search, aggregation and join queries • No index deﬁnition required Speed - Converged Index • Low latency for both highly selective queries and large scans • Optimizer picks between • inverted index (Index Filter operator) • columnar format (Column Scan operator) • inverted index (Index Scan operator) Scale - Disaggregated, cloud-native architecture Scale - Disaggregated, cloud-native architecture Simplicity - Bulk ingestion • Bulk ingest mode for large-scale data • Scale out number of ingesters, use larger leaf pods • Write to S3 instead of log store • No downtime Scale ingest Reference architectures with Apache Spark and Rockset Building data apps with Apache Spark and Rockset Data architecture for personalized recommendations Conclusion Speed: Sub-second analytics for user-facing apps • Star Schema Benchmark • Industry-standard benchmark to measure database performance for analytical apps • All queries ran in <1 sec • Median runtime of 254 millisec Scale: Lower cost with higher compute efficiency • Customer reduced overall bill by 75% • Data app running 24x7 • Converged indexing increases storage cost but significantly reduces compute required to run queries Zero Ops Overhead:Compress development time • Customer reduced development from 6 months to 3 days • SQL analytics on semi-structured data without data prep • No performance engineering required • Serverless for low ops “Our users want to search on any field, anywhere, and we needed to give them that ability. To have this unique capability offered as a service was exactly what we needed to deliver real-time search months ahead of plan.” - Todd McPartlin, Command Alkon Rockset: Serving tier to complement Apache Spark BI Machine Learning Data Pipelines Large-Scale Data Apps TBs of data low latency high concurrency Serving Tier Apache Spark sub-second analytics Data Lake scale compute/storage as needed serverless Learn more • Stop by the Rockset booth or Thank you! • Request Demo at Rockset.com ($300 in trial credits) [email protected] Thank you Venkat Venkataramani [email protected].

Sub-Second Analytics for User-Facing Applications with Apache Spark™ and Rockset Venkat Venkataramani CEO and Co-Founder, Rockset About Me

Amazon Connect Data Lake Best Practices AWS Whitepaper Amazon Connect Data Lake Best Practices AWS Whitepaper

Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility

Cost Modeling Data Lakes for Beginners How to Start Your Journey Into Data Analytics

A Comprehensive Study of Recent Metadata Models for Data Lake

Harness the Power of Your Data

Lake Data Warehouse Architecture for Big Data Solutions

Building a Data Lake for the Enterprise

LOOK BEFORE YOU LEAP INTO the DATA LAKE by Rash Gandhi, Sanjay Verma, Elias Baltassis, and Nic Gordon

Solution Brief Data-Driven Transformation on AWS: a Blueprint

A Big Data Lake for Multilevel Streaming Analytics

Data Lakes Efficiently Consolidate Your Data

Essential Guide to Data Lakes