Big Data Landscape for

Bob Baran Senior Sales Enginee [email protected] ! May 12, 2015 Typical Workloads

OLTP Applications Real-Time Web, Real-Time, Ad-Hoc Analytics Enterprise Data Mobile, and IoT Operational Warehouses Applications Reporting

Typical • MySQL • MongoDB • MySQL • Databases • Oracle • Cassandra • Oracle • Paraccel • Oracle • MySQL • IQ • Oracle Use Cases • ERP, CRM, Supply • Web, mobile, social • Operational • Exploratory • Enterprise Chain • IoT Datastores Analytics Reporting • Crystal Reports • Data Mining Workload • Real-time updates • Real-time updates • Real-time updates • Complex • Parameterized Strengths • ACID transactions • High ingest rates • Canned, queries reports against • High concurrency • High concurrency of parameterized requiring full historical data of small reads/ small reads/ writes reports table scans writes • Range queries • Range queries • Append only • Range queries

Operational Analytical

2 Recent History of RDBMSs

▪ RDBMS Definition ▪ Relational with joins ▪ ACID transactions ▪ Secondary indexes ▪ Typically row-oriented ▪ Operational and/or analytical workloads ▪ By early 2000s ▪ Limited innovation ▪ Looked like Oracle and Teradata won…

3 Hadoop Shakes Up Batch Analytics

▪ Data processing framework ▪ Cheap distributed file system ▪ Brute force, batch processing through MapReduce ▪ Great for batch analytics ▪ Great place to dump data to look at later

4 NoSQL Shakes Ups Operational DBs

▪ NoSQL wave ▪ Companies like Google, and LinkedIn needed greater & schema flexibility ▪ New databases developed by developers, not database people ▪ Provided scale-out, but lost SQL ▪ Worked well at web startups because: ▪ In some cases, use cases did not need ACID ▪ Willing to handle exceptions at app level

5 Convoluted Evolution of Databases

NoSQL Databases Hadoop 2010 Scale-out 2005 SQL Databases 2013 Scale Out

Scale Traditional Up ty Sc bili ala RDBMSs 1980s-2000s

Hierarchical/ Network Databases 1970s

Indexed Files (ISAM) 1960s

Functionality

6 Mainstream user changes

▪ Driven by web, social, mobile, and Internet of Things ▪ Major increases in scale – 30% annual data growth ▪ Significant requirements for semi-structured data ▪ Though relatively little unstructured ▪ Technology adoption continuum

Should I Why wouldn’t I What is it? use it? use it? Scale-out SQL DBs NoSQL for Cloud for operational web apps apps Hadoop technologies for analytics

7 Schema on Ingest vs. Schema on Read

Schema Schema on Ingest on Read Data Stream Application • Structured data • Schema on Read if should always you only use data a remain structured few times a year • Add schema if data used regularly

▪ Even “schemaless” MongoDB requires “schema” - 10 Things You Should Know About Running MongoDB At Scale • By Asya Kamsky, Principal Solutions Architect at MongoDB • Item #1 – “have a good schema and indexing strategy”

‹#› Scale-out is the future of databases

How do I scale?

Scale Up Scale Out

NoSQL NewSQL SQL-on- MPP Hadoop

Hadoop Analytic RDBMS Engines

9 NoSQL

Pros Cons ▪ Easy scale-out ▪ No SQL – requires retraining ▪ Flexible schema and app rewrites ▪ Easier web development with ▪ No joins – i.e., no cross row/ hierarchical data structures document dependencies (MongoDB) ▪ No reliable updates through ▪ Cross-data center replication transactions across rows/tables (Cassandra) ▪ Eventual consistency (Cassandra) ▪ Not designed to do aggregations required for analytics

10 NewSQL

Pros Cons ▪ Easy scale-out ▪ Proprietary scale-out, ▪ ANSI SQL – eliminates unproven into petabytes retraining and app rewrites ▪ Must manage another ▪ Reliable updates through ACID distributed infrastructure transactions beyond Hadoop ▪ RDBMS functionality ▪ Can not leverage Hadoop ▪ Strong cross-data center ecosystem of tools replication (NuoDB)

11 NewSQL – In-Memory

Pros Cons ▪ Easy scale-out ▪ Memory 10-20x more expensive ▪ High performance because ▪ Limited SQL everything in memory ▪ Limited cross-node transactions ▪ ACID transactions within nodes ▪ Proprietary scale-out, unproven into petabytes ▪ Must manage another distributed infrastructure beyond Hadoop ▪ Can not leverage Hadoop ecosystem

12 Operational RDBMS on Hadoop

Pros Cons ▪ Easy scale-out ▪ Full table scans slower than MPP ▪ Scale-out infrastructure proven DBs, but faster than traditional into petabytes RDBMSs ▪ ANSI SQL – eliminates ▪ Existing HDFS data must be re- retraining and app rewrites loaded through SQL interface ▪ Reliable updates through ACID transactions ▪ Leverages Hadoop distributed infrastructure and tool ecosystem

13 MPP Analytical Databases

Pros Cons ▪ Easy scale-out ▪ Poor concurrency models prevent ▪ Very fast performance for full support of real-time apps table scans ▪ Poor performance for range ▪ Highly parallelized, shared queries nothing architectures ▪ Need to redistribute all data to ▪ May have columnar storage add nodes (hash partitioning) () ▪ May require specialized hardware ▪ No maintenance of indexes (Netezza) (Netezza) ▪ Proprietary scale out - can not leverage Hadoop ecosystem of tools

14 SQL-on-Hadoop – Analytical Engines

Pros Cons ▪ Easy scale-out ▪ Relatively immature, especially ▪ Scale-out proven into compared to MPP DBs petabytes ▪ Limited SQL ▪ Leverages Hadoop distributed ▪ Poor concurrency models prevent infrastructure support of real-time apps ▪ Can leverage Hadoop ▪ No reliable updates through ecosystem of tools transactions ▪ Intermediate results must fit in memory (Presto)

15 Future: Hybrid In-Memory Architectures

Memory Cache Pure with Disk In-Memory - Unsophisticated memory - Very expensive management

Hybrid In-Memory - Flexible, cost-effective - Controlled by optimizer - In-memory materialized views?

16 Summary – Future of Databases

▪ Predicted Trends ▪ Scale-out dominates databases ▪ Developers stop worrying about data size and develop new data-driven apps ▪ Hybrid in-memory architecture becomes mainstream ▪ Predicted Winners ▪ Hadoop becomes de facto distributed file system ▪ NoSQL used for simple web apps ▪ Scale-out SQL RDBMSs replace traditional RDBMSs

17 Questions?

Bob Baran Senior Sales Engineer [email protected] ! May 12, 2015 Powering Real-Time Apps on Hadoop

Bob Baran Senior Sales Engineer [email protected] ! May 12, 2015 Who Are We?

THE ONLY

HADOOP RDBMS Power operational applications on Hadoop Affordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back Transactional – Real-time updates & ACID 10x Transactions Better ANSI SQL – Leverage existing SQL code, tools, & Price/Perf skills Flexible – Support operational and analytical workloads ‹#› What People are Saying… Recognized as a key innovator in databases

Scaling out on Splice An alternative to The unique claim of … Splice Machine presented today’s RDBMSes, Machine is that it can run some major benefits Splice Machine effectively transactional over Oracle combines traditional relational applications database technology with Quotes ...automatic balancing between as well as support analytics on clusters...avoiding the costly the scale-out capabilities licensing issues. of Hadoop. top of Hadoop.

Awards

21 Advisory Board Advisory Board includes luminaries in databases and technology

Mike Franklin Roger Bamford Computer Science Chair, UC Former Principal Architect at Berkeley Oracle Director, UC Berkeley AmpLab Father of Oracle RAC Founder of Apache Spark

Marie-Anne Neimat Ken Rudin Co-Founder, Times-Ten Database Head of Analytics at Facebook Former VP, Database Eng. at Former GM of Oracle Data Oracle Warehousing

22 Combines the Best of Both Worlds

Hadoop ▪ Scale-out on commodity servers ▪ Proven to 100s of petabytes ▪ Efficiently handle sparse data ▪ Extensive ecosystem

RDBMS ▪ ANSI SQL ▪ Real-time, concurrent updates ▪ ACID transactions ▪ ODBC/JDBC support

‹#› Focused on OLTP and Real-Time Workloads

OLTP Applications Real-Time Web, Real-Time, Ad-Hoc Analytics Enterprise Data Mobile, and IoT Operational Warehouses Applications Reporting

Typical • MySQL • MySQL • MySQL • Greenplum • Teradata Databases • Oracle • Oracle • Oracle • Paraccel • Oracle • MongoDB • Netezza • Sybase IQ • Cassandra Use Cases • ERP, CRM, Supply • Web, mobile, social • Operational • Exploratory • Enterprise Chain • IoT Datastores Analytics Reporting • Crystal Reports • Data Mining Workload • Real-time updates • Real-time updates • Real-time updates • Complex • Parameterized Strengths • ACID transactions • High ingest rates • Canned, queries reports against • High concurrency • High concurrency parameterized requiring full historical data of small reads/ of small reads/ reports table scans writes writes • Range queries • Append only • Range queries • Range queries

24 OLTP Campaign Management: Harte-Hanks

Challenges Overview Digital marketing services provider Oracle RAC too expensive to scale Unified Customer Profile Queries too slow – even up to ½ hour Real-time campaign management Getting worse – expect 30-50% data growth OLTP environment with BI reports Looked for 9 months for a cost-effective solution

Solution Diagram Initial Results Cross-Channel Campaigns Real-Time 10-20x price/perf Personalization with no application, BI or ETL rewrites ¼ cost with commodity scale out 3-7x faster

Real-Time Actions through parallelized queries

25 Reference Architecture: Operational Data Lake Offload real-time reporting and analytics from expensive OLTP and DW systems

Operational OLTP Data Lake Systems Data Stream or Warehouse Batch ERP Updates Executive Business CRM ETL Reports

Supply Ad Hoc Chain Analytics

HR Datamart …

Operational Real-Time, Reports & Event-Driven Analytics Apps

‹#› Streamlining the Structured Data Pipeline in Hadoop

Traditional Hadoop Pipeline Source Systems SQL Query Engines BI Tools Apply ERP Inferred Sqoop Schema CRM

… Stored as flat files vs. Streamlined Hadoop Pipeline Source Advantages Systems BI Tools • Reduced operational costs with less complexity ERP Exisiting • Reduced processing time and ETL Tool errors with fewer translations CRM • Real-time updates for data cleansing … Stored in same • Better SQL support schema

27 Complementing Existing Hadoop-Based Data Lakes Optimizing storage and querying of structured data as part of ELT or Hadoop query engines

Pig

OLTP Systems HCATALOG 3 SCHEMA ON READ: ERP Ad-hoc Hadoop queries across structured and unstructured data CRM 1 SCHEMA ON Supply INGEST: 2 SCHEMA BEFORE READ: Structured Unstructured Repository for structured data Chain Streamlined, Data Data structured-to- or metadata from ELT process structured on unstructured data HR integration …

‹#› Proven Building Blocks: Hadoop and Derby

APACHE DERBY ▪ ANSI SQL-99 RDBMS ▪ Java-based ▪ ODBC/JDBC Compliant ! APACHE HBASE/HDFS ▪ Auto-sharding ▪ Real-time updates ▪ Fault-tolerance ▪ Scalability to 100s of PBs ▪ Data replication

‹#› HBase: Proven Scale-Out

▪ Auto-sharding ▪ High availability thru ▪ Scales with commodity hardware failover and replication ▪ Cost-effective from GBs to PBs ▪ LSM-trees

‹#› Splice Optimizations to HBase

▪ Splice Storage is optimized over raw HBase ▪ We use Bitmap Indexes to store data in packed byte arrays ▪ This approach allows us to store data in a much smaller footprint than traditional HBase ▪ With a TPCH schema, we found a 10X reduction in data size reduction ▪ Requires far less hardware and resources to perform the same workload ▪ Asynchronous Write Pipeline ▪ HBase writes (puts) are not pipelined and block while the call is being made ▪ Splice’s write pipeline allows us to reach speeds of over 100K writes / second per HBase node ▪ This allows extremely high ingest speeds without requiring more hardware and custom code ▪ Transactions ▪ As scalability increases, the likelihood of failures increases ▪ We utilize Snapshot Isolation to make sure if there is a failure, it does not corrupt existing data ▪ RDBMS Capabilities ▪ The use of SQL vs. custom scans and the ability for an optimizer to choose the best access path to the data ▪ Core Data Management functions (Indexes, Constraints, typed columns, etc.)

31 Distributed, Parallelized Query Execution

Parallelized computation across cluster Moves computation to the data Utilizes HBase co-processors No MapReduce

HBase !Co-Processor HBase Server Memory Space

LE GE ND ‹#› ANSI SQL-99 Coverage

▪ Data types – e.g., INTEGER, REAL, ▪ Conditional functions – e.g., CASE, CHARACTER, DATE, BOOLEAN, BIGINT searched CASE ▪ DDL – e.g., CREATE TABLE, CREATE ▪ Privileges – e.g., privileges for SELECT, SCHEMA, ALTER TABLE, DELETE, UPDATE DELETE, INSERT, EXECUTE ▪ Predicates – e.g., IN, BETWEEN, LIKE, ▪ Cursors – e.g., updatable, read-only, EXISTS positioned DELETE/UPDATE ▪ DML – e.g., INSERT, DELETE, UPDATE, ▪ Joins – e.g., INNER JOIN, LEFT OUTER SELECT JOIN ▪ Query specification – e.g., SELECT ▪ Transactions – e.g., COMMIT, ROLLBACK, DISTINCT, GROUP BY, HAVING READ COMMITTED, REPEATABLE READ, ▪ SET functions – e.g., UNION, ABS, MOD, READ UNCOMMITTED, Snapshot Isolation ALL, CHECK ▪ Sub-queries ▪ Aggregation functions – e.g., AVG, MAX, ▪ Triggers COUNT ▪ User-defined functions (UDFs) ▪ String functions – e.g., SUBSTRING, ▪ Views – including grouped views concatenation, UPPER, LOWER, ▪ Window Functions (rank, rownumber, …) POSITION, TRIM, LENGTH 33 Window Functions (Advanced Analytics Functions)

▪ Analytics such as Running total, Moving averages, Top-N Queries ▪ Performs calculations across a set of table rows related to the current row in the window ▪ Similar to aggregate functions with two significant differences: ▪ Outputs one row for each input value it operates upon. ▪ Groups rows with window partitioning and frame clauses vs. Group BY ▪ SPLICE MACHINE Currently Supports ▪ RANK ▪ DENSE_RANK ▪ ROW NUMBER ▪ AVG ▪ SUM ▪ COUNT ▪ MAX ▪ MIN

34 Lockless, ACID transactions

• Adds multi-row, multi-table transactions to HBase with rollback • Fast, lockless, high concurrency • Extends research from Google Percolator, Yahoo Labs, U of Waterloo • Patent pending technology

‹#› Customer Performance Benchmarks Typically 10x price/performance improvement

SPEED 3-7x 20x 7x FASTER

VS.

PRICE/ 10-20x 10x 30x LOWER PERFORMANCE

‹#› Applications, BI / SQL tool support via ODBC/JDBC

‹#› Splice Machine Safe Journey Process

Initial Rapid Proof of Pilot Enterprise Overview Assessment Concept Project Implementation • Splice Machine • Half day workshop • Prove client use case on • Identify paid pilot use • Kickstart overview • Assess Splice Machine Splice Machine hosted case with limited • Requirements • Set the stage for Rapid fit environment change management • Design/Dev Assessment • Identity target use • Benchmark using impact • QA Test cases customer queries and • Install Splice Machine • Cutover • Risk assessment of use schema on client environment • Hypercare cases • On Customer data or • Deploy use case/ • Agree upon success generated data that application on client criteria resembles customer data data • Prove Splice Machine against key requirements

1 day 5 days (including prep) 2 weeks 3-6 weeks 3-10 months

38 Safe Journey Enterprise Implementation Stages

Kickstart Requirements Design/Dev QA Test Parallel Ops Cutover Hypercare

Packaged 2 Establish clear Based on Agile The QA test Used when an Formal period in Period of on- period includes: week program functional and method. Phase • Performance Test existing system which Splice- site support to get new performance is divided into • End-to-End is being ported based solution during cutover client off to requirements 2 week sprints System to Splice goes-live and and for a period strong start on document Integration Test Machine from pre-existing immediately ! • User Acceptance solid foundation ! Stories covering Test another system is following go- ! Can be a a set of ! database deprecated live Incorporates: “refresh only” if required Depending on ! • Splice project is a port capabilities are scale of project The new Splice Architecture & of an existing assigned to there may be Machine-based Development app to Splice each developer multiple system runs side courses iterations of each • Risk Assessment ! test with break/ by side with the Workshop A design doc is fix cycles in old system for a • Implementation created, code is between period of time Blueprint written, unit tests are written and executed until Optional Optional they pass

‹#› Common Risks and Mitigation Strategies

Data migration • Risk: Clients are typically migrating very large data sets to Splice Machine. Issues with migration of certain data types such as dates can waste a lot of time reloading large amounts of data • Solution: First migrate a small subset of tables that contain all required Changesdata types. to source Ensure schema these during migrate implementation successfully before migrating the entire • Riskdatabase: Changes to the schema of the source database to be migrated during the course of the implementation will lead to a significant amount of rework and reloading of data, adding unplanned time to the project • Solution: All stakeholders agree up front to freeze the schema as of an Storedagreed procedure upon date conversion prior to the Design/Development stage. • Risk: Stored procedures need to be converted from the original language (e.g., PL/SQL) to Java. Complex stored procedures make include significant amounts or procedural code as well as multiple SQL statements • Solution: Carefully review the function and design of SPs to be 40 converted. Leverage an automated conversion tool where appropriate Common Risks and Mitigation Strategies

SQL compatibility • Risk: Even though Splice Machine conforms to the ANSI 99+ SQL standard, virtually every database has unique syntax and some queries may need to be modified. Additionally, SQL generated by packaged applications may not be modifiable. • Solution: Formal review of SQL syntax during the requirements phase. Modify relevant queries during the Design/Dev phase. If not modifiable an enhancement request for Splice Machine to support the required syntax out of the box may needed.

Indexing • Risk: Proper indexing is usually important to maximize the performance of Splice Machine. Splice Machine indexes are likely to differ from the indexes required for a traditional RDBMS • Solution: Ensure that query performance SLAs are clearly defined in the Requirements phase. Incorporate proper index design early in the Design/Dev phase. Assume some iteration will be required to achieve the optimal indexes

Hadoop knowledge • Risk: Project stakeholders often have limited knowledge of Hadoop and the distributed computing paradigm. This can lead to confusion about the Splice Machine value proposition and the and the advantages of moving to a scale-out architecture • Solution: Include the Splice Machine Kickoff Program at the beginning of the implementation project. This includes essential training on Hadoop and related fundamentals concepts critical to realizing value from a Splice Machine deployment

41 Summary

THE ONLY

HADOOP RDBMS Power operational applications on Hadoop Affordable, Scale-Out – Commodity hardware Elastic – Easy to expand or scale back Transactional – Real-time updates & ACID 10x Transactions Better ANSI SQL – Leverage existing SQL code, tools, & Price/Perf skills Flexible – Support operational and analytical workloads ‹#› Questions?

Bob Baran Senior Sales Engineer [email protected] ! May 12, 2015