WHITE PAPER

The Guide to Scaling Operational Applications on Hadoop Cost-effective ways to power your web, mobile, OLTP, and Internet of Things applications

www.splicemachine.com The content of this white paper, including the ideas and concepts contained within, are the property of Splice Machine, Inc. This document is considered proprietary and confidential, and should not be reproduced or reused in any way without the permission of Splice Machine. Published 03/15

© Copyright 2015 Splice Machine. All rights reserved. Table of Contents

The Guide to Scaling Operational Applications on Hadoop...... 2

The Challenge...... 3 Traditional Scale-Up RDBMSs...... 3 NoSQL ...... 4 NewSQL...... 5 SQL-on-Hadoop and Hadoop RDBMSs...... 5

Four Reasons to Choose Hadoop RDBMS ...... 6

Harte Hanks Case Study ...... 8

About Splice Machine ...... 10

Guide to Scaling | 1 The Guide to Scaling Operational Applications on Hadoop

Cost-effective Ways to Power Your Web, Mobile, OLTP and Internet of Things Applications

Powering real-time applications involves scaling not only existing enterprise applications, but also new applications that have emerged from the web, social media, and mobile devices. Overwhelmed by massive data growth, businesses must carefully select cost-effective technologies that can enable applications to easily manage both the data volumes of today and its exponential growth in the future. DATA-INTENSIVE APPLICATIONS Saving the Applications of Yesterday REQUIRE A Business applications commonly access data using the Structured Query CAPABLE OF: Language (SQL) through legacy relational database management systems (RDBMSs) such as Oracle, IBM DB2, or MySQL. However, many existing SQL applications are being overwhelmed by exponential data growth, and the legacy RDBMSs that support them often hit a wall, either from a performance or cost perspective. Scaling up these systems can mean expensive database hardware Complex, replacements, while NoSQL technology usually requires cost-prohibitive interactive application rewrites. queries Supporting the Applications of Today and Tomorrow Data The modern world of web, social, mobile and Internet of Things (IoT) updates applications is also very demanding on databases, requiring them to ensure real-time responses while handling massively increasing data volumes with in real increasing velocity. time These new applications are extracting information from a plethora of sources within the Internet of Things, including countless sensors, social platforms, and mobile devices. They serve a number of use cases, such as digital marketing, High healthcare, and fraud detection, which share these workload characteristics: concurrency • Complex, interactive queries of small • Data updates in real time reads and • High concurrency of small reads and writes writes The key to riding the wave of and enabling digital applications in real time is to select a database that can support massive data growth without bursting IT budgets.

Guide to Scaling | 2 The Scalability Challenge Scalability is the ability of a system to accommodate a growing amount of data and/or workloads. There are generally two ways to scale:

• Scaling up - adding more resources to a single server • Scaling out - adding more servers to the system that simultaneously cooperate

THE PAST THE FUTURE

Small data volumes; purge often Massive data volumes; retain forever

Slow data velocity Rapid data velocity

Rigid, static data of similar structure Flexible, fluid data of many structures

Many-to-many, shared nothing One-to-one, shared disk architecture architecture

Primary storage Scale both writes and reads

Scale-up on proprietary hardware Scale-out on commodity hardware

Traditional Scale-Up RDBMSs The decision ultimately comes down to price/performance. If money is no object, scale-up platforms (e.g., Oracle Exadata, IBM DB2) work well because they often do not require changes to applications, and many businesses appreciate the reliability and security of a proprietary system. However, note that scaling up requires a hardware migration every time a new larger server replaces the old one, a process which suffers from the law of diminishing returns: costs will rise significantly faster than performance, and eventually technological innovation will plateau to the point when higher performance cannot be achieved no matter the price.

Guide to Scaling | 3 NoSQL Databases For new web, social, mobile, and IoT applications, NoSQL databases, such as QUESTIONS TO MongoDB and Cassandra, have become popular options. They require less CONSIDER WHEN migration over the lifetime of an application, because they automatically SELECTING YOUR redistribute loads as new nodes are added and do not require discrete NEXT DATABASE: hardware upgrade projects.

Can it scale operational applications? However, NoSQL databases, by design, lack SQL, joins and ACID transactions, Can it so they have difficulty handling data that has dependencies across multiple rows, documents, or tables. This might seem to be an academic issue, but it support ACID can have major impacts on application design. transactions? Consider a mobile app tracking contacts who work for multiple companies. If fifty of the contacts work at a particular address and they move, the Is it application would have to update only one address in an address table as integrated part of a SQL relational model using joins. In a NoSQL database which has no joins (and thus denormalizes data), the application must query to find with Hadoop? all fifty instances of the addresses on each contact and then update each corresponding row or document with the address. The application must also Is it ensure that all 50 addresses are updated, even in the case of more than one failure. In a SQL database, the application would use ACID transactions to real-time? update all fifty rows, which either all succeed or fail together.

Therefore, if a web, social, mobile, or IoT application has no data dependencies Is it across multiple rows, documents, or tables, then NoSQL databases will work cost-effective? well. Otherwise, application developers will have to implement join-like and transaction-like capabilities for each application, which is time-consuming and error-prone for these complex database functions. Many companies that have chosen NoSQL, especially if they need to do real-time reporting that aggregates across multiple attributes or dimensions, have discovered later that they needed to reimplement SQL functionality in each of their applications.

Guide to Scaling | 4 NewSQL Because the market is once again realizing the importance of SQL, NewSQL solutions have emerged to address the limitations of NoSQL databases. NewSQL platforms are diverse: some are focused on distribution in the cloud (e.g., , NuoDB), while others run in-memory (e.g. MemSQL, VoltDB). These scale-out databases best serve applications that require high throughput and hundreds of concurrent connections.

Although NewSQL databases maintain SQL support, there may be limitations to their ACID transactions (e.g., ACID only in a single node, not across nodes). They may be more affordable than traditional RDBMSs, but most still use a proprietary scale-out architecture. This might be disadvantageous to companies that have started to standardize on Hadoop and do not want to maintain another scale-out architecture.

SQL-on-Hadoop and Hadoop RDBMSs SQL-on-Hadoop databases (e.g., Hive, Impala) differ from their NewSQL counterparts by possessing native Hadoop integration. Most are designed for ad-hoc and exploratory analytics, which allow data scientists to use Hadoop without the need to write MapReduce. However, most SQL-on-Hadoop solutions are currently unable to support real-time operational applications, which often require transactional updates and a high concurrency of small reads and writes.

What many businesses are finding is that they truly need to intertwine the proven performance and functionality of a traditional SQL database with the flexibility of a Hadoop scale-out solution. For organizations that are looking to scale affordably with a proven scale-out technology but still maintain full SQL support and RDBMS functionality, a Hadoop RDBMS (e.g. Splice Machine) is an attractive alternative to consider.

Guide to Scaling | 5 Four Reasons to Choose Hadoop MAKE SURE YOUR NEXT RDBMS for Powering Real-Time DATABASE HAS THIS SQL Applications FUNCTIONALITY When it comes to powering real-time big data applications, here are four key reasons why a Hadoop RDBMS can be a compelling solution:

1. General-purpose platform. A Hadoop RDBMS is a general purpose, üü Joins capable of handling mixed operational and analytical workloads (i.e., OLTP and OLAP) with real-time queries. Unlike NoSQL üü Secondary solutions that effectively handle simple web applications but do not perform well with transactional operations, a Hadoop RDBMS easily handles the indexes data updates across multiple rows and tables required by mission-critical business applications. üü Aggregations 2. Full SQL support. By combining full ANSI SQL support with the Hadoop üü Stored ecosystem, businesses can scale-out from gigabytes to petabytes using a Hadoop RDBMS without needing to rewrite their existing SQL applications procedures or retrain their IT staff like NoSQL solutions require.

üü Window It’s also important to note that although NewSQL and emerging functions SQL-on-Hadoop solutions claim to support SQL, their capabilities are often limited. Enterprises will find all of the key functionality they have currently in their legacy SQL databases in a Hadoop RDBMS: • Joins • Secondary indexes • Aggregations • Stored procedures • Window functions

Guide to Scaling | 6 3. Real-time updates with transactions. The issue with many NoSQL solutions is that they sacrifice transactions, which makes it very difficult to update data across multiple rows or tables in real time. A Hadoop RDBMS supports full ACID transactions, allowing for thousands of concurrent users to access and alter data simultaneously. Transactional integrity is vital to powering real-time applications, and more enterprises are requiring its foundational capabilities in their databases.

4. Developer framework support. Application developers can be reassured that a Hadoop RDBMS supports a number of developer frameworks, including .NET, Java, Python, and Ruby on Rails, as well as those written in JavaScript /AngularJS. This allows developers to build applications quickly and easily using the tools with which they are most productive.

When NoSQL emerged into the database market, many companies selected solutions prematurely, believing that NoSQL was the only option to effectively scale their applications. Over time, many of these companies also learned that NoSQL was not necessarily the right fit. For many applications, NoSQL databases force developers to reinvent the wheel, as they need to conduct joins and transactions at the application level.

Today, technology has evolved to provide massive scalability without sacrificing functionality and performance. For many applications, a Hadoop RDBMS effectively combines the reliability and functionality of an RDBMS with the scale-out of NoSQL.

FOUR REASONS 1 2 3 4 TO CHOOSE General-purpose Full SQL Real-time Developer HADOOP RDBMS platform support updates with framework transactions support

Guide to Scaling | 7 Harte Hanks Case Study: Powering Digital Marketing Applications with Splice Machine Optimizing Campaign Management. Marketing services provider Harte Hanks was experiencing serious challenges with the databases powering their digital marketing applications. Harte Hanks used Oracle RAC to provide its clients with a 360° view of their customers, but their queries were getting slower, in some cases over a half hour to complete. Expecting 30-50% future data growth as more data sources were added like mobile and social interactions, the company was concerned that performance issues would become increasingly worse.

Harte Hanks needed to provide deeper insights to its clients by drastically increasing the amount of data at its disposal without any rewrites to their applications. It also desired more effective personalization through faster queries that power cross-channel campaigns.

Requirements. To reduce costs and handle the increased data load, Harte Hanks recognized that it needed a new SQL scale-out database solution that could leverage the proven scale-out of Hadoop while still ensuring seamless integration to its complex environment of existing applications and tools, which require ACID compliance and ODBC/JDBC standards:

• IBM Unica for campaign management • IBM Cognos for business intelligence • Harte Hanks Trillium for data quality • Ab Initio for ETL

Rob Fuller “Splice Machine’s Hadoop RDBMS delivers all of the Managing Director functionality and performance we need. We are delighted of Product Innovation with our initial results where queries execute several times Harte Hanks faster on a significantly less expensive cluster.”

Guide to Scaling | 8 Results. The company chose Splice Machine, a Hadoop RDBMS, to support its mixed workload applications (OLAP and OLTP). They saw a 75% cost savings with a 3-7x increase in query speeds. They can now easily scale out to hundreds of terabytes by adding commodity servers.

Harte Hanks can now cost-effectively add significantly more data to improve the quality of the services it provides to its clients: • A 360° view through a customer relationship management system • Personalized campaign execution • Cross-channel campaign analytics with real-time and mobile access so customer insights can be rapidly shared across organizations Ultimately, by replacing Oracle RAC with Splice Machine, Harte Hanks has experienced a 10-20x improvement in price/performance, without any rewrites to its IBM Unica campaign management software, Cognos business intelligence reports, Ab Initio ETL scripts, or Trillium data quality software.

RESULTS 3-7x 75% 10-20x Increase in Cost Savings Price Performance Query Speeds Over Standard Improvement

Guide to Scaling | 9 About Splice Machine Designed to power real-time applications that are overwhelmed by data growth, Splice Machine is The Hadoop RDBMS. Splice Machine offers an ANSI-SQL database with support for ACID transactions on the distributed computing infrastructure of Hadoop. Like Oracle and MySQL, it is an operational database that can handle operational (OLTP) or analytical (OLAP) workloads, while scaling out cost-effectively from terabytes to petabytes on inexpensive commodity servers.

OPERATIONAL APPS SQL & BI TOOLS

SPLICE BULK SPLICE IMPORT/EXPORT ODBC / JDBC

SPLICE PARSER

SPLICE PLANNER

SPLICE OPTIMIZER DERBY

SPLICE EXECUTOR SPLICE MACHINE

HBASE

HDFS HADOOP

Splice Machine marries two proven technology stacks: Apache Derby and HBase/Hadoop. With over 15 years of development, Apache Derby is a popular, Java-based SQL database. Splice Machine chose Derby because it is a full-featured ANSI SQL database, lightweight (<3 MB), and easy to embed into the HBase/Hadoop stack.

Splice Machine chose HBase and Hadoop because of their proven auto-sharding, replication, and failover technology. HBase also enables real-time, incremental writes on top of the immutable Hadoop file system, and since Splice Machine does not modify HBase, it can be used with any standard Hadoop distribution. Supported Hadoop distributions include Cloudera, MapR and Hortonworks.

www.splicemachine.com (415) 857-2111 | [email protected]

© Copyright 2015 Splice Machine. All rights reserved.