The Guide to Scaling Operational Applications on Hadoop Cost-Effective Ways to Power Your Web, Mobile, OLTP, and Internet of Things Applications

WHITE PAPER The Guide to Scaling Operational Applications on Hadoop Cost-effective ways to power your web, mobile, OLTP, and Internet of Things applications www.splicemachine.com The content of this white paper, including the ideas and concepts contained within, are the property of Splice Machine, Inc. This document is considered proprietary and confidential, and should not be reproduced or reused in any way without the permission of Splice Machine. Published 03/15 © Copyright 2015 Splice Machine. All rights reserved. Table of Contents The Guide to Scaling Operational Applications on Hadoop ........................... 2 The Scalability Challenge ........................................................ 3 Traditional Scale-Up RDBMSs ................................................. 3 NoSQL databases. 4 NewSQL ................................................................... 5 SQL-on-Hadoop and Hadoop RDBMSs ......................................... 5 Four Reasons to Choose Hadoop RDBMS ......................................... 6 Harte Hanks Case Study ........................................................ 8 About Splice Machine .........................................................10 Guide to Scaling | 1 The Guide to Scaling Operational Applications on Hadoop Cost-effective Ways to Power Your Web, Mobile, OLTP and Internet of Things Applications Powering real-time applications involves scaling not only existing enterprise applications, but also new applications that have emerged from the web, social media, and mobile devices. Overwhelmed by massive data growth, businesses must carefully select cost-effective technologies that can enable applications to easily manage both the data volumes of today and its exponential growth in the future. DATA-INTENSIVE APPLICATIONS Saving the Applications of Yesterday REQUIRE A DATABASE Business applications commonly access data using the Structured Query CAPABLE OF: Language (SQL) through legacy relational database management systems (RDBMSs) such as Oracle, IBM DB2, or MySQL. However, many existing SQL applications are being overwhelmed by exponential data growth, and the legacy RDBMSs that support them often hit a wall, either from a performance or cost perspective. Scaling up these systems can mean expensive database hardware Complex, replacements, while NoSQL technology usually requires cost-prohibitive interactive application rewrites. queries Supporting the Applications of Today and Tomorrow Data The modern world of web, social, mobile and Internet of Things (IoT) updates applications is also very demanding on databases, requiring them to ensure real-time responses while handling massively increasing data volumes with in real increasing velocity. time These new applications are extracting information from a plethora of sources within the Internet of Things, including countless sensors, social platforms, and mobile devices. They serve a number of use cases, such as digital marketing, High healthcare, and fraud detection, which share these workload characteristics: concurrency • Complex, interactive queries of small • Data updates in real time reads and • High concurrency of small reads and writes writes The key to riding the wave of big data and enabling digital applications in real time is to select a database that can support massive data growth without bursting IT budgets. Guide to Scaling | 2 The Scalability Challenge Scalability is the ability of a system to accommodate a growing amount of data and/or workloads. There are generally two ways to scale: • Scaling up - adding more resources to a single server • Scaling out - adding more servers to the system that simultaneously cooperate THE PAST THE FUTURE Small data volumes; purge often Massive data volumes; retain forever Slow data velocity Rapid data velocity Rigid, static data of similar structure Flexible, fluid data of many structures Many-to-many, shared nothing One-to-one, shared disk architecture architecture Primary storage Scale both writes and reads Scale-up on proprietary hardware Scale-out on commodity hardware Traditional Scale-Up RDBMSs The decision ultimately comes down to price/performance. If money is no object, scale-up platforms (e.g., Oracle Exadata, IBM DB2) work well because they often do not require changes to applications, and many businesses appreciate the reliability and security of a proprietary system. However, note that scaling up requires a hardware migration every time a new larger server replaces the old one, a process which suffers from the law of diminishing returns: costs will rise significantly faster than performance, and eventually technological innovation will plateau to the point when higher performance cannot be achieved no matter the price. Guide to Scaling | 3 NoSQL Databases For new web, social, mobile, and IoT applications, NoSQL databases, such as QUESTIONS TO MongoDB and Cassandra, have become popular options. They require less CONSIDER WHEN migration over the lifetime of an application, because they automatically SELECTING YOUR redistribute loads as new nodes are added and do not require discrete NEXT DATABASE: hardware upgrade projects. Can it scale operational applications? However, NoSQL databases, by design, lack SQL, joins and ACID transactions, so they have difficulty handling data that has dependencies across multiple Can it rows, documents, or tables. This might seem to be an academic issue, but it support ACID can have major impacts on application design. transactions? Consider a mobile app tracking contacts who work for multiple companies. If fifty of the contacts work at a particular address and they move, the Is it application would have to update only one address in an address table as integrated part of a SQL relational model using joins. In a NoSQL database which has no joins (and thus denormalizes data), the application must query to find with Hadoop? all fifty instances of the addresses on each contact and then update each corresponding row or document with the address. The application must also ensure that all 50 addresses are updated, even in the case of more than one Is it failure. In a SQL database, the application would use ACID transactions to real-time? update all fifty rows, which either all succeed or fail together. Therefore, if a web, social, mobile, or IoT application has no data dependencies Is it across multiple rows, documents, or tables, then NoSQL databases will work cost-effective? well. Otherwise, application developers will have to implement join-like and transaction-like capabilities for each application, which is time-consuming and error-prone for these complex database functions. Many companies that have chosen NoSQL, especially if they need to do real-time reporting that aggregates across multiple attributes or dimensions, have discovered later that they needed to reimplement SQL functionality in each of their applications. Guide to Scaling | 4 NewSQL Because the market is once again realizing the importance of SQL, NewSQL solutions have emerged to address the limitations of NoSQL databases. NewSQL platforms are diverse: some are focused on distribution in the cloud (e.g., Clustrix, NuoDB), while others run in-memory (e.g. MemSQL, VoltDB). These scale-out databases best serve applications that require high throughput and hundreds of concurrent connections. Although NewSQL databases maintain SQL support, there may be limitations to their ACID transactions (e.g., ACID only in a single node, not across nodes). They may be more affordable than traditional RDBMSs, but most still use a proprietary scale-out architecture. This might be disadvantageous to companies that have started to standardize on Hadoop and do not want to maintain another scale-out architecture. SQL-on-Hadoop and Hadoop RDBMSs SQL-on-Hadoop databases (e.g., Hive, Impala) differ from their NewSQL counterparts by possessing native Hadoop integration. Most are designed for ad-hoc and exploratory analytics, which allow data scientists to use Hadoop without the need to write MapReduce. However, most SQL-on-Hadoop solutions are currently unable to support real-time operational applications, which often require transactional updates and a high concurrency of small reads and writes. What many businesses are finding is that they truly need to intertwine the proven performance and functionality of a traditional SQL database with the flexibility of a Hadoop scale-out solution. For organizations that are looking to scale affordably with a proven scale-out technology but still maintain full SQL support and RDBMS functionality, a Hadoop RDBMS (e.g. Splice Machine) is an attractive alternative to consider. Guide to Scaling | 5 Four Reasons to Choose Hadoop MAKE SURE YOUR NEXT RDBMS for Powering Real-Time DATABASE HAS THIS SQL Applications FUNCTIONALITY When it comes to powering real-time big data applications, here are four key reasons why a Hadoop RDBMS can be a compelling solution: 1. General-purpose platform. A Hadoop RDBMS is a general purpose, ü Joins operational database capable of handling mixed operational and analytical workloads (i.e., OLTP and OLAP) with real-time queries. Unlike NoSQL solutions that effectively handle simple web applications but do not perform ü Secondary well with transactional operations, a Hadoop RDBMS easily handles the indexes data updates across multiple rows and tables required by mission-critical business applications. ü Aggregations 2. Full SQL support. By combining full ANSI SQL support with the Hadoop ecosystem, businesses can scale-out from gigabytes

The Guide to Scaling Operational Applications on Hadoop Cost-Effective Ways to Power Your Web, Mobile, OLTP, and Internet of Things Applications

Operational Database Offload

Beyond Relational Databases

Data Platforms Map from 451 Research

Product 360: Retail and Consumer Industries

Operational Database Overview Date Published: 2020-02-29 Date Modified: 2021-02-04

P6 Reporting Database Planning and Sizing

Sql Connect String Sample Schema

What Is Database? Types and Examples

Aware, Workstation-Based Distributed Database System

Providing High Availability for SAP Resources with Oracle Clusterware 11 Release 2

Using Oracle Goldengate 12C for Oracle Database

Providing High Availability for SAP Resources