information management experts

In-Memory & Column DBs and Orestes Appel how they support Big Data Senior Principal Contents

• The Basics of Big Data / Analytics • What is needed to support Big Data & real-time analytics? • What kind of DB systems are required? • The (new) Technology • Some commercial examples • Conclusions • Q&A

© Noah Consulting, LLC 2015 Page | 2 The Dawn of Big Data*

Exabyte = 2 60 bytes = 1,152,921,504,606,846,976 bytes

5 10,200 B.C. 2003 A.D. exabytes of data

*Source: Google’s Eric Schmidt.

© Noah Consulting, LLC 2015 Page | 3 Is Big Data always successful?

Typical Big Data Projects

Fail Over Schedule Over Budget 0% 20% 40% 60% 80% 100% 120% Over Budget Over Schedule Fail

Source: Jim Kaskade, VP of Big Data & Analytics, CSC)

Failure to Operationalize data insights. Inefficient tools and processes. Lack of tie into business goals.

© Noah Consulting, LLC 2015 Page | 4 Why is Everyone Talking about Big Data

The sheer volume of data that must be processed (e.g. seismic, etc.). The rate at which the data arrives (e.g. streaming data, SCADA, etc.). The variation in data and the computational complexity of the required analytics. The industry is starting to implement a technology that they intuitively know can help them store and analyze large diverse data sets.

Source: Jim Crompton, Noah Consulting. “Big Data and the Internet of Things Meet the Oil & Gas Industry”, paper presented at the 19th International Conference on Petroleum Integration, Information & Data Management, held at the Marriott Westchase Hotel, Houston, TX, USA, May 19-21, 2015.

© Noah Consulting, LLC 2015 Page | 5 Just How Much Data are we talking about?* In the next minute…

98,000 695,000 11 million 698,445 tweets will be created status updates will be instant messages will Google searches will be on Twitter generated on Facebook be sent conducted

100 168 million 1,820 TB 217 new mobile web users hours of video will be emails will be sent of data will be created will be added uploaded to YouTube

According to Cisco, nearly 1 million minutes of video will travel the internet every second by 2018. Source: Jim Crompton, Noah Consulting. “Big Data and the Internet of Things Meet the Oil & Gas Industry”

© Noah Consulting, LLC 2015 Page | 6 Another view of Big Data (the 5 Vs)

Volume Size of the data

Velocity Speed at which data is generated

Variety Different types of data

Veracity Level of accuracy of the data

Value The data can be transformed into value

© Noah Consulting, LLC 2015 Page | 7 Information Intensity: Seismic, Drilling, Production, Subsurface Characterization

March 2010, WesternGeco’s UniQ system acquired and quality checked one TB of data/hr. The Transocean Clear Leader drill ship is instrumented with 30,000 sensors with a 1-5 sec. sample rate capability

Large Offshore Oil Large Onshore Field Oil Field Simulation Models Facility 10-15,000 measurement 50+ TB 100-million-cell earth model 1-3000 I/O points with 10+ points 2-million-cell simulation model GB/day data stream 3D Seismic Cubes= 400 TB+ 2 TB / cube raw data * Source: Jim Crompton, Noah Consulting, LLC (2015) 100 GB / cube processed

© Noah Consulting, LLC 2015 Page | 8 The Promise of Analytics

Descriptive Analytics What has happened?

Predictive Analytics What will happen next?

Prescriptive Analytics How can I keep good things going? Deeper insight at the speed of business.

© Noah Consulting, LLC 2015 Page | 9 HistoricalDBs Models © Noah Consulting, NoahLLC© 2015 Page Page | 10

https://en.wikipedia.org/wiki/Database DBs Models and some history … before Relational

Logical Model NOT No standard Relational processing entailsEqual to Physical treating whole RelationalPerformance DB’s meet varies the with corporation access (SQL) quality of coderelationships as operands.Model Its primary Geologists, Accountants,purpose Technician is loop use-avoidance, it alike an absolute requirement for end users to be productive Performance is a function of size, operation’s complexity and compiler/optimizerat all, and a Systemsclear productivity booster for Programmers are Requires an Difficult to program.application indispensable programmers. Standard model, standard access -E.F. Codd expert

Reliable, sound model that protects your data

© Noah Consulting, LLC 2015 Page | 11 What does a traditional (Relational) DB bring to the table?

• Persistent Storage in Record/Row Format • Structured-data modelled through a Schema • Powerful/Declarative Query Language (SQL) • Transactions Consistency/Safety (ACID)

But there are disadvantages, too!

© Noah Consulting, LLC 2015 Page | 12 Relational DB Systems, SQL & ACID

ACID (Atomicity, a c i d Consistency, Isolation, Durability): a set of Atomicity: Consistency: Isolation: Durability: properties that Transactions Only valid data Transactions Written data guarantee that are all or is saved. do not affect will not be lost. transactions are processed nothing. each other. reliably.

Disadvantages Advantages • Performance: A major constraint. If the number of tables is large • Ease of use: seeing information as tables (rows and columns) and relations complex, performance suffers with SQL queries • Flexibility: tables can be easily manipulated by simple operators (project, • Physical Storage Usage: join operations would depend upon the join, etc.) physical storage (many times, tuned up to satisfies • Precision: use of relational algebra and relational calculus in the the more common queries, impacting physical storage) manipulation of the relations • Slow extraction: one row is one record. To access one attribute • Security: Security control and entitlements can be implemented more only, the whole row must be found and extracted • Data Independence: achieved more easily with normalization than with other models • Data Manipulation Language: addressing queries through structured language based on relational algebra and relational calculus (SQL)

© Noah Consulting, LLC 2015 Page | 13 Main characteristics of Relational DBs

• Locking at Row level • Replication • Data stored in Disk • Main memory Buffer pool • Query Optimization

© Noah Consulting, LLC 2015 Page | 14 Database: Persistent collection of records

Stock Quote Record

Symbol Price Quantity Market Date

AAPL 125.00 20,000 NYSE 01/Jan/2014

. . .

MSFT 85.00 10,000 NYSE 01/Jan/2014

SQL

SELECT price WHERE symbol = ‘AAPL’ and date = ‘01/Jan/2014’

© Noah Consulting, LLC 2015 Page | 15 Access and Performance of Database in Disk

SELECT avg (price) FROM tablename WHERE symbol = AAPL and date = a_date

Only need price, date & symbol, but we must scan and read all columns.

1B records, 100 Bytes each record = 100GB If Disk can access 100MB/sec, then query takes 1,000 secs OR 16.67 minutes

© Noah Consulting, LLC 2015 Page | 16 Column-oriented DBs

• Column representation reduces Scan Time • Each column is stored in a separate file • Only need to read 3 columns • Compression can be very efficient

1B records, 100 bytes each = 100 * 3/5; assuming access takes 100 MB/sec = 600 seconds = 10 mins.

© Noah Consulting, LLC 2015 Page | 17 Big Data: New Capabilities Needed*

• High performance o Million of users o Thousands of transactions/second • Distributed computing & High Availability o Many processors, many servers Fast data comes from o Multi-node replication • Humans • New Programming Models • Internet of Things (IoT) o Not all data is relational (JSON and many more) • Both o SQL may get complex and … slow o Current data may not conform to a schema

*Source: “Big Data Storage”, by Samuel Madden, Professor and Director of Big Data at CSAIL, Massachusetts Institute of Technology (2014)

© Noah Consulting, LLC 2015 Page | 19 No SQL

Key Value Stores Document Stores Key Value Key JSON/XML Documents K1 V1 K1 JSON/XML Doc1 K2 V2 K2 JSON/XML Doc2 Kn Vn Kn JSON/XML Docn

Cassandra Couch DB Operations Dynamo • Get (key) Mongo DB Riak • Put (key, value) Others… Others…

• Non-Relational Approach … at least at the beginning of the game… No ACID • Much faster than Relational No High-level language • Easy to program

© Noah Consulting, LLC 2015 Page | 21 New Wave DBs Systems Some Examples

Column Store In-Memory Store

Paraccel • MemSQL • Hewlett-Packard • Microsoft Hekaton/SQL • SAP HANA Server In-Memory OLTP • SAP Sybase/IW • SAP HANA • SQLFire • VoltDB (40,000 trans. per core/sec) Note: we are not covering complex event processing (CEP)

© Noah Consulting, LLC 2015 Page | 23 Relational Too Expensive, Too Slow*

Four major sources of overhead Only 12% actual work? Yes! Useful Work Buffer 12% Manageme nt Index 29% Management • Disk-based (buffer pool overhead) 11% • Record-level locking too expensive • Aries-style write-ahead logging too expensive • Multi-threading latches are a killer Latching Logging 10% • Limited to a few thousand 20% transactions per second Locking 18% *Source: “The Expert Guide to Fast Data”, by Michael Stonebraker, Professor, Massachusetts Institute of Technology and John Hugg, Senior Software Engineer, VoltDB (2015)

© Noah Consulting, LLC 2015 Page | 24 New OLAP Approach (e.g. HP-Vertica) – Similar for Paraccel, HANA*

• Table is mapped into a collection of views, stored by column and sorted on all attributes, left to right • Column stored in 64k chunklets • 1 Tbyte of main memory cost less than 30k (e.g. 64 Gbytes per server in 16 servers) • First attribute stored uncompressed, the rest are compressed • Chunklets decompressed only when needed • Key Operation: process a column • To increase speed of loading, a main-memory row-store is utilized • Newly loaded tuples go there • In bulk, groups or rows are sorted and converted to column format, and compressed • Then, written to new disk segments • Segment merge makes these segments bigger … and bigger • Queries go to both places *Source: “Tackling The Challenges of Big Data - Big Data Storage”, by Michael Stonebraker, Professor, Massachusetts Institute of Technology (2015)

© Noah Consulting, LLC 2015 Page | 25 The Old Way vs The New Way (OLTP)*

• Main memory not disk • Anti-caching not caching • Command logging not data logging (snapshots) • Failover not recovery from a log • MVCC or timestamp order not dynamic locking (MVCC = Multi Version Concurrency Control) • Single threaded not multi-threaded

*Source: “Tackling The Challenges of Big Data - Big Data Storage”, by Michael Stonebraker, Professor, Massachusetts Institute of Technology (2015)

© Noah Consulting, LLC 2015 Page | 26 The H-Store System (e.g. Volt-DB)*

An experimental main-memory, parallel • No Disk database management system that is • Partition DB into RAM-sized chunks optimized for on-line transaction • Distribute Across a Cluster of Machines processing (OLTP) applications • No Concurrency Control • highly distributed • One Transaction at a Time Per Partition • row-store-based relational database • No stalls due to disk, network, etc • runs on a cluster on shared-nothing • Most transactions run on a single partition • main memory executor nodes • No Disk-based logging • Collaboration of MIT, Brown • Recover from replicas University, Carnegie Mellon • By copying state on crash University, Yale University and Intel • Asynchronously checkpoint to disk • Use transaction logging to recover from crash

*Source: “Tackling The Challenges of Big Data - Big Data Storage”, by Samuel Madden, Professor and Director of Big Data at CSAIL, Massachusetts Institute of Technology (2015)

© Noah Consulting, LLC 2015 Page | 27 Other (new) DB Models

• Graph-oriented DBs (Oracle Spatial and Graph, Oracle NoSQL DB, Neo4J (ACID), …) • Array DBs (Oracle GeoRaster, PostGIS, SciDB, …) • Hadoop Stack

Hive/Pig (HL Language)

Hadoop

HDFS File System Fully Scalable

© Noah Consulting, LLC 2015 Page | 28 Conclusions

If you don’t need speed, stick to traditional RDBMs

If you do need speed: o Don’t give up on the good stuff . SQL or SQL-equivalent . ACID o For OLTP, focus on In-Memory DBs . According to M. Stonebraker, because they are a factor of 50 – 100 faster o For OLAP (Data Warehousing), go with Columnar Store DBs . Because they are faster, too

Ask your vendors what they are doing in terms of In-Memory / Columnar DBs or New Technology DBs

© Noah Consulting, LLC 2015 Page | 29 So, you think we are relying too much in Academia?

• Linux From: [email protected] (Linus Benedict Torvalds) Newsgroups: comp.os.minix Subject: What would you like to see most in minix? Date: 25 Aug 91 20:57:08 GMT Hello everybody out there using minix – I’m doing a (free) operating system (just a hobby, won’t be big and professional like gnu) for 386(486) AT clones… • Android core is based on Linux • Apple iOS and MacOS (derived from FreeBSD Unix) • Edgar (Ted) Codd • Invented the Relational Model while working for IBM • Mike Stonebraker (MIT) • Ingres • Postgres • H-Store © Noah Consulting, LLC 2015 Page | 30 Going forward

Keep your Data in Select DB Systems that good standing can cope with “fast” in a consistent fashion Data Analytics The Bedrock Techniques

Data Management is still king!

© Noah Consulting, LLC 2015 Page | 31 © Copyright 2015 Noah Consulting LLC. All Rights Reserved. Questions & Answers

Orestes Appel

Senior Principal Noah Consulting, LLC Two Allen Center 1200 Smith Street, 16th Floor +1-403-370-1814 Houston, Texas 77002 [email protected]

www.noah-consulting.com

Booth #10 Noah Information Management Consulting Banker's Hall 888 3rd St SW Ste. 1000 Calgary, AB T2P 5C5

© Noah Consulting, LLC 2015 page | 33