Best Practices for Big Data : Visualizing Billions of Rows with Rapid Response Times “Big Data Analytics at the Speed of Thought”

Best Practices for Big Data : Visualizing billions of rows with rapid response times “Big Data Analytics at the Speed of Thought” Anthony Maresco 1 Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved . Safe Harbor Notice This presentation describes features that are under development by MicroStrategy. The objective of this presentation is to provide insight into MicroStrategy’s technology direction. The functionalities described herein may or may not be released as shown. This presentation contains statements that may constitute “forward-looking statements” for purposes of the safe harbor provisions under the Private Securities Litigation Reform Act of 1995, including descriptions of technology and product features that are under development and estimates of future business prospects. Forward-looking statements inherently involve risks and uncertainties that could cause actual results of MicroStrategy Incorporated and its subsidiaries (collectively, the “Company”) to differ materially from the forward-looking statements. Factors that could contribute to such differences include: the Company’s ability to meet product development goals while aligning costs with anticipated revenues; the Company’s ability to develop, market, and deliver on a timely and cost-effective basis new or enhanced offerings that respond to technological change or new customer requirements; the extent and timing of market acceptance of the Company’s new offerings; continued acceptance of the Company’s other products in the marketplace; the timing of significant orders; competitive factors; general economic conditions; and other risks detailed in the Company’s Form 10-Q for the three months ended September 30, 2018 and other periodic reports filed with the Securities and Exchange Commission. By making these forward-looking statements, the Company undertakes no obligation to update these statements for revisions or changes after the date of this presentation. Copyright © 2017Copyright MicroStrategy © 2019 MicroStrategyIncorporated. Incorporated. All All Rights Rights Reserved Reserved. Topics • Our Performance Scenario and Objective • First Order Basic Practices • A Note on Benchmarks • Hadoop OLAP Type Tools • Intelligent Cubes for Concurrency • Last Things • Summary • Q&A Special Thanks To…. HF Chadeisson Principal Solutions Architect 4 Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved . Our Performance Scenario and Objectives Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved . Objective • Compare performance impact of changes with a simple drill-down scenario using TPC-H SSB benchmark data • Progress from slow to fast using a combination of Hadoop features and MicroStrategy features • “Speed of Thought” Big Data Analytics with billions of rows • Get an introduction of new developments that improve the picture • Look at how some of these capabilities optimize Dossier interactive performance Minimize this…. Johannes Kepler Johannes Kepler writing to Galileo Galilei in 1609 “Ships and sails proper for the heavenly air should be fashioned. Then there will also be people, who do not shrink from the dreary vastness of space.” Giordano Bruno was burned at the stake in 1600 for declaring the universe was infinite with countless suns and earths. Performance – Concurrency - Structure • “Initially, having any way to use SQL against the Hadoop data was the goal, now there is an increasing requirement to connect business users … and give them the performance they expect with high levels of concurrency.” • “Note that to meet this requirement, it is likely that users will need to have structured data stored in Hadoop (along with the original unstructured data), as good performance is more likely if a transformation is done once rather than per-query” https://dzone.com/articles/sql-and-hadoop déjà vu All Over Again • ROLAP vs. MOLAP • In-memory • Columnar • Indexing • Pre-Aggregation • MPP Features added to the data stores • And more… • All taken to the next level to deal with Big Data Live Access to Data Sources with Unparalleled Performance Quickly generate multi-pass SQL and leverage push down functionality for optimized performance MicroStrategy Analytics Platform Dynamic Sourcing Generate Multipass SQL Pushdown architecture MicroStrategy lets users seamlessly run queries MicroStrategy is able to easily and quickly generate multi-pass MicroStrategy leverages the database to its full extent by pushing and dynamically drill across multiple sources. The SQL queries to provide greater analytical power and minimize data joins and analytic calculations to the database when possible. server is able to auto-recognize tasks and the amount of data that is pulled back to the mid-tier. The Every connector is optimized for high performance by pushing intelligently direct queries against in-memory sophisticated SQL engine can deliver high performance for the down functions to leverage the power of the database in cubes when possible. most complex SQL computations. conjunction with the server. Cube Cube Database 11 Deliver a single version of the truth on top of your big data Leverage the unified metadata repository to design reusable objects for rapid development, governance and scale Clients Desktop Web iPhone iPad Android Applications SCALABLE EXTENSIBLE Documents Reusable Objects Data Model CENTRALIZED CENTRALIZED ADMINISTRATION SECURITY Relational Hadoop & Cloud- Enterprise Personal or Databases Big Data Based Data Applications Departmental Enterprise Data & Business Applications What is “Speed of Thought” Analytics ? • 3 seconds ? • 7 seconds ? • 10 seconds ? • 15 seconds ? • 30 seconds ? • More ??? • All have been quoted! Sometimes you have to wait….but you can set and manage SLA’s for X% of your workload…and evolve… Data – Hardware – Software - Facts • TPC-H SSB at various scales up to 1Terabyte • ~6 Billion rows in the largest fact table – also 1.2 and 3.3 billion tests • Add in semi-structured and data mining • 17 Billion Row Demo by Indexima • 30 B Row demo provided by Kyvos during the conference • 1.2 B Row Taxi used to show Dossier performance with several techniques and products referenced • 10 Node clusters with 8 data/worker nodes • Worker nodes have 8 vCPU and 61 GB • Use ORC file types • Hortonworks HDP 2.6.3 • 10 Node cluster with HDP 3.1.0 provided preliminary information Considerations • Plenty of new turf and it’s continually changing • SQL is still key as NoSQL or NewSQL • Add in semi-structured and data mining • SQL on Hadoop products are evolving quickly with rapid increases in functionality and performance. • Performance can double, triple or more in some cases in 6 month periods. • Ultimate decisions depend on testing with your data • Schema on read is key for iteration and flexibility in the back end • Schemas are still important to operationalize analytics to large number of users • Use of aggregation, agg-aware sql generation, caching, cubes, and dynamic sourcing are keys to performance and scalability • Memory Caching is a key infrastructure component • Agility means “Everything is finished…..Nothing is finished…” Prescription • Use the tools in the platform • Use MicroStrategy to make them better • Add additional tools & components where warranted • Use MicroStrategy to make them better • Use of aggregation, aggregation-aware sql generation, caching, cubes, and dynamic sourcing are tools for performance and scalability • Ensure you have the resources, knowledge, and time allocation for initial and ongoing tuning – sizing – monitoring - benchmarking • With MicroStrategy you get performance, scalability, governance, and concurrency to securely distribute insight with Big Data Analytics to 10’s of 1000’s of people • Additional Big Data Techniques and Tools are required when volume – velocity – variety get to a tipping point First-Order Best Practices Copyright © 2019 MicroStrategy Incorporated. All Rights Reserved . Preliminaries : Minimize Joins TPC-H Schema SSB Schema LINENUMBER • SSB Version of TPC-H eliminates joins but is missing lookup tables • Snowflake is preferred with both dimension and lookup tables Preliminaries : Partitioning • Hive and Impala recommend partitioning to minimize scanned rows • This requires including a filter based on the partition in every query • It is important to pick partition columns so there are the right number of partitions • Partitioning can have other side effects • In an ad-hoc environment, you may not always be using the partitioning filter • In this study, partitioning was not used to see the relative impact of various actions • Bucketing can also be used in conjunction with Partitioning Preliminaries : Statistics • Always collect statistics on tables and columns in order to leverage cost based optimization • Example for Table ANALYZE TABLE table1 COMPUTE STATISTICS; • Example for Columns ANALYZE TABLE table1 COMPUTE STATISTICS FOR COLUMNS; Preliminaries : Optimized File Types • Parquet • ORC • AVRO Use the Latest Versions and Features • Example: move from HiveServer2 to HiveServer2 Interactive LLAP • At 550 Scale – 3.3 Billion Rows • Times are mm:ss Hive Server 2 Hive Server 2 Interactive LLAP 7:58 0:25 6:01 0:25 5:41 0:26 SQL and Row Counts select a14.d_year d_year, Table Rows a13.c_region c_region, a12.p_category p_category, lineorders 5,999,989,709 a12.p_brand1 p_brand1, dates 2556 a12.p_mfgr p_mfgr, sum(a11.lo_revenue) WJXBFS1 customers 30,000,000 from lineorder a11 parts 200,000,000 join part a12 on (a11.lo_partkey = a12.p_partkey) Supplier 2,000,000 join

Best Practices for Big Data : Visualizing Billions of Rows with Rapid Response Times “Big Data Analytics at the Speed of Thought”

Java Linksammlung

Declarative Languages for Big Streaming Data a Database Perspective

Apache Apex: Next Gen Big Data Analytics

Vulnerability Summary for the Week of July 10, 2017

Chainsys-Platform-Technical Architecture-Bots

Informatica 10.2 Hotfix 2 Release Notes April 2019

Sphinx: Empowering Impala for Efficient Execution of SQL Queries

HDP 3.1.4 Release Notes Date of Publish: 2019-08-26

Apache Calcite: a Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Yellowbrick Versus Apache Impala

Supplement for Hadoop Company

Hortonworks Data Platform for Enterprise Data Lakes Delivers Robust, Big Data Analytics That Accelerate Decision Making and Innovation