How to Build and Run a Big Data Platform in the 21St Century

How to build and run a big data platform in the 21st century Ali Dasdan - Atlassian - [email protected] Dhruba Borthakur - Rockset - [email protected] This tutorial was presented at the IEEE BigData Conference in Los Angeles, CA, USA on Dec 9th, 2019. 1 Disclaimers ● This tutorial presents the opinions of the authors. It does not necessarily reflect the views of our employers. ● This tutorial presents content that is already available in the public domain. It does not contain any confidential information. 2 Speaker: Ali Dasdan Ali Dasdan is the head of engineering for Confluence Cloud at Atlassian. Prior to Atlassian, he worked as the head of engineering and CTO in three startups (Turn in real-time online advertising, Vida Health in chronic disease management, Poynt in payments platform) and as an engineering leader in four leading tech companies (Synopsys for electronic design automation; Yahoo! for web search and big data; Ebay for e-commerce search, discovery, and recommendation; and Tesco for big data for retail). He has been working in big data and large distributed systems since 2006 (releasing the first production application on Hadoop in 2006). He led the building of three big data platforms on-prem from scratch (100PB+ capacity at Turn, multi-PB capacity for a leading telco, and 30PB+ capacity at Tesco). He is active in multiple areas of research related to big data, namely, big data platforms, distributed systems, machine learning. He received his PhD in Computer Science from University of Illinois at Urbana-Champaign in 1999. During part of his PhD, he worked as a visiting scholar at the University of California, Irvine on embedded real-time systems. Linkedin: https://www.linkedin.com/in/dasdan/ Scholarly publications: https://dblp.org/pers/hd/d/Dasdan:Ali Citations: https://scholar.google.com/citations?user=Bol1-tMAAAAJ&hl=en Courses taught: Multiple courses at the graduate school, many paper presentations at conferences, 3-day training on embedded real-time systems at a SoCal company, two tutorials in the 18th and 19th International World Wide Web Conferences. 3 Speaker: Dhruba Borthakur Dhruba Borthakur is cofounder and CTO at Rockset (https://rockset.com/, a company building software to enable data-powered applications. Dhruba was the founding engineer of the open source RocksDB [1] database when we was an engineer at Facebook. Prior to that, he is one of the founding engineers of the Hadoop File System [2] at Yahoo!. Dhruba was also an early contributor to the open source Apache HBase project. Previously, he was a senior engineer at Veritas Software, where he was responsible for the development of VxFS and Veritas SanPointDirect storage system; was the cofounder of Oreceipt.com, an ecommerce startup based in Sunnyvale; and was a senior engineer at IBM-Transarc Labs, where he contributed to the development of Andrew File System (AFS). Dhruba holds an MS in computer science from the University of Wisconsin-Madison. Linkedin: https://www.linkedin.com/in/dhruba/ Scholarly publications: https://dblp.uni-trier.de/pers/hd/b/Borthakur:Dhruba Citations: https://scholar.google.com/citations?user=NYgWyv0AAAAJ&hl=en. Rockset: : https://rockset.com/ RocksDB: https://en.wikipedia.org/wiki/RocksDB Apache HDFS: https://svn.apache.org/repos/asf/hadoop/common/tags/release-0.10.0/docs/hdfs_design.pdf 4 Abstract We want to show that building and running a big data platform for both streaming and bulk data processing for all kinds of applications involving analytics, data science, reporting, and the like in today’s world can be as easy as following a checklist. We live in a fortunate time that many of the components needed are already available in the open source or as a service from commercial vendors. We show how to put these components together in multiple sophistication levels to cover the spectrum from a basic reporting need to a full fledged operation across geographically distributed regions with business continuity measures in place. We plan to provide enough information to the audience that this tutorial can also serve as a high-level goto reference in the actual process of building and running. 5 Motivation We have been working on big data platforms for more than a decade now. Building and running big data platforms have become relatively more manageable now. Yet our interactions at multiple forums have shown us that a comprehensive and checklist-style guideline for building and running big data platforms does not exist in a practical form. With this tutorial we are hoping to close this gap, within the limitations of the tutorial length. 6 Targeted audience ● Targets ○ Our first target is the developers and practitioners of software or analytics or data science. However, we do provide multiple research issues and provide a review of the research literature so that researchers and students can significantly benefit from this tutorial too. ● Content level ○ 25% beginner, 50% intermediate, 25% advanced ● Audience prerequisites ○ Introductory software engineering or data science knowledge or experience as an individual contributor or as a manager of such individuals. Interest in building and running big data platforms. 7 Terminology and notation ● API: Application Programming Interface ● BCP: Business Continuity Planning ● DAG: Directed Acyclic Graph ● DB: Database ● DR: Disaster Recovery ● GDPR: General Data Protection Regulation ● ML: Machine Learning ● On premise (on-prem): Computing services offered by the 1st party ● Public cloud: Computing services offered by a 3rd party over the public internet ○ Possibly from Alibaba, Amazon, Google, IBM, Microsoft, Oracle, etc. ● RPO: Recovery Point Objective; RTO: Recovery Time Objective ● SLA: Service Level Agreement; SLI: Service Level Indicator; SLO: Service Level Objective ● TB (terabyte); PB (petabyte) = 1000 TB; EB (exabyte) = 1000 PB 8 Tutorial outline Section Time allocated Presenter 1 Introduction 15 Ali 2 Architecture 30 Ali 3 Input layer 10 Dhruba 4 Storage layer 10 Dhruba 5 Data transformation 10 Ali 6 Compute layer 10 Ali 7 Access layer 10 Dhruba 8 Output layer 10 Dhruba 9 Management and operations 10 Ali 10 Examples, lessons learned 5 Ali 9 Introduction Section 1 of 10 10 What is big data? ● Clive Humby, UK mathematician, founder of Dunnhumby, and an architect of Tesco’s Clubcard, 2006 ○ “Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.” ● The Economist (2019): “Data is the new oil.” ● Forbes (2018), Wired (2019): “No, data is not the new oil.” ● Ad Exchanger (2019): “Consent, not data, is the new oil.” https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data 11 What is big data? ● No standard definition Data volumes ● According to my then 6-yr old daughter ○ “You get data; then you get more data; and ● 10+ of terabytes you get more and more data; finally you get ○ A high end server big data.” ● 10+ of petabytes ● Another definition ○ US Library of Congress (10PB?) ○ If the size of your data is part of the ● 100+ of petabytes problem, you may have big data. ○ Turn (now Amobee) (2016) ● “Big” along one or more of 3 V’s ○ Dropbox (2016) ○ Netflix (2017) ○ Twitter (2017) ● 1+ of exabytes ○ Amazon, Google, Facebook ○ 1 exabyte = 1000 petabytes ● 1+ of zettabytes ○ (per Cisco) global internet traffic 12 How big is a petabyte of data? 13 How big is a petabyte of data? ● Length of US land and water boundary: 19,937mi = 24,483km ○ Length of US-Canada land and water boundary: 5,525mi = 8,892km ○ Length of US-Mexico land and water boundary: 1,933mi = 3,112km ○ Length of US coastline: 12,479mi = 20,083km ● Recall 1 PB of text = 50M file cabinets or 23,000km of file cabinets ● So 1PB of text almost covers all around US land and water boundary ● https://fas.org/sgp/crs/misc/RS21729.pdf 14 Where to fit a petabyte of data in a datacenter? ● A high-end server in 2017 ○ Main memory size: 0.5TB ○ Hard disk size: 45TB ○ Number of cores: 40 ● 2 racks with 15 servers each ○ Total main memory size: 15TB ○ Total hard disk size: 1.4PB ○ Total number of cores: 1,200 ● Now consider the total number of racks in the row in this picture ● Then consider the total number of rows in a typical datacenter ● Finally consider the total number of data centers owned by one or more companies ● That is a lot of data capacity 15 Generic big data platform architecture (1 of 3) 16 Generic big data platform architecture (2 of 3) 17 Generic big data platform architecture (3 of 3) 18 Generic big data platform ● To discuss ○ how this generic architecture is realized in the real world, or ○ how the real world architectures fit into this generic architecture ● To discuss the six key components of a big data platform Six key components to discuss 1. Input 2. Compute (or process) 3. Data transformations 4. Storage 5. Access 6. Output 7. Management and operations 19 Gartner: Analytics maturity (2017) 20 Gartner: Types of analytics (2017) 21 Gartner: Analytics maturity (2017) 22 https://www.gartner.com/en/newsroom/press-releases/2018-02-05-gartner-survey-shows-organizations-are-slow-to-advance-in-data-and-analytics Cons of a big data platform ● Hope that pros are obvious already ● Cons and how to counter them ○ Costly ■ But the benefits are immense, just a little more patience ○ Not many users ■ But they will come, guaranteed ○ Centralized ■ But this opens up more innovation opportunities

How to Build and Run a Big Data Platform in the 21St Century

Java Linksammlung

Oracle Metadata Management V12.2.1.3.0 New Features Overview

IPS Signature Release Note V9.17.79

Oracle Big Data SQL Release 4.1

Hybrid Transactional/Analytical Processing: a Survey

Hortonworks Data Platform Apache Spark Component Guide (December 15, 2017)

Creating Dashboards and Data Stories Within the Data & Analytics Framework (DAF)

Schema Evolution in Hive Csv

HDP 3.1.4 Release Notes Date of Publish: 2019-08-26

Benchmarking Distributed Data Warehouse Solutions for Storing Genomic Variant Information

Presto: the Definitive Guide

Hortonworks Data Platform Date of Publish: 2018-09-21