Presto: the Definitive Guide
Total Page:16
File Type:pdf, Size:1020Kb
Presto The Definitive Guide SQL at Any Scale, on Any Storage, in Any Environment Compliments of Matt Fuller, Manfred Moser & Martin Traverso Virtual Book Tour Starburst presents Presto: The Definitive Guide Register Now! Starburst is hosting a virtual book tour series where attendees will: Meet the authors: • Meet the authors from the comfort of your own home Matt Fuller • Meet the Presto creators and participate in an Ask Me Anything (AMA) session with the book Manfred Moser authors + Presto creators • Meet special guest speakers from Martin your favorite podcasts who will Traverso moderate the AMA Register here to save your spot. Praise for Presto: The Definitive Guide This book provides a great introduction to Presto and teaches you everything you need to know to start your successful usage of Presto. —Dain Sundstrom and David Phillips, Creators of the Presto Projects and Founders of the Presto Software Foundation Presto plays a key role in enabling analysis at Pinterest. This book covers the Presto essentials, from use cases through how to run Presto at massive scale. —Ashish Kumar Singh, Tech Lead, Bigdata Query Processing Platform, Pinterest Presto has set the bar in both community-building and technical excellence for lightning- fast analytical processing on stored data in modern cloud architectures. This book is a must-read for companies looking to modernize their analytics stack. —Jay Kreps, Cocreator of Apache Kafka, Cofounder and CEO of Confluent Presto has saved us all—both in academia and industry—countless hours of work, allowing us all to avoid having to write code to manage distributed query processing. We’re so grateful to have a high-quality open source distributed SQL engine to start from, enabling us to focus on innovating in new areas instead of reinventing the wheel for each new distributed data system project. —Daniel Abadi, Professor of Computer Science, University of Maryland, College Park Presto: The Definitive Guide SQL at Any Scale, on Any Storage, in Any Environment Matt Fuller, Manfred Moser, and Martin Traverso Beijing Boston Farnham Sebastopol Tokyo Presto: The Definitive Guide by Matt Fuller, Manfred Moser, and Martin Traverso Copyright © 2020 Matt Fuller, Martin Traverso, and Simpligility Technologies Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Acquisition Editor: Jonathan Hassell Indexer: Potomac Indexing, LLC Development Editor: Michele Cronin Interior Designer: David Futato Production Editor: Elizabeth Kelly Cover Designer: Karen Montgomery Copyeditor: Sharon Wilkey Illustrator: Rebecca Demarest Proofreader: Piper Editorial April 2020: First Edition Revision History for the First Edition 2020-04-03: First release See http://oreilly.com/catalog/errata.csp?isbn=9781492044277 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Presto: The Definitive Guide, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. This work is part of a collaboration between O’Reilly and Starburst. See our statement of editorial inde‐ pendence. 978-1-49208-403-7 [LSI] Table of Contents Foreword. xiii Preface. xv Part I. Getting Started with Presto 1. Introducing Presto. 3 The Problems with Big Data 3 Presto to the Rescue 4 Designed for Performance and Scale 5 SQL-on-Anything 6 Separation of Data Storage and Query Compute Resources 7 Presto Use Cases 7 One SQL Analytics Access Point 7 Access Point to Data Warehouse and Source Systems 8 Provide SQL-Based Access to Anything 9 Federated Queries 10 Semantic Layer for a Virtual Data Warehouse 10 Data Lake Query Engine 11 SQL Conversions and ETL 11 Better Insights Due to Faster Response Times 11 Big Data, Machine Learning, and Artificial Intelligence 12 Other Use Cases 12 Presto Resources 12 Website 12 Documentation 13 Community Chat 13 v Source Code, License, and Version 14 Contributing 14 Book Repository 15 Iris Data Set 15 Flight Data Set 16 A Brief History of Presto 16 Conclusion 17 2. Installing and Configuring Presto. 19 Trying Presto with the Docker Container 19 Installing from Archive File 20 Java Virtual Machine 20 Python 21 Installation 21 Configuration 22 Adding a Data Source 23 Running Presto 24 Conclusion 24 3. Using Presto. 25 Presto Command-Line Interface 25 Getting Started 25 Pagination 28 History 28 Additional Diagnostics 28 Executing Queries 29 Output Formats 30 Ignoring Errors 30 Presto JDBC Driver 30 Downloading and Registering the Driver 32 Establishing a Connection to Presto 32 Presto and ODBC 35 Client Libraries 35 Presto Web UI 35 SQL with Presto 36 Concepts 37 First Examples 37 Conclusion 40 vi | Table of Contents Part II. Diving Deeper into Presto 4. Presto Architecture. 43 Coordinator and Workers in a Cluster 43 Coordinator 45 Discovery Service 46 Workers 46 Connector-Based Architecture 47 Catalogs, Schemas, and Tables 48 Query Execution Model 48 Query Planning 53 Parsing and Analysis 54 Initial Query Planning 54 Optimization Rules 57 Predicate Pushdown 57 Cross Join Elimination 58 TopN 58 Partial Aggregations 59 Implementation Rules 60 Lateral Join Decorrelation 60 Semi-Join (IN) Decorrelation 61 Cost-Based Optimizer 62 The Cost Concept 62 Cost of the Join 64 Table Statistics 65 Filter Statistics 66 Table Statistics for Partitioned Tables 67 Join Enumeration 68 Broadcast Versus Distributed Joins 68 Working with Table Statistics 70 Presto ANALYZE 70 Gathering Statistics When Writing to Disk 71 Hive ANALYZE 71 Displaying Table Statistics 72 Conclusion 72 5. Production-Ready Deployment. 73 Configuration Details 73 Server Configuration 73 Logging 75 Node Configuration 76 JVM Configuration 77 Table of Contents | vii Launcher 77 Cluster Installation 79 RPM Installation 80 Installation Directory Structure 81 Configuration 82 Uninstall Presto 82 Installation in the Cloud 82 Cluster Sizing Considerations 83 Conclusion 84 6. Connectors. 85 Configuration 86 RDBMS Connector Example PostgreSQL 87 Query Pushdown 88 Parallelism and Concurrency 90 Other RDBMS Connectors 90 Security 92 Presto TPC-H and TPC-DS Connectors 92 Hive Connector for Distributed Storage Data Sources 93 Apache Hadoop and Hive 94 Hive Connector 95 Hive-Style Table Format 96 Managed and External Tables 97 Partitioned Data 98 Loading Data 100 File Formats and Compression 102 MinIO Example 103 Non-Relational Data Sources 104 Presto JMX Connector 104 Black Hole Connector 106 Memory Connector 107 Other Connectors 107 Conclusion 108 7. Advanced Connector Examples. 109 Connecting to HBase with Phoenix 109 Key-Value Store Connector Example: Accumulo 110 Using the Presto Accumulo Connector 113 Predicate Pushdown in Accumulo 115 Apache Cassandra Connector 117 Streaming System Connector Example: Kafka 118 Document Store Connector Example: Elasticsearch 120 viii | Table of Contents Overview 120 Configuration and Usage 121 Query Processing 121 Full-Text Search 122 Summary 122 Query Federation in Presto 122 Extract, Transform, Load and Federated Queries 129 Conclusion 129 8. Using SQL in Presto. 131 Presto Statements 132 Presto System Tables 134 Catalogs 136 Schemas 137 Information Schema 138 Tables 139 Table and Column Properties 141 Copying an Existing Table 142 Creating a New Table from Query Results 143 Modifying a Table 144 Deleting a Table 144 Table Limitations from Connectors 144 Views 145 Session Information and Configuration 146 Data Types 147 Collection Data Types 149 Temporal Data Types 150 Type Casting 154 SELECT Statement Basics 155 WHERE Clause 157 GROUP BY and HAVING Clauses 158 ORDER BY and LIMIT Clauses 159 JOIN Statements 160 UNION, INTERSECT, and EXCEPT Clauses 161 Grouping Operations 162 WITH Clause 164 Subqueries 165 Scalar Subquery 165 EXISTS Subquery 166 Quantified Subquery 166 Deleting Data from a Table 167 Conclusion 167 Table of Contents | ix 9. Advanced SQL. 169 Functions and Operators Introduction 169 Scalar Functions and Operators 170 Boolean Operators 171 Logical Operators 172 Range Selection with the BETWEEN Statement 173 Value Detection with IS (NOT) NULL 174 Mathematical Functions and Operators 174 Trigonometric Functions 175 Constant and Random Functions 176 String Functions and Operators 176 Strings and Maps 177 Unicode 178 Regular Expressions 179 Unnesting Complex Data Types 182 JSON Functions 183 Date and Time Functions and Operators 184 Histograms 186 Aggregate Functions 187 Map Aggregate Functions 187 Approximate Aggregate Functions 189 Window Functions