Google Bigquery the Defi Nitive Guide Data Warehousing, Analytics, and Machine Learning at Scale
Total Page:16
File Type:pdf, Size:1020Kb
Google BigQuery The Defi nitive Guide Data Warehousing, Analytics, and Machine Learning at Scale Valliappa Lakshmanan & Jordan Tigani Praise for Google BigQuery: The Definitive Guide This book is essential to the rapidly growing list of businesses that are migrating their existing enterprise data warehouses from legacy technology stacks to Google Cloud. Lak and Jordan provide a comprehensive coverage of BigQuery so that you can use it not only as your Enterprise Data Warehouse, for business analytics— but also use SQL to query real-time data streams; access BigQuery from managed Hadoop and Spark clusters; and use machine learning to automatically categorize and run forecasting and predictions on your data. —Thomas Kurian, CEO, Google Cloud Every once in a great while a piece of software or service comes along that changes everything. BigQuery has changed the way enterprises can think about their data, all of it. Designed from the beginning to handle the world’s largest datasets, BigQuery has gone on to be one of the best platforms for analyzing and learning from data. Announced in June 2016, “Standard SQL” is one of the most clean, complete, powerful, implementations of SQL ever designed. Powerful features include deeply nested data, user defined functions in JavaScript and SQL, geospatial data, integrated machine learning, and URL addressable data sharing, just to name a few. There is no better place to learn about BigQuery than from this book by Jordan and Lak, two of the people who know BigQuery best. —Lloyd Tabb, Cofounder and CTO, Looker Even though I’ve been using BigQuery for over seven years, I was pleased to discover that this book taught me things I never knew about it! It provides invaluable insights into best practices and techniques, and explains concepts in an easy to understand fashion. The code examples are a great way to follow the content in a practical, hands-on manner, and they kept the book fun and engaging. This book will undoubtedly become the go-to reference for BigQuery users. —Graham Polley, Managing Consultant, Servian BigQuery can handle a lot of data very fast and at a low cost. The platform is there to help you get all your data in one place for faster insights. This book is a deep dive into key parts of BigQuery. In this quest along with two prominent legendary Googlers— Lak Lakshmanan and Jordan Tigani—you’ll learn the essentials of BigQuery as well as advanced topics like machine learning. I’m a huge BigQuery advocate. Having used the tool firsthand, I can say that it will easily make your big data life a lot easier. This was an amazing read and now the BigQuery journey starts for you! Jump in! —Mikhail Berlyant, SVP Technology, Viant Inc. Google BigQuery: The Definitive Guide Data Warehousing, Analytics, and Machine Learning at Scale Valliappa Lakshmanan and Jordan Tigani Beijing Boston Farnham Sebastopol Tokyo Google BigQuery: The Definitive Guide by Valliappa Lakshmanan and Jordan Tigani Copyright © 2020 Valliappa Lakshmanan and Jordan Tigani. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Editor: Nicole Taché Indexer: Ellen Troutman-Zaig Production Editor: Kristen Brown Interior Designer: David Futato Copyeditor: Octal Publishing, LLC Cover Designer: Karen Montgomery Proofreader: Arthur Johnson Illustrator: Rebecca Demarest October 2019: First Edition Revision History for the First Edition 2019-10-23: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492044468 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Google BigQuery: The Definitive Guide, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the authors, and do not represent the publisher’s views. While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-04446-8 [LSI] Table of Contents Preface. xiii 1. What Is Google BigQuery?. 1 Data Processing Architectures 1 Relational Database Management System 2 MapReduce Framework 3 BigQuery: A Serverless, Distributed SQL Engine 5 Working with BigQuery 7 Deriving Insights Across Datasets 7 ETL, EL, and ELT 8 Powerful Analytics 10 Simplicity of Management 12 How BigQuery Came About 12 What Makes BigQuery Possible? 15 Separation of Compute and Storage 15 Storage and Networking Infrastructure 16 Managed Storage 18 Integration with Google Cloud Platform 19 Security and Compliance 20 Summary 21 2. Query Essentials. 23 Simple Queries 24 Retrieving Rows by Using SELECT 25 Aliasing Column Names with AS 27 Filtering with WHERE 28 SELECT *, EXCEPT, REPLACE 29 v Subqueries with WITH 30 Sorting with ORDER BY 31 Aggregates 31 Computing Aggregates by Using GROUP BY 31 Counting Records by Using COUNT 32 Filtering Grouped Items by Using HAVING 33 Finding Unique Values by Using DISTINCT 33 A Brief Primer on Arrays and Structs 35 Creating Arrays by Using ARRAY_AGG 36 Array of STRUCT 39 TUPLE 39 Working with Arrays 40 UNNEST an Array 41 Joining Tables 42 The JOIN Explained 42 INNER JOIN 45 CROSS JOIN 46 OUTER JOIN 47 Saving and Sharing 48 Query History and Caching 49 Saved Queries 50 Views Versus Shared Queries 51 Summary 51 3. Data Types, Functions, and Operators. 53 Numeric Types and Functions 54 Mathematical Functions 55 Standard-Compliant Floating-Point Division 55 SAFE Functions 56 Comparisons 56 Precise Decimal Calculations with NUMERIC 57 Working with BOOL 59 Logical Operations 59 Conditional Expressions 60 Cleaner NULL-Handling with COALESCE 61 Casting and Coercion 62 Using COUNTIF to Avoid Casting Booleans 64 String Functions 65 Internationalization 66 Printing and Parsing 67 String Manipulation Functions 68 vi | Table of Contents Transformation Functions 68 Regular Expressions 68 Summary of String Functions 70 Working with TIMESTAMP 71 Parsing and Formatting Timestamps 71 Extracting Calendar Parts 72 Arithmetic with Timestamps 73 Date, Time, and DateTime 74 Working with GIS Functions 75 Summary 76 4. Loading Data into BigQuery. 79 The Basics 79 Loading from a Local Source 80 Specifying a Schema 87 Copying into a New Table 90 Data Management (DDL and DML) 90 Loading Data Efficiently 92 Federated Queries and External Data Sources 94 How to Use Federated Queries 95 When to Use Federated Queries and External Data Sources 98 Interactive Exploration and Querying of Data in Google Sheets 105 SQL Queries on Data in Cloud Bigtable 114 Transfers and Exports 119 Data Transfer Service 119 Exporting Stackdriver Logs 125 Using Cloud Dataflow to Read/Write from BigQuery 127 Moving On-Premises Data 131 Data Migration Methods 132 Summary 134 5. Developing with BigQuery. 135 Developing Programmatically 135 Accessing BigQuery via the REST API 135 Google Cloud Client Library 142 Accessing BigQuery from Data Science Tools 159 Notebooks on Google Cloud Platform 159 Working with BigQuery, pandas, and Jupyter 164 Working with BigQuery from R 169 Cloud Dataflow 170 JDBC/ODBC drivers 173 Table of Contents | vii Incorporating BigQuery Data into Google Slides (in G Suite) 174 Bash Scripting with BigQuery 176 Creating Datasets and Tables 177 Executing Queries 179 BigQuery Objects 181 Summary 182 6. Architecture of BigQuery. 185 High-Level Architecture 185 Life of a Query Request 185 BigQuery Upgrades 190 Query Engine (Dremel) 190 Dremel Architecture 192 Query Execution 197 Storage 212 Storage Data 212 Metadata 218 Summary 227 7. Optimizing Performance and Cost. 229 Principles of Performance 229 Key Drivers of Performance 230 Controlling Cost 230 Measuring and Troubleshooting 232 Measuring Query Speed Using REST API 233 Measuring Query Speed Using BigQuery Workload Tester 234 Troubleshooting Workloads Using Stackdriver 236 Reading Query Plan Information 238 Increasing Query Speed 244 Minimizing I/O 245 Caching the Results of Previous Queries 250 Performing Efficient Joins 253 Avoiding Overwhelming a Worker 262 Using Approximate Aggregation Functions 265 Optimizing How Data Is Stored and Accessed 268 Minimizing Network Overhead 268 Choosing an Efficient Storage Format 272 Partitioning Tables to Reduce Scan Size 281 Clustering Tables Based on High-Cardinality Keys 284 Time-Insensitive Use Cases 288 Batch Queries 288 viii | Table of Contents File Loads 289 Summary 290 Checklist 290 8. Advanced Queries. 293 Reusable Queries 293 Parameterized Queries 294 SQL User-Defined Functions 299 Reusing Parts of Queries 303 Advanced SQL 307 Working with Arrays 308 Window Functions 316 Table Metadata 322 Data Definition Language and Data Manipulation Language 325 Beyond SQL 330 JavaScript UDFs 330 Scripting 332 Advanced Functions