Proprietary + Confidential

2020-04-30 Open Source

Map Flume Tensorflow GFS Dremel PubSub Millwheel Papers Reduce Java Dataflow

Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable

2002 2004 2005 2006 2008 2010 2012 2014 2015 2016 Fully managed storage & database services

Cloud App Engine Cloud Cloud Cloud Cloud BigQuery Storage Memcache Firestore Bigtable SQL

Binary or Web/mobile Hierarchical, Heavy read + Web RDBMS+scale, Enterprise Data object data applications, mobile, web write, events frameworks HA, HTAP Warehouse gaming

Images, media Game state, User profiles, AdTech, CMS, Transactions, Analytics, serving, backups user sessions Game State financial, IoT eCommerce Ad/Fin/MarTech Dashboards

3 Google’s Smart Analytics Platform Collect, process, store, analyze and visualize data and insights

Data Catalog (Metadata Mgmt)

Pub/Sub Dataflow (Messaging) (Streaming)

Partner BI Tools Migration Service Dataproc BigQuery BigQuery Streaming (Hadoop/Spark) (SQL)

Looker Data Fusion Data Transfer Service Cloud Storage (Data Integration)

Batch Dataprep Dataproc IoT Core Databases (Wrangling) (Spark)

Collect Process Store Analyze Understand

Composer (Workflow Orchestration)

Smart Analytics as a Service: Fully Managed. Serverless. Enterprise class. Globally Distributed. Secure Providing choice to customers

Managed Open Cloud Native Services Source Services Partner Services

Differentiation Familiarity Completeness

BigQuery Dataflow Composer Dataprep

Pub/Sub Data Catalog

Dataproc Data Fusion

Confidential + Proprietary Sources Users Cloud Data Logging Monitoring Composer Catalog

DLP

Salesforce Stores

Real time API Manager

Real time store SAP Partners Cloud BI & reporting Bigtable

Data Studio Ingest Pipelines Analytics store Apps & 3rd Customers Party Cloud Cloud BigQuery Any BI tool Pub/Sub Dataflow

Data Lake Analytics & IA Internal Ecommerce applications Batch Cloud Storage AI Platform

Dataiku or GA Cloud other Apps and Dataproc bots Sources Cloud Data Logging Monitoring Composer Catalog

Salesforce BI & reporting Pipelines

Looker Cloud SAP Dataprep Ingest Analytics store Data Studio Cloud Cloud Real time BigQuery Pub/Sub Dataflow

Apps & 3rd Power BI Party ETL Data Lake

Batch Cloud ELT Analytics & IA Storage

Ecommerce AI Platform

GA Proprietary + Confidential

BigQuery

BigQuery

Fully managed and serverless for maximum agility and scale Unique ’s enterprise data warehouse Real-time insights from streaming data for analytics Unique

Gigabyte- to petabyte-scale storage and SQL queries Built-in ML for out-of-the-box predictive insights Unique Encrypted, durable, And highly available High-speed, in-memory BI Engine for faster reporting and analysis

Unique Proprietary + Confidential BigQuery | Architecture Decoupled storage and compute for maximum flexibility

SQL:2011 Replicated, Distributed Compliant High-Available Cluster Storage BigQuery Compute (Dremel) Streaming (99.9999999999% durability) REST API Ingest Web UI, CLI Distributed Memory Shuffle Tier Free Bulk ODBC/JDBC Loading Client Libraries Petabit Network In 7 languages Faster performance for complex queries SELECT WHERE year... GROUP BY state SHUFFLE BY state state COUNT(*)

Join and aggregate more data

Better scalability

Distributed Storage Proprietary + Confidential

Cloud Dataflow BigQuery Cloud Dataproc Cloud ML Engine

Processing

Storage

Cloud Storage BigQuery Storage Cloud Bigtable () () (NoSQL) Proprietary + Confidential

BigQuery Storage API

Use BigQuery Storage like GCS for Dataflow and Dataproc, break down the Data Warehouse storage wall Parquet & ORC in GCS

Run high-performance dataframes on BigQuery Cloud BQ Storage API Query Federation Dataflow Cloud SQL and Cloud Bigtable BigQuery Federation Cloud SQL

Query your Cloud SQL and Cloud Bigtable instances Cloud directly from BigQuery, without moving data around. Dataproc

Parquet & ORC Federation Cloud Bigtable Query Parquet and ORC files directly in GCS

Enterprise-grade Workload management With Reservations

BigQuery Reservations allows customers to:

● Control flat-rate spend ● Buy slots in Web UI in seconds ● Efficiently manage workloads in BigQuery ● Automatically share any unused capacity Introducing Flex slots

● A new commitment type ○ Alongside monthly & annual ● Pricing ○ $30 per slot per month* ● More flexible ○ 60 second minimum ● Combine with monthly/annual ● Available in all BQ Reservations regions! ● Available in BigQuery Reservations today! BigQuery Commitment Types and Use Cases

Seconds Hours Months Years Minutes Days

Flex Slots Flex Slots Monthly Commitments Annual Commitments

Rapidly respond to Plan for business-critical Ideal balance between The most cost-effective business demands and calendar events. flexibility and cost. option for steady-state Evaluate performance. workloads. BigQuery workload management

Customers can Example programmatically perform At 3am an important workload in project_d workload management needs to run using Reservations: At 6am we delete the Create and delete reservations At 3am we create a reservationreservation Move 1000 slots to the reservation Move 1000 slots back Move project_d into reservation Move projects between Move project_d back

reservations Project_d was guaranteed 1000 slots Move slots between reservations 3am-6am Idle slots are seamlessly and Idle slots seamlessly shared automatically shared in real-time Default BI best-effort On-Demand 1000 slots 1500 slots 30 slots

Project _f Project_a Project_d Project_e Project_b Project_c ETL / Scale streaming analytics pipelines with Cloud Dataflow

Streaming analytics service that minimizes processing time and cost with autoscaling while blending batch and stream processing.

● Fastest stream and batch processing on one service ● Lower TCO for streaming analytics ● Automatically burst resources when data spikes ● Build and monitor pipelines Dataflow: Stream Analytics as a managed service

Google Cloud Dataflow Service

S Streaming Engine Dataflow SQL O SI Exactly-Once Processing Cloud IAM Compute Engine Cloud Network Dataflow Shuffle U N R Handling Late Data Optional GCP Services FlexRS cost savings K

C Java/Python/Go/SQL Resource Autoscaling S E Virtual Private Key Stackdriver OSS SDK (Beam) Cloud Management Service Monitoring Demo: A simple Streaming reference architecture Scales seamlessly to petabytes to let you focus on bringing actual value

Governance Data Logging Monitoring Layer Catalog Sources Analytics

Data Studio Gateway Ingest Pipelines Analytics store Device Cloud Cloud Cloud BigQuery Any BI tool IoT Core Pub/Sub Dataflow

AI Platform Data Lake Device Cloud Storage Proprietary + Confidential

Discover Automate Simplify Data Analytics or the data lifecycle Machine Learning with Cloud Validate Structure

Dataprep Analyst

Enrich Clean Proprietary + Confidential

Legacy data preparation Modern data preparation on

Cloud Dataprep:

Business users not Business users push the “Run empowered to transform Job” button to apply data samples transformations to datasets of any size Must hire an IT/Data ops team and manage a Hadoop cluster No need to create or manage infrastructure Negotiate org-wide software licenses, arrange billing and No need to provision manage seats software licenses

Integrate application Integrated, and highly permissions scalable with infrastructure permissions Serverless Fast Easy Simplicity Exploration Preparation Proprietary + Confidential

Process diverse datasets - structured or unstructured

Prepare datasets of any size, PB or MB, with equal ease

Leverages Cloud Dataflow without needing to write any scripts

Auto-scalable and can easily handle processing massive data sets

Serverless Fast Easy Simplicity Exploration Preparation Proprietary + Confidential

Sources Targets

BigQuery tables BigQuery tables

Cloud Storage or local upload Cloud Storage: using common file formats: ● CSV (compressed or not) ● CSV ● LOG ● JSON (compressed or not) ● JSON ● GZIP ● Avro ● TXT ● BZIP

Serverless Fast Easy Simplicity Exploration Preparation Cloud Dataproc

Combining the best of open source and cloud. Open source data and analytics processing at scale on Cloud Dataproc

Build data and analytics processing jobs using the open source software you love with the scale, security, and governance of the cloud.

● Autoscale SQL, batch, streaming, and machine learning open source processing (Apache MapReduce, Apache Spark, Presto, etc.) ● Lower TCO of running OSS ● Build Spark jobs on The benefits of Hadoop/Spark on Cloud

On premises On compute engine Cloud Dataproc Custom code Custom code Custom code

Monitoring/Health Monitoring/Health Monitoring/Health

Dev integration Dev integration Dev integration

Scaling Scaling Scaling

Job submission Job submission Job submission

GCP connectivity GCP connectivity GCP connectivity

Deployment Deployment Deployment

Creation Creation Creation

Self-managed Google managed Build code free data pipelines with Data Fusion

Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines.

● Use pre build open source library of connectors ● Execute data pipelines in Apache Spark ● Metadata and lineage integrations ● Build Apache Kakfa pipelines That’s a wrap.

[email protected]