Proprietary + Confidential
2020-04-30 Open Source
Map Flume Tensorflow Google GFS BigTable Dremel PubSub Millwheel Papers Reduce Java Dataflow
Google Cloud Products BigQuery Pub/Sub Dataflow Bigtable
2002 2004 2005 2006 2008 2010 2012 2014 2015 2016 Fully managed storage & database services
Cloud App Engine Cloud Cloud Cloud Cloud BigQuery Storage Memcache Firestore Bigtable SQL Spanner
Binary or Web/mobile Hierarchical, Heavy read + Web RDBMS+scale, Enterprise Data object data applications, mobile, web write, events frameworks HA, HTAP Warehouse gaming
Images, media Game state, User profiles, AdTech, CMS, Transactions, Analytics, serving, backups user sessions Game State financial, IoT eCommerce Ad/Fin/MarTech Dashboards
3 Google’s Smart Analytics Platform Collect, process, store, analyze and visualize data and insights
Data Catalog (Metadata Mgmt)
Pub/Sub Dataflow (Messaging) (Streaming)
Partner BI Tools Migration Service Dataproc BigQuery BigQuery Streaming (Hadoop/Spark) (SQL)
Looker Data Fusion Data Transfer Service Cloud Storage (Data Integration)
Batch Dataprep Dataproc IoT Core Databases (Wrangling) (Spark)
Collect Process Store Analyze Understand
Composer (Workflow Orchestration)
Smart Analytics as a Service: Fully Managed. Serverless. Enterprise class. Globally Distributed. Secure Providing choice to customers
Managed Open Cloud Native Services Source Services Partner Services
Differentiation Familiarity Completeness
BigQuery Dataflow Composer Dataprep
Pub/Sub Data Catalog
Dataproc Data Fusion
Confidential + Proprietary Sources Users Cloud Data Logging Monitoring Composer Catalog
DLP
Salesforce Stores
Real time API Manager
Real time store SAP Partners Cloud BI & reporting Bigtable
Data Studio Ingest Pipelines Analytics store Apps & 3rd Customers Party Cloud Cloud BigQuery Any BI tool Pub/Sub Dataflow
Data Lake Analytics & IA Internal Ecommerce applications Batch Cloud Storage AI Platform
Dataiku or GA Cloud other Apps and Dataproc bots Sources Cloud Data Logging Monitoring Composer Catalog
Salesforce BI & reporting Pipelines
Looker Cloud SAP Dataprep Ingest Analytics store Data Studio Cloud Cloud Real time BigQuery Pub/Sub Dataflow
Apps & 3rd Power BI Party ETL Data Lake
Batch Cloud ELT Analytics & IA Storage
Ecommerce AI Platform
GA Proprietary + Confidential
BigQuery
BigQuery
Fully managed and serverless for maximum agility and scale Unique Google Cloud Platform’s enterprise data warehouse Real-time insights from streaming data for analytics Unique
Gigabyte- to petabyte-scale storage and SQL queries Built-in ML for out-of-the-box predictive insights Unique Encrypted, durable, And highly available High-speed, in-memory BI Engine for faster reporting and analysis
Unique Proprietary + Confidential BigQuery | Architecture Decoupled storage and compute for maximum flexibility
SQL:2011 Replicated, Distributed Compliant High-Available Cluster Storage BigQuery Compute (Dremel) Streaming (99.9999999999% durability) REST API Ingest Web UI, CLI Distributed Memory Shuffle Tier Free Bulk ODBC/JDBC Loading Client Libraries Petabit Network In 7 languages Faster performance for complex queries SELECT WHERE year... GROUP BY state SHUFFLE BY state state COUNT(*)
Join and aggregate more data
Better scalability
Distributed Storage Proprietary + Confidential
Cloud Dataflow BigQuery Cloud Dataproc Cloud ML Engine
Processing
Storage
Cloud Storage BigQuery Storage Cloud Bigtable (files) (tables) (NoSQL) Proprietary + Confidential
BigQuery Storage API
Use BigQuery Storage like GCS for Dataflow and Dataproc, break down the Data Warehouse storage wall Parquet & ORC in GCS
Run high-performance dataframes on BigQuery Cloud BQ Storage API Query Federation Dataflow Cloud SQL and Cloud Bigtable BigQuery Federation Cloud SQL
Query your Cloud SQL and Cloud Bigtable instances Cloud directly from BigQuery, without moving data around. Dataproc
Parquet & ORC Federation Cloud Bigtable Query Parquet and ORC files directly in GCS
Enterprise-grade Workload management With Reservations
BigQuery Reservations allows customers to:
● Control flat-rate spend ● Buy slots in Web UI in seconds ● Efficiently manage workloads in BigQuery ● Automatically share any unused capacity Introducing Flex slots
● A new commitment type ○ Alongside monthly & annual ● Pricing ○ $30 per slot per month* ● More flexible ○ 60 second minimum ● Combine with monthly/annual ● Available in all BQ Reservations regions! ● Available in BigQuery Reservations today! BigQuery Commitment Types and Use Cases
Seconds Hours Months Years Minutes Days
Flex Slots Flex Slots Monthly Commitments Annual Commitments
Rapidly respond to Plan for business-critical Ideal balance between The most cost-effective business demands and calendar events. flexibility and cost. option for steady-state Evaluate performance. workloads. BigQuery workload management
Customers can Example programmatically perform At 3am an important workload in project_d workload management needs to run using Reservations: At 6am we delete the Create and delete reservations At 3am we create a reservationreservation Move 1000 slots to the reservation Move 1000 slots back Move project_d into reservation Move projects between Move project_d back
reservations Project_d was guaranteed 1000 slots Move slots between reservations 3am-6am Idle slots are seamlessly and Idle slots seamlessly shared automatically shared in real-time Default BI best-effort On-Demand 1000 slots 1500 slots 30 slots
Project _f Project_a Project_d Project_e Project_b Project_c ETL / Scale streaming analytics pipelines with Cloud Dataflow
Streaming analytics service that minimizes processing time and cost with autoscaling while blending batch and stream processing.
● Fastest stream and batch processing on one service ● Lower TCO for streaming analytics ● Automatically burst resources when data spikes ● Build and monitor Apache Beam pipelines Dataflow: Stream Analytics as a managed service
Google Cloud Dataflow Service
S Streaming Engine Dataflow SQL O SI Exactly-Once Processing Cloud IAM Compute Engine Cloud Network Dataflow Shuffle U N R Handling Late Data Optional GCP Services FlexRS cost savings K
C Java/Python/Go/SQL Resource Autoscaling S E Virtual Private Key Stackdriver OSS SDK (Beam) Cloud Management Service Monitoring Demo: A simple Streaming reference architecture Scales seamlessly to petabytes to let you focus on bringing actual value
Governance Data Logging Monitoring Layer Catalog Sources Analytics
Data Studio Gateway Ingest Pipelines Analytics store Device Cloud Cloud Cloud BigQuery Any BI tool IoT Core Pub/Sub Dataflow
AI Platform Data Lake Device Cloud Storage Proprietary + Confidential
Discover Automate Simplify Data Analytics or the data lifecycle Machine Learning with Cloud Validate Structure
Dataprep Analyst
Enrich Clean Proprietary + Confidential
Legacy data preparation Modern data preparation on
Cloud Dataprep:
Business users not Business users push the “Run empowered to transform Job” button to apply data samples transformations to datasets of any size Must hire an IT/Data ops team and manage a Hadoop cluster No need to create or manage infrastructure Negotiate org-wide software licenses, arrange billing and No need to provision manage seats software licenses
Integrate application Integrated, and highly permissions scalable with infrastructure permissions Serverless Fast Easy Simplicity Exploration Preparation Proprietary + Confidential
Process diverse datasets - structured or unstructured
Prepare datasets of any size, PB or MB, with equal ease
Leverages Cloud Dataflow without needing to write any scripts
Auto-scalable and can easily handle processing massive data sets
Serverless Fast Easy Simplicity Exploration Preparation Proprietary + Confidential
Sources Targets
BigQuery tables BigQuery tables
Cloud Storage or local upload Cloud Storage: using common file formats: ● CSV (compressed or not) ● CSV ● LOG ● JSON (compressed or not) ● JSON ● GZIP ● Avro ● TXT ● BZIP
Serverless Fast Easy Simplicity Exploration Preparation Cloud Dataproc
Combining the best of open source and cloud. Open source data and analytics processing at scale on Cloud Dataproc
Build data and analytics processing jobs using the open source software you love with the scale, security, and governance of the cloud.
● Autoscale SQL, batch, streaming, and machine learning open source processing (Apache MapReduce, Apache Spark, Presto, etc.) ● Lower TCO of running OSS ● Build Spark jobs on Kubernetes The benefits of Hadoop/Spark on Cloud
On premises On compute engine Cloud Dataproc Custom code Custom code Custom code
Monitoring/Health Monitoring/Health Monitoring/Health
Dev integration Dev integration Dev integration
Scaling Scaling Scaling
Job submission Job submission Job submission
GCP connectivity GCP connectivity GCP connectivity
Deployment Deployment Deployment
Creation Creation Creation
Self-managed Google managed Build code free data pipelines with Data Fusion
Cloud Data Fusion is a fully managed, cloud-native data integration service that helps users efficiently build and manage ETL/ELT data pipelines.
● Use pre build open source library of connectors ● Execute data pipelines in Apache Spark ● Metadata and lineage integrations ● Build Apache Kakfa pipelines That’s a wrap.