Stomping on big data using BigQuery Pablo Caif | Shine Solutions | YOW! Data | 22.09.2016 About me

Software engineer

Shine Solutions

Google Developer Expert (cloud)

Work a lot with BigQuery What is BigQuery?

Fully managed, No-Ops data warehouse

Petabyte-scale and fast

SQL interface

Highly available Dremel == BigQuery

Google engineers tired of waiting for MR jobs

Dremel whitepaper 2006

Same codebase

Inspiration for Apache Drill Under the hood

MPP in memory execution

Always performs a FTS (no indexes)

Runs on a petabit network

Columnar storage RUN QUERY Demo What just happened?

Just rented ~2000 cores from Google

Paid $20 for biggest query 4TB Complexity is completely abstracted 100B rows Focus on insights, not infrastructure Deep dive High level architecture Storage Engine Storage Engine: Capacitor

SELECT play_count FROM songs WHERE name = “Here Comes The Sun”;

xc*@6j F$#h5

c8!af rm7y5

8ec2(*& a%6%# Decompress Filter Compress 7h!%d2A H#$$i

{7833} a7c%a1 4t#@h 4t#@h Here Comes The Sun

c-%1G! @#Ds Dynamic Query Execution DremelX Architecture

Master

Shard Shard Shard Shard ● Dynamic Serving Tree Shard Shard Shard Shard

Shard Shard Shard Shard

Distributed Storage ● Columnar Storage Multitenancy Resources are shared across accounts

My query Your query Getting Data In Loading the data

1 2 3

Direct Load Batch ETL Other

Local upload Dataflow/Beam Analytics (premium)

GCS (CSV, JSON, Hadoop connector Cloud Datastore AVRO) Cloud access logs Streaming Costs Three Costs

Queries $5 p/TB

Storage $0.02 p/GB

Streaming $0.01 p/200MB Core features SDKs Restful API Table Partitioning Table Partitioning

SELECT … FROM sales 20160101 20160102 WHERE _PARTITIONTIME BETWEEN sales TIMESTAMP(“20160101”) AND TIMESTAMP(“20160131”) 20160131 Query billing illustrated

c1 c2 c3 c4 c5

60 125 80 45 99 Size: GB GB GB GB GB Query billing illustrated

c1 c2 c3 c4 c5

60 125 80 45 99 Size: GB GB GB GB GB Partitioned query billing illustrated

c1 c2 c3 c4 c5 20160101

20160102

20160103

20160104

20160105

60 125 80 45 99 Size: GB GB GB GB GB Federated Queries Federated Query

Query data without loading first

Read data from GCS, Drive, Sheets

Automatic schema detection

BigQuery Google GCS GDrive AVRO format now supported Storage Sheets

30 Security Security

Third party audits

ISO and PCI compliancy

Data encrypted at rest

ACL’s with predefined roles My Tips Useful tips

Use caching

Use table decorators

Start small

Cost control (beta)

34 Caveats Watch out for..

No data centre in Australia yet

Query execution times are not deterministic

Streaming errors - back off and retry

Be aware of ‘SELECT * …’ Wrapping up

Bleeding edge, no-ops DW

Incredibly fast even at petabyte scale

Bang for buck

Great for streaming data directly to it Thank You

pablocaif@.com