Stomping on big data using Google BigQuery Pablo Caif | Shine Solutions | YOW! Data | 22.09.2016 About me
Software engineer
Shine Solutions
Google Developer Expert (cloud)
Work a lot with BigQuery What is BigQuery?
Fully managed, No-Ops data warehouse
Petabyte-scale and fast
SQL interface
Highly available Dremel == BigQuery
Google engineers tired of waiting for MR jobs
Dremel whitepaper 2006
Same codebase
Inspiration for Apache Drill Under the hood
MPP in memory execution
Always performs a FTS (no indexes)
Runs on a petabit network
Columnar storage RUN QUERY Demo What just happened?
Just rented ~2000 cores from Google
Paid $20 for biggest query 4TB Complexity is completely abstracted 100B rows Focus on insights, not infrastructure Deep dive High level architecture Storage Engine Storage Engine: Capacitor
SELECT play_count FROM songs WHERE name = “Here Comes The Sun”;
xc*@6j F$#h5
c8!af rm7y5
8ec2(*& a%6%# Decompress Filter Compress 7h!%d2A H#$$i
{7833} a7c%a1 4t#@h 4t#@h Here Comes The Sun
c-%1G! @#Ds Dynamic Query Execution DremelX Architecture
Master
Shard Shard Shard Shard ● Dynamic Serving Tree Shard Shard Shard Shard
Shard Shard Shard Shard
Distributed Storage ● Columnar Storage Multitenancy Resources are shared across accounts
My query Your query Getting Data In Loading the data
1 2 3
Direct Load Batch ETL Other
Local upload Dataflow/Beam Analytics (premium)
GCS (CSV, JSON, Hadoop connector Cloud Datastore AVRO) Cloud access logs Streaming Costs Three Costs
Queries $5 p/TB
Storage $0.02 p/GB
Streaming $0.01 p/200MB Core features SDKs Restful API Table Partitioning Table Partitioning
SELECT … FROM sales 20160101 20160102 WHERE _PARTITIONTIME BETWEEN sales TIMESTAMP(“20160101”) AND TIMESTAMP(“20160131”) 20160131 Query billing illustrated
c1 c2 c3 c4 c5
60 125 80 45 99 Size: GB GB GB GB GB Query billing illustrated
c1 c2 c3 c4 c5
60 125 80 45 99 Size: GB GB GB GB GB Partitioned query billing illustrated
c1 c2 c3 c4 c5 20160101
20160102
20160103
20160104
20160105
60 125 80 45 99 Size: GB GB GB GB GB Federated Queries Federated Query
Query data without loading first
Read data from GCS, Drive, Sheets
Automatic schema detection
BigQuery Google GCS GDrive AVRO format now supported Storage Sheets
30 Security Security
Third party audits
ISO and PCI compliancy
Data encrypted at rest
ACL’s with predefined roles My Tips Useful tips
Use caching
Use table decorators
Start small
Cost control (beta)
34 Caveats Watch out for..
No data centre in Australia yet
Query execution times are not deterministic
Streaming errors - back off and retry
Be aware of ‘SELECT * …’ Wrapping up
Bleeding edge, no-ops DW
Incredibly fast even at petabyte scale
Bang for buck
Great for streaming data directly to it Thank You
pablocaif@gmail.com