
Stomping on big data using Google BigQuery Pablo Caif | Shine Solutions | YOW! Data | 22.09.2016 About me Software engineer Shine Solutions Google Developer Expert (cloud) Work a lot with BigQuery What is BigQuery? Fully managed, No-Ops data warehouse Petabyte-scale and fast SQL interface Highly available Dremel == BigQuery Google engineers tired of waiting for MR jobs Dremel whitepaper 2006 Same codebase Inspiration for Apache Drill Under the hood MPP in memory execution Always performs a FTS (no indexes) Runs on a petabit network Columnar storage RUN QUERY Demo What just happened? Just rented ~2000 cores from Google Paid $20 for biggest query 4TB Complexity is completely abstracted 100B rows Focus on insights, not infrastructure Deep dive High level architecture Storage Engine Storage Engine: Capacitor SELECT play_count FROM songs WHERE name = “Here Comes The Sun”; xc*@6j F$#h5 c8!af rm7y5 8ec2(*& a%6%# Decompress Filter Compress 7h!%d2A H#$$i {7833} a7c%a1 4t#@h 4t#@h Here Comes The Sun c-%1G! @#Ds Dynamic Query Execution DremelX Architecture Master Shard Shard Shard Shard ● Dynamic Serving Tree Shard Shard Shard Shard Shard Shard Shard Shard Distributed Storage ● Columnar Storage Multitenancy Resources are shared across accounts My query Your query Getting Data In Loading the data 1 2 3 Direct Load Batch ETL Other Local upload Dataflow/Beam Analytics (premium) GCS (CSV, JSON, Hadoop connector Cloud Datastore AVRO) Cloud access logs Streaming Costs Three Costs Queries $5 p/TB Storage $0.02 p/GB Streaming $0.01 p/200MB Core features SDKs Restful API Table Partitioning Table Partitioning SELECT … FROM sales 20160101 20160102 WHERE _PARTITIONTIME BETWEEN sales TIMESTAMP(“20160101”) AND TIMESTAMP(“20160131”) 20160131 Query billing illustrated c1 c2 c3 c4 c5 60 125 80 45 99 Size: GB GB GB GB GB Query billing illustrated c1 c2 c3 c4 c5 60 125 80 45 99 Size: GB GB GB GB GB Partitioned query billing illustrated c1 c2 c3 c4 c5 20160101 20160102 20160103 20160104 20160105 60 125 80 45 99 Size: GB GB GB GB GB Federated Queries Federated Query Query data without loading first Read data from GCS, Drive, Sheets Automatic schema detection BigQuery Google GCS GDrive AVRO format now supported Storage Sheets 30 Security Security Third party audits ISO and PCI compliancy Data encrypted at rest ACL’s with predefined roles My Tips Useful tips Use caching Use table decorators Start small Cost control (beta) 34 Caveats Watch out for.. No data centre in Australia yet Query execution times are not deterministic Streaming errors - back off and retry Be aware of ‘SELECT * …’ Wrapping up Bleeding edge, no-ops DW Incredibly fast even at petabyte scale Bang for buck Great for streaming data directly to it Thank You [email protected] .
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages39 Page
-
File Size-