High Performance Computing & Machine Learning Google Cloud Platform
Wyatt Gorman HPC & ML Specialist Google Cloud
Confidential & Proprietary $29 billion Over 1 billion 5,000 Google investment in unique IP caching Compute the last 3 addresses points across Engine years served daily the globe guarantees 99.99% uptime HPC in Google Cloud Compute needs are growing
Top500.org Compute Growth
Sum of Entire List #1
#500
Top500.org Compute Growth Google confidential & proprietary HPC Powers Many Industries
Manufacturing Oil & Gas Weather & Financial Services Healthcare & Media Climatology Genomics
Google confidential & proprietary Machine Learning Data and Compute are Core to Self Driving Cars
Google confidential & proprietary Why choose Google Cloud for HPC?
Fast Provisioning Latest Hardware Secure Global Resources Start thousands of VMs in First with Intel Skylake, First Automatic patches, Collaborate globally, seconds, scale down with NVIDIA V100 & P4, defense in depth instantly over Google’s instantaneously Google Cloud TPUs, 160 vCPU security, 750+ security best-in-class network VMs experts
Storage for Reduced Lower Cost Every Use Case HPC Partnerships Complexity Per Second billing, Custom Ready for the heaviest I/O Partnered to bring your Live Migration Machine Types, Preemptible with GCS, Filestore, PD, favorite workload managers, 24x7 Managed VMs, Resource Aggregation Local SSD, and storage applications, and tools to Services partners GCP Enterprise Support
Google confidential & proprietary Infrastructure HPC Requirements
Computing Storing large High performance Partner power amounts of networking Solutions data
Google confidential & proprietary Google Compute Engine
Highly Configurable Resources Cloud-Friendly Economics Compute ● Latest Intel Skylake Processors ● Customizable Instances Engine ● Customizable Instances ● Sustained Use Discounts ○ 1 to 160 cores ● Resource Aggregation ○ Up to 3844 GB RAM ● Committed Use Discounts ○ Rightsizing Recommendations ● Per Second Billing ● GPUs & TPUs ● Preemptible VMs Preemptible Virtual Machines
● Up to 80% cost savings ● 24 Hour Lifetime, 30 second notice ● Supports all Standard VM features ● Preemptible GPUs & Local SSDs
10 Google Networking
Performance Google Software Defined Networking ● 7,000 VMs per VPC Network/ Bandwidth ● Predictable, low latency (~20 - 40 µs) ● Clos topology: A collection of smaller ● Scalable bandwidth custom switches arranged to provide ○ 2 Gbps per vCPU the properties of a much larger logical ○ Up to 16Gbps per VM switch. ● Tailoring latency-sensitive tools to GCP ● Centralized software management ● Open-Sourcing high-performance internal protocols and tools (gRPC) stack. ● Implementing custom protocols tailored Global Network to the high performance data center.
● Thousands of POP around the world ● Google Backbone between datacenters ● Multiple interconnect options to on- prem
11 Google Cloud Network The largest cloud network, comprised of >100 points of presence
FASTER (US, JP, TW) 2016
Unity (US, JP) 2010
PLCN (HK, LA) 2019 SJC (JP, HK, SG) 2013
Monet (US, BR) 2017
Network Junior (Rio, Santos) Network sea 2017 cable investments Tannat (BR, UY, AR) 2017 Edge points of presence >100 Indigo (SG, ID, AU) 2019 Storage options on GCP GOOGLE CLOUD STORAGE Data Storage Exabyte-scale, feature-rich object storage Automatically scaling throughput
PERSISTENT DISK SSD/HDD Persistent Disk High-performance, replicated block storage
LOCAL STORAGE Local SSD (NVMe) for scratch & fast access Physically attached to node via PCI
FILESTORE Highly-available, durable, POSIX compliant shared storage across tens of thousands of nodes
HYBRID & PARTNER SOLUTIONS Partner storage solutions from NetApp, Elastifile, DDN, etc. Move petabytes to GCS with the Data Transfer Appliance Partners & Solutions Partners & Solutions
Storage HPC Solutions Applications & Platforms
Coming Soon Coming Soon Coming Soon Slurm for Google Cloud Platform Partnership Announced at Supercomputing 2017
● Cloud Auto-Scaling: Automatic elastic scaling of instances on-demand according to queue depth and job requirements. Spins resources down once idle.
● Burst to Cloud: Dynamically create virtual machines to offload jobs from your on-premise cluster to Google Cloud. Leverages Cloud Auto-Scaling functionality.
● Federate to Cloud: Federate jobs between your on-premise Slurm cluster and your Google Cloud Slurm cluster(s).
● Open Source: Scripts found in SchedMD’s Github repo, under “Slurm-gcp” https://github.com/schedmd/slurm-gcp Burst to Cloud: Auto-Scaling
Architecture: Slurm Auto-Scaling Cluster
HPC Users On-Prem
HPC Cluster
Login Controller Worker Nodes Cloud VPN / Compute Engine Compute Engine Compute Engine Interconnec t Slurmd SlurmCtld, NFS Autoscaled
Storage
Storage Bucket Cloud Storage Burst to Cloud: Federated Architecture: Federated Slurm Clusters
On-Prem
HPC Controlle Worker Storage Login Node Users r Node Nodes System
Cloud VPN / Interconnect
HPC Cluster Storage Login Controller Worker Nodes Compute Engine Compute Engine Compute Engine Storage GCS, PD, Lustre, ... Slurmd SlurmCtld, NFS Autoscaled Customer Case Studies SUNY Downstate Medical Center Neurological Simulations
● 5000 Cores maximum ● Slurm Auto-Scaling Cluster
● 1 week of execution ● Preemptible VMs
# Cores
Confidential + Proprietary University of South Carolina Microbiome Sequencing ● 4000 Nodes # Cores
● 125,000 Cores
● 16 hrs of execution
● Single Job # Nodes ● Slurm Auto-Scaling Cluster
● Preemptible VMs
● R/O Persistent Disk / GCS Confidential + Proprietary University of South Carolina - Architecture But wait, there’s more!
● Genomics & Life Sciences: The Broad Institute performed genome analysis with 50k cores ● Computational Chemistry: Using GCP for massive drug discovery virtual screening ● Computational Mathematics: 220,000 cores and counting: MIT math professor breaks record for largest ever Compute Engine job Note: Now 580k cores! ● Physics: Google Cloud, FermiLab, HEPCloud and probing the nature of Nature ● Satellite Image Analysis: Descartes Labs monitors planet Earth’s resources with Google Compute Engine
And also: Financial Services, Media Rendering & Transcoding, and many, many others... Machine Learning “Machine learning is a core, transformative way by which we’re rethinking how we’re doing everything.”
– Sundar Pichai CEO, Google Products powered by Machine Learning
Search Android Play Search ranking Keyboard & speech App recommendations Speech recognition input Game developer experience
Gmail Drive Chrome Smart Reply Suggested visualization Search by Image Spam classification and insights
Photos YouTube Maps Photos search Video recommendations Street View image Better thumbnails Parsing Local Search
Translate Cardboard Ads Text, graphic, and Smart stitching Richer Text Ads speech translations Automated Bidding
© 2017 Google Inc. All rights reserved. AlphaGo
Go has 1.74 x 10172 possible moves, more than the number of atoms in the known universe.
27 Google DC Ops Applying ML lead to 40% reduction in cooling energy in Google datacenters. Machine Learning services from Google Cloud
Cloud ML Engine Cloud ML APIs Cloud AutoML & TensorFlow Pre-trained, Customizable, powerful, Scalable, powerful, easy to use just add data flexible Machine Learning tools What is TensorFlow?
Open source software for ML development
#1 Project on Github
tensorflow.org launched in Nov 2015
Created by Google Brain team
Standard tool for ML development at Google “People who are really serious about software should make their own hardware”
- Alan Kay Tensor Processing Unit (TPU) Custom ASIC built and optimized for TensorFlow
180 Teraflops & 30x faster than CPU Like fast-forwarding 7 years into the future Q/A Thank you
https://cloud.google.com/hpc https://cloud.google.com/products/ai/