High Performance Computing & Machine Learning Google Cloud

High Performance Computing & Machine Learning Google Cloud Platform

Wyatt Gorman HPC & ML Specialist Google Cloud

Confidential & Proprietary $29 billion Over 1 billion 5,000 Google investment in unique IP caching Compute the last 3 addresses points across Engine years served daily the globe guarantees 99.99% uptime HPC in Google Cloud Compute needs are growing

Top500.org Compute Growth

Sum of Entire List #1

#500

Top500.org Compute Growth Google confidential & proprietary HPC Powers Many Industries

Manufacturing Oil & Gas Weather & Financial Services Healthcare & Media Climatology Genomics

Google confidential & proprietary Machine Learning Data and Compute are Core to Self Driving Cars

Google confidential & proprietary Why choose Google Cloud for HPC?

Fast Provisioning Latest Hardware Secure Global Resources Start thousands of VMs in First with Intel Skylake, First Automatic patches, Collaborate globally, seconds, scale down with NVIDIA V100 & P4, defense in depth instantly over Google’s instantaneously Google Cloud TPUs, 160 vCPU security, 750+ security best-in-class network VMs experts

Storage for Reduced Lower Cost Every Use Case HPC Partnerships Complexity Per Second billing, Custom Ready for the heaviest I/O Partnered to bring your Live Migration Machine Types, Preemptible with GCS, Filestore, PD, favorite workload managers, 24x7 Managed VMs, Resource Aggregation Local SSD, and storage applications, and tools to Services partners GCP Enterprise Support

Google confidential & proprietary Infrastructure HPC Requirements

Computing Storing large High performance Partner power amounts of networking Solutions data

Google confidential & proprietary Google Compute Engine

Highly Configurable Resources Cloud-Friendly Economics Compute ● Latest Intel Skylake Processors ● Customizable Instances Engine ● Customizable Instances ● Sustained Use Discounts ○ 1 to 160 cores ● Resource Aggregation ○ Up to 3844 GB RAM ● Committed Use Discounts ○ Rightsizing Recommendations ● Per Second Billing ● GPUs & TPUs ● Preemptible VMs Preemptible Virtual Machines

● Up to 80% cost savings ● 24 Hour Lifetime, 30 second notice ● Supports all Standard VM features ● Preemptible GPUs & Local SSDs

10 Google Networking

Performance Google Software Defined Networking ● 7,000 VMs per VPC Network/ Bandwidth ● Predictable, low latency (~20 - 40 µs) ● Clos topology: A collection of smaller ● Scalable bandwidth custom switches arranged to provide ○ 2 Gbps per vCPU the properties of a much larger logical ○ Up to 16Gbps per VM switch. ● Tailoring latency-sensitive tools to GCP ● Centralized software management ● Open-Sourcing high-performance internal protocols and tools (gRPC) stack. ● Implementing custom protocols tailored Global Network to the high performance data center.

● Thousands of POP around the world ● Google Backbone between datacenters ● Multiple interconnect options to on- prem

11 Google Cloud Network The largest cloud network, comprised of >100 points of presence

FASTER (US, JP, TW) 2016

Unity (US, JP) 2010

PLCN (HK, LA) 2019 SJC (JP, HK, SG) 2013

Monet (US, BR) 2017

Network Junior (Rio, Santos) Network sea 2017 cable investments Tannat (BR, UY, AR) 2017 Edge points of presence >100 Indigo (SG, ID, AU) 2019 Storage options on GCP GOOGLE CLOUD STORAGE Data Storage Exabyte-scale, feature-rich object storage Automatically scaling throughput

PERSISTENT DISK SSD/HDD Persistent Disk High-performance, replicated block storage

LOCAL STORAGE Local SSD (NVMe) for scratch & fast access Physically attached to node via PCI

FILESTORE Highly-available, durable, POSIX compliant shared storage across tens of thousands of nodes

HYBRID & PARTNER SOLUTIONS Partner storage solutions from NetApp, Elastifile, DDN, etc. Move petabytes to GCS with the Data Transfer Appliance Partners & Solutions Partners & Solutions

Storage HPC Solutions Applications & Platforms

Coming Soon Coming Soon Coming Soon Slurm for Google Cloud Platform Partnership Announced at Supercomputing 2017

● Cloud Auto-Scaling: Automatic elastic scaling of instances on-demand according to queue depth and job requirements. Spins resources down once idle.

● Burst to Cloud: Dynamically create virtual machines to offload jobs from your on-premise cluster to Google Cloud. Leverages Cloud Auto-Scaling functionality.

● Federate to Cloud: Federate jobs between your on-premise Slurm cluster and your Google Cloud Slurm cluster(s).

● Open Source: Scripts found in SchedMD’s Github repo, under “Slurm-gcp” https://github.com/schedmd/slurm-gcp Burst to Cloud: Auto-Scaling

Architecture: Slurm Auto-Scaling Cluster

HPC Users On-Prem

HPC Cluster

Login Controller Worker Nodes Cloud VPN / Compute Engine Compute Engine Compute Engine Interconnec t Slurmd SlurmCtld, NFS Autoscaled

Storage

Storage Bucket Cloud Storage Burst to Cloud: Federated Architecture: Federated Slurm Clusters

On-Prem

HPC Controlle Worker Storage Login Node Users r Node Nodes System

Cloud VPN / Interconnect

HPC Cluster Storage Login Controller Worker Nodes Compute Engine Compute Engine Compute Engine Storage GCS, PD, Lustre, ... Slurmd SlurmCtld, NFS Autoscaled Customer Case Studies SUNY Downstate Medical Center Neurological Simulations

● 5000 Cores maximum ● Slurm Auto-Scaling Cluster

● 1 week of execution ● Preemptible VMs

# Cores

Confidential + Proprietary University of South Carolina Microbiome Sequencing ● 4000 Nodes # Cores

● 125,000 Cores

● 16 hrs of execution

● Single Job # Nodes ● Slurm Auto-Scaling Cluster

● Preemptible VMs

● R/O Persistent Disk / GCS Confidential + Proprietary University of South Carolina - Architecture But wait, there’s more!

● Genomics & Life Sciences: The Broad Institute performed genome analysis with 50k cores ● Computational Chemistry: Using GCP for massive drug discovery virtual screening ● Computational Mathematics: 220,000 cores and counting: MIT math professor breaks record for largest ever Compute Engine job Note: Now 580k cores! ● Physics: Google Cloud, FermiLab, HEPCloud and probing the nature of Nature ● Satellite Image Analysis: Descartes Labs monitors planet Earth’s resources with Google Compute Engine

And also: Financial Services, Media Rendering & Transcoding, and many, many others... Machine Learning “Machine learning is a core, transformative way by which we’re rethinking how we’re doing everything.”

– Sundar Pichai CEO, Google Products powered by Machine Learning

Search Android Play Search ranking Keyboard & speech App recommendations Speech recognition input Game developer experience

Gmail Drive Chrome Smart Reply Suggested visualization Search by Image Spam classification and insights

Photos YouTube Maps Photos search Video recommendations Street View image Better thumbnails Parsing Local Search

Translate Cardboard Ads Text, graphic, and Smart stitching Richer Text Ads speech translations Automated Bidding

Go has 1.74 x 10172 possible moves, more than the number of atoms in the known universe.

27 Google DC Ops Applying ML lead to 40% reduction in cooling energy in Google datacenters. Machine Learning services from Google Cloud

Cloud ML Engine Cloud ML APIs Cloud AutoML & TensorFlow Pre-trained, Customizable, powerful, Scalable, powerful, easy to use just add data flexible Machine Learning tools What is TensorFlow?

Open source software for ML development

#1 Project on Github

tensorflow.org launched in Nov 2015

Created by Google Brain team

Standard tool for ML development at Google “People who are really serious about software should make their own hardware”

- Alan Kay Tensor Processing Unit (TPU) Custom ASIC built and optimized for TensorFlow

180 Teraflops & 30x faster than CPU Like fast-forwarding 7 years into the future Q/A Thank you

https://cloud.google.com/hpc https://cloud.google.com/products/ai/