<<

HPC : High Performance Computing in Infrastructure Clouds

Kate Keahey [email protected] Argonne Naonal Laboratory Computaon Instute, University of Chicago

15/06/12 1 Clouds of All Types

Control Software-as-a-Service (SaaS)

Community-specific tools, applicaons and portals

Platform-as-a-Service (PaaS)

Infrastructure-as-a-Service (IaaS)

Specializaon

2 15/06/12 Technology Trends

15/06/12 3 Virtualization

. System-level virtualizaon Privileged domain Guest domains – Emulates a computer similar to a real one

– The VM runs a full OS Userland Userland Userland . VM images Privileged Guest Guest – Snapshots OS OS OS – Migraon . Scheduling Type 1

Hardware

4 15/06/12 Widespread Virtualization

. From IBM in 1967… . …to in 2003 – The Need for Speed – Open Source . Other – KVM – Palacios, KHVMM

From Pra et al., SOSP 2003 5 15/06/12 Virtualization: Game-changing Benefits

. Benefits to users – Control over environment • Touches all aspects from convenience to privilege – Convenient packaging/distribuon mechanism • No makefiles, no validaon and revalidaon, facilitates sharing . Benefits to providers – Security implicaons of isolaon – Speed of deployment, switching between environments (crical difference from reimaging) . Migraon and snapshong

6 15/06/12 Infrastructure Clouds: Game-Changing Benefits

. Benefits to users – On-demand access: manage peaks in demand – Pay-as-you-go: manage “valleys” in demand – Viable infrastructure outsourcing model . Benefits to providers – Consolidaon and economies of scale . Retail versus wholesale resources

7 15/06/12 Clouds and HPC

15/06/12 8 Clouds and HPC

Can HPC workloads be run in a cloud?

HPC Cloud

Can a supercomputer operate as a cloud?

9 15/06/12 The Nimbus Project @ ANL

High-quality, extensible, customizable, open source implementaon Nimbus Plaorm Context Elasc Cloudinit.d Broker Scaling Tools Enable users to use IaaS clouds

Nimbus Infrastructure Workspace Cumulus Service Enable providers to build IaaS clouds

Enable developers to extend, experiment and customize

10 15/06/12 The Red Question (RQ)

Can HPC workloads be run in a cloud?

Napper et al., “Will Cloud Compung Reach Top500?” 2009 (The answer was: “No”) Virtual Cluster from AWS #146 on Top500 2010 (currently #42)

Total me to change your mind: 1 year

11 15/06/12 RQ: Performance

. I/O Performance – Bandwidth – Latency . Instability

From Santos et al., Xen Summit 2007

How good are clouds for HPC? Resources: - The Magellan Report - “Comparisons: not as Odious as Once Thought”, see www.scienceclouds.org/blog

From Ramakrishnan et al., PMBS’11

12 15/06/12 RQ: Storage

. Performance challenges – I/O impact – Buffer caching not enabled – Bandwidth contenon – Variability . Geng the custom infrastructure – Storage opons of all shapes and sizes – How do we combine offerings? . Price performance consideraons – Instance service levels versus HPC needs

13 15/06/12 RQ: Overall Performance

Overall impact on tightly-coupled applications

PARATEC MILC From Ramakrishnan et al., PMBS’11 Also see Jackson et al. , CloudCom’10

14 15/06/12 RQ: Size and Scale

. “Ulity Supercompung” . Record (CycleCompung) – Computaonal chemistry app – ~6,000 instances – ~50,000 cores – ~$5,000 per hour . Nimbus for OOI – BigData analycs – Elasc scaling and HA – ~2,000 instances

15 15/06/12 RQ: “Other Considerations”

. Special hardware – E.g., Amazon GPUs instances . Noise – Significant variability . MTBF and failure rates – Both a hardware and soware consideraon – Significant lack of reliability – Ask Franck…

See Jackson et al. , CloudCom’10 How good are clouds for HPC? “Mohammad and the Mountain” see www.scienceclouds.org/blog

16 15/06/12 Programming Models

. Leverage on-demand, large provider pool . Adapt to failures – Master/Slave example

Auto-scale Redeploy Role transfer + leader elecon

17 15/06/12 RQ: Open Issues

. More understanding of performance – Sll need understanding of external connecvity, storage/ performance tradeoffs, MTBF & noise – Ongoing issue as the offerings change dynamically . Performance characterizaon – Lightweight, easy to run, as conclusive as possible . Storage space/opons sll largely unexplored . Programming models

18 15/06/12 The Blue Question (BQ)

. Is it feasible to turn a supercomputer into a cloud? . Challenges: – Hypervisor challenges – Deployment speed – Resource management model

19 15/06/12 BQ: Hypervisor Issues

. Challenges: architecture, I/O drivers, and performance . New hypervisors – KHVMM (Blue Gene/P) . Xen-IB (since ~2006) . “Symbioc virtualizaon” – Passthrough I/O

– Preempon control CTH (shockwave simulaon) – Opmized paging within ~5% of nave performance On Cray XT4 (Sandia) . Research soware From Lange et al., VEE’ 11

20 15/06/12 BQ: Deployment Scale and Speed

. Moving images is the main component of VM deployment . Challenge: make image deployment faster . LANTorrent: the principle on a LAN – Streaming – Minimizes congeson at the switch . Evaluaon using the Magellan resource Detecng and eliminang At Argonne Naonal Laboratory duplicate transfers . Other approaches: . Boom line: a thousand VMs in – Nicolae et al., HPDC’11 10 minutes – Riteau et al., QCOW

21 15/06/12 BQ: Resource Management

. Challenge: high performance versus cost . Mul-tenancy – Noisy neighbor – Interleaving needs – Not suitable for HPC with current methods . Single-tenancy – Ulizaon challenge – Preempble instances and spot instances: increase ulizaon without sacrificing the ability to respond to on-demand requests – Preempble with snapshong: increase ulizaon without sacrificing the ability to resume computaon Paper: Sotomayor et al., HPDC’08

22 15/06/12 BQ: Resource Management (2)

Preempon Disabled Preempon Enabled Average ulizaon: 36.36% Average ulizaon: 83.82% Maximum ulizaon: 43.75% Maximum ulizaon: 100%

Infrastructure Ulizaon (%) Infrastructure Ulizaon (%)

From Marshall et al., CCGrid’11

23 15/06/12 BQ: Open Questions

. Hypervisors – Rapid progress but sll many tradeoffs to examine – Need robust, open source soluons . Resource management – Combine various concerns to allow users to build beer virtual clusters – Ulizaon, policies, energy saving, etc. . Storage management – Service levels – Combining storage and compute clouds . Image management

24 15/06/12 Why Do We Care?

. Makeup of Top500 by Franck: – 87% commodity clusters (similar to Amazon CC) – 13% luxury compung . Cloud compung creates a “crical mass” – Aracts applicaons and investment . Can we afford to not support it?

25 15/06/12 Open Issues

. Performance characterizaon – Conclusive, easy to run, and lightweight – Reflecng both applicaon requirements and plaorm idiosyncrasies – “Cloud500” . Filling the gap between theory and pracce . “Shining star” demo . Models . Opportunies: big data

26 15/06/12 Build-a-Collaboration

. KerData team (Rennes) . Myriads team (Rennes) . Avalon team (Lyon) . LBNL . Northwestern . University of Colorado . ISI . OpenCirrus (internaonal)

27 15/06/12