Powering)the)Road)to)National)HPC)Leadership)

Jack%C.%Wells,%Director%of%Science Oak$Ridge$Leadership$Computing$Facility/Oak$Ridge$National$Laboratory

Join%the%Conversation%#OpenPOWERSummit Powering$the$Road$to$National$HPC$ Leadership$

Jack%C.%Wells Director%of%Science Oak%Ridge%Leadership%Computing%Facility Oak%Ridge%National%Laboratory

2018%OpenPOWER Summit Las%Vegas 19%March%2018

This%research%used%resources%of%the%Oak%Ridge%Leadership%Computing%Facility%at%the%Oak% Ridge%National%Laboratory,%which%is%supported%by%the%Office%of%Science%of%the%U.S.% Department%of%Energy%under%Contract%No.%DE2AC05200OR22725.%Some%of%the%work% presented%here%is%from%the%TOTAL%and%Oak%Ridge%National%Laboratory%collaboration%which% is%done%under%the%CRADA%agreement%NFE214205227.%Some%of%the%experiments%were% ORNL%is%managed%by%UT2Battelle% supported%by%an%allocation%of%advanced%computing%resources%provided%by%the%National% for%the%US%Department%of%Energy Science%Foundation.%The%computations%were%performed%on%Nautilus%at%the%National% Institute%for%Computational%Sciences. A"Little"About"ORNL…

Oak$Ridge$National$ Laboratory$is$the$ largest$US$ Department$of$ Energy$(DOE)$open$ science$laboratory$

Oak Ridge, Tennessee What$is$a$Leadership$Computing$Facility$(LCF)?

• Collaborative%DOE%Office%of%Science%user2 • Highly%competitive%user%allocation%programs% facility%program%at%ORNL%and%ANL (INCITE,%ALCC). • Mission:%Provide%the%computational%and%data% • Projects%receive%10x%to%100x%more%resource% resources%required%to%solve%the%most% than%at%other%generally%available%centers. challenging%problems. • LCF%centers%partner%with%users%to%enable% • 22centers/22architectures%to%address%diverse% science%&%engineering%breakthroughs% and%growing%computational%needs%of%the% (Liaisons,%Catalysts). scientific%community ORNL$has$systematically$delivered$a$series$ of$leadershipEclass$systems On%scope%•%On%budget%•%Within%schedule

OLCF23 OLCF22 27 PF OLCF21 2012 2.5 10002fold %XK7% 1% PF improvement PF 263% 2009 in%8%years TF 2008 Cray%XT5% 62% Cray%XT5% Jaguar 54% 2008 25% TF TF Jaguar 18.5 TF Cray%XT4% TF 2007 Jaguar Titan,%five%years%old%in%October%2017,%continues% 2006 to%deliver%world2class%science%research%in%support% Cray%XT3% Cray%XT4% 2004 2005 Jaguar of%our%user%community.%We%will%operate%Titan% Cray%X1E% Cray%XT3% Jaguar Phoenix% Jaguar through%2019%when%it%will%be%decommissioned. We$are$building$on$this$record$of$success$ to$enable$exascale in$2021

OLCF25 ~1 OLCF24 EF 200 2021 PF 5002fold improvement Frontier 27 2018 in%9%years PF IBM% 2012 Summit Cray%XK7% Titan Coming$in$2018:$Summit$will$replace$Titan$ as$the$OLCF’s$leadership$$

Summit,%slated%to%be%more%powerful%than%any%other%existing% supercomputer,%is%the%Department%of%Energy’s%Oak%Ridge%National% Laboratory’s%newest%supercomputer%for%open%science. Compute$System Summit$Overview 10.2$PB$Total$Memory 256%compute%racks 4,608%compute%nodes Compute$Rack Mellanox EDR%IB%fabric 18%Compute%Servers 200%PFLOPS Warm%water%(70°F%direct2cooled% ~13%MW% components) Compute$Node RDHX%for%air2cooled%components 2%x%POWER9 6%x%%GV100 Components NVMe2compatible%PCIe 1600%GB%SSD% IBM$POWER9 • 22%Cores • 4%Threads/core • NVLink

! 39.7%TB%Memory/rack GPFS$File$System ! 25%GB/s%EDR%IB2 (2%ports) 55%KW%max%power/rack 512%GB%DRAM2 (DDR4) 250$PB$storage 96%GB%HBM2 (3D%Stacked) 2.5%TB/s%read,%2.5%TB/s%write Coherent%Shared%Memory

NVIDIA$GV100 • 7%TF • 16%GB%@%0.9%TB/s • NVLink Summit$Node$Overview DRAM DRAM

7 TF 256 GB 256 GB 7 TF GPU GPU HBM HBM 16 GB 16 GB 900 900 GB/s 900 GB/s

50 GB/s 50 GB/s 50 GB/s50 GB/s50

135 GB/s 135 64 GB/s 135 GB/s P9 P9 7 TF 7 TF GPU GPU HBM HBM 16 GB 16 GB 50 GB/s50 GB/s50 900 900 GB/s 900 GB/s

16 GB/s

50 GB/s 50 GB/s 16 GB/s 50 GB/s50 GB/s50 NIC 7 TF 7 TF GPU GPU HBM HBM 16 GB 16 GB 900 900 GB/s 900 GB/s

6.0 GB/s Read NVM 2.2 GB/s Write 12.5 GB/s12.5 GB/s12.5

TF 42 TF (6x7 TF) HBM/DRAM Bus (aggregate B/W) HBM 96 GB (6x16 GB) NVLINK DRAM 512 GB (2x16x16 GB) X-Bus (SMP) NET 25 GB/s (2x12.5 GB/s) PCIe Gen4 MMsg/s 83 EDR IB HBM & DRAM speeds are aggregate (Read+Write). All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional. Coming$in$2018:$Summit$will$replace$Titan$ as$the$OLCF’s$leadership$supercomputer$

Feature Titan Summit Application Performance Baseline 5210x%Titan Number%of%Nodes 18,688 4,608 Node%performance 1.4%TF 42%TF • Many%fewer%nodes Memory per%Node 32 GB DDR3%+%6%GB%GDDR5 512%GB%DDR4%+%96%GB%HBM2 • Much%more%powerful%nodes NV%memory per%Node 0 1600%GB • Much%more%memory%per%node% Total%System%Memory 710%TB >10%PB%DDR4%+%HBM2%+ Non2volatile and%total%system%memory System%Interconnect Gemini%(6.4%GB/s) Dual%Rail%EDR2IB (25%GB/s) • Faster%interconnect Interconnect%Topology 3D Torus Non2blocking%Fat%Tree • Much%higher%bandwidth% Bi2Section%Bandwidth 15.6%TB/s 115.2 TB/s between%CPUs%and%GPUs 1%AMD%™ 2%IBM%POWER9™ Processors 1%NVIDIA%Kepler™ 6%NVIDIA Volta™

• Much%larger%and%faster%file% ® system File%System 32%PB,%1%TB/s, 250 PB,%2.5%TB/s,%GPFS™ Power%Consumption 9%MW 13%MW What$is$CORAL?$ The$program$through$which$Summit$&$Sierra$are$procured. • Several%DOE%labs%have%strong%supercomputing%programs%and%facilities.% • To%bring%the%next%generation%of%leading%%to%these%labs,%DOE% created%CORAL%(the%Collaboration%of%Oak%Ridge,%Argonne,%and%Livermore)%to% jointly%procure%these%systems,%and%in%so%doing,%align%strategy%and%resources% across%the%DOE%enterprise. • Collaboration%grouping%of%DOE%labs%was%done%based%on%common%acquisition% timings.%Collaboration%is%a%win2win%for%all%parties.%

“Summit”%System “Sierra”%System

OpenPOWER Technologies:%IBM%POWER%CPUs,%NVIDIA%Tesla%GPUs,%Mellanox EDR%100Gb/s%InfiniBand

Paving%The%Road%to%Exascale%Performance OLCF$Program$to$Ready$Application$ Developers$and$Users • We%are%preparing%users%through: – Application%Readiness%and%Early%Science%through%Center%for%Accelerated% Application%Readiness%(CAAR) – Training%and%web2based%%documentation – Early%access%on%SummitDev and%Summit%Phase%I%system%(already%accepted) – Access%for%broader%user%base%on%final,%accepted%Phase%II%system • Goals:% – Early%science%achievements,% – Demonstrate%application%readiness,% – Prepare%INCITE%&%ALCC%proposals,% – Harden%Summit%for%full2user%operations Summit$Early$Science$Program$(ESP)$

• We%put%out%a%Call%for%Proposals%in%December%2017 – Resulted%in%62%Letters%of%Intent%(LOI)%received%by%year’s%end. • 27%are%from%PIs%at%universities • 32%are%from%PIs%at%national%laboratories%or%research%institutions%(DOE,%NASA)% • 14%are%CAAR%project2related%LOIs • 27%have%had%past%INCITE%allocations • 9%have%had%past%ALCC%allocations • 15%have%connections%to%the%US%DOE%Exascale%Computing%Project • 9%are%AI%or%deep%learning2related% – Proposals%are%due%at%the%beginning%of%June – ESP%Users%will%gain%full%access%to%Summit%for%early%science%later%this%year Summit$will$be$the$world’s$smartest$ supercomputer$for$open$science But%what%makes%a%supercomputer%smart? Summit%provides%unprecedented%opportunities%for%the%integration% of%artificial%intelligence%(AI)%and%scientific%discovery.%Here’s%why:

• GPU$Brawn:$Summit%links%more%than%27,000%deep2learning% optimized%NVIDIA%GPUs%with%the%potential%to%deliver% exascale2level%performance%(a%billion2billion%calculations%per% second)%for%AI%applications. • HighEspeed$Data$Movement:$NVLink high2bandwidth% technology%built%into%all%of%Summit’s%processors%supplies%the% next2generation%“information%superhighways”%needed%to%train% deep%learning%algorithms%for%challenging%science%problems% quickly. • Memory$Where$it$Matters:%Summit’s%sizable%local%memory% One%of%Summit’s%4,600%IBM%AC922%nodes.%Each%node% contains%six%NVIDIA%Volta%GPUs%and%two%IBM%Power9% gives%AI%researchers%a%convenient%launching%point%for%data2 CPUs,%giving%scientists%new%opportunities%to%automate,% intensive%tasks,%an%asset%that%allows%for%faster%AI%training%and% accelerate%and%drive%understanding%using%artificial% greater%algorithmic%accuracy. intelligence%techniques. Summit$will$be$the$world’s$smartest$ supercomputer$for$open$science But%what%can%a%smart%supercomputer%do?

Science%challenges%for%a%smart%supercomputer:%

Identifying$NextEgeneration$Materials Predicting$Fusion$Energy By%training%AI%algorithms%to%predict%material% Predictive%AI%software%is%already%helping% properties%from%experimental%data,% scientists%anticipate%disruptions%to%the%volatile% longstanding%questions%about%material% plasmas%inside%experimental%reactors.% behavior%at%atomic%scales%could%be%answered% Summit’s%arrival%allows%researchers%to%take% for%better%batteries,%more%resilient%building% this%work%to%the%next%level%and%further% materials,%and%more%efficient%semiconductors.% integrate%AI%with%fusion%technology.%

Deciphering$HighEenergy$Physics$Data Combating$Cancer With%AI%supercomputing,%physicists%can%lean%on% Through%the%development%of%scalable%deep% machines%to%identify%important%pieces%of% neural%networks,%scientists%at%the%US% information—data%that’s%too%massive%for%any% Department%of%Energy%and%the%National% single%human%to%handle%and%that%could%change% Cancer%Institute%are%making%strides%in% our%understanding%of%the%universe. improving%cancer%diagnosis%and%treatment.% Summit$is$still$under$construction

• We%expect%to%accept%the%machine%in%Summer%of%2018,%allow%early%users%on%this% year,%and%allocate%our%first%users%through%the%INCITE%program%in%January%2019.% • We%are%continuing%node%and%file%storage%installation%and%software%testing.%% Questions? Jack$Wells [email protected]