The Operaonal Impact of GPUs on ORNL's XK7

Jim Rogers Director of Operaons Naonal Center for Computaonal Sciences Oak Ridge Naonal Laboratory

Office of Science Session Descripon, Session ID S4670

The Operaonal Impact of GPUs on ORNL's Cray XK7 Titan With a peak computaonal capacity of more than 27PF, Oak Ridge Naonal Lab's Cray XK7, Titan, is currently the largest compung resource available to the US Department of Energy. Titan contains 18,688 individual compute nodes, where each node pairs one commodity x86 processor with a single NVIDIA Kepler GPU. When compared to a typical mulcore soluon, the ability to offload substanve amounts of work to the GPUs provides benefits with significant operaonal impacts. Case studies show me-to-soluon and energy-to-soluon that are frequently more than 5 mes more efficient than the non-GPU-enabled case. The need to understand how effecvely the Kepler GPUs are being used by these applicaons is augmented by changes to the Kepler device driver and the Cray Resource Ulizaon soware, which now provide a mechanism for reporng valuable GPU usage metrics for scheduled work and memory use, on a per job basis.

2 Presenter Overview

Jim Rogers is the Director of Operaons for the Naonal Center for Computaonal Sciences at Oak Ridge Naonal Laboratory. The NCCS provides full facility and operaons support for three petaFLOP- scale systems including Titan, a 27PF Cray XK7. Jim has a BS in Computer Engineering, and has worked in high performance compung systems acquision, integraon, and operaon for more than 25 years.

3 Content • The OLCF’s Cray XK7 Titan • Examples of – Hardware Descripon NVML_COMPUTEMODE_ – Mission Need EXCLUSIVE_PROCESS Measurement – INCITE Allocaon Program – • Compeve Allocaons Lace QCD • Computaonally Ready Science – LAMMPS – NAMD • The Operaonal Need to Understand Usage • Operaonal Impact of GPUs on – ALTD (the early years) Titan – NVIDIA’s Role – Case Study: WL-LSMS • Δ to the Kepler Driver, API, and NVML – Examples among Domains – Cray’s Resource Ulizaon (RUR) • Edge Case: HPL – Assessing the Operaonal Impact to Delivered Science • Takeaways… • Time- and Energy- to Soluon

4 ORNL’s Cray XK7 Titan - A Hybrid System with 1:1 AMD Opteron CPU and NVIDIA Kepler GPU

4,352 ft2 SYSTEM SPECIFICATIONS: 404 m2 • Peak performance of 27 PF • Sustained performance of 17.59 PF • 18,688 Compute Nodes each with: • 16-Core AMD Opteron CPU • NVIDIA K20x (Kepler) GPU • 32 + 6 GB memory • 512 Service and I/O nodes Electrical Distribuon • (4) Transformers • 200 Cabinets • (200) 480V/100A circuits • 710 TB total system memory • (48) 480V/20A circuits • Cray Gemini 3D Torus Interconnect • 8.9 MW peak energy measurement 5 Cray XK7 Compute Node

XK7 Compute Node Characteriscs

AMD Opteron 6274 16 core processor - 141 GF PCIe Gen2 Tesla K20x - 1311 GF HT3 Host Memory HT3 32GB 1600 MHz DDR3

Tesla K20x Memory Z 6GB GDDR5 Y Gemini High Speed Interconnect X

Slide courtesy of Cray, Inc. 6 Science challenges for the OLCF in next decade ASCR Mission: “…discover, develop, and deploy computaonal and networking capabilies to analyze, model, simulate, and predict complex phenomena important to DOE.”

Climate Change Science Combustion Science Understand the dynamic Increase efficiency by ecological and chemical 25%-50% and lower evolution of the climate emissions from internal system with uncertainty combustion engines using quantification of impacts on advanced fuels and new, low- regional and decadal temperature combustion scales. concepts.

Biomass to Biofuels Fusion Energy/ITER Enhance the understanding Develop predictive and production of biofuels for understanding of plasma transportation and other bio- properties, dynamics, and products from biomass. interactions with surrounding materials.

Globally Optimized Accelerator Designs Solar Energy Optimize designs as the next Improve photovoltaic generations of accelerators are efficiency and lower planned, detailed models will be cost for organic and needed to provide a proof of principle inorganic materials. and efficient designs of new light sources.

7 Innovave and Novel Computaonal Impact on Theory and Experiment INCITE is an annual, peer-review allocation program that provides unprecedented computational and data science resources

• 5.8 billion core-hours awarded for 2014 on the 27-petaflop Cray XK7 Call for Proposals “Titan” and the 10-petaflop IBM BG/Q The INCITE program seeks proposals for high-impact “Mira” science and technology research challenges that • Average award: 78 million core-hours require the power of the leadership-class systems. on Titan and 88 million core-hours on Allocations will be for Mira in 2014 calendar year 2015. April 16 – June 27, 2014 • INCITE is open to any science domain • INCITE seeks computationally Contact information intensive, large-scale research Julia C. White, INCITE Manager campaigns [email protected]

8 LETTER RESEARCH a b Rosetta + Autobuild Rosetta + Autobuild Autobuild 8 AutoBuild 100% SA + Autobuild Arp/Warp 7 Torsion-space SA + Autobuild SA + AutoBuild 80% DEN + Autobuild 6 Torsion-space SA + Autobuild Extreme SA + Autobuild 5 DEN + Autobuild 60%

4 < 0.40

3 free 40% to R

Number of targets Number of targets 2 20% 1 Fraction of structures solved Fraction of structures 0 0% 0–29% 30–34% 35–39% 40–44% 45–49% 50+% 0–15% 15–19% 19–22% 22–28% 28–31% 31–100% Template sequence identity Rfree after autobuilding c Diversity of INCITE Science CC = 0.51 CC = 0.30 CC = 0.27 CC = 0.23 CC = 0.13

Determining protein structures and design High-fidelity simulation of complex proteins that block influenza virus suspension flow for practical rheometry. infection. - David Baker, - William George, National Institute University of Washington of Standards and Technology

Modeling of ubiquitous weak intermolecular Simulating a flow of healthy (red) Figure 2 | Methodbonds comparison. using a, HistogramQuantum of Rfree Montevalues after Carlo autobuilding to c, Dependence of structure determination success on initialand map diseased quality. (blue) blood cells for the eight difficult blind cases solved using the new approach (Table 1). For Sigma-A-weighted 2mFo 2 DFc density maps (contoured at 1.5s) computed most existingdevelop approaches, none benchmark of the cases yielded energies.Rfree values under 50%; from benchmark set templates with divergence from the native structure - George Karniadakis, DEN was able to reduce Rfree to 45–49% for three of the structures. For all- eight Darioincreasing Alfé, from left to right are shown in grey; the solved crystal structure is Brown University cases, Rosetta energy and density guided structure optimization led to Rfree shown in yellow. The correlation with the native density is shown above each values under 40%. b, Dependence of successUniversity on sequence identity.College The fraction London,panel. UK The solid green bar indicates structures the new approach was able to of cases solved (Rfree after autobuilding ,40%) is shown as a function of solve (Rfree , 0.4); the red bar those that torsion-space refinement or DEN template sequence identity over the 18 blind cases and 59 benchmark cases. The refinement is able to solve, and the purple bar those that can be solved directly new method is a clear improvement below 28% sequence identity. using the template. to orange bar);Calculating for one of these an the improved improvements probabilistic were sufficient for seismicKey to the success of the approach described hereProviding is the integration new of insights into the dynamics autobuildinghazard to effectively forecast solve the structure.for California. structure prediction and crystallographic chain tracingof turbulent and refinement combustion processes in Over the combined set of 18 blind cases and the 59 benchmark cases, methods. Simulated annealing guided by molecularinternal-combustion force fields and dif- engines. Rosetta refinement yielded a model with density correlation as good or fraction data has had an important role in crystallographic refinement14,21. better than any of the control methods for all but six structures.- Thomas The Jordan,Structure prediction methods such as Rosetta can be even- more Jacqueline powerful Chen and Joseph Oefelein, dependence of success on sequenceUniversity identity over of the Southern combined set California is when combined with crystallographic data because the force fields Sandia National Laboratories illustrated in Fig. 2b. The improvement in performance is particularly incorporate additional contributionssuchassolvationenergyandhydro- striking below 22% sequence identity, where the quality of the starting gen bonding, and the sampling algorithms can build non-modelled por- homology models becomes too low for the control methods in almost tions of the molecule de novo and cover a larger region of conformational all cases. With the new method the success rate in the 15–28% space than simulated annealing. The difference between Rosetta sampling sequence identityOther range, generally INCITE considered veryresearch challenging for topicsand simulated annealing sampling, both using crystallographic data, is molecular replacement, is over 50%. illustrated in Fig. 3. Beginning with the homology model placed by Figure 2c illustrates• Glimpse the dependence into dark of model-building matter on the quality • molecularGlobal replacementclimate in the unit cell for blind• case Membrane 6, we generated channels 100 • Nano-devices of initial electron density. Conventional chain rebuilding requires a models by simulated annealing at two starting temperatures, and 100 map in which• theSupernovae connectivity is ignition largely correct (leftmost panel), • modelsAccelerator with Rosetta design energy- and density-guided• Protein optimization folding followed • Batteries whereas the new• Protein method can structure tolerate breaks in the chain more than • byCarbon refinement. sequestration The 2mFo 2 DFc (ref. 22) electron• densityChemical maps generated catalyst design • Solar cells other methods (panels 2–4), as long as there is sufficient information in using phases from over 50% of the Rosetta models had correlations 0.36 the electron density• Creation map, combined of biofuels with the Rosetta energy function, • orTurbulent better to the finalflow refined map, whereas fewer• than Plasma 5% of models physics from • Reactor design to guide structure• Replicating optimization. The enzyme map on the functions far right contains too • simulatedPropulsor annealing systems had correlations this high.• Our Algorithm approach probably development • Nuclear structure little information to guide energy-based structure optimization and outperforms even extreme simulated annealing because the physical hence the new approach fails. In the five blind cases that have not yet chemistry and protein structural information which guide sampling been solved the comparative models may have been too low in quality, eliminate the vast majority of non-physical conformations. or there may have been complications in the X-ray diffraction data sets Approaches to molecular replacement combining the power of crys- themselves. 9 tallographic map interpretation and structure prediction methodology

00 MONTH 2011 | VOL 000 | NATURE | 3 ©2011 Macmillan Publishers Limited. All rights reserved Allocaon Programs for the OLCF’s Cray XK7 Titan

10 Content • The OLCF’s Cray XK7 Titan • Examples of – Hardware Descripon NVML_COMPUTEMODE_ – Mission Need EXCLUSIVE_PROCESS Measurement – INCITE Allocaon Program – • Compeve Allocaons Lace QCD • Computaonally Ready Science – LAMMPS – NAMD • The Operaonal Need to Understand Usage • Operaonal Impact of GPUs on – ALTD (the early years) Titan – NVIDIA’s Role – Case Study: WL-LSMS • Δ to the Kepler Driver, API, and NVML – Examples among Domains – Cray’s Resource Ulizaon (RUR) • Edge Case: HPL – Assessing the Operaonal Impact to Delivered Science • Takeaways… • Time- and Energy- to Soluon

11 Monitoring GPU Usage on Titan- The Early Years

• Requirement: Detect, on a per-job basis, if/ Making sense of an example link statement

when jobs use accelerator-equipped nodes. % lsms /usr/lib/../lib64/crt1.o /usr/lib/../lib64/cr.o /opt/gcc/4.7.0/snos/lib/gcc/x86_64-suse-linux/4.7.0/crtbegin.o • Initial Solution libLSMS.aSystemParameters.o libLSMS.aread_input.o – Leverage ORNL’s Automatic Library Tracking libLSMS.aPotenalIO.o Database (ALTD) libLSMS.abuildLIZandCommLists.o libLSMS.aenergyContourIntegraon.o • At link time, a list of libraries linked against is stored in libLSMS.asolveSingleScaerers.o libLSMS.acalculateDensies.o a database libLSMS.acalculateChemPot.o • When the resulting program is executed via aprun, a /lustre/widow0/scratch/larkin/lsms3-trunk/lua/lib/liblua.a new ALTD record is written that contains the specific … executable, to be run, the batch job id, and other info -lcublas /opt/nvidia/cudatoolkit/5.0.28.101/lib64/libcublas.so -lcup /opt/nvidia/cudatoolkit/5.0.28.101/extras/CUPTI/lib64/libcup.so - – Batch jobs are compared against ALTD to see if lcudart they were linked against an accelerator-specific /opt/nvidia/cudatoolkit/5.0.28.101/lib64/libcudart.so -lcuda library /opt/cray/nvidia/default/lib64/libcuda.so • libacc*, libOpenCL*, libmagma*, libhmpp*, libcuda*, /opt/cray/atp/1.4.4/lib//libAtpSigHCommData.a -lAtpSigHandler /opt/cray/atp/1.4.4/lib//libAtpSigHandler.so -lgfortran libcupti*, libcula*, libcublas* /opt/gcc/4.7.0/snos/lib/gcc/x86_64-suse-linux/4.7.0/../../../../lib64/ • Jobs whose executables are linked against one of the libgfortran.so -lhdf5_hl_cpp_gnu above are deemed to have used the accelerator ... /opt/cray/pmi/3.0.1-1.0000.9101.2.26.gem/lib64/libpmi.so -lalpslli • Outliers /usr/lib/alps/libalpslli.so -lalpsul /usr/lib/alps/libalpsul.so – Job run outside of the batch system /lib64/libpthread.so.0 -lstdc++ • ALTD knows about them, but we can’t tie them to /lib64/ld-linux-x86-64.so.2 -lgcc_s /opt/gcc/4.7.0/snos/lib/gcc/x86_64-suse-linux/4.7.0/../../../../lib64/ usage because there’s no job record libgcc_s.so /opt/gcc/4.7.0/snos/lib/gcc/x86_64-suse-linux/4.7.0/ – ALTD is enabled by default, but if it’s disabled we crtend.o won’t capture link/run info /usr/lib/../lib64/crtn.o

12 Assessing GPU Usage with ALTD

!18!!

!16!! Millions&

!14!!

!12!!

!10!!

!8!!

!6!!

!4!!

Titan&Core&Hours&Delivered&(Daily)& !2!!

!"!!!! 5/31/13! 6/30/13! 7/30/13! 8/29/13! 9/28/13! 1/26/14! 2/25/14! 10/28/13! 11/27/13! 12/27/13! Core!Hours!(Unknown)! Core"Hours!(CPU)! Core"Hours!(CPU+GPU)!

Rocky start using Great apparent use of the Unknowns are 14% of ALTD… lots of edge GPU by the workflow, but total delivered hours cases escaped. no way to quantify it. since May 31, 2013. 13 NVIDIA’s Role - Δ to the Kepler Driver, API, and NVML

The previous NVML is cool. But we needed… You can spot check… • GPU ulity (not point in me – Driver version ulizaon) for the life of a process – pstate • Persistent state of that GPU and – Memory use memory data. – Compute mode • Ability to retrieve that data, by – GPU ulizaon apid, using a predefined API – Temperature And we conceded… – Power • if there is work on any of the 14 – Clock SMs, we are accumulang GPU ulity.

• NVIDIA products containing these new features • Kepler (GK110) or beer; • Kepler Driver 319.82 or later; • NVML API 5.319.43 or later; 14 • The CUDA 5.5 release cadence nvidia-smi Output (truncated) from a Single Titan Kepler GPU

======NVSMI LOG======ECC Errors Driver version for XK is no Timestamp : Mon Mar 18 Temperature 16:51:15 2013 Gpuless than 304.47.13 : 25 C Driver Version : 304.47.13 Power Readings Attached GPUs : 1 Power Management : Supported GPU 0000:02:00.0 Power Draw : 18.08 W Product Name : Tesla K20X Display Mode : Disabled Power Limit : 225.00 W Persistence Mode : Enabled DefaultKepler Power – the K20X Limit : 225.00 W Performance State : P8 Min Power Limit : 55.00 W Clocks Throttle Reasons Max Power Limit : 300.00 W Idle : Active Clocks User Defined Clocks : Not Active Graphics : 324 MHz SW Power Cap : Not Active KeplerSM has either a p-state : 324 MHz HW Slowdown : Not Active Memory : 324 MHz Unknown : Not Active Applicationsof 0 (busy) or 8 (idle) Clocks Memory Usage Total : 5759 MB Graphics : 732 MHz Used : 37 MB Memory : 2600 MHz Free : 5722 MB Max Clocks Compute Mode : Graphics : 784 MHz Exclusive_Process SM 6GB GDDR5 : 784 MHz Memory : 2600 MHz Gpu : 0 % Compute Processes : None Memory : 0 % Ecc Mode Current : Enabled Instantaneous GPU Pending : Enable ulizaon. This is a point- in-me sample, and has NVML is a C-based API for monitoring and managing various no temporal quality. states of the NVIDIA GPU devices. nvidia-smi is an exisng applicaon that uses the nvml API.

15 Cauon: Default Mode versus Exclusive Process

• The default GPU compute mode on Titan is EXCLUSIVE_PROCESS. However, we do not preclude users from using DEFAULT compute mode, and some applicaons demonstrate slightly beer performance in DEFAULT compute mode. • In EXCLUSIVE_PROCESS compute mode, the current release of the Kepler device driver acts exactly like you would expect. • However, in Default Mode, the aggregaon of GPU seconds across mulple contexts can be misinterpreted by third party soware using the new API. – Look for updates to the way that GPU seconds are accumulated across mulple contexts in Default mode as the CUDA 6.5 cadence nears. • Kepler Compute Modes: – NVML_COMPUTEMODE_DEFAULT Default compute mode – mulple contexts per device. – NVML_COMPUTEMODE_EXCLUSIVE_THREAD Compute-exclusive-thread mode – only one context per device, usable from one thread at a me. – NVML_COMPUTEMODE_PROHIBITED Compute-prohibited mode – no contexts per device. – NVML_COMPUTEMODE_EXCLUSIVE_PROCESS Compute-exclusive-process mode – only one context per device, usable from mulple threads at a me. 16 Cray RUR, and the NVIDIA API

• At the conclusion of every job, Cray uses the revised NVIDIA API to query every compute node associated with a job, extracng the accumulated GPU usage and memory usage stascs on each individual node. • By aggregang that informaon with data from the job scheduler, stascs can then be generated that describe the GPU usage, on a per-job basis.

Compute Node Aggregate GPU- GPU- GPU seconds, Compute Seconds seconds Node run me, across across scheduler Compute all apids all apids data Node

Aggregate GPU Seconds Provide specific job Collect GPU ulity, per (ulity x runme) across an informaon that allows node, for a specific apid enre aprun. Reference determinaon of GPU scheduler data for run me usage, per job 17 and other data OLCF Acquision and Operaonal Costs Where Cycles are Free through a Compeve Process

• 2014 Allocaon Model: • Titan Acquision and 125M Cray XK7 Node Hours Operaonal Costs – INCITE: 75M Node Hours – (5-year life) among 40 Projects – Facilies – ALCC: 37.5 Node Hours – Power among 10-20 Projects – Cooling – DD: 12.5 Node Hours – Asset (Purchase, Taxes, Maintenance, Lease) • The cost of the computer me dominates everything else. – Staff • How can we effecvely use the XK7 architecture to minimize • Total asset cost: me to soluon ? – $1/node hour.

18 Content • The OLCF’s Cray XK7 Titan • Examples of – Hardware Descripon NVML_COMPUTEMODE_ – Mission Need EXCLUSIVE_PROCESS Measurement – INCITE Allocaon Program – • Compeve Allocaons Lace QCD • Computaonally Ready Science – LAMMPS – NAMD • The Operaonal Need to Understand Usage • Operaonal Impact of GPUs on – ALTD (the early years) Titan – NVIDIA’s Role – Case Study: WL-LSMS • Δ to the Kepler Driver, API, and NVML – Examples among Domains – Cray’s Resource Ulizaon (RUR) • Edge Case: HPL – Assessing the Operaonal Impact to Delivered Science • Takeaways… • Time- and Energy- to Soluon

19 GPU Usage by Lace QCD on OLCF’s Cray XK7 Titan NVML_COMPUTEMODE_EXCLUSIVE_PROCESS 100.0%$ • Lace QCD calculaons aim to understand the physical 90.0%$ phenomena encompassed by quantum

80.0%$ chromodynamics (QCD), the fundamental theory of the strong forces of subatomic physics. 70.0%$

60.0%$

50.0%$

40.0%$

Aggregate'GPU'Usage'via'RUR' 30.0%$

20.0%$ Lace QCD GPU Usage Average: 52.50% 10.0%$ StdDev: 0.0406

0.0%$ 0$ 100$ 200$ 300$ 400$ 500$ 600$ 700$ Q1CY14'Sample'Number'(each'using'800'Cray'XK7'compute'nodes)' 20 GPU Usage by LAMMPS on OLCF’s Cray XK7 Titan Mixed Mode (OpenMP + MPI), NVML_COMPUTEMODE_EXCLUSIVE_PROCESS P3HT PCBM • LAMMPS - Classical Molecular (electron donor) (electron acceptor) Dynamics Soware used in 100.0%$ simulaons for biology, materials 90.0%$ science, granular, mesoscale, etc 80.0%$

70.0%$ Coarse-grained MD simulation of phase-separation of a 1:1 60.0%$ weight ratio P3HT/PCBM mixture into donor (white) and acceptor (blue) domains. 50.0%$

40.0%$

Aggregate'GPU'Usage'via'RUR' 30.0%$ This Series: A sample of all 20.0%$ Mixed Mode (OpenMP + MPI) LAMMPS runs in Q1CY14. 10.0%$ Average GPU Usage: 49.28% 0.0%$ 0$ 50$ 100$ 150$ 200$ 250$ 300$ 350$ 400$ 450$ Q1CY14'Sample'Number'(each'using'64'Cray'XK7'compute'nodes)'

21 GPU Usage by NAMD on OLCF’s Cray XK7 Titan NVML_COMPUTEMODE_EXCLUSIVE_PROCESS 100.0%$ • NAMD is a parallel molecular dynamics code designed for high- 90.0%$ performance simulaon of large bio-molecular systems. • The availability of systems like Titan have rapidly expanded the domain of 80.0%$ bio-molecular simulaon from isolated proteins in solvent to complex aggregates, oen in a lipid environment. Such systems rounely comprise 70.0%$ 100,000 atoms, and several published NAMD simulaons have exceeded 10,000,000 atoms. 60.0%$

50.0%$

40.0%$

Aggregate'GPU'Usage'via'RUR' 30.0%$

20.0%$ NAMD GPU Usage 10.0%$ Average: 26.89% StdDev: 0.062 0.0%$ 0$ 50$ 100$ 150$ 200$ 250$ 300$ 350$ 400$ 450$ Q1CY14'Sample'Number'(each'using'768'Cray'XK7'compute'nodes)' 22 Content • The OLCF’s Cray XK7 Titan • Examples of – Hardware Descripon NVML_COMPUTEMODE_ – Mission Need EXCLUSIVE_PROCESS Measurement – INCITE Allocaon Program – • Compeve Allocaons Lace QCD • Computaonally Ready Science – LAMMPS – NAMD • The Operaonal Need to Understand Usage • Operaonal Impact of GPUs on – ALTD (the early years) Titan – NVIDIA’s Role – Case Study: WL-LSMS • Δ to the Kepler Driver, API, and NVML – Examples among Other Domains – Cray’s Resource Ulizaon (RUR) • Edge Case: HPL – Assessing the Operaonal Impact to Delivered Science • Takeaways… • Time- and Energy- to Soluon

23 Application Power Efficiency on the Cray XK7 The Behavior of Magnetic Systems with WL-LSMS CPU-only power (energy) consumpon trace for a WL-LSMS run that simulates WL#LSMS,v.3.0, 1024 Fe atoms as they reach their Curie temperature. Run size: 18,561 Titan CPU,Only,vs.,GPU,Enabled,Energy,ConsumpEon, kW#instantaneous, nodes (99% of Titan). Run signature: Inializaon, followed by 20 Monte Carlo Cray,XK7,(Titan),18,561,compute,nodes, 8,000$ steps. Computaon dominated by double complex matrix matrix mulplicaon.

7,500$ For each step, update 7,000$ density of states (DOS) ZGEMM 6,500$

6,000$

5,500$

5,000$ Application: WL-LSMS 4,500$ Runtime (hh:mm:ss): 04:11:44 Total Energy: 31,600 kW-hr Avg. Inst. Power: 6,160 kW Energy/Cooling Cost: $3,500 4,000$ Energy Consumed: 25,700 kW-hr Single Run Opportunity Cost:

3,500$ Mech. (1.23 PUE): 5,900 kW-hr (Runtime*Asset Cost) $77,800

Elapsed,Time,h:mm:ss, 3,000$ 0:25:55$ 1:43:41$ 2:35:31$ 3:10:05$ 4:01:55$ 24 0:00:00$ 0:08:38$ 0:17:17$ 0:34:34$ 0:43:12$ 0:51:50$ 1:00:29$ 1:09:07$ 1:17:46$ 1:26:24$ 1:35:02$ 1:52:19$ 2:00:58$ 2:09:36$ 2:18:14$ 2:26:53$ 2:44:10$ 2:52:48$ 3:01:26$ 3:18:43$ 3:27:22$ 3:36:00$ 3:44:38$ 3:53:17$ 4:10:34$ Application Power Efficiency on the Cray XK7 Comparing CPU-Only and GPU-Enabled WL-LSMS

The idencal WL-LSMS run (1024 Fe atoms on 18,561 Titan nodes), WL#LSMS,v.3.0, comparing the runme and power consumpon of the GPU-enabled CPU,Only,vs.,GPU,Enabled,Energy,ConsumpEon, kW#instantaneous, version versus the CPU-only version. Cray,XK7,(Titan),18,561,compute,nodes, 8,000$ • Runme Is 9X faster for the accelerated code-> 9X less

7,500$ opportunity cost. Same science output. • Total energy consumed is 7.3X less 7,000$

6,500$

6,000$

5,500$

5,000$ App: GPU-enabled WL-LSMS Total Energy: 4,300 kW-hr 4,500$ Runtime (hh:mm:ss) 00:27:43 Energy/Cooling Cost: $475 Avg. Inst. Power: 7,070 kW Single Run Opportunity Cost: 4,000$ Energy Consumed: 3,500 kW-hr (Runtime*Asset Cost) $8,575 (CPU$Only)$kW9instantaneous$ (GPU$Enabled)$kW9instantaneous$ 3,500$ Mech. (1.23 PUE): 800 kW-hr

Elapsed,Time,h:mm:ss, 3,000$ 0:25:55$ 1:43:41$ 2:35:31$ 3:10:05$ 4:01:55$ 25 0:00:00$ 0:08:38$ 0:17:17$ 0:34:34$ 0:43:12$ 0:51:50$ 1:00:29$ 1:09:07$ 1:17:46$ 1:26:24$ 1:35:02$ 1:52:19$ 2:00:58$ 2:09:36$ 2:18:14$ 2:26:53$ 2:44:10$ 2:52:48$ 3:01:26$ 3:18:43$ 3:27:22$ 3:36:00$ 3:44:38$ 3:53:17$ 4:10:34$ Operaonal Impact

• From Titan’s introducon to producon (May Cray XK7 vs. Cray XE6 31, 2013) through the end of the year Applicaon – Titan delivered 2.6B core-hours to science Performance Rao* – Maintained 90% ulizaon LAMMPS – Was stable- 98.7% scheduled availability and 468- 7.4 hour MTTF Molecular dynamics S3D 2.2 • Titan provides significant me to soluon Turbulent combuson benefits for applicaons that can take advantage of the 1:1 architecture. Denovo 3D neutron transport 3.8 • Many applicaons have been restructured to for nuclear reactors expose addional parallelism. This posively WL-LSMS impacts not only the XK7 but other large Stascal mechanics of 3.8 parallel compute systems as well. magnec materials • Considerable performance gains on the XK7 connue to outpace its tradional competors. • Faster soluon produces addional opportunity.

26 Content • The OLCF’s Cray XK7 Titan • Examples of – Hardware Descripon NVML_COMPUTEMODE_ – Mission Need EXCLUSIVE_PROCESS Measurement – INCITE Allocaon Program – • Compeve Allocaons Lace QCD • Computaonally Ready Science – LAMMPS – NAMD • The Operaonal Need to Understand Usage • Operaonal Impact of GPUs on – ALTD (the early years) Titan – NVIDIA’s Role – Case Study: WL-LSMS • Δ to the Kepler Driver, API, and NVML – Examples among Other Domains – Cray’s Resource Ulizaon (RUR) • Edge Case: HPL – Assessing the Operaonal Impact to Delivered Science • Takeaways… • Time- and Energy- to Soluon

27 April 15, 2012 Top 500 Submission: Cray XK6 (Jaguar)

Cray XK6 HPL Run Stascs 1.941PF sustained (73.6% of 2.634PF peak) System Configuraon: Cray XK Blade Upgrade Run Duraon: 24.6 hours Complete (No Fermi or Power (Consumpon Only) Kepler accelerators) System Idle: 2,935 kW 18,688 AMD Opteron Max kW: 5,275 kW 6274 processors Mean kW 5,142 kW (Interlagos) Total 126,281 kW-h Cray Gemini Interconnect

28 ORNL's Cray XK7 Titan | HPL Consumpon kW-hours MW, Instantaneous (Cumulave) 10.00 Instantaneous Measurements 8,000.00 • 8.93 MW 7545.56kW-hr • 21.42 PF • 2,397.14 MF/Wa

Thousands 9.00 7,000.00

RmaxPower = 8,296.53kW 8.00

6,000.00 Custom NVIDIA HPL Binary- 7.00 NVML Driver Reported 99%

GPU Usage 5,000.00 6.00

5.00 Cray XK7 HPL Run Stascs 17.59 PF sustained 4,000.00 vs 1.941PF on Opteron-only System Configuraon: 4.00 Run Duraon: 0:54.53 (November 2012) 3,000.00 Titan Upgrade Complete. vs more than 24.6 hours 3.00 18,688 AMD Opteron Power (Consumpon Only) 2,000.00 6274 processors, 1:1 with Max kW: 8,930 kW 2.00 NVIDIA GK110 (Kepler) Mean kW 8.296 kW GPUs in SXM form factor kW, Instantaneous 1,000.00 1.00 Total 7,545kW-h RmaxPower vs 126,281 kW-h kW-hours, cumulave - - 0:00:53 0:01:53 0:02:53 0:03:53 0:04:53 0:05:53 0:06:53 0:07:53 0:08:53 0:09:53 0:10:53 0:11:53 0:12:53 0:13:53 0:14:53 0:15:53 0:16:53 0:17:53 0:18:53 0:19:53 0:20:53 0:21:53 0:22:53 0:23:53 0:24:53 0:25:53 0:26:53 0:27:53 0:28:53 0:29:53 0:30:53 0:31:53 0:32:53 0:33:53 0:34:53 0:35:53 0:36:53 0:37:53 0:38:53 0:39:53 0:40:53 0:41:53 0:42:53 0:43:53 0:44:53 0:45:53 0:46:53 0:47:53 0:48:53 0:49:53 0:50:53 0:51:53 0:52:53 0:53:53 0:54:53

29 Run Time Duraon (hh:mm:ss) Takeaways

• Revisions by NVIDIA to the Kepler device driver now expose an ability for system accounng methods to capture, per node, and per job, important usage and high water stascs concerning the K20X SM and memory subsystems. • GPU enabled applicaons on the Cray XK7 can frequently realize me-to-soluon savings of 5X or more versus tradional CPU-only applicaons.

• Invitaon: the OLCF welcomes R&D applicaons focused on maximizing scienfic applicaon efficiency – Applicaon performance benchmarking, analysis, modeling, and scaling studies. End-to-end workflow, visualizaon, data analycs. – hps://www.olcf.ornl.gov/support/geng-started/olcf-director-discreon- project-applicaon/

30 #GTC14 Quesons?

[email protected]

The activities described herein were performed using resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC0500OR22725.

31