The Opera onal Impact of GPUs on ORNL's Cray XK7 Titan
Jim Rogers Director of Opera ons Na onal Center for Computa onal Sciences Oak Ridge Na onal Laboratory
Office of Science Session Descrip on, Session ID S4670
The Opera onal Impact of GPUs on ORNL's Cray XK7 Titan With a peak computa onal capacity of more than 27PF, Oak Ridge Na onal Lab's Cray XK7, Titan, is currently the largest compu ng resource available to the US Department of Energy. Titan contains 18,688 individual compute nodes, where each node pairs one commodity x86 processor with a single NVIDIA Kepler GPU. When compared to a typical mul core solu on, the ability to offload substan ve amounts of work to the GPUs provides benefits with significant opera onal impacts. Case studies show me-to-solu on and energy-to-solu on that are frequently more than 5 mes more efficient than the non-GPU-enabled case. The need to understand how effec vely the Kepler GPUs are being used by these applica ons is augmented by changes to the Kepler device driver and the Cray Resource U liza on so ware, which now provide a mechanism for repor ng valuable GPU usage metrics for scheduled work and memory use, on a per job basis.
2 Presenter Overview
Jim Rogers is the Director of Opera ons for the Na onal Center for Computa onal Sciences at Oak Ridge Na onal Laboratory. The NCCS provides full facility and opera ons support for three petaFLOP- scale systems including Titan, a 27PF Cray XK7. Jim has a BS in Computer Engineering, and has worked in high performance compu ng systems acquisi on, integra on, and opera on for more than 25 years.
3 Content • The OLCF’s Cray XK7 Titan • Examples of – Hardware Descrip on NVML_COMPUTEMODE_ – Mission Need EXCLUSIVE_PROCESS Measurement – INCITE Alloca on Program – • Compe ve Alloca ons La ce QCD • Computa onally Ready Science – LAMMPS – NAMD • The Opera onal Need to Understand Usage • Opera onal Impact of GPUs on – ALTD (the early years) Titan – NVIDIA’s Role – Case Study: WL-LSMS • Δ to the Kepler Driver, API, and NVML – Examples among Domains – Cray’s Resource U liza on (RUR) • Edge Case: HPL – Assessing the Opera onal Impact to Delivered Science • Takeaways… • Time- and Energy- to Solu on
4 ORNL’s Cray XK7 Titan - A Hybrid System with 1:1 AMD Opteron CPU and NVIDIA Kepler GPU
4,352 ft2 SYSTEM SPECIFICATIONS: 404 m2 • Peak performance of 27 PF • Sustained performance of 17.59 PF • 18,688 Compute Nodes each with: • 16-Core AMD Opteron CPU • NVIDIA K20x (Kepler) GPU • 32 + 6 GB memory • 512 Service and I/O nodes Electrical Distribu on • (4) Transformers • 200 Cabinets • (200) 480V/100A circuits • 710 TB total system memory • (48) 480V/20A circuits • Cray Gemini 3D Torus Interconnect • 8.9 MW peak energy measurement 5 Cray XK7 Compute Node
XK7 Compute Node Characteris cs
AMD Opteron 6274 16 core processor - 141 GF PCIe Gen2 Tesla K20x - 1311 GF HT3 Host Memory HT3 32GB 1600 MHz DDR3
Tesla K20x Memory Z 6GB GDDR5 Y Gemini High Speed Interconnect X
Slide courtesy of Cray, Inc. 6 Science challenges for the OLCF in next decade ASCR Mission: “…discover, develop, and deploy computa onal and networking capabili es to analyze, model, simulate, and predict complex phenomena important to DOE.”
Climate Change Science Combustion Science Understand the dynamic Increase efficiency by ecological and chemical 25%-50% and lower evolution of the climate emissions from internal system with uncertainty combustion engines using quantification of impacts on advanced fuels and new, low- regional and decadal temperature combustion scales. concepts.
Biomass to Biofuels Fusion Energy/ITER Enhance the understanding Develop predictive and production of biofuels for understanding of plasma transportation and other bio- properties, dynamics, and products from biomass. interactions with surrounding materials.
Globally Optimized Accelerator Designs Solar Energy Optimize designs as the next Improve photovoltaic generations of accelerators are efficiency and lower planned, detailed models will be cost for organic and needed to provide a proof of principle inorganic materials. and efficient designs of new light sources.
7 Innova ve and Novel Computa onal Impact on Theory and Experiment INCITE is an annual, peer-review allocation program that provides unprecedented computational and data science resources
• 5.8 billion core-hours awarded for 2014 on the 27-petaflop Cray XK7 Call for Proposals “Titan” and the 10-petaflop IBM BG/Q The INCITE program seeks proposals for high-impact “Mira” science and technology research challenges that • Average award: 78 million core-hours require the power of the leadership-class systems. on Titan and 88 million core-hours on Allocations will be for Mira in 2014 calendar year 2015. April 16 – June 27, 2014 • INCITE is open to any science domain • INCITE seeks computationally Contact information intensive, large-scale research Julia C. White, INCITE Manager campaigns [email protected]
8 LETTER RESEARCH a b Rosetta + Autobuild Rosetta + Autobuild Autobuild 8 AutoBuild 100% SA + Autobuild Arp/Warp 7 Torsion-space SA + Autobuild SA + AutoBuild 80% DEN + Autobuild 6 Torsion-space SA + Autobuild Extreme SA + Autobuild 5 DEN + Autobuild 60%
4 < 0.40
3 free 40% to R
Number of targets Number of targets 2 20% 1 Fraction of structures solved Fraction of structures 0 0% 0–29% 30–34% 35–39% 40–44% 45–49% 50+% 0–15% 15–19% 19–22% 22–28% 28–31% 31–100% Template sequence identity Rfree after autobuilding c Diversity of INCITE Science CC = 0.51 CC = 0.30 CC = 0.27 CC = 0.23 CC = 0.13
Determining protein structures and design High-fidelity simulation of complex proteins that block influenza virus suspension flow for practical rheometry. infection. - David Baker, - William George, National Institute University of Washington of Standards and Technology
Modeling of ubiquitous weak intermolecular Simulating a flow of healthy (red) Figure 2 | Methodbonds comparison. using a, HistogramQuantum of Rfree Montevalues after Carlo autobuilding to c, Dependence of structure determination success on initialand map diseased quality. (blue) blood cells for the eight difficult blind cases solved using the new approach (Table 1). For Sigma-A-weighted 2mFo 2 DFc density maps (contoured at 1.5s) computed most existingdevelop approaches, none benchmark of the cases yielded energies.Rfree values under 50%; from benchmark set templates with divergence from the native structure - George Karniadakis, DEN was able to reduce Rfree to 45–49% for three of the structures. For all- eight Darioincreasing Alfé, from left to right are shown in grey; the solved crystal structure is Brown University cases, Rosetta energy and density guided structure optimization led to Rfree shown in yellow. The correlation with the native density is shown above each values under 40%. b, Dependence of successUniversity on sequence identity.College The fraction London,panel. UK The solid green bar indicates structures the new approach was able to of cases solved (Rfree after autobuilding ,40%) is shown as a function of solve (Rfree , 0.4); the red bar those that torsion-space refinement or DEN template sequence identity over the 18 blind cases and 59 benchmark cases. The refinement is able to solve, and the purple bar those that can be solved directly new method is a clear improvement below 28% sequence identity. using the template. to orange bar);Calculating for one of these an the improved improvements probabilistic were sufficient for seismicKey to the success of the approach described hereProviding is the integration new of insights into the dynamics autobuildinghazard to effectively forecast solve the structure.for California. structure prediction and crystallographic chain tracingof turbulent and refinement combustion processes in Over the combined set of 18 blind cases and the 59 benchmark cases, methods. Simulated annealing guided by molecularinternal-combustion force fields and dif- engines. Rosetta refinement yielded a model with density correlation as good or fraction data has had an important role in crystallographic refinement14,21. better than any of the control methods for all but six structures.- Thomas The Jordan,Structure prediction methods such as Rosetta can be even- more Jacqueline powerful Chen and Joseph Oefelein, dependence of success on sequenceUniversity identity over of the Southern combined set California is when combined with crystallographic data because the force fields Sandia National Laboratories illustrated in Fig. 2b. The improvement in performance is particularly incorporate additional contributionssuchassolvationenergyandhydro- striking below 22% sequence identity, where the quality of the starting gen bonding, and the sampling algorithms can build non-modelled por- homology models becomes too low for the control methods in almost tions of the molecule de novo and cover a larger region of conformational all cases. With the new method the success rate in the 15–28% space than simulated annealing. The difference between Rosetta sampling sequence identityOther range, generally INCITE considered veryresearch challenging for topicsand simulated annealing sampling, both using crystallographic data, is molecular replacement, is over 50%. illustrated in Fig. 3. Beginning with the homology model placed by Figure 2c illustrates• Glimpse the dependence into dark of model-building matter on the quality • molecularGlobal replacementclimate in the unit cell for blind• case Membrane 6, we generated channels 100 • Nano-devices of initial electron density. Conventional chain rebuilding requires a models by simulated annealing at two starting temperatures, and 100 map in which• theSupernovae connectivity is ignition largely correct (leftmost panel), • modelsAccelerator with Rosetta design energy- and density-guided• Protein optimization folding followed • Batteries whereas the new• Protein method can structure tolerate breaks in the chain more than • byCarbon refinement. sequestration The 2mFo 2 DFc (ref. 22) electron• densityChemical maps generated catalyst design • Solar cells other methods (panels 2–4), as long as there is sufficient information in using phases from over 50% of the Rosetta models had correlations 0.36 the electron density• Creation map, combined of biofuels with the Rosetta energy function, • orTurbulent better to the finalflow refined map, whereas fewer• than Plasma 5% of models physics from • Reactor design to guide structure• Replicating optimization. The enzyme map on the functions far right contains too • simulatedPropulsor annealing systems had correlations this high.• Our Algorithm approach probably development • Nuclear structure little information to guide energy-based structure optimization and outperforms even extreme simulated annealing because the physical hence the new approach fails. In the five blind cases that have not yet chemistry and protein structural information which guide sampling been solved the comparative models may have been too low in quality, eliminate the vast majority of non-physical conformations. or there may have been complications in the X-ray diffraction data sets Approaches to molecular replacement combining the power of crys- themselves. 9 tallographic map interpretation and structure prediction methodology
00 MONTH 2011 | VOL 000 | NATURE | 3 ©2011 Macmillan Publishers Limited. All rights reserved Alloca on Programs for the OLCF’s Cray XK7 Titan
10 Content • The OLCF’s Cray XK7 Titan • Examples of – Hardware Descrip on NVML_COMPUTEMODE_ – Mission Need EXCLUSIVE_PROCESS Measurement – INCITE Alloca on Program – • Compe ve Alloca ons La ce QCD • Computa onally Ready Science – LAMMPS – NAMD • The Opera onal Need to Understand Usage • Opera onal Impact of GPUs on – ALTD (the early years) Titan – NVIDIA’s Role – Case Study: WL-LSMS • Δ to the Kepler Driver, API, and NVML – Examples among Domains – Cray’s Resource U liza on (RUR) • Edge Case: HPL – Assessing the Opera onal Impact to Delivered Science • Takeaways… • Time- and Energy- to Solu on
11 Monitoring GPU Usage on Titan- The Early Years
• Requirement: Detect, on a per-job basis, if/ Making sense of an example link statement
when jobs use accelerator-equipped nodes. % lsms /usr/lib/../lib64/crt1.o /usr/lib/../lib64/cr .o /opt/gcc/4.7.0/snos/lib/gcc/x86_64-suse-linux/4.7.0/crtbegin.o • Initial Solution libLSMS.aSystemParameters.o libLSMS.aread_input.o – Leverage ORNL’s Automatic Library Tracking libLSMS.aPoten alIO.o Database (ALTD) libLSMS.abuildLIZandCommLists.o libLSMS.aenergyContourIntegra on.o • At link time, a list of libraries linked against is stored in libLSMS.asolveSingleSca erers.o libLSMS.acalculateDensi es.o a database libLSMS.acalculateChemPot.o • When the resulting program is executed via aprun, a /lustre/widow0/scratch/larkin/lsms3-trunk/lua/lib/liblua.a new ALTD record is written that contains the specific … executable, to be run, the batch job id, and other info -lcublas /opt/nvidia/cudatoolkit/5.0.28.101/lib64/libcublas.so -lcup /opt/nvidia/cudatoolkit/5.0.28.101/extras/CUPTI/lib64/libcup .so - – Batch jobs are compared against ALTD to see if lcudart they were linked against an accelerator-specific /opt/nvidia/cudatoolkit/5.0.28.101/lib64/libcudart.so -lcuda library /opt/cray/nvidia/default/lib64/libcuda.so • libacc*, libOpenCL*, libmagma*, libhmpp*, libcuda*, /opt/cray/atp/1.4.4/lib//libAtpSigHCommData.a -lAtpSigHandler /opt/cray/atp/1.4.4/lib//libAtpSigHandler.so -lgfortran libcupti*, libcula*, libcublas* /opt/gcc/4.7.0/snos/lib/gcc/x86_64-suse-linux/4.7.0/../../../../lib64/ • Jobs whose executables are linked against one of the libgfortran.so -lhdf5_hl_cpp_gnu above are deemed to have used the accelerator ... /opt/cray/pmi/3.0.1-1.0000.9101.2.26.gem/lib64/libpmi.so -lalpslli • Outliers /usr/lib/alps/libalpslli.so -lalpsu l /usr/lib/alps/libalpsu l.so – Job run outside of the batch system /lib64/libpthread.so.0 -lstdc++ • ALTD knows about them, but we can’t tie them to /lib64/ld-linux-x86-64.so.2 -lgcc_s /opt/gcc/4.7.0/snos/lib/gcc/x86_64-suse-linux/4.7.0/../../../../lib64/ usage because there’s no job record libgcc_s.so /opt/gcc/4.7.0/snos/lib/gcc/x86_64-suse-linux/4.7.0/ – ALTD is enabled by default, but if it’s disabled we crtend.o won’t capture link/run info /usr/lib/../lib64/crtn.o
12 Assessing GPU Usage with ALTD
!18!!
!16!! Millions&
!14!!
!12!!
!10!!
!8!!
!6!!
!4!!
Titan&Core&Hours&Delivered&(Daily)& !2!!
!"!!!! 5/31/13! 6/30/13! 7/30/13! 8/29/13! 9/28/13! 1/26/14! 2/25/14! 10/28/13! 11/27/13! 12/27/13! Core!Hours!(Unknown)! Core"Hours!(CPU)! Core"Hours!(CPU+GPU)!
Rocky start using Great apparent use of the Unknowns are 14% of ALTD… lots of edge GPU by the workflow, but total delivered hours cases escaped. no way to quantify it. since May 31, 2013. 13 NVIDIA’s Role - Δ to the Kepler Driver, API, and NVML
The previous NVML is cool. But we needed… You can spot check… • GPU u lity (not point in me – Driver version u liza on) for the life of a process – pstate • Persistent state of that GPU and – Memory use memory data. – Compute mode • Ability to retrieve that data, by – GPU u liza on apid, using a predefined API – Temperature And we conceded… – Power • if there is work on any of the 14 – Clock SMs, we are accumula ng GPU u lity.
• NVIDIA products containing these new features • Kepler (GK110) or be er; • Kepler Driver 319.82 or later; • NVML API 5.319.43 or later; 14 • The CUDA 5.5 release cadence nvidia-smi Output (truncated) from a Single Titan Kepler GPU
======NVSMI LOG======ECC Errors Driver version for XK is no
15 Cau on: Default Mode versus Exclusive Process
• The default GPU compute mode on Titan is EXCLUSIVE_PROCESS. However, we do not preclude users from using DEFAULT compute mode, and some applica ons demonstrate slightly be er performance in DEFAULT compute mode. • In EXCLUSIVE_PROCESS compute mode, the current release of the Kepler device driver acts exactly like you would expect. • However, in Default Mode, the aggrega on of GPU seconds across mul ple contexts can be misinterpreted by third party so ware using the new API. – Look for updates to the way that GPU seconds are accumulated across mul ple contexts in Default mode as the CUDA 6.5 cadence nears. • Kepler Compute Modes: – NVML_COMPUTEMODE_DEFAULT Default compute mode – mul ple contexts per device. – NVML_COMPUTEMODE_EXCLUSIVE_THREAD Compute-exclusive-thread mode – only one context per device, usable from one thread at a me. – NVML_COMPUTEMODE_PROHIBITED Compute-prohibited mode – no contexts per device. – NVML_COMPUTEMODE_EXCLUSIVE_PROCESS Compute-exclusive-process mode – only one context per device, usable from mul ple threads at a me. 16 Cray RUR, and the NVIDIA API
• At the conclusion of every job, Cray uses the revised NVIDIA API to query every compute node associated with a job, extrac ng the accumulated GPU usage and memory usage sta s cs on each individual node. • By aggrega ng that informa on with data from the job scheduler, sta s cs can then be generated that describe the GPU usage, on a per-job basis.
Compute Node Aggregate GPU- GPU- GPU seconds, Compute Seconds seconds Node run me, across across scheduler Compute all apids all apids data Node
Aggregate GPU Seconds Provide specific job Collect GPU u lity, per (u lity x run me) across an informa on that allows node, for a specific apid en re aprun. Reference determina on of GPU scheduler data for run me usage, per job 17 and other data OLCF Acquisi on and Opera onal Costs Where Cycles are Free through a Compe ve Process
• 2014 Alloca on Model: • Titan Acquisi on and 125M Cray XK7 Node Hours Opera onal Costs – INCITE: 75M Node Hours – (5-year life) among 40 Projects – Facili es – ALCC: 37.5 Node Hours – Power among 10-20 Projects – Cooling – DD: 12.5 Node Hours – Asset (Purchase, Taxes, Maintenance, Lease) • The cost of the computer me dominates everything else. – Staff • How can we effec vely use the XK7 architecture to minimize • Total asset cost: me to solu on ? – $1/node hour.
18 Content • The OLCF’s Cray XK7 Titan • Examples of – Hardware Descrip on NVML_COMPUTEMODE_ – Mission Need EXCLUSIVE_PROCESS Measurement – INCITE Alloca on Program – • Compe ve Alloca ons La ce QCD • Computa onally Ready Science – LAMMPS – NAMD • The Opera onal Need to Understand Usage • Opera onal Impact of GPUs on – ALTD (the early years) Titan – NVIDIA’s Role – Case Study: WL-LSMS • Δ to the Kepler Driver, API, and NVML – Examples among Domains – Cray’s Resource U liza on (RUR) • Edge Case: HPL – Assessing the Opera onal Impact to Delivered Science • Takeaways… • Time- and Energy- to Solu on
19 GPU Usage by La ce QCD on OLCF’s Cray XK7 Titan NVML_COMPUTEMODE_EXCLUSIVE_PROCESS 100.0%$ • La ce QCD calcula ons aim to understand the physical 90.0%$ phenomena encompassed by quantum
80.0%$ chromodynamics (QCD), the fundamental theory of the strong forces of subatomic physics. 70.0%$
60.0%$
50.0%$
40.0%$
Aggregate'GPU'Usage'via'RUR' 30.0%$
20.0%$ La ce QCD GPU Usage Average: 52.50% 10.0%$ StdDev: 0.0406
0.0%$ 0$ 100$ 200$ 300$ 400$ 500$ 600$ 700$ Q1CY14'Sample'Number'(each'using'800'Cray'XK7'compute'nodes)' 20 GPU Usage by LAMMPS on OLCF’s Cray XK7 Titan Mixed Mode (OpenMP + MPI), NVML_COMPUTEMODE_EXCLUSIVE_PROCESS P3HT PCBM • LAMMPS - Classical Molecular (electron donor) (electron acceptor) Dynamics So ware used in 100.0%$ simula ons for biology, materials 90.0%$ science, granular, mesoscale, etc 80.0%$
70.0%$ Coarse-grained MD simulation of phase-separation of a 1:1 60.0%$ weight ratio P3HT/PCBM mixture into donor (white) and acceptor (blue) domains. 50.0%$
40.0%$
Aggregate'GPU'Usage'via'RUR' 30.0%$ This Series: A sample of all 20.0%$ Mixed Mode (OpenMP + MPI) LAMMPS runs in Q1CY14. 10.0%$ Average GPU Usage: 49.28% 0.0%$ 0$ 50$ 100$ 150$ 200$ 250$ 300$ 350$ 400$ 450$ Q1CY14'Sample'Number'(each'using'64'Cray'XK7'compute'nodes)'
21 GPU Usage by NAMD on OLCF’s Cray XK7 Titan NVML_COMPUTEMODE_EXCLUSIVE_PROCESS 100.0%$ • NAMD is a parallel molecular dynamics code designed for high- 90.0%$ performance simula on of large bio-molecular systems. • The availability of systems like Titan have rapidly expanded the domain of 80.0%$ bio-molecular simula on from isolated proteins in solvent to complex aggregates, o en in a lipid environment. Such systems rou nely comprise 70.0%$ 100,000 atoms, and several published NAMD simula ons have exceeded 10,000,000 atoms. 60.0%$
50.0%$
40.0%$
Aggregate'GPU'Usage'via'RUR' 30.0%$
20.0%$ NAMD GPU Usage 10.0%$ Average: 26.89% StdDev: 0.062 0.0%$ 0$ 50$ 100$ 150$ 200$ 250$ 300$ 350$ 400$ 450$ Q1CY14'Sample'Number'(each'using'768'Cray'XK7'compute'nodes)' 22 Content • The OLCF’s Cray XK7 Titan • Examples of – Hardware Descrip on NVML_COMPUTEMODE_ – Mission Need EXCLUSIVE_PROCESS Measurement – INCITE Alloca on Program – • Compe ve Alloca ons La ce QCD • Computa onally Ready Science – LAMMPS – NAMD • The Opera onal Need to Understand Usage • Opera onal Impact of GPUs on – ALTD (the early years) Titan – NVIDIA’s Role – Case Study: WL-LSMS • Δ to the Kepler Driver, API, and NVML – Examples among Other Domains – Cray’s Resource U liza on (RUR) • Edge Case: HPL – Assessing the Opera onal Impact to Delivered Science • Takeaways… • Time- and Energy- to Solu on
23 Application Power Efficiency on the Cray XK7 The Behavior of Magnetic Systems with WL-LSMS CPU-only power (energy) consump on trace for a WL-LSMS run that simulates WL#LSMS,v.3.0, 1024 Fe atoms as they reach their Curie temperature. Run size: 18,561 Titan CPU,Only,vs.,GPU,Enabled,Energy,ConsumpEon, kW#instantaneous, nodes (99% of Titan). Run signature: Ini aliza on, followed by 20 Monte Carlo Cray,XK7,(Titan),18,561,compute,nodes, 8,000$ steps. Computa on dominated by double complex matrix matrix mul plica on.
7,500$ For each step, update 7,000$ density of states (DOS) ZGEMM 6,500$
6,000$
5,500$
5,000$ Application: WL-LSMS 4,500$ Runtime (hh:mm:ss): 04:11:44 Total Energy: 31,600 kW-hr Avg. Inst. Power: 6,160 kW Energy/Cooling Cost: $3,500 4,000$ Energy Consumed: 25,700 kW-hr Single Run Opportunity Cost:
3,500$ Mech. (1.23 PUE): 5,900 kW-hr (Runtime*Asset Cost) $77,800
Elapsed,Time,h:mm:ss, 3,000$ 0:25:55$ 1:43:41$ 2:35:31$ 3:10:05$ 4:01:55$ 24 0:00:00$ 0:08:38$ 0:17:17$ 0:34:34$ 0:43:12$ 0:51:50$ 1:00:29$ 1:09:07$ 1:17:46$ 1:26:24$ 1:35:02$ 1:52:19$ 2:00:58$ 2:09:36$ 2:18:14$ 2:26:53$ 2:44:10$ 2:52:48$ 3:01:26$ 3:18:43$ 3:27:22$ 3:36:00$ 3:44:38$ 3:53:17$ 4:10:34$ Application Power Efficiency on the Cray XK7 Comparing CPU-Only and GPU-Enabled WL-LSMS
The iden cal WL-LSMS run (1024 Fe atoms on 18,561 Titan nodes), WL#LSMS,v.3.0, comparing the run me and power consump on of the GPU-enabled CPU,Only,vs.,GPU,Enabled,Energy,ConsumpEon, kW#instantaneous, version versus the CPU-only version. Cray,XK7,(Titan),18,561,compute,nodes, 8,000$ • Run me Is 9X faster for the accelerated code-> 9X less
7,500$ opportunity cost. Same science output. • Total energy consumed is 7.3X less 7,000$
6,500$
6,000$
5,500$
5,000$ App: GPU-enabled WL-LSMS Total Energy: 4,300 kW-hr 4,500$ Runtime (hh:mm:ss) 00:27:43 Energy/Cooling Cost: $475 Avg. Inst. Power: 7,070 kW Single Run Opportunity Cost: 4,000$ Energy Consumed: 3,500 kW-hr (Runtime*Asset Cost) $8,575 (CPU$Only)$kW9instantaneous$ (GPU$Enabled)$kW9instantaneous$ 3,500$ Mech. (1.23 PUE): 800 kW-hr
Elapsed,Time,h:mm:ss, 3,000$ 0:25:55$ 1:43:41$ 2:35:31$ 3:10:05$ 4:01:55$ 25 0:00:00$ 0:08:38$ 0:17:17$ 0:34:34$ 0:43:12$ 0:51:50$ 1:00:29$ 1:09:07$ 1:17:46$ 1:26:24$ 1:35:02$ 1:52:19$ 2:00:58$ 2:09:36$ 2:18:14$ 2:26:53$ 2:44:10$ 2:52:48$ 3:01:26$ 3:18:43$ 3:27:22$ 3:36:00$ 3:44:38$ 3:53:17$ 4:10:34$ Opera onal Impact
• From Titan’s introduc on to produc on (May Cray XK7 vs. Cray XE6 31, 2013) through the end of the year Applica on – Titan delivered 2.6B core-hours to science Performance Ra o* – Maintained 90% u liza on LAMMPS – Was stable- 98.7% scheduled availability and 468- 7.4 hour MTTF Molecular dynamics S3D 2.2 • Titan provides significant me to solu on Turbulent combus on benefits for applica ons that can take advantage of the 1:1 architecture. Denovo 3D neutron transport 3.8 • Many applica ons have been restructured to for nuclear reactors expose addi onal parallelism. This posi vely WL-LSMS impacts not only the XK7 but other large Sta s cal mechanics of 3.8 parallel compute systems as well. magne c materials • Considerable performance gains on the XK7 con nue to outpace its tradi onal compe tors. • Faster solu on produces addi onal opportunity.
26 Content • The OLCF’s Cray XK7 Titan • Examples of – Hardware Descrip on NVML_COMPUTEMODE_ – Mission Need EXCLUSIVE_PROCESS Measurement – INCITE Alloca on Program – • Compe ve Alloca ons La ce QCD • Computa onally Ready Science – LAMMPS – NAMD • The Opera onal Need to Understand Usage • Opera onal Impact of GPUs on – ALTD (the early years) Titan – NVIDIA’s Role – Case Study: WL-LSMS • Δ to the Kepler Driver, API, and NVML – Examples among Other Domains – Cray’s Resource U liza on (RUR) • Edge Case: HPL – Assessing the Opera onal Impact to Delivered Science • Takeaways… • Time- and Energy- to Solu on
27 April 15, 2012 Top 500 Submission: Cray XK6 (Jaguar)
Cray XK6 HPL Run Sta s cs 1.941PF sustained (73.6% of 2.634PF peak) System Configura on: Cray XK Blade Upgrade Run Dura on: 24.6 hours Complete (No Fermi or Power (Consump on Only) Kepler accelerators) System Idle: 2,935 kW 18,688 AMD Opteron Max kW: 5,275 kW 6274 processors Mean kW 5,142 kW (Interlagos) Total 126,281 kW-h Cray Gemini Interconnect
28 ORNL's Cray XK7 Titan | HPL Consump on kW-hours MW, Instantaneous (Cumula ve) 10.00 Instantaneous Measurements 8,000.00 • 8.93 MW 7545.56kW-hr • 21.42 PF • 2,397.14 MF/Wa
Thousands 9.00 7,000.00
RmaxPower = 8,296.53kW 8.00
6,000.00 Custom NVIDIA HPL Binary- 7.00 NVML Driver Reported 99%
GPU Usage 5,000.00 6.00
5.00 Cray XK7 HPL Run Sta s cs 17.59 PF sustained 4,000.00 vs 1.941PF on Opteron-only System Configura on: 4.00 Run Dura on: 0:54.53 (November 2012) 3,000.00 Titan Upgrade Complete. vs more than 24.6 hours 3.00 18,688 AMD Opteron Power (Consump on Only) 2,000.00 6274 processors, 1:1 with Max kW: 8,930 kW 2.00 NVIDIA GK110 (Kepler) Mean kW 8.296 kW GPUs in SXM form factor kW, Instantaneous 1,000.00 1.00 Total 7,545kW-h RmaxPower vs 126,281 kW-h kW-hours, cumula ve - - 0:00:53 0:01:53 0:02:53 0:03:53 0:04:53 0:05:53 0:06:53 0:07:53 0:08:53 0:09:53 0:10:53 0:11:53 0:12:53 0:13:53 0:14:53 0:15:53 0:16:53 0:17:53 0:18:53 0:19:53 0:20:53 0:21:53 0:22:53 0:23:53 0:24:53 0:25:53 0:26:53 0:27:53 0:28:53 0:29:53 0:30:53 0:31:53 0:32:53 0:33:53 0:34:53 0:35:53 0:36:53 0:37:53 0:38:53 0:39:53 0:40:53 0:41:53 0:42:53 0:43:53 0:44:53 0:45:53 0:46:53 0:47:53 0:48:53 0:49:53 0:50:53 0:51:53 0:52:53 0:53:53 0:54:53
29 Run Time Dura on (hh:mm:ss) Takeaways
• Revisions by NVIDIA to the Kepler device driver now expose an ability for system accoun ng methods to capture, per node, and per job, important usage and high water sta s cs concerning the K20X SM and memory subsystems. • GPU enabled applica ons on the Cray XK7 can frequently realize me-to-solu on savings of 5X or more versus tradi onal CPU-only applica ons.
• Invita on: the OLCF welcomes R&D applica ons focused on maximizing scien fic applica on efficiency – Applica on performance benchmarking, analysis, modeling, and scaling studies. End-to-end workflow, visualiza on, data analy cs. – h ps://www.olcf.ornl.gov/support/ge ng-started/olcf-director-discre on- project-applica on/
30 #GTC14 Ques ons?
The activities described herein were performed using resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC0500OR22725.
31