Scienzfic Simulazons on Thousands of Gpus with Performance Portability

Total Page:16

File Type:pdf, Size:1020Kb

Scienzfic Simulazons on Thousands of Gpus with Performance Portability Scien&fic Simulaons on Thousands of GPUs with Performance Portability Alan Gray and Kevin Stratford EPCC, The University of Edinburgh CORAL procurement • Three “pre-exascale” machines have been announced in the US, each in the region of 100-300 petaflops • Summit at ORNL and Sierra at LLNL will use NVIDIA GPUs (with IBM CPUs) . • Aurora at Argonne will use Intel Xeon Phi many-core CPUs (Cray system) • Performance Portability is the key issue for the programmer 2 Outline Applications: Ludwig and MILC Performance Portability with targetDP Performance results on GPU, CPU and Xeon Phi § Using same source code for each Scaling to many nodes with MPI+targetDP 3 Ludwig Applicaon • So3 ma4er substances or complex fluids are all around us • Ludwig: uses lace Boltzmann and finite difference methods to simulate a wide range of systems • Improving the understanding of, and ability to manipulate, liquid crystals is a very ac&ve research area • But required simulaons can be extremely computaonally demanding, due to range of scales involved • targetDP developed in co-design with Ludwig Gray, A., Hart, A., Henrich, O. & Stratford, K., Scaling soft Stratford, K., A. Gray, and J. S. Lintuvuori. "Large Colloids matter physics to thousands of graphics processing units in in Cholesteric Liquid Crystals." Journal of Statistical parallel, IJHPCA (2015) Physics 161.6 (2015): 1496-1507. 4 MILC applicaon • Lace QCD simulaons provide numerical studies to help understand how quarks and gluons interact to form protons, neutrons and other elementary par&cles. • The Unified European Applicaon Benchmark Suite (UEABS) is a set of 12 applicaon codes designed to be representave of EU HPC usage ¡ including Lace QCD component, derived from MILC codebase • targetDP applied to this ¡ h_p://www.prace-ri.eu/ueabs/ applicaon benchmark to enable for GPU and Xeon Phi 5 Mul&-valued data • For most scien&fic simulaons the bo_leneck is memory bandwidth • Simulaon data consists of mulple values at each site • In memory, we have a choice of how to store this ¡ |rgb|rgb|rgb|rgb| (Array of Structs AoS) ¡ |rrrr|gggg|bbbb| (Struct of Arrays SoA) ¡ Most general case is Array of Structs of (short) Arrays (AoSoA) ¡ E.g. ||rr|gg|bb|||rr|gg|bb|| has SA length of 2 ¡ Major effect on bandwidth. Best layout architecture-specific • Soluon: ¡ De-couple memory layout from applicaon source code ¡ Can simply be done with macro, e.g. field[INDEX(iDim,iSite)] 6 targetDP • Simple serial code example: loop over N grid points ¡ With some operaon … at each point int iSite; for (iSite = 0; iSite < N; iSite++) { ... } 7 • OpenMP int iSite; #pragma omp parallel for for (iSite = 0; iSite < N; iSite++) { • targetDP ... } __targetEntry__ void scale(double* field){ int iSite; • CUDA __targetTLP__(iSite, N) { __global__ void scale(double* field) { ... } int iSite; return; iSite=blockIdx.x*blockDim.x+threadIdx.x } if(iSite<N) { ... } return; } 8 __targetEntry__ void scale(double* t_field) { int index; __targetTLP__(iSite, N) { int iDim; for (iDim = 0; iDim < 3; iDim++) { t_field[INDEX(iDim,iSite)] = t_a*t_field[INDEX(iDim,iSite)]; } } return; } • PROBLEM: to fully ulise modern CPUs, compiler must vectorize innermost loops to create vector instruc&ons. • SOLUTION: TLP can be strided, such that each thread operates on chunk of VVL lace sites ¡ VVL must be 1 for above example to work ¡ But we can set VVL>1, and add a new innermost loop 9 __targetEntry__ void scale(double* t_field) { int baseIndex; __targetTLP__(baseIndex, N) { int iDim, vecIndex; for (iDim = 0; iDim < 3; iDim++) { __targetILP__(vecIndex) \ t_field[INDEX(iDim,baseIndex+vecIndex)] = \ t_a*t_field[INDEX(iDim,baseIndex+vecIndex)]; } } return; } • ILP can map to loop over chunk of lace sites, with OpenMP SIMD direcve • Easily vectorizable by compiler • VVL can be tuned specifically for hardware, e.g. VVL=8 will create single IMCI instruc&on for 8-way DP vector unit on Xeon Phi ¡ Without this, performance is several &mes worse on Xeon Phi • We can just map to an empty macro, when we don’t want ILP 10 • Func&on called from host code using wrappers to CUDA API ¡ That can alternavely map to regular CPU (malloc, memcpy etc) targetMalloc((void **) &t_field, datasize); copyToTarget(t_field, field, datasize); copyConstDoubleToTarget(&t_a, &a, sizeof(double)); scale __targetLaunch__(N) (t_field); targetSynchronize(); copyFromTarget(field, t_field, datasize); targetFree(t_field); 11 Results CPU Xeon Phi GPU • Same performance-portable targetDP source code on all architectures 12 700" Full$Ludwig$Liquid$Crystal$128x128x128$Test$Case$$ 600" Ludwig"ReMainder" Advect."Bound." 500" AdvecPon" LC"Update" 400" CheMical"Stress" Order"Par."Grad." !me$(s)$ 300" Collision" PropagaPon" 200" " "" 100" " " " "" "" 0" " " Intel"Ivy1 Intel"Haswell" AMD" Intel"Xeon" NVIDIA"K20X"NVIDIA"K40" bridge"121 81core"CPU" Interlagos" Phi"" GPU" GPU" core"CPU" 161core"CPU" Best%% AoSoA,%% %%AoS,%% %%AoS,%% AoSoA,%% %%SoA,%% %%SoA,%% Config:% VVL=4% VVL=1% VVL=1% VVL=8% VVL=1% VVL=1% 13 700" Full$MILC$Conjugate$Gradient$64x64x32x8$Test$Case$$ 600" MILC"Remainder" 500" ShiN" Scalar"Mult."Add" 400" Insert" Insert"&"Mult." !me$(s)$ 300" Extract"&"Mult." Extract" 200" " 100" "" " " " "" "" 0" " " Intel"Ivy1 Intel"Haswell" AMD" Intel"Xeon" NVIDIA"K20X"NVIDIA"K40" bridge"121 81core"CPU" Interlagos" Phi"" GPU" GPU" core"CPU" 161core"CPU" Best%% AoSoA,%% %%AoS,%% %%AoS,%% AoSoA,%% %%SoA,%% %%SoA,%% Config:% VVL=4% VVL=1% VVL=1% VVL=8% VVL=1% VVL=1% 14 Comparing with capability of hardware • Use “Roofline” model • It can be shown that all our kernels are memory-bandwidth bound ¡ Compare kernel bandwidth with STREAM benchmark 140" Intel"IvyPbridge"(Es.mated)" Intel"Xeon"Phi"(Es.mated)" NVIDIA"K40"GPU"(Actual)" 120" 100" 80" 60" 40" Percentage)of)STREAM) 20" 0" ShiN"(0.00)" Extract"(0.07)" Insert"(0.10)" Collision"(1.08)" LC"Update"(0.79)"Advec.on"(0.13)" Propaga.on"(0.00)" Chemical"Stress"(2.97)" Advect."Bound."(0.05)" Order"Par."Grad."(0.15)" Extract"and"Mult."(0.38)"Insert"and"Mult."(0.38)" Scalar"Mult."Add"(0.07)" 15 Ludwig" MILC" MPI+targetDP Supercomputer Scaling 16 Ludwig$Liquid$Crystal:$128x128x128$ 1000" Ludwig$Liquid$Crystal:$128x128x128$ 1000" Titan"CPU"" Titan"CPU""(One"160core" (One"160core" 100" Interlagos"per"node)""" 100" Interlagos"per"node)""" Archer"CPU"" Archer"CPU""(Two"120core"Ivy0 (Two"120core"Ivy0bridge"per"node)" !me$(s)$ bridge"per"node)" !me$(s)$ Titan"GPU"" 10" Titan"GPU""(One"K20X"per"node)" 10" (One"K20X"per"node)" 1" 1" 1" 10" 100" 1000" 1" 10" nodes$100" 1000" nodes$ 17 Ludwig$Liquid$Crystal:$128x128x128$ 1000" Ludwig$Liquid$Crystal:$1024x1024x512$ 1000# Titan#CPU##Titan"CPU"" (One#160core#(One"160core" Interlagos#per#node)### 100" Interlagos"per"node)""" Archer#CPU##Archer"CPU"" (Two#120core#Ivy0(Two"120core"Ivy0 bridge#per#node)#bridge"per"node)" 100#!me$(s)$ !me$(s)$ Titan"GPU"" 10" Titan#GPU#(One"K20X"per"node)" #(One#K20#per#node)# 1" 10# 1" 10" 100" 1000" 100# 1000#nodes$ 10000# nodes$ 18 Ludwig$Liquid$Crystal:$128x128x128$ 1000" MILC$Conjugate$Gradient:$64x64x32x8$ 1000" Titan"CPU""Titan"CPU"" (one"160core"Interlagos"(One"160core" per"node)" 100" Interlagos"per"node)""" 100" Archer"CPU""Archer"CPU"" (one"120core"Ivy0bridge"(Two"120core"Ivy0 per"node)"bridge"per"node)" !me$(s)$ !me$(s)$ Titan"GPU"" 10" Titan"GPU""(One"K20X"per"node)" 10" (one"K20X"per"node)" 1" 1" 1" 10" 100" 1000" 1" 10" nodes$100" 1000" nodes$ 19 MILC$Conjugate$Gradient:$64x64x64x192$Ludwig$Liquid$Crystal:$128x128x128$ 1000" 1000" Titan"CPU"" (one"160core"Interlagos"Titan"CPU"" per"node)"(One"160core" 100" Interlagos"per"node)""" 100" Archer"CPU"Archer"CPU"" (one"120core"Ivy0bridge"(Two"120core"Ivy0 per"node)"bridge"per"node)" !me$(s)$ !me$(s)$ Titan"GPU""Titan"GPU"" 10" (one"K20X"per"node)"(One"K20X"per"node)" 10" 1" 1" 1" 10" 100" 1000" 10" 100" nodes$1000" 10000" nodes$ 20 Summary • targetDP is a simplis&c framework that allows grid-based codes to perform well on modern mul&/many-core CPUs as well as GPUs ¡ By abstrac&ng parallelism and memory spaces ¡ Express TLP and ILP. We can see that exposing ILP is crucial on Xeon Phi today, and vector units will con&nue to get wider on future CPUs ¡ It is also crucial to de-couple memory layout by abstrac&ng memory accesses. • We demonstrated performance portability across mul&ple modern architectures • GPUs and Xeon Phi are significantly faster than CPUs, because they offer higher memory bandwidth ¡ GPUs have advantage over Xeon Phi • MPI+targetDP is suitable for large-scale supercompu&ng ¡ NVLINK should help with strong mul&-GPU scaling • We have been concentrang on structured grid-based applicaons, but similar thinking may be fruimul for other areas • targetDP is freely available ¡ hp://ccpforge.cse.rl.ac.uk/svn/ludwig/trunk/targetDP/README 21 Acknowledgements 22 .
Recommended publications
  • Petaflops for the People
    PETAFLOPS SPOTLIGHT: NERSC housands of researchers have used facilities of the Advanced T Scientific Computing Research (ASCR) program and its EXTREME-WEATHER Department of Energy (DOE) computing predecessors over the past four decades. Their studies of hurricanes, earthquakes, NUMBER-CRUNCHING green-energy technologies and many other basic and applied Certain problems lend themselves to solution by science problems have, in turn, benefited millions of people. computers. Take hurricanes, for instance: They’re They owe it mainly to the capacity provided by the National too big, too dangerous and perhaps too expensive Energy Research Scientific Computing Center (NERSC), the Oak to understand fully without a supercomputer. Ridge Leadership Computing Facility (OLCF) and the Argonne Leadership Computing Facility (ALCF). Using decades of global climate data in a grid comprised of 25-kilometer squares, researchers in These ASCR installations have helped train the advanced Berkeley Lab’s Computational Research Division scientific workforce of the future. Postdoctoral scientists, captured the formation of hurricanes and typhoons graduate students and early-career researchers have worked and the extreme waves that they generate. Those there, learning to configure the world’s most sophisticated same models, when run at resolutions of about supercomputers for their own various and wide-ranging projects. 100 kilometers, missed the tropical cyclones and Cutting-edge supercomputing, once the purview of a small resulting waves, up to 30 meters high. group of experts, has trickled down to the benefit of thousands of investigators in the broader scientific community. Their findings, published inGeophysical Research Letters, demonstrated the importance of running Today, NERSC, at Lawrence Berkeley National Laboratory; climate models at higher resolution.
    [Show full text]
  • Ushering in a New Era: Argonne National Laboratory & Aurora
    Ushering in a New Era Argonne National Laboratory’s Aurora System April 2015 ANL Selects Intel for World’s Biggest Supercomputer 2-system CORAL award extends IA leadership in extreme scale HPC Aurora Argonne National Laboratory >180PF Trinity NNSA† April ‘15 Cori >40PF NERSC‡ >30PF July ’14 + Theta Argonne National Laboratory April ’14 >8.5PF >$200M ‡ Cray* XC* Series at National Energy Research Scientific Computing Center (NERSC). † Cray XC Series at National Nuclear Security Administration (NNSA). 2 The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery & innovation >180 PFLOPS (option to increase up to 450 PF) 18X higher performance† >50,000 nodes Prime Contractor 13MW >6X more energy efficient† 2018 delivery Subcontractor Source: Argonne National Laboratory and Intel. †Comparison of theoretical peak double precision FLOPS and power consumption to ANL’s largest current system, MIRA (10PFs and 4.8MW) 3 Aurora | Science From Day One! Extreme performance for a broad range of compute and data-centric workloads Transportation Biological Science Renewable Energy Training Argonne Training Program on Extreme- Scale Computing Aerodynamics Biofuels / Disease Control Wind Turbine Design / Placement Materials Science Computer Science Public Access Focus Areas Focus US Industry and International Co-array Fortran Batteries / Solar Panels New Programming Models 4 Aurora | Built on a Powerful Foundation Breakthrough technologies that deliver massive benefits Compute Interconnect File System 3rd Generation 2nd Generation Intel® Xeon Phi™ Intel® Omni-Path Intel® Lustre* Architecture Software >17X performance† >20X faster† >3X faster† FLOPS per node >500 TB/s bi-section bandwidth >1 TB/s file system throughput >12X memory bandwidth† >2.5 PB/s aggregate node link >5X capacity† bandwidth >30PB/s aggregate >150TB file system capacity in-package memory bandwidth Integrated Intel® Omni-Path Architecture Processor code name: Knights Hill Source: Argonne National Laboratory and Intel.
    [Show full text]
  • Architectural Trade-Offs in a Latency Tolerant Gallium Arsenide Microprocessor
    Architectural Trade-offs in a Latency Tolerant Gallium Arsenide Microprocessor by Michael D. Upton A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering) in The University of Michigan 1996 Doctoral Committee: Associate Professor Richard B. Brown, CoChairperson Professor Trevor N. Mudge, CoChairperson Associate Professor Myron Campbell Professor Edward S. Davidson Professor Yale N. Patt © Michael D. Upton 1996 All Rights Reserved DEDICATION To Kelly, Without whose support this work may not have been started, would not have been enjoyed, and could not have been completed. Thank you for your continual support and encouragement. ii ACKNOWLEDGEMENTS Many people, both at Michigan and elsewhere, were instrumental in the completion of this work. I would like to thank my co-chairs, Richard Brown and Trevor Mudge, first for attracting me to Michigan, and then for allowing our group the freedom to explore many different ideas in architecture and circuit design. Their guidance and motivation combined to make this a truly memorable experience. I am also grateful to each of my other dissertation committee members: Ed Davidson, Yale Patt, and Myron Campbell. The support and encouragement of the other faculty on the project, Karem Sakallah and Ron Lomax, is also gratefully acknowledged. My friends and former colleagues Mark Rossman, Steve Sugiyama, Ray Farbarik, Tom Rossman and Kendall Russell were always willing to lend their assistance. Richard Oettel continually reminded me of the valuable support of friends and family, and the importance of having fun in your work. Our corporate sponsors: Cascade Design Automation, Chronologic, Cadence, and Metasoft, provided software and support that made this work possible.
    [Show full text]
  • TOP500 Supercomputer Sites
    7/24/2018 News | TOP500 Supercomputer Sites HOME | SEARCH | REGISTER RSS | MY ACCOUNT | EMBED RSS | SUPER RSS | Contact Us | News | TOP500 Supercomputer Sites http://top500.org/blog/category/feature-article/feeds/rss Are you the publisher? Claim or contact Browsing the Latest Browse All Articles (217 Live us about this channel Snapshot Articles) Browser Embed this Channel Description: content in your HTML TOP500 News Search Report adult content: 04/27/18--03:14: UK Commits a 0 0 Billion Pounds to AI Development click to rate The British government and the private sector are investing close to £1 billion Account: (login) pounds to boost the country’s artificial intelligence sector. The investment, which was announced on Thursday, is part of a wide-ranging strategy to make the UK a global leader in AI and big data. More Channels Under the investment, known as the “AI Sector Deal,” government, industry, and academia will contribute £603 million in new funding, adding to the £342 million already allocated in existing budgets. That brings the grand total to Showcase £945 million, or about $1.3 billion at the current exchange rate. The UK RSS Channel Showcase 1586818 government is also looking to increase R&D spending across all disciplines by 2.4 percent, while also raising the R&D tax credit from 11 to 12 percent. This is RSS Channel Showcase 2022206 part of a broader commitment to raise government spending in this area from RSS Channel Showcase 8083573 around £9.5 billion in 2016 to £12.5 billion in 2021. RSS Channel Showcase 1992889 The UK government policy paper that describes the sector deal meanders quite a bit, describing a lot of programs and initiatives that intersect with the AI investments, but are otherwise free-standing.
    [Show full text]
  • MPI on Aurora
    AN OVERVIEW OF AURORA, ARGONNE’S UPCOMING EXASCALE SYSTEM ALCF DEVELOPERS SESSION COLLEEN BERTONI, SUDHEER CHUNDURI www.anl.gov AURORA: An Intel-Cray System Intel/Cray machine arriving at Argonne in 2021 Sustained Performance greater than 1 Exaflops 2 AURORA: A High-level View § Hardware Architecture: § Intel Xeon processors and Intel Xe GPUs § Greater than 10 PB of total memory § Cray Slingshot network and Shasta platform § IO • Uses Lustre and Distributed Asynchronous Object Store IO (DAOS) • Greater than 230 PB of storage capacity and 25 TB/s of bandwidth § Software (Intel One API umbrella): § Intel compilers (C,C++,Fortran) § Programming models: DPC++, OpenMP, OpenCL § Libraries: oneMKL, oneDNN, oneDAL § Tools: VTune, Advisor § Python 3 Node-level Hardware The Evolution of Intel GPUs Source: Intel 5 Intel GPUs § Intel Integrated GPUs are used for over a decade in § Laptops (e.g. MacBook pro) § Desktops § Servers § Recent and upcoming integrated generations : § Gen 9 – used in Skylake based nodes § Gen 11 – used in Ice Lake based nodes § Gen 9: Double precision peak performance: 100-300 GF § Low by design due to power and space limits Layout of Architecture components for an Intel Core i7 processor 6700K for desktop systems (91 W TDP, 122 mm) § Future Intel Xe (Gen 12) GPU series will provide both integrated and discrete GPUs 6 Intel GPU Building Blocks EU: Execution Unit Subslice L2 Slice: 24 EUs SIMD FPU Dispatch&I$ L1 Shared Local Memory (64 KB/subslice) SIMD FPU Subslice: 8 EUs Send 8x EU Sampler L2 $ L3 Data Cache Branch Dataport
    [Show full text]
  • Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA
    Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA Kazuhiko Komatsu, S. Momose, Y. Isobe, O. Watanabe, A. Musa, M. Yokokawa, T. Aoyama, M. Sato, H. Kobayashi Tohoku University 14 November, 2018 SC18 Outline • Background • Overview of SX-Aurora TSUBASA • Performance evaluation • Benchmark performance • Application performance • Conclusions 14 November, 2018 SC18 2 Background • Supercomputers become important infrastructures • Widely used for scien8fic researches as well as various industries • Top1 Summit system reaches 143.5 Pflop/s • Big gap between theore8cal performance and sustained performance ◎ Compute-intensive applica8ons stand to benefit from high peak performance ✖Memory-intensive applica8ons are limited by lower memory performance Memory performance has gained more and more attentions 14 November, 2018 SC18 3 A new vector supercomputer SX-Aurora TSUBASA • Two important concepts of its design • High usability • High sustained performance • New memory integration technology • Realize the world’s highest memory bandwidth • New architecture • Vector host (VH) is attached to vector engines (VEs) • VE is responsible for executing an entire application • VH is used for processing system calls invoked by the applications VH VEVE 14 November, 2018 SC18 5 X86 Linux Vector processor New execution model • Conven1onal model • New execution model Host GPU VH VE Start Exe module load Start processing processing Transparent System call Kernel offload (I/O, etc) execution OS function Kernel Finish execution processing Fisnish processing
    [Show full text]
  • Analysis of the Characteristics and Development Trends of the Next-Generation of Supercomputers in Foreign Countries
    This study was carried out for RIKEN by Special Study Analysis of the Characteristics and Development Trends of the Next-Generation of Supercomputers in Foreign Countries Earl C. Joseph, Ph.D. Robert Sorensen Steve Conway Kevin Monroe IDC OPINION Leadership-class supercomputers have contributed enormously to advances in fundamental and applied science, national security, and the quality of life. Advances made possible by this class of supercomputers have been instrumental for better predicting severe weather and earthquakes that can devastate lives and property, for designing new materials used in products, for making new energy sources pragmatic, for developing and testing methodologies to handle "big data," and for many more beneficial uses. The broad range of leadership-class supercomputers examined during this study make it clear that there are a number of national programs planned and already in place to not only build pre-exascale systems to meet many of today’s most aggressive research agendas but to also develop the hardware and software necessary to produce sustained exascale systems in the 2020 timeframe and beyond. Although our studies indicate that there is no single technology strategy that will emerge as the ideal, it is satisfying to note that the wide range of innovative and forward leaning efforts going on around the world almost certainly ensures that the push toward more capable leadership-class supercomputers will be successful. IDC analysts recognize, however, that for almost every HPC development project examined here, the current effort within each organization is only their latest step in a long history of HPC progress and use.
    [Show full text]
  • Leadership Computing Partnering with the ALCF Enabling
    ARGONNE LEADERSHIP ALCF COMPUTING FACILITY Accelerating Discovery The ALCF provides supercomputing resources to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. As a key player in our nation’s efforts to deliver future exascale computing capabilities, the ALCF is helping to advance scientific computing through a convergence of simulation, data science, and machine learning methods. CONNECT WITH US We encourage you to contact us if you have any questions about alcf.anl.gov getting started at the ALCF. [email protected] Leadership Computing Aurora Slated to be one of the world’s first exascale systems, Aurora will be capable of performing more than a quintillion calculations per second. Designed in collaboration with industry leaders, Intel and Cray, the ALCF’s next-generation supercomputer will help ensure continued U.S. leadership in high-end computing for scientific research. Theta Theta, the ALCF’s Intel-Cray supercomputer, is the engine that drives scientific discoveries for the ALCF user community. The system provides powerful capabilities for research involving modeling and simulation, data science, and machine learning techniques. Partnering with the ALCF ALCF resources are available to researchers in academia, industry, and government laboratories through competitive, peer-reviewed allocation programs supported by DOE and Argonne National Laboratory, including INCITE, ALCC, and the ALCF Data Science Program. A special allocation program is available for ECP projects. ALCF computational scientists, performance engineers, visualization experts, and support staff help users to maximize scientific productivity on the facility’s supercomputers. The ALCF also provides training opportunities to prepare researchers to use its leadership-class systems for future science campaigns.
    [Show full text]
  • Intel® Xeon Phi™ Processors
    IDC HPC User Forum Intel’s vision: the Scalable system framework Bret Costelow September 2015 Legal Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com]. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at https://www-ssl.intel.com/content/www/us/en/high- performance-computing/path-to-aurora.html. Intel, the Intel logo, Xeon, and Intel Xeon Phi are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
    [Show full text]
  • Exascale” Supercomputer Fugaku & Beyond
    The first “exascale” supercomputer Fugaku & beyond l Satoshi Matsuoka l Director, RIKEN Center for Computational Science l 20190815 Modsim Presentation 1 Arm64fx & Fugaku 富岳 /Post-K are: l Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU l HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) l Gen purpose CPU – Linux, Windows (Word), otherSCs/Clouds l Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU l Largest and fastest supercomputer to be ever built circa 2020 l > 150,000 nodes, superseding LLNL Sequoia l > 150 PetaByte/s memory BW l Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) l 25~30PB NVMe L1 storage l many endpoint 100Gbps I/O network into Lustre l The first ‘exascale’ machine (not exa64bitflops but in apps perf.) 3 Brief History of R-CCS towards Fugaku April 2018 April 2012 AICS renamed to RIKEN R-CCS. July 2010 Post-K Feasibility Study start Satoshi Matsuoka becomes new RIKEN AICS established 3 Arch Teams and 1 Apps Team Director August 2010 June 2012 Aug 2018 HPCI Project start K computer construction complete Arm A64fx announce at Hotchips September 2010 September 2012 Oct 2018 K computer installation starts K computer production start NEDO 100x processor project start First meeting of SDHPC (Post-K) November 2012 Nov 2018 ACM Gordon bell Award Post-K Manufacturing approval by Prime Minister’s CSTI Committee 2 006 2011 2013
    [Show full text]
  • A CORAL System and Implications for Future HPC Hardware and Data Centers
    A CORAL System and Implications for Future HPC Hardware and Data Centers Michael K Patterson Senior Principal Engineer; Power, Packaging & Cooling Intel, Technical Computing Systems Architecture and Pathfinding Acknowledgement Contributors, borrowed content, and data sources Mark Seager, Intel Ram Nagappan, Intel Susan Coghlan, Argonne National Lab Helmut Satzger, LRZ, Munich Jim Rogers, ORNL Cray; Aurora system partners Appreciation NCAR Team for continuing to invite us. Intel Core™ i7 with over 1 billion transistors and over 6 miles (10 km) of wires in the chip to connect them It would only take eight Core™ i7s to make a wire long enough to connect Glasgow to Edinburgh! Intel Investments Paving the Way Holistic Approach to Cluster Solutions CPU Software & Tools Fabric Storage Intel® Xeon® Processors Intel® Parallel Studio Intel® Omni-Path Intel® Solid-State Intel® Xeon Phi™ Product Intel® Enterprise Edition Architecture Drives (NVMe) Family for Lustre* software *Other names and brands may be claimed as the property of others. 4 Intel’s Scalable System Framework A Configurable Design Path Customizable for a Wide Range of HPC & Big Data Workloads Small Clusters Through Supercomputers Compute Memory/Storage Compute and Data-Centric Computing Fabric Software Standards-Based Programmability On-Premise and Cloud-Based Intel Silicon Photonics Intel® Xeon® Processors Intel® True Scale Fabric Intel® SSDs Intel® Software Tools Intel® Xeon Phi™ Intel® Omni-Path Intel® Lustre-based Solutions HPC Scalable Software Stack Coprocessors Architecture Intel® Silicon Photonics Intel® Cluster Ready Program Intel® Xeon Phi™ Processors Intel® Ethernet Technology 6 CORAL COLLABORATION OF OAK RIDGE, ARGONNE, AND LIVERMORE . Acquire DOE 2018 – 2022 Leadership Computing Capability .
    [Show full text]
  • Aurora Fact Sheet
    Aurora Fact Sheet System Feature The Aurora Details Comparison to Mira Peak System Performance (FLOP/s) 180 - 450 PetaFLOP/s 10 PetaFLOP/s Processor Future Generation Intel® Xeon Phi™ Processor PowerPC A2 1600 MHz processor (Code name: Knights Hill) Number of Nodes >50,000 49,152 Compute Platform Intel system based on Cray Shasta next generation IBM Blue Gene/Q supercomputing platform Aggregate High Bandwidth On-Package >7,000 Terabytes 768 Terabytes Memory, local Memory and Persistent Memory Aggregate High Bandwidth On-Package >30 Petabytes/s 2.5 Petabytes/s Memory Bandwidth System Interconnect 2nd Generation Intel® Omni-Path Architecture with IBM 5D torus interconnect silicon photonics with VCSEL photonics Interconnect Aggregate Node >2.5 Petabytes/s 2 Petabytes/s Link Bandwidth Interconnect Bisection Bandwidth >500 Terabytes/s 24 Terabytes/s Interconnect Interface Integrated Integrated Burst Buffer Storage Intel® SSDs, using both 1st and 2nd Generation Intel® None Omni-Path Architecture File System Intel® Lustre File System IBM GPFS File System File System Capacity >150 Petabytes 26 Petabytes File System Throughput >1 Terabyte/s 300 Gigabyte/s Intel Architecture (Intel® 64) Compatibility Yes No Peak Power Consumption 13 Megawatts 4.8 Megawatts FLOP/s Per Watt >13 GigaFLOP/s per watt >2 GigaFLOP/s per watt Delivery Timeline 2018 2012 Facility Area for Compute Clusters ~3,000 sq. ft. ~1,536 sq. ft. For further information on Aurora, visit: intel.com/Aurora Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.
    [Show full text]