Download Presentation

Total Page:16

File Type:pdf, Size:1020Kb

Download Presentation Computer architecture in the past and next quarter century CNNA 2010 Berkeley, Feb 3, 2010 H. Peter Hofstee Cell/B.E. Chief Scientist IBM Systems and Technology Group CMOS Microprocessor Trends, The First ~25 Years ( Good old days ) Log(Performance) Single Thread More active transistors, higher frequency 2005 Page 2 SPECINT 10000 3X From Hennessy and Patterson, Computer Architecture: A ??%/year 1000 Quantitative Approach , 4th edition, 2006 52%/year 100 Performance (vs. VAX-11/780) VAX-11/780) (vs. Performance 10 25%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 • VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 Page 3 CMOS Devices hit a scaling wall 1000 Air Cooling limit ) 100 2 Active 10 Power Passive Power 1 0.1 Power Density (W/cm Power Density 0.01 1994 2004 0.001 1 0.1 0.01 Gate Length (microns) Isaac e.a. IBM Page 4 Microprocessor Trends More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 Page 5 Microprocessor Trends Multi-Core More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 2015(?) Page 6 Multicore Power Server Processors Power 4 Power 5 Power 7 2001 2004 2009 Introduces Dual core Dual Core – 4 threads 8 cores – 32 threads Page 7 Why are (shared memory) CMPs dominant? A new system delivers nearly twice the throughput performance of the previous one without application-level changes. Applications do not degrade in performance when ported (to a next-generation processor). This is an important factor in markets where it is not possible to rewrite all applications for a new system, a common case. Applications benefit from more memory capacity and more memory bandwidth when ported. .. even if they do not (optimally) use all the available cores. Even when a single application must be accelerated, large portions of code can be reused. Design cost is reduced, at least relative to the scenario where all available transistors are used to build a single processor. Page 8 PDSOI optimization results Constant performance Optimizing for maximum performance for each core improvement, 20% per gen. 11nm (750W/cm2) 10.00 2 15 nm (260W/cm2) 200W/cm 100W/cm 2 8.00 22nm (110W/cm2) 50W/cm 2 25W/cm 2 32 nm (65W/cm2) 6.00 45 nm (65W/cm2) 32nm 65nm (50W/cm2) 22nm 4.00 15nm 45nm Est. Clock Frequency (GHz) Frequency Clock Est. Constant 65 nm 11nm 90nm power density 2.00 25 W/cm 2 0.00 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 T2 Date D. Frank, C. Tyberg Page 9 Microprocessor Trends More active transistors, higher frequency Multi-Core More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 2015(?) Page 10 Microprocessor Trends Hybrid/Heterogeneous More active transistors, higher frequency Multi-Core More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 2015(?) 2025(??) Page 11 Heterogeneous Power Architecture Processors Xilinx Virtex II-Pro Cell Broadband Engine Rapport Kilocore 2002 2005 2006 1-2 Power Cores 1 Power Core 1 Power Core + FPGA + 8 accelerator cores + 256 accelerator cores Page 12 Cell Broadband Engine Heterogeneous SPE0 SPE1 SPE2 SPE3 SPE4 SPE5 SPE6 SPE7 Multiprocessor SPU SPU SPU SPU SPU SPU SPU SPU – Power processor 16B/cycle – Synergistic Processing Elements LS LS LS LS LS LS LS LS Power Processor AUC AUC AUC AUC AUC AUC AUC AUC Element (PPE) MFC MFC MFC MFC MFC MFC MFC MFC – general purpose 16B/cycle – running full-fledged OSs EIB (up to 96B/cycle) – 2 levels of globally 16B/cycle 16B/cycle(2x) coherent cache PPE MIC BIC Synergistic Proc. Element (SPE) L2 – SPU optimized for I/O I/O 32B/cycle computation density PPU L1 – 128 bit wide SIMD 16B/cycle Dual XDR™ – Fast local memory 64-bit Power Architecture w/VMX – Globally coherent DMA Page 13 Memory Managing Processor vs. Traditional General Purpose Processor Cell AMD BE IBM Intel Page 14 Page 15 Current Cell: Integer Workloads Breadth-First Search Villa, Scarpazza, Petrini, Peinador IPDPS 2007 Sort: Gedik, Bordawekar, Yu (IBM) Mapreduce Sangkaralingam, De Kruijf, Oct. 2007 Page 16 Three Generations of Cell/B.E. 90nm SOI 236mm 2 65nm SOI 175mm 2 45nm SOI 116mm 2 Cell/B.E. Generation W (mm) H (mm) Area (mm 2) Scaling from 90nm Scaling from 65nm 90nm 19.17 12.29 235.48 100.0% 65nm 15.59 11.20 174.61 74.2% 100.0% 45nm 12.75 9.06 115.46 49.0% 66.1% Synergistic Processor Element (SPE) Generation W (mm) H (mm) Area (mm 2) Scaling from 90nm Scaling from 65nm 90nm 2.54 5.81 14.76 100.0% 65nm 2.09 5.30 11.08 75.0% 100.0% 45nm 1.59 4.09 6.47 43.9% 58.5% Power Processor Element (PPE) Generation W (mm) H (mm) Area (mm 2) Scaling from 90nm Scaling from 65nm 90nm 4.44 6.05 26.86 100.0% 65nm 3.50 5.60 19.60 73.0% 100.0% Takahashi e.a. 45nm 2.66 4.26 11.32 42.1% 57.7% Page 17 Cell Broadband Engine-based CE Products Sony Playstation 3 and PS3 Toshiba Regza Cell Page 18 IBM and its Partners are Active Users of Cell Technology Three Generations of Server Blades Accompanied By 3 SDK Releases IBM QS20 IBM QS21 IBM QS22 Two Generations of PCIe Cell Accelerator Boards CAB ( Mercury ) PXCAB ( Mercury/Fixstars/Matrix Vision ) 1U Formfactor Mercury Computer TPlatforms Custom Boards Hitachi Medical ( Ultrasound ) Other Medical and Defense World’s First 1 PFlop Computer LANL Roadrunner Top 7 Green Systems Green 500 list Page 19 Playstation 3 high-level organization and PS3 cluster. 1Gb Ethernet XDR Cell IO XDR BroadBand Bridge XDR Engine XDR RSX GPU Roadrunner accelerated node and system. IB Adapter DDR2 DDR2DDR2 IBM IBM PCIe AMD DDR2DDR2 PowerXCell8iIBM IO BridgePCIe Bridge Opteron DDR2DDR2 PowerXCell8i Bridge DDR2 DDR2 IBM IBM PCIe AMD DDR2DDR2 IO BridgePCIe PowerXCell8iIBM DDR2DDR2 Bridge Bridge Opteron DDR2DDR2 PowerXCell8i DDR2 IB Adapter Roadrunner Accelerated Node QPACE PowerXCell8i node card and system. QPACE Processor Node Card: Rialto DDR2 Xilinx IBM DDR2 Virtex 5 PowerXCell8i DDR2 FPGA DDR2 QPACE node card. Green 500 (Green500.org) Performance and Productivity Challenges require a Multi- Dimensional Approach Highly Highly Scalable Productive Systems Multi-core Systems Hybrid Systems POWER Comprehensive (Holistic) System Innovation & Optimization Page 24 Next Era of Innovation – Hybrid Computing Symmetric Multiprocessing Era Hybrid Computing Era Today pNext 1.0 pNext 2.0 p6 p7 Traditional Cell Throughput Computational BlueGene Technology Out Market In Driven by cores/threads Driven by workload consolidation Page 25 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM. HPC Cluster Directions ExaScale ExaF Accelerators Targeted Configurability BG/Q Accelerators PF Roadrunner Accelerators PERCS BG/P Systems Extended Configurability Performance Capability Machines Accelerators Power, AIX/ Linux Accelerators Capacity Clusters Linux Clusters Power, x86-64, Less Demanding Communication 2007 2008 2009/2010 2011 2012 2013 2018-19 Page 26 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM. Converging Software ( Much Harder! ) Software-Hardware Efficiency Driven by Number of operations (we all learn this in school) Degree of thread parallelism Degree of data parallelism Degree of locality (code and data, data more important) Degree of predictability ( code and data, data more important ) Need a new Portable Framework Allow portable format to retain enough information for run-time optimization. Allow run-time optimization to heterogeneous hardware for each of the parameters above. Page 27 Two Standards for Programming the Node Two standards evolving from different sides of the market OpenMP OpenCL CPUs GPUs MIMD SIMD Scalar code bases Scalar/Vector code bases Parallel for loop Data Parallel Shared Memory Model Distributed Shared Memory Model Concurrency and Locality CPU SPE GPU Page 28 All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM. IBM Systems and Technology Group Hybrid/Accelerator Node/Server Model 2 Address Spaces Host Cores Compute Cores (User Mode Cores) (Base Cores) Base Base UMC UMC UMC UMC Core 1 … Core N 1 2 …… M-1 M Host Memory Device Global Memory System Memory Device Memory 29 IBM Confidential IBM Corporation IBM Systems and Technology Group Heterogeneous Node/Server Model Single Memory System – 1 Address Space Host Cores Compute Cores (User Mode Cores) (Base Cores) Base Base UMC UMC UMC UMC Core 1 … Core N 1 2 …… M-1 M Host Memory Device Global Memory System Memory 30 IBM Confidential IBM Corporation Cell Broadband Engine Unified host and device memory Zero copies between them SPE 1 SPE N WorkItem 1 WorkItem N Compute Unit 1 Compute Unit N PPE Local Memory Local Memory Private Memory Private Memory Local Store Local Store Host Device Compute Device Host Memory Device Global Memory System Memory Page 31 Page 32 Finite
Recommended publications
  • SEVENTH FRAMEWORK PROGRAMME Research Infrastructures
    SEVENTH FRAMEWORK PROGRAMME Research Infrastructures INFRA-2007-2.2.2.1 - Preparatory phase for 'Computer and Data Treat- ment' research infrastructures in the 2006 ESFRI Roadmap PRACE Partnership for Advanced Computing in Europe Grant Agreement Number: RI-211528 D8.3.2 Final technical report and architecture proposal Final Version: 1.0 Author(s): Ramnath Sai Sagar (BSC), Jesus Labarta (BSC), Aad van der Steen (NCF), Iris Christadler (LRZ), Herbert Huber (LRZ) Date: 25.06.2010 D8.3.2 Final technical report and architecture proposal Project and Deliverable Information Sheet PRACE Project Project Ref. №: RI-211528 Project Title: Final technical report and architecture proposal Project Web Site: http://www.prace-project.eu Deliverable ID: : <D8.3.2> Deliverable Nature: Report Deliverable Level: Contractual Date of Delivery: PU * 30 / 06 / 2010 Actual Date of Delivery: 30 / 06 / 2010 EC Project Officer: Bernhard Fabianek * - The dissemination level are indicated as follows: PU – Public, PP – Restricted to other participants (including the Commission Services), RE – Restricted to a group specified by the consortium (including the Commission Services). CO – Confidential, only for members of the consortium (including the Commission Services). PRACE - RI-211528 i 25.06.2010 D8.3.2 Final technical report and architecture proposal Document Control Sheet Title: : Final technical report and architecture proposal Document ID: D8.3.2 Version: 1.0 Status: Final Available at: http://www.prace-project.eu Software Tool: Microsoft Word 2003 File(s): D8.3.2_addn_v0.3.doc
    [Show full text]
  • Recent Developments in Supercomputing
    John von Neumann Institute for Computing Recent Developments in Supercomputing Th. Lippert published in NIC Symposium 2008, G. M¨unster, D. Wolf, M. Kremer (Editors), John von Neumann Institute for Computing, J¨ulich, NIC Series, Vol. 39, ISBN 978-3-9810843-5-1, pp. 1-8, 2008. c 2008 by John von Neumann Institute for Computing Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above. http://www.fz-juelich.de/nic-series/volume39 Recent Developments in Supercomputing Thomas Lippert J¨ulich Supercomputing Centre, Forschungszentrum J¨ulich 52425 J¨ulich, Germany E-mail: [email protected] Status and recent developments in the field of supercomputing on the European and German level as well as at the Forschungszentrum J¨ulich are presented. Encouraged by the ESFRI committee, the European PRACE Initiative is going to create a world-leading European tier-0 supercomputer infrastructure. In Germany, the BMBF formed the Gauss Centre for Supercom- puting, the largest national association for supercomputing in Europe. The Gauss Centre is the German partner in PRACE. With its new Blue Gene/P system, the J¨ulich supercomputing centre has realized its vision of a dual system complex and is heading for Petaflop/s already in 2009. In the framework of the JuRoPA-Project, in cooperation with German and European industrial partners, the JSC will build a next generation general purpose system with very good price-performance ratio and energy efficiency.
    [Show full text]
  • Introduction to the Cell Multiprocessor
    Introduction J. A. Kahle M. N. Day to the Cell H. P. Hofstee C. R. Johns multiprocessor T. R. Maeurer D. Shippy This paper provides an introductory overview of the Cell multiprocessor. Cell represents a revolutionary extension of conventional microprocessor architecture and organization. The paper discusses the history of the project, the program objectives and challenges, the design concept, the architecture and programming models, and the implementation. Introduction: History of the project processors in order to provide the required Initial discussion on the collaborative effort to develop computational density and power efficiency. After Cell began with support from CEOs from the Sony several months of architectural discussion and contract and IBM companies: Sony as a content provider and negotiations, the STI (SCEI–Toshiba–IBM) Design IBM as a leading-edge technology and server company. Center was formally opened in Austin, Texas, on Collaboration was initiated among SCEI (Sony March 9, 2001. The STI Design Center represented Computer Entertainment Incorporated), IBM, for a joint investment in design of about $400,000,000. microprocessor development, and Toshiba, as a Separate joint collaborations were also set in place development and high-volume manufacturing technology for process technology development. partner. This led to high-level architectural discussions A number of key elements were employed to drive the among the three companies during the summer of 2000. success of the Cell multiprocessor design. First, a holistic During a critical meeting in Tokyo, it was determined design approach was used, encompassing processor that traditional architectural organizations would not architecture, hardware implementation, system deliver the computational power that SCEI sought structures, and software programming models.
    [Show full text]
  • The QPACE Supercomputer Applications of Random Matrix Theory in Two-Colour Quantum Chromodynamics Dissertation
    The QPACE Supercomputer Applications of Random Matrix Theory in Two-Colour Quantum Chromodynamics Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften (Dr. rer. nat.) der Fakult¨atf¨urPhysik der Universit¨atRegensburg vorgelegt von Nils Meyer aus Kempten im Allg¨au Mai 2013 Die Arbeit wurde angeleitet von: Prof. Dr. T. Wettig Das Promotionsgesuch wurde eingereicht am: 07.05.2013 Datum des Promotionskolloquiums: 14.06.2016 Pr¨ufungsausschuss: Vorsitzender: Prof. Dr. C. Strunk Erstgutachter: Prof. Dr. T. Wettig Zweitgutachter: Prof. Dr. G. Bali weiterer Pr¨ufer: Prof. Dr. F. Evers F¨urDenise. Contents List of Figures v List of Tables vii List of Acronyms ix Outline xiii I The QPACE Supercomputer 1 1 Introduction 3 1.1 Supercomputers . .3 1.2 Contributions to QPACE . .5 2 Design Overview 7 2.1 Architecture . .7 2.2 Node-card . .9 2.3 Cooling . 10 2.4 Communication networks . 10 2.4.1 Torus network . 10 2.4.1.1 Characteristics . 10 2.4.1.2 Communication concept . 11 2.4.1.3 Partitioning . 12 2.4.2 Ethernet network . 13 2.4.3 Global signals network . 13 2.5 System setup . 14 2.5.1 Front-end system . 14 2.5.2 Ethernet networks . 14 2.6 Other system components . 15 2.6.1 Root-card . 15 2.6.2 Superroot-card . 17 i ii CONTENTS 3 The IBM Cell Broadband Engine 19 3.1 The Cell Broadband Engine and the supercomputer league . 19 3.2 PowerXCell 8i overview . 19 3.3 Lattice QCD on the Cell BE . 20 3.3.1 Performance model .
    [Show full text]
  • Measuring Power Consumption on IBM Blue Gene/P
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Springer - Publisher Connector Comput Sci Res Dev DOI 10.1007/s00450-011-0192-y SPECIAL ISSUE PAPER Measuring power consumption on IBM Blue Gene/P Michael Hennecke · Wolfgang Frings · Willi Homberg · Anke Zitz · Michael Knobloch · Hans Böttiger © The Author(s) 2011. This article is published with open access at Springerlink.com Abstract Energy efficiency is a key design principle of the Top10 supercomputers on the November 2010 Top500 list IBM Blue Gene series of supercomputers, and Blue Gene [1] alone (which coincidentally are also the 10 systems with systems have consistently gained top GFlops/Watt rankings an Rpeak of at least one PFlops) are consuming a total power on the Green500 list. The Blue Gene hardware and man- of 33.4 MW [2]. These levels of power consumption are al- agement software provide built-in features to monitor power ready a concern for today’s Petascale supercomputers (with consumption at all levels of the machine’s power distribu- operational expenses becoming comparable to the capital tion network. This paper presents the Blue Gene/P power expenses for procuring the machine), and addressing the measurement infrastructure and discusses the operational energy challenge clearly is one of the key issues when ap- aspects of using this infrastructure on Petascale machines. proaching Exascale. We also describe the integration of Blue Gene power moni- While the Flops/Watt metric is useful, its emphasis on toring capabilities into system-level tools like LLview, and LINPACK performance and thus computational load ne- highlight some results of analyzing the production workload glects the fact that the energy costs of memory references at Research Center Jülich (FZJ).
    [Show full text]
  • QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine
    N OVEL A RCHITECTURES QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine Application-driven computers for Lattice Gauge Theory simulations have often been based on system-on-chip designs, but the development costs can be prohibitive for academic project budgets. An alternative approach uses compute nodes based on a commercial processor tightly coupled to a custom-designed network processor. Preliminary analysis shows that this solution offers good performance, but it also entails several challenges, including those arising from the processor’s multicore structure and from implementing the network processor on a field- programmable gate array. 1521-9615/08/$25.00 © 2008 IEEE uantum chromodynamics (QCD) is COPUBLISHED BY THE IEEE CS AND THE AIP a well-established theoretical frame- Gottfried Goldrian, Thomas Huth, Benjamin Krill, work to describe the properties and Jack Lauritsen, and Heiko Schick Q interactions of quarks and gluons, which are the building blocks of protons, neutrons, IBM Research and Development Lab, Böblingen, Germany and other particles. In some physically interesting Ibrahim Ouda dynamical regions, researchers can study QCD IBM Systems and Technology Group, Rochester, Minnesota perturbatively—that is, they can work out physi- Simon Heybrock, Dieter Hierl, Thilo Maurer, Nils Meyer, cal observables as a systematic expansion in the Andreas Schäfer, Stefan Solbrig, Thomas Streuer, strong coupling constant. In other (often more and Tilo Wettig interesting) regions, such an expansion isn’t pos- sible, so researchers must find other approaches. University of Regensburg, Germany The most systematic and widely used nonper- Dirk Pleiter, Karl-Heinz Sulanke, and Frank Winter turbative approach is lattice QCD (LQCD), a Deutsches Elektronen Synchrotron, Zeuthen, Germany discretized, computer-friendly version of the Hubert Simma theory Kenneth Wilson proposed more than 1 University of Milano-Bicocca, Italy 30 years ago.
    [Show full text]
  • HPC Hardware & Software Development At
    Heiko J Schick – IBM Deutschland R&D GmbH August 2010 HPC HW & SW Development at IBM Research & Development Lab © 2010 IBM Corporation Agenda . Section 1: Hardware and Software Development Process . Section 2: Hardware and Firmware Development . Section 3: Operating System and Device Driver Development . Section 4: Performance Tuning . Section 5: Cluster Management . Section 6: Project Examples 2 © 2010 IBM Corporation IBM Deutschland Research & Development GmbH Overview Focus Areas . One of IBM‘s largest Research & Text. Skills: Hardware, Firmware, Development sites Operating Systems, Software . Founded: and Services 1953 . More than 60 Hard- . Employees: and Software projects Berlin ~2.000 . Technology consulting . Headquarter: . Cooperation with Mainz Böblingen research institutes Walldorf and universities Böblingen . Managing Director: München Dirk Wittkopp 3 3 © 2010 IBM Corporation Research Zürich Watson Almaden China Tokio Haifa Austin India 4 © Copyright IBM Corporation 2009 Research Hardware Development Greenock Rochester Boulder Böblingen Toronto Fujisawa Endicott Burlington La Gaude San Jose East Fishkill Poughkeepsie Yasu Tucson Haifa Yamato Austin Raleigh Bangalore 5 © Copyright IBM Corporation 2009 Research Hardware Development Software Development Krakau Moskau Vancouver Dublin Hursley Minsk Rochester Böblingen Beaverton Toronto Paris Endicott Santa Foster Rom City Lenexa Littleton Beijing Teresa Poughkeepsie Haifa Yamato Austin Raleigh Costa Kairo Schanghai Mesa Taipei Pune Bangalore São Paolo Golden Coast Perth Sydney
    [Show full text]
  • Presentación De Powerpoint
    Towards Exaflop Supercomputers Prof. Mateo Valero Director of BSC, Barcelona National U. of Defense Technology (NUDT) Tianhe-1A • Hybrid architecture: • Main node: • Two Intel Xeon X5670 2.93 GHz 6-core Westmere, 12 MB cache • One Nvidia Tesla M2050 448-ALU (16 SIMD units) 1150 MHz Fermi GPU: • 32 GB memory per node • 2048 Galaxy "FT-1000" 1 GHz 8-core processors • Number of nodes and cores: • 7168 main nodes * (2 sockets * 6 CPU cores + 14 SIMD units) = 186368 cores (not including 16384 Galaxy cores) • Peak performance (DP): • 7168 nodes * (11.72 GFLOPS per core * 6 CPU cores * 2 sockets + 36.8 GFLOPS per SIMD unit * 14 SIMD units per GPU) = 4701.61 TFLOPS • Linpack performance: 2.507 PF 53% efficiency • Power consumption 4.04 MWatt Source http://blog.zorinaq.com/?e=36 Cartagena, Colombia, May 18-20 Top10 Rank Site Computer Procs Rmax Rpeak 1 Tianjin, China XeonX5670+NVIDIA 186368 2566000 4701000 2 Oak Ridge Nat. Lab. Cray XT5,6 cores 224162 1759000 2331000 3 Shenzhen, China XeonX5670+NVIDIA 120640 1271000 2984300 4 GSIC Center, Tokyo XeonX5670+NVIDIA 73278 1192000 2287630 5 DOE/SC/LBNL/NERSC Cray XE6 12 cores 153408 1054000 1288630 Commissariat a l'Energie Bull bullx super-node 6 138368 1050000 1254550 Atomique (CEA) S6010/S6030 QS22/LS21 Cluster, 7 DOE/NNSA/LANL PowerXCell 8i / Opteron 122400 1042000 1375780 Infiniband National Institute for 8 Computational Cray XT5-HE 6 cores 98928 831700 1028850 Sciences/University of Tennessee 9 Forschungszentrum Juelich (FZJ) Blue Gene/P Solution 294912 825500 825500 10 DOE/NNSA/LANL/SNL Cray XE6 8-core 107152 816600 1028660 Cartagena, Colombia, May 18-20 Looking at the Gordon Bell Prize • 1 GFlop/s; 1988; Cray Y-MP; 8 Processors • Static finite element analysis • 1 TFlop/s; 1998; Cray T3E; 1024 Processors • Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method.
    [Show full text]
  • Multigrid Algorithms on QPACE
    FRIEDRICH-ALEXANDER-UNIVERSITAT¨ ERLANGEN-NURNBERG¨ INSTITUT FUR¨ INFORMATIK (MATHEMATISCHE MASCHINEN UND DATENVERARBEITUNG) Lehrstuhl f¨urInformatik 10 (Systemsimulation) Multigrid algorithms on QPACE Daniel Brinkers Diploma Thesis Multigrid algorithms on QPACE Daniel Brinkers Diploma Thesis Aufgabensteller: Prof. Dr. U. R¨ude Betreuer: Dipl.-Inf. M. St¨urmer Bearbeitungszeitraum: 01.09.2009 { 01.03.2010 Erkl¨arung: Ich versichere, daß ich die Arbeit ohne fremde Hilfe und ohne Benutzung anderer als der angege- benen Quellen angefertigt habe und daß die Arbeit in gleicher oder ¨ahnlicher Form noch keiner anderen Prufungsbeh¨ ¨orde vorgelegen hat und von dieser als Teil einer Prufungsleistung¨ angenom- men wurde. Alle Ausfuhrungen,¨ die w¨ortlich oder sinngem¨aß ubernommen¨ wurden, sind als solche gekennzeichnet. Erlangen, den 1. M¨arz 2010 . Abstract QPACE is a cluster computer developed for lattice QCD calculations. The computing nodes consists of IBM PowerXCell 8i processors. The interconnect is a custom three dimensional torus network. The application dependent design goals were a latency smaller then 1µs and a band- width about 1 GiB/s. The cluster is very energy efficient and reached the 1st place on the Green500 list of November 2009. With this work a multigrid solver should be implemented and investigated on this hardware. It was concluded, that it is possible to use this hardware for multigrid calculations. Another important result is the conclusion that the analysis of commu- nication patterns is important to understand the performance of programs on this hardware. Methods and parameters for this analysis are introduced. Zusammenfassung QPACE ist ein Cluster, der fur¨ Gitter QCD Berechnungen entwickelt wurde. Die Cluster- knoten bestehen aus IBM PowerXCell 8i Prozessoren.
    [Show full text]
  • QPACE 2 and Domain Decomposition on the Intel Xeon Phi∗
    QPACE 2 and Domain Decomposition on the Intel Xeon Phi∗ Paul Artsa, Jacques Blochb, Peter Georgb, Benjamin Glässleb, Simon Heybrockb, Yu Komatsubarac, Robert Lohmayerb, Simon Magesb, Bernhard Mendlb, Nils Meyerb, Alessio Parcianelloa, Dirk Pleiterb;d, Florian Rapplb, Mauro Rossia, Stefan Solbrigb, Giampietro Tecchiollia, Tilo Wettig†b, Gianpaolo Zaniera aEurotech HPC, Via F. Solari 3/A, 33020 Amaro, Italy bDepartment of Physics, University of Regensburg, 93040 Regensburg, Germany cAdvanet Inc., 616-4 Tanaka, Kita-ku, Okayama 700-0951, Japan dJSC, Jülich Research Centre, 52425 Jülich, Germany E-mail: [email protected] We give an overview of QPACE 2, which is a custom-designed supercomputer based on Intel Xeon Phi processors, developed in a collaboration of Regensburg University and Eurotech. We give some general recommendations for how to write high-performance code for the Xeon Phi and then discuss our implementation of a domain-decomposition-based solver and present a number of benchmarks. arXiv:1502.04025v1 [cs.DC] 13 Feb 2015 The 32nd International Symposium on Lattice Field Theory, 23-28 June, 2014 Columbia University, New York, NY ∗Work supported by DFG in the framework of SFB/TRR-55. †Speaker. c Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. http://pos.sissa.it/ QPACE 2 and Domain Decomposition on the Intel Xeon Phi Tilo Wettig 1. Introduction After the invention of lattice QCD in 1974 [1] it quickly became clear that (i) it had great potential to yield physically relevant nonperturbative results and (ii) enormous computational re- sources would be required to obtain them.
    [Show full text]
  • Prace-3Ip D8.3.4
    SEVENTH FRAMEWORK PROGRAMME Research Infrastructures INFRA-2012-2.3.1 – Third Implementation Phase of the European High Performance Computing (HPC) service PRACE PRACE-3IP PRACE Third Implementation Phase Project Grant Agreement Number: RI-312763 D8.3.4 Technical lessons learnt from the implementation of the joint PCP for PRACE-3IP Final Version: 1.5 Author(s): Stephen Booth, EPCC Date: 10.01.2018 D8.3.4 Technical lessons learnt from the implementation of the joint PCP for PRACE-3IP Project and Deliverable Information Sheet PRACE Project Project Ref. №: RI-312763 Project Title: PRACE Third Implementation Phase Project Project Web Site: http://www.prace-project.eu Deliverable ID: D8.3.4 Deliverable Nature: <Report> Deliverable Level: Contractual Date of Delivery: PU 31 / 12 / 2017 Actual Date of Delivery: 15 / 01 / 2018 EC Project Officer: Leonardo Flores Añover * - The dissemination level are indicated as follows: PU – Public, PP – Restricted to other participants (including the Commission Services), RE – Restricted to a group specified by the consortium (including the Commission Services). CO – Confidential, only for members of the consortium (including the Commission Services). Document Control Sheet Title: Technical lessons learnt from the implementation of the joint Document PCP for PRACE-3IP ID: D8.3.4 Version: 1.5 Status: Final Available at: http://www.prace-project.eu Software Tool: Microsoft Word 2010 File(s): D8.3.4.docx Written by: Stephen Booth, EPCC Authorship Contributors: Fabio Affinito, CINECA Eric Boyer, GENCI Carlo Cavazzoni,
    [Show full text]
  • Keir, Paul G (2012) Design and Implementation of an Array Language for Computational Science on a Heterogeneous Multicore Architecture
    Keir, Paul G (2012) Design and implementation of an array language for computational science on a heterogeneous multicore architecture. PhD thesis. http://theses.gla.ac.uk/3645/ Copyright and moral rights for this thesis are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the Author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the Author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Glasgow Theses Service http://theses.gla.ac.uk/ [email protected] Design and Implementation of an Array Language for Computational Science on a Heterogeneous Multicore Architecture Paul Keir Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy School of Computing Science College of Science and Engineering University of Glasgow July 8, 2012 c Paul Keir, 2012 Abstract The packing of multiple processor cores onto a single chip has become a mainstream solution to fundamental physical issues relating to the microscopic scales employed in the manufac- ture of semiconductor components. Multicore architectures provide lower clock speeds per core, while aggregate floating-point capability continues to increase. Heterogeneous multicore chips, such as the Cell Broadband Engine (CBE) and modern graphics chips, also address the related issue of an increasing mismatch between high pro- cessor speeds, and huge latency to main memory.
    [Show full text]