QPACE: Energy-Efficient High Performance Computing

Total Page:16

File Type:pdf, Size:1020Kb

QPACE: Energy-Efficient High Performance Computing QPACE: Energy-Efficient High Performance Computing Sebastian Rinke, Research Centre Jülich, Germany Hans Böttiger, IBM Germany Research and Development Benjamin Krill, IBM Germany Research and Development Abstract QPACE is a special purpose supercomputer for Lattice Quantum Chromodynamics (LQCD) calculations designed with focus on low power consumption. To open QPACE for a broader range of High Performance Computing (HPC) ap- plications, the High Performance Linpack benchmark (HPL), whose calculations are common for some class of HPC applications, is chosen for porting to the QPACE system. In this respect, this paper discusses features and limitations of the QPACE architecture. It starts with analysing the requirements of HPL and, based on that, presents different ap- proaches to meet these requirements. The paper describes the different stages of porting HPL to QPACE ranging from low-level communication interface to MPI collective operations. Finally, HPL benchmark results show to what extent HPC applications other than LQCD could benefit from QPACE. 1 Introduction face (Rambus FlexIO) is integrated into the processor. Re- garding double precision (DP) floating point performance, During the last decade, energy efficiency has gained in- one SPE can execute up to 4 operations per cycle. Thus, creasing interest in High Performance Computing (HPC). clocked at 3.2 GHz, the 8 SPEs on a PowerXCell 8i pro- Thus, besides computational performance, power effi- cessor yield a DP peak performance of 102.4 GFlops. Each ciency is an additional factor by which parallel systems compute node, called node card, comprises one PowerX- are evaluated. QPACE (QCD Parallel Computing on Cell Cell 8i CPU, 4 GiB of DDR2 main memory, and a network Broadband Engine) is a parallel computer which was de- processor (NWP) which is implemented on a Xilinx V5- signed for Lattice Quantum Chromodynamics (LQCD) LX110T FPGA (Field Programmable Gate Array). The calculations with special attention to low power consump- PowerXCell 8i is directly connected to the NWP via its tion. Despite the fact that QPACE is a special purpose I/O interface. The network processor provides access to 3 supercomputer, its system architecture and performance different interconnects: Gigabit Ethernet, signal tree net- suggest to use it for a broader range of parallel high per- work, and 3D torus network. The power consumption of a formance applications. In this respect, this paper discusses single node card is about 130 W. One QPACE rack houses features and limitations of QPACE as an energy-efficient 256 node cards (35 kW) and yields 26 TFlops DP peak general purpose supercomputer. After an overview of the performance. This leads to a power efficiency of about 1.5 architecture, the high performance torus network and its W/GFlops (peak). corresponding API (Application Programming Interface) are discussed in Section 2. Then, in Section 3, the require- 2.1 Torus Network ments of the High Performance Linpack (HPL) benchmark are identified as they are common for some class of HPC The 3D torus network is a high speed point-to-point inter- applications. In Section 4, these requirements are related connect with 1 GB bandwidth (peak) per torus direction. to the features of QPACE and approaches for dealing with The aggregate bandwidth of the 6 links of a node card is certain limitations are investigated. In this respect, the im- 6 × 1 GB (peak). The torus provides direct SPE-to-SPE plementation of HPL on QPACE is described. Finally, communication between neighbouring node cards. That benchmark results of the HPL on QPACE are presented. means, data from one SPE’s local store can be transferred Section 5 concludes. to the local store of another SPE on one of the 6 neighbours in the torus. However, a routing mechansim for communic- ation between non-neighbours is not implemented. Since 2 QPACE Architecture each node card contains 8 SPEs and there is only one torus link between 2 neighbouring node cards, each link is lo- QPACE [4, 5, 6] is a distributed-memory parallel computer gically divided into 8 virtual channels. This provides 8 based on the IBM PowerXCell 8i [12] processor. This hy- separate reliable communication channels per link. brid processor contains one PowerPC Processor Element (PPE) and 8 Synergistic Processor Elements (SPE). Each 2.1.1 Torus Low-Level Communication SPE has 256 KB on-chip memory, called local store (LS), and runs one single thread. Memory transfers between For a better understanding of the remaining part, this sec- main memory and local store are possible via DMA op- tion briefly discusses the torus communication API and erations. The memory controller as well as an I/O inter- presents a ping-pong message transfer with the resulting benchmark results. As mentioned above, the torus network Note that a receive operation (tnw_credit()) allows message transfers between SPEs of neighbouring must be posted before the corresponding send operation node cards. That means, both the sender and the receiver (tnw_put()). That means, a message transfer is a credit are threads running on different SPEs. Basically, a thread based mechanism: (i) The receiver issues a credit for the on an SPE can access the torus network by means of the message which it is to receive. (ii) The sender sends the following operations: message. (iii) The incoming message fills the credit. List- ings 1 and 2 show the steps for a ping-pong message trans- tnw_put(link, channel, sbuf, offset, fer between two nodes A and B. size, tag) Initiates a non-blocking message transfer on the 1 / ∗ Initialization ∗ / sender. That means, tnw_put() posts the send tnw_credit_base(link_to_B , channel); request to the network and returns immediately. tnw_notify_base(link_to_B , channel , nbuf); The tag parameter identifies the request and is 4 / ∗ Give credit for response from B ∗ / later used for testing it for completion. link and tnw_credit(link_to_B , channel , buf, size , channel denote the torus link (0;:::; 5) and the 7 nindex ) ; virtual channel (0;:::; 7), respectively. The ad- dress of the send buffer of size bytes is provided / ∗ Send t o B ∗ / 10 tnw_put(link_to_B , channel , buf, 0, size , in sbuf. The offset is also transmitted to the t a g ) ; receiver and defines an offset in the receive buffer. 13 / ∗ Wait for response from B ∗ / tnw_test_put(tag) tnw_wait_recv(nbuf , nindex); Is called by the sender in order to test a send request, Listing 1: Ping-pong message transfer: Node A identified by tag, for completion. It returns true if the send buffer can be modified. tnw_notify_base(link, channel, nbuf) 1 / ∗ Initialization ∗ / Since receive operations are non-blocking as well, tnw_credit_base(link_to_A , channel); i.e. they return immediately, the receiver needs a tnw_notify_base(link_to_A , channel , nbuf); means by which it can test for received messages. In 4 / ∗ Give credit for incoming message from A ∗ / this respect, nbuf provides a pointer to a so-called tnw_credit(link_to_A , channel , buf, size , notification buffer. This buffer holds entries where 7 nindex ) ; each entry identifies a previously posted receive op- eration. Before a receive can be posted for link / ∗ Wait for incoming message from A ∗ / 10 tnw_wait_recv(nbuf , nindex); and channel, a notification buffer nbuf must be provided by a call to tnw_notify_base(link, / ∗ Send back to A ∗ / channel, nbuf). See below for details about 13 tnw_put(link_to_A , channel , buf, 0, size , the use of the notification buffer. t a g ) ; Listing 2: Ping-pong message transfer: Node B tnw_credit_base(link, channel) This call declares the whole local store of the SPE as receive buffer for messages over link and Figure 1(a) depicts the corresponding bandwidth channel. However, the actual receive operation benchmark results between two node cards. The timings is called tnw_credit() and described in the fol- between sending and receiving a message are shown in lowing. Fig. 1(b). Here, a 128 B message has a latency of 3 µsec. Note that the size of a message transmitted via tnw_credit(link, channel, rbuf, size, tnw_put() must be a multiple of 128 B. Furthermore, nindex) the send and the receive buffer must be aligned to 128 B As stated above, the receive operation is non- boundaries. The effect of these restrictions is further dis- blocking. In particular, tnw_credit() posts a cussed in Section 4. receive of size bytes for link and channel where rbuf is the destination address in the local store. The input parameter nindex is the in- 3 HPL Requirements dex of an entry in the notification buffer (see tnw_notify_base() above) which refers to this The Linpack benchmarks [1] represent a collection of receive operation. codes for numerical Algebra which have become an in- dustry standard for measuring the performance of com- tnw_wait_recv(nbuf, nindex) puter systems. HPL [1, 2] is a portable and scalable version This call blocks until the receive operation identi- of Linpack for distributed-memory computers that solves fied by the notification buffer nbuf and the index a random dense linear system of equations by Gaussian nindex (in the notification buffer) is complete. elimination with partial pivoting. 900 160 800 140 700 120 600 100 500 80 400 60 300 Latency [usec] Bandwidth [MiB/s] 40 200 100 20 0 0 256 1024 4096 16384 65536 256 1024 4096 16384 65536 Message size [Byte] Message size [Byte] (a) (b) Figure 1: Message transfer from local store to local store over torus network: (a) Bandwidth (b) Latency The corresponding matrix (N × N) is distributed onto a routing of MPI messages as HPL offers a ring algorithm two-dimensional grid of (P × Q) processes in a block- with next neighbour pattern. cyclic fashion to ensure good load balancing. With the right-looking algorithm [1] the LU factorization proceeds with the decomposition of one panel at a time as shown in Communication Section Communicator Fig.
Recommended publications
  • Examining the Viability of FPGA Supercomputing
    1 Examining the Viability of FPGA Supercomputing Stephen D. Craven and Peter Athanas Bradley Department of Electrical and Computer Engineering Virginia Polytechnic Institute and State University Blacksburg, VA 24061 USA email: {scraven,athanas}@vt.edu Abstract—For certain applications, custom computational hardware created using field programmable gate arrays (FPGAs) produces significant performance improvements over processors, leading some in academia and industry to call for the inclusion of FPGAs in supercomputing clusters. This paper presents a comparative analysis of FPGAs and traditional processors, focusing on floating- point performance and procurement costs, revealing economic hurdles in the adoption of FPGAs for general High-Performance Computing (HPC). Index Terms— computational accelerator, digital arithmetic, Field programmable gate arrays, high- performance computing, supercomputers. I. INTRODUCTION Supercomputers have experienced a recent resurgence, fueled by government research dollars and the development of low-cost supercomputing clusters. Unlike the Massively Parallel Processor (MPP) designs found in Cray and CDC machines of the 70s and 80s, featuring proprietary processor architectures, many modern supercomputing clusters are constructed from commodity PC processors, significantly reducing procurement costs. In an effort to improve performance, several companies offer machines that place one or more FPGAs in each node of the cluster. Configurable logic devices, of which FPGAs are one example, permit the device’s hardware to be programmed multiple times after manufacture. A wide body of research over two decades has repeatedly demonstrated significant performance improvements for certain classes of applications when implemented within an FPGA’s configurable logic [1]. Applications well suited to speed-up by FPGAs typically exhibit massive parallelism and small integer or fixed-point data types.
    [Show full text]
  • Overview of the SPEC Benchmarks
    9 Overview of the SPEC Benchmarks Kaivalya M. Dixit IBM Corporation “The reputation of current benchmarketing claims regarding system performance is on par with the promises made by politicians during elections.” Standard Performance Evaluation Corporation (SPEC) was founded in October, 1988, by Apollo, Hewlett-Packard,MIPS Computer Systems and SUN Microsystems in cooperation with E. E. Times. SPEC is a nonprofit consortium of 22 major computer vendors whose common goals are “to provide the industry with a realistic yardstick to measure the performance of advanced computer systems” and to educate consumers about the performance of vendors’ products. SPEC creates, maintains, distributes, and endorses a standardized set of application-oriented programs to be used as benchmarks. 489 490 CHAPTER 9 Overview of the SPEC Benchmarks 9.1 Historical Perspective Traditional benchmarks have failed to characterize the system performance of modern computer systems. Some of those benchmarks measure component-level performance, and some of the measurements are routinely published as system performance. Historically, vendors have characterized the performances of their systems in a variety of confusing metrics. In part, the confusion is due to a lack of credible performance information, agreement, and leadership among competing vendors. Many vendors characterize system performance in millions of instructions per second (MIPS) and millions of floating-point operations per second (MFLOPS). All instructions, however, are not equal. Since CISC machine instructions usually accomplish a lot more than those of RISC machines, comparing the instructions of a CISC machine and a RISC machine is similar to comparing Latin and Greek. 9.1.1 Simple CPU Benchmarks Truth in benchmarking is an oxymoron because vendors use benchmarks for marketing purposes.
    [Show full text]
  • SEVENTH FRAMEWORK PROGRAMME Research Infrastructures
    SEVENTH FRAMEWORK PROGRAMME Research Infrastructures INFRA-2007-2.2.2.1 - Preparatory phase for 'Computer and Data Treat- ment' research infrastructures in the 2006 ESFRI Roadmap PRACE Partnership for Advanced Computing in Europe Grant Agreement Number: RI-211528 D8.3.2 Final technical report and architecture proposal Final Version: 1.0 Author(s): Ramnath Sai Sagar (BSC), Jesus Labarta (BSC), Aad van der Steen (NCF), Iris Christadler (LRZ), Herbert Huber (LRZ) Date: 25.06.2010 D8.3.2 Final technical report and architecture proposal Project and Deliverable Information Sheet PRACE Project Project Ref. №: RI-211528 Project Title: Final technical report and architecture proposal Project Web Site: http://www.prace-project.eu Deliverable ID: : <D8.3.2> Deliverable Nature: Report Deliverable Level: Contractual Date of Delivery: PU * 30 / 06 / 2010 Actual Date of Delivery: 30 / 06 / 2010 EC Project Officer: Bernhard Fabianek * - The dissemination level are indicated as follows: PU – Public, PP – Restricted to other participants (including the Commission Services), RE – Restricted to a group specified by the consortium (including the Commission Services). CO – Confidential, only for members of the consortium (including the Commission Services). PRACE - RI-211528 i 25.06.2010 D8.3.2 Final technical report and architecture proposal Document Control Sheet Title: : Final technical report and architecture proposal Document ID: D8.3.2 Version: 1.0 Status: Final Available at: http://www.prace-project.eu Software Tool: Microsoft Word 2003 File(s): D8.3.2_addn_v0.3.doc
    [Show full text]
  • Recent Developments in Supercomputing
    John von Neumann Institute for Computing Recent Developments in Supercomputing Th. Lippert published in NIC Symposium 2008, G. M¨unster, D. Wolf, M. Kremer (Editors), John von Neumann Institute for Computing, J¨ulich, NIC Series, Vol. 39, ISBN 978-3-9810843-5-1, pp. 1-8, 2008. c 2008 by John von Neumann Institute for Computing Permission to make digital or hard copies of portions of this work for personal or classroom use is granted provided that the copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise requires prior specific permission by the publisher mentioned above. http://www.fz-juelich.de/nic-series/volume39 Recent Developments in Supercomputing Thomas Lippert J¨ulich Supercomputing Centre, Forschungszentrum J¨ulich 52425 J¨ulich, Germany E-mail: [email protected] Status and recent developments in the field of supercomputing on the European and German level as well as at the Forschungszentrum J¨ulich are presented. Encouraged by the ESFRI committee, the European PRACE Initiative is going to create a world-leading European tier-0 supercomputer infrastructure. In Germany, the BMBF formed the Gauss Centre for Supercom- puting, the largest national association for supercomputing in Europe. The Gauss Centre is the German partner in PRACE. With its new Blue Gene/P system, the J¨ulich supercomputing centre has realized its vision of a dual system complex and is heading for Petaflop/s already in 2009. In the framework of the JuRoPA-Project, in cooperation with German and European industrial partners, the JSC will build a next generation general purpose system with very good price-performance ratio and energy efficiency.
    [Show full text]
  • Performance of a Computer (Chapter 4) Vishwani D
    ELEC 5200-001/6200-001 Computer Architecture and Design Fall 2013 Performance of a Computer (Chapter 4) Vishwani D. Agrawal & Victor P. Nelson epartment of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 ELEC 5200-001/6200-001 Performance Fall 2013 . Lecture 1 What is Performance? Response time: the time between the start and completion of a task. Throughput: the total amount of work done in a given time. Some performance measures: MIPS (million instructions per second). MFLOPS (million floating point operations per second), also GFLOPS, TFLOPS (1012), etc. SPEC (System Performance Evaluation Corporation) benchmarks. LINPACK benchmarks, floating point computing, used for supercomputers. Synthetic benchmarks. ELEC 5200-001/6200-001 Performance Fall 2013 . Lecture 2 Small and Large Numbers Small Large 10-3 milli m 103 kilo k 10-6 micro μ 106 mega M 10-9 nano n 109 giga G 10-12 pico p 1012 tera T 10-15 femto f 1015 peta P 10-18 atto 1018 exa 10-21 zepto 1021 zetta 10-24 yocto 1024 yotta ELEC 5200-001/6200-001 Performance Fall 2013 . Lecture 3 Computer Memory Size Number bits bytes 210 1,024 K Kb KB 220 1,048,576 M Mb MB 230 1,073,741,824 G Gb GB 240 1,099,511,627,776 T Tb TB ELEC 5200-001/6200-001 Performance Fall 2013 . Lecture 4 Units for Measuring Performance Time in seconds (s), microseconds (μs), nanoseconds (ns), or picoseconds (ps). Clock cycle Period of the hardware clock Example: one clock cycle means 1 nanosecond for a 1GHz clock frequency (or 1GHz clock rate) CPU time = (CPU clock cycles)/(clock rate) Cycles per instruction (CPI): average number of clock cycles used to execute a computer instruction.
    [Show full text]
  • The QPACE Supercomputer Applications of Random Matrix Theory in Two-Colour Quantum Chromodynamics Dissertation
    The QPACE Supercomputer Applications of Random Matrix Theory in Two-Colour Quantum Chromodynamics Dissertation zur Erlangung des Doktorgrades der Naturwissenschaften (Dr. rer. nat.) der Fakult¨atf¨urPhysik der Universit¨atRegensburg vorgelegt von Nils Meyer aus Kempten im Allg¨au Mai 2013 Die Arbeit wurde angeleitet von: Prof. Dr. T. Wettig Das Promotionsgesuch wurde eingereicht am: 07.05.2013 Datum des Promotionskolloquiums: 14.06.2016 Pr¨ufungsausschuss: Vorsitzender: Prof. Dr. C. Strunk Erstgutachter: Prof. Dr. T. Wettig Zweitgutachter: Prof. Dr. G. Bali weiterer Pr¨ufer: Prof. Dr. F. Evers F¨urDenise. Contents List of Figures v List of Tables vii List of Acronyms ix Outline xiii I The QPACE Supercomputer 1 1 Introduction 3 1.1 Supercomputers . .3 1.2 Contributions to QPACE . .5 2 Design Overview 7 2.1 Architecture . .7 2.2 Node-card . .9 2.3 Cooling . 10 2.4 Communication networks . 10 2.4.1 Torus network . 10 2.4.1.1 Characteristics . 10 2.4.1.2 Communication concept . 11 2.4.1.3 Partitioning . 12 2.4.2 Ethernet network . 13 2.4.3 Global signals network . 13 2.5 System setup . 14 2.5.1 Front-end system . 14 2.5.2 Ethernet networks . 14 2.6 Other system components . 15 2.6.1 Root-card . 15 2.6.2 Superroot-card . 17 i ii CONTENTS 3 The IBM Cell Broadband Engine 19 3.1 The Cell Broadband Engine and the supercomputer league . 19 3.2 PowerXCell 8i overview . 19 3.3 Lattice QCD on the Cell BE . 20 3.3.1 Performance model .
    [Show full text]
  • Measuring Power Consumption on IBM Blue Gene/P
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Springer - Publisher Connector Comput Sci Res Dev DOI 10.1007/s00450-011-0192-y SPECIAL ISSUE PAPER Measuring power consumption on IBM Blue Gene/P Michael Hennecke · Wolfgang Frings · Willi Homberg · Anke Zitz · Michael Knobloch · Hans Böttiger © The Author(s) 2011. This article is published with open access at Springerlink.com Abstract Energy efficiency is a key design principle of the Top10 supercomputers on the November 2010 Top500 list IBM Blue Gene series of supercomputers, and Blue Gene [1] alone (which coincidentally are also the 10 systems with systems have consistently gained top GFlops/Watt rankings an Rpeak of at least one PFlops) are consuming a total power on the Green500 list. The Blue Gene hardware and man- of 33.4 MW [2]. These levels of power consumption are al- agement software provide built-in features to monitor power ready a concern for today’s Petascale supercomputers (with consumption at all levels of the machine’s power distribu- operational expenses becoming comparable to the capital tion network. This paper presents the Blue Gene/P power expenses for procuring the machine), and addressing the measurement infrastructure and discusses the operational energy challenge clearly is one of the key issues when ap- aspects of using this infrastructure on Petascale machines. proaching Exascale. We also describe the integration of Blue Gene power moni- While the Flops/Watt metric is useful, its emphasis on toring capabilities into system-level tools like LLview, and LINPACK performance and thus computational load ne- highlight some results of analyzing the production workload glects the fact that the energy costs of memory references at Research Center Jülich (FZJ).
    [Show full text]
  • QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine
    N OVEL A RCHITECTURES QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine Application-driven computers for Lattice Gauge Theory simulations have often been based on system-on-chip designs, but the development costs can be prohibitive for academic project budgets. An alternative approach uses compute nodes based on a commercial processor tightly coupled to a custom-designed network processor. Preliminary analysis shows that this solution offers good performance, but it also entails several challenges, including those arising from the processor’s multicore structure and from implementing the network processor on a field- programmable gate array. 1521-9615/08/$25.00 © 2008 IEEE uantum chromodynamics (QCD) is COPUBLISHED BY THE IEEE CS AND THE AIP a well-established theoretical frame- Gottfried Goldrian, Thomas Huth, Benjamin Krill, work to describe the properties and Jack Lauritsen, and Heiko Schick Q interactions of quarks and gluons, which are the building blocks of protons, neutrons, IBM Research and Development Lab, Böblingen, Germany and other particles. In some physically interesting Ibrahim Ouda dynamical regions, researchers can study QCD IBM Systems and Technology Group, Rochester, Minnesota perturbatively—that is, they can work out physi- Simon Heybrock, Dieter Hierl, Thilo Maurer, Nils Meyer, cal observables as a systematic expansion in the Andreas Schäfer, Stefan Solbrig, Thomas Streuer, strong coupling constant. In other (often more and Tilo Wettig interesting) regions, such an expansion isn’t pos- sible, so researchers must find other approaches. University of Regensburg, Germany The most systematic and widely used nonper- Dirk Pleiter, Karl-Heinz Sulanke, and Frank Winter turbative approach is lattice QCD (LQCD), a Deutsches Elektronen Synchrotron, Zeuthen, Germany discretized, computer-friendly version of the Hubert Simma theory Kenneth Wilson proposed more than 1 University of Milano-Bicocca, Italy 30 years ago.
    [Show full text]
  • Investigations of Various HPC Benchmarks to Determine Supercomputer Performance Efficiency and Balance
    Investigations of Various HPC Benchmarks to Determine Supercomputer Performance Efficiency and Balance Wilson Lisan August 24, 2018 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2018 Abstract This dissertation project is based on participation in the Student Cluster Competition (SCC) at the International Supercomputing Conference (ISC) 2018 in Frankfurt, Germany as part of a four-member Team EPCC from The University of Edinburgh. There are two main projects which are the team-based project and a personal project. The team-based project focuses on the optimisations and tweaks of the HPL, HPCG, and HPCC benchmarks to meet the competition requirements. At the competition, Team EPCC suffered with hardware issues that shaped the cluster into an asymmetrical system with mixed hardware. Unthinkable and extreme methods were carried out to tune the performance and successfully drove the cluster back to its ideal performance. The personal project focuses on testing the SCC benchmarks to evaluate the performance efficiency and system balance at several HPC systems. HPCG fraction of peak over HPL ratio was used to determine the system performance efficiency from its peak and actual performance. It was analysed through HPCC benchmark that the fraction of peak ratio could determine the memory and network balance over the processor or GPU raw performance as well as the possibility of the memory or network bottleneck part. Contents Chapter 1 Introduction ..............................................................................................
    [Show full text]
  • Download Presentation
    Computer architecture in the past and next quarter century CNNA 2010 Berkeley, Feb 3, 2010 H. Peter Hofstee Cell/B.E. Chief Scientist IBM Systems and Technology Group CMOS Microprocessor Trends, The First ~25 Years ( Good old days ) Log(Performance) Single Thread More active transistors, higher frequency 2005 Page 2 SPECINT 10000 3X From Hennessy and Patterson, Computer Architecture: A ??%/year 1000 Quantitative Approach , 4th edition, 2006 52%/year 100 Performance (vs. VAX-11/780) VAX-11/780) (vs. Performance 10 25%/year 1 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 • VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 Page 3 CMOS Devices hit a scaling wall 1000 Air Cooling limit ) 100 2 Active 10 Power Passive Power 1 0.1 Power Density (W/cm Power Density 0.01 1994 2004 0.001 1 0.1 0.01 Gate Length (microns) Isaac e.a. IBM Page 4 Microprocessor Trends More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 Page 5 Microprocessor Trends Multi-Core More active transistors, higher frequency Log(Performance) Single Thread More active transistors, higher frequency 2005 2015(?) Page 6 Multicore Power Server Processors Power 4 Power 5 Power 7 2001 2004 2009 Introduces Dual core Dual Core – 4 threads 8 cores – 32 threads Page 7 Why are (shared memory) CMPs dominant? A new system delivers nearly twice the throughput performance of the previous one without application-level changes. Applications do not degrade in performance when ported (to a next-generation processor).
    [Show full text]
  • HPC Hardware & Software Development At
    Heiko J Schick – IBM Deutschland R&D GmbH August 2010 HPC HW & SW Development at IBM Research & Development Lab © 2010 IBM Corporation Agenda . Section 1: Hardware and Software Development Process . Section 2: Hardware and Firmware Development . Section 3: Operating System and Device Driver Development . Section 4: Performance Tuning . Section 5: Cluster Management . Section 6: Project Examples 2 © 2010 IBM Corporation IBM Deutschland Research & Development GmbH Overview Focus Areas . One of IBM‘s largest Research & Text. Skills: Hardware, Firmware, Development sites Operating Systems, Software . Founded: and Services 1953 . More than 60 Hard- . Employees: and Software projects Berlin ~2.000 . Technology consulting . Headquarter: . Cooperation with Mainz Böblingen research institutes Walldorf and universities Böblingen . Managing Director: München Dirk Wittkopp 3 3 © 2010 IBM Corporation Research Zürich Watson Almaden China Tokio Haifa Austin India 4 © Copyright IBM Corporation 2009 Research Hardware Development Greenock Rochester Boulder Böblingen Toronto Fujisawa Endicott Burlington La Gaude San Jose East Fishkill Poughkeepsie Yasu Tucson Haifa Yamato Austin Raleigh Bangalore 5 © Copyright IBM Corporation 2009 Research Hardware Development Software Development Krakau Moskau Vancouver Dublin Hursley Minsk Rochester Böblingen Beaverton Toronto Paris Endicott Santa Foster Rom City Lenexa Littleton Beijing Teresa Poughkeepsie Haifa Yamato Austin Raleigh Costa Kairo Schanghai Mesa Taipei Pune Bangalore São Paolo Golden Coast Perth Sydney
    [Show full text]
  • Presentación De Powerpoint
    Towards Exaflop Supercomputers Prof. Mateo Valero Director of BSC, Barcelona National U. of Defense Technology (NUDT) Tianhe-1A • Hybrid architecture: • Main node: • Two Intel Xeon X5670 2.93 GHz 6-core Westmere, 12 MB cache • One Nvidia Tesla M2050 448-ALU (16 SIMD units) 1150 MHz Fermi GPU: • 32 GB memory per node • 2048 Galaxy "FT-1000" 1 GHz 8-core processors • Number of nodes and cores: • 7168 main nodes * (2 sockets * 6 CPU cores + 14 SIMD units) = 186368 cores (not including 16384 Galaxy cores) • Peak performance (DP): • 7168 nodes * (11.72 GFLOPS per core * 6 CPU cores * 2 sockets + 36.8 GFLOPS per SIMD unit * 14 SIMD units per GPU) = 4701.61 TFLOPS • Linpack performance: 2.507 PF 53% efficiency • Power consumption 4.04 MWatt Source http://blog.zorinaq.com/?e=36 Cartagena, Colombia, May 18-20 Top10 Rank Site Computer Procs Rmax Rpeak 1 Tianjin, China XeonX5670+NVIDIA 186368 2566000 4701000 2 Oak Ridge Nat. Lab. Cray XT5,6 cores 224162 1759000 2331000 3 Shenzhen, China XeonX5670+NVIDIA 120640 1271000 2984300 4 GSIC Center, Tokyo XeonX5670+NVIDIA 73278 1192000 2287630 5 DOE/SC/LBNL/NERSC Cray XE6 12 cores 153408 1054000 1288630 Commissariat a l'Energie Bull bullx super-node 6 138368 1050000 1254550 Atomique (CEA) S6010/S6030 QS22/LS21 Cluster, 7 DOE/NNSA/LANL PowerXCell 8i / Opteron 122400 1042000 1375780 Infiniband National Institute for 8 Computational Cray XT5-HE 6 cores 98928 831700 1028850 Sciences/University of Tennessee 9 Forschungszentrum Juelich (FZJ) Blue Gene/P Solution 294912 825500 825500 10 DOE/NNSA/LANL/SNL Cray XE6 8-core 107152 816600 1028660 Cartagena, Colombia, May 18-20 Looking at the Gordon Bell Prize • 1 GFlop/s; 1988; Cray Y-MP; 8 Processors • Static finite element analysis • 1 TFlop/s; 1998; Cray T3E; 1024 Processors • Modeling of metallic magnet atoms, using a variation of the locally self-consistent multiple scattering method.
    [Show full text]