Using the TOP500 to Trace and Project Technology and Architecture Trends

Using the TOP500 to Trace and Project Technology and Architecture Trends Peter M. Kogge Timothy J. Dysart Univ. of Notre Dame Univ. of Notre Dame 384 Fitzpatrick Hall 384 Fitzpatrick Hall Notre Dame, IN 46556 Notre Dame, IN 46556 [email protected] [email protected] ABSTRACT such as [7] and [12] have used the data to analyze trends, The TOP500 is a treasure trove of information on the lead- and make predictions about reaching various milestones such as a petaflop/s1. More recently, a DARPA-funded working ing edge of high performance computing. It was used in the 2 2008 DARPA Exascale technology report to isolate out the group in 2007 [13] used the data to extrapolate whether effects of architecture and technology on high performance or not reaching an exaflop/s was feasible in a short time computing, and lay the groundwork to project how cur- frame. The answer from that report then heavily influenced rent systems might mature through the coming years. Two followup efforts such as UHPC. particular classes of architectures were identified: \heavy- This paper updates the analysis from the Exascale report weight"(based on high end commodity microprocessors) and in both data covered and metrics analyzed. In addition, \lightweight," (primarily BlueGene variants), and projec- while the prior report identified two generic classes of sys- tions made on performance, concurrency, memory capacity, tems, \heavyweight" and \lightweight," we introduce here a and power. This paper updates those projections, and adds third class \heterogeneous," to reflect new types of entrants a third class of \heterogeneous" architectures (leveraging the that have appeared recently on the list. Further, we have emerging class of GPU-like chips) to the mix. done this analysis with an eye towards repeating it after each list announcement, and in normalizing definitions and measurement characteristics to help guide new list entrants as Categories and Subject Descriptors to what parameters would be of most help in understanding C.0 [General]: System Architectures; C.4 [Performance the underlying trends. of Systems]: Design studies, Performance attributes The paper is organized as follows. Section 2 standard- izes the key definitions we will use in the paper. Section 3 General Terms discusses the much more detailed database we have devel- oped about members of the various TOP500 lists. Section Design, Performance 4 overviews the three classes of architectures we considered. Section 5 develops some key metrics from the historical data. Keywords Section 6 develops the technology roadmaps and trend as- TOP500, Exascale, heterogeneous systems, system projec- sumptions we use. Section 7 uses all this material to analyze tions the observed trends and make projections as to the future. Section 8 concludes. 1. INTRODUCTION With 18 years of semi-annual reporting, the TOP500 list 2. TERMINOLOGY has become perhaps the longest-lived organized source of From its beginning, the TOP500 has focused on perfor- data on computer architectures and systems, at least as they mance of LINPACK on the leading parallel computers of apply to a specific benchmark. As such, mining these records the day. To allow for year-over-year comparison, the rank- can provide excellent insight into trends in architecture and ings have used a few simple terms to specify both the hard- technology, especially at the high end. ware architecture and the performance. While satisfactory This paper performs such an analysis. It is certainly not for most of the early years, with the rise of SMPs, multi- the first to do so; indeed the TOP500 website itself routinely cores, and now heterogeneous architectures, making \apples- produces graphs of top level characteristics such as perfor- to-apples" comparisons has become more difficult, especially mance, architecture style, manufacturer, etc. Earlier papers when detailed analysis and projections are desired as was done for the Exascale report. In the following subsections we review the terminology used in this paper, and suggest Permission to make digital or hard copies of all or part of this work for that the use of some variants in future TOP500 reports may personal or classroom use is granted without fee provided that copies are 1 not made or distributed for profit or commercial advantage and that copies We use “flop" for a single floating point operation, “flops" bear this notice and the full citation on the first page. To copy otherwise, to to refer to multiple floating point operations without a time republish, to post on servers or to redistribute to lists, requires prior specific reference, and “flop/s" to refer to some number of them executed per second. permission and/or a fee. 2 SC11 November 12-18 2011, Seattle, Washington, USA Hereafter referred to as the \Exascale report." or \Exascale Copyright 2011 ACM 978-1-4503-0771-0/11/11 ...$10.00. study." simplify future analysis. 2.2 Metrics 2.1 Basic System Terms A second set of terms is important from the standpoint of comparing systems. The most obvious ones are Rpeak and Until the November 2008 list, the key reported parameter Rmax which drive the TOP500 rankings [8]. Rpeak is the used to describe the architecture of a system was \Processor theoretical peak performance, computed in Gflop/s, as the Count." As discussed above, this became quite fuzzy with number of FPUs times the clock rate of each FPU. Rmax is the rise of multi-core, and was replaced with \core count." the performance, in Gflop/s, for the largest problem run. Even that, however, is still inadequate for the kinds of pro- The Exascale report added the following metrics: jections done here. Consequently, the following is the list of terms used in this paper: • Thread Level Parallelism (TLP): the number of • Core: for a CPU, a set of logic that is capable of in- distinct hardware-supported concurrent threads that dependent execution of a program thread. For GPUs, makes up the execution of a program. None of the top the terms SIMD core and Streaming Multiprocessor systems to date have been explicitly multi-threaded, (SM) have been used by AMD and NVIDIA respec- although newer chips do support at least a low level tively. Each such unit has a number (typically 16-48) of \Hyper Threading." Consequently, for this study of replicated simple processors (unified shaders) which each core as reported in the TOP500 list is assumed contain the ALUs and increasingly FPUs. to correspond to a single thread of execution. • Socket: a chip that represents the main site where • Thread Level Concurrency (TLC): is an attempt conventional computing is performed. This term re- to measure the number of separate operations of inter- flects that when one looks at a modern HPC system, est that can be executed per cycle. For the TOP500 the biggest area on a board is devoted to the heatsink the operation of interest has been floating point. For and socket that sandwich the microprocessor chip. In homogeneous systems, it is computed as the perfor- earlier rankings, a socket was primarily a single-core mance metric (Rmax or Rpeak) divided by the product microprocessor chip, and thus counting sockets was of the number of \threads" (i.e. cores) and the clock equivalent to counting \processors." Today, a socket rate. TLC is meant to be similar to the Instruction typically supports a large number of cores. Level Parallelism (ILP) term used in computer archi- • tecture to measure the number of instructions from a Node: the set of sockets, associated memory, and single thread that can either be issued per cycle within NICs that represent the basic replicable unit of the a microprocessor core (akin to a \peak" measurement), system, as seen by application software. Several years or the number of instructions that are actually com- ago the term \node" came into vogue, and was used pleted and retired per second (akin to the sustained or interchangeably with \processor" and \socket." With \max" numbers of the current discussion). the rise first of chip sets that couple multiple identi- cal microprocessor chips into a single unified SMP, and • Total Concurrency (TC): the total number of sep- then by the rise of multi-socket machines with mixes arate operations of interest that can be computed in of heterogeneous microprocessors (as in Roadrunner), a system at each clock cycle. For the TOP500 such it became important to make this distinction. a measure reflects (within a factor of 2 to account • Core Clock: the clock rate of a core. In systems with for fused multiply-add) the total number of distinct multiple types of cores, we will use here the one that hardware units capable of computing those operations. most contributes to overall computation. This metric can be computed as the number of cores times the peak TLC, and is important because it re- • Memory Capacity: the aggregate physical memory flects the explicit amount of concurrency that must be (typically DRAM) that is directly addressable by pro- expressed by, and extracted from, a program to utilize grams running in cores within sockets, within nodes, the hardware efficiently. not including caches. • Flop/s per watt (FPW): the performance of the • Power: total power consumed by the system. This system divided by system power. metric is today often inconsistent across system de- scriptions, depending on whether or not file systems • Energy per flop (EPF): the reciprocal of the above, (disk drives) are included, and/or whether the power in units of Joules (or more conveniently in most cases required for cooling and power conditioning is counted. \picoJoules" (pJ), 10−12 Joules). In the future it may be more consistent to quote perhaps \CEC power" as the power actually drawn by the We note that many of these metrics may actually be com- computing electronics, \system power" to include sec- putable in two forms depending on whether Rmax or Rpeak ondary storage and local power conditioning, and \fa- is used.

Using the TOP500 to Trace and Project Technology and Architecture Trends

Gpus: the Hype, the Reality, and the Future

Accelerating Applications with Pattern-Specific Optimizations On

NVIDIA Corp NVDA (XNAS)

Massively Parallel Computation Using Graphics Processors with Application to Optimal Experimentation in Dynamic Control

Parallelization Schemes & GPU Acceleration

The GPU Computing Revolution

Understanding Throughput- Oriented Architectures

Modern GPU Architectures

Trends in Heterogeneous Systems Architectures (And How They'll Affect Parallel Programming Models)

3-1-11: Master Deck- Please Save to Your Desktop

Low Overhead Dynamic Binary Translation for ARM

Msc Informatics Eng