EECS 594 Spring 2009 Lecture 6: Overview of High-Performance Computing

EECS 594 Spring 2009 Lecture 6: Overview of High-Performance Computing 1 High-Performance Computing Today In the past decade, the world has experienced one of the most exciting periods in computer development. Microprocessors have become smaller, denser, and more powerful. The result is that microprocessor-based supercomputing is rapidly becoming the technology of preference in attacking some of the most important problems of science and engineering. 2 1 Super Scalar/Special Purpose/Para 1 PFlop/s (1015) IBM RoadRunn 2X Transistors/ Parallel Cray Jagua ASCI White Chip Every 1.5 ASCI Red Pacific 1 TFlop/s (1012) Years TMC CM-5 Cray T3D Vector TMC CM-2 Cray 2 1 GFlop/s Cray X-MP (109) Super Scalar 1941 1 (Floating Point operations / second, Flop/s) 1945 100 Cray 1 1949 1,000 (1 KiloFlop/s, KFlop/s) 1951 10,000 1961 100,000 CDC 7600 IBM 360/195 1964 1,000,000 (1 MegaFlop/s, MFlop/s) 1 MFlop/s 1968 10,000,000 (106) Scalar CDC 6600 1975 100,000,000 1987 1,000,000,000 (1 GigaFlop/s, GFlop/s) 1992 10,000,000,000 IBM 7090 1993 100,000,000,000 1997 1,000,000,000,000 (1 TeraFlop/s, TFlop/s) 2000 10,000,000,000,000 2007 478,000,000,000,000 (478 Tflop/s) 1 KFlop/s 2009 1,100,000,000,000,000 (1.1 Pflop/s) (103) UNIVAC 1 EDSAC 1 07 1950 1960 1970 1980 1990 2000 2010 3 Technology Trends: Microprocessor Capacity Gordon Moore (co-founder of Intel) Electronics Magazine, 1965 Number of devices/chip Microprocessors have become smaller, doubles every 18 months denser, and more powerful. Not just processors, bandwidth, storage, etc. 2X transistors/Chip Every 2X memory and processor speed and 1.5 years size, cost, & power every 18 Called “Moore’s Law” months. 4 2 Moore’s “Law” Something doubles every 18-24 months Something was originally the number of transistors Something is also considered performance Moore’s Law is an exponential Exponentials can not last forever »However Moore’s Law has held remarkably true for ~30 years 5 Something’s Happening Here… In the “old From K. Olukotun, L. Hammond, H. days” it was: Sutter, and B. Smith each year processors A hardware issue just became a would become software problem faster Today the clock speed is fixed or getting slower Things are still doubling every 18 -24 months Moore’s Law reinterpretated . Number of cores double 07 every 18-246 months 3 Power Cost of Frequency • Frequency 7 Power Cost of Frequency • Frequency 8 4 24 GHz, 1 Core No Free Lunch For Traditional Software (Without highly concurrent software it won’t get any faster!) 12 GHz, 1 Core with no change to the code!) Operations per second for serial code (It just runs twice as fast every 18 months 6 GHz 1 Core 3 GHz 3 GHz, 4 Cores 3 GHz, 8 Cores 2 Cores 3GHz 1 Core Free Lunch For Traditional Software Traditional Free Lunch For 9 Additional operations per second if code can take advantage of concurrency What’s Next? Many Floating- + 3D Stacked Different Classes of Chips Point Cores Memory Home Games / Graphics Business Scientific 5 Percentage of peak A rule of thumb that often applies A contemporary processor, for a spectrum of applications, delivers (i.e., sustains) 10% of peak performance There are exceptions to this rule, in both directions Why such low efficiency? 11 Why Fast Machines Run Slow Latency Waiting for access to memory or other parts of the system Overhead Extra work that has to be done to manage program concurrency and parallel resources the real work you want to perform Starvation Not enough work to do due to insufficient parallelism or poor load balancing among distributed resources Contention Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint. 12 6 Memory hierarchy Typical latencies for today’s technology 13 Processor-DRAM Memory Gap Proc 60%/yr. (2X/1.5yr) “Moore’s Law” Chip Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 9%/yr. (2X/10 yrs) 14 7 Principles of Parallel Computing Parallelism and Amdahl’s Law Granularity Locality Load balance Coordination and synchronization Performance modeling All of these things makes parallel programming even harder than sequential programming. 15 Here’s your problem Say 2.26 GHz 2 ops/cycle DP 4.52 Gflop/s peak FSB 533 MHz 32 bit data path (4 bytes) or 2.132 GB/s With 8 bytes/word (DP) 266.5 MW/s from 16 memory 8 Intel Clovertown Quad-core processor Each core does 4 floating point ops/s Say 2.4 GHz thus 4 ops/core*4 flop/s * 2.4 GHz = 38.4 Gflop/s peak FSB 1.066 GHz 1.066 GHz*4B /8 (W/B) = 533 MW/s »There’s your problem 17 Solving the Memory Bottleneck Since we cannot make fast enough memories, we invented the memory hierarchy L1 Cache (on chip) L2 Cache Optional L3 Cache Main Memory Hard Drive 9 Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in L1, then in L2, then in main memory. Typical bus structure: CPU chip register ﬁle L1 ALU cache cache bus system bus memory bus I/O main L2 cache bus interface bridge memory 20 10 Three Types of Cache Misses Compulsory (or cold-start) misses First access to data Can be reduced via bigger cache lines Can be reduced via some pre-fetching Capacity misses Misses due to the cache not being big enough Can be reduced via a bigger cache Conflict misses Misses due to some other memory line having evicted the needed cache line Can be reduced via higher associativity Write Policy: Write-Through What happens when the processor modifies memory that is in cache? Option #1: Write-through Write goes BOTH to cache and to main memory Memory and cache always consistent Store Memory CPU Cache Load Cache Load 11 Write Policy: Write-Back Option #2 Write goes only to cache Cache lines are written back to memory when evicted Requires a “dirty” bit to indicate whether a cache line was written to or not Memory not always consistent with the cache Write CPU Store Back Memory Cache Load Cache Load Cache Basics Cache hit: a memory access that is found in the cache -- cheap Cache miss: a memory access that is not in the cache - expensive, because we need to get the data from elsewhere Consider a tiny cache (for illustration only) X|00|0 X001 Address X010 X011 X100 X101 tag line offset X110 X111 Cache line length: number of bytes loaded together in one entry Direct mapped: only one address (line) in a given range in cache Associative: 2 or more lines with different addresses exist 24 12 Direct-Mapped Cache Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache. cache main memory 25 Set Associative Cache Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way set- associative cache a block from main memory can go into N (N > 1) locations in the cache. 2-way set-associative cache Main memory 26 13 Fully Associative Cache Fully Associative Cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be associated with any entry in the cache. cache Main memory 27 Here assume cache has 8 blocks, while memory has 32 Fully associative Direct mapped Set associative 12 can go anywhere 12 can go only into 12 can go anywhere in block 4 (12 mod 8) Set 0 (12 mod 4) Block no 28 14 Here assume cache has 8 blocks, while memory has 32 Fully associative Direct mapped Set associative 12 can go anywhere 12 can go only into 12 can go anywhere in block 4 (12 mod 8) Set 0 (12 mod 4) Block no 29 Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 30 15 Registers Registers are the source and destination of most CPU data operations. They hold one element each. They are made of static RAM (SRAM), which is more expensive. The access time is usually 1-1.5 CPU clock cycles. Registers are at the top of the memory subsystem. 31 The Principle of Locality The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 15 years, HW relied on localilty for speed 32 16 Principals of Locality Temporal: an item referenced now will be again soon. Spatial: an item referenced now causes neighbors to be referenced soon. Lines, not words, are moved between memory levels. Both principals are satisfied. There is an optimal line size based on the properties of the data bus and the memory subsystem designs.

EECS 594 Spring 2009 Lecture 6: Overview of High-Performance Computing

UNICOS/Mk Status)

UNICOS® Installation Guide for CRAY J90lm Series SG-5271 9.0.2

(PDF) Kostenlos

The Gemini Network

Trends in HPC Architectures and Parallel Programmming

System Programmer Reference (Cray SV1™ Series)

Cray Supercomputers Past, Present, and Future

Implementation of IEEE Floating-Point Arithmetic on the Cray T90 System

Message Passing Dataflow Shared Memory

An Introduction

Performance Computing Systems

Performance Engineering on Cray XC40 with Xeon Phi