High Performance Computing

Course #: CSI 440/540 High Perf Sci Comp I Fall ‘09

Mark R. Gilder Email: [email protected] [email protected] CSI 440/540

This course investigates the latest trends in high-performance computing (HPC) evolution and examines key issues in developing algorithms capable of exploiting these architectures.

Grading: Your grade in the course will be based on completion of assignments (40%), course project (35%), class presentation(15%), class participation (10%).

Course Goals  Understanding of the latest trends in HPC architecture evolution,  Appreciation for the complexities in efficiently mapping algorithms onto HPC architectures,  Familiarity with various program transformations in order to improve performance,  Hands-on experience in design and implementation of algorithms for both shared & distributed memory parallel architectures using Pthreads, OpenMP and MPI.  Experience in evaluating performance of parallel programs.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 2 Grades

 40% Homework assignments  35% Final project  15% Class presentations  10% Class participation

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 3 Homework

 Usually weekly w/some exceptions  Must be turned in on time – no late homework assignments will be accepted  All work must be your own - cheating will not be tolerated  All references must be sited  Assignments may consist of problems, programming, or a combination of both

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 4 Homework (continued)

 Detailed discussion of your results is expected – the program is only a small part of the problem

 Homework assignments will be posted on the class website along with all of the lecture notes ◦ http://www.cs.albany.edu/~gilder

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 5 Project

 Topic of general interest to the course  Read three or four papers to get some ideas of latest research in HPC field  Identify an application that can be implemented on our RIT cluster  Implement and write a final report describing your application and results  Present your results in class

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 6 Other Remarks

 Would like the course to be very interactive

 Willing to accept suggestions for changes in content and/or form

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 7 Material

 Book(s) optional ◦ Introduction to Parallel Computing 2nd Edition by Grama et. Al. ISBN: 0-201-64865-2

◦ The Sourcebook of Parallel Computing (The Morgan Kaufmann Series) by J. Dongarra, I. Foster, G. Fox et. al. ISBN: 1558608710

◦ Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers by B. Wilkinson and M. Allen ISBN: 0136717101

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 8 Material

 Lecture notes will be provided online either before or just after class

 Other reading material may be assigned

 Course website: http://www.cs.albany.edu/~gilder/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 9 Course Overview

 Learning about: ◦ High-Performance Computing (HPC) ◦ Parallel Computing ◦ Performance Analysis ◦ Computational Techniques ◦ Tools to aid in parallel programming ◦ Developing programs using MPI, Pthreads, maybe OpenMP, maybe CUDA

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 10 What You Should Learn

 In depth understanding of: ◦ When parallel computing is useful ◦ Understanding of parallel computing options ◦ Overview of programming models ◦ Performance analysis and tuning

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 11 Background

 Strong C-programming experience

 Understanding of Operating Systems

 Some background in Numerical Computing

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 12 Computer Accounts

 For most of the class we will be using the RIT computer cluster. See the following link for more info about the hardware: http://www.rit.albany.edu/wiki/IBM_pSeries_Cluster

 Accounts will be made available by the second week of class

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 13 Homework #1

Implement a version of the following operations: 1) Matrix-vector multiplication n

c j Ai, j * x j , for i 1,..,m j 1 2) Matrix multiplication n

Ci, j Ai,k *Bk, j , for i,j 1,...,n k 1

The point of this assignment is not to focus on writing software but rather to look at the performance for each of your implementations and try to explain the observed performance. You should run several experiments on various systems and provide an analysis of your results. This should include plots of your data for various values of n between say 10 and 5000. Make sure you provide a write-up along with your plots and be sure to demonstrate that your implementation is also generating the correct results. Information on various processors may be found at: http://www.cpu-world.com/CPUs/index.html http://www.cpu-world.com/sspec/index.html

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 14 Lecture 1 Outline: ◦ HPC Introduction ◦ Motivation ◦ General Computing Trends

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 15 Units of High Performance Computing

Term Expanded Actual Performance 1 Kflop/s 1 Kiloflop/s 103 Flop/sec 1 Mflop/s 1 Megaflop/s 106 Flop/sec 1 Gflop/s 1 Gigaflop/s 109 Flop/sec 1 Tflop/s 1 Teraflop/s 1012 Flop/sec 1 Pflop/s 1 Petaflop/s 1015 Flop/sec

Data 1 KB 1 Kilobyte 103 Bytes 1 MB 1 Megabyte 106 Bytes 1 GB 1 Gigabyte 109 Bytes 1 TB 1 Terabyte 1012 Bytes 1 PB 1 Petabyte 1015 Bytes

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 16 HPC Highlights High Performance Computing: Overloaded term but typically refers to the use of computer clusters and/or custom supercomputers to solve large-scale scientific computations. Essentially covers computers which are designed for large computations and/or data intensive tasks. These systems rely on parallel processing to increase algorithm performance (speed-up).

Example Applications Include:  Computational Fluid Dynamics (CFDs)  Large Scale Modeling / Simulation  Bioinformatics  Molecular Dynamics  Financial

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 17 HPC Highlights

HPC Attributes:  Multiple processors – 10’s, 100’s, 1000’s.  High Speed Interconnect Network, i.e, InfiniBand, GigE, etc.  Clusters typically built from COTS / commodity components  Supercomputers built from mix of both commodity / custom components  Performance typically in the Teraflop range (1012 floating point operations / sec)

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 18 HPC Data Example

 Lets say you can print ◦ 5 columns of 100 numbers each; on both sides of the page = 1000 numbers (Kflop) in one second ( 1 Kflop/s)

 106 numbers (Mflop) = 1,000 pages (about 10 cm) ◦ 2 reams of paper / second ◦ 1 Mflop/s

 109 numbers (Gflop) = 10,000 cm = 100 m stack ◦ Height of Statue of Liberty (printed / second) 1 Gflop/s

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 19 HPC Data Example  Lets say you can print  106 numbers (Mflop) = ◦ 5 columns of 100 numbers each; on 1000 pages (about 10 both sides of the page = 1000 numbers cm) (Kflop) in one second ( 1 Kflop/s) ◦ 2 reams of paper / second ◦ 1 Mflop/s  109 numbers (Gflop) = 10,000 cm = 100 m stack ◦ Height of Statue of Liberty (printed / second) 1 Gflop/s

 1012 numbers (Tflop) = 100 km stack; altitude achieved by SpaceShipOne. 1 Tflop/s

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 20 HPC Data Example  Lets say you can print ◦ 5 columns of 100 numbers each; on both sides of the page = 1000 numbers (Kflop) in one second ( 1 Kflop/s) ◦ 106 numbers (1 Mflop) = 1000 pages (about 10 cm); 2 reams paper / sec ◦ 109 numbers (1 Gflop) = 10,000 cm = 100 m stack; Height of Statue of Liberty per sec ◦ 1012 numbers (1 Tflop) = 100 km stack; SpaceShipOne’s distance to space/sec

 1015 numbers (1 Pflop) = 100,000 km stack printed per second; 1 Pflop/s ◦ ¼ distance to moon

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 21 HPC Data Example  Lets say you can print ◦ 5 columns of 100 numbers each; on both sides of the page = 1000 numbers (Kflop) in one second ( 1 Kflop/s) ◦ 106 numbers (1 Mflop) = 1000 pages (about 10 cm); 2 reams paper / sec ◦ 109 numbers (1 Gflop) = 10000 cm = 100 m stack; Height of Statue of Liberty per sec ◦ 1012 numbers (1 Tflop) = 100 km stack; SpaceShipOne’s distance to space/sec ◦ 1015 numbers (1 Pflop) = 100,000 km stack printed per sec

 1016 numbers (10 Pflop) = 1,000,000 km stack printed per second; 1 Pflop/s ◦ distance to moon and back and then a bit

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 22 High Performance Computing Today

 In the past decade, the world has experienced one of the most exciting periods in computer development

 Microprocessors have become smaller, denser, and more powerful

 The result is that microprocessor-based supercomputing is rapidly becoming the technology of preference in attacking some of the most important problems of science and engineering

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 23 Lecture 1 Outline: ◦ HPC Introduction ◦ Motivation ◦ General Computing Trends

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 24 HPC Motivation Currently  Increases in performance accomplished by increases in clock speed

 Power and heat dissipation limits have clock frequencies stagnating

 Legacy software investments are at risk

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 25 HPC Motivation

 Legacy software is based on a single thread of execution

 We need to start thinking in parallel

 Lack of compiler/language tools means more painful and costly software development cycles

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 26 HPC Motivation

 Parallel or concurrent designs lead to multicore  However, multicore may not be enough for some problems  Heterogeneous systems to the rescue: these consist of a collection or processors designed for specific problems which are tied together  Heterogeneous Multicore: same thing on a chip

Result: Diversity of Computing Architectures w/Limited Tools for Exploiting Capabilities Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 27 HPC Motivation

In the past parallel computing efforts have shown great promise, however, in the end uniprocessor computing has always won.

Why? – we could just wait 12-18 months and the same algorithms would perform twice as fast! …..UNTIL NOW!

This shift towards increasing parallelism is not due to any new breakthroughs in software and/or parallel architectures but instead due to greater challenges in further improving uniprocessor architectures.

General Purpose Computing is taking an irreversible step toward parallel architectures.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 28 Lecture 1 Outline: ◦ HPC Introduction ◦ Motivation ◦ General Computing Trends

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 29 Transistor Growth – Moore’s Law 1.E+10 Montecito 1.E+09 Itanium2-9M 2 1.E+08 4 Itanium 1.E+07 Pentium II Pentium III Pentium 1.E+06 1.E+05 I286 8086

Transistors per chip per Transistors 1.E+04 8080 8008 1.E+03 1970 1975 1980 1985 1990 1995 2000 2005 2010 Year  Moore’s law states that transistor densities will double every 2 years. Traditionally, this has held and, in fact, the cost per function has dropped, on average, 27% per year.  Many other components of computer systems have followed similar exponential growth curves, including data storage capacity and transmission speeds. It really has been unprecedented technological and economic growth.  Smaller transistors have lots of nice effects, including the ability to switch transistors faster. This, combined with other component speedups causes many to relate Moore’s law to a 2x improvement in performance every two years.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 30 Moore’s “Law”

 Moore’s Law is an exponential ◦ Exponentials can not last forever Gordon Moore (co-founder of ◦ However, Moore’s Law has held ) Electronics Magazine, 1965 remarkably true for almost 30 Number of devices/chip doubles every 18 months years! ◦ Note: not really a law rather an observed / empirical result

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 31 Transistor Growth – Moore’s Law

Fun Facts  The original transistor built by Bell Labs in 1947 could be held in your hand, while hundreds of Intel’s new 45nm transistor can fit on the surface of a single red blood cell.  Intel's first processor, the 4004, debuted in 1971 and consisted of 2,300 transistors. Compare that to the 800 million transistors found in today's Intel Quad Core (an increase by a factor of 350,000)  The price of a transistor in one of Intel’s forthcoming next-generation processors -- codenamed Penryn -- will be about 1 millionth the average price of a transistor in 1968. If car prices had fallen at the same rate, a new car today would cost about 1 cent.  You could fit more than 2,000 45nm transistors across the width of a human hair.  A 45nm transistor can switch on and off approximately 300 billion times a second. A beam of light travels less than a tenth of an inch during the time it takes a 45nm transistor to switch on and off.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 32 Where Has The Speed Come From?

Processor architecture improvements

Bipolar Thermal Limit 1988 1991 1994 1997 2000 2003 2006 Cache, Onboard FPU Pipelining Superscalar Archs Branch Prediction Out of Order MMX SSE SSE2 SSE3 HyperThreading Dual Core

Increasing Performance

486 Pentium POWER3 AMD K7 Athlon 64 X2

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 33 Instruction Level Parallelism (ILP) 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05

Relative Performance/Cycle Relative 0.00 1985 1990 1995 2000 2005 Year

The figure shows how effective real Intel processors have been at extracting instruction parallelism over time. There is a flat region before instruction-level parallelism was pursued intensely, then a steep rise as parallelism was utilized usefully, followed by a tapering off in recent years as the available parallelism has become fully exploited.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 34 Today’s Processors

 Equivalences for today’s microprocessors ◦ Voltage Level  A flashlight (~1 volt) ◦ Current Level  An Oven (~250 amps) ◦ Power Level  A light bulb (~100 watts) ◦ Area  A postage stamp (~ 1 square inch)

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 35 Power Density

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 36 Power Densities

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 37 Processor Voltages 20

15

10

5 Processor Voltage Processor Voltage (Vdd) 0 1970 1980 1990 2000 2010 2020 Year Data Courtesy Intel

Reduction in core voltage from 18V (1970) to currently 1.2V. Currently , a conventional silicon-based transistor requires a minimum voltage of 0.7 V to perform one transition.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 38 Power 2 P ACV f AVIshort VIleak

1 2 3 1.Dynamic power consumption – charging and discharging of the capacitive load on each gates output. It’s proportional to the frequency of the system’s operation, f, the activity of the gates in the system, A, the total capacitance seen by the gate’s outputs, C, and the square of the supply voltage, V. 2.Short-circuit power expended – short

circuit current, Ishort , which momentarily flows, τ, between the supply voltage and ground when logic gate’s output switches. 3.Power loss due to leakage current.

In today’s circuits the first term dominates therefore reducing the supply voltage is the most effective way to reduce power consumption – this savings can be significant, i.e., halving the voltage reduces the power consumption to one-fourth it’s original value.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 39 CPU Clock Speed Stagnation

This is a graph of the clock speed development of AMD and Intel processors from 1993 until the end of 2005. Between 1993 and 1999, the average clock speed increased tenfold. Then stagnation set in; over the past four years, frequencies haven't even doubled.

2 fmax (V Vthreshold) /V

Unfortunately, the maximum frequency of operation is roughly linear in V. Reducing it, limits the circuit to a lower frequency. Reducing the power to one-fourth its original value only halves the maximum frequency. These two equations have an important corollary:

Parallel processing, which involves splitting a computation in two and running it as two parallel independent tasks, has the potential to cut the power in half without slowing the computation.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 40 Power Cost of Frequency

 Power ≈ Voltage2 x Frequency (V2F)  Frequency ≈ Voltage  Power ≈ Frequency3

Cores V Freq Perf Power PE (freq/power) Superscalar 1 1 1 1 1 1 “New” Superscalar 1X 1.5X 1.5X 1.5X 3.3X 0.45X

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 41 Power Cost of Frequency

 Power ≈ Voltage2 x Frequency (V2F)  Frequency ≈ Voltage  Power ≈ Frequency3

Cores V Freq Perf Power PE (perf/power) Superscalar 1 1 1 1 1 1 “New” Superscalar 1X 1.5X 1.5X 1.5X 3.3X 0.45X Multicore 2X 0.75X 0.75X 1.5X 0.8X 1.88X (Larger # is better)

 Example illustrates that we can achieve 50% more performance with 20% less power  Multiple cores demonstrates that multiple slower devices are better than one superfast device

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 42 Power Cost of Frequency

 Power ≈ Voltage2 x Frequency (V2F)  Frequency ≈ Voltage  Power ≈ Frequency3

Cores V Freq Perf Power PE (freq/power) Superscalar 1 1 1 1 1 1

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 43 Power Cost of Frequency

 Power ≈ Voltage2 x Frequency (V2F)  Frequency ≈ Voltage  Power ≈ Frequency3

Cores V Freq Perf Power PE (perf/power) Superscalar 1 1 1 1 1 1

Multicore (2 Cores) 1 0.5X 0.5X 0.5X 0.125X 4 1 0.5X 0.5X 0.5X 0.125X

 Much better performance efficiency however requires that the software can be parallelized!

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 44 4 GHz Barrier

 Can’t make reasonable computers out of current technology past 4GHz

 Cooling the devices becomes very difficult

 Just can’t get the heat out of the cores efficiently.

 Power density on today’s chips is staggering.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 45 Summary

• Power is a primary design constraint • Clock frequencies are stagnating • ILP has reached a roadblock

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '08 46