High Performance Computing

Course #: CSI 441/541 High Perf Sci Comp II Spring ‘10

Mark R. Gilder Email: [email protected] [email protected] CSI 441/541

This course investigates the latest trends in high-performance computing (HPC) evolution and examines key issues in developing algorithms capable of exploiting these architectures.

Grading: Your grade in the course will be based on completion of assignments (40%), course project (35%), class presentation(15%), class participation (10%).

Course Goals  Understanding of the latest trends in HPC architecture evolution,  Appreciation for the complexities in efficiently mapping algorithms onto HPC architectures,  Hands-on experience in design and implementation of algorithms for both shared & distributed memory parallel architectures using MPI,  Experience in evaluating performance of parallel programs.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 2 Grades

 40% Homework assignments  35% Final project  15% Class presentations  10% Class participation

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 3 Homework

 Usually weekly w/some exceptions  Must be turned in on time – no late homework assignments will be accepted  All work must be your own - cheating will not be tolerated  All references must be sited  Assignments may consist of problems, programming, or a combination of both

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 4 Homework (continued)

 Detailed discussion of your results is expected ◦ the actual program is only a small part of the problem

 Homework assignments will be posted on the class website along with all of the lecture notes ◦ http://www.cs.albany.edu/~gilder

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 5 Project

 Topic of general interest to the course  Read three or four papers to get some ideas of latest research in HPC field  Identify an application that can be implemented on our RIT cluster  Implement and write a final report describing your application and results  Present your results in class

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 6 Other Remarks

 Would like the course to be very interactive

 Willing to accept suggestions for changes in content and/or form

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 7 Material

 Book(s) optional ◦ Introduction to Parallel Computing 2nd Edition by Grama et. Al. ISBN: 0-201-64865-2

◦ The Sourcebook of Parallel Computing (The Morgan Kaufmann Series) by J. Dongarra, I. Foster, G. Fox et. al. ISBN: 1558608710

◦ Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers by B. Wilkinson and M. Allen ISBN: 0136717101

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 8 Material

 Lecture notes will be provided online either before or just after class

 Other reading material may be assigned

 Course website: http://www.cs.albany.edu/~gilder/

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 9 Course Overview

 Learning about: ◦ High-Performance Computing (HPC) ◦ Parallel Computing ◦ Performance Analysis ◦ Computational Techniques ◦ Tools to aid in parallel programming ◦ Developing programs using MPI, Pthreads, maybe OpenMP, maybe CUDA

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 10 What You Should Learn

 In depth understanding of: ◦ When parallel computing is useful ◦ Understanding of parallel computing options ◦ Overview of programming models ◦ Performance analysis and tuning

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 11 Background

 Strong C-programming experience

 Knowledge of parallel programming

 Some background in numerical computing

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 12 Computer Accounts

 For most of the class we will be using the RIT computer cluster. See the following link for more info about the hardware: http://www.rit.albany.edu/wiki/IBM_pSeries_Cluster

 Accounts will be made available by the second week of class

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 13 Lecture 1 Outline: ◦ General Computing Trends ◦ HPC Introduction ◦ Why Parallel Computing

◦ Note: some portions of the following slides were taken from Jack Dongarra w/permission

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 15 Units of High Performance Computing

Term Expanded Actual Performance 1 Kflop/s 1 Kiloflop/s 103 Flop/sec 1 Mflop/s 1 Megaflop/s 106 Flop/sec 1 Gflop/s 1 Gigaflop/s 109 Flop/sec 1 Tflop/s 1 Teraflop/s 1012 Flop/sec 1 Pflop/s 1 Petaflop/s 1015 Flop/sec

Data 1 KB 1 Kilobyte 103 Bytes 1 MB 1 Megabyte 106 Bytes 1 GB 1 Gigabyte 109 Bytes 1 TB 1 Terabyte 1012 Bytes 1 PB 1 Petabyte 1015 Bytes

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 16 HPC Highlights High Performance Computing: Overloaded term but typically refers to the use of computer clusters and/or custom supercomputers to solve large-scale scientific computations. Essentially covers computers which are designed for large computations and/or data intensive tasks. These systems rely on parallel processing to increase algorithm performance (speed-up).

Example Applications Include:  Computational Fluid Dynamics (CFDs)  Large Scale Modeling / Simulation  Bioinformatics  Molecular Dynamics  Financial

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 17 HPC Highlights

HPC Attributes:  Multiple processors – 10’s, 100’s, 1000’s.  High Speed Interconnect Network, i.e, InfiniBand, GigE, etc.  Clusters typically built from COTS / commodity components  Supercomputers built from mix of both commodity / custom components  Performance typically in the Teraflop range (1012 floating point operations / sec)

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 18 HPC Trends

 HPC at all-time high of $11.6 billion in 2007 http://insidehpc.com/2008/02/21/idc-releases-preliminary-hpc-growth-figures/  and Linux are the current winners  Clustering continues to drive change ◦ Price/performance reset ◦ Dramatic increase in units ◦ Growth in standardization of CPU, OS, interconnect  Grid technologies will become pervasive  New challenges for datacenters: ◦ Power, cooling, system management, system consolidation, virtualization?  Storage and data management are growing in importance

Software will be the #1 roadblock – and hence provide major opportunities at all levels Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 19 HPC Data Example

 Lets say you can print ◦ 5 columns of 100 numbers each; on both sides of the page = 1000 numbers (Kflop) in one second ( 1 Kflop/s)

 106 numbers (Mflop) = 1,000 pages (about 10 cm) ◦ 2 reams of paper / second ◦ 1 Mflop/s

 109 numbers (Gflop) = 10,000 cm = 100 m stack ◦ Height of Statue of Liberty (printed / second) 1 Gflop/s

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 20 HPC Data Example  Lets say you can print  106 numbers (Mflop) = ◦ 5 columns of 100 numbers each; on 1000 pages (about 10 both sides of the page = 1000 numbers cm) (Kflop) in one second ( 1 Kflop/s) ◦ 2 reams of paper / second ◦ 1 Mflop/s  109 numbers (Gflop) = 10,000 cm = 100 m stack ◦ Height of Statue of Liberty (printed / second) 1 Gflop/s

 1012 numbers (Tflop) = 100 km stack; altitude achieved by SpaceShipOne. 1 Tflop/s

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 21 HPC Data Example  Lets say you can print ◦ 5 columns of 100 numbers each; on both sides of the page = 1000 numbers (Kflop) in one second ( 1 Kflop/s) ◦ 106 numbers (1 Mflop) = 1000 pages (about 10 cm); 2 reams paper / sec ◦ 109 numbers (1 Gflop) = 10,000 cm = 100 m stack; Height of Statue of Liberty per sec ◦ 1012 numbers (1 Tflop) = 100 km stack; SpaceShipOne’s distance to space/sec

 1015 numbers (1 Pflop) = 100,000 km stack printed per second; 1 Pflop/s ◦ ¼ distance to moon

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 22 HPC Data Example  Lets say you can print ◦ 5 columns of 100 numbers each; on both sides of the page = 1000 numbers (Kflop) in one second ( 1 Kflop/s) ◦ 106 numbers (1 Mflop) = 1000 pages (about 10 cm); 2 reams paper / sec ◦ 109 numbers (1 Gflop) = 10000 cm = 100 m stack; Height of Statue of Liberty per sec ◦ 1012 numbers (1 Tflop) = 100 km stack; SpaceShipOne’s distance to space/sec ◦ 1015 numbers (1 Pflop) = 100,000 km stack printed per sec

 1016 numbers (10 Pflop) = 1,000,000 km stack printed per second; 1 Pflop/s ◦ distance to moon and back and then a bit

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 23 High Performance Computing Today

 In the past decade, the world has experienced one of the most exciting periods in computer development

 Microprocessors have become smaller, denser, and more powerful

 The result is that microprocessor-based supercomputing is rapidly becoming the technology of preference in attacking some of the most important problems of science and engineering

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 24 Transistor Growth – Moore’s Law 1.E+10 Montecito 1.E+09 Itanium2-9M 2 1.E+08 4 Itanium 1.E+07 Pentium II Pentium III Pentium 1.E+06 1.E+05 I286 8086

Transistors per chip per Transistors 1.E+04 8080 8008 1.E+03 1970 1975 1980 1985 1990 1995 2000 2005 2010 Year  Moore’s law states that transistor densities will double every 2 years. Traditionally, this has held and, in fact, the cost per function has dropped, on average, 27% per year.  Many other components of computer systems have followed similar exponential growth curves, including data storage capacity and transmission speeds. It really has been unprecedented technological and economic growth.  Smaller transistors have lots of nice effects, including the ability to switch transistors faster. This, combined with other component speedups causes many to relate Moore’s law to a 2x improvement in performance every two years.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 25 Moore’s “Law”

 Moore’s Law is an exponential ◦ Exponentials can not last forever Gordon Moore (co-founder of ◦ However, Moore’s Law has held ) Electronics Magazine, 1965 remarkably true for almost 30 Number of devices/chip doubles every 18 months years! ◦ Note: not really a law rather an observed / empirical result

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 26 Transistor Growth – Moore’s Law

Fun Facts  The original transistor built by Bell Labs in 1947 could be held in your hand, while hundreds of Intel’s new 45nm transistor can fit on the surface of a single red blood cell.  Intel's first processor, the 4004, debuted in 1971 and consisted of 2,300 transistors. Compare that to the 800 million transistors found in today's Intel Quad Core (an increase by a factor of 350,000)  The price of a transistor in one of Intel’s forthcoming next-generation processors -- codenamed Penryn -- will be about 1 millionth the average price of a transistor in 1968. If car prices had fallen at the same rate, a new car today would cost about 1 cent.  You could fit more than 2,000 45nm transistors across the width of a human hair.  A 45nm transistor can switch on and off approximately 300 billion times a second. A beam of light travels less than a tenth of an inch during the time it takes a 45nm transistor to switch on and off.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 27 Where Has The Speed Come From?

Processor architecture improvements

Bipolar Thermal Limit 1988 1991 1994 1997 2000 2003 2006 Cache, Onboard FPU Pipelining Superscalar Archs Branch Prediction Out of Order MMX SSE SSE2 SSE3 HyperThreading Dual Core x86

Increasing Performance

486 Pentium POWER3 AMD K7 Athlon 64 X2

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 28 Instruction Level Parallelism (ILP) 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05

Relative Performance/Cycle Relative 0.00 1985 1990 1995 2000 2005 Year

The figure shows how effective real Intel processors have been at extracting instruction parallelism over time. There is a flat region before instruction-level parallelism was pursued intensely, then a steep rise as parallelism was utilized usefully, followed by a tapering off in recent years as the available parallelism has become fully exploited.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 29 Today’s Processors

 Equivalences for today’s microprocessors ◦ Voltage Level  A flashlight (~1 volt) ◦ Current Level  An Oven (~250 amps) ◦ Power Level  A light bulb (~100 watts) ◦ Area  A postage stamp (~ 1 square inch)

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 30 Power Density

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 31 Power Densities

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 32 Processor Voltages 20

15

10

5 Processor Voltage Processor Voltage (Vdd) 0 1970 1980 1990 2000 2010 2020 Year Data Courtesy Intel

Reduction in core voltage from 18V (1970) to currently 1.2V. Currently , a conventional silicon-based transistor requires a minimum voltage of 0.7 V to perform one transition.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 33 Power 2 P  ACV f AVIshort VIleak

1 2 3 1.Dynamic power consumption – charging and discharging of the capacitive load on each gates output. It’s proportional to the frequency of the system’s operation, f, the activity of the gates in the system, A, the total capacitance seen by the gate’s outputs, C, and the square of the supply voltage, V. 2.Short-circuit power expended – short

circuit current, Ishort , which momentarily flows, τ, between the supply voltage and ground when logic gate’s output switches. 3.Power loss due to leakage current.

In today’s circuits the first term dominates therefore reducing the supply voltage is the most effective way to reduce power consumption – this savings can be significant, i.e., halving the voltage reduces the power consumption to one-fourth it’s original value.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 34 CPU Clock Speed Stagnation

This is a graph of the clock speed development of AMD and Intel processors from 1993 until the end of 2005. Between 1993 and 1999, the average clock speed increased tenfold. Then stagnation set in; over the past four years, frequencies haven't even doubled.

2 fmax  (V Vthreshold) /V

Unfortunately, the maximum frequency of operation is roughly linear in V. Reducing it, limits the circuit to a lower frequency. Reducing the power to one-fourth its original value only halves the maximum frequency. These two equations have an important corollary:

Parallel processing, which involves splitting a computation in two and running it as two parallel independent tasks, has the potential to cut the power in half without slowing the computation.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 35 Power Cost of Frequency

 Power ≈ Voltage2 x Frequency (V2F)  Frequency ≈ Voltage  Power ≈ Frequency3

Cores V Freq Perf Power PE (freq/power) Superscalar 1 1 1 1 1 1 “New” Superscalar 1X 1.5X 1.5X 1.5X 3.3X 0.45X

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 36 Power Cost of Frequency

 Power ≈ Voltage2 x Frequency (V2F)  Frequency ≈ Voltage  Power ≈ Frequency3

Cores V Freq Perf Power PE (perf/power) Superscalar 1 1 1 1 1 1 “New” Superscalar 1X 1.5X 1.5X 1.5X 3.3X 0.45X Multicore 2X 0.75X 0.75X 1.5X 0.8X 1.88X (Larger # is better)

 Example illustrates that we can achieve 50% more performance with 20% less power  Multiple cores demonstrates that multiple slower devices are better than one superfast device

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 37 Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 38 Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 39 Percentage of Peak

 A rule of thumb that often applies ◦ A contemporary RISC processor, for a spectrum of applications, delivers (i.e., sustains) 10% of peak performance

 Why such low efficiency?

 There are two primary reasons behind the disappointing percentage of peak ◦ IPC (in)efficiency ◦ Memory (in)efficiency

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 40 IPC (Instructions / Cycle)

 Today the theoretical IPC (instructions per cycle) is 4 in most contemporary RISC processors (6 in Itanium)

 Detailed analysis for a spectrum of applications indicates that the average IPC is 1.2–1.4

 We are leaving ~75% of the possible performance on the table…

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 41 Why Fast Machines Run Slow

 Latency ◦ Waiting for access to memory or other parts of the system  Overhead ◦ Extra work that has to be done to manage program concurrency and parallel resources for the real work to be performed  Starvation ◦ Not enough work to do due to insufficient parallelism or poor load balancing among distributed resources  Contention ◦ Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 42 Latency In A Single System

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 43 Memory Hierarchy

 Typical latencies for today’s data access technologies

Hierarchy Processor Clocks Register 1 L1 Cache 2-3 L2 Cache 6-12 L3 Cache 14-40 Near Memory 100-300 Far Memory 300-900 Remote Memory O(103) Message-Passing O(103)-O(104)

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 44 Memory Hierarchy

 Most programs have a high degree of locality in their accesses ◦ Spatial Locality: accessing things nearby previous accesses ◦ Temporal Locality: reusing an item that was previously accessed  Memory hierarchy tries to exploit locality

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 45 Memory Bandwidth

 To provide bandwidth to the processor the either needs to be faster or wider  Busses are limited to perhaps 400-800 MHz  Links are faster (point-to-point) ◦ Increased link frequencies increase error rates requiring coding and redundancy thus increasing power and die size and not helping bandwidth  Making things wider requires pin-out (Si real estate) and power ◦ Both power and pin-out are serious issue

From www.naplestech.com Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 46 Processor-DRAM Memory Gap

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 47 Processor in Memory (PIM)

 PIM merges logic with memory ◦ Wide ALUs next to the row buffer ◦ Optimized for memory throughput, not ALU utilization  PIM has the potential of riding Moore’s Law while ◦ Greatly increasing effective memory bandwidth ◦ Providing many more concurrent execution threads ◦ Reducing latency ◦ Reducing power ◦ Increasing overall system efficiency  It may also simplify programming and system design

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 48 Internet – 4th Revolution in Telecommunications  Telephone, Radio, Television  Growth in Internet outstrips the others  Exponential growth since 1985  Traffic doubles every 100 days

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 49 Peer-to-Peer Computing

 Peer-to-peer is a style of networking in which a group of computers communicate directly with each other  Wireless communication  Home computer in the utility room, next to the water heater and furnace  Web tablets  BitTorrent ◦ http://en.wikipedia.org/wiki/Bittorrent  Imbedded computers in things all tied together. ◦ Books, furniture, milk cartons, etc  Smart Appliances ◦ Refrigerator, scale, etc

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 50 Internet On Everything

Smart Refrigerator Internet controlled refrigerator-oven

Smart laundry room appliances

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 51 SETI@home: Global Distributed Computing  Running on 500,000 PCs, ~1000 CPU Years per Day ◦ 485,821 CPU Years so far  Sophisticated Data & Signal Processing Analysis  Distributes Datasets from Arecibo Radio Telescope

Berkeley Open Infrastructure for Network Computing

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 52 SETI@home

 Use thousands of Internet connected PCs to help in the search for extraterrestrial intelligence.  When their computer is idle or being wasted this software will download a 300 kilobyte chunk of data for analysis. Performs about 3 Tflops for each client in 15 hours.  The results of this analysis are sent back to the SETI team combined with thousands of other participants.  Largest distributed computation project in existence ◦ Averaging 40 Tflops/sec  Today a number of companies are trying this for profit

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 53  Google query attributes ◦ 150M queries/day (2000/second) ◦ 100 countries ◦ 8B documents in the index  Data centers ◦ 100,000 Linux systems in data centers around the world  15 TFlop/s and 1000 TB total capability  40-80 1U/2U servers/cabinet  100 MB Ethernet switches/cabinet with gigabit Ethernet uplink ◦ Growth from 4,000 systems (June 2000)  18M queries then  Performance and operation ◦ Simple reissue of failed commands to new servers ◦ No performance debugging  problems are not reproducible

Source: Monika Henzinger, Google & Cleve Moler

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 54 General Computing Trends

In the past parallel computing efforts have shown great promise, however, in the end uniprocessor computing has always won.

Why? – we could just wait 12-18 months and the same algorithms would perform twice as fast! …..UNTIL NOW!

This shift towards increasing parallelism is not due to any new breakthroughs in software and/or parallel architectures but instead due to greater challenges in further improving uniprocessor architectures.

General Purpose Computing is taking an irreversible step toward parallel architectures.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 55 Why Parallel Computing

• Power is a primary design constraint • Clock frequencies are stagnating • ILP has reached a roadblock • Can’t make reasonable computers out of current technology past 4GHz • Cooling the devices becomes very difficult – just can’t get the heat out of the cores efficiently • Power density on today’s chips is staggering • Fundamental limits are being approached

• Desire to solve bigger, more realistic applications problems • More cost effective solution

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 56 MultiCore 10000 Multi-Core Era Massively parallel 1000 applications

100 Multi-Core Era scalar and Parallel applications 10

1 Threads Per Socket 0 2002 2007 2012 2017 Year Data Courtesy Intel

However, going multi-core introduces a whole new set of problems!

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 57 Considerations

 Rising software development costs  Increased compute system complexity  Lack of compiler technology solutions  Large volume of legacy code  Future changes that will make us start over again

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 58 Current Environment

 Lack of robust cross-platform development frameworks  Programming languages implemented as after-thought via language extensions and/or pragmas  Developers *must* understand low-level details of both the algorithm and the architecture in order to be successful. ◦ Processor Interconnect Network ◦ I/O performance ◦ Cache Coherency ◦ Memory Hierarchy

Parallel Programming is still mostly an art

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 59 Principles of Parallel Computing

 Parallelism and Amdahl’s Law  Granularity  Locality  Load balance  Coordination and synchronization  Performance modeling

All of these makes parallel programming even harder then sequential programming

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 60 “Implicit” Parallelism In Current Machines  Bit level parallelism ◦ within floating point operations, etc.  Instruction level parallelism (ILP) ◦ multiple instructions execute per clock cycle  Memory system parallelism ◦ overlap of memory operations with computation  OS parallelism ◦ multiple jobs run in parallel on commodity SMPs

Limits to all of these – for very high performance, need user to identify, schedule and coordinate parallel tasks

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 61 Finding Enough Parallelism

 Suppose only part of an application can be parallelized  Amdahl’s law

◦ let fs be the fraction of work done sequentially, then (1-fs) is fraction parallelizable fp ◦ Also let N = number of processors  Even if the parallel part speeds up perfectly the sequential part can severely limit the performance gains

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 62 Amdahl’s Law

Amdahl’s Law:

 Amdahl's law states that if fs is the fraction of a calculation that is sequential, and (1 − fs) is the fraction that can be parallelized (fp ), then the maximum speedup that can be achieved by using N processors is

1  S – effect of multiple processors S  f  (1 f ) / N on speedup s s

 (1 f )   tN – effect of multiple processors s tN    fs *t1 on run-time  N 

 In the limit, as N tends to infinity, the maximum speedup tends to 1/ fs.

 As an example, if fs is only 10%, the problem can be sped up by only a maximum of a factor of 10, no matter how large the value of N used.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 63 Illustration of Amdahl’s Law

It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 64 Overhead of Parallelism

 Given enough parallel work, this is the biggest barrier to getting desired speedup  Parallelism overheads include: ◦ cost of starting a thread or process ◦ cost of communicating shared data ◦ cost of synchronizing ◦ extra (redundant) computation  Each of these can be in the range of milliseconds (=millions of flops) on some systems  Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (i.e. large granularity), but not so large that there is not enough parallel work to do

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 65 Locality And Parallelism

 Large memories are slow, fast memories are small  Storage hierarchies are large and fast on average  Parallel processors, collectively, have large, fast performance capabilities ◦ the slow accesses to “remote” data we call “communication”  Algorithm should do most work on local data

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 66 Load Imbalance

 Load imbalance is the time that some processors in the system are idle due to ◦ insufficient parallelism (during that phase) ◦ unequal size tasks  Examples of the latter ◦ adapting to “interesting parts of a domain” ◦ tree-structured computations ◦ fundamentally unstructured problems  Algorithm needs to balance load

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 67 What’s Ahead

 Greater instruction level parallelism?  Bigger caches?  Multiple processors per chip?  Complete systems on a chip?  High performance LAN, Interface, and Interconnect

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 68 Directions

 Move toward shared memory ◦ SMPs and Distributed Shared Memory ◦ Shared address space w/deep memory hierarchy

 Clustering of shared memory machines for scalability

 Efficiency of message passing and data parallel programming ◦ Helped by standards efforts such as MPI and HPF

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 69 High Performance Computers

 ~ 20 years ago ◦ 1x106 Floating Point Ops/sec (Mflop/s)  Scalar based  ~ 10 years ago ◦ 1x109 Floating Point Ops/sec (Gflop/s)  Vector & Shared memory computing, bandwidth aware  Block partitioned, latency tolerant  ~ Today ◦ 1x1012 Floating Point Ops/sec (Tflop/s)  Highly parallel, distributed processing, message passing, network based  Data decomposition, communication/computation  ~ 1 years away ◦ 1x1015 Floating Point Ops/sec (Pflop/s)  Many more levels MH, combination/grids & HPC  More adaptive, bandwidth aware, fault tolerant, extended precision, attention to SMP nodes

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 70 Top 500 Computers

 Listing of the 500 most powerful Computers in the World

 Yardstick: Rmax from LINPACK MPP ◦ Ax=b, dense problem  Updated twice a year ◦ SuperComputing Conference (SC‘xy) in the US every November

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 71 What Is A Supercomputer  A supercomputer is a hardware and software system that provides Why do we need them? close to the maximum Almost all of the technical performance that can currently be areas that are important to achieved. the well-being of humanity  Over the last 10 years the range use supercomputing in for the Top500 has increased fundamental and essential greater than Moore’s Law ways.  1993: Computational fluid ◦ #1 = 59.7 GFlop/s dynamics, protein folding, ◦ #500 = 422 MFlop/s climate modeling, national  2004: security, in particular for cryptanalysis and for ◦ #1 = 478 TFlop/s simulating nuclear weapons ◦ #500 = 5.9 TFlop/s to name a few.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 72 Architecture/Systems Continuum

Top 500 Rmax by system type

Tightly 100%  Custom processor with Coupled custom inteconnect Custom ◦ Cray X1 80% ◦ NEC SX-8 ◦ IBM Regatta 60% ◦ IBM Blue Gene/L Custom Hybrid Hybrid  Commodity processor with Commod. custom interconnect 40% ◦ SGI Altix  Intel Itanium 2 20% ◦ Cray XT3, XD1  AMD Opteron Commodity  Commodity processor with 0%

commodity interconnect

Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 Jun-01 Jun-02 Jun-03 Jun-04

Dec-93 Dec-94 Dec-95 Dec-96 Dec-97 Dec-98 Dec-99 Dec-00 Dec-01 Dec-02 Dec-03 ◦ Clusters  Pentium, Itanium, Opteron, Alpha  GigE, Infiniband, Myrinet, Quadrics ◦ NEC TX7 ◦ IBM eServer ◦ Dawning Loosely Coupled Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 73 The Top-10 Supercomputers November 2008

Rank Site Computer/Year Vendor Cores Rmax Rpeak Power

Roadrunner - BladeCenter QS22/LS21 Cluster, PowerXCell 8i DOE/NNSA/LANL 1 3.2 Ghz / Opteron DC 1.8 GHz , Voltaire Infiniband / 2008 129600 1105.00 1456.70 2483.47 United States IBM

Oak Ridge National Jaguar - Cray XT5 QC 2.3 GHz / 2008 2 Laboratory 150152 1059.00 1381.40 6950.60 Cray Inc. United States NASA/Ames Research Pleiades - SGI Altix ICE 8200EX, QC 3.0/2.66 GHz / 2008 3 Center/NAS 51200 487.01 608.83 2090.00 SGI United States DOE/NNSA/LLNL BlueGene/L - eServer Blue Gene Solution / 2007 4 212992 478.20 596.38 2329.60 United States IBM Argonne National Blue Gene/P Solution / 2007 5 Laboratory 163840 450.30 557.06 1260.00 IBM United States Texas Advanced Computing Ranger - SunBlade x6420, Opteron QC 2.3 Ghz, Infiniband / 6 Center/Univ. of Texas 2008 62976 433.20 579.38 2000.00 United States Sun Microsystems NERSC/LBNL Franklin - Cray XT4 QuadCore 2.3 GHz / 2008 7 38642 266.30 355.51 1150.00 United States Cray Inc. Oak Ridge National Jaguar - Cray XT4 QuadCore 2.1 GHz / 2008 8 Laboratory 30976 205.00 260.20 1580.71 Cray Inc. United States NNSA/Sandia National Red Storm - Sandia/ Cray Red Storm, XT3/4, 2.4/2.2 GHz 9 Laboratories dual/quad core / 2008 38208 204.20 284.00 2506.00 United States Cray Inc.

Shanghai Supercomputer Dawning 5000A - Dawning 5000A, QC Opteron 1.9 Ghz, 10 Center Infiniband, Windows HPC 2008 / 2008 30720 180.60 233.47 China Dawning

Rmax and Rpeak values are in Tflops Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 Power in KW for entire system 74 Performance Development

~10 years

6-8 years

My Laptop 8-10 years

http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 75 Performance Projection

My Laptop 8-10 years

http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 76 Top 500 Conclusions

 Microprocessor based supercomputers have brought a major change in accessibility and affordability  MPPs continue to account of more than half of all installed high-performance computers worldwide

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 77 Distributed And Parallel Systems

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 78 Highly Parallel Supercomputing Where Are We?

 Performance: ◦ Sustained performance has dramatically increased during the last year. ◦ On most applications, sustained performance per dollar now exceeds that of conventional supercomputers. But... ◦ Conventional systems are still faster on some applications.

 Languages and compilers: ◦ Standardized, portable, high-level languages such as HPF, PVM and MPI are available. But ... ◦ Initial HPF releases are not very efficient. ◦ Message passing programming is tedious and hard to debug. ◦ Programming difficulty remains a major obstacle to usage by mainstream scientist.

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 79 Highly Parallel Supercomputing Where Are We?

 Operating Systems: ◦ Robustness and reliability are improving. ◦ New system management tools improve system utilization. But... ◦ Reliability still not as good as conventional systems.

 I/O Subsystems: ◦ New RAID disks, HiPPI interfaces, etc. provide substantially improved I/O performance. But… ◦ I/O remains a major bottleneck

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 80 The Importance Of Standards Software  Writing programs for MPP is hard ...  But ... one-off efforts if written in a standard language  Past lack of parallel programming standards ... ◦ ... has restricted uptake of technology (to "enthusiasts") ◦ ... reduced portability (over a range of current architectures and between future generations)  Now standards exist: (PVM, MPI & HPF), which ... ◦ ... allows users & manufacturers to protect software investment ◦ … encourage growth of a "third party" parallel software industry & parallel versions of widely used codes

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 81 The Importance Of Standards Hardware  Processors ◦ commodity RISC processors  Interconnects ◦ high bandwidth, low latency communications protocol ◦ no de-facto standard yet (ATM, Fibre Channel, HPPI, FDDI)  Growing demand for total solution: ◦ robust hardware + usable software  HPC systems containing all the programming tools / environments / languages / libraries / applications packages found on desktops

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 82 The Future of HPC

 The expense of being different is being replaced by the economics of being the same  HPC needs to lose its "special purpose" tag  Still has to bring about the promise of scalable general purpose computing ...  ... but it is dangerous to ignore this technology  Final success when MPP technology is embedded in desktop computing  Yesterday's HPC is today's mainframe is tomorrow's workstation

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 83 Achieving TeraFlops (1012 Flops/sec)

 In 1991, 1 Gflop/sec  1000-fold increase in performance ◦ Architecture  Exploiting parallelism ◦ Processor, communication, memory  Moore’s Law (hopefully!) ◦ Algorithmic improvements  Block-partitioned algorithms

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 84 Future: Petaflops 1015 Flops/sec

15  Today  10 flops for current workstations  A Pflop for 1 second ≈ a typical workstation computing for 1 year.  From an algorithmic standpoint

 Concurrency  Dynamic redistribution of  Data locality workload  Latency &  New language constructs synchronization  Role of numerical libraries  Floating point accuracy  Algorithm adaptation to hardware failure

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 85 A Petaflop Computer System – Today? 2 GHz processor (O(109) ops/s) ◦ .5 Million PCs ◦ $.5B ($1K each) ◦ 100 Mwatts ◦ 5 acres ◦ .5 Million Windows licenses!! ◦ PC failure every second

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 86 A Petaflop Computer System

 1 Pflop/s sustained computing  Between 10,000 and 1,000,000 processors  Between 10 TB and 1PB main memory  Commensurate I/O bandwidth, mass store, etc.  May be feasible and “affordable” by the year 2010

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 87 IDC – HPC Survey

 The following depict recent trends in HPC growth ◦ copied w/Permission for distribution within class only ◦ DO NOT REDISTRIBUTE WITHOUT PERMISSION

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 88 HPC Growth

All HPC Servers $ Billions 12

10

8

6

4

2

0 1999 2000 2001 2002 2003 2004 2005 2006

Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 89 HPC – By Processor Type

100% 90% 80% 70% x86-64 60% x86-32 50% VECTOR 40% RISC 30% EPIC 20%

10% Revenuepercentage by type CPU 0% 2000 2001 2002 2003 2004 2005 2006

Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 90 HPC – By OS

100%

80%

60% Linux Unix W/NT . 40% Other

20% Revenue percentagebyO/S 0% 2000 2001 2002 2003 2004 2005 2006

Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 91 HPC – Clusters Cluster Market Penetration

100% 90% Cluster 80% Non-Cluster 70% 60% 50% 40% 30% 20% 10% 0%

Q103 Q203 Q303 Q403 Q104 Q204 Q304 Q404 Q105 Q205 Q305 Q405 Q106 Q206 Q306 Q406

Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 92 HPC – Industry Trends

2000 2005 2010 Bio-sciences $585,358 $1,433,807 $2,218,965 CAE $839,669 $1,108,510 $1,978,185 Chemical engineering $324,601 $222,466 $447,418 DCC & distribution $113,464 $513,684 $627,877 Economics/financial $203,067 $254,967 $483,848 EDA $457,630 $648,477 $1,036,687 Geosciences and geo-engineering $181,863 $489,452 $800,670 Mechanical design and drafting $114,185 $155,843 $198,902 Defense $760,065 $811,335 $1,651,183 Government lab $688,555 $1,375,964 $1,671,932 Software engineering $93,634 $19,774 $18,531 Technical management $275,562 $101,561 $73,023 University/academic $933,386 $1,699,966 $2,498,767 Weather $171,127 $358,978 $494,431 Other $153,769 $3,238 $64,744 Total revenue $5,895,934 $9,198,020 $14,265,164

Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 93 HPC Issues  Clusters are still hard to use and manage ◦ Power, cooling and floor space are major issues ◦ Third party software costs ◦ Weak interconnect performance at all levels ◦ Applications & programming — Hard to scale beyond a node ◦ RAS is a growing issue ◦ Storage and data management ◦ Multi-processor type support and accelerator support  Requirements are diverging ◦ High-end — need more, but is a shrinking segment ◦ Mid and lower end – the mainstream will look more for complete solutions ◦ New entrants – ease-of-use will drive them, plus need applications • Parallel software is missing for most users • And will get weaker in the near future—Software will be the #1 roadblock • Multi-core will cause many issues to “hit-the-wall”

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 Source IDC 94 Custom HPC Architectures

 IBM BlueGene/L  IBM Cell Processor Sony/IBM/Toshiba

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 95 HPC – Current Top Performer BlueGene/L Top HPC Performer - BlueGene/L (as of 8/2007) from: http://www.top500.org/ Located at Lawrence Livermore National Laboratory  BlueGene/L boasts a peak speed of over 360 teraFLOPS, a total memory of 32 terabytes, total power of 1.5 megawatts, and machine floor space of 2,500 square feet. The full system has 65,536 dual-processor compute nodes.  Nodes are configured as a 32 x 32 x 64 3D torus; each node is connected in six different directions for nearest-neighbor communications  A global reduction tree supports fast global operations such as global max/sum in a few microseconds over 65,536 nodes  Multiple global barrier and interrupt networks allow fast synchronization of tasks across the entire machine within a few microseconds  1,024 gigabit-per-second links to a global parallel file system to support fast input/output to disk

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 96 BlueGene/L

http://www.llnl.gov/asc/computing_resources/bluegenel/ Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 97 BlueGene/L ASIC

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 98 HPC – IBM Cell Processor Sony/IBM/Toshiba

 Summer 2000: Sony, Toshiba, and IBM came together to design Sony’s new PlayStation 3 processor.  March 2001: STI Design Center in Austin. A joint investment of $400,000,000  Sony figured out it’s not really FLOPS that are the problem, it’s the data if you can keep it fed and balanced, the Cell looks very promising.  Heterogeneous model can be very difficult to program  1 Power Processing Unit (PPU)  8 Synergistic Processing Units (SPU)  Element Interconnect Bus (EIB)  Memory Interface Controller (MIC)  2 Configurable I/O Connections (FlexIO)

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 99 Cell Power Processing Unit (PPU)

• Simple 64-bit Power PC core (PPE)  In-order, dual-issue Pipeline  Symmetric Multi-Threaded (2-threads)  VMX (AltiVec) Vector Unit  512KB L2 Cache • Responsible for running the OS and coordinating SPUs

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 100 Synergistic Processing Unit (SPU)

Consists of… • Synergistic Processing Element (SPE) • SIMD Vector Processor • Single-Precision FP • 128 entry, 128bit register file • No branch prediction, uses software hints • 256KB Local Store (LS) SPE • NOT a cache • Holds instructions & data LS • Memory Flow Controller (MFC) MFC • Handles DMA & Mailbox communications SPU

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 101 Element Interconnect Bus (EIB)

 Connects PPU, MIC, SPUs, IO ports  4 Rings ◦ 2 Clockwise, 2 CCW  Peak Aggregate Bandwidth of 204.8GB/s  Sustained: 197GB/s

Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 102