High Performance Computing
Course #: CSI 441/541 High Perf Sci Comp II Spring ‘10
Mark R. Gilder Email: [email protected] [email protected] CSI 441/541
This course investigates the latest trends in high-performance computing (HPC) evolution and examines key issues in developing algorithms capable of exploiting these architectures.
Grading: Your grade in the course will be based on completion of assignments (40%), course project (35%), class presentation(15%), class participation (10%).
Course Goals Understanding of the latest trends in HPC architecture evolution, Appreciation for the complexities in efficiently mapping algorithms onto HPC architectures, Hands-on experience in design and implementation of algorithms for both shared & distributed memory parallel architectures using MPI, Experience in evaluating performance of parallel programs.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 2 Grades
40% Homework assignments 35% Final project 15% Class presentations 10% Class participation
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 3 Homework
Usually weekly w/some exceptions Must be turned in on time – no late homework assignments will be accepted All work must be your own - cheating will not be tolerated All references must be sited Assignments may consist of problems, programming, or a combination of both
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 4 Homework (continued)
Detailed discussion of your results is expected ◦ the actual program is only a small part of the problem
Homework assignments will be posted on the class website along with all of the lecture notes ◦ http://www.cs.albany.edu/~gilder
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 5 Project
Topic of general interest to the course Read three or four papers to get some ideas of latest research in HPC field Identify an application that can be implemented on our RIT cluster Implement and write a final report describing your application and results Present your results in class
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 6 Other Remarks
Would like the course to be very interactive
Willing to accept suggestions for changes in content and/or form
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 7 Material
Book(s) optional ◦ Introduction to Parallel Computing 2nd Edition by Grama et. Al. ISBN: 0-201-64865-2
◦ The Sourcebook of Parallel Computing (The Morgan Kaufmann Series) by J. Dongarra, I. Foster, G. Fox et. al. ISBN: 1558608710
◦ Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers by B. Wilkinson and M. Allen ISBN: 0136717101
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 8 Material
Lecture notes will be provided online either before or just after class
Other reading material may be assigned
Course website: http://www.cs.albany.edu/~gilder/
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 9 Course Overview
Learning about: ◦ High-Performance Computing (HPC) ◦ Parallel Computing ◦ Performance Analysis ◦ Computational Techniques ◦ Tools to aid in parallel programming ◦ Developing programs using MPI, Pthreads, maybe OpenMP, maybe CUDA
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 10 What You Should Learn
In depth understanding of: ◦ When parallel computing is useful ◦ Understanding of parallel computing options ◦ Overview of programming models ◦ Performance analysis and tuning
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 11 Background
Strong C-programming experience
Knowledge of parallel programming
Some background in numerical computing
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 12 Computer Accounts
For most of the class we will be using the RIT computer cluster. See the following link for more info about the hardware: http://www.rit.albany.edu/wiki/IBM_pSeries_Cluster
Accounts will be made available by the second week of class
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 13 Lecture 1 Outline: ◦ General Computing Trends ◦ HPC Introduction ◦ Why Parallel Computing
◦ Note: some portions of the following slides were taken from Jack Dongarra w/permission
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 15 Units of High Performance Computing
Term Expanded Actual Performance 1 Kflop/s 1 Kiloflop/s 103 Flop/sec 1 Mflop/s 1 Megaflop/s 106 Flop/sec 1 Gflop/s 1 Gigaflop/s 109 Flop/sec 1 Tflop/s 1 Teraflop/s 1012 Flop/sec 1 Pflop/s 1 Petaflop/s 1015 Flop/sec
Data 1 KB 1 Kilobyte 103 Bytes 1 MB 1 Megabyte 106 Bytes 1 GB 1 Gigabyte 109 Bytes 1 TB 1 Terabyte 1012 Bytes 1 PB 1 Petabyte 1015 Bytes
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 16 HPC Highlights High Performance Computing: Overloaded term but typically refers to the use of computer clusters and/or custom supercomputers to solve large-scale scientific computations. Essentially covers computers which are designed for large computations and/or data intensive tasks. These systems rely on parallel processing to increase algorithm performance (speed-up).
Example Applications Include: Computational Fluid Dynamics (CFDs) Large Scale Modeling / Simulation Bioinformatics Molecular Dynamics Financial
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 17 HPC Highlights
HPC Attributes: Multiple processors – 10’s, 100’s, 1000’s. High Speed Interconnect Network, i.e, InfiniBand, GigE, etc. Clusters typically built from COTS / commodity components Supercomputers built from mix of both commodity / custom components Performance typically in the Teraflop range (1012 floating point operations / sec)
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 18 HPC Trends
HPC at all-time high of $11.6 billion in 2007 http://insidehpc.com/2008/02/21/idc-releases-preliminary-hpc-growth-figures/ X86 and Linux are the current winners Clustering continues to drive change ◦ Price/performance reset ◦ Dramatic increase in units ◦ Growth in standardization of CPU, OS, interconnect Grid technologies will become pervasive New challenges for datacenters: ◦ Power, cooling, system management, system consolidation, virtualization? Storage and data management are growing in importance
Software will be the #1 roadblock – and hence provide major opportunities at all levels Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 19 HPC Data Example
Lets say you can print ◦ 5 columns of 100 numbers each; on both sides of the page = 1000 numbers (Kflop) in one second ( 1 Kflop/s)
106 numbers (Mflop) = 1,000 pages (about 10 cm) ◦ 2 reams of paper / second ◦ 1 Mflop/s
109 numbers (Gflop) = 10,000 cm = 100 m stack ◦ Height of Statue of Liberty (printed / second) 1 Gflop/s
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 20 HPC Data Example Lets say you can print 106 numbers (Mflop) = ◦ 5 columns of 100 numbers each; on 1000 pages (about 10 both sides of the page = 1000 numbers cm) (Kflop) in one second ( 1 Kflop/s) ◦ 2 reams of paper / second ◦ 1 Mflop/s 109 numbers (Gflop) = 10,000 cm = 100 m stack ◦ Height of Statue of Liberty (printed / second) 1 Gflop/s
1012 numbers (Tflop) = 100 km stack; altitude achieved by SpaceShipOne. 1 Tflop/s
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 21 HPC Data Example Lets say you can print ◦ 5 columns of 100 numbers each; on both sides of the page = 1000 numbers (Kflop) in one second ( 1 Kflop/s) ◦ 106 numbers (1 Mflop) = 1000 pages (about 10 cm); 2 reams paper / sec ◦ 109 numbers (1 Gflop) = 10,000 cm = 100 m stack; Height of Statue of Liberty per sec ◦ 1012 numbers (1 Tflop) = 100 km stack; SpaceShipOne’s distance to space/sec
1015 numbers (1 Pflop) = 100,000 km stack printed per second; 1 Pflop/s ◦ ¼ distance to moon
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 22 HPC Data Example Lets say you can print ◦ 5 columns of 100 numbers each; on both sides of the page = 1000 numbers (Kflop) in one second ( 1 Kflop/s) ◦ 106 numbers (1 Mflop) = 1000 pages (about 10 cm); 2 reams paper / sec ◦ 109 numbers (1 Gflop) = 10000 cm = 100 m stack; Height of Statue of Liberty per sec ◦ 1012 numbers (1 Tflop) = 100 km stack; SpaceShipOne’s distance to space/sec ◦ 1015 numbers (1 Pflop) = 100,000 km stack printed per sec
1016 numbers (10 Pflop) = 1,000,000 km stack printed per second; 1 Pflop/s ◦ distance to moon and back and then a bit
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 23 High Performance Computing Today
In the past decade, the world has experienced one of the most exciting periods in computer development
Microprocessors have become smaller, denser, and more powerful
The result is that microprocessor-based supercomputing is rapidly becoming the technology of preference in attacking some of the most important problems of science and engineering
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 24 Transistor Growth – Moore’s Law 1.E+10 Montecito 1.E+09 Itanium2-9M Itanium 2 1.E+08 Pentium 4 Itanium 1.E+07 Pentium II Pentium III Pentium 1.E+06 I486 I386 1.E+05 I286 8086
Transistors per chip per Transistors 1.E+04 8080 8008 1.E+03 1970 1975 1980 1985 1990 1995 2000 2005 2010 Year Moore’s law states that transistor densities will double every 2 years. Traditionally, this has held and, in fact, the cost per function has dropped, on average, 27% per year. Many other components of computer systems have followed similar exponential growth curves, including data storage capacity and transmission speeds. It really has been unprecedented technological and economic growth. Smaller transistors have lots of nice effects, including the ability to switch transistors faster. This, combined with other component speedups causes many to relate Moore’s law to a 2x improvement in performance every two years.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 25 Moore’s “Law”
Moore’s Law is an exponential ◦ Exponentials can not last forever Gordon Moore (co-founder of ◦ However, Moore’s Law has held Intel) Electronics Magazine, 1965 remarkably true for almost 30 Number of devices/chip doubles every 18 months years! ◦ Note: not really a law rather an observed / empirical result
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 26 Transistor Growth – Moore’s Law
Fun Facts The original transistor built by Bell Labs in 1947 could be held in your hand, while hundreds of Intel’s new 45nm transistor can fit on the surface of a single red blood cell. Intel's first processor, the 4004, debuted in 1971 and consisted of 2,300 transistors. Compare that to the 800 million transistors found in today's Intel Quad Core (an increase by a factor of 350,000) The price of a transistor in one of Intel’s forthcoming next-generation processors -- codenamed Penryn -- will be about 1 millionth the average price of a transistor in 1968. If car prices had fallen at the same rate, a new car today would cost about 1 cent. You could fit more than 2,000 45nm transistors across the width of a human hair. A 45nm transistor can switch on and off approximately 300 billion times a second. A beam of light travels less than a tenth of an inch during the time it takes a 45nm transistor to switch on and off.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 27 Where Has The Speed Come From?
Processor architecture improvements
Bipolar Thermal Limit 1988 1991 1994 1997 2000 2003 2006 Cache, Onboard FPU Pipelining Superscalar Archs Branch Prediction Out of Order MMX SSE SSE2 SSE3 HyperThreading Dual Core x86
Increasing Performance
486 Pentium POWER3 AMD K7 Pentium 4 Athlon 64 X2
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 28 Instruction Level Parallelism (ILP) 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05
Relative Performance/Cycle Relative 0.00 1985 1990 1995 2000 2005 Year
The figure shows how effective real Intel processors have been at extracting instruction parallelism over time. There is a flat region before instruction-level parallelism was pursued intensely, then a steep rise as parallelism was utilized usefully, followed by a tapering off in recent years as the available parallelism has become fully exploited.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 29 Today’s Processors
Equivalences for today’s microprocessors ◦ Voltage Level A flashlight (~1 volt) ◦ Current Level An Oven (~250 amps) ◦ Power Level A light bulb (~100 watts) ◦ Area A postage stamp (~ 1 square inch)
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 30 Power Density
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 31 Power Densities
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 32 Processor Voltages 20
15
10
5 Processor Voltage Processor Voltage (Vdd) 0 1970 1980 1990 2000 2010 2020 Year Data Courtesy Intel
Reduction in core voltage from 18V (1970) to currently 1.2V. Currently , a conventional silicon-based transistor requires a minimum voltage of 0.7 V to perform one transition.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 33 Power 2 P ACV f AVIshort VIleak
1 2 3 1.Dynamic power consumption – charging and discharging of the capacitive load on each gates output. It’s proportional to the frequency of the system’s operation, f, the activity of the gates in the system, A, the total capacitance seen by the gate’s outputs, C, and the square of the supply voltage, V. 2.Short-circuit power expended – short
circuit current, Ishort , which momentarily flows, τ, between the supply voltage and ground when logic gate’s output switches. 3.Power loss due to leakage current.
In today’s circuits the first term dominates therefore reducing the supply voltage is the most effective way to reduce power consumption – this savings can be significant, i.e., halving the voltage reduces the power consumption to one-fourth it’s original value.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 34 CPU Clock Speed Stagnation
This is a graph of the clock speed development of AMD and Intel processors from 1993 until the end of 2005. Between 1993 and 1999, the average clock speed increased tenfold. Then stagnation set in; over the past four years, frequencies haven't even doubled.
2 fmax (V Vthreshold) /V
Unfortunately, the maximum frequency of operation is roughly linear in V. Reducing it, limits the circuit to a lower frequency. Reducing the power to one-fourth its original value only halves the maximum frequency. These two equations have an important corollary:
Parallel processing, which involves splitting a computation in two and running it as two parallel independent tasks, has the potential to cut the power in half without slowing the computation.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 35 Power Cost of Frequency
Power ≈ Voltage2 x Frequency (V2F) Frequency ≈ Voltage Power ≈ Frequency3
Cores V Freq Perf Power PE (freq/power) Superscalar 1 1 1 1 1 1 “New” Superscalar 1X 1.5X 1.5X 1.5X 3.3X 0.45X
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 36 Power Cost of Frequency
Power ≈ Voltage2 x Frequency (V2F) Frequency ≈ Voltage Power ≈ Frequency3
Cores V Freq Perf Power PE (perf/power) Superscalar 1 1 1 1 1 1 “New” Superscalar 1X 1.5X 1.5X 1.5X 3.3X 0.45X Multicore 2X 0.75X 0.75X 1.5X 0.8X 1.88X (Larger # is better)
Example illustrates that we can achieve 50% more performance with 20% less power Multiple cores demonstrates that multiple slower devices are better than one superfast device
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 37 Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 38 Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 39 Percentage of Peak
A rule of thumb that often applies ◦ A contemporary RISC processor, for a spectrum of applications, delivers (i.e., sustains) 10% of peak performance
Why such low efficiency?
There are two primary reasons behind the disappointing percentage of peak ◦ IPC (in)efficiency ◦ Memory (in)efficiency
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 40 IPC (Instructions / Cycle)
Today the theoretical IPC (instructions per cycle) is 4 in most contemporary RISC processors (6 in Itanium)
Detailed analysis for a spectrum of applications indicates that the average IPC is 1.2–1.4
We are leaving ~75% of the possible performance on the table…
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 41 Why Fast Machines Run Slow
Latency ◦ Waiting for access to memory or other parts of the system Overhead ◦ Extra work that has to be done to manage program concurrency and parallel resources for the real work to be performed Starvation ◦ Not enough work to do due to insufficient parallelism or poor load balancing among distributed resources Contention ◦ Delays due to fighting over what task gets to use a shared resource next. Network bandwidth is a major constraint.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 42 Latency In A Single System
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 43 Memory Hierarchy
Typical latencies for today’s data access technologies
Hierarchy Processor Clocks Register 1 L1 Cache 2-3 L2 Cache 6-12 L3 Cache 14-40 Near Memory 100-300 Far Memory 300-900 Remote Memory O(103) Message-Passing O(103)-O(104)
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 44 Memory Hierarchy
Most programs have a high degree of locality in their accesses ◦ Spatial Locality: accessing things nearby previous accesses ◦ Temporal Locality: reusing an item that was previously accessed Memory hierarchy tries to exploit locality
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 45 Memory Bandwidth
To provide bandwidth to the processor the bus either needs to be faster or wider Busses are limited to perhaps 400-800 MHz Links are faster (point-to-point) ◦ Increased link frequencies increase error rates requiring coding and redundancy thus increasing power and die size and not helping bandwidth Making things wider requires pin-out (Si real estate) and power ◦ Both power and pin-out are serious issue
From www.naplestech.com Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 46 Processor-DRAM Memory Gap
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 47 Processor in Memory (PIM)
PIM merges logic with memory ◦ Wide ALUs next to the row buffer ◦ Optimized for memory throughput, not ALU utilization PIM has the potential of riding Moore’s Law while ◦ Greatly increasing effective memory bandwidth ◦ Providing many more concurrent execution threads ◦ Reducing latency ◦ Reducing power ◦ Increasing overall system efficiency It may also simplify programming and system design
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 48 Internet – 4th Revolution in Telecommunications Telephone, Radio, Television Growth in Internet outstrips the others Exponential growth since 1985 Traffic doubles every 100 days
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 49 Peer-to-Peer Computing
Peer-to-peer is a style of networking in which a group of computers communicate directly with each other Wireless communication Home computer in the utility room, next to the water heater and furnace Web tablets BitTorrent ◦ http://en.wikipedia.org/wiki/Bittorrent Imbedded computers in things all tied together. ◦ Books, furniture, milk cartons, etc Smart Appliances ◦ Refrigerator, scale, etc
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 50 Internet On Everything
Smart Refrigerator Internet controlled refrigerator-oven
Smart laundry room appliances
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 51 SETI@home: Global Distributed Computing Running on 500,000 PCs, ~1000 CPU Years per Day ◦ 485,821 CPU Years so far Sophisticated Data & Signal Processing Analysis Distributes Datasets from Arecibo Radio Telescope
Berkeley Open Infrastructure for Network Computing
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 52 SETI@home
Use thousands of Internet connected PCs to help in the search for extraterrestrial intelligence. When their computer is idle or being wasted this software will download a 300 kilobyte chunk of data for analysis. Performs about 3 Tflops for each client in 15 hours. The results of this analysis are sent back to the SETI team combined with thousands of other participants. Largest distributed computation project in existence ◦ Averaging 40 Tflops/sec Today a number of companies are trying this for profit
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 53 Google query attributes ◦ 150M queries/day (2000/second) ◦ 100 countries ◦ 8B documents in the index Data centers ◦ 100,000 Linux systems in data centers around the world 15 TFlop/s and 1000 TB total capability 40-80 1U/2U servers/cabinet 100 MB Ethernet switches/cabinet with gigabit Ethernet uplink ◦ Growth from 4,000 systems (June 2000) 18M queries then Performance and operation ◦ Simple reissue of failed commands to new servers ◦ No performance debugging problems are not reproducible
Source: Monika Henzinger, Google & Cleve Moler
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 54 General Computing Trends
In the past parallel computing efforts have shown great promise, however, in the end uniprocessor computing has always won.
Why? – we could just wait 12-18 months and the same algorithms would perform twice as fast! …..UNTIL NOW!
This shift towards increasing parallelism is not due to any new breakthroughs in software and/or parallel architectures but instead due to greater challenges in further improving uniprocessor architectures.
General Purpose Computing is taking an irreversible step toward parallel architectures.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 55 Why Parallel Computing
• Power is a primary design constraint • Clock frequencies are stagnating • ILP has reached a roadblock • Can’t make reasonable computers out of current technology past 4GHz • Cooling the devices becomes very difficult – just can’t get the heat out of the cores efficiently • Power density on today’s chips is staggering • Fundamental limits are being approached
• Desire to solve bigger, more realistic applications problems • More cost effective solution
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 56 MultiCore 10000 Multi-Core Era Massively parallel 1000 applications
100 Multi-Core Era scalar and Parallel applications 10
1 Threads Per Socket 0 2002 2007 2012 2017 Year Data Courtesy Intel
However, going multi-core introduces a whole new set of problems!
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 57 Considerations
Rising software development costs Increased compute system complexity Lack of compiler technology solutions Large volume of legacy code Future changes that will make us start over again
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 58 Current Environment
Lack of robust cross-platform development frameworks Programming languages implemented as after-thought via language extensions and/or pragmas Developers *must* understand low-level details of both the algorithm and the architecture in order to be successful. ◦ Processor Interconnect Network ◦ I/O performance ◦ Cache Coherency ◦ Memory Hierarchy
Parallel Programming is still mostly an art
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 59 Principles of Parallel Computing
Parallelism and Amdahl’s Law Granularity Locality Load balance Coordination and synchronization Performance modeling
All of these makes parallel programming even harder then sequential programming
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 60 “Implicit” Parallelism In Current Machines Bit level parallelism ◦ within floating point operations, etc. Instruction level parallelism (ILP) ◦ multiple instructions execute per clock cycle Memory system parallelism ◦ overlap of memory operations with computation OS parallelism ◦ multiple jobs run in parallel on commodity SMPs
Limits to all of these – for very high performance, need user to identify, schedule and coordinate parallel tasks
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 61 Finding Enough Parallelism
Suppose only part of an application can be parallelized Amdahl’s law
◦ let fs be the fraction of work done sequentially, then (1-fs) is fraction parallelizable fp ◦ Also let N = number of processors Even if the parallel part speeds up perfectly the sequential part can severely limit the performance gains
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 62 Amdahl’s Law
Amdahl’s Law:
Amdahl's law states that if fs is the fraction of a calculation that is sequential, and (1 − fs) is the fraction that can be parallelized (fp ), then the maximum speedup that can be achieved by using N processors is
1 S – effect of multiple processors S f (1 f ) / N on speedup s s
(1 f ) tN – effect of multiple processors s tN fs *t1 on run-time N
In the limit, as N tends to infinity, the maximum speedup tends to 1/ fs.
As an example, if fs is only 10%, the problem can be sped up by only a maximum of a factor of 10, no matter how large the value of N used.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 63 Illustration of Amdahl’s Law
It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 64 Overhead of Parallelism
Given enough parallel work, this is the biggest barrier to getting desired speedup Parallelism overheads include: ◦ cost of starting a thread or process ◦ cost of communicating shared data ◦ cost of synchronizing ◦ extra (redundant) computation Each of these can be in the range of milliseconds (=millions of flops) on some systems Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (i.e. large granularity), but not so large that there is not enough parallel work to do
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 65 Locality And Parallelism
Large memories are slow, fast memories are small Storage hierarchies are large and fast on average Parallel processors, collectively, have large, fast performance capabilities ◦ the slow accesses to “remote” data we call “communication” Algorithm should do most work on local data
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 66 Load Imbalance
Load imbalance is the time that some processors in the system are idle due to ◦ insufficient parallelism (during that phase) ◦ unequal size tasks Examples of the latter ◦ adapting to “interesting parts of a domain” ◦ tree-structured computations ◦ fundamentally unstructured problems Algorithm needs to balance load
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 67 What’s Ahead
Greater instruction level parallelism? Bigger caches? Multiple processors per chip? Complete systems on a chip? High performance LAN, Interface, and Interconnect
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 68 Directions
Move toward shared memory ◦ SMPs and Distributed Shared Memory ◦ Shared address space w/deep memory hierarchy
Clustering of shared memory machines for scalability
Efficiency of message passing and data parallel programming ◦ Helped by standards efforts such as MPI and HPF
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 69 High Performance Computers
~ 20 years ago ◦ 1x106 Floating Point Ops/sec (Mflop/s) Scalar based ~ 10 years ago ◦ 1x109 Floating Point Ops/sec (Gflop/s) Vector & Shared memory computing, bandwidth aware Block partitioned, latency tolerant ~ Today ◦ 1x1012 Floating Point Ops/sec (Tflop/s) Highly parallel, distributed processing, message passing, network based Data decomposition, communication/computation ~ 1 years away ◦ 1x1015 Floating Point Ops/sec (Pflop/s) Many more levels MH, combination/grids & HPC More adaptive, bandwidth aware, fault tolerant, extended precision, attention to SMP nodes
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 70 Top 500 Computers
Listing of the 500 most powerful Computers in the World
Yardstick: Rmax from LINPACK MPP ◦ Ax=b, dense problem Updated twice a year ◦ SuperComputing Conference (SC‘xy) in the US every November
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 71 What Is A Supercomputer A supercomputer is a hardware and software system that provides Why do we need them? close to the maximum Almost all of the technical performance that can currently be areas that are important to achieved. the well-being of humanity Over the last 10 years the range use supercomputing in for the Top500 has increased fundamental and essential greater than Moore’s Law ways. 1993: Computational fluid ◦ #1 = 59.7 GFlop/s dynamics, protein folding, ◦ #500 = 422 MFlop/s climate modeling, national 2004: security, in particular for cryptanalysis and for ◦ #1 = 478 TFlop/s simulating nuclear weapons ◦ #500 = 5.9 TFlop/s to name a few.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 72 Architecture/Systems Continuum
Top 500 Rmax by system type
Tightly 100% Custom processor with Coupled custom inteconnect Custom ◦ Cray X1 80% ◦ NEC SX-8 ◦ IBM Regatta 60% ◦ IBM Blue Gene/L Custom Hybrid Hybrid Commodity processor with Commod. custom interconnect 40% ◦ SGI Altix Intel Itanium 2 20% ◦ Cray XT3, XD1 AMD Opteron Commodity Commodity processor with 0%
commodity interconnect
Jun-93 Jun-94 Jun-95 Jun-96 Jun-97 Jun-98 Jun-99 Jun-00 Jun-01 Jun-02 Jun-03 Jun-04
Dec-93 Dec-94 Dec-95 Dec-96 Dec-97 Dec-98 Dec-99 Dec-00 Dec-01 Dec-02 Dec-03 ◦ Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics ◦ NEC TX7 ◦ IBM eServer ◦ Dawning Loosely Coupled Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 73 The Top-10 Supercomputers November 2008
Rank Site Computer/Year Vendor Cores Rmax Rpeak Power
Roadrunner - BladeCenter QS22/LS21 Cluster, PowerXCell 8i DOE/NNSA/LANL 1 3.2 Ghz / Opteron DC 1.8 GHz , Voltaire Infiniband / 2008 129600 1105.00 1456.70 2483.47 United States IBM
Oak Ridge National Jaguar - Cray XT5 QC 2.3 GHz / 2008 2 Laboratory 150152 1059.00 1381.40 6950.60 Cray Inc. United States NASA/Ames Research Pleiades - SGI Altix ICE 8200EX, Xeon QC 3.0/2.66 GHz / 2008 3 Center/NAS 51200 487.01 608.83 2090.00 SGI United States DOE/NNSA/LLNL BlueGene/L - eServer Blue Gene Solution / 2007 4 212992 478.20 596.38 2329.60 United States IBM Argonne National Blue Gene/P Solution / 2007 5 Laboratory 163840 450.30 557.06 1260.00 IBM United States Texas Advanced Computing Ranger - SunBlade x6420, Opteron QC 2.3 Ghz, Infiniband / 6 Center/Univ. of Texas 2008 62976 433.20 579.38 2000.00 United States Sun Microsystems NERSC/LBNL Franklin - Cray XT4 QuadCore 2.3 GHz / 2008 7 38642 266.30 355.51 1150.00 United States Cray Inc. Oak Ridge National Jaguar - Cray XT4 QuadCore 2.1 GHz / 2008 8 Laboratory 30976 205.00 260.20 1580.71 Cray Inc. United States NNSA/Sandia National Red Storm - Sandia/ Cray Red Storm, XT3/4, 2.4/2.2 GHz 9 Laboratories dual/quad core / 2008 38208 204.20 284.00 2506.00 United States Cray Inc.
Shanghai Supercomputer Dawning 5000A - Dawning 5000A, QC Opteron 1.9 Ghz, 10 Center Infiniband, Windows HPC 2008 / 2008 30720 180.60 233.47 China Dawning
Rmax and Rpeak values are in Tflops Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 Power in KW for entire system 74 Performance Development
~10 years
6-8 years
My Laptop 8-10 years
http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 75 Performance Projection
My Laptop 8-10 years
http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 76 Top 500 Conclusions
Microprocessor based supercomputers have brought a major change in accessibility and affordability MPPs continue to account of more than half of all installed high-performance computers worldwide
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 77 Distributed And Parallel Systems
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 78 Highly Parallel Supercomputing Where Are We?
Performance: ◦ Sustained performance has dramatically increased during the last year. ◦ On most applications, sustained performance per dollar now exceeds that of conventional supercomputers. But... ◦ Conventional systems are still faster on some applications.
Languages and compilers: ◦ Standardized, portable, high-level languages such as HPF, PVM and MPI are available. But ... ◦ Initial HPF releases are not very efficient. ◦ Message passing programming is tedious and hard to debug. ◦ Programming difficulty remains a major obstacle to usage by mainstream scientist.
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 79 Highly Parallel Supercomputing Where Are We?
Operating Systems: ◦ Robustness and reliability are improving. ◦ New system management tools improve system utilization. But... ◦ Reliability still not as good as conventional systems.
I/O Subsystems: ◦ New RAID disks, HiPPI interfaces, etc. provide substantially improved I/O performance. But… ◦ I/O remains a major bottleneck
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 80 The Importance Of Standards Software Writing programs for MPP is hard ... But ... one-off efforts if written in a standard language Past lack of parallel programming standards ... ◦ ... has restricted uptake of technology (to "enthusiasts") ◦ ... reduced portability (over a range of current architectures and between future generations) Now standards exist: (PVM, MPI & HPF), which ... ◦ ... allows users & manufacturers to protect software investment ◦ … encourage growth of a "third party" parallel software industry & parallel versions of widely used codes
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 81 The Importance Of Standards Hardware Processors ◦ commodity RISC processors Interconnects ◦ high bandwidth, low latency communications protocol ◦ no de-facto standard yet (ATM, Fibre Channel, HPPI, FDDI) Growing demand for total solution: ◦ robust hardware + usable software HPC systems containing all the programming tools / environments / languages / libraries / applications packages found on desktops
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 82 The Future of HPC
The expense of being different is being replaced by the economics of being the same HPC needs to lose its "special purpose" tag Still has to bring about the promise of scalable general purpose computing ... ... but it is dangerous to ignore this technology Final success when MPP technology is embedded in desktop computing Yesterday's HPC is today's mainframe is tomorrow's workstation
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 83 Achieving TeraFlops (1012 Flops/sec)
In 1991, 1 Gflop/sec 1000-fold increase in performance ◦ Architecture Exploiting parallelism ◦ Processor, communication, memory Moore’s Law (hopefully!) ◦ Algorithmic improvements Block-partitioned algorithms
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 84 Future: Petaflops 1015 Flops/sec
15 Today 10 flops for current workstations A Pflop for 1 second ≈ a typical workstation computing for 1 year. From an algorithmic standpoint
Concurrency Dynamic redistribution of Data locality workload Latency & New language constructs synchronization Role of numerical libraries Floating point accuracy Algorithm adaptation to hardware failure
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 85 A Petaflop Computer System – Today? 2 GHz processor (O(109) ops/s) ◦ .5 Million PCs ◦ $.5B ($1K each) ◦ 100 Mwatts ◦ 5 acres ◦ .5 Million Windows licenses!! ◦ PC failure every second
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 86 A Petaflop Computer System
1 Pflop/s sustained computing Between 10,000 and 1,000,000 processors Between 10 TB and 1PB main memory Commensurate I/O bandwidth, mass store, etc. May be feasible and “affordable” by the year 2010
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 87 IDC – HPC Survey
The following depict recent trends in HPC growth ◦ copied w/Permission for distribution within class only ◦ DO NOT REDISTRIBUTE WITHOUT PERMISSION
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 88 HPC Growth
All HPC Servers $ Billions 12
10
8
6
4
2
0 1999 2000 2001 2002 2003 2004 2005 2006
Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 89 HPC – By Processor Type
100% 90% 80% 70% x86-64 60% x86-32 50% VECTOR 40% RISC 30% EPIC 20%
10% Revenuepercentage by type CPU 0% 2000 2001 2002 2003 2004 2005 2006
Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 90 HPC – By OS
100%
80%
60% Linux Unix W/NT . 40% Other
20% Revenue percentagebyO/S 0% 2000 2001 2002 2003 2004 2005 2006
Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 91 HPC – Clusters Cluster Market Penetration
100% 90% Cluster 80% Non-Cluster 70% 60% 50% 40% 30% 20% 10% 0%
Q103 Q203 Q303 Q403 Q104 Q204 Q304 Q404 Q105 Q205 Q305 Q405 Q106 Q206 Q306 Q406
Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 92 HPC – Industry Trends
2000 2005 2010 Bio-sciences $585,358 $1,433,807 $2,218,965 CAE $839,669 $1,108,510 $1,978,185 Chemical engineering $324,601 $222,466 $447,418 DCC & distribution $113,464 $513,684 $627,877 Economics/financial $203,067 $254,967 $483,848 EDA $457,630 $648,477 $1,036,687 Geosciences and geo-engineering $181,863 $489,452 $800,670 Mechanical design and drafting $114,185 $155,843 $198,902 Defense $760,065 $811,335 $1,651,183 Government lab $688,555 $1,375,964 $1,671,932 Software engineering $93,634 $19,774 $18,531 Technical management $275,562 $101,561 $73,023 University/academic $933,386 $1,699,966 $2,498,767 Weather $171,127 $358,978 $494,431 Other $153,769 $3,238 $64,744 Total revenue $5,895,934 $9,198,020 $14,265,164
Source IDC Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 93 HPC Issues Clusters are still hard to use and manage ◦ Power, cooling and floor space are major issues ◦ Third party software costs ◦ Weak interconnect performance at all levels ◦ Applications & programming — Hard to scale beyond a node ◦ RAS is a growing issue ◦ Storage and data management ◦ Multi-processor type support and accelerator support Requirements are diverging ◦ High-end — need more, but is a shrinking segment ◦ Mid and lower end – the mainstream will look more for complete solutions ◦ New entrants – ease-of-use will drive them, plus need applications • Parallel software is missing for most users • And will get weaker in the near future—Software will be the #1 roadblock • Multi-core will cause many issues to “hit-the-wall”
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 Source IDC 94 Custom HPC Architectures
IBM BlueGene/L IBM Cell Processor Sony/IBM/Toshiba
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 95 HPC – Current Top Performer BlueGene/L Top HPC Performer - BlueGene/L (as of 8/2007) from: http://www.top500.org/ Located at Lawrence Livermore National Laboratory BlueGene/L boasts a peak speed of over 360 teraFLOPS, a total memory of 32 terabytes, total power of 1.5 megawatts, and machine floor space of 2,500 square feet. The full system has 65,536 dual-processor compute nodes. Nodes are configured as a 32 x 32 x 64 3D torus; each node is connected in six different directions for nearest-neighbor communications A global reduction tree supports fast global operations such as global max/sum in a few microseconds over 65,536 nodes Multiple global barrier and interrupt networks allow fast synchronization of tasks across the entire machine within a few microseconds 1,024 gigabit-per-second links to a global parallel file system to support fast input/output to disk
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 96 BlueGene/L
http://www.llnl.gov/asc/computing_resources/bluegenel/ Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 97 BlueGene/L ASIC
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 98 HPC – IBM Cell Processor Sony/IBM/Toshiba
Summer 2000: Sony, Toshiba, and IBM came together to design Sony’s new PlayStation 3 processor. March 2001: STI Design Center in Austin. A joint investment of $400,000,000 Sony figured out it’s not really FLOPS that are the problem, it’s the data if you can keep it fed and balanced, the Cell looks very promising. Heterogeneous model can be very difficult to program 1 Power Processing Unit (PPU) 8 Synergistic Processing Units (SPU) Element Interconnect Bus (EIB) Memory Interface Controller (MIC) 2 Configurable I/O Connections (FlexIO)
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 99 Cell Power Processing Unit (PPU)
• Simple 64-bit Power PC core (PPE) In-order, dual-issue Pipeline Symmetric Multi-Threaded (2-threads) VMX (AltiVec) Vector Unit 512KB L2 Cache • Responsible for running the OS and coordinating SPUs
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 100 Synergistic Processing Unit (SPU)
Consists of… • Synergistic Processing Element (SPE) • SIMD Vector Processor • Single-Precision FP • 128 entry, 128bit register file • No branch prediction, uses software hints • 256KB Local Store (LS) SPE • NOT a cache • Holds instructions & data LS • Memory Flow Controller (MFC) MFC • Handles DMA & Mailbox communications SPU
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 101 Element Interconnect Bus (EIB)
Connects PPU, MIC, SPUs, IO ports 4 Rings ◦ 2 Clockwise, 2 CCW Peak Aggregate Bandwidth of 204.8GB/s Sustained: 197GB/s
Mark R. Gilder CSI 441/541 – SUNY Albany Spring '10 102