High Performance Computing
Course #: CSI 440/540 High Perf Sci Comp I Fall ‘09
Mark R. Gilder Email: [email protected] [email protected] CSI 440/540
This course investigates the latest trends in high-performance computing (HPC) evolution and examines key issues in developing algorithms capable of exploiting these architectures.
Grading: Your grade in the course will be based on completion of assignments (40%), course project (35%), class presentation(15%), class participation (10%).
Course Goals Understanding of the latest trends in HPC architecture evolution, Appreciation for the complexities in efficiently mapping algorithms onto HPC architectures, Familiarity with various program transformations in order to improve performance, Hands-on experience in design and implementation of algorithms for both shared & distributed memory parallel architectures using Pthreads, OpenMP and MPI. Experience in evaluating performance of parallel programs.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 2 MultiCore To The Rescue??? 10000 Multi-Core Era Massively parallel 1000 applications
100 Multi-Core Era scalar and Parallel applications 10
1 Threads Per Socket 0 2002 2007 2012 2017 Year Data Courtesy Intel
However, going multi-core introduces a whole new set of problems!
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 3 Considerations
Rising software development costs Increased compute system complexity Lack of compiler technology solutions Large volume of legacy code Future changes that will make us start over again
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 4 Current Environment
Lack of robust cross-platform development frameworks Programming languages implemented as after-thought via language extensions and/or pragmas Developers *must* understand low-level details of both the algorithm and the architecture in order to be successful. ◦ Processor Interconnect Network ◦ I/O performance ◦ Cache Coherency ◦ Memory Hierarchy
Parallel Programming is still mostly an art
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 5 HPC Trends
HPC at all-time high of $10 billion in 2006 X86 and Linux are the current winners Clustering continues to drive change ◦ Price/performance reset ◦ Dramatic increase in units ◦ Growth in standardization of CPU, OS, interconnect Grid technologies will become pervasive New challenges for datacenters: ◦ Power, cooling, system management, system consolidation, virtualization? Storage and data management are growing in importance
Software will be the #1 roadblock – and hence provide major opportunities at all levels Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 6 HPC Growth
All HPC Servers $ Billions 12
10
8
6
4
2
0 1999 2000 2001 2002 2003 2004 2005 2006
Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 7 HPC – By Processor Type
100% 90% 80% 70% x86-64 60% x86-32 50% VECTOR 40% RISC 30% EPIC 20%
10% Revenue percentage by CPU CPU percentage type by Revenue 0% 2000 2001 2002 2003 2004 2005 2006
Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 8 HPC – By OS
100%
80%
60% Linux Unix W/NT . 40% Other
20% Revenue percentage by by percentage O/SRevenue 0% 2000 2001 2002 2003 2004 2005 2006
Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 9 HPC – Clusters Cluster Market Penetration
100% 90% Cluster 80% Non-Cluster 70% 60% 50% 40% 30% 20% 10% 0%
Q103 Q203 Q303 Q403 Q104 Q204 Q304 Q404 Q105 Q205 Q305 Q405 Q106 Q206 Q306 Q406
Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 10 HPC – Industry Trends
2000 2005 2010 Bio-sciences $585,358 $1,433,807 $2,218,965 CAE $839,669 $1,108,510 $1,978,185 Chemical engineering $324,601 $222,466 $447,418 DCC & distribution $113,464 $513,684 $627,877 Economics/financial $203,067 $254,967 $483,848 EDA $457,630 $648,477 $1,036,687 Geosciences and geo-engineering $181,863 $489,452 $800,670 Mechanical design and drafting $114,185 $155,843 $198,902 Defense $760,065 $811,335 $1,651,183 Government lab $688,555 $1,375,964 $1,671,932 Software engineering $93,634 $19,774 $18,531 Technical management $275,562 $101,561 $73,023 University/academic $933,386 $1,699,966 $2,498,767 Weather $171,127 $358,978 $494,431 Other $153,769 $3,238 $64,744 Total revenue $5,895,934 $9,198,020 $14,265,164
Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 11 HPC Issues Clusters are still hard to use and manage ◦ Power, cooling and floor space are major issues ◦ Third party software costs ◦ Weak interconnect performance at all levels ◦ Applications & programming — Hard to scale beyond a node ◦ RAS is a growing issue ◦ Storage and data management ◦ Multi-processor type support and accelerator support Requirements are diverging ◦ High-end — need more, but is a shrinking segment ◦ Mid and lower end – the mainstream will look more for complete solutions ◦ New entrants – ease-of-use will drive them, plus need applications • Parallel software is missing for most users • And will get weaker in the near future—Software will be the #1 roadblock • Multi-core will cause many issues to “hit-the-wall”
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 Source IDC 12 The Top-10 Supercomputers November 2008
Rank Site Computer/Year Vendor Cores Rmax Rpeak Power
Roadrunner - BladeCenter QS22/LS21 Cluster, PowerXCell 8i DOE/NNSA/LANL 1 3.2 Ghz / Opteron DC 1.8 GHz , Voltaire Infiniband / 2008 129600 1105.00 1456.70 2483.47 United States IBM
Oak Ridge National Jaguar - Cray XT5 QC 2.3 GHz / 2008 2 Laboratory 150152 1059.00 1381.40 6950.60 Cray Inc. United States NASA/Ames Research Pleiades - SGI Altix ICE 8200EX, Xeon QC 3.0/2.66 GHz / 2008 3 Center/NAS 51200 487.01 608.83 2090.00 SGI United States DOE/NNSA/LLNL BlueGene/L - eServer Blue Gene Solution / 2007 4 212992 478.20 596.38 2329.60 United States IBM Argonne National Blue Gene/P Solution / 2007 5 Laboratory 163840 450.30 557.06 1260.00 IBM United States Texas Advanced Computing Ranger - SunBlade x6420, Opteron QC 2.3 Ghz, Infiniband / 6 Center/Univ. of Texas 2008 62976 433.20 579.38 2000.00 United States Sun Microsystems NERSC/LBNL Franklin - Cray XT4 QuadCore 2.3 GHz / 2008 7 38642 266.30 355.51 1150.00 United States Cray Inc. Oak Ridge National Jaguar - Cray XT4 QuadCore 2.1 GHz / 2008 8 Laboratory 30976 205.00 260.20 1580.71 Cray Inc. United States NNSA/Sandia National Red Storm - Sandia/ Cray Red Storm, XT3/4, 2.4/2.2 GHz 9 Laboratories dual/quad core / 2008 38208 204.20 284.00 2506.00 United States Cray Inc.
Shanghai Supercomputer Dawning 5000A - Dawning 5000A, QC Opteron 1.9 Ghz, 10 Center Infiniband, Windows HPC 2008 / 2008 30720 180.60 233.47 China Dawning
Rmax and Rpeak values are in Tflops Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 Power in KW for entire system 13 Performance Development
http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 14 Performance Development
~10 years
http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 15 Performance Development
~10 years
6-8 years
http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 16 Performance Development
~10 years
6-8 years
My Laptop 8-10 years
http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 17 Performance Projection
My Laptop 8-10 years
http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 18 Top 500 Conclusions
Microprocessor based supercomputers have brought a major change in accessibility and affordability
MPPs continue to account of more than half of all installed high-performance computers worldwide
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 19 HPC – Current Top Performer BlueGene/L Top HPC Performer - BlueGene/L (as of 8/2007) from: http://www.top500.org/ Located at Lawrence Livermore National Laboratory BlueGene/L boasts a peak speed of over 360 teraFLOPS, a total memory of 32 terabytes, total power of 1.5 megawatts, and machine floor space of 2,500 square feet. The full system has 65,536 dual-processor compute nodes. Nodes are configured as a 32 x 32 x 64 3D torus; each node is connected in six different directions for nearest-neighbor communications A global reduction tree supports fast global operations such as global max/sum in a few microseconds over 65,536 nodes Multiple global barrier and interrupt networks allow fast synchronization of tasks across the entire machine within a few microseconds 1,024 gigabit-per-second links to a global parallel file system to support fast input/output to disk
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 20 BlueGene/L
http://www.llnl.gov/asc/computing_resources/bluegenel/ Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 21 BlueGene/L ASIC
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 22 HPC – IBM Cell Processor Sony/IBM/Toshiba
Summer 2000: Sony, Toshiba, and IBM came together to design Sony’s new PlayStation 3 processor. March 2001: STI Design Center in Austin. A joint investment of $400,000,000 Sony figured out it’s not really FLOPS that are the problem, it’s the data if you can keep it fed and balanced, the Cell looks very promising. Heterogeneous model can be very difficult to program 1 Power Processing Unit (PPU) 8 Synergistic Processing Units (SPU) Element Interconnect Bus (EIB) Memory Interface Controller (MIC) 2 Configurable I/O Connections (FlexIO)
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 23 Cell Power Processing Unit (PPU)
• Simple 64-bit Power PC core (PPE) In-order, dual-issue Pipeline Symmetric Multi-Threaded (2-threads) VMX (AltiVec) Vector Unit 512KB L2 Cache • Responsible for running the OS and coordinating SPUs
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 24 Synergistic Processing Unit (SPU)
Consists of… • Synergistic Processing Element (SPE) • SIMD Vector Processor • Single-Precision FP • 128 entry, 128bit register file • No branch prediction, uses software hints • 256KB Local Store (LS) SPE • NOT a cache • Holds instructions & data LS • Memory Flow Controller (MFC) MFC • Handles DMA & Mailbox communications SPU
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 25 Element Interconnect Bus (EIB)
Connects PPU, MIC, SPUs, IO ports 4 Rings ◦ 2 Clockwise, 2 CCW Peak Aggregate Bandwidth of 204.8GB/s Sustained: 197GB/s
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 26 Lecture 2 Outline: ◦ Why Parallel Computing ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 27 Outline
◦ Why Parallel Computing ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 28 Parallel Computing
Motivated by high computational complexity and memory requirements of large applications Two Approaches ◦ Shared Memory ◦ Distributed Memory The majority of modern systems are clusters (distributed memory architecture) ◦ Many simple machines connected with a powerful interconnect ◦ ASCI Red, ASCI White, … Also a hybrid approach can be used ◦ IBM Blue Gene
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 29 Programming Models
Multiprogramming ◦ Multiple programs running simultaneously Shared Address ◦ Global address space available to all processors ◦ Shared data is written to this global space Message Passing ◦ Data is sent directly to processors using “messages” Data Parallel
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 30 Dichotomy of Parallel Computing Platforms An explicitly parallel program must specify concurrency and interaction between concurrent subtasks.
The former is sometimes also referred to as the control structure and the latter as the communication model.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 31 Control Structure
Parallelism can be expressed at various levels of granularity - from instruction level to processes.
Between these extremes exist a range of models, along with corresponding architectural support.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 32 Control Structure of Parallel Programs Processing units in parallel computers either operate under the centralized control of a single control unit or work independently.
If there is a single control unit that dispatches the same instruction to various processors (that work on different data), the model is referred to as single instruction stream, multiple data stream (SIMD).
If each processor has its own control control unit, each processor can execute different instructions on different data items. This model is called multiple instruction stream, multiple data stream (MIMD).
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 33 SIMD and MIMD Processors
A typical SIMD architecture (a) and a typical MIMD architecture (b).
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 34 SIMD Processors
Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged to this class of machines.
SIMD relies on the regular structure of computations (such as those in image processing).
It is often necessary to selectively turn off operations on certain data items. For this reason, most SIMD programming paradigms allow for an ``activity mask'', which determines if a processor should participate in a computation or not.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 35 Conditional Execution in SIMD Processors
Executing a conditional statement on an SIMD computer with four processors: (a) the conditional statement; (b) the execution of the statement in two steps.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 36 MIMD Processors
In contrast to SIMD processors, MIMD processors can execute different programs on different processors. A variant of this, called single program multiple data streams (SPMD) executes the same program on different processors. It is easy to see that SPMD and MIMD are closely related in terms of programming flexibility and underlying architectural support. Examples of such platforms include current generation Sun Ultra Servers, SGI Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 37 SIMD-MIMD Comparison
SIMD computers require less hardware than MIMD computers (single control unit).
However, since SIMD processors are specially designed, they tend to be expensive and have long design cycles.
Not all applications are naturally suited to SIMD processors.
In contrast, platforms supporting the SPMD paradigm can be built from inexpensive off-the-shelf components with relatively little effort in a short amount of time.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 38 Outline
◦ Why Parallel Computing ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 39 Classification Of Parallel Systems Flynn’s Taxonomy
The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture: Flynn's Taxonomy Single Multiple SISD – Single Instruction, Single Data Instruction Instruction Single ◦ SISD MISD Normal Instructions Data Multiple SIMD – Single Instruction, Multiple Data SIMD MIMD Data ◦ Vector Operations, MMX, SSE, Altivec MISD – Multiple Instructions, Single Data MIMD – Multiple Instructions, Multiple Data ◦ SPMD – Single Program, Multiple Data - tasks are run simultaneously on multiple processors with different input in order to obtain results faster.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 40 Classification Of Parallel Systems Flynn’s Taxonomy
Single Instruction, Single Data stream (SISD) A sequential computer which exploits no parallelism in either the instruction or data streams. Examples of SISD architecture are the traditional uniprocessor machines like a PC or old mainframes.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 41 Flynn’s Taxonomy - Continued
Multiple Instruction, Single Data stream (MISD) Unusual due to the fact that multiple instruction streams generally require multiple data streams to be effective. However, this type is used when it comes to redundant parallelism, as for example on airplanes that need to have several backup systems in case one fails. Some theoretical computer architectures have also been proposed which make use of MISD, but none have entered mass production.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 42 Flynn’s Taxonomy - Continued
Single Instruction, Multiple Data streams (SIMD) A computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized. For example, an array processor or GPU.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 43 Flynn’s Taxonomy - Continued
Multiple Instruction, Multiple Data streams (MIMD) Multiple autonomous processors simultaneously executing different instructions on different data. Distributed systems are generally recognized to be MIMD architectures; either exploiting a single shared memory space or a distributed memory space.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 44 Flynn’s Taxonomy - Recap
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 45 Outline
◦ Why Parallel Computing ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 46 Communication Model of Parallel Platforms There are two primary forms of data exchange between parallel tasks - accessing a shared data space and exchanging messages.
Platforms that provide a shared data space are called shared-address-space machines or multiprocessors.
Platforms that support messaging are also called message passing platforms or multicomputers.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 47 Shared Memory Systems
Memory resources are shared among processors
Relatively easy to program for since there is a single unified memory space
Scales poorly with system size due to the need for cache coherency
Example: ◦ Symmetric Multiprocessors (SMP) Each processor has equal access to RAM 4-way motherboards MUCH more expensive than 2-way
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 48 Distributed Memory Systems
Individual nodes consist of a CPU, RAM, and a network interface ◦ A hard disk is not necessary; mass storage can be supplied using NFS
Information is passed between nodes using the network
No need for special cache coherency hardware
More difficult to write programs for distributed memory systems since the programmer must keep track of memory usage
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 49 Shared-Address-Space Platforms
Part (or all) of the memory is accessible to all processors.
Processors interact by modifying data objects stored in this shared-address-space.
If the time taken by a processor to access any memory word in the system global or local is identical, the platform is classified as a uniform memory access (UMA), else, a non-uniform memory access (NUMA) machine.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 50 NUMA and UMA Shared-Address- Space Platforms
Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memory-access shared- address-space computer with caches and memories; (c) Non-uniform- memory-access shared-address-space computer with local memory only.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 51 NUMA and UMA Shared-Address-Space Platforms The distinction between NUMA and UMA platforms is important from the point of view of algorithm design. NUMA machines require locality from underlying algorithms for performance.
Programming shared-address-space platforms is easier since reads and writes are implicitly visible to other processors.
However, read-write data to shared data must be coordinated (this will be discussed in greater detail when we talk about threads programming).
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 52 Shared-Address-Space vs. Shared Memory Machines It is important to note the difference between the terms shared address space and shared memory.
We refer to the former as a programming abstraction and to the latter as a physical machine attribute.
It is possible to provide a shared address space using a physically distributed memory.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 53 Message-Passing Platforms
These platforms comprise of a set of processors and their own (exclusive) memory.
Instances of such a view come naturally from clustered workstations and non-shared-address-space multicomputers.
These platforms are programmed using (variants of) send and receive primitives.
Libraries such as MPI and PVM provide such primitives.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 54 Message Passing vs. Shared Address Space Platforms Message passing requires little hardware support, other than a network.
Shared address space platforms can easily emulate message passing. The reverse is more difficult to do (in an efficient manner).
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 55 Outline
◦ Why Parallel Computing ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 56 Parallel Programming Terminology
Efficiency is the execution time using a single processor divided by the execution time using a multiprocessor and the number of processors. Parallel Overhead the extra work associated with parallel version compared to its sequential code, mostly the extra CPU time and memory space requirements from synchronization, data communications, parallel environment creation and cleanup, etc. Synchronization the coordination of simultaneous tasks to ensure correctness and avoid unexpected race conditions. Speedup also called parallel speedup, which is defined as wall-clock time of best serial execution divided by wall-clock time of parallel execution. Amdahl's law can be used to give a maximum speedup factor. Scalability a parallel system's ability to gain proportionate increase in parallel speedup with the addition of more processors. Task a logically high level, discrete, independent section of computational work. A task is typically executed by a processor as a program.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 57 Speed-Up / Efficiency
Speedup refers to how much faster a parallel algorithm performs over its corresponding sequential algorithm. It is defined as follows: T S 1 p T where p p – number of processors
T1 – sequential execution time
Tp – parallel execution time with (p) processors
Efficiency provides a performance metric indicating how well utilized the processors are, typically between 0 and 1. S E p p p
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 58 Speed-Up / Efficiency Example
T1 = 150 secs
Tp = 50 secs P = 4
T1 150 Speedup: S p 3.0 Tp 50
3.0 Efficiency: E 0.75 p 4
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 59 Super-linear Speedup
Super-Linear Speedup occurs when a speedup of more than N when using N processors is observed.
In practice super linear speedup rarely happens but can occur in some cases.
Typically occurs due to cache effects resulting from the different memory hierarchies . In parallel computing, not only the numbers of processors change, but so does the size of accumulated caches from different processors – with the larger accumulated cache size, more or even all of a core data set can fit into caches and the memory access time reduces dramatically, which causes the extra speedup in addition to that from the actual computation.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 60 Outline
◦ Why Parallel Computing ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 61 Amdahl’s Law
Need to understand potential of multicore systems Amdahl’s Law: Amdahl's law states that if F is the fraction of a calculation that is sequential, and (1 − F) is the fraction that can be parallelized, then the maximum speedup that can be achieved by using N processors is 1 F (1 F) / N
In the limit, as N tends to infinity, the maximum speedup tends to 1/F.
As an example, if F is only 10%, the problem can be sped up by only a maximum of a factor of 10, no matter how large the value of N used.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 62