<<

High Performance

Course #: CSI 440/540 High Perf Sci Comp I Fall ‘09

Mark R. Gilder Email: [email protected] [email protected] CSI 440/540

This course investigates the latest trends in high-performance computing (HPC) evolution and examines key issues in developing capable of exploiting these architectures.

Grading: Your grade in the course will be based on completion of assignments (40%), course project (35%), class presentation(15%), class participation (10%).

Course Goals  Understanding of the latest trends in HPC architecture evolution,  Appreciation for the complexities in efficiently mapping algorithms onto HPC architectures,  Familiarity with various program transformations in order to improve performance,  Hands-on experience in design and implementation of algorithms for both shared & parallel architectures using Pthreads, OpenMP and MPI.  Experience in evaluating performance of parallel programs.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 2 MultiCore To The Rescue??? 10000 Multi-Core Era 1000 applications

100 Multi-Core Era scalar and Parallel applications 10

1 Threads Per Socket 0 2002 2007 2012 2017 Year Data Courtesy

However, going multi-core introduces a whole new set of problems!

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 3 Considerations

 Rising development costs  Increased compute system complexity  Lack of technology solutions  Large volume of legacy code  Future changes that will us start over again

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 4 Current Environment

 Lack of robust cross-platform development frameworks  Programming languages implemented as after-thought via language extensions and/or pragmas  Developers *must* understand low-level details of both the and the architecture in order to be successful. ◦ Interconnect Network ◦ I/O performance ◦ Coherency ◦

Parallel Programming is still mostly an art

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 5 HPC Trends

 HPC at all-time high of $10 billion in 2006  and Linux are the current winners  Clustering continues to drive change ◦ Price/performance reset ◦ Dramatic increase in units ◦ Growth in standardization of CPU, OS, interconnect  Grid technologies will become pervasive  New challenges for datacenters: ◦ Power, cooling, system management, system consolidation, virtualization?  Storage and data management are growing in importance

Software will be the #1 roadblock – and hence provide major opportunities at all levels Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 6 HPC Growth

All HPC Servers $ Billions 12

10

8

6

4

2

0 1999 2000 2001 2002 2003 2004 2005 2006

Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 7 HPC – By Processor Type

100% 90% 80% 70% x86-64 60% x86-32 50% VECTOR 40% RISC 30% EPIC 20%

10% Revenue percentage by CPU CPU percentage type by Revenue 0% 2000 2001 2002 2003 2004 2005 2006

Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 8 HPC – By OS

100%

80%

60% Linux Unix W/NT . 40% Other

20% Revenue percentage by by percentage O/SRevenue 0% 2000 2001 2002 2003 2004 2005 2006

Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 9 HPC – Clusters Cluster Market Penetration

100% 90% Cluster 80% Non-Cluster 70% 60% 50% 40% 30% 20% 10% 0%

Q103 Q203 Q303 Q403 Q104 Q204 Q304 Q404 Q105 Q205 Q305 Q405 Q106 Q206 Q306 Q406

Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 10 HPC – Industry Trends

2000 2005 2010 Bio-sciences $585,358 $1,433,807 $2,218,965 CAE $839,669 $1,108,510 $1,978,185 Chemical engineering $324,601 $222,466 $447,418 DCC & distribution $113,464 $513,684 $627,877 Economics/financial $203,067 $254,967 $483,848 EDA $457,630 $648,477 $1,036,687 Geosciences and geo-engineering $181,863 $489,452 $800,670 Mechanical design and drafting $114,185 $155,843 $198,902 Defense $760,065 $811,335 $1,651,183 Government lab $688,555 $1,375,964 $1,671,932 $93,634 $19,774 $18,531 Technical management $275,562 $101,561 $73,023 University/academic $933,386 $1,699,966 $2,498,767 Weather $171,127 $358,978 $494,431 Other $153,769 $3,238 $64,744 Total revenue $5,895,934 $9,198,020 $14,265,164

Source IDC Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 11 HPC Issues  Clusters are still hard to use and manage ◦ Power, cooling and floor space are major issues ◦ Third party software costs ◦ Weak interconnect performance at all levels ◦ Applications & programming — Hard to scale beyond a node ◦ RAS is a growing issue ◦ Storage and data management ◦ Multi-processor type support and accelerator support  Requirements are diverging ◦ High-end — need more, but is a shrinking segment ◦ Mid and lower end – the mainstream will look more for complete solutions ◦ New entrants – ease-of-use will drive them, plus need applications • Parallel software is missing for most users • And will get weaker in the near future—Software will be the #1 roadblock • Multi-core will cause many issues to “hit-the-wall”

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 Source IDC 12 The Top-10 November 2008

Rank Site Computer/Year Vendor Cores Rmax Rpeak Power

Roadrunner - BladeCenter QS22/LS21 Cluster, PowerXCell 8i DOE/NNSA/LANL 1 3.2 Ghz / Opteron DC 1.8 GHz , Voltaire Infiniband / 2008 129600 1105.00 1456.70 2483.47 United States IBM

Oak Ridge National Jaguar - XT5 QC 2.3 GHz / 2008 2 Laboratory 150152 1059.00 1381.40 6950.60 Cray Inc. United States NASA/Ames Research Pleiades - SGI Altix ICE 8200EX, Xeon QC 3.0/2.66 GHz / 2008 3 Center/NAS 51200 487.01 608.83 2090.00 SGI United States DOE/NNSA/LLNL BlueGene/L - eServer Blue Gene Solution / 2007 4 212992 478.20 596.38 2329.60 United States IBM Argonne National Blue Gene/P Solution / 2007 5 Laboratory 163840 450.30 557.06 1260.00 IBM United States Texas Advanced Computing Ranger - SunBlade x6420, Opteron QC 2.3 Ghz, Infiniband / 6 Center/Univ. of Texas 2008 62976 433.20 579.38 2000.00 United States Sun Microsystems NERSC/LBNL Franklin - Cray XT4 QuadCore 2.3 GHz / 2008 7 38642 266.30 355.51 1150.00 United States Cray Inc. Oak Ridge National Jaguar - Cray XT4 QuadCore 2.1 GHz / 2008 8 Laboratory 30976 205.00 260.20 1580.71 Cray Inc. United States NNSA/Sandia National Red Storm - Sandia/ Cray Red Storm, XT3/4, 2.4/2.2 GHz 9 Laboratories dual/quad core / 2008 38208 204.20 284.00 2506.00 United States Cray Inc.

Shanghai Dawning 5000A - Dawning 5000A, QC Opteron 1.9 Ghz, 10 Center Infiniband, Windows HPC 2008 / 2008 30720 180.60 233.47 China Dawning

Rmax and Rpeak values are in Tflops Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 Power in KW for entire system 13 Performance Development

http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 14 Performance Development

~10 years

http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 15 Performance Development

~10 years

6-8 years

http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 16 Performance Development

~10 years

6-8 years

My Laptop 8-10 years

http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 17 Performance Projection

My Laptop 8-10 years

http://www.top500.org/lists/2008/11/performance_development Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 18 Top 500 Conclusions

based supercomputers have brought a major change in accessibility and affordability

 MPPs continue to account of more than half of all installed high-performance computers worldwide

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 19 HPC – Current Top Performer BlueGene/L Top HPC Performer - BlueGene/L (as of 8/2007) from: http://www.top500.org/ Located at Lawrence Livermore National Laboratory  BlueGene/L boasts a peak speed of over 360 teraFLOPS, a total memory of 32 terabytes, total power of 1.5 megawatts, and machine floor space of 2,500 square feet. The full system has 65,536 dual-processor compute nodes.  Nodes are configured as a 32 x 32 x 64 3D torus; each node is connected in six different directions for nearest-neighbor communications  A global reduction tree supports fast global operations such as global max/sum in a few microseconds over 65,536 nodes  Multiple global barrier and interrupt networks allow fast of tasks across the entire machine within a few microseconds  1,024 gigabit-per-second links to a global parallel file system to support fast input/output to disk

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 20 BlueGene/L

http://www.llnl.gov/asc/computing_resources/bluegenel/ Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 21 BlueGene/L ASIC

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 22 HPC – IBM Processor /IBM/Toshiba

 Summer 2000: Sony, Toshiba, and IBM came together to design Sony’s new PlayStation 3 processor.  March 2001: STI Design Center in Austin. A joint investment of $400,000,000  Sony figured out it’s not really FLOPS that are the problem, it’s the data if you can keep it fed and balanced, the Cell looks very promising.  Heterogeneous model can be very difficult to program  1 Power Processing Unit (PPU)  8 Synergistic Processing Units (SPU)  Element Interconnect (EIB)  Memory Interface Controller (MIC)  2 Configurable I/O Connections (FlexIO)

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 23 Cell Power Processing Unit (PPU)

• Simple 64-bit Power PC core (PPE)  In-order, dual-issue  Symmetric Multi-Threaded (2-threads)  VMX (AltiVec) Vector Unit  512KB L2 Cache • Responsible for running the OS and coordinating SPUs

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 24 Synergistic Processing Unit (SPU)

Consists of… • Synergistic Processing Element (SPE) • SIMD • Single-Precision FP • 128 entry, 128bit • No branch prediction, uses software hints • 256KB Local Store (LS) SPE • NOT a cache • Holds instructions & data LS • Memory Flow Controller (MFC) MFC • Handles DMA & Mailbox communications SPU

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 25 Element Interconnect Bus (EIB)

 Connects PPU, MIC, SPUs, IO ports  4 Rings ◦ 2 Clockwise, 2 CCW  Peak Aggregate of 204.8GB/s  Sustained: 197GB/s

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 26 Lecture 2 Outline: ◦ Why ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 27 Outline

◦ Why Parallel Computing ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 28 Parallel Computing

 Motivated by high and memory requirements of large applications  Two Approaches ◦ ◦ Distributed Memory  The majority of modern systems are clusters (distributed ) ◦ Many simple machines connected with a powerful interconnect ◦ ASCI Red, ASCI White, …  Also a hybrid approach can be used ◦ IBM Blue Gene

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 29 Programming Models

 Multiprogramming ◦ Multiple programs running simultaneously  Shared Address ◦ Global available to all processors ◦ Shared data is written to this global space  Passing ◦ Data is sent directly to processors using “”  Data Parallel

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 30 Dichotomy of Parallel Computing Platforms  An explicitly parallel program must specify concurrency and interaction between concurrent subtasks.

 The former is sometimes also referred to as the control structure and the latter as the communication model.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 31 Control Structure

 Parallelism can be expressed at various levels of - from instruction level to processes.

 Between these extremes exist a range of models, along with corresponding architectural support.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 32 Control Structure of Parallel Programs  Processing units in parallel computers either operate under the centralized control of a single or work independently.

 If there is a single control unit that dispatches the same instruction to various processors (that work on different data), the model is referred to as single instruction , multiple data stream (SIMD).

 If each processor has its own control control unit, each processor can execute different instructions on different data items. This model is called multiple instruction stream, multiple data stream (MIMD).

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 33 SIMD and MIMD Processors

A typical SIMD architecture (a) and a typical MIMD architecture (b).

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 34 SIMD Processors

 Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged to this class of machines.

 SIMD relies on the regular structure of computations (such as those in image processing).

 It is often necessary to selectively turn off operations on certain data items. For this reason, most SIMD programming paradigms allow for an ``activity mask'', which determines if a processor should participate in a computation or not.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 35 Conditional Execution in SIMD Processors

Executing a conditional statement on an SIMD computer with four processors: (a) the conditional statement; (b) the execution of the statement in two steps.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 36 MIMD Processors

 In contrast to SIMD processors, MIMD processors can execute different programs on different processors.  A variant of this, called single program multiple data streams (SPMD) executes the same program on different processors.  It is easy to see that SPMD and MIMD are closely related in terms of programming flexibility and underlying architectural support.  Examples of such platforms include current generation Sun Ultra Servers, SGI Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 37 SIMD-MIMD Comparison

 SIMD computers require less hardware than MIMD computers (single control unit).

 However, since SIMD processors are specially designed, they tend to be expensive and have long design cycles.

 Not all applications are naturally suited to SIMD processors.

 In contrast, platforms supporting the SPMD paradigm can be built from inexpensive off-the-shelf components with relatively little effort in a short amount of time.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 38 Outline

◦ Why Parallel Computing ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 39 Classification Of Parallel Systems Flynn’s Taxonomy

The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture: Flynn's Taxonomy Single Multiple  SISD – Single Instruction, Single Data Instruction Instruction Single ◦ SISD MISD Normal Instructions Data  Multiple SIMD – Single Instruction, Multiple Data SIMD MIMD Data ◦ Vector Operations, MMX, SSE, Altivec  MISD – Multiple Instructions, Single Data  MIMD – Multiple Instructions, Multiple Data ◦ SPMD – Single Program, Multiple Data - tasks are run simultaneously on multiple processors with different input in order to obtain results faster.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 40 Classification Of Parallel Systems Flynn’s Taxonomy

 Single Instruction, Single Data stream (SISD) A sequential computer which exploits no parallelism in either the instruction or data streams. Examples of SISD architecture are the traditional uniprocessor machines like a PC or old mainframes.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 41 Flynn’s Taxonomy - Continued

 Multiple Instruction, Single Data stream (MISD) Unusual due to the fact that multiple instruction streams generally require multiple data streams to be effective. However, this type is used when it comes to redundant parallelism, as for example on airplanes that need to have several backup systems in case one fails. Some theoretical computer architectures have also been proposed which make use of MISD, but none have entered mass production.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 42 Flynn’s Taxonomy - Continued

 Single Instruction, Multiple Data streams (SIMD) A computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized. For example, an array processor or GPU.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 43 Flynn’s Taxonomy - Continued

 Multiple Instruction, Multiple Data streams (MIMD) Multiple autonomous processors simultaneously executing different instructions on different data. Distributed systems are generally recognized to be MIMD architectures; either exploiting a single shared memory space or a distributed memory space.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 44 Flynn’s Taxonomy - Recap

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 45 Outline

◦ Why Parallel Computing ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 46 Communication Model of Parallel Platforms  There are two primary forms of data exchange between parallel tasks - accessing a shared data space and exchanging messages.

 Platforms that provide a shared data space are called shared-address-space machines or multiprocessors.

 Platforms that support messaging are also called platforms or multicomputers.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 47 Shared Memory Systems

 Memory resources are shared among processors

 Relatively easy to program for since there is a single unified memory space

 Scales poorly with system size due to the need for cache coherency

 Example: ◦ Symmetric Multiprocessors (SMP)  Each processor has equal access to RAM  4-way motherboards MUCH more expensive than 2-way

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 48 Distributed Memory Systems

 Individual nodes consist of a CPU, RAM, and a network interface ◦ A hard disk is not necessary; mass storage can be supplied using NFS

 Information is passed between nodes using the network

 No need for special cache coherency hardware

 More difficult to write programs for distributed memory systems since the programmer must keep track of memory usage

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 49 Shared-Address-Space Platforms

 Part (or all) of the memory is accessible to all processors.

 Processors interact by modifying data objects stored in this shared-address-space.

 If the time taken by a processor to access any memory word in the system global or local is identical, the platform is classified as a (UMA), else, a non-uniform memory access (NUMA) machine.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 50 NUMA and UMA Shared-Address- Space Platforms

Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memory-access shared- address-space computer with caches and memories; () Non-uniform- memory-access shared-address-space computer with local memory only.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 51 NUMA and UMA Shared-Address-Space Platforms  The distinction between NUMA and UMA platforms is important from the point of view of algorithm design. NUMA machines require locality from underlying algorithms for performance.

 Programming shared-address-space platforms is easier since reads and writes are implicitly visible to other processors.

 However, read-write data to shared data must be coordinated (this will be discussed in greater detail when we talk about threads programming).

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 52 Shared-Address-Space vs. Shared Memory Machines  It is important to note the difference between the terms shared address space and shared memory.

 We refer to the former as a programming abstraction and to the latter as a physical machine attribute.

 It is possible to provide a shared address space using a physically distributed memory.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 53 Message-Passing Platforms

 These platforms comprise of a set of processors and their own (exclusive) memory.

 Instances of such a view come naturally from clustered workstations and non-shared-address-space multicomputers.

 These platforms are programmed using (variants of) send and receive primitives.

 Libraries such as MPI and PVM provide such primitives.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 54 Message Passing vs. Shared Address Space Platforms  Message passing requires little hardware support, other than a network.

 Shared address space platforms can easily emulate message passing. The reverse is more difficult to do (in an efficient manner).

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 55 Outline

◦ Why Parallel Computing ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 56 Parallel Programming Terminology

 Efficiency is the execution time using a single processor divided by the execution time using a multiprocessor and the number of processors.  Parallel Overhead the extra work associated with parallel version compared to its sequential code, mostly the extra CPU time and memory space requirements from synchronization, data communications, parallel environment creation and cleanup, etc.  Synchronization the coordination of simultaneous tasks to ensure correctness and avoid unexpected race conditions.  also called parallel speedup, which is defined as wall-clock time of best serial execution divided by wall-clock time of parallel execution. Amdahl's law can be used to give a maximum speedup factor.  a parallel system's ability to gain proportionate increase in parallel speedup with the addition of more processors.  Task a logically high level, discrete, independent section of computational work. A task is typically executed by a processor as a program.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 57 Speed-Up / Efficiency

 Speedup refers to how much faster a performs over its corresponding sequential algorithm. It is defined as follows: T S 1 p T where p p – number of processors

T1 – sequential execution time

Tp – parallel execution time with (p) processors

 Efficiency provides a performance metric indicating how well utilized the processors are, typically between 0 and 1. S E p p p

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 58 Speed-Up / Efficiency Example

T1 = 150 secs

Tp = 50 secs P = 4

T1 150 Speedup: S p 3.0 Tp 50

3.0 Efficiency: E 0.75 p 4

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 59 Super-linear Speedup

 Super-Linear Speedup occurs when a speedup of more than N when using N processors is observed.

In practice super linear speedup rarely happens but can occur in some cases.

Typically occurs due to cache effects resulting from the different memory hierarchies . In parallel computing, not only the numbers of processors change, but so does the size of accumulated caches from different processors – with the larger accumulated cache size, more or even all of a core data set can fit into caches and the memory access time reduces dramatically, which causes the extra speedup in addition to that from the actual computation.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 60 Outline

◦ Why Parallel Computing ◦ Flynn’s Taxonomy ◦ Communication Models ◦ Parallel Programming Terminology ◦ Amdahl’s Law

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 61 Amdahl’s Law

Need to understand potential of multicore systems Amdahl’s Law:  Amdahl's law states that if F is the fraction of a calculation that is sequential, and (1 − F) is the fraction that can be parallelized, then the maximum speedup that can be achieved by using N processors is 1 F (1 F) / N

 In the limit, as N tends to infinity, the maximum speedup tends to 1/F.

 As an example, if F is only 10%, the problem can be sped up by only a maximum of a factor of 10, no matter how large the value of N used.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 62