The Interconnection Network 6

ARCHITECTURES FOR PARALLEL COMPUTATION 1. Why Parallel Computation 2. Parallel Programs 3. A Classification of Computer Architectures 4. Performance of Parallel Architectures 5. The Interconnection Network 6. SIMD Computers: Array Processors 7. MIMD Computers 9. Multicore Architectures 10. Multithreading 11. General Purpose Graphic Processing Units 12. Vector Processors 13. Multimedia Extensions to Microprocessors Datorarkitektur Fö 11-12 1 of 82 The Need for High Performance Two main factors contribute to high performance of modern processors: Fast circuit technology Architectural features: - large caches - multiple fast buses - pipelining - superscalar architectures (multiple functional units) Datorarkitektur Fö 11-12 2 of 82 The Need for High Performance Two main factors contribute to high performance of modern processors: Fast circuit technology Architectural features: - large caches - multiple fast buses - pipelining - superscalar architectures (multiple functional units) However Computers running with a single CPU, often are not able to meet performance needs in certain areas: - Fluid flow analysis and aerodynamics - Simulation of large complex systems, for example in physics, economy, biology, technic - Computer aided design -Multimedia - Machine learning Datorarkitektur Fö 11-12 3 of 82 A Solution: Parallel Computers One solution to the need for high performance: architectures in which several CPUs are running in order to solve a certain application. Such computers have been organized in different ways. Some key features: number and complexity of individual CPUs availability of common (shared memory) interconnection topology performance of interconnection network I/O devices - - - - - - - - - - - - - To efficiently use parallel computers you need to write parallel programs. Datorarkitektur Fö 11-12 4 of 82 Parallel Programs Parallel sorting Unsorted-1Unsorted-2 Unsorted-3 Unsorted-4 Sort-1Sort-2 Sort-3 Sort-4 Sorted-1Sorted-2 Sorted-3 Sorted-4 Merge SORTED Datorarkitektur Fö 11-12 5 of 82 Parallel Programs var t: array [1..1000] of integer; Parallel sorting - - - - - - - - - - - procedure sort (i, j:integer); - - sort elements between t[i] and t[j] - - Unsorted-1Unsorted-2 Unsorted-3 Unsorted-4 end sort; procedure merge; Sort-1Sort-2 Sort-3 Sort-4 - - merge the four sub-arrays - - end merge; Sorted-1Sorted-2 Sorted-3 Sorted-4 - - - - - - - - - - - begin - - - - - - - - Merge cobegin sort (1,250) | SORTED sort (251,500) | sort (501,750) | sort (751,1000) coend; merge; - - - - - - - - end; Datorarkitektur Fö 11-12 6 of 82 Parallel Programs Matrix addition: ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a11 a12 a1m b11 b12 b1m c11 c12 c1m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a21 a22 a2m b21 b22 b2m c21 c22 c2m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a31 a32 a3m +=b31 b32 b3m c31 c32 c3m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ an1 an2 anm bn1 bn2 bnm cn1 cn2 cnm Datorarkitektur Fö 11-12 7 of 82 Parallel Programs Matrix addition: ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a11 a12 a1m b11 b12 b1m c11 c12 c1m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a21 a22 a2m b21 b22 b2m c21 c22 c2m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a31 a32 a3m +=b31 b32 b3m c31 c32 c3m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ an1 an2 anm bn1 bn2 bnm cn1 cn2 cnm Sequential version: var a: array [1..n, 1..m] of integer; b: array [1..n, 1..m] of integer; c: array [1..n, 1..m] of integer; i: integer - - - - - - - - - - - begin - - - - - - - - for i:=1 to n do for j:= 1 to m do c[i,j]:=a[i, j] + b[i, j]; end for end for - - - - - - - - end; Datorarkitektur Fö 11-12 8 of 82 Parallel Programs Matrix addition: ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a11 a12 a1m b11 b12 b1m c11 c12 c1m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a21 a22 a2m b21 b22 b2m c21 c22 c2m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Parallel version: a31 a32 a3m +=b31 b32 b3m c31 c32 c3m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ var a: array [1..n, 1..m] of integer; an1 an2 anm bn1 bn2 bnm cn1 cn2 cnm b: array [1..n, 1..m] of integer; c: array [1..n, 1..m] of integer; Sequential version: i: integer var a: array [1..n, 1..m] of integer; - - - - - - - - - - - b: array [1..n, 1..m] of integer; procedure add_vector(n_ln: integer); c: array [1..n, 1..m] of integer; var j: integer begin i: integer for j:=1 to m do - - - - - - - - - - - c[n_ln, j]:=a[n_ln, j] + b[n_ln, j]; begin end for - - - - - - - - end add_vector; for i:=1 to n do begin for j:= 1 to m do - - - - - - - - c[i,j]:=a[i, j] + b[i, j]; cobegin for i:=1 to n do end for add_vector(i); end for coend; - - - - - - - - - - - - - - - - end; end; Datorarkitektur Fö 11-12 9 of 82 Parallel Programs Matrix addition: ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a11 a12 a1m b11 b12 b1m c11 c12 c1m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a21 a22 a2m b21 b22 b2m c21 c22 c2m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Parallel version: a31 a32 a3m +=b31 b32 b3m c31 c32 c3m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ var a: array [1..n, 1..m] of integer; an1 an2 anm bn1 bn2 bnm cn1 cn2 cnm b: array [1..n, 1..m] of integer; c: array [1..n, 1..m] of integer; Vector computation version 1: i: integer var a: array [1..n, 1..m] of integer; - - - - - - - - - - - b: array [1..n, 1..m] of integer; procedure add_vector(n_ln: integer); c: array [1..n, 1..m] of integer; var j: integer begin i: integer for j:=1 to m do - - - - - - - - - - - c[n_ln, j]:=a[n_ln, j] + b[n_ln, j]; begin end for - - - - - - - - end add_vector; for i:=1 to n do begin c[i,1:m]:=a[i,1:m] +b [i,1:m]; - - - - - - - - end for; cobegin for i:=1 to n do - - - - - - - - add_vector(i); end; coend; - - - - - - - - end; Datorarkitektur Fö 11-12 10 of 82 Parallel Programs Matrix addition: ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a11 a12 a1m b11 b12 b1m c11 c12 c1m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a21 a22 a2m b21 b22 b2m c21 c22 c2m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ a31 a32 a3m +=b31 b32 b3m c31 c32 c3m ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ an1 an2 anm bn1 bn2 bnm cn1 cn2 cnm Vector computation version 1: Vector computation version 2: var a: array [1..n, 1..m] of integer; var a: array [1..n, 1..m] of integer; b: array [1..n, 1..m] of integer; b: array [1..n, 1..m] of integer; c: array [1..n, 1..m] of integer; c: array [1..n, 1..m] of integer; i: integer - - - - - - - - - - - - - - - - - - - - - - begin begin - - - - - - - - - - - - - - - - c[1:n,1:m]:=a[1:n,1:m]+b[1:n,1:m]; for i:=1 to n do - - - - - - - - c[i,1:m]:=a[i,1:m] +b [i,1:m]; end; end for; - - - - - - - - end; Datorarkitektur Fö 11-12 11 of 82 Parallel Programs Pipeline model computation: x y y =455 × + logx Datorarkitektur Fö 11-12 12 of 82 Parallel Programs Pipeline model computation: x y y =455 × + logx x a y a = 45 + logx y = 5 × a Datorarkitektur Fö 11-12 13 of 82 Parallel Programs Pipeline model computation: channel ch:real; - - - - - - - - - x y y =455 × + logx cobegin var x: real; while true do read(x); send(ch, 45+log(x)); end while | x a y a = 45 + logx y = 5 × a var v: real; while true do receive(ch, v); write(5 * sqrt(v)); end while coend; - - - - - - - - - Datorarkitektur Fö 11-12 14 of 82 Flynn’s Classification of Computer Architectures Flynn’s classification is based on the nature of the instruction flow executed by the computer and that of the data flow on which the instructions operate. Datorarkitektur Fö 11-12 15 of 82 Flynn’s Classification of Computer Architectures Single Instruction stream, Single Data stream (SISD) CPU Control instr. stream Processing unit unit Memory stream data Datorarkitektur Fö 11-12 16 of 82 Flynn’s Classification of Computer Architectures Single Instruction stream, Multiple Data stream (SIMD) SIMD with shared memory Processing DS1 unit_1 Processing DS2 unit_2 Control IS Shared unit Memory Processing DSn Interconnection Network unit_n Datorarkitektur Fö 11-12 17 of 82 Flynn’s Classification of Computer Architectures Single Instruction stream, Multiple Data stream (SIMD) SIMD with no shared memory LM1 DS1 Processing unit_1 LM2 DS2 LM Processing unit_2 Control IS unit Interconnection Network LMn DSn Processing unit_n Datorarkitektur Fö 11-12 18 of 82 Flynn’s Classification of Computer Architectures Multiple Instruction stream, Multiple Data stream (MIMD) MIMD with shared memory IS1 LM1 CPU_1 DS1 Control Processing unit_1 unit_1 IS2 LM2 CPU_2 DS2 Control Processing unit_2 unit_2 Shared Memory ISn Interconnection Network LMn CPU_n DSn Control Processing unit_n unit_n Datorarkitektur Fö 11-12 19 of 82 Flynn’s Classification of Computer Architectures Multiple Instruction stream, Multiple Data stream (MIMD) MIMD with no shared memory IS1 LM1 CPU_1 DS1 Control Processing unit_1 unit_1 IS2 LM2 CPU_2 DS2 Control Processing unit_2 unit_2 ISn Interconnection Network LMn CPU_n DSn Control Processing unit_n unit_n Datorarkitektur Fö 11-12 20 of 82 Performance of Parallel Architectures Important questions: How fast runs a parallel computer at its maximal potential? How fast execution can we expect from a parallel computer for a concrete application? How do we measure the performance of a parallel computer and the performance improvement we get by using such a computer? Datorarkitektur Fö 11-12 21 of 82 Performance Metrics Peak rate: the maximal computation rate that can be theoretically achieved when all modules are fully utilized. The peak rate is of no practical significance for the user. It is mostly used by vendor companies for marketing

The Interconnection Network 6

Processamento Paralelo Em Cuda Aplicado Ao Modelo De Geração De Cenários Sintéticos De Vazões E Energias - Gevazp

Radio Frequency Identification Based Smart

Andrzej Nowak - Bio

COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1)

FPL-2019 Why-I-Love-Latency

A Survey of Microarchitectural Timing Attacks and Countermeasures on Contemporary Hardware

Parallel Implementation of Sorting Algorithms

IJMEIT// Vol.04 Issue 08//August//Page No:1729-1735//ISSN-2348-196X 2016

Simultaneous Multithreading on X86 64 Systems: an Energy Efﬁciency Evaluation

(19) United States %

PCAP Assignment IV.Pages

UC Riverside UC Riverside Previously Published Works