<<

9-1 Chap. 9 and Vector Processing

 9-1 Parallel Processing  Simultaneous data processing tasks for the purpose of increasing the = computational speed  Perform concurrent data processing to achieve faster execution time  Multiple Functional Unit : Fig. 9-1 Parallel Processing Example  Separate the execution unit into eight functional units operating in parallel  Computer Architectural Classification Adder-subtractor  Data-Instruction Stream : Flynn Integer multiply  Serial versus Parallel Processing : Feng  Parallelism and Pipelining : Händler Logic unit

 Flynn’s Classification Shift unit To Memory  1) SISD (Single Instruction - Single Data stream) Incrementer Processor registers » for practical purpose: only one processor is useful Floatint-point add-subtract » Example systems : Amdahl 470V/6, IBM 360/91 Floatint-point IS multiply

Floatint-point divide

CU PU MM IS DS

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-2

 2) SIMD Shared memmory

(Single Instruction - Multiple Data stream) DS 1 PU 1 MM1 » vector or array operations 에 적합한 형태

 one vector operation includes many DS 2 operations on a data stream PU 2 MM2 IS » Example systems : CRAY -1, ILLIAC-IV CU

DS n PU n MMn

IS  3) MISD (Multiple Instruction - Single Data stream)

» Data Stream에 Bottle neck으로 인해 DS

IS1 IS1 CU1 PU 1 실제로 사용되지 않음

IS2 IS2 CU2 PU 2 MMn MM2 MM1

ISn ISn CUn PU n

DS

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-3

 4) MIMD (Multiple Instruction - Multiple Data stream) » 대부분의 Multiprocessor Shared memory IS1 IS1 DS System에서 사용됨 CU1 PU1 MM1 » Shared memory or IS2 IS2 Message passing : Chap. 13 CU2 PU2 MM2

v v

ISn ISn CUn PUn MMn

 Main topics in this Chapter  Pipeline processing : Sec. 9-2 » Arithmetic pipeline : Sec. 9-3 » Instruction pipeline : Sec. 9-4 (Sec. 9-5 : RISC Instruction Pipeline)  Vector processing :adder/multiplier pipeline 이용, Sec. 9-6 Large vector, Matrices,  Array processing :별도의 array processor 이용, Sec. 9-7 그리고 Array Data 계산 » Attached array processor : Fig. 9-14 » SIMD array processor : Fig. 9-15

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-4

 9-2 Pipelining  Pipelining의원리  Decomposing a sequential into suboperations  Each subprocess is executed in a special dedicated segment concurrently  Pipelining의예제: Fig. 9-2  Multiply and Add Operation : A i * B i  C i ( for i = 1, 2, …, 7 )  3 개의 Suboperation Segment로분리 » 1) R 1  Ai , R 2  Bi : Input Ai and Bi » 2) R 3  R 1 * R 2 , R 4  Ci : Multiply and input Ci » 3) R 5  R 3  R 4 : Add Ci  Content of registers in pipeline example : Tab. 9-1  General considerations  4 segment pipeline : Fig. 9-3 » S : Combinational circuit for Suboperation » R : Register(intermediate results between the segments)  Space-time diagram : Fig. 9-4 Segment » Show segment utilization as a function of time versus Clock-cycle  Task : T1, T2, T3,…, T6 » Total operation performed going through all the segment

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-5

S : Nonpipeline / Pipeline

 S = n • tn / ( k + n - 1 ) • tp = 6 • 4 tp / ( 4 + 6 - 1 ) • tp = 24 tp / 9 tp = 2.67 »n : task number ( 6 ) Pipeline에서의처리시간= 9 clock cycles

»tn : time to complete each task in nonpipeline

k + n - 1  n »tp : clock cycle time ( 1 clock cycle ) »k : segment number ( 4 ) Clock cycles 1 293 4 5 6 7 8  If n이면, S = tn / tp 1 T 1 T2 T3 T4 T5 T6  한개의task를 처리하는 시간이 같을 때

T1 T2 T3 T4 T5 T6 즉, nonpipeline ( tn ) = pipeline ( k • tp ) 2 이라고 가정하면, Segment 3 T1 T2 T3 T4 T5 T6 S = tn / tp = k • tp / tp = k T1 T2 T3 T4 T5 T6 따라서 이론적으로 k 배 (segment 개수) 4 만큼 처리 속도가 향상된다.  Pipeline에는 Arithmetic Pipeline(Sec. 9-3)과 Instruction Pipeline(Sec. 9-4)이있다  Sec. 9-3 Arithmetic Pipeline  Floating-point Adder Pipeline Example : Fig. 9-6  Add / Subtract two normalized floating-point binary number » X = A x 2a = 0.9504 x 103 » Y = B x 2b = 0.8200 x 102

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-6

 4 segments suboperations Exponents Mantissas » 1) Compare exponents by subtraction : abA B 3 - 2 = 1 R R  X = 0.9504 x 103  Y = 0.8200 x 102 Compare Difference Segment 1 : exponents » 2) Align mantissas by subtraction  X = 0.9504 x 103  Y = 0.08200 x 103 R » 3) Add mantissas Segment 2 : Choose exponent Align mantissas  Z = 1.0324 x 103 » 4) Normalize result R

 4 Z = 0.1324 x 10 Add or subtract Segment 3 : mantissas

R R

Adjust Normalize Segment 4 : exponent result

R R

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-7

Segment 1 : Fetch instruction from memory  9-4 Instruction Pipeline Decode instruction and calculate  Instruction Cycle effective address 1) Fetch the instruction from memory Segment 2 : Branch ? 2) Decode the instruction Fetch operand 3) Calculate the effective address from memory

4) Fetch the operands from memory Execute instruction 5) Execute the instruction Interrupt Interrupt ? 6) Store the result in the proper place handling Segment 3 :

 Example : Four-segment Instruction Pipeline Update PC

Segment 4 :  Four-segment CPU pipeline : Fig. 9-7 Empty pipe » 1) FI : Instruction Fetch » 2) DA : Decode Instruction & calculate EA Step : 1 2 3 49125 6 7 8 1011 13 Instruction : 1 FIDA FO EX

» 3) FO : Operand Fetch 2 FIDA FO EX

» 4) EX : Execution (Branch) 3 FIDA FO EX

 Timing of Instruction Pipeline : Fig. 9-8 4 FI FIDA FO EX » Instruction 3 에서 Branch 명령 실행 5 FIDA FO EX 6 FIDA FO EX

7 FIDA FO EX No Branch Branch

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-8

 Pipeline Conflicts : 3 major difficulties  1) Resource conflicts » memory access by two segments at the same time  2) Data dependency » when an instruction depend on the result of a previous instruction, but this result is not yet available  3) Branch difficulties » branch and other instruction (interrupt, ret, ..) that change the value of PC  Data Dependency 해결 방법  Hardware 적인 방법 » Hardware Interlock  previous instruction의 결과가 나올 때 까지 Hardware 적인 Delay를강제삽입 »  previous instruction의 결과를 곧바로 ALU 로전달(정상적인 경우, register를 경유함)  Software 적인 방법 » Delayed Load : Fig. 9-9, Sec. 9-5  previous instruction의 결과가 나올 때 까지 No-operation instruction 을삽입  Handling of Branch Instructions  Prefetch target instruction » Conditional branch에서 branch target instruction (조건 맞음) 과다음instruction (조건 안 맞음) 을모두fetch

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-9

 Branch Target Buffer : BTB » 1) Associative memory를 이용하여 branch target address 이후에 몇 개에 instruction 을 미리 BTB에 저장한다. » 2) 만약 branch instruction이면 우선 BTB를 검사하여 BTB에 있으면 곧바로 가져온다(Cache 개념 도입) Clock cycles : 1 2 3 4 5 6 7108 9  Loop Buffer 1. Load IEA

» 1) small very high speed register file (RAM) 을 2. Increment IEA

이용하여 프로그램에서 loop를 detect한다. 3. Add IEA

» 2) 만약 loop가 발견되면 loop 프로그램 전체를 4. Subtract IEA Loop Buffer에 load 하여 실행하면 외부 5. Branch to X IEA 메모리를 access 하지 않는다. 6. No-operation IEA 7. No-operation IEA  Branch Prediction 8. Instruction in X IEA » Branch를 predict하는 additional hardware logic 사용 (a) Using no-operation instructions » Correct guess eliminates the branch difficulty Clock cycles : 2 3 4 5 6 7 8 1  Delayed Branch 해결 방법 1. Load IEA 2. Increment IEA  Fig. 9-8 에서와 같이 branch instruction이 3. Branch to X IEA pipeline operation을 지연시키는 경우 4. Add IEA

 예제 : Fig. 9-10, p. 318, Sec. 9-5 5. Subtract IEA IEA » 1) No-operation instruction 삽입 6. Instruction in X » 2) Instruction Rearranging : Compiler 지원 (b) Rearranging the instructions Fig. 9-10 Example of delayed branch

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-10

 9-5 RISC Pipeline  RISC CPU 의특징 Conflict 발생  Instruction Pipeline 을 이용함 Clock cycles : 1 2 3 4 5 6  Single-cycle instruction execution 1. Load R1 IEA  Compiler support 2. Load R2 IEA

 Example : Three-segment Instruction Pipeline 3. Add R1+R2 IEA  3 Suboperations Instruction Cycle 4. Store R3 IEA » 1) I : Instruction fetch » 2) A : Instruction decoded and ALU operation (a) Pipeline timing with data conflict » 3) E : Transfer the output of ALU to a register, Clock cycles : 1 2 3 4 5 6 7 memory, or PC(Program control Inst.=JMP/CALL) 1. Load R1 IEA

 Delayed Load : Fig. 9-9(a) 2. Load R2 IEA »3 번째 Instruction(ADD R1 + R2)에서 Conflict 발생 3. No-operation IEA  4 번째 clock cycle에서 2 번째 Instruction (LOAD R2) 실행과 동시에 3 번째 instruction에서 R2 를연산 4. Add R1+R2 IEA » Delayed Load 해결 방법 : Fig. 9-9(b) 5. Store R3 IEA

 No-operation 삽입 (b) Pipeline timing with delayed load  Delayed Branch : Sec. 9-4에서 이미 설명 Fig. 9-9 Three-segment pipeline timing

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-11

 9-6 Vector Processing  Science and Engineering Applications  Long-range weather forecasting, Petroleum explorations, Seismic data analysis, Medical diagnosis, Aerodynamics and space flight simulations, Artificial intelligence and expert systems, Mapping the human genome, Image processing  Vector Operations  Arithmetic operations on large arrays of numbers  Conventional scalar processor » Machine language » Fortran language Initialize I = 0 DO 20 I = 1, 100 20 Read A(I) 20 C(I) = A(I) + B(I) Read B(I) Store C(I) = A(I) + B(I) Increment I = I + 1 If I  100 go to 20 Continue

» Single vector instruction C(1:100) = A(1:100) + B(1:100)

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-12

 Vector Instruction Format : Fig. 9-11

ADD A B C 100  Matrix Multiplication  3 x 3 matrices multiplication : n2 = 9 inner product

a11 a12 a13  b11 b12 b13  c11 c12 c13  a a a   b b b   c c c   21 22 23   21 22 23   21 22 23  a31 a32 a33  b31 b32 b33  c31 c32 c33 

»:c11 a11 b11 a12 b21  a13 b31 이와 같은 inner product가 9 개

 Cumulative multiply-add operation : n3 = 27 multiply-add c  c  a  b

»:c11 c11 a11 b11 a12 b21  a13 b31 이와 같은 multiply-add가 3 개   따라서 9 X 3 multiply-add = 27

C11의 초기값 = 0

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-13

 Pipeline for calculating an inner product : Fig. 9-12  Floating point multiplier pipeline : 4 segment  Floating point adder pipeline : 4 segment

 예제 ) C A1B1 A2B 2 A3B 3   Ak Bk » after 1st clock input » after 4th clock input Source Source A A

A1B1 A4B4 A3B3 A2B2 A1B1

Source Multiplier Adder Source Multiplier Adder B pipeline pipeline B pipeline pipeline » after 8th clock input » after 9th, 10th, 11th ,... Source Source A A

A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1 A8B8 A7B7 A6B6 A5B5 A4B4 A3B3 A2B2 A1B1

Source Multiplier Adder Source Multiplier Adder B pipeline pipeline B pipeline pipeline » Four section summation

A B  A B A1B1  A5B5 C  A1B1  A5B5  A9B9  A13B13  2 2 6 6  A B  A B  A B  A B  2 2 6 6 10 10 14 14  , , ,    A3B3  A7B7  A11B11  A15B15 

 A4B4  A8B8  A12B12  A16B16 

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-14

 Memory Interleaving : Fig. 9-13 Address bus

 Pipeline and vector processors often require AR AR AR AR simultaneous access to memory from two or

Memory Memory Memory Memory more source using one memory bus system array array array array  AR 의하위2 bit를 사용하여 4 개중 1 개의 memory module 선택 DR DR DR DR  예제 ) Even / Odd Address Memory Access Data bus   Supercomputer = Vector Instruction + Pipelined floating-point arithmetic  Performance Evaluation Index » MIPS : Million Instruction Per Second » FLOPS : Floating-point Operation Per Second  megaflops : 106, gigaflops : 109, teraflops : 1012  Cray supercomputer : Cray Research » Clay-1 : 80 megaflops, 4 million 64 bit words memory » Clay-2 : 12 times more powerful than the clay-1  VP supercomputer : Fujitsu » VP-200 : 300 megaflops, 32 million memory, 83 vector instruction, 195 scalar instruction » VP-2600 : 5 gigaflops

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-15

 9-7 Array Processors  Performs computations on large arrays of data Vector processing : Adder/Multiplier pipeline 이용 Array processing :별도의 array processor 이용

 Array Processing  Attached array processor : Fig. 9-14 » Auxiliary processor attached to a general purpose computer  SIMD array processor : Fig. 9-15 » Computer with multiple processing units operating in parallel

 Vector 계산 C = A + B 에서 ci = ai + bi 를

각각의 PEi에서 동시에 실행

PE 1 M1

Master control PE 2 M2 General-purpose Input-Output Attached array unit computer interface Processor

PE 3 M3

High-speed memory to- Main memory Local memory memory bus

Main memory PE n Mn

© Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm.