Chap. 9 Pipeline and Vector Processing

9-1 Chap. 9 Pipeline and Vector Processing 9-1 Parallel Processing Simultaneous data processing tasks for the purpose of increasing the = computational speed Perform concurrent data processing to achieve faster execution time Multiple Functional Unit : Fig. 9-1 Parallel Processing Example Separate the execution unit into eight functional units operating in parallel Computer Architectural Classification Adder-subtractor Data-Instruction Stream : Flynn Integer multiply Serial versus Parallel Processing : Feng Parallelism and Pipelining : Händler Logic unit Flynn’s Classification Shift unit To Memory 1) SISD (Single Instruction - Single Data stream) Incrementer Processor registers » for practical purpose: only one processor is useful Floatint-point add-subtract » Example systems : Amdahl 470V/6, IBM 360/91 Floatint-point IS multiply Floatint-point divide CU PU MM IS DS © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-2 2) SIMD Shared memmory (Single Instruction - Multiple Data stream) DS1 PU1 MM1 » vector or array operations 에 적합한 형태 one vector operation includes many DS2 operations on a data stream PU2 MM2 IS » Example systems : CRAY -1, ILLIAC-IV CU DSn PUn MMn IS 3) MISD (Multiple Instruction - Single Data stream) » Data Stream에 Bottle neck으로 인해 DS IS1 IS1 실제로 사용되지 않음 CU1 PU1 Shared memory IS2 IS2 CU2 PU2 MMn MM2 MM1 ISn ISn CUn PUn DS © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-3 4) MIMD (Multiple Instruction - Multiple Data stream) » 대부분의 Multiprocessor Shared memory IS1 IS1 DS System에서 사용됨 CU1 PU1 MM1 » Shared memory or IS2 IS2 Message passing : Chap. 13 CU2 PU2 MM2 v v ISn ISn CUn PUn MMn Main topics in this Chapter Pipeline processing : Sec. 9-2 » Arithmetic pipeline : Sec. 9-3 » Instruction pipeline : Sec. 9-4 (Sec. 9-5 : RISC Instruction Pipeline) Vector processing :adder/multiplier pipeline 이용, Sec. 9-6 Large vector, Matrices, Array processing :별도의 array processor 이용, Sec. 9-7 그리고 Array Data 계산 » Attached array processor : Fig. 9-14 » SIMD array processor : Fig. 9-15 © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-4 9-2 Pipelining Pipelining의원리 Decomposing a sequential process into suboperations Each subprocess is executed in a special dedicated segment concurrently Pipelining의예제: Fig. 9-2 Multiply and Add Operation : A i * B i C i ( for i = 1, 2, …, 7 ) 3 개의 Suboperation Segment로분리 » 1) R 1 Ai , R 2 Bi : Input Ai and Bi » 2) R 3 R 1 * R 2 , R 4 Ci : Multiply and input Ci » 3) R 5 R 3 R 4 : Add Ci Content of registers in pipeline example : Tab. 9-1 General considerations 4 segment pipeline : Fig. 9-3 » S : Combinational circuit for Suboperation » R : Register(intermediate results between the segments) Space-time diagram : Fig. 9-4 Segment » Show segment utilization as a function of time versus Clock-cycle Task : T1, T2, T3,…, T6 » Total operation performed going through all the segment © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-5 Speedup S : Nonpipeline / Pipeline S = n • tn / ( k + n - 1 ) • tp = 6 • 4 tp / ( 4 + 6 - 1 ) • tp = 24 tp / 9 tp = 2.67 »n : task number ( 6 ) Pipeline에서의처리시간= 9 clock cycles »tn : time to complete each task in nonpipeline k + n - 1 n »tp : clock cycle time ( 1 clock cycle ) »k : segment number ( 4 ) Clock cycles 1 293 4 5 6 7 8 If n이면, S = tn / tp 1 T1 T2 T3 T4 T5 T6 한개의task를 처리하는 시간이 같을 때 2 T1 T2 T3 T4 T5 T6 즉, nonpipeline ( tn ) = pipeline ( k • tp ) 이라고 가정하면, Segment 3 T1 T2 T3 T4 T5 T6 S = tn / tp = k • tp / tp = k 4 T1 T2 T3 T4 T5 T6 따라서 이론적으로 k 배 (segment 개수) 만큼 처리 속도가 향상된다. Pipeline에는 Arithmetic Pipeline(Sec. 9-3)과 Instruction Pipeline(Sec. 9-4)이있다 Sec. 9-3 Arithmetic Pipeline Floating-point Adder Pipeline Example : Fig. 9-6 Add / Subtract two normalized floating-point binary number » X = A x 2a = 0.9504 x 103 » Y = B x 2b = 0.8200 x 102 © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-6 4 segments suboperations Exponents Mantissas » 1) Compare exponents by subtraction : abA B 3 - 2 = 1 R R X = 0.9504 x 103 Y = 0.8200 x 102 Compare Difference Segment 1 : exponents » 2) Align mantissas by subtraction X = 0.9504 x 103 Y = 0.08200 x 103 R » 3) Add mantissas Segment 2 : Choose exponent Align mantissas Z = 1.0324 x 103 » 4) Normalize result R 4 Z = 0.1324 x 10 Add or subtract Segment 3 : mantissas R R Adjust Normalize Segment 4 : exponent result R R © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-7 Fetch instruction Segment 1 : from memory 9-4 Instruction Pipeline Decode instruction Segment 2 : and calculate Instruction Cycle effective address 1) Fetch the instruction from memory Branch ? 2) Decode the instruction Fetch operand 3) Calculate the effective address Segment 3 : from memory 4) Fetch the operands from memory Segment 4 : Execute instruction 5) Execute the instruction Interrupt 6) Store the result in the proper place handling Interrupt ? Example : Four-segment Instruction Pipeline Update PC Four-segment CPU pipeline : Fig. 9-7 Empty pipe » 1) FI : Instruction Fetch » 2) DA : Decode Instruction & calculate EA Step : 1 2 3 49125 6 7 8 1011 13 Instruction : 1 FIDA FO EX » 3) FO : Operand Fetch 2 FIDA FO EX » 4) EX : Execution (Branch) 3 FIDA FO EX Timing of Instruction Pipeline : Fig. 9-8 4 FI FIDA FO EX » Instruction 3 에서 Branch 명령 실행 5 FIDA FO EX 6 FIDA FO EX 7 FIDA FO EX No Branch Branch © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-8 Pipeline Conflicts : 3 major difficulties 1) Resource conflicts » memory access by two segments at the same time 2) Data dependency » when an instruction depend on the result of a previous instruction, but this result is not yet available 3) Branch difficulties » branch and other instruction (interrupt, ret, ..) that change the value of PC Data Dependency 해결 방법 Hardware 적인 방법 » Hardware Interlock previous instruction의 결과가 나올 때 까지 Hardware 적인 Delay를강제삽입 » Operand Forwarding previous instruction의 결과를 곧바로 ALU 로전달(정상적인 경우, register를 경유함) Software 적인 방법 » Delayed Load : Fig. 9-9, Sec. 9-5 previous instruction의 결과가 나올 때 까지 No-operation instruction 을삽입 Handling of Branch Instructions Prefetch target instruction » Conditional branch에서 branch target instruction (조건 맞음) 과다음instruction (조건 안 맞음) 을모두fetch © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-9 Branch Target Buffer : BTB » 1) Associative memory를 이용하여 branch target address 이후에 몇 개에 instruction 을 미리 BTB에 저장한다. » 2) 만약 branch instruction이면 우선 BTB를 검사하여 BTB에 있으면 곧바로 가져온다(Cache 개념 도입) Clock cycles : 1 2 3 4 5 6 7108 9 Loop Buffer 1. Load IEA » 1) small very high speed register file (RAM) 을 2. Increment IEA 이용하여 프로그램에서 loop를 detect한다. 3. Add IEA » 2) 만약 loop가 발견되면 loop 프로그램 전체를 4. Subtract IEA Loop Buffer에 load 하여 실행하면 외부 5. Branch to X IEA 메모리를 access 하지 않는다. 6. No-operation IEA 7. No-operation IEA Branch Prediction 8. Instruction in X IEA » Branch를 predict하는 additional hardware logic 사용 (a) Using no-operation instructions » Correct guess eliminates the branch difficulty Clock cycles : 1 2 3 4 5 6 7 8 Delayed Branch 해결 방법 1. Load IEA 2. Increment IEA Fig. 9-8 에서와 같이 branch instruction이 3. Branch to X IEA pipeline operation을 지연시키는 경우 4. Add IEA 예제 : Fig. 9-10, p. 318, Sec. 9-5 5. Subtract IEA » 1) No-operation instruction 삽입 6. Instruction in X IEA » 2) Instruction Rearranging : Compiler 지원 (b) Rearranging the instructions Fig. 9-10 Example of delayed branch © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-10 9-5 RISC Pipeline RISC CPU 의특징 Conflict 발생 Instruction Pipeline 을 이용함 Clock cycles : 1 2 3 4 5 6 Single-cycle instruction execution 1. Load R1 IEA Compiler support 2. Load R2 IEA Example : Three-segment Instruction Pipeline 3. Add R1+R2 IEA 3 Suboperations Instruction Cycle 4. Store R3 IEA » 1) I : Instruction fetch (a) Pipeline timing with data conflict » 2) A : Instruction decoded and ALU operation » 3) E : Transfer the output of ALU to a register, Clock cycles : 1 2 3 4 5 6 7 memory, or PC(Program control Inst.=JMP/CALL) 1. Load R1 IEA Delayed Load : Fig. 9-9(a) 2. Load R2 IEA »3 번째 Instruction(ADD R1 + R2)에서 Conflict 발생 3. No-operation IEA 4 번째 clock cycle에서 2 번째 Instruction (LOAD R2) 실행과 동시에 3 번째 instruction에서 R2 를연산 4. Add R1+R2 IEA » Delayed Load 해결 방법 : Fig. 9-9(b) 5. Store R3 IEA No-operation 삽입 (b) Pipeline timing with delayed load Delayed Branch : Sec. 9-4에서 이미 설명 Fig. 9-9 Three-segment pipeline timing © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-11 9-6 Vector Processing Science and Engineering Applications Long-range weather forecasting, Petroleum explorations, Seismic data analysis, Medical diagnosis, Aerodynamics and space flight simulations, Artificial intelligence and expert systems, Mapping the human genome, Image processing Vector Operations Arithmetic operations on large arrays of numbers Conventional scalar processor » Machine language » Fortran language Initialize I = 0 DO 20 I = 1, 100 20 Read A(I) 20 C(I) = A(I) + B(I) Read B(I) Store C(I) = A(I) + B(I) Increment I = I + 1 If I 100 go to 20 Continue Vector processor » Single vector instruction C(1:100) = A(1:100) + B(1:100) © Korea Univ.

Chap. 9 Pipeline and Vector Processing

PIPELINING and ASSOCIATED TIMING ISSUES Introduction: While

2.5 Classification of Parallel Computers

Microprocessor Architecture

Instruction Pipelining (1 of 7)

Review of Computer Architecture

CS152: Computer Systems Architecture Pipelining

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Pipelined Computations

Trends in Processor Architecture

Reverse Engineering X86 Processor Microcode

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Vector Processors