Chap. 9 Pipeline and Vector Processing

Total Page:16

File Type:pdf, Size:1020Kb

Chap. 9 Pipeline and Vector Processing 9-1 Chap. 9 Pipeline and Vector Processing 9-1 Parallel Processing Simultaneous data processing tasks for the purpose of increasing the = computational speed Perform concurrent data processing to achieve faster execution time Multiple Functional Unit : Fig. 9-1 Parallel Processing Example Separate the execution unit into eight functional units operating in parallel Computer Architectural Classification Adder-subtractor Data-Instruction Stream : Flynn Integer multiply Serial versus Parallel Processing : Feng Parallelism and Pipelining : Händler Logic unit Flynn’s Classification Shift unit To Memory 1) SISD (Single Instruction - Single Data stream) Incrementer Processor registers » for practical purpose: only one processor is useful Floatint-point add-subtract » Example systems : Amdahl 470V/6, IBM 360/91 Floatint-point IS multiply Floatint-point divide CU PU MM IS DS © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-2 2) SIMD Shared memmory (Single Instruction - Multiple Data stream) DS1 PU1 MM1 » vector or array operations 에 적합한 형태 one vector operation includes many DS2 operations on a data stream PU2 MM2 IS » Example systems : CRAY -1, ILLIAC-IV CU DSn PUn MMn IS 3) MISD (Multiple Instruction - Single Data stream) » Data Stream에 Bottle neck으로 인해 DS IS1 IS1 실제로 사용되지 않음 CU1 PU1 Shared memory IS2 IS2 CU2 PU2 MMn MM2 MM1 ISn ISn CUn PUn DS © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-3 4) MIMD (Multiple Instruction - Multiple Data stream) » 대부분의 Multiprocessor Shared memory IS1 IS1 DS System에서 사용됨 CU1 PU1 MM1 » Shared memory or IS2 IS2 Message passing : Chap. 13 CU2 PU2 MM2 v v ISn ISn CUn PUn MMn Main topics in this Chapter Pipeline processing : Sec. 9-2 » Arithmetic pipeline : Sec. 9-3 » Instruction pipeline : Sec. 9-4 (Sec. 9-5 : RISC Instruction Pipeline) Vector processing :adder/multiplier pipeline 이용, Sec. 9-6 Large vector, Matrices, Array processing :별도의 array processor 이용, Sec. 9-7 그리고 Array Data 계산 » Attached array processor : Fig. 9-14 » SIMD array processor : Fig. 9-15 © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-4 9-2 Pipelining Pipelining의원리 Decomposing a sequential process into suboperations Each subprocess is executed in a special dedicated segment concurrently Pipelining의예제: Fig. 9-2 Multiply and Add Operation : A i * B i C i ( for i = 1, 2, …, 7 ) 3 개의 Suboperation Segment로분리 » 1) R 1 Ai , R 2 Bi : Input Ai and Bi » 2) R 3 R 1 * R 2 , R 4 Ci : Multiply and input Ci » 3) R 5 R 3 R 4 : Add Ci Content of registers in pipeline example : Tab. 9-1 General considerations 4 segment pipeline : Fig. 9-3 » S : Combinational circuit for Suboperation » R : Register(intermediate results between the segments) Space-time diagram : Fig. 9-4 Segment » Show segment utilization as a function of time versus Clock-cycle Task : T1, T2, T3,…, T6 » Total operation performed going through all the segment © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-5 Speedup S : Nonpipeline / Pipeline S = n • tn / ( k + n - 1 ) • tp = 6 • 4 tp / ( 4 + 6 - 1 ) • tp = 24 tp / 9 tp = 2.67 »n : task number ( 6 ) Pipeline에서의처리시간= 9 clock cycles »tn : time to complete each task in nonpipeline k + n - 1 n »tp : clock cycle time ( 1 clock cycle ) »k : segment number ( 4 ) Clock cycles 1 293 4 5 6 7 8 If n이면, S = tn / tp 1 T1 T2 T3 T4 T5 T6 한개의task를 처리하는 시간이 같을 때 2 T1 T2 T3 T4 T5 T6 즉, nonpipeline ( tn ) = pipeline ( k • tp ) 이라고 가정하면, Segment 3 T1 T2 T3 T4 T5 T6 S = tn / tp = k • tp / tp = k 4 T1 T2 T3 T4 T5 T6 따라서 이론적으로 k 배 (segment 개수) 만큼 처리 속도가 향상된다. Pipeline에는 Arithmetic Pipeline(Sec. 9-3)과 Instruction Pipeline(Sec. 9-4)이있다 Sec. 9-3 Arithmetic Pipeline Floating-point Adder Pipeline Example : Fig. 9-6 Add / Subtract two normalized floating-point binary number » X = A x 2a = 0.9504 x 103 » Y = B x 2b = 0.8200 x 102 © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-6 4 segments suboperations Exponents Mantissas » 1) Compare exponents by subtraction : abA B 3 - 2 = 1 R R X = 0.9504 x 103 Y = 0.8200 x 102 Compare Difference Segment 1 : exponents » 2) Align mantissas by subtraction X = 0.9504 x 103 Y = 0.08200 x 103 R » 3) Add mantissas Segment 2 : Choose exponent Align mantissas Z = 1.0324 x 103 » 4) Normalize result R 4 Z = 0.1324 x 10 Add or subtract Segment 3 : mantissas R R Adjust Normalize Segment 4 : exponent result R R © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-7 Fetch instruction Segment 1 : from memory 9-4 Instruction Pipeline Decode instruction Segment 2 : and calculate Instruction Cycle effective address 1) Fetch the instruction from memory Branch ? 2) Decode the instruction Fetch operand 3) Calculate the effective address Segment 3 : from memory 4) Fetch the operands from memory Segment 4 : Execute instruction 5) Execute the instruction Interrupt 6) Store the result in the proper place handling Interrupt ? Example : Four-segment Instruction Pipeline Update PC Four-segment CPU pipeline : Fig. 9-7 Empty pipe » 1) FI : Instruction Fetch » 2) DA : Decode Instruction & calculate EA Step : 1 2 3 49125 6 7 8 1011 13 Instruction : 1 FIDA FO EX » 3) FO : Operand Fetch 2 FIDA FO EX » 4) EX : Execution (Branch) 3 FIDA FO EX Timing of Instruction Pipeline : Fig. 9-8 4 FI FIDA FO EX » Instruction 3 에서 Branch 명령 실행 5 FIDA FO EX 6 FIDA FO EX 7 FIDA FO EX No Branch Branch © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-8 Pipeline Conflicts : 3 major difficulties 1) Resource conflicts » memory access by two segments at the same time 2) Data dependency » when an instruction depend on the result of a previous instruction, but this result is not yet available 3) Branch difficulties » branch and other instruction (interrupt, ret, ..) that change the value of PC Data Dependency 해결 방법 Hardware 적인 방법 » Hardware Interlock previous instruction의 결과가 나올 때 까지 Hardware 적인 Delay를강제삽입 » Operand Forwarding previous instruction의 결과를 곧바로 ALU 로전달(정상적인 경우, register를 경유함) Software 적인 방법 » Delayed Load : Fig. 9-9, Sec. 9-5 previous instruction의 결과가 나올 때 까지 No-operation instruction 을삽입 Handling of Branch Instructions Prefetch target instruction » Conditional branch에서 branch target instruction (조건 맞음) 과다음instruction (조건 안 맞음) 을모두fetch © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-9 Branch Target Buffer : BTB » 1) Associative memory를 이용하여 branch target address 이후에 몇 개에 instruction 을 미리 BTB에 저장한다. » 2) 만약 branch instruction이면 우선 BTB를 검사하여 BTB에 있으면 곧바로 가져온다(Cache 개념 도입) Clock cycles : 1 2 3 4 5 6 7108 9 Loop Buffer 1. Load IEA » 1) small very high speed register file (RAM) 을 2. Increment IEA 이용하여 프로그램에서 loop를 detect한다. 3. Add IEA » 2) 만약 loop가 발견되면 loop 프로그램 전체를 4. Subtract IEA Loop Buffer에 load 하여 실행하면 외부 5. Branch to X IEA 메모리를 access 하지 않는다. 6. No-operation IEA 7. No-operation IEA Branch Prediction 8. Instruction in X IEA » Branch를 predict하는 additional hardware logic 사용 (a) Using no-operation instructions » Correct guess eliminates the branch difficulty Clock cycles : 1 2 3 4 5 6 7 8 Delayed Branch 해결 방법 1. Load IEA 2. Increment IEA Fig. 9-8 에서와 같이 branch instruction이 3. Branch to X IEA pipeline operation을 지연시키는 경우 4. Add IEA 예제 : Fig. 9-10, p. 318, Sec. 9-5 5. Subtract IEA » 1) No-operation instruction 삽입 6. Instruction in X IEA » 2) Instruction Rearranging : Compiler 지원 (b) Rearranging the instructions Fig. 9-10 Example of delayed branch © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-10 9-5 RISC Pipeline RISC CPU 의특징 Conflict 발생 Instruction Pipeline 을 이용함 Clock cycles : 1 2 3 4 5 6 Single-cycle instruction execution 1. Load R1 IEA Compiler support 2. Load R2 IEA Example : Three-segment Instruction Pipeline 3. Add R1+R2 IEA 3 Suboperations Instruction Cycle 4. Store R3 IEA » 1) I : Instruction fetch (a) Pipeline timing with data conflict » 2) A : Instruction decoded and ALU operation » 3) E : Transfer the output of ALU to a register, Clock cycles : 1 2 3 4 5 6 7 memory, or PC(Program control Inst.=JMP/CALL) 1. Load R1 IEA Delayed Load : Fig. 9-9(a) 2. Load R2 IEA »3 번째 Instruction(ADD R1 + R2)에서 Conflict 발생 3. No-operation IEA 4 번째 clock cycle에서 2 번째 Instruction (LOAD R2) 실행과 동시에 3 번째 instruction에서 R2 를연산 4. Add R1+R2 IEA » Delayed Load 해결 방법 : Fig. 9-9(b) 5. Store R3 IEA No-operation 삽입 (b) Pipeline timing with delayed load Delayed Branch : Sec. 9-4에서 이미 설명 Fig. 9-9 Three-segment pipeline timing © Korea Univ. of Tech. & Edu. Computer System Architecture Chap. 9 Pipeline and Vector Processing Dept. of Info. & Comm. 9-11 9-6 Vector Processing Science and Engineering Applications Long-range weather forecasting, Petroleum explorations, Seismic data analysis, Medical diagnosis, Aerodynamics and space flight simulations, Artificial intelligence and expert systems, Mapping the human genome, Image processing Vector Operations Arithmetic operations on large arrays of numbers Conventional scalar processor » Machine language » Fortran language Initialize I = 0 DO 20 I = 1, 100 20 Read A(I) 20 C(I) = A(I) + B(I) Read B(I) Store C(I) = A(I) + B(I) Increment I = I + 1 If I 100 go to 20 Continue Vector processor » Single vector instruction C(1:100) = A(1:100) + B(1:100) © Korea Univ.
Recommended publications
  • PIPELINING and ASSOCIATED TIMING ISSUES Introduction: While
    PIPELINING AND ASSOCIATED TIMING ISSUES Introduction: While studying sequential circuits, we studied about Latches and Flip Flops. While Latches formed the heart of a Flip Flop, we have explored the use of Flip Flops in applications like counters, shift registers, sequence detectors, sequence generators and design of Finite State machines. Another important application of latches and flip flops is in pipelining combinational/algebraic operation. To understand what is pipelining consider the following example. Let us take a simple calculation which has three operations to be performed viz. 1. add a and b to get (a+b), 2. get magnitude of (a+b) and 3. evaluate log |(a + b)|. Each operation would consume a finite period of time. Let us assume that each operation consumes 40 nsec., 35 nsec. and 60 nsec. respectively. The process can be represented pictorially as in Fig. 1. Consider a situation when we need to carry this out for a set of 100 such pairs. In a normal course when we do it one by one it would take a total of 100 * 135 = 13,500 nsec. We can however reduce this time by the realization that the whole process is a sequential process. Let the values to be evaluated be a1 to a100 and the corresponding values to be added be b1 to b100. Since the operations are sequential, we can first evaluate (a1 + b1) while the value |(a1 + b1)| is being evaluated the unit evaluating the sum is dormant and we can use it to evaluate (a2 + b2) giving us both |(a1 + b1)| and (a2 + b2) at the end of another evaluation period.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Microprocessor Architecture
    EECE416 Microcomputer Fundamentals Microprocessor Architecture Dr. Charles Kim Howard University 1 Computer Architecture Computer System CPU (with PC, Register, SR) + Memory 2 Computer Architecture •ALU (Arithmetic Logic Unit) •Binary Full Adder 3 Microprocessor Bus 4 Architecture by CPU+MEM organization Princeton (or von Neumann) Architecture MEM contains both Instruction and Data Harvard Architecture Data MEM and Instruction MEM Higher Performance Better for DSP Higher MEM Bandwidth 5 Princeton Architecture 1.Step (A): The address for the instruction to be next executed is applied (Step (B): The controller "decodes" the instruction 3.Step (C): Following completion of the instruction, the controller provides the address, to the memory unit, at which the data result generated by the operation will be stored. 6 Harvard Architecture 7 Internal Memory (“register”) External memory access is Very slow For quicker retrieval and storage Internal registers 8 Architecture by Instructions and their Executions CISC (Complex Instruction Set Computer) Variety of instructions for complex tasks Instructions of varying length RISC (Reduced Instruction Set Computer) Fewer and simpler instructions High performance microprocessors Pipelined instruction execution (several instructions are executed in parallel) 9 CISC Architecture of prior to mid-1980’s IBM390, Motorola 680x0, Intel80x86 Basic Fetch-Execute sequence to support a large number of complex instructions Complex decoding procedures Complex control unit One instruction achieves a complex task 10
    [Show full text]
  • Instruction Pipelining (1 of 7)
    Chapter 5 A Closer Look at Instruction Set Architectures Objectives • Understand the factors involved in instruction set architecture design. • Gain familiarity with memory addressing modes. • Understand the concepts of instruction- level pipelining and its affect upon execution performance. 5.1 Introduction • This chapter builds upon the ideas in Chapter 4. • We present a detailed look at different instruction formats, operand types, and memory access methods. • We will see the interrelation between machine organization and instruction formats. • This leads to a deeper understanding of computer architecture in general. 5.2 Instruction Formats (1 of 31) • Instruction sets are differentiated by the following: – Number of bits per instruction. – Stack-based or register-based. – Number of explicit operands per instruction. – Operand location. – Types of operations. – Type and size of operands. 5.2 Instruction Formats (2 of 31) • Instruction set architectures are measured according to: – Main memory space occupied by a program. – Instruction complexity. – Instruction length (in bits). – Total number of instructions in the instruction set. 5.2 Instruction Formats (3 of 31) • In designing an instruction set, consideration is given to: – Instruction length. • Whether short, long, or variable. – Number of operands. – Number of addressable registers. – Memory organization. • Whether byte- or word addressable. – Addressing modes. • Choose any or all: direct, indirect or indexed. 5.2 Instruction Formats (4 of 31) • Byte ordering, or endianness, is another major architectural consideration. • If we have a two-byte integer, the integer may be stored so that the least significant byte is followed by the most significant byte or vice versa. – In little endian machines, the least significant byte is followed by the most significant byte.
    [Show full text]
  • Review of Computer Architecture
    Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I made some modifications to the note for clarity. Assume some background information from CSCE 430 or equivalent von Neumann architecture Memory holds data and instructions. Central processing unit (CPU) fetches instructions from memory. Separate CPU and memory distinguishes programmable computer. CPU registers help out: program counter (PC), instruction register (IR), general- purpose registers, etc. von Neumann Architecture Memory Unit Input CPU Output Unit Control + ALU Unit CPU + memory address 200PC memory data CPU 200 ADD r5,r1,r3 ADD IRr5,r1,r3 Recalling Pipelining Recalling Pipelining What is a potential Problem with von Neumann Architecture? Harvard architecture address data memory data PC CPU address program memory data von Neumann vs. Harvard Harvard can’t use self-modifying code. Harvard allows two simultaneous memory fetches. Most DSPs (e.g Blackfin from ADI) use Harvard architecture for streaming data: greater memory bandwidth. different memory bit depths between instruction and data. more predictable bandwidth. Today’s Processors Harvard or von Neumann? RISC vs. CISC Complex instruction set computer (CISC): many addressing modes; many operations. Reduced instruction set computer (RISC): load/store; pipelinable instructions. Instruction set characteristics Fixed vs. variable length. Addressing modes. Number of operands. Types of operands. Tensilica Xtensa RISC based variable length But not CISC Programming model Programming model: registers visible to the programmer. Some registers are not visible (IR). Multiple implementations Successful architectures have several implementations: varying clock speeds; different bus widths; different cache sizes, associativities, configurations; local memory, etc.
    [Show full text]
  • CS152: Computer Systems Architecture Pipelining
    CS152: Computer Systems Architecture Pipelining Sang-Woo Jun Winter 2021 Large amount of material adapted from MIT 6.004, “Computation Structures”, Morgan Kaufmann “Computer Organization and Design: The Hardware/Software Interface: RISC-V Edition”, and CS 152 Slides by Isaac Scherson Eight great ideas ❑ Design for Moore’s Law ❑ Use abstraction to simplify design ❑ Make the common case fast ❑ Performance via parallelism ❑ Performance via pipelining ❑ Performance via prediction ❑ Hierarchy of memories ❑ Dependability via redundancy But before we start… Performance Measures ❑ Two metrics when designing a system 1. Latency: The delay from when an input enters the system until its associated output is produced 2. Throughput: The rate at which inputs or outputs are processed ❑ The metric to prioritize depends on the application o Embedded system for airbag deployment? Latency o General-purpose processor? Throughput Performance of Combinational Circuits ❑ For combinational logic o latency = tPD o throughput = 1/t F and G not doing work! PD Just holding output data X F(X) X Y G(X) H(X) Is this an efficient way of using hardware? Source: MIT 6.004 2019 L12 Pipelined Circuits ❑ Pipelining by adding registers to hold F and G’s output o Now F & G can be working on input Xi+1 while H is performing computation on Xi o A 2-stage pipeline! o For input X during clock cycle j, corresponding output is emitted during clock j+2. Assuming ideal registers Assuming latencies of 15, 20, 25… 15 Y F(X) G(X) 20 H(X) 25 Source: MIT 6.004 2019 L12 Pipelined Circuits 20+25=45 25+25=50 Latency Throughput Unpipelined 45 1/45 2-stage pipelined 50 (Worse!) 1/25 (Better!) Source: MIT 6.004 2019 L12 Pipeline conventions ❑ Definition: o A well-formed K-Stage Pipeline (“K-pipeline”) is an acyclic circuit having exactly K registers on every path from an input to an output.
    [Show full text]
  • Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
    Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g.
    [Show full text]
  • Pipelined Computations
    Pipelined Computations 1 Functional Decomposition car manufacturing with three plants speedup for n inputs in a p-stage pipeline 2 Pipeline Implementations processors in a ring topology pipelined addition pipelined addition with MPI MCS 572 Lecture 27 Introduction to Supercomputing Jan Verschelde, 15 March 2021 Introduction to Supercomputing (MCS 572) Pipelined Computations L-27 15 March 2021 1 / 27 Pipelined Computations 1 Functional Decomposition car manufacturing with three plants speedup for n inputs in a p-stage pipeline 2 Pipeline Implementations processors in a ring topology pipelined addition pipelined addition with MPI Introduction to Supercomputing (MCS 572) Pipelined Computations L-27 15 March 2021 2 / 27 car manufacturing Consider a simplified car manufacturing process in three stages: (1) assemble exterior, (2) fix interior, and (3) paint and finish: input sequence - - - - c5 c4 c3 c2 c1 P1 P2 P3 The corresponding space-time diagram is below: space 6 P3 c1 c2 c3 c4 c5 P2 c1 c2 c3 c4 c5 P1 c1 c2 c3 c4 c5 - time After 3 time units, one car per time unit is completed. Introduction to Supercomputing (MCS 572) Pipelined Computations L-27 15 March 2021 3 / 27 p-stage pipelines A pipeline with p processors is a p-stage pipeline. Suppose every process takes one time unit to complete. How long till a p-stage pipeline completes n inputs? A p-stage pipeline on n inputs: After p time units the first input is done. Then, for the remaining n − 1 items, the pipeline completes at a rate of one item per time unit. ) p + n − 1 time units for the p-stage pipeline to complete n inputs.
    [Show full text]
  • Trends in Processor Architecture
    A. González Trends in Processor Architecture Trends in Processor Architecture Antonio González Universitat Politècnica de Catalunya, Barcelona, Spain 1. Past Trends Processors have undergone a tremendous evolution throughout their history. A key milestone in this evolution was the introduction of the microprocessor, term that refers to a processor that is implemented in a single chip. The first microprocessor was introduced by Intel under the name of Intel 4004 in 1971. It contained about 2,300 transistors, was clocked at 740 KHz and delivered 92,000 instructions per second while dissipating around 0.5 watts. Since then, practically every year we have witnessed the launch of a new microprocessor, delivering significant performance improvements over previous ones. Some studies have estimated this growth to be exponential, in the order of about 50% per year, which results in a cumulative growth of over three orders of magnitude in a time span of two decades [12]. These improvements have been fueled by advances in the manufacturing process and innovations in processor architecture. According to several studies [4][6], both aspects contributed in a similar amount to the global gains. The manufacturing process technology has tried to follow the scaling recipe laid down by Robert N. Dennard in the early 1970s [7]. The basics of this technology scaling consists of reducing transistor dimensions by a factor of 30% every generation (typically 2 years) while keeping electric fields constant. The 30% scaling in the dimensions results in doubling the transistor density (doubling transistor density every two years was predicted in 1975 by Gordon Moore and is normally referred to as Moore’s Law [21][22]).
    [Show full text]
  • Reverse Engineering X86 Processor Microcode
    Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz, Ruhr-University Bochum https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/koppe This paper is included in the Proceedings of the 26th USENIX Security Symposium August 16–18, 2017 • Vancouver, BC, Canada ISBN 978-1-931971-40-9 Open access to the Proceedings of the 26th USENIX Security Symposium is sponsored by USENIX Reverse Engineering x86 Processor Microcode Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz Ruhr-Universitat¨ Bochum Abstract hardware modifications [48]. Dedicated hardware units to counter bugs are imperfect [36, 49] and involve non- Microcode is an abstraction layer on top of the phys- negligible hardware costs [8]. The infamous Pentium fdiv ical components of a CPU and present in most general- bug [62] illustrated a clear economic need for field up- purpose CPUs today. In addition to facilitate complex and dates after deployment in order to turn off defective parts vast instruction sets, it also provides an update mechanism and patch erroneous behavior. Note that the implementa- that allows CPUs to be patched in-place without requiring tion of a modern processor involves millions of lines of any special hardware. While it is well-known that CPUs HDL code [55] and verification of functional correctness are regularly updated with this mechanism, very little is for such processors is still an unsolved problem [4, 29]. known about its inner workings given that microcode and the update mechanism are proprietary and have not been Since the 1970s, x86 processor manufacturers have throughly analyzed yet.
    [Show full text]
  • Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. SIMD architecture 2. Vector architectures optimizations: Multiple Lanes, Vector Length Registers, Vector Mask Registers, Memory Banks, Stride, Scatter-Gather, 3. Programming Vector Architectures 4. SIMD extensions for media apps 5. GPUs – Graphical Processing Units 6. Fermi architecture innovations 7. Examples of loop-level parallelism 8. Fallacies Copyright © 2012, Elsevier Inc. All rights reserved. 2 Classes of Computers Classes Flynn’s Taxonomy SISD - Single instruction stream, single data stream SIMD - Single instruction stream, multiple data streams New: SIMT – Single Instruction Multiple Threads (for GPUs) MISD - Multiple instruction streams, single data stream No commercial implementation MIMD - Multiple instruction streams, multiple data streams Tightly-coupled MIMD Loosely-coupled MIMD Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Advantages of SIMD architectures 1. Can exploit significant data-level parallelism for: 1. matrix-oriented scientific computing 2. media-oriented image and sound processors 2. More energy efficient than MIMD 1. Only needs to fetch one instruction per multiple data operations, rather than one instr. per data op. 2. Makes SIMD attractive for personal mobile devices 3. Allows programmers to continue thinking sequentially SIMD/MIMD comparison. Potential speedup for SIMD twice that from MIMID! x86 processors expect two additional cores per chip per year SIMD width to double every four years Copyright © 2012, Elsevier Inc. All rights reserved. 4 Introduction SIMD parallelism SIMD architectures A. Vector architectures B. SIMD extensions for mobile systems and multimedia applications C.
    [Show full text]
  • Vector Processors
    VECTOR PROCESSORS Computer Science Department CS 566 – Fall 2012 1 Eman Aldakheel Ganesh Chandrasekaran Prof. Ajay Kshemkalyani OUTLINE What is Vector Processors Vector Processing & Parallel Processing Basic Vector Architecture Vector Instruction Vector Performance Advantages Disadvantages Applications Conclusion 2 VECTOR PROCESSORS A processor can operate on an entire vector in one instruction Work done automatically in parallel (simultaneously) The operand to the instructions are complete vectors instead of one element Reduce the fetch and decode bandwidth Data parallelism Tasks usually consist of: Large active data sets Poor locality Long run times 3 VECTOR PROCESSORS (CONT’D) Each result independent of previous result Long pipeline Compiler ensures no dependencies High clock rate Vector instructions access memory with known pattern Reduces branches and branch problems in pipelines Single vector instruction implies lots of work Example: for(i=0; i<n; i++) c(i) = a(i) + b(i); 4 5 VECTOR PROCESSORS (CONT’D) vadd // C code b[15]+=a[15] for(i=0;i<16; i++) b[i]+=a[i] b[14]+=a[14] b[13]+=a[13] b[12]+=a[12] // Vectorized code b[11]+=a[11] set vl,16 b[10]+=a[10] vload vr0,b b[9]+=a[9] vload vr1,a b[8]+=a[8] vadd vr0,vr0,vr1 b[7]+=a[7] vstore vr0,b b[6]+=a[6] b[5]+=a[5] b[4]+=a[4] Each vector instruction b[3]+=a[3] holds many units of b[2]+=a[2] independent operations b[1]+=a[1] 6 b[0]+=a[0] 1 Vector Lane VECTOR PROCESSORS (CONT’D) vadd // C code b[15]+=a[15] 16 Vector Lanes for(i=0;i<16; i++) b[14]+=a[14] b[i]+=a[i]
    [Show full text]