CSC3501 - S.J. Park Announcement

Computer Architecture (CSC-3501) Lecture 25 (24 April 2008)

Seung-Jong Park (Jay) http://www.csc.lsu.edu/~sjpark

1 2

CSC3501 - S.J. Park CSC3501 - S.J. Park Chapter 9 Objectives 9.1 Introduction

„ Learn the properties that often distinguish RISC „ We have so far studied only the simplest models of from CISC architectures. computer systems; classical single- von Neumann systems. „ Understand how multiprocessor architectures are classified. „ This chapter presents a number of different „ AitthftthttAppreciate the factors that create compl litiexity in approaches to computer organization and multiprocessor systems. architecture. „ Become familiar with the ways in which some „ Some of these approaches are in place in today’s architectures transcend the traditional von commercial systems. Others may form the basis for Neumann paradigm. the computers of tomorrow.

3 4

CSC3501 - S.J. Park CSC3501 - S.J. Park 9.2 RISC Machines 9.2 RISC Machines

„ The underlying philosophy of RISC machines is that „ The difference between CISC and RISC becomes a system is better able to manage program evident through the basic execution when the program consists of only a few equation: different instructions that are the same length and reqqyuire the same number of clock cycles to decode and execute. „ RISC systems access memory only with explicit load „ RISC systems shorten execution time by reducing and store instructions. the clock cycles per instruction.

„ In CISC systems, many different kinds of instructions „ CISC systems improve performance by reducing the access memory, making instruction length variable number of instructions per program. and fetch-decode-execute time unpredictable.

5 6

1 CSC3501 - S.J. Park CSC3501 - S.J. Park 9.2 RISC Machines 9.2 RISC Machines

„ The simple instruction set of RISC machines „ Consider the the program fragments: mov ax, 0 enables control units to be hardwired for maximum mov bx, 10 mov cx, 5 mov ax, 10 speed. Begin add ax, bx mov bx, 5 CISC RISC loop Begin mul bx, ax „ The more complex-- and variable-- instruction set of CISC machines requires -based control „ ThttlThe total cl oc k cyc les for thCISCthe CISC vers ion m ihtbight be: units that interpret instructions as they are fetched (2 movs × 1 cycle) + (1 mul × 30 cycles) = 32 cycles from memory. This translation takes time. „ While the clock cycles for the RISC version is: (3 movs × 1 cycle) + (5 adds × 1 cycle) + (5 loops × „ With fixed-length instructions, RISC lends itself to 1 cycle) = 13 cycles pipelining and . „ With RISC clock cycle being shorter, RISC gives us much faster execution speeds.

7 8

CSC3501 - S.J. Park CSC3501 - S.J. Park 9.2 RISC Machines 9.2 RISC Machines

„ Because of their load-store ISAs, RISC architectures „ It is becoming increasingly difficult to require a large number of CPU registers. distinguish RISC architectures from CISC architectures. „ These register provide fast access to data during sequential program execution. „ Some RISC systems provide more extravagant instruction sets than some CISC systems. „ They can also be employed to reduce the overhead typically caused by passing parameters to „ Some systems combine both approaches. subprograms. „ The following two slides summarize the „ Instead of pulling parameters off of a stack, the characteristics that traditionally typify the subprogram is directed to use a subset of registers. differences between these two architectures.

9 10

CSC3501 - S.J. Park CSC3501 - S.J. Park 9.2 RISC Machines 9.2 RISC Machines

„ RISC „ CISC „ RISC „ CISC Multiple register sets. Single register set. Simple instructions, Many complex Three operands per One or two register few in number. instructions. instruction. operands per Fixed length Variable length Parameter passing instruction. instructions. instructions. through register Parameter passing Complexity in Complexity in windows. through memory. compiler. microcode. Single-cycle Multiple cycle Only LOAD/STORE Many instructions instructions. instructions. instructions access can access memory. Hardwired Microprogrammed memory. control. control. Few addressing Many addressing modes. Highly pipelined. Less pipelined. modes.

Continued....

11 12

2 CSC3501 - S.J. Park CSC3501 - S.J. Park 9.3 Flynn’s Taxonomy 9.3 Flynn’s Taxonomy

„ Many attempts have been made to come up with a „ The four combinations of multiple processors and way to categorize computer architectures. multiple data paths are described by Flynn as:

„ Flynn’s Taxonomy has been the most enduring of these, despite having some limitations.

„ Flynn’s Taxonomy takes into consideration the number of processors and the number of data paths incorporated into an architecture.

„ A machine can have one or many processors that operate on one or many data streams.

13 14

CSC3501 - S.J. Park CSC3501 - S.J. Park 9.3 Flynn’s Taxonomy 9.3 Flynn’s Taxonomy

„ Flynn’s Taxonomy falls short in a number of ways: „ Symmetric multiprocessors (SMP) and massively „ First, there appears to be no need for MISD parallel processors (MPP) are MIMD machines. architectures that differ in how they use memory. „ Second, parallelism is not homogeneous. This „ SMP systems share the same memory and MPP assumption ignores the contribution of specialized do not. processors. „ An easy way to distinguish SMP from MPP is: „ Third, it provides no straightforward way to MPP ⇒ many processors + distributed memory + distinguish architectures of the MIMD category. communication via network One idea is to divide these systems into those that share SMP ⇒ fewer processors + shared memory + memory, and those that don’t, as well as whether the communication via memory interconnections are -based or -based.

15 16

CSC3501 - S.J. Park CSC3501 - S.J. Park 9.3 Flynn’s Taxonomyh 9.3 Flynn’s Taxonomy

„ Other examples of MIMD architectures are found in „ Flynn’s Taxonomy has been expanded to include , where processing takes place SPMD (single program, multiple data) architectures. collaboratively among networked computers. „ Each SPMD processor has its own data set and A network of workstations (NOW) uses otherwise idle systems to solve a problem. program memory. Different nodes can execute different instructions within the same program using A collection of workstations (COW) is a NOW where one workstation coordinates the actions of the others. instructions similar to: A dedicated cluster parallel computer (DCPC) is a group of If myNodeNum = 1 do this, else do that workstations brought together to solve a specific problem. „ Yet another idea missing from Flynn’s is whether the A pile of PCs (POPC) is a cluster of (usually) heterogeneous architecture is instruction driven or data driven. systems that form a dedicated parallel system. The next slide provides a revised taxono my.

17 18

3 CSC3501 - S.J. Park CSC3501 - S.J. Park 9.3 Flynn’s Taxonomy Current Supercomputer 500

19 20

CSC3501 - S.J. Park CSC3501 - S.J. Park BlueGene (1) BlueGene (2) The BlueGene/L custom processor chip. The BlueGene/L. (a) Chip. (b) Card. (c) Board. (d) Cabinet. (e) System.

CSC3501 - S.J. Park CSC3501 - S.J. Park A Growth of High Performance Computing 9.4 Parallel and Multiprocessor Architectures

„ Parallel processing is capable of economically increasing system throughput while providing better fault tolerance. 1959 „ The limiting factor is that no matter how well an IBM 7094 1976 1991 1996 2003 1949 Cray 1 Intel Delta T3E Cray X1 alggporithm is parallelized , there is alwa ys some 110Edsac 3 106 109 1012 1015 portion that must be done sequentially. One OPS KiloOPS MegaOPS GigaOPS TeraOPS PetaOPS Additional processors sit idle while the sequential work is 1823 2001 1943 1951 1964 1982 1988 1997 performed. Babbage Difference Univac 1 CDC 6600 Cray XMP Cray YMP ASCI RedEarth Simu Engine Harvard M lator ark 1 „ Thus, it is important to keep in mind that an n -fold increase in processing power does not necessarily result in an n -fold increase in throughput.

24 By Thomas Sterling - Caltech & JPL 23

4 CSC3501 - S.J. Park CSC3501 - S.J. Park 9.4 Superpipelining 9.4 Superscalar (Dynamic multiple-issue processors )

„ Recall that pipelining divides the fetch-decode-execute cycle „ Superscalar architectures include multiple execution units such as into stages that each carry out a small part of the on a specialized integer and floating-point adders and multipliers. set of instructions. „ A critical component of this architecture is the instruction fetch „ Ideally, an instruction exits the pipeline during each tick of the unit, which can simultaneously retrieve several instructions from clock. memory. „ Superpipelining occurs when a pipeline has stages that require less than half a clock cycle to complete. „ A decoding unit determines which of these instructions can be executed in parallel and combines them accordingly. The pipeline is equipped with a separate clock running at a frequency th thtiat is at tl least td doubl blthtfthe that of the mai n syst em cl ock . „ This arc hitec ture a lso requ ires comp ilers tha t ma ke op timum use „ Superpipelining is only one aspect of superscalar design. of the hardware.

25 26

CSC3501 - S.J. Park CSC3501 - S.J. Park Pentium 4 Block Diagram 9.4 VLIW (Static multiple-issue processors) „ Very long instruction word (VLIW) architectures differ from superscalar architectures because the VLIW compiler, instead of a hardware decoding unit, packs independent instructions into one long instruction that is sent down the pipeline to the execution units. Compiler takes greater responsibility for exploiting parallelism „ One could argue that this is the best approach because the compiler can better identify instruction dependencies. „ However, compilers tend to be conservative and cannot have a view of the run time code. „ E.g., Intel and Itanium 2 for the IA-64 ISA – EPIC (Explicit Parallel Instruction Computer)

27 28

CSC3501 - S.J. Park CSC3501 - S.J. Park Example of VLIW - IA-64 CISC vs RISC vs SS vs VLIW Original Source Code Parallel Machine Code CISC RISC Superscalar VLIW Compile Instr size variable size fixed size fixed size fixed size (but large)

Compiler Hardware multiple functional units Instr format variable fixed format fixed format fixed format format Registers few, some many GP GP and many, many IAIA--6464 Compiler special rename (RUU) GP Views Wider More efficient use of . . . . Memory embedded in load/store load/store load/store Scope . . . . execution resources . . . . reference many instr’s Key Issues decode data hardware (compiler) complexity forwarding, dependency code

128 bits (bundle) hazards resolution scheduling Instruction Instruction 2 Instruction 1 Instruction 0 Template IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB EX M WB 41 bits 41 bits 41 bits 5 bits flow IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB IF ID EX M WB EX M WB Memory (M) Memory (M) Integer (I) (MMI)

29

5