CPS 303 High Performance Computing

CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 2: Architecture of Parallel Computers Hardware Software 2.1.1 Flynn’s taxonomy Single-instruction single-data (SISD) Single-instruction multiple-data (SIMD) Multiple-instruction single-data (MISD) Multiple-instruction multiple-data (MIMD) Michael Flynn classified systems according to the number of instruction streams and the number of data streams. Instruction streams and data streams Data stream: a sequence of digitally encoded coherent signals of data packets used to transmit or receive information that is in transmission. Instruction stream: a sequence of instructions. Instruction set architecture Stored program computer: memory stores programs, as well as data. Thus, programs must go from memory to the CPU where it can be executed. Programs consist of many instructions which are the 0's and 1's that tell the CPU what to do. The format and semantics of the instructions are defined by the ISA (instruction set architecture). The reason the instructions reside in memory is because CPUs typically hold very little memory. The more memory the CPU has, the slower it runs. Thus, memory and CPU are separate chips 2.1.2 SISD --- the classic van Neumann machine Load X Instruction pool Load Y Add Z, X, Y Store Z P Memory Data pool Input Devices Arithmetic Logic Control Unit CP Output U Unit Devices External Storage A single processor executes a single instruction stream, to operate on data stored in a single memory. During any CPU cycle, only one data stream is used. The performance of an van Neumann machine can be improved by caching. Steps to run a single instruction IF (instruction fetch): the instruction is fetched from memory. The address of the instruction is fetched from program counter (PC), the instruction is copied from memory to instruction register (IR). ID (instruction decode): decode the instruction and fetch operands. EX (execute): perform the operation, done by ALU (arithmetic logic unit) MEM (memory access): it happens normally during load and store instructions. WB (write back): write results of the operation in the EX step to a register in the register file. PC (update program counter): update the value in program counter, normally PC Å PC + 4 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Subscalar CPUs: Since only one instruction is executed at a time, the entire CPU must wait for that instruction to complete before proceeding to the next instruction. As a result the subscalar CPU gets "hung up" on instructions which take more than one clock cycle to complete execution. This process is inherent inefficiency It takes 15 cycles to complete three instructions 2.1.3 Pipeline and vector architecture IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Scalar CPUs: In this 5 stage pipeline, it can barely achieve the performance of one instruction per CPU clock cycle IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Superscalar CPUs: in the simple superscalar pipeline, two instructions are fetched and dispatched at the same time, a performance of a maximum of two instructions per CPU clock cycle can be achieved. Example Fetch the operands from memory Compare exponent Shift one operand float x[100], y[100], z[100]; Add for (i=0; i<100; i++) Normalized the results z[i] = x[i] + y[i]; Store result in memory The functional units are arranged in a pipeline. The output of one functional unit is the input to the next. Say x[0] and y[0] are being added, one of x[1] and y[1] can be shifted, the exponents in x[2] and y[2] can be compared, and x[3] and y[3] can be fetched. After pipelining, we can produce a result six times faster than without the pipelining. clock fetch comp shift add norm store 1 X0,y0 2 X1,y1 X0,y0 3 X2,y2 X1,y1 X0/y0 4 5 6 do i=1, 100 z(i) = x(i) + y(i) z(1:100) = x(1:100) + y(1:100) enddo Fortran 77 Fortran 90 By adding vector instructions to the basic machine instruction set, we can further improve the performance. Without vector instructions, each of the basic instructions has to be issued 100 times. With vector instructions, each of the basic instructions has to be issued 1 time. Using multiple memory banks. Operations (fetch and store) that access main memory are several times slower than CPU only operations (add). For example, suppose we can execute a CPU operation once every CPU cycle, but we can only execute a memory access every four cycles. If we used four memory banks, and distribute the data z[i] in memory bank i mod 4, we can execute one store operation per cycle. 2.1.4 SIMD Instruction pool P Load X[1] Load X[2] Load X[n] P load Y[1] Load Y[2] Load Y[3] Data pool Add Z[1], X[1], Y[1] Add Z[2], X[2], Y[2] Add Z[n], X[3], Y[3] P Store Z[1] Store Z[2] Store Z[n] A type of parallel computers. Single instruction: All processor units execute the same instruction at any give clock cycle. Multiple data: Each processing unit can operate on a different data element. It typically has an instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction units Best suitable for specialized problems characterized by a high degree of regularity, e.g., image processing. infantry A single CPU to control and a large collection of subordinate ALUs, each with its own memory. During each instruction cycle, the control processor broadcasts an instruction to all of the subordinate processors, and each of the subordinate processors either executes the instruction or is idle. For (i=0; i<100; i++) if (y[i] != 0.0) z[i] = x[i]/y[i]; else z[i] = x[i]; Time step 1 Test local_y != 0 Time step 2 If local_y != 0, z[i]=x[i]/y[i] If local_y == 0, idle Time step 3 If local_y != 0, idle If local_y == 0, z[i]=x[i] Disadvantage: in a program with many conditional branches or long segments of code whose execution depends on conditionals, its more likely that many processes will remain idle for a long time. 2.1.5 MISD Instruction pool A single data stream is fed into multiple processing units Each processing unit operates on the data independently via independent instruction streams P P Very few actual machines: CMU’s C.mmp computer (1971) Data pool Load X[1] Load X[1] Load X[1] Mul Y[1], A, X[1] Mul Y[2], B, X[1] Mul Y[3], C, X[1] Add Z[1], X[1], Y[1] Add Z[2], X[1], Y[2] Add Z[3], X[1], Y[3] Store Z[1] Store Z[2] Store Z[3] Each processor has both a control unit and an ALU, and is capable of 2.1.6 MIMD executing its own program at its own pace Multiple instruction stream: Every processor may execute a different instruction stream Instruction pool Multiple data stream: Every processor may work with a different data stream P P Execution can be synchronous or asynchronous, deterministic or nondeterministic P P Examples: most current supercomputers, grids, networked parallel computers, Data pool multiprocessor SMP computer P P Load X[1] Load A Load X[1] load Y[1] Mul Y, A, 10 load C[2] Add Z[1], X[1], Y[1] Sub B, Y, A Add Z[1], X[1], C[2] Store Z[1] Store B Sub B, Z[1], X[1] 2.1.7 shared-memory MIMD Bus-based architecture Switch-based architecture Cache coherence CPU CPU CPU Interconnected network Memory Memory Memory Generic shared-memory architecture Shared-memory systems are sometimes called multiprocessors Bus-based architecture CPU CPU CPU Cache Cache Cache Bus Memory Memory Memory The interconnect network is bus based. The bus will become saturated if multiple processors are simultaneously attempting to access memory. Each processor has access to a fairly large cache. These architectures do not scale well to large numbers of processors because of the limited bandwidth of a bus. Switch-based architecture Memory Memory Memory CPU CPU CPU The interconnect network is switch-based. A crossbar can be visualized as a rectangular mesh of wires with switches at the points of intersection, and terminal on its left and top edges. The switches can either allow a signal to pass through in both the vertical and horizontal directions simultaneously, or they can redirect a signal from vertical to horizontal or vice versa. Any processor can access any memory, and any other processor can access any other memory. The crossbar switch-based architecture is very expensive. A total number of mn hardware switches are need for an m×ncrossbar. The crossbar system is a NUMA (nonuniform memory access) system, because when a processor access memory attached to another crossbar, the access times will be greater. Cache coherence The caching or shared variables should ensure cache coherence. Basic idea: each processor has a cache controller, which monitors the bus traffic. When a processor updates a shared variable, it also updates the corresponding main memory location.

CPS 303 High Performance Computing

How Data Hazards Can Be Removed Effectively

Flynn's Taxonomy

Computer Organization & Architecture Eie

Pipeline and Vector Processing

Computer Organization Structure of a Computer Registers Register

Computer Architectures

UTP Cable Connectors

Assembly Language: IA-X86

Lecture #2 "Computer Systems Big Picture"

Evaluation of Synthesizable CPU Cores

MIPS Architecture with Tomasulo Algorithm [12]

Chapter 1 + Basic Concepts and Computer Evolution Computer Architecture 2 Computer Organization