CPS 303 High Performance

Wensheng Shen Department of Computational Science SUNY Brockport Chapter 2: Architecture of Parallel Computers

„ Hardware

„ Software 2.1.1 Flynn’s taxonomy

„ Single-instruction single-data (SISD)

„ Single-instruction multiple-data (SIMD)

„ Multiple-instruction single-data (MISD)

„ Multiple-instruction multiple-data (MIMD)

Michael Flynn classified systems according to the number of instruction streams and the number of data streams. Instruction streams and data streams

„ Data stream: a sequence of digitally encoded coherent signals of data packets used to transmit or receive information that is in transmission.

„ Instruction stream: a sequence of instructions. Instruction set architecture

Stored program computer: memory stores programs, as well as data. Thus, programs must go from memory to the CPU where it can be executed.

Programs consist of many instructions which are the 0's and 1's that tell the CPU what to do. The format and semantics of the instructions are defined by the ISA (instruction set architecture).

The reason the instructions reside in memory is because CPUs typically hold very little memory. The more memory the CPU has, the slower it runs. Thus, memory and CPU are separate chips 2.1.2 SISD --- the classic van Neumann machine

Load X Instruction pool

Load Y Add Z, X, Y Store Z P

Memory Data pool Input Devices Arithmetic Logic CP Output U Unit Devices External Storage A single executes a single instruction stream, to operate on data stored in a single memory. During any CPU cycle, only one data stream is used. The performance of an van Neumann machine can be improved by caching. Steps to run a single instruction

„ IF (instruction fetch): the instruction is fetched from memory. The address of the instruction is fetched from program (PC), the instruction is copied from memory to instruction register (IR). „ ID (instruction decode): decode the instruction and fetch operands. „ EX (execute): perform the operation, done by ALU () „ MEM (memory access): it happens normally during load and store instructions. „ WB (write back): write results of the operation in the EX step to a register in the . „ PC (update ): update the value in program counter, normally PC Å PC + 4 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB

Subscalar CPUs : Since only one instruction is executed at a time, the entire CPU must wait for that instruction to complete before proceeding to the next instruction. As a result the subscalar CPU gets "hung up" on instructions which take more than one clock cycle to complete execution. This is inherent inefficiency

It takes 15 cycles to complete three instructions 2.1.3 Pipeline and vector architecture

IF ID EX MEM WB IF ID EX MEM WB

IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB

Scalar CPUs: In this 5 stage pipeline, it can barely achieve the performance of one instruction per CPU clock cycle IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB

IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB

Superscalar CPUs: in the simple superscalar pipeline, two instructions are fetched and dispatched at the same time, a performance of a maximum of two instructions per CPU clock cycle can be achieved. Example Fetch the operands from memory Compare exponent Shift one operand float x[100], y[100], z[100]; Add for (i=0; i<100; i++) Normalized the results z[i] = x[i] + y[i]; Store result in memory

„ The functional units are arranged in a pipeline. The output of one functional unit is the input to the next. Say x[0] and y[0] are being added, one of x[1] and y[1] can be shifted, the exponents in x[2] and y[2] can be compared, and x[3] and y[3] can be fetched. After pipelining, we can produce a result six times faster than without the pipelining. clock fetch comp shift add norm store

1 X0,y0

2 X1,y1 X0,y0

3 X2,y2 X1,y1 X0/y0

4

5

6 do i=1, 100 z(i) = x(i) + y(i) z(1:100) = x(1:100) + y(1:100) enddo

Fortran 77 Fortran 90

„ By adding vector instructions to the basic machine instruction set, we can further improve the performance. Without vector instructions, each of the basic instructions has to be issued 100 times. With vector instructions, each of the basic instructions has to be issued 1 time. „ Using multiple memory banks. Operations (fetch and store) that access main memory are several times slower than CPU only operations (add). For example, suppose we can execute a CPU operation once every CPU cycle, but we can only execute a memory access every four cycles. If we used four memory banks, and distribute the data z[i] in memory bank i mod 4, we can execute one store operation per cycle. 2.1.4 SIMD Instruction pool

P Load X[1] Load X[2] Load X[n] P

load Y[1] Load Y[2] Load Y[3] Data pool

Add Z[1], X[1], Y[1] Add Z[2], X[2], Y[2] Add Z[n], X[3], Y[3] P Store Z[1] Store Z[2] Store Z[n]

A type of parallel computers. Single instruction: All processor units execute the same instruction at any give clock cycle. Multiple data: Each processing unit can operate on a different data element. It typically has an instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction units Best suitable for specialized problems characterized by a high degree of regularity, e.g., image processing.

infantry A single CPU to control and a large collection of subordinate ALUs, each with its own memory. During each , the control processor broadcasts an instruction to all of the subordinate processors, and each of the subordinate processors either executes the instruction or is idle. For (i=0; i<100; i++) if (y[i] != 0.0) z[i] = x[i]/y[i]; else z[i] = x[i];

Time step 1 Test local_y != 0 Time step 2 If local_y != 0, z[i]=x[i]/y[i] If local_y == 0, idle Time step 3 If local_y != 0, idle If local_y == 0, z[i]=x[i]

Disadvantage: in a program with many conditional branches or long segments of code whose execution depends on conditionals, its more likely that many processes will remain idle for a long time. 2.1.5 MISD

Instruction pool A single data stream is fed into multiple processing units Each processing unit operates on the data independently via independent instruction streams P P Very few actual machines: CMU’s C.mmp computer (1971) Data pool

Load X[1] Load X[1] Load X[1]

Mul Y[1], A, X[1] Mul Y[2], B, X[1] Mul Y[3], C, X[1]

Add Z[1], X[1], Y[1] Add Z[2], X[1], Y[2] Add Z[3], X[1], Y[3] Store Z[1] Store Z[2] Store Z[3] Each processor has both a control unit and an ALU, and is capable of 2.1.6 MIMD executing its own program at its own pace Multiple instruction stream: Every processor may execute a different instruction stream Instruction pool Multiple data stream: Every processor may work with a different data stream P P Execution can be synchronous or asynchronous, deterministic or nondeterministic P P Examples: most current supercomputers, grids, networked parallel computers, Data pool multiprocessor SMP computer P P

Load X[1] Load A Load X[1]

load Y[1] Mul Y, A, 10 load C[2]

Add Z[1], X[1], Y[1] Sub B, Y, A Add Z[1], X[1], C[2]

Store Z[1] Store B Sub B, Z[1], X[1] 2.1.7 shared-memory MIMD

„ -based architecture

„ -based architecture

„ coherence CPU CPU CPU

Interconnected network

Memory Memory Memory

Generic shared-memory architecture

Shared-memory systems are sometimes called multiprocessors Bus-based architecture

CPU CPU CPU

Cache Cache Cache

Bus

Memory Memory Memory

The interconnect network is bus based. The bus will become saturated if multiple processors are simultaneously attempting to access memory. Each processor has access to a fairly large cache. These architectures do not scale well to large numbers of processors because of the limited bandwidth of a bus. Switch-based architecture

Memory Memory Memory

CPU

CPU

CPU

The interconnect network is switch-based. A crossbar can be visualized as a rectangular mesh of wires with at the points of intersection, and terminal on its left and top edges. The switches can either allow a signal to pass through in both the vertical and horizontal directions simultaneously, or they can redirect a signal from vertical to horizontal or vice versa. Any processor can access any memory, and any other processor can access any other memory. „ The crossbar switch-based architecture is very expensive. A total number of mn hardware switches are need for an m×ncrossbar.

„ The crossbar system is a NUMA (nonuniform memory access) system, because when a processor access memory attached to another crossbar, the access times will be greater.

„ The caching or shared variables should ensure cache coherence. „ Basic idea: each processor has a cache controller, which monitors the bus traffic. „ When a processor updates a shared variable, it also updates the corresponding main memory location. „ The cache controllers on the other processors detect the write to main memory and mark their copies of the variable as invalid. „ Not good for other types of shared-memory machine. 2.1.8 Distributed-memory MIMD

CPU Memory CPU Memory CPU Memory

Interconnected network

In distributed-memory system, each processor has its own private memory. A static network (mesh) A dynamic network (crossbar)

A node is a vertex corresponding to a processor/memory pair. In static network, all vertices are nodes. In dynamic network, some vertices are nodes, other vertices are switches. Fully connected interconnection network

The ideal interconnected network is a fully connected network, in which each node is directly connected to every other node. With a fully connected network, each node can communicate directly with every other node. Communication involves no delay. The cost is too high to be practical.

Question: How many connections are needed for a 10 processor machine? Crossbar interconnection network

Question: for a machine with p processors, how many switches do we need? Multistage switching network

For a machine of p nodes, an omega network will use plog2(p)/2 switches.

An omega network Static interconnection networks

A linear array A ring

For a system of p processors, a linear array needs p-1 wires, a ring need p wires. They scale well, but the communication cost is high. In a linear array, two communication processors may have to forward the message along as many as p-1 wires, and in a ring it may be necessary to forward the message along as many as p/2 wires. Dimension 1 Dimension 2

Dimension 3

For a hypercube network with dimension d, the number of processors is p=2d. The maximum number of wires a message will need to be forwarded is d = log2(p). This is much better than the linear array or ring. It does not scale well. Each time we wish to increase the machine size, we must double the number of nodes and add a new wire to each node. Three dimensional mesh Two dimensional mesh

If a mesh has dimension d1×d2×…×dn, then the maximum number of wires a message will have to travere is n ∑(di −1) i=1

1/n If a mesh is a square, d1=d2=…=dn, the maximum will be n(p -1). A mesh becomes a torus if “wrap around” wires are added. For torus, the maximum will be 1/2np1/n.

Mesh and torus scale better than hypercubes. If we increase the size of a q×qmesh, we simply add a q×1 mesh and q wires. We need to add p(n-1)/n nodes if we want to increase the size of a square n-dimensional mesh or torus. Characteristics of static networks

Diameter: the diameter of a network is the maximum distance between any two nodes in the network.

Arc connectivity: The arc connectivity of a network is the minimum number of arcs that must be removed from the network to break it into two disconnected networks

Bisection width: The bisection width is the minimum number of communication links that must be removed to partition the network into two equal halves

Number of links: the number of links is the total number of links in the network. Characteristics of static networks

Network Diameter Bisection Arc Number of width connection links Fully 1 p2/4 p-1 p(p-1)/2 connected star 2 1 1 p-1 Linear array p-1 1 1 p-1 Ring (p>2) p-2 2 2 p Hypercube logp p/2 logp ( p log p) / 2 2( p −1) p 2D mesh 2 2( p − p) 2D torus 2⎣ p / 2⎦ 2 p 4 2p 2.1.9 communication and routing

„ If two nodes are not directly connected or if a processor is not directly connected to memory module, how is data transmitted between the two? „ If there are multiple routes joining the two nodes or processor and memory, how is the route decided on? „ Is the route chosen always shortest? Store-and-forward routing Time Data Node A Node B Node C 0 z y x w 1 z y x w 2 z y x w 3 z y x w 4 z y x w 5 z y x w 6 z y x w 7 z y x w 8 z y x w Store-and-forward routing: A sends a message to C through B, B read the entire message, and then send it to node C. It takes more time and memory. Cut-through routing

Time Data Node A Node B Node C 0 z y x w 1 z y x w 2 z y x w 3 z y x w 4 z y x w 5 z y x w

Cut-through routing: A sends an message through B to C, B immediately forward each identifiable piece or packet of the message to C Communication unit

A message is a contiguous group of bits that is transferred from source terminal to destination terminal.

A packet is the basic unit of a message. Its size is in the order of hundreds and thousands of bytes. It consists of header flits and data flits. Flit: flit is the small unit of information at link layer, and its size is of a few words Phit: phit is the smallest physical unit of information at the physical layer, which is transferred across one physical link in one cycle Communication cost

Startup time : startup time is the time required to handle a message at the sending and receiving nodes, that includes, (1) prepare message (adding header, trailer, error correction information), (2) execute the routing algorithm, and (3) establish an interface between the local node and the router Note: This latency is incurred only once for a single message transfer

Per-hop Time: per-hop time is the time taken by the header of a message to travel between two directly connected nodes in the network. Note: The per-hop time is also called node latency

Per-word transfer time: per-word transfer time is the time taken for one word to traverse one link. Per-word transfer time is the reciprocal of the channel bandwidth When a message traverses a path with multiple links, Each intermediate node on the path forwards the message to the next node after it has received and stored the message Total communication cost for a message of size m to traverse a path of l links

tcomm = ts + (mtw – th)l Example: communication time for linear array

o 1 2 3 4

(1) Store and forward routing: tcomm = ts + mltw, since in modern parallel computers, the per-hop time is very small compared to per-word time.

(2) cut-through routing: tcomm = ts + lth + mtw, the term of the product of message size and number of links is no longer contained. 2.2 software issues

„ A program is parallel if at any time during its execution, it can comprise more than one process.

„ We see how the processes can be specified, created, and destroyed. 2.2.1 Shared memory programming

Time Process 0 Process 1 „ Private and shared variables 0 Fetch sum =0 Finish calculation of private_x int private_x; shared int sum=0; 1 Fetch private_x=2 Fetch sum = 0 sum = sum + private_x; 2 Add 2+0 Fetch private_x 3 Fetch sum into register A Fetch private_x into register B 3 Store sum = 2 Add 3 + 0 Add contents of register B to register A Store content of register A in sum 4 Store sum =3 Mutual exclusion, Critical section, Binary semaphore, barrier

shared int s = 1; // wait until s=1; Problem: s is not atomic, one process has a value of 1, while (!s); another may have a value of 0 s = 0; sum = sum + private_x; Int private_x; s = 1; Shared int sum=0; Shared int s=1;

/* compute priviate_x */ Void P(int* s /* in/out */); P(&s); Void V(int *s /*out */); Sum=sum+private_x; P(int * s) V(int *s) V(&s); { { While (!s); s=1; Barrier(); s=0; } If (I’m process 0) } printf(“sum = %d\n”, sum); 2.2.2 Message passing

„ The most commonly used method of programming distributed-memory MIMD system is message passing, or its variant.

„ We focus on the Message-Passing Interface (MPI) MPI_Send() and MPI_Recv()

Int MPI_Send(void* buffer /* in */, int count /* in */, MPI_Datatype datatype /* in */, int destination /* in */, int tag /* in */, MPI_Comm communicator /* in */)

Int MPI_Recv(void* buffer /* in */, int count /* in */, MPI_Datatype datatype /* in */, int source /* in */, int tag /* in */, MPI_Comm communicator /* in */), MPI_Status* status /* out */) Process 0 sends a float x to process 1;

MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);

Process 1 receives the float x from process 0;

MPI_Recv(&x, 1, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);

Different programs or a single program? SPMD(Single-Program-Multiple_Data) model

If (my_process_rank == 0) MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD); Else if(my_process_rank == 1) MPI_Recv(&x, 1, MPI_FLOAT, 0, 0, MPI_COMM_WORLD); Buffering

„ 0 (A) Æ “request to send”; 1 (B) Æ”ready to receive” „ We can buffer the message: the content of the message can be copied into a system-controlled block of memory (on A or B, or both), and 0 can continue executing. „ Synchronous communication: process 0 wait until process 1 is ready; „ Buffered communication: the message is buffered into the appropriate memory location controlled by 1. „ Advantage: the sending process can continue to do useful work if the receiving process is not ready, the system will not crash even if process 1 doesn’t execute a receive „ Disadvantage: it uses additional memory and if the receiving process is ready, the communication will actually take longer because of copying data between the buffer and the user program memory locations. Blocking and nonblocking communication

„ Blocking communication: a process remains idle until the message is available, such as MPI-Recv(). In blocking communication, it may not be necessary for process 0 to receive permission to go ahead with the send. „ Nonblocking receive operation: MPI_Irecv(), with an additional parameter a request. The call would notify the system that process 1 intended to receive a message from 0 with the properties indicated by the argument. The system would initialize the request argument, and process 1 would return. Then process 1 could perform some other useful work and check back later to see if the message has arrived. „ Nonblocking communication can be used to provide dramatic improvements in the performance of message-passing programs. 2.2.3 Data-parallel languages

(1) Specify a collection of 10 abstract Program add_arrays processors; !HPF$ PROCESSORS p(10); (2) Define arrays; (3) Specify that y should be mapped to the real x(1000), y(1000), z(1000) abstract processors in the same way that x !HPF$ ALIGN y(:) WITH x(:) is; !HPF$ ALIGN z(:) WITH x(:) (4) Specify that y should be mapped to the !HPF$ DISTRIBUTE x(BLOCK) ONTO p abstract processors in the same way that x is; (5) Specify which elements of x will be C initialize x and y mapped to which abstract processors; …. (6) BLOCK specifies that x will be mapped by blocks onto the processors. The first z = x + y 1000/10=100 elements will be mapped to end the first processor. 2.2.4 RPC and Active message

„ RPC (remote procedure call) and active messages are two other methods to parallel systems, but we are not going to discuss them in this course. 2.2.5 Data mapping

„ Optimal data mapping is about how to assign data elements to processors so that communication is minimized. „ Our array A=(a0, a1, a2, …, an-1), our processors P=(q0, q1, q2, …, qp-1) „ If the number of processor is equal to the number of array elements „ ai = qi „ Block mapping: partitioning the array elements into blocks of consecutive entries and assigns the blocks to the processors. If p=3 and n=12 „ a0, a1, a2, a3 Æ q0 „ a4, a5, a6, a7 Æ q1 „ a8, a9, a10, a11 Æ q2 „ Cyclic mapping: it assigns the first element to the first processor, the second element to the second processor, and so on. „ a0, a3, a6, a9 Æ q0 „ a1, a4, a7, a10 Æ q1 „ a2, a5, a8, a11 Æ q2 „ Block-cyclic mapping: it partitions the array into blocks of consecutive elements as in the block mapping. The blocks are not necessarily of size n/p. The blocks are then mapped to the processors in the same way that the elements are mapped in the cyclic mapping. The block size is 2 in the following example. „ a0, a1, a6, a7 Æ q0 „ a2, a3, a8, a9 Æ q1 „ a4, a5, a10, a11 Æq2 How about matrices?

p0

p0

p0 p1

grid