Vector Processors

A vector is a pipelined processor with special instructions designed to keep the (floating point) (s) full. These special instructions are vector instructions.

Terminology:

¨ Scalar – a single quantity (number).

¨ Vector – an ordered series of scalar quantities – a one-dimensional array.

Scalar Quantity Data

Vector Quantity Data Data Data Data Data Data Data Data

Five basic types of vector operations:

1. V ß V Example: Complement all elements

2. S ß V Examples: Min, Max, Sum

3. V ß V x V Examples: Vector addition, multiplication, division

4. V ß V x S Examples: Multiply or add a scalar to a vector

5. S ß V x V Example: Calculate an element of a matrix

One instruction says, in effect, do the same thing on all the elements of the vector(s).

Vector Processors Architecture of Parallel Page 1 The generic :

Stream A

Pipelined Processor Stream B Multiport Memory System Stream = A x B

Many large-scale scientific and engineering problems can be solved by operations on large vectors or matrices of floating point numbers. Vector processors are designed to efficiently work on these problems.

Performance of these machines is measured in:

¨ FLOPS – Floating Point Operations per Second,

¨ MegaFLOPS – a million FLOPS, or

¨ GigaFLOPS – a billion FLOPS.

The extremely high performance is achieved only for problems that can be expressed as operations on large vectors. These processors are also called , popularized by the series.

The cost/performance ratio of vector processors can be impressive, but the initial cost is high (few of them are built). NEC's SX-4 series, which NEC claims was the most successful , sold just 134 systems in 3 years. NEC reports that the SX-5, introduced in June 1998, has received orders for 22 systems over the last year.

We also see the attached vector processor – an optional vector processing unit attached to a standard scalar .

© 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 2 Matrix multiplication: Suppose we want to calculate the product of two N x N matrices, C := A x B. We must perform this calculation:

N– 1

cij := å aik b kj k = 0

Inner loop of a performing the martix multiply

The following loop calculates a single element of the matrix C. We must execute this loop N2 times to get A x B:

------; Instructions to initialize 1 iteration of kLoop ------; (initialize RC, RN, Rk, Ri, Rj) kLoop ADD Ri, Stride-i ; Increment column of A ADD Rj, Stride-j ; Increment row of B LOAD RA, A(Ri) ; Get value of Matrix A row LOAD RB, B(Rj) ; Get value of Matrix B column FMPY RA, RB ; Floating multiply FADD RC, RA ; Floating add INC Rk ; Increment k CMP Rk, RN ; At end of Row x Column? BNE kLoop ; No -- Repeat for R x C STORE RC, C(r, c) ; Yes -- Store C element ------; Continue with all Rows/Columns of C

Vector Processors Architecture of Parallel Computers Page 3 Vector Processor Operation

With a vector processor, we have minimal instructions to set up the vector operation, and the entire inner loop (kLoop) consists of three vector instructions:

------; Instructions to initialize vector operation VLOAD V1, A(r), N, Stride-i ; Vector load row of A with stride i VLOAD V2, B(c), N, Stride-j ; Vector load column of B with stride j VMPYADD V1, V2, RC ; Vector multiply + add to C STORE RC, C(r, c) ; Store C element ------; Continue with all Rows/Columns of C

The special vector instruction allows us to calculate each element of C in a single vector floating point instruction (VMPYADD) rather than 2N scalar floating point instructions (FMPY and FADD) and 5N loop control and addressing instructions.

In addition, the special vector instruction can keep the floating point pipeline full and generate one result output per clock. For example, if we have a 4-stage floating point addition pipe and a 10-stage floating point multiply pipe:

¨ Do we ever get more than one instruction in the pipelines at a time with the kLoop sequence of the scalar processor?

¨ We will keep both pipelines full with successive multiply/adds on the vector processor.

With P independent pipes, we can operate on P elements of C in parallel.

© 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 4 FORTRAN is still the preferred language for the majority of the users of vector processors, because the majority of users are scientists and engineers and because there is a large amount of scientific software available in FORTRAN.

Example FORTRAN:

DO 100 I=1,N A(I) = B(I) + C(I) B(I) = 2 * A(I+1) 100 CONTINUE

If we unwind this “DO Loop”

A(1) = B(1) + C(1) B(1) = 2 * A(2) A(2) = B(2) + C(2) B(2) = 2 * A(3) . . .

“Vector FORTRAN”

TEMP(1:N) = A(2:N+1) A(1:N) = B(1:N) + C(1:N) B(1:N) = 2 * TEMP(1:N)

Also, some optimizing FORTRAN compilers automatically generate vector code from the original “DO Loop.”

For example, DEC VAX FORTRAN supports the automatic generation of vector operations.

¨ [NO]VECTOR Controls whether or not the compiler checks the source code for data dependencies and generates code for the vector hardware when the code is eligible.

Vector Processors Architecture of Parallel Computers Page 5 An example vector processor:

NEC announced the SX-4 supercomputer in November 1994. It is the third in the SX series of supercomputers and is upward compatible from the SX- 3R vector processor with enhancements for scalar processing, short vector processing, and parallel processing. The SX-4 has an 8.0 ns clock cycle and a peak performance of 2 Gflops per processor. Each SX-4 processor contains a vector unit and superscalar unit. The vector unit is built using eight vector pipeline processor VLSI chips. Each vector unit chip is a self contained vector unit with registers holding 32 vector elements. The eight chips are connected by crossbar and comprise 32 vector pipelines arranged as sets of eight add/shift, eight multiply, eight divide, and eight logical pipes. Each set of eight pipes serves a single vector instruction, and all sets of pipes can operate concurrently. With a vector add and vector multiply operating concurrently, the pipes provide 2 GFLOPS peak performance.

The memory and the processors within each SX-4 node are connected by a nonblocking crossbar. Each processor has a 16 Gbytes per second port into the crossbar. The main memory can have up to 1024 banks of 64-bit wide synchronous static RAM (SSRAM). The SSRAM is composed of 4 Mbit, 15 ns components. Bank cycle time is only two clocks. (Note: NEC has subsequently changed to use Synchronous Dynamic RAM (SDRAM) instead of static RAM). A 32 processor node has a 512 gigabytes per second sustainable memory bandwidth. Conflict free unit stride as well as stride 2 access is guaranteed from all 32 processors simultaneously. Higher strides and list vector access benefit from the very short bank cycle time.

Note: The SX-4 achieves the stated 2 GFLOPS by feeding a multiply directly into an add, and concurrently doing this on 8 parallel pipelines. 8 ns per clock = 125 MHz. 125 MHz x 2 FLOPS/clock x 8 pipes = 2 GFLOPS.

© 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 6 NEC SX-5 Organization

CPU- MM- Main Memory Unit IOP - Input-Output Processor VR - Vector SR - Scalar Register File

The SX-5 Series employs a 0.25µ CMOS LSI technology. This enables the SX-5 to achieve a clock cycle of 4.0ns, which is half that of the SX-4 Series.

Vector Processors Architecture of Parallel Computers Page 7 SX-4BA SX-4A SX-4AM Server Single Multi Node Node CPUs 1-4 1-16 8-256 CPU Peak 1.8 GF 2 GF 2 GF System Peak 7.2 GF 32 GF 512 GF Clock 8.8 ns 8.0 ns 8.0 ns

Memory Type SDRAM SDRAM SDRAM Max.Capacity 16 GB 32 GB 512 GB Max Banking 4,096 8,192 131,072

IOP (max) 1.6 GB/s 3.2 GB/s 25.6 GB/s

XMU Optional Optional Optional Max Bandwidth 3.6 GB/s 8 GB/s 128 GB/s Max.Capacity 8 GB 16 GB 64 GB Table 1: SX-4A Models Overview

SX-4 Vector Unit

Substantial effort has been made to provide significant vector performance for short vector lengths. The crossover between scalar and vector performance is a short 8 elements in most cases.

The vector unit has 8 operational registers from which all operations can be started. In addition, there are 64 vector data registers which have a subset of valid instructions and that can receive results from pipelines concurrently with the 8 operational registers; the vector data registers serve as a high performance vector which significantly reduces memory traffic in most cases. The ganging of 8 vector pipeline processor VLSI results in visible vector registers which each hold 256 vector elements. Therefore the vector unit is described as 72 registers of 256 elements of 64 bits each.

© 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 8 Revisit the definition of

Recall that the speedup of a pipeline measures how much more quickly a workload is completed by the pipeline processor than by a non-pipeline processor.

Best Serial Time Speedup = Parallel Execution Time

A k-stage pipeline with all stages of equal duration (one clock period) has a theoretical speedup of k – because it takes k clocks to get a single operation through the pipe and we are retiring one operation every clock.

Vector Processors Architecture of Parallel Computers Page 9 We will now look at the actual speedup of a pipeline in a vector processor considering how full we can keep it.

Several tasks (operations on the elements of a vector) may be simultaneously active in a pipeline.

Space (pipeline stages)

1 2 3 4 5 S T T T T T 4 4 4 4 4 4 • • • 1 2 3 4 5 S T T T T T 3 3 3 3 3 3 • • • 1 2 3 4 5 S T T T T T 2 2 2 2 2 2 • • • S 1 2 3 4 5 T T T T T 1 1 1 1 1 1 • • • 0 1 2 3 4 5 6 7 8 Time (pipeline cycles)

Suppose there are:

¨ k stages in the pipeline, and

¨ n tasks to be executed.

We have n – 1 clocks where the pipeline is not full (startup at the beginning and empty out at the end). So, the speedup S(k) that is achieved when we account for the time it takes to fill the pipeline is given by:

S(k) = nk k + (n -1)

As n (number of tasks) approaches infinity, the speedup approaches k (number of stages). Therefore, short vectors get little speedup and long vectors approach maximum speedup.

© 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 10 It may be possible to partially overlap finishing one vector operation with starting up another vector operation. A timing diagram would look like this: Space

Time

Vector instructions must be able to specify the stride for a vector.

· The elements of a vector may not be stored in consecutive memory locations. For example, in our N x N matrix multiplication, vector A has a stride of 1 (the row) and vector B has a stride of N (the column).

· A constant stride may be specified such as every other (stride = 2), or every third (stride=3), etc., vector element can be loaded or stored.

· Many problems involve sparse matrices where the stride is random. In such cases, gather/scatter instructions are used to load and store data under the control of a vector register that contains a pointer to the location of the needed data – indirect addressing.

· An arithmetic operation need not be performed on every element of the vector. In such a case, a mask register is constructed that controls which elements of a vector are loaded, operated on, and stored.

Assuming that we get all of the pipeline and logical operations worked out, the main problem with vector processors is feeding them. How much memory bandwidth do we need to feed an SX-4 processor with 64-bit operands?

Vector Processors Architecture of Parallel Computers Page 11 If we had to feed the pipeline directly from interleaved memory as Stone shows in figure 5.4:

Stage 4 0 1 2 3 4 5 6 7

Stage 3 0 1 2 3 4 5 6 7

Stage 2 0 1 2 3 4 5 6 7

Stage 1 0 1 2 3 4 5 6 7

Mem. mod. 7 RB5 RB5 RA7 RA7 W3 W3

Mem. mod. 6 RB4 RB4 RA6 RA6 W2 W2

Mem. mod. 5 RB3 RB3 RA5 RA5 W1 W1

Mem. mod. 4 RB2 RB2 RA4 RA4 W0 W0

Mem. mod. 3 RB1 RB1 RA3 RA3

Mem. mod. 2 RB0 RB0 RA2 RA2 W6

Mem. mod. 1 RA1 RA1 RB7 RB7 W5 W5

Mem. mod. 0 RA0 RA0 RB6 RB6 W4 W4

0 1 2 3 4 5 6 7 8 9 10 11 12

Time (clock periods)

The pipeline is running at 8 ns per clock and each operand is given two clocks, so the memory modules must each have an access time of 16 ns. This is a reasonable SRAM access time.

Problems:

· Three of these modules need to transfer their 64-bit data words concurrently to/from the processor pipeline on every clock, requiring three 125 MHz busses into the processor, similar to figure 5.2 in Stone.

· The three vectors must be stored in the modules as in figure 5.3 such that the access to the memory modules is perfectly synchronized.

© 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 12 Back to Interleaved Memory

How can we organize memory to provide sequential access faster than any one module cycle time?

Recall that interleaved memory places consecutive words of memory in different memory modules:

Memory Memory Memory Memory Module 0 Module 1 Module 2 Module 3

Words Words Words Words with with with with addresses addresses addresses addresses = 0 (mod 4) = 1 (mod 4) = 2 (mod 4) = 3 (mod 4)

· Since a read or write to one module can be started before a read/write to another module finishes, reads/writes can be overlapped.

· Only the leading bits of the address are used to determine the address within the module. The least-significant bits (in the diagram above, the two least-significant bits) determine the memory module.

· Thus, by loading a single address into the memory-address register (MAR) and saying “read” or “write”, the processor can read/write M words of memory. We say that memory is M-way interleaved.

· Low-order interleaving distributes the addresses so that consecutive addresses are located within consecutive modules. For example, for 8- way interleaving:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Vector Processors Architecture of Parallel Computers Page 13 Interleaved-memory designs: Interleaved memory divides an address into two portions: one selects the module, and the other selects an address within the module.

Each module has a separate MAR and a separate MDR.

· When an address is presented, a decoder determines which MAR should be loaded with this address. It uses the low-order m — log2M bits to decide this.

· The high-order n–m bits are actually loaded into the MAR. They select the proper location within the module.

Address within Memory memory module module Address from CPU n–m bits m bits

Address n–m m

MAR MAR MAR

Memory Memory … Memory 2 m – 1 … unit unit unit Decoder m – 1 0 1 2 1

MDR MDR MDR 0

Data bus

© 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 14 An alternative to feeding a vector processor directly from external storage is to provide a hierarchical memory system similar to cache memory. Memory on the processor chip is called register storage rather than L1 cache, and is managed directly by the programmer rather than automatically by the hardware.

A vector processor with high-speed register storage:

Main memory

Vector FP add/sub. load/store FP multiply

FP divide Vector registers Integer

Boolean

Scalar registers

The vector registers are large – 64 to 256 floating point numbers each. 256 floating point numbers at 64 bits each times 8 registers is equivalent to a 16k byte internal data cache.

Vector Processors Architecture of Parallel Computers Page 15 Masking

If statements in loops get in the way of vector processors. For example, consider an operation on a vector where you want to do something if the element is not 0. You might code it as the following loop for a scalar processor:

for i := 1 to n do if A[i] ¹ 0 then A[i] := A[i] - B[i];

This does not work well with a vector processor. We would like to specify an operation on the entire vector A.

A vector mask register (VM) holds a boolean vector that can be set to specify if the operation on the corresponding vector element should be performed. The operation on the vector element takes place only of the corresponding mask bit in the VM is 1.

For example, the following sequence could be used with the mask register:

VLOAD V1, A, N, Stride-i ; Vector load row of A with stride i VLOAD V2, B, N, Stride-j ; Vector load column of B with stride j SLOAD S0, #0 ; Scalar floating point constant 0 VMSNE S0, V1 ; Sets VM bit to 0 if V1[i] = S0 VSUB V1, V2 ; Vector subtract V2 from V1 VMC ; Clear vector mask to all 1 STORE A, V1 ; Store vector A

© 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 16