CSC506 Lecture 7 Vector Processors

Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions. Terminology: ¨ Scalar – a single quantity (number). ¨ Vector – an ordered series of scalar quantities – a one-dimensional array. Scalar Quantity Data Vector Quantity Data Data Data Data Data Data Data Data Five basic types of vector operations: 1. V ß V Example: Complement all elements 2. S ß V Examples: Min, Max, Sum 3. V ß V x V Examples: Vector addition, multiplication, division 4. V ß V x S Examples: Multiply or add a scalar to a vector 5. S ß V x V Example: Calculate an element of a matrix One instruction says, in effect, do the same thing on all the elements of the vector(s). Vector Processors Architecture of Parallel Computers Page 1 The generic vector processor: Stream A Pipelined Processor Stream B Multiport Memory System Stream C = A x B Many large-scale scientific and engineering problems can be solved by operations on large vectors or matrices of floating point numbers. Vector processors are designed to efficiently work on these problems. Performance of these machines is measured in: ¨ FLOPS – Floating Point Operations per Second, ¨ MegaFLOPS – a million FLOPS, or ¨ GigaFLOPS – a billion FLOPS. The extremely high performance is achieved only for problems that can be expressed as operations on large vectors. These processors are also called supercomputers, popularized by the CRAY series. The cost/performance ratio of vector processors can be impressive, but the initial cost is high (few of them are built). NEC's SX-4 series, which NEC claims was the most successful supercomputer, sold just 134 systems in 3 years. NEC reports that the SX-5, introduced in June 1998, has received orders for 22 systems over the last year. We also see the attached vector processor – an optional vector processing unit attached to a standard scalar computer. © 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 2 Matrix multiplication: Suppose we want to calculate the product of two N x N matrices, C := A x B. We must perform this calculation: N– 1 cij := å aik b kj k = 0 Inner loop of a scalar processor performing the martix multiply The following loop calculates a single element of the matrix C. We must execute this loop N2 times to get A x B: ------- ; Instructions to initialize 1 iteration of kLoop ------- ; (initialize RC, RN, Rk, Ri, Rj) kLoop ADD Ri, Stride-i ; Increment column of A ADD Rj, Stride-j ; Increment row of B LOAD RA, A(Ri) ; Get value of Matrix A row LOAD RB, B(Rj) ; Get value of Matrix B column FMPY RA, RB ; Floating multiply FADD RC, RA ; Floating add INC Rk ; Increment k CMP Rk, RN ; At end of Row x Column? BNE kLoop ; No -- Repeat for R x C STORE RC, C(r, c) ; Yes -- Store C element ------- ; Continue with all Rows/Columns of C Vector Processors Architecture of Parallel Computers Page 3 Vector Processor Operation With a vector processor, we have minimal instructions to set up the vector operation, and the entire inner loop (kLoop) consists of three vector instructions: ------- ; Instructions to initialize vector operation VLOAD V1, A(r), N, Stride-i ; Vector load row of A with stride i VLOAD V2, B(c), N, Stride-j ; Vector load column of B with stride j VMPYADD V1, V2, RC ; Vector multiply + add to C STORE RC, C(r, c) ; Store C element ------- ; Continue with all Rows/Columns of C The special vector instruction allows us to calculate each element of C in a single vector floating point instruction (VMPYADD) rather than 2N scalar floating point instructions (FMPY and FADD) and 5N loop control and addressing instructions. In addition, the special vector instruction can keep the floating point pipeline full and generate one result output per clock. For example, if we have a 4-stage floating point addition pipe and a 10-stage floating point multiply pipe: ¨ Do we ever get more than one instruction in the pipelines at a time with the kLoop sequence of the scalar processor? ¨ We will keep both pipelines full with successive multiply/adds on the vector processor. With P independent pipes, we can operate on P elements of C in parallel. © 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 4 FORTRAN is still the preferred language for the majority of the users of vector processors, because the majority of users are scientists and engineers and because there is a large amount of scientific software available in FORTRAN. Example FORTRAN: DO 100 I=1,N A(I) = B(I) + C(I) B(I) = 2 * A(I+1) 100 CONTINUE If we unwind this “DO Loop” A(1) = B(1) + C(1) B(1) = 2 * A(2) A(2) = B(2) + C(2) B(2) = 2 * A(3) . “Vector FORTRAN” TEMP(1:N) = A(2:N+1) A(1:N) = B(1:N) + C(1:N) B(1:N) = 2 * TEMP(1:N) Also, some optimizing FORTRAN compilers automatically generate vector code from the original “DO Loop.” For example, DEC VAX FORTRAN supports the automatic generation of vector operations. ¨ [NO]VECTOR Controls whether or not the compiler checks the source code for data dependencies and generates code for the vector hardware when the code is eligible. Vector Processors Architecture of Parallel Computers Page 5 An example vector processor: NEC announced the SX-4 supercomputer in November 1994. It is the third in the SX series of supercomputers and is upward compatible from the SX- 3R vector processor with enhancements for scalar processing, short vector processing, and parallel processing. The SX-4 has an 8.0 ns clock cycle and a peak performance of 2 Gflops per processor. Each SX-4 processor contains a vector unit and superscalar unit. The vector unit is built using eight vector pipeline processor VLSI chips. Each vector unit chip is a self contained vector unit with registers holding 32 vector elements. The eight chips are connected by crossbar and comprise 32 vector pipelines arranged as sets of eight add/shift, eight multiply, eight divide, and eight logical pipes. Each set of eight pipes serves a single vector instruction, and all sets of pipes can operate concurrently. With a vector add and vector multiply operating concurrently, the pipes provide 2 GFLOPS peak performance. The memory and the processors within each SX-4 node are connected by a nonblocking crossbar. Each processor has a 16 Gbytes per second port into the crossbar. The main memory can have up to 1024 banks of 64-bit wide synchronous static RAM (SSRAM). The SSRAM is composed of 4 Mbit, 15 ns components. Bank cycle time is only two clocks. (Note: NEC has subsequently changed to use Synchronous Dynamic RAM (SDRAM) instead of static RAM). A 32 processor node has a 512 gigabytes per second sustainable memory bandwidth. Conflict free unit stride as well as stride 2 access is guaranteed from all 32 processors simultaneously. Higher strides and list vector access benefit from the very short bank cycle time. Note: The SX-4 achieves the stated 2 GFLOPS by feeding a multiply directly into an add, and concurrently doing this on 8 parallel pipelines. 8 ns per clock = 125 MHz. 125 MHz x 2 FLOPS/clock x 8 pipes = 2 GFLOPS. © 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 6 NEC SX-5 Organization CPU- Central Processing Unit MM- Main Memory Unit IOP - Input-Output Processor VR - Vector Register File SR - Scalar Register File The SX-5 Series employs a 0.25µ CMOS LSI technology. This enables the SX-5 to achieve a clock cycle of 4.0ns, which is half that of the SX-4 Series. Vector Processors Architecture of Parallel Computers Page 7 SX-4BA SX-4A SX-4AM Server Single Multi Node Node CPUs 1-4 1-16 8-256 CPU Peak 1.8 GF 2 GF 2 GF System Peak 7.2 GF 32 GF 512 GF Clock 8.8 ns 8.0 ns 8.0 ns Memory Type SDRAM SDRAM SDRAM Max.Capacity 16 GB 32 GB 512 GB Max Banking 4,096 8,192 131,072 IOP (max) 1.6 GB/s 3.2 GB/s 25.6 GB/s XMU Optional Optional Optional Max Bandwidth 3.6 GB/s 8 GB/s 128 GB/s Max.Capacity 8 GB 16 GB 64 GB Table 1: SX-4A Models Overview SX-4 Vector Unit Substantial effort has been made to provide significant vector performance for short vector lengths. The crossover between scalar and vector performance is a short 8 elements in most cases. The vector unit has 8 operational registers from which all operations can be started. In addition, there are 64 vector data registers which have a subset of valid instructions and that can receive results from pipelines concurrently with the 8 operational registers; the vector data registers serve as a high performance vector cache which significantly reduces memory traffic in most cases. The ganging of 8 vector pipeline processor VLSI results in visible vector registers which each hold 256 vector elements. Therefore the vector unit is described as 72 registers of 256 elements of 64 bits each. © 1997, 1999 E.F. Gehringer, G.Q. Kenney CSC 506, Summer 1999 Page 8 Revisit the definition of speedup Recall that the speedup of a pipeline measures how much more quickly a workload is completed by the pipeline processor than by a non-pipeline processor. Best Serial Time Speedup = Parallel Execution Time A k-stage pipeline with all stages of equal duration (one clock period) has a theoretical speedup of k – because it takes k clocks to get a single operation through the pipe and we are retiring one operation every clock.

CSC506 Lecture 7 Vector Processors

Data-Level Parallelism

2.5 Classification of Parallel Computers

Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy

Vector-Thread Architecture and Implementation by Ronny Meir Krashinsky B.S

Lecture 14: Gpus

Lecture 14: Vector Processors

Amdahl's Law Threading, Openmp

Parallel Generation of Image Layers Constructed by Edge Detection

CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors

High-Performance Message Passing Over Generic Ethernet Hardware with Open-MX Brice Goglin

An Investigation of Symmetric Multi-Threading Parallelism for Scientific Applications

Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks