Vector Machines  Vector Machines Today Introduction  a Vector Processor Is a CPU That Can Run One Instructiononanentire Vector of Data

Hakam Zaidan Stephen Moore Outline Vector Architectures Properties Applications History Westinghouse Solomon ILLIAC IV CDC STAR 100 Cray‐1 Other Cray Vector Machines Vector Machines Today Introduction A Vector processor is a CPU that can run one instructiononanentire vector of data. The fetched number of instructions are small. They also achieve data parallelism in large scientific and multimedia applications. Styles of Vector Architectures Based on how the operands are fetched, vector processors can be divided into two categories: Memory‐Memory Architecture. Vector‐Register Architecture. Vector Processor Elements Vector Register: Fixed length, single vector, ports for reading and writing. Usually 8 to 32 registers of length 64 or 128 bits. Vector Functional Units (FUs): Usually 4‐8 functional units: FP mult, FP add, and FP divide, in addition to the integer add and logical shift Vector Load Store Unit (LSUs). Scalar registers. Cross‐bar. Vector Processor Properties Results are independent. Known pattern for memory access by the vector instructions. In pipelines, branches and branch problems are reduced. Single vector instruction indicate huge amount of calculations (e.g. loops). Disadvantages With scalar instructions: Relatively slow. Some difficulties in the implementation of the precise exceptions. High cost for on‐chip vector memory systems. Code complexity. Applications Lossy compression. Lossless compression. Multimedia Processing. Standard benchmarking kernels. Handwriting recognition. Speech recognition. Cryptography. Operating system and networking. Databases. Support of language run‐time. History In 1962, Illinois Automatic Computer series of super computers ILLIAC I, ILLIAC II, ILLIAC III, ILLIAC IV (with 64 ALUs 100‐ 150 Mflops). In 1973 TI’s Advance Scientific Computer (ASC) 20‐80 Mflops. In 1975 the Cray‐1 (80‐240 Mflops) was the first super computer to have vector registers instead of keeping data in memory. CRAY‐XMP, CRAY‐YMP, NEC SX/2, CRAY C‐90, NEC SX/4, CRAY J‐90, CRAY T‐90, NEC SX/5. (from 1976 to 1999). Westinghouse Solomon Project Used an array of processing elements (PE) Applied same instruction to all processors, different data per processor Research contract with US Air Force Prototype built in 1964 Development ended after contract expired ILLIAC IV Parallel Machine One Control Unit (CU) controlled PEs One of predicted four CUs built 64‐PEs available per CU Each PE had private memory unit Expected 1000 MFLOPS Achieved 100‐250 MFLOPS Fastest machine until 1981 CDC STAR 100 Designed to operate at 100 MFLOPS Long pipelines Long vector setup time Needed to have 50 elements to be faster than competitors Scalar performance was slow Cray‐1 Supercomputer built in 1976 138 MFLOPS, or 250 MFLOPS for bursts Fast Vector and Scalar computation Smaller than other computers Architecture Uses registers to increase speed 8 24‐bit address registers 64 24‐bit address‐save registers 8 64‐bit scalar registers 64 64‐bit scalar‐save registers 8 64‐word vector registers “Chains” together functional units Address, Scalar, Vector, and Floating Point Other Cray Computers Cray X‐MP (1983) Used shared memory, faster clock, more memory bandwidth, 2 CPUs, 400‐800 MFLOPS Cray‐2 (1985) New architecture, fast memory, 1.9 GFLOPS Cray Y‐MP (1988) 2, 4, or 8 vector processors, 2.67 GFLOPS Cray X1 (2003) Unification of multiple architectures, 12.8 GFLOPS Not financially successful Vector Supercomputers Vector Machines Today Very expensive to build Smaller speedup compared to using multiple processors Processors with many sequential cores are preferred Vector Machine concepts are still used IBM ViVA Virtual Vector machine Uses multiple functional units Acts as a vector processor Vector Intelligent RAM (VIRAM) Architecture developed at UC Berkeley. Full vector microprocessor and DRAM on a single chip. Lower memory latency up to 5‐10X lower, and bandwidth up to 50‐100X higher. High bandwidth for I/O up to 0.5‐2 GB/sec. Improve energy efficiency 2X‐4X, as there are no off chip bus. Adjustable memory size. Lower cost and power than traditional vector supercomputers. Clustered Organization for Decoupled Execution (CODE) Developed at UC Berkeley. CODE is a proposed vector architecture to overcome the conventional vector processors disadvantages or limitations. CODE organizes the vector registers in clusters 4‐8 registers in each cluster. CODE allows partial completion of an instruction in case of an exception. CODE supports precise exception using a history buffer. CODE can hide communication latency. Conclusion Vector supercomputers are not practical due its high cost. To improve the cost performance, vector supercomputers are adapting commodity technology like SMT. Designs of superscalar microprocessors designs began to absorb some of the techniques made popular in earlier vector computer systems. (e.g. Intel MMX extension). Vector processors are useful for embedded and multimedia applications which require low power, small code size and high performance. References C. Kozyrakis, D. Patterson, ” Overcoming the Limitations of Conventional Vector Processors”, in ISCA, 2003. C. Kozyrakis, D. Patterson, ” Vector vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks”, in MICRO, 2002. W. J. Bouknight, et al., “The Illiac IV System”, Proceedings of the IEEE Vol 60, No. 4, April 1972. R. M. Russell, “The Cray‐1 Computer System”, Communications of the ACM Vol 21, No 1, Jan 1978. J Gebis, et al., “Improving Memory Subsystem Performance using ViVA: Virtual Vector Architecture” in ARCS '09 Proceedings of the 22nd International Conference on Architecture of Computing Systems, 2009, pp. 146‐158. D. L. Slotnick, et al., “The Solomon Computer”, Westinghouse Electric Corporation, Baltimore, MD, 1962. Questions?.

Vector Machines  Vector Machines Today Introduction  a Vector Processor Is a CPU That Can Run One Instructiononanentire Vector of Data

Performing the Shuffle with the PM2I and Illiac SIMD Interconnection Networks

Online Sec 6.15.Indd

SIMD1 Ñ Illiac IV

The CRAY- 1 Computer System

ILLIAC IV Is the Most Powerful by As Much As a Factor of Four

Puters. Large-Scale Computer Systems Have the Potential to Achieve Two to Three Orders of Magnitude Speed Improvement Over the Next Decade

The CRAY-1 Computer System^

Illiac IV History First Massively Parallel Computer Three Earlier Designs

A Survey of Concurrent Architectures Technical Report: CSL-TR-86-307

COMPUTERS on NASTRAN James L. Rogers, Jr. NASA Langley

The Illiac IV System

Microfilms International 300 N