Hakam Zaidan Stephen Moore Outline  Vector Architectures  Properties  Applications  History  Westinghouse Solomon  ILLIAC IV  CDC STAR 100  ‐1  Other Cray Vector Machines  Vector Machines Today Introduction  A is a CPU that can run one instructiononanentire vector of data.

 The fetched number of instructions are small.

 They also achieve data parallelism in large scientific and multimedia applications. Styles of Vector Architectures  Based on how the operands are fetched, vector processors can be divided into two categories:

 Memory‐Memory Architecture.

 Vector‐Register Architecture. Vector Processor Elements  Vector Register:  Fixed length, single vector, ports for reading and writing.  Usually 8 to 32 registers of length 64 or 128 bits.  Vector Functional Units (FUs):  Usually 4‐8 functional units: FP mult, FP add, and FP divide, in addition to the integer add and logical shift  Vector Load Store Unit (LSUs).  Scalar registers.  Cross‐bar. Vector Processor Properties  Results are independent.  Known pattern for memory access by the vector instructions.  In pipelines, branches and branch problems are reduced.  Single vector instruction indicate huge amount of calculations (e.g. loops). Disadvantages  With scalar instructions: Relatively slow.

 Some difficulties in the implementation of the precise exceptions.

 High cost for on‐chip vector memory systems.

 Code complexity. Applications  Lossy compression.  Lossless compression.  Multimedia Processing.  Standard benchmarking kernels.  Handwriting recognition.  Speech recognition.  Cryptography.  Operating system and networking.  Databases.  Support of language run‐time. History  In 1962, Illinois Automatic Computer series of super computers  ILLIAC I, ILLIAC II, ILLIAC III, ILLIAC IV (with 64 ALUs 100‐ 150 Mflops).  In 1973 TI’s Advance Scientific Computer (ASC) 20‐80 Mflops.  In 1975 the Cray‐1 (80‐240 Mflops) was the first super computer to have vector registers instead of keeping data in memory.  CRAY‐XMP, CRAY‐YMP, NEC SX/2, CRAY C‐90, NEC SX/4, CRAY J‐90, CRAY T‐90, NEC SX/5. (from 1976 to 1999). Westinghouse Solomon Project  Used an array of processing elements (PE)  Applied same instruction to all processors, different data per processor  Research contract with US Air Force  Prototype built in 1964  Development ended after contract expired

ILLIAC IV  Parallel Machine  One Control Unit (CU) controlled PEs  One of predicted four CUs built  64‐PEs available per CU  Each PE had private memory unit  Expected 1000 MFLOPS  Achieved 100‐250 MFLOPS  Fastest machine until 1981

CDC STAR 100  Designed to operate at 100 MFLOPS  Long pipelines  Long vector setup time  Needed to have 50 elements to be faster than competitors  Scalar performance was slow Cray‐1  built in 1976  138 MFLOPS, or 250 MFLOPS for bursts  Fast Vector and Scalar computation  Smaller than other computers

Architecture  Uses registers to increase speed  8 24‐bit address registers  64 24‐bit address‐save registers  8 64‐bit scalar registers  64 64‐bit scalar‐save registers  8 64‐word vector registers  “Chains” together functional units  Address, Scalar, Vector, and Floating Point

Other Cray Computers  Cray X‐MP (1983)  Used shared memory, faster clock, more memory bandwidth, 2 CPUs, 400‐800 MFLOPS  Cray‐2 (1985)  New architecture, fast memory, 1.9 GFLOPS  Cray Y‐MP (1988)  2, 4, or 8 vector processors, 2.67 GFLOPS  Cray X1 (2003)  Unification of multiple architectures, 12.8 GFLOPS  Not financially successful Vector Vector Machines Today

 Very expensive to build  Smaller speedup compared to using multiple processors  Processors with many sequential cores are preferred  Vector Machine concepts are still used  IBM ViVA  Virtual Vector machine  Uses multiple functional units  Acts as a vector processor Vector Intelligent RAM (VIRAM)  Architecture developed at UC Berkeley.  Full vector microprocessor and DRAM on a single chip.  Lower memory latency up to 5‐10X lower, and bandwidth up to 50‐100X higher.  High bandwidth for I/O up to 0.5‐2 GB/sec.  Improve energy efficiency 2X‐4X, as there are no off chip bus.  Adjustable memory size.  Lower cost and power than traditional vector supercomputers. Clustered Organization for Decoupled Execution (CODE)  Developed at UC Berkeley.  CODE is a proposed vector architecture to overcome the conventional vector processors disadvantages or limitations.  CODE organizes the vector registers in clusters  4‐8 registers in each cluster.  CODE allows partial completion of an instruction in case of an exception.  CODE supports precise exception using a history buffer.  CODE can hide communication latency. Conclusion  Vector supercomputers are not practical due its high cost.  To improve the cost performance, vector supercomputers are adapting commodity technology like SMT.  Designs of superscalar microprocessors designs began to absorb some of the techniques made popular in earlier vector computer systems. (e.g. Intel MMX extension).  Vector processors are useful for embedded and multimedia applications which require low power, small code size and high performance. References  C. Kozyrakis, D. Patterson, ” Overcoming the Limitations of Conventional Vector Processors”, in ISCA, 2003.  C. Kozyrakis, D. Patterson, ” Vector vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks”, in MICRO, 2002.  W. J. Bouknight, et al., “The Illiac IV System”, Proceedings of the IEEE Vol 60, No. 4, April 1972.  R. M. Russell, “The Cray‐1 Computer System”, Communications of the ACM Vol 21, No 1, Jan 1978.  J Gebis, et al., “Improving Memory Subsystem Performance using ViVA: Virtual Vector Architecture” in ARCS '09 Proceedings of the 22nd International Conference on Architecture of Computing Systems, 2009, pp. 146‐158.  D. L. Slotnick, et al., “The Solomon Computer”, Westinghouse Electric Corporation, Baltimore, MD, 1962. Questions?