The ILLIAC IV Computer 1

Chapter 27 The ILLIAC IV computer 1 George H. Barnes / Richard M. Brown / Maso Kato David J. Kuck / Daniel L. Slotnick / Richard A. Stokes Summary The structure of ILLIAC IV, a parallel-array computer con- control unit so that a single instruction stream sequenced taining 256 processing elements, is described. Special features include the processing of many data streams. multiarray processing, multiprecision arithmetic, and fast data-routing 2 addresses and data common to all of the data 6 Memory interconnections. Individual processing elements execute 4 X 10 instruc- were broadcast from the central control. 9 processing tions per second to yield an effective rate of 10 operations per second. 3 amount of local control at the individual Index terms Array, computer structure, look-ahead, machine lan- Some processing thin-film element level was obtained each guage, parallel processing, speed, memory. by permitting element to enable or disable the execution of the common instructions according to local tests. Introduction 4 Processing elements in the array had nearest-neighbor con- The study of a number of well-formulated but computationally nections to provide moderate coupling for data exchange. massive problems is limited by the computing power of currently available or proposed computers. Some involve manipulations of Studies with the original SOLOMON computer indicated that such a both feasible very large matrices (e.g., linear programming); others, the solution parallel approach was and applicable to a sets of areas. of of partial differential equations over sizable grids (e.g., variety important computational The advent of LSI cir- weather models); and others require extremely fast data correlation cuitry, or at least medium-scale versions, with gate times of the techniques (phased array signal processing). Substantive progress order of 2 to 5 ns, suggested that a SOLOMON-type array of 9 in these areas requires computing speeds several orders of magni- potentially 10 word operations per second could be realized. In tude greater than conventional computers. addition, memory technology had advanced sufficiently to indicate 6 At the same time, signal propagation speeds represent a serious that 10 words of memory with 200 to 500-ns cycle times could barrier to increasing the speed of strictly sequential computers. be produced at acceptable cost. The ILLIAC IV Phase I design Thus, in recent years a variety of techniques have been introduced study during the latter part of 1966 resulted in the design discussed in in this to be fabricated the Defense to overlap the functions required sequential processing, e.g., paper. The machine, by Space multiphased memories, program look-ahead, and pipeline arith- and Special Systems Division of Burroughs Corporation, Paoli, Pa., metic units. Incremental speed gains have been achieved but at is scheduled for installation in early 1970. considerable cost in hardware and complexity with accompanying problems in machine checkout and reliability. Summary of the ILLIAC IV The use of explicit parallelism of operation rather than over- IV lapping of subfunctions offers the possibility of speeds which in- The ILLIAC main structure consists of 256 processing elements in four crease linearly with the number of gates, and consequently has arranged reconfigurable SOLOMON-type arrays of 64 been explored in several designs [Slotnick et al., 1962; Unger, 1958; processors each. The individual processors have a 240-ns ADD Holland, 1959; Murtha, 1966]. The SOLOMON computer [Slotnick time and a 400-ns MULTIPLY time for 64-bit operands. Each 4 et al., 1962], which introduced a large degree of overt parallelism processor requires approximately 10 ECL gates and is provided into its structure, had four principal features. with 2048 words of 240-ns cycle time thin-film memory. Instruction and control 1 A large array of arithmetic units was controlled by a single addressing The ILLIAC IV array possesses a common control unit which HEEE Trans., C-17, vol. 8, pp. 746-757, August, 1968. decodes the instructions and generates control signals for all 320 Chapter 27 The ILLIAC IV computer 321 in the This eliminates the cost and of constants, and (2) it overlap of common processing elements array. storage program permits in each element. fetches with other complexity for decoding and timing circuits operand operations. are In addition, an index register and address adder provided the final address Processor with each processing element, so that operand partitioning for element i is determined as follows: Oj Many computations do not require the full 64-bit precision of the To make more efficient use of the hardware and speed . processors. a = a + (b) + (Cj) up computations, each processor may be partitioned into either where a is the base address in the instruction, (b) is the specified two 32-bit or eight 8-bit subprocessors, to yield 512 32-bit or contents of a central index in the control unit, and (c,) register 2048 8-bit subprocessors for the entire ILLIAC IV set. is the contents of the local index of the ele- register processing The subprocessors are not completely independent in that they effective ment i. This in is very independence operand addressing share a common index register and the 64-bit data routing paths. for rows and columns of matrices and other multidimen- handling The 32-bit subprocessors have separate enabled/disabled modes sional data structures [Kuck, 1968]. for indexing and data routing; the 8-bit subprocessors do not. Mode control and data conditional operations Array partitioning is to be able to Although the goal of the ILLIAC IV structure The 256 elements of ILLIAC IV are grouped into four separate a control the processing of a number of data streams with single subarrays of 64 processors, each subarray having its own control instruction stream, it is sometimes necessary to exclude some data unit and capable of independent processing. The subarrays may streams or to process them differently. This is accomplished by be dynamically united to form two arrays of 128 processors or one whose value providing each processor with an ENABLE flip-flop array of 256 processors. The following advantages are obtained. controls the instruction execution at the processor level. The bit is of a test result in each ENABLE part register 1 Programs with moderately dimensioned vector or matrix holds the results of tests conditional on local data. processor which variables can be more efficiently matched to the array size. Thus in ILLIAC IV the data conditional jumps of conventional 2 Failure of any subarray does not preclude continued proc- computers are accomplished by processor tests which enable or essing by the others. disable local execution of subsequent commands in the instruction stream. This paper summarizes the structure of the entire ILLIAC IV system. Programming techniques and data structures for ILLIAC Routing IV are covered in a paper by Kuck [1968]. i has data Each processing element in the ILLIAC IV routing — connections to 4 of its neighbors, processors i + 1, i 1, i + 8, for and i — 8. End connection is end around so that, a single array, ILLIAC IV structure 63 connects to 0, 62, 7, and 55. processor processors The organization of the ILLIAC IV system is indicated in Fig. 1. data transmissions of arbitrary distance are ac- Interprocessor The individual processing elements (PEs) are grouped in four a sequence of routings within a single instruction. complished by arrays, each containing 64 elements and a control unit (CU). The of For a array the maximum number routing steps control to 64-processor four arrays may be connected together under program is 4. actual is 7; the average overall possible distances In required permit multiprocessing or single-processing operation. The system distance 1 is most common and distances programs, routing by program resides in a general-purpose computer, a Burroughs than 2 are rare. greater B 6500, which supervises program loading, array configuration changes, and I/O operations internal to the ILLIAC IV system Common operand broadcasting and to the external world. To provide backup memory for the disk 109 Constants or other operands used in common by all the processors ILLIAC IV arrays, a large parallel-access system (10 bits, bit access 40-ms maximum is are fetched and stored locally by the central control and broadcast per second rate, latency) directly to the There is also for real-time data to the processors in conjunction with the instruction using them. coupled arrays. provision for connections to the ILLIAC IV This has several advantages: (1) it reduces the memory used directly arrays. Section 2 Processors for data 322 Part 4 The instruction-set processor level: special-function processors array Chapter 27 The ILLIAC IV computer 323 324 Part 4 The instruction-set processor level: special-function processors Section 2 Processors for array data words (16 instructions), fetch of the next block is initiated; the Column different is If the Array possibility of pending jumps to blocks ignored. next block is found to be already resident in the buffer, no further (12) action is taken; else fetch of the next block from the array memory is initiated. On arrival of the requested block, the instruction buffer is cyclically filled; the oldest block is assumed to be the least required block in the buffer and is overwritten. Jump instructions initiate the same procedures. Fetch of a new instruction block from memory requires a delay of approximately three memory cycles to cover the signal trans- mission times between the array memory and the control unit. On execution of a straight line program, this delay is overlapped with the execution of the 8 instructions remaining in the current block. In a multiple-array configuration, instructions are fetched from the array memory specified by the program counter, and broadcast simultaneously to all the participating control units.

The ILLIAC IV Computer 1

Performing the Shuffle with the PM2I and Illiac SIMD Interconnection Networks

Online Sec 6.15.Indd

SIMD1 Ñ Illiac IV

Vector Machines  Vector Machines Today Introduction  a Vector Processor Is a CPU That Can Run One Instructiononanentire Vector of Data

The CRAY- 1 Computer System

ILLIAC IV Is the Most Powerful by As Much As a Factor of Four

Puters. Large-Scale Computer Systems Have the Potential to Achieve Two to Three Orders of Magnitude Speed Improvement Over the Next Decade

The CRAY-1 Computer System^

Illiac IV History First Massively Parallel Computer Three Earlier Designs

A Survey of Concurrent Architectures Technical Report: CSL-TR-86-307

COMPUTERS on NASTRAN James L. Rogers, Jr. NASA Langley

The Illiac IV System