<<

PARALLEL OPERATION IN THE CONTROL DATA 6600 James E. Thornton ,

HISTORY survived early difficulties to become the basis for a nice jump in circuit performance. About four years ago, in the summer of 1960, Control Data began a project which cul­ SYSTEM ORGANIZATION minated last month in the delivery of the first 6600 . In 1960 it was apparent that The computing system envisioned in that brute force circuit performance and parallel project, and now called the 6600, paid special operation were the two main approaches to attention to two kinds of use, the very large any advanced computer. scientific problem and the time sharing of This paper presents some of the consid­ smaller problems. For the large problem, a erations having to do with the parallel opera­ high-speed floating point central processor with tions in the 6600. A most important and access to a large central memory was obvious. fortunate event coincided with the beginning Not so obvious, but important to the 6600 of the 6600 project. This was the appearance system idea, was the isolation of this central of the high-speed silicon , which arithmetic from any peripheral activity.

4096 WORD 4096 WORD 4096 WORD 4096 WORD CORE MEMORY CORE MEMORY ~ ~ CORE MEMORY CORE MEMORY PERIPHERAL PERIPHERAL PERIPHERAL PERIPHERAL & CONTROL & CONTROL & CONTROL & CONTROL PROCESSOR PROCESSOR PROCESSOR PROCESSOR

6600 CENTRAL MEMORY 4096 WORD 4096 WORD CORE MEMORY CORE MEMORY PERIPHERAL 6600 CENTRAL PROCESSOR PERIPHERAL & CONTROL & CONTROL PROCESSOR PROCESSOR 6600 CENTRAL MEMORY

4096 WORD 4096 WORD 4096 WORD 4096 WORD ~ CORE MEMORY - CORE MEMORY +- CORE MEMORY CORE MEMORY PHERIPHERAl PERIPHERAL PERIPHERAL PERIPHERAL & CONTROL & CONTROl. & CONTROL & CONTROL PROCESSOR PROCESSOR PROCESSOR PROCESSOR

Figure 1. Control Data 6600. 33

From the collection of the Computer History Museum (www.computerhistory.org) 34 PROCEEDINGS-SPRING JOINT COMPUTER CONFERENCE, 1964

It was from this general line of reasoning instruction formats to provide for direct, in­ that the idea Of a multiplicity of peripheral direct, and relative addressing. Instructions processors was formed (FIg. 1). Ten such provide logical, addition, subtraction, shift, and peripheral processors have access to the central conditional branching. Instructions also pro­ memory on one side and the peripheral channels vide single word or block transfers to and from on the other. The executive control of the any of twelve peripheral channels, and single system is always in one of these peripheral word or block transfers to and from central processors, with the others operating on as­ memory. Central memory words of 60 signed peripheral or control tasks. All ten length are assembled from five consecutive processors have access to twelve input-output peripheral words. Each processor has instruc­ channels and may "change hands," monitor tions to interrupt the central processor and to channel activity, and perform other related monitor the central program address. jobs. These processors have access to central To get this much processing power with memory, and may pursue independent transfers reasonable economy and space, a time-sharing to and from this memory. design was adopted (Fig. 2). This design Each of the ten peripheral processors contains a register "barrel" around which is contains its own memory for program and moving the dynamic information for all ten buffer areas, thereby isolating and protecting processors. Such things as program address, the more critical system control operations in accumulator contents, and other pieces of in­ the separate processors. The central processor formation totalling 52 bits are shifted around operates from the central memory with relocat­ the barrel. Each complete trip around requires ing register and file protection for each program one major cycle or one thousand nanoseconds. in central memory. A "slot" in the barrel contains adders, assembly networks, distribution network, and intercon­ PERIPHERAL AND CONTROL PROCESSORS nections to perform one step of any peripheral The peripheral and control processors instruction. The time, to perform this step or, are housed in one chassis of the main frame. in other words, the time through the slot, is Each processor contains 4096 memory words / one minor cycle o:!'" one hundred nanoseconds. of 12 bits length. There are 12- and 24- Each of the ten processors, therefore, is allowed

PROCESSOR TIME-SHARED PROCESSOR REGISTERS INSTRUCTION MEMORIES CONTROL

READ PYRAM 10

CENTRAL CENTRAL MEMORY MEMORY (60) (60)

(12) REAL TIME

1/0 CHANNELS EXTERNAL EQUIPMENT

Figure 2. 6600 Peripheral and Control Processors.

From the collection of the Computer History Museum (www.computerhistory.org) PARALLEL OPERATION IN THE CONTROL DATA 6600 35 one minor cycle of every ten to perform one of five references to be "nested" in each network its steps. A peripheral instruction may require during any major cycle. The central memory one or more of these steps, depending on the is organized in independent banks with the abil­ kind of instruction. ity to transfer central words every minor cycle. In effect, the single arithmetic and the The peripheral processors, therefore, introduce single' distribution and assembly network are at most about 2 % interference at the central made to appear as ten. Only the memories are memory address control. kept truly independent. Incidentally, the A. single real time clock, continuously memory read-write cycle time is equal to one running, is available to all peripheral proces­ complete trip around the barrel, or one thousand sors. nanoseconds. Input-output channels are bi-directional, CENTRAL PROCESSOR 12-bit paths. One 12-bit word may move in one direction every major cycle, or 1000 nano­ The 6600 central processor may be con­ seconds, on each channel. Therefore, a maxi­ sidered the high-speed arithmetic unit of the mum burst rate of 120 million bits per second system (Fig. 3). Its program, operands, and is possible using all ten peripheral processors. results are held in the central memory. It has A sustained rate of about 50 million bits per no connection to the peripheral processors ex­ second can be maintained in a practical operat­ cept through memory and except for two single ing system. Each channel may service several controls. These are the exchange jump, which peripheral devices and may interface to other starts or interrupts the central processor from systems, such as satellite . a peripheral processor, and the central program Peripheral and control processors access address which can be monitored by a peripheral central memory through an assembly network processor. and a dis-assembly network. Since five periph­ A key description of the 6600 central eral memory references are required to make processor, as you will see in later discussion, is up one central memory word, a natural assem­ "parallel by function." This means that a num­ bly network of five levels is used. This allows ber of arithmetic functions may be performed PERIPHERAL AND CENTRAL PROCESSOR CONTROL PROCESSORS

24 OPERATING REGISTERS

12 INPUT OUTPUT CHANNELS

Figure 3. Block Diagram of 6600.

From the collection of the Computer History Museum (www.computerhistory.org) 36 PROCEEDINGS-SPRING JOINT COMPUTER CONFERENCE, 1964

concurrently. To this end, there are ten func­ As an example,· a 15-bit instruction may call tional units within the central processor. These for an ADD, designated by the f and m octal are the two increment units, floating add unit, digits, from registers designated by the j and k fixed add unit, shift unit, two multiply units, octal digits, the result going to the register divide unit, boolean unit, and branch unit. In designated by the i octal digit. In this example, a general way, each of these units is a three the addresses of the three-address, floating add address unit. As an example, the floating add unit are only three bits in length, each address unit obtains two 60-bit operands from the cen­ referring to one of the eight floating point regis­ tral registers and produces a 60-bit result which ters. The 30-bit format follows this same form is returned to a register. Information to and but substitutes for the k octal digit an 18-bit from these units is held in the central registers, constant K which serves as one of the input of which there are twenty-four. Eight of these operands. These two formats provide a highly are considered index registers, are of 18 bits efficient control of concurrent operations. length, and one of which always contains zero. As a background, consider the essential Eight are considered address registers, are of difference between a general purpose device and 18 bits length, and serve to address the five read a special device in which high speeds are re­ central memory trunks and the two store cen­ quired. The designer of the special device can tral memory trunks. Eight are considered float­ generally improve on the traditional general ing point registers, are of 60 bits length, and purpose device by introducing some form of are the only central registers to access central concurrency. For example, some activities of memory during a central prograrn. a housekeeping nature may be performed sepa­ In a sense, just as the whole central proc­ rate from the main sequence of operations in essor is hidden behind central memory from separate hardware. The to,tal time to complete the peripheral processors, so, too, the ten func­ a job is then optimized to the main sequence tional units are hidden behind the central regis­ and excludes the housekeeping. The two cate­ ters from central memory. As a consequence, gories operate concurrently. a considerable instruction efficiency is obtained It would be, of course, most attractive to and an interesting form of concurrency is feasi­ provide in a general purpose device some gen­ ble and practical. The fact that a small number eralized scheme to do the same kind of thing. of bits can give meaningful definition to any The organization of the 6600 central processor function makes it possible to develop forms of provides just this kind of scheme. With a multi­ operand and unit reservations needed for a plicity of functional units, and of operand reg­ general scheme of concurrent arithmetic. isters and with a simple and highly efficient Instructions are organized in two for­ addressing system, a generalized queue and res­ mats, a 15-bit format and a 30-bit format, and ervation scheme is practical. This is called the may be mixed in an instruction word (Fig. 4). scoreboard.

m k The scoreboard maintains a running file of each central register, of each functional unit, 3 15 BITS and of each of the three operand trunks to and from each unit. Typically, the scoreboard file

14 l'---__--I o is made up of two-, three-, and four-bit quan­ , tities identifying the nature of register and OPERATION CODE unit usage. As each new instruction is brought - 60 BITS up, the conditions at the instant of issuance are o RESULT REG. set into the scoreboard. A snapshot i~ taken, (1 of 8) so to speak, of the pertinent conditions. If no 1st OPERAND waiting is required, the execution of the instruc­ REG. (1 of 8) tion is begun immediately under control of the unit itself. If waiting is required (for example,

2nd OPERAND an input operand may not yet be available in REG. (1 of 8) the central registers), the scoreboard controls Figure 4. 15-Bit Instruction Format the delay, and when released, allows the unit to

From the collection of the Computer History Museum (www.computerhistory.org) PARALLEL OPERATION IN THE CONTROL DATA 6600 37 begin its execution. Most important, this activ­ conflicts are no longer involved, and a consider­ ity -is accomplished in the scoreboard and the able speed increase can be had. functional unit, and does not necessarily limit Five memory trunks are provided from later instructions from being brought up and memory into the central processor to five of the issued. floating point registers (Fig. 6). One address In this manner, it is possible to issue a register is assigned to each trunk (and there­ series of instructions, some related, some not, fore to the floating point register). Any in­ until no functional units are left free or until struction calling for address reg-ister result a specific register is to be assigned more than implicitly initiates a memory reference on that one result. With just those two restrictions trunk. These instructions are handled through on issuing (unit free and no double result), the scoreboard and therefore tend to overlap several independent chains of instructions may memory access with arithmetic. For example, proceed concurrently. Instructions may issue a new memory word to be loaded in a floating every minor cycle in the absence of the two point register can be brought in from memory restraints. The instruction executions, in com­ but may not enter the register until all previous parison, range from three minor cycles for fixed uses of that register are completed. The central add, 10 minor cycles for floating multiply, to registers, therefore, provide all of the data to 29 minor cycles for floating divide. the ten functional units, and receive all of the To provide a relatively continuous source unit results. No storage is maintained in any of instructions, one buffer register of 60 bits is unit. located at the bottom of an instruction stack Central memory is organized in 32 banks capable of holding 32 instructions (Fig. 5). of 4096 words. Consecutive addresses call for a Instruction words from memory enter the bot­ different bank; therefore, adjacent addresses in tom register of the stack pushing up the old one bank are in reality separated by 32. Ad­ instruction words. In straight line programs, dresses may be issued every 100 nanoseconds. only the bottom two registers are in use, the A typical central memory information transfer bottom being refilled as quickly as memory con­ rate is about 250 million bits per second. flicts allow. In programs which branch back As mentioned before, the functional units to an instruction in the upper stack registers, are hidden behind the registers. Although the no refills are allowed after the branch, thereby units might appear to increase hardware dupli­ holding the program loop completely in the cation, a pleasant fact emerges from this design. stack. As a result, memory access or memory Each unit may be trimmed to perform its func-

t INSTRUCTION - REGISTERS INSTRUCTION t STACK 8 60-BIT t WORDS t t t t t

FROM CENTRAL MEMORY

Figure 5. 6600 Instruction Stack Operation.

From the collection of the Computer History Museum (www.computerhistory.org) 38 PROCEEDINGS-SPRING JOINT COMPUTER CONFERENCE, 1964

OPERANDS (60-Bln XO .. Xl X2 OPERANDS .. X3 X4 ...... X5 f+- RESULTS r- X6 ADDRESSES {l8-Bln L X7 AO

~ Al OPERAND A2 CENTRAL A3 10 FUNCTIONAL MEMORY ---- ADDRESSES A4 .----.... UNITS ...... A5 RESULT r A6 I INSTRUCTION ADDRESSES L- A7 INCREMENT (18-'BI1) REGISTERS BO ~ B1 B2 ..- 1NSTRUCTION B3 STACK B4 (UP TO 8 WORDS B5 60-BIT) B6 ~ B7 INSTRUCTIONS + FIgure 6. Central Processor Operating Registers. tion without regard to others. Speed increases able concurrency. The program need not be are had from this simplified design. written in a particular way, although certainly As an example of special functional unit some optimization can be done. The specific design, the floating multiply accomplishes the method of accomplishing this concurrency in­ coefficient multiplication in nine minor cycles volves issuing as many instructions as possible' plus one minor cycle to put away the result for while handling most of the conflicts during a total of 10 minor cycles, or 1000 nanoseconds. execution. Some of the essential requirements The multiply uses layers of carry save adders for such a scheme include: grouped in two halves. Each half concurrently 1. Many functional units forms a partial product, and the two partial 2. Units with three address properties products finally merge while the long carries 3. Many transient registers with many propagate. Although this is a fairly large com­ trunks to and from the units plex of circuits, the resulting device was suffi­ 4. A simple and efficient instruction set ciently smaller than originally planned to allow two multiply units to be included in the final CONSTRUCTION design. To sum up the characteristics of the Circuits in the 6600 computing system central processor, remember that the broad­ use all-transistor logic (Fig. 7). The silicon brush description is "concurrent operation." transistor operates in saturation when switched In other words, any program operating within "on" and averages about five nanoseconds of the central processor utilizes some of the avail- stage delay. Logic' circuits are constructed in

From the collection of the Computer History Museum (www.computerhistory.org) PARALLEL OPERATION IN THE CONTROL DATA 6600 39

Figure 7. 6600 Printed Circuit Module. a cordwood plug-in module of about 21/2 inches bit drive circuits plus address translation are by 2112 inches by 0.8 inch. An average of about contained in the module. One such module is 50 are contained in these modules. used for each peripheral processor, and five Memory circuits are constructed in a modules make up one bank of central memory. plug-in module of about six inches by six inches Logic modu12s and memory modules are by 2112 inches (Fig. 8). Each memory module held in upright hinged chassis in an X shaped contains a coincident current memory of 4096 cabinet (Fig. 9). Interconnections between 12-bit words. All read-write drive circuits and modules on the chassis are made with twisted pair transmission lines. Interconnections be­ tween chassis are made with coaxial cables. Both maintenance and operation are ac­ complished at a programmed display console (Fig. 10). More than one of these consoles may be included in a system if desired. Dead start facilities bring the ten peripheral processors to a condition which allows information to enter from any chosen peripheral device. Such loads normally bring in an which provides a highly sophisticated capability for multiple users, maintenance, and so on. The 6600 Computer has taken advantage of certain technology advances, but more par­ ticularly, logic organization advances which now appear to be quite successful. Control Data is exploring advances in technology upward within the same compatible structure, and iden­ tical technology downward, also within the Figure 8. 6600 Memory Module. same compatible structure.

From the collection of the Computer History Museum (www.computerhistory.org) 40 PROCEEDINGS-SPRING JOINT COMPUTER CONFERENCE, 1964

Figure 9. 6600 Main Frame Section.

Figure 10. 6600 Display Console.

From the collection of the Computer History Museum (www.computerhistory.org)