Architecture and Applications of the

Lewis W.Tucker and George G. Robertson Thinking Machines Corp.

L” F the use ot computers affects culty of identifying and synchronizing increasingly broader segments of I. these independent threads of control.

computer systems increase steadily. For- have come of age. We tions on the data in parallel. This tunately, the technology base for the last approach, known as ,‘ twenty years has continued to improve at describe here the works best for large amounts of data. For a steady rate-increasing in capacity and many applications, it proves the most nat- speed while decreasing in cost for perfor- architecture, evolution, ural programming approach, leading to mance. However, the demands outpace and applications of significant decreases in execution time as the technology. This raises the question, well as simplified programming. can we make a quantum leap in perfor- the Connection architectures contain- mance while the rate of technology ing tens of thousands or even millions of improvement remains relatively constant? Machine system. processing elements support this “data- Computer architects have followed two parallel” programming model. Early general approaches in response to this examples of this kind of architecture are question. The first uses exotic technology 1CL’s Distributed Array Processor in a fairly conventional (DAP),’ Goodyear’s Massively Parallel architecture. This approach suffers from Processor (MPP),3 Columbia Univer- manufacturing and maintenance problems sity’s N~n-Von,~and others.’ Each of and high costs. The second approach these machines has some elements of the exploits the parallelism inherent in many ing on tens of thousands or even millions desired architecture, but lacks others. For problems. The parallel approach seems to of data elements. We can find opportuni- example, the MPP has 16K (K = 1,024) offer the best long-term strategy because, ties for parallelism in both the control processing elements arranged in a two- as the problems grow, more and more sequence and in the collection of data dimensional grid, but interprocessor com- opportunitiesarise to exploit the parallel- elements. munication is supported only between ism inherent in the data itself. In the control sequence, we can identify neighboring processors. Where do we find the inherent parallel- threads of control that could operate The Connection Machine provides 64K ism and how do we exploit it? Most com- independently, thus on different proces- physical processing elements, millions of puter programs consist of a control sors. This approach, known as control virtual processing elements with its virtual sequence (the instructions) and a collection parallelism, is used for programming most processor mechanism, and general- of data elements. Large programs have multiprocessor computers. The primary purpose, reconfigurable communications tens of thousands of instructions operat- problems with this approach are the diffi- networks. The Connection Machine

26 OOl8-9162/SS/OSOO-OO26$0l.OO @I988 IEEE COMPUTER Front end 0 (DECVAXor ----(, Sy mbolics) - Businterface Connection Machine Parallel Processor Unit /

Connection Machine Frontend 1 16,384Processors 16,384Processors (MCVAX Of - Sy mbolics) - Businterface equence equence

W-J 2 Front end (DECVAX~~-, Sy mbolics) - Businterface

Connection Machine connection Machine 16,384processors 16,384processors Front end 3 (DECVAX or 1 I 1 I ) -' Connection Machine VO System - Businterface I, I I I Data Graphic I Data Data I Vault Vault Vault Display Network I d?

'i Figure 1. Connection Machine system organization.

encompasses a fully integrated architec- four sequencers. Each sequencer controls Thus, users can easily develop programs ture designed for data-. up to 16,384individual processors execut- that exploit the power of the Connection ing parallel operations. A high- Machine hardware. performance, data-parallel 1/0 system At the heart of the Connection Machine Architecture of the (bottom of Figure 1) connects processors system lies the parallel-processing unit Connection Machine to peripheral mass storage (the Datavault) consisting of thousands of processors (up and graphic display devices. to 64K), each with thousands of bits of The Connection Machine is a data- System is based upon the oper- memory (four kilobits on the CM-1 and 64 parallel computing system with integrated ating system or environment of the front- kilobits on the CM-2). As well as process- hardware and software. Figure 1 shows the end computer, with minimal visible soft- ing the data stored in memory, these hardware elements of the system. One to ware extensions. Users can program using processors when logically interconnected four front-end computer systems (right familiar languages and programming con- can exchange information. All operations side of Figure 1) provide the development structs, with all development tools happen in parallel on all processors. Thus, and execution environments for system provided by the front end. Programs have the Connection Machine hardware directly software. They connect through the nexus normal sequential control flow and do not supports the data-parallel programming (a 4 x 4 cross-point switch) to from one to need new synchronization structures. model.

August 1988 27 actually build the machine, so Thinking Machines was founded in 1983. The Connection Machine Model CM-1 was designed at Thinking Machines dur- ing 1983 and the first half of 1984. By the end of 1984, with funding from the Defense Advanced Research Projects Agency, Thinking Machines had built the first 16K-processor CM-1 prototype. A demonstration of its capabilities took place in May 1985, and by November the company had constructed and successfully demonstrated a full 64K-processor machine. Thinking Machines commer- cially introduced the machine in April 1986. The first machines went to MIT and Perkin-Elmer in the summer of 1986. As illustrated in Figure 1, the CM-1 con- tains the following system components: up to 64K data processors, an interprocessor communications network, one to four sequencers, and one to four front-end computer interfaces. Although part of the original Connec- tion Machine design, the 1/0 system was not implemented until the introduction of the CM-2.

CM-I data processorsand memory. The

~ __ CM-1 parallel-processing unit contains from 16K to 64K data processors. As Figure 2. CM-1 data processors. shown in Figure 2, each data processor contains an arithmetic-logic unit and associated latches, four kilobits of bit-addressable memory, eight one-bit flag registers,

I a router interface, and a two-dimensional-grid interface. The data processors are implemented using two chip types. A proprietary cus- -b tom chip contains the ALU, flag bits, and communications interface for 16 data processors. Memory consists of commer- cial static RAM chips, with parity protec- tion. A fully configured parallel-processing unit contains 64K data processors, consist- ing of 4,096 processor chips and 32 mega- bytes of RAM. Figure 3. Complex problems change topology. A CM-1 ALU consists of a three-input, two-output logic element and associated latches and memory interface (see Figure 2). The basic conceptual ALU cycle first CM-1: First implementation of CM in 1980 at the MIT AI Laboratory, where reads two data bits from memory and one concept. Hillis originally conceived the the basic architectural design and proto- data bit from a flag. The logic element then Connection Machine architecture while at type custom integrated circuits were devel- computes two result bits from the three MIT and described it in his thesis.6 The oped. It became clear that private input bits. Finally, one of the two results design of the Connection Machine began enterprise would have to get involved to is stored in memory and the other result,

28 COMPUTER in a flag. The entire operation is condi- cially designed microcomputer used to tional on the value of a context flag; if the implement the CM virtual machine-is flag is zero, then the results for that data implemented as an Advanced Micro processor are not stored. The CM-1 was Devices 2901/2910 bit-sliced machine with The logic element can compute any two 16KSbitwords of microcode storage. A Boolean functions on three inputs. This designed with flexible Connection Machine contains from one to simple ALU suffices to carry out all the interprocessor four sequencers. The sequencer’s input is operations of a virtual-machine instruc- communication a stream of high-level, virtual-machine tion set. Arithmetic is carried out in a bit- instructions and arguments, transmitted serial fashion, requiring 0.75 microsecond in mind. on a synchronous 32-bit parallel data path per bit plus instruction decoding and over- from the nexus. The sequencer outputs a head. Hence, a 32-bit Add takes about 24 stream of nanoinstructions that controls microseconds. With 64K processors com- the timing and operation of the CM data puting in parallel, this yields an aggregate processors and memory. rate of 2,000 million instructions per sec- The CM-I nexus-a 4x4 cross-point ond (that is, two billion 32-bit Adds per switch-connects from one to four front- second). end computers to from one to four The CM processing element is a sequencers. The connections to front-end reduced-instruction-set-computerproces- front-end computer or the sequencer to all computers are via high-speed, 32-bit, par- sor. Each ALU cycle breaks down into data processors at once. allel, asynchronous data paths; while the subcycles. On each cycle, data processors Global OR is a logi-R of the ALU connections to sequencers are syn- execute one low-level instruction (called a carry output from all data processors, chronous. The nexus provides a partition- nanoinstruction)issued by the sequencer, which makes it possible to quickly discover ing mechanism so that the CM can be while the memories can perform one read unusual or termination conditions. configured as up to four partitions under or write operation. The basic ALU cycle Hypercube communication forms the front-end control. This allows isolation of for a two-operand integer Add consists of basis for the router and numerous paral- parts of the machine for different users or three nanoinstructions: LoadA to read lel primitives supported by the virtual- purposes (such as diagnosis and repair of memory operand A, LoadB to read mem- machine model. The topology of the net- a failure in one partition while other par- oryoperand B, and Store to store the result work consists of a Boolean n-cube. For a titions continue to run). When more than of the ALU operation. Other nanoinstruc- fully configured CM-1, the network is a one sequencer is connected to the same tions direct the router and NEWS (north- 12-cube connecting 4,096 processor chips front-end through the nexus, they are syn- east-west-south) grid, and perform diag- (that is, each ldprocessor chip lies at the chronized by a common clock generated nostic functions. vertex of a 12-cube). An example of a par- by the nexus. allel primitive implemented with the The front-end bus interface supports a CM-I communications. Hypercube is Sort, which runs in logarith- 32-bit, parallel, asynchronous data path designers typically use data structuring mic time; sorting 64K 32-bit keys takes between the front-end computer and the techniques to express important relation- about 30 milliseconds. nexus. The FEBI is the only part of the ships between data elements. For example, The router directly implements CM-1 that lies outside of the main cabinet. an image-understanding system usually general pointer following with switched It resides as a board in the system bus of employs a two-dimensional grid to repre- message packets containing processor the front end (any DEC VAX containing sent the individual pixels of the image. At addresses (the pointers) and data. The a VAXBI 1/0 bus and running Ultrix, or a later stage in the processing, however, a router controller, implemented in the CM ad9Symbolics 3600 series Lisp machine). tree data structure or relational graph processor chips, uses the Hypercube for might represent more abstract relation- data transmission. It provides heavily CM virtual-machine model. The CM ships such as those between objects and overlapped, pipelined message switching virtual-machine parallel instruction set, their parts (see Figure 3). with routing decisions, buffering, and called Paris, presents the user with an On a serial machine with sufficient ran- combining of messages directed to the abstract machine architecture very much dom access memory, pointers to memory same address, all implemented in like the physical Connection Machine elements are used to implement complex hardware. hardware architecture, but with two data structures. In a data-parallel architec- The NE WSgridis a two-dimensional important extensions: a much richer ture, however, individual data elements Cartesian grid that provides a direct way instruction set and a virtual-processor are assigned to individual processors and to perform nearest-neighbor communica- abstraction. interprocessor communication expresses tion. Since all processors communicate in the relationships between the elements of the same direction (north, east, west, or Paris. Paris provides a rich set of paral- very large data structures. south), addresses are implicit and no col- lel primitives ranging from simple arith- The CM-1 was designed with flexible lisions occur, making NEWS communica- metic and logical operations to high-level interprocessor communication in mind tion much faster (by about a factor of six) APL-like reduction (parallel prefii) oper- and supports several distinct communica- than router communication for simple ations,’ sorting, and communications tion mechanisms: regular message patterns. operations. The interface to Paris between the front end and the rest of the Connec- Broadcast communications allow CM-I sequencer, nexus, and front-end tion Machine reduces to a simple stream of immediate data to be broadcast from the interface. The CM-1 sequencer-a spe- operation desand arguments. The argu-

August 1988 29 system uses the array extensions in the draft Fortran 8x standard (proposed by American National Standards Institute Technical Committee X3J3) to express data-parallel operations. The remainder of the language is the standard Fortran 77. No extension is specific to the Connection Machine; the Fortran 8x array extensions map naturally onto the underlying data- parallel hardware. The *Lisp and CM-Lisp languages are data-parallel dialects of (a version of Lisp currently being stan- dardized by ANSI Technical Committee X3J 13). *Lisp gives programmers fine control over the CM hardware while main- taining the flexibility of Lisp. CM-Lisp is a higher-level language that adds small syntactic changes to the language interface and creates a powerful data-parallel pro- gramming language. The C * language is a data-parallel exten- sion of the C programming language (as described in the draft C standard proposed by ANSI Technical Committee X3J11). C*programs can be read and written like Figure 4. Connection Machine Model CM-2 and Datavault. serial C programs. The extensions are unobtrusive and easy to learn. The assembly language of the Connec- tion Machine, Paris, is the target language ments usually describe fields to operate on, request v = 220 virtual processors. of the high-level-language compilers. Paris in the form of a start address and bit Assume the available hardware to have logically extends the instruction set of the length. Arguments can also be immediate P =216 physical processors, each with front end and masks the physical imple- data, broadcast to all data processors. M = 216 bits of memory (the size for CM-2 mentation of the CM processing unit. Most of Paris is implemented in firm- memory; M = 212 bits of memory for the ware and runs on the sequencers, where CM-1). Then each physical processor Evolution of the CM-2. Experience the opcode/argument stream is parsed and would support VIP = 16 virtual gained during the first year of in-house use expanded to the appropriate sequence of processors. of the CM- 1 led to the initiation of a proj- nanoinstructions for the data processors. This ratio V/P, usually denoted N, is ect to build an improved version of the Since Paris defines the virtual-machine called the virtual-processor ratio, or VP- machine: the Connection Machine Model instruction set, we use the same name for ratio. In this example, each virtual proces- CM-2. the assembly language of the Connection sor would have M/N = 2” bits of mem- The design team established several Machine. ory and would appear to execute code at goals for the CM-2: increasing memory about 1/N = 1/16 the speed of a physical capacity, performance, and overall relia- Virtualprocessors. Data-parallel appli- processor. In fact, virtual processors often bility while maintaining or improving ease cations often call for many more individ- exceed this execution rate, since instruc- of manufacturing. To continue to support ual processors than are physically available tion decoding by the sequencer can be the CM-1 customer base, the designers on a given machine. Connection Machine amortized over the number of virtual wanted the CM-2 to be program compat- software provides for this through its processors. ible with the previous machine. Finally, virtual-processor mechanism, supported the designers wanted the CM-2 to incor- at the Paris level and transparent to the CM software environment. The Con- porate a high-speed 1/0 system for user. When we initialize the Connection nection Machine system software uses peripheral data storage and display Machine system, the number of virtual existing programming languages and envi- devices. processors required by the application is ronments as much as possible. Languages To satisfy these goals, CM-2 kept the specified. If this number exceeds the num- are based on well-known standards. Min- basic architecture of the CM-I (see Figure ber of available physical processors, the imal extensions support data-parallel con- 1). To increase performance, the CM-2 local memory of each processor splits into structs so that users need not learn a new incorporates a redesigned sequencer and as many regions as necessary, with the programming style. The Connection CM processor chip and an optional processors automatically time-sliced Machine front-end operating system floating-point accelerator. A four-fold among the regions. (either Unix or Lisp) remains largely increase in microcode storage in the For example, if an application needed to unchanged. sequencer allowed for improvements in process a million pieces of data, it would Fortran on the Connection Machine performance and functionality in the

30 COMPUTER virtual-machine implementation. A bility. We accomplished this by replacing sixteen-fold increase in memory capacity the two-dimensional NEWS grid with a brought total memory capacity up to 512 more general n-dimensional grid imple- megabytes. Enhanced reliability resulted The DataVault mented on top of the Hypercube (by grey- from adding error correction to the mem- encoding addresses). We enhanced flexi- ory system, and diagnostic capability was combines high bility and functionality by supporting improved by increasing the number of reliability with fast high-level-language concepts with n - data paths with error detection (parity). A transfer rates for large dimensional-grid nearest-neighbor com- redesigned NEWS grid increased function- munication. Thus, programmers can ality and ease of manufacture. CM-2 blocks of data. employ one-dimensional to sixteen- implemented an 1/0 system to support a dimensional nearest-neighbor grid com- massively parallel disk system (called the munication according to the requirements Datavault) and a high-speed color of the task. Manufacturing ease and relia- graphics system. bility are enhanced because we removed The CM processor chip underwent rede- the separate set of cables for the NEWS sign in late 1985 and early 1986. The first two options: single precision or double grid. prototype of the CM-2 was working by the precision. Both options support IEEE end of 1986. The company commercially standard floating-point formats and oper- CM-2 I/O structure. The Connection introduced the CM-2 (see Figure 4) in April ations, and increase the rate of floating- Machine 1/0 structure moves data into or 1987, and delivered about a dozen point calculationsby a factor of more than out of the parallel-processing unit at machines to customers in the fall of 1987. 20. Taking advantage af this speed aggregate peak rates as high as 320 mega- The first Datavault was delivered at the increase requires no change in user bytes per second using eight 1/0 con- end of 1987. software. trollers. All transfers are parity-checked The hardware associated with each of on a byte-by-byte basis. CM-2data processors and memory. The these options consists of two special- A Connection Machine 1/0 bus runs CM-2 data processor strongly resembles purpose very-large-scale-integrationchips: from each 1/0 controller to the devices it the CM-1 data processor. The major a memory-interface unit and a floating- controls. This bus is 80 bits wide (64 data differences are point execution unit for each pair of CM-2 bits, eight parity bits, and eight control 64 kilobits of bit-addressable memory processor chips. Because the floating- bits). The 1/0 controller multiplexes and instead of four kilobits, point units access memory in an demultiplexes between 256-bit processor four one-bit flag registers instead of orthogonal manner to the CM-2 proces- chunks and 64-bit I/O-bus chunks. The eight, sors, the memory-interface unit transposes controller also acts as arbitrator, allocat- an optional floating-pointaccelerator, 32-bit words before passing them to the ing bus access to the various devices on the a generalized NEWS-grid interface to floating-point unit. bus. support n-dimensional grids, Firmware that drives the floating-point an 1/0 interface, and accelerator stages the data through the Datavault. Since standard peripheral increased error-detection circuitry. memory-interface unit to the floating- devices do not operate at the speeds that The CM-2 data processors are imple- point execution unit. In general, it takes the CM system itself can sustain, we had mented using four chip types. A proprie- 5 N stages to implement a floating-point to design a mass-storage system capable of tary custom chip contains the ALU, flag operation, for a virtual-processor ratio of operating at very high speed. The bits, router interface, NEWS-grid inter- N. However, the firmware is pipelined so Datavault combines high reliability with face, and 1/0 interface for 16 data proces- as to require only 3 N + 2 stages instead of faStitransfer rates for large blocks of data. sors, and part of the Hypercube network 5 N stages. It holds five gigabytes of data, expanda- controller. The memory consists of com- ble to ten gigabytes, and transfers data at mercial dynamic RAM chips, with single- CM-2 communications. The CM-2 a rate of 40 megabytes per second. Eight, bit error correction and double-bit error communications are basically the same as DataVaults, operating in parallel, offer a detection. The floating-point accelerator those of the CM-1 with two exceptions. combined data transfer rate of 320 mega- consists of a custom floating-point inter- First, we redesigned the router to improve bytes per second and hold up to 80 giga- face chip and a floating-point execution reliability, diagnostic capability, and per- bytes of data. chip; one of each is required for every 32 formance. For example, we enhanced its The design philosophy followed for the data processors. A fully configured 64K- performance by providing hardware for Connection Machine architecture served processor system contains 4,096 processor en route combining of messages directed for the Datavault as well: we used stan- chips, 2,048 floating-pointinterface chips, to the same destination. The combining dard technology in a parallel configura- 2,048 floating-point execution chips, and operations supported include Sum, Logi- tion. Each Datavault unit stores data half a gigabyte of RAM. cal OR, Overwrite, Max, and Min. spread across an array of 39 individual The second major difference lies in the disk drives. Each 64-bit data chunk CM-2 floating-point accelerator. In nature of grid communications. We com- received from the Connection Machine addition to the bit-serial data processors pletely redesigned grid communications 1/0 bus is split into two 32-bit words. described above, the CM-2 parallel- for the CM-2. We wanted to increase flex- After verifying parity, the Datavault con- processing unit has an optional floating- ibility and functionality while simplifying troller adds seven bits of error-correcting point accelerator closely integrated with the overall system architecture and code and stores the resulting 39 bits on 39 the processing unit. This accelerator has increasing manufacturing ease and relia- individual drives. Subsequent failure of

August 1988 31 Table 1. Example applications. Conservative engineering throughout ensures that the machine is manufactura- Field Application ble, maintainable, and reliable. The power of the machine arises from its novel archi- Geophysics Modeling geological strata using reverse time tecture rather than from exotic engineer- migration techniques ing. It incorporates few types of boards; VLSI Design Circuit simulation and optimization of cell place- one board in particular-the matrix ment in standard cell or gate array circuits board-is replicated 128 times in a 64K Particle Simulation N-body interactions, such as modeling defect machine. The matrix board is a 10-layer movement in crystals under stress and modeling board with 9-mil trace widths. galaxy collisions The chip technology is also current state of the art and conservative. The CM-1 chip Fluid-Flow Modeling Cellular autonoma and Navier-Stokes-based was implemented on a 10,000-gate, two- simulation, such as turbulance simulation in micron, complementary-metal-oxide- helicopter rotor-wake analysis and fluid flow semiconductor gate array. The CM-2 chip through pipes is implemented on two-micron CMOS Stereo matching, object recognition, and image standard cells and has about 14,000 gate processing equivalents. Protein-Sequence Matching Large database searching for matching protein The massive parallelism of the machine sequences makes it possible to provide a particularly powerful and fast set of hardware diagnos- Information Retrieval Document retrieval from large databases, analysis tics. For example, the entire memory of English text, and memory-based reasoning (which accounts for a large percentage of Machine Learning Neural-net simulation, conceptual clustering, the silicon area of the machine) can be classifier systems, and genetic tested in parallel. Diagnostics can isolate a failure to a particular chip or pair of Computer-generated graphics for animation and chips and one wire connecting them more scientific visualization than 98 percent of the time. Diagnostics, in combination with the error-detection hardware on all data paths, leads to a relia- ble and maintainable system, with mean time to repair well under one hour. any one of the 39 drives does not impair and a high-resolution, 19-inch color mon- CM-2 performance. We can measure reading of the data, since the ECC allows itor. The frame buffer is a single module the Connection Machine’s performance in detection and correction of any single-bit that resides in the Connection Machine a number of ways. Since the machine uses error. backplane in place of an 1/0 controller. bit-serial arithmetic, the speed of integer Although operation is possible with a This direct backplane connection allows arithmetic and logical operations will vary single failed drive, three spare drives can the frame buffer to receive data from the with word length; the languages imple- replace failed units until repaired. The Connection Machine processors at rates mented on the machine take advantage of ECC provides 100-percent recovery of the up to one gigabit per second. this and use small fields whenever possible. data on the failed disk, allowing a new ,For example, 32-bit integer arithmetic and copy of this data to be reconstructed and CM-2 engineering and physical charac- logical operations run at 2,500 million written onto the replacement disk. Once teristics. The cube-shaped Connection instructions per second, while eight-bit this recovery is complete, the database is Machine measures 1.5 meters a side and is arithmetic runs at 4,000 MIPS. considered healed. This mass-storage sys- made up of eight subcubes. Each subcube The speed of the machine also depends tem architecture leads to high transfer contains 16 matrix boards, a sequencer on how many processors take part in a par- rates, reliability, and availability. board, and two 1/0 boards, arranged ver- ticular calculation, and on the virtual- tically. This vertical arrangement allows processor ratio. In some cases, higher Graphics display. Visualization of scien- air cooling of the machine. Power dissipa- virtual-processor ratios lead to higher tific information is becoming increasingly tion is 28 kilowatts. Each matrix board has instruction rates because the physical important in areas such as modeling fluid 512 processors and four megabytes of processors are better utilized. flows or materials under stress. The enor- memory. The matrix board has 32 custom Sustained floating-point performance mous amount of information resulting chips implementing the processors and has been shown to exceed 20 gigaflops (bil- from such simulations is often best com- router, 16 floating-point chips, 16 custom lions of floating-point operations per sec- municated to the user through high-speed floating-point memory-interface chips, ond) for polynomial evaluation using graphic displays. We therefore designed a and 176 RAM chips. The nexus board 32-bit floating-point-precision operands. real-time, tightly coupled graphic display occupies the space between the subcubes. When the function to be computed for the Connection Machine. This system Each front end has one front-end interface involves interprocessor communication consists of a 1,280 x 1,024-pixel frame- board. Red lights on the front and back, such as a 4K x 4K-element matrix multi- buffer module with 24-bit color and four with one light for each CM chip (4,096 ply, sustained performance typically overlays (with hardware pan and zoom) altogether), assist troubleshooting. exceeds five gigaflops.

32 COMPUTER We can express the performance of the understand materials more fully, designers Connection Machine communicationssys- often perform simulation studies. Unfor- tems in either of two ways: bandwidth or tunately, macro-level experiments, time per operation. Grid communications The general-purpose whether direct or computer-simulated, performance varies with the choice of grid have not successfully explained important dimensionality, grid shape, and virtual- nature of the behaviors such as metal fatigue. That processor ratio. A two-dimensional-grid Connection Machine requires simulation at the molecular level. Send operation takes about three microse- Here a major problem arises. Studies of conds per bit. For 32-bit, two-dimensional permits it to be perfect crystals generally offer little help grid operations, that translates into 96 applied equally we11 to in understanding real-world materials, as microseconds or 20 billion bits per second evidenced by the fact that the strengths of communications bandwidth. numeric and predicted by such studies often exceed Router communications performance is symbolic processing. those measured in actual metals by twenty somewhat harder to measure because it to fifty times. Defects in the crystalline depends on the complexity of the address- structure alter its properties and dramati- ing pattern in the message mix, and thus cally increase the computational complex- the number of message collisions. For a ity of simulation studies. Often the typical message mix, 32-bit general Send interactions of ten thousand to one million operations take about 600 microseconds. atoms need to be simulated to accurately This translates into about three billion bits Although data-parallel programming represent the real-world behavior of per second of communications bandwidth requires a different approach towards materials. for typical message mixes (peak bandwidth computation, programmbs quickly Molecular-dynamics simulation is exceeds 50 billion bits per second). adapted. In fact, they often found that extremely computation-intensive, and the The performance of the Connection many systems are naturally expressed in a necessity of computing high-order interac- Machine 1/0 system depends on the num- data-parallel programming model. Fears tions on systems of millions of particles ber of channels in use. Each of up to eight that parallel programming would require poses major problems for conventional 1/0 channels has a bandwidth of 40 mil- a massive reeducation effort proved machines. Data-parallel architectures per- lion bytes per second, for a total band- unfounded. mit the investigator to see in minutes what width of 320 million bytes per second. This The partial list of applications given in would take hours on traditional hardware. peak bandwidth has been observed on an Table 1 illustratesthe range of applications The Connection Machine virtual- installed Datavault disk system. The typi- developed for the Connection Machine. In processor mechanism allows the investiga- cal sustained bandwidth on the Datavault general, most applications required less tor to simulate systems many times larger is 210 million bytes per second, which than a person-year of effort and were pro- than would a physical-processor system makes it possible to copy the entire con- totyped in a matter of months. The list alone. tents of the Connection Machine memory includes examples from engineering, A commonly used model for molecular (512 megabytes) to disk in about 2.4 materials science, geophysics, artificial dynamics uses the Lennard-Jones 6/12 seconds. intelligence, document retrieval, and com- potential to describe the interactions The preceding discussions show that the puter graphics. Far from being an archi- between uncharged, nonbonded atoms. Connection Machine achieves significant tecture designed for special domains, the Two approaches permit the study of this gains in performance over conventional general-purposenature of the Connection behavior on the Connection Machine. In computers through the use of a data- Machine permits it to be applied equally both cases, processors are assigned to indi- parallel model in a fine-grained, massively well to numeric and symbolic processing. particles. Each processor contains parallel architecture without using any We include here discussions of three :E$,y,z) position of the particle, the exotic technology. But, how broadly applications from the fields of VLSI velocity, and the force on the particle from applicable is this data-parallel model? In design, materials science, and computer a previous time step of the simulation. All the remainder of this article, we will illus- vision. These examples illustratethe use of calculations are performed using 32-bit trate the breadth of applications by parallelism in a range of areas and the role floating-point precision. describing a range of applications already interprocessor communication plays in In the first approach, we consider inter- developed on the Connection Machine. supporting the data-structure require- actions between all particles. Particles are ments of each application. Grid-based arbitrarily assigned to processors. The communication finds primary application NEWS grid circulates the information in regularly structured problems such as about each particle throughout the CM Applications of the particle simulations, while general routing such that each processor can compute the Connection Machine supports the differing topologies of circuit force its particle receives from every other simulation and computer vision. particle in the system. All processors use Performance measures of massively Consult Waltz' for detailed descrip- the result of these forces to update their parallel architectures tell only part of the tions of several other applications. own (x,y,z) position in parallel. Depend- story. One of the significant revelations ing upon the purpose of the simulation,we that occurred with the introduction of the Molecular dynamics. Since the 196os, might observe several hundred time steps. Connection Machine concerned the sur- materials science has been a key technol- For large systems, each processor com- prising number of different application ogy in designing jet-engine turbine blades putes tens of thousands of force calcula- areas suitable for this technology. and other high-technology products. TO tions during a single time step.

August 1988 33 Researchers have applied this data- exchanges permitted are those that reduce parallel approach to studying defect total wire length, the placement pattern c would seek a local minimum-not neces- motion in two-dimensional crystals of Circuit Placement 16,384 atoms. They introduce stress by sarily optimal. gradually shifting the top and bottom rows Simulated annealing, however, accepts of atoms in opposite directions. They then a certain percentage of “bad” moves in observe the resulting movement, or perco- order to expose potentially better solu- lation, of the defects. This work has con- tions, avoiding the classic trap of local tributed greatly to the study of minimums. The percentage of such “bad” stress-failure modes in various materials. moves the system will accept is governed

I I by a parameter usually expressed as a tem- perature. Starting with a high tempera- VLSI design and circuit simulation. ture, a large number of cells are permitted Computer-aided design tools are routinely to change position over great distances. used in VLSI design, but as the size of cir- Gradually, the temperature parameter is cuits increases, two aspects of the design r I lowered, restricting the movement of cells process become increasingly important and the number of exchanges. Ultimately, and computationally expensive: cell place- the system converges to a near-optimal U U- ment and circuit simulation. solution as only nearby “good” exchanges are permitted. Cellplacement. Semicustom VLSI cir- Implementation of simulated annealing Figure 5. VLSl cell placement. Given an cuits with more than 10,OOO cells or parts in parallel on the Connection Machine is initial placement of cells, individual have become quite common. Placing this straightforward. Individual cells, potential cells swap positions to reduce the total number of parts on a chip while simultane- locations for cells, and nodes where wires wire length. ously minimizing the area taken up by between cells connect are represented in interconnecting wire has proved a difficslt individual processors. The need to and time-consuming problem. Minimiza- exchange information between cells at This approach makes no assumption tion of area occupied is important because arbitrary locations illustrates the utility of about which particles lie near others and in general the smaller the area, the higher the general router-based communications hence dominate the overall force calcula- the expected yield. Designers have applied mechanism of the Connection Machine. tion. Dramatic improvements in execution various automated methods to this prob- Simulated annealing takes place as fol- time are possible if we know the location lem. One method, employing simulated lows. According to the given temperature of nearby particles in advance. If we annealir~g,~produces good optimization parameter, a certain percentage of cells assume that the particles of the simulation results on a broad spectrum of placement initiate a swap, calculate the expected have a regular structure and do not drift problems. Unfortunately, for circuits hav- change in wire length, and accept or reject far from their initial relative positions (as ing 15,000 parts, simulated annealing may the move. Potential exchanges are chosen in modeling solid-phase materials at low require 100 to 360 hours on a conventional by randomly generating the addresses of temperatures), we can take an alternative computer. On a 64K-processor Connec- other cells. (Exchanges in general are not approach. tion Machine, the time required reduces to independent when performed in parallel, In the second approach, particles are less than two hours. so restrictions on cell movements are assigned to processors according to their Simulated annealing is an optimization enforced to ensure the consistency of the spatial configuration. In a manner akin to procedure related to an analogous process cell rearrangement.) This process repeats convolution, nearest-neighbor communi- in materials science wherein the strength of as the temperature parameter slowly cation is used to calculate the forces from a metal is improved by first raising it to a lowers, restricting cell movement until the a surrounding K x K neighborhood of high temperature and then allowing it to solution is acceptable to the designer. particles. Particles are free to drift, but we cool slowly. When the metal is hot, the assume they stay roughly ordered relative energy associated with its molecules causes Circuit simulation. A second area of to each other. Since the force calculation them to roam widely. As the metal cools, importance to the electronics industry is is dominated by these near neighbors, we molecules lose their energy and become VLSI circuit simulation. Unfortunately, can impose a cutoff in the force calcula- more restricted in their movements, set- our ability to simulate large circuits has not tion. We simply ignore interactions of dis- tling into a compact, stable configuration. kept pace with the increasing number of tant particles. This general optimization procedure, components. Again, this results from the Even in large-scale simulations, the already applied to a variety of problems, computation-intensive nature of simula- number of near neighbors remains rela- supports parallel implementation. tion. Consequently, often only circuits tively small. Thus, we need to compute Designers apply simulated annealing to with several hundred transistors are simu- only a fraction of the total number of force VLSI placement in the following way. lated at a time. Just as computer-aided calculations. As expected, this results in a Given an initial arbitrary placement of design tools have become essential for suc- significant reduction in the time required cells for a VLSI circuit, the designer first cessful implementation of VLSl, designers for simulation. Systems as large as a mil- computes the size of the silicon chip need simulation at the level of analog lion molecules when tested showed sus- required. Improvement in the layout waveforms to verify circuit design. tained execution rates in excess of 2.6 results from swapping cells to reduce the Real circuits operate according to the gigaflops. total wire length (see Figure 5). If the only characteristics of each component and

34 COMPUTER ..e ... ..e I*

Figure 6. Simulation output of VLSI circuit. Waveforms from multiple probe points are displayed for a given input.

It.. through enforcement of Kirchoff's law, A simulation time step starts with all allel, simulations of circuits with more which states that at each node the current nodes sending their voltages to the termi- than 10,OOO devices can run in a few entering the node is exactly balanced by the nals of the components to which they con- minutes on a 64K-processor Connection current leaving the node. In computer nect. In the initial time step, we know only Machine. Simulations of circuits having simulation, this leads to the application of the potentials at the power supply and more than 100,OOO transistors have run in an iterative relaxation method at each time ground nodes (others are assumed zero). less than three hours. step, which brings the circuit into Each element computes its response and equilibrium. sends this information back to connecting Computer vision. Problems in com- A Gauss-Jacobi relaxation algorithm nodes. Each node processor checks to see puter vision pose a major challenge for for solving a system of nonlinear differen- if all currents balance. If so. the time step today's systems not only because the prob- tial equations has been employed to simu- is complete; otherwise, each node proces- lem is large (256K pixels in a typical video late VLSI circuits." As implemented on a sor updates its expected potential accord- image), but also because we need a variety Connection Machine, individual compo- ing to the relaxation rule, and the cycle of data structures to meet the differing nents (such as transistors, resistors, or repeats. In practice, each time step typi- representational requirements of vision. capacitors) and nodes are each assigned to cally requires three to four iterations and We can easily see how to apply massively- individual processors. Pointers express the accounts for one-tenth of a nanosecond parallel systems to pixel-level image pro- connectivity of the circuit. During simula- simulated time. Figure 6 shows a typical cessing, but it is often less clear how par- tion, the pattern of communication simulation. allelism applies to higher-level concerns, as directly reflects the topology of the circuit. Because all operations take place in par- in the recognition of objects in a scene.

August 1988 35 Therefore, we show here two examples in computer vision: production of a depth map from stereo images, and two- dimensional-object recognition.

Depth map from binocular stereo. Binocular stereo is a method for determin- ing three-dimensional distances based on comparisons between two different views of the same scene. It is used by biological and machine vision systems alike. Drum- heller and Poggiol’ have implemented a program that solves this problem in near- real time and illustrates aspects of per- forming image analysis on the Connection Machine. Given a stereo image pair, the program produces a depth map expressed in the form of a topographic display. First, two images (left and right) are digitized and loaded into the Connection Machine, with pixel pairs from right and left images assigned to individual processors. Next, a difference-of-Gaussians filter is applied to each image to detect edge points, which form the primitive tokens to be matched in the two images. All pixels perform this Figure 7. Using a stereo pair of left and right terrain images, the system corrects for difference-of-Gaussians convolution in any geometric distortions, detects edge points, computes elevations, and generates parallel. a contour map for the given terrain. To determine the disparity of features between left and right images, the Connec- tion Machine system performs a local cross-correlation between the two images. 1 It slides the right image over the left using NEWS communications. At a given rela- tive shift, the two edge-point patterns are ANDed in parallel, producing anew image that has the Boolean value t only at points where the left and right images both con- tain an edge. Then each processor counts the number oft’s in a small neighborhood, effectively computing the degree of local correlation at each point. This occurs at several rela- tive shifts, with each processor keeping a record of the correlation score for each shift. Finally, each processor selects the shift (disparity) at which the maximum corre- lation took place. Since the above process determines dis- parity (and distance) only at edge pixels, we must “fill in” the non-edge pixels by interpolation using a two-dimensional heat-diffusion model. The resulting depth map is displayed either as a gray-scale image with density as a function of height from the surface, or in standard topology map format. The entire process, from loading in a pair of 256 x 256-pixel stereo images to production of a contour map, Figure 8. Parallel object-recognition steps. completes in less than two seconds on a

36 COMPUTER Figure 9. Object recognition example showing original image (a), feature extraction (b), hypothesis generation (c), and recogni- tion and labeling of objects (a).

64K-processor Connection Machine (see dimensional objects having rigid shapes, straight line-segments and their intersec- Figure 7). it is being extended to the three- tions at corners. Each feature is assigned dimensional object domain. to its own processor. A single object is Object recognition. Recognition of The general framework for the therefore distributed over a number of objects in natural scenes remains an approach taken here is massive parallel processors, enabling each feature to unsolved problem in computer science hypothesis generation and test. Whereas actively participate in the solution. despite attracting research interest for object-recognition algorithms tradition- Features useful for hypothesis genera- many years. It is important in many appli- ally use some form of constraint-basedtree tion are those that constrain the position cations, including robotics, automated search, here searching is effectively and orientation of a matching model inspection, and aerial photo- replaced by hypothesis generation and object. A single point, line, or patch of interpretation. parameter space clustering. In thk scheme, color is by itself not sufficient. In a two- In contrast to many of the systems image features in the scene serve as events, dimensional world, however, the intersec- designed to date, the object-recognition while features of each model serve as tion of two lines that matches an expected system described in this sectioh was expectations waiting to be satisfied. model corner can be used to generate a designed to work with object databases Hypotheses arise whenever an event satis- hypothesis that an instance of the model containing hundreds of models and to fies an expectation. object exists-albeit translated and rotated search each scene in parallel for instances On the Connection Machine, objects such that the corresponding features come of each model object.” While this system known to the system are represented sim- into alignment. is limited to recognition of two- ply as a collection of features-generally Figure 8 illustrates this parallel hypoth-

August 1988 37 esis generation and verification method. model in hardware and software. From the Acknowledgments An unknown scene is first digitized and applications described here and elsewhere, loaded into the Connection Machine mem- it seems clear to us that data-level parallel- Numerous people have contributed to the ory as an array of pixels, one per proces- ism has broad applicability. We expect development of the architecture and applica- sor. Edge points are marked and straight that its applicability will continue to grow tions of the Connection Machine. We wish to line segments are fitted to edge points as people continue to require solutions to offer special thanks to , Brewster using a least-squares estimate. Intersecting larger and more complex problems. 0 Kahle, Guy Steele, Dick Clayton, Dave Waltz, lines are grouped into corner features and Rolf-Dieter Fiebrich, Bernie Murray, and Mike Drumheller for their help in preparing this matched in parallel with corresponding article. corner features in the model database. This work was partially sponsored by the Whenever a match between an image Defense Advanced Research Projects Agency and model feature occurs, a hypothetical under contract no. N00039-84-C-0638. instance of the corresponding model object is created and projected into the image plane. A hypothesis-clustering scheme, related to the Hough transform, References is next applied to order hypotheses accord- ing to the support offered by mutually sup- porting hypotheses. Although this still 1. W.D. Hillis andG.L. Steele, “Data-Parallel le1 Operations,” Proc. Int’lConJ Parallel Algorithms,” Comm. ACM, Vol. 29, No. Processing, Aug. 1986, pp. 355-362. leaves many possible interpretations for 12, 1986, pp. 1,170-1%183. 8. D.L. Waltz, “Applications of theconnec- each image feature, the Connection 2. P.M. Flanderset al., “Efficient High Speed tion Machine,” Computer, Vol. 20, No. 1, Machine can quickly accept or reject thou- Computing with the Distributed Array 1987, pp. 85-97. sands of hypotheses in parallel using a Processor,” High-speed Computer and 9. C.D. Kirkpatrick, C.D. Gelatt, and M.P. template-like verification step. Algorithm Organization, Kuch, Lawrie, Vecchi, “Optimization by Simulated Verification results from having each and Sameh, eds., Academic Press, New Annealing,” Science, Vol. 220, May 1983, York, 1977. instance check the image for evidence sup- pp. 671-680. porting each of its features in parallel. 3. K.E. Batcher, “Design of a Massively Par- 10. R.A. Newton, and A.L. Sangiovanni- allel Processor,” IEEE Trans. Computers, Vincentelli, “Relaxation-Based Electrical Hypotheses having strong support for Vol. C-29, No. 9, 1980. Simulation,” IEEE Trans. Computer- their expected features are accepted over 4. D.E. Shaw, The Non- Von , Aided Design, Vol. CA-D-3, No. 4, 1984. those with little support. Expected features tech. report, Dept. of Computer Science, 11. M. Drumheller andT. Poggio, “On Parallel of an object that are occluded or obscured ColumbiaUniv., New York, August 1982. Stereo,” Proc. 1986 IEEE Int’l Conf. on in the scene do not rule out the hypothe- Robotics and Automation, April 1986, pp. 5. L.S. Haynes et al., “A Survey of Highly 1,439-1.488. sis, they only weaken the confidence the Parallel Computing,” Computer, Vol. 15, 12. L.W. Tucker, C.R. Feynman, and D.M. system has in the object’s existence. Com- NO. 1, 1982, pp. 9-24. Fritzsche, “Object Recognition Using the petitive matching between instances for 6. W.D. Hillis, The Connecfion Machine, Connection Machine,” Proc. IEEE Con$ each image feature finally resolves any MIT Press, Cambridge, Mass., 1985. Computer Vision and Pattern Recognition, conflicts that arise in the overall scene 7. G.E. Blelloch, “Scans as Primitive Paral- June 1988, pp. 871-877. interpretation. Results appear in Figure 9. The time required from initial image acquisition to display of recognized objects is approximately six seconds on a 64K-processor Connection Machine. Experiments have shown this time to be constant over a range of object database sizes from 10 to 100. We expect optimiza- tion of this task to reduce the time to less than one second.

he Connection Machine represents an excitingapproach to Lewis W. Tucker is a senior scientist at Think- George G. Robertson is a principal engineer at ing Machines Corp., where he is engaged in the Xerox PARC, working on user interface T dealing with the large and com- application of parallel architectures to machine research and . Previously, plex problems that a growing number of vision. His interests include image understand- at Thinking Machines, he worked on massively people wish to solve with computer sys- ing, parallel algorithms, computer architecture, parallel systems and artificial intelligence. His tems. The size of problems has grown at a and machine learning. publications are primarily in the areas of user rate faster than improvements in technol- Tucker earned his BA from Cornel1 Univer- interfaces, programming language design, oper- sity in 1972 and a PhD in computer science from ating systems design, distributed systems, and ogy, forcing us to find innovative ways of the Polytechnic Institute of New York in 1984. machine learning. using current technology to achieve quan- Robertson received an MS in computer tum leaps in performance. science from Carnegie Mellon University and a We have described the architecture and BA in from Vanderbilt University. evolution of the Connection Machine, which directly implements a data-parallel Readers may write to Tucker at Thinking Machines Corp., 245 First St., Cambridge, MA02142.

38 COMPUTER