Transputers-Past, Present, and Future

High-performance enhancements in transputers signal a trend toward general-purpose computing. We present the progress, products, and results of ongoing work in transputer development.

Co/in Wbitby-Strevens transputer transformation is under- cation designer directly controls the transputer A way. Exploiting the hardware without the need for a resident oper- lnmos Limited potential of the transputer, ESPRIT ating system or runtime kernel. Applications for Q projects continue to develop higher transputers include office systems (fax, performance, lower cost parallel processing videophones, laser printers, terminals), digital computers. In advancing the transputer toward telecommunications, military systems and hand- general-purpose computing, Inmos Limited is held satellite navigation systems (see box on also improving the transputer’s suitability in sample applications), industrial control systems. embedded systems with the upcoming release and music synthesizers. of new products. Initial promotion of the transputer focused The first transputer emerged at a time when on its use as a general-purpose component for very large scale integration (VLSI) technology special-purpose systems. A number of applica- permitted a combination of a small, fast pro- tions-particularly in graphics and image cessor (20,000 ) and local memory and processing-clearly required high-performance, communications facilities on silicon. At the same floating-point operations. Inmos achieved this time, the effective exploitation of VLSI techno- capability very effectively by adding a floating- ogy also resulted in the development of reduced point unit within the transputer chip (in con- instruction-set computing (RISC) processors. Both trast to the conventional. but more cumbersome, the transputer and conventional RISC proces- coprocessor approach). This advancement sors reevaluate architecture trade-offs in the opened up the exciting prospect of construct- context of VLSI capabilities. However, devel- ing parallel processing machines using arrays opers of the transputer created a design at the of transputers to produce -level instruction-set level to support multiprocessing performance. across a number of transputers and within a Applications requirements for simulation and single transputer without the overheads usu- modeling (for example, quantum chromody- ally associated with complex runtime software. namics and fluid-flow analysis) provided the In embedded systems, the transputer appli- incentive to establish the ESPRIT Supernode

16 IEEE Micro 0272.1732/90/1200-0016$01.00 0 1990 IEEE Two sample applications

The transputer‘s suitability to a wide range of embed- ded systems is illustrated by considering applications at both extremes. The first application-a handheld naviga- tional system-uses just one transputer to address such issues as high performance, low cost. and low power. The second application-a long-range, three-dimensional rd- dar system-uses up to 4,000 transputers to provide supercomputer levels of performance. but in :I real-time application.

Hand-held satellite navigation system Some 18 satellites, in six orbits, operate as part of a Global Positioning System, Each satellite continuously transmits its position in an encoded form, and by listen- ing to four satellites one can calculate the position (in three dimensions) of the receiving station. The traditional approach involves the use of complex dedicated signal processing hardware. However, the Gypsy hand-held receiver, developed by Columbus Positioning Limited, performs the complex signal processing and mathematics required in one transputer, which also ser- vices a keyboard and controls a display (see Figure A). A large, delreloping market for the system includes marine navigation, transport control systems (reporting delivery truck positions). and automotive applications (dashboard navigation systems). The advantages of the transputer-based solution over the traditional approach include relatively fast acquisi- tion time (which also saves battery power). the elimina- tion of custom logic for signal processing. the minimal amount of glue logic required. the lo= specification re- quired on radio oscillators (transputer software performs Figure A. Gypsy hand-held receiver. frequency compensation). and quick implementation of future product changes. ferent board types). With radar technology now moving so fast, harcl\viring IS doul~ly expensive because boards Long-range radar system are quickly rendered out of date. By contrast, in the The Martello long-range, three-dimensional radar sys- transputer system. sof&2re performs the entire signal pro- tem under construction by Marconi Radar Systems uses up ccssing task, and the same basic computer remains usable to 4,000 transputers for signal processing of the radar re- throughout a \vholc generation of racla~-s. turns.’ The transputers combine to form a very fast parallel The signal processing algorithms take place in Ireal-time computer that accepts digitized radar returns at its input by using pipeline parallelism to spread the parts of the and defines the range, bearing, and height of every target algorithms across an al-~-a).L\ ith computation overlapping at its output. (Some environmental information is also added communications (see Figure I3 on the next p;ige). Simula- to the output.) The processing load amounts to roughly tion devised the optimum map. making it necessav to three billion operations per second through each radar build only a sn~all part of an array to prove the concept. In channel. an array node of 50 transputers. housekeeping uses two The alternative to transputers-hardnriring-is inherently transputers, while the rest are a\2ilable for data processing. expensive (previous Martello systems used about 50 dif-

December 1990 17 Transputers

Two sample applications cor2tinuedfkm p. 17 HDLC Dual Digitized Radar Processed Monitor Timing Timing serial Ethernet video control video video in out interface interface

t t t t t t t Typical Master + application control Control interface unit Interface Video Waveform and Control buffer generator status processor interface

tf t t t tt

57400 array in a I Radar I Extracted typical application as a radar signal

Node

Figure B. Diagram of Martello radar system architecture.

Geometric parallelism achieves the processing rate The vast number of processors in the Martello com- required for the Martello radar. The radar coverage breaks puter areIS-bit T222 transputers, with some 32-bit T425 into range bands, and a particular array node directs all transputers. A follow-on project proposes to constntct a samples from a given band. Each array node executes multifunctional radar for the Italian navy using floating- an identical program, but on different data. The scalable point transputers. However, no fundamental change to performance of the radar is proportional to the number the architecture is necessary to take advantage of new of array nodes. generations of transputers. project (see ESIWT Supernode box). This project led to the project contributed greatly in establishing the need for development of many new commercial hardware and soft- reconfigurable systems and in creating paradigms for par- ware products, including the Inmos T800 transputer, the allel programming. Parsys Supernode machine, the Telmat T-node machine, The personal also opened up another mar- and the N.A. Software Limited’s parallel processing librar- ket for the transputer, since it provides the basis for work- ies for Fortran numerical routines. More significantly, the station accelerators. More than 50 companies now market

18 IEEE Micro The ESPRIT Supernode Project Current ESPRIT projects

Over about a four-year period (Dec. 1985-Nov. 19891, The Supernode 2 project studies software issues- the Supernode Project aimed to create a low-cost, high- particularly those concerned with advanced operating performance computer system from transputers. This systems-in the context of the Supemode machine. Such project incorporated the development of the T800 an converts the machine from one that floating-point transputer’ and the Supernode RTP runs one application into a more general-purpose facility. (Reconfigurable Transputer Processor). The two-year PUMA (Parallel Universal Message- One Supernode consists of 16 transputers connected passing Architecture) project explores the possibilities via a crossbar switch (also developed during the project). offered by the new virtual communications architecture Using crossbar switches, multiple Supernodes combine (see the “The next-generation transputer” section in the to provide systems of up to about 1,000 transputers with main article). Besides studying the performance offered no limitations on topology (except those implied by the by various topologies and the implications for computer four links on each transputer). Constructing larger sys- architecture, the project examines high-level models of tems remains possible with minor constraints on topology. parallelism and their implementation using a virtual The wide range of applications demonstrated on the message-passing system. Supemode included finite-element analysis, logic simu- The three-year GPMIMD (General Purpose, Multiple- lation, luminosity simulation, and real-time image analy- Instruction, Multiple-Data) project aims to develop a sis. Development of parallel algorithms and parallel standard European MIMD architecture based on Hl numerical libraries also occurred. transputers and virtual communications. Four main Eu- The main results demonstrated the feasibility of the ropean transputer-supercomputer manufacturers- system, and the surprising ease in programming appli- Meiko, Parsys, , and Telmat--currently work as cations to take advantage of concurrency. A large num- key collaborators on the project, which is led by Inmos. ber of European research projects base their development The OMI MAP (Open Microsystems Initiative Micro- of techniques on the now commer- processor Architecture Project) forms the core of the cially exploited Supernode. European-led Open Microsystems Initiative. It aims to define the architecture of a new family designed to exploit the VLSI capabilities anticipated in accelerator cards, offering a range of hardware and soft- the latter half of the 1990s. Collaborators include Bull, ware capabilities. Initially these cards targeted the develop- Inmos, Olivetti, Siemens, and Thomson. ing transputer embedded systems market. However, as Several other projects exploit the transputer architec- application software for the transputer grew, the market- ture for specific application areas. For example, Padmavati place widened-first to engineering design and (Parallel Associative Development Machine as a Vehicle more recently to commercial and financial workstations. for Artificial Intelligence) uses transputers and associa- Several workstations based entirely on transputers are now tive memory for symbolic processing. available. ESPRIT projects continue to attack the difficult problems of providing very high performance systems (see box for current projects). The original issues concerned the feasibil- allel computer executes a range of applications without ity of constructing such systems (possibly containing thou- concern for the number of processors employed or the to- sands of processors) and determining the methods of pology in which they are connected. Indeed, whether the programming individual applications. The Supemode project underlying hardware architecture is based on message passing developed a machine that could effectively execute a wide (such as transputer-based and hypercube-style systems) or range of applications, and in many cases nearly achieved shared memory (such as Sequent’s Balance system or Cedar the theoretical maximum performance from the machine. systems) is of no concern to the programmer. Current ESPRIT project issues involve “scalability,” “port- The expertise derived from ESPRIT and Supemode projects ability,” and programming ease. Scalability refers to the ease continues to spin off other applications. Migration of pro- of achieving increased performance from an application by cessing techniques and applications from general-purpose using more processors. Portability relates to the ease of computers into special-purpose systems signifies a definite transferring programs between parallel machines of differ- computing trend. For example, high-performance laser printers ent architectures, The term “general-purpose” summarizes incorporating Postscript processing and intelligent terminals these issues of parallel computing. A general-purpose par- continued on p. 76

December 1990 19 TransDuters

A system is constructed from one or more transputers Transputers operating concurrently and communicating through stan- continuedfmmp. 19 dard links. The programming language Occam (see box) supporting windowing systems enjoy increasing popularity. formalizes the computational model. Occam describes a In both cases, the workstation formerly carried out the corre- system as a collection of processes and communications sponding processing. Similarly, one expects techniques for that operate concurrently and communicate through channels. high-performance systems developed on machines such as Supernode to migrate into next-generation embedded Transputer processing systems. The transputer directly implements the Occam model of concurrency. A hardware scheduler allows any number of Transputer architecture Occam processes to share a single processor, and transputer The development of transputer architecture involved four instructions implement Occam message passing. An appli- main objectives: cation designer can configure a collection of processes ready for execution on a network of transputers. Each transputer

l it created a commercial product range that set new executes a component process and transputer links imple- standards in ease of programming and engineering; ment Occam channels. l it provided maximum performance to the user; Roth internal and external communications use the same

l it allowed the exploitation of future developments in instructions, allowing for Occam program reconfiguration VLSI technology within a compatible family; and (such as using a different processes-to-processors alloca-

l it created a programmable component for building tion) without recompilation. In particular, an application systems with large numbers of concurrent computing designer can configure an Occam program to execute on a components. small number of transputers for low cost or on a larger number of transputers for high performance. A transputer contains a processor. memory, and a num- The transputer processor supports fast interrupt response ber of standard point-to-point communications links-all by providing two levels of priority. (Typically, the interrupt integrated into one silicon chip (as shown in Figure 1). An response is less than 1 microsecond on a 20-MHz clock external memory interface extends the on-chip memory. ‘When transputer; worst case, it is less than 4 microseconds). Using appropriate, transputers also incorporate special-purpose the ALT construct (a key word in Occam meaning alterna- processing and/or interfacing capabilities. Separating the tive), a high priority process waits for the first of several external memory interface (for local memory) from the inputs to become ready. It then executes the specific piece communications optimizes performance and minimizes of code to respond to the particular interrupt. contention The processor treats access to a timer as an input. In a delayed input, the process waits until the timer reaches an appropriate value. The processor supports an arbitrary number of timer inputs. A programmer can also use a timer input Processor within an ALT construct as a time-out on a communication. Developing the transputer instruction set involved five 11 design objectives: Link in l to implement Occam effectively, so that high-level lan- Link out guage usage results in the effective use of silicon capa- bility, and that highly concurrent programs execute with minimum overheads;

l to implement Occam simply and directly to faciliate easy, straightfonvard program compilation, and to ensure that lower level programming is unnecessary;

l to provide word-length independence, so that a pro- Applications/external memory interface gram executes using processors of different word lengths without recompilation;

l to provide position independence, so that programs and workspaces are allocated anywhere in memory after Figure 1. Processing and interfaces in transputer recompilation; and architecture. continued on p. 78

76 IEEE Micro Basic Occam concepts

The Occam language’ programs concurrent, distributed implements as a loop and is equivalent to systems. The word distributed emphasizes the unsuitabil- ity of previous languages in this area. Occam describes a SEQ system as a collection of concurrent processes that com- a[basel := base municate with each other and with peripheral devices a[hdse + 11 := base + 1 through channels. Concurrent processes do not communi- ., cate via shared variables; thus Occam is particularly suit- albase + count - 11 := base + count - 1 able for programming systems with no memory sharing between processors. Replication used with PAR provides arrays of similar Occam provides three primitive processes: processes. AL?‘ and IF can also be replicated. Using a construct as a component in another con- u:= e Assign expression e to variable LJ struct makes it possible to design a system as a set of ! e Output expression e to channel c nested processes (for example, by using PAR within c? u Input from channel c to variable u PAR as shown in Figure C). The messages input and output on the channels of a process fully specify the Occam provides constructs that combine primitive process, and this completely hides its internal structure processes: from the outside world. Internally, the programmer can structure the process SEQ Components execute one after itself as a set of nested processes. At any level of design. another (sequential) the designer works with a small and manageable set of PAR Components execute together processes. (parallel) ALT First ready component executes (alternative)

The language also provides IF and WHILE constructs. A construct is itself a process and it may be used as a component of another construct. Occam syntax uses indentation to indicate program structure. A programmer writes parallel programs by using channels, inputs, and outputs combined in parallel and alternative constructs. Each Occam channel provides a communication path between two processes. Commu- nication is synchronized and takes place when both the inputting and the outputting processes are ready. Data to be communicated is copied from the outputting process to the inputting process, and both processes continue. An AL?‘ process waits for input from any one of a number of channels. ALT takes input from the first to be used for output by another process.

Occam provides a replicated constructor. For example

SEQ i = base FOR count a[il := i Figure C. Design of nested Occam language processes.

December 1990 77 Transputers

l to pl-ovide lo\\.-latency response to communications with more than 4 microseconds to switch between low priority I external devices. and high priority. A het\veen processes executing at low pri- The resulting design’ uses a simple linear address space. ority occurs only when the evaluation stack contains no use- I six functional registers for sequential programming (see Fig- ful contents. With minimal need to save and restore registers, ~ ure 2), and additional registers as qiwue pointers to support the processor implements concurrency very efficiently. concurrency. The instruction format uses very compact encoding based on l-byte instructions. Prefixing instructions are used to form , Regrrs , ( Locals ( 1 Program / long operands. The instruction size is independent of the word length. In general, a program requires much less slor- age to hold it than an equivalent program in a conventional Ic1 I I or RISC microprocessor. Since a program requires less star- Workspace age to represent it. fetching instructions use less memory bandwidth. As the transputer accessesmemory one word at Next instruction L a time, the processor receives several instructions for every Ooerand fetch (depending upon the number of bytes in a word). In addition to Occam, high-performance compilers for C, Fortran, Pascal. and Ada have been implemented for the Figure 2. Function of transputer registers for sequential transputer. programming. Transputer communications The six registers for sequential programming are the A link between two transputers implement? a pair of Occam channels, one in each direction. Two one-directional signal

l Lvorkspace pointer that points to a storage area contain- lines connect a link interface on one transputer to a link ing local variables. interface on the other transputer. Each signal line carries data

l instruction pointer that points I0 the next instruction to and control information. lx executed. Communication through a link involv-es a simple protocol,

l operand register used to form instnlcrion operands, and which supports the synchronized communicarion of Occam.

l A, I3, and C registers that form an evaluation stack. This The protocol provides for the transmission of an arbitrary stack holds the operands and intermediate results for sequence of bytes, which allows trdnsputers of different \vorcl expression e~luation. lengths to communicate. Each byte transmits as a start bit. fohWed by a one bit. 8 The hardware scheduler allows for the combined execu- data hits, and a stop bit (see Figure 3a). After transmitting a tion of any number of processes through the sharing of pro- data byte, the sender waits until receiving an acknowledgment. cessor time. ATany time. a concurrent process is active (either which consists of a start bit followed by a zero bit (see Figure currently executing or on a list awaiting execution) or inac- 3b). The acknowledgment signifies both that a process re- tive (either ready to input. ready to output. or waiting until a ceived rhe acknowledged byte. and that the receiving link is specified time). ready to receive another byte. The sending process proceeds A list holds active processes awaiting execution. This linked only after receiving acknowledgment for the final byte. list of process workspaces uses tno registers in implemen?d- Data bytes and acknowledgments multiplex down each tion--one that points to the first process on the list and one signal line. An acknowledgment transmits as soon as recep- that points to the last process. The hardware scheduler main- tains ho such liats+)ne for high-priority processes, the other for low-priority processes. 1 1 Data 0 The implementation of the instrucrion set uses a single I I I I I I I level of . Many instruclions execute in one cycle (4 (50 ns on a X-MHz transputer); many of the rest execute in two cycles. Some complex functions (such as block move) take an arbitrary number of cycles. These instructions still provide higher perfoi-mance than possible with software. 04 To limit the latency figure for switching between low and high priority, time-consuming instructions allow a switch during execution. Consequently, the processor never takes Figure 3. Formats of data links (a) and acknowledgment (b).

78 IEEE Micro tion of a data byte begins (provided a process is waiting for resulting in essentially a system design activity.’ it and room to buffer another data byte is available). Conse- To configure a program for a network of transputers, the quently, transmission may continue without delays between applications designer identifies the parallelism. The designer data bytes. subsequently maps the parallel processes onto the transputer As the transputer uses -transistor-logic-compatible network to optimize the system according to the design crite- (‘ITL) signals, the applications engineer can extend the links ria (such as maximized performance and minimized latency). by inserting industry-standard line drivers and receivers. Experience gained to date with parallel systems-particu- The design of the links makes engineering of transputer larly with ESPRIT projects-identified a number of program- systems easy. Irrespective of internal performance, all trans- ming paradigms that help to structure systems designs.” These puters use a i-MHz clock for frequency reference. The low paradigms can be written in languages such as Ada,‘ or can frequency simplifies the clock distribution in large transputer employ parallel extensions or libraries in C or Fortran. How- systems. The communications system does not require a phase ever, Occamj describes these paradigms most conveniently. reference. Therefore, it is not necessary for all transputers to Descriptions of the paradigms for algorithmic parallelism. operate on the same clock. The flexibility to use a number of geometric parallelism, and farming appear below: clocks enables interworking between independently designed (sub)systems. 1) Algorithmicparallelism. The designer splits the applica- The use of point-to-point serial communications, instead tion into functional units. In simple cases. these units of buses, offers the following advantages: can form a pipeline; in general, they can form more complex structures such as feedback loops. The various

l simplified board layout and backplane design; stages potentially operate in parallel. With a modest

l increased communications bandwidth, as many links in amount of buffering, the communication of data or par- a system operate concurrently; and tially computed results between stages will, in many cases,

l easy interconnection of devices with different word completely overlap processing (eliminating communi- lengths and performance. cations overhead). The structure maps onto one or more transputers, up to the number of components in the Transputers with different word lengths and performance structure, with the communications performed internally all interwork together, ensuring the easy upgrading of sys- or via transputer links, as appropriate. For a pipe- tems as the technology advances. It is not necessary to line structure, its slowest stage limits the maximum downgrade a connected set of components to the perfor- performance. mance of the slowest component! 2) Geometricparallelism. Designers partition the design on Each transputer contains a separate communications en- a regular baSiS. For example, they can divide a screen gine, allowing communications to proceed in parallel with image into quadrants, or a matrix operation into the execution of processor instructions. Indeed. many appli- submatrices. A separate process performs the computa- cations completely overlap communications and processing, tion for each partition. The processes often operate in- maximizing overall system throughput. dependently or interact only with immediate neighbors. Two link adapters and a link switch add to the flexibility A particularly beneficial use of geometric parallelism in- and use of communications. The link adapters provide an volves scaling up the size of a problem to be solved interface between a link and a byte-wide port. The Inmos (such as performing weather forecasting on a finer mesh). COO4link switch provides a crossbar switch between 32 links, Designers usually implement geometric parallelism by controlled by a separate configuration link. The COO4 is allocating processes straightforwardly onto a physical cascddable, allowing for the construction of arbitrary networks regular network. Geometric parallelism usually attains of transputers (limited only by the number of links on each maximum performance by considering granularity issues. transputer; most transputers contain four links). The processes communicate very efficiently with their immediate neighbors when executing on the same pro- Programming paradigms cessor. The processes communicate less efficiently with While designing transputer-based hardware to perform at immediate neighbors when executing on separate pro- any desired level of performance is easy, one must ensure cessors. Where one process is allocated to each proces- that the software configuration exploits the hardware archi- sor (see Figure 4a on the next page). communication tecture. Software structured as a sequential program with costs dominate performance in most instances. With all conventional compilation operates no hster on 10 transputers processes on a single processor, computation time than on one! dominates performance. For embedded systems, the design of software architec- Computation and communication achieve balance at ture and hardware architecture optimally occurs hand-in-hand, an intermediate level of granularity (see Figure 4b). This

December 1990 79 ro\v and easily defined interface specifications allow for easy reasoning about an application’s correctness.

The next-generation transputer For the past two years. a team at Inmos’s bistol, Engjand. design center has heen working to enhance the tmnsputer‘s performance and suitability for emheddcd systems. From this work, a new product family-based around a new processor code-named Hl-will be launchecl in Spring 1991.” The team‘s design goal called for establishing a new stan- (4 W dard in single-processor performance while enhancing the trdnsputer family‘s position as the premier multiprocessing IJ Processor 0 Process - Occam channel microprocessor. It also required maintaining upwartl coin- patihility with existing transputer products. To meet these goals, the team developed a new micro- Figure 4. A simple allocation of geometric parallelism architecture that implements the same instruction set as the granularity (a); a balanced allocation of geometric existing Inmos T8Oi transputer. The Hl provides an order- parallelism (b). of-magnitude increase in performance, combined with en- hdncecl capabilities to support the sofbxire standards emerging in the enibedclecl systems marketplace. balance results from ;I boundary-to-area effect (for two The HI architecture includes such key features as a dimensional gritls) in \vhich the amount of communica- pipelined. combined with on-chip cache tion at the boundary varies linearly with tk grain size. RAM. and improved communications that provide a new ancl the area of computation varies as the square of the degree of freedom in multiprocessor programming. grain size. In comparing the two figures. the boundary To complement the HI transputer. Inmos is now design- to-are3 effect in Figure 4b results in four times the amount ing a range of network communication products based on a of computation, but only twice the communication 3s new lOO-Mbit/s link protocol. The protocol supports the dy- found in l?gwe 421. A similar sur~dce-to-voluime effect namic routing of messages between processors. occurs for three-diinension~~l grids. 3) F6Wn7ilg. Designers &vi& the application into small, Hl performance similar pieces (for example, when needecl to process a The Hl provides a peak performance in excess of 150 large number of similar data items). Each transputer in a MIPS (million ) ancl 20 Mflops (mil- net\vork provides :I process. and a master pro- lion floating-point operations per second) and ;L sustained cess allocates (or “farms out”) work to the servers as performance exceeding 60 MIPS and 10 Mflops. It maintains they become free. Farming provides two major benefits. instruction-set compatibility \\;ith the T8Oi. First, it automatically balances the load--a server corn- A number of design features contribute to the achieve- pleting one piece of work immecliately proceeds to the ment of these performance levels. The processor itself uses a next. Second, it functions relatively independent of the pipelined. superscalar architecture, \vhich executes up to eight topology--any reasonably linked nemork works well. instructions on each clock cycle and operates at 3 clock speed The limit to performance is the rate of dispensing work of 50 MHz. The number of cycles required to execute many and handling results. instructions-such as integer and floating-point multiply, and logical shift--decreases significantly. The Ocam model of par:&zlism offers a significant hen- I:nlike other superscalar machines, the Hl architecture does cfit: Messages sent and received completely &fine a process. not require an advancecl compiler to schedule the different The designer can structure ;1 process internally as a set of fWKtiOnal units in the processor. Hardware controls the flow processes. thereby using any clesired level of nesting. A very of multiple instructions through the pipeline. It is not neces- powerful technique, therefore, combines the above paradigms sary to modify existing compilers or recompile source code. in one application, such as a farm of geometric-array servers An advanced suhmicron, CMOS (complementary metal- functioning with pipeline components. (The earlier Occam oxide semicontluctor) process allows a high transistor count box diagrams the Occam program that encapsulates all three and high clock frequency operations. This process enables paradigms within a simple top-level structure.) the implementation of a sophisticated processor and pro- The designer can use these models to easily structure ap- vides 16 Kbytes of on-chip cache memory. plications tc) define a large amount of par:tllelism. The nar- The move to a cached architecture is 3 radical develop-

80 IEEE Micro ment. The 16.Kbyte cache is sufficiently large to achieve high In contrast. hut in keeping with its intended market, addi- hit rates for most applications. The HI still allows for directly tional Hl transputer enhancements allow programmers to addressed. on-chip RAM for applications containing only SlTldll write more efficient real-time kernels. These enhancements amounts of memory. or ones intolerant of indeterminate access and control the state of the machine, the process and performance c;iiised hy cache-line misses. timer queues, and time slicing and interruptdhility mechanisms. The design team took great care to ensure that the Hl transputer \vill provide high performance levels in low-com- New freedom ponent-count systems. For example, the Hl will provide a A limitation in exploiting existing tuansputer nemorks is programmable memory interface with a @bit data sus- the need to match the parallel structure of the algorithms taining high data transfer rates for cache-line refill. The inter- used to the interconnectivity provided. In the worst case, a face supports four independent banks of external memory, specific machine possesses a fixed topology. In the best case and the timing for each hank is configured independently (21 totally reconfigurahle architecture sucl~ as the Supernode). from software. For example. an application designer can the limitation of four links per transputer restricts mapping. choose to fill two hanks with dynamic RAM-one bank with This limitation can result in poor software portability. virtual RAM, the other with peripherals. Such a system fre- nonoptimum design, and scalahili~ prohlcms. quently requires no external support logic. The Hl product family largely eliminates this problem h) providing hard\~are to allow transputer connections via a Hl error-handling and user-mode processes low-latency communication network. It supports communiG The Hl transputer hardware supports the same scheduling cation channels l,et\veen any two processes anywhere in the algorithms used by current-generation transputers. such as nemork. the T80i. In addition. on the 111transputer each process may The hard\\are simplifies programming hecause designers use ;I second process (known as a trap handler). When an do not need to consider how to allocate processes to the error-such as an integer overflolv or ;I tloating-point er- transputer network until after completion of the program ro--occurs. control transfers to the trap handler. The trap lvriting. They can USC different allocations on different ma- handler copes with the error in soft\vdl-e in all cases before it chines and they can change the allocation to optimize per- (in most instances) returns control to the process in which formance. In addition. it is possible-at least in principle-to the error occurred. let the compiler make this allocation. effecti\.ely removing all The Hl also supports a separate user mode. This mode configuration tlet& from the program I’ountain” discusses the prevents privileged instructions (including communications coiniii~lnications capabilities of the 111 product fdmily in and scheduling instructions) from executing. It checks and greater detail. ‘I’he following paragraphs sum up thrse translates all memory accesses from 3 logical to the physical capabilities. address space. The Hl transputer itself contains a separate communica- tions processor, which multiplexes a large number of logical communication links (virtual links) along each of its physical H 1 transputer enhancements links. Each virtual link supports one Occam channel in each direction. allow programmers to write The communications processor transmits messages as a sequence of packets. \vhich all contain 32 bytes of data ex- more efficient real- time kernels. cept the last packet. Each message packet starb with ;I header, bvhich routes the packet through the communication net- work and identifies the destination virtual link on the remote Memory protection and address-translation mechanisms transputer. specifically support secure programming and &hugging in Designers construct a separate communications nehvork embedded systems. For dedicated (single-user) systems. the using the Inmos Cl04 routing device and can use one Cl04 protection aids the detection of programming errors. For to connect small numbers of Hl transputers. In larger systems. multiuser. genera-purpose computing systems, it protects designers can use Cl04 connections to form a hypercuhe, a users and the operating system from erroneous programs. mllltidilnensional grid. or a tree network. The protection and translation mechanisms are optimized Each Cl04 provides 32 bidirectional links. The header of for the requirements of embedded systems. These mecha- each packet arriving on a link input determines the link on nisms allow the processor to execute code in protected (user) which to output the packet. As soon as the link output is mode at the same speed 21s normal processes without the free. the whole packet transmits through it. performance overhead involved in supporting page-based. An algorithm known as interval labeling decides through . which link to send a packet. In intenlal labeling. each output

December 1990 81 Transmters

link is associated with a continuous set of header values (an interval). The header of an incoming packet lies within only Colin Whitby-Strevens manages the one range. and the packet transmits to the associated link. Central Technology Group in the Micro- Optimum, deadlock-free labeling schemes exist for each of processor Design Division of Inmos Lim- the common network topologies. ited. His interests include microprocessor The Cl04 provides additional facilities to connect networks architecture, parallel programming and together and reduce the impact of message congestion on system design, and programming lan- worst-case latency and bandwidth in heavily loaded networks. @ldgeS. He recently chaired a series of Commission of European Countries workshops-involving more than 70 companies THETRANSI ’UlER IS WELL ESTABLISHEDas a highly cost- and organizations-to establish the Open Microsystems Ini- effective processor, particularly in embedded applications. It tiative. Prior to joining Inmos, he was a lecturer in computer provides unique advantages for applications that require more science at the University of Warwick, where he developed a than one processor, and serves as the basis for many research multiprocessor operating system for the Modular One com- programs in parallel computing. puter. He subsequently established the Warwick Distributed Inmos developed the current generation of transputers as Computing Research Project, which developed some of the general-purpose components for special-purpose machines. key concepts subsequently exploited in the Inmos transputer The introduction of a higher performance transputcr-sup- architecture. porting virtual communications, memory protection, and other Whitby-Strevens holds a BSc degree in mathematics from advances-represents a significant step toward the develop- Hull University, and the Diploma in computer science and ment of general-purpose, multiprocessing computing systems. PhD from Cambridge University. Research continues on the architecture to effectively exploit IIe authored 30 technical papers and coauthored BCPL- the full capabilities of VLSI technology during the 1990s. no Language and Its Compiler. He is an affiliate of the IEEE Meanwhile, the HI provides highly efficient implementations Computer Society, a member of the Association of Comput- of conventional operating systems and real-time kernels. It ing Machinery, and an associate member of the Rritish Corn- greatly reduces the cost of porting existing software and up- puter Society. grading existing applications to take advantage of the trdnsputer’s capabilities. /JJ Address questions concerning this article to Colin Whitby- Strevens, Inmos Limited. 1000, Aztec West, Almondsbury, Bristol, IS12 4SQ, United Kingdom.

References 1. A. Baker, “A Signal Achievement,” Parallelogram Infemat/ona/, Vol. 2, No. 29, Aug. 1990, pp. 1O-1 1. 2. M. Homewood et al., “The IMS T800 Transputer,” /EEf Macro, Vol. 7, No. 5, Oct. 1987, pp. 1O-26. 3. lnmos Limited, Occam 2 Reference Manual, Prentice-Hall, London, 1988. 4. lnmos Llmited, Transputer instruction Set-A Compiler Writer’s Guide, Prentice-Hall, 1988. 5. lnmos Llmited, Communicating Process Architecture, Prentice- Hall, 1988 6 A.J.G. Hey, D.J Pritchard, and C. Whltby-Strevens, “Multlpara- digm Parallel Programming,” froc. Hawalr ht’l Conf. SyIrem Sciences, IEEE, New York, 1989, pp. 716-725. 7. J Barnesand C Whltby-Strevens, “High Performance Ada Using Transputers,” Defense Computing, Vol. 1, No. 5, Sept./Ott. 1988, pp. 45-49. Reader Interest Survey 8. C Dyson, “lnmos HI Architecture Revealed,” New Electron& Indicate your interest in this article by circling the appropriate Vol. 23, No. 8, Sept. 1990, pp. 2 I-24. number on the Reader Service Card. 9. D. Pountain, “Virtual Channels The Next Generation of Transputers,” Bfle(European and World Edition), Vol. 15, No. 4, Low 156 Medium 157 High 1%’ Apr. 1990, pp. 3-12.

82 IEEE Micro