<<

- 237 -

PROGRAMMABLE DIGITAL MICROCIRCUITS - A SURVEY WITH EXAMPLES OF USE

C. Verkerk CERN, Geneva, Switzerland

1. Introduction

For most readers the title of these lecture notes will evoke . The fixed instruction set microprocessors are however not the only programmable digital mi• crocircuits and, although a number of pages will be dedicated to them, the aim of these notes is also to draw attention to other useful microcircuits. A complete survey of programmable circuits would fill several books and a selection had therefore to be made. The choice has rather been to treat a variety of devices than to give an in- depth treatment of a particular circuit. The selected devices have all found useful ap• plications in high-energy physics, or hold promise for future use.

The is very young : just over eleven years. An advertisement, an• nouncing a new era of integrated electronics, and which appeared in the November 15, 1971 issue of Electronics News, is generally considered its birth-certificate. The adver• tisement was for the 4004 and its three support chips. The history leading to this announcement merits to be recalled. Intel, then a very young company, was working on the design of a chip-set for a high-performance , for and in collaboration with a Japanese firm, . One of the Intel found the Busicom design of 9 different chips too complicated and tried to find a more general and programmable solu• tion. His design, the 4004 microprocessor, was finally adapted by Busicom, and after further négociation, Intel acquired marketing rights for its new invention. The firm's marketing department, however, was not interested in the product, arguing that the new chip would conquer 10% at most of the market, which was then 40'000

yearly. History15does not record what happened to the marketing manager What happened to the microprocessor is well known : the 4-bit 4004 was superseded after one year by the first 8-bit , the 8008, followed in 1974 by the 8080 and the 6800 and by PACE, the first 16-bit microprocessor. The years 1978/79 saw the birth of the well known 16-bit devices : , Z8000 and 68000 and also the highly capable 8-bit 6809. Sales also rose with an astounding speed : in 1979, 75 mil• lion microprocessors were sold and at the end of the year the total number sold was 110 million. This rate of growth has been maintained since.

What happened to all these processors ? Very few were in fact used as direct re• placements of , at least in the beginning. Many were used in a large variety of control applications, replacing random logic. The overwhelming majority however ended in new applications which even the most imaginative marketing manager could not foresee : arcade games, car ignition systems, home appliances, hobby compu• ters, office systems, etc.

The microprocessor was invented to replace random logic and the first to apply it were electronics engineers, accustomed to logic design, but not well versed in pro• gramming. The programs these engineers developed were in general short and dedicat• ed. As soon as they were considered to be working, these programs were "petrified" in - 238 - a Read-Only Memory (ROM). In the early days most programming was done in "hexa• decimal", .g. directly in the machine language. The majority of the users were una• ware of the possibilities better programming tools could provide. Even assembly lan• guage programming was believed to be beyond reach both for the engineers and for the micro- systems (There is a book, published in 1976/77 which dedicated exactly two of its 300 or more pages to ). It should therefore be no surprise that the early generations of microprocessors had rather primitive architectures. Tech• nological limitations obviously were also an important obstacle to making supermicros, but it was not the only one, contrary to popular belief.

In more recent applications, such as office automation systems and personal compu• ters, a better architecture of the is required, in order to sup• port higher-level language programming. These systems are in fact used in much the same way as mainframes and larger minicomputers are. They therefore run operating systems, have real peripherals attached to them (hard or flexible disc, alphanumeric or graphic display, keyboard and high-quality ) and the user expects to find high-level language in to special application packages. Part of these notes will attempt to describe in how far the modern 16-bit microprocessors satisfy the requirements of efficient implementation of high level languages and operating systems.

Besides making it easier to program microprocessors, semi-conductor manufacturers have also constantly strived to improve their performance. Working with longer words , as the 16-bit processors do, in itself improves the performance considerably. Progress in MOS device has resulted in speeding up the internal operations by more than an order of magnitude. Performance improvements were also sought in two other directions : use of bipolar device technology to overcome the speed limitations of earlier MOS devices and the use of arithmetic attachments to increase the numeric pro• cessing power. The first road led to the bit-slice microprocessor, whereas the second ended in the design of a few interesting floating-point arithmetic devices. Both will re• ceive some attention in these lectures.

The modern microprocessors, where the designer succeeded in putting of the order of 10s on a little chip of silicon, usually 6x6 mm2, are perfect examples of the so-called "Very Large Scale Integration" (VLSI) technique. That this has been at all possible is, in addition to the progress in device technology mentioned before, due to the availability of Computer-Aided Design systems and techniques. The latter, and the need to form design engineers for the semi-conductor industry, has spurred the work on VLSI which is now undertaken at a number of Universities and Research laboratories. Some interesting results have been obtained , and the last part of these notes will briefly review a few. The research is mainly oriented towards investigating novel computer architectures, but also some devices for practical use have been de• signed in universities.

A microprocessor chip is not a viable object, generally speaking, without its com• plement of support chips : memory in its different appearances, interface adapters, er• ror detection circuits, analogue-to-digital converters, etc.. Many of these support cir• cuits have wide possibilities for control by the microprocessor and should therefore be included in the class of programmable digital microcircuits. A few will be mentioned, mainly because they have found uses in high-energy physics experiments. - 239 -

Table I

Classification of the processor chips

CALCULATOR CHIPS

r- 1 bit - bit *FIXED INSTRUCTION - *8 bit *SET MICROPROCESSORS •»16 bit L 32 bit

SINGLE CHIP MICROPROCESSORS

PROCESSORS

i-» R) ALU MSECUENCER '* BIT—SLICES [-SUPPORT (MAU.DMA) LsPECIAL (FFT)

-SPECIAL PURPOSE (CIPHER, «GRAPHICS, »SYSTOLIC, MOUSE) VLSI -TREE MACHINES -SIMPLE INSTRUCTION SET *(LISP, RISC) LNON-VON NEUMANN

Table II

Examples of support chips

[-»RAM ) -»MEMORY -{-ROM V L-EPROM __ ) FIRMWARE CHIPS

•TIMERS/EVENT COUNTERS

.-MULTIPLIERS '-* ARITHMETIC • L*FLOATING POINT PROCESSOR SUPPORT -»

-PLAs, PALS

—,-IKTERRUPT CONTROLLER DMA CONTROLLER -GENERAL INTERFACE •PARALLEL -(SERIAT L KEYBOARD, LED DISPLAY -DISC-, CASSETTE INTERFACE CRT CONTROLLER, CHARACTER GENERATOR -DOT-MATRIX PRINTER CONTROL t-i/o SUPPORT - -COMMUNICATION (HLDC, , ETC.) CYCLIC REDUNDANCY CHECK -ENCRYPTION, DATA SECURITY •INTER- CONNECTION, TRANSCEIVERS -IEEE-488 TRANSCEIVERS -BURST-ERROR DETECTION TEPPER MOTOR IGITAL —> ANALOG SNALOG > DIGITAL SYNTHESIS t-SPEECH ANALYSIS - 240 -

Finally, if we were to make some classification scheme of the whole area of pro• grammable digital microcircuits, we could distinguish between the processors proper and the support chips. The processors can be further subdivided into user-programmable and fixed program processors (e.g. the marvels you find inside your electronic watch). Still finer sub-divisions of the user-programmable processors and the support chips are shown in Table I and II. The devices which will receive some attention in the following pages are marked with an asterisk.

2. Bit-Slice Microprocessors.

The Field Effect , implemented in MOS technology made it possible to put a complete processing unit on a single chip. First (1970) 4-bits wide, soon 8-bits (=1973) and later (1978) 16-bits. Constant improvements in packing density and power dissipation of the transistors made this progress possible. The earlier microprocessors lacked however dramatically in power when compared to even the lowest range of the then available. The processing speed could only be improved by using bipo• lar (signal propagation times through a gate of 1-3 ns, compared to »10 ns for MOS technology around 1974), but these did not allow the packing density and the power dissipation needed. Instead of a complete 8-bit processor, including instruction decoding and control, handling etc., which was possible in MOS technology, either TTL or ECL technologies allowed at best the implementation of a 4-bit wide Arithmetic and Logic Unit with its control and a few registers. To make use of such an ALU in a processor, external control circuitry is needed to ensure proper execution of the instructions in a program. On the other hand, longer data-words could be easily handled by concatenating several chips, as many as the number of bits in the data word required : 8 4-bit chips put together produce a 32-bit processor. The cir• cuit represents therefore a 4-bit wide vertical "slice" of the 32-bit processor, including its internal registers used in arithmetic operations (such as accumulators or general re• gisters and condition code registers).

One should note that building a 16- or 32-bit wide processor from 4-bit wide slices is different from the way processors used to be built. The ALU used to be a pure combinatorial device, without internal storage capability. The accumulators or general purpose register files used to be built up from chips containing flip-. The parts Indicated in Figure 1 in the block marked Arithmetic Processor Unit and consisting of ALU, and shifter were separate items. The saw to it that the operands were picked up from the register file, transported to the inputs of the ALU and that the result was correctly routed back to its destination in the register file (For simplicity we consider only register-to register operations for the moment). This neces• sitated not only control signals to be sent to the ALU, to indicate what operation (add, substract, AND, OR etc..) to perform, but also control of the register file, of the data paths and timing signals.

The bit-slice RALU simplifies this control in the following way : the external control signals specify the operation and its operands; the setting-up of the data paths, of the ALU Control and the necessary clocking is done internally to the slice. All slices in the processor therefore receive the same control signals (see figure 2). The slices cannot operate completely independently of each other, however. Carries generated in arithmet• ic operations must be passed from one slice to another and the same must happen to bits shifted out of a slice in shift operations. In short, one can thus say slices combine ALU, and registers on a single chip, split the overall data busses, share control lines and propagate status. - 241 -

CENTRAL PROCESSING UNIT (CPU) COMPUTER CONTROL UNIT (ecu) REGISTER FILE

ARITHMETIC PROCESSOR UNIT (APU)

PROGRAM CONTROL UNIT (PCU)

INTERRUPT DMA CONTROLLER CONTROLLER

1 10

EXTERNAL WORLD

Fig. 1 Schematic diagram of a computer, consisting of a , Memory and Input-Output.

.IS

Cout IS Cin

.OP

A B A B A B A B_

;—3 0r f—r 16 B II II ,16 jfci OP

Fig. 2 Diagram showing how a 16-bit ALU can be built from four 4-bit slices.

So far we only spoke about the RALU slices, leaving the overall control aside. The task of the control unit is to provide a sequence of control signals, corresponding to the sequential execution of program-instructions. The control unit should thus fetch the next instruction and generate the necessary control signals for its execution. It should however also be able to deviate from the sequential execution by making condi• tional or unconditional branches, by making calls and returns, and possibly by handling external and internal . The control unit will thus become a com• plicated device and the number of bits it can manipulate will again be limited by techno• logical constraints. Control units - usually called sequencers - can however be conca• tenated to obtain wider words. The words handled by the sequencer are instruction addresses and longer words in this case mean that larger programs can be accommodated. - 242 -

I/O UNIT T PERIPHERAI

RAIUMRALU)

STATUS FLAGS

CONTROL SECTION PROCESSING SECTION

BIT-SLICED >cPROCESSOR

Fig. 3 Diagram of a processor built from bit-slices. Both the arithmetic (RALU) and the control (sequencer) parts are made up from slices.

Taking a number of RALU and sequencer slices we can build a processor (see fig• ure 3) which is not only faster than the microprocessors of the 70's, but which in ad• dition can have nearly unlimited wordlength and program size. Apart from the number of chips needed, there is a more fundamental difference with the nor• mal microprocessor : the instruction set the processor will recognize is not uniquely de• fined by the manufacturer. The designer has a considerable liberty in putting the parts together and in deciding which bits of an instruction control what. Moreover, the pieces of instructions the manufacturer has defined (e.g. the control codes for the RALU and the control of flow in the sequencer) are at a more primitive level than most conventional machine instructions to which the assembly language pro• grammer is accustomed. For this reason code at this level is said to be composed of micro-instructions. Another important fact to realize is thus that bit-slices require microprogramming, a level of programming well below normal assembly languages. (This does not exclude that other levels of programming can be put on top of the , as we will see later).

To make the distinction between an assembly language program and microcode clear, let us consider the following example : assume we have a machine with a number of general purpose registers, say R0-R7, and we want to add the contents of RO to the - 243 - contents of a memory location. The address L of the memory location is held in register R2 : (R2)=L, where (R2) means "contents of R2". In assembly language this could be written as ADD @R2, RO.

Translated in, for instance, PDP-11 , this would result in a 16-bit instruc• tion, which would do exactly what was described above. In microcode, this simple oper• ation would have to be broken down in several small steps. The same steps indeed that occur internally in the PDP-11. Assuming that the same 16-bit instruction code is fed into a bit-slice machine, than the steps to be executed by the microprogram could be :

C. Fetch and Decode sequence common to all instructions

1. (PC) --> MAR ; Transfer Contents of Program initiate memory read to Register to fetch new instruction. 2. (MDR) > IR ; Transfer contents of Memory Data register to 3. Decode contents of IR ; Decode operation code and

(PC)+1 -->PC addressing modes. .Result is an address M in microstore. Update the 4, Transfer microprogram control to the sequence of micro-instructions which will execute the instruction held in IR. (e.g. transfer control to microcode address M found in step 3).

Now follow the steps for the ADD sequence for the particular .

5. (R2) -- > MAR ; Contents of R2 to MAR initiate memory read 6. (MDR) -- > A input of ALU ; Contents of memory data register (RO) --> B input of ALU to one and of RO to the other set up ALU control for "ADD" input of the ALU. 7. output of ALU --> MDR ; output goes back to memory initiate memory write 8. go back to step 1 go back to fetch next instruction.

Many steps are needed for such a simple operation. Note that not all steps take an equal amount of time. Depending on the speed of the memory there may be considerable delays between step 1 and 2 and between 5 and 6. Also note that it is assumed that there are sufficient data paths available so that the actions in step 6 may be done sim• ultaneously. What do we really have to do to execute these atomic actions ? Look at step 1 and 5. We see that the source for loading the MAR can be the PC or one of the general regis• ters. This means that the correct register must be selected, and its output gated to the inputs of the MAR. This, implies enabling gates, setting up , in order to route the data to its destination. An important task for a micro-instruction is thus to set up the route for the data, in much the same way as a railway signal man sets the points to direct an incoming train to the correct platform. Another task is to enable - 244 -

MICROPROGRAMMING

DECODER OUTPUT LINE=YU INSTRUCTION WILKES MICROPROGRAMMED CONTROL STORAGE

/

A8 B DECODER diode matrix NEXT ROM s ADDRESS (1 1 2n lines 1 1

Control information:i Sequencing States, i information : Output i transitions

I I from ALU STATUS : Conditional Branch

TARGET IR

Fig. 4 Wilkes1 microprogramming model. Each microinstruction contains control information together with the address of the next microinstruc• tion.

the MAR inputs, so that the next clock pulse will actually strobe the data into the re• gister. All this control information is provided by the micro-instruction's fields. The micro-instruction itself can also contain the address of the next micro-instruction to be executed and thus define the flow of control of the microprogram. What has been de• scribed so far is exactly the micro-programming model proposed in 1951 by Wilkes and shown in figure 4. Each clock pulse causes a new microcode to be strobed into the de• coder and thus new control information to become available together with the next ad• dress.

We have only touched the surface of microprogramming so far, but it should already be clear to everyone that writing microcode is not a simple task and that it requires very intimate knowledge of the hardware of the machine. We can learn still more from the example : the first four steps are necessary to fetch (from main memory) and de• code instructions of a higher level than the micro-instructions themselves. In the exam• ple a PDP-11 machine instruction was fetched and then executed. Writing short micro• code sequences, one for each conventional machine instruction, is therefore equivalent to defining a set of such machine instructions. In other words, we have created the possibility to write programs in assembly language, from where we can build up to higher level languages. By writing other, longer sequences of microinstructions, we could define a set of commands of a higher level; for instance a set particularly well suited for interpretation of Pascal P-code. The task of writing microcode for the exe• cution of all defined instructions has only to be performed once (by an expert !). Once it is done the machine presents itself to a programmer just as any other machine. The microcode can be stored in a Read-Only Memory in which case we have obtained a fixed instruction set processor. When the microcode is stored in a Read-Write memory, the possibility exist to add new instructions or sequences of instructions : the machine is user-microprog rammable. - 245 -

The designer of a bit-slice machine is free to define the instruction set his proces• sor will execute. He can define none. In that case the processor can only be program• med by writing microcode. This can be done once and for all so that a fixed program machine will be the result. The solution adopted generally is to leave it to the user to program the processor. Several examples of such processors-to-be-microprogrammed ex• ist and are used in high-energy physics experiments : ESOP25, CAB35, MONICA"0.

The designer can also choose an instruction set of another, hopefully well-known and widely used, computer, e.g. PDP-11 or IBM 370/168. In that case he creates an "emulator". Emulators are also used in high-energy physics experiments ; examples are MICES> B> and 370/E7>.

The last choice the designer has is to define an instruction set of his own liking, which has as a consequence that all software must be developed from scratch. There are many examples of commercial machines8' in this category : Nanodata QM-1, Con• trol Data 5600, Burroughs B1700, etc. GESPROs> is an example from high-energy phy• sics .

Of all these possibilities, emulators have the invaluable advantage that software running on the emulated machine will also run on the emulator (if it does not, than there is something wrong with the emulator). This means in practice that high-level languages are available, and that programs can be compiled and debugged on a large machine with excellent facilities. The final program can however run in a small, cheap machine (the emulator), without fancy peripherals but dedicated to its task. Emulators tend however to be slower than the directly microprogrammed machines. A user microprogrammable emulator makes the best of both worlds.

For completeness we must mention that there is another way to make an "emulator". The processor can be made in such a way that it can only run microcode, which is ob• tained by translating machine code for the emulated computer into microcode for the emulator, instruction by instruction. With a proper design of the bit-slice processor this translation can be simple and a program can be written to do it automatically. There are a few problems, which can be solved in practice. Normal programs mix in• structions and data. The emulator must keep the microcode separate from the data, so the translator program must do some more work. The other problem is that the micro• programs generated are long, generally longer than the original machine code, so the microstore must be large to contain these programs. And as the microstore must also be made of fast memory chips (the speed of operation depends directly on the access time of the microstore),this type of emulator is generally expensive. The well-known example

(and actually the only one known) of this type of emulator is the 168/E,105 115 l2) 135 of which between 20 and 30 are in operation in high-energy physics.

We noted already that writing microcode is difficult and tedious and that it requires expertise. It is therefore important to use good tools when writing microcode. Several good do now exist. In fact they are meta-assemblers1*', which means that the code to be generated is not pre-defined inside the assembler, but the user has to define it. A meta-assembler works in two (or three) phases : - the definition phase. In this phase the format of the micro-instruction is defined, symbolic names are given to fields and default values attributed. Also macro defini• tions are made in this phase. - the assembly phase. During this phase the symbolic micro-instructions (which use the field-names and the macro-definitions) are assembled into binary code. - 246 -

- a post-processing phase, in which the binary code may be re-formatted for program• ming of PROM chips or for use by a loader.

A good will allow for almost any width of micro-instructions so that horizontal and vertical microcode may be assembled. It will also have macro facilities and will allow nesting of macros. Error detection is another important feature.

Another software tool which is very important for debugging microcode is a simula• tor. Instead of writing an ad-hoc simulator for every processor built from bit-slices, it is preferable to use a general tool. The ISPS system (Instruction Set Processor System)

is an example. 1"° ls) The user writes in a hardware description language a definition of the machine (which as a matter of fact constitutes an excellent piece of documenta• tion). ISPS compiles the description and the user can then interactively simulate the behaviour of the hardware. He can set , inspect the contents of registers or memory, etc. All micro-instructions and also sequences can thus be tested and the hardware verified before building it. The interested reader is referred to ref. 14 for more details.

A number of bit-slice families exist, but the Amd 2900 family is by far the most po• pular. The RALU slice is 4-bits wide and contains a file of 16 registers of which two can be accessed (for reading) simultaneously. There is an extra register, connected to the output of the ALU which can be shifted and there is an additional shift register. The chip is controlled by 9-bits, divided into three encoded fields. A few sequencer chips belong to the family. The Amd 2909 is one of them. It contains a microprogram counter register (the equivalent of a normal program counter) with its in• crement logic. External addresses can be strobed into the MCP, so that jumps and branches can be made and a small stack, 4 deep, is used for storing of subroutine re• turn addresses. The sequencer slice is also 4-bits wide. A bit-slice family which is used for its high speed is the Motorola 10800, implemented in ECL. The essential dif• ference with the Amd 2900 is that the ALU slice does not contain a register file. It is bus-oriented and a separate register file slice can be easily combined with the ALU slice. The sequencer is much more involved than the Amd 2909. The family also com• prises a Memory Control slice, which contains a 4-bit ALU. Address calculations may thus be made elsewhere than in the main ALU of the processor. Figure 5 shows a

MICRO STORE IK * 112 bits

f~ MICROINSTRUCTION REGISTER T AAC TI0 Ml F ALU CCL 0-BUS- MCF

AUX-ADDR. CONTROL

0 NA MC 10800 T M MC 10B06 CONO. MC 10S01 CODE M I F RF ALU LOGIC MCF I T 1 CMA LI Fig. 5 Simplified block diagram I-BUS of a processor built from the Motorola 10800 bit- INTERRUPT TARGET INSTR slice family LOGIC DECODING

INTERRU PTS - 247 -

block-diagram of a processor constructed with the 10800 family. This processor emulates

the PDP-11 instruction set and has been described elsewhere.6'

A large body of literature exists on micro-programmingls> 17> and bit-slices,18> 195 which shows that there is much more to be said on the topic than can be exposed in these few pages. For detailed descriptions of the various chips, the reader is refered to the manufacturer's literature.

Bit-slices are components for building powerful processors. They require an inte• grated hard ware/firm ware/software design. Designing with bit-slice chips requires spe• cialists and good tools, but the result can be an excellent machine, as some of the ex• amples from high-energy physics have amply shown.

3. Fixed Instruction Set microprocessors.

The class of fixed instruction set microprocessors comprises all the popular devices which made the microprocessor revolution : , 8086, , 6809, 68000, , Z8000, MOS Technology 6502 and many, many others. As the name indicates, the and programmer have no control over the instruction set, which has been once and for all fixed by the manufacturer. For a critical application the de• signer will have to make a careful selection of the microprocessor chip which has the best chance of providing a reasonable solution to the problem. Really critical applica• tions are however extremely rare - and then in many cases solved by using a bit-sliced processor - and the designers choice is therefore mostly guided by other criteria : fam• iliarity and experience with a particular type, or with a member of the same family, cost, availability of monoboard , of suitable support chips, etc.

For industrial applications and for consumer products, where cost is the overriding factor, the simple 4-bit microprocessors have not been abandoned. The single chip mi• croprocessors, combining a CPU and memory on the same piece of silicon, are also very popular in products which are sold in large quantities. In these applications, the pro• cessor runs a simple program, which need not to be changed after suitable debugging. The development cost, including the software development, is quickly amortized and there is no need for "sophisticated" tools. High level languages are therefore not used at all and the processors also do not need to have features which make programming easier, or execution faster, or even code more compact. Pushing this argument to the extreme, for industrial control applications a 1-bit microprocessor has been manufac• tured. Intended as a - cheap - replacement of relay logic, this processor, the Motorola MC14500B, has found an interesting application in an experimental , on which we will come back in one of the following chapters.

There are however many applications where the expected volume of sales does not reach millions, but where adaptability of the product is of great importance. For a long time this has been the realm of the 8-bit microprocessors, until progress in device technology made the 16-bit microprocessor possible. With the 16-bit processor came also a breakthrough in the architecture of the machines, turning them into real computers, which have nothing to envy from the typical minicomputer of the mid-seventies. This does however not mean that industry has abandoned the 8-bit micro. Its lower cost and the fact that it is perfectly adequate for a large percentage of all applications (80% ?) accounts for its lasting popularity. Enormous progress has also been made to overcome the initial limitations of the 8-bit micros and the 6809 is a perfect example of how added - 248 -

Table III

Typical examples of the uses to which the various fixed instructions set microprocessors are put

DOMAINS OF USE :

- 1-bit (MC14500B) Industrial Control and ... vector processor

- 4-bit (, Consumer market Texas TMS 1000)

- 8-BIT (Intel 8080A, Most Widely Used in : Motorola 6800, -Control 6802,6809, - Character Handling Zilog Z80, - Home computers Mos Techn. 6502 - terminals etc, etc.)

- 16-bit (Intel 8086, : Professional applications where (re)programming and throughput are important National 16032 Texas TMS 9900)

- 32-bit (Intel ÏAPX432) : "Micro mainframe"

features can enhance a processor and make it much more suitable for programming in a high-level language, without loosing completely the compatibility with the other members of the family. Table III summarizes the fields of application of the different classes of micro-processors. In the rest of this chapter we will review some of the features which make that the 16-bit processors can be considered to be adult devices, apt to support an and an acceptable programming environment.

The limitations of the 8-bit processors are largely due to the word-size and res• tricted addressing modes and to the need to keep the CPU simple. These limitations can be summarized as follows : i) arithmetic operations : - limited precision ; operations on reasonably sized operands are slow. - hardware multiply/divide do not exist. ii) number of internal registers is limited, resulting in : - slow operation, as intermediate results must be stored in memory. - restricted indexing operations, which result in explicit address calculations, slowing down the overall operation. iii) the majority of instructions require more than one , again slowing down exe• cution. iv) total address space is limited to 64 Kbytes, precluding the running of large pro• grams and also to a large extent the use of high-level languages. v) limitations in the implemented addressing modes impede elegant solutions for param• eter passing in high-level language procedure calls. vi) the use of absolute addresses is practically unavoidable (ROMs and I/O devices at fixed addresses, forcing programs to occupy the holes left). In the absence of universally accepted conventions, software Is therefore not easy to transport bet• ween systems. vii) advanced features, such as multilevel interrupt or protection mechanisms are ab• sent or very primitive. - 249 -

The consequences of these restrictions are that software tools on 8-bit micropro• cessors are primitive. Programs, written in a high-level language and built-up from separately compiled and relocatable modules, linked together and to library routines, are the exception and not the rule.

It took industry a while to realize the impact of these restrictions and their conse• quences. With advancing technology, emphasis was first put on , e.g. applications where the total number of chips must be reduced. It is typical for this trend that the 6802, which is a 6800 with 128 of read/write memory on chip and thus particularly suited for controller applications, was developed a few years before the 6809. Both processors have approximately the same number of transistors; the 6809, although still an 8-bit machine, has overcome many of limitations listed before, the 6802 has not.

The 6800 (and the 6802) are in fact poor in registers, as shown in fig 6. Such a register set is typical for the 8-bit processors, which are all accumulator-based ma• chines. When we compare with the register set of, for instance the Z8000, figure 7, we are struck by two things : there are many more registers and - with a few excep• tions - no special roles are attributed to each register. This symmetric use of general purpose registers, is observed for most 16-bit processors, as we will see. The larger number of registers opens a number of possibilities, enhancing the capabilities of the machine. Most of the 16-bit processors that we will see have improvements over the 8-bit predecessors in all of the following aspects : i) a larger address space is available.. Although the length of an address is still 16-bits, as before, all processors (except one) have built-in or external fa• cilities to increase the address space from 64K to a maximum of 16 Mbytes. How this is done we will see in more detail later, ii) more resources are put at the programmers' disposal : - more registers, which are often not limited to a special function, but can be used to hold addresses, address-pointers or data.

I ACCA I Accumulator A

ACC B Accumulator B

IX

PC Program Counter

SP Stack Pointer

|1|1M'|N|Z|V|C| Condition Code Reg.

6800 Registers

Fig. 6 Programming model of the 8-bit 6800 microprocessor, showing the registers accessible to the program. - 250 -

RHO Oí 7 RLO RRO RH1 RL1 RQO RH2 RL2 RR2 RH3 RL3

RH4 RL4 RR4 RH5 RL5 RQ4 RH6 RL6 RR6 RH7 RL7 GENERAL 15 RR8 PURPOSE REGISTERS RQ8

RR10

RR12

R14' r SYSTEM STACK POINTER RQ12 R14 NORMAL STACK POINTER RR14 • R1 SYSTEM STACK POINTER , R15 NORMAL STACK POINTER

NOT USED

FLAG CONTROL WORD PROGRAM PC SEGMENT NO. STATUS

PC OFFSET

SEGMENT NUMBER PROGRAM STATUS AREA UPPER OFFSET POINTER 5 14 9 8 RATE COUNTER REFRESH -REFRESH ENABLE

Fig. 7 Register structure of the 16-bit Z8000 microprocessor, showing the registers accessible to a program.

- better arithmetic capabilities. The word size in itself is a great improvement, but most processors also possess hardware multiply and/or divide capability. - operations are defined on data types of different lengths : bytes (characters), words, double-words, etc. - powerful instructions have been added, as for instance block move. iii) all processors have enhanced possibilities for multiprogramming and multitasking : - task switching and context saving and restoring is eased and a much greater range of interrupts and traps is available. - schemes are sometimes implemented. - or paging is implemented, either on chip (this is the case for the 8086) or off-chip (68000) or at the user's choice (Z8000). - all processors have a privileged way of running, reserved for the operating system. These user/supervisor states go hand in hand with the protection me• chanisms : a user cannot run in supervisor state and thus cannot corrupt the operating system. iv) judicious selections of addressing modes provide elegant ways for parameter pass• ing in procedure calls, which largely satisfy the requirements of block-structured high-level languages.

How is this achieved ? The methods vary from one processor to another, so we will examine a few examples. - 251 -

Intel 8086.

The 8086, one of the earlier 16-bit processors, bears great ressemblance to the

8080, as was wanted by its designers.205 Most of the 8080 instructions are in fact com• patible with the 8086. A number of improvements have been made, apart from the lon• ger data and enhanced arithmetic capabilities : sighed and unsigned hardware multiply and divide instructions are implemented, accepting words and bytes as operands. The processor consists of two rather separate parts : the (EU) and the Bus Interface Unit (BIU)(see fig.8). The purpose of the EU is obvious, .the function of the BIU Is to generate addresses. Both units have their own set of registers, which are shown in fig 9. The program counter resides in the BIU and is called instruction pointer (IP). The registers in the EU are mostly special purpose and their roles cannot be interchanged. Thus AX is the main accumulator, also used in , BX is a base register, CX is used for counts and loop control, etc.

EXECUTION UNIT (EU) BUS INTERFACE UNIT (BIU)

GENERAL SEGMENT REGISTERS REGISTERS INSTRUCTION POINTER

ADDRESS MULTIPLEXED GENERATION AND BUS CONTROL

r* J OPERANDS J INSTRUCTION QUEUE

Fig. 8 The structuré of the 8086 microprocessor is broken-up into two parts: the Execution Unit and the Bus Interface Unit.

(AX) AH AL CODE SEGMENT (CS) (BX) BH BL STACK SEGMENT (SS) (CX) CH CL DATA SEGMENT (DS) (DX) DH DL EXTRA SEGMENT (ES) (SP) STACK POINTER INSTR. POINTER (IP) (BP) BASE POINTER (Si) SOURCE INDEX (DI) DEST. INDEX (PSW) FLAGS

EXECUTION UNIT BUS INTERFACE UNIT

8086 Regi

Fig. 9 The register structure of the Intel 8086 - 252 -

19 O MEM.ADDR. LATCH Physícol address

Logical »~ translation

|îEMP. 16-bit |0000| TEMP. 16-bit

16-bil 16-bit offset as CS segment from instruction S address E ss L E DS C from ALU T or A-bus ES

8086

Fig. 10 Logical to Physical Address translation in the Intel 8086. This translation is done on-chip.

The segment registers are used to extend the address space beyond the inherent 64 Kbytes. The way this is accomplished is shown in fig 10. From the 16-bit address de• fined in the instruction, a 20-bit physical address is formed, by adding a segment base address, which is shifted four places before the addition. In this way a 1 Mbyte ad• dress range is obtained. The segment base address register is usually selected by the processor : instructions are always fetched using the contents of CS and IP to calcu• late the physical address ; stack operations use SS and data is obtained using DS. ES is used in operations on character strings and determines the destination of the string. This selection of segment base registers can be overridden by attaching a pre-byte to the relevant instruction, but this is done in an asymmetric way : the programmer is not free in his choice. The existence of the segment registers imply that the program• mer has at any moment four blocks of 64 Kbytes of memory at his disposal. These blocks may be entirely disjoint or they may overlap. The contents of the segment regis• ters can be changed under program control, of course. It can thus be said that the 8086 has a incorporated in the processor, although it does not have al the facilities one would want for a multiprogramming system. For instance, if two small programs, owned by programmers A and B are loaded in the same block of 64K, then the program (say A's) that uses the lower value of CS can have undisturbed access to the other's (B's) instructions.

On the other hand, the use of Segment Registers allows to bind a program to a physical place in memory at the ultimate moment only. A program may also be dynami• cally relocated, if this move is accompanied by the appropriate change to the contents of the segment registers, and the program or its data do not exceed 64K.

The addressing modes implemented on the 8086 are sufficiently extended so that po• sition-independent or re-entrant code can be written without difficulty. The use of the stack pointer and base register allow parameter passing in procedure calls. Another feature of the 8086 is its string handling capabilities, which are fully interruptable, - 253 - which is of course very important for very long strings.The 8086 has however no privi• leged mode of operation. The reader who wants to know more details of this processor

is referred to Intel's literature, or to the article by Morse et al.20' Consultation of

Wakerly's excellent book213 is recommended, as it provides complete descriptions of a number of modern microprocessors, in a unified format, making comparisons very easy.

Zilog Z8000.

The designers of this processor had a number of objectives in mind which they summarized themselves22' : increase capabilities, provide architectural compatibility over a range of capabilities and, increase clarity. The first objective is obvious but the other two merit some explanation. Clarity means that all registers should play the same role and that operations are not implicitly linked to the use of special purpose regis• ters. A general , with 16 general purpose registers, each of 16-bit, is the result (see fig 7). The registers can hold addresses, or operands. Architectural compatibility means in fact that there are two compatible models of the Z8000 : the Z8002 or the unsegmented version and the Z8001 or the segmented version. The models have exactly the same architecture, but their internal structure is different. The un• segmented version has an address space of 64K; the Z8001 can address 8 Mbytes of memory, if an external memory management unit is used. The MMU calculates the physi• cal address from a 16-bit offset value and a 7-bit segment number. The way this is done is shown in fig. 11. Note that the segment number is converted into a segment base address by means of a look-up table. Fig. 12 shows an example of this mapping operation. Note again that logical segments may well overlap physically.

87

LOGICAL ADDRESS SEGMENT NO- OFFSET

MEMORY MANAGEMENT UNIT

BASE ADDRESS MEMORY

1615 O 8 7

. 00000000

oooooooo

24-BIT PHYSICAL ADDRESS

Fig. 11 Logical to Physical Address translation for the Zilog Z8000. A Memory Management Unit, external to the processor chip must be used. - 254 -

Fig. 12 An example of logical to physical address translation (Z8000). Note that segments may overlap and that the order in which logical segments are stored in physical memory is arbitrary.

The Z8000 has two stack pointer registers, which are part of the general purpose register file ; they constitute a deviation from the clarity principle. One contains the user stackpointer, the other the system stack pointer, which is only accessible when the machine operates in the privileged supervisor state. Certain instructions, including I/O commands, can only be executed when in supervisor state. The possible I/O tran• sactions include block transfers between a device and memory. The addressing modes of the Z8000 form a rather complete set, making parameter passing, re-entrancy, reloca• tion etc. possible. In addition, the Z8000 has a three level vectored priority interrupt and trap system. In short, it provides all the facilities an operating system needs. The Z8000 was in fact the first microprocessor which could be put on the same level as a larger minicomputer. Interested readers, who want to know more details, should consult the literature, particularly references 21 and 22.

Motorola 68000.

The 68000, more recent than the two preceding processors, has become very popu• lar in a short time. It is not only widely used in high-energy physics, but also in many different Personal Work Stations, including IBM's ! This is not astonishing, con• sidering that the 68000 provides all the facilities required for a modern computing sys• tem.215 23> Internally, it is a 32-bit machine, it interfaces to the external world over 16 data lines. It has two sets of 8 32-bit registers ; one set are data registers ; the other address registers. The CPU contains two additional special purpose registers : the program counter and the 16-bit (see fig. 13). In the first version of the 68000 the PC has 24-bits, but this can be extended in later models. In fact, the 68000 should be considered as the first member of an architectural family, which ,as - 255 -

DATA REGISTERS

ADDRESS REGISTERS

23 PROGRAM PC COUNTER Fig. 13 The register structure of the Motorola 68000. Note STATUS that the general registers are SR REGISTER 32-bit wide.

68000 Registers technology advances, can grow with new members, having increased capabilities. The development of such a family is greatly helped by the fact that the 68000 is a micro• programmed machine. It has even two levels of control : micro- and nano- control (see figure 14). By separating the control of program flow from the control of the combina• torial circuits in the CPU, a saving in the total size of the has been ob• tained.21,5 The 68000 can operate in one of two states : user or supervisor. The su• pervisor state uses a separate stack pointer which cannot be corrupted by a user. The instruction set, modeled after the PDP-11, is consistent, in the sense that any instruc• tion which specifies an operand in memory, may use any of the addressing modes.

INSTRUCTION MICRO DECODE ADDRESS CONTROL IR STORE

SEQUENCE MODIFICATION •x 6A0 x 10

REG- AND BRANCH FUNCTION SELECTION SELECTION ADDRESS CONDITIONALS

NANO I A I addressing " CONTROL STORE |S| fetch"

EXECUTION UNIT E st CONTROL l l ° (TIMING |D J odd, store + ) |C I add, fetch »280x70

M 68000 CONTROL UNIT

Fig. 14 The microprogrammed control of the 68000 has two levels. In addition to the microstore, a so-called nanostore is present. This structure minimizes the number of bits needed for the microprogram. - 256 -

The 68000 stands out for its large number of addressing modes: 12. Needless to say then that parameter passing is not a problem in a 68000 program. The processor has even gone a step further by introducing two instructions which make parameter passing particularly easy. Besides the stack pointer, another important aid, the stack frame pointer is introduced explicitly. A stack frame pointer facilitates access to parameters on the stack, by refering to the fixed position of the frame pointer FP, instead of to the often varying position of the stack pointer itself. A FP could be defined by the programmer in the processors mentioned so far (including the 6809), but instructions to manipulate the FP and thus facilitate parameter passing had not been implemented be• fore. The 68000 LINK and UNLK (unlink) instructions do precisely what is needed. Any address register may be used by LINK and UNLK, together with an offset. The LINK An, #displacement instruction does the following : i) the present value of An(=address of old FP) is pushed onto the stack ii) the new value of the stack pointer is put into An iii) displacement is added to SP, reserving space on the stack for local variables.

Figures 15 and 16 (from reference 23) show how parameter passing can be accom• plished, whereas at the same time re-entrant code will be produced (space for local variables is reserved on the stack). The translation from the PASCAL program into assembly language is straightforward (figure 15). From figure 16 it is seen that the LINK instruction forms a linked list of frame-pointers FP, thereby creating also the possibility of accessing data local to the calling program. Note that the procedure call in fig. 15 defines one parameter by its value, the other by its address.

SAMPLE PROGRAM:

PROGRAM EXAMPLE; VAR PARAM 1,PARAM 2: INTEGER; PROCEDURE PROC (X : INTEGER; VAR Y : INTEGER); VAR A, B: INTEGER; BEGIN END BEGIN PROC ( PARAM ]j PARAM2) END.

PROGRAM BODY :

MOVE PARAM 1 TO -SP@ "push first parameter" PEA PARAM 2 'push address of 2nd parameter" JSR PROC 'call the procedure " , ADD #6 TO SP ''pop parameters from the stack

PROCEDURE BODY :

LINK FP, 4 ' link and allocate three local variables " MOVEM < registerlist> T0-SP@ *push some register contents" MOVEM < registerlist> FROM SP@ "restore registers" UNLK FP ^restore stack" RETURN 'return to calling procedure"

Fig. 15 An example of a Pascal program, calling a procedure. The figure shows how the program and the procedure are translated into 68000 assembly language. - 257 -

LOW MEMORY

BEFORE CALL AFTER PUSHING HIGH MEMORY PARAMETERS, CALL, LINK, AND SAVING REGISTERS

Fig. 16 Use of the Stack Frame Pointer (FP) in procedure calls to facilitate access to the parameters passed to the procedure. Results can be passed back to the calling program by the same .

At present, the address space of the 68000 is 22"' = 16 Mbyte ; future expansion to 232 = 4 Gbyte is foreseen. Addressable units are 1, 8, 16, or 32 bits wide. An ex• ternal memory management unit can be used. The 68000 has memory-mapped I/O so no isolated I/O instructions exist. A bus request/grant protocol has been implemented on the chip, thus facilitating DMA transfers. The machine has 8 levels of vectored inter• rupts. Apart from the procedure calls explained above, a few other features facilitat• ing high-level language support have been implemented. These include bounds checking in accesses to array variables, traps on special conditions in arithmetic oper• ations and some loop constructs which closely match PASCAL FOR loops.

The reader may consult the manufacturer's literature or ref.21 for more details on the 68000.

Texas 9900/99000.

The 9900 is a rather old processor, which can be replaced by the faster and fully backward compatible 99000. The 99000 is one of the fastest microprocessors on the market. The 9900 is singular in the sense that it has no internal registers to perform operations on. There is only a program counter, a status register and the workspace pointer WP. WP points to a block of 16 locations in main memory which act as the 16 general purpose registers of the machine. Generally this slows down the operation of the machine, except when a context switch must be performed following an interrupt. To change the context of the machine it is sufficient to load a new value into WP, after the old value has been preserved. - 258 -

The machine has a few serious shortcomings. For instance, subroutine return ad• dresses are stored in a dedicated register (inside the workspace of course). Subrout• ines can therefore not be nested, unless the return address is moved to another place before the new call is made. The memory space is limited to 64 Kbytes, with no possi• bility of extension. The designers of the 99000 even refused to contemplate the intro• duction of some memory management scheme in their design,z5> giving absolute priori• ty to the principle of backwards compatibility.

National 16008/16016 and 16032.

National announced rather recently a new family of processors, all with the same architecture, but distinguished by the width of the data path.26' These processors appear to be very fast. They are further distinguished by their arithmetic capabilities (built-in floating point) and by the variety of data structures they can handle.

The characteristics of the preceding 16-bit microprocessors are summarized in Ta• ble IV, which is adapted from ref. 27. Table V, from the same source, lists execution times for a few typical instructions. The reader should be cautious in using these ta• bles to select the "best" microprocessor for a given application. Ticking off the de• sired characteristics and comparing the scores may well lead to a disastrous result, if a more profound study of a few processors is not made before the selection is at• tempted. Prevailing standards in the working environment and availability of subsys• tems and - most important ! - of software should have a much greater weight than the extra addressing mode, the additional instruction or the slightly faster addition. What the author hopes to have achieved with this rather long incursion into the field of processor structures is that the reader has become aware of the potential of these mo• dern machines. In particular their suitability for solving real-life problems using high-level languages and modern programming practice should be noted.

Table IV

Characteristics of a few 16-bit microprocessors

9900/95 8086 Z8D00 68000 16016/32

YEAR AVAILABLE 1976/81 1978 1979 1980 1981 IMPLEMENTATION UPR0G RANDOM UPR0G CLOCK FREQUENCY (MHZ) 3 5(4-8) 2.5-3.9 4-8 10 No OF BASIC INSTRUCTIONS 69/73 95 110 61 100 No OF GEN.PURPOSE + OTHER REGs. (16)+3 4+10 16+8 8+8+3 8+8 PIN COUNT 64/40 40 40/48 64 40/48 DIRECT ADDRESS RANGE(BYTES) 64K 1M 48M 16M/64M 16M No OF ADDRESSING MODES 8 24 8 14 9 1/0 SPACE (BYTES) 0.5/4K 64K 2x64K MEM. SEP SEP SEP MAP. DATA TYPES : BITS + + + + INTEGER BYTE/WORD + + + + + CHARACTER STRINGS + + + + BCD BYTE + + + + FLOATING POINT + DATA STRUCTURES : STACKS + + + + ARRAYS + RECORDS + + + + STRINGS + + + CONTROL STRUCTURE : TRAPS/INTs. + + , + + + SUPERVISOR CALL + + + - 259 -

Table V

Execution speeds for a few typical instructions on different 16-bit microprocessors

EXECUTION SPEEDS (.s) 9900 8086 Z8000 68000 16016

REGISTER > REGISTER MOVE 4 60 0.40 0 75 0 50 0.30 9 80 0 80 1 25 0 50 0.30

MEMORY > MEMORY MOVE 9 90 7 00 7 00 2 50 1.60 19 80 14 00 8 50 3 75 2.40

ADD MEMORY TO REGISTER 7 32 3 60 3 75 1 50 1.10 21 30 7 20 5 25 2 25 1.50

MULTIPLY (MEM—>MEM) 21 90 23 00 16 00 8 75 4.6 0 180 64 115 20 85 75 43 oo 7.60

CONDITIONAL BRANCH 3 60 1 60 1 50 1 25 1.40 2 90 0 80 1 50 1 oo 0.70

MODIFY INDEX, BRANCH IF=0 7 .60 2 .20 2 75 1 25 1.30

BRANCH TO SUBROUTINE 7 .90 3 80 3 75 2 25 2.50

4. Support chips

The variety of microprocessor support chips is astounding (see Table II for what is probably a "short list"). We will limit ourselves to a few devices which have found in• teresting applications in high-energy physics.

4.1 Memory

Memory devices are undoubtedly the most widely used support chips. Here we will only mention a few unusual applications of memory.

4.1.1 Look-up tables.

Look-up tables are more and more used, as they make very rapid evaluation of complex expressions possible. They can be used to evaluate Boolean expressions, such as

F = ABC + ACD + BCD

or arithmetic expressions, such as

R = VX2 + Y2 and

The principle is simple : the input values are concatenated to form an address. At that address in memory the output value is stored. The answer is obtained in one single memory access time.

For Boolean expressions a memory of size 2n is needed for n input variables. One therefore runs quickly out of supply of memory chips if, for instance, one would use directly the output of a large number of scintillation counters, to evaluate if a valid - 260 - trigger condition occurred. In many cases a FPLA (Field ) can be used instead, but the possibility of dynamically reconfiguring the expression is then lost. A memory can be quickly reloaded at any time.

In the early seventies produced a special memory chip for physics appli• cations. Instead of decoding on chip 5 bits to provide a column address and 5 bits to produce a row address, these decoders were simply bypassed. The 32 row and 32 co• lumn lines were brought out directly (see figure 17 for the principle of a 1K memory).

A 32 x 32 coincidence matrix is the result, which can still t>e used if multiple counters are hit. This would go unnoticed if a normal encoding scheme of the 32 counters into a 5-bit pattern had been used.

A4 A3 A2 Al AO

1024x1 MEMORY

Fig. 17 Principle of a IK " 1 bit memory, showing that 32 rows and 32 columns of storage elements are used.

An added advantage of using memory modules in a trigger or event selection set-up is that logical and arithmetic operations may be freely mixed. For arithmetic operations the operand sizes are the main problem. Three 16-bit input operands would need the whole IBM Mass Store to store the result for every possible combination of inputs. Also here a solution can be found sometimes, as discussed in another lecture course.283 Se• paration of variables and - in case of linear relations - splitting of long variables into subfields can provide solutions. For example ex cos Y needs a table with 64 K entries if X and Y have 8-bit each and are simply concatenated. If ex is evaluated separately from cos Y, two tables with 256 entries is all that is needed. The penalty is one multi• plication to be performed afterwards. The multiplication process itself provides another example of possible reduction of the table size : for two 16-bit operands a table with

232 = 4.10s entries is required. Splitting the operands a and ß into 2 fields of 8-bits each, yields :

a.ß = (A.28+a)(B.28+b) = AB.216 - (Ab+aB)2s + ab

Four identical tables with 256 entries are now all that is needed. All linear functions can be treated this way. Counting the number of bits set in a word is a third example where a look-up table is faster than any other method. Again the table size can be reduced at the cost of a few . - 261 -

It is even possible to construct an arithmetic processor using memories only. The

principle is the use of Residue Arithmetic.295 30) The principle can be explained very

simply : Assume we take 4 numbers (we could take more) Blt B2, B3 and B<, which are mutually prime (they do not contain common factors). 7,11,13 and 15 is a reason• able choice for the example. A number N can then be represented by the four remain•

ders obtained by dividing N by Bk k=1, ... 4 :

= Rk N modulo Bk

It can be proved (chínese remainder theorem)295 that this representation is unique for

N < Bj. B2. B3. B«. We have thus obtained a 15 bit representation of the numbers 0 - 15015. Why 15-bit ? Three of the remainders in our example are less than 16 and one is less than 8, so they hold in 4 resp. 3-bit fields.

The three basic arithmetic operations can be performed with ease on this represen• tation :

i) addition. N + N' = M

+ The representation of M is found by simply adding Rk R'k for k=1, ... 4 and taking each result modulo (B^).

ii) substraction. as above : Rk - R'k gives the result. iii) multiplication. P = N.N'

(akBk * Rk)(a'kBk • R'k) = { } BR * Rk.R'k

P R R So, R^ \ = k- '|< modulo Bk The result is obtained by multiplying the four residues separately, and taking the remainder.

These three basic arithmetic operations can be performed by using small look-up ta• bles. In our example each table is 256 x 4 bits and 12 of them are needed (addition, subtraction and multiplication for 4 different It can also be shown that conversion between ASCII code and residue number representation can be performed with look-up tables again. Also negative numbers or fractions can be represented. A processor can

thus be built entirely from memory.315 Such a processor would be very fast, but it would have one very serious shortcoming : it can by no means perform a meaningful by a variable, (division by a constant can of course be replaced by a multipli• cation).

4.2 Content Addressable Memory and Associative Processors.

A Content Addressable Memory (CAM) is a storage device which returns a signal on one or more address lines whenever the word presented on its data lines is found amongst the words stored in the memory. We present data to the device and obtain an answer indicating that the data is present or not and ,if present, the address where the data are stored. In general a mask can be applied to the data and the search for a match made on a selected set of bits. A CAM must also be able to operate as a normal read/write memory, as data must be written into it in the first place. Content Addres• sable Memories are ideal devices when arrays must be searched for the presence of a particular datum. A CAM avoids the need for a sequential search and gives the desired answer in a single memory cycle. CAMs would greatly improve the speed of many pro• cesses, which involve searching long lists or tables. Track following is an example of a procedure that would benefit from it ; CAMs have in fact been used in a processor,

specialized for this task.*5 - 262 -

One would therefore expect to find large sized CAM chips on the market. This is however not the case : a CAM chip is heavily pin-limited. Whereas a normal RAM of 2^ words x N bits needs only k + N pins (plus a few more for power, R/W control and chip enable), a CAM of the same size needs in addition N pins for a mask and 2^ pins to signal the matches found. These 2 lines cannot be encoded as the possibility would be lost to find multiple matches in a single cycle. A large CAM can of course be built from many small capacity chips, but a more elegant solution is offered by associative processors. An associative processor searches all elements of an array simultaneously for a match. Generally this is done bit-by-bit. The array to be searched may have 1000 entries, typically. As long as the ratio of the word length over the number of entries to be searched is small, a considerable gain in speed is obtained. In an associ• ative processor each memory word has a simple processing element attached to it. A memory word contains in general many bits, so that several attributes can be stored along with the search field. For instance if the memory contained personal data, a search could be made for all the Smithes who are between 45 and 55 years old. Once located, their address and telephone number could then be read from those memory lo• cations where a match was found.

The memory for an associative processor can be built from normal RAM chips, for instance as indicated in figure 18. The normal address lines are used to select one bit, for all 1024 elements of the vector. These 1024 bits are treated in the 1024 processing elements. The processing elements (PEs) must be cheap to make an associative proces• sor a viable structure. Since one bit is treated at a time, a single-bit microprocessor would be indicated. An associative processor using the single-bit Motorola MC14500B

has been built at the University of Toronto.3251 n addition to the PE's and the working store there is also a backing store. The different elements communicate as shown in figure 19 for a single horizontal slice through the machine. The shift register is used for communication between the different elements of the vector. This communication is needed for the execution of operations which are more complex than the simple search• es. A block diagram of the microprocessor is shown in figure 20. The MC14500B has a set of 7 boolean and 9 other instructions. Three of the instructions automatically enable the write line. Reference 32 gives more details on this cheap associative processor, in particular examples of search and other operations.

9 8-BIT 1 CHIP i ; 8 PEs B 8-BIT 1 CHIP ! ¡ 8 PES ETC.. ! ¡

102« k - BIT i ! 1024 PEs ELEMENTS SEARCH i i TOTAL IN A VECTOR FIELD—•», J

Fig. 18 AlKx 256 bit memory for an associative processor can be built from 128 256 x 8 bit memory chips. Broadcasting the same address to all 128 chips results in selecting a single bit for all 1024 elements of the vector. - 263 -

SERIAL IN CONST.

SHIFT REG SH

BACKING WORKING STORE (CCD) STORE (RAM) PE SERIAL OUT B K WK

WRITE ÎcONTROL

ADD ADDRESS *I ES RA; *ï LIN

* COMMON TO ALL ARRAY WORDS

Fig. 19 Block diagram of an associative processor. The diagram shows the structure implemented for each element of the array.

Œ3 FLAGS

OEN INSTRUCTIONS

7 BOOLEAN + 9 OTHER INSTRUCTIONS (3 ENABLE, WRITE ) INSTRUCTIONS

WRITE ENABLE

Fig. 20 Diagram of the MC14500B, the single-bit microprocessor used in the associative processor of the preceding figure.

4.3 Arithmetic attachments.

We have seen that the arithmetic capabilities of microprocessors have been greatly improved, but processors with floating-point arithmetic on chip do practically not exist yet. A number of floating point chips which can be interfaced to a microprocessor are available on the market to augment the power of the processor, if needed. The oldest, the Amd 9511 was terribly slow, taking 57 us for a 32-bit floating point multiplication - 264 -

or division. The , a co-processor for the 8086 improves greatly on this fig• ure, bringing it down to 16 fis (or 24 \is for multiplication of two 64-bit real numbers). The question arises if it is feasible to replace large number crunching computers by a reasonable number of 16-bit microprocessors, each with an attached floating point processor. Can a reasonable performance be obtained at an affordable cost ? At

the Brookhaven National Laboratory this problem was investigated.33' An experimen• tal processor was built from a 68000 and an Amd 9511. A Fortran was devel• oped for the 68000 and a piece of pattern recognition code for events in the multiparti- cle spectrometer was run. Performance measurements were made using this code. The results were then extrapolated to a 8 MHz 68000 (instead of the 4 MHz version used in the test) and to a NS 16081 (instead of the 9511). It was found that the combination of a 68000 and a NS 16081 would have 1/30 of the power of a CDC 7600, or approximately the power of a DEC-10(with a KA10 processor). One can conclude from this result that single microprocessors do not provide yet a solution for applications where number- crunching is essential.

5. VLSI

VLSI (Very Large Scale Integration) has become a tool for research in novel compu• ter architectures and special purpose machines. The availability of Computer Aided De• sign tools, the existence of small semiconductor firms specializing in the fabrication of custom-designed chips ("silicon foundries") and the need to educate specialists has led

to a flourishing research activity in applications of VLSI.3'" Examples of succesful pro• jects are :

- the geometry engine developed at Stanford35' for clipping, scaling and coordinate transformation of objects to be displayed on a graphics screen ;

- Scheme-79, designed at MIT36', which is a chip for direct execution of the program• ming language LISP ; - RISC, the Reduced Instruction Set Computer, developed at Berkeley, and which we will briefly describe below.

Besides these projects data flow machines, tree machines, systolic arrays and a number of other non-von Neumann structures are the object of study. Industry tends to remain on the safe side and its VLSI developments are essentially limited to fabricating larger memories and more complex (micro)processors (e.g. iAPX 432

of Intel37' 38').

5.1 Systolic Arrays

The concept of systolic computing was introduced by Kung.39' ",0' He observed that for certain calculations the throughput could be improved by connecting several processing elements in a pipe-line fashion and "pumping" the data through this pipe• line, instead of cycling the data through a central memory (see figures 21a and b).

a) i i b) MEMORY EMORY y 100 ns 5 MOPS MAXIMUM 100 ns 30 MOPS

|PE|PE|PE|PE¡PE|PE

Fig. 21 Principle of systolic arrays, a) The throughput in a normal processor is limited, because intermediate results must be stored back in memory, b) When intermediate results are passed to the next processing element in the array, the throughput is improved. - 265 -

The analogy with the heart gave the name to this type of computing structure. A structure as depicted in figure 21b is of course not suited for general computing ; in most instances we would not know what to do with the array of PE's. Systolic arrays are however very well suited for signal processing and pattern matching. We will show two examples of how a convolution integral may be evaluated using two differently structured systolic arrays. Kung himself gives four more structures to compute the convolution integral.*0>

We will approximate the convolution integral

Y(t) = oJ« W(t-T).x(T)dT

This can be done with a structure as shown in figure 22, where each square box is a processing element, which performs only one specific operation, as shown in the figure. In this example the weights are attached to PEs and do not change or move through the array. The input sequence is broadcast to all PEs, one element at a time. At the next beat of the clock, a new element of the input sequence is presented. The Y's move systolically through the array from left to right. At every beat the output of a PE becomes the input to its righthand neighbour. It is easily verified that after an in• itial period in which the becomes filled, the output of the array is exactly the

sequence {Yi, Y2, Yn + ^_|<} and that one new element of the sequence is produced every beat. The same result can be obtained with a different structure and" a different PE. This is shown in figure 23. The W's are again fixed in each PE, but this time the

X BROADCAST STAY Y¡ MOVE SYSTOLICALLY Yout=Vin+W.Xii,n

Fig. 22 Example of a systolic array to calculate the convolution integral. The specialized function of a PE is shown in the inset. For each beat of the clock, new Input data is presented and intermediate results move on.

is STAY

X¡'s AND Y¡'s MOVE IN OPPOSITE DIRECTIONS •w.x¡i,n

Fig. 23 Another configuration of a systolic array for the convolution integral. A new value of the input sequence is now presented every second beat of the clock. - 266 -

X and Y move systolically, in opposite directions. The PE transmits the X unchanged, but with a delay of one beat. It is again easy to verify that the correct result is ob• tained when a new X is presented every second beat, as indicated in the figure.

A concrete example of a systolic array is the pattern matching chip developed at

Carnegie-Mellon University by Foster and Kung.41' A character string of arbitrary length is searched for the occurrence of a pattern (figure 24). The pattern may con• tain wildcard characters, which match with any character (the X in the figure). The pattern moves over the character string and when a match is found a logical 1 is out• put, otherwise the result is "0".

AXC PATTERN 001001100.-. HOST ABCAACCQQ... STRING PM Fig. 24 The principle of the pattern matching chip, which is a practical RESULT implementation of the systolic array.

5.2 The Geometry Engine.

The geometry engine, designed by Clark, 35 5 is intended to transform objects to be displayed on a graphics screen. A complete system requires 12 identical chips, used in slightly different ways. The functions performed are : - coordinate transformation of the object - clipping to the boundaries of the viewing window - scaling to the viewport.

The chip itself Is an arithmetic device, capable of performing floating point opera• tions on numbers with an 8-bit exponent and a 20-bit mantissa. There are 4 floating point units per chip, multiplication is performed in 12 us. The overall performance of the 12-chip system is remarkable : 3500 lines or 900 polygons can be transformed, clip• ped and scaled every 1/30 second, the refresh time of the display. At the time of pu• blication of the description of the geometry engine, a part of the final chip had actual• ly been implemented, proving the feasability of the device.

5.3 RISC

Before we describe the Reduced Instruction Set Computer of Patterson

and Sequin,425 we want to repeat some of the arguments used in discussions on opti• mum instruction sets. The fundamental question is Should the instruction set of a computer be simple or complex, i.e. should the set contain powerful instructions such as block moves, which keep the processor busy for many cycles and which require complex logic or should it not ?

Some of the arguments in favour if CISCs (Complex Instruction Set Compu• ters) are : - the implementation of a compiler is simplified on a CISC. - some functions of the operating system can be migrated to the hardware, with in• creased performance. - the code for a CISC is compact. - 267 -

As we have seen in chapter 3, the present trend is towards CISC. The Z8000, MC 68000 and NS 16032 are examples. The Intel iAPX 432 and the Hewlett Packard HP 9000 (430 000 transistors !) have gone even further in this direction. By contrast, the Texas TMS 99000 has remained rather simple.

The opponents to this trend advance the following arguments against CISCs : - Maybe the code generation part of a compiler will be simplified, but this is only a small part of the total job. The lexical and syntax analysis, parsing, optimization and also loaders and exception handling remain unchanged. - complex instructions may do the wrong thing for languages other than the one they were designed for. - the cost of memory goes down. So why worry about compact code ? - compact programs are not necessarily the fastest.

They further argue that the design of the instruction set should be based on the actual use of the instructions. So LOAD, STORE and BRANCH instructions should be made faster. The designer can forget about the rest, the instructions which are seldom used : they should be replaced by software. The compiler should get the task to optim• ize the hardware/software mix for highest speed.

The RISC-I is an example of a Simple Instruction Set Computer. The design goals were the following : - instructions should be executed in a single cycle (register > operation > register) - higher level instructions should be implemented by software, using the basic instruc• tions. - all instructions should have the same word length. - high-level languages should be supported through fast execution of CALL and RETURN, together with an effective means of parameter passing and allocation of storage for local variables. RISC-I has a very elegant scheme of "register windows" to implement the latter re• quirement. A register window is shown in figure 25a. It consists of a storage area for global variables and three other areas : high, local and low. When a procedure A calls procedure B, the low area of A becomes the high area of B (see fig• ure 25b). Parameters are passed through this area. In the course of execution of a program the register window slides upward and downward, according to the nesting of the procedures.

HIGH A b) LOCAL A PROC. A LOW A/HIGH B LOCAL B PROC. B a) LOW B/HIGH C LOCAL C PROC.C HIGH LOW C LOCAL REGISTER 1 LOW WINDOW

GLOBAL GLOBAL

Fig. 25 Register windows used in RISC-I. a) The components of a register window, b) Parameters are passed to procedures by sliding the register window up and down. - 268 -

It is interesting to see how well RISC-I performs compared to other processors.

Some results have been published. '"> The comparison concerns a pattern matching in• struction, MATCHP, implemented in microcode on the VAX 11/750. The instruction counts the occurrences ( m ) of a 16-bit pattern in a stream of n bits. The instruction was rewritten in C and in assembly language for VAX 11/750, 11/780 and for RISC-I. The microcoded implementation is the fastest, RISC-I hand-coded is 5.1 times slower, the C-version 5.6 times, hand-coded version on VAX 11/780 8.6 and on VAX 11/750 14.3 times slower. In spite of the slower cycle of RISC (1.25 x slower than the 11/750), its performance is considerably better than the VAX's, in this particular case at least. Note also that the code produced by a portable, non-optimizing C-compiler can hardly be improved by hand-coding on RISC-I. These are certainly encouraging results for the proponents of simple instruction set computers.

6. Conclusion

In these lecture notes the author has paced rapidly over a wide field, picking a flower here and there, without bending down to inspect more closely the large variety of herbs growing in the field. He hopes nevertheless that he has succeeded in giving a glimpse of a few topics of interest for experimental physicists. The bit slices are the devices to be used when speed is at a premium or when an existing machine must be emulated. The fixed instruction set microprocessors are catching-up very rapidly, not only in speed and arithmetic capabilities, but especially in their high-level language and other software support facilities. Luckily, the times are gone where an engineer painfully aligned zeroes and ones to be burnt into a PROM, hoping that it would make his microprocessor chip to do something useful. The new developments in VLSI may produce some further happy surprises in the years to come. But even without new dev• ices, clever use of the old ones can also be of great help in physics experiments.

7. Acknowledgements.

I would like to thank Mrs C. Gentet for preparing the manuscript and making it fit to print and Mrs. O. Marais for her usual fast and accurate production of figures.

8. References.

1. R.N. Noyce, M.E. Hoff Jr, A History of Microprocessor Development at Intel, IEEE Micro, Vol.1, no. 1, Febr 1981, pp 8-21. 2. T. Lingjaerde, A Fast Microprogrammable Processor, CERN DD/75/17 (1975). 3. E. Barrelet, R. Marbot and P. Matricon, A Versatile Micro computer for High Rate Camac Data Acquisiton, in Real-time Data Handling and Process Control, Proc. 1st European Symposium, Berlin 1979 (ed. H. Meyer) (North Holland, Amsterdam 1980), p. 77. 4. P. Schildt, H.J. Stuckenberg, N. Wermes, MONICA - a Programmable Microproces•

sor for Track Recognition in an e+e- Experiment at Petra, in Proc. Topical Confer• ence on the Application of Microprocessors to High-Energy Physics Experiments, Geneva 1981, CERN 81-07, p. 38. 5. C. Halatsis, A. Van Dam, J. Joosten and M.F. Letheren, Architectural Considera• tions for a Microprogrammable Emulating Engine Using Bit-slices, CERN DD/79/7 (1979), in Proc. 7th Int. Symposium on , La Baule, 1980, IEEE publ. 80 CH 1494-4C (1980). - 269 -

6. J, Anthonioz-Blanc, C. Halatsis, J, Joosten, M.F. Letheren, A. van Dam, A. van Praag and C. Verkerk, MICE, a fast User-Microprogrammable Emulator of the PDP-11, CERN DD/80/14 (1980). 7. The 370/E was developed at the Weizman Institute by H. Brafman, R. Fall and R. Yaari. 8. A. B. Salisbury, Microprogrammable Computer Architectures, Elsevier, New York 1976. 9. J. Lecoq, These d'Etat, Mulhouse 1982. 10. P. F. Kunz, The LASS Hardware Processor, Nucl. Instr. Methods 135, 435 (1976). 11. P.F. Kunz, R.F. Fall, M.F. Gravina and H. Brafman, The LASS Hardware Processor, in Proc. 11th Annual Microprogramming Workshop, Pacific Grove, Cal., 1978, IEEE pub. 78 CH 1411-8 (1978). 12. P.F. Kunz, R.M. Fall, M.F. Gravina, J.H. Halpering, L.J. Levinson, G.J. Oxoby ad Q.H. Trang, Experience Using the 168/E Microprocessor for Off-line Data Analy• sis, IEEE Trans. Nucl. Science NS-27,582 (1980). 13. D. Lord et.al. The 168/E at CERN and the Mark 2, an Improved , in Proc. Topical Conf. on Application of Microprocessors to High-Energy Physics Experiments, Geneva 1981, CERN 81-07, p. 341. 14. C. Halatsis, Software Tools for Microprocessor Based Systems, in Proc. 1980 CERN School of Computing, Vraona, 1980, CERN 81-03, p. 241. 15. M.R. Barbacci and A.W. Nagle, an ISPS Simulator, Technical Report, Dept. of Computer Science, Carnegie-Mellon University (1978) 16. A.K. Agrawala and T.G. Rauscher, Foundations of Microprogramming; Architecture, Software and Applications (Academic Press, New York, 1976). 17. S.S. Husson, Microprogramming : Principles and Practices (Prentice-Hall, Englewood Cliffs, NJ, 1970). 18. G.J. Myers, Digital System Design with LSI Bit-Slice Logic (Wiley Interscience, 1980). 19. A. van Dam, Introduction to Bit Slices and Microprogramming, in Proc. 1980 CERN School of Computing, Vraona, CERN 81-03, p. 220. 20. S.P. Morse, B.W. Ravenel, S. Mazor and W.B. Pohlman, Intel Microprocessors - 8008 to 8086, Computer Vol.13, no. 10, October 1980, p.42. 21. J.F. Wakerly, Microcomputer Architecture and Programming (John Wiley and Sons, New York, 1981). 22. B.L. Peuto, Architecture of a New Microprocessor, Computer, Vol 12, no. 2, Feb. 1979, p. 10. 23. S. Stritter and T. Gunter, A Microprocessor Architecture for a Changing World, the Motorola 68000, Computer Vol. 12, no. 2, Feb. 79, p. 43. 24. S. Stritter and N. Tredennick, Microprogrammed Implementation of a Single Chip Microprocessor, in Proc. 11th Annual Microprogramming Workshop, Pacific Grove, 1978 : IEEE pub. 78 ch 1411-8, page 8. 25. R.V. Orlando and T.L. Anderson, An Overview of the 9900 Microprocessor Family, IEEE Micro, Vol. 1, no. 3 (August 1981), p. 38. 26. S. Bal et.al., The NS 16000 Family - Advances in Architecture and Hardware, Com• puter, Vol. 15, no. 6, (June '82) p. 58. 27. Hoo-min D. Toong and Amar Gupta, An Architectural Comparison of Contemporary 16-bit Microprocessors, IEEE Micro, Vol. 1, no. 2 (May '81), p. 26. 28. C. Verkerk, Use of Intelligent Devices in High-Energy Physics Experiments, in Proc. 1980 CERN School of Computing, Vraona, CERN 81-03, p. 282. 29. H.L. Garner, The Residue Number System, IRE Trans. Electr. Compu• ters, Vol EC-8, (1959) 140. - 270 -

30. N.S. Szabo and R.J. Tanaka, Residue Arithmetic and Its Applications to Computer Technology, (Mc Graw-Hill, New York, 1967). 31. A. Huang, Number Theoretic Processors, Thesis, Dept. of Electrical Eng., , 1980. 32. W.M. Loucks, M. Snelgrove and S.G. Zaky, A Vector Processor based on One-bit Microprocessors, IEEE Micro, Vol. 2, no. 1 (Febr. 1982), p. 53. 33. H. Bernstein et.al., A Microprocessor - based Single Board Computer for High En• ergy Physics Event Pattern Recognition, in Proc. Topical Conf. Application of Mi• croprocessors to High-Energy Physics Experiments, Geneva, 1981 ; Cern 81-07, p.479 34. P.C. Treleaven, VLSI Processor Architectures, Computer, Vol. 15, No 6 (June 1982), p. 33 35. J. Clark, A VLSI Geometry Processor for Graphics, Computer, Vol. 13, no.7 (July 1980), p. 59. 36. G.J. Sussman, J. Holloway, G.L. Steele Jr. ad A. Bell, Scheme-79, LISP on a Chip, Computer, Vol. 14, no. 7 (July 1981), p. 10. 37. Introduction to the ¡APX 432 Architecture, Intel Corporation, Santa Clara, Cal. 1981. 38. S. Zeigler, N. Allègre, R. Johnson, J. Morris and G. Burns, Ada for the Intel 432 Micro- computer, Computer, Vol. 14, no. 6 (June 1981), p. 47. 39. H.T. Kung, Let's Design for VLSI Systems, Proc. Conf. Very Large Scale Integration, Architecture, Design, Fabrication, Cal. Inst, of Technology, Los Angeles, Jan.1979, p. 65-90. 40. H.T. Kung, Why Systolic Architectures ? Computer, Vol. 15, No 1 (Jan. 1982), p.37. 41. M.J. Foster and H.T. Kung, The Design of Special Purpose VLSI Chips, Computer, Vol. 13, no. 1 (Jan.'80), p. 26. 42. D.A. Patterson and CH. Sequin, A VLSI RISC,Computer, Vol. 15, no. 9, (Sept. 1982), p.8. 43. J.R. Larus, A Comparison of Microcode, Assembly Code and High-Level Languages on the VAX-11 and RISC-I, Computer Architecture News (ACM Special Interest Group), Vol. 10, no. 5, Sept.'82, p. 10.