CALIFORNIA STATE UNIVERSITY,NORTHRIDGE

DIGITAL SIGNAL PROCESSING INTEGRATED CIRCUITS

(DSP ICS)

A project submitted in partial satisfaction of requirements for the degree of Master of Science in

Electrical Engineering

by

Trung Dien Vo

January 1988 The Project of Trung Dien Vo is approved:

California State University, Northridge

:'. j_ i i TABLE OF CONTENTS

List of Figures ...... iv

List of Tables ...... vii

Abstract ...... viii

Chapter 1 INTRODUCTION ...... •...... •..•..••.•.• 1

Chapter 2 THE NEC 7 7 2 0 ...... •.... 3

Chapter 3 THE TMS320XX DSP FAMILY . ...•...... •..•••...... 15

Chapter 4 THE ANALOG DEVICES ADSP-2100 ...... 30

Chapter 5 THE ZORAN ZR34161 VECTOR SIGNAL • •••••.••••••••••••••••••••••••• 4 7

Chapter 6 THE MOTOROLA DSP56000 ...... •...... 57

Chapter 7 THE AT&T WE-DSP32 ...... •...... 92

Chapter 8 THE TEXAS INSTRUMENTS TMS320C30 ...... •. 104

Chapter 9 THE ZORAN ZR34325 ...... 117

Chapter 10 THE EFFECTS OF WORD LENGTH ON DYNAMIC RANGE • .•••••••••.•••..•..•••.•••..•.•.•• 129

Chapter 11 COMPARISONS OF DIFFERENT DSP ICS ..•..... 151

Chapter 12 APPLICATIOt-~S ...... e ••••••••••••• 166

Chapter 13 CONCLUSION ...... •...... •...•....•. 172

References ...... 17 4

Appendix A THE PIPELINE AND HARVARD ARCHTITECTURES •...... •...... 178

Appendix B THE REDUCED-INSTRUCCTION-SET-COMPUTING (RISC) ARCHITECTURE •••.•••••••.••..••..• 185

Appendix FLOATING-POINT VERSUS FIXED-POINT ARITHMETIC ...... •...•.•.••.. 188

Appendix D TERMINOLOGY ..••...••..••...••.•.•••••••• 197

Appendix E LIST OF VENDORS ...... •.....•.....••• 201

iii LIST OF FIGURES

ChaEter ~ Figure 2.1 NEC 7720 Block Diagram ..••...•••••••••••• 4

Figure 2.2 RAM & Peripherals ..•••••••••••••••.• 6 Figure 2.3 ALU Block Diagram ..••.••..•.••.•••.•••••• s

Figure 2.4 Serial Input ...... 11

Figure 2.5 Serial Output ...... 11

Figure 2.6 Parallel I/0 ...... •...•...... •..• 13

Cha12ter 3 Figure 3.1a TMS320 Family Generation •.•••••.••••••.• 16

Figure 3.1b TMS320 Family Generation (Technology) .•• 17

Figure 3.2 TMS320C25 Block Diagram .•••••••••••••••• 25

ChaEter .! Figure 4.1 ADSP-2100 Block Diagram ..•••.•••.••.•.•• 31

Figure 4.2 ALU Architecture ..•.•...•...•.••.•.•.••• 33

Figure 4.3 MAC Block Diagram ...•..•...•..••...... •. 39

Figure 4.4 Shifter Block Diagram .•••••••.•.•..••••• 41

Figure 4.5 Data Address Generator (DAG) .•.••.••.••• 42

Figure 4.6 Program Sequencer •••••..••••..•••••.••.• 44

Figure 4.7 PMD-DMD Exchange Unit (BEU) ..••••••• 46

ChaEter 5 Figure 5.1 ZR34161 VSP Block Diagram .••••••••••.••• 48

Figure 5.2 VSP Block Diagram .•.••••. 52

ChaEter 6

Figure 6.1 DSP56000 Block Diagram •••••••••••••••••• 59

Figure 6.2 Program Controller ••....•..••••••••••••• 63

iv Figure 6.3 ..••..•.••••.•.•.•.••..•• 70 Figure 6.4 Operating Mode Register •..•••••••••••••• 70 Figure 6.5 Stack Pointer (SP) Format ..••••••••••••. 72 Figure 6.6 Stack Pointer Values ..••..•••••••••••••• 72

Figure 6.7 Data ALU ...... •...... •.... 77

Figure 6.8 Address ALU . .•.••.•....••.•.••..•••.••.• 84 Cha}2ter 2 Figure 7.1 WEDSP-32 Block Diagram •.••...••...•.••.• 93 Figure 7.2 Control Arithmetic Unit ...•.•...•...•••• 95 Figure 7.3 Memory Configuration ...•..•••.•.••••••• 101

Cha12ter !!_ Figure 8.1 TMS320C30 Block Diagram .••••.•.•.••••.• 105 Figure 8.2 (CPU) •.• o.oo •• o107 Figure 8.3 DMA Controller •.••.••..••••.•••• o • o • o •• 109 Figure 8.4 Memory Organization .•.•••••••.•.•.••.•• 111 Figure 8.5 Peripherals ...... 113

Cha12ter ~ Figure 9.1 ZR34325 Block Diagram .•.•.••••.••••.••• 118 Figure 9.2 Execution Unit ••••••••.•.•••••••••••••• 12 3 Figure 9.3 Internal Memory •••••••••••.•••• o•o•••••123 Figure 9.4 Internal Registers •.•.••.•••••• o •..• o •• 124 Cha}2ter 10 Figure 10.1 Truncation Effect on Frequency Response . ••••..•.••.••.•••.••••••••.••• 13 2 Figure 10.2 Sampling ••••••••••••••••••••••• 135 Figure 10.3 Additive Errors Due to Quantization.o •• 137 Figure 10.4 Dynamic Range Versus Data Word Length •• 141

v Figure 10.5 Mse Versus Word Length (d, t, k) •..••.• 145 Figure 10.6a Mse Versus Data Word Length (d) .•••••.• 146 Figure 10.6b Mse Versus Intermediate Word Length (t) •..••••..••••••••••.•••• 14 7 Figure 10.6c Mse Versus Kernel Word Length (k) •••••• 148

Appendix ~ Figure A.1 Pipeline Operation of Floating-point Addition ...... 180 Figure A.2 Von Neumann Operation ...••.•.••.....•.. 182 Figure A. 3 Harvard Operation ..•••..••••.•.•••.•••. 182 Figure A.4 3-Bus Harvard Operation .••.•..•.••••••. 184 Figure A. 5 Pipelined Harvard Operation ..•••....••• 184

Appendix ~ Figure C.1 Dynamic Range Versus Word Length •.••••. 194 Figure C.2 Precision Versus Word Length •.••••••••• 196

vi LIST OF TABLES

Chapter ..§.

Table 6.1 Accumulator Shifter Functions •••••••••••• 81

Chapter 10

Table 10.1 Dynamic Range Limiting Factor .•••.•••••• 150

Chapter 11

Table 11.1 Comparison of Benchmark Performance of 16- DSP ICs •••••...•.•.•••...•••.•.•• 157

Table 11.2 Comparison of 16-bit DSP ICs' Features .. 159

Table 11.3 Comparison of Benchmark Performance of 24- and 32-bit DSP ICs .•••••.••.•••••••• 164

Table 11.4 Comparison of 24- and 32-bit DSP ICs' Features ...... 165

vii ABSTRACT DIGITAL SIGNAL PROCESSING INTEGRATED CIRCUITS (DSP ICS)

by Trung D. Vo Master of Science in Engineering

Digital signal processing integrated circuits (DSP ICs), a new breed of microelectronic device has changed the picture of signal processing. Advances in VLSI (Very Large Scale Integrated) circuit density and speed have been so successful in the past few years, that today a single-chip can perform the tasks that previously required bulky circuit boards full of power-hungry devices. These DSP chips are readily available from many vendors, with the more advanced ones currently in the development cycle. This paper discusses and compares certain popular DSP ICs in relation to their architectures, performances and applications.

viii CHAPTER 1 INTRODUCTION

The single~chip digital signal processing IC is a very significant development for digital signal processing. currently, there are over 20 manufacturers competing in this DSP IC market. Most of these manufacturers are major, well~established IC makers that have been successful with general-purpose . This paper covers eight DSP ICs from six different vendors: the NEC 7720, the Analog Devices ADSP-2100, the Texas Instruments TMS320XX family, the Zoran ZR34161, the Motorola DSP56000, the AT&T WE-DSP32, the Texas Instruments TMS320C30, and the Zoran ZR34325. The first four of the group are 16-bit devices. The Motorola DSP56000 is 24-bit, and the last three are 32-bit. The TI TMS320XX family is the largest DSP IC family consisting of about fourteen versions. The other more-than-16-bit devices, except the AT&T WE-DSP32, are still in the development stages and have not been in production at the time of this writing. They are, however, expected to be available in the very near future. Most of these DSP res employ CMOS technology which offer very low power consumption. They are fast, with instruction cycles as low as 50 ns and their operations are general-purpose -like. This means that the chips fetch and execute instructions from memory just like any other computer. However, the detailed internal

1 2 architectures of DSP ICs are significantly different from the architectures of microprocessors. Most DSP ICs employ a

[See Appendix A], in which the data and program memories are separate, thus allowing data and program instructions to be fetched simultaneously. Also, to maximize the operation speed, these DSP ICs are based largely on reduced-instruction set computing (RISC) [See Appendix B] and pipeline architectures [Appendix A]. The eight DSP ICs are classified into two groups for comparisons. One group consists of the 16-bit devices and the other consists of devices having more than 16-bit data wordlength. Detailed architectures of the devices are discussed to provide a means for comparison and selection guidelines for a suitable DSP IC are developed. Since the data word lengths of these devices vary

(16-, 24-, and 32-bit), the effects of data word lengths on system dynamic range are studied in detail and results from simulations on these effects are included.

A chapter is devoted to comparing the DSP ICs based on their architectures and performances. Typical applications using these DSP res are also discussed in another chapter to provide some examples. Finally, it is concluded that: technically, all of the eigth DSP res are suitable for most DSP applications.

The selection of the best one for a particular application is dependent on its unique requirements. CHAPTER 2 THE NEC 7720 1. Introduction

The NEC 7720 (sometimes as uPD7720) is a 16-bit microprogrammable device designed to be a flexible, versatile number cruncher able to function as a stand- alone processor. It is aimed mainly for telecommunications, where accuracy, speed, and interface compatibility with existing systems is of great importance. The NEC 7720, however, has more general features that make it suitable for a broader range of applications. 2. Architecture

The NEC 7720 employs a Harvard architecture [Figure 2.1] which provides separate data and instruction buses. The processor is controlled by directly executable 23-bit horizontal microinstructions. The fetching and decoding of the concurrently with the execution of the present instruction. One external 8 MHz clock is divided internally into slower clocks for internal pipeline and parallel operation control giving a 250 ns instruction cycle time. These clock signals are also used for testing purposes. Memory configuration

The memories can be classified into 4 sections: a. Instruction-ROM 4

12t r ,, $12.23 """ I -- I --

Gonoral Output

s s c l 0 0 ..... A A A y y 1 D A A ''"'" " 1 0 s s c l 0 0 y v • •1 •D • • o•. o. •1 •0 ClK-liST- D IHT--- Ycc-

GHO- h'\N'frvpl

Figure 2.1 NEC 7720 Block Diagram 5

The Instruction-ROM contains a 512 x 23-bit Read Only Memory. The 23-bit word length reflects the 7720 multi- operation capability. There is also a four-level stack that allows the implementation of nested program loops. b. Data-ROM The Data-ROM contains a 512 x 13-bit Read Only Memory. This ROM is used to store fixed data such as: filter coefficients, FFT windows , and address bit reversal data, sine-cosine look-up tables, etc. This ROM is indirectly addressable through one pointer, which can be postdecremented as part of an instruction. c. Data-RAM The Data-RAM contains a 128 x 16-bit Random Access Memory. This RAM is partitioned into two halves, each is addressed by the same Data Pointer (DP) [Figure 2.2]. The MSB (Most Significant Bit) decides which of the two halves interfaces to the internal data bus. In both cases, the high RAM value is always made available for input to the K-register of the multiplier. This unique feature allows two RAM values to be sent to the multiplier in the same instruction. An additional sub-bus is available for direct access to the ALU () input. The RAM is indirectly addressable and the Data Pointer can be post-modified (decrement or increment) during the execution of an instruction. As part of the same instruction effecting a load-multiply- and add, two 6

-_ \

DATA POINTER ~CiiSTER

~ 1Jt." lllW SEL.S:T 0 0 alllWO 0 1 aROW1 1 0 aROft'2 1 1 • lOri 3

INTERNAl -DATA BUSII.D.BJ

ALU p ltfVT

l'tJLT

Figure ~.2 Vata RAM & Peripherals 7

distinct parts of the DP can be modified in different ways. The 3 high (DPH) are modified by being exclusive-ORed with a three bit pattern in the instruction word. The 4 low bits (DPL) can be modified (incremented or decremented, cleared or not affected) depending on two other bits in the instruction word. This partitioning of the DP gives the RAM a columnjrow structure, where the row is defined by the DPH and the column by the DPL. An additional feature arises from this structure by placing all the values of an individual calculation in the same column (or row) position, and therefore, a whole array of values requiring the same calculation could use one or two basic routines. The controllable flags DPLO, DPLF for detection of row beginning ;end together with the post-increment/decrement of the DPL could be used as loop counters.

d. Scratch-pad memory The scratch-pad memory is provided by one 16-bit transition register (TR), the input registers of the multiplier (K and L) and the data register (DR). A special mnemonic DRNF is used for DR in order not to effect the

DMA (Direct Memory Access) request or the handshaking flag RQM (Request for Memory). Arithmetic Logic Unit (ALU)

A two's complement fixed-point 16 bit ALU is provided with two accumulators, A and B [Figure 2.3]. This 8

INTE~L DATA BUS M REGISTER N REGISTER RAM

Figure 2.3 ALU Block Diagram 9

ALU can handle complex data or double-precision calculation on 32-bit data. Direct input from the Data­ RAM multiplier outputs (M and N) and left/right shifter for data scaling, offers multiple sources to the ALU and avoids overloading the Interface (IDB). The availability of the accumulator not involved in an ALU operation for data transfers increases the flexibility of this unit. To handle the overflow of the ALU, the 7720 provides a selection of a "look-back" software overflow control. It happens often that an overflow is canceled by an underflow or vice versa, so that the end result in the accumulator is correct. Four flags for each accumulator keep track of the overflow status for up to three consecutive ALU operations. The flag generate the correct

saturation constant (+ max or - max) in the sign register (SGN) after each operation. By testing the overflow status, it is possible that the erroneous accumulator result is replaced by the SGN register content during the following instruction. This technique can avoid the distortions generated by automatic replacement of the overflowed result by saturation constants which is normally used in other device architectures. Multiplier The 7720 has a hardware multiplier which can take 2 16-bit inputs in parallel and perform a multiplication to produce a 32-bit product in one instruction cycle. The 10

multiplication operation employs a carry-look-ahead technique using a modified Booth's . The input data range is [-1,+1], therefore, the hypothetical binary point is located after the first bit (sign bit) and this

unit can not overflow. The main bus (IDB) and four s~­ buses that constitute the multiple bus structure of the 7720 allow simultaneous loading of the multiplier's input registers (K and L) from various locations. The full precision of the product in the M and N registers may be

accumulated in two instruction steps by using another s~­ bus. Input/Output (I/O) Port The 7720 has a flexible and fast I/O port system. Its I/O port is software-configurable to accommodate 8 or 16- bit word length. There are flags provided for internal checking of the I/O status. This feature allows an efficient synchronization with the internal program flow. Serial I/O channels with corresponding handshaking lines are separate and double buffered [Figure 2.4 and Figure 2.5]. The serial I/O channels can operate at over 2 Megabits per second (Mbps) for both CCITT and AT&T standard specifications for full-duplex PUlse Code Modulation (PCM) data links. The capability of reversing the order of the received data during the transfer between the IDB and Serial Output (SO) and Serial Input (SI) registers increases the number of serially oriented ll

SERIAL INPUT -

13 14 15

Figure 2.4 Serial Input

SORO SOEN

l ' I I ! I . ! I i . l : : l i =tl_j ' jlI :i Lf=-=r-:.:Li=_ ·, =± I~ ~: ·lOB ~0 _ : 1 1 1 ! d. I ! ~~ --~1~------~ 15

Figure 2.5 Serial Output 12

devices (ADC, DAC, Codec, FIFO, etc.) which can be connected to the 7720. The parallel I/O port is 8-bit wide [Figure 2.6] and may be used for data transfer or reading the status of the processor. Data transfer is handled through a 16-bit data register (DR). This data register is software-configurable to single or double transfers. Flags for handshaking and control lines for DMA operation define the two transfer modes. A special mnemonic DRNF allows exit from a transfer loop by inhibiting the setting of the RQM flag or a new DMA request.

There are also a 2-bit general purpose output port (PO, Pl) that may be used for efficient signaling to external hardwares. Interrupt

The 7720 has only a one-level external interrupt. This interrupt uses a fixed vector pointing to the middle of the instruction-ROM. An interrupt enable flag (EI) makes it possible to ignore an interrupt request until a background routine is completed. This is used to synchronize the processor with the sampling rate. 3. The Instruction Set Each instruction in the 7720 instruction set requires only one 250 ns instruction cycle for execution (8 MHz clock). This is possible due to the 7720 pipeline architecture. During the execution of the present 13

DR ~ISTER

DRS ~ WR ONTROL rs LOGIC Ao OAO< ORO ORe

Figure 2.6 Parallel I/0 14

instruction, the fetching and decoding of the next instruction is done in parallel. Input and output functions are carried out independently of the internal program flow. The 7720 instruction set may be classified into 3 basic categories: a. The Arithmetic-move instruction type - This type is the workhorse of the processor. When used efficiently,

it allows up to seven tasks to be done in one instruction cycle, concurrent with the I/O transfers. b. The branch and jump instruction type - This type includes CALL, JUMP and 32 conditional jumps based on the accumulator flags, I/O status or the RAM pointer value. c. The load immediate data (LDI) to a destination using the IDB. This implies that the instruction ROM can also be used to store fixed data and addresses. CHAPTER 3 The TI TMS320 DSP IC Family

1. Introduction The Texas Instruments TMS320 product line contains a family of high performance 16/32-bit single chip microprocessors/microcomputers designed primarily for use in real time digital signal processing (DSP) applications. The family digital signal processor members combine the flexibility of a high-speed controller with the numerical capability of an array processor. This combination offers an inexpensive alternative to a multichip bit-slice processor or an expensive commercial array processor. The TMS320 family made quite an impressive entry into the DSP market following the disclosure of the TMS32010 and the TMS320M10, the first microprocesor and microcomputer of the family, at the ISSCC, San Francisco in February 1982. This disclosure set the first mark on Texas Instruments' DSP IC family road map. The current Texas Instruments' TMS family consists of three generations of digital signal processors [Figure 3.1a and Figure 3.lb]. The first generation consists of the TMS32010 and its offspring: TMS32Cl0, TMS32011, TMS320C10- 25, TMS32010-14, TMS320C15/E15, TMS320Cl5-25, TMS320C17- 25, and TMS320C17/E17. The second generation consists of the TMS32020 and TMS320C25. The third generation consists of the TMS320C3X which is still being developed. This

15 16

w u z Cl :!: a: 0 TMS320C10 ...a: w Q.

~------;------t------r------+------~•TIME 1983 1984 1985 1986

Figure 3.la TMS320 Family Generation s s p 10 E R I: ! F I~ 0 IE I R , I M Ic I 0 A M N p c A e T E I 8 l E --~'1~~ -----

,1111111111111 PIN/5'rn. CCIFATIIU 11111111111111111111111111111111111111111111ltJtr. ~ 4'------·- l.ou !MIS X 2.ou om X t.ou ae;~

FiRure 3.lb TMS320 Family Generation (Technology)

...... --.J 18

chapter is devoted to the first two generations, TMS320C1X and TMS320C2X. The third generation member of the family, TMS320C3X is covered in a separate chapter since it is a 32-bit design. To provide DSP system designers with easier, more flexible and better system integration and cost/performance tradeoffs, Texas Instruments has also included peripheral circuits in the DSP devices to reduce chip counts, board space, power consumption and system cost. 2. TMS320 Family Description

There are many common features among these generations. Software compatibility is maintained throughout the family to protect the user's investment in the family.

Architecturally, the TMS320 utilizes a modified Harvard architecture for speed and flexibility. In the strict Harvard architecture, the program and data memories lie in two separate spaces, permitting a full overlap of the instruction fetch and execution. The TMS320 family modified Harvard architecture allows the transfer between program and data spaces. This would increase the flexibility of the device and eliminate the need for a separate coefficient ROM while still retaining the processing power by having two separate program and data bus structures for full-speed execution. TMS320C1X 19

The first-generation device group TMS320ClX includes

the following specific members:

o TMS32010, the first 20 MHz digital signal processor

0 TMS32010-14, a 14-MHz version of the TMS32010

0 TMS32010-25, a 25-MHz version of the TMS32010

0 TMS32011 1 a TMS32010 with serial port

0 TMS320Cl0, a CMOS 20-MHz version of the TMS32010

0 TMS320Cl0-25, a 25-MHz version of the TMS320Cl0

0 TMS320Cl5, a TMS320Cl0 with expanded ROM and RAM

0 TMS320El5, an EPROM version of the TMS320Cl5

0 TMS320Cl5-25, a 25-MHz version of the TMS320Cl5

0 TMS320Cl7, a TMS320Cl5 with serial and ports

0 TMS320El7, an EPROM version of the TMS320Cl7

0 TMS320Cl7-25, a 25-MHz version of the TMS320Cl7 20

Key Features

Some of the key features of the TMS320C1X devices include:

o Instruction cycle timing:

- 160 ns (TMS32010-25/C10-25/C15-25/C17-25)

- 200 ns (TMS32010/C10/11/C15/E15/C17/E17) - 280 ns (TMS32010-14) o 32-bit ALU/Accumulator

o 16-bit bidirectional data bus at 50-Mbps (Mega bits per second) transfer rate

o 16 x 16-bit parallel multiplier with a 32-bit product

o 0 to 16-bit

o 144/256-word on-chip data RAM

o 1.5K/4K-word on-chip program ROM

o 4K-word on-chip program EPROM (TMS320E15/E17)

o 4K-word total external memory at full speed

o On-chip clock generator

o 8 input/8 output channels

o Dual-channel serial port with timer

(TMS32011/C17/E17) o Direct interface to combo-codecs (TMS32011/C17/E17)

o On-chip u-lawjA-law companding hardware

(TMS32011/C17/E17) o 16-bit coprocessor interface

o Single +5V supply voltage required 21

o Technology: - NMOS (TMS32010/ll) - CMOS (TMS320Cl0/Cl5/El5/Cl7/El7) TMS320C2X The TMS320C2X Digital Signal Processor is the second generation member of the TMS320 family of VLSI digital signal processors and peripherals. Its architecture is based upon that of the TMS32010, the first member of the DSP microprocessor family. Its instruction set is a superset of that of the TMS32010, thus maintaining software-upward compatibility. Some of the major enhancements of the TMS320C2X over the TMS32010 are as follows: o Program execution from on-chip RAM o 544 words of on-chip data RAM, of which 256 may be configured as either program or data memory o 12BK words of external memory expansion space (64K words of program memory, 64K words of data memorx) o Single-cycle multiply/accumulate instructions o Instruction set support for floating-point operations o Multiprocessor and global data memory interfaces o Synchronization capability for synchronous multiprocessor configurations o Block moves for data/program management o Wait-state capability for communication to slow off­ chip memories/peripherals 22

o On-chip timer and serial port o Five auxiliary registers with their own arithmetic unit o Bit-manipulation instructions o Three external, maskable user interrupts 2.1 Architectural Description Since the TI TMS320 family consists of more than 14 versions and their architectures are essentially the same, it is practical to select one member of the family for discussion in this paper. The TMS320C25 is chosen for this purpose since it is the second generation device and considered the most advanced member in this 16-bit category. The TMS320C25 is a pin-compatible CMOS version of the TMS32020 with a faster instruction cycle time and the inclusion of additional hardware and software features. The TMS320C25 is completely object code-compatible with

I~S32020 so that Tl~S32020 programs run unmodified en the TMS320C25. Some of the major enhancements of the TMS320C25 are as follows: o Faster instruction cycle time: 100 ns o Low-power CMOS technology with powerdown mode o 4K words of on-chip masked ROM o Eight auxiliary registers with a dedicated arithmetic unit o Eight-level hardware stack 23

o Fully static double-buffered serial port o Concurrent DMA using an extended hold operation o Bit-reversed addressing modes for radix-2 FFTs o Extended-precision arithmetic and adaptive filtering support o Full-speed operation of MAC/MACD from external memory o Accumulator carry bit and related instructions The TMS320C25 high-performance digital signal processor implements a single-accumulator, Harvard-type architecture in which program and data memory reside in separate address spaces. This allows a full overlap of instruction fetch and execution. Instructions are included to provide data transfers between the two spaces. Externally, the program and data memory spaces are multiplexed over the same bus so as to maximize the address range for both spaces while minimizing the pin count of the device. Internally, the TMS320C25 architecture maximizes processing power by maintaining two separate bus structures, program and data, for full-speed execution. Increased flexibility in system design is provided by two large on-chip data RAM blocks, one of which is configurable either as program or data memory. The TMS320C25 incorporates a separate level of pipelining for instruction decoding. The instruction fetch-decode-execute pipeline is essentially invisible to 24 ~ .

the user, except in some cases where the pipeline must be broken (such as for branch instructions). In this case, the instructions will have slightly different timing characteristics than the TMS32020. Other instructions, such as those that operate with external data memory, have improved cycle timings compared to the TMS32020. The device executes the majority of its instructions in a single machine cycle when sufficiently fast memory is utilized. The device may also communicate to slower off­ chip memories or peripherals by utilizing the READY signal. In those cases, the instructions become multicyclical. The functional block diagram of the TMS320C25 [Figure 3.2] outlines the principal blocks and data paths within the processor. The TMS320C25 architecture consists of the following major elements: o Central Arithmetic Logic Unit (CALU)

o Memory Organization o system Control o External Interfaces 2.1.1 Central Arithmetic Logic Unit (CALU) The TMS320C25 CALU contains the following elements: a. A 16-bit scaling shifter which has a 16-bit input connected to the data bus and a 32-bit output connected to the ALU. The scaling shifter produces a left-shift of 0 to 16 bits on the input data, as programmed in the 25

16 ,,

S1Ut6t r-4-~----~~·c~·~·~~----~ IU1t61 "";::=====-o.- 4·. r------CLXII r------1$11 " DX r---CL ...... ~"'-:--:----+--'Hf r-"-hl IU 115111161 I I UIIJ161 f f ~r----o.-11---,,-,,~~~

,, 015·00

+..r,, l " " AAO!I61• ' Aflll161 I AII~IJJ All2!161 Df!'f!l AII3J161 1- Alll•t111 3 Al'l:!tllll l2 A,_1!16t l2 All71161 l SHIITfiU 6 0 I •1 l l Alllill I I rU~... IU rt= 16 1••j f I ~T:-.oc~ -~}!.. ~..... ~. -tOC :J-~ 132 • 161 i AAM <2U • H!i ----j SLOCKIO r:~~-~~:~ ~6 11~6 . til MUX I SHirTfiiSIO 71 l

'5 1'6~ .______,.L..::..----1..:.;:__----./" "·· J

Figure 3,.2 TMS320C25 Block Diagram 26

instruction. Additional shift capabilities enable the processor to perform numerical scaling, bit extraction, extended arithmetic, and overflow prevention. b. A 32-bit ALU and accumulator which perform a wide range of arithmetic and logical instructions, the majority of which execute in a single clock cycle. The 32-bit accumulator is split into two 16-bit segments for storage in data memory: the ACCH (Accumulator High) and ACCL (Accumulator Low). c. A 16 x 16 hardware multiplier which is capable of computing a 32-bit product during every machine cycle. 2.1.2 Memory Organization A total of 544 16-bit words of on-chip data RAM, which is divided into three separate blocks (BO, B1, and B2). Of the 544 words, 256 words (block BO) are configurable as either data or program memory by CNFD or CNFP instructions provided for that purpose; 288 words (block B1 and B2) are always data memory. In addition, 64K words of off-chip directly addressable data memory space are provided. A 4096-word on-chip ROM that is -programmable at the factory with customers' programs. This ROM may be mapped in or out of the TMS320C25 1 s memory space by an external pin on the device, MicroProcessor/MicroComputer select (MP/MC). This permits the designer to accelerate time-to-market with a TMS320C25-based product by using 27

external ROM, and cost-reducing it later with the large 4K internal ROM on the device without any board redesign. 2.2.3 System Control Control operations are provided on the TMS320C25 by the following components: an on-chip timer, a repeat , three external maskable user interrupts, and internal interrupts generated by serial port operations or by the timer. The memory-mapped 16-bit timer (TIM) register is used for control operations and for synchronously sampling or writing to peripherals. The Repeat Counter (RPTC) allows a single instruction to be performed up to 256 times. The repeat feature can be used with instructions such as multiply/accumulates, block moves, I/O transfer, and table read/writes. Those instructions that are normally multicyclical are pipelined when using the repeat feature, and effectively become single-cycle instructions. The three external maskable user interrupts (INT2- INTO) available for external devices that interrupt the processor. Internal interrupts are generated by either the serial port, the timer, or the software interrupt instruction. Interrupts are prioritized with reset having the highest priority and the serial port transmit interrupt having the lowest priority. 28

2.1.4 I/O Interface The TMS320C25 supports a wide range of system interfacing requirements. Three separate address spaces (program, data, and I/O) provide interfacing to memory and I/O, thus maximizing system throughput. I/O design is simplified by having I/O treated the same way as memory. I/O devices are mapped into the I/O address space using the processor's external address and data buses in the same manner as memory-mapped devices. Interfacing to memory and I/O devices of varying speeds is accomplished by using the READY line. The TMS320C25 I/O space consists of 16 input and 16 output ports. These ports provide the full 16-bit parallel I/O interface via the data bus on the device. A single input or output operation typically takes two cycles; however, when used with the repeat counter, the operation becomes single-cycle. An on-chip serial port provides direct communication with serial devices such as codecs, serial A/D converters, and other serial systems. The interface signals are compatible with CODECS and many other serial devices with a minimum of external hardware. The two serial port memory-mapped registers (the data transmit/receive registers) may be operated in either an 8-bit byte or 16-bit word mode. The transmit framing synchronization pulse can be generated internally or 29

externally. The maximum speed of the serial port is 5 MHz. The primary enhancements of the TMS320C25's serial port over the TMS32020 are: o Double-buffering for both receive and transmit operations, thus allowing a continuous bit stream even if FSX is an output, o No minimum CLKR/CLKX frequency (fmin = 0 Hz), and o Frame sync mode (FSM) bit, which allows continuous operation with no frame synchronization pulses. The frame sync mode is useful in communicating to "PCM highways." For AT&T Tl and CCITT G711/712 lines, the TMS320C25 can easily be made to communicate directly in these formats by counting the transmitted/received in software and performing SFSM/RFSM instructions as needed to set/reset the FSM bit. CHAPTER 4 THE ANALOG DEVICES ADSP-2100

1. Introduction The ADSP-2100 is a programmable single-chip microprocessor optimized for digital signal processing (DSP) and other high-speed numeric processing applications. In contrast to the architecture of typical DSP microcomputers, where some portion of system memory is included on chip, the ADSP-2100 contains no primary on­ chip memory. Instead, it is designed for optimization of external memory accesses allowing much greater bus utilization and computational efficiency. The ADSP-2100 is fabricated in a high-speed 1.5um double-layer metal CMOS process to offer low power, high throughput operation. 2. Architecture The ADSP-2100 architecture is shown on Figure 4.1. This architecture is composed of seven (7) major components and supported by a network of five internal buses. The major components and the internal buses are listed as follows: a. Major Components: o Arithmetic/logic unit o Multiplier/accumulator o Barrel shifter o Two data address generators

30 -z~~-:::: ~~ l D;~. DI-TA j ADDRESS ADDRESS ~ CENEf1•HO~J, GENERA TO~ r 12 II !lf-- PMA 1 ' PMA BUS 1I l

OMA I L ! ~/ ... '"' / II ~~ PMO BUS

~ PMO

OMO

II BUS

Figure 4.1 ADSP-2100 Block Diagram

UJ ~ 32

o Program sequencer o memory o PMD-DMD bus exchange b. Internal Buses o Program Memory Address (PMA) bus o Program Memory Data (PMD) bus o Data Memory Address (DMA) bus o Data Memory Data (DMD) bus o Result (R) bus (which interconnects the computational units). Except for the Result (R)bus, the first four buses are extended off-chip for direct interface to external memories. 2.1 Arithmetic Logic Unit (ALU) The Arithmetic Logic Unit (ALU) provides a standard set of arithmetic and logical functions. The arithmetic functions are add, subtract, negate, increment, decrement, and absolute value. These are supplemented by two division primitives with which multiple cycle division can be constructed. The logic functions include AND, OR, XOR (exclusive OR) and NOT. A block diagram of the ALU is shown on Figure 4.2. As shown, the ALU is 16 bits wide with two 16-bit input ports, X and Y, and one output port, R. The ALU accepts a carry-in signal (CI) which is the carry bit from the processor arithmetic status register (ASTAT). The ALU generates six status signals: the zero (AZ) status, the negative (AN) status, the carry (AC) 33

P'IID 8US 2.e

Al AH AC AV AS AO II

. ____J i____ _,L-" ___....._ ___11-•us _

Figure 4.2 ALU Architecture 34

status, the overflow (AV) status, the X-input sign (AS) status, and the quotient (AQ) status. All arithmetic status signals are latched into the arithmetic status register (ASTAT) at the end of the cycle. The X input port of the ALU can accept data from two sources: the AX or the result (R) bus. The R bus connects the output registers of all the computational units, permitting them to be used as input operands directly. The AX register file is dedicated to the X input port and consists of two registers, AXO and AXl. These AX registers are readable and writable from the DMD bus. The AX register file outputs are dual-ported so that one register can provide input to the ALU while either one simultaneously drives the DMD bus. The Y input port of the ALU can also accept data from two sources: the AY register file and the ALU feedback (AF) register. The AY register file is dedicated to the Y input port and consists of two registers, AYO and AYl. These registers are readable and writable from the DMD bus and writable from the PMD bus. The AY register file outputs are also dual-ported: one AY register can provide input to the ALU while either one simultaneously drives the DMD bus. The output of the ALU is loaded into either the ALU feedback (AF) register or the ALU result (AR) register. The AF register is an ALU internal register which allows 35

the ALU result to be used directly as the ALU Y input. The AR register can drive both the DMD bus and the R bus. It is also loadable directly from the DMD bus. All the registers surrounding the ALU can be both read and written in the same cycle. When this happens, the value read from the register is the old one. The new value will be loaded into the register at the end of the cycle and therefore cannot be read out until the next cycle. This feature allows an input register to provide an operand into the ALU while simultaneously being loaded with next the operand from memory. It also allows the result register to be stored in memory while simultaneously being loaded with the result of a new computation. The ALU section contains a duplicate bank of registers, shown in Figure 4.2 as a "shadow" behind the primary registers. There are actually two AR and AF registers and two sets of AX and AY register files. Only one bank is accessible at a time. The additional bank of registers can be activated during an interrupt service routine for extremely fast context switching. A new task, such as an interrupt service routine, can be executed without transferring current states to storage. The selection of the primary or alternate bank of registers is controlled by a bit in the processor mode status register (MSTAT). Toggling this bit back 36

and forth between the two register banks. 2.2 Multiplier/Accumulator (MAC) The Multiplier/Accumulator (MAC) provides high-speed multiplication, multiplication with cumulative addition, multiplication with cumulative subtraction and clear-to­ zero functions. A feedback function allows part of the accumulator output to be directly used as one of the multiplicands on the next cycle. A block diagram of the MAC is shown on Figure 4.3. The MAC contains two 16-bit input ports X and Y, and a 32-bit product output port P. The 32-bit product is passed to a 40-bit adderjsubtractor which adds or subtracts the new product from the content of the multiplier result (MR) register. The MR register is 40-bit wide. In this manual, we refer to the entire register as MR. The register actually consists of three smaller registers: MRO and MR1 which are 16 bits wide and

MR2 which is 8 bits wide. The adderjsubtractor is greater than 32 bits to allow for intermediate overflow in a series of multiply/accumulate operations. The multiply overflow (MV) status bit is set when the accumulator has overflowed beyond the 32-bit boundary, that it, when there are significant (non-sign} bits in the top nine bits of the MR register. The input;output registers of the MAC section are similar to the ALU. The X input port can accept data from either the MX 37

register file or from any register on the result (R) bus.

The R bus connects the output registers of all the computational units, permitting them to be used as input

operands directly. There are two registers in the MX

register file, MXO and MXl. These registers can be read and written from the DMD bus. The MX register file

outputs are dual-ported so that a single register can drive the DMD bus at the same time it supplies operands to the multiplier. The Y input port can accept data from either the MY register file or the MF register. The MY register file has two registers, MYO and MX1; these registers can be read and written from the DMD bus and written from the PMD bus. The MY register file outputs are also dual-ported so

that a single register can drive the DMD bus at the same time it supplies operands to the multiplier.

The output of the adderjsubtractor goes to either the MF register or the MR register. The MF register is a feedback register which allows bits 16-31 of the result to be used directly as the multiplier Y input on a subsequent cycle. The 40-bit adderjsubtractor register (MR) is divided into three sections: MR2, MR1, and MRO. Each of these registers can be preloaded directly from the DMD bus and output to either the DMD bus or the R bus. All of the registers surrounding the MAC have the capability of being both read and written in the same 38

cycle. When this happens, the value read from the register is the old one. The new value will be loaded into the register at the end of the cycle and therefore cannot be read out until the next cycle. This feature allows an input register to provide an operand into the MAC while simultaneously being loaded with the next operand from memory. It also allows a result register to be stored in memory while simultaneously being loaded with the result of a new computation. The MAC section contains a duplicate bank of registers, shown in Figure 4.3 as a "shadow" behind the primary registers. There are actually two MR and MF registers and two sets of MX and MY register files. Only one bank is accessible at a time. The additional bank of registers can be activated during an interrupt service routine for extremely fast context switching. A new task, such as an interrupt service routine, can be executed without transferring current states to storage. The selection of the primary or alternate bank of registers is controlled by a bit in the processor mode status register (MSTAT). Toggling this bit switches back and forth between the register banks. 2.3 Barrel Shifter The shifter unit provides a complete set of shifting functions for 16-bit inputs, yielding a 16-bit or 32-bit output. These include arithmetic shift, logical shift, 39

Figure 4,3 MAC Block Diagram 40

normalization, derivation of exponent and derivation of common exponent for an entire block of numbers. These basic functions can be combined to efficiently implement any degree of numerical format control, including full floating point representation. A block diagram of the ADSP-2100 Barrel Shifter is provided in Figure 4.4. 2.4 Data Address Generators (DAGs) The ADSP-2100 contains two independent data address generators (DAGs) so that both program and data memories can be accessed simultaneously. The DAGs provide indirect addressing capabilities. Both perform automatic address modification. For circular buffers, the DAGs can perform modular address modification. The two DAGs differ in a few respects. DAGl can only generate data memory addresses, but provides an optional bit-reversal capability. DAG2 can generate both data memory and program memory addresses, but has no bit-reversal capability. A block diagram for the DAGs is shown on Figure 4.5. 2.5 Program Sequencer and Status The program sequencer generates a stream of instruction addresses, providing flexible control of program flow. It provides for zero-overhead looping, single-cycle branching (both conditional and unconditional) and sophisticated interrupt processing. 41

OMD BUS

From INSTRUCTION

I ' I I t ------t~- ---· -t------·-

A-BUS ? ? I -~-· -~..L..-!)£ ...... __l~-

Figure 4,4 Shifter Block Diagram 42

Figure 4.5 Data Address Generator 43

Figure 4.6 shows a block diagram for the program sequencer and status sections of the ADSP-2100. It is useful to be aware that the ADSP-2100 instruction set includes the following flow control instructions:

o JUMP o CALL o RETURN FROM SUBROUTINE (RTS) o RETURN FROM INTERRUPT (RTI) o DO UNTIL 2.6 Instruction Cache The instruction cache memory stores a short history of up to sixteen previously executed instructions. When the next instruction is already contained in the cache, the cache directly feeds the , freeing up the Program Memory Data (PMD) bus for data transfers. The cache is an important part of the efficient bus utilization of the ADSP-2100. Cache operation is transparent. No maintenance or overhead is required for either the storage or use of instructions in cache memory. 2.7 PMD-DMD Bus Exchange The PMD-DMD (P-D) bus exchange unit couples the program memory data bus and the data memory data bus, allowing them to transfer data in both directions. Since the program memory data (PMD) bus is 24 bits wide, while 44

DMD BUS ,II

CONDITION CODE (4 ltila) 'COUNT STACK • X 14 ADDRESS ol JUMP (14 ltils)

fUNCTION FIELD cou NTER ADDRESS ot IIUX • LAST INSTRUCTION L OGIC I I ill LOOP (14 IMta) & TERMINATION • CONDITION (4 ltlla) DOWN COUNTER I I I Cl OUT ,~~ fr- INSTRUCTION REGISTER I ...... • LOOI' STACK 4 Ill STATUS ~ STACK •... I 4 X II , ,. ST ATUS f IIUX L OGIC I I CONDITION LOOP LOGIC -+ r- COIII'AIIATOII J STATUS• I IIEGISTEIIS ARITHMETIC 1 STATUS ,, , ~ ,.

• ~ c-l

• INTERRUP\1 JCONTROLLER I IIIQD-3 ' 1 PRDGRAMl PC STACK COUNTER 11114 I r 1 IIUX rIIICAEMENT r l 1 1 T _____ j I -- I -- 1 - ., T . TI ~-- -- ···-··-----r-::=1---,::...-: w• r'- r' ~~~~~ ____ ] ______rL~E!:~u I I ·------

PMA BUS 14 =1_ ,~

Program Sequencer 45

the data memory data (DMD) bus is 16 bits wide, only the upper 16 bits of PMD can be directly transferred. An internal register (PX) contains the additional 8 bits. This register can be directly loaded and read when the full 24 bits are important. Figure 4.7 is a block diagram of the PMD-DMD Bus Exchange. 46

24 PMD-BUS

8 {LOWER) 16 {UPPER) 16 (UPPER)

MUX

PX REGISTER

8 (LOWER) 16 16

8 (LOWER) 16

DMD-BUS

Figure 4.7 PMD-DMD Bus Exchange Unit CHAPTER 5 THE ZORAN ZR34161

1. Introduction The ZORAN ZR34161 Vector Signal Processor (VSP) is a programmable digital signal processor with an architecture optimized for efficient and fast execution of digital

signal processing (DSP) . It is designed as a peripheral signal processor working in conjunction with a host controller. The device is programmable through a

high-functionality, vector-oriented instruction set. The block floating-point arithmetic used for FFT calculations provides a significant improvement on the signal-to­ computation noise ratio over the traditional fixed-point calculations employed by most 16-bit, single chip Digital

Signal Processors. This device is fabricated with CMOS technology and available in 48-pin DIP package.

2. Architecture The VSP architecture consists of three major components [Figure 5.1]: o The Bus Interface Unit (BIU) o The Execution Unit (EU)

o Memory/Registers 2.1 Bus-Interface Unit (BIU) The Bus-Interface Unit.comprises everything on the left side of Figure 5.1. Included in the BIU are: o Data Bus Buffers

47 48

_, I IU'FER L I fW1 ADDRESS 5EN. N 'Br(Qm) --~f!!~_c U B ~ EXECVTIOH F F SECTION 1 WilT E F ~ E ~ l'-.l'.ol I fU..TIPl.Y I 'i'• ~ CO£FF. LUT L.! INSTR FETCH I ADO I "~ J! DFC~ OFRI . L!0. § [§ I e- IHT- alHTROL ~~-r. 111- ~ ·- V,IQI s-- tiOOE 0/~- ~~~m- • - STATUS cu: O!f .. I cu: ~ - txXX50() DATA lkJS C>QOQO®i AIXlRfSS BUS S8'&.~ t(¥(1:¥(. ~-'-"

Figur2 5.1 ZR34161 VSP Block Diagram 49

o Address Generator o Instruction Fetch Unit o Bus-interface Control The BIU is responsible for executing all bus-related operations including instruction fetching and decoding, data I/O, and communication among the internal and external memory devices. DMA activities between the VSP and external memory are also controlled within the BIU. An instruction FIFO (first-in, first-out buffer) is present in the instruction fetch block which is able to store up to four VSP instructions. The VSP transfers data and fetches instructions in a manner similar to traditional microprocessor peripherals with a DMA interface. Because of the "high-level" vector nature of the instruction set and the on-chip FIFO, both data I/O and instruction fetching may be done in blocks. Hence, the VSP has a simple DMA structure using the BRQ and BACK pins for effecting the block move operations. The DMA structure may operate in conjunction with a bus arbiter in the host system. Each data I/O instruction may transfer as many as 256 16-bit words over the bus. The RD, WR, and DSTB pins are bidirectional depending on whether the VSP is using them for input or output. When the cs input pin is enabled, the pins are used as inputs. When the VSP is reading or writing to or from external memory in the master mode they are output so

pins. When neither operation is taking place, the pins assume a high-impedance condition. It may be necessary to provide external pull-ups on these pins depending on the system configuration.

2.2 Execution Unit~ The execution unit shown on the right side of Figure 5.1 is responsible for all ALU-intensive operations. Included in the EU are: o 17 x 17-bit Multiplier o o AdderjSubtractor o Two 25-bit Accumulators o Logic Circuit for FFT Butterfly Implementation The EU performs DSP operations with an architecture designed specifically to efficiently implement FFT butterflies. Because of the inherent structure of these types of operations, this same architecture can also be used to perform all of the other signal processing tasks provided by the VSP. The calculations are performed with an internal accuracy of 17 bits; both fixed and block floating-point FFT arithmetic are supported. The EU performs the following types of DSP vector operations using a single "high-level" instruction: o Real or complex dot product o Real or complex addition o Real or imaginary vector accumulation 51

o Scalar multiplication o Absolute value o Magnitude square/accumulate o Complex conjugate

o Fast Fourier Transform The architecture of the execution unit is shown in Figure 5.2. The addition of various data paths from the outputs of internal points in this architecture provide the ease for which the VSP implements all of the other signal processing instructions besides the FFT. The 25-bit accumulators allow accumulation with no overflow of the 256 17-bit words which are required for the 128 complex-point dot products. One of the accumulators is provided for the real part of the complex results and the second for the imaginary part.

The EU hardware supports both fixed-point as well as block floating-point operations for FFT computations.

Using fixed-point arith_metic; the results of the ALU are scaled (divided by 2) at the end of each FFT pass. this eliminates the possibility of overflow during the addition required for the FFT butterfly.

With block floating-point arithmetic, the results of ALU operations at the end of each FFT pass are scaled only if an overflow has occurred. The vector scale factor is determined by the size of the maximum overflow in that pass (up to 2 bits). The block floating-point exponent 52

Figure 5.2 VSP Execution Unit Block Diagram 53

then keeps track of the number of shifts performed during the FFT instruction. Thus, the block floating-point capability retains the precision of the input signal during FFT computations by not performing right shifts if no overflow has occurred. The VSP uses a four-bit scale factor which, when using block floating-point arithmetic, 15 allows scaling of up to 2

The length of the data vectors on which the EU can operate in a single instruction is limited to the VSP RAM size of 128 complex samples. When performing signal processing operations which use data arrays larger than this, the operations must be factored into multiple smaller-size operations of 128 samples or less. The appropriate instruction is then used as a "kernel" for completing the larger operation, where the appropriate instruction is called multiple times to complete the large operation. For instance, applications for FFTs of larger than 128 points must call the FFT instruction multiple times in order to complete the transform. 2.3 Memory and Registers The ZR34161 memory consists of 128 complex words by 38 bits data RAM, 64 by 4 bits scale RAM, 256 words by 17 bits sine and cosine look-up table. Each word of the

128 complex words by 38 bits data RAM is constructed by a 19-bit real part and a 19-bit imaginary part. The upper two bits are guard bits to prevent overflow when using 54

block floating-point FFT computations and are not available externally. The LSB is carried internally to provide extra precision for the unbiased rounding scheme used by the VSP and is not available externally either. The data RAM can be configured as two independent sections, each containing 64 complex words. One of the independent RAM sections can be accessed by the BIU while the other section is being accessed simultaneously by the EU. This powerful feature allows I/O instructions to be performed concurrently with ALU operations in many cases. This is particularly efficient in applications that require continuous FFT calculations in real time. While the EU is performing FFT butterfly calculations in one RAM section, the BIU can first store the results of the previous FFT instruction to external memory and then read in the data to be transformed next to the other RAM section. The Sine/Cosine look-up table (LUT) is an on-chip

ROM that contains 256 16-bit data values representing the sines and cosines of the angles from 0 to 90 degrees in increments of 0.35 degrees. The LUT coefficients are used to compute FFTs of up to 1024 points in length without external coefficients being supplied. In addition to the RAM and ROM, the VSP also contains the following registers: 1. Scale register - This register is 16 bits wide and is used to store up to four scale factors for FFT 55

computations. 2. Maximum scale register This is a 4-bit register and is used for keeping track of the largest scale factor generated during a sequence of FFT instructions.

3. Old maximum scale register - This register is also 4-bit wide and is updated with the current value in the maximum scale register by the execution of the LDSM (Load Scale/Mode Register) instruction with the parameter MD=O and UP=1.

4. Status register - This register is 14 bits wide and contains all status information about the VSP except for scaling which is stored in scale registers.

5. Mode register - This register is 16 bits wide and is used for programming the operating mode of the VSP.

6. Next fetch address register - This is a 16-bit readable register. It holds the address of the next instruction to be fetched by the VSP in the master mode.

7. Instruction Base/Start register -This is a 16- bit register that is used by the host processor to write a program base address that points to external memory. Writing an address to this register commands the VSP to begin instruction fetch and execution in the master mode at the base address pointed to by the register.

3. Instruction Set

There are 23 instructions oriented for DSP 56

applications. The instruction word length varies from one

to three words. The VSP is programmed at the functional

level rather than at the assembly-language level usually found in other processors. This is due to the fact that each instruction in the VSP is analogous to a subroutine kernel in a signal processing library. Most of the instructions contain parameters whose values control the instruction execution. The 23 instructions in the VSP are categorized as follows:

a. Data Block Movement

b. ALU/Memory Two-Vector Data Instructions

c. ALU Single-Vector Data Instructions

d. Control Instructions CHAPTER 6 THE MOTOROLA DSP56000

1. Introduction The DSP56000 is the first member of a new family of special-purpose microprocessors designed by Motorola in an attempt to enter the digital signal processing (DSP) market. It is a high performance, user programmable digital signal processing IC implemented in 1.5 micron HCMOS (H stands for high density) technology. The processor core is highly parallel which consists of three execution units, the program controller, the data Arithmetic Logic Unit (ALU), and the Address ALU. This core is capable of executing 10.25 million (MIPS). To overcome the limitation posed on other 16-bit devices by the precision and dynamic range requirements, the DSP56000 offers a unique 24-bit data word size. Its ability to perform ?4-hv-?4-hit -- -.~. -- --- multiplication and a 56-bit accumulation in a single 97.5- ns machine cycle is one of the most impressive characteristics, since multiplication and accumulation are among the most heavily used sequences in DSP algorithms. Presently, the DSP56000 is the only 24-bit single-chip digital signal processing IC announced on the market. 2. Architecture A block diagram of the DSP56000 architecture is

57 58

shown in Figure 6.1. This architecture is composed of a core processor and other on-chip peripherals including a­ bit parallel Host Interface (HI), Synchronous Serial Interface (SSI), Serial Communication Interface (SCI) and on-chip program and data memory. The core processor is used to execute the processor instruction set while other resources provide storage and inputjoutput (I/O) capability for the core processor. On-chip memory and peripherals are not considered part of the core and may vary from one family member to another.

For discussion, the major components of the DSP56000 are listed as follows: o Buses: a. Data Buses

b. Address Buses o Core Processor:

a. Program Control and System stack

b. Data ALU c. Address ALU o On-chip memory:

a. X Data Memory

b. Y Data Memory c. Program Memory o Interfaces

a. External Bus Interface b. Host Interface 59

ADDRESS BUS

·" 16

Ex1emal Address Swnch

PADOR YADDR XADDR ··~ "6 v,6 6 Program X Memory• YMemor)' MemOI'j RAM RAM ADDRESS ROM 256 X 24 256 X 24 Al.U 2048 X 24 ROM ROM 256 X 24 256 X 24 ~ ~

X DATA 24 DATA BUS ln1ernal YDATA ~4 .. Ex1emat Oa1a Bus L.... Oa1aBus ~4 ,2~ PDATA Switch GLOBAL ,~.4 ~

.... 1 .;:.. On-Chip Peripherals DATAALU 110 HOST 1 1 SSI PO RTS sa 24 X 24 ->56 PROGRAM CONTROLLER Mu~iply- PliO 14-- Aa:umulalor ~ BUSCTRL ~2 ~3 CLOCK INTERRUPT

Figure 6.1 DSP56000 Block Diagram 60

c. Serial Communication Interface (SCI)

d. Synchronous Serial Interface (SSI) 2.1 The DSP56000 Buses

2.1.1 Data Buses

There are three (3) 24-bit bidirectional data buses,

X Data Bus (XD), Y Data Bus (YD), and Global Data Bus (GD) for data movement in the DSP56000. The XD and YD may be treated as 24-bit buses individually, or as one 48-bit data bus by concatenation. Data transfers between the data ALU and the X Data Memory and Y Data Memory occur over XD and YD. These transfers are kept local on the chip to maximize speed and minimize power. All other data transfers occur over the Global Data

Bus (GD). 2.1.2 Address Buses

There are two (2) 16-bit unidirectional buses, X

Address Bus (XA), andY Address Bus (YA) for internal X Data Memory and Y Data Memory address specification. The internal address bus sizes depend on the amount of internal memory implemented. In addition, there is also a

Program Address Bus (PA) for addressing the Program

Memory. External memory spaces are addressed through a single 16-bit unidirectional address bus driven by a three

(3) input that selects either the X, Y or Program Address Bus. When external memory is used, one additional instruction cycle is required for each external 61

memory access. 2.2 The DSP56000 Core Processor The core of the DSP56000 processor is composed of three major, separate execution units: a. The Program Controller (PC) b. The Data ALU c. The Address ALU These units are interconnected by a multiple bus architecture and each execution unit works on the same instruction at the same time in a parallel rather than a pipe lined fashion as in some of the previous discussed devices. This parallel operation allows the three execution units to provide and combine all the resources to execute most DSP56000 instructions in a single instruction cycle (97.5 ns using a 20.5 MHz processor clock). Each execution unit is itself single cycle and non-pipeline operated. Also, each one contains a set of registers, arithmetic elements, and executable control code for operations. These operations are register­ oriented as oposed to memory-oriented as in general microprocessors. Each execution unit operates on its own local registers. The source operands are read from registers within the execution unit. These operands are then modified by arithmetic element operations and the results are stored in the registers located in that same execution unit. Data transfers between execution units or 62

between execution units and memory occur in parallel with internal execution unit operations. one noted feature of this architecture is that ALU results for conditional branching are available after each cycle without delay. This is a key advantage over a pipelined architecture, in which, the programmer faces the complex task of reestablishing data flow whenever an operation is interrupted. 2.2.1 The Program Controller The program controller unit controls the processor instruction flow, decodes the instructions, and performs exception processing. It consists of six directly addressable registers [Figure 6.2], the (PC), the loop address (LA), the loop count (LC), the status register (SR), the operating mode register {OMR), and the stack pointer {SP). In addition, there is a 15- level by 32-bit system stack memory. 1. Program Counter fPC) The program counter is a 16-bit register containing the address of the next location to be fetched from program memory space. The PC may point to instructions, data operands, or the addresses of operands. References to this register are always inherent and implied by most instructions. This special purpose address register is stacked when programing loop is initiated or jump to subroutine {JSR) is performed and when interrupts occur 63

15 0 15 8 7 0 1 0 IMR I CCR I MB MA PROGRAM COUNTER STATUS MODE OPERATING REGISTER REGISTER

15 0 15 0

LOOP ADRESS LOOP COUNTER

31 16 15 0 5 0

1 ...... 1----~ STACK POINTER

SSH SSL

SYSTEM STACK

Figure 6.2 Program Controller 64

(except for fast interrupt, during which the PC is never updated. An interrupt execution is considered to be fast if none of the instructions of the interrupt service routine cause a change of flow).

2. Loop Counter~ The loop counter is a special purpose 16-bit counter used to specify the number of times to repeat a hardware program loop. This register is stacked by a DO instruction and unstacked by end of loop processing or by execution of an ENDDO instruction. When the end of a hardware program loop is reached, the contents of the loop counter register are tested for one. If the loop counter is one, the program loop is terminated and the LC register is loaded with the previous LC contents stored on the stack. If the counter is not one, its value is decrement by one and the program loop is repeated. The number of times a loop has been executed can be determined during execution by reading the LC under program control.

3. Loop Address Register~ The loop address register indicates the location of the last instruction in a program loop. This register is stacked by a DO instruction and unstacked by an ENDDO instruction or by end of loop processing. When the instruction at the address contained in this register is fetched, the content of LC are checked. If it is one, the PC is incremented, the loop flag is restored from the 65

stack, the stack is then purged, the LA and LC registers are pulled from the stack and restored, and instruction execution continues normally. If it is not one, the LC is decremented, and the next instruction is taken from the address at the top of the system stack. In general, the LA register is a readjwrite register which is written into by a DO instruction and is read by the system stack for stacking the register.

4. Status Register 1§Bl The status register is a 16-bit register. It consists of an 8-bit Condition Code Register (CCR) and an 8-bit Mode Register (MR). The MR is the high-order 8 bits and the CCR is the low- order 8 bits of the Status Register. The MR bits are only affected by the processor reset, exception processing, the DO, ENDDO, RTI (Return From Interrupt), SWI (Software Interrupt) instructi9ns, and by other instructions which directly reference the MR register. During processor reset, the interrupt mask bits of the MR will be set, the scaling mode bits, loop flag, and the trace bit will be cleared. The CCR is a special purpose control register which defines the current user state of the processor at any given time. The CCR bits are affected by data ALU operations, parallel move operations, and by instructions which directly reference the CCR register. The CCR bits 66

are not affected by XD (X Data Bus) and YD (Y Data Bus) bus data transfers unless data limiting occurs when reading the A or B accumulators. During the processor reset, the limit bit will be cleared while all other CCR bits remain unchanged. The condition code portion of the status register consists of seven (7) defined bits [Figure

6.3]: c - Carry

V - Overflow Z - Zero N - Negative u - Unnormalized E - Extension L - Limit

The first 6 CCR bits, c, V, Z, N, U, E are true condition code bits that reflect the condition of the results of data ALU operations. They are not affected by address ALU calculations or by data transfers over the X, Y, or global data buses. The L bit is a latching overflow bit which indicates that an overflow has occurred in the Data ALU, or that data limiting has occurred when reading a Data ALU register. This limiting occurs as the result of a data bus move operation limiting accumulator data through the data shifter/limiter. The standard definition of the condition code is given as follows: 67

c - (Carry) This bit is always cleared except in the following cases:

o It is set if a carry generated out of the most significant bit of the result of an addition.

o It is also set if a borrow is generated in a subtraction. Since the DSP56000 accumulator is 56-bit wide, the carry or borrow is generated out of bit 55 of the result.

V - (Overflow)

This bit is always cleared except in the case of an arithmetic overflow occuring in the 56-bit result

indicating that the result is not representable in the

accumulator register, and the accumulator register has overflowed. z - (Zero) This bit is set only if the result equals zero and

cleared otherwise.

N - (Negative) This bit is set if the result is a negative number

or equivalently, if the most significant bit (bit 55) of the result is set.

U - (Unnormalized) This bit is set if the two MSBs of the MSP portion

(bit 10 and bit 11 of the SR) of the result are the same.

The MSP portion is defined by the scaling mode and the U 68

bit is computed as follows:

S1 so Scaling Mode u Bit Computation

0 0 No scaling u = (bit 47 + bit 46) 0 1 Scale down u = (bit 48 + bit 47) 1 0 Scale up u = (bit 46 + bit 45)

E - (Extension) The extension (E) bit is cleared if all the bits of the integer portion of the 56-bit result are all the same

(all O's or all 1's). It is set otherwise. The integer portion is defined by the scaling mode and the E bit is computed as follows:

S1 so Scaling Mode Integer Portion

0 0 No scaling Bits 55 1 • • • t 4 7 0 1 Scale down Bits 55, .•• ,48 1 0 Scale up Bits 55, ••• ,46

If bit E is cleared, the low-order fraction portion ,.,, contains Q ...... the significant bits (i.e the high-order integer portion is just sign extension). The accumulator extension register can be ignored when bit E is cleared.

If E is set, it is an indication that the extension accumulator is in use.

L - (Limit)

The limit bit is set in 2 cases:

1. If the overflow bit (V) is set

2. If the data shifter/limiters perform a limiting 69

operation.

After being set, the L bit is cleared only by a

processor reset or by instructions which specifically clear it. Note that L is affected by data movement operations which read the A or B accumulator registers. Otherwise, it is not affected.

5. Operating Mode Register (OMR)

The Operating Mode Register (OMR) is an 8-bit register which defines the current operating mode of the

processor [Figure 6.4]. The OMR bits are only affected by

the processor reset, during which, the chip operating mode will be loaded from the external Mode Select pins. The OMR

bits are also affected when the register is directly

referenced by related instructions.

The chip operating mode bits, Ml and MO, indicate the

bus expansion mode of the DSP56000. These bits are loaded

from the external Mode Select pins (MODA and MODB) on the

processor reset. After the processor leaves state, Ml and MO may be changed under the program control. The chip operating modes are shown below:

M1 MO Chip Operating Mode

0 0 Undefined 0 1 Undefined 1 0 Normal Expansion 1 1 Development Expansion

Additionally, the other OMR bits (Bits 2,3,4,5,6,7) are reserved for future expansion and will be read as zero 70

,~ 14 1J 1:' 11 '0 9 8 2 ·, 0

I L r j l T I I S1 1 so I 11 I 10 I ILIEIUINIZIVICI #L: Carry Zero Nogal·vo

Unnorrr:a: ...~ed

Er1en11on lrmoJ lnJerrupl Mask

Scah'>!f Mooe Trace MoO& Loop Flag

Figure 6.3 Status Register

0pSraJong Mode

Opsraclfig Mode Register 71

during the read operations.

6. The Stack Pointer ~ The Stack Pointer is a 6-bit register that indicates the location of the top of the system stack and the status of the stack (underflow, full and overflow conditions). The Stack Pointer is referenced implicitly by instructions such as: DO, REP, JSR, RTI, etc.) or directly by the MOVEC (Move Immediate Short To Control Register) instruction. The SP format [Figure 6.5] and possible stack values [Figure 6.6] are described with their bit positions as follows:

Stack Pointer The Stack Pointer portion is constituted by bits

0,1,2,3 of the SP register. It points to the last used space on the stack. Immediately after the processor reset these bits are cleared (SP=O) to indicate that the stack is empty. Data is pushed onto the stack by incrementing SP then Mr;+-;nn i-h.:o item at stack ...... --- .. ·-:;, -··- location indicated by SP. In reverse, an item is popped off the stack by reading it from location SP and then decrementing the SP by one.

Stack Error Flag ~ The Stack Error flag (SE) occupies bit 4 of the SP register. It indicates that a stack error has occurred, and the transition of SE from 0 to 1 causes the priority level 3 stack error exception. When the stack is 72

Slad< Poinler Slacl< ErrOf Flag Under11ow Flag

Stack Pointer Format

Figure 6.5 Stack Pointer (SP)

UF SE P3 P2 P1 PO 1 1 1 1 1 0 <- Stack Underflow condition after double pun 1 1 1 1 1 1 <- Stack Underflow condition 0 0 0 0 0 0 <- Stack Empty (reset). Pull causes underflow. 0 0 0 0 0 1 <- Stack location 1

0 0 ·j 1 1 0 <- Stack loc!ition 14.

0 0 1 1 ~ ~ <··Stack localior. ~5. Push causes over11ow. 0 1 0 0 0 0 <- Stack overflow condition 0 1 0 0 0 1 c- ~l'lck O.arf!o'.·. cvn.:fitio:1 aftet ooublc pt:sh

Figure 6.6 Stack Pointer Values 73

completely full, the Stack Pointer reads 001111, and any operation that pushes data to the stack will cause a stack error exception to occur and the will read 010000 (or 010001 if an implied double push occurs). Any implied pull operation with SP=O will cause a stack error exception and the SP will read all one's (or 111110 if an implied double pull occurs). In all of these cases the SE bit is set. When SP=O (stack is empty), instructions which read stack without SP post-decrement and instructions which write stack without SP pre-increment do not cause a stack error exception. For example, DO SSL,xxxx; REP SSL (Repeat Next Instruction); MOVEC (Move Immediate Short To Control Register) or MOVEP (Move Peripheral Data) when SSL is specified as a source or destination. Underflow Flag iQK1 The Underflow Flag (UF) occupies bit 5 of the SP register and is set when a stack underflow occurs. Unimplemented Stack Pointer Register bits

The remaining unimplemented stack pointer register bits are reserved for future expansion and will be read as zeros during the processor read operations.

6. System Stack ~ The System Stack is a separate internal RAM which is 32-bit wide and 15-level deep. It is used to store the contents of the Program Counter (PC), the Status Register 74

(SR} for subroutine calls and long interrupts. It is also used to store the contents of the Loop Counter (LC), the Loop Address register (LA) in addition to the SR and PC for program looping operations. The System Stack internal RAM is located in Stack Memory Space and its address is always inherent and implied by the current instruction. When a subroutine call or long interrupt occurs, the contents of the PC and SR registers are pushed on the top location in the System Stack. When a return from

subroutine occurs, the contents of the top location in the System Stack are popped to the PC. When a return from

interrupt occurs, the contents of the top location in the

System Stack are popped to the PC and SR. The System Stack is also used to implement no­ overhead hardware program loops. When a program loop is initiated with the execution of a DO instruction, the current 16-bit Loop Counter (LC) and 16-bit Loop Address

register (LA) are pushed onto the system stack to allow nested loops, the LC and LA registers are initialized with values specified in the DO instruction, the address of the first instruction in the program loop and the current status register contents are pushed onto the System Stack and the Loop Flag bit in the Status Register is set. The

Loop Flag bit is set when a program loop is in progress

and enables the end of the loop detection. A program loop begins execution after the DO instruction and continues 75

until the program address fetched equals the loop address register contents. The contents of the Loop Counter are then tested for one. If it is not one, the following steps will be performed: o The LC is decremented o The top location in the Stack RAM is read (but not pulled) into the PC to return to the start of the loop.

If it (LC) is one, the following events will occur: o The program loop is terminated by incrementing the

PC.

o The previous loop flag bit from the top location in the System Stack is read into the Status Register. o The stack is purged (the contents in the top location is pulled and discarded).

o The LA and LC registers are pulled off the Stack and the Stack respective registers are restored.

o The loop flag as well as the Stack Pointer are restored. The loop flag is always pulled from the system stack when a loop is terminated and an indication of whether the terminated loop was a nested loop is given.

2.2.2 Data Arithmetic Logic Unit (ALU)

The Data ALU performs all the arithmetic and logical operations on the DSP56000 data operands. The Data ALU is capable of multiplication, multiply-accumulate with positive or negative accumulation, addition, subtraction, shifting, and logical operations. All of these operations 76

can be done in one instruction cycle.

A block diagram of the DSP56000 Data ALU architecture is shown in Figure 6.7. There are six (6) major components in the Data ALU:

1. Data ALU Input Registers 2. Data ALU Accumulator Registers

3. Multiply-Accumulator (MAC) Unit 4. Bit Manipulation Unit

5. Accumulator Shifter

6. Data Shifter/Limiters

2.2.2.1 Data ALU Input Registers ~ XO, Y1, YO)

The Data ALU Input Registers are composed of X1, XO,

Yl, YO which are 24-bit latches. These latches serve as input pipeline registers for the Data ALU. Each register may be read or written by the XD or YD data bus. These registers are used as source operands for most data ALU operations.

2.2.2.2 Data ALU Accumulator Registers ~Al, AO, B2, Bl, BO)

The Data ALU Accumulator Registers are composed of

24-bit latches Al, AO, Bl, BO. A2 and B2 latches serve as extension registers. Each register may be read or written by the X Data Bus (XD) or the Y Data Bus (YO) as a complete word operand.

The Accumulator registers are treated as two 56-bit registers A (A2,Al,AO combined) and B (B2,Bl,BO) for data 77

J1 !i6j, I X1• xo • II •Y1 YO• II A2•• A1 AO ll 82 81 80 + 56 2• 2• • r-- " ' ,I- Multiply- ~ Accumulator ::l ~ "' !06, 1- & Logical ' "'0 ttl ~ .!i Unit ttl rEJ 0 8 ~~-- !06 56 )( >- i! ' ~ 2• t 2• , ' ' I Shiller I Limiter I ... '

~

~ B~ Mampulation Unit

Figure 6.7 Data ALU 78

ALU operation. These accumulator registers receive the EXT:MSP:LSP portion (Extension, Most Significant Portion, Least Significant Portion) of the Multiply-Accumulator unit output and supply a source accumulator registers as source andjor destination operands. When A2 or B2 is read, the register contents occupy the low-order portion (bits 0-7) of the word; the high-order portion (bits 8-23) is sign-extended. When A2 or B2 is written, the register receives the low-order portion of the word; the high order portion is not used. Automatic sign-extension is provided when the A or B register is written on with a smaller size operand. This can occur when when writing A or B from XD and/or YO data buses or with the results of certain data ALU operations such as Tee (Transfer Conditionally) or TFR (Transfer Data ALU Register). If a word operand is to be written to an accumulator register (A or B), the MSP portion of the accumulator is zeroed and the EXT portion is sign-extended from the MSP. Long word operands are written into the low­ order portion MSP,LSP of the accumulator register and the EXT portion is sign-extended from the MSP. No sign extension is performed when an individual 24-bit register is written. Test logic is included in each accumulator register to support the operation of the data shifter/limiters. The test logic detects the overflows out of the data shifter so that the limiter can substitute one 79

of several constants to minimize the error due to the flow. This is called saturation arithmetic. 2.2.2.3 Multiplier Accumulator {MAC) Unit The Multiplier-Accumulator {MAC) unit is the heart of the Data ALU. This unit performs all of the calculations on the data operands. It accepts up to three {3) 24-bit input operands and gives an output of 56-bit result of the form EXT,MSP,LSP. The operation of the MAC unit occurs independently and in parallel with XD and YD data bus activity. The Data ALU registers provide pipelining for both data ALU inputs and outputs. All ALU operations occur in one instruction cycle. To avoid racing conditions, latches are provided on the MAC unit input operand buses. There are two major components in the MAC unit: 1. Multiply-Accumulator Array and Logical Unit The multiply-accumulator {MAC) array is a 24 x 24 bit parallel multiply-accumulator with 56-bit accumulation. The MAC array structure is based on the modified Booth's algorithm. The MAC array is used in all arithmetic and logical operations. The array performs signed arithmetic with a fractional data representation. The MAC array also performs rounding if specified in the DSP56000 instruction. The type of rounding is specified by the scaling mode bits in the Status Register (SR). The MAC array can perform full 56-bit addition or 80

subtraction of the two full accumulators. The MAC array also performs the logical operations AND, OR, EXOR, and NOT on data ALU registers. The LU (Logical Unit) is 24- bit wide and operates on data in the MSP portion of the accumulator. The LSP and EXT portions of the accumulator are not affected.

2. Multiplier Control Recoder ~ The Multiplier Control Recoder directs the operation of the MAC array and recedes the multiplier operands for the modified Booth's algorithm multiplication. 2.2.2.4 Bit Manipulation Unit The Bit Manipulation Unit performs bit manipulation operations on memory operands. There are two basic types of bit manipulation instructions. One group, BCLR (Bit Test and Clear), BSET (Bit Test and Set), BCHG (Bit Test and Change), and BTST (Bit Test on Memory), tests the state of any single bit in a memory location and then optionally clears, or inverts the bit. For this group, the carry bit of the condition code register will contain the result of the bit test. The other group tests the state of any single bit in a memory location and jumps

(or jumps to subroutine) if the bit is set or clear.

2.2.2.5 Accumulator Shifter ~ The Accumulator Shifter is a 56-bit parallel shifter

(56-bit input and 56-bit output). The accumulator shifting operation functions are shown on Table 6.1 81

Table 6.1. Accumulator Shifter Functions Operation Type Instruction No Shift (Unmodified) Left Shift 1 Bit (Arithmetic/Logical) ASL,LSL,ROL Right Shift 1 Bit (Arithmetic/Logical) ASR,LSR,ROR

2.2.2.6 Data Shifter and Limiter The data shifter/limiters provide special post- processing on data ALU accumulator registers when they are read out to the XD or YD data buses. Two shifter/limiters are used and each consists of a shifter followed by a limiting circuit. The data shifters are capable of shifting data one bit to the left or to the right as well as passing the data unshifted. Each data shifter has a 24-bit output and an overflow output indicator. The data shifters are controlled by the scaling mode bits in the Status Register. These bits permit dynamic scaling of fixed point data us1ng the same program code. This operation allows block floating point algorithms to be implemented in a regular fashion. FFT routines would typically use this feature to selectively scale each butterfly pass. Saturated arithmetic is provided to selectively limit overflow when a data ALU accumulator register is read by the XD or YD data bus. The limiting is performed on the output of the data shifter. Two word operand data . 82

limiters are used - one for the XD bus and one for the YD bus. This allows two word operands to be limited independently in the same instruction cycle. The two data limiters can also be combined to form one 48-bit data limiter for long word operands. If the contents of the selected source accumulator can be represented in the destination operand size without overflow, the data limiter is disabled and the operand is not modified. If the contents of the selected source accumulator cannot be represented without overflow in the destination operand size, the data limiter will substitute a data value having maximum magnitude and the same sign as the source accumulator. The value of the accumulator is not changed.

2.2.3 Address Arithmetic and Logic Unit (ALU) The Address ALU is used to perform neccessary address calculation for addressing tasks in the DSP56000. This Address ALU operates in parallel with other chip resources to minimize address generation overhead. The Address ALU registers may be read or written by the Global Data Bus as 16-bit operands. The ALU itself can generate two 16-bit addresses every instruction cycle. It can directly address 65536 locations on each of the X Address Bus (XA) and the Y Address Bus (YA). This gives a total addressing capability of 131,072 x 24-bit data words for the 83

processor. In addition, the Address ALU can directly address 65536 locations on the Program Address Bus (PA). A block diagram of the DSP56000 Address ALU is shown on Figure 6.8. There are five (5) major components in the Address ALU: 1. Address Register Files 2. Offset Register Files 3. Modifier Register Files 4. Address Arithmetic Units 5. Address Output 2.2.3.1 Address Register Files Each of two Address Register Files consists of four, sixteen bit registers. The two files contain the address registers RO-R3 and R4-R7, respectively, and usually contain addresses used as pointers to memory. Each register may be read or written by the Global Data Bus. When read by the Global Data Bus, the contents of a register are zero extended. Each address register may be used as input to its associated Address Arithmetic Unit for a register update calculation. Each register may be written by the Global Data Bus or by the output of its respective Address Arithmetic Unit. When a register is written by the Global Data Bus, only the least significant 16-bits of the bus are used, the upper portion is not used. The registers accessed by the Global Data Bus and the Address Arithmetic Unit are not required to be the 84

GOB

XAB PAS VAS

Figure 6.8 Address ALU 85 I '

same. Due to pipelining, if an address register (M, N, or R) is changed with a MOVE instruction, the new contents will not be available for use as a pointer until the second instruction. The DSP56000 assembler will protect the user from doing this by flagging it with a warning message. 2.2.3.2 Offset Register Files Each of two Offset Register Files consists of four, sixteen bit registers. The two files contain the offset registers NO-N3 and N4-N7, respectively, and usually contain offset values used to update address pointers. Each offset register may be read or written by the Global Data Bus. When read by the Global Data Bus, the contents of a register are zero extended. When a register is written, only the least significant 16-bits of the Global Data Bus are used, the upper portion is not used. Each offset register is read when the same number address register is read and used as input to its associated Address Arithmetic Unit. A read address selects the offset register to be read to the Address Arithmetic Unit during an instruction cycle. The registers accessed by the Global Data Bus and the Address Arithmetic Unit are not required to be the same. 2.2.3.3 Modifier Register Files Each of two Modifier Register files consists of four, sixteen bit registers. The two files contain the modifier 86

registers MO-M3 and M4-M7, respectively, and usually specify the type of modification made to an address register during address register update calculations. Each modifier register may be read or written by the Global Data Bus. When read by the Global Data Bus, the contents of a register are zero extended. When a register is written, only the least significant 16-bits of the Global Data Bus are used, the upper portion is not used. Each modifier register is read when the same number address register is read and used as input to its associated Address Arithmetic Unit. A read address selects the modifier register to be read to the Address Arithmetic Unit during an instruction cycle. The registers accessed by the Global Data Bus and the Address Arithmetic Unit are not required to be the same. Each modifier register is preset during a processor reset.

2.2.3.4 Address Arithmetic Units The two Address Arithmetic Units are identical. Each is capable of performing the address arithmetic calculations required by the DSP56000 addressing modes and addressing modifiers.

Each Address Arithmetic Unit can update one address register, Rn, from its respective address register file during one instruction cycle. It is capable of performing linear, reverse carry, and modulo arithmetic. The contents of the selected modifier register specifies the 87

type of arithmetic required in an address register update calculation. The modifier value is decoded in the Address Arithmetic Unit and affects the unit's operation.

Linear address arithmetic is used for standard microprocessor type addressing, e.g., Rn+l; Rn+N. K Reverse carry arithmetic is useful for 2 point FFT addressing. For modulo arithmetic, the Address Arithmetic

Unit will perform the function (Rn+N) modulo M where N can

be one, minus one, or the contents of the offset register Nn.

2.2.3.5 Address Output Multiplexers

The address output multiplexers select the source for

the XA, YA, and PA (Program Address) Buses. They allow

the XA, YA, or PA address outputs to originate from either RO-R3, R4-R7, or from the Address Arithmetic Units.

2.3 On-chip Memory

There are over 9500 bytes of memory provided onboard the DSP56000 in three separate memory spaces. These memories include X Data Memory, Y Data Memory, and Program Memory.

2.3.1 X Data Memory

The X Data Memory may contain both data RAM and ROM.

The X Data RAM i~ a 24-bit wide internal memory and occupies the lowest 256 locations in X Memory Space. The X Data ROM is also a 24-bit wide internal memory and occupies the next lowest 256 locations in X Memory Spaces. 88

Addresses are received from the XA bus and data transfers occur on the XD bus. X memory may be expanded off chip.

2.3.2 ~ Data Memory The Y Data Memory may contain both data RAM and ROM. The Y Data RAM is a 24-bit wide internal memory and

occupies the lowest 256 locations in Y Memory Space. The Y Data ROM is also a 24-bit wide internal memory and

occupies the next lowest 256 locations in Y Memory Space. Addresses are received from the YA bus and data transfers

occur on the YO bus. Y memory may be expanded off chip.

2.3.3 Program Memory

The Program Memory consists of a 512 location by 24-

bit storage area or a 2048 location by 24-bit storage

area. Addresses are received from the program control

logic (usually the PC). The program memory may contain

instructions, constants, and data tables which are fixed at assembly time. Program memory may be expanded off-chip to 65536 program locations.

2.4 Interfaces

The DSP56000 provides 24 programmable I/O pins which are available in all chip operating modes. These 24 pins are separate from the DSP56000 address and data buses and are grouped as two I/O ports, B and c. Port B is a 15-bit I/O interface which can be used as general I/0 pins or as Host MPU/DMA Interface pins. Port c is a 9-bit I/O interface which may be used as either general I/O pins or 89

as Serial Communication Interface and Synchronous Serial

Interface pins. The Host MPU/DMA Interface provides a dedicated 8-bit parallel port to a host microprocessor or a DMA controller. The Serial Communication Interface provides standard serial I/O using asynchronous character protocols. The Synchronous Serial interface provides high speed synchronous serial data communication. All of the configurations may be set up under software control.

Three on-chip peripherals are provided on the DSP56000 an 8-bit parallel Host MPU/DMA Interface, a Serial

Communications Interface (SCI), and a Synchronous Serial

Interface (SSI). In addition, there is also a common external bus interface to access external memory or I/O devices. 2.4.1 Host Interface

The Host Interface is a byte-wide parallel port which may be connected directly to the data bus of a host processor. The host processor may be any of a number of popular microcomputers or microprocessors, another DSP chip, or DMA hardware. The DSP56000 host interface has an

8-bit bidirectional data bus, HO-H7 (PBO-PB7), and seven dedicated control pins, HAO, HAl, HA2, HR/W, HEN, HREQ, and HACK (PB9-PB15), to control data transfers. The Host

Interface appears as a memory mapped peripheral occupying

8 bytes in the host processor address space. Separate transmit and receive data registers are double-buffered to 90

allow the DSP56000 and host processor to transfer data efficiently at high speed. Host processor communication with the DSP56000 Host Interface is accomplished by using standard host processor data move instructions and addressing modes. The DSP56000 interrupt response is sufficiently fast that most host microprocessors can load or store data at their maximum programmed I/O (non­ DMA) instruction rate. This can be done without testing the handshake flags for each transfer. If the full handshake is not needed, the host processor can treat the DSP56000 as fast memory, and data can be transferred between the host and the DSP56000 at the fastest host processor rate. DMA hardware may be used with the handshake flags to transfer data at the maximum DSP56000 interrupt rate. 2.4.2 Serial Communication Interface {SCI) The Serial Communication Interface consists of separate transmitters and receivers. These transceivers' operations are asynchronous with program execution in the DSP56000 and may be either synchronous or asynchronous with respect to each other. There is also a baud rate generator included to generate the clock signals for the transmitters and receivers. The SCI can operate as a full duplex serial port to communicate with other Digital Signal Processors andjor microprocessors. This communication can also be either performed directly or via 91

modems. Facilities for communicating using standard asynchronous bit rates and protocols as well as high speed synchronous data transmission and receiving are also included.

2.4.3 Synchronous Serial Interface (SSI) The Synchronous Serial Interface is a full duplex serial port which allows the DSP56000 to communicate with a variety of serial devices. These devices include industry standard codecs, other DSPs or microprocessors, and peripherals using a subset of the Motorola Serial Peripheral Interface. The SSI consists of independent transmitters and receivers as well as a common SSI clock generator.

2.4.4 External Bus Interface (EBI)

The External Bus Interface is used to access external

Data Memory, Program Memory, andjor external I/O devices.

Separate select lines control the accessing of the memory spaces. External Program Memory may be into the Instruction Prefetch Register without going over the

Global Data Bus. This also avoids interfering with the decode or execution phases of previous instructions. f .

CHAPTER 7

THE AT&T WEDSP-32

1. Introduction The WEDSP32 Digital Signal Processor is a 32-bit, high speed, mask-programmable digital signal processor.

The device is available in two package types, either a standard 40-pin DIP or a 100-pin rectangular pin-grid­ array (PGA). With the 100-pin PGA package, memory can be expanded externally with 56K bytes of directly accessible data. The WEDSP32 is first introduced with NMOS technology. Recently, AT&T has just announced an enhanced

CMOS version {WEDSP32C) that is faster and consumes less power than the current NMOS device. This CMOS is also planned to be militarized within the next two years. This chapter will discuss the NMOS device in more details since it is currently available in production.

2. Architecture A block diagram of the WEDSP32 is shown in Figure

7.1. There are six (6) major components which constitute the architecture of the DSP32:

1. Control Arithmatic Unit {CAU)

2. Data Arithmatic Unit (DAU)

3. Parallel I/O Section (PIO)

4. Serial I/O Section (SIO) 5. On-chip Memory (ROM and RAM) 6. Complex Pipeline Control Section

92 93

4 *MSNO-MSN3 ~~A~'--~r-- l.fL.--J 8 1------·--, ....·-:;;, ' :M· ~ f I ~ 1.,.,, '"' "" l f ,---~....~. ..---..J-r- ....~-----, ~, l!oe--;3;----; PIO (.Al' ~ I'Aio-1'::~ ...L,.---~ PAR (II) ~· 1/1---"'-~ I'OR (tl) 1'080-P087 (I) \ I'IR 16 ROM RAM RAM I'C I~& / PCR Ill 5t2 X 32 512 X 32 5t2 X 32 · r1-r14 (II) I'GN l-:"~---~ t;. EMR (10) idS-rlt II I'WN !--~~ ESR {I) · I'IN ( .. ) .n-< I'IHT lot------i t I'OUT (ll) ~ PACI< ..__ __.,. • _j ·-=~~ .tl ~J D s.~ ...... ,.,.~= ' .. ~) ~ D IL..-____l.,OA_T_A_B~U~S-(3_21 ..' ____ .,__-J J ~; Dll__~,$~1~0~--~ DAU {52)* 32,

52 I MATIUQlI"OIN7 MULTII"LIER ~ !

40 1 ~ l ~A~J D>i.C ~, ... -·-·· .. ~-·-~~.~·.~------; CONTROl I i ;.~. r .. ·.----~---·· AND I ~~H :; '"- CLOCKS ~~ I :w L.:--:~=~~--=:~.·------.-j ~ ~O-o3(40q I I I i I ~-==,.=o:. • ~,.,,. J _j j -~~~ i L~~. -----~~--~.:

*AVAILABLE ON 100- PIN PG~. PACKAGE ONLY

Figure 7.1 WEDSP-32 Block Diagram 94

2.1 Control Arithmetic Unit (CAU) The Control Arithmetic Unit is used to generate addresses to memory and execute 16-bit integer

instructions [Figure 7.2]. This unit contains twenty one (21) 16-bit general purpose registers, one 16-bit Program Counter (PC) and one Arithmetic Logic Unit (ALU). All CAU registers are static which do not require to be refreshed. The general purpose registers are used as follows: o Registers R1 to R14 are used as general purpose registers in CA (Control Arithmetic) instructions. In DA (Data Arithmetic) instructions, they are used as memory pointers (RP). o Registers r15 to r19 are used as general purpose registers in CA instructions and as increment registers

(ri) in DA instructions. When used as increment registers, r15 - r19 hold values that will post-modify addresses in the memory pointers. o Register r20 is used as the serial I/O (SIO) DMA input pointer and is called the Pointer In (PIN). o Register r21 is used as the SIO DMA output pointer and is called the Pointer Out (POUT).

o When r20 and r21 are used as general-purpose registers their impact on DMA must be considered. The CAU is capable of executing a full set of 16-bit integer instructions. Two modes of operations are embedded 95

CAU

16

ALU

Al A~

PC

Rl - R19

Pin

Pout

A~ A~

16 ,,.

Figure 7.2 Control Arithmetic Unit 96

in the CAU: address generation for the DAU and CAU operands, and execution of the CA instructions. The CAU processes 16-bit data as two's complement, fixed-point integer numbers or as 16-bit unsigned numbers. For addressing the DAU operands, the CAU is capable of generating one address in each of the four states ( the instruction fetch, the two memory operand read and the memory write states) using the post-modified register indirect mode. This means, the following operations occur: in each state the CAU adds the contents of two registers, a pointer is selected from r1 to r14, an increment is selected from r15 to r19 while the preceding pointer is updated, and the next pointer and increment are accessed at the same time. 2.2 Data Arithmetic Unit (DAU) The Data Arithmetic Unit (DAU) is the main execution unit for signal processing algorithms in the DSP32 processor. This unit performs the multiply/accumulate operations on data and data type conversions. It is composed of a floating-point multiplier and adder, four (4) static 40-bit accumulators and a data arithmetic unit control register (DAUC). It is the floating point format that makes the AT&T device noticeable since the advantage of floating-point is to improve dynamic range and precision required for intermediate steps in signal processing algorithms. Also, with this floating-point 97

format, concerns like scaling and quantization error are eliminated thus making development easy. The DAU is capable of performing four million instructions per second (MIPS) with 16 MHz clock input. The multiply/accumulate structure is flexible in the way that operands are loaded into the three inputs. The three inputs for the DAU multiplier include memory input, I/O input or the accumulator input. Multiplier inputs are 32-bit floating point numbers with a 24-bit mantissa and an 8-bit exponent (except for data type conversions in which, there are three data formats: floating point, integer, and companded data). For the adder, the inputs can be from the memory, I/0, or from the accumulator and can be 8-, 16-, 32- or 40-bit inputs. Adder 40-bit inputs (32 bits plus 8 mantissa guard bits) can only come from one of the four of the accumulators, ao to a3, or from the multiplier. The data type conversions are also done by hardware in the DAU. This helps free the user programs from the overhead required to do these conversions. The data type conversions include: conversion of floating point into and from 16-bit integers, conversion of floating point into and from companded quantities (A-law and u-law). The DAU multiplier and adder are operated in parallel, each requiring one processor cycle for execution. Its pipelining operation is designed to work in conjunction with the Control Arithmetic Unit (CAU). 98

2.3 Parallel I/O Section (PIO)

The parallel I/O (PIO) is composed of the following elements: o Three 16-bit registers which function as parallel address register (PAR), parallel data register (PDR), and parallel interrupt register (PIR). o One 10-bit register which function as an error mask register (EMR) . o One 8-bit register which functions as a PIO control register (PCR). o One 6-bit register which functions as an error source register. These registers can be used to interface with a microprocessor and to detect errors. With the 8-bit

parallel I/O data bus bidirectional communication between the DSP32 and an external microprocesor is very easy. The

data transfers can be made either under program or DMA control. The PIO DMA allows a microprocessor to download a program without interrupting the execution of a DSP32 program in progress. When the DMA mode is selected, the microprocessor set the starting memory address for the DMA

transfer in the parallel address register (PAR). The control pins allow the microprocessor to access the PDR

and cause the DMA transfer to proceed.

2.4 Serial I/O Section (SIO) The serial I/O (SIO) is used to interface the DSP32 99

with external serial devices. It converts serial input data to parallel format and converts parallel output data to serial format for transmission. The serial input stream is loaded into the input shift register (ISR) and then into the input buffer (IBUF). Outputs from the DSP32 are loaded into the output buffer (OBUF) and then to the output shift register (OSR). This double-buffering is designed so that back-to-back transfers are possible with the serial interface. A second serial transmission can begin before the first has been processed. Data widths of 8-, 16-, or 32-bit can be selected with the I/O control word (IOC). The external control signals allow the DSP32 to interface directly to serial devices such as a CODEC, a TDM line or to other DSP32 processors for multiprocessor applications. A DMA option is also included for transfers between the IBUF, OBUF and memory without program intervention. In this case, the PIN and POUT function as dedicated pointers for the DMA transfers and these pointers are set by the user. This serial DMA option is useful for multichannel applications so that input and output buffers for an entire frame of sample can be established in memory. 2.5 On-chip Memory The DSP32 on-chip memory consists of 2048 bytes of ROM and 4096 bytes of RAM. The ROM is mask-programmed with application programs and is used to store program 100

instructions and fixed operands. The RAM is dynamic and can be refreshed automatically or under program control.

Typically, this RAM is used to store variable operands.

The DSP32 instructions are all 32 bits wide and can be fetched in one memory access cycle. The data are of three types described earlier (floating point, fixed point and companded) and can be a, 16, or 32 bits wide. The memory is uniformly byte-addressable, and is organized as an 8-, 16- or 32-bit addressable structure. Each memory location is 32 bits wide and can be addressed as one 32-bit word, two 16-bit integer words, or four 8-bit words. Addresses for 32-bit locations must be divisible by 4, addresses for 16-bit locations must be divisible by 2. If a 32-bit word is addressed with a 16-bit address not divisible by 4, the low-order two bits are truncated and an address exception error occurs. Similarly, if a 16-bit integer is addressed with a 16-bit address not divisible by 2, an address exception error happens.

The DSP32 memory can be expanded off-chip (this option is, however, available with the PGA package only) with 56 Kbytes of directly accessible data. The on-chip addresses, data latches, and memory control signals allow a zero chip interface to a standard byte wide memory chip.

The on-chip ROM, RAM and external memory may be arranged in four differrent configurations [Figure 7.3]. The configurations are set by using the memory control pins 101

t: -r 2 ~cr: ~ cr. -'* Q. c k a. fl) ! i i ' I I u ! ~ c- I z Ew I J 0 ~ •O"- ' I r 0"- .. ~ '§"" oo01'- , oow ~ ~ ~~ ~~~~0

:r- 0 Tr-.., . c-'* ... cr: t: I I"' ~ ..."' wi :r c ...... u 0 ~ 2 I :: 2: ~ 0 il ... , O•· (• ... 0 c '- o:·. 8 "... 0 .... u . 0 0. I:?. . t;... . 0 •.>0 0 0 0

2 2 T ~ c c a:: a:: c-'* ... a:: ., z a:: ... Q. Q. A. ii i i :i i i ... I u u u I t(w f 8 • z zI ..,:. z 2 j_ 0 0 0 010. ... 8~ 010.., ...... 8t:... 8~ oo 0...... 0 oo 00 oo

T 2 J 0 a:: a:: -'* 0 .. .. w t: .. a::=~ 0 E uz •I y ~~~~ 8 • z 5:. z' • 0 0 o ... 0 ... 0 ... §~ ...0. ... 00•• 0 s

¥ ¥ ol-f f • •

Figure 7.3 Memory Configuration 102

MMDO, MMDl. The DSP32 memory configuration is generally divided into two banks, an upper bank (1) and a lower bank (0). Memory access can be made without regard to the upper or lower bank. However, to achieve maximum throughput, memory access should alternate between the two banks, in which, while one memory bank is accessed, the other is being addressed. This is a form of pipelining and it reduces the

effective memory access time by one-half [See Appendix A]. Only the lower memory can be expanded off-chip so that the external address and data buses only need to operate at

half the rate of the on-chip buses. Off-chip expansion allows the DSP32 access up to 56 Kbytes of external memory. This memory is accessed by a 14-bit address bus (ABOO- AB13) and 4 byte-select lines (MSNO- MSN3). The

address bus selects a 32-bit word, the byte-select lines (MSNO - MSN3) select the proper bytes from within that 32-

bit word. The external memory can only be assigned to the

lower memory bank (0). Although the maximum range of memory addresses that can be accessed by the DSP32 is 64 Kbytes, the actual amount of memory available is less. The actual amount depends on the mode of operation. The maximum number of accessible addresses is 62 Kbytes (for

mode 2 operation). Memory locations OxE7FF-OxEFFF can

never be accessed regardless of mode. 2.6 Complex Pipeline Control Section 103

The Control section sequences the instructions through a lengthy and complex pepeline that includes branching on DA, CA, and I/O conditions.

DSP32 control is also capable of entering an exceptional state when the CAU produces an address that conflicts with the memory access in progress. These wait states allow the addressing of operands or instructions to be located in either memory bank without regard to interleaving. Throughput is degraded only incrementally when the automatically inserts the wait states. This capability further enhances the flexibility of the memory organization as well as the power of the instruction set. The control section also monitors various error conditions. Errors can be masked by the error mask register (EMR). Depending on EMR, the error asserts an output pin in the parallel I/O (PIO) unit. Also depending

on the of the EMR, the DSP32 either halts or continues when an unmasked error is detected. CHAPTER 8 THE TI TMS320C30 l.Introduction The TMS320C30 is a CMOS 32-bit Digital Signal Processor that has just recently been announced by Texas Instruments. This provisional new member of TI's popular TMS320XX Digital Signal Processor family is claimed to be the industry's first programmable DSP capable of performing 1024-point FFT with 32-bit floating-point precision on a single chip. Also, it is claimed to be the first designed specifically for high level language support in the form of a C compiler. Wish-list items for an advanced, super high-performance DSP such as very fast instruction cycles (approaching 50 ns), floating-point mathematics, on-chip instruction caches and multiple address generators are intended to be part of the TMS320C30 design. Presently, detailed technical information on the chip is not available to the public but only to customers who sign a non-disclosure agreement (NDA) with Texas Instruments. Also, this information (including a user's manual) is primary and subject to modifications during the chip development phase. 2. Architecture The TMS320C30 hardware architecture is built around a 40 MHz input clock and includes the following key elements [Figure 8.1]:

104 . i

Reldr­ -~ Hiiiii­ -I/O Hold HoldA­ - iiOHQiii A SIJObe­ -fiOSirobl R/W- -I/ORtW 0(31~1- - 1/0 0(1 s-o) -110~ A(23-0I- .-MIL

lnleqlf ~ lnlega"~ Addrftl lloalif'9·poolll lloalr' general ora Reset munipllur AL INT(3~) e .,A,ended prec1t1on CcnlrOI regislera iACi< r8')slera XF(l,O) MC,MP Xt "'-·11'01:;5 Addrnl ~"rlll<.tO yeneralor 1 X2, CU<., CcnltOIIIr Vcd7~) V..ll~l ""U•~ ,.... v_ SVBS TCLKO E.T(l,OI .• c.on1f01 , .... Eji,O) ~------TCLKI ClKOUT

Figure 8.1 TMS320C30 Block Diagram

...... 0 V1 106

a. A Central Processing Unit (CPU) b. A Direct Access (DMA) c. Memory d. Peripherals o Timers/Counters o Serial Ports o Format Converter e. External Interfaces 2.1 The Central Processing Unit (CPU) The CPU is the heart of the TMS320C30 architecture [Figure 8.2]. This unit is mostly responsible for the high performance, the accuracy and the precision of the system. The CPU consists of the following elements:

a. An ALU which is capable of both integer and floating-point operations. This ALU performs single cycle 32-bit integer, 32-bit logical, and 40-bit floating- point operations. The operations include single-cycle conversions for integer and floating-point data. b. A multiplier capable of single-cycle floating­ point and integer multiplication. Floating-point multiplication is performed with 32-bit floating-point inputs and the results are 40-bit floating-point products. Inputs for integer multiplication are 24-bit integer and output products are 32-bit integer numbers. c. A 32-bit barrel shifter that can shift up to 32 bits either left or right in a single cycle. 107

Figure 8.2 Central Processing Unit (CPU) 108

d. Eight (8) extended-precision registers that are used to support operations on 40-bit floating-point and 32-bit integer numbers. e. Eight (8) auxiliary register arithmetic units whose primary function is the generation of the 24-bit addresses. Two register arithmetic units (ARAUO and ARAUl) can generate two addresses in a single cycle. These registers (ARAUO and ARAUl) operate in parallel with the multiplier and ALU. They support displacements and indexes for linear, circular, and bit-reversed addressing. In addition, the auxiliary registers may also be used as 32-bit general-purpose registers. f. The twelve (12) remaining registers are used to support other system functions such as: addressing, stack management, processor status, block repeat and interrupts. g. Within the CPU, there are internal buses made up by CPU1/CPU2 and REG1/REG2 which carry two operands from memory and two operands from the register file. These buses make it possible for the CPU to perform parallel multiplications and additions/subtractions on four integer or floating-point operands in a single cycle. 2.2 Direct Memory Access (DMA) The TMS320C30 possesses an on-chip Direct Memory Access controller which can read from or write to any location in the memory map without interfering with the CPU operation [Figure 8.3]. The DMA controller contains 109

I DMADAlA IUS 7 :-...., ... I I DMMDDR IUS I R ,I "( R ~; I R •l A lilA COIITAOLW l D GUIIAL CONTIIOI. A IIEGISTER • D ,,..___,.., A' D SOOIIC( ADDRESS R I I!BiiSnR rv--v ~ s IIESTIIIIJIOII AOOIIfSS s IIEGISTER -!. • I TlWISfEII COUIITER u IIEGISTER If s ......

Figure 8.3 DMA Controller 110

its own address generators, source and destination registers, and transfer counter. Dedicated DMA address and data buses prevent conflicts between the CPU and the DMA controller. A DMA operation can be a block or a single word transfer to or from memory. The DMA controller can respond to interrupts in the same manner as the CPU. This ability allows the DMA to transfer data based upon interrupt service routines. This means I/O transfers normally performed by the CPU can instead be performed by the DMA controller. Again, the CPU may continue processing data while the DMA brings in or sends out data. The DMA controller makes it possible to interface the TMS320C30 with slow external memories and

peripherals (A/D converters, serial ports, etc.) wi~hout reducing the throughput of the CPU. This helps improve the system performance and decreased system cost. 2.3 Memory The total memory space of the TMS320C30 is 16 Mega words (32-bit word). Program, data, and I/O space are contained within this single logical 16 Mega-word memory space which consists of on-chip ROM, RAM, memory cache [Figure 8.4] and externally expandable memory. A machine word is 32-bit, and all addressing is done by word. The memory map is dependent upon whether the processor is running in the microprocessor mode or the microcomputer mode (controlled by MC/MP~ pin). 111

lAM UM 10M tACHE ILOCK (14x32) ILOCk D ILDCK 1 (1h32) (1b321 (4b32)

32 :u 32 31 2• 1• .- 1m r hl': ~ / l T T - POATA IUS II ·- PADDA IUS I I - DDATA IUS I -~,u., '.! DADDA1 IUS T -~ l- DADDA2 IUS TT - ~ : j - DMAOATA IUS T l - DMAADDR IUS I "- j,· .. li l+• 'lH l'tiDGRAM COUNTER/ CPU OMA INSTRUCTION REGISTER CONTROLLER

Figure 8.4 Memory Organization 112

On-chip memory includes RAM blocks 0 and 1 which are lK x 32-bit each, 4K x 32-bit ROM block and a 64 x 32-bit instruction cache. Each RAM and ROM block is capable of supporting two accesses in a single cycle. The separate program, data, and DMA buses allow parallel program fetches, data reads and writes, and DMA operations. In a single cycle, the CPU can access two data values in one RAM block and perform an external program fetch in parallel with the DMA loading another RAM block. The 64 x 32-bit instruction cache is also intended in the TMS320C30 on-chip memory resources. This instruction cache holds the most often repeated code sections to reduce the number of accesses required. This feature not only improves the system operation speed but also allows codes to be stored off-chip in slower and low-cost memory. 2.4 Peripherals The TMS320C30 peripherals are controlled through memory-mapped registers which are located on a dedicated peripheral bus [Figure 8.5]. This peripheral bus provides the capability of straightforward addition, removal and creation of peripheral modules. The peripherals include two timers, two serial ports and a format converter.

a. Timers The two timers are general-purpose 32-bit timer;event counters with two signaling modes and internal or external 113

\.

SERIAL I'ORT D , ,.,. GLOBAL CONTROl • • ,. REGISTER ClOCK OIVIO£ f- :..t'-xr llfGISTER ~ I~ TIWISIIIT REGISTER I- ()R(i CtdiO RfCEIYf REGISTER -

f- SERIALI'ORT 1 f- Gllllk COHTIIOl I- "E REGISTER A I ClOCK OIVID£ r- A REGISTER L 1- TIWISIIIT REGISTIR 1- D A RfCEIVE REGISTER 1 - A

8 A ~ TIMER 0 u 0 GLOSAL CONTROl S D REGISTER R 1- E PEAIOO REGISTER s s TillER REGISTER

TIMER 1 = lil081ol CONTROL REGISTER f- TCtK' PERIOO REGISTIR

TTIIER REGISTIR I I

Figure 8.5 Peripheral 114

clocking. An I/O pin is available to each timer. This I/O pin can be used as an external input clock to the timer, or it can be used as an output signal driven by the timer. This pin may also be configured as a regular general-purpose I/O pin. b. Serial Ports The TMS320C30 offers two modular and totally independent serial ports. Both serial ports are identical and are independently controlled by a complementary set of control registers. Each serial port can be configured to transfer 8-, 16-, 24-, or 32-bit data words. The clock for each of the serial ports can be supplied either internally or externally. Internal clock is generated by

dividing the chip internal clock. The maximum operation speed of the serial port is 8 Megabits per second (Mbps). The pins of the serial ports can also be configured as general-purpose I/O pins. A special handshake mode allows the TMS320C30 to communicate over their serial ports with guaranteed synchronization. Finally, the serial ports may also be configured to operate as timers.

c. Format Converter To support inputs from external devices that may be in different format standards (integer, floating-point and companded data), a format conversion module is included in the TMS320C30 to transform data to and from internal representations. The external formats supported by the 115

TMS320C30 are u-law and A-law common in pulse code modulation (PCM) systems and the IEEE single-precision floating-point number representations.

2.5 External Interfaces

Two external interfaces are provided in the TMS320C30: the parallel interface and the I/O interface. The parallel interface consists of a 32-bit data bus, a 24-bit address bus, along with a set of control signals.

The I/O interface consists of a 32-bit data bus, a 13-bit address bus and a set of control signals. Both interfaces support an external ready signal (ROY*) for hardware or software controlled wait-state generation.

The TMS320C30 busing capabilities include four effective internal buses and two external buses. The internal buses effectively perform two independent data fetches, one program fetch, and one DMA fetch. All are done in a single cycle. The external buses are also capable of simultaneous program, data and D~A fetches.

The TMS320C30 interrupt handling capability supports four external interrupts, a number of internal interrupts and a nonmaskable external reset (RESET*) signal. Two external I/O flags, XFO and XFl, may be configured as input or output pins under software control. These pins are also used by the interlocked operations to support multiprocessor communication. 116

2.3 Instruction Set

The TMS320C30 instruction set is designed for digital signal processing applications and numerically intensive applications. In addition, there are a full complement of general-purpose instructions to support other purposes beside multiply-accumulate operations. In all, the set

is made up of 110 instructions. These instructions are organized into the following groups: o Load and store

o Two-operand arithmetic o Two-operand logical

o Three-operand arithmetic

o Three-operand logical

o Parallel operation

o Arithmetic/logical instruction with store

o Program control

o Interlocked-operation CHAPTER 9 THE ZORAN ZR34325 VECTOR SIGNAL PROCESSOR

1. INTRODUCTION

The ZR34325 Vector Signal Processor (VSP-325} is, in many directions, an extension of the VSP-161 which was introduced by ZORAN in 1986. Its architecture is optimized to efficiently execute a wide variety of DSP algorithms including one-and two-dimensional Fast Fourier Transforms, digital filters, matrix operations, and many others. All arithmetic conforms to the IEEE 754-1985 floating-point specification. 2 . ARCHITECTURE

A block diagram of the ZR34325 architecture is shown in Figure 9.1. It consists of four major units: a. Bus-Interface Unit b. Execution Unit c. Control Unit

d. Memory and Registers

2.1 Bus-Interface Unit (BIU)

The bus-interface unit is responsible for interfacing the VSP-325 to the external memory and other peripherals in the system. The interface includes a demultiplexed address and data bus, over which both instructions and data are transferred. The 24-bit address space spans 16 million floating-point words. The control bus consists of

117 118

EXECLTTION UNIT (EU) ~Control .

figure 9.1 ZR3~3~5 Block Diagr~m 119

standard bus signals such as RD, WR, cs, DSTB, and RDY. The VSP uses these signals when it reads and writes external memory. Additionally, certain control pins are bidirectional to allow external devices in the system to read and write the memory-mapped internal registers and memory. As part of the "high-level" nature of the VSP-325 instruction set, it is able to read and write "blocks" or "vectors" of data in single instructions. In order to accomplish this, the VSP-325 contains a simple DMA mechanism controlled by the BRQ and BACK pins. The VSP- 325 activates the BRQ pin when it requires the bus; after receiving a BACK in response, it will use the bus for transferring instructions and/or data.

2.2 Execution Unit~ The EU is responsible for all floating-point arithmetic operations inside the VSP. As shown in Figure

9~2, it contains a floating-point multiplier, separate floating-point adders and , real and imaginary accumulators, and some additional arithmetic registers. The data buses entering and leaving the EU are complex, i.e., consist of both real and imaginary paths. Two operands can simultaneously enter the EU, one from the data RAM, and the second from the vector unit/coefficient table. 120

Operands from external memory enter the EU via the vector unit which operates as a FIFO buffer between the EU and the external data memory.

The EU is especially efficient at performing "vector" arithmetic, consisting of repetitive operations on multiple data samples. The length of data vectors operated upon by the EU is controlled by the NMPT (number of points) parameter in each instruction. In conjunction with the REPEAT parameter contained in many instructions, the length of data vectors operated upon by a single instruction can vary from as few as 1 sample up to 65536 samples.

2.3 Move Unit~ The move unit in the VSP is responsible for the interface between internal memory and the BIU for transferring of data. It is also responsible for address generation through the BIU. Three address generation modes are provided for accessing external memory: direct, indirect, and indexed. The MU also contains a "bit- reversal" address generator for use with FFT operations. It can bit-reverse or unbit-reverse the order of input or output data "on-the-fly" as the data is moved through the bus-interface unit. 2.4 Fetch Unit J!Ql The fetch unit is responsible for all control related to fetching VSP-325 instructions. Instructions fetched 121

are stored in the four-instruction FIFO, out of which they are subsequently executed. The program counter is managed by the FU, as well as all subroutines and branching control.

2.5 Vector Unit~ The vector unit operates as a 4-word FIFO between the move unit and the execution unit. External data entering the EU first goes through the vu. This allows the movement of external data via the MU to take place independently of the EU. 2.6 Control Unit

The control unit manages all of the internal VSP-325 registers which ultimately control all VSP-325 activity. 2.7 Internal Memory The VSP-325 contains an internal RAM memory space in

which data for arithmetic operations is stored. An

internal coefficient look-up (CLUT) table is also provided which contains floating-point coefficients used for many of the instructions. 17 registers are also provided for a variety of conditions to be discussed. All of the memory

and registers are memory-mapped into the external address space so they can be read or written by other devices as appropriate.

2.8 Data Memory

Data memory in the VSP-325 consists of 128 32-bit floating-point words organized as an array of 64 complex 122

words. Each complex word consists of separate real and imaginary parts. The memory can be used as a single array of 64 complex words, or alternatively as two separate arrays each of 32 complex words as illustrated in Figure 9.3. In the latter case, the memory operates in a dual­ ported fashion, allowing data I/O instructions to be overlapped with those of internal arithmetic instructions. Data I/O instructions van load or store as few as one sample or as many 64 complex or 128 real samples in a single instruction. 2.9 Coefficient Look-Up Table (CLUT) The CLUT in the VSP contains 1024 complex floating­ point coefficients used for performing FFTs, DFTs, modulation, and demodulation. It is automatically sequenced by the execution unit when performing any of the aforementioned operations. 2.10 Internal Registers The VSP-325 contains 17 registers, three of which are arithmetic, and 14 of which are for control and information as shown in Figure 9.4. The arithmetic registers are 32 bits in width; the control and information registers vary in width up to 24 bits. 2.10.1 Real/Imaginary Accumulators: Separate 32-bit accumulators are provided for both real and imaginary arithmetic results. 2.10.2 Min/Max Value Register: 123

Figure 9.2 Execution Unit

fte Re 53,.----~ 31..-----

RS:1

RS:O

OL-.---.J

Figure 9.3 Internal Memory 124

Accumulator Real Pr am Counter Accumulator Imaginary Stack Pointer Min-Max Values

Figure 9.4 Internal Register 125

The 32-bit min/max value register contains the minimum or maximum floating-point value encountered after the execution of the last MIN or MAX instruction. 2.10.3. Program Counter The 24-bit program counter provides the address out of which the next instruction will be fetched. 2.10.4 Stack Pointer: The 24-bit stack pointer points to the location of the stack in external memory. Because the register is a full 24 bits, there is no limit to the depth of the stack or the nesting of subroutines. The stack pointer is automatically decremented during CALL and PUSH instructions, and automatically incremented during RET and POP instructions. The stack can be updated using the LDR (load registers) or RAL (Register arithmetic/logic) instructions and stored using the STR (store registers) instruction. It also can be used as a third for indexed addressing.

2.10.5 ~and B Base Registers: The two 24-bit base registers are used for all arithmetic, data movement, and sequence instructions for indexed addressing of external memory. The effective address is computed as: address = base + displacement where the "base" is the content of the appropriate base register, and the "displacement" is the value specified in 126

the E~ (external memory address) field of the

instruction. 2.10.6. SAR Counter Register:

The 24-bit SAR (Store arithmetic registers) register provides a pointer to external memory. It is used during arithmetic instructions such as FIR, IIR, MTX, MIN, MAX and others when storage of the accumulator value to external memory is required. SAR is automatically incremented after the accumulator is stored.

2.10.7 Loop Counter: The 24-bit loop counter register is used for looping programs in the internal instruction FIFO. Because the register is 24 bits in width, very long loops can be implemented efficiently. Both the LOOP and JMPC instructions can automatically decrement the loop counter.

2.10.8 Extra Register: The 24-bit extra register is for general programmable use. It can be both updated and tested by a number of different instructions.

2.10.9 Min/Max Index Register: The 20-bit minjmax index register indicates the offset address of the minimum or maximum value encountered at the end of the Min or Max instruction. It can be updated and tested by a number of different instructions.

2.10.10 Mask Chain Register:

The 17-bit mask chain register is used in control 127

register instructions. It determines which registers will be affected in the operation. 2.10.11 Mode Register: The 24-bit mode register contains programmable bits which allow the configuration of the operating conditions of the VSP-325. Some of the programmable conditions

include defining the speed of external memory, whether instructions are executed sequentially or concurrently,

the number of internal RAM sections, and the rounding mode

for arithmetic operations.

2.10.12 Status Register:

The 24-bit status register indicates the state of all internal conditions and processing units in the processor. The contents of the register is readable by external

devices; the VSP-325 can also store its contents to external memory using a number of different instructions.

2.10.13 Interrupt Pointer:

The 24-bit interrupt pointer register contains the address of the interrupt routine in external memory. When the VSP-325 is interrupted on the INT pin, the processor effects a CALL to the address pointed to by this register.

2.10.14 Interrupt Mask Register: The VSP-325 allows the generation of an outgoing interrupt upon a variety of conditions enabled by bits in this register. If a mask bit in the 24-bit interrupt mask register (corresponding to a interrupt condition) has been 128

enabled, an interrupt will be generated if the condition becomes true. 2.10.15 Interrupt Status Register: The 24-bit interrupt status register contains the status of all conditions which have the potential to generate an interrupt. Some of these status conditions include illegal instructions, overflows, and trap conditions such as underflows and inexact arithmetic results. CHAPTER 10 WORDLENGTH EFFECT ON COMPUTATIONAL ACCURACY AND DYNAMIC RANGE

1. Introduction This chapter addresses the effects of using finite register lengths to represent relevant digital parameters used in DSP applications. While infinite precision is usually assumed and used in theoretical analysis for the parameters, it does not exist in practical hardware. Therefore, when dealing with digital hardware, the effects of finite precision on data representations are of tremendous importance and must be thoroughly studied. This is especially true when the system designer wishes to select a suitable DSP chip for a particular application. Spectral analysis, particularly the FFT (Fast Fourier Transform), is chosen to be the subject of study. 2. Error Sources We define the Fourier transform by a pair of integral operators of the form: Fourier transform:

2~jft S(f) = s(t) e dt = F {s(t)} (1) f f Inverse Fourier transform:

-~jft -1 s(t) = S(f)e df = F (S(f)} (2) f t

129 130

This transform pair is very useful and arises naturally in the analysis of many physical phenomena. It is highly desirable that digital computers can evaluate this transform. It is clear, however, that precise computations can never be achieved by numerical methods since there are many inherent sources of error. The four main sources of error are identified as follows: 1. The signal s(t) is known only for a finite time interval (time truncation) 2. The signal s(t) is only known at discrete instants in time (sampling process) 3. The signal s(t) is only known with limited precision at the sampling instants (quantization) 4. Only finite precision arithmetic is possible on any machine computation (computational error) This chapter dicusses the four sources of errors above. Since this paper concentrates on the digital signal processors, which is where the computation is performed, Item 4 (computational error) will be studied in more detail to develop guidelines for selecting appropriate hardware for a DSP system. 2.1 Time Truncation Effect The spectral analysis resolution is limited by the effect of time truncation. If the observation interval is T seconds, two spectral lines can not be resolved if they 131

are located less than 1/T Hz apart. An exact spectrum calculation of a unit amplitude signal of infinite duration is a unit impulse at the origin. If this signal is observed only for -T/2~t~T/2 the resulting transform is given as:

~ 2~jft s (f) = e dt = sin (~fT)/~f (3) T 0 1-~ If B is defined as the 4-dB width of the sine function, it can be shown that:

B 1/T (4)

From the translation theorem of Fourier analysis, it is clear that the same result would be obtained for a time truncated sinusoid of any frequency f ,and is given as: 0

s (f) = sin[~(f-f )T]/(f-f )T (5) T 0 0

The sine function above has infinite duration and must be truncated for practical problems. However, simply truncating the sine function causes errors in the frequency response as illustrated in Figure 10.1. 132

S(f)

(f) -1/T 0 1 /T

(a) Before Truncation

S(f)

(f) -1/T 0 1/T

(b) After Truncation

Figure 10. 1 Truncation Effect on Frequr:;ncy Response 133

2.2 Sampling Effect

Sampling is the process of observing a continuous signal only at discrete instants in time. Practically, these samples are usually taken to be equidistant, such that the kth sampling instant, t , is described in terms k of the kth-1 sampling instant as:

t = t +~t (6) k k-1

The sampling process may be represented mathematically by the following expression, assuming ideal infinite time interval sampling:

"" g (t) = Lo (t-kt.t) (7) l:.--

Where (t) is the Dirac delta function

The sampled version n(t) of the continuous signal s(t) is:

n(t) = s(t).g(t) (8)

and, in frequency domain, if F is denoted as the Fourier transform, we have: 134

F{n(t)} = F{s(t)} * F{g(t)} (9)

Where * is used to denote the convolution operator

The sampling process in the frequency domain is

illustrated in Figure 10.2.

The signal s(t), therefore, may be recovered from n(t) by passing n(t) through a filter H(f), where:

1/Llt, jfl < W/2 H(f) = (10) o,

provided that 1/~t>W. Thus under ideal conditions, the sampling process produces no error if the sampling

interval satisfies the condition:

6t < 1/W {11)

where the signal being sampled is band limited, such that:

S(f) = O, (12)

This condition is based on the Nyquist's theorem which states that the lowest adequate sampling frequency to 135

s (f)

(f) 0

-2/ T -1/ T 0 1/ T 2/ T

n (t)

-2/ T -1/ T 0 1/ T 21 T

F'igure 10.2 Samp1J..ng P:c_:,ce'':3 136

avoid aliasing is twice the Nyquist rate f , with: 0

f = 1,6t (13) 0

That is, if the signal being sampled is not strictly band limited, sampling will cause error and vice versa.

2.3 Quantization Effect Since it has been proven that proper sampling does not produce any error (the information contained in the signals being sampled is recovered perfectly), it is possible to study the gross effects of quantization by studying the quantization of a continuous signal s(t). It is obvious that to detect a small signal reliably the number of quantization levels needs to be large. However, it is desirable to keep the number of levels as small as possible in order that the hardware implementations be less complex and satisfy critical re~~irements such as operating speed.

The quantization process may be viewed as the addition of an error signal q(t) to the signal s(t) as shown in Figure 10.3.

s (t) = s(t) + q(t) (14) q Where s (t) is the quantized signal. q s (t)

s (t) 9

q(t)

Figure 10.3 Additive Errors Due to Quantization

0 0

...... w -..J 138

Also, it is known that the signal q(t) can be represented closely as broad-band noise for fine quantization of random signals.

2 T - q /12 ~ N (15) q

Where q is the quantizer step size.

The dynamic range (DR) of the quantizer is a rough measure of the quantizer's ability to pass relative amplitude information contained in the input signal. It is defined as:

2 DR = -10 log [(q /12}] 10 = -20 log (q} + 10 (16} 10

If the quantizer has an output of an x-bit binary number (including the sign bit) then the quantizer step size q is: x-1 q = (1/2} (17}

Substituting (17) into (16), the following result is 139

obtained:

DR = 6X + 4 (18)

In practice, the dynamic range is usually represented by a quantity known as the mean-squared value (mse) of the difference between the spectra of s(t) and s (t). That q is, if:

F{s(t)} = L(f)

F{s (t)} = L (f) q q then F{q(t)} = L (f) - L(f) = E(f) q and ave[L(f) - L (f)] = ave[E(f)] = T q 2 2 or mse = 10 log [T] 10

= -6x - 4

= - DR (19)

From expression (18), the number of binary bits that must be used to represent the data to satisfy a known dynamic range requirement can be determined. 140

Figure 10.4 is an illustration of the dynamic range evaluation versus data word length. This is only the theoretical result. In practice, the specification of the

desired quantization accuracy is much more complicated and

is not a straightforward matter. State-of-the-art devices

(ADCs) for high speed operation are available only on the order of 16 bits. The choice of x bits in excess of 16 is, therefore, impractical. Moreover, the environmental noise that always mixes with the signals of interest is frequently more of a limiting factor than the quantization error. Usually, a reasonable choice for x would be in the range from 6 to 16 bits. Real-life experience has shown that hardware complexity is directly proportional to x and computing speed is inversely proportional to x. The system designer would therefore, be better-off to incline toward choosing a smaller, yet appropriate x for a device.

A typical engineering compromise might be 10 or 12 bits.

2.4 Computational Error

As mentioned earlier, the Fast Fourier Transform

(FFT) is an efficient way to calculate the Discrete

Fourier Transform (OFT) for a time-limited sampled signal.

N-1

S(n) = ~(k)WNkn (20) k:O for n = 0,1, •••• ,N-1 where s(k) is the kth sample value 141

DYNAMIC RANGE (dB)

120-

11 0-

100-

90-

80-

70-

60- 0

50-

40-

30- 20·1'

1 0 ..

4 6 8 1 0 12 14 1 6 1 8 20 22

DATA WORD LENGTH

Fi~0~2 10 ~ Dynamic Range Versus Data Word_ Len g t i~ 142

In order to employ the FFT, the modulus N of the transform

must be a composite number, or:

N = N .N (21) 1 2

Using the transformations:

k = k + k N 0 1 1 (22) n=n +nN 1 0 2 where

k ,n = 1,2, ...... , N -1 0 0 1

k ,n = 1,2, ....•• , N -1 1 1 2

Substituting (20) and (21) into (19), the result is:

s 1 (nl,k 0) = I WN~:.n. {INN'""' I s(k.,k.) wN:'"' } (23) k=O ~=0 ,.

The term in braces depends only on k and n and is 0 1 defined as an intermediate transform, and the term w is N called the FFT kernel.

However, the evaluation of the FFT contains 143

accumulated computational error that is caused by the imprecision with which the kernels, data, and intermediate transforms are represented in digital computers. In special-purpose hardware, the data word lengths that represent the quantities involved need not be the same.

Therefore, the mean square error (mse) is inv~stigated as a function of three variables, the data word length d, the kernel word length k, and the intermediate transform word length t. These are considered as independent sources of errors. By in large, floating-point representations can provide much greater accuracy than fixed-point. However, the floating-point implementation is more complex and reduces computation rates. There are many schemes that offer DSP system designers compromises, permitting floating-point arithmetic accuracy while maintaining the simplicity inherent in fixed-point. One of such techni~~es is called the array-scaling scheme, in which entire arrays are forced to share a common characteristic. This is described next.

Array-scaling ~ The simplest AS technique, automatic array-scaling (AAS), always scales all numbers in the array downward before any operation that might cause any number in the array to overflow. The advantages of AAS are speed and simplicity. The disadvantage is the inaccuracy caused by 144

unnecessary forced scaling. Another more sophisticated technique is the conditional array-scaling (CAS) in which the data are tested prior to each operation. Scaling downward is performed only when that operation is liable to produce at least one overflow. However, if CAS is employed, provisions must be made for counting the scaling operations performed so that the correct magnitude of the final results can be determined. This step is not necessary in AAS since the number of times the data are scaled are known a priori. The simulated AAS technique for various d, k and t are shown in Figure 10.5. Various representative results for each individual of the word lengths d, t, k are also plotted in figures 10.6a,b,c. From these plots, it is concluded that there is not much merit in choosing the kernel word length (k) greater than the data word length (d). Additionally, if the intermediate transform word length is properly selected, at about 6 bits greater than d or k in AAS, the expected mse can be computed roughly as:

6d for dk

From the figures and by inspection, only t needs to be 145

10 LOG (mse) 10 20

1 0

0

-1 0

-20

-3 0

-40

-50

-60

-70

-o,., u"

-9 0

-1 0 0

-11 0

-1 20-

2 4 6 8 10 12 14 16 18 20 a, t , k (BITS) 146

10 LOG 1Jmse) 20

1 0

0

-1 0

-20

-3 0

-40 0

-50

-60

-70

-80

-90

-1 0 0

-11 0

-1 20-

2 4 6 8 10 12 14 16 18 20 d (BITS)

""! .... ,_ • _I .... ~.: 0 Mse V~rsus Data Word Length 147

10 LOG,Jmse)

20

1 0

0 0

-1 0

-20

-3 0

-40

-50

-60

-70

-80

-9 0

-1 0 0

-11 0

-1 20-

2 4 6 8 10 "12 14 16 18 20 t (BITS) F'fi'lc·e t0.6t; .. t-'i.s:c Versus IntP.rmedia•:e \!lord Length 148

0 10 LOG ( mse) 10 20

1 0

0

-1 0

-20

-3 0

-40

-50

-60

-70

-80

-9 0

-1 0 0

-11 0

0 - j 2 0 -.a.---r----+---r--·~.....,.--~--J----...... --"'"'T'--~--

2 4 ' 6 8. 10 12 14 16 18 20 k (BITS)

f1gure ID.6c Ms? Versus Kernel Worrl Leng~h 149

increased to 16 bits while d and k can be as low as 10 bits for an mse of - 60 dB. The same performance is obtained even when all d, t, and k are increased to 16 bits. Therefore, choosing nonuniform word lengths may allow considerable savings in hardware and/or speed. 2.5 Word Length Selection Guideline To provide a guideline on what word length should be chosen, a list of the quantizer word length, the intermediate word length and the data word length for various mse (-DR) levels is computed under AAS. The results are presented in Table 10.1. The item marked by a

star is the limiting factor of the system. The limiting factor is defined as the item whose word length prevents the DR from increasing when other items' word lengths are increased. For a DR level of 58 dB, for example, the first row of table 10.1 indicates that the 16-bit FFT data word length is the limiting factor. To increase the DR level . to 59 dB, an increased FFT data l.S required. A suitable DSP IC for this example should have data word length of 18 bits or greater. 150

ADC INTERMEDIATE FFT MSE (BITS) (BITS) (BITS) (dB)

10 10 16* -58 10 10* 18 -59 10* 12 18 -60 12 12* 18 -70 12 14 18* -71 12* 14 20 -72 14 14* 20 -80 14* 16 20 -81 16* 16* 20* -83

Table 10.1 Dynamic Range Limiting Factor CHAPTER 11 COMPARISONS OF DSP IC'S

1. Introduction

The eight different DSP ICs will be compared in this chapter. To make the comparison more convenient, the

DSP ICs are categorized into two groups: a 16-bit group, and a more-than-16-bit group. The 16-bit group consists of the four 16-bit DSP ICs. They are the NEC 77C20, the TI TMS320C25, the Analog

Devices ADSP-2100, and the Zoran ZR34161. It should be

noticed that the NEC 77C20 is compared instead of the

NEC7720, even though the latter was discussed in chapter

2. Architecturally, the 77C20 is exactly the same as the 7720, only it is a CMOS device.

The more-than-16-bit group consists of the other four DSP ICs: the Motorola DSP56000, the AT&T WEDSP-32, the TI TMS320C30; and the Zoran ZR34325. In this only the DSP56000 is a 24-bit device, the rest are all 32- bit.

The criteria for comparison will be based on the device architectures and other factors, including:

o The speed capability that is unattainable or too costly with other DSP ICs.

o The cost of designs based on the selected device.

o The space and power impact.

151 152

o The device availability. o The availability of software and hardware development tools. o Long term support from the device manufacturers. These criteria then are used as guidelines for selecting a DSP IC. 2. Comparisons 2.1 16-bit DSP ICs All of the four DSP ICs in this group employ a Harvard architecture, in which, the memory spaces for instructions and data are separate. The implementations of such memory structures are, however, different in each device. Unlike the other three DSP ICs, the ADSP-2100 carries no on-chip program or data memory. Instead, the memory resides externally to make room for processing logic. The ADSP-2100 provides separate buses for controlling separate data and instruction memories. However, this Harvard architecture is different from the others in its ability to mix data values in the program memory. The memory mixing scheme gives a speed advantage in such operations as performing window calculations. In window processing, the computer performs multiplications on a set of data with the window coefficients stored in a look-up table. After fetching the multiply instruction from the program memory, the CPU can simultaneously fetch the raw data from the 16K x 16 external data memory 153

and the coefficients from the program memory. Such parallel operation speeds up the windowing process. Normally, in a computer operation, after each instruction being executed, the processor must return to the program memory for the next instruction fetch and it slows the operation down. This is still true for the ADSP-2100. However, the instruction-cache memory in the ADSP-2100 is included to overcome the lost time in fetching the next instruction. This cache-memory stores the previous 16 instructions that the CPU performed. The CPU can, therefore, fetches the instruction from the on-chip cache­ memory instead of from the external program memory. The cache-memory thus frees the bus from instruction-transfer activities and allows the data to be transferred from the data and program memory during every CPU cycle. Consequently, the ADSP-2100 is very efficient in tasks that require running a loop, provided the maximum number of instructions in the loop is 16 or less since the cache­ memory is only 16-level deep. A major drawback of the ADSP-2100 architecture is that it does not contain on-chip memory beside the cache. All the program and data are stored externally and expensive and high speed memories are required even for small systems. The NEC 77C20 is very suitable for communication applications due to the three communication ports. Its two serial I/O ports, equipped with its control lines, are 154 , .

handy to be connected to a CODEC or any serial device. The 8-bit parallel I/O is suitable for operating with a­ or 16-bit microprocessors such as the Intel SOSX family. The 77C20 is ideal for use in low-end telecommunication modems, or to be used as local processor in each node of a communication network since it has sufficient on-chip memory and peripherals to support the operations. The Zoran ZR34161 is very fast in processing the FFT. Its ability to perform 1024-point, radix 2 integer complex FFT in 2.6 ms is the fastest among this group. This can be done because the ZR34161 has the unique property that its 23-high level instructions operate on complex vectors or arrays of data instead of on scalars as in other DSP ICs. Due to this capability, the ZR34161 is

the best device in this group for applications that require very high-performance execution of DSP algorithms. However, the ZR34161 lacks of other general-purpose microprocessor features such as no on-chip stack for storing returned address from a branch instruction. It also has only one interrupt level and no on-chip I/O port.

Therefore, it must rely on external I/O peripherals for communication. The TI TMS320C25 is perhaps the best DSP IC in this group when a compromise among trade-off factors such as system speed, design simplicity, performance, and cost is needed. The TMS320C25 architecture includes many DSP- 155

oriented features as well as general-purpose microprocessor features such as: 8 interrupt levels; 8 on­ chip stacks, fast cycle time, and external buses. Its ability to perform 256 point radix 2 complex FFT in 1.2 ms is slow compared to the ZR34161 and the ADSP-2100 but

adequate for a general DSP application. All four DSP ICs in this 16-bit category are fabricated with CMOS technology to minimize power consumption (the NEC 77C20 is the CMOS version of the 7720) . The maximum power consumptions of these devices are

listed as follows: l. NEC 77C20: 200 mW

2. Zoran ZR3416l: less than 300 mW 3. Analog Devices ADSP-2100: 600 mW 4. TI TMS320C25: 1.0 W

Note that even though the power consumption rating for the NEC 77C20 and the Zoran ZR34161 appears to be better than those of the other two, they do not offer power-down mode. The TI TMS320C25 and the Analog Devices ADSP-2100, on the other hand, provide power-down mode operation, or entering an "idle state", during which the power drained is negligibly low, thus allowing saving more power. The power consumption figures actually reflect the different levels of architecture complexities between these devices. Within this group, the TMS320C25 and the 156

ADSP-2100 are the two that have a more complicated architecture than the rest, resulting in a greater power dissipation. The TMS320C25 maximum clock frequency is 40 MHz which is the highest clock frequency in the group. Its minimum instruction cycle is 100 ns, which is four times as slow as the maximum clock speed. In this group, it is one of the two fastest devices. The ZR34161 has a maximum clock frequency at 20 MHz, its instruction cycle takes two clock cycles, or 100 ns which is the same as the TMS320C25. The ADSP-2100 maximum operating frequency is 32 MHz. However, its instruction cycle requires four clock pulses, or 125 ns. The 77C20 is the slowest device of the four. Its maximum operating clock frequency is only 8.196 MHz, resulting in a 250 ns instruction cycle. It is stressed that, the comparisons made above are based purely on the clock speeds of the devices. In a system, the operating speed is not only dependent on the processor clock speed but on other characteristics such as memory access time as well. To provide a means for speed comparison, benchmark performance of each DSP IC is given in Table 11.1. The ADSP-2100 is available in a 100-pin, 1.332" x 1.332" PGA (Pin Grid Array) package which is the biggest package of the four. The ZR34161 is the next biggest device and available in a 48-pin 0.598" x 2.45" DIP (Dual 157

BENCHMARK 77C20 TMS320C25 APSP-2100 ZR34161

FIR FILTER .75us/TAP 0.1us/TAP 0.125us/TAP NA

IIR BIQUAD NA FILTER 2.25us 1us O.BBOus

64-PT FFT 1.6ms NA NA 0.164ms (COMPLEX)

256-PT FFT NA 1.2ms NA NA (COMPLEX)

1024-pt FFT 77ms NA 7.2ms 2.6ms (COMPLEX)

ADAPTIVE NA 0.4us/TAP 0.25us/TAP NA FIR FILTER

DIVISION 26us NA NA NA

4X4 MATRIX NA NA NA 33us MULTIPLY

4096-PT FFT NA NA NA 33.3ms (COMPLEX)

240-PT RECT. NA NA NA 0.56ms WINDOW

Table 11.1 Comparison of Benchmark Performance of 16-bit DSP res. 158

In-line Package). The TMS320C25 is third in size and available in a 0.995" x 0.995" 68-pin PLCC package. The 77C20 is available in two package types, a 0.6" x 1.5" 28- pin DIP and a 0.689" x 0.689" 44-pin PLCC. It has also the smallest package of the four. Using the TMS320C25 to program is easy for someone who is familiar with its earlier family members since their instruction sets are very similar and the codes are upward compatible. However, for someone who is new to the device, programming in the DSP IC assembly language requires some practices. This is true for the other DSP ICs as well. Among this group, the TMS320C25 has the most complete set of software and hardware development tools. The TMS320 family has also the largest group of third party software and hardware support. Hardware and software development tools for the other three DSP ICs are very much balanced. Most other major features of the four 16-bit DSP ICs are compared in Table 11.2. 2.2 More-than-16-bit DSP ICs The Motorola DSP56000 is the only 24-bit and fixed-point DSP IC within this group. Its architecture is designed to be a compromise in situations where 16-bit is inadequate and 32-bit is overkill. With 24-bit data, it could deliver 144 dB dynamic range externally. The 159

DEVICE !uPD77C20*# TMS320C25 ADSP-2100 ZR34161

MANUFACTURER NEC TI ANALOG ZORAN TECHNOLOGY CMOS CMOS CMOS CMOS ARCHITECTURE HARVARD HARVARD HARVARD HARVARD PIPELINE YES YES YES YES DISSIPATION (W) 0.2 1.0 0.6 0.3 CYCLE TIME (ns) 250 100 125 100 EXTERNAL BUS 1DAT 1DAT, 1ADR 2DAT,2ADR 1DAT, 1ADR LOW POWER MODE NO YES YES NO DMA YES YES YES YES FLOATING-POINT NO NO NO NO DATA TYPE INTEGER INTEGER INTEGER INTEGER DATA LENGTH 16 BITS 16 BITS 16 BITS 16 BITS MAC INPUT (BIT) 16X16 16X16 16X16 17X17 MAC OUTPUT 32-BIT 32-BIT 40-BIT 17-BIT INTERRUPT LEVEL 1 8 4 1 STACK LEVEL 4 8 16 EXTERNAL PARALLEL I/0 * YES NO NO NO SERIAL I/0 * YES YES NO NO INTERNAL RAM 128X16 256X16 NONE 256X17 INT. DATA ROM 512X13 NONE NONE 256X16 INT. PROG. ROM 512X23 4KX16 NONE NONE

'C'V, T\11,11 M'C'M .L..I~..L. ~-~~..L~ .L',J..L...l.&."..l. NONE 64KX16 16KX16 64KX16 EXT. PROG MEM NONE 64KX16 32KX24 64KX16 ON-CHIP CACHE NO NO 16X24 NO AVAILABILITY NOW NOW NOW NOW SECOND SOURCE 2 ** NONE NONE NONE MILITARY NO YES YES YES PACKAGE DIP/PLCC PLCC PGA DIP UNIT PRICE*** $15 $37 $337 $700 * ON-CHIP I/0 ** OKI, GOULD *** PRICE IN 1986 *f 7720 CMOS VERSION

Table 11.2 Comparison of 16-bit DSP ICS Features 160

internal dynamic range is much greater (336 dB) due to the 56-bit capability of the Multiply/Accumulators in the ALU. This internal accuracy is also the highest in this group. The DSP56000 has a very fine memory structure, in which, data values in pairs can reside in independent data memories. This is ideal for complex operations where real and imaginary values can be stored in pairs in separate memory banks. However, the DSP56000 furnishes only one set of external bus so the processor must access one set of memories at a time. This slows the memory operation down. The DSP56000's program-memory space can be expanded to 64K which is sufficient for most medium scale systems. Instead of using a data-ready signal input frequently found in general-purpose microprocessors, the DSP56000 controls wait states for slow memories and peripherals internally by software. Each wait state is half an instruction cycle duration and as many as 15 wait states can be generated. This software controllable wait state ability is very convenient for interfacing the DSP56000 with other slow devices. The DSP56000 also has the most flexible I/O port arrangements with 24 programmable I/O pins which can be configured to perform various parallel and serial I/O operations. For parallel operations, 15 pins can be configured as general-purpose I/Os or as a bi­ directional communication and control bus for interfacing with a host computer, or a DMA. The other 9 pins can be 161

configured either as general-purpose I/Os or as two serial ports that support both synchronous and asynchronous serial operations. Given the fact that hardware for a OSP system such as the A/0 or 0/A converters' resolution is available only up to 16 bits, the OSP56000 appears to be the best general-purpose OSP IC in terms of performance and cost trade-off. The AT&T also has very flexible I/O ports which can support various parallel and serial operations. Its parallel I/O port is capable of bidirectional data transfer as in the case of the OSP56000. The serial I/O port is double-buffered allowing back-to-back serial data transfers. This means that a second serial transmission can begin even before the first has been processed. The

WEOSP-32 1 s serial I/O port can also support transfers of 8-, 16-, or 32-bit data word length. Provided with external control signals, the WEOSP-32 can be directly interfaced to a time-division-multiplexing line, to a COOEC or to other processors. This ability makes the WSDSP-32 an ideal OSP IC in telecommunication applications. The AT&T is a floating-point OSP IC. Therefore, it can offer a large dynamic range and reduce rounding errors. However, it is the slowest OSP IC among the three remaining floating-point devices. It is also the most power-hungry OSP IC in this group because it is an NMOS device. In addition, the WEOSP-32 cannot handle 162

IEEE standard floating-point formatted numbers making it incompatible with most existing applications. The TMS320C30 has the most desirable features of all the DSP ICs covered in this paper. Its ALU's ability to perform single-cycle 32-bit integer, 32-logical, and 40-bit floating-point operations on a single chip is fairly advanced and unique among this group. The TMS320C30 also has the largest on-chip RAM and ROM that can be dual accessed by the CPU and the on-chip DMA controller. In addition, the TMS320C30 has an on-chip cache memory that can hold 128 words of the most frequently used instructions. The cache is also updated automatically by using an internal LRU (least-recently­ used) algorithm. This, coupled with the fact that the TMS320C30 can offer 60-ns single execution time, makes it the fastest floating-point DSP IC. The Zoran ZR34325 architecture is designed as a concurrent processor. The ability to operate concurrently and independently of the six main processing units within the ZR34325 makes it ideal for executing DSP mathematical algorithms such as the FFT. In FFT operations, executing the FFT instruction, storing the previous FFT results, and loading the next frame of data are done simultaneously. The ZR34325 also has a unique instruction set which is very efficient for programming DSP algorithms. The instruction set includes DSP oriented 163

instructions such as FFT (internally compute the FFT), FIR (perform FIR filter), LD (load FIR coefficients from external memory). This system-level-programming ability, and the vectorized operation are two of the most advantageous feature of the ZR34325. They are especially useful in high number-crunching DSP applications such as spectrum analysis, digital filters etc. Among the four DSP ICs in this group, only the AT&T WEDSP-32 is presently available in production. Therefore, it has the most complete set of available software and hardware development support tools. However, the other DSP ICs are not so far behind. In fact, better development tools for these devices are likely since many features such as high-level language support (usually C) will be incorporated. The benchmark performance of the four DSP ICs in this group is compared in Table 11.3. Other major features are also compared in Table 11.4. 164

BENCHMARK DSP56000 WEDSP-32 TMS320C30 ZR34325

FIR FILTER . 0. 1us/TAP 0 .16us/TAP 0.06us/TAP NA

IIR BIQUAD 0.4us O.Bus 0.36us 0.4us FILTER

64-PT FFT 0.147ms NA NA NA (COMPLEX)

256-PT FFT 0.713ms 1.9ms NA NA (COMPLEX)

1024-pt FFT S.Oms 8.96ms NA 1.732ms (COMPLEX)

ADAPTIVE 0.2us/TAP 0.32us/TAP NA NA FIR FILTER

DIVISION 2.7us 2.9us NA 0.96us

4X4 MATRIX NA 10.2us NA NA MULTIPLY

NA NA NA SQUARE ROOT 1. 52us

LOGARITHM. NA 3.1us NA 2.16us

Table 11.3 Comparison of Benchmark Performance of 24- and 32-bit DSP res. 165

DEVICE DSP56000 WEDSP-32 TMS320C30 ZR34325

MANUFACTURER MOTOROLA AT&T TI ZORAN TECHNOLOGY HCMOS NMOS* CMOS CMOS ARCHITECTURE HARVARD ** HARVARD HARVARD RISC YES YES YES YES DISSIPATION(W) <1.0 2.7*** 1.0 NA CYCLE TIME (ns) 97 160 60 100 DMA YES YES YES YES FLOATING-POINT NO YES YES YES DATA LENGTH 24 24M/BE 16,32,40 24M/8E MAC INPUT 24X24 32X32 32X32 32X32 MAC OUTPUT 56 40 40 32 INTERRUPT LEVEL 4 1 12 NA STACK LEVEL EXTERNAL EXTERNAL 4 NA EXTERNAL BUS lDAT, lADR lDAT, lADR lDAT, lADR lDAT,lADR LOW POWER MODE YES NO YES NA PARALLEL I/0 YES YES YES YES SERIAL I/0 YES YES YES YES INTERNAL RAM 512X24 1KX32 2KX32 128X32 INTERNAL ROM 2560X24 512X32 4KX32 1KX32 EXT. DATA MEM 128KX24 14KX32 NA NA EXT. PROG. MEM 64KX24 14KX32 NA NA

M"[;'M 1 O?UY?Ll TOTAL EXT. l.~.LL.U.-.J. .,.L. J£..\..6a.~ .. 28KX32 512KX32 NA ON-CHIP CACHE NO NO 128X32 NO AVAILABILITY SAMPLE NOW 4Q88 4Q88 SECOND SOURCE NONE NONE NONE NONE MILITARY YES YES YES YES PACKAGE PGA DIP/PGA PGA PGA UNIT PRICE**** $650 $355 NA $700 AVAIL.OF INFO. YES YES NDA*# LIMITED

* CMOS VERSION UNDER DEVELOPMENT ** NON-HARVARD *** <2. OW FOR CMOS VERSION **** 1986 PRICING *i NDA=NON-DISCLOSURE AGREEMENT

Table 11.4 Comparison of 24- and 32-bit DSP ICS Features CHAPTER 12 APPLICATIONS

1. Application Areas DSP ICs have been found to be suitable in many application areas such as: o Instrumentation o Telecommunications o Control o Industrial o Graphics/Imaging o Voice/Speech o Consumer o Medical o Automotive o Military 2. General-purpose DSP General-purpose DSP applications include the following: o Fast Fourier Transforms (FFT) o Windowing o Convolution o Digital Filtering o Correlation o Hilbert Transforms o Adaptive Filtering o waveform Generation

166 167

3. Instrumentation

Typical instrumentation applications include the following:

o Spectrum Analysis o Function Generation o Pattern Matching

o Seismic Processing

o Transient Analysis o Digital Filtering

o Phase-locked Loops (PLLs) 4. Telecommunication

Telecommunication is one of the primary areas that DSP ICs have been used. Typically, the following applications of this large category have employed DSP ICs:

o Echo Cancellation

o ADPCM Transcoders

0 Digital PBXs

0 Line Repeaters

0 Channel Multiplexing

0 Modems

0 Adaptive Equalizers

0 DTMF Encoding/Decoding

0 Data Encryption

0 FAX

0 Cellular Telephones

0 Speaker Phones 168

o Digital Speech Interpolation (DSI) o X.25 Packet Switching o Video Conferencing o Spread Spectrum Communications 5. Control

0 Disk Control

0 Servo Control

0 Robot Control

0 Laser Printer Control

0 Engine Control

0 Motor Control

0 Numeric Control

6. Industrial

0 Robotics

0 Numeric Control

o Security Access o Power Line Monitors 7. Graphicsjimaginq Graphics and Digital Imaging are finding DSP ICs increasingly important, especially with the availability of the advanced 32-bit devices. Typical applications include: o 3-D Rotation

o Robot Vision o Image Transmission/Compression o Pattern Recognition/Scene Matching 169

o Image Enhancement o Homomorphic Processing o Workstations o Animation/Digital Map

8. Voice/Speech DSP ICs have found a niche in this area. Typical applications include:

o Voice Mail

0 Speech Vocoding

0 Speech Recognition

0 Speaker Verification

0 Speech Enhancement

0 Speech Synthesis

0 Text to Speech

9. Consumer Consumer electronics has also become one of the areas that DSP ICs are increasingly employed. Typical applications

o Radar Detectors

0 Power Tools

0 Digital Audio/TV

0 Music Synthesizer

0 Toys

10. Medical

0 Hearing Aids

0 Patient Monitoring 170

o Ultrasound Equipment

o Diagnosis Tools

o Prosthetics

o Fetal Monitors 11. Automotive

0 Engine Control

0 Vibration Analysis

0 Anti skid Brakes

0 Adaptive Ride Control

0 Global Positioning

0 Navigation

0 Voice Commands

0 Digital Radio

0 Cellular Telephones 12. Military

Military is, perhaps, one of the most important areas that DSP ICs are applied. Real-time processing and speed capability of DSP ICs have made them very suitable for military applications. In fact, due to the increasing military demand, most DSP vendors have plans for militarizing their DSP ICs. Typical applications in this area are listed as follows:

o Secure Communications o Radar Processing

o Sonar Processing

o Image Processing 171

o Navigation o Missile Guidance o Radio Frequency Modems CHAPTER 13 CONCLUSION Eight DSP ICs are covered in this paper. They may be considered as representatives of the current single- chip DSP IC technology as well as its future trend. The 16-bit DSP IC technology, or the first and second generation technology as some vendors put it, has become a mature technology. In this 16-bit category four DSP ICs are discussed: the NEC 7720, the TMS320C25, the ADSP-2100, and the ZR34161. All of the four devices employ Harvard architecture, pipeline operations and a certain degree of RISC.

The more-than-16-bit DSP IC technology is represented by the DSP56000, the WEDSP-32, the TMS320C30, and the ZR34325. In this category, many advanced features for DSP systems may be found. The trend is leaning toward CMOS technology, floating-point operation capability, a high degree of parallelism and high speed. The technology, however, is still in the development stage and a few years more are neccesary for monitoring its maturity. Overall, the success is highly promising. Technically, all of the eigth DSP ICs are suitable for most DSP applications. The selection of the best one for a particular application is, naturally, dependent on its unique requirements. Selecting the best DSP IC is not a trivial task.

172 173

The choice that was once much simpler now may be more complicated due to the ever-growing variety of the devices. Beside the technical aspects, considerations should be based on other factors as well. These factors include the quality and availability of the development systems, software, product support, second source, the health of the vendor and cost. Fortunately, due to similarities that exist between DSP ICs and microprocessors, experience and knowledge with the latter will help ease the task significantly.

·I REFERENCES 1. Analog Devices, Inc., ADSP-2100 User's Manual, Norwood, MA, 1986.

2. AT&T, WE DSP32 Digital Signal Processor, Allentown, PA, 1986.

3. Bennett, W. R., The Spectra of Quantized Signals, Bell Syst. Tech. J., vol. 27, July 1968. 4. Blackman, R. B. and Tukey, J. w., The Measurement of Power Spectra. New York: Dover, 1959.

5. Bradley, J. and Ehlig, P., Application of the TMS32010 Digital Signal Processor and Their Tradeoffs, Midcon/84 Electronic Show and Convention, 1984.

6. Bucklin, Willard, John Eldon, Louis Schirm, and Fred Williams, Designer's Guide to Digital Signal Processing, originally in EDN, March 18, 1981, pg 133; April 1, 1981, pg 153; and April 15, 1981, pg 133. This 3-part tutorial on DSP is now only available from University Microfilms International, 300 N Zeeb Rd, Box 91, Ann Arbor, MI 48106, (313) 761-4700. 7. Burrus, c. s. and Parks, T. w., DFT/FFT and Convolution Algorithms, New York, NY: John Wiley & Sons, Inc., 1984.

8. Christmas, W. J., A Microprocessor-Based Digital Audio Coder and Decoder, International Conference on Digital Processing of Signals in Communications, No. 62, 22-26, April 1985.

9. Cooley, 3. W. and Tukey, J. w., An algorithm for the machine calculation of complex Fourier series, Math. Comput., vol. 19, April 1965. 10. Crowell, c. D. and Simar, R., Digital Signal Processor Boosts Speed of Graphics Display Systems, Electronic Design, Vol. 33, No. 7, 205-9, March 1985.

11. Cushman, R. H., Third-Generation DSPs Put Advanced Functions On-Chip, EDN, Vol. 30, No. 16, 58-68, July 11, 1985. 12. Erskine, c. and Magar, s., Architecture and Applications of A Second-Generation Digital Signal Processor, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 1985.

174 175

13. Fernandez, E., Comparison and Evaluation of 32-Bit Microprocessors, Mini/Micro Southeast Computer Conference and Exhibition, 1984. 14. Glisson, T. H. and Sage, A. P., On discrete and complex representation of real signals, Proc. 12th Midwest Symp. Circuit Theory, April 1969. 15. Gold, B. and Rader, c. M., Digital Processing of Signals, New York, NY, McGraw-Hill Book Co, 1969. 16. Gold, B. and Rader, c. M., Effects of Quantization Noise in Digital Filters, Proc. AFIPS 1966 Spring Joint Computer Conf., 28, 213-219, 1966. 17. Hamming, R. w., Digital Filters, Englewood Cliffs, NJ: Prentice-Hall, Inc., 1977. 18. Holburn, D. M. and Sommerville, I. D., A High-Speed Image Processing System Using the TMS32010, Software and Microsystems (Great Britain), Vol. 4, No. 5-6, 102-8, October-December 1985. 19. Howard, c. F., A High-Level Approach to Digital Processing Design, Proceedings of MILCOMP/85 Military Computers, Graphics, and Software Conference (Ilkley, England), 344-9, October 1985. 20. Kaplan, L., Flexible Single Chip Solution Paves Way for Low Cost DSP, Northcon/83 Electronics Show and Convention, 1983. 21. Knowles, J. B. and Edwards, R., Effects of a Finite­ Word-Length Computer in a Sampled-Data Feedback system, Proc. Inst. Elec. Eng., 112, 1197-1207, June, 1965. 22. Knuth, D. E. The Art of : Volume 2/Seminumerical Algorithm, Second Edition. Reading, MA: Addison-Wesley Publishing Company.

23. Kropp, G., Signal Processor Offers Multiprocessor Capability, (Germany), Vol 34, No. 6, 53-8, March 1985. 24. Lin, K. and Frantz, G., Speech Applications with a General Purpose Digital Signal Processor, IEEE Region 5 conference Record, March 1985. 25. Loges, w., Signal Processor as High-Speed Digital Controller, Elektronik Industrie (Germany), Vol. 5, No. 5, 30-2, 1984. 176

26. Magar, S., Architecture and Applications of a Programmable Monolithic Digital Signal Processor A Tutorial Review, Proceedings of IEEE International Symposium on Circuits and Systems, 1984. 27. Magar, s., Signal Processing Chips Invite Design Comparisons, Computer Design, Vol. 23, No. 4, 179-86, April 1984. 28. Magar, S., Caudel, E., Essig, D. and Erskine, C., Digital Signal Processor Borrows from uP to Step up Performance, Electronic Design, Vol. 33, No. 4, 175-84, February 21, 1985. 29. Megal, H. and Heiman, A., Image Coding System A Single Processor Implementation, MILCOM/85 IEEE Military Communications Conference Record (Sponsor: Tadiran Ltd., Holon, Israel), Vol. 3, 628-34, 1985. 30. Mehrgardt, s., General-Purpose Processor System for Digital Signal Processing, Elektronik (Germany), Vol. 33, No. 3, 49-53, February 1984. 31. Mersereau, R., Schafer, R., Barnwell, T., and Smith, D., A Digital Filter Design Package for PCs and TMS320, Midcon/84 Electronic Show and Convention, 1984. 32. Motorola, Inc., DSP56000 Digital Signal Processor User's Manual, Phoenix, AZ., 1986. 33. Motorola, Inc., DSP56000SASMA Design-In Software Package Technical Summary, Phoenix, AZ., 1986. 34. NEC Electronics, Inc., Microcomputer Products Data Book, 1983. 35. Oppenheim, A. V., Realization of Digital Filters Using Block Floating-Point Arithmetic, IEEE Trans. on Audio and Electroacoustics, AU-18, 130-136, June, 1970. 36. Oppenheim, A. V. and , R. W., Digital Signal Processing, Englewood Cliffs, NJ, Prentice-Hall Inc, 1975. 37. Pagnucco, L. and Garcia, D., A 16/32 Bit Architecture for Signal Processing, MiniMicro West 1983 Computer Conference and Exhibition, 1983. 38. Potts, J. A., Versatile High Performance Digital Signal Processor, Ohmcon/83 Conference Record, 1983. 177

39. Rabiner, L. R. and Gold, B., Theory and Applications of Digital Signal Processing, Englewood Cliffs, NJ, Prentice-Hall Inc, 1975. 40. Riedel, N. K., McAninch, D. A., Fisher, C., and Goldstein, N. B., A Signal Processing Implementation for an IBM PC-Based Workstation, IEEE Micro, Vol. 5, No. 5, 52-67, October 1985. 41. Ryoo, Keun-Ho, On the Recent Digital Signal Processors, Journal of Korean Institute of Electrical Engineering (Korean), Vol. 33, No.9, 540-9, September 1984. 42. Shively, R. R., A digital processor to generate spectra in real time, First Ann. IEEE Computer Conf., Dig. Papers, 1967. 43. Stanley, W. D., Dougherty, G. R. and Dougherty, R., Digital Signal Processing, Reston, VA, Reston Publishing Co, 1984. 44. Titus, Jon, Special Report: MAC chips, EDN, June 11, 1987, pg 114. 45. Texas Instruments, Corp., Digital Signal Processing Applications with the TMS320 Family, Houston, TX, 1986. 46. Texas Instruments, Inc., TMS32010, TMS32020 and TMS32025 User's Manuals, Houston, TX, 1986. 47. Volkers, H., Fast Fourier Transforms with the TMS320 as Coprocessor, Elektronik (Germany), Vol. 33, No. 23, 109-12, November 1984.

48. Weinstein, c. w., Quantization Effects in Freq~ency Sampling Filters, NEREM Record, 222, 1968. 49. Weinstein, c. and Oppenheim, A. v., A Comparison of Roundoff Noise in Floating Point and Fixed Point Digital Filter Realizations, Proc. IEEE (Corresp.), 57, 1181-1183, June, 1969. 50. Zoran, Corp., Digital Signal Processors Data Book, Santa Clara, CA, 1987. APPENDIX A THE PIPELINE AND HARVARD ARCHITECTURE Pipeline Processing Pipeline processing as opposed to parallel processing is a technique frequently used in digital signal processing ICs. This technique decomposes a sequential process into subprocesses which are to be executed concurrently in specially dedicated segments. Each segment performs partial processing dictated by the way the task is partitioned. The result obtained from the computation in each segment is transferred to the next segment in the pipeline. After passing through all the segments, this result becomes the final result. The term

"pipeline" is used to imply an information flow analogous to water flow in a pipeline. A simple way to view the basic pipeline structure is to consider that each segment in the pipeline consists of an input register holding the data and a logic circuit performing the subprocess in that segment. The output of the logic circuit in that segment is connected to the input of the register in the next segment. The information flow in the pipeline is performed one step at a time according to the clock pulse supplied to each register. Note that, simultaneous operations are performed in a pipeline processor. Once the pipe is full, it takes only one clock pulse to obtain an output. This is true and

178 179

does not depend on how many segments there are in the system. If the time it takes to perform a in each segment is an interval t, and there are m segments in the system, then each computation requires m x t intervals. However, this time interval is only required for setup time to fill up the pipe. After the pipe is full, successive operations are overlapped in the pipeline and results are produced at only every t interval. Any operation that can be decomposed into a sequence of subprocesses of similar complexity can be implemented by a pipeline processor. This technique is very useful, and extremely efficient in those applications where the same computation must be repeated on a stream of input data such as in an FFT processor. Most floating-point arithmetic operations are performed in a pipeline fashion since they are easily decomposed into consecutive suboperations. For example, a typical floating-point addition performed in a pipelined-processor can be divided into three consecutive subprocesses as follows: aligning the mantissas, adding the mantissas, and normalizing the results. These three subprocesses can be structured as three segments in a pipeline. This is illustrated in Figure A.l. Pipeline processing can be applied not only to data but to the instruction stream as well. Consecutive instructions from memory may be fetched by the processor 180

A-- 0.777 X 1E3 B = 0.999 X 1E2 ,,

ALIGN MANTISSAS

A = 0 .777 X 1E3 B = 0.0999 X 1E3

t

ADD MANTISSAS

A + B = 1. 776 X 1E3 ,,

NORMALIZE RESULT

A + B = 0.1776 X 1E4

, t

Figure A.l Pipeline Operation of Floating-point Addition 181

while previous instructions are being executed. This technique is also known as instruction look-ahead buffer technique. The main drawback of this scheme is that once in a while, an instruction in the sequence may be a program control type that causes a branch out of normal sequence. In this case, the pending operations in the last segments are completed and the pipe must be emptied and all the instructions that have been read frc:;,,, the memory must be discarded. Other difficulties that hinder a pipelined-processor from operating at maximum speed include the different times that may be taken by different segments to operate on incoming data. Some segments may be skipped for certain instructions due to access conflicts. For example, two or more segments require memory access at the same time may cause one segment to wait until the other is finished with the memory. Harvard Bus Architecture The pipeline processing is closely associated with the Harvard architecture, which is common to most DSP processors. The Harvard architecture is significantly different from the widely known employed by almost all general-purpose microprocessors. In Von Neumann architecture, the instructions and data are located in the same memory space. During the first clock cycle of an instruction execution, the microprocessor fetches the instruction, and then the data [Figure A.2] 182

READ X READC., MULTIPLY (X)(C) MUT£Y INSTRUCTION ...... • • FETCHES

- Do\TAFLOW -X -c (XI( C) y L------~nME

Figure A.2 Von Neumann Operation

READ X READ C MULTIPlY(XI(C) WRIT£ Y INSTRUCTION FETCHES ~ ..

...... Do\TA Fl.OW X c (XI( C) y L------_.TIME ---·-·----···------~·-·--~

Figure A.3 Harvard Operation 183

This sequential nature of the instruction execution associated with the Von Neumann architecture prevents parallel data and instruction acquisitions thus limits the processing rate. The Harvard architecture, on the other hand, allows data and instruction fetches at the same time [Figure A.3]. This can be done by having separate data and instruction memory in the processor. Harvard processors, therefore, have multiple buses for data and instruction operations. Depending on the number of buses built in, the throughput of the processor may vary significantly. With two buses, simultaneous fetching is possible, but an extra bus cycle is required for storing the results. With three buses, one for each operand (2) and one for the results, the processing speed is much more improved [Figure A.4]. Further improvements in throughput can be achieved by including pipeline operation along with a multiple-bus architecture [Figure A.5]. 184

- INSTRUCT10N FETCH .. X .. REAOBUS2READ 8US1 I ~ MULTIPUER DQ'A FLOW .,.,.t--"=(X"'XC""I---..~a~ ~ WRIT£ BUS

Figure A.4 3-Bus Harvard Operation

INO IN1 1N2 IN3 ...... INSTRUCTION FETCHES XO X1 X2 X3 ...... REAOBUS1 CO C1 C2 C3 ~- ...... READBUS2 XOCO X1C1 X2C2 X3C3 ...... ,. MUL'i'lPLY 1ST STAGE MTA FLOW lCDCO X1C1 X2C2 X3C3 ,. ., ,.. ., II ., II ., MULTIPLY 2ND STAGE YO Y1 Y2 Y3 4 • • • • ... • a WRITEBUS ~------~------~~

Figure .\.5 Pipelined H~rvard Operation APPENDIX B REDUCED INSTRUCTION SET COMPUTING (RISC) Introduction This appendix briefly describes the modern reduced instruction set computing (RISC) architectures that are being used in an increasingly large number of recent high performance processor implementations. By omitting unnecessary functions, RISC architectures offer increased system speed while keep down the equipment costs. This RISC-based design approach is ideal for digital signal processors and almost all the advanced DSP chips actually employ RISC in their structures in some ways. Description of RISC Strictly speaking, it seems very difficult to provide a precise definition of a RISC architecture. A simplified design philosophy, however, can be provided thank to the original RISC research efforts pioneered by IBM's Thomas J. Watson Research Laboratory (project 801); by Stanford

University (The Microprocessor without Interlocked Pipeline Stages or MIPS) and by University of California at Berkeley (The RISC project). The simplified RISC architecture definition is stated as an architecture which: 1. Analyze target applications to determine which operations are used most frequently.

185 186

2. Optimize the data path design to execute these instructions as fast as possible.

3. Include other instructions only if they fit into the previously developed , are relatively frequent, and their inclusion will not slow the execution of more frequent instructions. 4. Apply a similar strategy to the design of other processor resources. A resource is included only if its frequent use is justified and its inclusion will not slow other, more frequently used resources.

5. Push as much complexity as reasonable from run time hardware into the compile-time software. When a is based upon RISC philosophy, the resulting architecture typically has features that are in common with other RISC designs. Some of the most commonly seen features in RISC computers are:

o Single-cycle execution of most instructions

o Load and store instruction set included

o Instruction set designed for a specific application class

o Relatively few instructions and address modes o Highly pipelined data path for much concurrency o Large register set o Fixed instruction format for simple decoding

o Hardwired instruction decoding o Complexity pushed into optimizing compiler 187

o Many levels of APPENDIX C FIXED-POINT VERSUS FLOATING-POINT ARITHMETIC Introduction Binary number representations in digital computers usually include a sign, a radix point (base of the number), as well as a magnitude. The sign indicates whether the number is positive or negative. The radix point separates the integer and fractional parts of the number. Number Representations The sign of a binary number can be represented with one bit. In most representations, the sign bit is assigned to be 'O' for a positive number and a '1' for a negative number. The sign bit is conventionally located in the leftmost location (most significant bit) of the number. Several formats are used to represent negative nu~bers including signed-magnitude, one's complement, and two's complement. The most common method is two's complement. The advantage of two•s-complement format is that it provides a unique representation for zero, whereas the other formats have both a positive and a negative zero. In two's-complement format, zero is considered positive. Therefore, the magnitude of the largest negative number that can be represented with a given number of bits is one greater than the magnitude of the largest positive

188 189

number. A two's-complement number of k+1 bits (one bit indicates the sign and k bits indicate the magnitude) can k k represent the range of numbers from 2 - 1 to -2 • The two's complement of a binary number may be computed by 2 methods:

a. Method 1 - Invert all the bits then add 1 to the

LSB of the number. This is illustrated in the example below:

Binary +79 0 1 0 0 1 1 1 1

Invert 1 0 1 1 0 0 0 0

Add 1 0 0 0 0 0 0 0 1

Two's complement 1 0 1 1 0 0 0 1

(equivalent to binary -79)

b. Method 2 - Invert all the bits to the left of the least significant 1. This is illustrated in the example below:

Binary +79 0 1 0 0 1 1 1 1

Locate LS 1 0 1 0 0 1 1 1 1

Invert all the bits

Left of LS 1 1 0 1 1 0 0 0 1

(equivalent to binary -79)

The base of a number is usually represented by the radix point. This radix point is used to separate the integer and the fractional part of a number. The bits to the left of the radix point is the integer and the bits to the 190

right is the fractional part. There are two ways to specify the location of the radix point: 1. Fixed-point format 2. Floating-point format Fixed-point Format Fixed-point format places the radix point at a single, predetermined location. This location is to the left of all bits if all the bits are used to represent a fractional number, and to the rigth of all the bits if all bits are integer. This location of the radix point is programmed in software, therefore, it does not need to be represented explicitly. The radix-point position may be changed by arithmetic operations, such as in multiplication, so that shifting may be necessary to keep the number in the same fixed-point format. While this fixed-point convention simplifies numeric operations and converses memory spaces, it places a limit on the magnitude and the precision of the number representations. In those applications that require a large range of numbers and high resolution, the fixed­ point poses a disadvantage since its radix point is not relocatable. The problems with fixed-point resulted from many sources such as: o High system noise accumulated by round-off errors o Filter poles and zeros imprecisely placed will alter the transfer functions and may cause instability 191

(oscillation). o Limited dynamic range o Saturation-induced signal distortion Very large and very small numbers, therefore, cannot be

represented efficiently by fixed-point format. To overcome these drawbacks floating-point arithmetic is introduced. Floating-Eoint Format

Floating-point format is scientific notation. A floating-point number consists of a mantissa and an exponent. Each part of the floating-point number is stored in a fixed-point format. The mantissa is usually put in a full fractional format, and the exponent is in a full integer format. In some cases, a constant such as excess code or bias is added to the exponent so that it is always positive.

Floating-point is always interpreted to represent a nuroher in the following form:

e m x r

In this expression, only the mantissa m and the exponent e are physically represented in the register (including their signs). The radix rand the radix-point position of the mantissa are always assumed by software.

The circuits that manipulate the floating point numbers in registers must conform with these assumptions for 192

achieving a correct computational results. A floating-point number is said to be normalized if the most significant position of the mantissa contains a nonzero digit. For example, the number 055 is not normalized but the number 550 is. When the mantissa is normalized, it has no leading zeros and contains the maximum possible number of significant digits (all bits are significant or no redundant sign bits). Therefore, normalization provides the highest precision for the number of bits available. It also simplifies the comparison of magnitudes, because the number with the greater exponent has the greater magnitude. Comparison for the fractions is necessary only if the exponents are equal. To see the advantage of normalization, consider the example of a register that can accommodate a mantissa of 6 decimal digits and a sign bit. The number +.234567 x 2 10 = 23.4567 is normalized since its mantissa has a nonzero digit in the most significant position. This number can be represented in an unnormalized form as 4 .002345 x 10 = 23.45. Clearly, the two zeros occupying the two most significant positions have left only 4 significant positions for the mantissa. The two numbers 6 and 7 that were in the normalized form have been lost in this representation since the register can accommodate only 6 digits. Floating-point numbers are inherently inexact due to 193

the fact that each number has multiple representations that differ only in precision. This fact causes error into floating-point arithmetic relative to the exact result. This error is less severe in multplication and division operations than in addition and subtraction operations. The associative law, therefore, does not always hold for floating-point calculations. Advantages of floating-point arithmetic include the elimination, in most cases, of the check for overflow. This advantage is especially important if multiple passes are performed on the same sample, as in the case of FFT computations because of the multiple stages and in IIR (Infinite Impulse Response) filters because of feedback. Floating-point arithmetic operations are more complicated than fixed-point and their executions usually take longer as well as require more complex hardware. However, floating-point representation is a must for scientific calculations because of the scaling problems involved with fixed-point computations. Telecommunications is one of many application areas that benefits from floating-point arithmetic. The A/D Pulse Code Modulation algorithm of the Consultative Committee for International Telegraphy and Telephony has several floating-point variables that are handled slowly in the software of a fixed-point system which is greatly improved in floating-point system. Workstations which FLOATING-POINT VERSUS FIXED-POINT DYNAMIC RANGE

(dB) FLOATING

1800 1600 [;,, I FIXED-POINT 1400

1200 1000 800 600 400

200

16 BIT 22 BIT 32 BIT 56 BIT

Figure C.l Dynamic Range Versus Wordlength

,_. \.0 .p.. 195

model tremendously large numbers of extremely small fractions and quantities cannot do without floating-point arithmetic. In general, floating-point format offers better dynamic range, and better precision than fixed-point format. For example, a 22-bit floating-point data format offers better dynamic range (476 dB) than even a 56-bit fixed-point (330 dB). With 22-bit floating-point, proper normalization offers 15-bit precision over a dynamic range of 384 dB. A 32-bit fixed-point format offers 15-bit precision over 102 dB, and even a 56-bit format offers the same precision over only 287 dB. Figure C.1 shows the comparison of dynamic range versus number of bits between fixed-point and floating-point format. As shown, high dynamic range of up to more than 1600 dB is possible with 32-bit floating-point chip while only about 200 dB is achieved for 32-bit fixed-point. For systems which demand high dynamic range and precision, floating-point format is, therefore, necessary. However, since floating-point implementation is relatively slower and more expensive due to its complexity than fixed-point, trade-off evaluations should always be thoroughly studied to select the best format for a system. A comparison between fixed-point and floating-point effect on the precision is also shown in Figure C.2. 196

PRECISION

1 E 1

1 E-1

1 E-3 FLOATING-POINT

1 E-5

' 1 E-7 FIXED-POINT 1 E-9

1 E -11

4 8 12 16 20 24 28 32 36 40

NUMBER OF BITS

Figur2 C.2 Precision Versus Wordlength APPENDIX D TERMINOLOGY Arithmetic-and-logic unit: hardware that performs data

operations such as addition, subtraction, and logical AND and OR.

Barrel shifter: hardware that allows arbitrary shifting of data.

Bit reversal: an addressing technique in which the order of bits in an address is reversed during a computation

(such as in Fast Fourier Transform).

Benchmark: a specification that tells how well a product does a specified task; ideally, a benchmark simplifies a comparison of competing products by presenting a single piece of data that tells which product is better for a given application.

Cache: a small, fast memory positioned between a larger, slower memory and a processor; the cache's control maintains blocks · of data adjacent to recently accessed data or instructions from the larger memory, thereby making access faster than with the larger memory alone.

Multiplier-accumulator: hardware dedicated to multiplying numbers and adding the results; the very fast units are typically able to multiply two numbers and add the product to previously accumulated results in a single clock

197 198

period.

Pipelining: starting the execution of a new task before a preceding one has been completed.

Round-off and truncation noise: in digital signal processing, noise introduced when the word length of the computation exceeds that of the DSP chip and the extra bits are discarded.

CA - Control arithmetic

CAU - Control arithmetic unit

Companding - The process of compression and expansion of a signal.

DA - Data arithmetic

DAU - Data arithmetic unit

DAUC Data arithmetic unit control

DMA - Direct memory access

EMR - Error mask register

ESR - Error source register

Fetch unit The elements in this unit handle the instruction stream and perform memory-based operand access. 199

FFT - Fast Fourier Transform.

Fixed-point arithmetic - A method of arithmetic in which operations take place in an unvarying manner, and in which the computer does not consider the location of the radix

point (base of the number).

FB - Frame boundary

IBUF - Input buffer

Indexed mode - An in which the address part of the instruction is modified by an auxiliary

(index) register during the execution of that instruction. This mode is achieved with the instruction rD = rS + N followed by *rD

ISR - Input shift register

IOC - I/O control word

I/O - Inputjoutput

Link editor - Software support tool used for maintaining libraries, defining sections of on-chip and off-chip memory, and recognizing holes in memory.

OBUF - Output buffer

OSR - Output shift register

PDR - Parallel data register 200

PCR - PIO control register

PINT - Parallel interrupt

PIO - Parallel I/0

PIR - Parallel interrupt register

Relative mode - An addressing mode in which the absolute address is obtained by means of address modification. Address modification is performed by the addition of a given number to the address part of an instruction known as the relative address. This mode is attained by moving the program counter (pc) to another CAU register then performing a register indirect operation. In relative addressing the address contained in the second byte of the instruction is added to the program counter's lowest eight bits plus one.

Simulator - A highly specific program that allows accurate simulations of the logical operation of the DSP32 device.

SIO - Serial I/0

TOM - Time division multiplexing APPENDIX E

List of Vendors

Analog Devices Inc. Box 280 Norwood, MA 02062 (617) 329-4700 Gf;i~~f~::!~:L:::tems ~------~------·-···- Fujitsu Microelectronics Inc. 3320 Scott Boulevard Santa Clara, CA 95054 (408) 727-1700

General Instrument Corp. Microelectronics Div. 2355 West Chandler Blvd. Chandler, AZ 85224 (602) 963-7373

Gould Inc. Gould AMI Semiconductor Div. 3800 Homestead Rd. Santa Clara, CA 95051 (408) 246-0330 1: •...... ·········· ... . ~~~~~~~~r!ngistri~uti~~ Center ( Box 20924 \ nh~~~~~ ~~ Okn~£ -...,...______• .L.I.V'!;.lJ.L~t ~LJ O.JVJV f5I2f 440-~239 National Semiconductor Corp. 2900 Semiconductor Dr. Santa Clara, CA 95051 (408) 721-5000

NEC Electronics Inc. 401 Ellis St. Mountain View, CA 94039 (415) 960-6000

Oki Semiconductor 650 N. Mary Ave. Sunnyvale, CA 94086 (408) 720-1900

201 202

Signetics Corp. 811 E. Argues Ave. sunnyvale, CA 94086 (408) 991-2000

Texas Instruments Inc. Semiconductor Group (SC-612) Box 809066 Dallas, TX 75380 (800) 232-3200 .. "--~..,--.,._~- ···-, Thomson Components-Mostek Corp. 1310 Electronics Dr. Carrollton, TX 75006 (214) 466-6000

Zoran Corp. 3450 Central Expressway Santa Clara, CA 95051 (408) 720-0444