€6<>i aamuoKm

m n INFORMATION TO USERS

This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer.

The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrougb, substandard margins, and improper alignment can adversely affect reproduction.

In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion.

Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand corner and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book.

Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6" x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order.

University Microfilms International A Belt & Howell Information Company 300 North Zeeb Road. Ann Arbor. Ml 48106-1346 USA 313/761-4700 BOO,'521-0600

Order Number 0825444

The VAMPIRE chip: A Vector-quantising Associative Memory Processor Implementing Real-time Encoding

Adkins, Kenneth Christopher, Ph.D.

The Ohio State University, 1093

Copyright ©1093 by Adkins, Kenneth Christopher. All rights reserved.

UMI 300 N. Zeeb Rd. Ann Aibor, MI 48106

T h e VAMPIRE C h ip : A V e c t o r - q u a n t iz in g A sso c ia t iv e M em o r y P r o c e sso r Implementing R e a l - t im e E n c o d in g

dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate

School of The Ohio State University

By

Kenneth Christopher Adkins, B.S., M.S.

$ $ $ $ $

The Ohio State University

1993

Dissertation Committee: Approved by

Steven B. Bibyk Steven Bibyk Adviser / y Stanley Ahalt Department of Electrical Mohammed Ismail Engineering © Copyright by

Kenneth Christopher Adkins

1993 To Cirnnny anti Gnmtlntl

ii A cknowledgements

I express sincere appreciation to Dr. Steven Bihyk for his continued support and guidance over the past five years. Thanks go to the other members of my advisory committee, Dr. Stanley C. Alialt and Dr. Mohammed Ismail, for their comments, time, and flexibility. Gratitude is expressed to Mary Jo Shnlkhauscr and Joe Harrold at NASA Lewis for support via grant NAG3-1802 and funding of the VAMPIRE chip. The assistance of Changku Hwang for running SPICE simulations and of Jim

Fowler for providing the satisfaction of Hccing the chip run in a real-time application is gratefully acknowledged. I thank my parents and family for their ever-present support and encouragement. To Thcrese Patnesky and our stray cat Sadie, I offer sincere thanks for enduring the fortunes and misfortunes that have characterized the push to complete this research. Vour devotion, encouragement, and cooking will not be forgotten. V i t a

December 28, 19GC ...... Horn * Cincinnati, Ohio

June, 1988 ...... B.S. in Electrical Engineering, The Ohio Slate University, Columbus, Ohio

1988-1991 ...... Teaching and Research Assistant, Department of Electrical Engineering, The Ohio Slate University

March, 1991 ...... M.S. in Electrical Engineering, The Ohio State University, Columbus, Ohio

1991-1992 ...... Teaching and Research Assistant, Department of Electrical Engineering, The Ohio State University

P ublications

(Bibliographic entries of previously published material)

F i e l d s O f S t u d y

Major Field: Electrical Engineering

Studies in VLSI Circuit Design, Communication Systems T a b l e o f C o n t e n t s

DEDICATION ......

ACKNOWLEDGEMENTS

VITA ......

LIST OF FIGURES . . .

LIST OF TABLES ...... xiii

LIST OF PLATES ...... xiv

CHAPTER PAGE

I Introduction ...... 1

1.1 Overview ...... 1 1.2 Nature of the Problem ...... 1 1.3 Problem Statement ...... 4 1.4 Dissertation Structure ...... 5

II Background ...... 6

2.1 M o tiv a tio n ...... 6 2.2 Previous Work ...... 10 2.2.1 Associative M em o ries ...... 10 2.2.2 Vector Quantization ...... 15 2.2.3 Hardware Implementations of VQ ...... 22 2.3 S u m m a ry ...... 24 III Design Considerations ...... '...... 26

3.1 Analog vs. D ig ita l ...... 26 3.2 Calculation ...... 28 3.3 Absolute Value ...... 35 3.4 Parallel vs. S erial...... 38 3.5 Carry Propagation ...... 42 3.6 Winner Selection ...... 44 3.7 VQ Codcbook Design ...... 48 3.8 Summary ...... 49

IV VAMPIRE Architecture ...... 51

4.1 Structural O verview ...... 51 4.2 Computation Cell ...... 55 4.2.1 Codeword S to ra g e ...... 56 4.2.2 “Grcatcr-Than” C irc u it ...... 59 4.2.3 Absolute Difference V alu e ...... 62 4.2.4 Component Sum s ...... 64 4.2.5 Global Compare Circuit ...... 66 4.3 Overflow C e ll ...... 69 4.4 Right-End C e ll ...... 72 4.5 Priority Encoder ...... 72 4.6 Intcr-Chip Circuitry ...... 75 4.7 Address Decoder ...... 80 4.8 Pad FVame ...... 80 4.9 Reset ...... 81

V Implementations ...... 82

5.1 Computation Cell ...... 83 5.2 Tiny Chip ...... 85 5.3 fiill-Scalc Chip ...... 87

VI Chip A nalysis ...... 89

6.1 Equipment and Procedure ...... 89 6.1.1 Data Acquisition and Control Adapter ...... 90 6.1.2 Evaluation Board ...... 92

vi 6.1.3 Testing Procedure and Sample O p e ra tio n ...... 96 6.2 Experimental Results ...... 101 6.2.1 Functional/Steady-Statc T esting ...... 101 6.2.2 High-Speed Testing ...... 102 6.2.3 Baseline Performance ...... 106 6.2.4 Initial Conditions ...... 114 6.2.5 Global Comparison/Winner Selection ...... 118 6.2.6 Location Dependence ...... 124 6.2.7 Multiple C h ip s ...... 133 6.3 Real-Time T e s t ...... 134 6.4 Discussion ...... 135

VII Alternative Applications ...... 142

7.1 Maximum Correlation Detection ...... 142 7.2 Symbol D e te cto r ...... 144 7.3 Symbol Synchronization ...... 145 7.4 Beam F o rm in g ...... 147 7.5 Artificial Neural Networks ...... 151 7.6 Non-Destructive Evaluation ...... 154 7.7 S u m m ary ...... 155

VIII Recommendat ions and Conclusions...... 157

APPENDICES

A The Mathematics of Information Theory ...... 165

A.l Information Theory ...... 165 A.2 Vector Quantization ...... 168

B Program Listing ...... 171

C VAMPIRE Chip User’s Guide ...... 181

C.l Introduction ...... 181 C.2 Wiring the C h ip ...... 183 C.3 Chip Operation ...... 185

vii C.3.1 R e s e t ...... 185 C.3.2 Store ...... 186 C.3.3 M atc h ...... 187 C.4 Using Multiple Chips ...... 189 C.5 Chip Pinout ...... 192

BIBLIOGRAPHY ...... 193

viii L i s t o f F i g u r e s

FIGURE PAGE

1 The original modified DPCM algorithm ...... 7

2 Block diagram of a DVQ system ...... 9

3 Typical CAM cell architecture ...... 13

4 Analog winncr-takc-all structure ...... 15

5 Enablcd-NOR structure ...... 16

6 An n-stagc MSVQ design. Each stage uses /»,• hits to encode the corre­ sponding codeword label 19

7 A Scparating-Mcan Vector Quantizer ...... 19

8 Graph of percent error (from Zrj-norm) versus angle 6 from basis axis. 31

9 The absolutp distance (/^-norm) calculation broken down into essential components ...... 34

10 A synchronous processing element design ...... 40

11 An asynchronous processing element design ...... 41

12 Circuit schematic for a traditional full-adder cell ...... 43

13 An unsuccessful example of conveying ROM I’A It K line information across chip boundaries ...... 47

14 The general structure of the VAMPIRE chip ...... 52

15 Internal structure of a word-unit ...... 53 16 Detailed floorplan of the VAM PIRE chip ...... 54

17 Structure of a single computation cell ...... 55

18 (a) a sample vector, (b) binary representation of this vector, (c) the bits of the vector as they are stored in the compulation cells ...... 57

19 Static RAM schematic ...... 57

20 Schematic of a standard static RAM cell ...... 58

21 Schematic for generating the “Grenlrr-Tliau” signal ...... 60

22 The GT circuits as they appear across a single word-unit ...... 61

23 Generating intermediate signals X and XOR ...... 63

24 Generating the outgoing carry bit and the difference bit ...... 64

25 A schematic showing-all the circuit components discussed to this point. 65

26 The full adder cell used in the VAM PIRE chip ...... 66

27 One bit of the global compare circuitry...... 67

28 A bit-slice of the on-chip winner selection circuitry...... 68

29 Schematic of the Overflow Cell ...... 71

30 Schematic of the righl-cnd and priority encoder cells ...... 73

31 Inter-chip winner selection circuitry ...... 76

32 Daisy chain connection of IN and OUT lines ...... 78

33 Various I/O signals ...... 79

34 Reset circuitry. The inverter chain helps eliminate spurious resets. . . 81

35 Diagram of the test equipment ...... 90

36 Schematic of the evaluation board used to test the performance of the VAMPIRE chip ...... 94

37 Timing diagram for some evaluation board signals...... 96 38 Portion of Lcnna image used for experiment. Tliin fragment represents 900 vectors ...... 103

39 Performance of the AM measured on different days ...... 104

40 Response of the absolute difference bit when the input (solid line) goes from 0V to 5V ...... 109

41 Response of the absolute difference bit when the input (solid line) goes from 5V to 0V ...... 110

42 Response of the winning codeword ...... I l l

43 Response of the winning codeword using properly sealed PMOS tran­ sistors ...... 113

44 Effect of memory contents on match time ...... 116

45 Typical response of the cillP.VALID line with respect to the two clock pulses ...... 120

46 Summary of results for Co = Mo = (0,0,0,0) and Cj = Mi = (0,0,0,x).122

47 The effect of storing repetitive codewords in the memory...... 125

48 One bit of the MRU ...... 127

49 Illustration of charge sharing on a LA I) EL bit line ...... 128

50 Settling time versus location index m ...... 132

51 Correcting the arbitration circuitry externally...... 137

52 Example of using comparators to determine which of the four chips contains the winning codcvcctor ...... 139

53 The selection codcbook evaluates the input data and selects the ap­ propriate memory chip to perform the VQ association ...... 140

54 A two-stage predictive MSVQ system using the VAM PIRE chips in their current state ...... 141

55 Signal space of 16*ary amplitude and phase shift keying (APK) system. 146

xi 56 Codewords arc derived by sampling the expected waveform in four ' locations at incrementally different time delays, as illustrated with the sliding window ...... 148

57 Example of using an array to direct radiated power ...... 150

58 Partition of the codeword space into two intertwined spirals using 80 codewords total ...... 153

59 Serial processing element with programmable vector dimension. . . . 160

60 The shaded region represents the distribution of adjacent pixel values x j,x 3 ...... 169

61 Four VAMPIRE chips wired together ...... 183

62 Timing diagram for a S T o n E operation ...... 187

63 Timing diagram for a MATCH operation ...... 188

xii L i s t o f T a b l e s

TABLE PAGE

1 Summary of error versus the // 2-norm ...... 32

2 lYuth table for forming absolute difference ...... 37

3 Description of the DACA’s 32 digital channels ...... 91

4 Baseline performance test results. Co = Mo — (0,0,0,0) and C\ = Mi = (0,0,0,1)...... 107

5 Effect of memory contents on settling time. Cq = Mo = (0,0,0,0) and C\ — M\ — (0,0,0.x) ...... 115

6 Effect of memory contents on settling time. C« = Mo = (0,0,0,0) and Ci = Mi varies...... 118

7 Summary of results for Co = Mo = (0,0,0,0) and Ci = Mi = (0,0,0,x).121

8 Results from matching to a codewords stored in location 0 and m. Mo = Co and Mi = C„ 126

9 Results from matching to codewords stored in location m and m + 1 Mo = Cm = (0) and M, = Cm+i = (0,0,0,1) ...... 130

10 Results from matching to codewords stored in location m and m + 1 Mo = Cm+, = (0) and M, = Cm = (0,0,0,1)...... 131 L i s t o f P l a t e s

PLATE PAGE

I The layout of a single computation cell ...... 84

II Photograph of the Tiny Chip die ...... 86

III Photograph of the full-scale VAMPIRE chip ...... 88

xiv CHAPTER I

Introduction

1.1 Overview

An encoding technique known as vector quantization (VQ) is introduced, and the problems facing real-time VQ image encoding arc explained. Previous research in this area is presented and compared with the approach taken for this dissertation.

The VAMPIHE chip, which stands for Vector-quantizing Associative Memory Processor

Implementing Real-time Encoding, was successfully designed and fabricated to meet the requirements of a real-time video-rate vector quantization encoder. Experimental rcsultB arc presented, as well as suggestions for alternative applications for this chip.

1.2 Nature of the Problem

Vector quantization is a lossy data compression technique which incorporates funda­ mental principles of information theory, resulting in a more cflicicnt quantization of the data space. In VQ, the vector space defined by a block (or vector) of data is divided into a finite number of distinct regions. The centroid of each region can be considered a vector representative of that region and is referred to as a codeword.

The collective body of all codewords is termed the codebook. It is assumed that the codewords are defined apriort to the encoding process and that a unique label (the encoded output) has been assigned to each one. The mapping from any given input vector to the correspondingly closest codeword label is termed vector quantization.

Compression is achieved since the number of bits required to send the codeword label is less than the number of bits required to represent the vector itself.

Stated in more formal terms1, let / be a /.'-component source vector with joint pdt /

C3 if I € pj. Two necessary conditions that must be satisfied for an optimal k- dimcnsional quantizer arc:

• the partition must be a Diriclilcl partition (also called a Voronoi region), that

is,

p, = {/: ||f- C 'll < ||/- C V ||, V ; 5< i) (1.1)

• and the output points must be centroids of their respective regions, so

Ci = min{C : [ \\I- £ |[ 7 ( /) dl) (1.2) Jpi

If each component of the source vector can be represented with b bits, then the resulting compression ratio is bx k: log3 N.

The merits of vector quantization have been recognized for quite some time. In

1948, Claude Shannon (“The Father of Modern Communications”) spawned much

Adapted from [1]. 3

of the work in a field now known as Information Theory with his landmark paper

(and subsequent book) entitled “A Mathematical Theory of Communications” [2, 3].

Shannon devised a method of quantifying the amount of information a discrete source

could produce based on the probability distribution of the symbols in its alphabet.

In this work, Shannon was able to show that the average uncertainty, or entropy,

of an event X is generally reduced when coupled with the observance of another

event Y. A second observation derived from his work was the development of the

ratC'distortion function, which defines the minimum information rate necessary to

represent an output with a given average distortion. A fundamental conclusion of these two findings (which lie stated in a later paper [4]) is that belter performance can always be achieved by coding vectors of data rather than scalars. For a more detailed and rigorous treatment of this subject, refer to Appendix A.

In spite of Shannon’s revolutionary work almost a half a century ago, VQ has only recently emerged as a promising approach for speech and image data compression [5,

6]. One of the main reasons these finds have not yet been implemented is that the theory offers no insight on how to design a vector coder. Another reason is that traditional scalar coders often yield satisfactory performance with enough adaptation and fine tuning. These are two explanations offered by Robert Gray in his oft- referenced treatise on vector quantization [5]. The fact is that vector-quantizers arc an order of magnitude more complex than the scalar coder counterpart; perhaps only recently has the technology to implement VQ become economical. However, now that the information age has begun to mature, competition in the business of data communications is more fierce. The bottom dollar will drive the state of the art to incorporate any technological advantages that could be used to increase revenues, such as those offered by vector quantization.

1.3 Problem Statement

The problem addressed by this dissertation is one of implementing VQ encoding for a video image application. The primary difficulty with vector coding iB that optimal quantization involves the computation of some distortion measure between the vector to be encoded and every codcvcctor in the codcbook. The “winning” codeword is the one that results in the minimum distortion value. It can be seen that the number of operations required to execute such a search is proportional to the number of codewords. Thus, an implementation utilizing a large codcbook entails a cumbersome computational load. Matters arc complicated even more when the system is required to operate at very high data rates.

The platform for which the proposed vector quantizer will operate delivers one vector approximately every 280 nanoseconds (ns), which equates to a processing rate of over 3.5 million vectors per second (vps). Furthermore, the encoder must be capable of using a codcbook size of at least 25G codewords, which yields a computing rate of almost one billion distortion measurements every second. If the candidate vectors were to be serially evaluated against these 256 codewords, each evaluation would have to take place in just over one nanosecond. Since each distortion computation involves a number of subcalculations, the effective operating speed of the vector coder will need to be on the order of several billion operations per second. 5

Clearly, this is not an algorithm that can be implemented on a traditional SISD

(single-input single-data) processor. Special-purpose hardware must be designed in order to implement such a computationally intensive encoder. This document de­ scribes the hardware built to complete such a task.

1.4 Dissertation Structure

Chapter I has introduced the topic of vector quantization and presented the main difficulties associated with implementing vector coding. The remaining chapters of this dissertation arc organized in the following manner. Chapter II contains the moti­ vation for this project, aside from the more general problem of VQ encoding. Related work from past research is presented here, and I explain how this body of work dif­ fers from the approach described in this document. In Chapter III many algorithmic details of the VAMPIRE chip design arc given. I explore many of the issues that in­ fluenced the particular path that was finally decided upon. Chapter IV describes the general associative memory (AM) architecture, providing circuit schematics which explain the operation of various components. Chapter V details specific characteris­ tics of each of three AM implementations: the prototype chip, the full-scale VAMPIRE chip, and a design created specifically for NDE applications. In Chapter VI, I give experimental test resultB from the fabricated implementations, including real-time performance in a video source coding system. Chapter VII contains a number of po­ tential alternative applications for the VQ chip. Finally, I conclude this dissertation work in Chapter VIII, giving some recommendations for future research. A number of appendices are included to supplement information contained in the body of text. CHAPTER II

Background

This chapter discusses some of the motivating factors which led us toward attacking the difficulties of VQ encoding. Work that other researchers have done in the areas of vector quantization, associative memories, and the hardware implementation of real-time vector coders is reviewed in subsequent sections.

2.1 Motivation

The vector quantization algorithm for which the VAMPIRE chip was developed origi­ nated as an attempt to improve upon the differential pulse code modulation (DPCM)- bascd design [7,8,9] shown in Figure 1. NASA developed this algorithm by modifying the standard predictive DPCM encoder with an additional “non-adaptivc” predictor

(NAP). The goal of the design was to compress an NTSC television picture signal for digital transmission, while retaining its broadcast quality characteristics in the resulting decoded image. If successful, such a project could be used for HDTV (High

Definition Television) image transmission or other space satellite image transmis­ sions [10].

In the original DPCM algorithm, a standard television image line is sampled at four times the color subcarrier frequency, resulting in a pixel rate of l/70ns. The 7

NAP Look-Up Dclay NAP Data Output PIX +, DIF Huffman Quantization Encoder Level QL, FV.

RP Four Sample NAP Delay PV

Two Line HUH(J)?- Delay

Figure 1: The original modified DPCM algorithm. 8

difference between each digitized pixel and a prediction value formed from previously

encoded pixels is quantized using a 13-level non-uniform scalar quantizer and then

entropy encoded with a Huffman coder, yielding an overall compression ratio of nearly

4:1, Though the system produces very high quality images under laboratory condi­ tions, it possesses some fundamental disadvantages which would impede a successful, practical implementation. For example, the entropy code, while giving the system ap­ proximately half of its compression capabilities, requires a complex buffering scheme to handle the variablc-Icngth nature of the code. More importantly though, Huffman codes are not robust against bit errors, which can greatly complicate error detection and correction processes. On the other hand, the Huffman coder docs have the advan­ tage of being a lossless coding technique. However, because the characteristics of satellite communications make them susceptible to channel errors, source coding methods that do not use entropy or variable length coding arc more favorable [11].

This becomes especially true given that the success of the system depends heavily on obtaining very high quality reconstructed images.

Block codes exhibit superior error detection and correction abilities as compared with variable length codes. Unfortunately, eliminating the Huffman code from the

NASA-based encoder would cut the compression capability in half, rendering it in­ effective as a data compression device. One method of realizing comparable per­ formance to the original DPCM system without an entropy code, is to improve the efficiency of the quantizer through the use of differential (or predictive) vector quanti­ zation (DVQ) techniques [12,13,14]. Instead of processing pixels in a scalar manner, 9 an entire block of predicted pixels would be subtracted from the corresponding block of image pixels, resulting in a vector of difference values (sec Figure 2). This residual vector could then be quantized with the appropriate vector coder.

Digitized Pixels To Channel Scalar-lo-Vcctor VQ Buffer

Vector Predictor

Figure 2: Block diagram of a DVQ system.

Associative Memories (AM's), or Content Addressable Memories (CAM’s) as they arc frequently called in the literature, are data storage devices that have the property of being accessed via the contents of the memory cells rather than the address of those cells. This property gives AM’s inherent search capabilities not existent in conventional memories. Vector quantization is one problem that can take advantage of AM’s unique features, and therefore maps well to associative techniques [5]. An AM that could search its contents based on a function governed by Equations 1 .1 and 1.2 would represent a direct mapping of VQ into hardware. This type of associative memory is a potentially feasible solution in this case because the address space of a standard look-up table is far too large to be economical, and serial search techniques take too long for real-time operation. In fact, the use of AM techniques is one of the few viable alternatives for encoding video-rate VQ. Unfortunately, an associative 10

memory with this type of search capacity does not currently exist; this research was

ultimately motivated to develop such a device.

2.2 Previous Work

Volumes have been written about various types of associative memories and about

the many derivatives of vector quantization. Note that this section is not intended to

give a broad survey of past work in these fields, but rather its purpose is to provide

a sampling of more recent developments. In particular, I will attempt to include here

only research that directly addresses vector coding implementation through the use

of associative techniques.

2.2.1 Associative Memories

Throughout the literature, the terms associative memories, content-addressable mem­

ories, catalog memories, search memories, distributed logic memories, etc. have all

been used to describe basically the same idea; namely, data manipulation operations

that select specific memory locations based on the content of the memory, or on data

“associated" with that memory. The early development of AM’s (since the 1950’s)

indicates to some degree their importance; many problems map more naturally to

an associative processor (AP) than a traditional Von Neumann machine [15, 16]. In­

deed, another clue that attests to the importance of associative recall is found in the

remarkably robust and useful biological (human) memory recall [17, 18]. For cer­ tain applications, content addressing can result in much higher system performance compared to conventional methods of addressing. However, the idea is not a recent innovation, as demonstrated by the fact that Hanlon’s 1966 survey article on CAM’s

and AM's [19] contains 125 references.

In spite of this, relatively few CAM’s have been designed and implemented. Early

attempts at using associative processing techniques were hampered by their high cost,

low speed, and limited technological advances. The complex internal control logic and

the lower storage capacity as compared to SRAM’s arc CAM characteristics that have discouraged its use. But contrary to SRAM’s, associative memories arc not intended

for only storing data. A number of functions for data processing can be implemented,

resulting in very powerful devices [15]. The required functions can be very different

from one application to another, and consequently associative memories tend to be

more application specific than their SRAM counterparts. Thus, one goal of AM design

is to develop an architecture that can perform a wide variety of functions.

Because of problems associated with hardware designs, CAM realizations through

software techniques arc popular when applicable. Hash coding is probably the most

widely used software implementation of content-addressing. In hashing, data is stored

in a memory location whose address is a function of the data itself. For example, if

the data is a set of names, let each letter be assigned a value between 1 and 26,

corresponding to A through Z, respectively. An appropriate hashing function might

be to multiply the value of the first letter by 26 and add to it the value of the second letter. Of course the main drawback to this set-up is that it is possible to have two different names that start with the same two letters. This phenomenon is termed collision, since both names will get mapped to the same location. Hashing does not usually make complete use of its allocated memory; however, since the memory being used is SRAM memory, the overall storage capability is still higher than that of CAM hardware. Hash codes are used because the software can be used to program existing hardware. In some cases, hash tables exhibit the same effectiveness as specialized hardware CAM’s. This is one reason hardware realizations have not been pursued heavily. The extra cost of CAM storage on a per bit basis may not offset the slight decrease in search time it provides versus hashing techniques. Unfortunately, VQ is not a good candidate for hashing methods. For a general set of codewords, the hashing function would likely not be able to achieve a high enough degree of accuracy within a reasonable amount of memory space.

The oldest hardware CAM design known is a cryotron catalog memory dating to

195G [20]. The 20-bit CAM unit consisted of 4 words at 5 bits each, and it occupied a 4 inch board. System speed was 10/is for a match and 500/j s for a write operation.

At the time, large scale integration of semiconductor devices had not been realized, and much effort was placed on superconducting (cryogenic) [ 2 1 , 2 2 ], magnetic [23], and even optical (holographic) implementations.

Improvements in LSI technology soon led to semiconductor CAM designs [24, 25,

26]. One of the first commercially available integrated CAM packages was Intel's

16-bit 3104 chip, released in 1971 [27], It offered 4 words of 4 bits each and op­ eration speeds of 30ns. Since that time chip capacity has increased tremendously, and an experimental CAM that can hold up to 1G0 kilobits of information has been reported [28]. Other high-capacity CAM’s have been reported in [29, 30, 31, 32], 13

Kowarik ct al have even proposed a 1-Mbit associative DRAM design [33]. The

Am99ClO is a commercially available chip from AMD that can hold 1 2 -kb (256 words by 48-bits) of content-addressable memory [34]. A more comprehensive review of

CAM developments can be found in Kohoncn [35], which lists over 1200 references.

A common trait among nearly all previous CAM designs is that they can only detect an exact match between a stored word and the reference word, where the reference word is a masked version of the input data. They have limited capacity to perform “soft” (inexact) matches. Many of the designs arc based on a circuit similar to the one shown in Figure 3 [36, 37]. This schematic shows a conventional six-

WORD

_ MATCH n r ™

Figure 3: Typical CAM cell architecture. transistor SRAM cell modified by transistors M7 - M9. Using appropriate operating procedures, the MATCH line will be pulled low if there are any mismatches between the stored word and the reference word. In order to conduct magnitude searches with this type of cell, the memory must perform a bit-wise serial Hamming search.

First, all bits but the MSB arc masked. In subsequent steps, additional bits arc revealed. If at any time a word docs not match, it is disqualified from interacting in later steps. In this manner, an entire memory can be scanned to reveal the largest or smallest member. However, due to the serial nature of this operation, exact-match type CAM cells arc not appropriate for real-time video-rate VQ matches. Also, the search function required for a VQ system relics on mathematical computations not available from exacL-malch cells.

Recent work in neural networks have led to the development of several hardware implementations of associative memories [38] which arc better suited for VQ appli­ cations. Based on Ifopficld networks [39], these circuits store information in the interconnect weights between simple [40]. These circuit structures have the advantage of being able to locate the best match even when a perfect match docs not exist; however, they also typically have a very low storage capacity, and they take many cycles to learn a word [41],

A digital/analog hybrid design such as the one found in [36] represents an interest­ ing compromise between the analog neural net structures and the traditional digital designs. This design relics on analog current summing coupled with winner-takc-all

(WTA) structures like those shown in Figures 4 and 5 [42, 43]. The word that best matches the input has the most pull-down transistors on at once, allowing it to more quickly shut down the match circuitry of other words. With the proper scaling of transistors, the distance between words can be measured as a Hamming distance or as a weighted (mathematical) distance. The main drawback to these designs is their limited accuracy.

MA. MA, MA. HI Hi Hi

tu n MB tutn

MD, □ i-J h c '«Ini MO, II-Hl 'iin2 'nhlXjp v MC, r i mc, \ 2 t - r - J " 0 * / T > Figure 4: Analog winner-take-all structure.

2.2.2 Vector Quantization

Due to its computational intensity, video vector quantization was originally an off-line algorithm, which could not be used for real-time applications. Though at first hard­ ware designs of vector coders were generally too complex or too slow to merit their implementation, the promise of improved encoder performance motivated an increas­ ing amount of interest in VQ, Within the last decade alone, a number of research groups using various techniques have closed the gap on realizing a real-time video­ rate vector quantizer. Both advances in VLSI technology and innovative derivatives of the algorithm itself are responsible for the improved performance. Though it is not possible to survey all of the research currently being conducted in VQ related areas, 16

=H

V ol

Vo2

Vo3

Vo4

Figure 5: Enabled-NOR structure. 17 this section presents some of the major focuses of effort. The papers found in [5, 6 ] give a much more comprehensive review of the subject.

It is important to realize that the full-scarch vector quantizer represents the up* per bound of VQ performance. For a given vector dimension (data block size) and a given set of “test” image statistics, it is possible to find an optimal codcbook set (for a predetermined amount of distortion) using an algorithm known as the Lindc-Buzo-

Gray (LBG) or generalized Lloyd algorithm [44]. No vector quantizing algorithm can match the performance of fully searching an optimized LBG codcbook. Unfortu­ nately, the full-scarch approach represents the upper limit of encoder complexity. For this reason, various “sub-optimal” VQ techniques have been proposed which greatly simplify the encoding procedure at the expense of some performance. Though ideally the codcbook selection process is completely independent of the actual VQ encoding operation, for some of the VQ methods listed below the two processes are somewhat interrelated.

The first set of sub-optimal encoders attempt to direct the codeword search. This directed search can serve to either prioritize or limit the search space. One of the most common of these methods is the trce-scarch (TSVQ) algorithm, first proposed by [45]. Here one imposes successive binary hyperplanc tests on the input vector to determine whether one descends to the left or right child of each node. This approach imposes stringent requirements on the codebook, as a suitable test must exist at each node of the tree, Riskin cl al. [46] use the fact some codewords occur more frequently than others to design a variable-height tree structure for maximizing overall 18 performance (termed Pruned Tree-Structure VQ). Ideally, one would like a general design technique for obtaining a tree search into an arbitrary VQ codcbook, thereby restoring the general-purpose nature of VQ. Examples of on-going work in this area can be found in [47, 48]. Another type of directed search algorithm is lattice or trellis

VQ [49, 50). Here the object is to allow the encoder to follow multiple paths before deciding on the optimal set of encoding decisions; however, good quantizing codebooks produced using the LBG algorithm generally cannot be well approximated using a lattice structure. One iinal type of directed search algorithm is a technique known as cache codcbook (CCB) search [51]. The CCB method exploits the correlation of neighboring code vectors by storing the most recently used ones in a cache. The cache codcbook is searched first, and the remainder of the codcbook is searched only if the minimum distortion of the CCB vectors is larger than some threshold value.

Another set of sub-optimal VQ methods attempts to use rcduccd-sizcd codcbooks by breaking the quantization process into distinct steps. The first of these techniques is called multi-stage VQ (MSVQ), shown in Figure 6 . The input vector is quantized in the first Btage with a small codebook, and the label of the appropriate codeword is recorded. The quantized vector is subtracted from the original input, and the result­ ing residual vector becomes the input to the next stage. The codeword labels from each stage are concatenated to form the final channel symbol [52]. Because of the minimal size of the codcbooks at each stage, MSVQ has the advantage of searching a smaller codeword space. A variant of MSVQ is called Address-VQ (AVQ) [53, 54]; here a block of pixels is vector quantized and subsequently a block of the codeword 19

VQVQ VQ

Stage 1 Staged

Figure G; An n-stagc MSVQ design. Each stage uses 6 ; bits to encode the correspond­ ing codeword label.

addresses is vector quantized using a separate codcbook. This is another example in which correlation between adjacent pixel blocks (and therefore codeword labels) is ex­ ploited to improve performance. Other examples of multi-step quantization methods are Separating-Mean VQ (SMVQ) [55], also known as Mean/Residual VQ (MRVQ), and Gain/Shape VQ (GSVQ) [45]. In MRVQ, the mean of an image vector is com-

VQ

Sample Mean Scalar Quantizer

Figure 7: A Separating-Mean Vector Quantizer. puted, scalar quantized, and subtracted from each component of the image vector

(see Figure 7). The quantized mean and the vector coded residual vector are trans­ mitted separately. GSVQ is a similar, yet superior, quantizer that uses separate, but 20 interdependent, codebooks to code the shape and gain of the waveform. The re­ constructed vector can be obtained by multiplying the quantized shape with the quantized gain.

The sub-optimal techniques described above reduce the computational intensity compared to that of a full-search algorithm by diminishing the search space, or by restructuring it for a more ordered search. In cither case, the best that any of these methods can hope to achieve is that of the full-search approach. They arc still impor­ tant, however, since at some point of encoder complexity a full-search algorithm will not be feasible to implement. Another property these methods share is that they arc all “memorylcss" techniques. That is, they do not rely on any history of the channel or symbols that have been sent through it. Though this makes the design somewhat simpler, we will sec that incorporating memory into the design can improve encoder performance.

A separate branch of VQ research focuses on methods of pre-processing the input data, such as transform coding and predictive coding techniques. Predictive methods are said to use the “memory" of the system by incorporating knowledge of previ­ ously transmitted pixels to form an estimate of the unknown data. The estimate is subtracted from the input data to produce an error vector, which will be the input to the vector quantizer. Predictive coding techniques, such as DPCM, exploit the high degree of correlation among adjacent image pixels to reduce the variance of the residual vector. The smaller the variance of the input data, the more efficiently a

VQ algorithm can partition the vector space. Thus, regardless of the type of search 21 algorithm used to implement VQ, placing an accurate prediction circuit in front of the quantizer will improve the coding performance [12,56,57, 58,9]. Note that prediction methods can be contained completely within a still image (intra-frame encoding), or the prediction can be generated between frames (inter-frame) as in the case of mo­ tion picture images. Motion estimation can offer a significant improvement of encoder performance, and it has been incorporated into the standard (MPEG, Motion Picture

Experts Group) for compressing moving video.

The other major form of pre-processing is known as transform coding. By far the most popular type of transform code for video compression applications is the

Discrete Cosine Transform (DCT). Usually implemented in two-dimensional blocks,

DCT can be executed in a “fast” format, and it effectively concentrates the majority of image block information into the low frequency coefficients of the transform domain.

Compression is achieved by only sending the coefficients which exceed some threshold, or by roughly quantizing the less important coefficients, or both. DCT performs so well in image compression systems that it has been incorporated into the standards set by the Joint Photographic Experts Group (JPEG) for still and moving images

(MPEG). Other transform techniques include the Fast Fourier Transform (FFT), the wavelet transform, and the Karhuncn-Loeve Transform (KLT). KLT, also known as the method of principle components, is by definition the most effective transform for any given set of statistics, as it computes the optimal set of basis vectors based on those statistics. The wavelet transform has also shown a great deal of potential, but neither of these transforms can be computed easily or quickly, especially in a 22 compact hardware implementation. Only recently has VQ been used to quantize

DCT coefficients; examples can be found in [59].

2.2.3 Hardware Implementations of VQ

This section reviews designs from various research groups that have specifically tar* gctcd hardware implementations of VQ. Early designers, because of the great degree of computational complexity, attacked lower rate applications such as speech cod­

ing [60,61]. These designs all used very high speed processing elements in a pipelined

architecture; however, building a real-time video-rale vector coders requires some­

what more acumen. The following architecture have been proposed to address this

problem.

Dczhgosha et al [62] proposed a design based on the MRVQ algorithm using a

block size of <1x4. The sampled mean is quantized with five bits and subtracted from each component of the image block. The sixteen residual values arc then scalar quantized into 17 non-uniform levels, before the residual block as a whole is vector quantized. Due to the extra quantization step, they refer to this method as the Mean

Quantized-Residual VQ (MQRVQ). The required codebook size was 7244 codevcctors, so the system was designed for a 13-bit (8192) codeword space, resulting in a bit rate of

1.125 bpp. The proposed architecture consists of 64 codebook/distortion/comparator

(CDC) chips, four middle-compare (MC) chips, and a final compare (FC) chip. Inter­ nally, the CDC chips are organized as parallel pipeline arrays capable of computing the absolute distortion measure for each of the 128 stored codewords. The resulting distortion measures pass through two stages of comparators to yield a 7-bit index and 23 a 6 -bit distortion measure. The outputs of the 64 CDC chips are sequenced through the set of four MC chips, which in turn feed the FC chip. The authors estimate that by digitizing only the active portion of the video image (90ns/pixcl), they can achieve operational performance by running the pipeline at 25MHz. The number of chips needed was approximated using a l/zm feature size.

Ramamoorthy ct al [63] propose a bit-serial systolic array utilizing MSVQ. For simulations they use an image block size of 4x4, and quantize this using 128 codevcc- tors in the first stage and 64 codcvcctors in the second stage, resulting in a bit rate of .8125 bpp. The bit-serial design is used to compute the mcan-squarc error and to generate the appropriate address over of period of 25 clock cycles, independent of codcbook size. Some of the functional blocks were designed and fabricated using

3/im CMOS technology. A single inner product processor (IPP) requires a layout area of approximately 62mm3; a codcbook size of 128 codewords and a vector size of 64 would require 256 such IPPs. Quoting the authors: “The VQ encoder system with the described architecture is well suited for wafcr-scalc interconnection and packaging.”

Panchanalhan and Goldberg [64] propose a novel CAM-based architecture for VQ image encoding. Using an exact-match CAM equipped with a mask register, they describe a serial procedure called gating that will produce the closest stored word to the input word even when an exact match does not exist. To implement a vector based search, they use the gating procedure and a number of CAM chips equal to the vector dimension to And the vector with the minimum Loo-norm. Unfortunately, to make up for extra processing time allotted to the serial nature of the gating procedure, 24 they propose to invert the roles of the image and the codcbook by storing the image in the GAM and matching the codcbook to the image. Though this design shows innovation and deserves some merit, it is not a practical implementation for real-time video image encoding.

Shcu cl a! [65, 6 6 ], as well as Tuttle ct al [67], have both proposed analog imple­ mentations of a real-time image encoder. The prototype chip from Shcu ct al holds

64 5x5 codewords in an area of 4.6mm x 6 .8 mm and quantizes one pattern every

500ns. The chip from Tbttlc ct al contains 256 4x4 codewords on a die size of 6 .8 mm x 6.9mm and operates at 5MIIz. Both use compact circuits to compute the Euclidean distance between the input vector and the stored vectors, but comparisons were han­ dled differently. Shcu's chip used a multi-input analog WTA circuit, whereas Tuttle’s chip had a binary comparison tree structure. Though both groups originally claim to have reached adequate precision in their analog hardware, neither set of researchers obtained results that quite matched simulations.

2.3 Summary

Much work has been accomplished in research areas relevant to AM design and VQ; however, with recent exceptions, most of this cfTort has not been directly applicable to real-time hardware image encoding structures. A number of the papers have presented proposals for hardware architectures, but few have actually fabricated their designs, and to date none have conclusively demonstrated operational behavior. The work described in the remainder of this document, namely the development of a full-search architecture for video-rate data, represents original work in the field of VQ image 25 compression. The proposed design must be able to search a 256-word codcbook at a rate of 3.57 Mvps. Each vector represents four 8 -bit difference components, for a total data width of thirty-two bits. CHAPTER III

Design Considerations

The associative memory device described in the preceding chapters may be realized with a variety of circuit designs; however, some designs arc more amenable to hard* ware implementations than others. Both a bottom-up and top-down approach were used in the planning stages of the VAMPIRE chip. Because the overall design needed to be highly integrated, individual circuits were laid out in a compact, cellular fash­ ion that offered high performance. A good low-level design was essential to help insure system-level operation. This chapter explores different design considerations and algorithmic issues to help determine which features combine to produce the most reliable, efficient, and usilicon-rcadyn layout. References to specific schematic details are reserved for Chapter IV.

3.1 Analog vs. Digital

Digital signal processing (DSP) has steadily overtaken analog processing since the advent of the digital computer. DSP offers a pre-defined amount of precision and an ability to reliably store data at the expense of increased power consumption, lim­ itations on operation speed, and overall less efficient use of silicon. As circuits are designed with faster clock speeds using smaller areas, however, the disadvantages of

26 digital implementations diminish with respect to their inherent advantages. Addi­ tionally, the number of operations that cannot be implemented with digital circuitry continually grows smaller.

In the planning stages of the VAMPIRE chip, many of these design issues were con­ sidered. First of all, the distortion measure calculation could be implemented much more compactly and at higher speeds using analog circuit techniques. Furthermore, the development of multiple-input analog winncr-takc-all (WTA) circuits such as the ones in Figures 4 and 5, allow for the smallest distance value to be selected. Given a compact enough implementation, the entire codcbook could be fabricated on a single chip. No doubt these were the thoughts of researchers that set out to build such an analog chip [ 6 G, 67, 6 8 ].

Of course, the limiting factor on the operation of an analog implementation is the precision. The codcvcctors, which must be stored digitally, would each be associated with a simple D/A. The transistor parameters of the D/A circuitry would have to be well-matched across the entire chip in order to obtain consistent performance.

The WTA circuits also rely heavily on matched transistor pairs. These problems are accentuated if the entire memory cannot be placed on a single die. The transfer of analog information between chips would present design difllcultics, and matdiing transistors on separate chips would involve a fairly elaborate compensation scheme, if it is even possible.

Successful data compression/decompression depends on consistent performance from both the encoder and decoder. Bit errors arc magnified since the information 28 content per bit is so much higher in a compressed data stream. I felt that the lim* itations involved with an analog AM were too great to justify its implementation.

A digital design would ofTer a definite amount of precision, more direct methods of layout verification, easier testing, and a high degree of expandability. Additionally, a digital AM design such as the VAM PIRE chip had never been successfully implemented.

For these reasons, I chose a fully digital format.

3.2 Distortion Calculation

The stored codewords and input data all have a vector format. The eventual goal of the AM is to determine which of the stored words is most like the input vector.

Stated another way, if d(>) is some distortion measure then the codeword Ci chosen to represent a particular input vector / is the one that minimizes d(C*t I).

Seemingly, the most logical method of measuring distortion between two vectors is to calculate the distance that separates them. The distance between two vectors

A and B in a Euclidean space is given as:

de = ||/ 1 - B\\ = D « (3.1) 1=0 where ds is the Euclidean distance, A = (0 1 , 0 3 ,...= (6 1 ,6 3 ,...,6 *), and k is the dimension of the vector space. Though the Euclidean distance represents the

“true” measure of geometric space separating two vectors, it is by no means the only method used for evaluating vector distances. Given the same vectors A and J3, a more general distortion metric can be formulated using the L r family of equations Larger values of r tend to give increasing emphasis to the larger component differences.

The La-norm is equivalent to Euclidean distance; however, the L\ and Loo-norms arc popular distortion measures as well [57, 64]. Setting r = oo represents the special ease in which the largest component difference becomes the distortion value itself.

The primary reason for seeking a distortion measure other than the "logical *1 choice is that computing the Euclidean distance involves multiplications, which do not map cflicicntly into silicon. Implemented serially, multiplications require excessive clock cycles; implemented asynchronously, they occupy a prohibitively large amount of sili­ con. The L\ and LM-norms, coinciding with their popularity, share the characteristic of being the only two measures in the entire Lr family that do not require expensive multiply operations. The Lj-norm is also known as the Minkowski meter, absolute distance, cilyblock measure, or Manhattan distance. The last two names refer to the characteristic of the Li-norm of only being allowed to move along the edges of the grid space defined by discrete axis points; in two dimensions, this grid resembles the layout of a city.

Both of these norms represent reasonable approximations to the Euclidean dis­ tance, with the error decreasing as the difference vector approaches one of the basis axes. In the case where only a single component differs between the vectors A and

B above, L\ and are identical to La. The following equation illustrates the 30 relationship among the three measures:

Loo < L 2 < L x (3.3)

Since the Z>M*norm is comprised only of a single vector component, clearly it can never be larger than the vector norm. The absolute distance, on the other hand, represents an alternative path from the origin to the vector end point. Since the Lj- norm represents the straight line distance to the same point, the cityblock measure must always be larger.

The graph in Figure 8 shows the amount of error incurred by using either the L x or Lt,o-norm instead of the Euclidean distance in a two-dimensional vector space. The x-axis represents 0 , the angle of the difference vector from either basis axis, expressed as fractions of pi radians. In the graph, G varies from 0 to £, which is sufficient to cover every possible case in two dimensions. The y-axis represents the percent error of the respective norm versus Euclidean distance. The third and fourth curves on the plot show the performance of two composite signals; the first composite is made up of an average of the L\ and Xoo-norms, while the second is composed of a weighted average (L x + 2 • loo) of these two signals. Since the L\ and Z/oo-mcasurcs always fall on opposite sides of the Lj-norm, the composite values are much more accurate in general. Table 1 summarizes the results shown in the graph which were generated with a simple computer program. In two dimensions it is relatively straightforward to verify these findings; however, as the vector dimension grows, analytically solving for the maximum composite error, for example, is not a trivial task. Figure Figure Percent Error 40.00 20.00 10.00 25.00 30.00 35.00 15.00 5.00 8 : Graph of percent error (from ij*norm) versus angle 0 from basis axis. basis from 0 angle versus ij*norm) (from error percent of Graph : 0.05 r ntError s eta h T vs. r o r r E t en erc P Theta (Radians) LI Norm LI Wgtd Composite LinfNonm 0.20 Composite .5 xPi 0.25 31 32

Tabic 1: Summary of error versus the // 2-norm.

Method Average Maximum Used % Error % Error Z/t*norm 27.26 41.42 Lao-norm 1 0 .0 2 29.29 Composite 8.62 11.80 Weighted Composite 3.43 5.72

The main disadvantage to using a distortion measure that docs not closely follow the Euclidean distance is that the computed image quality is reduced. The primary method for evaluating the fidelity of a reconstructed video image is the normalized mcan-squarc error (NMSE), given by:

NMSE = S/g-ZS)2, (3 .4 ) where x,- represents an image pixel, and £,• represents the reconstructed image pixel after decompression. Because the image fidelity criterion is based on an Li type of calculation, using a similar measure to select quantization vectors in the encoding process will minimize the resulting NMSE.

Though the NMSE is a commonly computed indicator of fidelity, it Jb a well-known fact among video compression researchers that this error figure is not necessarily a true measure of perceived image quality [69]. The human visual system (IiVS) processes visual information with far more complexity than any single fidelity computation [18].

The context within which an error occurs strongly influences how noticeable the distortion is to a human observer. Incongruences along sharply contrasting transitions 33 or among random patterns cannot be detected as easily as errors within a smooth field [70]. Many researchers have attempted to quantify errors in a more human fashion with little success. Consequently, the NMSE is used almost universally to provide a consistent, relatively robust fidelity measure that implies an overall image quality.

At the time of design, the only information available for the vector DPCM system that considered any alternative distortion measure to Euclidean distance involved the

Zi-norm [71]. Simulations showed that no visible degradation could be detected as a result of using the £i-norm in lieu of the Euclidean distance. The fact that other researchers had also used the cityblock measure lended credence to its validity. It can be implemented as a series of subtractions and additions, which arc relatively straightforward to layout. Thus, this was the measure I chose to implement in the design of the VAM PIRE chip. As implied above, using the absolute distance will result in an image with higher NMSE, but the simulations confirmed that these errors will not be visible. Computing the Zi-norm can be broken down into three sets of functions.

1 . Subtract each input component from the corresponding stored component.

2 . Convert the differences to absolute values.

3. Sum the absolute values.

Figure 9 graphically illustrates these steps. 34

32-bit Vector

j Component 1 | Component 2 | Component 3 | Component 4 | 8-bil word 8-bit word 8-bit word 8-bit word

I I ^ - 2 I IC-3 1 IC-4 | ^ Input Vector “ | SC-1 | SC-2 | SC-3 | SC-4 ] SwSvector

Diff-2

Component Suma

C S -U CS-3,4

Final Sum ~ | Final Diitance Metric

Figure 9: The absolute distance (Lj-norm) calculation broken down into essential components. 35

3.3 Absolute Value

As is readily evident from Figure 9, one of the primary operations within the distor­ tion computation is the absolute value. In this implementation of the AM, all four components of both the stored codewords and the input vectors arc 8 -bit positive integers. In order to perform the correct association, the absolute value of the differ­ ence between the input word and each stored word needs to be calculated. First I will review how this calculation would be accomplished using 1*8 complement arithmetic.

Mathematically, the l ’s complement of an 8 -bit positive integer is 255 minus the integer; digitally, it is a bitwise inversion. Let represent the ith component of an input vector, and Cf represent the ilh component of the j th stored vector. The sum of this input component with the l’s complement of the stored component gives

*f (255 — Cf), which can be rewritten as

d = 255 - (Cf - /,). (3.5)

There are two cases to consider when evaluating this expression: (a) cither Cf > Ii, or (b) /,- > Cf. In the first case where Cf is larger, the latter part of Equation 3.5 is positive, and the expression as a whole is a positive number less than 255. To retrieve the absolute difference, simply invert the whole expression once more to obtain 255 - (255 — ( Cf — /,•)) = Cf — In the second case, rewrite the equation for d as:

d = 255 + (A - Cf). (3.6)

Since A > Cf, the expression in Equation 3.6 will be larger than 255 but less than 36

511. By addling one to this result and discarding the MSB, the effect of the 255 term

is nullified (because the eight LSBs of 256 are zero), resulting in the difference /< —C/.

In summary, the l ’s complement method of determining the absolute difference of

an input component /,■ and a stored vector component Cf is outlined in the following

steps: ( 1 ) invert one of the values and add to the other, ( 2 ) if there is no output

carry from the MSB (case a), invert the result, (3) otherwise (case 6 ), add one to

the result and discard the 9(h bit. Bit-wise inversion (the l’s complement) is an

extremely simple circuit procedure; however, adding one to a result is not nearly as

straightforward. In the worst case, we must wait until the output carry settles, and

then add one to the final result.

A different way to calculate |Cf - 7,| is to first determine which value is greater,

Cf or /,-. Invert the larger and add this value to the smaller. Finally, invert the result.

For example, suppose the stored component is 101, and the input component is Oil.

(i) Original problem: |0 1 1 — 1011 (ii) Complement the larger and add : Oil + 010 = 101 (iii) Complement the result: 101 -*■ 010

Thus, the absolute value of 3 minus 5 is 2. The main drawback to this implementation is the fact that it is necessary to know which integer iB larger. Nonetheless, this is the method I chose to implement the absolute value function. Simulations show that the time required to first compute which value was larger is roughly the same as that used by the addition of one, as in case b above; however, the amount of layout area was significantly reduced by implementing the unconventional approach. The circuit details can be found in the next chapter. 37

Tabic 2: Truth table for forming absolute difference.

Stored Input GT Bit (STR) Bit (INP) Cm XDIFF C out XOR 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 0 0 1 1 1 1 1 1 0 1 0 0 0 1 0 0 0 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 1 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 0 0 1 I 1 1 1 0 0 0 0 0 1 1 1 1 0 1 1 I 0

Table 2 shows the truth table used to generate the absolute difference function.

The column labeled GT (for Grcatcr-Than) indicates which vector component is larger

in the following manner:

GT = f 1 it input > atorcd ( 0 otherwise v '

The column labeled Cin in Table 2 represents the incoming carry bit of the sum

shown in step ii in the above example. Likewise, the column labeled C o u t the outgoing carry bit of the same sum. The DIFF entries of Table 2 represent the final

absolute difference bits of step iii. The two other signals in Table 2 help to simplify the 38 implementation of necessary signals; XOR = Stored Bit 0 Input Bit, and X = Stored

Bit 0 GT. Using these signals; DIFF = Cjn © XOR; C o u t = Cm • XOR.0 X • XOR.

3.4 Parallel vs. Serial

Another issue that had to be resolved prior to the design of a hardware vector coder addressed the question, “What degree of parallelism should be implemented?” In particular, this refers to the amount of concurrency needed to compute the distortion measures and select the winning vectors. I hinted in an earlier chapter that an

SISD approach cannot be reasonably implemented with today's ; the following section attempts to justify this assertion.

Figure 9 explicitly illustrates the eleven functions needed to compute a single dis­ tortion value. There arc four subtractions, four absolute values, and the remaining three functions arc additions. If we assume that a general purpose (GP), serial proces­ sor can implement any one of these functions during a single instruction (block) cycle, then the calculation of 256 distortion values involves 11 X 256 = 2816 processing steps.

At an input rate of one vector every 280ns, the SISD processor must execute over 10 billion (giga-) instructions per second (GIPS), which is approximately 200-500 times faster than today’s readily available microprocessors. This does not even take into consideration the amount of time necessary to find the global minimum (256 steps) or fetch codewords from memory (an additional 256 steps); the extra steps require a processor which can go almost 12 GIPS.

The most obvious method for achieving a 200 times speed-up is to attack the problem with a number of processors equal to the number of codewords, which in this 39 case is 256. Now, given the same general purpose as in the previous example, each one is responsible for processing only eleven steps concurrent with the other 255 computing elements. The operation speed of each microprocessor is reduced to 39.3 MIPS (ll/280ns), which is within the range of available microprocessors; furthermore, the computation time is now independent of codcbook size. Though this “solution” is in theory a workable one, it involves a gross misuse of processing capabilities. The GP serial processor was designed to perform such a wide array of instructions that merely using it for additions and subtractions would be a waste of power and silicon. Additionally, interconnecting the separate microprocessors would prove to be a non-trivial task in its own right. Clearly, a more practical solution can be clTcctcd by custom designing the distributed processing elements such that they only compute the required calculations and that many processors reside on a single chip.

Given that we’ve decided to build a set of application-specific integrated circuits

(ASICs) containing a total of N = 256 distributed processors, in order to implement the VQ encoder there is still another degree of concurrency to consider. The first paradigm described will be referred to as a “serial,” “clocked,” or “synchronous” ap­ proach; the second design will be labeled a “parallel” or an “asynchronous” paradigm.

Note that throughout this discussion I only consider the distortion measure calcula­ tion; the winner selection function will be discussed in a following section.

The clocked approach breaks down the eleven computations into a scries of four steps. Illustrated in Figure 10, the synchronous design incorporates a difference 40 circuit, an absolute value circuit, a summing accumulator, and some overhead cir* cuitry at each codeword. Each of the four stored vector components is sequenced into the computing circuitry along with the corresponding input component. The sub­ traction, absolute value, and addition arc all performed in a single clock cycle while the accumulator (initially cleared) holds the intermediate values of the distortion measure. After four clock cycles the processor will have computed the final distance value, which will be available from the accumulator.

INPUT

CONTROL

ADS VAL

RAM 0

RAM 1

RAM 2

ACCUMULATOR RAM 3

Figure 10: A synchronous processing element design.

The parallel approach maps each of the eleven computation steps onto a distinct piece of silicon circuitry. Since no circuits arc re-used, the entire processing element can be set up as a feed-forward, asynchronous design (see Figure 11). Although this approach cannot be considered truly parallel since certain values will not be valid until the logic levels of preceding circuitry first settles, the design is essentially a single, large combinational circuit. 41

INPUT 0 INPUT 1 INPUT 2 INPUT 3

RAH 0 RAH 2 RAH 3

ADS VALABS VAL ADS VALABS ADS VAL

Figure 11: An asynchronous processing element design.

The synchronous design has the advantage that only SRAM cell area scales with

the vector dimension k. The asynchronous design must incorporate a corresponding number of adders and absolute value circuits as k gets larger. Furthermore, the parallel design provides a fixed architecture; a clever synchronous design could offer a user-selectable vector dimension. For these reasons, and the fact that the pin count is limited, the synchronous design is the only practical choice for large k.

On the other hand, the asynchronous approach makes for a faster, easier to operate chip. Design simulation and chip testing/operation are much more convenient without the added complication of a state machine. The synchronous design must be clocked through multiple states before the status of its operation can be ascertained, which makes debugging and testing much more difficult. Also, unless the last stage of the 42 distortion measure computation and the entire winner-selection process can settle within a single instruction cycle, the clock driving a synchronous chip must operate at a faster rate than the arriving pixel rate. This implies that buffering circuitry and additional glue logic must accompany the sequentially operated design. Alternatively, the final distortion value can be pipelined into a register so that the next distance computation can begin; this incurs a one pixel latency delay.

Because the vector dimension for the targeted system was small (k = 4), I opted to implement the asynchronous design, based on its ease of operation and testing advantages. In Chapter VIII, many of the issues raised here will be discussed further.

3.5 Carry Propagation

Because of the large number of adders to be incorporated into the design, the timing characteristics and circuit area allotted to the full adder (FA) and the carry propa­ gation in particular becomes an important issue. A “typical” ripple carry full adder is shown in Figure 12. The numerals drawn inside each of the gates indicates the number of transistors required in full complementary MOS logic. The carry circuitry alone accounts for 4 gates (18 transistors) and entails two gate delays. In order to make more efficient use of silicon and achieve a slight speed-up in operation time, a

Manchester-type carry circuit was implemented (see schematics next chapter). The basic principle of a Manchester design is to categorize the output carry condition at each bit location, without knowledge of the incoming carry bit. The output carry condition can be classified into one of three categories: generate, kill, or propagate.

The generate mode indicates that both bits to be summed are l ’s, and thus an output 43 carry should be generated regardless of the input carry. Similarly, when the two bits to be summed are 0’s, the output carry should be killed (set to 0). The propagate mode indicates that the input bits differ from one another, and the output carry is simply set equal to the input carry.

'o u t

SUM

Figure 12; Circuit schematic for a traditional full-adder cell.

The proposed carry architecture can be implemented in a compact, bit-slicc man­ ner, making for a modular, efficient layout. Many other types of advanced carry circuits such as look-ahead or multi-path involve a prohibitive number of gates to 44

justify using for the VAM PIRE chip. Additionally (no pun intended), the circuitry for

finding the final sum will undergo modifications so that it can be implemented in a

more compact design than that shown in Figure 12.

3.6 Winner Selection

Once the distortion measures have been computed, the next step of a VQ process

involves selecting the minimum of these distances, many of which will be contained

on different chips. This winner selection process is one of the most important design

considerations of the entire chip implementation. The logistics of comparing N 10-

bit digital values in a compact efficient manner has eluded many previous designers.

Often this is the bottleneck of the whole process.

For example, one paradigm for performing this N-input comparison is to set up

a log3 JV-lcvel binary tree structure. This is the approach taken by Dczghosha [62].

• Though some of the comparison tree can be integrated into each memory chip, in­

evitably a multiple chip set requires that some of the comparisons take place outside

the memory ASIC. A number of middle compare (MC) chips and a final compare

(FC) chip are used to sequence through the outputs of the 64 memory chips. The

entire structure is cumbersome to implement.

A more elegant solution consists of a bit-wise comparison of final codeword dis­

tance by means of wired-NOR circuitry. The basic idea is to examine the MSBs

of all the distance measures simultaneously. If at least one of the codewords has a

zero in the MSB of its corresponding distance metric, then all the codewords with

a one in the same bit position can be eliminated from further contention. If every distance measure has a one in the MSB, then all of the codewords arc still considered

“valid.” This process is repeated for all remaining codewords at each less significant bit position, until ultimately the LSB is reached. By this point only one codeword should remain as a valid, or “winning”, codeword. Communication among the differ­ ent memory locations takes place along a single wire, called a COM PARE line, for each bit location. In the process described above, valid distances having a zero at a given position drive the COM PARE line low. When none of the contending codewords has zero in the final distance measure of a particular bit position, the respective COM­

PARE line floats high via an internal pull-up transistor. In this manner, a wircd-NOR function is achieved (see the schematic in Chapter IV for details).

This approach works well for comparing codewords that are on the same chip, but further thought is required to make this design operate across chip boundaries. The problem arises from having to quickly transmit information about the state of one

AM chip to all the other chips, while simultaneously receiving feedback from those chips. For example, one approach involves merely adding an analog pad for each

COM PARE line. Connecting COM PARE lines from other chips together would extend the wired-NOR structure of an individual chip to include all chips. Unfortunately, this design is akin to adding a very large capacitor (associated with the IC pin and other external connections) onto each COMPARE line. This capacitance is many times greater than the capacitances encountered on-chip. The same pull-down transistor that was adequate enough to drive an integrated/monolithic COMPARE line may not 46 be able to handle an extension of this line to off-chip components, which would decisively degrade chip performance. As the number of chips grows, the problem is exacerbated.

Another example uses separate, fully-digital input and output pads to convey the COM PARE line information. In this example, the architecture, which is depicted in Figure 13, illustrates an inappropriate use of feedback. The figure shows one of the internal COMPARE lines from one AM chip feeding an external, multi-input

NAND gate. The other inputs to this NAND gate come from the corresponding

COM PARE lines of other chips. In the original design, the NAND was intended to relay information about the state of a particular bit location in the following manner: if any of the COMPARE lines from any of the AM chips is low, then the NAND will output a high, turning on the indicated pull-down transistor of all other chips; if the

COM PARE lines from all the AM chips arc high, then the NAND output will be low, and the state of the internal transistors will be left unaffected. The problem, of course, is that the loop linking the NAND output to the pull-down transistor back out to the NAND input is a positive feedback loop. Once a COM PARE line is low, the corresponding NAND output will be high which, in turn, ensures that the COM PARE line will remain low. The circuit will latch into a “zero-state”. The only way to make such a scheme work is to insert timing circuitry that can break the feedback loop until all of the COMPARE lines have settled into the proper state.

The ideal solution seems to be one that duplicates the internal wired-NOR struc­ ture at the output pin level, essentially creating two levels of wired-NOR. The pin-level Rom other Toother chips chips

Chip Boundary

Output Pod Input Pad

r f x

Internal Compare Line

Figure 13: An unsuccessful example of conveying COM PARE line information chip boundaries. 48 circuitry would be used to select from among valid chips, rather than from among individual codewords. This approach isolates the on-chip circuitry from the more capacitive ofF-chip components while still retaining the wired-NOR topology. Fur­ thermore, it provides a convenient method of connecting chips without prioritizing the order of those chips. Architectural details and circuit schematics can be found in

Chapter IV.

3.7 VQ Codebook Design

One last design issue which deserves consideration is the VQ codcbook which will eventually be loaded into the VAMPIRE chip. Many of the sub-optimal VQ techniques discussed in Chapter II required a specially adapted codcbook so that the particu­ lar VQ algorithm (e.g. tree-searched or trellis-coded) functioned correctly. As was mentioned, these techniques preclude the use of certain general codeword sets which may be typical of optimal image compression codcbooks. One major advantage of the VAMPIRE chip is that it was designed as a general-purpose, full-scarch associative processor; there arc absolutely no restrictions on the codcbook structure.

Given that any codebook can be used in conjunction with the VAMPIRE chip, a logical choice would simply be to use the optimal codebook os determined with the LBG algorithm. Since the technique to generate an LBG codebook is relatively straightforward, the problem was not considered during this dissertation endeavor.

However, another research group at Ohio State University has developed a far less computationally intensive algorithm, termed frequency sensitive competitive learn­ ing (FSCL), which very closely approximates the optimal algorithm [72], There is 49 considerable interest, to develop such an algorithm, as it would lend itself much easier to adaptive codcbook modifications. Thus, the codebook generated with the FSCL algorithm was the one eventually used in the VAMPIRE chip for real-time image cod­ ing. At this point in time, no provisions have been made on-chip that would allow the AM to adaptively update the codcbook.

3.8 Summary

This section restates the major design issues which have been decided upon. The specific implementation of these issues is considered in the next chapter.

a A digital approach will be taken, as this allows a definite amount of precision,

and (to date) no digital implementation of a video-rate VQ encoder has been

successfully completed.

a The distortion measure used will be the Zi-norm. Though .closer approxima­

tions to the Euclidean distance exist, the Li-norm can be implemented in a

straightforward manner, and simulation results reveal no noticeable degrada­

tion as compared with the Lj-norm.

a The absolute value circuitry will differ from the traditional one's or two's com­

plement circuitry by first determining which value is larger. This result can be

implemented in a more compact design for no loss of performance.

a The entire memory will operate in an asynchronous fashion. This approach

allows straightforward testing and operation. 50

• Again, carry propagation will differ from the traditional design so that a slightly

faster, more compact method can be implemented.

• The winner selection design may be the most crucial of all design aspects. Here

a simple, elegant wired-NOR solution is offered which is not interconnect inten­

sive.

• The codcbook will be generated by another research group. The structure of

the VAMPIRE chip is such that any codcbook can be loaded. CHAPTER IV

VAMPIRE Architecture

4.1 Structural Overview

This chapter reveals in great detail the internal structure of the VAM PIRE associa­

tive memory chip. The description will be organized hierarchically in a top-down

approach. For the following discussion, I assume that the chip will operate under the

conditions described in the Problem Statement (Section 1.3), and that it is designed

to meet the specific criteria set forth in Chapter III. As Figure 14 shows, there arc four major components that make up the VAM PIRE chip: the pad frame, an address decoder, I/O circuitry, and several word-units. The word-units, which comprise the majority of the VAMPIRE chip as shown in Figure 14, arc responsible for the storage of each codeword and the computations associated with winner selection. There is a one-to-one correspondence between the vector storage capacity of the VAM PIRE chip and the number of word-units. Since it is this block that contains the most interesting circuitry, the majority of this chapter will be devoted to its description. The decoder, the I/O circuitry, and the pad frame will be explained in later sections.

Just as the chip could be subdivided into several major components, each word- unit can be further broken down (see Figure 15) into four types of cells: an overflow

51 DOCDDDDCODDOD □□□□□□□□□□□□□□□□□□□ □ □□□□□□□□□□□□□□□□□□ " WORD UNITS WORD " c □ cm 1 / ICIR I | I CIRCUITRY I/O Figure 14: The general structure structure general The 14: Figure 0-15 PADS ------□ % a s u U § a S3 I 1 f o the the VAMPIRE VAMPIRE ------WORD UNITS WORD 16-31 ' chip.

_____ O O O O O O O J C O O O O O 52 53

cell, the computation cell (CC), the right*cnd cell, and a priority encoder (PE) or

multiple-rcsponse resolver (MRIt). The short solid lines that extend from each cell in

Figure 15 represent paths that facilitate the flow of data between adjacent cells and/or

word-units. There are eight computation cells in a word-unit, corresponding to the

number of bits in one component of a vector. The computation cells arc arranged

left to right from the most significant bit (MSB) to the least significant bit (LSB).

The computation cell is so named since it performs the brunt of the calculations of

the memory. The overflow cell processes carry bits that spill over from the MSB

computation cell, and it initializes signals that originate at the MSB of the stored

vector. The right-end cell resides at the LSB (right end) of the CC’s; this cell buffers

signals going into the priority encoder and initializes signals that propagate from the

LSB of the stored vector. Finally, the priority encoder or MRR cell interprets winner selection signals from the compulation cells and places the appropriate label on the

output bus. The number of MRR/ROM cells in a word-unit equals log, TV, where N

is the number of codewords on a chip. Figure 16 gives a more detailed floorplan of

the VAMPIRE chip, combining Figures 14 and 15.

CC CC. CC. CC, RE MRR Cell

Figure 15: Internal structure of a word-unit. 54

MSO's LSO'a •r I. C.C . C, C. m i ^ |QQ,N End • • • -- Ml (0,0) Kay Manwy CM

End — • • # — Ml (0,1) Kay Mamory Ckt , i in End — • * • — MI(O^) Kay Mamory CM “ l— (1 NSUradWotd*

L_

End C at (0.N) — • • • — Kay Mamory CM

Figure 1G: Detailed floorplan of the VAMPIRE ciiip. 55

4.2 Computation Cell

The circuitry contained within the computation cell is by far the most complex of any of the four cells that comprise a word-unit. Almost all of the processing performed by the VAMPIRE chip takes place in the computation cells. Collectively, the computation cells arc responsible for codeword storage, distance computation, and winner selection.

Bit Line* C on owe 0 12 3

I W ord 1. Word Greater Abioluto Cout-dJT^-j Component RAM Ctn-dUT m | Thin Difference OU n Sum 0,1 BitO Cout-ran 4 —*— Circuit 0 Value 0 Ctn wm O T | . OT W ord Word L Greater Abtoluts cout-dirr - ,- L Final RAM dn-diir Ol_0Ul « -{ Thin Difference O U n Sum BUI Coul-ran 4 — 4— Circuit 1 Value 1 Cln- ran OT OT W ord 1. Word Greiler Abtolule cou-dirr 4 —|— RAM - 1- ctn-dirr Thin GCC Difference Ojaut 4 ' Bit 2 OUn IVup*|*ie-in __ Circuit 2 Value 2 Prop*( alo-out OT ----- OT W ord ! . Word O rel ter Abaolute eoM-diir4 - l Component RAM dn-diir O^out Thin O U n Difference Bit 3 Cout-ran 4 —i — Circuit 3 Sum 24 Value 3 CHi-nrm O T ! . OT I. FFFFJ Compare Bit Unca

Figure 17: Structure of a single computation cell.

Figure 17 shows the general floorplan of the computation cell. The information flow in and out of the computation cell which was hinted to in Figures 15 and 16 can be seen in detail in Figure 17. There are three types of signals that interact with the computation cell: input signals (represented by arrows pointing into the 56 block), output signals (represented by arrows pointing out of the block), and tapped signals (having no arrow). The directed signals, such as carry propagation, are acted upon by the various components of a cell and may change state as they pass through; the tapped signals (e.g. WOHD) are unaffected. Each region within the “bit-slice” architecture of the computation cell will be described in the following sub-sections.

4.2.1 Codeword Storage

Notice that each compulation cell contains four bits of static RAM storage. These bits represent the same bit position from each of the four vector components. When eight computation cells arc laid out side by side, the resulting architecture can hold four 8-bit components. Figure 18 gives an example to show how the bits are ex­ plicitly arranged. Figure 18(a) shows a sample codeword vector (127,128,63,0), and

Figure 18(b) displays the binary representation of that vector. In Figure 18(c) we can sec how the bits arc distributed throughout the computation cells.

The storage circuitry consists of a standard cross-coupled inverter structure with complementary pass transistors along the input and feedback paths (sec Figure 19).

When the WORD line is asserted, the feedback path (PMOS transistor) is broken, and the value on the input BIT line passes into the inverter pair. When the WORD line is low, the input path (NMOS transistor) is cut off, and the stored bit is allowed to regenerate through the feedback path. The W ORD line is asserted only when the appropriate address decode line iB high, and the chip is in a mode that allows the address decoder to be enabled.

Note that the value previously stored in a bit location effectively has no impact 57

Component 1 Component 2 Component 0 y / Component 3

127 | 128 | 63 I 0 ~| (a)

01111111 10000000 00111111 00000000

0 1 1 1 1 1 0 0 0 0 CC7 cc CC4 CC3 CC, CC. 0 CC6 0 5 1 1 1 cc 0 0 0 0 0 (c)

Figure 18: (a) a sample vector, (b) binary representation of this vector, (c) the bits of the vector as they arc stored in the computation cells.

WORD

Input STR STR Bit

Stored Bit

Figure 19: Static RAM schematic. on the storage of new information. It may take some extra time to charge the STR and STR bar signals to the appropriate level if the new bit is different from the old bit; however, we do not have to be concerned with the ability of the input BIT line to override the existing state of the inverter pair. The reason for this is that the input BIT line shown in Figure 19 is connected only to the input of a single inverter. Notice the differences between this cell and a conventional static RAM cell shown in Figure 20.

In the standard cell, BIT and BIT bar lines arc used to force the inverter pair into the corrccL state. Both lines arc needed because the feedback between the two inverters remains intact throughout the write process. Single-ended designs (having only one

BIT line) generally arc not used in conventional RAM cells because they don’t allow a secure margin of operation. The RAM cell of Figure 19 docs not have this problem since the feedback iB removed during the write.

WORD

Figure 20: Schematic of a standard static RAM cell.

The main advantage of the new RAM cell design over the conventional static

RAM cell is a savings in layout area stemming from the single-ended aspect. Each 59

OIT line consists of a metal line four lambda wide, separated from other BIT lines by an additional 4 A. The current design requires one BIT line per vector bit, or thirty two

BIT lines. Doubling the number of BIT lines would add 32 x 8A = 256A to the width of each word-unit. A second advantage of the new design is an added safety margin of circuit operation. Because there is no feedback from the storage cells while a bit is being written, it is possible to drive the BIT lines with minimum sized transistors and still be relatively certain that the circuit will properly store the correct information.

The main drawback to the redesigned RAM cell is that it is not possible to di­ rectly read their contents. However, unlike the fabled Write-Only Memory for which there exists no way to retrieve data once it is stored, the RAM cell's contents may be inferred via associative matching operations. Note that this is not a significant drawback, as there is no reason to use this chip as a conventional RAM chip. The random access feature exists merely as a convenience for storing codewords. Once the codeword information has been loaded, the V A M PIRE's sole objective is to identify which codeword matches closest (not what the contents of that stored word arc). The only use a direct read may have is for debugging and failure analysis.

4.2.2 “Greater-Than” Circuit

In this section I will describe the “Greater-Than” block found in the computation cell. Recall from the discussion in Section 3.3 that each of the four components of the stored codewords and of the input vectors is an 8-bit positive integer in the range of [0,255]. In order to determine the absolute value of the distance between a stored codeword component and an input component we must first determine which 60 ib larger. The block labeled “Greater-Than Circuitry” of Figure 17 contains (most of) the circuitry required to make this determination.

As Figure 21 shows, the grcater-than (GT) circuit consists of a series of comple­

mentary pass transistor pairs terminating in aground connection at the LSB side and

an inverter on the MSB side. JuBt as the memory cells were distributed throughout

the bit-slicc architecture of the word-unit, the GT circuitry too is divided amongst

the CC’s. For each vector component, there exists a complementary transistor pair in each computation cell. Figure 22 clearly displays the four GT circuits as they appear

throughout a single word-unit. (Of course, the other circuitry within the CC has been

omitted.)

^ V l ) CTR0

XOR OT 1 | X0R( Vi-*/-- ^ 5 C _ x | xor v< H E | “ 3 | _ L_r_ ,Lj 1_ ... 1—-—I——J L gdom gUn gLout' gOh gLout '

Figure 21: Sclicmatic for generating the “Greater-Than” signal.

The GT circuit operates on the principle that the larger of two unsigned binary numbers can be determined by examining the bit positions that differ between the two (similar to the principles discussed in Section 3.6 on Winner Selection). The number that has a “1" in the most significant of these differing locations is the one that is larger. Referring again to Figure 21, the XOR signal represents the exclusive-

OR between a stored bit and an input bit. At the first position in which the stored 61

str — str — str — str xor xor j xor j xor I

GT n — str — str xor _ixor _i xor xor

GT — str — str —str xor xor _i xor j xor_j

GT ,J L — str — str str — str xor _i xor _j xor _i

GT

CC, CC CC CC,

Figure 22: The GT circuits as they appear across a single word-unit.

and input word differ, we know that XOH=l (XOR bar is u0”) and that all the more significant XORs (i.e. to the left) have a low value. The effect of this is to create a conducting path from the input of the inverter, through one or more n-typc transis­ tors, through a single p*typc transistor, to the appropriate stored bit. The circuit is structured so that the more significant bit locations are given precedence over bits with lower significance. Thus, the signal reaching the input side of the inverter is the stored bit from the first point at which X O R =l. If the stored word is identical to the input word, then all the n-channcl transistors arc on, and a “0” propagates to the inverter. It is clear that this circuit implements Equation 3.7, which has been redisplayed here. / 1 if input > stored GT = \ 0 otherwise f41) 62

The drawbacks of this design are its low noise margin and a potentially long propagation time. The noise margins are small since signals may have to pass through both p and n-typc transistors, causing the voltage level that reaches the inverter to be a threshold voltage (VV) drop from cither power rail. However, the asynchronous design of the chip should help to reduce noise overall. The main advantage of the design is its compact cellular layout. The savings in area versus a traditional adder uHing one's or two’s complement arithmetic is tremendous.

4.2.3 Absolute Difference Value

Figure 23 shows the schematic for generating both X and XOR. Recall from the discussion in Section 3.3 that these two signals arc used to generate the Cout &nd

DIFF for the absolute value circuit. The simple two-transistor structure found in both circuits provides the cxclusivc-OR (X O R ) function. It is apparent from the figure that only one transistor is on at any given time (assuming a digital voltage on the gates). When the gale voltage is high, the n-channcl transistor conducts the inverse of the second signal; when the gate voltage is low, the p-channcl transistor conducts the second signal. Thus, when both voltages are the same, a low signal is passed to the output; when the signals arc different, a high signal is passed. One drawback to using pass transistor logic, as is done here, is that the output voltage may be one threshold voltage level (V|) above the lower rail or below the upper rail.

In other words, the output of the two-transiBtor excluBive-or is not a fully restored signal. Care was taken in the design of these circuits that unrestored signals do not propagate another V« drop. 63

INP GT

STR STR XOR XOR

STR > STR

Figure 23: Generating intermediate signals X and XOR.

Figure 24 shows the schematic for generating C o u t and DIFF. The carry circuitry was designed based on a Manchester Carry architecture. In a traditional Manchester

Carry design, there exists a GENERATE and a KILL signal. The GENERATE signal is asserted when the outgoing carry bit will be high regardless of the incoming carry bit (i.e. both bits to be summed arc “ 1” ). Likewise, the KILL signal is asserted when the outgoing carry bit will definitely be low (i.e. both bits to be summed arc "0").

In all other situations, the outgoing carry equals the incoming carry. Neither the

KILL nor the GENERATE signal will be asserted, and the carry condition is said to be in a PROPAGATE mode. The schematic in Figure 24 incorporates these design principles into the absolute value function, in which one bit is added to the inverse of the other bit (because we are actually subtracting). For example, in a situation where the stored bit and input bit are the same, XOR=0; yet, one of these bits will be inverted for the actual addition. Thus, the two bits to be summed will actually be different. When the two bits are different, neither a GENERATE nor a KILL signal will be asserted, and C o u t w iU equal C in . This feature is reflected in Figure 24; when

XOR=0, C in passes straight through to C o u t * Likewise, when X O R=l, then the value of the outgoing carry bit can be determined from the value of X, independently of Cin . Again, the value of X is determined by which signal is greater. Table 2 can be used in conjunction with Figure 24 to verify that the schematic performs the appropriate carry-out and difference functions. Figure 25 shows a schematic of all

Cm

XOR XOR Cm DIFF XOR XOR

Figure 24: Generating the outgoing carry bit and the difference bit. the circuit components discussed up to this point.

4.2.4 Component Sums

The last steps depicted in the algorithm of Figure 9 arc the sumB of the absolute difference values. The circuits that perform these component sumB are not funda­ mentally different from the circuits just presented in the previous section. The first two sums are carried out asynchronously in parallel, followed by the final sum. The final sum represents the final distance metric. Figure 26 illustrates the schematic of the full adder which was used to compute these sums. 65

Word

Str Str

gtout gtln

OT

xord

DifT

Cout On

Figure 25: A schematic showing all the circuit components discussed to this point. 66

DifTl

Dirro

xotx

mi Sum

Id? f £ Cout Itl O n

ttJ

xors

Figure 26: The full adder cell used in the VAMPIRE chip.

4.2.5 Global Compare Circuit

Once the final distortion measures have been computed, the VAM PIRE chip must select the smallest one, as this corresponds to winning codeword. In keeping with the design philosophy of the chip, we would like the winner selection operation to be quick, area efficient, and asynchronous. It also must be a feedforward design, since the distortion measures will become valid at varied and unknown times.

The chosen design uses wired-NOR techniques to meet the requirements. Figure 27 shows one bit of the winner selection circuitry. There is one global compare circuit

(GCG) cell for each bit of the final distortion measure per codeword. Thus, a word- unit consisting of eight bit vector components will need ten GCC's (the distortion 67 measure is a ten bit number). Each computation cell contains one GCC, and there arc two GCC’s in the overflow cell. The function of the global compare circuitry is to simultaneously determine the smallest distance metric amongst all stored codewords.

The PROPAGATE signal is set up to ripple from the most to the least significant bit of the final distance metric. As long as PROPAGATE is asserted, then that particular word has not been eliminated from the winner selection process. The COM PARE line in

Figure 27 is connected to the same bit position of every word, and it is also connected to a single pull-up resistor (see Figure 28). Aspects of comparing values across chip boundaries will be detailed in a later section.

Compare

Metric Metric

Propagate Propagate

Figure 27: One bit of the global compare circuitry.

It is clear that this circuit executes a global comparison like that described in

Section 3.6. The following example illustrates this operation.

1. The PROPAGATE line of the MSB is high, or asserted. If any word has the MSB

of the corresponding distance metric low (M ETRIC bar is high), then the two

consecutive NMOS pass transistors will be on, and the COM PARE line will be

driven low. Compare Pin

I/O Pad Enable Line E=0 -> Input E=1 -> Output

Interchip Chip-Valid Out Circuitry

Pull-Up Resistor

Compare Line

Propagate In Propagate Out

GCC

Propagate In

GCC

Propagate In Propagate Out

GCC

Figure 28: A bit*slice of the on-chip winner selection circuitry. 69

2. If any word has the MSB of the corresponding distance metric high, and the

COM PARE line has been driven low by some other word, then a “ I ” will propa­

gate to the NOR gate, eliminating that word from any further interaction in the

global compare function. Once the PROPAGATE line is un-assertcd, it remains

low for all the less significant bits.

3. As the PROPAGATE lines settle towards the LSB's, more and more words arc

eliminated. If at any point the COM PARE line of a given bit position remains

high, then all words not eliminated from the compare m ust have a “1” at that

bit position (else sec step 1). In this case, no additional words arc eliminated;

the PROPAGATE line remains high at that position for all valid words.

4. Any word having its PROPAGATE line asserted coming out of the LSB is a

winner. The voltage levels left on the COM PARE lines represent the winning

distance metric. If there is more than one winner, their distance metrics must

be equal.

Thus, the GCC is able to compactly select on-chip winners. The inter-chip circuitry needed to arbitrate comparisons across multiple chips will be described in Section 4.6.

4.3 Overflow Cell

The Overflow Cell resides at the MSB end of the computation cells and is responsible for the following tasks:

• Generating the GT signal. 70

• Summing the MSB carry bits.

• Determining whether the memory location is empty or not.

• Controlling the two MSB COM PARE lines and associated PROPAGATE line.

Shown in Figure 29, the overflow cell contains a collection of circuits that cannot be distributed throughout the entire word-unit. These circuits serve to tic up the loose ends at the MSB side. For example, the inverters shown in Figures 21 and 22 responsible for generating the GT signal arc contained here in the overflow cell, Also, the result of adding four 8-bit numbers together is a 10-bit number; the carry bits from the component sums arc fed into the overflow cell for processing.

One major function of the overflow cell is to hold the state of the memory location in terms of being empty or full. When the chip is reset, the INTERNAL RESET signal goes high momentarily (sec Figure 29). This causes the feedback of the cross-coupled inverter pair to be cut off, and 0 is placed at the input of the SRAM bit. When the

INTERNAL RESET signal returns to low, the 0 is held by the storage cell. Since the state of the cross-coupled inverter pair represents the most significant PROPAGATE signal, a RESET cflcctively eliminates all words from contention. The memory locations are termed empty.

The reset process is undone for a particular codeword when data is stored. During a store, the W ORD line of the given location must be asserted. In the overflow cell, the effect of W ORD going high is to load a 1 in the “reset register," which then allows that cell to participate in subsequent match processes. 71

iMaiul Red* Conpan9 C o n p n l

oro OOO OTI

O t > l

OTI 002

GT J 003

S u n I C to-fum 0 CIb-«upi I Ckt-wmS

------n . | > - a > FTopot*le-oul

OoopmV Conjuol

Figure 29: Schematic of the Overflow Cell. 72

4.4 Right-End Cell

The right-end (RE) cell can be found next to the LSB computation cell. Aside

from initializing all the input carries to ground, the RE cell also serves as a bridge

between the PROPAGATE signals out of the computation cells and the priority encoder.

Figure 30 shows several circuit components; three RE cells can be found in the lower

left quadrant of the diagram. Each SELECT line driver cell contains a fraction of a

large, multi-input NOR gate. The distributed NOR gate is only connected to the

PROPAGATE lines on one side of the memory chip for address encoding purposes. If

the NOR has a low output, then the winning codeword for that particular chip muBt be on the left-hand side (locations 0-15) of the chip. The SELECT line drivers buffer the PROPAGATE signals. This is necessary because the pass transistors of the MRR arc very large (capacitive) and require a strong, differential signal.

4.5 Priority Encoder

Also shown in Figure 30 is the multiplc-rcsponsc resolver (MRR) or priority encoder

(PE). A PE is needed because it is possible for multiple codewords to be cqui-distant from an input vector, meaning that there is more than one winner. For the AM described here, we would like the MRR to select one of these winners (e.g. prioritize the winners) and discard the rest. Given that one or more of the SELECT lines from

Figure 30 is high, the PE must choose one of the codewords and deliver the proper address (or label) to the output bus. The design goal was to build a modular encoder so that the number of bits in the output label or the number of words on a chip could 73

Chip Arbitration Tri-State Output Padi

To Addrttt But

I/O Circuitry JJ L

j H[ NOR 0 « e Output - + • To MUX T T Select Prom the " T f ~ r > n i> r |Ro m H | ROMH [ROMH t™ 1 PMlel j T o m NOR I SI Ton Tiph I Mftl Prom the Prt£jj|«U' 3r ~ r jROM}-r~i—' [ROM(-f j" '' I ^ H ~ h > c Select 3 ^

Select From the 1 T - : t r [ROHH~|— > U Select / x s J & l

"Select" Line Driven Address Storage and Mulltplo-Response Resolver

Figure 30: Schematic of the right-end and priority encoder cells. 74 be changed without affecting the architecture of the individual MRR cells. The PE that was developed exhibits these characteristics.

For each codeword, there is one MRR cell per LABEL line; generally, there will be just enough lines to uniquely specify each codevcctor within the memory. When a

SELECT line is low (indicating that it is not a winning codeword), the vertical set of pass transistors turns on to allow the memory locations below it drive the LABEL bus.

The corresponding pass gate which links the ROM bit to the LABEL bus is turned off.

Conversely, when a SELECT line is high, the its ROM bit is allowed to drive the output, and the lower portion of the LABEL bus is cut ofT. Thus, if there are two winning codcvcclors on the same chip, the one that is below the other will not be able to affect the output. Priority has been given to codewords with smaller addresses. Though this assignment has been ordained somewhat arbitrarily, it should not adversely affect performance since winning codewords arc cqui-distant to the input, and we assume that any winning codeword can be used with equal effectiveness.

Figure 14 suggests that the VAMPIRE is laid out as a lG-codcword memory that has been mirrored to create a definite left and right side. When the two contending codewords are on different sides of the same chip, we no longer have the case where one codeword is above the other. For these instances, the NOR gate from the RE cell can be used to choose one vector (actually, one side) over the other. ThiB is accomplished via a multiplexor (MUX) at the top of the design; the particular set of four LABEL lines that will be passed on to the output pads (left or right) is determined by the NOR gate. The fifth output LABEL line is also derived from this signal. In the design described by Figure 30, the data written to the output bus has been hardwired into the chip. The ROM bit is simply a piece of metal routed to either power or ground. However, it would be very conceivable to make the output LABEL programmable. This approach would use slightly more layout area and add more complexity to the design, but it would also give the operator more on-chip flexibility.

In some of the applications discussed in Chapter VII, the added flexibility could be of considerable value.

Now, there is one more level of priority encoding to resolve which will be consid­ ered in the next section: the case in which two different AM chips contain winning codewords.

4.6 Inter-Chip Circuitry

There arc two aspects of inter-chip communication that need to be considered: (1) finding the codeword with the winning distance over many chips, and (2) prioritizing the output when two different chips contain winning codewords.

The approach that will be taken was discussed in Section 3.6. Essentially, a chip-level wircd-NOR circuit will be used to compare each chip’s internal minimum distortion versus the overall minimum distance as broadcast between chips. One bit of the circuit to accomplish this is shown in Figure 31; it represents the block shown at the top of Figure 28. Basically, the circuit operates as follows: 76

♦ Low True Logic

Select Input Output

Chlp-VaUd-Oui*

COMPARE

Figure 31: Inter-chip winner selection circuitry. • Every chip originally has its most significant CHIP.VALIDJN* (the asterisk de­

notes negative true logic (NTL)) line set low, indicating that it contains the

codeword with the overall minimum distortion.

• If an internal COM PARE line 1b low, then that chip can drive the external COM­

PARE pin low. If the internal COM PARE line is high, then the pad becomes an

input (high-impcdancc state).

• At any of the progressively lower bit positions, a chip is disqualified from com­

petition if the internal and external compare states differ (i.e. if the internal

COM PARE line is high and the external COM PARE is low). Disqualification oc­

curs by setting the CIHP_VALlD_OUT’ high.

• Once a chip has been disqualified, it can not aflcct the external comparisons for

the lesBer significant bits. Furthermore, the disqualified chip can not enable its

output LABEL pins.

Note that throughout this process, the internal state of the chip is not affected. Thus, each chip finds the most associated codeword in its memory to the input vector; only if it is the overall winner can it drive the output bus.

The second problem is one of prioritizing multiple winners on different AM chips.

With the scheme described above, winners on separate chips will try to simultaneously access the LABEL bus. To prevent this, an IN and OUT line were added to the VAMPIRE chip. The IN and OUT of successive chips are connected in a daisy chain fashion (see

Figure 32). With this design, three conditions must be met in order for a chip OUTOUT

VAMPIREVAMPIRE

OUT COMPARECOMPARE

COMPARE

COMPARE

VAMPIRE

OUT COM PARE COM

VAMPIREVAMPIRE

OUTOUT

Figure 32: Daisy chain connection of IN and OUT lines. to control the output bus: (1) the tN line must he high, (2) the least significant

CII IP-VALID-O UT* must b e low, and (3) the MATCH line must be low. When a chip has enptured the LADEL bus, it sets the CHIP-VALID line high (an external pin) and the O UT line low. In this way, the winning chip can be identified, and chips further down the chain cannot usurp control of the output. Figure 33 shows schematics for these signals. The ENABLE line is used to control the SELECT of the LADEL pads (see top right quadrant of Figure 30).

Match Enable

In

Chip-Valid-Oul* Chip-Valid (LSB)

Figure 33: Various I/O signals.

Note that chip speed should not be significantly degraded under the IN/OUT paradigm, as each chip determines a winner independently of the state of these lines.

Figure 32 also shows how the COMPARE lines are pinwheeled, or directly connected together, rather than daisy-chained. The only delay each IN to O U T transition should 80 incur (once the internal state of the AM chips have settled) is a single gate propagation time, namely the NOR gate delay shown in Figure 33.

4.7 Address Decoder

The address decoder for the VAMPIRE chip consisted of two standard 4-to-16 decoders with enable. Each decoder controlled the WORD lines on one side of the chip. The enable for each decoder was controlled by ( 1 ) the state of the fifth input address line

(A*), and (2) the state of the STR line. Assuming that these two signals arc in the proper state to enable a particular decoder, then the values on address lines Ao A 3 determine which WORD line gets asserted, and thus which codeword gets loaded.

4.8 Pad Frame

The pads represent a buffer between the monolithic, or integrated, circuit and the external world. Pad circuitry is generally used for two purposes: (1) amplify signals that arc being sent ofT-chip, and ( 2 ) provide protection from voltage spikes and other potentially hazardous conditions. Six different types of pads were used for the design of the VAM PIRE chip: corner, \ ground, input, output, and I/O pads. All of these pads were standard designs obtained from the MOSIS service (described in more detail next chapter). The function of each is relatively straightforward as described by its name. The I/O pad is the one that plays a key role in the global comparison/winner selection process. Depending on the state of its ENABLE line, the I/O pad can act a s cither an input or an output (refer to Figure 28). 81

4.9 Reset

The last circuit to be discussed is the reset circuitry. Externally, the RESET line uses negative true logic; internally, it is positive true logic. However, rather than just invert the signal, a feature was added to eliminate possible spuriouB resets. Shown in Figure 34 internal R ESET is simply taken from the output of a NOR gate, whose inputs arc the external RESET signal and a delayed version of the externa] RESET. The delay is induced by routing the signal through 80 inverters. The VAMPIRE chip is reset when the RESET pin is held low for approximately 30ns, according to simulations.

80 Iitvaun Total

Figure 34: Reset circuitry. The inverter chain helps eliminate spurious resets. CHAPTER V

Implementations

Based on the architecture described in Chapter IV, the VAMPIRE chip was laid out using a computer-aided design (CAD) package called MAGIC. MAGIC uses what is called a “lambda-based” grid and design rule checker. The layout is painted onto a grid whose dimensions arc specified in A, and the design rules are specified with respect to this unit. The minimum feature size is defined as 2A. An example of a design rule is: “TVansistors must have a minimum length of 2A and a minimum width of 3A.n The advantage of the A-bascd system is that the design can be laid out independently of the technology used to implement the design. Thus, if the chip is being fabricated at a facility that can handle a minimum feature size of 2/xm, then set A = 1/im.

The technology used for the VAMPIRE chip was a 2/im CMOS n-well process, and the designs were submitted to MOSIS (Mctal-Oxidc Semiconductor Implementation

Service). The MOSIS Service is a prototyping service offering fast-turnaround stan­ dard cell and full-custom VLSI circuit development at very low cost. MOSIS has developed a methodology that allows the merging of many different projects from various organizations onto a single wafer. Therefore, instead of paying for the cost of mask-making, fabrication, and packaging for a complete run (currently between

82 $50,000 and $80,000) MOSIS users pay only for the fraction of the silicon that they use, which can cost as little as $400. MOSIS* ease of access, quick turnaround, and cost-cffectivcness have afforded designers opportunities for frequent prototype iter* ations that otherwise might not even have been considered. The following sections discuss various aspects of VAMPIRE chip implementations.

5.1 Computation Cell

The basic building block for the associative memory is the computation cell. The size of this cell basically determines the density of the AM as a whole. Thus, it is of utmost importance to design the computation cell as compactly as possible. The layout for this cell is shown in Plate I.

As can be seen from the plot, the layout is extremely dense. The dimensions of the cell arc 316A x 219A. If we consider just the computation cell alone, it takes

2528A x 219A to hold one 32-bit codeword, A square of 2528A x 2528A will hold 11.5 codewords, or 368 bits. 84

Plate I: The layout of a single computation cell. 85

5.2 Tiny Chip

The first chip implementation of the VAMPIRE was actually a prototype design. It was fabricated on a MOSIS Tiny Chip, which has a payload area measuring 2220/im x

2250/itn. The Tiny Chip was selected primarily for its low cost; $400 buys four packaged IC’s, contained in 40-pin DIPs (Dual In-line Packages). However, because of the limited amount of area and pins, a full-scale codeword would not fit on the Tiny

Chip. A 32-bit vector requires eight computation cells in a word-unit; this structure alone spans a length of 2818/ims after including the overflow, right-end, and MRR cells. Thus for the prototype chip, a 16-bit vector (four components at 4 bits each) was implemented. Eight such codewords could fit on the chip. A photograph of the

Tiny Chip can be found in Plate II. This photo was taken at OSU on Polaroid film.

Most aspects of the Tiny Chip are the same as those which have been discussed in previous chapters. The exception is that this version of the chip utilized analog pads for the COMPARE lines, and thus no inter-chip circuitry was required to enable multi-chip operation. The Tiny Chip was only subjected to preliminary testing. A simple breadboard circuit was built to store and match data to the chip, and the

DAS9200 (Digital Acquisition System from Tektronix) was used to measure circuit performance. It was found that for several simple test cases, the AM chip performed correct associations in approximately 140ns. However, problems with the DAS and faults in the breadboard (adjacent columns shorting together) limited the number of tests that were run. Furthermore, the design of the full-scale VAMPIRE chip was well underway; the basic concept had been proven. 86

Plate II: Photograph of the Tiny Chip die. 87

5.3 Full-Scale Chip

The full-scale VAMPIRE chip was also fabricated in 2/im n-wcll technology. Submitted on a MOSIS Small Chip, the total die area for this chip measures 4.6mm x 6.8mm.

The cost of Small Chip package is approximately $2200, and for that price you receive twelve chips packaged in a 65-pin PGA (Pin Grid Array) carrier and eight unpackagcd die. The photograph of this chip can be found in Plate III. The pinouts are described in Appendix C. Testing for this chip will be described extensively in the next chapter. 88

Plate III: Photograph of the full-scale VAMPIRE chip. i-^gaaaaBI vttMmVw

- ' ■ .•.-^•'•?5>v. *■> /""V*. CHAPTER VI

Chip Analysis

Upon receipt of the packaged IC chips containing the VAMPIRE design, the only

remaining task was to determine whether or not the associative memory functioned

correctly and if so, how fast. B c c a u s c the 65-pin PGA package did not allow any

extra pins for testing purposes, a set of evaluation vccLors were devised to extract

as much information as possible concerning operational details. In this chapter, the

test equipment and procedure used to measure chip performance is described, and

the final results, including those from an actual VQ encoding system, are presented.

The chapter concludes with a discussion of results.

6.1 Equipment and Procedure

The primary equipment used to evaluate the performance of the VAMPIRE chip in­ cluded a Data Acquisition and Control Adapter (DACA) and a custom-built evalua­ tion board (EB). The DACA, which is a special purpose I/O adapter board connected to an IBM PC, was used to control input to and monitor output from the EB. The evaluation board is a wire-wrapped design consisting mainly of buffers and latches used for determining operational behavior and performance speed of the VAMPIRE

89 90 chip. A function generator was also used as a support device. The main components of the experimental testing station are shown in Figure 35.

Function Generator

3 St 0000

Power □ □□□□□□oodOTDODOO Supply oaaaaaaoaoaaaaoaaumafi Evaluation DACA Board IBM PC

Figure 35: Diagram of the test equipment.

6.1.1 Data Acquisition and Control Adapter

The DACA board is a commercially available adapter which facilitates data acquisi­ tion and control through programming modules written in either C, Pascal, or BASIC.

A programming module represents a set of functions which simplify the process of reading data from or writing data to the adapter. Though the DACA possesses both analog and digital capabilities, only the sixteen binary input channels (BIO * BI15) and sixteen binary output channels (BOO - B015) were used to access the evaluation board. 91

Tabic 3: Description of the DACA's 32 digital channels.

Channel Type Description BOO * B 0 7 Output DATA B 0 8 Output LOAD BOO Output EXECUTE BO10 Output MODE BOll - B015 Output ADDRESS BIO - BI4 Input LABEL BIS Input CMP.VALID BIO - BI15 Input COMPARE

For ease of implementation, PC-BASICA (interpreted BASIC) was chosen as the

programming interface language. The program itself can be found in Appendix B; the

first 100 lines of code consist of a header which initializes the adapter board and places

appropriate functions into the computer's memory. The remainder of the program

provides a menu-driven method for executing all the steps necessary to operate the

chip. Table 3 details how the DACA's 32 digital channels were utilized. Note that

the type category from the table (Input/Output) is with respect to the DACA, not

the chip. Three channels that deserve further explanation are output channels eight

through ten: LOAD, EXECUTE, and MODE.

Load Because the DACA can only present eight bits at a time, the LOAD signal

is used in conjunction with the address lines to store data into intermediate

"holding" registers.

Execute The EXECUTE line controls the clock input of the "presentation" registers. 92

Once all thirty-two bits of vector data have been placed into the holding reg­

isters, they arc transferred in parallel to the presentation registers, which feed

the inputs of the VAMPIRE chip. E x e c u t e is also used with the LOAD signal

to reset the chip.

Mode Output number ten (M ODE) establishes whether the chip will be in a store or

match mode.

The description of the EB in the next section will help clarify details concerning the various DACA I/O lines. Note that control of the DACA is extremely slow; for example, when a BO line is strobed high then low on successive lines of the BASICA program, the resulting pulse width can be as much as 20ms in duration. This issue is addressed in the next section.

6.1.2 Evaluation Board

In this section the operation of the evaluation board as it interacts with the DACA will be described. As mentioned earlier, the evaluation board (EB) is a wire-wrap board designed to interface with the DACA so that all aspects of the VAM PIRE chip could be tested, including complete “steady-state" functionality and simulation of real-time operation. Three limitations of the DACA necessitated the implementation of the EB:

1. The DACA is not capable of simultaneously presenting all 32 data bits to the

associative memory. 93

2. The adapter cannot approach the real-time operating speed (3.57MHz) required

of the chip.

3. Parasitic capacitance and resistance associated with the DACA degrade perfor­

mance characteristics of the VAMPIRE chip.

A schematic of the EB is shown in Figure 36. Ail signals shown in this schematic arc inputs or outputs from the DACA with the exception of the CLOCK line, which is derived from the HP-8116A Pulse/Function Generator. The EXECUTE line is tied to the input of the 8116A so that on a rising edge, the function generator delivers exactly two pulses to the CLOCK line. The spacing between the pulses is user selectable, allowing variable rate chip operation. Note that the CLOCK signal is used only to control the latching of data registers; it should not be misconstrued as an input to the VAMPIRE chip, which is asynchronous.

The AM's three phases of operation (reset, store, and match) are described below as they relate to the evaluation board. References to specific components in Figure 36 are given where possible, denoted as Uxy.

R eset The VAMPIRE chip is reset when the signal on its RESET pin iB held low for

approximately 30 nanoseconds. This condition occurs on the EB when both the

LOAD and EXECUTE signals are high, since the RESET pin is connected to the

output of a NAND (U13C in Figure 36) between these two signals.

Store As was hinted to earlier, the store operation is accomplished in a two-step

process. First, the four intermediate registers (U1 - U4) are each loaded with 94

i tr t

n pi

fljjl

Figure 36: Schematic of the evaluation board used to test the performance of the VAMPIRE chip. 95

eight bit data, one register at a time. Each latch is loaded by placing the

appropriate ADDRESS line high a nd strobing the LOAD line high then low. The

output of the respective NAND gate (U12A*D) pulses low then high, causing the

corresponding register to latch t ic data byte. Also, in order to accommodate

two chips, the state of the two E flip-flops (U ll) must be set. The MODE line

is placed in a high state, A4 is set to choose the appropriate AM chip, and

again the LOAD line is strobed. The timing diagram in Figure 37 illustrates

the loading process. Once all 32 bits of data have been loaded into the four

intermediate registers, the second stage of the store operation can proceed. j With the MODE line still in a high state, the EXECUTE line is pulsed, which

causes the CLOCK line to pulse, 'he presentation registers (U5 - U8) latch the

outputs of the intermediate rcgis Lcrs on the pulse of the CLOCK line, and the

appropriate STORE, (the underscore represents NTL) line is held low for the i duration of the EXECUTE pulse. Finally, the MODE line is returned to a low

value.

M atch For the match operation, data is loaded into the intermediate registers just as

with the store operation. Here though, the MODE line is kept low, which prevents

either of the STORE- lines from heing activated. Again on the rising edge of

the EXECUTE line, the CLOCK will pulse twice. The first pulse is primarily

used to load data to the presentation registers. During the second pulse, we

are interested in latching the LABjSL coming out of the VAMPIRE. Note that in

actuality both latches get clockec on each pulse of the CLOCK line; however, in the case of the presentation registers, the clock is merely latching identical

data, and in the case of the output latch, we simply ignore the first set of data.

Sec Figure 37 for timing details.

LOAD U U U— Lf can a m o )( BTTX 3 “ X n s iX ADDRM-O) “ X00001 )( 00010 X 00100 )( 01000 )( T0000 X ADDRESS NOOK 1 EBCOTI L SIORXy r RISXt

CLOCK J U l ilfl

Tiaa

Figure 37: Timing diagram for some evaluation board signals.

Appendix C contains a user’s guide for the VAMPIRE (including a pinout listing) which details the operation of the chip as it would be used in a VQ system. The next section will further describe the interaction between the DACA, EB, and AM chips.

6.1.3 Testing Procedure and Sample Operation

A careful study of the schematic provided in Figure 36 along with the BASIC program in Appendix B should supply enough information to anyone who would like in-depth knowledge of the experimental test set-up. However, this section has been included 97 to give a “quick-and-dirtyw explanation of how to operate the VAMPIRE chip test equipm ent.

Before turning on any other piece of equipment, (lie evaluation board should first be powered on to 5 volts and ground. Then, the PC may be turned on. The reason for this ordering is that upon power-up, the DACA’s output channels arc not guaranteed to be in a grounded state. Finally, the 8116A Function Generator may be turned on.

This piece of equipment has a protection feature such that its output is disabled on power-up and must be manually enabled.

Before testing begins, the 8116A must be configured properly. This is easily ac­ complished using the pushbuttons on the function generator’s front face. Note that on power-up, the 8116A initializes its settings to match those on the last shutdown.

The evaluation program relies on the function generator’s “Burst Mode” in order to obtain correct results. To set the 8(HiA in the proper mode, select E.BUR (which stands for externally-triggered burst) as the mode and PULSE as the output wave­ form. Next configure the 8116A as follows: DUR = number of bursts = 2, WID = pulse width = 25ns, AMP = one-half of the pcak-lo-peak voltage = 2.5V, OFS = half of what the offset voltage should he = 1.25V, and FIIQ = rate at which the chip will be tested. The input of the function generator Bhould be taken directly off the EXE­

CUTE line, and the output should be wired to the CI.o c k line (see Figure 36). Refer to the Operating and Service Manual for the 8116A Programmable Pulse/Function

Generator 50MHz for more information.

Now the evaluation board and program will operate as they were designed. At 98 the PC prompt, type:

C:\ BASICA VAMPIRE

After running for a few seconds, the following menu should appear:

FUNCTION (0-Exit, 1-Resot, 2-Store, 3-Match, 4-File Mode)?

Generally, it is a good idea to reset the chip before commencing any performance tests. After doing so the same menu will again appear.

Next we wish to load a codeword. After choosing option 2, the program will prompt the user for the pertinent data:

STORE OPERATION ADDRESS? 37 BYTEO? 127 BYTE1? 148 BYTE2? 151 BYTE3? 133

The program stores the vector (127,148,151,133) into location 5 (37 mod 32) of the second chip. Of course, all the intermediate steps required to do so are invisible to the user. For example, first the address is set to 00001, 01111111 (127) is placed on the DATA lines, and LOAD is strobed high then low. Likewise, the other three bytes are loaded according to the timing diagram of Figure 37. To indicate that the second chip should be loaded, the address is set to 10000, and again LOAD is strobed. Finally, the EXECUTE line is strobed high.

When option 3, a match, is selected, the program responds with: 99

HATCH OPERATION BEFORE HATCH - 31 BYTEO? 130 BYTE1? 130 BYTE2? 130 BYTE3? 130 AFTER HATCH - 7 SS HATCH - 5

First the program reads the current state or the I.AItEI. lines and reports to the

UBcr. Then the user is prompted Tor the information to be matched against. In this example we use the vector (130,130,130,130). After loading the data into registers on the EB, the pulse generator is triggered wiLh an EXECUTE pulse. The resulting slate of the LADEL latch is then rend. Next the program performs another match to obtain steady-stale information. The example shows a common response for the case when the frequency generator is set at Loo high of a frequency. The chip is not able to produce the correct match in the lime allowed, evidenced by the fact that it differs from the steady-state result. Note that during high-Hpcrd testing, the COMPARE lines are not connected to the DACA to ensure that they do not affect chip performance.

Thus, the state of the COMPARE lines is not printed on the PC's screen. These were probed individually to ascertain their state.

The last option of the menu indicates a “File Mode." In the fde mode the program can load an entire codebook, or match the chip against, a data file containing hundreds or thousands of test vectors. In the ease of file vector matching, the program asks if the user would like to log the output labels to a file. With this option, it is possible to identify where errors occur. For both of these operations, the file format is very 100 simple: the four components of a vector are listed on a single line separated by a space. In either ease, the program continues reading data until the end of file is reached.

One last capability of the file inode option 5s termed “Continuous Pattern”. In this mode, the program will match the chips against two alternating vectors indefinitely:

FUNCTION (0-Exit, 1-Reset, 2-Stora, 3-Match, 4-File Mode)? 4 (0-Main Menu, 1-Ld Cdbk, 2-Match Templates, 3-ContinuouB Pattern)?3 Enter first 4-byte pattern: BYTEO? 0 BYTE1? 0 BYTE2? 0 BYTE3? 0 Enter second 4-byte pattern: BYTEO? 87 BYTE1? 119 BYTE2? 132 BYTE3? 128 ** PRESS FI TO STOP ** PATTERN 1 AFTER HATCH - 31 SS MATCH - 0 PATTERN 2 AFTER MATCH - 1 SS MATCH - 1 PATTERN 1

This option is extremely handy for measuring the settling time for a specific pair of vectors. The frequency setting of the 811GA can he adjusted until the chip just functions correctly/incorrectly, thereby defining the operating threshold for that par­ ticular case. 101

6.2 Experimental Results

The purpose of the following experiments is two-fold: first, the tests should establish the operating characteristics of the VAMPIRE chip; and second, these tests should indicate how chip performance could be improved. Ideally, given knowledge of the chip's architecture, wc could devise a set of test vectors which would systematically isolate individual components of the the design and identify the response time for each. The aggregate of these individual response figures would make up the total response time of the chip. Using this information, we could easily suggest methods for improving overall performance.

In reality, the various computational blocks arc heavily dependent on each other, making the task of performance characterization much more difficult. Nonetheless, an experimental procedure was developed to test as many individual areas of chip opera­ tion as possible. More specifically, tests were structured to determine how the overall match time can be broken down into distinct delays associated with the distortion measure, global compare circuitry, priority encoder, and chip I/O functions.

6.2.1 Functional/S teady-Stato Testing

Preliminary chip testing was intended to first determine whether or not the chip functioned correctly. For this initial set of tests, the COMPARE and LABEL lines were connected directly to the DACA input terminals. Using the PC-controlled testing station, the VAMPIRE chip was stepped through its fundamental operations.

After executing a reset, the COMPARE lines all registered a high value, indicating 102

that the pins were in a high-impcdancc state and that the pull-up resistors dictated

the state of the output. The ciilp.VALtl) line measured high as well, inferring that

the chip contained a winning codeword. My first impression w a s that this was the

incorrect response; however, the two criteria for asserting the CHIP-VALID signal are

(1) the IN line must be high, And (2) the internal COMPARE lines must all match the

external COMPARE pins. As both of these conditions were met, the chip responded

properly.

The store operation proved equally successful. Several distinct vectors were indi­

vidually loaded into the AM, and after each operation, the COMPARE lines all went

low, indicating an exact match w a s present. The corresponding address appeared on the LADEL bus, identifying the location of the match. Eventually every memory location was tested by storing a vector into its contents. The priority encoder was also found to be operational. When two different memory locations were loaded with the identical vector, the smaller address appeared at the output.

As a final test, the chip was presented with an input vector different from any of the stored vectors. Again, the memory exhibited correct operation by settling on the closest codeword vector, and the stale of the COMPARE lines reflected the corresponding distance.

6.2.2 High-Speed Testing

The first true performance test of the VAMPIRE chip involved simply loading a 32- word codebook into the memory and applying sample input vectors. The codebook represented a genuine set of co d ew o rd s generated using a standard collection of train­ 103 ing images developed al Ohio Slate University. The sample difference vectors used for the experiment were also generated from actual image data run through the VQ predictor. In particular, the test data was taken as a portion of the now (in)famous

Lenna image. The image fragment (shown in Figure 38) represents 3600 pixels (60 x

60), or 900 vectors.

Figure 38: Portion of Lenna image used for experiment. This fragment represents 900 vectors.

The results of the test are summarized in the graph of Figure 39. Performance was measured as the percentage of incorrect associations vermis settling time. Errors were detected by matching to the same vector twice in succession. The label clocked into the output latch on the first matching operation is recorded and compared with the data latched in the second matching operation. Because the time between successive matches was on the order of milliseconds and since the input data remained constant during that time, the label resulting from the second match is considered steady-state data. Just to ensure that the steady-state data is correct, the program gives you the option of storing the final label values in a data file. Comparisons between this data file and computer generated labels revealed no discrepancies, 104

Cool

260 280 300 320 340 360 380 400 Settling Time (ns)

Figure 39: Performance of the AM measured on different days. 105

Two major results can be gathered from the data in Figure 39. First is the tem­ perature dependence of chip performance, which is demonstrated by the two different curves. The original set of performance data was taken during the first warm week of Spring, before the air conditioning was turned on in the lab. Thus the “Warm” data indicates an ambient temperature of over 80°F. The next set of tests did not occur for another week; an abnormally low setting on the thermostat resulted in a lab temperature of close to 65"F.1 While room temperature was not a consideration of during the design phase of the chip, these rcstillH stress the importance of keep­ ing experiments localized in time so that extraneous conditions such as temperature variations do not obscure more important information.

A second, more pertinent observation resulting from this test is the fact that the chip docs not operate at the target rate of one association per 280ns. The fastest the VAMPIRE chip Bcems to run error-free iB one vector association every 380ns, or

2.7MHz. Furthermore, this does not necessarily represent a worst-case analysis; the data here is only indicative or 900 sample vectors. For selected frequencies, namely at full (3.57MIIz) and half (1.78MHz) speed, over 12,000 image vectors were tested, all from the Bame image (Lenna). The results from the more extensive testing agreed well with the 900 sample case.

'Note that because this test was not performed intention ally, accurate temperature information is not available. 106

6.2.3 Baseline Performance

In an effort to determine why the VAMPIRE rliip did not meet design specifications, the absolute bcst-casc pcrformnnrc was ascertained using very tightly controlled test conditions. This baseline performance measure w a s meant to represent the upper bound chip of performance; the sole restriction being that the chip output (the win­ ning codeword label) must change state during the experiment. This requirement was imposed simply because there is no way to tell if the chip has settled into the correct state unless the output label reflects it. After determining maximum chip speed, subsequent trials can then be used to measure the effects of various, less ideal test conditions. The eventual goal is to identify which factors have the most detrimental effect on performance.

Unable to exhaustively search for optimum lest conditions, knowledge of the cir­ cuit structure and good jndgment/rommon sense helped define the first set of baseline tests. It is reasonable to assume ( hat the chip will operate most quickly when very few internal nodes change state. TIiub , the first experiments involved storing all zeros in one memory location and the vector (0,0,0,1) in a different memory location. The test program was then used to mat ch to an alternating sequence of Mq = (0,0,0,0) and Mi = (0,0,0,1). The following notation will be used to describe many of the experiments that follow: Cr = (0) and C„ = (0,0,0,1) denote that memory location x is loaded with the zero vector, and that memory location y receives the vector 0001

(all other memory locations are left unused); likewise, Mr — Cx represents the fact that the matching vectors (A/0 and Mi) will he the same as the stored codewords. 107

Tabic 4: Baseline performance test results. C0 — M0 = (0,0,0,0) and Cx = Mi = (0,0,0,1).

Match Wo Match Wi Trial # FTreq (MHz) T (ns) Prcq (MHz) T (ns) 1 13.0 70.9 18.0 55.6 2 13.0 70.9 18.0 55.0 3 14.1 70.9 17.0 58.8 4 13.8 72.5 10.4 01.0 5 13.0 73.5 10.0 00.2 6 13.0 70.9 17.8 50.2 Avc. 13.1 74.5 17.3 57.8

Because only exact associations were used throughout this experiment, the COMPARE lines were externally grounded to eliminate nny possible contribution they may have towards the total delay time.

As mentioned in Section 6.1.3, one option of the program listed in Appendix B allows the computer to cycle between two different vector templates. After each template is matched against the chip contents, the immediate and steady-state output labels arc displayed on the screen while the next vector is being matched. This immediate feedback allows the operating frequency to be changed uon the fly”. By incrementally changing the pulse frequency of the function generator, the operating threshold between correct and iucorrccL association can be determined.

Table 4 displays the results of the original baseline performance tests run six different times over the period or several days. The two columns marked “Match Mo” and “Match Mj" refer to the process of matching to vectors Mo and Mi, respectively. As can be seen from the table, both match operations arc executed at an extremely high rate; however, as the input lines transition from Mo -* M i , the chip speed is noticeably faster than the corresponding Mi -* M0 transition. Before attempting to analyze the discrepancy between the Mo and Mi settling times, consider first the results of two other experiments. In the first, of these experiments, the contents memory locations 0 and 1 are reversed to eliminate any bias that could be attributed to the particular positioning within memory. In the second experiment, Co = Mo =

(0,0,0,255) and Ci = Mi = (0,0,0,254). In both of these tests, the settling times were nominally equivalent to those in Table 4; the slower ease occurred when the smaller vector component (e.g. 0 or 254) was being matched.

The common denominator in all three tests boils down to the state of the LSB.

Cases in which the LSB transitions from 0 to 1 settle on average 16 to 17nB faster than the equivalent 1 —♦ 0 transition. SPICE simulations of the absolute difference circuitry agree with the observed results. The circuit schematic of this block was given back in Figure 25; Figures 40 through 42 show the results of the SPICE simulations.

The first of these figures (Fig. 40) depicts the transient response of the absolute difference LSB when the input bit (solid line?) changes from 0V to 5V. Recall from the original experiment description I hat memory location 1 contained a Ml” in its LSB.

Thus its difference drops to zero when the LSB of the input goes high. Conversely, the corresponding absolute difference bit of memory location 0 changes to a “1”. Likewise,

Figure 41 shows the transient response for the case when the input bit changes from high to low. Figure 42 illustrates just the response of the winning codeword from 109 each of the previous two figures. We can measure the response time of the two curves by finding the point where they cross 2.5V; in both cases the input bit crosses the

2.5V threshold at 5.5ns. The output transient curve for memory location 0 (marked with pluses (+)) crosses the 2.5V threshold at 10.0ns, for a response time of 4.5ns.

The corresponding curve for memory location 1 (marked with ’x’s) reaches 2.5V at

7.2ns, yielding a response time of just 1.7nB.

+ Mcmoiy Location 0

x Mcmoiy Location 1

0.2 0.4 0.6 0.8

Time (seconds)

Figure 40: Response of the absolute difference bit when the input (solid line) goes from 0V to 5V. from 5V to 0V. to 5V from Figure 41: Response of the absolute difference bit when the input (solid line) goes line) (solid input the when bit difference absolute the of Response 41: Figure Volts (V) 0 « 0 . 04 . 0.8 0.6 0.4 0.2 + Mcmoiy + Location 0 xMcmoiy1 Location 1.8 110 Volts (V) . 04 . 0.8 0.6 0.4 0.2 Figure 42: Response of the winning codeword. winning the of Response 42: Figure Time (seconds) x Mcmoiy Location 1 Location Mcmoiy x + Mcmoiy Location 0 1.4 1.6 1.8 111 112

One explanation (at least in part) of why matching to codeword 0 (the 1 —♦ 0 transition) yields a slower response time is a result of the chip layout. Throughout the computation cell, the gate widths of the PMOS and NMOS transistors were not sized in the proper ratio. In the interest of saving layout area, the p-type transistors were laid out the same size as the n-typc devices. Of course since electron mobility is approximately twice that of hole mobility, equivalent fall and rise times of comple- mcntary logic devices arc achieved only when ( lie PMOS transistors arc sealed about twice as large (wide) as the NMOS transistors. Figure 43 shows the results of the same SPICE simulation of the absolute difference circuit except using properly sealed

(2:1 ratio) p*type transistors. The response lime for the ease in which location 1 is the winner remains the same at 1.7ns. However, the response time for the ease in which location 0 wins drops from 4.5ns to 3.1ns.

Intuitively it makes sense that matching to different eases of zeros and ones results in dissimilar response times. Hut is it reasonable to attribute a 16ns timing difference to the characteristics of the /» and n-type transistors when simulations show only a 4.5ns difference? The answer is “yes” for the following reasons. First of all, the

SPICE simulations do not take into account all of the capacitance associated with the on-chip circuitry. Additional capacitance serves to stow the circuit more than the simulations reveal. Secondly, timing delays do not necessarily propagate in a linear fashion. A four nanosecond delay in one stage of the chip will not simply result in a four nanosecond overall delay. The dynamics of the asynchronous circuitry are such that internal conditions constantly change. A delay from an earlier stage could mean Figure 43: Response of the winning codeword using properly scaled PMOS transistors. PMOS scaled properly using codeword winning of the Response 43: Figure Volts (V) 0.5 2.5 3.5 . 04 . 0.8 0.6 0.4 0.2 Time (seconds) + Mcmoiy + Location 0 xMcmoiy1 Location 1.6 113 114 tliat the state of the circuitry in latter stages will be different. This effect will be demonstrated in a subsequent experiment.

The savings in layout area as a result or using “small" PMOS transistors was significant. Using the correctly sized PMOS devices would result in a 7.4% increase in the height of the computation cell (from 216A to 232A). Thus the overall set of sixteen word-units (sec Figure 14) would expand from a height of 3456pm to 3712/mn. For the chip size used to fabricate the VAMPIRE chip, the payload area allowed approximately

3912pm of circuit height after the pads were added. The amount of free space which could be used for I/O circuitry and interconnects would be cut in half, from 456pm to 200pm. It is not clear that the AM circuitry would have fit into this space.

6.2.4 Initial Conditions

The last section pointed out how two virtually identical situations resulted in two « very different match times due to the initial slate of the codeword circuitry. In both cases the winning codeword was an exact match, and both codewords were only a distance of 1 away when it was not the winning codeword. In this section, a more comprehensive group of experiments will be presented in order to gain further insight to the operation of the absolute difference and componcnt-sum circuitry. Other factors will be negated by (1) grounding the compare lines, making sure to only use exact matches, and (2) storing the pair of codewords in locations 0 and 1. Table 5 shows the outcome when the zero vector is stored in location 0 and vector (0,0,0,x) is loaded in location 1 (Co = (0)» C\ = (0,0,0,x), and Mr = Cr). As can be seen from the table, x is first varied as a power of two (x = 2'), then as a power of two minus one 115

Tabic 5: Effect of memory contents on settling time. Co — Mo ~ (0,0,0,0) and Ci = Mi — (0,0,0, x).

Match l/o Match . Mi 1 X FYcq (Mils) T (ns) IVcq (Mllz) T (ns) | 1 13.05 73.3 10.55 00.4 2 12.75 78.4 17.15 58.3 4 12.05 83.0 10.05 02.3 8 11.45 87.3 14.65 08.3 10 10.15 08.5 11.55 86.6 32 0.05 103.6 10.0 01.7 04 8.40 110.0 0.05 100.5 128 5.50 181.8 0.2 108.7 3 8.35 110.8 0.75 102.6 7 6.35 157.5 6.83 146.4 15 5.20 102.3 5.30 188.7 31 4.40 227.3 4.50 222.2 63 3.75 266.7 3.05 253.2 127 3.20 312.5 3,55 281.7 255 3.1 322.6 3.15 317.5 Settling Time (ns) 200 250 100 300 150 350 ' * I Figure 44: Effect of memory contents on match time. match on contents memory of Effect 44: Figure 100 150 x 200 5 300 250 116 117

(x = 2‘ — 1). The data is graphically illustrated in Figure 44.

While the data clearly demonstrates a definite pattern, the significance of this trend is not so well-defined. It seems reasonable to expect that mathematically larger data will take longer to process, since many or the computations depend on propa­ gation delays directed from LSI! to MSB. Unfortunately, the results shown here do not indicate precisely which part of the circuit is causing the delay. For example, is the ease in which x — 31 slow because the absolute difference is slow, or because the final metric is initially set to 31? (Remember that we arc only performing exact matches, so these references pertain to the state of the winning codewords' internal nodes immediately after the input changes stale.)

In an attempt to answer the questions raised in the previous experiment, several tests were performed using a variety of vectors for C\, A sample of these trials can be found in Table G. Each different vector represents a separate test in which

Co = Afo = (0) and C\ = M\ is defined in the table. For the first four trials, the four vector components sum to a total of G4. Thus, when M\ is being matched against the memory contents, the internal slate of codeword Cq registers a distortion value of

64. However, depending on how the is distributed across the vector, the match time varies accordingly.

It is difficult to extract any definite conclusions from the experiments presented in this section. If anything these tests demonstrate that the total chip delay cannot be attributed to any single circuit component (e.g. absolute difference, component sum). One issue that the data does address is that of degraded chip performance. 118

Tabic 6: Effect of memory contents on settling lime. Co = Mo = (0,0,0,0) and Ci = Mi varies.

M atch j\fo M a tc h . Mi Cl Froq (M H z) T(ns) Frcq (MHz) (0,0,0,64) 5.83 172 10.4 96.2 (16,16,16,16) 0.19 102 10.8 92.6 (0,32,0,32) 7.33 136 10.6 94.3 (1,0,0,63) 5.37 186 7.35 136 (0,0,0,63) 3.76 266 3.96 253

Between the baseline results from butt section and the results shown here, we can rule out pad I/O a s the sole source of delay; excessive chip delay must be due (at least in part) to the integrated design. Unfortunately, because of the complex and interdependent relationship between various components, it is difficult to draw any more specific results than thiB.

6.2.5 Global Comparison/Winner Selection

There is one aspect of chip operation that may have been overlooked during the previous set of tests. Because the test vectors were identical to the stored vectors, the trials all ended up with a zero final distance. Thus, it w a s assumed that the COMPARE lines did not play a role in determining settling time. Perhaps it is possible though, that while the internal sums were in a state of transition, the COM PARE lines were temporarily driven high by the internal pull-up resistors. For example, in Figure 41 there is a period of time where both difference hits Are high, which could allow such 119

a scenario to happen. Not only would the COMPARE line corresponding to that bit

float high momentarily, but also the circuits toward the LSB could conceivably be

affected in a chain reaction type of effect. Under these conditions, the total delay time

to produce an exact match would he a non-linear function of individual component

delays.

Though the scenario just described further complicates the overall picture, mea­

suring this effect can be accomplished in a relatively straightforward manner. The

CHIP-VALID signal indicates whclhcr or not a particular chip contains the global win­

ning codeword. This is basically accomplished by comparing the internal COMPARE

lines to the external COMPARE pins. By grounding the COMPARE pins and using exact

matches, the internal state of the chip can be indirectly monitored via the CHIP-VALID

signal.

This was the procedure used for the following lest. Again, Co = Mq s= (0) and

Ci — Mi — (0,0,0,x), where x could be any integer less than 256. These two vectors

were presented to the AM chip in an alternating fashion while the CHIP-VALID line

was monitored. Figure 45 shows a typical waveform for the CHIP-VALID signal (solid

line), relative to the two clock pulses (dashed line). Two parameters were measured

to gauge the chip response for this experiment: the time between the first rising

clock pulse edge and the falling CHIP-VALID edge (variable n), and the width of the

CHIP-VALID trough (variable b), Table 7 lists the values of a and b for various values of x measured during a match to vector M q. The subsequent figure(46) graphically illustrates this data. pulses. Figure 45: Typical response of the the of response Typical 45: Figure Volts (V) CHIP-VALID CHIP-VALID Time (ns) 0 10 4 10 8 200 180 160 140 120 100 ie ih epc t te w clock two the to respect with line 120 121

Tabic 7: Summary of results for Co — Mo = (0,0,0,0) and Ci = Mi = (0,0,0,x).

X a (ns) b(ns) a + b (ns) T (ns) A (ns) 6 92.4 27,6 120 149 29 7 84.0 28.8 113 150 37 9 82.4 8.0 90.4 119 29 10 92.8 26.4 119 149 30 11 84.0 37.2 121 155 34 12 94.4 67.5 162 187 25 13 84.4 47.6 132 167 35 14 86.4 70.0 156 187 21 15 85.6 69.6 155 188 33 16 97.6 6.80 104 106 2 17 82.8 17,2 100 128 28 18 84.4 44.4 129 160 31 19 86.0 46.0 132 165 33 20 93.6 67.2 161 191 30 21 84.0 77.6 162 196 34 22 86.4 77.6 164 196 32 23 86.0 79.2 165 198 33 24 96.0 104 200 224 24 25 85.0 100 185 217 32 26 86.0 105 191 220 29 27 86.0 112 198 230 32 28 88.0 109 197 225 28 29 86.0 112 198 229 31 30 87,0 110 197 228 31 31 87.0 114 201 234 33 32 97.6 13.6 111 123 12 33 83.2 22.4 106 135 29 34 84.0 50.0 134 167 33 35 85.0 50.8 136 172 36 36 86.0 80.4 166 197 31 37 84.0 82.4 166 197 31 Figure 46: Summary of results for results of Summary 46: Figure Time (ns) 200 250 100 150 + a xb oa+b * Chip Speed r C q — A/

From the data it is clear that the scenario hypothesized earlier has indeed come true. The CHIP-VALID signal should theoretically remain high throughout the exper* imcnt, as only exact matches arc being used; the facl that it temporarily dips low indicates that the internal COMPARE lines go through some state of transition. This explains why for larger values of x the VAMPIRE chip generally takes longer to settle.

Larger values have more significant bits enter a transition state. Since higher bits arc

“more upstream” with respect to the CHIP-VALID and PROPAGATE signals, it takes longer for these eases to reach their steady state.

Figure 46 contains a great deal of information concerning this transition process.

In addition to the general trend just described, it can be seen that other factors afTcct the final settling time. For example, the sharp dips in overall time at x = 16 and x = 32 indicate that the number of bits which change state has a noticeable impact.

When the number of bit positions that differ from the previous vector (i.e. Hamming distance) is larger, the chip takes longer to settle; when the Hamming distance is smaller, the chip takes less time to reach steady state.

One interesting fact concerning Figure 46 involves the uppermost curve. Labeled

“Chip Speed” and marked witli asterisks (*), this curve represents the maximum speed at which the chip will match to Afo for a given value of x. The data used to generate this curve was taken completely independent of the data defined by the other curves. The difference between the chip speed curve and the curve marked “a+b” represents the amount of time it takes for the output LABEL to become valid once the 124 internal computations have been completed. This data is listed in Table 7 under the heading “A.”

To conclusively demonstrate that it is in fact the internal COM PARE lines which arc responsible for much of the chip delay, the same experiment was performed except that two memory locations were ttscd to store each codeword. The effect of having two codewords simultaneously working towards a match is that the pull-down transistor attached to each COMPARE line is essentially doubled in size. In reality, there arc simply two pull-down transistors working in parallel. The results of this experiment arc shown in Figure 47. As can be seen in the graph, the settling time of the new configuration is almost half of the original case.

6.2.6 Location Dependence

Using the baseline results as a point of comparison, the next set of tests attempted to determine what effect memory location had on the VAMPIRE chip’s performance.

Recall from Chapter IV that the Priority Encoder of the output label gives the AM a definite sense of top and bottom. For example, locations 1 through 15 arc all

"below” location 0; each increasing step in address number corresponds to an extra transmission gate that its label must pass through (refer to Figure 30). It seems logical then that locations lower in memory would require more time to encode its address. Notice that codevectors Ifi - 31 arc “beside” location 0 since there are two distinct MRRs, one for each set of 1G codewords.

Table 8 shows the results of such a test. This table represents fourteen separate experiments in which Mo — Co = 0 and M\ — Cm — (0,0,0,1). For each experiment, Settling Time (ns) 200 100 220 240 120 140 150 180 iue4: fet trn rpeiiecdwod i t mory. em m e th in ords codew etitive rep storing f o effect e h T 47: Figure o o Two pull-down transistors * One pull-down transistor «* r / A x 125 126

Tabic 8: Results from matching to a codewords stored in location 0 and m. Mo = Co and Mi = Cm.

Match Wo Match M\ m FYeq (MHz) T (ns) Frcq (MIIz)J T (ns) 2 13.45 74.3 17.5 57.1 3 13.45 74.3 17.45 57.3 4 13.45 74.3 17.45 57.3 5 13.45 74.3 17.45 57.3 6 13.45 74.3 17.45 57.3 7 13.45 74.3 18.0 55.5 8 13.45 74.3 18.7 53.5 9 13.45 74.3 18.2 54.9 10 13.25 75.5 18.0 55.5 11 13.30 75.2 18.3 54.6 12 13.25 75.5 18.6 53.8 13 13.35 74.9 18.5 54.1 14 13.35 74.9 18.6 53.8 15 13.45 74.3 18.75 53.3

m was incremented by one; ns in the baseline tests, all other memory locations were empty. As expected, the Mq match data stays fairly constant regardless of the location of Cm, This is because Co is at the top of the priority encoder and is unaffected by events beneath it. However, the Afj matching speeds do not seem to agree with our anticipated result. In fact, the chip seems to perform better as m approaches 15, the lowest codeword in memory.

The apparent mystery can be readily explained upon closer examination of the encoding process. Figure 48 shows the structure for one bit of the MRR; associated with each node of the priority encoder is a parasitic capacitor. When memory location 127

15 contains the winning codeword, every transmission gate along the output label bus is turned on, charging each node capacitor to 5V. When memory location 0 becomes the winning codcvcctor, the first transmission gate shuts ofT, trapping the charge on the previously active bus. On the ensuing match cycle when location 15 is again the winner, the residual charge helps to more quickly change the output voltage.

LABEL BIT

Select

ROM

T“

Select Parasitic Capacitor

Figure 48: One bit of the MRR.

Figure 49 depicts two eases that should help to illustrate the scenario: case 1 shows what happens when C\ and Co arc the alternating winners, and case 2 shows the corresponding situation for (?is and Cq. Only one bit of the output address label is considered. In the figure, transmission gates are represented as simple switches, and parasitic capacitors are denoted Cp. 128

Case 1 Case 2

HLO-O M L l-l MLO-Ok HL1S-1 k ' V. B (a) j Output Output rWCp ± 'OUl B Cp -L HG, - -L C om B 5V 5V / j / j % £ Z Z

H L l-1 HLO-O HL1S-1 HLO-O

^ vB | , B ^ 1 (b) Output Output j !4Cp -1 - Cp -J - C,out -L Cp J- 14C- ± "out OV i J J OV

H L l-1 HLO-O HL15-1 HLO-O

Vn ^ V k k B 11 i o B (C) Output Output -L I4C. -1-± C. Cp ± c

Figure 49: Illustration of charge sharing on a LABEL b it line. 129

C ase 1

In the first time frame (a) location I charges the parasitic capacitor (Cp) and any

output capacitance (Com) to 5V. The amount of charge on Cp is 5Cp. In the second

time frame (b), codeword 0 is the winner, and the bus connecting locations 1 -1 5

is left floating. The charge on Cp is redistributed across the entire bus, resulting in

a net voltage of Vi = 5CP/15CP = .33V; Vo = OV. In the final time frame (c) the

memory is in transition from 0 to 1. The entire bus is connected to the output, and

no codeword has yet been declared the winner. Here, the output lincgctB charged to

5CP/(15CP + Cout) < .33V.

C ase 2

The second case shows a much different result, though the chip is taken through

the same exercises. First (a), the output is charged to 5V, saturating all 15 Cp of parasitic capacitance. Next (b), the bus floats, trapping 5 x 15 Cp — 75Cp of charge.

Assuming no charge leakage, then in the last sequence (c), the output node is charged to 75CP/(15CP + Cout) before a winner is even declared. If Cout is on the same order as then Vo « 4.8V. Thus, the output is given a “hcad-start” on its charging time relative to Case 1.

With these results in mind, a different experiment was devised to measure the true effect of memory location. In this new tcsL, we assume that each winning codeword 130

Tabic 9: Results from matching to codewords stored in location m and m -f 1. M0 = Cm = (0) and M t - Cm+i = (0,0,0, 1).

Match Afo Match W, m Frcq (MHz) T (n s) fircq (MHz) T (ns) 0 13.25 75.7 16.45 60.8 1 13.05 76.6 13.95 71.7 2 12.75 78.4 15.50 64.5 3 12.25 81.6 13.45 74.3 4 12.35 81.0 14.65 68.2 5 11.65 85.8 13.15 76.0 6 11.95 83.7 14.35 69.7 7 11.85 84.4 13.15 76.0 S 11.95 83.7 14.25 70.2 9 11.85 84.4 12.65 79.0 10 11.75 85.1 13.15 76.0 11 11.15 89.7 12.15 82.3 12 10.95 91.3 12.25 81.6 13 10.25 97.6 11.25 88.9 14 10.25 97.6 11.25 88.9

will have to charge the entire bus line above it for at least one of the bit positions.

Because each out bit settles independently of the others, worst-case operation is governed by the slowest component. Table 9 summarizes the results of the revised test in which location m gets loaded with the zero vector, and location m + 1 gets loaded with (0,0,0,1). In a subsequent test, location m + 1 gets loaded with the zero vector, and location m gets loaded with (0,0,0,1); table 10 summarizes these results.

Figure 50 graphically illustrates the two sets of tabulated data. 131

Tabic 10: Results from matching to codewords stored in location m and m + 1. M0 = Cm + 1 = (0) and A/, = C m = (0,0,0,1).

Match . Wo MaIcIi M\ m FYcq (MHz) T(n.) l*Vcq (MHz) T(ns)_ 0 12.40 80.6 16.15 61.9 1 12.05 83.0 15.70 63.7 2 11.90 84.0 15.70 63.7 3 11.70 85.5 14.50 69.0 4 11.75 85.1 14.80 67.6 5 11.70 85.5 13.90 71.9 6 12.65 79.0 14.20 70.4 7 11.75 85.1 13.60 73.5 8 12.65 79.0 13.80 72.5 9 11.50 87.0 13.40 74.6 10 12.05 83.0 13.45 74.3 11 10.85 92.2 12.75 78.4 12 10.75 93.0 12.75 78.4 13 9.65 103.6 12.05 83.0 M 9.45 105.8 11.95 83.7 Settling Time (os) 100 105 110 iue5: tlngtmevru lcto idx . m index location versus e tim g ettlin S 50: Figure + Match+ MO, m *Match MO,m+1 m o Match Ml, m Match o Ml, x Match M l, m+1 l, x Match M 132 133

The outcome of this experiment conforms to expectations; codewords lower in memory generally require a longer time to charge the internal label bus. The exact delay is governed by transmission line characteristics of the bus, which can be modeled as a distributed RC network. The sawtooth structure of the data implies that odd and even memory locations have different relative response times. Again, this is a factor of the individual p and n-type transistor. In summary, this experiment has shown that memory location is indeed a factor in determining operation speed of the chip, though not tfic limiting factor.

6.2.7 Multiple Chips

The last set of experiments were devised to test the operation of the VAMPIRG when multiple chips (two, exactly) were connected together. The object was to determine how the presence of other chips degraded the overall performance, if there was any degradation at all.

The first test involved exact matches with vectors on separate chips. The operation went smoothly, and no significant loss of performance was noticed. The next test involved inexact matches. This is when I discovered that something was wrong. Only certain cases of inexact matches produced correct results. By probing the COM PARE pins, it was ascertained that they rarely contained the proper values. After a series of diagnostic tests, it was discovered that in most c a h c s both of the CHIP-VALID lines were low. Somehow the presence of an additional AM chip caused both chips to be eliminated from contention. 134

Farther tests and an examination of the original layout revealed the source of the problem. The NOR gate shown in Figure 31 had actually been laid out as an inverter connected only to the COM PARE line (see Figure 51).-Because there was no link be­ tween the CHIP-VALIDJN* signal and the I/O pad’s enable line, the VAMPIRE chip would always drive the COMPARE pin if its corresponding internal COM PARE line was low. Thus, a chip that should be eliminated from further participation in the winner selection process can still afTcct the outcome. Note that the internal CHIP.VALIDJN and CHIP_VALID_OUT lines for each chip still function correctly; because of this, in­ terconnected chips can effectively eliminate each other without cither claiming the overall victory.

For example, if the winning codeword on one chip has a distance of 12 (llOOj), and the winning distance on a second chip is 9 (IOOI 3 ), then neither chip can drive the output. The third bit position of the nine inhibits the first chip from winning, and likewise the LSB of the twelve inhibits the other chip. On the other hand, if the winning distance of the second chip had been an 8 (lOOOj), then the resulting chip operation would appear correct.

6.3 Real-Time Video Test

Despite the limitations of the VAMPIRE chip, it was still possible to use them within a real-time video compression framework. Instead of using 256 quantization levels

(which requires multi-chip operation), the VQ codcbook was re-trained for only 32 codewords, thereby avoiding the problem of interconnecting chips. The second issue dealt with overcoming the slow operation of the associative memory. A method was 135 devised which involves running two chips in parallel. Each is loaded with an identical codcbook, and the data input alternates between the two memory chips. Referred to as “ping-ponging” the operation of each AM, this approach effectively cuts the required system speed in half, which provides ample time for the VAMPIRE to reach the correct output.

With these modifications in place, the AM chip was plugged into the VQ hardware3.

This hardware was built to implement the block diagram shown in Figure 2. The re­ sult was the first all-digital implementation of a real-time video-rate vector quantizer.

The picture quality was acceptable given the 32 to 5 compression ratio (6.4). It is expected that 256 codewords would yield a broadcast-quality image 32 codewords work when ping-ponged. DiscusB work around for chip connection problem.

6.4 Discussion

Chip testing has indicated that there arc two flaws in the VAMPIRE chip implemen­ tation which prevent the ASIC from operating at its design specifications: (1) the internal COM PARE line pull-down transistors do not possess adequate current drive capacity, and (2) the interchip communication circuitry prevents multiple chips from being directly connected, disallowing a codcbook size greater than 32 codewords. Of course both of these errors can be easily rectified for future versions of the chip. The question is, “How can the circuitry be modified to effectively use the chips which are readily available?”

aBuilt by Jim Fowler in the SPANN lab, OSU. 136

By far and away the less serious of these two design errors is the first one: inade­ quate current drive. Two relatively simple work-arounds exist for this problem. The first solution was the “ping-pong” method implemented for the real-time video tests.

A second solution almost makes the problem seem trivial. It simply involves storing each codeword into two distinct memory locations on the same chip. The result of storing duplicate codewords is to double effective size of the pull-down transistors on the COM PARE lines. Both of these methods cut the storage capacity of the chip set in half.

The second design error, which prevents multi-chip circuit operation, presents a major obstacle towards the implementation of a large codcbook. Though potential solutions exist to correct the oversight, they arc much more involved than those for the previous problem. The proposed solutions arc described below.

One way to view the situation is to consider that the chip docs operate correctly in a single chip mode. In this mode, the values on the COM PARE pins simply mirror the corresponding internal voltages. ThuB, one approach is to externally duplicate the circuitry that was supposed to internally provide interchip arbitration. Figure 51 illustrates the concept for one bit-slice of the COMPARE lines. Two problems plague this type of solution. The first is the logistics of wiring this much circuitry for every

COMPARE pin of every chip. Though the gates could be easily coded into a pro­ grammable logic device (PLD), the device would quickly become pin limited, as each additional AM chip requires ten inputs. Furthermore, if all chips cannot be accom­ modated with a single PLD, then a multi-level hierarchy of PLD’s will be needed. 137

NEW-COMPAJIB

Ntw-Chlp-Vtlld-Out* New-CWp-V»IM-ln*

I m p . .SetecU O .tnpu «°P*1 I Seiectel, Output

Select Input Output

Chip-Velld-Out* Chlp-V*lid-In*

COMPARE

Figure 51: Correcting the arbitration circuitry externally. 138

The other problem with this solution is that the internal CHIP-VALID signal will no longer be correct. Since the associative memories arc essentially operating indepen­ dently, each chip assumes it contains the winning codeword. The IN line could be used to control which of the chips is allowed to write to the output label bus, but then its original function (that of arbitrating multiple winners on different chips) is lost. Alternatively, the I/O pad that is used for the LABEL lines could also be du­ plicated external to the chip, with its function governed by the externally-generated

CHIP-VALID.

Another approach for allowing multi-chip operation goes back to the binary com­ pare tree structure discussed in Section 2.2.3. Shown in Figure 52, this solution appears more feasible than the last. It requires log3 N 10-bit comparators for N AM chips. Depending on the state of the comparators, the appropriate IN line will be set high for the chip containing the wining codeword.

A wholly different approach incorporates principles from some of the sub-optimal

VQ techniques. This approach takes advantage of the current operational chip be­ havior to produce a functional 128-codcword DVQ system. Eight of the VAMPIRE chips will be loaded with codcbook vectors, each chip storing only 16 different code­ words. The contents of each odd-numbered memory location will be a duplicate of the preceding even-numbered memory location. As discussed earlier, this will ensure that each AM chip will be capable of running at 3.57MHz. Two more VAM PIRE chips will be set up in the “ping-pong” configuration to serve as “selection chips." The codewords on these selection chips are chosen such that the overall codeword space is 139

LOGIC Vampire 0 Vampire 1

COMPARE COMPARE

110-bit Compare

COMPARE COMPARE

Vampire 2

LOGIC

Figure 52: Example of using comparators to determine which of the four chips con­ tains the winning codcvcctor.

divided into 8 supersets of the original Voronoi regions. Each superset must represent a union of exactly 16 of these Voronoi regions, and every Voronoi region must be a member of only one superset. Thirty-two codewords can be used to dciine the bound­ aries of the eight supersets, for an average of four codewords per superset (though the distribution of codewords per superset need not be uniform). The sixteen codevcctors that correspond to a single superset would all be stored on a single codebook chip.

Assuming the codevcctor space topology allowed such a division with an adequate amount of precision, the system would process data in two stages (see Figure 53).

In the first stage, the selection memory decides which of the eight codebook chips contain the winning codeword for the given input. The output label from the first set of VAMPIRE chips is run into a RAM look-up table which defines the selection 140 codeword to superset mapping. Depending on the superset that is chosen, the IN line of the corresponding AM chip ib set high, allowing the output label bus to be controlled by that chip. The key to making this type of approach work is being able to compute the appropriate supcrscl-codcword relationship. Perhaps using the original training data to adapt eight versus 128 codewords would be a good start.

16-CWVQ 16-CWVQ 16-CWVQ f l i - ----

32-CWVQ 16-CWVQ 16-CWVQ IN

RAM IM-Wonllnwgo Codcbook

Figure 53: The selection codcbook evaluates the input data and selects the appropri­ ate memory chip to perform the VQ association.

One last suggestion for uses of the VAMPIRE chip in its current state is a two- stage predictive MSVQ implementation. Shown in Figure 54, two stages of 16-vector codcbooks would result in an eight bit quantization, which is the original target value.

This design could be implemented at the 280ns video rate without any extraneous wiring; simply load duplicate codewords in adjacent locations and ignore the LSB of the output labels to obtain the encoded channel data. 141

4 bits 4 b ill

Image Vector 16-Codeword 16-Codeword VQ VQ

Pred ction Vector

Figure 54: A two-stage predictive MSVQ system using the VAMPIRE chips in their current state. CHAPTER VII

Alternative Applications

Though the VAMPIRE chip was originally designed And fabricated as a specialized

processor capable of determining the closest match in a vector quantization applica­

tion, the associative memory ASIC also has more general characteristics useful for a

variety of alternative applications. Its versatility is due in part from the fact that the

chip behaves as a maximum likelihood detector, which can be used in many forms of

digital communication systems, ranging from matched filtering, symbol synchroniza­

tion, and source coding. In the first section, 1 will explain how the VAMPIRE chip can

be used for a special type of maximum likelihood estimation known as correlation

detection, and point out other (more general) uses of the chip in subsequent sections.

7.1 Maximum Correlation Detection

When considering many types of communication signals, it is known that the cor­ relation coefficient between the signals is an important quantity in determining how well one signal can be distinguished from another. The correlation coefficient that relates signal s,*(t) to Sj(f) is given by where Ex is the energy of the respective signal over the period T. When the continuous- time signals s,(t) and Sj(t) arc symbols from a digital communication system, it is usually more convenient to represent them in a vector format, the elements of the vector being a discrete set of basis function coefficients. Expressed in this manner, the signals s,(t) and Sj(t) arc written as /.'-dimensional vectors, and Equation 7.1 becomes _ E*=l agsji _ Bpsj yjW Ei 64 The integral reduces to a sum, and the signal energies become vector magnitudes.

One example of signal detection involves correlating a set of N allowable channel symbols against the received signal. The member of the symbol set which is most correlated to the received signal is taken to be the transmitted symbol. This process involves the calculation of N correlation coefficients, followed by a global comparison to determine the largest coefficient. The symbol that correspond to the maximum correlation represents the most likely symbol.

Under certain conditions, the procedure described above is analogous to the vector quantization process performed by the VAM PIRE chip. Ry normalizing the set of stored symbols (codewords) such that they all have the same magnitude, it is possible to show that the designed memory structure executes the appropriate steps. In other words, we wish to show that

Q(') = {j = Pi = m.** IP,!} (7-3) where Q(/) represents vector quantization of the input vector /, and pj is the corre­ lation coefficient between the input vector and the j th codeword vector CK Here, pj ' ■ M i l 'l l Since we have specified that the codeword vectors all have equal magnitudes, then for a given input vector, the denominator of Equation 7 A is a constant. The stored word which maximizes the dot product corresponds to the most correlated codeword.

Under these same conditions, the Euclidean distance measure can be expressed as

mm Again, because the magnitude of all C* is constant, the first two terms in Equa­ tion 7.5 sum to a fixed value. Thus, the index j that maximizes the expression I ■ C> corresponds to the stored vector having the largest correlation coefficient pj and the smallest Euclidean distance dj with respect to a given input vector. Note that in reality the VAMPIRE chip only approximates the Euclidean distance measure, and likewise only approximates the correlation function.

7.2 Symbol Detector

The example of a symbol detector described in the previous section can be effective for decoding a large number of modulation schemes. A broader class of signals for which the associative memory can be used to detect symbols is called Amplitude and

Phase Shift Keying (APK) [73], APK, or Quadrature Amplitude Modulation (QAM), is general class of M-ary signals in which both the amplitude and the phase of the signal can be changed by discrete amounts. Figure 55 shows the signal space of a 145

16-ary APK, generated using four amplitude levels and four phase shifts. The iih signal can be described by following equation

s,(t) = Si cos(w,t + ©,*); 0 < t < T (7.6)

The significance of this type of code is that Euclidean distance, not strictly a correla­ tion value, is used for decoding the signal. Figure 55 illustrates a received vector and distances associated with the five closest allowable symbols. The VAMPIRE chip is ideally suited in this case to provide maximum-likchhood detection due to its general distance detection processing capabilities. A cost cfTcctivc decoder is important since

APK provides lower error rales than other M -ary modulation systems operating at the same rate (74).

7.3 Symbol Synchronization

Now that the chip has been established as a useful tool for symbol detection, another application for such a device is symbol synchronization (SS) in high-speed digital communications. Symbol synchronization refers to the process of determining the locations of symbol boundaries so that efficient detection can be performed. This is necessary for receiving terminals that can be switched on and off asynchronously with respect to the transmitted symbol stream. Upon power-up, a receiver must first synchronize to the symbol boundaries before detection can occur. Even after power- up, SS is needed to periodically correct drift due to clock discrepancies between transmitter and receiver. Proper SS iH imperative in order for digital communications 146

16-APK "Codeword Vectors'

Received Signal

Figure 55: Signal space of 16-ary amplitude and phase shift keying (APK) system. 147

to be effective, especially for higher order modulation schemes like M-ary Phase Shift

Keying (PSK).

One method of synchronizing symbols using the VAMPIRE chip is to load the dif­ ferent memory locations with the expected incoming waveform sampled and shifted by incremental fractions of the symbol period. (Again, the stored samples are nor­ malized.) The stored vector that is selected as the closest match to the input vector is most correlated to the received symbol, and the index of that vector corresponds to the amount of time between the sample points and the symbol boundary.

For example, Figure 56 shows two adjacent binary PSK signals (“1” then “0").

The dashed box denotes a time frame corresponding to one period; in this case the expected waveform is sampled four times over a symbol period. The progression of the box from left to right represents incremental Lime shifts in which the (1,0) sample waveform will be sampled and normalized. The resulting set of four-dimensional vectors is then stored into the AM. To perform SS, the received signal is also quantized in four locations, and these values arc presented as the input to the AM. The codeword index out of the chip can be mapped to a corresponding phase delay from the symbol boundary. Note that a similar procedure would likely be used for the (0,1) transition as well.

7.4 Beam Forming

Another application of the VAMPIRE chip which exploits its correlation and winner- selection properties emanates from a branch of antenna theory called array theory, or beam forming (BF). The basic concept of beam forming is to use a linear array of 148

TM TM

T14 TMTM

TM tm TM

Figure 56: Codewords arc derived by sampling the expected waveform in four loca­ tions at incrementally different time delays, as illustrated with the sliding window. 149 identical, equally-spaced antennae to achieve a highly directional radiation pattern.

The power of using BF techniques becomes evident when it is realized that one of the most basic and commonly-used type of antenna is the , which

(laterally) radiates energy equally in all directions.

A common example of an antenna array can be illustrated with radio and TV transmitters in cities that border large bodies of water. Because there is no listener- ship in the direction of the water, a single dipole antenna would waste a significant amount of energy broadcasting to desolate areas. However, a simple arrangement of two dipole antennae can be used to redirect the energy to populated regions (see

Figure 57).

Due to the reciprocity between signal transmission and reception, array theory applies equally to the directionality of a receiving antenna. In many applications characteristics of the transmitted signal arc known, but the location of the source may be unknown. For example, guided tracking systems, submarine sonar, and hom­ ing beacons all exhibit this property. In these applications, an AM like the VAMPIRE chip could be loaded with samples of the expected signal as detected from many different angles with respect to the antenna array. The vector dimension would cor­ respond to the number of antenna elements, and the number of codewords would define the resolution of the array. Note that an actual system implementation of this application may also need a synchronizing controller to optimize the sampling times.

Also, further research would likely be needed for qualifying the received signal to avoid false positives; the strongest signal isn't necessarily the correct one. 150

J I L . S ^ I I / — I \ \ Wulod *S \ \ Power \ \ \ AfiAAAA CITY

WATER WATER

CITY

Figure 57: Example of using an antenna array to direct radiated power. 151

7.5 Artificial Neural Networks

A significant amount of research has explored the use of artificial neural networks

(ANNs), specifically Hopficld nets, as associative memories. However, this imple­ mentation of an AM will probably never be as efficient as a dedicated piece of AM hardware, since ANNs possess much more general characteristics which arc not nec­ essarily optimized for associative matching. Furthermore, the traditional drawbacks to ANN implementations (analog weight storage, extensive multiplications) further detract from using a neural net in this way. Conversely, there is seemingly no litera­ ture which suggests using associative memories as ANNs (except maybe [68]). Since no such precedent exists, 1 will attempt to show here why this approach might yield some interesting and applicable results.

During the 1950*8 when interest in ANNs first became popular, an article by

Minsky and Pappcrt brought much of the work in this field to a halt. They pointed out that the single layer pcrccptron (a class of feed-forward networks widely studied at the time) was incapable of adapting its weight values to solve the relatively simple exclusive-or (XOR) classification problem. Later it was discovered that a multi­ layered network is needed to perform the XOR and other non-linear classification problems. In general, the more complex the classification is, the more involved the network structure gets. As of yet there is no straightforward method for determining the optimal structure of a network to solve a given problem. An even more difficult problem is to find the weight assignments based only on a small view of the data.

One example of the on-going research in this key field can be found in [75], 152

On the other hand, linear and non-linear classification problems arc both relatively straightforward to implement with the AM. There arc no complex weight values to compute; simply place a codeword at the desired location. For example, the problem of classifying an input point to one of two intertwined spirals, an order of magnitude more difficult than the XOR, is illustrated in Figure 58. 40 codewords were simply placed along each spiral’s radial arm to define the regions depicted in the graph. In the graph, codewords arc denoted with asterisks (*) and circles (o); the pluses (+) mark the boundary that partitions the two spirals. Imagine what type of NN structure and corresponding set of weights arc needed to achieve equivalent resolution.

The general relationship bclwccn the AM architecture and the NN structure is that the number of input neurons must equal the vector dimension of the AM. For the

NN, the number of output neurons corresponds to the number of desired responses, and the number of hidden laycrs/ncurons needed is dependent on the complexity of the final classification space. In the case of the AM, the number of codewords represents the number of distinct Voronoi regions in the classification space. If a particular region docs not meet the DirichlcL boundary conditions (Chapter I), it must be broken down into a union of (possibly overlapping) Voronoi regions.

One advantage of the AM implementation of an ANN classifier is that the code­ word configuration can be generated quickly and analytically. NN weight compu­ tations are still generally quite intensive, and they can often settle into local en­ ergy minima, which don't necessarily represent the optimal solution. Though an implementation like the VAMPIRE chip has no provisions for modifying the effective 153

IWWHWHIHWHH

Figure 58: Partition of the codeword space into two intertwined spirals using 80 codewords total. 154 classification space (codeword values), this capability is not out of the question. For example, the associative memory proposed by Fang and Shcu [68] can supposedly modify its stored contents using the FSCL algorithm.

7.6 Non-Destructive Evaluation

There arc many industrial processes in which a product moves continuously through a manufacturing plant and must be visually inspected to determine its quality. Exam­ ples of such web processing lines arc those for certain types of glass, metals, polymers, and fabrics. Defective areas of the web need to be removed from the line (or at least marked to be removed later) so that the amount of faulty material reaching the final product is minimized. A variety of nondestructive evaluation (NDE) methods pro­ duce an image which has features that must be categorized as either defects or false positives.

Human inspection of a web processing line is not suitable under many industrial conditions for a variety of reasons. Among them arc:

• The manufacturing equipment is likely able to operate at rates much higher

than a human could possibly handle.

• Because of the tedious, repetitive nature of the job, probability of error (missing

defects) is relatively high.

• Some manufacturing lines may be hazardous to human health.

On the other hand, when it comes to a detailed examination of a potential defect, 155 human expertise often offers the best judgment of how the anomaly should be classi­ fied.

One potentially feasible solution to the NDE process iB to scan across the web at a very high speed and classify the resulting spot images. The spot being scanned would be broken down into a number of regions. Each region would be represented as some numerical value (based on image intensity, for example), and the collection of values would then be the image vector. The idea is that defects in the material would result in certain types of vectors while acceptable material would result in other types of vectors. Thus, this problem maps directly to the video source coding problem, in which each piece of the image is classified as to its category. For the NDE ease, the index of the category identifies the defect that is being screened.

A next generation version of the associative memory is under development for use in high speed, image pattern classification. In a variety of manufacturing environ­ ments, the classifications must be done in real time. In such eases, processing rates of

10 million measurements per second are expected, It is conceivable that specialized processors such as the VAMPIRE chip will be utilized as a chip set to achieve Teraop performance.

7.7 Summary

The maximum likelihood detection/classification characteristics of the VAMPIRE chip lend itself well to many alternative applications. Among the uses examined here were: symbol detection, correlation filters, symbol synchronization, beam forming, neural networks, and NDE applications. Further research is required for many of the 156 described applications; this chapter was meant merely to introduce their feasibility.

Certainly there exists many more potential applications for which the VAM PIRE chip could be used. Probably the ones with the most potential arc those that require real-time, high-speed classifications. CHAPTER VIII

Recommendations and Conclusions

The vector quantizing VAMPIRE chip reflects a growing trend among modern commer­ cial IC production, namely the integration of memory and processing capacity [76,77].

In fact, the name of the chip itself includes both the words memory and processor just to stress this duality. The designation between specialized memory products and high performance processors has become increasingly more blurred. Many recent RISC

(Reduced Instruction Set Computers) architectures combine larger on-chip caches with a less complex processing unit to achieve better performance. Likewise, ASIC memories such as the Video RAM (VRAM) implement a certain degree of autonomy to help improve overall system speed. The Connection Machine from the Thinking

Machines Corporation is a prime example of mcmory-processor integration [78]. Here, literally thousands of simple processors, each associated with a small amount of “ded­ icated” memory, are linked together to form a general purpose computer capable of operating in the Teraop region for certain application. In many ways the VAMPIRE chip uses these same concepts to achieve its performance.

Chips like the VAMPIRE chip will almost certainly be needed when VQ becomes widely used in video compression systems. Because VQ offers improved guanftiiny performance, it can often be used to enhance existing algorithms; regardless of the

157 158 manner in which video data is processed, at some point it must be quantized. Though the AM design presented here has been somewhat area intensive, this approach lends itself well to wafer*scale integration techniques. Because location is not important within an AM structure, some degree of fault tolerance could be built into a chip of this type to route around wafer defects [79].

The work presented in this dissertation has provided a solid foundation for video- rate VQ chip design. If I were able to completely redo the VAM PIRE chip from scratch,

I would want to incorporate the following ideas:

• Move to a synchronous design for its added flexibility. The basic circuit oper­

ation has been proven with the current design; a next generation chip should

offer features that were not available with the design presented here. Some of

these issues of increased flexibility arc addressed below.

• Incorporate enough SRAM so that the chip can be programmed to expand the

number of VAMPIRE components to at least 16. Because the SRAM is much

more dense than the computation cells, not much layout area will be sacrificed

even if the extra storage goes unused.

• Explore the use of dynamic RAM for the codeword storage and dynamic logic

for much of the computation circuitry. These may offer even higher performance

in less area.

• Pipeline the distortion computation with the winner selection process. Though

it adds one vector delay of latency, a pipelined approach means that the chip 159

clock could operate at the same rate as the system clock. If adequate speed-up in

processing time can be obtained through the use of pipelining and dynamic logic,

perhaps a single processing element could be used to perform two consecutive,

distortion computations, effectively doubling the capacity of the chip. Of course,

the winner selection would have to be completed in half the time.

• Along with the accumulated L\ distortion, maintain a separate register for the

Zroo-norm. Make the final distortion measure a programmable function of Li

and Loo-norms to obtain a better L 2 approximation.

• Explore the possibility of implementing the winner selection process in a bit-

serial fashion as in [80]. This approach relies on the same type of wircd-NOR

comparison, yet only a single line is needed to interconnect chips. As the dis­

tance measure is clocked through its bits, a register is used to indicate whether

a codeword is still valid or whether it has been eliminated from competition.

• Obviously, fix the two design bugs that prevented the current VAM PIRE chip

from operating up to design specifications.

Figure 59 shows what a processing element might look like that incorporates some of these ideas. Note that there is a tradeoff between Area efficiency and the number of features implemented. Each added degree of flexibility adds some amount of overhead area.

There is also a great deal of research that could be done to further enhance real­ time video-rate VQ by working in related areas. I would recommend the following 160

i H P u r

PROGRAMMABLE CONTROL

DIS VAL RAM 0 RAM RAH 3

RAM I RAM RAM 9 RAH RAM RAH ACCUMULATOR RAM RAM Y RAM 1 L-INF

WINNER SELECT

Figure 59: Serial processing clement with programmable vector dimension. 161 fields of specialization:

• Explore the trade-offs of using larger block sizes coupled with some of the sub-

optimal VQ techniques discussed in Chapter II versus using the full search

approach and smaller vector dimensions. This is an important issue since the

number of codewords needed increases exponentially with vector dimension.

Not only will extremely large codcbooks be difficult to search, but at some

point it will become prohibitively expensive to arrive at the optimal set of

codewords. In this case a technique such as MSVQ could be beneficial, since

this process requires much smaller codcbooks. The issue now becomes which

method provides superior performance: MSVQ on an 8x8 block of pixel entities,

or full-scarch VQ on a 2x2 block.

• Explore more efficient pre* and post-VQ processing stcpB. The quantizing step

is always more efficient when the input variance is reduced. Also, the output

is most random (incompressible) when the codewords of a vector quantizer

are equally used. Thus, much effort should be focused on better predictors

(for systems that use memory), better ways to quantize transform coefficients,

and methods of incorporating knowledge-based algorithms into the compression

system. For example, given a certain fixed codeword distribution, it may be

possible to use information from previous pixels to adjust the vector region

centroids. Generally, a received codeword index indicates only that the encoded

vector falls somewhere within the Voronoi region associated with the codeword.

Regardless of where the original input vector lay within this region, it will 162

be decoded only as the centroid (i.e. the code vector). However, perhaps a

more detailed study would reveal that in certain situations performance can be

enhanced by adjusting the decoded vector from the fixed point that corresponds

to the regions center. To take this technique one step further, you could use the

fact that you know how the decoder will be adjusting the region centroids to

essentially prcdistort the codeword space and improve compression performance

even more. Other methods include using an entropy code (or a vector-cntropy

code, like the double-level Huffman code used in the NASA DPCM algorithm)

on the codeword indices, if it turns out that the codeword distribution is not

even.

• Another area which deserves further study is an analog implementation of the

VAMPIRE chip. Analog circuitry allows a much higher degree of integration

density, possibly enough to fit an entire codcbook on-chip. The two implemen­

tations reported in [67, 66] purport to have circuits structures which allow a

high enough degree of precision, though the results do not back up their claim.

However, it is conceivable that self-calibrating devices like the ones in [67] may

be able to provide adequate accuracy with a good IC process.

• The alternative applications described in Chapter VII deserve considerably more

attention than what could be provided in this research. Further study into

potential applications of the VAMPIRE chip would almost certainly reveal more

uses. For example, Sigma-Dclta (E — A) converters, which are so popular

now, use scalar quantizers; there is no reason to believe that they could not 163

benefit from vector quantization techniques. Another example is block matching

algorithms, which arc used in motion compensation video compression schemes.

Selecting the most similar block from a previous frame is analogous to VQ

encoding.

While some of these recommendations emphasize work in the domain of VQ, others pertain mostly to chip design. This illustrates the inter-dcpcndcncc between algo­ rithm and hardware, a relationship I have attempted to stress throughout this work.

This dissertation has introduced the Vector-quantizing Associative Memory Pro­ cessor Implementing Real-time Encoding (VAM PIRE) chip. It was successfully de­ signed and fabricated to meet the requirements of a real-time video-rate vector quan­ tization encoder. Experimental results demonstrated the operational behavior of the chip, and a hardware implementation of a real-time video system provided conclusive evidence of its performance. This research has resulted in a novel architecture which may prove to be at the forefront of emerging hardware video compression technology.

Because of its inherent advantages, VQ techniques will play an increasingly larger role in the coming era of multi-media networks, increased cablc-TV subscribership, and the phone companies entrance into the video services market. ASIC chips like the

VAM PIRE will be essential to obtain the type of compression ratios and resulting image quality that is being targeted [81]. The VAMPIRE chip has been designed in a general enough fashion to make it a useful device for a number of other high-speed applica­ tions, as well. Subsequent generations of the chip should enhance overall performance and provide greater flexibility. 163

benefit from vector quantization techniques. Another example is block matching

algorithms, which arc used in motion compensation video compression schemes.

Selecting the most similar block from a previous frame is analogous to VQ

encoding.

While some of these recommendations emphasize work in the domain of VQ, others pertain mostly to chip design. This illustrates the intcr*dcpendcnce between algo­ rithm and hardware, a relationship I have attempted to stress throughout this work.

This dissertation has introduced the Vector-quantizing Associative Memory Pro­ cessor Implementing Real-time Encoding (VAM PIRE) chip. It was successfully de­ signed and fabricated to meet the requirements of a real-time video-rate vector quan­ tization encoder. Experimental results demonstrated the operational behavior of the chip, and a hardware implementation of a real-time video system provided conclusive evidence of its performance. This research has resulted in a novel architecture which may prove to be at the forefront of emerging hardware video compression technology.

Because of its inherent advantages, VQ techniques will play an increasingly larger role in the coming era of multi-media networks, increased cable-TV subscribcrship, and the phone companies entrance into the video services market. ASIC chips like the

VAM PIRE will be essential to obtain the type of compression ratios and resulting image quality that is being targeted [81]. The VAMPIRE chip has been designed in a general enough fashion to make it a useful device for a number of other high-speed applica­ tions, as well. Subsequent generations of the chip should enhance overall performance and provide greater flexibility. “There are two things which I am confident I can do very well: one is an introduction to any literary work, stating what it is to contain, and how it should be executed in the most perfect manner; the other is a conclusion, shewing from various causes why the execution has not been equal to what the author promised to himself and to the public

— Samuel Johnson

“All you need in this life is ignorance and confidence, and then success is sure

— Mark Twain Appendix A

The Mathematics of Information Theory

A.l Information Theory

No study of data compression and information theory would be complete without including some of the traditional measures of information content. Most of modern communication theory is based on the work of Claude 13. Shannon made famous in his 1948 paper “A mathematical Theory of Communications” [2]. Shannon devised a method of quantifying the amount of information a discrete source could produce based on the probability distribution of the symbols in its alphabet. The following equations form the foundation of his theorems. For easier display of the equations, the following shorthand notations have been adopted to represent probability functions:

P (X = X{) => P (X = Xi\Y = yj) => /fy, etc. Also, unless otherwise stated, all logarithms are to the base 2 , and all quantities arc measured in bits.

Self-Information - amount of information the single event X = x,- conv eys about

itself.

I (x .) = log = - log Pi (A .l)

165 166

Average Self-Information - average amount of information occurrence X conveys

about itself.

n { .\) = £ p , iog-i- = -log p< (A.2) i '

The // term refers to entropy, or uncertainty, and in derived from Boltzmann's fa­ mous // theorem in thermodynamics [82]. Prom equation A.l, more information is conveyed when an event has a lower probability of occurring. Likewise, it can be shown from equation A.2 that the average self-information, or entropy, is highest when the events in X arc cquiprobablc. In this case, each event conveys the same amount of information, and thus it is no more certain that one event will occur than another. The next two equations describe interactions between consecutive events.

M utual Information - amount of information the event Y = t/j gives about the

event X = x,-.

/(*.-, Vj) = log ^ (A.3)

Average Mutual Information 1 - average amount of information the event Y gives

about occurrence A'.

/(.V;n = ££/>* log^i (A.4 ) i 3 1'

An important characteristic of the average mutual information is that I (X ;Y ) > 0.

Finally, two more equations arc presented:

Shannon also uses an //term to describe this equation; however, more recent authors [83, 84, 74], for example, use the / convention shown. 167

Conditional Entropy - the uncertainty of X after 1' is observed.

(A.5)

Through simple manipulations of equations A .2 and A.

I (* ; Y ) = l?(X) - H(X\Y) > 0 (A.6 )

It follows that H(X) > H(X\K), with the equality holding only when X and Y are statistically independent. In other words, the uncertainty of event X , given some knowledge of observed events K, is always less than or equal to the uncertainty of X by itself.

Shannon also developed what he termed fidelity evaluation functions, or rate dis­ tortion functions [3, 4, 83]. The rate distortion function, R(D)} defines the minimum information rate necessary to represent the output with an average distortion, D. In particular, for a Gaussian source using the mean-squnre-crror distortion measure,

where a* is the variance of the Gaussian source. The formula implies that the larger the distortion is allowed to be, the lower the rate can be. This makes sense intuitively; when more errors can be tolerated, less information needs to be sent. In more general terms, R{D) can be expressed as the minimum of the average mutual information between image data points, {/, and their reconstructions, U\

min B{d(I/,0))

The dependencies between rate and distortion measure can he reversed to give the distortion rate function. Again, for the Gaussian source,

D„{R) = 2~2,la l

Put this way, Dg(R) is the distortion that must be tolerated for a given rate, R , and variance, a\. It turns out that the Gaussian source gives an upper bound on the rate distortion. Thus, for a given distortion measure, the GausBian source requires the maximum data rate.

A.2 Vector Quantization

As the name implies, vector quantisation (VQ) algorithms group several data points

(a vector of values) into a single entity, A selected number of vectors divides the quantization space into a representative set that is referred to as the codebook. Every group of data pointB is encoded by selecting from the codebook the correspondingly closest vector (in Euclidean space). The index of that codeword is the data that is transmitted to the decoder.

Makhoul et al. [85] show how VQ can be used to divide the quantization space more efficiently than its scalar counterpart. Suppose the distribution of adjacent pixel values resembles the shaded region of Figure 60 (in reality, imagine that the shaded area represents the most dense and statistically significant part of the distribution). 169

With scalar quantization, every value between the edges of the (larger) surrounding

*2

There are N jd d lu along Ihli edge.

There ari Nadeltai along (hii edge,

Figure 60: The shaded region represents the distribution of adjacent pixel values

box must be accounted, since cither xi or ? 2 ran individually lake on any of these values. Let Ni be the number of Intervals along the X| axis and N 2 the number of intervals along the xg axis into which the box is divided. Farther, let A be the uniform interval spacing. Then:

N, = N, = for the case shown in Figure 60. The total number of subdivisions is N ixN j. On the other hand, using VQ we can take advantage of the relationship between Xi and xg by just quantizing the shaded area. Now, Na = a f A and A/5 = 6 /A , where Na is 170 the number of intervals along the shaded rectangle’s a side, and Nb is the number of intervals along the shaded rectangle's b side. Again, A is the uniform interval spacing.

Thus the total number of subdivisions for scalar and vector quantization are:

SQ: N ,xN , =

VQ; NaxNb = $ and the ratio of these two quantities is:

# of subdivisions for SQ _ (n + h)3 a2 + b2 # of subdivisions for VQ 2 ab ~ * 2ab > for a fixed A. There arc two ways of viewing these results: 1) for a given fineness of quantization, A, VQ will never produce more total subdivisions than scalar quan­ tization, or 2) for a given number of subdivisions, VQ produces finer quantization.

Thus, VQ always performs at least as well as scalar quantization. Appendix B

Program Listing

The following program is the HASICA program list'd to interface with the Data

Acquisition and Control Adapter.

10 'NAME: Data Acquisition And Control (DAAC) 11 * HEADER for BASICA

12 » 13 'FILE NAKE: DACHDR.BAS 14 • 15 'DOS DEVICE NAME: DAAC 16 • 17 'RESERVED FUNCTION NAMES: 18 ' AINM, AINS, AINSC, AOUH, AOUS, 19 ' BINM. BINS, BITINS, BITOUS, BOUM, BOUS, 20 ' CINM, CINS, CSET, DELAY 21 'RESERVED DEF SEG VALUE NAME: DSEG 22 ' 23 'NAMES DEFINED AND USED BY HEADER: 24 ' ADAPT*, AI, COUNT. FOUND*, 25 ' HNAMEf, SGX, STATX 26 » 27 » 28 'When using the BASICA Interpreter, this header 29 'oust be executed before any function calls are 30 'made that access the DAAC adapter. It initializes 31 'a number of variables for each function call. These 32 'variables are reserved and should not be used except 33 'to access the DAAC adapter. This routine also does a 34 *DEF SEG to the segment where the DAAC Device Driver 35 '(DAC.COM) is loaded. If you execute a DEF SEG to 36 'access other hardware, you must DEF SEG to the segment

171 172

37 'of the DAAC Device Driver before any subsequent 38 'calls to access the DAAC adapter. 39 » 40 » 41 FOUNDX - 0 42 SOX - AH2E 43 'Start searching the Interrupt vectors until you find 44 'one that points to the DAAC device driver. 45 'Do a DEF SEO to that segment. 46 WHILE ((SOX <- AH3E) AND (FOUNDX - 0)) 47 DEF SEO - 0 48 DSEO - PEEK(SOX) * PEEK(SOX + 1) ♦ 256 49 DEF SEO - DSEO 50 HNAME*-"" 51 FOR AI-10 TO 17 52 HNAME* - HNAME* 4 CHR*(PEEK(AI)) 53 NEXT AI 54 IF HNAME* - "DAAC 11 AND PEEK(18) 4 PEEK(19) <> 0 THEN FOUNDX - 1 55 SGX - SOX 4 4 56 WEND 57 IF FOUNDX - 0 THEN PRINT "ERROR: DEVICE DRIVER DAC.COM NOT FOUND" : END 58 'Nov initialize all function name variables for calls 59 'to access the device driver. 60 AINM - PEEK(AH13) * 256 ♦ PEEK(AH12) 61 AINS ■ PEEK(AHIE) 4 256 ♦ PEEK(AH14) 62 AINSC ■ PEEK(AH17) 4 256 4 PEEK(AH16) 63 AOUH - PEEK(AH19) 4 256 4 PEEK(A1I18) 64 AOUS - PEEK(AHIB) 4 256 4 PEEK(AHIA) 65 BINN - PEEK(AHID) 4 256 ♦ PEEK(AH1C) 66 BINS - PEEK(AKIF) 4 256 ♦ PEEK(AHIE) 67 BITINS - PEEK(AH21) 4 256 + PEEK(AH20) 68 BITOUS - PEEK(AH23) 4 256 4 PEEK(AH22) 69 BOUM - PEEK(AH25) 4 256 4 PEEK(AK24) 70 BOUS - PEEK(AH27) 4 256 4 PEEK(AK26) 71 CINM - PEEK(AH29) 4 256 4 PEEK(AH28) 72 CINS ■ PEEK(AH2B) 4 258 4 PEEK(AH2A) 73 CSET - PEEK(AH2D) 4 256 4 PEEK(AH2C) 74 DELAY - PEEK(AH2F) 4 256 ■4 PEEK(AH2E) 75 'Finally, execute any call to re-initialize the 76 'device driver from any former invocation of BASIC. 77 ADAPTX - 0 78 COUNT » 1 79 STATX * 0 80 CALL DELAY (ADAPTX, COUNT, STATX) 81 » 82 ’End of DAAC BASICA Header 83 * 100 120 NODULE: MAIN Author: K.C. Adkins 140 160 PURPOSE: This program allows interactive control of 180 the DACA, in particular to control the AM. 200 220 VARIABLES: LOOPX, ILOOPX, VALUE*, BYTEX, MASKXO, 240 (GLOBAL) SAMPLENUMX, CDATA(,), COLLECTLOOPX, 260 CBITX, TEMPIX, TEMP2X 2B0 300 320 MAX* ■ SOO 340 DIM CDATA(2,HAXX) :REH Allows HAX/2 clock cycloB of storage 360 DIN KASKC16) 380 NASK(O) - 1: HASK(l) - 2: HASK(2) - 4: KASK(3) - 8 400 MASK(4) -16: HASK(5) -32: HASK(6) -64: MASK(7) -128 420 HASK(8)>256: NASK(9)-512: HASK(10)-1024: HASK(U)-2048 440 MASK(12)-4096: HASK(13)-8192: HASK(14)-16384: HASK(15)-32768I 460 ADAPTX - 1: DEVICE* - 8: ISTOPX - 0: VALUE* - 0 480 FOR BITNUMX - 0 TO IE 500 CALL BITOUS(ADAPT*,DEVICE*,BITNUH*.VALUE*,STAT*) 520 NEXT 540 PRINT "FUNCTION (0-Exit, 1-Reset, 2-Store, 3-Hatch"j 560 INPUT ", 4-File Mode)";CHOICEX 570 IF CHOICEX<0 THEN STOP 580 ON CHOICEX GOSUB 880,1180,1720,3400 590 FILEINPUTX-O: LFNNX-0 600 IF CHOICEX-O THEN SYSTEM 640 * IF ISTOPX-1 THEN PRINT"NEHORY FULL":GOTO 360 660 GOTO 540 680 * INPUT "Do you wish to save data (0-No, 1-Yes)";CHOICE* 700 * IF CKOICEXOI GOTO 440 720 • INPUT "File Nane";F$ 740 ’ OPEN F$ FOR OUTPUT AS fl 760 * FOR LOOPX - 0 TO SAMPLENUMX-1 780 * PRINT#1,CDATA(0,LOOPX)iCDATACl.LOOPX) 800 1 NEXT 820 » CLOSE!1 840 SYSTEH 860 * 880 ' NODULE: RESET Author: K.C. Adkins 900 » 920 ’ PURPOSE: Issue a RESET to the VAMPIRE. This 1b aeconplishad 940 * by strobing LOAD and EXE high. 960 * 980 VALUEX - 1 1000 FOR BXTNUHX - 8 TO 9 1020 CALL BITOUSCADAPTX.DEVICEX,BITNUMX,VALUEX,STATX) 1040 NEXT 1060 VALUEX * 0 1080 FOR BITNUMX - 8 TO 9 1100 CALL BITOUSCADAPTX.DEVICEX,BITNUMX,VALUEX,STATX) 1120 NEXT 1140 RETURN 1160 » 1180 * MODULE: STORE Author: K.C. Adkins 1200 * 1220 * PURPOSE: This routine performs the necessary steps 1240 ' to store a vector into the VAMPIRE. 1260 » 1280 PRINT "STORE OPERATION" 1300 GOSUB 2200 . 1320 BITNUMX - 10: VALUEX - 1: REM Set STORE Mode 1340 CALL BITOUSCADAPTX.DEVICEX,BITNUMX,VALUEX.STATX) 1360 INPUT " ADDRESS";ADDRESSX: IF ADDRESSX>63 GOTO 1360 1380 GOSUB 2800 1400 FOR LOOPX ■ 0 TO 4 1420 VALUEX - 0 1440 IF (ADDRESSX AND NASK(LOOPX)) <> 0 THEN VALUEX-1 1460 TEHP1X - LOOPX + 11 1480 CALL BITOUSCADAPTX.DEVICEX,TEMPlX,VALUEX.STATX) 1500 NEXT 1620 VALUEX - 1: BITNUMX - 9 1640 CALL BITOUSCADAPTX.DEVICEX.BITNUMX,VALUEX.STATX) 1560 VALUEX " 0 1680 CALL BITOUSCADAPTX.DEVICEX,BITNUMX,VALUEX,STATX) 1600 BITNUMX - 10: VALUEX • 0: REM Kill STORE Mode 1620 CALL BITOUSCADAPTX,DEVICEX.BITNUMX,VALUEX.STATX) 1640 GOSUB 2200 1660 IF STATXOO THEN PRINT "STORE: STAT - STATX 1680 RETURN 1700 * 1720 * MODULE: MATCH Author: K.C. AdkinB 1740 » 1760 1 PURPOSE: This routine performs the necessary steps 1760 1 to match a vector against the contents of 1800 * the AM. 1820 1 1840 * INPUT "Do you wish IN to be low-0 or high-1";B3X 1860 * IF B3%<>0 THEN B3X - 1 1880 * BITNUMX ■ 3: CALL BITOUS(ADAPTX,DEVICEX.BITNUMX,B3X,STATX) 1900 PRINT "MATCH OPERATION" 1920 PRINT " BEFORE MATCH -'";:GOSUB 2200 1940 GOSUB 2800 1960 VALUEX - 1: BITNUMX - 9 1980 CALL BITOUSCADAPTX.DEVICEX,BITNUMX,VALUEX,STATX) 2000 VALUEX » 0 2020 CALL BITOUS(ADAPTX,DEVICEX,BITNUMX.VALUEX»STATX) 2040 PRINT " AFTER MATCH GOSUB 2200 2060 VALUEX - 1 2080 CALL BITOUSCADAPTX,DEVICEX,BITNUMX,VALUEX.STATX) 2100 VALUEX - 0 2120 CALL BITOUS(ADAPTX,DEVICEX,BITNUMX,VALUEX,STATX) 2140 PRINT " SS MATCH -";:GOSUB 2200 2160 IF STATXOO THEN PRINT "MATCH: STAT » "jSTATX 2180 RETURN

2200 * 2220 • MODULE: COLLECT Author: K.C. Adkins 2240 ’ 2260 * PURPOSE: This routine performs the necessary steps 2280 1 to read all 16 digital input lines. 2320 * 2340 TEMP ■ 0: COMPARE ■ 0: LABELX - 0 2360 FOR COLLECTLOOPX • 0 TO 4 2380 CALL BITINS(ADAPTX,DEVICEX,COLLECTLOOPX,CBITX,STATX) 2400 CBIT1 - CBITX 2420 IF STATXOO THEN PRINT "COLLECT: STAT - ";STATX 2440 TEMP - TEMP + CBIT1 * MASK(COLLECTLOOPX) 2460 NEXT 2480 * IF FILEINPUTX-0 THEN PRINT" CHIP STATE: C0MPARE(4-O) - ";TEHP 2500 TEMP « 0 2S20 FOR COLLECTLOOPX - 11 TO 16 2640 CALL BITINS(ADAPTX,DEVICEX,COLLECTLOOPX,CBITX.STATX) 2660 CBIT1 - CBITX 2580 IF STATXOO THEN PRINT "COLLECT: STAT - STATX 2600 TEMP - TEMP + CBIT1 * MASK(COLLECTLOQPX-ll) 2620 NEXT 2640 LABELX - TEMP 2660 IP FILEINPUTX-0 THEN PRINT" LABEL - LABELX 2680 ' IF SAMPLENUMX>(MAXX-20) THEN ISTOPX - 1 2700 * CDATA(0,SAMPLENUMX) - TEMP :TEMP - 0 2720 ’ CDATA(1,SAMPLENUMX) - ADDRESSX ♦ DATAVAL 2740 • SAMPLENUMX - SAMPLENUMX + 1 2760 IF STATXOO THEN PRINT "COLLECT: STAT ■ STATX 2780 RETURN 2800 * 2820 * MODULE: LOAD-DATA Author: K.C. Adkins 2840 * 2860 * PURPOSE: This routine performs the necessary steps 2880 * to load the four 8-bit bytes of data into 2900 * the appropriate registers. 2920 ’ 2940 VALUEX - 0 2960 FOR LOOPX - 11 TO 15 2980 CALL BITOUSCADAPTX,DEVICEX,LOOPX,VALUEX,STATX) 3000 NEXT 3020 FOR LOOPX ■ 0 TO 4 3025 IF LOOPX<4 GOTO 3040 3030 IF (ADDRESSX AND MASK(5)) - 0 GOTO 3360 3035 GOTO 3200 3040 IF FILEINPUTX-i THEN INPUT II, BYTEX: GOTO 3100 3045 IF COPAX-1 THEN BYTEX-PAT1(LOOPX): GOTO 3100 3050 IF COPAX-2 THEN BYTEX-PAT2(L0QPX): GOTO 3100 3060 PRINT " BYTE";LOOPX; 3080 INPUT BYTEX 3100 FOR ILOOPX - 0 TO 7 3120 VALUEX-0 3140 IF (BYTEX AND MASK(ILOOPX)) <> 0 THEN VALUEX-1 3160 CALL BITOUS(ADAPTX,DEVICEX,ILOOPX.VALUEX,STATX) 3180 NEXT 3200 BITNUMX - LOOPX +11: VALUEX-1 3220 CALL BITOUSCADAPTX.DEVICEX,BITNUMX,VALUEX,STATX) 3240 BITNUMX - 8 3260 CALL BITOUSCADAPTX,DEVICEX,BITNUMX,VALUEX,STATX) 3280 VALUEX-0 3300 CALL BITOUSCADAPTX.DEVICEX,BITNUMX,VALUEX.STATX) 3320 BITNUMX - LOOPX * 11 3340 CALL BITOUSCADAPTX.DEVICEX,BITNUMX,VALUEX.STATX) 3360 NEXT 3380 RETURN 3400 3420 * MODULE: FILE-MODE Author: K.C. Adkins 3440 3460 PURPOSE: This routing allows oporatlons in a 'batch* 3480 mods. A codsbook can bo loadod from a filo, 3S00 a sot of voctor templates can bo matched 3520 from a file, or two alternating patterns can 3540 be matched against the AN contents. 3550 3560 PRINT " CO-Main Menu, 1-Load Codobook,"; 3580 INPUT " 2-Match Templates 3-Continuous Pattern)";CK0ICE2X 3590 IF CHOICE2X<0 THEN CH0ICE2X-0 3600 ON CH0ICE2X GOSUB 3660,4420,5000 3620 IF CH0ICE2X-0 THEN RETURN 3640 GOTO 3560 3650 3660 ' MODULE: L0AD-C0DEB00K Author: K.C. Adkins 3680 3700 PURPOSE: Vectors can be stored from a file directly 3720 into the memory starting at a user selectable 3740 location. The vector format is: 3760 COMPO C0MP1 C0HP2 C0MP3 3780 3790 FILEINPUTX - 1: LFNMX»0 3800 INPUT 11 File Name";Ft 3820 OPEN FI FOR INPUT AS II 3840 INPUT 11 Starting Location?";NCWX 3860 PRINT 3880 PRINT " Reading ";F|;" 3900 IF NCWX<0 THEN NCWX-0 3920 BITNUMX - 10: VALUEX - 1: REM Set STORE Mode 3940 CALL BITOUSCADAPTX.DEVICEX,BITNUMX,VALUEX.STATX) 3960 1 BEGIN GOTO-LOOP 3970 ADDRESSX " NCHX 3980 GOSUB 2800 4000 FOR LOOPX - 0 TO 4 4020 VALUEX * 0 4040 IF (NCWX AND MASK(LOOPX)) <> 0 THEN VALUEX-1 4060 TEMP1X - LOOPX ♦ 11 4080 CALL BITOUSCADAPTX,DEVICEX.TEMPlX.VALUEX,STATX) 4100 NEXT 4120 VALUEX - 1: BITNUMX - 9 4140 CALL BITOUSCADAPTX,DEVICEX,BITNUMX,VALUEX,STATX) 4160 VALUEX - 0 4180 CALL BITOUSCADAPTX,DEVICEX,BITNUMX,VALUEX,STATX) 4200 PRINT 4220 NCUX - NCWX + 1 4240 IF EOFCD OR NCVX-64 THEN 4280 4260 GOTO 3960 4280 CLOSE 4300 PRINT "DONE" 4320 PRINT « ";NCWX;" codewords loadad into memory." 4330 PRINT 4340 BITNUMX - 10: VALUEX - 0: REM Kill STORE Noda 4360 CALL BITOUSCADAPTX,DEVICEX,BITNUMX,VALUEX,STATX) 4380 IF STATXOO THEN PRINT "STORE: STAT - STATX 4390 FILEINPUTX-0 4400 RETURN 4410 * 4420 > MODULE: MATCK-TEHPLATES Author: K.C. Adkins 4440 ’ 4460 ' PURPOSE: Match VAMPIRE contents against tha vectors 4480 * found in a usar-spacifiad file. The format 4600 * is the same as the LOAD-CODEBOOK format. 4520 * 4530 FILEINPUTX - 1 4535 IF LFNMX-1 THEN OPEN Ft FOR INPUT AS «1: GOTO 4564 4537 LFNMX-1 4540 INPUT " File Name";Ft 4560 OPEN Ft FOR INPUT AS II 4562 INPUT " Do you want to log the output labels CO-N, l-Y)" 4564 IF LFX-0 GOTO 4590 4568 INPUT 11 Log File Name";Lt 4570 OPEN Lt FOR OUTPUT AS 12 4590 PRINT: PRINT" Matching to ";Ft;" 4600 NTMPS - 0: NERRS - 0: TCTX - 1: TLAB2X - -1: LERRS ■ -1 4620 1 BEGIN GOTO-LOOP 179

4640 GOSUB 2800 4660 VALUEX - 1: BITNUMX - 9 4680 CALL BITOUSCADAPTX.DEVICEX.BITNUMX.VALUEX.STATX) 4700 VALUEX - 0 4720 CALL BITOUSCADAPTX,DEVICEX,BITNUMX.VALUEX,STATX) 4740 GOSUB 2200 4760 TLABX - LABELX 4770 IF TLAB2XOLABELX THEN LEHRS - LERRS + 1 4780 VALUEX - 1 4800 CALL BITOUSCADAPTX,DEVICEX,BITNUMX,VALUEX,STATX) 4820 VALUEX • 0 4840 CALL BITOUSCADAPTX,DEVICEX,BITNUMX,VALUEX,STATX) 4860 GOSUB 2200 4870 TLAB2X - LABELX 4880 IF TLABXOLABELX THEN NERRS - NERRS + 1 4890 IF LFX<>0 THEN PRINT 12,TLABX 489B PRINT TLABX,TLAB2X 4900 NTMPS ■ NTMPS + 1 4910 IF CNTMPS AND 15) - 0 THEN PRINT"*'*; 4915 IF CNTMPS AND 511) ■ 0 THEN PRINTsPRINT TCTX;" ";:TCTX-TCTX+1 4920 IF EOFCl) THEN 4950 4940 GOTO 4640 4950 CLOSE 4952 PRINT CHRIC7);CHR$(7);CHR$C7): PRINT 4955 PRINT ” Number of Templates - NTMPS;", Errors > ";NERRS 4957 PRINT " Latency Figure ■ ";LERRS 4960 IF STATXOO THEN PRINT "MATCH: STAT - ";STATX 4970 FILEINPUTX-0 4980 RETURN 4990 * 5000 ' MODULE: Continuous-Pattern Author: K.C. Adkins 5010 » 5020 1 PURPOSE: ThiB routine is used to iteratively match against 5030 * two alternating patterns, defined by the user. 5035 * Use FI to stop the cycle. 5040 * 5060 COPAX-1 5080 PRINT" Enter first 4-byte pattern:" 6000 FOR LOOPX - 0 TO 3 6020 PRINT " BYTE";LOOPX; 6040 INPUT PAT1CL00PX) 6060 NEXT 60S0 PRINT" Enter second 4-byte pattern:" 6100 FOR LOOPX - 0 TO 3 6120 PRINT " BYTE";LOOPX; 6140 INPUT PAT2(LOOPX) 6160 NEXT 6180 PRINT" ** PRESS FI TO STOP **" 6190 ON KEY(l) GOSUB 6500 6200 * GOTO-LOOP 6210 KEY(l) STOP 6220 PRINT " PATTERN ";COPAX 6240 GOSUB 2800 6260 VALUEX - 1: BITNUMX - 9 6280 CALL BITOUS(ADAPTX,DEVICEX,BITNUMX,VALUEX,STATX) 6300 VALUEX - 0 6320 CALL BITOUS(ADAPTX,DEVICEX,BITNUMX,VALUEX,STATX) 6340 PRINT" AFTER HATCH :GOSUB 2200 6360 VALUEX - 1 6380 CALL BITOUS(ADAPTX,DEVICEX,BITNUMX,VALUEX.STATX) 6400 VALUEX - 0 6420 CALL BITOUS(ADAPTX,DEVICEX,BITNUMX,VALUEX,STATX) 6440 PRINT " SS MATCH GOSUB 2200 6450 COPAX - 3 - COPAX 6460 KEY(l) ON 6480 GOTO 6200 6500 RETURN 6520 6520 COPAX-0 6530 PRINT 6540 RETURN Appendix C

VAMPIRE Chip User’s Guide

C.l Introduction

This manual describes the operation of the spcciat purpose Vector-quantizing Asso­ ciative Memory Processor Implementing Real-time Encoding (VA M PIRE) chip, which was developed as part of a Vector Quantization (VQ) video-rate encoder. VQ is an encoding technique that divides (quantizes) a vector space into distinct regions. The centroid of each region is a vector representative of that region, and it is referred to as a codeword. Thus, associated with all possible vectors is a finite set of code­ words. The mapping from any vector in the vector space (the input vector) to the correspondingly closest codeword is called vector quantization. Since it is assumed that the codewords are known npriori, a unique label (the encoded output) can be assigned to each one. In general the number of codewords is much less than the number of possible input vectors, and therefore the output label can be represented more compactly than the input vector. This is what gives VQ it’s data compression characteristic.

The Differential Vector Quantization (DVQ) algorithm being developed uses pixels sampled from the incoming video signal at a rate of four times the colorburst. Four

181 182 neighboring pixels are grouped together to form the incoming vector. A prediction of the incoming vector is calculated from neighboring, previously processed vectors and subtracted from the incoming vector on a pixel by pixel basis. The resulting set of four difference values form the differential vector supplied to the VAM PIRE chips. The

VAMPIRE chips simultaneously calculate the distance from the differential vector to each of the stored codewords and return the label of the closest one. The codewords arc determined offline and could number as many as 25G. In summary, the DVQ algorithm places the following requirements on the VAM PIRE chips:

1. The vector space spans four dimensions.

2. Each component of the vector has eight bits of resolution.

3. Vector throughput is that of the NTSC television colorburst rate ( 3.6MHz, or

one vector every 280ns).

4. Codebook size should be expandable to at least 256 codewords.

The VAMPIRE chip that was designed to perform these tasks was laid out and fabri­ cated in 2ftm CMOS technology. The chips can operate in a standalone mode, or they can be linked together to accommodate an unlimited sized codebook. This document describes how the VAMPIRE works and how it can be wired to accomplish the DVQ lookup. 183

C.2 Wiring the Chip

The schematic shown in Figure 61 demonstrates how four VAM PIRE chips would be connected to allow storage and matching capabilities across a codebook of 128 vectors.

Each VAMPIRE has 64 active pinB: 47 of them arc inputs, 7 arc outputs (5 of which

m itt KMA

h i 01 U 0 1 tU 4

IN our IN our

J L O O iJ N

DATA ADMJDUT

our our

D1 QO DJ 01 01

DATA

Figure 61: Four VAMPIRE chips wired together. may be tri-stated), and 10 can be used bidircctionally. The inputs can be broken down as follows: 32 data lines (D A T A ), 5 input address lines (A D D R J N ), 3 ground, 3 184 power, and 4 control lines (IN, STORE, MATCH, and R E SE T ). The output pins consist of the following: 5 output address lines (A D D R .O U T or LABEL) and 2 control lines

(O U T and CHIP-VALID). The remaining ten pins arc the bidirectional COM PARE lines which are responsible for interchip communication.

Each chip is capable of holding 32 codewords. Thus, the 5 LSB’s of the storage address, ADDRJN(4-0), are wired directly to each VAMPIRE chip. The more significant bits of the address lines must be coupled with additional logic so that only the desired chip is selected and loaded with data (see Figure 61). The DATA lines arc common to all chipB and arc wired together. The COMPARE lines arc critical for interchip operation; the value on these lines is the actual distance between the DATA lines and the closest stored vector. The COMPARE lines can operate as input or outputs, but when operating as outputs they can only drive the pins low. The user need only connect the COMPARE lines of all chips together, placing one pull-up resistor on each line.

As far as the various control lines of each chip arc concerned, corresponding MATCH and RESET lines are simply wired together. The STORE lines are coupled with addi­ tional logic, as is shown in Figure 61. The IN and o u t pins control the priority in which codewords may be decoded, though this will be explored in further detail in a subsequent section. The CHIP-VALID lines arc encoded to give the more significant bits of the output label. 185

C.3 Chip Operation

This chip was designed as an asynchronous circuit. Except for the actual storage elements, there is no feedback in the design. TI iu b , for a given set of stored codewords, data on the input pins completely determines the state of the device. Below, the three main modes of operation arc described.

C.3.1 Reset

The RESET pin is a negative true input. A tow logic level on this line causes a flag to be set at each memory location on the chip. The effect of this flag is to prevent that stored word from competing with other stored words during a MATCH operation, which essentially makes that memory location empty. Note that data may still be stored in that location, but it will have no effect on the rest of the memory. The only way the cmpty-flag at a particular location can be reset is to store a new set of data in that location.

It is only necessary to strobe the RESET line low when all the memory cells on a chip will not be loaded with data. For example, when the chip is first powered up, the state of the memory (including the empty flags) must be assumed to be random.

Resetting the chip insures that all the empty-flags arc set; if only half of the 32 locations get data written to them, the other half can be assured to be empty. Note that if all locations will be loaded, resetting the chip is not necessary, as the act of loading each vector resets the cmpty-flag and fills the memory locations.

Because the RESET pin can instantly mark the entire memory as empty, care must 186 be taken that no glitches appear on this line. To help prevent aberrant noise from resetting the chip, delay circuitry was built into the reset logic to implement the following logical equation:

SET-EMPTY-FLAGS = RESET*- DELAYED-RESET*

(The asterisk (*) represents negation.) Internal to the chip, the R ESET pin is routed through two paths. One path includes a scries of inverters causing a delay of several nanoseconds (DELAYEDJIESET), and the other path has no delay. When both paths have a low logic level, the chip is reset.

C.3.2 Store

In order to load the codcbook into memory, the chip must be capable of storing data into specified locations. When the STORE pin is at a low logic level, the value on the DATA pinB is loaded into the memory location indicated by the A D D R J N lines.

This process overwrites any data previously stored in that location, and it resets the cmpty-flag for that address. Note that in Figure 61, the system STO R E line controls the enable of decoder. The two M SB's of the address determine which of the four

VAMPIRE STORE lines will be asserted when the decoder is enabled.

Figure 62 shows the timing diagram of a STO R E operation; three times are in­ dicated: TjUhi Tdnut And T,c. Tail, represents the hold Lime for the A D D R JN lines.

These address lines must be stable before the STO R E line is strobed low. Thus, T*ih should be at least a few nanoseconds, which is a typical hold time for CMOS devices.

Tdm* is the amount of time needed for data to settle into the memory locations after the STORE line goes low. It is impossible to measure this time directly, so we rely on simulations to make an estimate. T*, is tlic time it takes for the COM PARE lines to settle after the STORE line goes low. Earlier it was stated that the value on the

COM PARE lines was the distance between the DATA pins and the closest stored vector.

As a word gets stored into memory, that word exists both in memory and on the DATA lines. Thus, the COMPARE lines should all drop to a low logic level when a word is being stored. Note that it only takes a small fraction of TM to actually store the data in the appropriate location{Tanii); however, we can he guaranteed that the data has been stored when the COMPARE lines go to zero. Note that the timing diagrams arc not necessarily to scale. X > 3 < c DATA

---- »» Tdm* M ----- COMPARE INDETKRMI HINT J VALID (ZERO) )

ADDR IN INDETBRHINANT ^ VALID

___ ^ Taih ------Taca -----—► STORE

T in a

Figure 62: Timing diagram for a STO R E operation.

C.3.3 Match

The primary function of the VAMPIItE is to match the incoming data to the closest stored codeword. In fact, the VAMPIRE is always performing this function. It is 188

DMA I VALID M Tdcs -H COMPARE X INDETBRHINANT X VALID Tc a o s ADDR OOT TRI-STATE X INDETKRHINANT ~ y VALID y TRI-ISTATE H— Tmao MATCH

T l * o IN

T la a

Figure 63: Timing diagram for a MATCH operation.

only during an actual MATCH operation that the winning chip is allowed to drive the

A D D R .O U T or LABEL bus. Figure 63 shows the timing diagram for a MATCH operation.

Two of the featured times in this diagram arc critical for the DVQ application: Tdc* and Tcao,. It is essentially the sum of these two limcH that determines the operating speed of the VAMI'IRE. Tdn (Data lo Compare Settling time) represents the amount of time it takes the system to find the minimum distance and identify the winning word(s). Once the winning words have been determined, it takes Tc^, (Compare to

Address.Out Settling time) to priority encode the address of the winner and drive the output bus. Actual values for these times Are difficult to estimate because they are so heavily dependent on the input data ami the contents of memory. Simulations results give a total settling time (Tcao« + Td«) that varies from 120ns to 210nB. The 189 other times shown in Figure 63 are displayed mainly to show the dependence of the

A D D R .O U T bus to the control signals MATCH and IN.

C.4 Using Multiple Chips

One of the most difficult tasks for an associative memory to perform is the association across chip boundaries. In order to select a single winner across several chips, each chip must broadcast information to and receive information from every other chip.

To accomplish the necessary information transfer, the VAMPIRE relies on wired-NOR logic to drive a common, bidirectional “comparison” bus (COM PARE lines). The bits of the COM PARE bus settle toward the smallest difference between the input vector and any stored vector. Internally, each VAMPIRE chip determines which of the codewords stored on that chip is closest to the input vector. Only the chip that has the global winner is allowed to drive the output bus (A D D R .O U T lines). Chips not containing the global winner place the A D D R .O U T drivers into a high impedance state.

One of the potential shortcomings with this scheme occurs when two different codewords arc equidistant to the input vector and these codewords reside on dif­ ferent chips. Under these conditions, each of the two chips containing a “winning” codeword will attempt to drive the A D D R .O U T bus with its corresponding address.

Note that this problem does not occur when the two winning codewords are on the same chip. In this case, the VAMPIRES built-in Multiple Response Resolver (MRR) prioritizes the memory locations such that the codeword with the lower address is chosen. Either codeword could have been chosen since it is assumed that when two

(or more) codewords are equidistant to an input vector, either can be used with the 190

same effectiveness,

To eliminate situations in which more than one chip attempts to drive the output

bus, the following solutions may be employed:

1. Select codewords such that no input vector can he equidistant to two or more

codewords.

2. Place all codewords that may potentially be equidistant to some input vector

on the same chip.

3. Multiplex the ADDR-OUT lines.

4. Use the IN and OUT control lines of the VAMPIRE.

The methods suggested in #1 and jf-2 above detract from the general purpose nature

of the VAMPIRE, require additional computational complexity, and place constraints

on the system that may be impossible to satisfy. Additionally, neither of these solu­

tions address the problem directly. Method #1 attempts to avoid the real problem,

and method #2 tries to localize the problem to the point that the MRRon the VAM­

PIRE chooses one of the codewords. These solutions certainly may be effective in

specific circumstances, but they arc not acceptable as general solutions.

Method #3 is a general solution, but it requires additional logic and parts. For

example, if there were eight VAMPIUE chips, each with five ADDR-OUT lines, then five

8 to 1 Multiplexors are needed. The three select lines for the multiplexor would come from the output of a Priority 8 to 3 Rncoder, which used the eight CHIP-VALID lines as inputs. To eliminate the need for extra hardware, the IN and OUT control lines can be used (method #4). The effect of these lines is to assign a priority to each of the

VAMPIRE chips. When IN is high, that chip is allowed to drive the ADDR-OUT lines, provided it contains a winning codeword, and in this ease OUT is low. If the chip does not contain a winning codeword, then OUT is high, signifying that a winner has not yet been selected. When IN is low, then OUT is also low. To use these control lines, the OUT line of one chip !k fed into the IN pin of the next chip, as is shown in Figure 61. These connections form a chain in which chips that arc found earlier in the chain have a higher priority. OUT and CIIIP-VAI.1D arc given by the following equations:

OUT = INWINNER-ON-CHIP*

CHIP-VALID = IN-WINNERjON-CIIIP

The ADDR-OUT line drivers become enabled under the following conditions:

ENABLE-OUTPUT = MATCH** CHIP-VALID

T he WINNER-ON-CIIIP Bignal is internal to each chi]), and it is asserted when the difference value on the global COMPARE lines matches the difference value of the on-chip winner. 192

C.5 Chip Pinout

The following table gives a pin-out description of the 05 pin grid array for the VAMPIRE chip.

# Loc Dcsc # Loc Dcsc # Loc Dcsc # Loc Desc 1 A1 OUT 17 K1 Cj 33 K10 lh 2 40 A10 B j 0 2 C2 MAT 18 J3 CB 34 H9 Bo 3 50 B8 B 2l 3 B1 IN 19 I<2 VDD 35 J10 B40 51 A9 GND 4 Cl CV 20 K3 Co 36 HI 0 Ba\ 52 A8 B 2 2 5 D2 Lq 21 J4 DiO 37 G9 Da2 53 B7 Bo 3 6 D1 h 22 I<4 B j\ 38 G10 Ba 3 54 A7 BoZ 7 El £rj 23 K5 Dj 2 39 F10 STR 55 A6 B02 8 E2 La 24 J5 B73 40 F9 A, 56 B6 Bol 9 FI La 25 K6 GND 41 E10 Aa 57 A5 VDt) 10 F2 Co 26 JO Be0 42 E9 Aa 58 B5 Bo0 11 G1 c x 27 K7 /?ol 43 D10 A, 59 A4 B x 3 12 G2 c 2 28 J7 44 D9 Ao 60 B4 B x2 13 HI C3 29 K8 no 3 45 CIO Bo 0 61 A3 B x\ 14 H2 Ca 30 J8 BnQ 46 CO /?31 62 B3 5i0 15 J1 c 6 31 K9 GND 47 1310 Bo 2 63 A2 VDD 16 J2 Co 32 J9 Br,\ 48 no Bo 3 64 B2 RST B ibliography

[1] D. Jeong and J. Gibson, “Lattice vector quantization for image coding," in IEEE International Conference on Acoustics, Speech and Signal Processing , pp. 1743 - 1746, 1989. [2] C. E. Shannon, “A mathematical theory of communication,” Bell System Tech­ nical Journal, vol. 27, July 1948. [3] C. Shannon and W. Weaver, The Mathematical Theory of Communication. Uni­ versity of Illinois Press, 1949. [4] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion," IRE National Convention Record (pi. j), March 1959. [5] R. Gray, “Vector quantization," IEEE ASSP Magazine, vol. 1, pp. 4 - 29, April 1984. [6] N. Nasrabadi and R. King, “Image coding using vector quantization: A review," IEEE Transactions on Communications , vol. 36, no. 8, pp. 957-971, 1988. [7] M. J. Shalkhauscr and W. A. Whyte, Jr., “Digital CODEC for real-time pro­ cessing of broadcast quality video signals at 1.8 bits/pixel,” in IEEE Global Telecommunications Conference , 1989. [8] K. Adkins, M. J. Shalkhauscr, and S. Bibyk, “Digital compression algorithms for HDTV transmission," in Proceedings of the 1990 IEEE International Symposium on Circuits and Systems, pp. 1022-1025, May 1990. [9] K. C. Adkins, “Alternative designs for a differential pulse code modulation video image encoder,” Master's thesis, The Ohio State University, 1989. [10] W.Hartz, R.Alexovich, and M.Neustadter, “Data compression techniques applied to high resolution high frame rate video technology, Tech. Rep. 4263, NASA Contractor Report, December 1989. [11] K. C. Adkins, S. B. Bibyk, and R. T. Kaul, “Associative computation circuits for real-time processing of satellite communications and image pattern classifi­ cation," in ISCAS, (Chicago, II), 1993. [12] C. W. Rutledge, “Vector DPCM: Vector predictive coding of color images," in IEEE Global Telecommunications Conference, 1986.

193 194

[13] A. K. Krishnamurthy, S. 13. Bibyk, and S. G. Aliall, “Video data compression using artificial neural network differential vector quantization,” in NASA Space Communications Technology Conference , pp. 95 - 101, NASA Conf. Publ. 3132, 1991. [14] J. E. Fowler, M. R. Carbonara, and S. C. Ahalt, “Image Coding Using Differential Vector Quantization,” IEEE Transactions on Circuits and Systems for Video Technology , 1993. (to appear). [15] L. Chisvin and R. J. Duckworth, “Content-addressable and associative memory: Alternatives to the ubiquitous RAM,” Computer•, pp. 51-64, July 1989. [16] H. Stuttgen, “A hcirarchical associative processing system,” in Lecture Notes in Computer Science, Springer-Verlag, 1985. [17] T. Kohonen, Self-Organization and Associative Memory. New York, NY: Springer- Verlag, second cd., 1987. [18] C. Mead, Analog VLSI and Neural Systems. Addison Wesley Publishing Co., 1989. [19] A. Hanlon, “Content-addressable and associative memory systems: A survey,” IEEE Transactions on Electronic Computers , vol. 15, pp. 509-521, August 1966. [20] A. Slade and II. McMahon, “A cryotron catalog memory," in Eastern Joint Computer Conference 1955. [21] R. Ahrons and L. Burns, “Superconducting memories,” Computer Design , vol. 3, pp. 12 - 19, January 1964. [22] J. Barnard, F. Bchnke, A. Lindquist, and R. Seeber, “Structure of a cryogenic associative processor," Proc. IEEE, vol. 52, pp. 1182 - 1190, October 1964. [23] W. McDcrmid and H. Peterson, “A magnetic associative memory,” IBM J. Res. Develop ., vol. 5, pp. 59 - 63, January 1961. [24] J. T, Koo, “Integrated-circuit content-addressable memories,” IEEE Journal of Solid-State Circuits, vol. 5, pp. 208-215, October 1970. [25] B. Parhami, “Associative memories and processors: an overview and selected bibliography," Proceedings o f the IEEE , vol. 61, pp. 722 - 730, June 1973. [26] R. Igarashi, T. Kurosawa, and T. Yalta, “A 150 nanosecond associative memory using integrated MOS transistors," in IEEE ISSCC, pp. 104 - 105, 1966. [27] Intel Data Sheet, October 1971. [28] M. Motomura, J. Toyoura, K. Ilirala, II. Ooka, II. Yamada, and T. Enomot, “A 1.2-million transistor, 33-Mllz, 20-b dictionary search processor (DISP) ULSI with a 160-kb CAM, IEEE Journal of Solid-State Circuits, vol. 25, pp. 1158 - 1165, October 1990. 195

[29] H, Kadota, J. Miyake, Y. Nlshimichi, II. Kudoh, and K. Kagawa, “An 8-kbit content-addressable and reentrant memory,” IEEE Journal of Solid-State Cir­ cuits, vol. SC-20, pp. 951-957, October 1985. [30] T. Ogura, S.-I. Yamada, and T. Nikaido, “A 4-kbit associative memory Isi,” IEEE Journal of Solid-State Circuits, vol. 20, pp. 1277 - 1281, December 1985. [31] T, Ogura, S. ichiro Yamada, and J. Yamada, “A 20Kb CMOS associative memory LSI lor artificial intelligence machines,” in Proceedings o f the IEEE International Conference on Computer Design: VLSI in Computers, pp. 574-577, January 1986. CS Press, Los Alamitos, Calif., Order No. 735. [32] S. Jones, I. Jalowiccki, S. Hedge, and R. Lea, “A 9-kbit associative memory for high-speed parallel processing applications,” IEEE Journal of Solid-State Circuits , vol. 23, pp. 543 - 548, April 1988. [33] 0 . Kowarik, R. Kraus, K. Hoffman, and K. tlorningcr, “Associative and data processing Mbit-DRAM,” in Proceedings of the IEEE In t’l Conference on Com­ puter Design: VLSI Design in Computers and Processors, pp. 421 - 424, 1990. [34] Advanced Micro Devices: 1989/1990 Memory Products. [35] T. Kohoncn, Content-Addressable Memories. Springer-Verlag, 2nd cd., 1987. [36] S. Bibyk, R, Kaul, K. Adkins, and Z. Bhatti, “Neural circuit architectures for real-time signal processing in video-rate communication systems,” in IJCNN, (Seattle), 1991. [37] N. Wcste and K. Eshraghian, Principles of CMOS VLSI Design: A systems perspective. Addison Wesley Publishing Co., 1985. [38] L. D. Jackcl, R. E. Howard, J. S. Dcnkcr, W. Hubbard, and S. A. Solla, “Building a hierarchy with neural networks: an example - image vector quantization, Applied Optics, vol. 26, pp. 5081 - 5084, December 1987. [39] J. J. Hopficld, “Artificial neural networks,” IEEE Circuits And Devices Maga­ zine, pp. 3-10, September 1988. [40] S. Bibyk and K. Adkins, “Neural nets and emergent adaptive signal processing," in Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 1203-1200, IEEE, May 1989. [41] Y.-H. Pao, Adaptive Pattern llecognition and Ncutnl Nets. Addison Wesley Pub­ lishing Co., 1989. [42] D. Tourestsky, ed., Advances m Neural Information Processing Systems 1, ch. Winner- Take-All Networks of O(N) Complexity, pp. 703-711. Morgan Kaufman, 1989. J. Lazzaro, S. Ryckebusch, M. Mahowald and Carver Mead, authors. [43] D. Yarrington, “Analog winner-take-all designs for adaptive signal processing,” Master’s thesis, The Ohio State University, 1992. 196

[44] Y. Linde, A. Buzo, and R. Gray, “An algorithm for vector quantization design,” IEEE Transactions on Communications , pp. 84 - 95, January 1980. [45] A. Buzo, J. A.H. Gray, R. Gray, and J. Markcl, “Speech coding based on vec­ tor quantization,” in IEEE Thins, on Acoustics Speech and Signal Processing , pp. 562-574, October 1980. [46] E. Riskin, E. Daly, and R. Gray, “Pruned tree-structured vector quantization in image coding,” in IEEE International Conference on Acoustics, Speech and Signal Processing , pp. 1735 - 1737, 1989. [47] A. Gcrsho and D. Cheng, “Past nearest neighbor search for nonstructured eu­ clidean codes,” in Abstracts of the 1983 IEEE Int 7. Symposium on Information Theory , p. 88, September 1983. [48] H.-S. Wu, “Fast search algorithm for vector quantisation,” Electronics Letters , vol. 28, pp. 457 - 458, February 1992. [49] A. Gcrsho, “Asymptotically optimal block quantization,” IEEE Transactions on Information Theory , vol. 25, pp. 373 - 380, July 1979. [50] J. Conway and N. Sloanc, “Voronoi regions of lattices, second moments of poly- topes, and quantization, IEEE Thmsactions on Information Theory , vol. 28, pp. 227 - 232, March 1982. [51] B. Marangclli, “Fast vector quantisation using cache codcbook,” Electronics Let­ ters, vol. 28, pp. 938 - 939, May 1992. [52] B.-H. Juang and J. A.H. Gray, “Multiple stage vector quantization for speech coding,” in IEEE International Conference on Acoustics, Speech and Signal Pro­ cessing, pp. 597 - 600, April 1982. [53] Y. Feng and N. Nasrabadi, “A dynamic address-vector quantization algorithm based on inter-block and inter-color correlation for color image coding,” in IEEE International Conference on Acoustics, Speech and Signal Processing , pp. 1755 - 1758, 1989. [54] Y. Feng and N. Nasrabadi, “A new vector quantization technique: Address* vector quantization,” in Proc, IEEE Global Tclecommun ., pp. 755 - 759, Novem­ ber 1988. [55] R. Baker and R. Gray, “Differential vector quantization of achromatic imagery,” in Proc. o f the In t’l. Picture Coding Symposium , March 1983. [56] R. Cohen and J, Woods, “Sliding block entropy coding of images,” in IEEE International Conference on Acoustics, Speech and Signal Processing , pp. 1731 - 1734, 1989. [57] V. Mathews and M. Khorchidian, “Multiplication-free vector quantization using Li distortion measure and its variants,” in IEEE International Conference on Acoustics, Speech and Signal Processing , pp. 1747 - 1750, 1989. 197

[58] N. Nasrabadi, S. Lin, and Y. Fong, “Intcrframc hierarchical vector quantization,” in IEEE International Conference on Acoustics, Speech and Signat Processing, pp. 1739 - 1742, 1989. [59] C.-H. Hsich, “DCT-bascd codehook design for vector quantization of images,” IEEE Transactions on Circuits and Systems for Video Technology , vol. 2, pp. 401 - 409, December 1992. [60] G. Davidson and A. Gcrsho, “Application of a VLSI vector quantization pro­ cessor to real-time speech coding, IEEE Transactions on Circuits and Systems , vol. 4, pp. 112-124, January 1986. [61] B. Tao, H. Abut, and R. Gray, “Hardware realization of waveform vector quan­ tizers,” IEEE Journal on Selected Areas of Communications, vol. 2, pp. 343 - 352, March 1984. [62] K. Dczhgosha, M. Jamati, and S. Kwatra, “A VLSI architecture for real-time image coding using a vector quantization based algorithm,” IEEE Transactions on Signal Processing, vol. 40, pp. 181-189, January 1992. [63] P.A.Ramamoorthy, B. Potu, and T. IVan, “Bit-serial VLSI implementation of vector quantizer for real-time image coding," IEEE Transactions on Circuits and Systems, vol. 36, pp. 1281 - 1290, October 1989. [64] S. Panchanalhan and M. Goldberg, “A content-addressable memory architec­ ture for image coding using vector quantization,” IEEE Transactions on Signal Processing, vol. 39, pp. 2066 - 2078, September 1991. [65] B. Shcu and W.-C. Fang, “Real-time high-ratio image compression using adap­ tive VLSI ncuroproccssors," in Proceedings of ICASSP, pp. 1173 - 1176, 1991. [66] W.-C. Fang, B, Sheu, and 0. Chen, “A real-time VLSI ncuroprocessor for adap­ tive image compression based upon frcqucncy-scnsiLive competitive learning,” in Proceedings o f the IJCNN, pp. 429 - 435, 1991. [67] G. TVittlc, S. Fallahi, and A. Abidi, “A low-power analog CMOS vector quan­ tizer," in Proceedings o f the Data Compression Conference , pp. 410-419, 1993. [68] W.-C. Fang, B. Sheu, 0. Chen, and J. Choi, “A VLSI neural processor for image data compression using self-organizing networks," IEEE Transactions on Neural Networks, vol. 3, May 1992. [69] D. J. Sakrison, “On the role of the observer and a distortion measure in image transmission,” IEEE Transactions on Communications, vol. 25, pp. 1251 - 1266, November 1977, [70] D. K. Sharma and A, N. Netravali, “Design of quantizers for DPCM coding of picture signals,” IEEE Transactions on Communications, vol. 25, pp. 1267 - 1274, November 1977. 198

[71] M. R. Carbonara, “Differential Vector Quantization of Digital Image Data Using Artificial Neural Network Codebook Design,” Master's thesis, The Ohio State University, 1992. [72] S. C. Ahalt, P. Chen, and A. K. Krishnamurthy, “Performance analysis of two image vector quantization techniques,” International Conference on Neural Net­ works, vol. 1, pp. 169-175, 1989. [73] G. R. Cooper and C. D. McGillcm, Modem Communications and Spread Spec­ trum. McGraw Hill, 1986. [74] A. B. Carlson, Communication Systems: An Introduction to Signals and Noise in Electrical Communication. McGraw Hilt, 1986. [75] K. Cios, R. Langcndcrfcr, R. Tjia, and N. Liu, “Recognition of defects in glass ribbons using neural networks," in Proceedings of the 1991 NSF Design and Manufacturing Systems Conference , (Austin, TX), pp. 203 - 206, 1991. [76] S. Szirom, “Intelligent memories promise product and market niches,” in WESCON 89, pp. 24 - 28, 1989. [77] K. Waldschmidt, “Editorial: Special section on associative processors and mem­ ories,” IEE Proceedings Part E, vol. 136, pp. 341 - 342, September 1989. [78] Thinking Machines Corporation, Cambridge, MA, Programming the Connection Machine , 1989. [79] H. Bcrgh, J. Enel and, and L.-E. Lundstrom, “A fault-tolerant associative mem­ ory witb high-speed operation,” IEEE Journal of Solid-State Circuits, vol. 25, pp. 912 - 919, August 1990. [80] R. Haul, K. Adkins, and S. Bibyk, “An all digital implementation of a modified hamming net for video compression with prediction and quantization circuits,” in ICSE, (Dayton, OH), 1991. [81] M. Leonard, “CODEC sends wideband video over phone lines,” Electronic De­ sign, pp. 129 - 130, June 28 1990. [82] E. T. Jaynes, “Information theory and statistical mechanics,” Physical Review, vol. 106, pp. 620 - 630, May 1957. [83] J. Proakis, Digital Communications. McGraw Hill, 1989. [84] R. E. Blaluit, Principles and Practice of Information Theory. Addison Wesley Publishing Co., 1987. [85] J. Makhoul, S. Roucos, and II. Gish, “Vector quantization in speech coding,” Proceeding of the IEEE, vol. 73, pp. 1551 - 1588, November 1985.