An Optical Delay Line Memory Mo del with Ecient Algorithms

y z

John H. Reif Akhilesh Tyagi

Department of Science Department of Computer Science

Duke University Iowa State University

Durham, NC 27706 Ames, IA 50011

Abstract

The extremely high rates of optical computing technology 100 MWords p er second and

upwards present unprecedented challenges in the dynamic memory design. An optical ber loop,

used as a delay line, is the b est candidate for primary, dynamic memory at this time. However,

they p ose sp ecial problems in the design of algorithms due to synchronization requirements b e-

tween the lo op data and the pro cessor. We develop a theoretical mo del, whichwe call the lo op

memory mo del LLM, to capture the relevantcharacteristics of a lo op based memory. An imp or-

tant class of algorithms, ascend/descend { which includes algorithms for merging, sorting, DFT,

matrix transp osition and multiplication and data p ermutation { can b e implemented without any

time loss due to memory synchronization. We develop b oth sequential and parallel implementa-

tions of ascend/descend algorithms and some matrix computations. Some lower b ounds are also

demonstrated.

1 Intro duction

1.1 The Success of VLSI Theory

Over the last 15 years, VLSI has moved from b eing a theoretical abstraction to b eing a practical reality.

As VLSI design to ols and VLSI fabrication facilities such as MOSIS b ecame widely available, the

algorithm design paradigms such as systolic algorithms [Kun79], that were thought to b e of theoretical

interest only,have b een used in high p erformance VLSI hardware. Along the same lines, the theoretical

limitations of VLSI predicted by area-time tradeo lower b ounds [Tho79]have b een found to b e

imp ortant limitations in practice.



A preliminary version of this pap er app eared in Proc. of the ADVANCED RESEARCH in VLSI: International

Conference 1991, MIT Press, March 1991.

y

supp orted in part by DARPA/ARO contract DAAL03-88-K-0195, Air Force Contract AFOSR-87-0386,

DARPA/ISTO contract N00014-88-K-0458, NASA sub contract 550-63 of primecontract NAS5-30428 and by NSF Grant

NSF-IRI-91-00681.

z

supp orted by NSF Grant MIP-8806169, NCBS&T Grant 89SE04 and a Junior Faculty Developmentaward from

UNC, Chap el Hill. This work was p erformed when the second author was with UNC, Chap el Hill. 1

1.2 The Promise of Optical Computing

The optical computing technology is considered to b e one of the several technologies that could provide

a b o ost of 2-3 orders of magnitude over the currently used semiconductor silicon based technology,in

computing sp eed. The eld of electro-optical computing, however, is at its infancy stage; comparable

to the state of VLSI technology,say,10years ago. Fabrication facilities for manykey electro-optical

comp onents are not widely available { instead, the crucial electro-optical devices must b e sp ecially

made in the lab oratories. However, a numb er of prototyp e electro-optical computing systems have

b een built recently { p erhaps most notably at Bell Lab oratories under Huang [Hua80, Hua84], and

also at Boulder under Jordan see b elow. In addition, several optical message routing devices have

b een built at Boulder [MJR89], Stanford and USC [Lou91], [MG90, MMG90], [SS84]. Thus optical

computing technology has matured to the p oint that prototypical computing machines can b e built in

the research lab oratories. But it certainly has not attained the maturity of VLSI technology. What is

likely to o ccur in the future? The technology for electro-optical computing is likely to advance rapidly

through the 90s, just as VLSI technology advanced in the late 70s and 80s. Therefore, following our

past exp erience with VLSI, it seems likely that the theoretical underpinnings for optical computing

technology { namely the discovery of ecient algorithms and of resource lower b ounds, are crucial to

guide its development. This seems to us to b e the right moment for the algorithm designers to get

involved in this enterprise. A careful interaction b etween the architects and the algorithm designers

can lead to a b etter thought-out design. This pap er is an attempt in that direction.

1.3 : A Key Problem in Optical Computing

The optical computing technology can obtain extremely high data rates b eyond what can b e obtained

by current semiconductor technology. Therefore, in order to sustain these data rates, the dynamic

storage must b e based on new technologies which are likely to b e completely or partially optical.

Jordan at the Colorado Opto electronic Computing Systems Center [Jor89b] and some other groups

have prop osed and used optical delay line lo ops for dynamic storage. In these data storage systems,

an optical b er, whose characteristics match the op erating wavelength, is used to form a delay line

lo op. In particular, the system sends a sequence of optically enco ded down one end of the lo op

and after a certain delay which dep ends on the length and optical characteristics of the lo op, the

optically enco ded bits app ear at the end of the lo op, to b e either utilized at that time and/or once

again sentdown the lo op. This idea of using propagation delay for data storage dates back to the use

of mercury delay lo ops in early electronic computing systems b efore the advent of large primary or

4

secondary memory storage. Jordan [Jor89b] has b een able to store 10 bits p er b er lo op with the b er

length of approximately one kilometer. This was achieved in a small, low cost prototyp e system with 2

a synchronous lo op without very precise temp erature control. Jordan used such a delay lo op system

to build the second known purely optical computer after Huang's, which can simulate a counter.

Note that this do es not represent the ultimate limitations on the storage capacity of optical delay

lo ops, which could in principle provide very large storage using higher p erformance electro-optical

transducers and multiple lo ops. Maguire and Prucnal [JP89] suggest using optical delay lines to form

disk storage.

The main problem with such a dynamic storage is that it is not a random-access memory. A delay

line lo op cannot b e tapp ed at many p oints since a larger numb er of taps leads to excessive signal

degradation. This implies that if an algorithm is not designed around this shortcoming of the dynamic

storage, it mighthavetowait for the whole length of the lo op for each data access. Systolic algorithms

[Kung, [Kun79]] also exhibit such a tightinter-dep endence b etween the dynamic storage and the data

access pattern.

1.4 Lo op Memory Mo del and Our Results

In this pap er, we prop ose the lo op-memory mo del LMM as a mo del of sequential electro-optical

computing with delay line based dynamic storage. The lo op-memory mo del contains the basic features

that current delay lo op systems use, as well as the features that systems in the future are likely to use. It

would seem that the restrictive discipli ne imp osed on the data access patterns by a lo op memory would

degrade the p erformance of most algorithms, b ecause the pro cessor mighthave to idle waiting for data.

However, we demonstrate that an imp ortant class of algorithms, ascend/descend algorithms, can b e

realized in the lo op memory mo del without any loss of eciency. Note that many problems including

merging, sorting, DFT, matrix transp osition and multiplication and data p ermutation are solvable

with an ascend/descend algorithm describ ed in Section 3. In fact, we give sequential algorithms

covering a broad range for the numb er of lo ops required. A parallel implementation p erforming the

optimal amountofwork is also shown. The work p erformed by a parallel algorithm running on p

pro cessors in time T is given by p  T .

An ascend or descend phase takes time O n log n in LMM using log n lo ops of sizes 1; 2; 4; :::;n.

Note that a straight-forward emulation of a butter y network with O n log n time p erformance re-

quires O n lo ops: n lo ops of size 1, n=2 of size 2, n=4 of size 4, :::, 1 of size n. It can b e implemented

p

1:5

in time n just with two lo ops of sizes n and n each. This can b e generalized into an ascend-descend

1:5 k=2

scheme with time nk + n 2 with 2  k  log n lo ops. At this p oint in time, a lo op is a precious

resource in optical technology, and hence tailoring an algorithm around the numberofavailable lo ops

is an imp ortant capability. The k -lo op adaptation of the ascend/descend algorithm provides just this

2

capability. A single lo op pro cessor takes n time. A matching lower b ound also exists for this case, 3

which is derived in Section 6 from one tap e Turing machine crossing sequence arguments. Matrix

multiplication and matrix transp osition can also b e p erformed in LMM without any loss of time.

We also consider a butter y network with p log p LMM pro cessors, where 1

of pro cessors, time pro duct of this network for ascend-descend algorithms is shown to b e O n log n.

Note that a butter y network p erforms n log n work. This shows that the ascend-descend algorithms

can b e redesigned in suchaway as not to incur anywork loss due to the restrictive nature of the lo op

memories.

1.5 Related Work

Several mo dels for have b een considered recently. RAM was prop osed as a simple

mo del of sequential computation without any virtual memory in Aho, Hop croft, Ullman [AHU74].

Aggarwal et al. [[AACS87] and [ACS87]] intro duced the rst hierarchical memory mo del which also

incorp orates blo ck transfer. They [ACS89] extended it to the parallel computation mo del PRAM.

Vitter and Shriver [VS90] consider the PRAM case where parallel blo ck transfers are p ermitted. In

all these mo dels, the algorithm optimization consists of exploiting either temp oral or spatial lo cality

given that the access to a blo ck of size b at address x costs time f x+b, where f x is the seek time.

In our mo del, the synchronization of the lo ops primary storage with pro cessor is the main ob jective

in algorithm design. In a more fully develop ed optical computing system, it almost seems imp erative

that a memory-hierarchy will exist; with either fast or laser disks at the b ottom

of the hierarchy. An analysis of the trac b etween the lo op memory and secondary memory could

b e similar to the previous work. In the early days of computing, acoustic delay lines were used as

mass storage. But using a delay line as a secondary storage medium do es not p ose the problems

which are encountered when it is used as the primary storage. Its use as a secondary storage can

b e analyzed by the techniques develop ed for the hierarchical memory mo del HMM [AACS87]. The

related work in the theory of optical computing is very sparse. Barakat and Reif [BR87]intro duced

VLSIO electro-optical VLSI, a mo del of optical computing. They considered -time trade-o s

and lower b ounds in this mo del. We [TR90] demonstrated energy and energy-time pro duct lower and

upp er b ounds for optical computations. We[RT90]intro duce the DFT-PRAM mo del, where a DFT

can b e computed in unit time, to mo del the p ower of optical computing. We develop several O 1

time algorithms in this mo del.

The work most related to this pap er is that of Jordan [Jor89b], [HJP92], [Jor89a], [Jor91]. They

describ e an optical delay lo op system and show that a numb er of networks could b e simulated by the

use of optical delay lo ops. However, their work do es not imply our results; in particular, they did not

lo ok at the algorithmic asp ects of the lo op-memory systems. 4

Optical wave guides have b een used in several optical computing systems, but primarily for

computing and not storage. Jordan et al. use delay lines for time slot p ermutation and sorting

+

[LLJ93a, JLL93 , LLJ93b ]. Barbarossa et al. [BL92] use optical delay lines for pro cessing. [WKG 96]

p erforms adaptive array pro cessing with recirculating lo ops. There are several applications of mi-

+

crowave pro cessing with delay line lo ops [DTD 96].

The algorithms presented in this pap er schedule the memory lo ops for their data needs. However,

all the problems solvable with an ascend-descend algorithm entertain structured data and hence data

scheduling can b e built into the algorithm as in systolic algorithms case. For random data, the memory

lo op scheduling similar to register allo cation in current-day compilers [Cha82] p oses an interesting

problem. Tyagi [Tya97] prop oses some algorithms for scheduling data on lo ops.

1.6 Organization

The rest of the pap er is organized as follows. Section 2 intro duces b oth the sequential pro cessor and

parallel lo op memory mo del. The various sequential pro cessor implementations of the ascend-descend

algorithms are describ ed in Section 3. Section 4 sketches the p pro cessor version of ascend-descend

algorithms. A brief sketch of matrix transp ose and multiplication is given in Section 5. Some lower

b ounds are shown in Section 6. An alternate mo del for parallel optical computation where the delay

line lo ops serve as b oth a communication link and a dynamic memory is presented in Section 7.

Section 8 concludes the pap er.

2 Mo del

The intent here is to develop a mo del for a pro cessor with the primary dynamic storage in a delay

line lo op memory. The pro cessor can have an instruction set similar to that of the Random Access

Machine RAM in Aho et al. [AHU74]. In order to make it a little more reasonable, we will assume

that the pro cessor also contains a constantnumb er of registers, each capable of holding one word of

9

data. Note that a register op erating at Giga Hertz rate 10 p er second can b e built using a very

short delay line with a directional coupler [Jor89b].

Memory Lo ops: A pro cessor can have several delay line memory lo ops of pre-sp eci ed lengths

asso ciated with it. Let there b e k lo ops L of lengths n for 0  i< k. Note that the length of the

i i

b er as a multiple of the wavelength determines the numb er of bits stored in it. We assume that the

size of data stored in eachloop L is a pair of words, i.e.,itisaword-parallel implementation. This

i

could mean that each conceptual lo op L is implemented using 64 physical lo ops for a 32- word.

i

An alternative metho d could b e to enco de two bits into eachwave, thereby requiring only 32 physical 5

CPU

lo cal

registers

R R R

'$'$'$

lo op

lo op lo op

3

1 2

&&&

Figure 1: A Pro cessor with Three Delay Line memory Lo ops

lo ops. In either case, L can store n word pairs. The pro cessor has a tap into each of these lo ops.

i i

A simple conceptual mo del is to assume that the currentword under the tap is stored in a register

asso ciated with the lo op. In practice, a memory lo op could b e directly coupled to a computational

device without any need for an intermediate register. In the following, we make the assumption that

for all practical purp oses, the CPU accesses a lo op through its tap register. A write into the lo op is

also p erformed from its tap register. We refer to the value in the tap register of the lo op L by v L .

i i

We will use the notation v L = e to denote a write of the result of an expression e into L . Similarly,

i i

x = ev L  denotes a read of L for the evaluation of the expression e.

i i

Note that the digital optical computer at Colorado [Jor91, HJP92, SJ90] uses a bit-serial design.

Each lo op stores 64 words of 16-bits each in a bit-serial manner. The primary reason for cho osing a

bit-serial design was the cost of the optical switches. In their memory lo op implementation [SJ90],

each lo op has a distinct read and write p ort. A Lithium Niobate directional coupler is used at eachof

these p orts. Each data word recirculates serially through a delay line passing the input/output p orts

once each memory cycle 20:5s, time to go through the lo op once. A memory counter keeps track

of the address of the memory word at the read/write p ort. There are no latches in the system. The

entire pro cessor is designed with \time-of- ight" synchronization.

Read/Write Delays: The delay to read or write from a memory lo op is a fraction of time required

for computation. The situation is similar to the register access time b eing a fraction of cycle time. In

this pap er, we will assume that b oth read/write require unit time. Recall that the Colorado optical

computer [Jor91] has a memory cycle of 20:5s with a lo op storing 64, 16-bit words. If these lo ops

were to b e made word-parallel, then the time b etween successivewords is 20ns which corresp onds to

a 50MHz pro cessing rate. This indeed is the pro cessing rate of the Colorado digital optical computer. 6

Hence our assumption ab out each lo op access taking a machine cycle is not unrealistic. We also

blur the distinction b etween an instruction cycle and a machine cycle in the algorithms presented in

Section 3. We assume that instructions nish in one machine cycle as well, or that the CPI cycles p er

instruction [HP96] is nearly 1. The algorithms in this pap er can b e easily mo di ed to t any other

delay mo del.

The op erands for the instructions can b e any of the lo cal registers or one of the lo op tap registers.

The parallel Butter y mo del is a direct extension of the sequential pro cessor mo del. The interesting

scenario o ccurs only when an n-input problem instance needs to b e solved on a p log p pro cessor

butter y network for p< n. In such a case, the lo op memory at each pro cessor is used to maintain

the copies of multiple data resulting from folding. For a description of a butter y network, the reader

is referred to Ullman [Ull84].

Note that the prototyp e of the electro-optical computing system b eing built at Colorado by Jordan

[Jor89b] resembles our mo del. All the register and main memory is realized with b er delay line storage

lo ops. The pro cessing logic is also derived from the optical technology in the form of a directional

coupler, where the electric eld b etween the wave guides is used as the control mechanism [Jor89b].

3 Ascend-Descend Algorithms

The CASCADE algorithms Ascend-Descend, intro duced in Preparata and Vuillemin [PV81], are an

k

imp ortant class of algorithms. Consider an algorithm with n =2 input data items. Let the ith input

b e lo cated at address i for all 0  i  n 1 initially. An algorithm is an ascend algorithm if it

0 1 2 k1

op erates successively on pairs of data items that are lo cated 2 ; 2 ; 2 ;:::; 2 = n=2 distance apart.

This involves ascending right to left on the k address bits. A descend algorithm is de ned similarly

k 1 k 2 0

to op erate on pairs of data items 2 ; 2 ; :::; 2 apart. We de ne CASCADE to b e the class of

algorithms that are comp osed of a sequence of Ascend and Descend algorithms. Many problems suchas

merging, sorting, DFT, matrix transp osition and multiplication and data p ermutation have CASCADE

algorithms. Hence, an ecient implementation of Ascend and Descend algorithms translates into

an ecient implementation of all these problems. Besides, the top ology of manyinterconnection

networks such as Butter y,Perfect-shue, Hyp ercub e and Cub e-connected-cycl es is ideally suited for

Ascend/Descend algorithms.

In this section, we show a single pro cessor implementation of an ascend-descend algorithm with

1:5

following resources: 1 time O n log n, log n lo ops of sizes 1; 2; 4;:::; n, 2 time n ,two lo ops

q

p

n

1:5 k=2

of sizes n and n, and 3 time nk + n 2 , k lo ops, with 2  k  log n, of sizes and

k

2

n n

; ; :::; n. Note that the rst implementation p erforms optimal amountofwork {  n log n,

k k 1

2 2 7

-

n 1

2 1 0

i

L ::: L L L

i

i i i

6

Read/Write Head

Figure 2: Lo op Words

demonstrating that an ascend/descend algorithm can b e realized without any loss in time p erfor-

mance due to high latency evident in lo op memories. Also note that the numb er of lo ops in a real

optical pro cessor might b e limited. The k -lo op realization is helpful in this situation as it provides an

implementation of ascend/descend algorithms on any optical pro cessor.

Lo op Primitive Op erations:

In the following algorithms, we will p erform same set of actions on lo ops many times. Hence we

encapsulate these actions into lo op primitives. We will use the notation jL j to denote the length of

i

the lo op L , also given by n . Recall that wehave assumed that each lo cation in the lo op is capable

i i

of holding twowords, as if there were two conceptual tracks on the lo op. We will refer to these words

by v L  and v L  for upp er and lower trackwords resp ectively.We will also refer to the words

u i l i

j

0 0

v L  and v L by the notation v L  and v L  resp ectively. The notation L denotes the word

u i l i u l

i i

i

that would b e at the read/write head tap register of L in j steps from the current time. Figure 2

i

shows a lo op L with data circulating left to right. The right end of the lo op is connected to the left

i

end. Sometimes, we will need to tag a value in a lo op L for reasoning ab out the correctness of our

i

algorithms. In that case, we will assume that we can asso ciate a memory counter with eachword in

the lo op L with values b etween 0 and n 1. We will refer to the upp er lower word with memory

i i

u l

counter value j by the notation L [j ] L [j ] .

i i

input copying: Our initial assumption is that the primary program input values reside in some other

memory.Wewould copy di erent input values into a given lo op of size 1 over time. The primitive

inputx ;L  copies the rst argument x to the upp er track of the lo op L at the word aligned copy

i j i j

0

with the write p ort when this primitiveisinvoked, i.e., v L = x .We assume that copy input takes

u i

j

one cycle.

copying: In ascend-descend algorithms in the next section, many copying steps involve copying each

0

value from a lo op L to two lo cations in L for jL j =2jL j. In particular, v L  needs to b e copied

i j j i

i

n 1

0

i

to v L  and v L . Figure 3 shows the copy pattern.

j

j 8

n 1

j n

i

n 1

i

L L

j j

0 1

::: L

Xy

Xy

X

Xy X

j Xy

X

Xy X

L L ::: L

X

X

X

j

X

X

j j

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

6 6 6 6 6

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X X

X

X

X

X

X

X

X

X

X

X

X

n 1 n 2 n 3

1 0

i i i

L L L ::: L L L

i

i i i

i i

Figure 3: Lo op Copying for Ascend-Descend Algorithms

upp erL , L  copy

i j

fL has twice as manywords as L .g

j i

fCopy upp er trackwords of L into upp er trackof L .g

i j

k k

18k; 0  k  n 1, v L = v L ;

i u u

j i

n +k

i k

28k; 0  k  n 1, v L = v L ;

i u u

i

j

end copy upp er

The second copyof L through second 8 at Line 2 is synchronized for L and L . After Line 1

i i j

n

i

has completed copying L into the rst half of L , the word at L at the entry into copy upp er is now

i j

j

0 0

at the read/write p ort current L . L at the entry into copy upp er is again at the read/write p ort.

j i

We need another similar primitive for copying the upp er trackwords from L to the lower trackofL .

i j

Note that this copying takes time n = jL j.

j j

copy lowerL , L 

i j

fL has twice as manywords as L .g

j i

fCopy upp er trackwords of L into lower trackofL .g

i j

k k

18k; 0  k  n 1, v L = v L ;

i l l

j i

n +k

k

i

28k; 0  k  n 1, v L = v L ;

i l l

j i

lower end copy

Lo op computation: Another step in the ascend-descend algorithms is to apply a function f of two

arguments to words on the upp er and lower tracks for an entire lo op.

compute lo opL , f 

i

fApply f to words in L .g

i

k k k

8j; 0  j  n 1, v L = fv L ;v L ;

i u u l

i i i

end compute lo op

0 0

An assumption made here is that the reads of twowords v L  and v L , computation of f

u l

i i

0

and the write of result v L  all takes one cycle. If need b e, the result could b e shifted by a few

u

i 9

k +c

k k

p ositions determined by the latency c of the computation, i.e., v L = fv L ;v L . The

u u l

i i

i

time taken by this computation is n = jL j assuming one cycle read-compute-write.

i i

log n Lo op Version:

Wenumb er the levels of a butter y network from 0 through log n b ottom-up as in Figure 4 such

i

that the cross-connections b etween level i and i +1 have a span 2 . The pro cessors on the same level

are numb ered 0 through n 1 from left to right. Let x for 0  i  n 1 and 1  j  log n b e the

i;j

value computed in the pro cessor p . The input value x corresp onds to x for 0  i  n 1. Note

i;j i i;0

that for the ascend phase x = f x ;x  when the j th bit from the right in the binary

j 1

i;j i;j 1

i+2 ;j 1

representation of i is 0, and x = f x ;x  otherwise. Here f is some op eration p erformed

j1

i;j i;j 1

i2 ;j 1

in the pro cessors. Similarly, for the descend phase x , for 0  i  n 1 and 0  j  log n 1,

i;j

is given by the following expressions: x = g x ;x j +1  when the j + 1st bit from the

i;j i;j +1

i+2 ;j +1

right in the binary representation of i is 0, and x = g x ;x  otherwise. The skeleton

j+1

i;j i;j +1

i2 ;j +1

of the implementation is as follows. A butter y computation graph is scanned in the order shown in

Figure 4. We will only demonstrate the ascend-phase of the algorithm. The descend phase is very

similar.

We assume that log n memory lo ops L , L , L , ::: , L of sizes 1; 2; 4; 8; :::; n resp ectively are

1 2 4 n

available. Recall that the word i p ositions away from the read/write p ort in the lo op L is referred

k

i 0

to as L . The 0th p osition of a lo op L is the p osition scanned by the tap register currently. The

k k

following implementation of the ascend algorithm assigns the physical lo op L i to the storage of the

2

values asso ciated with butter y level i for 0  i  log n.For instance, lo op L is time-multiplexed

1

to store all the butter y level 0 values, while lo op L stores all the butter y level 1 values. Figure 5

2

presents the ascend phase algorithm.

The ascend phase is implemented by a call to Pro cess-Ascend0, n Figure 5, where n is

assumed to b e a p ower of 2. It takes time 3n log n + n as:

T n=2Tn=2 + 3n ; T 1 = 1

The two recursive calls to Pro cess-Ascend in Steps 2 and 4 of Figure 5 give the term 2T n=2

and the two copying steps in Steps 3 and 5 lead to the 2n term. The computation in Step 6 adds

another n to this equation. These equations have a solution in T n=3nlog n + n.

The main concern in determining the correctness of this implementation and its timing analysis

is the synchronization of all the lo op accesses. For instance, at the return from the call Pro cess-

Ascend0, n=2 at Step 2, are the lo ops L and L synchronized so that copying takes place in n

n

n=2 10

p p p p

0;2 1;2 2;2 3;2

- - -

Level 2

Pi

P

P

P

P P

P

P P

P

P P

P

P P

P

P P

P

P P

P

P P

P

P P

P

P P

P

P P

P

P P

P

P

Level 1 - P-

@I @I @

@ @

Q Q

@

Q Q

@ @

Q Q

@

@ @

Q Q

@

Q Q

@ @

Q Q

@

Level 0 @- @- @R

p p p p

0;0 1;0 2;0 3;0

Figure 4: Butter y Network: Thick Lines Show the Order of Computing

Pro cess-Ascendi, k 

fProcess the k input bit positions starting at x for ascend phase. g

i

fk is assumedtobepower of 2. g

if k =1 then copy inputx , L , copy input bit x to L . 1

i 1 i 1

else

Pro cess-Ascendi, k=2; 2

copy upp erL , L ; 3

k

k=2

k k

Pro cess-Ascendi + , ; 4

2 2

copy lowerL , L ; 5

k

k=2

lo opL , f ; 6 compute

k

fcompute the values x = f x ;x , for 0  i  k 1.g

i;k

i;k =2 ik=2;k =2

end if

fNote that x and x are available in L at position i due to previous two copying steps. g

k

i;k =2 ik=2;k =2

end Pro cess-Ascend

Figure 5: RecursiveVersion of Ascend Phase Algorithm 11

steps? Note that the usage pattern of the lo ops by this algorithm has the following form: L , L , L ,

1 2 1

L , L , L , L , L , L , L , L and so on. The following inductive pro of establishes that the timing

2 4 1 2 1 2 4 8

of the algorithm is correct, as the synchronization is done by \time-of- ight". Note that wehave not

de ned the term \synchronization" used in the statement of the following theorem

Theorem 1 The ascend phase algorithm in Figure 5 has synchronized data transfer and computation

to emulate the behavior of a butter y network.

Proof: The pro of is inductive. Assume that all the data transfers and computation are synchronized

for Pro cess-Ascendi; k  for all values of k

involves only the transfer of x into L . The inductive step is to prove that Pro cess-Ascend0;nis

i 1

synchronized given that Pro cess-Ascend0;n=2 and Pro cess-Ascendn=2;n=2 are synchronized.

Assuming that n> 1, Pro cess-Ascend0;n rst calls Pro cess-Ascend0;n=2, copies L

n=2

into L , calls Pro cess-Ascendn=2;n=2, copies L into L , and nally computes L . After

n n n

n=2

Pro cess-Ascend0;n=2 is done, the assumption part of inductivehyp othesis is that L is scan-

n=2

0

ning x ,orv L =x . In addition to showing that all the copying steps work correctly,

0;log n1 u 0;log n1

n=2

0

we also need to prove that at the end of Pro cess-Ascend0;n, v L =x .

u 0;log n

n

Let T = 0 when the copy upp er at Step 3 starts copying L into L . Let the memory counter

n

n=2

of the word copied into L at 0  T = k  n 1bek. Similarly, let the memory counter in L

n

n=2

u u

for the word copied at 0  T = k

n n

u u u

L [0] = x which implies that L [j ] = L [j + n=2] = x . The copying is completed

0;log n1 n n j;log n1

n=2

3n n

at T = n 1, when Pro cess-Ascendn=2;n=2 is called. This call is completed in log n 1 +

2 2

3n n

0

steps. Once again, at time T = n + log n 1 + , v L = x by inductivehyp othesis.

u

n=2;log n1

2 2 n=2

The key question from synchronization p ersp ective is whichword is at the read/write p ort of L at

n

3n n

0 u u

T = n + log n 1 + ?Wewould like v L  to b e either L [0] or L [n=2] , which holds if T = n +

u n n

n

2 2

n 3n

log n 1 + isaintegral multiple of n=2. Since n=2 divides the time elapsed since the last access to

2 2

memory counter 0 in L , our synchronization holds for this copying step for the following reason. After

n

5n 3n

l l

copy lower of Step 5 is completed at T = + log n 1, L [j ] = L [j + n=2] = x for

n n

n=2+j;log n1

2 2

u u

0  j

n n j;log n1

3n 5n

j j

+ log n 1, v L =x and v L = x for 0  j

u j;log n1 l

n=2+j;log n1

n n

2 2

u j

as exp ected by the ascend algorithm resulting in L [j ] = x . Similarly, v L = x

n j;log n u

jn=2;log n1

n

j u

and v L =x for n=2  j

l j;log n1 n j;log n

n

7n 3n

0

inductivehup othesis. At time T = + log n 1, when compute lo op is done, v L = x

u 0;log n

n

2 2

establishing the second part of the inductivehyp othesis.

2 12

for d =1; dn; d=d+ 1; 1

fd is the diagonal number. g

for k =1; k d; k =2k; 2

fk is the span of the data b eing pro cessed. g

if i =dk mo d k == 0 3

fi is the starting position. Al l valid k; i combinations have i mo d k =0.g

if k == 1 copy inputx , L  4

i 1

else compute lo opL , f ; 5

k

if i mo d 2  k ==0 copy upp erL , L  6

k 2k

else copy lowerL , L ; 7

k 2k

Figure 6: IterativeVersion of Ascend Phase Algorithm

Figure 5 gives a recursiveversion of the ascend phase implementation. We give an iterativeversion

in Figure 6 as it helps understand the timing of the ascend phase from a di erent p ersp ective. The key

p oint to note here is that Pro cess-Ascendi, k  calls one computing primitivecompute lo op usually,

or copy input for an input bit with k = 1 and one copying primitivecopy upp er or copy lower for

each feasible value of i; k . The temp oral order in which these two primitives for di erenti; k  pairs

are called is as follows: 0; 1, 1; 1, 0; 2, 2; 1, 3; 1, 2; 2, 0; 4, 4; 1, 5; 1, 4; 2, 6; 1, 7; 1,

6; 2, 4; 4, 0; 8, ::: ,0;n. Note that i + k denotes a diagonal in the butter y network of Figure 4.

Also note that all the i; k  pairs are invoked in the increasing order of their diagonal d = i + k .Of

course, some of the diagonal entries are never invoked, for instance i =1;k = 2 is not used due to

the nature of this computation. Line 3 in Figure 6 checks for the validity of eachi; k  pair. Lines 1

and 2 set up these i; k  pairs going over diagonals d from 1 to n.

The ab ove description assumes that the n input words are available in some other memory not

the optical lo ops. This is not a realistic assumption since the pro cessor contains only a small sized

O 1 lo cal storage in addition to the memory lo ops. The only other place to hold the input is the

slow secondary memory. In the following, we show that if the input words are resident in the lo op

L initially, then they can b e trickled down to the smaller lo ops in synchrony with Pro cess-Ascend

n

without any loss of eciency. Here, we assume that either there is a separate third track on eachloop

to hold the input values or there is a separate set of log n input lo ops. In either case, let I denote

k

the input lo op of size k which is either part of the lo op L or is an indep endent lo op. Let the n input

k

words b e in I initially. In fact, we will assume that the memory counter for the input word x is

n j

j , i.e., I [j ]= x for 0  j

n j

and copy input right. The primitive copy input leftI , I , i copies I 's left half k=2words to I

k k

k=2 k=2

starting with input word x whichwould b e at memory counter value 0.

i 13

copy input leftI , I , i

k

k=2

fArgument i is only for reading convenience. Whenever this primitive is called, we maintain the

invariant that I [0] = x .g

k i

8j ,0j

k

k=2

end copy input left

We de ne copy input rightI , I , i + k=2 to copy the second right half of I k=2words into

k k

k=2

I . This task could have b een p erformed by copy input left, but again for presentation claritywe also

k=2

de ne copy input right. Once Again, the sp eci cation of i + k=2 is really redundant. The exp ectation

is to copy I [k=2+j]= x for 0  j

k

i+k=2+j k=2

0

this primitive is called v I =I [k=2] = x .

k

i+k=2

k

copy input rightI , I , i + k=2

k

k=2

fArgument i + k=2 is only for reading convenience. Whenever this primitive is called, I [0] =

k

x .g

i+k=2

8j ,0j

k

k=2

end copy input right

The pro cedure Pro cess-Ascend also needs to b e mo di ed as follows:

Pro cess-Ascendi, k 

fProcess the k input bit positions starting at x for ascend phase. g

i

fk is assumedtobepower of 2. g

if k =1 then return; 1

else

copy input leftI , I , i; 2

k

k=2

Pro cess-Ascendi, k=2; 3

copy upp erL , L ; 4

k

k=2

copy input rightI , I , i + k=2; 5

k

k=2

k k

Pro cess-Ascendi + , ; 6

2 2

copy lowerL , L ; 7

k

k=2

lo opL , f ; 8 compute

k

fcompute the values x = f x ;x , for 0  i  k 1.g

i;k

i;k =2 ik=2;k =2

end if

fNote that x and x are available in L at position i due to previous two copying steps. g

k

i;k =2 ik=2;k =2

end Pro cess-Ascend

Wenow need to argue that this new version of Pro cess-Ascend works correctly and that the

input words are also synchronized. All that wehave mo di ed in Pro cess-Ascend is to insert 14

copy input leftI , I , i at Line 2 and copy input rightI , I , i + k=2 at Line 5. Also, the action

k k

k=2 k=2

taken when k = 1 at Line 1 has changed from copying the input word x into L to doing nothing.

i 1

This version of ascend algorithm takes time 4n log n. Note that:

T n=4n+2Tn=2; T 1 = 0;

The input copying steps at Lines 2 and 5 each take time n=2. This results in extra n term for

the expression for T n. Since nothing is done for k = 1 case, the expression for T 1 is 0. These

equations have a solution in T n= 4nlog n.

Wenow give a pro of sketch for the timing of this version. The key invariant is that right b efore

0

Pro cess-Ascendi, k isinvoked I contains the input words x ;x ; :::; x with v I =x.

k i i+1 i+k1 i

k

It is true for the top-level call Pro cess-Ascend0, n as an assumption. For all the other recursive

calls of Pro cess-Ascend this invariant is maintained by Lines 2 and 5.

Lemma 1 The revised version of Pro cess-Ascend is synchronized.

Proof: The pro of is by induction on k as in Theorem 1. The inductivehyp otheses include: 1

0

when Pro cess-Ascendi, k  is called, I contains x for 0  j

k i+j i

k

0

from Pro cess-Ascendi, k , v L = x , 3 all the copying steps in Lines 2, 4, 5 and 7 are

u i;log k

k

synchronized.

Let us establish the hyp othesis for Pro cess-Ascend0, n assuming it is true for all k

0

that by assumption v I =x and hence copying step at Line 2, copies x ;x ; :::; x into I

0 0 1

n=21 n=2

n

establishing the same invarinat for the Pro cess-Ascend0, n=2 on Line 3. Line 2 copy input left

0

takes n=2 steps. Hence at the entry into Pro cess-Ascend0, n=2, v I = x = I [n=2]. Pro cess-

n

n=2

n

upp erL , L  Ascend0, n=2 takes time 2nlog n 1 whichisanintegral multiple of n. copy

n

n=2

is synchronized to the b eginning of L by the hyp othesis 2. This copying at Line 4 takes n

n=2

0

steps. At the end of this copying, v I  still is x 2nlog n 1 + n steps since entry into Line 3

n=2

n

Pro cess-Ascend0, n=2. Hence copy input left at Line 5 is entered with the correct invariant

copying x ; :::; x into I establishing hyp othesis 1 for the following Pro cess-Ascendn=2,

n1

n=2 n=2

n=2 call. copy lower is synchronized by the argument used in Theorem 1 since the numb er of steps

elapsed from the end of copy upp er Line 4 is 2nlog n 1 + n=2 whichisanintegral multiple of n=2.

0

Since copy lower takes n steps, compute lo op is also synchronized leaving v L = x establishing

u 0;log n

n

hyp othesis 2.

2 15

p

 -

n

p p

x x x

n1

n1 2 n 1

-

6 C6 6 6

C

C

-

C

C

C

C

p

C

n

C

C

-

C

C

C

C

? - CW

p p

x x x

0

n n n

Figure 7: A Square Shifter

TwoLoopVersion:

An ascend-descend algorithm can b e implemented with fewer lo ops two but then it takes longer time



1:5

O n . There seems to b e a trade-o b etween the numb er of lo ops deployed and the time taken

by the algorithm. The early optical are likely to have a small numb er of lo ops and hence

this trade-o provides a welcome opp ortunity.

The basic idea for this algorithm comes from the square implementation of a barrel shifter as

describ ed in Ullman [[Ull84], page 69]. It is shown in Figure 7. The n bits to b e shifted are stored along

p p p

a d ned ne array.Wenumb er the rows b ottom-up from 0 to d n e1 and columns left-right from

p p

p

0tod ne1. The least signi cant input bit x is stored at 0; 0 p osition, x ne1; 0, 1atd

0

d ne

p p

p

x ne1;d ne1. The log n control bits c c :::c c form at 0; 1, and x at d

log n1 log n2 1 0 n1

d ne

the word sp ecifying the shift amount0  s

p

vertical shift amount. This is b ecause the bits c :::c c enco de a shift amount0s < n.

1 0 v

log n=21

p

Note that any bits shifted out of Column j for 0  j< n1moveinto appropriate entry of Column

j + 1. Similarly, the most-signi cant half control bits c :::c sp ecify the horizontal shift

log n1

log n=2

p

log n=2

n in the original amount. Note that each one-bit shift horizontally corresp onds to a shift by2 =

p p

word. This shifting takes time O  n, in particular at most 2 n.

For an ascend-descend algorithm, the pro cessing at Level j as shown in Figure 4 requires commu-

nication b etween words x and x j 1 . This communication is achieved through a square-shifter

i;j 1

i2 ;j 1 16

emulating mechanism in the 2-lo op version. We use one lo op of size n to store all the n words, entire

p

array of the square shifter. This lo op is called main lo op. Another lo op of size n is used to pro cess

p

one column or row of the square shifter array at a time. We refer to the n size lo op by context lo op.

The input data is initially stored in the main lo op L . Note that anyrow or column can b e copied

n

p

from the main lo op L to the context lo op L within n steps.

n

n

p p

We pro cess the n columns rst followed by the n rows. We copy each column and row from

the main lo op into the context lo op in turn and pro cess it. The pro cessing of column i for 0  i<

p p

n corresp onds to pro cessing lower log n=2 levels Levels 1 | log n=2 for the columns i  n |

p

i +1 n 1 of the butter y network. Similarly, the computation of the j th row in the square

p

shifter for 0  j< n pro cesses upp er log n=2 levels Levels log n=2 + 1 | log n for the columns

p p

j; j + n;:::; n n+j. Note that if n is chosen to b e a p ower of 2, all the words needed by Level i

p

computation for 1  i< log n=2 are contained in lo op L .Ifnwere not a p ower of 2, the context

n

p

lo op would need to b e 2 n long to b e able to contain entries for columns i and i + 1 when column i is

b eing pro cessed. For instance, for n = 9, the context lo op has 3 entries. When Level 1 see Figure 4 is

computed, x and x need to communicate columns 0 and 1 of butter y and x and x need to

0;0 1;0 2;0 3;0

communicate coulmns 2 and 3 of the butter y. However, the context lo op only contains the values of

p

x , x , and x .Ifwe did use the context lo op of size 2 n which is 6 in this case, the context lo op

0;0 1;0 2;0

would havevalues x ;x ; :::; x . When Column 0 of the square shifter is pro cessed, the values

0;0 1;0 5;0

for x ;x ;x ;x will b e computed. The values for x and x are computed when Column 1

0;1 1;1 2;1 3;1 4;1 5;1

p

dlog ne=2 dlog ne=2

of square shifter is handled. In general, in this case 2 > n.We compute k =2 values

for each iteration of the context lo op for a square shifter column except the last one. For simplicity

of illustration, we assume in the following that n is a p ower of 2.

For the two-lo op algorithm, we are assuming the synchronization of the lo ops is done through

memory counters for the lo ops. Let us assume that the ith butter y column value x for all levels

i;l

p

l is at memory counter i for L . We use the upp er trackon L to keep the level l +1 values

n

n

computed from the level l values in the lower tracks. For the square shifter column i resulting in

p p p

p

n; i n +1; :::; i+1 n1 b eing copied into L , the memory counter values butter y columns i

n

p

are 0; 1; :::; n 1 resp ectively. The primitives are:

p

copy columni, L , L 

n

n

p

fcopies square shifter column i from L to L . Either track can b e used on L and the lower

n n

n

p

track is used on L .g

n

p p

l

p

8j ,0j< n,L [j] =L [i n+j];

n

n

column end copy

p

The primitive copy column can takeupton+ n1 steps. L mayhavetowait for up to n 1

n

p p

steps for the L memory counter to get to i n and n steps are needed to copy after that. We also

n 17

p

need a primitivetomove the computed values back from the upp er trackof L to L .

n

n

p

move column to maini, L , L 

n

n

p

fcopies upp er trackof L to cloumn i of L .g

n

n

p p

u

p

8j ,0j< n,L [i n+j]= L [j] ;

n

n

end move column to main

p

move column to main also takes up to n + n 1 steps. The following primitive copy rowi, L ,

n

p p

L  copies ith row from L to L .

n

n n

p

copy rowi, L , L 

n

n

p

fcopies square shifter row i from L to L . Either track can b e used on L and the lower track

n n

n

p

is used on L .g

n

p

for j =0; j< n;j=j+1

p

l

p

L n]; [j] = L [i+j

n

n

end copy row

copy row can takeupto2n1 steps, n 1 for the L memory counter synchronization and n steps

n

p

for the copying. Again, we need a reverse primitivetomove the upp er trackofarow from L back

n

to L .

n

p

move row to maini, L , L 

n

n

p

fcopies the computed values in row i from the upp er trackof L to L .g

n

n

p

for j =0; j< n;j=j+1

p

u

p

L [i+j n]= L [j] ;

n

n

end move row to main

move row to main can also takeupto2n1 steps. The following primitive computes a level l

p

value from two level l 1values in L .

n

0

p

compute valuex , k , L , j , f 

j;l

n

0 u

p

[j ] as x for l = log k .g fcompute f x ;x  and put it into L

j;l j;l1 jk;l1

n

if j mo d 2  k == 0

0 u 0 l 0 l

p p p

L [j ] = f L [j ] ;L [j +k];

n n n

0 u 0 l 0 l

p p p

else L [j ] = f L [j ] ;L [j k];

n n n

value end compute

p

This version of compute value takes up to n steps collecting the two level l 1values. Figure 8

shows the algorithm for the two-lo op version.

1:5 1:5 1:5

This algorithm takes time n log n +6n . The 6n term is due to copying of context to and from

the main lo op. This is b ecause for each level of butter y, eachrow/column takes time prop ortional 18

p

for i =0 up to n 1 do 1

fHand le the ith column.g

p

copy columni, L , L ; 2

n

n

log n

for l =1 up to do 3

2

fEach level l th of the butter y network is processed in turn. g

log n

fFirst deal with the lower levels corresponding to column operations. g

2

p

for j =0 up to n 1 do 4

l1

fThe value at j th position interacts with the value at j  2 . g

l1

p p

compute valuex ,2 ,L ,j,f; 5

i n +j;l n

p

fthis is done in one rotation of L .g

n

end for

end for

p

move column to maini, L , L ; 6

n

n

fcopy the updated values back to the main loop before getting next column.g

end for

log n

fNow deal with the upper levels corresponding to row operations. g

2

p

for i =0 up to n 1 do 7

p

copy rowi, L , L ; 8

n

n

for l = log n=2+ 1 up to log n do 9

fprocess upper log n=2 levels.g

p

for j =0 up to n 1 do 10

log n

l

2

fThe value at j th position interacts with the value at j  2 . g

l1log n=2

p p

compute valuex ,2 , L , j , f ; 11

j n +i;l n

p

fthis is done in one rotation of L .g

n

end for

end for

p

row to maini, L , L ; 12 move

n

n

end for

Figure 8: TwoLoopVersion of Ascend Phase Algorithm 19



1:5

algorithm as follows. In compute to n. It can b e easily converted into a O n value,waiting for

one rotation to access x is wasteful. If we pip eline this computation, then in one rotation of

ik

p

n

p

the context lo op L , rather than one value can b e computed. For instance, for the l th level

n

2k

p

with Column 0, in the rst rotation x ;x ;x ; :::; x .Thus for column row computation,

l l+1

0;l

2;l 2 ;l n;l

log n

p

P

i

2

the total computation time p er column row is n, whichisn. This gives a total time of 2

i=1

1:5 1:5 1:5 1:5

2n +6n . The term 2n in the pip elined case compares favorably with the term n log n.

In order to implement the pip elined version, the algorithm in Figure 8 would have to b e mo di ed

at two places. Lines 4 and 5 would b e replaced by:

l1

for j =0 up to 2 do 4

l1

p p

compute pip elined valuex ,2 , L , j,f; 5

i n+j;l n

end for

Similarly, Lines 10 and 11 are replaced by:

p

n 1 do 10 for j =0 up to

l1log n=2

p p

compute pip elined valuex ,2 , L , j , f ; 11

j n+i;l n

end for

0

p

The new primitive compute pip elined valuex , k , L , j , f  computes f x ;x ,

j;l j;l1 jk;l1

n

p

f x ;x , f x ;x , ::: in one rotation of the lo op L . It is is de ned

j +k;l1 j+kk;l1 j +2k;l1 j+2k k;l1

n

as follows:

0

p

compute pip elined valuex , k , L , j , f 

j;l

n

p

0

forstar t = j , mc = j ; mc < n ; mc = mc + k , star t = star t + k ;

if star t mo d 2  k == 0

u l l

p p p

L [mc] = f L [mc] ;L [mc + k ] ;

n n n

u l l

p p p

else L [mc] = f L [mc] ;L [mc k ] ;

n n n

end for

pip elined value end compute

2

Note that this algorithm also provides a O n  one lo op implementation. The pro cessor contains

i

only one lo op of length n. The ith level of butter y takes time 2 n for 1  i  log n.

A k Lo op Version:

What if we only had 2  k +1 < log n lo ops available? A hybrid algorithm based on the preceding

1:5 k=2 k k1 k2

two algorithms works in time nk + n 2 . The k lo ops of sizes n=2 ;n=2 ;n=2 ;:::; n and 20

Top k levels of Butter y

p p

k=2 k=2

p p

n=2 n=2

k=2 k=2

n=2 n=2

 



p



k=2

p p

n=2

k=2 k=2

n=2 n=2

p

k=2

n=2

Figure 9: A k Lo op Algorithm

p

k=2

n=2 are required. Figure 9 demonstrates the scheme b ehind the algorithm. The one lo op of size

b ottom log n k levels of the Butter y network are pro cessed in the two-lo op, mesh based algorithm

p p

n n

p

as the main lo op and L with the lo ops L as the context lo op. Each mesh is 

k

k=2

k=2 k=2

n=2

n=2

2 2

p

n

k

. The two-lo op algorithm for each requiring the main lo op of size n=2 and a context lo op of size

k=2

2

1:5 1:5k k

mesh takes time 8n =2 . Hence the total time taken in the two-lo op phase for all the n=2 meshes

1:5

8n

is . The top k levels of the butter y are pro cessed using the diagonal scan algorithm mentioned

k=2

2

1:5 k=2

earlier. This phase takes 4nk steps. Thus the total time is b ounded by4nk +8n 2 steps.

4 Parallel Ascend-Descend Algorithm

All the algorithms presented so far have assumed that a single pro cessor is used to solve the problem.

In this section, we are interested in using the optical b er lo ops for lo cal storage for pro cessors that are

part of a parallel computer. In particular, we assume that wehavea p-pro cessor machine organized

with butter y connectivity. The problem o ccurs when an ascend-descend problem with n input words

such that n>p is solved on this butter y network of p no des. In such a case, multiple instances of

data need to b e stored at each pro cessor. One lo op of size n=p p er pro cessor seems to suce to hold

this data except for the b ottom level pro cessors, which require log n log p lo ops.

Recall that a p-no de butter y has log p +1 rows and eachrow has p pro cessors. As stated earlier,

wenumb er the ith pro cessor in the j th row p for 0  i< pand 0  j  log p. Note that we

i;j 21

::: :::

:::

B [3; 0] B [2; 1] B [1; 2]

B [2; 0] B [1; 1] B [0; 2] :: :

B [1; 0] B [0; 1]

B [0; 0]

? ?

?

::: A[0; 2] A[0; 1] A[0; 0] - - - - -

C [0; 0] C [0; 1] C [0; 2] C [0;n 1]

-

? ? ? ?

::: A[1; 1] A[1; 0]

- - - - -

- C[1,0] C [1; 1] C [1; 2] C [1;n 1]

:::

? ? ? ?

? ? ? ?

- - - - -

C [n 1; 0] C [n 1; 1] C [n 1; 2] C [n 1;n 1]

? ? ? ?

Figure 10: A Systolic Algorithm for Matrix Multiplication

have a log p + 1 level, p-no de butter y, for 1  p

0

multiple of p. Let p refer to the i; j th pro cessor in the n-no de computation graph for 0  i< n and

i;j

0

0  j  log n. The following mapping from p to p will b e used in our algorithm. This mapping

k;l

i;j

n

0

has b een used in earlier work on parallel algorithms. p for j  log [lower log n log p levels]

i;j

p

0

maps into p . For j>logn=p, p . The b ottom log n log p maps into p

i i

i;j

b c; 0 c;jlog n+log p b

n=p n=p

levels of the n-word ascend-descend problem are solved sequentially at one of the Level 0 pro cessors.

Levels l + log n log p of the ascend-descend problem are mapp ed into Level l of the butter y network.

Once again, let us consider the implementation of the ascend phase. Each pro cessor in the b ottom

level of the p-no de butter y simulates an n=p-no de butter y sequentially, as shown in Section 3,

Algorithm 1. Thus, each of these p pro cessors contains log n log p lo ops of sizes 1; 2; 4; :::; n=p.

n n

The time taken for this pro cessing is 4 log . The top log p levels of the n-no de butter y are

p p

pip elined in the p-no de butter y. The j th level of the p-no de butter y pro cesses the log n=p+ jth

level of the n-no de butter y. Each pro cessor p contains n=p consecutive data values as dictated

k;l

0

to p . The connectivity of the p-no de butter y is adequate to emulate by the mapping from p

k;l

i;j

the connections of the n-no de butter y.To see this, note that the span of the connections b etween

j

level j and j + 1 in an n-no de butter y is 2 . The j th level is mapp ed into the i = j log n=pth 22

j

2

level of the p-no de butter y. The span of connections b etween the ith and i + 1st levels is .Thus

n=p

0

 has a cross-connection with the the l th elementinp whichwould have b een in p

j;i

j n=p+ l; i+logn=p

0

. This is the right cross-connection. l th elementofp whichwould b e p

i

i

j2;i+1

n=p[j 2 ]+l; i+1+logn=p

The lo ops b etween two consecutive levels are also synchronized. This can b e proven formally with an

n

induction argument. The top log p levels take time log p + .

p

n n n

The total time taken by this algorithm is 4 log + log p + . Then the work of this algorithm

p p p

is O n log n, which is the work p erformed byann-no de butter y network as well.

5 Matrix Computations

3

The matrix multiplication of two nn matrices can b e accomplished in O n  steps, while the transp ose

2

takes O n  time. Most matrix computations are very similar to their systolic counterparts [Kun79].

In a systolic algorithm, there are several active data streams interacting with each other. Each element

of a data stream could b e interacting with a distinct data stream in parallel. For instance, a commonly

used systolic algorithm for matrix multiplication uses the data streams as shown in Figure 10. Two

3

n  n matrices A and B are multiplied in time O n by this algorithm. Pro cessor p accumulates

i;j

the value for C [i; j ] such that C = A  B . The ith rowof A is input from the left edge of the array

Figure 10 such that i + 1st row is one time step b ehind ith row. For instance, A[0; 0] arrives at p

0;0

at T =0,but A[1; 0] arrives at p at T = 1. Similarly, j th column of B arrives at p at time T = j .

1;0 0;j

The rows of A and columns of B keep marching down horizontally and vertically resp ectively, while

p collects the pro duct term A[i; k ]  B [k; j ] one by one. This systolic algorithm could b e emulated in

i;j

lo op based pro cessors by allo cating a lo op to each data stream 2n data streams in this case, one for

eachrowofAand one for each column of B . However, the values for C [i; j ] are static | sitting in

one place. Where should the matrix C b e mapp ed? In general, all the systolic algorithms consisting of

only moving data streams can b e mapp ed to a lo op algorithm. Most of the time, a static data stream

such as matrix C  in this example, can also b e mapp ed on a lo op with prop er synchronization. Several

transformations for systolic algorithms exist to convert spatial data streams suchasCinto temp oral

moving data streams or vice versa [Qui84], [QvD89], [Che85], [LM82].

There is one problem that seems to o ccur in synchronization of data streams for several matrix

op erations. In the following, we present the problem along with a solution. We use matrix transp osition

2

as the example for the illustration of this problem. It app ears as if for transp osition, one lo op of size n

2

and n lo ops of size n eachwould giveanobvious n time algorithm. The outline of the algorithm could

b e as follows. Initially, the n  n matrix A is stored in the row-ma jor order in the lo op L . Each n-size

2

n

T

lo op accumulates one row of the transp osed matrix A which is a column of A during the rotation of

the main lo op, L 2 . On surface, this step seems to b e trivial. Let us illustrate it with Figure 11. The

n 23



A

L A[0; 0] A[0; 1] ::: A[0;n 1] A[1; 0] ::: A[1;n 1] ::: A[n 1;0] :: : A[n 1;n 1]

2

n

6

Row0

A[1; 0] A[2; 0] L A[0; 0] :: : A[n 1; 0]

n;0

6

Row1 A[0; 1] A[1; 1] :::

L

n;1

6

A[0; 2] A[1; 2] :: : Row2

L

n;2

6

A[0;n 1] :: : Row n 1

L

n;n1

6

Figure 11: A Flawed Scheme for Transp osition

matrix A is stored in row-ma jor order on the lo op L 2 initially. Let us assume that L 2 [0] = A[0; 0],

n n

with the data recirculating right to left. There are n lo ops of sizes n each for accumulating the ith

T

rowof A in L in Figure 11. Let us evaluate the synchronization requirements of this problem.

n;i

Assume at T =0,L2 scans A[0; 0], L 2 [0] = A[0; 0], which can b e copied to L [0] at T =0.

n;0

n n

[k ]into L [0]. The problem o ccurs Similarly,atT =kfor 0  k

2

n;k

n

at the copy of the next rowofAstarting at time T = n.AtT=n,L 2is scanning A[1; 0] whichgoes

n

to L [1]. However, at T = n, L is scanning L [0]. The same problem o ccurs at T = n + j for

n;0 n;0 n;0

and L are 0  j

2

n;j n;j n;j n;j

n

not synchronized. A solution for it may b e to read from L 2 at time T = j and write it into L [1]

n;j

n

at time T = n + j +1. However, the problem gets worse for the third entry or at time T =2n+j for

0  j

n;j n;j

n

at time T = i  n + j ,we need to scan L [i], but we scan L [0]. In other words, L 2 runs ahead of

n;j n;j

n

L by i words for T = i  n + j for 0  i; j < n. One solution for this problem is to intro duce n 1

n;j

delay lo ops of sizes 1; 2; 3; :::; n 1. The other solution is to have just one delay lo op of size n 1

with n 1 taps into it. Each of these taps requires a directional coupler switch. However, a delay

lo op seems to b e very useful for synchronization problems encountered in matrix computations. We

assume that wehave a single delayloopD with n 1 taps. The word at tap i is referred to by

n1

D [i] for 0  i< n1. The word D [0] is the word written at current time, D [1] is the word

n1 n1 n1

written one time unit earlier and so on. This lo op acts as a shifting lo op. Figure 12 shows this lo op.

Now the transp osition can b e done as follows: 24

-

D D [0] D [1] :: : D [n 2]

n1 n1 n1 n1

6 6 6

:: :

Figure 12: A Delay Lo op for Time-Shifting Words

fori =0; i< n; i = i+1

forj =0; j

if i == 0 copy L [j ]to L [0];

2

n;j

n

else

copy L [i  n + j ]to D [0].

2

n1

n

copy D [i]toL [j ].

n1 n;j

end if

end for

end for

2 T

This version of transp osition takes time n to get all the rows of A into n lo ops of size n. If these

2 2

rows were to b e copied backinto an n size lo op, this would take another n steps. An alternative

is to use a Butter y network with 2 log n lo ops. As shown in Stone [Sto70], for a matrix stored

in row-ma jor order, the transp ose corresp onds to log n shue steps. This can b e accomplished in

log n ascend-descend steps. The sequential version of ascend-descend then can transp ose a matrix in

2

4n log n steps.

3

Matrix multiplication can similarly b e p erformed in n steps. For the computation of C = A  B

where n  n matrices A and B are initially stored in row-ma jor order on lo ops L 2 and L 2

n ;A n ;B

resp ectively, the following straightforward metho d can b e used. Let us assume that C is to b e stored

on L 2 in row-ma jor order as well. First transp ose B into n, n-size lo ops L , L , :::, L as

n;0 n;1 n;n1

n ;C

describ ed earlier. The values of C are computed in the order of columns with Column 0 rst. We need

to use the delayloopD again to delay the writes of columns j by j time steps for 1  j  n 1.

n1

The following steps describ e this multiplication algorithm:

Transp ose B so that the ith column of B is in L for 0  i

n;i

forj =0; j

fori =0; i< n; i = i+1

X =0;

fork =0; k

X = X + L 2 [i  n + k ]  L [k ];

n;j

n ;A

end for

if j == 0 L [n  i]= X;

2

n ;C

else

D [0] = X ; L [i  n + j ]= D [j];

2

n1 n1

n

end if

end for

end for

3 2

This matrix multiplication takes time O n  with n, n-size and 3, n -size lo ops. Most of the

systolic matrix algorithms in Mead, Conway [[MC80], pages 271-285. Kung, Leiserson] map into lo op

algorithms very nicely.

6 Lower Bounds

This mo del o ers a rich milieu for lower b ounds. As wesaw, a higher numb er of lo ops seems to lead

toalower time. It is not just the numb er of lo ops, but the p erio d of data circulation in the lo ops

is also imp ortant. Ideally,we need a combinatorial technique that can decomp ose the data access

pattern of a problem into some canonical p erio ds. Unfortunately, at this time, we do not have sucha

technique. However, wedohave some simple lower b ounds which are tight for their sub cases. Let us

rst consider the single lo op version.

Note that a one-lo op pro cessor is like a single tap e Turing machine, where the head is forced to

move in the same direction at each time step. The tap e is also wrapp ed around to form a simple

lo op. The crossing sequence arguments given in Hennie [Hen65], [Hen66] can b e used to show that the



2

ascend-descend algorithms need n time on a one lo op pro cessor. The numb er of distinct input

n

patterns is 2 , assuming a binary input. n crossing sequences can b e shown to have length n

leading to the lower b ound.

The second technique quanti es the lower b ound in terms of the size of the smallest lo op. The

assumptions are that there is only a constant amount of \register" storage available to the pro cessor

in addition to the lo op memories. Note that all these lo ops have exactly one read/write p ort unlike

the delayloopD for matrix op erations. The following lower b ounds hold only if there is exactly

n1

one observation p ointinto the lo op memory no lo ops of D typ e are available. We can derivea

n1

tightlower b ound for the k lo op version with this metho d. Let the smallest lo op b e of size s. The

computation of lower log s levels of the ascend/descend phases can b e shown to require sn time. 26

p

k=2 1:5 k=2

, which matches the For the k -lo op version, s = n 2 . This gives a lower b ound of n 2

p

time of the algorithm. Similarly, for the 2-lo op version, the smallest lo op has size n and hence the



1:5

. This is also a tight b ound. lower b ound translates into n

Theorem 2 Let an optical computing system consisting of k>1 loops have the smal lest loop of size

s. The computation of an n-word ascend-descend algorithm requires time sn + n log n.

Proof: We assume that the data words at level l are stored in a memory lo op in the butter y

column order x ;x ; :::; x .For the lower b ound purp oses, any order that is a p ermutation of

0;l 1;l n1;l

0; 1;:::;n 1 will do, but the pro of is more understandable with this order. Another claim is that

the lower m = blog sc levels of the Butter y network are b est computed in the smallest sized s lo op.

The following discussion will prove this p oint.

l

Level l computation for ascend phase consists of x = f x ;x . There are n=2 groups

l1

i;l i;l1

i2 ;l1

l

of words consisting of 2 words each that de ne the p erio d of access required by the computation. A

l l1

group of words consists of x for 0  i

l l

i2 ;l i2 +j;l

the values of x and x .We call this communication as backward communication

l l l1

i2 +j;l1 i2 +2 +j;l1

with resp ect to the direction of data circulation in the lo ops. Similarly, the words x for

l l1

i2 +2 +j;l

l1

0  j<2 need the values of x and x . This is the forward communication.

l l l1

i2 +j;l1 i2 +2 +j;l1

Both forward and backward communication p ose di erent set of problems. For instance, they have

di erent p erio ds. For forward communication, x is needed at x with a p erio d of

l l l1

i2 +j;l1 i2 +2 +j;l

l1 l1

2 and a size of 2 . Backward communication on the other hand requires x to b e

l l1

i2 +2 +j;l1

l1

available next time we scan x . The next scan of x happ ens in n 2 time steps, which

l l

i2 +j;l i2 +j;l

l1

is the p erio d of backward communication. Exactly n 2 suchvalues have to b e rememb ered for

the next scan of the Level l data and hence that is also the size of the backward communication.

If we compute the lower m = log s levels of Butter y for 1  l  m on the smallest lo op L , the

s

m1

forward communication requirement limits us as follows. We need lo ops of sizes 1; 2; 4; 2 so that

forward communication do es not require additional scans of the Level l 1 and Level l data in L .

s

m

However s =2 is the smallest size lo op available. What is the p enalty of this limitation? Since only

a constant amount of register storage is available on the pro cessor, for each level l we can remember at

most a constantnumber of words to carry forward. Let this constant b e 1 without loss of generality.

l

Hence, we can compute at most n=2 Level l values in each scan of L . In fact, this is exactly what is

s

P

log s

l

2 scans done by the two-lo op, pip eline d computation describ ed earlier. Hence we need at least

l=1

of L in order to compute s of the n lower log s level values. This gives at least 2s scans of L for the

s s

n

computation of s values. Hence, the computation of n words for lower log s levels requires  2s =2n

s

scans of L . Since each scan of L takes s time steps, the time needed for 2n scans is at least 2sn.

s s 27

b b

b b

b b

b b

b b

b b

b b

b b

b b

b b

b b

b b

b b

b b

Figure 13: Use of Memory Lo ops as Communication Links

Is this the lowest time for computing lower log s levels of the ascend phase? The answer is that any

other lo op of larger size would take at least 2sn steps as well. Let us consider computing these lower

0

0

log s levels in Lo op L for s >s. The numb er of scans for computing s values from the lower log s

s

levels would again b e at least 2s. The numb er of iterations for computing all the n words would b e

0

n=s and hence the total time would b e at least 2ns whichisnolower than 2ns.

Note that we did not even consider the backward communication requirements to get this lower

b ound. We can derive the n log n lower b ound based on the structure of the butter y network. Since

we are dealing with sequential computation mo del, and there are n log n computation no des in the

butter y graph for an ascend phase, the lower b ound follows trivially.

2

7 Alternate Parallel Lo op Mo del

In Section 4, we used the lo op memory for holding the data blo cks due to problem folding. The

communication links b etween pro cessors in a butter y network are also implemented in b er technology

to keep up with the data rates. We prop ose another way of using the b er lo ops so as to serveas

b oth the data holders and the communication links. The disadvantage of this approach is that it

requires many taps into a b er. Consequently, at this p oint in time, this approach is not technically

feasible. Figure 13 shows the scheme. Between the two levels shown the 4 left-hand pro cessors need

to communicate with the 4 pro cessors right b elow them and with the 4 right-hand pro cessors on the

right. The two lo ops shown serve this purp ose. Each pro cessor drops a data value into either the

vertical connection lo op or the cross-connection lo op. This value is picked up by the target pro cessor

i

after 2 ticks for the lo ops connecting levels i and i +1. 28

8 Conclusions

The optical computing technology has matured to the p oint that prototypical computing machines can

b e built in the research lab oratories. This also is the right moment for the algorithm designers to get

involved in this enterprise. A careful interaction b etween the architects and the algorithm designers

can lead to a b etter thought-out design. This pap er is an attempt in that direction. Wehave studied

the rep ercussions of the use of memory lo ops on algorithm design. The use of delay line lo ops as

memories is necessitated by the required data rates upwards of 100 MWords p er second. We develop

a computational mo del, lo op memory mo del LMM, to capture the relevantcharacteristics of the

memory lo ops.

It would seem that the restrictive discipline imp osed on the data access patterns by a lo op memory

would degrade the p erformance of most algorithms, b ecause the pro cessor mighthave to idle waiting

for data. We demonstrate that an imp ortant class of algorithms, ascend/descend algorithms, can b e

realized in the lo op memory mo del without any loss of eciency. In fact, the sequential realizations

span a broad range for the numb er of lo ops required. A parallel implementation p erforming the

optimal amountofwork is also shown. Some matching lower b ounds are illustrated, as well.

Optical computing is an emerging eld. There are a myriad of op en questions. Wehave only

covered one class of algorithms in LMM, ascend/descend class. Many more problems and algorithms

need to b e investigated in this mo del. As we mentioned earlier, a large secondary memory which

would b e either semiconductor based or optical technology based would almost certainly b e needed.

In fact, the memory hierarchyislikely to have many more levels in this case due to the large mismatch

in the sp eeds of the dynamic and secondary storage. A mo del similar to HMM [AACS87] to capture

the hierarchical nature of memory would help.

The lower b ounds in LMM also p ose manyinteresting problems. A new set of techniques seems

to b e required. The data access pattern of a problem has to b e classi ed in terms of some basic

frequencies. In summary,we b elieve that this eld would serve as a fertile ground for future research.

9 Acknowledgements

The research of J. Reif was supp orted in part byDARPA/ARO contract DAAL03-88-K-0195, Air

Force Contract AFOSR-87-0386, DARPA/ISTO contract N00014-88-K-0458, NASA sub contract 550-

63 of primecontract NAS5-30428. A. Tyagi was supp orted by NSF Grant MIP-8806169, NCBS&T

Grant 89SE04 and a Junior Faculty Developmentaward from UNC, Chap el Hill. The supp ort from

these agencies is gratefully acknowledged. 29

References

[AACS87] A. Aggarwal, B. Alp ern, A. Chandra, and M. Snir. A Mo del for Hierarchical Memory.In

Proceedings of ACM Symposium on Theory of Computing, pages 305{314. ACM-SIGACT,

1987.

[ACS87] A. Aggarwal, A. Chandra, and M. Snir. Hierarchical Memory with Blo ck Transfer. In

Proceedings of IEEE Symposium on Foundations of Computer Science, pages 204{216.

IEEE, 1987.

[ACS89] A. Aggarwal, A. Chandra, and M. Snir. On Communication Latency in PRAM Com-

putations. In Proceedings of ACM Symposium on Paral lel Algorithms and Architectures.

ACM, 1989.

[AHU74] A.V. Aho, J.E. Hop croft, and J.D. Ullman. The Design and Analysis of Computer Algo-

rithms. Addison-Wesley, Reading, Mass., 1974.

[BL92] Giovanni Barbarossa and Peter J. Layb ourn. Novel architecture for optical guided-wave

recirculating delay lines. Proc. of SPIE, Advances in Optical Information Processing V,

1704:138{144, 1992.

[BR87] R. Barakat and J. Reif. Lower Bounds on the Computational Eciency of Optical Com-

puting Systems. Applied Optics, 26:1015{1018, March 1987.

[Cha82] G. J. Chaitin. Register allo cation and spilling via graph coloring. In Proceedings of ACM

Symposium on Compiler Construction, pages 98{105, 1982.

[Che85] M.C. Chen. The Generation of a Class of Multipliers: A Synthesis Approach to the

Design of Highly Parallel Algorithms in VLSI. In Proceedings of International Conference

on Computer Design. IEEE, 1985.

+

[DTD 96] Daniel Dol , J. Tab ourel, O. Durand, Vincent Laude, Jean-Pierre Huignard, and Jean

Chazelas. Optical architectures for programmable ltering of microwave signals. Proc. of

SPIE, Processing, Technology, and Applications, 2845:276{286, 1996.

[Hen65] F. C. Hennie. One-Tap e, O -Line Turing Machine Computations. Information and Con-

trol, 8:553{578, 1965.

[Hen66] F. C. Hennie. On-line Turing Machine Computations. IEEE Transactions on Electronic

Computers, EC-15:35{44, 1966. 30

[HJP92] V. P. Heuring, H. F. Jordan, and J. Pratt. A Bit-Serial Architecture for Optical Comput-

ing. Applied Optics, 31:3213{3224, 1992.

[HP96] J. L. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach.

Morgan-Kaufmann, 2nd edition, 1996.

[Hua80] A. Huang. Design for an Optical General Purp ose Digital Computer. In Proc. of Inter-

national Optical Computing Conference,volume 232, pages 119{127. SPIE - The Interna-

tional So ciety for Optical Engineering, 1980.

[Hua84] A. Huang. Architectural Considerations Involved in the Design of an Optical Digital

Computer. Proc. of IEEE, 727:780{786, 1984.

[JLL93] H. F. Jordan, Kyungso ok Y. Lee, and Daeshik Lee. Multichannel time-slot p ermuters.

Proc. of SPIE, High-Speed Fiber Networks and Channels II, 1784:252{261, 1993.

[Jor89a] H. F. Jordan. Digital Optical Computing with Fib ers and Directional Couplers. In Optical

Computing, OSA 1989 Technical Digest Series, Volume-9, Optical Computing Topical

Meeting, pages 352{355. OSA, February 1989.

[Jor89b] H. F. Jordan. Pip elined Digital Optical Computing. Technical Rep ort OCS Technical

Rep ort 89-34, Opto electronic Computing Center, University of Colorado, Boulder, 1989.

[Jor91] H. F. Jordan. Digital optical computers at b oulder. Proc. of SPIE, Optics for Computers:

Architectures and Technologies, 1505:87{98, 1991.

[JP89] G. Q. Maguire Jr. and P. R. Prucnal. High-Density Using Optical Delay

Lines: Use of Optical Delay Lines as a Disk. In Medical Imaging III,volume 1093, pages

571{577. SPIE - The International So ciety for Optical Engineering, January 1989.

[Kun79] H.T. Kung. Let's Design Algorithms for VLSI Systems. In Proceedings of the Caltech

ConferenceonAdvancedResearch in VLSI: Architecture, Design, Fabrication, pages 65{

90. Caltech, January 1979.

[LLJ93a] Daeshik Lee, Kyungso ok Y. Lee, and H. F. Jordan. Generalized lamb da time-slot p er-

muters. Proc. of SPIE, High-Speed Fiber Networks and Channels II, 1784:262{269, 1993.

[LLJ93b] Kyungso ok Y. Lee, Daeshik Lee, and H. F. Jordan. Time-slot sorter. Proc. of SPIE,

High-Speed Fiber Networks and Channels II, 1784:242{251, 1993.

[LM82] T. Lin and C. Mead. The Application of Group Theory in Classifying Systolic Arrays.

Technical Rep ort 5006:DF:82, California Institute of Technology,Pasadena, California,

April 1982. 31

[Lou91] A. Louri. Three-Dimensional Optical Architecture and Data-Parallel Algorithms for Mas-

sively Parallel Computing. IEEE Micro, pages 24{81, April 1991.

[MC80] C. Mead and L. Conway. Introduction to VLSI Systems. Addison-Wesley, Reading, Mass.,

1980.

[MG90] L. R. McAdams and J. W. Go o dman. Liquid Crystal 1  n Optical Switch. Optics Letters,

15:1150{1152, 1990.

[MJR89] E. S. Manilo , K. M. Johnson, and J. Reif. Holographic Routing Network for Parallel

Pro cessing Machines. In Proceedings of SPIE International Congress on Optical Science

and Engineering. SPIE, April 1989.

[MMG90] L. R. McAdams, R. N. McRuer, and J. W. Go o dman. Liquid Crystal Optical Routing

Switch. Applied Optics, 29:1304{1307, 1990.

[PV81] F. P. Preparata and J. Vuillemin. The Cub e-Connected Cycles: A Versatile Network for

Parallel Computation. Communications of the ACM, 245:300{309, May 1981.

[Qui84] P. Quinton. Automatic Synthesis of Systolic Arrays from Uniform Recurrent Equations.

In Proceedings of the 11th Annual Symposium on Computer Architecture. IEEE, 1984.

[QvD89] P. Quinton and V. van Dongen. The Mapping of Linear Recurrence Equations on Regular

Arrays. Journal of VLSI Signal Processing, 1:95{113, 1989.

[RT90] J. Reif and A. Tyagi. Ecient Parallel Algorithms for Optical Computing with the DFT

Primitive. In Proceedings of the Tenth ConferenceonFoundations of SoftwareTechnology

& Theoretical Computer Science, pages 149{160. Lecture Notes in Computer Science 

472, Springer-Verlag, Decemb er 1990.

[SJ90] D. B. Sarrazin and H. F. Jordan. Delay line memory systems. Applied Optics, 29:627{637,

1990. also OCSC TR 92-18.

[SS84] A. Sawchuk and T. Strand. Digital Optical Computing. Proc. of IEEE, 727:758{779,

1984.

[Sto70] H. S. Stone. Parallel Pro cessing with the Perfect Shue. IEEE Transactions on Comput-

ers, pages 153{161, February 1970.

[Tho79] C.D. Thompson. Area-Time Complexity for VLSI. In Proceedings of ACM Symposium on

Theory of Computing, pages 81{88. ACM-SIGACT, 1979. 32

[TR90] A. Tyagi and J. Reif. Energy Complexity of Optical Computations. In Proceedings of

the Second IEEE Symposium on Paral lel and DistributedProcessing, pages 14{21. IEEE,

Decemb er 1990.

[Tya97] A. Tyagi. Recon gurable memory queues/computing units architecture. In Proceedings of

the Recon gurable Architecture Workshop at 11th International Paral lel Processing Sym-

posium, April 1997.

[Ull84] J.D. Ullman. Computational Aspects of VLSI. Computer Science Press, Ro ckville, Md.,

1984.

[VS90] J. S. Vitter and E. A. M. Shriver. Optimal Disk I/O with Parallel Blo ck Transfer. In

Proceedings of ACM Symposium on Theory of Computing, pages 159{169. ACM-SIGACT,

1990.

+

[WKG 96] Kelvin H. Wagner, Shawn Kraut, Lloyd J. Griths, Samuel Weaver, Rob ert T. Weverka,

and Anthony W. Sarto. Ecient true-time-delay adaptive array pro cessing. Proc. of SPIE,

Radar Processing, Technology, and Applications, 2845:287{300, 1996. 33