An Optical Delay Line Memory Mo del with Ecient Algorithms
y z
John H. Reif Akhilesh Tyagi
Department of Computer Science Department of Computer Science
Duke University Iowa State University
Durham, NC 27706 Ames, IA 50011
Abstract
The extremely high data rates of optical computing technology 100 MWords p er second and
upwards present unprecedented challenges in the dynamic memory design. An optical ber loop,
used as a delay line, is the b est candidate for primary, dynamic memory at this time. However,
they p ose sp ecial problems in the design of algorithms due to synchronization requirements b e-
tween the lo op data and the pro cessor. We develop a theoretical mo del, whichwe call the lo op
memory mo del LLM, to capture the relevantcharacteristics of a lo op based memory. An imp or-
tant class of algorithms, ascend/descend { which includes algorithms for merging, sorting, DFT,
matrix transp osition and multiplication and data p ermutation { can b e implemented without any
time loss due to memory synchronization. We develop b oth sequential and parallel implementa-
tions of ascend/descend algorithms and some matrix computations. Some lower b ounds are also
demonstrated.
1 Intro duction
1.1 The Success of VLSI Theory
Over the last 15 years, VLSI has moved from b eing a theoretical abstraction to b eing a practical reality.
As VLSI design to ols and VLSI fabrication facilities such as MOSIS b ecame widely available, the
algorithm design paradigms such as systolic algorithms [Kun79], that were thought to b e of theoretical
interest only,have b een used in high p erformance VLSI hardware. Along the same lines, the theoretical
limitations of VLSI predicted by area-time tradeo lower b ounds [Tho79]have b een found to b e
imp ortant limitations in practice.
A preliminary version of this pap er app eared in Proc. of the ADVANCED RESEARCH in VLSI: International
Conference 1991, MIT Press, March 1991.
y
supp orted in part by DARPA/ARO contract DAAL03-88-K-0195, Air Force Contract AFOSR-87-0386,
DARPA/ISTO contract N00014-88-K-0458, NASA sub contract 550-63 of primecontract NAS5-30428 and by NSF Grant
NSF-IRI-91-00681.
z
supp orted by NSF Grant MIP-8806169, NCBS&T Grant 89SE04 and a Junior Faculty Developmentaward from
UNC, Chap el Hill. This work was p erformed when the second author was with UNC, Chap el Hill. 1
1.2 The Promise of Optical Computing
The optical computing technology is considered to b e one of the several technologies that could provide
a b o ost of 2-3 orders of magnitude over the currently used semiconductor silicon based technology,in
computing sp eed. The eld of electro-optical computing, however, is at its infancy stage; comparable
to the state of VLSI technology,say,10years ago. Fabrication facilities for manykey electro-optical
comp onents are not widely available { instead, the crucial electro-optical devices must b e sp ecially
made in the lab oratories. However, a numb er of prototyp e electro-optical computing systems have
b een built recently { p erhaps most notably at Bell Lab oratories under Huang [Hua80, Hua84], and
also at Boulder under Jordan see b elow. In addition, several optical message routing devices have
b een built at Boulder [MJR89], Stanford and USC [Lou91], [MG90, MMG90], [SS84]. Thus optical
computing technology has matured to the p oint that prototypical computing machines can b e built in
the research lab oratories. But it certainly has not attained the maturity of VLSI technology. What is
likely to o ccur in the future? The technology for electro-optical computing is likely to advance rapidly
through the 90s, just as VLSI technology advanced in the late 70s and 80s. Therefore, following our
past exp erience with VLSI, it seems likely that the theoretical underpinnings for optical computing
technology { namely the discovery of ecient algorithms and of resource lower b ounds, are crucial to
guide its development. This seems to us to b e the right moment for the algorithm designers to get
involved in this enterprise. A careful interaction b etween the architects and the algorithm designers
can lead to a b etter thought-out design. This pap er is an attempt in that direction.
1.3 Data Storage: A Key Problem in Optical Computing
The optical computing technology can obtain extremely high data rates b eyond what can b e obtained
by current semiconductor technology. Therefore, in order to sustain these data rates, the dynamic
storage must b e based on new technologies which are likely to b e completely or partially optical.
Jordan at the Colorado Opto electronic Computing Systems Center [Jor89b] and some other groups
have prop osed and used optical delay line lo ops for dynamic storage. In these data storage systems,
an optical b er, whose characteristics match the op erating wavelength, is used to form a delay line
lo op. In particular, the system sends a sequence of optically enco ded bits down one end of the lo op
and after a certain delay which dep ends on the length and optical characteristics of the lo op, the
optically enco ded bits app ear at the end of the lo op, to b e either utilized at that time and/or once
again sentdown the lo op. This idea of using propagation delay for data storage dates back to the use
of mercury delay lo ops in early electronic computing systems b efore the advent of large primary or
4
secondary memory storage. Jordan [Jor89b] has b een able to store 10 bits p er b er lo op with the b er
length of approximately one kilometer. This was achieved in a small, low cost prototyp e system with 2
a synchronous lo op without very precise temp erature control. Jordan used such a delay lo op system
to build the second known purely optical computer after Huang's, which can simulate a counter.
Note that this do es not represent the ultimate limitations on the storage capacity of optical delay
lo ops, which could in principle provide very large storage using higher p erformance electro-optical
transducers and multiple lo ops. Maguire and Prucnal [JP89] suggest using optical delay lines to form
disk storage.
The main problem with such a dynamic storage is that it is not a random-access memory. A delay
line lo op cannot b e tapp ed at many p oints since a larger numb er of taps leads to excessive signal
degradation. This implies that if an algorithm is not designed around this shortcoming of the dynamic
storage, it mighthavetowait for the whole length of the lo op for each data access. Systolic algorithms
[Kung, [Kun79]] also exhibit such a tightinter-dep endence b etween the dynamic storage and the data
access pattern.
1.4 Lo op Memory Mo del and Our Results
In this pap er, we prop ose the lo op-memory mo del LMM as a mo del of sequential electro-optical
computing with delay line based dynamic storage. The lo op-memory mo del contains the basic features
that current delay lo op systems use, as well as the features that systems in the future are likely to use. It
would seem that the restrictive discipli ne imp osed on the data access patterns by a lo op memory would
degrade the p erformance of most algorithms, b ecause the pro cessor mighthave to idle waiting for data.
However, we demonstrate that an imp ortant class of algorithms, ascend/descend algorithms, can b e
realized in the lo op memory mo del without any loss of eciency. Note that many problems including
merging, sorting, DFT, matrix transp osition and multiplication and data p ermutation are solvable
with an ascend/descend algorithm describ ed in Section 3. In fact, we give sequential algorithms
covering a broad range for the numb er of lo ops required. A parallel implementation p erforming the
optimal amountofwork is also shown. The work p erformed by a parallel algorithm running on p
pro cessors in time T is given by p T .
An ascend or descend phase takes time O n log n in LMM using log n lo ops of sizes 1; 2; 4; :::;n.
Note that a straight-forward emulation of a butter y network with O n log n time p erformance re-
quires O n lo ops: n lo ops of size 1, n=2 of size 2, n=4 of size 4, :::, 1 of size n. It can b e implemented
p
1:5
in time n just with two lo ops of sizes n and n each. This can b e generalized into an ascend-descend
1:5 k=2
scheme with time nk + n 2 with 2 k log n lo ops. At this p oint in time, a lo op is a precious
resource in optical technology, and hence tailoring an algorithm around the numberofavailable lo ops
is an imp ortant capability. The k -lo op adaptation of the ascend/descend algorithm provides just this
2
capability. A single lo op pro cessor takes n time. A matching lower b ound also exists for this case, 3
which is derived in Section 6 from one tap e Turing machine crossing sequence arguments. Matrix
multiplication and matrix transp osition can also b e p erformed in LMM without any loss of time.
We also consider a butter y network with p log p LMM pro cessors, where 1
of pro cessors, time pro duct of this network for ascend-descend algorithms is shown to b e O n log n.
Note that a butter y network p erforms n log n work. This shows that the ascend-descend algorithms
can b e redesigned in suchaway as not to incur anywork loss due to the restrictive nature of the lo op
memories.
1.5 Related Work
Several mo dels for virtual memory have b een considered recently. RAM was prop osed as a simple
mo del of sequential computation without any virtual memory in Aho, Hop croft, Ullman [AHU74].
Aggarwal et al. [[AACS87] and [ACS87]] intro duced the rst hierarchical memory mo del which also
incorp orates blo ck transfer. They [ACS89] extended it to the parallel computation mo del PRAM.
Vitter and Shriver [VS90] consider the PRAM case where parallel blo ck transfers are p ermitted. In
all these mo dels, the algorithm optimization consists of exploiting either temp oral or spatial lo cality
given that the access to a blo ck of size b at address x costs time f x+b, where f x is the seek time.
In our mo del, the synchronization of the lo ops primary storage with pro cessor is the main ob jective
in algorithm design. In a more fully develop ed optical computing system, it almost seems imp erative
that a memory-hierarchy will exist; with either fast semiconductor memory or laser disks at the b ottom
of the hierarchy. An analysis of the trac b etween the lo op memory and secondary memory could
b e similar to the previous work. In the early days of computing, acoustic delay lines were used as
mass storage. But using a delay line as a secondary storage medium do es not p ose the problems
which are encountered when it is used as the primary storage. Its use as a secondary storage can
b e analyzed by the techniques develop ed for the hierarchical memory mo del HMM [AACS87]. The
related work in the theory of optical computing is very sparse. Barakat and Reif [BR87]intro duced
VLSIO electro-optical VLSI, a mo del of optical computing. They considered volume-time trade-o s
and lower b ounds in this mo del. We [TR90] demonstrated energy and energy-time pro duct lower and
upp er b ounds for optical computations. We[RT90]intro duce the DFT-PRAM mo del, where a DFT
can b e computed in unit time, to mo del the p ower of optical computing. We develop several O 1
time algorithms in this mo del.
The work most related to this pap er is that of Jordan [Jor89b], [HJP92], [Jor89a], [Jor91]. They
describ e an optical delay lo op system and show that a numb er of networks could b e simulated by the
use of optical delay lo ops. However, their work do es not imply our results; in particular, they did not
lo ok at the algorithmic asp ects of the lo op-memory systems. 4
Optical wave guides have b een used in several optical computing systems, but primarily for
computing and not storage. Jordan et al. use delay lines for time slot p ermutation and sorting
+
[LLJ93a, JLL93 , LLJ93b ]. Barbarossa et al. [BL92] use optical delay lines for pro cessing. [WKG 96]
p erforms adaptive array pro cessing with recirculating lo ops. There are several applications of mi-
+
crowave pro cessing with delay line lo ops [DTD 96].
The algorithms presented in this pap er schedule the memory lo ops for their data needs. However,
all the problems solvable with an ascend-descend algorithm entertain structured data and hence data
scheduling can b e built into the algorithm as in systolic algorithms case. For random data, the memory
lo op scheduling similar to register allo cation in current-day compilers [Cha82] p oses an interesting
problem. Tyagi [Tya97] prop oses some algorithms for scheduling data on lo ops.
1.6 Organization
The rest of the pap er is organized as follows. Section 2 intro duces b oth the sequential pro cessor and
parallel lo op memory mo del. The various sequential pro cessor implementations of the ascend-descend
algorithms are describ ed in Section 3. Section 4 sketches the p pro cessor version of ascend-descend
algorithms. A brief sketch of matrix transp ose and multiplication is given in Section 5. Some lower
b ounds are shown in Section 6. An alternate mo del for parallel optical computation where the delay
line lo ops serve as b oth a communication link and a dynamic memory is presented in Section 7.
Section 8 concludes the pap er.
2 Mo del
The intent here is to develop a mo del for a pro cessor with the primary dynamic storage in a delay
line lo op memory. The pro cessor can have an instruction set similar to that of the Random Access
Machine RAM in Aho et al. [AHU74]. In order to make it a little more reasonable, we will assume
that the pro cessor also contains a constantnumb er of registers, each capable of holding one word of
9
data. Note that a register op erating at Giga Hertz rate 10 p er second can b e built using a very
short delay line with a directional coupler [Jor89b].
Memory Lo ops: A pro cessor can have several delay line memory lo ops of pre-sp eci ed lengths
asso ciated with it. Let there b e k lo ops L of lengths n for 0 i< k. Note that the length of the
i i
b er as a multiple of the wavelength determines the numb er of bits stored in it. We assume that the
size of data stored in eachloop L is a pair of words, i.e.,itisaword-parallel implementation. This
i
could mean that each conceptual lo op L is implemented using 64 physical lo ops for a 32-bit word.
i
An alternative metho d could b e to enco de two bits into eachwave, thereby requiring only 32 physical 5
CPU
lo cal
registers
R R R
'$'$'$
lo op
lo op lo op
3
1 2
&&&
Figure 1: A Pro cessor with Three Delay Line memory Lo ops
lo ops. In either case, L can store n word pairs. The pro cessor has a tap into each of these lo ops.
i i
A simple conceptual mo del is to assume that the currentword under the tap is stored in a register
asso ciated with the lo op. In practice, a memory lo op could b e directly coupled to a computational
device without any need for an intermediate register. In the following, we make the assumption that
for all practical purp oses, the CPU accesses a lo op through its tap register. A write into the lo op is
also p erformed from its tap register. We refer to the value in the tap register of the lo op L by v L .
i i
We will use the notation v L = e to denote a write of the result of an expression e into L . Similarly,
i i
x = ev L denotes a read of L for the evaluation of the expression e.
i i
Note that the digital optical computer at Colorado [Jor91, HJP92, SJ90] uses a bit-serial design.
Each lo op stores 64 words of 16-bits each in a bit-serial manner. The primary reason for cho osing a
bit-serial design was the cost of the optical switches. In their memory lo op implementation [SJ90],
each lo op has a distinct read and write p ort. A Lithium Niobate directional coupler is used at eachof
these p orts. Each data word recirculates serially through a delay line passing the input/output p orts
once each memory cycle 20:5s, time to go through the lo op once. A memory counter keeps track
of the address of the memory word at the read/write p ort. There are no latches in the system. The
entire pro cessor is designed with \time-of- ight" synchronization.
Read/Write Delays: The delay to read or write from a memory lo op is a fraction of time required
for computation. The situation is similar to the register access time b eing a fraction of cycle time. In
this pap er, we will assume that b oth read/write require unit time. Recall that the Colorado optical
computer [Jor91] has a memory cycle of 20:5s with a lo op storing 64, 16-bit words. If these lo ops
were to b e made word-parallel, then the time b etween successivewords is 20ns which corresp onds to
a 50MHz pro cessing rate. This indeed is the pro cessing rate of the Colorado digital optical computer. 6
Hence our assumption ab out each lo op access taking a machine cycle is not unrealistic. We also
blur the distinction b etween an instruction cycle and a machine cycle in the algorithms presented in
Section 3. We assume that instructions nish in one machine cycle as well, or that the CPI cycles p er
instruction [HP96] is nearly 1. The algorithms in this pap er can b e easily mo di ed to t any other
delay mo del.
The op erands for the instructions can b e any of the lo cal registers or one of the lo op tap registers.
The parallel Butter y mo del is a direct extension of the sequential pro cessor mo del. The interesting
scenario o ccurs only when an n-input problem instance needs to b e solved on a p log p pro cessor
butter y network for p< n. In such a case, the lo op memory at each pro cessor is used to maintain
the copies of multiple data resulting from folding. For a description of a butter y network, the reader
is referred to Ullman [Ull84].
Note that the prototyp e of the electro-optical computing system b eing built at Colorado by Jordan
[Jor89b] resembles our mo del. All the register and main memory is realized with b er delay line storage
lo ops. The pro cessing logic is also derived from the optical technology in the form of a directional
coupler, where the electric eld b etween the wave guides is used as the control mechanism [Jor89b].
3 Ascend-Descend Algorithms
The CASCADE algorithms Ascend-Descend, intro duced in Preparata and Vuillemin [PV81], are an
k
imp ortant class of algorithms. Consider an algorithm with n =2 input data items. Let the ith input
b e lo cated at address i for all 0 i n 1 initially. An algorithm is an ascend algorithm if it
0 1 2 k 1
op erates successively on pairs of data items that are lo cated 2 ; 2 ; 2 ;:::; 2 = n=2 distance apart.
This involves ascending right to left on the k address bits. A descend algorithm is de ned similarly
k 1 k 2 0
to op erate on pairs of data items 2 ; 2 ; :::; 2 apart. We de ne CASCADE to b e the class of
algorithms that are comp osed of a sequence of Ascend and Descend algorithms. Many problems suchas
merging, sorting, DFT, matrix transp osition and multiplication and data p ermutation have CASCADE
algorithms. Hence, an ecient implementation of Ascend and Descend algorithms translates into
an ecient implementation of all these problems. Besides, the top ology of manyinterconnection
networks such as Butter y,Perfect-shue, Hyp ercub e and Cub e-connected-cycl es is ideally suited for
Ascend/Descend algorithms.
In this section, we show a single pro cessor implementation of an ascend-descend algorithm with
1:5
following resources: 1 time O n log n, log n lo ops of sizes 1; 2; 4;:::; n, 2 time n ,two lo ops
q
p
n
1:5 k=2
of sizes n and n, and 3 time nk + n 2 , k lo ops, with 2 k log n, of sizes and
k
2
n n
; ; :::; n. Note that the rst implementation p erforms optimal amountofwork { n log n,
k k 1
2 2 7
-
n 1
2 1 0
i
L ::: L L L
i
i i i
6
Read/Write Head
Figure 2: Lo op Words
demonstrating that an ascend/descend algorithm can b e realized without any loss in time p erfor-
mance due to high latency evident in lo op memories. Also note that the numb er of lo ops in a real
optical pro cessor might b e limited. The k -lo op realization is helpful in this situation as it provides an
implementation of ascend/descend algorithms on any optical pro cessor.
Lo op Primitive Op erations:
In the following algorithms, we will p erform same set of actions on lo ops many times. Hence we
encapsulate these actions into lo op primitives. We will use the notation jL j to denote the length of
i
the lo op L , also given by n . Recall that wehave assumed that each lo cation in the lo op is capable
i i
of holding twowords, as if there were two conceptual tracks on the lo op. We will refer to these words
by v L and v L for upp er and lower trackwords resp ectively.We will also refer to the words
u i l i
j
0 0
v L and v L by the notation v L and v L resp ectively. The notation L denotes the word
u i l i u l
i i
i
that would b e at the read/write head tap register of L in j steps from the current time. Figure 2
i
shows a lo op L with data circulating left to right. The right end of the lo op is connected to the left
i
end. Sometimes, we will need to tag a value in a lo op L for reasoning ab out the correctness of our
i
algorithms. In that case, we will assume that we can asso ciate a memory counter with eachword in
the lo op L with values b etween 0 and n 1. We will refer to the upp er lower word with memory
i i
u l
counter value j by the notation L [j ] L [j ] .
i i
input copying: Our initial assumption is that the primary program input values reside in some other
memory.Wewould copy di erent input values into a given lo op of size 1 over time. The primitive
inputx ;L copies the rst argument x to the upp er track of the lo op L at the word aligned copy
i j i j
0
with the write p ort when this primitiveisinvoked, i.e., v L = x .We assume that copy input takes
u i
j
one cycle.
copying: In ascend-descend algorithms in the next section, many copying steps involve copying each
0
value from a lo op L to two lo cations in L for jL j =2jL j. In particular, v L needs to b e copied
i j j i
i
n 1
0
i
to v L and v L . Figure 3 shows the copy pattern.
j
j 8
n 1
j n
i
n 1
i
L L
j j
0 1
::: L
Xy
Xy
X
Xy X
j Xy
X
Xy X
L L ::: L
X
X
X
j
X
X
j j
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X X
X
X
X
X
X
X
6 6 6 6 6
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X X
X
X
X
X
X
X
X
X
X
X
X
n 1 n 2 n 3
1 0
i i i
L L L ::: L L L
i
i i i
i i
Figure 3: Lo op Copying for Ascend-Descend Algorithms
upp erL , L copy
i j
fL has twice as manywords as L .g
j i
fCopy upp er trackwords of L into upp er trackof L .g
i j
k k
18k; 0 k n 1, v L = v L ;
i u u
j i
n +k
i k
28k; 0 k n 1, v L = v L ;
i u u
i
j
end copy upp er
The second copyof L through second 8 at Line 2 is synchronized for L and L . After Line 1
i i j
n
i
has completed copying L into the rst half of L , the word at L at the entry into copy upp er is now
i j
j
0 0
at the read/write p ort current L . L at the entry into copy upp er is again at the read/write p ort.
j i
We need another similar primitive for copying the upp er trackwords from L to the lower trackofL .
i j
Note that this copying takes time n = jL j.
j j
copy lowerL , L
i j
fL has twice as manywords as L .g
j i
fCopy upp er trackwords of L into lower trackofL .g
i j
k k
18k; 0 k n 1, v L = v L ;
i l l
j i
n +k
k
i
28k; 0 k n 1, v L = v L ;
i l l
j i
lower end copy
Lo op computation: Another step in the ascend-descend algorithms is to apply a function f of two
arguments to words on the upp er and lower tracks for an entire lo op.
compute lo opL , f
i
fApply f to words in L .g
i
k k k
8j; 0 j n 1, v L = fv L ;v L ;
i u u l
i i i
end compute lo op
0 0
An assumption made here is that the reads of twowords v L and v L , computation of f
u l
i i
0
and the write of result v L all takes one cycle. If need b e, the result could b e shifted by a few
u
i 9
k +c
k k
p ositions determined by the latency c of the computation, i.e., v L = fv L ;v L . The
u u l
i i
i
time taken by this computation is n = jL j assuming one cycle read-compute-write.
i i
log n Lo op Version:
Wenumb er the levels of a butter y network from 0 through log n b ottom-up as in Figure 4 such
i
that the cross-connections b etween level i and i +1 have a span 2 . The pro cessors on the same level
are numb ered 0 through n 1 from left to right. Let x for 0 i n 1 and 1 j log n b e the
i;j
value computed in the pro cessor p . The input value x corresp onds to x for 0 i n 1. Note
i;j i i;0
that for the ascend phase x = f x ;x when the j th bit from the right in the binary
j 1
i;j i;j 1
i+2 ;j 1
representation of i is 0, and x = f x ;x otherwise. Here f is some op eration p erformed
j 1
i;j i;j 1
i 2 ;j 1
in the pro cessors. Similarly, for the descend phase x , for 0 i n 1 and 0 j log n 1,
i;j
is given by the following expressions: x = g x ;x j +1 when the j + 1st bit from the
i;j i;j +1
i+2 ;j +1
right in the binary representation of i is 0, and x = g x ;x otherwise. The skeleton
j+1
i;j i;j +1
i 2 ;j +1
of the implementation is as follows. A butter y computation graph is scanned in the order shown in
Figure 4. We will only demonstrate the ascend-phase of the algorithm. The descend phase is very
similar.
We assume that log n memory lo ops L , L , L , ::: , L of sizes 1; 2; 4; 8; :::; n resp ectively are
1 2 4 n
available. Recall that the word i p ositions away from the read/write p ort in the lo op L is referred
k
i 0
to as L . The 0th p osition of a lo op L is the p osition scanned by the tap register currently. The
k k
following implementation of the ascend algorithm assigns the physical lo op L i to the storage of the
2
values asso ciated with butter y level i for 0 i log n.For instance, lo op L is time-multiplexed
1
to store all the butter y level 0 values, while lo op L stores all the butter y level 1 values. Figure 5
2
presents the ascend phase algorithm.
The ascend phase is implemented by a call to Pro cess-Ascend0, n Figure 5, where n is
assumed to b e a p ower of 2. It takes time 3n log n + n as:
T n=2Tn=2 + 3n ; T 1 = 1
The two recursive calls to Pro cess-Ascend in Steps 2 and 4 of Figure 5 give the term 2T n=2
and the two copying steps in Steps 3 and 5 lead to the 2n term. The computation in Step 6 adds
another n to this equation. These equations have a solution in T n=3nlog n + n.
The main concern in determining the correctness of this implementation and its timing analysis
is the synchronization of all the lo op accesses. For instance, at the return from the call Pro cess-
Ascend0, n=2 at Step 2, are the lo ops L and L synchronized so that copying takes place in n
n
n=2 10
p p p p
0;2 1;2 2;2 3;2
- - -
Level 2
Pi
P
P
P
P P
P
P P
P
P P
P
P P
P
P P
P
P P
P
P P
P
P P
P
P P
P
P P
P
P P
P
P
Level 1 - P-
@I @I @
@ @
Q Q
@
Q Q
@ @
Q Q
@
@ @
Q Q
@
Q Q
@ @
Q Q
@
Level 0 @- @- @R
p p p p
0;0 1;0 2;0 3;0
Figure 4: Butter y Network: Thick Lines Show the Order of Computing
Pro cess-Ascendi, k
fProcess the k input bit positions starting at x for ascend phase. g
i
fk is assumedtobepower of 2. g
if k =1 then copy inputx , L , copy input bit x to L . 1
i 1 i 1
else
Pro cess-Ascendi, k=2; 2
copy upp erL , L ; 3
k
k=2
k k
Pro cess-Ascendi + , ; 4
2 2
copy lowerL , L ; 5
k
k=2
lo opL , f ; 6 compute
k
fcompute the values x = f x ;x , for 0 i k 1.g
i;k
i;k =2 ik=2;k =2
end if
fNote that x and x are available in L at position i due to previous two copying steps. g
k
i;k =2 ik=2;k =2
end Pro cess-Ascend
Figure 5: RecursiveVersion of Ascend Phase Algorithm 11
steps? Note that the usage pattern of the lo ops by this algorithm has the following form: L , L , L ,
1 2 1
L , L , L , L , L , L , L , L and so on. The following inductive pro of establishes that the timing
2 4 1 2 1 2 4 8
of the algorithm is correct, as the synchronization is done by \time-of- ight". Note that wehave not
de ned the term \synchronization" used in the statement of the following theorem
Theorem 1 The ascend phase algorithm in Figure 5 has synchronized data transfer and computation
to emulate the behavior of a butter y network.
Proof: The pro of is inductive. Assume that all the data transfers and computation are synchronized
for Pro cess-Ascendi; k for all values of k involves only the transfer of x into L . The inductive step is to prove that Pro cess-Ascend0;nis i 1 synchronized given that Pro cess-Ascend0;n=2 and Pro cess-Ascendn=2;n=2 are synchronized. Assuming that n> 1, Pro cess-Ascend0;n rst calls Pro cess-Ascend0;n=2, copies L n=2 into L , calls Pro cess-Ascendn=2;n=2, copies L into L , and nally computes L . After n n n n=2 Pro cess-Ascend0;n=2 is done, the assumption part of inductivehyp othesis is that L is scan- n=2 0 ning x ,orv L =x . In addition to showing that all the copying steps work correctly, 0;log n 1 u 0;log n 1 n=2 0 we also need to prove that at the end of Pro cess-Ascend0;n, v L =x . u 0;log n n Let T = 0 when the copy upp er at Step 3 starts copying L into L . Let the memory counter n n=2 of the word copied into L at 0 T = k n 1bek. Similarly, let the memory counter in L n n=2 u u for the word copied at 0 T = k n n u u u L [0] = x which implies that L [j ] = L [j + n=2] = x . The copying is completed 0;log n 1 n n j;log n 1 n=2 3n n at T = n 1, when Pro cess-Ascendn=2;n=2 is called. This call is completed in log n 1 + 2 2 3n n 0 steps. Once again, at time T = n + log n 1 + , v L = x by inductivehyp othesis. u n=2;log n 1 2 2 n=2 The key question from synchronization p ersp ective is whichword is at the read/write p ort of L at n 3n n 0 u u T = n + log n 1 + ?Wewould like v L to b e either L [0] or L [n=2] , which holds if T = n + u n n n 2 2 n 3n log n 1 + isaintegral multiple of n=2. Since n=2 divides the time elapsed since the last access to 2 2 memory counter 0 in L , our synchronization holds for this copying step for the following reason. After n 5n 3n l l copy lower of Step 5 is completed at T = + log n 1, L [j ] = L [j + n=2] = x for n n n=2+j;log n 1 2 2 u u 0 j n n j;log n 1 3n 5n j j + log n 1, v L =x and v L = x for 0 j u j;log n 1 l n=2+j;log n 1 n n 2 2 u j as exp ected by the ascend algorithm resulting in L [j ] = x . Similarly, v L = x n j;log n u j n=2;log n 1 n j u and v L =x for n=2 j l j;log n 1 n j;log n n 7n 3n 0 inductivehup othesis. At time T = + log n 1, when compute lo op is done, v L = x u 0;log n n 2 2 establishing the second part of the inductivehyp othesis. 2 12 for d =1; dn; d=d+ 1; 1 fd is the diagonal number. g for k =1; k d; k =2k; 2 fk is the span of the data b eing pro cessed. g if i =d k mo d k == 0 3 fi is the starting position. Al l valid k; i combinations have i mo d k =0.g if k == 1 copy inputx , L 4 i 1 else compute lo opL , f ; 5 k if i mo d 2 k ==0 copy upp erL , L 6 k 2k else copy lowerL , L ; 7 k 2k Figure 6: IterativeVersion of Ascend Phase Algorithm Figure 5 gives a recursiveversion of the ascend phase implementation. We give an iterativeversion in Figure 6 as it helps understand the timing of the ascend phase from a di erent p ersp ective. The key p oint to note here is that Pro cess-Ascendi, k calls one computing primitivecompute lo op usually, or copy input for an input bit with k = 1 and one copying primitivecopy upp er or copy lower for each feasible value of i; k . The temp oral order in which these two primitives for di erenti; k pairs are called is as follows: 0; 1, 1; 1, 0; 2, 2; 1, 3; 1, 2; 2, 0; 4, 4; 1, 5; 1, 4; 2, 6; 1, 7; 1, 6; 2, 4; 4, 0; 8, ::: ,0;n. Note that i + k denotes a diagonal in the butter y network of Figure 4. Also note that all the i; k pairs are invoked in the increasing order of their diagonal d = i + k .Of course, some of the diagonal entries are never invoked, for instance i =1;k = 2 is not used due to the nature of this computation. Line 3 in Figure 6 checks for the validity of eachi; k pair. Lines 1 and 2 set up these i; k pairs going over diagonals d from 1 to n. The ab ove description assumes that the n input words are available in some other memory not the optical lo ops. This is not a realistic assumption since the pro cessor contains only a small sized O 1 lo cal storage in addition to the memory lo ops. The only other place to hold the input is the slow secondary memory. In the following, we show that if the input words are resident in the lo op L initially, then they can b e trickled down to the smaller lo ops in synchrony with Pro cess-Ascend n without any loss of eciency. Here, we assume that either there is a separate third track on eachloop to hold the input values or there is a separate set of log n input lo ops. In either case, let I denote k the input lo op of size k which is either part of the lo op L or is an indep endent lo op. Let the n input k words b e in I initially. In fact, we will assume that the memory counter for the input word x is n j j , i.e., I [j ]= x for 0 j n j and copy input right. The primitive copy input leftI , I , i copies I 's left half k=2words to I k k k=2 k=2 starting with input word x whichwould b e at memory counter value 0. i 13 copy input leftI , I , i k k=2 fArgument i is only for reading convenience. Whenever this primitive is called, we maintain the invariant that I [0] = x .g k i 8j ,0j k k=2 end copy input left We de ne copy input rightI , I , i + k=2 to copy the second right half of I k=2words into k k k=2 I . This task could have b een p erformed by copy input left, but again for presentation claritywe also k=2 de ne copy input right. Once Again, the sp eci cation of i + k=2 is really redundant. The exp ectation is to copy I [k=2+j]= x for 0 j k i+k=2+j k=2 0 this primitive is called v I =I [k=2] = x . k i+k=2 k copy input rightI , I , i + k=2 k k=2 fArgument i + k=2 is only for reading convenience. Whenever this primitive is called, I [0] = k x .g i+k=2 8j ,0j k k=2 end copy input right The pro cedure Pro cess-Ascend also needs to b e mo di ed as follows: Pro cess-Ascendi, k fProcess the k input bit positions starting at x for ascend phase. g i fk is assumedtobepower of 2. g if k =1 then return; 1 else copy input leftI , I , i; 2 k k=2 Pro cess-Ascendi, k=2; 3 copy upp erL , L ; 4 k k=2 copy input rightI , I , i + k=2; 5 k k=2 k k Pro cess-Ascendi + , ; 6 2 2 copy lowerL , L ; 7 k k=2 lo opL , f ; 8 compute k fcompute the values x = f x ;x , for 0 i k 1.g i;k i;k =2 ik=2;k =2 end if fNote that x and x are available in L at position i due to previous two copying steps. g k i;k =2 ik=2;k =2 end Pro cess-Ascend Wenow need to argue that this new version of Pro cess-Ascend works correctly and that the input words are also synchronized. All that wehave mo di ed in Pro cess-Ascend is to insert 14 copy input leftI , I , i at Line 2 and copy input rightI , I , i + k=2 at Line 5. Also, the action k k k=2 k=2 taken when k = 1 at Line 1 has changed from copying the input word x into L to doing nothing. i 1 This version of ascend algorithm takes time 4n log n. Note that: T n=4n+2Tn=2; T 1 = 0; The input copying steps at Lines 2 and 5 each take time n=2. This results in extra n term for the expression for T n. Since nothing is done for k = 1 case, the expression for T 1 is 0. These equations have a solution in T n= 4nlog n. Wenow give a pro of sketch for the timing of this version. The key invariant is that right b efore 0 Pro cess-Ascendi, k isinvoked I contains the input words x ;x ; :::; x with v I =x. k i i+1 i+k 1 i k It is true for the top-level call Pro cess-Ascend0, n as an assumption. For all the other recursive calls of Pro cess-Ascend this invariant is maintained by Lines 2 and 5. Lemma 1 The revised version of Pro cess-Ascend is synchronized. Proof: The pro of is by induction on k as in Theorem 1. The inductivehyp otheses include: 1 0 when Pro cess-Ascendi, k is called, I contains x for 0 j k i+j i k 0 from Pro cess-Ascendi, k , v L = x , 3 all the copying steps in Lines 2, 4, 5 and 7 are u i;log k k synchronized. Let us establish the hyp othesis for Pro cess-Ascend0, n assuming it is true for all k 0 that by assumption v I =x and hence copying step at Line 2, copies x ;x ; :::; x into I 0 0 1 n=2 1 n=2 n establishing the same invarinat for the Pro cess-Ascend0, n=2 on Line 3. Line 2 copy input left 0 takes n=2 steps. Hence at the entry into Pro cess-Ascend0, n=2, v I = x = I [n=2]. Pro cess- n n=2 n upp erL , L Ascend0, n=2 takes time 2nlog n 1 whichisanintegral multiple of n. copy n n=2 is synchronized to the b eginning of L by the hyp othesis 2. This copying at Line 4 takes n n=2 0 steps. At the end of this copying, v I still is x 2nlog n 1 + n steps since entry into Line 3 n=2 n Pro cess-Ascend0, n=2. Hence copy input left at Line 5 is entered with the correct invariant copying x ; :::; x into I establishing hyp othesis 1 for the following Pro cess-Ascendn=2, n 1 n=2 n=2 n=2 call. copy lower is synchronized by the argument used in Theorem 1 since the numb er of steps elapsed from the end of copy upp er Line 4 is 2nlog n 1 + n=2 whichisanintegral multiple of n=2. 0 Since copy lower takes n steps, compute lo op is also synchronized leaving v L = x establishing u 0;log n n hyp othesis 2. 2 15 p - n p p x x x n 1 n 1 2 n 1 - 6 C6 6 6 C C - C C C C p C n C C - C C C C ? - CW p p x x x 0 n n n Figure 7: A Square Shifter TwoLoopVersion: An ascend-descend algorithm can b e implemented with fewer lo ops two but then it takes longer time