Parallel based on expander graphs for optical computing

Ramamohan Paturi, Dau-Tsuong Lu, Joseph E. Ford, Sadik C. Esener, and Sing H. Lee

We consider the task of interconnecting processors to realize efficient parallel algorithms. We propose interconnecting processors using certain graphs called expander graphs, which can provide fast communica- tion from any group of processors to the rest of the network. We show that these interconnections would result in a number of efficient parallel algorithms for sorting, routing, associative memory, and fault-tolerance networks. As the interconnections based on expander graphs are global and irregular, we reason that optical interconnections are preferred to electronic and propose implementation of these interconnections using the programmable optoelectronic multiprocessor architecture. Key words: Optical interconnection, optical computing, expander graphs.

1. Introduction To this end, we consider the design and the construc- To cope with the ever increasing demand on com- tion of expander graphs. We describe a probabilistic power, it is not enough to rely on faster device approach to construct and evaluate good expander puting that ex- technology. It is necessary to utilize parallel process- graphs. We then try to convince the reader ing. Assuming that the communication overhead is pander graphs can indeed result in efficient algorithms discuss an optoe- small and that the can be fully parallelized, a in a variety of situations. We then task requiring T sequential time steps can be per- lectronic implementation of these interconnection by distributing the task among p networks which combines the optical interconnection formed in Tip steps (VLSI) processors. These two assumptions are the most im- technology with very large scale integration portant considerations in designing efficient parallel technology," thus overcoming the difficulties encoun- consider the design and implementa- tered with pure'VLSI technology. algorithms. We of of such algorithms. More specifically, we investi- In Sec. II we explain the definition and theory tion we present a gate the interconnection properties of very large scale expander graphs. Using this theory, processor networks necessary to support efficient par- probabilistic approach to the construction of expander graphs give allel algorithms. We find certain interconnection net- graphs. In Sec. III we show that expander expander graphs very useful for this pur- rise to efficient parallel algorithms in a number of works called we explain how pose. Interconnections based on expander graphs can application domains. In particular, in constant time. This expander graphs can be used to construct approximate achieve global communication sorting property of expander graphs is successfully exploited halvers, the basic building block of an optimal design of several efficient parallel algorithms.'-10 algorithm. We show how expanders can result in a in the describe However, it has been unclear how to construct and lower delay in routing applications. We also implement good expander graphs. applications in associative memory, object distribu- that the interconnection net- tion, fault-tolerant networks, and error correcting We take the position imple- on expander graphs are the key to imple- codes. In Sec. IV, we discuss approaches to works based and propose an op- menting significantly efficient parallel algorithms. menting irregular interconnections toelectronic system. Finally, in Sec. V, we discuss our conclusions and suggest future research directions. 11. Expander Graphs All authors are with University of California, San Diego, La Jolla, Efficient parallel algorithms rely on the fast transfer California 92093; R. Paturi is in the Department of Computer Sci- within the processor network.12 Al- of of information ence & Engineering, the other authors are in the Department interconnected crossbar network can Electrical & Computer Engineering. though a fully Received 3 January 1990. accomplish such communication in a single time step, 0003-6935/91/080917-11$05.00/0. the number and length of interconnections required 2 © 1991 Optical Society of America. (n for n nodes) make the implementation of such large

10 March 1991 / Vol. 30, No. 8 / APPLIED OPTICS 917 1log n property is indispensable for designing optimal algo- rithms.' 2 One of these is a 0(logn) routing algorithm which can be used as a basis for a general purpose parallel computer.2 A. Definition of Expander Graphs Expander graphs are defined in terms of their prop- erties. For convenience, we describe them as bipartite \ ncreasing ammng/ graphs, i.e., graphs with connections between two dis- crete regions. A bipartite graph G = (I,O,E) has a set I of input nodes and a set 0 of output nodes with E as the set of edges between input and output nodes. We logn consider only those bipartite graphs for which III = 101 and define that III = 101 = n. Edge (ij) E connects Fig. 1. Communication distance on a hypercube with n nodes. input node i with the output node j. For any subset A The nodes are denoted by logn-digit binary addresses. For n = 8, of inputs, we define the neighborhood (A) = U E the longest communication distance is between (000) and (111). If O(ij) , E for some i e information needs to be transmitted from the group of nodes in the A}. The same applies to any 1 subset of outputs shaded region to the remaining nodes, it requires /2 logn - c(logn) /2 with its neighborhood in 0. We also steps, which grows with logn. define a bipartite graph G to be d-regular if the degree, i.e., fanout, of every node in the graph G equals d. For 0 < e < 1 and # > 1, a d-regular bipartite graph G scale networks prohibitively expensive. Often we are = (I,O,E) is called a (d,e,f3) expander if, for all A c I(O) limited to networks with a smaller number of intercon- so that AI < en, the neighborhood (A) of A in G is nections per processor. An example of such a network such that Ir(A)l 2 IAI. In other words, a graph ex- is hypercube with log2 n interconnections per proces- pands if every subset of nodes up to a given size has a sor. Two processors are connected if the Hamming large neighborhood. We call /3the exapnsion factor of distance between their log2 n-bit addresses is 1. The the graph. Typically, we use expander graphs with 3 hypercube and its variants, such as shuffle exchange = (1-e)/e. In the following subsections, we look at the and cube connected cycles, have become popular be- existence and the construction of expander graphs in cause of their relative ease of implementation and greater detail. because a number of algorithms have been implement- ed on them with satisfactory performance. Such net- B. Properties of Expander Graphs works cannot give us optimal parallel algorithms, how- In an expander graph, the size of the neighborhood ever, since they lack sufficient connectivity to of every set is larger than that set by a constant factor. facilitate fast parallel communication. For example, This expanding property gives rise to a number of consider a communication task in which some small interesting computation and communication proper- group, say 10%, of the processors hve information ties. For example, this expansion property gives rise which they need to transmit to the entire network. to approximate halving in a constant number of steps Assuming we have no control over the initial informa- using compare and exchange operations. It also offers tion distribution among the processors, any group of the means of realizing a trade-off between storage ca- 10% of the processors can be considered. Clearly, a pacity for the interconnection patterns and the degree crossbar network can accomplish this task in one step of the network while retaining the error correction, but is technologically expensive. A hypercube can exponential convergence, and robustness properties in also accomplish this task but requires 0(logn) steps, associative memory. Furthermore, this property also which scales with the size of the network (Fig. 1). We ensures multiple paths between the processing ele- need a network which can interconnect an arbitrary ments, providing the necessary redundancy for fault- number of processors in a constant number of steps. tolerant communication networks. These properties Interconnection networks based on expander graphs together with applications are presented in further can provide the solution. They can accomplish this detail in Sec. III. task in a number of steps dependent only on the frac- The large neighborhood criterion is a strong require- tional size of the group and the graph's expansion ment. Consequently, even the proof of existence of factor. The number of steps is independent of net- such graphs is nontrivial. Such a proof is in fact pro- work size. We call this property global communica- vided by a nonconstructive (probabilistic) argument. tion with a constant number of steps. It has been In this argument, we look at the set of all d-regular shown that for any given expansion there exist expan- bipartite graphs with each side having n nodes togeth- der graphs with a constant number of fanouts per node er with a uniform distribution 3 on it. We then show (see Theorem 1 of Sec. II.B).1 This global communi- that the fraction of these graphs that fails to have the cation ability is of fundamental importance for inter- given expanding property is <1. To make this frac- connection networks because a number of optimal al- tion small, we select a suitable d as a function of the gorithms can be developed based on such networks. expansion. Since the fraction of graphs which does In fact, there is some theoretical evidence that this not have the required expanding property is <1, cer-

918 APPLIED OPTICS / Vol. 30, No. 8 / 10 March 1991 tain graphs of degree d must exist that meet the given for estimating the expansion. However, such a proba- expansion requirements. This result is stated more bilistic approach gives rise to very irregular intercon- precisely in the following theorem.13 nection networks which create severe routing difficul- Theorem 1: Suppose 0 < e H(e + H(1-E) (1) In addition, optical interconnection is also superior for H(c) - (1 - E)H[(l - 0)/] such global interconnection from both power and where H(x) = -x log2x- (1 - ) log2(1 - x) is the speed considerations. In Sec. IV we present the spe- binary entropy function. Let I and 0 be two sets of cific optical interconnection techiques to realize ex- vertices, III = 101 = n, and let G be a random d-regular pander graph interconnections. In Sec. II.C we give bipartite graph on the classes of vertices I and 0, our method for constructing expander graphs with a obtained by choosing randomly d permutations from I given expansion using the theoretical results men- to 0. Then, with probability approaching 1 as n tends tioned here. to -, G is a [d,,(l -E)/e] expander. Notice that we allow multiple edges here. This C. Probabilistic Construction theorem guarantees expander graphs of any given ex- To generate an expander graph with a given expan- pansion whose degree is bounded as a function of Eas in sion factor, the three primary tasks are (a) generating the above equation. Notice that the degree bound is random d-regular graphs with a given degree d, (b) independent of n. The problem is that this probabilis- estimating its expansion by applying Theorem 2, and tic argument does not give us a clue as to how to (c) selecting the graph with the best expansion over a construct such expander graphs explicitly. Also, it large number of iterations. We present this algorithm seems that the explicit construction of expander below in a step-by-step approach, assuming that n is graphs is difficult due to our requirement that every the number of input or output nodes on each side of the subset of nodes have a large neighborhood. Although bipartite graphs and a = (1 - e)/E is the expansion: there are some explicit constructions of expander graphs,14 15 none of these constructions offers high ex- Step 1: Generate Random d-Regular Graph. pansion with a small degree bound. On the other We use a random number generator to first generate hand, the random construction shows that the degree a random permutation of the first n integers. This need not be higher than -loge/e. 3 Also, the probabilis- random permutation will be interpreted as a one-to- tic argument shows that for a suitably chosen d and for one connection between the input and output nodes. all large n, almost all the d-regular bipartite graphs will Then we select d using Eq. (1). This d is the minimum have the required expansion. This suggests that we degree required to achieve the given expansion. We should use randomly generated d-regular graphs. generate d random permutations and construct the n Such an approach would be successful only when we X n incidence matrix of the corresponding d-regular have the means to determine if a given d-regular bipar- graph. Theorem 1 guarantees that there exists expan- tite graph has the necessary expansion. Computing der graphs with the given expansion, provided that we the expansion of a bipartite graph is a co-NP complete select d according to Eq. (1) for the desired expansion. problem,16 but estimating the lower bound of the ex- pansion is not difficult, as shown by Tanner.17 We Step 2: Estimate the Expansion Using Theorem 2. only need to compute certain eigenvalues of the inci- We use standard numerical routines to compute the dence matrix of the expander graph for this estima- eigenvalues of this matrix. These eigenvalues togeth- tion. The precise result of Tanner is given below. er with d and e are substituted into Eq. (2) to obtain a Let G = (I,O,E) be a d-regular bipartite graph. Let lower bound on the expansion of the graph. M be the real valued incidence matrix of the bipartite graph: M = [mij] ,mij = 1 if the ith node in I is connect- Step 3: Selecting the Best Graph. ed to the jth node in 0 and zero otherwise. Since MMT Steps 1 and 2 are repeated for many iterations. We is a real symmetric non-negative definite matrix, it is select the graph with the largest expansion. diagonalizable and has real non-negative eigenvalues Figure 2 graphically illustrates this algorithm. Us- and orthogonal eigenvectors. Let XI X2 2 ... 2 Xn be ing this algorithm, for various values of n and d, we the ordered eigenvalues of MMT. Note that for d- found the network with the least second largest eigen- regular graphs Xi = d2. Then, the following theorem value and computed its expansion using Theorem 2. can be used to find the lower bound of the expansion of We then plotted the relationship between the expan- G.17 sion j [forE = 1/(1 + i3)] and the degree d of the network Theorem 2: If Xi > X2, for any E < 1, G is a (d,E,/3) for various values of the number of vertices n (Fig. 3). expander with From this figure, it is clear that the relationship be- tween the expansion and the degree is largely indepen- 2 dent of the size n of the network for larger values of n (n ,E(d - x+ 2 (2) > 128). The discrepancy for smaller values of n can be 3 As the computation of eigenvalues of a matrix is explained by the fact that the theoretical results are relatively easy, this theorem provides an efficient tool asymptotic. Hence we demonstrated that, for a given

10 March 1991 / Vol. 30, No. 8 / APPLIED OPTICS 919 storage mode. This approach will work for n up to | 5000 using a Cray Y-MP with 32 million memory words. For larger values of n, n >> d, the incidence matrix is sparse. We also recall that it is a real sym- metric non-negative definite matrix. Consequently we can use skyline methods to cut down dramatically I i the storage requirement for large incidence matrices.18 Ill. Applications of Expander Graphs In this section, we give a few applications where the connectivity of expander graphs is successfully ex- IJ ploited to yield fast parallel algorithms and efficient designs. A. Parallel Sorting It has been a long-standing problem to find an opti- Ji mal with 0(logn) stages. It is easy to see that we need at least logn stages of comparators with each stage performing 0(n) comparisons, since we have a lower bound of n logn on the total number of comparisons required for sorting. The credit for dis- covering an optimal algorithm goes to Ajtai, Komlos, and Szemeredi (AKS), who came up with an 0(logn) stage sorting network, thereby matching the lower bound to within a constant factor.4 The basic idea of the AKS algorithm is to halve recursively the given sequence of numbers. Such a recursive halving re- quires 0(logn) stages. Unlike a naive recursive halv- Fig. 2. Flow of the probabilistic constru ction algorithm. ing scheme in which it would take (log2n) steps (since each exact halving requires logn time), the novelty of N-256 N1024 E N *I32 N64 * N-128 0 the AKS sorting network is that it uses only approxi- mate halving instead of exact halving. They handle 3.50 these approximately halved sequences using an effi- cient error-management scheme to obtain the sorted sequence. Such an approach can result in an 0(logn) algorithm provided approximate halving can be done 3.00 quickly. This is how expander graphs enter the pic- ture. It turns out that we can do approximate halving in constant time using expander graphs. We discuss the relationship between approximate halving 2.50 and ex- pander graphs in a greater detail. Expansion The idea is that Factor we can use bipartite graphs to model (beta) the computation of an approximate halver network. 2.00 The nodes in the bipartite graph represent the wires in the comparator network. We partition the wires into two groups of equal sizes with the node sets I and 0 corresponding to these two groups so that, at the end of 1.50 the computation, most of the elements of the lower (higher) half of the inputs end up in the nodes of I(O). Hence we can assume that compare-and-exchange op- erations are only made between the nodes from differ- 1.00 ent parts of the graph. Each such compare-and-ex- 4 6 8 10 12 14 16 18 20 22 24 change operation can be denoted by an edge in the Degree (d) bipartite graph. In this model, we discard the timing Fig. 3. Plot of the expansion factor against degree. See that a information and collapse the stages. A comparator increases monotonically with d indepe ndent of n. network of depth d would then be modeled by a bipar- tite graph, each of whose nodes has degree d (counting expansion ,, the degree of the networ k does not grow multiplicities of edges). Also, given a d-regular bipar- with the number of vertices. tite graph, one can devise a comparator network with d Our initial experiments were conduc:ted for values of stages that performs the same compare-and-exchange n up to 1024. The incidence matrix is stored in the full operations in some order. Note that the algorithms we

920 APPLIED OPTICS / Vol. 30, No. 8 / 10 March 1991 * percentageerror distribution D cumulativeerror I

100% 13

12 90%

80%

70%

60%

percentage 50% ot input 3

2 40%

30%

d 20% Fig. 4. Unfolded view of an n = 32, d = 9 expander graph. e 0.33 and expansion factor 22.0. 10%

design will be insensitive to the order of these compari- 0% 0 1 2 3 4 5 sons. number of misclassifications alter halving We now define an e-halver as a comparator network Fig. 5. Error distribution and cumulative error for an n = 32, d = which takes the inputs aj,a 2, . ., a2n and produces two 5e-halver. See that 98% of all possible inputs results in two or less blocks (lower and higher) of outputs of equal length n. errors. The possibility of five errors is only 0.000025%. The idea is that the lower block contains all but an fraction of the n small elements of the input, and the higher block contains all but an e fraction of the n large * N=32 * N=64 A N-128 0 N=256 0 N=1024 elements of the input. In fact, an e-halver satisfies a stricter condition. An e-halver has the property that, 0.45 for any inputs and k < n, the number of elements from the k smallest elements of the input which are output in the higher block is

10 March 1991 / Vol. 30, No. 8 / APPLIED OPTICS 921 B. Routing * N-32 * N-64 k N-128 0 N=256 0 N=1024| One of the principal and immediate applications of 0.25 -- expander graphs is for constructing efficient routing networks. It has been shown by Valiant19 that if a messsage is routed to a random destination and then to its real destination, the delay can be made proportion- 0.20-- al to the diameter of the network. One intuitive rea- son for smaller delays is that random destination rout- ing would tend to distribute the packets evenly across all the edges in the network, thereby minimizing the 0.lo15 traffic on each edge. It turns out that one can dispense Amount with random destination routing and achieve the same of Misclassificati effect by using expander graphs. This was shown by (epsilon) Upfal.2 Recently, Leighton and Maggs showed that 0.10 they can achieve significantly lesser delays and higher fault tolerance by augmenting a butterfly network with an expanding graph. In particular, they showed that such an augmented butterfly is better than even a 0.05 dilated butterfly, which has the same amount of hard- ware. These results suggest that expander graphs will play a significant role in the development of parallel computers. 0.00 - I I I I l I I l I I 4 6 8 10 12 14 16 18 20 22 24 C. Associative Memory Degree (d) Associative memory is the ability to recall data given Fig. 7. Empirical e obtained by halving 1000 permutations of n partial information. One of the well understood mod- distinct integers. els of associative memory is that of Hopfield.20 This model assumes a fully interconnected network of neu- rons. Information is stored in this system by adjusting maximum error we have to consider when designing an the weights of the interconnections. This model has a error-management scheme to achieve the AKS sorting number of remarkable properties, which include error algorithm. Note that the expansion computed for Fig. correction, exponential convergence, and robustness 3 are lower bound estimates; therefore, the e obtained with respect to errors in the weights. However, in would be upper bound estimates (see Fig. 6). To see practice it is hard to implement such a network since how these expander graphs perform as e-halvers, we the number of interconnections grows quadratically used them to halve 1000 random permutations of n with the number of neurons. To retain many of the distinct integers and recorded the number of integers nice properties exhibited by the Hopfield model, Kom- that ended up in the wrong halves. The highest error los and Paturi7 have shown that one can use certain count was divided by n to give an empirical estimate of sparse networks, which should have global communi- e (see Fig. 7). This empirical e is significantly less than cation properties similar to those of expander graphs. the upper bound estimated, confirming with Fig. 5 that Nonexpanding networks like the hypercube would E-halvers are very efficient for a large majority of the lack the error-correction properties of the Hopfield input. model. In essence, expander graphs would give us the In addition to the sorting algorithm, researchers means to realize a trade-off between the storage capac- have developed optimal algorithms for other related ity and the degree of the network while retaining the problems, e.g., finding the maximum and median. 56 error correction, exponential convergence, and robust- The problem with all these algorithms is that they are ness properties. This shows that the connectivity pro- only optimal in an asymptotic sense. This means that vided by an expander graph interconnections is versa- these algorithms can perform better only when we tile. We generated a 256-node expander graph with d consider problems of very large sizes, e.g., sorting bil- = 20 and used it for interconnecting 256 neurons in a lions of numbers. simple Hopfield network. Using <10% of the inter- For smaller problem sizes, other existing algorithms connection required by a crossbar, we stored two 16 X would be more efficient. Even if the AKS algorithm is 16 binary images and demonstrated the error-correc- not presently competitive for practical problem sizes, tion property for 20% random error (Fig. 8). one can, with improved algorithm analysis techniques and advances in the theory of parallel algorithms, hope D. Object Redistribution for the development of parallel algorithms that are The object distribution algorithm is the central part optimal for practical problem sizes.3 The technologi- of Cole and Vishkin's solutions to the O(logn) time task cal feasibility of irregular interconnections would give scheduling problem and the (1) time processor impetus for the development of better parallel algo- scheduling problem. 8 We have a set of objects repre- rithms. senting the tasks to be performed in parallel. The goal

922 APPLIED OPTICS / Vol. 30, No. 8 / 10 March 1991 Input der graphs guarantee that we lose only a few nonfaulty processors. Hence expander graphs can be used as good fault-tolerant networks.9 Economic memory storage requires a refresh or res- toration mechanism to counteract the accumulation of errors. Such a mechanism must rely on redundancy and voting for restoration. This added computational requirement increases the possibility for device error. Thus the problem of information storage in the pres- ence of noise leads to the problem of computation in the presence of noise. This problem is similar to that of fault-tolerant computing. Here again global com- Ste 2 Ste 3 munication properties of expander graphs can be used to implement a voting mechanism economically.' 0 IV. Optoelectronic Implementation We have now described the irregular interconnec- tion approach to parallel computation and discussed some of its advantages. In this section, we examine the implementation technology. We show that a sys- tem combining local electronic computation with glob- al optical communications provides an excellent match Fig. 8. Convergence of a 16 X 16 input with 20% random error to to the system requirements. We describe the pro- one of the two stored images. The 256-neuron Hopfield network is grammable optoelectronic multiprocessor (POEM) interconnected by an expander graph that uses <10% of the crossbar system being developed at UCSD and discuss how it interconnection. can support expander graphs. Two implementations are discussed, one using fixed computer generated ho- is to divide this set of objects into collections of objects lographic optical interconnections, the other using re- with approximately equal sizes so that these collec- configurable volume (photorefractive) holographic in- tions can be executed in a minimum number of parallel terconnections. steps. This problem is encountered when the objects are distributed unevenly in the network, and no one A. Why Optoelectronics? processor has access to all the objects. Such an uneven Electronic VLSI technology is well established, in- distribution can be made more balanced if the objects expensive, and reliable. It is excellent for logic opera- are redistributed. The scheduling problems can be tions and local communications, as in a single process- solved optimally if the redistribution can be done in ing element. However, as the length and density of constant time. The proposed solution uses an expan- the communication links increase, the disadvantages der graph to interconnect these collections. As we of a purely electronic approach become significant. In shuffle the objects between pairs of interconnected particular, the irregular global communications de- collections to achieve local balance between them, the scribed are catastrophic for VLSI. Time delay, energy global communication property of expander graphs dissipation, and potential clock skew all grow with assures more even global object distribution in a con- increasing length-a problem for global communica- stant number of steps. tions. Electronic crosstalk and reliability consider- ations limit the allowable number of line crossings, E. Fault-Tolerant Networks and Error-Correction Codes making the layout of irregular interconnection links Achieving consensus in the presence of faults is a difficult. As a result, the problem of communications basic problem in distributed computing. Hardware or becomes critical in chip layout. Valuable silicon real software faults can prevent a processor from cooperat- estate is expended on connections, reducing the ing in the consensus process. In such a case, the goal is amount available for processing. For example, in to obtain unanimity among the nonfaulty processors. most VLSI chips, 70% of the silicon area is devoted to The problem is that faulty processors can prevent communications and related tasks, although most of communication among the nonfaulty ones. It is also the chip layout time is spent trying to minimize this possible that faulty processors can introduce mislead- percentage. ing messages into the network. To achieve unanimity, The communications problem is basically topologi- Q(t) connectivity is necessary where t is the number of cal. In VLSI electronics, all the processors lie in a 2-D faults to be tolerated. This high connectivity require- plane. As long as the communications between them ment can be relaxed if we are willing to lose some are also restricted to that plane, there is competition nonfaulty processors and settle for cooperation among between communications and processing for the same the vast majority of the nonfaulty processors. In such limited area. By introducing free-space optical inter- a case, one can use expander graphs to interconnect the connection, communication links can be taken into the processors. The communication properties of expan- third dimension above the processing plane. There

10 March 1991 / Vol. 30, No. 8 / APPLIED OPTICS 923 Opto-Electronic PE plane Processor Array Interconnection Control Signal Parallel Data Input / Output Clock / Control

OptojInstructions l ctron

I S~ottctor gic

P--esing Element Interconnection Parallel-Access Information flow M- Control Signal Opto-Electronic Memory Processor Array Fig. 10. Unfolded geometry POEM system. Information enters from the left and flows to the right as its is processed. Fig. 9. Idealized POEM system, which combines local electronic processing with global optical communication. PE plane are costs in power, speed, and complexity in converting the electronical signals to optical. However, it has been shown2l that for links longer than a (technology dependent) break-even length Ic the optical link is more efficient in terms of both power and speed. The 1c was calculated using realistic optical and electronic performance parameters and found to be as small as 1- 2 mm. This means that optical links are preferred for wafer-scale integration implementation of parallel Information flow c2:D processors using global interconnections. Fig. 11. Folded geometry POEM system. Information circulates Based only on power and speed considerations, the between the two processing planes. optical link is already preferred to the electronic wire for long distance communications, but there are other significant advantages offered by the optoelectronic elements (PEs) which perform computations and local combination. The area of the chip expended on the communications electronically. Each PE also has one long distance wires and any associated electronics (am- or more optical detector and modulator with which it plifiers, signal boosters, etc.) is made available for pro- can receive data and control instructions and commu- cessing. The VLSI layout is simplified. Problems nicate to other processors. These optoelectronic PEs arising from clock skew are reduced, since all long communicate among each other through electrooptic distance links have approximately the same length. (EO) modulation of coherent light (generated off- In addition, there are two potential advantages which chip). Modulators are preferred to integrated laser come from the physical separation of the interconnect sources for reduced on-chip power dissipation, simpli- technology from the processing plane: fault tolerance fied fabrication, and increased reliability. The pro- and reconfigurability. VLSI fabrication faults can be cessing planes can be manufactured using, for exam- corrected after chip testing by selecting the connection ple, silicon electronic processors fabricated on links to replace faulty processors with working spares. transparent EO PLZT substrates. 23 The system is This reduces production costs without sacrificing effi- controlled in single instruction multiple data (SIMD) ciency. As a consequence of the optical long distance fashion by a serial host computer, which distributes communication, all processors are effectively adjacent. the clock signal, determines the tasks of the PEs on If the connection can be changed during operation, the various wafers, and (for reconfigurable systems) con- system becomes more versatile, efficient, and opera- trols the interconnection pattern. Data transfer is bit tion fault tolerant. The type and time scale of reconfi- serial, but computations are made in parallel planes. guration are technology dependent, but in general the Interconnections are made using holography (see Sec. better the connection pattern matches the problem IV.C) and may be fixed or reconfigurable depending on requirements, the more efficiently the available pro- the technology and application. The POEM architec- cessing power can be applied to a variety of problems. ture is a generalized approach to optoelectronic pro- cessing, describing any system using holographic inter- B. Programmable Optoelectronic Multiprocessor connection of electronic processing arrays. It is The POEM is a generalized system approach to intended as a framework to be adapted into specific parallel computing derived from these considerations. systems matching the application requirement. The POEM system was described in detail in Ref. 22. The POEM system can have either an unfolded (Fig. We briefly describe it here, then discuss its application 10) or folded (Fig. 11) geometry. The unfolded system to parallel processing with irregular interconnection. can use fixed interconnects, which can be implemented An idealized POEM system is shown in Fig. 9. The with thin computer generated holograms (CGHs) or, VLSI wafer is divided into optoelectronic processing for certain regular interconnection patterns, with re-

924 APPLIED OPTICS / Vol. 30, No. 8 / 10 March 1991 fractive optics. A large number of processing planes Mrrors are required to perform the computation. As a result, the hardware cost is placed on the processing electron- ics rather than the interconnection technology. A fixed-interconnection unfolded system can be effi- cient for some computations, but a more versatile com- puter will require reconfigurable interconnections. The folded system in Fig. 11 uses reconfigurable inter- connections to perform general purpose processing with only two processing planes. Information is trans- ferred back and forth between the planes. The con- Processing planes nections can be bidirectional, as shown, or they can be Modulator recon- light Input different for the forward and return paths. The Fig. 12. POEM system using a fixed CGH. The EO modulator figurable interconnects increase the computer's versa- output from one processing plane is interconnected by the fixed tility and efficiency at a cost of increased optical sys- CGH to the next processing plane. tem complexity. Using only two processing planes increases hardware utilization but does not support tems implementing parallel irregularly connected al- pipelined operations. gorithms. The first system is unfolded, using fixed The speed and nature of reconfiguration determine thin holographic optical interconnects. The second is which algorithms can be efficiently implemented. folded with preprogrammed volume holographic inter- Clearly, a system which can update the connections in connections. We outline the procedure for approxi- less than a single clock cycle is ideal, but some compu- mate halving on each system to illustrate operational tations and algorithms need to update connections differences. relatively infrequently after many clock cycles. We have found it convenient to categorize reconfigurable interconnection systems according to their range and 1. Unfolded POEM with Fixed CGH Interconnects speed of reconfiguration. A preprogrammed connec- Thin holograms can be used to perform fixed inter- tion system can switch at high speed between a limited connections. They may be fabricated by recording set of prerecorded patterns. These patterns must be optical interference patterns or computer generated chosen and stored before operation and can be updat- masks in either phase or amplitude. A CGH with ed slowly (compared to the computer's run time) if at submicron features can be written by electron-beam all. A reprogrammable connection system is com- lithography, then etched into glass plates.24 Multilev- pletely general; any desired connection pattern can be el-phase CGHs can be designed to produce up to 100% constructed and implemented at the reconfiguration diffraction efficiency, although transmission (ampli- rate. Finally, an adaptive connection system pro- tude modulation) CGHs are much less efficient.25 duces a continuous incremental change in the inter- Figure 12 shows a POEM system which uses a fixed connection pattern in response to the algorithm's CGH to interconnect a series of parallel optoelectronic needs. processor arrays. The system shown uses a double pass faceted CGH C. Algorithm Implementation on POEM architecture2 6 with one facet devoted to each detector The technology of interconnection depends heavily and modulator. Coherent light entering the modula- on the needs of the algorithms to be implemented. tors from below is polarization modulated and ana- Some highly regular interconnection patterns can be lyzed. This output is collimated and directed to one or performed using space invariant refractive optics such more location in the next plane by modulator facets. as lenses, masks, and mirrors. For more general pat- Detector facets focus the incident light into detectors. terns, including the completely irregular connections The area of modulators and detectors is minimized to discussed in this paper, space variance is needed. A reduce device capacitance and response time.23 Data promising approach to space variant interconnection can be input electronically or in parallel using spatial is to use holography. Each connection can be stored as light modulators (not shown) imaged onto detectors in a single hologram. The input beam reads the holo- the processing planes. Assuming a 5- X 5-cm diffrac- gram, reconstructing a wavefront propagating toward tion-limited CGH with 0.5-Asm features and 700-nm the desired destinations. Holographic storage is light, 128 X 128 processor arrays could be intercon- dense and distributed, storing large amounts of infor- nected.26 Optoelectronic Si/PLZT processing arrays mation in a defect-tolerant manner. Most important, of this size are certainly feasible. More sophisticated any desired connection pattern with arbitrary fanout, interconnection hologram design and fabrication tech- fanin, and direction can in principle be stored. Holo- niques currently under investigation should be able to grams are divided into two major types, thick (volume) accommodate larger PE arrays. and thin, according to whether their thickness is large To perform the approximate halver algorithm de- or small compared to the features of the recorded scribed in Sec. III.A, each processing element requires interference pattern (the grating wavelength). In two detector inputs (the two values to be compared) Secs. IV.C.1 and IV.C.2 we describe two POEM sys- and two modulator outputs. The planes are divided

10 March 1991 / Vol. 30, No. 8 / APPLIED OPTICS 925 Processing plane 2

Processing plane I

2f 2f Photoretractive Modulator 21 crystall light input 7

Recording source array

Mirror 2 f I 2 axis rotation

Hologram recording optics Fig. 13. POEM system using reconfigurable volume holograms. Photorefractive crystals contain several interconnection patterns distin- guished by frequency or phase multiplexing. into two halves, higher and lower. One output from Figure 13 shows a POEM system using multiplexed each PE connects directly to that PE's corresponding volume holographic interconnects recorded in a pho- location in the next plane, while the other output con- torefractive crystal. The processing planes are similar nects to a quasirandom destination on the next plane's to those of the preceding example, except that now other half. Each PE receives and compares the two because the connection pattern is reconfigurable, the input values. The higher half of the processors passes information is exchanged back and forth between a the higher of the two values straight across and switch- single pair of processing array planes, PAl and PA2. es the lower. The lower half of the processors does the The optical system works by retrieving interconnec- opposite. The input values are loaded from the left tion patterns prestored as volume holograms superim- and propagate in parallel to the right, sorted more posed on a photorefractive crystal. In Fig. 13, a re- accurately in each step into the higher and lower half- cording source array is used to record each processor's planes. interconnection pattern sequentially. Computer- controlled scanning directs the recording images to 2. Folded POEM with Programmable spatially discrete crystal subvolumes. Multiple inter- PhotorefractiveInterconnects connection patterns are superimposed on each subvo- Thick (volume) holograms are dramatically differ- lume using phase or wavelength multiplexing. After ent from planar holograms in that they exhibit readout all the holograms are recorded, one complete intercon- selectivity. When the readout beam mismatches the nection pattern can be recalled in parallel using the stored hologram in either the optical wavelength or the input coded with the proper phase or frequency. The phase pattern, the diffraction efficiency decreases dra- system is preprogrammed, reconfigurable between matically. The degree of selectivity depends on the prestored patterns. Assuming diffraction-limited ho- thickness of the hologram; for a 1-mm thick hologram, lograms and 10% diffraction efficiency, two 50- X 50- X an angular mismatch of 0.10 (Ref. 27) cuts diffraction 2.5-mm lithium niobate crystals could interconnect a to nearly zero. This behavior allows the superposition 128 X 128 input array with ten prestored patterns. of multiple volume holograms, each coded with its own Again, more sophisticated approaches should increase reference wavefront. When one or more of the refer- the possible performance. In particular, prefabricat- ence wavefronts illuminates the hologram, the corre- ed CGH patterns could be used to provide wavefronts sponding images are simultaneously recalled. Each of for volume storage, decreasing programming time and these volume holograms can in theory have high (ap- possibly increasing array size. proaching 100%) diffraction efficiency. The principal To perform the approximate halving of n values, the volume recording media are photorefractive crystals, input is arbitrarily divided into two halves. Each half which develop an index modulation (phase grating) in is sent into one of two n/2 element processor arrays, continuous response to incident light. PAI and PA2. Each processor stores its value, then

926 APPLIED OPTICS / Vol. 30, No. 8 / 10 March 1991 sends a copy to the other plane along a skewed 1-1 6. N. Pippenger, "Sorting and Selection in Rounds," SIAM J. connection pattern. Both the forward and the reverse Comput. 16, 1032-1038 (1987). connection patterns are identical. In the next step, 7. J. Komlos and R. Paturi, "Effect of Connectivity in an Associa- each processor compares the received value with the tive Memory Model," in Proceedings, IEEE Symposium on one it was originally given. Processors in PA1 store Foundationsof (1988), pp. 138-147. the higher value and send the lower to PA2 using a new 8. R. Cole and U. Vishkin, "Approximate and Exact Parallel irregular connection. Processors in PA2 perform Scheduling with Applications to List, Tree and Graph Prob- the lems," in Twenty-Seventh Annual Symposium on Foundations same operation in reverse, storing the higher of the two of Computer Science (1986), pp. 478-491. values. As the process continues, planes PAl and PA2 9. C. Dwork, D. Peleg, N. Pippenger, and E. Upfal, "Fault Toler- hold in storage the higher and lower half, respectively, ance in Networks of Bounded Degree," SIAM J. Comput. 17, of the values with a steadily decreasing probability of 975-988 (1988). error. In this folded implementation a total of only n 10. N. Pippenger, "The Memory Refresh Problem," in Advanced processors was required, each with a single detector Research in VLSI, Proceedings, Fifth MIT Conference, J. Allen and modulator. and F. T. Leighton, Eds. (MIT Press, Cambridge, 1988). The two systems we have described are intended 11. J. W. Goodman, "Optics as an Interconnect Technology," in only to indicate the potential of optoelectronic pro- Optical Processing and Computing, H. H. Arsenault, T. Szo- plik, and B. Macukow, Eds. (Academic, New York, 1989), Chap. cessing for implementing irregularly interconnected 1, pp. 1-32. parallel algorithms. Both optical and electronic com- 12. L. G. Valiant, "Graph-Theoretic Properties in Computational ponents may be replaced as more advanced versions Complexity," J. Comput. Sys. Sci. 13, 278-285 (1988). become available. For example, the correlation ma- 13. N. Alon, "Eigenvalues and Expanders," Combinatorica 6,83-96 trix-tensor multiplier system currently being investi- (1983). gated at UCSD may provide a more versatile repro- 14. G. A. Margulis, "Explicit Construction of Concentrators," Prob. grammable interconnection system.28 The Si/PLZT Peredachi Inf. 9, 71-80 (1973) [Probl. Inf. Transm. 325-332 processor planes may be replaced with faster switching (1973)]. multiple quantum well modulators. Most important, 15. 0. Gabber and Z. Galil, "Explicit Construction of Linear Sized we have shown that optoelectronics is a technology Superconcentrators," J. Comput. Syst. Sci. 22, 407-420 (1981). well suited to implementing 16. M. Blum, R. M. Karp, 0. Vornberger, C. H. Papadimitriou, and these algorithms. M. Yannakakis, "The Complexity of Testing Whether a Graph is a Superconcentrator," Inf. Process. Lett. 13, 164-167 (1981). V. Conclusions and Further Work 17. R. M. Tanner, "Explicit Concentrators from Generalized N- Gons," SIAM J. Alg. Discuss. Math. 5, 287-293 (1984). We proposed an interconnection architecture based 18. N. E. Gibbs, W. G. Poole, Jr., and P. K. Stockmeyer, "A Compar- on expander graphs and have shown how these inter- ison of Several Bandwidth and Profile Reduction Algorithms," connections could lead to efficient parallel algorithms. ACM Trans. Math. Software 2, 322-330 (1976). We have also reasoned that such graphs cannot be 19. L. G. Valiant, "A Scheme for Fast Parallel Communication," implemented with existing VLSI technology but can SIAM J. Comput. 11, 350-361 (1982). be made practical with optoelectronic computing tech- 20. J. J. Hopfield, "Neural Networks and Physical Systems with nology using free space optical interconnects. Emergent Collective Computational Abilities," Proc. Natl. Our further work will focus on experimentally dem- Acad. Sci. USA 79, 2554-2558 (1982). onstrating the feasibility of implementing expanders 21. M. R. Feldman, S. C. Esener, C. C. Guest, and S. H. Lee, "Com- parisons Between Optical and Electrical Interconnects Based on on optoelectronic computers and find new and more Power and Speed Considerations," Appl. Opt. 27, 1742-1751 efficient ways of using irregular interconnections. (1988). 22. F. Kiamilev et al., "Programmable Opto-Electronic Multipro- The authors would like to acknowledge support by cessors and Their Comparison with Symbolic Substitution for AFOSR grant 89-0440 and DARPA administered by Digital Optical Computing," Opt. Eng. 28, 396-409 (1989). AFOSR grant 88-0022. 23. S. H. Lee, S. C. Esener, M. A. Title, and T. J. Drabik, "Two- Dimensional Silicon/PLZT Spatial Light Modulators: Design Considerations and Technology," Opt. Eng. 25, 250-260 (1986). References 24. K. S. Urquhart, S. H. Lee, C. C. Guest, M. R. Feldman, and H. 1. T. Leighton and B. Maggs, "Expanders Might be Practical: Farhoosh, "Computer Aided Design of Computer Generated Fast Algorithms for Routing Around Faults on Multibutter- Holograms for Electron Beam Fabrication," Appl. Opt. 28, flies," in Proceedings, IEEE Symposium on Foundations of 3387-3396 (1989). Computer Science (1989), pp. 384-389. 25. G. L. Swanson, "Binary Optics Technology: The Theory and 2. E. Upfal, "An O(logN) Deterministic Packet Routing Scheme," Design of Multi-Level Diffractive Optical Elements," MIT Lin- in Proceedings, Twenty First Annual ACM Symposium on coln Laboratory Technical Report 854 (1989). Theory of Computing (May 1989), pp. 241-250. 26. M. R. Feldman and C. C. Guest, "Interconnect Density Capabili- 3. M. S. Paterson, "Improved Sorting Networks with O(logn) ties of Computer Generated Holograms for Optical Interconnec- Depth," Research Report 89, Department of Computer Science, tion of Very Large Scale Integrated Circuits," Appl. Opt. 28, U. Warwick, Coventry, CV4 7AL, U.K. (1987). 3134-3137 (1989). 4. M. Ajtai, J. Komlos, and E. Szemeredi, "Sorting in c logn Paral- 27. R. J. Collier, C. B. Burckhardt, and L. H. Lin, Optical Hologra- lel Steps," Combinatorica 3, 1-19 (1983). phy (Academic, New York, 1971), Chap. 9. 5. M. Ajtai, J. Komlos, W. L. Steiger, and E. Szemer6di, "Deter- 28. J. E. Ford, Y. Fainman, and S. H. Lee, "Array Interconnection ministic Selection in O(log logn) Parallel Time," in ACM Sym- by Phase-Coded Optical Correlation," Opt. Lett. 15, 1088-1090 posium on Theory of Computing, Vol. 18 (1986), pp. 188-195. (1990).

10 March 1991 / Vol. 30, No. 8 / APPLIED OPTICS 927