Performing the Shuffle with the PM2I and Illiac SIMD Interconnection Networks

Robert R. Seban Howard Jay Siegel Purdue University School of Electrical Engineering West Lafayette, Indiana 47907

Abstract—Three SIMD single stage interconnection networks which have been proposed and studied in the literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to per- form the shuffle interconnection in an SIMD machine with N processors is examined. A lower bound of 3\/N/2 transfers for the Illiac to shuffle data is derived. An algorithm to do this task in 2\/N-l transfers is given. A lower bound of log2N transfers for the PM2I to shuffle data has been published previously. An algo- rithm to do this task in log2N + l in transfers is presented here.

1. Introduction This paper extends SIMD interconnection network studies presented in [28, 31]. In particular, the ability of Fig. 1: PE-to-PE SIMD machine configuration, with the PM2I and Illiac single stage interconnection SIMD machine networks to perform the shuffle interconnection NPEs. is examined. In [28] it is shown that a lower bound on of configuration is shown in Fig. 1. It is called the PE- the number of transfers needed for the PM2I network to to-PE organization. The network is unidirectional and perform the shuffle is log2N, where N is the number of connects each PE to some subset of the other PEs. A processing elements in the SIMD machine. The algo- transfer instruction causes data to be moved from each rithm presented here requires only (log2N) + l transfers. PE to one of the PEs to which the PE is connected by This algorithm is used as basis for an algorithm to do the network. (Here only one-to-one communications will the shuffle with the Illiac network in (2\/N)-l transfers. be considered, i.e., broadcasting (one-to-many) connec- This compares favorably an earlier result of 4(\/N-l) in tions are not considered.) To move data between two [25]. In addition, a lower bound 3\/N/2 on the number processing elements that are not directly connected, the transfers required for Illiac to do shuffle is proved. data must be passed through intermediary processing The model of SIMD machines used is described in elements by executing a programmed sequence of data Section 2. In Section 3 the interconnection networks are transfers. An alternative to the PE-to-PE SIMD formally defined. An algorithm to shuffle data using the machine organization is to position a bidirectional net- PM2I network is given in Section 4. The lower bound work between the processors and the memories. The analysis and algorithm for performing the shuffle with PE-to-PE paradigm will be used here, however, the the Illiac network is presented in Section 5. results presented will be applicable to the other organiza- tion also. 2. SIMD Machine Model The formal model of an SIMD machine used here Typically, an SIMD (single instruction stream - mul- consists of five parts: processing elements, control unit tiple data stream) machine [12] is a computer system con- instructions, processing element instructions, masking sisting of a control unit, N processors, N memory schemes, and interconnection functions. It is a modules, and an interconnection network. The control mathematical model that provides a common basis for unit broadcasts instructions to the processors, and all evaluating and comparing the various components of active processors execute the same instruction at the different SIMD machines. This model is based on the same time. Each active processor executes the instruc- one presented in [31]. tion on data in its own memory module. The intercon- Each processing element (PE) is a processor together with its own memory. There are N PEs, addressed {num- nection network, sometimes referred to as an alignment m or permutation network, provides for communications bered) from 0 to N—1, where N = 2 . It is assumed that among the processors and memory modules. Examples the processor contains a fast access general purpose of SIMD machines that have been constructed are the register A and a data transfer register (DTR). When Illiac IV [61 and STARAN [2, 3l. data transfers among PEs occur, it is the DTR contents One way to view the physical structure of an SIMD of each PE that are transferred. At any point in time, machine is as a set of N processing elements intercon- each PE is either in the active or the inactive mode. If a nected by a network, where each processing element (PE) PE is active, it executes the instructions broadcast to it by the control unit. If a PE is inactive, it will not exe- consists of a processor with its own memory. This type cute the instructions broadcast to it. This material is based upon work supported by the National Science The control unit stores the SIMD programs, exe- Foundation under Grant ECS-8120896. cutes control of flow instructions, and broadcasts pro-

117 cessing element instructions to the PEs. An example of interconnection network), where each function is a a control of flow instruction is the loop statement bijection on the set {0, 1, ..., N~l}, which deter- "for i = 0 until N-l do..." mines the communication links among the PEs. The processing element instructions consist of those A particular SIMD machine architecture can be operations that each processor can perform on data in its described by specifying N, C, I, M, and F. In this paper, individual memory or registers. It is assumed the set of N = 2m; C includes "for ... until ... do" instructions for processing element instructions includes the capability to controlling the flow of loops in the program; I includes move data among the registers. The notation "Z <— Y" instructions for moving data among the registers of a means the contents of register Y are copied into register given PE; M includes PE address masks; and F is varied. Z. The notation "Z <—> Y" means two registers The assumptions made about the SIMD machine to be exchange their contents. used as the model are intentionally minimal so that the A masking scheme is a method for determining material presented is applicable to a wide range of which PEs will be active at a given point in time. The machines. PE address masking scheme uses an m-position mask to specify which PEs are to be activated, each position of 3. The Interconnection Networks the mask corresponding to a bit position in the binary addresses of the PEs [28]. Each position of the mask will A. Introduction contain either a 0, 1, or X ("don't care"). The only PEs In this paper, three networks which can be con- that will be active are those that match the mask for all structed from a single stage of switches are examined. i, 0 < i < m: if the mask has a 0 in the i-th position, In a single stage network, data items may have to be then the PE address must have a 0 in the i-th position; if passed through the switches several times before reach- the mask has a 1 in the i-th position, then the PE ing their final destinations. Conceptually, a single stage address must have a 1 in the i-th position; and if the network can be viewed as N input selectors and N out- mask has an X in the i-th position, then the PE address put selectors, as shown in Fig. 2 [30]. The way in which may have either a 0 or 1 in the i-th position. For exam- the input selectors are connected to the output selectors ple, if N = 8 and the mask is 1X0, then only PEs 6 and determines the allowable interconnections. 4 are active. Superscripts are used as repetition factors, The following notation will be used: let N — 2m , e.g., X3012 is XXX011. Square brackets will be used to let the binary representation of an arbitrary PE address denote a mask. Each PE instruction and interconnection function (defined below) will be accompanied by a mask specifying which PEs will execute that command. For example, executing "A <- DTR [X^'O]" means that each even numbered PE is active and loads its A register B. The Illiac Network from its DTR. Each odd numbered PE is inactive and The Illiac network consists of four interconnection does nothing. Further information about the use and functions defined as follows: implementation of PE address masks is in [18, 28, 31, 34]. An interconnection network can be described by a set of interconnection functions, where each interconnec- tion function is a bijection (permutation) on the set of PE addresses [28]. When an interconnection function f is applied, PE i sends the contents of its DTR to the DTR of PE f(i). This occurs for all i simultaneously, for 0 < i < N and PE i active. Saying that an interconnec- tion function is a bijection means that every PE sends data to exactly one PE, and every PE receives data from exactly one PE (assuming all PEs are active). In this model, it is assumed that an inactive PE can receive data from another PE if an interconnection function is executed, but an inactive PE cannot send data. To pass data from one PE to another PE a programmed sequence of one or more interconnection functions must be exe- cuted. This sequence of functions moves the data from one PE's DTR to the other's by a single transfer or by passing the data through intermediary PEs. In summary, an SIMD machine can be formally represented as the five-tuple (N,C,I,M,F), where: (1) N is a positive integer, representing the number of PEs in the machine; (2) C is the set of control unit instructions, i.e., instructions that are executed by the control unit in order to control the flow of the program; (3) I is the set of processing element instructions, i.e., instructions that can be executed by each active PE and act on data within that PE; (4) M is the set of masking schemes, where each mask partitions the set {0, 1, ..., N-l} into two disjoint Fig. 2: Conceptual view of a single-stage network. sets, the enabled PEs and the disabled PEs; and "IS" is input selector, "OS" is output selec- (5) F is the set of interconnection functions (i.e., the tor.

118 Fig. 4: PM2I network for N = 8. (a) PM2+0 connec- tions, (b) PM2 + 1 connections, (c) PM2+2 connections. For the PM2_i connections, 0 < i < 2, reverse the direction of the ar- Fig. 3: Illiac network for N = 16. (The actual Illiac rows. IV SIMD machine had N = 64). Vertical lines are +\/N and - \/N. Horizontal lines must use the same PM2I interconnection function at the are +1 and -1. same time. A network similar to the PM2I is used in the a four nearest neighbor connection pattern, as shown for "Novel Multiprocessor Array" [24] and is included in the N = 16 in Fig. 3. This network was implemented in the network of the Omen computer [15]. The concept Illiac IV SIMD machine, where N = 64 [1, 6]. underlying the SIMDA machine's interconnection net- Relating this to the conceptual model of a single work is similar to that of the PM2I [36]. The PM2I con- stage network shown in Fig. 2, for each i, 0 < i < N, nection pattern forms the basis for the data manipulator input selector i has lines to output selectors i + 1, i-1, [10], ADM [33], and gamma [26] multistage networks. i + n, and i-n, mod N. For each j, 0 < j < N, output Various properties of the PM2I are discussed in [11, 27, selector j gets its inputs from input selectors j—1, j + 1, 28, 29, 31, 32]. j—n, and j+n, mod N. Since there is a single instruction stream in an SIMD machine, all active PEs must use the D. The Shuffle-Exchange Network same interconnection function (connection) at the same The Shuffle-Exchange network consists of a shuffle time. For example, if PE 0 is sending data to PE 1, function and an exchange function. The shuffle is then all active PEs must send data using the Illiac+ 1 defined by: connection. shuffle(pm_1pm_2...p1p0) = Pm-2Pm-3• •PiPoPm-i This type of network is included in the MPP [4, 5] and the exchange is defined by: and DAP [16] SIMD systems. Various properties and exchange(pm-lP 2...plPo) = pm_lPm_2...p1p0. capabilities of the Illiac network are discussed in [6, 13, For example, shuffle(3) = 6 and exchange(6) = 7, for 25, 28, 31, 32]. N > 8. This network is shown in Fig. 5 for N = 8, Consider the conceptual model of single stage net- C. The Plus-Minus 2' (PM2I) Network works shown in Fig. 2. For the Shuffle-Exchange single The Plus-Minus 2' (PM2I) network consists of 2m stage network, input selector P = Pm-i ••PiPo is con- interconnection functions defined by: nected to output selectors pm_2...piPoPm-i (— shuffle(P)) and Pm-i-PiPo (= exchange(P)). Output selector r r r e s m-i i o g * its inputs from input selectors r0rm_1...r2r1 and r^!...^!^. As with the other networks, all active PEs must use the same interconnection function at the for 0 < i < m. For example, PM2 (2) = 4 if N > 4. ml ml + 1 same time. Since P + 2 = P-2 , mod N, for all P, 0 < P < N, Mathematical properties of the shuffle are discussed the interconnection functions PM2 + (m_i> and PM2_

119 Features of the Shuffle-Exchange are discussed in [7, 8, 11, 13, 19, 20, 22, 23, 27, 28, 31, 32, 35, 37]. (The ability of each of the PM2I and Illiac networks to perform the exchange function in just two transfers was presented in Algorithm to perform the shuffle [31] and is not considered here.) 4. Shuffling -with the PM2I Network The following ground rules will be used in the design and analysis of the algorithm to perform the shuffle with the PM2I network. (1) The model and definitions presented in Sections 2 and 3 will be the formal basis for the results. (2) When simulating the shuffle, the data that is origi- nally the DTR of PE P must he transferred to the This algorithm used m + 1 inter-PE data transfers (3) The time for each algorithm is in terms of the and m + 1 register to register moves. The operation of number of executions of interconnection functions this algorithm for N = 8 is shown in Tab. 1. For exam- required to perform the simulation. ple, consider the data item initially in the DTR of PE 5 The reason for (3) can be seen by considering the (= 101). PE 5 does not match the mask in LI ([XXO]). way in which various instructions can be implemented. PE 5 does match the mask in L2 ([XXI]) and the data is The instructions in the simulation algorithms can be di- moved to PE PM2+0(5) = 6 (= 110). PE 6 does match vided into three categories: control unit operations (in the mask in L4 when j = 1 ([X10]) and the data is C), register to register operations (in I), and interproces- moved to the A register of PE 6. The data is unaffected sor data transfers (in F). Control unit operations, such by L5 when j = 1 (since it is not in the DTR). PE 6 as incrementing a count register in the control unit for a does match the mask in L4 when j = 2 ([1X0]) and the "for loop," can, in general, be done in parallel (over- data is moved to the DTR of PE 6. PE 6 does match lapped) with the previously broadcast PE instruction, the mask in L5 when j = 2 ([XXO]) and the data is thus taking no additional time. Register to register moved to the DTR of PE PM2+2(6) =2. PE 2 does operations within a PE will probably involve a single match the mask in L6 ([XXO]) and the data is moved to chip or, at worst, adjacent chips. The inter-PE data the DTR of PE PM2+0(2) =3. PE 3 does not match transfers will involve setting the controls of the intercon- the mask in L7 ([XXOJ). Thus, the data from PE 5 is nection network and passing data among the PEs, in- moved to PE 3 — shuffle(5). This is shown by the dot- volving board to board, and probably rack to rack, dis- ted line in Tab. 1. tances. Thus, unless the number of register to register To prove the algorithm is correct, induction will be operations is much greater than the number of inter-PE used (assume all arithmetic is mod N). The induction data transfers, the time for the interprocessor transfers hypothesis (proven correct below) is that after executing will be the dominating factor in determining the execu- PM2+i in L1 (for j = 0) or L5 (for 1 < j < m) the data tion time of the simulation algorithm. originally in the DTR of PE G = g^j. .gjgo will In the algorithm below ":" indicates a comment. currently be in PE P = p^.-.p^o = + 1 When discussing the algorithms, "Li" is used as an ab- (gm_,...gj+2gi + 1)*2' + (gj...gIgb)*2. (When j = 0, breviation for "line i of the algorithm." P = (gm-i.. g2gi)*2 + (go)*2.) The data will be in the A To understand the concept underlying the algorithm register if gj =0 and in the DTR if gj = 1. to perform the shuffle, consider the "distance" the shuffle Thus, when j = m—1, the data originally from PE G moves a data item. The data item in the DTR of PE P, is in PE (gm_1...g1g0)*2. The data item from the DTR of PE (gm_1...g1g0)*2 is moved to PE (gm_I...g,g0)*2 + 1 by L6; which is correct since this data item is from a PE where gj = gm_t = 1, so shuffle(G) = 2*G + 1. The data item from the A register of PE (gm_1...g1g0)*2 is moved to the DTR of that PE by L7; this is correct since this data item is from a PE where g: = gm_! = 0, so shuffle(G) = 2*G. To complete the correctness proof it must be shown that the induction hypothesis is true. Basis: j = 0.

Fig. 6: The idea underlying the algorithm for the PM2I to perform the shuffle, shown for N=8.

120 Tab. 1: Example of the algorithm for performing the shuffle using the PM2I when N = 8. It is assumed that initially the DTR of PE P contains the integer P,

Case 1: The data item from the DTR of PE

not moved by L2. It remains in the A register and go = 0. Thus, the induction hypothesis is Case 2: The data item from the DTR of PE

the DTR and g0 = 1. Thus, the induction hy- pothesis is true for j =0 for this case. Induction Step: Assume true for j = k - 1 and show true for j = k. Case 1: The data item from the DTR of PE

Subcase la: pk = 1. The A register data is moved to the DTR of PE P by L 4 and then to the DTR of 5. Shuffling with the Illiac Network In this section the use of the Illiac network to per- form the shuffle will be examined. First, it will be shown that a lower bound on the number of transfers (executions of Illiac interconnection functions) needed is 3n/2. Then, an algorithm requiring 2n-l transfers will be presented. To show that a lower bound on the number of transfers is 3n/2, four of the N data moves which the Furthermore, the data is in the DTR and g^ = 1. shuffle performs will be considered. These are: Thus, the induction hypothesis is true for j = k for this subcase. Subcase lb: pk = 0. The A register data is kept in the A register of PE P and not moved by L4 or

moves are done simultaneously when the shuffle inter- connection function is executed. It will now be shown that the Illiac cannot do all four in less than 3n/2 transfers, i.e., at least 3n/2 transfers are needed. To

Case 2: The data item from the DTR of PE

121 In order to more easily visualize the data move- Tab. 3: All possible combinations of 14 —>2 8 and ments in the Illiac network the "wrap-around" connec- 49 —>3 5 paths that can be done individually tions (e.g., 7 to 8, 56 to 0) have been "unwrapped" by in less than 3n/2 steps. drawing eight projections of the network, as shown in Fig. 7. The actual network is labeled "C" for center, and the eight projections are labeled NW (north west), N (north), NE (north east), W (west), E (east), SW (south west), S (south), and SE (south east). Thus, each PE is represented nine times: once in the original (center) net- work, and once in each projection. For example, consider the data movement from PE 7 to PE 8 using the Illiac+ 1 function. Normally, PE 7, which is in the rightmost column of the Illiac network, connects to PE 8, which is in the leftmost column, using a "wrap-around" connection. For purposes of this dis- cussion, the data from PE 7 in C will be moved to C's PE 8 equivalent in the E projection. In order to draw the projections, two constraints must be satisfied. Fig. 7 shows all the paths from the source PE 28 in (1) Each projection has to be topologically isomorphic the C network to its associated destination PE 56 in the to the Illiac network. C network and in the eight projections. Also shown is (2) Each projection must have the proper adjacency to the source PE 35 in the C network and its associated the C network and the other projections. destination PE 7 in the C network and in the eight pro- Proper adjacency means that two PEs, each from jections. There are only four ways to go from 28 to 56 different projections, are drawn adjacent to one another in less than 3n/2 = 12 moves and these are shown at the if and only if they are connected in the original network. top of Tab. 2. The four ways to go from 35 to 7 in less As an example of this, consider 7 in C, 63 in N, 0 in NE, than 12 moves are shown on the side of Tab. 2. The and 8 in E. four-tuple (w, x, y, z) means that the path consists of w One could continue generating more of these projec- Illiac+8 executions (moves), x Illiac + j executions, y Illi- tions "ad infinitum" to represent all possible implemen- ac_8 executions, and z Illiac_1 executions. Note that for tations of all possible moves. However, the goal here is the purposes here the order of execution is irrelevant. the show that the set of moves (a) through (d) above For example, 28 in C can go to 56 in the NE projection cannot be done in less than 3n/2 steps. Therefore, pro- by (0, 4, 5, 0), i.e., the path consists of four Illiac + j jections which would involve more than 3n/2 steps to do moves and five Illiac_8 moves. Any path between 28 in any of (a) through (d) individually are not of interest C and 56 in NE must include these moves. This is true and are unnecessary. in general, i.e., if the path from PE A to PE B is given The lower bound proof is organized as follows. as (w, x, y, z) then (1) the moves specified by the four- First it will be shown that there are only five sets of Illi- tuple will send data from A to B, and (2) any path from ac function executions that can perform both the A to B must include the moves specified by the four- 28 —>5 6 and 35 —> 7 moves in less than 3n/2 steps (Fig. tuple. In what follows {•} will denote the generalization 7 and Tab. 2). Then it will be shown that there are only of the path from n = 8 to any n. five sets of Illiac function executions (which happen to be Each square in Tab. 2 shows the set of moves need- different from the first five sets) that can perform both ed to do both the 28 —>5 6 and 35 —> 7 moves for all pos- the 14 —> 28 and 49 —> 35 moves in less than 3n/2 steps sible combinations of the individual moves which can be (Fig. 8 and Tab. 2). Finally, it will be shown that no done in less than 12 {3n/2} steps. The five combina- single set of less than 3n/2 Illiac function executions can tions which can be done in less than 12 {3n/2} steps are perform all four moves (Tab. 3). marked by a check (,/). For example, the 28 —> 56 path

Tab. 2: All possible combinations of 28 —> 56 and 35 —» 7 paths that can be done individually in less than 3n/2 steps.

The analysis for the 14 -> 28 and 49 --> 35 transfers shown in Fig. 8 and Tab. 3 is similar. The five sets of Il- liac functions which can do both of these transfers in less than 12 moves are checked in Tab. 3. The final step to the proof is to examine all combi- nations of the five sets found in each of Tabs. 2 and 3 to see if there exists any set of transfers which can perform all four transfers (28 --> 56, 35 --> 7, 14 --> 28, and 49 —> 35) in less than 12 moves. This is shown in Tab.

122 Fig. 7: The source/destination rela- tionship for the moves 28 --> 56 and 35 --> 7 in an "unwrapped" Illiac network. The circle denotes a destina- tion which can be reached in less than 3n/2 steps.

Fig. 8: The source/destination rela- tionship for the moves 14 -> 28 and 49 -> 35 in an "unwrapped" Illiac network. The circle denotes a destina- tion which can be reached in less than 3n/2 steps.

123 Tab. 4: Combination of relevant paths from Tabs. 2 and 3.

6. Conclusions The ability of the PM2I and Illiac single stage inter- connection SIMD machine networks to perform the shuffle interconnect was examined. In [28] is was shown that a lower bound on the number of transfers needed for the PM2I network to perform the shuffle is log2N. The algorithm described here and proven correct re- quired only (log2N) + l transfers. This algorithm was used as basis for an algorithm to do the shuffle with the

These results are of both theoretical and practical value. Theoretically, they add to the body of knowledge about the properties of the PM2I and Illiac networks. Practically, the algorithms presented could actually be used to perform the shuffle interconnection on a system that has implemented the PM2I or Illiac network. 4. As demonstrated, there is no such set. There are Furthermore, the lower bound proof shows that it is im- seven sets which require exactly 12 moves (indicated by possible to do the shuffle with the Illiac in any fewer checks), but none which requires less than 12. For ex- ample, 28 56 and 35 7 can be done using (3, 4, 4, 0), and 14 28 and 49 -+ 35 can be done using (2, 2, 2, Acknowledgements: Some of the figures and tables in this 2), however, the combination of these two sets yields (3, paper are from "Interconnection Networks for Large 4, 4, 2), which is greater than 12 moves. Scale Parallel Processing: Theory and Case Studies," by In summary, four of the moves performed by shuffle H. J. Siegel, to be published by D. C. Heath and Co. (28 56, 35 7, 14 28, and 49 35) have been ex- amined. It has been shown that no set of Illiac function References executions can do this in less than 3n/2 = 12 moves. As indicated above, this argument can be generalized direct- ly using the substitutions listed. [1] G. H. Barnes, R. M. Brown, M. Kato, D. J. Kuck, Consider an algorithm for performing the shuffle in- D. L. Slotnick, and R. A. Stokes, "The Illiac IV terconnection function with the Illiac network. This will computer," IEEE Trans. Comput., Vol. C-17, Aug. be done by replacing each PM2I interconnection function 1968, pp. 746-757. in the above algorithm with Illiac interconnection func- [2] K. E. Batcher, "STARAN parallel processor system 1 1 tions. For L2, use "Illiac+ 1 [X" " !]," since hardware," AFIPS Conf. Proc. 1974 Nat'I. Com- Illiac + , PM2+0. Similarly, for L6, use "Illiac+ 1 puter Conf., May 1974, pp. 405-410. p^•'oj." To do L5, first recall that only the even num- [3] K. E. Batcher, "STARAN series E," 1977 Intl. bered PEs contain the data of concern (after L2 is exe- Conf. Parallel Processing, Aug. 1977, pp. 140-143. cuted and before L6 is executed). Therefore, it is ac- [4] K. E. Batcher, "Design of a pro- 111 ceptable to use "PM2 + : [X ]" in L5, since any data cessor," IEEE Trans. Comput., Vol. C-29, Sept. movement among the odd numbered PEs is ignored (and 1980, pp. 836-840. overwritten by L6). To perform "PM2+j [X™]," for [5] K. E. Batcher, "Bit-serial parallel processing sys- 1 < j < m, with the Illiac network the algorithms tems," IEEE Trans. Comput., Vol. C-31, Mar. presented in [31] can be used. Specifically, to perform 1982, pp. 377-384. [6] W. J. Bouknight, S. A. Denneberg, D. E. McIntyre, J. M. Randall, A. H. Sameh, and D. L. Slotnick, "The Illiac IV system," Proc, of the IEEE, Vol. 60, Apr. 1972, pp. 369-388. [7] P-Y. Chen, D. H. Lawrie, P-C. Yew, and D. A. Pa- dua, "Interconnection networks using shuffles," Computer, Vol. 14, Dec. 1981, pp. 55-64. [8] P-Y. Chen, P-C. Yew, and D. H. Lawrie, "Perfor- mance of packet switching in buffered single-stage shuffle-exchange networks," 3rd Intl. Conf. Distri- buted Computer Systems, Oct. 1982, pp. 622-627. [9] G. R. Couranz, M. S. Gerhardt, and C. J. Young, "Programmable RADAR signal processing using the RAP," 1974 Sagamore Computer Conf. Parallel Processing, Aug. 1974, pp. 37-52. [10] T. Feng, "Data manipulating functions in parallel processors and their implementations," IEEE Trans. Comput., Vol. C-23, Mar. 1974, pp. 309-318.

124 [11] J. P. Fishburn and R. A. Finkel, "Quotient net- [25] S. E. Orcutt, "Implementation of permutation works," IEEE Trans. Comput., Vol. C-31, Apr. functions in Illiac IV-type computers," IEEE 1982, pp. 288-295. Trans. Comput., Vol. C-25, Sept. 1976, pp. 929- [12] M. J. Flynn, "Very high-speed computing sys- 936. tems," Proc, of the IEEE, Vol. 54, Dec. 1966, pp. [26] D. S. Parker and C. S. Raghavendra, "The gamma 1901-1909. network: a multiprocessor interconnection-network [13] W. M. Gentleman, "Some complexity results for with redundant paths," 9th Annual Symp. Comput- matrix computations parallel processors," Journal er Architecture, Apr. 1982, pp. 73-80. of the A CM, Vol. 25, Jan. 1978, pp. 112-115. [27] D. K. Pradhan and K. L. Kodandapani, "A uni- [14] S. W. Golomb, "Permutations by cutting and form representation of single- and multistage inter- shuffling," SIAM Review, Vol. 3, Oct. 1961, pp. connection networks used in SIMD machines," 293-297. IEEE Trans. Comput., Vol. C-29, Sept. 1980, pp. [15] L. C. Higbie, "The Omen computer: associative 777-791. array processor," IEEE Computer Society Compcon [28] H. J. Siegel, "Analysis techniques for SIMD 72, Sept. 1972, pp. 287-290. machine interconnection networks and the effects [16] D. J. Hunt, "The ICL DAP and its application to of processor address masks," IEEE Trans. Com- image processing," in Languages and Architectures Put., Vol. C-26, Feb. 1977, pp. 153-161. for Image Processing, M. J. B. Duff and S. Levialdi, [29] H. J. Siegel, "Partitionable SIMD computer system eds., Academic Press, London, England, 1981, pp. interconnection network universality," 16th Annual 275-282. Allerton Conf. Communication, Control, and Com- [17] P. B. Johnson, "Congruences and card shuffling," puting, Univ. Ill., Oct. 1978, pp. 586-595. American Mathematical Monthly, Vol. 63, Dec. [30] H. J. Siegel, "Interconnection networks for SIMD 1956, pp. 718-719. machines," Computer, Vol. 12, June 1979, pp. 57- [18] J. T. Kuehn, H. J. Siegel, and P. D. Hallenbeck, 65. "Design and simulation of an MC68000-based mul- [31] H. J. Siegel, "A model of SIMD machines and a timicroprocessor system," 1982 Int'I. Conf. Parallel comparison of various interconnection networks," Processing, Aug. 1982, pp. 353-362. IEEE Trans. Comput., Vol. C-28, Dec. 1979, pp. [19] T. Lang, "Interconnections between processors and 907-917. memory modules using the shuffle-exchange net- [32] H. J. Siegel, "The theory underlying the partition- work," IEEE Trans. Comput., Vol. C-25, May ing of permutation networks," IEEE Trans. Com- 1976, pp. 496-503. put., Vol. C-29, Sept. 1980, pp. 791-801. [20] T. Lang and H. S. Stone, "A shuffle-exchange net- [33] H. J. Siegel and R. J. McMillen, "Using the Aug- work with simplified control," IEEE Trans. Com- mented Data Manipulator network in PASM," put., Vol. C-25, Jan. 1976, pp. 55-66. Computer, Vol. 14, Feb. 1981, pp. 25-33. [21] D. H. Lawrie, "Access and alignment of data in an [34] H. J. Siegel, L. J. Siegel, F. C. Kemmerer, P. T. array processor," IEEE Trans. Comput, Vol. C-24, Mueller, Jr., H. E. Smalley, Jr., and S. D. Smith, Dec. 1975, pp. 1145-1155. "PASM: a partitionable SIMD/MIMD system for [22] D. Nassimi and S. Sahni, "Data broadcasting in image processing and pattern recognition," IEEE SIMD computers," IEEE Trans. Comput., Vol. C- Trans. Comput, Vol. C-30, Dec. 1981, pp. 934-947. 30, Feb. 1981, pp. 101-107. [35] H. S. Stone, "Parallel processing with the perfect [23] D. Nassimi and S. Sahni, "Parallel permutation shuffle," IEEE Trans. Comput, Vol. C-20, Feb. and sorting algorithms and a new generalized con- 1971, pp. 153-161. nection network," Journal of the ACM, Vol. 29, [36] A. H. Wester, "Special features in SMDA," 1972 July 1982, pp. 642-667. Sagamore Computer Conf, Aug. 1972, pp. 29-40. [24] Y. Okada, H. Tajima, and R. Mori, "A [37] C. Wu and T. Feng, "The universality of the reconfigurable parallel processor with micropro- shuffle-exchange network," IEEE Trans. Comput., gram control," IEEE Micro, Vol. 2, Nov. 1982, pp. Vol. C-30, May 1981, pp. 324-332. 48-60.

125