Performing the Shuffle with the PM2I and Illiac SIMD Interconnection Networks

Performing the Shuffle with the PM2I and Illiac SIMD Interconnection Networks Robert R. Seban Howard Jay Siegel Purdue University School of Electrical Engineering West Lafayette, Indiana 47907 Abstract—Three SIMD single stage interconnection networks which have been proposed and studied in the literature are the Illiac, PM2I, and Shuffle-Exchange. Here the ability of the Illiac and PM2I networks to perform the shuffle interconnection in an SIMD machine with N processors is examined. A lower bound of 3\/N/2 transfers for the Illiac to shuffle data is derived. An algorithm to do this task in 2\/N-l transfers is given. A lower bound of log2N transfers for the PM2I to shuffle data has been published previously. An algorithm to do this task in log2N + l in transfers is presented here. 1. Introduction This paper extends SIMD interconnection network studies presented in [28, 31]. In particular, the ability of Fig. 1: PE-to-PE SIMD machine configuration, with the PM2I and Illiac single stage interconnection SIMD machine networks to perform the shuffle interconnection NPEs. is examined. In [28] it is shown that a lower bound on of configuration is shown in Fig. 1. It is called the PE- the number of transfers needed for the PM2I network to to-PE organization. The network is unidirectional and perform the shuffle is log2N, where N is the number of connects each PE to some subset of the other PEs. A processing elements in the SIMD machine. The algo- transfer instruction causes data to be moved from each rithm presented here requires only (log2N) + l transfers. PE to one of the PEs to which the PE is connected by This algorithm is used as basis for an algorithm to do the network. (Here only one-to-one communications will the shuffle with the Illiac network in (2\/N)-l transfers. be considered, i.e., broadcasting (one-to-many) connec- This compares favorably an earlier result of 4(\/N-l) in tions are not considered.) To move data between two [25]. In addition, a lower bound 3\/N/2 on the number processing elements that are not directly connected, the transfers required for Illiac to do shuffle is proved. data must be passed through intermediary processing The model of SIMD machines used is described in elements by executing a programmed sequence of data Section 2. In Section 3 the interconnection networks are transfers. An alternative to the PE-to-PE SIMD formally defined. An algorithm to shuffle data using the machine organization is to position a bidirectional net- PM2I network is given in Section 4. The lower bound work between the processors and the memories. The analysis and algorithm for performing the shuffle with PE-to-PE paradigm will be used here, however, the the Illiac network is presented in Section 5. results presented will be applicable to the other organization also. 2. SIMD Machine Model The formal model of an SIMD machine used here Typically, an SIMD (single instruction stream - mul- consists of five parts: processing elements, control unit tiple data stream) machine [12] is a computer system con- instructions, processing element instructions, masking sisting of a control unit, N processors, N memory schemes, and interconnection functions. It is a modules, and an interconnection network. The control mathematical model that provides a common basis for unit broadcasts instructions to the processors, and all evaluating and comparing the various components of active processors execute the same instruction at the different SIMD machines. This model is based on the same time. Each active processor executes the instruc- one presented in [31]. tion on data in its own memory module. The intercon- Each processing element (PE) is a processor together with its own memory. There are N PEs, addressed {num- nection network, sometimes referred to as an alignment m or permutation network, provides for communications bered) from 0 to N—1, where N = 2 . It is assumed that among the processors and memory modules. Examples the processor contains a fast access general purpose of SIMD machines that have been constructed are the register A and a data transfer register (DTR). When Illiac IV [61 and STARAN [2, 3l. data transfers among PEs occur, it is the DTR contents One way to view the physical structure of an SIMD of each PE that are transferred. At any point in time, machine is as a set of N processing elements intercon- each PE is either in the active or the inactive mode. If a nected by a network, where each processing element (PE) PE is active, it executes the instructions broadcast to it by the control unit. If a PE is inactive, it will not exe- consists of a processor with its own memory. This type cute the instructions broadcast to it. This material is based upon work supported by the National Science The control unit stores the SIMD programs, exe- Foundation under Grant ECS-8120896. cutes control of flow instructions, and broadcasts pro- 117 cessing element instructions to the PEs. An example of interconnection network), where each function is a a control of flow instruction is the loop statement bijection on the set {0, 1, ..., N~l}, which deter- "for i = 0 until N-l do..." mines the communication links among the PEs. The processing element instructions consist of those A particular SIMD machine architecture can be operations that each processor can perform on data in its described by specifying N, C, I, M, and F. In this paper, individual memory or registers. It is assumed the set of N = 2m; C includes "for ... until ... do" instructions for processing element instructions includes the capability to controlling the flow of loops in the program; I includes move data among the registers. The notation "Z <— Y" instructions for moving data among the registers of a means the contents of register Y are copied into register given PE; M includes PE address masks; and F is varied. Z. The notation "Z <—> Y" means two registers The assumptions made about the SIMD machine to be exchange their contents. used as the model are intentionally minimal so that the A masking scheme is a method for determining material presented is applicable to a wide range of which PEs will be active at a given point in time. The machines. PE address masking scheme uses an m-position mask to specify which PEs are to be activated, each position of 3. The Interconnection Networks the mask corresponding to a bit position in the binary addresses of the PEs [28]. Each position of the mask will A. Introduction contain either a 0, 1, or X ("don't care"). The only PEs In this paper, three networks which can be con- that will be active are those that match the mask for all structed from a single stage of switches are examined. i, 0 < i < m: if the mask has a 0 in the i-th position, In a single stage network, data items may have to be then the PE address must have a 0 in the i-th position; if passed through the switches several times before reach- the mask has a 1 in the i-th position, then the PE ing their final destinations. Conceptually, a single stage address must have a 1 in the i-th position; and if the network can be viewed as N input selectors and N out- mask has an X in the i-th position, then the PE address put selectors, as shown in Fig. 2 [30]. The way in which may have either a 0 or 1 in the i-th position. For exam- the input selectors are connected to the output selectors ple, if N = 8 and the mask is 1X0, then only PEs 6 and determines the allowable interconnections. 4 are active. Superscripts are used as repetition factors, The following notation will be used: let N — 2m , e.g., X3012 is XXX011. Square brackets will be used to let the binary representation of an arbitrary PE address denote a mask. Each PE instruction and interconnection function (defined below) will be accompanied by a mask specifying which PEs will execute that command. For example, executing "A <- DTR [X^'O]" means that each even numbered PE is active and loads its A register B. The Illiac Network from its DTR. Each odd numbered PE is inactive and The Illiac network consists of four interconnection does nothing. Further information about the use and functions defined as follows: implementation of PE address masks is in [18, 28, 31, 34]. An interconnection network can be described by a set of interconnection functions, where each interconnection function is a bijection (permutation) on the set of PE addresses [28]. When an interconnection function f is applied, PE i sends the contents of its DTR to the DTR of PE f(i). This occurs for all i simultaneously, for 0 < i < N and PE i active. Saying that an interconnection function is a bijection means that every PE sends data to exactly one PE, and every PE receives data from exactly one PE (assuming all PEs are active). In this model, it is assumed that an inactive PE can receive data from another PE if an interconnection function is executed, but an inactive PE cannot send data. To pass data from one PE to another PE a programmed sequence of one or more interconnection functions must be executed. This sequence of functions moves the data from one PE's DTR to the other's by a single transfer or by passing the data through intermediary PEs.

Performing the Shuffle with the PM2I and Illiac SIMD Interconnection Networks

Online Sec 6.15.Indd

SIMD1 Ñ Illiac IV

Vector Machines  Vector Machines Today Introduction  a Vector Processor Is a CPU That Can Run One Instructiononanentire Vector of Data

The CRAY- 1 Computer System

ILLIAC IV Is the Most Powerful by As Much As a Factor of Four

Puters. Large-Scale Computer Systems Have the Potential to Achieve Two to Three Orders of Magnitude Speed Improvement Over the Next Decade

The CRAY-1 Computer System^

Illiac IV History First Massively Parallel Computer Three Earlier Designs

A Survey of Concurrent Architectures Technical Report: CSL-TR-86-307

COMPUTERS on NASTRAN James L. Rogers, Jr. NASA Langley

The Illiac IV System

Microfilms International 300 N