Proceedings of the Fourteenth International Conference on Machine Learning, ed. D. Fisher, Morgan Kaufmann, pp. 305-312, 1997

Predicting Multiprocessor Memory Access Patterns with Learning Models

M.F. Sakr1,2, S. P. Levitan2, D. M. Chiarulli3, B. G. Horne1, C. L. Giles1,4

1NEC Research Institute, Princeton, NJ, 08540 [email protected] 2EE Department, University of Pittsburgh, Pittsburgh, PA, 15261 3CS Department, University of Pittsburgh, Pittsburgh, PA, 15260 4UMIACS, University of Maryland, College Park, MD 20742

Abstract current processor requests. Hence, the end-to-end latency incurred by such INs can be characterized by three compo- nents (Figure 1): control time, which is the time needed to Machine learning techniques are applicable to determine the new IN configuration and to physically computer system optimization. We show that establish the paths in the IN; launch time, the time to trans- multiprocessors can successfully mit the data into the IN; and fly time, the time needed for utilize machine learning algorithms for memory the message to travel through the IN to its final destina- access pattern prediction. In particular three dif- tion. Launch time can be reduced by using high bandwidth ferent on-line machine learning prediction tech- opto-electronic INs, and fly time is relatively insignificant niques were tested to learn and predict repetitive in such an environment since the end-to-end distances are memory access patterns for three typical parallel relatively short. Therefore, control time dominates the processing applications, the 2-D relaxation algo- communication latency. rithm, matrix multiply and Fast Fourier Trans- form on a shared memory multiprocessor. The However, in a multiprocessor system executing a parallel predictions were then used by a routing control scientific application, the memory-access requests made algorithm to reduce control latency in the inter- by the processors follow a repetitive pattern based on the connection network by configuring the intercon- application. can analyze an application and nection network to provide needed memory attempt to predict its access patterns (Gornish, 90), but access paths before they were requested. Three often the pattern is dynamic and thus hard to predict. The trainable prediction techniques were used and goal of this work is to employ a technique that learns these tested: 1). a Markov predictor, 2). a linear predic- patterns on-line, predicts the processor requests, and per- tor and 3). a time delay neural network (TDNN) forms the IN configuration prior to the requests being predictor. Different predictors performed best on issued, thus hiding the control latency. The effect is a sig- different applications, but the TDNN produced nificant reduction in the communications latency for mul- uniformly good results. tiprocessor systems.

1 INTRODUCTION Control Launch Fly

Large scale multiprocessor systems require low-cost, Lc Ll Lf highly-scalable, and dynamically reconfigurable intercon- Time nection networks (INs) (Siegel, 90). Such INs offer a lim- ited number of communication channels that are Processor Request configured on demand to satisfy required processor-mem- ory accesses. In this demand driven environment, a proces- sor accessing a memory module makes a request to an IN Figure 1: The three components of the end-to-end controller to establish a path (reconfigure the IN) that sat- communication latency; control time, launch time and isfies the processor’s request. The controller is used to fly time. Control time dominates overall communication optimize the required IN configuration based on the set of latency.

Copyright Morgan Kaufmann Learning methods have been applied in various areas of and communication systems. For instance, IN Controller based on the State Sequence Routing Paradigm neural networks have been applied to learn both network State Generator topology and traffic patterns for routing and control of State Transformer communication networks (Fritsch, 91), (Jensen, 90), 1 P0-M3; P2-M14; P5-M7 (Thomopoulos, 91). Using neurocomputing in high speed 2 P1-M3; P6-M5; P7-M4 communication networks was the subject of a special issue of Communications (Habib, 95). Also, using a neu- Faults Predictions ral network as a static branch prediction technique was k IN Configurations recently presented by (Calder, 1997). Other applications of neural networks are for the control of switching ele- ments of a multistage interconnection network for paral- P0 PU M0 lel computers (Funabiki, 93), (Giles, 95) and for learning P1 PU M1 the structure of interconnection networks (Goudreau, 95). Reconfigurable For multicomputer systems, genetic algorithms have Interconnection been applied as a distributed task scheduling technique (Wang, 95). Solutions to the problem of mapping parallel Network (IN) P PU programs onto multicomputer systems to provide load 6 M30 balancing and minimize interprocessor communication P7 PU M31 have been proposed using genetic algorithms (Seredyn- ski, 94) and self organizing maps (Dormans, 95) as well Figure 2: An 8✕32 shared memory multiprocessor as variants of the Growing Cell Structures network system employing the SSR paradigm as the IN controller (Tumuluri, 96). In uniprocessor environments, Stigal et. and one on-line Prediction Unit (PU) per processor. al. (Stigal, 91) propose a neural network replace- ment algorithm. Their technique predicts which cache In such systems with N processors and K memory mod- block will be accessed furthest in the future and therefore ules, the reconfigurable IN can be configured to achieve should be replaced, thus lowering the cache miss rate. In any of the N✕K possible paths between a processor and a general, the literature on machine learning in computing memory module; however, it can only provide a subset of and communication systems has focused on how these these paths at any given time. A group of compatible (non- techniques can be used to identify patterns of communi- blocking) paths are called an IN configuration or a state. cation in order to optimize the control of these systems. Because of contention for paths, the IN must be dynami- The focus of this work is to study how three on-line cally reconfigured to satisfy the set of current processor- learning methods perform at predicting processor-mem- memory accesses. This SMM model employs an IN con- ory access patterns in a multiprocessor environment. We trol system based on the state sequence routing (SSR) par- use a Markov predictor, a linear predictor and a time- adigm (Chiarulli, 94) which takes advantage of the locality delay neural network (TDNN) (Lang, 90) to learn and characteristics exhibited in memory access patterns predict the memory access patterns of three parallelized (Johnson, 92) and reconfigures the network through a scientific applications: a 2-D relaxation algorithm, a fixed set of configurations in a repetitive manner. The IN matrix multiply, and a 1-D FFT. The next section presents controller, used for state sequence routing, consists of a the environment of our experiment where we describe a state generator which is controlled by a state transformer. shared memory multiprocessor model employing predic- The state generator maintains a collection of configura- tion units. In section 3, we describe the three prediction tions, called a state sequence and periodically reconfigures methods used and in section 4 we present experimental the IN with a new configuration from the set. Specifically, results of the predictors. The final section interprets our the state sequence is maintained in a cyclic shift register of results and discusses future directions of research. length k as shown in Figure 2. With each register shift, an IN configuration is broadcast to the processors, memory 2 MULTIPROCESSOR MODELS modules, and switching elements of the IN. The state sequence router exploits the memory access locality inher- Shared memory parallel computers are commonly ent in these patterns by re-using the sequence of states. referred to as multiprocessor systems (Bell, 85), (Kumar, The state transformer is responsible for determining the 94). Our shared memory multiprocessor (SMM) system set of configurations contained within the state generator consists of 8 processors (P0-P7), 32 memory modules based on processor requests. A processor that needs to (M0-M31), a reconfigurable IN and an IN controller (Fig- access a memory module issues a fault (or request) to the ure 2). This SMM model uses a state-sequence router state transformer only if the current state sequence does (Chiarulli, 94) as the reconfigurable interconnection net- not already include the required path to a memory module. work controller. In addition, we use a SMM simulator In response, the state transformer adds the required path to which allows us to record the memory access traces of the state sequence by removing the least recently used parallel applications. path. Using SSR the average control latency, L, incurred by each similar to the one used. Second, we use the processor’s access can be shown to be: memory access patterns as input to the PU to perform on- k line training and one-step ahead prediction of the next = ( )-- + p(k + f ) Lc 1 – p (1) memory access. Third, we evaluate the predictions by sim- 2 ulating the multiprocessor behavior with and without the where p is the probability of a fault, k is the sequence predictions and monitor the number of faults incurred. For length, and f is the fault service time. If a processor needs each of the experiments we use a relatively short state a path and it exists in the state sequence, there is no fault sequence length (k). As can be seen from Equation 1, the issued and the latency is just the time for the path to come optimum sequence length, k, is a trade off between around in the sequence which on an average is k/2. How- increasing k to reduce faults, and keeping k small to reduce ever if the path does not exist after k broadcasts, the pro- waiting time. The values of k were chosen to minimize the cessor issues a fault which must be serviced before the faults for these applications, for the non-predictive case. memory access can occur. The SSR based IN controller We tested using the best 1, 2, 3 and 4 predictions of the needs only to establish the initial paths and respond to the PUs as hints to the SSR controller. The three prediction changes in the memory access pattern; it is not required to methods tested are considered appropriate for this respond to individual memory access requests. dynamic system since the training and prediction is per- Our goal is to employ a technique that reduces the proba- formed on-line. bility of a fault by predicting changes in memory access 3.1 MARKOV PREDICTOR patterns and informing the controller of a needed transfor- mation before a fault occurs. Thus, the controller will There are many ways one could consider using a Markov transform the state sequence to include the soon-to-be- predictor (Isaacson, 76). We consider both a first and sec- needed path, avoiding the latency incurred by the fault. As ond order predictor which calculates the conditional prob- shown in Figure 2, a prediction unit (PU) is used to learn ability p of accessing memory module Mi given processor the access pattern of each processor. The predictions made Pk has just accessed memory module Mj, i.e. p(Mi|Mj;Pk). by the PU are used as hints by the SSR while routing the Similarly, for the second order, we calculate memory accesses. Since, processor-memory access pat- p(Mi|Mj,Mq;Pk) where the conditional probability is con- terns change dynamically and thus can be modeled as a ditioned on processor Pk previously accessing memory time series, for this preliminary investigation, we chose to module Mq, then Mj. Since in this model we use one PU study three simple on-line time series prediction methods: per processor, the input of the Markov prediction unit is a Markov predictor, a linear predictor and a TDNN. the temporal sequence depicting the memory access pat- tern of a processor. The probabilities are stored in a proba- 3 PREDICTION METHOD EXPERIMENT bility transition matrix. For the first order predictor, probability pij corresponds to the probability of accessing To evaluate the performance of various prediction meth- memory module i if the processor is currently accessing ods, we test how well each technique can predict the next memory module j. Similarly for the second order predic- memory access pattern as the SMM executes three typical tor, pi(jq) corresponds to the probability of accessing mem- parallelized scientific applications. The first application is ory module i if the processor is currently accessing a parallel (32✕32) 2-D grid-based temperature propaga- memory module j after completing an access to memory tion/relaxation algorithm; the second application is a module q. Each entry in the transition matrix is updated repetitive (24✕12 • 12✕24) matrix multiply program; the and normalized on-line as the application execution pro- third is the memory access pattern generated from a repet- ceeds. For example, in the first order Markov predictor of itive 1D Fast Fourier Transform (FFT) of a 16 sample vec- processor P0, the probability of processor P0 going from tor. M1 to M2 at time step t is calculated as the number of tran- sitions P0 has performed from M1 to M2 divided by the Each experiment consists of three distinct phases: First, total number of times P0 has accessed M1 from time 0 to using the shared memory multiprocessor (SMM) simula- time t. The number of parameters needed for the first order tor, we generate the memory accesses of a parallel pro- Markov predictor is 1024 probabilities while the number gram assuming fixed latency in the IN and memory of parameters for the second order Markov is 32K proba- modules. Using the raw memory accesses generated by the bilities. However, both first and second order predictors SMM simulator, we extract the sequence of memory update 32 probabilities on-line with every access since the accesses of a single processor. This memory access is rep- next access could go to one of 32 memory modules. At resented differently depending on the predictor used. For any given time, the non-zero probabilities are the predic- each experiment we use the 32 memory module access tions given to the state sequence router. However, the num- pattern of a single processor, these patterns are shown in ber of non-zero probabilities could be up to 32, therefore, Figures 3a, 4a, and 5a. The applications are symmetrically a fixed number of the most likely predictions is specified. partitioned to execute the same code on all processors We tested the first and second order Markov predictors while each processor uses different parts of the data. using the highest 1, 2, 3, and 4 probabilities as predictions. Hence, the access patterns of all other processors are very Also, we tested the system using state sequence lengths (k) of size 4, 5, 6 and 7. The results for each access pattern are 3.3 TIME DELAY NEURAL NETWORK discussed in the results section and the best results are The data encoding of the memory accesses for the TDNN depicted in Figures 3c, 4c, and 5c. The performance of the is the same as that of the Linear predictor. Again, since Markov predictor is compared to that of the Linear and we are implementing one-step-ahead prediction, the TDNN PUs in Section 4. TDNN takes as input the current binary vector and attempts to predict the access vector at the next time step 3.2 LINEAR PREDICTOR as in (Sakr, 96). Therefore there are 32 inputs and 32 out- For the Linear PU, the input data is transformed from a puts for the network. For each input, we experiment with processor’s raw 32 memory module traces into a sequence a tapped delay line of length 1, 5 or 10. The total number of 32 bit binary vectors. The ith component of the binary of inputs to the multilayer perceptron (MLP) section of vector is set to 1 when an access to the ith memory module the TDNN is 64, 192, 352 derived from (32 ✕ (1 input + takes place. All other values in that vector are set to zero. # taps)). We tested the performance of the TDNN using a single hidden layer of size 10, 20 and 30 neurons. Every For each value in the binary symbol vector we use a next output node has an additional bias weight, we use tapped step linear predictor which attempts to predict the next delay lines of sizes 1, 5 and 10. This gives 1002, 2282, access based on a linear combination of all the values in 3882 total weights for the TDNN with 10 nodes in the the vector and their history. Since there are 32 memory ✕ hidden layer; 1972, 4532, 7732 total weights for the modules (1 32 access vector) in the system tested, we use TDNN with 20 nodes in the hidden layer; and 2942, 32 linear predictors that predict the next access vector in 6782, 11582 total weights for the TDNN with 30 nodes in parallel. In order to compare the results of this predictor the hidden layer. Nodes in the hidden layer use a hyper- with that of the TDNN we use one bias weight for each bolic tangent activation function, while nodes in the out- output value, hence the Linear predictor is actually an put layer are affine. All of the weights were initialized affine predictor (Hecht, 91): uniformly in the range [-1/φ, 1/φ], where φ is the number l 32 of connections that enter a node (fan in). The learning algorithm is a simple on-line gradient descent algorithm xˆ (t + 1) = w x (t – k)+ w i=1,2,…32; l=1,5,10 (2) i ∑∑ ijk j i0 using the same adaptive learning rate used for the Linear k = 0 j = 1 predictor. Since the training and prediction is performed on-line, we make only one pass through the data. Many prediction interpretations could be used; we found that where x is a binary vector of dimension 32, xi denotes the ith component and xˆ is the prediction. Since we are imple- best performance was achieved if the output neurons with menting one-step-ahead prediction, the Linear predictor the largest values are selected as predictions. We tested takes as input the current binary vector and the past l his- using 1, 2, 3 or 4 output neurons with the highest values tory vectors and attempts to predict the vector x at the as prediction hints to the SSR controller. The best perfor- next time step (Equation 2). We tested the performance of mance of the TDNN is shown in Figures 3e, 4e, and 5e. the Linear predictor using l = 1, 5, and 10 past vectors. Therefore, the number of inputs for the three Linear pre- 4 RESULTS dictors tested are 64, 192, 352 (32 × (l + 1)) and the num- ber of coefficients (weights) to update at each time step is In this section we discuss the performance of the three 2080, 6176, 11296 respectively. The learning algorithm is prediction units tested for the three applications imple- a simple on-line gradient descent algorithm using the fol- mented on our SMM model. In order to compare the per- lowing adaptive learning rate, starting value is set to 0.01: formance of the prediction units, for each application we plot the memory access pattern followed by fault plots. if ( (present error - previous error) > previous error × 10% ){ First we show the characteristic access pattern of each of reduce learning rate by a decrease factor of 0.5 the applications in Figures 3a, 4a, and 5a. Then the net- and move back in the weight space to the previous point} work faults incurred for the non-predictive case (Figures else { 3b, 4b, and 5b) followed by the network faults incurred keep the updated weights and increase the learning rate by the system using the PUs (Figures 3c-e, 4c-e, and 5c- by an increase factor of 1.1} e). In this paper we report on the best results of each of the predictors for each application, across the space of the system and predictor parameters tested. For the complete The algorithm is performed on-line, so we make only one results see (Sakr, 96b). pass through the data. Furthermore, the outputs (predic- tions) with values > 0.5 of which the largest values are 4.1 2-D RELAXATION selected as the predictions which are passed along to the state sequence router as hints. We tested using 1, 2, 3 or 4 Figure 3a shows 8697 access vectors which depict the predictions as hints to the SSR controller. The best results access behavior of the 2-D Relaxation algorithm, the of the Linear predictor are depicted in Figures 3d, 4d, and large discontinuity in the pattern is a no-memory-access 5d. period which is a characteristic of the algorithm. The access patterns exhibit a stair-like behavior, where each tions used as hints does not enhance performance. On the stair discontinuity reflects a change in the memory module other hand, increasing k helps increase the total number of access. For this access pattern the first order Markov pre- network faults eliminated. We tested many TDNN config- dictor performed the best of the three prediction units urations, the performance of the TDNN in predicting this tested. The second order Markov predictor shows pattern relied heavily on the number of nodes in the hidden improved performance over the first order only for the 1 layer. Increasing the history used (tapped-delay line) does prediction case. In general, increasing the number of pre- not improve performance as much as increasing the size of dictions used by the state sequence router enhanced per- the hidden layer. Using a large k is also crucial in fault formance while increasing the size of the state sequence elimination for this pattern. Figure 3b plots the network (k) does not for this particular application. For the Linear faults incurred as impulses for the non-predictive case. predictor, increasing the history or the number of predic- The other fault plots show the best performance of the on- line predictors for the 2-D relaxation algorithm. The first order Markov predictor using 3 predictions and a k of size 30 3 eliminated 96% of the network faults. It performs best for this pattern since the total number of non-zero proba- 25 bilities is small (three), and using the top three probabili- ties is enough to predict almost perfectly and eliminate all 20 faults (Figure 3c). The best performance of the Linear pre- dictor was achieved by using 1 past access vector, 2 pre- 15 dictions and a k of size 6 eliminating 95% of all faults. Compared to the Markov predictor, the Linear predictor

10 needs a few more training iterations before its predictions Memory Module start to greatly reduce the number of faults (Figure 3d).

5 The TDNN with a hidden layer of 30 nodes, a tapped delay line of size 2 and k of size 7 produced its best result eliminating 71% of the network faults, shown in Figure 3e. 0 0 500000 1e+06 1.5e+06 2e+06 Time 4.2 MATRIX MULTIPLY (a) The matrix multiply application exhibits a more complex 1.4 1.2 (b) no PU pattern since each processor accesses the memory mod- 1

0.8 ules in a less uniform fashion than the 2-D relaxation algo- aults 0.6 F 0.4 rithm. Figure 4a shows the 11561 vector access pattern.

0.2

0 Since this application exhibits a complex access pattern 0 500000 1e+06 1.5e+06 2e+06 Time the first and second order Markov predictors cannot cap-

1.4 ture and predict the access pattern correctly. Increasing the 1.2 (c) Markov PU 1 number of predictions or k does not enhance overall per-

0.8

aults 0.6 formance. The performance of the Linear predictor is sim- F 0.4

0.2 ilar to that of the Markov predictor for this application.

0 0 500000 1e+06Time 1.5e+06 2e+06 Varying the history, or k, or the number of predictions does not improve performance. On the other hand, the TDNN 1.4

1.2 (d) Linear PU produces marginally better results. The TDNN performs

1

0.8

aults best with a hidden layer of 10 nodes, using 2 predictions

0.6 F

0.4 and a k of size 7 achieving 30% fault reduction. However,

0.2

0 increasing the size of the hidden layer hinders the perfor- 0 500000 1e+06 1.5e+06 2e+06 Time mance of the TDNN. Figures 4b-e show the best perfor- 1.4 mance of all three prediction units for this access pattern. 1.2 (e) TDNN PU 1 In this case, the Markov predictor performs poorly since 0.8

aults 0.6 the probabilities are of equal values which increases the

F 0.4

0.2 number of wrong predictions. Specifically, the second

0 0 500000 1e+06 1.5e+06 2e+06 Time order Markov predictor using 4 predictions and a k of size 5 eliminated 6% of the faults. The best performance of the Figure 3: (a) The memory access pattern of the 2-D Linear predictor, eliminating 1% of the faults, was relaxation algorithm for Processor P0. (b) The number of achieved using 10 past access vectors, 1prediction and a k network faults incurred without predictions. (c) Number of size 6. The linear predictor is not able to find a linear of network faults using the Markov PU. (d) Number of combination of the past accesses to predict the next access network faults using the Linear PU. (e) Number of well. However, the TDNN still achieves a moderate reduc- network faults using the TDNN PU. tion in the number of faults. the percentage of fault elimination for both the first and 30 second order Markov predictors. The Linear predictor pro- duces its best results when using the least history, the least 25 number of predictions and a k of size 6. Increasing the his- tory used or the number of predictions does not boost per- 20 formance. The TDNN is capable of learning and predicting the access pattern of the FFT. A hidden layer 15 consisting of 10 nodes provides better performance than the TDNN with a larger hidden layer. A number of predic-

Memory Module 10

5 30

0 0 500000 1e+06 1.5e+06 2e+06 25 Time

(a) 20

1.4

1.2 (b) no PU 15 1

0.8

aults 0.6

F 0.4 10 0.2 Memory Module 0 1.8e+06 1.9e+06 2e+06 2.1e+06 2.2e+06 2.3e+06 Time 5 1.4

1.2 (c) Markov PU

1

0.8

aults 0 0.6 0 20000 40000 60000 80000 100000 120000 140000 160000 F 0.4

0.2 Time

0 1.8e+06 1.9e+06 2e+06 2.1e+06 2.2e+06 2.3e+06 Time (a)

1.4

1.2 (d) Linear PU 1.4

1 1.2 (b) no PU

0.8 1

0.6 aults 0.8 aults

F 0.4 0.6 F 0.2 0.4

0 1.8e+06 1.9e+06 2e+06 2.1e+06 2.2e+06 2.3e+06 0.2

0 Time 0 20000 40000 60000 80000 100000 120000 140000 160000 Time 1.4

1.2 (e) TDNN PU 1.4 1 1.2 (c) Markov PU

0.8 1

0.6 0.8 aults

0.4 aults F 0.6 F 0.2 0.4

0 1.8e+06 1.9e+06 2e+06 2.1e+06 2.2e+06 2.3e+06 0.2

0 Time 0 20000 40000 60000 80000 100000 120000 140000 160000 Time Figure 4: (a) The memory access pattern of the matrix 1.4 1.2 (d) Linear PU multiply for Processor P0. (b) The number of network 1 0.8

faults incurred without predictions. (c) Number of aults 0.6 F 0.4

network faults using the Markov PU. (d) Number of 0.2

0 network faults using the Linear PU. (e) Number of 0 20000 40000 60000 80000 100000 120000 140000 160000 Time network faults using the TDNN PU. 1.4

1.2 (e) TDNN PU

1

0.8

4.3 FAST FOURIER TRANSFORM aults 0.6

F 0.4

0.2

0 The FFT application produces the memory access pattern 0 20000 40000 60000 80000 100000 120000 140000 160000 Time shown in Figure 5a. The Markov predictor is capable of capturing and predicting this pattern thus eliminating Figure 5: (a) The memory access pattern of the FFT for almost all the faults incurred by the state sequence router. Processor P5. (b) The number of network faults The second order Markov captures the pattern of access incurred without predictions. (c) Number of network thus producing better results compared to the first order faults using the Markov PU. (d) Number of network Markov. Increasing the number of predictions used is essen- faults using the Linear PU. (e) Number of network tial for good performance while increasing k does not affect faults using the TDNN PU. tions of either 2 or 3 with a k of size 6 or 7 are needed for vents 95% of the faults; and the TDNN removes 71% of good performance. Figures 5b-e show the best results of the faults. While for the matrix multiply: the second order the PUs for this access pattern. Since the FFT algorithm Markov predictor eliminates 6% of the faults; the linear exhibits a simple access pattern, the second order Markov predictor removes 1% of the faults; and the TDNN pre- predictor is able to capture the pattern and predict well vents 30% of the network faults from taking place. Finally, eliminating 95% of the faults using the top four predic- using the access patterns of the FFT: the second order tions and a k of size 4. The Linear predictor using 1 past Markov predictor prevents 95% of the network faults; the access vector, 1 prediction and a k of size 6 is capable of linear predictor eliminates 34% of the faults; and the eliminating some, 34%, of the network faults. The TDNN TDNN removes 45% of the network faults. As expected, with 10 nodes in the hidden layer, a tapped delay line of different predictors perform best on different applications. size 10, using the top 2 predictions and a k of size 6 elimi- From visual inspection of the access patterns (shown in nated 45% of the faults. The TDNN performs better than Figures 3a, 4a, and 5a), one could say that the access pat- the Linear predictor for this pattern but not better than the terns of different applications vary in complexity from the Markov. 2-D relaxation being the most simple to the matrix multi- ply the most complex. All prediction methods perform 5 SUMMARY AND CONCLUSIONS well on the applications with the simple access patterns. On the other hand, for very complex patterns, the Markov Completely connected interconnection networks (INs) are and Linear prediction methods perform very poorly and not feasible in large scale multiprocessor systems because TDNN gives uniformly good results. of their high complexity and soaring cost. Accordingly, we utilize less expensive reconfigurable interconnection net- Table 1: Percentage of faults eliminated for the three works that scale well but suffer from high overhead due to best learning techniques tested using the access control latency. Control latency is the time delay incurred patterns of three parallel applications by the network controller to determine a new desired IN configuration and to physically establish the paths in the Markov Linear TDNN network. Each reconfiguration request (network fault) is Predictor Predictor Predictor triggered when the current network configuration fails to satisfy a processor’s memory access. These requests are 2-D Relaxation 96% 95% 71% performed on a demand driven basis. However, memory access patterns of multiprocessor systems executing paral- Matrix Multiply 6% 1% 30% lel scientific applications exhibit a lot of repetitiveness due to loops which are a characteristic of such applications. FFT 95% 34% 45% Hence, in this work we study how three learning methods perform at learning and predicting these access patterns Given the environment, different applica- on-line. Correct prediction of the access patterns allows tions exhibit very different patterns and a technique that anticipatory reconfiguration of the IN and thereby satisfy will predict well across patterns is more appealing than a the forthcoming memory accesses preventing a network technique that performs best for specific patterns. Thus, fault. Thus, the average control latency L is hidden and we hypothesize that the TDNN has the best chance of consequently overall communication latency is reduced. adapting to different memory access patterns from the The three on-line prediction methods tested are: a first and variety of real applications. However, it could be feasible a second order Markov predictor; a linear prediction to use all prediction methods in a mixture of experts model method; and a time delay neural network (TDNN). We (Jordan, 94) and use the best predictor available. train the prediction methods using the access patterns of three parallel scientific applications: a 2D relaxation algo- Future work should address more realistic simulation of rithm; a matrix multiply; and a Fast Fourier Transform the multiprocessor environment, such as the effects of (FFT). The multiprocessor model used is an 8 processor incorporating runtime delays due to memory and network 32 memory module shared memory system with a state contention in the memory access patterns and how these sequence router as the reconfigurable interconnection con- prediction methods affect actual performance. We plan to troller. test the performance of different machine learning tech- niques and other prediction methods. Also, it would be The experiments show that coupling state sequence rout- interesting to investigate the applicability of prediction ing with different types of on-line prediction methods can techniques to the general problem of latency hiding at all decrease the number of memory access faults across dif- levels of the . Another open question is ferent applications with some methods being more effec- how will these prediction methods be efficiently imple- tive than others. The best results of the prediction methods mented in hardware and their results effectively used. For for the access patterns tested are as follows (Table 1). For example how and what is the effect of memory fault pre- the 2D relaxation algorithm the first order Markov predic- diction in the actual speedup of applications on a multipro- tor eliminates 96% of the faults; the Linear predictor pre- cessor?

References ­k?ny6i6k?j(noqp(k?j(’¯®p5ovk?n}f^p(k_°(y1i1y1k?myMp5k‹±uy6x-i6j5’H±uy6n²Ip5i6ž(l67^//

A ”F ¾A%¡

E G

¢¡ £¤¡¦¥¨§ © © ¡©  "!¦# $%§ & '(© ) * +-,

·[¡ RH¡5PQ+3b%$S *+-$Ò A¡Ú b%§¿Š /)'(¯+3,¯ J+ ¨$% '() +-$

¥

¨©   /01+-'(§ * 2+304'5+ /§ 06 *798;:<'5 § $%'(§ 7>=?+-©@¡BAA7>//C¡

. .ª.

. .

R¨+-'()©  T +-$ R¯)0@c?§ F:£'()© § ¨©   /01+-'(§ * 2+30

AF ¡

DE DEG

€§ 0_,+-0 )$S'(§ Õ984f^p(tu‘x?ny6iMd[i@m_Ÿoqny6mnx?i@yª±uy6²Il67W=+-©@¡ÖA¾7

.

¥H¡I J)© K¨§ 0 7MLN¡I£B06$%&¤)© K¨7MO¡IPQ+3$%§ 27ML>¡IR¨ $SK¨ *)T57MP¡

//>”AF ¾A%¡

D

U)06 $%7VO¡WU+-X§Y07[Z\¥H¡W]^+30_$%7V  *¡W`¨=? K¨§ $%'5§ F

G

a a

) 2§ K *)  ' 0_)$%'(b /01§ K¨ '5  +-$ ¨ * $%c )'5b% $%§ a

¡ Š¡ÑP+-0@K¯)$%7ۂ}¡ #N¡ÑP)'(+ Ø  *¡Á•¤ § 06)01'(b% '5)©

D

.

© § )06$S $%c?8%¡edMfSg hi6j(k?l6j(mnoqp(k?l p5k r¨i@p@s%i6j(tutuovk s

U Ü"01§ B+3,^`¨Ü

.

j(k s%x-j@s%y1lzj5k-{}|?~l6ny6tul677FAAA%¡

f^p(tu‘x-nQj(noqp5k-7=+3©@¡ 7/¨/C¡*FA ¡

w

E D

L>¡ ¡S Jb% )06©©  7^:¨¡ €-¡SR¨§ =  )$S7^‚}¡ £¤¡SU§ © b%§ 7C ¢¡Sƒu)+

Ë^¡¨·I )0 7C#N¡£¿06) )7H#>¡¨£B/)7C£>¡·Æ)06T?/ N  ¡

D

.

. .

 *¡„R¨+-'()©  T ¥¨) *§ K† J+3$%01+3©‡#u© c-+-0_  b ˆ,+30

D

ŠQ$% 01+-K¨'5  +-$Å+™€)0_)© © § © J+ / $%c?798©¥¨§ $¶ )  $ÖÞ

.

. .

a

‚^§ '(+-$%‰c?0_) © §‹ŠQ$% § 01'(+-$%$%§ '5  +-$ŒŽ§ &Ž+-06? *798dM‘‘’qoqy6{

J  $%c? *7 J#N¡

.ª.

‘noqml17=+-©@¡*””7//C¡*AF” ¡

“

G

a

·[¡zP¡zR¯)$%c-7Æ#>¡¤•¡z§) §Y©7I£¤¡z`ß¡z•Ž $%+-$¬ ¾*¡z#

¡SLu+30 )$% 27^•¡ FQ–>¡•Ž§  2 > ¡€)0_   +-$% $%c„)$SK

 § FK¨§ © )T„$%§ ¨06)©J$S§  &¤+30_„)0@'5b%  § '5 ¨0@§,+30V 2+3© )§ K

.

.

a

U)// $%c +-, R¨)01c?§ —?`¨UF£B0_)/b% T :<§ © ,QF

&Ž+-01KN01§Y'5+3c?$%  +-$%798}±uy1x?i6j(’£±uy6n²Ip5i6ž(l67=+3©@¡*”7//C¡*A”F ¡

DD

˜¢0@c?)$% X) +3$S798™r¨i@p(my6y6{oqkYs%l›š¨x?i@p(tuoqmi@pœ>p(i6ž5l1Ÿp5‘™p(k

¡ —3¡z:£)-0 7I ¢¡ RH¡z£B© § *7I:¡ €?¡zR¨§ =  )$%7I¥ß¡ £¤¡z•Ž+-06$%§ 7I¡

r¨j(i6j5’q’vy6’Cj(k?{¡ uovl6nQi6ov¢Yx-nQy1{¡r¨i@p5m_y6l6l6oqk s%7z:£)$„‚^+ § +37¤ŠQ )©T(7

.

U)c-c? $% 7%L>¡ ¡à Jb%)01¨© © ¯  ¡˜u$SFR¨ $%§[€01§ K¨ '( +-$

E

//C¡*AA FA”%¡

G

+-,áU©  /01+3'(§ 2 *+-0 U§ +-06T #u'('5§Y 2 €¨) § 06$S *Õ98

.

¡¦—?06 2'(b%7¨§4¡¦U)$%K¨§ ©© Q¡¦ J+ $% '5)  +-$

¥

r¨i@p5m_y6y6{oqkYs%lÄpq°¬nŸyÄ­š¨š¨šâ­k-nQy6i6k-j5noqp(k?j(’¹f^p(k_°(y6i@y6k-my¦p(k

.ª.

ŒŽ§ &Ž+-06«‚^+- $%c«–u * $%c«Œ¤§Y¨06)©¬ŒŽ§  2FŒŽ § 06 '5)©

±uy6x-i6j5’<±uy6n²Æp(i6ž(l67//C¡* FQ %¡

ED E

.

#u */§ '5 )$%K #u© § 06$S)  =§ #u//01+3)'5b%§ *798 ­Qš¨š¨š

¡ —3¡%:<)?0 7Ö:¡ €?¡%R¨§ =? )$%7ÖL>¡ ¡% Jb% )06© ©  7Ö¥H¡ £>¡%•Ž+30_$%§ 7

­k?ny6i6k?j(noqp(k?j(’¯®p5ovk?n}f^p(k_°(y1i1y1k?myMp5k‹±uy6x-i6j5’H±uy6n²Ip5i6ž(l67^//

a

¢¡ RH¡Û£B © § ã  ¡Å€¨§ 06,Q+30 )$%'5§ +-, ˜u$%FQR¨ $%§

E

.

AF  ¡

G G G

R¨§ )06$S $%cק b%+-K¨  $†€01§YK¯ '( $%cר©   /01+-'(§ * 2+30

a

Œ¡³—?$%)  - 7µ´z¡ )?§ ,<¶  7µ·M¡ ¢¡³R¯§Y§ Q”¡

¥

U§ +30_T #u'('5§ * €¨) § 06$% 2798 § '5b%$% '5)© ‚^§ /+-06

¥

.

 J+ /)06 2+3$S }+-,^:<§ =?§ $¸ŒŽ§ 0_)©¯ŒŽ§ &Ž+-06¹U+3K¨§ © B+-$

.

–uŠQ#u J:

¥ E ¥ EGE

a

0_),º‰¨' J+3$S 01+3© €¨0@+ © §  $ ¨©   2 )c?§

¥

.

#uK¨=?)$S'(§ K J+ /) +3$S)©¬:<K¨§Y 27Ėu$% =§ 06 2  TÂ+-,

.

ŠQ$% § 01'(+-$%$%§ '5  +-$»ŒŽ§ &Ž+-06? *798¼­š¨š¨š hi6j(k?l6j(mnoqp(k?l½p(k

U)06T?© )$%K¨7 J+-© © § c-§Æ€¨)06?7LA¾ A%¡

GD

f^p(tu‘x-nQy6i6l17=+-©@¡ A7$%+¯¡ 7/¨/  F¾%¡

D D D G

—3¡?:£§ 0@§ K¨T?$% *?C  ¡?L¿T?$%)  'Ž)//¨ $%c)$%K¦R¨+-)K

D

.

¢¡ RH¡¡£B ©§Y 27„¡ §„¡¡£¿+3¨K¨0@§ )» ¡¡‚^+- $%cÀ $

¥¨)© )$%'5 $%c &¤ b €)06)©© § ©×£¿§ $%§   ' #¿©c-+-06 b 2798

.

˜¢/  '5)©Á¨©   2 )c?§ÂŠQ$% § 01'(+-$%$%§ '5  +-$̎§ &Ž+-06? !Ä)

r¨i@p5m_y6y6{oqkYs%lzpq°HnŸyzå-oqi6l6n%­š¨š¯š>f^p(k_°(y6i@y6k?m_yzp(k[š¨Çp(’qx-nQoqp(k?j(i6~

ŒŽ§ 0_)©Œ¤§Y&Ž+-0_»:<+-©  +-$%798Å®*p(x?i6k-j5’›pq° o s%Ÿn²Æj(Çy

w

f^p(tu‘x-nQj(noqp5k-7=+3©@¡*ŠQŠQ7//C¡*” F”%¡

D

hy6mŸk-p5’vp@s%~67=?+-©@¡*”7$%+È¡ ¡

E

•¡ P¡-:£ § c-§ ©^ ¾¡-­k?ny6i@mp(k?k-y6mnoqp(k±¿y6nQ²Ip(i6ž5l°(p(i j5i s%y6æ

w

a

`H¡ •¡U£¿+30_$% 2b%7`H¡ L>¡É£B06)$S *+3$%7Ê#>¡ ËC¡U˨§  K¨§ $ )

.

|?m_j5’vyzr¨j5i6j(’q’qy6’

¾*¡© J+ / © § 01FK¨0@§ '(§ K L¿))̀¨01§ ,§  '5b% $%c  $

.

€?¡ L>¡È:< c-)© 7J ¢¡ •¡ÈL¿)c?©  7J ¢¡ —3¡È:<§ $Ê *¡Jç*#⌎§ 0_)©

U©  /0@+-'(§ 2 *+-06 &Ž b § +30_T •Ž § 06)01'(bS § *798

.

ŒŽ§ &Ž+-06Á ß)'5b%§¬ J+3$S 01+3©© § 0 Õè4­k-nQy6’v’qo s%y6k?n¡š¨kYs%oqk?y6y6i6ovk s

r¨i@p5m_y6y6{oqkYs%lp °ÍnŸyÎÏÏÐÑ­k?ny6i6k-j5noqp(k?j(’¸f^p(k_°(y1i1y6k-myp(k

|?~l1nQy6t¿luh¨Ÿi@p5xYs%Ÿ[d[i1nQo éCmoqj(’^±uy6x-i6j5’Ö±uy6n²Ip5i6ž(l67z ¢¡?L¿)c?©  7

|?x-‘y6i@mp(tu‘x?noqkYs%7//H¡*” F” %¡

D E

:¨¡%·Æ )0_))$%K´z¡%:£b% $¦§ K¨  +-06 27Ö//H¡% F 7^#u:<U`

E EE

.

¡ §„¡ £B+-K¨01§ )7? ¢¡ RH¡ £B © § W ¡ –u * $%c҂^§ '(0_0@§ $%

€01§ * "¡

ŒŽ§ 0_)©«ŒŽ§ &Ž+-06? µ+ R¨§ )06$  b%§ :<06'(01§Ó+-,

:¨¡ u¡ #N¡ b%+ +-/+3¨© +3 27ßRß¡¯]^b%)$%c?7ß ¢¡ LN¡¯§)$S$ Q¡

ŠQ$% § 01'(+-$%$%§ '5  +-$‹Œ¤§Y&Ž+-0_- 2798¹±uy1x?i6j(’^±uy6n²Æp(i6ž(l67¤=?+-©@¡?7 ¥

.

ŒŽ§ 0_)©WŒŽ§  &¤+30_Ä#¿//¨©  '() +3$Ê+3,¢ b%§ê:£b%+-06§ *u€)b

$%+È¡*7//C¡ ”F¾ ¡

G D

#u© c-+-0_  b ,+-0 06), ‰'; J+3$S 01+-© $â J+ ¨$% '() +-$

¥

. .ª.

a a

Š*¡ §„¡?•Ž)  *¡-£¿§ *I§ K¨  +-0u+-,Ib%§N:£/§ '( )©CŠQ * 2§

ŒŽ§ &Ž+-06? *798™­Qš¨š¨šÝ­Qk-nQy6i1k?j(nQoqp(k?j(’Ò®p5oqk-nÄf^p(k_°2y6i@y6k?my›p(k

+-$¡Œ¤§Y¨0@+-'(+ /¨  $Scª $•Ž c?b%F:£/§ § KŒŽ§  &¤+30_- 27­Qš¨š¨š

.

±uy6x-i6j5’<±uy6n²Æp(i6ž(l67//C¡*A ”F ¾A%¡

E G

a

f^p(tutux-k?oqm_j5noqp(k?l?gOj@s%j(Ô(oqk?y6Õ=?+-©@¡*””Õ˜u'5 + § 0_¡

¢¡  ¨© 067^ ¢¡ ·[¡SU+3b%)$S7^#N¡ Œ¡ ßbS+3¨K¨b%)06T„ Q ¡

¥ E

.

‚}¡I•Ž§ '5b% FQŒŽ § © *§ $Ñ ¡I±uy6x?i@p(mp(tu‘x?noqkYs%7[#uK¨K¨ *+-$

–u$% *¨/§ 06= *§ Kä#u© c?+30_ b â,+30ÅR¨§ )0_$% $%c³:

.

§§ *© § T£¡

§ /+-06)© J+30_01§Y© ) +-$% *87 :

¥

.

§ '(bS$% '()©3‚^§ /+-06%ë:<–uF JŠQ:£F F%¡

¥ E

L>¡ RH¡JŠQ *))'5 *+-$%7¢‚}¡ §„¡JU)K¨ *§ $©  ¢gj(i6ž(p5ÇÊf֟j5ovk?l

GE

h¨Ÿy6p(i6~}j5k-{^dM‘‘’qoqmj(noqp5k-l6Õ‚}¡ `H¡*·Æ06 § c?§ 0_¡

€?¡ u¡§)$%c?7%§„¡·Æ+30_,b%)c?§[ ¡€01+-'(§ * B:<'(bS§ K¨©  $%c

P¡ `H¡YPQ§ $% *§ $%7-¡ #N¡Y`¨ *b%§ 0_)7-:¡ ¢¡Y¥¨)0_) *bª Q¾¡YŒŽ§ 0_)© –u * $%cM£¿§Y$S§   'H#u© c-+-06 b *798Wr¨i1p(my1y6{oqk sVpq°CnQŸyÈìnŸW­Qš¨š¨š

.

ŒŽ§ &Ž+-06 J+-$% 01+-© © § 0×,+-0Ø#uK¨)/ =§Ù‚^+3¨  $Sc  $ |?~t¿‘p5l6ovx?típ(kr¨j5i1j5’q’vy6’Öj5k-{ª uoql6ni6oq¢Úx?ny6{ªr¨i@p5my1l6l6oqk s%7È//C¡

a

:£06= =?) © § J+ $% '5)  +-$ ŒŽ§ &Ž+-06? *798 ­Qš¨š¨š ”F %¡

E ED .ª.