Technical Report OSUBMI-TR-2012-N01 a Comparison of the Cray XMT and XMT-2 Shahid H. Bokhari and Saniyah S. Bokhari January 27

The Ohio State University Department of Biomedical Informatics 3190 Graves Hall 333 W. 10th Avenue Columbus, OH 43210

Technical Report

OSUBMI-TR-2012-n01

A Comparison of the Cray XMT and XMT-2

Shahid H. Bokhari and Saniyah S. Bokhari

January 27, 2012 A Comparison of the Cray XMT and XMT-2

Shahid H. Bokhari1,∗,† and Saniyah S. Bokhari2,‡

1 Department of Biomedical Informatics, The Ohio State University, Columbus, Ohio 2 Department of Computer Science & Engineering, The Ohio State University, Columbus, Ohio

SUMMARY We explore the comparative performance of the Cray XMT and XMT-2 massively multithreaded supercomputers. We use benchmarks to evaluate memory accesses for various types of loops. We also compare the performance of these machines on matrix multiply and on three previously implemented dynamic programming algorithms. It is shown that the relative performance of these machines is dependent on the size (number of processors) of the configuration, as well as the size of the problem being evaluated. In particular, small configurations of the original XMT can sometimes show slightly better performance than larger configurations of the XMT-2, for the same problem size. We note that, under heavy memory load, performance of loops can saturate well before the maximum number of processors available. This suggests that it may not always be useful to use the maximum number of processors for a specific run. We also show that manual restructuring of nested loops, including decreasing the parallelism, can result in major improvements in performance. The results in this paper indicate that careful exploration of the space of problem sizes, number of processors used, and choices of loop parallelization can yield substantial improvements in performance. These improvements can be very significant for production codes that run for extended periods of time.

KEYWORDS: Cray XMT, Cray XMT-2, Matrix Multiply, Dynamic Programming, Multithreading, Parallel Algorithms, Parallel Computing, Reassortment, Sequence Alignment, Shared Memory, Subset-sum Problem.

1. Introduction

The Cray XMT-2 is the latest version of a line of massively multithreaded supercomputers that includes the Tera [1, 2], MTA [21], MTA-2 [3, 9] and the XMT [4]. Currently several instances of the XMT-2 have been installed, notably at the Swiss National Supercomputing Centre (CSCS, www.cscs.ch), and at the Center for Applied High Performance Computing (CAHPC) at Noblis (www.noblis.org). The architecture of the basic machine is described in [1], and expositions of the

∗Correspondence to: Shahid H. Bokhari, Department of Biomedical Informatics, 3190 Graves Hall, 333 W. 10th Avenue Columbus, OH 43210 †E-mail: [email protected] ‡E-mail: [email protected] 2 BOKHARI & BOKHARI architecture and its performance appear in [4, 5, 6, 8, 9, 10, 11, 13]. It is of great interest to compare the performance of successive instantiations of the original Tera architecture; a prior example appears in [4]. Such a comparison illustrates the level of success attained by the hardware design in implementing the idealized architecture. In the present paper we compare the performance of the XMT and XMT-2 machines. We first evaluate the basic memory access times and then proceed to analyze the performance of loops with varying numbers of memory references. The latter analysis is particularly important because the rate of memory accesses can often have a major impact on the performance of a specific code. This is illustrated in our analysis of three previously reported dynamic programming codes. For two of these three codes, the performance of the machines is almost identical. For the third code, which has a set of nested loops that access memory with intensity, the XMT-2 provides significantly better performance. This demonstrates that the XMT-2 is capable of providing operands to processors at a much higher rate than the XMT. We also note that some of the differences between the XMT and XMT-2 are paradoxical (i.e., small configurations of the XMT are slightly faster, in some cases, than the XMT-2) and cannot be explained in terms of network size.

1.1. The Three Target Machines

The three machines used in our analysis are: 1. ‘Egret,’ a 16 processor, 128GB XMT at Cray Inc., 2. ‘Cougar,’ a 128 processor, 1TB XMT at the Center for Adaptive Supercomputer Software (CASS) (cass-mt.pnl.gov), at Paciﬁc Northeast Laboratory (PNL) (www.pnl.gov), and 3. ‘Matterhorn,’ a 64 processor, 2TB XMT-2 at the Swiss National Supercomputing Centre (CSCS) (www.cscs.ch).

2. Memory Access

The performance of the three target machines on a simple loop that simply executes a memory copy operation for(i=0; i

212 102 102 processors 10 10 7.68 1 216 2 4 elements 1 8 220 1 24 cycles per element 16 228 cycles per element

0.1 0.1

212 216 220 224 228 232 1 2 4 8 16 32 64 elements processors

Figure 1. Measured memory access times in clock cycles per element for Egret XMT.

102 102 212

processors 2

10 9.86 1 10 2 4 elements 8 1 20 1 cycles per element 16 2 cycles per element 32 24 228 64 2 0.1 0.1

212 216 220 224 228 232 1 2 4 8 16 32 64 elements processors

Figure 2. Measured memory access times in clock cycles per element for Cougar XMT. 4 BOKHARI & BOKHARI

212 102 102

216 processors 10 10 7.68 1 2

4 elements 1 8 220 1 cycles per element cycles per element 16 32 224 64 228 0.1 0.1

212 216 220 224 228 232 1 2 4 8 16 32 64 elements processors

Figure 3. Measured memory access times in clock cycles per element for Matterhorn XMT-2.

of different sizes, we could conclude that the signiﬁcant difference between the two is attributable to network latency (i.e. the greater number of stages that a datum must traverse in the larger machine). However this explanation is unsatisfactory as the cycles per element should, for very large data arrays, be independent of the number of stages of the interconnection, assuming identical machine clocks. Turning to Fig. 3, we note that the memory copy time for one processor of the XMT-2 matches almost exactly the time for the older 16 processor Egret XMT. Fig. 4 compares these machines in detail, and shows that the XMT-2 is very slightly faster. It also has better scaling, which we attribute to a larger network. The dashed horizontal lines in the left-hand parts of Fig. 1, 2 and 3 indicate the expected asymptotic access times for 2, 4, . . . processors assuming perfect scaling. As described in [4] the right hand panels of Fig. 1, 2 and 3 indicate the achieved speedup for various problem sizes with the dashed sloping line indicating the ideal. We can see that good scaling is obtained in each case up to half the total number of processors, for large data array sizes.

3. Varying numbers of Memory Accesses

To investigate further the performance of loops with varying access patterns, we carried out a set of experiments in which the 1, 2,. . . , 8 memory accesses took place in each iteration of a loop. In this case for(i=0; i

7.68 1

1 8 cycles per element 16

64 Egret Matterhorn 0.1 212 216 220 224 228 232 elements

Figure 4. Measured memory access times in clock cycles per element: comparison of Egret XMT and Matterhorn XMT-2. would correspond to 1 memory access†, while for(i=0; i

†The accumulation of sum in this case is carried out using a reduction operation resulting in a total of n + k log n operations, where n is the problem size and k a constant. The additional k log n component is negligible for all but very small n. This is evident in Fig. 5 (a), where the cycles per element for n = 28 are much higher than in Fig. 5 (b). 6 BOKHARI & BOKHARI

Egret XMT, 16 proc (a) 1 memory op/loop Cougar XMT, 128 proc Matterhorn XMT-2, 64 proc

1000

100

1 cycles per element 0.1 28 1 210 2 212 4 214 8 16 16 problem size 2 218 32 processors 220 64 222 128

Egret XMT, 16 proc (b) 2 memory ops/loop Cougar XMT, 128 proc Matterhorn XMT-2, 64 proc

1000

100

1 cycles per element 0.1 28 1 210 2 212 4 214 8 16 16 problem size 2 218 32 processors 220 64 222 128 Egret XMT, 16 proc (c) 4 memory ops/loop Cougar XMT, 128 proc Matterhorn XMT-2, 64 proc

1000

100

Figure 5. cycles per element 0.1 Surfaces showing the compar- 28 1 210 2 ative performance of Egret, 12 2 4 214 8 Cougar and Matterhorn as the 16 16 problem size 2 number of memory operations 218 32 processors 220 64 in a loop is varied. 222 128 CRAY XMT & XMT-2 7 entire space of problem size × processors. Note that in this and subsequent 3D plots the Egret surface extends only to 16 processors, and the Matterhorn surface stops at 64 processors. It is clear that each machine outperforms the other two over some parts of the problem space. At the processors=1 plane, the performance of all three processors is almost identical‡. Figs. 6(a) & 6(b) depict slices through the surfaces of Fig. 5 along the planes for processors = 1 & 8. In this slice the plots labeled “1” represent the slice from Fig. 5(a) (1 memory operation), the plots labeled 2 represent the slice from Fig. 5(b) (2 memory operations) and the plots labeled 4 represent the slice from Fig. 5(c) (4 memory operations). For 1 processor (Fig. 6(a)), it is interesting that the performance of Egret XMT is slightly better than Matterhorn XMT-2 for small problem sizes, for 1, 2 & 4 memory operations per loop. For large problem sizes, Egret & Matterhorn have identical performance while Cougar XMT is clearly poorer. Turning to the slice at 8 processors, we see that Matterhorn is slightly better than the other two machines at large problem sizes. For small problem sizes, Egret (a 16 processor machine) starts exhibiting some sort of breakdown or saturation, while the other two machines converge in performance. Figs. 6(c) & 6(d) show slices through the surfaces of Fig. 5 along the planes for problem sizes 216 and 222. We see that Egret and Matterhorn have almost identical performance for small numbers of processors, while Cougar is distinctly poorer. In Fig. 6(c), there is a sudden breakdown in performance in all Egret, Matterhorn and Cougar at roughly 4, 32 and 64 processors, respectively. This corresponds to the regions of peaks in Fig. 5 and occurs, for each machine, when the number of processors approaches the total number of processors. This region of peaks grows broader as the problem size decreases (Fig. 5). Fig. 6(d), representing the slice at 222, shows little evidence of saturation for Egret, but distinct indications for Matterhorn and Cougar. It is interesting that, for this problem size (222), the performance of Egret matches that of Matterhorn at 1 processor but drifts towards, and ultimately matches, Cougar at 16 processors. The experimental results presented above indicate that the performance of the 16 processor Egret XMT can sometimes be better than the 64 processor Matterhorn XMT-2 or the 128 processor Cougar XMT. The fact that this happens for small problem sizes would support that hypothesis that Egret’s advantage is because of a smaller network and thus lower latency. This advantage disappears at large problem sizes because the latency impact is nulliﬁed by the massive multithreading of the architecture. Later in this paper we will show that this anomalous behavior of Egret also occurs for large dense problems and cannot be explained away by latency hiding. For all 3 machines we note that the performance saturates when the number of processors approaches the total number of processors available. This saturation occurs earlier with decreasing problem size. The experiments described above are for simple memory access operations. More complicated operations would, we expect, result in even more complex performance curves. This is illustrated in the Section 5 where we show the performance of these three machines on three large dynamic programming codes.

‡In Fig. 5 (b) the intercepts on the processors=1 plane correctly match the processors=1 plots in Figs. 1, 2 & 3, as they represent 2 memory access. 8 BOKHARI & BOKHARI

Slice at Processors=1 Slice at Processors=8 2000 2000 Egret XMT Egret XMT 1000 Cougar XMT 1000 Cougar XMT Matterhorn XMT-2 1 Matterhorn XMT-2 2 1 1 1 4 1 4 1 2 4 1 1 2 1 1 1 4 1 1 1 1 2 42 4 1 42 1 24 1 2 4 1 1 100 24 100 42 1 2 2 4 1 24 1 2 1 24 4 1 2 4 1 2 24 4 1 24 1 2 4 1 4 1 2 4 41 1 24 1 2 2 2 4 41 24 1 2 2 4 41 24 1 2 2 41 4 4 4 42 1 1 2 2 14 4 4 4 4 4 4 2 2 2 2 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 24 1 24 2 21 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 2 12 12 2 1 2 1 2 12 21 2 2 4 10 2 21 12 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 10 2 4 1 1 12 12 2 2 2 2 2 2 2 4 1 2 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 24 4 1 1 cycles per element cycles per element 2 4 1 1 1 1 1 2 4 4 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 4 4 1 1 1 1 1 1 1 2 4 4 1 1 1 1 1 1 1 1 1 2 4 41 4 4 2 14 4 24 4 2 2 14 4 4 4 4 4 4 4 4 4 4 4 2 2 1 4 4 4 4 4 4 4 4 4 4 4 4 2 2 21 4 4 4 4 4 4 4 4 2 21 2 2 12 2 2 2 2 2 2 2 2 2 2 2 1 1 1 12 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0.2 0.2 28 210 212 214 216 218 220 222 28 210 212 214 216 218 220 222 size size

(a) Slice at 1 processor (b) Slice at 8 processors Slice at size=65536 Slice at size=222 20 4 20 4 4 Egret XMT Egret XMT 4 Cougar XMT 4 Cougar XMT Matterhorn XMT-2 1 1 Matterhorn XMT-2 4 1 10 2 1 1 1 1 10 2 4 4 24 4 2 4 2 4 1 1 414 2 2 4 1 1 24 4 2 1 1 2 2 2 4 4 1 2 4 2 4 1 4 2 4 2 2 4 1 2 4 1 4 2 2 4 2 2 4 4 4 41 2 4 1 4 4 4 2122 2 4 12 1 4 4 1 14 1 1 4 12 1 4 4 14 1 1 1 2 4 1 1 14 41 14 1 1 1 11111 1 2 4 2 2 41 4 1 1 4 1 2 4 4 2 2 4 4 4 2 2 2 4 4 4 2 4 4 2 2 2 4 44444 2 4 2 2 2 2 2 2 2 4 4 4 2 2 2 22222 1 2 4 4 1 2 2 1 1 1 2 2 4 4 1 2 2 4 4 cycles per element cycles per element 1 2 2 2 4 4 1 2 2 4 1 2 2 4 1 4 4 1 2 2 444 1 1 2 2 44 44 4 1 2 444 2 1 1 1 2 24 4 4 1 1 2 2 1 1 4 12 1 1 2 2 21 2 2 4 1 1 222 2 1 12 1 1 22 2 1 1 1 2 2 1 11 1 111 1 11 0.1 0.1 1 2 4 8 16 32 64 128 1 2 4 8 16 32 64 128 processors processors

Figure 6. Slices through the surfaces of Fig. 5.

4. Matrix Multiply

Matrix multiply is a well-understood problem and has been evaluated for the MTA and the XMT in prior research [4]. Figs. 7, 8 and 9 show the performance of matrix multiply on the 3 machines. The performance of the Egret 16 processor XMT is virtually identical to the XMT performance reported in [4]. Cougar 128 processor XMT’s performance is slightly poorer, for reasons we do not understand, but it demonstrates very good scaling, nevertheless. What is paradoxical is the performance of the CRAY XMT & XMT-2 9

24 1 2 =elements 104 104 2 222 4 220 16 8 218 3 3 10 2.74m 10 216 214

2.74m/16

2 2 cycles per element 10 10 cycles per element

101 101 212 216 220 224 228 1 2 4 8 16 32 64 128 elements, m × m processors Figure 7. Matrix Multiplication on Egret XMT.

1 224=elements 104 104 2 222 4 220 8 218 3 3.5m 3 10 16 10 216 32 64 128 214

2 2 cycles per element 10 10 cycles per element 3.5m/128

101 101 212 216 220 224 228 1 2 4 8 16 32 64 128 elements, m × m processors Figure 8. Matrix Multiplication on Cougar XMT.

Matterhorn 64 processor XMT-2, which is almost identical (up to 16 processors) to Egret. A detailed evaluation of this behavior is an interesting topic for future research. 10 BOKHARI & BOKHARI

224=elements 104 1 104 2 222 4 220 64 8 218 3 3 10 2.74m 32 10 16 216 214

2 2 cycles per element 10 10 cycles per element

2.74m/64

101 101 212 216 220 224 228 1 2 4 8 16 32 64 128 elements, m × m processors Figure 9. Matrix Multiplication on Matterhorn XMT-2.

5. Performance of three Dynamic Programming codes We now describe the performance of three dynamic programming codes on each of our target machines.

5.1. Main Loop from Alignment Algorithm

The alignment of DNA sequences is one of the key problems in bioinformatics and computational biology. The traditional method of aligning sequences employs dynamic programming. The time and space to align two sequences of length n are both Θ(n2). Prior research has shown that the MTA/XMT are very well suited to the parallel execution of this problem [4, 9]. The main loop from this algorithm is shown in Fig. 10 and consists of a traversal of an n × n table along columns and then rows. The XMT compiler is very successful in parallelizing code and automatically converts the table traversal into “wavefront” order (indicated by ‘w’ in the compiler listing). Fig. 11 shows the performance of the three machines. In this particular case the performance of Cougar XMT and Matterhorn XMT-2 are almost identical for 16, 32 and 64 processors. Of course, with a 2 TB memory, Matterhorn can accommodate much larger problem sizes. It is interesting that the 16 processor Egret XMT has slightly better performance than the other two machines for 1 to 16 processors. Also interesting is the fact that Cougar XMT is noticeably slower than the other two machines for 1 to 4 processors. We do not have a satisfactory explanation for this anomaly. CRAY XMT & XMT-2 11

| void dyn(char P[M_MAX],int m,char T[N_MAX],int n,int D[M_MAX][N_MAX]) | { | int i,j; | for (i=0; i <= m; i++) P | D[i][0] = i; | for (j=1; j <=n; j++) { P | D[0][j] = j; | } | #pragma mta noalias *P, *T, *D | for (i=1; i <=m; i++) { | int j; P:e | int myPi=P[i]; | for(j=1; j <= n ; j++){ | int v, h, d, m1, m2, p; -P1:w | v= D[i-1][j]+1; | p= (myPi!=T[j]); -P1:w | d= D[i-1][j-1]+p; -P1:w | m1 = MIN( d, v); -P1:w | h= D[i][j-1]+1; | m2 = MIN( m1,h); P-:w | D[i][j] = m2; SP1:w + | } | } | }

Figure 10. Cray Compiler Analysis (CANAL) output of sequence alignment code. The annotations are explained in the Appendix.

5.2. Main Loop from Subset-sum Algorithm

The subset-sum problem ﬁnds the subset of a list of integers that sums up to half the summation of the list. Although this is NP-complete, there exists a pseudo-polynomial algorithm that makes use of dynamic programming. The dynamic programming table is made by updating all cells (True or 1) whose corresponding integers are included in all possible subsets (summations of the list up to half the total sum of the list). This approach is based on the exposition given in Garey and Johnson [18]. The updating in our subset-sum algorithm is done on a word-by-word basis (the bits of the word represent the cells of the table). This allows selected words to be concurrently updated using the word level ‘full/empty’ locking mechanism of the Cray XMT and results in excellent parallelization. In addition we also parallelize the initializing and updating of the cells, on a row by row basis. Fig. 12 depicts the main loop of this code, where the ﬁlling and updating of the words is being carried out. Details of this code are available in [14] and [15]. Fig. 13 shows the performance of this code. Once again, we see that Egret XMT is doing better for 1 to 8 processors and is the same as the other two machines for 16 processors. The performance of Cougar XMT is the same as the Matterhorn XMT-2 for 1 processor and has better performance 12 BOKHARI & BOKHARI

3000 Egret XMT, 16 proc 8 Cougar XMT, 128 proc Matterhorn XMT-2, 64 proc 1000 16 32

64 1

100 2 128 4 time (sec)

0.3 212 213 214 215 216 217 218 219 220 size, n

Figure 11. Performance of the sequence alignment algorithm. Two equal-sized DNA sequences of size n are aligned using dynamic programming. As the table required is of size n2 words, the maximum problem size is < 219 bases on the 2TB Matterhorn XMT-2. CRAY XMT & XMT-2 13

| //Initialization X | writexf(&d[0][0],1); //left-most word of first row | #pragma mta block schedule | #pragma mta assert parallel | for(j=1;j>6);//w+(size/64) X-p | unsigned upper=d[i-1][w]; | unsigned temp11, temp12; | unsigned temp21, temp22; XSD | if(((w+1)*64+size)<=(halfSize+63)){ XSD | temp11=readfe(&d[i][wNew]); | temp12=temp11|((upper)<>(64-mybit)); XSD | writeef(&d[i][wNew+1],temp22); | } | } | } | }

Figure 12. Main loop of subset-sum algorithm. Annotations are explained in the Appendix. 14 BOKHARI & BOKHARI for 2 and 4 processors. When dealing with large problem sizes, the performance of Cougar XMT is increasingly better up to 64 processors, though the incremental increase is slight. For 128 processors, the performance of Cougar XMT breaks down. For small problem sizes the performance of Matterhorn XMT-2 is signiﬁcantly better than Cougar XMT for 8 to 64 processors. The scaling of Matterhorn XMT-2 is overall better than Cougar XMT for all problem sizes.

5.3. Key Loop from Reassortment Algorithm

The evolution of influenza A is a topic of great interest from a public health standpoint. Influenza evolves through the familiar mechanism of mutation as well as through reassortment, where RNA segments are mixed up to form new and potentially dangerous viruses. Traditional methods of evolutionary analysis, such as phylogenetic trees cannot directly model reassortment. Bokhari and Janies [7] developed the concept of reassortment networks for modeling the evolution of influenza. Bokhari, Pomeroy and Janies [12] implemented an algorithm for reassortment networks on the Cray XMT and used it to analyze the evolution of the 2009 swine origin influenza pandemic. The reassortment networks for influenza are very large graphs and tracing the evolution of a particular strain takes about 35 hours on the 128 processor Cougar XMT. It is thus worthwhile to explore the performance of this code on the three machines. The key loop from the reassortment algorithm that dominates the total time required by the code is shown in Fig. 14. The performance of this code is shown in Fig. 15. We note first of all that the performance of Egret is better than the other two machines up to 16 processors. Matterhorn XMT-2 is clearly better than Cougar XMT by 20–40%. Since the key loop of this algorithm, has a very intense pattern of memory accesses§, and since the XMT-2 is seen to be very significantly better than the XMT, we infer that the XMT-2 is able to deliver operands to processors at a much higher rate than the XMT. It is, however, very important to note that the performance of both the XMT and the XMT-2 saturates well below the maximum number of processors. This shows that it can be important to analyze intense inner loops of codes experimentally in order to establish the saturation point. In the present case, for example, using 40 instead of 64 processors on the XMT-2 or 75 instead of 128 processors on the XMT would lead to almost a factor of 2 improvement in performance. Note also that all three machines exhibit excellent scaling as long as saturation is avoided. To demonstrate that the saturation observed in Fig. 15 is indeed caused by memory performance limitations, we used the Cray Traceview tool [17] to examine memory performance on the XMT and XMT-2 near their saturation thresholds (≈ 80 and ≈ 40 processors, respectively). In the Cray XMT family of machines, when a memory location cannot be accessed because of locking and/or network congestion, the hardware is designed to ‘retry’ the access a small number of times [2, 19]. If the location is still not delivered after these retries, there is a trap and the thread is saved in software for execution at a later time. We used Traceview to track the retries and traps just before and after saturation. The results are shown in Figs. 16 & 17. It is clear that retries and traps increase dramatically when saturation occurs. This demonstrates that the saturation phenomenon shown in Fig. 15 is caused by limitations of the memory subsystem.

§For a problem of size 3200, the loop is executed 32002 × 8 times, with 2 memory accesses per loop. CRAY XMT & XMT-2 15 Processors Egret XMT, 16 proc Cougar XMT, 128 proc 10000 Matterhorn XMT-2, 64 proc 1 5000 2 3000 4 2000 8

1000 16 500 32 300 64 200 128 100 Seconds

50 30 20

5 3 2

1 3.5x109 1x1010 3.5x1010 1x1011 3.5x1011 1x1012 3.5x1012 1.5x1013 Table Size (bits)

Figure 13. Performance of the pseudo-polynomial Subset-sum algorithm. The size of the problem is the size of the dynamic programming table in bits. 16 BOKHARI & BOKHARI

| void labelVfromR2(int t){ | // nV = viruses = problem size; nS = RNA segments=8 | // from R, stage t, label the V, stage t+1 layer | int k; | for(k=0; k=0){ - | whichRVi[t+1][k] = theI; - | whichRVj[t+1][k] = theJ; - | whichRVs[t+1][k] = theS; | } - | V[t+1][k]=myV; | } | }

Figure 14. Dominant loop from virus reassortment code. Annotations are explained in the Appendix.

6. Restructuring nested loops

The experiments presented above demonstrate that it would be worthwhile to evaluate carefully the trade-offs in parallelizing nested loops, especially for codes that are to be run for extended periods of time, such as the virus reassortment code. We evaluated different combinations of parallelization for the 4 nested loops of the virus code of Fig. 14. Fig. 19 shows the result of this experiment. The solid plots in this ﬁgure are the same as the size=400 virus lines in Fig. 15 and are labeled SPPP to denote the fact that the outermost loop is serial while the inner three loops are parallel (as originally shown in CRAY XMT & XMT-2 17

assort32R.c, Egret vs Cougar vs Matterhorn

10000 problem size = 3200

1000 1600

100 time (sec) 800

10 400

Egret XMT, 16 proc Cougar XMT, 128 proc Matterhorn XMT-2, 64 proc 1 1 2 4 8 16 32 40 64 80 128 processors

Figure 15. Performance of the key loop of the virus reassortment algorithm. The loop has a heavier memory access pattern compared to Figs. 11 and 13. The XMT-2 clearly outperforms the XMT for large numbers of processors. However, the machines saturate at ≈ 40 and ≈ 80 processors, respectively. The 16 processor Egret XMT outperforms the other machines over the range 1–16 processors. Experiments for problem sizes 1600 and 3200 could not be run for the full range of processors because of time constraints. Cougar XMT runs for problem size 3200 are not at 1 processor intervals (unlike the remaining plots) for the same reason. 18 BOKHARI & BOKHARI

(a) Cougar XMT, 86 processors

(b) Cougar XMT, 87 processors

Figure 16. Plots of Retries and Traps (explained in main text) for the virus reassortment algorithm for problem size=400 viruses on the 128 processor Cougar XMT. Going from 86 to 87 processors, the traps increase from 273,738 to 2,489,279, and time increases from 5 to 7 sec, showing that the system has saturated. This corresponds to the step in the size=400 plot for Cougar in Fig. 15. Execution of the labeling code of Fig. 14 starts at about 11 seconds. The time prior to this is taken up by setup overhead in the main program, which is largely serial. This ﬁgure shows the execution of only one iteration of the labeling code. In practice the loop would be executed many times, thus amortizing the serial overhead. Note the different scales for Traps in the two plots. CRAY XMT & XMT-2 19

(a) Matterhorn XMT-2, 41 processors

(b) Matterhorn XMT-2, 43 processors

Figure 17. Plots of Retries and Traps (explained in main text) for the virus reassortment algorithm for problem size=400 viruses on the 64 processor Matterhorn XMT-2. Going from 41 to 43 processors, the traps increase from 69,159 to 1,712,068, and time increases from 7 to 10 sec, showing that the system has saturated. This corresponds to the step in the size=400 plot for Matterhorn in Fig. 15. Execution of the labeling code of Fig. 14 starts at about 4.5 seconds. See also the caption for Fig. 16 20 BOKHARI & BOKHARI

#pragma mta assert parallel for(slab=0;slab

Figure 18. Splitting the inner loop of the virus reassortment algorithm.

the code fragment for this program, Fig. 14). The lines labeled PPPP show performance of this code when all four loops are parallelized. The performance in this case is extremely poor for all machines because of an excess of parallelism¶. In contrast, the curves labeled SSPP show the result when only the two inner loops are parallelized. In this case there is a dearth of parallelism in all three machines and they all saturate at approximately the same level. We discovered that by serializing the innermost loop we could actually do much better. This is indicated by the lines labeled SPPS in Fig. 19 which are distinctly better than our original implementation. We carried this process further by serializing part of the second inner parallel loop as shown in the code fragment of Fig. 18. The results are shown in Fig. 20 where triangles represent an experiment with slab size 4 and circles stand for slab size 8. The dashed lines are the same as those in for SPPS in Fig. 19. In this experiment, 5 observations were made at each processor size, as there is a considerable spread in observations, especially when in the breakdown region. We see that, for slab size 4 on Cougar XMT, the data points are tightly clustered and outperform the SPPS plot. No such phenomenon is seen for Matterhorn, however. The results of this experiment are not conclusive but do suggest that it might be worthwhile in some cases to break up in a parallel loops into serial/parallel combinations especially if the code in question is running in a production environment for long periods of time. The experiments described in this section have very important implications for codes that have intense memory access patterns and consume a lot of wall clock time. We note that in these cases it is very worthwhile to explore alternative schemes of parallelizing or serializing nested loops. It may not always be a good idea to use all the processors available to us, even if we have exclusive access to the machine. It may be worthwhile to spend a considerable amount of time experimenting with

¶This is the reason why Bokhari et al. [12] did not parallelize the outer loop. CRAY XMT & XMT-2 21

assort32R.c, size 400, Egret vs Cougar vs Matterhorn 400 PPPP

100

SSPP time (sec)

SPPP

10 8 Egret XMT, 16 proc Cougar XMT, 128 proc 6 Matterhorn XMT-2, 64 proc 5 4 SPPS 3 1 2 4 8 16 32 40 64 80 128 processors

Figure 19. Experiments with parallelization of the dominant loop of the reassortment code of Fig. 14, problem size 400 22 BOKHARI & BOKHARI

assort32R.c/34R.c, size 400, Egret vs Cougar vs Matterhorn 10

6 time (sec)

4 Cougar XMT, 128 proc Matterhorn XMT-2, 64 proc Slab size 4 = , 8 =

32 40 50 64 80 90 processors

Figure 20. Splitting up the inner parallel loop, problem size 400. CRAY XMT & XMT-2 23 various parallelizing schemes for the case of production codes. The XMT-2 deﬁnitely gives us better performance than the XMT especially in the case of tight loops with extensive memory accessesk.

7. Discussion

Our experiments show that all three machines analyzed exhibit a saturation phenomenon caused by limitations of the memory subsystem. This is evident in Fig. 5 where we can see that, for all machines, the smooth performance surfaces degrade into regions of jagged peaks as the number of processors used approaches the total number of processors available. The regions of saturation are quite large for small problem sizes and remain signiﬁcant for large problem sizes. The saturation phenomenon is observed even for simple loops such as for(i=0;i

kCray has introduced two new compiler directives to help control parallelism and concurrency [16]. These are pragma mta max concurrency and pragma mta max processors. We were unable to obtain any improvements using the max concurrency directive (which limits the maximum number of streams). We did not experiment with the max processors directive as that simply limits the number of processors, which is trivial to do manually (e.g., after inspecting plots like Fig. 15, one could simply use the number of processors that avoids the saturation region). 24 BOKHARI & BOKHARI

4. There are mysterious similarities in performance between the Egret 16 processor XMT and the Matterhorn 64 processor XMT-2 which we are at a loss to explain. 5. It is very worthwhile to explore the restructuring of nested loops. Appropriate reduction of parallelism can often yield greatly improved performance (Section 6). 6. The 64 processor XMT-2 appears to have better scaling than the 128 processor XMT (Fig. 13). This needs to be explored further on a 128 processor XMT-2, when such a machine is available. 7. It is not worthwhile to use the maximum number of processors available in any given system. Underutilizing the total number of available processors may yield rich dividends. 8. The compiler man page on Matterhorn mentions the new -xmt2 ﬂag that enables 7 step lookahead on the XMT-2 (as opposed to 3 steps on the XMT). We have found that this ﬂag makes no difference in performance on our XMT-2 runs, as it is probably always on by default on XMT-2 compilers.

ACKNOWLEDGEMENTS

Access to Egret XMT was provided by Cray Inc. (www.cray.com). We thank David Mizell for his support and patience. Access to Cougar XMT was provided by the Center for Adaptive Supercomputing Software– Multithreaded Architectures (CASS-MT) hosted at the Paciﬁc Northwest National Laboratory (cass-mt.pnl. gov). We thank John Feo and Richard Russell for their support and encouragement and Michael Peterson for ensuring smooth access to the machine. Access to Matterhorn XMT-2 was provided by the Swiss National Supercomputing Centre (www.cscs.ch). We thank Thomas Schoenemeyer for his generous assistance. Shahid Bokhari is supported by Contract 142839 from Paciﬁc Northwest National Laboratory.

REFERENCES

1. Alverson, G., Alverson, R., Callahan, D., Koblenz, B., Porterfield, A., and Smith, B. (1992). Exploiting heterogeneous parallelism on a multithreaded multiprocessor. In Proc. Int. Conf. Supercomputing, pages 188–187. 2. Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., and Smith, B. (1990). The Tera computer system. In Proc. Int. Conf. Supercomputing, pages 1–6. 3. Anderson, W., Briggs, P., Hellberg, C. S., Hess, D. W., Khokhlov, A., Lanzagorta, M., and Rosenberg, R. (2003). Early experience with scientific programs on the Cray MTA-2. In SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 46, Washington, DC. IEEE Computer Society. 4. Bokhari, S. and Saltz, J. (2010). Exploring the performance of massively multithreaded architectures. Concurrency and Computation: Practice and Experience, 22(5), 588–616. 5. Bokhari, S. and Saltz, J. (January 20, 2009). Exploring the performance of massively multithreaded supercomputers. Technical Report OSUBMI-TR-2009-n01, Department of Biomedical Informatics, The Ohio State University. bmi.osu.edu/˜shahid/mtaxmt. 6. Bokhari, S. and Sauer, J. (2006). Parallel algorithms for bioinformatics. In A. Zomaya, editor, Parallel Computing for Bioinformatics, pages 509–529. Wiley. 7. Bokhari, S. H. and Janies, D. A. (2010). Reassortment networks for investigating the evolution of segmented viruses. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7, 288–298. 8. Bokhari, S. H. and Sauer, J. R. (2003). Sequence alignment on the Cray MTA-2. In proceedings of the 2003 Workshop on High Performance Computational Biology. 9. Bokhari, S. H. and Sauer, J. R. (2004). Sequence alignment on the Cray MTA-2. Concurrency and Computation, 16, 823–839. 10. Bokhari, S. H. and Sauer, J. R. (2005). A parallel graph decomposition algorithm for DNA sequencing with nanopores. Bioinformatics, 21(7), 889–896. 11. Bokhari, S. H., Glaser, M. A., Jordan, H. F., Lansac, Y., Sauer, J. R., and Van Zeghbroeck, B. (2002a). Parallelizing a DNA simulation code for the Cray MTA-2. In Proc. IEEE Computer Soc. Bioinformatics Conf., pages 291–302. 12. Bokhari, S. H., Pomeroy, L. W., and Janies, D. A. (2012). Reassortment networks and the evolution of pandemic h1n1 swine-origin influenza. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9, 214–227. 13. Bokhari, S. H., Glaser, M. A., Jordan, H. F., Lansac, Y., Sauer, J. R., and Zeghbroeck, B. V. (August 15, 2002b). Parallelizing a DNA simulation code for the Cray MTA-2. In Proc. IEEE Computer Society Bioinformatics Conf., pages 291–302. CRAY XMT & XMT-2 25

14. Bokhari, S. S. (2011). Parallel Solution of the Subset-sum Problem: An Empirical Study. Master’s thesis, Department of Computer Science and Engineering, The Ohio State University. http://etd.ohiolink.edu/view.cgi?acc_num=osu1305898281. 15. Bokhari, S. S. (2012). Parallel solution of the subset-sum problem: An empirical study. Concurrency and Computation: Practice and Experience. To appear. 16. Cray Inc. (2010). Limiting Loop Parallelism in Cray XMT Applications. S-0027-14. 17. Cray Inc. (2011). Cray XMT Performance Tools User’s Guide. S-2462-20. 18. Garey, M. R. and Johnson, D. S. (1979). Computers and Intractability. W. H. Freeman, New York. 19. Mizell, D. and Maschhoff, K. (2009). Early experiences with large-scale Cray XMT systems. In Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1–9. 20. Reidy, J., Bader, D. A., Ediger, D., and Mizell, D. W. (2011). Modeling memory constrained and massively multithreaded performance. Unpublished manuscript. 21. Snavely, A., Carter, L., Boisseau, J., Majumdar, A., Gatlin, K. S., Mitchell, N., Feo, J., and Koblenz, B. (1998). Multi-processor performance on the Tera MTA. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pages 1–8.

APPENDIX A. CANAL Annotation

The Cray Compiler Analysis tool (CANAL)[17] explains the decisions made by the compiler in parallelizing and transforming the code. The annotations shown in the code listings in this paper are explained below.

P The compiler executes this loop in parallel. p The compiler executes this loop in parallel because of an assert parallel pragma. D There is a dependency that would prevent parallel execution, but is executing the loop in parallel because of an assert parallel pragma. - The loop is executed serially. S The loop is executed serially: the marked statement does not allow parallelism. X The loop is executed serially because its structure does not allow parallelism. e A scalar variable was replicated. w A nested loop was wavefronted.