Automatic Complexity Analysis of Explicitly Parallel Programs

Torsten Hoefler Grzegorz Kwasniewski ETH Zurich ETH Zurich Zurich, Switzerland Zurich, Switzerland [email protected] [email protected]

ABSTRACT This implies that programmers need to expose exponen- The doubling of cores every two years requires programmers tially growing parallelism to exploit the full potential of the to expose maximum parallelism. Applications that are de- architecture. Parallel programming is generally hard and veloped on today’s machines will often be required to run on practical implementations may not always expose enough many more cores. Thus, it is necessary to understand how parallelism to be considered future-proof. This is exagger- much parallelism codes can expose. The work and depth ated by continuous application development and the fact model provides a convenient mental framework to assess the that applications are developed on systems with significantly required work and the maximum parallelism of algorithms lower core counts than their production environment. Thus, and their parallel efficiency. We propose an automatic anal- it is increasingly important that programmers understand ysis to extract work and depth from a source-code. We do bounds on the scalability of their implementation. this by statically counting the number of loop iterations de- Parallel codes are manifold and numerous programming pending on the set of input parameters. The resulting ex- frameworks exist to implement parallel versions of sequen- pression can be used to assess work and depth with regards tial codes. We define the class of explicitly parallel codes as to the program inputs. Our method supports the large class applications that statically divide their workload into sev- of practically relevant loops with affine update functions and eral pieces which are processed in parallel. Explicitly par- generates additional parameters for other expressions. We allel codes are the the most prevalent programming style demonstrate how this method can be used to determine work in large-scale parallelism using the Pthreads, OpenMP, the and depth of several real-world applications. Our technique Message Passing Interface (MPI), Partitioned Global Ad- enables us to prove if the theoretically maximum parallelism dress Space (PGAS), or Compute Unified Device Architec- is exposed in a practical implementation of a problem. This ture (CUDA) APIs. Many high-level parallel frameworks will be most important for future-proof software develop- (e.g., [18]) and domain-specific languages (e.g., [16]) com- ment. pile to such explicitly parallel languages. The work and depth model is a simple and effective model for parallel computation. It models computations as vertices Categories and Subject Descriptors and data dependencies as edges of a directed acyclic graph F.2 [Analysis of Algorithms and Problem Complex- (DAG). The total number of vertices in the graph is the total ity]: general work W and the length of a longest path is called the depth D (sometimes also called span). We will now describe more General Terms properties of the model and possible analyses. Theory 1.1 Work and depth and parallel efficiency In practice, analyses are often not used to predict exact Keywords running times of an implementation on a particular archi- tecture. Instead, they often determine how the running Loop iterations; Polyhedral model; Work and depth analysis time behaves with regards to the input size. The work and depth model links sequential running time and parallelism 1. INTRODUCTION elegantly. The work W is proportional to the time T1 re- Parallelism in today’s computers is still growing exponen- quired to compute the problem on a single core. The depth tially, currently doubling approximately every two years. D is the longest sequential chain and thus proportional to a lower bound to the time T∞ required to compute the prob- Permission to make digital or hard copies of all or part of this work for personal or lem with an infinite number of cores. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation Work and depth models are often used to develop parallel on the first page. Copyrights for components of this work owned by others than the algorithms (e.g., [33]) or to describe their properties (e.g., author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or in textbooks [22, 23]). Those algorithms are then often republish, to post on servers or to redistribute to lists, requires prior specific permission adapted in practical settings. We propose to use the same and/or a fee. Request permissions from [email protected]. model, somewhat in the inverse direction, to analyze exist- SPAA’14, June 23–25, 2014, Prague, Czech Republic. ing applications for bounds on their scalability and available Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2821-0/14/06 ...$15.00. parallelism. Our results can also be used to prove an imple- http://dx.doi.org/10.1145/2612669.2612685. mentation asymptotically optimal with regards to its paral- lel efficiency if bounds on work and depth of the problem The number of iterations of statement S1 in this loop (de- are known. In our analysis we use the assumption from [13] pending on the parameters n and p) can be counted as that all operations are performed in unit time and the time N(n, p) = T = n/p · log (n). required for accessing data, storing results, etc., is ignored. p 2 Brent’s lemma [13] bounds running times on p cores with From N(n, p), we can determine that the total work and W W p ≤ Tp ≤ p + D. D measures the sequential parts of depth is the calculation and is equivalent to time t needed to per- W (n) = N(n, 1) = T = n · log (n) form an operation with sufficient number of processes, W 1 2 is equivalent to the number of operations q in Brent’s nota- D D(n) = N(n, ∞) = T∞ = log2(n). tion and B = W is a lower bound of the sequential fraction that limits the returns from adding more cores. Applying The parallelization is work-conserving and the parallel ef- Amdahl’s law [2] shows that the speedup is limited to Sp = ficiency Ep = 1. If this loop implements parallel sorting, T1 1 Sp ≤ . If we consider the parallel efficiency Ep = , then our analysis shows that it is asymptotically optimal Tp B+ 1−B p p in work and depth [25], and thus exposes maximum par- then we can bound the maximally achievable efficiency us- allelism. In this paper, we will show how to perform this T1 1 ing the work and depth model as Ep = ≤ . We pTp 1+B(p−1) analysis automatically. observe that for fixed B, limP →∞ Ep = 0 such that every fixed-size computation can only utilize a limited number of Example II: Parallel reductions. cores efficiently, i.e., Ep ≥ 1 − . Our second example illustrates a common problem in par- This observation allows us to define available parallelism allel shared memory codes: reductions. Programmers of- and good scaling in terms of  as the maximum number of ten employ inefficient algorithms because efficient tree-based processes p for which Tp may decrease. Bounds on work and schemes are significantly harder to implement. A sequential depth for certain problems also allow us to differentiate be- reduction would be implemented as follows (addition oper- tween a problem that is hard to parallelize (e.g., depth first ations on the variable sum are performed atomically): search (DFS)) and a suboptimal parallelization; we can also define the distance of a given parallel code to a parallelism- sum =0; for(i=0; i p) would be In structured programming [17], loops and recursion are the for(i=id*n/p; i 0 processes1 for(i=1; i0 and n is a power of 2): with the iteration count of the most loaded process N(n, p) = Tp = dn/pe + dlog (p)e. The work of this solution is for(x=0; x

All variables that are not changed in the loop but influence Ep = n/(p dn/pe + p dlog2(p)e) which decreases slowly be- the iteration counts are called parameters. The parameter cause log2(p) work is added per process. From Ep, we can n represents size of the input problem. S1 is an arbitrary derive that the available parallelism is n. computation statement that models one work item. We now This example shows that it is crucial to catch loop behav- analyze work and depth for this explicitly parallel loop. ior in the analysis of parallel programs. Different implemen- For any loop, the elements that determine the number of tations solving the same problem may have different work iterations can be split into three classes: and depth, some of which resulting in limited scalability. Ex- periments at small scale may not expose those limitations as 1. Initial assignment: x=0, y=1 the constants are often rather small. However, our analysis 2. Loop guards: x k : gk,i = 0. works that a subset of this class covers many important Note that even though all the assignments and loop guards codes in parallel computing [9]. are affine, the number of iterations of such a nested loop Our method is strictly more powerful than other iteration may not be affine. For example the following affine loop will counting approaches (e.g., [6]) that require that loop up- iterate dlog2(n)e times: date functions are valid expressions in Presburger arithmetic (which supports only addition and subtraction of symbolic x=1; values and constants). We refer the reader to Section 8 for a while (x

n1(x0,1) n2(x0,2) nr−1(x0,r−1) all k loop nests will produce the final affine representation X X X N = ... nr(x0,r). (2) x = Afinalx0 + bfinal that expresses the whole loop nest (cf.

i1=0 i2=0 ir−1=0 Equation (1)). To solve Equation (2), we need to compute: Example of an affine representation. The following example illustrates how to transform a loop 1. the affine representations for all the loops together into its affine representation. Consider the following loop with their counting functions nk(x0,k), y=y0 ; z=z0 ; 2. the starting conditions for all loops as functions of it- while (y

4. ALGORITHM DESCRIPTION that can be simplified to dargmind(z0 − y0 ≤ 3d)e. A sym- We now describe all the steps and approximations needed bolic solver (e.g., MuPAD[10]) will determine the solution to solve Equation (2) which determines the final loop count. for dd = (z0 − y0)/3e, which leads to the counting function n(x0) = d(y0 − y0)/3e. 4.1 Affine representation of nested loops 4.2 Starting conditions We now explain how we transform a perfectly nested loop into a single affine statement. This statement can then be The starting conditions x0,k+1 for a loop at level k + 1 are combined with the initial assignments and the original loop determined by the value of the vector x before entering the update function of the parent loop into a new loop update loop. For each loop, at depths k = 1, . . . , r, letx ˆk denote the function that represents the whole loop nest. corresponding function defined in equation (3), giving the Each loop update statement x ← Ax + b is a recursive ikth initial assignment at level k, for ik = 1, . . . , nk(x0,k), formula for the value of the vector x in the current step, x0,k+1 = Ak · xˆk(ik, x0,k) + bk. (6) given the value in the previous step. The closed form of that We can now count the starting conditions recursively un- In both cases, we derive lower and upper bounds of the re- til we reach the top level, where x0,1 = x0. In general, the spective sum. starting condition at level k are compositions of affine repre- Bounded Sum Approximation (BSA) Algorithm. sentations and initial assignments of all the loops from level We now show our BSA algorithm that tightly approximates 1 . . . k − 1, treating all iteration variables i1, i2, . . . , ik−1 as lower and upper bounds of Equation (2) in the two cases parameters. where the exact solution cannot be determined. Obtaining bounds in the first case is simple. For a function Example for the starting condition. df(n)e, we determine the upper bound as f(n) + 1 and the We now show a small example to illustrate the compu- lower bound as f(n). tation of the starting conditions for inner loops. Given the For the second case, with no symbolic solution for a sum, two nested loops we approximate the sum with an integral [28]. For a non- decreasing function f1(i) y=y ; z=z ; 0 0 Z n+1 n Z n+1 while (y0) { x0,2(i1) = A1x(i1) + b1 = i1 x0 = i1 . 2 0 2 y0 m = k; while (m

All counting functions and starting conditions can be We see that k0,1 = k0 = 1 and l0,1 = l0 = 2. The counting combined into a final symbolic iteration count. First, function for the inner loop is we compute all r starting conditions x0,1, x0,2, . . . , x0,r n2 = s − k0,2 in the form x0,1 = x0, x0,2 = f2(i1, x0), ..., x0,r = fr(i1, i2, . . . , ir−1, x0). Then, we compute all counting func- and for the outer loop tions in the form n1(x0,1), n2(x0,2), ..., nr(x0,r). This en- & p 2 ' ables us to calculate the sums of Equation (2) and solve for 4l , + 4l , + 8k , + 1 + 1 n = l + 0 1 0 1 0 1 = 6. the final loop iteration count and derive work and depth 1 0,1 2 from this. We use a symbolic solver to simplify and solve the equations. The starting conditions for the inner loop are In some cases, it is not possible to symbolically determine i1 · (i1 − 1) 1 2 5 the exact solution from Equation (2). We differentiate two k0,2 = k0,1 + i1 · l0,1 − = − i1 + i1 + 1. 2 2 2 cases: The number of iterations of the loop nest, according to Equa- 1. The counting function contains a ceiling, e.g., tion 2 is ( (n+2)n n n d n e if n is even 1 1 P 2 i = 8 X X 1 2 5 i=1 (n+3)(n+1) N = n2 = (s + i1 − i1 − 1) 8 if n is odd 2 2 i1=0 i1=0 2. The symbolic solver cannot find a closed form, e.g., We now approximate this sum with an integral. Analyz- Pn i=1 i · log2(i). ing the first and second derivative of n2 shows that within n 9 to f(x − 1) at that point c. The upper bound U[0,bcc] = 8 U[0,c] 6= U[c,k] = U[dce,k] does not change in the intervals 7 [0, bcc] and [dce, k]. We can then bound the value of f(bcc) 6 with U[0,c](bcc). Thus, 5 bcc k Z bcc X Z k X 4 U[0,c](x) dx ≥ f(i), U[c,k](x) dx ≥ f(i) 3 0 i=0 dce i=dce 2 U (bcc) ≥ f(bcc). 1 [a,c]

0 0 1 2 3 4 5 6 From this follows that the upper bound of the non- i Pk monotonic sum of series i=0 f(i), with saddle point c, can Figure 1: Series approximation. Bars represent the series be expressed as: n2(i), the red line shows the function n2(k) and the dashed green line shows the function n (k − 1). The function n (k) k 2 2 Z bcc Z k X over-approximates the series in the interval [0, 2] and under- U = U[0,c](x) dx+U[0,c](c)+ U[c,n](x) dx ≥ f(i). approximates it in the interval [3, 6]; the dot shows the saddle 0 dce i=0 point. The lower bounds discussion follows a similar reasoning. We now prove propagation of this property through con- the interval (0,n1) the function is decreasing in (0,2.5) and secutive sums in Equation 2: Let f be the kth sum from increasing in (2.5,n1). The sum can then be bounded from k above by Equation 2, and Uk and Lk upper and lower bounds of fk. We then can present it as Z 2 Z n1 47 U = n (i − 1) di + n (2) + n (i ) di = 6s − nk 2 1 1 2 2 1 1 12 X 0 3 nk+1(i1, . . . , ik−1, x0) = fk(i1 . . . , ik, x0) and from below by ik=0 Z 2 Z n1 161 and bound it with L = n2(i1) di1 + n2(2) + n2(i1 − 1) di1 = 6s − . 0 3 12 Lk(i1 . . . , ik−1, x0) ≤ fk(i1, . . . , ik−1, x0) ≤ Uk(i1, . . . , ik−1, x0). Figure 1 illustrates the example. The next sum at level k − 1 will be

nk−1 4.4 Correctness of the algorithm X f (i , . . . , i , x ) All the steps of the algorithm except the sum approxima- k 1 k−1 0 i =0 tion are proper algebraic transformations. If no approxima- k−1 tion is needed then the algorithm produces the exact number Upper and lower bounds are not changed by summing, such of iterations. For example for the loops in Listing 3 the total that number of iterations of the inner loop is exactly nk−1 nk−1 X X l  m   fk(i1, . . . , ik−1, x0) ≥ Lk(i1, . . . , ik−1, x0) log m m 2 y N = y0 − 2 0 y0 + m log2( ) . ik−1=0 ik−1=0 y0 and If approximation is required, then we need to prove that nk−1 nk−1 we obtain proper upper and lower bounds and that this prop- X X erty propagates further through the algorithm. fk(i1, . . . , ik−1, x0) ≤ Uk(i1, . . . , ik−1, x0) ik−1=0 ik−1=0 Lemma 4.1. The Bounded Sum Approximation algorithm implies L (x ) ≤ f (x ) = N ≤ U (x ). gives valid lower and upper bounds for Equation (2). 1 0 1 0 1 0

Proof. Ceiling upper and lower bound for type 1 approxi- 4.5 Multipath loops mations are correct by the definition of the ceiling function. We formalize a loop containing multiple statements as Let us consider type 2 sum approximations. while (cT x < g ){ First we need to prove that the upper and lower bounds k k x ← A x + b for a series f(n), where n = 0, . . . , k, found by the algorithm k,1 k,1 x ← A x + b are correct. Let’s denote k,2 k,2 ... ( df x ← A x + b } f(x), if ∀x ∈ [a, b]: dx ≥ 0 k,m k,m U[a,b](x) = df f(x − 1), if ∀x ∈ [a, b]: dx ≤ 0 where each of the statements x ← Ak,ix + bk,i may be a ( simple assignment or an affine representation of a loop. We f(x − 1), if ∀x ∈ [a, b]: df ≥ 0 dx compose them in the same way as we did in Equation (5), L[a,b](x) = df f(x), if ∀x ∈ [a, b]: dx ≤ 0 forming a single affine statement. Starting conditions. We compute the starting condi- as upper and lower bounds of the monotonic series f in the tions for each loop by generalizing Equation (6). For multi- interval [a, b], as stated in Section 4.3. path loops the starting condition for a loop represented by Let c ∈ [0, k] be the only saddle point of function f(x). its affine representation x ← A x + b is the composition Intervals with multiple saddle points can be split to smaller k,i k,i of all the affine statements that precede it: intervals where each contains a single saddle point. Then, U and L will change from f(x − 1) to f(x) or from f(x) x0,k+1,i = Ak,i−1(... (Ak,1 ·xˆk(ik, x0,k−1)+bk,1) ...)+bk,i−1 For illustration consider the following example loop: do j=1,lastrow-firstrow+1 T while (ck x < gk ){ sum = 0. d0 x ← Ak,1x + bk,1 do k=rowstr(j),rowstr(j+1)-1 x ← Ak,2x + bk,2 sum = sum + a(k)*p(colidx(k)) x ← Ak,3x + bk,3 enddo x ← Ak,4x + bk,4 w(j) = sum x ← Ak,5x + bk,5 } enddo

Assume that in the example above, x ← Ak,2x + bk,2, x ← Our tool traces the expression lastrow-firstrow+1 back A x + b and x ← A x + b are affine representations k,3 k,3 k,4 k,4 to the program parameter row size = na or row size = of three nested loops. Then, the starting condition for the nprows na +1, depending on the process id. This reflects the fact third loop x ← Ak,4x + bk,4 is nprows that nprows may not divide na, where na is the problem size. x0,k+1,4 = Ak,3(Ak,2(Ak,1 ·xˆk(ik, x0,k−1)+bk,1)+bk,2)+bk,3. However, the value of rowstr(j) cannot be determined stat- Counting the number of iterations. We solve Equa- ically because it represents the location of the first nonzero value in row j of one of the program arrays. Our algorithm tion (2) using the appropriate counting function nr(x0,r) for each loop. The series of sums is formed according to the then represents the previous loop nest as: hierarchy of loops starting at the innermost loop. j=1; while (j<=row_size) {u; j ++;} 5. PRACTICAL CONSIDERATIONS We now briefly outline how the developed method can be u is treated as a loop with the fixed number of iterations used to assess work and depth of real applications. This u =irowstr(j+1)-rowstr(j). The total number of itera- paper is focusing on the fundamental techniques, yet, we tions of this code fragment for the most busy process is rep- want to provide a coarse view of how we apply our method resented as in practical settings.  na  The whole mechanism can be implemented in a source- N = u. code analysis tool or a compiler. We use the Low Level nprows Virtual Machine (LLVM [26]) and will outline the implemen- It is possible to provide the user with information about tation. LLVM’s internal program representation uses Single the exact code fragment that is the origin of u. Users can Static Assignment (SSA), which makes it simple to deter- then determine upper and lower bounds for each unknown mine loops (identified by back-edges), loop guards (identified parameter. by conditional branches with back-edges), and all dependent variables. From this information, we create initial assignments, loop 6. CASE STUDIES guards, and loop updates for each loop and apply the proce- In this section we present our results from analyzing sev- dure described in Section 4. While the vast majority of loops eral benchmarks. We compute work and depth of several in practical programs are affine, some loops may depend on parallel programs to demonstrate the insight we gained into more complex conditions and thus do not fit our framework. the bounds on parallel efficiency of those practical codes. However, one of the main strengths of our method is that Our analysis allows us to make statements about parallel we can still compute the number of iterations of loop nests efficiency without studying the problem or implementation. containing non-affine functions as we will describe in the We analyzed major loops of the NAS parallel benchmarks following section. version 3.2 [3] and the Mantevo micro applications version 2.0 [5]. We only present an interesting subset in our case 5.1 Extensions for non-affine loops studies due to space constraints. If a loop guard is not an affine function of iteration vari- In all the cases presented, no approximation was needed, ables and constant parameters then we may not be able to so presented results give exact number of iterations (with determine the exact number of iterations statically. Exam- respect to the introduced constants). If not stated other- ples are loops with iteration counts determined by unknown wise, in the following benchmarks m represents the prob- functions or complex sources like arrays that keep dynamic lem scale, n is a program parameter to NAS specifying data. This is often the case in applications that iterate until the number of iterations to perform, and p denotes num- a complex convergence criterion is reached, e.g., conjugate ber of processes. In some cases processes are arranged into gradient methods. If the loop exit conditions cannot be rep- multiple dimensions. In those cases, p1, p2, . . . , pk repre- resented as affine statements, then the whole block is treated sents number of processes in corresponding dimensions and Qk as a symbolic value u (undefined). p = i=1 pi. We also assume that the decomposition is A strength of our method is that it still solves the re- square, i.e., p1 ≈ p2 ≈ · · · ≈ pk. For easier work-depth anal- maining affine loop nests symbolically as u is simply treated ysis, presented results are simplified by replacing constant as a parameter that propagates while solving Equation 2. terms with auxiliary constants ci. We use constants instead In addition to just treating non-affine loops as new symbolic of asymptotic notation to retain lower-order terms. parameters, our tool enables the user to annotate such loops with affine upper and lower bound functions. 6.1 NAS Parallel Benchmarks: EP We demonstrate the technique with a loop that we found The NAS EP benchmark represents a typical Monte Carlo during one of our case studies. The following Fortran code simulation and thus performs nearly no communication. is extracted from the NAS CG benchmark. Our analysis found that only one out of seven loops could not be resolved due to a conditional goto statement, result- as u1=max_key_val-min_key_val+1; the second class iter- ing in a single u. The work of EP is ates over the buckets after redistribution and is represented as u =bucket_distrib_ptr2[k] - bucket_distrib_ptr1[k].  2m−16 · (u + 216)  2 N = . Both loop iteration counts depend on the structure of the in- p put and the distribution and can thus not easily be bounded The following listing shows the non-affine loop that deter- tightly. The total number of iterations of IS is mines u after removing all statements that do not influence   m   the iteration count: N = n 3(b + t) + 2 + p + 2 · u1 + 2 · u2 p u : do i=1 ,100 where b is the number of buckets, t is the size of the test ik =kk /2 array, and m is the number of keys to be sorted. The total if (ik .eq. 0) goto 130 work on one thread is kk=ik continue W = T1 = n(3(b + t) + 2m + 2u1 + 2u2 + 1).

A programmer can easily determine that u ≤ 100, which is The depth D = ∞ because the parallelization is not negligible compared to 216. Thus, we can approximate the work efficient which is due to the necessary inter-process work using one thread W = T ≈ 2m and depth D = T ≈ communications. The parallel efficiency of IS is Ep = 1 ∞  l m 1. This shows that the the parallelization is work-optimal 2 m c1/ p + c2 p + c3 p p , which drops quickly with the and the efficiency number of processes used. The reason is that each process 2m may need to communicate with each other process. The Ep ≈ . l 2m m available parallelism is also m in this case. p p

m m m 6.4 Mantevo Benchmarks: CoMD This means that Ep ≈ 1 if p . 2 and Ep ≈ 2 /p if p & 2 . We conclude that the maximum available parallelism in EP The Mantevo CoMD benchmark represents a classical m m molecular dynamics simulation. Eight out of 18 analyzed is 2 , because N cannot further decrease for p & 2 . This does not mean that the code will efficiently execute with 2m loops contain non-affine statements. The code distributes tasks, however, it specifies an upper bound to the speedup. atoms to processes. The first class updates atoms in the par- This is an expected result since the code is considered “em- titions and u1 represents the number of iterations of those barrassingly parallel”. loops. 6.2 NAS Parallel Benchmarks: CG u1 : while (inAtoms[iBox]) { int jBox=getBox(atoms->r[iOff+i]); The NAS CG program represents a typical conjugate gra- if (jBox!=iBox) moveAtom(i,iBox,jBox); dient solver. Our tool found that only 2 out of 23 analyzed else ++i;} loops were not affine (cf. Section 5.1). The two undefined loops had identical guards, resulting in a single parameter u The second class u2=qsort(nAtoms[iBox]) is limited by that can be bounded as 0 ≤ u ≤ m. This allows us to bound the data sizes to be sorted. The number of iterations of the the number of iterations CoMD Benchmark is s !  m   m  √   m   m    m   N . n g + (6 + 5g) + (3g + 4) log2( p) N = n g(B + 3) + g T u + u + + 2 p p p p 1 2 p B where g is the program parameter cgitmax. We√ can approx- where B is the fixed amount of atoms in each box, g is the imate work on one thread W = T1 . (g m + m(5g + 6))n. print rate program parameter, and T is the total number However, CG is not work optimal as the parallel work of boxes: is monotonically growing. This causes the depth to be √ √ √ √  3 m   3 m 3 m  3 m2 m D = T∞ = ∞. If we treat the problem size m as constant, T = 2 + 2 + + 2 + + . then CG’s parallel efficiency is p1 p2 p3 p2 p3 p1 p2 p3 s !−1 2 √  m   m  If we bound u1 < B and u2 < B log2(B), then we can ap- E = c c p log ( p) + c + c . 2/3 1/3 p 1 2 2 3 p 4 p proximate work and depth: W . c1 m+c2 m +c3 m + c4, and D . n(c5 + c6 B(log2(B))). The implementation This means that the work per process increases with is work-optimal and the efficiency Ep . c7/ (p + c8) is de- √ √ creasing with number of processors, which is the result of log2( p) due to a parallel reduction among p processes. The available parallelism is m. sequential parts of the program that cannot be parallelized. The available parallelism is m. 6.3 NAS Parallel Benchmarks: IS The NAS IS program represents a parallel bucket sort Scalability Analysis. algorithm. In each iteration, all processes perform their We were able to determine bounds for work, depth, paral- local sorting and exchange information. The communica- lel efficiency, and available parallelism for several real-world tion overhead is expected to grow with the number of pro- applications. We see that the available parallelism in all in- cesses. A total of 5 out of 15 analyzed loops were not vestigated applications scales linearly with the input prob- affine. Those five loops fall in two classes: the first class lem size. While this suggests good scaling, we show that for iterates over maximum and minimum key value represented CG and IS communication overheads increase the work with the number of processes. For example, for IS, this overhead constant updates such as x = 2 ∗ x. To the best of our grows linearly with the number of used processes. knowledge, no previous work handled such cases properly. We were able to perform those analyses by pure code in- Methods like the one proposed by Blanc et al. [11] require trospection which was guided by our tool without requiring explicitly that loops cannot include such statements. knowledge of the implemented methods or algorithms. If the Other works utilize dynamic approaches to extrapolate solved problem is known, then one could even proof opti- program performance and assess scalability in practical set- mality in terms of parallel efficiency or available parallelism. tings. Barnes et al. [4] use regression to linear and logarith- However, this is outside the scope of this paper. mic functions to predict scalability of nearly linear-scaling HPC applications. Calotoiu et al. [14] select a scaling model 7. DISCUSSION from a set of predefined candidate functions and fit the pa- rameters with regression. Other works, such as [20, 27] use We now discuss the limitations of our approach and briefly multi-layer neural networks or statistical techniques to pre- outline potential additional use-cases. dict scalability. More complex performance prediction frameworks con- Limitations. sider the effect of communication [15, 29] and extrapolate Since our analysis only counts loop iterations and does not single-node runs [36]. Partial execution [35] can improve account for the exact costs of each loop, we can only provide those techniques. Other studies provide advice for modeling bounds on the expected execution time on a parallel system. the general performance [21] and scalability [19] of parallel However, those bounds are always asymptotically correct. applications. In addition, many application-specific studies Since we limit ourselves to the work and depth model, we exist but cannot be generalized [7, 24]. cannot account for communication or synchronization over- We extend previous work significantly in two directions: heads in real codes. Yet, the bounds we provide are useful first, we show a technique that can tightly bound the num- to determine the relative behavior of work and depth and bers of iterations of arbitrary affine loop nests and second, allow us to reason about exposed parallelism and parallel we show how this method can be used to assess work and efficiency just like the work and depth model. depth of parallel applications.

Extending the Models. 9. CONCLUSIONS While outside the scope of this paper, it is simple to ex- tend our work and depth models to account for system pa- We show a method to symbolically count loop iterations rameters such as memory or network latency and bandwidth. of practical codes in terms of their input sizes. Our method Blelloch [12] discusses further options. provides either an exact solution or tight upper and lower bounds using bounded sum approximation. It is applica- Model-based Mapping to Heterogeneous Systems. ble to affine and non-affine loops. While it can bound all Having a model for the work and depth of each loop in a affine loops accurately, it handles non-affine loop counts as program can be useful when the program is to be mapped to a symbolic constant and allows the user to provide lower and future heterogeneous architectures. Those systems will most upper bounds. likely contain Latency Compute Units (LCU, cf. today’s We show how to derive parallel work and depth from the CPU cores) and Throughput Compute Unites (TCU, cf. to- loop count models. Using the work and depth model we ap- day’s accelerators such as GPUs or Xeon Phi). A compiler proximate bounds on the parallel efficiency of those codes. would need to determine the target architecture for each This technique allows us to specify upper bounds to scalabil- loop statically. It could use the generated work/depth mod- ity of practical parallel codes. In general, our method allows els to assign code pieces with low parallelism (small W/D) to a developer to quickly check how an explicitly parallel code LCUs and code pieces with larger parallelism (large W/D) scales with the numbers of processes and input sizes. to TCUs. We are applying a standard algorithmic analysis technique (measuring work and depth) to real source codes. Our devel- oped techniques pave the way to quickly and automatically 8. RELATED WORK assess program scalability and will thus quickly become an Counting loop iterations and assessing scalability of par- important tool for future parallel application development allel codes are important research problems. Rodriguez- and analysis. Carbonell and Kapur [31] find polynomial loop invariants using an algebraic approach. Sharma et al. [32] use a data Acknowledgments driven approach to iteratively guess the correct polynomial We thank the anonymous reviewers for outstandingly helpful loop invariant and then check its correctness, and Matringe comments to improve the quality of the paper. et al. [30] generate loop invariants also for non-linear differ- ential systems. Loop invariants can be used to bound loop References iteration counts but the resulting bounds are often not tight. [1] C. Alias, A. Darte, P. Feautrier, and L. Gonnord. Multi- Multiple research groups use the polyhedral model (PM) dimensional rankings, program termination, and com- to determine the exact number of loop iterations [1, 8, 37]. plexity bounds of flowchart programs. In Static Analy- In this case, the number of iterations can be approxi- sis, volume 6337 of Lecture Notes in , mated by counting integer points in that polyhedron using a pages 117–133. Springer, 2011. polynomial-time algorithm [6]. The PM is widely used, not [2] G. M. Amdahl. Validity of the single processor ap- proach to achieving large scale computing capabilities. only in loop analysis [9]. However, it has a serious limitation In Proc. of the April 18-20, 1967, Spring Joint Com- - it requires that the loop update function can be expressed puter Conference, AFIPS ’67 (Spring), pages 483–485, in Presburger arithmetic and thus cannot deal with non- New York, NY, USA, 1967. ACM. [3] D. H. Bailey et al. The NAS Parallel Benchmarks: Sum- [21] R. Jain. The Art Of Computer Systems Performance mary and Preliminary Results. In Proc. of the 1991 Analysis: Techniques for Experimental Design, Mea- ACM/IEEE Conference on Supercomputing, pages 158– surement, Simulation, and Modeling. Wiley, 2008. 165, New York, NY, USA, 1991. ACM. [22] J. J´aJ´a. An Introduction to Parallel Algorithms. Ad- [4] B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, dison Wesley Longman Publishing Co., Inc., Redwood B. de Supinski, and M. Schulz. A regression-based ap- City, CA, USA, 1992. proach to scalability prediction. In Proc. of the 22nd [23] R. M. Karp and V. Ramachandran. Handbook of The- Annual Intl. Conference on Supercomputing, ICS ’08, oretical Computer Science (Vol. A). chapter Parallel pages 368–377, New York, NY, USA, 2008. ACM. Algorithms for Shared-memory Machines, pages 869– [5] R. F. Barrett, M. A. Heroux, P. T. Lin, C. T. Vaughan, 941. MIT Press, Cambridge, MA, USA, 1990. and A. B. Williams. Poster: Mini-applications: Vehicles [24] D. J. Kerbyson et al. Predictive performance and scal- for co-design. In Proc. of the 2011 High Performance ability modeling of a large-scale application. In Proc. Computing Networking, Storage and Analysis Compan- of the 2001 ACM/IEEE conference on Supercomputing, ion, SC ’11, pages 1–2. ACM, 2011. pages 37–37. ACM, 2001. [6] A. I. Barvinok. A polynomial time algorithm for count- [25] D. E. Knuth. Sorting and Searching, The Art of Com- ing integral points in polyhedra when the dimension is puter Programming, Volume 3: (2Nd Ed.) Sorting and fixed. Math. Oper. Res., 19(4):769–779, Nov. 1994. Searching. Addison Wesley Longman Publishing Co., [7] G. Bauer, S. Gottlieb, and T. Hoefler. Performance Inc., Redwood City, CA, USA, 1998. Modeling and Comparative Analysis of the MILC Lat- [26] C. Lattner and V. Adve. LLVM: A Compilation Frame- tice QCD Application su3 rmd. In Proc. of the 2012 work for Lifelong Program Analysis & Transformation. 12th IEEE/ACM Intl. Symposium on Cluster, Cloud In Proc. of the Intl. Symposium on Code Generation and Grid Computing, pages 652–659, May 2012. and Optimization: Feedback-directed and Runtime Op- [8] A. M. Ben-Amram and S. Genaim. On the linear rank- timization, CGO ’04, Washington, DC, USA, 2004. ing problem for integer linear-constraint loops. SIG- IEEE Computer Society. PLAN Not., 48(1):51–62, Jan. 2013. [27] B. C. Lee, D. M. Brooks, B. R. de Supinski, M. Schulz, [9] M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, K. Singh, and S. A. McKee. Methods of inference and and C. Bastoul. The Polyhedral Model Is More Widely learning for performance modeling of parallel applica- Applicable Than You Think. In R. Gupta, editor, Com- tions. In Proc. of the 12th ACM SIGPLAN Symp. piler Construction, volume 6011 of LNCS, pages 283– on Principles and Practice of Parallel Programming, 303. Springer, 2010. PPoPP ’07, pages 249–258, 2007. [10] L. Bernardin. A review of symbolic solvers. SIGSAM [28] C. E. Leiserson, R. L. Rivest, C. Stein, and T. H. Cor- Bull., 30(1):9–20, Mar. 1996. men. Introduction to Algorithms. The MIT press, 2001. [11] R. Blanc, T. Henzinger, T. Hottelier, and L. Kovacs. [29] G. Marin and J. Mellor-Crummey. Cross-architecture ABC: Algebraic Bound Computation for Loops. In performance predictions for scientific applications using E. Clarke and A. Voronkov, editors, Logic for Pro- parameterized models. In Proc. of the Joint Int’l Conf. gramming, Artificial Intelligence, and Reasoning, vol- on Measurement and Modeling of Computer Systems, ume 6355 of LNCS, pages 103–118. Springer, 2010. SIGMETRICS ’04, pages 2–13. ACM, 2004. [12] G. E. Blelloch. Programming parallel algorithms. Com- [30] N. Matringe, A. Moura, and R. Rebiha. Generating mun. ACM, 39(3):85–97, Mar. 1996. invariants for non-linear hybrid systems by linear al- [13] R. P. Brent. The parallel evaluation of general arith- gebraic methods. In Static Analysis, volume 6337 of metic expressions. J. ACM, 21(2):201–206, Apr. 1974. LNCS, pages 373–389. Springer, 2011. [14] A. Calotoiu, T. Hoefler, M. Poke, and F. Wolf. Us- [31] E. Rodr´ıguez-Carbonell and D. Kapur. Automatic gen- ing automated performance modeling to find scalability eration of polynomial loop invariants: Algebraic foun- bugs in complex codes. In Proc, of SC13: Int’l Conf. dations. In Proc. of the 2004 Intl. Symposium on Sym- for High Performance Computing, Networking, Storage bolic and Algebraic Computation, ISSAC ’04, pages and Analysis, SC ’13, pages 45:1–45:12. ACM, 2013. 266–273, New York, NY, USA, 2004. ACM. [15] L. Carrington, A. Snavely, and N. Wolter. A perfor- [32] R. Sharma, S. Gupta, B. Hariharan, A. Aiken, P. Liang, mance prediction framework for scientific applications. and A. Nori. A data driven approach for algebraic loop Future Gener. Comput. Syst., 22(3):336–346, Feb. 2006. invariants. In M. Felleisen and P. Gardner, editors, [16] Z. DeVito et al. Liszt: A Domain Specific Language for Progr. Languages and Systems, volume 7792 of LNCS, Building Portable Mesh-based PDE Solvers. In Proc. of pages 574–592. Springer, 2013. 2011 Intl. Conference for High Performance Comput- [33] Y. Shiloach and U. Vishkin. An O(n2 log n) Parallel ing, Networking, Storage and Analysis, SC ’11, pages Max-flow Algorithm. J. Alg., 3(2):128–146, Feb. 1982. 9:1–9:12, New York, NY, USA, 2011. ACM. [34] A. M. Turing. On Computable Numbers, with an Ap- [17] E. W. Dijkstra. Structured programming. chapter plication to the Entscheidungsproblem. Proc. London Notes on Structured Programming, pages 1–82. Aca- Math. Soc., 2(42):230–265, 1936. demic Press Ltd., London, UK, UK, 1972. [35] L. T. Yang, X. Ma, and F. Mueller. Cross-platform per- [18] T. Goodale, G. Allen, G. Lanfermann, J. Mass´o, formance prediction of parallel applications using par- T. Radke, E. Seidel, and J. Shalf. The cactus frame- tial execution. In Proc. of the 2005 ACM/IEEE Con- work and toolkit: Design and applications. In High Per- ference on Supercomputing, SC ’05, Washington, DC, formance Computing for Computational Science, pages USA, 2005. IEEE Computer Society. 197–227. Springer, 2003. [36] J. Zhai, W. Chen, and W. Zheng. Phantom: Predict- [19] T. Hoefler, W. Gropp, M. Snir, and W. Kramer. Per- ing performance of parallel applications on large-scale formance Modeling for Systematic Performance Tuning. parallel machines using a single node. In Proc. of the In Intl. Conference for High Performance Computing, 15th ACM Symp. on Principles and Practice of Parallel Networking, Storage and Analysis (SC’11), Nov. 2011. Progr., PPoPP ’10, pages 305–314. ACM, 2010. [20] E. Ipek, B. R. de Supinski, M. Schulz, and S. A. Mc- [37] F. Zuleger, S. Gulwani, M. Sinn, and H. Veith. Bound Kee. An approach to performance prediction for par- analysis of imperative programs with the size-change allel applications. In Proc. of the 11th Intl. Euro-Par abstraction. In E. Yahav, editor, Static Analysis, vol- Conference on Parallel Processing, Euro-Par’05, pages ume 6887 of Lecture Notes in Computer Science, pages 196–205. Springer, 2005. 280–297. Springer Berlin Heidelberg, 2011.