Automatic Complexity Analysis of Explicitly Parallel Programs
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Complexity Analysis of Explicitly Parallel Programs Torsten Hoefler Grzegorz Kwasniewski ETH Zurich ETH Zurich Zurich, Switzerland Zurich, Switzerland [email protected] [email protected] ABSTRACT This implies that programmers need to expose exponen- The doubling of cores every two years requires programmers tially growing parallelism to exploit the full potential of the to expose maximum parallelism. Applications that are de- architecture. Parallel programming is generally hard and veloped on today's machines will often be required to run on practical implementations may not always expose enough many more cores. Thus, it is necessary to understand how parallelism to be considered future-proof. This is exagger- much parallelism codes can expose. The work and depth ated by continuous application development and the fact model provides a convenient mental framework to assess the that applications are developed on systems with significantly required work and the maximum parallelism of algorithms lower core counts than their production environment. Thus, and their parallel efficiency. We propose an automatic anal- it is increasingly important that programmers understand ysis to extract work and depth from a source-code. We do bounds on the scalability of their implementation. this by statically counting the number of loop iterations de- Parallel codes are manifold and numerous programming pending on the set of input parameters. The resulting ex- frameworks exist to implement parallel versions of sequen- pression can be used to assess work and depth with regards tial codes. We define the class of explicitly parallel codes as to the program inputs. Our method supports the large class applications that statically divide their workload into sev- of practically relevant loops with affine update functions and eral pieces which are processed in parallel. Explicitly par- generates additional parameters for other expressions. We allel codes are the the most prevalent programming style demonstrate how this method can be used to determine work in large-scale parallelism using the Pthreads, OpenMP, the and depth of several real-world applications. Our technique Message Passing Interface (MPI), Partitioned Global Ad- enables us to prove if the theoretically maximum parallelism dress Space (PGAS), or Compute Unified Device Architec- is exposed in a practical implementation of a problem. This ture (CUDA) APIs. Many high-level parallel frameworks will be most important for future-proof software develop- (e.g., [18]) and domain-specific languages (e.g., [16]) com- ment. pile to such explicitly parallel languages. The work and depth model is a simple and effective model for parallel computation. It models computations as vertices Categories and Subject Descriptors and data dependencies as edges of a directed acyclic graph F.2 [Analysis of Algorithms and Problem Complex- (DAG). The total number of vertices in the graph is the total ity]: general work W and the length of a longest path is called the depth D (sometimes also called span). We will now describe more General Terms properties of the model and possible analyses. Theory 1.1 Work and depth and parallel efficiency In practice, analyses are often not used to predict exact Keywords running times of an implementation on a particular archi- tecture. Instead, they often determine how the running Loop iterations; Polyhedral model; Work and depth analysis time behaves with regards to the input size. The work and depth model links sequential running time and parallelism 1. INTRODUCTION elegantly. The work W is proportional to the time T1 re- Parallelism in today's computers is still growing exponen- quired to compute the problem on a single core. The depth tially, currently doubling approximately every two years. D is the longest sequential chain and thus proportional to a lower bound to the time T1 required to compute the prob- Permission to make digital or hard copies of all or part of this work for personal or lem with an infinite number of cores. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation Work and depth models are often used to develop parallel on the first page. Copyrights for components of this work owned by others than the algorithms (e.g., [33]) or to describe their properties (e.g., author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or in textbooks [22, 23]). Those algorithms are then often republish, to post on servers or to redistribute to lists, requires prior specific permission adapted in practical settings. We propose to use the same and/or a fee. Request permissions from [email protected]. model, somewhat in the inverse direction, to analyze exist- SPAA’14, June 23–25, 2014, Prague, Czech Republic. ing applications for bounds on their scalability and available Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2821-0/14/06 ...$15.00. parallelism. Our results can also be used to prove an imple- http://dx.doi.org/10.1145/2612669.2612685. mentation asymptotically optimal with regards to its paral- lel efficiency if bounds on work and depth of the problem The number of iterations of statement S1 in this loop (de- are known. In our analysis we use the assumption from [13] pending on the parameters n and p) can be counted as that all operations are performed in unit time and the time N(n; p) = T = n=p · log (n): required for accessing data, storing results, etc., is ignored. p 2 Brent's lemma [13] bounds running times on p cores with From N(n; p), we can determine that the total work and W W p ≤ Tp ≤ p + D. D measures the sequential parts of depth is the calculation and is equivalent to time t needed to per- W (n) = N(n; 1) = T = n · log (n) form an operation with sufficient number of processes, W 1 2 is equivalent to the number of operations q in Brent's nota- D D(n) = N(n; 1) = T1 = log2(n): tion and B = W is a lower bound of the sequential fraction that limits the returns from adding more cores. Applying The parallelization is work-conserving and the parallel ef- Amdahl's law [2] shows that the speedup is limited to Sp = ficiency Ep = 1. If this loop implements parallel sorting, T1 1 Sp ≤ . If we consider the parallel efficiency Ep = , then our analysis shows that it is asymptotically optimal Tp B+ 1−B p p in work and depth [25], and thus exposes maximum par- then we can bound the maximally achievable efficiency us- allelism. In this paper, we will show how to perform this T1 1 ing the work and depth model as Ep = ≤ . We pTp 1+B(p−1) analysis automatically. observe that for fixed B, limP !1 Ep = 0 such that every fixed-size computation can only utilize a limited number of Example II: Parallel reductions. cores efficiently, i.e., Ep ≥ 1 − . Our second example illustrates a common problem in par- This observation allows us to define available parallelism allel shared memory codes: reductions. Programmers of- and good scaling in terms of as the maximum number of ten employ inefficient algorithms because efficient tree-based processes p for which Tp may decrease. Bounds on work and schemes are significantly harder to implement. A sequential depth for certain problems also allow us to differentiate be- reduction would be implemented as follows (addition oper- tween a problem that is hard to parallelize (e.g., depth first ations on the variable sum are performed atomically): search (DFS)) and a suboptimal parallelization; we can also define the distance of a given parallel code to a parallelism- sum =0; for(i=0; i<n; i++) sum=sum+a[i]; optimal solution. Work and depth are typically functions of the input size. A simple parallelization (assuming n > p) would be In structured programming [17], loops and recursion are the for(i=id*n/p; i<min((id+1)*n/p,n); i++) only techniques to increase the work depending on program s[id]+=a[i]; input parameters. Here, we focus on loops only and we as- for(i=0; i<p; i++) sum=sum+s[i]; sume that each program can be abstracted as a set of loops that determine the number of executions for each statement. where id is the thread number and s is an array of size p for We model each statement as a work item that takes unit keeping the partial sums of each thread. The total number time. To simplify the explanation further, we also assume of iterations of the most loaded process is N(n; p) = Tp = 2 that there is only one statement in each loop (since all state- dn=pe + p and the efficiency Ep = (n + 1)=(p dn=pe + p ). ments will have identical iteration counts). Now, the prob- This implementation is not work-efficient because the lower lem of determining the work is equivalent to determine the bound is Tp = Ω(n=p + log2(p)) and the efficiency decreases loop iteration counts for each statement. The depth is rela- with p2. The lower bound can be achieved if we combine tive to a special parameter p that represents the number of partial results of the sum in a tree structure processes. We now discuss a simple motivating example: for(i=id*n/p; i<min((id+1)*n/p,n); i++) Example I: Parallel sorting skeleton. s[id]+=a[i]; Assume the following loop is executed by p > 0 processes1 for(i=1; i<p; i*=2) combine_partial_sums(s); (p equally divides n, n>0 and n is a power of 2): with the iteration count of the most loaded process N(n; p) = Tp = dn=pe + dlog (p)e.