37th International Conference on Parallel Processing

Application of Automatic Parallelization to Modern Challenges of Scientific Computing Industries

Brian Armstrong Rudolf Eigenmann School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 47907-1285

Abstract try (referred to as full, industrial-grade applications) and achieve effective speedups. Characteristics of full applications found in scientific Automatic parallelization has proven successful at dis- computing industries today lead to challenges that are not covering loop-based parallelism in CPU kernel codes and li- addressed by state-of-the-art approaches to automatic par- braries of vector utilities. Previous work achieved speedups allelization. These characteristics are not present in CPU of four on average when CPU kernel codes and linear alge- kernel codes nor linear algebra libraries, requiring a fresh bra libraries were executed on eight processor systems.[6], look at how to make automatic parallelization apply to to- [4], [2] day’s computational industries using full applications. data gen. stack 3D FFT finite diff. The challenges to automatic parallelization result from 1,400 SMALL software engineering patterns that implement multifunc- 1,200 tionality, reusable execution frameworks, data structures 1,000 shared across abstract programming interfaces, a multilin- 800 gual code base for a single application, and the observation 600 that full applications demand more from compile-time anal- 400 ysis than CPU kernel codes do. Each of these challenges 200 has a detrimental impact on compile-time analysis required 0 35,000 for automatic parallelization. 30,000 MEDIUM

Then, focusing on a set of target loops that are paral- elapsed seconds 25,000 lelizable by hand and that result in speedups on par with 20,000 the distributed parallel version of the full applications, we 15,000 determine the prevalence of a number of issues that hinder 10,000 automatic parallelization. These issues point to enabling 5,000 techniques that are missing from the state-of-the-art. 0

In order for automatic parallelization to become utilized MPI serial

in today’s scientific computing industries, the challenges Polaris OpenMP described in this paper must be addressed. Figure 1. Measured Performance Achieved by Automatic Parallelization of SEISMIC. 1. Failure to Address Today’s Computational Needs However, applying automatic parallelization to full ap- There is a peaked interest in automatic parallelization plications typical of industrial-grade codes reveals charac- due to the impact that multicore computer systems have on teristics of full applications that pose additional challenges today’s market, broadening access to parallel systems, and not addressed in traditional automatic parallelization tech- the growth of data, requiring parallelization for efficient ex- niques and not evident in common CPU kernels or libraries ecution. Consequently, there is current incentive to apply of linear algebra utilities. Using Polaris, a state-of-the- automatic parallelization to real applications used in indus- art automatic parallelizing compiler, speedups of four on

0190-3918/08 $25.00 © 2008 IEEE 279 DOI 10.1109/ICPP.2008.65

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:58 from IEEE Xplore. Restrictions apply. eight-processor machines can be achieved with CPU kernel tional industries. codes and small applications, such as several of the PER- The following section uses applications representative FECT BENCHMARKS, yet, Polaris is not effective in finding of codes in use by scientific computing industries today significant parallelism in SEISMIC, a seismic processing ap- to identify characteristics of full applications that lead to plication suite that mimics applications found in the seismic challenges to automatic parallelization. The codes we con- processing industry. sider are SEISMIC, GAMESS, and SANDER1, representing Figure 1 shows the performance of SEISMIC on a four- computational challenges of scientific computing industries processor machine, comparing speedups from automatic in seismic processing, quantum dynamics simulation, and parallelization by Polaris with speedups from manual par- molecular dynamics simulation respectively. We identify allelization using MPI, which can be considered the ideal challenges that are not present in CPU kernel codes nor lin- in this case. The performance due to manual paralleliza- ear algebra libraries. tion of outer loops using OpenMP represents what is feasi- Each of the identified challenges has a detrimental im- ble to achieve with loop parallelization, showing that loop pact on compile-time analysis required for automatic paral- parallelization can achieve speedups on par with the ideal lelization. Section 3 provides further insight from observing case using MPI. Each of the two charts compares the perfor- specific target loops in full applications, revealing enabling mance of four compiled versions of a seismic processing ap- techniques that are missing from state-of-the-art, automatic plication. The serial version is the base case, run on one pro- parallelizing compilers, and how significant each issue is in cessor of a four-processor machine. The “MPI” label iden- terms of the number of target loops that would benefit. tifies the manually parallelized version of the code, which was run on four processors using MPI. The “OpenMP” ver- 2. Unique Challenges in Today’s Industry sion includes manual parallelization of the outermost paral- lel loops using OpenMP directives, which represents a tar- Codes used in the scientific computing industry have get for automatic parallelization. The “Polaris” version rep- characteristics that are different from CPU kernel codes and resents the state of the art in automatic parallelization. Data libraries of linear algebra utilities commonly used as bench- generation, stacking, 3D FFT, and finite difference are four marks. Industrial-grade application suites typically include components of the application suite, representing different utilities to access file systems, multiple versions of a code phases of seismic processing. for addressing various types of distributed and parallel ma- Only simple loops were parallelized by Polaris, which is chines, elaborate input datasets, and many configuration pa- evident from the fact that parallelization overhead is greater rameters that enable user selection of various functionali- than any speedups obtained from running SEISMIC on four ties. processors. The MEDIUM dataset is an order of magnitude The impacts that the key differences between industrial- larger in terms of the memory required than SMALL, re- grade applications and CPU kernel codes have on automatic vealing that the poor performance of automatic paralleliza- parallelization are described in the following sections. tion is not an artifact of the data size being small. Also, the performance trends among the four versions of paral- 2.1. Multifunctionality lelization are consistent across each of the four components of the seismic processing application suite, each of which Industrial-grade application suites typically encompass involves different computations, algorithms, and code, and a variety of approaches to the same problem. The choice each could be run as a separate executable, implying that of which techniques to apply is left to the user because it the challenges faced by automatic parallelization applies to depends on characteristics of the data set and the type of a variety of cases. In brief, the figure indicates that auto- analysis the user desires to do on the results. matic parallelization is not able to achieve speedups com- For example, consider the option in SANDER to perform parable to manual parallelization even when only using the minimization or molecular dynamics. To choose to do min- best loop-based parallelism as a basis or when using larger imization, a user sets imin=1 in an input file which is read datasets. by SANDER at runtime. Within the main program, the imin The contribution of this paper is to answer the question parameter determines which subroutines to call. SANDER of how automatic parallelization applies to industrial-grade allows many other options that impact which subroutines applications. Even though automatic parallelization is a ma- are called. ture field, most work has been evaluated with small appli- Similar multifunctionality exists in SEISMIC and cations and CPU kernel codes. Our work identifies chal- GAMESS.SEISMIC allows users to choose which com- lenges and characteristics of full applications that CPU ker- putational modules to use and the order in which to apply nel codes do not exhibit but that must be addressed in order 1SANDER is one component of AMBER that is written in FORTRAN 77 for automatic parallelization to apply to today’s computa- and encompasses the computationally intensive routines.

280

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:58 from IEEE Xplore. Restrictions apply. the modules. Users can select the wavefunction to use in mic traces for this module. The variable maxtrc is the max- GAMESS from choices such as: restricted or unrestricted imum number of traces (dependent on the size of the in- Hartree Fock, restricted open shell Hartree Fock, general- put data set and whether the computational module operates ized valence bond, and multiconfigurational self-consistent on each individual trace or an aggregation of traces.) The field. variable nra is the amount of persistent storage this module needs and nsa is the amount of scratch space this module needs. 2.1.1 Impact on Automatic Parallelization ASEISMIC module also has several computational sub- For multifunctional applications, the compiler must assume routines that are called in sequence with subroutines of that all combinations of choices are possible since the com- the other modules. The computational module has a sim- piler cannot determine which of the choices a user will ilar template to MODULEPREP but also includes the ar- select. Since the choice of which computational modules rays and the number of seismic traces passed in (ntri) and to use is determined by the user at runtime, compile-time that this module will produce (ntro) as in: MODULE- transformations and analyses must consider the multiple COMP(...,otra,ra,sa,nrti,ntro). Arrays OTR, RA, and SA choices to be possible. Analysis that does not take condi- are allocated outside of the code for the module. Array OTR tionals into account must approximate to address all possi- is used for the input data. Array RA is for persistent data. ble conditions, which leads to more general but less precise Array SA is scratch space. analysis. Analysis that incorporates conditionals, such as an anal- 2.2.1 Impact on Automatic Parallelization ysis algorithm that uses Guarded Array Regions[7] or the Gated Single Assignment representation[5],[9], is able to Due to a layer of abstraction, compiler analyses and trans- address some cases of multifunctionality but multiplies the formations must function with limited knowledge of the amount of analysis that is required with every conditional control flow across computational modules. The compiler block whose condition depends on one of the variables used can determine the control flow for portions of the code, to indicate a selection of the application’s options. such as the control flow within the code of a computa- tional module in SEISMIC, but it is not feasible for the com- Since multifunctionality causes the amount of compile- piler to determine the full global control flow of large ap- time analysis required to multiply as the compiler attempts plication suites since any computational module may fol- to account for many possible control flow paths, precise low any other. Without control flow information, analysis analysis can become infeasible in terms of the analysis tech- techniques, such as dataflow analysis, cannot be performed niques required to compare array access patterns across across a layer of abstraction. control flow paths, even though the paths may never be taken within the same execution in practice. Consequently, 2.3. Shared Data Structures the compiler makes conservative assumptions which can reduce the compiler’s ability of finding significant paral- Data structures can be shared across the layer of abstrac- lelism. tion mentioned above. Consequently, the same data struc- tures, allocated in outer contexts, may be used by multi- 2.2. Reuse of the Enclosing Framework ple computational modules to house different types of data. The computational modules employed by an industrial ap- To enable reuse and iterative development without re- plication suite can be diverse in terms of their data access quiring recoding of the application’s execution framework, patterns and the types of data they operate on. a layer of abstraction is added between the main execution A shared data structure holding wave function data is process and the subprocesses containing the computational used in GAMESS by the computational modules that per- techniques. form the wave function calculation chosen by the user. As Industrial-grade applications are often made to be ex- mentioned above, possible wave function calculations in- tensible and reusable. SEISMIC facilitates extensibility clude restricted and unrestricted Hartree Fock, restricted by providing templates for linking the computations of open shell Hartree Fock, generalized valence bond, and a module to the execution framework of the application multiconfigurational self-consistent field. The choice of suite. Every computational module in SEISMIC has a wave function calculation and additional features of either preparatory routine that abides by the template: MOD- the dataset or the preferences of the user impacts the size ULEPREP(ldim,maxtrc,nra,nsa). The values of the param- and shape the shared data structure must have to facilitate eters of MODULEPREP are set within the code of MOD- the selected wave function computations. ULEPREP. The variable ldim is the size of the leading di- Consider the JKDER subroutine which contains a sig- mension of the multidimensional array that will hold seis- nificant parallelizable loop. Within the loop are calls to

281

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:58 from IEEE Xplore. Restrictions apply. DABGVB and DABDFT, both of which use a portion 2.4. Multilingual Code of array X indexed from LVec. The choice of whether to call DABDFT and DABGVB depends on input parame- We refer to applications with code from multiple lan- ters chosen by the user. ROGVB is set to true if the user guages as multilingual. Multilingual applications are be- wants to do generalized valence bond calculations. HFSCF coming increasingly common as code is reused in multiple is set to true due to other user selections involving the wave contexts. It is common to find multilingual applications in function calculation and whether or not to apply the Moller- use in industry since much of the code of a full application Plesset perturbation theory. suite is reused for a number of years as new techniques and Subroutines DABDFT and DABGVB both access applications emerge. X[LVec...*] but for different data. DABDFT declares the The SEISMIC application consists of FORTRAN and . memory at X[LVec] as a single dimension array and ac- C is used to perform memory allocation and low level file cesses it within a loop using an indirect array access through activities. The main program is written in FORTRAN 77. array IA. In contrast, DABGVB declares the memory at The main program calls a C routine to allocate memory. X[LVec] to be a two-dimensional array, V, and accesses The C routine calls several FORTRAN 77 routines that per- V within an additional nested loop. The array access pat- form the data processing. Other low-level C routines are terns to the same segment of memory as well as the declared called from within the data processing subroutines to store shapes of the array segment differ among subroutines called and read files on disk. within the parallelizable loops of JKDER. In SEISMIC, the main program encloses the calls to The arrays RA and SA used by the generic framework for SEISPREP routines, written in FORTRAN 77, where re- computational modules in SEISMIC are other examples of lationships are applied among input parameters and some shared data structures that hold different data depending on common block variables. Data structures are allocated in user input selections. If the user selects the M3FK routine CPROC, written in C, using variables passed in from the to execute, RA is used as a series of two dimensional arrays main program. The data structures allocated in CPROC that holds complex numbers. In contrast, if a user selects the are then used by FORTRAN 77 subroutines called from STAK routine to execute, RA is used as a three dimensional CPROC. The compiler is unable to link the use of variables array that holds floating point numbers. in array index expressions to the declarations of the arrays Both M3FKB and STAKB are called from a loop in because the size of the array is determined from within the SEISPROC. The user sets up an input file to determine C code. which routines will be called from SEISPROC and in which order. At compile-time, we cannot know whether M3FKB, 2.4.1 Impact on Automatic Parallelization STAKB, or both will be called during runtime. Therefore, the compiler assumes that both are called and that RA is Most compilers take code of only a single language as input. used in both ways described above. Compilers that handle more than one language translate to a low-level, intermediate representation and do not leverage compile-time analysis across subroutines programmed in 2.3.1 Impact on Automatic Parallelization multiple languages at the high level required for automatic Shared data structures impact automatic parallelization be- parallelization. Consequently, compilers typically make the cause many parallelization techniques rely on describing the most conservative assumptions about the possible side ef- portion of memory accessed by an array expression in the fects of executing code in a different language, which pre- context of the declaration of the array. The declared sizes vents parallelization of loops enclosing code with multiple and shapes of shared data structures may not relate to the languages. access patterns in the code for a particular computational module because the sizes and shapes of shared data struc- 2.5. Compile-Time Complexity tures are made to be general enough to facilitate the memory needs of all the computational modules. Using SEISMIC, GAMESS, and SANDER to represent State-of-the-art automatic parallelization techniques fail full applications, and the PERFECT BENCHMARKS and to perform precise comparisons among an array’s accesses LINPACK to represent CPU kernel codes, we measure the when the size and multidimensional shape of the representa- amount of elapsed time required by Polaris to compile the tion of an array’s accesses are not clearly defined portions of various codes. Traditional automatic parallelization tech- the size and shape used to describe the array’s declaration. niques cannot be applied to a code when the compile time The result is that the compiler makes conservative assump- required to parallelize the code exceeds “reasonable” time tions that may limit the amount of parallelism the compiler and memory limits on a modern workstation. We subjec- is able to discover. tively define the threshold for what a reasonable time limit

282

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:58 from IEEE Xplore. Restrictions apply. st etev or feasdtm namdr worksta- modern a on time elapsed of hours twelve be to is ueo h opeiyprsaeet h S The statement. per mea- complexity a the provide to of statements sure FORTRAN of number the by codes. kernel CPU the of larger one magnitude compiling of for order than to an time is required applications full the the that compile expected is it Therefore, is plication. code each S since and of codes each the For across independently. executed averaged is time pile For pass. compiler by P breakdown the the a indicate by columns segments divided the The of time compiled. statements compile FORTRAN total of number right. the the show on axis a columns the as to The displayed corresponding height is a time at compile dash elapsed black total The ker- CPU codes. compiling nel of that with of applications complexity full the compiling compare to means quantitative a provides the results. of the majority in the is included until is approach subroutines code of This number subrou- a subroutine. to all applied selected and the subroutine, enclosing selected sub- tines subroutine the a through within selecting reachable is from by that chosen calls code is the portion of compile The all to routine, isolation. graph por- in call smaller compiled a the is application, of graph an call for the code of the tion of all to applying Polaris If when gigabytes. exceeded four are be limits compile-time to reasonable limit memory reasonable a and tion, AESapiain eur infiatymr compile more significantly require applications GAMESS

h oun fFgr iie h oa opl time compile total the divides 2 Figure of columns The 2 Figure in graphed measure time compile elapsed The time (seconds) / statement 0 1 2 3 4 5 6 7 8 iue2 opl ieprCd Statement Code per Time Compile 2. Figure SANDER Authorized licensed uselimited to:Purdue University. Downloaded on April 30,2009 at 12:58 from IEEE Xplore. Restrictions apply. ERFECT Seismic 105,531 h niesto oe scniee soeap- one as considered is codes of set entire the ,

B GAMESS 177,315 ENCHMARKS Sander 107,983

Perf. Bench. 11,671

n L and Linpack 26 - 24,000 48,000 72,000 96,000 120,000 144,000 168,000 192,000 INPACK EISMIC

time (seconds) h oa com- total the , Time Total Compile dependence test data- privatization substitution variable induction inline expansion GSA translation propagation constant interprocedural reduction others GAMESS, , EISMIC and 283 iebtas eur oersucst opl h same the codes. compile kernel to CPU as resources code more of com- amount require to code also more but have pile only not Conse- applications benchmarks. full other quently, the L all to compiling respect with of significant complexity P the than parallelization and statement automatic apply per to Polaris techniques for order in time icn peu nteoealeeuintm fS of sig- time provide execution con- to overall known applications, the loops in full involving speedup for nificant study time expen- case compile a more of sider data is terms by privatization in analysis array sive symbolic and of tests use dependence the why Loops understand Parallel To of Depth Nesting The 2.5.1 exist. patterns sym- specific use when only also but recognition analysis bolic reduction vari- and Induction substitution locations. able memory if same determine the to reference iterations they loop ar- different of across comparisons resolve accesses to ray used analy- is symbolic analysis use Symbolic pass sis. privatization array P and average pendence other the the to though significant cases, more all are privatizations in passes array compiler time bench- and compile each tests the for dependence dominate time data compile The elapsed mark. the to passes compiler respect main the with columns. of the significance the of privati- shows segments 3 array the Figure and by dependence indicated data as passes, the zation in com- spent of majority is the time that pile note statement, a compile to effort FECT

oudrtn h nutilgaecdsrqiemore require codes industrial-grade why understand To % of Compile Time 100% 20% 40% 60% 80% iue3 opl ieprCmie Pass Compiler per Time Compile 3. Figure 0% B ENCHMARKS

Seismic

GAMESS rL or

INPACK Sander

Perf. Bench. ERFECT oe ohtedt de- data the Both code. INPACK Linpack B dependence test data- privatization substitution variable induction inline expansion GSA translation propagation constant interprocedural reduction others ENCHMARKS oe sin- is codes EISMIC ER - 5 ditional subroutine calls. The PERFECT BENCHMARKS Perf. Bench. Seismic were created by extracting a smaller portion of the call 4 graph that encompassed computationally intensive part of scientific applications and explicitly assigning any required variables that are defined in the outer contexts to static val- 3 ues, which is why the outer subroutine nesting depths for the target loops in the PERFECT BENCHMARKS are much 2 smaller than for the target loops in SEISMIC, whereas the enclosed subroutine and loop nesting depths are more simi- 1 lar across the two types of codes.

Avg. Number per Parallel Loop Avg. 0 2.5.2 Impact on Automatic Parallelization outer subs outer loops enclosed enclosed subs loops A larger loop nesting depth requires additional symbolic analysis for data dependence and array privatization anal- ysis, resulting in a greater compile-time complexity for par- Figure 4. Nesting Characteristics of Loops allelization. Since the compiler analyzes an array reference Manually Identified as Parallel for cross-iteration dependencies for each enclosing loop, the Range Test permutes the loops in a loop nest to determine which loops are parallel, and the compiler compares array references in different loops when the loops share a com- and the PERFECT BENCHMARKS. We refer to these loops mon enclosing loop, the amount of symbolic analysis re- as the “target loops”. Target loops are identified in SEIS- quired relates to the loop nesting depth. MIC through manual analysis. When the target loops of The amount of symbolic analysis required also relates to SEISMIC are parallelized, we obtain overall speedups on par the subroutine nesting depth since interprocedural analysis with the hand parallelized version of SEISMIC. We leverage or inlining must be used to translate array access patterns work on manual parallelization of the PERFECT BENCH- within subroutines into the calling contexts of the subrou- MARKS, performed under the Center of Supercomputing tines in order to analyze cross-iteration dependencies in any Research and Development at the University of Illinois at loops enclosing the subroutine calls. Subroutines enclosing Urbana-Champaign, to identify target loops of the PER- a loop can require symbolic analysis if an array referenced FECT BENCHMARKS. Parallelizing the target loops serve within the loop is declared in a calling subroutine. as a goal of automatic parallelization. Figure 4 shows that the “nesting depths” of target loops 3. Issues Preventing Effective Automatic Par- in SEISMIC are significantly greater than those of target allelization loops in the PERFECT BENCHMARKS. A loop within a loop is referred to as a nested loop. Likewise, a subroutine called What prevents sufficient precision required for automatic from within another subroutine is referred to as a nested parallelization of full applications? The challenges outlined subroutine. We sum up the number of loops and subroutines in Section 2 each impact application of traditional automatic that enclose a loop to specify the total nesting depth of the parallelization techniques. To determine the way forward loop. The figure shows the average of the number of loops for automatic parallelization, we categorize the issues that and subroutines for loops that have been selected for hand must be addressed in order to enable traditional paralleliza- parallelization. “Outer subs” and “outer loops” are the num- tion techniques to be effective with industrial-grade appli- ber of subroutine calls and loops from the program level to cations. The following issues prevent the automatic paral- the target loops in the deepest call graph paths. “Enclosed lelization of a set of target loops of SEISMIC,GAMESS, subs” and “enclosed loops” are the number of subroutine and SANDER: rangeless variables, aliased arrays, array in- calls and loops enclosed within the target loops. direction, complex symbolic expressions, complex array ac- Figure 4 breaks down the total nesting depths by the con- cess patterns, and compile-time complexity. texts enclosing and contexts enclosed by the target loops. Each target loop is identified manually with one of the The break down reveals that target loops in SEISMIC are en- categories that prevents its parallelization in Figure 5. Only closed by significantly more subroutines than target loops in loops in the autoparallelized category were found to be par- PERFECT BENCHMARKS. allel by Polaris using state-of-the-art techniques. For loops The challenges to parallelizing full applications, de- in the aliasing category, Polaris incorrectly assumed that de- scribed in the previous sections, typically involve using ad- pendencies exist between subroutine array parameters that

284

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:58 from IEEE Xplore. Restrictions apply. 45 complexity To address compile-time complexity that prohibits whole 40 program, interprocedural optimization for the purpose of access 35 representation register scheduling, instruction-level parallelism, and even 30 symbol analysis such high-level optimizations as loop parallelization, tech- niques have been proposed for reducing the amount of the 25 indirection program to consider for analysis at any one time while still 20 allowing the benefits obtained from interprocedural opti- 15 rangeless mization. A Procedure Boundary Elimination framework # of Target Loops # of Target 10 aliasing was proposed that automatically determines portions of the 5 call graph that can be analyzed as completely encapsulated 0 autoparallelized units [8]. Each portion of the call graph is supplied with in- formation from the rest of the whole program call graph that

Sander is useful for supporting optimization of the portion. This ap- Seismic GAMESS proach to enabling interprocedural optimization is promis- ing when compile-time complexity is prohibitive. In order Figure 5. Remaining Hindrances to Automatic to apply automatic parallelization in Polaris to the target Parallelization of Target Loops loops in Section 3, appropriate portions of the whole pro- gram call graph were manually extracted, always starting with the main program or an outermost routine and includ- ing all subroutines called from within the target loop. An are aliased. We refer to variables for which the compiler approach such as Procedure Boundary Elimination may be cannot determine the limits on the possible values the vari- able to automatically determine the portion of the call graph ables can hold as rangeless variables. When comparison to consider. The heuristic used to automatically determine of array index expressions depends on rangeless variables, the portion of the call graph to consider, and which resulted symbolic analysis of the comparison is futile and the com- in speedups of near ten percent for low-level, classical com- piler must resort to conservative assumptions. Target loops piler optimizations, register allocation and scheduling, does for which Polaris could not perform precise symbolic anal- not identify the software engineering patterns described in ysis due to the presence of rangeless variables are part of Section 2. Considering different optimizations, such as therangeless category in the figure. high-level loop parallelization, requires different heuristics Loops with array indirection were not parallelized be- that address the challenges that are not present in CPU ker- cause Polaris had no support for analysis of arrays indexed nel codes. by arrays, indicated by the indirection category. The sym- Additionally, some of the portions of the whole program bolic analysis category encompasses cases that require ad- call graphs selected for compiling the target loops were pro- ditional symbolic analysis capabilities than what is imple- hibitive in terms of compile-time complexity when apply- mented in Polaris. The access representation category in- ing loop parallelization techniques, meaning that even if the cludes cases where the array access representation used compiler is able to apply a heuristic that mimics what we within Polaris is not sufficiently precise for determining that did manually to identify the portion of the call graph to com- specific array references do not contain cross-iteration de- pile, the analysis required to parallelize at the target level is pendencies. Symbolic analysis required to analyze loops prohibitive. In other words, determining how to make in- in the compile-time complexity category exceeds reasonable terprocedural optimization feasible still must be addressed compile-time limits, preventing Polaris from parallelizing in order to parallelize full applications and achieve results the loops. comparable with manual parallelization efforts. Blended analysis is an approach developed in order to 4. Related Work address issues similar to the challenges caused by mul- tifunctionality discussed in 2.1, but in the context of Excessive compilation time has been addressed in tech- framework-based, object-oriented applications. Authors niques that trade precision for compile-time efficiency, such of [3] describe a hybrid approach to enable post-execution as interprocedural techniques that summarize array access understanding of performance when the execution path patterns per subroutine and reuse the array summaries through the call graph is unknown at compile time. How- across multiple call sites for the same subroutine. However, ever, this technique only applies to a specific configuration compiler techniques do not exist to determine the minimal of input parameters, which is appropriate when addressing amount of precision needed to effectively parallelize an ap- similar usage patterns in framework-based applications, but plication to achieve speedups. is not as easily applied to codes used in this paper since

285

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:58 from IEEE Xplore. Restrictions apply. slight changes to input parameters can have significant im- complexity of performing the Range Test and array priva- pact on the available parallelization. tization on the nest of loops and subroutines. The result- In [1], we described some of the challenges expounded ing complexity of compiling a statement of the full appli- on in this paper in Section 2 using only SEISMIC and cations considered in this paper is shown to be greater than GAMESS as examples of industrial-grade applications. In the complexity of compiling a statement of the PERFECT this paper, we have described the impact that each of these BENCHMARKS and significantly higher than statements of challenges has on automatic parallelization. We also added LINPACK, indicating the unique challenge posed by full ap- SANDER as another example of an industrial-grade scien- plications in terms of the demands placed on compile-time tific computing application. Whereas our previous paper performance. identified challenges, this paper provides a metric to mea- Finally, we categorize the remaining challenges needed sure the complexity of these full applications with respect to to profitably apply automatic parallelization to SEISMIC, automatic parallelization, which is not evident from simple GAMESS, and SANDER with the goal of achieving re- measures of the number of lines of code or number of sub- sults on par with manual parallelization. In order for auto- routines. Using this metric, we compare the complexities of matic parallelization to become utilized in today’s scientific industrial-grade applications to CPU kernel and linear alge- computing industries, the challenges described in this paper bra benchmarks to quantify how much more complex full must be addressed. applications are to compile than several well known bench- marks used to evaluate compiler performance. Lastly, we References investigated a set of target loops in the full applications to determine what hindered state-of-the-art automatic paral- [1] B. Armstrong and R. Eigenmann. Challenges in the automatic lelization techniques. parallelization of large-scale computational applications. In Proceedings of SPIE/ITCOM 2001, volume 4528, pages 50– 60, Aug. 2001. 5. Conclusion [2] W. Blume, R. Doallo, R. Eigenmann, J. Grout, J. Hoe- flinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, In the process of analyzing why applying traditional L. Rauchwerger, and P. Tu. Parallel programming with Po- automatic parallelization techniques to full applications laris. IEEE Computer, pages 78–82, Dec. 1996. fails to achieve speedups comparable with the performance [3] B. Dufour, B. G. Ryder, and G. Sevitsky. Blended analysis for performance understanding of framework-based applications. achieved using CPU kernel benchmarks, a number of char- In ISSTA ’07: Proceedings of the 2007 international sympo- acteristics unique to full applications are identified. Each of sium on Software testing and analysis, pages 118–128, New these characteristics impacts automatic parallelizing com- York, NY, USA, 2007. ACM. pilers by hindering ability to find profitably parallelizable [4] M. W. Hall, J.-A. M. Anderson, S. P. Amarasinghe, B. R. loops using traditional techniques. Murphy, S.-W. Liao, E. Bugnion, and M. S. Lam. Maximizing The challenges posed by industrial-grade applications multiprocessor performance with the SUIF compiler. IEEE include software engineering practices that support user- Computer, 29(12):84–89, Dec. 1996. selectable multifunctionality, extensible and reusable exe- [5] P. Havlak. Construction of thinned gated single-assignment form. In U. Banerjee, D. Gelernter, A. Nicolau, and D. A. cution frameworks, and data structures that store generic Padua, editors, Proceedings of the 6th International Workshop types of data shared for use by diverse computational rou- on Languages and Compilers for Parallel Computing, pages tines. Industrial-grade applications typically leverage li- 477–499, London, UK, 1994. Springer-Verlag. braries and support utilities written in different languages, [6] P. Havlak and K. Kennedy. An implementation of interproce- requiring the compiler to be capable of compiling multiple dural bounded regular section analysis. IEEE Transactions on languages in order to analyze all subroutine calls. Parallel and Distributed Systems, 2(3):350–360, July 1991. Additionally, we show that applying automatic paral- [7] Z. Li, J. Gu, and G. Lee. Interprocedural Analysis Based lelization techniques to full applications is more complex on Guarded Array Regions, volume 1808 of Lecture Notes in Computer Science, chapter 7, pages 221–246. Springer- than applying these techniques to CPU kernel benchmarks, Verlag New York, Inc., New York, NY, USA, 2001. which ultimately limits the application of automatic paral- [8] S. Triantafyllis, M. J. Bridges, E. Raman, G. Ottoni, and D. I. lelization to subsets of the call graph to keep the compile August. A framework for unrestricted whole-program opti- time reasonable. Further analysis of the compile-time com- mization. In PLDI ’06: Proceedings of the 2006 Confer- plexity indicates that the requirement of symbolic analysis ence on Design and Implementation, capabilities to support the Range Test and array privatiza- pages 61–71, New York, NY, USA, 2006. ACM. tion is responsible for most of the elapsed compile time. [9] P. Tu and D. A. Padua. Gated SSA-based demand-driven sym- bolic analysis for parallelizing compilers. In Proceedings of We show that profitably parallelizable loops in SEISMIC the 9th International Conference on Supercomputing, pages have greater nesting depths than such loops in the PERFECT 414–423. ACM Press, 1995. BENCHMARKS. The nesting depth of a loop impacts the

286

Authorized licensed use limited to: Purdue University. Downloaded on April 30, 2009 at 12:58 from IEEE Xplore. Restrictions apply.