A Compiler Optimization Algorithm for Shared-Memory Multiprocessors
Kathryn S. McKinley
July 1, 1998
Abstract This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared- memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across pro- cedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user’s parallel loop di- rectives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves 3 programs, matches 4 programs, and degrades 1 program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines. Index Terms—Program parallelization, parallelization techniques, program optimization, data locality, re- structuring compilers, performance evaluation.
In IEEE Transactions on Parallel and Distributed Systems, VOL. 9, NO. 8, August, 1998. Personal use of this material is permitted. However, perrmission to reprint/republish this material for advertising or promotional puroses or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
K. McKinley is with the Department of Computer Science, University of Massachusetts, Amherst, MA 01003-4610. E-mail: [email protected].
1 1 Introduction One lesson from vectorization is that users rewrote programs based on feedback from vectorizing compilers. These rewritten programs were independent of any particular vector hardware and were written in a style amenable to vectorization. Compilers were then able to generate machine-dependent vector code with excellent results. We believe that just as automatic vectorization was not successful for dusty deck programs, automatic parallelization of dusty decks is unlikely to yield complete success. Finding medium to large grain parallelism is more difficult than single statement parallelism and compilers have had limited success on dusty deck programs [10, 19, 42, 43, 17]. We believe that only a combination of algorithmic parallelization and compiler optimization will enable the effective use of parallel hardware. This paper provides evidence that this approach is viable. Since the amount of parallelism a dusty deck program contains is unknown, measuring the success of paral- lelizing compilers on them is tenuous. The programs may actually be inherently sequential, parallel, or some- where in between. Since a different version of the algorithm could potentially achieve linear speed-up, only lin- ear speed-ups (performance improvements that scale with the number of processors) can be declared a complete success. Linear speed-up is rare, even for parallel applications due to communication overheads and Amdahl’s Law. In practice, parallel programs often require algorithms and data structures that differ from their sequential and vector counterparts. To achieve good parallel performance, the intellectual and programming costs required for good parallel performance need to be paid. Our ultimate goal is to provide compiler technology that de- tects and exploits parallelism on a variety of parallel machines, so that the programming cost will only have to be paid once. Users will concentrate on parallel algorithms at a high level. The compiler will be responsible for machine-dependent details such as exploiting the memory hierarchy. In this paper, we consider optimizing Fortran programs for symmetric shared-memory, bus-based parallel machines with local caches. We present an advanced parallelizing algorithm for complete applications that exploits and balances data locality, parallelism, and the granularity of parallelism. The algorithm uses existing dependence analysis tech- niques and a new cache model to drive the following loop optimizations: loop permutation, tiling, fusion, and distribution. It tries to organize the computation such that each processor achieves data locality, accessing only the data in its own cache, and such that the granularity of parallelism is large. Since we find that large grain parallelism often crosses procedure boundaries, the algorithm uses existing interprocedural analysis and new interprocedural optimizations. The main advantage of our optimization algorithm is that it yields good results, and is polynomial with respect to loop nesting depth (it does however use dependence analysis which is, of course, more expensive). In this paper, we present a new parallelizing algorithm, but clearly good analysis is a prerequisite to this work and we discuss the analysis that enables our parallelization algorithm to succeed. We present our parallelization algorithm in detail in Section 3. To test the algorithm, we performed the following experimental study. We collected Fortran programs writ- ten for a variety of parallel machines. Most of the programs in our suite are published versions of state-of-the-art parallel algorithms. The programs use parallel loops and two also use critical sections. We created sequential versions of each program by converting the parallel loops to sequential loops and by eliminating the critical sections. The algorithm then optimized this version. We do not recommend that users convert their programs to serial versions before handing it to the compiler, but we use this methodology to assess the ability of the com- piler to find, exploit, and optimize known parallelism. We used ParaScope [12, 27], an interactive parallelization tool, to systematically apply the transformations in the algorithm to the sequential programs. ParaScope imple- ments dependence analysis, interprocedural analysis, and safe application of the loop transformations (tiling, interchange, fusion, and distribution), but not the interprocedural optimizations, nor the optimization algorithm itself. Sections 4 and 5 detail this experiment. In Section 6, we show that the algorithm matched or improved the performance of seven of nine programs on a shared-memory, bus-based parallel machine with local caches. In addition to dependence analysis, many of the programs require interprocedural and symbolic analysis to find parallel loops. We also analyze which parts of the algorithm are responsible for the improvements. Most of the improvements occur in cases where data
2 locality and parallelism intertwine. The programmers were not able to exploit both when they conflict, but our algorithm does. This result suggests a combination of algorithmic parallelism and compiler optimization will yield the best performance. To explore whether a machine-independent parallel programming style exists, we also examined program- ming styles in light of the algorithm’s successes and failures in Section 6. We found that for the most part, these parallel programmers use a clean, modular style that is amenable to compiler analysis and optimization. Although the test suite is small, we believe this result adds to the evidence that programmers can write portable, parallelizable programs for scientific applications from which compilers can achieve very good performance. The remainder of this paper is organized as follows. Section 2 briefly reviews the technical background. Section 3 describes the parallelization algorithm. Section 5 presents our experimental framework and the pro- gram test suite and its characteristics. Section 4 measures the effectiveness of our algorithm at parallelizing and optimizing the programs in our test suite. Section 7 compares our work to other research in this area, and Section 8 summarizes our approach and results. 2 Technical Background This section overviews the technical background on dependence and reuse that is need to understand the paral- lelization algorithm.
Data Dependence. We assume the reader is familiar with data dependence [18, 29]. Throughout the paper,