Automatic and Interactive Parallelization

RICE UNIVERSITY Automatic and Interactive Parallelization by c Kathryn S M Kinley A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Do ctor of Philosophy Approved Thesis Committee Ken Kennedy Noah Harding Professor chair Computer Science Keith D Co op er Asso ciate Professor Computer Science Don H Johnson Professor Electrical and Computer Engineering Danny C Sorensen Professor Mathematical Sciences Houston Texas March Automatic and Interactive Parallelization c Kathryn S M Kinley Abstract The goal of this dissertation is to give programmers the ability to achieve high p er formance by fo cusing on developing parallel algorithms rather than on architecture sp ecic details The advantages of this approach also include program p ortability and legibility To achieve high p erformance we provide automatic compilation techniques that tailor parallel algorithms to sharedmemory multipro cessors with lo cal caches and a common bus In particular the compiler maps complete applications onto the sp ecics of a machine exploiting b oth parallelism and memory To optimize complete applications we develop novel general algorithms to trans form lo ops that contain arbitrary conditional control ow In addition we provide new interpro cedural transformations which enable optimization across pro cedure b ound aries These techniques provide the basis for a robust automatic parallelizing algo rithm that is applicable to complete programs The algorithm for automatic parallel co de generation takes into consideration the interaction of parallelism and data lo cality as well as the overhead of parallelism The algorithm is based on a simple cost mo del that accurately predicts cache line reuse from multiple accesses to the same memory lo cation and from consecutive accesses The optimizer uses this mo del to improve data lo cality It also uses the mo del to discover and intro duce eective parallelism that complements the b enets of data lo cality The optimizer further improves the eectiveness of parallelism by seeking to increase its granularity Parallelism is intro duced only when granularity is sucient to overcome its asso ciated costs The algorithm for parallel co de generation is shown to b e ecient and several of its comp onent algorithms are proven optimal The ecacy of the optimizer is illustrated with exp erimental results In most cases it is very eective and either achieves or improves the p erformance of handcrafted parallel programs When p erformance is not satisfactory we provide an interactive parallel programming to ol which combines compiler analysis and algorithms with human exp ertise Acknowledgments Ken Kennedy provided me with the three most imp ortant elements of supp ort in graduate scho ol intellectual p olitical and nancial In addition Ken Keith Co op er and Linda Torczon fostered a research atmosphere and working environment whose b enets are untold I would also like to recognize the other memb ers of my committee Keith Co op er Don Johnson and Danny Sorensen Keith has b een an endless source of encouragement and wisdom throughout my graduate career Don Johnson gave me my rst taste of research and ho oked me for life I am fortunate that many of my fellow graduate students and friends supp orted my research intellectually emotionally and with implementations I would esp ecially like to thank ChauWen Tseng Marina Kalem Mary Hall Paul Havlak Nat McIntosh Preston Briggs Ben Chase and the entire compiler group I am extremely gratefully to my entire family As always my parents were a constant source of love and encouragement To my husband Scotty Strahan I hop e I am the ro ck for you that you have b een for me A little Madness in the Spring Is wholesome even for the King Emily Dickinson Contents Abstract ii Acknowledgments iii List of Illustrations ix Intro duction Automatic parallelization Interactive parallelization Overview Technical Background Dep endence Analysis Interpro cedural dep endence analysis Augmented call graph Interactive Parallel Programming Intro duction Work Mo del Transformations Reordering transformations Dep endence breaking transformations Memory hierarchy transformations Miscellaneous transformations Transformation algorithms Lo op interchange Lo op skewing Lo op distribution Unroll and jam Incremental analysis after edits User and compiler interaction v Related work Discussion Lo op Transformations with Arbitrary Control Flow Motivation Lo op distribution Mechanics Restructuring Co de generation Other transformations Lo op skewing Lo op reversal Lo op p ermutation Strip mining Privatization Scalar expansion Lo op fusion Lo op p eeling Related work Discussion Interpro cedural Transformations Intro duction Technical background Augmented call graph Interpro cedural section analysis Supp ort for interpro cedural optimization The ParaScop e compilation system Recompilation analysis Interpro cedural transformation Lo op extraction Lo op emb edding Intrapro cedural transformations Lo op fusion Lo op p ermutation vi Exp erimental results Sp ec Ocean Related work Discussion Optimizing for Parallelism and Data Lo cality Intro duction Memory and language mo del Tradeos in optimization Optimizing data lo cality Sources of data reuse Simplifying assumptions Lo op cost Reference groups Lo op cost algorithm Imp erfectly nested lo ops Lo op p ermutation Memory order Permuting to achieve memory order Data lo cality exp erimental results Matrix multiply Stencil computations Jacobi and SOR Erlebacher Parallelism Performance estimation Intro ducing parallelism Strip mining Parallelization algorithm Optimization algorithm Exp erimental results Matrix multiply Dmxpy Related work Discussion vii An Automatic Parallel Co de Generator Intro duction Parallel co de generation Driving co de generation Pro cedure cloning Lo opbased optimization Partitioning for lo op distribution and lo op fusion Simple partition algorithm Merging the solutions Discussion Lo op fusion Lo op distribution Integrating interpro cedural transformations Selecting the appropriate interpro cedural transformation Extensions to pro cedure cloning .

Load more