A Metadata-Enhanced Framework for High Performance Visual Effects
Total Page:16
File Type:pdf, Size:1020Kb
Imperial College London Department of Computing A metadata-enhanced framework for high performance visual effects Jay L. T. Cornwall April 2010 Submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of Imperial College London. The work within this thesis is my own except where explicity cited. Abstract This thesis is devoted to reducing the interactive latency of image processing computations in visual effects. Film and television graphic artists depend upon low-latency feedback to receive a visual response to changes in effect parameters. We tackle latency with a domain-specific op- timising compiler which leverages high-level program metadata to guide key computational and memory hierarchy optimisations. This metadata encodes static and dynamic information about data dependence and patterns of memory access in the algorithms constituting a visual effect – features that are typically difficult to extract through program analysis – and presents it to the compiler in an explicit form. By using domain-specific information as a substitute for program analysis, our compiler is able to target a set of complex source-level optimisations that a ven- dor compiler does not attempt, before passing the optimised source to the vendor compiler for lower-level optimisation. Three key metadata-supported optimisations are presented. The first is an adaptation of space and schedule optimisation – based upon well-known compositions of the loop fusion and array contraction transformations – to the dynamic working sets and schedules of a runtime- parameterised visual effect. This adaptation sidesteps the costly solution of runtime code gen- eration by specialising static parameters in an offline process and exploiting dynamic metadata to adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second optimisation comprises a set of transformations to generate SIMD ISA-augmented source code. Our approach differs from autovectorisation by using static metadata to identify parallelism, in place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable parameters for optimal aligned memory access. The third optimisation comprises a related set of transformations to generate code for SIMT architectures, such as GPUs. Static dependence metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads. Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by iden- tifying inter-thread and intra-core data sharing opportunities in memory access metadata. A detailed performance analysis of these optimisations is presented for two industrially de- veloped visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written imple- mentations of these two effects. Programmability is enhanced by automating the generation of SIMD and SIMT implementations from a single programmer-managed scalar representation. ii Acknowledgements The road to this thesis has been shaped by many people to whom I am indebted. Some I have never had the pleasure of meeting: their lives’ work is trivialised in citations and off-hand remarks littered throughout these chapters with little more than a nod of acknowledgement. To you I am eternally grateful for allowing me to stand on your shoulders. The view is breathtaking. I hope that I am one day able to return the favour. To my supervisor, Paul Kelly, I owe much of my success and all of my gratitude. His patience, enthusiasm and insights helped to carve this path of research and navigate its many obstacles with deft ability. The incredible network of minds and personalities he has built throughout his distinguished career has had an invaluable effect on the research contained within in this thesis. To my second supervisor, Tony Field, I am grateful for his guidance on writing good papers, which I hope has influenced the text you are now reading. Of course, we did not work alone. The input of our researchers, past and present, in the Software Performance Optimisation group, and many of my colleagues in the Computer Systems group, ensured that success was the only possible outcome of our research endeavours. I must name those to whom I worked especially closely and enjoyed many academic and technical con- versations with: Lee Howes, Ashley Brown, Francis Russell, Will Osborne and Michael Mellor. I am beholden to my colleagues at The Foundry for the ideas and discussions we shared in finding common ground between our goals. In particular, I would like to thank Phil Parsonage and Bruno Nicoletti for stepping far beyond their lines of duty and endeavouring to believe that our daring academic–industrial collaboration could succeed. I owe thanks to Bill Collis for proposing the collaboration and arranging the funding to realise it. I must also thank the Engineering and Physical Sciences Research Council (EPSRC) for partly funding our work through an Industrial CASE award. This was a completely new experiment for our research group and we were very pleased with the outcome. I hope the results of our research have met the funding goals spot on. To my family, who supported me throughout this period of my life while I was unable to give them the time they deserved, I am more grateful than I am able to show. Amidst personal tragedy, my father and sister stood beside me with a devotion that gave me adamant resolve to complete this thesis. To my mother, I am grateful for life. I hope you find peace in your next. iii Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Objectives . .2 1.3 Contributions . .2 1.4 Publications . .3 1.5 Thesis Organisation . .4 2 Visual Effects Optimisation 6 2.1 Introduction . .6 2.2 Visual Effects . .6 2.2.1 Digital Image . .6 2.2.2 Digital Composition . .7 2.3 Software Optimisation . 11 2.3.1 Memory Hierarchy Management . 11 2.3.2 Component-Based Programming . 13 2.3.3 Generative Programming . 15 2.3.4 Metadata-Guided Optimisation . 16 2.3.5 Polyhedral Schedule Transformation . 16 2.4 Software Parallelisation . 19 2.4.1 Symmetric Multiprocessing (SMP) . 19 2.4.2 Single Instruction, Multiple Data (SIMD) . 21 2.4.3 Single Instruction, Multiple Threads (SIMT) . 21 2.5 Related Work . 23 2.6 Concluding Remarks . 26 3 Metadata-Augmented Single-Source Framework 28 3.1 Introduction . 28 3.2 Constraints upon the Visual Effects Domain . 28 3.3 Metadata Definition . 31 3.3.1 Primitive Context . 32 3.3.2 Data Dependence . 32 3.3.3 Memory Access . 34 iv CONTENTS v 3.4 Framework Design . 35 3.4.1 Visual Primitive Representation . 36 3.4.2 Visual Effect Construction . 37 3.5 Execution Strategy . 39 3.5.1 DAG Serialisation . 39 3.5.2 DOD Propagation . 40 3.5.3 Cache-Aware Iteration . 41 3.5.4 Multicore Parallelisation . 42 3.6 Code Generation . 43 3.6.1 Static Specialisation . 44 3.7 Performance Analysis . 45 3.8 Concluding Remarks . 53 4 Space and Schedule Optimisation 54 4.1 Introduction . 54 4.2 Schedule Optimisation . 54 4.2.1 Constructing the Constraint Matrices . 55 4.2.2 Minimising Polyhedral Scanning Complexity . 57 4.2.3 Constructing the Scattering Matrices . 60 4.2.4 Impact of Schedule Optimisation on Parallelisation . 61 4.2.5 Optimising the Polyhedral Schedule . 64 4.3 Space Optimisation . 66 4.4 Performance Analysis . 67 4.5 Concluding Remarks . 75 5 SIMD Code Generation and Optimisation 76 5.1 Transformation Phases . 77 5.1.1 Strip Mining . 78 5.1.2 Scalar Promotion . 80 5.1.3 Divergent Conditional Predication . 83 5.1.4 Memory Access Realignment . 85 5.1.5 Contracted Load/Store Rescheduling . 89 5.2 Performance Analysis . 90 5.3 Concluding Remarks . 95 6 SIMT Code Generation and Optimisation 98 6.1 Transformation Phases . 99 6.1.1 Syntax-Directed Translation . 100 6.1.2 Constant and Shared Memory Staging . 102 6.1.3 Memory Access Transposition . 103 6.1.4 Memory Access Realignment . 105 CONTENTS vi 6.1.5 Maximising Parallelism . 107 6.1.6 Scheduling Overhead Reduction . 108 6.2 Thread Block Size, Shape and Count Selection . 108 6.3 Challenges in SIMT Space/Schedule Optimisation . 110 6.4 Performance Analysis . 112 6.5 Concluding Remarks . 117 7 Conclusions and Further Work 120 7.1 Review of Objectives . 120 7.2 Technical Achievements . 121 7.3 Critical Analysis . 122 7.4 Further Work . 123 7.4.1 Compiler-Assistive Metadata . 123 7.4.2 Dynamic Runtime Optimisation . 124 7.4.3 Heterogeneous Multiprocessing . 125 7.5 Conclusions . 125 A Framework Ontology 127 A.1 The Functor Class . 127 A.2 The Indexer Class . 129 Bibliography 131 List of Tables 2.1 A condensed matrix representation of the systems of linear equations defining two convex hulls in a two-dimensional iteration space, as shown in Figure 2.4. 18 3.1 Hardware specifications for the three benchmarking platforms used throughout the performance analyses in Section 3.7. The Xeon was in 2-chip SMP configuration. Both Intel chips share partitions of the L2 cache between pairs of cores. 45 3.2 Image resolutions tested in the benchmarks throughout this thesis to identify cache spills and row-to-row memory aliasing effects. 46 4.1 Constraint matrices for the loop nests shown in Listing 4.1. Each loop nest has one constraint matrix, bounding its.