Designing Graphics Architectures Around Scalability and Communication
Total Page:16
File Type:pdf, Size:1020Kb
DESIGNING GRAPHICS ARCHITECTURES AROUND SCALABILITY AND COMMUNICATION A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Matthew Eldridge June 2001 c Copyright by Matthew Eldridge 2001 All Rights Reserved ii I certify that I have read this dissertation and that in my opin- ion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Pat Hanrahan (Principal Adviser) I certify that I have read this dissertation and that in my opin- ion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. William J. Dally I certify that I have read this dissertation and that in my opin- ion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Mark Horowitz Approved for the University Committee on Graduate Stud- ies: iii Abstract Communication forms the backbone of parallel graphics, allowing multiple functional units to cooperate to render images. The cost of this communication, both in system resources and money, is the primary limit to parallelism. We examine the use of object and image parallelism and describe architectures in terms of the sorting communication that connects these forms of parallelism. We introduce an extended taxonomy of parallel graphics archi- tecture that more fully distinguishes architectures based on their sorting communication, paying particular attention to the difference between sorting fragments after rasterization, and sorting samples after fragments are merged with the framebuffer. We introduce three new forms of communication, distribution, routing and texturing, in addition to sorting. Distribution connects object parallel pipeline stages, routing connects image parallel pipe- line stages, and texturing connects untextured fragments with texture memory. All of these types of communication allow the parallelism of successive pipeline stages to be decoupled, and thus load-balanced. We generalize communication to include not only interconnect, which provides communication across space, but also memory, which functions as com- munication across time. We examine a number of architectures from this communication- centric perspective, and discuss the limits to their scalability. We draw conclusions to the limits of both image parallelism and broadcast communication and suggest architectures that avoid these limitations. We describe a new parallel graphics architecture called “Pomegranate”, which is de- signed around efficient and scalable communication. Pomegranate provides scalable input bandwidth, triangle rate, pixel rate, texture memory and display bandwidth. The basic unit of scalability is a single graphics pipeline, and up to 64 such units may be combined. iv Pomegranate’s scalability is achieved with a novel “sort-everywhere” architecture that dis- tributes work in a balanced fashion at every stage of the pipeline, keeping the amount of work performed by each pipeline uniform as the system scales. The use of one-to-one communication, instead of broadcast, as well as a carefully balanced distribution of work allows a scalable network based on high-speed point-to-point links to be used for com- municating between the pipelines. Pomegranate provides one interface per pipeline for issuing ordered, immediate-mode rendering commands and supports a parallel API that al- lows multiprocessor applications to exactly order drawing commands from each interface. A detailed hardware simulation demonstrates performance on next-generation workloads. Pomegranate operates at 87–99% parallel efficiency with 64 pipelines, for a simulated per- formance of up to 1.10 billion triangles per second and 21.9 billion pixels per second. v Acknowledgements I would like to thank my advisor, Pat Hanrahan, for his continual guidance. His advice has been invaluable, as has his patience with my forays into activities far removed from being a graduate student. I thank the other members of my reading committee, Bill Dally and Mark Horowitz, for their time and attention to detail. Bill Dally has provided a great engineering perspective that I have had the pleasure of being exposed to in multiple classes as well as personally. Mark Horowitz was a source of many insightful suggestions throughout my career at Stanford. I owe a large debt of gratitude to Kurt Akeley. I have had numerous enlightening discussions with him, and he has given me a deeper understanding of graphics architecture. Kurt’s thoughtful comments on my thesis clarified not only my writing but also my ideas. Stanford has been a wonderful place to go to school, in large part because of the people I had the good fortune to work with. In particular, Homan Igehy, Gordon Stoll, John Owens, Greg Humphreys and Ian Buck have all taught me a great deal. The certainty with which I argued with them all was nearly always because I was wrong. They, together with Jeff Solomon, Craig Kolb, Kekoa Proudfoot, Matt Pharr and many others have made being a graduate student very enjoyable. The coincidences of my brother Adam attending graduate school at Stanford, David Ro- driguez self-employing himself in Sunnyvale, and Michael and Tien-Ling Slater relocating to Berkeley have all made my time outside of school much more pleasurable. I thank my parents for encouraging me to go to graduate school when I needed to be pushed, for their enthusiasm when I thought I would leave to go get rich, and for their constant encouragement to write my thesis when I was dragging my heels. Finally, I thank the Fannie and John Hertz Foundation for its support, as well as DARPA contract DABT63-95-C-0085-P00006. vi Contents Abstract iv Acknowledgements vi 1 Introduction 1 1.1 Graphics Pipeline . .............................. 2 1.2 Performance Metrics . ....................... 6 1.3 Communication . .............................. 7 1.4 Scalability . .............................. 8 1.5 Summary of Original Contributions . ................ 10 2 Graphics Pipeline 13 2.1 Choices . .............................. 14 2.2 Terminology . .............................. 17 2.3 Communication Costs . ....................... 18 2.4 Parallel Interface . .............................. 21 3 Parallelism and Communication 24 3.1 Parallelism . .............................. 24 3.1.1 Object Parallelism . ....................... 27 3.1.2 Image Parallelism . ....................... 27 3.1.3 Object Parallel to Image Parallel Transition . ............ 32 3.2 Communication . .............................. 32 3.2.1 Patterns . .............................. 34 3.2.2 Networks . .............................. 35 vii 3.2.3 Memory . .............................. 38 4 Taxonomy and Architectures 43 4.1 Sort-First . .............................. 46 4.1.1 Sort-First Retained-Mode . ................ 47 4.1.2 Sort-First Immediate-Mode . ................ 48 4.2 Sort-Middle . .............................. 50 4.2.1 Sort-Middle Interleaved ....................... 50 4.2.2 Sort-Middle Tiled . ....................... 50 4.3 Sort-Last Fragment .............................. 55 4.4 Sort-Last Image Composition . ....................... 56 4.4.1 Pipelined Image Composition . ................ 57 4.4.2 Non-Pipelined Image Composition . ................ 60 4.5 Hybrid Architectures . ....................... 62 4.5.1 WireGL + Lightning-2 . ....................... 63 4.5.2 VC-1 . .............................. 63 4.6 Observations . .............................. 65 4.6.1 Interface Limit . ....................... 65 4.6.2 Application Visibility of Work . ................ 70 4.6.3 Texturing . .............................. 71 4.6.4 Limits of Image Parallelism . ................ 71 5 Pomegranate: Architecture 74 5.1 Overview . .............................. 76 5.2 Scalability and Interface . ....................... 77 5.3 Architecture . .............................. 79 5.3.1 Network . .............................. 81 5.3.2 Command Processor . ....................... 82 5.3.3 Geometry Processor . ....................... 83 5.3.4 Rasterizer . .............................. 85 5.3.5 Texture Processor . ....................... 85 5.3.6 Fragment Processor . ....................... 88 5.3.7 Display . .............................. 89 viii 6 Pomegranate: Ordering and State 91 6.1 Ordering . .............................. 91 6.1.1 Serial Ordering . ....................... 92 6.1.2 Parallel Ordering . ....................... 97 6.2 State Management ..............................103 6.2.1 State Commands . .......................103 6.2.2 Context Switching . .......................106 7 Simulation and Results 108 7.1 Scalability . ..............................110 7.2 Load Balance . ..............................112 7.3 Network Utilization . .......................113 7.4 Comparison . ..............................114 8 Discussion 120 8.1 OpenGL and the Parallel API . .......................120 8.2 Communication . ..............................121 8.3 Consistency Model ..............................121 8.4 Shared State Management . .......................123 8.4.1 Texture Objects . .......................123 8.4.2 Display Lists . .......................125 8.4.3 Virtual Memory . .......................126 8.5 Automatic Parallelization . .......................127 8.6 Pomegranate-2 . ..............................130 9 Conclusions 132 Bibliography 135 ix List of Tables 2.1 Communication