Evaluation and Design of Multi-Processor Architectures
Total Page:16
File Type:pdf, Size:1020Kb
Eindhoven University of Technology MASTER Evaluation and design of multi-processor architectures Wan, W.K. Award date: 2008 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Technische Universiteit Eindhoven Department of Mathematics and Computer Science and Department of Electrical Engineering Master’s Thesis EVALUATION AND DESIGN OF MULTI-PROCESSOR ARCHITECTURES By Win King Wan Supervisor: prof.dr. Henk Corporaal Tutor: dr.ir. Bart Mesman August 2008 Abstract This master’s thesis consists of an evaluation of two very promising parallel processor archi- tectures, namely the Cell Broadband Engine and the NVIDIA GeForce 8 Series GPU and a proposal for the design of a new architecture. The evaluation is done with respect to the gaming application domain, in particular the 3D graphics pipeline. The graphics pipeline is chosen as it consists of all the steps that are neces- sary to project 3D objects onto a 2D screen and is therefore an important part in games. For the evaluation, the most time consuming parts of the graphics pipeline are mapped onto the two architectures, followed by performing optimizations and measurements and documenting the most important lessons learned. The results from this evaluation are ultimately used to propose a new architecture, which is able to process the steps in the graphics pipeline with high performance and can therefore be used as the first step towards our goal of obtaining “the ultimate gaming architecture”. iii Acknowledgements I want to thank my supervisor Henk Corporaal and my tutor Bart Mesman for introducing me to this project and for their time, support and feedback for the entire duration of my project. Furthermore, I want to thank fellow student Paul Meys, who worked on the same project as I did and who provided valuable feedback. I also want to thank my other fellow graduation students Chong Sheau Pei (Esther), Alberto Falcon, Raymond Frijns, Jochem van der Meer and Patricia Fuentes Montoya for providing the enjoyable times in the student room on the ninth floor in the Electrical Engineering building. Other people I want to mention are my friends Wim Cools, Frank Ophelders, Tim Paffen, Marcel Steine, Mohammed El-Kebir, Martin Harwig, Diederik van Houten, Nick Matthijssen, Fred van Nijnatten and Alex Onete for joining several of the weekly pool meetings and doing other fun things for the entire duration of the project. I want to thank the people in entire Electronic Systems Group as well, for the enjoyable times, like the daily coffee breaks, during my project. Finally, I want to thank my parents and sisters for their support. v Table of Contents List of Tables ix List of Figures xi List of Acronyms and Abbreviations xiii 1 Introduction 1 2 Graphics Pipeline 3 2.1 Description of the graphics pipeline . .3 2.2 Software Implementation . .7 2.2.1 The Irrlicht Engine . .8 3 Multi-Processor Architectures 11 3.1 Cell Broadband Engine Architecture . 11 3.1.1 PowerPC Processor Element . 11 3.1.2 Synergistic Processor Element . 13 3.1.3 Element Interconnect Bus . 15 3.1.4 Memory . 16 3.1.5 Input/Output . 17 3.1.6 Programming . 17 3.1.7 PlayStation 3 . 18 3.2 NVIDIA GeForce 8 Series Graphics Processing Unit . 19 3.2.1 Stream Processor . 20 3.2.2 Memory . 20 3.2.3 Programming . 21 3.2.4 GeForce 8800 GT . 24 3.3 Other architectures . 25 4 Getting familiar with the architectures 29 4.1 Application . 29 4.2 Mapping on Cell . 31 vii 4.3 Mapping on GPU . 32 4.4 Comparison . 33 5 Mapping 35 5.1 Mapped functionality . 35 5.1.1 Description of the mapped functions . 36 5.2 Mapping on the Cell . 39 5.2.1 Optimizations . 39 5.2.2 Alternative mapping . 43 5.3 Mapping on the GPU . 46 5.3.1 Optimizations . 48 5.3.2 Alternative mapping . 53 6 Proposal 57 6.1 Evaluation of Cell and GPU . 57 6.2 Ideas for the proposal . 60 6.3 Proposed architecture . 63 7 Conclusion and Future Work 73 7.1 Conclusion . 73 7.2 Future Work . 75 A Sample PPE Source Code 77 B Sample CUDA Code 79 C Compiling Irrlicht using GCC 81 D Subword parallelism for the YUV to RGB program 83 E Extra information for mapping 85 E.1 Software Managed Cache API . 85 E.2 Comparison between the PPE and the Intel Q6600 CPU . 86 E.3 Coalesced memory access . 87 E.4 Reduction operation . 88 Bibliography 91 viii List of Tables 3.1 Latency of SPU instructions . 15 3.2 Summary of architectures . 27 4.1 Results of the YUV to RGB conversion program . 33 4.2 Results of the kernels of the YUV to RGB conversion program . 34 5.1 Mapping on the Cell . 42 5.2 Mapping on the GPU . 51 6.1 Number of tiles compared to size of memories . 69 E.1 Timings of program increasing all elements in an array . 86 E.2 Timings of program increasing one variable . 86 ix List of Figures 2.1 The graphics pipeline . .4 2.2 Vertex processing . .4 2.3 Rasterization . .5 2.4 Anti-Aliasing . .6 2.5 Texture mapping . .9 2.6 The demo application . 10 3.1 Block diagram of the Cell Broadband Engine processor . 12 3.2 Vector types on the SPU . 13 3.3 SPU Functional Units . 14 3.4 Element Interconnect Bus . 15 3.5 Block diagram of GeForce 8800 . 19 3.6 Scatter operations . 21 3.7 Gather operations . 21 3.8 Compilation trajectory of CUDA . 23 5.1 Drawing one line of a triangle . 37 5.2 First function . 38 5.3 Fixpoint number . 41 5.4 Sending pointers to SPEs . 42 5.5 Mapping of first function on Cell . 43 5.6 Drawing two triangles in parallel . 44 5.7 Illustration of locking lines . 44 5.8 Transfers between system memory and GPU memory . 46 xi 5.9 Mapping of first function on GPU . 52 5.10 Each triangle can have its own height . 53 6.1 Division of output image over the number of PEs . 60 6.2 Ideas for the proposed architecture . 63 6.3 Mipmapping . 66 6.4 Block diagram of the proposed architecture . 70 D.1 Subword parallelism for vertical scaling . 83 D.2 Subword parallelism for horizontal scaling . 83 D.3 Subword parallelism for computing RGB . 84 E.1 Coalesced memory accesses . 87 E.2 Non-coalesced memory accesses . 88 E.3 Reduction operation . 89 xii List of Acronyms and Abbreviations 2D Two Dimensional 3D Three Dimensional AA Anti-Aliasing ALU Arithmetic Logic Unit API Application Programmer’s Interface BIF Cell Broadband Engine Interface CA Communication Assist CAL Compute Abstraction Layer CBEA Cell Broadband Engine Architecture Cell BE Cell Broadband Engine CPU Central Processing Unit CUDA Compute Unified Device Architecture DMA Direct Memory Access DRAM Dynamic Random Access Memory ECC Error-Correcting Code EIB Element Interconnect Bus FAQ Frequently Asked Questions xiii FIFO First In First Out FPGA Field Programmable Gate Array FPOA Field Programmable Object Array GB Gigabyte Gbit Gigabit GCC GNU Compiler Collection GDDR3 Graphics Double Data Rate, version 3 GDDR4 Graphics Double Data Rate, version 4 GDDR5 Graphics Double Data Rate, version 5 GHz Gigahertz GPU Graphics Processing Unit GUI Graphical User Interface HD High Definition Hz Hertz I/O Input and Output ID Identifier IOIF I/O Interface ISA Instruction Set Architecture KB Kilobyte L1 Level-1 L2 Level-2 LS Local Store MB Megabyte xiv Mbit Megabit MFC Memory Flow Controller MHz Megahertz MIC Memory Interface Controller MIMD Multiple Instruction, Multiple Data MPEG Moving Picture Experts Group ms Millisecond mutex Mutual Exclusion NaN Not a Number OS Operating System PC Personal Computer PCI Peripheral Component Interconnect PE Processing Element pixel Picture element PPE PowerPC Processor Element PPM Portable Pixel Map PS3 PLAYSTATION 3 PTX Parallel Thread Execution QoS Quality-of-Service RAM Random Access Memory RGB Red Green Blue ROP Render Output unit s Second xv SDK Software Development Kit SFU Special Function Unit SIMD Single Instruction, Multiple Data SP Stream Processor SPE Synergistic Processor Element SPU Synergistic Processor Unit texel Texture element TnL Transform and Lighting VLIW Very Long Instruction Word VMX Vector Multimedia Extensions WMV Windows Media Video XDR Extreme Data Rate XLC XL C/C++ Compiler YUV Luminance-Chrominance µs Microsecond xvi Chapter 1 Introduction This report serves as the final report for the graduation project started in January 2008, which consists of an evaluation of the performance of certain currently available highly parallel proces- sor architectures with respect to the gaming application domain. Using the results, conclusions and lessons learned from the evaluation, a new architecture is proposed. The 3D graphics pipeline is the main aspect for the evaluation of the gaming application do- main and is explained in chapter 2.