Evaluation and Design of Multi-Processor Architectures

Total Page:16

File Type:pdf, Size:1020Kb

Evaluation and Design of Multi-Processor Architectures Eindhoven University of Technology MASTER Evaluation and design of multi-processor architectures Wan, W.K. Award date: 2008 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Technische Universiteit Eindhoven Department of Mathematics and Computer Science and Department of Electrical Engineering Master’s Thesis EVALUATION AND DESIGN OF MULTI-PROCESSOR ARCHITECTURES By Win King Wan Supervisor: prof.dr. Henk Corporaal Tutor: dr.ir. Bart Mesman August 2008 Abstract This master’s thesis consists of an evaluation of two very promising parallel processor archi- tectures, namely the Cell Broadband Engine and the NVIDIA GeForce 8 Series GPU and a proposal for the design of a new architecture. The evaluation is done with respect to the gaming application domain, in particular the 3D graphics pipeline. The graphics pipeline is chosen as it consists of all the steps that are neces- sary to project 3D objects onto a 2D screen and is therefore an important part in games. For the evaluation, the most time consuming parts of the graphics pipeline are mapped onto the two architectures, followed by performing optimizations and measurements and documenting the most important lessons learned. The results from this evaluation are ultimately used to propose a new architecture, which is able to process the steps in the graphics pipeline with high performance and can therefore be used as the first step towards our goal of obtaining “the ultimate gaming architecture”. iii Acknowledgements I want to thank my supervisor Henk Corporaal and my tutor Bart Mesman for introducing me to this project and for their time, support and feedback for the entire duration of my project. Furthermore, I want to thank fellow student Paul Meys, who worked on the same project as I did and who provided valuable feedback. I also want to thank my other fellow graduation students Chong Sheau Pei (Esther), Alberto Falcon, Raymond Frijns, Jochem van der Meer and Patricia Fuentes Montoya for providing the enjoyable times in the student room on the ninth floor in the Electrical Engineering building. Other people I want to mention are my friends Wim Cools, Frank Ophelders, Tim Paffen, Marcel Steine, Mohammed El-Kebir, Martin Harwig, Diederik van Houten, Nick Matthijssen, Fred van Nijnatten and Alex Onete for joining several of the weekly pool meetings and doing other fun things for the entire duration of the project. I want to thank the people in entire Electronic Systems Group as well, for the enjoyable times, like the daily coffee breaks, during my project. Finally, I want to thank my parents and sisters for their support. v Table of Contents List of Tables ix List of Figures xi List of Acronyms and Abbreviations xiii 1 Introduction 1 2 Graphics Pipeline 3 2.1 Description of the graphics pipeline . .3 2.2 Software Implementation . .7 2.2.1 The Irrlicht Engine . .8 3 Multi-Processor Architectures 11 3.1 Cell Broadband Engine Architecture . 11 3.1.1 PowerPC Processor Element . 11 3.1.2 Synergistic Processor Element . 13 3.1.3 Element Interconnect Bus . 15 3.1.4 Memory . 16 3.1.5 Input/Output . 17 3.1.6 Programming . 17 3.1.7 PlayStation 3 . 18 3.2 NVIDIA GeForce 8 Series Graphics Processing Unit . 19 3.2.1 Stream Processor . 20 3.2.2 Memory . 20 3.2.3 Programming . 21 3.2.4 GeForce 8800 GT . 24 3.3 Other architectures . 25 4 Getting familiar with the architectures 29 4.1 Application . 29 4.2 Mapping on Cell . 31 vii 4.3 Mapping on GPU . 32 4.4 Comparison . 33 5 Mapping 35 5.1 Mapped functionality . 35 5.1.1 Description of the mapped functions . 36 5.2 Mapping on the Cell . 39 5.2.1 Optimizations . 39 5.2.2 Alternative mapping . 43 5.3 Mapping on the GPU . 46 5.3.1 Optimizations . 48 5.3.2 Alternative mapping . 53 6 Proposal 57 6.1 Evaluation of Cell and GPU . 57 6.2 Ideas for the proposal . 60 6.3 Proposed architecture . 63 7 Conclusion and Future Work 73 7.1 Conclusion . 73 7.2 Future Work . 75 A Sample PPE Source Code 77 B Sample CUDA Code 79 C Compiling Irrlicht using GCC 81 D Subword parallelism for the YUV to RGB program 83 E Extra information for mapping 85 E.1 Software Managed Cache API . 85 E.2 Comparison between the PPE and the Intel Q6600 CPU . 86 E.3 Coalesced memory access . 87 E.4 Reduction operation . 88 Bibliography 91 viii List of Tables 3.1 Latency of SPU instructions . 15 3.2 Summary of architectures . 27 4.1 Results of the YUV to RGB conversion program . 33 4.2 Results of the kernels of the YUV to RGB conversion program . 34 5.1 Mapping on the Cell . 42 5.2 Mapping on the GPU . 51 6.1 Number of tiles compared to size of memories . 69 E.1 Timings of program increasing all elements in an array . 86 E.2 Timings of program increasing one variable . 86 ix List of Figures 2.1 The graphics pipeline . .4 2.2 Vertex processing . .4 2.3 Rasterization . .5 2.4 Anti-Aliasing . .6 2.5 Texture mapping . .9 2.6 The demo application . 10 3.1 Block diagram of the Cell Broadband Engine processor . 12 3.2 Vector types on the SPU . 13 3.3 SPU Functional Units . 14 3.4 Element Interconnect Bus . 15 3.5 Block diagram of GeForce 8800 . 19 3.6 Scatter operations . 21 3.7 Gather operations . 21 3.8 Compilation trajectory of CUDA . 23 5.1 Drawing one line of a triangle . 37 5.2 First function . 38 5.3 Fixpoint number . 41 5.4 Sending pointers to SPEs . 42 5.5 Mapping of first function on Cell . 43 5.6 Drawing two triangles in parallel . 44 5.7 Illustration of locking lines . 44 5.8 Transfers between system memory and GPU memory . 46 xi 5.9 Mapping of first function on GPU . 52 5.10 Each triangle can have its own height . 53 6.1 Division of output image over the number of PEs . 60 6.2 Ideas for the proposed architecture . 63 6.3 Mipmapping . 66 6.4 Block diagram of the proposed architecture . 70 D.1 Subword parallelism for vertical scaling . 83 D.2 Subword parallelism for horizontal scaling . 83 D.3 Subword parallelism for computing RGB . 84 E.1 Coalesced memory accesses . 87 E.2 Non-coalesced memory accesses . 88 E.3 Reduction operation . 89 xii List of Acronyms and Abbreviations 2D Two Dimensional 3D Three Dimensional AA Anti-Aliasing ALU Arithmetic Logic Unit API Application Programmer’s Interface BIF Cell Broadband Engine Interface CA Communication Assist CAL Compute Abstraction Layer CBEA Cell Broadband Engine Architecture Cell BE Cell Broadband Engine CPU Central Processing Unit CUDA Compute Unified Device Architecture DMA Direct Memory Access DRAM Dynamic Random Access Memory ECC Error-Correcting Code EIB Element Interconnect Bus FAQ Frequently Asked Questions xiii FIFO First In First Out FPGA Field Programmable Gate Array FPOA Field Programmable Object Array GB Gigabyte Gbit Gigabit GCC GNU Compiler Collection GDDR3 Graphics Double Data Rate, version 3 GDDR4 Graphics Double Data Rate, version 4 GDDR5 Graphics Double Data Rate, version 5 GHz Gigahertz GPU Graphics Processing Unit GUI Graphical User Interface HD High Definition Hz Hertz I/O Input and Output ID Identifier IOIF I/O Interface ISA Instruction Set Architecture KB Kilobyte L1 Level-1 L2 Level-2 LS Local Store MB Megabyte xiv Mbit Megabit MFC Memory Flow Controller MHz Megahertz MIC Memory Interface Controller MIMD Multiple Instruction, Multiple Data MPEG Moving Picture Experts Group ms Millisecond mutex Mutual Exclusion NaN Not a Number OS Operating System PC Personal Computer PCI Peripheral Component Interconnect PE Processing Element pixel Picture element PPE PowerPC Processor Element PPM Portable Pixel Map PS3 PLAYSTATION 3 PTX Parallel Thread Execution QoS Quality-of-Service RAM Random Access Memory RGB Red Green Blue ROP Render Output unit s Second xv SDK Software Development Kit SFU Special Function Unit SIMD Single Instruction, Multiple Data SP Stream Processor SPE Synergistic Processor Element SPU Synergistic Processor Unit texel Texture element TnL Transform and Lighting VLIW Very Long Instruction Word VMX Vector Multimedia Extensions WMV Windows Media Video XDR Extreme Data Rate XLC XL C/C++ Compiler YUV Luminance-Chrominance µs Microsecond xvi Chapter 1 Introduction This report serves as the final report for the graduation project started in January 2008, which consists of an evaluation of the performance of certain currently available highly parallel proces- sor architectures with respect to the gaming application domain. Using the results, conclusions and lessons learned from the evaluation, a new architecture is proposed. The 3D graphics pipeline is the main aspect for the evaluation of the gaming application do- main and is explained in chapter 2.
Recommended publications
  • Desarrollo Del Juego Sky Fighter Mediante XNA 3.1 Para PC
    Departamento de Informática PROYECTO FIN DE CARRERA Desarrollo del juego Sky Fighter mediante XNA 3.1 para PC Autor: Íñigo Goicolea Martínez Tutor: Juan Peralta Donate Leganés, abril de 2011 Proyecto Fin de Carrera Alumno: Íñigo Goicolea Martínez Sky Fighter Tutor: Juan Peralta Donate Agradecimientos Este proyecto es la culminación de muchos meses de trabajo, y de una carrera a la que llevo dedicando más de cinco años. En estas líneas me gustaría recordar y agradecer a todas las personas que me han permitido llegar hasta aquí. En primer lugar a mis padres, Antonio y Lola, por el apoyo que me han dado siempre. Por creer en mí y confiar en que siempre voy a ser capaz de salir adelante y no dudar jamás de su hijo. Y lo mismo puedo decir de mis dos hermanos, Antonio y Manuel. A Juan Peralta, mi tutor, por darme la oportunidad de realizar este proyecto que me ha permitido acercarme más al mundo de los videojuegos, algo en lo que querría trabajar. Pese a que él también estaba ocupado con su tesis doctoral, siempre ha sacado tiempo para resolver dudas y aportar sugerencias. A Sergio, Antonio, Toño, Alberto, Dani, Jorge, Álvaro, Fernando, Marta, Carlos, otro Antonio y Javier. Todos los compañeros, y amigos, que he hecho y que he tenido a lo largo de la carrera y gracias a los cuales he podido llegar hasta aquí. Por último, y no menos importante, a los demás familiares y amigos con los que paso mucho tiempo de mi vida, porque siempre están ahí cuando hacen falta.
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • A Data-Driven Approach for Personalized Drama Management
    A DATA-DRIVEN APPROACH FOR PERSONALIZED DRAMA MANAGEMENT A Thesis Presented to The Academic Faculty by Hong Yu In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Computer Science Georgia Institute of Technology August 2015 Copyright © 2015 by Hong Yu A DATA-DRIVEN APPROACH FOR PERSONALIZED DRAMA MANAGEMENT Approved by: Dr. Mark O. Riedl, Advisor Dr. David Roberts School of Interactive Computing Department of Computer Science Georgia Institute of Technology North Carolina State University Dr. Charles Isbell Dr. Andrea Thomaz School of Interactive Computing School of Interactive Computing Georgia Institute of Technology Georgia Institute of Technology Dr. Brian Magerko Date Approved: April 23, 2015 School of Literature, Media, and Communication Georgia Institute of Technology To my family iii ACKNOWLEDGEMENTS First and foremost, I would like to express my most sincere gratitude and appreciation to Mark Riedl, who has been my advisor and mentor throughout the development of my study and research at Georgia Tech. He has been supportive ever since the days I took his Advanced Game AI class and entered the Entertainment Intelligence lab. Thanks to him I had the opportunity to work on the interactive narrative project which turned into my thesis topic. Without his consistent guidance, encouragement and support, this dissertation would never have been successfully completed. I would also like to gratefully thank my dissertation committee, Charles Isbell, Brian Magerko, David Roberts and Andrea Thomaz for their time, effort and the opportunities to work with them. Their expertise, insightful comments and experience in multiple research fields have been really beneficial to my thesis research.
    [Show full text]
  • This Is Your Presentation Title
    Introduction to GPU/Parallel Computing Ioannis E. Venetis University of Patras 1 Introduction to GPU/Parallel Computing www.prace-ri.eu Introduction to High Performance Systems 2 Introduction to GPU/Parallel Computing www.prace-ri.eu Wait, what? Aren’t we here to talk about GPUs? And how to program them with CUDA? Yes, but we need to understand their place and their purpose in modern High Performance Systems This will make it clear when it is beneficial to use them 3 Introduction to GPU/Parallel Computing www.prace-ri.eu Top 500 (June 2017) CPU Accel. Rmax Rpeak Power Rank Site System Cores Cores (TFlop/s) (TFlop/s) (kW) National Sunway TaihuLight - Sunway MPP, Supercomputing Center Sunway SW26010 260C 1.45GHz, 1 10.649.600 - 93.014,6 125.435,9 15.371 in Wuxi Sunway China NRCPC National Super Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Computer Center in Cluster, Intel Xeon E5-2692 12C 2 Guangzhou 2.200GHz, TH Express-2, Intel Xeon 3.120.000 2.736.000 33.862,7 54.902,4 17.808 China Phi 31S1P NUDT Swiss National Piz Daint - Cray XC50, Xeon E5- Supercomputing Centre 2690v3 12C 2.6GHz, Aries interconnect 3 361.760 297.920 19.590,0 25.326,3 2.272 (CSCS) , NVIDIA Tesla P100 Cray Inc. DOE/SC/Oak Ridge Titan - Cray XK7 , Opteron 6274 16C National Laboratory 2.200GHz, Cray Gemini interconnect, 4 560.640 261.632 17.590,0 27.112,5 8.209 United States NVIDIA K20x Cray Inc. DOE/NNSA/LLNL Sequoia - BlueGene/Q, Power BQC 5 United States 16C 1.60 GHz, Custom 1.572.864 - 17.173,2 20.132,7 7.890 4 Introduction to GPU/ParallelIBM Computing www.prace-ri.eu How do
    [Show full text]
  • Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers
    Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers Entwurf und Performance-Evaluierung eines Software-Frameworks für Multi-Physik-Simulationen auf heterogenen Supercomputern Der Technischen Fakultät der Universität Erlangen-Nürnberg zur Erlangung des Grades Doktor-Ingenieur vorgelegt von Dipl.-Inf. Christian Feichtinger Erlangen, 2012 Als Dissertation genehmigt von der Technischen Fakultät der Universität Erlangen-Nürnberg Tag der Einreichung: 11. Juni 2012 Tag der Promotion: 24. July 2012 Dekan: Prof. Dr. Marion Merklein Berichterstatter: Prof. Dr. Ulrich Rüde Prof. Dr. Takayuki Aoki Prof. Dr. Gerhard Wellein Abstract Despite the experience of several decades the numerical simulation of computa- tional fluid dynamics is still an enormously challenging and active research field. Most simulation tasks of scientific and industrial relevance require the model- ing of multiple physical effects, complex numerical algorithms, and have to be executed on supercomputers due to their high computational demands. Fac- ing these complexities, the reimplementation of the entire functionality for each simulation task, forced by inflexible, non-maintainable, and non-extendable im- plementations is not feasible and bound to fail. The requirements to solve the involved research objectives can only be met in an interdisciplinary effort and by a clean and structured software development process leading to usable, main- tainable, and efficient software designs on all levels of the resulting software framework. The major scientific contribution of this thesis is the thorough design and imple- mentation of the software framework WaLBerla that is suitable for the simulation of multi-physics simulation tasks centered around the lattice Boltzmann method. The design goal of WaLBerla is to be usable, maintainable, and extendable as well as to enable efficient and scalable implementations on massively parallel super- computers.
    [Show full text]
  • Basado En Imágenes Parametrizadas Sobre Resnet Para IdentiCar Intrusiones En 'Smartwatches' U Otros Dispositivos ANes
    IA eñ ™ • Publicaciones de autores 'Framework' basado en imágenes parametrizadas sobre ResNet para identicar intrusiones en 'smartwatches' u otros dispositivos anes. (Un eje singular de la publicación “Estado del arte de la ciencia de datos en el idioma español y su aplicación en el campo de la Inteligencia Articial”) Juan Antonio Lloret Egea, Celia Medina Lloret, Adrián Hernández González, Diana Díaz Raboso, Carlos Campos, Kimberly Riveros Guzmán, Adrián Pérez Herrera, Luis Miguel Cortés Carballo, Héctor Miguel Terrés Lloret License: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC-BY-NC-ND 4.0) IA eñ ™ • Publicaciones de autores aplicación en el campo de la Inteligencia Articial”) Abstracto Se ha definido un framework1 conceptual y algebraicamente, inexistente hasta ahora en su morfología, y pionero en aplicación en el campo de la Inteligencia Artificial (IA) de forma conjunta; e implementado en laboratorio, en sus aspectos más estructurales, como un modelo completamente operacional. Su mayor aportación a la IA a nivel cualitativo es aplicar la conversión o transducción de parámetros obtenidos con lógica ternaria[1] (sistemas multivaluados)2 y asociarlos a una imagen, que es analizada mediante una red residual artificial ResNet34[2],[3] para que nos advierta de una intrusión. El campo de aplicación de este framework va desde smartwaches, tablets y PC's hasta la domótica basada en el estándar KNX[4]. Abstract note The full version of this document in the English language will be available in this link. Código QR de la publicación Este marco propone para la IA una ingeniería inversa de tal modo que partiendo de principios matemáticos conocidos y revisables, aplicados en una imagen gráfica en 2D para detectar intrusiones, sea escrutada la privacidad y la seguridad de un dispositivo mediante la inteligencia artificial para mitigar estas lesiones a los usuarios finales.
    [Show full text]
  • Google Adquiere Motorola Mobility * Las Tablets PC Y Su Alcance * Synergy 1.3.1 * Circuito Impreso Al Instante * Proyecto GIMP-Es
    Google adquiere Motorola Mobility * Las Tablets PC y su alcance * Synergy 1.3.1 * Circuito impreso al instante * Proyecto GIMP-Es El vocero . 5 Premio Concurso 24 Aniversario de Joven Club Editorial Por Ernesto Rodríguez Joven Club, vivió el verano 2011 junto a ti 6 Aniversario 24 de los Joven Club La mirada de TINO . Cumple TINO 4 años de Los usuarios no comprueba los enlaces antes de abrirlos existencia en este septiembre, el sueño que vió 7 Un fallo en Facebook permite apropiarse de páginas creadas la luz en el 2007 es hoy toda una realidad con- Google adquiere Motorola Mobility vertida en proeza. Esfuerzo, tesón y duro bre- gar ha acompañado cada día a esta Revista que El escritorio . ha sabido crecerse en sí misma y superar obs- 8 Las Tablets PC y su alcance táculos y dificultades propias del diario de cur- 11 Propuesta de herramientas libre para el diseño de sitios Web sar. Un colectivo de colaboración joven, entu- 14 Joven Club, Infocomunidad y las TIC siasta y emprendedor –bajo la magistral con- 18 Un vistazo a la Informática forense ducción de Raymond- ha sabido mantener y El laboratorio . desarrollar este proyecto, fruto del trabajo y la profesionalidad de quienes convergen en él. 24 PlayOnLinux TINO acumula innegables resultados en estos 25 KMPlayer 2.9.2.1200 años. Más de 350 000 visitas, un volumen apre- 26 Synergy 1.3.1 ciable de descargas y suscripciones, servicios 27 imgSeek 0.8.6 estos que ha ido incorporando, pero por enci- El entrevistado . ma de todo está el agradecimiento de muchos 28 Hilda Arribas Robaina por su existencia, por sus consejos, su oportu- na información, su diálogo fácil y directo, su uti- El taller .
    [Show full text]
  • Estado Del Arte Metodologia Para El Desarrollo De
    ESTADO DEL ARTE METODOLOGIA PARA EL DESARROLLO DE APLICACIONES EN 3D PARA WINDOWS CON VISUAL STUDIO 2008 Y XNA 3.1 Integrantes EDWIN ANDRÉS BETANCUR ÁLVAREZ JHON FREDDY VELÁSQUEZ RESTREPO UNIVERSIDAD SAN BUENAVENTURA Facultad de Ingenierías Seccional Medellín Año 2012 TABLA DE CONTENIDO AGRADECIMIENTOS 3 PARTE 1 ESTADO DEL ARTE 1. RESUMEN 4 2. INTRODUCCIÓN 5 3. MICROSOFT XNA GAME STUDIO 6 3.1 CONCEPTOS 7 4. HISTORIA 10 4.1 El origen de los Videojuegos 10 4.2 XNA en la actualidad 18 4.3 Versiones 19 4.3.1 XNA Game Studio Professional 19 4.3.2 XNA Game Studio Express 20 4.3.3 XNA Game Studio 2.0 21 4.3.4 XNA Game Studio 3.0 21 4.3.5 XNA Game Studio 3.1 22 4.3.6 XNA Game Studio 4.1 23 5. OTROS FRAMEWORKS (ALTERNATIVAS) 24 5.1 PYGAME 24 5.2 PULPCORE 25 5.3 GAMESALAD 25 5.4 ADVENTURE GAME 26 5.5 BLENDER 3D 27 5.6 CRYSTAL SPACE 27 5.7 DIM3 28 5.8 GAME MAKER 28 5.9 M.U.G.E.N 28 6. POR QUÉ XNA 3.1 29 PARTE 2 MARCO TEORICO 7.1 BUCLE DEL JUEGO 32 7.2 Componentes del juego 33 7.3 COMPORTAMIENTO 35 8. REFERENCIAS 37 9. REFERENCIAS DE IMAGENES 39 10. ANEXOS 43 2 AGRADECIMIENTOS Este proyecto de grado tiene un origen muy especial el cual esta plasmado y respaldado por la trayectoria del Microsoft Student Tech Club USB donde se ha hecho el esfuerzo como estudiantes de pregrado para incentivar, informar y enamorar a los integrantes de la institución con las múltiples posibilidades y plataformas que nos ofrece Microsoft al tener el estado de estudiantes en formación.
    [Show full text]
  • Tutorial CUDA
    Graphic Processing Units – GPU (Section 7.7) History of GPUs • VGA in early 90’s -- A memory controller and display generator connected to some (video) RAM • By 1997, VGA controllers were incorporating some acceleration functions • In 2000, a single chip graphics processor incorporated almost every detail of the traditional high-end workstation graphics pipeline - Processors oriented to 3D graphics tasks - Vertex/pixel processing, shading, texture mapping, rasterization • More recently, processor instructions and memory hardware were added to support general-purpose programming languages • OpenGL: A standard specification defining an API for writing applications that produce 2D and 3D computer graphics • CUDA (compute unified device architecture): A scalable parallel programming model and language for GPUs based on C/C++ 70 Historical PC architecture 71 Contemporary PC architecture 72 Basic unified GPU architecture 73 Tutorial CUDA Cyril Zeller NVIDIA Developer Technology Note: These slides are truncated from a longer version which is publicly available on the web Enter the GPU GPU = Graphics Processing Unit Chip in computer video cards, PlayStation 3, Xbox, etc. Two major vendors: NVIDIA and ATI (now AMD) © NVIDIA Corporation 2008 Enter the GPU GPUs are massively multithreaded manycore chips NVIDIA Tesla products have up to 128 scalar processors Over 12,000 concurrent threads in flight Over 470 GFLOPS sustained performance Users across science & engineering disciplines are achieving 100x or better speedups on GPUs CS researchers can use
    [Show full text]
  • A GPU Approach to Fortran Legacy Systems
    A GPU Approach to Fortran Legacy Systems Mariano M¶endez,Fernando G. Tinetti¤ III-LIDI, Facultad de Inform¶atica,UNLP 50 y 120, 1900, La Plata Argentina Technical Report PLA-001-2012y October 2012 Abstract A large number of Fortran legacy programs are still running in production environments, and most of these applications are running sequentially. Multi- and Many- core architectures are established as (almost) the only processing hardware available, and new programming techniques that take advantage of these architectures are necessary. In this report, we will explore the impact of applying some of these techniques into legacy Fortran source code. Furthermore, we have measured the impact in the program performance introduced by each technique. The OpenACC standard has resulted in one of the most interesting techniques to be used on Fortran Legacy source code that brings speed up while requiring minimum source code changes. 1 Introduction Even though the concept of legacy software systems is widely known among developers, there is not a unique de¯nition about of what a legacy software system is. There are di®erent viewpoints on how to describe a legacy system. Di®erent de¯nitions make di®erent levels of emphasis on di®erent character- istics, e.g.: a) \The main thing that distinguishes legacy code from non-legacy code is tests, or rather a lack of tests" [15], b) \Legacy Software is critical software that cannot be modi¯ed e±ciently" [10], c) \Any information system that signi¯cantly resists modi¯cation and evolution to meet new and constantly changing business requirements." [7], d) \large software systems that we don't know how to cope with but that are vital to our organization.
    [Show full text]
  • Krakow, Poland
    Proceedings of the Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015) Krakow, Poland Jesus Carretero, Javier Garcia Blas Roman Wyrzykowski, Emmanuel Jeannot. (Editors) September 10-11, 2015 Volume Editors Jesus Carretero University Carlos III Computer Architecture and Technology Area Computer Science Department Avda Universidad 30, 28911, Leganes, Spain E-mail: [email protected] Javier Garcia Blas University Carlos III Computer Architecture and Technology Area Computer Science Department Avda Universidad 30, 28911, Leganes, Spain E-mail: [email protected] Roman Wyrzykowski Institute of Computer and Information Science Czestochowa University of Technology ul. D ˛abrowskiego 73, 42-201 Cz˛estochowa, Poland E-mail: [email protected] Emmanuel Jeannot Equipe Runtime INRIA Bordeaux Sud-Ouest 200, Avenue de la Vielle Tour, 33405 Talence Cedex, France E-mail: [email protected] Published by: Computer Architecture,Communications, and Systems Group (ARCOS) University Carlos III Madrid, Spain http://www.nesus.eu ISBN: 978-84-608-2581-4 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. This document also is supported by: Printed in Madrid — October 2015 Preface Network for Sustainable Ultrascale Computing (NESUS) We are very excited to present the proceedings of the Second International Workshop on Sustainable Ultrascale Com- puting Systems (NESUS 2015), a workshop created to reflect the research and cooperation activities made in the NESUS COST Action (IC1035) (www.nesus.eu), but open to all the research community working in large/ultra-scale computing systems.
    [Show full text]
  • Accelerating Hpc Applications on Nvidia Gpus
    ACCELERATING HPC APPLICATIONS ON NVIDIA GPUS WITH OPENACC Doug Miles, PGI Compilers & Tools, NVIDIA High Performance Computing Advisory Council February 21, 2018 PGI — THE NVIDIA HPC SDK Fortran, C & C++ Compilers Optimizing, SIMD Vectorizing, OpenMP Accelerated Computing Features CUDA Fortran, OpenACC Directives Multi-Platform Solution X86-64 and OpenPOWER Multicore CPUs NVIDIA Tesla GPUs Supported on Linux, macOS, Windows MPI/OpenMP/OpenACC Tools Debugger Performance Profiler Interoperable with DDT, TotalView 2 Programming GPU-Accelerated Systems Separate CPU System and GPU Memories GPU Developer View PCIe System GPU Memory Memory 3 Programming GPU-Accelerated Systems Separate CPU System and GPU Memories GPU Developer View NVLink System GPU Memory Memory 4 CUDA FORTRAN attributes(global) subroutine mm_kernel ( A, B, C, N, M, L ) real :: A(N,M), B(M,L), C(N,L), Cij integer, value :: N, M, L real, device, allocatable, dimension(:,:) :: integer :: i, j, kb, k, tx, ty Adev,Bdev,Cdev real, shared :: Asub(16,16),Bsub(16,16) tx = threadidx%x . ty = threadidx%y i = blockidx%x * 16 + tx allocate (Adev(N,M), Bdev(M,L), Cdev(N,L)) j = blockidx%y * 16 + ty Adev = A(1:N,1:M) Cij = 0.0 Bdev = B(1:M,1:L) do kb = 1, M, 16 Asub(tx,ty) = A(i,kb+tx-1) call mm_kernel <<<dim3(N/16,M/16),dim3(16,16)>>> Bsub(tx,ty) = B(kb+ty-1,j) ( Adev, Bdev, Cdev, N, M, L ) call syncthreads() do k = 1,16 C(1:N,1:L) = Cdev Cij = Cij + Asub(tx,k) * Bsub(k,ty) deallocate ( Adev, Bdev, Cdev ) enddo call syncthreads() .
    [Show full text]