Unidade De Processamento Gráfico

Total Page:16

File Type:pdf, Size:1020Kb

Unidade De Processamento Gráfico Unidade de Processamento Gráfico ● Definição ● Funcionamento ● Arquitetura ● GPU vs CPU ● GPGPUs ● Aplicações Definição GPUs são processadores dedicados para processamento gráfico da classe SIMD (Single Instruction Multiple Data). GPUs são desenvolvidas especificamente para cálculos de ponto flutuante, essenciais para renderização de imagens. Suas principais características são sua alta capacidade de processamento massivo paralelo e sua total programabilidade e desempenho em cálculos que exigem um volume grande de dados, resultando em um grande throughput (quantidade de instruções executadas por segundo). Funcionamento Para que as imagens possam ser geradas, a GPU trabalha executando uma sequência de etapas, que envolvem elaboração de elementos geométricos, aplicação de cores, inserção de efeitos e assim por diante. Essas etapas recebem o nome de pipeline gráfico. ​ O pipeline gráfico possui 5 estágios: vector processing, primitive processing, rasterization, fragment processing, pixel operations. Iremos examinar esses estágios enquanto a GPU os executam para processar uma objeto gráfico. ● Vector processing: A GPU recebe um conjunto de vértices do objeto do qual se quer processar. Para o nosso exemplo, a GPU recebe um conjunto de vértices que descreve dois triângulos sobrepostos. ● Primitive processing: Um polígono complexo é decomposto em polígonos mais simples chamadas de primitivas. Nesse estágio ocorre o processo de triangulação, isto é, os vértices são ligados formando triângulos e esses triângulos são criados de tal forma que tome a forma do objeto da imagem que se quer processar. ● Rasterization: Ocorre o processo de rasterização, ou seja, todo o objeto da imagem é preenchido com pixel. Cada primitiva é processa independentemente. ● Fragments processing: nesse estágio, os pixels ganham cores baseados na cor do objeto, iluminação, sombreamento. ● Pixel Operations: Se um objeto é sobreposto por outro objeto, esse objeto não deve aparecer na imagem final. Agora imagine um ponto de observação, onde aparece o objeto 1 mas o objeto 2 está sendo obscurecido pelo objeto 1 da visão do observador. Cada objeto foi processado independentemente, então realiza-se cálculos baseados na distância do observador de tal forma, que os pixels do objeto 2 não apareçam na imagem, descartando-os. Depois disso, a imagem processada é levada para a memória de vídeo. As GPUs podem contar com vários recursos para a execução dessas etapas, entre eles: ● Pixel Shader: é um programa que trabalha com a geração de efeitos com base em pixels. Esse recurso é amplamente utilizado em imagens 3D (de jogos, por exemplo) para gerar efeitos de iluminação, reflexo, sombreamento, etc; ● Vertex Shader: consiste em um programa que trabalha com estruturas formadas por vértices, lidando, portanto, como figuras geométricas. Esse recurso é utilizado para a modelagem dos objetos a serem exibidos; ● Render Output Unit (ROP): basicamente, manipula os dados armazenados ​ na memória de vídeo para que eles se "transformem" no conjunto de pixels que formará as imagens a serem exibidas na tela. Cabe a essas unidades a aplicação de filtros, efeitos de profundidade, entre outros; ● Texture Mapping Unit (TMU): trata-se de um tipo de componente capaz de rotacionar e redimensionar bitmaps (basicamente, imagens formadas por conjuntos de pixels) para aplicação de uma textura sob uma superfície. Arquitetura Fermi A GPU foca mais em ULA’s, o que as torna bem mais eficientes em termos de custo quando executam um software paralelo. Consequentemente, a GPU é construída para aplicações com demandas como cálculos paralelos grandes com mais ênfase no throughput que na latência. As GPUs da NVIDIA são divididas em: ● Streaming processing (SM): responsável por executar os blocos de threads entregues a GPU. ● GigaThreadEngine: responsável por enviar os blocos de threads às dezenas de SMs. Uma SM possui: ● Dois Warps Schedules: a dual Warp Scheduler recebe um conjunto de 32 threads chamadas de warps, seleciona uma instrução de cada Warp. ● Duas unidades Dispatch Unit: responsável por enviar as intrunções para as dezenas de unidades de execução da SM. ● Load/Store: carrega ou armazena o endereço das threads e carrega ou armazena o dado entre a cache e a shared memory. ● Special Function Unit (SFU): responsável por realizar cálculos transcendentais, como seno, cosseno, tangente. ● Duas colunas com 16 núcleos CUDA. ● Cada Núcleo possui uma ALU e uma FPU. ● Possui uma cache L1 por SM. ● Registradores: cada thread tem acesso somente ao seu registrador; ● L1 + Shared Memory: utiliza caches para threads individuais; ● Local Memory: utilizada para armazenar “spilled registers”. Um “spill register” ocorre quando um bloco de thread requer mais espaço nos registradores do que o disponível na SM. CPU vs GPU Uma CPU possui geralmente uma quantidade baixa de núcleos quando comparada a uma GPU. Uma GPU pode ter centenas de núcleos com cada um possuindo uma ALU para inteiros e para números em ponto flutuante com operações em pipeline. Uma CPU genérica moderna é construída utilizando camadas de cache para reduzir a latência do acesso à memória. Cada núcleo possui uma cache L1 para dados e uma cache L1 para instruções, auxiliadas por uma outra cache L2. Há ainda um terceiro nível de cache, a cache L3 que é compartilhada por todos os núcleos da CPU. Quando os dados não estão nas caches este é trazido da memória. Uma GPU é constituída de vários Processors Cluster (PC) cada um possuindo várias unidades chamada de Streaming Muitiprocessors (SM). Cada (SM) é composto de vários blocos onde cada um acomoda uma Cache de instruções L1, que por sua vez é compartilhada pelas dezenas de núcleos (também chamadas de Cuda) que compõe um bloco. Todos os SM compartilham a mesma cache L2. Uma GPU possui bem menos caches se comparado com uma CPU. Uma GPU não se preocupa com latência de acesso à memória, ou seja, o tempo gasto para recuperar um dado da memória e sim com a quantidade de trabalho que é realizada paralelamente. Uma GPU é projetada para computação paralela de dados. GPGPUs A Unidade de Processamento Gráfico de Propósito Geral ou GPGPU (General Purpose Graphics Processing Unit) utiliza a GPU (graphics processing unit) para ​ além de renderização gráfica tais como: processamento de imagem, visão ​ ​ ​ computacional, inteligência artificial, cálculo numérico dentre outras aplicações. Ou ​ ​ ​ ​ seja, é a utilização da GPU para realizar a computação em aplicações que antes eram tratada pela CPU(Unidade Central de Processamento). A NVIDIA apresentou a Compute Unified Device Architecture (CUDA) em 2006, trata-se de uma plataforma de computação paralela e modelo de programação que disponibiliza um aumento significativo de desempenho ao aproveitar o poder da GPU. Ao fornecer abstrações simples com respeito à organização hierárquica de threads, memória e sincronização, o modelo de programação CUDA permite aos programadores escreverem programas escaláveis sem a necessidade de aprender a multiplicidade de novos componentes de programação. Alguns modelos de GPUs da NVidia: ● G80: Introduzida em novembro de 2006. A GeForce 8800 foi a primeira GPU ​ com suporte a linguagem C, introduzindo a tecnologia CUDA. A placa gráfica possui unificada com 128 CUDA cores, distribuídos entre 8 SMs. ● Fermi: Lançada em abril de 2010, esta arquitetura trouxe suporte para novas ​ instruções para programas em C++, como alocação dinâmica de objeto e tratamento de exceções em operações de try e catch. Cada SM de um processador Fermi possui 32 CUDA cores. Até 16 operações de precisão dupla por SM podem ser executadas em cada ciclo de clock. ● Kepler: Lançada em 2012, a mais nova arquitetura da NVIDIA introduziu um ​ novo modelo de SM, chamado de SMX, que possui 192 CUDA cores, totalizando 1536 cores no chip. A alteração da arquitetura foi a principal responsável pela maior redução no consumo de energia. Aplicações ● Computação Acelerada por GPU: Computação com aceleração de GPU é a ​ utilização de uma GPU em conjunto com uma CPU para acelerar aplicativos científicos, de análise, de consumidor, de engenharia e corporativos. Basicamente o que faz a GPU, é receber da aplicação o processamento que requer maior uso intensivo de computação. A GPU recebe uma parte da aplicação, enquanto a CPU recebe o restante. ● Deep Learning: Arquiteturas de Deep Learning geralmente são complexas e ​ necessitam de muitos dados para seu treinamento. Dessa forma, é inevitável a dependência de muito poder computacional para aplicar essas técnicas. As GPUs se destacam em cargas de trabalho paralelas e aceleram as redes neurais em até 10 a 20 vezes. ● Carros Autônomos: Praticamente todo carro autônomo ou semi autônomo ​ em desenvolvimento hoje precisa de alguma plataforma computacional criada em torno de GPUs, como a NVIDIA que lançou a sua linha de placas Drive ​ PX, que permitem a condução automatizada em vias expressas e ​ mapeamento em HD, permite que os veículos usem redes neurais para processar dados de várias câmeras e sensores, pode compreender em tempo real o que está ocorrendo em volta do veículo e traçar uma rota segura e pode realizar 24 trilhões de operações de aprendizagem profunda por segundo. ● Medicina: Para processar gigabytes de dados gerados dinamicamente a ​ cada segundo para uso por profissionais da área médica na prestação de serviços de saúde aos pacientes, por exemplo detectar câncer de mama, sequenciamento e análise genômica com computação acelerada e inteligente, a modelagem molecular permite usar uma unidade de ​ ​ ​ processamento gráfico para simulações
Recommended publications
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • This Is Your Presentation Title
    Introduction to GPU/Parallel Computing Ioannis E. Venetis University of Patras 1 Introduction to GPU/Parallel Computing www.prace-ri.eu Introduction to High Performance Systems 2 Introduction to GPU/Parallel Computing www.prace-ri.eu Wait, what? Aren’t we here to talk about GPUs? And how to program them with CUDA? Yes, but we need to understand their place and their purpose in modern High Performance Systems This will make it clear when it is beneficial to use them 3 Introduction to GPU/Parallel Computing www.prace-ri.eu Top 500 (June 2017) CPU Accel. Rmax Rpeak Power Rank Site System Cores Cores (TFlop/s) (TFlop/s) (kW) National Sunway TaihuLight - Sunway MPP, Supercomputing Center Sunway SW26010 260C 1.45GHz, 1 10.649.600 - 93.014,6 125.435,9 15.371 in Wuxi Sunway China NRCPC National Super Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Computer Center in Cluster, Intel Xeon E5-2692 12C 2 Guangzhou 2.200GHz, TH Express-2, Intel Xeon 3.120.000 2.736.000 33.862,7 54.902,4 17.808 China Phi 31S1P NUDT Swiss National Piz Daint - Cray XC50, Xeon E5- Supercomputing Centre 2690v3 12C 2.6GHz, Aries interconnect 3 361.760 297.920 19.590,0 25.326,3 2.272 (CSCS) , NVIDIA Tesla P100 Cray Inc. DOE/SC/Oak Ridge Titan - Cray XK7 , Opteron 6274 16C National Laboratory 2.200GHz, Cray Gemini interconnect, 4 560.640 261.632 17.590,0 27.112,5 8.209 United States NVIDIA K20x Cray Inc. DOE/NNSA/LLNL Sequoia - BlueGene/Q, Power BQC 5 United States 16C 1.60 GHz, Custom 1.572.864 - 17.173,2 20.132,7 7.890 4 Introduction to GPU/ParallelIBM Computing www.prace-ri.eu How do
    [Show full text]
  • A GPU Approach to Fortran Legacy Systems
    A GPU Approach to Fortran Legacy Systems Mariano M¶endez,Fernando G. Tinetti¤ III-LIDI, Facultad de Inform¶atica,UNLP 50 y 120, 1900, La Plata Argentina Technical Report PLA-001-2012y October 2012 Abstract A large number of Fortran legacy programs are still running in production environments, and most of these applications are running sequentially. Multi- and Many- core architectures are established as (almost) the only processing hardware available, and new programming techniques that take advantage of these architectures are necessary. In this report, we will explore the impact of applying some of these techniques into legacy Fortran source code. Furthermore, we have measured the impact in the program performance introduced by each technique. The OpenACC standard has resulted in one of the most interesting techniques to be used on Fortran Legacy source code that brings speed up while requiring minimum source code changes. 1 Introduction Even though the concept of legacy software systems is widely known among developers, there is not a unique de¯nition about of what a legacy software system is. There are di®erent viewpoints on how to describe a legacy system. Di®erent de¯nitions make di®erent levels of emphasis on di®erent character- istics, e.g.: a) \The main thing that distinguishes legacy code from non-legacy code is tests, or rather a lack of tests" [15], b) \Legacy Software is critical software that cannot be modi¯ed e±ciently" [10], c) \Any information system that signi¯cantly resists modi¯cation and evolution to meet new and constantly changing business requirements." [7], d) \large software systems that we don't know how to cope with but that are vital to our organization.
    [Show full text]
  • NVIDIA's Fermi: the First Complete GPU Computing Architecture
    NVIDIA’s Fermi: The First Complete GPU Computing Architecture A white paper by Peter N. Glaskowsky Prepared under contract with NVIDIA Corporation Copyright © September 2009, Peter N. Glaskowsky Peter N. Glaskowsky is a consulting computer architect, technology analyst, and professional blogger in Silicon Valley. Glaskowsky was the principal system architect of chip startup Montalvo Systems. Earlier, he was Editor in Chief of the award-winning industry newsletter Microprocessor Report. Glaskowsky writes the Speeds and Feeds blog for the CNET Blog Network: http://www.speedsnfeeds.com/ This document is licensed under the Creative Commons Attribution ShareAlike 3.0 License. In short: you are free to share and make derivative works of the file under the conditions that you appropriately attribute it, and that you distribute it only under a license identical to this one. http://creativecommons.org/licenses/by-sa/3.0/ Company and product names may be trademarks of the respective companies with which they are associated. 2 Executive Summary After 38 years of rapid progress, conventional microprocessor technology is beginning to see diminishing returns. The pace of improvement in clock speeds and architectural sophistication is slowing, and while single-threaded performance continues to improve, the focus has shifted to multicore designs. These too are reaching practical limits for personal computing; a quad-core CPU isn’t worth twice the price of a dual-core, and chips with even higher core counts aren’t likely to be a major driver of value in future PCs. CPUs will never go away, but GPUs are assuming a more prominent role in PC system architecture.
    [Show full text]
  • Comparative Study of High Performance Software Rasterization Techniques
    MATHEMATICA MONTISNIGRI Vol XLVII (2020) COMPARATIVE STUDY OF HIGH PERFORMANCE SOFTWARE RASTERIZATION TECHNIQUES V. F. FROLOV1;2, V. A. GALAKTIONOV1;∗ AND B. H. BARLADYAN1 1Keldysh Institute of Applied Mathematics of RAS, Miusskaya Sq. 4, Moscow, Russia, 125047 2Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow, 119991, Russian Federation *Corresponding author. E-mail: [email protected], web page: http://keldysh.ru/ DOI: 10.20948/mathmontis-2020-47-13 Abstract. This paper provides a comparative study and performance analysis of different rasterization algorithms and approaches. Unlike many other papers, we don't focus on rasterization itself, but investigate complete graphics pipeline with 3D transfor- mations, Z-buffer, perspective correction and texturing that, on the one hand, allow us to implement a useful subset of OpenGL functionality and, on the other hand, consider various bottlenecks in the graphics pipeline and how different approaches manage them. Our ultimate goal is to find a scalable rasterizer technique that on the one hand effectively uses current CPUs and on the other hand is accelerating with the extensive development of hardware. We explore the capabilities of scan-line and half-space algorithms raster- ization, investigate different memory layout for frame buffer data, study the possibility of instruction-level and thread-level parallelism to be applied. We also study relative efficiency of different CPU architectures (in-order CPUs vs out-of-order CPUs) for the graphics pipeline implementation and tested our solution with x64, ARMv7 and ARMv8 instruction sets. We were able to propose an approach that could outperform highly op- timized OpenSWR rasterizer for small triangles.
    [Show full text]
  • Ultrasound Imaging Augmented 3D Flow Reconstruction and Computational Fluid Dynamics Simulation
    Imperial College of Science, Technology and Medicine Royal School of Mines Department of Bioengineering Ultrasound Imaging Augmented 3D Flow Reconstruction and Computational Fluid Dynamics Simulation by Xinhuan Zhou A thesis submitted for the degree of Doctor of Philosophy of Imperial College London October 2019 1 Abstract Abstract Cardiovascular Diseases (CVD), including stroke, coronary/peripheral artery diseases, are currently the leading cause of mortality globally. CVD are usually correlated to abnormal blood flow and vessel wall shear stress, and fast/accurate patient-specific 3D blood flow quantification can provide clinicians/researchers the insights/knowledge for better understanding, prediction, detection and treatment of CVD. Experimental methods including mainly ultrasound (US) Vector Flow Imaging (VFI), and Computational Fluid Dynamics (CFD) are usually employed for blood flow quantification. However current US methods are mainly 1D or 2D, noisy, can only obtain velocities at sparse positions in 3D, and thus have difficulties providing information required for research and clinical diagnosis. On the other hand while CFD is the current standard for 3D blood flow quantification it is usually computationally expensive and suffers from compromised accuracy due to uncertainties in the CFD input, e.g., blood flow boundary/initial condition and vessel geometry. To bridge the current gap between the clinical needs of 3D blood flow quantification and measurement technologies, this thesis aims at: 1) developing a fast and accurate
    [Show full text]
  • Intel Literature Or Obtain Literature Pricing Information in the U.S
    LITERATURE To order Intel Literature or obtain literature pricing information in the U.S. and Canada call or write Intel Literature Sales. In Europe and other international locations, please contact your local sales office or distributor. INTEL LITERATURE SALES In the U.S. and Canada P.O. BOX 7641 call toll free Mt. Prospect, IL 60056-7641 (800) 548-4725 CURRENT HANDBOOKS Product line handbooks contain data sheets, application notes, article reprints and other design informa­ tion. TITLE LITERATURE ORDER NUMBER SET OF 11 HANDBOOKS 231003 (Available in U.S. and Canada only) EMBEDDED APPLICATIONS 270648 8-BIT EMBEDDED CONTROLLERS 270645 16-BIT EMBEDDED CONTROLLERS 270646 16/32-BIT EMBEDDED PROCESSORS 270647 MEMORY 210830 MICROCOMMUNICATIONS 231658 (2 volume set) MICROCOMPUTER SYSTEMS 280407 MICROPROCESSORS 230843 PERIPHERALS 296467 PRODUCT GUIDE 210846 (Overview of Intel's complete product lines) PROGRAMMABLE LOGIC 296083 ADDITIONAL LITERATURE (Not included in handbook set) AUTOMOTIVE SUPPLEMENT 231792 COMPONENTS QUALITY/RELIABILITY HANDBOOK 210997 INTEL PACKAGING OUTLINES AND DIMENSIONS 231369 (Packaging types, number of leads, etc.) INTERNATIONAL LITERATURE GUIDE E00029 LITERATURE PRICE LIST (U.S. and Canada) 210620 (Comprehensive list of current Intel literature) MILITARY 210461 (2 volume set) SYSTEMS QUALITY/RELIABILITY 231762 LlTINCOV/10/89 u.s. and CANADA LITERATURE ORDER FORM NAME: __________________________________________________________ COMPANY: _____________________________________________________ ADDRESS: _____________________________________________________
    [Show full text]
  • A Mobile 3D Display Processor with a Bandwidth-Saving Subdivider
    > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 A Mobile 3D Display Processor with A Bandwidth-Saving Subdivider Seok-Hoon Kim, Sung-Eui Yoon, Sang-Hye Chung, Young-Jun Kim, Hong-Yun Kim, Kyusik Chung, and Lee-Sup Kim Abstract— A mobile 3D display processor with a subdivider is presented for higher visual quality on handhelds. By combining a subdivision technique with a 3D display, the processor can support viewers see realistic smooth surfaces in the air. However, both the subdivision and the 3D display processes require a high number of memory operations to mobile memory architecture. Therefore, we make efforts to save the bandwidth between the processor and off-chip memory. In the subdivider, we propose a re-computing based depth-first scheme that has much smaller working set than prior works. The proposed scheme achieves about 100:1 bandwidth reduction over the prior subdivision Fig. 1 The memory architecture of handhelds is different methods. Also the designed 3D display engine reduces the from the memory architecture of PCs. All the processors bandwidth to 27% by reordering the operation sequence of the 3D share the same bus and a unified memory which is outside display process. This bandwidth saving translates into reductions the chip. of off-chip access energy and time. Consequently the overall bandwidth of both the subdivision and the 3D display processes is affordable to a commercial mobile bus. In addition to saving significant attentions, because they can support smooth bandwidth, our work provides enough visual quality and surfaces, leading to high-quality rendering.
    [Show full text]
  • H Ll What We Will Cover
    Whllhat we will cover Contour Tracking Surface Rendering Direct Volume Rendering Isosurface Rendering Optimizing DVR Pre-Integrated DVR Sp latti ng Unstructured Volume Rendering GPU-based Volume Rendering Rendering using Multi-GPU History of Multi-Processor Rendering 1993 SGI announced the Onyx series of graphics servers the first commercial system with a multi-processor graphics subsystem cost up to a million dollars 3DFx Interactive (1995) introduced the first gaming accelerator called 3DFx Voodoo Graphics The Voodoo Graphics chipset consisted of two chips: the Frame Buffer Interface (FBI) was responsible for the frame buffer and the Texture Mapping Unit (TMU) processed textures. The chipset could scale the performance up by adding more TMUs – up to three TMUs per one FBI. early 1998 3dfx introduced its next chipset , Voodoo2 the first mass product to officially support the option of increasing the performance by uniting two Voodoo2-based graphics cards with SLI (Scan-Line Interleave) technology. the cost of the system was too high 1999 NVIDIA released its new graphics chip TNT2 whose Ultra version could challenge the speed of the Voodoo2 SLI but less expensive. 3DFx Interactive was devoured by NVIDIA in 2000 Voodoo Voodoo 2 SLI Voodoo 5 SLI Rendering Voodoo 5 5500 - Two Single Board SLI History of Multi-Processor Rendering ATI Technologies(1999) MAXX –put two chips on one PCB and make them render different frames simultaneously and then output the frames on the screen alternately. Failed by technical issues (sync and performance problems) NVIDIA SLI(200?) Scalable Link Interface Voodoo2 style dual board implementation 3Dfx SLI was PCI based and has nowhere near the bandwidth of PCIE used by NVIDIA SLI.
    [Show full text]
  • Complete System Power Estimation Using Processor Performance Events W
    Complete System Power Estimation using Processor Performance Events W. Lloyd Bircher and Lizy K. John Abstract — This paper proposes the use of microprocessor performance counters for online measurement of complete system power consumption. The approach takes advantage of the “trickle-down” effect of performance events in microprocessors. While it has been known that CPU power consumption is correlated to processor performance, the use of well-known performance-related events within a microprocessor such as cache misses and DMA transactions to estimate power consumption in memory and disk and other subsystems outside of the microprocessor is new. Using measurement of actual systems running scientific, commercial and productivity workloads, power models for six subsystems (CPU, memory, chipset, I/O, disk and GPU) on two platforms (server and desktop) are developed and validated. These models are shown to have an average error of less than 9% per subsystem across the considered workloads. Through the use of these models and existing on-chip performance event counters, it is possible to estimate system power consumption without the need for power sensing hardware. 1. INTRODUCTION events within the various subsystems, a modest number of In order to improve microprocessor performance while performance events for modeling complete system power are limiting power consumption, designers increasingly utilize identified. Power models for two distinct hardware platforms dynamic hardware adaptations. These adaptations provide an are presented: a quad-socket server and a multi-core desktop. opportunity to extract maximum performance while remaining The resultant models have an average error of less than 9% within temperature and power limits. Two of the most across a wide range of workloads including SPEC CPU, common examples are dynamic voltage/frequency scaling SPECJbb, DBT-2, SYSMark and 3DMark.
    [Show full text]
  • ECE 571 – Advanced Microprocessor-Based Design Lecture 36
    ECE 571 { Advanced Microprocessor-Based Design Lecture 36 Vince Weaver http://web.eece.maine.edu/~vweaver [email protected] 4 December 2020 Announcements • Don't forget projects, presentations next week (Wed and Fri) • Final writeup due last day of exams (18th) • Will try to get homeworks graded soon. 1 NVIDIA GPUs Tesla 2006 90-40nm Fermi 2010 40nm/28nm Kepler 2012 28nm Maxwell 2014 28nm Pascal/Volta 2016 16nm/14nm Turing 2018 12nm Ampere 2020 8nm/7nm Hopper 20?? ?? • GeForce { Gaming 2 • Quadro { Workstation • DataCenter 3 Also Read NVIDIA AMPERE GPU ARCHITECTURE blog post https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/ 4 A100 Whitepaper • A100 • Price? From Dell: $15,434.81 (Free shipping) • Ethernet and Infiniband (Mellanox) support? • Asynchronous Copy • HBM, ECC single-error correcting double-error detection (SECDED) 5 Homework Reading #1 NVIDIA Announces the GeForce RTX 30 Series: Ampere For Gaming, Starting With RTX 3080 & RTX 3090 https://www.anandtech.com/show/16057/ nvidia-announces-the-geforce-rtx-30-series-ampere-for-gaming-starting-with-rtx-3080-rtx-3090 September 2020 { by Ryan Smith 6 Background • Ampere Architecture • CUDA compute 8.0 • TSMC 7nm FINFET (A100) • Samspun 8n, Geforce30 7 GeForce RTX 30 • Samsung 8nm process • Gaming performance • Comparison to RTX 20 (Turing based) • RTX 3090 ◦ 10496 cores ◦ 1.7GHz boost clock ◦ 19.5 Gbps GDDR6X, 384 bit, 24GB ◦ Single precision 35.7 TFLOP/s ◦ Tensor (16-bit) 143 TFLOP/s 8 ◦ Ray perf 69 TFLOPs ◦ 350W ◦ 8nm Samsung (smallest non-EUV) ◦ 28 billion transistors ◦ $1500 • GA100 compute(?) TODO • Third generation tensor cores • Ray-tracing cores • A lot of FP32 (shader?) cores • PCIe 4.0 support (first bump 8 years, 32GB/s) • SLI support 9 • What is DirectStorage API? GPU can read disk directly? Why might that be useful? • 1.9x power efficiency? Is that possible? Might be comparing downclocked Ampere to Turing rather than vice-versa • GDDR6X ◦ NVidia and Micron? ◦ multi-level singnaling? ◦ can send 2 bits per clock (PAM4 vs NRZ).
    [Show full text]
  • Appendix C Graphics and Computing Gpus
    C APPENDIX Graphics and Computing GPUs John Nickolls Imagination is more Director of Architecture important than NVIDIA knowledge. David Kirk Chief Scientist Albert Einstein On Science, 1930s NVIDIA C.1 Introduction C-3 C.2 GPU System Architectures C-7 C.3 Programming GPUs C-12 C.4 Multithreaded Multiprocessor Architecture C-24 C.5 Parallel Memory System C-36 C.6 Floating-point Arithmetic C-41 C.7 Real Stuff: The NVIDIA GeForce 8800 C-45 C.8 Real Stuff: Mapping Applications to GPUs C-54 C.9 Fallacies and Pitfalls C-70 C.10 Concluding Remarks C-74 C.11 Historical Perspective and Further Reading C-75 C.1 Introduction Th is appendix focuses on the GPU—the ubiquitous graphics processing unit graphics processing in every PC, laptop, desktop computer, and workstation. In its most basic form, unit (GPU) A processor the GPU generates 2D and 3D graphics, images, and video that enable window- optimized for 2D and 3D based operating systems, graphical user interfaces, video games, visual imaging graphics, video, visual computing, and display. applications, and video. Th e modern GPU that we describe here is a highly parallel, highly multithreaded multiprocessor optimized for visual computing. To provide visual computing real-time visual interaction with computed objects via graphics, images, and video, A mix of graphics the GPU has a unifi ed graphics and computing architecture that serves as both a processing and computing programmable graphics processor and a scalable parallel computing platform. PCs that lets you visually interact with computed and game consoles combine a GPU with a CPU to form heterogeneous systems.
    [Show full text]