ECE 571 – Advanced Microprocessor-Based Design Lecture 25
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Reverse Engineering Power Management on NVIDIA Gpus - Anatomy of an Autonomic-Ready System Martin Peres
Reverse Engineering Power Management on NVIDIA GPUs - Anatomy of an Autonomic-ready System Martin Peres To cite this version: Martin Peres. Reverse Engineering Power Management on NVIDIA GPUs - Anatomy of an Autonomic-ready System. ECRTS, Operating Systems Platforms for Embedded Real-Time appli- cations 2013, Jul 2013, Paris, France. hal-00853849 HAL Id: hal-00853849 https://hal.archives-ouvertes.fr/hal-00853849 Submitted on 23 Aug 2013 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Reverse engineering power management on NVIDIA GPUs - Anatomy of an autonomic-ready system Martin Peres Ph.D. student at LaBRI University of Bordeaux Hobbyist Linux/Nouveau Developer Email: [email protected] Abstract—Research in power management is currently limited supported nor documented by NVIDIA. As GPUs are leading by the fact that companies do not release enough documentation the market in terms of performance-per-Watt [3], they are or interfaces to fully exploit the potential found in modern a good candidate for a reverse engineering effort of their processors. This problem is even more present in GPUs despite power management features. The choice of reverse engineering having the highest performance-per-Watt ratio found in today’s NVIDIA’s power management features makes sense as they processors. -
SIMD Extensions
SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel. -
This Is Your Presentation Title
Introduction to GPU/Parallel Computing Ioannis E. Venetis University of Patras 1 Introduction to GPU/Parallel Computing www.prace-ri.eu Introduction to High Performance Systems 2 Introduction to GPU/Parallel Computing www.prace-ri.eu Wait, what? Aren’t we here to talk about GPUs? And how to program them with CUDA? Yes, but we need to understand their place and their purpose in modern High Performance Systems This will make it clear when it is beneficial to use them 3 Introduction to GPU/Parallel Computing www.prace-ri.eu Top 500 (June 2017) CPU Accel. Rmax Rpeak Power Rank Site System Cores Cores (TFlop/s) (TFlop/s) (kW) National Sunway TaihuLight - Sunway MPP, Supercomputing Center Sunway SW26010 260C 1.45GHz, 1 10.649.600 - 93.014,6 125.435,9 15.371 in Wuxi Sunway China NRCPC National Super Tianhe-2 (MilkyWay-2) - TH-IVB-FEP Computer Center in Cluster, Intel Xeon E5-2692 12C 2 Guangzhou 2.200GHz, TH Express-2, Intel Xeon 3.120.000 2.736.000 33.862,7 54.902,4 17.808 China Phi 31S1P NUDT Swiss National Piz Daint - Cray XC50, Xeon E5- Supercomputing Centre 2690v3 12C 2.6GHz, Aries interconnect 3 361.760 297.920 19.590,0 25.326,3 2.272 (CSCS) , NVIDIA Tesla P100 Cray Inc. DOE/SC/Oak Ridge Titan - Cray XK7 , Opteron 6274 16C National Laboratory 2.200GHz, Cray Gemini interconnect, 4 560.640 261.632 17.590,0 27.112,5 8.209 United States NVIDIA K20x Cray Inc. DOE/NNSA/LLNL Sequoia - BlueGene/Q, Power BQC 5 United States 16C 1.60 GHz, Custom 1.572.864 - 17.173,2 20.132,7 7.890 4 Introduction to GPU/ParallelIBM Computing www.prace-ri.eu How do -
GPU4S: Embedded Gpus in Space
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. “The final publication is available at: DOI: 10.1109/DSD.2019.00064 GPU4S: Embedded GPUs in Space Leonidas Kosmidis∗,Jer´ omeˆ Lachaizey, Jaume Abella∗ Olivier Notebaerty, Francisco J. Cazorla∗;z, David Steenarix ∗Barcelona Supercomputing Center (BSC), Spain yAirbus Defence and Space, France zSpanish National Research Council (IIIA-CSIC), Spain xEuropean Space Agency, The Netherlands Abstract—Following the same trend of automotive and avion- in space [1][2]. Those studies concluded that although their ics, the space domain is witnessing an increase in the on-board energy efficiency is high, their power consumption is an order computing performance demands. This raise in performance of magnitude higher than the limited power budget of a space needs comes from both control and payload parts of the space- craft and calls for advanced electronics able to provide high system, which is limited to a couple of Watts. computational power under the constraints of the harsh space Interestingly, GPUs entered in the embedded domain to environment. On the non-technical side, for strategic reasons it is satisfy the increasing demand for multimedia-based hand- mandatory to get European independence on the used computing held and consumer devices such as smartphones, in-vehicle technology. In this project, which is still in its early phases, we entertainment systems, televisions, set-top boxes etc. -
Thermal Covert Channels Leveraging Package-On-Package DRAM
Thermal Covert Channels Leveraging Package-On-Package DRAM Shuai Chen∗, Wenjie Xiongy, Yehan Xu∗, Bing Li∗ and Jakub Szefery ∗Southeast University, Nanjing, China fchenshuai ic,220174472, bernie [email protected] yYale University, New Haven, CT, USA fwenjie.xiong, [email protected] Abstract—Package-on-Package (PoP) is an intergraded circuit observation of execution time [2], [3], or on observation packaging technique where multiple separate packages are of physical emanation, such as heat [4] or electromagnetic mounted vertically one on top of the other, allowing for more (EM) field [5], for example. This work focuses on thermal compact system design and reduction in the distance between channels, and shows how the heat, or temperature, can modules. However, this can introduce security vulnerabilities. be measured without special measurement equipment or In particular, this work shows that due to the close physical physical access in commodity SoC devices. Especially, this proximity of a System-on-a-Chip (SoC) package and a PoP work focuses on an SoC DRAM Package-on-Package (PoP) DRAM that is on top of it, a thermal covert channel exists configuration [6] where two chips (SoC and DRAM) are between the SoC and the PoP DRAM. The thermal covert stacked on top of each other. The heat transfers between the channel can allow multiple cores in the SoC to communicate two can be observed by measuring decay rate of DRAM by modulating the temperature of the PoP DRAM. Especially, cells, which is the basis for this work. it is possible for one core in the SoC to generate heat patterns, Previously, heat-based or thermal covert channels have which encode data that is to be transmitted, and another been demonstrated in data centers [7], or in multicore core to observe the transmitted pattern by measuring the processors [8], but these require dedicated thermal sensors decay rate of DRAM cells. -
Applications and Implementations
Applications and Implementations Hwanyong LEE CTO and Technical Marketing Director HUONE © Copyright Khronos Group, 2010 - Page 1 Khronos Family of Standards 3D Digital Asset Plugin-free Mobile OS accessibility Exchange format 3D Web Content Abstraction Authoring and Authoring A coordinated ecosystem of compute, graphics and media Cross platform Parallel Context and Surface Embedded 3D desktop 3D Computing Management standards and APIs Application Acceleration Acceleration Safety Critical 3D Steaming Media Advanced Audio Vector 2D Video, Audio and Window System System System Codec Creation Image Acceleration Acceleration Integration Integration Hundreds of man years invested by industry experts to create a coordinated visual computing ecosystem for accelerated parallel computation, 3D, video, audio and image processing on desktop, embedded and mobile systems © Copyright Khronos Group, 2010 - Page 2 OpenVG • Royalty-free open standard API • Low-level 2D vector graphics rendering API • OpenGL-style programming model • Advanced feature set enables - SVG, - Flash, - PDF, Postscript, Applications and UI - Java (JSR 287, 271, 226) SVG, Vector and - etc. Font Packages etc.. • Portable content • Map Applications • Hardware Acceleration Hardware Acceleration © Copyright Khronos Group, 2010 - Page 3 OpenVG Rendering Pipeline © Copyright Khronos Group, 2010 - Page 4 OpenVG with Native Graphics Processor • CPU sending data and commands to OpenVG hardware • OpenVG rendering pipeline is in the hardware CPU Native Vector Graphics Hardware © Copyright -
Paul Waring @Pwaring [email protected]
Low Cost Computing with Linux Paul Waring @pwaring [email protected] Low cost? £25-160 Creator CI20 Image: http://elinux.org/MIPS_Creator_CI20 Creator CI20 Release date May 2015 (v2) Cost £55-65 CPU Dual core 1.2Ghz MIPS 512 KB L2 cache GPU PowerVR SGX540 RAM 1 GB USB 2 ports (1 x OTG, 1 x Host) Networking 10/100Mbps Ethernet 802.11 b/g/n wireless Bluetooth 4.0 Onboard storage 8 GB flash 1 x SD MIPS Microprocessor without Interlocked Pipeline Stages Imagination Technologies – UK quoted company (LSE) MIPS Reduced Instruction Set Computer 32 and 64 bit MIPS Tend to be at the low/cheap end of the market DSL routers often have a MIPS CPU (e.g. Technicolour TG582n) Business model: License designs Raspberry Pi Image: https://www.raspberrypi.org/blog/raspberry-pi-3-on-sale/ Raspberry Pi B+ Release date February 2012 Cost £25-30 CPU Single core 700Mhz ARM11 128 KB L2 cache (shared with GPU) GPU Broadcom VideoCore IV RAM 512 MB (shared with GPU) USB 4 ports (via on-board hub) Networking 10/100Mbps Ethernet (USB) Onboard storage 1 x SD Raspberry Pi 2 Release date February 2015 Cost £25-30 CPU Quad core 900Mhz ARM Cortex-A7 256 KB L2 cache GPU Broadcom VideoCore IV RAM 1 GB USB 4 ports (via on-board hub) Networking 10/100Mbps Ethernet (USB) Onboard storage 1 x MicroSD Raspberry Pi 3 Release date February 2016 Cost £25-30 CPU Quad core 1.2Ghz ARM Cortex-A53 512 KB L2 cache GPU Broadcom VideoCore IV (at higher clock frequencies than B+ and 2) RAM 1 GB USB 4 ports (via on-board hub) Networking 10/100Mbps Ethernet 802.11 b/g/n wireless Bluetooth 4.1 Onboard -
A GPU Approach to Fortran Legacy Systems
A GPU Approach to Fortran Legacy Systems Mariano M¶endez,Fernando G. Tinetti¤ III-LIDI, Facultad de Inform¶atica,UNLP 50 y 120, 1900, La Plata Argentina Technical Report PLA-001-2012y October 2012 Abstract A large number of Fortran legacy programs are still running in production environments, and most of these applications are running sequentially. Multi- and Many- core architectures are established as (almost) the only processing hardware available, and new programming techniques that take advantage of these architectures are necessary. In this report, we will explore the impact of applying some of these techniques into legacy Fortran source code. Furthermore, we have measured the impact in the program performance introduced by each technique. The OpenACC standard has resulted in one of the most interesting techniques to be used on Fortran Legacy source code that brings speed up while requiring minimum source code changes. 1 Introduction Even though the concept of legacy software systems is widely known among developers, there is not a unique de¯nition about of what a legacy software system is. There are di®erent viewpoints on how to describe a legacy system. Di®erent de¯nitions make di®erent levels of emphasis on di®erent character- istics, e.g.: a) \The main thing that distinguishes legacy code from non-legacy code is tests, or rather a lack of tests" [15], b) \Legacy Software is critical software that cannot be modi¯ed e±ciently" [10], c) \Any information system that signi¯cantly resists modi¯cation and evolution to meet new and constantly changing business requirements." [7], d) \large software systems that we don't know how to cope with but that are vital to our organization. -
Effective Opengl 5 September 2016, Christophe Riccio
Effective OpenGL 5 September 2016, Christophe Riccio Table of Contents 0. Cross platform support 3 1. Internal texture formats 4 2. Configurable texture swizzling 5 3. BGRA texture swizzling using texture formats 6 4. Texture alpha swizzling 7 5. Half type constants 8 6. Color read format queries 9 7. sRGB texture 10 8. sRGB framebuffer object 11 9. sRGB default framebuffer 12 10. sRGB framebuffer blending precision 13 11. Compressed texture internal format support 14 12. Sized texture internal format support 15 13. Surviving without gl_DrawID 16 14. Cross architecture control of framebuffer restore and resolve to save bandwidth 17 15 Building platform specific code paths 18 16 Max texture sizes 19 17 Hardware compression format support 20 18 Draw buffers differences between APIs 21 19 iOS OpenGL ES extensions 22 20 Asynchronous pixel transfers 23 Change log 24 0. Cross platform support Initially released on January 1992, OpenGL has a long history which led to many versions; market specific variations such as OpenGL ES in July 2003 and WebGL in 2011; a backward compatibility break with OpenGL core profile in August 2009; and many vendor specifics, multi vendors (EXT), standard (ARB, OES), and cross API extensions (KHR). OpenGL is massively cross platform but it doesn’t mean it comes automagically. Just like C and C++ languages, it allows cross platform support but we have to work hard for it. The amount of work depends on the range of the application- targeted market. Across vendors? Eg: AMD, ARM, Intel, NVIDIA, PowerVR and Qualcomm GPUs. Across hardware generations? Eg: Tesla, Fermi, Kepler, Maxwell and Pascal architectures. -
Reverse Engineering Power Management on NVIDIA Gpus - a Detailed Overview
Reverse engineering power management on NVIDIA GPUs - A detailed overview Martin Peres Ph.D. student at LaBRI University of Bordeaux Hobbyist Linux/Nouveau Developer Email: [email protected] Abstract—Power management in Open Source operating sys- by Linus Torvalds [2], creator of Linux. With the exception of tems is currently limited by the fact that hardware companies the ARM-based Tegra, NVIDIA has never released enough do not release enough documentation to write the most power- documentation to provide 3D acceleration to a single GPU efficient drivers. This problem is even more present in GPUs after the Geforce 4 (2002). Power management was never despite having the highest performance-per-Watt ratio found in supported nor documented by NVIDIA. Reverse engineering today’s processors. This paper presents an overview of GPUs from a power management point of view and what power management the power management features of NVIDIA GPUs would allow features have been found when reverse engineering modern to improve the efficiency of the Linux community driver for NVIDIA GPUs. This paper finally discusses about the possibility NVIDIA GPUs called Nouveau [3]. This work will allow of achieving self-management on NVIDIA GPUs and also dis- driving the hardware more efficiently which should reduce the cusses the future challenges that will be faced by the community temperature, lower the fan noise and improve the battery-life when autonomic systems like Broadcom’s videocore R will become of laptops. mainstream. I. INTRODUCTION Nouveau is a fork of NVIDIA’s limited Open Source driver, xf86-video-nv [4] by Stephane Marchesin aimed at delivering Historically, CPU and GPU manufacturers were aiming Open Source 3D acceleration along with a port to the new to increase performance as it was the usual metric used by graphics Open Source architecture DRI [5]. -
NVIDIA's Fermi: the First Complete GPU Computing Architecture
NVIDIA’s Fermi: The First Complete GPU Computing Architecture A white paper by Peter N. Glaskowsky Prepared under contract with NVIDIA Corporation Copyright © September 2009, Peter N. Glaskowsky Peter N. Glaskowsky is a consulting computer architect, technology analyst, and professional blogger in Silicon Valley. Glaskowsky was the principal system architect of chip startup Montalvo Systems. Earlier, he was Editor in Chief of the award-winning industry newsletter Microprocessor Report. Glaskowsky writes the Speeds and Feeds blog for the CNET Blog Network: http://www.speedsnfeeds.com/ This document is licensed under the Creative Commons Attribution ShareAlike 3.0 License. In short: you are free to share and make derivative works of the file under the conditions that you appropriately attribute it, and that you distribute it only under a license identical to this one. http://creativecommons.org/licenses/by-sa/3.0/ Company and product names may be trademarks of the respective companies with which they are associated. 2 Executive Summary After 38 years of rapid progress, conventional microprocessor technology is beginning to see diminishing returns. The pace of improvement in clock speeds and architectural sophistication is slowing, and while single-threaded performance continues to improve, the focus has shifted to multicore designs. These too are reaching practical limits for personal computing; a quad-core CPU isn’t worth twice the price of a dual-core, and chips with even higher core counts aren’t likely to be a major driver of value in future PCs. CPUs will never go away, but GPUs are assuming a more prominent role in PC system architecture. -
ARM Linux Kernels and Graphics Drivers on Popular "Open" Hardware: Bleeding Edge Vs
SCaLE 13x – Open Source Hardware ARM Linux Kernels and Graphics Drivers on Popular "Open" Hardware: Bleeding Edge vs. Vendor Blobs and Kernel Forks - How Much is in Mainline, and How Open is Open? Prepared / Presented by Stephen Arnold, Principal Scientist VCT Labs Gentoo Linux / OpenEmbedded Developer What is ARM/Embedded? · Small Single Board Computer (SBC) or System on Chip (SoC) · Very resource-constrained · Zaurus 5000-D – 32 MB RAM, StrongARM SA-1100 (DEC/ARM) · Kurobox HG – 128 MB RAM, 256 MB flash, no display (G3, no altivec) · Modern devices blurring the lines between “embedded” and desktop/server-class hardware · Multicore CPUs – 2/4/8 cores · Per-core FPUs - VFP3/VFP4, NEON · Multicore GPUs – 192-core Cuda on Tegra K1 · Accelerated HD video processing · USB3, 10/100/1000 Ethernet, SATA, HDMI ARM Devices and Graphics Hardware · ARMv7 HardFloat VFP/NEON · Wandboard / udoo / cubox-i - iMX.6 quad core, Vivante GPU · Beaglebone black / white - AM335X single core, OMAP3 / SGX GPU, PRUs · Sunxi MK802-II 1GB TV stick - Allwinner A10 single core, Mali GPU · Samsung Chromebook - Exynos5 dual core, Mali GPU · Acer Chromebook / Jetson TK1 – Tegra K1 quad-core, NVDIA Cuda GPU · Genesi SmartBook - Freescale iMX.5 single core, AMD z430 GPU ARM Graphics Hardware cont. · ARMv7 HardFloat VFP (no NEON) · Trimslice Diskless - NVIDIA Tegra 2 dual core CPU/GPU · ARMv6 HardFloat VFP (no NEON) · Raspberry Pi - Broadcom SoC single core, VideoCore IV GPU The State of ARM Graphics · (mostly) Current Vendor Blobs · Cubox-i4Pro (iMX.6) · RaspberryPi (VideoCore IV) · Allwinner (Mali) · ChromeOS K1 (Tegra124) · TI (OMAP/SGX) · Open Source Graphics · Tegra/Nouveau – opentegra/grate, nouveau w/firmware · Broadcom/VideoCore IV – weston/wayland, fbturbo · Mali – lima, fbturbo · OMAP – omapfb, omap3 · Vivante – etna-viv, fbturbo · Adreno – freedreno (2D/3D, xorg) Vendor Kernel Forks · Typically a single (older) kernel branch with lots of patches · Minimal backporting (maybe none) · Forwardporting to new branch can take a long time..