AMD “Kabini” APU SOC DAN BOUVIER BEN BATES, WALTER FRY, SREEKANTH GODEY HOT CHIPS 25 AUGUST 2013

Total Page:16

File Type:pdf, Size:1020Kb

AMD “Kabini” APU SOC DAN BOUVIER BEN BATES, WALTER FRY, SREEKANTH GODEY HOT CHIPS 25 AUGUST 2013 AMD “Kabini” APU SOC DAN BOUVIER BEN BATES, WALTER FRY, SREEKANTH GODEY HOT CHIPS 25 AUGUST 2013 “KABINI” FLOORPLAN 28nm technology, 105 mm2, 914M transistors KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 2 "JAGUAR" CORE DESIGN GOALS IMPROVE ON “BOBCAT”: UPDATE THE INCREASE PROCESS PERFORMANCE IN A ISA/FEATURE SET PORTABILITY GIVEN POWER ENVELOPE “Jaguar” added: More IPC SSE4.1, SSE4.2 40-bit physical AES, CLMUL address-capable Better frequency at given voltage MOVBE Improved Improved power efficiency through AVX, virtualization clock gating and unit redesign XSAVE/XSAVEOPT F16C, BMI1 KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 3 "JAGUAR" ENHANCEMENTS 32KB Branch Improved IC prefetcher ICACHE Prediction Larger IB for improved New hardware divider fetch/decode decoupling Decode and New/improved COPs: Microcode ROMS CRC32/SSE4.2, BMI1, POPCNT, LZCNT More OOO resources Int Rename FP Decode Rename 128b native hardware Scheduler Scheduler 4 SP muls + 4 SP adds FP Scheduler Int PRF 1 DP mul + 2 DP adds FP PRF ALU ALU LAGU SAGU ISA: many new COPs VALU VALU Mul 256b AVX support VIMul St Conv. Ld/St Queues redesign Div New zero optimizations Enhanced tablewalks FPAdd FPMul 32KB Ld/St 128b data path to FPU DCache Queues Improved write combining To/From BU More outstanding Shared Cache Unit transactions KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 4 "JAGUAR" SHARED CACHE UNIT Shared cache is major design addition 32KB Branch ICACHE Prediction in “Jaguar” Decode and Microcode ROMS Int Rename FP Decode Rename Scheduler Scheduler FP Scheduler Supports 4 cores Int PRF FP PRF ALU ALU LAGU SAGU VALU VALU Mul VIMul St Conv. Div FPAdd FPMul Total shared 2MB, 16-way 32KB Ld/St DCache Queues BU Supported by 4 L2D banks L2 cache is inclusive Allows using L2 tags as probe filter L2 Interface L2 tags reside in interface block Divided into 4 banks L2D L2D L2D bank look-up only after L2 tag hit L2 interface block runs at core clock L2D L2D New L2 stream prefetcher per core KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 5 UNIFIED NORTHBRIDGE AND MEMORY EFFICIENT BANDWIDTH DELIVERY TO DELIVER VISUAL EXPERIENCE UNIFIED NORTHBRIDGE Supports Core 0 Core 1 Core 2 Core 3 DDR3 interface Quad-core Module Interface to graphics memory controller 256b Interface to I/O subsystem APU power management System Request Interface (SRI) PCI is the interconnect to I/O devices Crossbar (XBAR) MEMORY SUPPORT UNIFIED UNIFIED 64-bit interface with 4 memory ranks Link Controller Memory Controller NORTHBRIDGE NORTHBRIDGE Supports 1.25V, 1.35V, and 1.5V DIMMS Up to 10.3 GB/s with DDR3-1600 256b AMD Radeon™ 256b Up to 32 GB capacity in notebook FT3 BGA Memory Bus package 256b I/O Graphics Memory DRAM Supports memory P-states — with memory Controller Controller Controller speed changes on the fly KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 6 GRAPHICS CORE NEXT ARCHITECTURE A new GPU design Cutting-edge graphics performance and features for a new era High compute density with multi-tasking of computing Built for power efficiency Optimized for heterogeneous computing Support for high-level language features for heterogeneous compute KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 7 AMD RADEON™ HD 8000 GRAPHICS CORE NEXT ARCHITECTURE First APU with GCN architecture ACE ACE API support: Command Processor Graphics: DirectX 11.1, OpenGL 4.3, OpenGL ES 3.0 ACE Compute: OpenCL1.2, DirectCompute, C++ AMP ACE Hardware configuration: Geometry Engine Geometry engine Rasterizer ¼ prim/clock Global Data Share Two GCN compute units (CUs) 1 render back-end I$ 4 pixel color raster operation pipelines (ROPs) 16 depth test (Z) / stencil ops K$ Color cache (C$) / Depth cache (Z$) 128KB read/write L2 cache 4KB global data share with global synchronization resources L2 Cache Advanced power management: Fine-grain clock\clock tree gating PowerTune – dynamic V/F scaling with power containment System Fabric Zero core power – power gating KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 8 GCN COMPUTE UNIT Basic GPU building block of unified shader system Each CU can execute instructions from multiple New instruction set architecture kernels simultaneously Non-VLIW Designed for programming simplicity, high Vector unit + scalar co-processor utilization, high throughput, multi-tasking Distributed programmable scheduler Unstructured flow control, function calls, recursion, exception Consistent with AMD dGPU architecture so support kernels developed for GCN run anywhere Un-typed, typed, and image memory operations Flat address support Branch & Message Unit Texture Fetch Load / Store Units (16) Texture Filter Scheduler Units L1 Cache (16KB Cache Texture Filter Units Local Scalar Vector Units (SIMD-16) Vector Units (SIMD-16) Unit Vector Units (SIMD-16) Vector Units (SIMD-16) Texture Filter Data Units Share ) Vector Registers (64KB) Vector Registers (64KB) (64KB) Vector Registers (64KB) Vector Registers (64KB) Texture Filter Scalar Units Compute Unit Registers (4KB) KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 9 GCN R/W CACHE Reads and writes cached Command Bandwidth amplification Processors 64-128KB R/W L2 Improved behavior on more memory access patterns per MC Channel Compute Improved write-to-read re-use performance 16KB Vector DL1 Unit Relaxed consistency memory model Consistency controls available to control locality of 16KB Scalar DL1 32KB Instruction L1 X load/store B A GPU-coherent R Acquire/Release semantics control data visibility across the machine 64-128KB R/W L2 L2 coherent = all CUs, ACE, and command per MC Channel processors can have the same view of data Compute Global atomics 16KB Vector DL1 Unit Performed in L2 cache KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 10 AMD RADEON HD 8000 GCN ARCHITECTURE ACE Dual-display support ACE Command Processor HDMI and DisplayPort™ support up to 4K x 2K 30Hz ACE Wireless display ACE Low-power self-refresh Geometry Engine Video Codec Engine (VCE) fixed-function Rasterizer Multi-stream hardware H.264 HD encoder Global Data Share Power-efficient and faster than real-time 1080p@60fps Scalable video coding (SVC) I$ Universal Video Decoder (UVD) fixed-function K$ with codecs for: H.264 VC-1 L2 Cache MPEG-2 MVC DivX System Fabric WMV MFT WMV-native UVD VCE Display Controllers KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 11 DISPLAY TECHNOLOGY LEADERSHIP WIRELESS PANEL SELF- DYNAMIC ULTRAHD DISPLAY REFRESH REFRESH RATE 4096 3840 2160 1920 2160 1080 Drive 4K x 2K resolution Wi-Fi CERTIFIED, Significant power savings with Automatically and seamlessly displays* via HDMI and Miracast™ Support embedded DisplayPort panel reduce panel refresh rate to DisplayPort Low latency & low power self-refresh* save power* * Supported panel required KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 12 “KABINI” ACCELERATED COMPUTING A COMPLETE SYSTEM-ON-A-CHIP SOLUTION FCL FCL Allows the GPU to access coherent cache/memory 128b (each direction) path for I/O access to memory GPU access to coherent memory space CPU access to dedicated GPU framebuffer I/O GPU CPU GRAPHICS MEMORY BUS 256b (each direction) for GMC access to memory Full-bandwidth path for graphics to system memory DRAM-friendly stream of reads and write Bypasses coherency mechanism DDR3 NB INTEGRATED SYSTEM CONTROL AND I/O (FCH) Integrated FCH Provides complete system connectivity USB 3.0, USB 2.0, SATA 3, GPIO Integrated system clock generator Reduced motherboard footprint required AMD Radeon Memory Bus Allows GPU to access Higher I/O performance at reduced power consumption system memory with high bandwidth *Functional units not to scale KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 13 THE "KABINI" SYSTEM 2 SODIMMs DDR3 800-1600 DDR3 DDR3 Optionally: 18bpp LVDS SDXC - 2TB, (64b) UHS-I - 104MBps eDP SDIO or SD card: “Jaguar” “Jaguar” eDP LCD panel Core Core DisplayPort / HDMI 2MB Shared L2 2x USB 3.0 VGA External ports “Jaguar” “Jaguar” x4 PCI Express Discrete GPU - MXM Fingerprint scanner, Core Core PowerXpress™ WWAN, Bluetooth, 8x USB 2.0 AMD 2 x SATA 3 mini card HDD,DVD Radeon 8000 Display GPU x1 PCIe to PCIe mini card slot HW monitor, Control EC/KBC Media fan control, Acceleration x1 PCIe to PCIe mini card slot thermal sensor, TPM LPC keyboard scan x1 PCIe Wi-Fi controller FCH SPI x1 PCIe BIOS flash Gb Ethernet controller KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 14 CHIP-LEVEL POWER DISTRIBUTIONS TDP 100% Power consumption Core 3 90% VCE (and hence performance) is set by the cooling 80% GPU GPU GPU Display Core 2 DAC & PHYs capabilities of the power power power L2$ platform 70% 60% Core 1 Power varies a lot by GPU CPU 50% PADS power workload CPU CPU Core 0 power power 40% We measure and Display Display 30% power Display power manage the power of power UVD each component on the 20% FCH FCH FCH MC chip to generate the best power power power FCH 10% performance/watt Other chip Other chip Other chip Memory PADS power power power 0% App 1 App 2 App 3 KABINI APU SOC | HOT CHIPS 25 | AUGUST 2013 | 15 DIGITAL POWER MONITORING CPU CPU CPU CPU To manage temperature and Monitor Events send the power wherever Weights Weights Weights Weights it’s needed, we use power Weights Events monitors in all chip CAC CAC CAC CAC components Add-in Leakage Monitor Monitor Monitor Monitor “Kabini” and “Temash” have Calculate Power APU 2 power monitors in each (P = PLeakage + CacV f) P-states CPU, the GPU, the display Pboost interface, and the FCH Add in calculated GPU power AMD Turbo CORE P1 Manager Change P2 The central
Recommended publications
  • Improving Resource Utilization in Heterogeneous CPU-GPU Systems
    Improving Resource Utilization in Heterogeneous CPU-GPU Systems A Dissertation Presented to the Faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulfillment of the requirements for the Degree Doctor of Philosophy (Computer Engineering) by Michael Boyer May 2013 c 2013 Michael Boyer Abstract Graphics processing units (GPUs) have attracted enormous interest over the past decade due to substantial increases in both performance and programmability. Programmers can potentially leverage GPUs for substantial performance gains, but at the cost of significant software engineering effort. In practice, most GPU applications do not effectively utilize all of the available resources in a system: they either fail to use use a resource at all or use a resource to less than its full potential. This underutilization can hurt both performance and energy efficiency. In this dissertation, we address the underutilization of resources in heterogeneous CPU-GPU systems in three different contexts. First, we address the underutilization of a single GPU by reducing CPU-GPU interaction to improve performance. We use as a case study a computationally-intensive video-tracking application from systems biology. Because of the high cost of CPU-GPU coordination, our initial, straightforward attempts to accelerate this application failed to effectively utilize the GPU. By leveraging some non-obvious optimization strategies, we significantly decreased the amount of CPU-GPU interaction and improved the performance of the GPU implementation by 26x relative to the best CPU implementation. Based on the lessons we learned, we present general guidelines for optimizing GPU applications as well as recommendations for system-level changes that would simplify the development of high-performance GPU applications.
    [Show full text]
  • AMD Firepro™Professional Graphics for CAD & Engineering and Media & Entertainment
    AMD FirePro™Professional Graphics for CAD & Engineering and Media & Entertainment Performance at every price point. AMD FirePro professional graphics offer breakthrough capabilities that can help maximize productivity and help lower cost and complexity — giving you the edge you need in your business. Outstanding graphics performance, compute power and ultrahigh-resolution multidisplay capabilities allows broadcast, design and engineering professionals to work at a whole new level of detail, speed, responsiveness and creativity. AMD FireProTM W9100 AMD FireProTM W8100 With 16GB GDDR5 memory and the ability to support up to six 4K The new AMD FirePro W8100 workstation graphics card is based on displays via six Mini DisplayPort outputs,1 the AMD FirePro W9100 the AMD Graphics Core Next (GCN) GPU architecture and packs up graphics card is the ideal single-GPU solution for the next generation to 4.2 TFLOPS of compute power to accelerate your projects beyond of ultrahigh-resolution visualization environments. just graphics. AMD FireProTM W7100 AMD FireProTM W5100 The new AMD FirePro W7100 graphics card delivers 8GB The new AMD FirePro™ W5100 graphics card delivers optimized of memory, application performance and special features application and multidisplay performance for midrange users. that media and entertainment and design and engineering With 4GB of ultra-fast GDDR5 memory, users can tackle moderately professionals need to take their projects to the next level. complex models, assemblies, data sets or advanced visual effects with ease. AMD FireProTM W4100 AMD FireProTM W2100 In a class of its own, the AMD FirePro Professional graphics starts with AMD W4100 graphics card is the best choice FirePro W2100 graphics, delivering for entry-level users who need a boost in optimized and certified professional graphics performance to better address application performance that similarly- their evolving workflows.
    [Show full text]
  • Application-Controlled Frequency Scaling Explained
    The TURBO Diaries: Application-controlled Frequency Scaling Explained Jons-Tobias Wamhoff, Stephan Diestelhorst, and Christof Fetzer, Technische Universät Dresden; Patrick Marlier and Pascal Felber, Université de Neuchâtel; Dave Dice, Oracle Labs https://www.usenix.org/conference/atc14/technical-sessions/presentation/wamhoff This paper is included in the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference. June 19–20, 2014 • Philadelphia, PA 978-1-931971-10-2 Open access to the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference is sponsored by USENIX. The TURBO Diaries: Application-controlled Frequency Scaling Explained Jons-Tobias Wamhoff Patrick Marlier Dave Dice Stephan Diestelhorst Pascal Felber Christof Fetzer Technische Universtat¨ Dresden, Germany Universite´ de Neuchatel,ˆ Switzerland Oracle Labs, USA Abstract these features from an application as needed. Examples in- Most multi-core architectures nowadays support dynamic volt- clude: accelerating the execution of key sections of code on age and frequency scaling (DVFS) to adapt their speed to the the critical path of multi-threaded applications [9]; boosting system’s load and save energy. Some recent architectures addi- time-critical operations or high-priority threads; or reducing tionally allow cores to operate at boosted speeds exceeding the the energy consumption of applications executing low-priority nominal base frequency but within their thermal design power. threads. Furthermore, workloads specifically designed to run In this paper, we propose a general-purpose library that on processors with heterogeneous cores (e.g., few fast and allows selective control of DVFS from user space to accelerate many slow cores) may take additional advantage of application- multi-threaded applications and expose the potential of hetero- level frequency scaling.
    [Show full text]
  • PC-Grade Parallel Processing and Hardware Acceleration for Large-Scale Data Analysis
    University of Huddersfield Repository Yang, Su PC-Grade Parallel Processing and Hardware Acceleration for Large-Scale Data Analysis Original Citation Yang, Su (2009) PC-Grade Parallel Processing and Hardware Acceleration for Large-Scale Data Analysis. Doctoral thesis, University of Huddersfield. This version is available at http://eprints.hud.ac.uk/id/eprint/8754/ The University Repository is a digital collection of the research output of the University, available on Open Access. Copyright and Moral Rights for the items on this site are retained by the individual author and/or other copyright owners. Users may access full items free of charge; copies of full text items generally can be reproduced, displayed or performed and given to third parties in any format or medium for personal research or study, educational or not-for-profit purposes without prior permission or charge, provided: • The authors, title and full bibliographic details is credited in any copy; • A hyperlink and/or URL is included for the original metadata page; and • The content is not changed in any way. For more information, including our policy and submission procedure, please contact the Repository Team at: [email protected]. http://eprints.hud.ac.uk/ PC-Grade Parallel Processing and Hardware Acceleration for Large-Scale Data Analysis Yang Su A thesis submitted to the University of Huddersfield in partial fulfilment of the requirements for the degree of Doctor of Philosophy School of Computing and Engineering University of Huddersfield October 2009 Acknowledgments I would like to thank the School of Computing and Engineering at the University of Huddersfield for providing this great opportunity of study and facilitating me throughout this project.
    [Show full text]
  • AMD's Early Processor Lines, up to the Hammer Family (Families K8
    AMD’s early processor lines, up to the Hammer Family (Families K8 - K10.5h) Dezső Sima October 2018 (Ver. 1.1) Sima Dezső, 2018 AMD’s early processor lines, up to the Hammer Family (Families K8 - K10.5h) • 1. Introduction to AMD’s processor families • 2. AMD’s 32-bit x86 families • 3. Migration of 32-bit ISAs and microarchitectures to 64-bit • 4. Overview of AMD’s K8 – K10.5 (Hammer-based) families • 5. The K8 (Hammer) family • 6. The K10 Barcelona family • 7. The K10.5 Shanghai family • 8. The K10.5 Istambul family • 9. The K10.5-based Magny-Course/Lisbon family • 10. References 1. Introduction to AMD’s processor families 1. Introduction to AMD’s processor families (1) 1. Introduction to AMD’s processor families AMD’s early x86 processor history [1] AMD’s own processors Second sourced processors 1. Introduction to AMD’s processor families (2) Evolution of AMD’s early processors [2] 1. Introduction to AMD’s processor families (3) Historical remarks 1) Beyond x86 processors AMD also designed and marketed two embedded processor families; • the 2900 family of bipolar, 4-bit slice microprocessors (1975-?) used in a number of processors, such as particular DEC 11 family models, and • the 29000 family (29K family) of CMOS, 32-bit embedded microcontrollers (1987-95). In late 1995 AMD cancelled their 29K family development and transferred the related design team to the firm’s K5 effort, in order to focus on x86 processors [3]. 2) Initially, AMD designed the Am386/486 processors that were clones of Intel’s processors.
    [Show full text]
  • Mellanox Corporate Deck
    UCX Community Meeting SC’19 November 2019 Open Meeting Forum and Publicly Available Work Product This is an open, public standards setting discussion and development meeting of UCF. The discussions that take place during this meeting are intended to be open to the general public and all work product derived from this meeting shall be made widely and freely available to the public. All information including exchange of technical information shall take place during open sessions of this meeting and UCF will not sponsor or support any closed or private working group, standards setting or development sessions that may take place during this meeting. Your participation in any non-public interactions or settings during this meeting are outside the scope of UCF's intended open-public meeting format. © 2019 UCF Consortium 2 UCF Consortium . Mission: • Collaboration between industry, laboratories, and academia to create production grade communication frameworks and open standards for data centric and high-performance applications . Projects https://www.ucfconsortium.org Join • UCX – Unified Communication X – www.openucx.org [email protected] • SparkUCX – www.sparkucx.org • Open RDMA . Board members • Jeff Kuehn, UCF Chairman (Los Alamos National Laboratory) • Gilad Shainer, UCF President (Mellanox Technologies) • Pavel Shamis, UCF treasurer (Arm) • Brad Benton, Board Member (AMD) • Duncan Poole, Board Member (Nvidia) • Pavan Balaji, Board Member (Argonne National Laboratory) • Sameh Sharkawi, Board Member (IBM) • Dhabaleswar K. (DK) Panda, Board Member (Ohio State University) • Steve Poole, Board Member (Open Source Software Solutions) © 2019 UCF Consortium 3 UCX History https://www.hpcwire.com/2018/09/17/ucf-ucx-and-a-car-ride-on-the-road-to-exascale/ © 2019 UCF Consortium 4 © 2019 UCF Consortium 5 UCX Portability .
    [Show full text]
  • AMD Codexl 1.0 GA Release Notes (Version 1.0.)
    AMD CodeXL 1.0 GA Release Notes (version 1.0.) CodeXL 1.0 is finally here! Thank you for using CodeXL. We appreciate any feedback you have! Please use our CodeXL Forum to provide your feedback. You can also check out the Getting Started guide on the CodeXL Web Page and Milind Kukanur’s CodeXL blog at AMD Developer Central - Blogs This version contains: CodeXL Visual Studio package and Standalone application, for 32-bit and 64-bit Windows platforms CodeXL for 64-bit Linux platforms Kernel Analyzer v2 for both Windows and Linux platforms System Requirements CodeXL contains a host of development features with varying system requirements: GPU Profiling and OpenCL Kernel Debugging o An AMD GPU (Radeon HD 5xxx or newer) or APU is required o The AMD Catalyst Driver must be installed, release 12.8 or later. Catalyst 12.12 is the recommended version (will become available later in December 2012). For GPU API-Level Debugging, a working OpenCL/OpenGL configuration is required (AMD or other). CPU Profiling o Time Based Profiling can be performed on any x86 or AMD64 (x86-64) CPU/APU. o The Event Based Profiling (EBP) and Instruction Based Sampling (IBS) session types require an AMD CPU or APU processor. Supported platforms: Windows platforms: Windows 7, 32-bit and 64-bit are supported o Visual Studio 2010 must be installed on the station before the CodeXL Visual Studio Package is installed. Linux platforms: Red Hat EL 6 u2 64-bit and Ubuntu 11.10 64-bit New in this version The following items were not part of the Beta release and are new to this version: A unified installer for Windows which installs both CodeXL and APP KernelAnalyzer2 on 32 and 64 bit platforms, packaged in a single executable file.
    [Show full text]
  • AMD Codexl 1.1 GA Release Notes
    AMD CodeXL 1.1 GA Release Notes Thank you for using CodeXL. We appreciate any feedback you have! Please use our CodeXL Forum to provide your feedback. You can also check out the Getting Started guide on the CodeXL Web Page and the latest CodeXL blog at AMD Developer Central - Blogs This version contains: CodeXL Visual Studio 2012 and 2010 packages and Standalone application, for 32-bit and 64-bit Windows platforms CodeXL for 64-bit Linux platforms Kernel Analyzer v2 for both Windows and Linux platforms Note about 32-bit Windows CodeXL 1.1 Upgrade Error On 32-bit Windows platforms, upgrading from previous version of CodeXL using the CodeXL 1.1 installer will remove the previous version and then display an error message without installing CodeXL 1.1. The recommended method is to uninstall previous CodeXL before installing CodeXL 1.1. If you ran the 1.1 installer to upgrade a previous installation and encountered the error mentioned above, ignore the error and run the installer again to install CodeXL 1.1. Note about installing CodeAnalyst after installing CodeXL for Windows CodeXL can be safely installed on a Windows station where AMD CodeAnalyst is already installed. However, do not install CodeAnalyst on a Windows station already installed with CodeXL. Uninstall CodeXL first, and then install CodeAnalyst. System Requirements CodeXL contains a host of development features with varying system requirements: GPU Profiling and OpenCL Kernel Debugging o An AMD GPU (Radeon HD 5xxx or newer) or APU is required o The AMD Catalyst Driver must be installed, release 12.8 or later.
    [Show full text]
  • Amd Power Management Srilatha Manne 11/14/2013 Outline
    AMD POWER MANAGEMENT SRILATHA MANNE 11/14/2013 OUTLINE APUs and HSA Richland APU APU Power Management Monitoring and Management Features 2 | DOE WEBINAR| NOV 14, 2013 APU: ACCELERATED PROCESSING UNIT The APU has arrived and it is a great advance over previous platforms Combines scalar processing on CPU with parallel processing on the GPU and high-bandwidth access to memory Advantages over other forms of compute offload: Easier to program Easier to optimize Easier to load-balance Higher performance Lower power © Copyright 2012 HSA Foundation. All Rights Reserved. 3 WHAT IS HSA? Joins CPUs, GPUs, and accelerators into a unified computing framework Single address space accessible to avoid the overhead of data copying HSA Accelerated Applications Access to Broad Set of Programming Use-space queuing to minimize Languages communication overhead Pre-emptive context switching for better Legacy quality of service Applications Simplified programming HSA Runtime Single, standard computing environments Infrastructure Support for mainstream languages— OS C, C++, Fortran, Java, .NET Lower development costs Memory Optimized compute density Radical performance improvement for CPU GPU ACC HPC, big data, and multimedia workloads HSA Platform Low power to maximize performance per watt 4 | AMD OPTERON™ PROCESSOR ROADMAP | NOVEMBER 13, 2013 A NEW ERA OF PROCESSOR PERFORMANCE Heterogeneous Single-core Era Multi-core Era Systems Era Temporarily Enabled by: Constrained by: Enabled by: Constrained by: Enabled by: Moore’s Power Moore’s Law Power Abundant data Constrained by: Law Complexity SMP Parallel SW parallelism Programming Voltage architecture Scalability Power efficient models Scaling GPUs Comm.overhead pthreads OpenMP / TBB … Assembly C/C++ Java Shader CUDA OpenCL … C++ and Java ? we are here we are Performance Performance here Throughput Performance Performance we are Modern Modern Application here Single-thread Single-thread Performance Time Time (# of processors) Time (Data-parallel exploitation) © Copyright 2012 HSA Foundation.
    [Show full text]
  • GPU Computing with Python: Performance, Energy Efficiency And
    computation Article GPU Computing with Python: Performance, Energy Efficiency and Usability † Håvard H. Holm 1,2,* , André R. Brodtkorb 3,4 and Martin L. Sætra 4,5 1 Mathematics and Cybernetics, SINTEF Digital, P.O. Box 124 Blindern, NO-0314 Oslo, Norway 2 Department of Mathematical Sciences, Norwegian University of Science and Technology, NO-7491 Trondheim, Norway 3 Research and Development Department, Norwegian Meteorological Institute, P.O. Box 43 Blindern, NO-0313 Oslo, Norway; [email protected] 4 Department of Computer Science, Oslo Metropolitan University, P.O. Box 4 St. Olavs plass, NO-0130 Oslo, Norway; [email protected] 5 Information Technology Department, Norwegian Meteorological Institute, P.O. Box 43 Blindern, NO-0313 Oslo, Norway * Correspondence: [email protected] † This paper is an extended version of our paper published in International Conference on Parallel Computing (ParCo2019). Received: 6 December 2019; Accepted: 6 January 2020; Published: 9 January 2020 Abstract: In this work, we examine the performance, energy efficiency, and usability when using Python for developing high-performance computing codes running on the graphics processing unit (GPU). We investigate the portability of performance and energy efficiency between Compute Unified Device Architecture (CUDA) and Open Compute Language (OpenCL); between GPU generations; and between low-end, mid-range, and high-end GPUs. Our findings showed that the impact of using Python is negligible for our applications, and furthermore, CUDA and OpenCL applications tuned to an equivalent level can in many cases obtain the same computational performance. Our experiments showed that performance in general varies more between different GPUs than between using CUDA and OpenCL.
    [Show full text]
  • Brochure 1.4 MB
    www.SapphirePGS.com Professional Graphics Solutions SAPPHIRE PGS (Professional Graphics Solutions) is a business unit within SAPPHIRE Technology for Professional Graphics. It provides various types of professional graphics display solutions for workstation and professional clients. SAPPHIRE PGS supports the full range of 3D professional applications for professional users. For industrial customers, SAPPHIRE PGS integrates display related graphics application solutions for broadcasting, digital signage, medical, surveillance, ATC (Air Traffic Control) and other markets. SAPPHIRE PGS is focused on providing our customers with highly appropriate solutions and outstanding pre and after sales consultancy and services. SAPPHIRE Technology About SAPPHIRE Technology SAPPHIRE Technology is a leading manufacturer and global supplier of a broad range of innovative technologies for PC enthusiasts, home users and professionals. Its origins rooted in graphics hardware design and manufacturing, the extensive SAPPHIRE product range has since grown from state-of-the-art graphics add-in boards—for which SAPPHIRE is recognized as the premiere AMD partner—to include motherboards, mini PCs, external graphics expanders, and Professional AV products. Founded in 2001, SAPPHIRE is a privately held global company headquartered in Hong Kong. Further information can be found at: www.sapphiretech.com. About SAPPHIRE PGS SAPPHIRE PGS (Professional Graphics Solutions) is a business unit within SAPPHIRE Technology for Professional Graphics. It provides various types of professional graphics display solutions for workstation and professional clients. SAPPHIRE PGS supports the full range of 3D professional applications for professional users. For industrial customers, SAPPHIRE PGS integrates display related graphics application solutions for broadcasting, digital signage, medical, surveillance, ATC (Air Traffic Control) and other markets.
    [Show full text]
  • 256 Color for Windows 2000 Driver Download 256 Color for Windows 2000 Driver Download
    256 color for windows 2000 driver download 256 color for windows 2000 driver download. Explorer Patch (256 color TrayIcons) Dr. Hoiby is the author of this little tool. All Dr. Hoiby's Freewares are certified without Java, without .NET and without any Virtual Machine Code. He can be reached at [email protected] . P atch TrayIcon Windows: (You can now use 256 colors icon in your TrayBar). -> BEFORE AFTER. F.A.Q Title: 256 [2008-03-24] How do I switch my computer to 256 colors mode from a different mode? ?? I don't understand the question, sorry :( [2008-03-26] Title: Windows NT 4.0 workstation? [2008-04-28] Do you have a version for NT 4.0-SP6a ? Sorry, no :( I'm patching Windows during my spare time. So, with my three childrens, I do not much. [2009-02-04] Title: 256 [2008-06-01] Hi, I'm wondering what I have to download in order to run this 256bit patch. I can't find the correct one to download for my system. I'm running US Windows 2000 SP4, what should I download? Sorry. All the patched systems are listed on this page. [2009-02-04] Title: Win2000 SP4 [2008-06-15] Hi ! I wonder how to do a manual patch of my explorer.exe v5.0.3900.6920 Thanks ;) Bedfford A.B. - Chile Please send to me your explorer.exe file. After that I'll can answer this question. [2009-02-04] Title: Windows 95 OSR2 Spanish and NT4 with ie6 [2008-07-11] I want to know if you can pacth this version of Windows When you compare the solutions for the german version and the english version : http://www.dr-hoiby.com/TrayIconIn256Color/v4.00.950B.de.txt http://www.dr- hoiby.com/TrayIconIn256Color/v4.00.950.us.txt It seems to be the same.
    [Show full text]