Exploiting Heterogeneous Cpus/Gpus

Total Page:16

File Type:pdf, Size:1020Kb

Exploiting Heterogeneous Cpus/Gpus Exploiting Heterogeneous CPUs/GPUs David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA General Purpose Computing . With the introduction of multi-core CPUs, there has been a renewed interest in parallel computing paradigms and languages . Existing multi-/many-core architectures are being considered for general-purpose platforms (e.g., Cell, GPUs, DSPs) . Heterogeneous systems are becoming a common theme . Are we returning to the days of the X87 co-processor? . How should we combine multi-core and many-core systems into a single design? Heterogeneous Computing “….electronic systems that use a variety of different types of computational units…..” Wikipedia The elements could have different instruction set architectures The elements could have different memory byte orderings (i.e., endianness) The elements may have different memory coherency and consistency models The elements may only work with specific operating systems and application programming interfaces (APIs) The elements could be integrated on the same or different chips/boards/system Trends in Heterogeneous Computing: X86 Microprocessors . 1978 – Intel 8086 . Designed to run integer-based CPU-bound programs (e.g., Dhrystone) efficiently . No explicit floating point support . 1980 – Intel 8087 . 50 KFLOPS!!!!! . IEEE 754 definition . 1982 – Intel 80286/287 . 1985 – Intel 80386/387 and AMD AM386 w/387 . 1989 – Intel 80486DX . First integrated on-chip X87 Trends in Heterogeneous Computing: X86 Microprocessors . 1996 – Intel Pentium . MMX multimedia extensions . 1997 – AMD K6 . MMX and FP support . 1998 – AMD K6-2 . Extends MMX with 3DNow . SIMD vector instructions for graphics processing . 1999 – Intel Pentium III . Introduces SSE to X86 . 2001-2005 – Intel Pentium IV/Prescott and AMD Opteron/Athalon . SSE2 and SSE3 . 2006 – Intel Core and AMD K10 . SSE4, SSE4.2 and SSEa Trends in Heterogeneous Computing: X86 Microprocessors What spurred on these changes/advances? . The inefficiency of X86 to effectively emulate floating point . The need for increased precision in computations . The desire to have interactive games (e.g., Flight Simulator, Donkey Kong) . The emergence of multimedia (voice, video, graphics) . The competitive market! Some other examples of heterogeneous integration . The IBM Cell . Soul of the Sony PS3 . Composed of 1 Power Processing Element (PPE), with 8 physical SPEs . 9 DMA units for memory transfers . The Analog Devices Blackfin . Integration of a classic fixed-point digital signal processor (DSP) and a microcontroller . One instruction set . Shared memory architecture . The TI OMAP . Integration of an ARM and one or multiple DSPs . Popular cell-phone and media player platform . Shared memory architecture . Graphics Processing Units . More than 64% of Americans played a video game in 2009 . High-end - primarily used for 3-D rendering for videogame graphics and movie animation . Mid/low-end – primarily used for computer displays . Manufacturers include NVIDIA, AMD/ATI, IBM-Cell . Very competitive commodities market Enter GPGPU – desktop supercomputing! . GPU manufacturers made their chips programmable . OpenGL and DirectX provide support for programming shaders . NVIDIA GeForce3 was the first architecture to support this move (2002) . NVIDIA’s CUDA had a huge impact on lowering the threshold to accessing the GPU for general purpose computing . AMD’s Brook+ also played an important role . GPU manufacturers decide to make chipsets to specifically support the programmable GPU market . NVIDIA Tesla and Fermi . What spurred this change? . The need for 3D and 4D data processing A wide range of GPU apps Film 3D image analysis Protein folding Financial Adaptive radiation therapy Quantum chemistry Languages Acoustics Ray tracing GIS Astronomy Radar Holographics cinema Audio Reservoir simulation Machine learning Automobile vision Robotic vision / AI Mathematics research Bioinfomatics Robotic surgery Military Biological simulation Satellite data analysis Mine planning Broadcast Seismic imaging Molecular dynamics Cellular automata Surgery simulation MRI reconstruction Fluid dynamics Surveillance Multispectral imaging Computer vision Ultrasound N-body simulation Cryptography Video conferencing Network processing CT reconstruction Telescope Neural network Data mining Video Oceanographic research Digital cinema / projections Visualization Optical inspection Electromagnetic simulation Wireless Particle physics Equity training X-Ray GPGPU is becoming mainstream research Research activities are expanding significantly Search result for keyword “GPGPU” in IEEE and ACM AMD/ATI Radeon HD 5870 • Codename “Evergreen” • 1600 SIMD cores • L1/L2 memory architecture • 153GB/sec memory bandwidth • 2.72 TFLOPS SP • OpenCL and DirectX11 • Hidden memory microarchitecure • Provides for vectorized operation Comparison of CPU and GPU Hardware Architectures CPU/GPU Single Cores GFLOPs/ $/GFLOP precision Watt TFLOPs NVIDIA 285 1.06 240 5.8 $3.12 NVIDIA 295 1.79 480 6.2 $3.80 AMD HD 5870 2.72 1600 14.5 $0.16 AMD HD 4890 1.36 800 7.2 $0.18 Intel I-7 965 0.051 4 0.39 $11.02 Source: NVIDIA, AMD and Intel . The Medical Imaging field is rapidly deploying new 3-D and 4-D imaging technologies to improve patient outcomes . This move has created an avalanche of image data . Image reconstruction and image analysis have become major bottlenecks . Accurate 3-D and 4-D image reconstruction requires compute-intensive algorithms . The use of multi-modality imaging (e.g., CT and Ultrasound) further exacerbates this problem . Heterogeneous computing will play a large role in addressing these challenges Developing a suite of Biomedical Image Analysis Libraries – AMD-NVIDIA/OpenCL . Target applications: . Deformable registration - radiation oncology . 3-D Iterative reconstruction – cardio- vascular imaging . Maximum likelihood estimation – Digital Breast Tomosynthesis . Motion compensation in PET/CT images - cardiovascular imaging . Hyperspectral imaging – skin cancer screening . Image segmentation – brain imaging $1.3M NSF Award EEC-0946463 Developing a suite of Biomedical Image Analysis Libraries – AMD-NVIDIA/OpenCL . Target applications: . Deformable registration - radiation oncology . 3-D Iterative reconstruction – cardio- vascular imaging . Maximum likelihood estimation – Digital Breast Tomosynthesis . Motion compensation in PET/CT images - cardiovascular imaging . Hyperspectral imaging – skin cancer screening . Image segmentation – brain imaging $1.3M NSF Award EEC-0946463 . Currently, coronary heart disease (CHD) is the single leading cause of death in America . Health care costs related to CHD >$150B/year . U.S. in 2006 (American Heart Association) . Approximately 1,255,000 coronary attacks . Approximately 425,425 deaths . Invasive coronary angiography is the state-of-the-art for assessing coronary blockages . Inject dye into the bloodstream and then Xray the heart . 8% complication rate . 0.2% mortality rate 3-D Cardiovascular Plaque Imaging . 3D CT imaging can be used to identify vulnerable plaque . A helical scan of the body is performed . Provides for more accurate imaging of the cardio-vascular system . Produces a detailed 3-D view of the blockage . Possesses few negative side effects . Scanning geometry produces a tremendous amount of data to process Image reconstruction can take days to generate a single view!! Iterative CT Image Reconstruction . 3-D Spiral Cone-Beam Cardiac Image Reconstruction . Reconstruction performance is a barrier to improve image quality . Forward/backward projections consume more than 95% of total reconstruction time in an iterative helical cone-beam CT image reconstruction method . Comparison of a single OpenCL/AMD HD5870 implementation versus a multi-threaded optimized version on an Intel Core-2 Duo Execution time comparison (one projection) Reconstructed cardiac image Execution time (seconds) 31x speedup (250*250*9, 1160 projections) *In collaboration with H. Pien and C. Karl . A new technology developed at MGH to: . Produce a 3-D image of the breast utilizing 15 or more 2-D projections . 3-D imagery can help address the following issues related to 2-D mammography . Increase the correct detection rate of cancers . Reduce the rate of misdiagnosed cancers – avoid unneeded biopsies 2-D DBT 2-D DBT Cancer Hammartoma Increase correct detection rate Decrease false positive rate Tomosynthesis Image Reconstruction X-ray source (15 views) X-ray projections Set 3D volume (guess) Compute projections Forward • Utilizes a limited angle Correct 3D volume tomography approach using Backward many 2-D images to generate a 3-D image • Performs an iterative Maximum Likelihood Estimation 3D volume for 3-D image reconstruction (1196x2304x45) Detector • Reconstruction time is a (1196x2304) barrier to image-guided biopsy Reconstruction Speedup 25X speedup Reduces false positives Patient receives feedback in the same visit Enables image-guided biopsy Improves patient outcomes *In collaboration with R. Moore and W. Meleis OpenCL – The future for heterogeneous computing Being developed by Khronos Group – a non-profit Open Compute Language LLVM compiler Looks a lot like CUDA A framework for writing programs that execute on a range of heterogeneous systems Present support for AMD/NVIDIA GPUs, Cell, X86 multi- core CPUs, IBM Power, and ARM More about OpenCL during this afternoon’s tutorial GPU/OpenCL Strengths . Supercomputing on the desktop . Easy to program (small learning curve) . Already have demonstrated success with several
Recommended publications
  • Super 7™ Motherboard
    SY-5EH5/5EHM V1.0 Super 7Ô Motherboard ************************************************ Pentium® Class CPU supported ETEQ82C663 PCI/AGP Motherboard AT Form Factor ************************************************ User's Guide & Technical Reference NSTL “Year 2000 Test” Certification Letter September 23, 1998 Testing Date: September 23, 1998 Certification Date: September 23, 1998 Certification Number: NCY2000-980923-004 To Whom It May Concern: We are please to inform you that the “SY-5EHM/5EH5” system has passed NSTL Year 2000 certification test program. The Year 2000 test program tests a personal computer for its ability to support the year 2000. The “SY-5EHM/5EH5: system is eligible to carry the NSTL :Year 2000 Certification” seal. The Year 2000 certification test has been done under the following system configuration: Company Name : SOYO COMPUTER INC. System Model Name : SY-5EHM/5EH5 Hardware Revision : N/A CPU Model : Intel Pentium 200/66Mhz On Board Memory/L2 Cache : PC100 SDRAM DIMM 32MBx1 /1MB System BIOS : Award Modular BIOS V4.51PG, An Energy Star Ally Copyright © 1984—98, EH-1A6,07/15/1998-VP3-586B- 8669-2A5LES2AC-00 Best regards, SPORTON INTERNATIONAL INC. Declaration of Conformity According to 47 CFR, Part 2 and 15 of the FCC Rules Declaration No.: D872907 July.10 1998 The following designated product EQUIPMENT: Main Board MODEL NO.: SY-5EH Which is the Class B digital device complies with 47 CFR Parts 2 and 15 of the FCC rules. Operation is subject to the following two conditions : (1) this device may not cause harmful interference, and (2) this device must accept any interference received, including interference that may cause undesired operation.
    [Show full text]
  • A Superscalar Out-Of-Order X86 Soft Processor for FPGA
    A Superscalar Out-of-Order x86 Soft Processor for FPGA Henry Wong University of Toronto, Intel [email protected] June 5, 2019 Stanford University EE380 1 Hi! ● CPU architect, Intel Hillsboro ● Ph.D., University of Toronto ● Today: x86 OoO processor for FPGA (Ph.D. work) – Motivation – High-level design and results – Microarchitecture details and some circuits 2 FPGA: Field-Programmable Gate Array ● Is a digital circuit (logic gates and wires) ● Is field-programmable (at power-on, not in the fab) ● Pre-fab everything you’ll ever need – 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates 6-LUT6-LUT 6-LUT6-LUT 3 6-LUT 6-LUT FPGA: Field-Programmable Gate Array ● Is a digital circuit (logic gates and wires) ● Is field-programmable (at power-on, not in the fab) ● Pre-fab everything you’ll ever need – 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates 6-LUT 6-LUT 6-LUT 6-LUT 4 6-LUT 6-LUT FPGA Soft Processors ● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel code and hardware accelerators need effort – Less effort if soft processors got faster 5 FPGA Soft Processors ● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel code and hardware accelerators need effort – Less effort if soft processors got faster 6 FPGA Soft Processors ● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel
    [Show full text]
  • On Heterogeneous Compute and Memory Systems
    ON HETEROGENEOUS COMPUTE AND MEMORY SYSTEMS by Jason Lowe-Power A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Sciences) at the UNIVERSITY OF WISCONSIN–MADISON 2017 Date of final oral examination: 05/31/2017 The dissertation is approved by the following members of the Final Oral Committee: Mark D. Hill, Professor, Computer Sciences Dan Negrut, Professor, Mechanical Engineering Jignesh M. Patel, Professor, Computer Sciences Karthikeyan Sankaralingam, Associate Professor, Computer Sciences David A. Wood, Professor, Computer Sciences © Copyright by Jason Lowe-Power 2017 All Rights Reserved i Acknowledgments I would like to acknowledge all of the people who helped me along the way to completing this dissertation. First, I would like to thank my advisors, Mark Hill and David Wood. Often, when students have multiple advisors they find there is high “synchronization overhead” between the advisors. However, Mark and David complement each other well. Mark is a high-level thinker, focusing on the structure of the argument and distilling ideas to their essentials; David loves diving into the details of microarchitectural mechanisms. Although ever busy, at least one of Mark or David were available to meet with me, and they always took the time to help when I needed it. Together, Mark and David taught me how to be a researcher, and they have given me a great foundation to build my career. I thank my committee members. Jignesh Patel for his collaborations, and for the fact that each time I walked out of his office after talking to him, I felt a unique excitement about my research.
    [Show full text]
  • Am186em and Am188em User's Manual
    Am186EM and Am188EM Microcontrollers User’s Manual © 1997 Advanced Micro Devices, Inc. All rights reserved. Advanced Micro Devices, Inc. ("AMD") reserves the right to make changes in its products without notice in order to improve design or performance characteristics. The information in this publication is believed to be accurate at the time of publication, but AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication or the information contained herein, and reserves the right to make changes at any time, without notice. AMD disclaims responsibility for any consequences resulting from the use of the information included in this publication. This publication neither states nor implies any representations or warranties of any kind, including but not limited to, any implied warranty of merchantability or fitness for a particular purpose. AMD products are not authorized for use as critical components in life support devices or systems without AMD’s written approval. AMD assumes no liability whatsoever for claims associated with the sale or use (including the use of engineering samples) of AMD products except as provided in AMD’s Terms and Conditions of Sale for such products. Trademarks AMD, the AMD logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Am386 and Am486 are registered trademarks, and Am186, Am188, E86, AMD Facts-On-Demand, and K86 are trademarks of Advanced Micro Devices, Inc. FusionE86 is a service mark of Advanced Micro Devices, Inc. Product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
    [Show full text]
  • Class-Action Lawsuit
    Case 3:20-cv-00863-SI Document 1 Filed 05/29/20 Page 1 of 279 Steve D. Larson, OSB No. 863540 Email: [email protected] Jennifer S. Wagner, OSB No. 024470 Email: [email protected] STOLL STOLL BERNE LOKTING & SHLACHTER P.C. 209 SW Oak Street, Suite 500 Portland, Oregon 97204 Telephone: (503) 227-1600 Attorneys for Plaintiffs [Additional Counsel Listed on Signature Page.] UNITED STATES DISTRICT COURT DISTRICT OF OREGON PORTLAND DIVISION BLUE PEAK HOSTING, LLC, PAMELA Case No. GREEN, TITI RICAFORT, MARGARITE SIMPSON, and MICHAEL NELSON, on behalf of CLASS ACTION ALLEGATION themselves and all others similarly situated, COMPLAINT Plaintiffs, DEMAND FOR JURY TRIAL v. INTEL CORPORATION, a Delaware corporation, Defendant. CLASS ACTION ALLEGATION COMPLAINT Case 3:20-cv-00863-SI Document 1 Filed 05/29/20 Page 2 of 279 Plaintiffs Blue Peak Hosting, LLC, Pamela Green, Titi Ricafort, Margarite Sampson, and Michael Nelson, individually and on behalf of the members of the Class defined below, allege the following against Defendant Intel Corporation (“Intel” or “the Company”), based upon personal knowledge with respect to themselves and on information and belief derived from, among other things, the investigation of counsel and review of public documents as to all other matters. INTRODUCTION 1. Despite Intel’s intentional concealment of specific design choices that it long knew rendered its central processing units (“CPUs” or “processors”) unsecure, it was only in January 2018 that it was first revealed to the public that Intel’s CPUs have significant security vulnerabilities that gave unauthorized program instructions access to protected data. 2. A CPU is the “brain” in every computer and mobile device and processes all of the essential applications, including the handling of confidential information such as passwords and encryption keys.
    [Show full text]
  • Élan™SC520 Microcontroller Data Sheet PRELIMINARY
    PRELIMINARY Élan™SC520 Microcontroller Integrated 32-Bit Microcontroller with PC/AT-Compatible Peripherals, PCI Host Bridge, and Synchronous DRAM Controller DISTINCTIVE CHARACTERISTICS ■ ■ Industry-standard Am5x86® CPU with floating ROM/Flash controller for 8-, 16-, and 32-bit devices point unit (FPU) and 16-Kbyte write-back cache ■ Enhanced PC/AT-compatible peripherals – 100-MHz and 133-MHz operating frequencies provide improved performance – Low-voltage operation (core VCC = 2.5 V) – Enhanced programmable interrupt controller – 5-V tolerant I/O (3.3-V output levels) (PIC) prioritizes 22 interrupt levels (up to 15 external sources) with flexible routing ■ E86™ family of x86 embedded processors – Enhanced DMA controller includes double buffer – Part of a software-compatible family of chaining, extended address and transfer counts, microprocessors and microcontrollers well and flexible channel routing supported by a wide variety of development tools ■ – Two 16550-compatible UARTs operate at baud Integrated PCI host bridge controller leverages rates up to 1.15 Mbit/s with optional DMA interface standard peripherals and software ■ Standard PC/AT-compatible peripherals – 33 MHz, 32-bit PCI bus Revision 2.2-compliant – Programmable interval timer (PIT) – High-throughput 132-Mbyte/s peak transfer – Real-time clock (RTC) with battery backup – Supports up to five external PCI masters capability and 114 bytes of RAM – Integrated write-posting and read-buffering for ■ Additional integrated peripherals high-throughput applications – Three general-purpose
    [Show full text]
  • AMD-K6-2® Processor
    Preliminary Information ® Mobile AMD-K6-2® Processor Data Sheet Publication # 21896 Rev: E Amendment/0 Issue Date: May 2000 Preliminary Information © 2000 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice. Trademarks AMD, the AMD logo, K6, 3DNow!, and combinations thereof, and Super7 are trademarks, and AMD-K6 and RISC86 are registered trademarks of Advanced Micro Devices, Inc. MMX is a trademark of Intel Corporation.
    [Show full text]
  • 32-Bit Broch/4.0-8/23 (Page 3)
    E86™ FAMILY 32-Bit Microprocessors www.amd.com 3 Leverage the billions of dollars spent annually developing hardware and software for the world's dominant processor architecture—x86 SECTION I • Assured, flexible, and x86 compatible migration path from 16-bit to full 32-bit bus design HIGH PERFORMANCE x86 EMBEDDED PROCESSORS • Industry standard x86 architecture The E86™ family of 32-bit microprocessors and microcontrollers represent the highest level of x86 performance that AMD currently offers for the embedded provides largest knowledge base market. This 32-bit family of devices includes the Am386®, Am486®, AMD-K6™E of designers microprocessors as well as the Élan™ family of integrated microcontrollers. Since all E86 family processors are x86 compatible, a software compatible • Enhanced performance and lower upgrade path for your next generation design is assured. And since the E86 family is based on the world’s dominant processor architecture - x86 - system costs embedded designers are also able to leverage the billions of dollars spent annually developing hardware and software for the PC market. Low cost • High level of integration that development tools, readily available chipsets and peripherals, and pre-written software are all benefits of utilizing the x86 architecture in your designs. reduces time-to-market and increases reliability HIGH PERFORMANCE 32-BIT MICROPROCESSOR PORTFOLIO Many customers require the leading edge performance of PC microproces- • A complete third-party support program sors, while still desiring the level of support that is typically associated with from AMD’s FusionE86sm partners. embedded processors. AMD’s Embedded Processor Division is chartered to provide these industry-proven CPU cores with the long-term product support, development tool infrastructure, and technical support that embedded cus- tomers have come to expect.
    [Show full text]
  • Communication Theory II
    Microprocessor (COM 9323) Lecture 2: Review on Intel Family Ahmed Elnakib, PhD Assistant Professor, Mansoura University, Egypt Feb 17th, 2016 1 Text Book/References Textbook: 1. The Intel Microprocessors, Architecture, Programming and Interfacing, 8th edition, Barry B. Brey, Prentice Hall, 2009 2. Assembly Language for x86 processors, 6th edition, K. R. Irvine, Prentice Hall, 2011 References: 1. Computer Architecture: A Quantitative Approach, 5th edition, J. Hennessy, D. Patterson, Elsevier, 2012. 2. The 80x86 Family, Design, Programming and Interfacing, 3rd edition, Prentice Hall, 2002 3. The 80x86 IBM PC and Compatible Computers, Assembly Language, Design, and Interfacing, 4th edition, M.A. Mazidi and J.G. Mazidi, Prentice Hall, 2003 2 Lecture Objectives 1. Provide an overview of the various 80X86 and Pentium family members 2. Define the contents of the memory system in the personal computer 3. Convert between binary, decimal, and hexadecimal numbers 4. Differentiate and represent numeric and alphabetic information as integers, floating-point, BCD, and ASCII data 5. Understand basic computer terminology (bit, byte, data, real memory system, protected mode memory system, Windows, DOS, I/O) 3 Brief History of the Computers o1946 The first generation of Computer ENIAC (Electrical and Numerical Integrator and Calculator) was started to be used based on the vacuum tube technology, University of Pennsylvania o1970s entire CPU was put in a single chip. (1971 the first microprocessor of Intel 4004 (4-bit data bus and 2300 transistors and 45 instructions) 4 Brief History of the Computers (cont’d) oLate 1970s Intel 8080/85 appeared with 8-bit data bus and 16-bit address bus and used from traffic light controllers to homemade computers (8085: 246 instruction set, RISC*) o1981 First PC was introduced by IBM with Intel 8088 (CISC**: over 20,000 instructions) microprocessor oMotorola emerged with 6800.
    [Show full text]
  • AMD-K6 Processor Revision Guide As Errata Or Specification Changes/Clarifications and Are Available to Anyone Who Requests the Information
    AMD-K6® Processor Revision Guide Model 7 Publication # 21846 Rev: H Amendment/0 Issue Date: June 1999 © 1999 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. ("AMD") products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD's Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD's products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD's product could create a situation where personal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice. Trademarks AMD, the AMD logo, K6, and combinations thereof, K86, and Super7 are trademarks, and AMD-K6 is a registered trademark of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation.
    [Show full text]
  • AMD Athlon Processor X86 Code Optimization Guide
    AMD Athlon™ Processor x86 Code Optimization Guide Publication No. Revision Date 22007 K February 2002 © 2001, 2002 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or war- ranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and prod- uct descriptions at any time without notice. No license, whether express, implied, arising by estoppel or otherwise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as components in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other applica- tion in which the failure of AMD’s product could create a situation where per- sonal injury, death, or severe property or environmental damage may occur. AMD reserves the right to discontinue or make changes to its products at any time without notice. Trademarks AMD, the AMD Arrow logo, AMD Athlon, and combinations thereof, 3DNow!, AMD-751, and Super7 are trade- marks, and AMD-K6 and AMD-K6-2 are registered trademarks of Advanced Micro Devices, Inc.
    [Show full text]
  • Heterogeneous Cpu+Gpu Computing
    HETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu – University of Amsterdam [email protected] Significant contributions by: Stijn Heldens (U Twente), Jie Shen (NUDT, China), Heterogeneous platforms • Systems combining main processors and accelerators • e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC • Everywhere from supercomputers to mobile devices Heterogeneous platforms • Host-accelerator hardware model Accelerator FPGAs Accelerator PCIe / Shared memory ... Host MICs Accelerator GPUs Accelerator CPUs Our focus today … • A heterogeneous platform = CPU + GPU • Most solutions work for other/multiple accelerators • An application workload = an application + its input dataset • Workload partitioning = workload distribution among the processing units of a heterogeneous system Few cores Thousands of Cores 5 Generic multi-core CPU 6 Programming models • Pthreads + intrinsics • TBB – Thread building blocks • Threading library • OpenCL • To be discussed … • OpenMP • Traditional parallel library • High-level, pragma-based • Cilk • Simple divide-and-conquer model abstractionincreasesLevel of 7 A GPU Architecture Offloading model Kernel Host code 9 Programming models • CUDA • NVIDIA proprietary • OpenCL • Open standard, functionally portable across multi-cores • OpenACC • High-level, pragma-based • Different libraries, programming models, and DSLs for different domains Level of abstractionincreasesLevel of CPU vs. GPU 10 ALU ALU CPU Control Low latency, high Throughput: ~ALU 500 GFLOPsALU flexibility. Bandwidth: ~ 60 GB/s Excellent for
    [Show full text]