Improving Resource Utilization in Heterogeneous CPU-GPU Systems

Total Page:16

File Type:pdf, Size:1020Kb

Improving Resource Utilization in Heterogeneous CPU-GPU Systems Improving Resource Utilization in Heterogeneous CPU-GPU Systems A Dissertation Presented to the Faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulfillment of the requirements for the Degree Doctor of Philosophy (Computer Engineering) by Michael Boyer May 2013 c 2013 Michael Boyer Abstract Graphics processing units (GPUs) have attracted enormous interest over the past decade due to substantial increases in both performance and programmability. Programmers can potentially leverage GPUs for substantial performance gains, but at the cost of significant software engineering effort. In practice, most GPU applications do not effectively utilize all of the available resources in a system: they either fail to use use a resource at all or use a resource to less than its full potential. This underutilization can hurt both performance and energy efficiency. In this dissertation, we address the underutilization of resources in heterogeneous CPU-GPU systems in three different contexts. First, we address the underutilization of a single GPU by reducing CPU-GPU interaction to improve performance. We use as a case study a computationally-intensive video-tracking application from systems biology. Because of the high cost of CPU-GPU coordination, our initial, straightforward attempts to accelerate this application failed to effectively utilize the GPU. By leveraging some non-obvious optimization strategies, we significantly decreased the amount of CPU-GPU interaction and improved the performance of the GPU implementation by 26x relative to the best CPU implementation. Based on the lessons we learned, we present general guidelines for optimizing GPU applications as well as recommendations for system-level changes that would simplify the development of high-performance GPU applications. Next, we address underutilization at the system level by using load balancing to improve performance. We propose a dynamic scheduling algorithm that automatically and efficiently i Abstract ii divides the execution of a data-parallel kernel across multiple, possibly heterogeneous GPUs. We show that our scheduler can nearly match the performance of an unrealistic static scheduler when device performance is fixed, and can provide better performance when device performance varies. Finally, we address underutilization within a GPU by using frequency scaling to improve energy efficiency. We propose a novel algorithm for predicting the energy-optimal GPU clock frequencies for an arbitrary kernel. Using power measurements from real systems, we demonstrate that our algorithm improves significantly on the state of the art across multiple generations of GPUs. We also propose and evaluate techniques for decreasing the CPU's energy consumption during GPU computation. Many of the techniques presented in this dissertation can be used to improve the per- formance and energy efficiency of GPU applications with no programmer effort or software modifications required. As the diversity of available hardware systems continues to increase, automatic techniques such as these will become critical for software to fully realize the benefits of future hardware improvements. Approval Sheet This dissertation is submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Engineering) Michael Boyer Michael Boyer This dissertation has been read and approved by the Examining Committee: Kevin Skadron Kevin Skadron, Advisor Sudhanva Gurumurthi Sudhanva Gurumurthi, Committee Chair Jack Davidson Jack Davidson Mircean Stan Mircean Stan Westley Weimer Westley Weimer Accepted for the School of Engineering and Applied Science: James H. Aylor James H. Aylor, Dean, School of Engineering and Applied Science May 2013 iii Acknowledgments The work presented in this dissertation, as well as other work completed during my graduate career, would have not have been possible without the support of many different people. First and foremost, I thank my advisor, Kevin Skadron, who has played an important role in nearly everything I have accomplished as a graduate student. Advisors have a tough job: they must walk a fine line between being too overbearing and being too laid back, with the right mix often depending heavily on the particular student. Kevin has always managed to find the right balance between the two extremes, consistently helping me stay confident and motivated but also challenging me to keep improving. Other than my advisor, Nuwan Jayasena of AMD played the largest role in helping shape much of the work presented here. I also thank the members of my Ph.D. committee: Professors Jack Davidson, Sudhanva Gurumurthi, Mircea Stan, and Wes Weimer. Their guidance helped steer my work in a more interesting and useful direction. Wes deserves special recognition as a co-author of my first peer-reviewed publication, which was based on work that I completed as part of his Programming Languages course and which remains my most-cited (first-author) publication. Professor Andrew Grimshaw provided me with a unique opportunity to co-teach a course on parallel programming; this experience was rewarding and enlightening, and remains one of the highlights of my time in graduate school. The members of the LAVA Lab have supported me in innumerable ways over the years. David Tarjan took me under his wing early in my career and was a wonderful collaborator on a number of different projects. Shuai Che, Jeremy Sheaffer, and Runjie Zhang provided feedback iv Acknowledgments v on numerous matters both technical and otherwise. I am also indebted to other past and current members of the Lava Lab: Greg Faust, Marisabel Guevara, Tibor Horvath, Jiayuan Meng, Brett Meyer, Donnie Newell, Karthik Sankaranarayanan, Prateeksha Satyamoorthy, Lukasz Szafaryn, Jack Wadden, Ke Wang, and Liang Wang. I am especially thankful for the support of my fellow graduate students Chris Gregg, Jiawei Huang, Sudeep Ghosh, and Nishant George throughout my time at the University of Virginia. Saurav Basu helped me understand the underlying algorithms and data structures used in the leukocyte tracking work. Jonathan Dorn provided valuable insight on the frequency scaling work. This dissertation was typeset using a LATEX template created by Joel Coffman. My parents encouraged my intellectual curiosity from a young age, and are in many ways responsible for my early interest in computers and my subsequent choice to study computer engineering as both an undergraduate and graduate student. Their support throughout the years has been unwavering despite the physical distance that has separated us for the last decade. Any homesickness I may have felt has been tempered slightly by my in-laws, Jay, Cathy, Alanna, and Brett, who have warmly welcomed me into their family here in Virginia. Last but certainly not least, I thank my wife Jessica for her love and support. I am grateful for her constant encouragement, her occasional advice, and, perhaps most importantly, her patience. Seven years was a long time to wait; now it is time to see what adventures are in store for us in the next chapter of our life. This work was supported in part by NSF grants IIS-0612049 and CNS-0916908, SRC grants 1607.001, 1972.001, and 2233.001, a GRC AMD/Mahboob Khan Ph.D. fellowship, an NVIDIA research grant, and equipment donated by AMD and NVIDIA. Contents Contents vi List of Tables...................................... viii List of Figures...................................... ix 1 Introduction1 1.1 Leukocyte Tracking................................3 1.2 Heterogeneous Load Balancing..........................4 1.3 GPU Frequency Scaling.............................5 1.4 Organization...................................7 2 Background8 2.1 GPU Hardware..................................8 2.2 GPU Software................................... 11 2.2.1 GPU Programming Model........................ 12 2.3 Energy Efficiency................................. 14 3 Case Study: Leukocyte Tracking 18 3.1 Leukocyte Detection and Tracking........................ 20 3.2 Accelerating the Detection Stage........................ 22 3.2.1 Comparison to OpenMP......................... 23 3.2.2 Na¨ıve CUDA Implementation...................... 24 3.2.3 CUDA Optimizations........................... 24 3.2.4 Single- vs. Double-Precision Performance................ 26 3.3 Accelerating the Tracking Stage......................... 27 3.3.1 Comparison to OpenMP......................... 29 3.3.2 Na¨ıve CUDA Implementation...................... 29 3.3.3 CUDA Optimizations........................... 29 3.4 Discussion..................................... 35 3.4.1 Lessons for Software Developers.................... 36 3.4.2 Lessons for System Architects...................... 41 3.5 Related Work................................... 44 3.6 Conclusions.................................... 44 3.7 Postscript..................................... 45 3.8 Publications and Impact............................. 48 vi Contents vii 4 Heterogeneous Load Balancing 49 4.1 Motivation..................................... 50 4.1.1 Need for Heterogeneous Load Balancing................ 51 4.1.2 Need for Dynamic Load Balancing................... 52 4.2 Related Work................................... 54 4.3 Dynamic Load Balancing............................. 57 4.3.1 Optimal Partition............................. 57 4.3.2 Scheduling Algorithm........................... 59 4.3.3 Scheduler Implementation........................ 60 4.3.4 Example Schedule............................. 61 4.4 Experimental
Recommended publications
  • Small Form Factor 3D Graphics for Your Pc
    VisionTek Part# 900701 PRODUCTIVITY SERIES: SMALL FORM FACTOR 3D GRAPHICS FOR YOUR PC The VisionTek Radeon R7 240SFF graphics card offers a perfect balance of performance, features, and affordability for the gamer seeking a complete solution. It offers support for the DIRECTX® 11.2 graphics standard and 4K Ultra HD for stunning 3D visual effects, realistic lighting, and lifelike imagery. Its Short Form Factor design enables it to fit into the latest Low Profile desktops and workstations, yet the R7 240SFF can be converted to a standard ATX design with the included tall bracket. With 2GB of DDR3 memory and award-winning Graphics Core Next (GCN) architecture, and DVI-D/HDMI outputs, the VisionTek Radeon R7 240SFF is big on features and light on your wallet. RADEON R7 240 SPECS • Graphics Engine: RADEON R7 240 • Video Memory: 2GB DDR3 • Memory Interface: 128bit • DirectX® Support: 11.2 • Bus Standard: PCI Express 3.0 • Core Speed: 780MHz • Memory Speed: 800MHz x2 • VGA Output: VGA* • DVI Output: SL DVI-D • HDMI Output: HDMI (Video/Audio) • UEFI Ready: Support SYSTEM REQUIREMENTS • PCI Express® based PC is required with one X16 lane graphics slot available on the motherboard. • 400W (or greater) power supply GCN Architecture: A new design for AMD’s unified graphics processing and compute cores that allows recommended. 500 Watt for AMD them to achieve higher utilization for improved performance and efficiency. CrossFire™ technology in dual mode. • Minimum 1GB of system memory. 4K Ultra HD Support: Experience what you’ve been missing even at 1080P! With support for 3840 x • Installation software requires CD-ROM 2160 output via the HDMI port, textures and other detail normally compressed for lower resolutions drive.
    [Show full text]
  • AMD Firepro™Professional Graphics for CAD & Engineering and Media
    AMD FirePro™Professional Graphics for CAD & Engineering and Media & Entertainment Performance at every price point. AMD FirePro professional graphics offer breakthrough capabilities that can help maximize productivity and help lower cost and complexity — giving you the edge you need in your business. Outstanding graphics performance, compute power and ultrahigh-resolution multidisplay capabilities allows broadcast, design and engineering professionals to work at a whole new level of detail, speed, responsiveness and creativity. AMD FireProTM W9100 AMD FireProTM W8100 With 16GB GDDR5 memory and the ability to support up to six 4K The new AMD FirePro W8100 workstation graphics card is based on displays via six Mini DisplayPort outputs,1 the AMD FirePro W9100 the AMD Graphics Core Next (GCN) GPU architecture and packs up graphics card is the ideal single-GPU solution for the next generation to 4.2 TFLOPS of compute power to accelerate your projects beyond of ultrahigh-resolution visualization environments. just graphics. AMD FireProTM W7100 AMD FireProTM W5100 The new AMD FirePro W7100 graphics card delivers 8GB The new AMD FirePro™ W5100 graphics card delivers optimized of memory, application performance and special features application and multidisplay performance for midrange users. that media and entertainment and design and engineering With 4GB of ultra-fast GDDR5 memory, users can tackle moderately professionals need to take their projects to the next level. complex models, assemblies, data sets or advanced visual effects with ease. AMD FireProTM W4100 AMD FireProTM W2100 In a class of its own, the AMD FirePro Professional graphics starts with AMD W4100 graphics card is the best choice FirePro W2100 graphics, delivering for entry-level users who need a boost in optimized and certified professional graphics performance to better address application performance that similarly- their evolving workflows.
    [Show full text]
  • AMD Firepro™ W5000
    AMD FirePro™ W5000 Be Limitless, When Every Detail Counts. Powerful mid-range workstation graphics. This powerful product, designed for delivering superior performance for CAD/CAE and Media workflows, can process Key Features: up to 1.65 billion triangles per second. This means during > Utilizes Graphics Core Next (GCN) to the design process you can easily interact and render efficiently balance compute tasks with your 3D models, while the competition can only process 3D workloads, enabling multi-tasking that is designed to optimize utilization up to 0.41 billion triangles per second (up to four times and maximize performance. less performance). It also offers double the memory > Unmatched application of competing products (2GB vs. 1GB) and 2.5x responsiveness in your workflow, the memory bandwidth. It’s the ideal solution whether in advanced visualization, for professionals working with a broad range of complex models, large data sets or applications, moderately complex models and datasets, video editing. and advanced visual effects. > AMD ZeroCore Power Technology enables your GPU to power down when your monitor is off. Product features: > AMD ZeroCore Power technology leverages > GeometryBoost—the GPU processes > Optimized and certified for major CAD and M&E AMD’s leadership in notebook power efficiency geometry data at a rate of twice per clock cycle, doubling the rate of primitive applications delivering 1 TFLOP of single precision and 80 to enable our desktop GPUs to power down and vertex processing. GFLOPs of double precision performance with when your monitor is off, also known as the > AMD Eyefinity Technology— outstanding reliability for the most demanding “long idle state.” Industry-leading multi-display professional tasks.
    [Show full text]
  • SAPPHIRE HD 6950 2GB GDDR5 Dirt3 Edition
    SAPPHIRE HD 6950 2GB GDDR5 Dirt3 Edition The SAPPHIRE HD 6950 Dirt3 Special Edition is a new SAPPHIRE original model with a special cooler using a new dual fan configuration. Based on the latest high end AMD GPU architecture, it boasts true DX 11 capability and the powerful configuration of 1408 stream processors and 88 texture processing units. With its clock speed of 800MHz for the core and 2GB of the latest GDDR5 memory running at 1250Mhz (5 Gb/sec effective), this model speeds through even the most demanding applications for a smooth and detail packed experience. A Dual BIOS feature allows enthusiasts to experiment with alternative BIOS settings and performance can be further enhanced with the SAPPHIRE overclocking tool, TriXX, available as a free download from http://www.sapphiretech.com/ssc/TriXX/ System Overview Awards News Requirements Specification 1 x Dual-Link DVI 1 x HDMI 1.4a Output 1 x DisplayPort 1 x Single-Link DVI-D DisplayPort 1.2 800 MHz Core Clock GPU 40 nm Chip 1408 x Stream Processors 2048 MB Size Memory 256 -bit GDDR5 5000 MHz Effective Dimension 260(L)x110(W)x35(H) mm Size. Driver CD Software SAPPHIRE TriXX Utility 1 x Dirt®3 Coupon CrossFire™ Bridge Interconnect Cable DVI to VGA Adapter Accessory 6 PIN to 4 PIN Power Cable x 2 HDMI 1.4a high speed 1.8 meter cable(Full Retail SKU only) All specifications and accessories are subject to change without notice. Please check with your supplier for exact offers. Products may not be available in all markets.
    [Show full text]
  • AMD Firepro™ W7000
    AMD FirePro™ W7000 Be Limitless, When Every Detail Counts. The workstation card for those with higher standards. Key Features: AMD FirePro™ W7000 workstation graphics > Optimized performance for delivers incredible performance, superb workstation graphics applications visual quality and outstanding multi-display > AMD Graphics Core Next design experiences to engineering, design Architecture and digital media professionals — all from a > AMD Eye�nity technology single-slot solution. Its 3D primitive graphics performance is up to 2.1 times as fast as the > 4GB GDDR5 memory competing solutions, giving designers > 256-bit memory interface smoother interactivity when working with > 154 GB/s memory bandwidth complex 3D models allowing them to > Four DisplayPort outputs quickly visualize and render designs.1 > Maximum resolution 4096x2160 AMD FirePro™ W7000 offers up to 1.7 times > DisplayPort 1.2 support more memory bandwidth than competing solutions2, > AMD PowerTune and AMD ZeroCore Power > Support for DirectGMA bringing unmatched application responsiveness that technologies that allow for dynamic power > PCIe® 3.0 compliant professionals working with advanced visualization, management and higher engine clock speeds complex models, large data sets and video footage delivering improved performance and efficient > Designed and thoroughly power management.5 tested by AMD need. Using AMD Eyefinity multi-display technology, AMD FirePro™ W7000 can drive up to four native > Planned four year minimum lifecycle > GeometryBoost delivers real-time rendering of displays and up to six displays using DisplayPort 1.2, complex, realistic images at high tessellation speeds, > Limited three year warranty allowing designers and unparalleled productivity while a full 30-bit display pipeline enables a palette of > DirectX® 11.1, OpenCL™ 1.2 and and flexibility.3 more than 1.07 billion color values for more accurate OpenGL 4.2 support color reproduction and superior visual fidelity; requires This very powerful product is: 30-bit display.
    [Show full text]
  • SAPPHIRE R9 285 2GB GDDR5 ITX COMPACT OC Edition (UEFI)
    Specification Display Support 4 x Maximum Display Monitor(s) support 1 x HDMI (with 3D) Output 2 x Mini-DisplayPort 1 x Dual-Link DVI-I 928 MHz Core Clock GPU 28 nm Chip 1792 x Stream Processors 2048 MB Size Video Memory 256 -bit GDDR5 5500 MHz Effective 171(L)X110(W)X35(H) mm Size. Dimension 2 x slot Driver CD Software SAPPHIRE TriXX Utility DVI to VGA Adapter Mini-DP to DP Cable Accessory HDMI 1.4a high speed 1.8 meter cable(Full Retail SKU only) 1 x 8 Pin to 6 Pin x2 Power adaptor Overview HDMI (with 3D) Support for Deep Color, 7.1 High Bitrate Audio, and 3D Stereoscopic, ensuring the highest quality Blu-ray and video experience possible from your PC. Mini-DisplayPort Enjoy the benefits of the latest generation display interface, DisplayPort. With the ultra high HD resolution, the graphics card ensures that you are able to support the latest generation of LCD monitors. Dual-Link DVI-I Equipped with the most popular Dual Link DVI (Digital Visual Interface), this card is able to display ultra high resolutions of up to 2560 x 1600 at 60Hz. Advanced GDDR5 Memory Technology GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth. Advanced GDDR5 Memory Technology GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth. AMD Stream Technology Accelerate the most demanding applications with AMD Stream technology and do more with your PC. AMD Stream Technology allows you to use the teraflops of compute power locked up in your graphics processer on tasks other than traditional graphics such as video encoding, at which the graphics processor is many, many times faster than using the CPU alone.
    [Show full text]
  • AMD Firepro™ W5000
    AMD FirePro™ W5000 Be Limitless, When Every Detail Counts. The most powerful mid-range workstation graphics card ever created. Key Features: AMD FirePro™ W5000 is the most powerful > Utilizes Graphics Core Next (GCN) to efficiently balance compute tasks with mid-range workstation graphics card in the market. 3D workloads, enabling multi-tasking It delivers significantly higher performance than that is designed to optimize utilization the competing cards measured against a wide set and maximize performance. of parameters and in real world workflows.3 > Unmatched application responsiveness in your workflow, This powerful product, designed for delivering whether in advanced visualization, superior performance for CAD/CAE and Media complex models, large data sets or workflows, can process up to 1.65 billion triangles video editing. per second. This means during the design process > AMD ZeroCore Power Technology you can easily interact and render your 3D models, enables your GPU to power down when your monitor is off. while the competition can only process up to 0.41 billion > An energy-efficient design uses AMD PowerTune triangles per second (up to four times less > GeometryBoost—the GPU processes 3 technology to dynamically optimize GPU power geometry data at a rate of twice per performance). It also offers double the memory usage while AMD ZeroCore Power technology clock cycle, doubling the rate of primitive of competing products (2GB vs. 1GB) and 2.5x significantly reduces power consumption at idle. and vertex processing. the memory bandwidth.3 It’s the ideal solution > AMD Eyefinity Technology— for professionals working with a broad range of > The Industry-leading multi-display technology, Industry-leading multi-display AMD Eyefinity, enables highly immersive and technology enabling highly immersive applications, moderately complex models and datasets, and advanced visual effects.
    [Show full text]
  • Application-Controlled Frequency Scaling Explained
    The TURBO Diaries: Application-controlled Frequency Scaling Explained Jons-Tobias Wamhoff, Stephan Diestelhorst, and Christof Fetzer, Technische Universät Dresden; Patrick Marlier and Pascal Felber, Université de Neuchâtel; Dave Dice, Oracle Labs https://www.usenix.org/conference/atc14/technical-sessions/presentation/wamhoff This paper is included in the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference. June 19–20, 2014 • Philadelphia, PA 978-1-931971-10-2 Open access to the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference is sponsored by USENIX. The TURBO Diaries: Application-controlled Frequency Scaling Explained Jons-Tobias Wamhoff Patrick Marlier Dave Dice Stephan Diestelhorst Pascal Felber Christof Fetzer Technische Universtat¨ Dresden, Germany Universite´ de Neuchatel,ˆ Switzerland Oracle Labs, USA Abstract these features from an application as needed. Examples in- Most multi-core architectures nowadays support dynamic volt- clude: accelerating the execution of key sections of code on age and frequency scaling (DVFS) to adapt their speed to the the critical path of multi-threaded applications [9]; boosting system’s load and save energy. Some recent architectures addi- time-critical operations or high-priority threads; or reducing tionally allow cores to operate at boosted speeds exceeding the the energy consumption of applications executing low-priority nominal base frequency but within their thermal design power. threads. Furthermore, workloads specifically designed to run In this paper, we propose a general-purpose library that on processors with heterogeneous cores (e.g., few fast and allows selective control of DVFS from user space to accelerate many slow cores) may take additional advantage of application- multi-threaded applications and expose the potential of hetero- level frequency scaling.
    [Show full text]
  • Limited Lifetime Manufacturer's Warranty
    VisionTek Part# 900689 RADEON R9 280 SPECS • Graphics Engine: RADEON R9 280 LIKE A SUPERCHARGER FOR YOUR GAMING RIG • Video Memory: 3GB GDDR5 • Memory Interface: 384bit With 6Gb/s throughput and over four Teraflops of compute performance, you may just want to use seat belts • DirectX® Support: 11.2 when you fire up the VisionTek Radeon R9 280. It enables you to take advantage of 1440p and up to 4K Ultra HD • Bus Standard: PCI Express 3.0 high resolution displays and run the latest games without any lag. With support for the DIRECTX® 11.2 graphics • Core Speed: 855MHz (960MHz Boost) standard, you’ll elevate your gaming experience with stunning 3D visual effects, realistic lighting, and lifelike • Memory Speed: 1250MHz x4 (or faster) imagery. Additionally, the VisionTek Radeon R9 280 offers the following advanced AMD technologies: • DVI Output: DL DVI-I • AMD PowerTune technology for higher frame rates and automatic overclocking • VGA Output: Using DVI to VGA Adapter • AMD ZeroCore Power technology for automatic power-saving efficiency • HDMI Output: HDMI (Video/Audio) • AMD App Acceleration “co-processing” to improve performance of common computing tasks • DisplayPort Output: 2x mini DP • UEFI Ready: Support With 3GB of 384bit DDR5 memory, award-winning Graphics Core Next (GCN) architecture, and DVI-I/HDMI/2x mini DisplayPort outputs, the VisionTek Radeon R9 280 can run today’s most popular games at 1440p resolution SYSTEM REQUIREMENTS faster than other brand GPU powered graphics cards. For peace of mind ownership, it comes backed by an • PCI Express® based PC is required with industry leading lifetime warranty and free lifetime US-based technical support.
    [Show full text]
  • AMD's Early Processor Lines, up to the Hammer Family (Families K8
    AMD’s early processor lines, up to the Hammer Family (Families K8 - K10.5h) Dezső Sima October 2018 (Ver. 1.1) Sima Dezső, 2018 AMD’s early processor lines, up to the Hammer Family (Families K8 - K10.5h) • 1. Introduction to AMD’s processor families • 2. AMD’s 32-bit x86 families • 3. Migration of 32-bit ISAs and microarchitectures to 64-bit • 4. Overview of AMD’s K8 – K10.5 (Hammer-based) families • 5. The K8 (Hammer) family • 6. The K10 Barcelona family • 7. The K10.5 Shanghai family • 8. The K10.5 Istambul family • 9. The K10.5-based Magny-Course/Lisbon family • 10. References 1. Introduction to AMD’s processor families 1. Introduction to AMD’s processor families (1) 1. Introduction to AMD’s processor families AMD’s early x86 processor history [1] AMD’s own processors Second sourced processors 1. Introduction to AMD’s processor families (2) Evolution of AMD’s early processors [2] 1. Introduction to AMD’s processor families (3) Historical remarks 1) Beyond x86 processors AMD also designed and marketed two embedded processor families; • the 2900 family of bipolar, 4-bit slice microprocessors (1975-?) used in a number of processors, such as particular DEC 11 family models, and • the 29000 family (29K family) of CMOS, 32-bit embedded microcontrollers (1987-95). In late 1995 AMD cancelled their 29K family development and transferred the related design team to the firm’s K5 effort, in order to focus on x86 processors [3]. 2) Initially, AMD designed the Am386/486 processors that were clones of Intel’s processors.
    [Show full text]
  • Amd Power Management Srilatha Manne 11/14/2013 Outline
    AMD POWER MANAGEMENT SRILATHA MANNE 11/14/2013 OUTLINE APUs and HSA Richland APU APU Power Management Monitoring and Management Features 2 | DOE WEBINAR| NOV 14, 2013 APU: ACCELERATED PROCESSING UNIT The APU has arrived and it is a great advance over previous platforms Combines scalar processing on CPU with parallel processing on the GPU and high-bandwidth access to memory Advantages over other forms of compute offload: Easier to program Easier to optimize Easier to load-balance Higher performance Lower power © Copyright 2012 HSA Foundation. All Rights Reserved. 3 WHAT IS HSA? Joins CPUs, GPUs, and accelerators into a unified computing framework Single address space accessible to avoid the overhead of data copying HSA Accelerated Applications Access to Broad Set of Programming Use-space queuing to minimize Languages communication overhead Pre-emptive context switching for better Legacy quality of service Applications Simplified programming HSA Runtime Single, standard computing environments Infrastructure Support for mainstream languages— OS C, C++, Fortran, Java, .NET Lower development costs Memory Optimized compute density Radical performance improvement for CPU GPU ACC HPC, big data, and multimedia workloads HSA Platform Low power to maximize performance per watt 4 | AMD OPTERON™ PROCESSOR ROADMAP | NOVEMBER 13, 2013 A NEW ERA OF PROCESSOR PERFORMANCE Heterogeneous Single-core Era Multi-core Era Systems Era Temporarily Enabled by: Constrained by: Enabled by: Constrained by: Enabled by: Moore’s Power Moore’s Law Power Abundant data Constrained by: Law Complexity SMP Parallel SW parallelism Programming Voltage architecture Scalability Power efficient models Scaling GPUs Comm.overhead pthreads OpenMP / TBB … Assembly C/C++ Java Shader CUDA OpenCL … C++ and Java ? we are here we are Performance Performance here Throughput Performance Performance we are Modern Modern Application here Single-thread Single-thread Performance Time Time (# of processors) Time (Data-parallel exploitation) © Copyright 2012 HSA Foundation.
    [Show full text]
  • AMD Firepro™ W4100 Professional Graphics in a Class of Its Own
    AMD FirePro™ W4100 Professional Graphics In a class of its own Key Features: The AMD FirePro™ W4100 professional • Application optimizations graphics card represents a completely and certifications • AMD Graphics Core Next (GCN) new class of product – one that provides GPU architecture you with great graphics performance • Four Mini DisplayPort outputs • DisplayPort 1.2a support and display versatility while housed in a • AMD Eyefinity technology1 compact, low-power design. • 4K display resolution (up to 4096 x 2160) • 512 stream processors Increase your productivity by working across up to four high-resolution displays with AMD Eyefinity 1 • 645.1 GFLOPS peak single precision technology. Manipulate 3D models and large data sets with ease thanks to 2GB of ultrafast GDDR5 memory. With a stable driver that supports a growing list of optimized and certified applications, the • 2GB GDDR5 memory AMD FirePro W4100 is uniquely suited to provide the performance and quality you expect, and more, • 128-bit memory interface from a professional graphics card. • Up to 72GB/s memory bandwidth • PCIe® 3.0 compliant Peformance Get solid, midrange performance with the AMD FirePro W4100, delivering CAD performance that is up • OpenCL™, DirectX® and OpenGL support to 100% faster than the previous generation2. Equipped with 2GB of ultrafast GDDR5 memory with a • 50W maximum power consumption 128-bit memory interface, the AMD FirePro W4100 delivers up to 72GB/s of memory bandwidth, helping • Discreet active cooling solution improve application responsiveness to your workflows. Accelerate your 3D applications with 512 stream • Low profile single-slot form factor processors and enable more efficient data transfers between the GPU and CPU with PCIe® 3.0 support.
    [Show full text]