Building the IBM 4758 Secure Coprocessor

Total Page:16

File Type:pdf, Size:1020Kb

Building the IBM 4758 Secure Coprocessor COVER FEATURE Building the IBM 4758 Secure Coprocessor Meeting the challenge of building a user-configurable secure coprocessor provided several lessons in hardware and software development and continues to spur further research. Joan G. Dyer ork leading toward the IBM 4758 long as it could be transformed into a CCA follow-on started, arguably, in the 1980s when the and meet the appropriate deadlines. Mark Abyss project began exploring tech- Seeking to provide an environment where applica- Lindemann niques to build tamper-responsive hard- tions could run securely forced us to focus not only on Ronald Perez ware and use that technology to protect security mechanisms and their implementation and W 1,2 against software piracy. Influenced by this earlier management, but also on various flavors of security Reiner Sailer work, we sought to build a secure coprocessor, policies they must support. Clearly, the hardware on Leendert defined as a tamper-responding device derived from which applications run must be secure, as must the van Doorn the Abyss, Citadel, and 4755 work. We wanted to operating system and runtime environment in IBM T.J. Watson provide a single multipurpose platform that third par- between, while offering a reasonable API for applica- Research Center ties could use to develop and deploy secure coproces- tions developers. To fix problems in the field and sor applications, with minimal IBM participation. To enable fast and inexpensive reaction to changing cus- Sean W. Smith accomplish this, we sought to achieve several goals: tomer needs, we implemented part of the code as Dartmouth firmware, rather than read-only memory. Figure 1 College • Ensure that the device can be identified externally shows the 4758’s three major components and their as to contents, using some form of outbound interrelationships. authentication and public-key interface (PKI). Subdividing the software into different layers raises Steve • Design the device and its software to be securely issues of trust because upper components rely on the Weingart configurable and updatable in the field. security that lower layers offer. Applications cannot Cryptographic • Construct the software architecture to accom- be more secure than the kernel functions they call, and Appliances modate layers of code from different parties, who the operating system cannot be more secure than the may or may not trust each other. hardware that executes its commands. • Avoid letting the compromise of one device Thus, if the lower layers are robust, higher layers breach the security of another. can choose whether to relinquish some security. We • Validate all these assertions through an external designed the lower layers to be relatively permanent party. firmware, or static hardware, thereby offering fewer points of attack throughout operation. Decreasing IBM’s Common Cryptographic Architecture prod- complexity, careful evaluation, and more effective uct group realized that its next-generation product security mechanisms justify the increasing trust along required properties possessed by the secure coproces- the vertical arrow. Management of the different lay- sor that IBM Research advocated. This knowledge ers must reflect the demand for increasing trust in gave the research team a unique and perhaps nonre- lower layers, and be flexible enough to support mul- peatable opportunity: funding and authority to design tiple security policies in various organizations. and produce the product we thought should exist, as As the “Design Decisions” sidebar notes, we 0018-9162/01/$10.00 © 2001 IEEE October 2001 57 Figure 1. Overview of Manufacturing considerations a secure coproces- Because we sought to provide a third-party pro- sor’s three major Software: grammable device, the manufacturing process could • Applications design components: Officers not incorporate anything related to the eventual soft- • Environment/OS ware. Programming, loading of code, or both are com- hardware and firm- • Kernel ware, which only IBM • Loaders pleted outside the factory at the customer’s discretion. could alter, and soft- Consequently, when a 4758 leaves the factory, it must be fully armed against tampering. ware, which custo- Firmware: mers could configure IBM • Post Trust in the field to meet • Miniboot Initialization and the trusted root their specific require- Each 4758 undergoes the same initialization proce- ments. Hardware: dure, eliminating the need for individual personaliza- • Processor tion—which is expensive, error prone, generally IBM • Flash, RAM, ROM requires a large quantity, and doesn’t fit the general- • Locks purpose secure-coprocessor model. An authentication • Tamper-responding unit protocol is performed in the factory to establish an ini- • Crypto functions tial secret using a valid source of randomness and min- imal software within the 4758’s ROM. When shipped, each 4758 can authenticate requests to download code achieved our major goals, but learned through hind- and respond to tampering. Having a meaningful out- sight several techniques for improving the secure- bound authentication requires a securely generated root coprocessor design process. certificate, which is established in the factory environ- ment. SECURITY AND TRUST We based the 4758’s security on an architecture and Layers and codeload boundaries implementation validated from manufacture to Layer refers to the division of software into inde- deployment. pendent pieces. The initial layer is inserted into ROM Design Decisions We found the following to be the crucial ingredients needed Hence, for the Federal Information Processing Standards (FIPS) to build a programmable, secure coprocessor. validation procedure, we could only use the hardware to seed a pseudorandom-number generator (PRNG), per the FIPS 186-1 Hardware tamper response standard. Second, the mandatory statistical tests to ensure that Relying upon software to react to tampering is too slow. How- the RNG works correctly—performed on both the hardware ever, this observation implies a well-sealed package and careful source and the PRNG output—slow the boot process by more design of the communication paths between the 4758 and the than 20 seconds. outside world. Dissipating heat becomes a problem, which affects the speed at which components internal to the 4758 can be run. Layered design The ratchet hardware provides the mechanism that assures A programmable device must have a layer responsible for prior firmware or software layers that subsequent layers cannot implementing code load security policy. The layer should be val- alter their secrets. Hardware enforces this assurance best, ren- idated, as its correctness forms the basis upon which subsequent dering areas of storage invisible, protecting areas from tamper, or layers depend for their own security. both. The number of ratchet lock settings, and what they protect, Subsequent layers personalize the device. We initially selected requires careful iterative design—something that deadlines and two layers: one for an operating system and device drivers, the other constraints do not always allow for. Ideally, the number of second for application code. This division made it possible for locks and associated regions would be configured by firmware. third parties to either write applications to an operating system we supplied, or write everything themselves. Randomness Finally, minimal software in ROM provides for repairs or Generating initial secrets requires a validated source of ran- extensions of initial facilities using the secure field update archi- domness, which ensures that possible destruction of a few cards tecture. cannot compromise any others. We encountered two problems with this requirement, neither of them fatal. First, we could find Self-initialization no standard for hardware random-number generators (RNGs). A programmable device must leave the factory ready to 58 Computer during the 4758’s manufacturing phase, and the Secure boot remaining layers are downloaded into flash memory A 4758 provides a secure boot of itself. The hardware using a public-key authentication process associated and firmware guarantee that the coprocessor stores only with a layer’s owner. A codeload boundary denotes a authorized code in flash memory. Authorization is defined point at which the system accepts a signed file accomplished at codeload time. Checks of flash for authenticating a load-this-layer command. integrity are made at each boot before passing control to the next layer. The checks run against hardware- Officers and signing keys enforced flash contents that are locked read-only by the At each codeload boundary, an officer and key pair time the operating system and application run. authenticate loading. The previous layer’s officer The validity of these statements follows, in part, from establishes ownership and provides a certificate attest- the secure packaging. An attempt to replace the ROM ing to the new owner’s initial public key. The officer or flash contents destroys the 4758’s secrets. The dam- owning the layer has complete control over subse- aged coprocessor might still run the same application quent loads, including changing the key pair and software, but it could not authenticate itself when establishing various trust targeting options. booted, and the application software’s battery-backed RAM and encrypted flash information would be un- Retaining state available. A secure coprocessor provides the ideal place to keep secrets, provided they disappear upon tamper- HARDWARE ing and that any compromise of a single device does
Recommended publications
  • X86 Platform Coprocessor/Prpmc (PC on a PMC)
    x86 Platform Coprocessor/PrPMC (PC on a PMC) 32b/33MHz PCI bus PN1/PN2 +5V to Kbd/Mouse/USB power Vcore Power +3.3V +2.5v Conditioning +3.3VIO 1K100 FPGA & 512KB 10/100TX TMS2250 PCI BIOS/PLA Ethernet RJ45 Compact to PCI bridge Flash Memory PLA I/O Flash site 8 32b/33MHz Internal PCI bus Analog SVGA Video Pwr Seq AMD SC2200 Signal COM 1 (RXD/TXD only) IDE "GEODE (tm)" Conditioning COM 2 (RXD/TXD only) Processor USB Port 1 Rear I/O PN4 I/O Rear 64 USB Port 2 LPC Keyboard/Mouse Floppy 36 Pin 36 Pin Connector PC87364 128MB SDRAM Status LEDs Super I/O (16MWx64b) PC Spkr This board is a Processor PMC (PrPMC) implementing A PrPMC implements a TMS2250 PCI-to-PCI bridge to an x86-based PC architecture on a single-wide, stan- the host, and a “Coprocessor” cofniguration implements dard height PMC form factor. This delivers a complete back-to-back 16550 compatible COM ports. The “Mon- x86 platform with comprehensive I/O support, a 10/100- arch” signal selects either PrPMC or Co-processor TX Ethernet interface, and disk access (both floppy and mode. IDE drives). The board features an on-board hard drive using Compact Flash. A 512Kbyte FLASH memory holds the 1K100 PLA im- age that is automatically loaded on power up. It also At the heart of the design is an Advanced Micro De- contains the General Software Embedded BIOS 2000™ vices SC2200 GEODE™ processor that integrates video, BIOS code that is included with each board.
    [Show full text]
  • Convey Overview
    THE WORLD’S FIRST HYBRID-CORE COMPUTER. CONVEY HYBRID-CORE COMPUTING Hybrid-core Computing Convey HC-1 High Performance of application- specific hardware Heterogenous solutions • can be much more efficient Performance/ • still hard to program Programmability and Power efficiency deployment ease of an x86 server Application Multicore solutions • don’t always scale well • parallel programming is hard Low Difficult Ease of Deployment Easy 1/22/2010 3 Hybrid-Core Computing Application-Specific Personalities Applications • Extend the x86 instruction set • Implement key operations in Convey Compilers hardware Life Sciences x86-64 ISA Custom ISA CAE Custom Financial Oil & Gas Shared Virtual Memory Cache-coherent, shared memory • Both ISAs address common memory *ISA: Instruction Set Architecture 7/12/2010 4 HC-1 Hardware PCI I/O FPGA FPGA Intel Personalities Chipset FPGA FPGA 8 GB/s 80 GB/s Memory Memory Cache Coherent, Shared Virtual Memory 1/22/2010 5 Using Personalities C/C++ Fortran • Personalities are user specifies reloadable personality at instruction sets Convey Software compile time Development Suite • Compiler instruction generates x86 descriptions and coprocessor instructions from Hybrid-Core Executable P ANSI standard x86-64 and Coprocessor Personalities C/C++ & Fortran Instructions • Executable can run on x86 nodes FPGA Convey HC-1 or Convey Hybrid- bitfiles Core nodes Intel x86 Coprocessor personality loaded at runtime by OS 1/22/2010 6 SYSTEM ARCHITECTURE HC-1 Architecture “Commodity” Intel Server Convey FPGA-based coprocessor Direct
    [Show full text]
  • Exploiting Free Silicon for Energy-Efficient Computing Directly
    Exploiting Free Silicon for Energy-Efficient Computing Directly in NAND Flash-based Solid-State Storage Systems Peng Li Kevin Gomez David J. Lilja Seagate Technology Seagate Technology University of Minnesota, Twin Cities Shakopee, MN, 55379 Shakopee, MN, 55379 Minneapolis, MN, 55455 [email protected] [email protected] [email protected] Abstract—Energy consumption is a fundamental issue in today’s data A well-known solution to the memory wall issue is moving centers as data continue growing dramatically. How to process these computing closer to the data. For example, Gokhale et al [2] data in an energy-efficient way becomes more and more important. proposed a processor-in-memory (PIM) chip by adding a processor Prior work had proposed several methods to build an energy-efficient system. The basic idea is to attack the memory wall issue (i.e., the into the main memory for computing. Riedel et al [9] proposed performance gap between CPUs and main memory) by moving com- an active disk by using the processor inside the hard disk drive puting closer to the data. However, these methods have not been widely (HDD) for computing. With the evolution of other non-volatile adopted due to high cost and limited performance improvements. In memories (NVMs), such as phase-change memory (PCM) and spin- this paper, we propose the storage processing unit (SPU) which adds computing power into NAND flash memories at standard solid-state transfer torque (STT)-RAM, researchers also proposed to use these drive (SSD) cost. By pre-processing the data using the SPU, the data NVMs as the main memory for data-intensive applications [10] to that needs to be transferred to host CPUs for further processing improve the system energy-efficiency.
    [Show full text]
  • CUDA What Is GPGPU
    What is GPGPU ? • General Purpose computation using GPU in applications other than 3D graphics CUDA – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput Slides by David Kirk – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation • Applications – see //GPGPU.org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting Previous GPGPU Constraints CUDA • Dealing with graphics API per thread per Shader Input Registers • “Compute Unified Device Architecture” – Working with the corner cases per Context • General purpose programming model of the graphics API Fragment Program Texture – User kicks off batches of threads on the GPU • Addressing modes Constants – GPU = dedicated super-threaded, massively data parallel co-processor – Limited texture size/dimension Temp Registers • Targeted software stack – Compute oriented drivers, language, and tools Output Registers • Shader capabilities • Driver for loading computation programs into GPU FB Memory – Limited outputs – Standalone Driver - Optimized for computation • Instruction sets – Interface designed for compute - graphics free API – Data sharing with OpenGL buffer objects – Lack of Integer & bit ops – Guaranteed maximum download & readback speeds • Communication limited – Explicit GPU memory management – Between pixels – Scatter a[i] = p 1 Parallel Computing on a GPU Extended C • Declspecs • NVIDIA GPU Computing Architecture – global, device, shared, __device__ float filter[N]; – Via a separate HW interface local, constant __global__ void convolve (float *image) { – In laptops, desktops, workstations, servers GeForce 8800 __shared__ float region[M]; ... • Keywords • 8-series GPUs deliver 50 to 200 GFLOPS region[threadIdx] = image[i]; on compiled parallel C applications – threadIdx, blockIdx • Intrinsics __syncthreads() – __syncthreads ..
    [Show full text]
  • Comparing the Power and Performance of Intel's SCC to State
    Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs Ehsan Totoni, Babak Behzad, Swapnil Ghike, Josep Torrellas Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA E-mail: ftotoni2, bbehza2, ghike2, [email protected] Abstract—Power dissipation and energy consumption are be- A key architectural challenge now is how to support in- coming increasingly important architectural design constraints in creasing parallelism and scale performance, while being power different types of computers, from embedded systems to large- and energy efficient. There are multiple options on the table, scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the namely “heavy-weight” multi-cores (such as general purpose best use of exponentially increasing numbers of transistors within processors), “light-weight” many-cores (such as Intel’s Single- the power and energy budgets. Intel SCC is an appealing option Chip Cloud Computer (SCC) [1]), low-power processors (such for future many-core architectures. In this paper, we use various as embedded processors), and SIMD-like highly-parallel archi- scalable applications to quantitatively compare and analyze tectures (such as General-Purpose Graphics Processing Units the performance, power consumption and energy efficiency of different cutting-edge platforms that differ in architectural build. (GPGPUs)). These platforms include the Intel Single-Chip Cloud Computer The Intel SCC [1] is a research chip made by Intel Labs (SCC) many-core, the Intel Core i7 general-purpose multi-core, to explore future many-core architectures. It has 48 Pentium the Intel Atom low-power processor, and the Nvidia ION2 (P54C) cores in 24 tiles of two cores each.
    [Show full text]
  • World's First High- Performance X86 With
    World’s First High- Performance x86 with Integrated AI Coprocessor Linley Spring Processor Conference 2020 April 8, 2020 G Glenn Henry Dr. Parviz Palangpour Chief AI Architect AI Software Deep Dive into Centaur’s New x86 AI Coprocessor (Ncore) • Background • Motivations • Constraints • Architecture • Software • Benchmarks • Conclusion Demonstrated Working Silicon For Video-Analytics Edge Server Nov, 2019 ©2020 Centaur Technology. All Rights Reserved Centaur Technology Background • 25-year-old startup in Austin, owned by Via Technologies • We design, from scratch, low-cost x86 processors • Everything to produce a custom x86 SoC with ~100 people • Architecture, logic design and microcode • Design, verification, and layout • Physical build, fab interface, and tape-out • Shipped by IBM, HP, Dell, Samsung, Lenovo… ©2020 Centaur Technology. All Rights Reserved Genesis of the AI Coprocessor (Ncore) • Centaur was developing SoC (CHA) with new x86 cores • Targeted at edge/cloud server market (high-end x86 features) • Huge inference markets beyond hyperscale cloud, IOT and mobile • Video analytics, edge computing, on-premise servers • However, x86 isn’t efficient at inference • High-performance inference requires external accelerator • CHA has 44x PCIe to support GPUs, etc. • But adds cost, power, another point of failure, etc. ©2020 Centaur Technology. All Rights Reserved Why not integrate a coprocessor? • Very low cost • Many components already on SoC (“free” to Ncore) Caches, memory, clock, power, package, pins, busses, etc. • There often is “free” space on complex SOCs due to I/O & pins • Having many high-performance x86 cores allows flexibility • The x86 cores can do some of the work, in parallel • Didn’t have to implement all strange/new functions • Allows fast prototyping of new things • For customer: nothing extra to buy ©2020 Centaur Technology.
    [Show full text]
  • Introduction to Cpu
    microprocessors and microcontrollers - sadri 1 INTRODUCTION TO CPU Mohammad Sadegh Sadri Session 2 Microprocessor Course Isfahan University of Technology Sep., Oct., 2010 microprocessors and microcontrollers - sadri 2 Agenda • Review of the first session • A tour of silicon world! • Basic definition of CPU • Von Neumann Architecture • Example: Basic ARM7 Architecture • A brief detailed explanation of ARM7 Architecture • Hardvard Architecture • Example: TMS320C25 DSP microprocessors and microcontrollers - sadri 3 Agenda (2) • History of CPUs • 4004 • TMS1000 • 8080 • Z80 • Am2901 • 8051 • PIC16 microprocessors and microcontrollers - sadri 4 Von Neumann Architecture • Same Memory • Program • Data • Single Bus microprocessors and microcontrollers - sadri 5 Sample : ARM7T CPU microprocessors and microcontrollers - sadri 6 Harvard Architecture • Separate memories for program and data microprocessors and microcontrollers - sadri 7 TMS320C25 DSP microprocessors and microcontrollers - sadri 8 Silicon Market Revenue Rank Rank Country of 2009/2008 Company (million Market share 2009 2008 origin changes $ USD) Intel 11 USA 32 410 -4.0% 14.1% Corporation Samsung 22 South Korea 17 496 +3.5% 7.6% Electronics Toshiba 33Semiconduc Japan 10 319 -6.9% 4.5% tors Texas 44 USA 9 617 -12.6% 4.2% Instruments STMicroelec 55 FranceItaly 8 510 -17.6% 3.7% tronics 68Qualcomm USA 6 409 -1.1% 2.8% 79Hynix South Korea 6 246 +3.7% 2.7% 812AMD USA 5 207 -4.6% 2.3% Renesas 96 Japan 5 153 -26.6% 2.2% Technology 10 7 Sony Japan 4 468 -35.7% 1.9% microprocessors and microcontrollers
    [Show full text]
  • AI Chips: What They Are and Why They Matter
    APRIL 2020 AI Chips: What They Are and Why They Matter An AI Chips Reference AUTHORS Saif M. Khan Alexander Mann Table of Contents Introduction and Summary 3 The Laws of Chip Innovation 7 Transistor Shrinkage: Moore’s Law 7 Efficiency and Speed Improvements 8 Increasing Transistor Density Unlocks Improved Designs for Efficiency and Speed 9 Transistor Design is Reaching Fundamental Size Limits 10 The Slowing of Moore’s Law and the Decline of General-Purpose Chips 10 The Economies of Scale of General-Purpose Chips 10 Costs are Increasing Faster than the Semiconductor Market 11 The Semiconductor Industry’s Growth Rate is Unlikely to Increase 14 Chip Improvements as Moore’s Law Slows 15 Transistor Improvements Continue, but are Slowing 16 Improved Transistor Density Enables Specialization 18 The AI Chip Zoo 19 AI Chip Types 20 AI Chip Benchmarks 22 The Value of State-of-the-Art AI Chips 23 The Efficiency of State-of-the-Art AI Chips Translates into Cost-Effectiveness 23 Compute-Intensive AI Algorithms are Bottlenecked by Chip Costs and Speed 26 U.S. and Chinese AI Chips and Implications for National Competitiveness 27 Appendix A: Basics of Semiconductors and Chips 31 Appendix B: How AI Chips Work 33 Parallel Computing 33 Low-Precision Computing 34 Memory Optimization 35 Domain-Specific Languages 36 Appendix C: AI Chip Benchmarking Studies 37 Appendix D: Chip Economics Model 39 Chip Transistor Density, Design Costs, and Energy Costs 40 Foundry, Assembly, Test and Packaging Costs 41 Acknowledgments 44 Center for Security and Emerging Technology | 2 Introduction and Summary Artificial intelligence will play an important role in national and international security in the years to come.
    [Show full text]
  • Instruction Set Innovations for Convey's HC-1 Computer
    Instruction Set Innovations for Convey's HC-1 Computer THE WORLD’S FIRST HYBRID-CORE COMPUTER. Hot Chips Conference 2009 [email protected] Introduction to Convey Computer • Company Status – Second round venture based startup company – Product beta systems are at customer sites – Currently staffing at 36 people – Located in Richardson, Texas • Investors – Four Venture Capital Investors • Interwest Partners (Menlo Park) • CenterPoint Ventures (Dallas) • Rho Ventures (New York) • Braemar Energy Ventures (Boston) – Two Industry Investors • Intel Capital • Xilinx Presentation Outline • Overview of HC-1 Computer • Instruction Set Innovations • Application Examples Page 3 Hot Chips Conference 2009 What is a Hybrid-Core Computer ? A hybrid-core computer improves application performance by combining an x86 processor with hardware that implements application-specific instructions. ANSI Standard Applications C/C++/Fortran Convey Compilers x86 Coprocessor Instructions Instructions Intel® Processor Hybrid-Core Coprocessor Oil & Gas& Oil Financial Sciences Custom CAE Application-Specific Personalities Cache-coherent shared virtual memory Page 4 Hot Chips Conference 2009 What Is a Personality? • A personality is a reloadable set of instructions that augment x86 application the x86 instruction set Processor specific – Applicable to a class of applications instructions or specific to a particular code • Each personality is a set of files that includes: – The bits loaded into the Coprocessor – Information used by the Convey compiler • List of
    [Show full text]
  • CUDA C++ Programming Guide
    CUDA C++ Programming Guide Design Guide PG-02829-001_v11.4 | September 2021 Changes from Version 11.3 ‣ Added Graph Memory Nodes. ‣ Formalized Asynchronous SIMT Programming Model. CUDA C++ Programming Guide PG-02829-001_v11.4 | ii Table of Contents Chapter 1. Introduction........................................................................................................ 1 1.1. The Benefits of Using GPUs.....................................................................................................1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model....... 2 1.3. A Scalable Programming Model.............................................................................................. 3 1.4. Document Structure................................................................................................................. 5 Chapter 2. Programming Model.......................................................................................... 7 2.1. Kernels.......................................................................................................................................7 2.2. Thread Hierarchy...................................................................................................................... 8 2.3. Memory Hierarchy...................................................................................................................10 2.4. Heterogeneous Programming................................................................................................11 2.5. Asynchronous
    [Show full text]
  • Intel Xeon Phi Product Family Brief
    PRODUCT BRIEF The Intel® Xeon Phi™ Product Family Highly-Parallel Processing for Unparalleled Discovery Breakthrough Performance for Your Highly-Parallel Applications Extracting extreme performance from highly-parallel “Moving a code to Intel Xeon Phi might applications just got easier. Intel® Xeon Phi™ coprocessors, involve sitting down and adding a based on Intel Many Integrated Core (MIC) architecture, complement the industry-leading performance and couple lines of directives that takes energy-efficiency of the Intel® Xeon® processor E5 family a few minutes. Moving a code to to enable dramatic performance gains for some of today’s most demanding applications—up to 1.2 teraflops a GPU is a project.”3 per coprocessor.1 You can now achieve optimized performance for even your most highly-parallel technical –Dan Stanzione, Deputy Director at computing workloads, while maintaining a unified Texas Advanced Computing Center hardware and software environment.2 The Intel Xeon Phi Coprocessor Even Higher Efficiency for Parallel Processing A Family of Coprocessors for Diverse Needs While a majority of applications will continue to achieve Intel Xeon Phi coprocessors provide up to 61 cores, 244 maximum performance using Intel Xeon processors, threads, and 1.2 teraflops of performance, and they certain highly-parallel applications will benefit dramatically come in a variety of configurations to address diverse by using Intel Xeon Phi coprocessors. Each coprocessor hardware, software, workload, performance, and efficiency features many more and smaller cores, many more threads, requirements.1 They also come in a variety of form factors, and wider vector units. The high degree of parallelism including a standard PCIe* x16 form factor (with active, compensates for the lower speed of each individual core passive, or no thermal solution), and a dense form factor to deliver higher aggregate performance for highly- that offers additional design flexibility (Table 2).
    [Show full text]
  • Comparing Performance and Energy Efficiency of Fpgas and Gpus For
    Comparing Performance and Energy Efficiency of FPGAs and GPUs for High Productivity Computing Brahim Betkaoui, David B. Thomas, Wayne Luk Department of Computing, Imperial College London London, United Kingdom {bb105,dt10,wl}@imperial.ac.uk Abstract—This paper provides the first comparison of per- figurable Computing (HPRC). We define a HPRC system as formance and energy efficiency of high productivity computing a high performance computing system that relies on recon- systems based on FPGA (Field-Programmable Gate Array) and figurable hardware to boost the performance of commodity GPU (Graphics Processing Unit) technologies. The search for higher performance compute solutions has recently led to great general-purpose processors, while providing a programming interest in heterogeneous systems containing FPGA and GPU model similar to that of traditional software. In a HPRC accelerators. While these accelerators can provide significant system, problems are described using a more familiar high- performance improvements, they can also require much more level language, which allows software developers to quickly design effort than a pure software solution, reducing programmer improve the performance of their applications using reconfig- productivity. The CUDA system has provided a high productivity approach for programming GPUs. This paper evaluates the urable hardware. Our main contributions are: High-Productivity Reconfigurable Computer (HPRC) approach • Performance evaluation of benchmarks with different to FPGA programming, where a commodity CPU instruction memory characteristics on two high-productivity plat- set architecture is augmented with instructions which execute forms: an FPGA-based Hybrid-Core system, and a GPU- on a specialised FPGA co-processor, allowing the CPU and FPGA to co-operate closely while providing a programming based system.
    [Show full text]