Intel387tm SX MATH COPROCESSOR

Total Page:16

File Type:pdf, Size:1020Kb

Intel387tm SX MATH COPROCESSOR Intel387TM SX MATH COPROCESSOR Y New Automatic Power Management Y Compatible with the Intel386TM SX Ð Low Power Consumption Microprocessor Ð Typically 100 mA in Dynamic Mode, Ð Extends CPU Instruction Set to and 4 mA in Idle Mode Include Trigonometric, Logarithmic, and Exponential Y Socket Compatible with Intel387 Family of Math CoProcessors Y High Performance 80-Bit Internal Ð Hardware and Software Compatible Architecture Ð Supported by Over 2100 Commercial Y Implements ANSI/IEEE Standard Software Packages 754-1985 for Binary Floating-Point Ð 10% to 15% Performance Increase Arithmetic on Whetstone and Livermore Benchmarks Y Available in a 68-Pin PLCC Package See Intel Packaging Specification, Order Ý231369 The Intel387TM SX Math CoProcessor is an extension to the Intel386TM SX microprocessor architecture. The combination of the Intel387TM SX with the Intel386TM SX microprocessor dramatically increases the process- ing speed of computer application software that utilizes high performance floating-point operations. An internal Power Management Unit enables the Intel387TM SX to perform these floating-point operations while maintain- ing very low power consumption for portable and desktop applications. The internal Power Management Unit effectively reduces power consumption by 95% when the device is idle. The Intel387TM SX Math CoProcessor is available in a 68-pin PLCC package, and is manufactured on Intel's advanced 1.0 micron CHMOS IV technology. 240225±22 Intel386 and Intel387 are trademarks of Intel Corporation. *Other brands and names are the property of their respective owners. Information in this document is provided in connection with Intel products. Intel assumes no liability whatsoever, including infringement of any patent or copyright, for sale and use of Intel products except as provided in Intel's Terms and Conditions of Sale for such products. Intel retains the right to make changes to these specifications at any time, without notice. Microcomputer Products may have minor variations to this specification known as errata. January 1994 Order Number: 240225-009 © COPYRIGHT INTEL CORPORATION, 1995 1 Intel387TM SX Math CoProcessor CONTENTS PAGE CONTENTS PAGE 1.0 PIN ASSIGNMENT ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 5 4.0 HARDWARE SYSTEM INTERFACE ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 21 1.1 Pin Description Table ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 6 4.1 Signal Description ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 22 2.0 FUNCTIONAL DESCRIPTION ÀÀÀÀÀÀÀÀÀ 7 4.1.1 Intel386 CPU Clock 2 2.1 Feature List ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 7 (CPUCLK2) ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 22 2.2 Math CoProcessor Architecture ÀÀÀÀÀÀ 7 4.1.2 Intel387 Math CoProcessor 2.3 Power Management ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 8 Clock 2 (NUMCLK2) ÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 22 2.3.1 Dynamic Mode ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 8 4.1.3 Clocking Mode (CKM) ÀÀÀÀÀÀÀÀÀ 23 2.3.2 Idle Mode ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 8 4.1.4 System Reset (RESETIN) ÀÀÀÀÀÀ 23 2.4 Compatibility ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 8 4.1.5 Processor Request (PEREQ) ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 23 2.5 Performance ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 8 4.1.6 Busy Status (BUSYÝ) ÀÀÀÀÀÀÀÀÀ 23 3.0 PROGRAMMING INTERFACE ÀÀÀÀÀÀÀÀÀ 9 4.1.7 Error Status (ERRORÝ) ÀÀÀÀÀÀÀ 23 3.1 Instruction Set ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 9 4.1.8 Data Pins (D15±D0) ÀÀÀÀÀÀÀÀÀÀÀ 23 3.1.1 Data Transfer Instructions ÀÀÀÀÀÀ 9 4.1.9 Write/Read Bus Cycle Ý 3.1.2 Arithmetic Instructions ÀÀÀÀÀÀÀÀÀÀ 9 (W/R ) ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 23 Ý 3.1.3 Comparison Instructions ÀÀÀÀÀÀÀ 10 4.1.10 Address Stobe (ADS ) ÀÀÀÀÀÀÀ 23 3.1.4 Transcendental 4.1.11 Bus Ready Input Ý Instructions ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 10 (READY ) ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 24 3.1.5 Load Constant Instructions ÀÀÀÀ 10 4.1.12 Ready Output (READYOÝ) ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 24 3.1.6 Processor Instructions ÀÀÀÀÀÀÀÀÀ 11 4.1.13 Status Enable (STEN) ÀÀÀÀÀÀÀÀ 24 3.2 Register Set ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 11 4.1.14 Math CoProcessor Select 1 3.2.1 Status Word (SW) Register ÀÀÀÀ 12 (NPS1Ý) ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 24 3.2.2 Control Word (CW) 4.1.15 Math CoProcessor Select 2 Register ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 15 (NPS2) ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 24 3.2.3 Data Register ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 16 4.1.16 Command (CMD0Ý) ÀÀÀÀÀÀÀÀÀ 24 3.2.4 Tag Word (TW) Register ÀÀÀÀÀÀÀ 16 4.1.17 System Power (VCC) ÀÀÀÀÀÀÀÀÀ 24 3.2.5 Instruction and Data 4.1.18 System Ground (VSS) ÀÀÀÀÀÀÀÀ 24 Pointers ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 16 4.2 System Configuration ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 25 3.3 Data Types ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 18 4.3 Math CoProcessor Architecture ÀÀÀÀÀ 26 3.4 Interrupt Description ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 18 4.3.1 Bus Control Logic ÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 26 3.5 Exception Handling ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 18 4.3.2 Data Interface and Control 3.6 Initialization ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 21 Unit ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 26 3.7 Processing Modes ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 21 4.3.3 Floating Point Unit ÀÀÀÀÀÀÀÀÀÀÀÀÀ 26 3.8 Programming Support ÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 21 4.3.4 Power Management Unit ÀÀÀÀÀÀÀ 26 2 2 CONTENTS PAGE CONTENTS PAGE 4.4 Bus Cycles ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 26 7.0 ELECTRICAL 4.4.1 Intel387 SX Math CHARACTERISTICS ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 33 CoProcessor Addressing ÀÀÀÀÀÀÀÀÀÀ 27 7.1 Absolute Maximum Ratings ÀÀÀÀÀÀÀÀÀ 33 4.4.2 CPU/Math CoProcessor Synchronization ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 27 7.2 D.C. Characteristics ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 34 4.4.3 Synchronous/Asynchronous 7.3 A.C. Characteristics ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 35 Modes ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 27 8.0 Intel387 SX MATH COPROCESSOR 4.4.4 Automatic Bus Cycle INSTRUCTION SET ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 41 Termination ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 27 APPENDIX AÐIntel387 SX MATH 5.0 BUS OPERATION ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 27 COPROCESSOR COMPATIBILITY ÀÀÀÀ A-1 5.1 Non-pipelined Bus Cycles ÀÀÀÀÀÀÀÀÀÀ 28 A.1 8087/80287 Compatibility ÀÀÀÀÀÀÀÀÀ A-1 5.1.1 Write Cycle ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 28 A.1.1 General Differences ÀÀÀÀÀÀÀÀÀÀ A-1 5.1.2 Read Cycle ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 29 A.1.2 Exceptions ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ A-2 5.2 Pipelined Bus Cycles ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 29 APPENDIX BÐCOMPATIBILITY 5.3 Mixed Bus Cycles ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 30 BETWEEN THE 80287 AND 8087 5.4 BUSYÝ and PEREQ Timing MATH COPROCESSOR ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ B-1 Relationship ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 32 6.0 PACKAGE SPECIFICATIONS ÀÀÀÀÀÀÀÀ 33 6.1 Mechanical Specifications ÀÀÀÀÀÀÀÀÀÀ 33 6.2 Thermal Specifications ÀÀÀÀÀÀÀÀÀÀÀÀÀ 33 3 3 CONTENTS PAGE CONTENTS PAGE Figure 7-2 CPUCLK2/NUMCLK2 FIGURES Waveform and Figure 1-1 Intel387 SX Math Measurement Points for CoProcessor Pinout ÀÀÀÀÀÀÀÀÀÀÀ 5 Input/Output ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 38 Figure 2-1 Intel387 SX Math Figure 7-3 Output Signals ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 38 CoProcessor Block Figure 7-4 Input and I/O Signals ÀÀÀÀÀÀÀÀ 39 Diagram ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 7 Figure 7-5 RESET Signal ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 39 Figure 3-1 Intel 386 SX CPU and Intel387 Math CoProcessor Figure 7-6 Float from STEN ÀÀÀÀÀÀÀÀÀÀÀÀÀ 40 Register Set ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 11 Figure 7-7 Other Parameters ÀÀÀÀÀÀÀÀÀÀÀÀ 40 Figure 3-2 Status Word ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 12 TABLES Figure 3-3 Control Word ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 15 Table 1-1 Pin Cross ReferenceÐ Figure 3-4 Tag Word Register ÀÀÀÀÀÀÀÀÀÀÀ 16 Functional Grouping ÀÀÀÀÀÀÀÀÀÀÀ 5 Figure 3-5 Instruction and Data Pointer Table 3-1 Condition Code Image in Memory, 32-Bit Interpretation ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 13 Protected Mode Format ÀÀÀÀÀÀ 17 Table 3-2 Condition Code Interpretation Figure 3-6 Instruction and Data Pointer after FPREM and FPREM1 Image in Memory, 16-Bit Instructions ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 14 Protected Mode Format ÀÀÀÀÀÀ 17 Table 3-3 Condition Code Resulting Figure 3-7 Instruction and Data Pointer from Comparison ÀÀÀÀÀÀÀÀÀÀÀÀÀ 14 Image in Memory, 32-Bit Real Mode Format ÀÀÀÀÀÀÀÀÀÀÀ 17 Table 3-4 Condition Code Defining Operand Class ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 14 Figure 3-8 Instruction and Data Pointer Image in Memory, 16-Bit Table 3-5 Mapping Condition Codes to Real Mode Format ÀÀÀÀÀÀÀÀÀÀÀ 18 Intel386 CPU Flag Bits ÀÀÀÀÀÀÀÀ 14 Figure 4-1 Intel386 SX CPU and Table 3-6 Intel387 SX Math Intel387 SX Math CoProcessor Data Type CoProcessor System Representation in Memory ÀÀÀÀ 19 Configuration ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 25 Table 3-7 CPU Interrupt Vectors Figure 5-1 Bus State Diagram ÀÀÀÀÀÀÀÀÀÀÀ 28 Reserve for Math CoProcessor ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 20 Figure 5-2 Non-Pipelined Read and Write Cycles ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 29 Table 3-8 Intel387 SX Math CoProcessor Exceptions ÀÀÀÀÀÀ 20 Figure 5-3 Fastest Transition to and from Pipelined Cycles ÀÀÀÀÀÀÀÀ 30 Table 4-1 Pin Summary ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 22 Figure 5-4 Pipelined Cycles with Wait Table 4-2 Output Pin Status during States ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 31 Reset ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 23 Figure 5-5 BUSYÝ and PEREQ Timing Table 4-3 Bus Cycle Definition ÀÀÀÀÀÀÀÀÀÀ 26 Relationship ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 32 Table 6-1 Thermal Resistances Figure 7-1a Typical Output Valid Delay (§C/Watt) iJC and iJA ÀÀÀÀÀÀÀÀ 33 vs Load Capacitance at Max Table 6-2 Maximum TA at Various Operating Temperature ÀÀÀÀÀÀ 37 Airflows ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 33 Figure 7-1b Typical Output Slew Time vs Table 7-1 D.C. Specifications ÀÀÀÀÀÀÀÀÀÀÀ 34 Load Capacitance at Max Operating Temperature ÀÀÀÀÀÀ 37 Table 7-2a Timing Requirements of the Bus Interface Unit ÀÀÀÀÀÀÀÀÀÀÀÀ 35 Figure 7-1c Maximum ICC vs Frequency ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 37 Table 7-2b Timing Requirements of the Execution Unit ÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀÀ 36 Table 7-2c Other AC Parameters ÀÀÀÀÀÀÀÀÀ 36 Table 8-1 Instruction Formats ÀÀÀÀÀÀÀÀÀÀÀ 41 4 4 Intel387TM SX MATH COPROCESSOR 1.0 PIN ASSIGNMENT include VCC and VSS planes for power distribution and all VCC and VSS pins must be connected to the The Intel387 SX Math CoProcessor pinout as appropriate plane. viewed from the top side of the component is shown in Figure 1-1. VCC and VSS (GND) connections must NOTE: be made to multiple pins. The circuit board should Pins identified as N.C. should remain completely unconnected. 240225±1 Figure 1-1. Intel387TM SX Math CoProcessor Pinout Table 1-1. Pin Cross ReferenceÐFunctional Grouping BUSYÝ 36 D00 19 VCC 4VSS 5 N.C. 1 PEREQ 56 D01 20 9 14 10 ERRORÝ 35 D02 23 13 21 17 D03 8 22 25 18 ADSÝ 47 D04 7 26 27 52 CMD0Ý 48 D05 6 31 32 65 NPS1Ý 44 D06 3 33 34 67 NPS2 45 STEN 40 D07 2 37 38 68 D08 24 39 42 W/RÝ 41 D09 28 43 55 READYÝ 49 D10 29 46 60 READYOÝ 57 D11 30 50 61 D12 16 58 63 CKM 59 D13 15 62 66 CPUCLK2 54 D14 12 64 NUMCLK2 53 D15 11 RESETIN 51 5 5 Intel387TM SX MATH COPROCESSOR 1.1 Pin Description Table The following table lists a brief description of each pin on the Intel387 SX Math CoProcessor.
Recommended publications
  • X86 Platform Coprocessor/Prpmc (PC on a PMC)
    x86 Platform Coprocessor/PrPMC (PC on a PMC) 32b/33MHz PCI bus PN1/PN2 +5V to Kbd/Mouse/USB power Vcore Power +3.3V +2.5v Conditioning +3.3VIO 1K100 FPGA & 512KB 10/100TX TMS2250 PCI BIOS/PLA Ethernet RJ45 Compact to PCI bridge Flash Memory PLA I/O Flash site 8 32b/33MHz Internal PCI bus Analog SVGA Video Pwr Seq AMD SC2200 Signal COM 1 (RXD/TXD only) IDE "GEODE (tm)" Conditioning COM 2 (RXD/TXD only) Processor USB Port 1 Rear I/O PN4 I/O Rear 64 USB Port 2 LPC Keyboard/Mouse Floppy 36 Pin 36 Pin Connector PC87364 128MB SDRAM Status LEDs Super I/O (16MWx64b) PC Spkr This board is a Processor PMC (PrPMC) implementing A PrPMC implements a TMS2250 PCI-to-PCI bridge to an x86-based PC architecture on a single-wide, stan- the host, and a “Coprocessor” cofniguration implements dard height PMC form factor. This delivers a complete back-to-back 16550 compatible COM ports. The “Mon- x86 platform with comprehensive I/O support, a 10/100- arch” signal selects either PrPMC or Co-processor TX Ethernet interface, and disk access (both floppy and mode. IDE drives). The board features an on-board hard drive using Compact Flash. A 512Kbyte FLASH memory holds the 1K100 PLA im- age that is automatically loaded on power up. It also At the heart of the design is an Advanced Micro De- contains the General Software Embedded BIOS 2000™ vices SC2200 GEODE™ processor that integrates video, BIOS code that is included with each board.
    [Show full text]
  • Convey Overview
    THE WORLD’S FIRST HYBRID-CORE COMPUTER. CONVEY HYBRID-CORE COMPUTING Hybrid-core Computing Convey HC-1 High Performance of application- specific hardware Heterogenous solutions • can be much more efficient Performance/ • still hard to program Programmability and Power efficiency deployment ease of an x86 server Application Multicore solutions • don’t always scale well • parallel programming is hard Low Difficult Ease of Deployment Easy 1/22/2010 3 Hybrid-Core Computing Application-Specific Personalities Applications • Extend the x86 instruction set • Implement key operations in Convey Compilers hardware Life Sciences x86-64 ISA Custom ISA CAE Custom Financial Oil & Gas Shared Virtual Memory Cache-coherent, shared memory • Both ISAs address common memory *ISA: Instruction Set Architecture 7/12/2010 4 HC-1 Hardware PCI I/O FPGA FPGA Intel Personalities Chipset FPGA FPGA 8 GB/s 80 GB/s Memory Memory Cache Coherent, Shared Virtual Memory 1/22/2010 5 Using Personalities C/C++ Fortran • Personalities are user specifies reloadable personality at instruction sets Convey Software compile time Development Suite • Compiler instruction generates x86 descriptions and coprocessor instructions from Hybrid-Core Executable P ANSI standard x86-64 and Coprocessor Personalities C/C++ & Fortran Instructions • Executable can run on x86 nodes FPGA Convey HC-1 or Convey Hybrid- bitfiles Core nodes Intel x86 Coprocessor personality loaded at runtime by OS 1/22/2010 6 SYSTEM ARCHITECTURE HC-1 Architecture “Commodity” Intel Server Convey FPGA-based coprocessor Direct
    [Show full text]
  • Exploiting Free Silicon for Energy-Efficient Computing Directly
    Exploiting Free Silicon for Energy-Efficient Computing Directly in NAND Flash-based Solid-State Storage Systems Peng Li Kevin Gomez David J. Lilja Seagate Technology Seagate Technology University of Minnesota, Twin Cities Shakopee, MN, 55379 Shakopee, MN, 55379 Minneapolis, MN, 55455 [email protected] [email protected] [email protected] Abstract—Energy consumption is a fundamental issue in today’s data A well-known solution to the memory wall issue is moving centers as data continue growing dramatically. How to process these computing closer to the data. For example, Gokhale et al [2] data in an energy-efficient way becomes more and more important. proposed a processor-in-memory (PIM) chip by adding a processor Prior work had proposed several methods to build an energy-efficient system. The basic idea is to attack the memory wall issue (i.e., the into the main memory for computing. Riedel et al [9] proposed performance gap between CPUs and main memory) by moving com- an active disk by using the processor inside the hard disk drive puting closer to the data. However, these methods have not been widely (HDD) for computing. With the evolution of other non-volatile adopted due to high cost and limited performance improvements. In memories (NVMs), such as phase-change memory (PCM) and spin- this paper, we propose the storage processing unit (SPU) which adds computing power into NAND flash memories at standard solid-state transfer torque (STT)-RAM, researchers also proposed to use these drive (SSD) cost. By pre-processing the data using the SPU, the data NVMs as the main memory for data-intensive applications [10] to that needs to be transferred to host CPUs for further processing improve the system energy-efficiency.
    [Show full text]
  • CUDA What Is GPGPU
    What is GPGPU ? • General Purpose computation using GPU in applications other than 3D graphics CUDA – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput Slides by David Kirk – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation • Applications – see //GPGPU.org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting Previous GPGPU Constraints CUDA • Dealing with graphics API per thread per Shader Input Registers • “Compute Unified Device Architecture” – Working with the corner cases per Context • General purpose programming model of the graphics API Fragment Program Texture – User kicks off batches of threads on the GPU • Addressing modes Constants – GPU = dedicated super-threaded, massively data parallel co-processor – Limited texture size/dimension Temp Registers • Targeted software stack – Compute oriented drivers, language, and tools Output Registers • Shader capabilities • Driver for loading computation programs into GPU FB Memory – Limited outputs – Standalone Driver - Optimized for computation • Instruction sets – Interface designed for compute - graphics free API – Data sharing with OpenGL buffer objects – Lack of Integer & bit ops – Guaranteed maximum download & readback speeds • Communication limited – Explicit GPU memory management – Between pixels – Scatter a[i] = p 1 Parallel Computing on a GPU Extended C • Declspecs • NVIDIA GPU Computing Architecture – global, device, shared, __device__ float filter[N]; – Via a separate HW interface local, constant __global__ void convolve (float *image) { – In laptops, desktops, workstations, servers GeForce 8800 __shared__ float region[M]; ... • Keywords • 8-series GPUs deliver 50 to 200 GFLOPS region[threadIdx] = image[i]; on compiled parallel C applications – threadIdx, blockIdx • Intrinsics __syncthreads() – __syncthreads ..
    [Show full text]
  • Comparing the Power and Performance of Intel's SCC to State
    Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs Ehsan Totoni, Babak Behzad, Swapnil Ghike, Josep Torrellas Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA E-mail: ftotoni2, bbehza2, ghike2, [email protected] Abstract—Power dissipation and energy consumption are be- A key architectural challenge now is how to support in- coming increasingly important architectural design constraints in creasing parallelism and scale performance, while being power different types of computers, from embedded systems to large- and energy efficient. There are multiple options on the table, scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the namely “heavy-weight” multi-cores (such as general purpose best use of exponentially increasing numbers of transistors within processors), “light-weight” many-cores (such as Intel’s Single- the power and energy budgets. Intel SCC is an appealing option Chip Cloud Computer (SCC) [1]), low-power processors (such for future many-core architectures. In this paper, we use various as embedded processors), and SIMD-like highly-parallel archi- scalable applications to quantitatively compare and analyze tectures (such as General-Purpose Graphics Processing Units the performance, power consumption and energy efficiency of different cutting-edge platforms that differ in architectural build. (GPGPUs)). These platforms include the Intel Single-Chip Cloud Computer The Intel SCC [1] is a research chip made by Intel Labs (SCC) many-core, the Intel Core i7 general-purpose multi-core, to explore future many-core architectures. It has 48 Pentium the Intel Atom low-power processor, and the Nvidia ION2 (P54C) cores in 24 tiles of two cores each.
    [Show full text]
  • World's First High- Performance X86 With
    World’s First High- Performance x86 with Integrated AI Coprocessor Linley Spring Processor Conference 2020 April 8, 2020 G Glenn Henry Dr. Parviz Palangpour Chief AI Architect AI Software Deep Dive into Centaur’s New x86 AI Coprocessor (Ncore) • Background • Motivations • Constraints • Architecture • Software • Benchmarks • Conclusion Demonstrated Working Silicon For Video-Analytics Edge Server Nov, 2019 ©2020 Centaur Technology. All Rights Reserved Centaur Technology Background • 25-year-old startup in Austin, owned by Via Technologies • We design, from scratch, low-cost x86 processors • Everything to produce a custom x86 SoC with ~100 people • Architecture, logic design and microcode • Design, verification, and layout • Physical build, fab interface, and tape-out • Shipped by IBM, HP, Dell, Samsung, Lenovo… ©2020 Centaur Technology. All Rights Reserved Genesis of the AI Coprocessor (Ncore) • Centaur was developing SoC (CHA) with new x86 cores • Targeted at edge/cloud server market (high-end x86 features) • Huge inference markets beyond hyperscale cloud, IOT and mobile • Video analytics, edge computing, on-premise servers • However, x86 isn’t efficient at inference • High-performance inference requires external accelerator • CHA has 44x PCIe to support GPUs, etc. • But adds cost, power, another point of failure, etc. ©2020 Centaur Technology. All Rights Reserved Why not integrate a coprocessor? • Very low cost • Many components already on SoC (“free” to Ncore) Caches, memory, clock, power, package, pins, busses, etc. • There often is “free” space on complex SOCs due to I/O & pins • Having many high-performance x86 cores allows flexibility • The x86 cores can do some of the work, in parallel • Didn’t have to implement all strange/new functions • Allows fast prototyping of new things • For customer: nothing extra to buy ©2020 Centaur Technology.
    [Show full text]
  • Introduction to Cpu
    microprocessors and microcontrollers - sadri 1 INTRODUCTION TO CPU Mohammad Sadegh Sadri Session 2 Microprocessor Course Isfahan University of Technology Sep., Oct., 2010 microprocessors and microcontrollers - sadri 2 Agenda • Review of the first session • A tour of silicon world! • Basic definition of CPU • Von Neumann Architecture • Example: Basic ARM7 Architecture • A brief detailed explanation of ARM7 Architecture • Hardvard Architecture • Example: TMS320C25 DSP microprocessors and microcontrollers - sadri 3 Agenda (2) • History of CPUs • 4004 • TMS1000 • 8080 • Z80 • Am2901 • 8051 • PIC16 microprocessors and microcontrollers - sadri 4 Von Neumann Architecture • Same Memory • Program • Data • Single Bus microprocessors and microcontrollers - sadri 5 Sample : ARM7T CPU microprocessors and microcontrollers - sadri 6 Harvard Architecture • Separate memories for program and data microprocessors and microcontrollers - sadri 7 TMS320C25 DSP microprocessors and microcontrollers - sadri 8 Silicon Market Revenue Rank Rank Country of 2009/2008 Company (million Market share 2009 2008 origin changes $ USD) Intel 11 USA 32 410 -4.0% 14.1% Corporation Samsung 22 South Korea 17 496 +3.5% 7.6% Electronics Toshiba 33Semiconduc Japan 10 319 -6.9% 4.5% tors Texas 44 USA 9 617 -12.6% 4.2% Instruments STMicroelec 55 FranceItaly 8 510 -17.6% 3.7% tronics 68Qualcomm USA 6 409 -1.1% 2.8% 79Hynix South Korea 6 246 +3.7% 2.7% 812AMD USA 5 207 -4.6% 2.3% Renesas 96 Japan 5 153 -26.6% 2.2% Technology 10 7 Sony Japan 4 468 -35.7% 1.9% microprocessors and microcontrollers
    [Show full text]
  • AI Chips: What They Are and Why They Matter
    APRIL 2020 AI Chips: What They Are and Why They Matter An AI Chips Reference AUTHORS Saif M. Khan Alexander Mann Table of Contents Introduction and Summary 3 The Laws of Chip Innovation 7 Transistor Shrinkage: Moore’s Law 7 Efficiency and Speed Improvements 8 Increasing Transistor Density Unlocks Improved Designs for Efficiency and Speed 9 Transistor Design is Reaching Fundamental Size Limits 10 The Slowing of Moore’s Law and the Decline of General-Purpose Chips 10 The Economies of Scale of General-Purpose Chips 10 Costs are Increasing Faster than the Semiconductor Market 11 The Semiconductor Industry’s Growth Rate is Unlikely to Increase 14 Chip Improvements as Moore’s Law Slows 15 Transistor Improvements Continue, but are Slowing 16 Improved Transistor Density Enables Specialization 18 The AI Chip Zoo 19 AI Chip Types 20 AI Chip Benchmarks 22 The Value of State-of-the-Art AI Chips 23 The Efficiency of State-of-the-Art AI Chips Translates into Cost-Effectiveness 23 Compute-Intensive AI Algorithms are Bottlenecked by Chip Costs and Speed 26 U.S. and Chinese AI Chips and Implications for National Competitiveness 27 Appendix A: Basics of Semiconductors and Chips 31 Appendix B: How AI Chips Work 33 Parallel Computing 33 Low-Precision Computing 34 Memory Optimization 35 Domain-Specific Languages 36 Appendix C: AI Chip Benchmarking Studies 37 Appendix D: Chip Economics Model 39 Chip Transistor Density, Design Costs, and Energy Costs 40 Foundry, Assembly, Test and Packaging Costs 41 Acknowledgments 44 Center for Security and Emerging Technology | 2 Introduction and Summary Artificial intelligence will play an important role in national and international security in the years to come.
    [Show full text]
  • Instruction Set Innovations for Convey's HC-1 Computer
    Instruction Set Innovations for Convey's HC-1 Computer THE WORLD’S FIRST HYBRID-CORE COMPUTER. Hot Chips Conference 2009 [email protected] Introduction to Convey Computer • Company Status – Second round venture based startup company – Product beta systems are at customer sites – Currently staffing at 36 people – Located in Richardson, Texas • Investors – Four Venture Capital Investors • Interwest Partners (Menlo Park) • CenterPoint Ventures (Dallas) • Rho Ventures (New York) • Braemar Energy Ventures (Boston) – Two Industry Investors • Intel Capital • Xilinx Presentation Outline • Overview of HC-1 Computer • Instruction Set Innovations • Application Examples Page 3 Hot Chips Conference 2009 What is a Hybrid-Core Computer ? A hybrid-core computer improves application performance by combining an x86 processor with hardware that implements application-specific instructions. ANSI Standard Applications C/C++/Fortran Convey Compilers x86 Coprocessor Instructions Instructions Intel® Processor Hybrid-Core Coprocessor Oil & Gas& Oil Financial Sciences Custom CAE Application-Specific Personalities Cache-coherent shared virtual memory Page 4 Hot Chips Conference 2009 What Is a Personality? • A personality is a reloadable set of instructions that augment x86 application the x86 instruction set Processor specific – Applicable to a class of applications instructions or specific to a particular code • Each personality is a set of files that includes: – The bits loaded into the Coprocessor – Information used by the Convey compiler • List of
    [Show full text]
  • CUDA C++ Programming Guide
    CUDA C++ Programming Guide Design Guide PG-02829-001_v11.4 | September 2021 Changes from Version 11.3 ‣ Added Graph Memory Nodes. ‣ Formalized Asynchronous SIMT Programming Model. CUDA C++ Programming Guide PG-02829-001_v11.4 | ii Table of Contents Chapter 1. Introduction........................................................................................................ 1 1.1. The Benefits of Using GPUs.....................................................................................................1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model....... 2 1.3. A Scalable Programming Model.............................................................................................. 3 1.4. Document Structure................................................................................................................. 5 Chapter 2. Programming Model.......................................................................................... 7 2.1. Kernels.......................................................................................................................................7 2.2. Thread Hierarchy...................................................................................................................... 8 2.3. Memory Hierarchy...................................................................................................................10 2.4. Heterogeneous Programming................................................................................................11 2.5. Asynchronous
    [Show full text]
  • Intel Xeon Phi Product Family Brief
    PRODUCT BRIEF The Intel® Xeon Phi™ Product Family Highly-Parallel Processing for Unparalleled Discovery Breakthrough Performance for Your Highly-Parallel Applications Extracting extreme performance from highly-parallel “Moving a code to Intel Xeon Phi might applications just got easier. Intel® Xeon Phi™ coprocessors, involve sitting down and adding a based on Intel Many Integrated Core (MIC) architecture, complement the industry-leading performance and couple lines of directives that takes energy-efficiency of the Intel® Xeon® processor E5 family a few minutes. Moving a code to to enable dramatic performance gains for some of today’s most demanding applications—up to 1.2 teraflops a GPU is a project.”3 per coprocessor.1 You can now achieve optimized performance for even your most highly-parallel technical –Dan Stanzione, Deputy Director at computing workloads, while maintaining a unified Texas Advanced Computing Center hardware and software environment.2 The Intel Xeon Phi Coprocessor Even Higher Efficiency for Parallel Processing A Family of Coprocessors for Diverse Needs While a majority of applications will continue to achieve Intel Xeon Phi coprocessors provide up to 61 cores, 244 maximum performance using Intel Xeon processors, threads, and 1.2 teraflops of performance, and they certain highly-parallel applications will benefit dramatically come in a variety of configurations to address diverse by using Intel Xeon Phi coprocessors. Each coprocessor hardware, software, workload, performance, and efficiency features many more and smaller cores, many more threads, requirements.1 They also come in a variety of form factors, and wider vector units. The high degree of parallelism including a standard PCIe* x16 form factor (with active, compensates for the lower speed of each individual core passive, or no thermal solution), and a dense form factor to deliver higher aggregate performance for highly- that offers additional design flexibility (Table 2).
    [Show full text]
  • Comparing Performance and Energy Efficiency of Fpgas and Gpus For
    Comparing Performance and Energy Efficiency of FPGAs and GPUs for High Productivity Computing Brahim Betkaoui, David B. Thomas, Wayne Luk Department of Computing, Imperial College London London, United Kingdom {bb105,dt10,wl}@imperial.ac.uk Abstract—This paper provides the first comparison of per- figurable Computing (HPRC). We define a HPRC system as formance and energy efficiency of high productivity computing a high performance computing system that relies on recon- systems based on FPGA (Field-Programmable Gate Array) and figurable hardware to boost the performance of commodity GPU (Graphics Processing Unit) technologies. The search for higher performance compute solutions has recently led to great general-purpose processors, while providing a programming interest in heterogeneous systems containing FPGA and GPU model similar to that of traditional software. In a HPRC accelerators. While these accelerators can provide significant system, problems are described using a more familiar high- performance improvements, they can also require much more level language, which allows software developers to quickly design effort than a pure software solution, reducing programmer improve the performance of their applications using reconfig- productivity. The CUDA system has provided a high productivity approach for programming GPUs. This paper evaluates the urable hardware. Our main contributions are: High-Productivity Reconfigurable Computer (HPRC) approach • Performance evaluation of benchmarks with different to FPGA programming, where a commodity CPU instruction memory characteristics on two high-productivity plat- set architecture is augmented with instructions which execute forms: an FPGA-based Hybrid-Core system, and a GPU- on a specialised FPGA co-processor, allowing the CPU and FPGA to co-operate closely while providing a programming based system.
    [Show full text]