Implementation and Performance Analysis of Many-Body Quantum Chemical Methods on the Intel R Xeon Phitm Coprocessor and NVIDIA GPU Accelerator

Total Page:16

File Type:pdf, Size:1020Kb

Implementation and Performance Analysis of Many-Body Quantum Chemical Methods on the Intel R Xeon Phitm Coprocessor and NVIDIA GPU Accelerator Implementation and Performance Analysis of Many-body Quantum Chemical Methods on the Intel R Xeon PhiTM Coprocessor and NVIDIA GPU Accelerator A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Bobo Shi, M.S. Graduate Program in Computer Science and Engineering The Ohio State University 2016 Master’s Examination Committee: Dr. P. Sadayappan, Advisor Dr. Louis-Noel Pouchet c Copyright by Bobo Shi 2016 Abstract CCSD(T), part of coupled cluster (CC) method, is one of the most accurate methods applicable to reasonably large molecules in computational chemistry field. The ability of an efficient parallel CCSD(T) implementation will have a significant impact on applica- tion of the high-accuracy methods. Intel R Xeon PhiTM Coprocessor and NVIDIA GPU are the most important coprocessors/accelerators which has powerful parallel computing ability due to its massively parallel many-core architecture. In this work, CCSD(T) code is implemented on Intel R Xeon PhiTM Coprocessor and NVIDIA GPU. CCSD(T) method performs tensor contractions. In order to have an efficient implemen- tation, we allocate the result tensor only on Intel Xeon Phi Coprocessor or GPU, and keep result tensor on the coprocessor/accelerator to receive a sequence of results from tensor contraction performed on the Intel Xeon Phi Coprocessor or GPU. The input tensors are offloaded from the host to the coprocessor/accelerator for each tensor contraction. After all the tensor contractions are finished, the final result is accumulated on the coprocessor/ac- celerator to avoid huge data transfer from coprocessor/accelerator to host. The tensor contraction are performed using BLAS dgemm on coprocessor/accelerator. Then the result is post-processed using a 6 dimensional loop. For Intel Xeon Phi im- plementation, OpenMP is used to bind threads to physical processing units on Xeon Phi coprocessors. The OpenMP threads affinity are tuned for Intel Xeon Phi Coprocessor to obtain best performance. For GPU, a algorithm is designed to map the 6 dimensional ii loop (post-processing) to CUDA threads. gridDim and blockDim are tuned to reach best performance. 4x and 9x ∼ 13x overall speedup is obtained for Intel Xeon Phi and GPU implementation, respectively. iii This is dedicated to my parents, my sister and my wife iv Acknowledgments Firstly, I would like to take this opportunity to express my sincere appreciation to my master thesis adviser, Dr. P. (Saday) Sadayappan for his continuous support, patience and immense knowledge throughout my master’s studies and thesis work. I met Saday when I was in his class, Introduction to Parallel Computing. His teaching skills, patience to students and broad knowledge in the fields impressed me. My major was Biophysics at that time. And the class motivated my interest in computer science, especially in high performance computing field. The most important, I started to learn more knowledge in computer science field and finally decided to get a master’s degree in computer science. Saday is the one who gives me another gate through which I can start a new adventure. I would like to give my sincere gratitude to Dr. Sriram Krishnamoorthy. He provided me direct advice about my thesis work. Sriram had lots of video chat with Saday and me. I appreciate the insightful discussion with him. He always gave good ideas, which helped my thesis work. I also want to thank the member of my thesis committee, Prof. Louis-Noel Pouchet for participating in my master exam and taking time to read my thesis. I would like to thank the lab members, Venmugil Elango, Weilei Bao, Changwan Hong, Prashant Singh Rawat and John Eisenlohr for the help during my thesis work. v Vita 2010 . B.S. Physics, Fudan University 2015 . M.S. Biophysics, Ohio State University 2010-present . .PhD student in Biophyscis Program Graduate Teaching Associate, Graduate Research Associate, NSEC fellow, Ohio State University 2015-present . .Master student in Computer Science, Ohio State University Fields of Study Major Field: Computer Science and Engineering vi Table of Contents Page Abstract . ii Dedication . iv Acknowledgments . .v Vita ........................................... vi List of Tables . ix List of Figures . .x 1. Introduction . .1 1.1 CCSD(T) method . .2 1.1.1 Algebraic structure of the CCSD(T) approach . .2 1.1.2 Non-iterative CCSD(T) correction . .4 1.2 General structure of the algorithm . .5 1.3 Introduction to Intel R Xeon PhiTM ....................9 1.3.1 Intel R Xeon PhiTM coprocessor architecture . .9 1.3.2 Offload mode . 12 1.4 GPU and CUDA . 13 1.4.1 CUDA C . 13 2. Implementation . 16 2.1 Implementation on Intel R Xeon PhiTM Coprocesser . 16 2.1.1 Dgemm and offload on Intel R Xeon PhiTM Coprocesser . 16 2.1.2 OpenMP thread optimization . 17 2.2 Implementation on CUDA . 18 2.2.1 Degmm on CUDA . 18 vii 2.2.2 Explicit implementation of LOOP1 and LOOP2 on CUDA . 19 3. Performance Analysis . 23 3.1 Performance for Intel R Xeon PhiTM Coprocessor . 23 3.1.1 Tilesize . 24 3.1.2 OMP threads affinity . 27 3.1.3 Comparison between MIC implementation and original CPU ver- sion . 29 3.2 Performance for CUDA implementation . 29 3.2.1 Tuning for gridDim and blockDim ................ 31 3.2.2 Comparison between CUDA implementation and the original CPU version . 32 3.3 Discussion . 34 4. Conclusion and Future work . 38 viii List of Tables Table Page 1.1 The CUDA keywords for functions declaration. 13 1.2 The CUDA keywords for variable declaration. 14 3.1 Coprocessor Architectural comparison of Intel R Xeon PhiTM Coprocessors 3120A/P and 5110p . 23 3.2 Doubles part dgemm calculation time and data transfer time for tilesize of 40 and 50 for MIC USE 2MB BUFFERS = 16K, 32K and 64K. 26 3.3 Wall time (second) for 4 main parts of CCSD(T) for CPU and GPU for different tilesize. 33 3.4 Wall time (second) for singles dgemm and doubles dgemm of CCSD(T) for multicores for different tilesize. 35 3.5 Wall time (second) for LOOP1 and LOOP2 of CCSD(T) for multicores for different tilesize. 35 3.6 GFlops for 4 parts for CPU, MIC and GPU. 37 3.7 Cache miss of LOOP1 and LOOP2 for CPU and GPU . 37 ix List of Figures Figure Page 1.1 Diagram to calculate C[i,j]. Matrix A and B are symmetric. Blue block shows the value we actually use in the calculation. Red shows what the regular matrix multiplication algorithm should use. Because of the sym- metry of A and B, we are able to store only half of elements of A and B to save memory and, as a result, blue blocks are used instead. .6 1.2 Diagram to calculate C[j,i]. Matrix A and B are symmetric. Blue blocks show the value we actually use in the calculation. Red blocks show what the regular matrix multiplication algorithm should use. Because of the symmetry of A and B, we can store only half of elements of A and B to save memory. As a result, blue blocks are used instead. After we obtain C[i,j] and C[j,i], the symmetrized result, C*[i,j] can be obtained by 1 C*[i,j] = 2 (C[i,j] + C[j,i])....................7 1.3 High-level architecture of the Intel R Xeon PhiTM coprocessor with cores, and ring interconnect [1] . 10 1.4 Architecture overview of an Intel MIC architecture core [2] . 11 1.5 Grid of Thread Blocks. 15 2.1 Scheme of OpenMP thread affinity control for compact, scatter and balanced. compact: pack threads close to each other; scatter: Round- Robin threads to cores; balanced: keep OMP thread ids consecutive (MIC only). 18 3.1 Total time for dgemm calculation and data transfer for singles and doubles part of a small system (2 ozone) at different tilesize. 24 x 3.2 MIC offload dgemm data transfer speed vs. data size for an example dgemm code. Red dot indicates the data transfer speed for nwchem mic offload doubles dgemm tilesize = 40 case. 26 3.3 OpenMP threads affinity experiment shows the performance depend on number of OpenMP threads and threads affinity. This experiment is done on Intel R Xeon PhiTM Coprocessor 5110P. System with 3 ozone is calculated. 27 3.4 The performance between CPU and MIC with several threads affinity for four main parts of the CCSD(T): singles part dgemm, doubles part dgemm, LOOP1 and LOOP2................................ 30 3.5 Overall performance comparison between CPU and MIC for different tile- sizes. 31 3.6 Tuning of NB, NT for LOOP1 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP1................................... 32 3.7 Tuning of NB, NT for LOOP2 for tilesize = 30. NB is number of blocks. NT is number of threads per block. x axis is NB, y axis is wall time (second) for LOOP2................................... 33 3.8 Overall CCSD(T) wall time in second for the original CPU version and implementation of CPU version for different tilesize. 34 3.9 Overall CCSD(T) wall time in second for multi-CPU-cores for different tilesize. 36 xi Chapter 1: Introduction Computational chemistry always require high accurate methods to describe the instanta- neous interactions between electrons or the correlation effects in molecules. From Hartre- Fock [3], Density functional theory [4] to Coupled Cluster [5] and Configuration inter- action [6] method, each new method improves the accuracy of the computational chem- istry and bridge the gap between theory and experiments.
Recommended publications
  • Intel® Architecture Instruction Set Extensions and Future Features Programming Reference
    Intel® Architecture Instruction Set Extensions and Future Features Programming Reference 319433-037 MAY 2019 Intel technologies features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifica- tions. Current characterized errata are available on request. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Intel does not guarantee the availability of these interfaces in any future product. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1- 800-548-4725, or by visiting http://www.intel.com/design/literature.htm. Intel, the Intel logo, Intel Deep Learning Boost, Intel DL Boost, Intel Atom, Intel Core, Intel SpeedStep, MMX, Pentium, VTune, and Xeon are trademarks of Intel Corporation in the U.S.
    [Show full text]
  • X86 Platform Coprocessor/Prpmc (PC on a PMC)
    x86 Platform Coprocessor/PrPMC (PC on a PMC) 32b/33MHz PCI bus PN1/PN2 +5V to Kbd/Mouse/USB power Vcore Power +3.3V +2.5v Conditioning +3.3VIO 1K100 FPGA & 512KB 10/100TX TMS2250 PCI BIOS/PLA Ethernet RJ45 Compact to PCI bridge Flash Memory PLA I/O Flash site 8 32b/33MHz Internal PCI bus Analog SVGA Video Pwr Seq AMD SC2200 Signal COM 1 (RXD/TXD only) IDE "GEODE (tm)" Conditioning COM 2 (RXD/TXD only) Processor USB Port 1 Rear I/O PN4 I/O Rear 64 USB Port 2 LPC Keyboard/Mouse Floppy 36 Pin 36 Pin Connector PC87364 128MB SDRAM Status LEDs Super I/O (16MWx64b) PC Spkr This board is a Processor PMC (PrPMC) implementing A PrPMC implements a TMS2250 PCI-to-PCI bridge to an x86-based PC architecture on a single-wide, stan- the host, and a “Coprocessor” cofniguration implements dard height PMC form factor. This delivers a complete back-to-back 16550 compatible COM ports. The “Mon- x86 platform with comprehensive I/O support, a 10/100- arch” signal selects either PrPMC or Co-processor TX Ethernet interface, and disk access (both floppy and mode. IDE drives). The board features an on-board hard drive using Compact Flash. A 512Kbyte FLASH memory holds the 1K100 PLA im- age that is automatically loaded on power up. It also At the heart of the design is an Advanced Micro De- contains the General Software Embedded BIOS 2000™ vices SC2200 GEODE™ processor that integrates video, BIOS code that is included with each board.
    [Show full text]
  • Convey Overview
    THE WORLD’S FIRST HYBRID-CORE COMPUTER. CONVEY HYBRID-CORE COMPUTING Hybrid-core Computing Convey HC-1 High Performance of application- specific hardware Heterogenous solutions • can be much more efficient Performance/ • still hard to program Programmability and Power efficiency deployment ease of an x86 server Application Multicore solutions • don’t always scale well • parallel programming is hard Low Difficult Ease of Deployment Easy 1/22/2010 3 Hybrid-Core Computing Application-Specific Personalities Applications • Extend the x86 instruction set • Implement key operations in Convey Compilers hardware Life Sciences x86-64 ISA Custom ISA CAE Custom Financial Oil & Gas Shared Virtual Memory Cache-coherent, shared memory • Both ISAs address common memory *ISA: Instruction Set Architecture 7/12/2010 4 HC-1 Hardware PCI I/O FPGA FPGA Intel Personalities Chipset FPGA FPGA 8 GB/s 80 GB/s Memory Memory Cache Coherent, Shared Virtual Memory 1/22/2010 5 Using Personalities C/C++ Fortran • Personalities are user specifies reloadable personality at instruction sets Convey Software compile time Development Suite • Compiler instruction generates x86 descriptions and coprocessor instructions from Hybrid-Core Executable P ANSI standard x86-64 and Coprocessor Personalities C/C++ & Fortran Instructions • Executable can run on x86 nodes FPGA Convey HC-1 or Convey Hybrid- bitfiles Core nodes Intel x86 Coprocessor personality loaded at runtime by OS 1/22/2010 6 SYSTEM ARCHITECTURE HC-1 Architecture “Commodity” Intel Server Convey FPGA-based coprocessor Direct
    [Show full text]
  • A Superscalar Out-Of-Order X86 Soft Processor for FPGA
    A Superscalar Out-of-Order x86 Soft Processor for FPGA Henry Wong University of Toronto, Intel [email protected] June 5, 2019 Stanford University EE380 1 Hi! ● CPU architect, Intel Hillsboro ● Ph.D., University of Toronto ● Today: x86 OoO processor for FPGA (Ph.D. work) – Motivation – High-level design and results – Microarchitecture details and some circuits 2 FPGA: Field-Programmable Gate Array ● Is a digital circuit (logic gates and wires) ● Is field-programmable (at power-on, not in the fab) ● Pre-fab everything you’ll ever need – 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates 6-LUT6-LUT 6-LUT6-LUT 3 6-LUT 6-LUT FPGA: Field-Programmable Gate Array ● Is a digital circuit (logic gates and wires) ● Is field-programmable (at power-on, not in the fab) ● Pre-fab everything you’ll ever need – 20x area, 20x delay cost – Circuit building blocks are somewhat bigger than logic gates 6-LUT 6-LUT 6-LUT 6-LUT 4 6-LUT 6-LUT FPGA Soft Processors ● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel code and hardware accelerators need effort – Less effort if soft processors got faster 5 FPGA Soft Processors ● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel code and hardware accelerators need effort – Less effort if soft processors got faster 6 FPGA Soft Processors ● FPGA systems often have software components – Often running on a soft processor ● Need more performance? – Parallel
    [Show full text]
  • Exploiting Free Silicon for Energy-Efficient Computing Directly
    Exploiting Free Silicon for Energy-Efficient Computing Directly in NAND Flash-based Solid-State Storage Systems Peng Li Kevin Gomez David J. Lilja Seagate Technology Seagate Technology University of Minnesota, Twin Cities Shakopee, MN, 55379 Shakopee, MN, 55379 Minneapolis, MN, 55455 [email protected] [email protected] [email protected] Abstract—Energy consumption is a fundamental issue in today’s data A well-known solution to the memory wall issue is moving centers as data continue growing dramatically. How to process these computing closer to the data. For example, Gokhale et al [2] data in an energy-efficient way becomes more and more important. proposed a processor-in-memory (PIM) chip by adding a processor Prior work had proposed several methods to build an energy-efficient system. The basic idea is to attack the memory wall issue (i.e., the into the main memory for computing. Riedel et al [9] proposed performance gap between CPUs and main memory) by moving com- an active disk by using the processor inside the hard disk drive puting closer to the data. However, these methods have not been widely (HDD) for computing. With the evolution of other non-volatile adopted due to high cost and limited performance improvements. In memories (NVMs), such as phase-change memory (PCM) and spin- this paper, we propose the storage processing unit (SPU) which adds computing power into NAND flash memories at standard solid-state transfer torque (STT)-RAM, researchers also proposed to use these drive (SSD) cost. By pre-processing the data using the SPU, the data NVMs as the main memory for data-intensive applications [10] to that needs to be transferred to host CPUs for further processing improve the system energy-efficiency.
    [Show full text]
  • CUDA What Is GPGPU
    What is GPGPU ? • General Purpose computation using GPU in applications other than 3D graphics CUDA – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput Slides by David Kirk – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation • Applications – see //GPGPU.org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting Previous GPGPU Constraints CUDA • Dealing with graphics API per thread per Shader Input Registers • “Compute Unified Device Architecture” – Working with the corner cases per Context • General purpose programming model of the graphics API Fragment Program Texture – User kicks off batches of threads on the GPU • Addressing modes Constants – GPU = dedicated super-threaded, massively data parallel co-processor – Limited texture size/dimension Temp Registers • Targeted software stack – Compute oriented drivers, language, and tools Output Registers • Shader capabilities • Driver for loading computation programs into GPU FB Memory – Limited outputs – Standalone Driver - Optimized for computation • Instruction sets – Interface designed for compute - graphics free API – Data sharing with OpenGL buffer objects – Lack of Integer & bit ops – Guaranteed maximum download & readback speeds • Communication limited – Explicit GPU memory management – Between pixels – Scatter a[i] = p 1 Parallel Computing on a GPU Extended C • Declspecs • NVIDIA GPU Computing Architecture – global, device, shared, __device__ float filter[N]; – Via a separate HW interface local, constant __global__ void convolve (float *image) { – In laptops, desktops, workstations, servers GeForce 8800 __shared__ float region[M]; ... • Keywords • 8-series GPUs deliver 50 to 200 GFLOPS region[threadIdx] = image[i]; on compiled parallel C applications – threadIdx, blockIdx • Intrinsics __syncthreads() – __syncthreads ..
    [Show full text]
  • Multiprocessing Contents
    Multiprocessing Contents 1 Multiprocessing 1 1.1 Pre-history .............................................. 1 1.2 Key topics ............................................... 1 1.2.1 Processor symmetry ...................................... 1 1.2.2 Instruction and data streams ................................. 1 1.2.3 Processor coupling ...................................... 2 1.2.4 Multiprocessor Communication Architecture ......................... 2 1.3 Flynn’s taxonomy ........................................... 2 1.3.1 SISD multiprocessing ..................................... 2 1.3.2 SIMD multiprocessing .................................... 2 1.3.3 MISD multiprocessing .................................... 3 1.3.4 MIMD multiprocessing .................................... 3 1.4 See also ................................................ 3 1.5 References ............................................... 3 2 Computer multitasking 5 2.1 Multiprogramming .......................................... 5 2.2 Cooperative multitasking ....................................... 6 2.3 Preemptive multitasking ....................................... 6 2.4 Real time ............................................... 7 2.5 Multithreading ............................................ 7 2.6 Memory protection .......................................... 7 2.7 Memory swapping .......................................... 7 2.8 Programming ............................................. 7 2.9 See also ................................................ 8 2.10 References .............................................
    [Show full text]
  • 2Nd Generation Intel® Core™ Processor Family Mobile with ECC
    2nd Generation Intel® Core™ Processor Family Mobile with ECC Datasheet Addendum May 2012 Revision 002 Document Number: 324855-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
    [Show full text]
  • Comparing the Power and Performance of Intel's SCC to State
    Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs Ehsan Totoni, Babak Behzad, Swapnil Ghike, Josep Torrellas Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA E-mail: ftotoni2, bbehza2, ghike2, [email protected] Abstract—Power dissipation and energy consumption are be- A key architectural challenge now is how to support in- coming increasingly important architectural design constraints in creasing parallelism and scale performance, while being power different types of computers, from embedded systems to large- and energy efficient. There are multiple options on the table, scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the namely “heavy-weight” multi-cores (such as general purpose best use of exponentially increasing numbers of transistors within processors), “light-weight” many-cores (such as Intel’s Single- the power and energy budgets. Intel SCC is an appealing option Chip Cloud Computer (SCC) [1]), low-power processors (such for future many-core architectures. In this paper, we use various as embedded processors), and SIMD-like highly-parallel archi- scalable applications to quantitatively compare and analyze tectures (such as General-Purpose Graphics Processing Units the performance, power consumption and energy efficiency of different cutting-edge platforms that differ in architectural build. (GPGPUs)). These platforms include the Intel Single-Chip Cloud Computer The Intel SCC [1] is a research chip made by Intel Labs (SCC) many-core, the Intel Core i7 general-purpose multi-core, to explore future many-core architectures. It has 48 Pentium the Intel Atom low-power processor, and the Nvidia ION2 (P54C) cores in 24 tiles of two cores each.
    [Show full text]
  • World's First High- Performance X86 With
    World’s First High- Performance x86 with Integrated AI Coprocessor Linley Spring Processor Conference 2020 April 8, 2020 G Glenn Henry Dr. Parviz Palangpour Chief AI Architect AI Software Deep Dive into Centaur’s New x86 AI Coprocessor (Ncore) • Background • Motivations • Constraints • Architecture • Software • Benchmarks • Conclusion Demonstrated Working Silicon For Video-Analytics Edge Server Nov, 2019 ©2020 Centaur Technology. All Rights Reserved Centaur Technology Background • 25-year-old startup in Austin, owned by Via Technologies • We design, from scratch, low-cost x86 processors • Everything to produce a custom x86 SoC with ~100 people • Architecture, logic design and microcode • Design, verification, and layout • Physical build, fab interface, and tape-out • Shipped by IBM, HP, Dell, Samsung, Lenovo… ©2020 Centaur Technology. All Rights Reserved Genesis of the AI Coprocessor (Ncore) • Centaur was developing SoC (CHA) with new x86 cores • Targeted at edge/cloud server market (high-end x86 features) • Huge inference markets beyond hyperscale cloud, IOT and mobile • Video analytics, edge computing, on-premise servers • However, x86 isn’t efficient at inference • High-performance inference requires external accelerator • CHA has 44x PCIe to support GPUs, etc. • But adds cost, power, another point of failure, etc. ©2020 Centaur Technology. All Rights Reserved Why not integrate a coprocessor? • Very low cost • Many components already on SoC (“free” to Ncore) Caches, memory, clock, power, package, pins, busses, etc. • There often is “free” space on complex SOCs due to I/O & pins • Having many high-performance x86 cores allows flexibility • The x86 cores can do some of the work, in parallel • Didn’t have to implement all strange/new functions • Allows fast prototyping of new things • For customer: nothing extra to buy ©2020 Centaur Technology.
    [Show full text]
  • Travelmate P6 Series Product Sheet
    Acer recommends Windows 10 Pro. TravelMate P6 Wi-Fi 6 Microsoft Teams SPECIFICATIONS TravelMate P6 P614-51-G2_51G-G2 2 1 3 12 13 14 15 16 4 5 6 7 8 9 10 11 17 Product views 1. Infrared LED 5. HDMI® port 9. Smart Card reader slot (optional) 14. microCD card reader (optional) 6. USB port 10. Keyboard 15. USB port 2. Webcam 7. USB Type-C /Thunderbolt 11. Touchpad 16. Ethernet port 3. 14" display 3 port 12. Power button with fingerprint 17. Kensington lock slot 4. DC-in jack 8. Headset/speaker jack reader 13. SIM card slot (optional) Operating system1, Windows 10 Pro 64-bit (Acer recommends Windows 10 Pro for business.) Windows 10 Home 64-bit 2 CPU and chipset1 Intel® CoreTM i7-10510U processor Intel® CoreTM i5-10210U processor Memory1, 3, 4, 5 Dual-channel DDR4 SDRAM support up to 24 GB of DDR4 system memory, 8 GB / 4 GB of onboard DDR4 system memory Display6 14.0" display with IPS (In-Plane Switching) technology, Full HD 1920 x 1080, high-brightness (300nits) Acer ComfyViewTM LED-backlit TFT LCD 16:9 aspect ratio, 100% sRGB color gamut Wide viewing angle up to 170 degrees Stylish and Slim design Mercury free, environment friendly Graphics Intel® UHD Graphics 620 (P614-51-G2 only) NVIDIA® GeForce® MX250 (P614-51G-G2 only) Audio Four built-in microphones Built-in discrete smart amplifiers to deliver powerful sound Compatible with Cortana with Voice for up to 4 meters Acer TrueHarmony technology for lower distortion, wider frequency range, headphone-like audio and powerful sound Certified for Microsoft Teams Storage1, 7, Solid state
    [Show full text]
  • The Intel X86 Microarchitectures Map Version 2.0
    The Intel x86 Microarchitectures Map Version 2.0 P6 (1995, 0.50 to 0.35 μm) 8086 (1978, 3 µm) 80386 (1985, 1.5 to 1 µm) P5 (1993, 0.80 to 0.35 μm) NetBurst (2000 , 180 to 130 nm) Skylake (2015, 14 nm) Alternative Names: i686 Series: Alternative Names: iAPX 386, 386, i386 Alternative Names: Pentium, 80586, 586, i586 Alternative Names: Pentium 4, Pentium IV, P4 Alternative Names: SKL (Desktop and Mobile), SKX (Server) Series: Pentium Pro (used in desktops and servers) • 16-bit data bus: 8086 (iAPX Series: Series: Series: Series: • Variant: Klamath (1997, 0.35 μm) 86) • Desktop/Server: i386DX Desktop/Server: P5, P54C • Desktop: Willamette (180 nm) • Desktop: Desktop 6th Generation Core i5 (Skylake-S and Skylake-H) • Alternative Names: Pentium II, PII • 8-bit data bus: 8088 (iAPX • Desktop lower-performance: i386SX Desktop/Server higher-performance: P54CQS, P54CS • Desktop higher-performance: Northwood Pentium 4 (130 nm), Northwood B Pentium 4 HT (130 nm), • Desktop higher-performance: Desktop 6th Generation Core i7 (Skylake-S and Skylake-H), Desktop 7th Generation Core i7 X (Skylake-X), • Series: Klamath (used in desktops) 88) • Mobile: i386SL, 80376, i386EX, Mobile: P54C, P54LM Northwood C Pentium 4 HT (130 nm), Gallatin (Pentium 4 Extreme Edition 130 nm) Desktop 7th Generation Core i9 X (Skylake-X), Desktop 9th Generation Core i7 X (Skylake-X), Desktop 9th Generation Core i9 X (Skylake-X) • Variant: Deschutes (1998, 0.25 to 0.18 μm) i386CXSA, i386SXSA, i386CXSB Compatibility: Pentium OverDrive • Desktop lower-performance: Willamette-128
    [Show full text]