Instruction Set Innovations for Convey's HC-1 Computer

Instruction Set Innovations for Convey's HC-1 Computer THE WORLD’S FIRST HYBRID-CORE COMPUTER. Hot Chips Conference 2009 [email protected] Introduction to Convey Computer • Company Status – Second round venture based startup company – Product beta systems are at customer sites – Currently staffing at 36 people – Located in Richardson, Texas • Investors – Four Venture Capital Investors • Interwest Partners (Menlo Park) • CenterPoint Ventures (Dallas) • Rho Ventures (New York) • Braemar Energy Ventures (Boston) – Two Industry Investors • Intel Capital • Xilinx Presentation Outline • Overview of HC-1 Computer • Instruction Set Innovations • Application Examples Page 3 Hot Chips Conference 2009 What is a Hybrid-Core Computer ? A hybrid-core computer improves application performance by combining an x86 processor with hardware that implements application-specific instructions. ANSI Standard Applications C/C++/Fortran Convey Compilers x86 Coprocessor Instructions Instructions Intel® Processor Hybrid-Core Coprocessor Oil & Gas& Oil Financial Sciences Custom CAE Application-Specific Personalities Cache-coherent shared virtual memory Page 4 Hot Chips Conference 2009 What Is a Personality? • A personality is a reloadable set of instructions that augment x86 application the x86 instruction set Processor specific – Applicable to a class of applications instructions or specific to a particular code • Each personality is a set of files that includes: – The bits loaded into the Coprocessor – Information used by the Convey compiler • List of available instructions • Performance characteristics for compiler optimization – Information used by the GNU Debugger (GDB) • Instruction disassembly information • Information allowing machine state formatting and modification Page 5 Hot Chips Conference 2009 How do you Program using Personalities? C/C++ Fortran95 • Program in ANSI standard C/C++ or Fortran Common Optimizer • Unified compiler generates x86 and coprocessor instructions Personality Convey Intel® 64 • Compiler generates x86 code for Information Vectorizer& Optimizer full application and coprocessor Code & Code Generator Generator code where data parallelism exists • Run time checks determine if Linker sufficient parallelism exists to use coprocessor executable Intel® 64 code • Executable can run on x86 nodes or on Convey Hybrid-Core nodes Coprocessor code Page 6 Hot Chips Conference 2009 Convey System Architecture Intel x86 Linux ecosystem Commodity Motherboard Coprocessor PCIe Host Application Inter Engines Intel Intel FSB face Processor Chipset Coherent Memory Controller + TLB 4 DIMM 16 DIMM channels (80GB/sec) channels ECC memory Shared physical and virtual memory provides coherent programming model Page 7 Hot Chips Conference 2009 Inside the Coprocessor Host interface and memory Application Engines controllers implemented Scalar Instruction by coprocessor Processing Host Interface infrastructure Fetch/Decode Crossbar Four Xilinx V5LX330 FPGAs Implemented with Xilinx V5LX110 FPGAs Memory Memory Memory Memory Memory Memory Memory Memory Controller Implemented Controller Controller Controller Controller Controller Controller Controller with Xilinx V5LX155 FPGAs 16 DDR2 memory channels Standard or Scatter-Gather DIMMs 80GB/sec throughput Page 8 Hot Chips Conference 2009 High Performance Memory Subsystem Host Coprocessor • Mixed access mode PCIe Host Application North – cache line or quad word x86 Intf Engines Bridge • x86 cache coherency support Memory Controllers – without impacting 4 DIMMs coprocessor bandwidth 16 DIMMs, 80 GB/s • Prime number interleave (31) – full memory bandwidth at all strides (except a multiple of prime number) • Convey designed Scatter / Gather DIMMs or Standard DIMMs – Scatter / Gather DIMMs provide 8-byte accesses at full bandwidth • Large virtual page support (4MB) – entire virtual address of process is resident within the coprocessor TLBs • Page Coloring within OS – interleave pattern to span entire virtual address space of application Page 9 Hot Chips Conference 2009 How is the Coprocessor Started? Start Sequence: 1. x86 writes “Start Message” to a x86 predefined 64-byte memory line. Host Intf Cache Ownership for line is moved from Cache Application coprocessor cache to x86 cache. Engines 2. x86 writes start bit in a second Memory Controller memory line. Start Message Personality Mechanism works with: 3. Coprocessor detects “Start Bit” - Host thread time sharing Start Addr. memory line has been invalidated - Application paging Param 1 from its cache and immediately - Front Side Bus or QPI Param 2 retrieves the “Start Message” and Param 3 “Start Bit” memory lines. Application can start Param 4 coprocessor Param 5 4. Coprocessor checks that the synchronously, or Param 6 “Start Bit” is set and starts asynchronously coprocessor. Page 10 Hot Chips Conference 2009 A view from the Inside • Top half of platform is the Coprocessor • Bottom half is the Intel Motherboard 2U Platform Page 11 Hot Chips Conference 2009 Architecture for a Dynamically Coprocessor Host Application Loadable Instruction Set Intf Engines Memory Controllers • Three classes of instructions are defined with separate machine state for each 16 DIMMs Instruction Class / Examples Machine Pre- Location State defined? Executed Address A Registers Yes Scalar CALL Routine Processor Scalar S Registers Yes Scalar S5 2.0 * S6 Processor Application Engine (SIMD) Defined by No Application V2 S6 + V4 Personality Engines • Address and Scalar instructions are available to all personalities Note: The Scalar Processor is implemented within the Host Interface Page 12 Hot Chips Conference 2009 How does the Scalar Processor Coprocessor Interact with AE Instructions? Host Application Problem: The Scalar Processor functionality is fixed Intf Engines (I.e. does not depend on the personality loaded Memory Controllers on the AEs). How does the Scalar Processor know how to interact with user defined AE instructions? 16 DIMMs Solution: The Application Engine instruction encoding space is partitioned into regions. Each region has predefined interactions with the Scalar Processor machine state (A and/or S Registers). Example Application Instruction Operation A/S Register Interaction Engine Instructions V4 S2 + V3 Add a scalar (S2) to each Send S2 to AE element of a vector (V3) and write the result to a vector (V4). A5 VL Move vector length (VL) to an Receive VL from AE address register (A5). • A personality must select encodings for AE instructions based on the interactions between A/S registers and the AE instructions. Page 13 Hot Chips Conference 2009 A Personality’s Machine State can be Saved/Restored to Memory • Each personality defines its own machine state – Machine state is simply the instruction visible register state. – The coprocessor’s machine state is only saved to memory at instruction boundaries (I.e. once all executing instructions have completed). • A common mechanism is used to save / restore machine state for all personalities – All personality machine state is mapped to AEG registers (Application Engine Generic registers). – A few pre-defined AE instructions allow access to AEG registers for the purpose of machine state save and restore to/from memory. – The following predefined AE instructions allow user defined machine state to be saved/restored: MOV AEGcnt, At ; Obtain the number of defined AEG registers MOV Sa,Ab,AEG ; Restore an AEG[Ab] register from an S register MOV AEG,Ab,St ; Save an AEG[Ab] register to an S register – The operating system has one machine state save / restore routine used by all personalities. Page 14 Hot Chips Conference 2009 Machine State in Memory enables Application Debugging • Coprocessor machine state can be accessed (read and written) by the GNU Debugger (GDB). • GDB can perform single step, run and stop at breakpoint instruction. Page 15 Hot Chips Conference 2009 Single Precision Vector Personality 4 Application Engine FPGAs / 32 Function Pipes vector elements distributed across function pipes Load-store vector architecture with modern Dispatch Dispatch Dispatch Dispatch latency-hiding features Optimized for Signal Processing applications Crossbar Crossbar Crossbar Crossbar 1-of-32 Function Pipes A function pipe is the unit Vector Register File of data path replication Same instructions sent to all function pipes (SIMD) Add Each function pipe supports: FMA FMA FMA FMA Misc Load Store − Multiple functional units Integer Logical Convert Add − Out-of-order execution − Register renaming To Crossbar FMA - Fused Multiply Add Page 16 Hot Chips Conference 2009 Single Precision Vector Personality FFT Performance 1-d FFTs 2-d FFTs 70 70 60 60 50 50 40 40 p/s p/s o o l l f 30 f 30 G G Convey HC-1 20 20 Nehalem + MKL 10 10 Convey HC-1 Nehalem + MKL, 8 cores 0 0 1K 4K 1M 4M 1K 2K 4K 16K 64K 16M 512 64M 128 256 256K Number of Points Number of Points Page 17 Hot Chips Conference 2009 Financial Vector Personality 4 Application Engines / 32 Function Pipes vector elements distributed across function pipes Pairs of single precision functional units replaced Dispatch Dispatch Dispatch Dispatch by double precision units Crossbar Crossbar Crossbar Crossbar 1-of-32 Function Pipes Vector Register File Functional units for log, exp, random number generation Supported by the compiler as Misc Load Store vector intrinsics Integer Logical DP Add DP Convert DP FMA DP FMA Exp, Log Random Allows compile and run model for financial analytic To Crossbar applications Page 18 Hot Chips Conference 2009 Protein Sequencing Personality 4 Application

Instruction Set Innovations for Convey's HC-1 Computer

Data-Level Parallelism

Data Storage the CPU-Memory

X86 Platform Coprocessor/Prpmc (PC on a PMC)

Convey Overview

Lecture 14: Gpus

Exploiting Free Silicon for Energy-Efficient Computing Directly

CUDA What Is GPGPU

Cache-Aware Roofline Model: Upgrading the Loft

ARM-Architecture Simulator Archulator

Comparing the Power and Performance of Intel's SCC to State

Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks

Multi-Tier Caching Technology™