<<

Instruction Set Innovations for Convey's HC-1

THE WORLD’S FIRST HYBRID-CORE COMPUTER.

Hot Chips Conference 2009

[email protected] Introduction to Convey Computer

• Company Status – Second round venture based startup company – Product beta systems are at customer sites – Currently staffing at 36 people – Located in Richardson, Texas

• Investors – Four Venture Capital Investors • Interwest Partners (Menlo Park) • CenterPoint Ventures (Dallas) • Rho Ventures (New York) • Braemar Energy Ventures (Boston) – Two Industry Investors • Capital • Presentation Outline

• Overview of HC-1 Computer • Instruction Set Innovations • Application Examples

Page 3 Hot Chips Conference 2009 What is a Hybrid-Core Computer ? A hybrid-core computer improves application performance by combining an with hardware that implements application-specific instructions.

ANSI Standard Applications C/C++/Fortran Convey Compilers x86 Instructions Instructions Intel® Processor Hybrid-Core Coprocessor Oil & Gas Financial Sciences Custom CAE

Application-Specific Personalities

Cache-coherent shared

Page 4 Hot Chips Conference 2009 What Is a Personality?

• A personality is a reloadable set of instructions that augment x86 application the x86 instruction set Processor specific – Applicable to a class of applications instructions or specific to a particular code

• Each personality is a set of files that includes: – The bits loaded into the Coprocessor – Information used by the Convey compiler • List of available instructions • Performance characteristics for compiler optimization – Information used by the GNU Debugger (GDB) • Instruction disassembly information • Information allowing machine state formatting and modification

Page 5 Hot Chips Conference 2009 How do you Program using Personalities?

C/C++ Fortran95 • Program in ANSI standard C/C++ or Fortran Common Optimizer • Unified compiler generates x86 and coprocessor instructions Personality Convey Intel® 64 • Compiler generates x86 code for Information Vectorizer& Optimizer full application and coprocessor Code & Code Generator Generator code where exists • Run time checks determine if Linker sufficient parallelism exists to use coprocessor executable Intel® 64 code • Executable can run on x86 nodes or on Convey Hybrid-Core nodes Coprocessor code

Page 6 Hot Chips Conference 2009 Convey System Architecture

Intel x86 Linux ecosystem

Commodity Motherboard Coprocessor

PCIe Host Application Inter Engines Intel Intel FSB face Processor Chipset Coherent + TLB

4 DIMM 16 DIMM channels (80GB/sec) channels ECC memory

Shared physical and virtual memory provides coherent programming model

Page 7 Hot Chips Conference 2009 Inside the Coprocessor

Host interface and memory Application Engines controllers implemented Scalar Instruction by coprocessor Processing Host Interface infrastructure Fetch/Decode Crossbar Four Xilinx V5LX330 FPGAs

Implemented with Xilinx V5LX110 FPGAs Memory Memory Memory Memory Memory Memory Memory Memory Controller Implemented Controller Controller Controller Controller Controller Controller Controller with Xilinx V5LX155 FPGAs 16 DDR2 memory channels Standard or Scatter-Gather DIMMs 80GB/sec throughput Page 8 Hot Chips Conference 2009 High Performance Memory Subsystem Host Coprocessor • Mixed access mode PCIe Host Application North – line or quad word x86 Intf Engines Bridge • x86 cache coherency support Memory Controllers – without impacting 4 DIMMs coprocessor bandwidth 16 DIMMs, 80 GB/s • Prime number interleave (31) – full memory bandwidth at all strides (except a multiple of prime number) • Convey designed Scatter / Gather DIMMs or Standard DIMMs – Scatter / Gather DIMMs provide 8-byte accesses at full bandwidth • Large virtual page support (4MB) – entire virtual address of is resident within the coprocessor TLBs • Page Coloring within OS – interleave pattern to span entire virtual of application

Page 9 Hot Chips Conference 2009 How is the Coprocessor Started?

Start Sequence: 1. x86 writes “Start Message” to a x86 predefined 64-byte memory line. Host Intf Cache Ownership for line is moved from Cache Application coprocessor cache to x86 cache. Engines

2. x86 writes start bit in a second Memory Controller memory line. Start Message Personality Mechanism works with: 3. Coprocessor detects “Start Bit” - Host time sharing Start Addr. memory line has been invalidated - Application paging Param 1 from its cache and immediately - Front Side or QPI Param 2 retrieves the “Start Message” and Param 3 “Start Bit” memory lines. Application can start Param 4 coprocessor Param 5 4. Coprocessor checks that the synchronously, or Param 6 “Start Bit” is set and starts asynchronously coprocessor. Page 10 Hot Chips Conference 2009 A view from the Inside • Top half of platform is the Coprocessor • Bottom half is the Intel Motherboard

2U Platform

Page 11 Hot Chips Conference 2009 Architecture for a Dynamically Coprocessor Host Application Loadable Instruction Set Intf Engines Memory Controllers • Three classes of instructions are defined with separate machine state for each 16 DIMMs Instruction Class / Examples Machine Pre- Location State defined? Executed Address A Registers Yes Scalar CALL Routine Processor Scalar S Registers Yes Scalar S5  2.0 * S6 Processor Application Engine (SIMD) Defined by No Application V2  S6 + V4 Personality Engines • Address and Scalar instructions are available to all personalities Note: The is implemented within the Host Interface

Page 12 Hot Chips Conference 2009 How does the Scalar Processor Coprocessor Interact with AE Instructions? Host Application Problem: The Scalar Processor functionality is fixed Intf Engines (I.e. does not depend on the personality loaded Memory Controllers on the AEs). How does the Scalar Processor know how to interact with user defined AE instructions? 16 DIMMs Solution: The Application Engine instruction encoding space is partitioned into regions. Each region has predefined interactions with the Scalar Processor machine state (A and/or S Registers).

Example Application Instruction Operation A/S Register Interaction Engine Instructions V4  S2 + V3 Add a scalar (S2) to each Send S2 to AE element of a vector (V3) and write the result to a vector (V4). A5  VL Move vector length (VL) to an Receive VL from AE address register (A5).

• A personality must select encodings for AE instructions based on the interactions between A/S registers and the AE instructions.

Page 13 Hot Chips Conference 2009 A Personality’s Machine State can be Saved/Restored to Memory • Each personality defines its own machine state – Machine state is simply the instruction visible register state. – The coprocessor’s machine state is only saved to memory at instruction boundaries (I.e. once all executing instructions have completed).

• A common mechanism is used to save / restore machine state for all personalities – All personality machine state is mapped to AEG registers (Application Engine Generic registers). – A few pre-defined AE instructions allow access to AEG registers for the purpose of machine state save and restore to/from memory. – The following predefined AE instructions allow user defined machine state to be saved/restored:

MOV AEGcnt, At ; Obtain the number of defined AEG registers MOV Sa,Ab,AEG ; Restore an AEG[Ab] register from an S register MOV AEG,Ab,St ; Save an AEG[Ab] register to an S register

– The has one machine state save / restore routine used by all personalities.

Page 14 Hot Chips Conference 2009 Machine State in Memory enables Application Debugging

• Coprocessor machine state can be accessed (read and written) by the GNU Debugger (GDB).

• GDB can perform single step, run and stop at breakpoint instruction.

Page 15 Hot Chips Conference 2009 Single Precision Vector Personality 4 Application Engine FPGAs / 32 Function Pipes vector elements distributed across function pipes Load-store vector architecture with modern Dispatch Dispatch Dispatch Dispatch -hiding features

Optimized for applications Crossbar Crossbar Crossbar Crossbar 1-of-32 Function Pipes A function pipe is the unit Vector of data path replication Same instructions sent to all function pipes (SIMD)

Add Each function pipe supports: FMA FMA FMA FMA Misc Load Store − Multiple functional units Integer Logical Convert Add − Out-of-order execution − To Crossbar FMA - Fused Multiply Add

Page 16 Hot Chips Conference 2009 Single Precision Vector Personality FFT Performance 1-d FFTs 2-d FFTs 70 70 60 60 50 50 40 40 p/s p/s o o l l f 30 f 30 G G Convey HC-1 20 20 Nehalem + MKL 10 10 Convey HC-1 Nehalem + MKL, 8 cores 0 0 1K 4K 1M 4M 1K 2K 4K 16K 64K 16M 512 64M 128 256 256K Number of Points Number of Points

Page 17 Hot Chips Conference 2009 Financial Vector Personality 4 Application Engines / 32 Function Pipes vector elements distributed across function pipes Pairs of single precision functional units replaced Dispatch Dispatch Dispatch Dispatch by double precision units

Crossbar Crossbar Crossbar Crossbar 1-of-32 Function Pipes Vector Register File Functional units for log, exp, random number generation Supported by the compiler as Misc Load Store vector intrinsics Integer Logical DP Add DP Convert DP FMA DP FMA Exp, Log Random Allows compile and run model for financial analytic To Crossbar applications

Page 18 Hot Chips Conference 2009 Protein Sequencing Personality

4 Application Engines / 40 State Machines Inspect Application ported to Convey HC-1 as Dispatch Dispatch Dispatch Dispatch joint project with UCSD Application performs protein sequencing Crossbar Crossbar Crossbar Crossbar 1-of-40 State Machines

Peptide Substring Protein Mass Fetch Fetch Memory Entire “Best Match” routine implemented as state machine

Score Multiple state machines for data parallelism PRM Save Scores Match (MIMD) Memory

Update Temp Score Operates on main memory using virtual Match To Beat Memory addresses

Store Matches Executes greater than 100 times faster than a single Nehalem core

Page 19 Hot Chips Conference 2009 In Summary

• The Convey Approach Provides: – Virtual, cache coherent, global address space – Instruction based FPGA compute model – ANSI Standard C/C++/ Fortran high level language support – Integrated debugging environment using GDB (text or GUI based)

• The Convey system allows users to develop applications as one would for standard CPU based systems.

Page 20 Hot Chips Conference 2009