Convey Overview

THE WORLD’S FIRST HYBRID-CORE COMPUTER. CONVEY HYBRID-CORE COMPUTING Hybrid-core Computing Convey HC-1 High Performance of application- specific hardware Heterogenous solutions • can be much more efficient Performance/ • still hard to program Programmability and Power efficiency deployment ease of an x86 server Application Multicore solutions • don’t always scale well • parallel programming is hard Low Difficult Ease of Deployment Easy 1/22/2010 3 Hybrid-Core Computing Application-Specific Personalities Applications • Extend the x86 instruction set • Implement key operations in Convey Compilers hardware Life Sciences x86-64 ISA Custom ISA CAE Custom Financial Oil & Gas Shared Virtual Memory Cache-coherent, shared memory • Both ISAs address common memory *ISA: Instruction Set Architecture 7/12/2010 4 HC-1 Hardware PCI I/O FPGA FPGA Intel Personalities Chipset FPGA FPGA 8 GB/s 80 GB/s Memory Memory Cache Coherent, Shared Virtual Memory 1/22/2010 5 Using Personalities C/C++ Fortran • Personalities are user specifies reloadable personality at instruction sets Convey Software compile time Development Suite • Compiler instruction generates x86 descriptions and coprocessor instructions from Hybrid-Core Executable P ANSI standard x86-64 and Coprocessor Personalities C/C++ & Fortran Instructions • Executable can run on x86 nodes FPGA Convey HC-1 or Convey Hybrid- bitfiles Core nodes Intel x86 Coprocessor personality loaded at runtime by OS 1/22/2010 6 SYSTEM ARCHITECTURE HC-1 Architecture “Commodity” Intel Server Convey FPGA-based coprocessor Direct Application Application Engines Data Intel 5138 Engine Hub Port Dual Core Processor Intel 5400 MCH Intel IO Memory Subsyste Memory m Intel x86-64 Server FPGA based x86-64 Linux Shared cache-coherent memory 7/12/2010 8 HC-1 Hardware Coprocessor Assembly Host memory DIMMs FSB x16 PCI-E Mezzanine slot Card • 2U enclosure: – Top half of 2U platform contains the coprocessor Host x86 3 x 3½” Disk Drives – Bottom half contains Intel Server Assembly motherboard 1/22/2010 9 HC-1 Physical Layout 17.78" Direct Data Port Memory DIMMs AEs Memory DIMMs Memory Memory Controller Controller AEH 11.15" AEs Memory Memory Controller Controller Memory DIMMs Memory DIMMs Direct Data Port 1/22/2010 10 Inside the Coprocessor Host interface and memory Application Engines controllers implemented e Scalar by coprocessor Instruction Processing Fetch/Decod infrastructure Host Interface Link I/F Four Xilinx V5LX330 FPGAs Implemented with Xilinx V5LX110 FPGAs Memory Memory Memory Memory Memory Memory Memory Memory Memory Memory Controller Controller Controller Controller Controller Controller Controller Implemented Controller with Xilinx V5LX155 FPGAs 16 DDR2 memory channels Standard or Scatter-Gather DIMMs 80GB/sec throughput 7/12/2010 Page 11 Coprocessor Block Diagram scalar instructions executed by IAS, AE instructions broadcast to AEs datamover AE-MC for high links bandwidth support block 8192 TIDs, moves 76.8GB/s MCs translate virtual addresses, maintain coherence with host 7/12/2010 12 Strided Memory Access Performance • Convey SG-DIMMs optimized for Strided 64-bit Memory Accesses 64-bit memory access 60 • High bandwidth for non-unity strides, scatter/gather 50 • 3131 interleave for high 40 bandwidth for all strides except multiples of 31 30 GB/sec 20 • measured with strid3.c 10 written by Mark Seager: 0 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 for( i=0; i<n*incx; i+=incx ) { Stride (1-64 words) yy[i] += t*xx[i]; Nehalem (single core) } HC-1 SG-DIMM 3131 1/22/2010 13 Direct Data Port • Direct data paths to I/O gates on FPGAs • Permits custom H/W interface For more information see “Virtex-5 FPGA User Guide UG190 (v5.2)” November 5, 2009 1/22/2010 14 CONVEY SOFTWARE Convey Software Architecture GCC & Convey compatible Compilers compilers Objects can be linked with other gcc/glibc Applications compatible objects Key routines (BLAS, Convey Math Library (CML) FFTs) optimized for Convey personalities If coprocessor shared library not present, native Personality Support System x86 code is executed Kernel is Linux Convey Enhanced Standards-based (LSB) Linux Kernel compliant legacy apps run “as-is” Intel x86-64 Coprocessor personalities define ISA ISA instruction set 1/22/2010 16 Convey Compilers C/C++ Fortran95 • Program in ANSI standard C/C++ and Fortran Common Optimizer • Unified compiler generates x86 & Intel® 64 Convey coprocessor instructions Optimizer Vectorizer other & Code & Code compilers • Seamless debugging Generator Generator environment for Intel & coprocessor code Linker • Executable can run on executable x86_64 nodes or on Intel® 64 code Convey Hybrid-Core nodes Coprocessor code 1/22/2010 17 Coprocessor Regions • The compiler dispatches blocks of code called “coprocessor regions.” • Switches, directives and pragmas can be used to control what code is compiled for the coprocessor – loop or code segment within a routine or a whole routine – can also use underlying library interface (“copcall”) • Coprocessor regions cannot call x86 routines 7/12/2010 18 Memory Placement • Convey systems have a Non-Uniform Memory Access (NUMA) architecture – Host and coprocessor can access all of memory – but it’s much more efficient if they access local memory • OS manages host and coprocessor memory as separate memory pools • Compiler flags and directives can specify placement – #pragma cny coproc_mem (x,y) places static arrays or common blocks in coproc memory – cny_cp_malloc() alternative to malloc() • Memory can also be migrated or copied dynamically – “#pragma cny migrate_coproc (X,n)” specifies that array X should be moved to coprocessor memory before execution of the next line of code – cny_cp_memcopy() copies buffers using the datamover 7/12/2010 19 Convey Runtime Environment executable Intel® x86-64 code if dlopen of gdb shared library debugging on Coprocessor code fails, Intel x86- HW & simulator 64 code cny_runtime.o executed personalities are demand Convey Shared loaded by OS Libraries at runtime Convey x86-64 Simulator shared coprocessor Custom hardware library hardware FAP executes on x86 SP SPAT performance simulator 1/22/2010 20 PERSONALITIES A personality is... • A reloadable instruction set that augments the x86 ISA x86 application – Applicable to a class of applications specific or specific to a particular code Processor instructions • Each personality is a set of files consisiting of: – A unique ID $ ls /opt/convey/personalities/2.1.1.1.0 – The bits loaded into the AE FPGAs ae_fpga.tgz convey-aesp-2.0.0-10_03_17_62.tgz – Information used by the Convey lic_info compilers and tools PersDesc.dat • List of available instructions readme • Performance characteristics for rest.o compiler optimization save.o • Machine state formatting and zero.o modification 7/12/2010 22 How Are Personalities Created? • Personalities are a logic design – implemented using a hardware description language (high level tools may be used as well) – synthesized with the Xilinx tools – interfaced to the coprocessor dispatch, memory, and management infrastructure – packaged for demand loading by the Convey OS • Convey sells prebuilt personalities for key algorithms and applications • Convey licenses a Personality Development Kit that enables customers to construct their own 7/12/2010 23 Instruction Based and Algorithmic Personalities parameter (N=100000) • Instruction based real*4 a(N),b(N),c(N),s c$cny BEGIN_COPROC personalities implement do i=1,N instruction sets c(i) = a(i)+s*b(i) – Automatic vectorization enddo generates SIMD instructions c$cny END_COPROC from standard languages end $ cnyf95 -O3 -mcny_vector test.f – Different sequences of VECTOR LOOP in MAIN__0 at 4: L 2 instructions can implement S 1 B 2 U 0 I 0 M 0 different algorithms • Algorithmic personalities ret= l_copcall_fmt(sig, solver,a1,size); implement complete state machines solver: – implements just one algorithm mov %a8, $0, %aeg mov %a9, $1, %aeg – may be specific to a particular caep00 $0 application rtn 7/12/2010 24 Personality Family Tree Bioinformatics University/Research Finance Blast PCAP System Smith Inspect Simulation FAP Waterma n Architectur e Research Oil & Gas DP vector Embedded/OEM SP PDK vector Speech Recognitio n Common Foundation x86/Linux host Coherent virtual memory Personality infrastructure 7/12/2010 25 Instruction Based Personality Example (FAP) 32 Function Pipes across 4 Application Engines vector elements distributed across function pipes Vector architecture Dispatch Dispatch Dispatch Dispatch with optimizations for financial Monte Carlo simulation Crossbar Crossbar Crossbar Crossbar Double precision floating vector register file point units Functional units for common functions such as rcp add fma fma load misc store logical log, exp, random number integer exp,log,CND Parallel RNG Parallel generation Supported by the compiler to crossbar as vector intrinsics 7/12/2010 26 UCSD InsPect Personality 4 Application Engines / 40 State Machines Hardware implementation of search kernel from Dispatch Dispatch Dispatch Dispatch InsPect proteomics application Crossbar Crossbar Crossbar Crossbar 1-of-40 State Machines Peptide Substring Protein Mass Fetch Fetch Memory • Entire “Best Match” routine implemented as state machine Score • Multiple state machines for data PRM Save Scores Match Memory parallelism Update Temp Score • Operates on main memory using virtual Match To Beat Memory addresses Store Matches 1/22/2010 27 Calling the Inspect Personality cny_get_signature(mckxstr(UCSD_SIGNATURE), &cny_pdk_image, &cny_pdk_image2, &Result); … ptr = (char

Convey Overview

X86 Platform Coprocessor/Prpmc (PC on a PMC)

Exploiting Free Silicon for Energy-Efficient Computing Directly

CUDA What Is GPGPU

Comparing the Power and Performance of Intel's SCC to State

World's First High- Performance X86 With

Introduction to Cpu

AI Chips: What They Are and Why They Matter

Instruction Set Innovations for Convey's HC-1 Computer

CUDA C++ Programming Guide

Intel Xeon Phi Product Family Brief

Comparing Performance and Energy Efficiency of Fpgas and Gpus For

Intel Delivers New Architecture for Discovery with Intel® Xeon Phi™ Coprocessors