<<

THE WORLD’S FIRST HYBRID-CORE COMPUTER. CONVEY HYBRID-CORE COMPUTING Hybrid-core Computing

Convey HC-1

High Performance

of application- specific

hardware

Heterogenous solutions • can be much more efficient Performance/ • still hard to program Programmability and

Power efficiency Power deployment ease of

an Application Multicore solutions • don’t always scale well

• parallel programming is

hard Low

Difficult Ease of Deployment Easy

1/22/2010 3 Hybrid-Core Computing

Application-Specific Personalities Applications • Extend the x86 instruction set • Implement key operations in Convey Compilers hardware

Life Sciences

x86-64 ISA Custom ISA CAE

Custom

Financial

Oil & Gas

Shared -coherent, shared memory • Both ISAs address common memory

*ISA: Instruction Set Architecture 7/12/2010 4 HC-1 Hardware

PCI I/O FPGA FPGA

Intel Personalities Chipset FPGA FPGA

8 GB/s 80 GB/s

Memory Memory Cache Coherent, Shared Virtual Memory

1/22/2010 5 Using Personalities

C/C++ Fortran • Personalities are user specifies reloadable personality at instruction sets Convey Software compile time Development Suite • Compiler instruction generates x86 descriptions and coprocessor instructions from Hybrid-Core Executable P ANSI standard x86-64 and Coprocessor Personalities C/C++ & Fortran Instructions • Executable can run on x86 nodes FPGA Convey HC-1 or Convey Hybrid- bitfiles Core nodes x86 Coprocessor personality loaded at runtime by OS

1/22/2010 6 SYSTEM ARCHITECTURE HC-1 Architecture

“Commodity” Intel Server Convey FPGA-based coprocessor

Direct Application Application Engines Data Intel 5138 Engine Hub Port Dual Core Intel 5400 MCH

Intel IO Memory Subsyste Memory m Intel x86-64 Server FPGA based x86-64 Linux Shared cache-coherent memory

7/12/2010 8 HC-1 Hardware

Coprocessor Assembly

Host memory DIMMs FSB x16 PCI-E Mezzanine slot Card • 2U enclosure: – Top half of 2U platform contains the coprocessor Host x86 3 x 3½” Disk Drives – Bottom half contains Intel Server Assembly motherboard

1/22/2010 9 HC-1 Physical Layout 17.78" Direct Data Port

Memory DIMMs AEs Memory DIMMs

Memory Memory

Controller Controller

AEH 11.15"

AEs Memory Controller

Memory DIMMs Memory DIMMs Direct Data Port

1/22/2010 10 Inside the Coprocessor

Host interface

and memory Application Engines

controllers

implemented e Scalar Scalar

by coprocessor Instruction

Processing Fetch/Decod infrastructure Interface Host Link I/F Four V5LX330 FPGAs

Implemented

with Xilinx V5LX110

FPGAs

Memory Memory

Memory Memory Memory Memory Memory Memory Memory Memory

Controller

Controller Controller Controller Controller Controller Controller Implemented Controller with Xilinx V5LX155 FPGAs 16 DDR2 memory channels Standard or Scatter-Gather DIMMs 80GB/sec throughput 7/12/2010 Page 11 Coprocessor Block Diagram

scalar instructions executed by IAS, AE instructions broadcast to AEs

datamover AE-MC for high links bandwidth support block 8192 TIDs, moves 76.8GB/s

MCs translate virtual addresses, maintain coherence with host

7/12/2010 12 Strided Memory Access Performance

• Convey SG-DIMMs optimized for Strided 64-bit Memory Accesses 64-bit memory access 60 • High bandwidth for non-unity strides, scatter/gather 50

• 3131 interleave for high 40 bandwidth for all strides except

multiples of 31 30 GB/sec

20

• measured with strid3.c 10 written by Mark Seager:

0

5 9

1

13 17 21 25 29 33 37 41 45 49 53 57 61 for( i=0; i

yy[i] += t*xx[i]; Nehalem (single core) } HC-1 SG-DIMM 3131

1/22/2010 13 Direct Data Port • Direct data paths to I/O gates on FPGAs • Permits custom H/W interface

For more information see “Virtex-5 FPGA User Guide UG190 (v5.2)” November 5, 2009

1/22/2010 14 CONVEY SOFTWARE Convey Software Architecture GCC & Convey compatible Compilers compilers Objects can be linked with other gcc/glibc Applications compatible objects

Key routines (BLAS, Convey Math Library (CML) FFTs) optimized for Convey personalities If coprocessor shared library not present, native Personality Support System x86 code is executed

Kernel is Linux Convey Enhanced Standards-based (LSB) Linux Kernel compliant legacy apps run “as-is” Intel x86-64 Coprocessor personalities define ISA ISA instruction set

1/22/2010 16 Convey Compilers

C/C++ Fortran95 • Program in ANSI standard C/C++ and Fortran Common Optimizer • Unified compiler

generates x86 & Intel® 64 Convey coprocessor instructions Optimizer Vectorizer other & Code & Code compilers • Seamless debugging Generator Generator environment for Intel & coprocessor code Linker

• Executable can run on executable x86_64 nodes or on Intel® 64 code Convey Hybrid-Core nodes Coprocessor code

1/22/2010 17 Coprocessor Regions

• The compiler dispatches blocks of code called “coprocessor regions.” • , directives and pragmas can be used to control what code is compiled for the coprocessor – loop or code segment within a routine or a whole routine – can also use underlying library interface (“copcall”) • Coprocessor regions cannot call x86 routines

7/12/2010 18 Memory Placement

• Convey systems have a Non- (NUMA) architecture – Host and coprocessor can access all of memory – but it’s much more efficient if they access local memory • OS manages host and coprocessor memory as separate memory pools • Compiler flags and directives can specify placement – #pragma cny coproc_mem (x,y) places static arrays or common blocks in coproc memory – cny_cp_malloc() alternative to malloc() • Memory can also be migrated or copied dynamically – “#pragma cny migrate_coproc (X,n)” specifies that array X should be moved to coprocessor memory before execution of the next line of code – cny_cp_memcopy() copies buffers using the datamover

7/12/2010 19 Convey Runtime Environment

executable Intel® x86-64 code if dlopen of gdb shared library debugging on Coprocessor code fails, Intel x86- HW & simulator 64 code cny_runtime.o executed personalities are demand Convey Shared loaded by OS Libraries at runtime

Convey x86-64 Simulator shared coprocessor Custom hardware library hardware FAP executes on x86 SP

SPAT performance simulator

1/22/2010 20 PERSONALITIES A personality is...

• A reloadable instruction set that augments the x86 ISA x86 application – Applicable to a class of applications specific or specific to a particular code Processor instructions • Each personality is a set of files consisiting of: – A unique ID $ ls /opt/convey/personalities/2.1.1.1.0 – The bits loaded into the AE FPGAs ae_fpga.tgz convey-aesp-2.0.0-10_03_17_62.tgz – Information used by the Convey lic_info compilers and tools PersDesc.dat • List of available instructions readme • Performance characteristics for rest.o compiler optimization save.o • Machine state formatting and zero.o modification

7/12/2010 22 How Are Personalities Created?

• Personalities are a logic design – implemented using a hardware description language (high level tools may be used as well) – synthesized with the Xilinx tools – interfaced to the coprocessor dispatch, memory, and management infrastructure – packaged for demand loading by the Convey OS • Convey sells prebuilt personalities for key algorithms and applications • Convey licenses a Personality Development Kit that enables customers to construct their own

7/12/2010 23 Instruction Based and Algorithmic Personalities parameter (N=100000) • Instruction based real*4 a(N),b(N),c(N),s c$cny BEGIN_COPROC personalities implement do i=1,N instruction sets c(i) = a(i)+s*b(i) – Automatic vectorization enddo generates SIMD instructions c$cny END_COPROC from standard languages end $ cnyf95 -O3 -mcny_vector test.f – Different sequences of VECTOR LOOP in MAIN__0 at 4: L 2 instructions can implement S 1 B 2 U 0 I 0 M 0 different algorithms • Algorithmic personalities ret= l_copcall_fmt(sig, solver,a1,size); implement complete state machines solver: – implements just one algorithm mov %a8, $0, %aeg mov %a9, $1, %aeg – may be specific to a particular caep00 $0 application rtn

7/12/2010 24 Personality Family Tree

Bioinformatics University/Research Finance Blast PCAP System Smith Inspect Simulation FAP Waterma n Architectur e Research Oil & Gas DP vector Embedded/OEM SP PDK vector Speech Recognitio n

Common Foundation x86/Linux host Coherent virtual memory Personality infrastructure

7/12/2010 25 Instruction Based Personality Example (FAP) 32 Function Pipes across 4 Application Engines vector elements distributed across function pipes

Vector architecture Dispatch Dispatch Dispatch Dispatch with optimizations for financial Monte

Carlo simulation Crossbar Crossbar Crossbar Crossbar

Double precision floating vector

point units

Functional units for

common functions such as

rcp

add fma fma

load

misc store

logical log, exp, random number

integer exp,log,CND Parallel RNG Parallel generation Supported by the compiler to crossbar as vector intrinsics

7/12/2010 26 UCSD InsPect Personality

4 Application Engines / 40 State Machines Hardware implementation of search kernel from Dispatch Dispatch Dispatch Dispatch InsPect proteomics application

Crossbar Crossbar Crossbar Crossbar 1-of-40 State Machines

Peptide Substring Protein Mass Fetch Fetch Memory • Entire “Best Match” routine implemented as state machine Score • Multiple state machines for data PRM Save Scores Match Memory parallelism

Update Temp Score • Operates on main memory using virtual Match To Beat Memory addresses

Store Matches

1/22/2010 27 Calling the Inspect Personality

cny_get_signature(mckxstr(UCSD_SIGNATURE), &cny_pdk_image, &cny_pdk_image2, &Result); … ptr = (char *)cny_cp_malloc(msize); … copcall_nowait_stat_fmt(cny_pdk_image, (void *) &pdk_kernel_wrapper, &copcall_hndl, &copcall_status, "AAAAaaaaaA", max_items_per_pipe_array, num_todo_per_pipe_array, num_done_per_pipe_array, cny_input_per_pipe_array, GlobalOptions->InitialSTB, GlobalOptions->InitialTier1, GlobalOptions->StoreThreshold, GlobalOptions->StallThreshold, GlobalOptions->DropLastPosition, &PeptideMass[(int)'A']);

7/12/2010 28 UCSD InsPect Performance

• Proteomics application 120

– compares mass 100 spectrometry samples against a database of 80 proteins 60 • “streaming” personality implements entire 40

search kernel 20 relative performance relative – x86 cores place data on 0 queues for processing by the coprocessor – multiple independent function pipes samples

1/22/2010 29 Silicon Vox Speech Recognition Personality Continuous Automated Speech Recognizer (ASR)

Voice channels

Scoring/Backend Search Pipelines on AEs

Dispatch Dispatch Dispatch Dispatch

A-D

A-D

A-D Conversion

Score Score Score Score

Search Search Search Search Text Streams Crossbar Crossbar Crossbar Crossbar

Scoring and search algorithm • Backend search dominated by bit-level Next Queue Patches frame comparisons and random memory

Languag accesses Merge e Model • Multiple pipelines for Viterbi Languag • First commercial implementation of a e info hybrid architecture for Speech Recognition* *Several patents pending for ASR on a hybrid platform

1/22/2010 30 Custom Personalities

• Personality Development Kit – logic libraries implement interfaces to coprocessor infrastructure – System simulation environment for debugging – Management tools package bit files produced with Xilinx toolset into personalities • Architected instruction interface – transfer data to/from AE – control custom AE logic • Compiler interfaces to generate calls to custom personality instructions

1/22/2010 31 Personality Development Kit (PDK)

• Customer designed logic in Dispatch Interface Convey infrastructure Mgmt/Debug Interface CP HIB & IAS Mgmt • Executes as instructions within Host Customer Designed Logic Pers. Interface Loading an x86-64 AE AE interface address space Scalar - Pers. Instr. AE Debug • Allows designers to concentrate on Exception MC Link Interface HW handling Monitor Intellectual Memory Controllers (IPMI) Property, not Virtually addressible, housekeeping cache coherent memory

1/22/2010 32 CONCLUSION Energy Efficient, Hybrid Core Computing

• Higher Performance “Convey Computers may be at the – 5x to 25x application gains forefront of a wave of innovation • Energy Saving brought on by developing FPGAs as a viable alternative to CPUs…” – Up to 90% reduction in data center power usage “Convey Computer seeks to use FPGAs to create a hybrid computing platform” • Easy to program MIS Impact Report, 12/09/08 451 Group – ANSI standard C, C++ and Fortran • Reloadable Personalities – application specific performance on an x86 base “We have found that one rack of HC-1 servers will replace eight racks of other servers…with correspondingly lowered energy requirements” Pavel Pevzner - UCSD

7/12/2010 34