High-Performance Heterogeneous Computing with the Convey HC-1 Jason D

University of South Carolina Scholar Commons Faculty Publications Computer Science and Engineering, Department of 2010 High-Performance Heterogeneous Computing with the Convey HC-1 Jason D. Bakos University of South Carolina - Columbia, [email protected] Follow this and additional works at: https://scholarcommons.sc.edu/csce_facpub Part of the Computer Engineering Commons Publication Info Published in Computing in Science and Engineering, ed. Volodymyr Kindratenko and Pedro Trancoso, Volume 12, Issue 6, 2010, pages 80-87. http://ieeexplore.ieee.org/servlet/opac?punumber=5992 © 2010 by the Institute of Electrical and Electronics Engineers (IEEE) This Article is brought to you by the Computer Science and Engineering, Department of at Scholar Commons. It has been accepted for inclusion in Faculty Publications by an authorized administrator of Scholar Commons. For more information, please contact [email protected]. N o v E l A r C h I t E C t u r E S Editors: Volodymyr Kindratenko, [email protected] Pedro Trancoso, [email protected] HigH-Performance Heterogeneous comPuting witH tHe convey Hc-1 By Jason D. Bakos Unlike other socket-based reconfigurable coprocessors, the Convey HC-1 contains nearly 40 field-programmable gate arrays, scatter-gather memory modules, a high-capacity crossbar switch, and a fully coherent memory system. t Supercomputing 2009, Con- hardware description language. How- field-programmable gate arrays vey Computer unveiled the ever, realizing that the HC-1 appeals (FPGAs) as a “personality.” The four A HC-1, an all-in-one compute to customers who would like to do AEs each connect to eight memory server containing a socket-based re- this, Convey offers support and tools controllers through a full crossbar. configurable coprocessor board. The accordingly. Each memory controller is imple- HC-1 is unique in several ways. Unlike Here, I examine the HC-1, empha- mented on its own FPGA and is in-socket coprocessors from Nallatech sizing its system architecture, per- connected to two Convey-designed (www.nallatech.com/Intel-Xeon- formance, ease of programming, and scatter-gather dual inline memory FSB-Socket-Fillers/fsb-development- flexibility. modules (SG-DIMMs) contain- systems.html), DRC (www.drccomputer. ing 64 banks each and an integrated com/drc/modules.html), and Xtreme- System Overview Stratix-2 FPGA. The AEs themselves Data (www.xtremedata.com/products/ The HC-1’s host consists of a dual- are interconnected in a ring configu- accelerators/in-socket-accelerator/ socket Intel server motherboard, an ration with 668 Mbytes/s, full duplex xd2000i)—all of which are confined Intel 5400 memory-controller hub links for AE-to-AE communication. to a socket-sized footprint—Convey chipset, 24 Gbytes of RAM, 1,066 These links can be useful for multi- uses a mezzanine connector to bring MHz FSB, and a 2.13 GHz Intel Xeon FPGA applications. the front side bus (FSB) interface to a 5138—a dual-core, low-voltage pro- large coprocessor board roughly the cessor (the 65-nanometer Intel Core Memory Interleave Modes size of an ATX motherboard. This co- architecture released in 2006). Newer Each AE has a 2.5 Gbyte/s link to processor board is housed in a one-unit Intel Xeons based on the Nehalem or each memory controller, and each (1U) chassis that’s fused to the top of later architectures can’t be used in an SG-DIMM has a 5 Gbyte/s link to another 1U chassis containing the host HC-1-like system until Convey com- its corresponding memory control- motherboard. pletes the Quick Path Interconnect ler. As such, the effective memory In addition to the machine, Con- interface for their coprocessor board. bandwidth of the AEs is dependent vey designed a selection of accelera- The HC-1 host runs a 64-bit 2.6.18 on their memory access pattern to tor designs to use with it. Some of Linux kernel with a modified vir- the eight memory controllers and these implement soft-core floating tual memory system to accommodate their two SG-DIMMs. Each AE can point vector processors for which memory coherency for the coproces- achieve a theoretical peak bandwidth Convey has also developed a C and sor board. of 20 Gbyte/s when striding across FORTRAN compiler. Others, such eight different memory controllers, as their Smith-Waterman sequence Top-Level Design but this bandwidth would drop if two alignment accelerator design, include Figure 1 shows the coprocessor other AEs attempt to read from the an easy-to-use interface library. This board’s design. There are four user- same set of SG-DIMMs because this makes the HC-1’s FPGAs acces- programmable Virtex-5 LX 330s, would saturate the 5 Gbytes/s DIMM sible to programmers who lack the which Convey calls the application memory controller links. expertise or patience to design their engines (AEs). Convey refers to a Because each memory address own FPGA-based coprocessors in particular configuration of these maps only to one SG-DIMM (and 80 Copublished by the IEEE CS and the AIP 1521-9615/10/$26.00 © 2010 IEEE Computing in SCienCe & engineering CISE-12-6-Novel.indd 80 16/10/10 2:33 PM 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 5 GB/s 5 GB/s MC0 MC1 MC2 MC3 MC4 MC5 MC6 MC7 2.5 GB/s AE0 AE1 AE2 AE3 Virtex- Virtex- Virtex- Virtex- Application 5 LX 5 LX 5 LX 5 LX engine 668 MB/s full 330 330 330 330 hub duplex Coprocessor board 24 GB RAM Xeon 5138 dual-core 2.13 GHz Intel 5400 4 MB L2 Northbridge 1,066 MHz FSB Host board Figure 1. the hC-1 coprocessor board. Four application engines connect to eight memory controllers through a full crossbar. Each memory controller is implemented on its own field-programmable gate array. its corresponding memory control- groups isn’t used. Because the number on the coprocessor (indicating that it’s ler), Convey’s goal when designing of groups and banks per group is a out-of-date). If one of the application its memory system was to maximize prime number, this reduces the like- engines on the coprocessor reads from the likelihood that an arbitrary set lihood of strides aliasing to the same this block, an updated copy of the of unique memory references would SG-DIMM. Selecting the 31-31 inter- block’s memory contents is sent to the be uniformly distributed across all 16 leave comes at a cost of approximately coprocessor memory, and the memory SG-DIMMs and eight memory con- 1 Gbyte of addressable memory space block changes to shared in both the trollers. Convey provides two user- (6 percent) and a 6 percent reduction host and coprocessor memory. The selectable memory mapping modes to in peak memory bandwidth. coherence mechanism is transparent partition the coprocessor’s virtual ad- to the user and removes the need for dress space among the SG-DIMMs: Coprocessor Memory Coherency explicit direct memory access (DMA) The coprocessor memory is cache co- transactions, which coprocessors based • Binary interleave, which maps bit- herent with the host memory and is on peripheral component intercon- fields of the memory address to a implemented using the snoopy coher- nect (PCI) require. particular controller, DIMM, and ence mechanism built into the Intel bank, and FSB protocol. This essentially creates Host Interface • 31-31 interleave, a modulo 31 map- a common virtual address space that The coprocessor board contains two ping optimized for constant memory both the host and coprocessor share. non-user programmable FPGAs that strides (strides lengths that are a In the coherence protocol, both together form the application engine power-of-two are guaranteed to hit the host and the coprocessor possess hub (AEH). One FPGA serves as the all 16 SG-DIMMs for any sequence copies of the global memory space. physical interface between the copro- of 16 consecutive references). Each block of memory addresses in cessor board and the FSB, and its logic both the host memory and coproces- monitors the FSB to maintain the The memory banks are divided into sor memory are marked as exclusive, snoopy memory coherence protocol 32 groups of 32 banks each. In 31-31 shared, or invalid. A write by the host and manages the coprocessor memo- interleave, one group isn’t used, and to an address block will change its sta- ry’s page table. This FPGA is actually one bank within each of the remaining tus to exclusive and invalidate the block mounted to the mezzanine connector. november/DeCember 2010 81 CISE-12-6-Novel.indd 81 16/10/10 2:33 PM N o v E l A r C h I t E C t u r E S The second AEH FPGA contains host CPU can also use this mechanism double-precision vector personality, the scalar processor, a soft-core proces- to send parameters to and receive sta- financial analytics personality, and sor that implements the base Convey tus information from the AEs. Smith-Waterman personality. instruction set. The scalar processor The scalar processor is connected The two vector personalities act is a substantial architecture, including to each AE via a point-to-point link, as vector coprocessors for the scalar a cache and features such as multiple and uses this link to dispatch instruc- processor and are targets for Convey’s issue out-of-order execution, branch tions to the AEs that aren’t entirely vectorizing compiler. When using predication, register renaming, and implemented on the scalar processor.

High-Performance Heterogeneous Computing with the Convey HC-1 Jason D

X86 Platform Coprocessor/Prpmc (PC on a PMC)

Convey Overview

On Heterogeneous Compute and Memory Systems

Exploiting Free Silicon for Energy-Efficient Computing Directly

CUDA What Is GPGPU

Comparing the Power and Performance of Intel's SCC to State

World's First High- Performance X86 With

Introduction to Cpu

Heterogeneous Cpu+Gpu Computing

State-Of-The-Art in Heterogeneous Computing

AI Chips: What They Are and Why They Matter

Instruction Set Innovations for Convey's HC-1 Computer