Quick viewing(Text Mode)

High-Performance Heterogeneous Computing with the Convey HC-1 Jason D

High-Performance Heterogeneous Computing with the Convey HC-1 Jason D

University of South Carolina Scholar Commons

Faculty Publications Computer Science and Engineering, Department of

2010 High-Performance with the Convey HC-1 Jason D. Bakos University of South Carolina - Columbia, [email protected]

Follow this and additional works at: https://scholarcommons.sc.edu/csce_facpub Part of the Computer Engineering Commons

Publication Info Published in Computing in Science and Engineering, ed. Volodymyr Kindratenko and Pedro Trancoso, Volume 12, Issue 6, 2010, pages 80-87. http://ieeexplore.ieee.org/servlet/opac?punumber=5992 © 2010 by the Institute of Electrical and Electronics Engineers (IEEE)

This Article is brought to you by the Computer Science and Engineering, Department of at Scholar Commons. It has been accepted for inclusion in Faculty Publications by an authorized administrator of Scholar Commons. For more information, please contact [email protected]. N o v e l A r c h i t e c t u r e s

Editors: Volodymyr Kindratenko, [email protected] Pedro Trancoso, [email protected]

High-Performance Heterogeneous Computing with the Convey HC-1

By Jason D. Bakos

Unlike other socket-based reconfigurable , the Convey HC-1 contains nearly 40 field-programmable gate arrays, scatter-gather memory modules, a high-capacity crossbar , and a fully coherent memory system.

t Supercomputing 2009, Con- hardware description language. How- field-programmable gate arrays vey Computer unveiled the ever, realizing that the HC-1 appeals (FPGAs) as a “personality.” The four A HC-1, an all-in-one compute to customers who would like to do AEs each connect to eight memory containing a socket-based re- this, Convey offers support and tools controllers through a full crossbar. configurable board. The accordingly. Each is imple- HC-1 is unique in several ways. Unlike Here, I examine the HC-1, empha- mented on its own FPGA and is in-socket coprocessors from Nallatech sizing its system architecture, per- connected to two Convey-designed (www.nallatech.com/Intel-Xeon- formance, ease of programming, and scatter-gather dual inline memory FSB-Socket-Fillers/fsb-development- flexibility. modules (SG-DIMMs) contain- systems.html), DRC (www.drccomputer. ing 64 banks each and an integrated com/drc/modules.html), and Xtreme- System Overview Stratix-2 FPGA. The AEs themselves Data (www.xtremedata.com/products/ The HC-1’s host consists of a dual- are interconnected in a ring configu- accelerators/in-socket-accelerator/ socket server motherboard, an ration with 668 Mbytes/s, full duplex xd2000i)—all of which are confined Intel 5400 memory-controller hub links for AE-to-AE communication. to a socket-sized footprint—Convey chipset, 24 Gbytes of RAM, 1,066 These links can be useful for multi- uses a mezzanine connector to bring MHz FSB, and a 2.13 GHz Intel Xeon FPGA applications. the front side (FSB) interface to a 5138—a dual-core, low-voltage pro- large coprocessor board roughly the cessor (the 65-nanometer Intel Core Memory Interleave Modes size of an ATX motherboard. This co- architecture released in 2006). Newer Each AE has a 2.5 Gbyte/s link to board is housed in a one-unit Intel Xeons based on the Nehalem or each memory controller, and each (1U) chassis that’s fused to the top of later architectures can’t be used in an SG-DIMM has a 5 Gbyte/s link to another 1U chassis containing the host HC-1-like system until Convey com- its corresponding memory control- motherboard. pletes the Quick Path Interconnect ler. As such, the effective memory In addition to the machine, Con- interface for their coprocessor board. bandwidth of the AEs is dependent vey designed a selection of accelera- The HC-1 host runs a 64-bit 2.6.18 on their memory access pattern to tor designs to use with it. Some of Linux kernel with a modified vir- the eight memory controllers and these implement soft-core floating tual memory system to accommodate their two SG-DIMMs. Each AE can point vector processors for which memory coherency for the coproces- achieve a theoretical peak bandwidth Convey has also developed a C and sor board. of 20 Gbyte/s when striding across FORTRAN compiler. Others, such eight different memory controllers, as their Smith-Waterman sequence Top-Level Design but this bandwidth would drop if two alignment accelerator design, include Figure 1 shows the coprocessor other AEs attempt to read from the an easy-to-use interface library. This board’s design. There are four user- same set of SG-DIMMs because this makes the HC-1’s FPGAs acces- programmable Virtex-5 LX 330s, would saturate the 5 Gbytes/s DIMM sible to programmers who lack the which Convey calls the application memory controller links. expertise or patience to design their engines (AEs). Convey refers to a Because each memory address own FPGA-based coprocessors in particular configuration of these maps only to one SG-DIMM (and

80 Copublished by the IEEE CS and the AIP 1521-9615/10/$26.00 © 2010 IEEE Computing in Science & Engineering

CISE-12-6-Novel.indd 80 16/10/10 2:33 PM 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB 1 GB

5 GB/s 5 GB/s MC0 MC1 MC2 MC3 MC4 MC5 MC6 MC7

2.5 GB/s

AE0 AE1 AE2 AE3 Virtex- Virtex- Virtex- Virtex- Application 5 LX 5 LX 5 LX 5 LX engine 668 MB/s full 330 330 330 330 hub duplex

Coprocessor board

24 GB RAM

Xeon 5138 dual-core 2.13 GHz Intel 5400 4 MB L2 Northbridge

1,066 MHz FSB

Host board

Figure 1. The HC-1 coprocessor board. Four application engines connect to eight memory controllers through a full crossbar. Each memory controller is implemented on its own field-programmable .

its corresponding memory control- groups isn’t used. Because the number on the coprocessor (indicating that it’s ler), Convey’s goal when designing of groups and banks per group is a out-of-date). If one of the application its memory system was to maximize prime number, this reduces the like- engines on the coprocessor reads from the likelihood that an arbitrary set lihood of strides aliasing to the same this block, an updated copy of the of unique memory references would SG-DIMM. Selecting the 31-31 inter- block’s memory contents is sent to the be uniformly distributed across all 16 leave comes at a cost of approximately coprocessor memory, and the memory SG-DIMMs and eight memory con- 1 Gbyte of addressable memory space block changes to shared in both the trollers. Convey provides two user- (6 percent) and a 6 percent reduction host and coprocessor memory. The selectable memory mapping modes to in peak memory bandwidth. coherence mechanism is transparent partition the coprocessor’s virtual ad- to the user and removes the need for dress space among the SG-DIMMs: Coprocessor Memory Coherency explicit (DMA) The coprocessor memory is co- transactions, which coprocessors based • Binary interleave, which maps bit- herent with the host memory and is on peripheral component intercon- fields of the memory address to a implemented using the snoopy coher- nect (PCI) require. particular controller, DIMM, and ence mechanism built into the Intel bank, and FSB protocol. This essentially creates Host Interface • 31-31 interleave, a modulo 31 map- a common virtual address space that The coprocessor board contains two ping optimized for constant memory both the host and coprocessor share. non-user programmable FPGAs that strides (strides lengths that are a In the coherence protocol, both together form the application engine power-of-two are guaranteed to hit the host and the coprocessor possess hub (AEH). One FPGA serves as the all 16 SG-DIMMs for any sequence copies of the global memory space. physical interface between the copro- of 16 consecutive references). Each block of memory addresses in cessor board and the FSB, and its logic both the host memory and coproces- monitors the FSB to maintain the The memory banks are divided into sor memory are marked as exclusive, snoopy memory coherence protocol 32 groups of 32 banks each. In 31-31 shared, or invalid. A write by the host and manages the coprocessor memo- interleave, one group isn’t used, and to an address block will change its sta- ry’s page table. This FPGA is actually one bank within each of the remaining tus to exclusive and invalidate the block mounted to the mezzanine connector.

November/December 2010 81

CISE-12-6-Novel.indd 81 16/10/10 2:33 PM N o v e l A r c h i t e c t u r e s

The second AEH FPGA contains host CPU can also use this mechanism double-precision vector personality, the , a soft-core proces- to send parameters to and receive sta- financial analytics personality, and sor that implements the base Convey tus information from the AEs. Smith-Waterman personality. instruction set. The scalar processor The scalar processor is connected The two vector personalities act is a substantial architecture, including to each AE via a point-to-point link, as vector coprocessors for the scalar a cache and features such as multiple and uses this link to dispatch instruc- processor and are targets for Convey’s issue out-of-order execution, branch tions to the AEs that aren’t entirely vectorizing compiler. When using predication, , and implemented on the scalar processor. these personalities, each AE imple- sliding register windows. Instruction examples include ments eight floating point multiply- The scalar processor is the mecha- pipelines and eight load/store nism by which the host invokes com- • move instructions for exchanging units (for a total of 32 logically com- putations on the AEs. In Convey’s data between the scalar processor bined across four AEs). programming model, the AEs act as and AEs; and The financial analytics personality coprocessors to the scalar processor, • custom AE instructions, which is a double-precision personality that while the scalar processor acts as a co- consist of 32 unimplemented in- adds additional vector instructions, processor for the host CPU. To facili- structions that can be used to in- transcendental functions, probability tate this, the binary executable file on voke user-defined AE behaviors. distribution functions, and various random number generators designed Convey develops and licenses its own set of personalities for high-performance Monte Carlo simulation. In addition to the com- but also allows users to develop their own using the piler, the vector and financial per- sonalities also have robust debuggers, personality development kit (PDK). simulators, and performance analyz- ers. The single-precision vector per- sonality also has the Convey math library (CML), a corresponding, the host (Intel processor) contains in- Through the AE’s dispatch interface, hand-optimized basic linear algebra tegrated scalar processor code (using AE logic can also trigger exceptions subroutines (BLAS) implementation. a “fat binary” linker format), which is and implement memory synchroniza- The Smith-Waterman personality is a transferred to and executed on the sca- tion behaviors. parameterized, scalable processing el- lar processor when the host code calls ement and is built around the Convey a scalar processor routine through Personalities Sequence Library, a customized API. one of Convey’s runtime library calls Convey develops and licenses its own As mentioned earlier, users who (a similar mechanism is employed on set of personalities but also allows wish to develop their own person- GPUs). The scalar processor users to develop their own using the alities with HDL-based design must code can contain instructions that are personality development kit (PDK). license the PDK, which includes de- dispatched and executed (that is, off- Convey has established a global sign flows and robust system mod- loaded) onto the AEs. numeric identifier system for per- els that support hardware/software Code for the scalar processor can be sonalities and maintains a publicly ac- co-simulation. generated by one of Convey’s compil- cessible registration database for these ers or handwritten in assembly lan- identifiers, evidentially in the hope of Convey Instruction guage. After compilation and assembly, fostering a marketplace for custom Set Architecture the scalar processor code is linked into personalities. Convey developed its own entirely new the executable in the ctext linker sec- Convey’s “stock” personalities are instruction set architecture from the tion. Upon execution, the host code individually licensed and are each ground up. The Convey ISA includes can invoke scalar processor routines designed for specific application a scalar instruction set that’s common using the synchronous and asynchro- types. Currently, the set includes a to all personalities, including cus- nous copcall API functions. The single-precision vector personality, tom ones. All scalar instructions are

82 Computing in Science & Engineering

CISE-12-6-Novel.indd 82 16/10/10 2:33 PM Table 1. Level 3 BLAS Performance, Nehalem Xeon vs. Tesla vs. HC-1 Single-precision general matrix–matrix multiply (Gflops/s) NVIDIA Tesla HC-1 coprocessor Dual Xeon 5520 S1070 CUBLAS CML 1.2.2 MKL w/Intel C w/Nvidia C w/Convey C Matrix order compiler 11.1 compiler 3.1 Compiler 2.0.0 executed on the scalar processor. The 8,000 110 347 75 scalar instruction set includes instruc- 10,000 126 348 76 tions for program control (branches), 12,000 136 355 76 context saves, scalar arithmetic, load/ 14,000 140 363 75 store, and move instructions for the 16,000 140 378 76 set of A and S registers (which reside on the scalar processor). The instruc- Average 130 358 76 tion set also includes a large set of vec- tor instructions that are offloaded to the vector personalities (if present). This PowerEdge server is attached to a SSE4.2 vector unit for each core. The Convey ISA features a virtual- our Nvidia Tesla S1070, containing The Nehalem system achieved an ized register set. The three register four Tesla GPUs. The Tesla has also average throughput of approximately sets (scalar, address, and vector) are recently been superceded by the Fermi. 130 Gflops/s. This is reasonable, of arbitrary size because the hard- We designed a series of tests to because each of the eight cores has ware dynamically maps user registers measure both raw performance and an SSE unit that can perform four to physical registers at runtime. This ease of programming. To estimate multiplies and four adds per cycle at also applies to each vector register’s the systems’ peak floating point per- 2.26 GHz, giving a theoretical peak of length and the vector stride for the formance, we targeted dense single- 145 Gflops/s without considering any load/store units, both of which can be precision general matrix–matrix mul- effects of the memory system. The dynamically changed by the software tiply (SGEMM) from the level-three GPU-based system showed an aver- at runtime if you change the vector BLAS library, because an equivalent age throughput of approximately 358 registers’ length and stride values. platform-optimized implementation Gflops/s. The HC-1 achieved an aver- of this function is available in the Intel’s age throughput of 76 Gflops/s. Peak Floating Point math kernel library (MKL), Nvidia’s These performance metrics don’t Performance compute unified basic linear algebra look encouraging for the HC-1, es- The HC-1’s hardware, compiler, and subprograms (Cublas) library (http:// pecially given that both the Nehalem only one of their vector personalities developer.download.nvidia.com/ and the Tesla GT200 GPUs are cost approximately 10 times that of compute//3_0/toolkit/docs/ already previous-generation archi- a state-of-the-art dual-socket Xeon- CUBLAS_Library_3.0.pdf), and the tectures, while the HC-1 is still cur- based Dell PowerEdge server, or that Convey math library (CML). Specially, rent generation. Convey admits that of a rack-mounted four-GPU Nvidia we tested the operation was C = AB the peak throughput of the HC-1 Tesla server, despite the fact that each where A and B are square matrices. is “nearly 80 Gflops/s” based on its of these systems have approximately Table 1 shows the effective Gflops/s coprocessor memory bandwidth, so the same physical footprint. In my for each test, where we measure these results indicate that the HC-1 is lab, my research group has one of each Gflops as: more capable of achieving throughput of these systems, which allows for closer to its peak than the Xeon. convenient cost-performance com- time However, these performance results 3 parisons. We ran a series of simple 2×order are given by heavily hand-optimized tests to pit our HC-1 against our Dell BLAS routines. In our next set of PowerEdge R710 with dual Xeon 5520 on an unloaded system. The time performance tests, we explored the processors, which use the Nehalem includes I/O time for the Tesla and performance given of the Intel and architecture and were Intel’s state-of- HC-1. We ran each test only once Convey vectorizing compilers when the-art server processor architecture rather than averaging over a large set given non-optimized high-level code. from March 2009 to March 2010. of runs because these results are in- This product was recently supersed- tended to be illustrative only. Power Consumption ed by the Xeon 5600-series (West- The Intel results reflect the use of The tested machines are powered by a mere), which is a technology-scaled all eight processor cores (two sock- power distribution unit that is capable version of the same architecture. ets each with four-core CPUs) and of measuring the total current being

November/December 2010 83

CISE-12-6-Novel.indd 83 16/10/10 2:33 PM N o v e l A r c h i t e c t u r e s

drawn with a granularity of one amp. To determine how well the Intel vectorization. The compiler also pro- Although this is obviously an inac- and Convey vector architectures lend vides detailed feedback to the pro- curate method for testing power con- themselves to automatic compiler vec- grammer, reporting exactly which sumption, it allows us to make rough torization of naïvely written, (mostly) loops are vectorized and what type approximations. architecture-oblivious, and (mostly) and number of vector instructions are While running the SGEMM tests, non-hand-optimized code, we wrote used in the generated code. the PowerEdge alone drew 3 amps, a simple three-loop implementation For the Convey compiler to vector- indicating a 360-watt consumption, of matrix multiply, compiled this ize our code, we had to apply a minor and thus achieved 360 Mflops/watt. code with the maximum optimiza- transformation, using one loop nest During the Tesla SGEMM test, the tion settings with both the Intel and to initialize the result matrix to zero, PowerEdge and Tesla together drew 6 Convey compilers, and then com- followed by a second loop nest that amps (720 watts) and thus achieved ap- pared the resulting performance on performs the matrix multiply by com- proximately 500 Mflops/watt. During their corresponding platforms with puting the inner products and add- the HC-1 SGEMM test, the HC-1 that of the their corresponding BLAS ing each into the entries of the result alone drew 6 amps (720 watts) and thus performance. matrix. To be fair, we also tried this achieved approximately 100 Mflops/ For the Intel version, we paral- optimization to the Intel code but it watt. These results indicate that the lelized the outermost loop with resulted in a slight slowdown so we Tesla actually wins in per watt OpenMP (using the parallel for didn’t use it for the Intel tests. In the and the HC-1 comes in third, which directive), which distributed the loop HC-1 C code, both loops together are runs contrary to public popular opin- across 16 threads during runtime, marked for coprocessor execution: ion regarding the power efficiency fully utilizing the eight cores with of GPUs versus FPGAs. This indi- two-way symmetric multithreading. #pragma cny array(cm[size] cates that there might be inefficiencies Also, from prior experience we know [size]) in the HC-1’s system design. that the Intel load/store units perform #pragma cny array(am[size] best with vector strides of one—that [size]) Convey Compiler is, floating point values can only be #pragma cny array(bm[size] Convey has developed a vectorizing loaded directly into the streaming [size]) C and FORTRAN compiler based on single-instruction multiple-data ex- Open64 (www.open64.net) that can tensions (SSE) extended multimedia #pragma cny begin_coproc target the scalar processor coupled with (XMM) registers from consecutive for (i=0;i

84 Computing in Science & Engineering

CISE-12-6-Novel.indd 84 16/10/10 2:33 PM (a) (b)

Figure 2. Screen examples from Spat, Convey’s toolset for assisting programmers in tuning their code. (a) A plot depicting the utilization of the processor subsystems versus clock cycle during a loop execution. (b) An interactive trace of the instruction stream, showing the processor’s internal state during a specific clock cycle.

write architecture-oblivious code. Table 2. Compiler effectiveness for optimizing naïve code. However, Nvidia’s CUDA software Simple three-loop matrix multiplication (Gflops/s) development kit (SDK) includes a Xeon 5520 Xeon 5520 relatively simple matrix multiply that C code SSE4.2/ C code SSE4.2/ parallelizes the matrix multiply us- OMP OMP HC-1 C code ing a simple blocking technique. w/ICC 11.1 ICC 11.1 Nvidia CUDA single- We measured this implementation’s (row major × (row major × SDK matrixMul precision vector performance (not allowing a “kernel row major) column major) routine personality warmup” and including the host-GPU 1 (<1 % peak) 11 (10% peak) 189 (54% peak) 15 (21% peak) I/O time, which the code doesn’t in- 1 (<1 % peak) 11 (9% peak) 190 (54% peak) 15 (20% peak) corporate in its own instrumenta- 1 (<1 % peak) 11 (8% peak) 189 (53% peak) 16 (21% peak) tion) and included these results for 1 (<1 % peak) 11 (8% peak) 184 (51% peak) 16 (21% peak) discussion. Table 2 shows the test results. The 1 (<1 % peak) 10 (8% peak) 180 (48% peak) 15 (24% peak) Intel implementation achieves 8 to 10 percent of its MKL performance us- also offers a simulator and corre- approximately one day’s effort, I was ing the naïvely written code, while the sponding performance analysis tool able to match only the compiled code’s HC-1 outperforms the Intel imple- called “Spat” that graphically plots performance, which speaks well of the mentation and achieves 20 to 24 per- how various aspects of the code map Convey compiler. cent of its CML performance. These to the architecture and can assist in results indicate the HC-1 has more code tuning. Memory-Intensive potential for extracting performance As Figure 2a shows, the information Applications and automatically parallelizing float- is presented as a plot of clock cycle vs. HC-1’s real strength is its memory- ing point linear algebra kernels that usage of various architectural features. centric applications, or applications aren’t mapped directly into BLAS The tool can also graphically depict that require nonconsecutive memory routines. The CUDA SDK code detailed state information for various access strides.1 Our experimental achieves 48 to 54 percent of its peak units within the scalar and vector pro- results are evidence of this; but to performance but (as noted earlier) cessors (see Figure 2b). This informa- demonstrate, I offer results from a this code is explicitly parallelized by tion lets users step across clock cycles benchmark designed to stress memory Nvidia, unlike the Intel and Convey and witness how the system executes systems. code, so it’s not a fair comparison. various instructions. The figure’s The Stride3 benchmark is part of plots originate from my handwritten Lawrence Livermore National Lab’s Convey Simulator and assembly-language implementation of Sequoia benchmark suite (https:// Performance Analysis Tool the matrix-multiplier, with which I at- asc.llnl.gov/sequoia/benchmarks) and To help developers get the most per- tempted to outperform the compiler- uses a series of sequential kernels that formance out of their code, Convey generated implementation. After perform double-precision floating

November/December 2010 85

CISE-12-6-Novel.indd 85 16/10/10 2:33 PM N o v e l A r c h i t e c t u r e s

Table 3. Stride3C benchmark for Xeon vs. HC-1 coprocessors. Stride3C benchmark (Gflops/s) HC-1 Xeon 5520 w/double-precision Stride single- personality 256 0.06 4.3 512 0.05 4.3 Developing Custom 1024 0.05 4.3 Personalities 961 0.04 0.1 (lowest) According to Convey, its target cus- 992 0.06 0.3 (2nd lowest) tomers are primarily interested in 8 0.07 4.4 (highest) using predesigned personalities. We Overall average 0.05 4.1 purchased the system primarily as a platform for testing our research group’s customized accelerator de- Table 4. Smith-Waterman performance on Xeon vs. HC-1 coprocessors signs. We chose the HC-1 because it searching a protein database with an 80-character query. had four large Virtex-5 330 LX Database size Xeon 5520 HC-1 w/AESW FPGAa and because its memory- (amino acids) multithread personality HC-1 speedup coherent host interface eliminates the 8 × 107 3,073 ms 353 ms 8.7 extra engineering time required for DMA-based interfacing. Because I’ve 4 × 108 14,763 ms 1,773 ms 8.3 worked with PCI-based FPGA co- 8 8 × 10 29.754 ms 3,589 ms 8.3 processors, working with the HC-1’s memory model is much easier than point operations using values from To approximate the Smith- having to coordinate with the host to two matrices at various stride dis- Waterman personality’s perfor- set up explicit DMA transfers, which tances. In our particular test, we set mance relative to a well-known soft- greatly simplifies host interfacing. the matrix sizes such that they’re too ware implementation, we ran a series Designing custom personalities large to fit in the Xeon’s cache. of performance tests of the per- requires the use of Convey’s PDK, Table 3 shows the results: HC-1 eas- sonality against the University which contains ily outperformed the Xeon 5520 (the of Virginia’s SSearch35 version Stride3 benchmark is single-threaded, 35.04 (http://fasta.bioch.virginia. • a set of makefiles to support simula- which might be a disadvantage for the edu/fasta_www2/fasta_list2.shtml), tion and synthesis design flows, Xeon). a highly optimized multithreaded • a set of Verilog support and inter- Convey has also recently developed SSE-based Smith-Waterman im- face files, a Smith-Waterman personality for plementation. SSearch35 uses the • a set of simulation models for all high-throughput genomic database slightly more complex cost model of the coprocessor board’s non­ searches.2 The Smith-Waterman per- described earlier, so these imple- programmable components (such as sonality derives its performance from mentations use a slightly different the memory controllers and memo- the FPGA’s ability to perform com- scoring model. However, both are ry modules), and parisons on sub-byte data units (that based on the traditional dynamic • a programming-language interface is, 2 bits for nucleotide and 5 bits for programming approach to compute (PLI) to let the host code interface protein), which allows it to pack more optimal alignment scores and both with a behavioral HDL simulator operations per memory access than is use the Blosum substitution matrix. such as Modelsim. possible with fixed-architecture CPUs As before, the time values include and GPUs. However, the current ver- I/O time between the host and The kit’s simulation framework is sion of the Smith-Waterman person- coprocessor. easy to use and allows users switch be- ality seems to use a simplistic variant Table 4 shows the results. For the tween a simulated coprocessor and an of the Smith-Waterman algorithm in three sample database sizes, the HC-1 actual coprocessor by changing only that it considers match, mismatch, in- performs just over eight times better one environment variable. sert, and delete penalties rather than than the Xeon. Although these are Developing with the PDK involves more aggressive implementations encouraging results, it’s not clear if working within a Convey-supplied with more complex cost models that FPGAs will continue to maintain this wrapper that gives the user logic ac- allow different costs for opening gaps lead as CPUs architectures continue cess to instruction dispatches from and extending gaps. to scale. the scalar processor, access to all eight

86 Computing in Science & Engineering

CISE-12-6-Novel.indd 86 16/10/10 2:33 PM memory controllers, access to the both the Xeon and Tesla lose a sub- References coprocessor’s management processor stantial amount of memory system per- 1. J. Leidel, “Design Philosophies for for debugging support, and access to formance when loading vectors whose Memory-Centric Instruction Set Archi- the AE-to-AE links. However, the elements are not aligned properly and tectures,” presentation, Symp. Applica- wrapper requires fairly substantial re- not stored in consecutive memory lo- tion Accelerators in High Performance source overheads: 184 out of the 576 cations (Nvidia refers to such behavior Computing (SAAHPC’10), 2010; http:// 18-Kbytes block random access mem- as “non-coalesced” loads or stores). In saahpc.ncsa.illinois.edu/presentations/ ory (BRAMS) and approximately 10 addition, the FPGAs’ reconfigurable day1/session4/presentation_Leidel.pdf. percent of each FPGA’s slices. Con- nature lets the HC-1 perform opera- 2. Convey Computer, “Convey Computer vey supplies a fixed 150-MHz clock to tions on nonstandard memory units Announces Record-Breaking Smith- each FPGA’s user logic. and arbitrary precision values, making Waterman Acceleration of 172x,” Users who develop custom personal- it more efficient for applications such press release, 24 May 2010; www. ities must also develop a corresponding as sequence alignment. conveycomputer.com/Resources/ API. That is, although Convey’s com- Convey_Announces_Record_Breaking_ piler, debugger, and analysis tools can Acknowledgments Smith_Waterman_Acceleration.pdf. be used with their vector personalities, This material is based on work sup- there’s no compiler support—or tool ported by the US National Science support at all—for custom personali- Foundation under grant nos. CCF- Jason D. Bakos is an assistant professor in the ties. For example, if I were to develop a 0844951 and CCF-0915608. Thanks Department of Computer Science and Engi- custom personality to accelerate molec- to Glen Edwards, Chris Parrott, Mark neering at the University of South Carolina. ular dynamics, I’d also need to develop Kelly, John Leidel, Kirby Collins, and His research interests include computer ar- a corresponding software library that Tom Murphy of Convey Computer for chitecture, very large-scale integration (VLSI) would let users execute the accelerated answering my questions, for provid- design, and high-performance heterogeneous kernels on the AEs from their own soft- ing prerelease versions of the Convey computing. Bakos has a PhD in computer sci- ware. This library would be responsible compiler, and for providing free 30-day ence from the University of Pittsburgh. He is for interfacing with the scalar proces- licenses for the double-precision and a member of IEEE and the ACM. Contact him sor and AEs through the copcall and Smith-Waterman personalities. at [email protected]. custom instruction mechanism.

he HC-1’s FPGA-based coproces- Tsor doesn’t compete in peak float- ing point performance with Nvidia GPUs or even Intel Xeon processors, but its vector personality architecture is more flexible and allows its compiler to extract greater performance from generalized high-level code than Intel’s compiler. This is partly because the HC-1’s vector personalities and copro- cessor memory system are capable of single-instruction loads of vectors that are stored in nonconsecutive memory locations, allowing it to achieve a high- er ratio of its peak memory bandwidth relative to the Xeon and Nvidia GPUs for “strided” data. This is perhaps its greatest advantage over the Xeon and Nvidia architectures. In other words,

November/December 2010 87

CISE-12-6-Novel.indd 87 16/10/10 2:33 PM