Soft Vector Processors with Streaming Pipelines

Soft Vector Processors with Streaming Pipelines A. Severance y; z J. Edwards y H. Omidian z G. Lemieux y; z [email protected] [email protected] [email protected] [email protected] y VectorBlox Computing Inc. z Univ. of British Columbia Vancouver, BC Vancouver, BC ABSTRACT DDR2/ S VectorBlox Custom DDR3 C1 C Soft vector processors (SVPs) achieve significant performance MXP Pipe P1 gains through the use of parallel ALUs. However, since D$ M Custom ALUs are used in a time-multiplexed fashion, this does not C2 C Pipe P2 exploit a key strength of FPGA performance: pipeline par- S Scratchpad allelism. This paper shows how streaming pipelines can be Nios II/f Custom integrated into the datapath of a SVP to achieve dramatic C3 C M DMA Pipe P3 speedups. The SVP plays an important role in supplying I$ M Avalon Bus the pipeline with high-bandwidth input data and storing Custom Custom C4 C C Pipe P4 its results using on-chip memory. However, the SVP must Instructions also perform the housekeeping tasks necessary to keep the pipeline busy. In particular, it orchestrates data movement between on-chip memory and external DRAM, it pre- or Figure 1: System view of VectorBlox MXP post-processes the data using its own ALUs, and it controls the overall sequence of execution. Since the SVP is pro- grammed in C, these tasks are easier to develop and debug Custom Instructions DMA and Vector Work Queues, Instruction Decode & Control than using a traditional HDL approach. Using the N-body Address Generation problem as a case study, this paper illustrates how custom ABCD P1 P2 P3 P4 streaming pipelines are integrated into the SVP datapath S R/W Rd Σ and multiple techniques for generating them. Using a cus- DMA SrcA tom pipeline, we demonstrate speedups over 7,000 times and M performance-per-ALM over 100 times better than Nios II/f. DMA Scratchpad Align A The custom pipeline is also 50 times faster than a naive Intel Avalon Bus Wr Rd Core i7 processor implementation. DstC SrcB 1. INTRODUCTION Align C Align B Custom Pipelines Although capable of high performance, FPGAs are also difficult to program. Exploiting both wide and deep custom Figure 2: Internal view of VectorBlox MXP pipelines typically requires a hardware designer to design a custom system in VHDL or Verilog. Recently, the emergence of ESL tools such as Vivado HLS and Altera's OpenCL com- memory allocation and recursion are unavailable. Hence, piler allow software programmers to produce FPGA designs ESL tools are not the most effective way to make FPGAs using C or OpenCL. However, since all ESL tools translate a accessible to software programmers. high-level algorithm into an HDL, they share common draw- Another approach to supporting programmers is to pro- backs: changes to the algorithm require lengthy FPGA re- vide a soft vector processor (SVP). A SVP achieves high per- compiles, recompiling may run out of resources (eg, logic formance through wide data parallelism, efficient looping, blocks) or fail to meet timing, debugging support is very and prefetching. The main advantages of a SVP are scalable limited, and high-level algorithmic features such as dynamic performance and a traditional software programming model. Performance scaling is achieved by adding more ALUs, but beyond a certain point (eg, 64 ALUs) the increases in par- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed allelism are eroded by clock frequency degradation. This for profit or commercial advantage and that copies bear this notice and the full citation ultimately limits performance scaling. on the first page. Copyrights for components of this work owned by others than the To increase performance of SVPs even further, they must author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or also harness deep pipeline parallelism. For example, most republish, to post on servers or to redistribute to lists, requires prior specific permission types of encryption (such as AES) need deep pipelines to and/or a fee. Request permissions from [email protected]. get significant speedup. Although processors (and SVPs) FPGA ’14 Monterey, CA, USA ACM 978-1-4503-2671-1/14/02 ...$15.00. are not built to exploit deep pipeline parallelism, FPGAs http://dx.doi.org/10.1145/2554688.2554774. support it very well. The natural questions then become: How can a SVP be VIPERS [4]. All three processors employed a hybrid vector- interfaced with deep pipelines to exploit both wide and deep SIMD model, where vectors are streamed sequentially over parallelism? What kind of performance can be achieved? time through replicated (parallel) execution units. What types of problems arise, and how can they be solved? VESPA included support for heterogeneous vector lanes [5], To investigate these questions, we developed the Vector- e.g. there are fewer multipliers than general-purpose ALUs. Blox MXP, shown in Figure 1 and devised a way to add Due to the mismatch between vector register file width and deep custom pipelines to the processor, shown in Figure 2. execution unit width, a parallel load queue was used to The interface is kept as simple as possible so that software buffer a vector for heterogeneous operations, and a separate programmers can eventually develop these custom pipelines output queue was used to buffer results before writeback. using C; we also show that a simple high-level synthesis tool This required additional memory and multiplexers. In con- can be created for this purpose. To demonstrate speedups, trast, this paper uses pre-existing alignment networks and we selected the N-body gravity problem as case study. In requires no additional buffering to solve the width mismatch this problem, each body exerts an attractive force on every problem for 2-input/1-output custom instructions. We show other body, resulting in an O(N 2) computation. The size how to add custom vector instructions that require more and direction of the force between two bodies depends upon operands using minimal additional buffering. their two masses as well as the distance between them. Solv- VEGAS [6] and VENICE [7] are refinements of the VIPERS ing the problem requires square root and divide, neither of processor, further tailoring the architecture for FPGAs. Im- which are native operations to the MXP. Hence, we start by provements include replacing the VRF with a scratchpad implementing simple custom instructions for the reciprocal memory to allow for arbitrary data packing and access, re- and square root operations. Then, we implement the entire moving vector length limits, enabling sub-word SIMD (four gravity equation as a deep pipeline. packed bytes or two packed shorts) within each lane, simpli- It is important to note that we view the problem of adding fied conditionals and flags to fit within FPGA BRAMs, and floating-point units (FPUs) as a special case of adding deep, a DMA-based memory interface rather than a traditional custom pipelines. FPUs are large in area, so many appli- vector load/store approach. cations will not need one FPU for every integer ALU. In Work by Cong et al. [8] created composable vector units. fact, many floating-point applications would find a single f- At compilation time, the DFG of a vector program was ex- p divide unit or f-p square root unit to be sufficient, along amined for clusters of operations that can be composed to- with several f-p adders and f-p multipliers. Also, FPUs are gether to create a new streaming instruction that uses mul- deeply pipelined, so they need very long vectors to keep their tiple operators and operands. This was done by chaining pipelines utilized, especially when many units are instanti- together existing functional units using an interconnection ated in parallel. So, for both area and performance reasons, network and multi-ported register file. This is similar to the programmer should control the number of integer units traditional vector chaining, but it was resolved statically separately from the number of f-p adders, the number of f- by the compiler (not dynamically the architecture) and en- p multipliers, etc. Hence, this paper is not just proposing coded into the instruction stream. This provided pipeline a method for connecting specialized pipelines, but also for parallelism, but was limited by the number of available op- connecting general floating-point operators to SVPs. erators and available register file ports. It is not easily ex- The main contribution of this work is introducing a mod- tended to support wide SIMD-style parallelism. The re- ular way of allowing users to add streaming pipelines into ported speedups were less than a factor of two. SVPs as custom vector instructions to get huge speedups. The FPVC, or floating point vector coprocessor, was de- On the surface, this appears to be a simple extension of the veloped by Kathiara and Leeser [9]. It adds a floating-point way custom instructions are added to scalar CPUs such as vector unit to the hard Xilinx PowerPC cores which can ex- Nios II. However, there are unique challenges to be able to ploit SIMD parallelism as well as pipeline parallelism. The stream data from multiple operands in a SVP. Also, scalar FPVC fetches its own VIRAM-like instructions and has its CPU custom instructions are often data starved, limiting own private register file. Unlike most other vector architec- their benefits. We show that SVPs can provide high-bandwidth tures, it can also execute its own scalar operations separate data streaming to properly utilize custom instructions.

Soft Vector Processors with Streaming Pipelines

Efficient Parallel Algorithms for 3D Laplacian Smoothing on The

Understanding the Performance of HPC Applications

Best Practice Guide Modern Processors

SAMS Multi-Layout Memory: Providing Multiple Views of Data to Boost SIMD Performance

Kernel Optimization by Layout Restructuring

LLAMA: the Low Level Abstraction for Memory Access

Intel(R) 64 and IA-32 Architectures Optimization Reference Manual

Broadening the Applicability of FPGA-Based Soft Vector Processors

Intel 64 and IA-32 Architectures Optimization Reference Manual [PDF]

Execution Unit ISA (Ivy Bridge)

Impact of Data Layouts on the Efficiency of GPU-Accelerated IDW Interpolation