Neuflow: a Runtime Reconfigurable Dataflow Processor for Vision
Total Page:16
File Type:pdf, Size:1020Kb
Invited Paper NeuFlow: A Runtime Reconfigurable Dataflow Processor for Vision Clement´ Farabet1;2 Berin Martini2 Benoit Corda1 Polina Akselrod2 Eugenio Culurciello2 Yann LeCun1 1 Courant Institute of Mathematical Sciences, New York University, New York, NY, USA 2 Electrical Engineering Department, Yale University, New Haven, CT, USA http://www.neuflow.org Abstract tions), a non-linear operation (quantization, winner-take-all, sparsification, normalization, and/or point-wise saturation) In this paper we present a scalable dataflow hard- and finally a pooling operation (max, average or histogram- ware architecture optimized for the computation of general- ming). For example, the scale-invariant feature transform purpose vision algorithms—neuFlow—and a dataflow (SIFT [23]) operator applies oriented edge filters to a small compiler—luaFlow—that transforms high-level flow-graph patch and determines the dominant orientation through a representations of these algorithms into machine code for winner-take-all operation. Finally, the resulting sparse vec- neuFlow. This system was designed with the goal of pro- tors are added (pooled) over a larger patch to form local ori- viding real-time detection, categorization and localization entation histograms. Some recognition systems use a single of objects in complex scenes, while consuming 10 Watts stage of feature extractors [19,7, 25]. Other models like when implemented on a Xilinx Virtex 6 FPGA platform, or HMAX-type models [27, 24] and convolutional networks about ten times less than a laptop computer, and producing use two or more layers of successive feature extractors. speedups of up to 100 times in real-world applications. We This paper presents a scalable hardware architecture for present an application of the system on street scene anal- large-scale multi-layered synthetic vision systems based on ysis, segmenting 20 categories on 500 × 375 frames at 12 large parallel filter banks, such as convolutional networks— frames per second on our custom hardware neuFlow. neuFlow—and a dataflow compiler—luaFlow—that trans- forms a high-level flow-graph representation of an algo- rithm into machine code for neuFlow. This system is a 1. Introduction dataflow vision engine that can perform real-time detec- Computer vision is the task of extracting high-level in- tion, recognition and localization in mega-pixel images pro- formation from raw images. Generic, or general-purpose cessed as pipelined streams. The system was designed with synthetic vision systems have for ultimate goal the elabora- the goal of providing real-time detection, categorization and tion of a model that captures the relationships between high- localization of objects in complex scenes, while consuming dimensional data (images, videos) into a low-dimensional ten times less than a laptop computer—on the order of 10W decision space, where arbitrary information can be retrieved for an FPGA implementation—and producing speedups of easily, e.g. with simple linear classifiers or nearest neighbor up to 100 times in end-to-end applications, such as the street techniques. The exploration of such models has been an scene parser presented in section3. active field of research for the past decades, ranging from Graphics Processing Units (GPUs) are becoming a com- fully trainable models—such as convolutional networks— mon alternative to custom hardware in vision applications, to hand-tuned models—HMAX-type architectures, as well as demonstrated in [4]. Their advantage over custom hard- as systems based on dense SIFT (Scale-Invariant Feature ware are numerous: they are inexpensive, available in most Transform) or HoG (Histograms of Gradients). recent computers, and easily programmable with standard Many successful object recognition systems use dense development kits. The main reasons for continuing develop- features extracted on regularly-spaced patches over the in- ing custom hardware are twofold: performance and power put image. The majority of the feature extraction systems consumption. By developing a custom architecture that is have a common structure composed of a filter bank (gen- fully adapted to a certain range of tasks (as is shown in this erally based on oriented edge detectors or 2D gabor func- paper), the product of power consumption by performance 109 can be improved by two orders of magnitude (100x). Off-chip A Runtime Reconfigurable Dataflow Architecture Memory Other groups are currently working on custom archi- PT PT PT tectures for convolutional networks or similar algorithms: MUX. MUX. MUX. Smart DMA NEC Labs [2], Stanford [16], Kaist [15]. X + X + X + Section2 describes neuFlow’s architecture. Section3 % ∑π Mem % ∑π Mem % ∑π Mem describes a particular application, based on a standard con- PT PT PT volutional network. Section4 gives results on the perfor- MUX. MUX. MUX. mance of the system. Section5 concludes. X + X + X + % ∑π Mem % ∑π Mem % ∑π Mem Control 2. Architecture & Config PT PT PT MUX. MUX. MUX. Hierarchical visual models, and more generally image X + X + X + processing algorithms are usually expressed as sequences, % ∑π Mem % ∑π Mem % ∑π Mem trees or graphs of transformations. They can be well de- scribed by a modular approach, in which each module pro- Configurable Route Global Data Lines Runtime Config Bus cesses an input image or video collection and produces a Figure 1. A dataflow computer. A set of runtime configurable pro- new collection. Figure4 is a graphical illustration of this cessing tiles are connected on a 2D grid. They can exchange data approach. Each module requires the previous bank to be with their 4 neighbors and with an off-chip memory via global fully (or at least partially) available before computing its lines. Configurable elements are depicted as squares. output. This causality prevents simple parallelism to be im- plemented across modules. However parallelism can easily • a Runtime Configuration Bus, used to reconfigure be introduced within a module, and at several levels, de- many aspects of the grid at runtime—connections, op- pending on the kind of underlying operations. erators, Smart DMA modes. (the configurable ele- 2.1. A Dataflow Grid ments are depicted as squares on Fig.1), First dataflow architectures were introduced by [1], and • a controller that can reconfigure most of the computing quickly became an active field of research [8, 14, 18]. [3] grid and the Smart DMA at runtime. presents one of the latest dataflow architectures that shares 2.2. On Runtime Reconfiguration several similarities to the approach presented here: while both architectures rely on a grid of compute tiles, which One of the most interesting aspects of this grid is its con- communicate via FIFOs, the grid presented here also pro- figuration capabilities. Many systems have been proposed vides a runtime configuration bus, which allows efficient which are based on two-dimensional arrays of processing runtime reconfiguration of the hardware (as opposed to elements interconnected by a routing fabric that is recon- static, offline synthesis). figurable. Field Programmable Gate Arrays (FPGAs) for Figure1 shows a dataflow architecture that we designed instance, offer one of the most versatile grid of process- to process homogeneous streams of data in parallel [9]. It is ing elements. Each of these processing elements—usually a defined around several key ideas: simple look-up table—can be connected to any of the other elements of the grid, which provides with the most generic • a 2D grid of NPT Processing Tiles (PTs) that contain: routing fabric one can think of. Due to the simplicity of 1- a bank of processing operators. An operator can the processing elements, the number that can be packed in be anything from a FIFO to an arithmetic operator, or a single package is in the order of 104 to 105. The draw- even a combination of arithmetic operators. The op- back is the reconfiguration time, which takes in the order erators are connected to local data lines, 2- a routing of milliseconds, and the synthesis time, which takes in the multiplexer (MUX). The MUX connects the local data order of minutes to hours depending on the complexity of lines to global data lines or to the 4 neighboring tiles. the circuit. At the other end of the spectrum, recent multicore pro- • a Smart Direct Memory Access module (Smart DMA), cessors implement only a few powerful processing elements that interfaces off-chip memory and provides asyn- (in the order of 10s to 100s). For these architectures, no syn- chronous data transfers, with priority management, thesis is involved, instead, extensions to existing program- ming languages are used to explicitly describe parallelism. • a set of Nglobal global data lines used to connect PTs The advantage of these architectures is the relative simplic- to the Smart DMA, Nglobal << NPT ; and local data ity of use: the implementation of an algorithm rarely takes lines used to connect PTs with their 4 neighbors, more than a few days, whereas months are required for a 110 typical circuit synthesis for FPGAs. address on the network. Groups of similar modules also The architecture presented here is in the middle of this share a broadcast address, which dramatically speeds up re- spectrum. Building a fully generic dataflow computer is configuration of elements that need to perform similar tasks. a tedious task. Reducing the range of applications to the A typical execution of an operation on this system is the computation of visual models, vision systems and image following: (1) the control unit configures each tile to be processing pipelines