Visual Computing Systems Stanford CS348K, Fall 2018 Lecture 5

Home , Pixel Visual Core, Qualcomm Hexagon

Lecture 5: Image Processing Hardware

Visual Computing Systems Stanford CS348K, Fall 2018 Image processing workload characteristics ▪ “Pointwise" operations - output_pixel = f(input_pixel) ▪ “Stencil” computations (e.g., convolution, demosaic, etc.) - Output pixel (x,y) depends on fxed-size local region of input around (x,y) ▪ Lookup tables - e.g., contrast s-curve ▪ Multi-resolution operations (upsampling/downsampling) - e.g., Building Gaussian/Laplacian pyramids ▪ Fast-fourier transform - We didn’t talk about many Fourier-domain techniques in class (but readings had many examples) ▪ Long pipelines (DAGs) of these operations

Stanford CS348K, Fall 2018 So far, the discussion in this class has focused on generating eﬃcient code for multi-core processors such as CPUs and GPUs.

Stanford CS348K, Fall 2018 Consider the complexity of executing an instruction on a modern processor…

Read instruction Address translation, communicate with icache, access icache, etc. Decode instruction Translate op to uops, access uop cache, etc. Check for dependencies/pipeline hazards Identify available execution resource Use decoded operands to control register fle (retrieve data) Move data from register fle to selected execution resource Perform arithmetic operation Move data from execution resource to register fle Use decoded operands to control write to register fle SRAM

Question: How does SIMD execution reduce overhead when executing certain types of computations? What properties must these computations have?

Stanford CS348K, Fall 2018 Fraction of energy consumed by diﬀerent parts of

instruction pipeline (H.264 video encoding) [Hameed et al. ISCA 2010] no SIMD/VLIW vs. SIMD/VLIW

integer motion estimation fractional (subpixel) intraframe prediction, arithmetic encoding Figure 4. Datapath energy breakdown for H.264motion. IFestimation is instruction fetch/decodeDTC, quantization (including the I-cache). D-$ is the D-cache. Pip is the pipeline registers, busses, and clocking. Ctl is random control. RF is the register file. FU is the functional elements. Only the top bar (FU), or perhaps the top two (FU + RF) contribute useful work in the processor. For this application it is hard to achieve much more than 10%FU of = the functional power in units the FU without adding Pipcustom = pipeline hardware registers units. (interstage)This data was estimated from processor simulations. RF = register fetch D-$ = data cache transform instructions. CABAC’s fused operations provide Ctrl = misc pipeline control IF = instruction fetch + instruction cache negligible performance and energy gains of 1.1x. Stanford Fused CS348K, Fall 2018 acc = 0; instructions give the largest advantage for FME, on average acc = AddShft(acc, x0, x1, 20); acc = AddShft(acc, x , x , -5); doubling the energy/performance advantage of SIMD/VLIW. -1 2 Employing fused operations in combination with SIMD/VLIW acc = AddShft(acc, x-2, x3, 1); xn = Sat(acc); results in an overall performance improvement of 15x for the H.264 encoder, and an energy efficiency gain of almost 10x, but Figure 5. FME upsampling after fusion of two multiplications and two still uses greater than 50x more energy than an ASIC. additions. AddShft takes two inputs, multiplies both with the The basic problem is clear. For H.264, the basic operations are multiplicand and adds the result. Multiplication is performed using very simple and low energy. In our base machine we over- shifts and adds. Operation fusion results in 3 instructions instead of estimate the energy consumed by the functional units, since we the RISC’s 5 add/sub and 4 multiplication instructions. count the entire 32–wide functional unit energy. When we move to the SIMD machine, we tailor the functional unit to the desired

width, which reduces the required energy. However, executing Table 5. Fused operations added to each unit and the resulting 10s of narrow width operations per instruction still leaves a performance and energy gains. FME required fusion of large machine that is spending 90% of its energy on overhead functions, subgraphs to get significant performance improvement. with only 10% going to the functional units. # of Op Energy Perf fused 4.3 Algorithm Specific Instructions Depth Gain Gain ops To bridge the remaining gap, we must create instructions that can execute 100s of operations in a single instruction. To achieve this IME 4 3-5 1.5 1.6 parallelism requires creating instructions that are tightly FME 2 18-34 1.9 2.4 connected to custom data storage elements with algorithm- Intra 8 3-7 1.9 2.1 specific communication links to supply the large amounts of data required, and thus tend to be very closely tied to the specific CABAC 5 3-7 1.1 1.1 algorithmic methods being optimized. These storage elements can then be directly wired to custom designed multiple input and Table 5 presents the number of fused operations created for each possibly multiple output functional units, directly implementing H.264 algorithm, the average size of the fused instruction the required communication for the function in hardware. subgraphs, and the total energy and performance gain achieved Once this hardware is in place, the machine can issue “magic” through fusion. Interestingly, IME and FME do not share any instructions that can accomplish large amounts of computation at instructions, though Intra and FME share instructions for the very low costs. This type of structure eliminates almost all the Hadamard transform. DCT transform also implements the same

42 Modern SoC’s feature ASIC for image processing ▪ Implement basic RAW to RGB camera pipeline in silicon - Traditionally has been critical for real-time processing like viewfnder or video Qualcomm Snapdragon SoC

Image Signal Processor ASIC for processing camera sensor pixels

Stanford CS348K, Fall 2018 Digital Signal Processor (DSP) ▪ Typically simpler instruction stream control paths ▪ Complex instructions (e.g., SIMD/VLIW):VLIW: perform Area & many power operations efficient multi per-issue instruction

• Dual 64-bit execution units Variable sized • Standard 8/16/32/64bit data Example: Qualcomm Hexagon instruction packets Instruction types (1 to 4 instructions Cache • SIMD vectorized MPY / ALU per Packet) / SHIFT, Permute, BitOps Used for modem, audio, and (increasingly) Instruction Unit • Up to 8 16b MAC/cycle image processing on Qualcomm Snapdragon • 2 SP FMA/cycle Device L2 SoC processors DDR Cache Memory Maximizing the signal processing code work/packet/ TCM Example from inner loop of FFT: Executing 29• “simpleDual 64-bit RISC ops”Data in Unit 1 Datacycle Unit Execution Execution Below: innermost loop of FFT load/store (Load/ (Load/ Unit Unit units Store/ Store/ (64-bit (64-bit 29 “RISC” ops per cycle • Also 32-bit ALU) ALU) Vector) Vector) ALU Data Cache • Unified 32x32bit General Register 64-bit Load and File is best for compiler. • No separate Address 64-bit Store with Register File/Thread Register File or Accum Regs post-update Register File • Per-Thread addressing { R17:16 = MEMD(R0++M1) 7 Qualcomm Technologies, Inc. All Rights Reserved MEMD(R6++M1) = R25:24 Complex multiply with R20 = CMPY(R20, R8):<<1:rnd:sat round and saturation R11:10 = VADDH(R11:10, R13:12) Rs Rs }:endloop0 I R I R Rt I R I R Rt

* * * *

32 32 32 32

Zero-overhead loops Vector 4x16-bit Add 0x8000 <<0-1 <<0-1 <<0-1 <<0-1 0x8000 - • Dec count Add Add • Compare

• Jump top Sat_32 Sat_32

+ + + + High 16bits High 16bits

I R Rd Stanford CS348K, Fall 2018

▪ Designed for energy-eﬃcient image processing - Each core = 16x16 grid of 16 bit mul-add ALUs - Goal: 10-20x more eﬃcient than CPU/GPU on SoC

▪ Programmed using Halide and Tensorfow

Stanford CS348K, Fall 2018 Class discussion: Google Pixel Visual Core

Stanford CS348K, Fall 2018 Question ▪ What is the role of an ISA? (e.g., x86)

Answer: interface between program defnition (software) and hardware implementation

Compilers produce sequence of instructions

Hardware executes sequences of instructions as eﬃciently as possible (As shown earlier in lecture, many circuits used to implement/preserve this abstraction, not execute the computation needed by the program)

Stanford CS348K, Fall 2018 New ways of defning hardware ▪ Verilog/VHDL present very low level programming abstractions for modeling circuits (RTL abstraction: register transfer level) - Combinatorial logic - Registers ▪ Due to need for greater eﬃciency, there is signifcant modern interest in making it easier to synthesize circuit-level designs - Skip the ISA, directly synthesize circuits needed to compute the tasks defned by a program. - Raise the level of abstraction for direct hardware programming ▪ Examples: - C to HDL (e.g., ROCCC, Vivado) - Bluespec - CoRAM [Chung 11] - Chisel [Bachrach 2012]

Stanford CS348K, Fall 2018 Compiling image processing pipelines directly to HW

▪ Darkroom [Hegarty 2014] ▪ Rigel [Hegarty 2016] ▪ RIPL [Stewart 2018]

▪ Motivation: - Convenience of high-level description of image processing algorithms (like Halide) - Energy-eﬃciency of hardware implementations (particularly important for high-frame rate, low-latency, always on, embedded/robotics applications)

Stanford CS348K, Fall 2018 Optimizing for minimal buﬀering

▪ Recall: scheduling Halide programs for CPUs/GPUs - Key challenge: organize computation so intermediate buffers ft in caches ▪ Scheduling for hardware: - Key challenge: minimize size of intermediate buffers (keep buffered data spatially close to combinatorial logic)

Consider 1D convolution: out(x) = (in(x-1) + in(x) + in(x+1)) / 3.0

Eﬃcient hardware implementation: requires storage for 3 pixels in registers out_pixel = (buf0 + buf1 + buf2) / 3 buf0 = buf1 buf1 = buf2 “Shift” new pixel in buf2 = in_pixel input pixel buf0 buf1 buf2 (from sensor) output arithmetic

Stanford CS348K, Fall 2018 “Line buﬀering” input image: WIDTH pixels Consider convolution of 2D image in vertical direction: buf[0] buf[1] out(x,y) = (in(x,y-1) + in(x,y) + in(x,y+1)) / 3.0 buf[WIDTH]

Eﬃcient hardware implementation: buf[2*WIDTH]

let buf be a shift register containing 2*WIDTH+1 pixels

// assume: no output until shift register fills out_pixel = (buf[0] + buf[WIDTH] + buf[2*WIDTH]) / 3.0 shift(buf); // buf[i] = buf[i+1] buf[2*WIDTH] = in_pixel

buf[0] buf[WIDTH] buf[2*WIDTH]

input pixel output arithmetic (from sensor)

Stanford CS348K, Fall 2018 [Hegarty et al. 2016] XXXXRigel Core Modules XXXX XXX X 1 1 1 1 1 XXXX W,H Expr {tap name} ProvidesScanline set Order of well-defned building Downsample▪ Y Line buffer Math Expr Tap Constant hardwareA Math Expr blocksis an which arbitrary can expression be built out ofModule primitive Name math- 3.3 Multi-Rate Modules Modules like downsampleematical have operators a known (+,*,total number≫, etc). of Math input/out- exprs also include operations Fn. Name put tokens, but a latencyassembled (number into of firings image before processing each output ac- for slicing and creating vectors, tuples, etc. Our math exprs support Next we introduce a number of our architecture’s multi-rate mod- tually appears) that varies within a bounded number of firings. We Module Definition Module Application call modules withdata thisall property of f theow operators graphsvariable-latency in Darkroom’smodules. image functions plus some ad- ules. We first show multi-rate modules that are used to reduce the ditional functionality. In particular, we added arbitraryHigher-Order precision Modules parallelism of a pipeline (p<1), so designs can trade parallelism for Embedding▪ variable-latencyProvidesfixed-point modules programmer types in to our represent architecture service non-integer requiresof numbers. Since FPGAs reduced area. extensions beyond statically-scheduled SDF. Followup work on im- 1 1/N do not have general floating point support, we found that better sup- N proving the functionalitygluing of modules SDF has taken in a eitherdata af staticow graph or dy- N v port for fixed-point types was necessary. We also include some To reduce parallelism, we need to perform a computation on less namic scheduling approach. Static scheduling work, such as cy- than a full stencil’s worth of data. We accomplish this with the togetherprimitive floatingto form point a complete support, such as the abilityMap(fn,N) to normalize Reduce(fn,N) ReduceSeq(fn,N) clostatic dataflow, increases the flexibility of SDF but keeps some devectorize module. Devectorize takes a vector type, splits it into restrictions that allowimplementationnumbers. for static analysis [Bilsen et al. 1995]. These models keep some or all of SDF’s deadlock and buffering proper- Multi-Rate Modules smaller vectors, and then outputs the smaller vectors over multi- ties, at the expenseTap of added Constants schedulingare complexity programmable [Bilsen constant et al. values with arbitrary ple firings. With 2D vectors we devectorize the rows. Vectorize 1 1995; Murthy andtype. Lee 2002]. Taps can Dynamic be reset scheduling at the start approaches of a frame, but cannot1/N be1 modi- 1/X*Yperforms1 the reverse operation,1/N taking a small vector over multiple 1 such as GRAMPSfied place while no restrictions a frame is on being the number computed. of tokens firingsX,Y and concatenating them into a larger vector: that can be produced or consumed each firing, but also cannot prove Devectorize Upsample* FilterSeq* any properties about deadlock or buffering [Sugerman et al. 2009]. These core modules can be used to implement a pipeline that 1 X,Y 1/4 1 1/4 Prior work exists onmatches compiling hardware SDF graphs produced to hardware by the (such line-buffered as pipeline1 1/N architec- 1 uint32[4]1/X*Y uint32[1] uint32[1] uint32[4] [Horstmannshoff et al. 1997]), but to our knowledge no system ex- ture. For example, we can use a 4×4 stencil line buffer, a math expr Devectorize Vectorize ists that supports variable-latencythat implements modules. convolution (multiplies and a treeVectorize sum unrolled), Downsample* Rigel takes a hybridExample:and approach a tap 4x4 constant between convolution SDF with and the dynamic convolution in Rigel schedul- kernel to get a line-buffered ReduceSeq is a higher-order module that performs a reduction se- ing, which we call pipeline:variable-latency SDF. We restrict our pipeline Figure 3: List of the built-in modulesquentially in Rigel. over AsT infirings SDF, numbers (type A → A). Here we combine devec- to be a Directed Acyclic Graph (DAG) of SDF nodes. However, on edges indicate the input and output rates. (*) indicates variable- latency modules. torize (which increases the number of tokens, at lower parallelism) we allow nodes to haveInput variable latency, and implementuint32[4,4] the SDF and reduceSeq (which decreases the number of tokens). The convo- execution in hardware using dynamic scheduling.4,4 We use first-in- uint32 Output first-out (FIFO) queues to hide latency variation, creating a graph ConvolutionRigel is a higher-level programminglution model module than can languages now operate like C. on stencil size 1×4 instead of 4×4, uint32[4,4] uint32 reducing its amount of hardware by 4×. We refer to this pipeline of kernels that behaves at the top level similarly{kernel} to a traditional SDF In particular, Rigel performs domain-specific program checking us- system. This allows us to use SDF to prove that the pipeline will not ing SDF rules, and contains domain-specificas the reduced image parallelism processingConvRP op- : deadlock, but also support3.2 Higher-Orderthe variable-latency Modules modules we need for erations such as line buffering. In the future, we may consider HLS our target applications. as a compile target for Rigel instead of VerilogConvRP(W,H,T) to simplify Stanford our CS348K, im- Fall 2018 plementation. Input Our architecture has higher-order modules, which are modules built 1/T 2.3 Image Processing Languages uint32[W,H] 1 uint32[W/T,H] out of other modules. Map takes a module with type A → B and 1 1/T 3 Multi-Rate Line-BufferedKernel Pipelines Convolve(W/T,H) Output Halide is a CPU/GPUlifts image it to operate processing on language type A[N with] → a separateB[N] by duplicating the module uint32[W,H] 1/T 1 v + uint32 algorithm languageN andtimes. schedulingReduce languagetakes [Ragan-Kelley a binary operator et al. with type {A, A} → A We now describe the multi-rate line-buffered pipelineuint32[W/T,H] architecture, ReduceSeq(+,T) 2012]. Halide’s scheduling language is used to map the algorithm and uses it to perform a tree fold over an vector ofand size showN, how producing it can be used to implement advanced image pro- language into executable code, based on a number of loop trans- a module with type A[N] → A. cessing pipelines. Applications are implemented in our system by forms. Halide’s algorithm and scheduling languages are general, so creating a DAG of instances ofWe a set can of built-in connect static our and new variable-ConvRP module to the line buffer and making scheduling decisions requires either programmer insight, We can use map and reduce to build the convolutionlatency function SDF modules. from The core modules supported by our archi- autotuning, or heuristics [Mullapudi et al. 2016]. However, experi- convolution kernel as in the previous examples: modules in our architecture, instead of creatingtecture it by are hand listed as in math figure 3. menting with different Halide schedules is faster than rewriting the ops. We also show a module definition, which defines a pipeline 1/4 code by hand in lower-level languages like C. 1/4 Output As in synchronous dataflow, each of ourInput modules has an4,4 SDF input ConvRP(4,4,4) so that it can be reused multiple times later. We parameterize the uint32 1/4 uint32 Rigel was inspired bypipeline Halide’s over choice stencil to focus width/height. on programmer This pro- is not aand core output feature rate. of Our our modules always have rates M/N ≤ 1, which ductivity instead ofarchitecture, automated scheduling, but is instead which often accomplished necessitates with metaprogramming:indicates that the module consumes/produces a data{kernel} token on aver- a loss in flexibility. Rigel attempts to make an equivalent system age every M out of N firings. Each data token in our system has an for hardware, where we allow theConvolve(W,H) user manual control of a set of associated type. Our type systemThe supports total throughput arbitrary-precision of a pipeline ints, is limited by the module instance flexible and powerful hardwareInput tradeoffs with more convenience uints, bitfields, and booleans.with In addition, the lowest we support throughput. 2D vectors In this example, ConvRP has input/out- and ease of experimentationuint32[4,4] than is possible in existing hardware and tuples, bothOutput of which can beput nested. rate of 1/4, which means that the resulting pipeline can only W,H languages like Verilog.Kernel W,H + uint32 produce one output every 4 firings. uint32[4,4] 3.1 Core Modules Map(*,W,H)* Reduce(+,W,H) 2.4 High-Level Synthesis 3.4 Multi-Scale Image Processing Modules Our architecture inherits a number of core modules from the line- An emerging technologyOur higher-order in recent years modules is high-level can also synthesis be used tobuffered implement pipeline space- architecture. (HLS), which takes languages such as C or CUDA and compiles Next, we introduce multi-rate modules in our architecture that are time tradeoffs. Implementing space-time tradeoffs in our system used to implement multi-scale image processing. Rigel’s down- them to hardware.involves For example, creating Vivado multiple synthesizes implementations a subset of C ofOur an algorithmline buffer withmodule a takes a stream of pixels and converts it into a Xilinx FPGA design guided by a number of pragma annota- into stencils. Many of our modulessample canmodule operate discards over a range pixels of based on the user’s specified in- range of parallelisms. We formally define parallelism, , to be the tions [Vivado 2016]. In our experience, CPU-targeted image pro- types. Thep line buffer has typetegerA → horizontalA[stencilW, and stencilH vertical] for scale factor (module type A → A). cessing code requireswidth extensive of the modification datapaths toin perform the pipeline. well with For example,an arbitraryp=2 typeindicatesA, indicatingSimilarly, that its input Rigel’s type isupsampleA and outputmodule upsamples a stream by dupli- HLS tools. that the pipeline can process two pixels per firing,typep is=1 the/2 vectorindicatesA[stencilW,cating stencilH pixels]. in X and Y a specified number of times (module type that the pipeline can only do half a pixel’s worth of computation per A → A): firing.

Here we demonstrate an 8-wide data-parallel implementation of 1/X*Y 1 1 X,Y 1/X*Y convolution (p=8). We use the map operator to make 8 copies of X,Y the Convolve module we defined above. The line buffer module Upsample(X,Y) Downsample(X,Y) shown previously can be configured to consume/produce multiple stencils per firing. To feed this pipeline with data, we configure We can use these modules to downsample following a convolution, the runtime system to provide a vector of 8 pixels as input. These which is a component of a pipeline for computing Gaussian pyra- changes yield a pipeline that can produce 8 pixels/firing: mids. A basic implementation simply adds a downsample module to the convolution example shown previously: uint32[4,4][8] Input 4,4, 8 wide uint32[8] 1 2,2 1/4 Output Output Input 4,4 Convolve(4,4) 8 Convolve(4,4) uint32[8] uint32 uint32 {kernel} uint32[4,4][8] {kernel} Downsample(2,2) Halide to hardware [Pu et al. 2017] ▪ Reinterpret common Halide scheduling primitives to describe features of hardware circuits - unroll() —> replicate hardware ▪ Add new primitive accelerate() - Defnes granularity of accelerated task 1 Func unsharp(Func in) { scheduling, and Halide already has primitives to describe them. - Defnes throughput of accelerated task 2 Func gray, blurx, blury, sharpen, ratio, unsharp; For example, both CPU and hardware schedules must describe 3 Var x, y, c, xi, yi; 4 computation order and memory allocation. In such cases, we 5 // The algorithm reuse as many of the existing primitives as possible. Ulti- 6 gray(x, y) = .3*in(, x, y) + .6*in(1, x, y) + .1*in(2, x, y); 7 blury(x, y) = (gray(x, y-1) + gray(x, y) + gray(x, y+1)) / 3; mately, we were able to achieve efficient hardware mapping 8 blurx(x, y) = (blury(x-1, y) + blury(x, y) + blury(x+1, y)) / 3; and hybrid CPU/accelerator execution using only two new 9 sharpen(x, y) = 2 * gray(x, y) - blurx(x, y); 10 ratio(x, y) = sharpen(x, y) / gray(x, y); primitives and a bit of syntactic sugar. 11 unsharp(c, x, y) = ratio(x, y) * input(c, x, y); Parallel units for rgbThe language of scheduling is best explained in the context 12 13 // The schedule of an example. Figure 3 shows a simple unsharp mask filter 14 unsharp.tile(x, y, xi, yi, 256, 256).unroll(c) Unit of work doneimplemented per cycle: in Halide. Unsharp masking is an image sharp- 15 .accelerate({in}, xi, x) 16 .parallel(y).parallel(x); one pixel (one iterationening technique of xi loop) often used in digital image processing. We 17 in.fifo_depth(unsharp, 512); will use this as a running example throughout the paper, as it 18 gray.linebuffer().fifo_depth(ratio, 8); 19 blury.linebuffer(); Unit of work given todemonstrates accelerator: many important features of our system. The code 20 ratio.linebuffer(); 256x256 tile (one iterationfirst computes of x loop) a blurred gray-scale version of the input image 21 22 return unsharp; using a chain of three functions (gray, blury, and blurx), and 23 } then amplifies the input based on the difference between the Stanford CS348K, Fall 2018 original image and the blurred image. Accelerator Interface The hardware schedule begins on line 14. unsharp.tile is a standard Halide operation, which breaks an ordinary row- in unsharp major traversal (defined by the Vars x and y) into a blocked gray ratio blury blurx sharpen computation over tiles (here, 256 256 pixels). The variables ⇥ xi and yi represent the inner loops of the blocked computation which work pixel by pixel, while x and y then become the Figure 3: Algorithm and schedule code for the unsharp func- outer loops for iterating over blocks. tion, and its corresponding DAG. accelerate primitive defines With the image now broken into constant-sized pieces, we the accelerator scope from in to unsharp. can apply hardware acceleration. Our first new primitive is f.accelerate(inputs, innerVar, blockVar), which defines from the algorithm itself. However, the scheduling primitives both the scope and the interface of the accelerator and the also provide more control over the generated hardware, as granularity of the accelerator task. The first argument, inputs, described in the following sections. specifies a list of Funcs for which data will be streamed in. The 3. Language accelerator will use these inputs to compute all intermediate Funcs to produce the result f. In this example, this is the The Halide language tackles the problem of finding the most sequence of computation through gray, blury, blurx, sharpen, efficient implementation for an application by separating the and ratio that produces unsharp from in (Figure 3, bottom). computation to be performed (the algorithm) from the order The block loop variable blockVar defines the granularity in which it is done (the schedule). The language provides of the computation: the hardware will compute an entire tile scheduling primitives which control high-level scheduling de- of the size that blockVar counts; in this case, 256 256 pix- ⇥ cisions like tiling, loop reordering, and parallel execution, els. The inner loop variable innerVar controls the throughput: making it easy to experiment with various tradeoffs between innerVar will increment each cycle, in this case producing one locality, parallelism and redundant re-computation. pixel each time. To create higher-throughput hardware, we Our task is to extend these semantics to cover heteroge- could use Halide’s split primitive to split the innerVar loop neous systems, mostly involved with mapping Halide func- into two, and accelerate with the outer one as the hardware tions onto a specialized hardware engine. Specifically, the stride size. schedule should include: Our second new primitive is src.fifo_depth(dest, n). It • The scope and interface of the hardware accelerator pipeline. specifies a FIFO buffer with a depth of n, instantiated between • The granularity of the accelerator launch task, i.e. the size function src and function dest. In the unsharp example, both of output image block the hardware produces per launch. ratio and unsharp consume multiple data streams (sharpen is • The amount of parallelism implemented in the hardware da- fused into ratio), so the latency needs to be balanced across tapath, which affects the throughput of each pipeline stage. the inputs. The optimal FIFO depths in the DAG can be • The allocation of buffers, specifically line buffers, that opti- solved automatically as an integer linear programming (ILP) mally trades storage resources for less re-computation. problem [13], so we can eventually automate this decision, but • The number of delay register slices needed to match varying for now we specify and tune it by hand. computation latencies. Many hardware scheduling choices have analogues in CPU

4 Image processing for automotive/robotics

NVIDIA Drive Xavier

MobileEye EyeQ Processor Computer Vision accelerator for automotive applications https://en.wikipedia.org/wiki/Mobileye Stanford CS348K, Fall 2018 Summary ▪ Image processing workloads: demand high performance - Historically: accelerated via ASICs for eﬃciency on mobile devices - Rapidly evolving algorithms dictate need for programmability - Workload characteristics are amenable to hardware specialization

▪ Active industry eﬀorts: programmable image processors - Gain eﬃciency via wide parallelism and by limiting data fows

▪ Active academic research topic: increasing productivity of custom hardware design - Image processing is an application of interest

Stanford CS348K, Fall 2018