High Level Synthesis with a Dataflow Architectural Template
Total Page:16
File Type:pdf, Size:1020Kb
High Level Synthesis with a Dataflow Architectural Template Shaoyi Cheng and John Wawrzynek Department of EECS, UC Berkeley, California, USA 94720 Email: sh [email protected], [email protected] Abstract—In this work, we present a new approach to high of the FPGAs as throughput-oriented devices, it structures level synthesis (HLS), where high level functions are first the computation and data accesses into a series of coarse- mapped to an architectural template, before hardware synthe- grained pipeline stages, through which data flows. To target sis is performed. As FPGA platforms are especially suitable for implementing streaming processing pipelines, we perform this architectural template, we have developed a tool to slice transformations on conventional high level programs where they the original CDFG of the performance critical loop nests are turned into multi-stage dataflow engines [1]. This target into subgraphs, connected by communication channels (sec- template naturally overlaps slow memory data accesses with tion III). This decouples the scheduling of operations between computations and therefore has much better tolerance towards different subgraphs and subsequently improves the overall memory subsystem latency. Using a state-of-the-art HLS tool for the actual circuit generation, we observe up to 9x improvement throughput in the presence of data fetch stalls. Then, each in overall performance when the dataflow architectural template of the subgraphs is fed to a conventional high-level synthesis is used as an intermediate compilation target. flow, generating independent datapaths and controllers. FIFO Index Terms—FPGA, Overlay Architecture, Hardware design channels are instantiated to connect the datapaths, forming the template, High-level Synthesis, Pipeline Parallelism final system (section IV). The performance, when compared against directly synthesized accelerators, is far superior (sec- I. INTRODUCTION tion V), demonstrating the advantage of targeting the dataflow As the complexity of both the FPGA devices and their ap- architectural template during HLS. plications increase, the task of efficiently mapping the desired functionality is getting ever more challenging. To alleviate the II. THE DATAFLOW ARCHITECTURAL TEMPLATE difficulty of designing for FPGAs, there has been a trend Currently, HLS tools use a simple static model for schedul- towards using higher levels of abstraction. Tools taking in ing operations. Different parts of the generated hardware high-level function specifications and generating hardware IP run in lockstep with each other, with no need for dynamic blocks have been developed both in academia [2], [3] and dependency checking mechanisms such as scoreboarding or industry [4], [5]. Of course, the semantics of the high level load-store queueing. This rigid scheduling of operators, while languages like C/C++ are vastly different than the description producing circuit of simpler structure and smaller area, is of hardware behavior at clock cycle granularity. The tools vulnerable to stalls introduced by cache misses or variable often try to bridge this gap by fitting the control data flow latency operations. The entire compute engine is halted as graph (CDFG) of the original program into particular hard- the state machine in the controller waits for the completion ware paradigms such as Finite State Machine with Datapath of an outstanding operation. This effect becomes very pro- (FSMD). Depending on the nature of the application, these nounced when irregular offchip data accesses are encoded approaches may or may not generate hardware taking full in the function. Under these circumstances, the traditional advantage of what the FPGA has to offer. User guidance in approach where data movements are explicitly managed using the forms of directives or pragmas are often needed to expose DMA may not be effective as the access pattern is not known parallelism of various kinds and to optimize the design. An statically. Also, there may not be sufficient on-chip memory to important dimension of the space is in the mechanism with buffer the entirety of the involved data structure. As a result, which memory data are accessed. Designers sometimes need to the overall performance can deteriorate significantly. restructure the original code to separate out memory accesses To alleviate this problem, instead of directly synthesizing before invoking HLS. Also, it is often desirable to convert the accelerator from the original control dataflow graph, we from conventional memory accesses to a streaming model first map the input function to an architecture resembling a and to insert DMA engines [6]. Further enhancements can be dataflow engine. Figure 1 illustrates this mapping for a very achieved by including accelerator specific caching and burst simple example. The original function is broken up into a set of accesses. communicating processes, each of which can be individually In this paper, we realize an intermediate architectural tem- turned into an accelerator. The memory subsystem is assumed plate (section II) that will complement existing work in HLS. to be able to take in multiple outstanding requests. It captures some of the common patterns applied in optimizing The mapping process can distribute operations in the orig- HLS generated designs. In particular, by taking advantage inal function into multiple stages. This process can of course Copyright held by the owner/author(s). Presented at 14 2nd International Workshop on Overlay Architectures for FPGAs (OLAF2016), Monterey, CA, USA, Feb. 21, 2016. for(int i=0; i<N; i++) { for(int i=0; i<N; i++) { node* cur_node = heads[i] pop(prod_queue, &cur_product); 1 for(int i=0; i<N; i++) push(node_queue, cur_node); products[i] = cur_product; } 2 { } 3 float cur_product = 1; for(int i=0; i<N; i++) { 4 node* cur_node = heads[i]; pop(node_queue, &cur_node); 5 while(cur_node !=0 ) while(cur_node !=0 ) { push(br_tag_queue, br_tag_1); 6 { val_addr = &cur_node->value; for(int i=0; i<N; i++) { 7 float cur_val = cur_node->value; push(val_addr_queue, val_addr); float cur_product = 1; cur_node = cur_node->next; while(1) { 8 cur_product *= cur_val; } pop(br_tag_queue, &br_dst) 9 cur_node = cur_node->next; push(br_tag_queue, br_tag_2); if(br_dst == br_tag_1){ } pop(val_queue, &cur_val) 10 } cur_product *= cur_val; } 11 products[i] = cur_product; while(1) { else 12 } pop(val_addr_queue, &val_address) break; float cur_val = *val_address; } push(val_queue, cur_val); push(prod_queue, cur_product) } } 1 for(int i=0; i<N; i++) 2 { 3 float cur_product = 1; 4 node* cur_node = heads[i]; 5 while(cur_node !=0 ) 6 { 7 float cur_val = cur_node->value; Entry_bb: 8 cur_product *= cur_val; int i = 0; 9 cur_node = cur_node->next; goto bb; 10 } 11 products[i] = cur_product; 12 } bb: node* cur_node = heads[i]; float cur_product = 1; if(cur_node == 0) goto bb3; for(int i=0; i<N; i++) { else node* cur_node = heads[i]; goto bb1; push(nodeQ, cur_node); } while(1) { pop(vaddrQ, &val_addr) float cur_val = *val_addr; Array traversal push(valQ, cur_val); for(int i=0; i<N; i++) { for(int i=0; i<N; i++) { } float cur_product = 1; pop(nodeQ, &cur_node); 1 Memory access: while(1) { for(int i=0; i<N; i++) while(cur_node !=0 ) { 2 Pointer Chasing potential cache misses pop(brTagQ, &br_dst) { push(brTagQ, br1_tag); 3 if(br_dst == br1_tag){ float cur_product = 1; val_addr = &cur_node->value; pop(valQ, &cur_val) 4 node* cur_node = heads[i]; push(vaddrQ, val_addr); 5 cur_product *= cur_val; while(cur_node !=0 ) Value Fetch cur_node = cur_node->next; } 6 { } 7 else float cur_val = cur_node->value; push(brTagQ, br2_tag); for(int i=0; i<N; i++) { 8 pop(prodQ, &cur_product); break; cur_product *= cur_val; FP Arithmetic: } 9 FP Multiply products[i] = cur_product; } cur_node = cur_node->next; Long latency compute 10 } } push(prodQ, cur_product) 11 products[i] = cur_product; } 12 } Array traversal While(1){ brQ0 entry: Entry: bb: brQ1 (a) br bb; bb: br bb; pop(brQ0, cond0) i = phi [0, entry],[i1,bb3] bb: br cond0, bb3, bb1 cur_node = heads[i] i = phi [0, entry],[i1,bb3] bb1: cur_node = heads[i] cond0 = icmp cur_node, 0 nodeQ popcond(nodeQ, cur_node,bb) While(1){ br cond0, bb3, bb1 push(nodeQ,cur_node) cur_node1 = phi [cur_node2, bb1], pop(valAddrQ, valAddr) cond0 = icmp cur_node, 0 [cur_node,bb] valAddrQ cur_value = *valAddr push(brQ0, cond0) push(valAddrQ, &cur_node1->value) push(valQ,cur_value) bb1: bb3: cur_node2 = cur_node1->next } push(iQ0, i); cur_node1 = phi [cur_node2, bb1], [cur_node,bb] cond1 = icmp cur_node2, 0 i1 = i+1; cur_product = phi [1, bb], [cur_product1, bb1] bb3: push(brQ1,cond1) cond2 = icmp i1, N cur_value = cur_node1->value cur_product2 = phi [1,bb], [cur_product1,bb1] br cond1, bb3, bb1 br cond2, return_bb, bb cur_product1 = cur_product * cur_value products[i] = cur_product2 bb3: return_bb: cur_node2 = cur_node1->next i1 = i+1 } return cond1 = icmp cur_node2, 0 cond2 = icmp i1, N br cond1, bb3, bb1 br cond2, return_bb, bb While(1){ bb: While(1){ pop(brQ0, cond0) bb: Return_bb: br cond0, bb3, bb1 pop(brQ0, cond0) return bb1: br cond0, bb3, bb1 pop(brQ1, cond1) bb1: br cond1,bb3,bb1 cur_product = phi [1, bb], phi: select value based on the predecessor basic block of the current execution bb3: [cur_product1, bb1] icmp: generate 1 bit value by comparing the two operands prodQ popcond(prodQ, cur_product1, bb1) pop(valQ, cur_value) br: choose successor basic block based on the value of the first operand cur_product2 = phi [1,bb], cur_product1 = cur_product * cur_value unconditional jump if only there is only one successor