Parallelization in Co-Compilation for Configurable Accelerators

Parallelization in Co-Compilation for Configurable Accelerators A Host / Accelerator Partitioning Compilation Method J. Becker R. Hartenstein, M. Herz, U. Nageldinger Microelectronics Systems Institute Computer Structures Group, Informatik Technische Universitaet Darmstadt (TUD) University of Kaiserslautern D-64283 Darmstadt, Germany D-67653 Kaiserslautern, Germany Phone: +49 6151 16-4337 Fax: +49 631 205 2640 Fax: +49 6151 16-4936 Home fax: +49 7251 14823 e-mail: [email protected] [email protected] http://www.microelectronic.e-technik. http://xputers.informatik.uni-kl.de tu-darmstadt.de/becker/becker.html Abstract— The paper introduces a novel co-compiler and its each new wave is triggered by a paradigm shift. The sec- “vertical” parallelization method, including a general model for ond wave has been triggered by shifting from hardwired to co-operating host/accelerator platforms and a new parallelizing programmable microcontroller. The third wave will be compilation technique derived from it. Small examples are used triggered by shifting to using reconfigurable hardware plat- for illustration. It explains the exploitation of different levels of forms as a basis of a new computational paradigm. parallelism to achieve optimized speed-ups and hardware re- source utilization. Section II introduces novel vertical paralleliza- Makimoto’s third wave takes into account that hardware has tion techniques involving parallelism exploitation at four differ- become soft. Emanating from field-programmable logic ent levels (task, loop, statement, and operation level) is explained, (FPL, also see [2]) and its application the awareness of the achieved by for configurable accelerators. Finally the results are new paradigm of structural programming is growing. Com- illustrated by a simple application example. But first the paper mercially available FPGAs make use of RAM-based recon- summarizes the fundamentally new dynamically reconfigurable figurability, where functions of circuit blocks and the structure hardware platform underlying the co-compilation method. of their interconnect is determined by bit patterns having been downloaded into “hidden RAM” inside the circuit. Modern I. INTRODUCTION FPGAs are reconfigurable within seconds or milliseconds, Tsugio Makimoto has observed cycles of changing main- even partially or incrementally. Such “dynamically reconfig- stream focus in semiconductor circuit design and applica- urable” circuits may even reconfigure themselves. An active tion [1] (fig. 1). Makimoto’s model obviously assumes, that circuit segment programs an idling other segment. So we have two programming paradigms: program- standardized the future? ming in time and in space, distinguishing Crisis Crisis two kinds of “software”: standard memory, recon- n sequential software (code transistors, micro- figurable Design Design downloaded to RAM) st nand, nor.. processor nd dynamically year 1967 1 1987 2 procedural n structural software (down- 1957 (TTL) customized 1977programming customized 1997 programmingstructural for TV, clock, (computing logic (ASIC), (computing loaded to hidden RAM) in space) calculator, in time) add-on etc. chips But Makimoto’s third wave is heavily de- customized layed. FPGAs are available, but are main- accelerators ly used for a tinkertoy approach, rather application of outsourcing: paradigm shift: crisis symptom: paradigm shift: than for a new paradigm. Is it realistic to transistor and system vendor to hardware to limitations of the procedural to believe, that Makimoto’s third wave will integrated circuit. component vendor software migration microprocessor structural migration come? If yes, what is the reason of its de- paradigm: new paradigm: new paradigm: lay? Although FPGA integration density algorithm: fixed algorithm: variable algorithm: variable has passed that of microprocessors, the resources: fixed resources: fixed resources: variable evolution of dynamically reconfigurable circuits is approaching a dead end. For a Fig. 1: Makimoto’s wave: summarizing the history of paradigm shifts in semiconductor markets. change new solutions are needed for 12 12 10 transistors/chip Fig. 3: FPGA high growth 10 transistors/chip 16G rate of integration density — 4G compared to memory and (the ÒroadmapÓ Xilinx fabricated 1G microprocessor. prediction for Xilinx: “planned” 9 256M 10 microprocessors 64M Xilinx: “perhaps” is too optimistic) 16M design gap 4M 109 memory1M α Progress: parallel to 256k P5 P-II memory 106 Gordon Moore Curve 64k 803866803068040 FPGAs 68020 16k 432 FPGA pre-design is 4k 68000 1k 8086 of full custom style 80808085 Microprocessor 3 ×100 / decade 8008 6 10 4004 Transistor count 10 ×1,6 / year) ( microprocessor exceeds that of the year microprocessor 1990 2000 2010 100 year to market computation time and uses only part of the logic elements 1960 1970 1980 1990 2000 2010 — in some cases even only about 50%. So FPGAs would hardly be the basis of the mainstream paradigm shift to dy- 2: The Gordon Moore curve and microprocessor curve - with design gap [4]. namically reconfigurable, such as e. g. predicted by Makim- oto’s wave [1] (also see analysis in [7]). some fundamental issues [3]. This paper analyzes the state of the art and introduces a fundamentally new approach, The reason of the immense FPGA area inefficiency is the which has to cope with: need for configuration memory and the extensive use of reconfigurable routing channels, both being physical reconfig- n a hardware gap n a modeling gap urability overhead artifacts. n a software gap n an education gap B. Closing the Hardware Gap A. The Hardware Gap An alternative dynamically reconfigurable platform is the Comparing the Gordon Moore curve of integrated memory KressArray [16], being much less overhead-prone and more circuits versus that of microprocessors and other logic cir- area-efficient than FPGAs by about 3 orders of magnitude cuits (fig. 2) shows an increasing integration density gap, cur- (fig. 5). (This high density may be reason to need low power rently by about two orders of magnitude. We believe, that the design methods [18]). Also the KressArray integration densi- predictions in fig. 2 [4] are more realistic than the more opti- ty is growing a little faster than that of memories (fig. 5). The mistic ones of Semicon’s “road map” [5] (also [6]). high logical area efficiency is obtained by using multiplexers A main reason of this gap is the difference in design style inside the PEs (processing elements) instead of routing chan- [7]. The high density of memory circuits mainly relies on nels. Fig. 6 illustrates a 4 by 8 KressArray example. Fig. full custom style including wiring by abutment. Micropro- 8 illustrates the mapping (fig. b) of an application (fig. a: a cessors, however, include major chip areas defined by stan- system of 8 equations) onto this array. dard cell and similar styles based on “classical” placement and routing methods. This is a main reason of the density The Kress Array is a generalization of the systolic array — the gap, being a design gap. Another indication of increasing most area-efficient and throughput-efficient datapath design limitations of microprocessors is the rapidly growing usage 12 reconfigurability 10 transistors/chip of add-on accelerators: both boards and integrated circuits. overhead in FPGAs: Both, standard cell based ASICS and FPGAs ([2], [8]), are usu- logical chip area ally highly area-inefficient, because usual placement algo- uses only 1% of rithms use only flat wiring netlist statistics being much less rel- physical chip area [DeHon] - 1 logical . evant than needed for good optimization results A much better transistor per 200 9 placement strategy would be based on detailed data dependen- 10 physical transistors memory cy data directly extracted from a high level application specifi- [Tredennick] cation, like in synthesis of systolic arrays ([9],[10],[11]), where FPGAs physical it’s derived directly from a mathematical equation system or a high level program (“very high level synthesis”). Microprocessor Due to full custom design style FPGA integration density 106 (fig. 3) grows very fast (at a rate as high as that of memory Tredennick chips) and has already exceeded the of general purpose mi- Fig. 4: The hardware croprocessors [12]. But in FPGAs the reconfigurability gap of reconfigurability: overhead is very high (fig. 4). Figures having been pub- physical versus logical DeHon lished indicate 200 physical transistors needed for a logical integration density of FPGAs logical FPGAs — compared transistor ([13],[14]) or, only 1% of chip area is available year to microprocessors 103 for pure application logic [15]. Routing takes up to hours of and memory chips. 1990 2000 2010 12 10 transistors/chip The mapping problem has been mainly reduced to a place- KressArray: ment problem. Only a small residual routing problem goes logical integration beyond nearest neighbor interconnect, which uses a few PEs density is larger than that of FPGAs also as routing elements. DPSS includes a data scheduler to by about 3 orders organize and optimize data streams for host/array communi- of magnitude cation, being a separate algorithm carried out after place- 109 ment [16]. Instead of hours known from FPGA tools DPSS needs only a few seconds of computation time. Permitting memory KressArray alternative solutions by multiple turn-around within minutes the KressArray tools support experimental very rapid proto- typing, as well as profiling methods for known from hard- Microprocessor ware/software co-design (also see section II ff.).

Parallelization in Co-Compilation for Configurable Accelerators

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support