<<

PPaarrttiiaall RReeccoonnffiigguurraattiioonn oonn FFPPGGAAss

Dirk Koch ([email protected]) 1 Introduction: Terms and Definitions Definition of the term „“ (RC)

° A good definition for a reconfigurable hardware system was introduced with the Rammig Machine (by Franz Rammig 1977): … a system, which, with no manual or mechanical inter- ference, permits the building, changing, processing and destruction of real (not simulated) digital hardware

° Reconfigurable computing (RC) is defined as the study of computation using reconfigurable devices This includes architectures, algorithms and applications

° The term RC is often used to express that computation is carried out using dedicated hardware structures (often utilizing a high level of parallelism) which are mapped on reconfigurable hardware (this is opposed to the sequential von Neumann

computer paradigm!!!). 2 How does it Work? Look-up Tables Example of a SRAM-based look-up table (LUT) function generator. A LUT is basically a multiplexer that evaluates the truth table stored in the configuration SRAM cells (can be seen as a one bit wide ROM). A k-input LUT has 2k SRAM cells.

0 configuration SRAM cell tradeoff between Vdd 1 config power enable 2 and speed config 3 data 0 0 optimized for 4 1 1 state lowest power flip-flop 5 configura- L 6 tion cell 7 Q L

F F ......

A0 A . . 0 . A A 1 connection to 1 A2 A2 switch matrix A A3 3 FF 3 How does it Work? Look-up Tables Configuration examples Note that by permutating LUT values, inputs can be swapped (usefull for routing); it is also possible to rouite through a LUT.

4 How does it Work? Routing ° The routing is implemented with pass transistor logic: a) b)

0 0 LUT . . 1 1 .

F F

A0 .

. A1 . A2 .

. A

. 3

FF

clock 0 1 0 0 1 0 0 switch matrix multiplexer ° If we only count transistors, it costs more transistors for the SRAM cells than for the configuration SRAM cells. ° Alternatives: ° Flash-based FPGAs (e.g. FPGAs from Microsemi) ° Field Programmable Nanowire Interconnects (FPNI) (experimental) ° Tier Logic: thin film transistors for SRAM cells in the metal layer Idea: move the SRAM configuration cells from the silicon to the metal layers. (Tier logic is out of business) 5 How does it Work? FPGA Fabric ° Example of an FPGA fabric composed of LUTs, switch matrices and I/O cells. Other common primitives: memories, multipliers, transceivers, … ° On real FPGAs: a cluster of LUTs per switch matrix (e.g., eight LUTs and switch matrix form a configurable on FPGAs)

switch matrix 0 0 multiplexer 0 0 1 1 1 1 data_in switch . . . matrix F F F F

A0 A0

A1 A1

A2 A2

A3 A3

FF FF clock clock clock enable LUT I/O element

I/O pad 0 0 0 0 1 1 1 1 data_out

F F F F

A0 A0

A1 A1

A2 A2

A3 A3

FF FF

clock clock clock

possible configuration configuration SRAM cell 6 Introduction: Example for RC reconfigurable computing for i = 0 to 7 do { A[0][3..0] RC benefits tmp = A[i] & x"F"; + among von Neu- tmp = tmp + 42; 42 * Q[0] mann machines: l 24 o

Q[i]= tmp * 24; o p

} A[1][3..0] • fast parallel + u n

slow and r processing power hungry 42 * Q[1] o l l

i - pipelining 24 n g

. - loop transform. von Neumann computer . . LDI reg_i,0 • no instr. fetch A[7][3..0] L1:ANDI r_tmp,$i,xF + (no extra ... 42 * Q[7] memory access) BLI reg_i,L1 24 instruction pipelining • no instr. decode stream • possibility of A.[0] ... dedicated instr. data ..A[7] stream 42 Q[0]... (e.g., MAC) 42 Q[7] 24 • lower power 24 7 Introduction: Example (Benefits) Reconfigurable computing permits to tradeoff between performance (speed and/or latency) and area (number of used primitives) of the reconfigurable architecture. This requires to solve the following steps:

° Allocation: defining the resources / functional blocks which are allowed for implementation ° Binding: defining which operation is executed on a particular allocated resource ° Scheduling: defining the time when an operation is executed

Allocation, binding, and scheduling are fundamental problems that have to be solved at different level of abstraction (e.g., system level, architecture level, or all refinements. This holds for both the hardware and the software part!

Further: RC removes architectural limitations (e.g., like shared memory communication in GPUs) 8 Introduction: Terms and Definitions These RC benefits exist also for dedicated hardware (ASIC1, ASIP2), but reconfigurable computing allows more: ° Adaptability: react on environment changes or different workload scenarios by adapting the behavior and structure of a system (e.g., scaling a system with configuring more instances of an accelerator module to a reconf. device) ° Customization (post fabrication): allows for different features for individual systems ° Updatability: update to new standards, bug fixes, after sales business with new features „hardware apps“

Possible by (re)configuration: Configuration (and respectively reconfiguration) is the process of changing the structure of a reconfigurable device at start-up-time (respectively run-time). Mostly this means: sending new configurations to the device

ASIC1: application specific integrated circuit; ASIP2: application specific processor 9 Introduction: Terms and Definitions Reconfigurable architectures ° Coarse grained: ALU-like primitives with word sized routing channels ° Examples: NEC-DRC, PACT XPP, Silicon Hive, Ambric, Picochip, TILERA, GPGPU ° Advantage: extreme performance for domain specific tasks ° Fine grained: bit level primitives (e.g., look-up tables (LUTs)) and single wire routing ° Examples: plenty of academic architectures, FPGAs ° Advantage: can virtually implement anything ° But often poor performance and/or chip utilization ° Hybrid: fine-grained fabric with additional coarse-grained primitives (e.g., hardware multipliers or CPUs) ° Examples: Xilinx Virtex families (some with embedded PPC) ° Aims at combining the advantages of both 10 Introduction: the FPGA-ASIC Gap Hybrid FPGAs are dominating reconfigurable market, but there is a ° Gap between reconfigurable FPGAs and dedicated ASICs @ 90nm process* FPGA versus ASIC chip area ~ 18 x larger dynamic power ~ 14 x more clock speed ~ 3-5 x slower *Kuon & Rose: Measuring the Gap Between FPGAs and ASICs, in Tr. On CAD, 2007. ° Note that the gap towards a programmable von Neumann machine could be even orders of magnitude higher! ° also: lack of productive design tools (and skilled engineers)

Solution: partial run-time reconfiguration (PR): reusing the resources of a reconfigurable architecture by multiple modules over time. Only parts of a system might be updated while continuing operation of the remaining system. 11 FPGA-based Systems everywhere, but not PR

° FPGA-based systems are omnipresent in our daily life.

Each A380 contains more than 700 Actel FPGAs, e.g., for: ° Engine control & monitoring ° flight computers ° braking systems ° safety warning systems

12 What we should know about FPGAs

° Slow (~300 MHz), but highly parallel execution >1000 Operations ° Moderate I/O throughput, but >1MB @ >1TB/sec (on-chip) ° Difficult VHDL programming, but C++ is coming up

data flow oriented vs. control flow dominated for i=1 to ∞ loop if (old_position) then numbercrunching; case position is 42: if free then

13 PR Advantages: Area Saving

° Networking: Adapt to changing g r o

protocols over . a d i

time a c . w

° Encapsulated w w : e

design of the c r u o

processing s modules configuration FPGA network processor repository dispatcher config. HTTP VoIP SSH FTP

14 PR Advantages: Area Saving

° Economics of ASIC- and FPGA designs 6 0 0 2 . 3 0 . 6 1

s w e N

c i n o r t c e l E

: e c r u o S

° FPGA buyers: - reduce unit cost - after sales business ° FPGA vendors: more attractive for high volume designs 15 PR Advantages: Acceleration

° Reduce latency by spending more area on submodules

A B C ACABAC A C time A S0 S0 S1 B latency S1 A B C A B C A S2 C S2 t t ° May alternatively allow to reduce clock frequency (and power) ° Lower latency might reduce buffer sizes ° Example: TLS/SSL, sorting (database acceleration) ° May also increase throughput 16 PR Advantages: Faster Configuration

° Full FPGA bistream can currently be > 20 MB ° Flash memory performance 10-20 MB/s (special high-speed Flash memories reach up to 100 MB/s) ° ‰ Full initial configuration ~ 1-2 seconds in practice an order of magnitude to slow for PCIe (setup within 100 ms) ° Solution: Bootstrapping using PR

Initial config. from boot flash System config. via PCIe

conf. port conf. port

em'epmtyp FtyP' GA 'empty' PCIe core boot PCIe core boot flash flash

17 PR Advantages: IP Reuse

10 000 000 1 000 000 +58% / year logic transistors/year 100 000 design gap 10 000 1000 100 +21% / year 10 productivity in tr. per man-month 1 1980 1985 1990 1995 2000 2005 2010 [International Technology Roadmap for Semiconductors] High level of IP reuse ‰ Adapt the component-based system PR design flow for a general design methodology Idea: take as much as possible from an existing environment and add only the application stecific parts. 18 PR Advantages: SEU* Compensation

LUT-4 era LUT-6 era LUT-4 era LUT-6 era

# LUTs 1.2 M bitstream size ??? 6

V

6 -

I - V

- x 600 K I

30 MB x -

x

V e

i

e

t

x -

t t

i r

t

r i x

a

i i 7

r a t

7

V V - t

r 500 K - V

- 25 MB a t x

x r

S x

I

i t S e

I t e I t I t

I S r

- a r I i

i r x -

400 K t i 20 MB V V t

x 5

i

S

5 - t

a - r

o x 4 a t

o I x r I - r

I r e

I e t

4 - t x S -

300 K 15 MB P t

- r

P

x

r i e x S

I

i i

I x i t

I I t I I t r

I V I - e -

i V

- a - t a x r

x x r x x r V t x i i i

200 K 10 MB t e

e t t

e t

t e S

t V S a t r

r a

r i i r r

i r i t t

V V

V V S 100 K 5 MB S

2000 2002 2004 2006 2008 2010 2000 2002 2004 2006 2008 2010 130 nm 90 nm 56 nm 40 nm 28 nm 130 nm 90 nm 56 nm 40 nm 28 nm ° Smaller configuration SRAM cells ° Exponetial rise in the total amount ° ‰ Increased risc of *single event upsets (SEU) ° Solution: Configuration Scrubbing ° Continous reconfiguration during operation (repair) ° Readback for SEU detection (before committing a result)

19 RC on FPGAs (Classification)

° Classification of (run-time) reconfigurable FPGA-based systems FPGA-based systems

one-time configurable reconfigurable Actel SXA family (antifuse) ASIC substitution* global partial older FPGAs in field update* passive1 active2 * Typical use case Xilinx Spartan 3 Xilinx Virtex families mode changing* ° This lecture focuses on passive1 partial reconfiguration (interrupt whole FPGA during reconfiguration) and active partial recon- figuration2 (untouched parts continue execution) on FPGAs.

20 Context-Switching on FPGAs ° Partial reconfiguration is also referred as context switching. ° What is the Context of an FPGA? ° “Context” denotes a “state” which is stored in memory Located in: 1) FPGA fabric (technology level) 2) Modules (logic level) 1) Present FPGA configuration 2) State of a module • Register snapshot • RAM blocks • External state a d b o B e h p o t s i r h C

: e c r u o S Access via configuration port or Access via configuration port extra logic (e.g., scan-chain) 21 Context-Switching on FPGAs Classification Technology level (FPGA) static dynamic • Module runs forever • Configuration swapping • Single configuration/ • Run-to-completion model

) module context (no module context is e l static • ASIC-like considered at start) u d

o (e.g., memory controller) (e.g., motion-JPEG) m (

l

e • Multiple module contexts • module preemption and v

e resuming l • on a single configuration

c • Configuration swapping i

g dynamic

o (e.g., multi channel crypto) • Transparent (like software) L • Examined at UIO in the COSRECOS project* ° All variants may co-exist in a reconfigurable SoC

*Website: http://www.mn.uio.no/ifi/english/research/projects/cosrecos/ 22 Baseline Model of Partial Reconfiguration

The time-multiplex model: configuration data (bitstream) internal configuration logic

FPGA phone phone video <=> video MP3 MP3

reconfigurable region surrounding system

° Activate one module exclusively within a reconfigurable region ° Swapping between modules by writing a partial bitstream to a configuration port (defines the configuration time!) ° Bitstream might be written by the FPGA itself ‰ selfreconfiguration ° Used by the tools from Xilinx and Altera 23 PR Time-Granularity (sub-cycle) Tabula’s 3D Architecture ° 8 configuration planes ° Reconfiguration @ 1.6 GHz ° Within netlist reconfiguration (uses forwarding registers called „time via“)

8 folds @ 1.6 Ghz

200 MHz user clock

400 MHz user c2l4ock PR Time-Granularity (sub-cycle)

Example: 32- bit adder; a) conventional, b) time-multiplexed a) b15a15 b14a14 b13a13 b12a12 b11a11 b10a10 b9 a9 b8 a8 b7 a7 b6 a6 b5 a5 b4 a4 b3 a3 b2 a2 b1 a1 b0 a0

FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA FA '0'

s15 s14 s13 s12 s11 s10 s9 s8 s7 s6 s5 s4 s3 s2 s1 s0 b) b15a15 b14a14 b13a13 b12a12 FA FA FA FA

s15 s14 s13 s12 b11a11 b10a10 b9 a9 b8 a8 FA FA FA FA mc4 s11 s10 s9 s8 b7 a7 b6 a6 b5 a5 b4 a4 t FA FA FA FA time forwarding latch mc3 s7 s6 s5 s4 b3 a3 b2 a2 b1 a1 b0 a0 mc2 y FA FA FA FA '0' s s s s mc1 3 2 1 0 micro configuration

basic logic element

x 25 PR Time-Granularity (sub-cycle)

Example: 32- bit adder; c) scheduling, d) configuration storage c) clk execution mc1

mc2

mc3 microcycle configuration mc4 configurable d) config. function mc4 mc3 mc2 mc1 buffer configuration latch SRAM cell config data conventional configuration time multiplexed configuration ° Each micro configuration has its own set of SRAM cells (which requires area on the die; savings are possible in the logic) ° Rapid reconfiguration consumes power (millions of configuration bits) ‰ better suitable for latest CMOS processes (where static power dominates dynamic power) 26 PR Time-Granularity (sub-cycle)

° One memory access per time plane ‰ virtually 8 memory ports; (depending on the space/time mapping there are typically fewer memory accesses possible) ° Difficult to rate this approach: ° Clustered logic fabric (fast inside, slower from cluster to cluster) ° Extra multiplexers for accessing state values (LUTs can be shared, but not state flip-flops) ° Extra latch for each wire segment ° Difficult tools: Tabula hides the space/time mapping from the user and the reconfiguration is fully transparent ‰ the device appears larger than it is and logic resources are reused to implement larger netlists ° No easy manual manipulation of the netlist. ° Monitoring of signals is difficult. (transient events cannot be easily monitored with an oscilloscope)

° More information: http://www.tabula.com 27 PR Time-Granularity (single-cycle)

Multi-context FPGAs ° originally proposed by Scalera & Trimberger ° single cycle confi- guration swapping ° idea: duplicating all configuration bits for each “plane” and multiplexing between planes ° Problem: extra multiplexer required for each configuration bit ° All planes have to be mapped on a 2D chip (3D ‰ 2D mapping) ‰ longer routing between the primitives <=> lower performance ° Bad idea for FPGAs: most of the FPGA die area is spent on confi- guration SRAM cells) usefull only for coarse-grained architectures ° Better: multiplexing between different areas on the FPGA 28 PR Time-Granularity (multi-cycle)

Configuration by writing a new configuration bitstream to the device

° normal case for all FPGAs from Xilinx and Altera (starting with the -5 family)

° rapid partial module swapping (e.g., swapping within a frame in a video processing system)

° mode changing / field update (typically used in combination with full FPGA reconfiguration, e.g., in measurement equipment when changing settings or for prototyping (ASIC emulation))

29 PR in Time and Space

So far, we have only considered to have one module exclusively placed with a reconfigurable region (temporal partial reconfigurati- on) ‰ extension to multi-module placement of partially reconfigu- rable modules (spatial partial reconfiguration) Possibilities for tiling the reconfigurable area into resource slots: a) islainsdla sntydle style b) slot sstlyolet style c) grid gstryilde style

m3 4 m m m m m 3 m m 1 2 1 2 m 1 2

static part of the system unused reconfigurable area different modules As smaller the slots, as lower the internal fragmentation (the waste of logic resulting from fitting any sized module into a tile-grid (i.e.,

clustering the FPGA area into regular groups of resources)) 30 PR in Time and Space: Efficiency PR paradox: Runtime reconfiguration is brilliant, but not used! internal fragmentation M 4 3 4 6 3 m m m m m cconst 1 2 m m 1 2 m m 5 m

cconst communication cost c M M M overhead ° Internal fragmentation is dominating the overhead ° Can be optimized with small slots ‰ 2D placement (but might result in additional cost for the communication) ° 2D enhances BRAM/DSP utilization ° 2D is obligatory for newer FPGA Architectures (Virtex-5/6) Requires adequate on-FPGA communication architectures ° Buses

° Point-to-point connections 31 Optimal Resource Slot Size

° Internal fragmentation results from fitting modules into a grid of fixed resource slots. ° Analog: storing files in a filesystem with fixed clusters ° Average overhead of a module set of modules: l : resources in a slot c : communication mi : resources of module i

– =

Optimal slot size depends on the modules and communication cost

32 Optimal Resource Slot Size

° Impact of the resource slot size and the communication cost on the average module overhead ° Scenario: 9701 modules with 300, 301, …, 10000 LUTs with a communication cost of 0, 5, …, 25 LUTs per slot d

a 3000 e

h 2500 r e

v 2000 0 5 10 15 20 25 o

c

i 1500 g o

l 1000

e

g 500 a r

e 0 v 0 0 0 0 0 0 0 0 0 0 0 0 0 a 5 5 5 5 5 5 5 5 5 5 5 5 5 2 4 6 8 0 2 4 6 8 0 2 4 1 1 1 1 1 2 2 2 resource slot size in terms of LUTs Result: optimal slot size ~200–300 LUTs or ~25–40 CLBs 33 Optimal Resource Slot Size

l : resources in a slot c : communication mi : resources of module i

Discussion:

° If mi >> l the overhead converges to l/(l-c), meaning that for large modules (with many resource slots) the internal fragmentation becomes negligible. ° The optimal slot size can be computed by differentiating the average module overhead with respect to the slot size l. ° As the ceiling function is discontinuous, its bounds are considered: l l l ° Lower bound: perfect fit ° Upper bound: one slot is l almost unused l 34 Optimal Resource Slot Size Discussion: ° Upper bound: one slot is l almost unused l

l ° Worst case: l l

° Avarage case: (achievable only with 2D grid style placement)

35 Optimal Resource Slot Size

One of the best published solutions: ° Hagemeyer et al., Design of Homogeneous Communication Infrastructures for Partially Reconfigurable FPGAs ERSA, USA 2007. ° Master and slave support (32 bit) ° 16 sockets (XC2V4000) ° Communication cost: 8554 LUTs (~three 32-bit CPU-cores) ° No I/O support ° Resource slot size: 2560 LUTs

X

Catastrophic communication cost and too large resource slots 36 PR Space-Granularity (bitstream)

The behavior or structure of a system LUT-value LUT-value A3, A2, A1, A0 can be changed by small manipulations AND gate OR gate 0 OOOO 0 0 of the configuration bitstream. 1 OOO1 0 1 ° Manipulation of the routing 2 OO1O 0 1 (switch matrix multiplexer) 3 OO11 0 1 ° Changing logic functions 4 O1OO 0 1 O1O1 0 1 example: AND ‰ OR 5 6 O11O 0 1 O111 0 1 0 7 8 1OOO 0 1 1 9 1OO1 0 1 LUT values A 1O1O 0 1 B 1O11 0 1 F C 11OO 0 1 D 11O1 0 1 A0 A1 E 111O 0 1 A 2 Slice FF F 1111 1 1 A3 FF 37 PR Space-Granularity (bitstream) Tunable LUT a) standard look-up table implementation of a switchable wide input gate. Here, a multiplexer is used to switch between different Boolean functions. b) logic switching implemented by selectively reconfiguring the LUT table values (LUT tuning). ° Idea: keep the routing and change only the LUT values (faster reconfiguration) ° Only useful for very specific problems (e.g., crypto key changing).

a) b) partial reconfiguration LUT

38 PR Space-Granularity (small modules) Sometimes, even small modules can materially speed-up a system. ° Example: reconfigurable customized instruction set extensions (e.g., with instructions for CRC, DES round, bit swapping) result a) result b) register file register file OP_B OP_A OP_A OP_B conf. conf. instruction instruction instr. instr.

° Relatively small configurable instructions can speed up execution by at least an order of magnitude. (NIOS, GARP, DISC) ° Typically non concurrent operation (blocking the ALU) ° Difficulty: instructions have a high pin count per logic ‰Interfaces have to be ultra efficient!

° Different logic requirements ‰flexible instruction placement 39 PR Space-Granularity (large modules) Typically, systems consist of multiple concurrently working modules.

internal fragmentation M 3 4 3 4 6 m m m m m cconst 1 2 m m 1 2 m m 5 m

cconst communication cost c M M M overhead ° Difficulty: modules have different resource requirements ° Logic logic memory ° Memory reconfigurable ° Multipliers L L L L M L L L L L L L L M L L FPGA region ° Placement restrictions placement (string matching problem) options L L M L module

‰Interfaces should allow two-dimensional module placement ° Further: placement impacts the communication! 40 PR Space-Granularity (module coupling) tightly coupled loosely coupled register file cache system memory complexity (size) S C S O I I D N

] 1 [

reg file ° Reconfigurable HW memory e

in parallel to the ALU U h L

c bus a A c [1] Wirthlin and Hutchings: DISC: Dynamic Instruction RHW Set Computer (FCCM 1995)

41 PR Space-Granularity (module coupling) tightly coupled loosely coupled register file cache system memory complexity (size) I I S C

S O S I I O D N I

] N 1 [

reg file ° Reconfigurable HW memory e

in parallel to the ALU U h L

c bus a ° A

Module may contain c own register file RHW

42 PR Space-Granularity (module coupling) tightly coupled loosely coupled register file cache system memory complexity (size) I e I S C

z S O S a I I l O D N B I

. ] N 1 M [

reg file ° Reconfigurable HW memory e

in parallel to the ALU U h L

c bus a ° A

Decoupled by Fifo c channels (FSL-Fifo) ° Parallel execution RHW

43 PR Space-Granularity (module coupling) tightly coupled loosely coupled register file cache system memory complexity (size) I e I P S C

z R S O S a I I l A O D N B I

. G ]

N ] 1 M [ 2 [

reg file ° Coprocessor-like memory e

coupling of the U h L

c bus a reconfigurable HW A c [2] Hauser and Wawrzynek (FCCM 97): GARP: A MIPS Processor with RHW a Reconfigurable Coprocessor

44 PR Space-Granularity (module coupling) tightly coupled loosely coupled register file cache system memory complexity (size) I 4 e I P S C

z V R S O

S a I I l A C O D N B I

. G P ]

N ] 1 P M [ 2 [

reg file ° Coprocessor-like memory

coupling of the e U h L reconfigurable HW c bus a A ° Decoupled by Fifo c channels (FSL-Fifo) RHW ° Parallel execution

45 PR Space-Granularity (module coupling) tightly coupled loosely coupled register file cache system memory complexity (size) I 4 e I P S C

z V R S O

S a I I l A C O D N B I

. G P ]

N ] 1 P M [ 2 [

reg file ° Coprocessor-like memory

coupling of the e U h L reconfigurable HW c bus a A ° Decoupled by Fifo c channels (FSL-Fifo) RHW ° Parallel execution

46 PR Space-Granularity (module coupling) tightly coupled loosely coupled register file cache system memory complexity (size) I 4 e I P S C

z V R S O

S a I I l A C O D N B I

. G P ]

N ] 1 P M [ 2 [

reg file ° Connect reconfigurable memory

HW to the memory bus e U h L

c bus a ° Common FPGA-based A approaches require an c interface interface interface (in the easiest memory RHW RHW I/O case a “bus-macro”)

47 On-FPGA Communication

Goal: an efficient on-FPGA communication architecture that supports the grid-style module placement. ° Classification of different on-chip communication architectures: On-Chip Communication

Point-to-point Interconnect Bus Network-on-Chip

Hierarchical Custom Uniform Split bus Homogeneous Heterogene shared Bus source: Custom Segmented Bus ° for FPGAs: - buses (reading / writing of registerfiles and DMA) - point-to-point links (I/O-pin connection and data streaming)

48 On-FPGA Communication (History) ° Progress in Partial reconfiguration (physical implementation) using the Xilinx tools over the last decade: ° Fundamental problem: binding of the partial module entity sig- nals to fixed routing resources of the FPGA fabric „module plug“ PR region static system

'0' '0' '1' '1' '1' '0' '0' '0' '1' '1' '1' '0'

NAND OR

° „Xilinx Bus Macros“ for constraining the routing between the static system and one or more PR regions (introduced 2002) Í Costs two TBUFs per signal wire (in terms of latency and area) Í Placement restrictions & device support

49 On-FPGA Communication (History) ° Progress in Partial reconfiguration using the Xilinx tools over the last decade:

"slice-based bus macro"

OR NANODR

° „Slice-based Bus Macros“ (proposed by Hübner et al. in 2004) Ë More flexible (higher density of wires, more placement options) Ë Works with all Xilinx FPGAs (Virtex-II Pro: last FPGA with TBUFs) Í Costs two LUTs per signal wire (in terms of latency and area)

50 On-FPGA Communication (History) ° Progress in Partial reconfiguration using the Xilinx tools over the last decade:

"proxy logic" "proxy logic"

OR NOARND

° „Proxy logic“ (released for some devices by Xilinx in 2009) Ë Automatic placement of anchor primitives Í Costs one LUT per signal wire (in terms of latency and area) Í Only provided for some devices ° Same approach is used in the upcoming Altera PR flow 51 On-FPGA Communication "PR link"

OR

NAND „PR links“ Binding entity signals to the wires crossing the border to a reconfigurable module.

Ë No logic overhead, cleaner design flow, supports S6 (V5, V6) 52 On-FPGA Communication: Buses Bus macros are best suited to integrate modules into islands! The following slides present structured communication architec- tures for slot-based (1D) or grid-style (2D) module placement

The Simple Formula for Building Bus-based Reconfigurable Systems

53 ReCoBus Communication

All bus protocols can by implemented by the use of four signal classes: shared write shared read dedicated write dedicated read

Master Slave 1 Slave 2

__ R \ W address write_data

shared master read_data write signals interrupt_1 dedicated master

interrupt_2 read signals

select_1 dedicated master shared master address write signals select_2 read signals decode ° Example: connecting an interrupt signal from a slave to an interrupt controller is basically the same problem as connec- ting a bus request from a master module to an arbiter 54 ReCoBus: Shared Read Shared read signals for connecting one selected module with the static system: module 0 module M-1 a) b) slot 0 slot 1 slot R-1 fits into data_out data_out data_out data_out data_out . . . one LUT sel0 sel1 selR-1 r r sel0 selM-1 d e e & & & t t u s s & & a a m m m ≥1 ≥1 ≥1 m ≥1 y . . . '0'

 Homogeneous (=identical) logic and routing footprint inside each resource slot  Free module placement Ú Deep combinatory path (slow) Ú Massive resource overhead (has to replicated for each bit signal) Ú Only suitable for a coarse-grained placement grid

‰internal fragmentation 55 ReCoBus: Interleaving

° Problem: the structure of a distributed read multiplexer chain is unlikely for very fine-grained resource slot layouts: reconfigurable area Slot 1 Slot 2 Slot 3

'0'

D7..0 '0'

D15..8 '0'

D23..16 '0'

D31..24

‹ Logic overhead: 4/24 = 17%

56 ReCoBus: Interleaving

° Problem: the structure of a distributed read multiplexer chain is unlikely for very fine-grained resource slot layouts: reconfigurable area SloSt l1ot 1 Slot 2 Slot 2 Slot 3 SloSt l3ot 4

'0'

D7..0 '0'

D15..8 '0'

D23..16 '0'

D31..24

‹ Logic overhead: 4/18 = 22%

57 ReCoBus: Interleaving

° Problem: the structure of a distributed read multiplexer chain is unlikely for very fine-grained resource slot layouts: reconfigurable area SloSt 1lot 1 Slot 2 SloSt l2ot 3 SloSt l4ot 3 Slot 5 SloSt l4ot 6

'0'

D7..0 '0'

D15..8 '0'

D23..16 '0'

D31..24

‹ Logic overhead: 4/12 = 33%

58 ReCoBus: Interleaving

° Problem: the structure of a distributed read multiplexer chain is unlikely for very fine-grained resource slot layouts: reconfigurable area SS1lot S1 2 SS3lot S24 SS5lot S36 SS7lot S48 S9SlotS 510 S1S1lotS 612

'0'

D7..0 '0'

D15..8 '0'

D23..16 '0'

D31..24

‹ Logic overhead: 4/6 = 66%; very high latency!

59 ReCoBus: Interleaving

° Solution: multiple interleaved read multiplexer chains

reconfigurable area S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12

D 31..24 '0' D7..0 D7..0 D15..8 D15..8 D23..16

‹ Low logic overhead, low latency and fine granularity!

60 ReCoBus: Signal Alignment (1D) ° Example system:

Module 0 Module 1 S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8

CPU en dout en dout en dout en dout en dout en dout & & & & & & ≥1 ≥1 1 1 ≥1 ≥1 ≥1 ≥1

static system runtime reconfigurable system

° Alignment-multiplexer allow free module placement ° Interface grows together with the module complexity (size) For example: a small UART might be connected using an 8-bit data bus and a more complex Ethernet adapter with 32-bit ° The first LUT function of each chain (here rightmost) must be changed to an AND gate or an external source is needed 61 ReCoBus: Signal Alignment (1D) ° Assuming an 8-bit interface pro slot, it takes at least four consecutive slots to provide the full interface size

m1 0 1 2 3 0 1 2 3

0 start point & mux select value 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 used connection 0 unused connection D31...D24 D23...D16 D15...D8 D7...D0 62 ReCoBus: Signal Alignment (2D) ° The signal interleaving scheme can be extended to implement buses allowing to integrate modules in a 2D grid style.

m1 slot indexing: slot y y,x 0 0 1 2 3 0 1 2 3

m2 1 1 2 3 0 1 2 3 0

m3 2 2 3 0 1 2 3 0 1

m4 3 3 0 1 2 3 0 1 2

0 1 2 3 4 5 6 7 ≥1 ≥1 ≥1 ≥1 x

slot3,6 0 start point & mux select value 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 used connection D31...D24 D23...D16 D15...D8 D7...D0 0 unused connection 63 ReCoBus: Dedicated Write Signals

° LUTs can be used to decode an address within the bus value (the table contains then a one-hot value, e.g. for addr. 0xA) 0 0 1 0 LUT values 2 0 LUT in 3 0 0 Sin 0 SRL16 mode 4 0 1 1 5 0

. 6 0 . . 7 0 F F 8 0 0 A 9 0 A0 A 1 A1 A 1 A Slice FF A 2 2 0 A B 3 A3 Sout C 0 FF FF D 0 ° For setting an address, LUT values can be exchanged: E 0 ° Using the configuration port, F 0 ° Accessing the table with the user logic: SRL16 shift register primitive or distributed memory 64 ReCoBus: Dedicated Write Signals ° Architecture: uniformed distributed address comparator inside the bus (implemented by SRL16 shift register primitives)

fits into one 00 0 00 0 0 0 0 0 1 0 0 0 0 0 11 1 11 1 1 1 1 1 1 1 1 1 1 1 look-up table EN 5 0 1

config_data Q Din Q config_clock module_reset bus_enable 4 module_select

module_read bus_read & bus side reconfigurable select generator module side

° Two-stage reconfiguration: 1. FPGA: initialize the shift register with 0xFFFF 2. Logic: configure address comparator and activate module 65 ReCoBus: Dedicated Write Signals

° Arrangement of the address comparators

module 1 module 2 slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 fits into one

n n look-up table e e t t _ _

t EN t c c 5

CPU 0 d d e e e e 1 l l a a s s config_data Q Q e D e e e e in e s s r r r r config_clock module module module_reset select select bus_enable logic config_data logic bus_enable 4 config_clock bus_read d muodule_select bus m m logic myodule_read bus_read & bus side reconfigurable select generator module side ° Allows module relocation ° Multiple instances of a module (individual module addresses ° Automatic reset generation ° No interference by the reconfiguration process (Hot-Plug) ° Extra register file look-up for alignment multiplexer control 66 ReCoBus: Dedicated Write Signals

° Arrangement of the address comparators

module 1 module 2 slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 n n e e t t _ _ t t CPU c c d d e e e e l l a a s s fits into one e e e e e e s s r r r r look-up table EN 5 0 1

config_data Q D Q module in module cosnefilge_ccltock select module_reset logic config_data logic bus_enable bus_enable 4 config_clock bus_read d module_selectu bus m & module_read m logic bus_read y bus side reconfigurable select generator module side ° Allows module relocation ° Multiple instances of a module (individual module addresses) ° Automatic reset generation ° No interference by the reconfiguration process (Hot-Plug) ° Extra register file look-up for alignment multiplexer control 67 ReCoBus: Dedicated Write Signals

° Assuming an 8-bit interface pro slot, it takes at least four consecutive slots to provide the full interface size 5 4 3 2 1 0

select A ...A 1 1 1 1 1 1 9 8 7 6 5 4 3 2 1 0 master 7 0 module F F reserved select E E module_select 2 E 8 1 D D

A logic A

. A ...A .

. 7 0 C C . . .

1 B B 5 1 1 A A FF A A ... 9 9 FE 8 8 . module & 7 7 . . register 6 6 ≥1 bus_enable file 5 5 01 ≥1 ... 4 4 3 3 00 ≥1 2 2 1 1 ≥1 module 1 0 0 module_select0 ° Address mapping: the whole ReCoBus subsystem appears like one module in the address space of the system ° Up to 15 modules can be addressed (one encoding (0xF--) is used for the case that no module is selected) ° Wildcard addressing for multi cast operation (wired OR on read) 68 ReCoBus: Dedicated Read Signals ° Dedicated master read signals (interrupt)

module 1 module 2 slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 IRQ dummy IRQ dummy CPU sink sink 1 d 0 u m m y

° Idea: set connection within a module to an internal homogenously routed interrupt wire by bitstream manipulation ° The number of internal interrupt lines scales with the number of modules (allows many tiny slots) ° Crosspoints are directly implemented in the FPGA routing fabric (no extra logic required) ° In practice: internal wire sharing for interrupt and bus arbitration (also: signal interleaving and masking in the static system) 69 ReCoBus Properties

° Direct connection of a module to the bus ° Compatible to all established standards (AMBA, PLB, …) ° Module relocation & flexible module placement ° Variable module sizes ° Multiple instances of the same module ° Very low logic overhead ° Allows high speed / high throughput ° Hot-swap module exchange: The reconfiguration process is completely transparent for all bus transactions.

70 I/O-Bars

OK - We have a suitable Bus.

What about dedicated links or I/O?

71 I/O-Bars for Point-to-Point Links ° Horizontal routing track within the reconfigurable area ° Connections are set by modifying switch matrices ° One bar per interface requirement (e.g., video, audio) bypass static system Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 video video out in

audio audio out in ReCoBus

static system 72 I/O-Bars for Point-to-Point Links • Read-modify-write connection • Ideal for data streaming

static system Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 Slot 5 video video out in

audio audio out in ReCoBus

static system 73 I/O-Bars for Point-to-Point Links • I/O bar implementation

Incoming signals Route through signals Outgoing signals

74 I/O-Bars for Point-to-Point Links ° I/O-Bar implementation for 2D ° Vertical routing is accomplished in the static part ° Can be used with interleaving for decreasing latency (requires signal alignment in each module)

75 Demo System www.Re Bus.de

° 248 logic slots (192 LUTs/slot) ° +16 RAM slots ° 8-bit slave bus (up to 48 bit via 6 sequent slots) ° Video streaming ° Free placement ° Connection cost: 14 LUTs/slot ° 100 MHz (XC2V6000-6)

76 Demo System ° Regular structured ReCoBus macro (a macro contains logic and routing and is instantiated like any other VHDL module)

° Implementation on a XC2V-6000 ° One CLB provides up to 8 data signals (for read and write) ° Lower CLB packing can improve routing (congestion around the connecting resources 77 The ReCoBus-Builder www.Re Bus.de

° Easy usable System Specification (Communication Architecture & Floorplan) builder for reconfigurable systems

° Available on www.recobus.de

generate generate static module Re Bus system repository

bitlink module.bit -pos X,Y static.bit -outfile initial.bit

78 Design Flow

ReCoBus & connection bar protocol specification Design Entry, Static/Dynamic

[ReCoBus-Builder] c i n o m i t a a n c i y

f Partitioning, and Verification i d r /

bus & bar static e c v mmoodduulele1 i t RTL model system module11 a d t n s

a ,

, y r g t n i functional simulation [Modelsim] n n e ReCoBus & connection bar protocol specification

o i n t i g t i r s [ReCoBus-Builder] a n e p OK? d n o i budgeting floorplanning and communication budgeting t a t

[Xilinx XST] synthesis [ReCoBus-Builder] [Xilinx XST] n

e bus & bar static

m mmoodduulele1 e l module11 p RTL model system static static ReCoBus module module1 module 1 m 1 i

netlist constraints I/O bars constraints tenmetplilsattes l a c i s y h place & route static [PAR] placebb&uuirliodldu p tpeaa rmtritaioal dlm umoleoddu[uPleleA1R] p 1 1 functional simulation [Modelsim]

build static bitstream [bitgen] build module1 bitstream [bitgen] y l b

module m static system fulml moodduulele11 1 e n bitfile bbitiftifliele s bitfile s OK? a

m

partial bitstream extraction a e r

bitstream linking [bitscan] t

[bitscan] s t i b

initial system mmoodduulele1 partial modu1 le1 bitfile bbibtiftiftliefliele repository for the run-time system

[ ] novel tool [ ] third party or vendor tool

79 Design Flow

ReCoBus & connection bar protocol specification Physical Implementation

[ReCoBus-Builder] c i n o m i t a a n c i y f i d r /

bus & bar static e c v mmoodduulele1 i t RTL model system module11 a d t n s

a ,

, y r g t n i functional simulation [Modelsim] n n e OK?

o i n t i g t i r s a n e p OK? d floorplanning and n o i budgeting budgeting budgeting floorplanning and communication budgeting t a

t communication synthesis

[Xilinx XST] synthesis [ReCoBus-Builder] [Xilinx XST] n

e [Xilinx XST] [Xilinx XST]

m [ReCoBus-Builder] e l static static ReCoBus module p module1 module 1 m 1 i

netlist constraints I/O bars constraints tenmetplilsattes l a c i

s static static ReCoBus module y module1 module11 h place & route static [PAR] bbuuilidld p paartritaial lm moodduulele1 p constraints place&route module1 [PA1 R] netlist I/O bars constraints tenmetplilsattes

build static bitstream [bitgen] build module1 bitstream [bitgen] y l b

module m static system fulml moodduulele11 place & route static [PAR] bbuuilidld p paartritaial lm moodduulele 1 e place&route module [PA11R] bitfile bbitiftifliele s 1

bitfile s a

m

partial bitstream extraction a e r

bitstream linking [bitscan] t

[bitscan] s t i b

initial system mmoodduulele1 partial modu1 le1 bitfile bbibtiftiftliefliele repository for the run-time system

[ ] novel tool [ ] third party or vendor tool

80 Design Flow

ReCoBus & connection bar protocol specification Bitstream Assembly

[ReCoBus-Builder] c i n o m i t a a n c i y f i d r /

bus & bar static e c v mmoodduulele1 i t RTL model system module11 a d t n s

a ,

, y r g t n i functional simulation [Modelsim] n build static bitstream [bitgen] n e build module bitstream [bitgen] 1 o i n t i g t i r s a n e p OK? d static system mmoodduulele1 full module1 1 n

o bitfile

i bitfile bitfile budgeting floorplanning and communication budgeting t bitfile a t

[Xilinx XST] synthesis [ReCoBus-Builder] [Xilinx XST] n e m e l static static ReCoBus module p partial bitstream extraction module1 module 1 m 1 i

netlist constraints I/O bars constraints tenmetplilsattes l bitstream linking [bitscan] a c

i [bitscan] s y h

build partial module p place & route static [PAR] placeb&uirlodu ptea rmtiaol dmuoledu[PleA11R] 1 module initial system partimalo mduoldeu11le build static bitstream [bitgen] build module bitstream [bitgen] 1 1 bitfile

y bitfile l bitfile bitfile b

module m static system fulml moodduulele11 1 e

bitfile bbitiftifliele s

bitfile s a

m

partial bitstream extraction a e r

bitstream linking [bitscan] t

[bitscan] s t i b

initial system mmoodduulele1 partial modu1 le1 bitfile bbibtiftiftliefliele repository for the run-time system

[ ] novel tool [ ] third party or vendor tool

81 Design Flow Tested Design

floorplanning and budgeting budgeting communication synthesis [Xilinx XST] [Xilinx XST] [ReCoBus-Builder]

static static ReCoBus module module1 module11 netlist constraints I/O bars constraints tenmetplilsattes

place & route static [PAR] bbuuilidld p paartritaial lm moodduulele1 place&route module1 [PA1 R]

build static bitstream [bitgen] build module1 bitstream [bitgen]

static system mmoodduulele1 full module1 1 bitfile bbibtiftiftliefliele

partial bitstream extraction bitstream linking [bitscan] [bitscan]

initial system mmoodduulele1 partial modu1 le1 bitfile bbibtiftiftliefliele repository for the run-time system bitlink module.bit X Y \ static.bit initial.bit 82 Design Flow

° Modules might be implemented using different shapes/resources (design alternatives) ° Goal: higher utilization ° Interesting for com- ponent based system design (no place and route) ° Simplified system integration based on standardized interfaces ° Enhanced IP-reuse 2 multiplier, logic only 2 multiplier, 6 logic slots 30 slots 6 logic slots (includes gap)

83 Design Flow: Blocking

84 Design Flow: Xilinx PlanAhead

° New advanced GUI for the complete FPGA design flow

° Project management

° Floorplanning

° Critical path analysis (timing)

° Implementation viewer Source: Xilinx ° Integration of the vendor specific partial flow

85 Design Flow: Xilinx PlanAhead

° 1. Step: Synthesis of all partial and static modules in individual netlists (Static netlist has black boxes for the modules) ° 2. Step: Creation of a new PlanAhead project ° 3. Step: Creation of Reconfigurable Partitions ° A reconfigurable partition (RP)consists of several reconfigurable modules (RM) ° Assign a partial netlist to each RM ° A RM can also be a black box (empty module) 86 Design Flow: Xilinx PlanAhead ° 4. Step: Floor planning of the reconfigurable partitions ° Create Area Groups ° PlanAhead automatically creates the communication ports for the reconfigurable partition

° Port proxy logic: LUT1 (anchor re- quired for physical implementation)

° PlanAhead automatically creates the user constraints file (UCF) with the

bounding box Source: Xilinx definitions of the RPs 87 Design Flow: Xilinx PlanAhead

° 5. Step: Run design rule check (DRC) to verify the design ° 6. Step: Create the first reconfigurable configuration ° Consisting of the static module and for each RP a RM ° Implement this configuration ° Promote this configuration ° 7. Step: Create further configurations for each module in a RP: ° Import the static design ° Implement the partial module

° 8. Step: Create the static and partial configuration bitfiles

88 Design Flow: Xilinx PlanAhead

Differences between the ReCoBus-Builder approach and PlanAhead: ° Slot-style or grid-style vs. island style reconfiguration (island style has no external fragmentation problem ‰simple placement) ° ReCoBus allows module relocation "proxy logic" and multi module instantiation

° Proxy logic bounds a module to a particular fixed region (RP) ° Example: 3 islands and 4 kinds of NOARND modules requires 3x4=12 physical implementations (place&route) ° All partial modules have to be re-implemented in case of changes in the static system (does not scale for complex systems)

89 Design Flow: FPGA Issues In Xilinx FPGAs, the smallest atomic piece of configuration data is a configuration frame that contains data for all (older devices) or a set of vertical aligned CLBs (newer devices) ° Arbitrary configuration update is possible using readback-modify-write

° Instead of readback, a configuration image might be stored in memory to avoid the relatively slow readback process ° Warning: Using LUTs as memory elements (e.g., SRL16 mode) might result in side effects, when updating modules above or below these primitives because LUT values get overwritten. 90 Run-time Management

° Main problem: online temporal module placement Problem: map a DFG G = (V,E) onto a reconfigurable v v area such that the 1 2 schedule is feasible v v and the total execution 3 4 time is minimized.

v5

In other words: ° computing module placement positions ° and schedules

° Question: predictable (offline) vs. unpredictable (oline) problem

91 PR example: Sorting for Database Acceleration ° Sorting contributes to 30% of the CPU time in huge databases input: unsorted data stream . r t initial step: fully sorted sequences

FPGA n 3 o R c

- [intermediate steps]: merging D

2GB/s m

PCIe D e final step: merge and emit result

8x m

MEM prefetcher m

a A e r t max burst size s

B d max latency

e > > > > t r A B C D

o C s > > n u D FPGA FPGA > sorted output initial step context switching final step 50% area saving or 4 times larger problems as compared to a static design ° Next step: hierachical reconfiguration: swap comparator cells for different data types (integer, text, …) 92 PR example: Custom instructions ° Fine-grained communication architecture for flexible instruction placement

result OP_A OP_B register file OP_B OP_A B A instruction conf. conf. instruction instr. instr. register file

° Identical routing for OPs and results in each slot ° Both operands are available in each slot (end point & middle access) ° Commutative instructions (e.g., A > B) ° Implementation alternative ° Bitstream manipulation

93 PR example: Custom instructions

instruction slices slots bitstream latency (max/av) 64-bit XOR gate 19 (40%) 1 2.64 KB 7.04 / 5.95 ns CCITT CRC 33 (34%) 2 5.28 KB 5.32 / 3.98 ns sat. add/sub 70 (73%) 2 5.28 KB 9.89 / 7.81 ns barrel shifter 90 (94%) 2 5.28 KB 11.07 / 7.88 ns '1'-bit counter 214 (89%) 5 13.2 KB 11.37 / 8.25 ns mask & permute 16 (33%) 1 2.64 KB 5.94 / 4.05 ns

° Direct connection (no „proxy logic“) ° Swapping of instructions: ° Dedicated load commands ° Triggered by a trap handler

94