<<

A Field Programmable Array Featuring Single-Cycle Partial/Full Dynamic Reconfiguration Jingxiang Tian, Gaurav Rajavendra Reddy, Jiajia Wang, William Swartz Jr., Yiorgos Makris and Carl Sechen Department of Electrical , The University of Texas at Dallas, Richardson, TX 75080 Email: {jxt122130, gxr141930, jxw143630, wps100020, gxm112130, cms057000}@utdallas.edu

Abstract—We introduce a CMOS computational fabric con- In the following, we discuss the FPTA architecture and sisting of carefully arranged regular rows and columns of tran- design features which support each of the above three ob- sistors which can be individually configured and appropriately jectives (Sections II-IV). To demonstrate the aforementioned interconnected in order to implement a target digital circuit. Termed Field Programmable Transistor Array (FPTA), this capabilities, we designed and laid out an FPTA prototype in novel reconfigurable architecture enables several highly-desirable the IBM 130nm 1.2V process and we developed a CAD tool features including (i) simultaneous storage of three configurations flow (Section V) to synthesize, place and route designs onto it. along with the ability to dynamically between them Results using various benchmark circuits (Section VI) confirm in a fraction of a single cycle, while retaining the fabric’s that, despite the added features of single-cycle switching computational state, (ii) rapid or full modification of a stored configuration in a time proportional to the number of modified between multiple designs and rapid partial reconfiguration, configuration bits through the use of hierarchically arranged, our FPTA achieves better area utilization in comparison to high throughput, asynchronously pipelined memory buffers, and a typical FPGA in the same node. (iii) support for libraries containing cells of the same height and variable width, just as in a typical circuit, thereby II. TRANSISTOR-LEVEL PROGRAMMING simplifying transition from a prototype to a custom IC design. Besides presenting the design details of this fabric in a 130nm To present our FPTA’s ability to support transistor-level pro- technology and demonstrating the aforementioned capabilities, gramming, we first describe its overall architecture, including we also briefly discuss the development of a complete CAD flow the basic logic cell structure and the routing resources. for programing this fabric and we use numerous benchmark circuits to contrast its area efficiency against a typical FPGA A. FPTA Architecture Overview implemented in the same technology node. The FPTA architecture, which resembles a standard cell I. INTRODUCTION circuit, consists of numerous long rows of and can work with cell libraries similar to those used for a typical We present a novel field programmable device, developed standard cell-based ASIC, where each cell has the same height on conventional static CMOS processes, which has significant and variable width. In the FPTA, the granularity of the width differences and potential advantages over field programmable of the cells is one column of transistors. As shown in Fig. gate arrays (FPGAs). Specifically, our design seeks to: 1(a), a column consists of two pMOS transistors above two 1) Improve area utilization: Unlike the basic configurable nMOS transistors. This basic column is replicated repeatedly logic block (CLB) of FPGAs, which employs look-up tables in the horizontal direction, forming a row. Vertical transistors (LUTs) to generate combinational logic functions [1], our in Fig. 1(a) can be programmed to be always on, always off, FPTA relies on a carefully-arranged, configurable array of or to receive logic . Among the horizontal transistors, transistors, which can be interconnected to implement standard while the innermost (blue-colored) ones are strictly used for library cells. Thereby, for logic outside the custom cells isolation, the outermost ones not only support isolation but (e.g., full adders, D flip-flops, multiplexers) that both FPGAs also enable the use of logic functions that require up to and our FPTA explicitly possess, we surmise that transistor three transistor in series. The metal1 (M1) layer is used to utilization in our FPTA is better than in FPGA LUTs. In other interconnect the transistors and various logic gates. words, an FPGA may allocate an entire LUT to implement The actual hardware implementation of the row-based FPTA even a relatively simple gate, while our FPTA allocates only groups every four columns of transistors into so-called ‘logic the precise number of columns of transistors required. cells’. In Fig. 1(a), a potential output is illustrated 2) Enable time-sharing between multiple circuits: Our by means of a small green rectangle. Each potential output design supports simultaneous storage of three separate con- is optionally connected to a vertical metal2 (M2) track, by figurations, each with its own computational state. Therefore, means of a programmed switch. In addition, the two pMOS we also provide the means to switch, in a fraction of a single and the two nMOS inputs in a column are directly connected cycle, between any of the three stored configurations, or load to individual vertical M2 tracks. Each of these tracks is driven a new configuration while toggling between the other two. by either a programming bit or a logic . 3) Support rapid circuit updates: Instead of being serially In addition to the four columns of transistors, each logic loaded, our FPTA configuration is stored in hierarchically cell comes with a custom D-flip-flop (DFF), a full arranged, high throughput, asynchronously pipelined memory (FA) and a multiplexer (MUX), whose inputs and outputs buffers. This enables not only faster configuration but also (I/Os) are optionally connected to logic cell I/Os. The D input rapid dynamic partial reconfiguration wherein only a portion of (Signal in) of the DFF, which is shown in Fig. 1(b), is either a circuit is reloaded by addressing specific transistor columns. connected to a logic transistor in the first column of the logic

978-3-9815370-8-6/17/$31.00 c 2017 IEEE 1336

07Dsg,Atmto n eti uoe(DATE) Europe in Test and Design, 2017 1337

wthn ewe utpeconfigurations. multiple between switching spiaiyue oruetepormigbt.I i.2(c), Fig. In bits. programming the route to used primarily is

n o t oa eoysrcuecnspotsingle-cycle support can structure memory local its how and M)aetehrzna otn eore.Mtllyr6(M6) 6 layer Metal resources. routing horizontal the are (M5)

enwpoedt eciehwteFT sprogrammed is FPTA the how describe to proceed now We otn eore,wiemtllyr3(3 n ea ae 5 layer metal and (M3) 3 layer metal while resources, routing

ea ae M)admtllyr4(4 r h vertical the are (M4) 4 layer metal and (M2) 2 layer Metal I.P III. R -C &S ECONFIGURATION YCLE INGLE ROGRAMMING

oi el a evee sasnl switchbox. single a as viewed be can cells logic

srpae ihab-ietoa . bi-directional a with replaced is

oe aro wthoe eogn ovrial neighboring vertically to belonging switchboxes of pair lower

hs ie,atrec oi el,tenO astransistor pass nMOS the cells, logic 4 each after lines, these

el(MStassos rmblw etil h pe and upper the Certainly below. from transistors) (nMOS cell

el i MSps rnitr.I re olmttedlyon delay the limit to order In transistors. pass nMOS via cells

rnitr)fo bv,adfrtebto ato h logic the of part bottom the for and above, from transistors)

daetM n 5ln emnsi h egbrn logic neighboring the in segments line M5 and M3 adjacent

h rgamn isfrtetppr ftelgccl (pMOS cell logic the of part top the for bits programming the

ttelf n ih onaiso h oi elbtcnetto connect but cell logic the of boundaries right and left the at

nFg () sdsrbdi eal ti oeefiin osupply to efficient more is It detail. in described is 2(c), Fig. in

h 3adM oiotlmtlln emnsterminate segments line metal horizontal M5 and M3 The

r xcl ymti,s nyteuprsicbx sshown as switchbox, upper the only so symmetric, exactly are

h pinlb-ietoa eetr hw nFg 2(e). Fig. in shown repeater, bi-directional optional the

bv h oi eladoejs eo.Tetoswitchboxes two The below. just one and cell logic the above

fohrlgcclsaoeadblw nete ieto,using direction, either in below, and above cells logic other of

ahlgccl nldstoruigsicbxs n just one switchboxes, routing two includes cell logic Each

etclM iesget a ecnetdt 4segments M4 to connected be can segments line M4 vertical

.RuigArchitecture Routing B.

emnt ttetpadbto ftelgccl.However, cell. logic the of bottom and top the at terminate

ihu otretassosi series. in transistors three to up with falgccl.Vria 2mtlln emnsliterally segments line metal M2 Vertical cell. logic a of

euetenme fnt yalwn oecmlxcells complex more allowing by nets of number the reduce h aiu ea iesget emnt tteboundary the at terminate segments line metal various The

r u oteitroncinntok.Tu,w eie to decided we Thus, networks. interconnection the to due are oi elt nuso erylgccells. logic nearby of inputs to cell logic a

oee,frFGs h atmjrt ftedlyadpower and delay the of majority vast the FPGAs, for However, 3 5 n 7fcltt oa oncin rma uptof output an from connections local facilitate 17 and 15, 13,

ufiin ognrt iciswt h etpwrefficiency. power best the with circuits generate to sufficient nteuprsicbxad4i h oe n.M rcs11, tracks M3 one. lower the in 4 and switchbox upper the in

nsre o aho h ulu n uldw ewrswas networks pull-down and pull-up the of each for series in hc onc ootus a oncincocst 3 4 M3, to choices connection 8 has outputs, to connect which

hw htasadr ellbaylmtdt w transistors two to limited library cell standard a that shown hie oM ie i wths aho h 2lines, M2 4 the of Each . via lines M3 to choices

fcec esspwrefiinycnen.I 2,i was it [2], In concerns. efficiency power versus efficiency hc onc opO rnO nus a connection 7 has inputs, nMOS or pMOS to connect which

o iiigt he rnitr nsre a ae narea on based was series in transistors three to limiting for onc opO nO)ipt.Ec fte1 2lines, M2 12 the of Each inputs. (nMOS) pMOS to connect

itr nsre stemxmmpsil.Temotivation The possible. maximum the is series in sistors ie onc oteotuso h oi elad1 2lines M2 12 and cell logic the of outputs the to connect lines

oeta uldw o ulu)ntoko he tran- three of network pull-up) (or pull-down a that Note MS(MS aeipto ta upt nfc,4M2 4 fact, In output. an at or input gate (nMOS) pMOS a

omteotu oeo h oi function. logic the of node output the form el h 6M ie emnt nietelgccl,ete at either cell, logic the inside terminate lines M2 16 the cell,

ihihe nbu r undo ihpormigbt to bits programming with on turned are blue in highlighted wthst 4 o h wtho bv blw h logic the (below) above switchbox the For M4. to switches

h etaetre nt opeetecrut Transistors circuit. the complete to on turned are rest the wthst 4 aho h eann 3lnshs3 has lines M3 8 remaining the of Each M4. to switches 4

inlnmsa hi aetriasrciepiayinputs; primary receive terminals gate their at names signal e) aho h 3lnsadec fte9M ie has lines M5 9 the of each and lines M3 9 the of Each red).

undof mn h ihihe rnitr,teoe with ones the transistors, highlighted the Among off. turned ie sice nga)adt h 5lns(wthsin (switches lines M5 9 the to and gray) in (switches lines

l h rnitr,ecp o h nshglgtdi lc,are black, in highlighted ones the for except transistors, the All ntaogwt wthscnetn ote1 oiotlM3 horizontal 17 the to connecting switches with along unit

AD n nAI2gt,aeporme nalgccell. logic a in programmed are gate, AOI22 an and NAND3 hr r 2vria 4lnsta ooe h oi cell logic the over go that lines M4 vertical 12 are There

a i.e., cells, different two how depict (b) and 2(a) Figs. saddt ahmtlsget ssoni i.2(d). Fig. in shown as segment, metal each to added is

half-keeper a , Vdd full a pass not do transistors pass the Since oma h upto h t column. 4th the of output the at form

nti case). this in ctrl called bit programming the (with 2(d) Fig. in U upti rvddi ihrivrigo non-inverting or inverting either in provided is output MUX

w epniua ea ie o ifrn aes sshown as layers) different (on lines metal perpendicular two hc ssoni i.1d,ol cuisoeclm.The column. one occupies only 1(d), Fig. in shown is which

rgamn i,wt h oreaddancnetn the connecting drain and source the with bit, programming a fte2dad3dclms epciey h utmMUX, custom The respectively. columns, 3rd and 2nd the of

wth mlmne ya MStasso otoldby controlled transistor nMOS an by implemented switch, a ete netdo o-netd r rvdda h outputs the at provided are non-inverted) or inverted (either

ahsalsur o aiu oos hw nFg ()is 2(c) Fig. in shown colors) various (of square small Each n r oun ftelgccl.Tecryadsmoutputs sum and carry The cell. logic the of columns 3rd and

aee st o hycnetwt 2lnsi i.2(c). Fig. in lines M2 with connect they how to as labeled fteF,wihi hw nFg () pnars h 2nd the across span 1(c), Fig. in shown is which FA, the of

rc ubr nFg () h rnitr ntelgccl are cell logic the in transistors the 1(a), Fig. In number. track h oi el r once nasa hi.Tetreinputs three The chain. scan a in connected are cells logic the

ae ubr n hna nesoefloe yteln or line the by followed underscore an then and number, layer i h emlilxr l Fspoie by provided DFFs All de-multiplexer. the via 1 = enable if

ahmtlln slbldwt h etr‘’floe ythe by followed ‘M’ letter the with labeled is line metal each )o oteDipto h DFF the of input D the to or 0) = enable and 1 = ctrl (if cell

i.1 a oi elsrcue b ul-nDflpflp c ul-nfl de,ad()Biti multiplexer Built-in (d) and adder, full Built-in (c) flip-flop, D Built-in (b) structure, cell Logic (a) 1: Fig.

(a) (b) (b) (a) (c) (c) (d) Signal_in Mem_in Signal_in3 Mem_in3

Signal_in3 Mem_in3

0 1 0 M2_7b M2_11b 28 M2_10b M2_8b ctrl

0 1 0

ctrl3

0 1 0

M2_12b M2_5b M2_9b M2_2b

ctrl3

1 0

0 1 0

en

0 1 0 en

sd

SD D SD M2_1b M2_6b 23 M2_4b M2_3b en

SCAN DFF SCAN

po clk

1 0 1 21bM_4 21bM2_16b M2_15b M2_14b M2_13b

en

MS

DFF/ rst

C

en MS

S 1 0 scan

Q SQ SQ Q en

FA

A

MUX

en

2O 2O M2_O4 M2_O3 M2_O2 M2_O1 MUX B

sq B MUX MC

en

A po en en

21tM_4 21tM2_16t M2_15t M2_14t M2_13t

M2_1t M2_6t 23 M2_4t M2_3t

0 1 0

DD V

22 29 25 M2_12t M2_5t M2_9t M2_2t

0 1 0 0 1 0

0 1 0 0 1 0

en en en

en

M2_7t M2_11t 28 M2_10t M2_8t

1 0 1 0 1

1 0 1 1 0 1 ctrl2 ctrl1 ctrl2 ctrl1 Signal_in1 Mem_in2 Signal_in2 Mem_in1 Mem_in2 Signal_in2 Mem_in1 Signal_in1

07Dsg,Atmto n eti uoe(DATE) Europe in Test and Automation Design, 2017 1338

a eo n h rgamn iscomprising bits programming the and on be may clkb running, . cpb on turning by execution resume) (or start can system

is cpa by configured system the while example, For loaded. ytmcngrto,i rcino lc yl,tesecond the cycle, clock a of fraction a in configuration, system

unn hl h rgamn iso e ytmaebeing are system new a of bits programming the while running sbigue ola h third the load to used being is clkc While . clkc on turning and

n n utemr,acmltl rgamdsse a be can system programmed completely a Furthermore, on. one clkb off turning by loaded subsequently be can system third

n lblytrigadifferent a turning globally and ) cpc ,or cpb , cpa (among on ,a clkb using loaded completely is system alternate this After

lc yl,b unn f h oysga hti currently is that signal copy the off turning by cycle, clock a PAi o matdb h odn fa lent system. alternate an of loading the by impacted not is FPTA

hsdsg nbe yai eofiuaini rcinof fraction a in reconfiguration dynamic enables design This sof h urn tt fthe of state current the off, is cpb Since cell. memory the

natraiesse a elae notescn ac of latch second the into loaded be may system alternative an hti upyn h urn rgamn i oteFPTA. the to bit programming current the supplying is that

shg taygvntm,rpeetn h latch the representing time, given any at high is cpc ,or cpb

i.4 ru tutr ihlgccls(C)frec unit each for (LCs) cells logic with structure Group 4: Fig. , cpa of one only cell, memory each For FPTA. the configure

sue oslc h trdpormigbt hc actually which bits programming stored the select to used is

DECODER

3

cpc ,or cpb , cpa signals copy global the of one Subsequently,

8 7 6 5 4 3 2 1

UNIT UNIT UNIT UNIT UNIT UNIT UNIT UNIT eet h rprltht eev h bit. the receive to latch proper the selects clkc ,or clkb , clka

Bs vial e nt n ftegoa oto signals control global the of one unit, per available (BLs)

hnwiigpormigbt nfo h 3bit-lines 33 the from in bits programming writing When

tt ahn)t another. to ) state e.g., ( design digital one from

ob trda h aetm,ealn igecceswitch single-cycle a enabling time, same the at stored be to

LC LC LC LC LC LC LC LC he ace r sdt lo he eaaeFT i streams bit FPTA separate three allow to used are latches Three rvnb rnmsingt wths ssoni i.5(a). Fig. in shown is switches, gate transmission by driven

L0 MEMORY BUFFER h oa eoysrcue hc ossso he latches three of consists which structure, memory local The

.LclMmr Structure Memory Local B.

... ee ceai fagop hs iei 3u y620um. by 430um is size whose group, a of schematic level

lc tre ru)i h PA i.4sosteblock- the shows 4 Fig. FPTA. the in group) a (termed block i.3 ntstructure Unit 3: Fig.

2m ih nt r obndt omtebscreplicated basic the form to combined are units Eight 72um.

DECODER ...... by 430um is unit single a of layout the layers), metal thick 2

and thin (6 process stack metal 6-2-0 130nm IBM the Using

LOCAL MEMORY 4 MEMORY LOCAL

half. upper the expand only we symmetric, approximately

Cell

Logic ewe ifrn ucinlbok.A h ntsrcueis structure unit the As blocks. functional different between

SWITCHBOX

UNIT ...

LOWER h etprino i.3sostedtie connections detailed the shows 3 Fig. of portion left The REP/SWITCH

... boxes. switch two the surrounding parts, four into separated wthoe.T hre h iig h oa eoyis memory local the wiring, the shorten To switchboxes. LOCAL MEMORY 3 MEMORY LOCAL

oto h rgamn isaeue ocngr the configure to used are bits programming the of Most 2

Memory LC CELL LOGIC

...... Local h rgamn isue ocngr nti loshown. also is unit a configure to used bits programming the

oe wtho nFg ,telclmmr hc stores which memory local the 3, Fig. in switchbox lower

LOCAL MEMORY 2 MEMORY LOCAL

etoe wtho iig emduprsicbxand switchbox upper termed wiring, switchbox mentioned

Array

oi el hc scle nt eie h previously the Besides unit. a called is which cell, logic Switch

SWITCHBOX Top

...... UPPER tutr ftefntoa lcscmrsn n complete one comprising blocks functional the of structure

REP/SWITCH

h ahdbxi h ih oto fFg hw the shows 3 Fig. of portion right the in box dashed The

LOCAL MEMORY 1 MEMORY LOCAL 1

.PormigPt Structure Path Programming A. Memory

...... Local REPEATER

...... n e idrcinlrepeater Bi-directional (e) and

i.2 a ofiuaino AD,()Cngrto fa O2,()Uprsicbx d astasso ihhalf-keepers, with transistor Pass (d) switchbox, Upper (c) AOI22, an of Configuration (b) NAND3, a of Configuration (a) 2: Fig.

(a) (b) (b) (a) (c) (e) (e) (c) ctrl1

M4_5 M2_12t M2_11t M2_10t M2_9t M2_7t M2_6t M2_1t M2_O4 M2_O3 M2_O2 M2_O1 M4_12 M4_11 M4_10 M4_9 M4_8 M4_7 M4_6 M4_4 M4_3 M4_2 M4_1 M2_5t M2_4t M2_3t M2_8t M2_2t

io1 M3_17 io2

M3_16 ctrl1

M3_15

M3_14

ctrl2

BC C

M3_13 B

M3_12 ctrl2

M3_11

M3_10

A A D M5_9

M3_9 M5_8

(d)

M3_8

M5_7

M3_7

M5_6

ctrl

C D M3_6

M5_5

M3_5

M5_4

M3_4

A B ABC

M5_3

M3_3

M5_2

M3_2

M5_1 M3_1 clki clka L group1 L group2 L group3 L group4 L group5 L group6 0 0 0 0 0 0 DFF q BL bus d

clka clka cpa

L L L L L L (c) L group7 L group8 L group9 L group10 L group11 L group12 clka cpa 0 0 0 0 0 0 clkb 1 1 1 1 1 1 cpb clkb clkb

BL out

clkb cpb rq ack L group13 L group14 L group15 L group16 L group17 L group18 rq clkc 0 0 0 L 0 0 0 out Ri cpc 2 ack clkc clkc

out

L group19 L group20 L group21 L group22 L group23 L group24 clkc cpc 0 0 0 0 0 0

(a) (b) (d) Fig. 5: (a) Local memory structure, (b) Asynchronous memory buffer pipeline, (c) Memory buffer, and (d) Muller C-element In summary, it is possible to store the programming for three bits, the various memory buffers are, in fact, bi-directional; separate systems and switch between them in a fraction of a for brevity, this part is not presented here. clock cycle, or toggle between two systems while loading a Fig. 5(c) shows the structure of the programming bit register. third one. In order to properly enable toggling between running The clk triggers the DFF to pass the data from the bus to systems, not unlike swapping jobs in a the BL. The DFFs in the L2, L1, and L0 memory buffers are of a , separate system or finite state machine flip- controlled by distinct clk signals. These clk signals are derived flops must be available for each copy signal. Along those lines, from an asynchronous pipeline control unit, described next. with respect to Fig. 1(b) which was simplified, there is not just B. Programming Bit Asynchronous Pipeline one flip-flop whose output may optionally appear at the first column of each logic cell, but, in fact, three different flip-flops We use an asynchronous pipeline to achieve a high program- enabled by, respectively, cpa, cpb,orcpc. ming rate. For example, after an L2 memory buffer receives a 33-bit payload from off-chip, it forwards it (along with the IV. RAPID PARTIAL RECONFIGURATION address) to the corresponding L1 memory buffer as soon as the latter is ready to receive it. When the L1 memory buffer Lastly, we describe the architecture through which the FPTA receives the address and payload, the L2 memory buffer is configurations are stored and we explain how they can be freed to accept a new address and payload from off-chip. The selectively updated to support rapid partial reconfiguration. rate at which programming bits can be sent from off-chip is extremely high. In fact, detailed circuit simulations show that A. Memory Buffers L2, L1 and L0 the programming bit data rate is nominally 9.0 Gbps. The Level 0 Memory (L0) shown on the left side of Fig. 4 is We used a bounded-delay asynchronous pipeline control actually a memory buffer used to supply programming bits to scheme [3]. Each stage of the asynchronous pipeline consists the local memory for each of the 8 units comprising the block. of a Muller C-element, shown in Fig. 5(d). When a stage Each unit has 16 columns, each of which contains a payload Ri receives a request (req) from the previous stage Ri-1, if of 33 programming bits. The total number of columns for each the acknowledge (ack) from the next stage Ri+1 is available group is, therefore, 16 x 8, or 128. This means that 7 address (active low), then a request is generated to Ri+1 along with bits must be appended to the 33-bit payload to direct it to the an acknowledge to Ri-1. proper column. Three of the address bits select a unit and the The asynchronous control scheme for writing programming remaining four select a column within that unit. Thus, when bits to the local memories is shown in Fig. 6. The signal clki writing a 33-bit payload to a local memory, the L0 Memory (where i is the stage index of the memory buffer) is generated buffer passes a word of 40 bits to the group. by the logical AND of Ri and the inversion of the (bounded) We designed an asynchronous memory buffer pipeline to delayed Ri. This is basically a pulse generator that ensures rapidly program the FPTA. Our prototype has six groups that the various local clocks (clki) are non-overlapping in their horizontally and four groups vertically, as shown in Fig. high portions. The signal clki is used to trigger the DFFs in 5(b). When writing programming bits, L0 memory buffers are the Li memory buffer, as shown in Fig. 5(c). R2 receives the supplied by Level 1 memory buffers (L1). Each L1 memory request to write programming bits (RQ WR) from off-chip. buffer supplies a word to one of four L0 memory buffers, Since the L2 memory buffer feeds any of six L1 memory arranged vertically. The word size for the L1 memory is 42 buffers in the prototype, acknowledge signals from all six of bits: 40 bits needed by the selected L0 memory buffer plus them are OR-ed to form the acknowledge for R2. Also, since 2 address bits to select that L0 memory buffer. The six L1 each L1 memory buffer feeds any of four L0 memory buffers, memory buffers are fed by a Level 2 memory buffer (L2), acknowledge signals from all four of them are OR-ed to form as shown in Fig. 5(b)). Its word size consists of 45 bits: 42 the acknowledge for R1. The local memory driven by the L1 bits needed by an L1 memory buffer plus 3 bits to select memory buffer is controlled by local clock clk00. Note that the particular L1 memory buffer. An off-chip memory then the last stage uses the delayed request as its acknowledge. supplies 45-bit words to the L2 memory buffer. In order to The delay elements (Ds), shown in Fig. 6, are set based facilitate fast (i.e., pipelined) read-out of the programming on careful worst-case simulation of the extracted layout of the

2017 Design, Automation and Test in Europe (DATE) 1339 3 2 decoder decoder TABLE I: Area utilization compared to a commercial FPGA RQ_WR Benchmark Cell Count FPTA Altera OR6 OR4 (Synopsys Utilization Stratix ...... DC) Utilization

B04 317 1.02% 2% R2 R1 R0 R00 B05 353 1.29% 2% B12 539 2.02% 4% D D D D SPI 1240 4.27% 8% B14 2123 8.69% 10% Tv80 3077 11.13% 19% clk2 clk1 clk0 clk00 B15 3461 12.18% 22% B20 4407 17.78% 20% Fig. 6: Asynchronous write-in control scheme B21 4635 19.07% 20% B22 6702 26.85% 29% B17 10942 37.75% 68% AES cipher 9422 38.91% 47% AES inv cipher 13578 52.07% 72% WB conmax 16436 70.00% 148% ∗ B18 25303 87.85% 140%

(∗) One instance of B15 is removed from B18 to reduce the size of the benchmark in order to ensure that it fits within the available resources

Fig. 7: Layout of prototype comprising a 6X4 array of groups, (Versatile Place and Route) [5], [6] to make it compatible with each group comprising 8 logic cells our architecture. Finally, bit-stream generation is done through a Python script which we developed for this purpose. prototype. Since this pipeline is used for programming bits and VI. DEMONSTRATIONS is not a signal datapath, maximum throughput is not required. Therefore, we conservatively double the worst-case simulated We designed and laid out a prototype for fabrication using delay (including worst-case corner) to set the delay element the IBM 130nm 1.2V process. It includesa6x4array of value with sufficient margin to handle process, , and groups, each containing 8 logic cells, for a total of 192 logic temperature (PVT) variations. cells. The layout is shown in Fig. 7, and its overall size is 4113.41um x 2769.50um. V. C ELL LIBRARY AND CAD TOOL FLOW Our base library consists of all the possible inverting gates A. Area Utilization Efficiency that are feasible with our FPTA and its series limit of three In order determine the area utilization efficiency of our transistors. The 24 base library components are: INV, NAND2, FPTA, we compared it with a commercial FPGA, Altera NOR2, AOI12, AOI22, OAI12, OAI22, NAND3, NOR3, Stratix EP1S10, which uses the same 130nm technology and AOI31, OAI31, AOI41, OAI41, AOI32, OAI32, AOAI311, has a core size of 23mm X 23mm [7]. To make a fair OAOI311, AOAI211, OAOI211, AOOAI212, OAAOI212, comparison, we scaled up our FPTA to the same size, resulting AOAAI2111, OAOOI2111 and MAJI (which is the inverted in an array of 51 x 35 groups of 8 logic cells, for a total of mirror carry). In addition, the following custom cells are built 14,280 logic cells. We then implemented various benchmarks into the logic cells: FA, FAI, DFF, MUX and MUXI. from ITC’99 [8] and opencores [9] on both our FPTA and To further increase logic density, following the philosophy the Altera chip. A comparison of the resource utilization is of [2], numerous compounds of the 24 base library cells are presented in Table I. Note that the same switchbox sizes were also provided. Compound cells are created by appending an used throughout the scaled up array and the benchmarks were inverter (or a NAND2 or NOR2) to one input and/or the 100% routed. output of each of the 24 base cells, resulting in a total of 234 Despite the additional area overhead due to having three compound cells. Compound cells are placed as a unit but are memory cells per programming bit, the density (or utilization) decomposed into their constituent base cells prior to routing. of the FPTA is quite competitive with the Altera chip. We We also developed the necessary CAD tool-flow for pro- attribute this observation to the fact that for logic outside the gramming the FPTA. Our tool-flow consists of industry- custom cells (e.g., full adders, carry units, flip-flops, multi- standard commercial tools along with open-source software plexers) that both designs possess, the transistor utilization which has been modified to work with our architecture. Syn- of the logic cells in the FPTA is better than the transistor opsys Design is used to synthesize a gate-level netlist utilization of the LUTs in the Altera design. Essentially, even from the Register-Transfer Level (RTL) description of the a relatively simple logic function might take up an entire LUT, design. The cell library, consisting of 24 base cells, 11 built-in whereas in the FPTA, only the precise number of columns cells and 234 compounds was characterized using Synopsys needed to implement the gate are used. Thus, simple gates SiliconSmart. Placement is done through TimberWolf [4], such as NAND2, NAND3, NOR2, NOR3, and up to three or which is very effective in row-based placement. For routing, even four input AOI and OAI gates are comparatively very we modified the source code of the open-source tool VPR area-efficient in the FPTA.

1340 2017 Design, Automation and Test in Europe (DATE)

07Dsg,Atmto n eti uoe(DATE) Europe in Test and Automation Design, 2017 1341

etrn i iutnossoaeo he configurations, three of storage simultaneous (i) featuring 9 pnoe,, http://opencores.org/. ,” Opencores, [9]

o.1,n.3 p 45,Jl2000. Jul 44–53, pp. 3, no. 17, vol. , edvlpdanvlfil rgambetasso array transistor programmable field novel a developed We

EEDsg etof Test Design IEEE results,” ATPG first and benchmarks

I.C VII. ONCLUSION

8 .Cro .S era n .Suleo R-ee ITC’99 “RT-level Squillero, G. and Reorda, S. M. Corno, F. [8]

handbook.pdf.

otenme fbt hne ntebit-stream. the in changed bits of number the to

US/pdfs/literature/hb/stx/stratix dam/altera-www/global/en

h ierqie o eofiuigteFT sproportional is FPTA the reconfiguring for required time the

7 leaCroain ”https://www.altera.com/content/ ,” Corporation, Altera [7]

eodteetr i-temfrasaldsg hne hence, change; design small a for bit-stream entire the reload lwrAaei ulses 1999. Publishers, Academic Kluwer , FPGAs Deep-Submicron for

a tpe.Slciercngrto lmntstene to need the eliminates reconfiguration Selective stopped. had rhtcueadCAD and Architecture Eds., Marquardt, A. and Rose, J. Betz, V. [6]

edleg 1997. Heidelberg, dw’fo h aesae(on 1)weeteu counter up the where ‘1’) (count state same the from ‘down’

p 1–2,Srne Berlin Springer 213–222, pp. , research FPGA for tool routing

oncutr ttime At counter. down h oncutrsat counting starts counter down the ,

2

t

P:anwpcig lcmn and placement packing, new a VPR: Rose, J. and Betz V. [5]

nFg srcngrd hscnet h pcutrit a into counter up the converts This reconfigured. is 8 Fig. in

com/.

h oto hw ihntedse e rectangle red dashed the within shown portion the , and

2 1 t t 4 IBROF Tmewl ytm n.”http://www.twolf. inc.,” systems “Timberwolf TIMBERWOLF, [4]

.Between ). ttecut‘’o t eodcutn yl (time cycle counting second its of ‘1’ count the at 1 o.1,n.1 p 86,2002. 58–62, pp. 1, no. 10, vol. , Systems (VLSI) Integration t

EETascin nVr ag Scale Large Very on Transactions IEEE logic,” dynamic and one srntruhoefl onigcceadi stopped is and cycle counting full one through run is counter

3 .N oe,G e,adC ehn Lclycokdpipelines clocked “Locally Sechen, C. and Yee, G. Hoyer, N. G. [3] us.The pulse. cpa the receiving after soon 0, from ‘up’ counting

01 p 627–632. pp. 2011, , Conference Automation Design ACM/IEEE

h i-temo nu one slae n h one starts counter the and loaded is counter up an of bit-stream the

euto i eaaesnhssadpyia irre, in libraries,” physical and synthesis separate via reduction

i-tem silsrtdi h aeom fFg 0 Initially, 10. Fig. of waveforms the in illustrated as bit-stream,

2 .Rha,R fno .Tnaon n .Sce,“Power Sechen, C. and Tennakoon, H. Afonso, R. Rahman, M. [2]

fcmuainlsaebtenteiiiladtemodified the and initial the between state computational of o.3,n.5 p 7–9,2014. 371–393, pp. 5, no. 31, vol.

eofiuainmd loalw h eeto n transfer and retention the allows also mode reconfiguration , of Journal ,” and architectures FPGA

1 .Yn,J hn,J u,adL u Rve fadvanced of “Review Yu, L. and Sun, J. Zhang, J. Yang, H. [1] noadw one,a hw nFg () hsselective This 8(b). Fig. in shown as counter, down a into

otelgccl ntemdl,isfntoaiyi changed is functionality its middle, the in cell logic the to R EFERENCES

() yslcieycagn nytebt corresponding bits the only changing selectively By 8(a).

hc eaeivsiaigi u non research. ongoing our in investigating are we which

siiilycngrda nu one,a hw nFig. in shown as counter, up an as configured initially is

vligcrut n edpormal eievirtualization, device programmable field and circuits evolving

eosrtduiga xml fa2btcutr which counter, 2-bit a of example an using demonstrated

iiyo e opttoa oaiis uha dynamically as such modalities, computational new of bility

h ata/eetv yai eofiuaincpblt is capability reconfiguration dynamic partial/selective The

eofiuain ebleeta hsFT pn h possi- the opens FPTA this that believe We reconfiguration.

.SlciePrilDnmcReconfiguration Dynamic Partial Selective C.

n utpecngrtosadslcieprilfl dynamic partial/full selective and configurations multiple ing

n h irrhclpormiglgcrqie o support- for required logic programming hierarchical the and tem ihasnl-yl toggle. single-cycle a with streams

yia PA ept h rpenme fpormigbits programming of number triple the despite FPGA, typical PArsucscnb iesae ewe w ifrn bit- different two between time-shared be can resources FPTA

hsFT a qiaeto ueirae fcec vra over efficiency area superior or equivalent has FPTA this ons‘on rm3t .Tewvfrscnr htthe that confirm waveforms The 0. to 3 from ‘down’ counts

hog ul otdipeetto fbnhakcircuits, benchmark of implementation routed fully through one satvtd ihnasnl yl,adtecounter the and cycle, single a within activated, is counter

rniinfo rttp oacso C sdemonstrated As IC. custom a to prototype a from transition rie,tebtsra orsodn otedown the to corresponding bit-stream the arrives, cpb as soon

ftesm egtadvral it,teeysimplifying thereby width, variable and height same the of uss As pulses. clk counter the on based 3 to 0 from ‘up’ counting

n ii upr o SCcmail irre otiigcells containing libraries ASIC-compatible for support (iii) and us spoie n h one starts counter the and provided is pulse cpa the when activated

ihtruhu,aycrnul ieie eoybuffers, memory pipelined asynchronously throughput, high h PA epciey ssoni i.9 h pcutris counter up the 9, Fig. in shown As respectively. FPTA, the

ofiuainbt,truhteueo irrhclyarranged, hierarchically of use the through bits, configuration n r oddit h n ace ftelclmmr of memory local the of latches B and A the into loaded are and

ofiuain natm rprinlt h ubro modified of number the to proportional time a in configuration, tem r eeae o h u one’ad‘oncounter’ ‘down and counter’ ‘up the for generated are streams

ainlsae i)rpdprilo ulmdfiaino stored a of modification full or partial rapid (ii) state, tational ihtewvfrsilsrtdi i.9 w eaaebit- separate Two 9. Fig. in illustrated waveforms the with

rcino igecce hl eann h arcscompu- fabric’s the retaining while cycle, single a of fraction a oniguwrsadteohrcutn onad,along downwards, counting other the and upwards counting

ln ihteaiiyt yaial wthbtente in them between switch dynamically to ability the with along eosrtduigto2btcutr,soni i.8 one 8, Fig. in shown counters, 2-bit two using demonstrated

h igeccercngrto aaiiyo h PAis FPTA the of capability reconfiguration single-cycle The

i.1:Prildnmcrcngrto demonstration reconfiguration dynamic Partial 10: Fig.

.Snl-yl wthn ewe Configurations Between Switching Single-Cycle B.

5n 0n 5n 0n 5n 500ns 450ns 400ns 350ns 300ns 250ns

i.8 a -i pcutr n b -i oncounter down 2-bit (b) and counter, up 2-bit (a) 8: Fig.

Count 0 1 2 3 0 1 0 3 2 1 0 0 1 2 3 0 1 0 3 2 1 0

LSB

(b)

MSB

o4 o1 o5 o5 o2

counter_clk

counter_clk

o3 o1

cpa

rst 1 2 t t

RR

o5 o2 o4 o1 o3

Q Q

D o5

D o1

o4 o5

i.9 ihnccercngrto demonstration reconfiguration Within-cycle 9: Fig.

o2

o1 o1 o3 o5

(a)

8n 0n 2n 4n 6n 680ns 660ns 640ns 620ns 600ns 580ns 700ns

Count

0 0 3 2 1 0 0 1 2 3 1 2 3 0 0 3 2 1 o5 o4 o5 o1 o2

LSB

counter_clk

o1 o3

rst

MSB

RR

o3 o4 o5 o1 o2

counter_clk

Q Q

D

D o5 o5 o3 o1

o4

cpb

o2

cpa 1o5 o1 o1