Imposing Coarse-Grained Reconfiguration to General Purpose Processors

M. Duric⇤, M. Stanic⇤, I. Ratkovic⇤, O. Palomar⇤†, O. Unsal⇤, A. Cristal⇤†‡, M. Valero⇤†, A. Smith§ Barcelona Supercomputing Center first . last @bsc.es Universitat Politecnica` de Catalunya IIIA-CSIC ⇤ { } { } † ‡ §Microsoft Research [email protected]

Abstract—Mobile devices execute applications with diverse the appropriate unit for each workload, depending on the compute and performance demands. This paper proposes a workload characteristics and particular needs. general purpose that adapts the underlying hardware Accelerators usually increase performance by using fixed to a given workload. Existing mobile processors need to utilize more complex heterogeneous substrates to deliver the hardware structures that match computation of hot code demanded performance. They incorporate different cores and regions. Fixed-function hardware limits the applicability specialized accelerators. On the contrary, our processor utilizes of accelerators. Hence, multiple accelerators are typically only modest homogeneous cores and dynamically provides incorporated in the design. They incur additional area over- an execution substrate suitable to accelerate a particular heads, as well as increase the design and verification cost of workload. Instead of incorporating accelerators, the processor reconfigures one or more cores into accelerators on-the-fly. It the mobile processors. Minimizing their costs directly affects improves performance with minimal hardware additions. the final cost of the mobile processor. To moderate the costs The accelerators are made of general purpose ALUs recon- and mitigate the limited applicability, various designs pro- figured into a compute fabric and the general purpose pipeline pose reconfigurable compute accelerators [3], [4], [5], [6], that streams data through the fabric. To enable reconfiguration [7], [8], [9], [10]. They increase performance over different of ALUs into the fabric, the floorplan of a 4-core processor is changed to place the ALUs in close proximity on the workloads, but require including the remarkable amount of chip. A configurable switched network is added to couple and compute resources to enable per each workload specialized dynamically reconfigure the ALUs to perform computation acceleration. To avoid such addition of the resources in of frequently repeated regions, instead of executing general mobile processors, we propose reconfiguration of the general purpose instructions. Through this reconfiguration, the mobile purpose cores into accelerators. processor specializes its substrate for a given workload and maximizes performance of the existing resources. Our results This work contributes a novel reconfigurable and yet show that reconfiguration accelerates a set of selected compute low cost chip multiprocessor (CMP) which: 1) maximizes intensive workloads by 1.56x, 2,39x, 3,51x, when configuring computational capabilities of the existing general purpose the accelerator of 1-, 2-, or 4- cores respectively. cores; 2) minimizes the amount of extra hardware for Keywords-reconfigurable computing, dynamic processors accelerators.1 The processor is made of four composable lightweight cores [12], [13]. Composability avoids static I. INTRODUCTION placement of the resources and allows a workload to dy- namically optimize its execution substrate by composing one Mobile devices are becoming ubiquitous and the mobile or more cores into a large processor. Our approach goes market segment is starting to dominate the computer indus- one step further than composability and reconfigures one or try. The competitive market requires constant improvement more composable cores into accelerators. The accelerators in the quality of these devices. Modern mobiles enable new consist of the core’s ALUs reconfigured into a compute compelling user experiences such as speech and gesture fabric and the core’s pipeline that streams data through the interfaces. Mobile applications manage an ever-increasing fabric. The fabric resembles the power-efficient computing amount of data such as: photos, videos, and a world of substrate introduced by [8]. However, instead of integrating content available in the cloud. On such inputs, the applica- such substrate, our fabric places all the ALUs available in the tions repeat their computation many times and offer potential processor to the center of the chip and connects them with a for acceleration. To harness this potential, recent mobile configurable circuit-switched network. The network permits processors have shifted toward . the fabric to be configured to perform various commonly Heterogeneous architectures [1], [2] integrate cores with repeated computations. The pipeline of the accelerator can different power and performance characteristics, GPUs and be configured as well, by composing the resources of one or specialized accelerators in a single chip. These architectures more cores around the fabric to tune the memory capabilities are more complex, but yield extra performance by choosing of the accelerator and further increase its performance. This work has been partially funded by the Spanish Government (TIN2012-34557), the European Research Council under the EUs 7th FP 1A sketch of this idea [11] with a very basic implementation and (FP/2007-2013) / ERC GA n. 321253. and Microsoft Research. simplified evaluation showed promising results.

978-1-4799-2079-2/13/$31.00 c 2013 IEEE 978-1-4673-7311-1/15/$31.00 ©2015 IEEE 1 II. BACKGROUND AND RELATED WORK architectures, which incorporate the set of configurable data processing units. Each unit performs one compute task We build our design on composable cores based on an of the parallel workload (e.g. matrix arithmetic, signal or Explicit Data Graph Execution (EDGE) architecture [14], image processing), while data passes from one unit to [15]. The EDGE architecture is an efficient research vehicle another in pipeline. The units are coupled by using a for low power mobile computing. Compiler analysis is configurable interconnect in a grid-like compute fabric. Such used to divide a program into blocks of instructions that fabric is integrated into a processor pipeline like a back- execute atomically (Atomic Instruction Block or AIB). An end processor. The pipeline arranges compute data, while AIB consists of a sequence of dataflow instructions. EDGE the fabric computes the data. The fabric is configured by compilers statically generate the dataflow and encode it in mapping compute instructions to it. It improves performance the EDGE ISA. The encoded dataflow defines producer- by “spatially” executing multiple compute instructions in consumer relationships between EGDE instructions and different stages of the fabric. The fabric also improves avoids power hungry out-of-order hardware. Instructions efficiency by repeating once configured computation many inside the block communicate directly. Each instruction times, which avoids per compute instruction fetch and de- leverages two reservation stations, which hold left and right code overheads [19]. Instead of integrating such fabric into operands respectively. Producer instructions encode targets a lightweight mobile processor, we propose reconfiguration that route their outputs to the appropriate reservation stations of the processor compute resources into a fabric. Our fabric of consumer instructions. Register operations are used only is designed to resemble the previously proposed computing for handling less-frequent inter-block communication, by substrates, their performance and efficiency, while avoiding keeping the temporary results between the AIBs. Composing as much as possible of their area overheads. EDGE cores increases performance and efficiency [12], [13]. Rather than fixing the size of cores at design time, one or III. RECONFIGURATION OF CORES more lightweight physical cores can be composed at runtime to form a larger, wider-issue logical core by using an on- The reconfiguration of composable cores extends their chip routed network between the cores. Reconfiguration of capabilities beyond general purpose processing. Bulk re- composable cores is an advanced dynamic feature that adds a sources can be allocated and used either as general purpose circuit-switched network between the ALUs of composable processors or accelerators. Each application running on cores to specialize one or more cores into an accelerator and this processor dynamically tunes the amount of allocated further increase their performance and efficiency. resources and their configuration to achieve the desired There is previous work that proposes to reconfigure exist- performance and efficiency. For example, one application ing resources of a processor into some kind of accelerator. may be executed by using one or more cores of the CMP In the context of EDGE architectures, vector execution on for general purpose processing and another one accelerated EDGE cores [16], [17] dynamically repurposes the general by using the cores reconfigured into an accelerator. purpose cores into a . The vector processor Reconfiguration makes an accelerator of one or more executes vector AIBs, which allocate the existing compute general purpose cores. The accelerator is composed of resources and repeat the computation by streaming the the general purpose ALUs reconfigured to perform like a values of large vectors. A vector memory unit is incorporated compute fabric and the general purpose pipeline that streams to decouple vector memory accesses from computation and compute data through the fabric. The reconfiguration feature arrange the vector values in an efficient manner. The vec- is implemented on a 4-core CMP. The floorplan design of tor processor customizes the for vector the CMP is slightly modified to provide a suitable execution access patterns, while the reconfigurable cores proposed in substrate for the fabric. The original design assumes 4 this paper customize the compute resources for frequently identical tiles, one per core. The new floorplan places the executed computations. These two dynamic approaches have existing ALUs of each core close to each other by shifting different trade-offs and it may be interesting to have both them from the original location to the appropriate corner, as features on a single chip, while applying them selectively shown in Figure 1. A configurable circuit-switched network depending on the workload or even combining them. For is added to connect all the ALUs into the fabric. Although in example, when the reconfigured cores compute values with this work we focus on a 4-core CMP, an even larger compute regular access patterns, the vector memory unit may be used fabric can be created, by extending the switched network to efficiently arrange data and increase performance. to connect ALUs in a CMP that incorporates more than Various reconfigurable compute accelerators [18], [3], [5], four cores. This network extension would include longer [4], [6], [7], [19], [8], [9], [10] have been proposed to metal interconnects and repeaters that are positioned along accelerate compute intensive code regions. They are added the network to enable connections between ALUs on non- to the baseline processor to enable compute acceleration. neighbor cores. With such network, the resources of the The accelerators are based on coarse grained reconfigurable fabric scale with the number of cores on a chip.

2 INPUT/OUTPUT BUSES CORE 0 CORE 1 CONFIGURATION

INPU T F IFOS MEMORY

S S S S

Acc S ALUs ALUs C O R E 1 FU FU FU FU

DATA REGISTER

S S S S ALUs ALUs S C O R E 0 F U DATA REGISTER

FU FU FU FU O U T P F I S

CORE 3 CORE 2 STATUS REGISTER

S S S S S

INPUT FIFOS CORE 0 CORE 1 CONFIGURATION

FU FU FU FU MEMORY

S S S S Acc S C O R E 2 IN0 OUT0 ALUs ALUs CROSSBAR FU FU FU FU INn OUTn

ALUs ALUs C O R E 3

S S S S S

DATA REGISTER OUTPUT FIFOS CORE 3 CORE 2 STATUS REGISTER

Figure 1. The dynamic reconfiguration of a 4-core CMP into accelerators. The accelerators are composed of the CMP’s FUs connected into a compute fabric and one or more cores that stream data through the fabric. The fabric adds a circuit switched network to couple the FUs in 4 neighboring cores.

The compute fabric is utilized in the pipeline to execute by using a set of functional units (FU)2 and control logic groups of commonly repeated compute instructions. The to select which one is dispatched to perform the specific instructions are mapped to the fabric, while configuring operation. Only a single FU per ALU is active simultane- the appropriate among the ALUs. The configured ously at most. The ALUs in one core are connected with a datapath eliminates passing the data between the ALUs via shared bypass network. The network enables the ALUs to the register file. Once configured, the fabric can receive and operate conventionally, while bypassing the data between the processe new data without reconfiguration. If the amount of ALUs and a shared register file. To support reconfiguration, data is large, the fabric significantly increases efficiency by we add an extra configurable circuit-switched network that eliminating per instruction overheads such as fetch, decode connects all the FUs of a 4-core CMP and enables their and register file access [19]. To hide propagation delay in reconfiguration into the compute fabric. In the fabric, one FU the fabrics, the accelerators execute multiple iterations of performs one compute instruction and the network passes the configured computation in a pipelined fashion; different data between FUs that perform dependent instructions. It stages of the fabric simultaneously compute different values, permits using multiple FUs per ALU simultaneously, as while the network controls the progress of data through the opposed to conventional operation dispatch. connected ALUs. The network is based on a lightweight 2D mesh of nodes, The fabric is easily integrated into the pipeline of the as shown in Figure 1. Each node of the mesh may be: a) an existing general purpose cores. The accelerators use the extended FU b) a configurable or c) a fifo queue that pipeline to execute memory instructions as well as send holds input or output values. Each FU extends the compute and receive data processed by the fabric. Since the fabric logic by incorporating two data registers and two status performs most of the computation, the pipeline mainly registers. The data registers hold the values that are input executes non-compute instructions, which makes memory operands of the operation to be computed by the FU. Each processing more efficient. Composing cores further increases status register indicates whether the content of the associated the amount of resources available for memory processing data register is valid or not. Each FU is connected to four and allows for issuing more in-flight memory requests. neighboring from where it receives its inputs and Figure 1 shows examples of reconfigured accelerators, which sends outputs. The switches are connected to fifo queues, compose one or two cores to stream data through the fabric. FUs or other switches. The switches include a crossbar, an associated configuration memory, data and status registers. IV. COMPUTE FABRIC DESIGN The crossbar provides one or more outputs by selecting from

The compute fabric (Figure 1) is made by reconfiguring 2In this work functional units refer to ALU sub-units such as shifter, the ALUs of a CMP. Each ALU supports various operations, or multiplier.

3 one or more inputs according to its configuration bits, which the FUs during such cycles, thus increasing the utilization are stored in the configuration memory. The data registers of the FUs. To achieve this, the wakeup and select logic is hold the data that passes through a switch, before the switch extended to check if there is a bubble and issue a compute is able to forward it to the next node. The switch is con- instruction if the bubble exists. The extended logic interacts figured by writing configuration bits into the configuration with a dynamically controlled dataflow in the fabric to check memory. The configuration bits are transmitted by reusing if a particular FU is unoccupied. When a general purpose the network , like in [8]. When configured, the compute instruction is issued into an unoccupied FU, the switches provide the specialized datapath between the fifo fabric is not affected nor are any structural hazards created. queues and various FUs. This extension to the processor logic enables the FUs to Computation on the datapath may be driven either stat- operate in a hybrid mode, by performing general purpose ically or dynamically. Statically driven computation uses instructions and frequent computations through the fabric at a schedule provided by a specialized compiler to issue once. The hybrid mode increases the FUs utilization and the operations. The compiler calculates the delay of each provides a more flexible compute substrate. operation and generates the static execution schedule for the fabric. Input values are usually injected into the fabric at V. C OMPUTE FABRIC IN PIPELINE once and results are available after a known delay provided The fabric is integrated into the existing configurable by the compiler. Such scenario simplifies the design of pipeline of composable cores as shown in Figure 2. In the the network, which requires nothing beyond connections pipeline that compose one or more cores, the fabric performs and routing control. On the other hand, dynamically driven like a deeply pipelined . The input/output fifo computation requires extra data and status registers in the queues of the fabric facilitate its integration into the general network nodes to dynamically arrange the execution sched- purpose pipeline. Each queue utilizes one dedicated bank of ule. The schedule is arranged according to the dataflow entries and by using input or output buses it is connected to control of values arriving to the FUs. The FUs perform an each composable core of the pipeline. The pipeline executes operation when all its inputs are indicated as ready and the AIBs of EDGE instructions. The AIBs include instructions next node on the path is free to receive a new value. Dataflow that send the data to the queues or receive it from them control enables performing computation in a pipelined fash- (send/receive instructions). Send/receive instructions pass ion simply by streaming ready values through the fabric. the data between the pipeline and fabric, by accessing On the other hand, statically driven computation requires the fifo queues. Before the fabric is utilized, it has to be a complex static scheduling to pipeline the computation. configured to perform a particular computation. The static scheduling may be jeopardized by events such as misses, which result into variable latencies hard to A. Fabric Configuration predict at compile time. This resembles the choice between The fabric is configured by executing a number of con- in-order and out-of-order instruction issue. In-order issue figuration instructions, which occupy one or more AIBs. logic is much simpler than out-of-order, but it relies heavily Each configuration instruction has an immediate field with on the compiler to produce highly optimized schedules; this the identifier of a node in the switched network and the is specially complex due to variable latency instructions data used to configure the node. Rather than broadcasting (e.g. loads that miss in the cache). Unlike out-of-order this data through the network until it reaches its destination implementation, in our case the complexity of the additional node, the data is sent to a fifo queue connected to a row hardware required for dynamically driven computation is or a column where the destination node resides. From the small. For these reasons, in this work we chose dynamically fifo queue, the data is inserted into the associated node and driven computation. It provides a more flexible substrate each node forwards the data to the next node, until the with small hardware additions and does not require complex data reaches its destination. North nodes forward the data compiler support to generate the static execution schedule. to south nodes and west nodes forward the data to east By configuring the datapath, the fabric allocates a set of nodes. By configuring various switched nodes, the fabric the FUs available on the CMP, possibly leaving a number of forms a specialized datapath that connects a heterogeneous them off the datapath. The dynamically driven computation set of FUs that perform a particular computation (Figure 3). uses the allocated FUs only when a FU has ready input The number of blocks used for the configuration defines the values. Hence, the fabric may not utilize all the allocated configuration overhead, which is analyzed in section VII. FUs every cycle. There is a bubble in each cycle that a If the entire application exploits a single configuration FU, allocated or not, is not occupied. On the other hand, of the fabric, configuration overhead does not affect per- an accelerated workload may still use some general purpose formance remarkably. On the other hand if the application compute instructions, which are not mapped to the compute exploits different configurations, the configuration overheads fabric (e.g. instructions that reduce results generated by the may reduce the performance improvements. For such cases, fabric). These instructions may “fill the bubbles” and utilize an extra memory (RAM) could be used to keep various

4 FETCH DECODE EXECUTE MEMORY WRITE BACK

INSTRUCTION INSTRUCTION MEMORY EXECUTION WINDOW MEMORY UNIT WINDOW RREGEGISITSETRERS FILE

ICACHE INPU T F IFOS

ICACHE

S S S RESERVATION RESERVATION

STATTIONS O U T P F I S

STATIONS I N P U T F O S FU FU

S S S DCACHE DCACHE

FU FU

S S S

OUTPUT FIFOS

Figure 2. The compute fabric integrated into the general purpose pipeline that is composed of two general purpose cores. configurations. By storing the configuration data into the the output value produced by the fabric, which is later stored specialized memory that instantly sends the data to the to memory. Alternatively, the value can be read from/written fabric, the configuration overhead is reduced. Instead of ex- to a general purpose register. Each send/receive instruction ecuting one or more blocks with configuration instructions, is associated with a fifo queue which is encoded in the the memory instantly sends the configuration data to the instruction. The instruction requires a free entry in the queue fabric. The memory could also enable the sharing of fabric to perform its send or receive operation. If there are no free configurations (e.g. FFT, used both in graphics and speech entries in the associated queue, the core stalls the execution recognition). In our experiments we do not include such of the instruction, due to the structural hazard. When the memory, but we measure its potential impact. AIB commits, the entry is released and can be used by the next send/receive instruction. It is possible to further B. Fabric Utilization improve performance of memory accesses by integrating The fabric is utilized within the pipeline of composable a programmable memory controller [24] to arrange values cores, which executes AIBs appended with send/receive according to the access patterns, when known at compile instructions (Figure 3). AIBs are composed of EDGE in- time. Due to the additional complexity of implementing such structions, which explicitly encode producer-consumer rela- memory controller, we do not evaluate it in this work. tionships. Send instructions act like producers by forwarding input values into the fabric. The targets of send instructions The send/receive instructions included in the AIB control route their values to the appropriate fifo queues. The values the entire iteration(s) of the configured computation. The are provided by general purpose memory or register-read AIB sends all the values to be computed and receives instructions. Receive instructions act like producers as well, all the results. To increase performance, the AIBs with by forwarding output values from the fifo queues encoded limited communication with the fabric (e.g. compute inten- in the instructions to their specified consumers. sive regions) can apply loop unrolling to control multiple We rely on the general purpose pipeline to access memory iterations of the computation. The send/receive instructions and arrange input and output values. Such design can in the AIB are replicated per each iteration of the loop. accelerate the compute regions with any kind of memory An identifier specifies the iteration and the ordering of access patterns, regular, irregular or a combination of them. the send/receive requests in the fifo queues. The iteration To efficiently arrange the operand values and maintain the identifier implies the same semantics with respect to the correctness of the targeted application, the pipeline leverages ordering of send/receive requests as load-store identifiers the sophisticated features typically found in memory subsys- for the ordering of memory instructions. While executing tems such as LSQs, low-latency caches and prefetch mech- a given send or receive instruction, its iteration number is anisms. Core composition tunes the pipeline resources (e.g. used to access the appropriate entry in the fifo queue and data cache size) on-the-fly to further increase the efficiency maintain correctness of the computation. If the configured of memory accesses, which is evaluated in section VII. computation has a long latency, software pipelining may be After the value has been loaded by the general purpose applied to relieve the problem. In this optimization, multiple pipeline, a send instruction is used to insert the value into AIBs are used. The first AIB only sends the data to be the fabric. Similarly, a receive instruction is used to collect received by next block. Each of the following AIBs then

5

6

hc atfrterslswt ogdly.I satrade-off a is It delays. long with results the for wait which opt arci ut hlegn ncnetoa out-of- conventional in challenging quite is fabric compute

pcltv opttos u hyaodiln fAIBs, of idling avoid they but computations, speculative aaigvle htps hog h reconfigurable the through pass that values Managing

ceeain.I hscs,teaclrtr ontleverage not do accelerators the case, this In accelerations.

I S VI. F C ABRIC IN OMPUTATION PECULATIVE

hsgaate h orcns ftesfwr pipelined software the of correctness the guarantees This

buffers. odterslsaerlae nywe h I commits. AIB the when only released are results the hold

pcaie aaahfo oriptt w uptdata output two to input four from datapath specialized ya I htsuse,bttefioqeeetisthat entries queue fifo the but squashes, that AIB an by

prto nietelo)adegtsice htenables that switches eight and loop) the inside operation h eut r ee icre.Terslsmyb received be may results The discarded. never are results the

fsxfntoa nt oeui e ahcmuecompute compute each per unit (one units functional six of ae l h optto ntefbi o-pcltv and non-speculative fabric the in computation the all makes

e edrcieisrcin h hw arci composed is fabric shown The instruction. send/receive per arccnue h auswe h Iscmi.This commit. AIBs the when values the consumes fabric

pcfisteascae f uu n h prn location operand the and queue fifo associated the specifies ieiigd o eeaeseuaiecmuain The computation. speculative leverage not do pipelining

rte yuigsn/eev nrnis hr ahintrinsic each where intrinsics, send/receive using by written hswr.Ised h okod piie ihsoftware with optimized workloads the Instead, work. this

arc nti xml,tesn/eev ntutosare instructions send/receive the example, this In fabric. eecsieycmlxadw onticroaei in it incorporate not do we and complex excessively be

ntuto,wihps h nu/uptdt ofo the to/from data input/output the pass which instruction, adaet icr hs eut.Sc einsesto seems design Such results. these discard to hardware

eoigtecmueisrcin ped send/receive appends instructions compute the removing uusetisa nai,tefbi ol eur additional require would fabric the invalid, as entries queues

htieaetelo.Temdfidcd nadto to addition in code modified The loop. the iterate that hscs hr sn I omr h orsodn fifo corresponding the mark to AIB no is there case this

opt ntutosol efr okepn operations bookkeeping perform only instructions compute htwl s uhvle,hsntsatdyt ic in Since yet. started not has values, such use will that

opt ntutosisd h op hl h remaining the while loop, the inside instructions compute edsadd nldn h aei hc h etAIB, next the which in case the including discarded, be

tr w uptrsls h oie oeaod l fthe of all avoids code modified The results. output two store pcltv optto a rvd auswihne to need which values provide may computation Speculative

nu aibe,ra w nu osat,cmuete and them compute constants, input two read variables, input olwn Isadrcievle etb rvosAIBs. previous by sent values receive and AIBs following

n eitrisrcin,wihi ahieainla two load iteration each in which instructions, register and ieiig h Issn aust ercie ythe by received be to values send AIBs the pipelining,

ftelo.Teoiia oecnan opt,memory compute, contains code original The loop. the of n pcltv optto iutnosy nsoftware In simultaneously. computation speculative and

iga ftefbi ofiue opromtecomputation the perform to configured fabric the of diagram h otaepplnn piiainitoue nscinV section in introduced optimization pipelining software the

ups ieieadtecmuefbi;3 simplified a 3) fabric; compute the and pipeline purpose oecmlxhrwr ol erqie osupport to required be would hardware complex more A

hteeue nteaclrtrcmoe ftegeneral the of composed accelerator the on executes that arcfrseuaiecomputation. speculative for fabric

eea ups rcso;2 h oie oeo h loop the of code modified the 2) processor; purpose general xct n rmr pcltv Is hc eeaethe leverage which AIBs, speculative more or one execute

rgnlcd facmueitnielo hteeue na on executes that loop intensive compute a of code original ehns nbe h opsbecrst simultaneously to cores composable the enables mechanism

arci hw nFgr .Teeapecnan:1 the 1) contains: example The 3. Figure in shown is fabric uusaerlae ob sdfrtenx optto.This computation. next the for used be to released are queues

M’ xcto eore eofiue notecompute the into reconfigured resources execution CMP’s I omte rsuse,tealctdetisi h fifo the in entries allocated the squashed, or committed AIB

neapeo h opt oeaclrto yuigthe using by acceleration code compute the of example An en qahd netecmuaini nse n the and finished is computation the Once squashed. being

optto n vnulydsadterslsi h I is AIB the if results the discard eventually and computation .Eapeo arcUtilization Fabric of Example C.

nai-u-ed.Ti a,tefbi a lasfiihits finish always can fabric the way, This invalid-but-ready.

w piiain nScinVII. Section in optimizations two qahsmrstealctdetisi h f uust be to queues fifo the in entries allocated the marks squashes

aasn ypeiu Is eeaut h mato these of impact the evaluate We AIBs. previous by sent data au htwl ee esn.T vi hs nABthat AIB an this, avoid To sent. be never will that value

ogdly.Tels I nyrcie h eut fthe of results the receives only AIB last The delays. long optto ntefbi ol tl atn o ninput an for waiting stall could fabric the in computation

optdrslso h aasn ypeiu Ist avoid to AIBs previous by sent data the of results computed eurdt efr h optto,tednmclydriven dynamically the computation, the perform to required

ed aat ercie ytenx Isadrcie the receives and AIBs next the by received be to data sends icre.I h lc qahsbfr edn l h values the all sending before squashes block the If discarded.

h eut,bti h I qahs l h eut are results the all squashes, AIB the if but results, the

hc odotuso h optto.TeABmyreceive may AIB The computation. the of outputs hold which

h arcadrcieotu aafo h fabric. the from data output receive and fabric the optdb h arcadrslssoe otefioqueues, fifo the to stored results and fabric the by computed

h optto ftecd n h oei oie osn nu aato data input send to modified is code the and code the of computation the

f uu sso si sray pcltv ausare values Speculative ready. is it as soon as queue fifo a

ipie arco h M’ Ls h arci ofiue operform to configured is fabric The ALUs. CMP’s the of fabric simplified

dsad)altersls h arccnue au from value a consumes fabric The results. the all (discards) iue3 neapeo h opt oeaclrto yuiiigthe utilizing by acceleration code compute the of example An 3. Figure

omt l t eut orgseso eoyo )squashes 2) or memory or registers to results its all commits

ofiue optto.Det tmct,teABete 1) either AIB the atomicity, to Due computation. configured N O I G E R E T U P M O C D E I F I D O M ) 2 C I R B A F D E R U G I F N O C ) 3

1 T U O 0 T U O

} n eevsalteotusfroeo oeieain fthe of iterations more or one for outputs the all receives and

1 T U O / / ; ) 1 ( f e v i e c e r _ = ] i [ 1 s e r

ahABta ragstevle ed l h inputs the all sends values the arranges that AIB Each

0 T U O / /

; ) 0 ( f e v i e c e r _ = ] i [ 0 s e r

S S S S

3 N I / /

; ) y w , 3 ( f d n e s _ ihmds adaeadtos uha h f queues. fifo the as such additions, hardware modest with

D D A B U S L U M

2 N I / / ; ) ] i [ c r s , 2 ( f d n e s _

ievle o h arcadealst nert h fabric the integrate to enables and fabric the for values tive

1 N I / / ; ) ] 1 + i [ c r s , 1 ( f d n e s _

0 N I / IN /

; ) x w , 0 ( f d n e s _

ntutos lc tmct aiiae ragn specula- arranging facilitates atomicity Block instructions. S S S S 0 {

ert h arc eas hyeeueaoi lcsof blocks atomic execute they because fabric, the tegrate

) + + i ; m u n < i ; 0 = i t n i ( r o f L U M L U M L U M

rhtcue rvd oeavnaeu usrt oin- to substrate advantageous more a provide architectures

N O I G E R E T U P M O C E D O C C ) 1

oices efrac foto-re rcsos EDGE processors. out-of-order of performance increase to S S S S

}

1 N I 3 N I 2 N I orcns n upr pcltv optto necessary computation speculative support and correctness

; ] i [ c r s * y w + ] 1 + i [ c r s * x w = ] i [ 1 s e r

nwihvle hudetrtefbi,miti memory maintain fabric, the enter should values which in ; ] 1 + i [ c r s * y w - ] i [ c r s * x w = ] i [ 0 s e r

{

re rcsos ti motn oke h rgnlorder original the keep to important is It processors. order

) + + i ; m u n < i ; 0 = i t n i ( r o f Table I Table II SIMULATOR CONFIGURATION COMPUTING WORKLOADS

Component Description Name Description Domain ALUs [FUs] 2 [FMul, Fadd, IAdd, IMul, Logical] fft Fast Fourier Transform Signal Processing 64 entry kmeans K-Means Clustering Data Mining B. Predictor tournament mm Dense Matrix-Matrix Multiply Scientific Computing L/S Queue 32 entry unordered LSQ Per core mriq Magnetic Resonance Imaging - Q Image Reconstruction L1 I/D-cache 32 kB each, 1 cycle (hit), 8xMSHR nbody N-Body Simulation Physics Simulation L2 cache 4 banks x 512 KB, 15 cycles (hit), 8xMSHR spmv Sparse-Matrix Vector Multiply Scientific Computing DRAM latency: 256 cycles stencil 3-D Stencil Operation Fluid Dynamics On-chip network 1 cycle/hop, Manhattan routing distance 4x CMP FUs 4x10, (4 Cores x 2 ALUs [5FUs]) Switches 5x11, 1 cycle/hop Fabric Fifos 30, 8 entries/fifo emerging applications, which are used in various segments of the computer industry, including mobiles. between two optimizations and each one of them requires Table III shows configuration parameters, overhead and no complex hardware additions. the percentage of code accelerated. The most frequently executed compute regions of each workload are selected to VII. EVA L UAT I O N run in the accelerator, the rest is executed in the general In this section we evaluate the reconfiguration feature purpose pipeline. The selected regions are compiled with proposed in this paper on a dynamic EDGE architecture. We an in-house compiler [8] to provide the configuration bits use an in house, detailed cycle-level simulator that models for the fabric. The configuration bits are written by a set of a composable 4-core EDGE processor with the parame- configuration instructions, placed in one or more AIBs. By ters shown in Table I. To model the executing these AIBs, each workload configures the fabric properly, the simulator is coupled with the Ruby mem- to match the computation of its compute algorithm. The ory model (see http://www.m5sim.org/Ruby). The EDGE configuration overhead is shown in instructions, AIBs and simulator dynamically composes cores through calls to the cycles, that differs whether the configuration bits are cached. runtime system, via memory mapped system registers. To Once the fabric is configured, it allocates various nodes model the processor reconfiguration feature, we integrate an of the network to perform an entire computation. The independent cycle-level compute fabric simulator [8] that configuration requirements of each workload are presented simulates a fabric with the parameters shown in Table I. in Table III. Since the workloads perform floating-point The fabric utilizes the existing processor FUs (4x10), by computation, the fabric mostly allocates floating point FUs. adding extra switches (5x11) and fifo queues (30), in a The most commonly used FUs are multipliers and adders. A similar fashion at the design shown in Figure 1. The fabric floating-point adder performs two operations: add and sub- simulator is based on a configurable switching network tract. While performing one operation at most, the number of that models dynamically driven computation. The network allocated adders is presented separately for each operation. connects and dynamically customizes the subset of available The number of allocated multipliers goes up to 8, which FUs for a particular computation. For some workloads that is the amount provided by leveraging existing FUs on 4 require more FUs than what the processor already provides, dual-issue cores. On the other hand, the number of allocated the fabric simulator is configured to model the amount of adders goes over 8 in two cases: 15 in kmeans, 12 in stencil. available and extra FUs added to enable acceleration. The To accelerate these two “adder-hungry” workloads it is processor and fabric simulators are connected over banked necessary to incorporate more FUs than what is provided by fifo data queues of the fabric, by using connection buses reconfiguration of general purpose cores. Those workloads between each core of the processor simulator and the fifo are examples where the fabric needs extra FUs to match queues of the fabric simulator. The fifo queues are accessed a computation of a workload that intensively uses some by executing send/receive instructions in the processor. In operation. Please note that only 2 of 7 selected workloads the rest of this section, we characterize the acceleration find it necessary to incorporate the modest set of extra FUs, parameters of various workloads and show the performance while the others can accelerate their computations on the benefits of processor reconfiguration. existing compute resources. To find out what may the most preferable set of extra FUs, it is required to comprehensively A. Accelerated Workloads Characterization analyze mobile applications. Due to lack of space, we avoid To perform the evaluation of the reconfiguration feature, that step in this work and when necessary we incorporate we select seven computing workloads shown in Table II. The only the FUs, which are required to accelerate the workload. workloads are selected from various benchmark suites [20], The number of configured switches varies per workload [21], [22], [23]. Each workload utilizes different compute and it depends on the number of allocated FUs and their algorithm and represents a commonly used algorithms in locations. The input and output switches of the configured

7 Table III ACCELERATED WORKLOADS CHARACTERISTICS

% of code Configuration overhead FUs required FIFOs required Loop-unrolling Workload accelerated Instructions AIBs cycles cycles-cached FMul FAdd FSub Or Input Output n iterations fft 76 277 3 3094 163 4 3 3 0 6 4 2 kmeans 19 357 4 3775 209 8 7 8 0 16 1 2 mm 82 269 3 3085 159 8 7 0 0 16 1 2 mriq 64 269 3 3098 159 6 4 0 1 7 2 4 nbody 81 281 3 3092 164 7 5 3 0 7 3 4 spmv 20 277 3 3082 163 8 6 0 0 16 2 1 stencil 85 317 3 3320 183 2 10 2 0 14 2 1

1C-Composed 1C-Composed+Reconf 2C-Composed MemExe ALUExe ControlTransferExe TotalExe FetchDecode

2C-Composed+Reconf 4C-Composed 4C-Composed+Reconf 3 9 2,5 8 7 2 6 Composed - 5 1,5 4 1 3 2 0,5 1 Speedupover 1C 0 0 Reduction overheadof per instruction fft kmeans mm mriq nbody spmv stencil AVG fft kmeans mm mriq nbody spmv stencil AVG (a) Reconfiguration speedup (b) Reconfiguration reduction of overheads per instruction Figure 4. Reconfiguration performance and efficiency results fabric are connected to input and output fifo queues. The set B. Performance Results of selected workloads use up to 8 (kmeans, mm, spmv) input Figure 4a reports the performance improvements obtained and 4 output fifo queues (fft). by dynamically composing and reconfiguring cores. kC- The compute regions of each workload are modified by Composed indicates that k cores are composed into one hand to perform their computation in the configured fabric. larger logical processor and +Reconf indicates that the cores We modify the code of the compute region by writing send are reconfigured into an accelerator. The 2-core composed and receive intrinsics. The workloads that use the fabric processor outperforms the 1-core processor by 1.51 times to perform a limited amount of computation per iteration in average. The 4-core composed processor scales the per- are optimized using loop unrolling. The AIBs of optimized formance even further by achieving over 2.2x of average workloads replicate send/receive intrinsics to send/receive speedup compared to the 1-core logical processor. Com- the data of up to 4 compute iterations (mriq, nbody). The posing cores provides significant benefits for the selected fabric does not need to be modified or reconfigured when workloads. This is due to their compute intensive algorithms, changing the number of iterations unrolled per AIB. where abundant computation enables the enlarged substrate We next report the speedup of the reconfiguration feature with more execution resources to simultaneously perform over general purpose computation on the composable cores more computation and increase performance. and its efficiency due to reduced per instruction fetch and Reconfiguring the cores into accelerators increases perfor- decode overheads, while executing the entire workloads mance even more by specializing the computing substrate. (including the non-accelerated parts). We evaluate the effect Each accelerator utilizes the fabric configured per workload of composing cores to scale the resources of general purpose and one or more cores composed to stream data through the processors and accelerators. Each of the workloads uses fabric. Compared to the 1-core processor, the accelerators static inputs, but repeats its computation many times to increase the average performance by 1.56x, 2.39x and 3.51x, warm up the caches. Repeated computation minimizes the when 1, 2 or 4 cores are used in the accelerator respectively. reconfiguration overhead, since the fabric is configured once The results scale when the accelerators utilize more cores. per workload. We show the benefits of two optimization More cores in the accelerator execute more speculative techniques: loop unrolling and software pipelining. Loop un- AIBs, which perform speculative computation and increase rolling is included by default in the rest of the experiments, the utilization of the fabric, as explained in Section VI. The while software pipeline is used only in the experiment that results of specialized computing on the accelerators scale in shows the benefits of the software pipeline optimization. a similar way as the results of general purpose computing.

8 No-Optimization Loop-Unrolling Software-Pipelining With such similar scaling trends, the specialized computing 3,5

always outperforms the general purpose computing with the same number of cores. The 1-, 2- and 4-core accelerator 3 outperforms the 1- ,2 - and 4-core general purpose processor 2,5 Composed by 1.56x, 1.58x and 1.6x respectively. - 2 The speedup of the accelerators over the general purpose 1,5 processor varies per workload and goes from 1x (stencil) 1 to 2.57x (mm), when using the 1-core configuration for 0,5 each case. Stencil is an extreme workload that has a mod- Speedupover 1C 0 est amount of computation over the stencil values, which fft kmeans mm mriq nbody spmv stencil AVG are accessed following complex memory access patterns. Without extra memory optimizations [24], which are not Figure 5. Performance benefits of various optimization analyzed in this work, the stencil workload does not have any benefits by specializing its relatively small computation. On the other hand, mm intensively computes a sequentially memory accesses and address calculations for temporary accessed data of a dense matrix. The modified mm code for results. In other workloads, the number of bookkeeping ALU the accelerator removes most of the compute instructions. It instruction is reduced (fft, nbody). Appended send/receive enables an optimization of the memory accesses by refilling instructions in the AIBs incur an extra overhead to the the AIBs with extra memory instructions. The modified mm execution of control and transfer instructions. In some cases performs much faster on the accelerator. The accelerator uses they perform over 3 times more control and transfer executed the pipeline mostly for memory processing and leverages the instructions (mriq). In total, the accelerator does not notably configured fabric to simultaneously perform multiple opera- reduce the average number of total executed instructions, tions in different FUs. Some workloads (kmeans) implement which is the expected result. On the other hand, the accel- a compute intensive algorithm, but include non-compute erator significantly reduces the instruction fetch and decode instructions (e.g. library functions such as malloc) in the overheads. This happens because the accelerator maps the frequently executed code region. In such a case, the acceler- compute instructions to the fabric once, but executes them ator improves performance, but the non-compute instructions many times while avoiding fetch and decode stages. The (81%) limit speedup. Other workloads (spmv) implement an general purpose pipeline incorporated into the accelerator algorithm that has a few different computations in a single still performs the fetch and decode stages with the same loop. Although one iteration of the loop combines various amount of resources. But instead of processing the compute computations, the accelerator may be configured for only instructions, the pipeline fetches and decodes more memory one of them, because of the configuration overhead. The and transfer instructions in the refilled AIBs. By including rest of the computation (80%) is performed using the general more memory instructions per AIB, the pipeline more effi- purpose pipeline, which limits the performance. ciently saturates the available memory bandwidth, tolerates Figure 4b shows the reduction of instruction fetch, decode memory latency and increases the utilization of the ALUs and execution overheads, while comparing the 1-core accel- connected into the fabric. By increasing the utilization of the erator over the 1-core processor. Additionally, the execution ALUs, the accelerator effectively leverages the configured overhead is presented individually for: memory instructions; compute fabric and improves performance. ALU instructions, including the instructions executed by Figure 5 shows the benefits of loop unrolling and software the general purpose pipeline and the fabric; and control- pipelining optimizations. We present the speedup of the 1- transfer instructions, including branch, register read/write core accelerator over the 1-core general purpose processor. and the later send/receive instructions. The results show that The general purpose computing in the processor uses loop the accelerator moderately reduces the number of executed unrolling, but not software pipelining. The results of the memory and ALU instructions, but increases the number of accelerator are presented by incrementally applying loop executed control-transfer instructions by 2x. This happens unrolling and software pipelining. Loop unrolling is applied mainly because of the code modifications. The workloads to workloads that perform computation over unsubstan- are modified, by mapping the floating-point compute instruc- tial amount of per iteration input values (see Table III). tions to the fabric and refilling the AIBs with extra memory Loop unrolling increases the size of the AIBs that arrange and send/receive instructions. The modified AIBs encode the compute values and reduces inter-block communication more extensive computation on the configured substrate. overhead. Loop unrolling increases performance by about This may diminish the number of temporary results stored in 40% (fft, mm, mriq, nbody benefit from it). Since the memory and bookkeeping complexity. In some workloads, general purpose computing uses loop unrolling by default, the number of executed memory and ALU instructions the reconfiguration feature sometimes does not yield any is reduced (kmeans, mm, spmv), because they perform extra speedup without applying the loop unrolling opti-

9 mization (mriq, nbody experience slowdown with “No- [7] H. Park et al., “Polymorphic pipeline array: a flexible multi- Optimization”). Software pipelining provides substantial ad- core accelerator with virtualized execution for mobile multi- ditional performance when the workloads perform complex media applications,” in MICRO’09. and long latency computations (mm, mriq, nbody). The AIBs [8] V. Govindaraju et al., “Dynamically specialized datapaths for avoid this latency by receiving the results of previous AIBs. energy efficient computing,” in HPCA’11, pp. 503–514. Software pipeline is not applied to the general purpose computing, because the processor executes operations in- [9] ——, “DySER: unifying functionality and parallelism spe- dividually, unlike the compute fabric which fuses multiple cialization for energy efficient computing,” IEEE Micro, vol. 32, no. 5, pp. 38–51, 2012. operations and produces results with long delays. Software pipelining may have a negative impact to the performance [10] A. Parashar et al., “Triggered instructions: a control paradigm when computation is not extensively long or when an outer for spatially-programmed architectures,” in ISCA’13, pp. 142– loop that repeats the computation is complex(fft). Since the 153. software pipelining evaluated in this work does not require [11] M. Duric, O. Palomar, and A. Smith, “Recompac: Recon- extra hardware support, it may be selectively applied only figurable compute accelerator,” in Reconfigurable Computing to workloads which find it beneficial. By choosing the best and FPGAs (ReConFig), 2013 International Conference on, optimization, performance increases over 10%. Dec 2013, pp. 1–4.

VIII. CONCLUSIONS AND FUTURE WORK [12] C. Kim et al., “Composable Lightweight Processors,” in This paper presents a novel way to dynamically adapt MICRO’07. the existing resources of a general purpose quad-core pro- [13] M. S. S. Govindan et al., “Scaling Power and Performance via cessor for diverse workload requirements. The processor Processor Composability,” IEEE Transactions on Computers, on-the-fly reconfigures its available general purpose cores March 2013. into accelerators of frequently executed compute regions. [14] D. Burger et al., “Scaling to the End of Silicon with EDGE The general purpose cores efficiently execute control and Architectures,” IEEE Computer, vol. 37, no. 7, pp. 44–55, memory intensive applications. We expect with their re- July 2004. configuration to substantially accelerate compute intensive applications, when the computation repeats over a large [15] A. Smith et al., “Compiling for EDGE architectures,” in amount of data. The results presented in this work are very CGO’06, pp. 185–195. promising. They show that reconfiguration yields 56% of [16] M. Duric et al., “Evx: Vector execution on low power edge an average speedup on a set of selected compute workloads. cores,” in DATE’14, March, pp. 1–4. Only two of the seven selected workloads require extra adder units to enable the processor to adapt its execution substrate [17] ——, “Dynamic-vector execution on a general purpose edge chip multiprocessor,” in SAMOS’14. for their computations. Future work will involve extending the design to more powerful 8- and 16- core reconfigurable [18] Mei et al., “Adres: An architecture with tightly coupled processors, as well as include the estimation of their energy vliw processor and coarse-grained reconfigurable matrix,” in and area savings in a domain of low power mobile devices. FPLA, 2003, pp. 61–70.

REFERENCES [19] S. Gupta et al., “Bundled execution of recurring traces for energy-efficient general purpose processing,” [1] R. Kumar et al., “Single-isa heterogeneous multi-core ar- ser. MICRO-44, 2011, pp. 12–23. [Online]. Available: chitectures: The potential for processor power reduction,” in http://doi.acm.org/10.1145/2155620.2155623 MICRO’03, pp. 81–92. [20] S. C. Woo et al., “The splash-2 programs: Characterization [2] N. Brookwood, “AMD fusion family of APUs: enabling a and methodological considerations,” in ACM SIGARCH Com- superior, immersive PC experience,” White Paper, 2010. puter Architecture News, vol. 23, no. 2, 1995, pp. 24–36. [3] N. Clark et al., “Application-specific processing on a general- [21] S. Che et al., “Rodinia: A benchmark suite for heterogeneous purpose core via transparent instruction set customization,” in computing,” in IISWC’09, pp. 44–54. MICRO’04, pp. 30–40. [22] D. P. Playne et al., “Benchmarking gpu devices with n-body [4] ——, “Veal: Virtualized execution accelerator for loops,” in simulations,” in CDES’09. ISCA’08, 2008, pp. 389–400.

[5] M. Mishra et al., “Tartan: evaluating spatial computation for [23] J. A. Stratton et al., “Parboil: A revised benchmark suite for whole program execution,” SIGOPS Oper. Syst. Rev., vol. 40, scientific and commercial throughput computing,” Univ. of no. 5, pp. 163–174, Oct. 2006. Illinois at Urbana-Champaign, IMPACT-12-01.

[6] S. Khawam et al., “The reconfigurable instruction cell array,” [24] T. Hussain et al., “Advanced pattern based memory controller VLSI Systems, IEEE Trans., vol. 16, no. 1, pp. 75–85, 2008. for fpga based hpc applications,” in HPCS’14, pp. 287–294.

10