466 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 3, MARCH 2019 An Efficient SRAM-Based Reconfigurable Architecture for Embedded Processors Sajjad Tamimi , Zahra Ebrahimi , Behnam Khaleghi , and Hossein Asadi , Senior Member, IEEE

Abstract—Nowadays, embedded processors are widely used in wide range of domains from low-power to safety-critical applications. By providing prominent features such as variant peripheral support and flexibility to partial or major design modifications, field-programmable gate arrays (FPGAs) are com- monly used to implement either an entire embedded system or a hardware description language-based processor, known as soft-core processor. FPGA-based designs, however, suffer from high power consumption, large die area, and low performance that hinders common use of soft-core processors in low-power embedded systems. In this paper, we present an efficient reconfig- urable architecture to implement soft-core embedded processors in SRAM-based FPGAs by using characteristics such as low uti- lization and fragmented accessibility of comprising units. To this end, we integrate the low utilized functional units into efficiently designed look-up table (LUT)-based reconfigurable units (RUs). To further improve the efficiency of the proposed architecture, we used a set of efficient configurable hard logics that imple- ment frequent Boolean functions while the other functions will still be employed by LUTs. We have evaluated effectiveness of the proposed architecture by implementing the Berkeley RISC- V processor and running MiBench benchmarks. We have also examined the applicability of the proposed architecture on an Fig. 1. Online reconfiguration to various soft-core processors. alternative open-source processor (i.e., LEON2) and a digital sig- nal processing core. Experimental results show that the proposed users [3], [4]. However, commercially off-the-shelf hard-core architecture as compared to the conventional LUT-based soft- processors cannot fulfill design requirements when proces- core processors improves area footprint, static power, energy sor customizations such as instruction set architecture (ISA) consumption, and total execution time by 30.7%, 32.5%, 36.9%, update is demanded by the customers. In addition, these and 6.3%, respectively. processors typically provide specific and limited peripheral Index Terms—Field programmable gate arrays, microproces- support that may not fit appropriately to a variety of embed- sors, partial reconfiguration, power dissipation, reconfigurable ded applications. Moreover, there is always likelihood of logic, soft-core processors. obsolescence of a hard-core processor which complicates the migration of the overlying embedded system to newer tech- nologies and leads to expensive and even infeasible design I. INTRODUCTION update. MBEDDED systems are used in broad range of appli- An alternative approach to implement embedded processors E cations from indoor utensils to medical appliances and is using reconfigurable devices such as field-programmable safety-critical applications. The hardware infrastructure of gate arrays (FPGAs) which is referred to as soft-core. A typical these systems typically consists of an embedded processor as soft-core processor, as illustrated in Fig. 1, can be reconfig- the core and different peripherals specified for the target appli- ured on an individual FPGA device during system runtime. cation [1], [2]. An embedded processor can be implemented as The main privilege of soft-core over hard-core counterpart two commonly used platforms, i.e., either hard- or soft-core. is its intrinsic flexibility to reconfigure itself with work- Hard-core processors have received great attention due to the load variations to provide higher level of parallelism which high abstraction level they provide to both developers and results in: 1) shorter time-to-market; 2) inexpensive design update with upcoming technologies; 3) continuous adoption Manuscript received May 26, 2017; revised December 6, 2017; accepted with advancements in SRAM-based FPGAs; and 4) flexibility January 23, 2018. Date of publication March 15, 2018; date of current version to implement diverse range of applications. FPGAs, how- February 18, 2019. This paper was recommended by Associate Editor G. Stitt. (Corresponding author: Hossein Asadi.) ever, are conventionally equipped with abundant logic and The authors are with the Department of Computer Engineering, routing resources to target general applications which results Sharif University of Technology, Tehran 11155-9517, Iran (e-mail: in high power consumption, large silicon die area, and low [email protected]; [email protected]; behnam_khaleghi@ performance compared with application specific integrated cir- ce.sharif.edu; [email protected]). Color versions of one or more of the figures in this paper are available cuit (ASIC) counterparts [5]. These intrinsic characteristics online at http://ieeexplore.ieee.org. of FPGAs are in contrary with fundamental requisites of Digital Object Identifier 10.1109/TCAD.2018.2812118 embedded systems [6]. 0278-0070 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. TAMIMI et al.: EFFICIENT SRAM-BASED RECONFIGURABLE ARCHITECTURE FOR EMBEDDED PROCESSORS 467

With the limitations of hard-core processors along with configuration cells and logic resources in the RU by exploit- infeasibility of fabricating dedicated processors due to limita- ing the similarities along with the nonuniform distribution of tions in time and/or cost in one hand, and prominent features Boolean functions in the components. These homomorphic- of FPGAs in the other hand, designers have ever-increasing structure functions are implemented as a set of small area-size tendency to use soft-core processors in embedded systems. configurable hard logics (CHLs). To our knowledge, this is the Various open-source soft-core processors [7]Ð[12] have been first study that, in a fine-grained manner, improves the area and widely used for academic and commercial purposes. Such soft- static power efficiency of FPGA-based soft-core processors cores are developed based on full description of a processor using the concept of spatiality in component utilization. The from scratch using hardware description language (HDL) [13], proposed architecture can be employed in tandem with other modification of pretested intellectual property cores (e.g., methods that intend to architecturally improve the soft-core MicroBlaze and Nios II), or precharacterized cores that are processors, e.g., multithreading [26], power gating [27]Ð[29], developed by higher abstraction level tools (e.g., SPREE [14] or ISA customization [30], as they are orthogonal to our and Chisel [15]). proposed architecture. A major challenge of embedded soft-core processors To evaluate the efficiency of the proposed architecture, we is functional units with low utilization and/or fragmented first execute MiBench benchmarks [31] on Berkeley open- accesses, e.g., floating point multiplier (MUL) and divider. source RISC-V processor [8] to obtain the access patterns Such units are accessed sporadically for a short period of and utilization rate of the RISC-V units. Afterward, using an time and remain idle for significant amount of time. The large in-house script, we identified the candidate units to be inte- period of idle time can lead to significant static power dissi- grated in an optimal number of RUs. Berkeley ABC [32] pation which is a key design constraint in battery-powered and Quartus [33] are exploited for synthesizing the mobile devices. In particular, with the downscaling of the selected units and integrating into a shared RU. Thereafter, transistor feature size and threshold voltage in smaller tech- latest version of COFFE [34] is exploited to generate efficient nology nodes, the role of static power has become more transistor sizing used in HSPICE. Finally, we use -to- pronounced [16], leading to the era of Dark Silicon [17]. routing (VTR) 7.0 [35] fed with the parameters obtained from Furthermore, these low utilized units suffer from larger area HSPICE and Design Compiler to place and route the proces- (i.e., transistor count), which is another testimony for the sors and obtain the design characteristics. Furthermore, we ever-increasing static power in embedded systems. examine generality of the proposed architecture on an alter- To improve the power consumption of embedded proces- native open-source processor (LEON2), as well. Experimental sors, previous studies have suggested to replace the low results demonstrate that our proposed architecture improves utilized units with nonvolatile reconfigurable units (RUs) the area, static power, energy, critical path delay, and total which can be reconfigured whenever an access to an unavail- execution time of RISC-V by 30.7%, 32.5%, 36.9%, 9.2%, able unit is issued [18], [19]. Such a design suffers from area and 6.3%, respectively, compared to the conventional modular overhead caused by replacing the static CMOS units with look- LUT-based soft-core processors. Examining the applicability up table (LUT)-based equivalents. Several previous studies of the proposed architecture on LEON2 revealed that the area, have also considered the use of reconfigurable architectures in static power consumption, and energy of LEON2 is enhanced conjunction with [20]Ð[23] or within the FPGA-based proces- by 23.1%, 23.2%, and 23.2%, respectively. sors [18], [19], [24], [25] mainly to speed up the computations. The novel contributions of this paper are as follows. Nevertheless, these architectures have two main drawbacks. 1) We present a comprehensive design space exploration First, the suggested reconfigurable architectures are designed among prevalent infrastructures for computa- only for a specific application domain and are not generic. tional units of a processor to discriminate the proper Second, the suggested reconfigurable architectures which have infrastructure for embedded systems. This suggests been used within these platforms are power inefficient. These integration of low utilized and fragmentally accessed shortcomings obstruct the use of such architectures in the components into shared RUs. majority of embedded systems. 2) For the first time, we propose a metric to efficient In this paper, we propose an area and power efficient recon- integration of units in elaborately designed RUs based figurable architecture for soft-core embedded processors by on component area, utilization ratio, and configuration exploiting the low utilization and fragmented accesses of func- time. tional units. The proposed architecture is motivated by large 3) By comprehensive characterization of most frequent and area and high power consumption of the low-utilized func- homomorphic-structure functions, we propose an area tional units in SRAM-based soft-core processors. Such an and power efficient synthesis algorithm which exploits architecture, however, necessitates a comprehensive profiling efficient CHLs for implementing significant portion of to determine the accessibility of processor functional units functions in the proposed RUs. To the best of our across a wide range of applications to: 1) specify which knowledge, this is the first effort which investigates the functional units should be replaced with reconfigurable com- applicability of CHLs in either soft or hard processors. ponent and 2) determine how to efficiently integrate these 4) We investigate the generality of the proposed architec- units. For this purpose, we first profile a wide range of ture on an alternative embedded processor (LEON2) as benchmarks to determine infrequently used and fragmentally well. In addition, to examine the insensitivity of shared accessed units that are the major contributor of the logic static RU, we use both academic and commercial synthesis power, and hence, are promising candidates to be merged tools. in an individual RU. Afterward, to further improve the area The rest of this paper is as follows. The related work and power efficiency of the proposed RU, we propose an is reviewed in Section II. The motivation behind the algorithm to efficiently integrate these components as shared proposed architecture is explained in Section III. We artic- RUs. The proposed algorithm aims to minimize the number of ulate the proposed architecture in Section IV. Details of the

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. 468 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 3, MARCH 2019

TABLE I SUMMARY OF EXPLOITING RECONFIGURABLE FABRICS IN PROCESSORS

experimental setup and results are elaborated in Section V. attempts to map the instructions into RFUs [24]. The OneChip Finally, Section VII concludes this paper. architecture that integrates an RFU into the pipeline of a super- scalar RISC processor has been proposed in [25]. The RFU is implemented in parallel with Execute and Memory Access II. RELATED WORK stages and includes one or more FPGA devices and a con- Different approaches have been proposed to reduce the troller. Thus, it can be internally pipelined and/or parallelized. area and power of soft-core processors as well as to increase In addition to significant area and power overheads imposed by their performance, which can be classified in two categories. multiple reconfiguration memories and additional circuitries, Approaches in the first category employ three techniques: 1) design complexity (e.g., modifying the compiler) is another static or 2) dynamic power gating of components of execution drawback of the OneChip architecture. A heterogeneous archi- stage [27]Ð[29], [36]; and 3) dual threshold voltage to reduce tecture exploiting RFUs based on spin-transfer torque random power. Static power gating is applied offline at the configu- access memory (STT-RAM) LUTs for the hard-core general- ration time. Therefore, the power efficiency of this technique purpose processors has been proposed in [18]. The proposed is restricted to permanently off components which depends RFUs attempt to mitigate unavailability of functional units on the target application. Dynamic power gating is applied using either static or dynamic adaptive algorithms for reconfig- online, i.e., during the application runtime, so it may have uration. In the static algorithm, during a learning phase, idle higher opportunity for power reduction. In practice, however, functional units are identified, and are reconfigured to most dynamic power gating encounters several shortcomings that active units to provide higher availability. In the dynamic algo- hinders practicality of this technique. We will discuss advan- rithm, the reconfiguration decision is made periodically while tages of the proposed architecture over dynamic power gating at the beginning of each interval, idle units are replaced with method in Section V. the active ones. Although 18% improvement in performance As the second category, using reconfigurable devices to has been reported, considerable area and power overhead is implement contemporary processors is proposed. Previous expected due to larger area and higher power consumption of studies in this category can be classified as coarse-grained and an RFU compared with custom ASIC or standard cell-based fine-grained architectures. The former allows bit-level oper- components. More importantly, substantial modifications in ations, while the latter enables relatively more complicated both compiler and processor architecture, especially bus lines operations. In the coarse-grained architectures [20]Ð[23], the and datapath, are required which are neglected in this paper. reconfigurable devices are used as co-processor to speed up A similar approach along with thermal-aware reconfiguration computations and enhance the parallelism degree and energy of functional units has been proposed in [19], wherein com- efficiency by loop unrolling. The role of compiling algorithms, putations of thermally hotspot functional units are migrated to i.e., hardwareÐsoftware partitioning, considering today’s com- cooler units in order to balance the temperature distribution. plex many-core architectures and multiprogramming can Aforementioned drawbacks for [18] stand here, as well. The adversely affect the efficiency of such architectures [37]. Thus, summary of previous studies is reported in Table I. aforementioned architectures impose considerable area over- head of profiler and departure compilers from their straight- forward flow to support branch prediction which may not always be correct in interactive and time-dependent embedded III. MOTIVATION applications [23]. The main prerequisite to integrate several functional units In fine-grained architectures, one or several types of into a single shared RU is low utilization and fragmented instructions are selected to be executed on the reconfig- access of each unit. Low utilization is a crucial constraint since urable platform. The main challenge in such architectures is implementing even few high utilized components as a single choosing computation-intensive instructions. In this regard, RU is inefficient as frequent access to such components leads CHIMAERA [24] has proposed integrating an array of recon- to substantial execution time and power penalties. Fragmented figurable functional units (RFUs) into the pipeline of a accesses is another important criterion which facilitates inte- superscalar processor which are capable of performing 9- gration of several functional components into a single RU. input functions. The authors also proposed a C compiler that To comprehend this, assume two components C1 and C2

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. TAMIMI et al.: EFFICIENT SRAM-BASED RECONFIGURABLE ARCHITECTURE FOR EMBEDDED PROCESSORS 469

TABLE II COMPONENTS UTILIZATION AND AREA COMPARISON IN RISC-V default RISC-V cache configuration in CACTI for both of I- cache and D-cache which both have 16 KB size with 16-bytes block size, 4-way associate, 32-bit input/output address and data bus, and tag size of 20 for I-cache and 22 for D-cache. Other parts of the processor such as fetch, decode, write-back stage, and peripheral units are illustrated by shaded blocks. Diamond blocks are memories such as registers and caches.

B. Resource Utilization and Access Patterns The utilization rate of RISC-V functional units over 3.4 × 109 instructions for MiBench is illustrated in Fig. 3. As shown in this figure, the highest accessed unit (i.e., IU-MUL/DIV) has been utilized only for 0.8% on average in all benchmarks. Therefore, the first condition, i.e., low utiliza- tion, is fulfilled. It should be noted that the (ALU) which has utilization rate greater than 43.3% is excluded from the results. The ALU is frequently used by other pipeline stages, e.g., instruction and decode stage. Fragmented accesses to the low-utilized units in proces- sor data path can alleviate the reconfiguration overhead and, subsequently, the execution time. To specifically clarify the importance of fragmented accessibility, we use the switching rate (SR) parameter (i.e., number of interunit switching to the total instructions) between components for each benchmark.1 For example, if accesses to three units are {C1, C1 → C2 → C3, C3, C3 → C2}, then SR is (3/7) where 3 is the number of interunit switching and 7 is the total number of instruc- tions. The results reported in Fig. 4 show that, on average, SR is 4.9 × 10−3 in MiBench benchmarks, which indicates one switching per 204 instructions (i.e., [1/0.0049]). This value would become one switching per 535 instructions if FFT benchmark is excluded. Notice that some benchmarks utilize only a unique com- ponent during the whole execution time. For example, jpeg benchmark only utilizes MUL-DIV unit as it can be observed in Fig. 3. Thus, SR for such benchmarks is zero. This obser- vation has further motivated us to propose one individual RU Fig. 2. RISC-V processor (area of each block is normalized to the chip instead of several under-utilized units that may never be used area). in most applications. As an example, Fig. 5 illustrates the fragmented accessibility of three functional units within FPU running the FFT benchmark which has the highest SR. As both having 50 percent utilizations. Accordingly, two access shown in this figure, for a long sequence of instructions, one patterns to these components could be either as: of the units in the FPU unit is busy while the other two com- { , , , → , , , → , , , → 1) C1 C1 C1 C1 C2 C2 C2 C2 C1 C1 C1 C1 ponents are idle. This motivates the concept of shared RU ···→ , , , } C2 C2 C2 C2 ; from two perspectives. First, since at most one of the internal { → → → →···→ → } 2) C1 C2 C1 C2 C1 C2 units in the FPU unit is accessed each time, integrating these wherein an arrow (→) denotes a need to reconfiguration. internal units is conceivable. Second, the large idle time of While both scenarios result in the same utilization rate, obvi- unused units imposes significant leakage power, particularly ously, the former requires fewer number of reconfigurations. considering their large area.

A. RISC-V Processor IV. EFFICIENT SRAM-BASED RECONFIGURABLE ARCHITECTURE We have examined these conditions by running MiBench benchmarks on RISC-V processor. RISC-V is a 64-bit in- Here, we first explore the design space for possible imple- order, open-source embedded processor suited for FPGA mentation of functional units inside a processor and discuss devices [38], [39] which consists of six pipeline stages, advantages and limitations of each design style. Afterward, we wherein its execution stage is equipped with both integer and elaborate our proposed architecture. floating point units, as summarized in Table II.Fig.2 illus- trates RISC-V components proportional to their area footprint A. Design-Space Exploration on the silicon die. As shown in this figure, functional units Based on the underlying infrastructure, i.e., ASIC or recon- occupy substantial portion (more than 44%) of total proces- figurable (RC), the implementation structures for functional sor area (including caches). Area of caches and memory cells are calculated using CACTI (version 6.5) [40]. We modeled 1We take the average in the benchmarks having multiple subs.

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. 470 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 3, MARCH 2019

Fig. 3. Utilization rate of RISC-V components running MiBench benchmarks.

(a) (b)

Fig. 4. Average intercomponent switching ratio in the proposed RUs. (c) (d)

Fig. 6. Baseline architectures of implementing the functional units: (a) ASIC/inst, (b) ASIC/module, (c) RC/all, and (d) RC/module.

Fig. 5. Access pattern to low-utilized functional units in RISC-V processor.

Fig. 7. Proposed LUT-RU architecture overview. units can be classified in four categories which are demon- stratedinFig.6. We will compare these design concepts This design style incurs significant reconfiguration overhead against the proposed architecture with respect to area and since it requires to be reconfigured once a new instruction performance. needs to be executed. Area (i.e., the number of LUTs) in such 1) ASIC/Module and RC/Module: These architectural an architecture is determined by area of the largest module. design styles [shown in Fig. 6(b) and (d), respectively] are Beside large execution time of applications, the potential area the conventional architectures for implementing hard-core and efficiency of such architecture will be shaded if the design soft-core processors. One major drawback of these design comprises a set of small and frequently used modules and approaches is that the access pattern to the target module large but infrequently used ones. has not been well-considered. As it will be discussed later, there are infrequently accessed modules within embedded pro- B. Proposed Architecture cessors that dedicating separate units for each of them is The base of the proposed architecture is to integrate the inefficient in terms of area and power. It has been reported that low-utilized and fragmentally accessed units into shared RUs area and performance gap of ASIC/module and RC/module to reduce logic resources and the number of configuration bits implementations of total processor core are 17Ð27× and 18Ð in the processor. Hence, this will improve area and power 26×, respectively. A detailed comparison of processor main efficiency of the soft-core processor. To achieve an efficient building blocks with respect to area and delay has also been architecture, we take the following steps: 1) finding the low- presented in [41]. utilized candidate units to be integrated in RUs; 2) appropriate 2) ASIC/Inst: Designing a specific module per each instruc- grouping of the candidate units into single or multiple RUs; tion, as shown in Fig. 6(a), can significantly speed up the and 3) optimizing the proposed architecture by exploiting logic operations by making them dedicated, hence, small and the similarities in Boolean functions of associated units. We efficient. Nevertheless, its main drawbacks such as compli- explain each step in the following. cated control unit and large address buses cause significant 1) Candidate Units: As discussed in Section III, low uti- area and power, which limits performance gain, and makes it lized and fragmentally accessed units (i.e., small number of impractical to implement all instructions. switching to/from a unit) is essential to migrate them to an RU. 3) RC/All: In this implementation, a single RU is dedi- These parameters are determined by running the target appli- cated to all instructions which is represented in Fig. 6(c). cation benchmarks and probing the trace file. For the RISC-V

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. TAMIMI et al.: EFFICIENT SRAM-BASED RECONFIGURABLE ARCHITECTURE FOR EMBEDDED PROCESSORS 471

Fig. 8. Various architectural classification based on λeff.

TABLE III processor, as shown in Fig. 3, ALU and IU-MUL/DIV units GROUPING CANDIDATE UNITS INTO PROPOSED RUS have relatively high utilization (44.8% and 1.5%, respectively) compared to the average utilization rate of the floating-point units (0.24%). On the other hand, the area of the ALU is smaller than the FPU. Therefore, it is not effective to migrate the ALU unit into an RU. As presented in Table II, FPU unit occupies a significant fraction (more than 59.5%) of the pro- cessor area while its utilization is only about 0.24%. The same trace is used to obtain the total number of switching between units for each benchmark and estimate the reconfiguration overhead. nonexisting units in that RU. For instance, if the original 2) Grouping Into RUs: The critical step in the proposed instruction sequence is {C1, C2, C1, C3, C1, C2, C3, C2} and architecture is determining the number of RUs and assigning only C1 and C2 are integrated inside this RU, then the num- the candidate units to each unit. There is a tradeoff between ber of reconfigurations, i.e., Trcfg, would be equal to three the number of RUs and total execution time. Smaller number 1 2 3 (i.e., {C1 −→ C2 −→ C1, C1 −→ C2, C2}). We carefully of RU results in area-efficient architecture (i.e., more units analyzed λeff for possible architectural assortments of group- are bounded in an RU) but increases the overall execution ing six candidate components into two or three RUs which time due to the high number of reconfigurations in these units. has 40 different cases as summarized in Fig. 8.Allλ in λ × eff We define the efficiency goal ( eff) as minimizing the area this figure are normalized to the minimum case (i.e., the execution time according to case with RU1 = [1, 4], RU2 = [2, 3], and RU3 = [5, 6]). λeff = Area × Texecution. (1) For the sake of brevity, we have omitted the results of inte- grating candidate units into four and five RUs, since such In this equation, Texecution is composed of two parts: 1) classification does not gain much area-power improvement. required time to execute the instructions and 2) reconfigu- Based on the results reported in Fig. 8, we propose three RUs ration overhead (which can be the dominant part). Even in based on the best classification, which is detailed in Table III, the case that reconfiguration is performed infrequently (e.g., i.e., RU1=(FP Single[MUL/ADD/SUB], FP2INT), RU2=(FP once per 500 instructions), the reconfiguration time (which is Double[MUL/ADD/SUB], FP-DIV/SQRT), and RU3=(FP2FP, typically more than 50 us) will dominate the overall execu- INT2FP). × = tion time (can be estimated as 500 5ns 5 us assuming a Exploiting the concept of integrating several low-utilized clock cycle of 5 ns). As a result, we can alter this formula to and fragmentally accessed units in one RU (LUT-RU archi- λ ∝ × eff Area Trcfg. In addition, the critical path delay of a cir- tecture) improves several design parameters. First, the number cuit is not affected by the number of units inside an RU since of RUs in the pipeline stages will be reduced, and therefore the each time only one component is implemented, and thereby, complexity of the control unit and address bus will be reduced. λ can take its original mapping. Assuming N RUs, eff can be Second, number of write-back ports decreases, which conse- obtained from (2). In this equation, area of each RU (in term quently reduces the number of write-back buffers. It should be of number of LUTs) is equal to the area of its largest candi- noted that the proposed architecture would not provide con- date unit. Thus, integrating Components (denoted by Cj) with siderable advantage when: 1) functional units are high utilized similar area/size into an individual RU is more prosperous and have frequent access patterns that may cause considerable N execution time overhead and 2) functional units differ signif- × λeff ∝ Area(RUi) × Trcfg icantly in size, e.g., merging two component with 1 and × i=1 20 LUTs may not be beneficial since the context switching N overhead can shallow the small area gain. = max{Area(Cj)|Cj ∈ RUi}×Trcfg. (2) i=1 C. Optimizing RUs The number of reconfigurations in a particular RU can be By following the previous steps, the proposed LUT-RU obtained from the original benchmark trace by discarding the architecture could be achieved, which is illustrated in Fig. 7.

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. 472 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 3, MARCH 2019

TABLE IV COVERAGE RATIO OF 4-INPUT NPN-CLASSES IN LOW-UTILIZED COMPONENT FUNCTIONS IN RISC-V PROCESSOR

Fig. 9. Proposed CHL-LUT-RU architecture overview.

Here, we aim to further improve the efficiency of the proposed architecture by reducing the number of resources, i.e., transis- tors and configuration cells, by using area and power efficient CHLs instead of conventional LUTs. Previous studies have shown that 4-LUT-based FPGA architecture has the minimum area among different LUTs [42]. Nonetheless, even in such a small-input LUT, the majority of LUT structure remains under- utilized in a broad range of circuits [43]Ð[45]. For example, a considerable portion of functions do not necessarily have 4-inputs, so half of LUT configuration bits and multiplexer structure, or more, is wasted. Even if all inputs are used, a small set of specific functions are repeated significantly higher than others. Therefore, there is no need to allocate generous flexibility of 16 SRAM cells to implement them. The high number of SRAMs in LUT-based logic blocks can exacer- bates the area and power consumptions in FPGA and soft-core processors. In recent studies, several fine-grained architectures have been proposed that use small reconfigurable logic cells solely or along with conventional LUTs [43]Ð[45]. CHLs are logic blocks that can implement the majority of functions with considerably smaller area and less delay as compared to their LUT counterparts. However, the main goal of these studies Fig. 10. Proposed CLB. is enhancing of design constraints in FPGA platforms while their applicability in soft-core processors is not evaluated. Accordingly, we have proposed an area-power optimization to elimination of 4-LUT and pure exploiting of CHLs which approach, i.e., CHL-LUT-RU shown in Fig. 9, where for each are less flexible than LUT in one-to-one implementation of RU, we synthesize each component (which is assigned to some functions. the target RU) into 4-input LUTs by the means of Berkeley Fig. 10 illustrates the proposed configurable ABC synthesis tool [32] to obtain 4-input functions and then (CLB) wherein cluster size (N) is equal to 10 and extract its negationÐpermutationÐnegation (NPN) class repre- ¯ CHL1:CHL2:LUT ratio is equal to 4:3:3, based on observa- sentation. To elaborate, F = AB + CD and G = BC + AD tions obtained from Table IV. Each proposed CLB reduces are NPN-equivalent since each function can be obtained from area by 28.6% compared to homogeneous LUT-based CLB. the other by negating B and permuting A and C. Therefore, The proposed CHL-LUT-RU approach iterates over functions such functions can be implemented with the same homomor- of components mapped to an RU and examines whether phic logic structure augmented with configurable inverters at the function could be substituted with either one of CHLs. its inputs and output. Our investigation, reported in Table IV, Otherwise, it will remain as a 4-LUT. has revealed that a significant portion (66.8%) of functions in the proposed RUs have the same homomorphic structure based on the concept of NPN-class representation and can D. Insensitivity of Proposed Architecture to Different effectively be implemented with CHL1 and CHL2 which Synthesis Tools have been primarily introduced in [44]. It should be noted In addition to an academic synthesis tool (Berkeley ABC), that although applicability of CHLs is motivated by func- we repeat the mapping flow with a commercial tool (Altera tion characterization in RISC-V processor, their effectiveness Quartus Integrated Synthesis tool [33]) to investigate the is also examined in LEON2 processor and large-scale com- sensitivity of the results to CAD flow. To this regard, QUIP tar- mercial FPGA benchmarks [44], [45] wherein some of these geting Startix IV device is exploited which directly synthesizes benchmarks are soft-core processors and by a VHDL, Verilog, or SystemVerilog circuit to hierarchical themselves. For instance, our investigation has revealed that BLIF. Our investigation reveals that the coverage ratio of CHL1 and CHL2 can implement 68% of functions in OR1200 CHLs is almost the same obtained by an academic synthesis in VTR benchmark suite and 59% of functions in TV80 bench- tool (Berkeley ABC). The comparison of the CHLs coverage mark in IWLS’05. It is also noteworthy that any generic ratio in both Berkeley ABC and Altera Quartus is detailed function, even unidentified at design time, can be implemented in Table IV. It is noteworthy that default synthesis constraint using pure CHLs. Albeit, in such non-LUT designs there will of ABC is area optimization while Quartus aims to reduce be an overhead in the number of used logic blocks (CHLs) due the delay which results in larger area (number of LUTs) and

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. TAMIMI et al.: EFFICIENT SRAM-BASED RECONFIGURABLE ARCHITECTURE FOR EMBEDDED PROCESSORS 473

TABLE V PARAMETERS IN THE PROPOSED AND BASELINE RC ARCHITECTURES

Fig. 11. Implementation and evaluation flow of the proposed architecture(s). therefore power consumption [33]. To be consistent with other studies, we evaluate our proposed architecture using netlists of Berkeley ABC. Fig. 12. Comparing the number of configuration bits and area of RISC-V components in the baseline and proposed architectures. V. E XPERIMENTAL SETUP AND RESULTS Here, we elaborate the implementation and evaluation flow of the proposed and baseline architectures. According to the A. Preliminary overall evaluation flow illustrated in Fig. 11,wefirstexam- Table V demonstrates the architectural parameters and ine the access pattern to each of the datapath components the description for the proposed and baseline architectures to identify low-utilized and fragmentally accessed ones. For used in the VPR experiments. Similar to commercial FPGA this aim, each MiBench benchmark is cross-compiled to the devices [49], [50], we used the conventional island-style archi- RISC-V ISA using modified GCC [8]. Then, the utilization tecture exploited in VPR tool. To calculate the area, we first and access pattern to the functional units is obtained by the extracted total area from VPR post routing reports which uses means of RISC-V ISA Simulator (Spike) [46]. Next, HDL minimum width transistor model [34] represented by (3)in description of each component is extracted from the processor which x is the ratio of transistor width to minimum width. For HDL source code and then optimized and mapped to 4-input instance, a minimum-sized inverter area in the 45-nm technol- functions by the means of Berkeley ABC synthesis tool [32] ogy can be obtained by setting x = (90 nm/90 nm) = 1for and Altera Quartus [33]. The truth-table of each function is the nMOS, and x = (135 nm/90 nm) for the pMOS transistor generated using an in-house C script and is fed to the Boolean √ Matcher [47] to extract the corresponding NPN-Class rep- Min. width transistor(x) = 0.447 + 0.128x + 0.391 x. (3) resentation. We exploit COFFE [34] for efficient transistor sizing and area and delay estimation of LUT-based architec- While (3) can be directly used for comparison purpose, to obtain the geometric (i.e., layout) area, it is shown that multi- tures (according to commercial architectures), while evaluation 2 of the CHLs is performed using HSPICE circuit-level simula- plying it to 60F , in which F is the technology size, gives an tions with Nangate 45-nm Open Cell Library [48]. The area acceptable layout-level area [34], [35] and delay values of each cell are fed to the VPR [35] archi- Area(x) = Min. width transistor(x) × 60F2. (4) tecture file for circuit placement on FPGA-based platform. To obtain more accurate and reliable results, we run VPR simulations using different initial placements (i.e., different placement seeds) since the timing and area results are sensitive B. Area to initial placements. Lastly, area and critical path delay are According to Table III, low utilized components are inte- extracted directly from VPR while power is estimated by post- grated into 3 RUs with 3152, 10 918, and 2074 LUTs. processing VPR-generated reports such as number of logic Therefore, the number of LUTs is decreased by 26.8% from blocks and routing switches (channel width). An in-house C 43 743 (which is the total number of LUTs in all func- script is used to estimate execution-time based on profiling the tional units in RC/module) to 32 019 (number of LUTs in instruction execution flow reported by RISC-V ISA Simulator the proposed architecture). Taking into account that by reduc- (Spike). In the C script, we have considered the switching ing logic resources, routing resources are also reduced; hence, time overhead which is needed whenever an RU should be number of configuration bits in the proposed architecture is reconfigured. decreased by 30.9%, as shown in Fig. 12. Area of LUT-based

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. 474 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 3, MARCH 2019

Fig. 13. Comparing the delay of functional units and execution time in Fig. 14. Comparing the power and energy consumption of different different architectures in RISC-V processor. architectures implementing the RISC-V processor. architectures is first calculated with respect to the dimension- E. Power and Energy Consumption less area estimation in term of minimum width transistor The total static power for each architecture is obtained by count, and then the layout-level area is obtained using (4). The post-processing the VPR reports by summing the power of layout area of ASIC-based structures is directly reported by logic and routing resources in the target architecture. Notice Design Compiler and then is multiplied to the num- that we consider the static power of both utilized and unuti- ber of instances. According to Fig. 12, the proposed LUT-RU lized LUT/CHL cells in power and energy computation despite architecture has reduced the area by 30.7%, compared with the that the RUs do not use all cells. Fig. 14 compares the RC/module baseline architecture. As expected, this improve- normalized static power consumption in different architec- ment is lower than the improvement in the number of LUTs tures. Compared with the baseline RC/module architecture, the since some CHL cells have been substituted with LUTs. proposed LUT-RU architecture reduces the power consump- tion by 32.5%. The static power of both logic and routing is C. Critical Path Delay also decreased since the LUT-RU architecture reduces both of the logic and routing resources. In addition, while the proposed The delay of the ASIC-based architectures is obtained from architecture is most power-efficient in reconfigurable architec- Synopsys Design Compiler. As shown in Fig. 13,theLUT-RU tures and reduces the number of logic resources, however, the architecture does not affect the logic delay compared with the conventional ASIC implementation, i.e., ASIC/module, is still baseline RC/module since once a component, e.g., FP-FP2F, is power-efficient than all other architectures. configured in either RC/module or LUT-RU architectures, it The energy consumption of different architectures are would have identical delay in both architectures due to the obtained by (5). Energy is estimated as summation of recon- same mapping and logic resource usages. Yet, the routing figuration and operation energy while reconfiguration energy delay will be reduced since the channel width in LUT-RU is zero in ASIC-based architectures. Powertotal stands for total architecture is considerably less than RC/module. Hence, the power of RUs or total power of resources in baseline architec- proposed LUT-RU architecture reduces critical path delay by tures and nonreconfiguring components [i.e., integer unit (IU)]. 9.2% compared to the baseline RC/module architecture. For reconfiguration, a write energy of 3.5 × 10−18 Jforasin- gle SRAM cell in 45-nm technology is assumed [51]. Fig. 14 D. Application Execution Time compares the normalized energy in the proposed and base- To calculate the execution time, we multiplied execu- line architectures. As demonstrated in this figure, LUT-RU tion time of each instruction to its corresponding com- architecture improves the energy by 36.9% as compared to the baseline RC/module architecture ponent delay. In addition, for the proposed architectures,   we added the reconfiguration time by calculating the write N latency of the total number of configuration bits (sum of E = #SRAM(RUi) × #Rcfg(RUi) × ESRAM 2 logic and routing) with the method proposed in [18]. i=1 Accordingly, we assume a 128-bit parallel data bus [18] and + Powertotal × Execution time. (5) a write latency of 0.2 ns for a single SRAM cell [51]. Thus, reconfiguration time of RU1, RU2, and RU3 in It is noteworthy that the proposed architecture advances LUT-RU architecture will be [(424 619 × 0.2ns)/128] = power-gating approach in different aspects. First, the most 0.66 μs, [(1 481 035 × 0.2ns)/128] = 0.23 μs, and challenging issue of power-gating is so-called inrush current, [(337 531 × 0.2ns)/128] = 0.52 μs, respectively, in which i.e., a substantial wake-up current occurs whenever a power- 424 619, 1 481 035, and 337 531 are the configuration bit count gated module turns abruptly on, which may cause violation of the largest component in each RU. Due to the fact that the scenarios such as instability of the registers content and func- configurations time is prorated in overall execution time of the tional errors resulting from dynamic voltage (IR) drop [52]. application, LUT-RU architecture outperforms the RC/module It also can increase the wake-up time and wake-up energy. architecture by 6.3% improvement in execution time, as shown Previous studies have reported inrush power contributes to in Fig. 13. 20% of the static power in FPGAs with 45-nm technology size [53]. Second, power gating requires complicated con- 2It is noteworthy configuring one component to another in [18] has been troller and compiling techniques to: 1) anticipate the idle calculated based on configuring configuration cells of the smallest component intervals and evaluate whether the power gating is advanta- which is not correct. geous in spite of associated overheads; 2) predict data arrival in

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. TAMIMI et al.: EFFICIENT SRAM-BASED RECONFIGURABLE ARCHITECTURE FOR EMBEDDED PROCESSORS 475

TABLE VI UTILIZATION,AREA, AND CHLS COVERAGE RATIO IN LEON2 PROCESSOR FUNCTIONAL UNITS

Fig. 15. LEON2 processor microarchitecture (the area of each block is normalized to the chip area). Fig. 16. Utilization ratio of LEON2 components in MiBench benchmarks. order to initiate the wake-up phase and assure that only a small fraction of resources is switched each time to preclude timing (and hereby, functional) errors; and 3) route signals in such a way that they do not use the resources of power-gated regions. Another advantage of the proposed architecture with respect to power-gating is the improvement of reliability. When a soft- error flips a configuration cell in the proposed architecture, this cell will be automatically corrected in the next configura- tions (notice that the power-gating methods do no turn-off the SRAM cells). Last but not least, power-gating approach leaves the part of device unused, while the proposed architecture Fig. 17. Average intercomponent switching ratio in LEON2 processor. utilizes all the resources which shortens the wirelength and thereby, wire delays, consequently. Nevertheless, since FPGAs comprise abundant routing resources which mainly remain architectures not just in FPU. Accordingly, in the rest of this underutilized, power gating of the routing resources [54] can section, the reported results for LEON2 processor are in the be employed in tandem with our proposed architecture to scope of IU. further increase the power efficiency. To identify the utilization and access pattern to the func- tional units, each MiBench benchmark is cross-compiled to the SPARC-V8 architecture using RTEMS [55]. Afterwards, F. Alternative Processor Simics full system simulator [56] is used to obtain the uti- We have examined the applicability of the proposed archi- lization and access patterns of the functional units which are tecture on an alternative open-source soft-core processor. We shown in Table VI and Fig. 16. Utilization ratio of the inte- choose LEON2 which is SPARC-V8-compliant, as an alterna- ger divider (DIV) is approximately 3.48 × 10−6% while it tive processor for several reasons. First, it is popular among occupies a significant area (79.1%) of total functional units. a wide range of embedded processors and also compati- Therefore, it can be efficiently integrated with integer MUL ble and extensible to be customized for diverse range of into a shared RU. Hence, switching ratio of the proposed RU performance/cost tradeoff, e.g., LEON2-FT. Second, the full unit is 2.81 × 10−6 on average, as demonstrated in Fig. 17. HDL source code of total LEON2 (including FPU) is license Subsequently, reconfiguration overhead of the proposed RU free, while source code of FPU is not available in LEON3 will be small and prorated in application execution time, leav- and LEON4 which causes problem in running application ing negligible impact on total execution time of applications. with floating point instructions. Finally, in RISC-V case study, The rest of the evaluation flow for the proposed and base- the integrated functional units all belong to FPU, but here in line architectures in LEON2 processor is the same as RISC-V LEON2, as we will demonstrate in the following, all functional processor. units are separate and thereby, integration of low-utilized units 1) Area: Fig. 18 demonstrates the number of LUTs and is more challenging. configuration bits in the proposed and baseline architectures. Fig. 15 illustrates the synthesized floorplan of LEON2. As As shown in this figures, the number of LUTs and configura- shown in this figure, LEON2 functional units include a 32-bit tion bits in LUT-RU has been reduced by 23.1%, compared to adder, 32-bit MUL (that can be configured to 32×32, 32×16, the baseline RC/module architecture. The total area for LUT- 32×8, and 16×16), 32-bit divider, and an optional FPU. FPU RU is also reduced by 23.1%. The detailed results for each as a co-processor is licensed and therefore, is not available architecture are shown in Fig. 18. in the processor source code. Therefore, we have omitted the 2) Power and Energy Consumption: As demonstrated in FPU unit from our evaluations and focused on the integer Fig. 19, static power of the LUT-RU architecture is reduced by functional units which proves the generality of the proposed 23.2% as compared to the RC/module architecture. In addition,

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. 476 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 3, MARCH 2019

TABLE VII CHLS AND LUT PARAMETERS

TABLE VIII RESULTS FOR CHL-LUT-RU IN RISC-V AND LEON2 PROCESSORS NORMALIZED TO RC/MODULE (CONSIDERING LOGIC RESOURCES)

Fig. 18. Comparing the number of LUTs and area of LEON2 components in the baseline and proposed architectures.

Nevertheless, larger processors result in larger RUs and conse- quently, higher number of reconfiguration bits and therefore, reconfiguration time. The proposed architecture also improves the critical path delay which will directly enhance the exe- cution time due to the shorter clock cycles which is valid in RISC-V case study. Fig. 19. Comparing the power and energy of different architectures implementing LEON2 processor. G. Evaluating CHL-LUT-RU In the proposed CHL-LUT-RU architecture, each shared LUT that its function can be implemented with either CHL1 or CHL2, is simply substituted by an appropriate cell in the netlist BLIF file. The CHL-LUT-RU architecture has been evaluated after placement stage and the results are reported in terms of logic resources. Since the contribution of this paper is focused on logic architecture, the impact of routing resources has not been considered. Nevertheless, a brief discussion about routing is represented in Section VI. We intend to consider proposing a customized (generic) routing architecture in the future track of this paper. Table VII reports area, delay, and static power of Fig. 20. Comparing the delay of functional units and execution time in 4-LUT and CHLs which are obtained using HSPICE simula- different architectures implementing LEON2 processor. tions and with transistor sizing and subcircuit models obtained from COFFE. The delays reported in Table VII are the average delay of all inputs of target CHL or LUT. Static power of each compared to the RC/module architecture, LUT-RU architecture cell is measured using HSPICE simulations with the same tran- has reduced energy by 23.2%. These results are expectable sistor technology files and temperature used in Nangate library since Divider is considerably smaller than MUL. Hence, the (i.e., 25 ◦C). improvements are nearly equal between energy-power and The summary of results is reported in Table VIII. As demon- between area-configuration bits. strated in this table, the proposed CHL-LUT-RU architecture 3) Critical Path Delay and Application Execution Time: has further reduced the number of SRAM configuration bits, Fig. 20 indicates the critical path delay and total execution area, static power, critical path delay, execution time, and time of the baseline and proposed architectures. The results energy compared to the proposed LUT-RU in the scope of demonstrate that total execution time has only increased by logic resources. Note that reducing configuration bits can be 10−5% in LUT-RU compared with the RC/module architec- translated as the circuit reliability due to the reduction of ture. This is due to negligible utilization ratio of the divider the number of susceptible bits to soft errors [57], [58]. In which reduces the reconfiguration time overhead. addition, while our proposed architectures, particularly CHL- It should be noted that the efficiency of the proposed archi- LUT-RU, have significantly reduced the logic resources, the tecture is irrelevant to the size of processor itself, as we have conventional ASIC implementation, i.e., ASIC/module, is still shown its effectiveness in the moderate-size RISC-V proces- power-efficient than all other architectures. This is mainly sor with ∼ 35 000 LUTs, and a smaller one (LEON2) consists due to the large number of configuration cells used in the of ∼ 3200 LUTs. The key prerequisite for effectiveness of other architectures while the ASIC implementation of a 4-input the proposed architecture is: 1) low utilized units with frag- Boolean function consumes only 35 nW on average, whereas mented access pattern and 2) low utilized units have close area. this value is greater than 180 nW for CHL cells.

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. TAMIMI et al.: EFFICIENT SRAM-BASED RECONFIGURABLE ARCHITECTURE FOR EMBEDDED PROCESSORS 477

TABLE IX SUMMARY OF THE RESULTS FOR VARIOUS SYNTHESIS TOOLS AND PROCESSORS ACROSS ALL BENCHMARK (NORMALIZED TO RC/MODULE)

TABLE X VI. DISCUSSION COMPONENTS AREAINTHEDSP CORE [61] Here, we discuss some points related to the proposed archi- tecture. The summary of the experiments results is presented in Table IX. 1) Routing Architecture: We have not investigated the rout- ing architecture for CHL-LUT-RU due to its high complexity may require large number of reconfigurations which is not in both implementation and analysis. However, since the rout- consistent with currently available NVMs. ing resources in a typical island style FPGA architecture are surrounded by logic clusters, reducing the number of required 5) Applicability in Emerging Applications: One of the logic clusters by the factor of S will reduce the number of rout- promising application of the proposed architecture can be in ing resources (i.e., switch boxes) and thereby their area and implementing machine learning (ML) algorithms, e.g., neu- power by the same factor. In addition, taking the routing delay ral networks. Despite the widespread research on efficient into account, the critical path delay will not be increased since implementation of neural networks, only a limited range of at each time only one of the functional units is mapped on the objective functions, especially sum or mean squared error used corresponding RU owning the same (or more in the case that is in linear regression, has been implemented on the state-of-the- smaller than RU) routing and logic resources as in its baseline art FPGAs [60]. This is because, operations such as division architecture. In particular, we expect that the critical path delay and square root used in the gradient descent cycle of objec- may also be improved since higher number of resources (more tive function either occur infrequently or consume significant flexibility to place and route) will be allocated to specific com- amount of resources. The proposed architecture can be ben- ponents. Furthermore, in the proposed architecture, proximity eficial in these applications when for example, floating-point of the logic clusters of each component may increase after MUL can be merged with divider since in ML algorithms, the mapping due to the exclusive mapping of components and a stream of multiplication-additional (MAC) operations is fol- reduction of the array size. It can further improve the routing lowed by a division. Therefore, the context-switching inside an delay. RU will be tolerable. As an interesting point, in the proposed architecture we have already merged these units within RISC- 2) Optimal Heterogeneous Mapper: A heterogeneous V processor. The proposed architecture is also applicable for architecture such as the proposed CHL-LUT-RU may need component-based designs other than soft-core processors, such modifications in the original mapping algorithm that is as digital signal processing (DSP) cores that are exploited in used in a homogeneous, pure 4-LUT-based architecture [43]; ML algorithms [60]. We have examined the applicability of therefore, detailed investigation of the intracluster architec- the proposed architecture on a floating point DSP core from ture (e.g., ratio of CHL to LUTs) and mapping to such OpenCores [61] which its contributing components are listed an architecture is required. We will cover details of both in Table X. Depending on the distribution of the operations the routing and logic architectures as an extension of this (tunable according to the target application), several com- paper. ponents of the DSP core [60] can be merged into a shared 3) Reconfiguration Overhead: Integrating underutilized RU. One appropriate integration to save considerable area and components into shared RUs can impose reconfiguration over- power can be achieved by merging the divider, MUL, and sub- head which can be problematic in hard real-time designs. tractor units. Our experimental results for this DSP core reveal Although the impact of reconfiguration time on the total appli- that area is saved by 50.2%, with 50.1% improvement in static cation runtime is negligible (about 2%), it can be eliminated power while keeping the latency intact. using the concept of shadow SRAM cells [59]. Nevertheless, Another use-case of the proposed method can be in a tradeoff between eliminating the reconfiguration delay and FPGA-assisted Cloud computing, also known as, FPGA area-power overhead of shadow SRAMs should be taken into virtualization [62] which incorporates FPGAs in datacenters account. in order to improve the processing capacity and power con- 4) Applicability With Nonvolatile Memories: Although our sumption. A challenging task in such platforms is an efficient proposed architecture is targeted for SRAM-based soft-core use of FPGA resources since an FPGA is expected to afford processors, it is definitely not limited to SRAMs and can be computation of different users (i.e., several tasks) at the same expanded to Nonvolatile Memory (NVM) platforms as well. time. Since the proposed architecture integrates low-utilized However, despite promising characteristics of NVM such as functional units into a single RU, provided that we keep the negligible leakage power, they encounter with several unad- FPGA size the same, further tasks can be implemented into dressed deficiencies such as write latency and energy and still the same FPGA device. This will reduce the cost-per-user or are in prototyping phase. In addition, the proposed architecture cost-per-task.

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. 478 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 38, NO. 3, MARCH 2019

VII. CONCLUSION [18] A. R. Ashammagari, H. Mahmoodi, and H. Homayoun, “Exploiting STT-NV technology for reconfigurable, high performance, low power, In this paper, we presented a reconfigurable architecture and low temperature functional unit design,” in Proc. Conf. Design for embedded soft-core processors, which aims to improve Autom. Test Europe, Dresden, Germany, 2014, p. 335. area and power efficiency of FPGA-based soft-core proces- [19] A. R. Ashammagari, H. Mahmoodi, T. Mohsenin, and H. Homayoun, sors while maintaining flexibility to customize and update the “Reconfigurable STT-NV LUT-based functional units to improve processor for designers. In the proposed architecture, we inte- performance in general-purpose processors,” in Proc. ACM 24th Ed. Great Lakes Symp. VLSI, 2014, pp. 249Ð254. grated low-utilized and fragmentally accessed components into [20] H. Singh et al., “MorphoSys: An integrated reconfigurable system a single RU and used efficient configurable logics along with for data-parallel and computation-intensive applications,” IEEE Trans. the mapping algorithm to implement the overlapping functions Comput., vol. 49, no. 5, pp. 465Ð481, May 2000. of functional components mapped to each RU. Our integration [21] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, “ADRES: An architecture with tightly coupled VLIW processor and is based on carefully profiling the frequency and access pattern coarse-grained reconfigurable matrix,” in Field Programmable Logic and of instructions in MiBench applications, in order to well select Application. Heidelberg, Germany: Springer, 2003, pp. 61Ð70. the merged components. We have implemented RISC-V pro- [22] S. C. Goldstein et al., “PipeRench: A reconfigurable architecture and cessor to examine the efficiency of the proposed architectures. compiler,” Computer, vol. 33, no. 4, pp. 70Ð77, Apr. 2000. Experimental results show that, as compared to conventional [23] R. Lysecky, G. Stitt, and F. Vahid, “Warp processors,” ACM Trans. Design Autom. Electron. Syst., vol. 11, no. 3, pp. 659Ð681, 2004. FPGA-based soft-core processors, the proposed architecture [24] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, “CHIMAERA: improve the area footprint, static power, energy consumption, A high-performance architecture with a tightly-coupled reconfigurable critical path delay, and total execution time by 30.7%, 32.5%, functional unit,” in Proc. 27th Int. Symp. Comput. Archit., vol. 28, 2000, 36.9%, 9.2%, and 6.3%, respectively. pp. 225Ð235. [25] J. E. Carrillo and P. Chow, “The effect of reconfigurable units in super- scalar processors,” in Proc. ACM/SIGDA 9th Int. Symp. Field Program. Gate Arrays, Monterey, CA, USA, 2001, pp. 141Ð150. REFERENCES [26] M. Labrecque and J. G. Steffan, “Improving pipelined soft processors [1] D. D. Gajski and F. Vahid, “Specification and design of embedded with multithreading,” in Proc. IEEE Int. Conf. Field Program. Logic hardware- systems,” IEEE Des. Test. Comput., vol. 12, no. 1, Appl. (FPL), Amsterdam, The Netherlands, 2007, pp. 210Ð215. pp. 53Ð67, Feb. 1995. [27] Z. Hu et al., “Microarchitectural techniques for power gating of exe- [2] P. Marwedel, Embedded System Design: Embedded Systems Foundations cution units,” in Proc. ACM Int. Symp. Low Power Electron. Design, of Cyber-Physical Systems. Amsterdam, The Netherlands: Springer, Newport Beach, CA, USA, 2004, pp. 32Ð37. 2010. [28] S. Roy, N. Ranganathan, and S. Katkoori, “A framework for power- [3] M. Barr, Programming Embedded Systems in C and C++. Sebastopol, gating functional units in embedded ,” IEEE Trans. CA, USA: O’Reilly Media, 1999. Very Large Scale Integr. (VLSI) Syst., vol. 17, no. 11, pp. 1640Ð1649, Nov. 2009. [4] P. Marwedel and G. Goossens, Code Generation for Embedded [29] A. A. M. Bsoul, S. J. E. Wilton, K. H. Tsoi, and W. Luk, “An FPGA Processors, vol. 317. New York, NY, USA: Springer, 2013. architecture and CAD flow supporting dynamically controlled power [5] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” gating,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 1, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 26, no. 2, pp. 178Ð191, Jan. 2016. pp. 203Ð215, Feb. 2007. [30] P. Yiannacouras, J. G. Steffan, and J. Rose, “Exploration and cus- [6] M. Sarrafzadeh, F. Dabiri, R. Jafari, T. Massey, and A. Nahapetan, “Low tomization of FPGA-based soft processors,” IEEE Trans. Comput.-Aided power light-weight embedded systems,” in Proc. IEEE Int. Symp. Low Design Integr. Circuits Syst., vol. 26, no. 2, pp. 266Ð277, Feb. 2007. Power Electron. Design (ISLPED), 2006, pp. 207Ð212. [31] M. R. Guthaus et al., “MiBench: A free, commercially representative [7] Gaisler Research. Accessed: Mar. 2016. [Online]. Available: embedded benchmark suite,” in Proc. IEEE Int. Workshop Workload www.gaisler.com/ Characterization (WWC), 2001, pp. 3Ð14. [8] K. Asanovic« and D. A. Patterson, “Instruction sets should be free: The [32] Berkeley Logic Synthesis and Verification Group. (2011). ABC: A case for RISC-V,” Elect. Eng. Comput. Sci., Univ. California at Berkeley, System for Sequential Synthesis and Verification, Release 70901. Berkeley, CA, USA, Rep. UCB/EECS-2014-146, 2014. [Online]. [Online]. Available: http://www.eecs.berkeley.edu/∼alanmi/abc/ Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS- [33] J. Pistorius, M. Hutton, A. Mishchenko, and R. Brayton, “Benchmarking 2014-146. pdf method and designs targeting logic synthesis for FPGAs,” in Proc. IWLS, [9] OpenSparc Website. Accessed: Mar. 2016. [Online]. Available: vol. 7, 2007, pp. 230Ð237. www.opensparc.org [34] S. Yazdanshenas and V. Betz, “Automatic circuit design and modelling [10] OpenRisc Website. Accessed: Mar. 2016. [Online]. Available: for heterogeneous FPGAs,” in Proc. IEEE Int. Conf. Field Program. www..io Technol. (FPT), Melbourne, VIC, Australia, 2017, pp. 9Ð16. [11] F. Plavec, B. Fort, Z. G. Vranesic, and S. D. Brown, “Experiences with [35] J. Luu et al., “VTR 7.0: Next generation architecture and CAD system soft-core ,” in Proc. 19th IEEE Int. Parallel Distrib. for FPGAs,” ACM Trans. Reconfigurable Technol. Syst., vol. 7, no. 2, Process. Symp., Denver, CO, USA, 2005, p. 167b. p. 6, 2014. [12] L. Barthe, L. V. Cargnini, P. Benoit, and L. Torres, “The SecretBlaze: [36] M. Hosseinabady and J. L. Nunez-Yanez, “Run-time power gating A configurable and cost-effective open-source soft-core processor,” in in hybrid ARM-FPGA devices,” in Proc. 24th Int. Conf. IEEE Field Proc. IEEE Int. Symp. Parallel Distrib. Process. Workshops Ph.D. Forum Program. Logic Appl. (FPL), Munich, Germany, 2014, pp. 1Ð6. (IPDPSW), Shanghai, China, 2011, pp. 310Ð313. [37] J. Teich, “Hardware/software codesign: The past, the present, and [13] J. G. Tong, I. D. L. Anderson, and M. A. S. Khalid, “Soft-core processors predicting the future,” Proc. IEEE, vol. 100, pp. 1411Ð1430, May 2012. for embedded systems,” in Proc. IEEE Int. Conf. Microelectron. (ICM), [38] VectorBlox/risc-V. VectorBlox Computing Inc. Accessed: Mar. 10, 2017. Dhahran, Saudi Arabia, 2006, pp. 170Ð173. [Online]. Available: https://github.com/VectorBlox/orca [14] P. Yiannacouras, J. Rose, and J. G. Steffan, “The microarchitecture of [39] Y. Lee et al., “A 45nm 1.3 GHz 16.7 double-precision GFLOPS/W FPGA-based soft processors,” in Proc. ACM Int. Conf. Compilers Archit. RISC-V processor with vector accelerators,” in Proc. 40th IEEE Eur. Synth. Embedded Syst., San Francisco, CA, USA, 2005, pp. 202Ð212. Solid State Circuits Conf. (ESSCIRC), 2014, pp. 199Ð202. [15] J. Bachrach et al., “Chisel: Constructing hardware in a scala embed- [40] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: A ded language,” in Proc. ACM 49th Annu. Design Autom. Conf., tool to model large caches,” HP Lab., Bristol, U.K., Rep. HPL-2009-85, San Francisco, CA, USA, 2012, pp. 1216Ð1225. pp. 22Ð31, 2009. [16] A. A. M. Bsoul and S. J. E. Wilton, “An FPGA architecture supporting [41] H. Wong, V. Betz, and J. Rose, “Comparing FPGA vs. custom dynamically controlled power gating,” in Proc. IEEE Int. Conf. Field CMOS and the impact on processor microarchitecture,” in Proc. 19th Program. Technol. (FPT), Beijing, China, 2010, pp. 1Ð8. ACM/SIGDA Int. Symp. Field Program. Gate Arrays, 2011, pp. 5Ð14. [17] H. Esmaeilzadeh, E. Blem, R. St Amant, K. Sankaralingam, and [42] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep- D. Burger, “Dark silicon and the end of multicore scaling,” ACM submicron FPGA performance and density,” IEEE Trans. Very Large SIGARCH Comput. Archit. News, vol. 39, no. 3, pp. 365Ð376, 2011. Scale Integr. (VLSI) Syst., vol. 12, no. 3, pp. 288Ð298, Mar. 2004.

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply. TAMIMI et al.: EFFICIENT SRAM-BASED RECONFIGURABLE ARCHITECTURE FOR EMBEDDED PROCESSORS 479

[43] I. Ahmadpour, B. Khaleghi, and H. Asadi, “An efficient reconfigurable Zahra Ebrahimi received the B.Sc. and M.Sc. architecture by characterizing most frequent logic functions,” in Proc. degrees in computer engineering the Sharif IEEE 25th Int. Conf. Field Program. Logic Appl. (FPL), London, U.K., University of Technology (SUT), Tehran, Iran, in 2015, pp. 1Ð6. 2014 and 2016, respectively. [44] A. Ahari, B. Khaleghi, Z. Ebrahimi, H. Asadi, and M. B. Tahoori, She was with the Data Storage, Networks, and “Towards dark silicon era in FPGAs using complementary hard logic Processing Laboratory, Department of Computer design,” in Proc. IEEE 24th Int. Conf. Field Program. Logic Appl. (FPL), Engineering, SUT, as a Research Assistant for Munich, Germany, 2014, pp. 1Ð6. four years. Her current research interests include [45] Z. Ebrahimi, B. Khaleghi, and H. Asadi, “PEAF: A power-efficient reconfigurable computing and computer-aided architecture for SRAM-based FPGAs using reconfigurable hard logic design. design in dark silicon era,” IEEE Trans. Comput., vol. 66, no. 6, pp. 982Ð995, Jun. 2017. [46] Behavioural Simulation (Spike). Accessed: Aug. 2016. [Online]. Available: www.lowrisc.org/docs/untether-v0.2/spike [47] D. Chai and A. Kuehlmann, “Building a better Boolean matcher and symmetry detector,” in Proc. Conf. Design Autom. Test Europe, Munich, Germany, 2006, pp. 1079Ð1084. Behnam Khaleghi received the B.Sc. and M.Sc. [48] Nangate Open Cell Library. Accessed: Mar. 2016. [Online]. Available: degrees in computer engineering from the Sharif www.nangate.com/ University of Technology (SUT), Tehran, Iran, in [49] -2 Platform FPGA Hand Book, Altera, San Jose, CA, USA, 2013 and 2016, respectively. Apr. 2011. He is currently a Research Assistant with [50] Virtex-4 Platform FPGA User Guide, , San Jose, CA, USA, the Data Storage, Networks, and Processing Dec. 2008. Laboratory, Department of Computer Engineering, [51] A. Chen, J. Hutchby, V. Zhirnov, and G. Bourianoff, Emerging SUT. He was a Research Assistant with the Nanoelectronic Devices. Chichester, U.K.: Wiley, 2014. Chair for Embedded Systems, Karlsruhe Institute of [52] A. A. M. Bsoul and S. J. E. Wilton, “A configurable architecture to Technology, Karlsruhe, Germany, in 2014 and 2015. limit wakeup current in dynamically-controlled power-gated FPGAs,” in His current research interests include reconfigurable Proc. ACM/SIGDA Int. Symp. Field Program. Gate Arrays, Monterey, computing, CAD, and reliable system design. CA, USA, 2012, pp. 245Ð254. Mr. Khaleghi was a recipient of two Best Paper Nominations with the [53] S. Sharp. Power Management Solution GuideÐXilinx. Accessed: DAC’17 and DATE’17. Mar. 10, 2017. [Online]. Available: https://www.xilinx.com/publications/ archives/solution_guides/power_management.pdf [54] Z. Seifoori, B. Khaleghi, and H. Asadi, “A power gating switch box architecture in routing network of SRAM-based FPGAs in dark silicon era,” in Proc. Conf. Design Autom. Test Europe, Lausanne, Switzerland, 2017, pp. 1342Ð1347. [55] RTEMS Cross Compilation System (RCC). Accessed: Aug. 2016. [Online]. Available: www.rtems.org Hossein Asadi (M’08–SM’14) received the B.Sc. [56] P. S. Magnusson et al., “Simics: A full system simulation platform,” and M.Sc. degrees in computer engineering from Computer, vol. 35, no. 2, pp. 50Ð58, Feb. 2002. the Sharif University of Technology (SUT), Tehran, [57] S. Yazdanshenas, H. Asadi, and B. Khaleghi, “A scalable dependabil- Iran, in 2000 and 2002, respectively, and the Ph.D. ity scheme for routing fabric of SRAM-based reconfigurable devices,” degree in electrical and computer engineering from IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 9, Northeastern University, Boston, MA, USA, in pp. 1868Ð1878, Sep. 2015. 2007. [58] H. Asadi and M. B. Tahoori, “Analytical techniques for soft error He was with EMC Corporation, Hopkinton, MA, rate modeling and mitigation of FPGA-based designs,” IEEE Trans. USA, as a Research Scientist and a Senior Hardware Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 12, pp. 1320Ð1331, Engineer, from 2006 to 2009. From 2002 to 2003, Dec. 2007. he was a member of the Dependable Systems [59] W. Zhang, N. K. Jha, and L. Shang, “Low-power 3D nano/CMOS hybrid Laboratory, SUT, where he researched hardware verification techniques. From dynamically reconfigurable architecture,” ACM J. Emerg. Technol. 2001 to 2002, he was a member of the Sharif Rescue Robots Group. He Comput. Syst., vol. 6, no. 3, p. 10, 2010. has been with the Department of Computer Engineering, SUT, since 2009, [60] D. Mahajan et al., “TABLA: A unified template-based framework for where he is currently a tenured Associate Professor. He is the Founder and accelerating statistical machine learning,” in Proc. IEEE Int. Symp. High the Director of the Data Storage, Networks, and Processing Laboratory, the Perform. Comput. Archit. (HPCA), Barcelona, Spain, 2016, pp. 14Ð26. Director of Sharif High-Performance Computing Center, the Director of Sharif [61] OpenCores. Accessed: Nov. 2017. [Online]. Available: Information and Communications Technology Center, and the President of www..org Sharif ICT Innovation Center. He spent three months in 2015 as a Visiting [62] S. Yazdanshenas and V. Betz, “Quantifying and mitigating the costs Professor with the School of Computer and Communication Sciences, École of FPGA virtualization,” in Proc. IEEE 27th Int. Conf. Field Program. Poly-technique Fédérele de Lausanne, Lausanne, Switzerland. He is also the Logic Appl. (FPL), Ghent, Belgium, 2017, pp. 1Ð7. co-founder of HPDS corp., designing and fabricating midrange and high-end data storage systems. He has authored and co-authored over 80 technical papers in reputed journals and conference proceedings. His current research interests include data storage systems and networks, solid-state drives, oper- ating system support for I/O and memory management, and reconfigurable and dependable computing. Dr. Asadi was a recipient of the Technical Award for the Best Robot Design Sajjad Tamimi received the B.Sc. degree in com- from the International RoboCup Rescue Competition, organized by AAAI and puter engineering from the Iran University of RoboCup, the Best Paper Award at the 15th CSI International Symposium on Science and Technology, Tehran, Iran, in 2014 and & Digital Systems (CADS), the Distinguished Lecturer the M.Sc. degree in computer engineering from the Award from SUT in 2010, the Distinguished Researcher Award and the Sharif University of Technology (SUT), Tehran, in Distinguished Research Institute Award from SUT in 2016, the Distinguished 2016. Technology Award from SUT in 2017, and the Extraordinary Ability in He was with the Data Storage, Networks, and Science visa from U.S. Citizenship and Immigration Services in 2008. He Processing Laboratory, Department of Computer has also served as the publication chair of several national and international Engineering, SUT, as a Research Assistant for two conferences, including CNDS2013, AISP2013, and CSSE2013 during the past years. His current research interests include recon- four years. He has served as a Guest Editor of the IEEE TRANSACTIONS ON figurable computing, embedded system design, and COMPUTERS, an Associate Editor of Microelectronics Reliability,andthe computer architecture. Program Co-Chair of CADS2015 and CSI National Computer Conference.

Authorized licensed use limited to: SLUB Dresden. Downloaded on April 30,2021 at 13:50:39 UTC from IEEE Xplore. Restrictions apply.