<<

Journal of Low Power Electronics and Applications

Article Opportunistic Design Margining for Area and Power Efficient Processor Pipelines in Real Time Applications

Mini Jayakrishnan 1,2,*, Alan Chang 2 and Tony Tae-Hyoung Kim 1 1 VIRTUS, IC Design Centre of Excellence, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore; [email protected] 2 NXP Semiconductors Singapore Pte Ltd., 1 Fusionopolis Walk, #12-01/02 South Tower, Solaris, Singapore 138628, Singapore; [email protected] * Correspondence: [email protected]; Tel.: +65-911-63-479

 Received: 9 February 2018; Accepted: 18 March 2018; Published: 21 March 2018 

Abstract: The semiconductor industry is strategically focusing on automotive markets, and significant investment is targeted to addressing these markets. Runtime better-than-worst-case designs like Razor lead to massive timing errors upon breaching the critical operating point and have significant area overheads. As we scale to higher-reliability automotive and industrial markets we need alternative techniques that will allow full extraction of the power benefits without sacrificing reliability. The proposed method utilizes positive slack available in the stages and re-distributes it to the preceding critical logic stage using Slack Balancing Flip-Flops (SBFFs). We use opportunistic under designing to get rid of the area, power and error correction overheads associated with the speculative hardware of runtime techniques. The proposed logic reshaping results in 12 percent and eight percent power and area savings respectively compared to the worst-case design approach. Compared to runtime better-than-worst-case designs, we get 51 percent and 10 percent power and area savings, respectively. In addition, the timing budgeting and timing correction using opportunistic slack eliminate critical operating point behavior, metastability issues and hold buffer overheads encountered in existing runtime resilience techniques.

Keywords: variation tolerance; slack balancing; under design; logic reshaping

1. Introduction Technology scaling has benefited integrated circuits by meeting their power, performance and area goals over generations. However, it has also aggravated circuits and system failures creating major cost and reliability impacts in nanoscale designs [1,2]. Advanced technology nodes show a significant amount of intra-die variations due to inaccuracies [3–5]. Dynamic variations caused by voltage and temperature fluctuations also cause reliability issues [6]. These reliability issues shrink the design life cycle. The impacts of variations continue to increase with process technology scaling, which leads to pessimistic delay margins in processor pipelines. This stretches the timing guard bands in traditional designs. The worst-case design margins result in overdesign and wastage of chip resources. Therefore, we need opportunistic design techniques to improve the chip yield without compromising performance and energy efficiency, which are highly demanded. Digital circuits do not always face the worst operating conditions. Several alternative techniques have been proposed to dynamically tune the operating conditions in real time to recover the design margins. Sensor-based adaptive techniques help to combat static and dynamic variations [7,8]. Hardware signatures from sensors are used to tune the circuit operating point [9,10]. However, sensors may not respond fast enough to dynamic variations. In addition, sensors need to be calibrated to

J. Low Power Electron. Appl. 2018, 8, 9; doi:10.3390/jlpea8020009 www.mdpi.com/journal/jlpea J. Low Power Electron. Appl. 2018, 8, 9 2 of 18 determine safe operation regions through post-silicon tuning. Another technique is to use critical path replicas to monitor variations [11]. However, they fail to detect the actual amount of variation due to the mismatch between the actual path and the replica path. In situ error correction techniques illustrated in Figure1 help to overcome the drawbacks of the sensor-based designs. They tune supply voltage to the point of failure to minimize design margins and reclaim power. Timing errors due to dynamic variations are then detected and corrected using special flip-flops. However, the dynamic tuning of the operating point makes the in situ schemes vulnerable to metastability issues in data paths and/or error paths, which affects the overall system recovery. Razor I [12] illustrated in Figure1a has a metastability issue in the data path since data are allowed to change very close to the clock (CLK) edge. Razor II [13] in Figure1b overcomes the metastability in the data path using a latch. Double Sampling with Time Borrowing (DSTB) [14] and Transition Detection with Time Borrowing (TDTB) [14] are similar to Razor II, employing a latch as the data path. However, the delayed sampling leads to short path violations and buffer overheads. Bubble Razor [15] uses two-phase latch timing to overcome the short path violations. However, there is a performance penalty for error recovery during heavy workloads. Runtime techniques come with high-cost overheads, latency issues and the risk of runtime error handling. In the Razor approach, the voltage adaptation is based on the measured timing error rate. As errors have to be accepted, architectural recovery circuits are necessary, which increases design time and cost. Besides the undesirable increase in complexity, error recovery also leads to variable and unpredictable latency, which causes catastrophic failures. Error masking techniques in Figure1c like soft edge flip-flops [16] and Time Borrowing and Error Relay (TIMBER [17]) mask timing errors by borrowing time from the subsequent pipeline stages with zero latency overheads. However, they have limited design margin reduction and metastability and hold buffer overheads. Runtime techniques reclaim power, area and performance by using dynamic/adaptive voltage/frequency scaling and Better-than-Worst-Case (BTWC) design techniques [18–20]. Design margins are also recovered by redistributing slack to the critical pipeline stages using combinatorial and sequential optimizations. Sequential optimizations like retiming help to minimize clock period, achieve low power and maximize tolerance to variations [21,22]. However, it restricts the space of possible retiming and may result in unnecessary area overheads. Useful skew and clock skew scheduling [23] combined with gate sizing [24,25] help to compensate variations and speed up the processor pipelines. EVAL (Environment for Variation-Afflicted Logic) speeds up the timing of critical paths through Adaptive Body Bias (ABB), as well as Adaptive Supply Voltage (ASV) [26], but it has significant area overheads. Blue Shift uses On-demand Selective Biasing (OSB) and Path Constrained Tuning (PCT) [27] to achieve performance gains at the cost of significant power overheads. Such design optimizations worsen the critical operating point of the design [28], which limits the effectiveness of the voltage scaling. Power-aware slack distribution (SlackOptimizer) [29] uses cell sizing together with Razor voltage scaling to distribute slack evenly in a power- and cost-efficient manner. However, they retain metastability and hold buffer overheads of the speculative pipelines. Pulsed latches [30,31] are used as a means to reduce power consumption, but just like other latch-based systems, it is difficult to verify the design timing. Moreover, pulse width integrity issues and pulse generation overheads need to be taken care of while using such systems. The latency issues inherent in run time techniques makes them unsuitable for real-time applications [32,33]. For higher-reliability automotive markets, we need alternative techniques that will allow full extraction of the power benefits without sacrificing reliability [34]. Techniques that trade off critical path slack with non-critical slack such as path constraint tuning [27] and SlackOptimizer [29] over constrain the critical logic and reclaim the power by under constraining the non-critical logic. Our methodology is different in the fact that we under-design the critical logic and thereby get more power and area reductions from the usual over constrained counterparts. Compared to the existing speculative slack exploitation techniques like soft edge flip-flops [16] and TIMBER, we use available slack in the design, thereby achieving PPA reductions J. Low Power Electron. Appl. 2018, 8, 9 3 of 18

J.J. Low Low Power Power Electron. Electron. Appl. Appl. 2018 2018, ,7 7, ,x x 33 of of 18 18 independent of timing error rates. The reclaimed slack for the proposed method remains constant independent of the error rate, as shown in Figure2. Moreover, the existing optimizations based on retimingretiming [22], [22], skew skew scheduling scheduling [23] [23] and and gate gate sizing sizing [25] [25] do do not not fully fully reclaim reclaim the the available available slack slack to to be be retiming [22], skew scheduling [23] and gate sizing [25] do not fully reclaim the available slack to tradedtraded offoff withwith power/performance.power/performance. TheyThey alsoalso limitlimit thethe amountamount ofof timetime borrowedborrowed duedue toto holdhold be traded off with power/performance. They also limit the amount of time borrowed due to hold violations.violations. The The proposed proposed approach approach has has significant significant slack slack gain gain compared compared to to the the existing existing techniques, techniques, as as violations. The proposed approach has significant slack gain compared to the existing techniques, shownshown in in Figure Figure 2. 2. Our Our paper paper uses uses a a design design--timetime method, method, which which uses uses a a speculative speculative hardware hardware similar similar as shown in Figure2. Our paper uses a design-time method, which uses a speculative hardware toto thethe runtimeruntime methodsmethods toto allowallow betterbetter--thanthan--worstworst--casecase ooperation.peration. WeWe useuse opportunisticopportunistic underunder similar to the runtime methods to allow better-than-worst-case operation. We use opportunistic designingdesigning to to get get rid rid of of the the area, area, power power and and error error correction correction overheads overheads associated associated with with the the speculative speculative under designing to get rid of the area, power and error correction overheads associated with the hardwarehardware ofof runtimeruntime techniques.techniques. RuntimeRuntime techniquestechniques showshow significantsignificant areaarea overheadoverhead [33].[33]. TableTable 11 speculative hardware of runtime techniques. Runtime techniques show significant area overhead [33]. comparescompares the the differe differentnt cost cost--effectiveeffective resilient resilient design design techniques, techniques, which which shows shows that that only only the the proposed proposed Table1 compares the different cost-effective resilient design techniques, which shows that only the techniquetechnique has has zero zero area area overhead. overhead. proposed technique has zero area overhead.

FlipFlip-flop-flop LatchLatch Data Path DataData Path Path DataData Path Path Data Path QQ DD QQ DD QQ DD

CLK CLK CLKCLK ERROR CLK ERRORERROR CLK ERRORERROR ERROR

(a(a)) (b(b)) (c(c))

FigureFigure 1. 1. Speculative SpeculativeSpeculative techniques: techniques: techniques: ( a(a)( )aRazor Razor) Razor I Ierror error I error detection detection detection with with with flip flip- flip-flopflop-flop as as the the as data thedata data path path path [12], [12], [( 12b(b)]; ) Razor(Razorb) Razor II II error error II error detection detection detection with with with latch latch latch as as the asthe the data data data path path path [13] [13] [13 and and] and ( c(c) ( )cTIMBER )TIMBER TIMBER error error error masking masking masking with with with flip flip flip-flop-flop-flop asas the the data data path pathpath [17]. [[17].17].

RazorRazor SkewSkew schedule schedule TIMBERTIMBER ProposedProposed

TTckck/2/2

TTckck/6/6 Slackreclaimed Slackreclaimed

ErrorError rate rate

FigureFigure 2. 2. Reclaimed Reclaimed slack slack vs. vs. error error rate rate for for various various resilience resilience techniques. techniques.

OurOur paper paper relies relies on on opportunisti opportunisti opportunisticcc better better better-than-worst-case--thanthan--worstworst--casecase design, design, which which differs differs significantly significantlysignificantly from from traditionaltraditionaltraditional worst worst-caseworst--casecase design design and and at at the the same same time time gives gives area area and and power power savings. savings. We We use use worst worst-caseworst--casecase designdesign margins margins with with respect respect to to the the delayed delayeddelayed clock clock edge edge as as shown shownshown in in Figure FigureFigure 3 3,3,, whichwhichwhich categoricallycategoriccategoricallyally makesmakes it it a aa better better-than-worst-casebetter--thanthan--worstworst--casecase technique. technique.technique. WeWe We taketake advantageadvantage ofof the the opportunistic opportunistic slack slack and andand translatetranslatetranslate it it into intointo better better-than-worst-casebetter--thanthan--worstworst--casecase design design margins margins while while still still achieving achieving variationvariation tolerance.tolerance. OpportunisticOpportunistic logic logic downsizing downsizing downsizing helps helps helps to to to reclaim reclaim reclaim power power power at at atd design designesign time time time rather rather rather than than than scaling scaling scaling voltage voltage voltage at at runtime.atruntime. runtime. Therefore,Therefore, Therefore, wewe we useuse use thethe the termterm term “Opportunistic“Opportunistic “Opportunistic BetterBetter Better-Than-Worst-Case--ThanThan--WorstWorst--CaseCase design”design” design” (OBTWC)(OBTWC) (OBTWC) toto differentiatetodifferentiate differentiate itit from itfrom from traditionaltraditional traditional WorstWorst Worst-Case--CaseCase DesignDesign Design (WCD)(WCD) (WCD) andand and runtime runtime Better Better-Than-Worst-CaseBetter--ThanThan--WorstWorst--CaseCase (BTWC)(BTWC)(BTWC) techniques. techniques.techniques. The The propo proposedproposedsed solution solution works works in in a a conservative conservative manner manner that that guarantees guarantees “always “always correct”correct” computation computation and and timingtiming correctnesscorrectness ofof thethe circuitcircuit withwith respectrespect toto thethe delayed delayed clockclock edge,edge, eveneven in in the the worst worst-caseworst--casecase scenario. scenario.scenario.

J. Low Power Electron. Appl. 2018, 7, x 4 of 18

Table 1. Comparison of cost-effective resilient design techniques. DSTB, Double Sampling with Time Borrowing; TDTB, Detection with Time Borrowing; SBFF, Slack Balancing Flip-Flops.

Feature Speculative Non-Speculative Razor [12], DSTB, Retiming [21], SlackOptimizer, Proposed SBFF J. Low Power Electron. Appl.EVAL2018 [26],, 8, 9 TDTB [14], skew scheduling 4 of 18 Techniques SkewOptimizer, + logic Blueshift [27] TIMBER [17], soft [23], gate sizing CombOpt [29] downsizing edge flip-flop [16] [24] Table 1. ComparisonError rate of cost-effectivevs. Error resilientrate vs. design techniques. DSTB, Double Sampling with Time Trade-off Error rate vs. power No No Borrowing; TDTB,performance Detection with Timepower Borrowing; SBFF, Slack Balancing Flip-Flops. Duplicate Duplicate Error handling Duplicate Latch/FFs No error No error Featurepaths Latch/FFs Speculative Non-Speculative Clock tree Razor [12], DSTB, TDTB [14], SlackOptimizer, Retiming [21], Proposed SBFF EVAL [26], Techniques No TIMBER [17Yes], soft edge SkewOptimizer,Yes skew schedulingNo [23], Yes+ logic loading Blueshift [27] flip-flop [16] CombOpt [29] gate sizing [24] downsizing Short path Error rate vs. Trade-off No Error rateYes vs. power Error rateYes vs. powerYes No No No padding performance MetastabilityError handling DuplicateYes paths DuplicateYes Latch/FFs DuplicateYes Latch/FFs No error NoNo error ClockSequential tree loading No Yes Yes No Yes Large Large Large Small Large Shortoverhead path padding No Yes Yes Yes No CombinationalMetastability Yes Yes Yes No No Large Small Large Small Small Sequentialoverhead overhead Large Large Large Small Large CombinationalArea overhead overhead LargeYes SmallYes LargeYes Yes Small No Small AreaMargin overhead Yes Yes Yes Yes No Small Up to Tck/2 Tck/2 Small Tck/2 Marginrelaxation relaxation Small Up to Tck/2 Tck/2 Small Tck/2 MSMS = =Meta Meta Stability, Stability, Tck Tck = = clock clock period. period.

Lost performance/ Area/power Lost performance/ savings Actual circuit delay energy Actual circuit delay energy

Clock Clock

Process Process Voltage Voltage Temperature Aging Temperature Aging (a) (b)

FigureFigure 3.3. ComparisonComparison ofof designdesign marginingmargining techniques:techniques: ((aa)) Worst-CaseWorst-Case DesignDesign (WCD)(WCD) marginingmargining andand ((bb)) OpportunisticOpportunistic Better-Than-Worst-CaseBetter-Than-Worst-Case (OBTWC) (OBTWC) design design margining. margining.

The proposed approach, which is an extension of our previous work [35], leverages the The proposed approach, which is an extension of our previous work [35], leverages the underutilized slack present in processor pipelines after the tool-based optimizations. We use Static underutilized slack present in processor pipelines after the tool-based optimizations. We use Static Timing Analysis (STA) to look for near critical endpoints with sufficient consecutive slack after Timing Analysis (STA) to look for near critical endpoints with sufficient consecutive slack after placement and logic optimizations. The critical endpoints are then back annotated to the synthesis placement and logic optimizations. The critical endpoints are then back annotated to the synthesis tool, and we relax the timing margins of all the paths in the critical stage proportionate to the available tool, and we relax the timing margins of all the paths in the critical stage proportionate to the available slack. After relaxing the path slacks, we resynthesize the critical fan-in logic and replace the endpoints slack. After relaxing the path slacks, we resynthesize the critical fan-in logic and replace the endpoints with Slack Balancing Flip-Flops (SBFF). SBFFs are designed to sample data at a delayed clock edge. with Slack Balancing Flip-Flops (SBFF). SBFFs are designed to sample data at a delayed clock edge. Re-synthesis with relaxed timing margins delays the long paths and pushes them towards the SBFF Re-synthesis with relaxed timing margins delays the long paths and pushes them towards the SBFF delayed clock edge. The delayed data still maintain safe design margins with respect to the delayed delayed clock edge. The delayed data still maintain safe design margins with respect to the delayed clock edge of the SBFF. This removes metastability issues prevalent in the speculative flip-flop clock edge of the SBFF. This removes metastability issues prevalent in the speculative flip-flop pipelines. pipelines. The methodology relies on the critical wall of slack issues from the data path. Unlike the The methodology relies on the critical wall of slack issues from the data path. Unlike the in-situ in-situ techniques, the short paths converging to the SBFF also get delayed, which removes the hold techniques, the short paths converging to the SBFF also get delayed, which removes the hold buffer buffer overheads resulting from delayed data sampling. Re-synthesis downsizes and restructures the overheads resulting from delayed data sampling. Re-synthesis downsizes and restructures the logic logic gates in the critical fan-in stage, which gives significant power and area savings. This results in gates in the critical fan-in stage, which gives significant power and area savings. This results in the the optimal power-aware design of the processor pipeline from the ground up without the failure optimal power-aware design of the processor pipeline from the ground up without the failure risks risks of runtime speculative operating point tuning. They are not limited by the gate sizing and skew of runtime speculative operating point tuning. They are not limited by the gate sizing and skew optimization constraints in typical combinatorial optimizations, which results in the maximum slack optimization constraints in typical combinatorial optimizations, which results in the maximum slack power trade-off along the processor pipeline. The remainder of this paper is organized as follows. power trade-off along the processor pipeline. The remainder of this paper is organized as follows. Section 2 presents the motivation to use the proposed technique. Section 3 presents the proposed Section2 presents the motivation to use the proposed technique. Section3 presents the proposed design methodology. Section4 describes our proposed power optimization techniques, and experimental results are illustrated in Section5. Finally, Section6 concludes the proposed work. J. Low Power Electron. Appl. 2018, 8, 9 5 of 18

2. Motivation Conventional design margining approaches are based on dynamic operating point tuning and in situ error correction to trade off power and reliability. The operating conditions are tuned adaptively till Point of First Failure (PoFF), which makes the near critical paths also timing critical. Figure3 shows the effect of voltage scaling on the slack histogram of critical paths in a processor pipeline. Figure4a shows the baseline designed for the worst case with a Worst Negative Slack (WNS) of 10 ps. However, with the voltage scaling of 20 mV, 312 paths have negative slack, and the WNS becomes −600 ps (Figure4b). Similarly, 985 paths have negative slack and the WNS becomes −700 ps when the supply voltage is scaled by 40 mV (Figure4c). It is obvious that the paths with negative slack and the WNS increases with voltage scaling. The newly-created critical paths also need to be error resilient, which creates huge overheads in designs. Figure5 plots the Total Negative Slack (TNS) of the processor pipeline against the voltage scaling steps. Voltage scaling down to Va does not increase TNS. TNS slightly increases between Va and Vb, but can be managed using in situ error correction techniques. Beyond Vb, TNS increases rapidly, and the increased number of critical paths makes the error correction cost enormous. In the event of high error rates, the system frequency has to be halved to recover from errors. Traditional timing-driven optimizations focus on the combinational logic and do not support any tradeoffs between the logic paths separated by sequential elements. This results in a slack imbalance in the pipeline stages. Based on our timing analysis of a delay-optimized processor pipeline, we found that there is significant slack rebalance opportunity between the critical pipe_1 paths and the consecutive pipe_2 paths, as shown in Figure6. We leverage this slack to relax and resynthesize the critical fan-in logic and sample the resulting delayed inputs using a slack balancing flip-flop. Instead of reducing design margins at runtime using adaptive techniques, the proposed approach downsizes the critical fan-in logic of the pipeline for power savings. Worst-case design margins are still met with respect to the delayed clock edge of the slack balancing flip-flop, which helps to eliminate metastability issue faced by adaptive in situ error correction schemes. Moreover, we also relax the short paths by the same design margin, which eliminates the need for SBFF-induced short path buffer insertions later in the design flow. In contrast to the existing slack redistribution techniques [21–27], we insert intentional clock delays to fully balance the available slack at the critical endpoints and redesign the whole critical logic stage with the relaxed time constraints. This results in a pipeline optimized for low power without the need for runtime voltage speculation, which is vulnerable to critical operating point behavior. The proposed method designs the processor to handle worst-case PVT margins. Logic downsizing pushes the critical data towards the delayed clock edge, which is timed to maintain worst-case margins with respect to the delayed clock edge. This makes the design resistant to static and dynamic variations similar to traditional worst-case corner designs, but with lower power and area. Razor and TIMBER depend on runtime voltage and frequency scaling to reclaim the design margins associated with dynamic variations. Therefore, they require error handling circuitry and need to tune the voltage or frequency back to the worst case operating corner depending on the error rate. There is a significant risk and architectural latency involved in this runtime adaption, especially when the error rate is high. J. Low Power Electron. Appl. 2018, 8, 9 6 of 18 J. Low Power Electron. Appl. 2018,, 7,, xx 6 of 18

300 120 300 120 400 J. Low Power Electron. Appl. 2018, 7, x 6 of 18 Slacklack histogramhistogram Slacklack histogramhistogram withwith 350 Slacklack histogramhistogram withwith 250 350 250 100100 at Vworstcase 20mV voltage scaling 40mV voltage scaling 300 300 120 400 200 80 200 80 250 Slack histogram Slack histogram with 350 250 Slack histogram with 250 100 at Vworstcase 20mV voltage scaling 40mV voltage scaling 150 200 150 6060 300 200

of paths 200 of paths 80 of paths of paths of paths 250 of paths 150

# 100 # # # # # 100 40

40 200 150 60 100

50 of paths of paths of paths 50 2020 150 50 # 100 # # 50

40 100 0 0 0 00 0 50 10 20 30 40 50 60 70 80 90 100 20 --700 --600 --500 --400 --300 --200 --100 0 100 0 10 20 30 40 50 60 70 80 90 100 --600600 --500500 --400400 --300300 --200200 --100100 00 100100 50 Slack (ps) Slack (ps) Slack (ps) Slack (ps) Slack (ps) 0 Slack (ps) 0 0 0 10 20 30 (a40) 50 60 70 80 90 100 (b) -700 -600 -500 -400 -300 -(200c) -100 0 100 (a) -600 -500 -400 -300(b) -200 -100 0 100 (c) Slack (ps) Slack (ps) Slack (ps) (a) (b) (c) FigureFigure 4. Slack histogramhistogram of of the the processor processor pipeline pipeline (a) ( ina) the in worstthe worst case, case, (b) with (b) 20-mVwith 20 voltage-mV voltage scaling scalingandFigure (c) and with 4. ( c 40-mVSlack) with histogram 40 voltage-mV voltage scaling.of the processorscaling. pipeline (a) in the worst case, (b) with 20-mV voltage scaling and (c) with 40-mV voltage scaling.

450 450 400 400 350 350 ) ) 300 )

s 300 s s n n n

( 250

( 250

( 250

200200 TNS TNS TNS 150150

100100 50 vvaa vb vb 50 va vb 0 0 10 15 20 25 30 35 40 10 Voltage15 scaling20 25 step 30(mV) 35 40 Voltage scaling step (mV)

Figure 5. Cost of resilience and voltage scaling trade-off. FigureFigure 5. 5. CostCost of of resilience resilience and and voltage voltage scaling scaling trade trade-off.-off.

35 2.5 120 Decode stage Execute stage 30 Fetch stage 2.5 120 3535 2.5 100 120 2 Decode stage Execute stage 30 25 Fetch stage Decode stage Execute stage 30 Fetch stage 80 100 2 100 12.5 20 25 25 60 paths paths paths 80 80 15 1.5 of of 20 of 1.51 20 40 # # # 60 paths paths paths 10 60 paths paths paths 15 0.5 of of of 15 1 20 of of of 1 5 40 # # # 40 # # # 1010 0 0 0 0.5 0 10 20 30 40 50 60 70 80 90 100 0.5 0 10 20 30 40 50 60 70 80 90 100 0 20 10 20 30 40 50 60 70 80 90 100 20 55 Pipe1_slack (ps) Pipe1_slack (ps) Pipe1_slack (ps) 0 0 00 0 0 45 14 0 10 20 30 40 50 60 70 80 90 100 200 0 10 20 30 40 50 60 70 80 90 100 00 1010 2020 3030 4040 5050 6060 7070 8080 9090 100100 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 40 Fetch stage Decode stage 180 Execute stage Pipe1_slack (ps) 12 Pipe1_slack (ps) Pipeipe1_slack (ps) 35 Pipe1_slack (ps) Pipe1_slack (ps) 160 45 1410 200 45 30 14 140 200 40 40 Fetch stage 120 180 Execute stage 25 Fetch stage 128 Decode stage 180 Execute stage 12 Decode stage paths 35 paths paths 100 160 35 20 160 6 of of 10 of 10 80 140 30 15 140 # 30 # # 4 60 120 25 10 8 120 25 8 40 paths paths 2 paths 100 paths paths 20 5 paths 100 20 6 20 of of 6 of 80 of of 0 of 15 0 0 80 # # 15 0 500 1000 1500 2000 2500 3000 3500 4000 4500 # # # 0 500 1000 1500 2000 2500 3000 3500 4000 4500 # 4 0 60 500 1000 1500 2000 2500 3000 3500 4000 4500 4 60 10 10 Pipe2_slack (ps) Pipe2_slack (ps) 40 40 Pipe2_slack (ps) 2 5 2 5 20 20 0 0 0 0 0 500 1000 1500 2000 Critical2500 3000 3500path4000 slacks4500 (pipe_1)0 and consecutive stage slacks (pipe_2)0 of the processor. 0 500 Figure1000 1500 20006. 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 0 500 1000 1500 2000 2500 3000 3500 4000 4500 Figure 6. Critical path slacks (pipe_1)0 500 and1000 consecutive1500 2000 2500 3000 stage3500 4000 slacks4500 (pipe_2)0 500 of1000 the1500 processor.2000 2500 3000 3500 4000 4500 Pipe2_slack (ps) Pipe2_slack (ps) Pipe2_slack (ps) Pipe2_slack (ps) Pipeipe2_slack (ps) 3. Proposed Variation Tolerant Pipeline Design 3. Proposed Variation Tolerant Pipeline Design Figure 6. Critical path slacks (pipe_1) and consecutive stage slacks (pipe_2) of the processor. 3.1. Slack Balancing Principle 3.1. Slack Balancing Principle 3. ProposedTo explain Variation slack Tolerant balancing, Pipeline we consider Design a small circuit as shown in Figure 7 with four registers. TheTo explaintiming graph slack shows balancing, nodes we corresponding consider a small to the circuit registers as shownand edges in Figure corresponding7 with four to registers.the 3.1.The Slackcombinational timing Balancing graph paths. Principle shows Thenodes maximum corresponding delay for each to combinational the registers path and is edgesshown correspondingbeside the edges. to the combinationalThis circuit is paths. optimized The for maximum a minimum delay feasible for eachclock combinationalperiod T = 11. The path combinational is shown beside path (d, the b) edges. To explain slack balancing, we consider a small circuit as shown in Figure 7 with four registers. This circuit is optimized for a minimum feasible clock period T = 11. The combinational path (d, b) has The timing graph shows nodes corresponding to the registers and edges corresponding to the combinational paths. The maximum delay for each combinational path is shown beside the edges. This circuit is optimized for a minimum feasible clock period T = 11. The combinational path (d, b)

J. Low Power Electron. Appl. 2018, 8, 9 7 of 18 J. Low Power Electron. Appl. 2018, 7, x 7 of 18 J. Low Power Electron. Appl. 2018, 7, x 7 of 18

has a maximum delay of 11. In this critical path, time can be borrowed from the consecutive ahas maximum a maximum delay delay of 11. of In this11. In critical this path,critical time path, can time be borrowed can be borrowed from the consecutive from the consecutive stage (b, d) stage (b, d) whose delay is nine. We replace the flip-flop b by the SBFF with a clock delay equal to the whosestage (b, delay d) whose is nine. delay We replaceis nine. theWe flip-flop replace bthe by flip the-flop SBFF b withby the a clockSBFF delaywith a equal clock to delay the slack equal present to the slack present in the consecutive stage (b, d), which is two. Now, for the same clock period T = 11, we inslack the present consecutive in the stage consecutive (b, d), which stage is (b, two. d), which Now, foris two. the same Now, clock for the period same T clock = 11, period we have T an= 11, extra we have an extra margin of two in the fan-in paths (a, b) and (d, b). The extra margin is used to relax and marginhave an ofextra two margin in the fan-in of two paths in the (a, fan b)-in and paths (d, b). (a, Theb) an extrad (d, margin b). The isextra used margin to relax is andused downsize to relax and the downsize the logic in the fan-in paths, making sure it does not create other timing violations. This logicdownsize in the the fan-in logic paths, in the making fan-in paths, sure it making does not sure create it does other not timing create violations. other timing This violations. procedure This also procedure also delays the short paths along the critical fan-in logic, thus eliminating the need for delaysprocedure the shortalso delays paths alongthe short the criticalpaths along fan-in the logic, critical thus fan eliminating-in logic, thethus need eliminating for additional the need buffers for additional buffers compared to other timing speculation techniques. comparedadditional tobuffers other compared timing speculation to other timing techniques. speculation techniques.

6 a 6 b a b 4 7 5 6 4 5 2 7 a 5 6 b 5 d 2 a b d 7 9 9 7 11 9 9 9 11 9 9 9 c 5 c 5 c d c d 9 9 Figure 7. A circuit example and timing graph to explain slack balancing. Figure 7. A circuit example and timing graph to explain slack balancing. We compare the proposed technique with existing speculative error correction schemes. We compare the proposed technique with existing speculative error correction schemes. FigureWe 8a compare shows conceptually the proposed how technique the slack margins with existing are relaxed speculative in a typical error voltage correction scaling schemes.-based Figure 8a shows conceptually how the slack margins are relaxed in a typical voltage scaling-based Figureerror resilience8a shows scheme. conceptually We consider how the four slack timing margins paths are P1, relaxedP2, P3 and in aP4 typical with different voltage scaling-basedslack margins. error resilience scheme. We consider four timing paths P1, P2, P3 and P4 with different slack margins. errorHere, resilience we assume scheme. P3 to be We a consider part of a four non timing-critical paths fan-in P1, logic P2, P3and and P1, P4P2 with and differentP4 in the slackcritical margins. fan-in Here, we assume P3 to be a part of a non-critical fan-in logic and P1, P2 and P4 in the critical fan-in Here,logic with we assume P1 being P3 toa critical be a part short of a path. non-critical Energy fan-in can be logic reduced and P1, by P2 scaling and P4 the in supply the critical voltage, fan-in which logic logic with P1 being a critical short path. Energy can be reduced by scaling the supply voltage, which withincreases P1 being the adelay critical along short these path. paths. Energy The can timing be reduced speculation by scaling window the supply for a typical voltage, speculative which increases error increases the delay along these paths. The timing speculation window for a typical speculative error theresi delaylience alongscheme these is T paths.clk/2. Therefore, The timing the speculation paths are windowallowed forto relax a typical the speculativedelay margins error by resilience Tclk/2, as resilience scheme is Tclk/2. Therefore, the paths are allowed to relax the delay margins by Tclk/2, as schemeshown in is TFigureclk/2. Therefore,8b. Voltage the scaling paths is are done allowed beyond to relaxPoFF the until delay the marginscritical operating by Tclk/2, voltage as shown near in shown in Figure 8b. Voltage scaling is done beyond PoFF until the critical operating voltage near Figurecritical8 b.path Voltage P3 also scaling becomes is done timing beyond critical PoFF untilas depicted the critical in Figure operating 8c. voltageMoreover, near runtime critical pathvoltage P3 critical path P3 also becomes timing critical as depicted in Figure 8c. Moreover, runtime voltage alsoscaling becomes causes timing the inputs critical to change as depicted close in to Figure the clock8c. Moreover,edges, creating runtime metastability voltage scaling issues. causes They thealso scaling causes the inputs to change close to the clock edges, creating metastability issues. They also inputsrequire to additional change close buffers to the along clock P1 edges, to ensure creating that it metastability will not corrupt issues. the They delayed also data require and additional hence the require additional buffers along P1 to ensure that it will not corrupt the delayed data and hence the bufferserror genera alongted. P1 Even to ensure if timing that speculation it will not corrupt is not done the delayedtill Tclk/2, data we still and need hence to theinsert error buffers generated. for an error generated. Even if timing speculation is not done till Tclk/2, we still need to insert buffers for an Evenequivalent if timing delay speculation of Tclk/2 to isensure not done that the till Terrorclk/2, signal we still is not need corrupted. to insert Unless buffers we for use an a equivalent fine-grain equivalent delay of Tclk/2 to ensure that the error signal is not corrupted. Unless we use a fine-grain delayvoltage of Tscaling,clk/2 to ensureeven thatnon- thecritical error paths signal islike not P3 corrupted. will be Unlesscome wecritical. use a fine-grainFurthermore, voltage for voltage scaling, even non-critical paths like P3 will become critical. Furthermore, for scaling,performance even-driven non-critical design paths margin like P3relaxation will become techniques, critical. the Furthermore, slack relaxations for performance-driven are suboptimal as performance-driven design margin relaxation techniques, the slack relaxations are suboptimal as designshown marginin Figure relaxation 9b. Frequency techniques, scaling the results slack in relaxations new critical are path suboptimal formations as shown in P3, inas Figureshown9 b.in shown in Figure 9b. Frequency scaling results in new critical path formations in P3, as shown in FrequencyFigure 9c, which scaling makes results the in sca newling critical bounded. path formations in P3, as shown in Figure9c, which makes Figure 9c, which makes the scaling bounded. the scaling bounded.

Error correction FF allows Voltage scaled till Slack margin Error correction FF allows Voltage scaled till Slack margin speculation by Tclk/2 critical operating point speculation by Tclk/2 critical operating point P1 P1 P2 P2 P3 new critical path P3 new critical path P4 P4 unused slack unusedmargin slack (a) (b) (c) margin (a) (b) (c) Figure 8. Speculative power-aware flip-flop: (a) slack margin at sign off, (b) slack margin relaxed with Figure 8. Speculative power power-aware-aware flip-flop:flip-flop: (a) slack margin at sign off, off; (b) slack margin relaxed with timing speculation flip-flop and (c) slack margin traded off with power. P1, Path 1. timing speculationspeculation flip-flopflip-flop andand ((cc)) slackslack marginmargin tradedtraded offoff withwith power.power. P1,P1, PathPath 1.1.

J. Low Power Electron. Appl. 2018, 7, x 8 of 18 J.J. Low Low Power Power Electron. Electron. Appl. Appl.2018 2018, 8, ,7 9, x 88 of of 18 18

clock skew and gate size clock skew and gate size Frequency scaled till Slack margin tuning ; slack relaxed < Frequency scaled till Slack margin tuning ; slack relaxed < critical operating point Tclk/2 critical operating point Tclk/2 P1 P1 P2 P2 P3 new critical path P3 new critical path P4 P4 unused slack unusedmargin slack (a) (b) (c) (a) (b) margin (c) Figure 9. Performance-aware optimizations: (a) slack margin at sign off, (b) slack margin relaxed with Figure 9. Performance-aware optimizations: (a) slack margin at sign off, (b) slack margin relaxed with Figureclock skew 9. Performance-aware scheduling and gate optimizations: sizing and (ac) slack marginmargin attraded sign off;off with (b) slack frequency. margin relaxed with clockclock skew skew scheduling scheduling and and gate gate sizing sizing and and (c ()c slack) slack margin margin traded traded off off with with frequency. frequency. In our proposed technique, we use static time borrowing to trade off slack margin for power and In our proposed technique, we use static time borrowing to trade off slack margin for power and areaIn gains. our proposedFigure 10a technique, shows the weslack use margins static timeof the borrowing baseline worst to trade-case off design. slack We margin look for positive power area gains. Figure 10a shows the slack margins of the baseline worst-case design. We look for positive andslack area in gains.the pipeline Figure stages 10a shows and replace the slack themargins critical endpoints of the baseline with SBFFs, worst-case which design. borrows We a look time for TB slack in the pipeline stages and replace the critical endpoints with SBFFs, which borrows a time TB positiveproportional slack into the available pipeline stagesslacks as and depicted replace in the Figure critical 10b. endpoints As a result, with the SBFFs, critical which fan-in borrowspaths P1, proportional to the available slacks as depicted in Figure 10b. As a result, the critical fan-in paths P1, aP2 time and TB P4 proportional are relaxed toby the TB. available We re-synthesize slacks as depictedthe pipeline in Figure to downsize 10b. As the a result, fan-in the logic critical of SBFF fan-in for P2 and P4 are relaxed by TB. We re-synthesize the pipeline to downsize the fan-in logic of SBFF for pathspowerP1, and P2 andarea P4reductionsare relaxed (Figure by TB. 10c). We Afte re-synthesizer resizing the pipelinelogic, timing to downsize closure is the done fan-in on logicthe data of power and area reductions (Figure 10c). After resizing the logic, timing closure is done on the data SBFFpaths for with power respect and area to reductionsthe delayed (Figure clock 10 edgesc). After of resizingSBFF. Our the logic,approach timing is closuredeterministic is done onand paths with respect to the delayed clock edges of SBFF. Our approach is deterministic and thenon data-speculative, paths with unlike respect the to usual the delayeddynamic clock operating edges point of SBFF. scaling Our techniques. approach is Safe deterministic design margins and non-speculative, unlike the usual dynamic operating point scaling techniques. Safe design margins non-speculative,are kept with respect unlike to the the usual delayed dynamic clock operatingedge of the point SBFF, scaling which techniques. prevents metastability Safe designmargins issues in are kept with respect to the delayed clock edge of the SBFF, which prevents metastability issues in arethe kept pipeline. with respectUnlike toconventional the delayed resilience clock edge schemes, of the SBFF,the short which path prevents P1 is metastabilityalso delayed issuesby the the pipeline. Unlike conventional resilience schemes, the short path P1 is also delayed by the incorresponding the pipeline. TB Unlike window conventional TB1. This resilienceeliminates schemes, the need thefor shortadditional path sho P1rt is path also delayedbuffers along by the P1 corresponding TB window TB1. This eliminates the need for additional short path buffers along P1 correspondingto prevent it from TB window corrupting TB1. the This delayed eliminates data. theNote need that for the additional time borrow short window path buffers TB is fully along utilized P1 to to prevent it from corrupting the delayed data. Note that the time borrow window TB is fully utilized preventfor power it from and corrupting area reduction the delayed compared data. Note to other that the optimization time borrow techniques. window TB Unlike is fully utilizedclock skew for for power and area reduction compared to other optimization techniques. Unlike clock skew powerscheduling, and area which reduction is mostly compared used tofor other performance optimization enhancements, techniques. Unlikethe proposed clock skew method scheduling, targets scheduling, which is mostly used for performance enhancements, the proposed method targets whichcritical is mostlylogic power used forminimization performance by enhancements, utilizing maximum the proposed available method slack to targets downsize critical all logic logic power paths critical logic power minimization by utilizing maximum available slack to downsize all logic paths minimizationin the critical by stage. utilizing maximum available slack to downsize all logic paths in the critical stage. in the critical stage.

SBFF relaxes slack SBFF relaxes slack Slack margin margin by available slack Slack margin used to margin by available slack Slack margin used to Slack margin up to Tclk/2 downsize logic up to Tclk/2 downsize logic P1 TB1 P1 TB1 P2 TB3 P2 TB3 P3 P3 P4 TB4 P4 TB4 (a) (b) (c) (a) (b) (c) Figure 10. Proposed SBFF: (a) slack margin at sign off, (b) slack margin relaxed with SBFF and FigureFigure 10. 10.Proposed Proposed SBFF: SBFF: (a) slack(a) slack margin margin at sign at off,sign (b )off, slack (b) margin slack margin relaxed withrelaxed SBFF with and SBFF (c) slack and (c) slack margin used to downsize logic. margin(c) slack used margin to downsize used to logic.downsize logic. 3.2. Slack Balancing Flip-Flop 3.2.3.2. Slack Slack Balancing Balancing Flip-Flop Flip-Flop The SBFF and the timing diagram are illustrated in Figure 11. SBFF is a simplified version of TheThe SBFF SBFF and and the the timing timing diagram diagram are are illustrated illustrated in in Figure Figure 11 11.. SBFF SBFF is is a a simplified simplified version version of of TIMBER [17] without any error propagation logic. SBFF consists of the main latch, a shadow latch TIMBERTIMBER [17 [17]] without without any any error error propagation propagation logic. logic. SBFF SBFF consists consists of the of mainthe main latch, latch, a shadow a shadow latch andlatch and a clock control circuit. The data path has a main latch and a slave latch forming an edge-triggered aand clock a clock control control circuit. circuit. The The data data path path has has a main a main latch latch and and a slave a slave latch latch forming forming an an edge-triggered edge-triggered master-slave flip-flop. The proposed optimization technique downsizes the SBFF input logic for master-slavemaster-slave flip-flop. flip-flop. The The proposed proposed optimization optimization technique technique downsizes downsizes the SBFF the inputSBFF logicinput for logic power for power reduction. This delays the input data signal DATA, which is detected by the shadow latch reduction.power reduction. This delays This thedelays input the data input signal data DATA, signal DATA, whichis which detected is detected by the shadowby the shadow latch using latch using a Delayed Clock (DCK). The delay amount decides the design margin improvement. The input ausing Delayed a Delayed Clock (DCK).Clock (DCK). The delay The amountdelay amount decides decides the design the design margin margin improvement. improvement. The The input input to to the slave latch is switched between the master path and the shadow path using the control signals theto slavethe slave latch latch is switched is switched between between the the master master path path and and the the shadow shadow path path using using the the control control signals signals (P0 and P1) to ensure that the delayed data pass through the slave latch. As shown in Figure 11, the (P0(P0 and and P1) P1) to to ensureensure thatthat thethe delayed data data pass pass through through the the slave slave latch. latch. As As shown shown in inFigure Figure 11, 11 the, main latch samples the input data and transfers them to the output (Q) at the positive Clock (CK) themain main latch latch samples samples the the input input data data and and transfers transfers them them to to the the output (Q) at thethe positivepositive ClockClock (CK) (CK) edge when TG0 is closed and TG1 is open. Similarly, the slave latch samples the data and passes them edgeedge when when TG0TG0 isis closed an andd TG1 TG1 is is open. open. Similarly, Similarly, the the slave slave latch latch samples samples the thedata data and andpasses passes them to the output at the Delayed Clock edge (DCK) when TG1 is closed and TG0 is open. The data in the to the output at the Delayed Clock edge (DCK) when TG1 is closed and TG0 is open. The data in the

J. Low Power Electron. Appl. 2018, 8, 9 9 of 18 J.J. Low Low Power Power Electron. Electron. Appl. Appl. 2018 2018, 7, ,7 x, x 9 9of of 18 18 J. Low Power Electron. Appl. 2018, 7, x 9 of 18 themmastermaster to latch thelatch output are are compared compared at the Delayed with with those those Clock in in edgethe the slave (DCK)slave latch latch when by by TG1an an XOR XOR is closed gate. gate. andThe The TG0xor xor gat isgat open.ee output output The signal signal data master latch are compared with those in the slave latch by an XOR gate. The xor gate output signal inXOR_OUTXOR_OUT the master becomes becomes latch are ‘1’ ‘1’ compared when when they they with are are thosedifferent. different. in the Figure Figure slave 12 12 latch shows shows by anthe the XOR simulation simulation gate. The results results xor gateof of the the output SBFF. SBFF. XOR_OUT becomes ‘1’ when they are different. Figure 12 shows the simulation results of the SBFF. signalXOR_OUTXOR_OUT XOR_OUT is is sampled sampled becomes by by P1 P1 ‘1’ to to when filter filter theyany any glitches areglitches different. and and fake fake Figure transitions transitions 12 shows in in thethe SBFF simulationSBFF and and generates generates results of a a XOR_OUT is sampled by P1 to filter any glitches and fake transitions in the SBFF and generates a thesampledsampled SBFF. signal XOR_OUTsignal XOR_OUT_SAMPLED. XOR_OUT_SAMPLED. is sampled by P1 It toIt shows shows filter anythat that glitchesthe the delayed delayed and DATA DATA fake transitions are are detected detected in by theby the the SBFF delayed delayed and sampled signal XOR_OUT_SAMPLED. It shows that the delayed DATA are detected by the delayed generatesclockclock DCK DCK a sampledof of the the SBFF. signalSBFF. The XOR_OUT_SAMPLED.The XOR XOR output output is is also also It used showsused to to that generate generate the delayed the the monitor monitor DATA aresignal signal detected ACTIVITY ACTIVITY by the clock DCK of the SBFF. The XOR output is also used to generate the monitor signal ACTIVITY delayedMONITOR.MONITOR. clock Depending Depending DCK of the on SBFF.on whether whether The XOR a a data data output transition transition is also happens happens used to generatein in the the clock clock the cycle, monitorcycle, the the signalmaster master ACTIVITY latch latch and and MONITOR. Depending on whether a data transition happens in the clock cycle, the master latch and MONITOR.shadowshadow latch latch Depending will will sampl sampl onee the whetherthe same same value a value data or transitionor a a different different happens value. value. inThis This the will will clock be be cycle,handy handy the in in masterthe the pre pre- latchsilicon-silicon and or or shadow latch will sample the same value or a different value. This will be handy in the pre-silicon or shadowpostpost-silicon-silicon latch stage stage will sampleto to disable disable the the the same time time value borrowing borrowing or a different in in the the value.event event of Thisof low low will data data be activity handyactivity inrates, rates, the pre-siliconwhich which results results or post-silicon stage to disable the time borrowing in the event of low data activity rates, which results post-siliconinin further further power power stage toreduction. reduction. disable the In In timeFigure Figure borrowing 13a, 13a, the the monitor in monitor the event signal signal of lowis is trigger trigger data activityeded only only rates, once, once, which which which results shows shows in a a in further power reduction. In Figure 13a, the monitor signal is triggered only once, which shows a furtherlowlow data data power rate. rate. reduction.The The monitor monitor In signal Figuresignal is 13is triggered a,triggered the monitor continuously continuously signal is in triggeredin all all the the clock clock only cycles once,cycles whichas as shown shown shows in in Figure aFigure low low data rate. The monitor signal is triggered continuously in all the clock cycles as shown in Figure data13b13b for rate.for high high The data data monitor rates. rates. signal is triggered continuously in all the clock cycles as shown in Figure 13b 13bfor highfor high data data rates. rates.

MainMain Latch Latch SlaveSlave Latch Latch Main LatchCK SlaveCK LatchCK CK L0L0 CK R R CKCK CK L0 P0 R CK CK P0 L0L drives0 drives input input to to output output CK CK DATA P0 CK DATA CK QQ L0 drives input toL0 output samples TGTG0 0 CK L0 samples DATA Q CK DATA TG0 CK L0 samplesDATA CKCK P0P0 XOR_OUT CK DATA TG1 open R XOR_OUT TG1 open R CK P0 XOR_OUT DCK P1P1 L1L samples1 samples TG1 open R DCK ACTIVITYACTIVITY DCK P1 DCK L1 samplesDATADATA DCK ACTIVITYMONITORMONITOR TGTG1 1 P1 DCK DATA TGTG2 2 P1 MONITOR L0: Main Latch TG1 L0: Main Latch TG2 DCK P1 L1: Slave Latch TG0 open L1 drives input to output DCK P1P1 L0:L Main1: Slave Latch Latch TG0 open L1 drives input to output P0P0 P0P0 DCKDCK DCK P1 L1: Slave Latch TG0 open L1 drives input to output L1 P0 P0 DCK DCKDCK L1 ENEN CKCK DCK L1 DCKENDCK Shadow Latch CK Shadow Latch P1 P1 Shadow Latch DCK P1 P1 P1 P1 (a()a) (b()b) (a) (b) FigureFigure 11. 11. ( a(a) )SBFF SBFF and and ( b(b) )timing timing diagram. diagram. DCK, DCK, Delayed Delayed Clock. Clock. Figure 11. ((aa)) SBFF SBFF and and ( (b)) timing timing diagram. DCK, DCK, Delayed Delayed Clock. Clock.

DelayedDelayed data data detection DATADATA Delayeddetection data DATA detection QQ Q CKCK DelayedDelayed clock clock CK Delayededgeedge clock edge DCKDCK DCK LATCHLATCH00 LATCH0 LATCHLATCH11 LATCH1 GGlitchlitch XORXOR_OUT_OUT Glitch XOR_OUT XOR_OUT XOR_OUT Fake XOR_SAMPLED_OUT Fake _SAMPLED transitiontransitionFake _SAMPLED transition TIMETIME ( a(a.u.u.).) TIME (a.u.)

FigureFigure 12. 12. SBFF SBFF simulation simulation result. result. Figure 12. SBFFSBFF simulation simulation result.

CKCK CKCK CK CK DCKDCK DCKDCK DCK DCK DD DD D D ACTIVITYACTIVITY ACTIVITYACTIVITY ACTIVITYMONITORMONITOR ACTIVITYMONITORMONITOR MONITOR MONITOR (a()a) (b()b) (a) (b) FigureFigureFigure 13.13. 13. DataData Data activityactivity activity monitoring monitoring monitoring in in SBFF.in SBFF. SBFF. (a )( a Intermittent(a) )Intermittent Intermittent triggering triggering triggering of the of of the activity the activity activity monitor monitor monitor with with lowwith Figure 13. Data activity monitoring in SBFF. (a) Intermittent triggering of the activity monitor with datalowlow data activity data activity activity and (andb and) continuous ( b(b) )continuous continuous triggering triggering triggering of activity of of activity activity monitor monitor monitor with with high with high data high activity.data data activity. activity. low data activity and (b) continuous triggering of activity monitor with high data activity.

J. Low Power Electron. Appl. 2018, 8, 9 10 of 18

3.3. PipelineJ. Low Power Design Electron. Flow Appl. Using2018, 7, x SBFF 10 of 18 We use standard CAD tools and custom add-on scripts for SBFF insertion and logic optimization 3.3. Pipeline Design Flow Using SBFF for the proposed pipeline design. A standard cell library with 20 different flavors of SBFFs was developedWe using use standard library CAD characterization tools and custom tools add to-on replace scripts the for SBFF normal insertion flip-flops. and logic The optimization library has four sets offor non-scan the proposed flip-flops pipeline with design. two different A standard drive cell strengths library with and 20 four different sets of flavors scan flip-flops of SBFFs with was three developed using library characterization tools to replace the normal flip-. The library has four different drive strengths. Custom scripts are used to enable slack analysis, time budgeting and SBFF sets of non-scan flip-flops with two different drive strengths and four sets of scan flip-flops with three replacement.different drive Figure strengths. 14 shows Custom the proposed scripts are design used to flow enable using slack SBFF. analysis, It starts time withbudgeting filtering and theSBFF critical pathsreplacement based on. post-placement Figure 14 shows the and proposed optimization design flow STA using results. SBFF. In It this starts work, with filtering we consider the critical ~10% of the totalpaths flip-flops based on thatpost- areplacement most critical, and optimization whose slack STA isresults. less than In this 2% work, of the we clock consider period. ~10% Inof thethe next step,total we search flip-flops for that the are critical most endpointscritical, whose with slack sufficient is less than consecutive 2% of the slacks.clock period. These In endpoints the next step, are back annotatedwe search to the for synthesis the critical engine. endpoints The with timing sufficient for all theconsecutive critical fan-inslacks. paths These is endpoints relaxed proportionateare back to theannotated available to slack.the synthesis The pipeline engine. isThe then timing resynthesized for all the critical with fan the-in new paths timing is relaxed constraints. proportionate We then replaceto the the ava endpointsilable slack. by SBFFsThe pipeline and run is then the placementresynthesized and with clock the tree.new timing The clock constraints. delays forWe thethen SBFFs are insertedreplace the after endpoints the clock by tree SBFFs is built.and run This the helps placement to preserve and clock the tree. traditional The clock tool-based delays for the optimizations. SBFFs are inserted after the clock tree is built. This helps to preserve the traditional tool-based optimizations. We limit the inserted clock delay to TB1 = Tclk/8, TB2 = 2Tclk/8, TB3 = 3Tclk/8 and TB4 = 4Tclk/8 for We limit the inserted clock delay to TB1 = Tclk/8, TB2 = 2Tclk/8, TB3 = 3Tclk/8 and TB4 = 4Tclk/8 for simplicity.simplicity. The The pipeline pipeline is thenis then signed signed off, off, keeping keeping worst-caseworst-case design design margins. margins.

Filter critical Report RTL & Place & optimize Synthesize Report slack endpoints with consecutive slack constraints design slack < Th of all endpoints

Relax stage1 logic Filter endpoints Synthesize again of the endpoints with stage2 slack > by TB TB

Replace critical endpoints with SBFF

SBFF insertion and timing Place & optimize correction design

Clock tree synthesis

Delay insertion for SBFFs

Route &sign off

end

Figure 14. Design flow using SBFF. Figure 14. Design flow using SBFF. 3.4. Metastability and Hold Issues 3.4. Metastability and Hold Issues Speculative techniques relax max delay and constrain min delay by Tclk/2. The max delay savings Speculativeis minimal if techniqueswe consider relaxerror handling max delay and and min constrain delay overheads. min delay For by a typicalTclk/2. speculative The max delay pipeline savings is minimalshown ifin we Figure consider 15, the error max delay, handling Tmax andcan be min speculated delay overheads. up to Tclk/2 Fortill the a typical falling edge speculative of the clock pipeline shownand in is Figure calculated 15, theas: max delay, Tmax can be speculated up to Tclk/2 till the falling edge of the clock and is calculated as: Tmax = Tclk + Tclk/2 − Tsu,clk-f , (1) Tmax = Tclk + Tclk/2 − Tsu,clk-f, (1) where Tsu,clk-f represents the setup timing of the speculative flip-flop clock falling edge. The effective wheremaxTsu,clk-f delayrepresents improvement the T setupmax,eff is timinglimited ofby thethe speculativecritical operating flip-flop point clockand is falling calculated edge. as: The effective max delay improvement Tmax,eff is limitedTmax,eff by = T theclk + criticalTcop − Tsu,clk operating-f , point and is calculated as: (2)

where Tcop represents the effective amount by which max delay is relaxed till the critical operating Tmax,eff = Tclk + Tcop − Tsu,clk-f, (2) point. The min delay requirement, Tmin, for Razor is tightened by Tclk/2 and is calculated as: where Tcop represents the effective amountTmin by= Tclk which/2 + Th,clk max-f , delay is relaxed till the critical operating(3) point. The min delay requirement, T , for Razor is tightened by T /2 and is calculated as: min clk J. Low Power Electron. Appl. 2018, 8, 9 11 of 18

J. Low Power Electron. Appl. 2018, 7, x Tmin = Tclk/2 + Th,clk-f,11 of (3) 18 wherewhere Th,clkh,clk-f-f representsrepresents the the hold hold timing timing requirement requirement of the speculative of the speculative clock falling clock edge. falling Thus, edge. for a Thus,max delay for a maximprovement delay improvement of Tcop, there of isT copa min, there delay is a buffer min delay overhead buffer of overhead Tclk/2. of Tclk/2.

MSFF Razor FF D D

tcomb out t clk D su clk Tsu,clk-f err Tsu,clk-r tcomb clk clk tsu out err terr-path+ tres t tclk2out err-path clk tres tsu tclk2err (a) (b) (c)

FigureFigure 15. 15. ((aa)) Speculative Speculative flip flip-flop-flop pipeline, ((bb)) datadata pathpath metastabilitymetastability and and ( c(c)) error error path path metastability. metastability.

SBFF relaxes the max delay and tightens the min delay of the critical fan-in logic by TB SBFF relaxes the max delay and tightens the min delay of the critical fan-in logic by TB proportional proportional to the consecutive stage slack as shown in Figure 16. The SBFF max delay, Tmax,SBFF, is to the consecutive stage slack as shown in Figure 16. The SBFF max delay, T , is calculated as: calculated as: max,SBFF

Tmax,SBFFTmax,SBFF = Tclk ++ TTB B− −Tsu,clk_delTsu,clk_del-r-r , , (4)(4) where Tsu,clk_del-r represents the setup time constraint with respect to SBFF delayed clock rising edge. where Tsu,clk_del-r represents the setup time constraint with respect to SBFF delayed clock rising edge. The max delay improvement of TB gives extra room for power, area and performance improvement. The max delay improvement of TB gives extra room for power, area and performance improvement. ForFor highhigh performance performance designs, designs, the the extra extra margin margin can becan used be toused scale to frequency.scale frequency. For low For power low designs, power thedesigns, extra marginthe extra can margin be used can to be under used designto under the design logic to the achieve logic to power achieve and power area savings. and area The savings. logic The logic downsizing makes sure that the short path delay also increases by TB, thus eliminating the downsizing makes sure that the short path delay also increases by TB, thus eliminating the need for need for additional hold buffers. The SBFF min delay, Tmin,SBFF, is calculated as: additional hold buffers. The SBFF min delay, Tmin,SBFF, is calculated as:

Tmin,SBFF = TB + Th,clk_del-r , (5) Tmin,SBFF = TB + Th,clk_del-r, (5) where Th,clk_del-r represents the hold timing requirement with respect to the SBFF delayed clock edge. whereAs longTh,clk_del-r as the TrepresentsB constraint the is satisfied, hold timing the requirementD input meets with the respect worst-case to the setup SBFF timing delayed of the clock delayed edge. Asclock. long This as the eliminatesTB constraint the need is satisfied, for error the correction. D input meetsWe can the represent worst-case the setup maximum timing delay of the constraint delayed clock.on TB Thisas: eliminates the need for error correction. We can represent the maximum delay constraint on TB as: TBmax,SBFF ≤ Tclk/2 − Tsu,clk_del-r , (6) TBmax,SBFF ≤ Tclk/2 − Tsu,clk_del-r, (6) Runtime dynamic timing speculation techniques have metastability risks at the clock edges alongRuntime the data dynamic path and timing error path. speculation Data path techniques metastability have will metastability arise if D input risks changes at the clock too close edges to alongthe positive the data clock path edge and due error to voltage path.Data scaling path or metastabilityPVT variations, will asarise depicted if D in input Figure changes 15b. As too a closestatistical to the measure positive of clock the edgereliability due to from voltage metastable scaling orerrors, PVT variations,we express as the depicted Mean intime Figure between 15b. As a statistical measure of the reliability from metastable errors, we express the Mean time between failures, MTBFdata-path, as: failures, MTBFdata-path, as: MTBFdata-path = exp(tres⁄τc)⁄(fdata. T0. fclk), (7)

MTBFdata-path = exp(tres/τc)/(fdata. T0. fclk), (7) Here, tres represents the resolution time to exit the metastable state; T0 is the metastability window; τc is the metastability resolution time constant; fdat is the frequency of data transition; and fclk is the clock Here, tres represents the resolution time to exit the metastable state; T0 is the metastability window; frequency. The metastability resolution time tres can be calculated as: τc is the metastability resolution time constant; fdat is the frequency of data transition; and fclk is the clock frequency. The metastability resolutiontres = Tclk − time tclk2outtres − tcancomb be− tsu calculated, as: (8)

Here, tclk2out is the clock to output delay of the flip-flop; tcomb is the combinational logic delay of the tres = Tclk − tclk2out − tcomb − tsu, (8) succeeding stage; and tsu is the setup time constraint of the subsequent flip-flop as shown in Figure 15b. A higher tres amounts to better protection from metastable errors. For the speculative Here, tclk2out is the clock to output delay of the flip-flop; tcomb is the combinational logic delay of flip-flop, tclk2out is worse than the conventional flip-flops due to the redundant latch, which makes tres the succeeding stage; and tsu is the setup time constraint of the subsequent flip-flop as shown in smaller. This introduces additional constraints on the combinational delay of the next pipeline stage. Figure 15b. A higher tres amounts to better protection from metastable errors. For the speculative The error path is also prone to metastability because the dynamic voltage scaling pushes the data too close to the negative clock edge, as shown in Figure 15c. The corresponding mean time between failures, MTBFerr-path, is represented as:

MTBFerr-path = exp(tres⁄τc)⁄(fdata. T0. fclk), (9)

J. Low Power Electron. Appl. 2018, 8, 9 12 of 18

flip-flop, tclk2out is worse than the conventional flip-flops due to the redundant latch, which makes tres smaller. This introduces additional constraints on the combinational delay of the next pipeline stage. The error path is also prone to metastability because the dynamic voltage scaling pushes the data too close to the negative clock edge, as shown in Figure 15c. The corresponding mean time between failures, MTBFerr-path, is represented as:

J. Low Power Electron. Appl. 2018, 7, x MTBFerr-path = exp(tres/τc)/(fdata. T0. fclk),12 of (9) 18

Here, tres representsrepresents the the resolution resolution time time to to exit exit the the metastable metastable state; state; T0T is0 isthe the metastability metastability window; window; τc isτc theis the metastability metastability resolution resolution time time constant; constant; fdatfdat isis the the frequency frequency of of data data transition; transition; and and fclk isis the the clock clock frequency, allall withwith respectrespect toto thethe errorerror path.path. The error path metastability can be fatal especially since speculative designsdesigns rely rely on on the errorthe error signal signal timing t foriming proper for error proper recovery. error Here,recovery. the metastability Here, the metastabilityresolution time resolutiontres can be time expressed tres can be as: expressed as:

tres = Tclk − TB − tclk2err − tcomb − tsu, (10) tres = Tclk − TB − tclk2err − tcomb − tsu, (10)

Here, tclk2err is the clock to output delay for the error output signal with respect to the negative clock Here, tclk2err is the clock to output delay for the error output signal with respect to the negative edge. As shown in Figure 15c, the tres window for the error path is too small, and this puts additional clock edge. As shown in Figure 15c, the tres window for the error path is too small, and this puts constraints on the error path delay terr-path. Resolution time can extend up to three clock cycles, which additional constraints on the error path delay t . Resolution time can extend up to three clock means the error signal will take three clock cycleserr-path to resolve fully in the event of metastability. Any cycles, which means the error signal will take three clock cycles to resolve fully in the event of timing issues along the error path will affect the error recovery process, and the system may not be metastability. Any timing issues along the error path will affect the error recovery process, and the able to get back to the normal error-free state. The mean time between failures for the SBFF data path system may not be able to get back to the normal error-free state. The mean time between failures for can be represented as: the SBFF data path can be represented as: MTBFdata-path,SBFF = exp(tres⁄τc)⁄(fdata. T0. fclk), (11) MTBFdata-path,SBFF = exp(tres/τc)/(fdata. T0. fclk), (11) Here, tres represents the resolution time to come out of the metastable state; T0 is the metastability window of the delayed clock edge; τc is the metastability resolution time constant; fdata is the frequency Here, tres represents the resolution time to come out of the metastable state; T0 is the metastability of data transition; and fclk is the clock frequency. The metastability resolution time tres can be window of the delayed clock edge; τc is the metastability resolution time constant; fdata is the calculated as: frequency of data transition; and fclk is the clock frequency. The metastability resolution time tres can be calculated as: tres,SBFF = Tclk − tclk_del2out − tcomb − tsu, (12) tres,SBFF = Tclk − tclk_del2out − tcomb − tsu, (12) Here, tclk_del2out is the clock to output delay of the SBFF; tcomb is the combinational logic delay succeeding theHere, SBFF;tclk_del2out and istsu theis the clock setup to output time delayconstraint of the of SBFF; the tsubsequentcomb is the combinational flip-flop, as logicshown delay in Figure succeeding 16b. Timingthe SBFF; constraints and tsu is thein (6) setup make time sure constraint that the of setup the subsequent constraints flip-flop, are met as with shown the in delayed Figure 16clockb. Timing edge, andconstraints so, the probability in (6) make of sure metastable that the setuperrors constraintsis very low. are The met absence with the of metastability delayed clock makes edge, t andclk_del2out so, smallerthe probability than tclk2out of metastable of speculative errors FF, is which very low. results The in absence more resolution of metastability time for makes SBFFtclk_del2out in the eventsmaller of anythan timintclk2outg ofupset. speculative The design FF, which margin results improvement in more resolution using the timeproposed for SBFF non in-speculative the event of approach any timing is reliableupset. The and design predictable, margin improvement which results using in the less proposed overhead non-speculative compared approach to real is-time reliable design and predictable,margin speculation. which results in less overhead compared to real-time design margin speculation.

D Tsu,clk_del-r clk MSFF SBFF

tcomb clk_del out tsu D TB tcomb clk clk_del monitor out

tclk_del2out tsu (a) (b)

Figure 16. ((a)) SBFF SBFF pipeline and ( b) data path timing.

4. Power Optimization Algorithm Slack balancing flip-flops help to redistribute available slack in the design to the critical stages. Once we replace the critical endpoints with SBFFs, we can leverage the additional slack to under design the critical fan-in logic. The proposed power optimization approach uses the available slack margin to downsize the critical fan-in logic. There is an area reduction and delay increase when we downsize the gates. However, the critical endpoints absorb this delay increase because the delayed data are safely captured by the delayed clock of the SBFF. This results in fewer resources along the critical fan-in logic, leading to power and area savings with respect to the baseline design. To realize this, we propose a slack analysis and downsizing (SizeOpt) algorithm as described in Algorithm 1.

J. Low Power Electron. Appl. 2018, 8, 9 13 of 18

4. Power Optimization Algorithm Slack balancing flip-flops help to redistribute available slack in the design to the critical stages. Once we replace the critical endpoints with SBFFs, we can leverage the additional slack to under design the critical fan-in logic. The proposed power optimization approach uses the available slack margin to downsize the critical fan-in logic. There is an area reduction and delay increase when we downsize the gates. However, the critical endpoints absorb this delay increase because the delayed data are safely captured by the delayed clock of the SBFF. This results in fewer resources along the critical fan-in logic, leading to power and area savings with respect to the baseline design. To realize this, we propose a slack analysis and downsizing (SizeOpt) algorithm as described in Algorithm 1. The procedure takes an input netlist to calculate slack available in the logic stages. Based on the timing analysis results, we filter the critical endpoints whose slack margin is less than a predefined threshold (Th). We set Th as two percent of the clock period in this work. Our timing analysis showed enormous slack presence in the logic stages. Here, the time borrowing capability of a critical path is solely based on the slack present in the consecutive stage. Therefore, we divided the available slack into four time borrow bands TB = {TB1, TB2, TB3, TB4}, which are multiples of Tclk/8 and take values TB1 = Tclk/8, TB2 = 2Tclk/8, TB3 = 3Tclk/8 and TB4 = 4Tclk/8, respectively. Depending on whether the consecutive slack in the second stage falls in TB1, TB2, TB3 or TB4, we relax the endpoint’s fan-in logic by the respective TB. This is done by adjusting the path delay during synthesis. Once the path delay is adjusted, we resynthesize the pipeline, which results in logic downsizing. The critical endpoints in the netlist are then replaced by an SBFF using custom scripts. Clock delays for the SBFFs are decided based on their time borrowing capability TB. For the best overall power/area savings, we start from the paths with the highest slack in the range of TB4. We replace those paths with SBFFs and relax their timing with path adjust commands in the synthesis stage. This process is continued until there is no more critical path that has at least TB1 slack from which to borrow. This makes sure that the design undergoes a path relaxation proportional to the maximum slack available.

Algorithm 1: Slack Analysis and Downsizing (SizeOpt) Input: Initial Netlist Output: Netlist with SBFF and optimized cells

1 Procedure SizeOpt (Initial Netlist) 2 Run STA to find critical paths P and consecutive slacks; 3 P = ∅; 4 for all timing endpoints p in the netlist do 5 if slack(p) < Th then 6 P = P ∪ {p}; 7 end if 8 end for 9 for all TB = TB4, TB >= TB, TB = TB − TB1 do 10 for all p do 11 if consecutive slack (p) >= TB and < 2TB do 12 Replace endpoint regsiter by SBFF with DCK = TB; 13 Adjust timing of fan-in logic by TB; 14 Downsize fan-in and close timing with respect to DCK; 15 end if 16 Calculate power and area savings; 17 end for 18 end for

19 return (NetlistTB); J. Low Power Electron. Appl. 2018, 8, 9 14 of 18

5. Results and Analysis In order to enhance the processor pipeline, we developed 16 different flavors of SBFFs, as shown in Table2. We characterized the SBFFs and compared the power, area and setup-hold time metrics with the corresponding master-slave flip-flops. Figure 17 shows the power and area comparison between the different flip-flop pairs from MSFF and SBFF library. It is evident that SBFFs consume higher power and area compared to their MSFF counterparts. However, the proposed logic downsizing helps to convert the power and area overheads to gains, unlike other resilient techniques. Figure 18 shows the timingJ. Low Power comparison Electron. Appl. between 2018, 7, x the flip-flop pairs of the SBFF and MSFF library. The setup14 of time 18 and hold time are also slightlyTable high 2. SBFF for flavors the SBFF used cells. in the proposed design methodology.

Drive Reference Table 2. SBFF flavors used in the proposedDescription design methodology. Strength FF_1, FF_2 1X, 2X D-flip-flop, positive-edge triggered, q-only FF_3,Reference FF_4, FF_5 1X, Drive 2X, Strength 8X D-flip-flop, positive-edge Description triggered, q-only FF_1, FF_2FF_6 1X 1X, 2XD-flip-flop, positive D-flip-flop,-edge triggered, positive-edge low-asynchronous triggered,- q-onlyclear, q-only FF_3,FF_7, FF_4, FF_8, FF_5 FF_9 1X, 1X, 2X, 2X, 8X 8XD-flip-flop, positive D-flip-flop,-edge triggered, positive-edge low-asynchronous triggered,- q-onlyclear, q-only FF_10,FF_6 FF_11 1X, 2X 1X D-flip-flop, D-flip-flop, positive- positive-edgeedge triggered, triggered, low-asynchronous low-asynchronous-clear,-clear/set, q-only q-only FF_7,FF_12, FF_8, FF_13, FF_9 1X, 2X, 8X D-flip-flop, positive-edge triggered, low-asynchronous-clear, q-only 1X, 2X, 8X D-flip-flop, positive-edge triggered, low-asynchronous-clear/set, q-only FF_10,FF_14 FF_11 1X, 2X D-flip-flop, positive-edge triggered, low-asynchronous-clear/set, q-only FF_12, FF_13, FF_14 1X, 2X, 8X D-flip-flop, positive-edge triggered, low-asynchronous-clear/set, q-only D-flip-flop, positive-edge triggered, low-asynchronous-clear/set, sync hold, q- FF_15,FF_15, FF_16 FF_16 1X, 1X, 2X, 2X, 8X 8X D-flip-flop, positive-edge triggered, low-asynchronous-clear/set, sync hold, q-only only

9 18 MSFF 8 MSFF 16 ) SBFF 7 SBFF 14 pW ( )

6 2 12

5 10 um ( 4 8

3 6 Area Area

2 4 Internal Power Power Internal 1 2

0 0

(a) (b)

FigureFigure 17. 17.(a) ( Powera) Power comparison comparison betweenbetween MSFF MSFF and and SBFF SBFF standard standard cells cells and and(b) area (b) areacomparison comparison between MSFF and SBFF standard cells. between MSFF and SBFF standard cells. Error resilience experiments were conducted on a 40-nm embedded processor with a three-stage pipeline.Error resilience The whole experiments design flow wereis carried conducted out at the on worst a 40-nm-case PVT embedded corner using processor a 3σ process with alibrary, three-stage pipeline.a temperature The whole of 125 design °C and a flow worst is-case carried voltage out of at0.99 the V. The worst-case conventional PVT worst corner-case using corner a design 3σ process library,was a used temperature as the baseline of 125 ◦ Cto and compare a worst-case the power voltage and of area. 0.99 V.The The processor conventional has a worst-case three-stage corner designfetch was-decode used-execute as the pipe baselineline, and to compareit does more the in power each stage and to area. increase The the processor overall performance. has a three-stage There are no critical clock gates in the design. The processor implementation details and complexity fetch-decode-execute pipeline, and it does more in each stage to increase the overall performance. are depicted in Table 3. There are no critical clock gates in the design. The processor implementation details and complexity are depicted160 in Table3. 140 MSFF MSFF Based on140 the slack analysis, we found an opportunity120 to relax the slack margin in the data SBFF ) 120 ) SBFF path modules shown in Figure 19. The design has ~20100 K logic cells with 720 critical endpoints and ns ns ( 7000 flip-flops.100 Sixty percent of the critical endpoints( have sufficient slack and can be replaced by 80 SBFFs. The execute80 stage has the maximum number of critical endpoints, and ~326 paths in the execute 60 stage are relaxed60 by a maximum slack margin of TB4. 40 40 time Setup C to Q delay

20 20

0 0

(a) (b) Figure 18. (a) Setup time comparison between MSFF and SBFF standard cells and (b) hold time comparison between MSFF and SBFF standard cells.

J. Low Power Electron. Appl. 2018, 7, x 14 of 18

Table 2. SBFF flavors used in the proposed design methodology.

Drive Reference Description Strength FF_1, FF_2 1X, 2X D-flip-flop, positive-edge triggered, q-only FF_3, FF_4, FF_5 1X, 2X, 8X D-flip-flop, positive-edge triggered, q-only FF_6 1X D-flip-flop, positive-edge triggered, low-asynchronous-clear, q-only FF_7, FF_8, FF_9 1X, 2X, 8X D-flip-flop, positive-edge triggered, low-asynchronous-clear, q-only FF_10, FF_11 1X, 2X D-flip-flop, positive-edge triggered, low-asynchronous-clear/set, q-only FF_12, FF_13, 1X, 2X, 8X D-flip-flop, positive-edge triggered, low-asynchronous-clear/set, q-only FF_14 D-flip-flop, positive-edge triggered, low-asynchronous-clear/set, sync hold, q- FF_15, FF_16 1X, 2X, 8X only

9 18 MSFF 8 MSFF 16 ) SBFF 7 SBFF 14 pW ( )

6 2 12

5 10 um ( 4 8

3 6 Area Area

2 4 Internal Power Power Internal 1 2

0 0

(a) (b) Figure 17. (a) Power comparison between MSFF and SBFF standard cells and (b) area comparison between MSFF and SBFF standard cells.

Error resilience experiments were conducted on a 40-nm embedded processor with a three-stage pipeline. The whole design flow is carried out at the worst-case PVT corner using a 3σ process library, a temperature of 125 °C and a worst-case voltage of 0.99 V. The conventional worst-case corner design was used as the baseline to compare the power and area. The processor has a three-stage fetch-decode-execute pipeline, and it does more in each stage to increase the overall performance. J. LowThere Power are Electron. no critical Appl. 2018 clock, 8, gates 9 in the design. The processor implementation details and complexity15 of 18 are depicted in Table 3.

160 140 MSFF MSFF 140 120 SBFF ) 120 ) SBFF 100 ns ns ( 100 ( 80 80 60 60 40 40 time Setup J. Low Power Electron.C to Q delay Appl. 2018, 7, x 15 of 18 20 20

0 Table 3. Implementation details of0 the proposed design.

Technology(a) Node 40 nm (b) Figure 18. (a) Setup timeStd. comparison cells: between MSFF and~50 SBFFK-gates standard cells and (b) hold time Figure 18. (a) Setup time comparison between MSFF and SBFF standard cells and (b) hold time comparison between MSFFFlip and-flops SBFF standard cells. ~7000 comparison between MSFF and SBFF standard cells. Memory: 32 KB ISRAM, 32 KB DSRAM 4 KB ROM Table 3. Implementation details of the proposed design. Max. Frequency 180 MHz Typ. Voltage 1.1 Volts Technology Node 40 nm Std. cells: ~50 K-gates Based on the slack analysis, we found an opportunity to relax the slack margin in the data path Flip-flops ~7000 modules shown in Figure 19. TheMemory: design has ~20 K 32logic KB cells ISRAM, with 32 KB720 DSRAM critical endpoints and 7000 flip-flops. Sixty percent of the critical endpoints have sufficient4 KB slack ROM and can be replaced by SBFFs. The execute stage has the maximumMax. Frequency number of critical endpoints, 180 MHzand ~326 paths in the execute stage are relaxed by a maximum slackTyp. margin Voltage of TB4. 1.1 Volts

350

300 Execute stage

250 Fetch stage

200

150

100 of relaxed paths

# 50

0 TB1=Tclk/8 TB2=2Tclk/8 TB3=3Tclk/8 TB4=4Tclk/8 Slack (ps)

FigureFigure 19. 19. SlackSlack improvement improvement in in the the pipeline pipeline stages. stages.

TheThe proposed proposed power power optimization approachapproach down-sizesdown-sizes the the critical critical data data paths paths that that have have SBFFs SBFFs by bya predefineda predefined set set of relaxationof relaxation margins. margins. Here, Here, the actualthe actual power power and areaand area savings savings depend depend on how on manyhow manyendpoints endpoints are relaxed are relaxed and for and which for particularwhich particular relaxation relaxation margin. margin. Tables4 andTable5 summarizes 4 and 5 summarize the power theand power area savings and area of thesavings proposed of the approach. proposed Using approach. our algorithm, Using our we algorithm, get a maximum we get power a maximum savings powerof ~12% savings as shown of ~12% in Figure as shown 20a. Wein Figure get the 20a. maximum We get power the maximum savings inpower the execute savings stage, in the which execute has stage,the maximum which has number the maximum of paths number beings relaxedof paths by beings TB4. Onrelaxed the contrary,by TB4. On a typical the contrary, speculative a typical error speculativeresilience approach error resilience like TIMBER approach [17 like] suffers TIMBER from [17] a power suffers overhead from a power of 39% overhead due to the of 39% speculative due to thehardware. speculative We hardware. get an area We savings get an ofarea ~8% savings with of the ~8% proposed with the approach, proposed as approach, shown in as Figure shown 20 inb. FigureTIMBER 20b. hardware TIMBER has hardware an area has overhead an area of overhead ~1.7%. Again, of ~1.7%. the areaAgain, savings the area predominantly savings predominantly comes from comesthe execute from the stage execute path relaxations.stage path relaxations. It is evident It thatis evident the higher that the relaxation higher relaxation margins of margins TB4 give of moreTB4 giveoptimization more optimi opportunityzation opportunity and reduce and the overheads,reduce the whichoverheads, results which in power results and in area power savings. and area savings.

Table 4. Power savings of the proposed design methodology.

Baseline TIMBER [17] Proposed Power (nW) #Cells Leakage Dynamic Total #Cells Leakage Dynamic Total #Cells Leakage Dynamic Total Fetch 1976 1463 265,127 266,590 2014 1481 482,389 483,870 1751 1103 401,268 402,372 Decode 1874 920 149,114 150,034 1900 805 154,762 155,567 1796 573 162,912 163,485 Execute 16,414 14,825 1,307,869 1,322,695 16,356 14,714 1,777,447 1,792,162 15,053 11,039 941,299 952,339

J. Low Power Electron. Appl. 2018, 8, 9 16 of 18

Table 4. Power savings of the proposed design methodology.

Baseline TIMBER [17] Proposed Power (nW) #Cells Leakage Dynamic Total #Cells Leakage Dynamic Total #Cells Leakage Dynamic Total Fetch 1976 1463 265,127 266,590 2014 1481 482,389 483,870 1751 1103 401,268 402,372 Decode 1874 920 149,114 150,034 1900 805 154,762 155,567 1796 573 162,912 163,485 Execute 16,414 14,825 1,307,869 1,322,695 16,356 14,714 1,777,447 1,792,162 15,053 11,039 941,299 952,339 J. Low Power Electron. Appl. 2018, 7, x 16 of 18

Table 5. Area savings of the proposed design methodology. Table 5. Area savings of the proposed design methodology.

BaselineBaseline TIMBER TIMBER [17 [17]] Proposed Proposed AreaArea (µm (µm2)2 ) #Cells#Cells Cell Cell Area Area Net Net Area Total # #CellsCells Cell Cell Area Net Net Area Total #Cells#Cells Cell Cell Area Net Net AreaArea Total FetchFetch stage stage 1976 1976 4 4980980 6 6714714 11,695 2 2014014 5 5217217 6 6928928 12,145 17511751 4 4391391 6236 10,627 DecodeDecode stage stage 1874 1874 3 3316316 6 6640640 99559955 1 1900900 3 3264264 6 6677677 99409940 17961796 2 2768768 6 6430430 91999199 ExecuteExecute stage stage 16,41416,414 44,74144,741 62,899 62,899 107,639 16,356 46,919 46,919 62,784 62,784 109,702 15,053 39,790 39,790 58,245 9 98,0348,034

3000000 160000 +1.7% Baseline +39% 140000 Baseline 2500000 TIMBER -8.4% 120000 TIMBER ) ) 2000000 Proposed 2 Proposed 100000 nW (

-12% um ( 1500000 80000

60000 Area Area

Power 1000000 40000 500000 20000

0 0 Fetch Decode Execute Total Fetch Decode Execute Total Pipeline stages Pipeline stages

FigureFigure 20. 20. SizeOptSizeOpt power power optimization optimization technique. technique. (a (a) )Power Power savings savings and and (b (b) )area area savings savings with with respect respect toto the the worst worst-case-case design. design.

6.6. Conclusion Conclusionss InIn this this paper, paper, we we propose propose a a non non-speculative-speculative pipe pipe-line-line design design methodology methodology to to relax relax the the slack slack marginsmargins proportionate proportionate to to the the available available slack slack using using Slack Slack Balancing Balancing Flip Flip-Flops-Flops (SBFFs). (SBFFs). The The proposed proposed designdesign methodology methodology handles handles worst worst-case-case variations variations by bysampling sampling the thedata data at a delayed at a delayed clock clock edge. edge.The additionalThe additional slack slack margin margin obtained obtained from from SBFFs SBFFs is used is used to todownsize downsize the the critical critical fan fan-in-in logic logic for for power power andand area area savings. savings. Existing Existing speculative speculative error error detection detection schemes schemes have have critical critical operating operating point point behavior, behavior, andand their their effectiv effectivenesseness largely largely depends depends on on the the error error rate rate at at any any given given point point of of time. time. The The proposed proposed approachapproach is is non non-speculative,-speculative, which which eliminates eliminates the the critical critical wall wall of of slack slack behavior. behavior. The The speculative speculative techniquestechniques need need to to be be periodically periodically tuned tuned to to adapt adapt to to the the ambient ambient operating operating conditions conditions and and workload. workload. TheyThey rely rely on on run run-time-time voltage voltage scaling, scaling, which which results results in in metastability metastability issues issues and and compromises compromises the the functionalityfunctionality of of the the chip. chip. The The deterministic deterministic timing timing closure closure used used in in the the proposed proposed approach approach eliminates eliminates metastabilitymetastability and and error correctioncorrection overheads.overheads. Moreover,Moreover, the the proposed proposed logic logic downsizing downsizing eliminates eliminates the theneed need for for hold hold buffers. buffers. Experimental Experimental results results of theof the proposed proposed design design methodology methodology show show a power a power and andarea area savings savings of 12% of 12% and and 8%, 8%, respectively, respectively, compared compared to the to worst-casethe worst-case design. design. WithWith the aimaim to to maximize maximize power power and and area area benefits, benefits, the resilient the resilient design proposeddesign proposed can be improved can be improvedusing dynamic using variationdynamic tolerancevariation tolerance approaches. approaches. In the future, In the the future, proposed the proposed design time design resilience time resiliencetechnique technique can be combined can be withcombined runtime with voltage/frequency runtime voltage/frequency scaling to reap scaling more to benefits. reap more Furthermore, benefits. Furthermore,we can train we the can circuit train to the enable circuit or disableto enable time or borrowingdisable time depending borrowing on depending the data activityon the data rates. activityThis will rates. help toThis reduce will thehelp redundancy to reduce overheadsthe redundancy associated overheads the with associated the time borrowing the with hardware.the time borrowing hardware. Acknowledgments: This work was supported by EDB-Industrial Post-graduate Programme, Singapore, and NXP Semiconductors, Singapore. Acknowledgments: This work was supported by EDB-Industrial Post-graduate Programme, Singapore, and NXP Semiconductors, Singapore.

Author Contributions: The main author, Mini Jayakrishnan, carried out the experiments and writing of the paper. Alan Chang recommended the idea/topic and provided the necessary platform to conduct the experiments. Tony Tae-Hyoung Kim guided the main author in the research and documentation.

Conflicts of Interest: The authors declare no conflict of interest.

J. Low Power Electron. Appl. 2018, 8, 9 17 of 18

Author Contributions: The main author, Mini Jayakrishnan, carried out the experiments and writing of the paper. Alan Chang recommended the idea/topic and provided the necessary platform to conduct the experiments. Tony Tae-Hyoung Kim guided the main author in the research and documentation. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Borkar, S.; Karnik, T.; Narendra, S.; Tschanz, J.; Keshavarzi, A.; De, V. Parameter Variations and Impact on Circuits and Microarchitecture. In Proceedings of the 40th Annual Design Automation Conference (DAC), Anaheim, CA, USA, 2–6 June 2003. 2. Ghosh, S.; Roy, K. Parameter variation tolerance and error resiliency: New design paradigm for the nanoscale era. Proc. IEEE 2010, 98, 1718–1751. [CrossRef] 3. Asenov, A.; Brown, A.R.; Davies, J.H.; Kaya, S.; Slavcheva, G. Simulation of intrinsic parameter fluctuations in decananometer and nanometer-scale MOSFETs. IEEE Trans. Electron Devices 2003, 50, 1837–1852. [CrossRef] 4. Nassif, S.R. Modeling and analysis of manufacturing variations. In Proceedings of the Custom Integrated Circuit Conference, San Diego, CA, USA, 9 May 2001. 5. Borkar, S. Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro 2005, 25, 10–16. [CrossRef] 6. Bowman, K.A.; Duvall, S.G.; Meindl, J.D. Impact of Die-to-Die and With-in-Die Parameter Fluctuations on the Maximum Clock Frequency Distribution for Gigascale Integration. IEEE J. Solid State Circuits 2002, 37, 183–190. [CrossRef] 7. Dreslinski, R.G.; Wieckowski, M.; Blaauw, D.; Sylvester, D.; Mudge, T. Near-Threshold : Reclaiming Moore’s Law through Energy Efficient Integrated Circuits. Proc. IEEE 2010, 98, 253–266. [CrossRef] 8. Meijer, M.; Liu, B.; van Veen, R.; de Gyvez, J.P. Post-Silicon Tuning Capabilities of 45 nm Low-Power CMOS Digital Circuits. In Proceedings of the Symposium on VLSI Circuits, Kyoto, Japan, 16–18 June 2009. 9. Kim, T.; Persaud, R.; Kim, C.H. Silicon odometer: An on-chip reliability monitor for measuring frequency degradation of digital circuits. In Proceedings of the Very Large Scale Integration (VLSI) Circuits Symposium, Kyoto, Japan, 14–16 June 2007. 10. Gupta, P.; Agarwal, Y.; Dolecek, L.; Dutt, N.; Gupta, R.K.; Kumar, R.; Mitra, S.; Nicolau, A.; Rosing, T.S.; Srivastava, M.B.; et al. Underdesigned and Opportunistic Computing in Presence of Hardware Variability. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2013, 32, 8–23. [CrossRef] 11. Tschanz, J.; Bowman, K.; Walstra, S.; Agostinelli, M.; Karnik, T.; De, V. Tunable Replica Circuits and Adaptive Voltage-Frequency Techniques for Dynamic Voltage, Temperature, and Aging Variation Tolerance. In Proceedings of the Symposium on VLSI Circuits, Kyoto, Japan, 16–18 June 2009. 12. Ernst, D.; Kim, N.S.; Das, S.; Pant, S.; Rao, R.; Pham, T.; Ziesler, C.; Blaauw, D.; Austin, T.; Mudge, T.; et al. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation. In Proceedings of the 36th Symposium on Microarchitecture (MICRO-36), San Diego, CA, USA, 3–5 December 2003. 13. Das, S.; Tokunaga, C.; Pant, S.; Ma, W.; Kalaiselvan, S.; Lai, K.; Bull, D.; Blaauw, D. Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance. IEEE J. Solid-State Circuits 2009, 44, 32–48. [CrossRef] 14. Bowman, K.; Tschanz, J.; Kim, N.; Lee, J.; Wilkerson, C.; Lu, S.; Karnik, T.; De, V. Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance. IEEE J. Solid-State Circuits 2009, 44, 49–63. [CrossRef] 15. Fojtik, M.; Fick, D.; Kim, Y.; Pinckney, N.; Harris, D.M.; Blaauw, D.; Sylvester, D. Bubble Razor: Eliminating timing margins in an ARM cortex-M3 processor in 45 nm CMOS using architecturally independent error detection and correction. IEEE J. Solid-State Circuits 2013, 48, 66–81. [CrossRef] 16. Joshi, V.; Blaauw, D.; Sylvester, D. Soft-edge flip-flops for improved timing yield: Design and optimization. In Proceedings of the IEEF/ACM International Conference on Computer-Aided Design, San Jose, CA, USA, 5–8 November 2007. 17. Choudhury, M.; Chandra, V.; Mohanram, K.; Aitken, R. TIMBER: Time borrowing and error relaying for online timing error resilience. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 8–12 March 2010. J. Low Power Electron. Appl. 2018, 8, 9 18 of 18

18. Bull, D.; Das, S.; Shivashankar, K.; Dasika, G.; Flautner, K.; Blaauw, D. A power-efficient 32-bit ARM processor using timing-error detection and correction for transient-error tolerance and adaptation to PVT variation. IEEE J. Solid-State Circuits 2011, 46, 18–31. [CrossRef] 19. Das, S.; Roberts, D.; Lee, S.; Pant, S.; Blaauw, D.; Austin, T.; Flautner, K.; Mudge, T.T. A self-tuning DVS processor using delay error detection and correction. IEEE J. Solid-State Circuits 2006, 41, 792–804. [CrossRef] 20. Austin, T.; Bertacco, V.; Blaauw, D.; Mudge, T. Opportunities and Challenges for Better Than Worst-Case Design. In Proceedings of the Asia and South Pacific Design Automation Conference, Shanghai, China, 18–21 January 2005. 21. Liu, X.; Papaefthymiou, M.C.; Friedman, E.G. Maximizing performance by retiming and clock skew scheduling. In Proceedings of the 36th Design Automation Conference, New Orleans, LA, USA, 21–25 June 1999. 22. Monteiro, J.; Devadas, S.; Ghosh, A. Retiming sequential circuits for low power. In Proceedings of the IEEE/ ACM International Conference on Computer-Aided Design, Santa Clara, CA, USA, 7–11 November 1993. 23. Neves, J.L.; Friedman, E.G. Optimal clock skew scheduling tolerant to process variations. In Proceedings of the 33rd Annual Design Automation Conference, Las Vegas, NV, USA, 3–7 June 1996. 24. Sathyamurthy, H.; Sapatnekar, S.; Fishburn, J. Speeding up pipelined circuits through a combination of gate sizing and clock skew optimization. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 1998, 17, 173–182. [CrossRef] 25. Xi, J.G.; Dai, W.W.M. Useful-skew clock routing with gate sizing for low power design. In Proceedings of the 33rd Annual Design Automation Conference, Las Vegas, NV, USA, 3–7 June 1996. 26. Sarangi, S.R.; Greskamp, B.; Tiwari, A.; Torrellas, J. EVAL: Utilizing processors with variation-induced timing errors. In Proceedings of the International Symposium on Microarchitecture, Lake Como, Italy, 8–12 November 2008. 27. Greskamp, B.; Wan, L.; Karpuzcu, W.R.; Cook, J.J.; Torrellas, J.; Chen, D.; Zilles, C. BlueShift: Designing Processors for Timing Speculation from the Ground Up. In Proceedings of the IEEE International Symposium on High Performance , Raleigh, NC, USA, 14–18 February 2009. 28. Patel, J. CMOS Process Variations: A Critical Operation Point Hypothesis. Available online: https://web. stanford.edu/class/ee380/Abstracts/080402-jhpatel.pdf (accessed on 19 March 2018). 29. Kahng, A.B.; Kang, S.; Li, J. A New Methodology for Reduced Cost of Resilience. In Proceedings of the GLSVLSI, Houston, TX, USA, 21–23 May 2014. 30. Alidash, H.K.; Oklobdzija, V.G. Low-power soft error hardened latch. J. Low Power Electron. Appl. 2010, 6, 218–226. [CrossRef] 31. Paik, S.; Nam, G.J.; Shin, Y. Implementation of pulsed-latch and pulsed-register circuits to minimize clocking power. In Proceedings of the International Conference on Computer-Aided Design, San Jose, CA, USA, 7–10 November 2011. 32. Sartori, J.; Kumar, R. Characterizing the Voltage Scaling Limitations of Razor-Based Designs; Technical Report; Coordinated Science Laboratory, The University of Illinois at Urbana-Champaign: Champaign, IL, USA, 2009. 33. Wirnshofer, M. A Variation-Aware Adaptive Voltage Scaling Technique Based on In-Situ Delay monitoring. In Proceedings of the IEEE International Symposium on Design and Diagnostics of Electronic Circuits and Systems, Cottbus, Germany, 13–15 April 2012. 34. Ghosh, S.; Martin, S.; Stelmach, S. Reliability for IoT and Automotive markets. In Proceedings of the System-on-Chip Conference (SOCC), Munich, Germany, 5–8 September 2017. 35. Jayakrishnan, M.; Chang, A.; De Gyvez, J.P.; Hyoung, K.T. Slack-aware Timing Margin Redistribution Technique Utilizing Error Avoidance Flip-Flops and Time Borrowing. In Proceedings of the IFIP/IEEE International Conference on Very Large-Scale Integration (VLSI-SoC), Daejeon, Korea, 5–7 October 2015.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).