<<

Issues in System on the Chip Clocking Vojin G. Oklobdzija†‡, Fellow IEEE † Distinguished Visiting Professor, Electrical Engineering Dept., Chung-Ang University ‡ ACSEL Laboratory, Electrical Engineering Dept. † Seoul, Korea, ‡ University of California, Davis, California, USA +82-2-820-5346, USA: 1-530-752-5634 Email: [email protected], [email protected]

Abstract —Clocking considerations and clocked storage 10,000 100 elements for are discussed. Various ways of In te l Freq IBM Power PC scales 2X per SOC clocking are addressed. We discuss issues of particular DEC technology importance for SOC such as “time borrowing” and absorption of Gate delays/ generation clock uncertainties. Clock power savings techniques suitable for 21264S SOC are described. 1,000 Pentium III 21164A 21264 21064A Pentium(R) 21164 10 I. INTRODUCTION Mhz 21066 MPC750II 604 604+ Deciding on the clocking strategy in digital system is 100 P6 one of the single most important decisions. If not 601, 603

Pentium(R) Period Delays/Clock Gate considered properly, system bring-up and diagnostic could 486 be very costly while the operation will remain unreliable 386 thought the lifetime of the system [24-25]. The importance 10 1 of clocking is rising as the clock speed doubles every two to three years. 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 Following the speed increase, the number of logic levels Fig. 1. Increase in the clock frequency and decrease in the in the critical path diminishes. In today’s high-speed number of logic levels in the pipeline [1]. processors, instructions are executed in one-cycle, which is driven by a single-phase clock. The number of pipeline At today’s frequencies ability to absorb clock skew and stages is increasing to 15 or 20 in order to accommodate use faster Clocked Storage Element (CSE), results in a the clock speed increase. Today as few as 10 levels of direct performance improvement comparable to those that logic in the critical path are common while this number is are obtained through difficult implementations of decreasing as illustrated in Fig. 1. Diminishing amount of architectural or micro-architecture techniques. logic placed between two pipeline stages is largely responsible for the recent rapid increase in the clock II. CLOCKING CONSIDERATIONS FOR SOC frequency surpassing the technology scaling trend. This In a System on a Chip (SOC) clock subsystem has to decrease is occurring at about one half of the rate in clock satisfy variety of diverse requirements [22]. The clocking frequency increase, bringing the number of pipeline stages domains may use different frequencies and may be to roughly one half every six years. However, we can not separated by distances. A variety of communication expect that trend to continue much longer because a methods may be employed mandating the use of different minimal amount of logic (of at least two stages) is clocking techniques such as a mix of synchronous and necessary to make the pipeline stage meaningful. In asynchronous clocking. The wire delay becomes deeper pipelines, any overhead associated with the clock significant and the propagation delay for some signals system and clocking mechanism that is directly and may last more than one clock cycle. adversely affecting the machine performance is critically With the power continuing to grow, requirements for important. With the clock frequency reaching 5-10GHz low power would demand more efficient clocking traditional clocking techniques are stretching to their solutions. New ideas and new ways of designing digital limits, given that 3-5 gates per stage is barely useful. systems are required. In an SOC environment maintaining rapid increase in The two most important timing parameters affecting the clock frequency is tied into additional difficulties such as clock signal are Clock Skew and Clock : inability of a signal to cross the chip boundaries in one Clock Skew is a spatial variation of the clock signal as clock period as well as inabilities to distribute the clock distributed through the system. It is caused by the various over relatively large distances. Controlling clock RC characteristics of the clock paths to the various points uncertainties such as: clock jitter and skew in SOC is met in the system, as well as different loading of the clock with additional difficulties. signal at different points on the chip. Further we can ______distinguish global clock skew and local clock skew. Both Preparation of this paper was supported by the Institute of Information of them are equally important in high-performance system Technology Assessment Program of Korea under the program grant at Chung-Ang University. design. Clock Jitter is a temporal variation of the clock signal

with regard to the reference transition (reference edge) of prevent the corruption of the next state as illustrated in the clock signal. Clock jitter represents edge-to-edge Fig. 2. Though memorizing the state is a needed function variation of the clock signal in time. As such, clock jitter for the architected registers it is not a necessary function can also be classified as: long-term jitter and edge-to-edge for every CSE in the machine. clock jitter, which defines clock signal variation between This model is broader and includes wave pipelining [6], two consecutive clock edges. We are more concerned in which case the signal is blocked from corrupting the about edge-to-edge clock jitter because it is this present state Sn by a sheer delay of the wire. phenomenon that affects the time available for the logic operation. Inputs (X) Outputs (Y) Typically the clock signal has to be distributed to Combinational several hundreds of thousands of the clocked storage Logic Y=Y(X, Sn) elements. Therefore, the clock signal has the largest fan- out of any node in the design, which requires several Signal Blocked Path Blocker levels of amplification. As a consequence, the clock can not corrupt present (CSE) state S system by itself can use up to 40-50% of the power of the n entire VLSI chip [2,3]. We also must assure that every Next State S clocked storage element receives the clock signal precisely n+1 S = f (S , X) Present State: S n+1 n at the same moment in time [22,25]. n Clock There are several methods for the on-chip clock signal distribution attempting to minimize the clock skew and Blocker contain the power dissipated by the clock system [4]. The Transparent clock can be distributed in several ways of which the two logic signals signals adjusted not to blocked typical cases are: (a) an RC matched tree and (b) a grid. arrive earlyer If we had superior Computer Aided Design (CAD) tools, a perfect and uniform process and ability to route Orderly change of state from S to S wires and balance loads with a high degree of flexibility, a n n+1 at this point matched RC delay clock distribution (a) would be preferable to grid (b). However, neither of that is true. Therefore grid is used when clock distribution on the chip Fig. 2. Different view of FSM: no explicit latches has to be very precisely controlled. This is the case in high performance systems. The power consumed by the clock If the signal can not arrive in time, no blocking is is also the highest in the case that uses grid arrangement. necessary. However, this model also reveals problems of Local variations in device geometry and supply voltage wave pipelining technique: ideally all the signals should are important component of the clock skew. More arrive at the same point in time, which is not possible. sophisticated clock distribution than simple RC matched Therefore fast-path problem becomes more difficult to control and has much stringent requirements than the or grid-based schemes are thus necessary for SOC [25]. slow-path. The active schemes with adaptive digital de-skewing typically reduce clock skew of the simple passive clock B. Asynchronous Design networks by an order of magnitude, allowing tighter control of the clock period and higher clock rates [5]. In SOC clocking, synchronous systems are facing problems such as the lack of ability to precisely control A. Synchronous Design the clock, non-scaling clock uncertainties, wire delays and the simple fact that the signal may need one or more clock Traditional view of the Finite State Machine (FSM) is cycles to reach its destination. Thus, represented by Huffman model consisting of a design has been revisited [7]. combinational logic (CL) and clocked storage elements In asynchronous systems, the synchronous system (CSE). In this model, the next state which is determined overhead imposed by the clock uncertainties and CSE by the present state and the input (in case of Mealy requirements is simply traded for the overhead imposed by machine), is stored into the CSE by the triggering the handshake signaling, as shown in Fig. 3 [8]. mechanism of the clock (edge or level). Following this The real question is at which point one of the two model we are used to thinking that the purpose of the CSE communication strategies imposes lesser penalties on the is to “hold” or “memorize” the state. This view is further data transfer as the logic speed keep increasing. It makes supported by the Level Sensitive Scan Design (LSSD) sense to use synchronous design in local domains, which methodology which uses the storage elements to “scan- can be clocked synchronously without considerable out” the state of the machine during the test and debug difficulties. Data transfer lasting several clock cycles is mode [14]. accomplished using asynchronous communication. At We offer a different view: the purpose of CSE is to

high frequencies it takes several clock cycles to cross from Fig. 4 provides an illustration of a 1 billion transistor one chip edge to another. In a processor containing 1 chip. Projecting the speed of the chip at the time the billion transistors only a small portion can be clocked in a reaches 1 billion and comparing it with the synchronous manner. projected interconnect speed; it becomes obvious that synchronous design on the entire chip will not be possible. Asynchronous Paradigm Several clock cycles would be necessary for the signal to Handshake Handshake signals signals cross from one side of the chip to the other. From the Logic 1Data Logic 2Data Logic 3 design point, it is also obvious that the future 1 billion transistor chip will contain multiple cores in either: multi- processor or system-on-chip arrangement. Thus, Locally Compute Compute time 1 time 2 Compute time 3 Synchronous and Globally Asynchronous clocking is Hadnshake Hadnshake overhead overhead likely technique to be employed SOC design [12]. Synchronous Paradigm

Data Data

Logic 1 Logic 2 Logic 3 SYNCHRONOUS: locally 10 Million Transistors

Compute time 1 Waiting Compute time 2 Waiting Compute time 3 1 Billion Clock CSE CSE CSE overhead overhead overhead Transistors

GLOBAL: asynchronous communication Fig. 3. Data transfer in Asynchronous System versus Synchronous [8]. Fig. 4. Projection of a 1 billion transistor VLSI chip C. Globally Asynchronous Locally Synchronous In such a system a number of independently Systems synchronized modules communicate between each other According to the SIA industry projections VLSI chips using asynchronous communication mechanism. It is will contain 1 billion transistors before the year 2010 [9- projected that interconnect effects are to be manageable 11]. However, the number of transistors used to build within a local synchronous module [11]; therefore, a single processor logic has not been increasing at the same synchronous design would continue to be a viable option rate, but staying relatively constant and around 2 million for the processor core. The main feature of these systems logic transistors. The Table 1. shows some of the transistor is the absence of a global timing reference. Synchronous numbers for a sample of typical super-scalar RISC modules use distinct local , or clock domains, even architecture processors around the year 2000. The reasons running at different frequencies. for this stagnation are: (a) the architecture has reached 64- This methodology is viable when various blocks are bit word size and it is not growing further (for a integrated in a single chip in a System on a Chip (SOC) considerable time), (b) the parallelism exploitation environment. It allows for proven IP (Intellectual techniques of super-scalar machines have reached the limit Property) blocks to be reused without any modifications given that there is only a fixed amount of parallelism to be while relying on asynchronous interface between blocks. exploited in the ordinary code mix. Thus, the transistor Such design keeps the benefit of synchronous design growth has mainly occurred in cash memories. It is while avoiding problems caused by global wiring, common to see two levels of cache integrated on a single especially a global clock signal distribution. chip, while the chips containing three levels of cache are IPs are synchronous in their construction, starting to emerge. The situation in embedded processors while fully asynchronous designs are built using self- is similar except that the number of transistors used is timed circuits without having any global timing reference. smaller as compared to the high-performance ones. Globally Asynchronous, Locally Synchronous designs have been used in the mainframe computer systems in the Table 1: Transistor count in typical RISC processors past and this represents a logical migration of the Feature Digital MIPS Power HP Sun mainframe clocking methodology onto the VLSI chip 21164 10000 PC620 8000 US [13]. Freq. [MHz] 500 200 200 180 250 Pipeline Stg. 7 5-7 5 7-9 6-9 Issue Rate 4 4 4 4 4 III. CLOCKED STORAGE ELEMENTS Out-of-Ord. 6 loads 32 16 56 none A. Master-Slave Latch Reg-Ren./flp none/8 32/32 8/8 56 none To avoid the transparency of a single latch, two latches Total Trans. 9.3M 5.9M 6.9M 3.9M 3.8M are clocked back to back with two non-overlapping phases Logic Trans. 1.8M 2.3M 2.2M 3.9M 2.0M of the clock. In such arrangement the first latch serves as a “Master” by receiving the values from the data input and

passing them to the “Slave” latch, which simply follows Dependency on the rise time (or fall time) of the clock the “Master”. This is known as a Master-Slave (M-S) or signal makes the Flip-Flop use hazardous and thus L1 – L2 latch and should not to be confused with the prohibited in IBM LSSD design methodology. “Flip-Flop”, Fig.5. There is a fundamental difference C. Time Window based Flip-Flops between the F-F and M-S Latch, each one requiring a different clocking strategy 18,25]. Digital circuits are based on discrete time events. The time reference is a clock signal edge and-or finite delay D D DD through one or more logic elements. To generate a needed time reference, a pulse created by the property of re- convergent fan-outs is commonly used [25]. This method Master Pulse Clock: φ1 Clock (L1) Generator is illustrated in Fig. 7 on HLFF Flip-Flop introduced by Latch Q1 Q1 Partovi [15]. The trailing edge of this pulse is used as a S R time reference for shutting the Flip-Flop off. Thus, a short Slave Pulse Clock: φ2 “Time Window” is created during which Flip-Flop is (L2) Capturing Latch Latch accepting data, which is the way of creating “edge” in Q2 Q2 No Clock Slave Latch digital world. Q Q QQ (a) M-S Latch (b) Flip-Flop

Fig. 5. General structure of (a) Master-Slave Latch (b) a Flip- Flop

In a Master-Slave Latch the “Slave” latch can have two or more Masters acting as an internal with Clk storage capabilities. The first “Master” is used for capturing of data input while the second Master can, for example, be used as scan-input for the purpose of test. The Fig. 7. Time Window based Flip-Flop, HLFF by Partovi [15] second master is generally clocked with a separate clock. D. Pulsed Latches This clocking arrangement is used in IBM Level- Sensitive-Scan-Design (LSSD) [14], as shown in Fig. 6. In order to decrease the time overhead imposed by the Testability of the SOC is one of the primary concerns. clocked storage element Single Latch clocking has been Thus, Scan arrangement is highly desirable if not used[18,22,25]. To narrow the transparency window of mandatory. Use various clock domains may represent a the latch, the latch is clocked with short pulses generated problem in controlling the multiplicity of scan chains in locally from the global clock signal. Thus, the possibility SOC. of hold time violation is not eliminated, but it is traded for Master-Slave latch design also provides robustness and the convenience of a single latch and lower pipeline low-power characteristics when data activity is low. overhead. Given that the clock pulse is short, the hazard could be reduced by “padding” the logic, i.e. adding inverters in the fast paths so to eliminate the problem. The clock produced by local clock generator must be wide enough to enable the Latch to capture its data. At the same time it must be sufficiently short to minimize the possibility of “critical race”. Those conflicting requirements make use of such single-latch design hazardous by reducing the robustness and reliability of such design [22]. Nevertheless, they have been used due to the critical need to reduce cycle overhead imposed by

the clocked storage elements. Intel’s version of Pulsed Fig. 6. IBM LSSD Shift Register Latch [14]. Latch is shown in Fig. 8 [16]. A benefit of this design is low power consumption due to the common clock signal B. Flip-Flop generator and a simple structure of the latch. Saved power can be traded for speed [18]. Pulse generator used in Flip-Flop and Latch operate on different principles. Intel’s Pulsed Latch uses the principle of re-convergent While Latch is “level-sensitive”, meaning it is acting on fan-out to obtain desired short clock pulse. Further the level (logical value) of the clock signal, Flip-Flop is analysis shows that as the technology scales placing less “edge sensitive” which means that the mechanism of logic into the pipeline stage, the timing constraints capturing the data value on its input is related to the imposed by the pulsed latch may be more difficult to meet changes of the clock. The two are designed for a different [22]. set of requirements and thus consist of inherently different circuit topology [18,25].

A. Setup and Hold Time Properties CP Q CP Q D D Failure of the clocked storage element due to the Setup SS SS and Hold time violations is not an abrupt process [18]. This failing behavior is shown in Fig. 9 (b). Considering d1 how close data should be allowed to change with respect (a) N N 1 2 (b) to the locking event, we encounter two opposing requirements: one is to keep data further from the failing Clk CP region for the purpose of design reliability and second to Pulse Generator keep it close to the clock in order to increase the time available for the logic operation. Fig. 8. Intel’s Explicit Pulsed Latch [16]. Some vendors specify Setup and Hold times as points in time when the Clk-Q (tCQ) delay raises for an arbitrary number of 5-20%. We find this not to be valid. IV. TIMING PARAMETERS A redrawn picture, Fig.9 (b), where D-Q (tDQ) delay is Data and Clock inputs of a clocked storage element plotted (instead of Clk-Q), provides more insight [18]. We need to satisfy basic timing restrictions to ensure correct observe that in spite of the increase in Clk-Q delay, we are operation of the flip-flop. Fundamental timing constraints still benefiting because D-Q delay (representing the time between data and clock inputs are quantified with setup taken from the cycle) is reduced [18,25]. and hold times, as illustrated in Fig. 9 (a) [18]. Setup and hold times define time intervals during which input has to B. Time Borrowing and Absorption of Clock be stable to ensure correct flip-flop operation. The sum of Uncertainties setup and hold times define the “sampling window” of the Even if data arrives close to the reference edge of the clocked storage element. The “sampling window” is clock or pass that clock edge, the delay contribution of the defined as the time period in which clocked storage storage element is still smaller than the amount of delay element is “sampling” and data is not allowed to change. passed onto the next cycle. This allows for more time for 350 useful logic operation. This is known as: “time borrowing”. In order to understand the full effects of 300 Minimum Data-Output delayed data arrival we have to consider a pipelined 250 design where the data captured in the first clock cycle is used as input in the next clock cycle as shown in Fig. 10. 200

150 Setup Hold D Q Combinational D Q

Clk-Output [ps] Clk-Output logic 100 Sampling Window Q Q Clock Clock 50 Source Destination TCR1 TCR2 0 Data (cycle-1) DClk-Q DD-Q -200 -150 -100 -50 0 50 100 150 200 Data-Clk [ps] Clock Cycle 1 sampling Cycle 2 sampling window window (a) Clock reference T > T edge CR1 CR2 Constant Clk-Q Variable Clk-Q Failure Region Region Region Fig. 10. “Time Borrowing” in a pipelined design. The setup time U is negative with respect to the rising edge of the clock. D-Q The amount of time for which the TCR1 was stretched Clk-Q 45o did not come for free. It was simply “borrowed” leaving

Data to Output Delay less time in the next cycle (Cycle 2) for TCR2. Thus a

DDQm Uopt boundary between pipeline stages is flexible. If data can Data arrives early Data arrives late move around the clock reference edge, it is possible to Data to Clock Delay absorb the clock uncertainties: skew and jitter. Thus, “time

(b) borrowing” is one of the most important characteristics of today’s high-speed digital systems. Absorption of the Fig. 9. (a) Setup and Hold time behavior as a function of Clock- clock jitter is shown in Fig. 11(a) [25] and the effect on to-Output delay, (b) Setup and Hold time behavior as a function data arrival in the following cycle is illustrated in Fig.11 of Data-to-Output delay [18,25] (b). Moderate amounts of clock uncertainties can be effectively absorbed, while the absorption property

diminishes as the clock uncertainties become excessive.

no logic in between two storage elements, a race condition

340 can occur. A minimum delay restriction on the clock-to- 320 output delay given by: 300 (1) 280 tCLK−Q ≥ thold + tskew

260 DDQ=238ps 240 If this relation is satisfied, the system is immune to hold D-Q delay[ps] 220 time violations. 200 The clock uncertainty absorption property shows how 100806040200-20-40-60 D-Clk delay [ps] the propagation delay of a CSE is changing if the arrival of the reference clock is uncertain. Applying the clock Clk uncertainty to a CSE is equivalent to holding reference early nominal late Clk Clk Clk clock arrival fixed and allowing data arrival to change. More generally, uncertainty absorption should be (a) treated as degradation of Data-to-Output delay for t =30ps t =100ps CU CU uncertain Data-to-Clock delay. As such, it can be used to describe the timing of the CSE if used in time borrowing, Clk Clk U =-5ps Opt in exactly the same way if used for clock uncertainty absorption. Therefore, a “soft clock edge” designates a D D 3ps 44ps storage element whose output follows both early and late U =30ps arrivals of the input, allowing slower stages to borrow Q Q Opt time from the faster subsequent stages. The time borrowing capability and the clock uncertainty D =220ps D =261ps DQM DQM absorption are not mutually exclusive. They can be traded- (a) t =30ps (a =90%) (b) t =100ps (a =56%) CU CU CU CU off for each other. Fig. 12 illustrates a case where a wide (b) transparency window, denoted as a flat Data-to-Output Fig. 11. (a) Clock Skew absorption property: Data-to-Output characteristic, is used to both absorb the clock Delay versus Clock Arrival Time (b) Effects of clock uncertainties tCU and to borrow time tB from the uncertainties to Data arrival in the next cycle [25]. surrounding stages. Combinational logic of stage 1 takes more time than nominally assigned, and it borrows a The benefits of the “flat” Data-to-Output characteristic portion of the cycle time from stage 2. In general, the are obvious. We create it by expanding the time window storage element may not be completely transparent (i.e. during which the storage element is transparent Data-to-Output characteristics is not completely flat). The (transparency window). Widening of the transparency combination of clock uncertainty tCU and time borrowing window is equivalent to increasing the separation between tB causes an increase in the Data-to-Output delay of the the two reference events in time: one that opens and other Flip-Flop ∆DDQ. one that closes the CSE. In effect, the storage element Stage 1 Stage 2 D1 Q1 D2 Q2 D3 Q3 behaves as a transparent Latch for the short amount of DQ Logic 1 DQ Logic 2 DQ time after the active clock edge. The wider the CSE CSE CSE Clk 1 Clk 2 Clk 3 transparency window, the wider is the flat region of Data- 1 2 3 to-Output characteristic. Widening the transparency D t DQ window can be done by intentionally creating wider CU t capturing pulse of Flip-Flops and Pulsed Latches, or CU t overlapping Master and Slave clocks of Master-Slave D arrival Nominal Nominal Latches. A consequence of increasing the transparency Logic 1 delay Logic 2 delay window is that the failure region of Data-to-Output Actual Actual characteristic is moved away from the nominal clock edge. Logic 1 delay Logic 2 delay

This results in the decrease of Setup Time (larger negative Borrowed time (t ) values) and the increase of Hold Time of the storage B Clk ,Clk element. While decreasing Setup Time has no significant 1 3 Q effect to the system timing as long as the Data-to-Output 1 delay is constant, large Hold Time makes fast path D actual arrival due to 2 time borrowing requirement harder to meet. Thus, the design for clock nominal Clk t arrival t t uncertainty absorption is often traded for longer Hold 2 CU CU CU

Time. In many cases, however, these two requirements are Q2 not contradictory, since different type of storage elements D3 are used in fast and slow paths. The maximal clock skew that a system can tolerate is determined by clock storage Fig. 12. Time borrowing with uncertainty-absorbing clocked elements. If the clock-to-output delay of a clocked storage storage elements [25]. element is shorter than the hold time required and there is

The delay increase ∆DDQ is the same both in the case can be built, which further reduces power consumption when the clock uncertainty is tB+tCU with no time and clock uncertainties. borrowing and in the case when the borrowed time Dual-edge clocking is based on Dual Edge-Triggered between stages is tB+tCU and there is no clock uncertainty. Clocked Storage Elements (DET-CSE), capable of It should be noted that the practical values of the total capturing data on both rising and falling edge of the clock. borrowed time are about the width of the transparency The use of dual-edge clocking strategy requires precise window of the storage element and in any event shorter control of the arrival of both clock edges. This can be than the Hold Time. Better absorption and time borrowing satisfied with reasonably low hardware overhead. In capability can be obtained by widening the transparency addition, the clock uncertainty due to the variation of the window. However, if the transparency window is duty cycle can be partially absorbed by the storage widened, the Hold Time increases and the short-path element [20]. requirement become harder to meet. Therefore, use of Two fundamental ways of building dual-edge clocked wide transparency window is a tradeoff between the time storage elements: Latch-Mux and Flip-Flop as shown in borrowing and uncertainty absorption on one side and the Fig. 13.(a,b): Hold Time on the other side. In cases where sufficient minimum delay in the logic path can be assured, widening D DQ D D S 0 of this window may be beneficial. C Q Q 1 S C R Q Q CL V. DQ D S Q Q The energy consumed is approximately given by: Clk C Q Clk C R N E = α (i)⋅C ⋅V (i)⋅V switching ∑ 0−1 i swing DD (2) (a) (b) i=1 where N is the number of nodes, Ci is the node Fig.13. Dual-Edge Triggered CSE: (a.) Latch-Mux (b.) Flip-Flop topology capacitance, α0-1(i) is the probability that transition occurs at node i, and Vswing is the voltage swing of node i. Starting from (2), several commonly used techniques applied to B. Dual Edge Triggered Flip-Flop minimize energy consumption can be derived: An example of DET Flip-Flop design [21] is shown in (a) Reducing the number of active nodes. Fig. 14.(a) The circuit has a narrow data transparency (b) Reducing the voltage swing of the switching node. window and clock-less output multiplexing scheme. The (c) Reducing the voltage (technology scaling). first stage is symmetric consisting of two Pulse Generating (d) Reducing the activity of the node. (PG) Latches. It creates the data-conditioned clock pulse on each edge of the clock. The clock pulse is created at The approaches listed in (a)-(d) result in known node SX on the leading and node SY on the trailing edge of techniques used in low-power applications. The most the clock. The second stage is a 2-input NAND gate. It common has been “” which assures that the effectively serves as a multiplexer, implicitly relying on storage elements in an inactive part of the processor are the fact that nodes SX and SY alternate in being pre-charged not switching. A thorough review of the common "high", while the clock is "low" and "high", respectively. techniques for low-power can be found in [17]. In this This type of output multiplexing is very convenient paper we limit our consideration to clocking and clocked because it does not require clock control. The clock storage elements suitable for SOC. energy is mainly dissipated for pulse generation in the first stage. The clock load of this Flip-Flop is comparable to A. Dual Edge Triggering that of Single-Edge Triggered Flip-Flop thus, allowing for up to 50% of power savings. This makes DETFF a viable An approach suitable for high-performance and low- option for both high-performance and low-power systems. power application is the use of Dual-Edge Triggered This statement is supported by the comparison results (DET) clocked storage elements [23,25]. Substantial taken against a sample of conventional and conditional power savings in the clock distribution network can be CSEs [19], as shown in Fig. 14.(d). achieved by reducing the clock frequency by one half. This can be done if every clock transition is used as a time VI. CONCLUSION reference point, instead of using only one (leading edge or Clocking techniques and clock storage elements for trailing edge) transition of the clock. Main advantage of System on the Chip are discussed. Given the rapid this approach is that the system operates at half of the increase in clock frequency in portable and low power frequency of the conventional single-edge clocking design systems, it is important to consider clocking as the system style, while obtaining the same data throughput. reaches into multiple GHz speed. For complete analysis of Consequently, power consumption of the clock generation representative CSE please see [23]. We expect that current and distribution system is roughly halved for the same clocking techniques will be serving adequately as long as clock load. In addition, less aggressive clock subsystems wire delay continues to scale. In deep sub-micron domain

this may not be sustainable for much longer. At that point [18] V.Stojanovic and V.G. Oklobdzija, "Comaparative Analysis of the pipeline boundaries start to blur and synchronous Master-Slave Latches and Flip-Flops for High-Performance and Low-Power VLSI Systems," IEEE Journal of Solid-State Circuits, design will be possible only in limited domains on the Vol.34, No.4, April 1999. chip. A mix of Synchronous and Asynchronous design is [19] N. Nedovic, M. Aleksic, V. G. Oklobdzija, “Conditional Pre- necessary in SOC design. This is the next design challenge Charge Techniques for Power-Efficient Dual-Edge Clocking”, when single-chip multiple processing systems start to Proceedings of the International Symposium on Low Power and Design, Monterey, California, August 12-14, 2002. emerge. [20] Saint-Laurent M, Oklobdzija V.G, Singh S.S, Swaminathan M, ACKNOWLEDGMENT “Optimal Sequencing Energy Allocation for CMOS Integrated Contributions from my current and former students are Systems”, Proceedings of International Symposium on Quality greatfully acknowledged. Database of comparative results exists Electronic Design, p.94-99, March 2002. at www.ece.ucdavis.edu/acsel [21] N. Nedovic, W. W. Walker, V. G. Oklobdzija, M. Aleksic, “A Low Power Symmetrically Pulsed Dual Edge-Triggered Flip-Flop”, Proceedings of the European Solid-State Circuits Conference, REFERENCES ESSCIRC'02., Florence, Italy, September 24-26, 2002. [1] Borkar S, “Design Challenges of Technology Scaling”, IEEE [22] Unger S.H, Tan CJ, (1986) “Clocking Schemes for High-Speed Micro, Vol. 19, No. 4 , Aug 1999. Digital Systems”, IEEE Transactions on Computers, Vol. C-35, No [2] Gronowski P.E, Bowhill W.J, Preston R.P, Gowan M.K., Allmon 10, October 1986. R.L., “High-performance microprocessor design” Solid-State [23] Clocking and CSE, Advanced Computer System Engineering Circuits, IEEE Journal of , Volume: 33 Issue: 5 , May 1998. Laboratory web report: http://www.ece.ucdavis.edu/acsel [3] Bailey D.W, Benschneider B.J. “Clocking design and analysis for a [24] Oklobdzija V.G, (ed.) “High-Performance System Design: Circuits 600-MHz Alpha microprocessor”, Solid-State Circuits, IEEE and Logic”, Book, IEEE Press, July 1999. Journal of , Vol.33, No.11 , November 1998. [25] Oklobdzija V.G, Stojanovic V, Markovic D, Nedovic N, “Digital [4] Friedman EG (ed.) “Clock Distribution Networks in VLSI Circuits System Clocking: High-Performance and Low-Power Aspects, J. and Systems”, IEEE Press. Wiley, January 2003. [5] Schutz J, Wallace R. “A 450MHz IA32 P6 Family Microprocessor,” ISSCC Dig. Tech. Papers, pp. 236-237, Feb. 1st STAGE: X 2nd STAGE 1st STAGE: Y 1998. [6] Wayne P. Burleson, Maciej Ciesielski, Fabian Klass, Wentai Liu, CLK Mp1 Mp3 Mp6 Mp4 CLK1 “Wave-Pipelining: A Tutorial and Research Survey”, IEEE

Transaction of Very Large Scale Integration (VLSI) Systems, Vol. Mp2 Mp7 Mp8 Mp5 6, No. 3, September 1998. XY [7] Hauck Scott, “Asynchronous design methodologies: an overview”, Q D Mn1 Mn5 D Proceedings of the IEEE, Volume: 83 Issue: 1, Jan. 1995. I1 Mn9 I2 [8] V. G. Oklobdzija, Jens Sparso, “ Future Directions in Clocking I3 Multi-GHZ Systems” Invited presentation, International CLK3 Mn2 Mn4 Mn10 Mn8 Mn6 CLK4 Symposium on Low-Power Electronics and Design, Monterey, Q California, August 12-14, 2002. CLK Mn3 Mn7 CLK1 C [9] S. Hamilton "Taking Moore's Law Into the Next Century", L Computer, January 1999. [10] International Technology Roadmap for Semiconductors. Inv1 Inv2 Inv3 Inv4 http://public.itrs.net/ [11] The National Technology Roadmap for Semiconductors, CLK CLK1 CLK2 CLK3 CLK4 Semiconductor Industry Association, San Jose, Calif, 2001. http://www.semichips.org (a) [12] Hemani, A., Meincke, T.; Kumar, S.; Postula, A., Olsson, T., Nilsson, P., Oberg, J. Ellervee, P., Lundqvist, D., “Lowering power data activity=0% (vdd) activity=33% activity=50% consumption in clock by using globally asynchronous locally Dual-Edge Conditional Conventional synchronous design style”, Proceedings of the 36th Design Automation Conference, 21-25 June 1999. 80 Latch Mux Flip-Flop Single- Diff. Single-Ended Diff. [13] Cotten L. W, “Circuit Implementation of High-Speed Pipeline 70 ended Systems”, AFIPS Proceedings, Fall Joint Comput. Conf., pp. 489- 60 504, 1965. [14] LSSD Rules and Applications, Manual 3531, Release 59.0, IBM 50 Corporation, March 29, 1985. 40 [15] Partovi H., Burd R, Salim U, Weber F, DiGregorio L, Draper D, 30 “Flow-through latch and edge-triggered flip-flop hybrid elements”, 20 1996 IEEE International Solid-State Circuits Conference. Digest of 10 Technical Papers, ISSCC, San Francisco, February 8-10, 1996. EDP [fJ/250MHz, fJ/500MHz] EDP [fJ/250MHz, 0 [16] Tschanz James, Siva Narendra, Zhanping Chen, Shekhar Borkar, F F F F F F F M F F F YM Arm Manoj Sachdev, Vivek De, “Comparative Delay and Energy of GL CPFF CCF CPFF DTF S HL SD g T GFLF n SAb C2MOS im F- o DETCPFFDETDTF Diff Single Edge-Triggered & Dual Edge-Triggered Pulsed Flip-Flops DTF Str for High-Performance , Proceedings of the 2001 International Symposium on Low Power Electronics and Design, Huntington Beach, California, August 6-7, 2001. (b) [17] T. Kuroda and T. Sakurai “Overview of Low-Power ULSI Circuit Fig. 14. (a) Dual-Edge Triggered Flip-Flop [21], (b) Energy- Techniques”, IEICE Trans. Electronics, E78-C, No 4, April 1995, Delay product for different CSE input activities [23]. pp.334-344, INVITED PAPER, Special Issue on Low-Voltage Low-Power Integrated Circuits.