Clock Design Technology for High-Performance Processors

Clock Design Technology for High- Performance Processors

 Kinya Ishizaka  Hiroaki Komatsu

This paper introduces technology used in designing high-performance processors installed in servers so that they can distribute high-frequency clock signals among the whole chip. Clock synchronization systems must be designed so that clock signals can reach all synchronous sequential circuits in the processor at the same time. The influence of interconnect inductance cannot be ignored because clock frequencies have been getting higher, chip sizes have been getting larger, and low- resistance interconnects have appeared. Therefore, it is necessary to give sufficient consideration to inductance when designing such systems. The technology introduced in this paper uses both circuit design techniques to suppress clock skew and crosstalk noise, and CAD techniques which employ these techniques in an actual chip. The influence of inductance in clock transmission can be analyzed electrically by simulating actually designed clock circuits at the design stage, and this can help lead to higher performance processors.

1. Introduction inductance in processor interconnection cannot The high-performance, high-speed processor be ignored, and this has a serious impact in designed by the authors adopts a clock terms of delay and cross talk noise. Therefore, synchronization system. This makes it possible adequate consideration should be given to this for all the circuit blocks to perform processing issue in designing a circuit. and signal transmission at the same timing The processor designed by the authors in the by distributing just one clock signal among the current study distributes clock signals of 3.0 GHz whole chip. By using this system, clock signals to all over the domain of approximately 21 mm should reach all the synchronous sequential square. This paper explains the design approach circuits in the processer at the same time, ideally. taken to achieve this signal distribution. However, due to differences in signal channels, some timing gap is observed in actual practice. 2. Clock distribution system This gap is called “clock skew.” Minimizing it is The current processor design adopts a the key to achieving high performance. clock tree system based on an H shape, which is To minimize clock skews while distributing illustrated in Figure 1. Clock skew is reduced a high-frequency clock over a long distance across by having equal length, equal delay interconnects the whole processor, it is effective to improve that enable overall clock delay to be reduced as performance to drive a clock distribution buffer much as possible. Where delay balance is poor, and to reduce the resistance in the clock signal a dummy load not related to logics is connected wiring (hereafter, clock wiring). In implementing to the clock tree to establish a preferred delay these measures, however, the influence on balance.

136 FUJITSU Sci. Tech. J., Vol. 47, No. 2, pp. 136–141 (April 2011) K. Ishizaka et al.: Clock Design Technology for High-Performance Processors

Figure 1 Clock distribution in core block.

With points where it is hard to have an equal with incorrect data being output to the next delay only by distributing load and changing the step due to an undesired change in the input interconnect path, several buffers with identical data transmitted from the previous step before cell sizes are used. Each of these buffers has determining the output data from the latch), a different delay line (at increments of 10 ps) because input data are transferred to output inside and the delay can be changed drastically, continuously while the clock is active. Because if designers replace one buffer with another one. of this, a signal with a narrow pulse width (in the While these design approaches are taken order of several tens of picoseconds) is input to to minimize the clock skew, it is still difficult to the latch as a clock instead of using a signal with achieve a sufficient performance actually in some the duty at 50%. However, a narrow pulse width cases due to a clock skew exceeding its design might lead to a pulse disappearance because of a value. This may be because of factors such as deviation in chip specifications in some cases. To process-related ones inside the chip, temperature avoid this, a function to widen the pulse width deviations or noise from the power source. To is installed to secure a mean to check or avoid address this issue, a ring oscillator using a clock disturbances. path is made for each clock domain (i.e. a circuit As part of an energy conservation domain controlled by the same clock) in advance, countermeasure, a clock distribution buffer in‑ and inter-domain clock skew is determined by tegrating an “Enable” function for output signals observing the frequency of these ring oscillators. is used. By using this buffer, any unnecessary Then, a clock distribution buffer is installed that supply of the clock to inactivated areas is blocked. enables an increase or decrease in clock delay In addition, active power is reduced with this via a scan chain so as to maximize the processor countermeasure. performance during manufacture based on the above-mentioned observation result. 3. Inductance effect of clock In a sequential circuit, a transparent latch wiring is used to reduce the path delay. In using this Clock signals are transmitted over a long type of circuit, sufficient consideration should be distance with a higher frequency compared with given to avoid racing (a phenomena associated signals fused or general data transmission. This

FUJITSU Sci. Tech. J., Vol. 47, No. 2 (April 2011) 137 K. Ishizaka et al.: Clock Design Technology for High-Performance Processors

is achieved by using buffers with a higher drive Besides, inconsistent impedance may lead performance and upper layer interconnects to abnormal waveforms such as overshoot or characterized by a wider width, larger thickness undershoot [Figure 2 (b)] and bump [Figure 2 and lower resistance. For these reasons, the (c)]. influence of inductance cannot be ignored. These inductance influences should be Figure 2 (a) compares the simulation results of clearly identified in the circuit design phase the RLC model, where the impact of inductance is to avoid possible disturbance in an actual considered, and the RC model, where the impact processor. To cope with the inductance influences of inductance is not considered, for the same properly, the delay is calculated based on circuit interconnect. It is observed that the waveforms simulation equivalent to Simulation Program are steeper under the influence of inductance. with Integrated Circuit Emphasis (SPICE) using the RLC interconnect model when calculating a clock path delay. At the same time, a waveform check is carried out. If the waveform is judged as abnormal, appropriate correction is made by changing the clock wiring path, buffer installation

RLC model position and buffer drive performance so that a RC model normal waveform can be obtained.

4. Wiring geometry to control noise (a) Waveforms of RLC model and RC model Large inductance in clock wiring will increase the induced cross talk noise in the surrounding wiring. Besides, it is necessary to pay attention to cross talk noise caused by capacity coupling between the clock wiring and its surrounding wiring. To minimize these impacts, we implement a noise reduction countermeasure by arranging a shield in the vicinity of the clock wiring with some dedicated power source wiring (Figure 3). (b) Overshoot and undershoot Multiple pieces of shield power wiring are arranged in parallel around the clock wiring. These pieces of wiring serve as return current channels for the clock signal current. By arranging the shield current wiring as close to the clock wiring as possible, loop inductance can be reduced. Further, the striped geometry of the clock wiring formed by its fine division will increase the number of return current channels, resulting in a further reduction of loop (c) Abnormal waveform (bump) inductance. As a consequence, induced cross talk 1) Figure 2 noise is reduced. Inductance effect. Besides, the presence of shield power source

138 FUJITSU Sci. Tech. J., Vol. 47, No. 2 (April 2011) K. Ishizaka et al.: Clock Design Technology for High-Performance Processors

Clock wiring Shield Clock

Shield power source wiring in the same layer Shield Clock

Figure 4 Divided wiring and wiring layer crossover. Shield power source wiring in the lower layer

Figure 3 complicated divisions, initial wiring as well as Cross-sectional view of clock wiring. later modification processes would require an extremely long time. To simplify the process of wiring helps to reduce capacity coupling between editing these pieces of wiring, our team developed the clock wiring and general data wiring, which a “unified wiring” feature and “interlock wiring also reduces cross talk noise associated with the of lower layer shield wiring” feature. capacity. 1) Unified wiring This configuration offers an additional Unified wiring is an approach based on advantage of making it easier to extract wiring representing multiple pieces of wiring by just a inductance and capacity, because clock wiring is single piece of virtual wiring. It defines a virtual firmly surrounded by shield power source wiring. piece of wiring width that encloses all multiple pieces of wiring within the same layer. Wiring 5. CAD support (interconnect is carried out as a unified wiring state by using function) an interactive interconnect editor [Figure 5 (a)]. In designing a processor using advanced Upon completing the wiring process, real divided technologies, automatic interconnects are used wiring is generated from this unified wiring based as an output network for clock final distribution on the division rules by running a dedicated tool supplying to the latch. However, the critical [Figure 5 (b)]. In a CAD database, we set up a clock distribution network located before system where one network has two types of data the network for final distribution output, for (data for real divided wiring and data for virtual which timing-related restrictions (e.g. clock unified wiring) so that they can be treated as skew) are applicable, has interconnects that separate data. Also by employing a width not are designed based on a manual. This design used for ordinary wiring as the width of the virtual uses an interactive interconnect editor so that wiring, it is possible to identify the original clock the designer’s intention to be compliant with wiring just by looking at information about the these restrictions can be easily reflected in the wiring width. By handling the defined virtual interconnects.2), 3) wiring in the same way as the real wiring, it is When designing the clock wiring possible to operate divided wiring in the same sandwiching a shield, and if wiring across layer as interlinked wiring. This allows an multiple distribution layers is involved, a very operability equivalent to that of ordinary wiring complicated geometry is formed where a via is in editing. As a secondary benefit, unified wiring allocated to each piece of the divided wiring and improves the visibility of drawings because of its shield wiring as illustrated in Figure 4. If a simplified illustration. wiring process takes place in such a state with 2) Interlock wiring of lower layer shield wiring

FUJITSU Sci. Tech. J., Vol. 47, No. 2 (April 2011) 139 K. Ishizaka et al.: Clock Design Technology for High-Performance Processors

Virtual via grouping the data. With regard to the above-mentioned unifi ed wiring, a problem associated with excessive checking may occur if the wiring width is different from the width assumed, if the same design rule checks (e.g. spacing) as those for ordinary wiring are applied. This problem was solved by obtaining information for real divided wiring in the event of checking. To obtain either one of two types of wiring data (i.e. virtual wiring data or real Virtual clock wiring wiring data) depending on the contents of CAD tool processing, a unifi ed data access function is used. This makes it possible for the CAD (a) Unified wiring program to process data without being aware of divisional operations. If the shield power source wiring included in virtual wiring is connected with the power source, a short circuit seems to Shield wiring be generated between virtual wiring and the Via power source. So as not to confuse the designer, changes were made to the way of expressing data.

Via Such changes include displaying check results based only on real wiring or displaying wiring information of actual wiring.

6. CAD support (delay

Clock wiring calculation) In conventional clock wiring, no closed loop is generated in wiring patterns within a single (b) Divided wiring network, when there is no divided wiring and the Figure 5 actual wiring is comprised of just a single piece of Example of wiring representation. wiring. However, in the case of divided wiring, a via is arranged to change wiring layers. A closed loop is generated in such a point. Because of Next, to improve operability and reduce this, it is impossible to calculate a wiring delay human errors in operation, lower layer shield in simplifi ed approaches including widely used power source wiring is generated in linkage with calculations based on the Elmore delay model. the upper layer clock wiring, if clock wiring has Further, to calculate delays with high accuracy lower layer shield power source wiring. Also, by considering the impact of inductance, it is a dedicated shield wiring category is defi ned so essential to calculate delays based on a waveform that the concerned wiring can be regarded as analysis typically used in a SPICE simulation. clock wiring as a signal during the editing, while Meanwhile, when designing a processor, it can be regarded as power source wiring during analysis as a whole chip is carried out for more the via connection or checking. The purpose of accurate timing analysis because it is necessary this approach is to facilitate data handling by to take the following factors into consideration:

140 FUJITSU Sci. Tech. J., Vol. 47, No. 2 (April 2011) K. Ishizaka et al.: Clock Design Technology for High-Performance Processors

dummy metal inserted to achieve leveling, the above-mentioned approaches, it was possible and availability of wiring on the upper layer. to reduce analysis time by reducing the circuit To handle the data on the whole chip, a time scale and parallel processing. necessary for processing, i.e. turn-around-time Besides, we established a system to (TAT), is an important issue to consider in the automatically detect undesirable waveforms design phase. To be specific, we need to address such as overshoot or undershoot and bump from an issue associated with a greater circuit scale the waveform information upon executing a and prolonged TAT when the whole chip is circuit simulation. Through automating visual analyzed. inspections on a large amount of clock waveform Therefore, to ensure accurate timing data, a significant reduction in the number of analysis as a whole and to reduce TAT, we carry procedures and human errors could be achieved. out a delay calculation equivalent to SPICE only for the clock distribution circuits. The analysis 7. Conclusion is performed by conducting the following four This paper explained outlines of circuit steps: technology and measurement technology to 1) Generation of a SPICE net list from the minimize clock skew or noise as well as CAD capacity extraction results technology to be embodied in real chips. These 2) Circuit division with the focus on number of are technologies used to design a clock in high- steps and connections performance, high-speed processor application 3) Simulation of circuit by using a high-speed, for servers. By promoting the development of high-accuracy SPICE compatibility tool sophisticated clock control technologies as a 4) Feedback of analysis results further advancement of technologies to realize With regard to the circuit division mentioned sophisticated clocks and also as an effort for in Step 2), a tree structure from PLL as a clock energy conservation, we hope we can help to source to chopper to drive a latch in the final enhance processor performance and generate a step is divided into three groups in the direction driver to make servers more competitive. of number of steps. Evaluation is conducted one by one from the first group. In the second and References the third groups, there are multiple inputs that 1) C. Cheng et al.: LSI Interconnect Analysis and Synthesis. John Wiley & Sons, Inc., 2000. serve as starting points of the tree structure. 2) N. Ito et al.: A physical design methodology for Further division is carried out for each of 1.3 GHz SPARC64 microprocessor. International Conference on Computer Design, 2003, pp. these input terminals and circuit simulation is 204–210. performed in parallel. Further, by establishing 3) N. Ito et al.: Timing layout design approach for designing 2.16 GHz SPARC64 microprocessor. a system to manage the simulation by using a (in Japanese), DA Symposium 2005, pp. 255–260, queue, we made it possible to run a host of circuit 2005. simulations with excellent efficiency. By using

Kinya Ishizaka Hiroaki Komatsu Fujitsu Ltd. Fujitsu Ltd. Mr. Ishizaka is engaged in development Mr. Komatsu is engaged in development of technologies for high-performance of CAD systems for LSI layout design in processors. processor applications.

FUJITSU Sci. Tech. J., Vol. 47, No. 2 (April 2011) 141