Timing Circuit Design in High Performance DRAM
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 11 Timing Circuit Design in High Performance DRAM Feng (Dan) Lin Abstract The need for high performance dynamic random access memory (DRAM) becomes a pressing concern with processor speeds racing into the multi-gigahertz range. For a high performance computing system, such as high-end graphics cards or game consoles, memory bandwidth is one of the key limitations. One way to address this ever increasing bandwidth requirement is by increasing the data transfer rate. As the I/O speed jumps up, precise timing control becomes critical and challenge with variations and limitations in nano-scaled DRAM processes. In this chapter, we will explore two most important timing circuits used in high performance DRAM: clock distribution network (CDN) and clock synchronization circuits (CSC). Keywords Memory interface • DRAM • Memory bandwidth • Clock distribu- tion network (CDN) • Clock synchronization circuit (CSC) • Source-synchronous • Analog phase generator (APG) • De-skewing • Duty-cycle correction (DCC) Delay-locked loop (DLL) • 3D integration 11.1 Introduction 11.1.1 Memory Interface A typical memory system, shown in Fig. 11.1, generally includes a memory control- ler, DRAM devices and memory channels between them. As part of the DRAM device, memory interface serves as a gateway between the external world and internal periphery logics (i.e. array interface and command and address logic). The state-of-art memory interface may include data transceivers, SERDES (serializer/deserializer), F. Lin (*) Senior Member of Technical Staff, Micron Technology, Inc. e-mail: [email protected] K. Iniewski (ed.), CMOS Processors and Memories, 337 Analog Circuits and Signal Processing, DOI 10.1007/978-90-481-9216-8_11, © Springer Science+Business Media B.V. 2010 338 F. Lin Memory Channels DRAM Data Array Data I/O Interface Memory WCK Controller CDN CK CSC CDN CDN C/A C/A Logic Memory Interface Fig. 11.1 High-performance Memory Interfaces (clock paths and timing circuits in Gray colors) clock distribution networks (CDNs), and clock synchronization circuits (CSC). Besides bidirectional data I/Os, which move data in and out of the memory interface, there are also clocks and command and address (C/A) inputs, which are unidirec- tional. The data and C/A are retimed within the memory interface using the distrib- uted clocks. External to the DRAM interface, the memory controller communicates to one or multiple DRAMs via memory channels and buses. Various bussing topolo- gies and different characteristics of memory channels will introduce latency and timing skews, and may profoundly impact the overall system timing performance. In some literature, the data channel is also referred to as a link, which includes transceivers in both the memory controller and the DRAM, and the connection between them. Total memory bandwidth can be increased either by increasing the number of links or by increasing the per link data transfer rate. The bandwidth limitation in both the channel and the DRAM interface put an upper boundary for the data rate. The package size, the number of DRAMs, and metal routing layers on the PCB board also put some constrains on the number of links. Power effi- ciency (usually specified in mW/Gb/s) and cost are the other two considerations. Although each pieces of the link and associate clocking scheme worth a detail analysis when pursuing higher system performance, with limited space, we now turn our focus to the timing circuits design within the DRAM interface. 11.1.2 Evolution of the DRAM Interface and Timing Specifications Timing circuit is generally related to a clock. Following the advent of synchronous operation, DRAM device specifications and performance began a slow migration towards metrics related to clock frequency [1]. From EDO (extended data out) asynchronous DRAM to DDR (double data rate) synchronous DRAM, the memory 11 Timing Circuit Design in High Performance DRAM 339 interface also evolved from clock-less asynchronous operation to synchronous data operation on both rising and falling clock edges. The various generations of DDR and its faster graphics cousin GDDR, such as DDR2, DDR3, GDDR3, GDDR4, and GDDR5, encompass evolutionary advances in technology to achieve higher overall data transfer rates. The fastest data transfer rate reported for GDDR5 [4] is around 6 Giga bits per second per pin, compared to only 0.133 Giga bits per second for a low-end DDR system. To enable such speedup, changes in burst length (BL), improvements in signaling technology and packaging, and significant advances in circuit design are imperative. Design a timing circuitry may start with specifications. Two important timing parameters define how fast the DRAM device can cycle the data (clock cycle time or tCK) and how many clock cycles the system must wait for data following a Read request (CAS latency or CL). To achieve higher performance, it is desire to have DRAM devices with the lowest CAS latency possible at any given operating fre- quency. Unfortunately, timing delays and timing variations associated with data and clock path make it tough to meet with higher overall data rates. Such timing delays and variations are generally referred to as clock skew and jitter. When a signal (e.g. CK) propagating through a memory interface, like in Fig. 11.1, a timing delay is introduced, also known as clock skew. Besides static timing delay, high-speed signal distribution is also susceptible to duty-cycle distortion and timing mismatch. Under process, voltage and temperature (PVT) variations, the propagation delay may change dynamically (i.e. jitter) and further degrade timing performance. Both skew and jitter make it difficult to predict and track latency [14] (e.g. CL) as speeds increase. 11.1.3 Source-Synchronous Interface and Matched Routing There are several ways to mitigate the timing variations caused by the on-die signal distribution. Traditionally, memory interfaces favor source-synchronous relationship between data and clock (or strobe) to alleviate PVT sensitivities. In some literatures, the clock bundled with data is also referred as a forwarded clock. Shown in Fig. 11.1, the additional write clocks (WCK) can facilitate high-speed data capture by placing the clock close to the destination (e.g. transceivers), which in turn, shorten the clock dis- tribution. Differential write clock can further reduce duty-cycle distortion at the cost of additional pins. A forwarded read clock (not shown in Fig. 11.1) synchronized with returned data may facilitate data capture at the memory controller side as well. To maintain source-synchronicity, a technique called matched routing [1–2] has been utilized to match the internal clocks and data all the way from the device pins to the capture latches. Logical effort [3] matching and point-to-point matching are both used in this method to address various topologies of the clock and data paths. With this technique, the clock distribution delay is modeled and replicated in each data path. Any timing change caused by PVT variations can be tracked and zeroed out as long as the matched routing is applied. Although the gate and wire delay may 340 F. Lin be scaled linearly with DRAM processes, matched routing adds extra circuits for each data path and increases power and area consumption. With single-ended sig- naling for the data transceivers and the lengthened data path, duty-cycle distortion is a major concern for this approach. At higher speeds and for increased data bus width (pin count or number of links), matched routing may not be practical. An alternative scheme employs adjustable timing for the clock distribution net- work (CDN). In this case, data capture latches are located close to the data input pads. The latches are generally built from sense-amp style circuits to maximize their gain and input sensitivity and minimize setup and hold. The CDN delay can be backed out through training by the memory controller. The goal of training is to optimize capture timing at the input data latches. The memory controller accom- plishes this by delaying or advancing the clock until the latches operate in the center of the data eye. Although process variation and static timing offsets can be tuned out by initial training, delay variations across the CDN due to voltage and temperature (VT) drift may still exist. Retraining may be necessary if these delay variations are too great, further complicating the link interface design. To mitigate VT sensitivity for high-speed memory operating, a novel multi- phase, voltage and temperature insensitive CDN [6] will be introduced in Section 11.2. Design consideration for different CDN topologies will be analyzed with a focus on power and performance. Simulation data using a 50 nm DRAM process will be presented for evaluation. 11.1.4 Timing Adjust Circuitry Besides training, another way to compensate the timing skew is by using a clock synchronization circuit (CSC), such as phase-locked loop (PLL) or delay-locked loop (DLL). Both the PLL and DLL are closed-loop feedback systems with adjust- able delays and a timing comparator (i.e. phase frequency detector). A feedback signal is compared against a reference signal and the delay gets adjusted until the phase relationship between the feedback and reference signals approaches a pre- determined condition. For the PLL, however, the feedback signal generated by an oscillator must achieve both phase and frequency locks to the reference. PLLs are usually found in communication systems and microprocessors for clock data recovery (CDR) and frequency synthesis. DLLs, on the other hand, are widely used in memory interfaces for clock synchro- nization and de-skewing. For instance, a DLL plays an important role for the DRAM output timing, in which the output data are synchronized with the incoming system clock (e.g. CK). The overall timing path monitored by the DLL not only includes input data capture and clock distribution network, but also the output data path.