electronics

Article ArticleA 100 Gb/s Quad-Lane SerDes Receiver with a PI- A 100 Gb/s Quad-Lane SerDes Receiver with a Based Quarter-Rate All-Digital CDR PI-Based Quarter-Rate All-Digital CDR Heejae Hwang and Jongsun Kim * Heejae Hwang and Jongsun Kim * Department of Electronic and Electrical Engineering, Hongik University, Seoul 06983, Korea; Department of Electronic and Electrical Engineering, Hongik University, Seoul 06983, Korea; [email protected] [email protected] * Correspondence: [email protected]; Tel.: +82-2-320-3014 * Correspondence: [email protected]; Tel.: +82-2-320-3014  Received: 19 June 2020; Accepted: 5 5 July July 2020; 2020; Published: 9 9 July July 2020 2020 

Abstract: A 100 GbGb/s/s quad-lanequad-lane SerDesSerDes receiverreceiver withwith aa phase-interpolatorphase-interpolator (PI)-based(PI)-based quarter-ratequarter-rate all-digitalall-digital clock and data recovery (CDR)(CDR) isis presented.presented. The proposed CDR utilizes a multi-phasemulti-phase multiplying delay-lockeddelay-locked loop (MDLL) to generate the eight eight-phase-phase reference clocks, which achieves multi-phasemulti-phase frequency multiplication wit withh a small area andand lessless powerpower consumption.consumption. The shared MDLL generates and distributes eight eight-phase-phase clocks to each CDR. The proposed CDR uses a new initial phase tracker that uses a preamble to achieve a fast lock time of about 12 ns and to provide a constant output data sequence. The CDR CDR utilizes utilizes quarter quarter-rate-rate 2x 2x-oversampling-oversampling architecture, and the PI controller is designed fullfull custom to minimize the loop latency. To improve the dithering jitter performance of the recovered clock, thethe decimation factorfactor ofof thethe CDR can be adjustable. Also, a new continuous-timecontinuous-time linear equalizer (CTLE) receiver w wasas adopted to reduce power consumption and achievedachieved aa data data rate rate of 25 of Gb 25/s /lane. Gb/s/lane. The proposed The proposed SerDes receiver SerDes with receiver a digital with CDR a digital is implemented CDR is inimplemented 40 nm CMOS in technology.40 nm CMOS The technology. 100 Gb/s four-channel The 100 Gb/s SerDes four-channel receiver SerDes (4 CTLEs receiver+ 4 CDRs (4 CTLEs+ MDLL) + 4 occupiesCDRs + MDLL) an active occupies area of an only active 0.351 area mm of2 onlyand 0.351 consumes mm2 and 241.8 consumes mW, which 241.8 achieves mW, which a high achieves energy eaffi highciency energy of 2.418 efficiency pJ/bit. of 2.418 pJ/bit.

Keywords: SerDesSerDes;; clock and data recovery recovery;; CDR CDR;; receiver;receiver; MDLL;MDLL; equalizer

1. Introduction The role ofof opticaloptical oror electricalelectrical networks networks that that enable enable high-performance high-performance servers servers and and data data centers centers to performto perform energy energy proportional proportional computing computing [1] is increasing [1] is increasing day by dayday. byTherefore, day. Therefore, there is an there increasing is an demandincreasing for demand energy-e forffi cientenergy electrical-efficient serial electrical links (asserial shown links in (as Figure shown1) thatin Figure require 1) data that ratesrequire of overdata 25rates Gb of/s/ overlane for25 Gb/s/lane tera-byte computingfor tera-byte in computing recent servers in recent and data servers centers. and data centers.

Transmitter (Tx) Chip Receiver (Rx) Chip

Parallel Driver CDR Parallel Equalizer Data + EQ Recovered Data Channel CDRIN Data Data Serializer Deserializer Sampler Data_Tx Data_Rx

Recovered CLKTX CLKRX Clock Clock PLL PLL Recovery

Reference Reference Clock Clock Figure 1. TypicalTypical high high-speed-speed serial-linkserial-link SerDes architecture.architecture.

Figure1 1 shows shows aa blockblock diagramdiagram ofof a typical high high-speed-speed serial-linkserial-link serializerserializer/deserializer/deserializer ( (SerDes)SerDes) architecture [[22–8] where the transmitter ( (Tx)Tx) chip and the receiver (Rx) chip are connected through a backplane or copper cable channel. The serializer of Tx converts slow parallel data into high-speed serial Electronics 2020, 9, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics

Electronics 2020, 9, 1113; doi:10.3390/electronics9071113 www.mdpi.com/journal/electronics Electronics 2020, 9, 1113 2 of 16

data by using the high-frequency clock, CLKTX, generated by the phase-locked loop (PLL). The converted serial data (DATA_TX) is transmitted to the Rx chip through the lossy channel. The equalizer of the Rx chip receives the severely distorted DATA_RX signal. It compensates for the channel losses and inter-symbol interference (ISI) to generate the clock and data recovery (CDR)IN signal with an open eye. Subsequently, the clock and data recovery (CDR) uses a data sampler and a clock recovery circuit to generate the recovered data and clock from the random input data. The recovered data is then converted into slow parallel data through the deserializer. Usually, the PLL of Rx receives the low-frequency reference clock and generates the multiplied high-frequency CLKRX signal for the CDR. Conventionally, the CDR architectures have been implemented based on a PLL due to the advantage of an input jitter filtering capability. However, conventional PLL-based CDRs have a long lock time and a jitter accumulation problem [9,10]. To implement an efficient energy proportional computing network, burst-mode communication has been used for passive optical networks (PON) [11,12] and recently, an attempt has been made to increase the power efficiency by applying a burst-mode to a memory interface [13]. Therefore, a CDR technology having a fast lock time (= power-on time or acquisition time) is very important for low-power burst-mode implementation. It has become common to increase data throughputs using multiple lanes (or channels) of high-speed serial links as the aggregate I/O bandwidth of a single chip exceeds hundreds of Gb/s. Among these high-speed serial buses, multi-lane source synchronous serial-links have been used for backplane applications that provide connectivity between CPU and CPU, CPU and dual in-line memory modules (DIMMs), and CPU and bridge chips. Multiple serial point-to-point lanes can be placed in parallel to increase the aggregate bandwidth. The most representative multi-lane source synchronous serial-links are QuickPath Interconnect (QPI) [14] and AMD HyperTransport (HT) [15]. Figure2 shows typical multi-lane high-speed source synchronous serial-link architectures [16]. As shown in Figure2a, in the source synchronous clocking structure, the Rx receiver uses the transmitted clock almost identically correlated to the noise environment of the Tx chip so that high-speed data can be restored with low power and low latency. If half-rate sampling is used on the Rx chip, the frequency of the transmitted clock can be lowered to half. As the data rate of the channel increases above a few Gb/s/lane, the method of Figure2a has reached its limit, and as shown in Figure2b, PLL (or DLL or just simple deskew circuits) has been added to the clock path at the Rx side, through which improved timing margin and skew cancellation [17–21]. If 1:2 de-multiplexing is used at the receiver path of each data lane, the forwarded clock rate can be reduced to 1/2 frequency. As shown in Figure2c, as the data rate increases above about 10 Gb /s/lane, CDR and equalizer circuits are added to each data lane to improve signal integrity and provide much higher communication bandwidth [5,22–28]. However, these conventional multi-lane receivers with CDRs [5,22,23,28] generally have the disadvantages of a large area and high-power consumption. In particular, these multi-lane receivers have a problem that it is difficult to apply to burst mode applications due to the large CDR lock time. In particular, since [5,22,23] all use a PLL as a high-frequency clock generator, there are issues such as a very long lock time, jitter peaking, and jitter accumulation problems that may degrade the CDR performance. Among the various CDR architectures, phase interpolator (PI)-based digital CDRs offer a fast lock time with better jitter performance [5,6,29–32]. In this paper, a new 100 Gb/s quad-lane SerDes receiver with a small area and low-power consumption is presented. The proposed all-digital quarter-rate PI-based CDR utilizes a multiplying delay-locked loop (MDLL) instead of the usual PLL to generate the eight-phase high-frequency clocks required for the PI operation [24,33]. Through this, both frequency multiplication and eight-phase clock generation are performed at the same time, thereby achieving an area and power reduction and low-jitter characteristics. The proposed digital CDR uses a new initial phase tracker that uses a preamble to reduce the lock time. The CDR utilizes a quarter-rate 2x-oversampling architecture, and the PI controller is designed full custom to minimize the loop latency. The update rate of the CDR can be adjustable, resulting in an improved peak-to-peak (p-p) clock jitter performance. As a result, Electronics 2020, 9, x FOR PEER REVIEW 3 of 16 can be adjustable, resulting in an improved peak-to-peak (p-p) clock jitter performance. As a result, it has the advantage of having a fast lock time of about 12 ns and a constant output data sequence. Electronics 2020, 9, 1113 3 of 16 Moreover, a new differential near-ground receiver with an adaptive continuous-time linear equalizer (CTLE) was adopted to reduce the power consumption and ensure high-frequency characteristics up toit has25 Gb/s/lane. the advantage of having a fast lock time of about 12 ns and a constant output data sequence. Moreover,The rest a new of this diff papererential is near-groundorganized as receiverfollows. withSection an adaptive2 presents continuous-time the architecture linear and operation equalizer of(CTLE) the wasproposed adopted quad to reduce-lane SerDes the power Receiver consumption and the and quarter ensure-rate high-frequency CDR. Section characteristics 3 shows the up experimentalto 25 Gb/s/lane. results and Section 4 presents the conclusion.

Figure 2. Typical multi-lanemulti-lane high-speedhigh-speed source source synchronous synchronous serial-link serial-link architecture architecture (a) ( withouta) without PLL PLL (b) (withb) with PLL PLL (c) with(c) with equalizers, equalizers, PLLs, PLLs, and and CDRs. CDRs.

2. ProposedThe rest Multi of this-Lane paper SerDes is organized Receiver as follows.Architecture Section 2 presents the architecture and operation of the proposed quad-lane SerDes Receiver and the quarter-rate CDR. Section3 shows the experimental The proposed SerDes link is for the duplex interface structure with four data lanes and one clock results and Section4 presents the conclusion. lane. Figure 3 shows a block diagram of the proposed 100 Gb/s quad-lane (25 Gb/s/lane × 4 differential lanes)2. Proposed SerDes Multi-Lane receiver architecture. SerDes Receiver Here, Architecture it is assumed that a forwarded clock in which a low- frequency clock (= 625 MHz) is transmitted through a separate differential lane without equalization, The proposed SerDes link is for the duplex interface structure with four data lanes and one so the tracking of a data jitter in a wide frequency range is possible. The multi-lane interface clock lane. Figure3 shows a block diagram of the proposed 100 Gb /s quad-lane (25 Gb/s/lane architecture presented in this paper is based on source synchronous clocking. Moreover, the× 4 differential lanes) SerDes receiver architecture. Here, it is assumed that a forwarded clock in proposed CDR structure can be which a low-frequency clock (= 625 MHz) is transmitted through a separate differential lane without applied to plesiochronous systems in which the Tx and the Rx use separate independent low- equalization, so the tracking of a data jitter in a wide frequency range is possible. The multi-lane frequency quartz-based reference clocks. interface architecture presented in this paper is based on source synchronous clocking. Moreover, The proposed SerDes receiver core consists of a shared multi-phase MDLL, an all-digital PI- the proposed CDR structure can be applied to plesiochronous systems in which the Tx and the Rx use based quarter-rate CDR, and a CTLE. The shared MDLL receives a 625 MHz reference clock and separate independent low-frequency quartz-based reference clocks.

Electronics 2020, 9, x FOR PEER REVIEW 4 of 16

provides eight-phase (= p0~p7) quarter-rate (= 6.25 GHz) clocks multiplied by ten times (n = 10) and distributes them to each CDR block. The use of the proposed multi-phase MDLL has the advantage of the area and power reduction by simultaneously performing frequency multiplication and multi- phase clock generation. The CTLE is designed to compensate for a channel loss of about 19.7 dB at 12.5 GHz. The channel loss of the 40 cm transmission line used in this paper is about 22.6 dB at 12.5 GHz. This channel loss value is similar to those shown by a typical chip-to-chip and midrange backplane interface below 50 cm long [30]. The proposed CDR includes four data samplers and four edge samplers, and performs a 1-to-4 de-multiplexing operation at a quarter-rate using a 2x- oversampling technology. A total of eight samplers (four corresponding to the data samples and four corresponding to the edge samples) recover 6.25 Gb/s four-bit parallel data (D<3:0>) and four 6.25 GHz clocks (Dclk0~Dclk 3) from 25 Gb/s random data input through the CTLE. To perform this operation, this CDR includes a phase interpolator (PI), a phase selector (PS), and a PI controller to provide phase-adjusted driving clocks for each data sampler and an edge sampler. In particular, a new initial phase tracker inside the PI controller is used to speed up the lock time and to provide a Electronicsconstant2020, 9 , data 1113 output sequence. Detailed CDR configuration and operation are described in the4 of 16 following section.

FigureFigure 3. Proposed3. Proposed 100 100 Gb Gb/s/s quad-lane quad-lane SerDes SerDes receiver receiver architecture architecture..

The2.1. Proposed proposed All SerDes-Digital receiverPI-Based Quarter core consists-Rate CDR of a Architecture shared multi-phase MDLL, an all-digital PI-based quarter-rateFigure CDR, 4a shows and a the CTLE. block The diagram shared of the MDLL proposed receives all-digital a 625 25 MHz Gb/s referencePI-based quarter clock- andrate CDR provides eight-phasearchitecture. (= p0~p7) The CDR quarter-rate is composed (= 6.25 of four GHz) data clocks samplers, multiplied four edge by tensamplers, times (fourn = phase10) and selectors distributes them(PS), to each four CDR phase block. interpolators The use (PI), of the and proposed a PI controller. multi-phase The proposed MDLL hasCDR the adopts advantage a quarter of-rate the area and powerarchitecture reduction to sufficiently by simultaneously widen the timing performing margin frequencyof the input multiplication sampler and to and simplify multi-phase the clock clock generation.distribution The network CTLE is with designed much ato lower compensate sampling forclock a frequency. channel loss To this of about end, four 19.7 data dB samplers at 12.5 GHz. The channel(#1~#4) are loss arranged of the 40 in cmparallel transmission on the input line stage, used and in thisthe four paper data is samplers about 22.6 are dB operated at 12.5 in GHz. This channelsynchronization loss value with is four similar Dclks to (Dclk0~ thoseDclk3) shown operating by a typical at quarter chip-to-chip-rate (= 6.25 and GHz), midrange respectively. backplane interfaceAdditionally, below 50 cm2x-oversampling long [30]. The technology proposed is CDR used includes to find fourthe transition data samplers position and of four the edgeinput samplers,data stream. To this end, four edge samplers operating at quarter-rate are arranged parallel to the input. and performs a 1-to-4 de-multiplexing operation at a quarter-rate using a 2x-oversampling technology. As shown in Figure 4b, when the CDR is locked, the four data clocks (Dclk0~Dclk3) are aligned to the A totalmiddle of eight of each samplers output (four data (D<0 corresponding>~D<3>), respectively. to the data For samples this operation, and four the corresponding PI controller generates to theedge samples)the MA<1:0>, recover 6.25MB<1:0>, Gb/s and four-bit PI<8:0> parallel signals datato control (D< 3:0the> PS) and and fourPI. The 6.25 PS GHz and PI clocks generate (Dclk0~Dclk the four 3) fromDclks 25 Gb (Dclk0~/s randomDclk3) data required input for through the four the data CTLE. samplers, To performand the adjacent this operation, Dclks are out this of CDR phase includes with a phase interpolator (PI), a phase selector (PS), and a PI controller to provide phase-adjusted driving clocks for each data sampler and an edge sampler. In particular, a new initial phase tracker inside the PI controller is used to speed up the lock time and to provide a constant data output sequence. Detailed CDR configuration and operation are described in the following section.

2.1. Proposed All-Digital PI-Based Quarter-Rate CDR Architecture Figure4a shows the block diagram of the proposed all-digital 25 Gb /s PI-based quarter-rate CDR architecture. The CDR is composed of four data samplers, four edge samplers, four phase selectors (PS), four phase interpolators (PI), and a PI controller. The proposed CDR adopts a quarter-rate architecture to sufficiently widen the timing margin of the input sampler and to simplify the clock distribution network with much a lower sampling clock frequency. To this end, four data samplers (#1~#4) are arranged in parallel on the input stage, and the four data samplers are operated in synchronization with four Dclks (Dclk0~Dclk3) operating at quarter-rate (= 6.25 GHz), respectively. Additionally, 2x-oversampling technology is used to find the transition position of the input data stream. To this end, four edge samplers operating at quarter-rate are arranged parallel to the input. As shown in Figure4b, when the CDR is locked, the four data clocks (Dclk0~Dclk3) are aligned to the middle of each output data (D<0>~D<3>), respectively. For this operation, the PI controller generates the MA<1:0>, MB<1:0>, and PI<8:0> signals to control the PS and PI. The PS and PI generate the four Dclks (Dclk0~Dclk3) required for the four data samplers, and the adjacent Dclks are out of phase with Electronics 2020, 9, x FOR PEER REVIEW 5 of 16 each other by 90 degrees. The PS and PI also generate the four Eclks (Eclk0~Eclk3) required for the four edge samplers. By applying this 2x-oversampling quarter-rate CDR technology, demultiplexed 4-bit parallel data (D<3:0>) can be restored from the 25 Gb/s serial random input data. This CDR structure has the Electronics 2020, 9, 1113 5 of 16 advantage that the order of the recovered 4-bit parallel data (D<3:0>) is always constant. This means, for instance, that the data sampler #1 clocked with Dclk0 always outputs D<0> first, and the data eachsampler other #2 byclocked 90 degrees. with Dclk1 The PSalways and PIoutputs also generate D<1> first. the This four function Eclks (Eclk0~Eclk3) is performed required through for the the PI fourcontroller edge samplers.with the initial tracking mode.

(a)

(b)

Figure 4. (a) Block diagram of the proposed all-digital PI-based quarter-rate CDR (b) and the CDR operation in locked state.

By applying this 2x-oversampling quarter-rate CDR technology, demultiplexed 4-bit parallel data (D<3:0>) can be restored from the 25 Gb/s serial random input data. This CDR structure has the Electronics 2020, 9, 1113 6 of 16 advantage that the order of the recovered 4-bit parallel data (D<3:0>) is always constant. This means, for instance, that the data sampler #1 clocked with Dclk0 always outputs D<0> first, and the data samplerElectronics #2 clocked 2020, 9, x with FOR PEER Dclk1 REVIEW always outputs D<1> first. This function is performed through6 of 16 the PI controller withFigure the 4. (a initial) Block trackingdiagram of mode. the proposed all-digital PI-based quarter-rate CDR (b) and the CDR On theoperation other hand, in locked the state. order of the 4-bit parallel data output by the conventional quarter-rate CDR with 1:4 demultiplexing may not be constant. This means that in the conventional method, the data sampler #1On can the output other hand, any data the order from of D the<0 >4-bitto Dparallel<3> first. data output Thus, theby the conventional conventional quarter-ratequarter-rate CDRs CDR with 1:4 demultiplexing may not be constant. This means that in the conventional method, the may require additional reordering circuits to set the order of the parallel data stream that is randomly data sampler #1 can output any data from D<0> to D<3> first. Thus, the conventional quarter-rate output,CDRs which may can require have additional the disadvantage reordering circuits of increasing to set the the order latency of the and parallel area data overhead. stream that In is a packet protocol-basedrandomly serialoutput, interface, which can the have order the disadvantage of deserialized of increasing data can the be latency arranged and area in a overhead. training modeIn a with an additionalpacket protocol preamble-based overhead serial interface, of increased the order latency. of deserialized data can be arranged in a training mode with an additional preamble overhead of increased latency. 2.2. Proposed PI Controller 2.2. Proposed PI Controller Figure5a shows the block diagram of the proposed PI controller. It consists of an initial phase Figure 5a shows the block diagram of the proposed PI controller. It consists of an initial phase tracker (IPT), an early/late (E/L) detector, a majority vote logic (MVL), a digital loop filter (DLF), a 2-to-1 tracker (IPT), an early/late (E/L) detector, a majority vote logic (MVL), a digital loop filter (DLF), a 2- MUX,to and-1 MUX, an octant and an phaseoctant phase controller. controller. The The proposed proposed PIPI controller controller provides provides two operation two operation modes; modes; initial trackinginitial tracking mode mode and and sequential sequential tracking tracking mode.mode.

(a)

(b)

Figure 5. (a) Proposed PI Controller with an initial phase tracker. (b) Octants phase planes with 72 phase steps depending on the control code: MA<1:0>, MB<1:0>, and PI<8:0>. Electronics 2020, 9, 1113 7 of 16

At power-on, the operation of the CDR starts in the initial tracking mode. The IPT uses a repeated preamble of the “00001111” pattern to achieve both the fast lock time and the constant data output sequence. During this mode, the IPT compares E<1> and E<3> edge information based on the phase of Eclk1. The PI controller quickly makes the rising edge of Eclk3 align with the first rising edge of the “1111” pattern. Since the phases of Eclk1 and Eclk3 are shifted 180 degrees, when the rising edge of Eclk3 is located at the center of the preamble “00001111” pattern, Dclk1 is automatically located at the center of D0 as shown in Figures4b and5a. In this initial tracking mode, the operation of the CDR is similar to the phase tracking operation of a delay-locked loop (DLL). The initial tracking mode is performed during 36 CLKCONT cycles. The stop signal changes from zero to one after 36 CLKCONT cycles. The frequency of CLKCONT (f CLKcont) is 3.125 GHz, which is half of the frequency of Eclk3, so the total initial tracking mode takes about 11.52 ns. The PI of the PI-based CDR must have the ability to randomly shift the phase of the output signal up to 360 degrees. Therefore, as shown in Figure5b, the output signals MA <1:0> and MB<1:0> of the octant phase controller are used to allocate the octant phase planes. The PI<8:0> is used to divide the 45 degree phase into nine phases, so the resolution of the proposed PI corresponds to 5 degrees, which is about 2.22 ps in this design. Thus, the octant phase planes consist of a total of 72 phase steps, which are associated with an initial tracking mode of only 36 (= 72/2) cycles (for 180 degrees in both directions). After the initial tracking mode, the sequential tracking mode using 2x-oversampling is performed. The output of the eight samplers, D<3:0> and E<3:0>, are used as an input to the E/L detector. The E/L detector compares the output of adjacent samplers and generates the Early<7:0> and Late<7:0> signals. The MVL determines whether the phases of the sampling clocks are earlier or later than the incoming data stream by majority voting. The digital loop filter (DLF) composed of an encoder and a variable finite-state machine (FSM) receives the EA<1:0> and LA<1:0> signals. It generates the UPDLF/DNDLF signals that can change the output signals of the octant phase controller. The encoder compares the EA<1:0> and LA<1:0> to generate the Comp signal and the Equal signal. If EA<1:0> has more numbers than LA<1:0>, the output Comp signal goes high, and in the opposite case, the Comp signal goes low. If E<1:0> and L<1:0> have the same number of 10s, an Equal signal is generated, and the FSM maintains the previous value. The octant phase controller implemented with an up/down counter generates the control codes (MA<1:0>, MB<1:0>, and PI<8:0>) for the phase selector (PS) and the phase interpolator (PI). For the design of the proposed PI controller, a full-custom design was used for a high-speed operation. Both the E/L detector and the IPT operate at 6.25 GHz. Other PI controller blocks also operate at a high speed of 3.125 GHz, which results in the advantage of reducing the loop latency. As the update rate of the CDR increases, the input jitter tolerance improves, whereas the dithering jitter of the recovered clock deteriorates. The jitter tolerance refers to the ability of the CDR to maintain the targeted bit error rate (BER) when a low-frequency input sinusoidal jitter (usually from a few kHz to tens of MHz) is added and injected into the input data of the CDR. In applications that do not require spread-spectrum clocking, the BER can be minimized by paying more attention to jitter generation than jitter tolerance. Therefore, it is necessary to have the ability to adjust the update rate of the CDR depending on the application field of the CDR. In this paper, to reduce the dithering jitter of the recovered clock and prevent the BER increase, the FSM of the proposed DLF performs a decimation filter [6] function. When the decimation factor (DF) control (DFCONT) signal is 0, the FSM operates as a four-state machine and outputs once when four consecutive Comp impulses occur. Thus, the update rate of the CDR is f DF, where DF = 1/4. CLKcont × When the DF is 1, the FSM operates in eight states, and the update rate of the CDR becomes f CLKcont/8 with DF = 1/8, which has the effect that the dithering jitter is further reduced. Figure6 shows the Early /Late (E/L) detector, which consists of four bang-bang phase detectors (PD), four de-multiplexers (DEMUX), and one 1/2 divider. When the sequential tracking mode starts after the initial tracking mode ends, the four bang-bang PDs using the Alexander equation [34] generate Electronics 2020, 9, x FOR PEER REVIEW 8 of 16

generate the L0–L3 and E0–E3 signals. After going through the four DEMUX blocks, the Early<7:0> and Late<7:0> signals are generated.

2.3. Proposed CTLE Electronics 2020, 9, 1113 8 of 16 ElectronicsFigure 2020 ,7 9 shows, x FOR PEERthe block REVIEW diagram of the proposed CTLE, which consists of a near-ground 8(NG) of 16 receiver [35,36], an adaptive CTLE, and a power detector [37]. High-speed small-swing differential generate the L0–L3 and E0–E3 signals. After going through the four DEMUX blocks, the Early<7:0> theinput L0–L3 signals and (RX E0–E3IN, RX signals.INb) passing After through going the through parasitic the RLC four input DEMUX network blocks, (including the Early package/bond<7:0> and and Late<7:0> signals are generated. Latewire/PAD<7:0> signalsand electrostatic are generated. discharge (ESD) devices) are applied as inputs to the NG receiver. 2.3. Proposed CTLE Figure 7 shows the block diagram of the proposed CTLE, which consists of a near-ground (NG) receiver [35,36], an adaptive CTLE, and a power detector [37]. High-speed small-swing differential input signals (RXIN, RXINb) passing through the parasitic RLC input network (including package/bond wire/PAD and electrostatic discharge (ESD) devices) are applied as inputs to the NG receiver.

FigureFigure 6.6. BlockBlock diagramdiagram ofof thethe EE/L/L detectordetector andand thethe bang-bangbang-bang PD.PD. 2.3. Proposed CTLE Figure7 shows the block diagram of the proposed CTLE, which consists of a near-ground (NG) receiver [35,36], an adaptive CTLE, and a power detector [37]. High-speed small-swing differential input signals (RXIN, RXINb) passing through the parasitic RLC input network (including package/bond wire/PAD and electrostaticFigure 6. discharge Block diagram (ESD) of devices) the E/L detector are applied and the as bang inputs-bang to thePD. NG receiver.

Figure 7. Block diagram of the proposed CTLE.

The NG receiver adopts a dual gain-path common-gate amplifier with feed-forward capacitors (CFF) that boost high-frequency gain. The NG receiver also utilizes an adaptive bias generator (ABG) to compensate for the input common-mode variation and to improve the channel impedance matching performance. The input terminal of the NG receiver provides the impedance matching function for the channel termination,Figure 7. Block which diagram has the of the advantage proposed of CTLE.CTLE not . requiring additional channel termination devices. If the amplitude of the high-speed serial input signal changes, the characteristics of thThee NG NG receiver receiver and adopts impedance a dual matching gain-pathgain-path are common-gatecommon affected.-gate The amplifieramplifier ABG using with common feed-forwardfeed-forward-mode capacitors feedback (C(CFF)) that that boost boost high high-frequency-frequency gain. gain. The The NG NG receiver receiver also also utilizes utilizes an an adaptive adaptive bias bias generator generator (ABG) (ABG) to compensate compensate for for the the input input common-mode common-mode variation variation and to and improve to improve the channel the impedance channel impedance matching performance.matching performance. The input terminal The input of the terminal NG receiver of the providesNG receiver the impedance provides the matching impedance function matching for the channelfunction termination, for the channel which termination, has the advantage which has of not the requiring advantage additional of not requiring channel terminationadditional channel devices. Iftermination the amplitude devices. of the If the high-speed amplitude serial of the input high signal-speed changes, serial input the characteristicssignal changes, of the the characteristics NG receiver of the NG receiver and impedance matching are affected. The ABG using common-mode feedback

Electronics 2020, 9, 1113 9 of 16

Electronics 2020, 9, x FOR PEER REVIEW 9 of 16 and impedance matching are affected. The ABG using common-mode feedback can reduce the current mismatchcan reduce problem the current in the mismatch receiver input problem stage in according the receiver to the input VCM stagelevel according changes, towhich the VresultsCM level in improvedchanges, which impedance results matching in improved performances. impedance Thematching PMOS performances. active inductor The loads PMO ofS the active NG inductor receiver provideloads ofe thefficient NG high-frequencyreceiver provide compensation efficient high with-frequency less area compensation and power consumption with less area than and thepassive power inductorconsumption loads. than the passive inductor loads. The proposed adaptive CTLE is a two two-stage-stage differential differential pair amplifier amplifier with active inductor inductor loads. loads. ItIt isis usedused toto compensatecompensate for for the the additional additional high-frequency high-frequency gain gain at at about about 12.5 12.5 GHz GHz to generateto generate open-eye open- dataeye data (CDR (CDRIN, CDRIN, CDRINb).INb The). The power power detector detector [37 [37]] using using the the spectrum spectrum balancing balancing techniquetechnique createscreates aa control voltagevoltage (V(VCTRLCTRL)) that that can can adaptively adaptively adjust adjust the the ga gainin of of the CTLE. The power detector consistsconsists of a high-passhigh-pass filterfilter (HPF), a low-passlow-pass filter filter (LPF), a rectifier, rectifier, and a voltage voltage-to-current-to-current converter. converter. Figure8 8 shows shows the the simulated simulated frequency frequency response response of of the the proposed proposed CTLE CTLE for for di differentfferent VVCTRLCTRL voltages.voltages. TheThe proposedproposed CTLECTLE achievesachieves a maximummaximum gain boosting characteristic of about 19.7 dB at a NyquistNyquist frequencyfrequency ofof 12.512.5 GHz.GHz.

FigureFigure 8.8. FrequencyFrequency responseresponse ofof thethe proposedproposed CTLECTLE forfor thethe didifferentfferent V VCTRLCTRL voltages.

2.4. Proposed Multi-PhaseMulti-Phase MDLL Figure9 9shows shows the the block block diagram diagram of the of multi-phasethe multi-phase multiplying multiplying delay-locked delay-locked loop (MDLL) loop (MDLL) [ 38,39] used[38,39] in usedthis paper. in this The paper. proposed The proposed multi-phase multi MDLL-phase consists MDLL of aconsists phase detector of a phase (PD), detector a charge (PD), pump a (CP),charge a pump loop filter (CP), (LF), a loop a voltage filter (LF), controlled a voltage delay controlled line (VCDL), delay line a divide-by-10 (VCDL), a divide divider,-by and-10 divider, a select logic.and a select The MDLL logic. usesThe MDLL a four-stage uses a difourfferential-stage differential delay line to delay generate line to eight-phase generate eight clocks-phase (p0 toclocks p7), each(p0 to with p7), aeach phase with di aff phaseerence difference of 45 degrees. of 45 degrees. The proposed The proposed MDLL generatesMDLL generates 6.25 GHz 6.25 eight-phase GHz eight- clocksphase clocks multiplied multiplied by ten b timesy ten times using using a reference a reference clock clock of 625 of MHz.625 MHz. The The lock lock time time of theof the MDLL MDLL is aboutis about 350 350 ns. ns. The The proposed proposed MDLL MDLL consumes consumes about about 24.2 24.2 mW mW of of power powerat at 6.256.25 GHzGHz andand achievesachieves a peak-to-peak (p–p) jitter of 10.5 ps. The active area of the proposed MDLL is only 170 µm 80 µm. peak-to-peak (p–p) jitter of 10.5 ps. The active area of the proposed MDLL is only 170 μm × 80 μm. TheThe MDLLMDLL core consumesconsumes 19.7 mW, andand thethe multi-phasemulti-phase clockclock distributiondistribution across the fourfour CDRsCDRs consumes about 4.5 mW. TheThe sharedshared MDLLMDLL isis locatedlocated inin thethe centercenter ofof thethe fourfour CDRsCDRs to minimize the lengthlength of of the the clock clock distribution. distribution. The The clock clock distribution distribution of multi-phase of multi-phase high-frequency high-frequency clocks clocks is a critical is a taskcritical due task to problemsdue to problems such as such noise as coupling, noise coupling, clock skew,clock andskew, the and power the power consumption consumption of the of clock the tree.clock However,tree. However, since thesince overall the overall size of size the of proposed the proposed quad-lane quad receiver-lane receiver architecture architecture is very is small very andsmall the and clock the distributionclock distribution length length in one in direction one direction is shorter is shorter than 230 thanµm, 230 noise μm coupling, noise coupling and power and consumptionpower consumption increase increase problems problems can be minimized.can be minimized.

Electronics 2020, 9, x FOR PEER REVIEW 10 of 16

Electronics 2020, 9, 1113 10 of 16 Electronics 2020, 9, x FOR PEER REVIEW 10 of 16

Figure 9. Block diagram of the proposed multi-phase MDLL.

3. Experimental Results The proposed 100 Gb/s quad-lane SerDes receiver was implemented in a 40 nm CMOS process. Figure 10 shows the layout of the quad-lane SerDes receiver composed of four CTLEs, one shared MDLL, and four CDRs. The total active area of the quad-lane SerDes receiver is only 780 μm × 450 μm (= 0.351 mm2). To achieve an aggregate data rate of 100 Gb/s, the proposed four-channel SerDes receiver consumes a total power of 241.8 mW (MDLL = 24.2 mW, CTLE × 4ea = 81.6 mW, CDR × 4ea = 136 mW), which resultsFigure in a 9. high Block-energy diagram ef ficiencyof the proposed of 2.418 multi-phasemulti pJ/bit.-phase MDLL. Figure 11 shows the locking process of the proposed 25 Gb/s/lane quarter-rate all-digital CDR. 3. Experimental Results When the CDR is activated, the initial tracking mode operates for 36 CLKcont cycles by using a repeatedThe proposedpreamble 100of the GbGb/s “00001111”/s quad-lanequad-lane pattern SerDes toreceiver achieve was a fast implemented acquisition in in time a 40 and nm aCMOS CMOS constant process. process. data Figureoutput 10sequence shows, thewhere layout the CDR of thethe update quad-lanequad- ratelane is SerDesSerDes 3.125 GHz receiverreceiver (= 0.32 composedcomposed ns). When of the fourfour initial CTLEs,CTLEs, tracking one sharedshared mode MDLL,is completed, andand fourfour and CDRs.CDRs. the DATA#1 The The total total output active active areais areasynchro of of the thenized quad-lane quad to- laneDclk0 SerDes SerDes, then receiver thereceiver CDR is only isenters only780 theµ780m sequential μm450 × µ450m 2 × (trackingμ=m0.351 (= 0.351 mm mode mm). Toand2). achieve To performs achieve an aggregate 2x an-oversampling. aggregate data ratedata ofBy rate 100 controlling of Gb 100/s, the Gb/s, proposedthe the decimation proposed four-channel factor four- channelof SerDes the DLF, receiver SerDes the consumes a total power of 241.8 mW (MDLL = 24.2 mW, CTLE 4ea = 81.6 mW, CDR 4ea = 136 mW), CDRreceiver update consumes rate can a total be adjustedpower of to241.8 have mW fCLKcont (MDLL/4 or = f24.2CLKcont mW,×/8 to CTLE minimize × 4ea the = 81.6 dithering× mW, CDR jitter. × 4eaAs whichshown= 136 mW), resultsin Figure which in a11, high-energyresults the lock in atime high effi ofciency-energy the proposed of ef 2.418ficiency pJ CDR/bit. of is2.418 only pJ/bit. less than 12 ns. Figure 11 shows the locking process of the proposed 25 Gb/s/lane quarter-rate all-digital CDR. When the CDR is activated, the initial tracking mode operates for 36 CLKcont cycles by using a repeated preamble of the “00001111” pattern to achieve a fast acquisition time and a constant data output sequence, where the CDR update rate is 3.125 GHz (= 0.32 ns). When the initial tracking mode is completed, and the DATA#1 output is synchronized to Dclk0, then the CDR enters the sequential tracking mode and performs 2x-oversampling. By controlling the decimation factor of the DLF, the CDR update rate can be adjusted to have fCLKcont/4 or fCLKcont/8 to minimize the dithering jitter. As shown in Figure 11, the lock time of the proposed CDR is only less than 12 ns.

Figure 10. Layout of the proposed quad quad-lane-lane SerDes receiver core block.block.

Figure 11 shows the locking process of the proposed 25 Gb/s/lane quarter-rate all-digital CDR. When the CDR is activated, the initial tracking mode operates for 36 CLKcont cycles by using a repeated preamble of the “00001111” pattern to achieve a fast acquisition time and a constant data output sequence, where the CDR update rate is 3.125 GHz (= 0.32 ns). When the initial tracking mode is completed, and the DATA#1 output is synchronized to Dclk0, then the CDR enters the sequential tracking mode and performs 2x-oversampling. By controlling the decimation factor of the DLF, the CDR update rate can beFigure adjusted 10. Layout to have off theCLKcont proposed/4 or fquadCLKcont-lane/8 SerDes to minimize receiver the core dithering block. jitter. As shown in Figure 11, the lock time of the proposed CDR is only less than 12 ns.

Electronics 2020, 9, 1113 11 of 16 Electronics 2020, 9, x FOR PEER REVIEW 11 of 16 Electronics 2020, 9, x FOR PEER REVIEW 11 of 16

FigureFigure 11. Locking11. Locking process process of of the the proposed proposed 25 Gb/s Gb/s quarter quarter-rate-rate digital digital CDR CDR.. Figure 11. Locking process of the proposed 25 Gb/s quarter-rate digital CDR. FigureFigure 12 shows 12 shows the the simulated simulated locking locking process process ofof thethe proposed CDR CDR at at25 25 Gb/s/lane. Gb/s/lane. When When the the Figure 12 shows the simulated locking process of the proposed CDR at 25 Gb/s/lane. When the CDR starts,CDR starts, the phasethe phase error error (∆t) (Δt) between between the the Dclk0 Dclk0 andand thethe ideal lock lock point point is isabout about 70 70ps. ps.After After the the CDR starts, the phase error (Δt) between the Dclk0 and the ideal lock point is about 70 ps. After the initialinitial tracking tracking mode mode using using the the preamble preamble data data pattern pattern is finished,finished, it it can can be be confirmed confirmed that that the thephase phase initialerror trackingΔt becomes mode zero, using and thethe preaCDRmble input data D0 patternis synchronized is finished, to Dclk0.it can be confirmed that the phase errorerror∆t becomes Δt becomes zero, zero, and and the the CDR CDR input input D0 D0 isis synchronizedsynchronized to to Dclk0. Dclk0.

Figure 12. Simulated locking process of the proposed CDR at 25 Gb/s. FigureFigure 12. 12.Simulated Simulated locking locking processprocess of of the the proposed proposed CDR CDR at 25 at Gb/s 25 Gb. /s.

Figure 13 left shows the simulated eye diagram of the CTLE input when the channel loss is about

22.6 dB at 12.5 GHz, where the input data eye is completely closed. Figure 13 right shows the simulated Electronics 2020, 9, x FOR PEER REVIEW 12 of 16

Figure 13 left shows the simulated eye diagram of the CTLE input when the channel loss is about 22.6 dB at 12.5 GHz, where the input data eye is completely closed. Figure 13 right shows the simulated eye diagram at the CTLE output operating at 25 Gb/s with a 231−1 pseudorandom bit stream (PRBS) data pattern, where the data eye is opened with 0.67 unit interval (UI) (= 27 ps).

Electronics 2020, 9, x FOR PEER REVIEW 12 of 16 Electronics 2020, 9, 1113 12 of 16 Figure 13 left shows the simulated eye diagram of the CTLE input when the channel loss is about 22.6 dB at 12.5 GHz, where the input data eye is completely closed. Figure 13 right shows the 31 eye diagram at the CTLE output operating at 25 Gb/s with a 2 1 pseudorandom31 bit stream (PRBS) simulated eye diagram at the CTLE output operating at 25 Gb/s with− a 2 −1 pseudorandom bit stream data(PRBS) pattern, data pattern, where the where data the eye data is opened eye is withopened 0.67 with unit 0.67 interval unit (UI)interval (= 27 (UI) ps). (= 27 ps).

Figure 13. Post-layout simulation results of the proposed 25 Gb/s/lane CTLE operation with PRBS31 data sequence.

Figure 14 shows the eye diagram of the recovered quarter-rate 6.25 GHz clock (Dclk0), which has a p–p jitter of 18 ps and a root mean square (RMS) jitter of 3.5 ps, respectively, when DF = 1/4. When DF = 1/8, it shows an improved p–p jitter of 16.5 ps and an RMS jitter of 3.5 ps, respectively.

FigureFigure 15 shows 13. Post-layout the eye diagram simulation of results the recovered of the proposed quarter 25-rate Gb/s /6.25lane Gb/s CTLE data, operation which with has PRBS31 a p-p jitter Figure 13. Post-layout simulation results of the proposed 25 Gb/s/lane CTLE operation with PRBS31 of 22data ps and sequence. an RMS jitter of 3.7 ps, respectively, when DF = 1/4. When DF = 1/8, it shows a p–p jitter of 20.5data ps sequence and an RMS. jitter of 3.6 ps, respectively. FigureTable 114 summarizes shows the eye the diagram performance of the of recovered the proposed quarter-rate all-digital 6.25 GHzCDR clockwith (Dclk0),the state which-of-the has-art Figure 14 shows the eye diagram of the recovered quarter-rate 6.25 GHz clock (Dclk0), which aPI p–p-based jitter CDRs. of 18 The ps results and a rootin this mean work square are the (RMS) post-layout jitter of simulation 3.5 ps, respectively, values. It can when be confirmed DF = 1/4. has a p–p jitter of 18 ps and a root mean square (RMS) jitter of 3.5 ps, respectively, when DF = 1/4. Whenthat theDF proposed= 1/8, it showsCDR has an improvedthe best energy p–p jitter efficiency of 16.5 of ps 2.34 and pJ/bit an RMS (= 2.34 jitter mW/Gb/s). of 3.5 ps, Additionally, respectively. When DF = 1/8, it shows an improved p–p jitter of 16.5 ps and an RMS jitter of 3.5 ps, respectively. FigureRef. [29,32] 15 shows require the eyea high diagram-frequency of the clock recovered generator quarter-rate of 10 GHz 6.25 Gbor /highers data, whichas an external has a p–p reference jitter of Figure 15 shows the eye diagram of the recovered quarter-rate 6.25 Gb/s data, which has a p-p jitter 22clock ps andsource. an RMS This jitter means of 3.7that ps, the respectively, use of an additional when DF = PLL1/4. circuit, WhenDF which= 1/ 8,has it showsthe disadva a p–pntage jitter ofof of 22 ps and an RMS jitter of 3.7 ps, respectively, when DF = 1/4. When DF = 1/8, it shows a p–p jitter 20.5power ps andconsumption an RMS jitter and ofsilicon 3.6 ps, area respectively. increase. of 20.5 ps and an RMS jitter of 3.6 ps, respectively. Table 1 summarizes the performance of the proposed all-digital CDR with the state-of-the-art PI-based CDRs. The results in this work are the post-layout simulation values. It can be confirmed that the proposed CDR has the best energy efficiency of 2.34 pJ/bit (= 2.34 mW/Gb/s). Additionally, Ref. [29,32] require a high-frequency clock generator of 10 GHz or higher as an external reference clock source. This means that the use of an additional PLL circuit, which has the disadvantage of power consumption and silicon area increase.

Figure 14. Post-layout simulation results of the recovered quarter-rate clock with decimation factor ElectronicsFigure 2020 14., 9 ,Post x FOR-layout PEER simulationREVIEW results of the recovered quarter-rate clock with decimation factor13 of 16 (DF) = 1/4 and 1/8. (DF) = 1/4 and 1/8. 160ps = 1UI 160ps = 1UI (V) (V) 1.0 1.0

137ps 138ps (0.85UI) (0.86UI) Figure 14. Post0.5 -layout simulation results of the recovered0.5 quarter-rate clock with decimation factor (DF) = 1/4 and 1/8. p-p jitter = 22ps p-p jitter = 20.5ps RMS jitter = 3.7ps RMS jitter = 3.6ps

0 0

0 160 320 0 160 320 (ps) (ps) DF = 1/4 DF = 1/8 . Figure 15. Post-layout simulation results of the recovered quarter-rate data with DF = 1/4 and 1/8. Figure 15. Post-layout simulation results of the recovered quarter-rate data with DF = 1/4 and 1/8.

Table 1. CDR performance summary and comparison.

Reference [5] [28] [29] [32] This Work Process and supply 130 nm/1.2 V 28 m/1.0 V 90 nm/1.0 V 65 nm/1.2 V 40 nm/1.0 V Data rate (Gb/s) 7 13 44 40 25 CDR architecture PI-based PI-based PI-based PI-based PI-based Sampling rate Baud-rate Half-rate Quarter-rate Quarter-rate Quarter-rate DEMUXing ratio 1:1 1:20 1:16 1:4 1:4 PLL PLL Multi-phase generation 8-phase DLL 1/2 Divider 8-phase MDLL (LC-VCO) (LC-VCO) External reference clock 150 MHz N/A 10 GHz 20 GHz 625 MHz frequency CDR power (mW) 13.7 - - 159 34.2 Multi-phase gen. power 39.6 27 - 30 24.2 (mW) Total power (mW) 53.3 59 230 189 58.4 Energy efficiency (pJ/bit) 7.6 4.53 5.74 4.7 2.34 Lock time - - - - 12ns

Table 2 provides a comparison between this work and some recently published multi-lane SerDes receiver chips. It can be confirmed that the proposed quad-channel SerDes receiver (4 CTLEs + 4 CDRs + 1 MDLL) achieves the highest energy efficiency of 2.418 pJ/bit and the highest data rate of 25 Gb/s/channel, while using a small area of only 0.351 mm2.

Table 2. SerDes receiver chip performance summary and comparison.

Reference [5] [22] [28] This work Process and supply 130 nm/1.2 V 65 nm/1.0 V 28 m/1.0 V 40 nm/1.0 V CDR architecture PI-based PI-based PI-based PI-based Equalization CTLE CTLE CTLE + DFE CTLE Aggregate data rate (Gb/s) 28 140.8 52 100 Number of data channels 4 22 4 4 Data rate/channel (Gb/s) 7 6.4 13 25 Total Power (mW) 271.5 635 155 241.8 Energy Efficiency (pJ/bit) 9.7 4.5 2.98 2.418 Chip Area (mm2) 2.41 2.96 0.226 1 0.351 1 RX + TX area.

4. Conclusions We have presented an energy-efficient, small-area 100 Gb/s quad-lane SerDes receiver with a PI- based quarter-rate all-digital CDR. The proposed PI-based CDR utilizes a multi-phase MDLL to achieve lower clock jitter, a smaller area, and a lower power consumption. The proposed digital CDR uses a new preamble-based initial tracking mode to achieve a fast lock time of less than 12 ns, which is suitable for use in burst-mode applications. The CDR uses a quarter-rate 2x-oversampling

Electronics 2020, 9, 1113 13 of 16

Table1 summarizes the performance of the proposed all-digital CDR with the state-of-the-art PI-based CDRs. The results in this work are the post-layout simulation values. It can be confirmed that the proposed CDR has the best energy efficiency of 2.34 pJ/bit (= 2.34 mW/Gb/s). Additionally, Ref. [29,32] require a high-frequency clock generator of 10 GHz or higher as an external reference clock source. This means that the use of an additional PLL circuit, which has the disadvantage of power consumption and silicon area increase.

Table 1. CDR performance summary and comparison.

Reference [5][28][29][32] This Work Process and supply 130 nm/1.2 V 28 m/1.0 V 90 nm/1.0 V 65 nm/1.2 V 40 nm/1.0 V Data rate (Gb/s) 7 13 44 40 25 CDR architecture PI-based PI-based PI-based PI-based PI-based Sampling rate Baud-rate Half-rate Quarter-rate Quarter-rate Quarter-rate DEMUXing ratio 1:1 1:20 1:16 1:4 1:4 PLL PLL Multi-phase generation 8-phase DLL 1/2 Divider 8-phase MDLL (LC-VCO) (LC-VCO) External reference clock 150 MHz N/A 10 GHz 20 GHz 625 MHz frequency CDR power (mW) 13.7 - - 159 34.2 Multi-phase gen. power (mW) 39.6 27 - 30 24.2 Total power (mW) 53.3 59 230 189 58.4 Energy efficiency (pJ/bit) 7.6 4.53 5.74 4.7 2.34 Lock time - - - - 12ns

Table2 provides a comparison between this work and some recently published multi-lane SerDes receiver chips. It can be confirmed that the proposed quad-channel SerDes receiver (4 CTLEs + 4 CDRs + 1 MDLL) achieves the highest energy efficiency of 2.418 pJ/bit and the highest data rate of 25 Gb/s/channel, while using a small area of only 0.351 mm2.

Table 2. SerDes receiver chip performance summary and comparison.

Reference [5][22][28] This work Process and supply 130 nm/1.2 V 65 nm/1.0 V 28 m/1.0 V 40 nm/1.0 V CDR architecture PI-based PI-based PI-based PI-based Equalization CTLE CTLE CTLE + DFE CTLE Aggregate data rate (Gb/s) 28 140.8 52 100 Number of data channels 4 22 4 4 Data rate/channel (Gb/s) 7 6.4 13 25 Total Power (mW) 271.5 635 155 241.8 Energy Efficiency (pJ/bit) 9.7 4.5 2.98 2.418 Chip Area (mm2) 2.41 2.96 0.226 1 0.351 1 RX + TX area.

4. Conclusions We have presented an energy-efficient, small-area 100 Gb/s quad-lane SerDes receiver with a PI-based quarter-rate all-digital CDR. The proposed PI-based CDR utilizes a multi-phase MDLL to achieve lower clock jitter, a smaller area, and a lower power consumption. The proposed digital CDR uses a new preamble-based initial tracking mode to achieve a fast lock time of less than 12 ns, which is suitable for use in burst-mode applications. The CDR uses a quarter-rate 2x-oversampling architecture and a new full custom PI controller with an adjustable update rate, resulting in a reduced dithering jitter. A new small-area low-power CTLE using an NG receiver was adopted to achieve a data rate of 25 Gb/s/lane. Implemented in 40 nm CMOS technology, the four-channel SerDes receiver achieves an aggregate data rate of 100 Gb/s and occupies an active core area of only 0.351 mm2. It consumes only 241.8 mW and achieves a high energy efficiency of 2.418 pJ/bit. Electronics 2020, 9, 1113 14 of 16

Author Contributions: Conceptualization, J.K.; methodology, H.H.; validation, J.K. and H.H.; formal analysis, H.H.; investigation, J.K and H.H.; data curation, H.H.; writing—original draft preparation, J.K and H.H.; writing—review and editing, J.K.; supervision, J.K.; project administration, J.K.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript. Funding: This research was funded and conducted under “the Competency Development Program for Industry Specialists” of the Korean Ministry of Trade, Industry and Energy (MOTIE), operated by Korea Institute for Advancement of Technology (KIAT). (No. N0001883, HRD program for N0001883). This work was also supported by National Research Foundation of Korea (NRF 2019R1A2C-1010017). The EDA tools were supported by IDEC. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Barroso, L.A.; Holzle, U. The case for energy-proportional computing. Computer 2008, 40, 33–37. [CrossRef] 2. Chen, S.; Li, H.; Chiang, P.Y. A robust energy/area-efficient forwarded-clock receiver with all-digital clock and data recovery in 28-nm CMOS for high-density interconnects. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2016, 24, 578–586. [CrossRef] 3. Guo, S.; Liu, T.; Zhang, T.; Xi, T.; Wu, G.; Gui, P.; Fan, Y.; Maung, W.; Morgan, M. A low-voltage low-power 25 Gb/s clock and data recovery with equalizer in 65 nm CMOS. In Proceedings of the 2015 IEEE Radio Frequency Integrated Circuits Symposium (RFIC), Phoenix, AZ, USA, 17–19 May 2015; pp. 307–310. 4. Chung, S.H.; Kim, L.S. 1.22 mW/Gb/s 9.6 Gb/s data jitter mixing forwarded-clock receiver robust against power noise with 1.92 ns latency mismatch between data and clock in 65 nm CMOS. In Proceedings of the 2012 Symposium on VLSI Circuit (VLSIC), Honolulu, HI, USA, 13–15 June 2012; pp. 144–145. 5. Kalantari, N.; Buckwalter, J.F. A multichannel serial link receiver with dual-loop clock-and-data recovery and channel equalization. IEEE Trans. Circuits Syst. I Regul. Pap. 2014, 60, 2920–2931. [CrossRef] 6. Wu, G.; Huang, D.; Li, J.; Gui, P.; Liu, T.; Guo, S.; Wang, R.; Fan, Y.; Chakraborty, S.; Morgan, M. A 1-16 Gb/s all-digital clock and data recovery with a wideband high-linearity phase interpolator. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2016, 24, 2511–2520. [CrossRef] 7. Kwak, Y.H.; Kim, Y.; Hwang, S.; Kim, C. A 20 Gb/s clock and data recovery with a ping-pong delay line for unlimited phase shifting in 65 nm CMOS process. IEEE Trans. Circuits Syst. I: Regul. Pap. 2013, 60, 303–313. [CrossRef] 8. Bueren, G.V.; Rodoni, L.; Jaeckel, H.; Brun, R.; Holzer, D.; Huber, A.; Schmatz, M. 5.75 to 44Gb/s quarter rate CDR with data rate selection in 90 nm bulk CMOS. In Proceedings of the ESSCIRC 2008—34th European Solid-State Circuit Conference, Edinburgh, UK, 15–19 September 2008; pp. 166–169. 9. Hsieh, M.T.; Sobelman, G.E. Architectures for multi-gigabit wire-linked clock and data recovery. IEEE Circuits Syst. Mag. 2008, 8, 45–57. [CrossRef] 10. Choi, W.S.; Anand, T.; Shu, G.; Elshazly, A.; Hanumolu, P.K.A burst-mode digital receiver with programmable input jitter filtering for energy proportional links. IEEE J. Solid-State Circuits 2015, 50, 737–748. [CrossRef] 11. Su, C.; Chen, L.K.; Cheung, K.W. Theory of burst-mode receiver and its applications in optical multiaccess networks. J. Lightwave Technol. 1997, 15, 590–606. 12. Verbeke, M.; Rombouts, P.; Ramon, H.; Verbist, J.; Bauwelinck, J.; Yin, X.; Torfs, G. A 25 Gb/s all-digital clock and data recovery circuit for burst-mode applications in PONs. J. Lightwave Technol. 2018, 36, 1503–1509. [CrossRef] 13. Leibowitz, B.; Palmer, R.; Poulton, J.; Frans, Y.; Li, S.; Wilson, J.; Bucher, M.; Fuller, A.M.; Eyles, J.; Aleksic, M.; et al. A 4.3 GB/s mobile memory interface with power-efficient bandwidth scaling. IEEE J. Solid-State Circuits 2010, 45, 889–898. [CrossRef] 14. Ziakas, D.; Baum, A.; Maddox, R.; Safranek, R. Intel® QuickPath interconnect architectural features supporting scalable system architectures. In Proceedings of the 18th IEEE Symposium on High Performance Interconnects, Mountain View, CA, USA, 18–20 August 2010; pp. 1–6. 15. Loke, A.; Doyle, B.; Maheshwari, S.; Fischette, D.; Wang, C.; Wee, T.; Fang, E.S. An 8.0-Gb/s HyperTransport transceiver for 32-nm SOI-CMOS server processors. IEEE J. Solid-State Circuits 2012, 47, 2627–2642. [CrossRef] 16. Zerbe, J.; Daly, B.; Luo, L.; Stonecypher, W.; Dettloff, W.; Eble, J.C.; Stone, T.; Ren, J.; Leibowitz, B.; Bucher, M.; et al. A 5 Gb/s link with matched source synchronous and common-mode clocking techniques. IEEE J. Solid-State Circuits 2011, 46, 974–985. [CrossRef] Electronics 2020, 9, 1113 15 of 16

17. Casper, B.; Jaussi, J.; O’Mahony, F.; Mansuri, M.; Canagasaby, K.; Kennedy, J.; Yeung, E.; Mooney, R. A 20Gb/s forwarded clock transceiver in 90 nm CMOS. In Proceedings of the 2006 IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 6–9 February 2006; pp. 263–272. 18. Jaussi, J.E.; Balamurugan, G.; Johnson, D.R.; Casper, B.; Martin, A.; Kennedy, J.; Shanbhag, N.; Mooney, R. 8-Gb/s source-synchronous I/O link with adaptive receiver equalization, offset cancellation, and clock de-skew. IEEE J. Solid-State Circuits 2005, 40, 80–88. [CrossRef] 19. O’Mahony, F.; Shekhar, S.; Mansuri, M.; Balamurugan, G.; Jaussi, J.E.; Kennedy, J.; Casper, B.; Allstot, D.J.; Mooney, R. A 27 Gb/s forwarded-clock I/O receiver using an injection-locked LC-DCO in 45nm CMOS. In Proceedings of the 2008 IEEE International Solid-State Circuits Conference-(ISSCC), San Francisco, CA, USA, 3–7 February 2008; pp. 452–627. 20. Hossain, M.; Carusone, A.C. 7.4 Gb/s 6.8 mW source synchronous receiver in 65 nm CMOS. IEEE J. Solid-State Circuits 2011, 46, 1337–1348. [CrossRef] 21. Dickson, T.O.; Liu, Y.; Rylov,S.V.;Agrawal, A.; Kim, S.; Hsieh, P.H.;Bulzacchelli, J.F.; Ferriss, M.; Ainspan, H.A.; Rylyakov, A. A 1.4 pJ/bit, power-scalable 16 12 Gb/s source-synchronous I/O with DFE receiver in 32 nm × SOI CMOS technology. IEEE J. Solid-State Circuits 2015, 50, 1917–1931. [CrossRef] 22. Reutemann, R.; Ruegg, M.; Keyser, F.; Bergkvist, J.; Dreps, D.; Toifl, T.; Schmatz, M. A 4.5 mW/Gb/s 6.4 Gb/s 22+1-lane source synchronous receiver core with optional cleanup PLL in 65 nm CMOS. IEEE J. Solid- State Circuits 2010, 45, 2850–2860. [CrossRef] 23. Lv, F.; Zheng, X.; Zhao, F.; Wang, J.; Yue, S.; Wang, Z.; Cao, W.; He, Y.; Zhang, C.; Jiang, H.; et al. A power scalable 2–10 Gb/s PI-based clock data recovery for multilane applications. Microelectron. J. 2018, 82, 36–45. [CrossRef] 24. Hwang, H.; Kim, J. A low-power 20 Gbps multi-phase MDLL-based digital CDR with receiver equalization. In Proceedings of the 2019 International SoC Design Conference (ISOCC), Jeju, Korea, 6–9 October 2019; pp. 42–43. 25. Aprile, C.; Cevrero, A.; Francese, P.A.; Menolfi, C.; Braendli, M.; Kossel, M.; Morf, T.; Kull, L.; Oezkaya, I.; Leblebici, Y.; et al. An eight-lane 7-Gb/s/pin source synchronous single-ended RX with equalization and far-end crosstalk cancellation for backplane channels. IEEE J. Solid-State Circuits 2018, 53, 861–872. [CrossRef] 26. Hossain, M.; Chen, E.H.; Navid, R.; Leibowitz, B.; Chou, A.; Li, S.; Park, M.J.; Ren, J.; Daly, B.; Su, B.; et al. A 4 40 Gb/s quad-lane CDR with shared frequency tracking and data dependent jitter filtering. × In Proceedings of the 2014 Symposium on VLSI Circuits Digest of Technical Papers, Honolulu, HI, USA, 10–13 June 2014; pp. 1–2. 27. Singh, U.; Garg, A.; Raghavan, B.; Huang, N.; Zhang, H.; Huang, Z.; Momtaz, A.; Cao, J. A 780 mW 4 28 Gb/s transceiver for 100 GbE gearbox PHY in 40 nm CMOS. In Proceedings of the 2014 IEEE × International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 9–13 February 2014. 28. Kocaman, N.; Ali, T.; Rao, L.P.; Singh, U.; Abdul-Latif, M.; Liu, Y.; Hafez, A.A.; Park, H.; Vasani, A.; Huang, Z.; et al. A 3.8 mW/Gbps quad-channel 8.5–13 Gbps serial link with a 5 tap DFE and a 4 tap transmit FFE in 28 nm CMOS. IEEE J. Solid-State Circuits 2016, 51, 881–892. 29. Rodoni, L.; Buren, G.V.; Huber, A.; Schmatz, M.; Jackel, H. A 5.75 to 44 Gb/s quarter rate CDR with data rate selection in 90 nm bulk CMOS. IEEE J. Solid-State Circuits 2009, 44, 1927–1941. [CrossRef] 30. Navid, R.; Chen, E.H.; Hossain, M.; Leibowitz, B.; Ren, J.; Chou, C.A.; Daly, B.; Aleksic, M.; Su, B.; Li, S.; et al. A 40 Gb/s serial link transceiver in 28 nm CMOS technology. IEEE J. Solid-State Circuits 2015, 50, 814–827. [CrossRef] 31. Ng, H.T.; Farjad-Rad, R.; Lee, M.E.; Dally, W.J.; Greer, T.; Poulton, J.; Edmondson, J.H.; Rathi, R.; Senthinathan, R. A second-order semidigital clock recovered circuit based on injection locking. IEEE J. Solid-State Circuits 2003, 38, 2101–2110. 32. Zheng, X.; Zhang, C.; Lv, F.; Zhao, F.; Yuan, S.; Yue, S.; Wang, Z.; Wang, Z.; Li, F.; Wang, Z.; et al. A 40-Gb/s quarter-rate serdes transmitter and receiver chipset in 65-nm CMOS. IEEE J. Solid-State Circuits 2017, 52, 2963–2978. [CrossRef] 33. Kim, J.; Shin, H. A low-power 3.52 Gbps SerDes with a MDLL frequency multiplier for high-speed on-chip networks. J. Semicond. Technol. Sci. 2018, 18, 658–666. [CrossRef] 34. Alexander, J. Clock recovery from random binary signals. IET Electron. Lett. 1975, 11, 541–542. [CrossRef] Electronics 2020, 9, 1113 16 of 16

35. Kaviani, K.; Amirkhany, A.; Huang, C.; Huang, C.; Le, P.; Beyene, W.; Madden, C.; Saito, K.; Sano, K.; Murugan, V.I.;et al. A 0.4-mW/Gb/s near-ground receiver front-end with replica transconductance termination calibration for a 16-Gb/s source-series terminated transceiver. IEEE J. Solid-State Circuits 2013, 48, 636–648. [CrossRef] 36. Ku, J.; Bae, B.; Kim, J. A 13-Gbps low-swing low-power near-ground signaling transceiver. J. Inst. Electron. Inf. Eng. 2014, 51, 49–58. 37. Lee, J. A 20 Gb/s adaptive equalizer in 0.13 µm CMOS technology. IEEE J. Solid-State Circuits 2006, 41, 2058–2066. [CrossRef] 38. Han, S.; Lim, J.; Kim, J. An area-efficient multi-phase fractional-ratio clock frequency multiplier. J. Semicond. Technol. Sci. 2016, 16, 143–146. [CrossRef] 39. Kim, J.; Han, S. A fast-locking all-digital multiplying DLL for fractional-ratio dynamic frequency scaling. IEEE Trans. Circuits Syst. II 2018, 65, 276–280. [CrossRef]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).