High-Speed -Rate Clock and Data Recovery

by

Danny Yoo

A thesis submitted in conformity with the requirements for the degree of Master of Applied Science The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto

c Copyright 2018 by Danny Yoo Abstract

High-Speed Baud-Rate Clock and Data Recovery

Danny Yoo Master of Applied Science The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto 2018

This thesis presents an adaptive baud-rate CDR with CTLE and 1-tap DFE. The novelty in this design is the adaptation engine tailored for baud-rate clock and data recovery where the comparators for the DFE and the PD are shared to save power. A testchip was fabricated in TSMC 28nm CMOS. The adaptation engine is demonstrated for 34-36Gb/s operation with a Tyco 5” channel resulting in 15.05-18.25dB channel losses. At 35Gb/s, the total power consumption is measured to be 106.3mW or a FOM of 3.04pJ/bit. This thesis also presents a 2x half-baud-rate clock and data recovery technique with 2x oversampling at half-baud-rate (every other UI). A testchip was also fabricated in TSMC 28nm CMOS. A 30Gb/s 2x half-baud-rate CDR was tested with a Tyco 5” channel with 13.06dB of loss. The total power consumption is measured to be 79.2mW or a FOM of 2.64pJ/bit.

ii Acknowledgements

I would like to sincerely thank my supervisor, Professor Ali Sheikholeslami for providing me the opportunity to conduct research in the area of high-speed wireline circuits. Pro- fessor Sheikholeslami has supported me throughout every step of my tapeout, which is fabricated in a leading-edge advanced process technology. I thank Professor David Johns, Professor Tony Chan Carusone and Professor Joyce Poon for serving on my thesis examination committee. Their insightful comments and recommendations were invaluable addition to this thesis. I am thankful for the support and design review provided by Fujitsu’s staff, espe- cially, Hirotaka Tamura, Takayuki Shibasaki and Junji Ogawa. Special thanks to Wahid Rahman and Joshua Liang for guidance throughout my MASc research and Mohammad Tabrizi for layout and measurement support. I would also like to thank Nikola Nedovic for his visit to help set up the digital synthesis flow back in 2015, which still had an impact on my 2017 tapeout. My gratitude goes out to Jaro Pristupa and MOSIS support team for CAD and tech- nical support. I would also like to acknowledge Professor Antonio Liscidini, Professor Sorin Voinigescu and CMC for test equipment rental as I could not have finished my testchip measurements without them. Finally, I would like send my deepest thanks to my parents and my brother for their unconditional love and support.

iii Contents

Acknowledgements iii

Table of Contents iv

List of Figures vii

List of Abbreviations x

1 Introduction 1 1.1 Motivation...... 1 1.2 An Adaptive Baud-Rate CDR...... 1 1.3 A 2x Half-Baud-Rate CDR...... 2 1.4 Thesis Outline...... 2

2 Background 3 2.1 Overview of Baud-Rate PD...... 3 2.2 Pattern-based Baud-Rate Scheme...... 3 2.2.1 Pattern Detection...... 4 2.2.2 Optimal Sampling Point...... 4 2.3 Why Adaptation Engine?...... 7 2.3.1 Challenges...... 7 2.3.2 CTLE Adaptation...... 7 2.3.3 Comparator Level Adaptation...... 9

3 Proposed Adaptation Engine 11 3.1 Data Level Loop...... 12 3.2 Goals for On-Chip Adaptation...... 13 3.3 Adaptation Flow...... 15 3.4 Part 1: CTLE Adaptation...... 17 3.5 Part 2: Comparator Level Adaptation...... 20 3.6 Summary of Adaptation...... 24 3.7 System-level Behavioral Model...... 26 3.7.1 Behavioral Model: Continuous-time Model...... 26 3.7.2 Behavioral Model: Event-driven Model...... 28

iv 4 Circuit Simulations and Measurement Results 32 4.1 Analog Design...... 32 4.1.1 Closed-loop CDR Simulations...... 37 4.2 Digital Design...... 38 4.3 Lab Measurements...... 40 4.3.1 Testchip...... 40 4.3.2 Test Setup...... 41 4.3.3 Measurement Results...... 46

5 Proposed 2x Half-Baud-Rate CDR 55 5.1 Background...... 55 5.1.1 Alexander 2x-oversampled Bang-Bang PD...... 55 5.1.2 Mueller-Muller Baud-Rate PD...... 57 5.1.3 Sub-Baud-Rate Clock and Data Recovery...... 57 5.2 Proposed 2x half-baud-rate scheme...... 58 5.2.1 System-level Behavioral Model...... 63 5.3 Circuit Implementation & Simulations...... 65 5.3.1 Analog Design...... 65 5.3.2 Closed-loop CDR Simulations...... 66 5.3.3 Digital Design...... 68 5.4 Lab Measurements...... 68 5.4.1 Testchip...... 70 5.4.2 Test Setup...... 70 5.4.3 Measurement Results...... 71

6 Chip Design Methodology 80 6.1 Behaviour Model Methodology...... 80 6.2 Schematic & Layout Design Methodology...... 80 6.3 Advanced Layout Techniques & Considerations...... 81 6.3.1 Matching...... 81 6.3.2 Design for Electromigration (EM) & IR drop...... 83 6.3.3 Other Layout Considerations...... 84 6.4 Place & Route Digital Implementation Methodology...... 84

7 Conclusion 86 7.1 Thesis Contribution...... 86 7.2 Future Works...... 87 7.2.1 Improvements for an Adaptive Baud-Rate CDR...... 87 7.2.2 Improvements for a 2x Half-Baud-Rate CDR...... 87

Bibliography 89

Appendices 94

v A Ancillary 95 A.1 Portlist for Synthesized Digital...... 95 A.2 Output Pad MUX Selection...... 96

vi List of Figures

2.1 ISSCC 2016 Shibasaki’s Baud-Rate CDR...... 5 2.2 Shibasaki’s proposed analog front-end (VLSI2014) [34]...... 5 2.3 Pattern detection of Shibasaki’s baud-rate PD...... 5 2.4 Eye opening of 1-tap speculative DFE for Shibasaki’s baud-rate PD...... 6 2.5 VLSI2014 Shibasaki’s proposed PD logic [34]...... 6 2.6 Sub-optimal α levels illustrating reduced eye opening for and margin.....8 2.7 Pulse response of Channel + CTLE...... 8 2.8 Eye diagram demonstrating (1+α) - (1-α) = 2α ...... 9 2.9 LMS for adapting comparators for DFE...... 10

3.1 Full-rate system level block diagram of CDR and the proposed adaptation engine..... 12 3.2 Block diagram of proposed data level loop...... 13 3.3 Example of dLev converging...... 13 3.4 Example of data level (dLev) filtered for 111 and 011 pattern...... 14 3.5 Diagram illustrating the goal of on-chip adaptation...... 14 3.6 Diagram illustrating the optimal PD level...... 15 3.7 Diagram of eye opening for comparator optimized for DFE and PD respectively...... 16 3.8 Flow Diagram of Proposed Adaptation Engine...... 17 3.9 Schematic of a CTLE stage with tunable Cs...... 18 3.10 CTLE transfer function across Cs settings simulated in MATLAB Simulink...... 19 3.11 CTLE adaptation using line thickness...... 21 3.12 Block diagram of spectrum balancing [17]...... 22 3.13 CTLE transfer function showing 0011 pattern and its neighboring patterns for three dif- ferent CTLE settings...... 22 3.14 Visual Example of CTLE adaptation...... 23 3.15 Visual example of theory behind the proposed algorithm for finding optimal PD level... 24

3.16 Visual example of finding Vamp ...... 25

3.17 Visual example of why Vamp = dLev(011)max ...... 25 3.18 Adaptation where line thickness guides CTLE adaptation of Cs (top right) and optimal sampling phase deduced from the slew rate guides adaptation of comparator level.... 26 3.19 Proposed schematic of quarter-rate baud-rate receiver. Digital adaptation is completed in digital back-end and the rest of the CDR is done in analog front-end...... 27 3.20 Plot of channel characteristic of various channels imported and converted to rational system model in MATLAB...... 28

vii 3.21 Adaptation vs time where line thickness guides CTLE adaptation and optimal PD level guides α level adaptation. Tyco 5” channel at 36 Gb/s...... 29 3.22 Step response and pulse response of channel + CTLE...... 30 3.23 Jitter tolerance simulated for the converged adaptive setting of event-driven model (BER < 10−6)...... 31

4.1 Schematic of the 2-stage CTLE...... 33 4.2 Simulated AC response of the 2-stage CTLE in Cadence...... 34 4.3 Simulated eye diagram at the output of the 2-stage CTLE. Top two eyes are under- equalized. The bottom left is optimally equalized and the bottom right is starting to over-equalize...... 34 4.4 Schematic of double-tail latch published in ISSCC2007 [31]...... 35 4.5 Schematic of charge pump and loop filter...... 36 4.6 Schematic of 8-stage ring oscillator used as VCO...... 37 4.7 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Error count 38 4.8 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Recovered clock frequency...... 38

4.9 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Vctrl of VCO 39 4.10 Open-cavity QFN under a microscope showing wire bond connections for the proposed adaptive baud-rate CDR...... 41 4.11 Package Pinout for D1: Adaptive baud-rate CDR with CTLE + 1-tap DFE...... 42 4.12 Die micrograph in TSMC 28nm HPC process for the proposed adaptive baud-rate CDR. 42 4.13 High-speed testboard for design 1: adaptive baud-rate CDR testchip. Testboard is pro- grammed and controlled by Arduino Mega2560 + PC...... 43 4.14 Arduino Mega2560 used to program the testboard PCB...... 43 4.15 Measurement setup for testing adaptive baud-rate CDR...... 44 4.16 Measured S21 insertion loss of Tyco 5 channel with 36 cables and bias tees using a VNA. Bias tees are required for setting the input common-mode as the SHF 12104A PRBS bit pattern generator cannot set voltage offset to set the common-mode. Low-frequency loss is cause by poor low-frequency performance of bias tees...... 45 4.17 Measurement setup for measuring S21 channel loss...... 45 4.18 36 Gb/s PRBS31 input eye measured using a sampling scope including all channel loss.. 46 4.19 Measurement setup for eye diagram of input PRBS31...... 47 4.20 Measured clock spectrum and phase noise for locked CDR at 35 Gb/s...... 48 4.21 Measured clock spectrum and phase noise for locked CDR at 35 Gb/s...... 49 4.22 Measured jitter tolerance with sinusoidal jitter injected...... 50 4.23 Minimum 10-100MHz jitter tolerance around converged adaptive setting, manually sweep- ing CTLE parameter Cs...... 51 4.24 Minimum 10-100MHz jitter tolerance around converged adaptive setting, manually sweep- ing comparator level α ...... 51 4.25 Measured jitter tolerance for different channel losses by sweeping the data-rate, hence Nyquist frequency...... 52 4.26 Measured power consumption with 35 Gb/s PRBS31 input with CDR lock...... 53 4.27 Performance comparison to prior work for the same CDR architecture...... 54

viii 5.1 Schematic of Alexander 2x-oversampled bang-bang PD and its basic operation...... 56 5.2 Visual example of the lock point of Mueller-Muller PD [25]...... 57 5.3 Half baud-rate data sampling...... 58 5.4 Sub baud-rate data recovery by exploiting ISI...... 59 5.5 Eye diagram example of sub baud-rate data recovery by exploiting ISI. Green arrows on the left show theoretical maximum horizontal eye opening of 0.5UI. Green arrows on the right show the small vertical eye opening margins...... 59 5.6 Sub baud-rate (0.5x-sampled) data and clock recovery in comparison to Mueller-Muller CDR...... 60 5.7 Full-rate block diagram of the proposed 2x half-baud-rate CDR architecture...... 61 5.8 Eye diagram of the proposed 2x half-baud-rate scheme...... 62 5.9 Proposed 2x half-baud-rate PD compared to the conventional baud-rate Mueller-Muller PD 63 5.10 Proposed quarter-rate implementation of 2x half-baud-rate CDR. Proposed 2x half-baud- rate PD and the data decoder are simple custom high-speed digital logic gates...... 64 5.11 Jitter tolerance simulated for event-driven model of 2x half-baud-rate CDR (BER < 10−6) 65 5.12 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Error count 67 5.13 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Recovered clock frequency...... 67

5.14 Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Vctrl of VCO 67 5.15 Open-cavity QFN under a microscope showing wire bond connections for the proposed 2x half-baud-rate CDR...... 68 5.16 Package Pinout for D2: Non-uniform baud-rate CDR with CTLE...... 69 5.17 Die photo in TSMC 28nm HPC process for the proposed 2x half-baud-rate CDR. Dimen- sions of each building block is listed in a table...... 70 5.18 High-speed testboard for design 2: non-uniform baud-rate CDR testchip. Testboard is programmed and controlled by Arduino Mega2560 + PC...... 71 5.19 Measurement setup for testing 2x half-baud-rate CDR...... 72 5.20 Measured S21 insertion loss of Tyco 5 channel with 36 cables and bias tees using a VNA. Bias tees are required for setting the input common-mode as the SHF 12104A PRBS bit pattern generator cannot set voltage offset to set the common-mode...... 73 5.21 Measured clock spectrum of an open-loop CDR...... 75 5.22 Measured clock spectrum and phase noise for locked CDR at 30 Gb/s...... 76 5.23 Measured clock spectrum and phase noise for locked CDR at 30 Gb/s...... 77 5.24 Measured jitter tolerance with sinusoidal jitter injected at 30 Gb/s PRBS31 & 7..... 78 5.25 Power breakdown of 2x half-baud-rate CDR testchip...... 78 5.26 Performance comparison to recently published baud-rate CDRs...... 79

6.1 Layout of the full-chip die with two CDRs on top & bottom...... 81 6.2 Layout of the full-chip die showing top aluminum layer for power distribution...... 82

ix List of Abbreviations

ADC Analog-to-Digital Converter

BBPD Bang-Bang Phase Detector

BERT Bit Error Rate Test

BER Bit Error Rate

CDR Clock and Data Recovery

CML Current Mode Logic

CMOS Complementary Metal-Oxide Semiconductor

CM Common-Mode

CP Charge Pump

CTLE Continuous-Time Linear Equalizer

CT Continuous-Time

DAC Digital-to-Analog Converter

DCC Duty Cycle Correction

DCD Duty Cycle

DEMUX Demultiplexer

DFE Decision Feedback Equalizer dLev Data Level

ESD Electrostatic Discharge

FOM Figure of Merit

HDL Hardware Description Language

I/O Input / Output

ISI Inter-Symbol Interference

LDO Low Drop Out

x LF Loop Filter

LMS Least Mean Square

LPF Low-Pass Filter

MMPD Mueller-Muller Phase Detector

MMSE Minimum Mean Square Error

MUX Multiplexer

P&R Place-and-Route

PCB Printed Circuit Board

PD Phase Detector

PIPO Parallel In Parallel Out

PI Phase Interpolator

PLL Phase-Locked Loop

PM Phase Margin

PN Phase Noise

PRBS Pseudo Random Binary Sequence

PSRR Power Supply Rejection Ratio

PVT Process, Voltage and Temperature

QFN Quad Flat No Lead

RF Radio Frequency

RMS Root Mean Square

RTL Register Transfer Level

RTL Register-Transfer Level

SERDES Serializer / Deserializer

SIPO Serial In Parallel Out

SJ Sinusoidal Jitter

UI Unit Interval

VCO Voltage-Controlled Oscillator

VGA Variable Gain Amplifier

VNA Vector Network Analyzer

xi Chapter 1

Introduction

1.1 Motivation

In today’s society, digital data is ubiquitous. To get the most out of the data that sur- rounds us in this digital age, sufficient processing speed and data-rate are imperative. As the demand for speed is greater than ever before, development of wireline circuits and SERDES is critical. However, an increase in data-rate is usually accompanied by an increase in power consumption, which leads to environmental concerns. The current scenario urgently calls for new low-power architectures and circuit techniques that will improve speed and data-rate, while addressing the energy concerns. This thesis intro- duces two unique research topics based on baud-rate clock and data recovery which aim to solve the power consumption problem in high-speed wireline I/O links.

1.2 An Adaptive Baud-Rate CDR

In an attempt to further save power, the baud-rate clock and data recovery circuit (CDR) published in ISSCC 2016 [35] shares the front-end comparators between the decision feedback equalizer (DFE) and the phase detector (PD). In the following year, a frequency detector (FD) based on the same baud-rate scheme was pushed in ISSCC 2017 [28]. However, these two CDRs lacked an adaptation engine where the CDR settings could be autonomously adapted to the optimal lock point for various channels. Furthermore, manual tuning of these CDRs is difficult and tedious as will be discussed in this thesis. As a result, this thesis will present the proposed adaptive baud-rate CDR with CTLE and 1-tap DFE. The novelty in this design is the adaptation engine tailored for baud-rate clock and data recovery where comparators for the DFE and the PD are shared to save power.

1 Chapter 1. Introduction 2

1.3 A 2x Half-Baud-Rate CDR

A traditional, Alexander-like [4] 2x oversampling bang-bang CDR where the recovered clock locks to the data edge has many inherent advantages such as robustness. On the contrary, a Mueller-Muller PD, which is a prevalent baud-rate PD scheme, locks the sampling phase to the middle of the pulse response. By doing so, and locking to the center of the data, there are some innate disadvantages. Therefore, the propose clock and data recovery technique aims to combine the advantage of locking to the edge, similar to a BBPD and baud-rate sampling, similar to a MMPD. The proposed 2x half-baud-rate CDR collects two samples (2x) from every other UI (half-baud-rate), effectively sampling the data at baud-rate, but lock to the data edge similar to a BBPD

1.4 Thesis Outline

In this thesis, there are two stand-alone baud-rate CDRs taped out in TSMC 28nm technology. To address both chip designs, this thesis will be organized as follows. Chapter 2,3 and4 covers the first design, which is the proposed adaptive baud-rate CDR. First, Chapter2 will cover the background. Second, Chapter3 will delve into the proposed adaptation engine. Lastly, Chapter4 will present circuit simulations and measurement results. The second design which is the proposed 2x half-baud-rate CDR will be presented in Chapter5. Chip design methodology of how the two separate chip designs were taped out on time for the same fabrication shuttle will follow in Chapter6. The final chapter (Chapter7) will be the conclusion which summarizes the two contributions of this thesis. Chapter 2

Background

The background to the proposed adaptive baud-rate CDR with CTLE and 1-tap DFE fabricated in TSMC 28nm technology will be covered in this chapter. The testchip was an analog mixed-signal design with a synthesized digital incorporated via a place & route tool. First, an overview of baud-rate PD scheme will be presented in the following section.

2.1 Overview of Baud-Rate PD

In recent years, baud-rate clock and data recovery has been prominent over the classical oversampling CDRs such as the Alexander Bang-Bang PD [4] which require multiple clock phases to perform phase detection and clock and data recovery. Details and the background of the Alexander BBPD can be found later in Section 5.1.1 which serves as a background to the proposed 2x half-baud-rate CDR. Baud-rate samples the data only once per UI, therefore requires fewer number of clock phases, and hence reduces the power consumption in the clock distribution network [9]. There are many different baud-rate PD schemes such as an integrating-based [8], Mueller-Muller [24], and a MMSE (minimum mean squared error) [26]. Background to the Mueller-Muller PD which is a prevalent baud-rate scheme will be discussed in details in Section 5.1.2. Despite the wide range of baud-rate schemes, this thesis will focus on 1) pattern-based PD presented in ISSCC 2016 [35] which the proposed adaptive baud-rate CDR is based on and 2) proposed 2x half-buad-rate PD which will be presented in Chapter5.

2.2 Pattern-based Baud-Rate Scheme

The proposed adaptive CDR is based on a baud-rate CDR from ISSCC 2016 shown in Figure 2.1[34, 35]. Its novel analog front-end and the phase detector (PD) will be discussed first.

3 Chapter 2. Background 4

In order to save power, the 1-tap DFE and the PD share the same comparators in the analog front-end. Figure 2.2 illustrates the advantages of combining the front-end comparators. First, the number of comparators is 2/3 compared to the conventional data and edge sampling bang-bang PD. Essentially the proposed scheme detects phase (clock information) from 1-tap speculative DFE data. In addition, the number of clock phases that needs to be routed is half compared to the conventional BBPD scheme. As a result, PLL and clock distribution power is also half. The comparator levels are set to +/-α to cancel out 1st post-cursor ISI (inter-symbol interference) left after CTLE’s equalization. At the same time, this α level is also used as a locking point for the phase detector. In Figure 2.2, +α is labelled as DH and -α is labelled as DL.

2.2.1 Pattern Detection

Shibasaki’s baud-rate PD detects slow rising and slow falling patterns to make timing decision whether the clock is late or early. It looks at 3 consecutive samples at a time =

(Sn−1,Sn,Sn+1) to filter out a specific pattern. For example, 011 pattern is used for de- tecting a rising waveform and 100 pattern is used for detecting a falling waveform. Figure 2.3 demonstrate a rising waveform detection with +α comparator level setting the CDR’s lock point. In terms of data recovery, eye opening available for the 1-tap speculative DFE for a rising waveform is shown in blue in Figure 2.4. Although Shibasaki’s PD only looks at 3 consecutive samples at a time, for a valid 011 rising waveform pattern satisfying the PD logic table in Figure 2.5, 001 pattern must be prior to sample n. Essentially when Shibasaki’s PD detects for 011 for a rising pattern and 100 for a falling pattern, it is actually detecting 0011 and 1100 patterns respectively. In other words, 2UI pulse pattern is used for pattern detection in order to recover timing information.

2.2.2 Optimal Sampling Point

The CTLE’s boost changes data slope of 011 and 100 patterns which the PD detects. The CTLE should be adjusted so that the comparator assigned for the phase detection produces 0 and 1 at an equal probability at the “optimal sampling point”. To illustrate the optimal sampling point, let us take a look at the scenario when the previous bit is a 1. The eye opening available to a data sampler when the previously detected bit is a 1 is shown in red on top left of Figure 2.5. The optimal sampling position is 1/2 UI from the convergence point of low-rate pattern (i.e. 100 sequence) with +α (DH) level such that data decision is made at the x-axis center of the eye opening. At this point, the 100 pattern should produce equal chance of early/late with the second comparator set at -α, (DL) used for the PD. Chapter 2. Background 5

Figure 2.1: ISSCC 2016 Shibasaki’s Baud-Rate CDR

Figure 2.2: Shibasaki’s proposed analog front-end (VLSI2014) [34]

Figure 2.3: Pattern detection of Shibasaki’s baud-rate PD Chapter 2. Background 6

Figure 2.4: Eye opening of 1-tap speculative DFE for Shibasaki’s baud-rate PD

Figure 2.5: VLSI2014 Shibasaki’s proposed PD logic [34] Chapter 2. Background 7

2.3 Why Adaptation Engine?

Autonomous adaptation scheme that can adapt to the best CDR settings dynamically on-chip is imperative because the CDR in each lane cannot be tuned manually in a real- world product. In the following, we outline the challenges in implementing an adaptive engine.

2.3.1 Challenges

The challenge of forming an adaptation scheme for the pattern-based PD used in this CDR is that there exists two tuning knobs, one which is the CTLE setting and the other for the comparator level (+/-α). The problem with having two tuning knobs is that, even for a known, fixed channel it is hard and even harder for various unknown channels. Furthermore, the fact that these two tuning knobs are correlated complicates the adaptive scheme. First, changing the CTLE setting will change the data slope which essentially changes the PD gain and the +/-α required for an optimal lock point with the maximum timing margin. Second, changing the CTLE setting also changes the +/-α needed for the 1-tap DFE by affecting the amount of post-cursor ISI remaining. For example, more equalization means a smaller α is needed and less equalization means a larger α is needed. As a result, when the CTLE and/or the α level changes, the optimal sampling point changes which makes manual tuning difficult. Even a slight shift in the comparator level α undermines the eye opening as shown in Figure 2.6. Reduced eye opening is critical to the robustness of CDR’s system as noise and jitter margin is significantly undermined. Therefore, the goal of adaptation is to find the CTLE setting and the comparator level (+/-α) for the optimal jitter tolerance.

2.3.2 CTLE Adaptation

The ultimate goal of an inductor-less CTLE is not to fully compensate and equalize for the channel loss. The 1-tap DFE present in the CDR design would not be required in that scenario, hence making the DFE wasteful in terms of the CDR’s power budget.

We want f3dB of CTLE output to be at fbaud/3 ∼ fbaud/4 for an ideal PD operation. This means that at the output of the CTLE, 2UI pulse swing should reach a full swing P∞ in an ideal scenario. In other words, all residual ISI ( i=2 αi), other than the first post- 1 cursor should be fully minimized, for a system with a pulse response of (1 + α1D + 2 3 α2D + α3D + ··· ) as shown in Figure 2.7. The remaining 1 significant post-cursor ISI can be canceled out by the 1-tap DFE, responsible for equalizing the content at fbaud/2 (the Nyquist frequency). Chapter 2. Background 8

Figure 2.6: Sub-optimal α levels illustrating reduced eye opening for noise and jitter margin

Figure 2.7: Pulse response of Channel + CTLE Chapter 2. Background 9

Figure 2.8: Eye diagram demonstrating (1+α) - (1-α) = 2α

2.3.3 Comparator Level Adaptation

Since the front-end comparators are shared by the DFE and the PD, it could only be optimized for either one of the two. If our primary goal is to adapt the comparator levels to optimize the DFE, (i.e. set the comparators to exactly α to perfectly cancel out the 1st post-cursor ISI remaining after CTLE) α level can be extracted by looking at data levels present in the CTLE’s output eye. A system with one significant post-cursor ISI will have 4 distinct levels: (1+α), (1-α), (-1+α), (-1-α). Exact value of α can be obtained by equations Eq. 2.1 and Eq. 2.2 below as an example. Figure 2.8 demonstrates the latter equation visually. The apex of red eye opening for when the previous bit was a 1 is (1+α). The apex of blue eye opening for when the previous bit was a 0 is (1-α). Taking the difference would yield exactly 2α and dividing by two would then be the value of 1st post-cursor ISI.

(1 + α) + (−1 + α) = 2α (2.1) (1 + α) − (1 − α) = 2α (2.2)

In practice, by exploiting the 4 distinct data levels, a simple sign-sign LMS could be implemented to find α. Two sign-sign LMS loops, one for the eref (error reference) and a slower loop for α would get α to converge to the middle of the 110 pattern eye opening for when the previously bit was 1 as shown in Figure 2.9. Essentially, this LMS DFE adaptation is taking (1+α) + (-1+α) = 2α and dividing it by two to obtain α. The α in Figure 2.9 converges to the value shown in green. This green α value is the vertical mid-point of 110 eye pattern. The sign-sign LMS convergence is governed by the Chapter 2. Background 10

Figure 2.9: LMS for adapting comparators for DFE following two equations Eq. 2.3 & Eq. 2.4:

αn+1 = αn + µ · sgn(err)sgn(dout) (2.3)

erefn+1 = erefn + k · sgn(err)sgn(dout) (2.4) where err signal is generated by a comparator which compares dout to ±α±eref depend- ing on the data pattern. For example for data pattern where the previous data and the current data are both 1, the threshold of the comparator is +α + eref. If the previous data is 1 and the current data 0 then the threshold would be +α − eref. However, it is apparent in Figure 2.9 that the convergence of α is not the optimal comparator level for an optimal PD operation in terms of jitter tolerance. In fact, the optimal PD level is shown in red. It was previously stated that for the optimal DFE operation, the comparator’s α level should be the vertical midpoint of 110 eye opening available to the 1-tap loop-unrolled DFE. For the optimal PD operation, the comparator’s α level should intersect with 011 rising data sequence at the x-axis midpoint of the 110 pattern eye opening. This x-axis midpoint should provide the greatest peak-to-peak jitter tolerance. In the next section, novel technique for obtaining optimal PD level will be discussed. Chapter 3

Proposed Adaptation Engine

As indicated in the background chapter, an adaptive scheme is imperative as fine tun- ing for the best jitter tolerance manually is difficult. In addition, the jitter tolerance of Shibasaki’s baud-rate CDR is heavily affected by both the CTLE setting and the com- parator level. Worse, the CTLE setting and the comparator level are correlated. Figure 3.1 illustrates the proposed adaptation engine which is specifically tailored for a baud- rate CDR where comparators for the PD and the DFE are shared. Hence, the full-rate system level block diagram in Figure 3.1 contains many building blocks found in the Shibasaki baud-rate CDR even though all building blocks were designed independently in a slightly different process technology. The front-end consists of a CTLE, a 1-tap speculative DFE and a baud-rate PD which shares the comparators, and the CDR. The CDR loop is an analog loop consisting of a charge pump (CP) and a low-pass filter (LPF) as the loop filter. A ring oscillator is used for the voltage-controlled oscillator (VCO) in this inductor-less CDR.

The adaptive engine, highlighted at the bottom receives the recovered clock (CKrec)

from the CDR and rotates it to a new phase (CKX ) in order to adaptively sample the CTLE output at the intersection of the 011 and 110 patterns. The rotation is performed by a phase interpolator (PI) and guided by the PI logic in the quarter-rate system im- plementation. The heart of the adaptive engine is a data level loop, which is a feedback loop that observes the CTLE samples and reconstructs the data level (dLev) with 9-bit resolution. To do so, the adaptive sampler subtracts the stored dLev from the current CTLE sample, quantizes the difference to 1-bit, and feeds the resulting bits to a digital filter prior to summing them up in an accumulator. The DAC produces an analog level corresponding to the 9-bit dLev and feeds it to the sampler as its threshold. The role of the digital filter and the pattern filter is to calculate the average, the maximum, and

the minimum of the CTLE output at phase CKX . A cycle counter (or filter scheduler) schedules 16k clock cycles to execute each of these tasks using the same data level loop.

11 Chapter 3. Proposed Adaptation Engine 12

Figure 3.1: Full-rate system level block diagram of CDR and the proposed adaptation engine

The calculated values of dLev for various filters are then used by the adaptive logic block to guide both the CTLE parameter (CS) and the one-tap DFE coefficient (+/-α) that also determines the sampling phase of the PD.

3.1 Data Level Loop

In order to perform both CTLE and DFE/PD adaptation, we rely heavily on the data level loop. The original use of the data level loop for finding voltage levels was published in JSSC 2005 by Stojanovic et al.[37]. The proposed data level loop is modified to serve the adaptive scheme for the Shiabaski baud-rate CDR specifically. Figure 3.2 illustrates the block diagram of the proposed data level loop. The dLev converges to the middle of the data level of the filtered data sequence sampled by the adaptive clock, CKadapt. dLev convergence is governed by Eq. 3.1.

dLevn+1 = dLevn + ∆dLev · sgn(en) (3.1)

Different data sequence patterns can be filtered out for the data level loop, making it very useful in determining the voltage level for any specific data pattern. For example, Figure Chapter 3. Proposed Adaptation Engine 13

Figure 3.2: Block diagram of proposed data level loop

Figure 3.3: Example of dLev converging

3.3 shows that dLev converges to the correct value of 200 mV. The final dLev value after convergence can be further stabilized by applying more filtering. To demonstrate that the data level loop can track any data patterns, we investigate sweeping of the adaptive clock for different pattern filter settings. Figure 3.4 depicts an example eye diagram and the result of dLev filtered out for the 111 pattern shown in green and 011 pattern shown in blue. The adaptive clock here is swept for 1UI as a demonstration. Again, the data level loop could have been filtered out more, which would have improved the monotonicity of dLev values, especially for the 011 pattern shown in blue.

3.2 Goals for On-Chip Adaptation

The end goal of the proposed adaptation engine is to arrive at the optimal jitter tolerance for the baud-rate CDR system. Initially, the adaptation engine will need to run with the CDR locked to a sub-optimal phase. This essentially means that it needs to be frequency locked and somewhat phase locked, although, presence of bit errors is totally acceptable Chapter 3. Proposed Adaptation Engine 14

Figure 3.4: Example of data level (dLev) filtered for 111 and 011 pattern

Figure 3.5: Diagram illustrating the goal of on-chip adaptation

(e.g. 1E-3). Initial bit errors in the system is okay because once the adaptation engine is turned on, errors average out in the data level loop. Therefore, the data level loop is still able to track and perform data filtering to obtain dLev. Via adaptation, maximum jitter tolerance is autonomously achieved by adapting the lock position to the optimal data-sampling phase as shown in Figure 3.5. There are two steps to the proposed adaptive scheme. First is the CTLE adaptation and the latter is the DFE/PD adaptation:

1. Find the optimal CTLE setting (Cs) with the flattest equalization up to fbaud/3 ∼

fbaud/4 ensuring that there is only 1 significant post-cursor ISI with other higher- order residual ISIs all minimized.

2. Find the optimal comparator level for PD operation. The optimal PD level intersects Chapter 3. Proposed Adaptation Engine 15

Figure 3.6: Diagram illustrating the optimal PD level

with 011 data sequence at the x-axis midpoint of the 110 pattern eye opening (Figure 3.6).

When the two conditions above are satisfied via adaptation, our baud-rate CDR should have the optimal jitter tolerance. It was previously mentioned that the comparator level can only be optimized for either the PD or the DFE since the front-end comparators are shared. Instead of adapting +/-α to optimize the DFE to cancel out the 1st post-cursor ISI perfectly, we opt to adapt for the optimal PD operation. The main reason for optimizing for PD operation is because the Shibasaki baud-rate CDR has less timing margin (jitter) compared to voltage margin (amplitude) for the DFE to recover the data correctly. Even if the optimal PD level is not at the exact value of 1st post-cursor ISI α for the DFE, we are trading off a little bit of eye opening for better jitter tolerance by locking to a more optimal data-sampling phase. Figure 3.7 depicts a fictitious example of the final eye opening where the comparator level is optimized for the DFE in (a) and optimized for the PD in (b). It is apparent that the total peak-to-peak jitter tolerance is larger for the scenario where the comparator level is optimized for the PD.

3.3 Adaptation Flow

A flow diagram of the proposed adaptation scheme is shown in Figure 3.8. First, the CTLE is set to the maximum equalization setting and the comparator level α for the DFE/PD is set to 0.3FS (full-scale). These settings should cause our baud-rate CDR to lock to a sub-optimal phase. This means that the CDR must be frequency locked although there could be bit errors present due to the clock phase being sub-optimal. Chapter 3. Proposed Adaptation Engine 16

Figure 3.7: Diagram of eye opening for comparator optimized for DFE and PD respectively

The initial comparator level of 0.3FS is chosen as the starting point because the system should only have one significant post-cursor ISI after the CTLE since our baud-rate CDR only has a 1-tap loop-unrolled DFE which can cancel out just one post-cursor ISI. When the pulse response has just one significant post-cursor ISI, α is usually a value close to 0.3FS. This is not always the case as different channels exhibit different channel and pulse response. An initial α level of 0.3FS (e.g. 100mV for a 300mV full-scale input) should be a viable starting point but if the baud-rate CDR is unable to achieve phase lock with a reasonable BER (e.g. 1E-3), the proposed adaptation engine can be restarted with a different initial comparator level α. After the CTLE setting is set to the maximum and the comparator level α is set to the pre-defined value, adaptation begins. First, for the highest CTLE setting, the adaptation engine obtains a new α value (optimal PD level). This means that for the current eye, which is most likely over equalized by the maximum CTLE setting, the adaptation engine has picked the optimal PD level which would give us the largest peak- to-peak jitter tolerance. The α level selection algorithm done by the proposed adaptation engine will be further discussed in the next section. Once the optimal PD level is obtained, the adaptation engine lowers the CTLE setting by one setting. The proposed adaptation engine essentially goes back and fourth between the CTLE (blue) and the comparator (red) adaptation as shown in Figure 3.8. In essence, for each CTLE setting along the way of the adaptive process, the proposed adaptation engine is always tuning the CDR to the best lock position. In other words, it’s tuning the CDR in small tick-tock like increments to avoid losing phase lock. If the CTLE adaptation was to finish completely before adapting the comparator level α at all, there is no guarantee that it will even maintain a CDR lock. Furthermore, as previously described, the CTLE setting and the comparator level α are correlated, and therefore it makes sense to adapt them in small increments to find the optimal solution in a multi-dimensional solution space. Once the Chapter 3. Proposed Adaptation Engine 17

Figure 3.8: Flow Diagram of Proposed Adaptation Engine proposed adaptation engine detects that the lowered CTLE setting did not lower the CTLE line thickness, then the engine should revert back to the previous CTLE setting without updating the α level and end adaptation. Using line thickness as the metric for CTLE adaptation will be further elaborated in the next section.

3.4 Part 1: CTLE Adaptation

The proposed adaptation engine is broken into two different phases: CTLE and compara- tor level adaptations. The former, CTLE adaptation, will be discussed in more detail in this section. The CTLE used in our baud-rate CDR shown in Figure 3.1 is a common current-mode logic (CML) CTLE architecture. It is a differential pair with RC source degeneration with a resistor in parallel with a capacitor (Figure 3.9). This CTLE stage is repeated twice for the 2-stage CTLE in the CDR’s design. The transfer function of the CTLE is Chapter 3. Proposed Adaptation Engine 18

Figure 3.9: Schematic of a CTLE stage with tunable Cs as follows (Eq. 3.2).

1 s + gm R C H(s) = s s (3.2) C  1 + g R /2  1  L s + m s s + RsCs RDCL The CTLE’s zero, poles, DC gain, and peaking gains are governed by equations below. 1 ωz = (3.3) RsCs

1 + gmRs/2 ωp1 = (3.4) RsCs

1 ωp2 = (3.5) RDCL

g R DC gain = m D (3.6) 1 + gmRs/2

Ideal peak gain = gmRD (3.7)

Ideal peak gain ωp1 Ideal peaking = = = 1 + gmRs/2 (3.8) DC gain ωz

For our proposed CTLE adaptation scheme, the adaptive variable is the source de-

generation capacitor Cs. Sweeping Cs is very similar to sweeping the zero, ωz, overall. Chapter 3. Proposed Adaptation Engine 19

Figure 3.10: CTLE transfer function across Cs settings simulated in MATLAB Simulink

Digitally tunable capacitors are used to adjust Cs for gentle tuning of the CTLE transfer function [27]. By increasing the source degeneration capacitance, the zero frequency of the system can be reduced while maintaining the low frequency DC gain as seen in Figure

3.10. In addition, if Rs was an adaptive variable, then a VGA would have been needed to boost the DC gain since Rs increases peaking by lowering the DC gain. Adding a VGA is complicated because the VGA setting must be adapted as well, thus adding a 3rd variable which would further complicate the adaptive scheme. As mentioned in Section 3.2, the goal of CTLE adaptation is to find the optimal CTLE setting (Cs) with the flattest equalization up to fbaud/3 ∼ fbaud/4, ensuring that there is only 1 significant post-cursor ISI. This also means that 0011 2UI pulse pattern should reach full-scale and have the minimum line thickness. In other words, the line thickness P∞ of the CTLE’s eye is representative of the residual ISI ( i=2 αi) for a pulse response of 1 2 3 (1 + α1D + α2D + α3D + ··· ). The CTLE adaptation involves observing the CTLE output’s line thickness and exploiting this property. Figure 3.11(a) illustrates that the line thickness of the CTLE’s eye can be measured at three different data patterns at the crossing (CKX ): 111, 011, and 0101 pattern. Measuring line thickness at 011 and 111 patterns would ensure flattest equalization up to fbaud/3 ∼ fbaud/4. For example, the line thickness for 011 pattern can be obtained by setting the data filter of the data level loop to 011 and allow the loop to track the maximum dLev which would be the dLev(011)max. Similarly, the minimum value of the 011 pattern can be obtained by allowing the data level Chapter 3. Proposed Adaptation Engine 20

loop to track the minimum dLev which would be the dLev(011)min. Taking the difference would yield the line thickness for the 011 pattern as shown in Eq. 3.9. Similarly, the line thickness of the 111 pattern could be obtained in the same manner. Since max and min values of dLev are heavily dictated by voltage noise, line thickness is heavily filtered out to average out the noise.

Line T hickness(011) = dLev(011)max − dLev(011)min (3.9)

Line T hickness(111) = dLev(111)max − dLev(111)min (3.10)

Figure 3.11(b) depicts the CTLE’s line thickness for different data patterns vs. CTLE

Cs settings. This plot is generated in MATLAB Simulink with the baud-rate CDR running with the settings described in the plot title. It is evident that a Cs setting of 200 fF in Figure 3.11(b) yields the optimal CTLE setting with the minimum line thickness, thus the flattest equalization up to fbaud/3 ∼ fbaud/4. To converge to the minimum thickness, the initial CTLE setting can be set to the maximum value as described in the previous section (Section 3.3) and lower the CTLE setting until the sign of the change in line thickness flips. At this point, the adaptation engine reverts back to the previous CTLE setting and ends adaptation. Contrary to spectrum balancing a CTLE like in Figure 3.12[33, 20, 17, 10, 18], performing CTLE adaptation based on the line thickness of its output is essentially a pattern-guided CTLE adaptation similar to [12, 13, 32]. Figure 3.13 demonstrates the CTLE transfer function showing the 001 pattern and its neighboring patterns for different CTLE settings. Filtering out 011 and 111 patterns and measuring the line thickness is a pattern-guided method of optimizing the CTLE equalization setting such that the transfer function has the flattest response without certain patterns being over/under equalized. Essentially, minimizing line thickness is getting rid of all the higher-order post-cursor ISIs except for the first post-cursor ISI which the DFE will cancel out after the CTLE. A visual example demonstrating the CTLE adaptation is shown in Figure 3.14. The

CTLE’s eye diagram for various Cs settings from the highest to the lowest is shown on the left. Following the CTLE adaptation algorithm starting from the highest setting, it will converge to the CTLE setting with minimum line thickness which is at Cs = 200 fF. The eye diagram associated with this optimal setting is highlighted in red.

3.5 Part 2: Comparator Level Adaptation

The comparator level adaptation is predicated on optimizing for the PD operation instead of the DFE. The optimal PD level essentially is the x-axis midpoint of the eye opening. Chapter 3. Proposed Adaptation Engine 21

(a) Measuring line thickness for different data patterns

(b) Line thickness Vs CTLE Cs setting

Figure 3.11: CTLE adaptation using line thickness Chapter 3. Proposed Adaptation Engine 22

Figure 3.12: Block diagram of spectrum balancing [17]

Figure 3.13: CTLE transfer function showing 0011 pattern and its neighboring patterns for three different CTLE settings Chapter 3. Proposed Adaptation Engine 23

Figure 3.14: Visual Example of CTLE adaptation

In order to find the optimal PD level, we take a look at slew rate similar to [23]. It is apparent from Figure 3.15 that the 011 rising data sequence slews and thus the slew rate is given by: V Slew rate = amp (3.11) 0.5 UI 0.5 UI for the base of the triangle arrives from a premise that if the CTLE is able to equalize up to fbaud/4, all ISIs are equalized for the 0011 pattern, and thus the rise time of a 011 sequence equals the fall time of a 110 sequence. Exploiting the formation of the right triangle shown in Figure 3.15, the optimal PD level would follow Eq. 3.12. Setting α to this optimal PD level will sample the data at the center of the eye opening available for the 1-tap speculative DFE with 0.5UI of timing margin on both sides. As a result, we could set the comparator level α at the optimal PD level once we find Vamp, simply by dividing by 2.

V Optimal P D level = 0.25 UI × Slew = amp (3.12) 2

To find Vamp, the adaptive sampler inside the data level loop is used to find the voltage levels for different data patterns. First, the PI code is swept to find the cross point between dLev(011)avg and dLev(110)avg. This cross point of average values of the two patterns is illustrated in Figure 3.16. This figure also illustrates various patterns high- lighted on the eye diagram and its corresponding values of dLev simulated in MATLAB

Simulink on the right. At this clock phase of CKX , (at the crossing) Vamp = dLev(011)max Chapter 3. Proposed Adaptation Engine 24

Figure 3.15: Visual example of theory behind the proposed algorithm for finding optimal PD level can be obtained. At this same clock phase, line thickness for CTLE adaptation is also obtained by measuring dLev(011)max and dLev(011)min. The maximum value is filtered out and used for Vamp on purpose instead of the average value because the rising wave- form actually does not slew perfectly in an ideal line as the slope tails off a little near the crossing point of the 011 and 110 patterns. This phenomenon can be seen in Figure 3.17. Taking the maximum value of dLev helps to alleviate from the nonideality of rising waveform not slewing perfectly.

3.6 Summary of Adaptation

Figure 3.18 shows a visual summary of the entire adaptation plotted against CTLE parameter Cs. The following graph should be read from right to left where the adaptation process starts at the highest setting of 200 fF until it finds the minimum line thickness (e.g. 100 fF). Where and how the line thickness is taken is shown on the left side. Every step along the way, for each value of Cs, the α level is updated to the optimal PD level at the x-axis center of the eye opening with 0.5UI margin on both sides as shown in the eye diagram. At Csopt which has the minimum line thickness, α level is considered αopt and these are the final converged values after adaptation. In the next section, (Section 3.7) where system-level behavioral model will be discussed, adaptation versus time will be illustrated via a time-domain simulation. Chapter 3. Proposed Adaptation Engine 25

Figure 3.16: Visual example of finding Vamp

Figure 3.17: Visual example of why Vamp = dLev(011)max Chapter 3. Proposed Adaptation Engine 26

Figure 3.18: Adaptation where line thickness guides CTLE adaptation of Cs (top right) and optimal sampling phase deduced from the slew rate guides adaptation of comparator level

3.7 System-level Behavioral Model

A system-level behavioral model was built in MATLAB Simulink for a quarter-rate ar- chitecture as shown in Figure 3.19 which is actually the exact schematic of the circle implementation to be discussed in the next chapter when we delve into analog design. Variables that are adjusted during adaptation are highlighted in red. These variables (CTLE parameter, α and dLev, PI code) form a feedback loop from digital to analog and could be observed in a system-level model. Both continuous-time and discrete-time event driven models were created as they each have specific pros and cons. An advantage of the continuous-time model is that it is more accurate and it provides an insight on real-time eye diagram of the signals which is vastly useful in debugging. The cost of a continuous-time model is that the simulation time is slow. An event-driven model allows for a faster simulation which is useful for running a sweep of long simulations e.g. a jitter tolerance test. In both continuous-time and event-driven models of the CDR, analog front-end is separated from digital back-end which is solely for digital adaptation.

3.7.1 Behavioral Model: Continuous-time Model

For modelling the channel, the Tyco 5” channel’s real s-parameters measured with a vector network analyzer (VNA) were imported. The RF Toolbox from MATLAB was Chapter 3. Proposed Adaptation Engine 27

Figure 3.19: Proposed schematic of quarter-rate baud-rate receiver. Digital adaptation is completed in digital back-end and the rest of the CDR is done in analog front-end

used to map this into a rational function with poles and zeros such that the transfer function block can be used in Simulink for the channel model. An example of various channels mapped in MATLAB is shown in Figure 3.20. Specifically, the proposed CDR was mainly designed for the Tyco 5” channel as other channels have too much channel loss at the Nyquist frequency of 18 GHz for the target data-rate of 36 Gb/s. One problem of a behavioral model is that it is usually all ideal models, and hence, too unrealistic. Gaussian and sinusoidal jitter are added to the quarter-rate clocks in order to imitate real life jitter and noise as much as possible. Even on the comparators, Gaussian voltage noise is added. Although metastability, offset and hysteresis are not captured properly for the comparator, adding voltage noise is far better than a noise-less ideal comparator. A graphical example of adaptation vs. time is shown in Figure 3.21. To explain the context for this simulation, the channel model is Tyco 5” at a data-rate of 36 Gb/s. This channel in MATLAB has 19.9 dB loss at Nyquist. The VCO’s phase noise at 1

MHz offset is set at -80dBc/Hz. Other simulation conditions include 0.1 UIpp sinusoidal

jitter, 0.1 UIpp random Gaussian jitter, additive white Gaussian noise with 40 dB SNR, and comparator noise turned on. The left column with the eye diagrams illustrates the quality of the eye for the initial CTLE equalization setting and the final CTLE setting. At a Cs value of 200 fF, it has the smallest line thickness hence the most optimal CTLE equalization setting. On the right are the different variables plotted against time. The CTLE parameter Cs is initially set to the highest level of 400 fF and then adaptation begins. Once the line thickness of the next Cs value is greater than the previous value, Chapter 3. Proposed Adaptation Engine 28

Figure 3.20: Plot of channel characteristic of various channels imported and converted to rational system model in MATLAB

Cs returns to the previous state and adaptation finishes. In this example, when Cs is set to 150 fF, the line thickness is increased, therefore, Cs setting returns to 200 fF which corresponds to the minimum line thickness. At each Cs setting along the way, the comparator level α is set to the optimal PD level discussed in the previous section such that the CDR always preserves phase lock since large jerky changes in CDR settings can cause the CDR to lose lock and diverge completely. To verify that the CDR is error free, a BERT (bit error rate tester) was built in Simulink to confirm that the recovered data is still a PRBS just like the input to the channel. The error count after adaptation had converged was stable after the adaptation finished at 20µs.

3.7.2 Behavioral Model: Event-driven Model

An event-driven simulation refers to a class of discrete-time simulations where the smallest simulation step size corresponds to the occurrence of an event as opposed to time [15]. An advantage of an event-driven model of a CDR is that the simulation speed iup to 1800 times faster compared to a continuous-time model [39]. Event-driven simulations use a variable time step that captures events of interest, resulting in faster simulation. The event-driven model is designed to share many of the same components as the continuous-time model. However, one difference is how the VCO clocks and the input data are created. For the channel model, channel step response is used instead of a Chapter 3. Proposed Adaptation Engine 29

Figure 3.21: Adaptation vs time where line thickness guides CTLE adaptation and optimal PD level guides α level adaptation. Tyco 5” channel at 36 Gb/s Chapter 3. Proposed Adaptation Engine 30

Figure 3.22: Step response and pulse response of channel + CTLE continuous-time transfer function with poles and zeros. CTLE’s transfer function is also combined with the channel model’s transfer function all into one combined step response since it cannot be cascaded. Figure 3.22 illustrates an example of the combined step and pulse responses of the channel + CTLE. Furthermore, the CDR’s loop filter is also converted into a discrete time z-domain function using bilinear transform. Using the event-driven model of the CDR, jitter tolerance was simulated with injection of sinusoidal jitter after the digital adaptation converged to the final values. The jitter tolerance simulation was programmed in MATLAB to use a binary search algorithm with 5 iterations from the initial search points (red line) shown in Figure 3.23. For example, from this initial search point, if the jitter tolerance test passes for the specified BER, that is the jitter tolerance for that frequency. If the jitter tolerance test fails at the initial search point, the test would try again at the halfway point in terms of amplitude, like a binary search. The high-frequency jitter tolerance is 0.5 UIpp and the minimum jitter tolerance is around 0.4 UIpp at the dip due to an undershoot. These jitter tolerance values are for BER < 10−6, therefore for BER < 10−12, degradation is expected. Chapter 3. Proposed Adaptation Engine 31

Figure 3.23: Jitter tolerance simulated for the converged adaptive setting of event-driven model (BER < 10−6) Chapter 4

Circuit Simulations and Measurement Results

Circuit implementation (schematic design) in Cadence and its simulation results as well as the measured results of the testchip will be discussed in this section. A 36 Gb/s inductor-less baud-rate CDR from Figure 3.19, in which the adaptation engine will be demonstrated is fully implemented in analog front-end. Since it’s a stand- alone analog CDR, it is fully functional without digital synthesis. The adaptation engine and the BERT are the only circuits built digitally. An advantage of having a standalone, fully functional CDR designed in analog is that locking of the CDR can be simulated in Cadence. When a digital CDR is used, AMS (analog/mixed signal) verification tool has to be setup to test the analog circuits together with the digital circuits in order to verify locking of the CDR which is a more complicated state of affairs. To minimize the chance of the adaptive baud-rate CDR not working, the analog CDR was fully verified for a lock in Cadence using Spectre and verified that there are no bit errors. In addition, the digital circuit is verified functionally in ModelSim and NCsim every step along the way (Verilog RTL, synthesized RTL, post place & route) against the vectors generated in Simulink simulation. The digital implementation will be discuss in more details after the analog design section.

4.1 Analog Design

Since the digital adaptation is the novelty of this adaptive baud-rate CDR, only key schematics and simulation results will be discussed. To begin the analog design section of the thesis, Figure 4.1 shows the schematic of a 2- stage CTLE. The tunable source degeneration capacitor is a 4-bit digitally programmable array of MOM capacitors and pass-gate switches. A 2-stage CTLE (both with high-

32 Chapter 4. Circuit Simulations and Measurement Results 33

Figure 4.1: Schematic of the 2-stage CTLE

frequency boost) had to be implemented because the CTLE is heavily loaded at the output by 8 comparators for the DFE and the PD, and one additional comparator used for adaptation. As a result, with a single CTLE stage, output capacitance limits bandwidth of boost. When two stages are used, the first stage is only loaded by the input gate of the second CTLE stage, hence, provides boost at a higher frequency and improves CTLE’s overall bandwidth performance. Figure 4.2 is the simulated AC response of 2- stage CTLE with post-layout extraction. Due to layout parasitic from the interconnect, power-grid and fill, it is very challenging to push the bandwidth any further. Figure 4.3 demonstrates the eye opening for various Cs values with post-layout extraction with 32 Gb/s PRBS31 as the input. Following the CTLE in the analog front-end are the comparators. In a quarter-rate architecture, the number of comparators required increases. In the proposed adaptive baud-rate CDR, 8 comparators are required for the DFE and the PD. Four comparators have +α and the other four compartors have -α as the threshold level as shown in Figure 3.19. Double-tail latch published in ISSCC2007 is used as sense amplifier as opposed to a StrongArm becuase double-tail latches have performance advantage in lower power supply cases due to less stacking of devices [31]. A schematic of the double-tail latch is shown in Figure 4.4. Since TSMC 28nm HPC uses 0.9V for supply of thin-oxide devices, double-tail latches were preferred. Since we require comparisons to +/-α instead of the zero-level, dual-difference comparator scheme was used where the input data is compared to the threshold level α. These threshold levels are generated by a 9-bit reference DAC block with 512 levels of 1mV/step. For the single adaptive sampler used to provide error information to digital adaptation, the input sensitivity is modified by optimizing the sizing of the double-tail latch. A higher input accuracy was necessary for adaptation Chapter 4. Circuit Simulations and Measurement Results 34

Figure 4.2: Simulated AC response of the 2-stage CTLE in Cadence

Figure 4.3: Simulated eye diagram at the output of the 2-stage CTLE. Top two eyes are under-equalized. The bottom left is optimally equalized and the bottom right is starting to over-equalize Chapter 4. Circuit Simulations and Measurement Results 35

Figure 4.4: Schematic of double-tail latch published in ISSCC2007 [31] where for the rest of 8 comparators used for the DFE and the PD, higher sensitivity was not required to achieve a CDR lock and error-free recovered data. The DFE and the PD which uses the information gathered by the comparators are custom digital logic blocks made up of simple logic gates, flip-flops, multiplexer and adders. The DFE is designed to operate at 2 Gb/s in 16 parallel interleaved paths. The PD is designed to operate at 4 Gb/s in 8 parallel paths. The four quarter-rate comparator paths are demuxed accordingly for both the DFE and the PD. The charge pump is a simple current steering differential pair and the loop filter is a common higher-order RC loop filter of type-II PLL. Figure 4.5 depicts a simplified schematic of the charge pump and loop filter combination. Amount of current steered by the charge pump can be digitally adjusted with 4-bit settings which affect the CDR’s loop gain. C1 and C2 of loop filter are fixed capacitance values set by MOM capacitors. The resistance is a 4-bit tunable resistor switch array for adjusting the CDR’s loop dynamic. The VCO is made up of CML based 8-stage ring oscillator as shown in Figure 4.6. 8 stages were used to reduce phase noise as it has been published that increasing the number of stages in a ring oscillator reduces the phase noise [3]. The proposed 8-stage Chapter 4. Circuit Simulations and Measurement Results 36

Figure 4.5: Schematic of charge pump and loop filter ring VCO uses the same CML delay stage architecture as [16, 29] and has a tuning range between 6.76 to 9.14 GHz when Vctrl is swept from 200 to 700mV. For this tuning range, the simulated VCO’s free-running phase noise in Cadence Spectre after post-layout extraction was -80.77 to -82.42dBc/Hz at 1 MHz offset. The CML clocks coming out of the VCO is converted to CMOS signals used by the comparators, using similar structure of CML2CMOS circuit from [9]. The VCO and the clock buffers are under a regulated voltage to suppress supply voltage noise. An LDO (low-drop out) regulator with PMOS pass-gate was implemented for the regulator. A PMOS design with a lower drop-out voltage had to be used instead of a NMOS pass-gate which has a superior PSRR (power supply rejection ratio) because high voltage thick-oxide devices were not available in the TSMC 28nm HPC design kit through MOSIS. Since the nominal supply voltage is 0.9V with maximum recommended voltage of 1.0V, an LDO regulator had to be used. Even then, the LDO is designed with a 1.1V supply which is a little higher than the recommended maximum supply voltage and hence it could have some repercussions in terms of reliability. Since this is a testchip rather than a real product with a more stringent reliability requirements, applying 1.1V supply solely to the PMOS pass-gate of LDO was deemed okay. If a higher supply thick- oxide devices were available, The LDO regulator would have been placed on the higher supply (VDDH). Inputs to the digital back-end of the CDR are designed to be approximately 1 Gb/s data and 1 GHz core clock (CKrec/8). The Dout data path is 32-parallel bits of data as shown in Figure 3.19. An adaptive sampler is clocked by an adaptive clock CKX with PI’s phase controlled by output of digital adaptation. The comparator threshold level uses dLev from the output of the digital adaptation as well. The error signal from the Chapter 4. Circuit Simulations and Measurement Results 37

Figure 4.6: Schematic of 8-stage ring oscillator used as VCO adaptive sampler is also down-sampled to 1 Gb/s by multiple demux stages. However, it is still a 1-bit signal as only the MSB bit is propagated through the demux stages.

To generate the core clock (CKrec/8) used for digital adaptation, quarter-rate clock is divided down by clock dividers.

4.1.1 Closed-loop CDR Simulations

The simulation results of the closed-loop CDR simulations with post-layout extraction will be discussed in this subsection for the proposed adaptive baud-rate CDR with just the analog CDR portion sans digital adaptation. The testbench of the analog CDR portion of the adaptive baud-rate CDR is as follows. The input data is 32 Gb/s PRBS31 pattern which is attenuated through a Verilog-A model of the Tyco 5” channel imported into Cadence. Since the CDR is a PLL-style CDR, the initial frequency is adjusted by setting the Vctrl of the VCO to a initial frequency that matches the input data within the frequency capture range of the PD. Figure 4.7 illustrates that the CDR is error free for all 16 parallel DFE paths of the data after phase lock. Even if all parallel data paths are error free for a PRBS pattern, it does that prove that the interleaved data at full-rate is also error free. Therefore, another test with a PRBS7 was conducted to verify that the parallel paths are still a PRBS7 when the parallel recovered data are interleaved and combined manually. A PRBS7 was used for this as it is a short repeating pattern that could be checked much more easily, than say a PRBS31. For a PRBS31, a digital BERT written in Verilog for digital synthesis could be used to check that it is error free after fabrication. It interleaves all 32 down-sampled parallel data paths and verifies that it is error free. Chapter 4. Circuit Simulations and Measurement Results 38

Figure 4.7: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Error count

Figure 4.8: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Recovered clock frequency

Figure 4.8 illustrates the frequency of the recovered clock. It fluctuates & dithers at an average value of 8 GHz quarter-rate for 32 Gb/s PRBS31 data. Figure 4.9 shows the Vctrl which is the control voltage going into the VCO. It fluctuates after the CDR is locked since the PD periodically dithers between early and late. The peak-to-peak amplitude of Vctrl matched the Simulink simulation result very closely, confirming a close resemblance between the behavioral model and the circuit simulation.

4.2 Digital Design

During the initial design of the digital adaptation in MATLAB Simulink, the building blocks were specifically designed with MATLAB fcn (function) blocks with codes written much like Verilog for an easier HDL conversion in the later design stage, i.e. RTL logic synthesis. This is because the end goal of the digital circuit was not to just simulate and validate the results solely in a behavioral model but to synthesize the digital and place & route such that the digital layout could be taped out along with the custom analog layout. Chapter 4. Circuit Simulations and Measurement Results 39

Figure 4.9: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Vctrl of VCO

There are three major digital blocks for the adaptation engine in Figure 3.1. First is the accumulator block which makes up the data level loop in conjunction with the adaptive sampler which is an analog block designed in Cadence. The accumulator block consists of a digital filter, pattern filter, integrator and an FSM for pattern filter scheduling. The second digital block is the PI logic block for obtaining the appropriate clock phase for the adaptive sampler at the crossing point of the 011 and 110 data patterns. Thirdly, the adaptation logic block is responsible for measuring the line thickness, Vamp to adapt the CTLE parameter Cs and the comparator level α for the DFE/PD adaptation. Since the digital blocks have been introduced, the process of digital design will now be discussed. The first procedure was to generate the test vectors from a Simulink simulation for both input and output variables of digital adaptation circuit. These test vectors are saved for every rising edge of core clock used to clock the digital circuits. Second, all MATLAB fcn blocks in Simulink were manually converted into Verilog codes. The validity of the Verilog models were confirmed with a testbench where the Verilog models were being fed with the input test vectors saved from Simulink. For each bit, the output of the Verilog model was compared to the expected output vectors from Simulink. The next step in the digital design was to take the RTL design written in Verilog and run RTL synthesis in Cadence using a .tcl script. After doing RTL synthesis, the adaptation engine was not able to meet timing constraint at 1GHz clock speed, even with the high-speed custom standard cells. To fix this timing failure, the critical path was identified. The adaptation engine takes in 32 parallel data as the input and evaluates all of them in 8 blocks (groups) of 4-bit data. This is a serial operation in digital hence the timing could not be closed. The solution was to only use one block of 4-bit data instead of all 8 and throw away 7 blocks (28 bits) of data every clock cycle. As a result, the timing constraint is met for the standard cells and the trade off is that digital adaptation takes 8x longer to achieve convergence. Alternatively, one could make an argument to slow down the digital core clock to Chapter 4. Circuit Simulations and Measurement Results 40

250 MHz to meet timing, but this means that the recovered data going into the digital adaptation has to be demuxed from 32-to-128 bits. To process all 128 bits, there is a 4x increase in the propagation delay of combination logic. Although 250 MHz means 4x longer period, there is zero gain in terms of timing since it scales linearly. The second method which further improved timing of digital circuit was to remove >= (greater or equal) operation which required 5-bit digital comparator that is cascaded for each bit. Instead, >= was modified to == to use parallel XNOR gates in favour, which is a much cheaper operation in digital logic gates during RTL synthesis. Although >= is always a safer operation in case that there’s a glitch in the digital bit, == had to be used in order to meet timing constraint for the testchip. Once the synthesized Verilog met timing, it was again tested against the test vectors from Simulink to validate an error-free operation. NCSim was used after RTL synthesis, which is a digital verification tool from Cadence. In the next step, the synthesized Verilog was used with the P & R (place & route) flow in Cadence Innovus. This process again regenerates a new Verilog file representing the end result of the place & route and was tested against the test vectors in NCSim. Since the Verilog codes were validated against the test vectors generated from Simulink in every step of the digital design process, it provided confidence that the synthesized digital circuits will be functional post-tapeout even without AMS simulation. An AMS simulation was omitted due to lack of time and resources, especially with a tight tapeout schedule. A GDS (graphic database system) file created from P & R was streamed into Cadence for a layout of the digital adaptation and the final Verilog generated from P & R was imported into Cadence for the schematic. With the imported schematic, LVS (layout-versus-schematic) check was performed to ensure that all the connections were correct without shorts or opens.

4.3 Lab Measurements

In this section, measured results of the testchip from the lab will be discussed.

4.3.1 Testchip

The testchip of adaptive baud-rate CDR with CTLE and 1-tap DFE was fabricated in TSMC 28nm HPC CMOS technology with a 0.9V supply. The testchip die was packaged with an open-cavity QFN so that the high-speed input and output could be probed. Un- der the microscope, Figure 4.10 reveals the packaged die with the wire bond connections and Figure 4.11 is the package pinout instruction sent to the packaging company. Figure 4.12 is more zoomed into the die and all the major building blocks are highlighted in aqua Chapter 4. Circuit Simulations and Measurement Results 41

Figure 4.10: Open-cavity QFN under a microscope showing wire bond connections for the proposed adaptive baud-rate CDR blue. The total testchip area was 1.57 mm width by 0.785 mm height. The following subsection will explain the test setup for the testchip.

4.3.2 Test Setup

Figure 4.15 illustrates the testing setup for a normal operation of adaptive baud-rate CDR. The packaged testchip was soldered onto a PCB board as shown in Figure 4.13 which is programmed via Arduino Mega2560 (Figure 4.14) with a PC. Figure 4.10 depicts the QFN package under a microscope. High-speed probes rated for 40G was used to probe the high-speed PRBS input data. The SHF 12104A bit pattern generator was used to generate both PRBS7 and PRBS31. Input data then passes through the Tyco 5” channel using 36” SMA cables and is connected to a 40G bias tee before being connected to the GSGSG probe head. The channel loss through this setup is shown in Figure 4.16 which is measured using the Agilent N5222A PNA microwave network analyzer. The setup for obtaining S21 channel characteristic is depicted in Figure 4.17. Figure 4.18 illustrates an eye diagram of this PRBS input at 36 Gb/s, observed using the Agilent Infiniium DCA-J 86100C digital communication analyzer with an 86112A electrical module. At this data- rate, the input eye is completely closed before being equalized. The measurement setup for observing this PRBS input eye diagram is shown in Figure 4.19. On the output side, high-speed quarter-rate recovered clocks were probed at the output Chapter 4. Circuit Simulations and Measurement Results 42

Figure 4.11: Package Pinout for D1: Adaptive baud-rate CDR with CTLE + 1-tap DFE

Figure 4.12: Die micrograph in TSMC 28nm HPC process for the proposed adaptive baud-rate CDR Chapter 4. Circuit Simulations and Measurement Results 43

Figure 4.13: High-speed testboard for design 1: adaptive baud-rate CDR testchip. Testboard is pro- grammed and controlled by Arduino Mega2560 + PC

Figure 4.14: Arduino Mega2560 used to program the testboard PCB Chapter 4. Circuit Simulations and Measurement Results 44

(a)

(b)

Figure 4.15: Measurement setup for testing adaptive baud-rate CDR Chapter 4. Circuit Simulations and Measurement Results 45

Figure 4.16: Measured S21 insertion loss of Tyco 5 channel with 36 cables and bias tees using a VNA. Bias tees are required for setting the input common-mode as the SHF 12104A PRBS bit pattern generator cannot set voltage offset to set the common-mode. Low-frequency loss is cause by poor low-frequency performance of bias tees.

Figure 4.17: Measurement setup for measuring S21 channel loss Chapter 4. Circuit Simulations and Measurement Results 46

Figure 4.18: 36 Gb/s PRBS31 input eye measured using a sampling scope including all channel loss

pads for observing the clock spectrum and the phase noise with the Rohde & Schwarz FSWP26 phase noise analyzer and VCO tester. Before the spectrum analyzer is con- nected, the differential clocks being probed needs to be converted to a single-ended signal using a Narda 4346 180◦. coupler. For low-speed (250-500 Mb/s) or static digital signals, the Agilent DSAX91604A Infiniium 16GHz real time digital storage oscilloscope was used to observe and debug digital adaptation.

4.3.3 Measurement Results

All measurements presented in this subsection were obtained with the setup shown in Figure 4.15. With 35 Gb/s PRBS31 input data, the proposed adaptive baud-rate CDR is able to converge to the optimal CDR settings and achieve a CDR lock. The CDR’s recovered clock spectrum at 8.75 GHz quarter-rate is shown in Figure 4.20(a). The locked clock spectrum exhibits a skirt characterized by the loop dynamics or the bandwidth of the CDR. Phase noise of a locked CDR for the same converged adaptive settings is shown in Figure 4.20(b). There is an overshoot present in the CDR’s loop dynamic but this is the optimal setting with minimum overshoot for the analog loop filter present on the chip. Since the loop filter was designed using fixed MOM capacitor values, we do not have an extra tuning knob to completely fix the overshoot. The worst case phase noise is at -104 dBc/Hz at 40 MHz offset. Total integrated jitter is 875.8 fs with a PRBS31 input pattern. Figure 4.21 represents the same measured results but for PRBS7. The worst case phase noise is -105 dBc/Hz at 35 MHz offset. Total integrated jitter is 750.6 Chapter 4. Circuit Simulations and Measurement Results 47

Figure 4.19: Measurement setup for eye diagram of input PRBS31 fs for PRBS7 which is better than PRBS31 as expected. The measured phase noise of the locked CDR is a result of the CDR loop bandwidth suppressing the poor phase noise of free running VCO which was simulated in Cadence Spectre to be -80.77 to -82.42dBc/Hz for the entire tuning range at 1 MHz offset. In- creasing the CDR’s bandwidth suppresses the phase noise of the VCO further but as a consequence allows more input jitter from the data signals in the case of a CDR to enter the system. As a results, there is a fine balance between increasing/decreasing the loop bandwidth of the CDR from the optimal point, before bit errors begin to be introduced. For the converged CDR setting after adaptation, jitter tolerance was tested by inject- ing sinusoidal jitter with the SHF 12104A bit pattern generator which was programmed via a PC using an ethernet cable. Figure 4.22 is the measured jitter tolerance for PRBS7 & PRBS31 plotted against IEEE 802.3 masks [1,2]. The equipment limit was 54ps of absolute sinusoidal jitter amplitude and the maximum jitter frequency of 400 MHz. The sinusoidal jitter was injected up to 300 MHz as it was not sure if it is accurate to go up to the maximum frequency of 400 MHz. However, for the jitter amplitude, 54ps was used since even 54ps seemed a little low at a lower jitter frequencies. The dip in the jitter tolerance curve is due to the undershoot which goes hand to hand with an overshoot in the CDR’s jitter transfer curve represented by the phase noise plot. Again, this dip caused by the undershoot cannot be completely fixed due to the fixed capacitors in the analog loop filter which sets the CDR’s loop dynamics. Despite the undershoot, the jitter tolerance after adaptation passes both IEEE 802.3bs and IEEE 802.3cc [1,2] receiver jitter tolerance masks. Chapter 4. Circuit Simulations and Measurement Results 48

(a) Clock Spectrum

(b) Phase Noise

Figure 4.20: Measured clock spectrum and phase noise for locked CDR at 35 Gb/s Chapter 4. Circuit Simulations and Measurement Results 49

(a) Clock Spectrum

(b) Phase Noise

Figure 4.21: Measured clock spectrum and phase noise for locked CDR at 35 Gb/s Chapter 4. Circuit Simulations and Measurement Results 50

Figure 4.22: Measured jitter tolerance with sinusoidal jitter injected

In addition, the converged setting is actually the optimal setting with the least amount of undershoot. This becomes evident when we plot and observe the jitter tolerance of settings around the converged setting after adaptation. Figure 4.23 shows that the minimum 10-100MHz JTol (jitter tolerance) degrades rapidly as the CTLE’s parameter Cs diverge away from the converged value of 8 highlighted in red. When sweeping the Cs value, we hold the comparator level α constant at the converged value (α = 133 mV). Similarly, Figure 4.24 is when we take the converged setting and manually sweep the comparator level α while holding Cs constant (Cs = 8). It is evident that the minimum 10-100MHz JTol is less sensitive to the change in finer 9b comparator level α with 512 settings compared to a coarser 4b CTLE parameter Cs with 16 settings. Figure 4.25 demonstrates that the designed adaptive baud-rate CDR is able to adapt to different channel losses as all three curves passes the IEEE 802.3 masks. Different channel losses were created by changing the input data-rate, which in essence changes the channel loss at Nyquist, since the Nyquist frequecy itself alters. New channels with different attenuation could not be obtained which is the reason why the input data-rate had to be swept. Ironically, at a slower data-rate of 34 Gb/s, the proposed adaptive baud-rate CDR actually performs more poorly due to the fact that 34 Gb/s is at the bottom of the VCO’s tuning range therefore KVCO gain is lower and may be very noisy and perhaps not even monotonic down there. The testchip returned from fab as being faster than TT (typical) corner therefore the center frequency of the VCO is higher than the intended design. Ideally, the CDR with this process shift to a FF corner should Chapter 4. Circuit Simulations and Measurement Results 51

Figure 4.23: Minimum 10-100MHz jitter tolerance around converged adaptive setting, manually sweeping CTLE parameter Cs

Figure 4.24: Minimum 10-100MHz jitter tolerance around converged adaptive setting, manually sweeping comparator level α Chapter 4. Circuit Simulations and Measurement Results 52

Figure 4.25: Measured jitter tolerance for different channel losses by sweeping the data-rate, hence Nyquist frequency

be operating at a higher range between 36-45 Gb/s but we do not have an appropriate channel with 10-18 dB loss at Nyquist for the mentioned data-rates. Final measurement done in the lab is the power consumption measurement. On the PCB board, sense resistors were installed in order to measure the current being drawn by the DUT (device under test) for each of the power domain. The power domains are separated into VDDA, VDD CTLE, VDD DAC, VDD LDO, VDD DIG and VDD IO. VDDA contains most of analog circuits including: comparators, demux, PD, DFE, CP, LF and some clock buffers & clock dividers. VDD CTLE domain has the two stage CTLE powered on it. VDD DAC is a separate supply domain solely for the reference DAC used to set the comparator’s threshold levels. The reference DAC’s power supply was kept separate in case that the reference DAC’s levels had to be adjusted independently from other power domains after tapeout. VDD LDO consists of the VCO, the VCO bias, the LDO regulator and clock buffers. VDD DIG is for the synthesized digital circuit and VDD IO is for the IO drivers from TSMC standard library and intermediate buffers to the output pads for low-speed digital signals for debugging (also contains heavily down- sampled analog signals such as the comparator and the DFE outputs). Since the adaptation engine implemented in digital is designed to turn off automati- cally after convergence, VDD DIG power is omitted for the total power consumption in normal operation although the power consumed by VDD DIG (6.3 mW) is still reported. Same with VDD IO, 2.7 mW is omitted as IO drivers and buffers were only present for testchip’s debugging purposes. Figure 4.26 is the measured power consumption with 35 Chapter 4. Circuit Simulations and Measurement Results 53

Figure 4.26: Measured power consumption with 35 Gb/s PRBS31 input with CDR lock

Gb/s PRBS31 input while the CDR is phase locked and error free. The total power consumption is 106.3 mW which is 3.04 pJ/bit. Most of the power is consumed by the VCO to bring down the phase-noise. Ring VCO’s phase noise improves by 3 dB with every twofold in the current consumption. Extra power was spent in the VCO to lower the risk of CDR not locking due to poor phase noise, especially due to inductance from the package wirebonds. Therefore, if less power was spend on the VCO by trading off phase noise margin, the total power consumption of the CDR could have been improved drastically with a better figure of merit (FOM) in terms of pJ/bit. Finally, Figure 4.27 compares the performance of the proposed work to prior works. This is the first on-chip, live adaptation engine tailored for a baud-rate CDR where the comparators are shared between the DFE and the PD to save power. This figure concludes Chapter3 on adaptive baud-rate CDR with CTLE and 1-tap DFE. Chapter 4. Circuit Simulations and Measurement Results 54

Figure 4.27: Performance comparison to prior work for the same CDR architecture Chapter 5

Proposed 2x Half-Baud-Rate CDR

This chapter presents the details of the second baud-rate CDR design that was taped out in TSMC 28nm technology. The proposed design #2 is a 2x half-baud-rate CDR with CTLE and data decoder. This testchip consists of an analog CDR with digital BERT being the only synthesized digital incorporated via a place & route tool.

5.1 Background

Background to the proposed 2x half-baud-rate scheme will be discussed in this chap- ter. First, background to some prior architectures in clock and data recovery will be introduced.

5.1.1 Alexander 2x-oversampled Bang-Bang PD

Robustness is a crucial aspect of building a receiver for an I/O link. Alexander (2x- oversampled) bang-bang phase detector (BBPD) where the data is sampled twice, at the center and the edge, has been prominent in clock and data recovery due to its robustness and simple hardware implementation [4]. Figure 5.1 illustrates simple hardware involved with the Alexander 2x-oversampled BBPD. The basic operation is as follows. If the

previous data Dn and edge En are the same then the clock is early. If the next data

Dn+1 and edge En are the same then the clock is late. Since this is a bang-bang PD, theoretically, the output should be totally non-linear. However, due to the presence of inevitable jitter in real life, it linearizes the PD characteristic where the slope or the PD gain is a function of σ of the jitter.

55 Chapter 5. Proposed 2x Half-Baud-Rate CDR 56

Figure 5.1: Schematic of Alexander 2x-oversampled bang-bang PD and its basic operation Chapter 5. Proposed 2x Half-Baud-Rate CDR 57

Figure 5.2: Visual example of the lock point of Mueller-Muller PD [25]

5.1.2 Mueller-Muller Baud-Rate PD

Despite the robustness and simple hardware complexity of the Alexander 2x-oversampled BBPD, recent trend has shifted towards baud-rate phase detectors as a means of reducing power consumption by sampling only once per UI [36,9,7, 11, 14, 38]. However, it is apparent from prior works [9,7] that Mueller-Muller phase detector (MMPD), which is a popular option of baud-rate PD, is sensitive to equalization and symmetry in the pulse response [25]. MMPD’s lock point is at the middle of the symmetric pulse where the pre-cursor equals the post-cursor as shown in Figure 5.2. As a result, if the pulse response is not perfectly symmetric, the locking point will not be at the peak of the pulse response. Also, MMPD is only functional for uncorrelated random data and cannot lock to an alternating 0101 pattern. These disadvantages of the MMPD will be compared to the proposed 2x half-baud-rate scheme in a later section.

5.1.3 Sub-Baud-Rate Clock and Data Recovery

Following the trend of opting for a lower power consumption by reducing the number of samples per UI, the most intuitive solution would be to take a baud-rate scheme and somehow sample it even less frequently. A half-baud-rate scheme shown in Figure 5.3 where the data is sampled every other UI (0.5x-sampled) would potentially lower the power consumption. Whenever the data is sampled every other UI, information about the previous bit needs to be recovered as illustrated in Figure 5.4. For a system with only one significant post-cursor ISI, four distinct data levels exist: (h0+h1), (h0-h1), (- Chapter 5. Proposed 2x Half-Baud-Rate CDR 58

Figure 5.3: Half baud-rate data sampling

h0+h1), (-h0-h1) where ho is the magnitude of main-cursor and h1 is the post-cursor ISI as illustrated in bottom portion of Figure 5.4. Samplers can be placed at appropriate

threshold levels (+/-Vref and 0) to recover the data for the unsampled UI, thus recovering 2 bits of data for each sample. Figure 5.5 illustrates the threshold levels in which the

CDR could yield an error-free data recovery for any sequence of 2-bit data (dn−1, dn): (0,0), (0,1), (1,0), (1,1). However, green arrows in Figure 5.5 highlight the theoretical maximum horizontal eye opening of 0.5 UI on the left and vertical eye opening margins on the right. Small vertical eye opening translates into poor noise margin. Despite this, half-baud-rate operation is theoretically feasible for a clock recovery as well, even without an integration & dump technique. Figure 5.6 illustrates that clock recovery could be achieved by adding two additional samplers at +/-α on top of the three samplers originally required solely for data recovery. The white circles at +/-α indicate the lock points. In comparison to the Mueller-Muller baud-rate CDR, this would still require fewer number of comparator (samplers) even for a quarter-rate clocking implementation. A total of 12 comparators would be required for a MM-CDR whereas half-baud-rate (0.5x sampled) CDR would only require 10 comparators. In addition, since only every other UI is sampled, there would be a huge power saving over MM-CDR as well in the clock distribution network. A half-baud-rate scheme without the need of an integrating & dump technique sounds attractive in theory but is not very feasible in reality due to poor noise and jitter margin.

5.2 Proposed 2x half-baud-rate scheme

As discussed in the previous subsection on half-baud-rate (0.5x sampled) scheme, due to poor jitter and noise margin, it would be very difficult to prototype this in real life. Most certainly, jitter and noise would prevent the CDR from being error-free for BER < 10−12. Therefore, the proposed CDR opts for a 2x half-baud-rate scheme where edge sampling is added to the half-baud-rate scheme to make it baud-rate on average. Since Chapter 5. Proposed 2x Half-Baud-Rate CDR 59

Figure 5.4: Sub baud-rate data recovery by exploiting ISI

Figure 5.5: Eye diagram example of sub baud-rate data recovery by exploiting ISI. Green arrows on the left show theoretical maximum horizontal eye opening of 0.5UI. Green arrows on the right show the small vertical eye opening margins Chapter 5. Proposed 2x Half-Baud-Rate CDR 60

Figure 5.6: Sub baud-rate (0.5x-sampled) data and clock recovery in comparison to Mueller-Muller CDR

the data and edge are sampled every other UI, the 2x half-baud-rate PD is essentially a 2x oversampling BBPD at half-baud-rate that locks to the edge. Advantages of edge locking will be delved into in this section. Figure 5.7 illustrates the full-rate block diagram of the proposed 2x half-baud-rate CDR. Blocks highlighted in red are simple logic-gate circuits required for the 2x half-baud-rate operation. The data decoder circuit is imperative for the data-recovery of 1UI that is not sampled at all, and this is done by exploiting the inherent ISI present in the system. Figure 5.8 illustrates an eye diagram corresponding to a channel with one significant post-cursor ISI while all other ISI terms are assumed to be minimized through a front-

end equalizer. We sample a UI by three comparators at the edge phase φe with their outputs labeled as DL, ED, and DH, and by one comparator at the center phase φc with its output labled as DM, while we skip sampling the following UI altogether. Indeed, we rely on ISI to recover the previous bit. In doing so, we perform 4 comparisons in every other UI, or on average 2 comparisons per UI. By having the center and edge samples, albeit in every other UI, this scheme inherits the benefits of a bang-bang PD (BBPD) by locking to the edge, as will be demonstrated later. By skipping every other UI, the proposed scheme shares the benefits of reduced hardware and low power consumption with the baud-rate Mueller-Muller PD (MMPD). We explain the phase detector (PD) and the data decoder (DD) logic by observing samples from current UI (n). If at φe the data falls between +/-Vref , we conclude that there is a data transition (0→1 or 1→0) at this phase and hence we will judge the early/late by the output of the edge (ED) and the data (DM) comparators, similar to a BBPD logic. If these two bits are identical, the clock is late; otherwise, it is early as shown in the phase detector table of Figure 5.8. Chapter 5. Proposed 2x Half-Baud-Rate CDR 61

Figure 5.7: Full-rate block diagram of the proposed 2x half-baud-rate CDR architecture

The DD only needs to observe the outputs of the data comparators (DH, DL, and DM) to decode the current and the previous bit. Similar to a 1-tap speculative DFE, the DD recovers the unsampled UI by slicing the data eye at a threshold that is adjusted depending on the previous bit sequence. If the output of all three comparators are zero,

Dn−1 and Dn are both zero. Similarly, if the outputs of all three comparators are 1,

Dn−1 and Dn are both 1. If the data at φe falls between +/-Vref , it implies a transition

between Dn−1 and Dn. Therefore, by observing the sign of DM (which indicates Dn), we ¯ can find Dn−1=Dn. The data decoder logic is also summarized in a table in Figure 5.8. Although the eye diagram of the proposed 2x half-baud-rate scheme may look similar to duobinary signalling, there are some advantages to the proposed scheme. The disad- vantage of duobinary is that precoder & decoder are required and the precoder especially is not trivial as stated in various prior works [40, 19]. The proposed 2x half-baud-rate does not need a precoder and operates for conventional NRZ signalling. In addition, the proposed data decoder is a simple hardware made up of digital logic gates which is efficient in terms of both power and area. Robustness of a conventional 2x oversampling BBPD and power saving of MMPD by sampling at baud-rate are combined in the proposed 2x half-baud-rate PD. Both MMPD and the proposed 2x half-baud-rate PD in Figure 5.9 display similar PD characteris- tic over 1UI period when properly tuned and equalized, depicted by the black curves.

However, MMPD suffers significantly as equalization setting and comparator level Vref Chapter 5. Proposed 2x Half-Baud-Rate CDR 62

Figure 5.8: Eye diagram of the proposed 2x half-baud-rate scheme Chapter 5. Proposed 2x Half-Baud-Rate CDR 63

Figure 5.9: Proposed 2x half-baud-rate PD compared to the conventional baud-rate Mueller-Muller PD diverge from the optimal point. For instance, when an offset is present for the comparator reference level +/-Vref , dead zone forms for MMPD as shown in Figure 5.9 (first row). Similarly, second row of Figure 5.9 depicts that when the residual ISI exists due to poor front-end equalization, dead zone appears for MMPD. It is apparent from the simulation results that the proposed PD (similar to BBPD) does not show sensitivity to these two settings as much.

5.2.1 System-level Behavioral Model

A system-level behavioral model was built in MATLAB Simulink for the quarter-rate architecture as shown in Figure 5.10. Similar to the behavioral models built for the first design: adaptive baud-rate CDR from Section 3.7, both continuous-time and event-driven models in MATLAB Simulink were created for 2x half-baud-rate CDR. The details of how continuous-time and event-driven behavioral models were built will be omitted as they are designed the same way with minor tweaks in some of the building blocks such as the PD and the addition of data decoder. Using the event-driven model of the 2x half-baud-rate CDR, jitter tolerance was sim- ulated with the injection of sinusoidal jitter. The jitter tolerance simulation was pro- grammed in MATLAB to use a binary search algorithm with 5 iterations from the initial search points (red line) shown in Figure 3.23. The high-frequency jitter tolerance is 0.5 Chapter 5. Proposed 2x Half-Baud-Rate CDR 64

Figure 5.10: Proposed quarter-rate implementation of 2x half-baud-rate CDR. Proposed 2x half-baud- rate PD and the data decoder are simple custom high-speed digital logic gates Chapter 5. Proposed 2x Half-Baud-Rate CDR 65

Figure 5.11: Jitter tolerance simulated for event-driven model of 2x half-baud-rate CDR (BER < 10−6)

UIpp and the minimum jitter tolerance is 0.394 UIpp. These jitter tolerance values are for BER < 10−6 due to simulation time, therefore for BER < 10−12, degradation is expected.

5.3 Circuit Implementation & Simulations

Circuit implementation (schematic design) in Cadence and its simulation results will be presented in this section. The quarter-rate circuit implementation follows Figure 5.10 such that the behavioral model and the schematic in Cadence match exactly.

5.3.1 Analog Design

The proposed 2x half-baud-rate scheme is implemented on a PLL-style, fully analog CDR. The analog design also shares many components from the proposed adaptive baud-rate CDR from the previous chapter. The same two-stage CTLE is used with tunable source degeneration resistor and capacitor with 4-bit controls each. The output of CTLE is sampled by a total of 8 double-tail latch comparators from [31] which is also the same comparator from the proposed adaptive baud-rate CDR. Since the

2x half-baud-rate CDR samples every other UI, only two comparators each at +/-Vref are required for a quarter-rate implementation. For the zero-level comparators, two extra comparators are required for edge sampling on φe phase for clock recovery, thus a total of four. One difference with the zero-level comparators at φe and φc is that dual-difference scheme is removed so that instead of the input being compared to a threshold, it is Chapter 5. Proposed 2x Half-Baud-Rate CDR 66 compared to the plus and minus polarities of the input signal itself, hence, effectively the zero-level. The two critical circuit blocks: 1) 2x half-baud-rate PD and 2) Data decoder high- lighted by red boxes in Figure 5.10 are designed using custom high-speed digital logic gates operating at 7.5 Gb/s in accordance to the truth table in Figure 5.8. These blocks made up of digital logic gates are extremely simple in complexity with very little power consumption. The CDR’s loop remains as a higher-order RC loop of type-II PLL for the proposed 2x half-baud-rate CDR. While the charge pump and the loop filter are the same as the proposed adaptive baud-rate CDR, 8-stage ring VCO is tuned such that the centre frequency is a little lower to compensate for the fact that there is no DFE. Therefore, the proposed 2x half-baud-rate CDR is not able to tolerate the same data-rate or the same attenuation. The quarter-rate VCO clock is divided down by a factor of eight to be used for the digital BERT. Similarly, in the data path, output of the data decoder is demuxed to produce 32 parallel data signals for the digital BERT. This digital BERT gives the true error-rate as opposed to checking one of the demuxed data path in the analog domain. A demuxed version of PRBS is guaranteed to be PRBS but not the other way around. Therefore it cannot be assumed that after checking one of the demuxed data path being error-free, the interleaved version of all parallel paths are error free. As a result, digital BERT is imperative for obtaining the true bit error rate, which is used for the BER measurement of a testchip.

5.3.2 Closed-loop CDR Simulations

The simulation results of the closed-loop CDR simulations with post-layout extraction will be discussed in this subsection for the proposed 2x half-baud-rate CDR. Figure 5.12 illustrates that CDR is error free for all parallel paths of the data after phase lock. To ensure that the interleaved data is also a PRBS, a test with a PRBS7 was conducted to verify that the parallel paths are still a PRBS7 when the parallel recovered data are interleaved and combined manually. For a PRBS31, the synthesized digital BERT written in Verilog could be used to check that the CDR is error free when measured in the lab after fabrication. Figure 5.13 illustrates the frequency of the recovered clock. It fluctuates & dithers at an average value of 7 GHz quarter-rate for 28 Gb/s PRBS31 data. Figure 5.14 shows the Vctrl which is the control voltage going into the VCO. It fluctuates after the CDR is locked since the PD periodically dithers between early and late. The peak-to-peak amplitude of Vctrl matched the Simulink simulation result very closely, confirming a close resemblance between behavioral model and the circuit simulation. Chapter 5. Proposed 2x Half-Baud-Rate CDR 67

Figure 5.12: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Error count

Figure 5.13: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Recovered clock frequency

Figure 5.14: Simulated results of closed loop CDR at 32 Gb/s PRBS31 input in Cadence: Vctrl of VCO Chapter 5. Proposed 2x Half-Baud-Rate CDR 68

Figure 5.15: Open-cavity QFN under a microscope showing wire bond connections for the proposed 2x half-baud-rate CDR

5.3.3 Digital Design

The digital BERT from 5.10 is the only circuit that is synthesized in digital and place & routed. The digital BERT takes in 32 parallel down-sampled recovered data as the input and interleaves them to check that it is still an error-free PRBS pattern. Errcnt[19:0] is the total error count, err is the bit error for every clock cycle and erronce is a flag that stays high if there’s at least one bit error after the BERT is enabled. Due to the fact that AMS (analog/mixed signal) verification tool was not setup for the TSMC 28nm design kit, the interface between analog CDR and digital BERT was never simulated. Since the analog CDR locked with post-layout extraction and the digital BERT was tested separately in ModelSim and NCSim throughout the digital design stages, the chance of the interface breaking was minimized. In addition, the digital BERT has the option to flip LSB and MSB order of the input data in case bus ordering at the analog/digital interface doesn’t match.

5.4 Lab Measurements

In this section, measured results of the testchip from the lab will be presented. Chapter 5. Proposed 2x Half-Baud-Rate CDR 69

Figure 5.16: Package Pinout for D2: Non-uniform baud-rate CDR with CTLE Chapter 5. Proposed 2x Half-Baud-Rate CDR 70

Figure 5.17: Die photo in TSMC 28nm HPC process for the proposed 2x half-baud-rate CDR. Dimensions of each building block is listed in a table.

5.4.1 Testchip

The testchip of 2x half-baud-rate CDR with CTLE was fabricated in TSMC 28nm HPC CMOS technology with a 0.9V supply. The testchip die was packaged with an open-cavity QFN so that the high-speed input could be probed. Under the microscope, Figure 5.15 reveals the packaged die with the wire bond connections and Figure 5.16 is the package pinout instruction sent to the packaging company. Figure 5.17 is more zoomed into the die and all the major building blocks are highlighted. The total testchip area was 1.57 mm width by 0.785 mm height. The total die area is 1.232 mm2 and the area consumed by the building blocks of the CDR is only 0.135 mm2. The following subsection will explain the test setup for the testchip.

5.4.2 Test Setup

Figure 5.19 illustrates the testing setup for a normal operation of the 2x half-baud-rate CDR. The packaged testchip was soldered onto a PCB board as shown in Figure 5.18 which is programmed via Arduino Mega2560 with a PC. Figure 5.15 depicts the QFN Chapter 5. Proposed 2x Half-Baud-Rate CDR 71

Figure 5.18: High-speed testboard for design 2: non-uniform baud-rate CDR testchip. Testboard is programmed and controlled by Arduino Mega2560 + PC

package under a microscope. High-speed probes rated for 40G was used to probe high- speed PRBS input data. The SHF 12104A bit pattern generator was used to generate both PRBS7 and PRBS31. Input data then passes through the Tyco 5” channel using 36” SMA cables and is connected to a 40G bias tee before being connected to the SGS probe head. SGS probe had to be used instead of GSGSG due to limited clearance in between wirebonds. The Channel loss through this setup is shown in Figure 5.20 which is measured using the Agilent N5222A PNA microwave network analyzer.

For the output recovered clock, CK/16 was observed instead of probing the high-speed quarter-rate clock because there was not enough clearance between the wirebonds to land the probes. The clock spectrum and the phase noise of low-speed divided down version of recovered clock (CK/16) is observed using the Rohde & Schwarz FSWP26 phase noise analyzer and VCO tester. For low-speed (250-500 Mb/s) or static digital signals, the Agilent DSAX91604A Infiniium 16GHz real time digital storage oscilloscope was used.

5.4.3 Measurement Results

All measurements presented in this subsection were done with the setup shown in Figure 5.19(a) where Tyco 5” channel with 13.06 dB loss at Nyquist for 30 Gb/s was used for all measurements. Initially, the VCO’s frequency is manually tuned to 30 Gb/s for an Chapter 5. Proposed 2x Half-Baud-Rate CDR 72

(a)

(b)

Figure 5.19: Measurement setup for testing 2x half-baud-rate CDR Chapter 5. Proposed 2x Half-Baud-Rate CDR 73

Figure 5.20: Measured S21 insertion loss of Tyco 5 channel with 36 cables and bias tees using a VNA. Bias tees are required for setting the input common-mode as the SHF 12104A PRBS bit pattern generator cannot set voltage offset to set the common-mode. Chapter 5. Proposed 2x Half-Baud-Rate CDR 74 open-loop CDR as shown in Figure 5.21. Figure 5.22(a) illustrates the measured clock spectrum of the divided recovered clock (CK/16) for PRBS31 when the CDR is locked. The locked clock spectrum exhibits a skirt characterized by the loop dynamics or the bandwidth of the CDR. The integrated jitter from the phase noise plot is 823.5 fs for PRBS31. The recovered clock spectrum for PRBS7 and its phase noise plot is shown in Figure 5.23. The integrated jitter for PRBS7 is lower as expected at 731.8 fs. In addition, the capture range was measured to be -2300ppm to +66000ppm. The higher ppm in the positive direction is due to the asymmetric nature of the 2x half-baud- rate PD logic where the data sample always follows the edge sample, not the other way around. This property makes frequency acquisition available for free in one direction without adding any additional feedback loop in the CDR. In other words, in the positive direction, where the incoming data is faster than the CDRs initial VCO frequency, the PD is able to pull up the VCO frequency by +66000ppm (equivalently 2Gb/s) to a frequency lock and then track the phase simultaneously to achieve a phase lock. The measured jitter tolerance with sinusoidal jitter injected at the input bit pattern generator is shown in Figure 5.24. The jitter tolerance curves for both PRBS31 & PRBS7 passes the IEEE 802.3 masks although PRBS31 passes marginally. The proposed CDR was originally designed for 28 Gb/s, however, after fabrication the VCO’s tuning range was shifted up due to a process shift perhaps to an FF corner thus VCO’s frequency cannot be brought down to 28 Gb/s even after adjusting the VCO’s supply voltage and the bias tail current. As a result, the measured jitter tolerance at 30 Gb/s is not as high since it was never designed to operate at such a speed. Figure 5.25 illustrates the power breakdown per block. The total power consumption measured is 79.2 mW and the FOM is 2.64 pJ/bit (at 30 Gb/sa0. It is clear that the VCO and the clocking has been over-designed for phase noise therefore there’s a room for improvement in terms of power and FOM. Omitting the VCO and the clocking power, only 25 mW is consumed which is very low-power. Finally, the table in Figure 5.26 compares the performance of the proposed 2x half-baud-rate CDR to recently published baud-rate CDRs. This work is the first 2x half-baud-rate CDR reported that is 2x oversampling at half-baud-rate, hence sampling every other UI and locking to the edge. Chapter 5. Proposed 2x Half-Baud-Rate CDR 75

Figure 5.21: Measured clock spectrum of an open-loop CDR Chapter 5. Proposed 2x Half-Baud-Rate CDR 76

(a) Clock Spectrum

(b) Phase Noise

Figure 5.22: Measured clock spectrum and phase noise for locked CDR at 30 Gb/s Chapter 5. Proposed 2x Half-Baud-Rate CDR 77

(a) Clock Spectrum

(b) Phase Noise

Figure 5.23: Measured clock spectrum and phase noise for locked CDR at 30 Gb/s Chapter 5. Proposed 2x Half-Baud-Rate CDR 78

Figure 5.24: Measured jitter tolerance with sinusoidal jitter injected at 30 Gb/s PRBS31 & 7

Figure 5.25: Power breakdown of 2x half-baud-rate CDR testchip Chapter 5. Proposed 2x Half-Baud-Rate CDR 79

Figure 5.26: Performance comparison to recently published baud-rate CDRs Chapter 6

Chip Design Methodology

In a single tapeout shuttle run, two separate wireline receiver chips with two separate CDRs were designed and fabricated for this research. It is known that meeting the tapeout deadline for a single CDR design itself is quite onerous due to the sheer size of the CDR as a system. In addition, doing two system level designs and simulations, schematic designs and simulations, digital designs and verification and layout designs is very time consuming. This chapter will delve into the design methodology that allowed the design of two separate CDRs within a very tight tapeout time-frame. Furthermore, advanced layout techniques implemented will be discussed.

6.1 Behaviour Model Methodology

As discussed in Section 5.2.1, two CDR designs share most of the continuous-time and event-driven behaviour model. When designing the second CDR, most of the building blocks were recycled with some minor tweaks such as the new PD and the data decoder. For the second CDR design, half-rate clocking architecture was preferred over the quarter- rate clocking architecture but due to the tight tapeout schedule, quarter-rate clocking scheme was re-used from the initial design stage when building the behaviour model so that completely new continuous-time and event-driven models did not have to be built.

6.2 Schematic & Layout Design Methodology

The schematic and the layout were designed to be shared and re-used for both CDR designs. Many of the circuit components such as the CTLE, the charger pump and the loop filter and many of the biasing circuitry are shared. Even circuis that are different, such as the comparators and the VCOs tailored to each CDR’s needs, were only slightly modified instead of doing a full re-design. Furthermore, filler cells, decoupling capacitor

80 Chapter 6. Chip Design Methodology 81

Figure 6.1: Layout of the full-chip die with two CDRs on top & bottom cells, IO pads, and the power grid were designed to be shared and used for both CDR designs. Figure 6.1 illustrate that the top CDR is a mirror of the bottom CDR with tweaks made to the building blocks and routing. Without sharing many of the common blocks, tye tapeout deadline could not have been met for such a large layout of 1.57mm by 1.57mm die area in a 28nm technology. Figure 6.2 reveals the top aluminum (AP) layer used to distribute power vertically to the metal8 layer below. This figure clearly illustrate that the power grid, output pads and the IO were also shared and re-used for both CDR designs.

6.3 Advanced Layout Techniques & Considerations

6.3.1 Matching

The matching of the double-tail latch clocked comparators in layout was critical as a mismatch in device properties such as the threshold voltage (Vt) can eat into the timing Chapter 6. Chip Design Methodology 82

Figure 6.2: Layout of the full-chip die showing top aluminum layer for power distribution Chapter 6. Chip Design Methodology 83 and noise margin. Therefore, although offset calibration was built for the comparators, comparator layout was done with meticulous care. First, common centroid layout tech- niques [22, 21] were used for sets of four quarter-rate comparators. As a result, any linear gradients in the die tend to cancel out. MOSFETs are afflicted by gradients in etching, in Vt, and in oxide thickness. All capacitor and resistor arrays incorporated common centroid layout to enhance relative matching between them as well. In addition to the common centroid layout technique, the quarter-rate comparators were interconnected using a symmetrically RC distributed H-tree method [30] for routing data and clocks to minimize skew and delay offset. A disadvantage of the H-tree method is that it is more heavily loaded by parasitic capacitance from extra routing metals, hence, causing larger absolute delay and requiring larger clock buffers. However, we gain timing margin due to lower skew between the quarter-rate comparators which is absolutely critical for the CDR’s front-end. Layout techniques such as common centroid and H-tree interconnect method were discussed. Another technique used in this chip design is “interdigitation” [6] found in analog differential pairs. Interdigitation was mainly implemented in CML circuits such as the CTLE and the ring oscillator delay stages in the VCO. Interdigitation lowers the device mismatch between M1 and M2 of the input diff-pairs and also helps to cancel out linear gradients in the die similar to common-centroid.

6.3.2 Design for Electromigration (EM) & IR drop

This subsection will discuss the design methodology for EMIR (electromigration and IR drop). Considerations were made for the maximum current that a specific width of metal wire can carry during the physical layout design to pass electromigration (EM) requirements. The maximum current values (Imax) were found in the design rule check document for the TSMC 28nm process. In order to meet the allowable maximum current density for CML analog blocks, multiple fingers had to be used instead of a device with a larger width (W). By increasing the number of fingers, more metal tracks are available for the source/drain to meet EM rules. In terms of IR drop, a full mesh power grid with via stacks was incorporated to minimize IR drop on the supply. For example, the aluminum (AP) layer of the power grid was routed vertically and metal8 (M8) layer below was routed horizontally to form a power mesh from top to bottom, all the way down to the base of the transistor. Furthermore, to improve EM, staggered output pads could be seen in Figure 6.2 on some power supply pads that sink a lot of current such as the VCO. By staggering, more pads could be placed which could be wirebonded when packaging the chip. To improve the EM even further and to reduce pad inductance due to the double bonding that was applied Chapter 6. Chip Design Methodology 84 when wire bonding to the QFN package.

6.3.3 Other Layout Considerations

Many other layout considerations will be discussed in this subsection. First, for , all high-speed clocks that were routed a long distance were shielded. Shield- ing the clock routes with VSS ground metal traces reduces the mutual inductance and capacitive cross-talk on high-speed quadrature clock routes. ESD was also considered during the layout stage of the chip design. ESD diodes and secondary ESD diode were added to protect the gates from singals coming into input pads. ESD clamps from the TSMC ESD library were placed between power/ground to properly clamp the supplies during an electrostatic discharge. In order to prevent any latch-up, an adequate number of n-taps and p-taps were placed in the layout. For custom analog layouts, all the MOS devices were laid out in such a way that it was always seeing the same environment which includes the n-tap/p-tap. For example, if there were many rows of NMOS devices, p-taps that connect the substrate to the ground were places in between every row as well as at the outer edges. This way, any row of NMOS devices were seeing the same distance to top and bottom p-taps. When placing PMOS devices, they must reside in an n-well which is more isolated in terms of noise compared to the p− subtrate for the NMOS devices. The deep n-well (DNW) layout technique was used to provide noise isolation for the NMOS devices inside isolated p-wells. In addition, DNW had to be put in to provide ground isolation between digital ground (vss dig) and analog ground (vssa) since they cannot be sharing the same p-substrate or they will be shorted. Only at the PCB board level, different ground domains were shorted together.

6.4 Place & Route Digital Implementation Methodology

During tapeout, the digital adaptation engine was designed for both baud-rate CDR designs. However, for the second CDR design from Chapter5, the adaptation engine did not work for the testchip after fabrication which is why its details are omitted from this thesis. Nonetheless, during the chip design process, even the digital flow from the RTL syntehsis and place & route were shared between the two CDR designs. The digital flow script just had to be modified to point to different Verilog files for their respective digital adaptation scheme. Furthermore, the area allocated for the digital block after place & route in the layout was the same as well. Therefore, once the layout was streamed into Cadence Virtuoso, it fit perfectly for both CDR layouts. The pin coordinates were also Chapter 6. Chip Design Methodology 85 preserved during the digital flow for P & R, thus, routing to the input and output of the P & R layout was shared between the two CDR in order to save valuable time. Chapter 7

Conclusion

In summary, this thesis began by motivating the need for a baud-rate CDR with the goal of reducing power consumption. In chapter2, the background to the first design (adaptive baud-rate CDR) was presented. Chapter3 followed up with the details of the proposed adaptive engine. Chapter4 shared the simulated and measured results. For the second design (2x half-baud-rate CDR) presented in Chapter5, all of the background, proposed 2x half-baud-rate scheme and the measured results were self contained within the same chapter. In addition, Chapter6 gave an insight on how the two chips were being designed simultaneously in an efficient manner as well as some advanced layout techniques incorporated in the testchips.

7.1 Thesis Contribution

The contributions from each of the two baud-rate CDR designs will be summarized in this section as a conclusion. The first contribution of this thesis is the proposed adaptive baud-rate CDR with CTLE and 1-tap DFE. The novelty in this design is the adaptation engine tailored for baud-rate clock and data recovery where the comparators for the DFE and the PD are shared to save power. A testchip was fabricated in TSMC 28nm HPC CMOS technology with a 0.9 V supply. The adaptation engine is demonstrated for 34-36 Gb/s operation with a Tyco 5” channel resulting in 15.05-18.25 dB channel losses. Measurement in the lab demonstrated that the testchip is able to pass the IEEE 802.3 jitter tolerance masks for the mentioned channel losses. At 35 Gb/s, the total power consumption is measured to be 106.3mW or a FOM of 3.04 pJ/bit. A paper that presents this 36Gb/s adaptive baud-rate CDR has been submitted to ISSCC 2019. The second contribution is the proposed 2x half-baud-rate clock and data recovery technique using both data and edge samples every other UI (half-baud-rate) to lock at

86 Chapter 7. Conclusion 87 the edge. A testchip was also fabricated in TSMC 28nm HPC CMOS technology with a 0.9 V supply. A 30 Gb/s 2x half-baud-rate CDR was tested with a Tyco 5” channel with 13.06 dB of loss. The total power consumption is measured to be 79.2 mW or a FOM of 2.64 pJ/bit. A paper written for a 30Gb/s 2x-half-baud-rate CDR also has been submitted to ISSCC 2019. In conclusion, two separate CDR testchips were fabricated in a 28nm process technol- ogy and successfully measured in the lab.

7.2 Future Works

There are several ways to follow up with the work from this thesis, which is broken down into two sections for each of the two design.

7.2.1 Improvements for an Adaptive Baud-Rate CDR

One possible works is to take the pattern-based baud-rate CDR where the comparators are shared between the DFE and the PD and turn it into a PAM4 receiver. Since PAM4 signaling [5, 14, 38] sends/receives two bits per symbol, it is a popular approach for achieving a higher data-rate. In addition, if PAM4 signaling is indeed feasible, an adaptive scheme that is compatible with PAM4 should be studied as well. Second, the CDR’s equalization capabilities could be improved. For example, a new tuning knob could be added to the CTLE. By tuning the source degeneration resistance, the peaking could be improved by lowering the DC gain. In addition, another tuning knob can be added to the speculative DFE as well. A 2-tap speculative DFE would be able to cancel out two post-cursor ISI as opposed to one. Adding more knobs would complicate the adaptation but as a trade off, high channel attenuation could be handled.

7.2.2 Improvements for a 2x Half-Baud-Rate CDR

First, a 2x half-baud-rate CDR could be improved in terms of power. The VCO made up the majority of the power consumption which hindered it from achieving state-of- the-art FOM for a high-speed wireline CDR. Instead of wirebonding, a flip-chip could be implemented which would get rid of inductance from the wirebond. In addition, an LC VCO could be implemented to the lower phase noise and improving the power consumption at the trade off of reduced tuning range. Since the use of inductors is required for the LC VCO, t-coils can also be implemented in the front-end to extend the bandwidth and inductive peaking using inductors could be applied to the CTLE to extend the bandwidth of the high-frequency boost. As a result, the data-rate could be Chapter 7. Conclusion 88 improved as well.

Second, adaptation of the Vref and CTLE settings could be implemented for the 2x half-baud-rate CDR. Out adaptation scheme for this proposed CDR did not work after fabrication due to an error in the interface between the analog and the digital circuits. In future works, the adaptive scheme could be fixed and implemented properly. Third, a 2x half-baud-rate CDR that can tolerate a higher channel loss for MR and LR applications would be an interesting project. The proposed 2x half-baud-rate CDR in this thesis was tailored for an XSR/USR application with only the CTLE as the main equalization scheme. For a higher-loss application such as for a backplane, a direct feedback DFE with multiple taps could be implemented after the CTLE. At the summing node of the DFE, which is still an analog eye, the phase detector and the data decoder may still work. Lastly, similar to the possible future works of the first design, feasibility of PAM4 signaling should be studied for a 2x half-baud-rate CDR. If PAM4 is indeed feasible, it would be possible to double the data-rate at the same clock rate. Bibliography

[1] Ieee standard for ethernet - amendment 10: Media access control parameters, phys- ical layers, and management parameters for 200 gb/s and 400 gb/s operation. IEEE Std 802.3bs-2017 (Amendment to IEEE 802.3-2015 as amended by IEEE’s 802.3bw-2015, 802.3by-2016, 802.3bq-2016, 802.3bp-2016, 802.3br-2016, 802.3bn- 2016, 802.3bz-2016, 802.3bu-2016, 802.3bv-2017, and IEEE 802.3-2015/Cor1-2017), pages 1–372, Dec 2017.

[2] Ieee standard for ethernet - amendment 11: Physical layer and management param- eters for serial 25 gb/s ethernet operation over single-mode fiber. IEEE Std 802.3cc- 2017 (Amendment to IEEE Std 802.3-2015 as amended by IEEE s 802.3bw-2015, 802.3by-2016, 802.3bq-2016, 802.3bp-2016, 802.3br-2016, 802.3bn-2016, 802.3bz- 2016, 802.3bu-2016, 802.3bv-2017, 802.3-2015/Cor 1-2017, and 802.3bs-2017), pages 1–45, Jan 2018.

[3] A. A. Abidi. Phase noise and jitter in cmos ring oscillators. IEEE Journal of Solid- State Circuits, 41(8):1803–1816, Aug 2006.

[4] J.D.H. Alexander. Clock recovery from random binary data. 11:541 – 542, 02 1975.

[5] M. Bassi, F. Radice, M. Bruccoleri, S. Erba, and A. Mazzanti. 3.6 a 45gb/s pam-4 transmitter delivering 1.3vppd output swing with 1v supply in 28nm cmos fdsoi. In 2016 IEEE International Solid-State Circuits Conference (ISSCC), pages 66–67, Jan 2016.

[6] J. D. Bruce, H. W. Li, M. J. Dallabetta, and R. J. Baker. Analog layout using alas! IEEE Journal of Solid-State Circuits, 31(2):271–274, Feb 1996.

[7] R. Dokania, A. Kern, M. He, A. Faust, R. Tseng, S. Weaver, K. Yu, C. Bil, T. Liang, and F. O’Mahony. 10.5 a 5.9pj/b 10gb/s serial link with unequalized mm-cdr in 14nm tri-gate cmos. In 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers, pages 1–3, Feb 2015.

89 Bibliography 90

[8] A. Emami-Neyestanak, S. Palermo, Hae-Chang Lee, and M. Horowitz. Cmos transceiver with baud rate clock recovery for optical interconnects. In 2004 Sym- posium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No.04CH37525), pages 410–413, June 2004.

[9] P. A. Francese, T. Toifl, P. Buchmann, M. Brndli, C. Menolfi, M. Kossel, T. Morf, L. Kull, and T. M. Andersen. A 16 gb/s 3.7 mw/gb/s 8-tap dfe receiver and baud- rate cdr with 31 kppm tracking bandwidth. IEEE Journal of Solid-State Circuits, 49(11):2490–2502, Nov 2014.

[10] C. Gimeno, E. Guerrero, C. Aldea, S. Celma, and C. Azcona. A fully-differential adaptive equalizer using the spectrum-balancing technique. In 2013 IEEE Inter- national Symposium on Circuits and Systems (ISCAS2013), pages 1187–1190, May 2013.

[11] J. Han, Y. Lu, N. Sutardja, and E. Alon. 6.2 a 60gb/s 288mw nrz transceiver with adaptive equalization and baud-rate clock and data recovery in 65nm cmos technology. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pages 112–113, Feb 2017.

[12] Y. Hidaka, W. Gai, A. Hattori, T. Horie, J. Jiang, K. Kanda, Y. Koyanagi, S. Mat- subara, and H. Osone. A 4-channel 3.1/10.3gb/s transceiver macro with a pattern- tolerant adaptive equalizer. In 2007 IEEE International Solid-State Circuits Con- ference. Digest of Technical Papers, pages 442–443, Feb 2007.

[13] Y. Hidaka, W. Gai, T. Horie, J. H. Jiang, Y. Koyanagi, and H. Osone. A 4-channel 1.25-10.3 gb/s backplane transceiver macro with 35 db equalizer and sign-based zero- forcing adaptive control. IEEE Journal of Solid-State Circuits, 44(12):3547–3559, Dec 2009.

[14] J. Im, D. Freitas, A. Roldan, R. Casey, S. Chen, A. Chou, T. Cronin, K. Geary, S. McLeod, L. Zhou, I. Zhuang, J. Han, S. Lin, P. Upadhyaya, G. Zhang, Y. Frans, and K. Chang. 6.3 a 40-to-56gb/s pam-4 receiver with 10-tap direct decision-feedback equalization in 16nm finfet. In 2017 IEEE International Solid-State Circuits Con- ference (ISSCC), pages 114–115, Feb 2017.

[15] Raj Jain. The art of computer systems performance analysis - techniques for experi- mental design, measurement, simulation, and modeling. Wiley professional comput- ing. Wiley, 1991. Bibliography 91

[16] M. S. Jalali, A. Sheikholeslami, M. Kibune, and H. Tamura. A reference-less single- loop half-rate binary cdr. IEEE Journal of Solid-State Circuits, 50(9):2037–2047, Sept 2015.

[17] H. Y. Joo, K. S. Ha, and L. S. Kim. A data pattern-tolerant adaptive equalizer using spectrum balancing method. In 2009 Symposium on VLSI Circuits, pages 220–221, June 2009.

[18] Y. H. Kim, Y. J. Kim, T. Lee, and L. S. Kim. A 21-gbit/s 1.63-pj/bit adaptive ctle and one-tap dfe with single loop spectrum balancing method. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 24(2):789–793, Feb 2016.

[19] J. Lee, M. Chen, and H. Wang. Design and comparison of three 20-gb/s backplane transceivers for duobinary, pam4, and nrz data. IEEE Journal of Solid-State Circuits, 43(9):2120–2133, Sept 2008.

[20] Jri Lee. A 20gb/s adaptive equalizer in 0.13/spl mu/m cmos technology. In 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers, pages 273–282, Feb 2006.

[21] M. P. Lin, Y. He, V. W. Hsiao, R. Chang, and S. Lee. Common-centroid ca- pacitor layout generation considering device matching and parasitic minimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32(7):991–1002, July 2013.

[22] Di Long, Xianlong Hong, and Sheqin Dong. Optimal two-dimension common cen- troid layout generation for mos transistors unit-circuit. In 2005 IEEE International Symposium on Circuits and Systems, pages 2999–3002 Vol. 3, May 2005.

[23] H. Miyaoka, F. Terasawa, M. Kudo, H. Kano, A. Matsuda, N. Shirai, S. Kawai, T. Shibasaki, T. Danjo, Y. Ogata, Y. Sakai, H. Yamaguchi, T. Mori, Y. Koyanagi, H. Tamura, Y. Ide, K. Terashima, H. Higashi, T. Higuchi, and N. Naka. A 28.3 gb/s 7.3 pj/bit 35 db backplane transceiver with eye sampling phase adaptation in 28 nm cmos. In 2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits), pages 1–2, June 2016.

[24] K. Mueller and M. Muller. Timing recovery in digital synchronous data receivers. IEEE Transactions on Communications, 24(5):516–531, May 1976.

[25] F. A. Musa. High-speed baud-rate clock recovery. PhD thesis, University of Toronto, Toronto, ON, 2008. Bibliography 92

[26] F. A. Musa and A. C. Carusone. A baud-rate timing recovery scheme with a dual- function analog filter. IEEE Transactions on Circuits and Systems II: Express Briefs, 53(12):1393–1397, Dec 2006.

[27] P. J. Peng, J. F. Li, L. Y. Chen, and J. Lee. 6.1 a 56gb/s pam-4/nrz transceiver in 40nm cmos. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pages 110–111, Feb 2017.

[28] W. Rahman, D. Yoo, J. Liang, A. Sheikholeslami, H. Tamura, T. Shibasaki, and H. Yamaguchi. A 22.5-to-32-gb/s 3.2-pj/b referenceless baud-rate digital cdr with dfe and ctle in 28-nm cmos. IEEE Journal of Solid-State Circuits, 52(12):3517–3531, Dec 2017.

[29] W. Rahman, D. Yoo, J. Liang, A. Sheikholeslami, H. Tamura, T. Shibasaki, and H. Yamaguchi. A 22.5-to-32-gb/s 3.2-pj/b referenceless baud-rate digital cdr with dfe and ctle in 28-nm cmos. IEEE Journal of Solid-State Circuits, 52(12):3517–3531, Dec 2017.

[30] B. Ravelo and A. K. Jastrzebski. Modelling of symmetrical distributed clock rc h-tree. In International Symposium on Electromagnetic Compatibility - EMC EU- ROPE, pages 1–6, Sept 2012.

[31] D. Schinkel, E. Mensink, E. Klumperink, E. van Tuijl, and B. Nauta. A double- tail latch-type voltage sense amplifier with 18ps setup+hold time. In 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers, pages 314– 605, Feb 2007.

[32] S. Shahramian, C. Ting, A. Sheikholeslami, H. Tamura, and M. Kibune. A pattern- guided adaptive equalizer in 65nm cmos. In 2011 IEEE International Solid-State Circuits Conference, pages 354–356, Feb 2011.

[33] M. H. Shakiba. A 2.5 gb/s adaptive cable equalizer. In 1999 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC. First Edition (Cat. No.99CH36278), pages 396–397, Feb 1999.

[34] T. Shibasaki, W. Chaivipas, Yanfei Chen, Y. Doi, T. Hamada, H. Takauchi, T. Mori, Y. Koyanagi, and H. Tamura. A 56-gb/s receiver front-end with a ctle and 1-tap dfe in 20-nm cmos. In 2014 Symposium on VLSI Circuits Digest of Technical Papers, pages 1–2, June 2014.

[35] T. Shibasaki, T. Danjo, Y. Ogata, Y. Sakai, H. Miyaoka, F. Terasawa, M. Kudo, H. Kano, A. Matsuda, S. Kawai, T. Arai, H. Higashi, N. Naka, H. Yamaguchi, Bibliography 93

T. Mori, Y. Koyanagi, and H. Tamura. 3.5 a 56gb/s nrz-electrical 247mw/lane serial-link transceiver in 28nm cmos. In 2016 IEEE International Solid-State Circuits Conference (ISSCC), pages 64–65, Jan 2016.

[36] F. Spagna, L. Chen, M. Deshpande, Y. Fan, D. Gambetta, S. Gowder, S. Iyer, R. Kumar, P. Kwok, R. Krishnamurthy, C. Lin, R. Mohanavelu, R. Nicholson, J. Ou, M. Pasquarella, K. Prasad, H. Rustam, L. Tong, A. Tran, J. Wu, and X. Zhang. A 78mw 11.8gb/s serial link transceiver with adaptive rx equalization and baud-rate cdr in 32nm cmos. In 2010 IEEE International Solid-State Circuits Conference - (ISSCC), pages 366–367, Feb 2010.

[37] V. Stojanovic, A. Ho, B. W. Garlepp, F. Chen, J. Wei, G. Tsang, E. Alon, R. T. Kollipara, C. W. Werner, J. L. Zerbe, and M. A. Horowitz. Autonomous dual-mode (pam2/4) serial link transceiver with adaptive equalization and data recovery. IEEE Journal of Solid-State Circuits, 40(4):1012–1026, April 2005.

[38] P. Upadhyaya, C. F. Poon, S. W. Lim, J. Cho, A. Roldan, W. Zhang, J. Namkoong, T. Pham, B. Xu, W. Lin, H. Zhang, N. Narang, K. H. Tan, G. Zhang, Y. Frans, and K. Chang. A fully adaptive 19-to-56gb/s pam-4 wireline transceiver with a configurable adc in 16nm finfet. In 2018 IEEE International Solid - State Circuits Conference - (ISSCC), pages 108–110, Feb 2018.

[39] M. van Ierssel, H. Yamaguchi, A. Sheikholeslami, H. Tamura, and W. W. Walker. Event-driven modeling of cdr jitter induced by power-supply noise, finite decision- circuit bandwidth, and channel isi. IEEE Transactions on Circuits and Systems I: Regular Papers, 55(5):1306–1315, June 2008.

[40] K. Yamaguchi, K. Sunaga, S. Kaeriyama, T. Nedachi, M. Takamiya, K. Nose, Y. Nakagawa, M. Sugawara, and M. Fukaishi. 12gb/s duobinary signaling with /spl times/2 oversampled edge equalization. In ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005., pages 70–585 Vol. 1, Feb 2005. Appendices

94 Appendix A

Ancillary

A.1 Portlist for Synthesized Digital

This section outlines the portlist for synthesized digital. Digital includes the adaptation engine and the BERT. Input ports listed are programmed using an Arduino.

PORT WIDTH DIRECTION DESCRIPTION i_CORECLK 1 INPUT Input clock for digital core i_RSTb 1 INPUT Active-low reset for digital core i_e_k 8 INPUT 8-bits for 1 GHz i_DOUT 32 INPUT Input data for digital core 32-bit for 1GHz i_ncon_avg 5 INPUT i_ncon_max 5 INPUT i_dLev_delay_cycle 10 INPUT i_counter_max 24 INPUT i_num_pi_trial 5 INPUT i_num_trial 9 INPUT i_thick_mode 2 INPUT i_max_Cs 4 INPUT o_new_trial 1 OUTPUT o_block_detected 1 OUTPUT Accumulator port. Simplified only has 1 block processed o_dLev 9 OUTPUT Accumulator port o_pi_ctrl 5 OUTPUT PI_logic port o_mode_pi 3 OUTPUT PI_logic port o_cross_rdy 1 OUTPUT PI_logic port o_prev_diff 10 OUTPUT PI_logic port WIDTH_DLEV+1 o_dLev011 9 OUTPUT PI_logic port o_dLev110 9 OUTPUT PI_logic port o_new_Cs_setting 1 OUTPUT adaptive_logic port o_thresh_H 9 OUTPUT adaptive_logic port o_Cs 4 OUTPUT adaptive_logic port o_FSM_state 3 OUTPUT adaptive_logic port o_tot_max 16 OUTPUT adaptive_logic port o_tot_min 16 OUTPUT adaptive_logic port o_tot_vamp 16 OUTPUT adaptive_logic port o_max 9 OUTPUT adaptive_logic port o_min 9 OUTPUT adaptive_logic port o_thick_out 10 OUTPUT adaptive_logic port WIDTH_MAX+1 o_mode_adpt 3 OUTPUT adaptive_logic port o_adapt_rdy 1 OUTPUT adaptive_logic port i_bert_ENAb 1 INPUT BERT port active low enable to process data or reset error counter i_bert_pncctl 2 INPUT BERT port 2b00 = PRBS7 2b01 = PRBS31 2b10 = PRBS23 2b11 = PRBS15 i_bert_rxorder 1 INPUT BERT port input data invert order 0: Yes Bit 31 first 1:[Default] No Bit 0 first o_bert_pnerr 1 OUTPUT BERT port comparison error current sample o_bert_pneonce 1 OUTPUT BERT port comparison error all past samples

95 Appendix A. Ancillary 96

o_bert_pnebitcnt 20 OUTPUT BERT port error counter

A.2 Output Pad MUX Selection

This section outlines the 4-bit selectable output pad mux and what each of the 16 possible settings output to the outside world via PCB traces. oPAD0 coreck_16 (500 MHz) oPAD1: 0) dout clk0 sample 1) dLev <0 or 4> 2) Thresh <8:0> selectable 3) Thresh [0-->8] burst mode 4) new_trial 5) dcomp <0> 6) PIPO<0> 7) mode_pi <0> (1-bit) 8) FSM_state [0-->2] burst mode 9) max [0-->8] burst mode 10) tot_max [0-->15] burst mode 11) tot_vamp [0-->15] burst mode 12) dLev011 [0-->8] burst mode 13) o_bert_pnebitcnt [0->19] burst mode 14) block_detected<0> oPAD2: 0) dout clk90 sample 1) dLev <1 or 5> 2) adapt_rdy 3) Thresh flag_bit0 4) cross_rdy 5) dcomp <1> 6) PIPO<1> 7) new_Cs_setting 8) FSM_state flag_bit0 9) max flag_bit0 10) tot_max flag_bit0 11) tot_vamp flag_bit0 12) dLev011 flag_bit0 13) o_bert_pnebitcnt flag_bit0 14) ck_div16 oPAD3: 0) dout clk180 sample 1) dLev <2 or 6> 2) PI <4:0> selectable 3) PI [0-->4] burst mode 4) Cs [0-->3] burst mode 5) dcomp <2> 6) PIPO <2> 7) mode_adpt [0-->2] burst mode 8) thick_out [0-->9] burst mode 9) min [0-->8] burst mode 10) tot_min [0-->15] burst mode 11) prev_diff [0-->9] burst mode 12) dLev110 [0-->8] burst mode 13) o_bert_pnerr (1-bit) oPAD4: 0) e_k_demux (between ck90 and 180) 1) dLev <3 or 7> 2) Cs <3:0> selectable Appendix A. Ancillary 97

3) PI flag_bit0 4) Cs flag_bit0 5) dcomp <3> 6) PIPO<3> 7) mode_adpt flag_bit0 8) thick_out flag_bit0 9) min flag_bit0 10) tot_min flag_bit0 11) prev_diff flag_bit0 12) dLev110 falg_bit0 13) o_bert_pneonce (1-bit) 14) ck_div16

<63:0> 500Mhz data_rec <15:0> 500Mhz e_k_demux

4 bit select e.g. Sel = selects dout<0>, dout<1>, dout<2>, e_k<0> [block0] Sel = selects dout<4>, dout<5>, dout<6>, e_k<0> [block 1] . . Sel = select dout<28>,dout<29>,dout<30>, e_k<0> [block 7] Sel = select dout<32>,dout<33>,dout<34>, e_k<8> [block 8] Sel = select dout<60>,dout<61>,dout<62>, e_k<8> [block 15]