AN ABSTRACT OF THE DISSERTATION OF

Bruce Querbach for the degree of Doctor of Philosophy in Electrical and Computer

Engineering presented on June 12, 2015.

Title: The Architecture of a Reusable Built-In Self-Test for Link Training, IO and Memory

Defect Detection and Auto Repair on 14nm Intel SOC.

Abstract approved: ______

Patrick Y. Chiang

The complexity of designing and testing today’s system on chip (SOC) is increasing due to greater integrated circuit (IC) density and higher IO and memory frequencies. SOCs for the mobile phone and tablet market have the unique challenge of short product development windows, at times less than six months, and low cost board and platform that limits physical access to test access ports (TAP). This dissertation presents the architecture of a reusable built-in self-test (BIST) engine called converged pattern generator and checker (CPGC) that was developed to address the above challenges. It is used in the critical path of millions of x86 SOC for DDR3, DDR4, LP-DDR3, LP-DDR4 IO initialization and link training. The CPGC is also an essential BIST engine for IO and memory defect detection, and in some cases, the automatic repair of detected memory defects. The software and hardware infrastructure that leverages CPU L2/L3 cache to enable cache based testing (CBT) and the parallel execution of the CPGC Intel BIST engine is shown to improve test time 60x to 170x over conventional TAP based testing. In addition, silicon results are presented showing that CPGC enables easy debug of inter symbol interference (ISI) and crosstalk issues in silicon and boards, enables fast IO link training, improves validation time by 3x, and in some instances, reduces SOC and platform power by 5% to 11% through closed loop IO circuit power optimization. This CPGC BIST engine has been developed into a reusable IP solution, which has been successfully designed into at least 11 Intel CPUs and SOCs (32nm-14nm), with seven of these successfully debugged, tested, and launched into the market place. Ultimately has led to over 100 million CPUs being shipped within one quarter using this architecture.

©Copyright by Bruce Querbach

June 12th, 2015

All Rights Reserved

The Architecture of a Reusable Built-In Self-Test for Link Training, IO and Memory

Defect Detection and Auto Repair on 14nm Intel SOC

by

Bruce Querbach

A DISSERTATION

submitted to

Oregon State University

in partial fulfillment of

the requirements for the

degree of

Doctor of Philosophy

Presented June 12, 2015

Commencement June 2015

Doctor of Philosophy dissertation of Bruce Querbach presented on June 12, 2015

APPROVED:

Major Professor, representing Electrical and Computer Engineering

Director of the school of Electrical Engineering and Computer Science

Dean of the Graduate School

I understand that my dissertation will become part of the permanent collection of Oregon

State University libraries. My signature below authorizes release of my dissertation to any reader upon request.

Bruce Querbach, Author

ACKNOWLEDGEMENTS

The author expresses sincere appreciation to:

1. Patrick Chiang for his patience and advice throughout the years

2. Sudeep Puligundla for his insights and advice throughout this process

3. Rahul Khanna for his collaboration, showing me the ropes, and for his patience

over many paper revisions

4. Sabyasachi Deyati for his inspiration, advice, and paper contribution

5. David Blankenbeckler for paper collaboration and sharing his experience

6. Peter Yanyang Tan for collecting RMT and JTAG and for paper contribution

7. Roger R Rees for his guidance, advice over the years, and paper reviews

8. Anne Meixner for her advice and paper edit and review

9. Ron Anderson for sharing his insights on DDR debug, data collection, and paper

contribution

10. Adam Taylor for sharing his expertise, post silicon data validation collection, and

paper contribution

11. Daniel Becerra for collecting valuable data on DDR interface and for paper

contribution

12. Zale T Schoenborn for his visionary ideas and support over the years, and paper

contribution

13. David G Ellis for his guidance over the years, for teaching, for sharing his expertise

in digital design, and for paper contribution

14. Yulan Zhang for her efforts and contribution in simulation and modeling CPGC,

and paper contribution

15. Thilak K Poriyani House Raju, and Rucha P Purandare for their simulation, design,

validation, and data collection

16. Van Lovelace for sharing his knowledge on BIOS and MRC and paper contribution

17. Joe Crop for his advice, paper edits, and paper contribution

18. Amy Querbach for her patience and support throughout time

19. Ram Mallela, Buz Hampton (GM), John D Barton (VP), and Rani Borkar (Sr. VP)

for management support and funding

Contribution of authors

From Intel Corporation:

Sudeep Puligundla, Rahul Khanna, David Blankenbeckler, Peter Yanyang Tan, Ron Anderson, Adam Taylor, Daniel Becerra, Zale T Schoenborn, David G Ellis, Yulan Zhang, and Van Lovelace.

From Georgia Institute of Technology:

Sabyasachi Deyati

From Oregon State University:

Joe Crop and Patrick Chiang.

TABLE OF CONTENTS Page

1 Dissertation Overview ...... 1

2 Introduction and Literature Review ...... 4

2.1 Challenges of IO Bandwidth, Frequency, and Low Power ...... 4

2.2 Process Variation...... 7

2.3 Temperature Variation ...... 8

2.4 Aging Effects...... 9

2.5 Calibration Methods and BIST ...... 11

2.6 High Speed Differential IO (HSIO) ...... 15

2.7 Memory Single Ended IO ...... 17

2.8 Fault Models ...... 20

2.9 Memory Test Algorithms ...... 25

2.10 Research Goals and methodology ...... 27

2.10.1 Goal ...... 27

2.10.2 Methodology ...... 28

2.11 References ...... 30

3 Comparison of hardware and software stress method ...... 35

3.1 Introduction ...... 36

3.2 The APGC Architecture ...... 38

3.3 The Aggressor/Victim Pattern Rotation ...... 40

3.4 The Software Learning Algorithm and Pattern ...... 41

TABLE OF CONTENTS Page

3.5 Margin Result ...... 41

3.6 Receiver Voltage Margin – Low Side ...... 42

3.7 Receiver Voltage Margin – High Side ...... 42

3.8 Receiver Timing Margin – Low Side ...... 43

3.9 Transmitter Voltage Margin – Low Side ...... 43

3.10 Transmitter Voltage Margin – High Side ...... 44

3.11 Transmitter Timing Margin – Low Side...... 44

3.12 Comparison of Test Execution Time of APGC and Software ...... 45

3.13 Conclusion ...... 46

3.14 Acknowledgements ...... 47

3.15 References ...... 48

4 Improved Debug, Validation and Test Time ...... 51

4.1 Introduction ...... 52

4.2 Failure Modes ...... 55

4.3 CPGC Architecture ...... 58

4.3.1 Architecture for Software Assisted Repair Technology ...... 59

4.3.2 Unified Pattern Sequencer ...... 61

4.3.3 LMN ...... 62

4.3.4 Fixed Pattern Buffer ...... 62

TABLE OF CONTENTS Page

4.3.5 LFSR ...... 62

4.3.6 Flop Based Data Pattern Buffer ...... 63

4.3.7 Register File Based Pattern ...... 63

4.3.8 Data Pattern Dynamic Inversion/DC ...... 63

4.3.9 Post-Scheduler Command and Address Pattern Generation ...... 64

4.3.10 Error Status, Logging and Masking ...... 64

4.3.11 Command and Address Generator ...... 64

4.3.12 Variable Retention Time Stress ...... 66

4.3.13 Gate Count ...... 66

4.3.14 Power ...... 67

4.3.15 Protocol ...... 68

4.4 CPGC Usage and IP Reuse Model ...... 68

4.5 Software Architecture ...... 75

4.5.1 Software Assisted Dynamic Margining ...... 78

4.5.2 Software Assisted CPGC Test Invocation ...... 78

4.6 Simulation Results...... 79

4.7 CPGC Results and Silicon Measurements ...... 81

4.7.1 Debugging Boot Failure Using CPGC LFSR Addressing ...... 86

4.8 Summary ...... 91

4.9 Acknowledgments ...... 92

TABLE OF CONTENTS Page

4.10 References ...... 92

5 Platform execution time improvement using L3 Cache ...... 97

5.1 Introduction ...... 98

5.2 CBT Uses Cache-As-RAM Mode ...... 101

5.2.1 Non-Evict Mode (NEM) ...... 101

5.2.2 NEM Exit to Preboot Execution Environment (PXE) ...... 104

5.3 CPGC BIST Engine ...... 106

5.3.1 Architecture Details ...... 106

5.3.2 Address Instruction (Loop3) ...... 109

5.3.3 Data Instruction (Loop2)...... 109

5.3.4 Data Capabilities Overview ...... 110

5.3.5 Algorithm Instruction (Loop 1) ...... 110

5.3.6 Command Instruction (Loop 0) ...... 112

5.3.7 Offset Command Instruction (Loop 0) ...... 112

5.3.8 Examples of Usage Models for the Offset Command Instruction ...... 113

5.3.9 Refresh Control Generator ...... 113

5.4 Rank Margin Tool (RMT) ...... 114

5.5 Traditional JTAG and TAP Based Tests ...... 122

5.6 Performance (Test Time) Comparison ...... 126

TABLE OF CONTENTS Page

5.7 Generalized Cache Based Testing ...... 127

5.8 Conclusion ...... 128

5.9 Acknowledgements ...... 129

5.10 References ...... 130

6 Architecture of a Reusable BIST Engine (CPGC) ...... 135

6.1 Introduction ...... 136

6.2 2nd Generation CPGC for 14nm 3DS SOC Memory and IO ...... 139

6.2.1 CPGC Architecture Overview ...... 139

6.2.2 CPGC Architecture Detail ...... 142

6.2.3 Algorithmic Memory Cell Testing...... 144

6.2.4 CPGC Software infrastructure Architecture ...... 146

6.2.5 CPGC Applications and Results ...... 148

6.3 Summary ...... 156

6.4 Acknowledgments ...... 157

6.5 References ...... 157

7 Conclusion ...... 161

8 Bibliography ...... 164

LIST OF FIGURES Page

INTEL LAB’S RESEARCH PREDICTS HSIO BANDWIDTH INCREASING 3X/2YRS 4

INTEL LAB’S RESEARCH PREDICTS -20%/YEAR IN IO POWER 5

BER REDUCES AS RECEIVER DATA EYE SHRINKS 6

20% VT VARIATION LIES BEYOND 3 STANDARD DEVIATION 7

AS T (37C->125C), JITTER INCREASED 30PS, EYE-HEIGHT DECREASED ABOUT 20MV 8

AS P (TT->SS), JITTER +25PS, EYE-HEIGHT -30MV 9

BOND BREAKING RATE INCREASE WITH (IDS/W) AND TEMPERATURE 10

DDR3 SELF-CALIBRATION 12

IO TERMINATION CALIBRATION CIRCUIT 13

INTEL HSIO RECEIVER ARCHITECTURE 15

INTEL PCIE GEN 3 8GT/S HSIO RECEIVER AND DFX 16

INTEL HASWELL (22NM) DDR CASCADE DRIVER 18

INTEL HASWELL (22NM) EDRAM AT 6.4T/S 19

DRAM ORGANIZATION 20

ADJACENT NEIGHBORHOOD OF A MEMORY CELL B 21

DRAM FAULT SIMULATION MODEL 22

MATS+ TEST ALGORITHM 26

THE CONCEPTUAL MEMORY LINK TRANSMITTER AND RECEIVER 39

THE APGC ARCHITECTURE 40

THE RX VOLTAGE LOWER SIDE BER (ON Q-SCALE) VS. VREF TICKS 42

THE RX VOLTAGE HIGHER SIDE BER (ON Q-SCALE) VS. VREF TICKS 43

THE RX TIMING LOWER SIDE BER (ON Q-SCALE) VS. VREF TICKS 43

THE TX VOLTAGE LOWER SIDE (ON Q-SCALE) 44

THE TX VOLTAGE HIGHER SIDE (ON Q-SCALE) 44

THE TX TIMING LOWER SIDE (ON Q-SCALE) 45

CPGC BIST ENGINE WITH SOFTWARE ASSISTED REPAIR TECHNOLOGY 54

LIST OF FIGURES Page

CPGC ARCHITECTURE OVERVIEW 58

CPGC SW ASSISTED REPAIR TECHNOLOGY FLOW 59

CPGC UNIFIED PATTERN SEQUENCER AND BUFFER 61

CPGC ROW HAMMER STRESS PATTERN 65

UEFI COMPLIANT BIOS THAT HOSTS THE CPGC TESTS 77

READ/WRITE CREDITS FOR STANDARD IP INTERFACE 79

BASE OFFSET MEMORY STRESS ALGORITHM 79

CHECKER BOARD MEMORY DATA STRESS ALGORITHM 79

PRBS PATTERNS STRESS PATTERNS 80

VICTIM AND AGGRESSOR PATTERN 81

SILICON MEASUREMENT OF DDR IO EYE DIAGRAM 82

EYE DIAGRAM WITH VA STRESS PATTERN TURNED ON 83

CPGC BIOS MEASURED EYE HEIGHT WITH ALL LANES ENABLED 83

CPGC BIOS MEASURED EYE HEIGHT WITH BIT 5/ECC 5 DISABLED 84

CPGC IDENTIFIED MISSING SHIELDING ON BOARD LAYOUT 86

CPGC ARCHITECTURE IO TRAINING BOOT FAILURE DEBUG 87

IO DQS GLITCH CORRUPTING RX FIFO DEBUG 88

DDR IO INITIALIZATION AND TRAINING BIOS MAPPED EYE 90

CACHE BASED TEST ARCHITECTURE 100

NON-EVICT MODE ENTRY FLOW DIAGRAM 103

NON-EVICT MODE EXIT FLOW DIAGRAM 105

CPGC ARCHITECTURE LOOPS 107

CPGC ARCHITECTURE FLOW DIAGRAM 109

CPGC ARCHITECTURE LOOP 3, ADDRESS INSTRUCTION 109

CPGC ARCHITECTURE LOOP 2, DATA INSTRUCTION. 110

CPGC ARCHITECTURE LOOP 1, ALGORITHM INSTRUCTION 111

LIST OF FIGURES Page

CPGC ARCHITECTURE LOOP 0, COMMAND INSTRUCTION 112

CPGC ARCHITECTURE, OFFSET INSTRUCTION 113

REFRESH CONTROL GENERATOR 114

CROSSTALK AND ISI ACROSS 64 DQ BITS (RMT) 116

VCCIO SUPPLY BOUNCE AS A FUNCTION OF DQ PATTERN FREQUENCY 117

MEASURED VOLTAGE AND TIMING DATA EYE (CBT) 118

TRANSMITTER DATA PER-BIT MARGIN (CBT) 119

RECEIVER DATA PER-BIT MARGIN (CBT) 120

TRANSMITTER REFERENCE VOLTAGE PER-BIT MARGIN (CBT) 121

RECEIVER REFERENCE VOLTAGE PER-BIT MARGIN (CBT) 122

TRANSMITTER DATA TIMING PER-BIT MARGIN (TAP) 123

RECEIVER DATA TIMING PER-BIT MARGIN (TAP) 124

TRANSMITTER REFERENCE VOLTAGE PER-BIT MARGIN (TAP) 125

RECEIVER REFERENCE VOLTAGE PER BIT-MARGIN (TAP) 126

SOC MEMORY SUB-SYSTEM AND CPGC ARCHITECTURE OVERVIEW 141

UEFI BOOT SERVICE LAYER THAT SUPPORTS CPGC TESTS MODULE. 148

CPGC SILICON RESULTS 152

CPGC CLOSED CALIBRATION LOOP SAVES CPU AND DRAM POWER 155

LIST OF TABLES Page

VICTIM AGGRESSOR PATTERN ROTATION WITH TWO DIFFERENT LFSRS 40

CROSSTALK/ISI IMPACT TO 11TH BIT IN TIME OF DQ61USING SOFTWARE 41

VICTIM AGGRESSOR PATTERN ROTATION WITH TWO DIFFERENT LFSRS 46

CPGC GATE COUNT 66

CPGC POWER ESTIMATES 67

CPGC ARCHITECTURE USAGE FOR VALIDATION 68

CPGC ARCHITECTURE USAGE FOR DEBUG 71

CPGC ARCHITECTURE USAGE FOR MEMORY INITIALIZATION/TRAINING 73

CPGC ARCHITECTURE SAVINGS TO VALIDATION, TEST, AND DEBUG TIME 91

PERFORMANCE GAIN OF CACHE BASED TESTS 127

COMPARISON OF CPGC FEATURES TO PRIOR WORK. 138

1

1 Dissertation Overview

This dissertation is organized as follows.

1. Dissertation Overview, which is the current chapter.

2. Introduction and Literature review presents prior research and publications from historical experts and experts in the field to provide a background of the challenges faced within IO and memory technology, as well as the existing efforts in this field. Specifically, the introduction will lay the background for Intel high speed IO, memory IO, inter-symbol- interference (ISI), crosstalk (which especially dominates DDR), the quest for effective and quick test patterns, test time and functional usage, memory cell failures model, and memory cell test algorithms.

3. Research Goals will cover the goals of this dissertation based on the challenges identified above in the introduction.

4. Research Methodology will cover the methods used within this dissertation to achieve the goals laid out above.

5. “Comparison of Hardware Based and Software Based Stress Testing of Memory IO

Interface” was the first publication (2013 IEEE MWSCAS) focused on studying the effective patterns, and the effectiveness and the speed of using a BIST approach.

2

6. “A Reusable BIST with Software Assisted Repair Technology for Improved Memory and IO Debug, Validation, and Test Time (IO Debug and Validation)” was the next publication (2014 IEEE ITC) focused on using a BIST technology to improve memory and

IO debug, validation, and test time. This usage turns out to be one of the largest values of the BIST to Intel and to the industry. In addition, a software assisted memory cell repair technology was introduced. There was a lot of industry interest in how to speed up BIST execution and overall test time using cache, hence that led to the third paper.

7. “Platform IO and System Memory Test Using L3 Cache Based Test (CBT) and Parallel

Execution of CPGC Intel BIST Engine (IO Link Training)”. This paper was submitted to

2015 IEEE ITC due to the surprisingly large interest in reducing overall test execution time through running BIST in cache. It focused on more details of the hardware and software infrastructure developed leveraging Intel’s cache-as-ram (CAR) and non-eviction-mode

(NEM) mode to parallel execute a BIST engine per memory controller. This led to significant test time reductions. Paired with how fast this BIST runs is the use of CPGC in

IO link training and initialization. The combination of the hardware and software infrastructure of using cache based code, running BIST in parallel gave current and future

IO circuit designers and architects a powerful, flexible, yet low cost method to adapt to unforeseen conditions post-silicon. This was a valued contribution not only to Intel but also to the industry.

8. “Architecture of a Reusable BIST Engine for Detection and Auto Correction of Memory

Failures and for IO Debug, Validation, Link Training, and Power Optimization on 14nm

3

SOC” was the last publication submission at the time of writing this dissertation. More of the CPGC architecture is presented, including the use of pipeline architecture to enable back-to-back DDR transactions. The architecture is innovative as the first to combine IO

BIST with memory BIST. An architecture that is capable of testing all the classical memory test algorithms such as “scan”, “march”, as well as newer user defined algorithms, is presented. The architecture was designed for testing 3D stacked IC, but unfortunately at the time of writing this dissertation, not enough 3DS IC memory cell test data was available, and thus most of the data presented are for memory IO tests. Lastly, the challenges of reducing IO power was explored here using the flexible BIST engine to not only optimize IO performance, but also to optimize and minimize IO power through fine power tuning.

9. Conclusion chapter summarizes this dissertation.

10. Bibliography chapter lists the references.

4

2 Introduction and Literature Review

2.1 Challenges of IO Bandwidth, Frequency, and Low Power

Shrinking process technology driven by Moore’s law of doubling the transistors in integrated circuits (IC) has led to continuous improvements in computer performance since

1965. The ever increasing computing performance of single core and multiple core processors has driven an ever increasing demand on computer interface input and output

(IO) bandwidth. According to Intel lab [1], shown in Figure 1, the bandwidth of high speed

IO (HSIO) for computers is expected to increase 3X every two years.

Figure 1: Intel lab’s research predicts HSIO bandwidth increasing 3X/2yrs (source F. O’Mahony [1])

5

At the same time, the power density of a computer CPU core has reached the limit of existing cooling technology. To push computing performance beyond frequency scaling of the CPU core, dual core, quad core, and multi-core CPUs have been introduced not only by Intel, but also by a number of other industry players. In conjunction with the ever increasing IO bandwidth, the power efficiency or low power demand on HSIO and memory

IO continues to trend towards lower power, according to Figure 2, this corresponds to a

20% reduction in IO power every year.

Figure 2: Intel lab’s research predicts -20%/year in IO power (source F. O’Mahony [1])

6

Power reduction must be balanced with acceptable performance and error rates. Bit error rate (BER) can be simply defined as the number of error bits that occurs within a period of time or a window of bits transmitted. As shown in Figure 3, the vertical axis is the voltage detected at the IO receiver [2], and the horizontal axis is the timing at the receiver expressed in terms of unit interval (UI). As the receiver data eye reduces in size, so does BER. Both the data eye shown here and BER are important metrics of IO health. A number of non- idealities can reduce the eye, which includes clock jitter, inter-symbol-interference (ISI), crosstalk, supply noise, etc. ISI and crosstalk are highly channel related, and thus will be discussed further.

Figure 3: BER reduces as receiver data eye shrinks (source B. Casper [2])

7

2.2 Process Variation

Additional challenges for IO in advanced CMOS technology include process, temperature, and voltage variation. As transistors shrink in minimal dimension, the process variation of transistor-to-transistor, IC-to-IC, and wafer-to-wafer increasingly affects not only HSIO and memory IO circuits, but also all IC circuits. Latif, RAICS 2011 [3], demonstrated that in Figure 4, ~20% of silicon process variation of threshold voltage (VT) distribution lies outside of the 3 sigma window.

Figure 4: 20% VT variation lies beyond 3 standard deviation (source Latif, RAICS 2011 [3])

8

2.3 Temperature Variation

In addition to process variation, temperature and voltage variation can significantly affect

HSIO, memory IO, and all IC circuit performance. Tinghou Chen, EDAPS 2009 showed on universal serial bus (USB2) IO temperature and process variability [4]. According to

Figure 5, as temperature increased from 37C to 125C, the jitter increased by 30ps, thus causing eye-height to decrease by about 20mV.

Figure 5: As T (37C->125C), jitter increased 30ps, eye-height decreased about 20mV (Source T. Chen [4])

Chen [4] also showed that as process moved from typical to slow, the eye height decreased by about 30mV in Figure 6. Eye diagram or eye height, which will be discussed in more detail later, are important metrics for measuring IO health.

9

Figure 6: As P (TT->SS), jitter +25ps, eye-height -30mV (Source T. Chen [4])

2.4 Aging Effects

In addition to the PVT variation above, as ICs operate in these demanding environments over time, aging effects can adversely degrade circuit performance. For example, hot electrons cause the bond breaking rate increase with temperature and current [5], which can fundamentally change HSIO and memory IO circuit behavior. “Bond breaking rate” is defined by A. Bravaix [5] as the breaking of Si-H bond over time by electrons. This leads to degraded performance and non-functional circuits (ICs dying of old age). According to

Figure 7, the aging effect of bound breaking rate increases with current density of (Ids/W) and temperature.

10

Figure 7: Bond breaking rate increase with (Ids/W) and Temperature (source A. Bravaix [5])

11

2.5 Calibration Methods and BIST

Given the PVT variation mentioned above, none of the modern CMOS device and circuits would be exactly as designed, and none of the high speed IO circuits would work without calibration. Having accurate on die resistor, reference voltage and currents based on calibration is critical to circuit functionality. There have been a number of prior works on circuit calibration and IO calibration. A self-calibration method was demonstrated for

DDR3, specifically, the on die termination (ODT) [6] was calibrated using an external calibration controller, and the calibrated results were stored in registers on the DRAM

(Figure 8).

12

Figure 8: DDR3 Self-calibration (source C. Park [6])

Fundamental to any calibration, such as ODT, there must be a sensor to measure input, a controller to provide a calibration algorithm, and an actuator to drive the new calibrated value. More on calibration will be discussed later.

13

Melikyan [7] in 2013 calibrated I/O termination resistors using the calibration circuit shown in Figure 9. An integrated reference current generator generated a reference current, which was compared to locally generated current through resistor banks. The compared results were sent to a logic module to control the actual transmitter (TX) and receiver (RX) through registers.

Figure 9: IO termination calibration circuit (source V. Melikyan [7])

Regardless of the method of calibration, or implementation choice between hardware vs. software, the calibration time should be minimized. Calibration time is also known as the test time, or in a functional usage as training time. Calibration time should be as short as

14

possible to enable the circuit to be operational most of the time. If the test time were too long, it is costly in a high volume manufacturing environment. If the training time is too long, the experience is simply not acceptable to the end users. Not only does the test time need to be short, but the test or calibration needs to be accurate to be effective, and in the long term save power by not having to repeat calibration often.

15

2.6 High Speed Differential IO (HSIO)

Despite the PVT variation challenges that HSIO and memory IO circuits face, and the added challenges of increasing demand on IO bandwidth and lower IO power, Intel has taken advantage of smaller transistor sizes and faster transistors to continue to push the edges of HSIO and memory IO. Intel has demonstrated HSIO for PCIe generation 3 (PCIe

Gen3) mass-produced running at 8GT/S in commercial products. Shown in Figure 10 is an example of Intel HSIO receiver architecture [8]. This architecture consists of a differential input, equalizer, receiver amplifier, and a digital clock detection and recovery (CDR) control loop. Taking advantage of a smaller transistor size, pushing more and more of the analog circuit control loops into the digital domain helps to minimize the PVT variation effect introduced above. This strategy of pushing more analog circuits into the digital domain to avoid PVT variation is more important as device size shrinks, and digital logic becomes cheaper from silicon real-estate perspective.

Figure 10: Intel HSIO receiver architecture (source L. Chen [8])

16

Figure 11 from S. Puligundla [9] presents more details of an Intel PCIe Gen3 HSIO receiver. In this case, a decision feedback equalizer (DEF) was used in combination with automatic gain control (AGC) to provide a real-time eye monitor at the receiver. Depending on the channel behavior, the AGC would adjust the receiver amplifier gain, and the DEF would try to remove inter-symbol-interference (ISI) to ensure that an adequate receiver eye is observed. Since the receiver eye is available already, a design-for-x (DFX) circuit [9] was added to enable design-for-test, design-for-manufacturing, and design-for-validation.

Hence, the X above refers to all post silicon activities that measure the eye. More on DFX will be discussed later.

Figure 11: Intel PCIe Gen 3 8GT/S HSIO receiver and DFX (source S. Puligundla [9])

17

2.7 Memory Single Ended IO

Intel recently published papers on its 22nm Haswell design [10], for the memory IO such as DDR, Intel used a cascode output driver and pre-driver level shifter to level shift from

Intel internal CMOS voltages to DDR voltages (shown in Figure 12). Due to the single ended IO bus nature of DDR, when compared to the differential IO bus of PCIe, DDR suffers from crosstalk, which is the interference from a neighboring IO wire. At the same time, DDR IO bus frequency is relatively low compared to PCIe Gen3, again, due to the single-ended IO bus nature. This dissertation is largely focused on memory IO and specifically DDR IO characterization, testing, validation, and training. Hence, much more discussion on DDR will follow.

18

Figure 12: Intel Haswell (22nm) DDR cascade driver (N. Kurd [10])

19

One of the motivations for DDR to be a single ended bus is IO pin density. To keep the single ended IO bus, and push the IO frequency, while achieving higher IO bandwidth,

Intel Haswell moved some of the DRAM on package (shown in Figure 13). Embedded

DRAM (eDRAM) has a much shorter trace length than DDR, hence Intel Haswell was able to push the IO frequency up to 6.4GT/S [10].

Figure 13: Intel Haswell (22nm) eDRAM at 6.4T/s (source N. Kurd [10])

20

2.8 Memory Cell Fault Models

In addition to memory IO challenges, as memory density increases, the memory cell retention of data and the testing of memory cell failure becomes challenging as well due to the increasing test time associated with higher memory density, and the lack of probe access points. Therefore, some background on memory cell fault model is discussed here.

Figure 14 shows a basic DRAM organization [11] where the addresses are decoded into x

(a.k.a row) and y (a.k.a column).

Figure 14: DRAM organization (source K. K. Saluja and K. Kinoshita [11])

21

Each memory cell within the DRAM can be labeled as cell B (see Figure 15). Cell B has neighbor cells that are labeled north (N), south (S), west (W), and east (E) [12]. In addition, the paper also defines static neighborhood pattern-sensitive faults as static interactions between cell B with its neighbors (eg: N, S, W, and E).

Figure 15: Adjacent neighborhood of a memory cell B (source P. de Jong and A. J. van de Goor [12])

Different memory cell fault patterns have been studied, and can be categorized [11] as:

1) Stuck-at-faults (SAF), which are cells stuck at 0 or 1, and can be tested via DC.

2) Adjacent pattern-sensitive faults (APSF), which are static

3) Adjacent pattern-sensitive faults, which are dynamic

4) Pattern-sensitive faults (PSF), which are dynamic

Typically, the DRAM is tested on a tester sitting outside of the DRAM interface. The tester can be programmed to step through various algorithms and patterns. According to [11], algorithms that depend on an odd number of neighbor cells are hard to build into hardware, because the number doesn’t fit easily as a power of two. The work presented in this

22

dissertation solves this problem by making the number of neighbors a variable called N, which is programmable via register. This will be discussed later.

Z. Al-Ars and A. J. van de Goor [13] has studied modeling of DRAM faults in simulation.

According to Figure 16, a DRAM cell is represented as a capacitor; the bit line (BL) and word line (WL) represent address control similar to above x/col or y/row address. The O denotes open, the S denotes short, and B denotes bridging.

Figure 16: DRAM fault simulation model (source Z. Al-Ars and A. J. van de Goor [13])

Van De Goor defined four base test classes [14]: Electrical, march, base cell, and repetitive.

He used the following short hand notation to describe address walking directions for his tests:

23

1. ⇑ Denotes an increasing address order (that is, address 0, 1, 2, and so on),

2. ⇓ Denotes a decreasing address order, and

3.  Denotes that the test engineer can arbitrarily choose the address order to be ⇑ or

4. The ↗ denotes an address increment along the main diagonal

24

The classical fault model of DRAM can be organized [15] and [16] as:

5. Address decoder faults (AFs):

a. A certain address cannot access any cell

b. A certain address accesses multiple cells simultaneously

c. No address can access a certain cell

d. Multiple addresses can access a certain cell

6. Memory Cell faults:

a. Stuck-at fault (SAF)

b. Stuck-open fault (SOF)

c. Transition fault (TF)

d. fault (DRF): (There is a lot of interest in DRF due to smaller

transistor size, and more leaky DRAM cells. More on this will be discussed

later)

e. Coupling fault (CF)

f. Bridging fault (BF): short between cells

g. Neighborhood pattern sensitive fault (NPSF)

25

2.9 Memory Test Algorithms

Given the above fault model, developing an efficient test algorithm to test and cover these fault model in shortest time is important, and at times, tied directly to millions of dollars.

There are a number of classical memory cell test algorithms published, and some of them are discussed here. A march test involves walking through memory in any of these N, E,

S, and W directions while writing 1s or 0s, and reading the expected values such as 1s and

0s. According to Van De Goor donation, the march test can be written in short hand as:

{  (w0);  (r0, w1,r1);  (r1, w0,r0);

 (w1);  (r1, w0,r0);  (r0, w1,r1) }

A modified algorithmic test sequence (MATS) test is the shortest march test for unlinked

SAFs in memory cells, and it can be written as:

{ (w0);(r0, w1);(r1) }

{ (w1);(r1, w0);(r0) }

A MATS+ test initializes the memory to a known data background and steps through with a read-write sequence. The test is performed with both incrementing and decrementing addressing. The test is visually shown in Figure 17, and it is written as:

26

{  (w0);  (r0, w1);  (r1, w0) }

Start _ _ _ _ _R _R _R _R R R R R W W W W W W W W W W W W _ _ _ _ _R _R _R _R R R R R W W W W W W W W W W W W _ _ _ _ _R _R _R _R R R R R W W W W W W W W W W W W _ _ _ _ _R _R _R _R R R R R W W W W W W W W W W W W

Figure 17: MATS+ test algorithm

The MATS++ test [17] sequence is an optimized test sequence similar to the MATS+ test but allows fault coverage for TF’s. It can be written as:

{  (w0); (r0, w1); (r1, w0, r0) }

March C- is a modified march C test that removes redundancies. It detects unlinked AF’s,

SAF’s, TF’s and all CF’s, which can be written as:

27

{  (w0); (r0, w1); (r1,w0);

 (r0,w1); (r1,w0); (r0) }

2.10 Research Goals and methodology

2.10.1 Goal

This dissertation is based on a series of published papers investigating the shortfalls identified above. For example, how is faster and higher bandwidth IO enabled? How is lower IO power enabled? How are faults in IO and memory cells to be identified quickly

(lower test time) and efficiently? With that in mind, this dissertation follows a path to develop an on die solution to:

1. Measure and characterize the IO health

2. Fine tune IO for maximal performance (link training)

3. Fine tune IO for lower power (power tuning)

4. Leverage Intel core performance and L3 cache to reduce test time and link training

time

5. Measure memory cell defects

6. Repair memory cell failures using spare cells automatically

28

The reason to pursue these goals was to develop a low cost and effective method to improve

IO performance, reduce power, reduce test time, improve debug and validation coverage, and in some cases, automatically repair defects detected in memory cells.

2.10.2 Methodology

The research was driven by the challenges facing the high performance microprocessor design at the 32nm, 22nm, and 14nm technology node, and beyond. The research involved understanding prior and existing publications on HSIO and memory IO design. The next step was to identify and understand gaps and shortcomings in the current state of the art

32nm and 22nm ICs. The research focused on real world industry designs and the design challenges, and proposed architectures that are innovative and cost effective that can be applied to the real world problems identified. Specifically, the research was focused on developing a unified or converged BIST engine that addressed both the IO and memory cell test challenges and requirements, and reduced test time and overall cost.

Once the architectures were identified, the architectures underwent rigorous peer review, which included primarily industry peers, but in the process of publishing papers, it was reviewed by academics as well. Multiple patents were filed (6 granted, and 2 pending)[18]–

[25]. For the proposed architectures to be widely implemented across Intel, the architecture also had to pass financial review to be funded, and, at times, had to pass Intel vice president

(VP) level review to be granted full authority, funding, and engineering resources to complete the design and validation cycle.

29

The author worked with a group of peers, dotted line reports, and direct reports to complete the design cycle and validation cycle. The design cycle and methodology involved using the register transfer language (RTL) to describe the architecture, and pre-silicon model based simulations of both the architecture and the test patterns. The design cycle also included automatic synthesis of the RTL into gate level netlist, floor plan, timing analysis, reliability, and power validation before tapeout. The post-silicon validation cycle and methodology involved test chips, early sample characterization of the function HSIO, the memory IO circuits, and the BIST engine. The development and proofing of high volume manufacturing flow followed. The post-silicon validation cycle also included statistical sub sampling of large populations and testing to predict a population’s overall behavior, and finalized the recipe and flow for volume production to ensure that BIST and IO behavior matches expectation. Many of the steps within the design and validation cycle are iterative to take insights from simulation, test chip, and validation back through the architecture to improve the overall research.

30

2.11 References

[1] F. O’Mahony, G. Balamurugan, J. E. Jaussi, J. Kennedy, M. Mansuri, S. Shekhar, and B. Casper, “The future of electrical I/O for microprocessors,” in International

Symposium on VLSI Design, Automation and Test, 2009. VLSI-DAT ’09, 2009, pp. 31–34.

[2] B. Casper, J. Jaussi, F. O’Mahony, M. Mansuri, K. Canagasaby, J. Kennedy, E.

Yeung, and R. Mooney, “A 20Gb/s Forwarded Clock Transceiver in 90nm CMOS B.,” in

Solid-State Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE

International, 2006, pp. 263–272.

[3] M. A. A. Latif, N. B. Z. Ali, and F. A. Hussin, “A case study of process-variation effect to SoC analog circuits,” in 2011 IEEE Recent Advances in Intelligent Computational

Systems (RAICS), 2011, pp. 520–523.

[4] T. Chen, P. Luo, and H. Fu, “Signal integrity analysis of high speed USB IO and interconnection,” in IEEE Electrical Design of Advanced Packaging Systems Symposium,

2009. (EDAPS 2009), 2009, pp. 1–5.

[5] A. Bravaix, C. Guerin, V. Huard, D. Roy, J.-M. Roux, and E. Vincent, “Hot-Carrier acceleration factors for low power management in DC-AC stressed 40nm NMOS node at high temperature,” in Reliability Physics Symposium, 2009 IEEE International, 2009, pp.

531–548.

[6] C. Park, H. Chung, Y.-S. Lee, J. Kim, Y.-S. Lee, M.-S. Chae, D.-H. Jung, S.-H.

Choi, S. Seo, T.-S. Park, J.-H. Shin, J.-H. Cho, S. Lee, K.-W. Song, K. Kim, J.-B. Lee, C.

31

Kim, and S.-I. Cho, “A 512-mb DDR3 SDRAM prototype with CIO minimization and self- calibration techniques,” IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 831–838, Apr.

2006.

[7] V. Melikyan, A. Balabanyan, A. Hayrapetyan, and N. Melikyan,

“Receiver/transmitter input/output termination resistance calibration method,” in

Electronics and Nanotechnology (ELNANO), 2013 IEEE XXXIII International Scientific

Conference, 2013, pp. 126–130.

[8] L. Chen, F. Spagna, P. Marzolf, and J. K. Wu, “A 90nm 1-4.25-Gb/s Multi Data

Rate Receiver for High Speed Serial Links,” in Solid-State Circuits Conference, 2006.

ASSCC 2006. IEEE Asian, 2006, pp. 391–394.

[9] S. Puligundla, F. Spagna, L. Chen, and A. Tran, “Validating the performance of a

32nm CMOS high speed serial link receiver with adaptive equalization and baud-rate clock ,” in Test Conference (ITC), 2010 IEEE International, 2010, pp. 1–5.

[10] N. Kurd, M. Chowdhury, E. Burton, T. P. Thomas, C. Mozak, B. Boswell, M. Lal,

A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T. M. Wilson, M. Merten, S. Chennupaty,

W. Gomes, and R. Kumar, “5.9 Haswell: A family of IA 22nm processors,” in Solid-State

Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, 2014, pp. 112–113.

[11] K. K. Saluja and K. Kinoshita, “Test Pattern Generation for API Faults in RAM,”

IEEE Trans. Comput., vol. C–34, no. 3, pp. 284–287, Mar. 1985.

32

[12] P. de Jong and A. J. van de Goor, “Test pattern generation for API faults in RAM,”

IEEE Trans. Comput., vol. 37, no. 11, pp. 1426–1428, Nov. 1988.

[13] Z. Al-Ars and A. J. van de Goor, “Test generation and optimization for DRAM cell defects using electrical simulation,” IEEE Trans. Comput.-Aided Des. Integr. Circuits

Syst., vol. 22, no. 10, pp. 1371–1384, Oct. 2003.

[14] A. J. van de Goor, “An industrial evaluation of DRAM tests,” IEEE Des. Test

Comput., vol. 21, no. 5, pp. 430–440, Sep. 2004.

[15] C.-T. Huang, J.-R. Huang, C.-F. Wu, C.-W. Wu, and T.-Y. Chang, “A programmable BIST core for embedded DRAM,” IEEE Des. Test Comput., vol. 16, no. 1, pp. 59–70, Jan. 1999.

[16] M. L. Bushnell and Vishwani D. Agarwal, Essentials of Electronic Testing. .

[17] A. J. van de Goor, Testing semiconductor memories: theory and practice. J. Wiley

& Sons, 1991.

[18] B. Querbach, D. G. Ellis, A. Khan, M. J. Tripp, E. S. Gayles, and E. Gollapudi,

Automatic self test of an integrated circuit component via AC I/O loopback. Google

Patents, 2006.

[19] B. Querbach, A. Khan, M. Tripp, L. B. Guerrero, M. A. V. Vargas, and A.

Muhtaroglu, Methods and apparatuses for validating AC I/O loopback tests using delay modeling in RTL simulation. Google Patents, 2007.

33

[20] D. Ellis, B. Querbach, J. Nejedlo, A. Khan, S. Babcock, E. Gayles, and E.

Gollapudi, Pre-announce signaling for interconnect built-in self test. Google Patents, 2003.

[21] D. G. Ellis, B. Querbach, J. J. Nejedlo, A. Khan, S. R. Babcock, E. S. Gayles, and

E. Gollapudi, Push button mode automatic pattern switching for interconnect built-in self test. Google Patents, 2004.

[22] B. Querbach, M. A. Abdallah, A. M. Khan, M. M. Hossain, and S. M. Sarkar,

Regulating a timing between a strobe signal and a data signal. Google Patents, 2009.

[23] B. L. Spry, T. Z. Schoenborn, P. Abraham, C. P. Mozak, D. G. Ellis, J. J. Nejedlo,

B. Querbach, Z. Greenfield, R. Ghattas, and J. Tholiyil, Robust memory link testing using memory controller. Google Patents, 2009.

[24] M. B. Trobough, K. K. Tiruvallur, C. B. Prudvi, C. E. Iovin, D. W. Grawrock, J. J.

Nejedlo, A. N. Kabadi, T. K. Goff, E. J. Halprin, and K. B. Udawatta, TEST, VALIDATION,

AND DEBUG ARCHITECTURE. US Patent 20,150,127,983, 2015.

[25] B. Querbach, R. B. Hamilton, L. A. Johnson, and M. Kim, Voltage margining with a low power, high speed, input offset cancelling equalizer. Google Patents, 2009.

34

Comparison of Hardware based and Software based Stress Testing of Memory IO Interface

Bruce Querbach, Sudeep Puligundla, Daniel Becerra, Zale T Schoenborn, Patrick Chiang [email protected] [email protected] [email protected] [email protected] [email protected]

2013 MWSCAS,

Columbus, Ohio,

56th IEEE International Midwest Symposium on Circuits and Systems

35

3 Comparison of hardware and software stress method

Comparison of Hardware Based and Software Based Stress Testing of Memory IO

Interface

Abstract— In post-silicon testing and validation of circuit functionality, an effective

IO stress pattern can identify bugs quickly and provide adequate test coverage. A lot of work has been done to identify the right stress patterns specific to each IO interface. While some patterns can be generic enough to apply to all IOs, other patterns are interface topology specific. In addition to identifying the worst-case pattern, tradeoffs between test-time and test coverage must be made depending on the test goals. Pseudo random bit stream (PRBS) generators are commonly used to generate test patterns because of the adequate frequency content in PRBS patterns, ease of implementation, and minimal gate count. This paper introduces an advanced pattern generator and checker (APGC) based on PRBS that retains all the aforementioned advantages. The APGC was implemented for a DDR memory interface where different LFSRs beat against each other spatially on neighboring IO lanes while rotating this form of aggressor-victim pattern in time. The results of the

APGC stress patterns are compared to a form of advanced software-based learning algorithm based patterns that exhaustively search this complete parameter space.

The comparison of APGC to software showed that the measured bit error rate (BER) plotted on a Q-scale of both methods is similar for the receiver side. On the transmitter side, APGC showed less eye opening than the software. In addition to the

36

margin comparison, on the test execution side, APGC can speed up the test and validation execution time compared to the software by 32 to 2048 times depending on aggressor victim lane width of eight to 64 lanes.

3.1 Introduction

After an integrated circuit is taped out, the post-silicon test and validation is focused on verifying the proper functionality of the design. This work must be done quickly to enable time to market. An effective IO stress pattern can identify bugs quickly and provide adequate test coverage. The effectiveness of the stress pattern can be measured by its frequency content. The wider the frequency content, the more corner cases it excites. The running length of the test is also very important from an effectiveness point of view. As the pattern length increases, the validation coverage increases; however, that adds cost to the testing and hence, the product. Thus, an effective pattern must also be short to minimize test time.

A lot of work has been done to identify the most effective stress pattern. Goel [1] uses a term tree minimization algorithm to come up with an effective pattern.

While some patterns can be generic enough to apply to all IOs, such as a clock pattern, DC all ones, and DC all zeros, these patterns are good starting patterns to train and initialize the link. However, the frequency content of these patterns are impulse functions at a few selective frequencies.

37

Using learning algorithms such as an exhaustive search or a genetic algorithm [2], a topology specific worst-case pattern can be found after a long search time. Usually this involves testing over all the interactions between neighboring IO lanes both in time and in space to analyze the effects of crosstalk and inter symbol interference (ISI). Genetic algorithms can reduce search times, but these algorithms frequently come up with a solution that is very specific to the IO interface topology under test. For example, by switching several DDR modules around, the previously discovered worst-case pattern is no longer valid due to a topology change.

Tradeoffs between test-time and test coverage must be made depending on the test goals.

For example, HDMI interface error tolerance is less stringent than multi-processor chip to chip communication interfaces. The higher requirements in error tolerance and reliability requirements can increase test time significantly.

The pseudo random bit sequencer (PRBS) has been widely used in testing the correct functionality of integrated circuits. PRBS can quickly generate adequate frequency content while keeping the gate count of the design to minimum. Eye diagrams built based on these

PRBS patterns can be related to circuit parameters [3] and be used to tune and improve transmitter/receiver amplifier designs. There are a number of PRBS designs focused on low power and high performance, such as [4]. A number of PRBS architectures have been designed as full rate [5], half rate [6, 7, 8, and 9], and quarter rate [10]. In this paper, the focus is on comparing a combination of multiple PRBS patterns to learning algorithms.

38

Section 3.2 presents the architecture of APGC. Section 3.3 describes the aggressor and victim pattern rotation capability of APGC architecture to stress crosstalk effects between neighboring lanes. This was used for DDR memory link initialization and validation.

Section 3.4 describes a form of an advanced learning algorithm that exhaustively searches the complete parameter space. The software based pattern generation was also used in link validation for comparison. Section 3.5 presents the margin result comparison and time savings between the two algorithms.

3.2 The APGC Architecture

Shown in Figure 18 below is a conceptual transmitter and receiver for memory link. The

Tx Vref is generated using a voltage divider, and the value is sent to the receiver on the

DRAM device. The transmitter clock has multiple phase interpolators (PI) to adjust the clock up to 64 steps within one UI. Similarly, the receiver sampler can adjust the PI up to

64 steps within one UI. The receiver sampler also needs a reference voltage (Rx Vref) to compare with sampled data. The Vref in both the Tx and the Rx side can be adjusted up to

64 steps between the ground and the VCC.

39

Tx Vref pad

Tx data Transmitter

Transmitter clk

PI adjustments DQ Receiver PAD Rx data Sampler

Receiver clk Rx Vref

PI adjustments Figure 18: The conceptual memory link transmitter and receiver An advanced pattern generator and checker, presented below, was used to measure the IO link quality. As shown in Figure 19, APGC uses two LFSRs, which can produce PRBS and low frequency patterns. It also supports two N bit shift registers, which can produce user defined patterns. Each data lane has a mux, which chooses which of the patterns are sent on the DQ lane. Each of the 64 data lanes (DQ0, DQ1, … DQ63) can have unique patterns.

40

N Bit Shift Register

N Bit Shift Register

N bit LFSR

N bit LFSR

4 4 4 4 } } } }

0 of 63 1 of 63 1

DQ 0 DQ 1 DQ n DQ 63 Figure 19: The APGC architecture

3.3 The Aggressor/Victim Pattern Rotation

On a recent Intel microprocessor memory link initialization and training algorithm, an aggressor and a victim pattern beat against each other to identify and minimize lane-to- lane crosstalk. This was achieved through the programming of two different LFSR patterns

(LFSR-0, LFSR-1) provided by APGC on neighboring lanes while rotating this sequence in time across 5, 8, 16, 32 or 64 DQ lanes. An example of these two LFSR patterns rotating between five lanes is shown in Table 1.

Table 1: Victim Aggressor pattern rotation with two different LFSRs

IO pin Pattern Sequence in Time DQ59 0000 0000 0000 0000 LFSR 0 LFSR 1 LFSR 0 LFSR 0 LFSR 0 LFSR 0 DQ60 0001 0001 0001 0001 LFSR 0 LFSR 0 LFSR 1 LFSR 0 LFSR 0 LFSR 0 DQ61 0011 0011 0011 0011 LFSR 0 LFSR 0 LFSR 0 LFSR 1 LFSR 0 LFSR 0 DQ62 0111 0111 0111 0111 LFSR 0 LFSR 0 LFSR 0 LFSR 0 LFSR 1 LFSR 0 DQ63 1111 1111 1111 1111 LFSR 1 LFSR 0 LFSR 0 LFSR 0 LFSR 0 LFSR 1

41

3.4 The Software Learning Algorithm and Pattern

Many learning algorithms are used to discover the worst-case pattern, which is usually specific to the platform topology and layout. The software algorithm used here is an exhaustive search algorithm that tries all combinations of bit flips of neighboring bits both in time and in space of the bit under test (11th bit of DQ61 in Table 2). When this bit of

DQ61 is under test, the margins including crosstalk and ISI are measured when all its neighbors are flipped one bit at a time. Shown in red are the neighbor bits that have the most cross coupling.

Table 2: Crosstalk/ISI impact to 11th bit in time of DQ61using software

IO pin Pattern Cross Talk/ISI DQ59 0000 0010 1010 0001 88 99 63 94 98 80 98 58 92 67 16 99 99 92 99 79 DQ60 0000 0001 1011 0001 65 60 65 82 94 33 69 25 21 0 0 75 99 90 99 92 DQ61 0000 0001 1010 0000 31 46 21 48 63 18 2 4 0 0 0 40 65 46 73 33 DQ62 0000 0000 0110 0011 90 96 71 80 99 54 99 63 37 0 29 99 86 73 99 99 DQ63 0000 1000 0110 0011 63 65 79 79 99 35 79 31 99 0 6 99 94 73 99 92

3.5 Margin Result

Below is the comparison of Hardware (APGC) vs. Software Stress Measured on a Recent

Intel Microprocessor. Comparison of eye margins of APGC and software in terms of Q- scale vs. ticks is shown in Figure 20 through Figure 25 for both transmitter (Tx) and receiver (Rx). Q-scale is defined as the inverse of the standard normal cumulative distribution of the BER. A tick is defined as 1/64th UI (unit interval) for timing margin and

1/64th of the full voltage swing from ground to VCC for voltage margin.

42

3.6 Receiver Voltage Margin – Low Side

The IO under test has the ability to change the sampling voltage and sampling time for both the transmitter and receiver independent of the memory device on the remote side.

Figure 20 compares the low side voltage margin measured by APGC and software. The plot shows that the measured margin between the two approaches matches closely.

Figure 20: The Rx Voltage Lower Side BER (on Q-scale) vs. vref ticks 3.7 Receiver Voltage Margin – High Side

Similar margin result match and trend is also observed between the two approaches for high side voltage margin as shown in Figure 21.

43

Figure 21: The Rx Voltage Higher side BER (on Q-scale) vs. Vref ticks 3.8 Receiver Timing Margin – Low Side

Similar margin result match and trend is also observed between the two approaches for lower side timing margin as shown in Figure 22.

Figure 22: The Rx Timing Lower Side BER (on Q-scale) vs. Vref ticks 3.9 Transmitter Voltage Margin – Low Side

While we observed comparable results on the receiver margin measurements, on the transmitter margin measurements, differences were observed as shown in Figure 23. For the same Q, we observed a difference of up to 10 ticks.

44

Figure 23: The Tx voltage Lower Side (on Q-scale) 3.10 Transmitter Voltage Margin – High Side

On the transmitter voltage higher side in Figure 24, for the same Q, we observed a difference of 15 ticks.

Figure 24: The Tx Voltage Higher Side (on Q-scale) 3.11 Transmitter Timing Margin – Low Side

On the transmitter timing lower side in Figure 25, for the same Q, we observed a difference of two ticks.

45

Figure 25: The Tx Timing Lower Side (on Q-scale)

3.12 Comparison of Test Execution Time of APGC and Software

In the case of APGC, each LFSR is programmed to generate 2^14 bits and this pattern is rotated to the neighboring lane using a circular shift register that is eight lanes wide.

Therefore, after eight rotations, the complete pattern takes 2^14 x 8 = 131072 bit time.

In the case of the software, the base pattern is 16 x 8 = 128 bits. Each of the single bit and double bits (next to each other) in base pattern will be flipped (both 1 and 0) to act as aggressor to test the ISI and crosstalk on the victim bit. This results in 128 x 128 x 2 =

32768 bit time per victim location in time and space. The complete pattern takes 32768 x

128 = 4194304 bit time.

If only eight lanes of wide spatial effect are considered, the APGC would take only 131072

/ 4194304 = 3.125% of the software execution time. If we consider 16 lanes wide or 64 lanes wide, the time savings would increase as demonstrated in table 3:

46

Table 3: Victim Aggressor pattern rotation with two different LFSRs

8 lanes 16 lanes 64 lanes

APGC (bit time) 131072 262144 1048576

Software (bit

time) 4194304 33554432 2.15E+09

Ratio of time

saving 32 128 2048

3.13 Conclusion

In summary, this paper presented the APGC architecture, which provides the ability to generate rotating victim aggressor PRBS patterns and supports advanced learning algorithms in software via the user defined pattern buffers. The APGC pattern used for memory link initialization and training was compared to the exhaustive search-learning algorithm in software used on a recent Intel microprocessor.

The silicon results show good correlation across Rx voltage and timing between APGC based patterns and software. On the transmitter side, both timing and voltage margins are close, but not completely matching. This difference is worth further exploration.

47

This paper showed that rotating PRBS patterns are sufficient for memory link initialization and training, and the validation margin results are comparable to more advanced learning algorithms. The APGC architecture based patterns are better in terms of the test execution speed. A speed up of 2048 times that of the software was realized assuming a 64 lanes wide repeatable pattern, and up to 32 times over eight lanes. Typical link training can be done on the order of milliseconds. During this time, APGC provides adequate discovery of ISI and crosstalk and tries to minimize their impact on link performance.

For future work, the authors would like to explore the comparison of APGC to other learning algorithms, and study in detail the execution speed of various algorithms.

3.14 Acknowledgements

The help received from Eric Donkoh, Peters Stephen, and Rahima K Mohammed of Intel

Corporation is greatly appreciated.

48

3.15 References

[1] Goel, S.K.; Devta-Prasanna, N.; Turakhia, R.P. “Effective and Efficient Test Pattern

Generation for Small Delay Defect”, VLSI Test Symposium, 2009. VTS '09. 27th IEEE

[2] An Improved Genetic Algorithm and Its Blending Application with Neural Network,

Tai-shan Yan, Intelligent Systems and Applications (ISA), 2010 2nd International

Workshop

[3] Venkatesha, D.B.; Chitrashekaraiah, S.; Rezazadeh, A.A.; Thiede, A. “Interpreting

InGaP/GaAs DHBT Eye diagrams using small signal parameters”, Microwave

Integrated Circuit Conference, 2007. EuMIC 2007. European

[4] Laskin, E.; Voinigescu, S.P., “A 60 mW per Lane, 4 23-Gb/s 2 1 PRBS Generator”,

Solid-State Circuits, IEEE Journal of Volume: 41, Issue: 10

[5] R. Malasani, C. Bourde, and G. Gutierrez, “A SiGe 10-Gb/s multipattern bit error rate

tester,” Radio Frequency Integrated Circuits (RFIC) Symposium, 2003 IEEE

[6] H. Veenstra, “1–58 Gb/s PRBS generator with <1.1 ps RMS jitter in InP technology,”

Solid-State Circuits Conference, 2004. ESSCIRC 2004. Proceeding of the 30th

European

49

[7] Knapp, H.; Wurzer, M.; Meister, T.F.; Bock, J.; Aufinger, K., “40 Gbit/s 27-1 PRBS

generator IC in SiGe bipolar technology,” Bipolar/BiCMOS Circuits and Technology

Meeting, 2002. Proceedings of the 2002

[8] H. Knapp, M. Wurzer, W. Perndl, K. Aufinger, J. Böck, and T. F. Meister, “100-Gb/s

2 1 and 54-Gb/s 2 1 PRBS generators in SiGe bipolar technology,” IEEE J. Solid-State

Circuits, vol. 40, no. 10, pp. 2118–2125, Oct. 2005.

[9] D. Kucharski and K. Kornegay, “A 40 Gb/s 2.5 V 2 1 PRBS generator in SiGe using a

low-voltage logic family,” in IEEE ISSCC Dig. Tech. Papers, San Francisco, CA, Feb.

2005, pp. 340–341.

[10] T. O. Dickson, E. Laskin, I. Khalid, R. Beerkens, J. Xie, B. Karajica, and S. P.

Voinigescu, “A 72 Gb/s 2 1 PRBS generator in SiGe BiCMOS technology,” in IEEE

ISSCC Dig. Tech. Papers, San Francisco, CA, Feb. 2005, pp. 342–345.

50

A Reusable BIST with Software

Assisted Repair Technology for

Improved Memory and IO Debug,

Validation and Test Time

Bruce Querbach Rahul Khanna David Blankenbeckler David Blankenbeckler Yulan Zhang Ronald T Anderson David G Ellis Intel Corporation

Sabyasachi Deyati Georgia Institute of Tech

Patrick Chiang Oregon State University

2014 ITC

Seattle, Washington, USA

51

2014 IEEE International Test Conference (ITC)

4 Improved Debug, Validation and Test Time

A Reusable BIST with Software Assisted Repair Technology for Improved Memory

and IO Debug, Validation and Test Time

Abstract

As silicon integration complexity increases with 3D stacking and through-silicon-via

(TSV), so does the occurrence of memory and IO defects and associated test and validation time. This ultimately leads to an overall cost increase. On a 14nm Intel

SOC, a reusable BIST engine called converged pattern generator and checker

(CPGC) are architected to detect memory and IO defects, and combined with the software assisted repair technology to automatically repair memory cell defects on 3D stacked Wide-IO DRAM. Additionally, we also present the CPGC gate count, power, simulation, and silicon results. The reusable CPGC IP is designed to connect to a standard IP interface, which enables a quick turn-key SOC development cycle.

Silicon results show CPGC can speed up validation by 3x, improve test time from minutes down to seconds, and decrease debug time by 5x including root-cause of boot failures of the memory interface. CPGC is also used in memory training and initialization, which makes it a critical part of Intel SOC.

52

Keywords: High speed IO, Memory IO, Memory Interface Training, Post Package Repair,

Memory Array Test Engine, hardware/software calibration

4.1 Introduction

Shrinking process technology and faster memory bus data continue to drive complexity in the I/O design. This surge in complexity brings with it an increase in design bugs. While some of these bugs (such as logic bugs) can be detected at the pre-silicon stage, others

(such as cross coupling of adjacent data lanes, voltage droop, ground bounce, power bounce, timing error due to process shifts, etc.) remain undetected. Although logic bugs are possible to detect in pre-silicon, with increase in design complexity, shortened design time, and complex OS and software demands, this can lead to escalations of bug escapes.

A modern microprocessor design can experience considerable design bugs specifically in memory interfaces. Sarangi et. al. [1] describe the percentages of memory interface design bugs found in AMD64 and Intel's Pentium 4 to be 17.4 and 20.2 respectively. Therefore there is a critical need for an efficient methodology to validate the memory interface in a microprocessor design.

Controllability and observability are key attributes in any post silicon validation methodology. One conceivable but expensive solution is to probe the prepackaged dynamic

53

random access memory (DRAM) die. The easiest and most cost effective solution is to include built-in self-test (BIST) circuitry inside the chip. The BIST engine provides an efficient mechanism to deterministically drive and receive patterns on the memory bus along with automated data checking capability. These features can be used to swiftly identify and isolate bugs in the silicon validation phase. Furthermore, given the continuous increase in system reliability requirements and the current trend of adding more complex features to achieve higher reliability goals, this paper aims to present an effective automated repair strategy that combines converged pattern generator and checker (CPGC) with software assistance in protecting against memory errors. In addition, we present possible directions for designing and testing future systems that are resilient to memory errors.

In addition to enabling efficient validation of the memory controller, a BIST engine facilitates effective identification of DRAM defects. DRAM errors are one of the leading causes of hardware failure in modern compute clusters [10, 11]. This has led to increasing requirements for reliability, availability, and serviceability (RAS), particularly in mission critical environments. Standard OS based memory stress tools pose a significant challenge in providing adequate coverage to the many types of DRAM faults due to the presence of architectural features in the memory controller, such as data scrambling, reordering, and various levels of interleaving. These features make it difficult to generate the types of patterns or coverage that the test writer intends. With a suitable design, a BIST engine can overcome these challenges by providing the means to precisely put the intended patterns

54

on the bus, thereby significantly improving the ability to identify DRAM faults. Once identified, the BIST engine with memory controller and software assisted repair technology can provide a means to enable automatic repair of memory cell defects in 3DS WIO

DRAM.

Memory IO Controller

l Data In

a Data In n c

i Address o

f Address Protocol i f t a c Scheduler r RW Control RW Control n t u

F DFx Memory Mux Under Test

Data In Pattern Address Generator RW Control

Data Out Signature Analysis

Figure 26: CPGC BIST engine with software assisted repair technology

In this paper, the authors propose a new architecture (Figure 26) for CPGC. It is capable of IO [13, 14, 15, 21] and memory test, debug, validation, IO link training [16, 17, 18], and post package repair. The architecture is suitable for such technologies as embedded

55

memory, DDR4, and WIO on system-on-chip (SOCs) with 3D stacking and TSV topology.

To speed up fault isolation and improve manufacturing yield, the CPGC BIST engine also includes a software repair technology for 3D stacked memory cell defects.

In section 4.2, the paper presents the memory and IO failures modes. In section 4.3, the paper describes the CPGC architecture. In section 4.4, the paper discusses how CPGC uses standardized interface to enable intellectual property (IP) reuse, which expedites rapid prototyping and a quicker SOC product development cycle. In section 4.5, the paper presents software architecture. In section 4.6, the paper presents simulation results based on CPGC architecture. In section 4.7, silicon results demonstrate that CPGC significantly reduces debug and validation time. In section 4.8, the paper summarizes all the results presented in this paper.

4.2 Failure Modes

This paper will briefly cover various failure modes that occur within the memory IO and cell due to cross coupling of electrical signals or charges.

There is significant literature available regarding the different types of fault modes for

DRAMs and the associated algorithms to detect them. Venkatesh [2] has demonstrated a

56

fault model (to model all plausible physical faults in memory) to test the BIST algorithms.

The most prevalent memory BIST algorithms are march and modified march [3, 4, and 5].

In [6] the authors proposed a non-march (O(n^2) complexity) algorithm for memory test where it employs galloping pattern tests and captures some of the failures escaping conventional march based tests. While the complexity is higher compared to march based tests, this technique is helpful for embedded memory testing. Bernardi [7] introduce a hardware/software co-test for embedded memory in SOCs. The mentioned co-test is an online test; i.e., it can be performed when the device under test (DUT) is operational.

Bernardi [8] also demonstrated a programmable MBIST suitable for stacked DRAM as well as standalone DRAM. It supports both march and neighborhood pattern sensitive faults (NPSF) tests. March-based MBISTS alone are not sufficient for the state of the embedded memory on SOCs.

As described in [2, 9], following are the common memory faults that can be detected by this CPGC IP:

 Stuck at faults

 Address decoder faults

 Transition faults

 Inversion coupling faults

 Idempotent coupling faults

 State coupling faults

57

 Retention fault

 Neighborhood pattern sensitive faults

In addition, common IO faults that can be detected by this CPGC IP are:

 ISI (time domain interference)

 Crosstalk (neighbor victim/aggressor interaction)

 Clock domain crossing issues

 Power supply issues

 Logic circuit IP integration issues

58

D-Unit / Memory Controller CPGC IP Control SPID Register CPGC-CADB

Interface CPGC_CMD I M P _

FIFO C G P C CPGC_PGC DDRIO

Err Log

PMI MISC CMD System Scheduler PMI Agent Data Buff DATA

Figure 27: CPGC architecture overview 4.3 CPGC Architecture

Figure 27 illustrates the block diagram of the CPGC architecture where the CPGC IP is embedded in the memory controller. CPGC supports both pre-scheduler (CPGC_T) and post-scheduler (CPGC_C) patterns. Functional traffic from the SOC core that originates from upstream is multiplexed with CPGC to form the command (CMD) and data of the

DDR protocol. It is transmitted through the memory scheduler and data-buffer (Data Buff) to the DDRIO. CPGC control registers are programmed through a standard low bandwidth interface (side band), and an out of band protocol interface exists between CPGC and

59

DDRIO. The CPGC IP comprises several PRBS (pseudo random binary sequence) generators such as the CPGC pattern generator checker (CPGC_PGC), the CPGC command generator (CPGC2_CMD), and the CPGC command-address data buffer

(CPGC_CADB). These pattern generators can generate fixed, low freq (LMN), and random patterns. The pattern can drive command, address, and data IO of the memory interface.

Idle ACT

Only if NumFailStatus < MaxFail

PRE_all Wait Exit tPGM PPR

If more failing address

Wait_PRE Wait PRE PGMPST

PPR Wait Enable Done tPGM

Figure 28: CPGC SW assisted repair technology flow

4.3.1 Architecture for Software Assisted Repair Technology

The CPGC reusable BIST engine introduces a software assisted repair technology. The flow is shown in Figure 28, which has been adopted by JEDEC as a part of the WIO industry standard. Due to heat from solder reflow during the attachment TSV and 3DS,

60

additional memory cell defects could be introduced during the packaging process. Through

JEDEC and WIO industry standards, agreement has been reached with the DRAM manufactures to provide spare rows of additional memory cells for post package repair

(PPR). The CPGC architecture for software assisted repair technology will conduct this

PPR if necessary at the end of memory cell defect discovery. After CPGC detects memory failures, it supports both manual and automatic modes depending on the software configuration to repair the defective cell using the spare cells that the DRAM manufacturer provided. The flow is illustrated below. In the manual mode, the detection phase is executed separately from the repair phase, which is fitting for test and debug to ensure accurate detection of failures. In the automatic mode, the BIST engine automatically repairs the detected failures with software assistance. The repair is stored in the NVRAM of the DRAM device; hence, repairs are permanent and persistent through subsequent reboots and power cycles. The combining of the two phases can decrease High Volume

Manufacturing (HVM) test time significantly from many minutes down to seconds.

The detection and the repair phase can also be run via BIOS or at boot via firmware. This enables PPR of highly integrated SOC with a WIO memory subsystem in the field. In the field, repair is especially valuable to the smart phone and tablet market. Since memory is soldered down, any defective memory could cause the whole phone/tablet to be tossed out.

The in the field repair can improve yield and reduce debug and validation time, thus accelerating the product development and launch cycle.

61

Figure 29: CPGC unified Pattern sequencer and buffer

4.3.2 Unified Pattern Sequencer

Each of the pattern generators supports different bus widths, but is built upon the unified pattern sequencer (UPS). As illustrated in Figure 29 this consists of PRBS with programmable length of 16, 23, 32, a low freq LMN counter that can generate a clock waveform as low as 10 Khz, and a 32-bit fixed pattern buffer.

62

4.3.3 LMN

L, M, N are three letters in the alphabet that denote three 32 bit counters. The L counter counts the number of clocks to drive the initial value. The M counter counts the number of clocks to drive the high value, and the N counter counts the number of clocks to drive the low value. Combined, the LMN generator can generate a very low frequency (on the order of 10Khz) clock waveform of any duty cycle to excite low frequency resonance of the system.

4.3.4 Fixed Pattern Buffer

Advanced software algorithms frequently derive a large set of user-defined patterns to try.

These can be programmed through the 32bit fixed pattern buffer. The pattern is simply repeated over and over again until changed by the software.

4.3.5 LFSR

Multiple LFSR tap lengths (16, 23, 32) of maximal running length pseudo random pattern are included in the PRBS generator.

63

4.3.6 Flop Based Data Pattern Buffer

Three instances of the uni-sequencer (UPS) are multiplexed together to construct the downstream pattern. These 8:1 per lane MUX are controlled by an eight bit wide by 64 lanes pattern buffer in a small gate count flop based design as illustrated in Figure 29.

4.3.7 Register File Based Pattern

Depending on the IP instantiation and die area constraints, the CPGC IP can support a register file (RF) based design, achieving 64 cache line deep of pattern, with each cache line being eight bits deep in time. The RF design has one write port and two read ports to minimize gate count and area.

4.3.8 Data Pattern Dynamic Inversion/DC

CPGC supports DC patterns such as all zeros/ones, as well as dynamic inversion of the data dependent on the address. The dynamic inversion feature enables rapid checkerboard testing of memory cell integrity.

64

4.3.9 Post-Scheduler Command and Address Pattern Generation

This supports both in-protocol and out-of-protocol and enables early investigation of future memory protocols, or the isolation of a specific debug failure by stressing rare or difficult sequences.

4.3.10 Error Status, Logging and Masking

CPGC supports per lane and per bit time error counters. Similarly CPGC offers mask capability at per lane and per bit time level. CPGC error and data logging enables precision debug of memory and IO failures.

4.3.11 Command and Address Generator

Through CPGC2_CMD, the command and address generator has the ability to detect latest faults with the current DRAM process and topology, specifically for WIO that uses 3D stacking and through-silicon-via (TSV).

65

Figure 30: CPGC Row Hammer Stress pattern

Row hammer problems [19, 20] occur when excessive activate commands targeted at a specific row (aggressor) introduce crosstalk that can cause a bit-flip error in an adjacent row (victim) (Figure 30). Repeated aggressor row charging and discharging can result in one or more failures, as indicated by the red cell in the figure. The CPGC not only can row hammer the +1 and -1 row to detect the traditional row hammer failures, but it can also hammer and detect problems caused by the +2 and +3 rows. These CPGC features are specifically suited for the WIO TSV testing of 3D stacked memory.

66

4.3.12 Variable Retention Time Stress

Contingent on the DRAM die layout, different memory cells can experience variable retention time failures before refresh. JEDEC spec calls for a 64 ms refresh interval. CPGC can test at intervals of 64ms, 128ms, etc. Each interval CPGC can repeat write, read scan patterns anywhere between one to seven times with a variable pause of 64ms, 128ms in between. CPGC accomplishes this by adjusting memory controller timing. This can be used to simulate longer refresh intervals to emulate the accelerated decay seen at higher temperatures.

4.3.13 Gate Count

The overall gate count of CPGC IP is shown in Table 4 to occupy 44K micrometer squared.

Compared to Intel Atom S1200 die area of 9.86mm x 10mm [22, 23], the BIST engine would occupy 0.045% of total area.

Table 4: CPGC gate count

IP block Gate Count (um^2) Dpatunit 13621 Cmdunit 1847 Configunit 3751 Errlogunit 100 Errunit 874 Unisequencer 169 Msgifunit 1714

67

Cpgct 21953 Total 44028

Table 5: CPGC power estimates

IP block Active Power (mW) dpatunit 1.796 cmdunit 0.244 configunit 0.495 errlogunit 0.013 errunit 0.115 unisequencer 0.022 msgifunit 0.226 cpgct 2.894 Total 5.805

4.3.14 Power

Assuming a 13W thermal design power (TDP) [22, 23], the CPGC IP has a total active power (as illustrated in Table 5) of 5.8mW. When the IP is not in use, it can be power gated off to draw almost no power except leakage.

68

4.3.15 Protocol

CPGC IP supports a simple request/acknowledge protocol for all write and read transactions. It can interface with memory controller and IO physical layers using standardized interface protocols.

4.4 CPGC Usage and IP Reuse Model

The key applications of this CPGC reusable BIST IP are in the following four areas:

Validation, debug, memory initialization/training using the BIOS, and test/HVM. For validation, CPGC can be used in system margin validation (SMV), bench device validation

(BDV), and signal integrity validation (SIV) as shown in Table 6.

Table 6: CPGC architecture usage for validation

69

For SMV, a victim and aggressor (VA) pattern can be used on the data pins (DQ) to measure the worst case margin. Similarly, command address data buffer (CADB) provides the pattern for margin measurements on command and address pins. Due to the bidirectional nature of the DDR bus, whenever the command switches from read to write or from write to read, the turnaround time margin can be validated using CPGC patterns.

CPGC can purposely generate high or low bus utilization traffic. Even when the OS or the core is not powered up or functional, CPGC IP can be used as a very basic debug tool to test the DDR physical layer. Lastly, CPGC can quickly get margins of all DQ pins in a

70

few seconds, and support full automation, which speeds up the validation time significantly compared to OS based software test patterns.

Bench device validation is accomplished on a test bench with the silicon device isolated.

Similar VA patterns can induce noise on the DQ pins. CPGC offers precise control of patterns on the DQ pins. Since BDV boards typically do not support OS boot, CPGC IP is the perfect test and debug tool for BDV.

Signal integrity validation (SIV) is performed on either prototype or production boards with the focus on neighbor crosstalk and inter symbol interference (ISI). Yet again, the VA pattern and the CADB pattern can provide a precise and stressful condition for SIV. SIV also requires explicit control of the pattern bits in time and space to create simultaneous switching of bits to excite even and odd mode resonance on the IO bus. CPGC IP also offers explicit trigger signals for scope capture and post processing of the waveforms.

71

Table 7: CPGC architecture usage for debug

CPGC can be used in a variety of debug scenarios, including power-on debug, margin debug, signal debug, and circuit debug as shown in Table 7. At power-on, CPGC offers explicit pattern control and stress testing capabilities to prove the bus is stable. This helps to isolate any issues away from the memory subsystem. CPGC IP requires very little enabling at power-on to become a useful tool.

To debug margin failure or low margin, CPGC offers precise turnaround control. CPGC can easily target any rank and channel to isolate the low margin rank/channel. CPGC can get margins of all byte lanes and individual lanes, while OS tools tend to hang at the first failure. To debug signal integrity, CPGC's explicit pattern control helps to generate ISI and crosstalk. The subseq_wait feature helps to control the gap between transactions.

72

In circuit debug, CPGC's control of read and write helps to isolate the transmitter and receiver of the memory physical layer independently. CPGC's precise control on rank targeting and turnarounds helps to catch any tri-state buffer issues. LFSR addressing provides semi-repeatable but randomized rank targeting, which helps to debug address transmitter circuits. As an independent data source from core functional traffic, CPGC helps to isolate circuit issues separate from memory controller or core issues.

In BIOS, CPGC IP supports memory training algorithms, helps to initialize all memory content at boot, and helps to complete Memtest, which is a quick test of all available memory. These usages are shown in Table 8. For training, the same VA pattern can be used to margin and center the strobe and reference voltage setting. Basic training can be done without using the functional path, simplifying the implementation. CPGC can generate back to back transactions, and provide CADB patterns for command training.

73

Table 8: CPGC architecture usage for memory initialization/training

CPGC is also used in board high volume manufacturing test (HVM). In HVM, minimizing test time is the key to reducing product cost. CPGC provides an automated way to screen and margin the memory subsystem. As the memory capacity increases, the number of memory cell defects gains even greater significance. CPGC IP is fully equipped to test the memory cell through industry standard algorithms such as march C, march G, march SL, butterfly, galloping, and MATS+. As a programmable reusable BIST engine, CPGC IP also supports many complex memory test algorithms through the use of a sophisticated memory test language (MEMTEL).

CPGC's software assisted repair technology for post package defects led to the currently defined JEDEC protocol standard where the DRAM manufactures provides finite reserved

74

spare rows for memory cell failures introduced during the packaging and assembly of 3D stacked TSV packages. These failures can be introduced during the solder reflow and heating process, which cannot be predicted prior to packaging and assembly. The PPR flow can automatically repair failures detected at HVM in seconds. The CPGC offers huge test time savings compared to the operator intervention method, which takes multiple minutes to identify and repair failing bits.

In order to find row-hammer problems, the CPGC IP generates a base address on the aggressor row, and generates an offset address on the victim row. Repeating base writes may result in row-hammer failures. As illustrated in Figure 33, RequestCD_Address_R stays on base row zero for a number of clocks and jumps to offset row one for a brief one or two clocks.

Checker board memory stress, as illustrated in Figure 34, writes all ones followed by all zeros to the DQ pins, which can be used to stress power supply on the memory IO as well as memory cell retention.

LFSR is an important mechanism to generate pseudo random bit stream (PRBS). Illustrated in Figure 35, the 32 bit seed is fixed. As soon as “enable” is high, the Cpat_out, which generates eight bits of output every clock starts to output random bit streams.

75

4.5 Software Architecture

This integrated CPGC technology has built-in customizable flexibilities enabling the testing and diagnosis of the entire memory sub-system. In case of impending failure, it is possible to isolate the memory field-replicable-unit (FRU) for further testing through

CPGC IP. BIOS based UEFI architecture supports a middle-ware layer that can incorporate the CPGC API(s) at the boot-time as well as the run-time. At the boot time the CPGC- based test executes at the initialization of the memory reference code (MRC). At run-time, the code executes in the system management interrupt (SMI), which is a virtualized space in system. The API(s) libraries are incorporated in both boot-time environments as well as run-time environments, while boot-time CPGC code is executed on an as needed basis.

Execution of run-time code can cause system perturbations that can degrade the performance of the workload running on it. In order to execute the CPGC test through the

SMI-based middle-ware, the system has to be prepared by executing these two conditions.

1. Reason for executing the test by continually analyzing the performance to predict an upcoming failure. Normally, this failure is detected through statistical modeling. This modeling scheme includes evaluating the memory soft-error patterns to predict hard-failure in future. Whenever a non-spurious pattern of soft-failures is detected, it is compared against a predetermined threshold to forecast hard failures. When this threshold is exceeded, the CPGC test is marked to be initiated either while the system is still functional or at the next boot.

76

2. System Quiescent - Once the CPGC test is marked (for a given memory sub-system), system resources are quieted by migrating the existing workload. One such OS assisted mechanism is called memory hot removal that can remove the memory (marked for failure) dynamically. Once this memory is isolated, we can perform certain CPGC tests in a contained manner. Once the test is completed, memory can then be hot-added using OS assisted methods supported by ACPI specifications.

A firmware level approach to host CPGC tests and API(s) require building memory-map infrastructure for identifying memory addresses. Memory initialization code is modified to

CPGC test code. System management interrupt (SMI) provides necessary application programming interfaces (API) to assist in constructing and configuring the memory CPGC assisted tests. Figure 31 illustrates the UEFI boot service layer that hosts the CPGC test modules. Similar modules are replicated into the SMI for run-time execution.

77

Figure 31: UEFI compliant BIOS that hosts the CPGC tests

UEFI is a standard-based modern software architecture (originated by Intel) that offers a rich extensible pre-operating system (OS) environment with advanced boot and runtime services. It has intrinsic networking capability and is designed from the ground up to work with multi-processor (MP) systems. The UEFI standard is architected for dynamic modularity. This has opened up opportunity for collaboration on the technology that allows reducing development costs and time to market through easily integrating plugin modules from different partners and vendors.

78

4.5.1 Software Assisted Dynamic Margining

In order to reduce the system guard-band, a CPGC assisted technique can identify the sufficient margins required to operate the DRAM rank with optimal performance/WATT.

The goal in this case is to identify the excess margins and convert to voltage reduction.

This checking is done in UEFI assisted agents in the SMI handler. These handlers are invoked opportunistically to test the availability of sufficient margins.

4.5.2 Software Assisted CPGC Test Invocation

As mentioned earlier in this section, ad-hoc invocation of a CPGC based test can interfere with the performance requirements of the workload. Therefore we use statistical methods to evaluate the probability of failures based on memory related system events (single-bit- errors etc.) to trigger this mechanism. These statistical methods seek to synthesize the temporal patterns of these failures using Markov chains. A system in this case can be modeled as a discrete Markov process to evaluate lifetime reliability and is described at any time to be in one of the states (S = normal, repair, fail, etc). By employing the Markov modeling we can represent the failure detection, repair and maintenance process graphically. This provides us with a tool to evaluate system reliability measures and possible corrective measures to make it more reliable for the given cost. Details of this methodology are beyond the scope of this paper.

79

4.6 Simulation Results

Figure 32 illustrates CPGC IP interfacing mechanism with the memory controller via a standard interface using a request and grant-credit mechanism. As the initial three credits that were available for write are reduced to zero, the WrCreditValidDC returns a credit on each high pause, eventually restoring the credits to one, to two, and to three.

Figure 32: Read/Write credits for standard IP interface

Figure 33: Base offset memory stress algorithm

Figure 34: Checker board memory data stress algorithm

80

Figure 35: PRBS patterns stress patterns

In order to find row-hammer problems, the CPGC IP generates a base address on the aggressor row, and generates an offset address on the victim row. Repeating base writes may result in row-hammer failures. As illustrated in Figure 33, RequestCD_Address_R stays on base row zero for a number of clocks and jumps to offset row one for a brief one or two clocks.

Checkerboard memory stress, as illustrated in Figure 34, writes all ones followed by all zeros to the DQ pins, which can be used to stress power supply on the memory IO as well as for memory cell retention.

LFSR is an important mechanism to generate pseudo random bit stream (PRBS). Illustrated in Figure 35, the 32 bit seed is fixed. As soon as enable is high, the Cpat_out, which generates eight bits of output every clock starts to output random bit streams.

81

An aggressor/victim pattern is illustrated in Figure 36. Notice that while one lane acts as the victim, the remaining lanes act as aggressors, attacking the victim using a different pattern (LFSR1). Switching neighbors purposely introduces crosstalk on the victim bit, which will cause stress that leads to potential electrical failure.

Figure 36: Victim and Aggressor Pattern

4.7 CPGC Results and Silicon Measurements

This section presents the CPGC results and silicon measurements.

CPGC was used in silicon validation to introduce the VA pattern rotation as shown in the simulation in in Figure 36. Figure 37 demonstrates DDR margin without using VA pattern.

Figure 38 shows DDR margin with aggressors turned on. As illustrated in Figure 37

82

compared to Figure 38, the CPGC controlled VA Pattern significantly reduced measured margin. This is a great tool for stressing the DDR IO link. The added VA rotation using

CPGC reduced pattern reprogramming tests, and eliminated system reboots between the tests. This VA pattern rotation feature of the CPGC alone reduced the validation time by

3x.

CPGC was used in silicon validation to introduce the VA pattern rotation as shown above in simulation. Figure 37 demonstrate DDR margin without using VA pattern.

Figure 37: silicon measurement of DDR IO eye diagram

Figure 38 shows DDR margin with aggressors turned on.

83

Figure 38: eye diagram with VA stress pattern turned on

CPGC is capable of debugging and isolating IO lanes bit by bit to see their contribution to crosstalk and result margin. Shown below in Figure 39 is the BIOS measured eye height for all 64 data lanes and 8 ECC lanes with all byte lanes enabled.

Figure 39: CPGC BIOS measured eye height with all lanes enabled

84

Shown below in Figure 40 are the same eye height measurements with data lane five and

ECC lane five disabled. Compared to Figure 39, there are three ticks of margin improvements. This shows that bit lane five alone contributed to three ticks of margin reduction.

Figure 40: CPGC BIOS measured eye height with bit 5/ECC 5 disabled

CPGC is capable of debugging and isolating IO lanes bit by bit to see their contribution to crosstalk and result margin. Shown in Figure 39 is the BIOS measured eye height for all

64 data lanes and eight ECC lanes with all byte lanes enabled. Shown in Figure 40 are the same eye height measurements with data lane five and ECC lane five disabled. Compared

85

to Figure 39, there are three ticks of margin improvements. This shows that bit lane five alone contributes to three ticks of margin reduction.

These types of bit by bit isolation benefit root cause debug of board issues. Shown in Figure

41 is both a working board that has low cross (top), where the victim bit in white is shielded from the seven aggressors in blue and the non-working board, where the shielding is missing between the victim in white and aggressors in blue.

On an Intel SOC (early stepping), CPGC was facilitated to debug and isolate a memory controller speed-path. CPGC was able to isolate the bug because it provided an alternative path to the DDR physical layer. When the debugger noticed that the functional memory subsystem did not work at certain frequencies, but while using CPGC to bypass the functional memory controller, the DDR physical layer was working at the specified frequency. The debugger was able to quickly focus the attention on the memory controller and find the speed-path.

86

Figure 41: CPGC identified missing shielding on board layout

4.7.1 Debugging Boot Failure Using CPGC LFSR Addressing

CPGC was proven to be extremely useful in debugging memory initialization and boot failures. At first, the boot failure was not deterministic, which was hard to reproduce, and would have taken a lot of debug time. Illustrated near the top of Figure 42, typical memory traffic writes to Rank 0 followed by Rank 1 when not using CPGC.

87

Illustrated near the bottom of Figure 42, by varying burst size using CPGC and LFSR addressing, the targeted rank ends up changing constantly (purple waveform), which led to a quick failure reproduction. This helped to reduce boot failure debug time down to a few minutes.

Figure 42: CPGC architecture IO training boot failure debug

Illustrated in Figure 43, the DQS_P and DQS_N pin crossed during tri-state. The receive enable (RCVEN) inside SOC Rx was asserted immediately for the next read command.

The crossing triggered the DQS clock inside the SOC Rx post RCVEN to glitch, which corrupts the Rx FIFO. Once again, CPGC provided precise address and data pattern control, which enabled the debugger to fine tune the pattern and dial into the glitch quickly.

88

This again saved several days of debug labor compared to that without the CPGC reusable

BIST engine.

Figure 43: IO DQS glitch corrupting Rx FIFO debug

In DDR initialization and training, CPGC's CADB pattern and deselect feature provided an ISI and crosstalk rich pattern during the non-functional select de-asserted state. This helped the BIOS and firmware to quickly center the command and address pins and ensure these pins were protected against functional ISI noise.

89

Figure 44 illustrates the measured eye from BIOS based memory training and eye centering using CPGC based LFSR pattern. LFSR provides a quick pseudo random pattern. The pattern does not have to be the worst case stress, but how quickly the pattern can be applied is important from a BIOS boot time point of view. CPGC delivers this speedy pattern, which is ideal for memory training. The vertical axis represents the voltage measurements, and the horizontal axis represents the timing measurements. Both voltage and timing offset can be applied to measure the boundary of failure. A contour of the eye is mapped out by the BIOS before the voltage and timing is centered in the middle of this mapped eye.

90

Figure 44: DDR IO initialization and training BIOS mapped eye

CPGC can also be used in conjunction with circuit tuning controls such as reference voltage and phase interpolators to intentionally offset the data strobe to measure I/O jitter and DDR skew.

91

4.8 Summary

The CPGC architecture offers significant time savings for validation, test, and debug for

DDR, TSV, and 3DS WIO as shown in Table 9. The victim and aggressor pattern rotation automation of the CPGC reusable BIST engine alone decreased validation time by 3x. The

CPGC software assisted repair technology reduced test time from minutes down to seconds. The CPGC reusable BIST engine enabled debug of various memory and IO failures including memory IO not booting. The CPGC reusable BIST engine reproduced in-deterministic boot failures in minutes, and consistently reproduced bugs that were hard to find, which enabled a debug time savings of 5X.

Table 9: CPGC architecture savings to Validation, Test, and debug Time

Savings Comments

Validation 3X VA pattern rotation automates victim lane selection

Test Minutes to seconds Auto-Repair reduced loading and unloading of failure

signature

Debug 5X Reproduced in-deterministic boot failures in minutes

An overview of the CPGC reusable BIST engine architecture on Intel 14nm SOC was presented in this paper. The CPGC BIST engine was designed as a reusable IP that supports standard interfaces to enable quick turnkey SOC solution. CPGC is also used in memory training and initialization, which makes it a critical part of Intel SOC.

92

4.9 Acknowledgments

The authors would like to acknowledge Thilak K Poriyani House Raju for design, simulation, and verification of the CPGC IP. The authors would like to thank Adam Taylor for his knowledge and data contribution to this paper. The authors would like to acknowledge Anne Meixner for reviews, technical advice, and encouragement.

4.10 References

[1] S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas,

"Patching Processor Design Errors with Programmable Hardware," Micro, IEEE, vol. 27, pp. 12-25, 2007.

[2] R. Venkatesh, S. Kumar, J. Philip, and S. Shukla, "A fault modeling technique to test memory BIST algorithms," in Memory Technology, Design and Testing, 2002. (MTDT

2002). Proceedings of the 2002 IEEE International Workshop on, 2002, pp. 109-116.

[3] Y. Jen-Chieh, W. Chi-Feng, C. Kuo-Liang, C. Yung-Fa, H. Chih-Tsun, and W. Cheng-

Wen, " built-in self-test using March-like algorithms," in Electronic Design,

Test and Applications, 2002. Proceedings. The First IEEE International Workshop on,

2002, pp. 137-141.

93

[4] A. J. Van de Goor, S. Hamdioui, and H. Kukner, "Generic, orthogonal and low-cost

March Element based memory BIST," in Test Conference (ITC), 2011 IEEE International,

2011, pp. 1-10.

[5] S. Sheng-Chih, H. Hung-Ming, C. Yi-Wei, and L. Kuen-Jong, "A high speed BIST architecture for DDR-SDRAM testing," in Memory Technology, Design, and Testing,

2005. MTDT 2005. 2005 IEEE International Workshop on, 2005, pp. 52-57.

[6] G. Mrugalski, A. Pogiel, N. Mukherjee, J. Rajski, J. Tyszer, and P. Urbanek, "Fault

Diagnosis in Memory BIST Environment with Non-march Tests," in Test Symposium

(ATS), 2011 20th Asian, 2011, pp. 419-424.

[7] P. Bernardi, L. Ciganda, M. S. Reorda, and S. Hamdioui, "An Efficient Method for the

Test of Embedded Memory Cores during the Operational Phase," in Test Symposium

(ATS), 2013 22nd Asian, 2013, pp. 227-232.

[8] P. Bernardi, M. Grosso, M. S. Reorda, and Y. Zhang, "A programmable BIST for

DRAM testing and diagnosis," in Test Conference (ITC), 2010 IEEE International, 2010, pp. 1-10.

[9] Y. Dongkyu, K. Taehyung, and P. Sungju, "A microcode-based memory BIST implementing modified march algorithm," in Test Symposium, 2001. Proceedings. 10th

Asian, 2001, pp. 391-395.

[10] B. Schroeder, E. Pinheiro, and W. Weber. "DRAM Errors in the Wild: A Large-Scale

Field Study", SIGMETRICS/Performance '09, June 15-19, 2009, Seattle, WA, USA.

94

[11] D. Blankenbeckler, A. Norman, M. Shepherd. "Comparison of off chip interconnect validation to field failures", VALID, 2011, Barcelona, Spain.

[12] Nejedlo, J.J., "IBIST (interconnect built-in-self-test) architecture and methodology for

PCI Express," Test Conference, 2003. Proceedings. ITC 2003

[13] Nejedlo, J.J., "TRIBuTE board and platform test methodology," Test Conference,

2003. Proceedings. ITC 2003

[14] Nejedlo, J.J., "IBISTTM (interconnect built-in self-test) architecture and methodology for pci express: intel'snext-generation test and validation methodology for performance

IO," Test Conference, 2003

[15] Nejedlo, J.; Khanna, R., "Intel IBIST, the full vision realized," Test Conference, 2009.

ITC 2009. International , vol., no., pp.1,11, 1-6 Nov.

[16] http://www.memcon.com/pdfs/proceedings2013/track1/Tuning DDR4 for Power and

Performance.pdf

[17] http://www.memcon.com/pdfs/proceedings2013/track1/Tuning DDR4 for Power and

Performance.pdf

[18] http://www.google.com/patents/WO2014004748A1?cl=en

[19] http://www.youtube.com/watch?v=i3-gQSnBcdo

[20] Querbach, B "CPGC2, a buit-in self test and auto-repair engine for DRAM (WIO,

DDR4)", Patent pending 14/141,239

95

[21] Querbach, B. "Comparison of Hardware Based and Software Based Stress Testing of

Memory Io Interface." IEEE MWSCAS 2013

[22] "`http://en.wikipedia.org/wiki/Transistor count"'

[23] "`http://www.extremetech.com/computing/152979-intel-announces-full-set-of-new- atom-and-xeon-server-processors-to-fend-off-arm"'

96

Platform IO and System Memory Test Using L3 Cache Based Test (CBT) and Parallel Execution of CPGC Intel BIST Engine

Bruce Querbach [email protected], Tan, Peter Yanyang [email protected], Lovelace, Van [email protected], Blankenbeckler, David [email protected] Khanna, Rahul [email protected] Puligundla, Sudeep [email protected] Patrick Chiang [email protected]

2015 ITC (submitted)

Anaheim California, USA

97

2015 IEEE International Test Conference (ITC)

5 Platform execution time improvement using L3 Cache

Platform IO and System Memory Test Using L3 Cache Based Test (CBT) and

Parallel Execution of CPGC Intel BIST Engine

Abstract

As the memory industry pushes to increase memory density, device variation is creating more defects. Furthermore, new form factors (phone, tablet, mobile and client PC) and low cost board and platform limits physical access to JTAG or TAP.

Taking advantage of the x86 architecture’s high functional bandwidth to memory, the quickest way to access and test memory is through CPU core, by storing and running parallel CPGC/BIST test content via CPU L3 cache, one CPGC BIST engine per memory channel. We propose a cache based testing framework that speeds up test time 60x to 170x compared to JTAG or TAP based testing using the same test content. We will present the cache based test (CBT) architecture and infrastructure

(MRC/NEM setup, CPGC/IBIST), test content, results, and a side by side comparison of test time to JTAG or TAP. Finally we will discuss and compare this approach to generalized cache based tests.

98

5.1 Introduction

Memory capacity and density are increasing year after year [1], and so are the defects in memory, IO, microprocessors, and SOC. Sarangi et. al. [2] describe the percentages of memory interface design bugs found in AMD64 and Intel’s Pentium 4 to be 17.4 and 20.2, respectively, of total published errata. Therefore there is a critical need for an efficient methodology to validate the memory interface in a microprocessor design.

Even if the defect rates remain the same, larger memory capacity leads to more defects.

For example, DRAM errors have been shown to be one of the leading causes of hardware failure in modern compute clusters [3] [4].

Memory test time increases with increased memory capacity; hence having an effective memory test and repair built-in self-test (BIST) engine is critical. Hardware parallelism can be fully exploited by running the converged pattern generator and checker (CPGC)

BIST engine in parallel with as many instances as needed, for example, one BIST engine per memory channel.

However, even with full parallel hardware CPGC BIST [5]-[12] engine implementation, test execution time can still be slow if significant test content needs to be loaded through

JTAG or Intel test port (ITP) ports. Further, with the introduction of computing platform and system form factors such as phone and tablets, providing physical space for test access port (TAP) is a challenge.

99

Some work has been done on testing using cache. For example, Gurumurthy et. al. [13] experimented and found cache-resident self-test (CReST) can be an substitute for functional patterns. Foutris et. al. [14] was able to improve both test speed and coverage running multi-threaded tests on OpenSPARC T1 processor models. Theodorou et. al. [15] improved on their low-cost software-based self-test (SBST) to exploit the parallelism of multithreaded multicore architectures in order to speed up their testing of L1 cache.

However, exploring cache base tests (CBT) on IA86 platform and system to test the entire

DRAM memory subsystem has not been done before. We propose a CBT architecture that not only significantly speeds up test time by parallel testing of the entire memory subsystem, but is also target based and requires no debug port, making it ideally suited for the phone and tablet market.

CBT architecture is shown in Figure 45. The test content is stored in the BIOS on flash rom or flash (writeable) devices before power on. As a part of the CPU boot process, the

CPU fetches the tests into level 3 (L3) cache. Tests such as rank margin tool (RMT) programs the CPGC hardware BIST engine, and executes the tests in parallel (one BIST per DDR4 channel). The CPGC BIST engine writes to memory under test (DDR4

DIMMs), reads the results back, and compares for error. Each BIST is designed to occupy maximal memory bandwidth, thus issuing test instructions from the CPU cache to flood all available memory bandwidth (BW) is the quickest method to test the memory sub- system as will be shown in the results of this paper.

100

Under Under Test

CPGC BIST BIST CPGC

Memory Memory engine

DDR4

X86 IA core

Under Under Test

CPGC BIST BIST CPGC

Memory Memory engine

DDR4 L3 Cache (cache based test)

BIOS and flash storage

Figure 45: Cache based test architecture In section 5.2, we will introduce the cache based test architecture and infrastructure and how memory reference code (MRC) and CBT are setup using non-eviction mode (NEM).

In section 5.3, we will present more CPGC BIST engine architecture details, which form the foundation of CBT. In Section 5.4, we will present the test content and some of the powerful test results from CBT and RMT, for example how CBT with CPGC and on die voltage monitor can detect and debug system and platform level crosstalk and power supply noise problems. In sections 5.5 and 5.6, we will present test results, and compare the test results and test time side by side to show significant test time decrease. In section

5.7, we will discuss and compare this approach to more generalized cache based tests in terms of advantages and disadvantages. Section 5.8 provides a summary of the paper.

101

5.2 CBT Uses Cache-As-RAM Mode

5.2.1 Non-Evict Mode (NEM)

Every IA86 processor during initial power-on must fetch instructions from a predefined reset vector. This is where BIOS gets loaded from flash ROM or RAM into the CPU cache, and is the perfect place to store our CBT test contents, and test results. Intel provides its customers with memory reference code (MRC) in source code and binary for the primary purpose of initializing the complex DDR3/DDR4 subsystem. RMT uses the CPGC BIST engine to perform DDR margin tests. The CPGC BIST engine is also capable of initializing memory (fill zeros) and memory cell tests and repair. A subset of this (margining to center strobes, fill zeros, and a quick memory scan test) is appropriate as a part of functional memory initialization; hence, all these are a part of MRC and BIOS. Since boot time is critical to client and mobile PC user experience, (without memory, the user does not see any text or graphics displayed on the screen), the functional initialization became a subset of the full debug and diagnostic capabilities. The user can choose in the BIOS setup screen which subset of tests to run on the next boot. To speed up test content loading and leverage

IA86 core, the test content is actually compressed (zipped) when stored on the flash, then uncompressed as it is read into the L3 cache.

While anyone can write code that gets loaded into cache, or even BIOS code that runs at boot, most if not all of these efforts would run after memory is initialized and possibly when some unified extensible firmware interface (UEFI) or operating system (OS) support is in place. However, to run IA86 code without memory initialization requires NEM mode,

102

which is a subset of cache-as-ram (CAR) mode. Figure 46 shows details of how Intel enables at boot time, with one logical processor to be selected as the boot strap processor

(BSP). NEM is a special mode of IA86 memory configuration where the user must configure one base address and length for the data stack, and one base address and length for the code stack. The code stack will become read only, to avoid eviction from cache

(thus self-modifying code is not permitted). The data stack of course is writeable, but would be un-cacheable, since it is already in the cache. With all the configuration registers set via memory type range registers (MTRR), the CPU can be instructed to enter NEM.

The advantage of running in NEM mode for testing memory is test time. Since IA86 is architected to give maximal memory bandwidth from the core, this is the fastest way to run memory tests, especially if the test content calls to a hardware BIST engine that has been paired one for each memory channel. Another important advantage of NEM mode is since none of the test content resides in memory, we are able to test the entire system memory, margin the memory to failure, and reset the memory subsystem all in a tight loop within our test without a reboot or power cycle. This has proven to be extremely fast when seeking the worst margin corner and weak cells. If we tried to do this under OS, we would likely end up with a blue screen.

This methodology of cache based testing can be applied generally to other computer interfaces and subsystems.

103

Assume: NEM Init - flat 32 bit protected mode - 1 logical processor per socket is the Boot Selected Processor Init all fixed and variable range MTRR to 0

Configure default memory type to Un- Cacheable (UC)

Determine base address and length for data stack

Configure data stack as Write Back Memory Type using MTRR register

Determine base address and length for code region

Configure code region as Write protected Memory Type using MTRR register

Enable MTRR, BSP cache

Enable No-Eviction Mode setup state

Enable No-Eviction Mode Run state

Return to Caller

Figure 46: Non-Evict Mode entry flow diagram

104

5.2.2 NEM Exit to Preboot Execution Environment (PXE)

As soon as possible after configuring initial system memory, and completing cache based tests, the BIOS should transition to normal cache mode (see Figure 47). Any stack data required by the BIOS must be copied to the initial system memory, and that initial system memory must be in the un-cacheable (UC) state. Next, the cache must be invalidated by executing the invalidate-instruction (INVD) to flush the cache. After this point in time, cache data is no longer valid and the next read will come from the flash storage. BIOS will ensure no data modification occurs until NEM exit is complete. Finally, the non-eviction run state and non-eviction setup state are cleared, and the BIOS can continue with the rest of the power-on self-test (POST) and transition to PEX boot.

105

Assume: - flat 32 bit protected mode NEM Exit - 1 logical processor per socket is the Boot Selected Processor - System memory is available but Any data required by the BIOS must be in the uncacheable state copied to system memory

Configure default memory type to Un- Cacheable (UC)

Disable the MTRRs

Flush the caches, execute the INVD instruction

Disable the No-Evictions Mode Run State

Disable No-Eviction Mode Setup State

Initialize the MTRRs

Ren-enable the MTRRs

Continue POST

Figure 47: Non-Evict Mode exit flow diagram

106

During the NEM exit flow, cache based testing results must be carefully handled to not be lost when the entire cache is invalidated. The test data can be stored in the data stack during

NEM run time, but it must first be copied to the actual system memory along with other memory initialization parameters at NEM exit. This must be done quickly to not get in the way of system BIOS boot, hence, this data isn’t written to a hard drive and no advanced post-processing at the NEM exist flow is done. We are simply trying to preserve the data when the entire memory and cache system is being reconfigured. The storing of test results from cache to memory is a quick process, again thanks to the large functional bandwidth from the core. While many of the test results and plots presented in this paper use these critical data collected during cache based testing, the data has to be post-processed to make them presentable.

5.3 CPGC BIST Engine

5.3.1 Architecture Details

An overview of CPGC architecture was briefly discussed in ITC 2014 [5]. CPGC forms the critical foundation of CBT, hence we now will present more detailed architectures of

CPGC. To accomplish complex address walking algorithms, Figure 48 shows the CPGC architecture consists of the nested loops of address instruction (AI), data instruction (DI), algorithm instruction (GI), command instruction (CI), and offset instruction (OI).

107

Address Instruction Loop Data Instruction Loop Algorithm Instruction Loop

Command Instruction Loop

Offset Loop Loop Offset Loop Offset

Nested loop Offset Loop

1 2

Figure 48: CPGC architecture loops The flow diagram of the CPGC BIST engine is shown in Figure 49. The four areas of address, data, algorithm, and command CPGC instructions are initialized. Offset CPGC instruction is optionally initialized depending on if there is an offset. When all the instructions are ready, the complete instruction set is ready to execute. When the lowest order instruction is complete (e.g.: offset), the flow checks for additional offsets before completing the loops for command, algorithm, data, and address instructions.

108

Initialize Address Instruction DONE Selection

YES Initialize Select Next Data NO All Address Address Instructions Instruction Instruction Selected? Selection

YES Initialize Select Next Algorithm NO All Data Data Instructions Instruction Instruction Selected? Selection

YES Initialize Select Next Command NO All Algorithm Algorithm Instructions Instruction Instruction Selected? Selection

YES Offset? Select Next YES NO All Comnd. Command Instructions Selected? Initialize instruction Offset Instruction Selection Select Next Offset Instruction

Compare Execute NO results to selected expected Address, Data, All Offset YES values, Algorithm, Instructions identify Command and Selected? defects, store Offset defect Instructions. location data.

109

Figure 49: CPGC architecture flow diagram 5.3.2 Address Instruction (Loop3)

The outermost loop of address instructions that direct the addressing for the entire test is shown in Figure 50. This supports up to four instructions. Each address instruction supports the following features: Address direction, address order, enable address decode, and address decode rotate count.

Loop 3 (up to 4 Address_Instructions)

Loop 2 (up to 4 Data_Instructions)

Loop 1 (up to 8 Algorithm_Instructions )

Loop 0 (up to 24 Command_Instructions)

1 2

Offset Offset Offset

Figure 50: CPGC architecture loop 3, address instruction 5.3.3 Data Instruction (Loop2)

The data instruction loop shown in

Figure 51 supports up to four data instructions with the following features: Data select rotation enable, invert rotation enable, invert data, and data background.

110

Loop 3 (up to 4 Address_Instructions)

Loop 2 (up to 4 Data_Instructions)

Loop 1 (up to 8 Algorithm_Instructions )

Loop 0 (up to 24 Command_Instructions) 1 2

t t e e s s f f f f O O

Figure 51: CPGC architecture loop 2, data instruction. 5.3.4 Data Capabilities Overview

CPGC includes two data uni-sequencers, which can supply 32 bit linear feedback shift register (LFSR), 32 bit user programmable, and low frequency (LMN) counter. With address based inversion, data can result in: All zeros, all ones, checker board, row stripe, and column stripe.

5.3.5 Algorithm Instruction (Loop 1)

The algorithm instruction loop shown in

111

Figure 52 supports up to eight algorithm instructions of the following features: Starting command, starting command sequence address direction, reverse address direction, invert data, FastY Init, wait-for-event-before-algorithm, wait-to-start-until count value, and reset count value at start.

Loop 3 (up to 4 Address_Instructions)

Loop 2 (up to 4 Data_Instructions)

Loop 1 (up to 8 Algorithm_Instructions )

Loop 0 (up to 24 Command_Instructions)

1 2

Offset Offset Offset

Figure 52: CPGC architecture loop 1, algorithm instruction

112

5.3.6 Command Instruction (Loop 0)

The command instruction loop shown in Figure 53 supports up to 24 Command_

Instruction of the following features: Read or write, invert data, use alternate data, hammer, offset command, address decode enable, and use Offset1 or Offset2.

Loop 3 (up to 4 Address_Instructions)

Loop 2 (up to 4 Data_Instructions)

Loop 1 (up to 8 Algorithm_Instructions )

Loop 0 (up to 24 Command_Instructions)

1 2

Offset Offset Offset

Figure 53: CPGC architecture loop 0, command instruction

5.3.7 Offset Command Instruction (Loop 0)

An offset command instruction is a special command instruction that enables a variety of complex offset operations to occur. Figure 54 below shows two offset structures that any command instructions can point to.

113

Loop 3 (up to 4 Address_Instructions)

Loop 2 (up to 4 Data_Instructions)

Loop 1 (up to 8 Algorithm_Instructions )

Loop 0 (up to 24 Command_Instructions)

1 2

Offset Offset Offset

Figure 54: CPGC architecture, offset instruction

5.3.8 Examples of Usage Models for the Offset Command Instruction

The usage models of offset instruction include: Address decode offset, butterfly, and galloping algorithms. The address decode algorithm can test the address decoder in the X and Y directions. The butterfly algorithm can test near neighbors (crosshair), all neighbors

(crosshair and diagonal), and selective neighbors. The galloping algorithm can test block, column, row, or diagonal memory addresses.

5.3.9 Refresh Control Generator

Refresh control is important to maintain DRAM . It is also an excellent tool to test and stress the DRAM for cell retention failures. The CPGC BIST engine has a variable refresh signal generator (Figure 55) to control and stress the DRAM refresh rates.

114

Algorithm Instruction

Variable Data Refresh Refresh Input Signal Signals Generator

Figure 55: Refresh control Generator

5.4 Rank Margin Tool (RMT)

After explaining how CBT is built on the foundation of CPGC, the CBT test content at higher levels (firmware) is mostly built into a tool called RMT because it margins each

DDR rank until failure. RMT can be applied to DDR3, DDR4, LP-DDR3, LP-DDR4, etc.

In the simplest form, the margin failure point of the left strobe and the right strobe position can tell us where the strobe should be centered. RMT accomplishes this by writing to DDR

IO registers for strobe offset, configuring the CPGC BIST engine to record pass or failure for this specific strobe margin setting, then sweeping the strobe setting across the entire range. RMT averages the left fail and right fail to determine the optimal center strobe position. To speed up this margin search, advanced search algorithms such as binary search can be used to cut time down from:

115

O(lo g n) Equation 1 to:

O(n) Equation 2

For high volume manufacturing tests, RMT can screen pass and fail at two pre-specified points, which is significantly faster than a comprehensive search.

RMT can measure crosstalk and inter symbol interference (ISI) by running different patterns on the DDR4 bus that specifically excites neighboring bits. A complete crosstalk and ISI test can be a time consuming, but a useful debug tool when the screens or margin tests above indicate a problematic platform.

Figure 56 shows an IA86 system, where the interactions of 64 lanes of data (DQ) of channel

0 are plotted against itself (channel 0). Measured margin degradations are normalized to

100% indicated by color. The strongest interaction shows up on the diagonal line between individual bits vs. itself or immediate neighbor. This is indicative of a parallel single ended bus such as DDR4. There are also approximately four blocks of patterns, which indicates that within a group of 16 bits that are routed closely together, crosstalk is significant, but from group to group (of 16 bits), the interaction is limited.

116

Figure 56: Crosstalk and ISI across 64 DQ bits (RMT) On die voltage monitor has been extensively studied [16-20]. When CBT combined with the Intel on die voltage monitors, the effect of the CPGC BIST pattern on power supply disturbance can be studied. As the frequency of CPGC BIST pattern increases, so does the droop noise.

Shown in Figure 57 is the VCCIO Bounce as a function of DQ frequency. Up to 10mV of

VCCIO bounce can be measured at 700 MHz to 800 MHz DQ frequency. These power supply noise measurements and debug tools give platform/system engineers the ability to validate and fine tune their platform power design.

117

Figure 57: VCCIO supply Bounce as a function of DQ pattern frequency RMT can be configured to run such extensive debug tests, and the raw results in terms of margin deltas for 64 lanes can be stored in a data structure by the MRC. At the end of MRC initialization of memory, and NEM exit, the data structure is temporarily placed in the newly initialized main memory. As the BIOS moves on to the next stage of the boot process called pre-boot execution environment (PXE), the BIOS is capable of storing the test results in flash. This type of storing the prior boot memory initialization results can significantly speed up subsequent reboots, especially if the prior memory initialization results are still considered valid on reboot. BIOS could skip the lengthy memory initialization to give a faster user experience. The test results on the flash can be read by

118

special OS based software for post-processing or accessed remotely via Intel manageability engine (ME)/BMC, when applicable, to enable remote system diagnostics and debug.

Figure 58 shows another type of results that CBT can collect. In this case, a typical voltage vs. timing eye diagram is collected by sweeping the strobe from left (-25) offset to right

(+15) offset, and the DDR4 reference voltage from bottom (-63) offset to top (+63) to create a complete eye diagram. Once again, the raw margin data can be stored by the BIOS into flash, and the post processing software can pull the data from flash, to plot it, and to

extrapolate (shown in red line) the complete data eye.

)

ticks

( voltage

Timing (ticks)

Figure 58: Measured voltage and timing data eye (CBT)

119

Shown in Figure 59 is the CBT result for the per-bit timing margin on transmitter data lanes (TxDq) from lane 0 (bit 0) to lane 63 (bit 63) for DQ and ECC 0 (bit 64) to ECC 7

(bit 71). For each lane, the lane is margined to failure, which is shown as “*”. When the lane passes, it is shown as blank (“ ”). Extra passing points (eg: from (16 to -16) are truncated since they all pass to make the plot easier to fit.

N0: TxDq Per bit margins: TxDq 0------7 8-----15 16----23 24----31 32----39 40----47 48----55 56----63 64----71 30 ******** ******** ******** ******** ******** ******** ******** ******** ******** 29 ******** ******** ******** ******** ******** ******** ******** ******** ******** 28 ******** ******** ******** ******** ******** ******** ******** ******** ******** 27 ******** ******** ******** ******** ******** ******** ******** ******** ******** 26 ******** ******** ******** ******** ******** ******** ******** ******** ******** 25 ******** ******** ******** ******** ******** ******** ******** ******** ******** 24 ******** ******** ******** ******** ******** ******** * ****** ******** ******** 23 ******** ******** * ****** ******** ******** * ****** * * **** ******** ******** 22 *** *** *** **** * ****** ******** ******** * **** * **** ******** **** * 21 * * ** *** **** ** *** **** ** * ** *** * *** * *** * * *** **** * 20 * * ** * *** * * * * ** * ** *** * * * * ** * * *** **** * 19 * * * * * * * * ** * * ** * * * * * * ** 18 * * * ** * ** * * * 17 * * * -16 * * -17 * * -18 * ** ** * * ** ** -19 * ** ** * ** * ** ** * * * * -20 * ** ***** ** * * ** * * ** ** ** * * ** * * -21 * ** ***** ** * ** * ** ** * ** ** ****** * ** ** ** ** * * -22 ** ** ******** ** ** * ** ** ** ** ** ******** ** ** * ** ** * ** ** -23 ** ** * ******** ** ** ** ** ***** ** *** * ******** ***** * ** *** * ** ** ** -24 ***** ** ******** ******** ** ***** ** ***** ******** ******** ** ***** ***** ** -25 ***** ** ******** ******** ******** ******** ******** ******** ******** ***** ** -26 ******** ******** ******** ******** ******** ******** ******** ******** ******** -27 ******** ******** ******** ******** ******** ******** ******** ******** ******** -28 ******** ******** ******** ******** ******** ******** ******** ******** ******** -29 ******** ******** ******** ******** ******** ******** ******** ******** ******** -30 ******** ******** ******** ******** ******** ******** ******** ******** ******** N0.C0.D0.R0: TxDq - Per bit margins

Figure 59: Transmitter data per-bit margin (CBT) Shown in Figure 60 is another example of CBT results. We collected the per-bit timing margin results on receiver data lanes (RxDqs) from lane 0 (bit 0) to lane 63 (bit 63) for DQ and ECC 0 (bit 64) to ECC 7 (bit 71). For each lane, the lane is margined to failure, which

120

is shown as “*”, when the lane passed, it is shown as blank (“ ”). Extra passing points (eg: from (19 to -19) are truncated since they all pass, to make the plot easier to fit.

Per bit margins: RxDqs 0------7 8-----15 16----23 24----31 32----39 40----47 48----55 56----63 64----71 30 ******** ******** ******** ******** ******** ******** ******** ******** ******** 29 ******** ******** ******** ******** ******** ******** ******** ******** ******** 28 ******** ******** ******** ******** ******** ******** ******** ******** ******** 27 ******** ******** ******** ******** ******** ******** ******** ******** ******** 26 ******** ******** ******** ******** ******** ******** ******** ******** ******** 25 ******** ******** ******** ******** ******** ******** ******** ******** ******** 24 ***** ** ******** ******** ******** ** ***** ** ***** ******** ******** ******** 23 ***** ** ******** ** * ** ******** ** ***** ** ** ** ** ** * ******** ** ** ** 22 ** ** ** ***** ** * ** ** * * ** ** * ** * * ** ** ** * ** ** 21 * ** ** * ** ** ** * * * * * ** * * 20 * * * * * -20 * -21 * * * * * -22 * * * ** * * * * * * *** * *** * * *** * ** -23 * * * ** * * ** * * * * * * ** **** * * *** * * *** * ** -24 * * *** * * ** * **** * * **** * * ****** **** * * * *** * * **** * ** ** -25 * * *** ****** * ******** ******** ******** ******* **** *** ******** **** ** -26 * ****** ******** ******** ******** ******** ******** ******** ******** ******* -27 ******** ******** ******** ******** ******** ******** ******** ******** ******** -28 ******** ******** ******** ******** ******** ******** ******** ******** ******** -29 ******** ******** ******** ******** ******** ******** ******** ******** ******** -30 ******** ******** ******** ******** ******** ******** ******** ******** ******** N0.C0.D0.R0: RxDqs - Per bit margins

Figure 60: Receiver data per-bit margin (CBT) Shown in Figure 61 is the CBT result for the per-bit voltage margin results on transmitter data lanes (TxVref) from lane 0 (bit 0) to lane 63 (bit 63) for DQ and ECC 0 (bit 64) to

ECC 7 (bit 71). For each lane, the lane is margined to failure by offsetting the reference or sampling voltage. Failures are shown as “*”. When the lane passed, it is shown as blank (“

”). Extra passing points (eg: from (13 to -14) are truncated since they all pass to make the plot easier to fit.

121

Per bit margins: TxVref 0------7 8-----15 16----23 24----31 32----39 40----47 48----55 56----63 64----71 30 ******** ******** ******** ******** ******** ******** ******** ******** ******** 29 ******** ******** ******** ******** ******** ******** ******** ******** ******** 28 ******** ******** ******** ******** ******** ******** ******** ******** ******** 27 ******** ******** ******** ******** ******** ******** ******** ******** ******** 26 ******** ******** ******** ******** ******** ******** ******** ******** ******** 25 ******** ******** ******** ******** ******** ******** ******** ******** ******** 24 ******** ******** ******** ******** ******** ******** ******** ******** ******** 23 ******* ******** ******** ******** ******** ******** ******** ******** ******* 22 ******* ******** ******** ******** ******** ******** ******** ******** ****** 21 ***** * ******** ******** ******** ******** ******** ******** ******** ***** 20 ** * * ******** ****** * ** *** * ******** ******** ***** ** ******** * * 19 * ******** ***** * ** * * ******** ******** ** * ** ******** * 18 ******** ** * ** * * ******** ***** ** * * *** **** 17 ** * ** ******** *** ** * * ** 16 ** * * ** ** * * * 15 ** * ** * * * 14 * * -15 * * * * * -16 * * *** * ** *** **** * -17 * * * * ** *** *** *** **** * *** **** ** -18 * * ** * * * ** *** *** ******** ** *** **** *** -19 ******** * * **** * ** ******** ******** **** *** ******** -20 * ** * ******** ******** ** ** * ******** ******** ******** ******** *** -21 * ** * * ******** ******** ******** ******** ******** ******** ******** ** ***** -22 **** * * ******** ******** ******** ******** ******** ******** ******** ** ***** -23 ******** ******** ******** ******** ******** ******** ******** ******** ******** -24 ******** ******** ******** ******** ******** ******** ******** ******** ******** -25 ******** ******** ******** ******** ******** ******** ******** ******** ******** -26 ******** ******** ******** ******** ******** ******** ******** ******** ******** -27 ******** ******** ******** ******** ******** ******** ******** ******** ******** -28 ******** ******** ******** ******** ******** ******** ******** ******** ******** -29 ******** ******** ******** ******** ******** ******** ******** ******** ******** -30 ******** ******** ******** ******** ******** ******** ******** ******** ******** N0.C0.D0.R0: TxVref - Per bit margins

Figure 61: Transmitter reference voltage per-bit margin (CBT) Shown in Figure 62 is the CBT result for the per-bit voltage margin results on receiver data lanes (RxVref) from lane 0 (bit 0) to lane 63 (bit 63) for DQ and ECC 0 (bit 64) to ECC 7

(bit 71). For each lane, the lane is margined to failure by offsetting the reference or sampling voltage. Failures are shown as “*”. When the lane passes, it is shown as blank (“

”). Extra passing points (eg: from (28 to -30) are truncated since they all pass to make the plot easier to fit.

122

Per bit margins: RxVref 0------7 8-----15 16----23 24----31 32----39 40----47 48----55 56----63 64----71 47 ******** ******** ******** ******** ******** ******** ******** ******** ******** 46 ******** ******** ******** ******** ******** ******** ******** ******** ******** 45 ******** ******** ******** ******** ******** ******** ******** ******** ******** 44 ******** ******** ******** ******** ******** ******** ******** ******** ******** 43 ******** ******** ******** ******** ******** ******** ******** ******** ******** 42 ******** ******** ******** ******** ******** ******** ******** ******** ******** 41 ******** ******** ******** ******** ******** ******** ******** ******** ******** 40 ******** ******** ******** ******** ******** ******** ******** ******** ******** 39 ******** ******** ******** ******** ******** ******** ******** ******** ** ***** 38 ******* ******** * ****** ******** ******** ******** ******** ******** ** ***** 37 ****** ******** * ****** ****** * ******** ******** ******** ******** * ** * 36 ** *** ******** ***** ** *** * ******** ******** ******** *** **** ** * 35 * ******** **** ** *** ******** ******** *** **** ** **** ** * 34 ***** * **** ** *** ******** **** * * * * ** * ** * 33 ** ** * * * * * ******** * ** * * ** * * 32 * ** * * * ** * * ** * * * * 31 * * ** * * * 30 * * 29 * * -31 * ** -32 ** * * * * * ** * -33 * ** * **** * * **** * -34 **** * * * * **** *** **** *** * * -35 * * ****** * ** *** ******** ******** * * * * *** *** ** -36 * ******** ** * ****** ******** ******** *** **** *** *** *** -37 ** ******** ** * ****** * ******** ******** ******** ******** *** -38 * ***** ******** * **** * ******** ******** ******** ******** ******** ** *** -39 ******** ******** ******** ******** ******** ******** ******** ******** ** ***** -40 ******** ******** ******** ******** ******** ******** ******** ******** ******** -41 ******** ******** ******** ******** ******** ******** ******** ******** ******** -42 ******** ******** ******** ******** ******** ******** ******** ******** ******** -43 ******** ******** ******** ******** ******** ******** ******** ******** ******** -44 ******** ******** ******** ******** ******** ******** ******** ******** ******** -45 ******** ******** ******** ******** ******** ******** ******** ******** ******** -46 ******** ******** ******** ******** ******** ******** ******** ******** ******** -47 ******** ******** ******** ******** ******** ******** ******** ******** ******** N0.C0.D0.R0: RxVref - Per bit margins

Figure 62: Receiver reference voltage per-bit margin (CBT) 5.5 Traditional JTAG and TAP Based Tests

The same margin test can be run via traditional JTAG or TAP port. For some (mostly server) Intel boards, Intel test access (ITP) ports are both required and populated. Through a host PC, the host software, in this case, Python, because Python is a high level language that supports object oriented programming without requiring compiling binary executable,

123

is used. We find in the lab, during debug, modifying the test code on the fly is quicker and easier using Python.

Basically, Python code follows the same algorithm, and writes to the same registers in the

CPU to trigger the CPGC BIST engine.

Shown in Figure 63 is ITP Python results of running the per-bit timing margin results on transmitter data (DQ) lanes (RxDqs) from lane 0 (00) to lane 63 (63) for DQ and ECC 0

(64) to ECC 7 (71). For each lane, the lane is margined to failure, which is shown as “#”.

When the lane passed, it is shown as (“.”). Extra passing points (eg: from (15 to -16) are truncated to make the plot easier to fit.

>>? margin_control.margin_txdq(channels=[0], ranks=[[0]]*4, loopcounts=[12]*4) 000000000011111111112222222222333333333344444444445555555555666666666677 012345678901234567890123456789012345678901234567890123456789012345678901 +27 ######################################################################## +26 #################.###############################.###################### +25 ###########.#####.##.###############.####.#######.#.################.#.. +24 ###..##.###.#####.##.#.#####.####.##.####.######..#.#####.##.#######.#.. +23 ###..##.###..##...##.#..####.#..#.##.###..######..#..####.#..####.##.#.. +22 #.#..#....#..#....#..#..#.##.#....#..###..#..###..#...... #..#.##.##.... +21 #.#..#....#..#....#.....#.##.#...... ##...#..#.#..#...... #..#.#..... +20 #.#...... #...... #...... ##...#...... #...... #.#..... +19 ..#...... -19 ...... #...... #...... -20 ...#...... ##..#...... #...... #...#...... -21 ...#....##.##..#...#...... ##..##...... ##..#....#.##...... #...... #.... -22 .#.##...##.##.##...##..#...##..##..#....##..#...##.##....#..#....#.#.... -23 .#.##...#####.##...##..###.##.####.##...##.##..######...##.##..###.##..# -24 ##.##..###########.##.####.#######.##...##.##..######..###.##..###.##..# -25 ##.##.############.##.####.#######.##...##.##..#######.##############.## -26 #####.################################.#######.######################.## -27 ######################################################################## hi side timing margin: offset=+19.00 target_ber=0.000000e+00 extrapolated

Figure 63: Transmitter data timing per-bit margin (TAP)

124

Shown in Figure 64 are ITP Python results of running the per-bit timing margin results on receiver data (DQ) lanes (RxDqs) from lane 0 (00) to lane 63 (63) for DQ and ECC 0 (64) to ECC 7 (71). For each lane, the lane is margined to failure, which is shown as “#”. When the lane passes, it is shown as (“.”). Extra passing points (eg: from (20 to -18) are truncated to make the plot easier to fit.

>>? margin_control.margin_rxdq(channels=[0], ranks=[[0]]*4, loopcounts=[12]*4) 000000000011111111112222222222333333333344444444445555555555666666666677 012345678901234567890123456789012345678901234567890123456789012345678901 +27 ######################################################################## +26 #####.############################.##################################### +25 ##.##.############################.#######.#######.##.#######.####.##.## +24 ##.##.############.#.#############..###.##..#.####..#..######.####.##### +23 #..##...##.##.#.###...#.##.#####...... ##..#.#.##.##...##.##.##.#.##.## +22 ...... #..#...... #..#.##...... #...##..#...... #..## +21 ...... #...... +20 ...... #...... #...... -19 ...... #...... -20 ...... #...... -21 #...... #...... #..#...... #...... #.. -22 #....#....#.##...... #..#.#..#...##.####..#...... ####....####....##. -23 #....#....#.##.#..#..#..##.#.#..#.##########.#...... ####.#.#####..#.##. -24 #....##.#.#.##.##.####.#######..############.#....#..######.#####.##.##. -25 #.#.###.#########################################.#..###############.##. hi side timing margin: offset=+20.00 target_ber=0.000000e+00 extrapolated

Figure 64: Receiver data timing per-bit margin (TAP) Shown in Figure 65 are the ITP Python results of running the per-bit voltage margin results on transmitter data (DQ) lanes (RxDqs) from lane 0 (00) to lane 63 (63) for DQ and ECC

0 (64) to ECC 7 (71). For each lane, the lane is margined to failure, which is shown as “#”.

When the lane passes, it is shown as (“.”). Extra passing points (eg: from (14 to -9) are truncated to make the plot easier to fit.

125

>>? margin_control.margin_txvref_no_PDA(channels=[0], ranks=[[0]]*4, loopcounts=[12]*4) 000000000011111111112222222222333333333344444444445555555555666666666677 012345678901234567890123456789012345678901234567890123456789012345678901 +27 ######################################################################## +26 #######.################################################################ +25 #####.#.#############################################.################.. +24 ###.#.#.##############.###.#.#.####################.#.###########...#... +23 .#....#.########..###..#.....#..#######..####.##..#.#...#####.##....#... +22 ...... ###....#..###..#...... ####.##...##..##..#.....##....##...... +21 ...... ##...... #.##...#...##...... #....#...... +20 ...... ##...... #...... #...... #...... +19 ...... #...... -16 ...... #...... #...... -17 ...... #...... ###....###...... -18 ...... #...... ###..#.###...... #...... -19 ...... #....##...... ###..#####...#.#.....#..##...##...... -20 ...... #.#..######.....#...... ############.###.....###########...... -21 ...... #.##########....#...##..##################.#.###########...... -22 ...... ############....##..##..################################...... -23 ..#.....############.#.###.###..################################.....##. -24 #.##.#..######################.##################################...#### -25 ####.#.################################################################# -26 ####.#.################################################################# -27 ######################################################################## hi side voltage margin: offset=+19.00 target_ber=0.000000e+00 extrapolated

Figure 65: Transmitter reference voltage per-bit margin (TAP) Shown in Figure 66 are the ITP Python results of running the per-bit voltage margin results on receiver data (DQ) lanes (RxDqs) from lane 0 (00) to lane 63 (63) for DQ and ECC 0

(64) to ECC 7 (71). For each lane, the lane is margined to failure, which is shown as “#”.

When the lane passes, it is shown as (“.”). Extra passing points (eg: from (27 to -27) are truncated to make the plot easier to fit.

126

>>? margin_control.margin_rxvref(channels=[0], ranks=[[0]]*4, loopcounts=[12]*4) 000000000011111111112222222222333333333344444444445555555555666666666677 012345678901234567890123456789012345678901234567890123456789012345678901 +40 ######################################################################## +39 #################.################################################.##### +38 ######..#########.############.###################################.##### +37 .#####..########..#####.######.##################################...##.# +36 ...###..########...###..##.###..################################....##.# +35 ...... #####.#...###..##.###..##########################..####....##.. +34 ...... #####.#...#.#..##.#.#..#########.##.####....##..#..##.#...... +33 ...... #.##.#...... ##.#....###.#####.##.#.#.....#...... #...... +32 ...... ##...... #..#....#.#.####..#....#.....#...... #...... +31 ...... #...... ##.#..#...... #...... +30 ...... #.#...... +29 ...... #...... +28 ...... #...... -30 ...... #...... -31 ...... #...... -32 ...... #.#...... #...... ##.#...... -33 ...... #.#.##.#...... ####.#.#####.#...... -34 ...... #.#.##.#...... #..####.#.#####.#.#...... -35 ...... ######.#..#...... #.#.#..############.#.#.....###.#...#.#.....#.. -36 .....#..########..#..#...#.###.##################.#.#######..###.....#.. -37 .....#..########..####.###.###.####################.#######..###....###. -38 ##.####.#########.#########################################.#####...###. -39 #################.################################################..#### -40 ##################################################################..#### -41 ######################################################################## hi side voltage margin: offset=+28.00 target_ber=0.000000e+00 extrapolated

Figure 66: Receiver reference voltage per bit-margin (TAP) 5.6 Performance (Test Time) Comparison

While the format of CBT/RMT output vs. TAP based Python test differ slightly, we were able to collect the same detailed level of debug data. We ran CBT/RMT with the option of showing per-bit margin diagram on/off. The test times were 15901ms and 5679ms respectively to per-bit margin on and off.

The test time of using EV Python scripts running in a host was 991130ms or 991 seconds.

Thus the ratios of running on target cache vs. host were either 62 or 175 times depended on the per-bit-margin being on or off, see Table 10.

127

Table 10: Performance gain of cache based tests

test time per bit margin JTAG/TAP (ms) Cache based RMT (ms) performance gain on 991130 15901 62X off 991130 5679 175X

5.7 Generalized Cache Based Testing

Most tests can be written small in code size (eg: for n equal 0 to 100, do x equal n + x) to fit inside cache to achieve comparable speedup to the method presented here. Such generalized cache based testing is encouraged because of the test time benefit. Especially when running the tests on the target under test target based test (TBT), it can easily be used to test other platform interfaces (eg: USB, PCIe). One of the advantages of these generalized CBT or TBT for non-memory interfaces, like USB or PCIe, is that it can test the interface to failure without worry about corrupting the test execution data and code space itself. However, when it comes to memory tests, generalized CBT cannot fully test the entire memory range if it is based on the target because of OS dependency and non- accessible ranges. If generalized CBT does not run on the target, the access time to the target registers via JTAG or TAP is going to slow down significantly.

Further, writing a CBT that runs under OS does not guarantee full CPU core usage, because the multi-threaded nature of modern OS, or even the virtualization of hardware or machines

128

(VM), the generalized CBT could easily be task switched out when competing with other equal or higher priority OS tasks. When such task switch occurs, significant performance degradation of test time results.

Unlike generalized CBTs, MRC based CBT running in NEM runs in a single thread mode having full control of all IA86 core and system resources, thus avoiding the problems of task switching discussed above. MRC based CBT running in NEM has been proven to be the fastest way to test memory subsystems on IA86 systems.

5.8 Conclusion

As the memory industry pushes to increase memory density, device variation is creating more defects. Larger memory capacity is leading to more defects and also causing an increase in memory test time. Furthermore, new form factors (phone, tablet, mobile and client PC) and low cost board and platform limit physical JTAG or TAP access. We presented a CBT architecture that achieved test speed improvements of 60x to 170x compared to JTAG or TAP based tests running the same test content. This CBT is built on the foundations on an effective memory test and repair CPGC BIST engine. Some of the

CPGC BIST engine architecture details have been discussed. The architecture discussed here are critical in dealing with future memory test challenges that the industry faces. We also presented the CBT infrastructure (MRC setup, NEM, and RMT with CPGC/IBIST),

CBT test contents, CBT test results and side by side comparison of test time to JTAG or

TAP. Intel x86 are architected to have high functional memory bandwidth accessed

129

directly through the core, thus we have shown this is the quickest way to access and test platform system memory is through CPU core, by storing and running parallel CPGC/BIST test content via CPU L3 cache, one CPGC BIST engine per memory channel. Finally we discussed and compared this approach to more generalized cache based tests and presented the unique advantages.

In the future, we hope to expand CBT to broader IO and system interfaces, and hope to further improve test execution speeds through parallelism.

5.9 Acknowledgements

The authors acknowledge and thank the contribution from Rajesh Kapoor at Intel for sharing BIOS knowledge.

130

5.10 References

[1] “Coalition of 20+ Tech Firms Backs MRAM as Potential DRAM, NAND

Replacement - See more at: http://www.dailytech.com/Coalition+of+20+Tech+Firms+Backs+MRAM+as+Potential+

DRAM+NAND+Replacement/article33826.htm#sthash.afOQukEt.dpuf.”

[2] S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas,

“Patching Processor Design Errors with Programmable Hardware,” IEEE Micro, vol. 27, no. 1, pp. 12–25, Jan. 2007.

[3] B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: a large- scale field study,” in ACM SIGMETRICS Performance Evaluation Review, 2009, vol. 37, pp. 193–204.

[4] M. Shepherd, D. Blankenbeckler, and A. Norman, “Comparison of off-chip interconnect validation to field failures,” in VALID 2011, The Third International

Conference on Advances in System Testing and Validation Lifecycle, 2011, pp. 121–125.

[5] B. Querbach, R. Khanna, D. Blankenbeckler, Y. Zhang, R. T. Anderson, D. G. Ellis,

Z. T. Schoenborn, S. Deyati, and P. Chiang, “A reusable BIST with software assisted repair technology for improved memory and IO debug, validation and test time,” in Test

Conference (ITC), 2014 IEEE International, 2014, pp. 1–10.

131

[6] B. Querbach, D. G. Ellis, A. Khan, M. J. Tripp, E. S. Gayles, and E. Gollapudi,

“Automatic self test of an integrated circuit component via AC I/O loopback,” 7,139,957,

2006.

[7] B. Querbach, S. Puligundla, D. Becerra, Z. T. Schoenborn, and P. Chiang,

“Comparison of hardware based and software based stress testing of memory IO interface,” in 2013 IEEE 56th International Midwest Symposium on Circuits and Systems

(MWSCAS), 2013, pp. 637–640.

[8] Querbach, Bruce, “‘CPGC2, a buit-in self test and auto-repair engine for DRAM

(WIO, DDR4)’, Patent pending.”

[9] J. Nejedlo and R. Khanna, “Intel #x00AE; IBIST, the full vision realized,” in Test

Conference, 2009. ITC 2009. International, 2009, pp. 1–11.

[10] D. Ellis, B. Querbach, J. Nejedlo, A. Khan, S. Babcock, E. Gayles, and E.

Gollapudi, “Pre-announce signaling for interconnect built-in self test,” 2003.

[11] D. G. Ellis, B. Querbach, J. J. Nejedlo, A. Khan, S. R. Babcock, E. S. Gayles, and

E. Gollapudi, “Push button mode automatic pattern switching for interconnect built-in self test,” 6,826,100, 2004.

[12] B. L. Spry, T. Z. Schoenborn, P. Abraham, C. P. Mozak, D. G. Ellis, J. J. Nejedlo,

B. Querbach, Z. Greenfield, R. Ghattas, and J. Tholiyil, “Robust memory link testing using memory controller,” 12/651,252, 2009.

132

[13] S. Gurumurthy, M. Pratapgarhwala, C. Gilgan, and J. Rearick, “Comparing the effectiveness of cache-resident tests against cycleaccurate deterministic functional patterns,” in Test Conference (ITC), 2014 IEEE International, 2014, pp. 1–8.

[14] N. Foutris, M. Psarakis, D. Gizopoulos, A. Apostolakis, X. Vera, and A. Gonzalez,

“MT-SBST: Self-test optimization in multithreaded multicore architectures,” in Test

Conference (ITC), 2010 IEEE International, 2010, pp. 1–10.

[15] G. Theodorou, N. Kranitis, A. Paschalis, and D. Gizopoulos, “Software-Based Self

Test Methodology for On-Line Testing of L1 Caches in Multithreaded Multicore

Architectures,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 21, no. 4, pp. 786–

790, Apr. 2013.

[16] M. Shin, J. Shim, J. Kim, J. S. Pak, C. Hwang, C. Yoon, J. Kim, H. Kim, K. Park, and Y. Kim, “A 6.4Gbps on-chip eye opening monitor circuit for signal integrity analysis of high speed channel,” in IEEE International Symposium on Electromagnetic

Compatibility, 2008. EMC 2008, 2008, pp. 1–6.

[17] K. Noguchi and Makoto Nagata, “An On-Chip Multichannel Waveform Monitor for Diagnosis of Systems-on-a-Chip Integration,” IEEE Trans. Very Large Scale Integr.

VLSI Syst., vol. 15, no. 10, pp. 1101–1110, Oct. 2007.

[18] A. K. M. M. Islam and H. Onodera, “Characterization and compensation of performance variability using on-chip monitors,” in Proceedings of Technical Program -

133

2014 International Symposium on VLSI Technology, Systems and Application (VLSI-

TSA), 2014, pp. 1–4.

[19] J.-S. Wang, P. Andrews, S. Fu, and G. Pwu, “Low power Sigma;- Delta; analog- to-digital converter for temperature sensing and voltage monitoring,” in Proceedings of the

43rd IEEE Midwest Symposium on Circuits and Systems, 2000, 2000, vol. 1, pp. 36–39 vol.1.

[20] T. Okumoto, K. Yoshikawa, and M. Nagata, “Monitoring effective supply voltage within power rails of integrated circuits,” in Solid State Circuits Conference (A-SSCC),

2012 IEEE Asian, 2012, pp. 113–116.

134

Architecture of a Reusable BIST Engine for Detection and Auto Correction of Memory Failures and for IO Debug, Validation, Link Training, and Power Optimization on 14nm 3DS SOC

Bruce Querbach [email protected], Khanna, Rahul [email protected], Puligundla, Sudeep [email protected], David Blankenbeckler [email protected], Joe Crop [email protected], Patrick Chiang [email protected],

2015 IEEE design and test magazine (accepted)

Vancouver, British Columbia, Canada

135

6 Architecture of a Reusable BIST Engine (CPGC)

Architecture of a Reusable BIST Engine for Detection and Auto Correction of

Memory Failures and for IO Debug, Validation, Link Training, and Power

Optimization on 14nm 3DS SOC

Abstract—3D stacked (3DS) integrated circuit (IC) presents unique challenges to testability due to the lack of probe points and overall product yield due to memory subsystem failures. High speed memory and IO require potentially complex link training, yet the end user experience of phone and tablet demands quick power on and quick link training. Further, compared to the desktop CPUs, phones and tablets present low power challenges for Intel chips.

This paper presents the architecture of a reusable built-in-self-test (BIST) engine called the converged pattern generator and checker (CPGC). CPGC hardware and software implemented on Intel 14nm 3DS IC is capable of auto repair for detected memory defects. Silicon results are presented showing that CPGC enables easy debug of inter-symbol-interference (ISI) and crosstalk issues in silicon and platform to improve the probe-ability of 3DS IC. It also enables quick and complex IO link training of all memory and IO subsystem. Further, CPGC improved validation time by 3x, and reduced system-on-chip (SOC) and platform power by 5% to 11% through closed loop circuit power optimization/training. This reusable CPGC BIST engine IP solution has been successfully designed into at least 11 Intel CPU/SOCs.

136

6.1 Introduction

Process technology continues to shrink; for example, Intel has demonstrated 14nm technology [16]. These shrinking technologies combined with consumer demand to store everything has driven memory capacity to increase year after year [15]. Along with it, due to the smaller geometry, lower voltage, and complex physical design, the memory defects are increasing. DRAM errors are one of the leading causes of hardware failure in modern compute clusters as the memory capacity and density increase [2, 13].

Dong Wook Kim in ECTC 2013 [14] has demonstrated 3D stacked wide-IO (WIO) memory with one logic die followed by up to four memory die connected via through silicon via (TSV). WIO is similar to double data rate (DDR) memory, but offers much wider channel and IO bandwidth than DDR through more parallel pins. TSV are tiny vertical interconnects that pass through the die to enable connectivity of multiple dies. This further limits the probe-ability of sandwiched die post bonding. When functional at speed tests are required, BIST is the on-die solution that can test this 3DS technology at speed compared to tester based solutions. BIST also solves the probe-ability issue of 3DS IC that other probe solutions cannot access, because BIST is built directly into the 3DS IC.

Much prior work has been done related to memory test. Boundary scan [1] introduced DC and stuck-at tests. Venkatesh [3] demonstrated a physical memory fault model to test BIST

137

algorithms. March and modified-march algorithms [4-6] are commonly used to detect memory cell failures. A galloping pattern [7] can capture failures that escape the march algorithm. Bernardi [8] introduced a hardware software co-test, which can be performed while the device under test (DUT) is operational. Bernardi [9] also demonstrated a programmable memory BIST (MBIST) suitable for stacked DRAM, which uses march to detect neighborhood pattern sensitive faults (NPSF). In [12], the paper showed that patterns that are quick to execute and provide good coverage for memory defects are important.

This paper presents the architecture of a BIST engine that is able to test memory cell failures according to the algorithms described in the prior works above (scan, march, gallop, etc…), but goes beyond prior work by introducing auto repair of detected memory cell failure, IO link training, and power optimization/training. A comparison of features of prior work is shown Table 11.

138

Table 11: Comparison of CPGC features to prior work.

The BIST engine tests 3DS SOC using the functional circuits themselves by exciting physical neighboring aggressors of a victim IO lane in Figure 67 (a), or a victim memory cell (called base) in Figure 67 (b). This new generation of reusable BIST engines is built into the SOC CPU, which is called converged pattern generator and checker (CPGC) [11].

Converged refers to combining interconnect BIST (IBIST) [10] and MBIST capabilities into one BIST engine. The paper will also describe an auto-repair technology, which, combined with the fault detection capabilities, can physically re-map defective memory using redundant memory cells built into the DRAM. This provides significant test time savings in post package manufacturing flow. CPGC also reduced SOC and platform power through closed loop optimization of on die termination (ODT) and equalization. Such BIST

139

that supports IO and memory defect detection, memory auto repair, IO link training, and power optimization/training has not been previously demonstrated.

The rest of this paper will be structured as follows. In section 6.2.1, the architecture of

CPGC is presented. Section 6.2.4 details the software architecture. In section 6.2.5, silicon test results are presented. Lastly, in section 6.3, the paper is summarized.

6.2 2nd Generation CPGC for 14nm 3DS SOC Memory and IO

6.2.1 CPGC Architecture Overview

The typical computer architecture for SOCs consists of one or more processors with a memory sub-system that includes a memory controller and a main memory. 3DS SOC significantly reduces the footprint of the memory and IO sub-system, because memories are stacked on top of the CPU as shown in Figure 67 (a). The CPGC BIST engine becomes a part of the memory sub-system and the memory controller, which is shown in Figure 67

(b). The CPGC architecture consists of test pattern generators (TPG), defect detectors

(DD), a repository of failing addresses (RF), an optional self-repair logic circuit (SF), and configuration registers (CR). These capabilities (TPG, DD, RF, SF and CR) form the basis of CPGC, which are present for each memory channel. Shown on the right of Figure 67 (b) is the main memory with spare memory along with two typical memory test algorithms.

The main memory can be written (W) sequentially from top to bottom and left to right, or a as a base address walks through memory, an offset can stress the neighbors around the

140

base. The CPGC can be programmed through the CR to automatically or manually repair memory defects detected with software assistance or a hardware finite state machine (FSM) as shown in Figure 67 (d) using the spare memory of the DRAM to map out the defect rows.

141

Figure 68 (a) CPU with CPGC BIST engine DRAM 2 T T S S V V DRAM 1 T T T S S S V V V SOC with CPGC BIST Engine Figure 68 (b) CPGC BIST engine architecture as part of Intel Memory Controller

Memory Controller Main Memory W W W W W W W W Test Pattern W W W W Generator W W W W Self stuck at Repair Defect 8 Detector Logic 7 Circuit 4 3 base 1 2 5 Repository of 6 butterfly Failing Addresses N Flags W E Configuration Registers Spare Memory S Figure 68 (c) The CPGC BIST engine pipeline architecture Address Address Address Address Order3 Order2 Order1 Order0

Carry Column Bank Count Address Address Pipestage n00 Gen. Gen. Gen.

Rank Row Address Address Pipestage n01 Gen. Gen.

Rank Row Col. Bank Address Address Address Address

Figure 68 (d) The CPGC BIST engine auto-repair architecture

Idle ACT if NumFailStatus < MaxFail

Wait Exit PRE_al PPR l tPGM If more failing address Wait Wait_P PRE PGMP RE ST

PPR Wait Done Enable tPGM

Figure 67: SOC memory sub-system and CPGC Architecture overview

142

6.2.2 CPGC Architecture Detail

Similar to [6], the CPGC BIST engine includes command, address, and data generator. The

CPGC supports four instruction types: Address, data, algorithm, and command. An offset instruction type is optional. When all the instructions are programmed, the lowest instruction loop is executed (offset) first, followed by command, algorithm, data, and address instruction loops.

6.2.2.1 Address Pipeline

Given the dependencies between command, address, and offset, generating a complete address back-to-back on every clock to achieve high speed test similar to [6] a pipeline architecture design is required.

An example of an address generation pipeline is shown in Figure 67(c). The diagram shows pipe stages n00 and n01. When bank is ordered lower than column, it is lumped with column into pipe stage n00. Carry count is a look-ahead carry propagation to ensure that row address has valid inputs in pipe stage n01. Rank address is lumped with row address into pipe stage n01.

6.2.2.2 Auto Repair FSM and Auto Repair Technology

143

The CPGC reusable BIST engine introduced an optional hardware repair engine that works in conjunction with software to provide auto repair technology. This repair capability has been adopted by the JEDEC as a part of the wide-IO (WIO) industry standard, known as post package repair (PPR). In a high-volume manufacturing (HVM) environment, the automatic test equipment (ATE) can use the CPGC row hammer to detect memory cell defects. At the end of the detection phase of the test the ATE test flow could initiate either a hardware automatic, or a software assisted manual repair of the failed memory cells using spare rows of the DRAM device. DRAM device manufacturers have adopted the PPR

JEDEC standard. Specifically, future DRAM are built with extra spare rows of memory to compensate for the increased likelihood of memory cell failures. The automatic modes require no operator intervention, thus significantly reducing HVM test time from many minutes down to seconds.

The auto-repair and manual repair can also be run at boot via firmware or BIOS. This enables in the field PPR of a highly integrated SOC with a 3DS WIO memory subsystem, such as a phone or tablet. Such end user auto-repair can significantly improve manufacturing yield and end user customer experience. Imagine, instead of throwing away a dead phone, getting customer service to push an auto repair software update, or even having the phone simply automatically repair itself on the next reboot after a crash.

144

Figure 67 (d) shows an 11 state FSM that loops between activate (ACT) and wait for programming time completion (Wait tPGM) if multiple failing addresses need to be repaired. All 11 states are used to complete PPR. The PPR flow will be activated if the number of failures (NumFailStatus) is less than the maximum number of failure (MaxFail) that the spare memory could support.

6.2.2.3 Refresh Control Generator

Refresh control is important to maintain DRAM data integrity. It is also an excellent tool to test and stress the DRAM for cell retention failures. The CPGC BIST engine has a variable refresh signal generator to control and stress the DRAM refresh rates.

6.2.3 Algorithmic Memory Cell Testing

The CPGC implementation has been designed with significant flexibility to accommodate a broad variety of different memory test algorithms efficiently. To accomplish these address walking algorithms, the CPGC architecture Test Pattern Generator consists of the nested loops of address instruction (AI), data instruction (DI), algorithm instruction (GI), command instruction (CI), and offset instruction (OI).

Some examples are described below.

6.2.3.1 Memory-Scan

145

A type of test that is commonly used to initialize memory and detect stuck at fault memory failures is called memory-scan, which is shown in the upper right corner of Figure 67 (b).

The scan writes “0” to the entire memory, which clears the memory contents at boot, and catches any command, address, and data bus communication errors. A subsequent read scan is done to detect any errors and provide coverage for stuck at “1” faults. The process is repeated by writing “1” and reading to verify to catch stuck at “0” faults. Some implementations follow with a second scan of write and read “0” to provide full stuck at coverage since the initial memory contents were random.

6.2.3.2 Row Hammer and Butterfly

Conceptually, DRAMs are fundamentally made up of a large array of capacitors in the form of holes in the substrate. A “1” is represented by a capacitor holding a charge; a “0” is represented by a discharged capacitor. There are various physical mechanisms where coupling effects between neighboring capacitors can result in bit flips. The opportunity for these cross charge coupling fault failures increases as device size decreases. The charge sharing impact of the previous bit in time can cause ISI failures, which increases as DRAM frequency increases. Interconnect and wires for SOC internal and external IO faces similar crosstalk and ISI. Due to finite charge leakage, DRAM cells are refreshed to retain their stored value. However, slower refresh rates lead to additional DRAM retention failures.

146

An example test algorithm, typically called butterfly, for detecting neighbor cross coupling is shown in the right side of Figure 67 (b). In this test, the nearest neighbors in the x and y location are written to with opposite data as the victim, or base, cell in a crosshair shape and up to two memory cell locations away (Wr1, Wr2, Wr3, Wr4, Wr5, Wr6, Wr7, and

Wr8). The base is read at the end of these write commands to check for errors. The butterfly algorithm can test near neighbors (crosshair), all neighbors (crosshair and diagonal), and selective neighbors.

6.2.3.3 Other Cell Test Algorithms

The CPGC engine is fully capable of supporting many different types of memory cell test algorithms, such as MATS+, march LR, march C-, PMOVI, and even more complicated algorithms to target certain types of DRAM faults. In addition, the CPGC can easily set up the background data pattern prior to beginning the test.

6.2.4 CPGC Software infrastructure Architecture

The CPGC technology has built-in customizable flexibilities enabling the testing and diagnosis of entire memory sub-systems. In case of impending failure, it is possible to isolate the memory field-replicable-unit (FRU) for further testing through CPGC IP. The

FRU would take some or all of the memory offline, and use CPGC to test the memory. The

BIOS-based unified extensible firmware interface (UEFI) architecture supports a middle- ware layer that can incorporate the CPGC Application Program Interface (API(s)) at boot-

147

time as well as the run-time. At boot-time the CPGC-based test could execute at the initialization of the memory reference code. At run-time, the code could execute in the system management interrupt (SMI), which is a virtualized space in the system. The API libraries are incorporated in both boot-time environments as well as in run-time environments. While boot-time CPGC code could execute at every boot, run-time code cannot be executed frequently. Execution of run-time code can cause system perturbations that can degrade the performance of the workload running on it.

A firmware level approach to host CPGC tests and API(s) requires building a memory- map infrastructure for identifying memory addresses. Memory initialization code is modified to CPGC test code. The system management interrupt (SMI) provides the necessary APIs to assist in constructing and configuring the CPGC tests. Figure 68 illustrates the UEFI boot services layer that hosts the CPGC test modules. Similar modules are replicated into the SMI for run-time execution.

148

module

modules modules modules

CPGC testCPGC

other test test other test other test other

UEFI OS BOOT SERVICE layer PLATFORM specific firmware

CPU hardware (including CPGC)

Figure 68: UEFI BOOT SERVICE layer that supports CPGC tests module. UEFI is a standard-based modern software architecture (originated by Intel) that offers a rich extensible pre-operating system (OS) environment with advanced boot and runtime services. It has an intrinsic networking capability and is designed from the ground up to work with multi-processor (MP) systems. The UEFI standard is architected for dynamic modularity. This has opened up opportunities for collaboration on the technology that allows reducing development costs and time to market through code sharing and easy integration of plugin modules from different partners and vendors.

6.2.5 CPGC Applications and Results

6.2.5.1 Memory Initialization and IO Link Training Using CPGC

One of the important applications where CPGC is extensively used is memory initialization and training during system power up. Typical link training algorithms use eye centering

149

techniques on the received signal to ensure maximum possible margins on either side of sampling instants in voltage and timing axis.

CPGC writes a stream of patterns on the IO link, which are either generated via built-in linear feedback shift registers (LFSR) or via user programmable registers. This stream of data is stored in memory, read back, and compared to the original by CPGC. CPGC intentionally degrades and margins the IO write or read path independently while transmitting, receiving, and checking for errors to identify the high side and low side limits of voltage margin, and the left side and right side limits of timing margin to generate a 2-

D eye diagram Figure 3 (c) for the link under test. The CPGC applies the eye centering technique by calculating the mid-point between the high side and low side limits of the voltage margin, and between right side and left side limits of the timing margin. These voltage mid-point and timing mid-point becomes the optimal eye center as result of CPGC

IO link training. CPGC applies a link training algorithm to all the command, address, and data pins of each memory channel, and all memory and high speed IO attached to the CPU

(eg: PCIe). During the training, eye margin is automatically measured and reported.

6.2.5.2 ISI and Crosstalk Debug Using CPGC

Another application of CPGC is identifying IO link performance loss due to data dependent jitter such as ISI and crosstalk through bit-by-bit pattern searches. A DDR memory IO bus

150

is a single-ended voltage termination bus with very closely placed lanes between processor and DDR dual inline memory module (DIMM), which make this bus susceptible to ISI and crosstalk issues. CPGC can be used to program and transmit stress patterns on the bus that can cause a worst-case 1 or 0 transition after a long run of identical bits. One such example, used in DDR validation to measure eye height loss due to ISI, is shown in Figure 3 (b). In this case, CPGC implemented in silicon is programmed to change the precursor bit on a

DDR pin (pin 68) in a deterministic method from a 0 to 1 after a long sequence of 0s. This

0 to 1 transition is immediately followed by a 1 to 0 transition to create a worst-case 1 during this bit period that corresponds to the inner contour of the eye diagram shown in

Figure 69 (a), thus reducing the eye height margin. Similarly, a worst-case 0 can be created to stress the bus. Such programmability in CPGC helps to debug any ISI. CPGC also allows simultaneous programming of multiple lanes on the bus with different deterministic or random patterns and provides independent control of offset/skew in timing between the lanes. This feature is used to create a victim-aggressor condition among the lanes for crosstalk studies and mitigation.

Another way of stressing a memory cell is through repeated writes to the neighboring cells, called hammering. A row hammer issues repeated activate commands to specific aggressor rows to cause victim row/bit failures. CPGC can stress an additional +2 and +3 aggressor rows, which is specifically suited for the WIO TSV testing of 3DS memory, when physical probing of a sandwiched die is impossible.

151

152

Figure 70 (a) scope capture of DDR IO eye Figure 70 (b) ISI debug changing precursor from 0 to 1

Figure 70 (c) CPGC BIST based Memory IO initialization, training, and Eye measurement.

Figure 69: CPGC Silicon results

153

6.2.5.3 Silicon and System Validation Using CPGC

The CPGC engine is also used in silicon and system validation. A unique feature of CPGC is its ability to drive patterns through memory IO without requiring the core of the processor to be working. This feature enables early silicon debug and characterization of the IO circuit physical layer independent from the processor. In a HVM system validation environment, after the initial bring-up of the complete system, CPGC provides a pattern auto-rotation feature to automatically feed test patterns from one lane to the next and switch between victim and aggressor roles. The auto-rotation of victim and aggressor patterns is shown to decrease HVM validation time by 3X.

6.2.5.4 Platform Power Savings Using CPGC

In addition to memory IO initialization and training, CPGC is also used to optimize the power and performance of DDR IO. CPGC implementation provides controls via configuration registers to closed-loop compensation circuits such as reference voltage

(Vref), slew rate, drive strength, and on-die termination. These control features are optimized and tuned for the lowest power while maintaining acceptable eye opening and bit error rates at the receivers. The results showed that a power savings of 6mW/bit and

12mW/bit were achieved on CPU power (CpuPwr) and DIMM Power (DimmPwr) respectively, according to the plot average shown in Figure 4 when ODT was tuned. By calibrating for power, margin reductions of -5, -8, +11, and -19 ticks were observed on receiver timing (RdT), transmitter timing (WrT), receiver voltage (RdV), and transmitter

154

voltage (WrV). There are a total of 64 controllable states of voltage and timing offset in the X and Y direction. Each shift in state is called a tick. However, the reduction in margin did not affect the overall system BER of 10-12. This margin/power trade off led to 5.28%

CPU power savings on 22nm 15W CPU and 11.03% power savings on the platform.

155

calibration feedback loop TX Slew Rate control ...

TCO LS LS LS LS

...

C Data ... Channel G Comb Drive P

C logic Strength Enable ...

... ODT

calibration control

LS LS LS LS Vref RX digital D Q ... code DFF Rx Amp TCO Slew Rate control DAC Q

DQS DLL

Figure 70: CPGC closed calibration loop saves CPU and DRAM Power

156

6.3 Summary

The CPGC engine addresses the complexity of designing and testing today’s IC that prior tester and probe solutions no longer work due to 3DS probe-ability and higher frequency

IO and memory. This paper presented the hardware and software architecture of a reusable

BIST engine for Intel SOC that includes the auto repair of detected memory defects.

Key CPGC silicon results and benefits are: Debug isolation of ISI and crosstalk, memory

IO initialization, training, and margin, auto-rotation that led to 3x validation time improvement, isolation of memory sub-systems to enable early debug that saved Intel up to 3 months of product delay, row hammer, butterfly, and galloping that enabled WIO TSV testing of 3DS memory, and 5% to 11% of CPU and platform power savings through closed loop calibration of DDR IO circuits.

This reusable CPGC BIST engine IP solution has been successfully designed into Intel products, with at least 11 Intel SOCs using CPGC for IO link training, validation, and debug. At least seven Intel SOCs have been successfully debugged, tested, and launched in the marketplace using the reusable CPGC BIST engine.

157

6.4 Acknowledgments

The authors would like to acknowledge Adam Taylor and Thilak K Poriyani House Raju for their wonderful support and effort on this CPGC project.

6.5 References

[1] Van Treuren, B.G.; Chiang, C.; Honaker, K., "Problems Using Boundary-Scan for

Memory Cluster Tests," Test Conference, 2008. ITC 2008. IEEE International ,

vol., no., pp.1,10, 28-30 Oct. 2008

[2] D. Blankenbeckler, A. Norman, M. Shepherd. "Comparison of off-chip

interconnect validation to field failures", VALID, 2011, Barcelona, Spain.

[3] R. Venkatesh, S. Kumar, J. Philip, and S. Shukla, "A fault modeling technique to

test memory BIST algorithms," in Memory Technology, Design and Testing, 2002.

(MTDT 2002). Proceedings of the 2002 IEEE International Workshop on, 2002,

pp. 109-116.

[4] Y. Jen-Chieh, W. Chi-Feng, C. Kuo-Liang, C. Yung-Fa, H. Chih-Tsun, and W.

Cheng-Wen, "Flash memory built-in self-test using March-like algorithms," in

Electronic Design, Test and Applications, 2002. Proceedings. The First IEEE

International Workshop on, 2002, pp. 137-141.

158

[5] A. J. Van de Goor, S. Hamdioui, and H. Kukner, "Generic, orthogonal and low-

cost March Element based memory BIST," in Test Conference (ITC), 2011 IEEE

International, 2011, pp. 1-10.

[6] S. Sheng-Chih, H. Hung-Ming, C. Yi-Wei, and L. Kuen-Jong, "A high speed BIST

architecture for DDR-SDRAM testing," in Memory Technology, Design, and

Testing, 2005. MTDT 2005. 2005 IEEE International Workshop on, 2005, pp. 52-

57.

[7] G. Mrugalski, A. Pogiel, N. Mukherjee, J. Rajski, J. Tyszer, and P. Urbanek, "Fault

Diagnosis in Memory BIST Environment with Non-march Tests," in Test

Symposium (ATS), 2011 20th Asian, 2011, pp. 419-424.

[8] P. Bernardi, L. Ciganda, M. S. Reorda, and S. Hamdioui, "An Efficient Method for

the Test of Embedded Memory Cores during the Operational Phase," in Test

Symposium (ATS), 2013 22nd Asian, 2013, pp. 227-232.

[9] P. Bernardi, M. Grosso, M. S. Reorda, and Y. Zhang, "A programmable BIST for

DRAM testing and diagnosis," in Test Conference (ITC), 2010 IEEE International,

2010, pp. 1-10.

[10] Nejedlo, J.; Khanna, R., "Intel® IBIST, the full vision realized," Test Conference,

2009. ITC 2009. International , vol., no., pp.1,11, 1-6 Nov.

159

[11] Querbach, B "CPGC2, a buit-in self test and auto-repair engine for DRAM (WIO,

DDR4)", Patent pending 14/141,239

[12] Querbach, B. "Comparison of Hardware Based and Software Based Stress Testing

of Memory Io Interface." IEEE MWSCAS 2013

[13] B. Schroeder, E. Pinheiro, W. Weber. “DRAM Errors in the Wild: A Large-Scale

Field Study”, SIGMETRICS/Performance ’09, June 15-19, 2009, Seattle, WA,

USA

[14] Dong Wook Kim; Vidhya, R.; Henderson, B.; Ray, U.; Sam Gu; Wei Zhao;

Radojcic, R.; Nowak, M., "Development of 3D through silicon stack (TSS)

assembly for wide IO memory to logic devices integration," Electronic

Components and Technology Conference (ECTC), 2013 IEEE 63rd , vol., no.,

pp.77,80, 28-31 May 2013

[15] “Coalition of 20+ Tech Firms Backs MRAM as Potential DRAM, NAND

Replacement”

160

http://www.dailytech.com/Coalition+of+20+Tech+Firms+Backs+MRAM+as+Pot

ential+DRAM+NAND+Replacement/article33826.htm

[16] Singh, Vivek, "Lithography at 14nm and beyond: Choices and challenges," Design

Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE , vol., no.,

pp.459,459, 5-9 June 2011

161

7 Conclusion

This dissertation tackled some of the challenges that today’s complex IO and memory design brings in terms of higher IO bandwidth, higher frequency, and lower power requirements. Both the IO and the design-for-testability (DFT) need to be low cost and brought to the market quickly. In addition, the lack of probe-ability of phones and tablets physically limits traditional JTAG or TAP access.

First, this dissertation compared the hardware BIST types of solution to software tests.

BIST types of solutions (eg: APGC, which is an earlier version of CPGC) can significantly speed up test time due to the potential for parallel hardware instances to be executed and the reduced test instruction loading/unloading. For example, APGC when compared to the software can speed up test time by 32x to 2048x depending on aggressor victim lane width of eight to 64 lanes using the auto-rotation feature.

Next, this dissertation presented the architecture of a reusable BIST engine called CPGC that is used in the critical path of DDR3, DDR4, LP-DDR3, LP-DDR4 IO initialization

(aka link training) of x86. This on-die solution solved the above probe-ability and physical access problems, while providing a low cost solution that can quickly and automatically train and test the IO, which led to the mass adoption and production of this CPGC BIST architecture. Silicon results show that CPGC can speed up validation time by 3x, decrease test time from minutes down to seconds, and decrease debug time by 5x including root- cause of boot failures of the memory interface.

162

Next, this dissertation presented CPGC also as the essential BIST engine for IO and memory defect detection and in some cases, the auto repair of detected memory defects.

The software and hardware infrastructure that leverages CPU L2/L3 cache to enable CBT and parallel execution of the CPGC Intel BIST engine demonstrated 60x to 170x improvements in test time compared to JTAG or TAP based testing.

Next, this dissertation presented a detailed discussion of CPGC architecture. The architecture combined the BIST needs for both IO and memory cell testing into a single

BIST engine. To generate back to back full bandwidth DDR transactions, a pipelined BIST architecture was developed. The memory cell testing engine is capable of all classical memory test algorithms (eg: march, MATS++) and detection of classical memory fault models, and it supports highly flexible user programmable algorithms for unforeseen future defects. Despite the fact that effort was put into the architecture to enable 3DS memory IC testing, at the time this dissertation was written not enough 3DS memory cell test data was available.

Lastly, this dissertation presented how this CPGC architecture reduced SOC and platform power by 5% to 11% through closed loop IO circuit power optimization. This helped to make the overall SOC lower power and more competitive.

This CPGC BIST engine has been developed into a highly successful reusable IP solution, which has been successfully designed into at least 11 Intel CPUs and SOCs (32nm-14nm), with at least seven Intel SOCs successfully debugged, tested, and launched into the market-

163

place, which ultimately led to over 100 million CPUs shipped within one quarter based on this architecture. This architecture is used on daily basis around the world.

For future work, as 3DS logic and memory IC becomes more prevalent, more data on 3DS

IC memory cell tests will be collected to ensure the CPGC BIST engine architecture is efficient at 3DS memory cell testing. To adapt to different test scenario, the CPGC BIST engine IP will be made more modular, and flexible with the focus of achieving efficient test time, and efficient design time.

164

8 Bibliography

1 "`http://en.wikipedia.org/wiki/Transistor count"'

2 "`http://www.extremetech.com/computing/152979-intel-announces-full-set-of- new-atom-and-xeon-server-processors-to-fend-off-arm"'

3 “Coalition of 20+ Tech Firms Backs MRAM as Potential DRAM, NAND

Replacement - See more at: http://www.dailytech.com/Coalition+of+20+Tech+Firms+Backs+MRAM+as+Potential+

DRAM+NAND+Replacement/article33826.htm#sthash.afOQukEt.dpuf.”

4 “Coalition of 20+ Tech Firms Backs MRAM as Potential DRAM, NAND

Replacement - See more at: http://www.dailytech.com/Coalition+of+20+Tech+Firms+Backs+MRAM+as+Potential+

DRAM+NAND+Replacement/article33826.htm#sthash.afOQukEt.dpuf.”

5 A. Bravaix, C. Guerin, V. Huard, D. Roy, J.-M. Roux, and E. Vincent, “Hot-Carrier acceleration factors for low power management in DC-AC stressed 40nm NMOS node at high temperature,” in Reliability Physics Symposium, 2009 IEEE International, 2009, pp.

531–548.

6 A. J. van de Goor, “An industrial evaluation of DRAM tests,” IEEE Des. Test

Comput., vol. 21, no. 5, pp. 430–440, Sep. 2004.

165

7 A. J. Van de Goor, S. Hamdioui, and H. Kukner, "Generic, orthogonal and low- cost March Element based memory BIST," in Test Conference (ITC), 2011 IEEE

International, 2011, pp. 1-10.

8 A. J. van de Goor, Testing semiconductor memories: theory and practice. J. Wiley

& Sons, 1991.

9 A. K. M. M. Islam and H. Onodera, “Characterization and compensation of performance variability using on-chip monitors,” in Proceedings of Technical Program -

2014 International Symposium on VLSI Technology, Systems and Application (VLSI-

TSA), 2014, pp. 1–4.

10 An Improved Genetic Algorithm and Its Blending Application with Neural

Network, Tai-shan Yan, Intelligent Systems and Applications (ISA), 2010 2nd

International Workshop

11 B. Casper, J. Jaussi, F. O’Mahony, M. Mansuri, K. Canagasaby, J. Kennedy, E.

Yeung, and R. Mooney, “A 20Gb/s Forwarded Clock Transceiver in 90nm CMOS B.,” in

Solid-State Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE

International, 2006, pp. 263–272.

12 B. L. Spry, T. Z. Schoenborn, P. Abraham, C. P. Mozak, D. G. Ellis, J. J. Nejedlo,

B. Querbach, Z. Greenfield, R. Ghattas, and J. Tholiyil, “Robust memory link testing using memory controller,” 12/651,252, 2009.

166

13 B. Querbach, D. G. Ellis, A. Khan, M. J. Tripp, E. S. Gayles, and E. Gollapudi,

Automatic self test of an integrated circuit component via AC I/O loopback. Google

Patents, 2006.

14 B. Querbach, A. Khan, M. Tripp, L. B. Guerrero, M. A. V. Vargas, and A.

Muhtaroglu, Methods and apparatuses for validating AC I/O loopback tests using delay modeling in RTL simulation. Google Patents, 2007.

15 D. Ellis, B. Querbach, J. Nejedlo, A. Khan, S. Babcock, E. Gayles, and E.

Gollapudi, Pre-announce signaling for interconnect built-in self test. Google Patents, 2003.

16 D. G. Ellis, B. Querbach, J. J. Nejedlo, A. Khan, S. R. Babcock, E. S. Gayles, and

E. Gollapudi, Push button mode automatic pattern switching for interconnect built-in self test. Google Patents, 2004.

17 B. Querbach, M. A. Abdallah, A. M. Khan, M. M. Hossain, and S. M. Sarkar,

Regulating a timing between a strobe signal and a data signal. Google Patents, 2009.

18 B. L. Spry, T. Z. Schoenborn, P. Abraham, C. P. Mozak, D. G. Ellis, J. J. Nejedlo,

B. Querbach, Z. Greenfield, R. Ghattas, and J. Tholiyil, Robust memory link testing using memory controller. Google Patents, 2009.

19 M. B. Trobough, K. K. Tiruvallur, C. B. Prudvi, C. E. Iovin, D. W. Grawrock, J. J.

Nejedlo, A. N. Kabadi, T. K. Goff, E. J. Halprin, and K. B. Udawatta, TEST,

VALIDATION, AND DEBUG ARCHITECTURE. US Patent 20,150,127,983, 2015.

167

20 B. Querbach, R. B. Hamilton, L. A. Johnson, and M. Kim, Voltage margining with a low power, high speed, input offset cancelling equalizer. Google Patents, 2009.

21 B. Querbach, D. G. Ellis, A. Khan, M. J. Tripp, E. S. Gayles, and E. Gollapudi,

“Automatic self test of an integrated circuit component via AC I/O loopback,” 7,139,957,

2006.

22 B. Querbach, R. Khanna, D. Blankenbeckler, Y. Zhang, R. T. Anderson, D. G. Ellis,

Z. T. Schoenborn, S. Deyati, and P. Chiang, “A reusable BIST with software assisted repair technology for improved memory and IO debug, validation and test time,” in Test

Conference (ITC), 2014 IEEE International, 2014, pp. 1–10.

23 B. Querbach, S. Puligundla, D. Becerra, Z. T. Schoenborn, and P. Chiang,

“Comparison of hardware based and software based stress testing of memory IO interface,” in 2013 IEEE 56th International Midwest Symposium on Circuits and Systems

(MWSCAS), 2013, pp. 637–640.

24 B. Schroeder, E. Pinheiro, and W. Weber. "DRAM Errors in the Wild: A Large-

Scale Field Study", SIGMETRICS/Performance '09, June 15-19, 2009, Seattle, WA, USA.

25 B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: a large- scale field study,” in ACM SIGMETRICS Performance Evaluation Review, 2009, vol. 37, pp. 193–204.

26 C. Park, H. Chung, Y.-S. Lee, J. Kim, Y.-S. Lee, M.-S. Chae, D.-H. Jung, S.-H.

Choi, S. Seo, T.-S. Park, J.-H. Shin, J.-H. Cho, S. Lee, K.-W. Song, K. Kim, J.-B. Lee, C.

168

Kim, and S.-I. Cho, “A 512-mb DDR3 SDRAM prototype with CIO minimization and self- calibration techniques,” IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 831–838, Apr.

2006.

27 C.-T. Huang, J.-R. Huang, C.-F. Wu, C.-W. Wu, and T.-Y. Chang, “A programmable BIST core for embedded DRAM,” IEEE Des. Test Comput., vol. 16, no. 1, pp. 59–70, Jan. 1999.

28 D. Blankenbeckler, A. Norman, M. Shepherd. "Comparison of off chip interconnect validation to field failures", VALID, 2011, Barcelona, Spain.

29 D. Ellis, B. Querbach, J. Nejedlo, A. Khan, S. Babcock, E. Gayles, and E.

Gollapudi, “Pre-announce signaling for interconnect built-in self test,” 2003.

30 D. G. Ellis, B. Querbach, J. J. Nejedlo, A. Khan, S. R. Babcock, E. S. Gayles, and

E. Gollapudi, “Push button mode automatic pattern switching for interconnect built-in self test,” 6,826,100, 2004.

31 D. Kucharski and K. Kornegay, “A 40 Gb/s 2.5 V 2 1 PRBS generator in SiGe using a low-voltage logic family,” in IEEE ISSCC Dig. Tech. Papers, San Francisco, CA,

Feb. 2005, pp. 340–341.

32 Dong Wook Kim; Vidhya, R.; Henderson, B.; Ray, U.; Sam Gu; Wei Zhao;

Radojcic, R.; Nowak, M., "Development of 3D through silicon stack (TSS) assembly for wide IO memory to logic devices integration," Electronic Components and Technology

Conference (ECTC), 2013 IEEE 63rd , vol., no., pp.77,80, 28-31 May 2013

169

33 F. O’Mahony, G. Balamurugan, J. E. Jaussi, J. Kennedy, M. Mansuri, S. Shekhar, and B. Casper, “The future of electrical I/O for microprocessors,” in International

Symposium on VLSI Design, Automation and Test, 2009. VLSI-DAT ’09, 2009, pp. 31–

34.

34 G. Mrugalski, A. Pogiel, N. Mukherjee, J. Rajski, J. Tyszer, and P. Urbanek, "Fault

Diagnosis in Memory BIST Environment with Non-march Tests," in Test Symposium

(ATS), 2011 20th Asian, 2011, pp. 419-424.

35 G. Theodorou, N. Kranitis, A. Paschalis, and D. Gizopoulos, “Software-Based Self

Test Methodology for On-Line Testing of L1 Caches in Multithreaded Multicore

Architectures,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 21, no. 4, pp. 786–

790, Apr. 2013.

36 Goel, S.K.; Devta-Prasanna, N.; Turakhia, R.P. “Effective and Efficient Test

Pattern Generation for Small Delay Defect”, VLSI Test Symposium, 2009. VTS '09. 27th

IEEE

37 H. Knapp, M. Wurzer, W. Perndl, K. Au?nger, J. Böck, and T. F. Meister, “100-

Gb/s 2 1 and 54-Gb/s 2 1 PRBS generators in SiGe bipolar technology,” IEEE J. Solid-

State Circuits, vol. 40, no. 10, pp. 2118–2125, Oct.

38 H. Veenstra, “1–58 Gb/s PRBS generator with <1.1 ps RMS jitter in InP technology,” Solid-State Circuits Conference, 2004. ESSCIRC 2004. Proceeding of the

30th European

170

39 http://www.google.com/patents/WO2014004748A1?cl=en

40 http://www.memcon.com/pdfs/proceedings2013/track1/Tuning DDR4 for Power and Performance.pdf

41 http://www.memcon.com/pdfs/proceedings2013/track1/Tuning DDR4 for Power and Performance.pdf

42 http://www.youtube.com/watch?v=i3-gQSnBcdo

43 J. Nejedlo and R. Khanna, “Intel #x00AE; IBIST, the full vision realized,” in Test

Conference, 2009. ITC 2009. International, 2009, pp. 1–11.

44 J. Nejedlo and R. Khanna, “Intel #x00AE; IBIST, the full vision realized,” in Test

Conference, 2009. ITC 2009. International, 2009, pp. 1–11.

45 J.-S. Wang, P. Andrews, S. Fu, and G. Pwu, “Low power Sigma;- Delta; analog- to-digital converter for temperature sensing and voltage monitoring,” in Proceedings of the

43rd IEEE Midwest Symposium on Circuits and Systems, 2000, 2000, vol. 1, pp. 36–39 vol.1.

46 J.-S. Wang, P. Andrews, S. Fu, and G. Pwu, “Low power Sigma;- Delta; analog- to-digital converter for temperature sensing and voltage monitoring,” in Proceedings of the

43rd IEEE Midwest Symposium on Circuits and Systems, 2000, 2000, vol. 1, pp. 36–39 vol.1.

171

47 K. K. Saluja and K. Kinoshita, “Test Pattern Generation for API Faults in RAM,”

IEEE Trans. Comput., vol. C–34, no. 3, pp. 284–287, Mar. 1985.

48 K. Noguchi and Makoto Nagata, “An On-Chip Multichannel Waveform Monitor for Diagnosis of Systems-on-a-Chip Integration,” IEEE Trans. Very Large Scale Integr.

VLSI Syst., vol. 15, no. 10, pp. 1101–1110, Oct. 2007.

49 K. Noguchi and Makoto Nagata, “An On-Chip Multichannel Waveform Monitor for Diagnosis of Systems-on-a-Chip Integration,” IEEE Trans. Very Large Scale Integr.

VLSI Syst., vol. 15, no. 10, pp. 1101–1110, Oct. 2007.

50 Knapp, H.; Wurzer, M.; Meister, T.F.; Bock, J.; Aufinger, K., “40 Gbit/s 27-1

PRBS generator IC in SiGe bipolar technology,” Bipolar/BiCMOS Circuits and

Technology Meeting, 2002. Proceedings of the 2002

51 L. Chen, F. Spagna, P. Marzolf, and J. K. Wu, “A 90nm 1-4.25-Gb/s Multi Data

Rate Receiver for High Speed Serial Links,” in Solid-State Circuits Conference, 2006.

ASSCC 2006. IEEE Asian, 2006, pp. 391–394.

52 "Laskin, E.; Voinigescu, S.P., “A 60 mW per Lane, 4 23-Gb/s 2 1 PRBS

Generator”, Solid-State Circuits, IEEE Journal of Volume: 41, Issue: 10 "

53 M. A. A. Latif, N. B. Z. Ali, and F. A. Hussin, “A case study of process-variation effect to SoC analog circuits,” in 2011 IEEE Recent Advances in Intelligent Computational

Systems (RAICS), 2011, pp. 520–523.

54 M. L. Bushnell and Vishwani D. Agarwal, Essentials of Electronic Testing. .

172

55 M. Shepherd, D. Blankenbeckler, and A. Norman, “Comparison of off-chip interconnect validation to field failures,” in VALID 2011, The Third International

Conference on Advances in System Testing and Validation Lifecycle, 2011, pp. 121–125.

56 M. Shepherd, D. Blankenbeckler, and A. Norman, “Comparison of off-chip interconnect validation to field failures,” in VALID 2011, The Third International

Conference on Advances in System Testing and Validation Lifecycle, 2011, pp. 121–125.

57 M. Shin, J. Shim, J. Kim, J. S. Pak, C. Hwang, C. Yoon, J. Kim, H. Kim, K. Park, and Y. Kim, “A 6.4Gbps on-chip eye opening monitor circuit for signal integrity analysis of high speed channel,” in IEEE International Symposium on Electromagnetic

Compatibility, 2008. EMC 2008, 2008, pp. 1–6.

58 M. Shin, J. Shim, J. Kim, J. S. Pak, C. Hwang, C. Yoon, J. Kim, H. Kim, K. Park, and Y. Kim, “A 6.4Gbps on-chip eye opening monitor circuit for signal integrity analysis of high speed channel,” in IEEE International Symposium on Electromagnetic

Compatibility, 2008. EMC 2008, 2008, pp. 1–6.

59 N. Foutris, M. Psarakis, D. Gizopoulos, A. Apostolakis, X. Vera, and A. Gonzalez,

“MT-SBST: Self-test optimization in multithreaded multicore architectures,” in Test

Conference (ITC), 2010 IEEE International, 2010, pp. 1–10.

60 N. Kurd, M. Chowdhury, E. Burton, T. P. Thomas, C. Mozak, B. Boswell, M. Lal,

A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T. M. Wilson, M. Merten, S. Chennupaty,

W. Gomes, and R. Kumar, “5.9 Haswell: A family of IA 22nm processors,” in Solid-State

173

Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, 2014, pp. 112–113.

61 Nejedlo, J.; Khanna, R., "Intel® IBIST, the full vision realized," Test Conference,

2009. ITC 2009. International , vol., no., pp.1,11, 1-6 Nov.

62 Nejedlo, J.J., "IBIST (interconnect built-in-self-test) architecture and methodology for PCI Express," Test Conference, 2003. Proceedings. ITC 2003

63 Nejedlo, J.J., "IBISTTM (interconnect built-in self-test) architecture and methodology for pci express: intel'snext-generation test and validation methodology for performance IO," Test Conference, 2003

64 Nejedlo, J.J., "TRIBuTE board and platform test methodology," Test Conference,

2003. Proceedings. ITC 2003

65 P. Bernardi, L. Ciganda, M. S. Reorda, and S. Hamdioui, "An Efficient Method for the Test of Embedded Memory Cores during the Operational Phase," in Test Symposium

(ATS), 2013 22nd Asian, 2013, pp. 227-232.

66 P. Bernardi, M. Grosso, M. S. Reorda, and Y. Zhang, "A programmable BIST for

DRAM testing and diagnosis," in Test Conference (ITC), 2010 IEEE International, 2010, pp. 1-10.

67 P. de Jong and A. J. van de Goor, “Test pattern generation for API faults in RAM,”

IEEE Trans. Comput., vol. 37, no. 11, pp. 1426–1428, Nov. 1988.

174

68 Querbach, B. "Comparison of Hardware Based and Software Based Stress Testing of Memory Io Interface." IEEE MWSCAS 2013

69 Querbach, Bruce, “‘CPGC2, a buit-in self test and auto-repair engine for DRAM

(WIO, DDR4)’, Patent pending.”

70 R. Malasani, C. Bourde, and G. Gutierrez, “A SiGe 10-Gb/s multipattern bit error rate tester,” Radio Frequency Integrated Circuits (RFIC) Symposium, 2003 IEEE

71 R. Venkatesh, S. Kumar, J. Philip, and S. Shukla, "A fault modeling technique to test memory BIST algorithms," in Memory Technology, Design and Testing, 2002.

(MTDT 2002). Proceedings of the 2002 IEEE International Workshop on, 2002, pp. 109-

116.

72 S. Gurumurthy, M. Pratapgarhwala, C. Gilgan, and J. Rearick, “Comparing the effectiveness of cache-resident tests against cycleaccurate deterministic functional patterns,” in Test Conference (ITC), 2014 IEEE International, 2014, pp. 1–8.

73 S. Puligundla, F. Spagna, L. Chen, and A. Tran, “Validating the performance of a

32nm CMOS high speed serial link receiver with adaptive equalization and baud-rate clock data recovery,” in Test Conference (ITC), 2010 IEEE International, 2010, pp. 1–5.

74 S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas,

"Patching Processor Design Errors with Programmable Hardware," Micro, IEEE, vol. 27, pp. 12-25, 2007.

175

75 S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas,

“Patching Processor Design Errors with Programmable Hardware,” IEEE Micro, vol. 27, no. 1, pp. 12–25, Jan. 2007.

76 S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas,

“Patching Processor Design Errors with Programmable Hardware,” IEEE Micro, vol. 27, no. 1, pp. 12–25, Jan. 2007.

77 S. Sheng-Chih, H. Hung-Ming, C. Yi-Wei, and L. Kuen-Jong, "A high speed BIST architecture for DDR-SDRAM testing," in Memory Technology, Design, and Testing,

2005. MTDT 2005. 2005 IEEE International Workshop on, 2005, pp. 52-57.

78 Singh, Vivek, "Lithography at 14nm and beyond: Choices and challenges," Design

Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE , vol., no., pp.459,459, 5-9

June 2011

79 T. Chen, P. Luo, and H. Fu, “Signal integrity analysis of high speed USB IO and interconnection,” in IEEE Electrical Design of Advanced Packaging Systems Symposium,

2009. (EDAPS 2009), 2009, pp. 1–5.

80 T. O. Dickson, E. Laskin, I. Khalid, R. Beerkens, J. Xie, B. Karajica, and S. P.

Voinigescu, “A 72 Gb/s 2 1 PRBS generator in SiGe BiCMOS technology,” in IEEE

ISSCC Dig. Tech. Papers, San Francisco, CA, Feb. 2005,

176

81 T. Okumoto, K. Yoshikawa, and M. Nagata, “Monitoring effective supply voltage within power rails of integrated circuits,” in Solid State Circuits Conference (A-SSCC),

2012 IEEE Asian, 2012, pp. 113–116.

82 T. Okumoto, K. Yoshikawa, and M. Nagata, “Monitoring effective supply voltage within power rails of integrated circuits,” in Solid State Circuits Conference (A-SSCC),

2012 IEEE Asian, 2012, pp. 113–116.

83 V. Melikyan, A. Balabanyan, A. Hayrapetyan, and N. Melikyan,

“Receiver/transmitter input/output termination resistance calibration method,” in

Electronics and Nanotechnology (ELNANO), 2013 IEEE XXXIII International Scientific

Conference, 2013, pp. 126–130.

84 Van Treuren, B.G.; Chiang, C.; Honaker, K., "Problems Using Boundary-Scan for

Memory Cluster Tests," Test Conference, 2008. ITC 2008. IEEE International , vol., no., pp.1,10, 28-30 Oct. 2008

85 Venkatesha, D.B.; Chitrashekaraiah, S.; Rezazadeh, A.A.; Thiede, A. “Interpreting

InGaP/GaAs DHBT Eye diagrams using small signal parameters”, Microwave Integrated

Circuit Conference, 2007. EuMIC 2007. European

86 Y. Dongkyu, K. Taehyung, and P. Sungju, "A microcode-based memory BIST implementing modified march algorithm," in Test Symposium, 2001. Proceedings. 10th

Asian, 2001, pp. 391-395.

177

87 Y. Jen-Chieh, W. Chi-Feng, C. Kuo-Liang, C. Yung-Fa, H. Chih-Tsun, and W.

Cheng-Wen, "Flash memory built-in self-test using March-like algorithms," in Electronic

Design, Test and Applications, 2002. Proceedings. The First IEEE International Workshop on, 2002, pp. 137-141.

88 Y. Jen-Chieh, W. Chi-Feng, C. Kuo-Liang, C. Yung-Fa, H. Chih-Tsun, and W.

Cheng-Wen, "Flash memory built-in self-test using March-like algorithms," in Electronic

Design, Test and Applications, 2002. Proceedings. The First IEEE International Workshop on, 2002, pp. 137-141.

89 Z. Al-Ars and A. J. van de Goor, “Test generation and optimization for DRAM cell defects using electrical simulation,” IEEE Trans. Comput.-Aided Des. Integr. Circuits

Syst., vol. 22, no. 10, pp. 1371–1384, Oct. 2003.