AN ABSTRACT OF THE DISSERTATION OF
Bruce Querbach for the degree of Doctor of Philosophy in Electrical and Computer
Engineering presented on June 12, 2015.
Title: The Architecture of a Reusable Built-In Self-Test for Link Training, IO and Memory
Defect Detection and Auto Repair on 14nm Intel SOC.
Abstract approved: ______
Patrick Y. Chiang
The complexity of designing and testing today’s system on chip (SOC) is increasing due to greater integrated circuit (IC) density and higher IO and memory frequencies. SOCs for the mobile phone and tablet market have the unique challenge of short product development windows, at times less than six months, and low cost board and platform that limits physical access to test access ports (TAP). This dissertation presents the architecture of a reusable built-in self-test (BIST) engine called converged pattern generator and checker (CPGC) that was developed to address the above challenges. It is used in the critical path of millions of x86 SOC for DDR3, DDR4, LP-DDR3, LP-DDR4 IO initialization and link training. The CPGC is also an essential BIST engine for IO and memory defect detection, and in some cases, the automatic repair of detected memory defects. The software and hardware infrastructure that leverages CPU L2/L3 cache to enable cache based testing (CBT) and the parallel execution of the CPGC Intel BIST engine is shown to improve test time 60x to 170x over conventional TAP based testing. In addition, silicon results are presented showing that CPGC enables easy debug of inter symbol interference (ISI) and crosstalk issues in silicon and boards, enables fast IO link training, improves validation time by 3x, and in some instances, reduces SOC and platform power by 5% to 11% through closed loop IO circuit power optimization. This CPGC BIST engine has been developed into a reusable IP solution, which has been successfully designed into at least 11 Intel CPUs and SOCs (32nm-14nm), with seven of these successfully debugged, tested, and launched into the market place. Ultimately has led to over 100 million CPUs being shipped within one quarter using this architecture.
©Copyright by Bruce Querbach
June 12th, 2015
All Rights Reserved
The Architecture of a Reusable Built-In Self-Test for Link Training, IO and Memory
Defect Detection and Auto Repair on 14nm Intel SOC
by
Bruce Querbach
A DISSERTATION
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Doctor of Philosophy
Presented June 12, 2015
Commencement June 2015
Doctor of Philosophy dissertation of Bruce Querbach presented on June 12, 2015
APPROVED:
Major Professor, representing Electrical and Computer Engineering
Director of the school of Electrical Engineering and Computer Science
Dean of the Graduate School
I understand that my dissertation will become part of the permanent collection of Oregon
State University libraries. My signature below authorizes release of my dissertation to any reader upon request.
Bruce Querbach, Author
ACKNOWLEDGEMENTS
The author expresses sincere appreciation to:
1. Patrick Chiang for his patience and advice throughout the years
2. Sudeep Puligundla for his insights and advice throughout this process
3. Rahul Khanna for his collaboration, showing me the ropes, and for his patience
over many paper revisions
4. Sabyasachi Deyati for his inspiration, advice, and paper contribution
5. David Blankenbeckler for paper collaboration and sharing his experience
6. Peter Yanyang Tan for collecting RMT and JTAG data and for paper contribution
7. Roger R Rees for his guidance, advice over the years, and paper reviews
8. Anne Meixner for her advice and paper edit and review
9. Ron Anderson for sharing his insights on DDR debug, data collection, and paper
contribution
10. Adam Taylor for sharing his expertise, post silicon data validation collection, and
paper contribution
11. Daniel Becerra for collecting valuable data on DDR interface and for paper
contribution
12. Zale T Schoenborn for his visionary ideas and support over the years, and paper
contribution
13. David G Ellis for his guidance over the years, for teaching, for sharing his expertise
in digital design, and for paper contribution
14. Yulan Zhang for her efforts and contribution in simulation and modeling CPGC,
and paper contribution
15. Thilak K Poriyani House Raju, and Rucha P Purandare for their simulation, design,
validation, and data collection
16. Van Lovelace for sharing his knowledge on BIOS and MRC and paper contribution
17. Joe Crop for his advice, paper edits, and paper contribution
18. Amy Querbach for her patience and support throughout time
19. Ram Mallela, Buz Hampton (GM), John D Barton (VP), and Rani Borkar (Sr. VP)
for management support and funding
Contribution of authors
From Intel Corporation:
Sudeep Puligundla, Rahul Khanna, David Blankenbeckler, Peter Yanyang Tan, Ron Anderson, Adam Taylor, Daniel Becerra, Zale T Schoenborn, David G Ellis, Yulan Zhang, and Van Lovelace.
From Georgia Institute of Technology:
Sabyasachi Deyati
From Oregon State University:
Joe Crop and Patrick Chiang.
TABLE OF CONTENTS Page
1 Dissertation Overview ...... 1
2 Introduction and Literature Review ...... 4
2.1 Challenges of IO Bandwidth, Frequency, and Low Power ...... 4
2.2 Process Variation...... 7
2.3 Temperature Variation ...... 8
2.4 Aging Effects...... 9
2.5 Calibration Methods and BIST ...... 11
2.6 High Speed Differential IO (HSIO) ...... 15
2.7 Memory Single Ended IO ...... 17
2.8 Memory Cell Fault Models ...... 20
2.9 Memory Test Algorithms ...... 25
2.10 Research Goals and methodology ...... 27
2.10.1 Goal ...... 27
2.10.2 Methodology ...... 28
2.11 References ...... 30
3 Comparison of hardware and software stress method ...... 35
3.1 Introduction ...... 36
3.2 The APGC Architecture ...... 38
3.3 The Aggressor/Victim Pattern Rotation ...... 40
3.4 The Software Learning Algorithm and Pattern ...... 41
TABLE OF CONTENTS Page
3.5 Margin Result ...... 41
3.6 Receiver Voltage Margin – Low Side ...... 42
3.7 Receiver Voltage Margin – High Side ...... 42
3.8 Receiver Timing Margin – Low Side ...... 43
3.9 Transmitter Voltage Margin – Low Side ...... 43
3.10 Transmitter Voltage Margin – High Side ...... 44
3.11 Transmitter Timing Margin – Low Side...... 44
3.12 Comparison of Test Execution Time of APGC and Software ...... 45
3.13 Conclusion ...... 46
3.14 Acknowledgements ...... 47
3.15 References ...... 48
4 Improved Debug, Validation and Test Time ...... 51
4.1 Introduction ...... 52
4.2 Failure Modes ...... 55
4.3 CPGC Architecture ...... 58
4.3.1 Architecture for Software Assisted Repair Technology ...... 59
4.3.2 Unified Pattern Sequencer ...... 61
4.3.3 LMN ...... 62
4.3.4 Fixed Pattern Buffer ...... 62
TABLE OF CONTENTS Page
4.3.5 LFSR ...... 62
4.3.6 Flop Based Data Pattern Buffer ...... 63
4.3.7 Register File Based Pattern ...... 63
4.3.8 Data Pattern Dynamic Inversion/DC ...... 63
4.3.9 Post-Scheduler Command and Address Pattern Generation ...... 64
4.3.10 Error Status, Logging and Masking ...... 64
4.3.11 Command and Address Generator ...... 64
4.3.12 Variable Retention Time Stress ...... 66
4.3.13 Gate Count ...... 66
4.3.14 Power ...... 67
4.3.15 Protocol ...... 68
4.4 CPGC Usage and IP Reuse Model ...... 68
4.5 Software Architecture ...... 75
4.5.1 Software Assisted Dynamic Margining ...... 78
4.5.2 Software Assisted CPGC Test Invocation ...... 78
4.6 Simulation Results...... 79
4.7 CPGC Results and Silicon Measurements ...... 81
4.7.1 Debugging Boot Failure Using CPGC LFSR Addressing ...... 86
4.8 Summary ...... 91
4.9 Acknowledgments ...... 92
TABLE OF CONTENTS Page
4.10 References ...... 92
5 Platform execution time improvement using L3 Cache ...... 97
5.1 Introduction ...... 98
5.2 CBT Uses Cache-As-RAM Mode ...... 101
5.2.1 Non-Evict Mode (NEM) ...... 101
5.2.2 NEM Exit to Preboot Execution Environment (PXE) ...... 104
5.3 CPGC BIST Engine ...... 106
5.3.1 Architecture Details ...... 106
5.3.2 Address Instruction (Loop3) ...... 109
5.3.3 Data Instruction (Loop2)...... 109
5.3.4 Data Capabilities Overview ...... 110
5.3.5 Algorithm Instruction (Loop 1) ...... 110
5.3.6 Command Instruction (Loop 0) ...... 112
5.3.7 Offset Command Instruction (Loop 0) ...... 112
5.3.8 Examples of Usage Models for the Offset Command Instruction ...... 113
5.3.9 Refresh Control Generator ...... 113
5.4 Rank Margin Tool (RMT) ...... 114
5.5 Traditional JTAG and TAP Based Tests ...... 122
5.6 Performance (Test Time) Comparison ...... 126
TABLE OF CONTENTS Page
5.7 Generalized Cache Based Testing ...... 127
5.8 Conclusion ...... 128
5.9 Acknowledgements ...... 129
5.10 References ...... 130
6 Architecture of a Reusable BIST Engine (CPGC) ...... 135
6.1 Introduction ...... 136
6.2 2nd Generation CPGC for 14nm 3DS SOC Memory and IO ...... 139
6.2.1 CPGC Architecture Overview ...... 139
6.2.2 CPGC Architecture Detail ...... 142
6.2.3 Algorithmic Memory Cell Testing...... 144
6.2.4 CPGC Software infrastructure Architecture ...... 146
6.2.5 CPGC Applications and Results ...... 148
6.3 Summary ...... 156
6.4 Acknowledgments ...... 157
6.5 References ...... 157
7 Conclusion ...... 161
8 Bibliography ...... 164
LIST OF FIGURES Page
INTEL LAB’S RESEARCH PREDICTS HSIO BANDWIDTH INCREASING 3X/2YRS 4
INTEL LAB’S RESEARCH PREDICTS -20%/YEAR IN IO POWER 5
BER REDUCES AS RECEIVER DATA EYE SHRINKS 6
20% VT VARIATION LIES BEYOND 3 STANDARD DEVIATION 7
AS T (37C->125C), JITTER INCREASED 30PS, EYE-HEIGHT DECREASED ABOUT 20MV 8
AS P (TT->SS), JITTER +25PS, EYE-HEIGHT -30MV 9
BOND BREAKING RATE INCREASE WITH (IDS/W) AND TEMPERATURE 10
DDR3 SELF-CALIBRATION 12
IO TERMINATION CALIBRATION CIRCUIT 13
INTEL HSIO RECEIVER ARCHITECTURE 15
INTEL PCIE GEN 3 8GT/S HSIO RECEIVER AND DFX 16
INTEL HASWELL (22NM) DDR CASCADE DRIVER 18
INTEL HASWELL (22NM) EDRAM AT 6.4T/S 19
DRAM ORGANIZATION 20
ADJACENT NEIGHBORHOOD OF A MEMORY CELL B 21
DRAM FAULT SIMULATION MODEL 22
MATS+ TEST ALGORITHM 26
THE CONCEPTUAL MEMORY LINK TRANSMITTER AND RECEIVER 39
THE APGC ARCHITECTURE 40
THE RX VOLTAGE LOWER SIDE BER (ON Q-SCALE) VS. VREF TICKS 42
THE RX VOLTAGE HIGHER SIDE BER (ON Q-SCALE) VS. VREF TICKS 43
THE RX TIMING LOWER SIDE BER (ON Q-SCALE) VS. VREF TICKS 43
THE TX VOLTAGE LOWER SIDE (ON Q-SCALE) 44
THE TX VOLTAGE HIGHER SIDE (ON Q-SCALE) 44
THE TX TIMING LOWER SIDE (ON Q-SCALE) 45
CPGC BIST ENGINE WITH SOFTWARE ASSISTED REPAIR TECHNOLOGY 54
LIST OF FIGURES Page
CPGC ARCHITECTURE OVERVIEW 58
CPGC SW ASSISTED REPAIR TECHNOLOGY FLOW 59
CPGC UNIFIED PATTERN SEQUENCER AND BUFFER 61
CPGC ROW HAMMER STRESS PATTERN 65
UEFI COMPLIANT BIOS THAT HOSTS THE CPGC TESTS 77
READ/WRITE CREDITS FOR STANDARD IP INTERFACE 79
BASE OFFSET MEMORY STRESS ALGORITHM 79
CHECKER BOARD MEMORY DATA STRESS ALGORITHM 79
PRBS PATTERNS STRESS PATTERNS 80
VICTIM AND AGGRESSOR PATTERN 81
SILICON MEASUREMENT OF DDR IO EYE DIAGRAM 82
EYE DIAGRAM WITH VA STRESS PATTERN TURNED ON 83
CPGC BIOS MEASURED EYE HEIGHT WITH ALL LANES ENABLED 83
CPGC BIOS MEASURED EYE HEIGHT WITH BIT 5/ECC 5 DISABLED 84
CPGC IDENTIFIED MISSING SHIELDING ON BOARD LAYOUT 86
CPGC ARCHITECTURE IO TRAINING BOOT FAILURE DEBUG 87
IO DQS GLITCH CORRUPTING RX FIFO DEBUG 88
DDR IO INITIALIZATION AND TRAINING BIOS MAPPED EYE 90
CACHE BASED TEST ARCHITECTURE 100
NON-EVICT MODE ENTRY FLOW DIAGRAM 103
NON-EVICT MODE EXIT FLOW DIAGRAM 105
CPGC ARCHITECTURE LOOPS 107
CPGC ARCHITECTURE FLOW DIAGRAM 109
CPGC ARCHITECTURE LOOP 3, ADDRESS INSTRUCTION 109
CPGC ARCHITECTURE LOOP 2, DATA INSTRUCTION. 110
CPGC ARCHITECTURE LOOP 1, ALGORITHM INSTRUCTION 111
LIST OF FIGURES Page
CPGC ARCHITECTURE LOOP 0, COMMAND INSTRUCTION 112
CPGC ARCHITECTURE, OFFSET INSTRUCTION 113
REFRESH CONTROL GENERATOR 114
CROSSTALK AND ISI ACROSS 64 DQ BITS (RMT) 116
VCCIO SUPPLY BOUNCE AS A FUNCTION OF DQ PATTERN FREQUENCY 117
MEASURED VOLTAGE AND TIMING DATA EYE (CBT) 118
TRANSMITTER DATA PER-BIT MARGIN (CBT) 119
RECEIVER DATA PER-BIT MARGIN (CBT) 120
TRANSMITTER REFERENCE VOLTAGE PER-BIT MARGIN (CBT) 121
RECEIVER REFERENCE VOLTAGE PER-BIT MARGIN (CBT) 122
TRANSMITTER DATA TIMING PER-BIT MARGIN (TAP) 123
RECEIVER DATA TIMING PER-BIT MARGIN (TAP) 124
TRANSMITTER REFERENCE VOLTAGE PER-BIT MARGIN (TAP) 125
RECEIVER REFERENCE VOLTAGE PER BIT-MARGIN (TAP) 126
SOC MEMORY SUB-SYSTEM AND CPGC ARCHITECTURE OVERVIEW 141
UEFI BOOT SERVICE LAYER THAT SUPPORTS CPGC TESTS MODULE. 148
CPGC SILICON RESULTS 152
CPGC CLOSED CALIBRATION LOOP SAVES CPU AND DRAM POWER 155
LIST OF TABLES Page
VICTIM AGGRESSOR PATTERN ROTATION WITH TWO DIFFERENT LFSRS 40
CROSSTALK/ISI IMPACT TO 11TH BIT IN TIME OF DQ61USING SOFTWARE 41
VICTIM AGGRESSOR PATTERN ROTATION WITH TWO DIFFERENT LFSRS 46
CPGC GATE COUNT 66
CPGC POWER ESTIMATES 67
CPGC ARCHITECTURE USAGE FOR VALIDATION 68
CPGC ARCHITECTURE USAGE FOR DEBUG 71
CPGC ARCHITECTURE USAGE FOR MEMORY INITIALIZATION/TRAINING 73
CPGC ARCHITECTURE SAVINGS TO VALIDATION, TEST, AND DEBUG TIME 91
PERFORMANCE GAIN OF CACHE BASED TESTS 127
COMPARISON OF CPGC FEATURES TO PRIOR WORK. 138
1
1 Dissertation Overview
This dissertation is organized as follows.
1. Dissertation Overview, which is the current chapter.
2. Introduction and Literature review presents prior research and publications from historical experts and experts in the field to provide a background of the challenges faced within IO and memory technology, as well as the existing efforts in this field. Specifically, the introduction will lay the background for Intel high speed IO, memory IO, inter-symbol- interference (ISI), crosstalk (which especially dominates DDR), the quest for effective and quick test patterns, test time and functional usage, memory cell failures model, and memory cell test algorithms.
3. Research Goals will cover the goals of this dissertation based on the challenges identified above in the introduction.
4. Research Methodology will cover the methods used within this dissertation to achieve the goals laid out above.
5. “Comparison of Hardware Based and Software Based Stress Testing of Memory IO
Interface” was the first publication (2013 IEEE MWSCAS) focused on studying the effective patterns, and the effectiveness and the speed of using a BIST approach.
2
6. “A Reusable BIST with Software Assisted Repair Technology for Improved Memory and IO Debug, Validation, and Test Time (IO Debug and Validation)” was the next publication (2014 IEEE ITC) focused on using a BIST technology to improve memory and
IO debug, validation, and test time. This usage turns out to be one of the largest values of the BIST to Intel and to the industry. In addition, a software assisted memory cell repair technology was introduced. There was a lot of industry interest in how to speed up BIST execution and overall test time using cache, hence that led to the third paper.
7. “Platform IO and System Memory Test Using L3 Cache Based Test (CBT) and Parallel
Execution of CPGC Intel BIST Engine (IO Link Training)”. This paper was submitted to
2015 IEEE ITC due to the surprisingly large interest in reducing overall test execution time through running BIST in cache. It focused on more details of the hardware and software infrastructure developed leveraging Intel’s cache-as-ram (CAR) and non-eviction-mode
(NEM) mode to parallel execute a BIST engine per memory controller. This led to significant test time reductions. Paired with how fast this BIST runs is the use of CPGC in
IO link training and initialization. The combination of the hardware and software infrastructure of using cache based code, running BIST in parallel gave current and future
IO circuit designers and architects a powerful, flexible, yet low cost method to adapt to unforeseen conditions post-silicon. This was a valued contribution not only to Intel but also to the industry.
8. “Architecture of a Reusable BIST Engine for Detection and Auto Correction of Memory
Failures and for IO Debug, Validation, Link Training, and Power Optimization on 14nm
3
SOC” was the last publication submission at the time of writing this dissertation. More of the CPGC architecture is presented, including the use of pipeline architecture to enable back-to-back DDR transactions. The architecture is innovative as the first to combine IO
BIST with memory BIST. An architecture that is capable of testing all the classical memory test algorithms such as “scan”, “march”, as well as newer user defined algorithms, is presented. The architecture was designed for testing 3D stacked IC, but unfortunately at the time of writing this dissertation, not enough 3DS IC memory cell test data was available, and thus most of the data presented are for memory IO tests. Lastly, the challenges of reducing IO power was explored here using the flexible BIST engine to not only optimize IO performance, but also to optimize and minimize IO power through fine power tuning.
9. Conclusion chapter summarizes this dissertation.
10. Bibliography chapter lists the references.
4
2 Introduction and Literature Review
2.1 Challenges of IO Bandwidth, Frequency, and Low Power
Shrinking process technology driven by Moore’s law of doubling the transistors in integrated circuits (IC) has led to continuous improvements in computer performance since
1965. The ever increasing computing performance of single core and multiple core processors has driven an ever increasing demand on computer interface input and output
(IO) bandwidth. According to Intel lab [1], shown in Figure 1, the bandwidth of high speed
IO (HSIO) for computers is expected to increase 3X every two years.
Figure 1: Intel lab’s research predicts HSIO bandwidth increasing 3X/2yrs (source F. O’Mahony [1])
5
At the same time, the power density of a computer CPU core has reached the limit of existing cooling technology. To push computing performance beyond frequency scaling of the CPU core, dual core, quad core, and multi-core CPUs have been introduced not only by Intel, but also by a number of other industry players. In conjunction with the ever increasing IO bandwidth, the power efficiency or low power demand on HSIO and memory
IO continues to trend towards lower power, according to Figure 2, this corresponds to a
20% reduction in IO power every year.
Figure 2: Intel lab’s research predicts -20%/year in IO power (source F. O’Mahony [1])
6
Power reduction must be balanced with acceptable performance and error rates. Bit error rate (BER) can be simply defined as the number of error bits that occurs within a period of time or a window of bits transmitted. As shown in Figure 3, the vertical axis is the voltage detected at the IO receiver [2], and the horizontal axis is the timing at the receiver expressed in terms of unit interval (UI). As the receiver data eye reduces in size, so does BER. Both the data eye shown here and BER are important metrics of IO health. A number of non- idealities can reduce the eye, which includes clock jitter, inter-symbol-interference (ISI), crosstalk, supply noise, etc. ISI and crosstalk are highly channel related, and thus will be discussed further.
Figure 3: BER reduces as receiver data eye shrinks (source B. Casper [2])
7
2.2 Process Variation
Additional challenges for IO in advanced CMOS technology include process, temperature, and voltage variation. As transistors shrink in minimal dimension, the process variation of transistor-to-transistor, IC-to-IC, and wafer-to-wafer increasingly affects not only HSIO and memory IO circuits, but also all IC circuits. Latif, RAICS 2011 [3], demonstrated that in Figure 4, ~20% of silicon process variation of threshold voltage (VT) distribution lies outside of the 3 sigma window.
Figure 4: 20% VT variation lies beyond 3 standard deviation (source Latif, RAICS 2011 [3])
8
2.3 Temperature Variation
In addition to process variation, temperature and voltage variation can significantly affect
HSIO, memory IO, and all IC circuit performance. Tinghou Chen, EDAPS 2009 showed on universal serial bus (USB2) IO temperature and process variability [4]. According to
Figure 5, as temperature increased from 37C to 125C, the jitter increased by 30ps, thus causing eye-height to decrease by about 20mV.
Figure 5: As T (37C->125C), jitter increased 30ps, eye-height decreased about 20mV (Source T. Chen [4])
Chen [4] also showed that as process moved from typical to slow, the eye height decreased by about 30mV in Figure 6. Eye diagram or eye height, which will be discussed in more detail later, are important metrics for measuring IO health.
9
Figure 6: As P (TT->SS), jitter +25ps, eye-height -30mV (Source T. Chen [4])
2.4 Aging Effects
In addition to the PVT variation above, as ICs operate in these demanding environments over time, aging effects can adversely degrade circuit performance. For example, hot electrons cause the bond breaking rate increase with temperature and current [5], which can fundamentally change HSIO and memory IO circuit behavior. “Bond breaking rate” is defined by A. Bravaix [5] as the breaking of Si-H bond over time by electrons. This leads to degraded performance and non-functional circuits (ICs dying of old age). According to
Figure 7, the aging effect of bound breaking rate increases with current density of (Ids/W) and temperature.
10
Figure 7: Bond breaking rate increase with (Ids/W) and Temperature (source A. Bravaix [5])
11
2.5 Calibration Methods and BIST
Given the PVT variation mentioned above, none of the modern CMOS device and circuits would be exactly as designed, and none of the high speed IO circuits would work without calibration. Having accurate on die resistor, reference voltage and currents based on calibration is critical to circuit functionality. There have been a number of prior works on circuit calibration and IO calibration. A self-calibration method was demonstrated for
DDR3, specifically, the on die termination (ODT) [6] was calibrated using an external calibration controller, and the calibrated results were stored in registers on the DRAM
(Figure 8).
12
Figure 8: DDR3 Self-calibration (source C. Park [6])
Fundamental to any calibration, such as ODT, there must be a sensor to measure input, a controller to provide a calibration algorithm, and an actuator to drive the new calibrated value. More on calibration will be discussed later.
13
Melikyan [7] in 2013 calibrated I/O termination resistors using the calibration circuit shown in Figure 9. An integrated reference current generator generated a reference current, which was compared to locally generated current through resistor banks. The compared results were sent to a logic module to control the actual transmitter (TX) and receiver (RX) through registers.
Figure 9: IO termination calibration circuit (source V. Melikyan [7])
Regardless of the method of calibration, or implementation choice between hardware vs. software, the calibration time should be minimized. Calibration time is also known as the test time, or in a functional usage as training time. Calibration time should be as short as
14
possible to enable the circuit to be operational most of the time. If the test time were too long, it is costly in a high volume manufacturing environment. If the training time is too long, the experience is simply not acceptable to the end users. Not only does the test time need to be short, but the test or calibration needs to be accurate to be effective, and in the long term save power by not having to repeat calibration often.
15
2.6 High Speed Differential IO (HSIO)
Despite the PVT variation challenges that HSIO and memory IO circuits face, and the added challenges of increasing demand on IO bandwidth and lower IO power, Intel has taken advantage of smaller transistor sizes and faster transistors to continue to push the edges of HSIO and memory IO. Intel has demonstrated HSIO for PCIe generation 3 (PCIe
Gen3) mass-produced running at 8GT/S in commercial products. Shown in Figure 10 is an example of Intel HSIO receiver architecture [8]. This architecture consists of a differential input, equalizer, receiver amplifier, and a digital clock detection and recovery (CDR) control loop. Taking advantage of a smaller transistor size, pushing more and more of the analog circuit control loops into the digital domain helps to minimize the PVT variation effect introduced above. This strategy of pushing more analog circuits into the digital domain to avoid PVT variation is more important as device size shrinks, and digital logic becomes cheaper from silicon real-estate perspective.
Figure 10: Intel HSIO receiver architecture (source L. Chen [8])
16
Figure 11 from S. Puligundla [9] presents more details of an Intel PCIe Gen3 HSIO receiver. In this case, a decision feedback equalizer (DEF) was used in combination with automatic gain control (AGC) to provide a real-time eye monitor at the receiver. Depending on the channel behavior, the AGC would adjust the receiver amplifier gain, and the DEF would try to remove inter-symbol-interference (ISI) to ensure that an adequate receiver eye is observed. Since the receiver eye is available already, a design-for-x (DFX) circuit [9] was added to enable design-for-test, design-for-manufacturing, and design-for-validation.
Hence, the X above refers to all post silicon activities that measure the eye. More on DFX will be discussed later.
Figure 11: Intel PCIe Gen 3 8GT/S HSIO receiver and DFX (source S. Puligundla [9])
17
2.7 Memory Single Ended IO
Intel recently published papers on its 22nm Haswell design [10], for the memory IO such as DDR, Intel used a cascode output driver and pre-driver level shifter to level shift from
Intel internal CMOS voltages to DDR voltages (shown in Figure 12). Due to the single ended IO bus nature of DDR, when compared to the differential IO bus of PCIe, DDR suffers from crosstalk, which is the interference from a neighboring IO wire. At the same time, DDR IO bus frequency is relatively low compared to PCIe Gen3, again, due to the single-ended IO bus nature. This dissertation is largely focused on memory IO and specifically DDR IO characterization, testing, validation, and training. Hence, much more discussion on DDR will follow.
18
Figure 12: Intel Haswell (22nm) DDR cascade driver (N. Kurd [10])
19
One of the motivations for DDR to be a single ended bus is IO pin density. To keep the single ended IO bus, and push the IO frequency, while achieving higher IO bandwidth,
Intel Haswell moved some of the DRAM on package (shown in Figure 13). Embedded
DRAM (eDRAM) has a much shorter trace length than DDR, hence Intel Haswell was able to push the IO frequency up to 6.4GT/S [10].
Figure 13: Intel Haswell (22nm) eDRAM at 6.4T/s (source N. Kurd [10])
20
2.8 Memory Cell Fault Models
In addition to memory IO challenges, as memory density increases, the memory cell retention of data and the testing of memory cell failure becomes challenging as well due to the increasing test time associated with higher memory density, and the lack of probe access points. Therefore, some background on memory cell fault model is discussed here.
Figure 14 shows a basic DRAM organization [11] where the addresses are decoded into x
(a.k.a row) and y (a.k.a column).
Figure 14: DRAM organization (source K. K. Saluja and K. Kinoshita [11])
21
Each memory cell within the DRAM can be labeled as cell B (see Figure 15). Cell B has neighbor cells that are labeled north (N), south (S), west (W), and east (E) [12]. In addition, the paper also defines static neighborhood pattern-sensitive faults as static interactions between cell B with its neighbors (eg: N, S, W, and E).
Figure 15: Adjacent neighborhood of a memory cell B (source P. de Jong and A. J. van de Goor [12])
Different memory cell fault patterns have been studied, and can be categorized [11] as:
1) Stuck-at-faults (SAF), which are cells stuck at 0 or 1, and can be tested via DC.
2) Adjacent pattern-sensitive faults (APSF), which are static
3) Adjacent pattern-sensitive faults, which are dynamic
4) Pattern-sensitive faults (PSF), which are dynamic
Typically, the DRAM is tested on a tester sitting outside of the DRAM interface. The tester can be programmed to step through various algorithms and patterns. According to [11], algorithms that depend on an odd number of neighbor cells are hard to build into hardware, because the number doesn’t fit easily as a power of two. The work presented in this
22
dissertation solves this problem by making the number of neighbors a variable called N, which is programmable via register. This will be discussed later.
Z. Al-Ars and A. J. van de Goor [13] has studied modeling of DRAM faults in simulation.
According to Figure 16, a DRAM cell is represented as a capacitor; the bit line (BL) and word line (WL) represent address control similar to above x/col or y/row address. The O denotes open, the S denotes short, and B denotes bridging.
Figure 16: DRAM fault simulation model (source Z. Al-Ars and A. J. van de Goor [13])
Van De Goor defined four base test classes [14]: Electrical, march, base cell, and repetitive.
He used the following short hand notation to describe address walking directions for his tests:
23
1. ⇑ Denotes an increasing address order (that is, address 0, 1, 2, and so on),
2. ⇓ Denotes a decreasing address order, and
3. Denotes that the test engineer can arbitrarily choose the address order to be ⇑ or
⇓
4. The ↗ denotes an address increment along the main diagonal
24
The classical fault model of DRAM can be organized [15] and [16] as:
5. Address decoder faults (AFs):
a. A certain address cannot access any cell
b. A certain address accesses multiple cells simultaneously
c. No address can access a certain cell
d. Multiple addresses can access a certain cell
6. Memory Cell faults:
a. Stuck-at fault (SAF)
b. Stuck-open fault (SOF)
c. Transition fault (TF)
d. Data retention fault (DRF): (There is a lot of interest in DRF due to smaller
transistor size, and more leaky DRAM cells. More on this will be discussed
later)
e. Coupling fault (CF)
f. Bridging fault (BF): short between cells
g. Neighborhood pattern sensitive fault (NPSF)
25
2.9 Memory Test Algorithms
Given the above fault model, developing an efficient test algorithm to test and cover these fault model in shortest time is important, and at times, tied directly to millions of dollars.
There are a number of classical memory cell test algorithms published, and some of them are discussed here. A march test involves walking through memory in any of these N, E,
S, and W directions while writing 1s or 0s, and reading the expected values such as 1s and
0s. According to Van De Goor donation, the march test can be written in short hand as:
{ (w0); (r0, w1,r1); (r1, w0,r0);
(w1); (r1, w0,r0); (r0, w1,r1) }
A modified algorithmic test sequence (MATS) test is the shortest march test for unlinked
SAFs in memory cells, and it can be written as:
{ (w0);(r0, w1);(r1) }
{ (w1);(r1, w0);(r0) }
A MATS+ test initializes the memory to a known data background and steps through with a read-write sequence. The test is performed with both incrementing and decrementing addressing. The test is visually shown in Figure 17, and it is written as:
26
{ (w0); (r0, w1); (r1, w0) }
Start _ _ _ _ _R _R _R _R R R R R W W W W W W W W W W W W _ _ _ _ _R _R _R _R R R R R W W W W W W W W W W W W _ _ _ _ _R _R _R _R R R R R W W W W W W W W W W W W _ _ _ _ _R _R _R _R R R R R W W W W W W W W W W W W
Figure 17: MATS+ test algorithm
The MATS++ test [17] sequence is an optimized test sequence similar to the MATS+ test but allows fault coverage for TF’s. It can be written as:
{ (w0); (r0, w1); (r1, w0, r0) }
March C- is a modified march C test that removes redundancies. It detects unlinked AF’s,
SAF’s, TF’s and all CF’s, which can be written as:
27
{ (w0); (r0, w1); (r1,w0);
(r0,w1); (r1,w0); (r0) }
2.10 Research Goals and methodology
2.10.1 Goal
This dissertation is based on a series of published papers investigating the shortfalls identified above. For example, how is faster and higher bandwidth IO enabled? How is lower IO power enabled? How are faults in IO and memory cells to be identified quickly
(lower test time) and efficiently? With that in mind, this dissertation follows a path to develop an on die solution to:
1. Measure and characterize the IO health
2. Fine tune IO for maximal performance (link training)
3. Fine tune IO for lower power (power tuning)
4. Leverage Intel core performance and L3 cache to reduce test time and link training
time
5. Measure memory cell defects
6. Repair memory cell failures using spare cells automatically
28
The reason to pursue these goals was to develop a low cost and effective method to improve
IO performance, reduce power, reduce test time, improve debug and validation coverage, and in some cases, automatically repair defects detected in memory cells.
2.10.2 Methodology
The research was driven by the challenges facing the high performance microprocessor design at the 32nm, 22nm, and 14nm technology node, and beyond. The research involved understanding prior and existing publications on HSIO and memory IO design. The next step was to identify and understand gaps and shortcomings in the current state of the art
32nm and 22nm ICs. The research focused on real world industry designs and the design challenges, and proposed architectures that are innovative and cost effective that can be applied to the real world problems identified. Specifically, the research was focused on developing a unified or converged BIST engine that addressed both the IO and memory cell test challenges and requirements, and reduced test time and overall cost.
Once the architectures were identified, the architectures underwent rigorous peer review, which included primarily industry peers, but in the process of publishing papers, it was reviewed by academics as well. Multiple patents were filed (6 granted, and 2 pending)[18]–
[25]. For the proposed architectures to be widely implemented across Intel, the architecture also had to pass financial review to be funded, and, at times, had to pass Intel vice president
(VP) level review to be granted full authority, funding, and engineering resources to complete the design and validation cycle.
29
The author worked with a group of peers, dotted line reports, and direct reports to complete the design cycle and validation cycle. The design cycle and methodology involved using the register transfer language (RTL) to describe the architecture, and pre-silicon model based simulations of both the architecture and the test patterns. The design cycle also included automatic synthesis of the RTL into gate level netlist, floor plan, timing analysis, reliability, and power validation before tapeout. The post-silicon validation cycle and methodology involved test chips, early sample characterization of the function HSIO, the memory IO circuits, and the BIST engine. The development and proofing of high volume manufacturing flow followed. The post-silicon validation cycle also included statistical sub sampling of large populations and testing to predict a population’s overall behavior, and finalized the recipe and flow for volume production to ensure that BIST and IO behavior matches expectation. Many of the steps within the design and validation cycle are iterative to take insights from simulation, test chip, and validation back through the architecture to improve the overall research.
30
2.11 References
[1] F. O’Mahony, G. Balamurugan, J. E. Jaussi, J. Kennedy, M. Mansuri, S. Shekhar, and B. Casper, “The future of electrical I/O for microprocessors,” in International
Symposium on VLSI Design, Automation and Test, 2009. VLSI-DAT ’09, 2009, pp. 31–34.
[2] B. Casper, J. Jaussi, F. O’Mahony, M. Mansuri, K. Canagasaby, J. Kennedy, E.
Yeung, and R. Mooney, “A 20Gb/s Forwarded Clock Transceiver in 90nm CMOS B.,” in
Solid-State Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE
International, 2006, pp. 263–272.
[3] M. A. A. Latif, N. B. Z. Ali, and F. A. Hussin, “A case study of process-variation effect to SoC analog circuits,” in 2011 IEEE Recent Advances in Intelligent Computational
Systems (RAICS), 2011, pp. 520–523.
[4] T. Chen, P. Luo, and H. Fu, “Signal integrity analysis of high speed USB IO and interconnection,” in IEEE Electrical Design of Advanced Packaging Systems Symposium,
2009. (EDAPS 2009), 2009, pp. 1–5.
[5] A. Bravaix, C. Guerin, V. Huard, D. Roy, J.-M. Roux, and E. Vincent, “Hot-Carrier acceleration factors for low power management in DC-AC stressed 40nm NMOS node at high temperature,” in Reliability Physics Symposium, 2009 IEEE International, 2009, pp.
531–548.
[6] C. Park, H. Chung, Y.-S. Lee, J. Kim, Y.-S. Lee, M.-S. Chae, D.-H. Jung, S.-H.
Choi, S. Seo, T.-S. Park, J.-H. Shin, J.-H. Cho, S. Lee, K.-W. Song, K. Kim, J.-B. Lee, C.
31
Kim, and S.-I. Cho, “A 512-mb DDR3 SDRAM prototype with CIO minimization and self- calibration techniques,” IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 831–838, Apr.
2006.
[7] V. Melikyan, A. Balabanyan, A. Hayrapetyan, and N. Melikyan,
“Receiver/transmitter input/output termination resistance calibration method,” in
Electronics and Nanotechnology (ELNANO), 2013 IEEE XXXIII International Scientific
Conference, 2013, pp. 126–130.
[8] L. Chen, F. Spagna, P. Marzolf, and J. K. Wu, “A 90nm 1-4.25-Gb/s Multi Data
Rate Receiver for High Speed Serial Links,” in Solid-State Circuits Conference, 2006.
ASSCC 2006. IEEE Asian, 2006, pp. 391–394.
[9] S. Puligundla, F. Spagna, L. Chen, and A. Tran, “Validating the performance of a
32nm CMOS high speed serial link receiver with adaptive equalization and baud-rate clock data recovery,” in Test Conference (ITC), 2010 IEEE International, 2010, pp. 1–5.
[10] N. Kurd, M. Chowdhury, E. Burton, T. P. Thomas, C. Mozak, B. Boswell, M. Lal,
A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T. M. Wilson, M. Merten, S. Chennupaty,
W. Gomes, and R. Kumar, “5.9 Haswell: A family of IA 22nm processors,” in Solid-State
Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, 2014, pp. 112–113.
[11] K. K. Saluja and K. Kinoshita, “Test Pattern Generation for API Faults in RAM,”
IEEE Trans. Comput., vol. C–34, no. 3, pp. 284–287, Mar. 1985.
32
[12] P. de Jong and A. J. van de Goor, “Test pattern generation for API faults in RAM,”
IEEE Trans. Comput., vol. 37, no. 11, pp. 1426–1428, Nov. 1988.
[13] Z. Al-Ars and A. J. van de Goor, “Test generation and optimization for DRAM cell defects using electrical simulation,” IEEE Trans. Comput.-Aided Des. Integr. Circuits
Syst., vol. 22, no. 10, pp. 1371–1384, Oct. 2003.
[14] A. J. van de Goor, “An industrial evaluation of DRAM tests,” IEEE Des. Test
Comput., vol. 21, no. 5, pp. 430–440, Sep. 2004.
[15] C.-T. Huang, J.-R. Huang, C.-F. Wu, C.-W. Wu, and T.-Y. Chang, “A programmable BIST core for embedded DRAM,” IEEE Des. Test Comput., vol. 16, no. 1, pp. 59–70, Jan. 1999.
[16] M. L. Bushnell and Vishwani D. Agarwal, Essentials of Electronic Testing. .
[17] A. J. van de Goor, Testing semiconductor memories: theory and practice. J. Wiley
& Sons, 1991.
[18] B. Querbach, D. G. Ellis, A. Khan, M. J. Tripp, E. S. Gayles, and E. Gollapudi,
Automatic self test of an integrated circuit component via AC I/O loopback. Google
Patents, 2006.
[19] B. Querbach, A. Khan, M. Tripp, L. B. Guerrero, M. A. V. Vargas, and A.
Muhtaroglu, Methods and apparatuses for validating AC I/O loopback tests using delay modeling in RTL simulation. Google Patents, 2007.
33
[20] D. Ellis, B. Querbach, J. Nejedlo, A. Khan, S. Babcock, E. Gayles, and E.
Gollapudi, Pre-announce signaling for interconnect built-in self test. Google Patents, 2003.
[21] D. G. Ellis, B. Querbach, J. J. Nejedlo, A. Khan, S. R. Babcock, E. S. Gayles, and
E. Gollapudi, Push button mode automatic pattern switching for interconnect built-in self test. Google Patents, 2004.
[22] B. Querbach, M. A. Abdallah, A. M. Khan, M. M. Hossain, and S. M. Sarkar,
Regulating a timing between a strobe signal and a data signal. Google Patents, 2009.
[23] B. L. Spry, T. Z. Schoenborn, P. Abraham, C. P. Mozak, D. G. Ellis, J. J. Nejedlo,
B. Querbach, Z. Greenfield, R. Ghattas, and J. Tholiyil, Robust memory link testing using memory controller. Google Patents, 2009.
[24] M. B. Trobough, K. K. Tiruvallur, C. B. Prudvi, C. E. Iovin, D. W. Grawrock, J. J.
Nejedlo, A. N. Kabadi, T. K. Goff, E. J. Halprin, and K. B. Udawatta, TEST, VALIDATION,
AND DEBUG ARCHITECTURE. US Patent 20,150,127,983, 2015.
[25] B. Querbach, R. B. Hamilton, L. A. Johnson, and M. Kim, Voltage margining with a low power, high speed, input offset cancelling equalizer. Google Patents, 2009.
34
Comparison of Hardware based and Software based Stress Testing of Memory IO Interface
Bruce Querbach, Sudeep Puligundla, Daniel Becerra, Zale T Schoenborn, Patrick Chiang [email protected] [email protected] [email protected] [email protected] [email protected]
2013 MWSCAS,
Columbus, Ohio,
56th IEEE International Midwest Symposium on Circuits and Systems
35
3 Comparison of hardware and software stress method
Comparison of Hardware Based and Software Based Stress Testing of Memory IO
Interface
Abstract— In post-silicon testing and validation of circuit functionality, an effective
IO stress pattern can identify bugs quickly and provide adequate test coverage. A lot of work has been done to identify the right stress patterns specific to each IO interface. While some patterns can be generic enough to apply to all IOs, other patterns are interface topology specific. In addition to identifying the worst-case pattern, tradeoffs between test-time and test coverage must be made depending on the test goals. Pseudo random bit stream (PRBS) generators are commonly used to generate test patterns because of the adequate frequency content in PRBS patterns, ease of implementation, and minimal gate count. This paper introduces an advanced pattern generator and checker (APGC) based on PRBS that retains all the aforementioned advantages. The APGC was implemented for a DDR memory interface where different LFSRs beat against each other spatially on neighboring IO lanes while rotating this form of aggressor-victim pattern in time. The results of the
APGC stress patterns are compared to a form of advanced software-based learning algorithm based patterns that exhaustively search this complete parameter space.
The comparison of APGC to software showed that the measured bit error rate (BER) plotted on a Q-scale of both methods is similar for the receiver side. On the transmitter side, APGC showed less eye opening than the software. In addition to the
36
margin comparison, on the test execution side, APGC can speed up the test and validation execution time compared to the software by 32 to 2048 times depending on aggressor victim lane width of eight to 64 lanes.
3.1 Introduction
After an integrated circuit is taped out, the post-silicon test and validation is focused on verifying the proper functionality of the design. This work must be done quickly to enable time to market. An effective IO stress pattern can identify bugs quickly and provide adequate test coverage. The effectiveness of the stress pattern can be measured by its frequency content. The wider the frequency content, the more corner cases it excites. The running length of the test is also very important from an effectiveness point of view. As the pattern length increases, the validation coverage increases; however, that adds cost to the testing and hence, the product. Thus, an effective pattern must also be short to minimize test time.
A lot of work has been done to identify the most effective stress pattern. Goel [1] uses a term tree minimization algorithm to come up with an effective pattern.
While some patterns can be generic enough to apply to all IOs, such as a clock pattern, DC all ones, and DC all zeros, these patterns are good starting patterns to train and initialize the link. However, the frequency content of these patterns are impulse functions at a few selective frequencies.
37
Using learning algorithms such as an exhaustive search or a genetic algorithm [2], a topology specific worst-case pattern can be found after a long search time. Usually this involves testing over all the interactions between neighboring IO lanes both in time and in space to analyze the effects of crosstalk and inter symbol interference (ISI). Genetic algorithms can reduce search times, but these algorithms frequently come up with a solution that is very specific to the IO interface topology under test. For example, by switching several DDR modules around, the previously discovered worst-case pattern is no longer valid due to a topology change.
Tradeoffs between test-time and test coverage must be made depending on the test goals.
For example, HDMI interface error tolerance is less stringent than multi-processor chip to chip communication interfaces. The higher requirements in error tolerance and reliability requirements can increase test time significantly.
The pseudo random bit sequencer (PRBS) has been widely used in testing the correct functionality of integrated circuits. PRBS can quickly generate adequate frequency content while keeping the gate count of the design to minimum. Eye diagrams built based on these
PRBS patterns can be related to circuit parameters [3] and be used to tune and improve transmitter/receiver amplifier designs. There are a number of PRBS designs focused on low power and high performance, such as [4]. A number of PRBS architectures have been designed as full rate [5], half rate [6, 7, 8, and 9], and quarter rate [10]. In this paper, the focus is on comparing a combination of multiple PRBS patterns to learning algorithms.
38
Section 3.2 presents the architecture of APGC. Section 3.3 describes the aggressor and victim pattern rotation capability of APGC architecture to stress crosstalk effects between neighboring lanes. This was used for DDR memory link initialization and validation.
Section 3.4 describes a form of an advanced learning algorithm that exhaustively searches the complete parameter space. The software based pattern generation was also used in link validation for comparison. Section 3.5 presents the margin result comparison and time savings between the two algorithms.
3.2 The APGC Architecture
Shown in Figure 18 below is a conceptual transmitter and receiver for memory link. The
Tx Vref is generated using a voltage divider, and the value is sent to the receiver on the
DRAM device. The transmitter clock has multiple phase interpolators (PI) to adjust the clock up to 64 steps within one UI. Similarly, the receiver sampler can adjust the PI up to
64 steps within one UI. The receiver sampler also needs a reference voltage (Rx Vref) to compare with sampled data. The Vref in both the Tx and the Rx side can be adjusted up to
64 steps between the ground and the VCC.
39
Tx Vref pad
Tx data Transmitter
Transmitter clk
PI adjustments DQ Receiver PAD Rx data Sampler
Receiver clk Rx Vref
PI adjustments Figure 18: The conceptual memory link transmitter and receiver An advanced pattern generator and checker, presented below, was used to measure the IO link quality. As shown in Figure 19, APGC uses two LFSRs, which can produce PRBS and low frequency patterns. It also supports two N bit shift registers, which can produce user defined patterns. Each data lane has a mux, which chooses which of the patterns are sent on the DQ lane. Each of the 64 data lanes (DQ0, DQ1, … DQ63) can have unique patterns.
40
N Bit Shift Register
N Bit Shift Register
N bit LFSR
N bit LFSR
4 4 4 4 } } } }
0 of 63 1 of 63 1 DQ 0 DQ 1 DQ n DQ 63 Figure 19: The APGC architecture 3.3 The Aggressor/Victim Pattern Rotation On a recent Intel microprocessor memory link initialization and training algorithm, an aggressor and a victim pattern beat against each other to identify and minimize lane-to- lane crosstalk. This was achieved through the programming of two different LFSR patterns (LFSR-0, LFSR-1) provided by APGC on neighboring lanes while rotating this sequence in time across 5, 8, 16, 32 or 64 DQ lanes. An example of these two LFSR patterns rotating between five lanes is shown in Table 1. Table 1: Victim Aggressor pattern rotation with two different LFSRs IO pin Pattern Sequence in Time DQ59 0000 0000 0000 0000 LFSR 0 LFSR 1 LFSR 0 LFSR 0 LFSR 0 LFSR 0 DQ60 0001 0001 0001 0001 LFSR 0 LFSR 0 LFSR 1 LFSR 0 LFSR 0 LFSR 0 DQ61 0011 0011 0011 0011 LFSR 0 LFSR 0 LFSR 0 LFSR 1 LFSR 0 LFSR 0 DQ62 0111 0111 0111 0111 LFSR 0 LFSR 0 LFSR 0 LFSR 0 LFSR 1 LFSR 0 DQ63 1111 1111 1111 1111 LFSR 1 LFSR 0 LFSR 0 LFSR 0 LFSR 0 LFSR 1 41 3.4 The Software Learning Algorithm and Pattern Many learning algorithms are used to discover the worst-case pattern, which is usually specific to the platform topology and layout. The software algorithm used here is an exhaustive search algorithm that tries all combinations of bit flips of neighboring bits both in time and in space of the bit under test (11th bit of DQ61 in Table 2). When this bit of DQ61 is under test, the margins including crosstalk and ISI are measured when all its neighbors are flipped one bit at a time. Shown in red are the neighbor bits that have the most cross coupling. Table 2: Crosstalk/ISI impact to 11th bit in time of DQ61using software IO pin Pattern Cross Talk/ISI DQ59 0000 0010 1010 0001 88 99 63 94 98 80 98 58 92 67 16 99 99 92 99 79 DQ60 0000 0001 1011 0001 65 60 65 82 94 33 69 25 21 0 0 75 99 90 99 92 DQ61 0000 0001 1010 0000 31 46 21 48 63 18 2 4 0 0 0 40 65 46 73 33 DQ62 0000 0000 0110 0011 90 96 71 80 99 54 99 63 37 0 29 99 86 73 99 99 DQ63 0000 1000 0110 0011 63 65 79 79 99 35 79 31 99 0 6 99 94 73 99 92 3.5 Margin Result Below is the comparison of Hardware (APGC) vs. Software Stress Measured on a Recent Intel Microprocessor. Comparison of eye margins of APGC and software in terms of Q- scale vs. ticks is shown in Figure 20 through Figure 25 for both transmitter (Tx) and receiver (Rx). Q-scale is defined as the inverse of the standard normal cumulative distribution of the BER. A tick is defined as 1/64th UI (unit interval) for timing margin and 1/64th of the full voltage swing from ground to VCC for voltage margin. 42 3.6 Receiver Voltage Margin – Low Side The IO under test has the ability to change the sampling voltage and sampling time for both the transmitter and receiver independent of the memory device on the remote side. Figure 20 compares the low side voltage margin measured by APGC and software. The plot shows that the measured margin between the two approaches matches closely. Figure 20: The Rx Voltage Lower Side BER (on Q-scale) vs. vref ticks 3.7 Receiver Voltage Margin – High Side Similar margin result match and trend is also observed between the two approaches for high side voltage margin as shown in Figure 21. 43 Figure 21: The Rx Voltage Higher side BER (on Q-scale) vs. Vref ticks 3.8 Receiver Timing Margin – Low Side Similar margin result match and trend is also observed between the two approaches for lower side timing margin as shown in Figure 22. Figure 22: The Rx Timing Lower Side BER (on Q-scale) vs. Vref ticks 3.9 Transmitter Voltage Margin – Low Side While we observed comparable results on the receiver margin measurements, on the transmitter margin measurements, differences were observed as shown in Figure 23. For the same Q, we observed a difference of up to 10 ticks. 44 Figure 23: The Tx voltage Lower Side (on Q-scale) 3.10 Transmitter Voltage Margin – High Side On the transmitter voltage higher side in Figure 24, for the same Q, we observed a difference of 15 ticks. Figure 24: The Tx Voltage Higher Side (on Q-scale) 3.11 Transmitter Timing Margin – Low Side On the transmitter timing lower side in Figure 25, for the same Q, we observed a difference of two ticks. 45 Figure 25: The Tx Timing Lower Side (on Q-scale) 3.12 Comparison of Test Execution Time of APGC and Software In the case of APGC, each LFSR is programmed to generate 2^14 bits and this pattern is rotated to the neighboring lane using a circular shift register that is eight lanes wide. Therefore, after eight rotations, the complete pattern takes 2^14 x 8 = 131072 bit time. In the case of the software, the base pattern is 16 x 8 = 128 bits. Each of the single bit and double bits (next to each other) in base pattern will be flipped (both 1 and 0) to act as aggressor to test the ISI and crosstalk on the victim bit. This results in 128 x 128 x 2 = 32768 bit time per victim location in time and space. The complete pattern takes 32768 x 128 = 4194304 bit time. If only eight lanes of wide spatial effect are considered, the APGC would take only 131072 / 4194304 = 3.125% of the software execution time. If we consider 16 lanes wide or 64 lanes wide, the time savings would increase as demonstrated in table 3: 46 Table 3: Victim Aggressor pattern rotation with two different LFSRs 8 lanes 16 lanes 64 lanes APGC (bit time) 131072 262144 1048576 Software (bit time) 4194304 33554432 2.15E+09 Ratio of time saving 32 128 2048 3.13 Conclusion In summary, this paper presented the APGC architecture, which provides the ability to generate rotating victim aggressor PRBS patterns and supports advanced learning algorithms in software via the user defined pattern buffers. The APGC pattern used for memory link initialization and training was compared to the exhaustive search-learning algorithm in software used on a recent Intel microprocessor. The silicon results show good correlation across Rx voltage and timing between APGC based patterns and software. On the transmitter side, both timing and voltage margins are close, but not completely matching. This difference is worth further exploration. 47 This paper showed that rotating PRBS patterns are sufficient for memory link initialization and training, and the validation margin results are comparable to more advanced learning algorithms. The APGC architecture based patterns are better in terms of the test execution speed. A speed up of 2048 times that of the software was realized assuming a 64 lanes wide repeatable pattern, and up to 32 times over eight lanes. Typical link training can be done on the order of milliseconds. During this time, APGC provides adequate discovery of ISI and crosstalk and tries to minimize their impact on link performance. For future work, the authors would like to explore the comparison of APGC to other learning algorithms, and study in detail the execution speed of various algorithms. 3.14 Acknowledgements The help received from Eric Donkoh, Peters Stephen, and Rahima K Mohammed of Intel Corporation is greatly appreciated. 48 3.15 References [1] Goel, S.K.; Devta-Prasanna, N.; Turakhia, R.P. “Effective and Efficient Test Pattern Generation for Small Delay Defect”, VLSI Test Symposium, 2009. VTS '09. 27th IEEE [2] An Improved Genetic Algorithm and Its Blending Application with Neural Network, Tai-shan Yan, Intelligent Systems and Applications (ISA), 2010 2nd International Workshop [3] Venkatesha, D.B.; Chitrashekaraiah, S.; Rezazadeh, A.A.; Thiede, A. “Interpreting InGaP/GaAs DHBT Eye diagrams using small signal parameters”, Microwave Integrated Circuit Conference, 2007. EuMIC 2007. European [4] Laskin, E.; Voinigescu, S.P., “A 60 mW per Lane, 4 23-Gb/s 2 1 PRBS Generator”, Solid-State Circuits, IEEE Journal of Volume: 41, Issue: 10 [5] R. Malasani, C. Bourde, and G. Gutierrez, “A SiGe 10-Gb/s multipattern bit error rate tester,” Radio Frequency Integrated Circuits (RFIC) Symposium, 2003 IEEE [6] H. Veenstra, “1–58 Gb/s PRBS generator with <1.1 ps RMS jitter in InP technology,” Solid-State Circuits Conference, 2004. ESSCIRC 2004. Proceeding of the 30th European 49 [7] Knapp, H.; Wurzer, M.; Meister, T.F.; Bock, J.; Aufinger, K., “40 Gbit/s 27-1 PRBS generator IC in SiGe bipolar technology,” Bipolar/BiCMOS Circuits and Technology Meeting, 2002. Proceedings of the 2002 [8] H. Knapp, M. Wurzer, W. Perndl, K. Aufinger, J. Böck, and T. F. Meister, “100-Gb/s 2 1 and 54-Gb/s 2 1 PRBS generators in SiGe bipolar technology,” IEEE J. Solid-State Circuits, vol. 40, no. 10, pp. 2118–2125, Oct. 2005. [9] D. Kucharski and K. Kornegay, “A 40 Gb/s 2.5 V 2 1 PRBS generator in SiGe using a low-voltage logic family,” in IEEE ISSCC Dig. Tech. Papers, San Francisco, CA, Feb. 2005, pp. 340–341. [10] T. O. Dickson, E. Laskin, I. Khalid, R. Beerkens, J. Xie, B. Karajica, and S. P. Voinigescu, “A 72 Gb/s 2 1 PRBS generator in SiGe BiCMOS technology,” in IEEE ISSCC Dig. Tech. Papers, San Francisco, CA, Feb. 2005, pp. 342–345. 50 A Reusable BIST with Software Assisted Repair Technology for Improved Memory and IO Debug, Validation and Test Time Bruce Querbach Rahul Khanna David Blankenbeckler David Blankenbeckler Yulan Zhang Ronald T Anderson David G Ellis Intel Corporation Sabyasachi Deyati Georgia Institute of Tech Patrick Chiang Oregon State University 2014 ITC Seattle, Washington, USA 51 2014 IEEE International Test Conference (ITC) 4 Improved Debug, Validation and Test Time A Reusable BIST with Software Assisted Repair Technology for Improved Memory and IO Debug, Validation and Test Time Abstract As silicon integration complexity increases with 3D stacking and through-silicon-via (TSV), so does the occurrence of memory and IO defects and associated test and validation time. This ultimately leads to an overall cost increase. On a 14nm Intel SOC, a reusable BIST engine called converged pattern generator and checker (CPGC) are architected to detect memory and IO defects, and combined with the software assisted repair technology to automatically repair memory cell defects on 3D stacked Wide-IO DRAM. Additionally, we also present the CPGC gate count, power, simulation, and silicon results. The reusable CPGC IP is designed to connect to a standard IP interface, which enables a quick turn-key SOC development cycle. Silicon results show CPGC can speed up validation by 3x, improve test time from minutes down to seconds, and decrease debug time by 5x including root-cause of boot failures of the memory interface. CPGC is also used in memory training and initialization, which makes it a critical part of Intel SOC. 52 Keywords: High speed IO, Memory IO, Memory Interface Training, Post Package Repair, Memory Array Test Engine, hardware/software calibration 4.1 Introduction Shrinking process technology and faster memory bus data continue to drive complexity in the I/O design. This surge in complexity brings with it an increase in design bugs. While some of these bugs (such as logic bugs) can be detected at the pre-silicon stage, others (such as cross coupling of adjacent data lanes, voltage droop, ground bounce, power bounce, timing error due to process shifts, etc.) remain undetected. Although logic bugs are possible to detect in pre-silicon, with increase in design complexity, shortened design time, and complex OS and software demands, this can lead to escalations of bug escapes. A modern microprocessor design can experience considerable design bugs specifically in memory interfaces. Sarangi et. al. [1] describe the percentages of memory interface design bugs found in AMD64 and Intel's Pentium 4 to be 17.4 and 20.2 respectively. Therefore there is a critical need for an efficient methodology to validate the memory interface in a microprocessor design. Controllability and observability are key attributes in any post silicon validation methodology. One conceivable but expensive solution is to probe the prepackaged dynamic 53 random access memory (DRAM) die. The easiest and most cost effective solution is to include built-in self-test (BIST) circuitry inside the chip. The BIST engine provides an efficient mechanism to deterministically drive and receive patterns on the memory bus along with automated data checking capability. These features can be used to swiftly identify and isolate bugs in the silicon validation phase. Furthermore, given the continuous increase in system reliability requirements and the current trend of adding more complex features to achieve higher reliability goals, this paper aims to present an effective automated repair strategy that combines converged pattern generator and checker (CPGC) with software assistance in protecting against memory errors. In addition, we present possible directions for designing and testing future systems that are resilient to memory errors. In addition to enabling efficient validation of the memory controller, a BIST engine facilitates effective identification of DRAM defects. DRAM errors are one of the leading causes of hardware failure in modern compute clusters [10, 11]. This has led to increasing requirements for reliability, availability, and serviceability (RAS), particularly in mission critical environments. Standard OS based memory stress tools pose a significant challenge in providing adequate coverage to the many types of DRAM faults due to the presence of architectural features in the memory controller, such as data scrambling, reordering, and various levels of interleaving. These features make it difficult to generate the types of patterns or coverage that the test writer intends. With a suitable design, a BIST engine can overcome these challenges by providing the means to precisely put the intended patterns 54 on the bus, thereby significantly improving the ability to identify DRAM faults. Once identified, the BIST engine with memory controller and software assisted repair technology can provide a means to enable automatic repair of memory cell defects in 3DS WIO DRAM. Memory IO Controller l Data In a Data In n c i Address o f Address Protocol i f t a c Scheduler r RW Control RW Control n t u F DFx Memory Mux Under Test Data In Pattern Address Generator RW Control Data Out Signature Analysis Figure 26: CPGC BIST engine with software assisted repair technology In this paper, the authors propose a new architecture (Figure 26) for CPGC. It is capable of IO [13, 14, 15, 21] and memory test, debug, validation, IO link training [16, 17, 18], and post package repair. The architecture is suitable for such technologies as embedded 55 memory, DDR4, and WIO on system-on-chip (SOCs) with 3D stacking and TSV topology. To speed up fault isolation and improve manufacturing yield, the CPGC BIST engine also includes a software repair technology for 3D stacked memory cell defects. In section 4.2, the paper presents the memory and IO failures modes. In section 4.3, the paper describes the CPGC architecture. In section 4.4, the paper discusses how CPGC uses standardized interface to enable intellectual property (IP) reuse, which expedites rapid prototyping and a quicker SOC product development cycle. In section 4.5, the paper presents software architecture. In section 4.6, the paper presents simulation results based on CPGC architecture. In section 4.7, silicon results demonstrate that CPGC significantly reduces debug and validation time. In section 4.8, the paper summarizes all the results presented in this paper. 4.2 Failure Modes This paper will briefly cover various failure modes that occur within the memory IO and cell due to cross coupling of electrical signals or charges. There is significant literature available regarding the different types of fault modes for DRAMs and the associated algorithms to detect them. Venkatesh [2] has demonstrated a 56 fault model (to model all plausible physical faults in memory) to test the BIST algorithms. The most prevalent memory BIST algorithms are march and modified march [3, 4, and 5]. In [6] the authors proposed a non-march (O(n^2) complexity) algorithm for memory test where it employs galloping pattern tests and captures some of the failures escaping conventional march based tests. While the complexity is higher compared to march based tests, this technique is helpful for embedded memory testing. Bernardi [7] introduce a hardware/software co-test for embedded memory in SOCs. The mentioned co-test is an online test; i.e., it can be performed when the device under test (DUT) is operational. Bernardi [8] also demonstrated a programmable MBIST suitable for stacked DRAM as well as standalone DRAM. It supports both march and neighborhood pattern sensitive faults (NPSF) tests. March-based MBISTS alone are not sufficient for the state of the embedded memory on SOCs. As described in [2, 9], following are the common memory faults that can be detected by this CPGC IP: Stuck at faults Address decoder faults Transition faults Inversion coupling faults Idempotent coupling faults State coupling faults 57 Retention fault Neighborhood pattern sensitive faults In addition, common IO faults that can be detected by this CPGC IP are: ISI (time domain interference) Crosstalk (neighbor victim/aggressor interaction) Clock domain crossing issues Power supply issues Logic circuit IP integration issues 58 D-Unit / Memory Controller CPGC IP Control SPID Register CPGC-CADB Interface CPGC_CMD I M P _ FIFO C G P C CPGC_PGC DDRIO Err Log PMI MISC CMD System Scheduler PMI Agent Data Buff DATA Figure 27: CPGC architecture overview 4.3 CPGC Architecture Figure 27 illustrates the block diagram of the CPGC architecture where the CPGC IP is embedded in the memory controller. CPGC supports both pre-scheduler (CPGC_T) and post-scheduler (CPGC_C) patterns. Functional traffic from the SOC core that originates from upstream is multiplexed with CPGC to form the command (CMD) and data of the DDR protocol. It is transmitted through the memory scheduler and data-buffer (Data Buff) to the DDRIO. CPGC control registers are programmed through a standard low bandwidth interface (side band), and an out of band protocol interface exists between CPGC and 59 DDRIO. The CPGC IP comprises several PRBS (pseudo random binary sequence) generators such as the CPGC pattern generator checker (CPGC_PGC), the CPGC command generator (CPGC2_CMD), and the CPGC command-address data buffer (CPGC_CADB). These pattern generators can generate fixed, low freq (LMN), and random patterns. The pattern can drive command, address, and data IO of the memory interface. Idle ACT Only if NumFailStatus < MaxFail PRE_all Wait Exit tPGM PPR If more failing address Wait_PRE Wait PRE PGMPST PPR Wait Enable Done tPGM Figure 28: CPGC SW assisted repair technology flow 4.3.1 Architecture for Software Assisted Repair Technology The CPGC reusable BIST engine introduces a software assisted repair technology. The flow is shown in Figure 28, which has been adopted by JEDEC as a part of the WIO industry standard. Due to heat from solder reflow during the attachment TSV and 3DS, 60 additional memory cell defects could be introduced during the packaging process. Through JEDEC and WIO industry standards, agreement has been reached with the DRAM manufactures to provide spare rows of additional memory cells for post package repair (PPR). The CPGC architecture for software assisted repair technology will conduct this PPR if necessary at the end of memory cell defect discovery. After CPGC detects memory failures, it supports both manual and automatic modes depending on the software configuration to repair the defective cell using the spare cells that the DRAM manufacturer provided. The flow is illustrated below. In the manual mode, the detection phase is executed separately from the repair phase, which is fitting for test and debug to ensure accurate detection of failures. In the automatic mode, the BIST engine automatically repairs the detected failures with software assistance. The repair is stored in the NVRAM of the DRAM device; hence, repairs are permanent and persistent through subsequent reboots and power cycles. The combining of the two phases can decrease High Volume Manufacturing (HVM) test time significantly from many minutes down to seconds. The detection and the repair phase can also be run via BIOS or at boot via firmware. This enables PPR of highly integrated SOC with a WIO memory subsystem in the field. In the field, repair is especially valuable to the smart phone and tablet market. Since memory is soldered down, any defective memory could cause the whole phone/tablet to be tossed out. The in the field repair can improve yield and reduce debug and validation time, thus accelerating the product development and launch cycle. 61 Figure 29: CPGC unified Pattern sequencer and buffer 4.3.2 Unified Pattern Sequencer Each of the pattern generators supports different bus widths, but is built upon the unified pattern sequencer (UPS). As illustrated in Figure 29 this consists of PRBS with programmable length of 16, 23, 32, a low freq LMN counter that can generate a clock waveform as low as 10 Khz, and a 32-bit fixed pattern buffer. 62 4.3.3 LMN L, M, N are three letters in the alphabet that denote three 32 bit counters. The L counter counts the number of clocks to drive the initial value. The M counter counts the number of clocks to drive the high value, and the N counter counts the number of clocks to drive the low value. Combined, the LMN generator can generate a very low frequency (on the order of 10Khz) clock waveform of any duty cycle to excite low frequency resonance of the system. 4.3.4 Fixed Pattern Buffer Advanced software algorithms frequently derive a large set of user-defined patterns to try. These can be programmed through the 32bit fixed pattern buffer. The pattern is simply repeated over and over again until changed by the software. 4.3.5 LFSR Multiple LFSR tap lengths (16, 23, 32) of maximal running length pseudo random pattern are included in the PRBS generator. 63 4.3.6 Flop Based Data Pattern Buffer Three instances of the uni-sequencer (UPS) are multiplexed together to construct the downstream pattern. These 8:1 per lane MUX are controlled by an eight bit wide by 64 lanes pattern buffer in a small gate count flop based design as illustrated in Figure 29. 4.3.7 Register File Based Pattern Depending on the IP instantiation and die area constraints, the CPGC IP can support a register file (RF) based design, achieving 64 cache line deep of pattern, with each cache line being eight bits deep in time. The RF design has one write port and two read ports to minimize gate count and area. 4.3.8 Data Pattern Dynamic Inversion/DC CPGC supports DC patterns such as all zeros/ones, as well as dynamic inversion of the data dependent on the address. The dynamic inversion feature enables rapid checkerboard testing of memory cell integrity. 64 4.3.9 Post-Scheduler Command and Address Pattern Generation This supports both in-protocol and out-of-protocol and enables early investigation of future memory protocols, or the isolation of a specific debug failure by stressing rare or difficult sequences. 4.3.10 Error Status, Logging and Masking CPGC supports per lane and per bit time error counters. Similarly CPGC offers mask capability at per lane and per bit time level. CPGC error and data logging enables precision debug of memory and IO failures. 4.3.11 Command and Address Generator Through CPGC2_CMD, the command and address generator has the ability to detect latest faults with the current DRAM process and topology, specifically for WIO that uses 3D stacking and through-silicon-via (TSV). 65 Figure 30: CPGC Row Hammer Stress pattern Row hammer problems [19, 20] occur when excessive activate commands targeted at a specific row (aggressor) introduce crosstalk that can cause a bit-flip error in an adjacent row (victim) (Figure 30). Repeated aggressor row charging and discharging can result in one or more failures, as indicated by the red cell in the figure. The CPGC not only can row hammer the +1 and -1 row to detect the traditional row hammer failures, but it can also hammer and detect problems caused by the +2 and +3 rows. These CPGC features are specifically suited for the WIO TSV testing of 3D stacked memory. 66 4.3.12 Variable Retention Time Stress Contingent on the DRAM die layout, different memory cells can experience variable retention time failures before refresh. JEDEC spec calls for a 64 ms refresh interval. CPGC can test at intervals of 64ms, 128ms, etc. Each interval CPGC can repeat write, read scan patterns anywhere between one to seven times with a variable pause of 64ms, 128ms in between. CPGC accomplishes this by adjusting memory controller timing. This can be used to simulate longer refresh intervals to emulate the accelerated decay seen at higher temperatures. 4.3.13 Gate Count The overall gate count of CPGC IP is shown in Table 4 to occupy 44K micrometer squared. Compared to Intel Atom S1200 die area of 9.86mm x 10mm [22, 23], the BIST engine would occupy 0.045% of total area. Table 4: CPGC gate count IP block Gate Count (um^2) Dpatunit 13621 Cmdunit 1847 Configunit 3751 Errlogunit 100 Errunit 874 Unisequencer 169 Msgifunit 1714 67 Cpgct 21953 Total 44028 Table 5: CPGC power estimates IP block Active Power (mW) dpatunit 1.796 cmdunit 0.244 configunit 0.495 errlogunit 0.013 errunit 0.115 unisequencer 0.022 msgifunit 0.226 cpgct 2.894 Total 5.805 4.3.14 Power Assuming a 13W thermal design power (TDP) [22, 23], the CPGC IP has a total active power (as illustrated in Table 5) of 5.8mW. When the IP is not in use, it can be power gated off to draw almost no power except leakage. 68 4.3.15 Protocol CPGC IP supports a simple request/acknowledge protocol for all write and read transactions. It can interface with memory controller and IO physical layers using standardized interface protocols. 4.4 CPGC Usage and IP Reuse Model The key applications of this CPGC reusable BIST IP are in the following four areas: Validation, debug, memory initialization/training using the BIOS, and test/HVM. For validation, CPGC can be used in system margin validation (SMV), bench device validation (BDV), and signal integrity validation (SIV) as shown in Table 6. Table 6: CPGC architecture usage for validation 69 For SMV, a victim and aggressor (VA) pattern can be used on the data pins (DQ) to measure the worst case margin. Similarly, command address data buffer (CADB) provides the pattern for margin measurements on command and address pins. Due to the bidirectional nature of the DDR bus, whenever the command switches from read to write or from write to read, the turnaround time margin can be validated using CPGC patterns. CPGC can purposely generate high or low bus utilization traffic. Even when the OS or the core is not powered up or functional, CPGC IP can be used as a very basic debug tool to test the DDR physical layer. Lastly, CPGC can quickly get margins of all DQ pins in a 70 few seconds, and support full automation, which speeds up the validation time significantly compared to OS based software test patterns. Bench device validation is accomplished on a test bench with the silicon device isolated. Similar VA patterns can induce noise on the DQ pins. CPGC offers precise control of patterns on the DQ pins. Since BDV boards typically do not support OS boot, CPGC IP is the perfect test and debug tool for BDV. Signal integrity validation (SIV) is performed on either prototype or production boards with the focus on neighbor crosstalk and inter symbol interference (ISI). Yet again, the VA pattern and the CADB pattern can provide a precise and stressful condition for SIV. SIV also requires explicit control of the pattern bits in time and space to create simultaneous switching of bits to excite even and odd mode resonance on the IO bus. CPGC IP also offers explicit trigger signals for scope capture and post processing of the waveforms. 71 Table 7: CPGC architecture usage for debug CPGC can be used in a variety of debug scenarios, including power-on debug, margin debug, signal debug, and circuit debug as shown in Table 7. At power-on, CPGC offers explicit pattern control and stress testing capabilities to prove the bus is stable. This helps to isolate any issues away from the memory subsystem. CPGC IP requires very little enabling at power-on to become a useful tool. To debug margin failure or low margin, CPGC offers precise turnaround control. CPGC can easily target any rank and channel to isolate the low margin rank/channel. CPGC can get margins of all byte lanes and individual lanes, while OS tools tend to hang at the first failure. To debug signal integrity, CPGC's explicit pattern control helps to generate ISI and crosstalk. The subseq_wait feature helps to control the gap between transactions. 72 In circuit debug, CPGC's control of read and write helps to isolate the transmitter and receiver of the memory physical layer independently. CPGC's precise control on rank targeting and turnarounds helps to catch any tri-state buffer issues. LFSR addressing provides semi-repeatable but randomized rank targeting, which helps to debug address transmitter circuits. As an independent data source from core functional traffic, CPGC helps to isolate circuit issues separate from memory controller or core issues. In BIOS, CPGC IP supports memory training algorithms, helps to initialize all memory content at boot, and helps to complete Memtest, which is a quick test of all available memory. These usages are shown in Table 8. For training, the same VA pattern can be used to margin and center the strobe and reference voltage setting. Basic training can be done without using the functional path, simplifying the implementation. CPGC can generate back to back transactions, and provide CADB patterns for command training. 73 Table 8: CPGC architecture usage for memory initialization/training CPGC is also used in board high volume manufacturing test (HVM). In HVM, minimizing test time is the key to reducing product cost. CPGC provides an automated way to screen and margin the memory subsystem. As the memory capacity increases, the number of memory cell defects gains even greater significance. CPGC IP is fully equipped to test the memory cell through industry standard algorithms such as march C, march G, march SL, butterfly, galloping, and MATS+. As a programmable reusable BIST engine, CPGC IP also supports many complex memory test algorithms through the use of a sophisticated memory test language (MEMTEL). CPGC's software assisted repair technology for post package defects led to the currently defined JEDEC protocol standard where the DRAM manufactures provides finite reserved 74 spare rows for memory cell failures introduced during the packaging and assembly of 3D stacked TSV packages. These failures can be introduced during the solder reflow and heating process, which cannot be predicted prior to packaging and assembly. The PPR flow can automatically repair failures detected at HVM in seconds. The CPGC offers huge test time savings compared to the operator intervention method, which takes multiple minutes to identify and repair failing bits. In order to find row-hammer problems, the CPGC IP generates a base address on the aggressor row, and generates an offset address on the victim row. Repeating base writes may result in row-hammer failures. As illustrated in Figure 33, RequestCD_Address_R stays on base row zero for a number of clocks and jumps to offset row one for a brief one or two clocks. Checker board memory stress, as illustrated in Figure 34, writes all ones followed by all zeros to the DQ pins, which can be used to stress power supply on the memory IO as well as memory cell retention. LFSR is an important mechanism to generate pseudo random bit stream (PRBS). Illustrated in Figure 35, the 32 bit seed is fixed. As soon as “enable” is high, the Cpat_out, which generates eight bits of output every clock starts to output random bit streams. 75 4.5 Software Architecture This integrated CPGC technology has built-in customizable flexibilities enabling the testing and diagnosis of the entire memory sub-system. In case of impending failure, it is possible to isolate the memory field-replicable-unit (FRU) for further testing through CPGC IP. BIOS based UEFI architecture supports a middle-ware layer that can incorporate the CPGC API(s) at the boot-time as well as the run-time. At the boot time the CPGC- based test executes at the initialization of the memory reference code (MRC). At run-time, the code executes in the system management interrupt (SMI), which is a virtualized space in system. The API(s) libraries are incorporated in both boot-time environments as well as run-time environments, while boot-time CPGC code is executed on an as needed basis. Execution of run-time code can cause system perturbations that can degrade the performance of the workload running on it. In order to execute the CPGC test through the SMI-based middle-ware, the system has to be prepared by executing these two conditions. 1. Reason for executing the test by continually analyzing the performance to predict an upcoming failure. Normally, this failure is detected through statistical modeling. This modeling scheme includes evaluating the memory soft-error patterns to predict hard-failure in future. Whenever a non-spurious pattern of soft-failures is detected, it is compared against a predetermined threshold to forecast hard failures. When this threshold is exceeded, the CPGC test is marked to be initiated either while the system is still functional or at the next boot. 76 2. System Quiescent - Once the CPGC test is marked (for a given memory sub-system), system resources are quieted by migrating the existing workload. One such OS assisted mechanism is called memory hot removal that can remove the memory (marked for failure) dynamically. Once this memory is isolated, we can perform certain CPGC tests in a contained manner. Once the test is completed, memory can then be hot-added using OS assisted methods supported by ACPI specifications. A firmware level approach to host CPGC tests and API(s) require building memory-map infrastructure for identifying memory addresses. Memory initialization code is modified to CPGC test code. System management interrupt (SMI) provides necessary application programming interfaces (API) to assist in constructing and configuring the memory CPGC assisted tests. Figure 31 illustrates the UEFI boot service layer that hosts the CPGC test modules. Similar modules are replicated into the SMI for run-time execution. 77 Figure 31: UEFI compliant BIOS that hosts the CPGC tests UEFI is a standard-based modern software architecture (originated by Intel) that offers a rich extensible pre-operating system (OS) environment with advanced boot and runtime services. It has intrinsic networking capability and is designed from the ground up to work with multi-processor (MP) systems. The UEFI standard is architected for dynamic modularity. This has opened up opportunity for collaboration on the technology that allows reducing development costs and time to market through easily integrating plugin modules from different partners and vendors. 78 4.5.1 Software Assisted Dynamic Margining In order to reduce the system guard-band, a CPGC assisted technique can identify the sufficient margins required to operate the DRAM rank with optimal performance/WATT. The goal in this case is to identify the excess margins and convert to voltage reduction. This checking is done in UEFI assisted agents in the SMI handler. These handlers are invoked opportunistically to test the availability of sufficient margins. 4.5.2 Software Assisted CPGC Test Invocation As mentioned earlier in this section, ad-hoc invocation of a CPGC based test can interfere with the performance requirements of the workload. Therefore we use statistical methods to evaluate the probability of failures based on memory related system events (single-bit- errors etc.) to trigger this mechanism. These statistical methods seek to synthesize the temporal patterns of these failures using Markov chains. A system in this case can be modeled as a discrete Markov process to evaluate lifetime reliability and is described at any time to be in one of the states (S = normal, repair, fail, etc). By employing the Markov modeling we can represent the failure detection, repair and maintenance process graphically. This provides us with a tool to evaluate system reliability measures and possible corrective measures to make it more reliable for the given cost. Details of this methodology are beyond the scope of this paper. 79 4.6 Simulation Results Figure 32 illustrates CPGC IP interfacing mechanism with the memory controller via a standard interface using a request and grant-credit mechanism. As the initial three credits that were available for write are reduced to zero, the WrCreditValidDC returns a credit on each high pause, eventually restoring the credits to one, to two, and to three. Figure 32: Read/Write credits for standard IP interface Figure 33: Base offset memory stress algorithm Figure 34: Checker board memory data stress algorithm 80 Figure 35: PRBS patterns stress patterns In order to find row-hammer problems, the CPGC IP generates a base address on the aggressor row, and generates an offset address on the victim row. Repeating base writes may result in row-hammer failures. As illustrated in Figure 33, RequestCD_Address_R stays on base row zero for a number of clocks and jumps to offset row one for a brief one or two clocks. Checkerboard memory stress, as illustrated in Figure 34, writes all ones followed by all zeros to the DQ pins, which can be used to stress power supply on the memory IO as well as for memory cell retention. LFSR is an important mechanism to generate pseudo random bit stream (PRBS). Illustrated in Figure 35, the 32 bit seed is fixed. As soon as enable is high, the Cpat_out, which generates eight bits of output every clock starts to output random bit streams. 81 An aggressor/victim pattern is illustrated in Figure 36. Notice that while one lane acts as the victim, the remaining lanes act as aggressors, attacking the victim using a different pattern (LFSR1). Switching neighbors purposely introduces crosstalk on the victim bit, which will cause stress that leads to potential electrical failure. Figure 36: Victim and Aggressor Pattern 4.7 CPGC Results and Silicon Measurements This section presents the CPGC results and silicon measurements. CPGC was used in silicon validation to introduce the VA pattern rotation as shown in the simulation in in Figure 36. Figure 37 demonstrates DDR margin without using VA pattern. Figure 38 shows DDR margin with aggressors turned on. As illustrated in Figure 37 82 compared to Figure 38, the CPGC controlled VA Pattern significantly reduced measured margin. This is a great tool for stressing the DDR IO link. The added VA rotation using CPGC reduced pattern reprogramming tests, and eliminated system reboots between the tests. This VA pattern rotation feature of the CPGC alone reduced the validation time by 3x. CPGC was used in silicon validation to introduce the VA pattern rotation as shown above in simulation. Figure 37 demonstrate DDR margin without using VA pattern. Figure 37: silicon measurement of DDR IO eye diagram Figure 38 shows DDR margin with aggressors turned on. 83 Figure 38: eye diagram with VA stress pattern turned on CPGC is capable of debugging and isolating IO lanes bit by bit to see their contribution to crosstalk and result margin. Shown below in Figure 39 is the BIOS measured eye height for all 64 data lanes and 8 ECC lanes with all byte lanes enabled. Figure 39: CPGC BIOS measured eye height with all lanes enabled 84 Shown below in Figure 40 are the same eye height measurements with data lane five and ECC lane five disabled. Compared to Figure 39, there are three ticks of margin improvements. This shows that bit lane five alone contributed to three ticks of margin reduction. Figure 40: CPGC BIOS measured eye height with bit 5/ECC 5 disabled CPGC is capable of debugging and isolating IO lanes bit by bit to see their contribution to crosstalk and result margin. Shown in Figure 39 is the BIOS measured eye height for all 64 data lanes and eight ECC lanes with all byte lanes enabled. Shown in Figure 40 are the same eye height measurements with data lane five and ECC lane five disabled. Compared 85 to Figure 39, there are three ticks of margin improvements. This shows that bit lane five alone contributes to three ticks of margin reduction. These types of bit by bit isolation benefit root cause debug of board issues. Shown in Figure 41 is both a working board that has low cross (top), where the victim bit in white is shielded from the seven aggressors in blue and the non-working board, where the shielding is missing between the victim in white and aggressors in blue. On an Intel SOC (early stepping), CPGC was facilitated to debug and isolate a memory controller speed-path. CPGC was able to isolate the bug because it provided an alternative path to the DDR physical layer. When the debugger noticed that the functional memory subsystem did not work at certain frequencies, but while using CPGC to bypass the functional memory controller, the DDR physical layer was working at the specified frequency. The debugger was able to quickly focus the attention on the memory controller and find the speed-path. 86 Figure 41: CPGC identified missing shielding on board layout 4.7.1 Debugging Boot Failure Using CPGC LFSR Addressing CPGC was proven to be extremely useful in debugging memory initialization and boot failures. At first, the boot failure was not deterministic, which was hard to reproduce, and would have taken a lot of debug time. Illustrated near the top of Figure 42, typical memory traffic writes to Rank 0 followed by Rank 1 when not using CPGC. 87 Illustrated near the bottom of Figure 42, by varying burst size using CPGC and LFSR addressing, the targeted rank ends up changing constantly (purple waveform), which led to a quick failure reproduction. This helped to reduce boot failure debug time down to a few minutes. Figure 42: CPGC architecture IO training boot failure debug Illustrated in Figure 43, the DQS_P and DQS_N pin crossed during tri-state. The receive enable (RCVEN) inside SOC Rx was asserted immediately for the next read command. The crossing triggered the DQS clock inside the SOC Rx post RCVEN to glitch, which corrupts the Rx FIFO. Once again, CPGC provided precise address and data pattern control, which enabled the debugger to fine tune the pattern and dial into the glitch quickly. 88 This again saved several days of debug labor compared to that without the CPGC reusable BIST engine. Figure 43: IO DQS glitch corrupting Rx FIFO debug In DDR initialization and training, CPGC's CADB pattern and deselect feature provided an ISI and crosstalk rich pattern during the non-functional select de-asserted state. This helped the BIOS and firmware to quickly center the command and address pins and ensure these pins were protected against functional ISI noise. 89 Figure 44 illustrates the measured eye from BIOS based memory training and eye centering using CPGC based LFSR pattern. LFSR provides a quick pseudo random pattern. The pattern does not have to be the worst case stress, but how quickly the pattern can be applied is important from a BIOS boot time point of view. CPGC delivers this speedy pattern, which is ideal for memory training. The vertical axis represents the voltage measurements, and the horizontal axis represents the timing measurements. Both voltage and timing offset can be applied to measure the boundary of failure. A contour of the eye is mapped out by the BIOS before the voltage and timing is centered in the middle of this mapped eye. 90 Figure 44: DDR IO initialization and training BIOS mapped eye CPGC can also be used in conjunction with circuit tuning controls such as reference voltage and phase interpolators to intentionally offset the data strobe to measure I/O jitter and DDR skew. 91 4.8 Summary The CPGC architecture offers significant time savings for validation, test, and debug for DDR, TSV, and 3DS WIO as shown in Table 9. The victim and aggressor pattern rotation automation of the CPGC reusable BIST engine alone decreased validation time by 3x. The CPGC software assisted repair technology reduced test time from minutes down to seconds. The CPGC reusable BIST engine enabled debug of various memory and IO failures including memory IO not booting. The CPGC reusable BIST engine reproduced in-deterministic boot failures in minutes, and consistently reproduced bugs that were hard to find, which enabled a debug time savings of 5X. Table 9: CPGC architecture savings to Validation, Test, and debug Time Savings Comments Validation 3X VA pattern rotation automates victim lane selection Test Minutes to seconds Auto-Repair reduced loading and unloading of failure signature Debug 5X Reproduced in-deterministic boot failures in minutes An overview of the CPGC reusable BIST engine architecture on Intel 14nm SOC was presented in this paper. The CPGC BIST engine was designed as a reusable IP that supports standard interfaces to enable quick turnkey SOC solution. CPGC is also used in memory training and initialization, which makes it a critical part of Intel SOC. 92 4.9 Acknowledgments The authors would like to acknowledge Thilak K Poriyani House Raju for design, simulation, and verification of the CPGC IP. The authors would like to thank Adam Taylor for his knowledge and data contribution to this paper. The authors would like to acknowledge Anne Meixner for reviews, technical advice, and encouragement. 4.10 References [1] S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas, "Patching Processor Design Errors with Programmable Hardware," Micro, IEEE, vol. 27, pp. 12-25, 2007. [2] R. Venkatesh, S. Kumar, J. Philip, and S. Shukla, "A fault modeling technique to test memory BIST algorithms," in Memory Technology, Design and Testing, 2002. (MTDT 2002). Proceedings of the 2002 IEEE International Workshop on, 2002, pp. 109-116. [3] Y. Jen-Chieh, W. Chi-Feng, C. Kuo-Liang, C. Yung-Fa, H. Chih-Tsun, and W. Cheng- Wen, "Flash memory built-in self-test using March-like algorithms," in Electronic Design, Test and Applications, 2002. Proceedings. The First IEEE International Workshop on, 2002, pp. 137-141. 93 [4] A. J. Van de Goor, S. Hamdioui, and H. Kukner, "Generic, orthogonal and low-cost March Element based memory BIST," in Test Conference (ITC), 2011 IEEE International, 2011, pp. 1-10. [5] S. Sheng-Chih, H. Hung-Ming, C. Yi-Wei, and L. Kuen-Jong, "A high speed BIST architecture for DDR-SDRAM testing," in Memory Technology, Design, and Testing, 2005. MTDT 2005. 2005 IEEE International Workshop on, 2005, pp. 52-57. [6] G. Mrugalski, A. Pogiel, N. Mukherjee, J. Rajski, J. Tyszer, and P. Urbanek, "Fault Diagnosis in Memory BIST Environment with Non-march Tests," in Test Symposium (ATS), 2011 20th Asian, 2011, pp. 419-424. [7] P. Bernardi, L. Ciganda, M. S. Reorda, and S. Hamdioui, "An Efficient Method for the Test of Embedded Memory Cores during the Operational Phase," in Test Symposium (ATS), 2013 22nd Asian, 2013, pp. 227-232. [8] P. Bernardi, M. Grosso, M. S. Reorda, and Y. Zhang, "A programmable BIST for DRAM testing and diagnosis," in Test Conference (ITC), 2010 IEEE International, 2010, pp. 1-10. [9] Y. Dongkyu, K. Taehyung, and P. Sungju, "A microcode-based memory BIST implementing modified march algorithm," in Test Symposium, 2001. Proceedings. 10th Asian, 2001, pp. 391-395. [10] B. Schroeder, E. Pinheiro, and W. Weber. "DRAM Errors in the Wild: A Large-Scale Field Study", SIGMETRICS/Performance '09, June 15-19, 2009, Seattle, WA, USA. 94 [11] D. Blankenbeckler, A. Norman, M. Shepherd. "Comparison of off chip interconnect validation to field failures", VALID, 2011, Barcelona, Spain. [12] Nejedlo, J.J., "IBIST (interconnect built-in-self-test) architecture and methodology for PCI Express," Test Conference, 2003. Proceedings. ITC 2003 [13] Nejedlo, J.J., "TRIBuTE board and platform test methodology," Test Conference, 2003. Proceedings. ITC 2003 [14] Nejedlo, J.J., "IBISTTM (interconnect built-in self-test) architecture and methodology for pci express: intel'snext-generation test and validation methodology for performance IO," Test Conference, 2003 [15] Nejedlo, J.; Khanna, R., "Intel IBIST, the full vision realized," Test Conference, 2009. ITC 2009. International , vol., no., pp.1,11, 1-6 Nov. [16] http://www.memcon.com/pdfs/proceedings2013/track1/Tuning DDR4 for Power and Performance.pdf [17] http://www.memcon.com/pdfs/proceedings2013/track1/Tuning DDR4 for Power and Performance.pdf [18] http://www.google.com/patents/WO2014004748A1?cl=en [19] http://www.youtube.com/watch?v=i3-gQSnBcdo [20] Querbach, B "CPGC2, a buit-in self test and auto-repair engine for DRAM (WIO, DDR4)", Patent pending 14/141,239 95 [21] Querbach, B. "Comparison of Hardware Based and Software Based Stress Testing of Memory Io Interface." IEEE MWSCAS 2013 [22] "`http://en.wikipedia.org/wiki/Transistor count"' [23] "`http://www.extremetech.com/computing/152979-intel-announces-full-set-of-new- atom-and-xeon-server-processors-to-fend-off-arm"' 96 Platform IO and System Memory Test Using L3 Cache Based Test (CBT) and Parallel Execution of CPGC Intel BIST Engine Bruce Querbach [email protected], Tan, Peter Yanyang [email protected], Lovelace, Van [email protected], Blankenbeckler, David [email protected] Khanna, Rahul [email protected] Puligundla, Sudeep [email protected] Patrick Chiang [email protected] 2015 ITC (submitted) Anaheim California, USA 97 2015 IEEE International Test Conference (ITC) 5 Platform execution time improvement using L3 Cache Platform IO and System Memory Test Using L3 Cache Based Test (CBT) and Parallel Execution of CPGC Intel BIST Engine Abstract As the memory industry pushes to increase memory density, device variation is creating more defects. Furthermore, new form factors (phone, tablet, mobile and client PC) and low cost board and platform limits physical access to JTAG or TAP. Taking advantage of the x86 architecture’s high functional bandwidth to memory, the quickest way to access and test memory is through CPU core, by storing and running parallel CPGC/BIST test content via CPU L3 cache, one CPGC BIST engine per memory channel. We propose a cache based testing framework that speeds up test time 60x to 170x compared to JTAG or TAP based testing using the same test content. We will present the cache based test (CBT) architecture and infrastructure (MRC/NEM setup, CPGC/IBIST), test content, results, and a side by side comparison of test time to JTAG or TAP. Finally we will discuss and compare this approach to generalized cache based tests. 98 5.1 Introduction Memory capacity and density are increasing year after year [1], and so are the defects in memory, IO, microprocessors, and SOC. Sarangi et. al. [2] describe the percentages of memory interface design bugs found in AMD64 and Intel’s Pentium 4 to be 17.4 and 20.2, respectively, of total published errata. Therefore there is a critical need for an efficient methodology to validate the memory interface in a microprocessor design. Even if the defect rates remain the same, larger memory capacity leads to more defects. For example, DRAM errors have been shown to be one of the leading causes of hardware failure in modern compute clusters [3] [4]. Memory test time increases with increased memory capacity; hence having an effective memory test and repair built-in self-test (BIST) engine is critical. Hardware parallelism can be fully exploited by running the converged pattern generator and checker (CPGC) BIST engine in parallel with as many instances as needed, for example, one BIST engine per memory channel. However, even with full parallel hardware CPGC BIST [5]-[12] engine implementation, test execution time can still be slow if significant test content needs to be loaded through JTAG or Intel test port (ITP) ports. Further, with the introduction of computing platform and system form factors such as phone and tablets, providing physical space for test access port (TAP) is a challenge. 99 Some work has been done on testing using cache. For example, Gurumurthy et. al. [13] experimented and found cache-resident self-test (CReST) can be an substitute for functional patterns. Foutris et. al. [14] was able to improve both test speed and coverage running multi-threaded tests on OpenSPARC T1 processor models. Theodorou et. al. [15] improved on their low-cost software-based self-test (SBST) to exploit the parallelism of multithreaded multicore architectures in order to speed up their testing of L1 cache. However, exploring cache base tests (CBT) on IA86 platform and system to test the entire DRAM memory subsystem has not been done before. We propose a CBT architecture that not only significantly speeds up test time by parallel testing of the entire memory subsystem, but is also target based and requires no debug port, making it ideally suited for the phone and tablet market. CBT architecture is shown in Figure 45. The test content is stored in the BIOS on flash rom or flash (writeable) devices before power on. As a part of the CPU boot process, the CPU fetches the tests into level 3 (L3) cache. Tests such as rank margin tool (RMT) programs the CPGC hardware BIST engine, and executes the tests in parallel (one BIST per DDR4 channel). The CPGC BIST engine writes to memory under test (DDR4 DIMMs), reads the results back, and compares for error. Each BIST is designed to occupy maximal memory bandwidth, thus issuing test instructions from the CPU cache to flood all available memory bandwidth (BW) is the quickest method to test the memory sub- system as will be shown in the results of this paper. 100 Under Under Test CPGC BIST BIST CPGC Memory Memory engine DDR4 X86 IA core Under Under Test CPGC BIST BIST CPGC Memory Memory engine DDR4 L3 Cache (cache based test) BIOS and flash storage Figure 45: Cache based test architecture In section 5.2, we will introduce the cache based test architecture and infrastructure and how memory reference code (MRC) and CBT are setup using non-eviction mode (NEM). In section 5.3, we will present more CPGC BIST engine architecture details, which form the foundation of CBT. In Section 5.4, we will present the test content and some of the powerful test results from CBT and RMT, for example how CBT with CPGC and on die voltage monitor can detect and debug system and platform level crosstalk and power supply noise problems. In sections 5.5 and 5.6, we will present test results, and compare the test results and test time side by side to show significant test time decrease. In section 5.7, we will discuss and compare this approach to more generalized cache based tests in terms of advantages and disadvantages. Section 5.8 provides a summary of the paper. 101 5.2 CBT Uses Cache-As-RAM Mode 5.2.1 Non-Evict Mode (NEM) Every IA86 processor during initial power-on must fetch instructions from a predefined reset vector. This is where BIOS gets loaded from flash ROM or RAM into the CPU cache, and is the perfect place to store our CBT test contents, and test results. Intel provides its customers with memory reference code (MRC) in source code and binary for the primary purpose of initializing the complex DDR3/DDR4 subsystem. RMT uses the CPGC BIST engine to perform DDR margin tests. The CPGC BIST engine is also capable of initializing memory (fill zeros) and memory cell tests and repair. A subset of this (margining to center strobes, fill zeros, and a quick memory scan test) is appropriate as a part of functional memory initialization; hence, all these are a part of MRC and BIOS. Since boot time is critical to client and mobile PC user experience, (without memory, the user does not see any text or graphics displayed on the screen), the functional initialization became a subset of the full debug and diagnostic capabilities. The user can choose in the BIOS setup screen which subset of tests to run on the next boot. To speed up test content loading and leverage IA86 core, the test content is actually compressed (zipped) when stored on the flash, then uncompressed as it is read into the L3 cache. While anyone can write code that gets loaded into cache, or even BIOS code that runs at boot, most if not all of these efforts would run after memory is initialized and possibly when some unified extensible firmware interface (UEFI) or operating system (OS) support is in place. However, to run IA86 code without memory initialization requires NEM mode, 102 which is a subset of cache-as-ram (CAR) mode. Figure 46 shows details of how Intel enables at boot time, with one logical processor to be selected as the boot strap processor (BSP). NEM is a special mode of IA86 memory configuration where the user must configure one base address and length for the data stack, and one base address and length for the code stack. The code stack will become read only, to avoid eviction from cache (thus self-modifying code is not permitted). The data stack of course is writeable, but would be un-cacheable, since it is already in the cache. With all the configuration registers set via memory type range registers (MTRR), the CPU can be instructed to enter NEM. The advantage of running in NEM mode for testing memory is test time. Since IA86 is architected to give maximal memory bandwidth from the core, this is the fastest way to run memory tests, especially if the test content calls to a hardware BIST engine that has been paired one for each memory channel. Another important advantage of NEM mode is since none of the test content resides in memory, we are able to test the entire system memory, margin the memory to failure, and reset the memory subsystem all in a tight loop within our test without a reboot or power cycle. This has proven to be extremely fast when seeking the worst margin corner and weak cells. If we tried to do this under OS, we would likely end up with a blue screen. This methodology of cache based testing can be applied generally to other computer interfaces and subsystems. 103 Assume: NEM Init - flat 32 bit protected mode - 1 logical processor per socket is the Boot Selected Processor Init all fixed and variable range MTRR to 0 Configure default memory type to Un- Cacheable (UC) Determine base address and length for data stack Configure data stack as Write Back Memory Type using MTRR register Determine base address and length for code region Configure code region as Write protected Memory Type using MTRR register Enable MTRR, BSP cache Enable No-Eviction Mode setup state Enable No-Eviction Mode Run state Return to Caller Figure 46: Non-Evict Mode entry flow diagram 104 5.2.2 NEM Exit to Preboot Execution Environment (PXE) As soon as possible after configuring initial system memory, and completing cache based tests, the BIOS should transition to normal cache mode (see Figure 47). Any stack data required by the BIOS must be copied to the initial system memory, and that initial system memory must be in the un-cacheable (UC) state. Next, the cache must be invalidated by executing the invalidate-instruction (INVD) to flush the cache. After this point in time, cache data is no longer valid and the next read will come from the flash storage. BIOS will ensure no data modification occurs until NEM exit is complete. Finally, the non-eviction run state and non-eviction setup state are cleared, and the BIOS can continue with the rest of the power-on self-test (POST) and transition to PEX boot. 105 Assume: - flat 32 bit protected mode NEM Exit - 1 logical processor per socket is the Boot Selected Processor - System memory is available but Any data required by the BIOS must be in the uncacheable state copied to system memory Configure default memory type to Un- Cacheable (UC) Disable the MTRRs Flush the caches, execute the INVD instruction Disable the No-Evictions Mode Run State Disable No-Eviction Mode Setup State Initialize the MTRRs Ren-enable the MTRRs Continue POST Figure 47: Non-Evict Mode exit flow diagram 106 During the NEM exit flow, cache based testing results must be carefully handled to not be lost when the entire cache is invalidated. The test data can be stored in the data stack during NEM run time, but it must first be copied to the actual system memory along with other memory initialization parameters at NEM exit. This must be done quickly to not get in the way of system BIOS boot, hence, this data isn’t written to a hard drive and no advanced post-processing at the NEM exist flow is done. We are simply trying to preserve the data when the entire memory and cache system is being reconfigured. The storing of test results from cache to memory is a quick process, again thanks to the large functional bandwidth from the core. While many of the test results and plots presented in this paper use these critical data collected during cache based testing, the data has to be post-processed to make them presentable. 5.3 CPGC BIST Engine 5.3.1 Architecture Details An overview of CPGC architecture was briefly discussed in ITC 2014 [5]. CPGC forms the critical foundation of CBT, hence we now will present more detailed architectures of CPGC. To accomplish complex address walking algorithms, Figure 48 shows the CPGC architecture consists of the nested loops of address instruction (AI), data instruction (DI), algorithm instruction (GI), command instruction (CI), and offset instruction (OI). 107 Address Instruction Loop Data Instruction Loop Algorithm Instruction Loop Command Instruction Loop Offset Loop Loop Offset Loop Offset Nested loop Offset Loop 1 2 Figure 48: CPGC architecture loops The flow diagram of the CPGC BIST engine is shown in Figure 49. The four areas of address, data, algorithm, and command CPGC instructions are initialized. Offset CPGC instruction is optionally initialized depending on if there is an offset. When all the instructions are ready, the complete instruction set is ready to execute. When the lowest order instruction is complete (e.g.: offset), the flow checks for additional offsets before completing the loops for command, algorithm, data, and address instructions. 108 Initialize Address Instruction DONE Selection YES Initialize Select Next Data NO All Address Address Instructions Instruction Instruction Selected? Selection YES Initialize Select Next Algorithm NO All Data Data Instructions Instruction Instruction Selected? Selection YES Initialize Select Next Command NO All Algorithm Algorithm Instructions Instruction Instruction Selected? Selection YES Offset? Select Next YES NO All Comnd. Command Instructions Selected? Initialize instruction Offset Instruction Selection Select Next Offset Instruction Compare Execute NO results to selected expected Address, Data, All Offset YES values, Algorithm, Instructions identify Command and Selected? defects, store Offset defect Instructions. location data. 109 Figure 49: CPGC architecture flow diagram 5.3.2 Address Instruction (Loop3) The outermost loop of address instructions that direct the addressing for the entire test is shown in Figure 50. This supports up to four instructions. Each address instruction supports the following features: Address direction, address order, enable address decode, and address decode rotate count. Loop 3 (up to 4 Address_Instructions) Loop 2 (up to 4 Data_Instructions) Loop 1 (up to 8 Algorithm_Instructions ) Loop 0 (up to 24 Command_Instructions) 1 2 Offset Offset Offset Figure 50: CPGC architecture loop 3, address instruction 5.3.3 Data Instruction (Loop2) The data instruction loop shown in Figure 51 supports up to four data instructions with the following features: Data select rotation enable, invert rotation enable, invert data, and data background. 110 Loop 3 (up to 4 Address_Instructions) Loop 2 (up to 4 Data_Instructions) Loop 1 (up to 8 Algorithm_Instructions ) Loop 0 (up to 24 Command_Instructions) 1 2 t t e e s s f f f f O O Figure 51: CPGC architecture loop 2, data instruction. 5.3.4 Data Capabilities Overview CPGC includes two data uni-sequencers, which can supply 32 bit linear feedback shift register (LFSR), 32 bit user programmable, and low frequency (LMN) counter. With address based inversion, data can result in: All zeros, all ones, checker board, row stripe, and column stripe. 5.3.5 Algorithm Instruction (Loop 1) The algorithm instruction loop shown in 111 Figure 52 supports up to eight algorithm instructions of the following features: Starting command, starting command sequence address direction, reverse address direction, invert data, FastY Init, wait-for-event-before-algorithm, wait-to-start-until count value, and reset count value at start. Loop 3 (up to 4 Address_Instructions) Loop 2 (up to 4 Data_Instructions) Loop 1 (up to 8 Algorithm_Instructions ) Loop 0 (up to 24 Command_Instructions) 1 2 Offset Offset Offset Figure 52: CPGC architecture loop 1, algorithm instruction 112 5.3.6 Command Instruction (Loop 0) The command instruction loop shown in Figure 53 supports up to 24 Command_ Instruction of the following features: Read or write, invert data, use alternate data, hammer, offset command, address decode enable, and use Offset1 or Offset2. Loop 3 (up to 4 Address_Instructions) Loop 2 (up to 4 Data_Instructions) Loop 1 (up to 8 Algorithm_Instructions ) Loop 0 (up to 24 Command_Instructions) 1 2 Offset Offset Offset Figure 53: CPGC architecture loop 0, command instruction 5.3.7 Offset Command Instruction (Loop 0) An offset command instruction is a special command instruction that enables a variety of complex offset operations to occur. Figure 54 below shows two offset structures that any command instructions can point to. 113 Loop 3 (up to 4 Address_Instructions) Loop 2 (up to 4 Data_Instructions) Loop 1 (up to 8 Algorithm_Instructions ) Loop 0 (up to 24 Command_Instructions) 1 2 Offset Offset Offset Figure 54: CPGC architecture, offset instruction 5.3.8 Examples of Usage Models for the Offset Command Instruction The usage models of offset instruction include: Address decode offset, butterfly, and galloping algorithms. The address decode algorithm can test the address decoder in the X and Y directions. The butterfly algorithm can test near neighbors (crosshair), all neighbors (crosshair and diagonal), and selective neighbors. The galloping algorithm can test block, column, row, or diagonal memory addresses. 5.3.9 Refresh Control Generator Refresh control is important to maintain DRAM data integrity. It is also an excellent tool to test and stress the DRAM for cell retention failures. The CPGC BIST engine has a variable refresh signal generator (Figure 55) to control and stress the DRAM refresh rates. 114 Algorithm Instruction Variable Data Refresh Refresh Input Signal Signals Generator Figure 55: Refresh control Generator 5.4 Rank Margin Tool (RMT) After explaining how CBT is built on the foundation of CPGC, the CBT test content at higher levels (firmware) is mostly built into a tool called RMT because it margins each DDR rank until failure. RMT can be applied to DDR3, DDR4, LP-DDR3, LP-DDR4, etc. In the simplest form, the margin failure point of the left strobe and the right strobe position can tell us where the strobe should be centered. RMT accomplishes this by writing to DDR IO registers for strobe offset, configuring the CPGC BIST engine to record pass or failure for this specific strobe margin setting, then sweeping the strobe setting across the entire range. RMT averages the left fail and right fail to determine the optimal center strobe position. To speed up this margin search, advanced search algorithms such as binary search can be used to cut time down from: 115 O(lo g n) Equation 1 to: O(n) Equation 2 For high volume manufacturing tests, RMT can screen pass and fail at two pre-specified points, which is significantly faster than a comprehensive search. RMT can measure crosstalk and inter symbol interference (ISI) by running different patterns on the DDR4 bus that specifically excites neighboring bits. A complete crosstalk and ISI test can be a time consuming, but a useful debug tool when the screens or margin tests above indicate a problematic platform. Figure 56 shows an IA86 system, where the interactions of 64 lanes of data (DQ) of channel 0 are plotted against itself (channel 0). Measured margin degradations are normalized to 100% indicated by color. The strongest interaction shows up on the diagonal line between individual bits vs. itself or immediate neighbor. This is indicative of a parallel single ended bus such as DDR4. There are also approximately four blocks of patterns, which indicates that within a group of 16 bits that are routed closely together, crosstalk is significant, but from group to group (of 16 bits), the interaction is limited. 116 Figure 56: Crosstalk and ISI across 64 DQ bits (RMT) On die voltage monitor has been extensively studied [16-20]. When CBT combined with the Intel on die voltage monitors, the effect of the CPGC BIST pattern on power supply disturbance can be studied. As the frequency of CPGC BIST pattern increases, so does the droop noise. Shown in Figure 57 is the VCCIO Bounce as a function of DQ frequency. Up to 10mV of VCCIO bounce can be measured at 700 MHz to 800 MHz DQ frequency. These power supply noise measurements and debug tools give platform/system engineers the ability to validate and fine tune their platform power design. 117 Figure 57: VCCIO supply Bounce as a function of DQ pattern frequency RMT can be configured to run such extensive debug tests, and the raw results in terms of margin deltas for 64 lanes can be stored in a data structure by the MRC. At the end of MRC initialization of memory, and NEM exit, the data structure is temporarily placed in the newly initialized main memory. As the BIOS moves on to the next stage of the boot process called pre-boot execution environment (PXE), the BIOS is capable of storing the test results in flash. This type of storing the prior boot memory initialization results can significantly speed up subsequent reboots, especially if the prior memory initialization results are still considered valid on reboot. BIOS could skip the lengthy memory initialization to give a faster user experience. The test results on the flash can be read by 118 special OS based software for post-processing or accessed remotely via Intel manageability engine (ME)/BMC, when applicable, to enable remote system diagnostics and debug. Figure 58 shows another type of results that CBT can collect. In this case, a typical voltage vs. timing eye diagram is collected by sweeping the strobe from left (-25) offset to right (+15) offset, and the DDR4 reference voltage from bottom (-63) offset to top (+63) to create a complete eye diagram. Once again, the raw margin data can be stored by the BIOS into flash, and the post processing software can pull the data from flash, to plot it, and to extrapolate (shown in red line) the complete data eye. ) ticks ( voltage Timing (ticks) Figure 58: Measured voltage and timing data eye (CBT) 119 Shown in Figure 59 is the CBT result for the per-bit timing margin on transmitter data lanes (TxDq) from lane 0 (bit 0) to lane 63 (bit 63) for DQ and ECC 0 (bit 64) to ECC 7 (bit 71). For each lane, the lane is margined to failure, which is shown as “*”. When the lane passes, it is shown as blank (“ ”). Extra passing points (eg: from (16 to -16) are truncated since they all pass to make the plot easier to fit. N0: TxDq Per bit margins: TxDq 0------7 8-----15 16----23 24----31 32----39 40----47 48----55 56----63 64----71 30 ******** ******** ******** ******** ******** ******** ******** ******** ******** 29 ******** ******** ******** ******** ******** ******** ******** ******** ******** 28 ******** ******** ******** ******** ******** ******** ******** ******** ******** 27 ******** ******** ******** ******** ******** ******** ******** ******** ******** 26 ******** ******** ******** ******** ******** ******** ******** ******** ******** 25 ******** ******** ******** ******** ******** ******** ******** ******** ******** 24 ******** ******** ******** ******** ******** ******** * ****** ******** ******** 23 ******** ******** * ****** ******** ******** * ****** * * **** ******** ******** 22 *** *** *** **** * ****** ******** ******** * **** * **** ******** **** * 21 * * ** *** **** ** *** **** ** * ** *** * *** * *** * * *** **** * 20 * * ** * *** * * * * ** * ** *** * * * * ** * * *** **** * 19 * * * * * * * * ** * * ** * * * * * * ** 18 * * * ** * ** * * * 17 * * * -16 * * -17 * * -18 * ** ** * * ** ** -19 * ** ** * ** * ** ** * * * * -20 * ** ***** ** * * ** * * ** ** ** * * ** * * -21 * ** ***** ** * ** * ** ** * ** ** ****** * ** ** ** ** * * -22 ** ** ******** ** ** * ** ** ** ** ** ******** ** ** * ** ** * ** ** -23 ** ** * ******** ** ** ** ** ***** ** *** * ******** ***** * ** *** * ** ** ** -24 ***** ** ******** ******** ** ***** ** ***** ******** ******** ** ***** ***** ** -25 ***** ** ******** ******** ******** ******** ******** ******** ******** ***** ** -26 ******** ******** ******** ******** ******** ******** ******** ******** ******** -27 ******** ******** ******** ******** ******** ******** ******** ******** ******** -28 ******** ******** ******** ******** ******** ******** ******** ******** ******** -29 ******** ******** ******** ******** ******** ******** ******** ******** ******** -30 ******** ******** ******** ******** ******** ******** ******** ******** ******** N0.C0.D0.R0: TxDq - Per bit margins Figure 59: Transmitter data per-bit margin (CBT) Shown in Figure 60 is another example of CBT results. We collected the per-bit timing margin results on receiver data lanes (RxDqs) from lane 0 (bit 0) to lane 63 (bit 63) for DQ and ECC 0 (bit 64) to ECC 7 (bit 71). For each lane, the lane is margined to failure, which 120 is shown as “*”, when the lane passed, it is shown as blank (“ ”). Extra passing points (eg: from (19 to -19) are truncated since they all pass, to make the plot easier to fit. Per bit margins: RxDqs 0------7 8-----15 16----23 24----31 32----39 40----47 48----55 56----63 64----71 30 ******** ******** ******** ******** ******** ******** ******** ******** ******** 29 ******** ******** ******** ******** ******** ******** ******** ******** ******** 28 ******** ******** ******** ******** ******** ******** ******** ******** ******** 27 ******** ******** ******** ******** ******** ******** ******** ******** ******** 26 ******** ******** ******** ******** ******** ******** ******** ******** ******** 25 ******** ******** ******** ******** ******** ******** ******** ******** ******** 24 ***** ** ******** ******** ******** ** ***** ** ***** ******** ******** ******** 23 ***** ** ******** ** * ** ******** ** ***** ** ** ** ** ** * ******** ** ** ** 22 ** ** ** ***** ** * ** ** * * ** ** * ** * * ** ** ** * ** ** 21 * ** ** * ** ** ** * * * * * ** * * 20 * * * * * -20 * -21 * * * * * -22 * * * ** * * * * * * *** * *** * * *** * ** -23 * * * ** * * ** * * * * * * ** **** * * *** * * *** * ** -24 * * *** * * ** * **** * * **** * * ****** **** * * * *** * * **** * ** ** -25 * * *** ****** * ******** ******** ******** ******* **** *** ******** **** ** -26 * ****** ******** ******** ******** ******** ******** ******** ******** ******* -27 ******** ******** ******** ******** ******** ******** ******** ******** ******** -28 ******** ******** ******** ******** ******** ******** ******** ******** ******** -29 ******** ******** ******** ******** ******** ******** ******** ******** ******** -30 ******** ******** ******** ******** ******** ******** ******** ******** ******** N0.C0.D0.R0: RxDqs - Per bit margins Figure 60: Receiver data per-bit margin (CBT) Shown in Figure 61 is the CBT result for the per-bit voltage margin results on transmitter data lanes (TxVref) from lane 0 (bit 0) to lane 63 (bit 63) for DQ and ECC 0 (bit 64) to ECC 7 (bit 71). For each lane, the lane is margined to failure by offsetting the reference or sampling voltage. Failures are shown as “*”. When the lane passed, it is shown as blank (“ ”). Extra passing points (eg: from (13 to -14) are truncated since they all pass to make the plot easier to fit. 121 Per bit margins: TxVref 0------7 8-----15 16----23 24----31 32----39 40----47 48----55 56----63 64----71 30 ******** ******** ******** ******** ******** ******** ******** ******** ******** 29 ******** ******** ******** ******** ******** ******** ******** ******** ******** 28 ******** ******** ******** ******** ******** ******** ******** ******** ******** 27 ******** ******** ******** ******** ******** ******** ******** ******** ******** 26 ******** ******** ******** ******** ******** ******** ******** ******** ******** 25 ******** ******** ******** ******** ******** ******** ******** ******** ******** 24 ******** ******** ******** ******** ******** ******** ******** ******** ******** 23 ******* ******** ******** ******** ******** ******** ******** ******** ******* 22 ******* ******** ******** ******** ******** ******** ******** ******** ****** 21 ***** * ******** ******** ******** ******** ******** ******** ******** ***** 20 ** * * ******** ****** * ** *** * ******** ******** ***** ** ******** * * 19 * ******** ***** * ** * * ******** ******** ** * ** ******** * 18 ******** ** * ** * * ******** ***** ** * * *** **** 17 ** * ** ******** *** ** * * ** 16 ** * * ** ** * * * 15 ** * ** * * * 14 * * -15 * * * * * -16 * * *** * ** *** **** * -17 * * * * ** *** *** *** **** * *** **** ** -18 * * ** * * * ** *** *** ******** ** *** **** *** -19 ******** * * **** * ** ******** ******** **** *** ******** -20 * ** * ******** ******** ** ** * ******** ******** ******** ******** *** -21 * ** * * ******** ******** ******** ******** ******** ******** ******** ** ***** -22 **** * * ******** ******** ******** ******** ******** ******** ******** ** ***** -23 ******** ******** ******** ******** ******** ******** ******** ******** ******** -24 ******** ******** ******** ******** ******** ******** ******** ******** ******** -25 ******** ******** ******** ******** ******** ******** ******** ******** ******** -26 ******** ******** ******** ******** ******** ******** ******** ******** ******** -27 ******** ******** ******** ******** ******** ******** ******** ******** ******** -28 ******** ******** ******** ******** ******** ******** ******** ******** ******** -29 ******** ******** ******** ******** ******** ******** ******** ******** ******** -30 ******** ******** ******** ******** ******** ******** ******** ******** ******** N0.C0.D0.R0: TxVref - Per bit margins Figure 61: Transmitter reference voltage per-bit margin (CBT) Shown in Figure 62 is the CBT result for the per-bit voltage margin results on receiver data lanes (RxVref) from lane 0 (bit 0) to lane 63 (bit 63) for DQ and ECC 0 (bit 64) to ECC 7 (bit 71). For each lane, the lane is margined to failure by offsetting the reference or sampling voltage. Failures are shown as “*”. When the lane passes, it is shown as blank (“ ”). Extra passing points (eg: from (28 to -30) are truncated since they all pass to make the plot easier to fit. 122 Per bit margins: RxVref 0------7 8-----15 16----23 24----31 32----39 40----47 48----55 56----63 64----71 47 ******** ******** ******** ******** ******** ******** ******** ******** ******** 46 ******** ******** ******** ******** ******** ******** ******** ******** ******** 45 ******** ******** ******** ******** ******** ******** ******** ******** ******** 44 ******** ******** ******** ******** ******** ******** ******** ******** ******** 43 ******** ******** ******** ******** ******** ******** ******** ******** ******** 42 ******** ******** ******** ******** ******** ******** ******** ******** ******** 41 ******** ******** ******** ******** ******** ******** ******** ******** ******** 40 ******** ******** ******** ******** ******** ******** ******** ******** ******** 39 ******** ******** ******** ******** ******** ******** ******** ******** ** ***** 38 ******* ******** * ****** ******** ******** ******** ******** ******** ** ***** 37 ****** ******** * ****** ****** * ******** ******** ******** ******** * ** * 36 ** *** ******** ***** ** *** * ******** ******** ******** *** **** ** * 35 * ******** **** ** *** ******** ******** *** **** ** **** ** * 34 ***** * **** ** *** ******** **** * * * * ** * ** * 33 ** ** * * * * * ******** * ** * * ** * * 32 * ** * * * ** * * ** * * * * 31 * * ** * * * 30 * * 29 * * -31 * ** -32 ** * * * * * ** * -33 * ** * **** * * **** * -34 **** * * * * **** *** **** *** * * -35 * * ****** * ** *** ******** ******** * * * * *** *** ** -36 * ******** ** * ****** ******** ******** *** **** *** *** *** -37 ** ******** ** * ****** * ******** ******** ******** ******** *** -38 * ***** ******** * **** * ******** ******** ******** ******** ******** ** *** -39 ******** ******** ******** ******** ******** ******** ******** ******** ** ***** -40 ******** ******** ******** ******** ******** ******** ******** ******** ******** -41 ******** ******** ******** ******** ******** ******** ******** ******** ******** -42 ******** ******** ******** ******** ******** ******** ******** ******** ******** -43 ******** ******** ******** ******** ******** ******** ******** ******** ******** -44 ******** ******** ******** ******** ******** ******** ******** ******** ******** -45 ******** ******** ******** ******** ******** ******** ******** ******** ******** -46 ******** ******** ******** ******** ******** ******** ******** ******** ******** -47 ******** ******** ******** ******** ******** ******** ******** ******** ******** N0.C0.D0.R0: RxVref - Per bit margins Figure 62: Receiver reference voltage per-bit margin (CBT) 5.5 Traditional JTAG and TAP Based Tests The same margin test can be run via traditional JTAG or TAP port. For some (mostly server) Intel boards, Intel test access (ITP) ports are both required and populated. Through a host PC, the host software, in this case, Python, because Python is a high level language that supports object oriented programming without requiring compiling binary executable, 123 is used. We find in the lab, during debug, modifying the test code on the fly is quicker and easier using Python. Basically, Python code follows the same algorithm, and writes to the same registers in the CPU to trigger the CPGC BIST engine. Shown in Figure 63 is ITP Python results of running the per-bit timing margin results on transmitter data (DQ) lanes (RxDqs) from lane 0 (00) to lane 63 (63) for DQ and ECC 0 (64) to ECC 7 (71). For each lane, the lane is margined to failure, which is shown as “#”. When the lane passed, it is shown as (“.”). Extra passing points (eg: from (15 to -16) are truncated to make the plot easier to fit. >>? margin_control.margin_txdq(channels=[0], ranks=[[0]]*4, loopcounts=[12]*4) 000000000011111111112222222222333333333344444444445555555555666666666677 012345678901234567890123456789012345678901234567890123456789012345678901 +27 ######################################################################## +26 #################.###############################.###################### +25 ###########.#####.##.###############.####.#######.#.################.#.. +24 ###..##.###.#####.##.#.#####.####.##.####.######..#.#####.##.#######.#.. +23 ###..##.###..##...##.#..####.#..#.##.###..######..#..####.#..####.##.#.. +22 #.#..#....#..#....#..#..#.##.#....#..###..#..###..#...... #..#.##.##.... +21 #.#..#....#..#....#.....#.##.#...... ##...#..#.#..#...... #..#.#..... +20 #.#...... #...... #...... ##...#...... #...... #.#..... +19 ..#...... -19 ...... #...... #...... -20 ...#...... ##..#...... #...... #...#...... -21 ...#....##.##..#...#...... ##..##...... ##..#....#.##...... #...... #.... -22 .#.##...##.##.##...##..#...##..##..#....##..#...##.##....#..#....#.#.... -23 .#.##...#####.##...##..###.##.####.##...##.##..######...##.##..###.##..# -24 ##.##..###########.##.####.#######.##...##.##..######..###.##..###.##..# -25 ##.##.############.##.####.#######.##...##.##..#######.##############.## -26 #####.################################.#######.######################.## -27 ######################################################################## hi side timing margin: offset=+19.00 target_ber=0.000000e+00 extrapolated Figure 63: Transmitter data timing per-bit margin (TAP) 124 Shown in Figure 64 are ITP Python results of running the per-bit timing margin results on receiver data (DQ) lanes (RxDqs) from lane 0 (00) to lane 63 (63) for DQ and ECC 0 (64) to ECC 7 (71). For each lane, the lane is margined to failure, which is shown as “#”. When the lane passes, it is shown as (“.”). Extra passing points (eg: from (20 to -18) are truncated to make the plot easier to fit. >>? margin_control.margin_rxdq(channels=[0], ranks=[[0]]*4, loopcounts=[12]*4) 000000000011111111112222222222333333333344444444445555555555666666666677 012345678901234567890123456789012345678901234567890123456789012345678901 +27 ######################################################################## +26 #####.############################.##################################### +25 ##.##.############################.#######.#######.##.#######.####.##.## +24 ##.##.############.#.#############..###.##..#.####..#..######.####.##### +23 #..##...##.##.#.###...#.##.#####...... ##..#.#.##.##...##.##.##.#.##.## +22 ...... #..#...... #..#.##...... #...##..#...... #..## +21 ...... #...... +20 ...... #...... #...... -19 ...... #...... -20 ...... #...... -21 #...... #...... #..#...... #...... #.. -22 #....#....#.##...... #..#.#..#...##.####..#...... ####....####....##. -23 #....#....#.##.#..#..#..##.#.#..#.##########.#...... ####.#.#####..#.##. -24 #....##.#.#.##.##.####.#######..############.#....#..######.#####.##.##. -25 #.#.###.#########################################.#..###############.##. hi side timing margin: offset=+20.00 target_ber=0.000000e+00 extrapolated Figure 64: Receiver data timing per-bit margin (TAP) Shown in Figure 65 are the ITP Python results of running the per-bit voltage margin results on transmitter data (DQ) lanes (RxDqs) from lane 0 (00) to lane 63 (63) for DQ and ECC 0 (64) to ECC 7 (71). For each lane, the lane is margined to failure, which is shown as “#”. When the lane passes, it is shown as (“.”). Extra passing points (eg: from (14 to -9) are truncated to make the plot easier to fit. 125 >>? margin_control.margin_txvref_no_PDA(channels=[0], ranks=[[0]]*4, loopcounts=[12]*4) 000000000011111111112222222222333333333344444444445555555555666666666677 012345678901234567890123456789012345678901234567890123456789012345678901 +27 ######################################################################## +26 #######.################################################################ +25 #####.#.#############################################.################.. +24 ###.#.#.##############.###.#.#.####################.#.###########...#... +23 .#....#.########..###..#.....#..#######..####.##..#.#...#####.##....#... +22 ...... ###....#..###..#...... ####.##...##..##..#.....##....##...... +21 ...... ##...... #.##...#...##...... #....#...... +20 ...... ##...... #...... #...... #...... +19 ...... #...... -16 ...... #...... #...... -17 ...... #...... ###....###...... -18 ...... #...... ###..#.###...... #...... -19 ...... #....##...... ###..#####...#.#.....#..##...##...... -20 ...... #.#..######.....#...... ############.###.....###########...... -21 ...... #.##########....#...##..##################.#.###########...... -22 ...... ############....##..##..################################...... -23 ..#.....############.#.###.###..################################.....##. -24 #.##.#..######################.##################################...#### -25 ####.#.################################################################# -26 ####.#.################################################################# -27 ######################################################################## hi side voltage margin: offset=+19.00 target_ber=0.000000e+00 extrapolated Figure 65: Transmitter reference voltage per-bit margin (TAP) Shown in Figure 66 are the ITP Python results of running the per-bit voltage margin results on receiver data (DQ) lanes (RxDqs) from lane 0 (00) to lane 63 (63) for DQ and ECC 0 (64) to ECC 7 (71). For each lane, the lane is margined to failure, which is shown as “#”. When the lane passes, it is shown as (“.”). Extra passing points (eg: from (27 to -27) are truncated to make the plot easier to fit. 126 >>? margin_control.margin_rxvref(channels=[0], ranks=[[0]]*4, loopcounts=[12]*4) 000000000011111111112222222222333333333344444444445555555555666666666677 012345678901234567890123456789012345678901234567890123456789012345678901 +40 ######################################################################## +39 #################.################################################.##### +38 ######..#########.############.###################################.##### +37 .#####..########..#####.######.##################################...##.# +36 ...###..########...###..##.###..################################....##.# +35 ...... #####.#...###..##.###..##########################..####....##.. +34 ...... #####.#...#.#..##.#.#..#########.##.####....##..#..##.#...... +33 ...... #.##.#...... ##.#....###.#####.##.#.#.....#...... #...... +32 ...... ##...... #..#....#.#.####..#....#.....#...... #...... +31 ...... #...... ##.#..#...... #...... +30 ...... #.#...... +29 ...... #...... +28 ...... #...... -30 ...... #...... -31 ...... #...... -32 ...... #.#...... #...... ##.#...... -33 ...... #.#.##.#...... ####.#.#####.#...... -34 ...... #.#.##.#...... #..####.#.#####.#.#...... -35 ...... ######.#..#...... #.#.#..############.#.#.....###.#...#.#.....#.. -36 .....#..########..#..#...#.###.##################.#.#######..###.....#.. -37 .....#..########..####.###.###.####################.#######..###....###. -38 ##.####.#########.#########################################.#####...###. -39 #################.################################################..#### -40 ##################################################################..#### -41 ######################################################################## hi side voltage margin: offset=+28.00 target_ber=0.000000e+00 extrapolated Figure 66: Receiver reference voltage per bit-margin (TAP) 5.6 Performance (Test Time) Comparison While the format of CBT/RMT output vs. TAP based Python test differ slightly, we were able to collect the same detailed level of debug data. We ran CBT/RMT with the option of showing per-bit margin diagram on/off. The test times were 15901ms and 5679ms respectively to per-bit margin on and off. The test time of using EV Python scripts running in a host was 991130ms or 991 seconds. Thus the ratios of running on target cache vs. host were either 62 or 175 times depended on the per-bit-margin being on or off, see Table 10. 127 Table 10: Performance gain of cache based tests test time per bit margin JTAG/TAP (ms) Cache based RMT (ms) performance gain on 991130 15901 62X off 991130 5679 175X 5.7 Generalized Cache Based Testing Most tests can be written small in code size (eg: for n equal 0 to 100, do x equal n + x) to fit inside cache to achieve comparable speedup to the method presented here. Such generalized cache based testing is encouraged because of the test time benefit. Especially when running the tests on the target under test target based test (TBT), it can easily be used to test other platform interfaces (eg: USB, PCIe). One of the advantages of these generalized CBT or TBT for non-memory interfaces, like USB or PCIe, is that it can test the interface to failure without worry about corrupting the test execution data and code space itself. However, when it comes to memory tests, generalized CBT cannot fully test the entire memory range if it is based on the target because of OS dependency and non- accessible ranges. If generalized CBT does not run on the target, the access time to the target registers via JTAG or TAP is going to slow down significantly. Further, writing a CBT that runs under OS does not guarantee full CPU core usage, because the multi-threaded nature of modern OS, or even the virtualization of hardware or machines 128 (VM), the generalized CBT could easily be task switched out when competing with other equal or higher priority OS tasks. When such task switch occurs, significant performance degradation of test time results. Unlike generalized CBTs, MRC based CBT running in NEM runs in a single thread mode having full control of all IA86 core and system resources, thus avoiding the problems of task switching discussed above. MRC based CBT running in NEM has been proven to be the fastest way to test memory subsystems on IA86 systems. 5.8 Conclusion As the memory industry pushes to increase memory density, device variation is creating more defects. Larger memory capacity is leading to more defects and also causing an increase in memory test time. Furthermore, new form factors (phone, tablet, mobile and client PC) and low cost board and platform limit physical JTAG or TAP access. We presented a CBT architecture that achieved test speed improvements of 60x to 170x compared to JTAG or TAP based tests running the same test content. This CBT is built on the foundations on an effective memory test and repair CPGC BIST engine. Some of the CPGC BIST engine architecture details have been discussed. The architecture discussed here are critical in dealing with future memory test challenges that the industry faces. We also presented the CBT infrastructure (MRC setup, NEM, and RMT with CPGC/IBIST), CBT test contents, CBT test results and side by side comparison of test time to JTAG or TAP. Intel x86 are architected to have high functional memory bandwidth accessed 129 directly through the core, thus we have shown this is the quickest way to access and test platform system memory is through CPU core, by storing and running parallel CPGC/BIST test content via CPU L3 cache, one CPGC BIST engine per memory channel. Finally we discussed and compared this approach to more generalized cache based tests and presented the unique advantages. In the future, we hope to expand CBT to broader IO and system interfaces, and hope to further improve test execution speeds through parallelism. 5.9 Acknowledgements The authors acknowledge and thank the contribution from Rajesh Kapoor at Intel for sharing BIOS knowledge. 130 5.10 References [1] “Coalition of 20+ Tech Firms Backs MRAM as Potential DRAM, NAND Replacement - See more at: http://www.dailytech.com/Coalition+of+20+Tech+Firms+Backs+MRAM+as+Potential+ DRAM+NAND+Replacement/article33826.htm#sthash.afOQukEt.dpuf.” [2] S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas, “Patching Processor Design Errors with Programmable Hardware,” IEEE Micro, vol. 27, no. 1, pp. 12–25, Jan. 2007. [3] B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: a large- scale field study,” in ACM SIGMETRICS Performance Evaluation Review, 2009, vol. 37, pp. 193–204. [4] M. Shepherd, D. Blankenbeckler, and A. Norman, “Comparison of off-chip interconnect validation to field failures,” in VALID 2011, The Third International Conference on Advances in System Testing and Validation Lifecycle, 2011, pp. 121–125. [5] B. Querbach, R. Khanna, D. Blankenbeckler, Y. Zhang, R. T. Anderson, D. G. Ellis, Z. T. Schoenborn, S. Deyati, and P. Chiang, “A reusable BIST with software assisted repair technology for improved memory and IO debug, validation and test time,” in Test Conference (ITC), 2014 IEEE International, 2014, pp. 1–10. 131 [6] B. Querbach, D. G. Ellis, A. Khan, M. J. Tripp, E. S. Gayles, and E. Gollapudi, “Automatic self test of an integrated circuit component via AC I/O loopback,” 7,139,957, 2006. [7] B. Querbach, S. Puligundla, D. Becerra, Z. T. Schoenborn, and P. Chiang, “Comparison of hardware based and software based stress testing of memory IO interface,” in 2013 IEEE 56th International Midwest Symposium on Circuits and Systems (MWSCAS), 2013, pp. 637–640. [8] Querbach, Bruce, “‘CPGC2, a buit-in self test and auto-repair engine for DRAM (WIO, DDR4)’, Patent pending.” [9] J. Nejedlo and R. Khanna, “Intel #x00AE; IBIST, the full vision realized,” in Test Conference, 2009. ITC 2009. International, 2009, pp. 1–11. [10] D. Ellis, B. Querbach, J. Nejedlo, A. Khan, S. Babcock, E. Gayles, and E. Gollapudi, “Pre-announce signaling for interconnect built-in self test,” 2003. [11] D. G. Ellis, B. Querbach, J. J. Nejedlo, A. Khan, S. R. Babcock, E. S. Gayles, and E. Gollapudi, “Push button mode automatic pattern switching for interconnect built-in self test,” 6,826,100, 2004. [12] B. L. Spry, T. Z. Schoenborn, P. Abraham, C. P. Mozak, D. G. Ellis, J. J. Nejedlo, B. Querbach, Z. Greenfield, R. Ghattas, and J. Tholiyil, “Robust memory link testing using memory controller,” 12/651,252, 2009. 132 [13] S. Gurumurthy, M. Pratapgarhwala, C. Gilgan, and J. Rearick, “Comparing the effectiveness of cache-resident tests against cycleaccurate deterministic functional patterns,” in Test Conference (ITC), 2014 IEEE International, 2014, pp. 1–8. [14] N. Foutris, M. Psarakis, D. Gizopoulos, A. Apostolakis, X. Vera, and A. Gonzalez, “MT-SBST: Self-test optimization in multithreaded multicore architectures,” in Test Conference (ITC), 2010 IEEE International, 2010, pp. 1–10. [15] G. Theodorou, N. Kranitis, A. Paschalis, and D. Gizopoulos, “Software-Based Self Test Methodology for On-Line Testing of L1 Caches in Multithreaded Multicore Architectures,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 21, no. 4, pp. 786– 790, Apr. 2013. [16] M. Shin, J. Shim, J. Kim, J. S. Pak, C. Hwang, C. Yoon, J. Kim, H. Kim, K. Park, and Y. Kim, “A 6.4Gbps on-chip eye opening monitor circuit for signal integrity analysis of high speed channel,” in IEEE International Symposium on Electromagnetic Compatibility, 2008. EMC 2008, 2008, pp. 1–6. [17] K. Noguchi and Makoto Nagata, “An On-Chip Multichannel Waveform Monitor for Diagnosis of Systems-on-a-Chip Integration,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 15, no. 10, pp. 1101–1110, Oct. 2007. [18] A. K. M. M. Islam and H. Onodera, “Characterization and compensation of performance variability using on-chip monitors,” in Proceedings of Technical Program - 133 2014 International Symposium on VLSI Technology, Systems and Application (VLSI- TSA), 2014, pp. 1–4. [19] J.-S. Wang, P. Andrews, S. Fu, and G. Pwu, “Low power Sigma;- Delta; analog- to-digital converter for temperature sensing and voltage monitoring,” in Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems, 2000, 2000, vol. 1, pp. 36–39 vol.1. [20] T. Okumoto, K. Yoshikawa, and M. Nagata, “Monitoring effective supply voltage within power rails of integrated circuits,” in Solid State Circuits Conference (A-SSCC), 2012 IEEE Asian, 2012, pp. 113–116. 134 Architecture of a Reusable BIST Engine for Detection and Auto Correction of Memory Failures and for IO Debug, Validation, Link Training, and Power Optimization on 14nm 3DS SOC Bruce Querbach [email protected], Khanna, Rahul [email protected], Puligundla, Sudeep [email protected], David Blankenbeckler [email protected], Joe Crop [email protected], Patrick Chiang [email protected], 2015 IEEE design and test magazine (accepted) Vancouver, British Columbia, Canada 135 6 Architecture of a Reusable BIST Engine (CPGC) Architecture of a Reusable BIST Engine for Detection and Auto Correction of Memory Failures and for IO Debug, Validation, Link Training, and Power Optimization on 14nm 3DS SOC Abstract—3D stacked (3DS) integrated circuit (IC) presents unique challenges to testability due to the lack of probe points and overall product yield due to memory subsystem failures. High speed memory and IO require potentially complex link training, yet the end user experience of phone and tablet demands quick power on and quick link training. Further, compared to the desktop CPUs, phones and tablets present low power challenges for Intel chips. This paper presents the architecture of a reusable built-in-self-test (BIST) engine called the converged pattern generator and checker (CPGC). CPGC hardware and software implemented on Intel 14nm 3DS IC is capable of auto repair for detected memory defects. Silicon results are presented showing that CPGC enables easy debug of inter-symbol-interference (ISI) and crosstalk issues in silicon and platform to improve the probe-ability of 3DS IC. It also enables quick and complex IO link training of all memory and IO subsystem. Further, CPGC improved validation time by 3x, and reduced system-on-chip (SOC) and platform power by 5% to 11% through closed loop circuit power optimization/training. This reusable CPGC BIST engine IP solution has been successfully designed into at least 11 Intel CPU/SOCs. 136 6.1 Introduction Process technology continues to shrink; for example, Intel has demonstrated 14nm technology [16]. These shrinking technologies combined with consumer demand to store everything has driven memory capacity to increase year after year [15]. Along with it, due to the smaller geometry, lower voltage, and complex physical design, the memory defects are increasing. DRAM errors are one of the leading causes of hardware failure in modern compute clusters as the memory capacity and density increase [2, 13]. Dong Wook Kim in ECTC 2013 [14] has demonstrated 3D stacked wide-IO (WIO) memory with one logic die followed by up to four memory die connected via through silicon via (TSV). WIO is similar to double data rate (DDR) memory, but offers much wider channel and IO bandwidth than DDR through more parallel pins. TSV are tiny vertical interconnects that pass through the die to enable connectivity of multiple dies. This further limits the probe-ability of sandwiched die post bonding. When functional at speed tests are required, BIST is the on-die solution that can test this 3DS technology at speed compared to tester based solutions. BIST also solves the probe-ability issue of 3DS IC that other probe solutions cannot access, because BIST is built directly into the 3DS IC. Much prior work has been done related to memory test. Boundary scan [1] introduced DC and stuck-at tests. Venkatesh [3] demonstrated a physical memory fault model to test BIST 137 algorithms. March and modified-march algorithms [4-6] are commonly used to detect memory cell failures. A galloping pattern [7] can capture failures that escape the march algorithm. Bernardi [8] introduced a hardware software co-test, which can be performed while the device under test (DUT) is operational. Bernardi [9] also demonstrated a programmable memory BIST (MBIST) suitable for stacked DRAM, which uses march to detect neighborhood pattern sensitive faults (NPSF). In [12], the paper showed that patterns that are quick to execute and provide good coverage for memory defects are important. This paper presents the architecture of a BIST engine that is able to test memory cell failures according to the algorithms described in the prior works above (scan, march, gallop, etc…), but goes beyond prior work by introducing auto repair of detected memory cell failure, IO link training, and power optimization/training. A comparison of features of prior work is shown Table 11. 138 Table 11: Comparison of CPGC features to prior work. The BIST engine tests 3DS SOC using the functional circuits themselves by exciting physical neighboring aggressors of a victim IO lane in Figure 67 (a), or a victim memory cell (called base) in Figure 67 (b). This new generation of reusable BIST engines is built into the SOC CPU, which is called converged pattern generator and checker (CPGC) [11]. Converged refers to combining interconnect BIST (IBIST) [10] and MBIST capabilities into one BIST engine. The paper will also describe an auto-repair technology, which, combined with the fault detection capabilities, can physically re-map defective memory using redundant memory cells built into the DRAM. This provides significant test time savings in post package manufacturing flow. CPGC also reduced SOC and platform power through closed loop optimization of on die termination (ODT) and equalization. Such BIST 139 that supports IO and memory defect detection, memory auto repair, IO link training, and power optimization/training has not been previously demonstrated. The rest of this paper will be structured as follows. In section 6.2.1, the architecture of CPGC is presented. Section 6.2.4 details the software architecture. In section 6.2.5, silicon test results are presented. Lastly, in section 6.3, the paper is summarized. 6.2 2nd Generation CPGC for 14nm 3DS SOC Memory and IO 6.2.1 CPGC Architecture Overview The typical computer architecture for SOCs consists of one or more processors with a memory sub-system that includes a memory controller and a main memory. 3DS SOC significantly reduces the footprint of the memory and IO sub-system, because memories are stacked on top of the CPU as shown in Figure 67 (a). The CPGC BIST engine becomes a part of the memory sub-system and the memory controller, which is shown in Figure 67 (b). The CPGC architecture consists of test pattern generators (TPG), defect detectors (DD), a repository of failing addresses (RF), an optional self-repair logic circuit (SF), and configuration registers (CR). These capabilities (TPG, DD, RF, SF and CR) form the basis of CPGC, which are present for each memory channel. Shown on the right of Figure 67 (b) is the main memory with spare memory along with two typical memory test algorithms. The main memory can be written (W) sequentially from top to bottom and left to right, or a as a base address walks through memory, an offset can stress the neighbors around the 140 base. The CPGC can be programmed through the CR to automatically or manually repair memory defects detected with software assistance or a hardware finite state machine (FSM) as shown in Figure 67 (d) using the spare memory of the DRAM to map out the defect rows. 141 Figure 68 (a) CPU with CPGC BIST engine DRAM 2 T T S S V V DRAM 1 T T T S S S V V V SOC with CPGC BIST Engine Figure 68 (b) CPGC BIST engine architecture as part of Intel Memory Controller Memory Controller Main Memory W W W W W W W W Test Pattern W W W W Generator W W W W Self stuck at Repair Defect 8 Detector Logic 7 Circuit 4 3 base 1 2 5 Repository of 6 butterfly Failing Addresses N Flags W E Configuration Registers Spare Memory S Figure 68 (c) The CPGC BIST engine pipeline architecture Address Address Address Address Order3 Order2 Order1 Order0 Carry Column Bank Count Address Address Pipestage n00 Gen. Gen. Gen. Rank Row Address Address Pipestage n01 Gen. Gen. Rank Row Col. Bank Address Address Address Address Figure 68 (d) The CPGC BIST engine auto-repair architecture Idle ACT if NumFailStatus < MaxFail Wait Exit PRE_al PPR l tPGM If more failing address Wait Wait_P PRE PGMP RE ST PPR Wait Done Enable tPGM Figure 67: SOC memory sub-system and CPGC Architecture overview 142 6.2.2 CPGC Architecture Detail Similar to [6], the CPGC BIST engine includes command, address, and data generator. The CPGC supports four instruction types: Address, data, algorithm, and command. An offset instruction type is optional. When all the instructions are programmed, the lowest instruction loop is executed (offset) first, followed by command, algorithm, data, and address instruction loops. 6.2.2.1 Address Pipeline Given the dependencies between command, address, and offset, generating a complete address back-to-back on every clock to achieve high speed test similar to [6] a pipeline architecture design is required. An example of an address generation pipeline is shown in Figure 67(c). The diagram shows pipe stages n00 and n01. When bank is ordered lower than column, it is lumped with column into pipe stage n00. Carry count is a look-ahead carry propagation to ensure that row address has valid inputs in pipe stage n01. Rank address is lumped with row address into pipe stage n01. 6.2.2.2 Auto Repair FSM and Auto Repair Technology 143 The CPGC reusable BIST engine introduced an optional hardware repair engine that works in conjunction with software to provide auto repair technology. This repair capability has been adopted by the JEDEC as a part of the wide-IO (WIO) industry standard, known as post package repair (PPR). In a high-volume manufacturing (HVM) environment, the automatic test equipment (ATE) can use the CPGC row hammer to detect memory cell defects. At the end of the detection phase of the test the ATE test flow could initiate either a hardware automatic, or a software assisted manual repair of the failed memory cells using spare rows of the DRAM device. DRAM device manufacturers have adopted the PPR JEDEC standard. Specifically, future DRAM are built with extra spare rows of memory to compensate for the increased likelihood of memory cell failures. The automatic modes require no operator intervention, thus significantly reducing HVM test time from many minutes down to seconds. The auto-repair and manual repair can also be run at boot via firmware or BIOS. This enables in the field PPR of a highly integrated SOC with a 3DS WIO memory subsystem, such as a phone or tablet. Such end user auto-repair can significantly improve manufacturing yield and end user customer experience. Imagine, instead of throwing away a dead phone, getting customer service to push an auto repair software update, or even having the phone simply automatically repair itself on the next reboot after a crash. 144 Figure 67 (d) shows an 11 state FSM that loops between activate (ACT) and wait for programming time completion (Wait tPGM) if multiple failing addresses need to be repaired. All 11 states are used to complete PPR. The PPR flow will be activated if the number of failures (NumFailStatus) is less than the maximum number of failure (MaxFail) that the spare memory could support. 6.2.2.3 Refresh Control Generator Refresh control is important to maintain DRAM data integrity. It is also an excellent tool to test and stress the DRAM for cell retention failures. The CPGC BIST engine has a variable refresh signal generator to control and stress the DRAM refresh rates. 6.2.3 Algorithmic Memory Cell Testing The CPGC implementation has been designed with significant flexibility to accommodate a broad variety of different memory test algorithms efficiently. To accomplish these address walking algorithms, the CPGC architecture Test Pattern Generator consists of the nested loops of address instruction (AI), data instruction (DI), algorithm instruction (GI), command instruction (CI), and offset instruction (OI). Some examples are described below. 6.2.3.1 Memory-Scan 145 A type of test that is commonly used to initialize memory and detect stuck at fault memory failures is called memory-scan, which is shown in the upper right corner of Figure 67 (b). The scan writes “0” to the entire memory, which clears the memory contents at boot, and catches any command, address, and data bus communication errors. A subsequent read scan is done to detect any errors and provide coverage for stuck at “1” faults. The process is repeated by writing “1” and reading to verify to catch stuck at “0” faults. Some implementations follow with a second scan of write and read “0” to provide full stuck at coverage since the initial memory contents were random. 6.2.3.2 Row Hammer and Butterfly Conceptually, DRAMs are fundamentally made up of a large array of capacitors in the form of holes in the substrate. A “1” is represented by a capacitor holding a charge; a “0” is represented by a discharged capacitor. There are various physical mechanisms where coupling effects between neighboring capacitors can result in bit flips. The opportunity for these cross charge coupling fault failures increases as device size decreases. The charge sharing impact of the previous bit in time can cause ISI failures, which increases as DRAM frequency increases. Interconnect and wires for SOC internal and external IO faces similar crosstalk and ISI. Due to finite charge leakage, DRAM cells are refreshed to retain their stored value. However, slower refresh rates lead to additional DRAM retention failures. 146 An example test algorithm, typically called butterfly, for detecting neighbor cross coupling is shown in the right side of Figure 67 (b). In this test, the nearest neighbors in the x and y location are written to with opposite data as the victim, or base, cell in a crosshair shape and up to two memory cell locations away (Wr1, Wr2, Wr3, Wr4, Wr5, Wr6, Wr7, and Wr8). The base is read at the end of these write commands to check for errors. The butterfly algorithm can test near neighbors (crosshair), all neighbors (crosshair and diagonal), and selective neighbors. 6.2.3.3 Other Cell Test Algorithms The CPGC engine is fully capable of supporting many different types of memory cell test algorithms, such as MATS+, march LR, march C-, PMOVI, and even more complicated algorithms to target certain types of DRAM faults. In addition, the CPGC can easily set up the background data pattern prior to beginning the test. 6.2.4 CPGC Software infrastructure Architecture The CPGC technology has built-in customizable flexibilities enabling the testing and diagnosis of entire memory sub-systems. In case of impending failure, it is possible to isolate the memory field-replicable-unit (FRU) for further testing through CPGC IP. The FRU would take some or all of the memory offline, and use CPGC to test the memory. The BIOS-based unified extensible firmware interface (UEFI) architecture supports a middle- ware layer that can incorporate the CPGC Application Program Interface (API(s)) at boot- 147 time as well as the run-time. At boot-time the CPGC-based test could execute at the initialization of the memory reference code. At run-time, the code could execute in the system management interrupt (SMI), which is a virtualized space in the system. The API libraries are incorporated in both boot-time environments as well as in run-time environments. While boot-time CPGC code could execute at every boot, run-time code cannot be executed frequently. Execution of run-time code can cause system perturbations that can degrade the performance of the workload running on it. A firmware level approach to host CPGC tests and API(s) requires building a memory- map infrastructure for identifying memory addresses. Memory initialization code is modified to CPGC test code. The system management interrupt (SMI) provides the necessary APIs to assist in constructing and configuring the CPGC tests. Figure 68 illustrates the UEFI boot services layer that hosts the CPGC test modules. Similar modules are replicated into the SMI for run-time execution. 148 module modules modules modules CPGC testCPGC other test test other test other test other UEFI OS BOOT SERVICE layer PLATFORM specific firmware CPU hardware (including CPGC) Figure 68: UEFI BOOT SERVICE layer that supports CPGC tests module. UEFI is a standard-based modern software architecture (originated by Intel) that offers a rich extensible pre-operating system (OS) environment with advanced boot and runtime services. It has an intrinsic networking capability and is designed from the ground up to work with multi-processor (MP) systems. The UEFI standard is architected for dynamic modularity. This has opened up opportunities for collaboration on the technology that allows reducing development costs and time to market through code sharing and easy integration of plugin modules from different partners and vendors. 6.2.5 CPGC Applications and Results 6.2.5.1 Memory Initialization and IO Link Training Using CPGC One of the important applications where CPGC is extensively used is memory initialization and training during system power up. Typical link training algorithms use eye centering 149 techniques on the received signal to ensure maximum possible margins on either side of sampling instants in voltage and timing axis. CPGC writes a stream of patterns on the IO link, which are either generated via built-in linear feedback shift registers (LFSR) or via user programmable registers. This stream of data is stored in memory, read back, and compared to the original by CPGC. CPGC intentionally degrades and margins the IO write or read path independently while transmitting, receiving, and checking for errors to identify the high side and low side limits of voltage margin, and the left side and right side limits of timing margin to generate a 2- D eye diagram Figure 3 (c) for the link under test. The CPGC applies the eye centering technique by calculating the mid-point between the high side and low side limits of the voltage margin, and between right side and left side limits of the timing margin. These voltage mid-point and timing mid-point becomes the optimal eye center as result of CPGC IO link training. CPGC applies a link training algorithm to all the command, address, and data pins of each memory channel, and all memory and high speed IO attached to the CPU (eg: PCIe). During the training, eye margin is automatically measured and reported. 6.2.5.2 ISI and Crosstalk Debug Using CPGC Another application of CPGC is identifying IO link performance loss due to data dependent jitter such as ISI and crosstalk through bit-by-bit pattern searches. A DDR memory IO bus 150 is a single-ended voltage termination bus with very closely placed lanes between processor and DDR dual inline memory module (DIMM), which make this bus susceptible to ISI and crosstalk issues. CPGC can be used to program and transmit stress patterns on the bus that can cause a worst-case 1 or 0 transition after a long run of identical bits. One such example, used in DDR validation to measure eye height loss due to ISI, is shown in Figure 3 (b). In this case, CPGC implemented in silicon is programmed to change the precursor bit on a DDR pin (pin 68) in a deterministic method from a 0 to 1 after a long sequence of 0s. This 0 to 1 transition is immediately followed by a 1 to 0 transition to create a worst-case 1 during this bit period that corresponds to the inner contour of the eye diagram shown in Figure 69 (a), thus reducing the eye height margin. Similarly, a worst-case 0 can be created to stress the bus. Such programmability in CPGC helps to debug any ISI. CPGC also allows simultaneous programming of multiple lanes on the bus with different deterministic or random patterns and provides independent control of offset/skew in timing between the lanes. This feature is used to create a victim-aggressor condition among the lanes for crosstalk studies and mitigation. Another way of stressing a memory cell is through repeated writes to the neighboring cells, called hammering. A row hammer issues repeated activate commands to specific aggressor rows to cause victim row/bit failures. CPGC can stress an additional +2 and +3 aggressor rows, which is specifically suited for the WIO TSV testing of 3DS memory, when physical probing of a sandwiched die is impossible. 151 152 Figure 70 (a) scope capture of DDR IO eye Figure 70 (b) ISI debug changing precursor from 0 to 1 Figure 70 (c) CPGC BIST based Memory IO initialization, training, and Eye measurement. Figure 69: CPGC Silicon results 153 6.2.5.3 Silicon and System Validation Using CPGC The CPGC engine is also used in silicon and system validation. A unique feature of CPGC is its ability to drive patterns through memory IO without requiring the core of the processor to be working. This feature enables early silicon debug and characterization of the IO circuit physical layer independent from the processor. In a HVM system validation environment, after the initial bring-up of the complete system, CPGC provides a pattern auto-rotation feature to automatically feed test patterns from one lane to the next and switch between victim and aggressor roles. The auto-rotation of victim and aggressor patterns is shown to decrease HVM validation time by 3X. 6.2.5.4 Platform Power Savings Using CPGC In addition to memory IO initialization and training, CPGC is also used to optimize the power and performance of DDR IO. CPGC implementation provides controls via configuration registers to closed-loop compensation circuits such as reference voltage (Vref), slew rate, drive strength, and on-die termination. These control features are optimized and tuned for the lowest power while maintaining acceptable eye opening and bit error rates at the receivers. The results showed that a power savings of 6mW/bit and 12mW/bit were achieved on CPU power (CpuPwr) and DIMM Power (DimmPwr) respectively, according to the plot average shown in Figure 4 when ODT was tuned. By calibrating for power, margin reductions of -5, -8, +11, and -19 ticks were observed on receiver timing (RdT), transmitter timing (WrT), receiver voltage (RdV), and transmitter 154 voltage (WrV). There are a total of 64 controllable states of voltage and timing offset in the X and Y direction. Each shift in state is called a tick. However, the reduction in margin did not affect the overall system BER of 10-12. This margin/power trade off led to 5.28% CPU power savings on 22nm 15W CPU and 11.03% power savings on the platform. 155 calibration feedback loop TX Slew Rate control ... TCO LS LS LS LS ... C Data ... Channel G Comb Drive P C logic Strength Enable ... ... ODT calibration control LS LS LS LS Vref RX digital D Q ... code DFF Rx Amp TCO Slew Rate control DAC Q DQS DLL Figure 70: CPGC closed calibration loop saves CPU and DRAM Power 156 6.3 Summary The CPGC engine addresses the complexity of designing and testing today’s IC that prior tester and probe solutions no longer work due to 3DS probe-ability and higher frequency IO and memory. This paper presented the hardware and software architecture of a reusable BIST engine for Intel SOC that includes the auto repair of detected memory defects. Key CPGC silicon results and benefits are: Debug isolation of ISI and crosstalk, memory IO initialization, training, and margin, auto-rotation that led to 3x validation time improvement, isolation of memory sub-systems to enable early debug that saved Intel up to 3 months of product delay, row hammer, butterfly, and galloping that enabled WIO TSV testing of 3DS memory, and 5% to 11% of CPU and platform power savings through closed loop calibration of DDR IO circuits. This reusable CPGC BIST engine IP solution has been successfully designed into Intel products, with at least 11 Intel SOCs using CPGC for IO link training, validation, and debug. At least seven Intel SOCs have been successfully debugged, tested, and launched in the marketplace using the reusable CPGC BIST engine. 157 6.4 Acknowledgments The authors would like to acknowledge Adam Taylor and Thilak K Poriyani House Raju for their wonderful support and effort on this CPGC project. 6.5 References [1] Van Treuren, B.G.; Chiang, C.; Honaker, K., "Problems Using Boundary-Scan for Memory Cluster Tests," Test Conference, 2008. ITC 2008. IEEE International , vol., no., pp.1,10, 28-30 Oct. 2008 [2] D. Blankenbeckler, A. Norman, M. Shepherd. "Comparison of off-chip interconnect validation to field failures", VALID, 2011, Barcelona, Spain. [3] R. Venkatesh, S. Kumar, J. Philip, and S. Shukla, "A fault modeling technique to test memory BIST algorithms," in Memory Technology, Design and Testing, 2002. (MTDT 2002). Proceedings of the 2002 IEEE International Workshop on, 2002, pp. 109-116. [4] Y. Jen-Chieh, W. Chi-Feng, C. Kuo-Liang, C. Yung-Fa, H. Chih-Tsun, and W. Cheng-Wen, "Flash memory built-in self-test using March-like algorithms," in Electronic Design, Test and Applications, 2002. Proceedings. The First IEEE International Workshop on, 2002, pp. 137-141. 158 [5] A. J. Van de Goor, S. Hamdioui, and H. Kukner, "Generic, orthogonal and low- cost March Element based memory BIST," in Test Conference (ITC), 2011 IEEE International, 2011, pp. 1-10. [6] S. Sheng-Chih, H. Hung-Ming, C. Yi-Wei, and L. Kuen-Jong, "A high speed BIST architecture for DDR-SDRAM testing," in Memory Technology, Design, and Testing, 2005. MTDT 2005. 2005 IEEE International Workshop on, 2005, pp. 52- 57. [7] G. Mrugalski, A. Pogiel, N. Mukherjee, J. Rajski, J. Tyszer, and P. Urbanek, "Fault Diagnosis in Memory BIST Environment with Non-march Tests," in Test Symposium (ATS), 2011 20th Asian, 2011, pp. 419-424. [8] P. Bernardi, L. Ciganda, M. S. Reorda, and S. Hamdioui, "An Efficient Method for the Test of Embedded Memory Cores during the Operational Phase," in Test Symposium (ATS), 2013 22nd Asian, 2013, pp. 227-232. [9] P. Bernardi, M. Grosso, M. S. Reorda, and Y. Zhang, "A programmable BIST for DRAM testing and diagnosis," in Test Conference (ITC), 2010 IEEE International, 2010, pp. 1-10. [10] Nejedlo, J.; Khanna, R., "Intel® IBIST, the full vision realized," Test Conference, 2009. ITC 2009. International , vol., no., pp.1,11, 1-6 Nov. 159 [11] Querbach, B "CPGC2, a buit-in self test and auto-repair engine for DRAM (WIO, DDR4)", Patent pending 14/141,239 [12] Querbach, B. "Comparison of Hardware Based and Software Based Stress Testing of Memory Io Interface." IEEE MWSCAS 2013 [13] B. Schroeder, E. Pinheiro, W. Weber. “DRAM Errors in the Wild: A Large-Scale Field Study”, SIGMETRICS/Performance ’09, June 15-19, 2009, Seattle, WA, USA [14] Dong Wook Kim; Vidhya, R.; Henderson, B.; Ray, U.; Sam Gu; Wei Zhao; Radojcic, R.; Nowak, M., "Development of 3D through silicon stack (TSS) assembly for wide IO memory to logic devices integration," Electronic Components and Technology Conference (ECTC), 2013 IEEE 63rd , vol., no., pp.77,80, 28-31 May 2013 [15] “Coalition of 20+ Tech Firms Backs MRAM as Potential DRAM, NAND Replacement” 160 http://www.dailytech.com/Coalition+of+20+Tech+Firms+Backs+MRAM+as+Pot ential+DRAM+NAND+Replacement/article33826.htm [16] Singh, Vivek, "Lithography at 14nm and beyond: Choices and challenges," Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE , vol., no., pp.459,459, 5-9 June 2011 161 7 Conclusion This dissertation tackled some of the challenges that today’s complex IO and memory design brings in terms of higher IO bandwidth, higher frequency, and lower power requirements. Both the IO and the design-for-testability (DFT) need to be low cost and brought to the market quickly. In addition, the lack of probe-ability of phones and tablets physically limits traditional JTAG or TAP access. First, this dissertation compared the hardware BIST types of solution to software tests. BIST types of solutions (eg: APGC, which is an earlier version of CPGC) can significantly speed up test time due to the potential for parallel hardware instances to be executed and the reduced test instruction loading/unloading. For example, APGC when compared to the software can speed up test time by 32x to 2048x depending on aggressor victim lane width of eight to 64 lanes using the auto-rotation feature. Next, this dissertation presented the architecture of a reusable BIST engine called CPGC that is used in the critical path of DDR3, DDR4, LP-DDR3, LP-DDR4 IO initialization (aka link training) of x86. This on-die solution solved the above probe-ability and physical access problems, while providing a low cost solution that can quickly and automatically train and test the IO, which led to the mass adoption and production of this CPGC BIST architecture. Silicon results show that CPGC can speed up validation time by 3x, decrease test time from minutes down to seconds, and decrease debug time by 5x including root- cause of boot failures of the memory interface. 162 Next, this dissertation presented CPGC also as the essential BIST engine for IO and memory defect detection and in some cases, the auto repair of detected memory defects. The software and hardware infrastructure that leverages CPU L2/L3 cache to enable CBT and parallel execution of the CPGC Intel BIST engine demonstrated 60x to 170x improvements in test time compared to JTAG or TAP based testing. Next, this dissertation presented a detailed discussion of CPGC architecture. The architecture combined the BIST needs for both IO and memory cell testing into a single BIST engine. To generate back to back full bandwidth DDR transactions, a pipelined BIST architecture was developed. The memory cell testing engine is capable of all classical memory test algorithms (eg: march, MATS++) and detection of classical memory fault models, and it supports highly flexible user programmable algorithms for unforeseen future defects. Despite the fact that effort was put into the architecture to enable 3DS memory IC testing, at the time this dissertation was written not enough 3DS memory cell test data was available. Lastly, this dissertation presented how this CPGC architecture reduced SOC and platform power by 5% to 11% through closed loop IO circuit power optimization. This helped to make the overall SOC lower power and more competitive. This CPGC BIST engine has been developed into a highly successful reusable IP solution, which has been successfully designed into at least 11 Intel CPUs and SOCs (32nm-14nm), with at least seven Intel SOCs successfully debugged, tested, and launched into the market- 163 place, which ultimately led to over 100 million CPUs shipped within one quarter based on this architecture. This architecture is used on daily basis around the world. For future work, as 3DS logic and memory IC becomes more prevalent, more data on 3DS IC memory cell tests will be collected to ensure the CPGC BIST engine architecture is efficient at 3DS memory cell testing. To adapt to different test scenario, the CPGC BIST engine IP will be made more modular, and flexible with the focus of achieving efficient test time, and efficient design time. 164 8 Bibliography 1 "`http://en.wikipedia.org/wiki/Transistor count"' 2 "`http://www.extremetech.com/computing/152979-intel-announces-full-set-of- new-atom-and-xeon-server-processors-to-fend-off-arm"' 3 “Coalition of 20+ Tech Firms Backs MRAM as Potential DRAM, NAND Replacement - See more at: http://www.dailytech.com/Coalition+of+20+Tech+Firms+Backs+MRAM+as+Potential+ DRAM+NAND+Replacement/article33826.htm#sthash.afOQukEt.dpuf.” 4 “Coalition of 20+ Tech Firms Backs MRAM as Potential DRAM, NAND Replacement - See more at: http://www.dailytech.com/Coalition+of+20+Tech+Firms+Backs+MRAM+as+Potential+ DRAM+NAND+Replacement/article33826.htm#sthash.afOQukEt.dpuf.” 5 A. Bravaix, C. Guerin, V. Huard, D. Roy, J.-M. Roux, and E. Vincent, “Hot-Carrier acceleration factors for low power management in DC-AC stressed 40nm NMOS node at high temperature,” in Reliability Physics Symposium, 2009 IEEE International, 2009, pp. 531–548. 6 A. J. van de Goor, “An industrial evaluation of DRAM tests,” IEEE Des. Test Comput., vol. 21, no. 5, pp. 430–440, Sep. 2004. 165 7 A. J. Van de Goor, S. Hamdioui, and H. Kukner, "Generic, orthogonal and low- cost March Element based memory BIST," in Test Conference (ITC), 2011 IEEE International, 2011, pp. 1-10. 8 A. J. van de Goor, Testing semiconductor memories: theory and practice. J. Wiley & Sons, 1991. 9 A. K. M. M. Islam and H. Onodera, “Characterization and compensation of performance variability using on-chip monitors,” in Proceedings of Technical Program - 2014 International Symposium on VLSI Technology, Systems and Application (VLSI- TSA), 2014, pp. 1–4. 10 An Improved Genetic Algorithm and Its Blending Application with Neural Network, Tai-shan Yan, Intelligent Systems and Applications (ISA), 2010 2nd International Workshop 11 B. Casper, J. Jaussi, F. O’Mahony, M. Mansuri, K. Canagasaby, J. Kennedy, E. Yeung, and R. Mooney, “A 20Gb/s Forwarded Clock Transceiver in 90nm CMOS B.,” in Solid-State Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE International, 2006, pp. 263–272. 12 B. L. Spry, T. Z. Schoenborn, P. Abraham, C. P. Mozak, D. G. Ellis, J. J. Nejedlo, B. Querbach, Z. Greenfield, R. Ghattas, and J. Tholiyil, “Robust memory link testing using memory controller,” 12/651,252, 2009. 166 13 B. Querbach, D. G. Ellis, A. Khan, M. J. Tripp, E. S. Gayles, and E. Gollapudi, Automatic self test of an integrated circuit component via AC I/O loopback. Google Patents, 2006. 14 B. Querbach, A. Khan, M. Tripp, L. B. Guerrero, M. A. V. Vargas, and A. Muhtaroglu, Methods and apparatuses for validating AC I/O loopback tests using delay modeling in RTL simulation. Google Patents, 2007. 15 D. Ellis, B. Querbach, J. Nejedlo, A. Khan, S. Babcock, E. Gayles, and E. Gollapudi, Pre-announce signaling for interconnect built-in self test. Google Patents, 2003. 16 D. G. Ellis, B. Querbach, J. J. Nejedlo, A. Khan, S. R. Babcock, E. S. Gayles, and E. Gollapudi, Push button mode automatic pattern switching for interconnect built-in self test. Google Patents, 2004. 17 B. Querbach, M. A. Abdallah, A. M. Khan, M. M. Hossain, and S. M. Sarkar, Regulating a timing between a strobe signal and a data signal. Google Patents, 2009. 18 B. L. Spry, T. Z. Schoenborn, P. Abraham, C. P. Mozak, D. G. Ellis, J. J. Nejedlo, B. Querbach, Z. Greenfield, R. Ghattas, and J. Tholiyil, Robust memory link testing using memory controller. Google Patents, 2009. 19 M. B. Trobough, K. K. Tiruvallur, C. B. Prudvi, C. E. Iovin, D. W. Grawrock, J. J. Nejedlo, A. N. Kabadi, T. K. Goff, E. J. Halprin, and K. B. Udawatta, TEST, VALIDATION, AND DEBUG ARCHITECTURE. US Patent 20,150,127,983, 2015. 167 20 B. Querbach, R. B. Hamilton, L. A. Johnson, and M. Kim, Voltage margining with a low power, high speed, input offset cancelling equalizer. Google Patents, 2009. 21 B. Querbach, D. G. Ellis, A. Khan, M. J. Tripp, E. S. Gayles, and E. Gollapudi, “Automatic self test of an integrated circuit component via AC I/O loopback,” 7,139,957, 2006. 22 B. Querbach, R. Khanna, D. Blankenbeckler, Y. Zhang, R. T. Anderson, D. G. Ellis, Z. T. Schoenborn, S. Deyati, and P. Chiang, “A reusable BIST with software assisted repair technology for improved memory and IO debug, validation and test time,” in Test Conference (ITC), 2014 IEEE International, 2014, pp. 1–10. 23 B. Querbach, S. Puligundla, D. Becerra, Z. T. Schoenborn, and P. Chiang, “Comparison of hardware based and software based stress testing of memory IO interface,” in 2013 IEEE 56th International Midwest Symposium on Circuits and Systems (MWSCAS), 2013, pp. 637–640. 24 B. Schroeder, E. Pinheiro, and W. Weber. "DRAM Errors in the Wild: A Large- Scale Field Study", SIGMETRICS/Performance '09, June 15-19, 2009, Seattle, WA, USA. 25 B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM errors in the wild: a large- scale field study,” in ACM SIGMETRICS Performance Evaluation Review, 2009, vol. 37, pp. 193–204. 26 C. Park, H. Chung, Y.-S. Lee, J. Kim, Y.-S. Lee, M.-S. Chae, D.-H. Jung, S.-H. Choi, S. Seo, T.-S. Park, J.-H. Shin, J.-H. Cho, S. Lee, K.-W. Song, K. Kim, J.-B. Lee, C. 168 Kim, and S.-I. Cho, “A 512-mb DDR3 SDRAM prototype with CIO minimization and self- calibration techniques,” IEEE J. Solid-State Circuits, vol. 41, no. 4, pp. 831–838, Apr. 2006. 27 C.-T. Huang, J.-R. Huang, C.-F. Wu, C.-W. Wu, and T.-Y. Chang, “A programmable BIST core for embedded DRAM,” IEEE Des. Test Comput., vol. 16, no. 1, pp. 59–70, Jan. 1999. 28 D. Blankenbeckler, A. Norman, M. Shepherd. "Comparison of off chip interconnect validation to field failures", VALID, 2011, Barcelona, Spain. 29 D. Ellis, B. Querbach, J. Nejedlo, A. Khan, S. Babcock, E. Gayles, and E. Gollapudi, “Pre-announce signaling for interconnect built-in self test,” 2003. 30 D. G. Ellis, B. Querbach, J. J. Nejedlo, A. Khan, S. R. Babcock, E. S. Gayles, and E. Gollapudi, “Push button mode automatic pattern switching for interconnect built-in self test,” 6,826,100, 2004. 31 D. Kucharski and K. Kornegay, “A 40 Gb/s 2.5 V 2 1 PRBS generator in SiGe using a low-voltage logic family,” in IEEE ISSCC Dig. Tech. Papers, San Francisco, CA, Feb. 2005, pp. 340–341. 32 Dong Wook Kim; Vidhya, R.; Henderson, B.; Ray, U.; Sam Gu; Wei Zhao; Radojcic, R.; Nowak, M., "Development of 3D through silicon stack (TSS) assembly for wide IO memory to logic devices integration," Electronic Components and Technology Conference (ECTC), 2013 IEEE 63rd , vol., no., pp.77,80, 28-31 May 2013 169 33 F. O’Mahony, G. Balamurugan, J. E. Jaussi, J. Kennedy, M. Mansuri, S. Shekhar, and B. Casper, “The future of electrical I/O for microprocessors,” in International Symposium on VLSI Design, Automation and Test, 2009. VLSI-DAT ’09, 2009, pp. 31– 34. 34 G. Mrugalski, A. Pogiel, N. Mukherjee, J. Rajski, J. Tyszer, and P. Urbanek, "Fault Diagnosis in Memory BIST Environment with Non-march Tests," in Test Symposium (ATS), 2011 20th Asian, 2011, pp. 419-424. 35 G. Theodorou, N. Kranitis, A. Paschalis, and D. Gizopoulos, “Software-Based Self Test Methodology for On-Line Testing of L1 Caches in Multithreaded Multicore Architectures,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 21, no. 4, pp. 786– 790, Apr. 2013. 36 Goel, S.K.; Devta-Prasanna, N.; Turakhia, R.P. “Effective and Efficient Test Pattern Generation for Small Delay Defect”, VLSI Test Symposium, 2009. VTS '09. 27th IEEE 37 H. Knapp, M. Wurzer, W. Perndl, K. Au?nger, J. Böck, and T. F. Meister, “100- Gb/s 2 1 and 54-Gb/s 2 1 PRBS generators in SiGe bipolar technology,” IEEE J. Solid- State Circuits, vol. 40, no. 10, pp. 2118–2125, Oct. 38 H. Veenstra, “1–58 Gb/s PRBS generator with <1.1 ps RMS jitter in InP technology,” Solid-State Circuits Conference, 2004. ESSCIRC 2004. Proceeding of the 30th European 170 39 http://www.google.com/patents/WO2014004748A1?cl=en 40 http://www.memcon.com/pdfs/proceedings2013/track1/Tuning DDR4 for Power and Performance.pdf 41 http://www.memcon.com/pdfs/proceedings2013/track1/Tuning DDR4 for Power and Performance.pdf 42 http://www.youtube.com/watch?v=i3-gQSnBcdo 43 J. Nejedlo and R. Khanna, “Intel #x00AE; IBIST, the full vision realized,” in Test Conference, 2009. ITC 2009. International, 2009, pp. 1–11. 44 J. Nejedlo and R. Khanna, “Intel #x00AE; IBIST, the full vision realized,” in Test Conference, 2009. ITC 2009. International, 2009, pp. 1–11. 45 J.-S. Wang, P. Andrews, S. Fu, and G. Pwu, “Low power Sigma;- Delta; analog- to-digital converter for temperature sensing and voltage monitoring,” in Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems, 2000, 2000, vol. 1, pp. 36–39 vol.1. 46 J.-S. Wang, P. Andrews, S. Fu, and G. Pwu, “Low power Sigma;- Delta; analog- to-digital converter for temperature sensing and voltage monitoring,” in Proceedings of the 43rd IEEE Midwest Symposium on Circuits and Systems, 2000, 2000, vol. 1, pp. 36–39 vol.1. 171 47 K. K. Saluja and K. Kinoshita, “Test Pattern Generation for API Faults in RAM,” IEEE Trans. Comput., vol. C–34, no. 3, pp. 284–287, Mar. 1985. 48 K. Noguchi and Makoto Nagata, “An On-Chip Multichannel Waveform Monitor for Diagnosis of Systems-on-a-Chip Integration,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 15, no. 10, pp. 1101–1110, Oct. 2007. 49 K. Noguchi and Makoto Nagata, “An On-Chip Multichannel Waveform Monitor for Diagnosis of Systems-on-a-Chip Integration,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 15, no. 10, pp. 1101–1110, Oct. 2007. 50 Knapp, H.; Wurzer, M.; Meister, T.F.; Bock, J.; Aufinger, K., “40 Gbit/s 27-1 PRBS generator IC in SiGe bipolar technology,” Bipolar/BiCMOS Circuits and Technology Meeting, 2002. Proceedings of the 2002 51 L. Chen, F. Spagna, P. Marzolf, and J. K. Wu, “A 90nm 1-4.25-Gb/s Multi Data Rate Receiver for High Speed Serial Links,” in Solid-State Circuits Conference, 2006. ASSCC 2006. IEEE Asian, 2006, pp. 391–394. 52 "Laskin, E.; Voinigescu, S.P., “A 60 mW per Lane, 4 23-Gb/s 2 1 PRBS Generator”, Solid-State Circuits, IEEE Journal of Volume: 41, Issue: 10 " 53 M. A. A. Latif, N. B. Z. Ali, and F. A. Hussin, “A case study of process-variation effect to SoC analog circuits,” in 2011 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 2011, pp. 520–523. 54 M. L. Bushnell and Vishwani D. Agarwal, Essentials of Electronic Testing. . 172 55 M. Shepherd, D. Blankenbeckler, and A. Norman, “Comparison of off-chip interconnect validation to field failures,” in VALID 2011, The Third International Conference on Advances in System Testing and Validation Lifecycle, 2011, pp. 121–125. 56 M. Shepherd, D. Blankenbeckler, and A. Norman, “Comparison of off-chip interconnect validation to field failures,” in VALID 2011, The Third International Conference on Advances in System Testing and Validation Lifecycle, 2011, pp. 121–125. 57 M. Shin, J. Shim, J. Kim, J. S. Pak, C. Hwang, C. Yoon, J. Kim, H. Kim, K. Park, and Y. Kim, “A 6.4Gbps on-chip eye opening monitor circuit for signal integrity analysis of high speed channel,” in IEEE International Symposium on Electromagnetic Compatibility, 2008. EMC 2008, 2008, pp. 1–6. 58 M. Shin, J. Shim, J. Kim, J. S. Pak, C. Hwang, C. Yoon, J. Kim, H. Kim, K. Park, and Y. Kim, “A 6.4Gbps on-chip eye opening monitor circuit for signal integrity analysis of high speed channel,” in IEEE International Symposium on Electromagnetic Compatibility, 2008. EMC 2008, 2008, pp. 1–6. 59 N. Foutris, M. Psarakis, D. Gizopoulos, A. Apostolakis, X. Vera, and A. Gonzalez, “MT-SBST: Self-test optimization in multithreaded multicore architectures,” in Test Conference (ITC), 2010 IEEE International, 2010, pp. 1–10. 60 N. Kurd, M. Chowdhury, E. Burton, T. P. Thomas, C. Mozak, B. Boswell, M. Lal, A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T. M. Wilson, M. Merten, S. Chennupaty, W. Gomes, and R. Kumar, “5.9 Haswell: A family of IA 22nm processors,” in Solid-State 173 Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, 2014, pp. 112–113. 61 Nejedlo, J.; Khanna, R., "Intel® IBIST, the full vision realized," Test Conference, 2009. ITC 2009. International , vol., no., pp.1,11, 1-6 Nov. 62 Nejedlo, J.J., "IBIST (interconnect built-in-self-test) architecture and methodology for PCI Express," Test Conference, 2003. Proceedings. ITC 2003 63 Nejedlo, J.J., "IBISTTM (interconnect built-in self-test) architecture and methodology for pci express: intel'snext-generation test and validation methodology for performance IO," Test Conference, 2003 64 Nejedlo, J.J., "TRIBuTE board and platform test methodology," Test Conference, 2003. Proceedings. ITC 2003 65 P. Bernardi, L. Ciganda, M. S. Reorda, and S. Hamdioui, "An Efficient Method for the Test of Embedded Memory Cores during the Operational Phase," in Test Symposium (ATS), 2013 22nd Asian, 2013, pp. 227-232. 66 P. Bernardi, M. Grosso, M. S. Reorda, and Y. Zhang, "A programmable BIST for DRAM testing and diagnosis," in Test Conference (ITC), 2010 IEEE International, 2010, pp. 1-10. 67 P. de Jong and A. J. van de Goor, “Test pattern generation for API faults in RAM,” IEEE Trans. Comput., vol. 37, no. 11, pp. 1426–1428, Nov. 1988. 174 68 Querbach, B. "Comparison of Hardware Based and Software Based Stress Testing of Memory Io Interface." IEEE MWSCAS 2013 69 Querbach, Bruce, “‘CPGC2, a buit-in self test and auto-repair engine for DRAM (WIO, DDR4)’, Patent pending.” 70 R. Malasani, C. Bourde, and G. Gutierrez, “A SiGe 10-Gb/s multipattern bit error rate tester,” Radio Frequency Integrated Circuits (RFIC) Symposium, 2003 IEEE 71 R. Venkatesh, S. Kumar, J. Philip, and S. Shukla, "A fault modeling technique to test memory BIST algorithms," in Memory Technology, Design and Testing, 2002. (MTDT 2002). Proceedings of the 2002 IEEE International Workshop on, 2002, pp. 109- 116. 72 S. Gurumurthy, M. Pratapgarhwala, C. Gilgan, and J. Rearick, “Comparing the effectiveness of cache-resident tests against cycleaccurate deterministic functional patterns,” in Test Conference (ITC), 2014 IEEE International, 2014, pp. 1–8. 73 S. Puligundla, F. Spagna, L. Chen, and A. Tran, “Validating the performance of a 32nm CMOS high speed serial link receiver with adaptive equalization and baud-rate clock data recovery,” in Test Conference (ITC), 2010 IEEE International, 2010, pp. 1–5. 74 S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas, "Patching Processor Design Errors with Programmable Hardware," Micro, IEEE, vol. 27, pp. 12-25, 2007. 175 75 S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas, “Patching Processor Design Errors with Programmable Hardware,” IEEE Micro, vol. 27, no. 1, pp. 12–25, Jan. 2007. 76 S. Sarangi, S. Narayanasamy, B. Carneal, A. Tiwari, B. Calder, and J. Torrellas, “Patching Processor Design Errors with Programmable Hardware,” IEEE Micro, vol. 27, no. 1, pp. 12–25, Jan. 2007. 77 S. Sheng-Chih, H. Hung-Ming, C. Yi-Wei, and L. Kuen-Jong, "A high speed BIST architecture for DDR-SDRAM testing," in Memory Technology, Design, and Testing, 2005. MTDT 2005. 2005 IEEE International Workshop on, 2005, pp. 52-57. 78 Singh, Vivek, "Lithography at 14nm and beyond: Choices and challenges," Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE , vol., no., pp.459,459, 5-9 June 2011 79 T. Chen, P. Luo, and H. Fu, “Signal integrity analysis of high speed USB IO and interconnection,” in IEEE Electrical Design of Advanced Packaging Systems Symposium, 2009. (EDAPS 2009), 2009, pp. 1–5. 80 T. O. Dickson, E. Laskin, I. Khalid, R. Beerkens, J. Xie, B. Karajica, and S. P. Voinigescu, “A 72 Gb/s 2 1 PRBS generator in SiGe BiCMOS technology,” in IEEE ISSCC Dig. Tech. Papers, San Francisco, CA, Feb. 2005, 176 81 T. Okumoto, K. Yoshikawa, and M. Nagata, “Monitoring effective supply voltage within power rails of integrated circuits,” in Solid State Circuits Conference (A-SSCC), 2012 IEEE Asian, 2012, pp. 113–116. 82 T. Okumoto, K. Yoshikawa, and M. Nagata, “Monitoring effective supply voltage within power rails of integrated circuits,” in Solid State Circuits Conference (A-SSCC), 2012 IEEE Asian, 2012, pp. 113–116. 83 V. Melikyan, A. Balabanyan, A. Hayrapetyan, and N. Melikyan, “Receiver/transmitter input/output termination resistance calibration method,” in Electronics and Nanotechnology (ELNANO), 2013 IEEE XXXIII International Scientific Conference, 2013, pp. 126–130. 84 Van Treuren, B.G.; Chiang, C.; Honaker, K., "Problems Using Boundary-Scan for Memory Cluster Tests," Test Conference, 2008. ITC 2008. IEEE International , vol., no., pp.1,10, 28-30 Oct. 2008 85 Venkatesha, D.B.; Chitrashekaraiah, S.; Rezazadeh, A.A.; Thiede, A. “Interpreting InGaP/GaAs DHBT Eye diagrams using small signal parameters”, Microwave Integrated Circuit Conference, 2007. EuMIC 2007. European 86 Y. Dongkyu, K. Taehyung, and P. Sungju, "A microcode-based memory BIST implementing modified march algorithm," in Test Symposium, 2001. Proceedings. 10th Asian, 2001, pp. 391-395. 177 87 Y. Jen-Chieh, W. Chi-Feng, C. Kuo-Liang, C. Yung-Fa, H. Chih-Tsun, and W. Cheng-Wen, "Flash memory built-in self-test using March-like algorithms," in Electronic Design, Test and Applications, 2002. Proceedings. The First IEEE International Workshop on, 2002, pp. 137-141. 88 Y. Jen-Chieh, W. Chi-Feng, C. Kuo-Liang, C. Yung-Fa, H. Chih-Tsun, and W. Cheng-Wen, "Flash memory built-in self-test using March-like algorithms," in Electronic Design, Test and Applications, 2002. Proceedings. The First IEEE International Workshop on, 2002, pp. 137-141. 89 Z. Al-Ars and A. J. van de Goor, “Test generation and optimization for DRAM cell defects using electrical simulation,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 22, no. 10, pp. 1371–1384, Oct. 2003.