Special Session: In-System-Test (IST) Architecture for NVIDIA Drive-AGX

2019 IEEE 37th VLSI Test Symposium (VTS) ! Special! Session: In-System-Test (IST) Architecture for NVIDIA Drive-AGX Platforms Pavan Kumar Datla Jagannadha, Mahmut Yilmaz, Milind Sonawane, Sailendra Chadalavada, Shantanu Sarangi, Bonita Bhaskaran, Shashank Bajpai, Venkat Abilash Reddy, Jayesh Pandey, Sam Jiang NVIDIA Corp, 2788 San Tomas Expressway, Santa Clara, CA 95051, USA {pavand, myilmaz, msonawane, schadalavada, ssarangi, bbhaskaran, sbajpai, vreddy, jpandey, sajiang}@nvidia.com Abstract - Safety is one of the crucial features of autonomous Xavier AI SOC and delivers 30 TOPS (Tera Operations Per drive platforms, and semiconductor chips used in these Second) of performance while consuming only 30 watts of architectures must guarantee functional safety aspects mandated by power. It includes various processors for redundant and diverse ISO 26262 standard. To monitor the failures due to field defects, deep learning algorithms. For applications that require ultimate in-system-structural-tests are automatically run during key-on performance, NVIDIA DRIVE AGX Pegasus™ achieves 320 and/or key-off. Upon detection of any permanent defects by the in- TOPS of deep learning with an architecture built on two system-test (IST) architecture, Drive platform responds to achieve NVIDIA Xavier™ processors and two next generation GPUs the fail-safe state of the system. In this paper, we present the IST based on Turing architecture [2]. architecture that helps with achieving highest functional safety levels on the NVIDIA Drive platform. Automotive ICs are screened with high-quality test methods to achieve near-zero DPPM (Defective Parts Per Keywords— automotive, functional safety, ISO 26262, in- Million). Even with such high standards of testing, there can be system-test, MBIST, LBIST, ATPG, power, permanent faults reliability defects that will manifest during the in-system field operation due to environmental or operating conditions [3]. I. INTRODUCTION The ISO 26262 standard [4] defines functional safety for all The design complexity of SOCs (System-On-Chip) and electronic and electrical equipment used in automotive safety- GPUs (Graphic Processing Unit) used in Autonomous driving related systems. These functional safety features form an applications is on a steady rise. Companies are moving to integral part of each automotive product development phase. lower semiconductor technology nodes to meet the high- ISO 26262 also defines the various ASIL (Automotive Safety performance requirements of these mission critical Integrity Level) standards applicable to the lifecycle of these applications. These SOC and GPU chips are tested during systems. Different ASIL levels (A, B, C, and D) have different production testing to screen for manufacturing defects and only test coverage targets, with ASIL-D being the most stringent known good die are used on system boards designed for drive among these and requiring the highest test coverage. platform applications. NVIDIA ICs used in automotive applications have built-in Automotive drive platform applications are considered functional safety mechanisms such as ECC or functional highly safety critical [1] and any failure of the integrated redundancy. IST supplements these safety mechanisms to circuits (ICs) used in these systems could be life threatening. achieve the highest possible ASIL level for permanent fault NVIDIA Drive AGX platform (Fig. 1) is architected from the coverage targets. ground up for safety. It is an AI (Artificial Intelligence) super computer that can run an array of deep neural networks IST involves execution of structural ATPG (Automatic simultaneously and is designed to safely handle highly Test Pattern Generation) vectors, i.e., deterministic scan automated and fully autonomous driving. Drive AGX platform compression and Logic Built-In Self-Test (LBIST), and a has different configurations that can support various comprehensive set of MBIST (Memory Built-In Self-Test) performance needs. Drive AGX Xavier™ includes a single algorithms during key-on and/or key-off to determine a pass or a fail status. IST can cover all fault models applicable to lower geometry FinFET technologies. The challenge was to translate the execution of these vectors into a fully self-contained functional feature that could be repeatedly used in an automotive system for the life time of the vehicle within the test-time and power budgets. In this paper, we present details of the IST operation and challenges we had to overcome. Section II lists goals of IST. High-level IST architecture is presented in Section III. Section IV includes the challenges we faced and corresponding solutions. We conclude the paper with silicon results and final Fig. 1. NVIDIA Drive AGX Pegasus™ Platform remarks. 978-1-7281-1170-4/19/$31.00 ©2019 IEEE ! II. GOALS The! primary goals of IST architecture can be categorized as follows: • High quality test: To achieve the highest ASIL safety level, the design under test (DUT) needs to have a very high permanent test coverage. Additionally, we expect that comprehensive set of fault models needs to be supported by test to detect lower geometry FinFET design defects [5][6][7]. • Low latency: The high-quality test patterns are quantified by highest test coverage achieved with shortest possible test time and smallest test data volume. • Architecture flexibility: The architecture should be fully scalable for varied clock frequencies and data rates to adapt to power, storage, and latency requirements. It should also support different design configurations. • TDP (Thermal Design Power) budget: We need to make sure we stay within the limits of the functional TDP during IST execution. • Debug and diagnosis: The architecture should support all modes of debug and diagnosis and provide traceability for field returns. III. HIGH LEVEL IST ARCHITECTURE The IST architecture enables structural testing of a complex SOC system to detect permanent faults in the field. It can be used for supplementing functional safety mechanisms for permanent fault coverage goals as specified in ISO 26262 [4]. It is fully scalable and can meet the various requirements over the life-cycle of the product. IST supports key-on and key-off testing, updating test configurations and their application sequence, and targeting comprehensive fault models under different test conditions, e.g voltage and clock frequencies. The scheme also maintains a high level of in-field diagnostic granularity of scan and MBIST test patterns. This architecture is not limited to the in-field application of test patterns. It can also be used for System Level Test (SLT) to screen for defects to further improve test quality. For example, it can help in bridging the gap between ATE environment and the platform specific operating conditions. IST uses a combination of hardware and software components to test a Xavier SOC standalone and/or a discrete Turing GPU paired with Xavier SOC. Fig. 2 shows an overview of IST architecture where test data and results are stored off-chip in the eMMC (Embedded Multi-Media Card) flash memory on the platform. The eMMC memory size requirements are based on desired test-quality and the cost of the platform. For the DRIVE-AGX platforms, the test-data for Fig. 2. In-System Test Architecture the Turing dGPU will also be stored in the eMMC flash memory which is connected to the Xavier SOC. The data will For IST, production ATE test patterns should be translated be transported from the Xavier SOC to the Turing over PCIe. into a packet format that can be stored on eMMC memory, The hardware (HW) controller has a direct communication then fetched and decoded by HW controllers on chip. Test data path with the flash memory. application during production testing is from primary pins of ! the SOC and/or GPU using ATE platforms. The test data Our manufacturing test programs ensure that defective parts application during IST is enabled by intercepting the are not shipped to customers, and the design-for-testability multiplexers! inside the IP being tested. Customized software (DFT) structures we insert throughout the chip ensure that tools were developed to create and store IST test programs on these tests are of the highest quality with quantifiable coverage eMMC memory. and low latency. The challenge was to translate this expertise of manufacturing testing into a fully self-contained functional IST HW controllers on Xavier and Turing communicate feature that could exercise the same high-quality structural tests with various on-chip as well as platform components to in an automotive system. Furthermore, IST needs to run these execute the tests utilizing the IEEE 1500, Scan compression tests while a subset of the logic stays functional. [8][9], XLBIST [10], and MBIST systems on the chip. The HW controllers are programmed via IEEE 1500 as well as via software registers. The controllers can handle platform A. Test-Data and Test-Latency Limitations interrupts, e.g., thermal interrupts, and power cycling to meet The key-on and key-off application of IST mandates a very the performance and latency requirements specified for a given low test-latency relative to ATE tests. Furthermore, external system. memory requirements should be minimized to reduce platform cost. Based on the request to execute IST either during key-on or key-off, the system software configures the chip to platform One of the first options for in-field testing is BIST specific operating conditions, e.g. clock frequencies, power & mechanisms: LBIST and MBIST. MBIST has a proven track voltage settings, coverage target. The testing conditions and record of getting comprehensive coverage of memory test application sequence is flexible and can be updated over structures with low test latency, whereas for LBIST, we cannot the product life cycle. tell the same. The XLBIST test patterns require minimal test- data, but these non-deterministic patterns, in general, do not The high-level IST operational sequence is illustrated in yield high test coverage on complex SOCs, especially under Fig. 3. Custom designed HW controllers execute MBIST, constraints of low latency and strict power budgets.

Load more