(GPU) Devices

Edward J. Wyrwas [email protected] 301-286-5213 Lentech, Inc. in support of NEPP

Acknowledgment: This work was sponsored by: NASA Electronic Parts and Packaging (NEPP)

To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 1 Acronyms

Acronym Definition Acronym Definition Acronym Definition 1MB 1 Megabit Gov't Government NRL Naval Research Laboratory 3D Three Dimensional GPU Graphics Processing Unit NRO United States Navy National Reconnaissance Office 3DIC Three Dimensional Integrated Circuits GRC NASA Glenn Research Center NSWC Crane Naval Surface Warfare Center, Crane Division ACE Absolute Contacting Encoder GSFC Goddard Space Flight Center OCM On- RAM ADC Analog to Digital Converter GSN Goal Structured Notation PBGA Ball Grid Array AEC Automotive Electronics Council GTH/GTY Transceiver Type PC Personal AES Advanced Encryption Standard HALT Highly Accelerated Life Test PCB AF Air Force HAST Highly Accelerated Stress Test PCIe Component Interconnect Express AFRL Air Force Research Laboratory HBM PCIe Gen2 Peripheral Component Interconnect Express Generation 2 AFSMC Air Force Space and Missile Systems Center HDIO High Density Digital Input/Output AMS Agile Mixed Signal HDR High-Dynamic-Range PLL Phase Locked Loop ARM ARM Holdings Public Limited Company HiREV High Reliability Virtual Electronics Center POL point of load BGA Ball Grid Array HMC PoP Package on Package BOK Body of Knowledge HP Labs Hewlett-Packard Laboratories PPAP Production Part Approval Process CAN Controller Area Network HPIO High Performance Input/Output Proc. Processing CBRAM Conductive Bridging Random Access Memory HPS High Pressure Sodium PS-GTR High Speed Bus Interface CCI Correct Coding Initiative HUPTI Hampton University Proton Therapy Institute QDR quad data rate CGA Column Grid Array I/F interface QFN Quad Flat Pack No Lead CMOS Complementary Metal Oxide Semiconductor I/O input/output QSPI Serial Quad Input/Output Xilinx ceramic flip-chip (CF and CN) packages are ceramic column I2C Inter- CN R& Research and Development grid array (CCGA) packages i2MOS Microsemi second generation of Rad-Hard MOSFET R&M Reliability and Maintainability COTS Commercial Off The Shelf IC Integrated Circuit RAM Random Access Memory CRC Cyclic Redundancy Check IC Integrated Circuit ReRAM Resistive Random Access Memory CRÈME Cosmic Ray Effects on Micro Electronics I- independent cache RGB Red, Green, and Blue CRÈME MC Cosmic Ray Effects on Micro Electronics Monte Carlo IUCF Indiana University Cyclotron Facility RH Radiation Hardened CSE Crypto Security Engin JFAC Joint Federated Assurance Center SATA Serial Advanced Technology Attachment CU JPEG Joint Photographic Experts Group SCU Secondary Control Unit D-Cache defered cache Joint Test Action Group (FPGAs use JTAG to provide SD Secure Digital DCU Distributed Control Unit JTAG access to their programming debug/emulation functions) SD/eMMC Secure Digital embedded MultiMediaCard DDR Double Data Rate (DDR3 = Generation 3; DDR4 = Generation 4) KB Kilobyte SD-HC Secure Digital High Capacity DLA Defense Logistics Agency SDM Spatial-Division-Multiplexing L2 Cache independent caches organized as a hierarchy (L1, L2, etc.) DMA SEE Single Event Effect DMEA Defense MicroElectronics Activity LANL Los Alamos National Laboratories SESI secondary electrospray ionization DoD Department of Defense LANSCE Los Alamos Neutron Science Center Si Silicon DOE Department of Energy LLUMC Loma Linda University Medical Center SiC Silicon Carbide DSP Digital Signal Processing L-mem Long-Memory SK Hynix SK Hynix Semiconductor Company dSPI Dynamic Signal Processing Instrument LP Low Power SLU Saint Louis University Dual Ch. Dual Channel LVDS Low-Voltage Differential Signaling SMDs Selected Item Descriptions ECC Error-Correcting Code LW HPS Lightwatt High Pressure Sodium SMMU System EEE Electrical, Electronic, and Electromechanical M/L BIST Memory/Logic Built-In Self-Test SNL Sandia National Laboratories EMAC Equipment Monitor And Control MBMA Model-Based Missions Assurance SOA Safe Operating Area EMIB Multi- Interconnect Bridge MGH Massachusetts General Hospital SOC Systems on a Chip ESA European Space Agency Mil/Aero Military/Aerospace SPI Serial Peripheral Interface eTimers Event Timers MIPI Mobile Industry Interface STT Spin Transfer Torque ETW Electronics Technology Workshop MMC MultiMediaCard TBD To Be Determined FCCU Fluidized Catalytic Cracking Unit MOSFET Metal-Oxide-Semiconductor Field-Effect Temp Temperature MP FeRAM Ferroelectric Random Access Memory THD+N Total Harmonic Distortion Plus Noise MP Multiport Fin Field Effect Transistor (the conducting channel is wrapped by a TRIUMF Tri-University Meson Facility FinFET MPFE Multiport Front-End thin silicon "fin") T-Sensor Temperature-Sensor MPU Microprocessor Unit FPGA Field Programmable Gate Array TSMC Taiwan Semiconductor Manufacturing Company Msg message FPU Floating Point Unit U MD University of Maryland NAND Negated AND or NOT AND FY Fiscal Year UART Universal Asynchronous Receiver/Transmitter NASA National Aeronautics and Space Administration GaN Gallium Nitride UFHPTI University of Florida Proton Health Therapy Institute GAN GIT GaN GIT Eng Prototype Sample NASA STMD NASA's Space Technology Mission Directorate UltraRAM Ultra Random Access Memory GAN SIT Gallium Nitride GIT Eng Prototype Sample Navy Crane Naval Surface Warfare Center, Crane, Indiana USB Universal Serial Bus Gb Gigabyte NEPP NASA Electronic Parts and Packaging VNAND Vertical NAND GCR Galactic Cosmic Ray NGSP Next Generation Space Processor WDT Watchdog Timer GIC Global Industry Classification NOR Not OR logic gate

To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Technology Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 2 Outline

• What the technology is (and isn’t) • Our tasks and their purpose • Roadmap • Partners • Test Readiness • Comments

To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Technology Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 3 Technology

• Graphics Processing Units (GPU) & General Purpose Graphics Processing Units (GPGPU) – Are considered a compute device or – Is not a standalone multiprocessor (even when contained in an SoC)

• Application workflow: – Run the sequential part of their workload on the CPU – which is optimized for single-threaded performance – Accelerate parallel processing using multi-thread performance on the GPU

To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Technology Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 4 Device Packaging

Nvidia TX1 SoC Skylake Processor

Qualcomm

Nvidia GTX 1050 GPU AMD RX460 GPU

To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Technology Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 5 Purpose

• GPUs are best used for single instruction- multiple data (SIMD) parallelism – Perfect for breaking apart a large data set into smaller pieces and processing those pieces in parallel

• Key computation pieces of mission applications can be computed using this technique – Sensor and science instrument input – Object tracking and obstacle identification – convergence (neural network) – Image processing – Data compression

To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Technology Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 6 FY18-19: GPU Testing

Description: FY18-19 Plans: – This is a task over all device topologies and process – Continue development of universal test suite which includes math, output buffer (colors), memory hierarchy and neural networks – The intent is to determine inherent radiation tolerance and sensitivities – Probable test structures for SEE: – Identify challenges for future radiation hardening efforts – Nvidia (16, 14, 10nm) – Investigate new failure modes and effects – AMD (14, 10nm) – Testing includes total dose, single event (proton) and reliability. – Intel (14) Test will include a GPU devices from nVidia and other – (10nm) vendors as available – Compare to previous generations – Tests: – Investigate failure modes/compensation for increased – characterization pre, during and post-rad power consumption

Schedule: Deliverables:

Microelectronics FY18 FY19 – Test reports and quarterly reports T&E M J J A S O N D J F M A – Expected submissions for publications On-going discussions for test samples GPU Test Development SEE Testing NASA and Non-NASA Organizations/Procurements: Analysis and Comparison – Source procurements: Proton (MGH), TID (GSFC), Laser (NRL)

Lead Center/PI: GSFC/Lentech/Wyrwas Co-Is: Carl Szabo

To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Technology Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 7 GPU Roadmap

BOK Body of Knowledge Document eDAA…

GPUs Radiation Testing – 14nm Nvidia GTX 1050 – 14nm AMD Collaboration with TBD – 7nm AMD Vega – 12nm Nvidia

System on Chip Radiation Testing – 20nm Nvidia X1 – 16nm Nvidia Tegra X2 TBD – 14nm Intel+AMD GPU – 12nm+ Nvidia Xavier –

Neural Chips – KnuEdge Hermosa – KnuEdge Hydra TBD

FY17 FY18 FY19 To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Technology Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 8 Partners

Ongoing and new collaborations: • JPL (Steve Guertin, Andrew Daniel) • GSFC Microwave Branch • Navy Crane (Dobrin Bossev, Jonathan • GSFC Photonics Group Wang) • Harris Corporation • NEPP Microprocessors (Carl Szabo) • Ball Aerospace • Dr. Paolo Rech (UFRGS) • General Atomics • Cubic Aerospace • LetSAT.org (LeTourneau University) • TuSimple • (AMD) • JSC Human Interface Branch

To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Technology Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 9 Test Readiness & Results

• A universal test bench is under development to provide a standardized approach to test GPUs with minimal variation between device types. The test bench must perform comparably under Proton, Heavy-Ion, Laser and Total Ionizing Dose tests.

400W Cooling on Bare NVIDIA GTX 1050 180W Cooling on Lidded AMD Ryzen CPU

• A cooling solution created for GPU testing has been refined to also cool socketed CPUs such as an AMD Ryzen microprocessor which contains a GPU. This technique can be applied to System on Module (SOM) devices too.

To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Technology Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 10 Test Readiness & Results

• Three types of payloads have been created for the GPU test bench: Neural Network, Math-Logic and Colors. – The neural network is a convolutional neural network (CNN) which can avoid processor optimizations that recursive neural networks (RNN) primarily benefit from. – Math-Logic uses mathematics and conditional logic statements to exercise memory hierarchy. – The Colors payload assesses corruption in the output image presented to a display.

Algebra and Matrices Color Output Neural Networks

⋯ 퐴 = 휋푟2 푂푅 ⋮ ⋱ ⋮ ⋯

To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Technology Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 11 Comments

• The NEPP GPU standardized approach involves: – rapid development of cooling system for each DUT form factor and packaging type – system implementation using modular COTS’ system and network components – public domain that has been excessively tested by the community – payloads that can be easily updated to accommodate new DUTs while maintaining the ability to test older DUTs

• References – Nvidia Jetson TX1 (SoC) http://hdl.handle.net/2060/20170009004 – Nvidia GTX 1050 (Discrete GPU) http://hdl.handle.net/2060/20170009005 – Considerations for testing (overview) http://hdl.handle.net/2060/20170004734 – NEPP GPU Body of Knowledge document TBD – NEPP Website – Standardizing GPU Radiation Test Approaches TBD – SEE Symposium, May 2018

(other documents will be published after review)

To be presented by Edward J. Wyrwas at the 2018 NEPP Electronics Technology Workshop (ETW), NASA GSFC, Greenbelt, MD, June 18-21, 2018. 12