<<

Radiation Tolerant Intelligent Memory Stack (RTIMS)

Authors: Tak-kwong Ng, Jeffrey A. Herath Affiliation: NASA Langley Research Center Email: t.ng@.gov, [email protected]

Abstract data. The requirement to efficiently handle large data sets necessitates The Radiation Tolerant Intelligent the use of larger on-board memories. Memory Stack (RTIMS), suitable for This paper discusses the development both geostationary and low earth of a radiation tolerant memory, orbit missions, has been developed. suitable for both geo-stationary The memory module is fully functional (GEO) and low earth orbit (LEO) and undergoing environmental and missions. radiation characterization. A self- RTIMS has been designed, built, contained “flight-like” module is tested and integrated onto a 3U expected to be completed in 2006. Compact PCI printed circuit board RTIMS provides reconfigurable (PCB). RTIMS capitalizes on previous circuitry and 2 gigabits of error technology investments at NASA corrected or 1 gigabit of triple Langley Research Center (LaRC), NASA redundant digital memory in a small Goddard Space Flight Center (GSFC), package. RTIMS utilizes circuit ASRC Aerospace, Irvine Sensors and 3D stacking of heterogeneous components Plus USA. RTIMS incorporates the and radiation shielding technologies. circuit stacking technology of 3D- Plus USA, as well as package-level A reprogrammable field programmable radiation shielding and novel gate array (FPGA), six synchronous radiation mitigation techniques dynamic random access memories, developed at LaRC. By using linear regulator, and the radiation reprogrammable FPGA technology for mitigation circuitries are stacked the memory controller, RTIMS can also into a module of 42.7mm x 42.7mm x be a key element in 13.00mm. adaptive/reconfigurable computing Triple module redundancy, current applications. limiting, configuration scrubbing, and single event function interrupt detection are employed to mitigate 2. Need & Benefits radiation effects. The mitigation techniques significantly simplify The need for compact, high system design. RTIMS is well suited performance, radiation tolerant for deployment in real-time data memories has been identified as processing, reconfigurable computing, critical to enabling our nation’s and memory intensive applications. future space missions. Improved memory technologies offer reductions in size, cost, mass, power, risk and system complexity, and can be applied 1. Introduction to advanced data processing, high bandwidth communication, and meeting NASA has identified many systems the data volume needs of the next- that will require on-board satellite generation high-resolution sensors. data processing for future missions. Advances made by the RTIMS Continuing themes for addressing technology provide the following key these requirements include the need benefits: for ever-increasing resolution, improved data quality, and additional capacity for raw and/or processed 1. Significant reductions in the mission risk which is increasingly size and mass of mission memory important as flight system arrays. development times and budgets 2. A radiation tolerant memory decrease. It also allows RTIMS to suitable for both GEO and LEO space adapt to changing mission conditions. missions by using new package level 10. Additional logic and processing radiation shielding technology and resources. By reprogramming the FPGA Triple Modular Redundancy (TMR) FPGA a designer can take advantage of the techniques. “extra” elements in the FPGA to 3. Simplified interface to a large implement other portions of their SDRAM memory array with built-in design. logic for timing reads, writes and These features provide considerable refresh cycles. engineering savings as well as 4. Novel self scrubbing and Single advanced functionality. RTIMS is Event Functional Interrupt (SEFI) enabling, because it allows detection that allow a relatively standardized hardware to be used for “soft” FPGA to become radiation completely different missions tolerant without external scrubbing achieving lower costs since the same and monitoring hardware. hardware design can be reused. 5. Efficient design, test, and validation by incorporating the 3. RTIMS Module Construction radiation mitigation technology at the component level rather than The RTIMS module is a major step requiring the designer to implement a forward in the state of the art for design unique solution at the system space based memory arrays. A block level. diagram of RTIMS is shown in Figure 6. Increased system reliability by 1. The new stacking technology used distributing the radiation mitigation for RTIMS allows a heterogeneous structure to each component instead stack of electronic parts to be built of a single point failure at the into a single component. The system level. electronic parts in the stack can be 7. Added mission flexibility by bare electronic die, packaged parts operating the memory array in a TMR (plastic or ceramic), or chip architecture with 1Gb of storage or passives (capacitors and resistors). in an EDAC (Error Detection And This new stacking technology Correction) mode where part of the typically provides an 80% reduction memory is used to detect and correct in required volume or footprint for a errors with 2Gb of storage (corrects given application. single bit errors and detects double bit errors). This allows RTIMS to be used effectively on many types of missions because it can be configured for the “harshness” of the expected environment. 8. Tightly integrated solution that includes a Field Programmable Gate Array (FPGA), six synchronous dynamic random access memories (SDRAMs), linear regulator, and the radiation mitigation circuitry into a single stack or component. 9. In-flight reconfigurability using Static Random Access Memory (SRAM) based FPGA technology allows the design to overcome both hardware and software errors that may be detected after launch during mission operations. This reduces overall

2 Vcc = 3.3V

100k 100k 100k leads) that will be used to Current- Limited 1 #FAULT1 Switch vertically interconnect the layers.

LDO Voltage 4.7k Nickel is then chemically and Regulator #Enable Current- Current- #PG Limited Limited 1 #FAULT2 electro-chemically deposited onto the Switch Switch 1 #FAULT3

1.5V module. This effectively shorts all

VCCAUX VCCINT VCCO BA/CTRL0-06 CLK SDRAM 100k of the flying leads . A laser BA/CTRL0-05 CKE BA/CTRL0-02 #RAS 4.7k BA/CTRL0-03 #CAS Current- #FAIL Power- BA/CTRL0-07 #CS On- BA/CTRL0-04 #WE VDD Limited- BA/CTRL0-08 DQM is then used to create grooves that Reset POR MR_OUT BA/CTRL0-01 BA0 Switch BA/CTRL0-00 BA1 A00~12 A0~12 RESET DQ0 DQ 7~0 7~0 U1 leave the appropriate vertical TDO CLK SDRAM 100k CKE TMS #RAS #CAS TCK Current- #FAIL connections. Finally, a mission #CS #WE VDD Limited- DQM BA0 Switch BA1 TDO TDI A TMS 0~12 specific thickness of Tantalum is CLK CCLK DQ37~0 DQ TCK 7~0 U4 D0 DIN TDI OE/#RESET #INIT BA/CTRL1-06 CLK SDRAM 100k BA/CTRL1-05 CKE #CF #PROGRAM BA/CTRL1-02 #RAS installed for radiation shielding. BA/CTRL1-03 #CAS #CE DONE Current- #FAIL BA/CTRL1-07 #CS BA/CTRL1-04 #WE VDD Limited- EEPROM BA/CTRL1-08 DQM 18V04 BA/CTRL1-01 BA0 Switch BA/CTRL1-00 BA1 The RTIMS module measures 42.7mm x A10~12 A0~12

Vcc DQ2 DQ 7~0 7~0 U3

CLK 100k 42.7mm x 13.0mm as shown in Figure 3. 330 SDRAM CKE #RAS #CAS 27 Current- #FAIL Address0~26 #CS #WE VDD Limited- DQM 16 BA0 Switch Data_In0~15 BA1 A0~12

10 DQ57~0 DQ7~0 Misc_Ctrl_Signals0~9 U6

BA/CTRL2-06 CLK SDRAM 100k BA/CTRL2-05 CKE BA/CTRL2-02 #RAS BA/CTRL2-03 #CAS Current- #FAIL BA/CTRL2-07 #CS BA/CTRL2-04 #WE VDD Limited- BA/CTRL2-08 DQM BA/CTRL2-01 BA0 Switch BA/CTRL2-00 BA1 A20~12 A0~12

DQ1 DQ 7~0 7~0

CLK SDRAM 100k BA/CTRL3-05 CKE BA/CTRL3-02 #RAS BA/CTRL3-03 #CAS Current- #FAIL BA/CTRL3-07 #CS BA/CTRL3-04 #WE VDD Limited- BA/CTRL3-08 DQM BA/CTRL3-01 BA0 Switch BA/CTRL3-00 BA1 A0~12 Figure 2. RTIMS Cross Section DQ4 DQ 7~0 7~0 U5

#FAIL

16 Xilinx Virtex-II Data_Out0~15 4 X2V1000 Misc0~3

Figure 1. RTIMS Block Diagram

NASA GSFC, in conjunction with Centre National d’Études Spatiales (CNES) and the European Space Agency (ESA), completed a study on this stacking technology from 3D Plus. The study determined that the stacking technology is rugged enough and suitable for space applications [1]. The mechanical development of a 3D module is very similar to that used Figure 3. RTIMS Dimensions for a standard Multi Chip Module (MCM). The main difference from an The RTIMS module is designed with a MCM is the splitting of the design 144 pin quad flat pack footprint and function onto several layers. These three active layers of circuitry. A layers are essentially individual cross section of the RTIMS module circuit boards allowing the stacking (without tantalum shield installed) of electronic components placed on is shown in Figure 2. them. Once each layer (circuit board) of the module is designed, the layers RTIMS also incorporates internal can be fabricated, populated with thermal drains and internal radiation components, tested, and burned in, shielding. A completed RTIMS module all before final module assembly. is shown in Figure 4. This reduces the impacts of manufacturing errors and faulty components. The layers are then stacked together, aligned, and vertically spaced. A very pure epoxy resin from Dexter (HYSOL FP4450) is used to fill the spaces around the entire module and between each layer. After resin polymerization, the module is cut from the mold, exposing the external connections (flying-

3 Figure 5 shows the RTIMS wiring scheme between the FPGA and the SDRAMs. The Clk_TR0, Clk_TR1, Clk_TR2 are copies of the SDRAM clock. When operating in TMR memory mode, the control buses, Add/Ctl_TR3, Add/Ctl_TR2, Add/Ctl_TR1, and Add/Ctl_TR0, have the same signaling. During write operation, DQ5, DQ4, DQ3 are driven with the high order byte of the data word, and DQ2, DQ1, DQ0 are driven with the low order byte. During read, DQ5, DQ4, and DQ3 are voted to generate the high order byte Figure 4. RTIMS Module of the data word, and DQ2, DQ1, and DQ0 are voted to generate the low 4. Making the System Radiation order byte. There is no single point failure in this configuration. Tolerant When operating in EDAC memory mode, the SDRAMs are organized into two 4.1. Robustness vs. Capacity banks of 64 MWords each. Add/Ctl_TR3 and Add/Ctl_TR1 buses have the same A single event upset (SEU) may signaling and control one bank of introduce one or more data bit errors memory. Add/Ctl_TR2 and Add/Ctl_TR0 in a SDRAM device. The RTIMS buses also have the same signaling architecture provides the flexibility and control the other bank. During of organizing the six SDRAMs, each of write operations, either DQ5, DQ4, 1 Gb capacity, into a 128 MWord (2 DQ2, or DQ3, DQ1, DQ0 are driven with bytes per word) EDAC memory, or a 64 the data word and the check bits. MWord TMR memory. This can be easily Similarly, during read operations, accomplished if all of the signal either DQ5, DQ4, DQ2, or DQ3, DQ1, pins of the SDRAMs are connected to DQ0 are selected for the data words the FPGA. However, the Xilinx device and check bits. The control signals selected for this project, XQ2V1000 are single point failures. A SEU can in a BG575 package, does not have induce multiple uncorrectable data sufficient pins to support this bit errors. solution. 4.2. Single Point Failure

The FPGA and the SDRAM are susceptible to SEU. The logic implemented in the FPGA utilizes the Xilinx TMR strategy, and the Xilinx TMRTOOL [2] is employed to triplicate the design. Due to the limited number of signal pins on the FPGA package, it is not possible to triplicate every i/o signal. The single pin i/o signals thus become single point failures. The DataIn and DataOut buses of the module interface signals are not triplicated. If this is shown to be problematic, an EDAC scheme can be implemented to protect these buses. The twelve spare i/o on the module Figure 5. RTIMS FPGA SDRAMs Wiring can be utilized for this purpose.

4 The interface signals between the FPGA and the SDRAMs are not 4.5. Configuration Memory Refresh triplicated. An upset to anyone of the control signals may result in The FPGA is programmed to implement multiple bit data errors. This may an application by setting the states occur when RTIMS is operating in EDAC of the configuration memory cells. memory mode, possibly resulting in The configuration memory cells are unrecoverable errors. For the TMR sensitive to SEU. It is essential to memory mode, this would only affect correct a configuration memory upset one of the three redundant words and as quickly as possible, because the would therefore be corrected. logic redundancy employed may not An upset to one of the data signals correct as errors accumulate. may result in a single bit error. For RTIMS utilizes a simple approach by the EDAC memory mode, this would refreshing the configuration memory result in a single bit error, but the cells at a regular interval that is data would be recovered (error selectable by the application. The corrected) via the EDAC circuit. For default interval is 1 hour. The the TMR memory mode, it would only interval can be set to 10 minutes, 30 affect one bit of the three redundant minutes, 1 hour, 4 hours, 24 hours, words and would also be corrected. or off. The configuration bit stream is 4.3. SEU stored in the non-volatile radiation tolerant EEPROM. Upon power-on the The FPGA is susceptible to SEU. FPGA loads its configuration from the Xilinx XTMR scheme is employed to EEPROM. It is the design objective to mitigate the problem. This scheme store one configuration bit stream works based on the assumption that that would work for both initial one SEU only induces one upset on a configuration and periodic signal and its fan-outs. Furthermore, configuration refreshing. it is assumed that the upset is Table 1 outlines the default corrected before the next SEU. configuration bitstream [3]. The Current manufacturer test results default bitstream loads the entire indicate these assumptions are FPGA with one command as indicated in realistic. step 2. This also overwrites the block ram. For configuration 4.4. Dynamic State Recovery refreshing, the contents of the block ram are not overwritten, requiring With redundancy, the circuit can instead a modified step 2. Steps 3 function correctly with errors and 4 are also skipped during induced by one SEU. It is essential configuration refreshing. to correct the error states as Step Function quickly as possible, because the 1Set up redundancy employed may not correct 2 Write Configuration as errors accumulate. Xilinx XTMR 3 Initialize registers, activate all interconnects scheme guarantees the correctness of 4 Start startup sequence the inputs of the flip-flop for a 5 Check CRC, Desynch to desynchronizes the single error induced by a SEU. configuration logic, 4 NOPS An upset flip-flop can be corrected Table 1. Default Bitsteam only when it is reloaded with the The modified configuration corrected value. The flip-flop bitstream is shown in Table 2. During primitive in the FPGA has a Clock initial configuration loading, steps Enable (CE) pin that controls the 2A-2C are functionally equivalent to loading of the flip-flop. Hence an step 2 in the default bitstream. upset flip-flop is corrected only During configuration refresh, the after the CE pin is made active. To content of the EEPROM is read mitigate this the usage of flip-flops sequentially, but steps 2C-5 are with rarely active CE pins is skipped by deasserting the CS signal eliminated in the RTIMS design. on the FPGA. This effectively skips

5 the refreshing of the BRAM content, internal). An alternative efficient the initialization of registers, and method for detecting configuration the startup sequence. As steps are refresh failure is discussed next. skipped, a revised CRC word is During configuration refresh, all supplied. Step 5 is therefore also distributed memory that is skipped. The revised CRC word is implemented with look-up tables supplied in step 6. During initial within the Configurable Logic Blocks configuration loading, though the CS are initialized. This process can be signal is asserted, the Desynch used to detect the failure of the command in step 5 signals the end of configuration refresh. the configuration. The configuration refresh control includes a 16x1 distributed memory. Step Function Just prior to configuration refresh, 1Set up a known pattern is written into this 2A Write Configuration Stream to GCLK, IOB1, IOI1, memory. Configuration refresh, when CLBs, IOI2, IOB2 successful, clears this memory to all 2B Write Configuration Stream to BRAM INT zeros. Upon the completion of 2C Write Configuration Stream to BRAM content AUTO configuration refresh, this memory is CRC read. Any non-zero content directly 3 Initialize registers, activate all interconnects indicates the failure of 4 Start startup sequence configuration refresh. 5 Check CRC, Desynch to desynchronizes the configuration logic, 4 NOPS 4.7. DCM Failure Recovery 6 Check revised CRC, Desynch to desynchronizes the It has been shown that the configuration logic, 4 NOPS Table 2. Modified Bitstream functionality of the Digital Clock The configuration memory refresh Manager (DCM) may not be recovered by controller can be implemented configuration refresh alone. A reset externally or internally. When it is to the failed DCM is required to implemented externally, it requires recover its functionality. The additional radiation tolerant approach to detecting and correcting devices. For this project, the this type of failure uses three configuration memory refresh counters. Each counter is clocked by controller is implemented in the FPGA one of the triplicate DCM clock that is internal to the RTIMS module. outputs. The counters are started The internal controller does not synchronously and, in normal increase the risk of failure as long operation, they will count in lock as any SEU induced errors are step. When one of the DCMs fail, the corrected before the next SEU. counter associated with this DCM will have a different count than the other two counters. Hence an error is 4.6. Detection of Refresh Failure detected and a reset to this DCM is generated. The configuration logic of the FPGA is hard-wired, and is not TMR protected. Though it has a small footprint, it has been shown to be susceptible to SEU. An upset here can only be corrected by a full initialization of the FPGA. Previous methods suggest that when certain configuration control register reads fail, this implies that the configuration logic is not functioning properly. This method is indirect and inefficient, requiring triplicated configuration data bus Figure 6. DCM Checker pins on the FPGA (since the configuration refresh controller is

6 The DCM checker is shown in Figure board developed at LaRC. This board, 6. The checker itself is triplicated. in turn, will provide the structural A router macro is inserted between support and electronics support the DCM clock outputs and the DCM during environmental testing for checker. A DCM reset signal is space qualification as dictated in generated when either the Reset_DCM MIL-STD-883 and MIL-STD-202. The voter fails, or the DCM clock output RTIMS module was designed for a Total fails. Ionizing Dose (TID) of 100Krad(Si) at 25°C, latch up immunity to at least 4.8. SDRAM Scrubbing 60MeV –cm²/mg at 25°C and an operating temperature of -40°C to The SDRAMs are also susceptible to +85°C. The types of environmental SEUs. It is essential to prevent the testing for RTIMS includes: thermal accumulation of errors induced by vacuum, vibration, accelerated life, multiple SEUs. Otherwise the error and radiation. correction circuitry may not be able The radiation testing of the RTIMS to recover the data. module investigates the performance RTIMS implements a scrubbing of the module in a radiation strategy where the contents of the environment. During initial testing SDRAMS are read sequentially at a we will take the finished modules to fixed time interval. The interval is a proton test facility to better defaulted to one scrub every 30 us, understand the proton Single Event on average. It takes approximately 32 Effects (SEE) sensitivity. Follow on minutes to refresh the entire TMR testing will include performing a configured SDRAM array, and 64 radiation transport analysis using a minutes to refresh the entire EDAC typical geosynchronous radiation configured SDRAM array. environment to understand the The scrubbing process consists of modules’ response to Total Ionizing reading the SDRAM. When an error is Dose (TID). detected and the error is correctable, the SDRAM is updated 5.1 Proton Testing with the corrected data. Ideally the scrubbing is performed Piece-part heavy-ion SEE testing when the SDRAM is idling. Otherwise has been performed on all devices there will be contention between within the RTIMS module. This testing regular memory accesses and scrubbing allowed for an analysis of memory accesses resulting in destructive single event effects and performance degradation. individual component single event This strategy is flexible and upsets and functional interrupts. allows minimal contention by taking But, system-level upset and advantage of the fact that the time functional interrupt events are interval between scrubs can vary as possible that cannot be seen and the long as the average interval is effectiveness of NASA LaRC’s maintained. When normal accesses are radiation mitigation technology can sparse, the SDRAM scrubber waits not be verified by this type of until at least 8 scrubs are pending. testing. It will then do scrubbing accesses Ideally, a system is placed in a until the scrubbing is ahead by 8. single-event-producing environment When normal accesses are dense, the (similar to the mission) and SDRAM scrubber contends with normal monitored for these synergistic memory access requests, resulting in effects while studying mitigation a maximum 0.2% degradation in effectiveness. The stacked nature of throughput. the module does not allow for ground- based heavy ion sources to penetrate 5. Environmental Qualification through the entire module structure and impact every device as would The RTIMS modules are currently higher energy space radiation. installed on a 3U Compact PCI memory

7 In these cases, high energy proton 100 Krads(Si) for a GEO mission (for sources are used which penetrate the solar active and inactive periods). entire module structure. This allows verification of the entire module’s 6. Summary performance in a radiative environment. However, since proton- The objective of the RTIMS project induced SEEs require a nuclear is to develop and demonstrate an in- interaction, the event rate is flight reconfigurable radiation significantly reduced, requiring tolerant stacked memory array based higher fluences to see effects and on state-of-the-art chip stacking, larger module sample sizes due to radiation shielding and radiation high doses from these higher mitigation technologies. Upon fluences. completion of the environmental testing this objective is expected to 5.2 Radiation Transport Analysis be met. RTIMS can be a complete computing Every component within the RTIMS module. Computing cores can be module was tested with Cobalt-60 for compiled into the on-board FPGA. The TID. A couple of the components module can then support a showed sensitivity to TID, so distributed, reconfigurable computing additional shielding was placed in architecture. the package. To better understand how RTIMS is also suitable for high the RTIMS module will perform in a reliability computing at nuclear relevant space radiation environment, facilities. Automated or remote a radiation transport analysis will controlled nuclear waste handlers be done with the radiation transport often have significant computing code NOVICE which calculates expected and/or data collection tasks. RTIMS’ dosage for each component within the radiation tolerance makes it an RTIMS module. excellent fit for these applications. For this work, a GEO mission was The radiation tolerant RTIMS are selected. A year-long radiation also suitable for data collection environment will be calculated for and/or computing applications on two conditions – solar active and nuclear powered craft (aircraft solar inactive (i.e., the radiation carriers, submarines, and future environment includes solar particle nuclear powered spacecraft). events for the solar active But, first and foremost, the RTIMS environment). These environments will modules are designed for high be generated using the standard NASA performance space-based computing GSFC space radiation environment applications, including real-time models. data processing, reconfigurable Next, the geometry of the RTIMS computing, and memory intensive module will be translated from the space-based systems. The RTIMS schematics and layouts of the RTIMS technology, which enables new module into geometry coding that the measurements and information radiation transport code (NOVICE) can products, increases the accessibility understand. If the calculated doses and utility of data, and reduces the are less than the sensitivities for risk, cost, size, and development all of the components then the RTIMS time of space-based systems. module should survive greater than

[1] Plante Jeannette, Evaluation of 3D Plus Test Structures, Dynamic Range Corporation and NASA, pp 13, 12/12/2001. [2] TMRTool User Guide, Xilinx, Inc., 5/30/2005 [3] Virtex-II Platform FPGA User Guide, Xilinx, Inc., pp 314-336, 3/2005

8