(12) Patent Application Publication (10) Pub. No.: US 2010/0293438 A1 Lastras-Montano Et Al

US 2010O293438A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2010/0293438 A1 Lastras-Montano et al. (43) Pub. Date: Nov. 18, 2010 (54) SYSTEM TO IMPROVE ERROR (73) Assignee: INTERNATIONAL BUSINESS CORRECTION USING VARIABLE LATENCY MACHINES CORPORATION, AND ASSOCATED METHODS Armonk, NY (US) (21) Appl. No.: 12/023.546 (75) Inventors: Luis A. Lastras-Montano, 1-1. Cortlandt Manor, NY (US); Piyush (22) Filed: Jan. 31, 2008 C. Patel, Cary, NC (US); Eric E. Publication Classification Retter, Austin, TX (US); Barry M. (51) Int. Cl. Trager, Yorktown Heights, NY H03M, 3/05 (2006.01) S. Mits; Snmuel WAEaWinograd, Cary, G06F II/It (2006.01) Scarsdale, NY (US); Kenneth L. (52) U.S. Cl. .................. 714/763; 714/E11.034; 714/785 Wright, Austin, TX (US) (57) ABSTRACT A system to improve error correction may include a fast Correspondence Address: decoder to process data packets until the fast decoder finds an IBM CORPORATION uncorrectable error in a data packet at which point a request ROCHESTER PLAW DEPT.917 for at least two data packets is generated. The system may also 3605 HIGHWAY 52 NORTH include a slow decoder to possibly correct the uncorrectable ROCHESTER, MN 55901-7829 (US) error in a data packet based upon the at least two data packets. 812 CHANNEL BUFFER FAST DECODER MARKING INFO SCORE 802- NEW ERROR CE, NCSE MARKING MARKING INFO LOCATION AND ENSUEFA INFO PREPROCESSING 804 MAGNITUDE CALCULATION MODIFIED COMPUTATION so 806 SYNDROME Eir ERROR 72B SYNDROME COMPUTATION MAGNITUDE CEON GENERATION COMPUTATION OPTIONAL FOR KNOWN SYNDROME LOCATIONS 811 BYPASS 807 SYNDROME BYPASS SELECT Patent Application Publication Nov. 18, 2010 Sheet 1 of 12 US 2010/0293438A1 INTEGRATED PROCESSOR CHIP 106 NARROW HIGH SPEED LINKS (4-6x DRAM DATA RATE) 104 RANK O 103 DIMM's 106 ------- ----------- - ---------- !------ TTTTT ------ EHUB-4------,DIMMS ----ii-Litijii TTTTTT FIG.1 Patent Application Publication Nov. 18, 2010 Sheet 2 of 12 US 2010/0293438A1 O - Patent Application Publication Nov. 18, 2010 Sheet 3 of 12 US 2010/0293438A1 997 ZZ 0 Patent Application Publication Nov. 18, 2010 Sheet 4 of 12 US 2010/0293438A1 “s PROCESSING DATA PACKETS UNTIL. A FAST DECODER FINDS AN UNCORRECTABLE ERROR DATA PACKET AT WHICH POINT A REQUEST FOR AT LEAST TWO ERROR PACKETS IS GENERATED POSSIBLY CORRECTING THE UNCORRECTABLE ERROR DATA PACKET VIA A SLOW DECODER USING THE AT LEAST TWO ERROR PACKETS FIG.4 305 SYMBOL OF THE ERROR CONTROL CODE 303 304 TIME H TIME / -eI/O PINS I/O-e PINS A PIN FAILURE AFFECTS ONLY ONE SYMBOL x8 X4 DRAM DRAM 301 FIG.5 JO2 Patent Application Publication Nov. 18, 2010 Sheet 5 of 12 US 2010/0293438A1 I L | | | |TTTTTTTTT |-–————————————————————————————————————————————————————————————— •||TTTTTTTTT'||TTTTTTTTT'|| f––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Patent Application Publication Nov. 18, 2010 Sheet 6 of 12 US 2010/0293438A1 |------Z?G––––––––––––––––––––––––––––––-?----------------------–1 TTTTTTTTT||||TTTTTTTTT |TTTTTTTTT'||TTTTTTTTT'|| |[[[[[[[[[]]|[TTTTTTTT [[[[[[[[[]|[TTTTTTTT |-—————————————————————————————————————————————————————————————— |TTTTTTTTT'|||TTTTTTTTTTTTTTTTTTT) – Patent Application Publication Nov. 18, 2010 Sheet 7 of 12 US 2010/0293438A1 Patent Application Publication Nov. 18, 2010 Sheet 8 of 12 US 2010/0293438A1 E18M/OWEJ SISEÕÕE}} ~~)WIWO /0/ (NWWWOO Patent Application Publication Nov. 18, 2010 Sheet 9 of 12 US 2010/0293438A1 EG9 }}0}}}|E.MAE'N 705 |OETES Z19 Patent Application Publication Nov. 18, 2010 Sheet 10 of 12 US 2010/0293438A1 S|SHTYÖEHONIGINEd}}}|B}} (IESE?SWHOTV) Patent Application Publication Nov. 18, 2010 Sheet 11 of 12 US 2010/0293438A1 EO ESON EO IESON E[]|SW EG9 |SSWdÅ8EW0}}0]NÅS ‘HEGOOGOMOTSSECINTONI)ECOWAHLENNIHEGOORG r––––––––!!!----------------------------T Patent Application Publication US 2010/0293438A1 US 2010/0293438 A1 Nov. 18, 2010 SYSTEM TO IMPROVE ERROR 0007 DIMMs are often designed with dynamic memory CORRECTION USING VARIABLE LATENCY chips or DRAMs that need to be regularly refreshed to prevent AND ASSOCATED METHODS the data stored within them from being lost. Originally, DRAM chips were asynchronous devices, however contem RELATED APPLICATIONS porary chips, synchronous DRAM (SDRAM) (e.g. single 0001. This application contains subject matter related to data rate or “SDR', double data rate or “DDR, DDR2, the following co-pending applications entitled "System for DDR3, etc) have synchronous interfaces to improve perfor Error Decoding with Retries and Associated Methods” and mance. DDR devices are available that use pre-fetching along having an attorney docket number of POU920080028US1, with other speed enhancements to improve memory band “System to Improve Memory Reliability and Associated width and to reduce latency. DDR3, for example, has a stan Methods” and having an attorney docket number of dard burst length of8, where the term burst length refers to the POU92008.0029US1, “System for Error Control Coding for number of DRAM transfers in which information is conveyed Memories of Different Types and Associated Methods” and from or to the DRAM during a read or write. Another impor having an attorney docket number of POU920080030US1, tant parameter of DRAM devices is the number of I/O pins “System to Improve Error Code Decoding Using Historical that it has to convey read/write data. When a DRAM device Information and Associated Methods and having an attorney has 4 pins, it is said that it is a “by 4' (or x4) device. When it docket number of POU920080031US1, “System to Improve has 8 pins, it is said that it is a “by 8 (or x8) device, and so on. Memory Failure Management and Associated Methods” and 0008 Memory device densities have continued to grow as having an attorney docket number of POU920080032US1, computer systems have become more powerful. Currently it and “System to Improve Miscorrection Rates in Error Control is not uncommon to have the RAM content of a single com Code Through Buffering and Associated Methods” and hav puter be composed of hundreds of trillions of bits. Unfortu ing an attorney docket number of POU920080033US1, the nately, the failure of just a portion of a single RAM device can entire subject matters of which are incorporated herein by cause the entire computer system to fail. When memory errors reference in their entirety. The aforementioned applications occur, which may be “hard' (repeating) or “soft' (one-time or are assigned to the same assignee as this application, Inter intermittent) failures, these failures may occur as single cell, national Business Machines Corporation of Armonk, New multi-bit, full chip or full DIMM failures and all or part of the York. system RAM may be unusable until it is repaired. Repair turn-around-times can be hours or even days, which can have GOVERNMENT LICENSE RIGHTS a substantial impact to a business dependent on the computer systems. 0002 This invention was made with Government support 0009. The probability of encountering a RAM failure dur under Agreement No. HR0011-07-9-0002 awarded by ing normal operations has continued to increase as the DARPA. The Government has certain rights in the invention. amount of memory storage in contemporary computers con FIELD OF THE INVENTION tinues to grow. 0010 Techniques to detect and correct bit errors have 0003. The invention relates to the field of computer sys evolved into an elaborate science over the past several tems, and, more particularly, to error correction systems and decades. Perhaps the most basic detection technique is the related methods. generation of odd or even parity where the number of 1's or O's in a data word are “exclusive or-ed' (XOR-ed) together to BACKGROUND OF THE INVENTION produce a parity bit. For example, a data word with an even 0004. This invention relates generally to computer number of 1's will have a parity bit of 0 and a data word with memory, and more particularly to providing a high fault tol an odd number of 1's will have aparity bit of 1, with this parity erant memory system. bit data appended to the stored memory data. If there is a 0005 Computer systems often require a considerable single error present in the data word during a read operation, amount of high speed RAM (random access memory) to hold it can be detected by regenerating parity from the data and information Such as operating system Software, programs and then checking to see that it matches the stored (originally other data while a computer is powered on and operational. generated) parity. This information is normally binary, composed of patterns of 0011 More sophisticated codes allow for detection and 1's and 0's known as bits of data. The bits of data are often correction of errors that can affect groups of bits rather than grouped and organized at a higher level. A byte, for example, individual bits; Reed-Solomon codes are an example of a is typically composed of 8 bits; more generally these groups class of powerful and well understood codes that can be used are called symbols and may consist on any number of bits. for these types of applications. 0006 Computer RAM is often designed with pluggable 0012. These error detection and error correction tech Subsystems, often in the form of modules, so that incremental niques are commonly used to restore data to its original/ amounts of RAM can be added to each computer, dictated by correct form in noisy communication transmission media or the specific memory requirements for each system and appli for storage media where there is a finite probability of data cation. The acronym, “DIMM refers to dual in-line memory errors due to the physical characteristics of the device. The modules, which are perhaps the most prevalent memory mod memory devices generally store data as Voltage levels repre ule currently in use. A DIMM is a thin rectangular card senting a 1 or a 0 in RAM and are subject to both device failure comprising one or more memory devices, and may also and state changes due to high energy cosmic rays and alpha include one or more of registers, buffers, hub devices, and/or particles.

Load more