(12) United States Patent (10) Patent No.: US 8,171,377 B2 Dell Et Al
Total Page:16
File Type:pdf, Size:1020Kb
US008171377B2 (12) United States Patent (10) Patent No.: US 8,171,377 B2 Dell et al. (45) Date of Patent: May 1, 2012 (54) SYSTEM TO IMPROVE MEMORY 6.732,289 B1 5/2004 Talagala et al. ................... T14?6 RELIABILITY AND ASSOCATED METHODS 6.738,947 B1* 5/2004 Maeda ......... T14f785 6.851,086 B2 2/2005 Szymanski ........ T14f781 7,017,099 B2 3/2006 Micheloni et al. ............ 714/752 (75) Inventors: Timothy J. Dell, Colchester, VT (US); 7,047.478 B2 5/2006 Gregorietal. ... 714,773 Luis A. Lastras-Montano, Cortlandt 7,100,096 B2 8/2006 Webb, Jr. et al. ............... 714.52 Manor, NY (US); Barry M. Trager, 7,117,421 B1 10/2006 Danilak .. T14f763 Yorktown Heights, NY (US); Shmuel 7.356,753 B2 * 4/2008 Chen ............................. 714/755 Winograd, Scarsdale, NY (US) (Continued) (73) Assignee: International Business Machines FOREIGN PATENT DOCUMENTS Corporation, Armonk, NY (US) JP 100911394 4f1998 WO WO2005062478 A1 7/2005 (*) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 OTHER PUBLICATIONS U.S.C. 154(b) by 1127 days. IEEE (Signal Processing Systems Design and Implementation, SIPS 06), Wei Liu et al., “Low-Power High-Throughput BCH Error Cor (21) Appl. No.: 12/023,374 rection VLSI Design for Multi-Level Cell NAND Flash Memories”, pp. 303-308, Oct. 2006. (22) Filed: Jan. 31, 2008 (Continued) (65) Prior Publication Data Primary Examiner — Fritz Alphonse US 2010/0287.445A1 Nov. 11, 2010 (74) Attorney, Agent, or Firm — Ido Tuchman (51) Int. Cl. GI IC 29/00 (2006.01) (57) ABSTRACT (52) U.S. Cl. ......................... 714/763; 714/756: 714/785 A system to improve memory reliability in computer systems (58) Field of Classification Search .................. 714/763, that may include memory chips, and may rely on a error 714/756, 762, 755, 785 control encoder to send codeword symbols for storage in each See application file for complete search history. of the memory chips. At least two symbols from a codeword are assigned to each memory chip and therefore failure of any (56) References Cited of the memory chips could affect two symbols or more. The system may also include a table to record failures and partial U.S. PATENT DOCUMENTS failures of the codeword symbols for each of the memory 3,665,396 A 5/1972 Forney, Jr. .................... 714,789 chips so the error control encoder can correct Subsequent 4,742,519 A 5/1988 Abe et al. ...................... 714/755 partial failures based upon the previous partial failures. The 5,457,701 A 10, 1995 Wasilewski et al. .......... 714,776 error control coder is capable of correcting and/or detecting 5,870,406 A 2, 1999 Ramesh et al. ............... 714,709 6,018,817 A 1/2000 Chen et al. .................... T14f762 more errors if only a fraction of a chip is noted in the table as 6,044,483 A 3/2000 Chen et al. .................... T14f762 having a failure as opposed to a full chip noted as having a 6,345,375 B1 2, 2002 Kelkar et al. .... 714,748 failure. 6,487,680 B1 11/2002 Skazinski et al. ... 714, 23 6,539,513 B1 3/2003 Chen ............................. T14f788 19 Claims, 12 Drawing Sheets 10 N 24 14 MEMORY CHIPS ERROR CONTROL ENCODER INPUT PNS REED-SOLOMON CODE OUTPUT PNS COMMUNICATIONS NETWORK ERROR (CODEWORD SYMBOLS, CONTROL ERRORS, ERRORS DECODER SOLATED TO INPUT AND/OR OUTPUT PIN, AND THRESHOLDS) TABLE (FAILURES,PARTIAL FAILURES,SUBSEQUENT PARTIAL FAILURES, AND NEW FAILURES) US 8,171,377 B2 Page 2 U.S. PATENT DOCUMENTS OTHER PUBLICATIONS 2003.01.10330 A1 6/2003 Fujie et al. ............. T10.36 IBM, Bond et al., "Error Correction and Detection Circuit for Mul 2003/O135782 A1 7/2003 Matsunami et al. 714/5 tiple Memory Types', pp. 376-377. Technical Disclosure Bulletin 2004/O153844 A1 8, 2004 Ghose et al. r72 Apr. 1988 v30 n11. 2004/O25O196 Al 12/2004 Hassner et al 714,788 IBM, Aichelmann, "Paging from Multiple Bit Array Without Distrib 2006, OO29065 A1 2, 2006 Fellman ................. 370,389 uted Buffering”, pp. 485-488, Technical Disclosure Bulletin, Jun. 2006/O104279 A1 5/2006 Fellman et al. ........ 370/392 1981 v24 n1B. 2006/01 17211 A1 6/2006 Matsunami et al. 2006/0233,199 Al 10/2006 Harriman ............... 370/469 * cited by examiner U.S. Patent May 1, 2012 Sheet 1 of 12 US 8,171,377 B2 INTEGRATED PROCESSOR CHIP 106 NARROW HIGH SPEED LINKS (4-6x DRAM ATA RATE) 104 RANK O 103 DIMM's 106 F----------------- ------- ----------- - ------ EHUB-4------------|---------- if ------ DIMMS -----Liliili - FIG.1 U.S. Patent May 1, 2012 Sheet 4 of 12 US 8,171,377 B2 “s SENDING CODEWORD SYMBOLS FOR STORACE IN EACH OF A PLURALITY OF MEMORY CHIPS WHERE FAILURE OF ANY OF THE PLURALITY OF MEMORY CHIPS AFFECTS AT LEAST TWO CODEWORD SYMBOLS RECORDING FAILURES AND PARTIAL FAILURES OF THE CODEWORD SYMBOLS FOR EACH OF THE PLURALITY OF MEMORY CHIPS SO SUBSEQUENT PARTIAL FAILURES CAN BE CORRECTED BASED UPON THE PREVIOUS PARTIAL FAILURES FIG.4 305 SYMBOL OF THE ERROR CONTROL CODE 304 TIME TIME / I/O-e- PINS A PIN FAILURE AFFECTS ONLY ONE SYMBOL x8 x4 DRAM DRAM 301 G.5 302 U.S. Patent May 1, 2012 Sheet 5 of 12 US 8,171,377 B2 L 07 ETTTTTTTT TTTTTTTTTTTTTTTTTT'|| r--------------------------------------------------------------- •|[[TIITLID)|IIIIIIIII|| r-—————————————————————————————————————————————————————————————— U.S. Patent May 1, 2012 Sheet 6 of 12 US 8,171,377 B2 [LIITTITT TTTTTTTTT TTTTTTTTT [[[[[[[[[ TTTTTTTTT TTTTTTTTT ---| [[TTTTTTT TTTTTTTTT TTTTTTTTT [[[[[[[[[ [[TTTTTTT TTTTTTTTT U.S. Patent May 1, 2012 Sheet 7 of 12 US 8,171,377 B2 TETTWAWCHNION||W}{ECHOSCH|HO7X92 U.S. Patent May 1, 2012 Sheet 8 of 12 US 8,171,377 B2 yol ~~~--WIWO U.S. Patent May 1, 2012 Sheet 9 of 12 US 8,171,377 B2 U.S. Patent May 1, 2012 Sheet 11 of 12 US 8,171,377 B2 EO ESON |SSWdÅ8EWO}}0]NÅS ‘HHCOOHOMOTSSECINTON)EGOWWHENNIM?CIOOEG r––––––––!-----------------|-----------T U.S. Patent May 1, 2012 Sheet 12 of 12 US 8,171,377 B2 CHOOTHOÀIWES 9x/yx |OETES BOJOW US 8,171,377 B2 1. 2 SYSTEM TO IMPROVE MEMORY synchronous DRAM (SDRAM) (e.g. single data rate or RELIABILITY AND ASSOCATED METHODS “SDR', double data rate or "DDR", DDR2, DDR3, etc) have synchronous interfaces to improve performance. DDR GOVERNMENT LICENSE RIGHTS devices are available that use pre-fetching along with other speed enhancements to improve memory bandwidth and to This invention was made with Government support under reduce latency. DDR3, for example, has a standard burst Agreement No. HR0011-07-9-0002 awarded by DARPA. length of 8, where the term burst length refers to the number The Government has certain rights in the invention. of DRAM transfers in which information is conveyed from or to the DRAM during a read or write. Another important RELATED APPLICATIONS 10 parameter of DRAM devices is the number of I/O pins that it has to convey read/write data. When a DRAM device has 4 This application contains subject matter related to the fol pins, it is said that it is a “by 4” (or x4) device. When it has 8 lowing co-pending applications entitled "System for Error pins, it is said that it is a “by 8 (or x8) device, and so on. Decoding with Retries and Associated Methods” and having Memory device densities have continued to grow as com an Ser. No. 12/023,356, “System for Error Control Coding for 15 puter systems have become more powerful. Currently it is not Memories of Different Types and Associated Methods” and uncommon to have the RAM content of a single computer be having an Ser. No. 12/023.408, "System to Improve Error composed of hundreds of trillions of bits. Unfortunately, the Code Decoding Using Historical Information and Associated failure of just a portion of a single RAM device can cause the Methods” and having an Ser. No. 12/023,445, "System to entire computer system to fail. When memory errors occur, Improve Memory Failure Management and Associated Meth which may be “hard' (repeating) or “soft' (one-time or inter ods” and having an Ser. No. 12,023,498, “System to Improve mittent) failures, these failures may occur as single cell, Miscorrection Rates in Error Control Code Through Buffer multi-bit, full chip or full DIMM failures and all or part of the ing and Associated Methods” and having an Ser. No. 12,023, system RAM may be unusable until it is repaired. Repair 516, and “System to Improve Error Correction Using Variable turn-around-times can be hours or even days, which can have Latency and Associated Methods” and having an Ser. No. 25 a Substantial impact to a business dependent on the computer 12,023.546, the entire subject matters of which are incorpo systems. rated herein by reference in their entirety. The aforemen The probability of encountering a RAM failure during tioned applications are assigned to the same assignee as this normal operations has continued to increase as the amount of application, International Business Machines Corporation of memory storage in contemporary computers continues to Armonk, N.Y. 30 grOW. Techniques to detect and correct bit errors have evolved FIELD OF THE INVENTION into an elaborate science over the past several decades. Per haps the most basic detection technique is the generation of The invention relates to the field of computer systems, and, odd or even parity where the number of 1s or 0's in a data more particularly, to memory reliability in computing sys 35 word are “exclusive or-ed' (XOR-ed) together to produce a tems and related methods.