US008732386B2

(12) United States Patent (10) Patent N0.: US 8,732,386 B2 O’Krafka et al. (45) Date of Patent: May 20, 2014

(54) SHARING DATA FABRIC FOR (56) References Cited COHERENT-DISTRIBUTED CACHING OF U.S. PATENT DOCUMENTS MULTI-NODE SHARED-DISTRIBUTED 4,916,605 A 4/1990 Beardsley et a1. 5,046,002 A 9/1991 Takashi et a1. (75) Inventors: Brian Walter O’Krafka, Austin, CA 5,057,996 A 10/1991 Cutler et a1. (US); Michael John Koster, Bridgeville, 5,117,350 A 5/1992 Parrish et a1. 5,212,789 A 5/1993 Rago CA (US); Darpan Dinker, Union City, 5,287,496 A 2/1994 Chen et a1. CA (US); Earl T. Cohen, Oakland, CA 5,297,258 A 3/1994 Hale et a1. (US); Thomas M. McWilliams, Oakland, CA (US) (Continued) FOREIGN PATENT DOCUMENTS (73) Assignee: Sandisk Enterprise IP LLC., Milpitas, CA (US) EP 1548600 Bl 1/2007 EP 1746510 Al 1/2007 (*) Notice: Subject to any disclaimer, the term of this OTHER PUBLICATIONS patent is extended or adjusted under 35 U.S.C. 154(b) by 929 days. Bsn-modulestore, Versioning Concept, Oct. 13, 2010, 2 pgs. , http://en.wikipedia.0rg, Oct. 3, 2011, 9 pgs. (21) Appl. N0.: 12/197,s99 (Continued) (22) Filed: Aug. 25, 2008 Primary Examiner * Larry Mackall (74) Attorney, Agent, or Firm * Morgan, LeWis & Bockius Prior Publication Data (65) LLP US 2009/0240869 A1 Sep. 24, 2009 (57) ABSTRACT A Sharing Data Fabric (SDF) causes ?ash memory attached Related US. Application Data to multiple compute nodes to appear to be a single large memory space that is global yet shared by many applications (63) Continuation-in-part of application No. 12/ 130,661, running on the many compute nodes. Flash objects stored in ?led on May 30, 2008, noW Pat. No. 7,975,109. ?ash memory of a home node are copied to an object cache in (60) Provisional application No. 61/038,336, ?led on Mar. DRAM at an action node by SDF threads executing on the 20, 2008. nodes. The home node has a ?ash object map locating ?ash objects in the home node’s ?ash memory, and a global cache (51) Int. Cl. directory that locates copies of the object in other sharing G06F 12/00 (2006.01) nodes. Application programs use an applications-program US. Cl. ming interface (API) into the SDF to transparently get and put (52) objects Without regard to the object’s location on any of the USPC ...... 711/103;711/148; 711/E12.008 many compute nodes. SDF threads and tables control coher Field of Classi?cation Search (58) ency of objects in ?ash and DRAM. USPC ...... 711/103, 148, E12.008 See application ?le for complete search history. 24 Claims, 15 Drawing Sheets

112191.199?

OBJ

7: PUT FLASH OBJ IN OBJ CACHE 90 APP 92 FLASH TRD OBJ Q

4: LOCATE OBJ U IN FLA HMAP

2 9 2' Egg 0 ME 99" GET MISS FROM HOME FLASH US 8,732,386 B2 Page 2

(56) References Cited 2003/0220985 A1 11/2003 Kawamoto et al. 2004/0010502 A1 1/2004 Bom?m et a1. U_S_ PATENT DOCUMENTS 2004/0078379 A1 4/2004 Hinshaw et al. 2004/0143562 A1 7/2004 Chen et al. 5394555 A 2/1995 Hunter et 31, 2004/0148283 A1* 7/2004 Harris et a1...... 707/5 5,403,639 A 4/1995 Belsan et 31, 2004/0172494 A1* 9/2004 Pettey et a1...... 710/305 5,423,037 A 6/1995 Hvasshovd 2004/0205151 A1 10/2004 Sprigg et al. 5,509,134 A 4/1996 Fandrich et a1, 2004/0267835 A1 12/2004 ZWilling et al. 5537534 A 7/1996 Voigt et al‘ 2005/0005074 A1 1/2005 Landin et al. 5,603,001 A * 2/1997 Sukegawa 6131...... 711/103 2005/0021565 A1 1/2005 Kapoor er al~ 5,611,057 A 3/1997 pecone et al‘ 2005/0027701 A1 2/2005 Zane et al. 5,613,071 A * 3/1997 Rankin et al‘ ““““““““““““““ N 1/1 2005/0028134 A1 2/2005 Zane et al. 5,680,579 A 10/1997 Young et 31‘ 2005/0034048 A1 2/2005 Nemawarkar et a1. 5,692,149 A 11/1997 Lee 2005/0081091 A1 4/2005 Bartfai et a1. 5,701,480 A 12/1997 Raz 2005/0086413 A1 4/2005 Lee et al. 5,742,787 A * 4/1998 Talreja ,,,,,,,,,,,,,,,,,,,,,,,, n 711/103 2005/0120133 A1* 6/2005 Slack-Smith ...... 709/234 5,887,138 A 3/1999 Hagersten et a1. 2005/0131964 A1 6/2005 SaXena 5,897,661 A 4/1999 Bamnovsky et 31‘ 2005/0240635 A1 10/2005 Kapoor et al. 5,897,664 A 4/1999 Nesheim et al‘ 2005/0246487 A1 11/2005 Ergan et al. 5963983 A 10/1999 Sakakura et 31‘ 2006/0059428 A1 3/2006 Humphries et al. 6,000,006 A 12/1999 Bruce et al‘ 2006/0161530 A1 7/2006 Biswal et a1. 6,052,815 A 4/2000 Zook 2006/0174063 A1 8/2006 Soules et al. 6,130,759 A 10/2000 Blair 2006/0174069 A1* 8/2006 Shaw et al...... 711/146 6,141,692 A 10/2000 Loewenstein et al‘ 2006/0179083 A1 8/2006 Kulkarni et al. 6,216,126 B1 4/2001 Ronstrom 2006/0195648 A1 8/2006 Chandrasekaran et al. 6,298,390 B1 10/2001 Matena et a1, 2006/0212795 A1 9/2006 Cottrille et al. 6,308,169 B1 10/2001 Ronstrom et a1. 2006/0218210 A1 9/2006 Sarma er 91 6,434,144 B1 8/2002 Romanov 2006/0242163 A1 10/2006 Miller et al. 6,467,060 B1 10/2002 Malakapalli et a1. 2007/0043790 A1 2/2007 Kryger 6,615,313 B2 9/2003 Kato et al‘ 2007/0143368 A1 6/2007 Lundsgaard et al. 6,658,526 B2 12/2003 Nguyen et al‘ 2007/0174541 A1 7/2007 Chandrasekaran et al. 6,728,826 B2 4/2004 Kaki et al‘ 2007/0234182 A1 10/2007 Wickeraad et al. 6,745,209 B2 6/2004 Holenstein et a1. 2007/0276784 A1 11/2007 Piedmonte 6,874,044 B1 3/2005 Chou et a1, 2007/0283079 A1 12/2007 Iwamura et al. 6,938,084 B2 8/2005 Gamache et al‘ 2007/0288692 A1 12/2007 Bruce et al. 6,981,070 B1 12/2005 Luk et al‘ 2007/0294564 A1 12/2007 Reddin et a1. 7,003,586 B1 2/2006 Bailey et al‘ 2007/0299816 A1 12/2007 Arora et al. 7,010,521 B2 3/2006 Hinshaw et a1, 2008/0034076 A1 2/2008 Ishikawa et al. 7,043,621 B2 5/2006 Merchant et al‘ 2008/0034174 A1 2/2008 Traister et al. 7,082,481 B2 7/2006 Lambrache et a1, 2008/0034249 A1 2/2008 Husain et al. 7,162,467 B2 1/2007 Eshleman et al‘ 2008/0046538 A1 2/2008 Susarla et a1. 7,200,718 B2 4/2007 Duzett 2008/0046638 A1 2/2008 Maheshwari et a1. 7,203,890 B1 4/2007 Normoyle 2008/0288713 A1 11/2008 Lee et al. 7,249,280 B2 7/2007 Lamport et al‘ 2009/0006500 A1 1/2009 Shiozawa et al. 7,269,708 B2 9/2007 Ware 2009/0006681 A1 1/2009 Hubert et al. 7,269,755 B2 9/2007 Moshayedi et 31‘ 2009/0019456 A1 1/2009 Saxena et al. 7,272,605 B1 9/2007 Hinshaw et a1, 2009/0024871 A1 1/2009 EmaI'u et al. 7,272,654 B1 9/2007 Brendel 2009/0030943 A1 1/2009 Kall 7,281,160 B2 10/2007 Stewart 2009/0070530 A1 3/2009 Satoyama et al. 7,305,386 B2 12/2007 Hinshaw et a1. 2009/0150599 A1 6/2009 Bennett 7,334,154 B2 2/2008 Lorch et al‘ 2009/0177666 A1 7/2009 Kaneda 7,359,927 B1 4/2008 Cardente 2010/0125695 A1 5/2010 Wu et al. 7,383,290 B2 6/2008 Mehra et 31‘ 2010/0241895 A1 9/2010 Li et al. 7,406,487 B1 7/2008 Gupta et al‘ 2010/0262762 A1 10/2010 Borchers et a1. 7,417,992 B2 8/2008 Krishnan 2010/0318821 A1 12/2010 Kwan et a1. 7,467,265 B1 12/2008 Tawri et al‘ 2011/0022566 A1 1/2011 Beaverson et al. 7,529,882 B2 5/2009 Wong 2011/0082985 A1 4/2011 Haines et al. 7,542,968 B2 6/2009 Yokomizo et a1. 2011/0167038 A1 7/2011 Wang et al~ 7,562,162 B2 7/2009 Kreiner et al‘ 2011/0179279 A1 7/2011 Greevenbosch et al. 7,584,222 B1 9/2009 Georgiev 2011/0185147 A1 7/2011 Hat?eld et al. 7,610,445 B1 10/2009 Manus et al. 7,647,449 B1 1/2010 Roy et a1. OTHER PUBLICATIONS 7,809,691 B1 10/2010 Karmarkar et al. _ _ 7,822,711 B1 10/2010 Ranade Chacon, G1t, The Fast Vers1on Control System, Oct. 3, 2011, 3 pgs. 7,885,923 B1 2/ 2011 Tawri et al, RICE, EXtension Versioning, Update and Compatibility, Aug. 9, 7,917,472 B2 3/2011 Persson 2011, 11 pgs. et 31~ RICE, Toolkit Version Format, Aug. 19, 2011, 4 pgs. 8,0245, , 15 B 2 9/201 1 Auerbachnner et al‘ Email. Commun1cat10n. . from James BodW1n. to Chr1stopher. Brokaw 8,069,328 B2 11/2011 Pyeon rePnOr am Sep' 13,2011’413‘53'. 8,239,617 B1 8/2012 Linnell G1t (Software), http://en.W1k1ped1a.org, Oct. 3, 2011, 10 pgs. 8,261,289 B2 9/2012 Kasravi et 31‘ Vijaykumar, Speculative Versioning Cache, Dec. 1, 2001, 13 pgs. 8,321,450 B2 11/2012 Thane et 31, Hitz, Design for an NFS File Server Appliance, Jan. 19, 8,335,776 B2 12/2012 Gokhale 1994, 23 P89 8,370,853 B2 2/ 2013 Giampaolo et 31, McDonald, Architectural Semantics for Practical Transactional 8,401,994 B2 3/2013 Hoang 6161. Memory,1un,2006, 12 P89 2002/0166031 A1 11/2002 Chen et a1. McGonigle, A Short History ofbtrfs, Aug. 14, 2009, 11 pgs. 2002/0184239 A1 12/2002 Mosher, Jr. et a1. Mellor, ZFSithe future of ?le systems? Aug. 14, 2006, 5 pgs. 2003/0016596 A1 * 1/2003 Chiquoine et al...... 369/34.01 Mercurial, http://en.Wikipedia.org, Oct. 2, 2011, 6 pages. US 8,732,386 B2 Page 3

(56) References Cited Suf?ciently Advanced Bug, File Versioning, Caching and Hashing, Oct. 3, 2011, 3 pgs. OTHER PUBLICATIONS BuchholZ, The Structure of the Reiser File System, Jan. 26, 2006, 21 Module: Mongoid: Versioning, http://rdoc.info, Documentation by The Z File System (ZFS), FreeBSD Handbook, Oct. 3, 2011, 8 pgs YARD 0.7.2, 6 pages Oct. 3, 2011. (Author not provided). Noach, Database Schema under Version Control, code.openarck.org, Tux3 Filesystem Project, 2008, 1 pg. Tux3 Versioning Filesystem, Jul. 2008, 67 pgs. Apr. 22, 2010, 6 pages. Tux3, http://en.wikipedia.org, Jun. 2, 2010, 3 pgs. Reiser FS, , http://enwikipediaorg, Sep. 17, 2011, 5 pgs. WAFLiWrite Anywhere File Layout, 1999, 1 pg. Russell, Track and Record Database SchemaVersions, Jun. 28, 2005, Write Anywhere File Layout, Sep. 9, 2011, 2 pgs. 8 pgs. ZFS, , http://en.wikipedia.org Sep. 30, 2011, 18 pgs. Schooner Information Technology, IPAF, PCT/US2008/065167, Ajmani, Automatic Software Upgrades for Distributed Systems, Oct. 23, 2008, 7 pgs. MIT, Sep. 2004, 164 pgs. Schooner Information Technology, ISIUWO, PCT/US2008/065167, Amza, Data Replication Strategies for Fault Tolerance and Availabil Jan. 28, 2009, 16 pgs. ity on Commodity Clusters, 2000, 9 pgs. SQL Server Database Schema Versioning and Update, Dec. 2, 2009, 2 pgs. * cited by examiner US. Patent May 20, 2014 Sheet 1 0f 15 US 8,732,386 B2

BOTTLENECK PROBLEM

PRIOR ART

FIG. 1

FIG. 2

=? 4' """ "> POWER W COHERENCY PROBLEM US. Patent May 20, 2014 Sheet 2 0f 15 US 8,732,386 B2

CPU CPU CPU , (SVR) E (SVR) E ' ' ' (SVR)E

DRAM DRAM DRAM , CACHE Q CACHE 2—2- ' ' ° CACHEZZ—

SHARING DATA FABRIC (SDF)

E FLASH FLASH FLASH ; i MEM MEM ' ' ' MEM .

26 GLOBAL, SHARED FLASH MEMORY I US. Patent May 20, 2014 Sheet 3 0f 15 US 8,732,386 B2

I DRAM Q _ E

‘ CPU 2 1§

PCIE SWITCH Q

FLASH 34 — NIC' MODULES @

ETHERNET OR INFINIBAND

_ 4 TO OTHER NODES AND EXTERNAL L/B US. Patent May 20, 2014 Sheet 4 0f 15 US 8,732,386 B2

APPLICATION PROGRAM(S) 122 (NETWORKED DATABASES, DISTRIBUTED APPS, NETWORKEDSERV'CES LE TRANSACTIONALCACHING)

SHARING DATA FABRIC SERVICES (UNIFIED SHARED DATA ACCESS (API'S) m ACROSS NODES',ATTRIBUTES, POLICIES, USER-CONTROL)

COMMODHY COMPUTE NODES m

(TRANSPARENT LOCATION, REPLICATION, CONSISTENCY, SHARING DATA FABRIC 1_1_2 MIGRATION,VERS|ON LOGS, CACHING IN DRAM) INTERCONNECT 1_IQ (PCIE SWITCHES, GB ETHERNET)

FLASH MGT 108 (WEAR-LEVEL, CACHINC, — LOGS, WRITE BUFFER)

FLASH INTERFACE m

FLASH FLASH FLASH FLASH h CTLR ' CTLR CTLR CTLR 104

FLASH FLASH FLASH FLASH h MEM MEM MEM MEM 102

FIG. 5 US. Patent May 20, 2014 Sheet 5 0f 15 US 8,732,386 B2

APPLICATION PROGRAM(S)

l SHARING DATA FABRIC : 3 SERVICES (API'S) 116 i ISDF_GET SDF_START SDF_CREATE():l ISDF_PUT SDF_ABORT SDF_OPEN() {SDF_LOCK SDF_COMM|T SDF_CLOSE() ISDF_UNLOCK SDF_SAVEPOINT SDF_DELETE()§ L SHARING DATA FABRIC m

LOCK NODE SHARING CACHE TBL MAP DIR MAP F I GET() 5 PUT() E; : LOCKO FLASH MGT lqg 5 UNLOCK() a i START() : SEND() READO s, ABORT() . RECEIVE() WRITE()

L ' FLASH INTERFACE 106 NETWORK INTERFACE 120 —

FIG. 6 US. Patent May 20, 2014 Sheet 6 0f 15 US 8,732,386 B2

26

..S:2:19E 92...22...92 .GEw wooz

?mO30wN.’vm Ilimo..Q .b HIV .. IOlIOQIOII.

m. $ una uu. Q. on

oau n a a o nu

00-a uc.au aan n US. Patent May 20, 2014 Sheet 7 0f 15 US 8,732,386 B2

E;II| ...... %w?$02232oz_~_<_._w25:maoz 2. I| mlI l mvINES55v..SE50: QMolmmm% 52%55¢20EummNNNW 12¢4255§Q$12::2: . 22Q2 r?uQ ...... 6.55"$59%,;SE .. .m-Ebm Bo......

US. Patent May 20, 2014 Sheet 11 0f 15 US 8,732,386 B2

w?ma ?1%“le05 -l..l.HI..lH-l.Nil“...lllllllllllI a $8652% a8:2:“#2 Bo

£35.29S...... mmmwm5. uu:2:30

150.6"l mv_n_ 52%NF -.w.@@.z.€._§a...... @a8&<

US. Patent May 20, 2014 Sheet 13 0f 15 US 8,732,386 B2

zmzhmmno 30.002

"HnI.2am?g? mmwmsz$96m$3: N.12:a195W563;am&< g:m“Mun“W.”nI u.-o.va- 3d"3o .l..|..l.4.l..l..|.n.l..l.||0|-l..|.q.u.l..l..0.4.|..llll.llIlHm m5.25302_.51um -----.w.@%.z%----.m@.%-§

@| l US. Patent May 20, 2014 Sheet 14 0f 15 US 8,732,386 B2

--...... - 81502592%TE;ww__>_50II -wmmz?w?dm?a..... mm?mam a8:2:a“am&< $.30 9: u| 2% mm05 u30 ._O_n_m?Boowmuw NWonZ0%56%...... ma5on BO.002zmshmm“w moOzzOF0

?wER @| M20:9.v6<“w “ma?a5% ...... mooz20:.2 ...... @20120% US. Patent May 20, 2014 Sheet 15 0f 15 US 8,732,386 B2

...... :- Ill .H.-. mezammx004 >._.m

a12:$&< 213.516 x004 wooz29H? US 8,732,386 B2 1 2 SHARING DATA FABRIC FOR tered hardware platform is needed to ensure coherency of COHERENT-DISTRIBUTED CACHING OF replicated data accessible by multiple servers. MULTI-NODE SHARED-DISTRIBUTED Adding second database 16' increases the power consump FLASH MEMORY tion, since a second set of disks must be rotated and cooled. Operating the motors to physically spin the hard disks and run RELATED APPLICATIONS fans and air conditioners to cool them requires a substantially large amount of power. This application claims the bene?t of US. Provisional It has been estimated (by J. Koomey of Stanford Univer Application No. 61/038,336 ?led Mar. 20, 2008. This appli sity) that aggregate electricity use for servers doubled from cation is a Continuation-In-Part (CIP) of the co-pending US. 2000 to 2005 both in the US. and worldwide. Total power for application for “System Including a Fine-Grained Memory servers and the required auxiliary infrastructure represented and a Less-Fine-Grained Memory”, U.S. Ser. No. 12/130, about 1.2% of total US electricity consumption in 2005. As 661, ?led May 30, 2008, and the co-pending PCT application the Internet and its data storage requirements seem to increase for “System Including a Less-Fine-Grained Memory and a exponentially, these power costs will ominously increase. Fine-Grained Memory with a Write Buffer for the Less-Fine Flash memory has replaced ?oppy disks for personal data Grained Memory”, U.S. Ser. No. PCT/US08/65167, ?led May 29, 2008, hereby incorporated by reference. transport. Many small key-chain ?ash devices are available that can each store a few GB of data. Flash storage may also FIELD OF THE INVENTION be used for data backup and some other specialized applica 20 tions. Flash memory uses much less power than rotating hard This invention relates to shared multi-node storage sys disks, but the different interfacing requirements of ?ash have tems, and more particularly to coherent caching of objects in limited its use in large server farms. The slow write time of a shared, global ?ash memory. ?ash memory complicates the coherency problem of distrib uted databases. BACKGROUND OF THE INVENTION 25 What is desired is a large storage system that uses ?ash memory rather than hard disks to reduce power consumption. Demand for computer disk storage has increased sharply in A ?ash memory system with many nodes that acts as a global the last decade. Computer hard-disk technology and the yet shared address space is desirable. A global, shared ?ash resulting storage densities have grown rapidly. Despite appli memory spread across many nodes that can coherently share cation-program bloat, a substantial increase in web sites and 30 objects is desirable. their storage requirements, and wide use of large multimedia ?les, disk-drive storage densities have been able to keep up. BRIEF DESCRIPTION OF THE DRAWINGS Disk performance, however, has not been able to keep up. Access time and rotational speed of disks, key performance FIG. 1 highlights a prior-art bottleneck problem with a parameters in many applications, have only improved incre 35 web-based database server. mentally in the last 10 years. FIG. 2 highlights a coherency problem when a database is Web sites on the Internet may store vast amounts of data, and large web server farms may host many web sites. Storage replicated to reduce bottlenecks. Area Networks (SANs) are widely used as a centralized data FIG. 3 shows a global, shared ?ash memory that appears to store. Another widespread storage technology is Network 40 be a single global address space to multiple servers connected Attached Storage (NAS). These disk-based technologies are to a Sharing Data Fabric (SDF). now widely deployed but consume substantial amounts of FIG. 4 shows a hardware node in a global, shared ?ash power and can become a central-resource bottleneck. The memory system. recent rise in energy costs makes further expansion of these FIG. 5 is a layer diagram of software and hardware layers disk-based server farms undesirable. Newer, lower-power 45 in a ?ash memory system using a shared data fabric to enable technologies are desirable. global sharing of a distributed ?ash memory. FIG. 1 highlights a prior-art bottleneck problem with a FIG. 6 is a transaction diagram of services and interfaces to distributed web-based database server. A large number of a shared data fabric. users access data in database 16 through servers 12 on web 10. FIG. 7 shows permanent objects in ?ash memory being Web 10 can be the Internet, a local Intranet, or other network. 50 copied to DRAM caches on multiple nodes. As the number of users accessing database 16 increases, FIG. 8 shows an action node requesting an object from a additional servers 12 may be added to handle the increased home node that fetches a modi?ed object on a sharing node workload. However, database 16 is accessible only through using transaction tables and an object directory. database server 14. The many requests to read or write data in FIG. 9 is a storage ?ow model of an action node requesting database 16 must funnel through database server 14, creating 55 an object from ?ash memory at a home node. a bottleneck that can limit performance. FIG. 10 is a storage ?ow model of an action node request FIG. 2 highlights a coherency problem when a database is ing an object from ?ash memory at a home node using an replicated to reduce bottlenecks. Replicating database 16 by asynchronous SDF thread at the action node. creating a second database 16' that is accessible through sec FIG. 11 is a snapshot state diagram of a compute node that ond database server 14' can reduce the bottleneck problem by 60 can act as an action, home, or sharing node. servicing read queries. However, a new coherency problem is FIG. 12 shows a hit in the object cache of the action node. created with any updates to the database. One user may write FIG. 13 shows a get operation that misses in the object a data record on database 16, while a second user reads a copy cache of the action node, and fetches the object from ?ash of that same record on second database 16'. Does the second memory of the home node. user read the old record or the new record? How does the copy 65 FIG. 14 shows a get operation that misses in the object of the record on second database 16' get updated? Complex cache of the action node, and fetches a modi?ed copy of the distributed database software or a sophisticated scalable clus object from a third-party node. US 8,732,386 B2 3 4 FIG. 15 shows a get operation that misses in the object Sharing data fabric (SDF) 20 is a middleware layer that cache of the action node, and directly fetches a modi?ed copy includes SDF threads running on processors 18, 18', andAPIs of the object from a third-party node. and tables of data. A physical interconnect such as an Ethernet FIG. 16 shows a lock operation. or In?niBand® fabric connect physical nodes together. Object copies are transferred across the physical interconnect DETAILED DESCRIPTION by SDF 20 from ?ash memory 24, 24' to cache DRAM 22, 22', and among DRAM 22, 22' caches as needed to ensure coher The present invention relates to an improvement in global, ency of object copies. shared ?ash memory systems. The following description is Flash memory 24, 24' can be physically located on many presented to enable one of ordinary skill in the art to make and nodes, such as having one ?ash memory 24 for each processor use the invention as provided in the context of a particular 18, or in other arrangements. SDF 20 makes all the objects application and its requirements. Various modi?cations to the stored in ?ash memory 24, 24' appear to be stored in a global preferred embodiment will be apparent to those with skill in address space, even though the global address spaced is the art, and the general principles de?ned herein may be shared among many processors 18, 18'. Thus ?ash memory 24, 24' together appear to be one global, shared ?ash memory applied to other embodiments. Therefore, the present inven 26 via SDF 20. tion is not intended to be limited to the particular embodi FIG. 4 shows a hardware node in a global, shared ?ash ments shown and described, but is to be accorded the widest memory system. A ?ash memory system has multiple nodes scope consistent with the principles and novel features herein such as shown in FIG. 4. The multiple nodes are connected disclosed. 20 together by a high-speed interconnect such as an Ethernet or The inventors have realized that power consumption can be In?niBand. One or more links in this high-speed interconnect dramatically reduced by replacing rotating hard disks with connect to Network Interface Controller (N IC) 3 6 on the node ?ash memory. The ?ash memory can be distributed across shown in FIG. 4. many physical nodes, and each node can have a processor that Processor 18 executes application programs, threads, and can process user requests and system-management threads. 25 other routines and accesses a local memory that stores pro Dynamic-random-access memory (DRAM) on each of the gram code and data, such as DRAM 22. DRAM 22 also acts physical nodes can cache data or objects that are normally as a DRAM cache of objects in the global, shared ?ash stored in ?ash memory. Coherency among objects in ?ash and memory. in DRAM can be ensured by a Sharing Data Fabric (SDF) Processor 18 also connects to switch 30. Switch 30 may be middleware layer. SDF includes an interface for communica 30 a PCI EXPRESS switch. Switch 30 allows processor 18 to tions between high-level programs and lower-level hardware communicate with other nodes through NIC 36 to send and controllers and their software and ?rmware drivers. SDF is receive object copies and coherency commands. Flash mod accessible by high-level application programs using an appli ules 34 contain arrays of ?ash memory that store permanent cations-programming interface (API). Communication objects. Flash modules 34 are accessed by processor 18 between nodes to ensure coherency is performed by SDF 35 through switch 30. threads. FIG. 5 is a layer diagram of software and hardware layers The DRAM cache may hold copies of objects stored in the in a ?ash memory system using a shared data fabric to enable local node’s ?ash memory, or copies of ?ash objects stored in global sharing of a distributed ?ash memory. Sharing data another node’s ?ash memory. Global caching is achieved by fabric services 116 include API’s that application programs the SDF, which enables the local DRAM cache to store copies 40 122 or networked services 118 can use to access objects and of objects from other nodes. Objects can reside anywhere in a control attributes of the objects. Sharing data fabric services shared, global address space. The SDF copies objects to 116 are the API’ s that communicate with routines and threads DRAM caches on any node while ensuring consistency. in sharing data fabric 112 that provide a uni?ed shared data This distributed caching of ?ash is extremely useful since access of objects that are permanently stores in ?ash memory a process such as a web server running on one node’s proces 45 102, and may maintain cached copies in DRAM in compute sor may access data stored on any of the nodes. The system nodes 114. can be scaled up by adding nodes. Normally, adding nodes Compute nodes 114 are compute nodes, such as node 100 slows a system down, since bottlenecks may occur to data shown in FIG. 4, with processors, DRAM caches of objects, stored in just one location on a remote node, such as shown on and interconnect. These compute nodes may be constructed FIG. 1. However, using SDF, data or objects may be cached 50 from commodity parts, such as commodity processors, inter on one or more nodes, allowing multiple processors to access connect switches and controllers, and DRAM memory mod the same data. Coherency of the cached objects is important ules. to prevent data corruption. Sharing data fabric services 116 allow application pro FIG. 3 shows a global, shared ?ash memory that appears to grams 122 and networked services 118 to control policies and be a single global address space to multiple servers connected 55 attributes of objects by executing routines and launching to a sharing data fabric (SDF). Central Processing Units threads of sharing data fabric 112 that are executed on com (CPUs) or processors 18, 18' can execute programs such as pute nodes 114. The exact location of objects within ?ash server applications to process requests that arrive over a net memory 102 is transparent to application programs 122 and work such as the Intemet. Each of processors 18 has a cache networked services 118 since sharing data fabric 112 copies of DRAM 22 that contain local copies of objects. These local 60 objects from ?ash memory 102 to DRAM caches in compute copies in DRAM 22 are accessed by processors 18 in nodes 114 and may obtain a copy from any location in ?ash response to requests from external users. memory 102 that has a valid copy of the object. Objects may While DRAM 22, 22' stores transient copies of objects, the be replicated to make back-up copies in ?ash memory 102. objects are more permanently stored in ?ash memory 24, 24'. Sharing data fabric 112 performs consistency and coher Objects remain in ?ash memory 24, 24' and are copied to 65 ency operations such as ?ushing modi?ed objects in a DRAM caches in DRAM 22, 22' in response to access requests by cache to copy back and update the permanent object in ?ash programs running on processors 18, 18'. memory 102. Sharing data fabric 112 may also migrate ?ash