<<

Algorithms and Data Structures for Memories

ERAN GAL AND SIVAN TOLEDO Tel-Aviv University

Flash memory is a type of electrically-erasable programmable read-only memory (EEPROM). Because flash memories are nonvolatile and relatively dense, they are now used to store files and other persistent objects in handheld computers, mobile phones, digital cameras, portable music players, and many other computer systems in which magnetic disks are inappropriate. Flash, like earlier EEPROM devices, suffers from two limitations. First, bits can only be cleared by erasing a large block of memory. Second, each block can only sustain a limited number of erasures, after which it can no longer reliably store data. Due to these limitations, sophisticated data structures and algorithms are required to effectively use flash memories. These algorithms and data structures support efficient not-in-place updates of data, reduce the number of erasures, and level the wear of the blocks in the device. This survey presents these algorithms and data structures, many of which have only been described in patents until now.

Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage Management—Allocation/deallocation strategies; garbage collection; secondary storage; D.4.3 [Operating Systems]: File Systems Management—Acess methods; file organization; E.1 [Data Structures]—Arrays; lists, stacks, and queues; trees; E.2 [Data Storage Representations]—Linked representations; E.5 [Files]—Organizagtion/ structure General Terms: Algorithms, Performance, Reliability Additional Key Words and Phrases: Flash memory, EEPROM memory, wear leveling

1. INTRODUCTION of other programmable memories such as volatile RAM and magnetic disks. Perhaps Flash memory is a type of electrically- more importantly, memory cells in a flash erasable programmable read-only mem- device (as well as in other types of EEP- ory (EEPROM). Flash memory is nonvolatile ROM memory) can be written to only a lim- (retains its content without power) so it ited number of times, between 10,000 and is used to store files and other persistent 1,000,000, after which they wear out and objects in workstations and servers (for become unreliable. the BIOS), handheld computers and mobile In fact, flash memories come in two fla- phones, digital cameras, and portable mu- vors, NOR and NAND, that are also quite dif- sic players. ferent from each other. In both types, write The read/write/erase behaviors of flash operations can only clear bits (change memory is radically different than that their value from 1 to 0). The only way to

Authors’ address: School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copy- rights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. c 2005 ACM 0360-0300/05/0600-0138 $5.00

ACM Computing Surveys, Vol. 37, No. 2, June 2005, pp. 138–163. Algorithms and Data Structures for Flash Memories 139 set bits (change their value from 0 to 1) is other patents assigned to two compa- to erase an entire region memory. These nies that specialize in flash-management regions have fixed size in a given device, products, M-Systems and SanDisk (for- typically ranging from several kilobytes merly SunDisk). Finally,we also examined to hundreds of kilobytes and are called patents that were cited or mentioned in erase units.NOR flash, the older type, is a other relevant materials both patents and random-access device that is directly ad- Web sites. dressable by the processor. Each bit in a We believe that this methodology led us NOR flash can be individually cleared once to most of the relevant materials. But this per-erase-cycle of the erase unit contain- does not imply, of course, that the article ing it. NOR devices suffers from high erase covers all the techniques that have been times. NAND flash, the newer type, enjoys invented. The techniques that are used in much faster erase times, but it is not di- some flash-management products have re- rectly addressable (it is accessed by issu- mained trade secrets; some are alluded to ing commands to a controller), access is in corporate literature but are not fully by page (a fraction of an erase unit, typi- described. Obviously, this article does not cally 512 bytes) not by bit or byte, and each cover such techniques. page can be modified only a small num- The rest of this survey consists of three ber of times in each erase cycle. That is, sections, each of which describes the map- after a few writes to a page, subsequent ping of one category of abstract data struc- writes cannot reliably clear additional bits tures onto flash memories. The next sec- in the page; the entire erase unit must be tion discusses flash data structures that erased before further modifications of the store an array of fixed- or variable-length page are possible [Woodhouse 2001]. blocks. Such data structures typically em- Because of these peculiarities, storage- ulate magnetic disks where each block management techniques that were de- in the array represents one disk sec- signed for other types of memory devices, tor. Even these simple data structures such as magnetic disks, are not always pose many flash-specific challenges such appropriate for flash. To address these as wear leveling and efficient reclama- issues, flash-specific storage techniques tion. These challenges and techniques to have been developed with the widespread address them are discussed in detail in introduction of flash memories in the early Section 2, and in less detail in later sec- 1990s. Some of these techniques were in- tions. The section that follows, Section 3, vented specifically for flash memories, but describes flash-specific file systems. A file many have been adapted from techniques system is a data structure that represents that were originally invented for other a collection of mutable random-access files storage devices. This article surveys the in a hierarchical name space. Section 4 data structures and algorithms that have describes three additional classes of flash been developed for management of flash data structures: application-specific data storage. structures (mainly search trees), data The article covers techniques that have structures for storing machine code, and a been described in the open literature, mechanism to use flash as a main memory including patents. We only cover US replacement. Section 5 summarizes the patents mostly because we assume that survey. US patents are a superset of those of other countries. To cope with the large number 2. BLOCK-MAPPING TECHNIQUES of flash-related patents, we used the fol- lowing strategy to find relevant patents. One approach to using flash memory is to We examined all the patents whose ti- treat it as a block device that allows fixed- tles contain the words flash, file (or fil- size data blocks to be read and written ing), and system,aswell as all the patents much like disk sectors. This allows stan- whose titles contain the words wear and dard file systems designed for magnetic leveling.Inaddition, we also examined disks, such as FAT,toutilize flash devices.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 140 E. Gal and S. Toledo

In this setup, the file system code calls ten, the new data does not overwrite the a device driver, requesting block read or sector where the block is currently stored. write operations. The device driver stores Instead, the new data is written to another and retrieves blocks from the flash device. sector and the virtual-block-to-sector map (Some removable flash devices, like Com- is updated. pactFlash, even incorporate a complete Typically, sectors have a fixed size and ATA disk interface so they can actually be occupy a fraction of an erase unit. In NAND used through the standard disk driver.) devices, sectors usually occupy one flash However, mapping the blocks onto flash page. But in NOR devices, it is also possible addresses in a simple linear fashion to use variable-length sectors. presents two problems. First, some data This mapping serves several purposes: blocks may be written to much more than others. This presents no problem for —First, writing frequently modified blocks magnetic disks so conventional file sys- to a different sector in every modifica- tems do not attempt to avoid such situa- tion evens out the wear of different erase tions. But when the file system in mapped units. onto a flash device, frequently used erase —Second, the mapping allows writing a units wear out quickly, slowing down ac- single block to flash without erasing and cess times and eventually burning out. rewriting an entire erase unit [Assar This problem can be addressed by using et al. 1995a, 1995b, 1996]. a more sophisticated block-to-flash map- —Third, the mapping allows block writes ping scheme and by moving blocks around. to be implemented atomically so that, if Techniques that implement such strate- power is lost during a write operation, gies are called wear-leveling techniques. the block reverts to its prewrite state The second problem that the identity when flash is used again. mapping poses is the inability to write data blocks smaller than a flash erase Atomicity is achieved using the following unit. Suppose that the data blocks that technique. Each sector is associated with the file system uses are 4KB each and that a small header which may be adjacent flash erase units are 128KB each. If 4KB to the sector or elsewhere in the erase blocks are mapped to flash addresses us- unit. When a block is to be written, the ing the identity mapping, writing a 4KB software searches for a free and erased block requires copying a 128KB flash erase sector. In this state, all the bits in both unit to RAM, overwriting the appropriate the sector and its header are 1. Then a 4KB region, erasing the flash erase unit, free/used bit in the header of the sector and rewriting it from RAM. Furthermore, if is cleared, to mark that the sector is no power is lost before the entire flash erase longer free. Then the virtual block number unit is rewritten to the device, 128KB of is written to its header, and the new data data are lost; in a magnetic disk, only the is written to the chosen sector. Next, the 4KB data block would be lost. It turns out prevalid/valid bit in the header is cleared that the wear-leveling technique automat- to mark the sector is ready for reading. Fi- ically address this issue as well. nally, the valid/obsolete bit in the header of the old sector is cleared to mark that it is no longer contains the most recent copy of the virtual block. 2.1. The Block-Mapping Idea In some cases, it is possible to optimize The basic idea behind all the wear-leveling this procedure, for example, by combin- techniques is to map the block number ing the free/used bit with the virtual block presented by the host, called a virtual number: if the virtual block number is all block number,toaphysical flash address, 1s, then the sector is still free, otherwise called a sector. (Some authors and ven- it is in use. dors use a slightly different terminology.) If power is lost during a write operation, When a virtual block needs to be rewrit- the flash may be in two possible states

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 141 with respect to the modified block. If power a direct map during device initialization was lost before the new sector was marked (when the flash device is inserted into a valid, its contents are ignored when the system or when the system boots). flash is next used, and its valid/obsolete Direct maps are stored at least partially bit can be set to mark it ready for era- in RAM which is volatile. The reason that sure. If power was lost after the new sec- direct maps are stored in RAM is that, by tor was marked valid but before the old definition, they support fast lookups. This one was marked obsolete, both copies are implies that when a block is rewritten and legitimate (indicating two possible serial- moved from one sector to another, a fixed izations of the failure and write events), lookup location must be updated. Flash and the system can choose either one and does not support this kind of in-place mark the other obsolete. If choosing the modification. most recent version is important, a two- To summarize, the indirect map on the bit version number can indicate which one flash device itself ensures that sectors can is more recent. Since there can be at most always be associated with the blocks that two valid versions with consecutive ver- they contain. The direct map, which is sion numbers modulo 4, 1 is newer than stored in RAM, allows the system to quickly 0, 2 than 1, 3 than 2, and 0 is newer than find the sector that contains a given block. 3 [Aleph One 2002]. These block-mapping data structures are illustrated in Figure 1. A direct map is not absolutely neces- 2.2. Data Structures for Mapping sary. The system can search sequentially How does the system find the sector that through the indirect map to find a valid contains a given block? Fundamentally, sector containing a requested block. This there are two kinds of data structures that is slow but efficient in terms of RAM usage. represent such mappings. Direct maps are By only allowing each block to be stored essentially arrays that store in the ith lo- on a small number of sectors, searching cation the index of the sector that cur- can be performed much faster (perhaps rently contains block i. Inverse maps store through the use of hardware comparators in the ith location the identity of the block as patented in Assar et al. [1995a, 1995b]). stored in the ith sector. In other words, This technique, which is similar to set- direct maps allow efficient mapping of associative caches, reduces the amount of blocks to sectors, and inverse maps allow RAM or hardware comparators required for efficient mapping of sectors to blocks. In the searches but reduces the flexibility of some cases, direct maps are not simple the mapping. The reduced flexibility can arrays but more complex data structure. lead to more frequent erases and to accel- But a direct map, whether implemented erated wear. as an array or not, always allows effi- Translation Layer (FTL)isa cient mapping of blocks to sectors. Inverse technique to store some of the direct map maps are almost always arrays, although within the flash device itself while trying they may not be contiguous in physical to reduce the cost of updating the map on memory. the flash device. This technique was origi- Inverse maps are stored on the flash de- nally patented by Ban [1995] and was later vice itself. When a block is written to a adopted as a PCMCIA standard [Intel sector, the identity of the block is also writ- Corporation 1998b].1 ten. The block’s identity is always written in the same erase unit as the block itself so that they are erased together. The block’s identity may be stored in a header imme- 1Intel writes that “M-Systems does grant a royalty- diately preceding the data, or it may be free, non-exclusive license for the design and devel- opment of FTL-compatible drivers, file systems, and written to some other area in the unit, of- utilities using the data formats with PCMCIA PC ten a sector of block numbers. The main Cards as described in the FTL Specifications” [Intel use of the inverse map is to reconstruct Corporation 1998b].

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 142 E. Gal and S. Toledo

Fig. 1. Block mapping in a flash device. The gray array on the left is the virtual block to the physical sector direct map, residing in RAM. Each physical sector contains a header and data. The header contains the index of the virtual block stored in the sector, an erase counter, valid and obsolete bits, and perhaps an error-correction code and a version number. The virtual block numbers in the headers of populated sectors constitute the inverse map from which a direct map can be constructed. A version number allows the system to determine which of two valid sectors containing the same virtual block is more recent.

The FTL uses a combination of mecha- the map. To look up the logical number nisms, illustrated in Figure 2, to perform of a block, the system first finds the the block-to-sector mapping. sector containing the mapping in the -level RAM map, and then retrieves (1) Block numbers are first mapped to log- the mapping itself. In short, the map ical block numbers which consist of a is stored in a two-level hierarchical logical erase unit number (specified by structure. the most significant bits of the logi- cal block number) and a sector index (4) When a block is rewritten and moved within the erase unit. This mechanism to a new sector, its mapping must allows the valid sectors of an erase be changed. To allow this to happen unit to be copied to a newly-erased at least some of the time without erase unit without changing the block- rewriting the relevant mapping block, to-logical-block map since each sector a backup map is used. If the relevant is copied to the same location in the entry in the backup map which is also new erase unit. stored on flash is available (all 1s), (2) This block-to-logical-block map can be the original entry in the main map stored partially in RAM and partially is cleared, and the new location is within the flash itself. The mapping of written to the backup map. Otherwise, the first blocks, which in FAT-formatted the mapping sector must be rewritten. devices change frequently, can be During lookup, if the mapping entry is stored in RAM, while the rest is stored all 0s, the system looks up the mapping in the flash device. The transition point in the backup map. This mechanism fa- can be configured when the flash is for- vors sequential modification of blocks matted and is stored in a header in the since, in such cases, multiple map- beginning of the flash device. pings are moved from the main map to the backup map before a new mapping (3) The flash portion of the block-to- sector must be written. The backup logical-block map is not stored map can be sparse; not every mapping contiguously in the flash but is scat- sector must have a backup sector. tered throughout the device along with an inverse map. A direct map (5) Finally, logical erase units are mapped in RAM, which is reconstructed during to physical erase units using a small initialization, points to the sectors of direct map in RAM. Because it is small

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 143

Fig. 2.Anexample of the FTL mapping structures. The system in the figure maps two logical erase units onto three physical units. Each erase unit contains four sectors. Sectors that contain page maps contain four mappings each. Pointers represented as gray rectangles are stored in RAM. The virtual-to-logical page maps, shown on the top right, are not contiguous so a map in RAM maps their sectors. Normally, the first sectors in the primary map reside in RAM as well. The replacement map contains only one sector; not every primary map sector must have a replacement. The illustration of the entire device on the bottom also shows the page-map sectors. In the mapping of virtual block 5, the replacement map entry is used, because it is not free (all 1’s).

(one entry per erase unit, not per the 5th sectors in later units are still free. sector), the RAM overhead is small. It is To update block 5, the new data is written constructed during initialization from to sector 5 in the first unit in the chain an inverse map; each physical erase where it is still free. If sector 5 is used unit stores its logical number. This in all the units in the chain, the system direct map is updated whenever an adds another unit to the chain. To reclaim erase unit is reclaimed. space, the system folds all the valid sectors in the chain to the last unit in the chain. Ban later patented a translation layer That unit becomes the first unit in the new for NAND devices, called NFTL [Ban 1999]. It chain, and all the other units in the old is simpler than the FTL and comes in two chain are erased. The length of chains is types: one for devices with spare storage one or longer. for each sector (sometimes called out-of- If spare data is available in every sec- band data), and one for devices without tor, the chains are always of length one or such storage. The type of devices without two. The first unit in the chain is the pri- spare data is less efficient but simpler, so mary unit, and blocks are stored in it in we’ll start with it. The virtual block num- their nominal sectors (sector 5 in our ex- ber is broken up into a logical erase-unit ample). When a valid sector in the primary number and a sector number within the unit is updated, the new data are writ- erase unit. A data structure in RAM maps ten to an arbitrary sector in the second each logical erase unit to a chain of phys- unit in the chain, the replacement unit. ical units. To locate a block, say block 5 The replacement unit can contain many in logical unit 7, the system searches the copies of the same virtual block, but only appropriate chain. The units in the chain one of them is valid. To reclaim space, or are examined sequentially. As soon as one when the replacement unit becomes full, of them contains a valid sector in position the valid sectors in the chain are copied 5, it is returned. The 5th sectors in ear- to a new unit and the two units in the old lier units in the chain are obsolete, and chain are erased.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 144 E. Gal and S. Toledo

Fig. 3. Block mapping with variable-length sectors. Fixed sized headers are added to one end of the erase unit, and variable-length sectors are added to the other size. The figure shows a unit with four valid sectors, two obsolete ones, and some free space (including a free header, the third one).

It is also possible to map variable-length are usually called extents. Each allocated logical blocks onto flash memory, as shown extent is described by an entry in the al- in Figure 3. Wells et al. [1998] patented location map. The entry specifies the lo- such a technique, and a similar technique cation and length of the extent and the was used by the Microsoft Flash File Sys- virtual block number of the first sector in tem [Torelli 1995]. The motivation for the the extent (the other sectors in the extent Wells et al. [1998] patent was compressed store consecutive virtual blocks). When a storage of standard disk sectors. For ex- virtual block within an extent is updated, ample, if the last 200 bytes of a 512-byte the extent is broken into two or three new sector are all zeros, the zeros can be rep- extents, one of which contain the now obso- resented implicitly rather than explicitly, lete block. The original entry for the extent thereby saving storage. The main idea in in the allocation map is marked as invalid, such techniques is to fill an erase unit and one or two new entries are added at with variable-length data blocks from one the end of the map. end of the unit, say the low end, while filling fixed-size headers from the other 2.3. Erase-Unit Reclamation end. Each header contains a pointer to the variable-length data block that it repre- Over time, the flash device accumulates sents. The fixed-size headers allow con- obsolete sectors and the number of free stant time access to data (i.e, to the first sectors decrease. To make space for new word of the data). The fixed-size headers blocks and for updated blocks, obsolete offer another potential advantage to sys- sectors must be reclaimed. Since the tems that reference data blocks by logi- only way to reclaim a sector is to erase cal erase-unit number and a block index an entire unit, reclamation (sometimes within the unit. The Microsoft Flash File called garbage collection) operates on en- System [Torelli 1995] is one such system. tire erase units. In this system, a unit can be reclaimed Reclamation can take place either in the and defragmented without any need to up- background (when the CPU is idle) or on- date references to the blocks that were re- demand when the amount of free space located. We describe this mechanism in drops below a predetermined threshold. more detail in the following. The system reclaims space in several Smith and Garvin [1999] patented a stages. similar system but at a coarser granular- ity. Their system divides each erase unit —One or more erase units are selected for into a header, an allocation map, and sev- reclamation. eral fixed-size sectors. The system allo- —The valid sectors of these units are cates storage in blocks comprised of one copied to newly allocated free space else- or more contiguous sectors. Such blocks where in the device. Copying the valid

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 145

data prior to erasing the reclaimed units mation can reduce the wear on other units ensures persistence even if a fault occurs and thereby level the wear. Our supposi- during reclamation. tion that the data is static implies that it —The data structures that map logical will not change soon (or ever). Therefore, blocks to sectors are updated, if neces- by copying the contents of the unit to an- sary, to reflect the relocation. other unit which has undergone many era- —Finally, the reclaimed erase units are sures, we can reduce future wear on the erased and their sectors are added other unit. to the free-sector reserve. This stage Erase-unit reclamation involves a might also include writing an erase-unit fourth policy, but it is irrelevant to our header on each newly-erased unit. discussion. The fourth policy triggers reclamation events. Clearly, reclamation Immediately following the reclamation of must take place when the system needs a unit, the number of free sectors in the to update a block but no free sector device is at least one unit’s worth. There- is available. This is called on-demand fore, the maximum amount of useful data reclamation. But some systems can also that a device can contain is smaller by one reclaim erase units in the background erase unit than its physical size. In many when the flash device, or the system as cases, the system keeps at least one or a whole, are idle. The ability to reclaim two free and erased units at all times to in the background is largely determined allow all the valid data in a unit that is by the overall structure of the system. being reclaimed to be relocated to a sin- Some systems can identify and utilize gle erase unit. This scheme is absolutely idle periods, while others cannot. The necessary when data is stored on the unit characteristics of the flash device are in variable-size blocks since, in this case, also important. When erases are slow, fragmentation may prevent reclamation avoiding on-demand erases improves the altogether. response time of block updates; when The reclamation mechanism is gov- erases are fast, the impact of an on- erned by two policies: which units to demand erase is less severe. Also, when reclaim, and where to relocate valid sec- erases can be interrupted and resumed tors to. These policies are related to later, a background erase operation has another policy which governs sector allo- little or no impact on the response time of cation during block updates. These three a block update, but when erases are unin- interrelated policies affect the system in terruptible, a background erase operation three ways. They affect the effectiveness can delay a block update. However, all of of the reclamation process which is mea- these issues are largely orthogonal to the sured by the number of obsolete sectors algorithms and data structures that are in reclaimed units; they affect wear lev- used in mapping and reclamation so we eling; and they affect the mapping data do not discuss this issue any further. structures (some relocations require sim- ple map updates and some require com- 2.3.1. Wear-Centric Reclamation Policies and plex updates). Wear-Measuring Techniques. In this section, The goals of wear leveling and efficient we describe reclamation techniques that reclamation are often contradictory. Sup- are primarily designed to reduce wear. pose that an erase unit contains only so- These techniques are used in systems that called static data, data that is never or separate efficient reclamation and wear rarely updated. Efficiency considerations leveling. Typically, the system uses an suggest that this unit should not be re- efficiency-centric reclamation policy most claimed since reclaiming it would not free of the time but switches to a wear-centric up any storage—its data will simply be technique that ignores efficiency once in copied to another erase unit which will a while. Sometimes uneven wear triggers immediately become full. But although re- the switch, and sometimes it happens pe- claiming the unit is inefficient, this recla- riodically whether or not wear is even.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 146 E. Gal and S. Toledo

Many techniques attempt to level the power be lost during the reclamation, the wear by measuring it; therefore, this sec- erase count of i will be recovered from tion will also describe wear-measurement unit j after power is restored. Assar et al. mechanisms. [1996] patented a simpler but less efficient Lofgren et al. [2000, 2003] patented solution. They proposed a bounded unary a simple wear-leveling technique that is 8-bit erase counter which is stored on an- triggered by erase-unit reclamation. In other erase unit. The counter of unit i, this technique, the header of each erase which is stored on another unit j = i, unit includes an erase counter. Also, the starts at all ones, and a bit is cleared ev- system sets aside one erase unit as a spare. ery time i is erased. Because the counter When one of the most worn-out units is re- can be updated, it does not need to be claimed, its counter is compared to that erased every time unit i is erased. On the of the least worn-out unit. If the differ- other hand, the number of updates to the ence exceeds a threshold, say 15,000, a erase counter is bounded. When it reaches wear-leveling relocation is used instead of the maximum (8 in their patent), further a simple reclamation. The contents of the erases of unit i will cause loss of accu- least-worn out unit (or one of the least- racy in the counter. In their system, such worn out units) are relocated to the spare counters were coupled with periodic global unit, and the contents of the most worn out restarts of the wear-leveling mechanism unit, the one whose reclamation triggered in which all the counters are rolled back the event, are copied to the just-erased to the erased state. least-worn out unit. The most worn-out Jou and Jeppesen III [1996] patented a unit that was reclaimed becomes the new technique that maintains an upper bound spare. This technique attempts to iden- on the wear (number of erasures). The tify worn-out sectors and static blocks and bound is always correct but not neces- to relocate static blocks to worn-out sec- sarily tight. Their system uses an erase- tors. In the next wear-leveling event, it before-write strategy: the valid contents will be used to store the blocks from the of an erase unit chosen for reclamation least-worn out units. Presumably,the data are copied to another unit, but the unit in the least-worn out unit are relatively is not erased immediately. Instead, it is static; storing them on a worn-out unit re- marked in the flash device as an erasure duces the chances that the worn-out unit candidate and added to a priority queue of will soon undergo further rapid wear. Also, candidates in RAM. The queue is sorted by by removing the static data from the least wear; the unit with the least wear in the worn-out unit, we usually bring its future queue (actually the least-wear bound) is erase rate close to the average of the other erased when the system needs a free unit. units. If power is lost during an erasure, the new Clearly, any technique that relies on bound for the erased unit is set to the min- erase counters in the erase-unit headers imum wear among the other erase candi- is susceptible to loss of an erase counter dates plus 1. Since the pre-erasure bound if power is lost after a unit is erased but on the unit was less than or equal to that before the new counter is written to the of all the other ones in the queue, the new header. The risk is higher when erase op- bound may be too high, but it is correct. erations are slow. (The patent does not increase the bound by One way to address this risk is to store 1 over that of the minimum in the queue; the erase counter of unit i on another unit this yields a wear estimate that may be j = i. One such technique was patented just as useful in practice but not a bound.) by Marshall and Manning [1998], as part This technique levels the wear to some ex- of a flash file system. Their system stores tent by delaying reuse of worn-out units. an erase counter in the header of each The evenness of the wear in this technique unit. Prior to the reclamation of unit i, depends on the number of surplus units: if the counter is copied to a specially-marked the queue of candidates is short, reuse of area in an arbitrary unit j = i. Should worn-out units cannot be delayed much.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 147

Another solution to the same prob- erase unit stores static data for much of lem, patented by Han [2000], relies on the device’s lifetime. In addition, the total wear-estimation using erase latencies. On overhead of wear leveling in this technique some flash devices, the erase latency in- is predictable and evenly spread in time. creases with wear. Han’s technique com- It appears that the idea behind this pares erase times in order to rank erase technique was used in earlier software. unit by wear. This avoids altogether the M-Systems developed and marketed soft- need to store erase counters. The wear ware called TrueFFS, a block-mapping de- rankings can be used in a wear-leveling vice driver that implements the FTL. The relocation or allocation policy. Without ex- M-Systems literature [Dan and Williams plicit erase counters, the system can only 1997] states that TrueFFS uses a wear- estimate the wear of a unit after it is leveling technique that combines random- erased in a session. Therefore, this tech- ness with erase counts. Their literature nique is probably not applicable in its pure claimed that the use of randomness elim- form (without counters) when sessions are inates the need to protect the exact erase short and only a few units are erased. counts stored in each erase unit. The Another approach to wear leveling is to details of the wear-leveling algorithm of rely on randomness rather than on esti- TrueFFS are not described in the open lit- mates of actual wear. Woodhouse [2001] erature or in patents. proposed a simple randomized wear- leveling technique. Every 100th reclama- tion the system selects for reclamation a 2.3.2. Combining Wear-Leveling with Efficient unit containing only valid data, at ran- Reclamation. We now describe policies dom. This has the effect of moving static that attempt to address both wear level- data from units with little wear to units ing and efficient reclamation. with more wear. If this technique is used Kawaguchi et al. [1995], describe one in a system that otherwise always favors such policy. They implemented a block de- reclamation efficiency over wear leveling, vice driver for a flash device. The driver extreme wear imbalance can still occur. If was intended for use with a log-structured units are selected for reclamation, based Unix file system which we describe in the solely upon the amount of invalid data following. This file system operates much they contain, a little worn-out unit with like a block-mapping mechanism: it relo- a small amount of invalid data may never cates block on update, and reclaims erase be reclaimed. units to free space. Kawaguchi et al. [1995] At about the same time, Ban [2004] describe two reclamation policies. The first patented a more robust technique. His policy selects the next unit for reclama- technique, like the one of Lofgren et al. tion based on a weighted benefit/cost ra- [2000, 2003], relies on a spare unit. Every tio. The benefit of a unit reclamation is certain number of reclamations, an erase the amount of invalid space in the unit, unit is selected at random, its contents re- and the cost is incurred by the need to located to the spare unit, and it is marked read the valid data and write it back else- as the new spare. The trigger for this wear- where. This is weighted by the age of the leveling event can be deterministic, say block, the time since the last invalidation. the 1000th erase since the last event, or A large weight is assumed to indicate that random. Using a random trigger ensures the remaining valid data in the unit is rel- that wear leveling is triggered even if ev- atively static. This implies that the valid ery session is short and encompasses only occupancy of the unit is unlikely to de- a few erase operations. The aim of this crease soon so there is no point in wait- technique is to have every unit undergo ing until the benefit increases. In other a fairly large number of random swaps, words, the method tries to avoid reclaim- say 100, during the lifetime of the flash ing units whose benefit/cost ratio is likely device. The large number of swaps is sup- to increase soon. This policy is not explic- posed to diminish the likelihood that an itly designed to level wear, but it does

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 148 E. Gal and S. Toledo result in some wear leveling since it The system tries to achieve a roughly ensures that blocks with some invalid constant reclamation frequency in all data are eventually reclaimed, even if the the partitions. Therefore, hot partitions amount of invalid data in them is small. should contain fewer blocks than cold par- Kawaguchi et al. [1995] found that this titions because hot data tends to be in- policy still leads to inefficient reclamation validated more quickly. At every reclama- in some cases and proposed a second pol- tion, the system compares the reclamation icy which tends to improve efficiency at the frequency of the current partition to the expense of worse wear leveling. They pro- average frequency. If the partition’s recla- posed to write data to two units. One unit mation frequency is higher than average, is used for sectors relocated during the some of its blocks are moved to neighbor- reclamation of so-called cold units, ones ing partitions. Blocks from the beginning that were not modified recently. The other of the unit being reclaimed are moved to a unit is used for sectors relocated from hot colder partition, and blocks from the end units and for updating blocks not under- are moved to a hotter partition. going reclamation. This policy tends to Wu and Zwaenepoel [1994] employ a cluster static data in some units and dy- simple form of wear leveling. When the namic data in others. This, in turn, tends erase count of the most worn-out unit is to increase the efficiency of reclamation, higher by 100 than that of the least worn- because units with dynamic data tend to out unit, the data on them is swapped. be almost empty upon reclamation, and This probably works well when the least static units do not need to be reclaimed at worn-out unit contains static or nearly all because they often contain no invalid static data. If this is the case, then swap- data. Clearly,units with static data can re- ping the extreme units allows the most main unreclaimed for long periods which worn-out unit some time to rest. leads to uneven wear unless a separate ex- We shall discuss Wu and Zwaenepoel’s plicit wear-leveling mechanism is used. [1994] system again later in the survey. A more elaborate policy is described by The system was intended as a main mem- Wu and Zwaenepoel [1994]. Their sys- ory replacement not as a disk replace- tem partitions the erase units into fixed- ment, so it has some additional interesting size partitions. Lower-numbered parti- features that we describe in Section 4. tions are supposed to store hot virtual Wells [1994] patented a reclamation pol- blocks, while higher-numbered partitions icy that relies on a weighted combination are supposed to store cold virtual blocks. of efficiency and wear leveling. The system Each partition has one active erase unit selects the next unit to be reclaimed based that is used to store updated blocks. When on a score. The score of a unit j is defined a virtual block is updated, the new data to be is written to a sector in the active unit in the same partition that the block currently score( j ) = 0.8 × obsolete( j ) resides in. When the active unit in a par- + 0.2 × (max{erasures(i)} tition fills up, the system finds the unit i with the least valid sectors in the same − erasures( j )) , partition and reclaims it. During reclama- tion, valid data in the unit that is being where obsolete( j )isthe amount of in- reclaimed is copied to the beginning of an valid data in unit j , and erasures( j )is empty unit. Thus, blocks that are not up- the number of erasures that unit j has dated frequently tend to slide toward the undergone. The unit with the maximal beginning of erase units. Blocks that were score is reclaimed next. Since obsolete( j ) updated after the unit they are on became and erasures( j ) are measured in dif- active, and are hence hot, tend to reside to- ferent units, the precise weights of the ward the end of the unit. This allows Wu two terms, 0.8 and 0.2inthe system and Zwaenepoel’s [1994] system to classify described in the patent, should depend blocks as hot or cold. on the space-measurement metric. The

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 149 principle, however, is to weigh efficiency where valid( j )isthe amount of valid data heavily and wear differences lightly. Af- in unit j , and age( j )isadiscrete mono- ter enough units have been reclaimed us- tone function of the time since the last era- ing this policy, the system checks whether sure of unit j . This score function leads the more aggressive wear leveling is neces- system to prefer units with a lot of obsolete sary. If the difference between the most data and little valid data, units that have and the least worn out units is 500 or not been reclaimed recently,and units that higher, the system selects additional units have not been erased many times. This for reclamation, using a wear-heavy policy combines efficiency with some degree of that maximizes a different score, wear leveling. The CAT policy also includes an additional wear-leveling mechanism: = . × when an erase unit nears the end of its score ( j ) 0 2 obsolete( j ) life, it is exchanged with the least worn- + 0.8 × (max{ erasures(i)} out unit. i − . The DAC policy is more sophisticated. erasures( j )) First, blocks may be classified into more than three categories. More importantly, The combined policy is designed to first en- blocks are reclassified on every update so sure a good supply of free sectors, and once a cold block that heats up will be relocated that goal is achieved to ensure that the to a hotter erase unit, even if the units wear imbalance is not too extreme. In the that store it never get reclaimed while it wear-leveling phase, efficiency is not im- is valid. portant since that phase only starts after TrueFFS selects units for reclamation there are enough free sectors. Also, there based on both the amount of invalid data, is no point in activating the wear-leveling the number of erasures, and identification phase where wear is even since this will of static areas [Dan and Williams 1997]. just increase the overall wear. The details are not described in the open Chiang et al. [1999] proposed a block literature. TrueFFS also tries to cluster clustering that they call CAT. Chiang and related blocks so that multiple blocks in Chang [1999] later proposed an improved the same unit are likely to become in- technique, called DAC.Atthe heart of these valid together. This is done by trying to methods lies a concept called temperature. map contiguous logical blocks onto a sin- The temperature of a block is an estimate gle unit under the assumption that higher- of the likelihood that it will be updated level software (i.e., a file system) attempts soon. The system maintains a tempera- to cluster related data at the logical-block ture estimate for every block using two level. simple rules: (1) when a block is updated, Kim and Lee [2002] proposed an adap- its temperature rises, and (2), blocks cool tive scheme for combining wear leveling down over time. The CAT policy classifies and efficiency in selecting units for recla- blocks into three categories: read-only (ab- mation. Their scheme is designed for a file solutely static), cold, and hot. The category system not for a block-mapping mecha- that a block belongs to does not necessarily nism but since it is applicable for block- matches its temperature because, in CAT, mapping mechanisms, we describe it here. blocks are only reclassified during recla- Their scheme, which is called CICL, selects mation. Each erase unit stores blocks from an erase unit (actually a group of erase one category. When an erase unit is re- units called a segment) for reclamation by claimed, its valid sectors are reclassified. minimizing the following score, CAT selects units for reclamation by maxi-   mizing the score function valid( j ) cicl-score( j ) = (1 − λ) valid( j ) + obsolete( j )   obsolete( j ) × age( j ) cat-score( j ) = , + λ erasures( j ) . valid( j ) × erasures( j ) 1 + maxi{ erasures(i)}

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 150 E. Gal and S. Toledo

In this expression, λ is not a constant, must contain more than 16MB of obsolete but a monotonic function that depends on data. Therefore, on average, a quarter of the discrepancy between the most and the the sectors on a unit are obsolete. There least worn-out units, must be at least one unit with that much obsolete data, so by reclaiming it, the sys- tem is guaranteed to generate a quarter of 0 <λ( maxi{erasures(i)} a unit’s worth of free sectors. − min {erasures(i)}) < 1 . i To ensure that the system meets the deadlines of all the tasks that it admits, When λ is small, units are selected for every task is associated with a reclama- reclamation mostly upon efficiency consid- tion task. These tasks reclaim at most one erations. When λ is high, unit selection unit every time one of them is invoked. The is driven mostly by wear, with preference period of a reclamation task is chosen so given to young units. Letting λ grow when that it is invoked once every time the task wear become uneven and shrink when it it is associated with writes α blocks, where evens out leads to more emphasis on wear α is the number of sectors that are guaran- when wear is imbalanced, and more em- teed to be freed in every reclamation. This phasis on efficiency when wear is roughly ensures that a task uses up free sectors even. Kim and Lee [2002] augment this at the rate that its associated reclamation policy with two other techniques for clus- task frees sectors. (If a reclamation task tering data and for unit allocation but does not reclaim a unit because there are these are file-system specific, so we de- many free pages, the system is guaranteed scribe them later in the survey. to have enough sectors for the associated task.) The system admits a task only if it 2.3.3. Real-Time Reclamation. Reclaim- can meet the deadlines of both it and its ing an erase unit to make space for new reclamation task. or updated data can take a considerable Chang et al. [2004] also proposed a amount of time. A slow and unpredictable slightly more efficient reclamation policy reclamation can cause a real-time system that avoids unnecessary reclamation and to miss a deadline. Chang et al. [2004] an auxiliary wear-leveling policy that the proposed a guaranteed reclamation policy system applies when it is not under dead- for real-time systems with periodic tasks. line pressure. They assume that tasks are periodic, and that each task provides the system with 3. FLASH-SPECIFIC FILE SYSTEMS its period, with per-period upper bounds on CPU time and the number of sector Block-mapping technique present the updates. flash device to higher-level software, in Chang et al.’s system [2004] relies on particular file systems, as a rewritable two principles. First, it uses a greedy recla- block device. The block device driver (or a mation policy that reclaims the unit with hardware equivalent) perform the block- the least amount of valid data, and it only to-sector mapping, erase-unit reclama- reclaims units when the number of free tion, wear leveling, and perhaps even re- sectors falls below a threshold. (This pol- covery, of the block device to a designated icy is used only under deadline pressure; state following a crash. Another approach for nonreal-time reclamation, a different is to expose the hardware characteristics policy is used.) This policy ensures that ev- of the flash device to the file-system layer ery reclamation generates a certain num- and let it manage erase units and wear. ber of free sectors. Consider, for exam- The argument is that an end-to-end solu- ple, a 64MB flash device that is used to tion can be more efficient than stacking store 32MB worth of virtual blocks, and a file system designed for the character- that reclamation is triggered when the istics of magnetic hard disks on top of a amount of free space falls below 16MB. device driver designed to emulate disks When reclamation is triggered, the device using flash.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 151

The block-mapping approach does have device, the deleted file’s blocks, which are several advantages over flash-specific file not marked as obsolete, are copied from systems. First, the block-mapping ap- one unit to another whenever the unit that proach allows developers to utilize exist- stores them is reclaimed. By combining ing file system implementations, thereby the block device driver with the file system reducing development and testing costs. software, the system can mark the deleted Second, removable flash devices, such as file’s sectors as obsolete which prevents CompactFlash and SmartMedia, must use them from ever being copied. This appears storage formats that are understood by all to be the approach of FLite, a FAT file sys- the platforms in which users need to use tem/device driver combination software them. These platforms currently include from M-Systems [Dan and Williams 1997]. Windows, Mac, and Linux/Unix operating Similar combination software is also sold systems as well as hand-held devices by HCC Embedded. Their file systems are such as digital cameras, music players, described in the following, but it is not PDAs, and phones. The only rewritable clear whether they exploit such opportu- file system format supported by all major nities or not. operating systems is FAT so removable But when the flash memory device is not devices typically use it. If the removable removable, a flash-specific file system is device exposes the flash memory device a reasonable solution and over the years, directly, then the host platforms must several such file systems have been devel- also understand the headers that describe oped. In the mid 1990’s Microsoft tried to the content of erase units and sectors. standardize flash-specific file systems for The standardization of the FTL allows removable memory devices, in particular, these headers to be processed on multiple a file system called FFS2 but this effort platforms; this is also the approach taken did not succeed. Douglis et al. [1994] re- by the SmartMedia standard. Another port very poor write performance for this approach, invented by SunDisk (now San- system which is probably the main reason Disk) and typified by the CompactFlash it failed. We comment further on this file format, is to hide the flash memory device system later. behind a disk interface implemented in Most of the flash-specific file systems hardware as part of the removable device use the same overall principle, that of a so the host is not required to understand log-structured file system. This principle the flash header format. Typically, using was invented for file systems that utilize a CompactFlash on a personal computer magnetic disks where it is not currently does require even a special device driver used much. It turns out, however, to be because the existing disk device-driver appropriate for flash file systems. There- can access the device. fore, we next describe how log-structured Even on a removable device that must file systems work, and then describe indi- use a given file system structure, say a vidual flash file systems. FAT file system, combining the file system with the block-mapping mechanism yields 3.1. Background: Log-Structured File benefits. Consider the deletion of a file, Systems for example. If the system uses a stan- dard file system implementation on top of Conventional file systems modify infor- a block-mapping device driver, the deleted mation in place. When a block of data, file’s data sectors will be marked as free whether containing part of a file or file sys- in a bitmap or file-allocation table, but tem metadata, must be modified, the new they will normally not be overwritten or contents overwrite old data in exactly the otherwise be marked as obsolete. Indeed, same physical locations on the disk. Over- when the file system is stored on a mag- writing makes finding data easy: the same netic disk, there is no reason to move the data structure that allowed the system to read-write head to these now free sectors. find the old data will now find the new But if the file system is stored on a flash data. For example, if the modified block

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 152 E. Gal and S. Toledo contained part of a file, the pointer in the journal (or a log) before the block is modi- inode or in the indirect block that pointed fied in place. Following a crash, the fixing- to the old data is still valid. In other words, up process examines the tail of the log and in conventional file systems, data items either completes or rolls back each meta- (including both metadata and user data) data operation that may have been inter- are mutable so pointers remain valid af- rupted. Except for Sun Solaris systems ter modification. which use soft updates almost all the com- In-place modification creates two kinds mercial and free operating systems today of problems, however. The first kind in- use journaling file systems such as NTFS volves movement of the read-write head on Windows, JFS on AIX and Linux, XFS and disk rotation. Modifying existing data on IRIX and Linux, ext3 and reiserfs on forces the system to write in specific physi- Linux, and the new journaling HFS+ on cal disk locations so these writes may suf- MacOS. fer long seek and rotational delays. The Log-structured file systems take the effect of these mechanical delays are espe- journaling approach to the limit: the jour- cially severe when writing cannot be de- nal is the file system. The disk is organized layed, for example, when a file is closed. as a log, a continuous medium that is in The second problem that in-place modifi- principle infinite. In reality, the log con- cation creates is the risk of metadata in- sist of fixed-sized segments of contiguous consistency after a crash. Even if writes areas of the disk, chained together into a are atomic, high-level operations consist- linked list. Data and metadata are always ing of several writes, such as creating or written to the end of the log; they never deleting a file, are not atomic so the file overwrite old data. This raises two prob- system data structure may be in an incon- lems: how do you find the newly-written sistent state following a crash. There are data, and how do you reclaim disk space several techniques to address this issue. occupied by obsolete data? To allow newly- In lazy-write file systems, a fixing-up pro- written data to be found, the pointers to cess corrects the inconsistencies following the data must change so the blocks con- a crash prior to any use of the file sys- taining these blocks must also be written tem. The fixing-up can delay the recovery to the log, and so on. To limit the possi- of the computer system from the crash. In ble snowball effect of these rewrites, files careful-write systems, metadata changes are identified by a logical inode index. The are performed immediately according to a logical inodes are mapped to physical lo- strict ordering so that, after a crash, the cation in the log by a table that is kept data structure is always in a consistent in memory and is periodically flushed to state except perhaps for loss of disk blocks the log. Following a crash, the table can which are recovered later by a garbage col- be reconstructed by finding the most re- lector. This approach reduces the recovery cent copy of the table in the log, and then time but amplifies the effect of mechanical scanning the log from that position to the disk delays. Soft updates is a generaliza- end in order to find files whose position has tion of careful writing, where the file sys- changed after that copy was written. To tem is allowed to cache modified metadata reclaim disk space, a cleaner process finds blocks; they are later modified according segments that contain a large amount of to a partial ordering that guarantees that obsolete data, copies the valid data to the the only inconsistencies following a crash end of the log (as if they were changed), are loss of disk blocks. Experiments with and then erases the segment and adds it this approach have shown that it delivers to the end of the log. performance similar to that of journaling The rationale behind log-structured file file systems [Rosenblum and Ousterhout system is that they enjoy good write per- 1992; Seltzer et al. 1993] which we de- formance since writing is only done to the scribe next, but currently only Sun uses end of the log so it does not incur seek this approach. In journaling file systems, and rotational delays (except when switch- each metadata modification is written to a ing to a new segment which occurs only

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 153 rarely). On the other hand, read perfor- a long period until the file is deleted or mance may be slow because the blocks of updated again. Therefore, the unit is un- a file might be scattered around the disk likely to be reclaimed soon; and the file’s if each block was modified at a different data is less likely to be copied over and time, and the cleaner process might fur- over again. Kim and Lee [2002] propose ther slow down the system if it happens policies for finding infrequently modified to clean segments that still contain a sig- files, for estimating their fragmentation, nificant amount of valid data. As a con- and for selecting the collection period and sequence, log-structured file systems are scope (size). not in widespread use on disks; not many Kim and Lee [2002] also propose to ex- have been developed and they are rarely ploit clustering to improve wear leveling. deployed. They propose to allocate the most worn- On flash devices, however, log- out free unit for storage of cold data during structured file systems make perfect such collections and to allocate the least sense. On flash, old data cannot be worn-out free unit for normal updates of overwritten. The modified copy must be data. The rationale here is that not-much- written elsewhere. Furthermore, log- used data that is collected is likely to re- structuring the file system on flash does main valid for a long period, thereby allow- not influence read performance since flash ing a worn-out unit to rest. On the other allows uniform-access-time random ac- hand, recently updated data is likely to be cess. Therefore, most of the flash-specific updated again soon so a unit that is used file systems use this design principle. for normal log operations is likely to be- Kawaguchi et al. [1995] were probably come a candidate for reclamation soon so the first to identify log-structured file sys- it is better to use a little worn-out unit. tems as appropriate for flash memories. Although they used the log-structured ap- 3.2. The Research-In-Motion File System proach to design a block-mapping device driver, their paper pointed out that this Research In Motion, a company mak- ing handheld text messaging devices and approach is suitable for flash memory 2 management in general. smart phones, patented a log-structured Kim and Lee [2002] proposed clean- file system for flash memories [Parker ing, allocation, and clustering policies for 2003]. The file system is designed to log-structured file systems that reside on store contiguous variable-length records flash devices. We have already described of data, each having a unique identifier. their reclamation policy which is not file The flash is partitioned into an area system-specific. Their other policies are for programs and an area for the file sys- specific to file systems. Their system tries tem (this is fairly common and is designed to identify files that have not been up- to allow programs to be executed in-place dated recently and whose data is frag- directly from NOR flash memory). The file mented over many erase units. Once in a system area is organized as a perfectly cir- while, a clustering process tries to collect cular log containing a sequence of records. the blocks of such files into a single erase Each record starts with a header contain- unit. The rationale behind this policy is ing the record’s size, identity, invalidation that if an infrequently modified file is frag- flags and a pointer to the next record in mented, it does not contribute much valid the same file, if any. data to the units that store it so they may Because the log is circular, cleaning may become good candidates for reclamation. not be effective: the oldest erase unit may But whenever such a unit is reclaimed, not contain much obsolete data, but it is the file’s data is simply copied to a new cleaned anyway. To address this issue, the unit, thereby reducing the effectiveness of patent proposes to partition the flash fur- reclamation. By collecting the file’s data ther into a log for so-called hot (frequently onto a single unit, the unit is likely to con- tain a significant amount of valid data for 2http://www.rim.net.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 154 E. Gal and S. Toledo modified) data and a log for cold data. No resent each valid node on the flash. Each suggestion is made as to how to classify structure participates in two linked lists, records. one chaining all the nodes according to Keeping the records contiguous allows physical address, to assist in garbage col- the file system to return pointers directly lection, and the other containing all the into the NOR flash in read operation. Pre- nodes of a file, in order. The list of nodes sumably, this is enabled by an API that only belonging to a file form a direct map of file allows access to a single record at a time, positions to flash addresses. Because both not to multiple records or other arbitrary the inode to flash address and file posi- parts of files. tion to flash-address maps are only kept The patent suggests keeping a direct in RAM, the data structure on the flash can map in RAM, mapping each logical record be very simple. In particular, when a file number to its current location on the flash is extended or modified, only the new data device. The patent also mentions the pos- and the inode are written to the flash, but sibility of keeping this data structure, or no other mapping information. The obvi- parts thereof, in flash but without provid- ous consequence of this design choice is ing any details. high RAM usage. JFFS2 uses a simple wear-leveling tech- nique. Most of the time, the cleaner 3.3. The Journaling Flash Filing System selects for cleaning an erase unit that con- The Journaling Flash File System (JFFS) tains at least some obsolete data (the arti- was originally developed by Axis Com- cle describing JFFS2 does not specify the munications [2004] AB for embedded specific policy, but it is probably based on Linux. It was later enhanced, in a ver- the amount of obsolete data in each unit). sion called JFFS2, by David Woodhouse of But on every 100th cleaning operation, the Red Hat [Woodhouse 2001]. Both versions cleaner selects a unit with only valid data are freely available under the GNU Public in an attempt to move static data around License (GPL). Both versions focus mainly the device. on NOR devices and may not work reliably The cleaner can merge small blocks of on NAND devices. Our description here fo- data belonging to a file into a new large cuses on the more recent JFFS2. . This is supposed to improve per- JFFS2 is a posix compliant file system. formance, but it can actually degrade per- Files are represented by an inode number. formance. If later, only part of that large Inode numbers are never reused, and each chunk is modified, a new copy of the en- version of an inode structure on flash car- tire large chunk must be written to flash ries a version number. Version numbers because writes are not allowed to modify are also not reused. The version numbers parts of valid chunks. allow the host to reconstruct a direct inode map from the inverse map stored on flash. 3.4. YAFFS: Yet Another Flash Filing System In JFFS2, the log consists of a linked list of variable-length nodes. Most nodes YAFFS was written by Aleph One [2002] contain parts of files. Such nodes con- as a NAND file system for embedded device. tain a copy of the inode (the file’s meta- It has been released under the GPL and data), along with a range of data from the has been used in products running both file, possibly empty. There are also special Linux and Windows CE. It was written directory-entry nodes which contain a file because the authors evaluated JFFS and name and an associated inode number. JFFS2 and concluded that they are not At mount time, the system scans all the suitable for NAND devices. nodes in the log and builds two data struc- In YAFFS, files are stored in fixed-sized tures. One is a direct map from each inode chunks which can be 512 bytes, 1KB, or number to the most recent version of it on 2KB in size. The file system relies on be- flash. This map is kept in a hash table. The ing able to associate a header with each other is a collection of structures that rep- chunk. The header is 16 bytes for 512 bytes

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 155 chunks, 30 bytes for 1KB, and 42 bytes to achieve a strictly sequential writing for 2KB. Each file (including directories) order within erase units so that erased include one header chunk, containing the pages are written one after the other file’s name and permissions, and zero or and never rewritten. This is achieved us- more data chunks. As in JFFS2, the only ing two mechanisms. First, each chunk’s mapping information on flash is the con- header contains not only the file ID and tent of each chunk, stored as part of the the position within the file but also a header. This implies that at mount time sequence number. The sequence number all the headers must be read from flash to determines which, among all the chunks construct the file ID and file contents maps, representing a single block of a file, is and that, as in JFFS, the maps of all the the valid chunk. The rest can be re- files are stored in RAM at all times. claimed. Second, files and directories are To save RAM relative to JFFS, YAFFS deleted by moving them to a trash di- uses a much more efficient map structure rectory which implicitly marks all their to map file locations to physical flash ad- chunks for garbage collection. When the dresses. The mapping uses a tree struc- last chunk of a deleted file is erased, the ture with 32-byte nodes. Internal nodes file itself can be deleted from the trash contain 8 pointers to other nodes, and directory. (Application code cannot rescue leaf nodes contain 16 2-byte pointers to files from this trash directory.) The RAM physical addresses. For large flash mem- file-contents maps are recreated at mount ories, 16-bit words cannot point to an ar- time by reading the headers of the chunks bitrary chunk. Therefore, YAFFS uses a by sequence number to ensure that only scheme that we call approximate point- the valid chunk of each file block is re- ers. Each pointer value represents a con- ferred to. tiguous range of chunks. For example, if Wear leveling is achieved mostly by in- the flash contains a 218 chunk, each 16-bit frequent random selection of an erase unit value represents 4 chunks. To find the to reclaim. But as in JFFS, most of the chunk that contains the required data time an attempt is made to erase an erase block, the system searches the headers of unit that contains no valid data. The au- these 4 chunks to find the right one. In thors argue that wear leveling is some- essence, approximate pointers work be- what less important for NAND devices than cause the data they point to are self de- for NOR devices. NAND devices are often scribing. To avoid file-header modification shipped with bad pages, marked as such on every append operation, each chunk’s in their headers, to improve yield, They header contains the amount of data it car- also suffer from bit flipping during normal ries; upon append, a new tail chunk is operation which requires using ECC/EDC written containing the new size of the tail codes in the headers. Therefore, the au- which, together with the number of full thors of YAFFS argue, file systems and blocks, gives the size of the file. block-mapping techniques must be able to The first version of YAFFS uses 512- cope with errors and bad pages anyways so byte chunks and invalidates chunks by an erase unit that is defective due to exces- clearing a byte in the header. To ensure sive wear is not particularly exceptional. that random bit errors, which are com- In other words, uneven wear will lead to mon in NAND devices, do not cause a deleted loss of storage capacity but it should not chunk to reappear as valid (or vice versa), have any other impact on the file system. invalidity is signaled by at 4 or more zero bits in the byte (that is, by a majority vote 3.5. The Trimble File System of the bits in the byte). This requires writ- ing on each page twice before the erase The Trimble file system was a NOR imple- unit is reclaimed. mented by Marshall and Manning [1998] YAFFS2 uses a slightly more complex for Trimble Navigation (a maker of GPS arrangement to invalidate chunks of 1 equipment). Manning is also one of the or 2KB. The aim of this modification is authors of the more recent YAFFS.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 156 E. Gal and S. Toledo

The overall structure of the file system (nor FFS). Microsoft obtained a number of is fairly similar to that of YAFFS. Files patents on flash file system, and we as- are broken into 252-byte chunks, and each sume that the systems described by the chunk is stored with a 4-byte header in patents are similar to FFS2. a 256-byte flash sector. The 4-byte header The earliest patent [Barrett et al. 1995] includes the file number and chunk num- describes a file system for NOR flash de- ber within the file. Each file also includes a vices that contain one large erase unit. header sector, containing the file number, Therefore, the device is treated as a write- a valid/invalid word, a file name, and up to once device, except that bits that were not 14 file records, only one of which, the last cleared when an area was first written can one, is valid. Each record contains the size be cleared later. The system uses linked of the file, a checksum, and the last modi- lists to represent the files in a directory fication time. Whenever a file is modified, and the data blocks of a file. When a file is a new record is used and when they are all extended, a new record is appended to the used up, a new header sector is allocated end of its block list. This is done by clear- and written. ing bits in the next field of the last record As in YAFFS, all the mapping informa- in the current list; that field was originally tion is stored in RAM during normal oper- left all set, indicating that it was not yet ation since the flash contains no mapping valid (the all 1s bit pattern is considered structures other than the inverse maps. an invalid pointer). Erase units are chosen for reclamation Updating a range of bytes in a file is based solely on the number of valid sec- more difficult. A file is updated by patch- tors that they still contain and only if they ing the linked list, as shown in Figure 4. contain no free sectors. Ties are broken by The first record that points to now in- erase counts to provide some wear level- valid data is marked invalid by setting a ing. To avoid losing the erase counts dur- replacement field to the address of a re- ing a crash that occurs after a unit is placement record (the replacement field erased but before its header is written, the is actually called secondary ptr in the erase count is written prior to erasure into patent). This again uses a field that is ini- a special sector containing block erasure tially left at the erased state, all 1s. The information. next field of the replacement record can Sectors are allocated sequentially point back to the original linked list or it within erase units and the next erase can point to additional new records. But unit to be used is selected from among unless the update reaches to the end of the the available ones based on erase counts, file, the new part of the linked list eventu- again providing some wear-leveling ally points back to old records; hence, the capability. list is patched. A record does not contain file data, only a pointer to a run of data. The runs of data are raw and not marked by any special header. This allows a run to 3.6. The Microsoft Flash File System be broken into three when data in its mid- In the mid 1990’s, Microsoft tried to pro- dle are updated. Three new records will mote a standardized file system for re- point to the still valid prefix of the run, movable flash memories which was called to a new replacement run, and to the still FFS2 (we did not find documentation of valid suffix. an earlier version, but we assume one ex- The main defect in this scheme is that isted). Douglis et al. [1994] report very repeated updates to a file lead to longer poor write performance for this system and longer linked lists that must be tra- which is probably the main reason it versed to access the data. For example, failed. By 1998, Intel [1998a] listed this if the first 10 bytes of a file are updated solution as obsolete. t times, then a chain of t invalid records Current Microsoft documentation (as must be traversed before we reach the of Spring 2004) does not mention FFS2 record that points to the most recent data.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 157

Fig. 4. The data structures of the Microsoft File System. The data structure near the top shows a linked-list element pointing to a block of 20 raw bytes in a file. The bottom figure shows how the data structure is modified to accomodate an update of 5 of these 20 bytes. The data and next-in-list pointers of the original node are invalidated. The replacement pointer, which was originally free (all 1’s; marked in gray in the figure), is set to point to a chain of 3 new nodes, two of which point to still-valid data within the existing block, and one of which points to a new block of raw data. The last node in the new chain points back to the tail of the original list.

The cause of this defect is the attempt find things but requires traversing long to keep objects (files and directories) in a chains of invalid data to find current data. static addresses. For example, the header The log-structured approach, where ob- of the device, which is written once and jects are moved when they are updated, for all, contains a pointer to the root di- makes it more difficult to find things, but rectory as it does in conventional file sys- once an object is found, not invalid data tems. This arrangement makes it easy to needs to be accessed.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 158 E. Gal and S. Toledo

Later versions of FFS allowed reclama- designed and implemented the file system tion of erase units [Torelli 1995]. This is for use in a handheld audio recorder. done by associating each data block (a data extent or a linked-list record) with a small 3.8. Other Commercial Embedded File descriptor. Memory is allocated contigu- Systems ously from the low end of the unit, and fixed-size descriptors are allocated from Several other companies offer embedded the top end, toward the low end. Each de- flash file systems but provide only few de- scriptor describes the offset of a data block tails on their design and implementation. within the unit, the size of the block, and TargetFFS. Blunk Microsystems offers whether it is valid or obsolete. Pointers TargetFFT, an embedded flash file system, 3 to data blocks within the file system are in both NAND and NOR varieties. It works not physical pointers but concatenation of under their own operating system but is a logical erase-unit number and a block designed to be portable to other operating number within the unit. The block number systems and to products without an oper- is the index of the descriptor of the block. ating system. The file system uses a POSIX- This allows the actual blocks to be moved like API. and compacted when the unit is reclaimed; Blunk claims that the file system guar- only the valid descriptors need to retain antees integrity across unexpected shut- their position in the new block. This does downs that it levels the wear of the erase not cause much fragmentation because de- units, that the NAND version uses ECC/EDC, scriptors are small and uniform in size and that it is optimized for fast mounts, (6 bytes). This system is described in three typically a second for a 64MB file system. patents Krueger and Rajagopalan [1999a, The code footprint of the file system is 1999b, 2001]. around 60KB plus about 64KB RAM. It is possible to reclaim linked-list The company’s Web site contains one records in this system, but it’s not trivial. performance graph showing that the write From Torelli’s article [1995], it seems that performance degrades as the volume fills at least some implementations did do that, up. No explanation is given, but the likely thereby not only reclaiming space but also reason is the drop in the effectiveness of shortening the lists. Suppose that a valid erase-unit reclamation. record a points to record b, and that in b, the replacement pointer is used and points smxFFS. This file system from Micro Digital only supports nonremovable NAND to c. This means that c replaced b. When 4 a is moved to another unit during recla- devices. The file system consists of a mation, it can be modified to point to c, block-mapping device driver and a simple and b can be marked obsolete. Obviously, FAT-like file system with a flat directory it is also possible to skip multiple replaced structure. The block-mapping driver as- records. sumes that every flash page is associated with a small spare area (16 bytes for 512- This design is highly NOR specific, both because of the late assignment to the byte pages) and that pages can be updated replacement pointers, and the way that in place three times before erasure. The units are filled from both ends with data system performs wear-leveling by relocat- from one end and with descriptors from ing a fixed number of static blocks when- the other. ever the difference between the most and the least worn out page exceeds a thresh- old. The default software configuration for 3.7. Norris Flash File System a 16MB flash device requires about 168KB Norris Communications Corporation of RAM. patented a flash file system based on linked lists [Daberko 1998], much like 3http://www.blunkmicro.com. the Microsoft Flash File System. The 4http://www.smxinfo.com/rtos/fileio/smxffs. patent is due to Daberko who apparently htm.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 159

EFFS. HCC Embedded5 offers several Wu et al. [2003a, 2003b] proposed flash- flash file systems. EFFS-STD is a full file aware implementations of B-trees and system. The company makes claims simi- R-trees. Their implementations represent lar to those of Blunk, including claims of a tree node as an ordered set of small integrity, wear-leveling, and NOR and NAND items. Items in a set represent individ- capability. ual insertions, deletions, and updates to a EFFS-FAT and EFFS-THIN are FAT node. The items are ordered by time so the implementation for removable flash mem- system can construct the current state of ories. The two versions offer similar a tree node by traversing its set of items. functionality, except that EFFS-THIN is To conserve space, item sets can be com- optimized for 8-bit processors. pacted when they grow too much. In order to avoid frequent write operations of small FLite. FLite combines a standard FAT amounts of data, new items are collected FTL file system with an -compatible block- in RAM and flushed to disk when the RAM mapping device driver [Dan and Williams buffer fills. To find the items that consti- 1997]. It is unclear whether this is a cur- tute a node, the system maintains in RAM a rent product. linked list of the items of each node in the tree. Hence it appears that the RAM con- 4. BEYOND FILE SYSTEMS sumption of these trees can be quite high. Allowing high-level software to view a The main problem with flash-aware flash device as a simple rewritable block application-specific data structures is device or as an abstract file store are that they require that the flash device not the only possibilities. This section be partitioned. One partition holds the describe additional services that flash- application-specific data structure, an- management software can provide. The other holds other data, usually files. first class of services are higher-level ab- Partitioning the device at the physical stractions than files, the second level of level (physical addresses) adversely af- services are programs that execute di- fects wear leveling because worn-out units rectly from flash, and the third is a mech- in one partition cannot be swapped with anism to use flash devices as a persistent relatively fresh units in the other. The main memory. Wu et al. [2003a, 2003b] implementations partition the virtual block address space 4.1. Flash-Aware Application-Specific Data so both the tree blocks and file-system Structures blocks are managed by the same block- mapping mechanism. In other words, their Some applications maintain sophisticated data structures are flash-aware, but they data structures in nonvolatile storage. operate at the virtual block level not at Search trees, which allow database man- the physical sector level. Another prob- agement systems and other applications lem with partitioning is the potential for to respond quickly to queries, are the most wasting space if the partitions cannot be common of these data structures. Nor- dynamically resized. mally, such a data structure is stored in a file. In more demanding applications, 4.2. Execute-in-Place the data structure might be stored directly on a block device such as a disk parti- Code stored in a NOR device, which is tion without a file system. Some authors directly addressable by the processor, argue that by implementing flash-aware can be executed from the device itself application-specific data structures, per- without being copied into RAM. This is formance and endurance can be improved known as execute-in-place,orXIP. Unfor- beyond implementation over a file system tunately, execute-in-place raises some dif- or even over a block device. ficult problems. Unless the system uses a virtual- 5http://www.hcc-embedded.com. memory mechanism, which requires a

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 160 E. Gal and S. Toledo hardware memory-management unit, vice single-word read and write requests. code must be contiguous in flash, and it Because this memory is both fast for must not move. This implies that erase single-word access and nonvolatile, it can units containing code cannot participate replace both the magnetic disk and the in a wear-leveling block-mapping scheme. DRAM in a computer. The memory system This also precludes storage of code in files itself contained a large NOR flash mem- within a read-write file system since such ory that was connected by a very wide file systems do not store files contiguously. bus to a battery-backed static RAM device. Therefore, to implement XIP in a system Because the static RAM is much more ex- without virtual memory requires parti- pensive than flash, the system combined tioning the flash memory at the physical a large flash memory with a small static address level into a code partition and a RAM. The system also contained a processor data partition. As explained earlier, this with a memory management unit (MMU) accelerates wear. that was able to access both the flash and Even if the system does use virtual- the internal RAM. memory, implementing XIP is tricky. First, The system partitions the physical ad- this requires that the block-mapping dress space of the external memory bus device driver be able to return the into 256-byte pages that are normally physical addresses that correspond to a mapped by the internal MMU to flash given virtual block number. Second, the pages. Read requests are serviced directly block-mapping device driver must notify from this memory-mapped flash. Write re- the virtual memory subsystem whenever quests are serviced by copying a page from memory-mapped sectors are relocated. the flash to the internal RAM, modifying In addition, using XIP requires that code the internal MMU’s state so that the exter- be stored uncompressed. In some proces- nal physical page is now mapped to the sor architectures, machine code can be RAM, and then performing the word-size effectively compressed. If this is possi- write onto the RAM.Aslong as the phys- ble, then XIP saves RAM but wastes flash ical page is mapped into the RAM, further storage. read and write requests are performed di- These reasons have led some to suggest rectly on the RAM. When the RAM fills, the that, at least in systems with large RAM, oldest page in the RAM is written back to XIP should not be used [Woodhouse 2001]. the flash, again modifying the MMU’s state Systems with small RAM, where XIP is more to reflect the change. important, often do not have a memory The system works well thanks to the management unit so their only option is to wide bus between the flash and the in- partition the physical flash address space. ternal RAM and thanks to the ability to Hardware limitations also contribute to buffer pages in the RAM for a while. The the difficulty of implementing XIP. Many wide bus allows pages stored on flash to flash devices cannot read from one erase be transfered to RAM in one cycle which unit while another is being written to or is critical for processing write requests erased. If this is the case, code that might quickly. By buffering pages in the RAM for need to run during a write or erase oper- a while before they are written back to ation cannot be executed in place. To ad- flash, many updates to a single page can be dress this issue, some flash devices allow a performed with a single RAM-to-flash page write or erase to be suspended, and other transfer. The reduction in the number of devices allow reads while writing (RWW). flash writes reduces unit erasures, thereby improving performance and extending the system’s lifetime. 4.3. Flash-Based Main Memories Using a wide bus has a significant draw- Wu and Zwaenepoel [1994] describe eNVy, back, however. To build a wide bus, Wu and a flash-based nonvolatile main memory. Zwaenepoel [1994] used many flash chips The system was designed to reside on the in parallel. This made the effective size of memory bus of a computer and to ser- erase units much larger. Large erase units

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 161 are harder to manage and, as a result, are than articles in the technical and scientific prone to accelerated wear. literature. Our aim has been to survey flash- management techniques in order to pro- 5. SUMMARY vide both practitioners and researchers Flash memories have been an enabling with a broad overview of existing tech- technology for the introduction of com- niques. In particular, by surveying both puters into numerous hand held devices. technical articles, patents, and corporate A decade ago, flash memories were used literature we ensure a thorough coverage mostly in boot loaders (BIOS chips) and as of all the relevant techniques. We hope disk replacements for ruggedized comput- that our article will encourage researchers ers. Today, flash memories are also used to analyze these techniques both theoret- in mobile phones and PDA’s, portable music ically and experimentally. We also hope players and audio recorders, digital cam- that this article will facilitate the de- eras, USB memory devices, remote controls, velopment of new and improved flash- and more. Flash memories provide these management techniques. devices with fast and reliable storage ca- pabilities thanks to the sophisticated data structures and algorithms that this article REFERENCES surveys. ALEPH ONE. 2002. YAFFS: Yet another flash fil- In general, the challenges posed by ing system. Cambridge, UK. Available at newer flash devices are greater than those http://www.aleph1.co.uk/yaffs/index.html. posed by older devices—the devices are be- ASSAR, M., NEMAZIE,S.,AND ESTAKHRI,P. 1995a. coming harder to use. This happens be- Flash memory mass storage architecture. US cause flash hardware technology is driven patent 5,388,083. Filed March 26, 1993; Issued mostly by the desire for increased capac- February 7, 1995; Assigned to Cirrus Logic. ASSAR, M., NEMAZIE,S.,AND ESTAKHRI,P. 1995b. ity and performance often at the expense Flash memory mass storage architecture incor- of ease of use. This trend requires devel- poration wear leveling technique. US patent opment of new software techniques and 5,479,638. Filed March 26, 1993; Issued Decem- new system architectures for new types of ber 26, 1995; Assigned to Cirrus Logic. devices. ASSAR, M., NEMAZIE,S.,AND ESTAKHRI,P. 1996. Unfortunately, many of these tech- Flash memory mass storage architecture incor- poration wear leveling technique without using niques are only described in patents, not CAM cells. US patent 5,485,595. Filed October in technical articles. Although the purpose 4, 1993; Issued January 16, 1996; Assigned to of patents is to reveal inventions, they suf- Cirrus Logic. fer from three disadvantages relative to AXIS COMMUNICATIONS. 2004. JFFS home page. technical articles in journals and confer- Lund, Sweden. Available at http://developer. ence proceedings. First, a patent almost axis.com/software/jffs/. BAN,A.1995. Flash file system. US patent never contains a realistic assessment of 5,404,485. Filed March 8, 1993; Issued April 4, the technique that it presents. A patent 1995; Assigned to M-Systems. usually contains no quantitative compar- BAN,A. 1999. Flash file system optimized for page- ison to alternative techniques and no mode flash technologies. US patent 5,937,425. theoretical analysis. Although patents of- Filed October 16, 1997; Issued August 10, 1999; ten do contain a qualitative comparison to Assigned to M-Systems. alternatives, the comparison is almost al- BAN,A. 2004. Wear leveling of static areas in flash memory. US patent 6,732,221. Filed June ways one-sided, describing the advantages 1, 2001; Issued May 4, 2004; Assigned to M- of the new invention but not its potential Systems. disadvantages. Second, patents often de- BARRETT,P.L., QUINN,S.D.,AND LIPE,R.A. 1995. scribe a prototype not how the invention System for updating data stored on a flash- is used in actual products. This again re- erasable, programmable, read-only memory (FEPROM) based upon predetermined bit value duces the reader’s ability to assess the ef- of indicating pointers. US patent 5,392,427. fectiveness of specific techniques. Third, Filed May 18, 1993; Issued February 21, 1995; patents are sometimes harder to read Assigned to Microsoft.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. 162 E. Gal and S. Toledo

CHANG, L.-P., KUO,T.-W., AND LO,S.-W. 2004. Real- and system for file system management us- time garbage collection for flash-memory storage ing a flash-erasable, programmable, read- systems of real-time embedded systems. ACM only memory. US patent 6,256,642. Filed 29 Trans. Embed. Comput. Syst. 3,4,837–863. January, 1992; Issued July 3, 2001; Assigned to CHIANG, M.-L. AND CHANG, R.-C. 1999. Cleaning Microsoft. policies in mobile computers using flash mem- LOFGREN,K.M., NORMAN,R.D.,THELIN,GREGORY, ory. J. Syst. Softw. 48,3,213–231. B., AND GUPTA,A. 2003. Wear leveling tech- CHIANG, M.-L., LEE,P.C.,AND CHANG, R.-C. 1999. niques for flash EEPROM systems. US patent Using data clustering to improve cleaning 6,594,183. Filed June 30, 1998; Issued July, performance for flash memory. Softw. Pract. 15, 2003; Assigned to Sandisk and Western Exp. 29,3,267–290. Digital. LOFGREN,K.M.J.,NORMAN,R.D.,THELIN,G.B.,AND DABERKO,N. 1998. Operating system including improved file management for use in devices uti- GUPTA,A. 2000. Wear leveling techniques for lizing flash memory as main memory. US patent flash EEPROM systems. US patent 6,081,447. 5,787,445. Filed March 7, 1996; Issued July 28, Filed March 25, 1999; Issued June 27, 2000; As- 1998; Assigned to Norris Communications. signed to Western Digital and Sandisk. MARSHALL,J.M.AND MANNING,C.D.H. 1998. Flash DAN,R.AND WILLIAMS,J. 1997. A TrueFFS and file management system. US patent 5,832,493. FLite technical overview of M-Systems’ flash file Filed April 24, 1997; Issued November 3, 1998; systems. Tech. rep. 80-SR-002-00-6L Rev. 1.30, Assigned to Trimble Navigation. M-Systems. PARKER,K.W. 2003. Portable electronic device DOUGLIS,F.,CACERES, R., KAASHOEK,M.F., LI, K., having a log-structured file system in flash mem- MARSH,B.,AND TAUBER,J.A. 1994. Storage ory. US patent 6,535,949. Filed April 19, 1999; alternatives for mobile computers. In Proceed- Issued March 18, 2003; Assigned to Research In ings of the 1st USENIX Symposium on Operat- Motion. ing Systems Design and Implementation (OSDI). Monterey, CA. 25–37. ROSENBLUM,M.AND OUSTERHOUT,J.K. 1992. The design and implementation of a log-structured HAN,S.-W. 2000. Flash memory wear leveling sys- file system. ACMTrans. Comput. Syst. 10,1,26– tem and method. US patent 6,016,275. Filed 52. November 4, 1998; Issued January 18, 2000; As- signed to LG Semiconductors. SELTZER,M.I., BOSTIC, K., MCKUSICK,M.K., AND STAELIN,C. 1993. An implementation of a log- INTEL CORPORATION. 1998a. Flash file system selec- structured file system for UNIX. In USENIX tion guide. Application Note 686, Intel Corpora- Winter Conference Proceedings. San Diego, CA. tion. 307–326. NTEL ORPORATION I C . 1998b. Understanding the SMITH,K.B.AND GARVIN,P.K. 1999. Method flash translation layer (FTL) specification. and apparatus for allocating storage in flash Application Note 648, Intel Corporation. memory. US patent 5,860,082. Filed March 28, JOU,E.AND JEPPESEN III, J. H. 1996. Flash memory 1996; Issued January 12, 1999; Assigned to wear leveling system providing immediate direct Datalight. access to microprocessor. US patent 5,568,423. TORELLI,P. 1995. The Microsoft flash file system. Filed April 14, 1995; Issued October 22, 1996; Dr. Dobb’s Journal 20, 62–72. Assigned to Unisys. WELLS,S.,HASBUN,R.N.,AND ROBINSON,K. 1998. KAWAGUCHI, A., NISHIOKA,S.,AND MOTODA,H. 1995. Sector-based storage device emulator having A flash-memory based file system. In Proceed- variable-sized sector. US patent 5,822,781. Filed ings of the USENIX Technical Conference. New October 30, 1992; Issued October 13, 1998; As- Orleans, LA. 155–164. signed to Intel. KIM, H.-J. AND LEE,S.-G. 2002. An effective WELLS,S.E. 1994. Method for wear leveling in a flash memory manager for reliable flash mem- flash EEPROM memory. US patent 5,341,339. ory space management. IEICE Trans. Inform. Filed November 1, 1993; Issued August 23, 1994; Syst. E85-D,6,950–964. Assigned to Intel. KRUEGER,W.J.AND RAJAGOPALAN,S. 1999a. Method WOODHOUSE,D.2001. JFFS: The journaling and system for file system management using a flash file system. Ottawa Linux Symposium flash-erasable, programmable, read-only mem- (July). Available at http://sources.redhat.com/ ory. US patent 5,634,050. Filed June 7, 1995; Is- jffs2/jffs2.pdf. sued May 27, 1999; Assigned to Microsoft. WU,C.-H., CHANG, L.-P., AND KUO,T.-W. 2003a. An KRUEGER,W.J.AND RAJAGOPALAN,S. 1999b. Method efficient B-tree layer for flash-memory stor- and system for file system management us- age systems. In Proceedings of the 9th In- ing a flash-erasable, programmable, read-only ternational Conference on Real-Time and Em- memory. US patent 5,898,868. Filed 7 June bedded Computing Systems and Applications 1995; Issued April 27, 1999; Assigned to (RTCSA).Tainan City,Taiwan. Lecture Notes Microsoft. in Computer Science, vol. 2968. Springer, 409– KRUEGER,W.J.AND RAJAGOPALAN,S. 2001. Method 430.

ACM Computing Surveys, Vol. 37, No. 2, June 2005. Algorithms and Data Structures for Flash Memories 163

WU,C.-H., CHANG, L.-P., AND KUO,T.-W. 2003b. WU,M.AND ZWAENEPOEL,W. 1994. eNVy: A non- An efficient R-tree implementation over flash- volatile, main memory storage system. In Pro- memory storage systems. In Proceedings of the ceedings of the 6th International Conference on 11th ACM International Symposium on Ad- Architectural Support for Programming Lan- vances in Geographic Information Systems. New guages and Operating Systems. San Jose, CA. Orleans, LA. 17–24. 86–97.

Received July 2004; accepted August 2005

ACM Computing Surveys, Vol. 37, No. 2, June 2005.