Chameleon: a High Performance Flash/FRAM Hybrid Solid State Disk Architecture

Chameleon: A High Performance Flash/FRAM Hybrid Solid State Disk Architecture

Jin Hyuk Yoon1, Eyee Hyun Nam2, Yoon Jae Seong2, Hongseok Kim2, Bryan S. Kim2, Sang Lyul Min2, and Yookun Cho3 School of Computer Science and Engineering, Seoul National University, Seoul, Korea [email protected], 2{ehnam, yjsung, hskim, sjkim, symin}@archi.snu.ac.kr, [email protected]

seek time of an HDD. The program page operation writes the Abstract—Flash memory solid state disk (SSD) is gaining supplied contents to the target page and takes about 200 μs. popularity and replacing hard disk drive (HDD) in mobile Unlike HDD that allows overwrites, flash memory cannot computing systems such as ultra mobile PCs (UMPCs) and perform in-place update of data. Thus, a program page notebook PCs because of lower power consumption, faster random access, and higher shock resistance. One of the key operation should be preceded by an erase block operation that challenges in designing a high-performance flash memory SSD is sets all the bits in the target physical block to 1 and takes about an efficient handling of small random writes to non-volatile data 2 ms. whose performance suffers from the inherent limitation of flash NAND flash memory allows a limited number of bad memory that prohibits in-place update. In this paper, we propose physical blocks at the manufacturing time that can be identified a high performance Flash/FRAM hybrid SSD architecture called by a special marking in them. In addition to these bad physical Chameleon. In Chameleon, metadata used by the flash translation layer (FTL), a software layer in the flash memory SSD, is blocks at the manufacturing time, some physical blocks may maintained in a small FRAM since this metadata is a target of become bad at run time. Also, flash memory has a physical intensive small random writes, whereas the bulk data is kept in limit to the number of times a physical block can be erased and the flash memory. Performance evaluation based on an FPGA re-written, which necessitates a technique called wear-leveling implementation of the Chameleon architecture shows that the use [4] that tries to even out the erase count of physical blocks. of FRAM in Chameleon improves the performance by 21.3 %. Most system software that uses storage devices including file The results also show that even for bulk data that cannot be maintained in FRAM because of the size limitation, the use of systems and virtual memory systems assumes a block device fine-grained write buffering is critically important because of the interface that essentially abstracts the HDD interface. To use inability of flash memory to perform in-place update of data. flash memory as a storage device, we need to emulate the functionality of HDD to provide a block device interface. A I. INTRODUCTION software layer that provides such emulation is called the flash lash memory, a non-volatile memory whose origin is translation layer (FTL) [4] and it hides the peculiarities of flash F EEPROM, is increasingly being used as a storage medium memory and gives an illusion of an HDD. in mobile devices because of low power consumption, fast In this paper, we propose an SSD architecture called random access, and high shock resistance. Recently, flash Chameleon that uses in addition to NAND flash memory a memory solid state disk (SSD) that provides an interface small amount of Ferroelectric RAM (FRAM) [11] that is identical to that of hard disk drive (HDD) out of flash memory non-volatile and allows fast random reads and (in-place) is gaining popularity and replacing HDD in mobile computing updates. Moreover, the FRAM is compatible with CMOS logic systems such as ultra mobile PCs (UMPCs) and notebook PCs process and its cell size is comparable to that of SRAM. In fact because of its advantages inherited from flash memory. there is already a commercial microcontroller with FRAM NAND flash memory, the type of flash memory used for embedded in it [8]. In Chameleon, we use the knowledge of the storage applications, consists of a set of physical blocks, each FTL in order to separate data over the FRAM (FTL metadata) of which contains a set of pages [10]. Currently the most and the NAND flash memory (bulk data). popular physical block size is 128 KB consisting of 64 pages of The contributions of this paper are two-fold: (1) we design 2 KB data area and 64 byte spare area. There are three basic and implement a Flash/FRAM hybrid SSD that significantly operations to a NAND flash memory chip: read page, improves the performance while introducing only a small program page, and erase block. The read page operation, additional cost by combining the best features of NAND flash given the physical block number and the page number, returns memory (low cost) and FRAM (fast random read/non-volatile the contents of the addressed page, which takes about 20 μs write) and (2) we evaluate based on a real SSD implementation that is more than two orders of magnitude faster than a typical the effectiveness of write buffering schemes that are critically important in flash memory SSDs to minimize the performance degradation resulting from the inability of flash memory to This work was supported by the IT R&D program of MIC/IITA. perform in-place update of data. [2006-S-040-01, Development of Flash Memory-based Embedded Multimedia Software] Manuscript submitted: 31-Oct-2007. Manuscript accepted: 27-Nov-2007. Final manuscript received: 29-Nov-2007.

II. CHAMELEON HYBRID SSD ARCHITECTURE blocks. However, such a technique is not suitable for flash memory SSDs where the capacity is generally larger than 16 A. Chameleon SSD Architecture Overview GB because of a scalability problem. For example, even assuming only one spare area read for each physical block at boot time, the minimum boot time will be 2.621 seconds for a 16 GB SSD (16 GB / 128 KB x 20 μs = 2.621 seconds). Moreover, the boot time will proportionally increase with the capacity, causing a serious scalability problem. 2) Write buffers The Chameleon SSD architecture uses a hybrid write buffering scheme similar to [7] and [12] that combines both

Fig. 1. Chameleon SSD architecture. block-level write buffers [6] and page-level write buffers [5][13]. Fig. 1 shows the overall architecture of the Chameleon SSD. Block-level write buffers: In Chameleon, block-level write In Chameleon, the embedded processor together with SRAM buffers in NAND flash memory are used to optimize the provides an execution environment for the FTL. The performance of sequential writes from the host. Each Chameleon SSD uses NAND flash memory and FRAM to block-level write buffer is associated with a logical block and manage various types of non-volatile data whose details will be has a pointer called the next write pointer that indicates the explained next. The host interface implements device-side page to be written next. Both the associated logical block storage system protocols such as ATA and SCSI. The flash number and the next write pointer of each block-level write controller translates a high-level flash memory command from buffer are maintained in FRAM since they are targets of small the FTL into a sequence of low-level flash memory operations random writes. such as read page, program page, and erase block, and directs In a block-level write buffer, if the next host write request them to the target NAND flash memory chips. continues from the next write pointer, a simple append The extended DMA between the host interface and the flash operation (sequential program after the next write pointer) is controller provides a high-bandwidth data channel between the performed and the next write pointer is updated. In the case host and flash memory. It contains ECC encoder/decoder units where the host write request is made at a location greater than for correcting bit-flip errors from NAND flash memory. The the next write pointer, the data in-between is copied from the MUX/DEMUX units, also within the extended DMA, data block and the append operation is performed. In a very implement both bus-level and chip-level interleavings to unlikely case where the host write request is made at a location increase the effective read/write bandwidth of the flash smaller than the next write pointer, the logical block is memory subsystem. Both interleavings are transparent to the relocated to the page-level write buffer explained below. FTL except that the block size and the page size as seen by the Since there are only a finite number of block-level write FTL are the physical block size and the physical page size times buffers, a situation is inevitable where all the write buffers are the degree of interleaving, respectively. allocated when a new host write request arrives that is not mapped to any of the block-level write buffers. In this case, we B. Types of Non-volatile Data in Chameleon select the least recently used (LRU) write buffer block and The Chameleon SSD manages various types of non-volatile perform an operation called a block merge where the data after data explained below to emulate the functionality of HDD out the next write pointer is copied from the data block and the of flash memory and FRAM. write buffer block becomes the data block for the logical block 1) Data blocks being merged. After the block merge operation, the old data The most important role of the FTL is to maintain the block that is no longer needed is erased and used as the write mapping between the storage space visible to the host and the buffer block for the new host write request. physical blocks in NAND flash memory. For this purpose, the Page-level write buffers: In Chameleon, page-level write host-visible storage space is divided into logical blocks whose buffers in NAND flash memory aim to optimize the size is equal to the physical block size times the degree of performance of small random writes from the host that cannot interleaving. Through a mapping table called the block be efficiently handled by block-level write buffers. In this write mapping table, each logical block is mapped to a data block that buffering scheme, all the write buffer blocks form a log as in a consists of physical blocks within the same interleaving unit. log-structured file system [9] and data from the host write The block mapping table is an important FTL metadata in request is simply appended at the end of the log. Moreover, Chameleon and since it is subject to small random updates, it is unlike the block-level write buffers where mapping maintained in FRAM. information is associated with each write buffer block, the We could alternatively maintain the block mapping mapping information for page-level write buffers, also kept in information using the spare area associated with each page in FRAM, is maintained for each page. NAND flash memory and construct the block mapping table in In the page-level write buffering scheme, two different SRAM at boot time after scanning the spare areas of all the data strategies are used to replenish free pages depending on the

number of valid pages in the log. If the number of valid pages in the data block with the smallest erase count is swapped with the the log is below a given threshold, an operation called a free write buffer block with the largest erase count among those garbage collection is performed [4]. In the garbage collection, in the two types of write buffer if the difference in the erase the page-level write buffer block with the smallest number of count between the two is above a given threshold. In practice, valid pages is selected and all the valid pages are copied to the these two techniques effectively prevent any single physical end of the log. Although there are other more intelligent block from failing prematurely due to excessive erasures [1]. approaches for selecting the write buffer block to be garbage-collected based on cost/benefit analysis such as those III. PROTOTYPE IMPLEMENTATION AND PERFORMANCE in [2], [5], [9], and [13], this simple approach is sufficient in EVALUATION Chameleon since it also has data blocks and block-level write buffers. After the copy operation, the write buffer block that A. Implementation Platform has been garbage-collected is erased and added to the free We implemented a prototype of the Chameleon SSD space of the log. architecture using an in-house development board shown in Fig. On the other hand, if the number of valid pages is above a 2. The development board has an ARM7TDMI test-chip given threshold, the logical block with the largest number of running at 20 Mhz and a Xilinx Virtex II (XCV6000) FPGA valid pages in the log is selected and a block merge operation is chip. The FPGA implements most of the functionalities of the performed. After the block merge operation, all the pages in the Chameleon SSD architecture except for on-board FRAM (2 log belonging to the logical block are considered as invalid. MB), SRAM (64 KB), and NAND flash. The board has four This block merge operation is intended to reduce the number of sockets for four NAND flash memory modules, each of which valid pages in the log, thereby improving the effectiveness of is a flash memory bus operating at 24 Mhz and has four 1 GB garbage collection. NAND flash memory chips, resulting in a total capacity of 16 GB. The degrees of both bus-level and chip-level interleavings C. Processing of Host Requests and Miscellaneous Issues are four and thus 16 chips operate in parallel. The sizes of Host write request processing: With the data blocks and block-level and page-level write buffers used in the experiment the two types of write buffer explained in the previous are 8 MB and 32 MB, respectively. The prototype uses the subsection, a write request from the host is processed in a parallel ATA (PATA, also called IDE/EIDE) interface whose straightforward manner. If the logical block to which the maximum transfer bandwidth is 66 MB/s in the current request belongs is already in one of the two types of write implementation. buffer, it is directed to that write buffer. Otherwise, the request is directed to one of the two types of write buffer based on the following criterion – if the request starts at a logical block boundary and the number of requested pages is above a given threshold, the request is directed to the block-level write buffer; otherwise, it is directed to the page-level write buffer. This criterion is simple but works reasonably well for most workloads. Moreover, if the decision turns out to be incorrect, the logical block can always be relocated from one type of write buffer to the other. Host read request processing: The handling of a read Fig. 2. Chameleon SSD prototype. request from the host is similarly straightforward since a logical B. Performance Evaluation Method block can be write-buffered in at most one type of write buffer. We used the PCMark04 benchmark program [3] for our For a host read request, the pages currently in one type of write performance evaluation. The benchmark program consists of buffer are serviced by that write buffer and the other pages are four components programs: Window XP Startup, Application serviced by the data block associated with the logical block. Loading, General HDD Usage, and File Copying. Windows XP Bad block handling: In Chameleon, bad physical blocks at Startup replays host read/write requests made during the the manufacturing time are mapped out when the SSD is Windows XP boot-up; Application Loading contains requests initialized. Similarly, a bad physical block at run time is made when application programs such as MS Word, Acrobat dynamically mapped out by the FTL and a re-mapping is Reader, Windows Media Player are launched or exited; performed using a spare physical block reserved at the SSD General HDD Usage contains those made when application initialization time. programs such as MS Word, Winzip, Winamp, Internet Wear-leveling: Chameleon uses both implicit and explicit Explorer, Picture Viewer are running; Finally, File Copying wear-leveling techniques [4] to guarantee an even wear-out of contains those made when approximately 400 MB of files are physical blocks in flash memory. In the implicit technique, if a copied. The specification of the desktop PC used for running host write request is directed to the block-level write buffer, the these benchmarks is given in Fig. 2. free write buffer block with the smallest erase count is used. On the other hand, in the explicit technique, when the SSD is idle,

C. Evaluation Results In File Copying, sequential accesses dominate and since they TABLE I can be effectively handled by block-level write buffering the PERFORMANCE EFFECT OF FRAM addition of page-level write buffering gives only a limited Chameleon Chameleon % gain performance improvement. w/o FRAM with FRAM PCMark04 HDD Score 10585 12841 21.3 % Windows XP Startup 21.2 MB/s 25.0 MB/s 17.9 % IV. CONCLUSIONS Application Loading 18.3 MB/s 22.9 MB/s 25.1 % In this paper, we have proposed a high-performance General HDD Usage 14.7 MB/s 18.0 MB/s 22.4 % File Copying 30.4 MB/s 33.9 MB/s 11.5 % Flash/FRAM hybrid SSD architecture called Chameleon that cures the inefficiency of flash memory in handling small

random writes by using a small amount of FRAM. Performance To evaluate how much performance gain is made due to the evaluation based on an FPGA implementation of the use of FRAM in the Chameleon SSD, we implemented an Chameleon architecture has shown that the use of FRAM in alternative FTL that does not make any use of FRAM and Chameleon improves the performance by 21.3 %. The results emulates the reads/writes from/to FRAM by read/modify/write also showed that even for bulk data that cannot be maintained operations using SRAM and page-level write buffers in flash in FRAM because of the size limitation, the use of write memory. buffering at the page-level is critically important because of the For fair comparison, we also used the spare area to maintain inability of flash memory to perform in-place update. mapping information in the Chameleon SSD without FRAM as long as it does not cause a scalability problem explained in ACKNOWLEDGEMENTS Section II.B.1. For example, the mapping information for both block-level and page-level write buffers is maintained in the The authors would like to thank the reviewers for their spare area since their sizes are bounded and do not increase constructive comments and suggestions. with the capacity. However, the mapping information for data blocks is not maintained in the spare area for the scalability REFERENCES reason; instead it is maintained in the page-level write buffer. [1] A. Ben-Aroya and S. Toledo, “Competitive analysis of flash-memory This combination results in a boot time of 1263 ms whereas the algorithms,” in Proc. 14th Annual European Symposium on Algorithms, Springer Lecture Notes in Computer Science, Vol. 4168/2006, pp. boot time of the Chameleon SSD with FRAM is 6.3 ms. 100-111, Sep. 2006. Table I shows that the performance improvement by using [2] M.-L. Chiang, P. C. H. Lee, and R.-C. Chang, “Using data clustering to FRAM is 21.3 % according to the PCMark04 score calculated improve cleaning performance for flash memory,” Software: Practice and Experience, Vol. 29, No. 3, pp. 267-290, Mar. 1999. from a weighted sum of the four component benchmark [3] Futuremark Corporation, “PCMark04 whitepaper,” programs based on a typical usage pattern [3]. All the http://www.futuremark.com/. component benchmarks except File Copying show similar [4] E. Gal and S. Toledo, “Algorithms and data structures for flash memories,” ACM Computing Surveys, Vol. 37, No. 2, pp. 138-163, Jun. performance improvements. In File Copying, sequential writes 2005. dominate and since this access pattern generates only a small [5] A. Kawaguchi, S. Nishioka, and H. Motoda, “A flash-memory based file number of updates to the FTL metadata, the performance system,” in Proc. USENIX 1995 Winter Technical Conf., 1995, pp. improvement is limited. 155-164. [6] J. Kim, J. M. Kim, S. H. Noh, S. L. Min, and Y. Cho, “A space-efficient 50.0 flash translation layer for CompactFlash systems,” IEEE Trans. 45.0 Consumer Electronics, Vol. 48, No. 2, pp. 366-375, May 2002. 40.0 [7] S.-W. Lee, D.-J. Park, T.-S. Chung, D.-H. Lee, S. Park, and H.-J. Song , 33.9 35.0 31.9 “A log buffer-based flash translation layer using fully associative sector 30.0 28.9 translation,” ACM Trans. Embedded Computing Systems, Vol. 6, No. 3, 24.425.0 Jul. 2007. 25.0 22.322.9 MB/s 18.0 [8] Ramtron International Corp., Versa 8051 microcontrollers, 20.0 16.8 14.5 http://www.ramtron.com/doc/Products/Microcontroller/. 15.0 [9] M. Rosenblum and J. Ousterhout, “The design and implementation of a 10.0 7.5 8.3 5.6 5.3 4.4 5.1 log-structured file system,” ACM Trans. Computer Systems, Vol. 10, No. 5.0 1, pp. 26-52, Feb. 1992. 0.0 [10] Samsung Electronics, NAND flash memory data sheets, Windows XP Startup Application Loading General Usage File Copying http://www.samsung.com/. block-level write buffer (X), page-level write buffer (X) block-level write buffer (O), page-level write buffer (X) [11] A. Sheikholeslami and P. G. Gulak, “A survey of circuit innovations in block-level write buffer (X), page-level write buffer (O) block-level write buffer (O), page-level write buffer (O) Fig. 3. Effect of block-level and page-level write buffering. ferroelectric random-access memories,” Proc. IEEE, Vol. 88, pp. 667-689, May 2000. [12] C.-H. Wu and T.-W. Kuo, “An adaptive two-level management for the Fig. 3 shows PCMark04 results for different combinations of flash translation layer in embedded systems,” in Proc. 2006 IEEE/ACM block-level and page-level write buffering in the Chameleon International Conf. Computer-Aided Design (ICCAD‘2006), 2006, pp. SSD with FRAM. The results show that the use of page-level 601-606. [13] M. Wu and W. Zwaenepoel, “eNVy: a non-volatile, main memory storage write buffering is critically important for the performance of a system,” in Proc. Sixth International Conf. Architectural Support for flash memory SSD because of the inability of flash memory to Programming Languages and Operating Systems (ASPLOS-6), 1994, pp. perform in-place update whereas the use of block-level write 86-97. buffering has only a marginal effect except for File Copying.