Bypassing the Limit of RAM Capacity in Distributed Le System Lizardfs

University of Warsaw Faculty of Mathematics, Informatics and Mechanics Piotr Sarna Student no. 321169 Bypassing the limit of RAM capacity in distributed le system LizardFS Master's thesis in COMPUTER SCIENCE Supervisor: dr Janina Mincer-Daszkiewicz Institute of Informatics June 2016 Supervisor's statement Hereby I conrm that the presented thesis was prepared under my supervision and that it fulls the requirements for the degree of Master of Computer Science. Date Supervisor's signature Author's statement Hereby I declare that the presented thesis was prepared by me and none of its contents was obtained by means that are against the law. The thesis has never before been a subject of any procedure of obtaining an aca- demic degree. Moreover, I declare that the present version of the thesis is identical to the attached electronic version. Date Author's signature Abstract Distributed le systems with a centralized metadata server are constrained by many bound- aries, one of which is limited RAM capacity. This particular issue can be potentially bypassed with the help of fast solid-state disks. My research will be based upon creating a library of alternative memory allocation methods, which would utilize underlying drive (e.g. SSD, ZRAM) for data storage, using RAM only as cache. The next step would be to implement the library as primary memory allocator in LizardFS an open-source, distributed le system. Results could then be veried and benchmarked on real-life installations. Keywords distributed, le system, storage, memory, optimization, SSD, ZRAM, LizardFS, library, C++ Thesis domain (Socrates-Erasmus subject area codes) 11.3 Computer Science Subject classication Software and its engineering Software organization and properties Contextual software domains Operating systems Memory management Allocation / deallocation strategies Information systems Information storage systems Storage architectures Distributed storage Tytuª pracy w j¦zyku polskim Obej±cie limitu pami¦ci RAM w rozproszonym systemie plików LizardFS Contents Introduction ........................................ 5 1. Background ....................................... 7 1.1. Introduction to LizardFS . .7 1.1.1. Architecture . .7 1.1.2. Resource consumption . .8 1.1.3. Memory of the master server . .9 1.1.4. Existent memory optimizations . 10 2. Inspiration ........................................ 15 2.1. Introduction to SSDAlloc . 15 2.1.1. Motivation . 15 2.1.2. Keywords . 15 2.1.3. Architecture . 16 2.2. Conclusions . 17 3. Design and implementation ............................. 19 3.1. Design . 19 3.2. Choice of programming language . 19 3.3. Used libraries and portability . 20 3.3.1. Linux-specic headers . 20 3.3.2. BerkeleyDB . 21 3.4. User-space memory management . 21 3.4.1. Details . 21 3.4.2. Performance of signal-based management . 23 3.5. Implementation details . 24 3.5.1. Application layer . 24 3.5.2. Runtime layer . 26 3.5.3. Storage layer . 28 4. Choice of storage ................................... 29 4.1. HDD . 29 4.1.1. Introduction . 29 4.1.2. Access time . 30 4.1.3. Prices . 30 4.1.4. Benchmarks . 30 4.2. SSD . 31 4.2.1. Introduction . 31 4.2.2. Access time . 32 3 4.2.3. Prices . 32 4.2.4. Benchmarks . 32 4.3. In-kernel compression . 32 4.3.1. Introduction . 32 4.3.2. Implementations . 33 4.3.3. Usage examples . 33 4.3.4. Compression algorithms . 34 4.3.5. Benchmarks . 34 4.4. Comparison . 35 4.4.1. Price-oriented choice . 36 4.4.2. Performance-oriented choice . 36 4.4.3. Summary . 36 5. Integration with LizardFS .............................. 39 5.1. Build conguration . 39 5.2. Code changes . 39 5.3. Summary . 41 6. Test results ....................................... 43 6.1. Parameters . 43 6.2. Congurations . 44 6.3. Scope . 44 6.4. Result sheets . 45 6.4.1. without fsalloc . 45 6.4.2. single stream zram . 45 6.4.3. multiple stream zram . 46 6.4.4. HDD . 47 6.4.5. SSD . 48 6.5. Combined results . 49 6.6. In-kernel compressed memory . 51 6.7. Proling . 51 6.7.1. Results . 51 6.7.2. Conclusions . 52 7. Conclusions ....................................... 57 7.1. Evaluation . 57 7.2. Perspectives . 58 7.3. Epilogue . 58 A. LizardFS graphs .................................... 59 B. Contents of the attached CD ............................ 65 Bibliography ........................................ 67 4 Introduction Global need of maintaining very large amounts of data is still increasing. The storage sector is currently enjoying the benets of having many independent distributed le systems designed to store petabytes of data safely and eciently. Unfortunately, these le systems have their own limitations CPU and RAM usage, power consumption, network latencies, keeping consistency of distributed data. One of the interesting architectures is a "le system with centralized metadata server". Its huge advantage is not having to worry about metadata consistency, which is a non-trivial problem to cope with. On the other hand, memory and CPU usage is centralized as well, which leads to very high hardware requirements for large installations. Some systems (e.g. LizardFS [13]) implement a policy of keeping all metadata in random- access memory. By that means, as long as there is enough memory available, metadata operations are performed quickly, without any need to use disk I/O. It is worth noting, that metadata operations are very frequent on a POSIX-compliant le system. The most excessively used one is a lookup call, which is used to extract information about a le (more specically, an inode) from its path. Ensuring that this operation is fast should be the primary goal of a distributed le system developer. This performance-oriented policy has one big aw, which is extremely high memory consumption. An attempt to reduce RAM constraint without rewriting the whole application is the main topic of this thesis. Princeton University researchers, Anirudh Badam and Vivek S. Pai presented an interesting idea on how to move the burden of dynamically allocated memory from RAM to solid-state hard drives [1, 3]. The project is called SSDAlloc and its goal is to intercept the allocation process and keep the data in designated SSD block device, using RAM only as temporary cache for frequently used pages. Solid-state drives are not the only reasonable way to store allocated data, though. Another noteworthy idea is to take advantage of compressed RAM devices, e.g. ZRAM module [15] available in Linux kernel. The thesis contains documented implementation of a memory management library fsalloc, designed to overload traditional malloc/new calls with ones that keep allocated data in the database. The library was injected into distributed le system LizardFS in order to compare its performance and memory usage with and without fsalloc. 5 The thesis consists of 7 chapters: 1. Background introduction to open-source distributed le system LizardFS and its memory management. 2. Motivations introduction to SSDAlloc project and its summary. 3. Design and implementation implementation details of fsalloc memory management library. 4. Choice of storage review of storage back-ends that could be used by fsalloc. 5. Integration with LizardFS steps performed to introduce fsalloc as a memory manager for frequently used structures in LizardFS system. 6. Results performance results of experiments carried out on production-ready installations. 7. Conclusions comments regarding performance results and future development oppor- tunities. Electronic version of this thesis, implementation of fsalloc library and detailed performance results are available on the attached CD. 6 Chapter 1 Background Allocation library is not a fully standalone product it is used along with an application that benets from its routines. In this particular case the application is a distributed le system cluster with high resource consumption fsalloc is to be integrated with LizardFS and tested in terms of both correctness and performance. In order to properly understand the design decisions, approaches and other details, overall description of LizardFS background for the whole project is presented in this chapter. 1.1. Introduction to LizardFS LizardFS [13] is an open-source, distributed le system. Project's goal is to deliver fault- tolerant, highly available, scalable and POSIX-compliant alternative to other solutions. Source code is freely available and can be found at https://github.com/lizardfs/lizardfs. 1.1.1. Architecture Conceptually, LizardFS can be divided into 4 parts: 1. metadata server, which can be either of: master, shadow, metalogger, 2. chunkserver, 3. mount, 4. client-side extra features, namely: CGI web server, console administration tool. Master is a centralized metadata management process. It holds all metadata present in the system in RAM, which makes it a primary target for memory usage optimization. Clients gather all needed information about les from master before performing any reads/writes. Chunkservers are managed from master server only decisions on performing a replication between chunkservers or getting rid of certain chunks are commissioned directly from master. 7 Shadow is an online copy of master server, ready to become promoted to master once a failure occurs. Shadows are used in high-availability solutions and keep the same memory structure as masters, so they are the target of RAM optimization as well. Metalogger is an oine copy of system's metadata and is not memory-consuming, so it will not be discussed further in this thesis. Chunkserver is responsible for: keeping chunks of les on internal storage, ensuring that all chunks hold valid information (via checksum calculation), sending and receiving le data from clients. All les in the system are divided into chunks that are up to 64 MiB in size. Chunks are then distributed over chunkservers in order to keep a safe number of copies and follow the rules of georeplication. Each le holds information on how many copies should be kept and on which groups of chunkservers. This information as a whole is called a goal.

Load more