Huffman Coding (Programming Part) Due: Thursday, November

Total Page:16

File Type:pdf, Size:1020Kb

Huffman Coding (Programming Part) Due: Thursday, November 15-122 Assignment 6 Page 1 of 12 15-122 : Principles of Imperative Computation Fall 2012 Assignment 6: Huffman Coding (Programming Part) Due: Thursday, November 8, 2012 by 23:59 For the programming portion of this week's homework, you'll write a C0 implementation of a popular compression technique called Huffman coding. In fact, you will write parts of it twice, once on C0 and once in C. • huffman.c0 (described in Sections 1, 2, 3, and 4) • huffman.c (described in Section 5) You should submit these files electronically by the due date. Detailed submission instructions can be found below. Important Note: For this lab you must restrict yourself to 25 or fewer Autolab sub- missions. As always, you should compile and run some test cases locally before submitting to Autolab. 15-122 Assignment 6 Page 2 of 12 Assignment: Huffman Coding (25 points in total) Starter code. Download the file handout-hw6.zip from the course website. When you untar it, you will find several C0 and C files and a lib/ directory with some provided libraries. You should not modify or submit the library code in the lib/ directory, nor should you rely on the internals of these implementations. When we are testing your code, we will use our own implementations of these libraries with the same interface, but possibly different internals. Compiling and running. You will compile and run your code using the standard C0 tools and the gcc compiler. You should compile and run the C0 version with % cc0 -d -o huff0 huffman.c0 huffman-main.c0 % ./huff0 You should compile and run the C version with % gcc -DDEBUG -g -Wall -Wextra -Werror -std=c99 -pedantic -o huff \ lib/xalloc.c lib/heaps.c freqtable.c huffman.c huffman-main.c % ./huff Submitting. Once you've completed some files, you can submit them to Autolab. There are two ways to do this: From the terminal on Andrew Linux (via cluster or ssh) type: % handin hw6 huffman.c0 huffman.c Your score will then be available on the Autolab website. Your files can also be submitted to the web interface of Autolab. To do so, please tar them, for example: % tar -cvf hw6_solutions.tar huffman.c0 huffman.c Then, go to https://autolab.cs.cmu.edu/15122-f12 and submit them as your solu- tion to homework 6 (huffmanlab). You may submit your assignment up to 25 times. When we grade your assignment, we will consider the most recent version submitted before the due date. 15-122 Assignment 6 Page 3 of 12 Annotations. Be sure to include appropriate //@requires, //@ensures, //@assert, and //@loop_invariant annotations in your program. You should provide loop invariants and any assertions that you use to check your reasoning. If you write any auxiliary functions, in- clude precise and appropriate pre- and postconditions. Also include corresponding REQUIRES, ENSURES, and ASSERT macros in your C programs, especially where they help explain program properties that are important but not obvious. You should write these as you are writing the code rather than after you're done: docu- menting your code as you go along will help you reason about what it should be doing, and thus help you write code that is both clearer and more correct. Annotations are part of your score for the programming problems; you will not receive maximum credit if your annotations are weak or missing. Unit testing. You should write unit tests for your code. For this assignment, unit tests will not be graded, but they will help you check the correctness of your code, pinpoint the location of bugs, and save you hours of frustration. Style. Strive to write code with good style: indent every line of a block to the same level, use descriptive variable names, keep lines to 80 characters or fewer, document your code with comments, etc. If you find yourself writing the same code over and over, you should write a separate function to handle that computation and call it whenever you need it. We will read your code when we grade it, and good style is sure to earn our good graces. Feel free to ask on Piazza if you're unsure of what constitutes good style. Task 0 (5 points) On this assignment, 5 points will be given for style. We will examine you C0 code as well as your C code. 15-122 Assignment 6 Page 4 of 12 Data Compression: Overview Whenever we represent data in a computer, we have to choose some sort of encoding with which to represent it. When representing strings in C0, for instance, we use ASCII codes to represent the individual characters. Other encodings are possible as well. The UNICODE standard (as another encoding example) defines a variety of character encodings with a variety of different properties. The simplest, UTF-32, uses 32 bits per character. Under the ASCII encoding, each character is represented using 7 bits, so a string of length n requires 7n bits of storage, which we usually round up to 8n bits or n bytes. For example, consider the string "more free coffee"; ignoring spaces, it can be represented in ASCII as follows with 14 × 7 = 98 bits: 1101101 · 1101111 · 1110010 · 1100101 · 1100110 · 1110010 · 1100101 · 1100101 · 1100011 · 1101111 · 1100110 · 1100110 · 1100101 · 1100101 This encoding of the string is rather wasteful, though. In fact, since there are only 6 distinct characters in the string, we should be able to represent it using a custom encoding that uses only dlog 6e = 3 bits to encode each character. If we were to use the custom encoding shown in Figure 1, Character Code `c' 000 `e' 001 `f' 010 `m' 011 `o' 100 `r' 101 Figure 1: A custom fixed-length encoding for the non-whitespace characters in the string "more free coffee". the string would be represented with only 14 × 3 = 42 bits: 011 · 100 · 101 · 001 · 010 · 101 · 001 · 001 · 000 · 100 · 010 · 010 · 001 · 001 In both cases, we may of course omit the separator \·" between codes; they are included only for readability. If we confine ourselves to representing each character using the same number of bits, i.e., a fixed-length encoding, then this is the best we can do. But if we allow ourselves a variable- length encoding, then we can take advantage of special properties of the data: for instance, in the sample string, the character 'e' occurs very frequently while the characters 'c' and 'm' occur very infrequently, so it would be worthwhile to use a smaller bit pattern to encode the character 'e' even at the expense of having to use longer bit patterns to encode 'c' 15-122 Assignment 6 Page 5 of 12 Character Code `e' 0 `o' 100 `m' 1010 `c' 1011 `r' 110 `f' 111 Figure 2: A custom variable-length encoding for the non-whitespace characters in the string "more free coffee". and 'm'. The encoding shown in Figure 2 employs such a strategy, and using it, the sample string can be represented with only 34 bits: 1010 · 100 · 110 · 0 · 111 · 110 · 0 · 0 · 1011 · 100 · 111 · 111 · 0 · 0 Since this encoding is prefix-free|no code word is a prefix of any other code word|the \·" separators are redundant here, too. It can be proven that this encoding is optimal for this particular string: no other encoding can represent the string using fewer than 34 bits. Moreover, the encoding is optimal for any string that has the same distribution of characters as the sample string. In this assignment, you will implement a method for constructing such optimal encodings developed by David Huffman. Huffman Coding: A Brief History Huffman coding is an algorithm for constructing optimal prefix-free encodings given a fre- quency distribution over characters. It was developed in 1951 by David Huffman when he was a Ph.D student at MIT taking a course on information theory taught by Robert Fano. It was towards the end of the semester, and Fano had given his students a choice: they could either take a final exam to demonstrate mastery of the material, or they could write a term paper on something pertinent to information theory. Fano suggested a number of possible topics, one of which was efficient binary encodings: while Fano himself had worked on the subject with his colleague Claude Shannon, it was not known at the time how to efficiently construct optimal encodings. Huffman struggled for some time to make headway on the problem and was about to give up and start studying for the final when he hit upon a key insight and invented the algorithm that bears his name, thus outdoing his professor, making history, and attaining an \A" for the course. Today, Huffman coding enjoys a variety of applications: it is used as part of the DEFLATE algorithm for producing ZIP files and as part of several multimedia codecs like JPEG and MP3. 15-122 Assignment 6 Page 6 of 12 1 Huffman Trees Recall that an encoding is prefix-free if no code word is a prefix of any other code word. Prefix-free encodings can be represented as binary full trees with characters stored at the leaves: a branch to the left represents a 0 bit, a branch to the right represents a 1 bit, and the path from the root to a leaf gives the code word for the character stored at that leaf. For example, the encodings from Figures 1 and 2 are represented by the binary trees in Figures 3 and 4, respectively. c e f m o r Figure 3: The custom encoding from Figure 1 as a binary tree.
Recommended publications
  • 15-122: Principles of Imperative Computation, Fall 2010 Assignment 6: Trees and Secret Codes
    15-122: Principles of Imperative Computation, Fall 2010 Assignment 6: Trees and Secret Codes Karl Naden (kbn@cs) Out: Tuesday, March 22, 2010 Due (Written): Tuesday, March 29, 2011 (before lecture) Due (Programming): Tuesday, March 29, 2011 (11:59 pm) 1 Written: (25 points) The written portion of this week’s homework will give you some practice reasoning about variants of binary search trees. You can either type up your solutions or write them neatly by hand, and you should submit your work in class on the due date just before lecture begins. Please remember to staple your written homework before submission. 1.1 Heaps and BSTs Exercise 1 (3 pts). Draw a tree that matches each of the following descriptions. a) Draw a heap that is not a BST. b) Draw a BST that is not a heap. c) Draw a non-empty BST that is also a heap. 1.2 Binary Search Trees Refer to the binary search tree code posted on the course website for Lecture 17 if you need to remind yourself how BSTs work. Exercise 2 (4 pts). The height of a binary search tree (or any binary tree for that matter) is the number of levels it has. (An empty binary tree has height 0.) (a) Write a function bst height that returns the height of a binary search tree. You will need a wrapper function with the following specification: 1 int bst_height(bst B); This function should not be recursive, but it will require a helper function that will be recursive. See the BST code for examples of wrapper functions and recursive helper functions.
    [Show full text]
  • Algorithms Documentation Release 1.0.0
    algorithms Documentation Release 1.0.0 Nic Young April 14, 2018 Contents 1 Usage 3 2 Features 5 3 Installation: 7 4 Tests: 9 5 Contributing: 11 6 Table of Contents: 13 6.1 Algorithms................................................ 13 Python Module Index 31 i ii algorithms Documentation, Release 1.0.0 Algorithms is a library of algorithms and data structures implemented in Python. The main purpose of this library is to be an educational tool. You probably shouldn’t use these in production, instead, opting for the optimized versions of these algorithms that can be found else where. You should totally check out the docs for implementation details, complexities and further info. Contents 1 algorithms Documentation, Release 1.0.0 2 Contents CHAPTER 1 Usage If you want to use the algorithms in your code it is as simple as: from algorithms.sorting import bubble_sort my_list= bubble_sort.sort(my_list) 3 algorithms Documentation, Release 1.0.0 4 Chapter 1. Usage CHAPTER 2 Features • Pseudo code, algorithm complexities and futher info with each algorithm. • Test coverage for each algorithm and data structure. • Super sweet documentation. 5 algorithms Documentation, Release 1.0.0 6 Chapter 2. Features CHAPTER 3 Installation: Installation is as easy as: $ pip install algorithms 7 algorithms Documentation, Release 1.0.0 8 Chapter 3. Installation: CHAPTER 4 Tests: Pytest is used as the main test runner and all Unit Tests can be run with: $ ./run_tests.py 9 algorithms Documentation, Release 1.0.0 10 Chapter 4. Tests: CHAPTER 5 Contributing: Contributions are always welcome. Check out the contributing guidelines to get started.
    [Show full text]
  • Detecting Self-Conflicts for Business Action Rules
    2011 International Conference on Computer Science and Network Technology Detecting Self-Conflicts for Business Action Rules LUO Qian1, 2, TANG Chang-jie1+, LI Chuan1, YU Er-gai2 (1. Department of Computer Science, Sichuan University, Chengdu 610065, China; 2. The Second Research Institute of China Aviation Administration Centre, Chengdu 610041China) + Corresponding author: Changjie Tang Phone: +86-28-8546-6105, E-mail: [email protected] Rule4 FM D A CraftSite 4 Abstract—Essential discrepancies in business operation datasets Rule5 3U D Chengdu Y A CraftSite 5 may cause failures in operational decisions. For example, an Rule6 CA I A CraftSite 6 antecedent X may accidentally lead to different action results, which obviously violates the atomicity of business action rules and … … … … … … … will possibly cause operational failures. These inconsistencies Rule119 I B CraftSite 105 within business rules are called self-conflicts. In order to handle the problem, this paper proposes a fast rules conflict detection Rule120 M B CraftSite 229 algorithm called Multiple Slot Parallel Detection (MSPD). The algorithm manages to turn the seeking of complex conflict rules into the discovery of non-conflict rules which can be accomplished The “Rule1” says that “IF (Type = International and in linear time complexity. The contributions include: (1) formally Transition = San Francisco) THEN (Landing field = proposing the Self-Conflict problem of business action rules, (2) CraftSite1). The rule is made by operator “A", which stands for proving the Theorem of Rules Non-conflict, (3) proposing the MSPD algorithm which is based on Huffman- Tree, (4) a business operator, only investigated when a certain rule is to conducting extensive experiments on various datasets from Civil be discussed as a problem.
    [Show full text]
  • A Directory Index for Ext2
    A Directory Index for Ext2 Daniel Phillips Abstract has lacked to date. Without some form of directory indexing, there is a practical limit The native filesystem of Linux, Ext2, inherits of a few thousand files per directory before its basic structure from Unix systems that were quadratic performance characteristics become in widespread use at the time Linus Torvalds visible. To date, an applications such as a mail founded the Linux project more than ten years transfer agent that needs to manipulate large ago. Although Ext2 has been improved and numbers of files has had to resort to some strat- extended in many ways over the years, it still egy such as structuring the set of files as a shares an undesireable characteristic with those directory tree, where each directory contains early Unix filesystems: each directory opera- no more files than Ext2 can handle efficiently. tion (create, open or delete) requires a linear Needless to say, this is clumsy and inefficient. search of an entire directory file. This results in The design of a directory indexing scheme to a quadratically increasing cost of operating on be retrofitted onto a filesystem in widespread all the files of a directory, as the number of files production use presents some interesting and in the directory increases. The HTree direc- unique problems. First among these is the need tory indexing extension was designed and im- to maintain backward compatibility with exist- plemented by the author to address this issue, ing filesystems. If users are forced to recon- and has been shown in practice to perform very struct their partitions then much of the con- well.
    [Show full text]
  • Affinity Hybrid Tree
    Affinity Hybrid Tree: An Indexing Technique for Content-Based Image Retrieval in Multimedia Databases Kasturi Chatterjee and Shu-Ching Chen Florida International University Distributed Multimedia Information System Laboratory School of Computing and Information Sciences Miami, FL 33199, USA kchat001, chens @cs.fiu.edu Abstract supporting popular access mechanisms like CBIR arises. The main challenge of implementing such an index struc- A novel indexing and access method, called Affinity Hy- ture is to make it capable of handling high-level image rela- brid Tree (AH-Tree), is proposed to organize large image tionships easily and efficiently during access. The existing data sets efficiently and to support popular image access multidimensional index structures support CBIR by trans- mechanisms like Content-Based Image Retrieval (CBIR) by lating the content-similarity measurement into feature-level embedding the high-level semantic image-relationship in equivalence, which is a very difficult job and can result in the access mechanism as it is. AH-Tree combines Space- erroneous interpretation of user’s perception of similarity. Based and Distance-Based indexing techniques to form a There are several multidimensional indexing techniques hybrid structure which is efficient in terms of computational for capturing the low-level features like feature based or dis- overhead and fairly accurate in producing query results tance based techniques, each of which can be further classi- close to human perception. Algorithms for similarity (range fied as a data-partitioned [1][5][8][19] or space-partitioned and k-nearest neighbor) queries are implemented. Results [10][13] based algorithm. Feature based indexing tech- from elaborate experiments are reported which depict a low niques project an image as a feature vector in a feature space computational overhead in terms of the number of I/O and and index the space.
    [Show full text]
  • Kernel Methods for Tree Structured Data
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by AMS Tesi di Dottorato Dottorato di Ricerca in Informatica Universit`adi Bologna, Padova Ciclo XXI Settore scientifico disciplinare: INF/01 Kernel Methods for Tree Structured Data Giovanni Da San Martino Coordinatore Dottorato: Relatore: Prof. S. Martini Prof. A. Sperduti Esame finale anno 2009 Abstract Machine learning comprises a series of techniques for automatic extraction of mean- ingful information from large collections of noisy data. In many real world applica- tions, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain.
    [Show full text]
  • Advancing Cyber Security with a Semantic Path Merger Packet Classification Algorithm
    ADVANCING CYBER SECURITY WITH A SEMANTIC PATH MERGER PACKET CLASSIFICATION ALGORITHM A Thesis Presented to The Academic Faculty by J. Lane Thames In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Electrical and Computer Engineering Georgia Institute of Technology December 2012 Copyright c 2012 by J. Lane Thames ADVANCING CYBER SECURITY WITH A SEMANTIC PATH MERGER PACKET CLASSIFICATION ALGORITHM Approved by: Dr. Randal Abler, Advisor Dr. Raghupathy Sivakumar School of Electrical and Computer School of Electrical and Computer Engineering Engineering Georgia Institute of Technology Georgia Institute of Technology Dr. Henry Owen Dr. Dirk Schaefer School of Electrical and Computer School of Mechanical Engineering Engineering Georgia Institute of Technology Georgia Institute of Technology Dr. George Riley Date Approved: September 27, 2012 School of Electrical and Computer Engineering Georgia Institute of Technology To Linda, my wife and best friend, forever plus infinity iii ACKNOWLEDGEMENTS First and foremost, I give special thanks to my advisor, Dr. Randal (Randy) Abler, who has been a great friend and mentor. I began working with Dr. Abler while I was an under- graduate student at Georgia Tech. He introduced me to the world of computer networking, information technology, and cyber security, and I am indefinitely grateful for the support, guidance, and knowledge he has given to me throughout the years. I would like to thank Dr. George Riley, Dr. Henry Owen, Dr. Raghupathy Sivakumar, and Dr. Dirk Schaefer for taking time from their busy schedules to serve as my thesis committee. Their feedback, support, and guidance played an invaluable role in my disser- tation research.
    [Show full text]
  • Effective Spatial Data Partitioning for Scalable Query Processing
    Effective Spatial Data Partitioning for Scalable Query Processing Ablimit Aji Hoang Vo Fusheng Wang HP Labs Emory University Stony Brook University Palo Alto, California, USA Atlanta, Georgia, USA Stony Brook,New York, USA [email protected] [email protected] [email protected] ABSTRACT amounts of spatial data in a way that was never before pos- Recently, MapReduce based spatial query systems have emerged sible. The volume and velocity of data only increase signifi- as a cost effective and scalable solution to large scale spa- cantly as we shift towards the Internet of Things paradigm tial data processing and analytics. MapReduce based sys- in which devices have spatial awareness and produce data tems achieve massive scalability by partitioning the data and while interacting with each other. As science and businesses running query tasks on those partitions in parallel. There- are becoming increasingly data-driven, timely analysis and fore, effective data partitioning is critical for task paral- management of such data is of utmost importance to data lelization, load balancing, and directly affects system perfor- owners. A wide spectrum of applications and scientific disci- mance. However, several pitfalls of spatial data partitioning plines such as GIS, Location Based Social Networks (LBSN), make this task particularly challenging. First, data skew neuroscience [4], medical imaging [38] and astronomy [27], is very common in spatial applications. To achieve best can benefit from an efficient spatial query system to cope query performance, data skew need to be reduced to the with the challenges of Spatial Big Data. minimum. Second, spatial partitioning approaches generate To effectively store, manage and process such large amounts boundary objects that cross multiple partitions, and add of spatial data, a scalable distributed data management sys- extra query processing overhead.
    [Show full text]
  • 15-122: Principles of Imperative Computation, Fall 2010 Assignment 6: Trees and Secret Codes
    15-122: Principles of Imperative Computation, Fall 2010 Assignment 6: Trees and Secret Codes Karl Naden (kbn@cs) Out: Tuesday, March 22, 2010 Due (Written): Tuesday, March 29, 2011 (before lecture) Due (Programming): Tuesday, March 29, 2011 (11:59 pm) 1 Written: (25 points) The written portion of this week’s homework will give you some practice reasoning about variants of binary search trees. You can either type up your solutions or write them neatly by hand, and you should submit your work in class on the due date just before lecture begins. Please remember to staple your written homework before submission. 1.1 Heaps and BSTs Exercise 1 (3 pts). Draw a tree that matches each of the following descriptions. a) Draw a heap that is not a BST. b) Draw a BST that is not a heap. c) Draw a non-empty BST that is also a heap. 1.2 Binary Search Trees Refer to the binary search tree code posted on the course website for Lecture 17 if you need to remind yourself how BSTs work. Exercise 2 (4 pts). The height of a binary search tree (or any binary tree for that matter) is the number of levels it has. (An empty binary tree has height 0.) (a) Write a function bst height that returns the height of a binary search tree. You will need a wrapper function with the following specification: 1 int bst_height(bst B); This function should not be recursive, but it will require a helper function that will be recursive. See the BST code for examples of wrapper functions and recursive helper functions.
    [Show full text]
  • Huffman Trees
    Data compression Huffman Trees Compression reduces the size of a file: ・To save space when storing it. Greedy Algorithm for Data Compression ・To save time when transmitting it. ・Most files have lots of redundancy. Tyler Moore Who needs compression? ・Moore's law: # transistors on a chip doubles every 18–24 months. CS 2123, The University of Tulsa ・Parkinson's law: data expands to fill space available. ・Text, images, sound, video, … “ Everyday, we create 2.5 quintillion bytes of data—so much that Some slides created by or adapted from Dr. Kevin Wayne. For more information see 90% of the data in the world today has been created in the last https://www.cs.princeton.edu/courses/archive/fall12/cos226/lectures.php two years alone. ” — IBM report on big data (2011) Basic concepts ancient (1950s), best technology recently developed. 3 Applications Lossless compression and expansion Generic file compression. Message. Binary data B we want to compress. ・Files: GZIP, BZIP, 7z. Compress. Generates a "compressed" representation C (B). Archivers: PKZIP. Expand. Reconstructs original bitstream B. ・ uses fewer bits (you hope) ・File systems: NTFS, HFS+, ZFS. Multimedia. Compress Expand bitstream B compressed version C(B) original bitstream B ・Images: GIF, JPEG. 0110110101... 1101011111... 0110110101... ・Sound: MP3. ・Video: MPEG, DivX™, HDTV. Basic model for data compression Communication. ・ITU-T T4 Group 3 Fax. ・V.42bis modem. Compression ratio. Bits in C (B) / bits in B. ・Skype. Ex. 50–75% or better compression ratio for natural language. Databases. Google, Facebook, .... 4 5 Rdenudcany in Enlgsih lnagugae Variable-length codes Q. How mcuh rdenudcany is in the Enlgsih lnagugae? Use different number of bits to encode different chars.
    [Show full text]
  • Addition of Ext4 Extent and Ext3 Htree DIR Read-Only Support in Netbsd
    Addition of Ext4 Extent and Ext3 HTree DIR Read-Only Support in NetBSD Hrishikesh Christos Zoulas ​ ​ ​ <[email protected]> <[email protected]> ​ ​ ​ accessing a data block involves lots of indirection. Abstract Ext4 Extent provides a more efficient way of This paper discusses the project indexing file data blocks which especially reduces ‘Implementation of Ext4 and Ext3 Read Support for access time of large files by allocating contiguous NetBSD kernel’ done as a part of Google Summer of disk blocks for the file data. Code 2016. The objective of this project was to add The directory operations (create, open or the support of Ext3 and Ext4 filesystem features viz., delete) in Ext2fs requires a linear search of an entire Ext4 Extents and Ext3 HTree DIR in read only mode directory file. This results in a quadratically by extending the code of existing Ext2fs increasing cost of operating on all the files of a implementation in NetBSD kernel,. directory, as the number of files in the directory increases. The HTree directory indexing extension 1 Introduction added in Ext3fs, addresses this issue and reduces the respective operating cost to nlog(n). The fourth extended file systems or Ext4 as it is This project added these two features in the commonly known, is the default filesystem of many NetBSD kernel. The project was started with the Linux distributions. Ext4 is the enhancement of Ext3 implementation of Ext4 Extent in read-only mode and Ext2 filesystems which brought improved and then the next feature Ext3 Htree DIR read efficiency and more reliability. In many ways, Ext4 is support was implemented.
    [Show full text]
  • TABLEFS: Enhancing Metadata Efficiency in the Local File System
    TABLEFS: Enhancing Metadata Efficiency in the Local File System Kai Ren, Garth Gibson Carnegie Mellon University kair, garth @cs.cmu.edu { } Abstract phasize simple (NoSQL) interfaces and large in-memory caches [2, 24, 33]. File systems that manage magnetic disks have long rec- Some of these key-value stores feature high rates ognized the importance of sequential allocation and large of change and efficient out-of-memory Log-structured transfer sizes for file data. Fast random access has dom- Merge (LSM) tree structures [8, 23, 32]. An LSM tree inated metadata lookup data structures with increasing can provide fast random updates, inserts and deletes use of B-trees on-disk. Yet our experiments with work- without scarificing lookup performance [5]. We be- loads dominated by metadata and small file access in- lieve that file systems should adopt LSM tree techniques dicate that even sophisticated local disk file systems like used by modern key-value stores to represent metadata Ext4, XFS and Btrfs leave a lot of opportunity for perfor- and tiny files, because LSM trees aggressively aggregate mance improvement in workloads dominated by meta- metadata. Moreover, today’s key-value store implemen- data and small files. tations are “thin” enough to provide the performance lev- In this paper we present a stacked file system, els required by file systems. TABLEFS, which uses another local file system as an ob- In this paper we present experiments in the most ma- ject store. TABLEFS organizes all metadata into a sin- ture and restrictive of environments: a local file sys- gle sparse table backed on disk using a Log-Structured tem managing one magnetic hard disk.
    [Show full text]