Building Efficient Algorithms by Learning to Compress by Davis W

Total Page:16

File Type:pdf, Size:1020Kb

Building Efficient Algorithms by Learning to Compress by Davis W Building Efficient Algorithms by Learning to Compress by Davis W. Blalock B.S., University of Virginia (2014) S.M., Massachusetts Institute of Technology (2016) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2020 © Davis W. Blalock, MMXX. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Author.............................................................. Department of Electrical Engineering and Computer Science August 28, 2020 Certified by. John V. Guttag Dugald C. Jackson Professor of Electrical Engineering and Computer Science Thesis Supervisor Accepted by . Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students 2 Building Efficient Algorithms by Learning to Compress by Davis W. Blalock Submitted to the Department of Electrical Engineering and Computer Science on August 28, 2020, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract The amount of data in the world is doubling every two years. Such abundant data offers immense opportunities, but also imposes immense computation, storage, and energy costs. This thesis introduces efficient algorithms for reducing these costs for bottlenecks in real world data analysis and machine learning pipelines. Concretely, we introduce algorithms for: • Lossless compression of time series. This algorithm compresses better than any existing method, despite requiring only the resources available on a low-power edge device. • Approximate matrix-vector multiplies. This algorithm accelerates approximate sim- ilarity scans by an order of magnitude relative to existing methods. • Approximate matrix-matrix multiplies. This algorithm often outperforms existing approximation methods by more than 10× and non-approximate computation by more than 100×. We provide extensive empirical analyses of all three algorithms using real-world datasets and realistic workloads. We also prove bounds on the errors introduced by the two approximation algorithms. The theme unifying all of these contributions is learned compression. While com- pression is typically thought of only as a means to reduce data size, we show that specially designed compression schemes can also dramatically increase computation speed and reduce memory requirements. Thesis Supervisor: John V. Guttag Title: Dugald C. Jackson Professor of Electrical Engineering and Computer Science 3 4 Acknowledgments This PhD would not have been possible without the support and efforts of many people. At the top of this list is my advisor, John Guttag. It’s difficult to identify all the positive aspects of having John as an advisor, so I will stick to only a few key points. First, John is extremely supportive and kind—he genuinely wants what’s best for his graduate students and is generous and understanding far beyond the call of duty. In addition, one could not ask for a more knowledgeable mentor; John has deep expertise in many areas of computer science, and I doubt I could have tackled the problems I did without his willingness and ability to straddle many subfields at once. This knowledgeability also extends to the practice of research itself; John can often transform writing, presentations, and problem framings from lackluster to exceptional in the span of a single round of feedback. I am also indebted to my collaborators and labmates. I would particularly like to thank Divya Shanmugam and Jose Javier Gonzalez Ortiz for managing to put up with me over the course of multiple research projects. I would also like to thank Anima, Yun, Guha, Joel, Jen, Amy, Maggie, Adrian, Tristan, Marzyeh, Tiam, Harini, Katie, Marianne, Emily, Addie, Wayne, Sra, Matt, Dina, Maryann, Roshni, and Aniruddh for their friendship, feedback, and ideas over the years. Along similar lines, I would like to thank Sam Madden and Tamara Broderick for serving on my thesis committee and being great collaborators in research (and other) endeavors in the past few years. Sam and Tamara not only possess deep expertise in their fields, but are also enjoyable and interesting to work with. I would be remiss not to include some mentors from before MIT. Foremost, I’d like to thank John Lach and Ben Boudaoud for taking a chance on me when I was an undergrad who didn’t know anything. I’d also like to thank Jeff, Kevin, Jermaine, Jim, Jake, Drake, and the rest of the PocketSonics team for mentoring me throughout many years of internships. Finally, I should probably mention my family, who may have had some tangential 5 influence on me getting where I am today. But seriously, I could never thankthem enough or adequately describe their positive impact on my life in a paragraph, so I will just note that I’m profoundly grateful for all the sacrifices they’ve made for me and for the good fortune of getting to be part of this family. 6 Contents 1 Introduction 15 2 Compressing Integer Time Series 19 2.1 Introduction . 19 2.2 Definitions and Background . 22 2.2.1 Definitions . 22 2.2.2 Hardware Constraints . 22 2.2.3 Data Characteristics . 23 2.3 Related Work . 24 2.3.1 Compression of Time Series . 24 2.3.2 Compression of Integers . 25 2.3.3 General-Purpose Compression . 26 2.3.4 Predictive Filtering . 26 2.4 Method . 27 2.4.1 Overview . 27 2.4.2 Forecasting . 29 2.4.3 Bit Packing . 33 2.4.4 Entropy Coding . 36 2.4.5 Vectorization . 36 2.5 Experimental Results . 37 2.5.1 Datasets . 37 2.5.2 Comparison Algorithms . 38 2.5.3 Compression Ratio . 39 7 2.5.4 Decompression Speed . 42 2.5.5 Compression Speed . 43 2.5.6 FIRE Speed . 45 2.5.7 When to Use Sprintz . 46 2.5.8 Generalizing to Floats . 48 2.6 Summary . 50 3 Fast Approximate Scalar Reductions 51 3.1 Introduction . 51 3.1.1 Problem Statement . 53 3.1.2 Assumptions . 54 3.2 Related Work . 55 3.3 Method . 57 3.3.1 Background: Product Quantization . 57 3.3.2 Bolt ............................... 61 3.3.3 Theoretical Guarantees . 64 3.4 Experimental Results . 65 3.4.1 Datasets . 66 3.4.2 Comparison Algorithms . 67 3.4.3 Encoding Speed . 68 3.4.4 Query Speed . 68 3.4.5 Nearest Neighbor Accuracy . 72 3.4.6 Accuracy in Preserving Distances and Dot Products . 74 3.5 Summary . 75 4 Fast Approximate Matrix Multiplication 77 4.1 Introduction . 77 4.1.1 Problem Formulation . 79 4.2 Related Work . 79 4.2.1 Linear Approximation . 80 4.2.2 Hashing to Avoid Linear Operations . 80 8 4.3 Background - Product Quantization . 81 4.4 Our Method . 83 4.4.1 Hash Function Family, 푔(·) ................... 83 4.4.2 Learning the Hash Function Parameters . 84 4.4.3 Optimizing the Prototypes . 86 4.4.4 Fast 8-Bit Aggregation, 푓(·, ·) .................. 87 4.4.5 Complexity . 88 4.4.6 Theoretical Guarantees . 89 4.5 Experiments . 90 4.5.1 Methods Tested . 91 4.5.2 How Fast is Maddness?..................... 92 4.5.3 Softmax Classifier . 93 4.5.4 Kernel-Based Classification . 94 4.5.5 Image Filtering . 96 4.6 Summary . 97 5 Summary and Conclusion 99 A Additional Theoretical Analysis of Bolt 103 A.1 Quantization Error . 103 A.1.1 Definitions . 103 A.1.2 Guarantees . 105 A.2 Dot Product Error . 108 A.2.1 Definitions and Preliminaries . 108 A.2.2 Guarantees . 112 A.2.3 Euclidean Distance Error . 114 B Additional Theoretical Analysis of Maddness 119 B.1 Proof of Generalization Guarantee . 119 B.2 Aggregation Using Pairwise Averages . 125 9 C Additional Method and Experiment Details for Maddness 129 C.1 Quantizing Lookup Tables . 129 C.2 Quantization and MaddnessHash ................... 130 C.3 Subroutines for Training MaddnessHash ............... 131 C.4 Additional Experimental Details . 131 C.4.1 Exact Matrix Multiplication . 132 C.4.2 Additional Baselines . 132 C.4.3 UCR Time Series Archive . 133 C.4.4 Caltech101 . 134 C.4.5 Additional Results . 134 10 List of Figures 2-1 Overview of Sprintz using a delta coding predictor. a) Delta cod- ing of each column, followed by zigzag encoding of resulting errors. The maximum number of significant bits is computed for each col- umn. b) These numbers of bits are stored in a header, and the original data is stored as a byte-aligned payload, with leading zeros removed. When there are few columns, each column’s data is stored contigu- ously. When there are many columns, each row is stored contiguously, possibly with padding to ensure alignment on a byte boundary. 34 2-2 Boxplots of compression performance of different algorithms on the UCR Time Series Archive. Each boxplot captures the distribution of one algorithm across all 85 datasets. 40 2-3 Compression performance of different algorithms on the UCR Time Series Archive. The x-axis is the mean rank of each method, where rank 1 on a given dataset has the highest ratio. Methods joined with a horizontal black line are not statistically significantly different. 42 2-4 Sprintz becomes faster as the number of columns increases and as the width of each sample approaches multiples of 32B (on a machine with 32B vector registers). 44 2-5 Sprintz compresses at hundreds of MB/s even in the slowest case: its highest-ratio setting with incompressible 8-bit data. On lower settings with 16-bit data, it can exceed 1GB/s. 44 11 2-6 Fire is nearly as fast as delta and double delta coding.
Recommended publications
  • Artificial Intelligence in Health Care: the Hope, the Hype, the Promise, the Peril
    Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril Michael Matheny, Sonoo Thadaney Israni, Mahnoor Ahmed, and Danielle Whicher, Editors WASHINGTON, DC NAM.EDU PREPUBLICATION COPY - Uncorrected Proofs NATIONAL ACADEMY OF MEDICINE • 500 Fifth Street, NW • WASHINGTON, DC 20001 NOTICE: This publication has undergone peer review according to procedures established by the National Academy of Medicine (NAM). Publication by the NAM worthy of public attention, but does not constitute endorsement of conclusions and recommendationssignifies that it is the by productthe NAM. of The a carefully views presented considered in processthis publication and is a contributionare those of individual contributors and do not represent formal consensus positions of the authors’ organizations; the NAM; or the National Academies of Sciences, Engineering, and Medicine. Library of Congress Cataloging-in-Publication Data to Come Copyright 2019 by the National Academy of Sciences. All rights reserved. Printed in the United States of America. Suggested citation: Matheny, M., S. Thadaney Israni, M. Ahmed, and D. Whicher, Editors. 2019. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. NAM Special Publication. Washington, DC: National Academy of Medicine. PREPUBLICATION COPY - Uncorrected Proofs “Knowing is not enough; we must apply. Willing is not enough; we must do.” --GOETHE PREPUBLICATION COPY - Uncorrected Proofs ABOUT THE NATIONAL ACADEMY OF MEDICINE The National Academy of Medicine is one of three Academies constituting the Nation- al Academies of Sciences, Engineering, and Medicine (the National Academies). The Na- tional Academies provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions.
    [Show full text]
  • ROOT I/O Compression Improvements for HEP Analysis
    EPJ Web of Conferences 245, 02017 (2020) https://doi.org/10.1051/epjconf/202024502017 CHEP 2019 ROOT I/O compression improvements for HEP analysis Oksana Shadura1;∗ Brian Paul Bockelman2;∗∗ Philippe Canal3;∗∗∗ Danilo Piparo4;∗∗∗∗ and Zhe Zhang1;y 1University of Nebraska-Lincoln, 1400 R St, Lincoln, NE 68588, United States 2Morgridge Institute for Research, 330 N Orchard St, Madison, WI 53715, United States 3Fermilab, Kirk Road and Pine St, Batavia, IL 60510, United States 4CERN, Meyrin 1211, Geneve, Switzerland Abstract. We overview recent changes in the ROOT I/O system, enhancing it by improving its performance and interaction with other data analysis ecosys- tems. Both the newly introduced compression algorithms, the much faster bulk I/O data path, and a few additional techniques have the potential to significantly improve experiment’s software performance. The need for efficient lossless data compression has grown significantly as the amount of HEP data collected, transmitted, and stored has dramatically in- creased over the last couple of years. While compression reduces storage space and, potentially, I/O bandwidth usage, it should not be applied blindly, because there are significant trade-offs between the increased CPU cost for reading and writing files and the reduces storage space. 1 Introduction In the past years, Large Hadron Collider (LHC) experiments are managing about an exabyte of storage for analysis purposes, approximately half of which is stored on tape storages for archival purposes, and half is used for traditional disk storage. Meanwhile for High Lumi- nosity Large Hadron Collider (HL-LHC) storage requirements per year are expected to be increased by a factor of 10 [1].
    [Show full text]
  • Arxiv:2004.10531V1 [Cs.OH] 8 Apr 2020
    ROOT I/O compression improvements for HEP analysis Oksana Shadura1;∗ Brian Paul Bockelman2;∗∗ Philippe Canal3;∗∗∗ Danilo Piparo4;∗∗∗∗ and Zhe Zhang1;y 1University of Nebraska-Lincoln, 1400 R St, Lincoln, NE 68588, United States 2Morgridge Institute for Research, 330 N Orchard St, Madison, WI 53715, United States 3Fermilab, Kirk Road and Pine St, Batavia, IL 60510, United States 4CERN, Meyrin 1211, Geneve, Switzerland Abstract. We overview recent changes in the ROOT I/O system, increasing per- formance and enhancing it and improving its interaction with other data analy- sis ecosystems. Both the newly introduced compression algorithms, the much faster bulk I/O data path, and a few additional techniques have the potential to significantly to improve experiment’s software performance. The need for efficient lossless data compression has grown significantly as the amount of HEP data collected, transmitted, and stored has dramatically in- creased during the LHC era. While compression reduces storage space and, potentially, I/O bandwidth usage, it should not be applied blindly: there are sig- nificant trade-offs between the increased CPU cost for reading and writing files and the reduce storage space. 1 Introduction In the past years LHC experiments are commissioned and now manages about an exabyte of storage for analysis purposes, approximately half of which is used for archival purposes, and half is used for traditional disk storage. Meanwhile for HL-LHC storage requirements per year are expected to be increased by factor 10 [1]. arXiv:2004.10531v1 [cs.OH] 8 Apr 2020 Looking at these predictions, we would like to state that storage will remain one of the major cost drivers and at the same time the bottlenecks for HEP computing.
    [Show full text]
  • Computational Thinking's Influence on Research and Education For
    Computational thinking’s influence on research and education for all Influenza del pensiero computazionale nella ricerca e nell’educazione per tutti Jeannette M. Wing Columbia University, New York, N.Y., USA, [email protected] HOW TO CITE Wing, J.M. (2017). Computational thinking’s influence on research and education for all. Italian Journal of Educational Technology, 25(2), 7-14. doi: 10.17471/2499-4324/922 ABSTRACT Computer science has produced, at an astonishing and breathtaking pace, amazing technology that has transformed our lives with profound economic and societal impact. In the course of the past ten years, we have come to realize that computer science offers not just useful software and hardware artifacts, but also an intellectual framework for thinking, what I call “computational thinking”. Everyone can benefit from thinking computationally. My grand vision is that computational thinking will be a fundamental skill—just like reading, writing, and arithmetic—used by everyone by the middle of the 21st Century. KEYWORDS Computational Thinking, Education, Curriculum. SOMMARIO L’informatica ha prodotto, a un ritmo sorprendente e mozzafiato, una straordinaria tecnologia che ha profondamente trasformato le nostre vite con importanti conseguenze economiche e sociali. Negli ultimi dieci anni ci si è resi conto che l’informatica non solo offre utili artefatti software e hardware, ma anche un paradigma di ragionamento che chiamo “pensiero computazionale”. Ognuno di noi può trarre vantaggio dal pensare in modo computazionale. Mi aspetto che il pensiero computazionale costituirà un’abilità fondamentale - come leggere, scrivere e far di conto – che verrà utilizzata da tutti entro la prima metà del XXI secolo.
    [Show full text]
  • Compressed Transitive Delta Encoding 1. Introduction
    Compressed Transitive Delta Encoding Dana Shapira Department of Computer Science Ashkelon Academic College Ashkelon 78211, Israel [email protected] Abstract Given a source file S and two differencing files ∆(S; T ) and ∆(T;R), where ∆(X; Y ) is used to denote the delta file of the target file Y with respect to the source file X, the objective is to be able to construct R. This is intended for the scenario of upgrading soft- ware where intermediate releases are missing, or for the case of file system backups, where non consecutive versions must be recovered. The traditional way is to decompress ∆(S; T ) in order to construct T and then apply ∆(T;R) on T and obtain R. The Compressed Transitive Delta Encoding (CTDE) paradigm, introduced in this paper, is to construct a delta file ∆(S; R) working directly on the two given delta files, ∆(S; T ) and ∆(T;R), without any decompression or the use of the base file S. A new algorithm for solving CTDE is proposed and its compression performance is compared against the traditional \double delta decompression". Not only does it use constant additional space, as opposed to the traditional method which uses linear additional memory storage, but experiments show that the size of the delta files involved is reduced by 15% on average. 1. Introduction Differential file compression represents a target file T with respect to a source file S. That is, both the encoder and decoder have available identical copies of S. A new file T is encoded and subsequently decoded by making use of S.
    [Show full text]
  • Implementing Compression on Distributed Time Series Database
    Implementing compression on distributed time series database Michael Burman School of Science Thesis submitted for examination for the degree of Master of Science in Technology. Espoo 05.11.2017 Supervisor Prof. Kari Smolander Advisor Mgr. Jiri Kremser Aalto University, P.O. BOX 11000, 00076 AALTO www.aalto.fi Abstract of the master’s thesis Author Michael Burman Title Implementing compression on distributed time series database Degree programme Major Computer Science Code of major SCI3042 Supervisor Prof. Kari Smolander Advisor Mgr. Jiri Kremser Date 05.11.2017 Number of pages 70+4 Language English Abstract Rise of microservices and distributed applications in containerized deployments are putting increasing amount of burden to the monitoring systems. They push the storage requirements to provide suitable performance for large queries. In this paper we present the changes we made to our distributed time series database, Hawkular-Metrics, and how it stores data more effectively in the Cassandra. We show that using our methods provides significant space savings ranging from 50 to 95% reduction in storage usage, while reducing the query times by over 90% compared to the nominal approach when using Cassandra. We also provide our unique algorithm modified from Gorilla compression algorithm that we use in our solution, which provides almost three times the throughput in compression with equal compression ratio. Keywords timeseries compression performance storage Aalto-yliopisto, PL 11000, 00076 AALTO www.aalto.fi Diplomityön tiivistelmä Tekijä Michael Burman Työn nimi Pakkausmenetelmät hajautetussa aikasarjatietokannassa Koulutusohjelma Pääaine Computer Science Pääaineen koodi SCI3042 Työn valvoja ja ohjaaja Prof. Kari Smolander Päivämäärä 05.11.2017 Sivumäärä 70+4 Kieli Englanti Tiivistelmä Hajautettujen järjestelmien yleistyminen on aiheuttanut valvontajärjestelmissä tiedon määrän kasvua, sillä aikasarjojen määrä on kasvanut ja niihin talletetaan useammin tietoa.
    [Show full text]
  • Modification of Adaptive Huffman Coding for Use in Encoding Large Alphabets
    ITM Web of Conferences 15, 01004 (2017) DOI: 10.1051/itmconf/20171501004 CMES’17 Modification of Adaptive Huffman Coding for use in encoding large alphabets Mikhail Tokovarov1,* 1Lublin University of Technology, Electrical Engineering and Computer Science Faculty, Institute of Computer Science, Nadbystrzycka 36B, 20-618 Lublin, Poland Abstract. The paper presents the modification of Adaptive Huffman Coding method – lossless data compression technique used in data transmission. The modification was related to the process of adding a new character to the coding tree, namely, the author proposes to introduce two special nodes instead of single NYT (not yet transmitted) node as in the classic method. One of the nodes is responsible for indicating the place in the tree a new node is attached to. The other node is used for sending the signal indicating the appearance of a character which is not presented in the tree. The modified method was compared with existing methods of coding in terms of overall data compression ratio and performance. The proposed method may be used for large alphabets i.e. for encoding the whole words instead of separate characters, when new elements are added to the tree comparatively frequently. Huffman coding is frequently chosen for implementing open source projects [3]. The present paper contains the 1 Introduction description of the modification that may help to improve Efficiency and speed – the two issues that the current the algorithm of adaptive Huffman coding in terms of world of technology is centred at. Information data savings. technology (IT) is no exception in this matter. Such an area of IT as social media has become extremely popular 2 Study of related works and widely used, so that high transmission speed has gained a great importance.
    [Show full text]
  • In-Place Reconstruction of Delta Compressed Files
    In-Place Reconstruction of Delta Compressed Files Randal C. Burns Darrell D. E. Long’ IBM Almaden ResearchCenter Departmentof Computer Science 650 Harry Rd., San Jose,CA 95 120 University of California, SantaCruz, CA 95064 [email protected] [email protected] Abstract results in high latency and low bandwidth to web-enabled clients and prevents the timely delivery of software. We present an algorithm for modifying delta compressed Differential or delta compression [5, 11, compactly en- files so that the compressedversions may be reconstructed coding a new version of a file using only the changedbytes without scratchspace. This allows network clients with lim- from a previous version, can be usedto reducethe size of the ited resources to efficiently update software by retrieving file to be transmitted and consequently the time to perform delta compressedversions over a network. software update. Currently, decompressingdelta encoded Delta compressionfor binary files, compactly encoding a files requires scratch space,additional disk or memory stor- version of data with only the changedbytes from a previous age, used to hold a required second copy of the file. Two version, may be used to efficiently distribute software over copiesof the compressedfile must be concurrently available, low bandwidth channels, such as the Internet. Traditional as the delta file contains directives to read data from the old methods for rebuilding these delta files require memory or file version while the new file version is being materialized storagespace on the target machinefor both the old and new in another region of storage. This presentsa problem. Net- version of the file to be reconstructed.
    [Show full text]
  • Delta Compression Techniques
    D but the concept can also be applied to multimedia Delta Compression and structured data. Techniques Delta compression should not be confused with Elias delta codes, a technique for encod- Torsten Suel ing integer values, or with the idea of coding Department of Computer Science and sorted sequences of integers by first taking the Engineering, Tandon School of Engineering, difference (or delta) between consecutive values. New York University, Brooklyn, NY, USA Also, delta compression requires the encoder to have complete knowledge of the reference files and thus differs from more general techniques for Synonyms redundancy elimination in networks and storage systems where the encoder has limited or even Data differencing; Delta encoding; Differential no knowledge of the reference files, though the compression boundaries with that line of work are not clearly defined. Definition Delta compression techniques encode a target Overview file with respect to one or more reference files, such that a decoder who has access to the same Many applications of big data technologies in- reference files can recreate the target file from the volve very large data sets that need to be stored on compressed data. Delta compression is usually disk or transmitted over networks. Consequently, applied in cases where there is a high degree of data compression techniques are widely used to redundancy between target and references files, reduce data sizes. However, there are many sce- leading to a much smaller compressed size than narios where there are significant redundancies could be achieved by just compressing the tar- between different data files that cannot be ex- get file by itself.
    [Show full text]
  • Unfoldr Dstep
    Asymmetric Numeral Systems Jeremy Gibbons WG2.11#19 Salem ANS 2 1. Coding Huffman coding (HC) • efficient; optimally effective for bit-sequence-per-symbol arithmetic coding (AC) • Shannon-optimal (fractional entropy); but computationally expensive asymmetric numeral systems (ANS) • efficiency of Huffman, effectiveness of arithmetic coding applications of streaming (another story) • ANS introduced by Jarek Duda (2006–2013). Now: Facebook (Zstandard), Apple (LZFSE), Google (Draco), Dropbox (DivANS). ANS 3 2. Intervals Pairs of rationals type Interval (Rational, Rational) = with operations unit (0, 1) = weight (l, r) x l (r l) x = + − ⇥ narrow i (p, q) (weight i p, weight i q) = scale (l, r) x (x l)/(r l) = − − widen i (p, q) (scale i p, scale i q) = so that narrow and unit form a monoid, and inverse relationships: weight i x i x unit 2 () 2 weight i x y scale i y x = () = narrow i j k widen i k j = () = ANS 4 3. Models Given counts :: [(Symbol, Integer)] get encodeSym :: Symbol Interval ! decodeSym :: Rational Symbol ! such that decodeSym x s x encodeSym s = () 2 1 1 1 1 Eg alphabet ‘a’, ‘b’, ‘c’ with counts 2, 3, 5 encoded as (0, /5), ( /5, /2), and ( /2, 1). { } ANS 5 4. Arithmetic coding encode1 :: [Symbol ] Rational ! encode1 pick foldl estep unit where = ◦ 1 estep :: Interval Symbol Interval 1 ! ! estep is narrow i (encodeSym s) 1 = decode1 :: Rational [Symbol ] ! decode1 unfoldr dstep where = 1 dstep :: Rational Maybe (Symbol, Rational) 1 ! dstep x let s decodeSym x in Just (s, scale (encodeSym s) x) 1 = = where pick :: Interval Rational satisfies pick i i.
    [Show full text]
  • Compresso: Efficient Compression of Segmentation Data for Connectomics
    Compresso: Efficient Compression of Segmentation Data For Connectomics Brian Matejek, Daniel Haehn, Fritz Lekschas, Michael Mitzenmacher, Hanspeter Pfister Harvard University, Cambridge, MA 02138, USA bmatejek,haehn,lekschas,michaelm,[email protected] Abstract. Recent advances in segmentation methods for connectomics and biomedical imaging produce very large datasets with labels that assign object classes to image pixels. The resulting label volumes are bigger than the raw image data and need compression for efficient stor- age and transfer. General-purpose compression methods are less effective because the label data consists of large low-frequency regions with struc- tured boundaries unlike natural image data. We present Compresso, a new compression scheme for label data that outperforms existing ap- proaches by using a sliding window to exploit redundancy across border regions in 2D and 3D. We compare our method to existing compression schemes and provide a detailed evaluation on eleven biomedical and im- age segmentation datasets. Our method provides a factor of 600-2200x compression for label volumes, with running times suitable for practice. Keywords: compression, encoding, segmentation, connectomics 1 Introduction Connectomics|reconstructing the wiring diagram of a mammalian brain at nanometer resolution|results in datasets at the scale of petabytes [21,8]. Ma- chine learning methods find cell membranes and create cell body labelings for every neuron [18,12,14] (Fig. 1). These segmentations are stored as label volumes that are typically encoded in 32 bits or 64 bits per voxel to support labeling of millions of different nerve cells (neurons). Storing such data is expensive and transferring the data is slow. To cut costs and delays, we need compression methods to reduce data sizes.
    [Show full text]
  • The Pillars of Lossless Compression Algorithms a Road Map and Genealogy Tree
    International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 6 (2018) pp. 3296-3414 © Research India Publications. http://www.ripublication.com The Pillars of Lossless Compression Algorithms a Road Map and Genealogy Tree Evon Abu-Taieh, PhD Information System Technology Faculty, The University of Jordan, Aqaba, Jordan. Abstract tree is presented in the last section of the paper after presenting the 12 main compression algorithms each with a practical This paper presents the pillars of lossless compression example. algorithms, methods and techniques. The paper counted more than 40 compression algorithms. Although each algorithm is The paper first introduces Shannon–Fano code showing its an independent in its own right, still; these algorithms relation to Shannon (1948), Huffman coding (1952), FANO interrelate genealogically and chronologically. The paper then (1949), Run Length Encoding (1967), Peter's Version (1963), presents the genealogy tree suggested by researcher. The tree Enumerative Coding (1973), LIFO (1976), FiFO Pasco (1976), shows the interrelationships between the 40 algorithms. Also, Stream (1979), P-Based FIFO (1981). Two examples are to be the tree showed the chronological order the algorithms came to presented one for Shannon-Fano Code and the other is for life. The time relation shows the cooperation among the Arithmetic Coding. Next, Huffman code is to be presented scientific society and how the amended each other's work. The with simulation example and algorithm. The third is Lempel- paper presents the 12 pillars researched in this paper, and a Ziv-Welch (LZW) Algorithm which hatched more than 24 comparison table is to be developed.
    [Show full text]