Building Efficient Algorithms by Learning to Compress by Davis W
Total Page:16
File Type:pdf, Size:1020Kb
Building Efficient Algorithms by Learning to Compress by Davis W. Blalock B.S., University of Virginia (2014) S.M., Massachusetts Institute of Technology (2016) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2020 © Davis W. Blalock, MMXX. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Author.............................................................. Department of Electrical Engineering and Computer Science August 28, 2020 Certified by. John V. Guttag Dugald C. Jackson Professor of Electrical Engineering and Computer Science Thesis Supervisor Accepted by . Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students 2 Building Efficient Algorithms by Learning to Compress by Davis W. Blalock Submitted to the Department of Electrical Engineering and Computer Science on August 28, 2020, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering and Computer Science Abstract The amount of data in the world is doubling every two years. Such abundant data offers immense opportunities, but also imposes immense computation, storage, and energy costs. This thesis introduces efficient algorithms for reducing these costs for bottlenecks in real world data analysis and machine learning pipelines. Concretely, we introduce algorithms for: • Lossless compression of time series. This algorithm compresses better than any existing method, despite requiring only the resources available on a low-power edge device. • Approximate matrix-vector multiplies. This algorithm accelerates approximate sim- ilarity scans by an order of magnitude relative to existing methods. • Approximate matrix-matrix multiplies. This algorithm often outperforms existing approximation methods by more than 10× and non-approximate computation by more than 100×. We provide extensive empirical analyses of all three algorithms using real-world datasets and realistic workloads. We also prove bounds on the errors introduced by the two approximation algorithms. The theme unifying all of these contributions is learned compression. While com- pression is typically thought of only as a means to reduce data size, we show that specially designed compression schemes can also dramatically increase computation speed and reduce memory requirements. Thesis Supervisor: John V. Guttag Title: Dugald C. Jackson Professor of Electrical Engineering and Computer Science 3 4 Acknowledgments This PhD would not have been possible without the support and efforts of many people. At the top of this list is my advisor, John Guttag. It’s difficult to identify all the positive aspects of having John as an advisor, so I will stick to only a few key points. First, John is extremely supportive and kind—he genuinely wants what’s best for his graduate students and is generous and understanding far beyond the call of duty. In addition, one could not ask for a more knowledgeable mentor; John has deep expertise in many areas of computer science, and I doubt I could have tackled the problems I did without his willingness and ability to straddle many subfields at once. This knowledgeability also extends to the practice of research itself; John can often transform writing, presentations, and problem framings from lackluster to exceptional in the span of a single round of feedback. I am also indebted to my collaborators and labmates. I would particularly like to thank Divya Shanmugam and Jose Javier Gonzalez Ortiz for managing to put up with me over the course of multiple research projects. I would also like to thank Anima, Yun, Guha, Joel, Jen, Amy, Maggie, Adrian, Tristan, Marzyeh, Tiam, Harini, Katie, Marianne, Emily, Addie, Wayne, Sra, Matt, Dina, Maryann, Roshni, and Aniruddh for their friendship, feedback, and ideas over the years. Along similar lines, I would like to thank Sam Madden and Tamara Broderick for serving on my thesis committee and being great collaborators in research (and other) endeavors in the past few years. Sam and Tamara not only possess deep expertise in their fields, but are also enjoyable and interesting to work with. I would be remiss not to include some mentors from before MIT. Foremost, I’d like to thank John Lach and Ben Boudaoud for taking a chance on me when I was an undergrad who didn’t know anything. I’d also like to thank Jeff, Kevin, Jermaine, Jim, Jake, Drake, and the rest of the PocketSonics team for mentoring me throughout many years of internships. Finally, I should probably mention my family, who may have had some tangential 5 influence on me getting where I am today. But seriously, I could never thankthem enough or adequately describe their positive impact on my life in a paragraph, so I will just note that I’m profoundly grateful for all the sacrifices they’ve made for me and for the good fortune of getting to be part of this family. 6 Contents 1 Introduction 15 2 Compressing Integer Time Series 19 2.1 Introduction . 19 2.2 Definitions and Background . 22 2.2.1 Definitions . 22 2.2.2 Hardware Constraints . 22 2.2.3 Data Characteristics . 23 2.3 Related Work . 24 2.3.1 Compression of Time Series . 24 2.3.2 Compression of Integers . 25 2.3.3 General-Purpose Compression . 26 2.3.4 Predictive Filtering . 26 2.4 Method . 27 2.4.1 Overview . 27 2.4.2 Forecasting . 29 2.4.3 Bit Packing . 33 2.4.4 Entropy Coding . 36 2.4.5 Vectorization . 36 2.5 Experimental Results . 37 2.5.1 Datasets . 37 2.5.2 Comparison Algorithms . 38 2.5.3 Compression Ratio . 39 7 2.5.4 Decompression Speed . 42 2.5.5 Compression Speed . 43 2.5.6 FIRE Speed . 45 2.5.7 When to Use Sprintz . 46 2.5.8 Generalizing to Floats . 48 2.6 Summary . 50 3 Fast Approximate Scalar Reductions 51 3.1 Introduction . 51 3.1.1 Problem Statement . 53 3.1.2 Assumptions . 54 3.2 Related Work . 55 3.3 Method . 57 3.3.1 Background: Product Quantization . 57 3.3.2 Bolt ............................... 61 3.3.3 Theoretical Guarantees . 64 3.4 Experimental Results . 65 3.4.1 Datasets . 66 3.4.2 Comparison Algorithms . 67 3.4.3 Encoding Speed . 68 3.4.4 Query Speed . 68 3.4.5 Nearest Neighbor Accuracy . 72 3.4.6 Accuracy in Preserving Distances and Dot Products . 74 3.5 Summary . 75 4 Fast Approximate Matrix Multiplication 77 4.1 Introduction . 77 4.1.1 Problem Formulation . 79 4.2 Related Work . 79 4.2.1 Linear Approximation . 80 4.2.2 Hashing to Avoid Linear Operations . 80 8 4.3 Background - Product Quantization . 81 4.4 Our Method . 83 4.4.1 Hash Function Family, 푔(·) ................... 83 4.4.2 Learning the Hash Function Parameters . 84 4.4.3 Optimizing the Prototypes . 86 4.4.4 Fast 8-Bit Aggregation, 푓(·, ·) .................. 87 4.4.5 Complexity . 88 4.4.6 Theoretical Guarantees . 89 4.5 Experiments . 90 4.5.1 Methods Tested . 91 4.5.2 How Fast is Maddness?..................... 92 4.5.3 Softmax Classifier . 93 4.5.4 Kernel-Based Classification . 94 4.5.5 Image Filtering . 96 4.6 Summary . 97 5 Summary and Conclusion 99 A Additional Theoretical Analysis of Bolt 103 A.1 Quantization Error . 103 A.1.1 Definitions . 103 A.1.2 Guarantees . 105 A.2 Dot Product Error . 108 A.2.1 Definitions and Preliminaries . 108 A.2.2 Guarantees . 112 A.2.3 Euclidean Distance Error . 114 B Additional Theoretical Analysis of Maddness 119 B.1 Proof of Generalization Guarantee . 119 B.2 Aggregation Using Pairwise Averages . 125 9 C Additional Method and Experiment Details for Maddness 129 C.1 Quantizing Lookup Tables . 129 C.2 Quantization and MaddnessHash ................... 130 C.3 Subroutines for Training MaddnessHash ............... 131 C.4 Additional Experimental Details . 131 C.4.1 Exact Matrix Multiplication . 132 C.4.2 Additional Baselines . 132 C.4.3 UCR Time Series Archive . 133 C.4.4 Caltech101 . 134 C.4.5 Additional Results . 134 10 List of Figures 2-1 Overview of Sprintz using a delta coding predictor. a) Delta cod- ing of each column, followed by zigzag encoding of resulting errors. The maximum number of significant bits is computed for each col- umn. b) These numbers of bits are stored in a header, and the original data is stored as a byte-aligned payload, with leading zeros removed. When there are few columns, each column’s data is stored contigu- ously. When there are many columns, each row is stored contiguously, possibly with padding to ensure alignment on a byte boundary. 34 2-2 Boxplots of compression performance of different algorithms on the UCR Time Series Archive. Each boxplot captures the distribution of one algorithm across all 85 datasets. 40 2-3 Compression performance of different algorithms on the UCR Time Series Archive. The x-axis is the mean rank of each method, where rank 1 on a given dataset has the highest ratio. Methods joined with a horizontal black line are not statistically significantly different. 42 2-4 Sprintz becomes faster as the number of columns increases and as the width of each sample approaches multiples of 32B (on a machine with 32B vector registers). 44 2-5 Sprintz compresses at hundreds of MB/s even in the slowest case: its highest-ratio setting with incompressible 8-bit data. On lower settings with 16-bit data, it can exceed 1GB/s. 44 11 2-6 Fire is nearly as fast as delta and double delta coding.