Decomposed Bounded Floats for Fast Compression and Eries

Decomposed Bounded Floats for Fast Compression and eries Chunwei Liu, Hao Jiang, John Paparrizos, Aaron J. Elmore University of Chicago {chunwei,hajiang,jopa,aelmore}@cs.uchicago.edu ABSTRACT Modern data-intensive applications often generate large amounts of low precision oat data with a limited range of values. Despite the prevalence of such data, there is a lack of an eective solution to ingest, store, and analyze bounded, low-precision, numeric data. To address this gap, we propose Bu, a new compression technique that uses a decomposed columnar storage and encoding methods to provide eective compression, fast ingestion, and high-speed in-situ adaptive query operators with SIMD support. PVLDB Reference Format: Chunwei Liu, Hao Jiang, John Paparrizos, Aaron J. Elmore. Decomposed Figure 1: Many datasets only span a xed range of values Bounded Floats for Fast Compression and Queries. PVLDB, 14(11): 2586 - with limited precision, which is a small subset of broad oat 2598, 2021. number range and precision. doi:10.14778/3476249.3476305 precision format, typically referred to as a numeric data type. For 1 INTRODUCTION example, consider Figure 1 that shows how precision varies for the IEEE oat standard and sample application requirements on GPS Modern applications and systems are generating massive amounts and temperature datasets. The numeric approach allows for “just of low precision oating-point data. This includes server moni- enough” precision but is not well optimized for ecient storage toring, smart cities, smart farming, autonomous vehicles, and IoT and ltering. Several methods have been explored for the latter devices. For example, consider a clinical thermometer that records reason, but due to the standard format, compression opportunities values between 80.0 to 120.9, a GPS device between -180.0000 to are limited for compression eectiveness, throughput, and in-situ 180.0000, or an index fund between 0.0001 to 9,999.9999. The In- query execution. This paper proposes a new storage format that ternational Data Corporation predicts that the global amount of extends the ideas of a numeric data type that supports custom data will reach 175/⌫ by 2025 [44], and sensors and automated precision but is optimized for fast and eective encoding while processes will be a signicant driver of this growth. To address allowing to work directly over the encoded data. this growth, data systems require new methods to eciently store To motivate our design, we rst describe short-comings with pop- and query this low-precision oating-point data, especially as data ular compression techniques for bounded-precision oating data. growth is outpacing the growth of storage and computation [38]. Specically, the recently proposed Gorilla method [40] is a delta- Several popular formats exist for storing numeric data with var- like compression approach that calculates the XOR for adjacent ied precision. A xed-point representation allows for a xed number values and only saves the dierence. Gorilla achieves compression of bits to be allocated for the data to the right of the radix (i.e., dec- benets by replacing the leading and trailing zeros with counts. imal point) but is not commonly used due to the more popular Gorilla is a state-of-the-art compression approach for oats, but it oating-point representation. Floats (or oating-point) allows for does not work well on low-precision datasets as low-precision does a variable amount of precision by allocating bits before and after not impact oat’s representation similarity. In addition, Gorilla’s the radix point (hence the oating radix point), within a xed total encoding and decoding steps are slow because of its complex vari- number of bytes (32 or 64). For oats, the IEEE oat standard [24] able coding mechanism. Alternative oat compression approaches is widely supported both by modern processors and programming leverage integer compression techniques by scaling the oat point languages, and is ubiquitous in today’s applications. value into integer [9]. Despite their simplicity, these approaches However, two main reasons result in oats not being ideal for rely on multiplication and division operations that are usually ex- many modern applications: (i) an overly high-precision and broader pensive and, importantly, cause overowing problems when the range that wastes storage and query eciency; and (ii) not being input value and the quantization (i.e., scaling) factor are too large. amenable for eective compression with ecient in-situ ltering General-purpose byte-oriented compression approaches, such operations. For the former reasons, many databases oer custom as Gzip [17], Snappy [52] and Zlib [21], can also be applied to This work is licensed under the Creative Commons BY-NC-ND 4.0 International compress oats. The input oat values are serialized into binary License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of representation before applying byte-oriented compression. These this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights compression approaches are usually slow because of multiple scans licensed to the VLDB Endowment. looking for commonly repeated byte sub-sequences. Furthermore, Proceedings of the VLDB Endowment, Vol. 14, No. 11 ISSN 2150-8097. doi:10.14778/3476249.3476305 full decompression is needed before any query evaluation. Dictio- nary encoding [8, 32, 46] is applicable for oat data, but it is not 2586 ideal since the cardinality of input oat data is usually high, result- programmable computers and calculators [48]. Many data represen- ing in an expensive dictionary operation overhead. Note that we tations are developed according to dierent system and application are only considering lossless formats or formats with bounded loss requirements. The most popular implementations include xed- (e.g., congurable precision). Lossy methods (e.g., spectral decom- point (including numeric data type) and oat-point. positions [19] and representation learning [35]) may compress the Fixed-point uses a tuple < B86=, 8=C464A, 5A02C8>=0; > to represent data more aggressively but often at the cost of losing accuracy. a real number ': Decades of database systems demonstrate the bene t of a format ' = (−1)B86= ⇤ 8=C464A.5A02C8>=0; that allows for a value domain that includes a congurable amount of precision. We take this approach, and to address the above con- Fixed-point partitions its bits into two xed parts: signed/unsigned integer section and fractional section. For a given bits budget # , cerns, integrate ideas from columnar systems and data compression the xed-point allocate a single bit for sign, bits for the integer for our proposed method, BoUnded Fast Floats compression (Bu). part, and bits for the fractional part (where # = 1 + + ). Fixed- Bu provides fast ingestion from oat-based inputs and a compress- point always has a specic number of bits for the integer part and ible decomposed columnar format for fast queries. Bu ingestion fractional part. Figure 2 shows examples of xed-point; as we can avoids expensive conditional logic and oating-point mathematics. see from the “Fixed” column, the radix point’s location is always Our storage format relies on a xed-size representation for fast data xed, no matter how large or small the corresponding real number skipping and random access, and incorporates encoding techniques, is. Fixed-point arithmetic is just scaled integer arithmetic, so most such as bit-packing, delta-encoding, and sparse formats to provide hardware implementations treat xed-point numbers as integer good compression ratios1 and fast adaptive query operators. When numbers with logically decimal point partition between integer and dening an attribute that uses Bu, the user denes the precision fractional part. Fixed-point usually has a limited range (−2 ⇠ 2 ) and optionally the minimum and maximum values (e.g., 90.000- 2(−) 119.999). Without the dened min and max values, our approach and precision ( ), thus it can encounter over ow or under- infers these through the observed range, but at a decreased inges- ow issues. A more advanced dynamic xed-point representations tion performance. Our experiments show the superiority of Bu allow moving the point to achieve the trade-o between range over current state-of-the-art approaches. Compared to the state-of- and precision. However, this is more complicated as it involves the-art, Bu achieves up to 35⇥ speedup for low selective ltering, extra control bits indicating the location of the point. Fixed-point up to 50⇥ speedup for aggregations and high selective ltering, is rarely used outside of old or low-cost embedded microproces- and 1.5⇥ speedup for ingestion with a single thread, while oering sors where a oating-point unit (FPU) is unavailable. Fixed-point is comparable compression sizes to the state-of-the-art. currently supported by few computer languages as oating-point We start with a review of the relevant background (Section 2), representations are usually simpler to use and accurate enough. < > which covers numeric representations (Section 2.1) and a wide range Floating-point uses a tuple B86=, 4G?>=4=C,<0=C8BB0 to rep- of compression methods (Section 2.2). In Section 3, we present Bu resent a real number ': compression and query execution with four contributions: ' = (−1)B86= ⇤ <0=C8BB0 ⇤ V4G?>=4=C • We introduce a “just-enough” bit representation for many Instead of using a xed number of bits for the integer part and real-world datasets based on the observation of their limited fractional part, Float point reserves a certain number of bits for the precision and range (Section 3.1). exponent part and mantissa, respectively. The base V is an implicit • We apply an aligned decomposed byte-oriented columnar stor- constant given by the number representation system, while in our age layout for oats to enable fast encoding with progressive, oating-point representation, V equals to 2.

Load more