(12) Patent Application Publication (10) Pub. No.: US 2016/0248440 A1 Lookup | | | | | | | | | | | | |
Total Page:16
File Type:pdf, Size:1020Kb
US 201602484.40A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2016/0248440 A1 Greenfield et al. (43) Pub. Date: Aug. 25, 2016 (54) SYSTEMAND METHOD FOR COMPRESSING Publication Classification DATAUSING ASYMMETRIC NUMERAL SYSTEMIS WITH PROBABILITY (51) Int. Cl. DISTRIBUTIONS H03M 7/30 (2006.01) (52) U.S. Cl. (71) Applicants: Daniel Greenfield, Cambridge (GB); CPC ....................................... H03M 730 (2013.01) Alban Rrustemi, Cambridge (GB) (72) Inventors: Daniel Greenfield, Cambridge (GB); (57) ABSTRACT Albanan Rrustemi, Cambridge (GB(GB) A data compression method using the range variant of asym (21) Appl. No.: 15/041,228 metric numeral systems to encode a data stream, where the probability distribution table is constructed using a Markov (22) Filed: Feb. 11, 2016 model. This type of encoding results in output that has higher compression ratios when compared to other compression (30) Foreign Application Priority Data algorithms and it performs particularly well with information that represents gene sequences or information related to gene Feb. 11, 2015 (GB) ................................... 1502286.6 Sequences. 128bit o 500 580 700 7so 4096 20123115 a) 8-entry SIMD lookup Vector minpos (e.g. phminposuw) 's 20 b) 16-entry SMD —- lookup | | | | | | | | | | | | | ||l Vector sub (e.g. psubw) Wector min e.g. pminuw) Vector minpos (e.g. phmirposuw) Patent Application Publication Aug. 25, 2016 Sheet 1 of 2 US 2016/0248440 A1 128bit e (165it o 500 580 700 750 4096 2012 s115 8-entry SIMD Vector sub a) lookup (e.g. pSubw) 770 270 190 70 20 (ufi) (ufi) (uf) Vector minpos (e.g. phminposuw) b) 16-entry SIMD lookup Vector sub Vector sub (e.g. pSubw) (e.g. pSubw) Vector min e.g. pminuw) Vector minpos (e.g. phminposuw) FIGURE 1 Patent Application Publication Aug. 25, 2016 Sheet 2 of 2 US 2016/0248440 A1 64-entry SIMD lookup | | T | | | | | | | | | | | Vector minpos (e.g. phminposuw) LSB SBIAS POS FIGURE 2 US 2016/0248440 A1 Aug. 25, 2016 SYSTEMAND METHOD FOR COMPRESSING other 244 symbols can result in a considerable reduction DATAUSING ASYMMETRC NUMERAL of compression efficiency for the 12 main symbols, SYSTEMIS WITH PROBABILITY especially if probability distributions are on an integer DISTRIBUTIONS Scale of 4096. 0011. Also, there is another issue with raNS variant: TECHNICAL FIELD despite its highly efficient coding scheme, it assumes that 0001. The present disclosure relates to methods of com each symbol is independent of history (i.e. raNS does not pressing data, in particular to encoding and decoding methods exploit inter-symbol redundancy). This means that, unless that involve the range variant of asymmetric numeral systems carefully managed, raNS by itself can easily perform worse (raNS). More specifically, the present disclosure concerns a when compared to other approaches Such as Zlib even though method of using probability distribution tables that are con that only uses Huffman compression. structed using Markov modelling of subsets of data as part of 0012. Therefore, there exists a need for an efficient data rANS encoding. The present disclosure also concerns the use compression method and a system. of this encoding method in compressing data. More specifi cally, the present disclosure concerns the use of this method in SUMMARY compressing information that contains or is related to gene 0013 The present disclosure seeks to provide an improved sequences. Furthermore, the present disclosure relates to Soft method of data compression. ware products recorded on machine readable data storage 0014. The present disclosure also seeks to provide a sys media, wherein the Software products are executable on com tem configured to efficiently perform the data compression. puting hardware for implementing aforesaid methods. 0015. According to an aspect of the present disclosure, provided is a method for encoding a data stream using the BACKGROUND rANS technique. The rANS utilises a probability distribution 0002 The most popular compression algorithms use one table that is constructed using a Markov model. of the following two approaches for encoding data, i.e. Huff 0016. The method of the present disclosure provides better man encoding and arithmetic/range encoding. Huffman compression ratio without a significant impact in the speed of encoding is much faster but may result in low compression compression and decompression. rates. Further, arithmetic encoding, on the other hand, results 0017 Optionally, the method may further comprise utilis in higher compression rates at the expense of additional com ing a plurality of probability distribution tables. Each of the putational cost (i.e. slower) for encoding and decoding. plurality of probability distribution table is constructed using Moreover, Asymmetric Numeral Systems (ANS) represent a a Markov model from a subset of data stream. The additional relatively new approach to lossless entropy encoding, with probability distribution tables better represent the data that is the potential to combine the speed of Huffman encoding and being compressed. The construction of these multiple tables compression ratios of arithmetic encoding (Jarek Duda, is conducted across stages, in an iterative manner. The initial Asymmetric numeral systems: entropy coding combining portion of the data is compressed using a first model. Further, speed of Human coding with compression rate of arithmetic when the compression of this initial portion is conducted, a coding, Arxiv. 1311.2540, F. Giesen, Interleaved entropy cod probability distribution table can be optionally constructed ers, Arxiv. 1402.3392, 2014). alongside compression. This new table can then be used to 0003. Further, ANS has two general drawbacks: compress a Subsequent portion of the data. 0004. 1. it relies on a static probability distribution, and 0018 Optionally, the method presented in this disclosure 0005 2. it decodes symbols in the reverse order to includes an escape code that is used to extend the probability encoding (i.e. it is LIFO Last In First Out), making it distribution table in order to handle the process of encoding of unsuitable for very large blocks of data. symbols or transitions encountered in data that cannot be 0006. There are several implications that result from encoded directly from the information that is present in the ANS’s reliance on static probability distributions: probability distribution table. This provides simpler handling 0007 1. a static probability distribution either needs to of cases where the model does not cover symbols or transi be built on the full dataset, or a subset of it. This needs to tions, making the method simpler, more efficient, and faster. be done before compression can proceed. 0019. Optionally, the method presented in this disclosure 0008 2. The probability distributions themselves can be includes steps to compress and decompress the probability resource intensive to store, especially for compressing distribution tables. This reduces overheads of storing these small blocks of data. For large blocks of data, the prob tables, alongside encoded/compressed data. ability distribution is expensive to calculate. Since a 0020. According to another aspect of the present disclo fixed probability distribution is required before com Sure, provided is a computing device (or a system) executing pression can begin, building a distribution table on a the above method for encoding the data stream using the large dataset is not a practical option for streaming appli rANS technique. Further, the computing device is configured cations. to store data that is encoded using the above method. 0009. 3. If a symbol is encountered that has an expected probability of 0 then it will fail encoding and corrupt DESCRIPTION OF THE DIAGRAMS Subsequent data 0010 4. Adjusting the probability of rare symbols to a 0021 Embodiments of the present disclosure will now be minimum positive value can greatly affect the compres described, by way of example only, with reference to the sion achieved. For example if only 12 symbols are used following diagrams wherein: for 99.9999% of the data, but there are 256 total symbols 0022 FIG. 1 is a schematic illustration of efficient lookups Supported, then allocating a non-zero probability to the that can be executed using SIMD operations; and US 2016/0248440 A1 Aug. 25, 2016 0023 FIG. 2 is a schematic illustration of tree-based vec 0031. That is, this approximates an order-n Markov-chain tor lookups that can be executed using SIMD operations for a transition probability model. full table that is 64 entries in size. 0032 To decompress in-order, symbols {S, ..., S} are DETAILED DESCRIPTION OF EMBODIMENTS compressed in reverse order So that (kincreasing from 1 to N): X-1-E(X-MN -1, SN - 1) 0024. In an embodiment, raNS can use tables to store the approximate probability distribution of symbols, where typi 0033 and in decompressing in-order we then have (k cally there are many approximate tables to choose from. It is increasing from 1 to N): simpler than conventional arithmetic encoding because it uses a single natural number as the state, instead of using a pair to represent a range. Further, compared to conventional arithmetic encoding, this simplicity also results in significant B) A Method for Handling Unexpected Transitions/Symbols improvements in encoding/decoding performance. There are different ANS approaches to encoding, with range ANS 0034. In order to handle the special case where Pr(SIM (raNS) being among the most promising. The rANS also (S. S))=0, replace the encoder and decoder are allows vectorisable SIMD processing, which enables further replaced with modified versions E' and D' respectively using performance improvements in modern CPU and GPU archi modified model M' that incorporates an escape code (esc). In tectures. The rANS variety utilises techniques in the align particular: ment and cumulative probability distribution to use simple lookups and bit shifts, with a vectorisable version that fits into 4x32-bit or 8x32-bit SIMD processing pipelines. In particu lar such a SIMD version will be very efficient when used with a probability distribution table scaled to 4096 integer values.