Evaluating Lossless Data Compression Algorithms for Use on Mobile Devices Literature Synthesis Paul Brittan (BRTPAU008)

Evaluating Lossless Data Compression Algorithms for Use on Mobile Devices Literature Synthesis Paul Brittan (BRTPAU008) Abstract references the upcoming data, to match the exact data that has already been encoded. Prediction with Partial Match With an increase in the usage of mobile devices, users have (PPM) which is a statistical data compression algorithm to continually store and transport data on their mobile based on context modeling and prediction [3] and Burrows- devices. In order for this to happen efficiently, the data first Wheeler Transform (BWT) which on its own does not needs to be compressed using a data compression reduces the size of the data, it only makes the data easier to application. Data compression is the process of converting an compress [4]. These algorithms are then compared using input data stream into a new data stream that has a smaller benchmark tests to find which one is optimal for size. This paper describes four lossless data compression implementing on mobile devices. algorithms. Lempel-Ziv 77 (LZ77) and Lempel-Ziv-Welch (LZW) which are based on the dictionary methods, 2. Lempel-Ziv 77 Prediction with Partial Match (PPM) which is based on the statistical methods and Burrows-Wheeler Transform (BWT) The Lempel-Ziv 77 (LZ77) lossless compression algorithm is which was found to be inadequate since the algorithm does used as the foundation for compression tools, such as GZip not compression the data only optimizes it for compression. [5]. The algorithm is asymmetric because, with time and After comparison of the algorithms using benchmark testing, memory, encoding is much more demanding than decoding. it was found that using LZ77 was the optimal algorithm The LZ77 algorithm uses data structures like binary trees, because of its speed and low memory usage. suffix trees and hash tables which provides fast searching without the need for high memory [6]. LZ77 compresses data by replacing sections of the data with a reference to matching 1. Introduction data that has already passed through both the encoder and decoder [3]. No searching is needed when decompressing the The advancements in mobile technologies and mobile data because, the compressor has issued an explicit stream of computing power have caused an increase in the popularity literals, locations, and match lengths [7]. The process of mobile devices. Mobile devices are now being used for becomes even more efficient if the window is stored entirely everyday tasks such as communication through emails or in the cache, so that retrieving a match is fast no matter instant messages and daily scheduling with the help of where it occurs in the window [7]. The LZ77 algorithm calendars. With this increase, usage of mobile devices has works by maintaining a current pointer into the input data, a been broadening through industries such as Healthcare, search and a look-ahead buffer. Symbols that are found Insurance and Field Services [1]. To keep up with all the data before the current symbol make up the search buffer, were as that needs to be stored on a mobile device or transferred symbols that appear after the current symbol are placed in the quickly across a network, there needs to be a way to look-ahead buffer. The buffers make a window which shows efficiently compress and decompression the data without the section of input currently being viewed. As the current losing the information. Lossless data compression is a set pointer moves forward the window moves through the input. of data compression algorithms that allows the original data While symbols are found in the look-ahead buffer, the to be reconstructed from the compressed data [2]. With algorithm looks in the search buffer for the longest match [7]. lossless data compression algorithms compression and Instead of send off the matched symbols, they are encoded decompression can be efficiently implemented on a mobile with the offset from the current pointer, the size of match and device, even with the hardware limitations such as low the symbol in the look-ahead buffer that is following the processing power, static memory and battery life [1]. match. The encoder and decoder must both keep track of the Lossless data compression has many advantages on a mobile last 2KB or 4KB of the most recent data. The encoder needs device, such as reducing the network bandwidth required for to keep this data to look for matches, while the decoder needs data exchange, reducing the disk space required for storage to keep it to understand the matches the encoder refers to. and minimizing the main memory required [3]. This paper The LZ77 provides option to increase the memory to describes four commonly used lossless compression improve performance. With a larger window there are algorithms used today. Lempel-Ziv 77(LZ77) and Lempel- improvements in the speed in which matches are found. Ziv-Welch (LZW) which uses dictionary methods to 3. Lempel-Ziv-Welch updated regularly. The symbol that occurs is encoded in relation to its predicted probability, using arithmetic coding. The Lempel-Ziv-Welch (LZW) algorithm was introduced for Although PPM is simple, it is also computationally expensive cases in which a match cannot be found using LZ77. Instead [3]. An arithmetic encoder can use the probabilities to code of the sliding buffers, LZW uses a separate dictionary which the input efficiently. Longer contexts will improve the is used as a codebook [8]. From the input stream the probability, but will take more time to calculate. To deal with compressor builds its dictionary from the input data. When a this, escape symbols are created to slowly reduce context group of symbols is found the dictionary is then checked. lengths. This creates a downfall were encoding a large string The longest prefix that matches the input and the unmatched of escape symbols can use up more space, that which would symbol which follows are then added to the dictionary, see of been saved by the use of large contexts. Storing and table 1 for example of LZW compression. The decompressor searching through each context is the reason for the large will then build a dictionary so that it can receive the indices memory usage of PPM algorithm [7]. With the PPM that refer to the same symbol that are in the compressors algorithm a table is built for each order, from 0 to the highest dictionary. order of the model. After parsing the input into substrings, the context for each substring is the characters that come Input Stream: AAAABAAABCC before the sub-string. The table keeps count of the frequency Encoded String Output Stream New Dictionary each substring has been found for the given context [10]. Entry When the PPM algorithm is used, it searches the highest- A 65 256 - AA order table for the given context. If the context is found, the AA 65 256 257 - AAA next character with the highest frequency count is returned as A 65 256 65 258 - AB the prediction. If there are no matches to any entries in the B 65 256 65 66 259 - BA table, the context is reduced by one character and the next lowest-order table is searched. This process is repeated until AAA 65 256 65 66 257 260 - AAAB the context is matched or the Zeroth-order table is reached. B 65 256 65 66 257 66 261 - BC The Zeroth-order table simply returns the most common C 65 256 65 66 257 66 262 - CC character seen in the training string [10]. 67 C 65 256 65 66 257 66 67 67 5. Burrows-Wheeler Transform Table 1. Example of LZW Compression [5]. The Burrows-Wheeler Transform (BWT) is a reversible This algorithm provides a quick build up of long patterns that algorithm that is used in the bzip2 compression algorithm [4]. can be stored, but there are multiple downfalls. Until the On its own, BWT does not reduce the size of the data, it only dictionary is filled with large commonly seen patterns, the formats the data so it becomes easier to compress by other resulting output will be bigger than the original input. Since algorithms. When a string of characters is transformed using the dictionary can grow without bound, LZW must be BWT the size of the characters remain the same, the implemented so it deletes the existing dictionary when it gets algorithm just calculates the order that the characters appear too big or finds a way to limit memory usage [7]. This in. If the input string has multiple substrings that have a high algorithm has no communication overhead and is frequency of appearing, then the transformed string will have computationally simple. Since both the compressor and the multiple locations in which a character will recur several decompressor have the initial dictionary, and all new entries times in a row. This helps with compression, since most into the dictionary are created based on entries in the compression algorithms are more effective when the input dictionary that already exists, the decompressor can recreate contains sets of repeated characters. After the BWT is the dictionary quickly as data is received. To decode a completed, the data is then compressed by running the dictionary entry the decoder must have received all previous transformed input through a Move-to-Front encoder and then entries in the block [5]. a run-length encoder. BWT takes advantage of symbols which are located further on in the string, not just those that have passed. The biggest problem is that the BWT requires 4. Prediction with Partial Match the allocation of RAM for the entire input and output streams and a large buffer is needed to perform the required sorts [5].

Load more