Evaluating Lossless for Use on Mobile Devices Literature Synthesis Paul Brittan (BRTPAU008)

Abstract references the upcoming data, to match the exact data that has already been encoded. Prediction with Partial Match With an increase in the usage of mobile devices, users have (PPM) which is a statistical data compression to continually store and transport data on their mobile based on context modeling and prediction [3] and Burrows- devices. In order for this to happen efficiently, the data first Wheeler Transform (BWT) which on its own does not needs to be compressed using a data compression reduces the size of the data, it only makes the data easier to application. Data compression is the process of converting an [4]. These algorithms are then compared using input data stream into a new data stream that has a smaller benchmark tests to find which one is optimal for size. This paper describes four lossless data compression implementing on mobile devices. algorithms. Lempel-Ziv 77 (LZ77) and Lempel-Ziv-Welch (LZW) which are based on the dictionary methods, 2. Lempel-Ziv 77 Prediction with Partial Match (PPM) which is based on the statistical methods and Burrows-Wheeler Transform (BWT) The Lempel-Ziv 77 (LZ77) algorithm is which was found to be inadequate since the algorithm does used as the foundation for compression tools, such as not compression the data only optimizes it for compression. [5]. The algorithm is asymmetric because, with time and After comparison of the algorithms using benchmark testing, memory, encoding is much more demanding than decoding. it was found that using LZ77 was the optimal algorithm The LZ77 algorithm uses data structures like binary trees, because of its speed and low memory usage. suffix trees and hash tables which provides fast searching without the need for high memory [6]. LZ77 data by replacing sections of the data with a reference to matching 1. Introduction data that has already passed through both the encoder and decoder [3]. No searching is needed when decompressing the The advancements in mobile technologies and mobile data because, the compressor has issued an explicit stream of computing power have caused an increase in the popularity literals, locations, and match lengths [7]. The process of mobile devices. Mobile devices are now being used for becomes even more efficient if the window is stored entirely everyday tasks such as communication through emails or in the cache, so that retrieving a match is fast no matter instant messages and daily scheduling with the help of where it occurs in the window [7]. The LZ77 algorithm calendars. With this increase, usage of mobile devices has works by maintaining a current pointer into the input data, a been broadening through industries such as Healthcare, search and a look-ahead buffer. Symbols that are found Insurance and Field Services [1]. To keep up with all the data before the current symbol make up the search buffer, were as that needs to be stored on a mobile device or transferred symbols that appear after the current symbol are placed in the quickly across a network, there needs to be a way to look-ahead buffer. The buffers make a window which shows efficiently compress and decompression the data without the section of input currently being viewed. As the current losing the . Lossless data compression is a set pointer moves forward the window moves through the input. of data compression algorithms that allows the original data While symbols are found in the look-ahead buffer, the to be reconstructed from the compressed data [2]. With algorithm looks in the search buffer for the longest match [7]. lossless data compression algorithms compression and Instead of send off the matched symbols, they are encoded decompression can be efficiently implemented on a mobile with the offset from the current pointer, the size of match and device, even with the hardware limitations such as low the symbol in the look-ahead buffer that is following the processing power, static memory and battery life [1]. match. The encoder and decoder must both keep track of the Lossless data compression has many advantages on a mobile last 2KB or 4KB of the most recent data. The encoder needs device, such as reducing the network bandwidth required for to keep this data to look for matches, while the decoder needs data exchange, reducing the disk space required for storage to keep it to understand the matches the encoder refers to. and minimizing the main memory required [3]. This paper The LZ77 provides option to increase the memory to describes four commonly used lossless compression improve performance. With a larger window there are algorithms used today. Lempel-Ziv 77(LZ77) and Lempel- improvements in the speed in which matches are found. Ziv-Welch (LZW) which uses dictionary methods to

3. Lempel-Ziv-Welch updated regularly. The symbol that occurs is encoded in relation to its predicted probability, using . The Lempel-Ziv-Welch (LZW) algorithm was introduced for Although PPM is simple, it is also computationally expensive cases in which a match cannot be found using LZ77. Instead [3]. An arithmetic encoder can use the probabilities to code of the sliding buffers, LZW uses a separate dictionary which the input efficiently. Longer contexts will improve the is used as a codebook [8]. From the input stream the probability, but will take more time to calculate. To deal with compressor builds its dictionary from the input data. When a this, escape symbols are created to slowly reduce context group of symbols is found the dictionary is then checked. lengths. This creates a downfall were encoding a large string The longest prefix that matches the input and the unmatched of escape symbols can use up more space, that which would symbol which follows are then added to the dictionary, see of been saved by the use of large contexts. Storing and table 1 for example of LZW compression. The decompressor searching through each context is the reason for the large will then build a dictionary so that it can receive the indices memory usage of PPM algorithm [7]. With the PPM that refer to the same symbol that are in the compressors algorithm a table is built for each order, from 0 to the highest dictionary. order of the model. After parsing the input into substrings, the context for each substring is the characters that come Input Stream: AAAABAAABCC before the sub-string. The table keeps count of the frequency Encoded String Output Stream New Dictionary each substring has been found for the given context [10]. Entry When the PPM algorithm is used, it searches the highest- A 65 256 - AA order table for the given context. If the context is found, the AA 65 256 257 - AAA next character with the highest frequency count is returned as A 65 256 65 258 - AB the prediction. If there are no matches to any entries in the B 65 256 65 66 259 - BA table, the context is reduced by one character and the next lowest-order table is searched. This process is repeated until AAA 65 256 65 66 257 260 - AAAB the context is matched or the Zeroth-order table is reached. B 65 256 65 66 257 66 261 - BC The Zeroth-order table simply returns the most common C 65 256 65 66 257 66 262 - CC character seen in the training string [10]. 67 C 65 256 65 66 257 66 67 67 5. Burrows-Wheeler Transform Table 1. Example of LZW Compression [5]. The Burrows-Wheeler Transform (BWT) is a reversible This algorithm provides a quick build up of long patterns that algorithm that is used in the compression algorithm [4]. can be stored, but there are multiple downfalls. Until the On its own, BWT does not reduce the size of the data, it only dictionary is filled with large commonly seen patterns, the formats the data so it becomes easier to compress by other resulting output will be bigger than the original input. Since algorithms. When a string of characters is transformed using the dictionary can grow without bound, LZW must be BWT the size of the characters remain the same, the implemented so it deletes the existing dictionary when it gets algorithm just calculates the order that the characters appear too big or finds a way to limit memory usage [7]. This in. If the input string has multiple substrings that have a high algorithm has no communication overhead and is frequency of appearing, then the transformed string will have computationally simple. Since both the compressor and the multiple locations in which a character will recur several decompressor have the initial dictionary, and all new entries times in a row. This helps with compression, since most into the dictionary are created based on entries in the compression algorithms are more effective when the input dictionary that already exists, the decompressor can recreate contains sets of repeated characters. After the BWT is the dictionary quickly as data is received. To decode a completed, the data is then compressed by running the dictionary entry the decoder must have received all previous transformed input through a Move-to-Front encoder and then entries in the block [5]. a run-length encoder. BWT takes advantage of symbols which are located further on in the string, not just those that have passed. The biggest problem is that the BWT requires 4. Prediction with Partial Match the allocation of RAM for the entire input and output streams and a large buffer is needed to perform the required sorts [5]. Prediction with Partial Match (PPM) is a changing statistical Even though BWT-based compression could be performed data compression algorithm that uses context modelling and with very little memory, common set-ups use fast sort prediction [9]. It uses a fixed context statistical modelling algorithms and data structures that need large amounts of algorithm, which predicts the next character in the input data. memory to supply speed [7]. Regardless of memory issues, The prediction probabilities for each preceding character in algorithms that implement BWT compress files at a high the model are calculated from frequency counts which are compression ratio.

6. Burrows-Wheeler Transform

The Burrows-Wheeler Transform (BWT) is a reversible algorithm that is used in the bzip2 compression algorithm [4]. On its own, BWT does not reduce the size of the data, it only formats the data so it becomes easier to compress by other algorithms. When a string of characters is transformed using BWT the size of the characters remain the same, the algorithm just calculates the order that the characters appear in. If the input string has multiple substrings that have a high frequency of appearing, then the transformed string will have multiple locations in which a character will recur several times in a row. This helps with compression, since most compression algorithms are more effective when the input contains sets of repeated characters. After the BWT is completed, the data is then compressed by running the transformed input through a Move-to-Front encoder and then Figure 2 – Results from Memory Tests [7]. a run-length encoder. BWT takes advantage of symbols which are located further on in the string, not just those that have passed. The biggest problem is that the BWT requires the allocation of RAM for the entire input and output streams and a large buffer is needed to perform the required sorts [5]. Even though BWT-based compression could be performed with very little memory, common set-ups use fast sort algorithms and data structures that need large amounts of memory to supply speed [7]. Regardless of memory issues, algorithms that implement BWT compress files at a high compression ratio.

7. Evaluation & Chosen Implementation

Four comparison tests were run on tools that implement the algorithms in this paper. LZO and Zlib were used to test Lempel-Ziv 77 (LZ77), compress was selected Lempel-Ziv- Welch(LZW), PPMd (also known as winRAR) used to test Prediction with Partial Match (PPM) and bzip2 was chosen for Burrows–Wheeler Transform (BWT). When benchmark Figure 3 – Results from Completion Time (Text) Tests [7]. comparison using traditional metrics are run on the above tools, the follow graphs are produced.

Figure 1 – Results from Compression Ration Tests [7]. Figure 4 – Results from Completion Time (Web) Tests [7].

Analysing the graphs produced by benchmark tests shows and System Sciences , 75 (2009). that the algoritgms that give the best compression ratios are PPM followed by BWT. These ratios, however, come at a great cost to both time and memory, and these resources that 4 Lauther, Ulrich and Lukovszki, Tamas. Space are not in abundance on mobile devices. The fastest of the Efficient Algorithms for the Burrows-Wheeler. four algorithms in both text and web is LZO, which uses the Algorithmica 58 (2010). (LZ77) algorithm. Even though LZO is quick and using the least static memory, it does provide the worst compression ratio when compressing and decompressing text. 5 Sadler, Christopher M. and Martonosi, Margaret. Data compression algorithms for From these results, the LZ77 algorithm will be the chosen energy-constrained devices in delay tolerant algorithm that will be implemented. Based on a design similiar to LZO, a lossless data compression/decompression networks. SenSys '06 Proceedings of the 4th application for mobile devices can be created that is quick, international conference on Embedded requires low processing power and memory. These three networked sensor systems (2006). attributes will also help preserve the battery life of the mobile device by not requiring a large amount of time and processing to complete. Although LZO weak text 6 Ferreira, Artur, Oliveira, Arlindo, and compression ratio, it does provide a good web compression, Figueiredo, Mario. Time and Memory Efficient which will help if the mobile device user needs to download Lempel-Ziv Compression Using Suffix Arrays. and update alot of data across the network. arXiv:0912.5449 (2009).

8. Conclusion & Future Research 7 Barr, Kenneth C. and Asanovic, Krste. Energy- Aware Lossless Data Compression. ACM From the Algorithms discussed in this paper, the optimal Transactions on Computer Systems, Vol. 24 , 3 lossless compression algorithm to be implemented on a mobile device is Lempel-Ziv 77. This algorithm provides (August 2006). sufficient compression ratio while being quick and using little static memory, which is ideal for mobile hardware 8 Wang, Le and Manner, J. Evaluation of data specifications. It has been noted that compression algorithm compression for energy-aware communication may be implemented using different data structures and produce completely different performance results [7]. For in mobile networks. Cyber-Enabled Distributed that reason further research to extend this paper can be to test Computing and Knowledge Discovery, 2009. different data structures to try find a way to achieve better CyberC '09. International Conference on compression ratios without increasing the time or memory (2009). needed. References 9 Carus, A. and Mesut, A. Fast Text Compression Using Multiple Static Dictionaries. Information 1 Nori, Anil. Mobile and Embedded Databases. Technology Journal 9 (2010). (), SIGMOD '07 Proceedings of the 2007 ACM SIGMOD international conference on 10 Burbey, Ingrid and Martin, Thomas L. Management of data. Predicting future locations using prediction-by- partial-match. MELT '08 Proceedings of the 2 Midhun, M. Data Compression Techniques. first ACM international workshop on Mobile Sree Narayana Gurukulam College of entity localization and tracking in GPS-less Engineering, Kolenchery, 2006. environments (2008).

3 Sakr, Sherif. XML compression techniques: A survey and comparison. Journal of Computer Page Should Be Made As Close