Research Article N-Gram-Based Text Compression
Total Page:16
File Type:pdf, Size:1020Kb
Hindawi Publishing Corporation Computational Intelligence and Neuroscience Volume 2016, Article ID 9483646, 11 pages http://dx.doi.org/10.1155/2016/9483646 Research Article n-Gram-Based Text Compression Vu H. Nguyen,1 Hien T. Nguyen,1 Hieu N. Duong,2 and Vaclav Snasel3 1 Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam 2Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, Ho Chi Minh City, Vietnam 3Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava, Czech Republic Correspondence should be addressed to Hien T. Nguyen; [email protected] Received 21 May 2016; Revised 2 August 2016; Accepted 25 September 2016 Academic Editor: Geun S. Jo Copyright © 2016 Vu H. Nguyen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n- gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods. 1. Introduction by assigning short codes to common parts, that is, characters, syllables, words, or sentences, and long codes to rare parts. According to [1], data compression is a process of converting In recent years, several techniques have been developed an input data stream into another data stream that has a for text compression. These techniques can be further clas- smallersize.Astreamcanbeafile,abufferinmemory,or sified into four major types, that is, substitution, statistical, individual bits sent on a communications channel. The main dictionary, and context-based method. The substitution text objectives of data compression are to reduce the size of input compression techniques replace a certain longer repetition of stream and increase the transfer rate as well as save storage characters with a shorter one. A technique that is a repre- space. Typically, data compression techniques are classified sentative of these techniques is run-length encoding [2]. The into two classes, that is, lossless and lossy, based on the result statistical techniques usually calculate the probability of char- of the decompression phase. acters to generate the shortest average code length, such as Text compression is a field of data compression, which Shannon-Fano coding [3, 4], Huffman coding [5], and arith- uses the lossless compression technique to convert an input metic coding [6, 7]. The next type consists of dictionary tech- filetoanotherformofdatafile.Itcannotusethelossy niques, which involve substitution of a substring of text by an compression technique because it needs to recover the exact index or a pointer code. They relate to a position in the dic- original file from the compressed file. If lossy compression tionary of the substring. Representatives of these techniques technique was used, the meaning of the decompression file are LZW [8], LZ77 [9], and LZ78 [10]. The last type is context- will be different from the original file. Several techniques based techniques, which involve the use of minimal prior have been proposed for text compression in recent years. assumptions about the statistics of the text. Normally, they Most of them are based on the same principle of removing or use the context of the text being encoded and the history of reducing redundancies from the original input text file. The the text to provide more efficient compression. Representa- redundancycanappearatcharacter,syllable,orwordlevels. tivesofthistypearePredictionbyPartialMatching(PPM) This principle proposed a mechanism for text compression [11] and Burrow–Wheeler transform (BWT) [12]. Every 2 Computational Intelligence and Neuroscience method has its own advantages and disadvantages when compression technique which utilizes syllable-based mor- being applied in a specific field. None of the above methods phology of multisyllabic languages. The proposed method is has been able to achieve the best results in terms of compres- designed to partition words into its syllables and then to pro- sion ratio. duce their shorter bit representations for compression. The Normally, users will decide to choose the appropriate number of bits in coding syllables depends on the number of methodbasedontheirpurposes.Withsystemsthatallow entries in the dictionary file. In [20], the authors proposed the reconstruction of information from the output not to a genetic algorithm in syllable-based text compression. This be exactly the same as the input, we can use lossy methods, algorithm was used to determine for the characteristic of syl- such as systems to compress images and voice messages. With lables. These characteristics are stored into dictionary, which systems that require the original data to be recovered exactly is part of the compression algorithm and it is not necessary to from the compressed data, we must use lossless methods, place them into compressed data. This leads to reduction of such as text compression systems. used space. In [21, 22], Lansky and his colleagues were the first This paper presents the first attempt at text compression to propose a method for syllable-based text compression using n-gram dictionaries, and the contribution has three techniques. In their papers, they focused on specification of attributes; that is, (1) it is a method for text compression syllables, methods for decomposition of words into syllables, using n-gram dictionaries, (2) it collects the text corpus of and using syllable-based compression in combination of the the Vietnamese language from the Internet and builds five principles of LZW and Huffman coding. In [23], the authors n-gram dictionaries with nearly 500,000,000 n-grams, and first proposed a method for small text file compression based (3) a test set of 10 different text files with different sizes to on the Burrow–Wheeler transformation. This method com- evaluate our new system and compare it with my two previous bines the Burrow–Wheeler transform with the Boolean mini- methods [13, 14] and also with other methods. The rest of mizationatthesametime. this paper is organized as follows. Section 2 presents earlier In our previous papers for Vietnamese text compression work related to this effort. Section 3 presents our proposed [13, 14], we proposed a syllable-based method based on mor- method, and Section 4 presents our experiments and results. phology and syllable dictionaries in [13]. With each morpho- Our conclusions are presented in Section 5. syllable, it is split into a consonant and a syllable, and they are compressed based on their corresponding dictionaries. This method has a compression ratio that converges to around 2. Related Work 73%, and it is suitable for small text files. The method in [14] compressed text based on a trigram model; it splits a In recent years, most text compression techniques have been text file into trigrams and compresses these trigrams using based on dictionary, word, or character levels [15–18]. Refer- a trigrams dictionary. This method achieves an encouraging ence [15] proposed a method to convert the characters in the compression ratio around 83%. source file to a binary code, where the most common charac- tersinthefilehavetheshortestbinarycodesandtheleast common have the longest. The binary codes are generated 3. Proposed Method basedontheestimatedprobabilityofthecharacterwithin the file and are compressed using 8-bit character word length. In this section, we present a method for Vietnamese text In [16], the authors proposed a method that combined word compression using n-gram dictionaries. This model has two with LZW. First, their method splits input text to word and main modules. The first module is used for text compression nonword and then uses them as initial alphabet of LZW. Ref- and the second module performs decompression. Figure 1 erence [17] proposed a technique to compress short text mes- describes our text compression model. In our model, we use sages based on two phases. In the first phase, it converts the n-gram dictionaries for both compression and decompres- input text consisting of letters, numbers, spaces, and punctu- sion. We will describe the model in detail in the following ation marks commonly used in English writing to a format subsections. which can be compressed in the second phase. In the second phase, it proposes a transformation which reduces the size of 3.1. n-Gram Theory and Dictionaries the message by a fixed fraction of its original size. In [18], the 3.1.1. n-Gram Theory. In this paper, we use n-gram theory authors proposed a word-based compression variant based mentioned from [24]: an n-gram is a contiguous sequence of on the LZ77 algorithm and proposed and implemented vari- items from a given sequence of a text or speech. An item can ous ways of sliding windows and various possibilities of out- be a phoneme, a syllable, a letter, a word, or a morphosyllable. put encoding. In a comparison with other word-based meth- In general, an item is considered as an atomic unit. An n- ods, their proposed method is the best. In these research, they gram of one item, two items, or three items is referred to as do not consider the structure of words or morphemes in the a “unigram,” a “bigram,” or a “trigram,” respectively.