Improved Compression Rate Using Quad-Byte Index Based Transformation As a Pre-Processing to Arithmetic Coding
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 3 Issue 2, February - 2014 Improved Compression Rate using Quad-Byte Index Based Transformation as a Pre-Processing to Arithmetic Coding Jyotika Doshi GLS Inst.of Computer Technology Opp. Law Garden, Ellisbridge Ahmedabad-380006, INDIA Savita Gandhi Dept. of Computer Science, Gujarat University Navrangpura Ahmedabad-380009, INDIA Abstract—Transformation algorithms are used to increase Grammar—based codes [10] and JPEG-MPEG methods redundancy in data sets and achieve better compression when used for image and video compression. Earlier-generation conventional compression techniques applied later. Arithmetic image and video coding standards such as JPEG, H.263, and coding is the most widely preferred entropy encoder used in MPEG-2, MPEG-4 were using Huffman coding in the most of the compression methods. It is nearly optimal and entropy coding step; whereas recent generation standards compression rate cannot be further improved without including JPEG2000 [7, 23] and H.264 [13, 24] utilize changing the data model. In this paper, we have used QBT-I arithmetic coding. (Quad-Byte Transformation using Indexes) technique to change the data model and introduce more redundancy in the Arithmetic coding [9, 12, 27] is the most widely data. We have experimented QBT-I at a pre-processing stage preferred efficient entropy coding technique providing before applying arithmetic coding compression method. QBT-I optimal entropy. Here, the problem is that further transforms most frequent 4-byte (quad-byte) integers. Most improvement in compression is not possible due to its frequent quad-bytes are arranged in sorted order of their entropy limitations. To achieve better compression, the only frequency and then divided in a group of 256 quad-bytes. EachIJERT IJERTpossibility is to change the data model and have it more quad-byte in a group is encoded using two tokens: group skewed. One way to change the data model is applying data number and the location in a group. Group number is denoted transformation. using variable length codeword; whereas location within a group is denoted using 8-bit index. QBT-I can be applied on Authors of this paper have proposed Quad-Byte any source; not necessarily text or image or audio. Minimum of Transformation using Index (QBT-I) method with an 2.5% compression gain is observed using QBT-I at a pre- intention to introduce more redundancy in the data and make processing as compared to compression using only arithmetic it more compressible using arithmetic coding at the second coding. Increasing number of groups gives better compression. stage [6]. Implementation of this proposal has resulted in minimum of nearly 2.5% improvement in compression as Keywords—data compression, data transformation, quad-byte compared to compression using only arithmetic coding. transformation, arithmetic coding QBT-I transforms most frequent quad-bytes (4-byte I. INTRODUCTION integers) forming various groups of 256 quad-bytes and then encoding quad-byte using two tokens: group number and Data transformation transforms data from one format to location (8-bit index) of quad-byte within a group. another. When data transformation is applied before applying conventional compression, the main purpose of a data Due to two-stage process of transformation and then transformation is to re-structure the data such that the compression, it is obviously going to be somewhat slower. transformed file is more compressible by a second-stage This slowness is acceptable since the transform truly skews conventional compression algorithm. The intention here is to the data source to fulfil our purpose of achieving more improve the overall compression rate as compared to what compression. could have been achieved by using only arithmetic coding Another advantage of QBT-I is that it can be applied to compression algorithm. any type of source; may be text, binary file, image, video or Majority of the data compression methods transforms any other format. data first and then apply entropy coding in the last step. Some of such methods are: LZ algorithms [21, 25, 28, 29]; DMC (Dynamic Markov Compression) [2, 4]; PPM [15] and their variants, context-tree weighting method [26], IJERTV3IS20786 www.ijert.org 1381 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 3 Issue 2, February - 2014 II. LITERATURE REVIEW Compression Pre-processing to BWT, Later MTF and RLE and entropy methods that can encoding Most of the research work in data transformation is be applied later intended to compress specific type of files. Transformation Drawback Applicable to text source only techniques like DCT and wavelet are used for image files. Requires pattern matching Compression-time more as used as a pre-processing Burrows Wheeler Transform (BWT) [3, 16] performs to BWT block encoding. Even though it is intended for text source only, it can be used for any source. For each block, BWT Methods BPE (Byte Pair Encoding) [8], digram encoding requires rotation-sorting-indexing. It is very time consuming and ISSDC (Iterative Semi-Static Digram Coding) [14] are and requires better data structures for efficient pattern also intended for text files, but can be applied to any type of matching. It gives better compression only when it is source. They will benefit more only when applied to small- combined ad-hoc compression techniques Run Length alphabet source like text files. Encoding (RLE) and Move-To-Front (MTF) encoding and then entropy coding. TABLE III. DIGRAM BASED ENCODING TECHNIQUES Star family transformation techniques are intended to compress text files. Star Transform [11], Length Index Digram encoding ISSDC BPE Preserving Transform (LIPT) [1, 17], and StarNT [22] are Source Type Any Any Any some such techniques shown in Table 1. Dictionary semi-static semi-static --- TABLE I. STAR FAMILY TRANSFORMATION TECHNIQUES Size of token to Digram (2 bytes) be encoded Star encoding LIPT StarNT Source Type Text Text Text Matching string or integer comparision Dictionary 22 sub-dict 22 sub-dict single comparison O(Dict-size) O(Dict-size) O(1): 2-byte Size of token to be word upto 22 word upto 22 Word time per token comparison encoded letters letters Code length fixed, depends on dictionary size 1 byte comparison time per O(Sub-Dict-size) O(Sub-Dict-size) O(Dict- size) token Redundancy index index substitution using Code length variable length: variable length: variable length: Compression Huffman or Arithmetic coding word-size <*, word length, index with max. methods that index> 3-letters can be applied Redundancy using * index, length index, length later Compression methods RLE, LZW, Huffman or Arithmetic Coding Drawback benefits only with repetitive, benefits repetitive, that can be applied Huffman, small- alphabet only with small benefits only later Arithmetic coding source sized source file when source and small alphabet have some IJERTIJERT source unused symbols, Drawback Applicable to text source only i.e. for small Requires pattern matching alphabet source Dictionary Based Encoding (IDBE) [20], Enhance Many of the present day transformation techniques, along Intelligent Dictionary Based Encoding (EIDBE) [18] and with transforming data, may introduce some compression Improved Intelligent Dictionary Based Encoding (IIDBE) also. Additionally they retain enough context and [19] are the transformation methods used for text files as redundancy for later applied compression algorithms to be shown in Table 2. They transform words using their index beneficial. position in the dictionary. III. RESEARCH SCOPE TABLE II. DICTIONARY BASED ENCODING TECHNIQUES Star family and dictionary based methods are applicable IDBE EIDBE IIDBE to text source only and string matching is time consuming. Source Type Text Text Text BWT can be applied to any source even though it is single 22 sub-dict 22 sub-dict Dictionary designed for text files. It is very slow due to the need of Size of token to Word word word rotations, sorting and mapping. It gives good compression be encoded only when later applied sequence of MTF, RTF and entropy comparison time O(Dict-size) O(sub-dict-size) O(sub-dict-size) encoding. per token Code length variable length: <1- variable length: <1- variable length: <1- All these methods require better data structures and byte codeword byte word length, byte codeword pattern matching algorithms for efficiency. length, codeword> codeword> length, codeword> Digram based encoding can be applied to any type of Redundancy (index, length) (index, length) index (index, length) source, but they will be beneficial only for small-alphabet using index using ASCII using ASCII index using source files like text. Here the advantage is of integer characters 33-250, characters 33-231, characters (A-Z, a-z) length 1-4 using length 1-22 using as in StarNT, length comparison leading to speed in transformation. ASCII characters ASCII characters 1-22 using ASCII 251-254 232-253 characters 232-253 We saw a research scope in transforming quad-bytes instead of digrams. Our assumption is that it will result in a IJERTV3IS20786 www.ijert.org 1382 International Journal of Engineering Research & Technology (IJERT) ISSN: 2278-0181 Vol. 3 Issue 2, February - 2014 reduced file size and take less transformation time as V. ALGORITHM compared to digram based transformation. Algorithm uses two output files: transformed data file As arithmetic coding is the most widely used entropy and code file. encoding method used with almost all compression The transformed data file contains the index codewords methods, we intend to apply quad-byte transformation that (for transformed quad-bytes only). Purpose of storing index can be applied to any type source data and introduce codewords separately is to introduce redundancy in the data redundancy to skew the distribution for getting better for better compression. compression using arithmetic coding later. Prefix codes denoting group codeword are copied in the We have already proposed Quad-Byte Transformation code file.