Lecture 4 4 Lossless Coding

Total Page:16

File Type:pdf, Size:1020Kb

Lecture 4 4 Lossless Coding Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Lecture 4 Lossless Coding (I) May 13, 2009 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Outline z Review • With proofs of some theorems (new) z Entropy Coding: Overview z Huffman Coding: The Optimal Code z Information Theory: Hall of Fame 1 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Review Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Image and video coding: Where is IT? Predictive Coding Input Image/Video Pre- Lossy Lossless Post- Processing Coding Coding Processing …110 Visual Quality Predictive Encoded 11001 Measurement Coding Image/Video … … Post- Lossy Lossless Pre-Processing Decoded PiProcessing CdiCoding CdiCoding Image/Video 3 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Coding: A Formal Definition z Intput: x=(x1,…,xm), where xi∈X * 2 z Output: y=(y1,…,ym), where yi∈Y =Y∪Y ∪… z Encoding: y=F(x)=f(x1)…f(xm), where f(xi)=yi • yi is the codeword corresponding to the input symbol xi. • The mapping f: X→Y* is called a code. • F is called the extension of f. • If F is an injection, i.e., x1≠x2 → y1≠y2, then f is a uniquely decodable (UD) code. z Decoding: x*=F-1(y) • When x*=x, we have a lossless code. • When x*≠x, we have a lossy code. • A lossy code cannot be a UD code . 4 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Memoryless random (i.i.d.) source z Notations • A source emits symbols in a set X. • At any time, each symbol xi is emitted from the source with a fixed probability pi, independent of any other symbols. • Anyyy two emitted symbols are inde pendent of each other: the probability that a symbol xj appears after another symbol xi is pipj. • ⇒ There is a discrete distribution P={Prob(xi)=pi|∀xi∈X}, which describes the statistical behavior of the source. z A memoryless random source is simppyly represented as a 2-tuple (X,P). • P can be simply represented as a vector P=[p1,p2,…], when we define an order of all the elements in X. 5 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Prefix-free (PF) code z We say a code f: X→Y* is pperefix-free (()oPF) or instantaneous, • if no codeword is a prefix of another codeword = there does not exist any two distinct symbols x1, x2∈X such that f(x1) is the prefix of f(x2). z Properties of FP codes • PF codes are always UD codes. • PF codes can be uniqqypuely represented b y a b-ary tree. • PF codes can be decoded without reference of the future. 6 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Kraft-McMillan number z Kraft-McMillan Number: K = 1/bL(x) x X where L(x) denotes∈ the length of the codeword f(x)and) and b is the size ofP Y. z Theorem 1 (Kraft): K≤1 ⇔ a PF code exists • K≤1 is often called Kraft Inequality. z Theorem 2 (Mc Millan) : a cod e i s UD ⇔ K≤1 z Theorems 1+2: a UD code always has a PF counterpart. z ⇒ UD codes are not important, but PF ones. 7 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Definition of entropy (Shannon, 1948) z Given a memoryless random source (X, P) with probability distribution P=[p1, …, pn], its ent ropy t o b ase b idfidis defined as f fllollows: n Hb(X) = Hb(P ) = pi logb(1/pi) i=1 X z When b=2, the subscription b may be omitted, then we have H(X)=H(P). 8 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Properties of entropy z The comparison theorem: P=[p1,…,pn]and] and Q=[q1,…,qn] are two probability dist rib uti ons ⇒ n n Hb(P )= pi log (1/pi) pi log (1/qi) i=1 b ≤ i=1 b and the equalityP holds if and Ponly if P=Q. z ⇒ Hb(P)≤logbn, and the equality holds if and only if p1===...=pn=1/n (uniform distribution) . n z Hb(P )=nHb(P) 9 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Proof of the comparison theorem * z Lemma: ∀x>0, lnx≤x-1, and equality holds if and only if x=1. 2 1 1.8 0.8 1.6 0.6 1.4 0.4 0.2 1.2 . ln(x) -ln(x) ss 0 1 )) (x-1 x-1 v x-1 -0.2 0.8 -0.4 0.6 -060.6 0.4 -0.8 0.2 -1 0 00.511.520 050.5 1 151.5 2 x x 10 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Proof of the comparison theorem * z We only need to prove it for the case b=e: logb(x)=ln(x). z From Lemma, ln(qi /pi)≤qi /pi-1, with equality if and only if qi /pi=1(i.e.,1 (i.e., qi=pi). n n n z ⇒ He(P ) pi ln(1/qi)= pi ln(1/pi) pi ln(1/qi) − − i=1 i=1 i=1 X Xn Xn = p ln(q /p ) p (q /p 1) i i i ≤ i i i − i=1 i=1 Xn n X = q p =1 1=0 i − i − i=1 i=1 X X 11 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Shannon’ s source coding theorem (I) z The e nt ropy o f a meeoyessmoryless raadosoucendom source defines the lower bound of the efficiency of all UD (uniquely decodable) codes . • Denoting by L the average length of all the codewords of a UD code , this theorem says L≥ Hb(X) 12 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Proof of Shannon Theorem (I) * z Calculate the Kraft -McMillan number n K = 1/bL(xi) i=1 X z Construct a “virtual” probability distribution L(xi) Q=[q1,…,qi,…,qn], where q i =1 Kb . z From the comparison theorem, we have n ±¡ ¢ Hb(P ) = pi logb(1/pi) i=1 Xn K≤1 → log K≤0 p log (1/q ) b ≤ i b i i=1 Xn L(xi) = pi logb Kb =logb K + L i=1 X ³ ´ 13 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Shannon’ s source coding theorem (II) z Given a memoryless random source ( X, P) with probability distribution P=[p1, …, pn]. z Make a PF Code (Shannon Code) as follow: • Finding L=[L1,…,Ln]where], where Li is the least positive Li integer such that b ≥1/p , i.e., L i = log (1 /p i ) . i d b e • One can pro v e K=∑(1/bLi)≤1then1, then Kft’Kraft’s Theorem ensures there must be a PF code. • Then, for this PF code, we can prove Hb(X)≤L<Hb(X)+1 14 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Proof of Shannon Theorem (II) * Li z ∑(1/b )≤∑(1/(1/pi))=∑pi=1 ⇒ There exists a FP code. z Li = log (1/pi) ⇒ L <log (1/p )+1. d b e i b i z L=∑piLi<∑pi(logb(1/pi)+1)=Hb(X)+1 15 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Approaching the entropy z Given a memoryless random source { X, P}, generate an extended source {Xn, Pn}. n n z Hb(X )≤Ln<Hb(X )+1⇒ Hb(X)≤L<Hb(X))1/+1/n, where note n Hb(X )=nHb(X) and Ln=nL. z Let n→∞, we have L→Hb(X). z Problem: n might be too large to be used in practice. 16 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Entropy Coding Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Image and video encoding: A big picture Differential Coding Motion Estimation and Compensation A/D Conversion Context-Based Coding Color Space Conversion … Pre-Filtering Predictive Partitioning Coding … Input Image/Video Post- Pre- Lossy Lossless Processing Processing Coding Coding (Post-filteri ng) Quantization Entropy Coding Transform Coding Dictionary-Based Coding Model-Based Coding Run-Length Coding Encoded … … Image/Video 18 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding The ingredients of entropy coding z A random source ( X, P) z A statistical model (X, P’) as an estimation of the random source z An algorithm to optimize the coding performance (i.e., to minimize the average codeword length) z At least one designer … 19 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding FLC, VLC and V2FLC z FLC = Fixed-length coding /code( ()s)/codeword( ()s) • Each symbol xi emitted from a random source (X, P) is encoded as an n-bit codeword, where |X|≤2n. z VLC = Variable-length coding/code(s)/codeword(s) • Each symbol xi emitted from a random source (X, P) is encoded as an ni-bit codeword. • FLC can be considered as a special case of VLC, where n1=…=n|X|. z V2FLC = Variable-to-fixed length coding/code(s)/codeword(s) • A symbol or a string of symbols is encoded as an n-bit codeword. • V2FLC can also be considered as a special case of VLC. 20 Shujun LI (李树钧): INF-10845-20091 Multimedia Coding Static coding vs. Dynamic/Adaptive coding z Static coding = The statistical model P’P is static, i.e., it does not change over time. z Dynam ic /Adapti ve codi ng = Th e st ati sti cal mod el P’ is dynamically updated, i.e., it adapts itself to the context (i.e., changes over time). • Dyypgnamic/Adaptive coding ⊂ Context-based coding z Hybrid coding = Static + Dynamic coding • A co de boo k is ma in tai ned at th e encod er sid e, and th e encoder dynamically chooses a code for a number of symbols and inform the decoder about the choice .
Recommended publications
  • Source Coding: Part I of Fundamentals of Source and Video Coding
    Foundations and Trends R in sample Vol. 1, No 1 (2011) 1–217 c 2011 Thomas Wiegand and Heiko Schwarz DOI: xxxxxx Source Coding: Part I of Fundamentals of Source and Video Coding Thomas Wiegand1 and Heiko Schwarz2 1 Berlin Institute of Technology and Fraunhofer Institute for Telecommunica- tions — Heinrich Hertz Institute, Germany, [email protected] 2 Fraunhofer Institute for Telecommunications — Heinrich Hertz Institute, Germany, [email protected] Abstract Digital media technologies have become an integral part of the way we create, communicate, and consume information. At the core of these technologies are source coding methods that are described in this text. Based on the fundamentals of information and rate distortion theory, the most relevant techniques used in source coding algorithms are de- scribed: entropy coding, quantization as well as predictive and trans- form coding. The emphasis is put onto algorithms that are also used in video coding, which will be described in the other text of this two-part monograph. To our families Contents 1 Introduction 1 1.1 The Communication Problem 3 1.2 Scope and Overview of the Text 4 1.3 The Source Coding Principle 5 2 Random Processes 7 2.1 Probability 8 2.2 Random Variables 9 2.2.1 Continuous Random Variables 10 2.2.2 Discrete Random Variables 11 2.2.3 Expectation 13 2.3 Random Processes 14 2.3.1 Markov Processes 16 2.3.2 Gaussian Processes 18 2.3.3 Gauss-Markov Processes 18 2.4 Summary of Random Processes 19 i ii Contents 3 Lossless Source Coding 20 3.1 Classification
    [Show full text]
  • Error Correction Capacity of Unary Coding
    Error Correction Capacity of Unary Coding Pushpa Sree Potluri1 Abstract Unary coding has found applications in data compression, neural network training, and in explaining the production mechanism of birdsong. Unary coding is redundant; therefore it should have inherent error correction capacity. An expression for the error correction capability of unary coding for the correction of single errors has been derived in this paper. 1. Introduction The unary number system is the base-1 system. It is the simplest number system to represent natural numbers. The unary code of a number n is represented by n ones followed by a zero or by n zero bits followed by 1 bit [1]. Unary codes have found applications in data compression [2],[3], neural network training [4]-[11], and biology in the study of avian birdsong production [12]-14]. One can also claim that the additivity of physics is somewhat like the tallying of unary coding [15],[16]. Unary coding has also been seen as the precursor to the development of number systems [17]. Some representations of unary number system use n-1 ones followed by a zero or with the corresponding number of zeroes followed by a one. Here we use the mapping of the left column of Table 1. Table 1. An example of the unary code N Unary code Alternative code 0 0 0 1 10 01 2 110 001 3 1110 0001 4 11110 00001 5 111110 000001 6 1111110 0000001 7 11111110 00000001 8 111111110 000000001 9 1111111110 0000000001 10 11111111110 00000000001 The unary number system may also be seen as a space coding of numerical information where the location determines the value of the number.
    [Show full text]
  • Image Data Compression Introduction to Coding
    Image Data Compression Introduction to Coding © 2018-19 Alexey Pak, Lehrstuhl für Interaktive Echtzeitsysteme, Fakultät für Informatik, KIT 1 Review: data reduction steps (discretization / digitization) Continuous 2D siGnal Fully diGital siGnal (liGht intensity on sensor) gq (xa, yb,ti ) Discrete time siGnal (pixel voltaGe readinGs) g(xa, yb,ti ) g(x, y,t) Spatial discretization Temporal discretization and diGitization g(xa, yb,t) g(xa, yb,t) gq (xa, yb,t) Discrete value siGnal AnaloG siGnal Spatially discrete siGnal (e.G., # of electrons at each (liGht intensity at a pixel) (pixel-averaGed intensity) pixel of the CCD matrix) © 2018-19 Alexey Pak, Lehrstuhl für Interaktive Echtzeitsysteme, Fakultät für Informatik, KIT 2 Review: data reduction steps (discretization / digitization) Discretization of 1D continuous-time signals (sampling) • Important signal transformations: up- and down-sampling • Information-preserving down-sampling: rate determined based on signal bandwidth • Fourier space allows simple interpretation of the effects due to decimation and interpolation (techniques of up-/down-sampling) Scalar (one-dimensional) signal quantization of continuous-value signals • Quantizer types: uniform, simple non-uniform (with a dead zone, with a limited amplitude) • Advanced quantizers: PDF-optimized (Max-Lloyd algorithm), perception-optimized, SNR- optimized • Implementation: pre-processing with a compander function + simple quantization Vector (multi-dimensional) signal quantization • Terminology: quantization, reconstruction, codebook, distance metric, Voronoi regions, space partitioning • Relation to the general classification problem (from Machine Learning) • Linde-Buzo-Gray algorithm of constructing (sub-optimal) codebooks (aka k-means) © 2018-19 Alexey Pak, Lehrstuhl für Interaktive Echtzeitsysteme, Fakultät für Informatik, KIT 3 LGB vector quantization – 2D example [Linde, Buzo, Gray ‘80]: 1.
    [Show full text]
  • Efficient Variable-To-Fixed Length Coding Algorithms for Text
    Efficient Variable-to-Fixed Length Coding Algorithms for Text Compression (テキスト圧縮に対する効率よい可変長-固定長符号化アルゴリズム) Satoshi Yoshida February, 2014 Abstract Data compression is a technique for reducing the storage space and the cost of trans- ferring a large amount of data, using redundancy hidden in the data. We focus on lossless compression for text data, that is, text compression, in this thesis. To reuse a huge amount of data stored in secondary storage, I/O speeds are bottlenecks. Such a communication-speed problem can be relieved if we transfer only compressed data through the communication channel and furthermore can perform every necessary pro- cesses, such as string search, on the compressed data itself without decompression. Therefore, a new criterion \ease of processing the compressed data" is required in the field of data compression. Development of compression algorithms is currently in the mainstream of data compression field but many of them are not adequate for that criterion. The algorithms employing variable length codewords succeeded to achieve an extremely good compression ratio, but the boundaries between codewords are not obvious without a special processing. Such an \unclear boundary problem" prevents us from direct accessing to the compressed data. On the contrary, Variable-to-Fixed-length coding, which is referred to as VF coding, is promising for our demand. VF coding is a coding scheme that segments an input text into a consecutive sequence of substrings (called phrases) and then assigns a fixed length codeword to each substring. Boundaries between codewords of VF coding are obvious because all of them have the same length. Therefore, we can realize \accessible data compression" by VF coding.
    [Show full text]
  • The Pillars of Lossless Compression Algorithms a Road Map and Genealogy Tree
    International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 6 (2018) pp. 3296-3414 © Research India Publications. http://www.ripublication.com The Pillars of Lossless Compression Algorithms a Road Map and Genealogy Tree Evon Abu-Taieh, PhD Information System Technology Faculty, The University of Jordan, Aqaba, Jordan. Abstract tree is presented in the last section of the paper after presenting the 12 main compression algorithms each with a practical This paper presents the pillars of lossless compression example. algorithms, methods and techniques. The paper counted more than 40 compression algorithms. Although each algorithm is The paper first introduces Shannon–Fano code showing its an independent in its own right, still; these algorithms relation to Shannon (1948), Huffman coding (1952), FANO interrelate genealogically and chronologically. The paper then (1949), Run Length Encoding (1967), Peter's Version (1963), presents the genealogy tree suggested by researcher. The tree Enumerative Coding (1973), LIFO (1976), FiFO Pasco (1976), shows the interrelationships between the 40 algorithms. Also, Stream (1979), P-Based FIFO (1981). Two examples are to be the tree showed the chronological order the algorithms came to presented one for Shannon-Fano Code and the other is for life. The time relation shows the cooperation among the Arithmetic Coding. Next, Huffman code is to be presented scientific society and how the amended each other's work. The with simulation example and algorithm. The third is Lempel- paper presents the 12 pillars researched in this paper, and a Ziv-Welch (LZW) Algorithm which hatched more than 24 comparison table is to be developed.
    [Show full text]
  • The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on Iot Nodes in Smart Cities
    sensors Article The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on IoT Nodes in Smart Cities Ammar Nasif *, Zulaiha Ali Othman and Nor Samsiah Sani Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science & Technology, University Kebangsaan Malaysia, Bangi 43600, Malaysia; [email protected] (Z.A.O.); [email protected] (N.S.S.) * Correspondence: [email protected] Abstract: Networking is crucial for smart city projects nowadays, as it offers an environment where people and things are connected. This paper presents a chronology of factors on the development of smart cities, including IoT technologies as network infrastructure. Increasing IoT nodes leads to increasing data flow, which is a potential source of failure for IoT networks. The biggest challenge of IoT networks is that the IoT may have insufficient memory to handle all transaction data within the IoT network. We aim in this paper to propose a potential compression method for reducing IoT network data traffic. Therefore, we investigate various lossless compression algorithms, such as entropy or dictionary-based algorithms, and general compression methods to determine which algorithm or method adheres to the IoT specifications. Furthermore, this study conducts compression experiments using entropy (Huffman, Adaptive Huffman) and Dictionary (LZ77, LZ78) as well as five different types of datasets of the IoT data traffic. Though the above algorithms can alleviate the IoT data traffic, adaptive Huffman gave the best compression algorithm. Therefore, in this paper, Citation: Nasif, A.; Othman, Z.A.; we aim to propose a conceptual compression method for IoT data traffic by improving an adaptive Sani, N.S.
    [Show full text]
  • Habilitation `A Diriger Des Recherches from Image Coding And
    Habilitation `aDiriger des Recherches From image coding and representation to robotic vision Marie BABEL Universit´ede Rennes 1 June 29th 2012 Bruno Arnaldi, Professor, INSA Rennes Committee chairman Ferran Marques, Professor, Technical University of Catalonia Reviewer Beno^ıtMacq, Professor, Universit´eCatholique de Louvain Reviewer Fr´ed´ericDufaux, CNRS Research Director, Telecom ParisTech Reviewer Charly Poulliat, Professor, INP-ENSEEIHT Toulouse Examiner Claude Labit, Inria Research Director, Inria Rennes Examiner Fran¸oisChaumette, Inria Research Director, Inria Rennes Examiner Joseph Ronsin, Professor, INSA Rennes Examiner IRISA UMR CNRS 6074 / INRIA - Equipe Lagadic IETR UMR CNRS 6164 - Equipe Images 2 Contents 1 Introduction 3 1.1 An overview of my research project ........................... 3 1.2 Coding and representation tools: QoS/QoE context .................. 4 1.3 Image and video representation: towards pseudo-semantic technologies . 4 1.4 Organization of the document ............................. 5 2 Still image coding and advanced services 7 2.1 JPEG AIC calls for proposal: a constrained applicative context ............ 8 2.1.1 Evolution of codecs: JPEG committee ..................... 8 2.1.2 Response to the call for JPEG-AIC ....................... 9 2.2 Locally Adaptive Resolution compression framework: an overview . 10 2.2.1 Principles and properties ............................ 11 2.2.2 Lossy to lossless scalable solution ........................ 12 2.2.3 Hierarchical colour region representation and coding . 12 2.2.4 Interoperability ................................. 13 2.3 Quadtree Partitioning: principles ............................ 14 2.3.1 Basic homogeneity criterion: morphological gradient . 14 2.3.2 Enhanced color-oriented homogeneity criterion . 15 2.3.2.1 Motivations .............................. 15 2.3.2.2 Results ................................ 16 2.4 Interleaved S+P: the pyramidal profile ........................
    [Show full text]
  • Error-Resilient Coding Tools in MPEG-4
    I Error-Resilient Coding Tools In MPEG-4 t By CHENG Shu Ling A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF THE MASTER OF PHILOSOPHY DIVISION OF INFORMATION ENGnvfEERING THE CHINESE UNIVERSITY OF HONG KONG July 1997 • A^^s^ p(:^S31^^^ 4c—^w -v^ UNIVERSITY JgJ ' %^^y^^ s�,s^^ %^r::^v€X '^<"nii^Ji.^i:^'5^^ Acknowledgement I would like to thank my supervisor Prof. Victor K. W. Wei for his invaluable guidance and support throughout my M. Phil, study. He has given to me many fruitful ideas and inspiration on my research. I have been the teaching assistant of his channel coding and multimedia coding class for three terms, during which I have gained a lot of knowledge and experience. Working with Prof. Wei is a precious experience for me; I learnt many ideas on work and research from him. I would also like to thank my fellow labmates, C.K. Lai, C.W. Lam, S.W. Ng and C.W. Yuen. They have assisted me in many ways on my duties and given a lot of insightful discussions on my studies. Special thanks to Lai and Lam for their technical support on my research. Last but not least, I want to say thank you to my dear friends: Bird, William, Johnny, Mok, Abak, Samson, Siu-man and Nelson. Thanks to all of you for the support and patience with me over the years. ii Abstract Error resiliency is becoming increasingly important with the rapid growth of the mobile systems. The channels in mobile communications are subject to fading, shadowing and interference and thus have a high error rate.
    [Show full text]
  • Efficient Inverted Index Compression Algorithm Characterized by Faster
    entropy Article Efficient Inverted Index Compression Algorithm Characterized by Faster Decompression Compared with the Golomb-Rice Algorithm Andrzej Chmielowiec 1,* and Paweł Litwin 2 1 The Faculty of Mechanics and Technology, Rzeszow University of Technology, Kwiatkowskiego 4, 37-450 Stalowa Wola, Poland 2 The Faculty of Mechanical Engineering and Aeronautics, Rzeszow University of Technology, Powsta´ncówWarszawy 8, 35-959 Rzeszow, Poland; [email protected] * Correspondence: [email protected] Abstract: This article deals with compression of binary sequences with a given number of ones, which can also be considered as a list of indexes of a given length. The first part of the article shows that the entropy H of random n-element binary sequences with exactly k elements equal one satisfies the inequalities k log2(0.48 · n/k) < H < k log2(2.72 · n/k). Based on this result, we propose a simple coding using fixed length words. Its main application is the compression of random binary sequences with a large disproportion between the number of zeros and the number of ones. Importantly, the proposed solution allows for a much faster decompression compared with the Golomb-Rice coding with a relatively small decrease in the efficiency of compression. The proposed algorithm can be particularly useful for database applications for which the speed of decompression is much more important than the degree of index list compression. Citation: Chmielowiec, A.; Litwin, P. Efficient Inverted Index Compression Keywords: inverted index compression; Golomb-Rice coding; runs coding; sparse binary sequence Algorithm Characterized by Faster compression Decompression Compared with the Golomb-Rice Algorithm.
    [Show full text]
  • Techniques for Inverted Index Compression
    Techniques for Inverted Index Compression GIULIO ERMANNO PIBIRI, ISTI-CNR, Italy ROSSANO VENTURINI, University of Pisa, Italy The data structure at the core of large-scale search engines is the inverted index, which is essentially a collection of sorted integer sequences called inverted lists. Because of the many documents indexed by such engines and stringent performance requirements imposed by the heavy load of queries, the inverted index stores billions of integers that must be searched efficiently. In this scenario, index compression is essential because it leads to a better exploitation of the computer memory hierarchy for faster query processing and, at the same time, allows reducing the number of storage machines. The aim of this article is twofold: first, surveying the encoding algorithms suitable for inverted index compression and, second, characterizing the performance of the inverted index through experimentation. CCS Concepts: • Information systems → Search index compression; Search engine indexing. Additional Key Words and Phrases: Inverted Indexes; Data Compression; Efficiency ACM Reference Format: Giulio Ermanno Pibiri and Rossano Venturini. 2020. Techniques for Inverted Index Compression. ACM Comput. Surv. 1, 1 (August 2020), 35 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn 1 INTRODUCTION Consider a collection of textual documents each described, for this purpose, as a set of terms. For each distinct term t appearing in the collection, an integer sequence St is built and lists, in sorted order, all the identifiers of the documents (henceforth, docIDs) where the term appears. The sequence St is called the inverted list, or posting list, of the term t and the set of inverted lists for all the distinct terms is the subject of this article – the data structure known as the inverted index.
    [Show full text]
  • NERG 1994 Deel 59 Nr. 05
    tijdschrift van het ---------------------- \ nederlands elektronica- en radiogenootschap s___________ > deel 59 1994 Nederlands Elektronica- en Radiogenootschap nederlands Correspondentie-adres: Postbus 39, 2260 AA Leidschendam. Gironummer 94746 t.n.v. Penningmeester NERG, Leidschendam. elektronica- HET GENOOTSCHAP De vereniging stelt zich ten doel het wetenschappelijk onderzoek op en het gebied van de elektronica en de informatietransmissie en -ver­ werking te bevorderen en de verbreiding en toepassing van de ver­ radiogenootschap worven kennis te stimuleren. Het genootschap is lid van de Convention of National Societies of Electrical Engineers of Western Europe (Eurel). BESTUUR Prof.Ir.J.H.Geels, voorzitter Ir.P.K.Tilburgs, secretaris Ir.O.B.P.Rikkert de Koe, penningmeester Ir.P.R.J.M.Smits, programma manager Ir.P.Baltus, vice voorzitter Prof.Dr.Ir.W.M.G.van Bokhoven, voorzitter onderwijscommissie Dr.Ir.R.C.den Duik Ir.C.Th.Koole Ir.P.P.M.van der Zalm Ir. W. van der Bijl LIDMAATSCHAP Voor lidmaatschap wende men zich via het correspondentie-adres tot de secretaris. Het lidmaatschap staat open voor academisch gegra­ dueerden en hen, wier kennis of ervaring naar het oordeel van het bestuur een vruchtbaar lidmaatschap mogelijk maakt. De contributie bedraagt ƒ 60,- per jaar. Leden jonger dan 30 jaar betalen gedurende maximaal 5 jaar de gereduceerde contributie van f 30,- per jaar. In bepaalde gevallen kunnen ook andere leden, na overleg met de penningmeester, voor deze gereduceerde contributie in aanmerking komen. Gevorderde stu­ denten komen in aanmerking voor een gratis lidmaatschap, en kun­ nen daartoe contact opnemen met een van de contactpersonen. De contributie is inclusief abonnement op het Tijdschrift van het NERG en deelname aan de werkvergaderingen.
    [Show full text]
  • Compressing Integers for Fast File Access
    Compressing Integers for Fast File Access Hugh E. Williams and Justin Zobel Department of Computer Science, RMIT University GPO Box 2476V, Melbourne 3001, Australia Email: hugh,jz @cs.rmit.edu.au { } Fast access to files of integers is crucial for the efficient resolution of queries to databases. Integers are the basis of indexes used to resolve queries, for exam- ple, in large internet search systems and numeric data forms a large part of most databases. Disk access costs can be reduced by compression, if the cost of retriev- ing a compressed representation from disk and the CPU cost of decoding such a representation is less than that of retrieving uncompressed data. In this paper we show experimentally that, for large or small collections, storing integers in a com- pressed format reduces the time required for either sequential stream access or random access. We compare different approaches to compressing integers, includ- ing the Elias gamma and delta codes, Golomb coding, and a variable-byte integer scheme. As a conclusion, we recommend that, for fast access to integers, files be stored compressed. Keywords: Integer compression, variable-bit coding, variable-byte coding, fast file access, scientific and numeric databases. 1. INTRODUCTION codes [2], and parameterised Golomb codes [3]; second, we evaluate byte-wise storage using standard four-byte Many data processing applications depend on efficient integers and a variable-byte scheme. access to files of integers. For example, integer data is We have found in our experiments that coding us- prevalent in scientific and financial databases, and the ing a tailored integer compression scheme can allow indexes to databases can consist of large sequences of retrieval to be up to twice as fast than with integers integer record identifiers.
    [Show full text]