Entropy and Information Theory Robert M

Total Page:16

File Type:pdf, Size:1020Kb

Entropy and Information Theory Robert M Entropy and Information Theory Robert M. Gray Entropy and Information Theory Springer Science+ Business Media, LLC Robert M. Gray Information Systems Laboratory Electrical Engineering Department Stanford University Stanford, CA 94305-2099 USA Printed on acid-free paper. © 1990 by Springer Science+Business Media New York Originally published by Springer·Verlag New York, Inc. in 1990 Softcover reprint ofthe hardcover lst edition 1990 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher Springer Science+Business Media, LLC, except for brief ex­ cerpts in connection with reviews or scholarly analysis. Use in connection with any form of infor­ mation storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use of general descriptive names, trade names, trade marks, etc. in this publication, even if the former are not especially identified, is not to be taken as a sign that such names, as understood by the Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone. Camera-ready text prepared by the author using LaTeX. 9 8 7 6 5 4 3 2 1 ISBN 978-1-4757-3984-8 ISBN 978-1-4757-3982-4 (eBook) DOI 10.1007/978-1-4757-3982-4 to Tim, Lori, Julia, Peter, Gus, Amy Elizabeth, and Alice and in memory of Tino Contents Prologue xi 1 Information Sources 1 1.1 Introduction . 1 1.2 Probability Spaces and Random Variables . 1 1.3 Random Processes and Dynamical Systems 5 1.4 Distributions . 7 1.5 Standard Alphabets . 12 1.6 Expectation . 13 1. 7 Asymptotic Mean Stationarity 16 1.8 Ergodie Properties . 17 2 Entropy and Information 21 2.1 Introduction ...... 21 2.2 Entropy and Entropy Rate 21 2.3 Basic Properties of Entropy 25 2.4 Entropy Rate . 37 2.5 Conditional Entropy and Information 42 2.6 Entropy Rate Revisited .. 50 2.7 Relative Entropy Densities ... 53 3 The Entropy Ergodie Theorem 57 3.1 Introduction .......... 57 3.2 Stationary Ergodie Sources .. 60 3.3 Stationary Nonergodie Sources 67 3.4 AMS Sources . 71 3.5 The Asymptotic Equipartition Property 75 4 Information Rates I 77 4.1 Introduction .............. 77 4.2 Stationary Codes and Approximation 77 Vll Vlll CONTENTS 4.3 Information Rate of Finite Alphabet Processes 86 5 Relative Entropy 89 5.1 Introduction ......... 89 5.2 Divergence ......... 89 5.3 Conditional Relative Entropy 106 5.4 Limiting Entropy Densities . 119 5.5 Information for General Alphabets 121 5.6 Some Convergence Results . 132 6 Information Rates II 135 6.1 Introduction ................ 135 6.2 Information Rates for General Alphabets 135 6.3 A Mean Ergodie Theorem for Densities . 139 6.4 Information Rates of Stationary Processes 141 7 Relative Entropy Rates 149 7.1 Introduction .............. 149 7.2 Relative Entropy Densities and Rates 149 7.3 Markov Dominating Measures . 152 7.4 Stationary Processes . 156 7.5 Mean Ergodie Theorems .... 159 8 Ergodie Theorems for Densities 165 8.1 Introduction .......... 165 8.2 Stationary Ergodie Sources .. 165 8.3 Stationary Nonergodie Sources 170 8.4 AMS Sources . 174 8.5 Ergodie Theorems for Information Densities. 177 9 Channels and Codes 181 9.1 Introduction ............. 181 9.2 Channels .............. 183 9.3 Stationarity Properties of Channels . 184 9.4 Examples of Channels ..... 188 9.5 The Rohlin-Kakutani Theorem 210 10 Distortion 217 10.1 Introduction .......... 217 10.2 Distortion and Fidelity Criteria 217 10.3 Performance ........ 219 10.4 The rho-bar distortion .. 221 10.5 d-bar Continuous Channels 224 CONTENTS IX 10.6 The Distortion-Rate Function . 228 11 Source Coding Theorems 241 11.1 Source Coding and Channel Coding 241 11.2 Block Source Codes for AMS Sources . 242 11.3 Block Coding Stationary Sources . 252 11.4 Block Coding AMS Ergodie Sources 253 11.5 Subadditive Fidelity Criteria 260 11.6 Asynchronaus Block Codes .... 262 11.7 Sliding Block Source Codes .... 265 11.8 A Geometrie Interpretation of OPTA's . 275 12 Coding for noisy channels 277 12.1 Noisy Channels ... 277 12.2 Feinstein's Lemma . 278 12.3 Feinstein's Theorem 281 12.4 Channel Capacity .. 285 12.5 Robust Block Codes 290 12.6 Block Coding Theorems for Noisy Channels 293 12.7 Joint Source and Channel Block Codes .. 295 12.8 Synchronizing Block Channel Codes ... 298 12.9 Sliding Block Source and Channel Coding 302 Bibliography 315 Index 327 Prologue This book is devoted to the theory of probabilistic information measures and their application to coding theorems for information sources and noisy channels. The eventual goal is a general development of Shannon's mathe­ matical theory of communication, but much of the space is devoted to the tools and methods required to prove the Shannon coding theorems. These tools form an area common to ergodie theory and information theory and comprise several quantitative notions of the information in random vari­ ables, random processes, and dynamical systems. Examples are entropy, mutual information, conditional entropy, conditional information, and dis­ crimination or relative entropy, along with the limiting normalized versions of these quantities such as entropy rate and information rate. Much of the book is concerned with their properties, especially the long term asymptotic behavior of sample information and expected information. The book has been strongly influenced by M. S. Pinsker's dassie Infor­ mation and Information Stability of Random Variables and Processes and by the seminal work of A. N. Kolmogorov, I. M. Gelfand, A. M. Yaglom, and R. L. Dobrushin on information measures for abstract alphabets and their convergence properties. Many of the results herein are extensions of their generalizations of Shannon's original results. The mathematical models of this treatment are more general than traditional treatments in that nonstationary and nonergodie information processes are treated. The models are somewhat less general than those of the Soviet school of infor­ mation theory in the sense that standard alphabets rather than completely abstract alphabets are considered. This restriction, however, permits many stronger results as weil as the extension to nonergodie processes. In addi­ tion, the assumption of standard spaces simplifies many proofs and such spaces include as examples virtually all examples of engineering interest. The information convergence results are combined with ergodie theo­ rems to prove general Shannon coding theorems for sources and channels. The results are not the most general known and the converses are not the strongest available, but they are sufficently general to cover most systems xi Xll PROLOGUE encountered in applications and they provide an introduction to recent ex­ tensions requiring significant additional mathematical machinery. Several of the generalizations have not previously been treated in book form. Ex­ amples of novel topics for an information theory text indude asymptotic mean stationary sources, one-sided sources as weil as two-sided sources, nonergodie sources, d-continuous channels, and sliding bloek codes. An­ other novel aspect is the use of recent proofs of general Shannon-McMillan­ Breiman theorems which do not use martingale theory: A coding proof of Ornstein and Weiss [117) is used to prove the almost everywhere conver­ gence of sample entropy for discrete alphabet proeesses and a variation on the sandwich approach of Algoet and Cover [7) is used to prove the conver­ gence of relative entropy densities for general standard alphabet processes. Both results are proved for asymptotically mean stationary processes which need not be ergodic. This material can be considered as a sequel to my book Probability, Random Processes, and Ergodie Properlies [51) wherein the prerequisite results on probability, standard spaees, and ordinary ergodie properties may be found. This book is self contained with the exeeption of common (and a few less common) results which may be found in the first book. It is my hope that the book will interest engineers in some of the math­ ematieal aspects and general models of the theory and mathematicians in some of the important engineering applications of performanee bounds and eode design for eommunication systems. Information theory or the mathematical theory of communication has two primary goals: The first is the development of the fundamental theoreti­ callimits on the achievable performanee when eommunicating a given infor­ mation source over a given eommunications channel using coding schemes from within a prescribed dass. The second goal is the development of coding schemes that provide performanee that is reasonably good in eom­ parison with the optimal performanee given by the theory. Information theory was born in a surprisingly rieh state in the dassie papers of Claude E. Shannon [129) [130) whieh eontained the basie results for simple mem­ oryless sources and channels and introdueed more general eommunication systems models, including finite state sources and channels. The key tools used to prove the original results and many of those that followed were special cases of the ergodie theorem and a new variation of the ergodie theorem which considered sample averages of a measure of the entropy or self information in a process. Information theory can be viewed as simply a branch of applied prob­ ability theory. Beeause of its dependenee on ergodie theorems, however, it can also be viewed as a branch of ergodie theory, the theory of invariant PROLOGUE Xlll transformations and transformations related to invariant transformations. In order to develop the ergodie theory example of principal interest to in­ formation theory, suppose that one has a random process, which for the moment we consider as a sample space or ensemble of possible output se­ quences together with a probability measure on events composed of collec­ tions of such sequences. The shift is the transformation on this space of sequences that takes a sequence and produces a new sequence by shifting the first sequence a single time unit to the left.
Recommended publications
  • On Measures of Entropy and Information
    On Measures of Entropy and Information Tech. Note 009 v0.7 http://threeplusone.com/info Gavin E. Crooks 2018-09-22 Contents 5 Csiszar´ f-divergences 12 Csiszar´ f-divergence ................ 12 0 Notes on notation and nomenclature 2 Dual f-divergence .................. 12 Symmetric f-divergences .............. 12 1 Entropy 3 K-divergence ..................... 12 Entropy ........................ 3 Fidelity ........................ 12 Joint entropy ..................... 3 Marginal entropy .................. 3 Hellinger discrimination .............. 12 Conditional entropy ................. 3 Pearson divergence ................. 14 Neyman divergence ................. 14 2 Mutual information 3 LeCam discrimination ............... 14 Mutual information ................. 3 Skewed K-divergence ................ 14 Multivariate mutual information ......... 4 Alpha-Jensen-Shannon-entropy .......... 14 Interaction information ............... 5 Conditional mutual information ......... 5 6 Chernoff divergence 14 Binding information ................ 6 Chernoff divergence ................. 14 Residual entropy .................. 6 Chernoff coefficient ................. 14 Total correlation ................... 6 Renyi´ divergence .................. 15 Lautum information ................ 6 Alpha-divergence .................. 15 Uncertainty coefficient ............... 7 Cressie-Read divergence .............. 15 Tsallis divergence .................. 15 3 Relative entropy 7 Sharma-Mittal divergence ............. 15 Relative entropy ................... 7 Cross entropy
    [Show full text]
  • Information Theory 1 Entropy 2 Mutual Information
    CS769 Spring 2010 Advanced Natural Language Processing Information Theory Lecturer: Xiaojin Zhu [email protected] In this lecture we will learn about entropy, mutual information, KL-divergence, etc., which are useful concepts for information processing systems. 1 Entropy Entropy of a discrete distribution p(x) over the event space X is X H(p) = − p(x) log p(x). (1) x∈X When the log has base 2, entropy has unit bits. Properties: H(p) ≥ 0, with equality only if p is deterministic (use the fact 0 log 0 = 0). Entropy is the average number of 0/1 questions needed to describe an outcome from p(x) (the Twenty Questions game). Entropy is a concave function of p. 1 1 1 1 7 For example, let X = {x1, x2, x3, x4} and p(x1) = 2 , p(x2) = 4 , p(x3) = 8 , p(x4) = 8 . H(p) = 4 bits. This definition naturally extends to joint distributions. Assuming (x, y) ∼ p(x, y), X X H(p) = − p(x, y) log p(x, y). (2) x∈X y∈Y We sometimes write H(X) instead of H(p) with the understanding that p is the underlying distribution. The conditional entropy H(Y |X) is the amount of information needed to determine Y , if the other party knows X. X X X H(Y |X) = p(x)H(Y |X = x) = − p(x, y) log p(y|x). (3) x∈X x∈X y∈Y From above, we can derive the chain rule for entropy: H(X1:n) = H(X1) + H(X2|X1) + ..
    [Show full text]
  • Information Theory and Maximum Entropy 8.1 Fundamentals of Information Theory
    NEU 560: Statistical Modeling and Analysis of Neural Data Spring 2018 Lecture 8: Information Theory and Maximum Entropy Lecturer: Mike Morais Scribes: 8.1 Fundamentals of Information theory Information theory started with Claude Shannon's A mathematical theory of communication. The first building block was entropy, which he sought as a functional H(·) of probability densities with two desired properties: 1. Decreasing in P (X), such that if P (X1) < P (X2), then h(P (X1)) > h(P (X2)). 2. Independent variables add, such that if X and Y are independent, then H(P (X; Y )) = H(P (X)) + H(P (Y )). These are only satisfied for − log(·). Think of it as a \surprise" function. Definition 8.1 (Entropy) The entropy of a random variable is the amount of information needed to fully describe it; alternate interpretations: average number of yes/no questions needed to identify X, how uncertain you are about X? X H(X) = − P (X) log P (X) = −EX [log P (X)] (8.1) X Average information, surprise, or uncertainty are all somewhat parsimonious plain English analogies for entropy. There are a few ways to measure entropy for multiple variables; we'll use two, X and Y . Definition 8.2 (Conditional entropy) The conditional entropy of a random variable is the entropy of one random variable conditioned on knowledge of another random variable, on average. Alternative interpretations: the average number of yes/no questions needed to identify X given knowledge of Y , on average; or How uncertain you are about X if you know Y , on average? X X h X i H(X j Y ) = P (Y )[H(P (X j Y ))] = P (Y ) − P (X j Y ) log P (X j Y ) Y Y X X = = − P (X; Y ) log P (X j Y ) X;Y = −EX;Y [log P (X j Y )] (8.2) Definition 8.3 (Joint entropy) X H(X; Y ) = − P (X; Y ) log P (X; Y ) = −EX;Y [log P (X; Y )] (8.3) X;Y 8-1 8-2 Lecture 8: Information Theory and Maximum Entropy • Bayes' rule for entropy H(X1 j X2) = H(X2 j X1) + H(X1) − H(X2) (8.4) • Chain rule of entropies n X H(Xn;Xn−1; :::X1) = H(Xn j Xn−1; :::X1) (8.5) i=1 It can be useful to think about these interrelated concepts with a so-called information diagram.
    [Show full text]
  • 16.1 Digital “Modes”
    Contents 16.1 Digital “Modes” 16.5 Networking Modes 16.1.1 Symbols, Baud, Bits and Bandwidth 16.5.1 OSI Networking Model 16.1.2 Error Detection and Correction 16.5.2 Connected and Connectionless 16.1.3 Data Representations Protocols 16.1.4 Compression Techniques 16.5.3 The Terminal Node Controller (TNC) 16.1.5 Compression vs. Encryption 16.5.4 PACTOR-I 16.2 Unstructured Digital Modes 16.5.5 PACTOR-II 16.2.1 Radioteletype (RTTY) 16.5.6 PACTOR-III 16.2.2 PSK31 16.5.7 G-TOR 16.2.3 MFSK16 16.5.8 CLOVER-II 16.2.4 DominoEX 16.5.9 CLOVER-2000 16.2.5 THROB 16.5.10 WINMOR 16.2.6 MT63 16.5.11 Packet Radio 16.2.7 Olivia 16.5.12 APRS 16.3 Fuzzy Modes 16.5.13 Winlink 2000 16.3.1 Facsimile (fax) 16.5.14 D-STAR 16.3.2 Slow-Scan TV (SSTV) 16.5.15 P25 16.3.3 Hellschreiber, Feld-Hell or Hell 16.6 Digital Mode Table 16.4 Structured Digital Modes 16.7 Glossary 16.4.1 FSK441 16.8 References and Bibliography 16.4.2 JT6M 16.4.3 JT65 16.4.4 WSPR 16.4.5 HF Digital Voice 16.4.6 ALE Chapter 16 — CD-ROM Content Supplemental Files • Table of digital mode characteristics (section 16.6) • ASCII and ITA2 code tables • Varicode tables for PSK31, MFSK16 and DominoEX • Tips for using FreeDV HF digital voice software by Mel Whitten, KØPFX Chapter 16 Digital Modes There is a broad array of digital modes to service various needs with more coming.
    [Show full text]
  • A Characterization of Guesswork on Swiftly Tilting Curves Ahmad Beirami, Robert Calderbank, Mark Christiansen, Ken Duffy, and Muriel Medard´
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by MURAL - Maynooth University Research Archive Library 1 A Characterization of Guesswork on Swiftly Tilting Curves Ahmad Beirami, Robert Calderbank, Mark Christiansen, Ken Duffy, and Muriel Medard´ Abstract Given a collection of strings, each with an associated probability of occurrence, the guesswork of each of them is their position in a list ordered from most likely to least likely, breaking ties arbitrarily. Guesswork is central to several applications in information theory: Average guesswork provides a lower bound on the expected computational cost of a sequential decoder to decode successfully the transmitted message; the complementary cumulative distribution function of guesswork gives the error probability in list decoding; the logarithm of guesswork is the number of bits needed in optimal lossless one-to-one source coding; and guesswork is the number of trials required of an adversary to breach a password protected system in a brute-force attack. In this paper, we consider memoryless string-sources that generate strings consisting of i.i.d. characters drawn from a finite alphabet, and characterize their corresponding guesswork. Our main tool is the tilt operation on a memoryless string-source. We show that the tilt operation on a memoryless string-source parametrizes an exponential family of memoryless string-sources, which we refer to as the tilted family of the string-source. We provide an operational meaning to the tilted families by proving that two memoryless string-sources result in the same guesswork on all strings of all lengths if and only if their respective categorical distributions belong to the same tilted family.
    [Show full text]
  • Shannon's Coding Theorems
    Saint Mary's College of California Department of Mathematics Senior Thesis Shannon's Coding Theorems Faculty Advisors: Author: Professor Michael Nathanson Camille Santos Professor Andrew Conner Professor Kathy Porter May 16, 2016 1 Introduction A common problem in communications is how to send information reliably over a noisy communication channel. With his groundbreaking paper titled A Mathematical Theory of Communication, published in 1948, Claude Elwood Shannon asked this question and provided all the answers as well. Shannon realized that at the heart of all forms of communication, e.g. radio, television, etc., the one thing they all have in common is information. Rather than amplifying the information, as was being done in telephone lines at that time, information could be converted into sequences of 1s and 0s, and then sent through a communication channel with minimal error. Furthermore, Shannon established fundamental limits on what is possible or could be acheived by a communication system. Thus, this paper led to the creation of a new school of thought called Information Theory. 2 Communication Systems In general, a communication system is a collection of processes that sends information from one place to another. Similarly, a storage system is a system that is used for storage and later retrieval of information. Thus, in a sense, a storage system may also be thought of as a communication system that sends information from one place (now, or the present) to another (then, or the future) [3]. In a communication system, information always begins at a source (e.g. a book, music, or video) and is then sent and processed by an encoder to a format that is suitable for transmission through a physical communications medium, called a channel.
    [Show full text]
  • Information Theory and Coding
    Information Theory and Coding Introduction – What is it all about? References: C.E.Shannon, “A Mathematical Theory of Communication”, The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948. C.E.Shannon “Communication in the presence of Noise”, Proc. Inst of Radio Engineers, Vol 37 No 1, pp 10 – 21, January 1949. James L Massey, “Information Theory: The Copernican System of Communications”, IEEE Communications Magazine, Vol. 22 No 12, pp 26 – 28, December 1984 Mischa Schwartz, “Information Transmission, Modulation and Noise”, McGraw Hill 1990, Chapter 1 The purpose of telecommunications is the efficient and reliable transmission of real data – text, speech, image, taste, smell etc over a real transmission channel – cable, fibre, wireless, satellite. In achieving this there are two main issues. The first is the source – the information. How do we code the real life usually analogue information, into digital symbols in a way to convey and reproduce precisely without loss the transmitted text, picture etc.? This entails proper analysis of the source to work out the best reliable but minimum amount of information symbols to reduce the source data rate. The second is the channel, and how to pass through it using appropriate information symbols the source. In doing this there is the problem of noise in all the multifarious ways that it appears – random, shot, burst- and the problem of one symbol interfering with the previous or next symbol especially as the symbol rate increases. At the receiver the symbols have to be guessed, based on classical statistical communications theory. The guessing, unfortunately, is heavily influenced by the channel noise, especially at high rates.
    [Show full text]
  • Language Models
    Recap: Language models Foundations of Natural Language Processing Language models tell us P (~w) = P (w . w ): How likely to occur is this • 1 n Lecture 4 sequence of words? Language Models: Evaluation and Smoothing Roughly: Is this sequence of words a “good” one in my language? LMs are used as a component in applications such as speech recognition, Alex Lascarides • machine translation, and predictive text completion. (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp Koehn) 24 January 2020 To reduce sparse data, N-gram LMs assume words depend only on a fixed- • length history, even though we know this isn’t true. Alex Lascarides FNLP lecture 4 24 January 2020 Alex Lascarides FNLP lecture 4 1 Evaluating a language model Two types of evaluation in NLP Intuitively, a trigram model captures more context than a bigram model, so Extrinsic: measure performance on a downstream application. • should be a “better” model. • – For LM, plug it into a machine translation/ASR/etc system. – The most reliable evaluation, but can be time-consuming. That is, it should more accurately predict the probabilities of sentences. • – And of course, we still need an evaluation measure for the downstream system! But how can we measure this? • Intrinsic: design a measure that is inherent to the current task. • – Can be much quicker/easier during development cycle. – But not always easy to figure out what the right measure is: ideally, one that correlates well with extrinsic measures. Let’s consider how to define an intrinsic measure for LMs.
    [Show full text]
  • Neural Networks and Backpropagation
    10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Neural Networks and Backpropagation Neural Net Readings: Murphy -- Matt Gormley Bishop 5 Lecture 20 HTF 11 Mitchell 4 April 3, 2017 1 Reminders • Homework 6: Unsupervised Learning – Release: Wed, Mar. 22 – Due: Mon, Apr. 03 at 11:59pm • Homework 5 (Part II): Peer Review – Expectation: You Release: Wed, Mar. 29 should spend at most 1 – Due: Wed, Apr. 05 at 11:59pm hour on your reviews • Peer Tutoring 2 Neural Networks Outline • Logistic Regression (Recap) – Data, Model, Learning, Prediction • Neural Networks – A Recipe for Machine Learning Last Lecture – Visual Notation for Neural Networks – Example: Logistic Regression Output Surface – 2-Layer Neural Network – 3-Layer Neural Network • Neural Net Architectures – Objective Functions – Activation Functions • Backpropagation – Basic Chain Rule (of calculus) This Lecture – Chain Rule for Arbitrary Computation Graph – Backpropagation Algorithm – Module-based Automatic Differentiation (Autodiff) 3 DECISION BOUNDARY EXAMPLES 4 Example #1: Diagonal Band 5 Example #2: One Pocket 6 Example #3: Four Gaussians 7 Example #4: Two Pockets 8 Example #1: Diagonal Band 9 Example #1: Diagonal Band 10 Example #1: Diagonal Band Error in slides: “layers” should read “number of hidden units” All the neural networks in this section used 1 hidden layer. 11 Example #1: Diagonal Band 12 Example #1: Diagonal Band 13 Example #1: Diagonal Band 14 Example #1: Diagonal Band 15 Example #2: One Pocket
    [Show full text]
  • Source Coding: Part I of Fundamentals of Source and Video Coding Full Text Available At
    Full text available at: http://dx.doi.org/10.1561/2000000010 Source Coding: Part I of Fundamentals of Source and Video Coding Full text available at: http://dx.doi.org/10.1561/2000000010 Source Coding: Part I of Fundamentals of Source and Video Coding Thomas Wiegand Berlin Institute of Technology and Fraunhofer Institute for Telecommunications | Heinrich Hertz Institute Germany [email protected] Heiko Schwarz Fraunhofer Institute for Telecommunications | Heinrich Hertz Institute Germany [email protected] Boston { Delft Full text available at: http://dx.doi.org/10.1561/2000000010 Foundations and Trends R in Signal Processing Published, sold and distributed by: now Publishers Inc. PO Box 1024 Hanover, MA 02339 USA Tel. +1-781-985-4510 www.nowpublishers.com [email protected] Outside North America: now Publishers Inc. PO Box 179 2600 AD Delft The Netherlands Tel. +31-6-51115274 The preferred citation for this publication is T. Wiegand and H. Schwarz, Source Coding: Part I of Fundamentals of Source and Video Coding, Foundations and Trends R in Signal Processing, vol 4, nos 1{2, pp 1{222, 2010 ISBN: 978-1-60198-408-1 c 2011 T. Wiegand and H. Schwarz All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, mechanical, photocopying, recording or otherwise, without prior written permission of the publishers. Photocopying. In the USA: This journal is registered at the Copyright Clearance Cen- ter, Inc., 222 Rosewood Drive, Danvers, MA 01923. Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is granted by now Publishers Inc for users registered with the Copyright Clearance Center (CCC).
    [Show full text]
  • Entropy and Decisions
    Entropy and decisions CSC401/2511 – Natural Language Computing – Winter 2021 Lecture 5, Serena Jeblee, Frank Rudzicz and Sean Robertson CSC401/2511 – Spring 2021 University of Toronto This lecture • Information theory and entropy. • Decisions. • Classification. • Significance and hypothesis testing. Can we quantify the statistical structure in a model of communication? Can we quantify the meaningful difference between statistical models? CSC401/2511 – Spring 2021 2 Information • Imagine Darth Vader is about to say either “yes” or “no” with equal probability. • You don’t know what he’ll say. • You have a certain amount of uncertainty – a lack of information. Darth Vader is © Disney And the prequels and Rey/Finn Star Wars suck CSC401/2511 – Spring 2021 3 Star Trek is better than Star Wars Information • Imagine you then observe Darth Vader saying “no” • Your uncertainty is gone; you’ve received information. • How much information do you receive about event ! when you observe it? 1 ! 1 = log ! )(1) For the units For the inverse of measurement 1 1 ! "# = log = log = 1 bit ! )("#) ! 1 ,2 CSC401/2511 – Spring 2021 4 Information • Imagine Darth Vader is about to roll a fair die. • You have more uncertainty about an event because there are more possibilities. • You receive more information when you observe it. 1 ! 5 = log ! )(5) " = log! ! ≈ 2.59 bits ⁄" CSC401/2511 – Spring 2021 5 Information is additive • From k independent, equally likely events ", 1 " 1 1 $ 8 = log = log $ % binary decisions = log! = % bits ! 9(8") ! 9 8 " 1 " 52 • For a unigram model, with each of 50K words # equally likely, 1 $ < = log ≈ 15.61 bits ! 1 550000 and for a sequence of 1K words in that model, " 1 $ < = log! ≈ 15,610???bits 1 #$$$ 550000 CSC401/2511 – Spring 2021 6 Information with unequal events • An information source S emits symbols without memory from a vocabulary #!, #", … , ## .
    [Show full text]
  • Fundamental Limits of Video Coding: a Closed-Form Characterization of Rate Distortion Region from First Principles
    1 Fundamental Limits of Video Coding: A Closed-form Characterization of Rate Distortion Region from First Principles Kamesh Namuduri and Gayatri Mehta Department of Electrical Engineering University of North Texas Abstract—Classical motion-compensated video coding minimum amount of rate required to encode video. On methods have been standardized by MPEG over the the other hand, a communication technology may put years and video codecs have become integral parts of a limit on the maximum data transfer rate that the media entertainment applications. Despite the ubiquitous system can handle. This limit, in turn, places a bound use of video coding techniques, it is interesting to note on the resulting video quality. Therefore, there is a that a closed form rate-distortion characterization for video coding is not available in the literature. In this need to investigate the tradeoffs between the bit-rate (R) paper, we develop a simple, yet, fundamental character- required to encode video and the resulting distortion ization of rate-distortion region in video coding based (D) in the reconstructed video. Rate-distortion (R-D) on information-theoretic first principles. The concept of analysis deals with lossy video coding and it establishes conditional motion estimation is used to derive the closed- a relationship between the two parameters by means of form expression for rate-distortion region without losing a Rate-distortion R(D) function [1], [6], [10]. Since R- its generality. Conditional motion estimation offers an D analysis is based on information-theoretic concepts, elegant means to analyze the rate-distortion trade-offs it places absolute bounds on achievable rates and thus and demonstrates the viability of achieving the bounds derives its significance.
    [Show full text]