Information Theory: MATH34600

Information Theory: MATH34600 Professor Oliver Johnson [email protected] Twitter: @BristOliver School of Mathematics, University of Bristol Teaching Block 1, 2019-20 Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 1 / 157 Why study information theory? Information theory was introduced by Claude Shannon in 1948. Paper A Mathematical Theory of Communication available via course webpage. It forms the basis for our modern world. Information theory is used every time you make a phone call, take a selfie, download a movie, stream a song, save files to your hard disk. Represent randomly generated `messages' by sequences of 0s and 1s Key idea: some sources of randomness are more random than others. Can compress random information (remove redundancy) to save space when saving it. Can pad out random information (introduce redundancy) to avoid errors when transmitting it. Quantity called entropy quantifies how well we can do. Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 2 / 157 Relationship to other fields (from Cover and Thomas) Now also quantum information, security, algorithms, AI/machine learning (privacy), neuroscience, bioinformatics, ecology . Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 3 / 157 Course outline 15 lectures, 3 exercise classes, 5 mandatory HW sets. Printed notes are minimal, lectures will provide motivation and help. IT IS YOUR RESPONSIBILITY TO ATTEND LECTURES AND TO ENSURE YOU HAVE A FULL SET OF NOTES AND SOLUTIONS Course webpage for notes, problem sheets, links, textbooks etc: https://people.maths.bris.ac.uk/∼maotj/IT.html Drop-in sessions: 11.30-12.30 on Mondays, G83 Fry Building. Just turn up to in these times. (Other times, I may be out or busy - but just email [email protected] to fix an appointment). Notes inherited from Dr Karoline Wiesner { thanks! This material is copyright of the University unless explicitly stated otherwise. It is provided exclusively for educational purposes at the University and is to be downloaded or copied for your private study only. Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 4 / 157 Course outline (cont.) Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 5 / 157 Contents 1 Introduction 2 Section 1: Probability Review 3 Section 2: Entropy and Variants 4 Section 3: Lossless Source Coding with Variable Codeword Lengths 5 Section 4: Lossy Source Coding with Fixed Codeword Lengths 6 Section 5: Channel Coding Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 6 / 157 Section 1: Probability Review Objectives: by the end of this section you should be able to Recall the necessary basic definitions from probability. Understand the relationship between joint, marginal and conditional probability etc Use basic information-theoretic notation and terminology such as alphabet, message, source etc. Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 7 / 157 Section 1.1: Concepts and notation Information theory can be viewed as a branch of applied probability. We will briefly review the concepts from probability theory you are expected to know. We also set the notation used throughout the course. A summary of basic probability can also be found in Chapter 2 of MacKay's excellent book Information Theory, Inference, and Learning Algorithms, available as pdf online, http://www.cambridge.org/0521642981. Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 8 / 157 Information sources Imagine a source of randomness e.g. tossing a coin, signals from space, sending an SMS Produces a sequence of symbols from an alphabet (i.e. a finite set) Definition 1.1. We use the following notation for finite sets, sequences, etc.: X , Y, Z, etc. alphabets (i.e., finite sets). x 2 X etc. letter from an alphabet. X × Y = fxy : x 2 X ; y 2 Yg Cartesian product of sets.a n x = x1x2 ::: xn sequence of length n. X n = X × X n−1 set of all strings/words/sequences n = fx = x1x2 ::: xn : xi 2 X g of length n. X 0 = fg sequence of length 0 (empty string) X ∗ = X 0 [X 1 [X 2 [ ::: set of all strings of any length. aUsually write the elements as concatenation xy not an ordered pair (x; y). Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 9 / 157 Coin tossing Example 1.2. Consider tossing coins. The alphabet X is the set fH; T g with letters `H' and `T '. The sequence (or string) HTHH 2 X 4. Want to quantify how likely this outcome is for (biased) coins. Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 10 / 157 Section 1.2: Random variables Definition 1.3 (Discrete random variable). A discrete random variable (r.v.) X is a function from a probability space (Ω; P) into a set X . That is, for each outcome ! 2 Ω the X (!) 2 X . X takes on one of a possible set of values X = fx1; x2; x3;::: xjX jg. Write probability PX (x) ≡ P(X = x) ≥ 0. P P We always require that x2X PX (x) = x2X P(X = x) = 1. Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 11 / 157 Example: English language Example 1.4. Take a vowel randomly drawn from an English document. X a e i o u PX (xi ) 0.19 0.29 0.19 0.23 0.10 Note that the probabilities are normalised (rescaled to add up to 1). The total probability of a vowel in the complete alphabet (including 'space') is 0:31. Example 1.5. For an example of English letter and word frequencies see Shannon's original paper from 1948, A Mathematical Theory of Communication, available online. Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 12 / 157 Jointly distributed random variables Definition 1.6 (Jointly distributed random variables). XY is a jointly distributed random variable on X × Y if X and Y are random variables taking values on X and Y. Joint distribution PXY s.t. PXY (xy) = P(X = x; Y = y), for all x 2 X ; y 2 Y. Remark 1.7. We obtain the marginal distribution PX of random variable X from the joint distribution PXY by summation: X X PX (x) = P(X = x) = P(XY = xy) = PXY (xy); y2Y y2Y P or in short-hand notation (where context is clear): P(x) = y2Y P(xy). Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 13 / 157 Conditional probability Definition 1.8. The conditional probability of a random variable X , given knowledge of Y , is calculated as follows: P(XY = xy) PXY (xy) PX jY (xjy) = = ; if PY (y) 6= 0 : P(Y = y) PY (y) If PY (y) = 0 then the conditional probability is undefined. P Note x2X PX jY (xjy) = 1 for any fixed y. Remark 1.9. We will often use the product rule (also known as chain rule) for calculating probabilities: PXY (xy) = PX jY (xjy)PY (y) = PY jX (yjx)PX (x) Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 14 / 157 Independence Definition 1.10. Two random variables X and Y are independent if and only if: PXY (xy) = PX (x)PY (y); for all x 2 X and y 2 Y. Equivalently, conditional probability mass function PX jY (xjy) does not depend on y. Assume that e.g. successive coin tosses are independent. If X and Y are independent, then so are f (X ) and g(Y ) for any functions f and g. Random variables X and Y are not necessarily independent. Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 15 / 157 Example: English text Example 1.11. The ordered pair XY consisting of two successive letters in an English document is a jointly distributed r.v. The possible outcomes are ordered pairs such as aa, ab, ac, and zz. Of these, we might expect ab and ac to be more probable than aa and zz. For example, the most probable value for the second letter given that the first one was q is u. Successive letters in English are not independent. Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 16 / 157 Independent and identically distributed (i.i.d.) Definition 1.12. Random variables X1; X2;:::; Xn are independent and identically distributed (i.i.d.) if they are independent and if their individual distributions are all identical. In other words, there is a probability distribution PX on X such that PX1X2···Xn (x1x2 ::: xn) = PX (x1) ··· PX (xn): This definition captures the intuitive notion of n independent repeated trials of the same experiment (observations of Xi ). Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 17 / 157 Section 1.3: Expectation and variance Definition 1.13 (Expectation). If outcomes X ⊂ R, the expectation or mean EX of a discrete random variable X is defined as X EX := PX (x) x: x2X The expectation is an average of the values x taken by X , weighted by their probabilities PX (x) = P(X = x). Note that the above sum is finite by the assumption of discreteness. Similarly define expectation of functions by P Ef (X ) := x2X PX (x) f (x): Oliver Johnson ([email protected]) Information Theory: @BristOliver TB 1 c UoB 2019 18 / 157 Variance and covariance Definition 1.14.

Information Theory: MATH34600

Information Theory, Pattern Recognition and Neural Networks Part III Physics Exams 2006

On Measures of Entropy and Information

Information Theory and Entropy

Lecture Notes (Chapter

Stan Modeling Language

Arxiv:1703.01694V2 [Cs.CL] 31 May 2017 Tinctive Forms to Diﬀerentiate Each Word from Others, Especially If a Speaker’S Intended Word Has Higher-Frequency Competitors

Computing Entropies with Nested Sampling

Lecture 6; Using Entropy for Evaluating and Comparing Probability Distributions Readings: Jurafsky and Martin, Section 6.7 Manning and Schutze, Section 2.2

Similarity-Based Approaches to Natural Language Processing

Probability for Linguists Probabilités Pour Les Linguistes

Probability for Linguists

Probability and Statistics Sam Roweis Machine Learning Summer School