Anatomy of a Bit: Information in a Time Series Observation

Santa Fe Institute Working Paper 11-05-XXX arxiv.org:1105.XXXX [physics.gen-ph] Anatomy of a Bit: Information in a Time Series Observation 1, 2, 1, 2, 1, 2, 3, Ryan G. James, ∗ Christopher J. Ellison, y and James P. Crutchfield z 1Complexity Sciences Center 2Physics Department University of California at Davis, One Shields Avenue, Davis, CA 95616 3Santa Fe Institute 1399 Hyde Park Road, Santa Fe, NM 87501 (Dated: October 24, 2018) Appealing to several multivariate information measures|some familiar, some new here|we ana- lyze the information embedded in discrete-valued stochastic time series. We dissect the uncertainty of a single observation to demonstrate how the measures' asymptotic behavior sheds structural and semantic light on the generating process's internal information dynamics. The measures scale with the length of time window, which captures both intensive (rates of growth) and subextensive components. We provide interpretations for the components, developing explicit relationships between them. We also identify the informational component shared between the past and the future that is not contained in a single observation. The existence of this component directly motivates the notion of a process's effective (internal) states and indicates why one must build models. Keywords: entropy, total correlation, multivariate mutual information, binding information, entropy rate, predictive information rate PACS numbers: 02.50.-r 89.70.+c 05.45.Tp 02.50.Ey 02.50.Ga compressed. In fact, a single observation tells us the os- A single measurement, when considered in the cillation's phase. And, with this single bit of information, context of the past and the future, contains a we have learned everything|the full bit that the time se- wealth of information, including distinct kinds of ries contains. Most systems fall somewhere between these information. Can the present measurement be two extremes. Here, we develop an analysis of the infor- predicted from the past? From the future? Or, mation contained in a single measurement that applies only from them together? Or not at all? Is some across this spectrum. of the measurement due to randomness? Does Starting from the most basic considerations, we decon- that randomness have consequences for the fu- struct what a measurement is, using this to directly step ture or it is simply lost? We answer all of these through and preview the main results. With that fram- questions and more, giving a complete dissection ing laid out, we reset, introducing and reviewing the rele- of a measured bit of information. vant tools available from multivariate information theory including several that have been recently proposed. At that point, we give a synthesis employing information I. INTRODUCTION measures and the graphical equivalent of the information diagram. The result is a systematic delineation of arXiv:1105.2988v1 [cs.IT] 16 May 2011 In a time series of observations, what can we learn the kinds of information that the distribution of single from just a single observation? If the series is a se- measurements can contain and their required contexts quence of coin flips, a single observation tells us noth- of interpretation. We conclude by indicating what is ing of the past nor of the future. It gives a single bit missing in previous answers to the measurement question of information about the present|one bit out of the in- above, identifying what they do and do not contribute, finite amount the time series contains. However, if the and why alternative state-centric analyses are ultimately time series is periodic|say, alternating 0s and 1s|then more comprehensive. with a single measurement in hand, the entire observation series need not be stored; it can be substantially II. A MEASUREMENT: A SYNOPSIS ∗ [email protected] For our purposes an instrument is simply an interface y [email protected] between an observer and the system to which it attends. z [email protected] All the observer sees is the instrument's output|here, we 2 take this to be one of k discrete values. And, from a series bits for n observations. Here, the function H[P ] of these outputs, the observer's goal is to infer and to is Shannon's entropy of the distribution P = understand as much about the system as possible|how (n1=n; n2=n; : : : ; nk=n). As a shorthand, when discussing predictable it is, what are the active degrees of freedom, the information in a random variable X that is dis- what resources are implicated in generating its behavior, tributed according to P , we also write H[X]. Thus, to the and the like. extent that H[X] log k, as the series length n grows ≤ 2 The first step in reaching the goal is that the observer the observer can effectively compress the original series must store at least one measurement. How many decimal of observations and so use less storage than n log2 k. digits must its storage device have? To specify which one The relationship between the raw measurement of k instrument outputs occurred the device must use (log2 k) and the average-case view (H[X]), that we just log10 k decimal digits. If the device stores binary values, laid out explicitly, is illustrated in the contrast between then it must provide log k bits of storage. This is the Figs. 1(a) and 1(b). The difference R = log k H[X] 2 1 2 − maximum for a one-time measurement. If we perform is the amount of redundant information in the raw mea- a series of n measurements, then the observer's storage surements. As such, the magnitude of R1 indicates how device must have a capacity of n log2 k bits. much they can be compressed. Imagine, however, that over this series of measure- Information storage can be reduced further, since us- ments it happens that output 1 occurs n1 times, 2 occurs ing H[X] as the amount of information in a measurement n2 times, and so on, with k occurring nk times. It turns implicitly assumed the instrument's outputs were statis- out that the storage device can have much less capac- tically independent. And this, as it turns out, leads to ity; using less, sometimes substantially less, than n log2 k H[X] being an overestimate as to the amount of infor- bits. mation in X. For general information sources, there are To see this, recall that the number M of possible se- correlations and restrictions between successive measure- quences of n measurements with n1; n2; : : : ; nk counts is ments that violate this independence assumption and, given by the multinomial coefficient: helpfully, we can use these to further compress sequences of measurements|X1;X2;:::;X`. Concretely, informa- n M = tion theory tells us that the irreducible information per n1 n2 nk entropy rate ··· observation is given by the Shannon : n! = : H(`) n1! nk! hµ = lim ; (1) ··· ` ` So, to specify which sequence occurred we need no more !1 P ` ` than: where H(`) = ` Pr(x ) log Pr(x ) is the block en- − x 2 tropy|the Shannonf entropyg of the length-` word distri- k log n + log M + log n + bution Pr(x`). 2 2 2 ··· The improved view of the information in a measure- The first term is the maximum number of bits to store the ment is given in Fig. 1(c). Specifically, since h H[X], µ ≤ count ni of each of the k output values. The second term we can compress even more; indeed, by an amount is the number of bits needed to specify the particular R = log2 k hµ. 1 − observed sequence within the class of sequences that have These comments are no more than a review of basic counts n1; n2; : : : ; nk. The third term is the number b of information theory [1] that used a little algebra. They bits to specify the number of bits in n itself. Finally, the do, however, set the stage for a parallel, but more de- ellipsis indicates that we have to specify the number of tailed, analysis of the information in an observation. In bits to specify b (log2 log2 n) and so on, until there is less focusing on a single measurement, the following comple- than one bit. ments recent, more sophisticated analyses of information We can make sense of this and so develop a help- sources that focused on a process's hidden states [2, and ful comparison to the original storage estimate of references therein]. In the sense that the latter is a state- n log2 k bits, if we apply Stirling's approximation: n! centric informational analysis of a process, the following n ≈ p2πn (n=e) . For a sufficiently long measurement series, takes the complementary measurement-centric view. a little algebra gives: Partly as preview and partly to orient ourselves on the path to be followed, we illustrate the main results in a k X ni ni pictorial fashion similar to that just given; see Fig. 2 log M n log 2 ≈ − n 2 n which further dissects the information in X. i=1 As a first cut, the information H[X] provided by each = nH[n1=n; n2=n; : : : ; nk=n] : 3 hµ and ρµ. It partitions H[X] into a piece wµ that is structural and a piece rµ that, as mentioned above, is R1 ephemeral. (See Fig. 2(d).) R ∞ With the basic informational components contained in a single measurement laid out, we now derive them from log2 k first principles. The next step is to address information in collections of random variables, helpful in a broad array H[X] of problems.

Anatomy of a Bit: Information in a Time Series Observation

Information Theory Techniques for Multimedia Data Classification and Retrieval

On Measures of Entropy and Information

Information Theory and Maximum Entropy 8.1 Fundamentals of Information Theory

Mutual Information Analysis: a Comprehensive Study∗

Nonparametric Maximum Entropy Estimation on Information Diagrams

Mutual Information Analysis: a Comprehensive Study *

4 Mutual Information and Channel Capacity

Information Measures & Information Diagrams

5 Mutual Information and Channel Capacity

Two Information-Theoretic Tools to Assess the Performance of Multi-Class Classiﬁers

The Conditional Entropy Bottleneck

The Mutual Information Diagram for Uncertainty Visualization