<<

Some background on

These notes are based on a section of a full biophysics course which I teach at Princeton, aimed at PhD students in Physics. In the current organization, this section includes an account of Laughlin’s ideas of optimization in the fly retina, as well as some pointers to topics in the next lecture. Gradually the whole course is being turned into a book, to be published by Princeton University Press. I hope this extract is useful in the context of our short course, and of course I would be grateful for any feedback.

The generation of physicists who turned to biologi- what Allan will say in response to his question. Once cal phenomena in the wake of quantum mechanics noted Allan gives his answer, all this uncertainty is removed— that to understand life one has to understand not just one of the responses occurred, corresponding to p = 1, the flow of energy (as in inanimate systems) but also the and all the others did not, corresponding to p = 0—so flow of information. There is, of course, some difficulty the entropy is reduced to zero. It is appealing to equate in translating the colloquial notion of information into this reduction in our uncertainty with the information something mathematically precise. Indeed, almost all we gain by hearing Allan’s answer. Shannon proved that statistical mechanics textbooks note that the entropy of this is not just an interesting analogy; it is the only defi- a gas measures our lack of information about the micro- nition of information that conforms to some simple con- scopic state of the molecules, but often this connection is straints. left a vague or qualitative. Shannon proved a theorem that makes the connection precise [Shannon 1948]: en- tropy is the unique measure of available information con- A. Shannon’s uniqueness theorem sistent with certain simple and plausible requirements. Further, entropy also answers the practical question of To start, Shannon assumes that the information how much space we need to use in writing down a de- gained on hearing the answer can be written as a func- 1 scription of the signals or states that we observe. This tion of the probabilities pn. Then if all N possible an- leads to a notion of efficient representation, and in this swers are equally likely the information gained should be section of the course we’ll explore the possibility that bi- a monotonically increasing function of N. The next con- ological systems in fact form efficient representations of straint is that if our question consists of two parts, and if relevant information. these two parts are entirely independent of one another, then we should be able to write the total information gained as the sum of the information gained in response I. ENTROPY AND INFORMATION to each of the two subquestions. Finally, more general multipart questions can be thought of as branching trees, Two friends, Max and Allan, are having a conversa- where the answer to each successive part of the question tion. In the course of the conversation, Max asks Allan provides some further refinement of the probabilities; in what he thinks of the headline story in this morning’s this case we should be able to write the total information newspaper. We have the clear intuitive notion that Max gained as the weighted sum of the information gained at will ‘gain information’ by hearing the answer to his ques- each branch point. Shannon proved that the only func- tion, and we would like to quantify this intuition. Fol- tion of the {pn} consistent with these three postulates— lowing Shannon’s reasoning, we begin by assuming that monotonicity, independence, and branching—is the en- Max knows Allan very well. Allan speaks very proper tropy S, up to a multiplicative constant. English, being careful to follow the grammatical rules To prove Shannon’s theorem we start with the case even in casual conversation. Since they have had many where all N possible answers are equally likely. Then the political discussions Max has a rather good idea about information must be a function of N, and let this func- how Allan will react to the latest news. Thus Max can tion be f(N). Consider the special case N = km. Then make a list of Allan’s possible responses to his question, we can think of our answer—one out of N possibilities— and he can assign probabilities to each of the answers. as being given in m independent parts, and in each part From this list of possibilities and probabilities we can we must be told one of k equally likely possibilities.2 compute an entropy, and this is done in exactly the same way as we compute the entropy of a gas in statistical me- chanics or thermodynamics: If the probability of the nth possible response is pn, then the entropy is 1 In particular, this ‘zeroth’ assumption means that we must take X seriously the notion of enumerating the possible answers. In S = − pn log2 pn . (1) this framework we cannot quantify the information that would n be gained upon hearing a previously unimaginable answer to our question. The entropy S measures Max’s uncertainty about 2 If N = km, then we can specify each possible answer by an m– 2

But we have assumed that information from indepen- the same issue arises is statistical mechanics: what are dent questions and answers must add, so the function the units of entropy? In a physical chemistry course you f(N) must obey the condition might learn that entropy is measured in “entropy units,” with the property that if you multiply by the absolute f(km) = mf(k). (2) temperature (in Kelvin) you obtain an energy in units of calories per mole; this happens because the constant It is easy to see that f(N) ∝ log N satisfies this equa- of proportionality is chosen to be the gas constant R, tion. To show that this is the unique solution, Shannon which refers to Avogadro’s number of molecules.4 In considers another pair of integers ` and n such that physics courses entropy is often defined with a factor of Boltzmann’s constant k , so that if we multiply by km ≤ `n ≤ km+1, (3) B the absolute temperature we again obtain an energy (in or, taking logarithms, Joules) but now per molecule (or per degree of freedom), not per mole. In fact many statistical mechanics texts m log ` m 1 take the sensible view that temperature itself should be ≤ ≤ + . (4) n log k n n measured in energy units—that is, we should always talk about the quantity kBT , not T alone—so that the en- Now because the information measure f(N) is monoton- tropy, which after all measures the number of possible ically increasing with N, the ordering in Eq. (3) means states of the system, is dimensionless. Any dimension- that less proportionality constant can be absorbed by choos- ing the base that we use for taking logarithms, and in f(km) ≤ f(`n) ≤ f(km+1), (5) information theory it is conventional to choose base two. Finally, then, we have f(N) = log2 N. The units of this and hence from Eq. (2) we obtain measure are called bits, and one bit is the information contained in the choice between two equally likely alter- mf(k) ≤ nf(`) ≤ (m + 1)f(k). (6) natives. Dividing through by nf(k) we have Ultimately we need to know the information conveyed in the general case where our N possible answers all m f(`) m 1 have unequal probabilities. To make a bridge to this ≤ ≤ + , (7) n f(k) n n general problem from the simple case of equal probabil- ities, consider the situation where all the probabilities which is very similar to Eq. (4). The trick is now that are rational, that is with k and ` fixed, we can choose an arbitrarily large value for n, so that 1/n =  is as small as we like. Then kn Eq. (4) is telling us that pn = P , (11) m km

m log ` − < , (8) n log k where all the kn are integers. It should be clear that if we can find the correct information measure for rational and hence Eq. (7) for the function f(N) can similarly {pn} then by continuity we can extrapolate to the general be rewritten as case; the trick is that we can reduce the case of rational probabilities to the case of equal probabilities. To do m f(`) P − < , or (9) this, imagine that we have a total of Ntotal = m km n f(k) possible answers, but that we have organized these into

f(`) log ` N groups, each of which contains k possibilities. If − ≤ 2, (10) n f(k) log k we specified the full answer, we would first tell which group it was in, then tell which of the kn possibilities so that f(N) ∝ log N as promised.3 was realized. In this two step process, at the first step We are not quite finished, even with the simple case we get the information we are really looking for—which of N equally likely alternatives, because we still have of the N groups are we in—and so the information in an arbitrary constant of proportionality. We recall that

4 Whenever I read about entropy units (or calories, for that mat- place (‘digit’ is the wrong word for k 6= 10) number in base k. ter) I imagine that there was some great congress on units at The m independent parts of the answer correspond to specifying which all such things were supposed to be standardized. Of the number from 0 to k − 1 that goes in each of the m places. course every group has its own favorite nonstandard units. Per- 3 If we were allowed to consider f(N) as a continuous function, haps at the end of some long negotiations the chemists were then we could make a much simpler argument. But, strictly allowed to keep entropy units in exchange for physicists contin- speaking, f(N) is defined only at integer arguments. uing to use electron Volts. 3 the first step is our unknown function, B. Writing down the answers

I1 = I({pn}). (12) In the simple case where we ask a question and there m At the second step, if we are in group n then we will gain are exactly N = 2 possible answers, all with equal probability, the entropy is just m bits. But if we make log2 kn bits, because this is just the problem of choosing from k equally likely possibilities, and since group n a list of all the possible answers we can label each of n them with a distinct m–bit binary number: to specify occurs with probability pn the average information we gain in the second step is the answer all I need to do is write down this number. Note that the answers themselves can be very complex: X I = p log k . (13) different possible answers could correspond to lengthy 2 n 2 n essays, but the number of pages required to write these n essays is irrelevant. If we agree in advance on the set But this two step process is not the only way to compute of possible answers, all I have to do in answering the the information in the enlarged problem, because, by question is to provide a unique label. If we think of construction, the enlarged problem is just the problem the label as a ‘codeword’ for the answer, then in this of choosing from Ntotal equally likely possibilities. The simple case the length of the codeword that represents th two calculations have to give the same answer, so that the n possible answer is given by `n = − log2 pn, and the average length of a codeword is given by the entropy. I + I = log (N ), (14) 1 2 2 total It will turn out that the equality of the entropy and X X I({pn}) + pn log2 kn = log2( km). (15) the average length of codewords is much more general n m than our simple example. Before proceding, however, it is important to realize that the entropy is emerging Rearranging the terms, we find as the answer to two very different questions. In the   first case we wanted to quantify our intuitive notion of X kn I({pn}) = − pn log (16) gaining information by hearing the answer to a question. 2 P k n m m In the second case, we are interested in the problem of X representing this answer in the smallest possible space. = − pn log2 pn. (17) n It is quite remarkable that the only way of quantifying how much we learn by hearing the answer to a question Again, although this is worked out explicitly for the case is to measure how much space is required to write down where the pn are rational, it must be the general an- the answer. swer if the information measure is continuous. So we are Clearly these remarks are interesting only if we can done: the information obtained on hearing the answer treat more general cases. Let us recall that in statistical to a question is measured uniquely by the entropy of the mechanics we have the choice of working with a micro- distribution of possible answers. canonical ensemble, in which an ensemble of systems is If we phrase the problem of gaining information from hearing the answer to a question, then it is natural to think about a discrete set of possible answers. On the other hand, if we think about gaining information from familiar in statistical mechanics. In the simple example of an the acoustic waveform that reaches our ears, then there ideal gas in a finite box, we know that the quantum version of is a continuum of possibilities. Naively, we are tempted the problem has a discrete set of states, so that we can compute to write the entropy of the gas as a sum over these states. In the limit that the box is large, sums can be approximated as integrals, Z and if the temperature is high we expect that quantum effects Scontinuum = − dxP (x) log2 P (x), (18) are negligible and one might naively suppose that Planck’s con- stant should disappear from the results; we recall that this is not or some multidimensional generalization. The difficulty, quite the case. Planck’s constant has units of momentum times position, and so is an elementary area for each pair of conjugate of course, is that probability distributions for continuous position and momentum variables in the classical phase space; variables [like P (x) in this equation] have units—the dis- in the classical limit the entropy becomes (roughly) the loga- tribution of x has units inverse to the units of x—and rithm of the occupied volume in phase space, but this volume we should be worried about taking logs of objects that is measured in units of Planck’s constant. If we had tried to start with a classical formulation (as did Boltzmann and Gibbs, have dimensions. Notice that if we wanted to compute of course) then we would find ourselves with the problems of a difference in entropy between two distributions, this Eq. (18), namely that we are trying to take the logarithm of a problem would go away. This is a hint that only entropy quantity with dimensions. If we measure phase space volumes differences are going to be important.5 in units of Planck’s constant, then all is well. The important point is that the problems with defining a purely classical en- tropy do not stop us from calculating entropy differences, which are observable directly as heat flows, and we shall find a similar situation for the of continuous (“classical”) 5 The problem of defining the entropy for continuous variables is variables. 4 distributed uniformly over states of fixed energy, or with Imagine that instead of asking a question once, we ask a canonical ensemble, in which an ensemble of systems it many times. As an example, every day we can ask the is distributed across states of different energies accord- weatherman for an estimate of the temperature at noon ing to the Boltzmann distribution. The microcanonical the next day. Now instead of trying to represent the an- ensemble is like our simple example with all answers hav- swer to one question we can try to represent the whole ing equal probability: entropy really is just the log of the stream of answers collected over a long period of time. number of possible states. On the other hand, we know Thus instead of a possible answer being labelled n, possi- that in the thermodynamic limit there is not much differ- ble sequences of answers are labelled by n1n2 ··· nN . Of ence between the two ensembles. This suggests that we course these sequences have probabilites P (n1n2 ··· nN ), can recover a simple notion of representing answers with and from these probabilities we can compute an entropy codewords of length `n = − log2 pn provided that we can that must depend on the length of the sequence, find a suitbale analog of the thermodynamic limit.

X X X S(N) = − ··· P (n1n2 ··· nN ) log2 P (n1n2 ··· nN ). (19) n1 n2 nN

Notice we are not assuming that successive questions long sequences of answers. We will have to work sig- have independent answers, which would correspond to nificantly harder to show that this is really the smallest QN possible representation. An important if obvious conse- P (n1n2 ··· nN ) = i=1 pni . Now we can draw on our intuition from statistical me- quence is that if we have many rather unlikekly answers chanics. The entropy is an extensive quantity, which (rather than fewer more likely answers) then we need means that as N becomes large the entropy should be more space to write the answers down. More profoundly, proportional to N; more precisely we should have this turns out to be true answer by answer: to be sure that long sequences of answers take up as little space as S(N) lim = S, (20) possible, we need to use an average of `n = − log2 pn N→∞ N bits to represent each individual answer n. Thus an- swers which are more surprising require more space to where S is the entropy density for our sequence in the write down. same way that a large volume of material has a well defined entropy per unit volume. The equivalence of ensembles in the thermodynamic limit means that having unequal probabilities in the C. Entropy lost and information gained Boltzmann distribution has almost no effect on anything we want to calculate. In particular, for the Boltzmann Returning to the conversation between Max and Al- distribution we know that, state by state, the log of the lan, we assumed that Max would receive a complete an- probability is the energy and that this energy is itself swer to his question, and hence that all his uncertainty an extensive quantity. Further we know that (relative) would be removed. This is an idealization, of course. fluctuations in energy are small. But if energy is log The more natural description is that, for example, the probability, and relative fluctuations in energy are small, world can take on many states W , and by observing data this must mean that almost all the states we actually D we learn something but not everything about W . Be- observe have log probabilities which are the same. By fore we make our observations, we know only that states analogy, all the long sequences of answers must fall into of the world are chosen from some distribution P (W ), two groups: those with − log2 P ≈ NS, and those with and this distribution has an entropy S(W ). Once we P ≈ 0. Now this is all a bit sloppy, but it is the right observe some particular datum D, our (hopefully im- idea: if we are willing to think about long sequences or proved) knowledge of W is described by the conditional streams of data, then the equivalence of ensembles tells distribution P (W |D), and this has an entropy S(W |D) us that ‘typical’ sequences are uniformly distributed over that is smaller than S(W ) if we have reduced our un- N ≈ 2NS possibilities, and that this appproximation be- certainty about the state of the world by virtue of our comes more and more accurate as the length N of the observations. We identify this reduction in entropy as sequences becomes large. the information that we have gained about W . The idea of typical sequences, which is the information Perhaps this is the point to note that a single observa- theoretic version of a thermodynamic limit, is enough to tion D is not, in fact, guaranteed to provide positive in- tell us that our simple arguments about representing an- formation [see, for example, DeWeese and Meister 1999]. swers by binary numbers ought to work on average for Consider, for instance, data which tell us that all of our 5 previous measurements have larger error bars than we ID = S(W )−S(W |D), then such data points are associ- thought: clearly such data, at an intuitive level, reduce ated unambiguously with negative information. On the our knowledge about the world and should be associ- other hand, we might hope that, on average, gathering ated with a negative information. Another way to say data corresponds to gaining information: although sin- this is that some data points D will increase our uncer- gle data points can increase our uncertainty, the average tainty about state W of the world, and hence for these over all data points does not. particular data the conditional distribution P (W |D) has If we average over all possible data—weighted, of a larger entropy than the prior distribution P (D). If course, by their probability of occurrence P (D)—we ob- we identify information with the reduction in entropy, tain the average information that D provides about W :

X I(D → W ) = S(W ) − P (D)S(W |D) (21) D " # X X X = − P (W ) log2 P (W ) − P (D) − P (W |D) log2 P (W |D) (22) W D W X X X X = − P (W, D) log2 P (W ) + P (W |D)P (D) log2 P (W |D) (23) W D W D X X X X = − P (W, D) log2 P (W ) + P (W, D) log2 P (W |D) (24) W D W D X X P (W |D) = P (W, D) log (25) 2 P (W ) W D X X  P (W, D)  = P (W, D) log . (26) 2 P (W )P (D) W D

We see that, after all the dust settles, the information the number of possible states that the world can be in, which D provides about W is symmetric in D and W . and we cannot learn more than we would learn by re- This means that we can also view the state of the world ducing this set of possibilities down to one unique state. as providing information about the data we will observe, Although sensible (and, of course, true), this is not a and this information is, on average, the same as the data terribly powerful statement: seldom are we in the po- will provide about the state of the world. This ‘informa- sition that our ability to gain knowledge is limited by tion provided’ is therefore often called the mutual in- the lack of possibilities in the world around us.6 The formation, and this symmetry will be very important in second equation, however, is much more powerful. It subsequent discussions; to remind ourselves of this sym- says that, whatever may be happening in the world, we metry we write I(D; W ) rather than I(D → W ). can never learn more than the entropy of the distribu- One consequence of the symmetry or mutuality of in- tion that characterizes our data. Thus, if we ask how formation is that we can write much we can learn about the world by taking readings X from a wind detector on top of the roof, we can place I(D; W ) = S(W ) − P (D)S(W |D) (27) a bound on the amount we learn just by taking a very D long stream of data, using these data to estimate the X = S(D) − P (W )S(D|W ). (28) distribution P (D), and then computing the entropy of W this distribution. 7 If we consider only discrete sets of possibilities then en- The entropy of our observations thus limits how much tropies are positive (or zero), so that these equations imply

I(D; W ) ≤ S(W ) (29) 6 This is not quite true. There is a tradition of studying the ner- I(D; W ) ≤ S(D). (30) vous system as it responds to highly simplified signals, and under these conditions the lack of possibilities in the world can be a sig- nificant limitation, substantially confounding the interpretation The first equation tells us that by observing D we can- of experiments. not learn more about the world then there is entropy 7 In the same way that we speak about the entropy of a gas I will in the world itself. This makes sense: entropy measures often speak about the entropy of a variable or the entropy of a 6

Z we can learn no matter what question we were hoping to = − dξ Pnoise(ξ) log Pnoise(ξ). (36) answer, and so we can think of the entropy as setting (in 2 a slight abuse of terminology) the capacity of the data D to provide or to convey information. As an example, Note that this is a constant, independent both of the the entropy of neural responses sets a limit to how much light intensity and of the nonlinear input/output relation information a neuron can provide about the world, and g(I). This is useful because we can write the information we can estimate this limit even if we don’t yet understand as a difference between the total entropy of the output what it is that the neuron is telling us (or the rest of the variable V and this conditional or noise entropy, as in brain). Eq. (28): Z I(V → I) = − dV P (V ) log2 P (V ) − Scond. (37) D. Optimizing input/output relations

With Scond constant independent of our ‘design,’ max- These ideas are enough to get started on “designing” imizing information is the same as maximizing the en- some simple neural processes [Laughlin 1981]. Imagine tropy of the distribution of output voltages. Assuming that a neuron is responsible for representing a single that there are maximum and minimum values for this number such as the light intensity I averaged over a voltage, but no other constraints, then the maximum en- small patch of the retina (don’t worry about time depen- tropy distribution is just the uniform distribution within dence). Assume that this signal will be represented by a the allowed dynamic range. But if the noise is small it continuous voltage V , which is true for the first stages of doesn’t contribute much to broadening P (V ) and we cal- processing in vision, as we have seen in the first section culate this distribution as if there were no noise, so that of the course. The information that the voltage provides about the intensity is P (V )dV = P (I)dI, (38) Z Z  P (V, I)  dV 1 I(V → I) = dI dV P (V, I) log = · P (I). (39) 2 P (V )P (I) dI P (V ) (31) Since we want to have V = g(I) and P (V ) = 1/(Vmax − Z Z P (V |I) V ), we find = dI dV P (V, I) log . min 2 P (V ) dg(I) (32) = (V − V )P (I), (40) dI max min The conditional distribution P (V |I) describes the pro- Z I 0 0 cess by which the neuron responds to its input, and so g(I) = (Vmax − Vmin) dI P (I ). (41) this is what we should try to “design.” Imin Let us suppose that the voltage is on average a non- Thus, the optimal input/output relation is proportional linear function of the intensity, and that the dominant to the cumulative probability distribution of the input source of noise is additive (to the voltage), independent signals. of light intensity, and small compared with the overall The predictions of Eq. (41) are quite interesting. First dynamic range of the cell: of all it makes clear that any theory of the nervous sys- V = g(I) + ξ, (33) tem which involves optimizing information transmission or efficiency of representation inevitably predicts that with some distribution Pnoise(ξ) for the noise. Then the the computations done by the nervous system must be conditional distribution matched to the statistics of sensory inputs (and, presum- P (V |I) = P (V − g(I)), (34) ably, to the statistics of motor outputs as well). Here the noise matching is simple: in the right units we could just read and the entropy of this conditional distribution can be off the distribution of inputs by looking at the (differ- written as entiated) input/output relation of the neuron. Second, Z this simple model automatically carries some predictions Scond = − dV P (V |I) log2 P (V |I) (35) about adaptation to overall light levels. If we live in a world with diffuse light sources that are not directly vis- ible, then the intensity which reaches us at a point is the product of the effective brightness of the source and response. In the gas, we understand from statistical mechanics some local reflectances. As is it gets dark outside the re- that the entropy is defined not as a property of the gas but as flectances don’t change—these are material properties— a property of the distribution or ensemble from which the mi- and so we expect that the distribution P (I) will look croscopic states of the gas are chosen; similarly we should really speak here about “the entropy of the distribution of observa- the same except for scaling. Equivalently, if we view the tions,” but this is a bit cumbersome. I hope that the slightly input as the log of the intensity, then to a good approx- sloppy but more compact phrasing does not cause confusion. imation P (log I) just shifts linearly along the log I axis 7

the fly visual system [Laughlin 1981]. He built an elec- tronic photodetector with aperture and spectral sensi- tivity matched to those of the fly retina and used his photodetector to scan natural scenes, measuring P (I) as it would appear at the input to these neurons. In parallel he characterized the second order neurons of the fly visual system—the large monopolar cells which re- ceive direct synaptic input from the photoreceptors—by measuring the peak voltage response to flashes of light. The agreement with Eq. (41) was remarkable (Fig 1), especially when we remember that there are no free pa- rameters. While there are obvious open questions (what happened to time dependence?), this is a really beautiful result that inspires us to take these ideas more seriously.

E. Maximum entropy distributions

Since the information we can gain is limited by the entropy, it is natural to ask if we can put limits on the entropy using some low order statistical properties of the data: the mean, the variance, perhaps higher moments or correlation functions, ... . In particular, if we can say that the entropy has a maximum value consistent with the observed statistics, then we have placed a firm upper bound on the information that these data can convey. The problem of finding the maximum entropy given some constraint again is familiar from statistical mechan- ics: the Boltzmann distribution is the distribution that has the largest possible entropy given the mean energy. More generally, imagine that we have knowledge not of the whole probability distribution P (D) but only of some expectation values, FIG. 1 Matching of input/output relation to the distribu- X tion of inputs in the fly large monopolar cells (LMCs). Top, hfii = P (D)fi(D), (42) a schematic probability distribution for light intensity. Mid- D dle, the optimal input/output relation according to Eq (41), highlighting the fact that equal ranges of output correspond where we allow that there may be several expectation to equal mass under the input distribution. Bottom, mea- values known (i = 1, 2, ..., K). Actually there is one more surements of the input/output relation in LMCs compared expectation value that we always know, and this is that with these predictions From [Laughlin 1981 & 1987]. the average value of one is one; the distribution is nor- malized: X as mean light intensity goes up and down. But then hf0i = P (D) = 1. (43) the optimal input/output relation g(I) would exhibit a D similar invariant shape with shifts along the input axis when expressed as a function of log I, and this is in rough Given the set of numbers {hf0i, hf1i, ··· , hfK i} as con- agreement with experiments on light/dark adaptation in straints on the probability distribution P (D), we would a wide variety of visual neurons. Finally, although obvi- like to know the largest possible value for the entropy, ously a simplified version of the real problem facing even and we would like to find explicitly the distribution that the first stages of visual processing, this calculation does provides this maximum. make a quantitative prediction that would be tested if The problem of maximizing a quantity subject to con- we measure both the input/output relations of early vi- straints is formulated using Lagrange multipliers. In this P sual neurons and the distribution of light intensities that case, we want to maximize S = − P (D) log2 P (D), so the animal encounters in nature. we introduce a function S˜, with one Lagrange multiplier Laughlin made this comparison (25 years ago!) for λi for each constraint: 8

K ˜ X X S[P (D)] = − P (D) log2 P (D) − λihfii (44) D i=0 K 1 X X X = − P (D) ln P (D) − λ P (D)f (D). (45) ln 2 i i D i=0 D

Our problem is then to find the maximum of the function can be thought of as a product of the order parameter S˜, but this is easy because the probability for each value with its conjugate force. Again, all these remarks should of D appears independently. The result is that be familiar from a statistical mechanics course. Probability distributions that have the maximum en- " K # 1 X tropy form of Eq. (46) are special not only because of P (D) = exp − λ f (D) , (46) Z i i their connection to statistical mechanics, but because i=1 they form what the statisticians call an ‘exponential fam- ily,’ which seems like an obvious name. The important where Z = exp(1 + λ0) is a normalization constant. There are few more things worth saying about maxi- point is that exponential families of distributions are mum entropy distributions. First, we recall that if the (almost) unique in having sufficient statistics. To un- value of D indexes the states n of a physical system, and derstand what this means, consider the following prob- we know only the expectation value of the energy, lem: We observe a set of samples D1,D2, ··· ,DN , each of which is drawn independently and at random from a X hEi = P E , (47) distribution P (D|{λi}). Assume that we know the form n n of this distribution but not the values of the parameters n {λi}. How can we estimate these parameters from the then the maximum entropy distribution is set of observations {Dn}? Notice that our data set {Dn} consists of N numbers, and N can be very large; on the 1 other hand there typically are a small number K  N Pn = exp(−λEn), (48) Z of parameters λi that we want to estimate. Even in this limit, no finite amount of data will tell us the exact val- which is the Boltzmann distribution (as promised). ues of the parameters, and so we need a probabilistic In this case the Lagrange multiplier λ has physical formulation: we want to compute the distribution of pa- meaning—it is the inverse temperature. Further, the rameters given the data, P ({λ }|{D }). We do this using function S˜ that we introduced for convenience is the dif- i n Bayes’ rule, ference between the entropy and λ times the energy; if we divide through by λ and flip the sign, then we have 1 the energy minus the temperature times the entropy, or P ({λi}|{Dn}) = ·P ({Dn}|{λi})P ({λi}), (49) P ({D }) the free energy. Thus the distribution which maximizes n entropy at fixed average energy is also the distribution where P ({λi}) is the distribution from which the pa- which minimizes the free energy. rameter values themselves are drawn. Then since each If we are looking at a magnetic system, for example, datum D is drawn independently, we have and we know not just the average energy but also the n average magnetization, then a new term appears in the N Y exponential of the probability distribution, and we can P ({Dn}|{λi}) = P (Dn|{λi}). (50) interpret this term as the magnetic field multiplied by n=1 the magnetization. More generally, for every order pa- rameter which we assume is known, the probability dis- For probability distributions of the maximum entropy tribution acquires a term that adds to the energy and form we can proceed further, using Eq. (46):

1 P ({λi}|{Dn}) = · P ({Dn}|{λi})P ({λi}) P ({Dn}) N P ({λi}) Y = P (D |{λ }) (51) P ({D }) n i n n=1 9

N " K # P ({λi}) Y X = N exp − λifi(Dn) (52) Z P ({Dn}) n=1 i=1 " K N # P ({λi}) X 1 X = N exp −N λi fi(Dn) . (53) Z P ({Dn}) N i=1 n=1

We see that all of the information that the data points it is a theorem that only exponential families have suffi- {Dn} can give about the parameters λi is contained in cient statistics. the average values of the functions fi over the data set, The generic problem of information processing, by the ¯ or the ‘empirical means’ fi, brain or by a machine, is that we are faced with a huge quantity of data and must extract those pieces that are N 1 X of interest to us. The idea of sufficient statistics is in- f¯ = f (D ). (54) i N i n triguing in part because it provides an example where n=1 this problem of ‘extracting interesting information’ can More precisely, the distribution of possible parameter be solved completely: if the points D1,D2, ··· ,DN are values consistent with the data depends not on all details chosen independently and at random from some distri- ¯ bution, the only thing which could possibly be ‘interest- of the data, but rather only on the empirical means {fi}, ing’ is the structure of the distribution itself (everything ¯ P ({λi}|D1,D2, ··· ,DN ) = P ({λi}|{fj}), (55) else is random, by construction), this structure is de- scribed by a finite number of parameters, and there is and a consequence of this is the information theoretic an explicit algorithm for compressing the N data points statement {Dn} into K numbers that preserve all of the interesting information.9 The crucial point is that this procedure ¯ I(D1,D2, ··· ,DN → {λi}) = I({fj} → {λi}). (56) cannot exist in general, but only for certain classes of probability distributions. This is an introduction to the This situation is described by saying that the reduced set idea some kinds of structure in data are learnable from ¯ of variables {fj} constitute sufficient statistics for learn- random examples, while other structures are not. ing the distribution. Thus, for distributions of this form, Consider the (Boltzmann) probability distribution for the problem of compressing N data points into K << N the states of a system in thermal equilibrium. If we ex- variables that are relevant for parameter estimation can pand the Hamiltonian as a sum of terms (operators) then be solved explicitly: if we keep track of the running aver- the family of possible probability distributions is an ex- ¯ ages fi we can compress our data as we go along, and we ponential family in which the coupling constants for each are guaranteed that we will never need to go back and operator are the parameters analogous to the λi above. examine the data in more detail. A clear example is that In principle there could be an infinite number of these if we know data are drawn from a Gaussian distribution, operators, but for a given class of systems we usually running estimates of the mean and variance contain all find that only a finite set are “relevant” in the renor- the information available about the underlying parame- malization group sense: if we write an effective Hamil- ter values. tonian for coarse grained degrees of freedom, then only The Gaussian example makes it seem that the con- a finite number of terms will survive the coarse graining cept of sufficient statistics is trivial: of course if we know procedure. If we have only a finite number of terms in that data are chosen from a Gaussian distribution, then the Hamiltonian, then the family of Boltzmann distribu- to identify the distribution all we need to do is to keep tions has sufficient statistics, which are just the expec- track of two moments. Far from trivial, this situation is tation values of the relevant operators. This means that quite unusual. Most of the distributions that we might the expectation values of the relevant operators carry all write down do not have this property—even if they are the information that the (coarse grained) configuration described by a finite number of parameters, we cannot of the system can provide about the coupling constants, guarantee that a comparably small set of empirical ex- which in turn is information about the identity or mi- pectation values captures all the information about the croscopic structure of the system. Thus the statement parameter values. If we insist further that the sufficient that there are only a finite number of relevant operators statistics be additive and permutation symmetric,8 then

9 There are more subtle questions: How many bits are we squeez- 8 These conditions mean that the sufficient statistics can be con- ing out in this process of compression, and how many bits of structed as running averages, and that the same data points in relevance are left? We return to this and related questions when different sequence carry the same information. we talk about learning later in the course. 10 is also the statement that a finite number of expectation quanta are photons. In the brain, quantization abounds: values carries all the information about the microscopic most neurons do not generate continuous analog volt- dynamics. The ‘if’ part of this statement is obvious: ages but rather communicate with one another through if there are only a finite number of relevant operators, stereotyped pulses or spikes, and even if the voltages vary then the expectation values of these operators carry all continuously transmission across a synapse involves the the information about the identity of the system. The release of a chemical transmitter which is packaged into statisticians, through the theorem about the uniqueness discrete vesicles. It can be relatively easy to measure the of exponential families, give us the ‘only if’: a finite num- mean rate at which discrete events are counted, and we ber of expectation values (or correlation functions) can might want to know what bounds this mean rate places provide all the information about the system only if the on the ability of the cells to convey information. Alter- effective Hamiltonian has a finite number of relevant op- natively, there is an energetic cost associated with these erators. I suspect that there is more to say along these discrete events—generating the electrical currents that lines, but let us turn instead to some examples. underlie the spike, constructing and filling the vesicles, Consider the situation in which the data D are real ... —and we might want to characterize the mechanisms numbers x. Suppose that we know the mean value of x by their cost per bit rather than their cost per event. and its variance. This is equivalent to knowledge of two If we know the mean count, there is (as for the Boltz- expectation values, mann distribution) only one function f1(n) = n that can appear in the exponential of the distribution, so that Z ¯ f1 = hxi = dxP (x)x, and (57) 1 P (n) = exp(−λn). (61) Z Z ¯ 2 2 f2 = hx i = dxP (x)x , (58) Of course we have to choose the Lagrange multiplier to fix the mean count, and it turns out that λ = ln(1 + 2 so we have f1(x) = x and f2(x) = x . Thus, from Eq. 1/hni); further we can find the entropy (46), the maximum entropy distribution is of the form

Smax(counting) = log2(1 + hni) + hni log2(1 + 1/hni). 1 2 (62) P (x) = exp(−λ1x − λ2x ). (59) Z The information conveyed by counting something can This is a funny way of writing a more familiar ob- never exceed the entropy of the distribution of counts, 2 and if we know the mean count then the entropy can ject. If we identify the parameters λ2 = 1/(2σ ) and 2 never exceed the bound in Eq. (62). Thus, if we have λ1 = −hxi/σ , then we can rewrite the maximum en- tropy distribution as the usual Gaussian, a system in which information is conveyed by counting discrete events, the simple fact that we count only a lim-   ited number of events (on average) sets a bound on how 1 1 2 P (x) = √ exp − (x − hxi) . (60) much information can be transmitted. We will see that 2πσ2 2σ2 real neurons and synapses approach this fundamental We recall that Gaussian distributions usually arise limit. through the central limit theorem: if the random vari- One might suppose that if information is coded in the ables of interest can be thought of as sums of many inde- counting of discrete events, then each event carries a pendent events, then the distributions of the observable certain amount of information. In fact this is not quite variables converge to Gaussians. This provides us with right. In particular, if we count a large number of events a ‘mechanistic’ or reductionist view of why Gaussians then the maximum counting entropy becomes are so important. A very different view comes from in- formation theory: if all we know about a variable is the Smax(counting; hni → ∞) ∼ log2(hnie), (63) mean and the variance, then the Gaussian distribution is and so we are guaranteed that the entropy (and hence the maximum entropy distribution consistent with this the information) per event goes to zero, although the knowledge. Since the entropy measures (returning to our approach is slow. On the other hand, if events are very physical intuition) the randomness or disorder of the sys- rare, so that the mean count is much less than one, we tem, the Gaussian distribution describes the ‘most ran- find the maximum entropy per event dom’ or ‘least structured’ distribution that can generate the known mean and variance. 1  e  S (counting; hni << 1) ∼ log , (64) A somewhat different situation is when the data D hni max 2 hni are generated by counting. Then the relevant variable is an integer n = 0, 1, 2, ···, and it is natural to imagine which is arbitrarily large for small mean count. This that what we know is the mean count hni. One way this makes sense: rare events have an arbitrarily large ca- problem can arise is that we are trying to communicate pacity to surprise us and hence to convey information. and are restricted to sending discrete or quantized units. It is important to note, though, that the maximum en- An obvious case is in optical communication, where the tropy per event is a monotonically decreasing function of 11 the mean count. Thus if we are counting spikes from a of the counting distribution: neuron, counting in larger windows (hence larger mean ∞ Z counts) is always less efficient in terms of bits per spike. X N If it is more efficient to count in small time windows, S ≡ − d tnP (t1, t2, ··· , tN ) log2 P (t1, t2, ··· , tN ) perhaps we should think not about counting but about N=0 ∞ ∞ measuring the arrival times of the discrete events. If we X X = P (N)S (N) − P (N) log P (N), (65) look at a total (large) time interval 0 < t < T , then we time 2 N=0 N=0 will observe arrival times t1, t2, ··· , tN in this interval; note that the number of events N is also a random vari- where we have made use of able. We want to find the distribution P (t1, t2, ··· , tN ) that maximizes the entropy while holding fixed the aver- P (t1, t2, ··· , tN ) = P (t1, t2, ··· , tN |N)P (N), (66) age event rate. We can write the entropy of the distribu- tion as a sum of two terms, one from the entropy of the and the (conditional) entropy of the arrival times in given arrival times given the count and one from the entropy by

Z N Stime(N) = − d tnP (t1, t2, ··· , tN |N) log2 P (t1, t2, ··· , tN |N). (67)

P If all we fix is the mean count, hNi = N P (N)N, This means that the uniform distribution must be then the conditional distributions for the locations of the events given the total number of events, P (t1, t2, ··· , tN |N), are unconstrained. We can maxi- N! P (t , t , ··· , t |N) = , (70) mize the contribution of each of these terms to the en- 1 2 N T N tropy [the terms in the first sum of Eq. (65)] by making the distributions P (t1, t2, ··· , tN |N) uniform, but it is important to be careful about normalization. When we and hence that the entropy [substituting into Eq. (65)] integrate over all the times t , t , ··· , t , we are forget- 1 2 N becomes ting that the events are all identical, and hence that per- mutations of the times describe the same events. Thus the normalization condition is not ∞ X   N!   Z T Z T Z T S = − P (N) log + log P (N) . (71) 2 T N 2 dt1 dt2 ··· dtN P (t1, t2, ··· , tN |N) = 1, N=0 0 0 0 (68) but rather Now to find the maximum entropy we proceed as before. 1 Z T Z T Z T We introduce Lagrange multipliers to constrain the mean dt1 dt2 ··· dtN P (t1, t2, ··· , tN |N) = 1. N! 0 0 0 count and the normalization of the distribution P (N), (69) which leads to the function

∞ X   N!   S˜ = − P (N) log + log P (N) + λ + λ N , (72) 2 T N 2 0 1 N=0 and then we maximize this function by varying P (N). As before the different Ns are not coupled, so the optimization conditions are simple:

∂S˜ 0 = (73) ∂P (N) 1   N!   = − ln + ln P (N) + 1 − λ − λ N, (74) ln 2 T N 0 1 12

 N!  ln P (N) = − ln − (λ ln 2)N − (1 + λ ln 2). (75) T N 1 0

Combining terms and simplifying, we have note that the interactions are pairwise (because we fix only a two–point function) but not limited to near neigh- 1 (λT )N P (N) = , (76) bors. Obviously the problem of finding the exchange in- Z N! teractions which match the correlation function is not so ∞ X (λT )N simple. Z = = exp(λT ). (77) N! Another interesting feature of the Ising or binary N=0 string problem concerns higher order correlation func- This is the Poisson distribution. tions. If we have continuous variables and constrain the The Poisson distribution usually is derived (as in our two–point correlation functions, then the maximum en- discussion of photon counting) by assuming that the tropy distribution is Gaussian and there are no nontrivial probability of occurrence of an event in any small time higher order correlations. But if the signals we observe bin of size ∆τ is independent of events in any other bin, are discrete, as in the sequence of spikes from a neuron, and then we let ∆τ → 0 to obtain a distribution in the then the maximum entropy distribution is an Ising model continuum. This is not surprising: we have found that and this model makes nontrivial predictions about the the maximum entropy distribution of events given the multipoint correlations. In particular, if we record the mean number of events (or their density hNi/T ) is given spike trains from K separate neurons and measure all of by the Poisson distribution, which corresponds to the the pairwise correlation functions, then the correspond- events being thrown down at random with some proba- ing Ising model predicts that there will be irreducible bility per unit time (again, hNi/T ) and no interactions correlations among triplets of neurons, and higher order among the events. This describes an ‘ideal gas’ of events correlations as well [Schneidman et al 2006]. along a line (time). More generally, the ideal gas is the Before closing the discussion of maximum entropy dis- gas with maximum entropy given its density; interac- tributions, note that our simple solution to the problem, tions among the gas molecules always reduce the entropy Eq. (46), might not work. Taking derivatives and setting if we hold the density fixed. them to zero works only if the solution to our problem is If we have multiple variables, x1, x2, ··· , xN , then we in the interior of the domain allowed by the constraints. can go through all of the same analyses as before. In par- It is also possible that the solution lies at the boundary ticular, if these are continuous variables and we are told of this allowed region. This seems especially likely when the means and covariances among the variables, then we combine different kinds of constraints, such as trying the maximum entropy distribution is again a Gaussian to find the maximum entropy distribution of images con- distribution, this time the appropriate multidimensional sistent both with the two–point correlation function and Gaussian. This example, like the other examples so far, with the histogram of intensity values at one point. The is simple in that we can give not only the form of the relevant distribution is a 2D field theory with a (gener- distribution but we can find the values of the parameters ally nonlocal) quadratic ‘kinetic energy’ and some arbi- that will satisfy the constraints. In general this is not trary local potential; it is not clear that all combinations so easy: think of the Boltzmann distribution, where we of correlations and histograms can be realized, nor that would have to adjust the temperature to obtain a given the resulting field theory will be stable under renormal- value of the average energy, but if we can give an explicit ization.10 There are many open questions here. relation between the temperature and average energy for any system then we have solved almost all of statistical mechanics! F. Information transmission with noise One important example is provided by binary strings. If we label 1s by spin up and 0s by spin down, the bi- We now want to look at information transmission in nary string is equivalent to an Ising chain {σi}. Fixing the presence of noise, connecting back a bit to what we the probability of a 1 is the same as fixing the mean discussed in earlier parts of the of course. Imagine that magnetization hσii. If, in addition, we specify the joint we are interested in some signal x, and we have a detector probability of two 1s occurring in bins separated by n that generates data y which is linearly related to the steps (for all n), this is equivalent to fixing the spin– signal but corrupted by added noise: spin correlation function hσiσji. The maximum entropy distribution consistent with these constraints is an Ising y = gx + ξ. (79) model,   1 X X P [{σ }] = exp −h σ − J σ σ ; (78) 10 i Z  i ij i j The empirical histograms of local quantities in natural images i ij are stable under renormalization [Ruderman and Bialek 1994]. 13

It seems reasonable in many systems to assume that the Our simplification is that the signal x also is drawn from noise arises from many added sources (e.g., the Brow- a Gaussian distribution, nian motion of electrons in a circuit) and hence has a Gaussian distribution because of the central limit theo- 1  1  rem. We will also start with the assumption that x is P (x) = exp − x2 , (81) p 2 2hx2i drawn from a Gaussian distribution just because this is 2πhx i a simple place to start; we will see that we can use the maximum entropy property of Gaussians to make some and hence y itself is Gaussian, more general statements based on this simple example. The question, then, is how much information observa- 1  1  tions on y provide about the signal x. P (y) = exp − y2 (82) p 2 2hy2i Let us formalize our assumptions. The statement that 2πhy i ξ is Gaussian noise means that once we know x, y is hy2i = g2hx2i + hξ2i. (83) Gaussian distributed around a mean value of gx:

1  1  To compute the information that y provides about x we P (y|x) = exp − (y − gx)2 . (80) use Eq. (26): p2πhξ2i 2hξ2i

Z Z  P (x, y)  I(y → x) = dy dxP (x, y) log bits (84) 2 P (x)P (y) 1 Z Z P (y|x) = dy dxP (x, y) ln (85) ln 2 P (y) * " # + 1 p2πhy2i 1 1 = ln − (y − gx)2 + y2 , ln 2 p2πhξ2i 2hξ2i 2hy2i (86)

where by h· · ·i we understand an expectation value over input, and then transducing this corrupted input: the joint distribution P (x, y). Now in Eq. (86) we can see that the first term is the expectation value of a con- y = g(x + ηeff ), (90) stant, which is just the constant. The third term involves where, obviously, ηeff = ξ/g. Note that the “effective the expectation value of y2 divided by hy2i, so we can noise” ηeff is in the same units as the input x; this is cancel numerator and denominator. In the second term, called ‘referring the noise to the input’ and is a standard we can take the expectation value first of y with x fixed, way of characterizing detectors, amplifiers and other de- and then average over x, but since y = gx + ξ the nu- vices. Clearly if we build a photodetector it is not so merator is just the mean square fluctuation of y around useful to quote the noise level in Volts at the output 2 its mean value, which again cancels with the hξ i in the ... we want to know how this noise limits our ability denominator. So we have, putting the three terms to- to detect dim lights. Similarly, when we characterize a gether, neuron that uses a stream of pulses to encode a con- tinuous signal, we don’t want to know the variance in " s # the pulse rate; we want to know how noise in the neural 1 hy2i 1 1 response limits precision in estimating the real signal, I(y → x) = ln 2 − + (87) ln 2 hξ i 2 2 and this amounts to defining an effective noise level in 1 hy2i the units of the signal itself. In the present case this is = log (88) 2 2 hξ2i just a matter of dividing, but generally it is a more com- plex task. With the effective noise level, the information 1  g2hx2i = log 1 + bits. (89) transmission takes a simple form, 2 2 hξ2i 1  hx2i  I(y → x) = log 1 + bits, (91) 2 2 hη2 i Although it may seem like useless algebra, I would like eff to rewrite this result a little bit. Rather than thinking of or our detector as adding noise after generating the signal 1 I(y → x) = log (1 + SNR), (92) gx, we can think of it as adding noise directly to the 2 2 14 where the signal to noise ratio is the ratio of the vari- then ance in the signal to the variance of the effective noise, 2 2 SNR = hx i/hηeff i. yi = gijxj + ξi, (93) The result in Eq. (92) is easy to picture: When we 2 1/2 start, the signal is spread over a range δx0 ∼ hx i , but with the usual convention that we sum over repeated by observing the output of our detector we can localize indices. The Gaussian assumptions are that each xi and 2 1/2 ξ has zero mean, but in general we have to think about the signal to a small range δx1 ∼ hηeff i , and the reduc- i tion in entropy is ∼ log2(δx0/δx1) ∼ (1/2) · log2(SNR), arbitrary covariance matrices, which is approximately the information gain. As a next step consider the case where we ob- Sij = hxixji (94) serve several variables y1, y2, ··· , yK in the hopes of Nij = hξiξji. (95) learning about the same number of underlying signals x1, x2, ··· , xK . The equations analogous to Eq. (79) are The relevant probability distributions are

  1 1 −1 P ({xi}) = exp − xi · (S )ij · xj (96) p(2π)K det S 2   1 1 −1 P ({yi}|{xi}) = exp − (yj − gikxk) · (N )ij · (yj − gjmxm) , (97) p(2π)K det N 2

where again the summation convention is used; det S components.” The eigenvectors describe directions in −1 denotes the determinant of the matrix S, and (S )ij is the space of y which are fluctuating independently, and the ij element in the inverse of the matrix S. the eigenvalues are the variances along each of these di- To compute the we proceed as be- rections. If the covariance of the noise is diagonal in the −1 T fore. First we find P ({yi}) by doing the integrals over same coordinate system, then the matrix N ·(g ·S ·g ) the xi, is diagonal and the elements along the diagonal are the Z signal to noise ratios along each independent direction. K P ({yi}) = d xP ({yi}|{xi})P ({xi}), (98) Taking the Tr log is equivalent to computing the infor- mation transmission along each direction using Eq. (92), and then we write the information as an expectation and then summing the results. value, An important case is when the different variables xi represent a signal sampled at several different points in * + P ({y }|{x }) time. Then there is some underlying continuous func- I({y } → {x }) = log i i , (99) tion x(t), and in place of the discrete Eq. (93) we have i i 2 P ({y }) i the continuous linear response of the detector to input signals, where h· · ·i denotes an average over the joint distribu- Z tion P ({yi}, {xi}). As in Eq. (86), the logarithm can 0 0 0 be broken into several terms such that the expectation y(t) = dt M(t − t )x(t ) + ξ(t). (101) value of each one is relatively easy to calculate. Two of three terms cancel, and the one which survives is related In this continuous case the analog of the covariance ma- 0 to the normalization factors that come in front of the trix hxixji is the correlation function hx(t)x(t )i. We exponentials. After the dust settles we find are usually interested in signals (and noise) that are sta- tionary. This means that all statistical properties of the 1 I({y } → {x }) = Tr log [1 + N −1 · (g · S · gT )], (100) signal are invariant to translations in time: a particular i i 2 2 pattern of wiggles in the function x(t) is equally likely to occur at any time. Thus, the correlation function which where Tr denotes the trace of a matrix, 1 is the unit could in principle depend on two times t and t0 depends matrix, and gT is the transpose of the matrix g. only on the time difference, The matrix g · S · gT describes the covariance of those components of y that are contributed by the signal x. 0 0 hx(t)x(t )i = Cx(t − t ). (102) We can always rotate our coordinate system on the space of ys to make this matrix diagonal, which corresponds The correlation function generalizes the covariance ma- to finding the eigenvectors and eigenvalues of the covari- trix to continuous time, but we have seen that it can be ance matrix; these eigenvectors are also called “principal useful to diagonalize the covariance matrix, thus finding 15 a coordinate system in which fluctuations in the differ- and since we have diagonalized the problem by Fourier ent directions are independent. From previous lectures transforming, we can compute the information just by we know that the answer is to go into a Fourier repre- adding the contributions from each frequency compo- sentation, where (in the Gaussian case) different Fourier nent, so that components are independent and their variances are (up to normalization) the power spectra. To complete the analysis of the continuous time Gaus- 1 X I[y(t) → x(t)] = log [1 + SNR(ω)]. (105) sian channel described by Eq. (101), we again refer noise 2 2 to the input by writing ω Z 0 0 0 0 y(t) = dt M(t − t )[x(t ) + ηeff (t )]. (103) Finally, to compute the frequency sum, we recall that

If both signal and effective noise are stationary, then each has a power spectrum; let us denote the power spectrum X Z dω f(ωn) → T f(ω). (106) of the effective noise ηeff by Neff (ω) and the power spec- 2π n trum of x by Sx(ω) as usual. There is a signal to noise ratio at each frequency,

S (ω) Thus, the information conveyed by observations on a SNR(ω) = x , (104) (large) window of time becomes Neff (ω)

T Z dω I[y(0 < t < T ) → x(0 < t < T )] → log [1 + SNR(ω)] bits. (107) 2 2π 2

We see that the information gained is proportional to the signal has some of the same statistical structure as the time of our observations, so it makes sense to define the noise. We will see another example of this somewhat an information rate: counterintuitive principle in just a moment. 1 Now we can use the maximum entropy argument in Rinfo ≡ lim · I[y(0 < t < T ) → x(0 < t < T )] T →∞ T a different way. When we study cells deep in the brain, (108) we might choose to deliver signals that are drawn from a Gaussian distribution, but given the nonlinearities of 1 Z dω = log2[1 + SNR(ω)] bits/sec. (109) real neurons there is no reason to think that the effective 2 2π noise in the representation of these signals will be Gaus- Note that in all these equations, integrals over frequency sian. But we can use the maximum entropy property of run over both positive and negative frequencies; if the the Gaussian once again, this time to show that if can signals are sampled at points in time spaced by τ then measure the power spectrum of the effective noise, and 0 we plug this into Eq. (109), then we will obtain a lower the maximum (Nyquist) frequency is |ω|max = π/τ0. The Gaussian channel is interesting in part because bound to the true information transmission rate. Thus we can work everything out in detail, but in fact we can we can make conservative statements about what neu- learn a little bit more than this. There are many situa- rons can do, and we will see that even these conservative tions in which there are good physical reasons to believe statements can be quite powerful. that noise will be Gaussian—it arises from the superpo- If the effective noise is Gaussian, then we know that sition of many small events, and the central limit the- the maximum information transmission is achieved by orem applies. If we know that noise is Gaussian and choosing signals that are also Gaussian. But this doesn’t we know its spectrum, we might ask how much informa- tell us how to choose the spectrum of these maximally tion can be gained about a signal that has some known informative signals. We would like to say that mea- statistical properties. Because of the maximum entropy surements of effective noise levels can be translated into property of the Gaussian distribution, this true infor- bounds on information transmission, but this requires mation transmission rate is always less than or equal that we solve the optimization problem for shaping the to what we would compute by assuming that signal is spectrum. Clearly this problem is not well posed with- Gaussian, measuring its spectrum, and plugging into Eq. out some constraints: if we are allowed just to increase (109). Notice that this bound is saturated only in the the amplitude of the signal—multiply the spectrum by case where the signal in fact is Gaussian, that is when a large constant—then we can always increase informa- 16 tion transmission. We need to study the optimization G. Back to the fly’s retina of information rate with some fixed ‘dynamic range’ for the signals. A simple example, considered by Shannon These ideas have been used to characterize informa- at the outset, is to fix the total variance of the signal tion transmission across the first synapse in the fly’s [Shannon 1949], which is the same as fixing the inte- visual system [de Ruyter van Steveninck and Laugh- gral of the spectrum. We can motivate this constraint lin 1996]. We have seen these data before, in thinking by noting that if the signal is a voltage and we have to about how the precision of photon counting changes as drive this signal through a resistive element, then the the background light intensity increases. Recall that, variance is proportional to the mean power dissipation. over a reasonable dynamic range of intensity varia- Alternatively, it might be easy to measure the variance tions, de Ruyter van Steveninck and Laughlin found that of the signals that we are interested in (as for the visual the average voltage response of the photoreceptor cell is signals in the example below), and then the constraint related linearly to the intensity or contrast in the movie, is empirical. and the noise or variability δV (t) is governed by a Gaus-

So the problem we want to solve is maximizing Rinfo sian distribution of voltage fluctuations around the av- while holding hx2i fixed. As before, we introduce a La- erage: grange multiplier and maximize a new function Z 0 0 0 V (t) = VDC + dt T (t − t )C(t ) + δV (t). (114) 2 R˜ = Rinfo − λhx i (110) Z   Z 1 dω Sx(ω) dω This (happily) is the problem we have just analyzed. = log2 1 + − λ Sx(ω). 2 2π Neff (ω) 2π As before, we think of the noise in the response as (111) being equivalent to noise δCeff (t) that is added to the movie itself, The value of the function S (ω) at each frequency con- Z x 0 0 0 tributes independently, so it is easy to compute the func- V (t) = VDC + dt T (t − t )[C(t ) + δCeff (t)]. (115) tional derivatives, Since the fluctuations have a Gaussian distribution, they δR˜ 1 1 1 can be characterized completely by their power spectrum = · · −λ, (112) N eff (ω), which measures the variance of the fluctuations δSx(ω) 2 ln 2 1 + Sx(ω)/Neff (ω) Neff (ω) C that occur at different frequencies, and of course the optimization condition is δR/δS˜ x(ω) = Z dω hδC (t)δC (t0)i = N eff (ω) exp[−iω(t − t0)]. 0. The result is that eff eff 2π C (116) 1 There is a minimum level of this effective noise set by Sx(ω) + Neff (ω) = . (113) 2λ ln 2 the random arrival of photons (shot noise). The pho- eff ton noise is white if expressed as NC (ω), although it Thus the optimal choice of the signal spectrum is one makes a nonwhite contribution to the voltage noise. As which makes the sum of signal and (effective) noise equal we have discussed, over a wide range of background light to white noise! This, like the fact that information is intensities and frequencies, the fly photoreceptors have maximized by a Gaussian signal, is telling us that effi- effective noise levels that reach the limit set by photon cient information transmission occurs when the received statistics. At high frequencies there is excess noise be- signals are as random as possible given the constraints. yond the physical limit, and this excess noise sets the Thus an attempt to look for structure in an optimally time resolution of the system. encoded signal (say, deep in the brain) will be frustrat- The power spectrum of the effective noise tells us, ul- ing. timately, what signals the photoreceptor can and cannot In general, complete whitening as suggested by Eq. transmit. How do we turn these measurements into bits? (113) can’t be achieved at all frequencies, since if the sys- One approach is to assume that the fly lives in some tem has finite time resolution (for example) the effective particular environment, and then calculate how much noise grows without bound at high frequencies. Thus information the receptor cell can provide about this par- the full solution is to have the spectrum determined by ticular environment. But to characterize the cell itself, Eq. (113) everywhere that the spectrum comes out to a we might ask a different question: in principle how much positive number, and then to set the spectrum equal to information can the cell transmit? To answer this ques- zero outside this range. If we think of the effective noise tion we are allowed to shape the statistical structure of spectrum as a landscape with valleys, the condition for the environment so as to make the best use of the recep- optimizing information transmission corresponds to fill- tor (the opposite, presumably, of what happens in evo- ing the valleys with water; the total volume of water is lution!). This is just the optimization discussed above, the variance of the signal. so it is possible to turn the measurements on signals and 17

FIG. 2 At left, the effective contrast noise levels in a single photoreceptor cell, a single LMC (the second order cell) and the inferred noise level for a single active zone of the synapse from photoreceptor to LMC. The hatching shows the signal spectra required to whiten the total output over the largest possible range while maintaining the input contrast variance hC2i = 0.1, as discussed in the text. At right, the resulting information capacities as a function of the photon counting rates in the photoreceptors. From [de Ruyter van Steveninck & Laughlin 1996a]. noise into estimates of the information capacity of these nals from six photoreceptors and thus act is if they cap- cells. This was done both for the photoreceptor cells and tured photons at six times higher rate. Finally, in the for the large monopolar cells that receive direct synaptic large monopolar cells information has been transmitted input from a group of six receptors. From measurements across a synapse, and in the process is converted from on natural scenes the mean square contrast signal was a continuous voltage signal into discrete events corre- fixed at hC2i = 0.1. Results are shown in Fig 2. sponding to the release of neurotransmitter vesicles at the synapse. As a result, there is a new limit to infor- The first interesting feature of the results is the scale: mation transmission that comes from viewing the large individual neurons are capable of transmitting well above monopolar cell as a “vesicle counter.” 1000 bits per second. This does not mean that this capacity is used under natural conditions, but rather If every vesicle makes a measurable, deterministic con- speaks to the precision of the mechanisms underlying tribution to the cell’s response (a generous assumption), the detection and transmission of signals in this system. then the large monopolar cell’s response is equivalent to Second, information capacity continues to increase as the reporting how many vesicles are counted in a small win- level of background light increases: noise due to photon dow of time corresponding to the photoreceptor time res- statistics is less important in brighter lights, and this re- olution. We don’t know the distribution of these counts, duction of the physical limit actually improves the per- but we can estimate (from other experiments, with un- formance of the system even up to very high photon certainty) the mean count, and we know that there is a counting rates, indicating once more that the physical maximum entropy for any count distribution once we fix limit is relevant to the real performance. Third, we see the mean, from Eq. (62) above. No mechanism at the that the information capacity as a function of photon synapse can transmit more information than this limit. counting rate is shifted along the counting rate axis as we Remarkably, the fly operates within a factor of two of go from photoreceptors to LMCs, and this corresponds this limit, and the agreement might be even better but (quite accurately!) to the fact that LMCs integrate sig- for uncertainties in the vesicle counting rate [Rieke et al 18

1997]. defined by information theory. In fact there is more to These two examples [Laughlin 1981; de Ruyter van the analysis of even this one synapse (e.g., why does the Steveninck & Laughlin 1996], both from the first synapse system choose the particular filter characteristics that in fly vision, provide evidence that the visual system it does?) and there are gaps (e.g., putting together the really has constructed a transformation of light intensity dynamics and the nonlinearities). But, armed with some into transmembrane voltage that is efficient in the sense suggestive results, let’s go on ... .