Copyright by Zhao Liu 2020 the Dissertation Committee for Zhao Liu Certiﬁes That This Is the Approved Version of the Following Dissertation

Home , Zhao (surname)

Copyright by Zhao Liu 2020 The Dissertation Committee for Zhao Liu certiﬁes that this is the approved version of the following dissertation:

Optimizing Mutual Information over Neural Tuning Curves

Committee:

Thibaud Taillefumier, Supervisor

Lorenzo A. Sadun

Ngoc Mai Tran

Gustavo de Veciana Optimizing Mutual Information over Neural Tuning Curves

Zhao Liu

DISSERTATION Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulﬁllment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT AUSTIN December 2020 Dedicated to my parents. Acknowledgments

First, I would like to thank my advisor, Professor Thibaud Taillefumier, for his continuous advice, support and encouragement. Without his guidance, it would be impossible for me to accomplish my study at UT Austin. It is very fortunate for me to be his student.

I would like to thank Prof. Lorenzo Sadun, Prof. Ngoc Tran, and Prof. Gustavo de Veciana, for serving as my committee members. I am very grateful to Prof. Sadun and Prof. de Veciana for fruitful discussions with them regarding my research work. I have taken a graduate course with Prof. Tran, and I really appreciate her mentor-ship and help during my study.

I would also like to thank the Department of Mathematics for their ﬁnancial and mental support. Our graduate advisors, Prof. Thomas Chen, Prof. Kui Ren and Prof. Timothy Perutz, and graduate coordinators, Ms. Elisa Bass and Ms. Jenny Kondo, have always been there to help me with any diﬃculty I encountered. I thank them for everything they have done for me and for the department.

Finally, I thank my parents for their support and love. I also thank Rongting Zhang, Qi Lei, Xiaoxia Wu, Xikai Zhao, Luyan Yu, Manyi Yim, and all my other family members and friends who accompanied, encouraged and helped me during the past six years.

v Optimizing Mutual Information over Neural Tuning Curves

Zhao Liu, Ph.D. The University of Texas at Austin, 2020

Supervisor: Thibaud Taillefumier

In this dissertation we study tuning curves that maximize the mutual information in a multi-dimensional Poisson model based on the Eﬃcient Cod- ing Hypothesis.

Tuning curves are functions that characterize the stimulus-response re- lations of neurons by specifying their firing rates. Under certain biological and energy constraints, tuning curves in a sensory system are often assumed to transmit information in the most efficient way, which is known as the Ef- ficient Coding Hypothesis. With the coding efficiency measured by mutual information, many information-theoretic approaches have been proposed to optimize the tuning curves. However, previous studies are mainly based on low-dimensional models or restricted to certain types of neurons.

We investigate properties of optimal tuning curves that encode a low- dimensional stimulus via high-dimensional responses in a Poisson ﬁring model. Especially, we focus on the constraints that aﬀect the heterogeneity in their

vi profiles. We show that the classical approach of Fisher information approximation does not yield relevant solutions, and thus we compute the mutual information directly with gradient-based Monte Carlo algorithms. We discover that under the cyclic invariant condition, neurons tend to fire with either the maximum or minimum firing rate in a low-firing-rate regime, which is theoretically proven for a continuous Poissonian limit. In this case, heterogeneous populations could not be obtained by maximizing this continuous limit.

When tuning curves are not constrained to be cyclic invariant, theoretical results show that neural firing rates are still discrete. However, neurons can fire at intermediary rates in high firing rate regimes. Based on this result, we propose an alternating algorithm that optimizes discrete particles, from which we discover that optimal tuning configurations are affected by the trade-off between the number of states and the size of space. When the number of states is less than the capacity of space, the regularity constraint on the particles can result in hierarchical, heterogeneous tuning curves that demonstrate population splitting.

vii Table of Contents

Acknowledgments v

Abstract vi

List of Figures xi

Chapter 1. Introduction 1

Chapter 2. Fisher Information Analysis 6 2.1 Mutual Information and Fisher Information ...... 6 2.2 Poisson Model ...... 9 2.3 Constrained Optimization Problem for One Population . . . . 11 2.3.1 Monotonic tuning curves ...... 13 2.3.2 Piecewise tuning curves ...... 16 2.4 Constrained Optimization Problem for Multi Populations . . . 19 2.4.1 Two populations ...... 20 2.4.2 Generalization to multi populations ...... 23 2.4.3 The geometry of stationary solutions ...... 25 2.5 Discussion ...... 30

Chapter 3. Cyclic Invariant Model 32 3.1 Basic Assumptions ...... 32 3.2 Discrete Model ...... 35 3.2.1 Single population with equal discretization ...... 36 3.2.2 Single population with subset discretization ...... 39 3.2.3 Multiple populations ...... 41 3.2.4 Numerical results ...... 43 3.3 Continuous Model ...... 48 3.3.1 The continuous limit ...... 49

viii 3.3.2 The optimal tuning curves ...... 51 3.3.2.1 A model problem with two neurons ...... 51 3.3.2.2 Generalization to the Poissonian limit . . . . . 57 3.3.3 Adding convolution ...... 61 3.4 Extensions ...... 64 3.4.1 Extension to multi populations ...... 64 3.4.2 Extension to multi inputs ...... 66 3.5 Discussion ...... 72

Chapter 4. Particle Model 74 4.1 Basic Setting ...... 74 4.2 Discreteness of Optimal Tuning Curves ...... 76 4.2.1 Equal Coding Theorem ...... 77 4.2.2 The capacity achieving problem ...... 78 4.2.3 Existence and uniqueness of solution ...... 81 4.2.4 The discreteness of capacity-achieving measure . . . . . 84 4.3 Alternating Maximization Algorithm ...... 91 4.3.1 Stochastic Gradient Descent ...... 94 4.3.2 Blahut-Arimoto Algorithm ...... 95 4.3.3 Introducing regularity ...... 98 4.4 Analysis and Approximation ...... 104 4.4.1 The repelling effect of gradient ...... 104 4.4.2 Approximation by lower bound ...... 106 4.5 Numerical Results ...... 111 4.5.1 The effect of the number of states ...... 112 4.5.2 The effect of the size of space ...... 114 4.5.3 Adding regularity ...... 118 4.6 Discussion ...... 124

Chapter 5. Summary 127

Appendix 133

ix Appendix 1. Additional Proofs 134 1.1 Proof of Proposition 3.2.1 ...... 134 1.2 Proof of the Continuous Limit ...... 136 1.3 Proof of Lemma 3.3.2 ...... 141 1.4 Proof of Lemma 3.3.4 ...... 142 1.5 Proof of Theorem 4.2.1 ...... 144 1.6 Proofs in Section 4.2.3 ...... 145 1.6.1 Proof of Lemma 4.2.2 ...... 145 1.6.2 Proof of Lemma 4.2.3 ...... 146 1.6.3 Proof of Lemma 4.2.4 ...... 149 1.7 Proofs in Section 4.2.4 ...... 153 1.7.1 Proof of Lemma 4.2.8 ...... 153 1.7.2 Proof of Theorem 4.2.9 ...... 154

Bibliography 158

x List of Figures

2.1 Neural spikes: an approximate plot of action potential (left) as if recorded from inside a neuron (see inset), and the spike recordings (right) of a real neuron taken from [19]...... 7 2.2 An example of periodic tuning curves for one population . . . 17 2.3 An example of mutually exclusive supported tuning curves for multiple populations ...... 25 2.4 Fisher Information as the path length ...... 29 2.5 Tuning curves with diverging path lengths ...... 30

3.1 Cyclic invariant tuning curves for 3 neurons ...... 33 3.2 Discrete tuning curves for 10 neurons with 10 bins ...... 36 3.3 Discrete tuning curves for 10 neurons with 20 bins ...... 40 3.4 Optimized tuning curves with different average constraints 1 . 44 3.5 Optimized tuning curves in high firing rate regimes ...... 45 3.6 Mutual Information v.s. average firing rate in different maximal firing rate regimes ...... 46 3.7 Optimized tuning curves with different number of populations and subset discretization ...... 47 3.8 Optimized tuning curves and rate curves with different convolution kernels ...... 48 3.9 Random saturated functions with k peaks2 ...... 56

3.10 The optimal average firing rate as a function of f+, fixing f− = 0.1 60 3.11 Piecewise constant tuning curve with 4 continuous regions3 .. 62 3.12 Mutual information (left) and 3 samples of tuning curves (right) 63 −5 3.13 The function F f¯ with f+ = 4.0, f− = 10 ...... 66 ¯ 3.14 The mutual information upper bound Uk f for k inputs . . . 69 ¯ 3.15 Selected components of Uk f when k =4...... 70 ¯ 3.16 The mutual information upper bound Uk f when k = 4 . . . 71 xi 4.1 Tuning curves for 3 neurons without cyclic invariance . . . . . 75 4.2 A geometric interpretation of the necessary and sufficient condition of the capacity-achieving measure ...... 88 4.3 A model of 3 neurons and 4 states ...... 93 4.4 Tuning curves for 1 neuron with the same mutual information 100 4.5 Tuning curves for 3 neurons with the same mutual information and different path lengths ...... 102 4.6 An example of the alternating maximization scheme ...... 104 4.7 Mutual information and gradient for a two-particle model in 1-d 105 4.8 Probability mass functions for Poisson distribution ...... 107 4.9 Mutual information and the lower bound for a two-particle model in 1-d ...... 110 4.10 Mutual information v.s. the number of neurons with fixed number of states ...... 113 4.11 Mutual information v.s. the number of states in one dimension 114 4.12 Optimized tuning values in 1-dim Gaussian and Poisson model 116

4.13 Optimized tuning values in 3-dim state space with diﬀerent f+ (colors indicate the location of points, and mutual information is calculated in bits)...... 117

4.14 Initial condition (f1(θ), f2(θ), f3(θ)) for adding regularity . . . 119 4.15 Optimized tuning curves for 3-dim Poisson model with different regularization: (a) state space representation and (b) the functions f1(θ), f2(θ), f3(θ)...... 120 4.16 Optimized tuning curves for Poisson model with different regularization in N = 4, 6, 8 dimensions ...... 122 4.17 Optimized tuning curves for Poisson model with different regularization in N = 10, 12 dimensions ...... 123

xii Chapter 1

Introduction

Sensory neurons process information by firing and propagating electrical impulses called action potentials or spikes. The dependence of a neuron’s spiking activity on the stimulus is classically described via tuning curves, which specify firing rates as functions of relevant stimulus features. The va- riety of tuning curves, which appears in many sensory modalities, is still a mystery in their roles of shaping the neural codes, especially in view of information theories such as the Efficient Coding Hypothesis.

The Efficient Coding Hypothesis, proposed by Barlow in 1961 [5], states that sensory systems have evolved to optimize the flow of information about the environment, subject to certain biological and energy constraints. The coding efficiency is naturally quantified by mutual information, which represents the “Shannon bits” one can obtain of a stimulus X through observing a neural response Y [46]: p(x, y) I(X; Y ) = DKL (p(x, y) p(x)p(y)) = dxdyp(x, y) log k p(x)p(y) ZZ Based on the Efficient Coding Hypothesis, a natural framework has been established whose major goal is to study the maximum of mutual information. However, existing applications of this framework mainly rely on the

1 low-dimensionality of models [44, 26, 35, 29, 42], Gaussian or independent noise assumptions [33, 3, 28], or (and) speciﬁc tuning parameters [23, 47, 16, 17, 21]. Extending such studies to a multidimensional, complex situation has proven to be theoretically challenging and as a result, the Eﬃcient Coding Hypothesis remains debated. The challenge essentially stems from the non-convexity of mutual information as a function of the tuning curves, preventing the explo- ration of high-dimensional models.

Several approaches have been proposed to resolve this diﬃculty. Under certain conditions, the mutual information between high-dimensional inputs and responses can be converted to a sum of one-dimensional quantities [52]. When such decomposition is not possible, bounds or approximation methods were derived, including Monte Carlo estimators [27], kernel mutual information [22], k-nearest-neighbor (kNN) based estimators [18], and variational lower bound approximations [4, 36, 9].

Besides the approximation methods, one popular approach is to use Fisher information as a substitute. Brunel and Nadal [10] derived a Fisher information bound as a lower bound of the mutual information, which is obtained from Cramer Rao’s Inequality under a parameter estimation setting. Although extensive works have been done based on this result [10, 16, 37], directly applying the Fisher information can sometimes be problematic. It was revisited by Wei and Stocker [48] that the Fisher Information bound only holds under certain assumptions. Other numerical results [7, 50] demonstrated the impreciseness of Fisher information bound as well.

2 In this dissertation, we study the properties of optimal tuning curves that maximize information transmission under various biological constraints. Speciﬁcally, we focus on the situation where multiple neurons encode a low- dimensional stimulus via high-dimensional responses, which are assumed to be conditionally independent and Poisson distributed. We put no restrictions on the type of neurons or shape of tuning curves.

We first analyze how Fisher information performs in this model in Chapter 2. As a result, it yields unphysical solutions with an infinite amount of information, while the exact mutual information remains finite. The reason is that under certain constraints, maximally informative tuning curves in terms of Fisher information are shown to have mutually exclusive supports, such that the multi-dimensional problem is decomposed into a sum of one- dimensional problems. Geometrically, the Fisher information of a stationary solution can be interpreted as a local path length in a state space, which fails to characterize global features. Hence, it leads to a mismatch between the dimensionality of parameter and the dimensionality of representation.

Consequently, we turn to the exact mutual information instead of approximation by Fisher information. We use Gradient-based Monte Carlo simulations to solve for the optimal tuning curves and explore the dependence of their proﬁles on the biological constraints, including average activity, ﬁnite activity range, and symmetry.

In Chapter 3, we assume that the tuning curves are cyclic invariant. This is a common pattern that occurs in practice, e.g. in head direction cells

3 [51]. We first formulate a vector representation of tuning curves via discretization. Under certain average and boundedness constraints, numerical experiments show that the optimal tuning curves saturate to the upper and lower boundaries. This binary-valued feature has been revealed to be optimal under different conditions [44, 6, 41] and was taken as an underlying assumption in some studies [21, 52]. We show that this result holds rigorously in a continuous limit involving an infinite neural population with a low firing rate, by studying the entropy and auto-correlation of tuning curves. However, this continuous limit cannot explain heterogeneity in population splitting except for the case when the input dimension is increased. This failure follows from the low firing rate assumption as well as the symmetry assumption.

Hence, to explore in a more general situation, we relax these assumptions in Chapter 4. In this case, maximizing the mutual information is proven to be equivalent to achieving the information capacity of a multi-dimensional Poisson channel. We show that the optimal tuning curves are discrete, which serves as an extension of previous results [41, 39] to high dimensions. This enables us to formulate the representation of tuning curves as a set of particles, of which the states and probabilities can be optimized by alternating between Stochastic Gradient Descent [38] and Blahut-Arimoto algorithm [8, 2]. Fur- thermore, a regularity constraint is naturally combined into the algorithm by minimizing the elastic energy in state space. Contrary to the case of cyclic invariance, numerical simulations suggest that optimal tuning curves become non-binary, saturating to intermediary values, and displaying heterogeneous

4 phenomena in their proﬁles.

The outline of this dissertation is as follows. In Chapter 2, we introduce the deﬁnition of the problem based on the Eﬃcient Coding Hypothesis, and analyze the drawbacks of the Fisher information approximation. In Chapter 3, under the condition of cyclic invariance, we show the binary-saturation phenomena of the discrete model and provide theoretical results by extending to a continuous Poissonian limit. In Chapter 4, the cyclic invariant constraint is dropped, from which we prove the discreteness of tuning curves and describe the alternating maximization algorithm. We also study the dependence of tuning curves on various constraints, including the number of states, the size of space and regularity.

5 Chapter 2

Fisher Information Analysis

In this chapter, we first introduce the background and formalize a neural coding model with a one-dimensional input and multi-dimensional, Poisson distributed outputs. To investigate the Efficient Coding Hypothesis, we solve the problem of maximizing the Fisher information bound using Calculus of Variations, subject to monotonicity and boundary constraints. By looking into these solutions, we discover a significant difference in the mutual information and the Fisher information, including their limits and geometric properties. Thus, Fisher information is not well suited for analyzing our Poisson model.

2.1 Mutual Information and Fisher Information

To begin with, we brieﬂy introduce mutual information and Fisher information in the neural coding context. Consider a neuron cell responding to a certain kind of stimulus, such as sound, light intensity, head direction, etc. After a stimulus is applied, the neuron spikes: its membrane potential rapidly rises to a peak potential and then quickly drops to resting potential, transmitting electrical signals along the axon and to other cells (see Figure 2.1).

6 Figure 2.1: Neural spikes: an approximate plot of action potential (left) as if recorded from inside a neuron (see inset), and the spike recordings (right) of a real neuron taken from [19].

Suppose the input stimulus can be represented by a one-dimensional random variable θ R. Induced by the randomness of input, the neuron’s ∈ response r is also a random variable with a conditional probability distribution p(r θ). Often the response r is taken as the spike count of a neuron. In a | system of more than one cell, r could be multi-dimensional.

The mutual information measures the “Shannon bits” one can obtain from a stimulus θ through observing a response r. Its formal deﬁnition [46] in information theory is the Kullback–Leibler divergence between the joint distribution and the product of marginal distributions:

p(r, θ) I(r; θ) = DKL (p(r, θ) p(r)p(θ)) = drdθp(r, θ) ln (2.1) k p(r)p(θ) ZZ

On the other hand, Fisher information (also called Fisher metric [1]) is not an information-theoretic quantity. It is a metric on the statistical manifold S = p(r θ) θ Θ , where Θ is the parameter space. If θ is one- { | | ∈ }

7 dimensional, the Fisher information is written as

∂ ln p(r θ) 2 ∂ ln p(r θ) 2 Fr(θ) = Ep r|θ | = | p(r θ)dr (2.2) ( ) ∂θ ∂θ | " # Z

∂ log pθ ∂p(x|θ) Since Epθ ∂θ = ∂θ dx = 0, Fr(θ) can also be interpreted as variance of the change (derivative) R in the log-likelihood ln p(r θ). Note that Fr(θ) is a | function of θ whereas I(r; θ) is a constant, and they are both non-negative.

A connection between mutual information and Fisher information was established. Brunel and Nadal [10] addressed that under the condition that a “good estimator” of θ exists (i.e. θˆ(r) with mean θ, variance 1 1 and Fr(θ) H(θˆ) H(θ)), the mutual information between r and θ is bounded below by: ≈ 1 2πe I(r; θ) H(θ) p(θ) ln dθ (2.3) ≥ − 2 F (θ) Z r where H(θ) stands for the entropy of θ,

H(θ) = p(θ) ln p(θ)dθ − Z The right-hand-side of Equation (2.3) is often referred to as the Fisher information bound

1 2πe IF (r; θ) := H(θ) p(θ) ln dθ (2.4) − 2 F (θ) Z r The conditions under which the inequality (2.3) holds were reexamined by Wei and Stocker [48]. They addressed that IF serves as a lower bound only if the noise is small and Gaussian. The deviation from IF to I can be both positive and negative, and it depends on the magnitude as well as the

8 non-Gaussianity of noise. One should be cautious when using (2.4) as a lower bound or an approximation of mutual information.

The results of Wei and Stocker [48] were based on a standard one- dimensional model with additive noise. In this chapter, we will revisit the limitations of Fisher information in a diﬀerent but also frequently applied, Poisson neural coding model.

2.2 Poisson Model

We begin with the basic setting of a one-dimensional Poisson model for a single neuron. Following the previous notations, denote by θ the stimulus and r the response, measured by the number of spikes, with θ R and r N. Let ∈ ∈ f(θ) be the ﬁring rate in unit time when a stimulus is applied. Over a time τ, the spike count r follows a Poisson distribution with parameter λ = f(θ)τ:

(f(θ)τ)r p(r θ) = e−f(θ)τ , r = 0, 1, 2, ... (2.5) | r!

As a function of θ, f(θ) is often referred to as tuning curve in literature. From its biological meaning, f(θ) is non-negative, and is usually assumed to be bounded between the minimal and maximal ﬁring rates:

f− f(θ) f , (2.6) ≤ ≤ +

Speciﬁcally, f− corresponds to the resting state of the neuron and is usually set to be a small positive value.

9 To extend the above model to multiple neurons, consider n such cells indexed by i = 1, 2, ..., n. For simplicity, we assume that each neuron re- ceives the same stimulus θ R at the same time, yet they spike at diﬀerent ∈ ﬁring rates fi(θ). The spike counts of n neurons are denoted by a vector

n r = (r1, r2, ..., rn) N . ∈ Further, we make an assumption of conditional independence: given

θ, the spike counts ri is independently Poisson-distributed with expectation fi(θ)τ, i.e. the joint conditional probability mass function of r equals

n n −fi(θ)τ ri e (fi(θ)τ) p(r θ) = p(ri θ) = . (2.7) | | ri! i=1 i=1 Y Y Following the deﬁnition (2.2), the Fisher information associated with the model p(r θ) would be | ∂ ln p(r θ) 2 Fr(θ) = E | r|θ ∂θ " # under the assumption of conditional independence (2.7) and that the functions

fi(θ) are continuously diﬀerentiable, it can be simpliﬁed as { } n 2 ∂ Fr(θ) = Er|θ ln p(ri θ)  ∂θ |  i=1 ! X n  ∂ 2 = Er |θ ln p(ri θ) i ∂θ | i=1 " # Xn f 0(θ)2 = τ i (2.8) fi(θ) i=1 X ∂ ∂ ∞ where the second equality follows from Er |θ ln p(ri θ) = p(ri θ) = i ∂θ | ∂θ ri=0 | P 10 0. The Fisher information bound (2.4) would then be

1 F (θ) I (r; θ) = dθp(θ) ln r + H(θ) F 2 2πe Z n 1 f 0(θ)2 1 = dθp(θ) ln τ i + H(θ) ln(2πe) (2.9) 2 fi(θ) − 2 i=1 ! Z X On the other hand, the mutual information between the spike counts r and the stimulus θ takes the form of

I(r; θ) = DKL (p(r, θ) p(r)p(θ)) ∞ k ∞ p(r θ)p(θ) = dθ p(r θ)p(θ) ln | ··· | p(r)p(θ) Z r1=0 rn=0 X ∞ X ∞ p(r θ) = p(θ)dθ p(r θ) ln | ··· | p(r) r =0 r =0 Z X1 Xn p(r θ) = p(θ)dθE ln | r|θ p(r) Z p(r θ) = p(θ)dθE ln | (2.10) r|θ p(s)p(r s)ds Z |

Note that Er|θ[ ] stands for taking the expectationR with respect to r that follows · the conditional distribution p(r θ) (2.7). | Without loss of generality, throughout this chapter, we take the typical time τ = 1.

2.3 Constrained Optimization Problem for One Popu- lation

Starting with the simplest case, assume that all of the n neurons have the same tuning curve f(θ), in other words, they belong to one neural popu-

11 lation. The Fisher information bound then reads

1 f 0(θ)2 ln(2πe) IF (r; θ) = dθp(θ) ln n + H(θ) (2.11) 2 f(θ) − 2 Z 1 f 0(θ)2 ln(2πe) ln n = dθp(θ) ln + H(θ) + (2.12) 2 f(θ) − 2 2 Z The above formulation suggests that the number of neurons n plays no role in the Fisher information bound, except that IF would increase with the separate

ln n term 2 . Here, we use a variational approach to determine the optimal tuning curve f that maximizes the Fisher information bound:

max IF (r; θ) (2.13) f

A stationary solution f should satisfy the following Euler-Lagrange equation:

∂L d ∂L = 0 (2.14) ∂f − dθ ∂f 0 0 f 0(θ)2 with the Lagrangian integrand L(θ, f, f ) = p(θ) ln f(θ) . By diﬀerentiating the Lagrangian L, the above equation reduces to

p(θ) d p(θ) 2 = 0 −f(θ) − dθ f 0(θ) p(θ) Dividing by f 0(θ) ,

f 0(θ) f 0(θ) d p(θ) d f 0(θ)2 2 = ln = 0, − f(θ) − p(θ) dθ f 0(θ) dθ f(θ)p(θ)2 f 0(θ)2 In other words, f(θ)p(θ)2 is constant. Hence, there exists a constant c > 0 such that f 0(θ)2 = cp(θ)2. (2.15) f(θ)

12 2.3.1 Monotonic tuning curves

From the above equation, a special case would arise naturally in which f(θ) is monotonic. Specifically, let us assume that p(θ) is continuously supported on a finite interval [θ−, θ+], f is an increasing function on [θ−, θ+], and, without loss of generality, f reaches the maximal and minimal firing rate at the boundary:

0 f (θ) 0, θ (θ−, θ ) (2.16) ≥ ∀ ∈ +

f(θ−) = f−, f(θ+) = f+ (2.17)

2 f 0(θ)2 d Using the identity f(θ) = 2 dθ f(θ) , we simplify the stationary condition (2.15) as p d f(θ) = cp(θ) (2.18) dθ p where c can be determined by the boundary conditions (2.17):

θ+ θ+ d c = c p(θ)dθ = f(θ)dθ = f+ f−. θ− θ− dθ − Z Z p p p Substituting back to equation (2.18), the stationary solution f can be written as 2 ∗ f (θ) = ( f f−)P (θ) + f− (2.19) + − p p p where θ P (θ) = p(s)ds Zθ− deﬁnes the cumulative distribution function of p(θ). In particular, when p(θ) is a uniform distribution, f(θ) is a quadratic function.

13 ∗ To show that the stationary solution f is the maximizer of IF , checking the second order variations is necessary. However, computing second order variations directly on f (which evolves both f and f 0 terms) is complicated. For the convenience of calculations, we take a more brief approach by applying a substitution f 0(θ) d h(θ) = = 2 f(θ). (2.20) f(θ) dθ p Based on the above substitution,p one could reach a constrained optimization problem for h that is equivalent to (2.13, 2.16, 2.17):

θ+ max dθp(θ) ln h(θ)2 , (2.21) Zθ− θ+ s.t. h(θ) 0, h(θ)dθ = 2( f+ f−) (2.22) ≥ θ− − Z p p and the Lagrangian function can be formulated as

θ+ θ+ H(h, µ, λ) = dθp(θ) ln h(θ)2 + µ(θ)h(θ)dθ (2.23) Zθ− Zθ− θ+ +λ h(θ)dθ 2( f f−) (2.24) − + − θ− Z p p where λ R and µ(θ) are the Lagrange multipliers associated with the con- ∈ straints (2.22). For a stationary solution h(θ), there exists λ and µ(θ) such that the following Karush–Kuhn–Tucker (KKT) conditions [40] hold:

2p(θ) + µ(θ) + λ = 0, θ (2.25) h(θ) ∀ µ(θ) 0, θ (2.26) ≥ ∀ µ(θ)h(θ) = 0, θ (2.27) ∀

14 0 Now for the stationary solution f ∗ (2.19), the corresponding h∗(θ) = f (θ) = √f(θ)

2( f f−)p(θ). The complementary slackness condition (2.27) and the + − stationaryp p condition (2.25) give us µ(θ) = 0 and λ = 1 . Thus, the −√f+−√f− stationarity of h∗ is induced by the stationarity of f ∗. Furthermore, evaluating the second order variation of H yields

δ2H 2p(θ) = 0 (2.28) δh2 −h2(θ) ≤

Therefore, the objective function H attains a local maximum at h∗(θ), and since the stationary solution is unique, this local maximizer is global.

It turns out that the maximum of Fisher information bound under the constraints (2.16, 2.17) will be

2 θ+ n 2( f+ f−)p(θ) ∗ 1 − IF = dθp(θ) ln   + H(θ) 2 θ p 2πep Z −     √2n( f f−) = ln + − (2.29) p√πe p !

For the other case of monotonicity, if we assume f is decreasing on

[θ−, θ+], one can show that, given boundary conditions at θ− and θ+, the maximizer would be:

2 ∗ f (θ) = ( f(θ ) f(θ−))P (θ) + f(θ−) (2.30) + − p p p which is consistent with (2.19). This can be seen as a general formulation for the maximizer under the monotonicity assumption.

15 2.3.2 Piecewise tuning curves

We further investigate the case when the tuning curve is piecewise monotonic. Write [θ−, θ+] as a union of M mutually disjoint intervals [θi−1, θi) (i=1,2,...,M) in which f is continuously diﬀerentiable and monotonic. Bound- ary constraints are deﬁned as the one-sided limits at the endpoints θi : { }

lim f(θ) := f (i), lim f(θ) := f (i). + − − + θ→θi−1 θ→θi

Following the same Lagrangian approach as before, solving the constrained optimization problem on each interval gives us a piecewise function

2 (i) (i) f+ f− Pi(θ) ∗ − (i) f (θ) = q q + f−  , θ [θi−1, θi) (2.31) Pi ∈  q      where θ θi Pi(θ) = p(s)ds, Pi = p(s)ds. Zθi−1 Zθi−1 Evaluating the maximum of Fisher information bound,

M 1 θi f 0(θ)2 1 I∗ (r, θ) = dθp(θ) ln n + H(θ) ln(2πe) F 2 f(θ) − 2 i=1 θi−1 X Z (i) (i) M √2n f f 1 + − − = Pi ln  q q  (2.32) Pi √πe i=1 X       ∗ As the domain is split into more intervals, IF would increase. The reason is that the Fisher information is dependent on the derivative f 0(θ). As a straightforward example, take p(θ) = 1 on [0, 1), take the uniform splitting

16 i (i) (i) θi = M , and for simplicity let f+ = f+, f− = f− = 0. We ﬁx the number of neurons n = 1. According to formula (2.31), the maximum of Fisher information bound is attained at the following periodic quadratic function

∗ 2 2 f (θ) = M f (θ θi− ) , θ [θi− , θi) (2.33) + − 1 ∈ 1

(see Figure 2.2 for an illustration of f ∗ with diﬀerent values of M). As a result, the Fisher information bound can be simpliﬁed to

∗ 2f+ IF (r, θ) = ln M . (2.34) p√πe !

f + f + f + f( ) f( ) f( )

...

0 0 0 0 1 0 1 0 1 M = 1 M = 2 M = 3

Figure 2.2: An example of periodic tuning curves for one population

In this case, I∗ as the number of intervals M . However, it F → ∞ → ∞ can be veriﬁed that the exact value of mutual information remains ﬁnite. We make a brief calculation of I(r; θ) as follows. First, the conditional probability mass function reads

∗ 2 2 −f (θ) ∗ r −M f+(θ−θi−1) 2 2 r e f (θ) e (M f+(θ θi−1) ) p(r θ) = = − , θ [θi− , θi) | r! r! ∈ 1

17 Using a change-of-variable, we compute the probability distribution of r:

1 p(r) = p(r θ)dθ | Z0 M 2 2 θi −M f+(ν−θi−1) 2 2 r e (M f (ν θi− ) ) = dν + − 1 r! i=1 θi−1 X Z M 1 1 1 2 r = dνe−f+s f s2 r! M + i=1 Z0 X1 1 2 r = dνe−f+s f s2 r! + Z0 By directly applying the deﬁnition (2.10), I(r; θ) could be simpliﬁed as

I(r; θ) ∞ p(r θ) = dθ p(r θ) ln | | p(r) Z r=0 M X ∗ 2 2 θi −f (θ) ∗ r −M f+(θ−θi−1) 2 2 r e f (θ) e (M f+(ν θi−1) ) = dθ ln 1 2 −r r! −f+s 2 i=1 θi−1 r dνe (f+s ) X Z X 0 M 2 r 2 r 1 1 e−f+η (f η2) e−Rf+η (f η2) = dη + ln + 1 −f s2 r 0 M r! dνe + (f s2) i=1 Z r 0 + X X2 2 1 −f+η 2 r −f+η 2 r e (f+η ) e R (f+η ) = dη ln 1 2 r r! −f+s 2 0 r dse (f+s ) Z X 0 Note that I(r; θ) does not depend onR M at all. Therefore, it suffices to show that the above series converges to a finite number. We write I(r; θ) as the difference of two terms and evaluate them separately:

2 1 −f+η 2 r e (f η ) 2 I(r; θ) = dη + ln e−f+η η2r r! Z0 r X 2 1 −f+η 2 r 1 e (f η ) 2 r dη + ln dse−f+s f s2 − r! + Z0 r Z0 X 18 where the ﬁrst term equals

2 1 −f+η 2 r e (f η ) 2 dη + ln e−f+η η2r r! Z0 r X 2 r 2 r 1 e−f+η (f η2) 1 e−f+η (f η2) = f η2dη + + 2 ln ηdη + r − + r! r! Z0 r Z0 r 5 X X = f −9 + and by applying Jensen’s inequality, the second term is bounded below:

2 1 −f+η 2 r 1 e (f η ) 2 dη + ln dse−f+s s2r r! Z0 r Z0 X 2 1 −f+η 2 r 1 e (f η ) 2 dη + ds ln e−f+s s2r ≥ r! Z0 r Z0 X 2 r 1 e−f+η (f η2) f = dη + + 2r r! − 3 − Z0 r = f X − + Thus the value of I(r; θ) is finite, 4 0 I(r; θ) f (2.35) ≤ ≤ 9 + In contrast to the finiteness of the mutual information, the Fisher information bound (2.34) diverges to infinity as M . In this situation, the Fisher → ∞ information bound does not serve as a good approximation.

2.4 Constrained Optimization Problem for Multi Pop- ulations

The variational method for optimizing the Fisher information bound can be generalized to multiple neural populations. We start our discussion

19 with two populations of neurons.

2.4.1 Two populations

Consider two populations of cells with diﬀerent tuning curves, namely f1(θ) and f2(θ), with n1 and n2 neurons respectively. Following the previous section, we assume that both f1 and f2 are monotonic with boundary conditions at θ− and θ+. In this case, we have the Fisher information bound 0 2 0 2 1 f1(θ) f2(θ) ln(2πe) IF (r; θ) = dθp(θ) ln n + n + H(θ) (2.36) 2 1 f (θ) 2 f (θ) − 2 1 2 Z 0 fi (θ) By taking hi = , the optimization problem is then formulated as √fi(θ)

maxh1,h2 H(h1, h2) (2.37)

θ+ hi(θ)dθ = 2 fi(θ+) fi(θ−) , i = 1, 2 (2.38) θ− − p p R hi(θ) 0 (or hi(θ) 0), i = 1, 2 (2.39) ≥ ≤ where the objective function H is deﬁned as

θ+ 2 2 H(h1, h2) := dθp(θ) ln n1h1(θ) + n2h2(θ) . (2.40) Zθ− To identify a local maximizer, we evaluate the ﬁrst and second order variations: δH p(θ)nihi(θ) = 2 2 , i = 1, 2, δhi n1h1(θ) + n2h2(θ) δ2H pn (n h2 n h2) = 1 2 2 − 1 1 2 2 2 2 δh1 (n1h1 + n2h2) δ2H pn (n h2 n h2) = 2 1 1 − 2 2 2 2 2 2 δh2 (n1h1 + n2h2) δ2H 2pn n h h δ2H = 1 2 1 2 = 2 2 2 δh1δh2 −(n1h1 + n2h2) δh2δh1

20 The Hessian has a positive and a negative eigenvalue, since its determinant equals 2 2 δ H δ H 2 2 p n n δh1 δh1δh2 1 2 det 2 2 = − 0. δ H δ H 2 2 2 2 (n h + n h ) ≤ " δh1δh2 δh2 # 1 1 2 2

This implies that any stationary solution (h1(θ), h2(θ)) in the interior (i.e. h1(θ)h2(θ) > 0 on some non-zero measure set) is a saddle point, thus cannot be a local-max. The maximum of H can only be attained at the boundary where h1 and h2 have mutually exclusive supports, i.e. h1(θ)h2(θ) = 0 for any θ (θ−, θ ). ∈ +

In a simple and biologically relevant situation, the supports of h1, h2 are assumed to be two contiguous intervals: (θ0, θ1) and (θ1, θ2), where θ0 = θ− and θ2 = θ+. Without loss of generality, we suppose that h1 is supported on

(θ0, θ1) and h2 is supported on (θ1, θ2). Outside the support of hi, the tuning

d√fi(θ) 1 curve fi(θ) is constant since dθ = 2 hi(θ) = 0. Thus only one population is sensitive to the stimulus on each interval. In this case, the Fisher information bound (2.36) can be written as the sum of two single-variable functionals of f1 and f2, 1 θ1 f 0 (θ)2 1 θ2 f 0 (θ)2 I (f , f ) = dθp(θ) ln n 1 + dθp(θ) ln n 2 F 1 2 2 1 f (θ) 2 2 f (θ) Zθ0 1 Zθ1 2 1 +H(θ) ln(2πe) (2.41) − 2

Further, if we constrain the boundary conditions of f1 and f2,

lim f1(θ) = f1,−, lim f1(θ) = f1,+ + − θ→θ0 θ→θ1

lim f2(θ) = f2,−, lim f2(θ) = f2,+ + − θ→θ1 θ→θ2

21 then we can solve for the maximizers f1 and f2 separately. Note that in this process, the mutually exclusive supports of h1 and h2 have enabled us to decompose the optimization problem (2.37, 2.38, 2.39) into two separate problems for one population.

Following the same approach as in Section 2.3.1, we solve for the maximizers 2 P1(θ) ( f , f ,−) + f ,− , θ [θ , θ ) ∗ 1 + 1 P1 1 0 1 f1 (θ) = − ∈ (2.42) constant , θ [θ , θ )  p p p ∈ 1 2 constant , θ [θ0, θ1) ∗  f (θ) = 2 ∈ (2.43) 2 P2(θ)  ( f2,+ f2,−) + f2,− , θ [θ1, θ2)  − P2 ∈ where p p p  θ θi Pi(θ) = p(s)ds, Pi = p(s)ds, (2.44) Zθi−1 Zθi−1 and that the optimum Fisher information bound would be

∗ 1 √2n1 f1,+ f1,− IF (r; θ) = P1 ln − P1 p√πe p !

1 √2n2 f2,+ f2,− +P2 ln − (2.45) P2 p√πe p ! ∗ In addition, one could further optimize IF over the probability masses

(P1,P2) subject to P1 + P2 = 1. This yields

∗ √2 √n1c1 + √n2c2 IF = ln (2.46) √πe ! where c = f , f ,−, c = f , f ,− and the maximum is attained 1 1 + − 1 2 2 + − 2 when Pi arep proportionalp to √nicpi: p

√n1c1 √n2c2 P1 = ,P2 = . (2.47) √n1c2 + √n2c2 √n1c2 + √n2c2

22 If constrained by the total number of neurons n1 = n2 = n, the Fisher information bound can be further optimized to

2 2 ∗ √2n c1 + c2 IF = ln (2.48) p√πe ! for which the maximum is reached when each population size ni is proportional

2 to ci , i.e. 2 2 c1n c2n n1 = 2 2 , n2 = 2 2 . (2.49) c1 + c2 c1 + c2

When the boundary values of f1 and f2 are equal, the above conditions

n 1 reduces to n1 = n2 = 2 and P1 = P2 = 2 . Thus the Fisher information bound is maximized when the neurons are equally split into two populations, each with a turning curve supported on half of the domain.

2.4.2 Generalization to multi populations

The above approach can be generalized to K populations of neurons.

Assume there are n cells in total, ni cells in the i-th population with a monotonic tuning curve fi(θ). Analogous to the two-population case, the maximum of IF is attained at tuning curves with mutually exclusive supports [θi−1, θi)

(i = 1, 2, ..., K). By optimizing over fi , Pi and ni , the maximum Fisher { } { } { } information bound can be evaluated as √2n K c2 (K)∗ j=1 j IF = ln (2.50)  q√Pπe  with the maximizers   2 1 fi, fi,− Pi(θ) + fi,− , θ [θi− , θi) ∗ Pi + 1 fi (θ) = − ∈ (2.51) constant , otherwise  p p p  23 θi 2 √nici ci n Pi = p(θ)dθ = , ni = (2.52) K K 2 θ n c c Z i−1 j=1 √ j j j=1 j where P P

ci = fi, fi,−. (2.53) + − p p When the boundary constraints satisfy fi,− = f− and fi,+ = f+, the

n 1 population sizes and probability masses are equal, i.e. ni = K , Pi = K . In (K)∗ this case, IF is simpliﬁed as

√2n f+ f− (K)∗ − 1 IF = ln + ln K (2.54)  p√πe p  2

If measured in bits, this will be 

√2n f+ f− (K)∗ − 1 IF = log2 + log2 K (2.55)  p√πe p  2

(K)∗  1  Note that IF will increase by 2 bits as the number of populations gets dou- bled - this is the “information gain” (measured in Fisher information bound) by splitting each population of neurons into two independent populations.

However, the gain in Fisher information will not improve the mutual information. To be concrete, we construct an example in a similar way as in section 2.3.2. Take p(θ) = 1 on [0, 1), then f ∗(θ) will be the piecewise { i } quadratic functions shown in Figure 2.3. Although it is not plausible here to set f− = 0 on a non-zero measure set, the same process of calculation can

n be carried out. Under the assumption that ni = K is an integer, the mutual information can be estimated such that n I(r; θ) C (2.56) ≤ K

24 f + f + f + f1( ) f1( ) f2( ) f1( ) f2( ) f3( )

...

f f f 0 1 0 1 0 1 K = 1 K = 2 K = 3

Figure 2.3: An example of mutually exclusive supported tuning curves for multiple populations

where C is a constant depending on f+ and f−:

3 3 3 3 1 1 3 3 2 2 2 2 2 2 2 2 4 f+ f− f+ ln f+ f− ln f− (f+ ln f+ f− ln f−)(f+ f− ) C = 1 − 1 + 1 − 1 − 1 1 2 − 9 2 2 2 2 − 2 2 f+ f− 3(f+ f− ) 3 f f − − + − − It also demonstrates, in comparison to the divergence pattern of IF , that the mutual information will not increase to inﬁnity as the population number increases; instead, it will be bounded by a constant if n is proportional to K.

2.4.3 The geometry of stationary solutions

Although it has been proved that the maximum of Fisher information bound does not have a local-max in the interior (i.e. f 0(θ)f 0(θ) = 0 on some i j 6 non-zero measure set), it would still be of interest to examine the properties of interior stationary tuning curves. In this section, we look back at the stationary solutions of the two-population problem in Section 2.4.1. Assume without loss

0 fi (θ) of generality that f1(θ), f2(θ) are increasing on [θ−, θ+]. Taking hi = , √fi(θ)

25 we introduce the Lagrangian function

2 1 θ+ θ+ H(h, λ, µ) = dθp(θ) ln n h2(θ) + n h2(θ) + µ (θ)h (θ)dθ 2 1 1 2 2 i i Zθ− i=1 Zθ− X 2 θ+ + λi hi(θ)dθ 2( fi(θ ) fi(θ−)) (2.57) − + − i=1 θ− X Z p p where λi R and µi(θ) are the Lagrangian multipliers associated with the ∈ boundary and monotonicity constraints:

θ+ hi(θ)dθ = 2 fi(θ+) fi(θ−) (2.58) θ− − p p R hi(θ) 0 (2.59) ≥

For a stationary point (h1(θ), h2(θ)), the following KKT conditions hold:

δH p(θ)nihi(θ) = 2 2 + µi(θ) + λi = 0 (2.60) δhi n1h1(θ) + n2h2(θ)

µi(θ)hi(θ) = 0 (2.61)

µi(θ) 0 (2.62) ≥

Let S be the overlap of the supports, i.e. S = θ h (θ)h (θ) > 0 , which { | 1 2 } has a positive measure if (h1(θ), h2(θ)) is an interior point. The complementary slackness condition (2.61) implies that µ1(θ) = µ2(θ) = 0 on S, thus the stationarity condition (2.60) reduces to

λ (n h2 + n h2) = pn h 1 1 1 2 2 − 1 1 λ (n h2 + n h2) = pn h 2 1 1 2 2 − 2 2

26 Solving for h1 and h2, we arrive at

λ1n2 h1(θ) = 2 2 p(θ), θ S −λ1n2 + λ2n1 ∈ λ2n1 h2(θ) = 2 2 p(θ), θ S −λ1n2 + λ2n1 ∈

Assume without loss of generality that S takes the whole support of the input distribution p(θ), i.e. h (θ)h (θ) > 0 for all θ [θ−, θ ]. Substituting the 1 2 ∈ + above h1, h2 into the boundary conditions (2.58) yields

λ1n2 2 2 = 2 f1(θ+) f1(θ−) λ1n2 + λ2n1 − − λ2n1 p p 2 2 = 2 f2(θ+) f2(θ−) λ1n2 + λ2n1 − − p p Solving for λ1 and λ2,

n1c1 λ1 = 2 2 −n1c1 + n2c2 n2c2 λ2 = 2 2 −n1c1 + n2c2 where ci = 2 fi(θ ) fi(θ−) . Therefore we ﬁnd the stationary solutions + − p p h (θ) = 2 f (θ ) f (θ−) p(θ) 1 1 + − 1 p p h (θ) = 2 f (θ ) f (θ−) p(θ) 2 2 + − 2 p p which integrates to

2 f (θ) = ( f (θ ) f (θ−))P (θ) + f (θ−) (2.63) 1 1 + − 1 1 p p p 2 f (θ) = ( f (θ ) f (θ−))P (θ) + f (θ−) (2.64) 2 2 + − 2 2 p p p

27 where P (θ) = θ p(s)ds is the cumulative distribution function of p(θ). Note θ− that since µ1(θR) = µ2(θ) = 0, the sign of hi(θ) does not play a role in deriving the solutions, thus equations (2.63) and (2.64) will also hold when f1 or f2 are decreasing.

An interesting observation is that for such stationary tuning curves, the

0 2 0 2 f1(θ) f2(θ) Fisher information Fr(θ) = n 2 + n 2 can be expressed as 1 f1(θ) 1 f2(θ)

2 2 Fr(θ) = 4p(θ) n f (θ ) f (θ−) 1 1 + − 1 2 2 p p +4p(θ) n f (θ ) f (θ−) (2.65) 2 2 + − 2 p p 1 Furthermore, if the input distribution is uniform on [θ−, θ ], i.e. p(θ) = , + θ+−θ− then the Fisher information would be a constant in [θ−, θ+]:

2 2 4 f (θ ) f (θ−) + f (θ ) f (θ−) 1 + − 1 2 + − 2 Fr(θ) = 2 (2.66) p p (θ θ−)p p + −

The formulation of Fr(θ) in (2.66) can be interpreted as the Euclidean distance between ( f1(θ−), f2(θ−)) and ( f1(θ+), f2(θ+)), up to a scale factor. Moreover, duep to thep linearity of fp1(θ) and pf2(θ), Fisher information can be seen as the length of trajectoryp in the spacep of (√f1, √f2) as the input θ changes from θ− to θ+, which is illustrated by the bottom right plot of Figure 2.4.

It is possible to divide the input space [θ−, θ+] into subintervals and constrain the monotonicity and boundary values of f1 and f2 on each subinterval, as in Section 2.3.2. Consequently, the stationary tuning curves would then

28 f2( ) f1( ) f2( ) f1( ) Tuning curves

0 1 +

f2 f2

(f1( ), f2( )) ( f1( ), f2( )) State Space

(f1( +), f2( +))

( f1( +), f2( +))

f1 f1 Figure 2.4: Fisher Information as the path length be piecewise quadratic functions, with the Fisher information locally equals to the path length on each subinterval. Following this idea, one could come up with longer and longer curves in the “deformed” state space (√f1, √f2), such as the space-filling curves in Figure 2.5 (shown in the first and second row). The tuning curves and paths in the state space (f1, f2) associated with these space-filling curves are plotted in the last two rows in the figure. With the total path length diverging to infinity, these tuning curves have the Fisher information bound diverging to infinity as well. However, following the same approach before, mutual information would still be finite. This serves as another example of the gap between Fisher information and mutual information.

29 f1( ) f2( ) f1( ) f2( ) f1( ) f2( )

f2 f2 f2

f1 f1 f1

f1( ) f2( ) f1( ) f2( ) f1( ) f2( )

f2 f2 f2

f1 f1 f1 Figure 2.5: Tuning curves with diverging path lengths

2.5 Discussion

We deﬁne a Poisson neural coding model with conditional independence, and derive the tuning curves that maximize the Fisher information bound using Lagrangian methods. A single neuron that ﬁres with a piecewise continuous tuning curve (Figure 2.2) can produce arbitrarily large Fisher in-

30 formation, by increasing its frequency of jumping from the highest to lowest activity. Similarly, for K neurons that fire according to the stimulus in separate regions (Figure 2.3), the Fisher information would diverge to infinity as the number of neurons gets larger. On the other hand, the exact value of mutual information is bounded for both cases. The inaccuracy of Fisher information, which has been discovered in previous studies [48, 7, 50] for different model settings, arises from its formulation in terms of derivatives and geometric interpretation as “local path length” in the deformed state space. Because of its locality, it fails to characterize the global features in information transmission. Due to these drawbacks, in the following chapters we will directly work on the mutual information instead of its approximation by Fisher information.

31 Chapter 3

Cyclic Invariant Model

In this chapter, we investigate the Eﬃcient Coding Hypothesis by directly maximizing the mutual information rather than the Fisher information bound. Following the Poisson model in Section (2.2), we introduce an extra assumption that allows the tuning curves to be cyclic invariant among the same population. We investigate the properties of optimal tuning curves both numerically and theoretically, and extend the conclusions to multiple populations and multiple inputs.

3.1 Basic Assumptions

We start with formal deﬁnitions of the cyclic invariant model. Consider a stimulus on a circle C, represented by θ [0, 1). Let f(θ) be a function on the ∈ circle, which is also equivalent to a periodic function on R with f(x) = f(x+1). Assume that in a population of N neurons, the tuning curve of the i-th neuron is the shift of f by an angle si [0, 1): ∈

fi(θ) = f(θ si), i = 1, ..., N. (3.1) −

In other words, fi(θ) are related by cyclic translations si . An illustration { } { } of a 3-neuron example is shown in Figure 3.1. The functions f1(θ), f2(θ), f3(θ)

32 1 2 with shifts s1 = 0, s2 = 3 and s3 = 3 are plotted on the left, with the 3-d trajectory in the state space (f1, f2, f3) shown on the right.

f1( )

0 1

f2( )

0 s2 1 f2

f3( )

0 s3 1 (a) (b)

Figure 3.1: Cyclic invariant tuning curves for 3 neurons

In practice, a neuron’s firing rate may depend not only on a single stimulus but on a history of stimuli. Formally, the dependence on the stimulus history is characterized by convolution, with the actual firing rate g expressed as: 1 g(θ) = f(θ y)ν(y)dy, θ [0, 1) (3.2) − ∈ Z0 1 where ν is a convolution kernel, normalized such that 0 ν(y)dy = 1. To distinguish f and g, throughout this chapter we refer toR f as the tuning curve (function) and g as the rate curve (function). It can be verified that the rate functions gi are also related by cyclic translations si , by { } { }

33 applying convolution to both sides of equation (3.1),

gi(θ) = g(θ si), i = 1, ..., N. (3.3) −

As introduced in the previous chapter, the spike counts r = (r1, ..., rN ) follow a conditionally independent Poisson distribution with λi = gi(θ)τ:

N N −gi(θ)τ ri e (gi(θ)τ) p(r θ) = p(ri θ) = (3.4) | | ri! i=1 i=1 Y Y Based on the Efficient Coding Hypothesis, we aim to find a function f that maximizes the mutual information between the response r and the stimulus θ. We expect f(θ) to be bounded between a baseline firing rate f− and a maximal firing rate f+. Besides, we put an extra constraint on the mean firing rate over the circle, denoted by f¯. Combining all the definitions and assumptions, the constrained optimization problem is formulated as below:

maxf I(r; θ) (3.5)

f− f(θ) f (3.6) ≤ ≤ + 1 ¯ 0 f(θ)dθ = f (3.7) R where

I(r; θ) = DKL(p(r, θ) p(r)p(θ)) ∞|| ∞ 1 p(r θ) = p(θ)dθ p(r θ) ln | 0 ··· | p(r) Z r1=0 rN =0 1 X X = p(θ)dθDKL (p(r θ) p(r)) (3.8) | k Z0

34 Since all the fi(θ) are shifted copies of f, it follows naturally that the constraints (3.6, 3.7) are satisﬁed by all of them, as well as the rate curves gi(θ) . { } We close this section with an additional note that without loss of generality, we take the typical time τ = 1 and the input distribution to be uniform:

p(θ) = 1, θ [0, 1). (3.9) ∈

3.2 Discrete Model

Since direct maximization of mutual information (3.5) over a continuous function f is often intractable, our ﬁrst attempt to solve the problem is to numerically optimize I(r; θ) over a discrete representation of f.

We start with discretization of the circle C with M equally spaced angles

m−1 θm = M , i = 1, ..., M. The function f can thus be represented by a vector f = (f1, ..., fM ) where

fm := f(θm), m = 1, ..., M. (3.10)

Furthermore, assume that the shifts of tuning curves are also equidis-

j−1 tant among the population, i.e. sj = N , j = 1, .., N. In this case, the set of 1 tuning curves fj(θ) is invariant under a cyclic shift of : { } N j 1 fj(θ) = f(θ − ), j = 1, ..., N. (3.11) − N

35 3.2.1 Single population with equal discretization

The simplest situation is when the circle is discretized with the same number of bins as the number of neurons, i.e. M = N. In this case, based on the cyclic invariance condition (3.11), the values of fi(θ) are given by a circular shift of the vector f by i 1, i.e. − m 1 i 1 fi (θm) = f − − = fhm−ii M − M +1 where m i := (m i) mod M. An example is shown in Figure 3.2 with h − i − synthetic data, where M = N = 10. The values of each fi(θ) are drawn on 10 equally-spaced points on a circle, visualized using a colormap. As shown in the ﬁgure, each tuning curve is a circular shift of the top left one.

Figure 3.2: Discrete tuning curves for 10 neurons with 10 bins

After convolution, the ﬁring rates would be

gi := g(θi) = fhi−ji+1νj (3.12) j=1 X M where νj is a discrete convolution kernel with νj = 1. The discrete { } j=1 P

36 version of mutual information I(r; θ) (3.8) can then be expressed as

M 1 I(r; θ) = DKL (p(r θi) p(r)) (3.13) M | k i=1 X by taking into account the uniformity of p(θ) (3.9). Here, the conditional probability distribution p(r θi) and the mixture distribution p(r) are in the | following discrete forms:

M r M rj g (θ ) j ghi−ji j i −gj (θi) +1 −ghi−ji+1 p(r θi) = e = e (3.14) | rj! rj! j=1 j=1 Y Y

M M M rj 1 1 ghi−ji +1 −ghi−ji+1 p(r) = p(r θi) = e (3.15) M | M rj! i=1 i=1 j=1 X X Y It is worth noting that the average of Kullback-Leibler divergences in (3.13) can be simpliﬁed under the cyclic invariant assumption, based on the following proposition:

Proposition 3.2.1. When M = N, the cyclic invariant property holds for the Kullback-Leibler divergence:

0 DKL (p(r θm) p(r)) = DKL (p(r θm0 ) p(r)) , m, m = 1, ..., M. (3.16) | k | k ∀

See Appendix 1.1 for the proof1.

Given that all the KL divergences in equation (3.13) are equal, the mutual information can be written as DKL(p(r θ ) p(r)) = DKL(p(r θ = 0) p(r)), | 1 k | k

1This proof is from Prof. Lorenzo A. Sadun, UT Austin.

37 and therefore

I(r; θ) = DKL (p(r θ = 0) p(r)) (3.17) | k

= Er|θ [ ln S(r)] (3.18) =0 − where M M rj p(r) 1 ghi−ji S(r) := = +1 (3.19) p(r θ = 0) M gh −ji i=1 j=1 1 +1 | X Y In the end, we formulate the constrained optimization problem (3.5, 3.6, 3.7) in the following discrete version:

maxf Er|θ [ ln (S(r))] =0 −

f− fi f (3.20) ≤ ≤ + 1 M ¯ M i=1 fi = f P The expectation formulation of mutual information (3.18) allows us to estimate it using the Monte Carlo algorithm. To evaluate the mutual information, one only needs to generate i.i.d. samples of r from the multivari- ate Poisson distribution p(r θ = 0) and take the average of ln (S(r)). The | − √1 approximation error would be O( n ) when n samples are used.

We apply gradient-based iteration methods for optimization. Following a similar argument as before, the gradient of I(r; θ) with respect to f could also be written as an expectation:

∂I ri = Er|θ 1 ln S(r) (3.21) ∂g =0 − g i i M ∂I ∂I = νhj−ii+1 (3.22) ∂fi ∂gj j=1 X

38 3.2.2 Single population with subset discretization

In this section, we investigate the case when there are fewer neurons than stimuli, i.e. N < M. The tuning curves are not centered at all stimulus positions, but only a subset of θm . { } Suppose that the translations of tuning curves are still equidistant (equation 3.11), and that M is an integer multiple of N: M = ND. Let ci = D(i 1) + 1 be the position of the translation si along the stimulus axis − 1, 2, ..., M, such that

i 1 ci 1 si = − = − , i = 1, ..., N, N M and in this case, the tuning curve of the i-th neuron would be a circular shift of f by ci 1: −

fi (θm) = fhm−cii+1

An illustration of this situation is drawn in Figure 3.3 with synthetic data. Here we take N = 10, M = 20 and D = 2, with 10 neural tuning curves centered on a subset 2, 4, ..., 20 , represented by red dots in Figure 3.3a. The { } values of fi(θ)(i = 1, ..., 10) are plotted in Figure 3.3b using a colormap. The tuning curves are circular shifts of each other by multiples of 2.

A similar property of KL divergence holds, as a generalization of Propo- sition 3.2.1:

Proposition 3.2.2. When M = ND (D is a positive integer),

DKL (p(r θm) p(r)) = DKL (p(r θm D) p(r)) m = 1, ..., M (3.23) | k | + k ∀

39 (a) (b)

Figure 3.3: Discrete tuning curves for 10 neurons with 20 bins

with the probability distributions of spike counts r = (r1, ..., rN ) given by N grj hm−cj i+1 −ghm−c i+1 p(r θm) = e j (3.24) | rj! j=1 Y

M N rj 1 ghi−cj i+1 −g p(r) = e hi−cj i+1 (3.25) M rj! i=1 j=1 X Y where g is obtained from discrete convolution of f (3.12).

The mutual information can also be expressed in terms of expectation,

M 1 I(r; θ) = DKL (p(r θm) p(r)) M | k m=1 XD 1 = DKL (p(r θm) p(r)) (3.26) D | k m=1 XD 1 = Er|θ [ ln Sm(r)] (3.27) D m − m=1 X where

M N rj p(r) 1 ghi−cj i+1 −(ghi−c i+1−g(hm−c i+1) Sm(r) := := e j j (3.28) p(r θm) M ghm−c i i=1 j=1 j +1 | X Y

40 and the gradient equals

∂I(r; θ) 1 rji = Er|θ 1 ln Sm (r) (3.29) ∂g D mi − g i i i M ∂I ∂I = νhj−ii+1 (3.30) ∂fi ∂gj j=1 X

i−mi where mi is the integer in 1, ..., D such that (i m) mod D = 0, ji = { } − D ∈ 1, ..., N . { }

3.2.3 Multiple populations

When neurons are split into diﬀerent populations, we discretize the model in a similar way. Suppose there are K populations of neurons, each having N neurons with the tuning curves f (k)(θ) , i = 1, ..., N, k = 1, ..., K. { i } For convenience, we apply the same discretization of space and the same convolution kernel for all populations, i.e.

(k) fk,m := f (θm) M

gk,m := fk,hi−ji+1νj j=1 X for all k = 1, ..., K, m = 1, ..., M. The tuning curves in each population are

1 N−1 cyclic invariant with regards to the shifts si = (0, , ..., ) , i.e. { } N N

(k) (k) f (θm) = f (θm si) = fk,hm−c i i − i +1

Note that M is also chosen to be an integer multiple of N, i.e. M = ND.

A natural assumption is that the spike counts of diﬀerent populations

41 are also conditionally independent:

p(r θm) = p(rk, , ..., rk,N θm) (3.31) | 1 | k Y=1 Combining the conditional independence within each population (3.4), we have

rk,n K N K N −gk,hm−cni+1 e gk,hm−cni+1 p(r θm) = p(rk,n θm) = (3.32) | | rk,n! k n=1 k n=1 Y=1 Y Y=1 Y Following the same process as before, we formulate the optimization problem for a multi-population model

maxf I(r; θ)

f− fk,m f (3.33) ≤ ≤ + 1 M ¯ M m=1 fk,m = fk P with the mutual information and its gradient

D 1 I(r; θ) = Er|θ [ ln Sm(r)] (3.34) D m − m=1 X ∂I(r; θ) 1 rji = Er|θ 1 ln Sm (r) (3.35) ∂g D mi − g i k,i k,i M ∂I ∂I = νhj−ki+1 (3.36) ∂fk,i ∂gk,j j=1 X where

M K N r k,n −gk,hi−cni+1 p(r) 1 gk,hi−cni+1 e Sm(r) = = (3.37) −gk,hm−c i+1 p(r θm) M gk,hm−c i e n i=1 k n=1 n +1 | X Y=1 Y In practice, the mutual information and its derivatives can be computed at the same time using precomputed values of ln(Sm(r)). In a Monte Carlo

42 algorithm with S random samples of r, it needs O(SM 2K) operations and O(M 2K) storage to evaluate I(r; θ) and ∂I(r;θ) . These computations can be { ∂gk,i } parallelized.

3.2.4 Numerical results

Our optimization tests are run with Python programs on a Linux Server in the Department of Mathematics, UT Austin. The Linux computer is equipped with a 24 cores’ CPU Intel L5640 2.27GHz. For constrained optimization, we apply the Sequential Least Squares Programming (often referred to as “SLSQP”, details can be found in [31]).

We set the scaling factor τ = 1.0 and the SLSQP step-size η = 1.5 10−7 × in the algorithm. We parallelize the Monte Carlo algorithm, starting with 104 samples then gradually increasing the sample size to 105 or 106 after the tuning curve stops to iterate. Usually, the algorithm converges faster within

500 iterations when f+ is less than 10, and converges slower when f+ is larger.

The binary-valued phenomenon

First, we investigate the effect of average constraint f¯ on the tuning curves and on their mutual information. We fix the number of neurons and bins to be M = N = 64, the maximal and minimal firing rates f+ = 9.7, f− = 0.1, and a constant convolution kernel ν with 16 bins’ width. We run the optimization algorithm subject to different average constraints f¯, starting from a linear initial condition. When f¯ = f+ or f−, the only option for f(θ)

43 would be a constant function which conveys zero information. For f¯ (f−, f ), ∈ + the results of optimization are shown in Figure 3.4.

Average=1.0, MI=3.6449 Average=1.9, MI=3.7504 Average=3.7, MI=3.8598 10 10 10

8 8 8

6 6 6

4 4 4

2 2 2

0 0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Average=4.9, MI=3.6687 Average=6.1, MI=3.6140 Average=8.5, MI=2.5853 10 10 10

8 8 8

6 6 6

4 4 4

2 2 2

0 0 0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60

Figure 3.4: Optimized tuning curves with diﬀerent average constraints 2

An interesting phenomenon is that almost all the tuning curves tend to reach the boundaries: neurons ﬁre at either the maximum or minimum ﬁring rate, and seldom take values in the middle. This is a key feature that arises in almost all the numerical experiments for the cyclic invariant model.

As f¯ increases, f(θ) takes f+ on a larger portion of the circle to satisfy the average constraint. Here, the values of f¯ in Figure 3.4 are chosen such that it is possible to have an integer number of fi (out of M = 64 bins in total) being f or f−. When f¯ is chosen such that i : fi = f cannot be an integer, a + |{ +}|

2“MI” stands for “Mutual Information” in the ﬁgures.

44 MI=4.0455 MI=4.0813 MI=4.0130 20.0 20.0 20.0

17.5 17.5 17.5

15.0 15.0 15.0

12.5 12.5 12.5

10.0 10.0 10.0

7.5 7.5 7.5

5.0 5.0 5.0

2.5 2.5 2.5 Tuning Curve (Computed) 0.0 0.0 0.0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 MI=4.0661 MI=4.0845 MI=4.0260 20.0 20.0 20.0

17.5 17.5 17.5

15.0 15.0 15.0

12.5 12.5 12.5

10.0 10.0 10.0

7.5 7.5 7.5

5.0 5.0 5.0

2.5 2.5 2.5 Tuning Curve (Adjusted) 0.0 0.0 0.0 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 f = 3.7 f = 4.9 f = 6.1

Figure 3.5: Optimized tuning curves in high ﬁring rate regimes similar eﬀect shows up: f is also binary-valued except for only a few points.

When f+ is higher (almost 20 in Figure 3.5), the algorithm will produce the solutions shown in the ﬁrst row of Figure 3.5. In this case, not all neurons

ﬁre at either f+ or f−. We compare these solutions with the “hand-adjusted” functions below that reach the boundaries. Although the mutual information is higher for the tuning curves below, the improvements are very tiny. This indicates that there is no strong tendency for f(θ) to be either f− or f+.

Another observation is that f(θ) demonstrates some high-frequency oscillations, both in Figure 3.4 and Figure 3.5. This is not a result of initialization (since our initial conditions of f are linear functions with no oscillation). In Section 3.3, we ﬁnd an explanation in terms of a continuous limit that these seemingly “random” oscillations enable the auto-correlation functions of f to be more close to uniform, thus increasing the mutual information.

45 f + = 2.5 4.0 f + = 3.3

f + = 4.9

3.5 f + = 9.7

f + = 12.9 3.0

2.5

2.0

1.5 Mutual Information 1.0

0.5

0.0 0 2 4 6 8 10 12 Average Constraint f

Figure 3.6: Mutual Information v.s. average firing rate in different maximal firing rate regimes

Besides the shapes of f, it is important to note the change in mutual information with f¯. Starting from zero at f¯ = f−, the optimum of mutual information would ﬁrst increase as we raise f¯, then fall back to zero at f¯ = f+,

¯ f−+f+ as shown in Figure 3.6. It reaches its peak at a value of f < 2 , which holds true in different maximal firing rate regimes (in this Figure we take f− = 0.1 in all the tests and vary f+). This is also a phenomenon that we will study in Section 3.3, Lemma 3.3.4. If the average firing rate is fixed, a higher upper bound will increase the degree of freedom for the optimization problem thus allowing neurons to transmit more information.

Finally, the binary-valued phenomenon is also present when there is more than one population (K > 1) and when the neurons are centered on a subset of discretized bins (D > 1). Examples of these cases are shown in Figure 3.7. Note that for two populations, the optimal tuning curves do not have mutually exclusive supports, contrary to the result of maximizing the Fisher information in Chapter 2.

46 K = 1, D = 1, MI = 2.4273 K = 2, D = 1, MI = 3.0380 K = 1, D = 2, MI = 2.0865 2.00 2.0 2.00

1.75 1.5 1.75

1.0 1.50 1.50

0.5 1.25 1.25

0.0 1.00 0 10 20 30 40 50 60 1.00 2.0

0.75 0.75 1.5

0.50 0.50 1.0

0.25 0.5 0.25

0.00 0.0 0.00 0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60 Figure 3.7: Optimized tuning curves with diﬀerent number of populations and subset discretization

The eﬀect of convolution

Figure 3.8 shows the optimization results with different convolution kernels ν. From top to bottom, each row stands for: the convolution kernel ν, the optimized tuning curve f(θ), and its corresponding rate function g = ν f. ∗ Initial conditions and all the constraints (f−, f+, f¯) are kept the same. In all 5 tests, the obtained tuning curves are binary-valued almost everywhere (except for 3 points in the first column). Even with an exponential-decay type of kernel (the last column), the algorithm also converges to a piecewise constant tuning curve, though the rate curve g has a completely different shape from others as a result of convolution.

For rectangular-shaped kernels of diﬀerent widths, the frequency that f is assigned to f+ or f− would decrease as the convolution kernel becomes wider, from a nearly random assignment when there is no convolution (the ﬁrst column) to more “continuous” curves (the 2nd to the 5th column). One possible explanation is that the scale of ν and the “period” of f should match,

47 n a hsepante“iayvle hnmnn ntenmrclresults. numerical the in phenomenon” “binary-valued the explain thus can and proper a under infinity, to go bins of and scaling neurons of number the both let i.e. 3.2.1, as limit the consider we Specifically, regime. Model Continuous 3.3 3.3.3. Section in of revisited pieces continuous be of will size phenomena the in asymmetric are solutions optimal that observe also we However, possible. as function rate the that so convolution different with curves rate and kernels curves tuning Optimized 3.8: Figure Rate Curve Tuning Curve Convolution Kernel Width=1, MI=4.1447 nti eto,w xlr h pia uigcre nacontinuous a in curves tuning optimal the explore we section, this In f eepc h otnossltost eebetedsrt ones discrete the resemble to solutions continuous the expect We . Width=2, MI=3.9915 g ol ec h pe n oe onsa much as bounds lower and upper the reach could Width=4, MI=3.3838 48 Width=8, MI=2.7307 f M snteatyproi,adthat and periodic, exactly not is ∞ → Width=16, MI=2.0474 o h oe nSection in model the for Width=16, MI=2.9512 f These . 3.3.1 The continuous limit

To ensure the finiteness of mutual information for an infinite number of neurons, we need the expected number of spikes at each stimulus position to remain finite. A natural assumption is that the average firing rate of each

1 neuron scales as n , thus the total average number of spikes stays constant when n . We refer to this setting as a low-ﬁring-rate regime. Speciﬁcally, → ∞ we consider a sequence of tuning curves fn(θ) satisfying: { }

1. Average constraint:

1 f¯ f¯n = fn(θ)dθ = , n = 1, 2, ... (3.38) n ∀ Z0 2. Bound constraint:

f− f+ fn , n = 1, 2, ... (3.39) n ≤ ≤ n ∀

3. For all s [0, 1), assume the following limit exists: ∈ sn lim nfn b c = f(s) (3.40) n→∞ n thus deﬁning a limit tuning curve f, which satisﬁes

1 f(θ)ds = f,¯ f− f(θ) f . (3.41) ≤ ≤ + Z0

For each fn(θ), consider the model in 3.2.1 where the number of neurons and number of bins are both equal to M = n. Furthermore, we ignore the eﬀect of convolution by taking the kernel ν to be the Dirac delta function.

49 Hence, the mutual information (3.18) for fn can be expressed as a functional of fn: n−1 n−1 i+j rj 1 fn( n ) In[fn] := Er|θ=0 ln j (3.42) − n rj " i=0 j=0 fn( n ) !# X Y where i n−1 i ri −fn( ) fn( ) e n p(r θ = 0) = n (3.43) | ri! i=0 Y Note that we start the indexing from zero for convenience.

The limit of In[fn] is derived in Appendix 1.2, which can be expressed in term of the limit tuning curve f:

lim In[fn] = I[f] (3.44) n→∞ where

∞ −f¯ 1 1 k 1 k e 0 j=1 f(sj θ)dθ I[f] := dsjf(sj) ln − (3.45) − k! ··· k k=0 0 0 j=1 R Q j=1 f(sj) ! X Z Z Y The right-hand-side could also be interpreted as theQ expectation with respect to a Poisson point process N on the circle [0, 1) with intensity f:

1 R 1 f(s−θ) 0 ln( )dN(s) I[f] = EN ln e f(s) dθ (3.46) − Z0 It is important to note that in this low-ﬁring-rate regime, the spike counts r which are higher than one ( rj such that rj 2) do not contribute to the ∃ ≥ limit (see Appendix 1.2 for details). Each neuron either spikes or does not spike. For this reason, we refer to I[f] as a Poissonian limit in the sense that neurons are silent except for only a ﬁnite number of them, which spike at most once almost surely, with their positions following a Poisson point process on the circle.

50 3.3.2 The optimal tuning curves

In this section, we explore the properties of f that maximizes the Pois- sonian limit I[f] subject to the boundedness and average constraints. Since the neurons would spike according to a Poisson point process, we analyze I[f] by conditioning on the number of neurons that spike. We begin with a model problem when only 2 neurons ﬁre, then further generalize to an arbitrary number of neurons.

3.3.2.1 A model problem with two neurons

Suppose that only two neurons spike (both spike only once in the low- ﬁring-rate regime) at random locations x, y [0, 1). Without loss of generality, ∈ 1 we scale f by its average so that 0 f(s)ds = 1. In this case, f can be interpreted as the conditional probabilityR density function of x and y, i.e.

p(x θ) = f(x θ), p(y θ) = f(y θ). | − | − In addition, conditional independence holds:

p(x, y θ) = p(x θ)p(y θ) = f(x θ)f(y θ). | | | − − Then the mutual information between the observed ﬁring locations (x, y) and the input θ equals

I(x, y; θ) = DKL(p(x, y, θ) p(x, y)p(θ)) 1 k = 2 dxf(x) ln f(x) 0 Z 1 1 1 du dyf(y + u)f(y) ln dθf(u + θ)f(θ) − Z0 Z0 Z0

51 1 Note that the second term can be written as 0 Af (u)du where 1 R Af (u) := dθf(θ)f(u + θ) (3.47) Z0 is the auto-correlation function of f. Given the constraints of f, Af is also normalized and bounded, i.e.

1 Af (u)du = 1, f− Af f ≤ ≤ + Z0

Since f and Af can be seen as probability densities on [0, 1), the mutual information can be expressed in terms of two meaningful components:

I(x, y; θ) = 2H[f] + H[Af ] (3.48) − where H[ ] is the continuous entropy [46]: · 1 H[f] = f(x) ln f(x)dx (3.49) − 0 Z 1 H[Af ] = Af (x) ln Af (x)dx (3.50) − Z0 Here, we take the strategy of maximizing the two components 2H[f] − and H[Af ] separately. For the ﬁrst term, a maximizer f should saturate to the bounds almost everywhere:

Lemma 3.3.1. Assume f(x) is piecewise C1 on [0, 1). Under the constraints 1 that f− f f and f(x)dx = 1, ≤ ≤ + 0 R 1 H[f] = f(x) ln f(x)dx − Z0 is maximized when f = f+ or f− a.e.

52 Proof. The idea is to apply Karush–Kuhn–Tucker (KKT) conditions [40]. Let G[f] = H[f] be the objective function. It can be shown by deﬁnition that − the functional derivative of G[f] w.r.t. f is

f G[f](x) = ln f(x) + 1 (3.51) ∇

First, For a local maximizer of G, there exists functions α (x) 0, + ≥ α−(x) 0 and a constant µ, such that for all x where f(x) (thus also f G(x)) ≥ ∇ is continuous, the following KKT conditions hold:

f G[f](x) α (x) + α−(x) µ = 0 (3.52) ∇ − + − α (x)(f(x) f ) = 0 (3.53) + − +

α−(x)(f− f(x)) = 0 (3.54) −

Substituting in f G = ln f + 1, the ﬁrst condition becomes ∇

ln f(x) + 1 = α (x) α−(x) + µ (3.55) + − Furthermore, from the complementary slackness conditions (3.53, 3.54), when f(x) = f or f−, α (x) = α−(x) = 0, which gives us 6 + + ln f(x) + 1 = µ (3.56)

Hence, f(x) is a piecewise constant function which takes at most 3 values: f+,

µ−1 f− and e .

Next, we show that the third option is impossible by contradiction. Write G[f] as 3

G(w1, w2, w3, f3) = wifi ln fi (3.57) i=1 X

53 where f , f , f are the discrete values of f and w , w , w are the measures { 1 2 3} { 1 2 3} of f = fi on the circle. Here we denote f = f−, f = f and f as the { } 1 2 + 3 µ−1 intermediate value e . Therefore, we aim to maximize G with respect to wi { } 3 and f over the domain [0, 1] [f−, f ], subject to the following constraints: 3 × + 3 3

g = wifi 1 = 0, g = wi 1 = 0 (3.58) 1 − 2 − i=1 i=1 X X 1 Note that the ﬁrst equality follows from 0 f(x)dx = 1. R The boundary values of (w1, w2, w3, f3) correspond to either the constant function f(x) = 1 or a function only taking f− and f+. For an interior point (w1, w2, w3, f3), we apply the KKT conditions again and obtain

∂G ∂g1 ∂g2 λ1 λ2 = 0, i = 1, 2, 3 (3.59) ∂wi − ∂wi − ∂wi ∂G ∂g1 ∂g2 λ1 λ2 = 0 (3.60) ∂f3 − ∂f3 − ∂f3 where λ1, λ2 R are multipliers associated with the equality constraints. By ∈ evaluating the gradients, the above equations become

fi ln fi λ fi λ = 0, i = 1, 2, 3 (3.61) − 1 − 2 w (1 + ln f ) λ w = 0 (3.62) 3 3 − 1 3

Taking i = 3, we solve for the constants λ1 and λ2:

λ = 1 + ln f , λ = f . 1 3 2 − 3

Substituting λ1, λ2 into the ﬁrst equation with i = 1 or 2:

fi ln fi = fi f + fi ln f − 3 3

54 Subtracting f3 ln f3 on both sides, we obtain

fi ln fi f ln f = (1 + ln f )(fi f ) − 3 3 3 − 3 fi ln fi f3 ln f3 − = 1 + ln f3, i = 1, 2. (3.63) fi f − 3 However, this contradicts the strict convexity of y = x ln x, since the right- hand-side equals the slope of x ln x at f3, but the left-hand-side is the slope of the line between (fi, fi ln fi) and (f3, f3 ln f3). Hence, it is impossible to have an intermediate value f (f−, f ) with positive measure, and we conclude 3 ∈ + that the optimal solution for maximizing G[f] only takes f+ and f− a.e.

The above Lemma shows that minimizing the entropy of f requires f to be binary-valued. On the other hand, maximizing the entropy of Af needs the auto-correlation function to be least-informative, i.e. the uniform distribution on [0, 1). This can be shown by the following lemma:

Lemma 3.3.2. The maximum entropy distribution on [0, 1) is the uniform distribution over this circle.

The proof follows directly from [46] (see Appendix 1.3 for details). As shown by this lemma, the maximum of H[Af ] is reachable only when Af is the uniform distribution on [0, 1). Ideally, this happens when f(x) is i.i.d.

Bernoulli distributed with P(f = f+) = , P(f = f−) = 1 , where 4 − 4 = 1−f− is chosen to satisfy the average constraint. In this case, 4 f+−f−

55 Af (x) = f(θ)f(x + θ)dθ Z = E [f(θ)f(x + θ)]

= E [f(θ)] E [f(x + θ)]

2 = ( f + (1 )f−) 4 + − 4 = 1.

Figure 3.9 shows a numerical simulation in which we investigate how the uniformity of Af is aﬀected by the shapes of f. We randomly take k (+) (−) intervals I in which f = f and k intervals I in which f = f−. For { i } + { i } (+) each k ranging from 1 to 100, we randomly sample the lengths of Ii and I(−)’s from Dirichlet distribution (setting k I(+) = k I(−) = 1 ), and i i=1 | i | i=1 | i | 2 plot the maximum of H[Af ] over 1000 samplesP in the leftP ﬁgure.

H[Af] as k 100 k = 10 k = 100 f f 0.00

0.02 1 1

0.04

0.06 Af Af

0.08

1 1

0.10

0 20 40 60 80 100 Figure 3.9: Random saturated functions with k peaks3

Two random examples of k = 10 and k = 100 are drawn in the middle and right ﬁgure, with f plotted on top and its auto-correlation Af on the bot-

56 tom. As k increases, the piecewise function f becomes more “uncorrelated”, which results in an increasing uniformity of Af , with H[Af ] approaching 0.

Combining the maximization results of its two components, the mutual information of the two-neuron model (3.48) has the following upper bound:

I(x, y; θ) = 2H[f] + H[Af ] 2H [f , f−] (3.64) − ≤ − + where H [f+, f−] is the entropy of a piecewise constant density function only taking f+ and f−, i.e.

H [f , f−] := (f ln(f ) + f− ln(f−)(1 )) (3.65) + − + + 4 − 4

1−f− where = f = f+ = . 4 |{ }| f+−f−

3.3.2.2 Generalization to the Poissonian limit

Now we generalize the above 2-neuron model to the Poissonian limit (3.45) by conditioning on the number of neurons that spike.

First, the above discussions on the 2-neuron model only holds when f is normalized. For a tuning curve f on [0, 1) with f¯ = 1 f(s)ds = 1, we 0 6 simply apply the above conclusion to f˜ = f/f¯: R

f(x)f(y) I(x, y; θ) = dx dyf(x)f(y) ln dθf(x θ)f(y θ) Z Z − − 2 = f¯ 2H[f˜] + H[A ˜] R − f 2 f+ f− 2 f¯ H , ≤ − f¯ f¯ 3 1 In this experiment we take f+ = 1.9, f− = 0.1 and f¯ = 1 such that = 1 = . 4 − 4 2

57 where the maximum is attained when f˜ randomly takes f+/f¯.

Furthermore, if k neurons spike at x , ..., xk [0, 1), then the mutual 1 ∈ information would be

1 1 k k j=1 f(xj) I(x , ..., xk; θ) = dx dxk f(xk) ln 1 1 ··· 1 k 0 0 j=1 " 0 Qj=1 f(xj θ)dθ# Z Z Y − k = f¯ kH[f˜] + H[A(k)] R Q − f˜ k f+ f− k f¯ H , ≤ − f¯ f¯ (k) where Af is the “k-fold autocorrelation” which integrates the product of k shifted copies of f:

1 k−1 (k) Af (u1, ..., uk−1) := f(θ) f(uj + θ)dθ (3.66) 0 j=1 Z Y (k) ˜ Note that Af˜ is also uniform when f (and thus f) is uncorrelated.

In a Poisson point process that deﬁnes the continuous limit (3.45, 3.46),

e−f¯ k neurons would ﬁre on the circle with a probability k! . If k = 0, no neurons would ﬁre so that zero information is transmitted. If k = 1, the information would be I(x; θ) = H[f], which also reaches maximum when f = f or − + f−. The cases when k 2 is discussed in the above model problems, which ≥ indicates that f needs to be f+ or f− a.e. and uncorrelated. Therefore, we conclude that the maximum of the Poissonian limit is reached at “binary- valued and uncorrelated tuning curves”:

Theorem 3.3.3. The Poissonian limit (3.45) is upper bounded by f f I[f] f¯H + , − (3.67) ≤ − f¯ f¯

58 where

f f f f f f H + , − := + ln + + − ln − (1 ) (3.68) f¯ f¯ − f¯ f¯ 4 f¯ f¯ − 4 f¯−f− and = . The maximum is attained when f(x) = f− or f+ a.e. with 4 f+−f− f = f = , f = f− = 1 , and that f is uncorrelated between its |{ +}| 4 |{ }| − 4 values, i.e. E [f(x)f(y)] = E [f(x)] E [f(y)] if x = y. In this case, H f+ , f− 6 f¯ f¯ is the continuous entropy of f/f¯. h i

Proof. Let f˜ = f/f¯. Based on the deﬁnition (3.45),

∞ −f¯ 1 1 k k e j=1 f(sj) I[f] = dsjf(sj) ln k! ··· 1 k k=0 0 0 j=1 0 Qj=1 f(sj θ)dθ! X Z Z Y − ∞ −f¯ 1 1 k k ˜ e k R Q j=1 f(sj) = f¯ dsjf˜(sj) ln k! ··· 1 k ˜ k=0 0 0 j=1 0 Qj=1 f(sj θ)dθ! X Z Z Y − ∞ −f¯ e k (k) R Q = f¯ kH[f˜] + H[A ] k! − f˜ k=0 X∞ −f¯ e k f f f¯ kH + , − ≤ k! − f¯ f¯ k X=0 f f = f¯H + , − − f¯ f¯ where the second last step follows from Lemma 3.3.1 and Lemma 3.3.2.

We conclude this section by taking a further step to maximize the right-hand-side of Equation (3.67), with proof given in Appendix 1.4:

Lemma 3.3.4. The maximum of f¯H f+ , f− is taken at − f¯ f¯ h i f ln f f− ln f− f¯∗ = exp + + − 1 (3.69) f f− − + −

59 Furthermore,

f+ ∗ f+ + f− max f−, < f¯ < . (3.70) { e } 2

The noise of Poisson distribution may account for the conclusion that f¯∗ is smaller than the mean of f+ and f−. An informative neural code should be able to discriminate different firing positions, thus f¯ would better be “roughly in the middle” of f+ and f−. However, since a Poisson distribution with an expected value (thus variance) λ = f+ is noisier than that with λ = f−, it is better that more neurons fire with f− than those with f+.

100

f *

f + + f 10 1 2

f + e

10 1 100 101

f +

Figure 3.10: The optimal average ﬁring rate as a function of f+, ﬁxing f− = 0.1

If the minimum f− is ﬁxed, then the optimal average ﬁring rate would ¯ approach f as f . This is shown in Figure 3.10 where the value of f¯∗ as e + → ∞ a function of f+ is plotted in log-log scale, according to Equation (3.69). It is

1 very close to 2 (f+ + f−) at the small values of f+, however, as f+ increases, it

f+ approximates e .

60 3.3.3 Adding convolution

When convolution is applied to the tuning curve, the Poissonian limit (3.45) becomes

1 R 1 g(s) 0 ln( )dN(s) I[g] = EN ln e g(s−θ) dθ (3.71) Z0 given that the rate curve g equals f ν. Intrinsically, we are optimizing over ∗ a function space which is smaller than before with smooth constraints on g.

Although the results above cannot be directly generalized to this case, it could explain some part of the binary-valued phenomena that we observed in numerical experiments (Section 3.2.4). First, as shown in Theorem 3.3.3, an optimal choice of g should reach the upper and lower bounds almost everywhere and its values would be uncorrelated. Since g is the convolution of f, its auto- correlation function is a “double convolution” of the autocorrelation of f:

1 Ag(x) = g(θ)g(x + θ)dθ 0 Z 1 1 = ν(u)du ν(s)dsAf (x + u + s) Z0 Z0 Thus the uniformity of Ag suggests that Af should be uniform, which implies that the values of f should be uncorrelated.

On the other hand, if f is totally uncorrelated, the rate curve g may not reach the maximum and minimum ﬁring rate. Uncorrelated assignments to f+ or f−, as shown in the k = 100 case of Figure 3.9, will give us a uniform g: 1 g(x) = ν(y)dyf(x y) = E[ν]E[f] = 1 − Z0

61 which provides zero information. Consequently, the requirements of f being binary-valued and uncorrelated can not be satisﬁed at the same time.

In numerical experiments (Figure 3.8), we observe that the optimal tuning curve f would still reach f+ and f− a.e. (which enables g to reach the bounds as much as possible), and it favors a typical length scale which is closely related to the width of the convolution kernel. An illustrative example is shown below. In this experiment, we take arbitrary step functions with at most 4 continuous regions, as plotted in Figure 3.11. The input space is equally split into n = 128 bins, and d+, d− stand for the number of bins where f takes f+ and f− in the ﬁrst two continuous regions, respectively. The average ﬁring

f++f− rate is set to be 2 . To obtain the rate function g, we convolve f with a ﬂat convolution kernel with a width of 32 bins.

f +

f n n d+ d 2 d+ 2 d Figure 3.11: Piecewise constant tuning curve with 4 continuous regions4

Mutual information for every combination (d−, d+) is evaluated using the Monte Carlo method with 106 samples. The values of mutual information

4 f++f− In this experiment we take f+ = 2, f− = 0, f¯ = = 1, d+, d− 0, ..., 64 2 ∈ { }

62 Mutual Information f 0 1

3.4

3.3 16 3

3.2 24 2 f2

+ 1 32 3.1 d

40 3.0

48 f3 2.9

2.8

0 8 16 24 32 40 48 56 d (a) (b)

Figure 3.12: Mutual information (left) and 3 samples of tuning curves (right) are plotted in Figure 3.12a with a colormap. An interesting phenomenon of symmetry-breaking can be observed: the mutual information reaches its minimum when the tuning curve is fully symmetric (i.e. d+ = d− = 32, shown as f1 in Figure 3.12b). The reason is that the rate curve repeats itself exactly after half of the period, which means that half of the space plays no role in encoding.

On the other hand, the values of mutual information are relatively high at the near-symmetric configurations (as an example, (d+, d−) in region ”2” of 3.12a and f2 in 3.12b). Under the effect of convolution, the rate curves reach the upper and lower bounds as much as possible while avoiding the drawbacks of being fully symmetric. Tuning configurations like f3 (region ”3” in 3.12a) lose the coding efficiency since the corresponding rate curves are not as “uncorrelated” as those like f2.

63 3.4 Extensions 3.4.1 Extension to multi populations

Assume there are two populations of neurons satisfying the conditional independence condition, i.e.

p(r , r θ) = p(r θ)p(r θ) (3.72) 1 2| 1| 2| Suppose that a discretization using n bins is applied to the input space and that each population has n neurons with cyclic-invariant tuning curves fn and gn, respectively, and no convolution is applied. The mutual information between responses (two populations combined) and stimulus can be expressed as

n−1 n−1 i+j r1,j n−1 i+k r2,k 1 fn( n ) gn( n ) In[fn, gn] = Er1,r2|θ=0 ln j k (3.73) − n r1,j r2,k " i=0 j=0 fn( n ) k gn( n ) !# X Y Y=0 where n i i n j j r1,i −fn( n ) r1,j −gn( n ) fn( n ) e gn( n ) e p(r1, r2 θ = 0) = (3.74) | r ,i! r ,j! i=1 1 j=1 1 Y Y Similar to the previous section, by introducing a proper scaling on fn and gn

(following the assumptions 3.38, 3.39, 3.40), it can be veriﬁed that In[fn, gn] approaches the following continuous limit, as a functional of the limit tuning curves f and g:

1 RR f(s)g(t) f(s−θ)g(t−θ) dN1(s)dN2(t) I[f, g] = EN1,N2 ln e dθ (3.75) 0 Z n n e−f¯ e−g¯ 1 2 = f(si)g(tj)dsidtj n ! n ! ··· n ,n 1 2 i=1 j=1 X1 2 Z Z Y Y n1 n2 i=1 f(si) j=1 g(tj) ln n1 n2 (3.76) f(si θ) g(tj θ)dθ i=1Q − Qj=1 − ! R Q Q 64 where N1, N2 are two independent Poisson point processes with intensities f and g, respectively.

The above limit can be decomposed into two parts,

e−f¯ e−g¯ I[f, g] = f¯n1 g¯n2 n H[f˜] + n H[˜g] n ! n ! 1 2 n ,n 1 2 X1 2 e−f¯ e−g¯ f¯n1 g¯n2 H A(n1,n2) n ! n ! f,˜ g˜ − n ,n 1 2 X1 2 h i

(n1,n2) n1 n2 where f˜ = f/f¯,g ˜ = g/g¯, A = f(si θ) g(tj θ)dθ. Following f,g i=1 − j=1 − the same process as the derivation ofR TheoremQ 3.3.3,Q we obtain an upper bound of mutual information for two populations:

f f g g I[f, g] f¯H + , − g¯H + , − (3.77) ≤ − f¯ f¯ − g¯ g¯ where the maximum is attained when f = f+ or f− a.e., g = g+ or g− a.e.,

n1 n2 n1,n2 and that E[ i=1 f(xi) j=1 g(yj)] = i,j E[f(xi)]E[g(yj)] for all n1, n2, i.e. the values ofQf and g areQ uncorrelated.Q The same conclusion can be applied to more than 2 populations.

A natural question to ask is whether splitting into different populations is beneficial in terms of information transfer. Suppose the maximum and minimum firing rates are identical for two populations, i.e. f+ = g+, f− = g−. Then in the ideal case, in which the right-hand-side of Equation 3.77 is reached, the mutual information would be

f f f f F f¯ + F (¯g) := f¯H + , − g¯H + , − (3.78) − f¯ f¯ − g¯ g¯

65 F(f) = f f+ , f [ f f ] 1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

f f f +

−5 Figure 3.13: The function F f¯ with f+ = 4.0, f− = 10 where the function F f¯ = f¯H f+ , f− . − f¯ f¯ h i It is shown in the proof of Lemma 3.3.4 that F f¯ is concave with respect to f¯, as demonstrated by its shape (see Figure 3.13). Therefore, if the total ﬁring rate for the two populations is ﬁxed, i.e. f¯+g ¯ = c, then based on the concavity of F ,

f¯+g ¯ c F f¯ + F (¯g) 2F = 2F ≤ 2 2 where the equality holds if and only if f¯ =g ¯. In other words, the best strategy is to take the same average ﬁring rate for the two populations, meaning that there would be no beneﬁt of splitting the population.

3.4.2 Extension to multi inputs

So far, we have only considered the models where neurons receive their inputs from only one source θ. In more practical cases, each neuron would respond to more than one stimulus variable, for example, θ1, ..., θk. Suppose that θ1, ..., θk are i.i.d. uniform random variables. A natural assumption is

66 that the probability of ﬁring at position x equals the sum of k copies of a tuning curve f, shifted by θ1, ..., θk:

p(x θ , ..., θk) = f(x θl) (3.79) | 1 − l X=1 Using the same approach as before, we decompose the mutual information into two terms which could be maximized separately. If only one neuron ﬁres at position x, then

I(x; θ1, ..., θk) k k l=1 f(x θl) = dx dθ dθk f(x θl) ln − 1 ··· − k l " ds1 dsk l=1 f(x sl)# Z Z Z X=1 ···P − Pk f x−θ k R R lP=1 ( l) ¯ l=1 f(x θl) kf¯ = kf dx dθ1 dθk − ln Pk ··· kf¯  l=1 f(x−sl)  ds dsk Z Z Z P 1 ··· kf¯   = kf¯ dθ dθkh ˜ (θ , ..., θk) ln h ˜ (θ , ...,R θk) R 1 ··· f,k 1 f,k 1 Z Z kf¯ dxAf,k(x) ln Af,k(x) − Z = kfH¯ [h ˜ ] + kfH¯ [A ˜ ] − f,k f,k where f˜ = f/f¯,

k ˜ k ˜ l=1 f(xl) f(x θl) h ˜ (x , ..., xk) := ,A ˜ (x) := dθ dθk − . f,k 1 k f,k 1 ··· k l P Z Z X=1 Compared to the previous model, we have the entropy of a “high-dimensional averaged” version of f˜ = f/f¯ instead of the entropy of f˜. But still, it can be shown that H[h ˜ ] is maximized when f takes f or f− a.e., using the − f,k + same KKT approach as in the proof of Lemma 3.3.1. Since the values of h are

67 linear combinations of f+ and f−, the maximum of mutual information can be expressed in terms of binomial coeﬃcients:

f+ f− I(x; θ , ..., θk) kf¯B k, , (3.80) 1 ≤ f¯ f¯ where k k ia + (k i)b ia + (k i)b B[k, a, b] := i(1 )k−i − ln − (3.81) i 4 − 4 k k i=0 X where = 1−b . 4 a−b

Likewise, if there are n ﬁring cells positioned at x1, ..., xn, with conditional independence assumption

n k

p(x , ..., xn θ , ..., θk) = f(xi θl) (3.82) 1 | 1 − i=1 l ! Y X=1 then we reach a similar conclusion that

I(x1, ..., xn; θ1, ..., θk) n k

= dx dxn dθ dθk f(xi θl) 1 ··· 1 ··· − i=1 l ! Z Z Z Z Y X=1 n k i=1 l=1 f(xi θl) ln −  k  Qdsl P f(xi sl) l i l=1 −   n Rn f+ f− nf¯ k BQ k, Q, P (3.83) ≤ f¯ f¯ Furthermore, if the ﬁring positions follow a Poisson point process on the circle, mutual information between the cell responses and the k inputs would be ∞ −f¯ e n n f+ f− I(r; θ , ..., θk) nf¯ k B k, , (3.84) 1 ≤ n! f¯ f¯ n=0 X

68 ¯ ¯n n f+ f− Let Uk(f; n) = nf k B k, f¯ , f¯ be the upper bound of mutual in- h i ¯ ∞ e−f¯ ¯ formation when n neurons ﬁre, and let Uk(f) = n=0 n! Uk(f; n) be the right-hand-side of equation (3.84), as a function of fP¯. If there are two populations of neurons, each receiving k inputs with tuning curves f and g, it can be veriﬁed using a similar approach as in Section 3.4.1 that:

I(r , r ; θ , ..., θk) Uk(f¯) + Uk(¯g) (3.85) 1 2 1 ≤

Uk(f) 14 1.4 k = 1 k = 2 k = 4 10000 12 1.2 10 8000 1.0 8 0.8 6000

6 0.6 4000 4 0.4 2000 0.2 2

0 0.0 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 f f f

Figure 3.14: The mutual information upper bound Uk f¯ for k inputs

The functions Uk(f¯) for the number of inputs k = 1, 2, 4 are plotted in

−5 Figure 3.14 with fixed values of f− = 10 0 and f = 4. An observation ≈ + is that the magnitude of Uk grows exponentially with k. Comparing with different components in Figure 3.15, we found that the magnitude at n = 16 is the largest, which is somewhat reasonable since we expect 16 neurons to fire when there are 4 inputs, if taking f¯ f = 4. ≈ + Following the idea of the previous section, the information gained from population splitting depends on the concavity of Uk. Here, for larger values of k (e.g. k = 4), we notice that Uk(f¯) are no longer concave, and there is

69 e f n! Uk(f ; n), k = 4

1.2 n = 1 n = 2 14 n = 4 2.5 1.0 12

2.0 10 0.8 8 1.5 0.6 6 0.4 1.0 4 0.5 0.2 2

0.0 0.0 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 f f f 1.4 200 n = 8 n = 16 n = 32

1000 1.2

150 800 1.0

0.8 600 100 0.6 400 0.4 50 200 0.2

0 0 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 f f f

Figure 3.15: Selected components of Uk f¯ when k = 4 a certain threshold tk (approximately around the peak) such that Uk f¯ has diﬀerent concavity property in (f−, tk) and (tk, f+). Therefore, if we constrain the sum of average ﬁring rates c = f¯ +g ¯, then choosing f¯ =g ¯ may not necessarily bring the maximum information.

An example is shown in Figure 3.16 for k = 4. We observe that the ¯ c strategy of choosing f andg ¯ depends on the value of c. When 2 is small (less

tk+f− tk than ), choosing f¯ = f− andg ¯ = c f− will give us the largest 2 ≈ 2 − information since Uk is concave in (f−, tk):

f¯+g ¯ c Uk (f−) + Uk f¯+g ¯ f− 2Uk = 2Uk − ≥ 2 2 However, such population splitting is not meaningful since all neurons in the f’s population have a constant tuning curve f(x) = f− and thus provide no

tk c ¯ information. If c is larger, when 2 < 2 < tk, we could take f andg ¯ such that

70 f c g f c g f c g 2 2 2

Figure 3.16: The mutual information upper bound Uk f¯ when k = 4 the line connecting (f,¯ Uk(f¯)) and (¯g, Uk(¯g)) is tangent to the curve atg ¯, as shown in the middle of Figure 3.16. Such values of f¯ andg ¯ could transmit a ¯ c larger amount of information than f =g ¯ = 2 since

f¯+g ¯ c Uk f¯ + Uk (¯g) 2Uk = 2Uk ≥ 2 2 c Finally, if 2 is higher than tk, then there is no beneﬁt of population splitting since Uk is convex, as shown in the right ﬁgure of 3.16.

Therefore, given multiple inputs, there is a special case when population splitting is beneﬁcial in information transmission: when the number of inputs k is large enough such that Uk(f¯) is concave in (f−, tk), convex in (tk, f+), and

tk f¯+¯g that 2 < 2 < tk.

We note that the above conclusion only holds when the sum of firing rates f¯ =g ¯ = c is fixed. If we introduce more freedom by fixing the weighted sum of firing rates, i.e. w1f¯+ w2g¯ = c, with free parameters w1, w2, then there would be no meaningful population splitting. Based on the shape of the curve

Uk(f¯), the maximum of mutual information would be achieved either when f¯ =g ¯ or when one of the ﬁring rates equals f+ or f−.

71 3.5 Discussion

We study the properties of optimal neural tuning curves that maximize the mutual information I(r; θ) under a cyclic invariant condition. Since analyt- ical solutions are not approachable, we resort to numerical solutions obtained by Monte Carlo simulations based on the discretization of f. Geometrically, a group of discrete cyclic invariant tuning curves can be constructed by pick-

M ing one point (f1, ..., fM ) in the M-dimensional cube [f−, f+] , followed by making M cyclic permutations on its coordinates.

Numerical results suggest that neurons would fire with either the maximum or minimum firing rate, i.e. f+ or f−, with no intermediary values, in the case of low maximum firing rate f+. This result inspires further theoretical analysis in terms of a continuous limit. We consider a sequence of tuning curves fn , such that each fn is the tuning curve for n equally spaced { } neurons on the circle, and that limn→∞ nfn(θ) = f(θ). The latter condition is to ensure the finiteness of the average firing rate. As n , the mutual → ∞ information I[fn] converges to a limit I[f], which can be seen as an expectation with respect to a Poisson point process with intensity f. In such a limit, although there are infinitely many neurons on the circle, only a finite number of neurons would spike, each spiking only once, which suggests that neurons are firing with low activity.

For the continuous limit of mutual information, we prove that its maximizer could be obtained by generating an inﬁnite, uncorrelated Bernoulli sequence which takes either f+ or f−, with certain probabilities such that f

72 satisﬁes the average constraint f¯. Geometrically, it corresponds to randomly selecting a corner from an inﬁnite-dimensional cube (with [f−, f+] on each side), such that the coordinates of the picked corner have an average f¯.

The binary-valued property of optimal tuning curves is consistent with our numerical results as well as some previous studies [44, 6, 41, 39]. However, the conclusions from the continuous limit can not be easily generalized to the case when the firing rate is obtained by convolving a tuning curve with a spatial filter. Such convolution would describe memory dependence of the firing process. Furthermore, in the continuous limit, there is no information benefit of having a heterogeneous population of tuning curves, except for those obtained from cyclic permutations. When there are more than one inputs to the neurons (in which case the input is multi-dimensional), population splitting is only beneficial in some special cases, which depends on the concavity of mutual information as a function of f¯.

These limitations stem from the cyclic invariant constraint as well as the low-ﬁring-rate assumption. Our next step is to revisit the Poisson model without these assumptions and study the optimal tuning curves using a different particle method.

73 Chapter 4

Particle Model

In the last chapter, we explore the optimal tuning curves in a cyclic invariant model and theoretically analyze their properties in a low-firing-rate regime (Poissonian limit). In this chapter, we revisit the above model in a more general situation by dropping the cyclic invariance and low-firing-rate assumptions. We present theoretical results to show that the optimal tuning curves are also discrete; however, they are not necessarily binary-valued. We also show numerical results under different constraints, which are obtained by applying an alternating maximization algorithm.

4.1 Basic Setting

We apply the same notations and assumptions for the Poisson model from the last chapter (Section 3.1) except that the cyclic invariance condition (3.1) is dropped. Geometrically, in the state space, cyclic invariance means that the curve θ (f (θ), f (θ), ..., fN (θ)) is invariant under rotations that are 7→ 1 2 1 integer multiples of N , as illustrated by the 3-dimensional example in Figure 3.1. Without such restriction, the state space curve is no longer invariant under rotations, as shown in Figure 4.1.

74 f1( )

0 1

f2( )

0 1 f2

f3( )

0 1 (a) (b)

Figure 4.1: Tuning curves for 3 neurons without cyclic invariance

For simplicity of analysis, in this chapter we also ignore the dependence of ﬁring rate on the stimulus history, i.e. setting the convolution kernel ν to be the Dirac delta function. The tuning and rate functions thus coincide, and the conditional probability mass function of the spike counts r = (r1, ..., rN ) would be: N N −fk(θ) rk e fk(θ) p(r θ) = p(rk θ) = (4.1) | | rk! k k Y=1 Y=1 Note that we assume the typical time τ = 1 as before. Also following from the previous chapter, the stimulus is uniformly distributed on the circle,

75 i.e. p(θ) = 1 on θ C = [0, 1). Therefore, the mutual information is given by ∈

I(r; θ) = DKL(p(r, θ) p(r)p(θ)) 1 || = dθDKL (p(r θ) p(r)) 0 | k Z ∞ ∞ 1 p(r θ) = dθ p(r θ) ln | (4.2) ··· | p(r) 0 r =0 r =0 Z X1 XN where 1 p(r) = p(r θ)dθ (4.3) | Z0 Since the tuning curves are no longer invariant under cyclic translations, we need to solve for N functions (f1, ..., fN ) in order to maximize the mutual information. To introduce more freedom into the tuning curves, we would only constrain their boundedness and remove the average constraints. To summarize, we consider the constrained optimization problem below:

maxf1,...,fN I(r; θ) (4.4)

f− fk(θ) f , k = 1, .., N (4.5) ≤ ≤ +

4.2 Discreteness of Optimal Tuning Curves

For a cyclic invariant model, it is observed and analytically proved in

Chapter 3 that an optimal tuning curve only takes two values: f− and f+ in a low-ﬁring regime. Without the restriction of cyclic invariance, solutions to the optimization problem (4.4, 4.5) would also satisfy a certain property of discreteness, which we will show in this section.

76 4.2.1 Equal Coding Theorem

The transmission of signal can be decomposed into two steps: first, a stimulus θ is converted to f1(θ), ..., fN (θ) and second, neurons spike according to these firing rates. Denote by λi the firing rate of the k-th neuron (k =

1, ..., N) and the vector λ = (λ1, ..., λN ). The encoding step from θ to λ is deterministic, written in the form of probability would be N

p(λ θ) = 1{λ f θ } = 1{λ f θ } (4.6) | = ( ) k= k( ) k Y=1 And the second step is conditionally independent and Poisson distributed (4.1),

N r e−λk λ k p(r λ) = k (4.7) | rk! k Y=1 Together, given an input θ, the joint conditional distribution of (λ, r) is

−f (θ) r N e k fk(θ) k if λ = f(θ) p(λ, r θ) = k=1 rk! | (0 if λ = f(θ) Q 6 which equals p(λ θ)p(r λ). This implies that the whole encoding process is a | | Markov Chain θ λ r, i.e. → → p(θ, λ, r) = p(θ)p(λ, r θ) = p(θ)p(λ θ)p(r λ) (4.8) | | | Since the ﬁrst step is noiseless, there would be no loss of information if we simplify the process θ λ r to λ r. Formally, we state it in the → → → theorem below:

Theorem 4.2.1. (Equal Coding) For the Markov Chain θ λ r with → → conditional probability distributions given by (4.6, 4.7),

I(r; θ) = I(r; λ). (4.9)

77 The proof follows directly from Nikitin et al. [39] (see details in Ap- pendix 1.5), whereas the term “Equal Coding” comes from Gjorgjieva et al. [20]. Based on this conclusion, the original problem (4.4, 4.5) can be converted to optimizing the coding eﬃciency from λ to r, i.e.

maxλ1,...,λN I(r; λ) (4.10)

f− λk f , k = 1, .., N (4.11) ≤ ≤ +

4.2.2 The capacity achieving problem

When information is transmitted from λ to r, the Poisson channel is ﬁxed (4.7). Thus the joint distribution p(r, λ) and further, the mutual information I(r; λ) are completely determined by the choice of input distribution p(λ). In view of this idea, the problem (4.10, 4.11) can be formulated as ﬁnding an input distribution to reach the Channel Capacity [46]:

C = maxµλ I(r; λ) (4.12)

µλ MS (4.13) ∈

N where µλ is the probability measure of λ, S = [f−, f+] is the domain of λ

(with its Borel σ-algebra), and MS is the set of all probability measures on S.

To simplify the notations, we drop the subscript λ and denote the input probability measure by µ. For the output r, its probability mass function would be the integral of p(r λ) w.r.t. µ, |

p(r; µ) = p(r λ)dµ(λ) (4.14) | Z

78 The entropy of r and the conditional entropy of r given λ can be expressed as functionals of µ, denoted as

Hr(µ) = dµp(r λ) ln p(r; µ) (4.15) − r | X Z Hr|λ(µ) = dµp(r λ) ln p(r λ) (4.16) − r | | X Z And the mutual information would be

p(r λ) I(µ) = Hr(µ) H (µ) = dµp(r λ) ln | (4.17) r|λ dµp λ − r | (r ) X Z | R Suppose that the mutual information I(µ) could be maximized by a capacity-achieving measure µ∗ (whose existence and uniqueness will be justified in Section 4.2.3), µ∗ = arg max I(µ) (4.18) µ∈MS In the following section, we will show that µ∗ satisfies a specific property of discreteness. Although we only consider the constraint f− fk(θ) f in ≤ ≤ + numerical experiments, our theoretical results below can be applied to more general cases when S is only required to be closed and bounded and when other linear constraints are applied. Here, we generalize the problem as follows:

Let S be a closed and bounded subset of (0, )N (for example, • ∞ N S = [f−, f+] where f− > 0);

Let G be a ﬁnite set of constraints indexed by L, i.e. G = gl : l L , • { ∈ } where each gl is a bounded, continuous (“well-behaved”) function on S,

79 such that

Gl(µ) := Eµ[gl(λ)] 0, l L. (4.19) ≤ ∀ ∈

(for example, when gk(λ) = λk f¯k with k L = 1, 2, ..., N , the above − ∈ { } ¯ inequality translates to E[λk] fk, i.e. bounded average ﬁring rate); ≤

Let MS,G be the collection of probability measures on S that satisfy the • constraints in G. Note that MS,G is a subset of MS.

Based on the above deﬁnitions, we aim to ﬁnd a capacity-achieving measure under the constraints of S and G:

µ∗ = arg max I(µ) (4.20) µ∈MS,G

For reference, the boundedness of support S and the constraints (4.19) are sometimes referred to as “bounded-input/peak constraint” and “average cost/power constraint” in some information-theoretic literature ([11, 41]).

Under different settings, the discreteness of capacity-reaching measures have been explored by Smith [43], Shamai [41] and Chan et al. [11]. Smith [43] showed that the capacity of a scalar Gaussian channel under bounded-input constraint is achieved by a discrete measure supported on only a finite number of values. The existence, uniqueness, necessary and sufficient conditions for such a measure were obtained. The same approach was applied to a scalar Poisson channel by Shamai [41], where the capacity-achieving measure is also shown to have a finite support. Chan et al.’s work [11] showed that a multi-dimensional Gaussian channel achieves its capacity by a discrete

80 measure under certain conditions (e.g. the inverse covariance matrices are constant and the constraints gl(λ) are quadratic). The notion of discreteness was extended from R to RN , which diﬀers from ﬁniteness in high dimensions.

We will follow the same steps as the above-mentioned previous works to show that µ∗ in (4.18) or (4.20) is discrete. The idea is as follows. First, the topologies are defined for the spaces MS and MS,G in order to introduce the notion of continuity and strict concavity. Second, combining the convexity and compactness of MS (or MS,G), as well as the continuity and concavity of I(µ), we show the existence and uniqueness of µ∗. Third, we apply the same definition of discreteness as Chan et al. [11], and prove a necessary and sufficient condition for µ∗ by taking the weak derivative of I(µ) and using the Lagrange multiplier method. Finally, the discreteness of µ∗ is obtained by extending the results of Shamai [41] from a scalar Poisson channel to our multi-dimensional, conditionally independent Poisson Channel.

4.2.3 Existence and uniqueness of solution

In this part, we show the existence and uniqueness of µ∗ in both (4.18) and (4.20). First, it is necessary to make an important note for the topologies on MS and MS,G.

Let M(RN ) denote the space of all probability measures on RN with N its Borel σ algebra. Let Cb(R ; R) be the space of continuous, bounded, N ∗ N real-valued functions on R , and denote by Cb (R ; R) its dual space. Then N ∗ N M(R ), as a subspace of Cb (R ; R), is naturally equipped with a topology

81 ∗ N induced by the weak* topology on Cb (R ; R) (in D.W.Stroock’s book [45] the term “weak topology” is used).

It is known that M(RN ) with the weak* topology is a Polish space1 with the L´evy–Prokhorov metric (Theorem 3.1.11, [45]):

() () dlevy(µ, ν) := inf : µ(F ) ν(F ) + and ν(F ) µ(F ) + { ≤ ≤ N for all closed F R 2 (4.21) ⊆ }

Convergence with regards to the above metric is equivalent to weak* convergence, denoted by µn µ. ⇒ First, by applying the compactness of S as well as the boundedness and continuity of constraints in G, the sets MS and MS,G are convex and compact, with proof given in Appendix 1.6.1:

Lemma 4.2.2. (Convexity and Compactness) MS and MS,G are convex and compact subsets of M(RN ) in the L´evy–Prokhorov metric.

Next, the objective function I(µ) is concave and continuous on MS

(thus also on its subset MS,G), with proofs given in Appendices 1.6.2, 1.6.3:

Lemma 4.2.3. (Strict Concavity) I(µ), as a functional from MS to R, is strictly concave.

Lemma 4.2.4. (Continuity) Hr(µ), Hr|λ(µ) and I(µ) are continuous on MS.

1 A Polish space is a separable metric space whose metric is complete. 2F () is the -neighborhood of F , i.e. F () = x RN : y F, x y < . { ∈ ∃ ∈ k − k }

82 Furthermore, I(µ) is weakly diﬀerentiable, with deﬁnition as follows

Deﬁnition 4.2.1. [43] A continuous function F : MS R is weakly diﬀer- → 3 entiable at µ if the following limit exists for µ MS: 0 ∀ 1 ∈

F ((1 τ)µ0 + τµ1) F (µ0) DF (µ0; µ1) = lim − − (4.22) τ→0+ τ

The weak derivative of I(µ) is given by the following lemma, proved in Appendix 1.6.3:

Lemma 4.2.5. (Weak Diﬀerentiability) I(µ) is weakly diﬀerentiable in MS.

Its weak derivative at µ0 in the direction of µ1 is

DI(µ ; µ ) = i(λ; µ )dµ I(µ ) (4.23) 0 1 0 1 − 0 Z where p(r λ) i(λ; µ) := p(r λ) ln | = DKL(p(r λ) p(r; µ)) (4.24) p(r; µ) r | | k X where p(r; µ) = p(r λ)dµ. Furthermore, i(λ; µ) is a continuous function of | λ on S. R

Based on the above properties of I(µ), we arrive at the conclusion that the capacity-achieving measure µ∗ in (4.18) or (4.20) exists and is unique:

∗ Theorem 4.2.6. (Existence and Uniqueness) There exists µ in MS that maximizes the mutual information I(µ). Furthermore, µ∗ is unique in the sense

∗ that if ν is also a capacity-achieving measure, then dlevy(µ , ν) = 0.

3Also known as Gateaux diﬀerentiable.

83 The same conclusion holds for the constrained optimization problem over MS,G.

N Proof. First, MS is a convex and compact subset of M(R ) (Lemma 4.2.2).

Since I(µ) is a continuous function over MS (Lemma 4.2.4), it would achieve its maximum in MS.

The uniqueness of µ∗ can be shown by contradiction. Suppose there exist capacity-achieving probability measures µ , µ such that µ = µ . Let 1 2 1 6 2 µτ = τµ + (1 τ)µ where τ (0, 1). By the strict concavity of I(µ) (Lemma 1 − 2 ∈ 4.2.3), the mutual information with respect to µτ would be

I(µτ ) > τI(µ ) + (1 τ)I(µ ) 1 − 2 which contradicts the assumption that I(µ1) = I(µ2) = max I(µ). Therefore, the maximum is unique.

Similarly, since MS,G is convex and compact, the existence and unique-

∗ ness of µ = arg maxµ∈MS,G I(µ) would follow from the same arguments as above.

4.2.4 The discreteness of capacity-achieving measure

In this section, we prove that the capacity-achieving measure µ∗ is discrete. To specify the meaning of discreteness for a measure on a high dimensional space, we follow the deﬁnitions introduced by Chan et.al.[11]. In short, it can be summarized as:

84 A probability measure µ on RN is discrete if its set of points of increase is sparse.

Deﬁnitions of the above-mentioned terminologies are given by:

Deﬁnition 4.2.2. (Point of increase) [11] A point λ RN is a point of ∈ increase of a probability measure µ on RN if for any open subset O RN ⊂ that contains λ, µ(O) > 0.

Definition 4.2.3. (Sparse set) [11] A subset E of RN is sparse (in RN ) if there exists a holomorphic (i.e. complex differentiable) function f defined on a connected open set U CN containing the closure of E, such that f = 0 on ⊂ E, but nonzero on U.

We make an important note for the sparse set. In one dimension, the sparsity of a bounded subset E is equivalent to the ﬁniteness of E , which can | | be veriﬁed by applying the identity theorem. However, this is not necessarily true in high dimensions. As a counterexample provided by [11], a collection of

“concentric shells” in RN can be sparse, without being a ﬁnite set (however, the number of shells should be ﬁnite).

Although a sparse set is not necessarily ﬁnite, the concept of sparsity can still give us some ideas on how “discrete” a set would be. As a logically equivalent statement of the deﬁnition (4.2.3), the following proposition is used frequently in our proofs:

85 Proposition 4.2.7. Suppose E is not sparse in RN . For any holomorphic function f deﬁned on a connected open subset U CN containing E¯ (the ⊂ closure of E), if f is zero on E, then it is also zero on U.

N For a measure µ on R , let Eµ denote the set that contains all the points of increase of µ. It can be shown that Eµ is the minimal closed subset

N of R such that µ(Eµ) = 1 ([11]). The sparsity of Eµ induces the discreteness of a measure µ, with the formal deﬁnition as:

Deﬁnition 4.2.4. (Discrete measure)[11] A probability measure µ on RN is discrete if its set of points of increase Eµ is sparse.

We prove a necessary and suﬃcient condition for the capacity-achieving

∗ measure for MS,G, and ﬁnally, arrive at the discreteness of µ . The same results can be applied to MS by just taking the set of constraints L to be empty.

∗ Lemma 4.2.8. µ MS,G is the capacity-achieving measure if and only if ∈ there exists φl 0 : l L such that: { ≥ ∈ }

∗ 1. DJφ(µ ; ν) 0 for all ν MS, where Jφ(µ) is the Lagrangian ≤ ∈

Jφ(µ) := I(µ) φlGl(µ) (4.25) − X ∗ and DJφ(µ ; ν) is the weak derivative (4.22) of Jφ ;

∗ 2. φlGl(µ ) = 0, l L. ∀ ∈

86 ∗ Theorem 4.2.9. (Necessary and suﬃcient condition) µ MS,G is capacity- ∈ achieving if and only if there exists φl 0 : l L such that { ≥ ∈ }

∗ ∗ i(λ; µ ) I(µ ) φlgl(λ) 0, λ S (4.26) − − ≤ ∀ ∈ l∈L X p(r|λ) where i(λ; µ) = p(r λ) ln = DKL(p(r λ) p(r; µ)) (Equation 4.24). r | p(r;µ) | k Furthermore, theP equality is reached for all λ Eµ∗ (the set of points of ∈ increase of µ∗).

The proofs of Lemma 4.2.8 and Theorem 4.2.9 are given in Appendices 1.7.1 1.7.2.

We make a note that when µ MS (i.e. there is no average constraint), ∈ the inequality (4.26) takes a simpler form:

∗ ∗ DKL(p(r λ) p(r; µ )) I(µ ), λ S (4.27) | k ≤ ∀ ∈ and for λ Eµ∗ , the equality is reached: ∈

∗ ∗ DKL(p(r λ) p(r; µ )) = I(µ ), λ Eµ∗ (4.28) | k ∀ ∈

Therefore, for a capacity-achieving measure µ∗, the distribution p(r; µ∗) = p(r λ)dµ∗(λ) is “equidistant” to each p(r λ) where λ is a point of increase, | | Rwith “distance” measured by Kullback-Leibler divergence. A geometric illustration is shown in Figure 4.2.

Finally, we show the discreteness of µ∗ by the following theorem.

87 p(r λ ) | 1

p(r) = p(r λ)dµ∗(λ) | R

p(r λ ) p(r λ ) | 2 | 3

Figure 4.2: A geometric interpretation of the necessary and suﬃcient condition of the capacity-achieving measure

∗ Theorem 4.2.10. (Discreteness) Suppose µ is capacity achieving and φl { ≥ 0 : l L are chosen such that the inequality in Theorem 4.2.9 is satisﬁed. ∈ } Let E be the subset of S at which the equality is reached. When gl(λ): l L { ∈ } are linear functions, the set E is sparse, and hence, µ∗ is discrete.

Proof. We apply the same idea of [41] (Theorem 1) and extend it to high dimensions.

r λ k e−λk First, given the Poisson channel p(r λ) = N k , we rewrite | k=1 rk! i(λ; µ∗) as Q

88 p(r λ)dµ∗ i(λ; µ∗) = p(r λ) ln | − | p(r λ) r R | r X k −λk ∞ ∞ N r N λk e ∗ k −λk dµ λk e S k=1 rk! = ln r λ k e−λk − ··· rk! N k r1=0 rN =0 k=1 R Q X X Y k=1 rk! ∞ ∞ N N rk −λk λk e Q = γ(r) + λk (ln λk 1) − ··· rk! − r =0 r =0 k k X1 XN Y=1 X=1 where N rk −λk ∗ γ(r) := ln λk e dµ (λ) S k Z Y=1 Let L := minS λ and M := maxS λ. Since S is a bounded, closed subset of (0, )N , we have 0 < L < M. Choose a connected open set W CN such ∞ ⊆ that ( L , )N is contained in W . Since W contains S, it also contains the 2 ∞ closure of E (denoted by E¯).

Let i(ω; µ∗) be the extension of i(λ; µ∗) to the complex domain W , i.e.

∞ ∞ N N rk −ωk ∗ ωk e i(ω; µ ) = γ(r) + ωk (ln ωk 1) , ω W. − ··· rk! − ∈ r =0 r =0 k k X1 XN Y=1 X=1 Note that i(ω; µ∗) is analytic (holomorphic).

We show the sparsity of E by contradiction. Suppose that E is not sparse in RN . By Proposition 4.2.7, for any holomorphic function f deﬁned on W , if f is zero on E, then it is also zero on W . Here we set the function f to be

∗ ∗ f(ω) = i(ω; µ ) I(µ ) φlgl(ω) − − l∈L X

89 Since f is zero on E, it is also zero on W , i.e.

∗ ∗ i(ω; µ ) = I(µ ) + φlgl(ω), ω W (4.29) ∀ ∈ l X

For the linear functions gl, there exist aj,l and bl such that gl(ω) = { } { } N j=1 aj,lωj + bl. The above equation can be written as

P N r ω k e−ωk k γ(r) ··· rk! r r k X1 XN Y=1 N ∗ = I(µ ) φl( aj,lωj + bl) + ωk (ln ωk 1) − − − l j k X X X=1 N

= φl aj,lωj + ωk (ln ωk 1) + C (4.30) − − l j k X X X=1 ∗ where C = φlbl I(µ ) is a constant. l − HereP we apply equation (4.30) to real-valued ω ( L , )N W . In ∈ 2 ∞ ⊂ the left-hand-side, γ(r) can be bounded by:

N N rk −λk ∗ |r| γ(r) = ln λ e dµ ln M = ln M ri k ≤ ZS k=1 i=1 Y X and therefore

N N N rk −ωk rk −ωk ωk e ωk e γ(r) rk ln M ··· rk! ≤ ··· rk! r r k r r k k X1 XN Y=1 X1 XN Y=1 X=1 N

= ln M ωk k X=1 For large values of ω, the left-hand-side of equation (4.30) will not grow faster than a constant multiple of k ωk. However, the right-hand-side grows as P 90 k ωk ln ωk, which leads to a contradiction. Hence, E is a sparse set. From theoremP 4.2.9, Eµ∗ is a subset of E, therefore we reach the conclusion that Eµ∗ is sparse and thus µ∗ is discrete.

4.3 Alternating Maximization Algorithm

In the last section, we show that the capacity-achieving measure is discrete. Although in high dimensions, discreteness does not necessarily imply ﬁniteness of support, this result could still be inspiring in developing an algorithm for optimizing the mutual information.

∗ In a simpliﬁed situation, suppose that µ = arg maxµ∈MS I(µ) is sup-

N ∗ ported on a finite number of points in R , namely supp(µ ) = λ1, ..., λM , { } with M being the cardinality of supp(µ∗). In our Poisson firing model of N neurons, λ is the vector of firing rates when an input θ [0, 1) is applied, ∈ i.e. λk = fk(θ), k = 1, 2, .., N. Therefore, the tuning curves also take a finite number of states,

f = (f (θ), ..., fN (θ)) f , ..., fM (4.31) 1 ∈ { 1 }

N where fi = λi is a vector in S R . In numerical computations, f is repre- ⊂ sented by a N M matrix fk,i , where × { }

fk,i := (fi)k, k = 1, ..., N, i = 1, ..., M. (4.32)

Let wi be the probability that f takes the value fi,

∗ µ (fi) = wi, i = 1, ..., M. (4.33)

91 Then the problem of optimizing the mutual information I(r; θ) (4.5) becomes maximization with regards to the state-probability pairs (f, w):

maxf,w I(r; θ) (4.34)

N M−1 f S = [f−, f ] , w ∆ (4.35) ∈ + ∈ where ∆M−1 is the M 1 simplex embedded in RM , i.e. ∆M−1 = w RN : − { ∈ M wi 0, wi = 1 . ≥ i=1 } BasedP on the dependence of I(r; θ) on f and w, it is natural to consider an algorithm that, under the assumption of a ﬁnite number of states, alternates between the following two steps:

Step 1 Fix the probabilities w = w , ..., wN , ﬁnd the state vectors f = { 1 } f , ..., fN such that { 1 } f = arg max I(f, w) (4.36) f∈S

Step 2 Fix the state vectors f = f , ..., fN , ﬁnd the probabilities w = { 1 } w , ..., wN such that { 1 }

w = arg max I(f, w) (4.37) w∈∆M−1

We refer to the state-probability pairs (fi, wi) as particles and this { } model of alternating maximization of mutual information as the particle model, based on its discrete nature. A geometric interpretation of this approach is shown in Figure 4.3. The number of neurons N = 3, and the tuning curves take M = 4 diﬀerent states f , f , f , f , represented by blue-colored { 1 2 3 4}

92 3 points in the cube [f−, f+] . The labels on the particles specify their probabilities, i.e. w , w , w , w . During an alternating maximization process, the { 1 2 3 4} ﬁrst step (4.36) updates the positions of particles, whereas the second step (4.37) updates the probabilities of visiting each particle.

0.2 0.4

0.1

f 0.3 2

Figure 4.3: A model of 3 neurons and 4 states

In practice, we use the Stochastic Gradient Descent algorithm for the ﬁrst step and the Blahut-Arimoto algorithm for the second step. We will elaborate on each procedure in the following sections.

93 4.3.1 Stochastic Gradient Descent

As a function of the state-probability pairs (f, w), the mutual information and its gradient can be expressed in terms of expectations: M

I(f, w) = wjEr|f [ln Sj(r)] (4.38) − j j=1 X ∂I fk,i rk = wiEr|fi − ln Si(r) (4.39) ∂fk,i fk,i r f k P k,i −fk,i where the expectations are taken with respect to p(r fi) = e , and | k=1 rk! M N r k −fk,l Q p(r) fk,le Si(r) := = wl rk −f (4.40) p(r fj) f e k,i l k k,i | X=1 Y=1 The above expectations cannot be computed with high accuracy, which brings diﬃculty to our optimization process. As in the last chapter, we resort to the Monte Carlo method to obtain a stochastic approximation of the gradient.

Here we apply the classic algorithm of projected Stochastic Gradient Descent (SGD)[38] to solve the problem of maximizing over f (4.36). Note that we actually take gradient ascent since we are maximizing the objective function. To reduce variance and accelerate the convergence, we compute the

(b,i) gradient using more than one sample, namely B i.i.d. samples r p(r fi) ∼ | for each i = 1, ..., M. One-step update of the states would be

B (b,i) 1 fi r (b,i) fi ΠS fi + η wi − ln Si(r ) (4.41) ← B fi b ! X=1 N where η is the learning rate, ΠS stands for projection onto S = [f−, f+] . If there are average constraints as described in Section 4.2.2, we need projection onto the intersection of constraints and S.

94 In practice, the mini-batch size B is set to be small (100 or 1000) at the beginning to approximate the solution faster, then gradually increased (to 104 or 105) to achieve higher accuracy. The learning rate is larger at the beginning and decreased at the later stage.

4.3.2 Blahut-Arimoto Algorithm

When the state vectors (f1, ..., fM ) are ﬁxed, the optimization problem

(4.37) amounts to ﬁnding the capacity-achieving probabilities w = (w1, ..., wM ) over the Poisson channel p(r f). We use the Blahut-Arimoto algorithm to | solve this problem.

The Blahut–Arimoto algorithm is an iterative algorithm that optimizes channel capacity or rate-distortion function of a source [8, 2]. The idea is briefly introduced below. Let I(w) be the mutual information I(r; θ) (4.38) when f is fixed, expressed as a function of w: M p(r fi) I(w) = wi p(r fi) ln | | p(r) i=1 N X rX∈N M p(r fi) = wi p(r fi) ln | | M i=1 N j=1 wjp(r fj) X rX∈N | M The nonlocal term p(r) = wjp(r fj) bringsP difficulty in maximization j=1 | over wj . We re-formulate PI(w) using the Bayesian Theorem: { } M p(fi r) I(w) = wi p(r fi) ln | (4.42) | wi i=1 N X rX∈N where the transition probability from output to input is given by p(fi r) = | p(fi)p(r|fi) wip(r|fi) p(r) = p(r) . The idea of Blahut–Arimoto algorithm is to introduce a

95 “variational transition probability” q(f r) and a functional J as | M q(fi r) J(w, q) = wi p(r fi) ln | (4.43) | wi i=1 N X rX∈N It is shown that the capacity also equals the maximum of J,

C = max I(w) = max max J(w, q) (4.44) w w q

Hence, the problem of maximizing I(w) has been broadened to a larger problem of maximizing J(w, q), which allows for greater ﬂexibility.

It can be veriﬁed that the functional J is concave in both w and q. Furthermore, for ﬁxed w, J is maximized by

p(fi)p(r fi) wip(r fi) q(fi r) = | = M | | p(r) wjp(r fj) j=1 | And for ﬁxed q, J(w, q) is maximized by P

exp ( r p(r fi) ln q(fi r)) wi = M | | exp ( p(r fj) ln q(fj r)) j=1 P r | | When the above equationsP are satisﬁedP simultaneously, w satisﬁes

96 (4.45) can also be written as

wi exp (DKL (p(r fi) p(r))) wi = M | k (4.46) wj exp (DKL (p(r fj) p(r))) j=1 | k P It is shown by Blahut [8] that I(w) achieves the maximum if and only if there exists a constant c, such that

DKL(p(r fi) p(r)) = c, if wi = 0 (4.47) | k 6

DKL(p(r fi) p(r)) c, if wi = 0 (4.48) | k ≤ and that c = I(w). This is consistent with the necessary and suﬃcient conditions of µ∗ we obtain in the previous section (see Equations 4.27, 4.28 and

Figure 4.2). Therefore, the algorithm searches for the “weights” wj such { } M that the mixture distribution p(r) = wjp(r fj) is “equidistant” to each j=1 | p(r fi) (which takes a weight wi > 0). P |

To solve for the optimal weights wj , the Blahut-Arimoto algorithm { } uses equation (4.45) as a ﬁxed-point iteration. The n-th iteration is given by

(n) p(r fi) ci = exp Er|fi ln M (|n) (4.49) w p(r fj)! j=1 j | (n) (n) (n+1) wi ci P wi = M (n) (n) (4.50) j=1 wj cj

In practice, the expectationsP in (4.49) are estimated using the Monte Carlo

(b,i) algorithm. With a batch size B, we use B i.i.d. Poisson samples r p(r fi) ∼ |

97 for each i = 1, ..., M and approximate the coeﬃcients by

B (b,i) (n) 1 p(r fi) cˆi = exp ln | (4.51) M (n) (b,i) B w p(r fj)! b=1 j=1 j | (n) (Xn) (n+1) wi cˆi P wi = M (n) (n) (4.52) j=1 wj cˆj

p(r|fi) P where the ratios can be computed by p(r|fj )

N rk −fk,i p(r fi) fk,ie | = rk −f p(r fj) f e k,j k k,j | Y=1 When average constraints are present, the algorithm needs to iterate over a sequence of “slope” parameters that serve as Lagrangian Multipliers. The generalized algorithm for constrained maximization is introduced in [8].

4.3.3 Introducing regularity

In Section 4.2.1, we decompose the transmission of signal into two steps: the deterministic encoding from stimulus θ to the ﬁring rates λ =

(f1(θ), ..., fN (θ)), and the random encoding from λ to responses r. The alternating maximization algorithm searches for the optimal probability distribution of λ that maximizes the mutual information between λ and r, but could not specify the mapping from θ λ in the ﬁrst step. In this section, we aim → to ﬁnd a good way to complete this part.

From the Equal Coding Theorem 4.2.1, the mutual information I(r; θ) is fully determined by I(r; λ). Thus for all conﬁgurations of tuning curves

98 (f1(θ), ..., fN (θ)) that satisfy the condition

(f1(θ), ..., fN (θ)) = fi with probability wi their mutual information are all equal to I(f, w) given by the expression in

(4.38). Assume that we have M distinct particles fi (otherwise, the prob- { } abilities on the same state could be grouped together). Since θ is uniformly distributed on the circle C = [0, 1), the above condition implies that there

M exists M mutually disjoint subsets Ui , with Ui = 1, such that { } i=1 | | P P(θ Ui) = Ui = wi (4.53) ∈ | | and that f(θ) takes the value fi on Ui, i.e.

fk(θ) = fk,i, θ Ui (4.54) ∈ where k = 1, ..., N, i = 1, ..., M.

There are inﬁnitely many ways to choose the subsets Ui . For example, { } 1 for a one-dimensional tuning curve that takes f+ and f− with probability 2 for each, f(θ) can be a very oscillatory function as shown in Figure 4.4a, or a more “continuous” version in Figure 4.4b. These functions have the same mutual information, however, to make the tuning conﬁguration simple and biologically relevant, we would prefer 4.4b instead of 4.4a.

Generally, we restrict each subset Ui to be an interval of length wi in

[0, 1). In this case, fi(θ) are piecewise constant functions with at most M { } jump discontinuities, and their shapes are completely determined by the order

99 f( ) f( )

(a) (b)

Figure 4.4: Tuning curves for 1 neuron with the same mutual information

in which f(θ) takes values in fi . Here, we consider the conﬁguration with { } lowest “Elastic energy”:

M 2 E(f) = fi fhi i (4.55) k − +1 k i=1 X where i + 1 := i + 1 mod M and fi fhi i is the Euclidean distance in h i k − +1 k RN . Intuitively, E(f) is the energy needed to transition among all the states

fi . For biological systems, it is reasonable to assume that the states are { } explored in a continuous way, thus the neighboring ones are visited ﬁrst to achieve energy eﬃciency.

Geometrically, an arrangement of fi with minimal elastic energy has { } the shortest “squared path length” in state space. Figure 4.5 shows an example, where the number of neurons N = 3, the number of states M = 8,

1 with each wi = and fi(θ) being either f− or f . The coordinates of fi are 8 + { } 3 drawn on the left, which take the 8 corners of a cube [f−, f+] . The indices i = 1, ..., M are labeled at each point, and the dashed line represents the closed

100 path f f f f . In Figure 4.5b, the points are ordered such 1 → 2 → · · · → 8 → 1 that the path length is the smallest (since fi fhi i = 1 for each i), so the k − +1 k energy E(f) is smaller than random conﬁgurations like 4.5a. Furthermore, an interesting observation is that such arrangement produces hierarchical tuning curves f1(θ), f2(θ), f3(θ), which are shown on the right.

Following from the above discussions, we adjust the alternating maximization algorithm such that when the mutual information is maximized, the elastic energy E(f) is minimized according to a given order of fi . Inspired by { } the ideas of the Elastic Net Method [14, 13], we add a regularity term that’s proportional to E(f) to the objective function and solve the optimization − problem below:

M β 2 (f, w) = arg max I(f, w) fi fhi+1i (4.56) f∈S,w∈∆M−1 − 2 k − k i=1 X The β term has an eﬀect of pulling neighboring points toward each other, thus making the path shorter. The parameter β will be referred to as the regularization coeﬃcient in later sections.

To optimize the new objective, we only need to add a Laplacian term to the Stochastic Gradient Descent step in equation (4.41):

B (b,i) 1 fi r (b,i) fi ΠS fi + η wi − ln Si(r ) + β fhi+1i + fhi−1i 2fi ← B fi − " b=1 ! # X (4.57)

(b,i) where r are i.i.d. samples from p(r fi), B is the batch size, η is the learning | N rate and ΠS is the projection onto S = [f−, f+] (see details in Section 4.3.1).

101 f1( ) f3

1 5 6 f2( ) 8 3

4 f2 2

f1 7 f3( )

(a)

f1( ) f3

2 3 7 f2( ) 6 1

4 f2 8

f1 5 f3( )

(b)

Figure 4.5: Tuning curves for 3 neurons with the same mutual information and diﬀerent path lengths

102 Since no w is involved in the regularity term, the Blahut-Arimoto step remains the same as before.

Taking regularity into consideration, the combined Alternating Maxi- mization Algorithm is summarized below: Algorithm 1: Alternating Maximization Algorithm input : An initial guess of state-probability pairs (f (0), w(0)); the maximum number of iterations max n; an error threshold ; (n) (n) (n) (n) a sequence of hyper-parameters B1 , B2 , B0 , m1 , (n) (n) (n) m2 , η , β (n = 0, 1, ..., max n); initialize n = 0, ∆I = ; while n max n and ∞∆I > do ≤ (n+1) | | (n) (n) (n) 1. Obtain f from (f , w ): run m1 steps of Stochastic (n) (n) Gradient Descent with batch-size B1 , learning rate η , regularity coefficient β(n) using equation (4.57). (n+1) (n+1) (n) (n) 2. Obtain w from (f , w ): run m2 steps of (n) Blahut-Arimoto algorithm with batch-size B2 using equations (4.51, 4.52). 3. Compute the change in mutual information using equation (n) (4.38) with batch-size B0 : ∆I = I(f (n+1), w(n+1)) I(f (n), w(n)). 4. n n + 1. − end ← output: optimized state-probability pairs (f (n), w(n)). Figure 4.6 demonstrates a typical process of optimization using Algo- rithm 1. In this example we pick a random initial guess of the particles with 3 neurons and 10 states, and alternate between 10 steps of SGD (batch-size 1000) and 10 steps of Blahut-Arimoto (batch-size 1000) for maxn = 10 cycles in total. As shown in the figure, the mutual information is largely increased during the first several iterations before it becomes stable.

103 1.75

1.50

1.25

1.00

0.75

Mutual Information 0.50

0.25 SGD step Blahut-Arimoto step 0.00 0 50 100 150 200 Number of steps Figure 4.6: An example of the alternating maximization scheme

4.4 Analysis and Approximation

In this section, we analyze the gradient of mutual information in a simplified 2-particle model. Based on the repelling effect of the gradient, we find a way to speed up the algorithm by approximating the mutual information with a lower bound.

4.4.1 The repelling eﬀect of gradient

We begin by analyzing the gradient of a Poisson model of N = 1 neuron and M = 2 states, namely f1 and f2 with probabilities w1 and w2. The mutual information (Equation 4.38) can be expressed as

2 p(r fi) I(f, w) = wiE ln | r|fi p(r) i=1 X p(r f ) p(r f ) = w E ln | 1 + w E ln | 2 (4.58) 1 r|f1 p(r) 2 r|f2 p(r)

104 where p(r) = w p(r f ) + w p(r f ). If we fix the values of w , w and f , 1 | 1 2 | 2 1 2 1 varying the value of f2, the gradient of I with respect to f1 would then be a function of d = f f : 2 − 1 ∂I r p(r) (d) = w Er|f 1 ln ∂f 1 1 − f p(r f ) 1 1 | 1 r w1p(r f1) + w2p(r f1 + d) = w Er|f 1 ln | | (4.59) 1 1 − f p(r f ) 1 | 1 Figure 4.7 shows the mutual information I and its gradient ∂I when ∂f1 1 we fix w1 = w2 = 2 , f1 = 20. The mutual information approaches 0 when the two points are very close (d 0), since it is difficult to distinguish the two ≈ conditional distributions p(r f ) and p(r f ), thus r carries little information | 1 | 2 about f. The gradient is also equal to zero when the two points overlap since I reaches the minimum.

Mutual information as a function of d = f f 2 − 1 (in log base 2) derivative as a function of d = f2 f1 − 1.0 ∂I ∂f1 0.04

0.8

0.02

0.6

0.00

0.4

0.02 − 0.2

0.04 − I(f1, f2) 0.0

20 10 0 10 20 30 40 20 10 0 10 20 30 40 − − − − d d

Figure 4.7: Mutual information and gradient for a two-particle model in 1-d

As f moves away from f , the sign of ∂I is opposite to the sign of 2 1 ∂f1 d = f f . This suggests that f would be driven away from f if we make 2 − 1 1 2

105 a gradient ascent step for f in the direction of ∂I . In the limiting case 1 ∂f1 (assume there is no boundedness constraint), when f f , it would be | 1 − 2| → ∞ fairly easy to identify the response r from p(r f ) or p(r f ) (see Figure 4.8), | 1 | 2 thus achieving maximum information. The mixture distribution p(r) would be a multi-modal function which is approximately equal to w p(r f ) when 1 | 1 r p(r f ), and w p(r f ) when r p(r f ), thus the mutual information ∼ | 1 2 | 2 ∼ | 2 converges to

p(r f1) p(r f2) I(f, w) w Er|f ln | + w Er|f ln | → 1 1 w p(r f ) 2 2 w p(r f ) 1 | 1 2 | 2 = (w ln w + w ln w ) (4.60) − 1 1 2 2 which equals the entropy of a binary-valued distribution with probabilities w , w , since the channel is nearly noiseless. When w = w = 1 , this is { 1 2} 1 2 2 equal to ln 2 (if measured in bits, this is equal to 1, as shown in Figure 4.7 left panel).

∂I As the distance increases, the “repelling force” on f1 (i..e ) would | ∂f1 | ﬁrst increase from 0 to a peak value, then fall back to zero again. Note that the derivative curve in Figure 4.7 is not symmetric: it is more ﬂat on the right side (when d > 0) than on the left side (when d < 0). This is due to higher variance of the Poisson distribution p(r f) for larger values of f (see Figure | 4.8).

4.4.2 Approximation by lower bound

Following from the above discussion, we aim to ﬁnd a quantity to approximate the mutual information that requires less computational resources

106 p(r|f = 10) 0.12 p(r|f = 20)

0.10 p(r|f = 40) p(r|f = 60)

0.08

0.06

0.04

0.02

0.00

0 20 40 60 80 100 Figure 4.8: Probability mass functions for Poisson distribution than I(f, w), while having a similar repelling eﬀect on the particles by its gradient. As a result, we discover that a lower bound of mutual information could satisfy both conditions and thus serve as a good candidate for approximation.

The derivation of lower bound is brieﬂy introduced as follows. For a

Poisson model with N neurons and M states (fi, wi) , the mutual information { } (4.38) can be expressed as the diﬀerence of two separate terms,

M p(r fi) I(f, w) = wiE ln | r|fi p(r) i=1 XM M

= wiEr|f [ln p(r fi)] Er|f [ln p(r)] i | − i i=1 i=1 X X Applying the concavity of y = ln x, we move the logarithms outside of the expectations in the second term,

Er|f [ln p(r)] ln Er|f [p(r)] (4.61) i ≤ i

107 Consequently, we obtain the following inequality: M M

I(f, w) wiEr|f [ln p(r fi)] ln Er|f [p(r)] ≥ i | − i i=1 i=1 XM XM M

= wiEr|f [ln p(r fi)] wi ln wjEr|f [p(r fj)] i | − i | i=1 i=1 j=1 ! X X X M M M

= wiEr|f [ln p(r fi)] wi ln wjtij (4.62) i | − i=1 i=1 j=1 ! X X X where tij integrates the product of two conditional distributions,

tij := Er|f [p(r fj)] (4.63) i |

If all the values of fi are given, tij can be precomputed. Substituting in { } { } the probability mass functions of Poisson distributions, we obtain

∞ ∞ N −fk,i rk −fk,j rk e fk,i e fk,j tij = ··· rk! rk! r =0 r =0 k X1 XN Y=1 N −(fk,i+fk,j ) = e I0 2 fk,ifk,j k=1 Y p x 2m ∞ ( 2 ) where I0(x) = m=0 m!Γ(m+1) is the Modified Bessel function of the first kind. P In practice, it is not computationally efficient to evaluate I0 2 fk,ifk,j and its derivatives. Inspired from previous works on the Kullback-Leibler p divergence between Gaussian mixtures [15, 24], we replace the Poisson model with a Gaussian distribution in RN with the same mean and variance, so that

tij have closed form expressions. The conditional distribution we take is { } N(fi, Σi):

− N − 1 1 T −1 p(r fi) = (2π) 2 det(Σi) 2 exp (r fi) Σ (r fi) (4.64) | −2 − i −

108 with the covariance matrix

f1,i f2,i Σi =   ...    fN,i      In this case, tij = p(r fi)p(r fj)dr can be directly calculated from | | R 1 1 T −1 tij = exp (fi fj) (Σi + Σj) (fi fj) (4.65) − N 1 2 2 −2 − − (2π) det(Σi + Σj)

After replacing by the Gaussian model, we have a simpliﬁed expression for the lower bound (4.62):

M M M 1 N L(f, w) := wi ln wjqij wi ln det(Σi) − − 2 − 2 i=1 j=1 ! i=1 X X X M M M N 1 N = wi ln wjqij wi ln fk,i (4.66) − − 2 − 2 i=1 j=1 ! i=1 k X X X X=1 where

N qij := (2π) 2 tij

1 1 T −1 = 1 exp (fi fj) (Σi + Σj) (fi fj) (4.67) 2 −2 − − det(Σi + Σj) We could also evaluate the gradient of L(f, w),

M 2 ∂L wi qil 1 + 2(fk,i fk,l) (fk,i fk,l) = wl M − − 2 ∂fk,i 2 fk,i + fk,l − (fk,i + fk,l) l j=1 wjqlj X=1 M 2 wi P qil 1 + 2(fk,i fk,l) (fk,i fk,l) + wl M − − 2 2 w q fk,i + fk,l − (fk,i + fk,l) l=1 j=1 j ij wiX P (4.68) −2fk,i

109 Mutual information as a function of d = f f 2 − 1 (in log base 2) derivative as a function of d = f2 f1 − 1.0 0.06 ∂I ∂f1 ∂L 0.8 0.04 ∂f1

0.6 0.02

0.4 0.00

0.2 0.02 −

0.0 0.04 − I(f1, f2) L(f , f ) 0.2 1 2 0.06 − − 20 10 0 10 20 30 40 20 10 0 10 20 30 40 − − − − d d

Figure 4.9: Mutual information and the lower bound for a two-particle model in 1-d

where the derivatives of Σi + Σj with respect to f are taken into account.

Here, we revisit the 2-particle model by plotting the lower bound (4.66) and its gradient (4.68) on top of those for I(f, w) in Figure 4.9. The bell-shape of the mutual information curve and the skewness of gradient curve are well preserved after approximation by the lower bound, thus it also has a similar repelling eﬀect on the particles. In the extreme case, as d = f f , 2 − 1 → ∞ 2 1 (f2−f1) the cross terms q12 = q21 = √ exp 0, and it can be veriﬁed f1+f2 − 2(f1+f2) → that 1 ln 2 lim L(f, w) = (w1 ln w1 + w2 ln w2) − d→∞ − − 2

1−ln 2 which is 2 smaller than the limit of mutual information (4.60). In N N(1−ln 2) dimensions, this gap would be 2 .

If we use L(f, w) (4.66) as objective function, the optimization process would be much faster than maximizing the exact mutual information, since the

110 objective and its gradient can be evaluated without sampling. To incorporate this idea into the alternating maximization algorithm, we simply replace the Stochastic Gradient Descent step in Algorithm 1 by Gradient Descent on L (also with a regularization term):

∂L fk,i ΠS fk,i + η + β fk,hi i + fk,hi− i 2fk,i (4.69) ← ∂f +1 1 − k,i In practice, we use the above-described approximation strategy to produce an initial condition for Algorithm 1, then turn to Stochastic Gradient Descent for ﬁner optimization. This strategy could greatly speed up convergence especially in high dimensions.

4.5 Numerical Results

In this section, we present the numerical results obtained by the alternating maximization algorithm 1. Same as the numerical experiments in Section 3.2.4, our tests are run with Python programs on Linux servers in the Department of Mathematics, UT Austin. Each Linux server is equipped with a 24 cores’ CPU Intel L5640 2.27GHz.

We discover that the optimal tuning curves and mutual information are mainly aﬀected by two factors: the number of states M and the size of the

N space S = [f−, f+] .

111 4.5.1 The eﬀect of the number of states

First, the number of states M determines the maximum mutual information attainable. When M is ﬁxed, it can be shown by applying the Equal Coding Theorem (4.2.1) that

I(r; θ) = I(r, λ)

= H(λ) H(λ r) − | H(λ) ≤ ln M (4.70) ≤ where the last inequality follows from the discreteness of λ and that the entropy

M H(λ) = wi ln(wi) is maximized when p(λ) is uniform. Therefore, the − i=1 1 equality isP only achieved when wi = p(λ = fi) = and H(λ r) = 0, in M | other words, when all states have equal probability and the value of λ can be completely determined by r.

An illustration of this upper bound is Figure 4.10. We ﬁx the number of states M = 100 and optimize the tuning curves for a various number of neurons without regularization (i.e. β = 0). As the dimension grows, for both curves the mutual information converges to log2(M) in bits. The convergence is faster for f+ = 10 than f+ = 4 since the former allows for more degrees of freedom.

Figure 4.11 illustrates this phenomenon from another perspective. The tests are carried out in 1-dimension (i.e. only one neuron), and the number

112 7 log2(100)

4 f + = 4

f + = 10 3

2 Mutual Information in bits

0 0 2 4 6 8 10 12 14 Number of Neurons (N) Figure 4.10: Mutual information v.s. the number of neurons with ﬁxed number of states of states varies from 1 to 10. When we have a small number of states (2, 3,

4), even though the maximal ﬁring rates f+ are as high as 100, the mutual information is still bounded by log2(M). For larger values of M, we would also expect a similar eﬀect.

Another noticeable fact is that each curve becomes ﬂat as the number of states increases. The reason is that “the actual number of particles” used is less than M. For example, when f+ = 1 and M = 10, although all the probabilities

1 wi are initialized to be at the beginning, after a few iterations of the { } M alternating maximization algorithm (especially the Blahut-Arimoto step), the probability on almost every state becomes nearly zero except for only 2 states

(with fi = f− and f+, respectively). Therefore, mutual information would stop to increase after M reaches 2. Similarly, for other values of f+, the number of states with positive probabilities is bounded by a ﬁxed number when M is

113 Optimal Mutual Information in log base 2 2.5

log 4 2.0 2 f + =100

f + =50 log23 1.5 f + =40 f + =30

f + =20 log22 1.0 f + =10

f + =5

f + =1 0.5

0.0

1 2 3 4 5 6 7 8 9 10 number of states

Figure 4.11: Mutual information v.s. the number of states in one dimension large enough.

In the following discussions, we refer to the set i : wi > as the { } effective particles. When the number of effective particles is less than M, it suggests that the optimal tuning curves are mainly constrained by the size of space, which we will further discuss in the later section. In this example, it is easy to see this effect for different f+ in Figure 4.11.

4.5.2 The eﬀect of the size of space

N The space S = [f−, f+] plays an important role. There are two main aspects that we focus on: the boundedness constraint f+ and the number of neurons N. (Note that the eﬀect of baseline ﬁring rate f− is ignored since it is usually assumed to be approximately 0. In our numerical experiments, we take f− = 0.1 which is relatively small compared to f+.)

First, there are two factors that f+ may take an eﬀect: the degree of

114 freedom and the amount of noise. The former is straightforward to understand:

N higher values of f+ enlarge the space S = [f−, f+] and introduce more freedom into the optimization problem, thus increasing the information as shown in Figures 4.10 and 4.11. Since the variance of a Poisson model is equal to f, the maximal value f+ also affects the amount of noise. To show the effect of Poissonian noise clearly, we compare it with a homogeneous Gaussian model in a 1-dimensional interval [0, 1] with different noise standard deviation σ. Note that all the optimization steps for the Gaussian model are the same as described in Section 4.3 except for the conditional probability distribution

2 p(r fi) N(fi, σ ) and the gradient: | ∼

∂I r fi = wiEr|f − ln Si(r) ∂f − i σ2 i p(r) where Si(r) = . Optimization is performed with regularization parameter p(r|fi) β = 0, and the number of states for both Gaussian and Poisson models is M = 20, which is shown to be enough for both cases (i.e. higher than the number of eﬀective particles).

In Figure 4.12 we scatter plot the optimized tuning values fi [0, 1] ∈ −3 (or [f−, f+]) of the eﬀective particles (with wi > = 10 ) vertically for each

σ of Gaussian model (or f+ of Poisson model). For the Gaussian model, it is clear that the number of eﬀective particles goes down as the noise level goes up, as shown in Figure 4.12a. When σ = 0.02, all the 20 states are used and the information is highest; when σ > 0.25, only two states are eﬀective (f = 0 or 1). The intermediary values of fi are evenly spaced in [0, 1].

115 Optimized tuning curve values in 1-dim Gaussian model Optimized tuning curve values in 1-dim Poisson model

1.0 25

0.8 20

0.6 15

0.4 10

0.2 5

0.0 0

0.0 0.1 0.2 0.3 0.4 0.5 0 5 10 15 20 25

f +

(a) (b)

Figure 4.12: Optimized tuning values in 1-dim Gaussian and Poisson model

For the Poisson model, the number of effective particles would increase with f+, as shown in Figure 4.12b. When f+ < 5, the encoding is binary: f takes either f+ or f−, which agrees with the saturation phenomenon when there is cyclic invariance constraint in a low-firing-rate regime (Chapter 3). As f+ reaches higher than 5, intermediate values arise, which is different from our numerical results for the cyclic invariant model. Since the noise is inhomoge- neous, the particles are not evenly spaced in [f−, f+]. Lower values of fi are more dense whereas higher ones are more sparse, which creates a “branching” structure.

Let us further look at how the size of space aﬀects the positions of particles in a 3-dimensional example. Starting with f+ = 1 and M = 64 states (randomly initialized), we gradually enlarge the space by increasing the value

116 4 of f+ . In Figure 4.13, selected optimization results are plotted in the state

3 −2 space [f−, f+] . Only the eﬀective particles (wi > = 10 ) are shown, where the radius of each particle is proportional to wi.

f + = 1.0, MI = 0.6063 f + = 2.0, MI = 1.5121 f + = 4.0, MI = 2.5149

f + = 6.0, MI = 2.9288 f + = 8.0, MI = 3.3479 f + = 16.0, MI = 4.3416

Figure 4.13: Optimized tuning values in 3-dim state space with diﬀerent f+ (colors indicate the location of points, and mutual information is calculated in bits).

It is interesting to see the evolution of states as the cube gets larger.

From f+ = 1 to f+ = 4, the 8 corners are gradually occupied, where at f+ = 1 and f+ = 2 the corners are selected in a way such that the particles are maximally apart from each other. Then a small “sub-cube” appears at f+ = 6, with a small side length that is consistent with the intermediary value shown in

4 At each step, we use the optimized values of fi at f+ 1 as the initial condition, while 1 { } − the weights wi are re-initialized to such that the particles that have “disappeared” (i.e. { } M those with wi < ) are given a chance to reappear. Regularization coeﬃcient is β = 0.

117 Figure 4.12b. As f+ continues to increase, most of the new particles are added to the edges and faces, and more “sub-cubes” show up. It is also noticeable that the probabilities of particles on the corners are larger compared to other particles, although as a whole the probability distribution tends to be more uniform as f+ increases. Furthermore, in general, the low-valued regions (e.g. around the bottom corner) take higher probability with more dense points, while high-valued regions (e.g. around the top corner) take less probability where the points are more sparse, which is similar to our result in 1-dimension (Figure 4.12b).

In general, the number of eﬀective states would increase exponentially with the dimension N. When the number of particles M is large enough and there is no regularity constraint, the particles would fully occupy all the (intermediary and boundary) states and reach the “capacity” of the space

N S = [f−, f+] . However, no heterogeneity in tuning curves is observed in this situation.

4.5.3 Adding regularity

Continuing from the last section, we test in a 3-dimensional cube of

ﬁxed f+ for various regularity parameters β, using M = 64 states. For initialization, we assign f2 and f3 randomly and sort the ﬁrst coordinate f1 so that it exhibits a peak in the middle, as shown in Figure 4.14. In this way, we specify an ordering of the particles fi in Equation (4.56). Also, since some particles { } will “disappear” during the process of optimization, it is reasonable to add

118 regularity only to the eﬀective particles with wi > in the elastic energy, i.e. we replace E(f) in Equation (4.55) with

2 E(f; ) := fi fi (4.71) k l − l+1 k {i :w >} j Xij

0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 4.14: Initial condition (f1(θ), f2(θ), f3(θ)) for adding regularity

The results of optimization are shown in Figure 4.15. With small values of β, the tuning curves are entangled in the state space, as shown in Figure 4.15a; however, with stronger regularity, they become disentangled since the path lengths are shortened. The functions in Figure 4.15b become smoother, whereas the loss of information is only a small amount from β = 0 to β = 10−3. In the meantime, optimal tuning curves follow a hierarchical pattern especially when β = 5 10−4 and 10−3: when θ is moving on the circle, the values of f × 2 and f3 change 2 or 3 times as fast as f1. This could be interpreted as a case of population splitting, in which we regard the “low frequency” neuron as in one population and the “high frequency” ones as in another.

119 = 0.0e + 00, MI=2.9906 = 1.0e 04, MI=2.9858 = 5.0e 04, MI=2.9726

= 1.0e 03, MI=2.9326 = 2.0e 03, MI=2.8533 = 4.0e 03, MI=2.6053

(a)

= 0.0e + 00, MI=2.9906 = 1.0e 04, MI=2.9858 = 5.0e 04, MI=2.9726 6 6 6

4 4 4

2 2 2

0 0 0 6 6 6

4 4 4

2 2 2

0 0 0 6 6 6

4 4 4

2 2 2

0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

= 1.0e 03, MI=2.9326 = 2.0e 03, MI=2.8533 = 4.0e 03, MI=2.6053 6 6 6

4 4 4

2 2 2

0 0 0 6 6 6

4 4 4

2 2 2

0 0 0 6 6 6

4 4 4

2 2 2

0 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (b)

Figure 4.15: Optimized tuning curves for 3-dim Poisson model with different regularization: (a) state space representation and (b) the functions f1(θ), f2(θ), f3(θ)

120 Finally, we numerically compute the optimal tuning curves in higher dimensions. When we have a large number of neurons, the optimization problem is mainly constrained by the number of states M, since the capacity of space is large enough. In this case, the maximum information attainable is log2 M (measured in bits, see Section 4.5.1). We would expect the probabilities wi to be uniform and the values of fi to be somewhat randomized, { } { } because the particles cannot occupy all the boundary and intermediary states

N in the high-dimensional cube S = [f−, f+] . This is when the regularity term plays a role since it would bring continuity in visiting the randomized states.

The effect of regularity in higher dimensions are shown in Figures 4.16 and 4.17. We fix the number of states M = 100 and f+ = 10. On the left column of figures, with no regularity imposed, the optimal tuning curves roughly have 3 values when N = 4 and 6. In this case, the size of space is the dominating factor. As N increases to 8, 10 and 12, we have a large enough

N space S = [f−, f ] , and the mutual information nearly reaches log (M) + 2 ≈ 6.64. It can be seen that the tuning curves’ values become more random as the dimension increases. After adding regularity, the optimal tuning curves obtained are less in information but much smoother, and display heterogeneity in their proﬁles. Neural populations with diﬀerent frequencies can be observed.

121 N = 4, = 0e + 00, MI = 5.2074 N = 4, = 1e 03, MI = 4.9934 10 10

10 0.0 0.2 0.4 0.6 0.8 1.0 10 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

N = 6, = 0e + 00, MI = 6.4270 N = 6, = 1e 03, MI = 6.3678 10 10

10 0.0 0.2 0.4 0.6 0.8 1.0 10 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

N = 8, = 0e + 00, MI = 6.6239 N = 8, = 1e 03, MI = 6.3586 10 10

10 0.0 0.2 0.4 0.6 0.8 1.0 10 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4.16: Optimized tuning curves for Poisson model with diﬀerent regularization in N = 4, 6, 8 dimensions

122 N = 10, = 0e + 00, MI = 6.6319 N = 10, = 1e 03, MI = 6.4206 10 10

10 0.0 0.2 0.4 0.6 0.8 1.0 10 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

N = 12, = 0e + 00, MI = 6.6359 N = 12, = 1e 03, MI = 6.4370 10 10

10 0.0 0.2 0.4 0.6 0.8 1.0 10 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Figure 4.17: Optimized tuning curves for Poisson model with diﬀerent regularization in N = 10, 12 dimensions

123 4.6 Discussion

In this chapter, we introduce a particle model for tuning curves without any symmetry assumption. Its theoretical foundation is the discreteness of optimal tuning curves: we ﬁrst show that the problem of maximizing the mutual information is equivalent to ﬁnding the capacity-achieving measure µ∗, whose existence and uniqueness is proved. Then we show that µ∗ is discrete

∗ N by verifying that the support of µ is sparse in [f−, f+] .

Based on the discreteness of µ∗, we develop a particle model (f, w) =

(fi, wi) that consists of state-probability pairs of particles in space. Opti- { } mization for the particle model could be performed by alternating between a Stochastic Gradient Descent step and a Blahut-Arimoto step, both of which are implemented by the Monte Carlo algorithm.

In essence, the maximum of mutual information will be reached as long as the particles follow the capacity-achieving distribution µ∗, regardless of their orderings. To interpret the particle model in terms of biologically meaningful tuning curves, we add a regularity constraint such that the total squared path

2 length fi fi is penalized. This is equivalent to introducing a Laplacian k − +1k term inP the gradient descent step in the algorithm. Such constraint is motivated by the fact that biological systems should continuously explore the state space. In view of the last chapter (Chapter 3), this idea is essentially requiring f to be not too “oscillated” as in Figure 3.9, in other words, introducing some correlations to the values of f. The eﬀect of regularization in canceling the high-frequency oscillations would be similar to applying a convolution kernel

124 in Chapter 3.

Numerical results show that optimal tuning curves are affected by the trade-off between the number of states and the size of space. The former determines an upper bound for mutual information, while the latter determines the number and positions of “effective particles” (which have a probability higher than ) via its impact on the noise level. Similar to the conclusion from the cyclic invariant model, optimal tuning curves are binary-valued when the space is relatively small (e.g. f+ < 5 in one dimension). However, as the space gets larger, intermediary values will arise. When the number of states is less than the capacity of space, adding the regularity term will produce continuous and hierarchical tuning curves. These curves have different frequencies in reaching the boundaries f+ or f−, which suggests heterogeneity in the neural population.

We make a note that the particle method can be generalized for multi- dimensional inputs. In this situation, the probability wi on each particle can be interpreted as an area or a volume in the space of θ. The regularity term would pull neighboring points closer in multi-dimensions, for example, in 2-d the stochastic gradient ascent step would involve a Laplacian term on the grid:

B (b,i) 1 fi r (b,i) fi ΠS fi + η wi − ln Si(r ) ← B fi " b ! X=1

+β fhi+1i,j + fhi−1i,j + fi,hj+1i + fi,hj−1i 4fi − # There are some drawbacks to the particle method. First, as we mention

125 in Section 4.2, the discreteness of µ∗ can not imply the finiteness of its support in high dimensions; instead, it only means the sparsity of support. Some sets such as concentric spheres can be sparse but not finite, which could be a choice of the support of µ∗, although we have never observed that in numerical experiments. Secondly, it is challenging to tune the hyperparameters, including the mini-batch sizes of SGD and Blahut-Arimoto, the number of steps in each procedure, the learning rate of SGD, and the regularity coefficient. In Partic- ular, the learning rate of SGD (η) and the regularity coefficient (β) should be taken depending on the number of particles M and the size of space (f+ and N). Currently, we run the algorithm empirically, and we aim to find a general rule for efficiently selecting the hyperparameters in future works.

126 Chapter 5

Summary

Throughout this dissertation, we focus on the optimal neural tuning curves that maximize the mutual information. This idea follows from the Eﬃ- cient coding Hypothesis, which states that biological systems should maximize the information ﬂow from the external stimulus to the neural responses. Based on this hypothesis, we solve the following constrained optimization problem

max I(r; θ) (5.1) (f1(θ),...,fN (θ))∈S where θ C is a one-dimensional stimulus on the circle, r = (r , ..., rN ) denote ∈ 1 the spike counts of N neurons that are conditionally independent and Poisson distributed, and S represents the space of tuning curves that satisfy certain biological constraints of the neural system.

In particular, we aim to find a set of constraints under which we could identify some meaningful heterogeneity in optimal tuning curves. In practice, neural tuning curves display various types and shapes. The reason behind such heterogeneity is still a mystery, and we search for an explanation by investigating the roles of different constraints in shaping the profiles of tuning curves. In our dissertation, there are several constraints in consideration,

127 including the boundedness of ﬁring rates, the average ﬁring rates, symmetry, and regularity.

The diﬃculty of this problem lies in the non-convexity of mutual information in high dimensions, which makes it intractable for theoretical analysis. Besides, mutual information is invariant under invertible mappings of the input stimulus, which enlarges the space of optimal solutions. Therefore, one classical approach to resolve the diﬃculties is to approximate the mutual information by the Fisher information.

In Chapter 2, we use the Fisher information bound (2.4) as a substitute to the mutual information in (5.1), and solve the optimization problem

max IF (r; θ) (5.2) (f1(θ),...,fN (θ))∈S under certain boundedness and monotonicity constraints. We derive the ana- lytical solutions to (5.2) by applying variational approaches, and discover that the Fisher information approximation yields biologically irrelevant solutions, especially in multi-dimensions. For multi-populated neurons, the optimal tuning curves in terms of Fisher information are mutually exclusively supported on the circle. Such configuration has an infinite amount of information as the number of populations increases. However, the exact mutual information remains finite. For the case of two populations, we view the Fisher information of stationary solutions as the length of trajectory in the state space, which could be made infinitely long by space-filling curves (Figure 2.5). The reason behind such ill-posedness of the problem is that Fisher information is formu-

128 lated in terms of derivatives and thus only serves as a local approximation. To avoid the drawback of locality, we resort to a global quantity to measure the information, namely the exact mutual information.

In Chapter 3, we directly solve the optimization problem (5.1). We impose a symmetry constraint that mirrors the situation in real-life neurons, specifically, the cyclic invariance of tuning curves. It is assumed that f1(θ), ..., fN (θ) are circular shifts of each other, i.e. k 1 fk(θ) = f θ − , k = 1, ..., N (5.3) − N In addition to cyclic invariance, we also take into account two reasonable constraints: the maximum/minimum firing rates and average firing rates:

f− f(θ) f (5.4) ≤ ≤ + 1 ¯ 0 f(θ)dθ = f (5.5) R The problem (5.1) subject to the constraints (5.3, 5.4, 5.5) can be solved by numerical methods. In the process of discretization, cyclic invariance enables us to simplify the mutual information as a single-term expectation, which could be estimated using Monte-Carlo simulations. In numerical experiments with low ﬁring rate f+, the optimal tuning curves saturate to the upper and lower boundaries (i.e. f(θ) = f+ or f−, a.e.), and their proﬁles exhibit no structure. These properties are rigorously proved for the continuous version of the problem max I[f] (5.6) f− f(θ) f+ ≤f(θ)dθ =≤ f¯ R 129 where I[f] (3.45) is a continuous limit of I(r; θ) obtained by taking the number

1 of neurons to infinity, while applying a proper scaling of f by N . Under such scaling of f, the neurons spike in a low-firing-rate regime (where only a finite number of neurons fire according to a Poisson point process (3.46), each firing only once). Based on Theorem 3.3.3, solutions to (5.6) can be obtained by sampling an infinite sequence of Bernoulli white noise, with P(f = f+) and ¯ P(f = f−) solely determined by the average f.

Such solutions are biologically unrealistic. Furthermore, heterogeneity would not arise in the framework of the continuous limit. Because the optimal tuning curves are binary-valued and uncorrelated, with their profiles solely dependent on f+, f− and f¯, heterogeneity would mean that the information would be larger for at least two different sets of tuning curves with f¯ =g ¯ 6 (assuming that they have the same bounds, f+ = g+ and f− = g−). However, there is no information gained by splitting into two populations because of the concavity of I[f] in f¯. Its failure in explaining heterogeneity may result from the cyclic invariance constraint as well as the low-firing-rate assumption. Therefore, in the following chapter, we drop the cyclic invariance and investigate other constraints under which heterogeneity would appear in optimal tuning curves.

In chapter 4, besides the relaxation of symmetry, we consider a particle model by assuming that the firing rates take only a finite number of values. It is justified that problem (5.1) is equivalent to maximizing over the probability

130 measure of ﬁring rates λ = (f1(θ), ..., fN (θ)):

max I(r; λ) (5.7) µλ and that there exists a capacity-achieving measure µ∗, which is shown to be discrete (in other words, the set of “points of increase” is sparse). The major theoretical results are given in the theorems 4.2.1, 4.2.6, and 4.2.10. Inspired by the discreteness of µ∗, we make a stronger assumption that the support of µ∗ is ﬁnite. Thus we arrive at the following simpliﬁed problem

max I(f, w) (5.8) f,w

∗ where f = (f1, ..., fM) is a collection of states in the support of µ , which

N lives in the N-dimensional cube [f−, f+] , and w = (w1, ..., wM ) denotes the frequencies of visiting the states.

In addition to the ﬁnite number of states, we introduce another constraint of regularity in tuning curves. This comes from the idea that biological systems should continuously explore the state space and that the following elastic energy should be minimized:

2 min E(f) = min fi fi+1 (5.9) f f i k − k X which drives the points fi closer and shortens the path length. Combining { } the maximization of information and minimization of elastic energy, we obtain

β 2 max I(f, w) fi fi+1 (5.10) f,w 2 − i k − k ! X

131 We utilize Algorithm 1 to solve the optimization problem (5.10), which alternates between updating the states by Stochastic Gradient Descent, and updating the probabilities by the Blahut-Arimoto algorithm.

Similar to Chapter 3, numerical results show that the optimal tuning curves are also binary-valued when the maximal ﬁring rate f+ is low; however, as f+ increases, intermediary values between f− and f+ will gradually show up (Figure 4.12). Furthermore, there is a trade-oﬀ between the number of

N states M and the size of the space S = [f−, f+] . Especially, when there is a large number of neurons, we have fewer states to encode compared to the capacity of the high-dimensional cube. In this case, imposing the regularity constraint (i.e. β > 0) yields heterogeneous tuning curves, in the sense that they display some hierarchy and diﬀerent frequencies (Figures 4.15, 4.16, 4.17). These properties are similar to neural tuning curves in practice.

Therefore, we conclude that the constraints for heterogeneity are the regularity and the finite number of states, as well as the boundedness of firing rates. We note that such a conclusion of heterogeneity in tuning curves is only empirical and requires further investigation. Future directions would involve more theoretical analysis in terms of heterogeneity, especially the population splitting. Another direction is to relax the strong assumption of conditional independence in the Poisson firing model. By relaxing such an assumption, we could extend our work by introducing correlations between neurons.

132 Appendix

133 Appendix 1

Additional Proofs

1.1 Proof of Proposition 3.2.1

When M = N, the cyclic invariant property holds for the Kullback-Leibler divergence:

0 DKL (p(r θm) p(r)) = DKL (p(r θm0 ) p(r)) , m, m = 1, ..., M. (1.1) | k | k ∀

Proof. It suﬃces to show that DKL (p(r θm) p(r)) = DKL (p(r θm ) p(r)) for | k | +1 k all m. First, The ratio of p(r θm) and p(r) can be reduced to | M r M M j −gi−j+1 M M r p(r) 1 j=1 gi−j+1 j=1 e 1 g j = = i−j+1 M r M p(r θm) M j −gm−j+1 M gm−j i=1 Q g Q e i=1 j=1 +1 | X j=1 m−j+1 j=1 X Y where we omit the modulusQ notationsQ for convenience. The Kullback-Leibler

134 divergence is evaluated as

DKL (p(r θm) p(r)) | k p(r) = Er|θm ln − p(r θm) | M rj M M rj gm−j 1 gi−j = +1 e−gj−m+1 ln +1 − rj! M gm−j r j=1 i=1 j=1 +1 ! X Y X Y M M r −r = C (r) g j e−gm−j+1 ln C (r) g j − 1 m−j+1 2 m−j+1 r j=1 j=1 ! X Y Y 1 1 M M rj where C1(r) = QM , C2(r) = M i=1 j=1 gj−i+1 are independent of m. j=1 rj !

Take a cyclic permutation of r such thatP r ˜jQ+1 = rj, i.e. ˜r = (rM , r1, . . . , rM−1).

Since C1 and C2 are invariant under cyclic permutations of r, we arrive at the conclusion using a change of variable:

DKL (p(r θm) p(r)) | Mk M r −r = C (r) g j e−gm−j+1 ln C (r) g j − 1 m−j+1 2 m−j+1 r j=1 j=1 ! X Y Y M M r −r = C (˜r) g˜j+1 e−gm−j+1 ln C (˜r) g ˜j+1 − 1 m−j+1 2 m−j+1 ˜r j=1 j=1 ! X Y Y M M = C (˜r) gr˜l e−gm+1−l+1 ln C (˜r) g−r˜l − 1 m+1−l+1 2 m+1−l+1 ˜r l l ! X Y=1 Y=1 = DKL (p(r θm ) p(r)) | +1 k

135 1.2 Proof of the Continuous Limit

Based on the assumptions (3.38, 3.39, 3.40), the limit of

n−1 n−1 i+j rj 1 fn( n ) In[fn] = p(r, θ) ln j (1.2) − n rj r i=0 j=0 fn( n ) ! X X Y is

∞ −f¯ 1 1 k k e j=1 f(sj) I[f] = dsjf(sj) ln (1.3) k! ··· 1 k k=0 0 0 j=1 0 Qj=1 f(sj θ)dθ! X Z Z Y − where R Q i n i ri −fn( ) fn( ) e n p(r, θ) = n . (1.4) ri! i=1 Y

Proof. First, as n , → ∞

− Pn f ( i ) − 1 Pn nf ( i ) − R 1 f(θ)dθ −f¯ e i=1 n n = e n i=1 n n e 0 = e . →

i+j rj 1 n−1 n−1 fn( n ) Next, denote Fn(r) := ln n i=0 j=0 j rj , then fn( n ) n−1 n−1 P rQj n−1 1 f+ f+ Fn(r) ln = ri ln ≤ n f− f− i=0 j=0 ! i=0 ! X Y X n We ﬁrst show that if a vector r N has at least one coordinate rj 2, ∈ ≥ then it makes no contribution to the sum, i.e.

lim Sn[fn] = 0 (1.5) n→∞ where

n i ri n−1 n−1 i+j rj fn( ) 1 fn( ) S [f ] := n ln n (1.6) n n j r ri! n j r∈ n,∃r ≥ i=1 !" i=0 j=0 fn( n ) !# NXj 2 Y X Y

136 k n Let ij j=1 be the those nonzero indices of r N where k = r 0 = { } ∈ k k n i=1 1{ri=06 }. The above summation (1.6) can be decomposed into n terms, Pwhere each term sums over all r Nn with k = 1, 2, ..., n nonzero indices (with ∈ spike counts denoted as ri1 , ..., rik ):

n k i n− k l i j rij 1 + j rij fn( n ) 1 fn( n ) Sn[fn] = ln ij r ri ! n ij k j=1 j " l j=1 fn( n ) !# X=1 krk0=Xk,∃rj ≥2 Y X=0 Y

f− f+ Using fn , a further decomposition of Sn[fn] could be n ≤ ≤ n

Sn[fn] r f r f ij n k + ij n−1 k + n 1 n ln rij ≤ ri!  n f−  k j=1 j l j=1 X=1 krk0=Xk,∃rj ≥2 Y X=0 Y n  Pk  n f+ j=1 rij k f+ n = ln k rij f− k j=1 rij ! j=1 ! X=1 krk0=Xk,∃rj ≥2 X n k f n Qk = ln + f− k l k l X=1 X=1 Pl l ∞ ∞ f+ k−l+ j=1 cj (k l + cj) n − j=1 · ··· l c =2 c =2 j=1 cj! P X1 Xl n k k−l f n Qk f = ln + + f− k l n k=1 l=1 X X P P ∞ ∞ f+ cj ∞ ∞ f+ cj l n n j=1 cj (k l) + ·  − ··· l ··· l  c =2 c =2 j=1 cj! c =2 c =2 j=1Pcj! X1 Xl X1 Xl  Q Q  where c , ..., cl specify the spike counts in r that are at least 2. The { 1 } other ri’s are either 1 or 0.

137 2f+ f+ Take n large enough so that 0 < n < 1. Denote by q = n . Since q (0, 1), we use the following inequalities to ﬁnd an upper bound of the ∈ above expression: ∞ qm = eq q 1 q2 m! − − ≤ m=2 X ∞ ∞ mqm qm = q = q(eq 1) 2q2 m! m! − ≤ m=2 m=1 X X Using the above inequalities, the ﬁrst summation term can be upper bounded:

∞ ∞ P l ∞ cj cj q q 2l l = q ··· cj! ≤ c =2 c =2 j=1 cj! j=1 c =2 X1 Xl Y Xj Similarly, the second term wouldQ be

Pl l ∞ ∞ j=1 cj q j=1 cj ··· l c =2 c =2 j=1Pcj! X1 Xl ∞ ∞ Pl−1 l−1 ∞ ∞ cj cl cl q Qj=1 q q cl = l−1 cj + ··· cl! cl! c =2 c =2 j=1 cj! j=1 ! c =2 c =2 ! X1 lX−1 X Xl Xl ∞ ∞ Pl−1 l−1 j=1 cj Qq 2 2 cj q + 2q ≤ ··· l−1 c =2 c =2 j=1 cj! j=1 ! ! X1 lX−1 X Pl−1 l− ∞ ∞ j=1 cj 1 Qq j=1 cj q2 + 2q2 q2(l−1) ≤ ··· l−1 · c =2 c =2 j=1Pcj! X1 lX−1 Pl−2 l− ∞ ∞ j=1 cj 2 qQ j=1 cj q2 q2 + 2q2 q2(l−2) + 2q2l ≤  ··· l−2 ·  c =2 c =2 j=1Pcj! X1 lX−2   ≤ · · · Q q2l + 2lq2l ≤

138 Substituting into Sn[fn], we obtain

n k f+ n k k−l 2l Sn[fn] ln q (k l + 1 + 2l)q ≤ f− k l − k=1 l=1 Xn X k k f n k k = ln + qk(k + 1) ql + qk lql f− k l l k " l l # X=1 X=1 X=1 2 f+ f+ Denote byq ˜ = q(1 + q) = n + n2 , then the ﬁrst term in the sum can be evaluated as

n k n k qk(k + 1) ql k l k=1 l=1 Xn X n = qk(k + 1) (1 + q)k 1 k − k=1 Xn n n n n n n n = kq˜k + q˜k kqk qk k k − k − k k k k k X=1 X=1 X=1 X=1 d = (˜q(1 +q ˜)n q˜) + (1 +q ˜)n 1 dq˜ − − d (q(1 + q)n q) ((1 + q)n 1) −dq − − − = 2(1 +q ˜)n + nq˜(1 +q ˜)n−1 2(1 + q)n nq(1 + q)n−1 − −

As n , the above equation goes to zero since → ∞

lim (1 +q ˜)n = lim (1 + q)n = ef+ n→∞ n→∞

n−1 n−1 f+ lim nq˜(1 +q ˜) = lim nq(1 + q) = f+e n→∞ n→∞

n k Similarly, the second term n qk k lql also converges to 0. k=1 k l=1 l Therefore, Sn[fn] 0 as nP . P → → ∞

139 Since we are able to ignore the spike counts which are greater than one, the limit of In[fn] in equation (1.2) can be expressed as

lim In[fn] = Wn[fn; k] (1.7) n→∞ − k X=0 where ri can only be zero and one in the sum:

n i ri n−1 n−1 i+j rj fn( n ) − Pn f ( i ) 1 fn( n ) W [f ; k] := e i=1 n n ln n n j r ri! n j n i=1 ! i=0 j=0 fn( n ) ! r∈{0,1X} ,krk0=k Y X Y k n Let ij denote those nonzero indices of r 0, 1 , with i i ik. { }j=1 ∈ { } 1 ≤ 2 ≤ · · · ≤ Then Wn[fn; k] can be written as

Wn[fn; k]

k ij 1 n−1 k l ij 1 fn( n ) − P f ( i ) n l=0 j=1 fn( n + n ) = e i n n ln 1! k ij 1 n j=1 P Qj=1 fn( n ) ! r X0, 1 Y ∈ { } rij = 1, j = 1, ..., k Q

k 1 n−1 k l ij P f˜ ( i ) ˜ 1 ij − n n n l=0 j=1 fn( n + n ) = f˜n( )e n ln nk n k ˜ ij j=1 P Qj=1 fn( n ) ! i1 X ik Y ≤ · · · ≤ ij 0, ..., n 1 Q ∈ { − }

ij where f˜n = nfn. As n , from condition (3.40), denote limn nfn = → ∞ n f(sj), sj [0, 1). Then take the limit by writing the sum as an integral: ∈

lim Wn[fn; k] n→∞ k 1 k −f¯ 0 j=1 f(sj + θ)dθ = ds dsk f(sj)e ln ··· 1 ··· k {0≤s1···≤sk≤1} j=1 R Q j=1 f(sj) ! Z Z Y − R 1 f(s)ds 1 1 k 1 k e 0 0 j=1Qf(sj θ)dθ = ds dsk f(sj) ln − k! ··· 1 ··· k 0 0 j=1 R Q j=1 f(sj) ! Z Z Y Q 140 Note that k! comes from the number of permutations of (s1, ..., sk) and the change from θ to θ is obtained by a change-of-variable and the periodicity − of f. Combining the above conclusions, we arrive at

lim In[fn] n→∞ ∞ −f¯ 1 1 k 1 k e 0 j=1 f(sj θ)dθ = dsjf(sj) ln − − k! ··· k k=0 0 0 j=1 R Q j=1 f(sj) ! X Z Z Y ∞ −f¯ 1 1 k k e Qj=1 f(sj) = dsjf(sj) ln (1.8) k! ··· 1 k k=0 0 0 j=1 0 Qj=1 f(sj θ)dθ! X Z Z Y − ¯ 1 R Q where f = 0 f(s)ds. R 1.3 Proof of Lemma 3.3.2

The maximum entropy distribution on [0, 1) is the uniform distribution over this circle.

Proof. Followed from [46]: it suﬃces to show that H[g] 0 for an arbitrary ≤ distribution g. Let U be the uniform distribution on [0, 1). Evaluating the Kullback-Leibler divergence between g and U yields

1 g(x) DKL(g U) = g(x) log dx k U(x) Z0 1 g(x) = g(x) log dx 1 Z0 = H[g] − 0 ≥

141 where the last step follows from non-negativity of KL divergence. Therefore H[g] 0 = H[U] for any density function g on [0, 1), thus the uniform density ≤ maximizes the entropy.

1.4 Proof of Lemma 3.3.4

The maximum of f¯H f+ , f− is taken at − f¯ f¯ h i f ln f f− ln f− f¯∗ = exp + + − 1 (1.9) f f− − + − Furthermore,

f+ ∗ f+ + f− max f−, < f¯ < . (1.10) { e } 2

Proof. To simplify the notations, denote a = f+ and b = f−. By their biological meanings, 0 < b < a. Deﬁne

a a v b b b a v F (v) = v ln − + ln − v v a b v v a b − − 1 a = (a ln a b ln b) v (a b) v ln v ab ln a b − − − − b − h i A local maxima can be found by directly ﬁnding the stationary point,

a ln a b ln b F 0(v) = (ln v + 1) + − = 0 − a b − which implies a ln a b ln b v∗ = exp − 1 a b − −

142 a ln a−b ln b By applying the convexity of y = x ln x, the slope a−b is in (ln b+1, ln a+ 1), therefore b < v∗ < a. Since the derivative F 0(v) is strictly decreasing, F reaches the global maximum at an interior point v∗.

∗ a a ln a−b ln b Next, v > e . This can be directly veriﬁed since a−b > ln a, thus v∗ > exp(ln a 1) = a . − e ∗ a+b ∗ 0 ∗ Finally we show that v < 2 . Since a < v < b and F (v ) = 0, by applying the monotonicity of F 0(v), we obtain that F 0(a) < 0 and F 0(b) > 0. Further, comparing their absolute values yields

ln a ln b 2 F 0(a) F 0(b) = (a + b) − + < 0 | | − | | − a b a + b − where the inequality follows from the concavity of y = ln x by comparing the

a+b ln a−ln b 0 slope at 2 to a−b . Therefore, taking into account that F (v) is convex and that F 0(a) < 0, F 0(b) > 0, F 0(a) < F 0(b) , the maximum point v∗ would | | | | a+b be smaller than the mean value 2 .

143 1.5 Proof of Theorem 4.2.1

(Equal Coding) For the Markov Chain θ λ r with conditional probability → → distributions given by

N N −λk rk e λk p(λ θ) = 1{λ=f(θ)} = 1{λk=fk(θ)}, p(r λ) = | | rk! k k Y=1 Y=1 then

I(r; θ) = I(r; λ) (1.11)

Proof. We extend the proof in [39] to multi-dimensions. Since θ λ r → → forms a Markov Chain, p(θ, λ, r) = p(θ)p(λ θ)p(r λ), and | | p(θ, λ, r) p(θ)p(λ θ)p(r λ) p(θ)p(λ θ) p(θ λ, r) = = | | = | = p(θ λ). (1.12) | p(λ, r) p(λ)p(r λ) p(λ) | | Applying the chain rule of mutual information,

I(r;(θ, λ)) = I(r; λ) + I(r; θ λ) = I(r; θ) + I(r; λ θ) | | where I(r; θ λ) and I(r; λ θ) are conditional mutual informations ([46]). In | | order to show that I(r; θ) = I(r; λ), it suﬃces to prove that

I(r; θ λ) = I(r; λ θ) | |

By deﬁnition, we write the left-hand-side as

144 Substituting Equation (1.12), the above quantity is zero. On the other hand,

p(r, λ θ) I(r; λ θ) = p(θ)dθ dλp(r, λ θ) ln | p(r θ)p(λ θ) | r | Z X Z | | p(λ θ, r) = dλ dθp(θ, λ, r) ln | p(λ θ) r X Z Z | N Note that p(λ θ) = 1{λ f θ } = 1{λ f θ }, and from the Markov Chain | = ( ) k=1 k= k( ) θ λ r, we obtain Q → → p(θ)p(λ θ)p(r λ) p(λ θ, r) = | | | p(θ, r) p(λ θ)p(r λ) = | | p(r θ) | r −λk k N e λk 1{λ f θ } = ( ) k=1 rk! = −f (θ) r N e k fk(θ) k k=1 Q rk! = 1 {λQ=f(θ)} = p(λ θ) |

Therefore I(r; λ θ) = 0, which implies that I(r; θ) = I(r; λ). |

1.6 Proofs in Section 4.2.3 1.6.1 Proof of Lemma 4.2.2

(Convexity and Compactness) MS and MS,G are convex and compact subsets of M(RN ) in the L´evy–Prokhorov metric.

Proof. The proof is identical to that given in [11] (Proposition 2).

First, we show convexity of MS,G. Let µ , µ MS,G, µ = aµ + 1 2 ∈ 1 (1 a)µ , where a (0, 1). Since µ is also supported on S, and the average − 2 ∈

145 cost constraints Eµ[gl(λ)] = aEµ1 [gl(λ)] + (1 a)Eµ2 [gl(λ)] 0 are satisﬁed, − ≤ µ MS,G. The convexity of MS is trivial. ∈

For compactness, it suﬃces to show that MS and MS,G are closed and tight in M(RN ) based on Theorem 3.1.9, [45] (also known as Prokhorov’s theorem).

MS, MS,G are tight: by deﬁnition, we need to show that for every > 0,

N there exists a compact subset K R such that µ(K) > 1 for every ⊂ − µ MS (or MS,G). Since µ(S) = 1 and S is bounded, closed thus compact in ∈ N R , taking K = S will work.

MS, MS,G are closed: suppose a sequence of measures µn MS con- ∈ verges to µ in the L´evymetric. Applying Theorem 3.1.5, [45] to the closed

S RN , ⊂ 1 = limn→∞µn(S) µ(S) = 1 ≤ N Thus µ MS. To further show that MS,G is closed in M(R ), ﬁrst note that ∈ if µn MS,G converges to µ, then µ MS. Under the condition that gl are ∈ ∈ bounded and continuous, µ(gl) = limn→∞ µn(gl) 0. Thus µ satisﬁes the ≤ average constraints, µ MS,G. ∈

The compactness of MS and MS,G then follow from Prokhorov’s theorem.

1.6.2 Proof of Lemma 4.2.3

(Strict Concavity) I(µ), as a functional from MS to R, is strictly concave.

146 Proof. Concavity directly follows from the conclusion that mutual information is concave for the input distribution (Theorem 2.7.4, [46]). Write I(µ) as

I(µ) = Hr(µ) Hr|λ(µ) (1.13) − where

Hr(µ) = p(r; µ) ln p(r; µ) − r X is concave for p(r; µ) (by the convexity of y = x ln x). Since p(r; µ) = dµp(r λ) | is a linear function of µ, Hr(µ) is concave in µ. Similarly, Hr|λ(µ) isR concave.

Thus I(µ) = Hr(µ) Hr|λ(µ) is concave. − For strict concavity, we need to show that if a (0, 1) such that ∃ ∈

I (aµ + (1 a)µ ) = aI(µ ) + (1 a)I(µ ) (1.14) 1 − 2 1 − 2 then dlevy (µ1, µ2) = 0. Using the linearity of Hr|λ(µ), the above equation reduces to

Hr(aµ + (1 a)µ ) = aHr(µ ) + (1 a)Hr(µ ) 1 − 2 1 − 2 which, by the strict convexity of y = x ln x, indicates that

N p(r; µ1) = p(r; µ2), r N (1.15) ∀ ∈

147 where N denotes the dimensionality of r (i.e. the number of neurons). Sub- r e−λk λ k stituting the Poisson distribution p(r λ) = N k , we have | k=1 rk! N N Q rk −λk rk −λk λk e dµ1(λ) = λk e dµ2(λ), r1, . . . , rN N. (1.16) ∀ ∈ k k Z Y=1 Z Y=1

−λk Using a change of variable dνi(λ) = k e dµi(λ), we arrive at

N NQ rk rk λk dν1(λ) = λk dν2(λ), r1, . . . , rN N. (1.17) ∀ ∈ k k Z Y=1 Z Y=1 which implies that

Q(λ)dν1(λ) = Q(λ)dν2(λ) (1.18) Z Z for all polynomials Q(λ).

To prove the uniqueness of µ, it suﬃces to show that dlevy (ν1, ν2) = 0. This is a moment problem ([25, 30]), which studies the question of whether a distribution can be uniquely determined by its moments. Here, we use the property that S is bounded, closed and thus compact in RN . By the Stone- Weierstrass theorem [49], any continuous function on S can be uniformly approximated by polynomials. Therefore,

f(λ)dν1 = f(λ)dν2 Z Z for all continuous function f on S. Since the topology on MS is equivalent to weak* topology, we have dlevy (ν1, ν2) = 0 and thus dlevy (µ1, µ2) = 0.

148 1.6.3 Proof of Lemma 4.2.4

(Continuity) Hr(µ), Hr|λ(µ) and I(µ) are continuous on MS.

Proof. Assume a sequence µn µ in the L´evy–Prokhorov metric (4.21), i.e. ⇒ r λ k e−λk weak* convergence. For any ﬁxed r, p(r λ) = N k is a continuous | k=1 rk! function of λ on S. Therefore, Q

lim p(r; µn) = lim p(r λ)dµn = p(r λ)dµ = p(r; µ) (1.19) n→∞ n→∞ | | Z Z In other words, p(r; µn) converges pointwise to p(r; µ).

For the continuity of Hr(µ), we prove that the series

p(r; µn) ln p(r; µn) − r X converges to p(r; µ) ln p(r; µ) by applying Dominated Convergence The- − r orem. It suﬃcesP to show that Hr(µ) = p(r; ν) ln p(r; ν) is bounded for − r any measure ν MS. Denote M = maxλ∈SPS , r = r + rN , then ∈ | | | | 1 ··· N rk −λk |r| λk e M p(r; ν) = p(r λ)dν = dν N | rk! ≤ S k k=1 rk! Z Z Y=1 For large enough r, p(r; ν) is close to zero. Using the propertyQ that x ln x is − increasing when x (0, 1 ), we have ∈ e p(r; ν) ln p(r; ν) − M |r| M |r| ln ≤ − r ! r ! k k k k M |r| M |r| = Q r lnQM + ln (rk!) − rk!| | rk! k k k X QM |r| QM |r| r ln M + rk(rk 1) ≤ − rk!| | rk! − k k k X Q Q 149 where the last inequality follows from ln(rk!) rk ln rk rk(rk 1). Let ≤ ≤ − a(r; M) denote the right-hand-side of the above equation. The inﬁnite sum

r a(r; M) can be computed using moments of Poisson distribution:

P N N N−1 N−1 a(r; M) = eM (MeM ) ln M + eM (M 2eM ) − r k=1 k=1 X X N X = N eM M 2 M ln M − < ∞

Therefore,

p(r; ν) ln p(r; ν) a(r; M) < , ν MS − r ≤ r ∞ ∀ ∈ X X and that

p(r; µn) ln p(r; µn) a(r; M) < , n = 1, 2, ... − r ≤ r ∞ ∀ X X

Since p(r; µn) ln p(r; µn) pointwise converges to p(r; µ) ln p(r; µ), by the Domi- nated Convergence Theorem the limit and the inﬁnite summation commutes,

lim Hr(µn) = lim p(r; µn) ln p(r; µn) n→∞ n→∞ − r X = p(r; µ) ln p(r; µ) − r X = Hr(µ) (1.20)

Similarly, the series p(r; µn) ln p(r λ) also satisﬁes Dominated − r | Convergence Theorem and convergesP to Hr|λ(µ). The continuity of the mutual information follows from I(µ) = Hr(µ) Hr|λ(µ). −

150 Proof of Lemma 4.2.5 (Weak Diﬀerentiability) I(µ) is weakly diﬀeren- tiable in MS. Its weak derivative at µ0 in the direction of µ1 is

DI(µ ; µ ) = i(λ; µ )dµ I(µ ) (1.21) 0 1 0 1 − 0 Z where p(r λ) i(λ; µ) := p(r λ) ln | = DKL(p(r λ) p(r; µ)) (1.22) p(r; µ) r | | k X where p(r; µ) = p(r λ)dµ. Furthermore, i(λ; µ) is a continuous function of | λ on S. R

Proof. Take µ , µ MS and τ [0, 1]. Since MS is convex (see Lemma 0 1 ∈ ∈ 4.2.2), µτ = (1 τ)µ + τµ is in MS. First, we compute the weak derivative − 0 1 of Hr(µ). From the deﬁnition we have

Hr(µτ ) Hr(µ ) p(r; µτ ) ln p(r; µτ ) p(r; µ ) ln p(r; µ ) − 0 = − 0 0 τ τ − r X Using Taylor expansion of x ln x,

p(r; µτ ) ln p(r; µτ ) p(r; µ ) ln p(r; µ ) − 0 0 2 = (p(r; µτ ) p(r; µ )) (ln p(r; µ ) + 1) + O(τ ) − 0 0 = τ(p(r; µ ) p(r; µ )) (ln p(r; µ ) + 1) + O(τ 2) 1 − 0 0 Therefore, similar to the proof of Lemma 4.2.4, by applying Dominated Con- vergence Theorem we obtain

Hr(µτ ) Hr(µ ) p(r; µτ ) ln p(r; µτ ) p(r; µ ) ln p(r; µ ) lim − 0 = lim − 0 0 τ↓0 τ τ↓0 τ − r X = (p(r; µ1) p(r; µ0)) (ln p(r; µ0) + 1) − r − X = (p(r; µ1) p(r; µ0)) ln p(r; µ0) (1.23) − r − X

151 where the last equality follows from r p(r; µ1) = r p(r; µ0) = 1. P P For the conditional entropy Hr|λ(µ), since it is linear in µ, it is diﬀer- entiable. Subtracting Hr|λ(µ) from Hr|λ(µτ ),

Hr|λ(µτ ) Hr|λ(µ0) = τ (p(r; µ1) p(r; µ0)) ln p(r λ) (1.24) − − r − | X

Therefore the diﬀerentiability of I(µ) follows from I(µ) = Hr(µ) Hr|λ(µ): −

I(µτ ) I(µ ) Hr(µτ ) Hr(µ ) Hr|λ(µτ ) Hr|λ(µ0) lim − 0 = lim − 0 lim − τ↓0 τ τ↓0 τ − τ↓0 τ

= (p(r; µ1) p(r; µ0)) ln p(r; µ0) − r − X + (p(r; µ1) p(r; µ0)) ln p(r λ) r − | X p(r λ) p(r λ) = p(r λ)dµ ln | p(r; µ ) ln | 1 p(r; µ ) 0 p(r; µ ) r | 0 − r 0 X Z X = i(λ; µ )dµ I(µ ) (1.25) 0 1 − 0 Z

p(r|λ) where i(λ; µ) := p(r λ) ln = DKL(p(r λ) p(r; µ)). r | p(r;µ) | k Finally we showP the continuity of i(λ; µ) in λ. Suppose a sequence

p(r|λn) p(r|λ) λn λ in S. Then p(r λn) ln converges pointwise to p(r λ) ln . → | p(r;µ) | p(r;µ) Using the Dominated Convergence Theorem again, the limit and summation commute:

p(r λn) lim i(λn; µ) = lim p(r λn) ln | = i(λ; µ) n→∞ n→∞ p(r; µ) r | X

152 1.7 Proofs in Section 4.2.4 1.7.1 Proof of Lemma 4.2.8

∗ µ MS,G is the capacity-achieving measure if and only if there exists φl ∈ { ≥ 0 : l L such that: ∈ }

∗ 1. DJφ(µ ; ν) 0 for all ν MS, where Jφ(µ) is the Lagrangian ≤ ∈

Jφ(µ) := I(µ) φlGl(µ) (1.26) − l∈L X ∗ 2. φlGl(µ ) = 0, l L. ∀ ∈

∗ Proof. Finding µ = arg maxµ∈MS,G I(µ) is equivalent to the constrained maximization problem

Maximize I(µ), µ MS ∈ subject to Gl(µ) = Eµ[gl(λ)] 0, l L ( ≤ ∀ ∈ Using the method of Lagrange multipliers ([34], Theorem 1 in Section 8.3 and Theorem 1 in Section 8.4), µ∗ is the maximizer if and only if there exists φl 0 : l L , such that: { ≥ ∈ }

∗ 1. Jφ(µ ) Jφ(ν) for all ν MS, where Jφ(µ) = I(µ) φlGl(µ); ≥ ∈ − P ∗ 2. φlGl(µ ) = 0, l L. ∀ ∈

Therefore it only suﬃces to show that the ﬁrst condition is equivalent to

∗ DJφ(µ ; ν) 0. ≤

153 Notice that Jφ(µ) is a concave function of µ since I(µ) is concave and the constraints Gl(µ) = gl(λ)dµ : l L are linear in µ. By deﬁnition, the { ∈ } weak derivative of Jφ at µR in the direction of ν equals

Jφ ((1 τ)µ + τν) Jφ(µ) DJφ(µ; ν) = lim − − τ→0+ τ

∗ On one hand, if Jφ(µ ) Jφ(ν) for any ν, then Jφ ((1 τ)µ + τν) Jφ(µ), ≥ − ≤ which implies that DJφ(µ; ν) 0. On the other hand, provided that DJφ(µ; ν) ≤ ≤ 0 for any ν, by concavity of Jφ we have

Jφ ((1 τ)µ + τν) (1 τ)Jφ(µ) + τJφ(ν) − ≥ − and Jφ ((1 τ)µ + τν) Jφ (µ) − − Jφ(ν) Jφ(µ) τ ≥ −

Jφ((1−τ)µ+τν)−Jφ(µ) Thus Jφ(ν) Jφ(µ) lim + = DJφ(µ; ν) 0. − ≤ τ→0 τ ≤

1.7.2 Proof of Theorem 4.2.9

∗ (Necessary and suﬃcient condition): µ MS,G is the capacity-achieving ∈ measure if and only if there exists φl 0 : l L such that { ≥ ∈ }

∗ ∗ i(λ; µ ) I(µ ) φlgl(λ) 0, λ S (1.27) − − ≤ ∀ ∈ l∈L X where i(λ; µ) := p(r λ) ln p(r;µ) (4.24). Furthermore, the equality is − r | p(r|λ) ∗ reached for all λ EPµ∗ (the set of points of increase of µ ). ∈

Proof. Follows from the same approach as in [43], [11]. We only need to show that (1.27) is equivalent to the two conditions given in Lemma (4.2.8).

154 ∗ To ﬁnd the speciﬁc form of DJφ(µ ; ν), apply the weak derivative of I

∗ ∗ (Lemma 4.2.5) and the linearity of Gl, we have DGl(µ ; ν) = G(ν) G(µ ) − and

∗ ∗ ∗ DJφ(µ ; ν) = DI(µ ; ν) φlDGl(µ ; ν) − l∈L X ∗ ∗ ∗ = i(λ; µ )dν I(µ ) φl (G(ν) G(µ )) − − − l∈L Z X ∗ ∗ ∗ = i(λ; µ ) φlgl(λ) dν I(µ ) + φlG(µ ) − − l∈L ! l∈L Z X X ∗ From the second condition of Lemma 4.2.8, l∈L φlG(µ ) = 0, thus P ∗ ∗ ∗ DJφ(µ ; ν) = i(λ; µ ) φlgl(λ) dν I(µ ) 0 (1.28) − − ≤ l∈L ! Z X Proof of necessity. For any λ S, let ν be the Dirac Delta measure centered ∈ at λ. Then ν MS. Substituting ν into the above equation yields ∈

∗ ∗ ∗ ∗ (i(λ; µ ) φlgl(λ))dν I(µ ) = i(λ; µ ) φlgl(λ) I(µ ) 0 − − − − ≤ l∈L l∈L Z X X

Proof of suﬃciency. Suppose that for µ MS,G, there exists φl 0 : l L ∈ { ≥ ∈ } such that

i(λ; µ) φlgl(λ) I(µ) 0 (1.29) − − ≤ l∈L X and we aim to show that µ maximizes I(µ).

First, for all ν MS, integrating the above equation with respect to ν ∈

155 gives us

i(λ; µ) φlgl(λ) I(µ) dν − − l ! Z X

= i(λ; µ) φlgl(λ) dν I(µ) 0 − − ≤ l ! Z X where we have used that I(µ) is a constant and S dν = 1. Next we show that

φlG(µ) = 0 for any l L. Integrate the equationR (1.29) on both sides with ∈ respect to µ and apply the deﬁnition of i(λ; µ),

0 i(λ; µ) φlgl(λ) I(µ) dµ ≥ − − l ! Z X p(r; µ) = p(r λ) ln φlgl(λ) I(µ) dµ − | p(r λ) − − r l ! Z X | X = φlGl(µ) − l X where the last equation follows from the deﬁnition of I(µ). Therefore, we have

φlGl(µ) 0. However, since µ satisﬁes the constraints, Gl(µ) 0. Hence ≥ ≤ Pfrom φl 0 we have ≥ φlGl(µ) = 0, l L. ∀ ∈ Thus both conditions of Lemma 4.2.8 are satisﬁed, µ is a capacity-achieving measure.

Finally, the equality is achieved at λ Eµ∗ . Let λ Eµ∗ be a ∈ 0 ∈ ∗ point of increase of µ . We show that i(λ ; µ) φlgl(λ ) I(µ) 0 by 0 − l∈L 0 − ≤ contradiction. Suppose the converse is true, i.e. thereP exists > 0 such that

∗ ∗ ∗ fφ(λ ; µ ) := i(λ ; µ ) φlgl(λ ) I(µ ) < . 0 0 − 0 − − l∈L X

156 From the continuity of i(λ; µ) (Lemma 4.2.5) and of gl, the function

∗ fφ(λ; µ ) is continuous at λ . There exists an open neighborhood U S of 0 ⊂ λ0 such that

∗ fφ(λ; µ ) < , λ U. −2 ∀ ∈ ∗ Because λ0 is a point of increase, µ (U) > 0. Integrating the above equation w.r.t.µ∗ on the whole set S, we have

∗ ∗ ∗ fφ(λ; µ )dµ = φlGl(µ ) = 0. − S l∈L Z X ∗ However, since fφ(λ; µ ) 0 on S and in particular, less than on U, ≤ − 2

∗ ∗ ∗ ∗ ∗ fφ(λ; µ )dµ fφ(λ; µ )dµ = µ (U) < 0 ≤ −2 ZS ZU which results in a contradiction. Hence for any λ Eµ∗ , the equality is ∈ satisﬁed.

157 Bibliography

[1] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American Mathematical Soc., 2007.

[2] Suguru Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information The- ory, 18(1):14–20, 1972.

[3] Joseph J Atick and A Norman Redlich. What does the retina know about natural scenes? Neural computation, 4(2):196–210, 1992.

[4] David Barber and Felix V Agakov. The im algorithm: a variational approach to information maximization. In Advances in neural information processing systems, page None, 2003.

[5] Horace B Barlow. Possible principles underlying the transformations of sensory messages. 1961.

[6] M Bethge, D Rotermund, and K Pawelzik. Optimal neural rate coding leads to bimodal ﬁring rate distributions. Network: Computation in Neural Systems, 14(2):303–319, 2003.

[7] Matthias Bethge, David Rotermund, and Klaus Pawelzik. Optimal short- term population coding: when ﬁsher information fails. Neural computation, 14(10):2317–2351, 2002.

158 [8] Richard Blahut. Computation of channel capacity and rate-distortion functions. IEEE transactions on Information Theory, 18(4):460–473, 1972.

[9] Braden AW Brinkman, Alison I Weber, Fred Rieke, and Eric Shea-Brown. How do eﬃcient coding strategies depend on origins of noise in neural circuits? PLoS computational biology, 12(10):e1005150, 2016.

[10] Nicolas Brunel and Jean-Pierre Nadal. Mutual information, ﬁsher information, and population coding. Neural computation, 10(7):1731–1757, 1998.

[11] Terence H Chan, Steve Hranilovic, and Frank R Kschischang. Capacity- achieving probability measure for conditionally gaussian channels with bounded inputs. IEEE Transactions on Information Theory, 51(6):2073– 2088, 2005.

[12] Vyacheslav V Chistyakov and Yuliya V Tretyachenko. Maps of several variables of ﬁnite total variation and helly-type selection principles. arXiv preprint arXiv:1001.0451, 2010.

[13] Richard Durbin, Richard Szeliski, and Alan Yuille. An analysis of the elastic net approach to the traveling salesman problem. Neural computation, 1(3):348–358, 1989.

[14] Richard Durbin and David Willshaw. An analogue approach to the trav- elling salesman problem using an elastic net method. Nature, 326(6114):

159 689–691, 1987.

[15] J-L Durrieu, J-Ph Thiran, and Finnian Kelly. Lower and upper bounds for approximation of the kullback-leibler divergence between gaussian mixture models. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4833–4836. Ieee, 2012.

[16] Deep Ganguli and Eero P Simoncelli. Implicit encoding of prior probabilities in optimal neural populations. In Advances in neural information processing systems, pages 658–666, 2010.

[17] Deep Ganguli and Eero P Simoncelli. Eﬃcient sensory encoding and bayesian inference with heterogeneous neural populations. Neural computation, 26(10):2103–2134, 2014.

[18] Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Eﬃcient estimation of mutual information for strongly dependent variables. In Artiﬁcial intelligence and statistics, pages 277–286, 2015.

[19] Luc J Gentet, Michael Avermann, Ferenc Matyas, Jochen F Staiger, and Carl CH Petersen. Membrane potential dynamics of gabaergic neurons in the barrel cortex of behaving mice. Neuron, 65(3):422–435, 2010.

[20] Julijana Gjorgjieva, Markus Meister, and Haim Sompolinsky. Functional diversity among sensory neurons from eﬃcient coding principles. PLoS computational biology, 15(11), 2019.

160 [21] Julijana Gjorgjieva, Haim Sompolinsky, and Markus Meister. Bene- ﬁts of pathway splitting in sensory coding. Journal of Neuroscience, 34(36):12127–12144, 2014.

[22] Arthur Gretton, Ralf Herbrich, and Alexander J Smola. The kernel mutual information. In 2003 IEEE International Conference on Acous- tics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., volume 4, pages IV–880. IEEE, 2003.

[23] Nicol S Harper and David McAlpine. Optimal neural population coding of an auditory spatial cue. Nature, 430(7000):682–686, 2004.

[24] John R Hershey and Peder A Olsen. Approximating the kullback leibler divergence between gaussian mixture models. In 2007 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing-ICASSP’07, volume 4, pages IV–317. IEEE, 2007.

[25] TH Hildebrandt and IJ Schoenberg. On linear functional operations and the moment problem for a ﬁnite interval in one or several dimensions. Annals of Mathematics, pages 317–328, 1933.

[26] Shiro Ikeda and Jonathan H Manton. Capacity of a single spiking neuron channel. Neural Computation, 21(6):1714–1748, 2009.

[27] Tobias Jung, Daniel Polani, and Peter Stone. Empowerment for continuous agent—environment systems. Adaptive Behavior, 19(1):16–39, 2011.

161 [28] Yan Karklin and Eero P Simoncelli. Eﬃcient coding of natural images with a population of noisy linear-nonlinear neurons. In Advances in neural information processing systems, pages 999–1007, 2011.

[29] David B Kastner, Stephen A Baccus, and Tatyana O Sharpee. Critical and maximally informative encoding between neural populations in the retina. Proceedings of the National Academy of Sciences, 112(8):2533– 2538, 2015.

[30] Oliver Knill. On hausdorﬀ’s moment problem in higher dimensions. Preprint, http://abel. math. harvard. edu/˜ knill/preprints/stability. ps

viewed, 4(02):2007, 1997.

[31] Dieter Kraft. A software package for sequential quadratic programming. Forschungsbericht- Deutsche Forschungs- und Versuchsanstalt fur Luft-

und Raumfahrt, 1988.

[32] Simon Laughlin. A simple coding procedure enhances a neuron’s information capacity. Zeitschrift f¨urNaturforschung c, 36(9-10):910–912, 1981.

[33] Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988.

[34] David G Luenberger. Optimization by vector space methods. John Wiley & Sons, 1997.

162 [35] Mark D McDonnell and Nigel G Stocks. Maximally informative stimuli and tuning curves for sigmoidal rate-coding neurons and populations. Physical review letters, 101(5):058103, 2008.

[36] Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Ad- vances in neural information processing systems, pages 2125–2133, 2015.

[37] Noga Mosheiﬀ, Haggai Agmon, Avraham Moriel, and Yoram Burak. An eﬃcient coding theory for a dynamic trajectory predicts non-uniform al- location of entorhinal grid cells to modules. PLoS computational biology, 13(6):e1005597, 2017.

[38] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.

[39] Alexander P Nikitin, Nigel G Stocks, Robert P Morse, and Mark D Mc- Donnell. Neural population coding is optimized by discrete tuning curves. Physical review letters, 103(13):138101, 2009.

[40] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

[41] S Shitz Shamai. Capacity of a pulse amplitude modulated direct detec- tion photon channel. IEE Proceedings I (Communications, Speech and Vision), 137(6):424–430, 1990.

163 [42] Tatyana O Sharpee. Optimizing neural information capacity through discretization. Neuron, 94(5):954–960, 2017.

[43] Joel G Smith. The information capacity of amplitude-and variance- constrained sclar gaussian channels. Information and control, 18(3):203– 219, 1971.

[44] Richard B Stein. The information capacity of nerve cells using a frequency code. Biophysical journal, 7(6):797, 1967.

[45] Daniel W Stroock. Probability theory: an analytic view. Cambridge university press, 2010.

[46] M Thomas. Cover and joy a. thomas: Elements of information theory. Wiley, 4:10, 1991.

[47] Zhuo Wang, Alan A Stocker, and Daniel D Lee. Optimal neural tuning curves for arbitrary stimulus distributions: Discrimax, infomax and minimum l p loss. In Advances in neural information processing systems, pages 2168–2176, 2012.

[48] Xue-Xin Wei and Alan A Stocker. Mutual information, ﬁsher information, and eﬃcient coding. Neural computation, 28(2):305–326, 2016.

[49] Stephen Willard. General topology. Courier Corporation, 2012.

[50] Stuart Yarrow, Edward Challis, and Peggy Seri`es. Fisher and shannon information in ﬁnite neural populations. Neural computation, 24(7):1740– 1780, 2012.

164 [51] Ryan M Yoder, Benjamin J Clark, and Jeﬀrey S Taube. Origins of landmark encoding in the brain. Trends in neurosciences, 34(11):561– 571, 2011.

[52] Yilun Zhang, David B Kastner, Stephen A Baccus, and Tatyana O Sharpee. Optimal information transmission by overlapping retinal cell mosaics. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2018.

165