Statistical physics approaches to behavior in networks of neurons

Xiaowen Chen

A Dissertation Presented to the Faculty of in Candidacy for the Degree of Doctor of

Recommended for Acceptance by the Department of Physics Adviser: Professor William Bialek

November 2020 c Copyright by Xiaowen Chen, 2020.

All rights reserved. Abstract

In recent years, advances in experimental techniques have allowed for the first time si- multaneous measurements of many interacting components in living systems at almost all scales, making now an exciting time to search for physical principles of in living systems. This thesis focuses on statistical physics approaches to collective behavior in networks of interconnected neurons; both statistical inference methods driven by real data, and analytical methods probing the theory of emergent behavior, are discussed. Chapter 3 is based on work with F. Randi, A. M. Leifer, and W. Bialek [Chen et al., 2019], where we constructed a joint probability model for the neural activity of the nematode, Caenorhabditis elegans. In particular, we extended the pairwise maximum model, a statistical physics approach to consistently infer distributions from data that has successfully described the activity of networks with spiking neurons, to this very different system with neurons exhibiting graded potential. We discuss signatures of collective behavior found in the inferred models. Chapter 4 is based on work with W. Bialek [Chen and Bialek, 2020], where we examine the tuning condition for the connection matrix among neurons such that the resulting dynamics exhibit long time scales. Starting from the simplest case of random symmetric connections, we combine maximum entropy and random matrix theory methods to explore the constraints required from long time scales to become generic. We argue that a single long time scale can emerge generically from realistic constraints, but a full spectrum of slow modes requires more tuning.

iii Acknowledgements

Since the beginning of my graduate school training and perhaps even earlier, I have been enjoying reading the acknowledgement sections of doctoral theses and books, and was in awe of how a doctoral degree cannot be completed alone. My thesis is no different. Throughout the past five years, I have met many amazingly talented and friendly mentors and colleagues, and grown thanks to my interactions with them. First and foremost, I would like to thank my advisor Professor Bill Bialek. From our first meeting during the Open House, he has introduced me to the wonderland of theoretical biophysics; provided me with continuous support, encouragement and guidance; and allowed me sufficient freedom to explore my scientific interests. In ad- dition to his fine taste of choosing research questions and rigor in conducting research, I have also been influenced by his optimism amid scientific expeditions in the field of biophysics, and his leadership both as a scientist and as a citizen. I would like to thank Professor Andrew Leifer and Dr. Francesco Randi for our collaboration on the data-analysis project in this thesis; they have taught me to be true to the facts. I also appreciate Andy’s enthusiasm for science, friendliness and continuous support throughout the years, advising my experimental project, inviting me to design followup experiments, and serving on my thesis committee. I also would like to thank Professor Michael Aizenman for serving on my thesis committee; and Professor Ned Wingreen for serving as a Second Reader of this thesis, providing much over the past years, and allowing me to attend his group meetings. I consider myself very lucky to study biological physics at Princeton, especially while the NSF Center for the Physics of Biological Function (CPBF) was being es- tablished. The center offers a wonderful and unique community for collaborative science and learning, financial support for participation in conferences such as the APS March Meeting and the annual iPoLS meeting, and opportunities to give back to the community through the undergraduate summer school. I would like to thank iv the leadership of Bill and Prof. Josh Shaevitz, and the administrative support from Dr. Halima Chahboune and Svitlana Rogers. I also thank Halima for her friendliness and support throughout the years. I would like to thank the theory faculty, Pro- fessors Curt Callan, David Schwab, Stephanie Palmer, and Vijay Balasubramanian, who have given me many valuable pieces of advice during theory group meetings. I also would like to thank the experimental faculty, Professors Robert Austin, Thomas Gregor, and Josh Shaevitz, for learning, teaching opportunities and career advice. I learned a tremendous amount from many discussions with the postdocs, graduate students, and visiting students that I had the fortune to overlap with at the Center, including Vasyl Alba, Ricard Alert Zenon, Marianne Bauer, Farzan Beroz, Ben Brat- ton, Katherine Copenhagen, Yuval Elhanati, Amir Erez, Kamesh Krishnamurthy, Endao Han, Caroline Holmes, Daniel Lee, Zhiyuan Li, Andreas Mayer, Lauren Mc- Gough, Leenoy Meshulam, Luisa Fernanda Ram´ırezOchoa, Pierre Ronceray, Zachary Sethna, Ben Weiner, Jim Wu, Bin Xu, and Yaojun Zhang. I have also enjoyed many informal conversations and spontaneous Icahn lunches with many of you. I thank Cassidy Yang and Diana Valverde Mendez for being amazing office mates and for befriending a theorist; I miss seeing you in the office. I also would like to thank especially the members of the Leifer Lab, including Kevin Chen, Matthew Creamer, Kelsey Hallinen, Ashley Linder, Mochi Liu, Jeffery Nguyen, Francesco Randi, Anuj Sharma, Monika Scholz, and Xinwei Yu for all the discussions related to worms and experimental techniques, and for welcoming me in the lab meetings and the lab itself. I would like to thank the training and support provided by the Department of Physics. I thank Kate Brosowsky for her support to graduate students, Laurel Lerner for organizing the departmental recitals, and all of the very friendly and supportive administrators. My attendance in many conferences and summer schools were made possible by the Compton Fund. I also thank the Women in Physics groups for the effort in creating a more inclusive environment in the Department. At the University

v level, I would like to thank the Graduate School and the Counseling & Psychological Services at University Health Services for support. My last five years would be less colorful without the friends I met through graduate school, including Trithep Devakul, Christian Jepsen, Ziming Ji, Du Jin, Rocio Kiman, Ho Tat Lam, Zhaoqi Leng, Xinran Li, Sihang Liang, Jingjing Lin, Jingyu Luo, Zheng Ma, Wenjie Su, Jie Wang, Wudi Wang, Zhenbin Yang, Zhaoyue Zhang, and many others. I value my friendship with Junyi Zhang and Jiaqi Jiang, which goes back all the way to attending the same high school in Shanghai to now being in the same cohort at Princeton Physics. I treasure my friendship with Xue (Sherry) Song and Jiaqi Jiang, who have been there for me through all the ups and downs of my graduate career. I would also like to thank Hanrong Chen for company, support, proofreading this thesis and many other things. Finally, I would like to thank my parents for their unwavering support and en- couragement. It was my father, Wei Chen, who bought me a frog to observe and learn swimming from, and my mother, Yanling Guo, who started a part-time PhD degree a few years before my graduate journey, who have kindled and cultivated my curiosity and courage for this scientific quest.

vi To my parents.

vii Contents

Abstract ...... iii Acknowledgements ...... iv List of Tables ...... xi List of Figures ...... xii

1 Introduction 1 1.1 The nervous system and its collective behavior ...... 4 1.2 Key problems ...... 7 1.3 Thesis overview ...... 8

2 Mathematical and statistical physics methods 11 2.1 Maximum Entropy Principle ...... 11 2.2 Random Matrix Theory ...... 16

3 Collective behavior in the small brain of C. elegans 22 3.1 Introduction ...... 23 3.2 Data acquisition and processing ...... 25 3.3 Maximum Entropy Model ...... 31 3.4 Does the model work? ...... 35 3.5 What does the model teach us? ...... 40 3.5.1 Energy landscape ...... 40 3.5.2 Criticality ...... 42

viii 3.5.3 Network topology ...... 43 3.5.4 Local perturbation leads to global response ...... 44 3.6 Discussion ...... 46

4 Searching for long time scales without fine tuning 50 4.1 Introduction ...... 51 4.2 Setup ...... 53 4.3 Time scales for ensembles with different global constraints ...... 57 4.3.1 Model 1: the Gaussian Orthogonal Ensemble ...... 57 4.3.2 Model 2: GOE with hard stability threshold ...... 60 4.3.3 Model 3: Constraining mean-square activity ...... 63 4.4 Dynamic tuning ...... 70 4.5 Discussion ...... 75

5 Conclusion and Outlook 78

A Appendices for Chapter 3 81 A.1 Perturbation methods for overfitting analysis ...... 81 A.2 Maximum entropy model with the pairwise correlation tensor constraint 85 A.3 Maximum entropy model fails to predict the dynamics of the neural networks as expected ...... 86

B Dynamical inference for C. elegans neural activity 89 B.1 Estimate correlation time from the data ...... 90 B.2 Coupling the neural activity and its time derivative ...... 93

C Appendices for Chapter 4 97 C.1 How to take averages for the time constants? ...... 97 C.2 Finite size effect for Model 2 ...... 99 C.3 Derivation for the scaling of time constants in Model 3 ...... 100 ix C.4 Decay of auto-correlation coefficient ...... 106 C.5 Model with additional constraint on self-interaction strength . . . . . 107

Bibliography 110

x List of Tables

C.1 Scaling of inverse slowest time scale (gap) g0, width of the support of spectral density l, and averaged norm per neuron x2 versus the h i i Lagrange multiplier ξ (to leading order) in different regimes...... 102

xi List of Figures

3.1 Schematics of data acquisition and processing of C. elegans neural activity...... 27 3.2 Comparison of the pairwise mutual distributions for the calcium-sensitive GCaMP worms and the GFP control worms, with mutual information measured for the calcium activity for each pairs of neurons...... 30 3.3 Discretization of the empirically observed fluorescence signals. . . . . 31 3.4 Model construction: learning the maximum entropy model from data. 34 3.5 No signs of overfitting are observed for pairwise maximum entropy models with up to N = 50 neurons...... 35 3.6 Pairwise maximum entropy model predicts unconstrained higher order correlations of the data...... 37 3.7 Comparison between model prediction and data for observables not constrained by the model...... 39 3.8 Energy landscape of the inferred maximum entropy model...... 41 3.9 The heat capacity is plotted against temperature for models with dif- ferent number of neurons, N...... 43 3.10 The topology of the learned maximum entropy model approaches that of the structural connectome, as the number of neurons being modeled, N, increases...... 45

xii 3.11 Local perturbation of the neural network leads to global response. . . 49

4.1 Schematics for emergent time scales from interconnected neuerons. . . 55 4.2 Spectral density for the connection matrix M drawn from various en- sembles...... 58 4.3 Mean field results for Model 3, i.e. ensembles with Gaussian prior and with a maximum entropy constraint on the norm activity...... 68

4.4 Finite size effects on the time scales τmax and τcorr for Model 3, inter- action strength c = 1 (the supercritical phase)...... 69 4.5 Langevin dynamics for neural networks to tune itself to the ensembles with slow modes...... 73

A.1 Equilibrium dynamics of the inferred pairwise maximum entropy model fails to capture the neural dynamics of C. elegans...... 88

B.1 Overlap of the discretized neuron signal, defined as q(∆t) =

D 1 PN E N i=1 δσi(t)σi(t+∆t) , versus delay time ∆t...... 91 t B.2 Spectra of the (cross-)correlation matrix for observed neuron groups with number of neurons N = 10, 30...... 92 B.3 Maximum entropy model constructed by constraining only the mean

activity, and the pairwise correlation θiσj between the magnitude h i of neural activity θ and its time derivative σ fails to predict the { } { } correlation among each class of the observables...... 95 B.4 The full pairwise maximum entropy model coupling the magnitude of neural activity and the time derivative of neural activiy does not improve prediction of higher order statistics compare to the pairwise maximum entropy model for only the time derivatives...... 96

xiii C.1 The fractional standard deviation of log time scale decreases with sys- tem size, while the fractional standard deviation of time scales is large and does not show trend of decreasing, suggesting the distribution of time scales has long tails...... 98

C.2 Finite size scaling for τmax and τcorr, for matrices drawn from GOE with hard stability constraint, interaction strength being the critical,

cc = 1/√2, (left panel) and super-critical, c = 1 (right panel). For each system size N, we average the time scales over 1000 Monte Carlo realization...... 99 C.3 The autocorrelation coefficient R(t) decays with time t. As system size increases, R(t) approaches the theory prediction of a power-law decay t−1/2. We need to pay attention to how we take the average. Here, ∼ we are at the critical case in GOE with hard stability constraint. . . . 100 C.4 Comparison between autocorrelation coefficient function at different system size and the mean field results...... 100

2 C.5 Scaling of g0, l, and x as a function of the Lagrange multiplier ξ for h i i random matrices with maximum entropy constraint on the norm. . . 103 C.6 Autocorrelation coefficient R(t) decays with time at different parame- ter sets (interaction strength c and Lagrange multiplier ξ)...... 106 C.7 Finite size effect for the autocorrelation coefficient vs. time. The interaction strength is at c = 1 (the supercritical phase)...... 107

C.8 Scaling of the longest time scale tmax and the correlation time tcorr vs. the constrained norm µ for connection matrix M with the additional

maximum entropy constraint fixing Mii = 0...... 109 h i

xiv Chapter 1

Introduction

One of the most exciting goals of a physicist is to find principles in the world of matter. For physicists working in the field of living systems, it is to find principles that are as effective and perhaps even as elegant as they are in the inanimate counter- parts. This has been a rewarding yet challenging quest: thanks to the development of modern experimental methods, physicists are able to perform quantitative mea- surements on living systems, but are at the same time confronted by the complexity of these systems, such as the lack of symmetry, being far out of equilibrium, and the difficulty in isolating causality, etc. Nonetheless, there exist seminal examples where physics and biology can indeed shed light upon each other. In one direc- tion, principles can inspire new discoveries in biology. A prominent example was Schr¨oedingerslecture on What is Life?, where the physicist, puzzled by the durable inheritance amid stochastic fluctuation of mutation, hypothesized the existence of an code script and anticipated the discovery of DNA [Schr¨odinger, 1944, Philip, 2018]. In turn, biology also inspires new principles, exemplified by the field of active mat- ters [Vicsek et al., 1995, Ramaswamy, 2010, Marchetti et al., 2013]. Among the many subfields of physics, the one with the strongest connection to living systems is perhaps statistical physics. We list four main connections below.

1 Understand collective behavior: how it emerges from local interaction and why • Statistical physics has traditionally been successful in understanding the emergent collective properties from local interactions in non-living matter. Meanwhile, almost by definition, things interesting in living systems are collective behavior, as living systems either have many interacting chemical components, are multi-cellular, or interact with many other peers and other species. Examples of these system-level phenomena are abundant at all levels of living systems, ranging from gene regulation within a single cell all the way to the predator-prey dynamics across different species. has been successful in describing for example, the flocking of birds, the stability of eco-systems, the formation of biofilms, etc. In all these cases, concepts from statistical mechanics such as phase transitions have been inspiring in understanding biological functions.

Characterize living systems across different scales and with different interactions • As mentioned above, collective behavior in biology occurs at all scales. In addition, the forms of interaction are diverse, including the electro-chemical in- teraction in an interconnected network of neurons, the mechano-sensory inter- action among contacting bacteria in a biofilm, or even some phenomenological “social forces” that aligns flocking birds. Statistical physics offers a framework to characterize all these different systems, as the establishment of universality classes suggests that systems with the same symmetry constraints, even with different detailed interactions, exhibit the same macroscopic properties. The idea that details do not matter also gives us hope to understand living systems by replacing the interaction with random ones drawn from specific ensembles.

2 Characterize high-dimensional data • We are at an exciting time with a proliferating amount of high-dimensional data in biology. Thanks to advance of high-throughput measurement tech- nology, scientists can now sequence the entire human genome in a few hours [Reuter et al., 2015], simultaneously measure up to 10000 neurons in alert animals [Stringer et al., 2019], and track the position and velocity of each bird in a flock of thousands [Cavagna et al., 2008]. These vast amounts of data require new analysis tools. Statistical physics offers tools to directly learn the rules of interaction from the data (statistical inference), and has been making great in for example quantifying the neural activity of many neurons in various brain regions [Schneidman et al., 2006, Tkaˇciket al., 2014, Meshulam et al., 2017], the adaptive immune system [Mora et al., 2010], and the flocking of birds [Bialek et al., 2012, Bialek et al., 2014]. Furthermore, these principles, directly learned from the data, can then be used to modify our understanding of the real system.

Connect mechanism to function • With a statistical physics model learned from real biological data, we can ask questions directly relate to the biology. The first thing we can ask is how the macroscopic property emerges, and are local interactions enough. Or really, can physics say something about biology. Then, we can ask what are the functional advantage of the specific biological system. A simple way to compare it to other models in the same class is to vary the parameter of the statistical physics model. Finally, we can ask the question of how the living system responds to perturbation, how it is maintained in a fluctuating environment, or how it has evolved to reach the current state.

3 This thesis investigates collective behavior in systems of interconnected neurons. For the rest of this chapter, we review relevant background in neuroscience, the hy- pothesis of criticality in biology, and statistical physics and mathematical tools we will use in this thesis.

1.1 The nervous system and its collective behavior

From neurons to neural networks

The nervous system is a familiar and heavily studied biological system at the interface between physics and biology. This familiarity naturally arises from our daily obser- vation of our own cognitive activity, such as reading this sentence while recalling the title of my dissertation. But in order to understand how those cognitive activities are performed, and to understand how surprised we should be that our brain can both read and recall, we need to look deeper into the individual unit of cognition – the neurons, and how biological functions can emerge from nervous systems with large numbers of neurons interacting with each other. Neurons are cells in the nervous system; they are basic units of cognition. Through careful controls of an electric potential across their membranes, neurons can be excited electrically. If the excitement is large enough, the change of potential (often in the form of a pulse called the “action potential”) can propagate down the axon of the neu- ron in milliseconds [Kandel et al., 2000]. Mathematical models developed by Hodgkin and Huxley in 1952 have been highly successful in describing the initiation and prop- agation of these action potentials in individual neurons [Hodgkin and Huxley, 1952]. Meanwhile, once propagated, such changes in voltage of a neuron can excite nearby neurons through chemical transmission across the synapses, which when the number of neurons is large, gives rise to cognitive activity. In particular, because the num- ber of neurons in a nervous system is often very large, ranging from 102 to 1011

4 depending on the organism, studies of the nervous system are amenable to statis- tical physics methods [Amit et al., 1987]. Theoretical efforts in understanding the emergent functions of the interconnected neurons include early efforts by the bi- nary input-output artificial neural networks modeled by McCulloch and Pitts, and the perceptron developed by Rosenblatt, both which can perform simple classifying tasks [McCulloch and Pitts, 1943, Rosenblatt, 1958]. In addition, a unique feature of neurons, especially compared to usual physi- cal systems, is that the brain is plastic: namely, the connection strength among neurons is a function of time and the neural activity itself, i.e. the brain can perform learning [Magee and Grienberger, 2020]. The framework was first laid down by Hebb in 1949 [Hebb, 1949], which is perhaps best summarized as “neurons that fire together wire together” [Shatz, 1992]. Recently, increasing experimen- tal evidences have revealed more complex updating rules for the neuronal net- work [Abbott and Nelson, 2000], for example the total synapse strength can be reg- ulated by some global variable through so-called synaptic scaling [Turrigiano, 2008], and that the learning rule can also depend on the activity strength of the pre- and post-synaptic neuron [Bienenstock et al., 1982]. With learning rules, neu- ronal networks can be designed to perform additional functions, such as con- tent addressable memory, and optimization tasks [Hopfield, 1982, Hopfield, 1984, Hopfield and Tank, 1985]. Networks with more non-linearity and unique structures have been found successful in many more tasks in modern development of usage of artificial neural networks, such as the Recurrent Neural Networks and Deep Learning [Cheng and Titterington, 1994].

Finding collective behavior with many-neuron measurements

Recent advance in technologies have allowed simultaneous measurements of an increasing number of neurons, both in vivo and in brain slices [Dombeck et al., 2010,

5 Ahrens et al., 2013, Segev et al., 2004, Nguyen et al., 2016, Nguyen et al., 2017, Venkatachalam et al., 2016]. As shown by the “Moores law” in neuroscience in [Stevenson and Kording, 2011], the number of simultaneously recorded neurons has a doubling time of 7.4 years since the 1960s, when only single neuron activity could be measured; a recent experiment was able to measure more than 10,000 neurons in the mouse visual cortex [Stringer et al., 2019]. These recordings offer an exciting opportunity for researchers to test the statistical physics idea that cognitive activities are emergent from interactions among neurons, and to reveal new principles of interacting nervous systems. How to extract principles from these high-dimensional data? One approach is to examine coarse-grained variables of the neural activity. Interestingly, researchers have often found self-similarity of such variables when measured at different scales: examples include a power-law like distribution of avalanche sizes in cultured slices of rat cortex [Beggs and Plenz, 2003], and a non-trivial RG fixed point when coarse- graining the spiking pattern in mice hippocampus [Meshulam et al., 2018]. Another approach is to infer the entire probability distribution from real data using maximum entropy models [Jaynes, 1957]. In many cases, these models can illustrate the col- lective character of network activity. In particular, the state of individual neurons often can be predicted with high accuracy from the state of the other neurons in the network, and the models that are inferred from the data are close to critical surfaces in their parameter space [Tkaˇciket al., 2009, Tkaˇciket al., 2015]. These novel experimental data and analysis methods have led to a controver- sial hypothesis, that biological systems operate at criticality for functional advan- tages [Mora and Bialek, 2011, Mu˜noz,2018]. This includes both statistical criticality with a peak in the heat capacity as parameters are varied to manifest an optimal dynamic range, and dynamical criticality, which is believed to be essential for the biological systems to operate with a high sensitivity and information processing ca-

6 pacity, while maintaining certain robustness. Examples of both criticalities have been found in many collective biological systems in additional to the nervous system, such as flocking of birds [Bialek et al., 2012, Bialek et al., 2014].

1.2 Key problems

While the quest of finding collective behavior in neural data and understanding the of such collective behavior is exciting, especially given the increasingly high-dimensional data which are available, both modeling collective behavior using real data, and finding principles such as the criticality hypothesis are still relatively new. Many questions remain to be explored, including:

How general is the approach of constructing joint distributions of neural activity • using maximum entropy models that build on local interactions? For example, whether these approaches could capture the dynamics of networks in which the neurons generate graded electrical responses.

In networks where neurons generate graded electrical responses, does the system • still exhibit signatures of ciriticality? How general is the criticality hypothesis in models of neuronal systems?

Often for a system to exhibit signatures of criticality, some level of fine tuning • is required. For example, in the case of Ising models, there is only one critical temperature which gives self-similarity. How can biological systems achieve criticality? How much fine tuning is required vs. how much of these can be self- organized? And how is the criticality maintained when the system is coupled to an environment?

7 Are there alternative mechanisms for neural systems to exhibit signatures of • criticality, such as an emergence of long time scales in a dynamical network, without being poised at criticality? Are they easier to maintain?

1.3 Thesis overview

This thesis addresses the above questions in two example problems: statistical in- ference of experimentally observed neural activity in the nematode, Caenorhabditis elegans; and the search for long time scales without fine tuning in dynamical systems.

Example 1: Collective behavior in a small brain

In the past decade, it has been shown that in some regions of the brain such as the salamander retina as it responds to natural movies, and the mouse hippocampus during exploration, statistical physics models with only pairwise interactions can de- scribe the network activity quite well, and that the models inferred from the data are close to critical surfaces in their parameter space. Nonetheless, almost all discussions of collective phenomena in networks of neurons had been focused on large vertebrate brains, with neurons that generate discrete, stereotyped action potentials or spikes. In Chapter 3, we showed that these inverse approaches can be successfully extended to capture the neural activity of networks in which neurons have graded electrical re- sponse by studying the nervous system of the nematode C. elegans. Despite the brain being very different to vertebrate brains in both its analog signal and its compactness, we found that the network activity for a large portion of the brain can be explained by a joint probability model, constructed using the maximum entropy principle to constrain the mean activity of individual neurons and the pairwise interaction. We also found collective behavior in our model: parameters are close to a critical surface, as shown by the peak of the heat capacity when we vary the parameters, and the

8 multiple local maxima in the inferred probability distribution of neural activity are reminiscent to the Hopfield model of memory. In addition, we also made a novel prediction on the function of such criticality for the brain to be both robust against external perturbations and efficient in information transmission, which is to be tested experimentally. The work in this chapter was done in collaboration with F. Randi, A. M. Leifer, and W. Bialek. It has been previously presented in the 2018 APS March Meeting in Los Angelas, the 2018 iPoLS Annual Conference in Houston, and the 2019 APS March Meeting in Boston. It has also been published previously as the following refereed journal article:

X. Chen, F. Randi, A. Leifer, W. Bialek. Searching for collective behavior in a • small brain, Physical Review E, 99:052418 (2019).

Example 2: Emergent long time scales in models of interconnected neurons

One simple but important example of dynamical criticality is the emergence of long time scales in neural systems to hold continuous variables in memory for a time scale much longer than the response time of individual neurons. A simple theoretical model for these persistent neural activities is a line , which requires fine-tuning of the parameters that some biological systems were experimentally shown to achieve through coupling with the stimuli [Seung, 1996, Major et al., 2004a]. But are there other more general methods, other than fine tuning its interaction parameters, to generate sufficiently long time scales? And in what cases does one obtain one long time scale versus a wide range of slow modes? Chapter 4 addresses these questions by combining maximum entropy and random matrix theory methods to construct ensembles of networks, and exploring the constraints required for long time scales to become generic. We argue that a single long time scale can emerge generically from realistic constraints, but a full spectrum of slow modes requires more tuning. We also identified the Langevin dynamics that will generate patterns of synaptic connections 9 drawn from these ensembles as familiar dynamics in neuronal systems, involving a combination of Hebbian learning and activity–dependent synaptic scaling. This work was done in collaboration with W. Bialek, was presented in the 2020 iPoLS Annual Conference, and has been posted to the arXiv preprint server as the following article:

X. Chen and W. Bialek. Searching for long time scales without fine tuning. • arXiv:2008.11674 [physics.bio-ph] (2020).

10 Chapter 2

Mathematical and statistical physics methods

We describe the statistical physics and mathematical methods used in this thesis. We first review the maximum entropy principle, which is used in Chapter 3 to construct the (joint) probability distribution of neural activities in C. elegans and in Chapter 4 to construct random matrix ensembles with global constraints. We will also introduce random matrix theory that is used in Chapter 4.

2.1 Maximum Entropy Principle

Maximum entropy model was developed more than 60 years ago as scientists gained better understanding between information and statistical mechanics. An excellent re- view of its development can be found in [Press´eet al., 2013]. The idea that minimal information equals maximal entropy was first suggested from Shannon’s seminal pa- per that paved the groundwork of [Shannon, 1948]. Later, Jaynes was able to connect information theory and statistical mechanics through the max- imum entropy framework [Jaynes, 1957]; this framework was further developed by an axiomatic derivation as the only method that draws consistent inferences about 11 probability distributions [Shore and Johnson, 1980]. In recent years, maximum en- tropy model has been tremendously successful in describing many systems with high dimension; this is especially true in the field of biophysics, where the model not only provides a description of the data, but also for the first time allows probing of the underlying principles of system biology [Mora and Bialek, 2011, Roudi et al., 2009]. For completeness, we outline the basics of maximum entropy methods in this section.

Maximizing the entropy of a distribution

In equilibrium statistical mechanics, one solve for the equilibrium distribution by maximizing the entropy subject to the appropriate constraints for the given ensemble. For example, Boltzmann distribution is the probability distribution for canonical ensembles, where the entropy is maximized when the averaged energy is fixed. And for grand canonical ensembles, both averaged energy and averaged number of particle are constrained. All these are based on a general principle, that if things are not constrained, they will lead to maximum entropy. The maximum entropy principle carries to information theory, and is especially useful in thinking about how to model distributions. For a general probability dis- tribution, P (x), we can define the Shannon entropy as a quantity that measures the amount of uncertainty, Z S(P ) = dxP (x) ln P (x) (2.1) −

It turns out the only way of having a consistent inference is to have the distribution with the maximum entropy, while satisfying the constraint. Mathematically, we as- sume the distribution P (x) satisfy a set of constraints on the observable µ(x), such O that Z dxP (x) µ(x) = fµ, (2.2) O

12 then the distribution " # 1 X P (x) = exp λ (x) (2.3) Z µ µ − µ O

where the Lagrange multiplier λµ is set to satisfy the constraints, have

S(PME) S(P ) (2.4) ≥

The distribution is mathematically equivalent to the Boltzmann distribution with an effective energy written as the combination of the constraints. What happens if there is a non-uniform prior, Q(x)? Instead of maximizing the entropy of the distribution subject to the constraint, we can find the distribution that is the closest to the prior, while matching the constraint. More formally, we minimize the Kullback-Leibler divergence, an information theoretical measure of the difference between two distribution

Z Q(x) DKL(P Q) = dxP (x) ln (2.5) k − P (x)

The corresponding distribution is

" # 1 X P (x) = Q(x) exp λ (x) (2.6) Z µ µ − µ O

Maximum entropy model in statistical inference

Given that the current trend of biological data often has many degrees of freedom, and not that much more independent samples, maximum entropy principle offers a good principle (instead of a more frequentlist approach) to perform statistical infer- ence. The idea is to choose well-measured observables from the data, and construct a probablity distribution that matches these constraints, while otherwise has maxi- mum entropy. This is especially helpful when we use low order of moments as the

13 constraints, since they can be measured accurately with a relatively small amount of independent samples. In addition, if we consider low order moments as the constraint, maximum entropy model is a natural way to construct systems with local interaction. For example, for a given spin system, these are the correlation. And if the sys- tem is truly interacting locally, the higher order statistics can be constructed from the lower interactions. This method has successfully characterized the collective behavior of many biological systems, such as the coherent motion of bird flocks [Bialek et al., 2012, Bialek et al., 2014], firing patterns in neu- ral networks [Schneidman et al., 2006, Meshulam et al., 2017, Tkaˇciket al., 2009, Tkaˇciket al., 2014], protein interaction networks [Weigt et al., 2009], and antibody distribution in immune systems [Mora et al., 2010]. We now go through the methods. If we have selected a set of constraints, the goal is to learn the Lagrange multipliers, λµ, such that the model reproduces the observables in the data. Mathematically, we define

Z model f ( λ ) dxP (x) µ(x) (2.7) µ { } ≡ O 1 X f data (x(t)) (2.8) µ T µ e ≡ t O and we want to find the set of λ such that { }

f model( λ ) = f data (2.9) µ { } µ

for all constraints fµ. This is a large system of non-linear equations, and is in general hard to solve. Instead, we can consider the equivalent but simpler convex optimization problem. If the system is truly described by the model, then we can write the probability of the

14 data to be T " # Y 1 X P (D λ ) = exp λ (x(t)) (2.10) µ Z µ µ | t=1 − µ O the normalized log likelihood (or the empirical log loss) is

1 X L(λ) = log P = λ f data + log Z(λ) (2.11) T µ µ − µ

At the maximum likelihood, we have as desired,

∂L model data = 0 = fµ fµ (2.12) ∂λµ −

In real systems with finite data, there is always error when we compute the em- pirical average of the observable. Thus, we only need to optimize the likelihood of the data, such that the resulted model-prediction is within the errorbar, given by the data.

model One difficulty of the problem is to compute fµ . Naively, it requires integrat- ing over all degrees of freedom, and is generally hard to solve. Instead, one can estimate the observable by sampling the distribution with Monte-Carlo methods, and perform the average over these samples. This is a computationally expensive step, although there are many efforts in speeding up the learning [Dud´ıket al., 2004, Broderick et al., 2007]. Alternative approaches include approximation methods such as message passing algorithms [Yedidia et al., 2001, Mezard and Mora, 2009], the Toulouse-Anderson-Palmer approximation [Thouless et al., 1977, Tanaka, 1998], and diagrammatical expansion [Cocco and Monasson, 2012].

Comment on statistical inference and dynamics

The maximum entropy distribution is mathematically equivalent to the Boltzmann distribution, but it is important to note that there is no default connection between

15 the probability distribution we write down with any sort of equilibrium statistical mechanics systems. In particular, systems with very different dynamics can have the same steady state distribution. To infer the dynamics for real biological system requires models that at least take the time derivative of states into consideration, or with constraints matching cross-time correlation.

2.2 Random Matrix Theory

In physics we often encounter interaction systems with many degrees of freedom, and in many cases the detailed interaction is too complicated to track down one by one. Luckily, when the system is large enough, it is often the case that the detailed interaction does not matter; rather, features of the system are determined by general properties and the symmetry. This concept of universality was what guided Eugene Wigner, when he considered the problem of energy level in heavy atoms using random matrices [Wigner, 1951], which subsequently lead the foundation of the field of random matrix theory. The modern field of random matrix theory spans both math and physics. The key goal is to compute the distribution of eigenvalues and eigenvectors, and the coarse grained variable such as the spectral density, and fluctuation statistics for random ma- trices drawn from different ensemble [Akemann et al., 2015]. Random Matrix Theory is also highly applicable: for example, in number theory, the level spacing is con- jected to relate to the distribution of the complex zeros of the Riemann-Zeta func- tion [Keating and Snaith, 2000]; in theoretical physics, it has shed light on problems in quantum chaos [Kos et al., 2018], black holes [Cotler et al., 2017], and the stability of metastable states in disordered systems [Castellani and Cavagna, 2005]; in finance, it helps identify high-reward portfolios from noisy background [Plerou et al., 2002, Bouchaud and Potters, 2003].

16 In the field of biophysics, there are two main fields where random matrix the- ory have shown being useful. One is ecology, where Random Matrix Theory was used to show that large ecological systems with too strong predator-prey interaction is unstable, as argued in the seminal paper by Sir Robert May [May, 1972]. Fur- ther investigation considering that not all fixed points are equal under the predator- pray dynamics showed that there exists phase transitions between single dominant species and many stable species, related to the marginal stability in disordered sys- tems [Biroli et al., 2018]. The other biophysics field where random matrix theory has been applied to is neural dynamics. The interaction matrix of a linear or non-linear neural network is often considered to be drawn from a simple random matrix ensemble, to study for example the phase transition between chaos and quiescence, to study how brains store information, etc. [Sompolinsky et al., 1988, Vreeswijk and Sompolinsky, 1996, Rajan and Abbott, 2006, Gudowska-Nowak et al., 2020]. In related field such as ma- chine learning, random matrix theory is also heavily used [Pennington and Worah, 2017, Can et al., 2020].

Example: Compute the spectral density of the Gaussian Orthogonal En- semble

For completeness, we sketch here the well-known derivation of the spectral density for random matrices drawn from the Gaussian Orthogonal Ensemble [Wigner, 1951, Dyson, 1962a, Dyson, 1962b]. Excellent pedagogical discussions can be found in Refs [Marino, 2016, Livan et al., 2018]. These same methods allow us to derive the spectral densities in all the other cases that we consider in the main text.

17 Let M be a matrix with size N N. Assume M is real symmetric, and that the × individual elements of the M matrix are independent gaussian random numbers,

Mii (0, 1/N), (2.13) ∼ N

Mij i6=j (0, 1/2N). (2.14) | ∼ N

This is the Gaussian Orthogonal Ensemble (GOE). Together with its complex and quaternion counterparts, the Gaussian Ensembles are the only random matrix ensem- bles that both have independent entries and are invariant under orthogonal (unitary, symplectic) transformations, which is more obvious when we write the probability distribution of M in terms of its trace:

 N  P (M) exp TrM |M . (2.15) ∝ − 2

Symmetric matrices can be diagonalized by orthogonal transformations,

M = O|ΛO, (2.16) where the matrix O is constructed out of the eigenvectors of M and the matrix Λ is diagonal with elements given by the eigenvalues λi . Because P (M) is invariant to { } orthogonal transformations of M, it is natural to integrate over these transformations and obtain the joint distribution of eigenvalues. To do this we need the Jacobian, also called the Vandermonde determinant,

N Y Y dM = λi λj dµ(O) dλi, (2.17) i

18 where dµ(O) is the Haar measure of the orthogonal group under its own action. Now we can integrate over the matrices O, or equivalently over the eigenvectors, to obtain

N N Y Z Y Y P ( λi ) dλi = dµ(O) λi λj P (M) dλi (2.18) { } i=1 i

Intuitively, one can think about these eigenvalues are experiencing a quadratic local potential with strength u(λ) = λ2/2. In addition, each pair of the eigenvalues repel each other with logarithmic strength; this term comes from the Vandermonde determinant, and gives all the universal features for Gaussian ensembles. Mathe- matically, the distribution P ( λi ) is equivalent to the Boltzmann distribution of a { } two-dimensional electron gas confined to one dimension. In mean-field theory, which for these problems becomes exact in the thermody- namic limit N , we can replace sums over eigenvalues by integrals over the → ∞ spectral density, 1 X ρ(λ) = δ(λ λ ). (2.20) N i i − Then the eigenvalue distribution can be approximated by1

 1  P (ρ(λ)) exp N 2S[ρ(λ)] (2.21) ∝ −2 where Z Z 2 0 0 0 S [ρ(λ))] = dλρ(λ)λ dλdλ ρ(λ)ρ(λ ) ln λ λ (2.22) − −

1The double integral over the log difference need to be corrected by the self-interaction. Luckily, these terms, after summation, are of order N, which is small compared to other terms with order N 2.

19 Because N is large, the probability distribution is dominated by the saddle point, ρ∗, such that δS˜ = 0 (2.23) δρ ρ=ρ∗

Here, Z S˜ = S + κ dλρ(λ) (2.24) has a term with the Lagrange multiplier κ to enforce the normalization of the density. Then, the spectral distribution satisfies

Z 2 ∗ 0 0 λ 2 dλ ρ (λ ) ln λ λ = κ. (2.25) − − −

To eliminate κ we can take a derivative with respect to λ, which gives us

Z dλ0ρ∗(λ0) λ = Pr , (2.26) λ λ0 − where we understand the integral to be defined by its Cauchy principal value. More generally, if

! X 1 X P (λ) exp u(λi) + ln λj λk , (2.27) ∝ − 2 | − | i j6=k then everything we have done in the GOE case still goes through but Eq (2.26) becomes du(λ) Z dλ0ρ(λ0) g(λ) = Pr . (2.28) ≡ dλ λ λ0 − Two methods are common in solving equations of this form. One is the resolvent method, which we will not discuss in detail; see [Livan et al., 2018]. The other is the Tricomi solution [Tricomi, 1957], which states that for smooth enough g(λ), the

20 solution of Eq (2.28) for the density ρ(λ) is

 Z b 0 0  1 1 0 √λ a√b λ 0 ρ(λ) = C Pr dλ − 0 − g(λ ) , (2.29) π√λ a√b λ − π a λ λ − − − where a and b are the edges of the support, and

Z b C = ρ(λ)dλ. (2.30) a

If the distribution has a single region of support, then C = 1. If the distribution more than one region of support, then we need to solve with Tricomi’s solution separately for each support, and the normalization changes accordingly. In general, solving the equation reduces to finding the edges of the support. For the Gaussian Orthogonal Ensemble, we substitute g(λ) = λ into Tricomi’s solution. The distribution is invariant for λ λ, so we can set a = b. Then the → − − integral 1 Z b √λ0 b√b λ0 b2 Pr dλ0 − − λ0 = λ2 (2.31) π λ λ0 − 2 −b − We expect the density to fall to zero at the edges of the support, rather than having a jump. Thus, we impose ρ(a) = ρ(b) = 0, which sets b = √2, and the spectral density becomes 1 ρ(λ) = √2 λ2 (2.32) π −

This is Wigner’s semicircle law. More generally, one would also like to solve for the two-point function and the spacing distribution for eigenvalues. They are more complicated, and solution may not be simple for all random matrix ensembles.

21 Chapter 3

Collective behavior in the small brain of C. elegans

The materials in this chapter were previously published as “Searching for collective behavior in a small brain” in [Chen et al., 2019].

In large neuronal networks, it is believed that functions emerge through the col- lective behavior of many interconnected neurons. Recently, the development of ex- perimental techniques that allow simultaneous recording of calcium concentration from a large fraction of all neurons in Caenorhabditis elegans—a nematode with 302 neurons—creates the opportunity to ask if such emergence is universal, reaching down to even the smallest brains. Here, we measure the activity of 50+ neurons in C. ele- gans, and analyze the data by building the maximum entropy model that matches the mean activity and pairwise correlations among these neurons. To capture the graded nature of the cells’ responses, we assign each cell multiple states. These models, which are equivalent to a family of Potts glasses, successfully predict higher statisti- cal structure in the network. In addition, these models exhibit signatures of collective behavior: the state of single cells can be predicted from the state of the rest of the

22 network; the network, despite being sparse in a way similar to the structural con- nectome, distributes its response globally when locally perturbed; the distribution over network states has multiple local maxima, as in models for memory; and the parameters that describe the real network are close to a critical surface in this family of models.

3.1 Introduction

The ability of the brain to generate coherent thoughts, percepts, memories, and actions depends on the coordinated activity of large numbers of interacting neu- rons. It is an old idea in the physics community that these collective behaviors in neural networks should be describable in the language of statistical mechan- ics [Hopfield, 1982, Hopfield, 1984, Amit et al., 1985]. For many years it was very dif- ficult to connect these ideas with experiment, but new opportunities are offered by the recent emergence of methods to record, simultaneously, the electrical activity of large numbers of neurons [Dombeck et al., 2010, Ahrens et al., 2013, Segev et al., 2004, Nguyen et al., 2016, Nguyen et al., 2017, Venkatachalam et al., 2016]. In particu- lar, it has been suggested that maximum entropy models [Jaynes, 1957] provide a path to construct a statistical mechanics description of network activity directly from real data [Schneidman et al., 2006], and this approach has been pursued in the analysis of the vertebrate retina as it responds to natural movies and other light conditions [Schneidman et al., 2006, Cocco et al., 2009, Tkaˇciket al., 2015, Tkaˇciket al., 2014], the dynamics of the hippocampus during exploration of real and virtual environments [Monasson and Rosay, 2015, Posani et al., 2017, Meshulam et al., 2017], and the coding mechanism of spontaneous spikes in cortical networks [Tang et al., 2008, Ohiorhenuan et al., 2010, K¨osteret al., 2014].

23 Maximum entropy models that match low order features of the data, such as the mean activity of individual neurons and the correlations between pairs, make quantitative predictions about higher order structures in the network, and in some cases these are in surprisingly detailed agreement with experi- ment [Tkaˇciket al., 2014, Meshulam et al., 2017]. These models also illustrate the collective character of network activity. In particular, the state of individual neurons often can be predicted with high accuracy from the state of the other neurons in the network, and the models that are inferred from the data are close to critical surfaces in their parameter space, which connects with other ideas about the possible criticality of biological networks [Mora and Bialek, 2011, Tkaˇciket al., 2015, Mu˜noz,2018, Meshulam et al., 2018]. Thus far, almost all discussion about collective phenomena in networks of neurons has been focused on vertebrate brain, with neurons that generate discrete, stereotyped action potentials or spikes [Rieke et al., 1997]. This discreteness suggests a natural mapping into an , which is at the start of the maximum entropy analy- ses, although one could imagine alternative approaches. What is not at all clear is whether these approaches could capture the dynamics of networks in which the neu- rons generate graded electrical responses. An important example of this question is provided by the nematode Caenorhabditis elegans, which does not have the molecular machinery needed to generate conventional action potentials [Goodman et al., 1998]. The nervous system of C. elegans has just 302 neurons, yet the worm can still ex- hibit complex neuronal functions: locomotion, sensing, nonassociative and associative learning, and sleep-wake cycles [Stephens et al., 2011, Sengupta and Samuel, 2009, Ardiel and Rankin, 2010, Nichols et al., 2017]. All of the neurons are “identi- fied,” meaning that we can find the cell with a particular label in every or- ganism of the species, and in some cases we can find analogous cells in nearby species [Bullock and Horridge, 1965]. In addition, this is the only organism in which

24 we know the entire pattern of connections among the cells, usually known as the (structural) connectome [White et al., 1986]. The small size of this nervous system, together with its known connectivity, has always made it a tempting target for theorizing, but relatively little was known about the patterns of electrical activity in the system. This has changed dramatically with the development of genetically encodable indicator molecules, whose fluorescence is modulated by changes in calcium concentration, a signal which in turn follows electrical activity [Chen et al., 2013]. Combining these tools with high resolution tracking microscopy opens the possibility of recording the activity in the entire C. elegans nervous system as the animal behaves freely [Nguyen et al., 2016, Venkatachalam et al., 2016, Nguyen et al., 2017]. In this paper we make a first try at the analysis of experiments in C. elegans using the maximum entropy methods that have been so successful in other contexts. Experiments are evolving constantly, and in particular we expect that recording times will increase significantly in the near future. To give ourselves the best chance of saying something meaningful, we focus on sub–populations of up to fifty neurons, in immobilized worms where signals are most reliable. We find that, while details differ, the same sorts of models, which match mean activity and pairwise correlations, are successful in describing this very different network. In particular, the models that we learn from the data share topological similarity with the known structural connectome, allow us to predict the activity of individual cells from the state of the rest of the network, and seem to be near a critical surface in their parameter space.

3.2 Data acquisition and processing

Following methods described previously [Nguyen et al., 2016, Nguyen et al., 2017], nematodes Caenorhabditis elegans were genetically engineered to expressed two flu- orescent proteins in all of their neurons, with tags that cause them to be localized

25 to the nuclei of these cells. One of these proteins, GCaMP6s, fluoresces in the green with an intensity that depends on the surrounding calcium concentration, which fol- lows the electrical activity of the cell and in many cases is the proximal signal for transmission across the synapses to other cells [Chen et al., 2013]. The second pro- tein, RFP, fluoresces in the red and serves as a position indicator of the nuclei as well as a control for changes in the visibility of the nuclei during the course of the experiment. Parallel control experiments were done on worms engineered to express GFP and RFP, neither of which should be sensitive to electrical activity. Although our ultimate goal is to understand neural dynamics in the freely moving animal, as a first step we study worms that are immobilized with polystyrene beads, to reduce motion-induced artifacts [Kim et al., 2013]. As described in Ref. [Nguyen et al., 2016], the fluorescence is excited using lasers. A spinning disk confocal microscope and a high-speed, high-sensitivity Scientific CMOS (sCMOS) camera records red- and green-channel fluorescent image of the head of the worm at a rate of 6 brain-volumes per second at a magnification of 40 ; a × second imaging path records the position and posture of the worm at a magnification of 10 , which are used in the tracking of the neurons across different time frames. × As shown in Fig. 3.1a, the raw data thus are essentially movies. By using a custom machine-learning approach [Nguyen et al., 2017], we are able to reduce the data to

g r the green and red intensities for each neuron i, Ii (t) and Ii (t). The data are described in more detail in [Scholz et al., 2018]. As indicated in Fig. 3.1b, the fluorescence intensity undergoes photobleaching, fortunately on much longer time scale than the calcium dynamics. Thus, we can extract the photobleaching effect by modeling the observed fluorescence intensity

26 (a) 10× IR 10× RFP 40× RFP (c) 1.5

I 1 40× GCaMP 0.5 0 100 200 300 400 500 (b) time (s) 1200 (d) 1.5

1000 f 1 800 RFP 0.5 600 0 100 200 300 400 500 (e) time (s)

(arb. units) (arb. 400 0.05 . 200 f 0 GCaMP -0.05

fluorescenceintensity area per 0 0 100 200 300 400 500 0 100 200 300 400 500 time (s) time (s)

Figure 3.1: Schematics of data acquisition and processing. (a) Examples of the raw images acquired through the 10 (scale bar equals 100µm) and 40 (scale bar equals 10µm) objectives. The body of× the nematode is outlined with× light green curves. As an example, we show for one neuron that (b) the intensity of the nuclei-localized fluorescent protein tags—the calcium-sensitive GCaMP and the control fluorophore RFP—are measured as functions of time. Photobleaching occurs on a longer time scale than the intracellular calcium dynamics, which allows us to perform photo- bleaching correction by dividing the raw signal with its exponential fit, resulting in the signals of panel (c). (d) The normalized ratio of the photobleaching-corrected in- tensity, f, is a proxy for the calcium concentration in each neuron nuclei (dark grey). As described in the text, this signal is discretized using the denoised time derivative f˙; we use three states, marked as red, blue, and black after smoothing (lightly offset for ease of visualization). (e) The time derivative f˙, extracted using total-variation regularized differentiation. with an exponential decay:

−t/τg Ig(t) = Sg(t)(1 + ηg)(e + Ag) (3.1) −t/τr Ir(t) = Sr(t)(1 + ηr)(e + Ar)

Here, Sg(t) and Sr(t) are the true signals corresponding to the calcium concentration,

ηg and ηr are stochastic variables representing the noise due to the laser and the cam- era, τg and τr are the characteristic time for photobleaching of the two fluorophores,

27 and Ag and Ar represent nonnegative offsets due to a population of unbleachable fluorophores, or regeneration of fluorescent states under continuous illumination.1 For each neuron, we fit the observed fluorescence intensities to Eqs (3.1) with

0 Sg(t) = Sg and ηg = 0, and similarly for Sr(t). As shown by the black lines in Fig. 3.1b, this captures the slow photobleaching dynamics; we then divide these out to recover

¯g ¯r normalized intensities in each channel and each cell, Ii (t) and Ii (t). Finally, to reduce instrumental and/or motion induced artifacts, we consider the ratio of the normalized

¯g ¯r intensities as the signal for each neuron, i.e. fi(t) = Ii (t)/Ii (t) (Fig. 3.1d). In this

normalization scheme, if the calcium concentration remains constant, then fi(t) = 1. Our goal is to write a model for the joint probability distribution of activity in all of the cells in the network. One approach to construct the distribution is to directly

use the continuous normalized fluorescence ratio fi(t) as the microscopic degrees of freedom. However, it is not clear how to select the class of probability distributions for continuous variables, especially because the number of independent samples is relatively small due to the large temporal correlation in the data, and that the one- point and two-point marginal distributions of the data are manifestly non-gaussian. To stay as close as possible to previous work, at least in this first try, it makes sense to quantize the activity into discrete states. One possibility is to discretize

based on the magnitude of the fluorescence ratio fi(t). But this is problematic, since even in “control” worms where the fluorescence signal should not reflect electrical activity, variations in different cells are correlated; this is illustrated in Fig. 3.2a,

where we see that the distribution of mutual information between fi(t) and fj(t), across all pairs (i, j), is almost the same in control and experimental worms. A closer look at the raw signal suggests that normalizing by the RFP intensity is not

1One may worry that a constant “background” fluorescence should be subtracted from the raw signal, rather than contributing to a divisive normalization. In our data, this background subtraction leads to strongly non-stationary noise in the normalized intensity after the photobleaching correction, in marked contrast to what we find by treating the constant as a contribution from unbleachable or regenerated fluorophores.

28 enough to correct for occasional wobbles of the worm; this causes the distribution of the fluorescence ratio to be non-stationary, and generates spurious correlations. This suggests that (instantaneous) fluorescence signals are not especially reliable, at least given the current processing methods and the state of our experiments. An alternative is to look at the derivatives of these signals, which are still biologically meaningful as they capture the net calcium ion flux of the cell, and by definition suffer from the global noise only at a few instances; now there is very little mutual ˙ ˙ information between fi(t) and fj(t) in the control worms, and certainly much less than in the experimental worms, as seen in Fig. 3.2b. To give ourselves a bit more help in isolating a meaningful signal, we denoise the time derivatives. The optimal Bayesian reconstruction of the underlying time derivative signal u(t) combines a description of noise in the raw fluorescence signal f(t) with some prior expectations about the signal u itself. We approximate the noise in f as Gaussian and white, which is consistent with what we see at high frequencies, and we assume that the temporal variations in the derivative are exponentially distributed and only weakly correlated in time. Then maximum likelihood reconstruction is equivalent to minimizing

τ Z T 1 Z T f 2 (3.2) F (u) = dt u˙ + 2 dt Au f , σf 0 | | 2σnτn 0 | − |

2 where A is the antiderivative operator, the combination σnτn is the spectral density of noise floor that we see in f at high frequencies, while σf is the total standard deviation of the signal and τf is the typical time scale of these variations; for more on these reconstruction methods see Ref. [Chartrand, 2011, Kato et al., 2015]. We determine the one unknown parameter τf by asking that, after smoothing, the cumulative power spectrum of the residue Au f has the least root mean square difference from the − cumulative power spectrum of the extrapolated white noise.

29 As an example, Fig. 3.1e shows the smooth derivative of the trace in Fig. 3.1d. After the smooth derivative u is estimated, we discretized the smooth estimate of the signal, Au, into three states of “rise,” “fall,” and “flat,” depending on whether the derivative u exceeds a constant multiple of σn/τf , the expected standard deviation of the smooth derivative extracted from a pure white noise. The constant is chosen

to be σn/τf = 5, such that the GFP control worm has almost all pairwise mutual information being zero after going through the same data processing pipeline. An example of the raw fluorescence and final discretized signals is shown in Fig. 3.3.

(a) (b) 102 102 GCaMP6 P(MI) P(MI) GFP (1/bits) (1/bits) 100 100

-2 10 10-2 0 0.5 1 0.. 0.5 1 MI(fi; fj) (bits) MI(fi; fj) (bits)

Figure 3.2: Comparison of pairwise mutual information distribution for the calcium- sensitive GCaMP worms and the GFP control worms. Mutual information is estimated using binning and finite-sample extrapolation methods as described in [Slonim et al., 2005] for all pairs of neurons. For the normalized fluorescence ratio, f, the distribution of the mutual information, P (MI(fi; fj)), exhibits little difference between the calcium-sensitive GCaMP worm and the GFP control worm (panel (a)). In comparison, for the time derivative of the normalized fluorescence ratio, f˙, the ˙ ˙ distribution of the mutual information, P (MI(fi; fj)), is peaked around zero for the GFP control worm, while the distribution is wide for the calcium-sensitive GCaMP worm (panel (b)). This observation suggests that time derivative of fluorescence ratio, ˙ fi, is more informative than its magnitude, fi.

30 (a)(a) NormalizedNormalized fluorescence fluorescence ratio, ratio, f (b)(b) DiscretizedDiscretized signal, signal, σ 2

10 2 10

20 20 20 1.5 30 rise 30 1.5 Neuron fall NeuronID, ID, i 40 40 ii 40 i 50 50 flat 1 60 1 60 60

70 80 70 0.5 80 100 200 300 400 500 80 0.5 100 200 300 400 500 100 200 300 400 500 tt(sec) (sec) t (sec)

Figure 3.3: Discretization of the empirically observed fluorescence signals. (a) Heatmap of the normalized fluorescence ratio between photobleaching-corrected GCaMP fluorescence intensity and RFP fluorescence intensity, f, for each neuron as a function of time. (b) Heatmap of the neuronal activity after discretization based on time derivatives of f. Green corresponds to a state of “rising”, red “falling”, and white “flat”.

3.3 Maximum Entropy Model

After preprocessing, the state of each neuron is described by a Potts variable σi, and the state of the entire network is σi . As in previous work on a wide range of bio- { } logical systems [Schneidman et al., 2006, Tkaˇciket al., 2014, Meshulam et al., 2017, Weigt et al., 2009, Mora et al., 2010, Bialek et al., 2012], we use a maximum entropy approach to generate relatively simple approximations to the distribution of states,

P ( σi ), and then ask how accurate these models are in making predictions about { } higher order structure in the network activity. The maximum entropy approach begins by choosing some set of observables,

µ( σi ), over the states of the system, and we insist that any model we write down O { } for P ( σi ) match the expectation values for these observables that we find in the { } data, X P ( σi ) µ( σi ) = µ( σi ) expt. (3.3) { } O { } hO { } i {σi} 31 Among the infinitely many distributions consistent with these constraints, we choose the one that has the largest possible entropy, and hence no structure beyond what is needed to satisfy the constraints in Eq. (3.3). The formal solution to this problem is

" # 1 X P ( σ ) = exp λ ( σ ) , (3.4) i Z µ µ i { } − µ O { }

where coupling constant λµ must be set to satisfy and Eq. (3.3), and the partition function Z as usual enforces normalization. Note that although the maximum entropy model is mathematically equivalent to the Boltzmann distribution, and hence can be analyzed by well-developed tools in equilibrium statistical mechanics, the model is a probability distribution for the one-time statistics of the data and does not assume the system to be in thermodynamic equilibrium, nor that the underlying dynamics obeys detailed balance (see below). Following the original application of maximum entropy methods to neural activ- ity [Schneidman et al., 2006], we choose as observables the mean activity of each cell, and the correlations between pairs of cells. With neural activity described by three states, “correlations” could mean a whole matrix or tensor of joint probabilities for two cells to be in particular states. We will see that models which match this tensor have too many parameters to be inferred reliably from the data sets we have available, and so we take a simpler view in which “correlation” measures the probability that two neurons are in the same state. Equation (3.4) then becomes

1 P (σ) = e−H(σ) , (3.5) Z with the effective Hamiltonian

p−1 1 X X X r (σ) = Jijδσ σ h δσ r . (3.6) H −2 i j − i i i6=j i r=1

32 The number of states p = 3, corresponding to “rise,” “fall,” and “flat” as defined

r above. The parameters are the pairwise interaction Jij and the local fields hi , and these must be set to match the experimental values of the correlations

T 1 X c δ = δ , (3.7) ij σiσj T σi(t)σj (t) ≡ h i t=1 and the magnetizations T 1 X mr δ = δ . (3.8) i σir T σi(t)r ≡ h i t=1

p Note that the local field for the “flat” state, hi , is set to zero by . In

addition, the interaction Jij can be non-zero for any pairs of neurons i and j regardless of the positions of the neurons (both physical and in the structural connectome), i.e. the equivalent Potts model does not have a pre-defined spatial structure. The model parameters are learned using coordinate descent and Markov chain Monte Carlo (MCMC) sampling [Dud´ıket al., 2004, Broderick et al., 2007, Schmidt, 2007]. In particular, we initialize all parameters at zero. For each optimiza-

r tion step, we calculate the model prediction cij and mi by alternating between MCMC sampling with 104 MC sweeps and histogram sampling to speed up the estimation.

r Then, we choose a single parameter from the set of parameters Jij, h to update, { i } such that the increase of likelihood of the data is maximized [Dud´ıket al., 2004]. We repeat the observable estimation and parameter update steps until the model reproduces the constraints within the experimental errors, which we estimate from variations across random halves of the data. This training procedure leaves part of the

interaction matrix Jij zero, while the model is able to reproduce the magnetization

r mi and the pairwise correlation cij within the experimental errors (Fig. 3.4). Because of the large temporal correlation in the data, the number of independent data in the recording is small compared to the number of parameters. This makes us worry about overfitting, which we test by randomly selecting 5/6 of the data as

33 (a) (b) 0.5 Jij Cij 10 0.5 10 1 10 1 20 10 20 20 0 20 0 i 0 i 0 30 30 30 30 Neuron ID Neuron ID Neuron ID Neuron ID 40 40 40 40 -1-1 50 -0.5 -0.5 50 50 10 20 30 40 50 50 10 20 30 40 50 10 20 30 40 50 Neuron ID 10 20Neuron30 40ID 50 (c) Neuronj ID (d) Neuronj ID 11 515 rise fall flat r r m 0.50.5 hi0.50 , data i 0.5 , data

, data 0 r i r i r i , model r i m m m

00 h -50 0 10 20 30 40 50 -5 0 10 20 30 40 50 0 10 20 3030 4040 5050 00 1010 2020 3030 4040 5050 i (e) Neuron IDID (f) NeuronNeuroni IDID 1 1

r cij, mi , data data 0.5 0.5 , data , data r i ij c m

0 0 0 0.5 1 0 0.5 1 c , model m r, model c ij, model mri, model ij i Figure 3.4: Model construction: learning the maximum entropy model from data. (a) Connected pairwise correlation matrix, Cij, measured for a subgroup of 50 neurons. (b) The inferred interaction matrix, Jij. (c) Probability of neuron i in state r, for r the same group of 50 neurons as panel (a). (d) The inferred local field, hi . (e) Model reproduces pairwise correlation (unconnected) within variation throughout the experiment. Error bars are extrapolated from bootstrapping random halves of the r data. (f) Same as panel (e), but for mean neuron activity mi . training set, inferring the maximum entropy model from this training set, and then comparing the log-likelihood of both the training data and the test data with respect to the maximum entropy model. No signs of overfitting are found for subgroups of up to N = 50 neurons, as indicated by that fact that the difference of the log-likelihood is zero within error bars (Fig. 3.5; details in Appendix A. 1). This is not true if we

34 try to match the full tensor correlations (Appendix A. 2), which is why we restrict ourselves to the simpler model.

(a) 0.1

0 ltest-ltrain

-0.1

(b) 0.1

0 N ltest-ltrain

-0.1

10 20 30 40 50 N

Figure 3.5: (a) No signs of overfitting are observed for pairwise maximum entropy models with up to N = 50 neurons, measured by the difference of per-neuron log- likelihood of the data under the pairwise maximum entropy model for training sets consists of 5/6 of the data and test sets. Clusters around N = 10, 15, 20,..., 50 represent randomly chosen subgroups of N neurons. Error bars are the standard deviation across 10 random partitions of training and test samples. The dashed lines show the expected per-neuron log-likelihood difference and its standard deviation calculated through perturbation methods (see Appendix A. 1). (b) The difference between log likelihood of the training data and of the test data is greater than 0 (the red line) within error bars for maximum entropy models on N = 10, 20,..., 50 neurons with pairwise correlation tensor constraint (see Appendix A. 2), which suggests that this model does not generalize well.

3.4 Does the model work?

The maximum entropy model has many appealing features, not least its mathematical equivalence to statistical physics problems for which we have some intuition. But this does not mean that this model gives an accurate description of the real network. Here we test several predictions of the model. In practice we generate these predictions by running a long Monte Carlo simulation of the model, and then treating the samples in 35 this simulation exactly as we do the real data. We emphasize that, having matched the mean activity and pairwise correlations, there are no free parameters, so that everything which follows is a prediction and not a fit. Since we use the correlations between pairs of neurons in constructing our model, the first nontrivial test is to predict correlations among triplets of neurons,

p X Cijk = (δσir δσir )(δσj r δσj r )(δσkr δσkr ) . (3.9) r=1 h − h i − h i − h i i

More subtly, since we used only the probability of two neurons being in the same state, we can try to predict the full matrix of pairwise correlations,

rs C δσ rδσ s δσ r δσ s ; (3.10) ij ≡ h i j i − h i ih j i note that the trace of this matrix is what we used in building the model. Scatter

rs plots of observed vs predicted values for Cijk and Cij are shown in Fig. 3.6a and c. In parts b and d of that figure we pool the data, comparing the root-mean-square differences between our predictions and mean observations (model error) with errors in the measurements themselves. Although not perfect, model errors are always within 1.5 the measurement errors, over the full dynamic range of our predictions. × Turning to more global properties of the system, we consider the probability of k neurons being in the same state, defined as

p D X E P (k) 1PN , (3.11) i=1 δσir=k ≡ r=1 where 1 is the indicator function. It is useful to compute this distribution not just from the data, but also from synthetic data in which we break correlations among neurons by shifting each cell’s sequence of states by an independent random time. We see in Fig. 3.7a that the real distribution is very different from what we would see with

36 (a) (b) 10-3 10 0.05 model 8

C , data C ijk 0 ijk 6 data 4 -0.05

2 -0.05 0 0.05 -0.01 0 0.01 0.02 0.03 C , model C , model ijk ijk (c) 0.2 (d) 0.02 model 0.1 0.015 Crs rs 0 ij C , data ij 0.01 data -0.1

0.005 -0.2 -0.2 0 0.2 -0.1 0 0.1 Crs Crs, model ij , model ij

Figure 3.6: Model validation: The model predicts unconstrained higher order correla- tions of the data. Panel (a) shows the comparison between model prediction and data for the connected three-point correlation Cijk for a representative group of N = 50 neurons. All 19800 possible triplets are plotted with the blue dot. Error bars are gen- erated by bootstrapping random halves of the data, and are shown for 20 uniformly spaced random triplets in red. Panel (b) shows the error of three-point function ∆Cijk as a function of the connected three-point function Cijk, binned by its value predicted by the model, Cijk, model. The red curve is the difference between data and model prediction. The blue curve is the standard error from mean of Cijk over the course of the experiment, extracted by bootstrapping random halves of the experiment. Panels (c, d) are the same as panels (a, b), but for the connected two-point correlation tensor rs Cij . independent neurons, so that in particular the tails provide a signature of correlations. These data agree very well with the distributions predicted by the model. Our model assigns an “energy” to every possible state of the network [Eq. (3.6)], which sets the probability of that state according to the Boltzmann distribution.

37 Because our samples are limited, we cannot test whether the energies of individual states are correct, but we can ask whether the distribution of these assigned energies across the real states taken on by the network agree with what it predicted from the model. Figure 3.7b compares these distributions, shown cumulatively, and we see that there is very good overlap between theory and experiment across 90% ∼ of the density, with the data having a slightly fatter tail than predicted. The good agreement extends over a range of ∆E 20 in energy, corresponding to predicted ∼ probabilities that range over a factor of exp(∆E) 108. ∼ The maximum entropy model gives the probability for the entire network to be in a given state, which means that we can also compute the conditional probabilities for the state of one neuron given the state of all the other neurons in the network. Testing whether we get this right seems a very direct test of the idea that activity in the network is collective. This conditional probability can be written as

" p−1 # X r P (σi σj6=i ) exp gi δσir , (3.12) |{ } ∝ r=1

r where the effective fields are combinations of the local field hi and each cell’s inter- action with the rest of the network.

N r r X g = h + Jij(δσ r δσ p) . (3.13) i i j − j j6=i

Then the probabilities for the states of neuron i are set by

P (σi = r) gr = e i , (3.14) P (σi = p) where the last state p is a reference. In Figure 3.7c and d we test these predictions. In practice we walk through the data, and at each moment in time, for each cell, we compute effective fields. We then find all moments where the effective field falls

38 (a) (b) 100 0.1 data ind. pairwise data P(k) 10-2 0.05 probability model

-4 0 10 0 20 40 -80 -60 -40 -20 k Energy (c) 6 (d) 6 model ) 4 ) 4 flat flat 2 2 /P /P fall 0 rise data 0 ln(P -2 ln(P -2 -4 -4 -4 -2 0 2 4 6 -5 0 5 gfall grise i i

Figure 3.7: Model validation: comparison between model prediction and data for observables not constrained by the model. The neuron network has N = 50 neurons. (a) Probability of k neurons being in the same state. Blue dots are computed from the data. Yellow dash-dot line is the prediction from a model where all neurons are independent, generated by applying a random temporal cyclic permutation to the activity of each neuron. Purple line is the prediction of the pairwise maximum entropy model. (b) Tail distribution of the energy for the data and the model. All error bars in this figure are extrapolated from bootstrapping. (c, d) Probability ratio r of the state of a single neuron as a function of the effective field gi , binned by the value of the effective field. Error bars are the standard deviation after binning. into a small bin, and compute the ratio of probabilities for the states of the one cell, collecting the data as shown. The agreement is excellent, except at extreme values of the field which are sampled only very rarely in the data. We note the agreement extends over a dynamic range of roughly two decades in the probability ratios.2

2The claim that behaviors are collective requires a bit more than predictability. It is possible that behaviors of individual cells are predictable from the state of the rest of the network, but that

39 3.5 What does the model teach us?

3.5.1 Energy landscape

Maximum entropy models are equivalent to Boltzmann distributions and thus define an energy landscape over the states of the system, as shown schematically in Fig. 3.8a. In our case, as in other neural systems, the relevant models have interactions with varying signs, allowing the development of frustration and hence a landscape with multiple local minima. These local minima are states of high probability, and serve to divide the large space of possible states into basins. It is natural to ask how many of these basins are supported in subnetworks of different sizes. To search for energy minima, we performed quenches from initial condi- tions corresponding to the states observed in the experiment, as described in [Tkaˇciket al., 2014]. Briefly, at each update, we change the state of one neuron such that the decrease of energy is maximized, and we terminate this procedure when no single spin flip will decrease the energy; the states that are attracted to local energy minimum α form a basin of attraction Ωα. As shown in Fig. 3.8c, the number of energy minima grows sub-exponentially as the number of neurons increases. Note that this approach only gives us the states that the animal has access to, rather than all metastable states, whose number is approximated by greedy quench along a long MCMC trajectory. Nonetheless, the probability of visiting a basin is similar between the data and the model, shown by the rank-frequency plot (Fig. 3.8d). Whether the energy minima correspond to well defined collective states depends on the heights of the barriers between states. Here, we calculate the barrier height between basins by single-spin-flip MCMC, initialized at one minimum α and terminat- ing when the state of the system belongs to a different basin Ωβ; the barrier between most of the predictive power comes from interaction with a single strongly coupled partner. We have r checked that the mutual information I(σi; gi ) is larger than the maximum of I(σi; σk), in almost all cases.

40 (a) (b) 25

basin barrier 20

Ωα Ωβ 15

E - E0 local min. 10 E α 5 β metabasin 0 10-5 100 probability density (c) 104 (d) 100 (e) N = 10 2 N = 20 model 10 N = 30

Nbasin frequency Nmetabasin 102 data

-5 0 0 10 10 10 0 2 -1 0 1 20 40 10 10 10 10 10 N rank E

Figure 3.8: Energy landscape of the inferred maximum entropy model. (a) Schematic of the energy landscape with local minima α, β and the corresponding basin Ωα,Ωβ. Colored in light blue is the metabasin formed at the given energy threshold, ∆E. (b) Typical distribution of the value of the energy minima and the barriers of a maximum entropy model on N = 30 neurons. The global energy minimum, E0, is subtracted from the energy, E. (c) The number of energy minima increases sub-exponentially as number of neurons included in the model increases. Error bars are the standard deviation of 10 different subgroups of N neurons. (d) The rank-frequency plot for frequency of visiting each basin matches well between data and model for a typical subgroup of 40 neurons. (e) The number of metabasins, grouped according to the energy barrier, diverges when the energy threshold ∆E approaches 1 from above.

basins Ωα and Ωβ is defined as the maximum energy along this trajectory. This sam- pling procedure is repeated 1000 times for each initial basin to compute the mean energy barrier. As shown in Fig. 3.8b, the distribution of barrier energies strongly overlaps the distribution of the energy minima, which implies that the minima are not well separated. Further visualization of the topography of the energy landscape is performed by constructing metabasins, following Ref [Becker and Karplus, 1997]. Here, we con-

41 struct metabasins by grouping the energy minima according to the barrier height; basins with barrier height lower than a given energy threshold, ∆E, are grouped into a single metabasin. This threshold can be varied: at high enough threshold, the sys- tem effectively does not see any local minima; at low threshold, the partition of the energy landscape approaches the partition given by the original basins of attraction. If the dynamics were just Brownian motion on the landscape, states within the same metabasin would transition into one other more rapidly than states belonging to dif- ferent metabasins. As shown in Fig. 3.8e, there is a transition at ∆E 1.2 from ≈ single to multiple metabasins for all N = 10, 20, and 30. Since the dynamics of the real system do not correspond to a simple walk on the energy landscape (Appendix A. 3 and Fig. A.1), we cannot conclude that this is a true dynamical transition, but it does suggest that the state space is organized in ways that are similar to what is seen in systems with such transitions.

3.5.2 Criticality

Maximum entropy models define probability distributions that are equivalent to equi- librium statistical physics problems. As these systems become large, we know that the parameter space separates into distinct phases, separated by critical surfaces. In sev- eral biological systems that have been analyzed, including the neural networks in sala- mander retina and mouse hippocampus, the diversity of human B cell repertoire, and the spontaneous flocking of European starlings, there are signs that these critical sur- faces are not far from the operating points of the real networks [Meshulam et al., 2018, Tkaˇciket al., 2015, Bialek et al., 2012, Mora et al., 2010], although the interpreta- tion of this result remains controversial [Mora and Bialek, 2011, Mu˜noz,2018]. Here we ask simply whether the same pattern emerges in C. elegans. One natural slice through the parameter space of models corresponds to changing the effective temperature of the system, effectively scaling all terms in the log prob-

42 ability up and down uniformly. Concretely, we replace (σ) (σ)/T in Eq (3.5). H → H We monitor the heat capacity of the system, as we would in thermodynamics; here the natural interpretation is of the heat capacity as being proportional to the vari- ance of the log probability, so it measures the dynamic range of probabilities that can be represented by the network. Results are shown in Fig. 3.9, for randomly chosen subsets of N = 10, 20, ..., 50 neurons. A peak in heat capacity often signals a critical point, and here we see that the maximum of the heat capacity approaches the oper-

ational temperature T0 = 1 from below as N becomes larger, suggesting that the full network is near to criticality.

40 N = 10 35 N = 20 N = 30 30 N = 40 N = 50 25

20

15 Heat capacity Heat

10

5

0 10-1 100 101 Temperature

Figure 3.9: The heat capacity is plotted against temperature for models with different number of neurons, N. The maximum of the heat capacity approaches the operational temperature of the C. elegans neural system T0 = 1 from below as N increases. Error bars are the standard error across 10 random subgroups of N neurons.

3.5.3 Network topology

The worm C. elegans is special in part because it is the only organism in which we know (essentially) the full pattern of connectivity among neurons. Our models also 43 have a “connectome,” since only a small fraction of the possible pairs of neurons are linked by a nonzero value of Jij. The current state of our experiments is such that we cannot identify the individual neurons, and so we cannot check if the effective connectivity in our model is similar to the anatomical connections. But we can ask statistical questions about the connections, and we focus on two global properties of the network: the clustering coefficient C, defined as the fraction of actual links compared to all possible links connecting the neighbors of a given neuron, averaged over all neurons; and the characteristic path length L, defined as the average short- est distance between any pair of neurons. As shown in Fig. 3.10, the topology of the inferred networks for all three worms that we investigated differ from random Erd˝os-R´enyi graphs with the same number of nodes (neurons) and links (non-zero interactions). Moreover, as we increase the number of neurons that we consider, the clustering coefficient C and the characteristic path length L approaches that found in the structural connectome [Watts and Strogatz, 1998].

3.5.4 Local perturbation leads to global response

How well can the sparsity of the inferred network explain the observed globally- distributed pairwise correlation? In particular, we would like to examine the response of the network to local perturbations. This test is of particular interest, since its pre- dictions can be examined experimentally, as local perturbation of the neural network can be achieved through optogenetic clamping or ablation of individual neurons. The maximum entropy model can be perturbed through both “clamping” and “ablation.” By definition, the only possible state in which we can clamp a single neu- ron is the all “flat” state, σk = p. Following the maximum entropy model [Eq. (3.6)], the probability distribution for the rest of the network becomes

1 −Hek(σ) Pek(σ) P (σ1, σ2, ...σN−1 σk = 3) = e , (3.15) ≡ | Zek 44 (a) (b) 1 3

2.5 model 0.5 struct. 2 random struct. rand. Clustering coefficient Clustering 0 1.5 20 40 60 length path Characteristic 20 40 60 N N Figure 3.10: The topology of the learned maximum entropy model approaches that of the structural connectome, as the number of neurons being modeled, N, increases. The two global topological properties being measured are the clustering coefficient C (panel (a)) and characteristic path length L (panel (b)). Here, the inferred network topology for three different worms is plotted in blue. Red curves are for the random- ized network with the same number of neurons, N, and number of connections, NE, 2 as the model, where we expect Lrandom ln(N)/ ln(2NE/N) and Crandom 2NE/N . The dark blue line corresponds to the network∼ property of the structural connectome;∼ the dark red line corresponds to randomized network with number of nodes and edges equal to those of the structural connectome [Watts and Strogatz, 1998]. Error bars are generated from the standard deviation across different 10 subgroups of N neurons.

where the effective Hamiltonian is

p−1 1 X X X X r ek(σ) = Jijδσ σ Jikδσ p h δσ r . (3.16) H −2 i j − i − i i i6=j6=k i6=k i6=k r=1

On the other hand, ablation of neuron k means the removal of neuron k from the network, which leads to an effective Hamiltonian

p−1 1 X X X r bk(σ) = Jijδσ σ h δσ r . (3.17) H −2 i j − i i i6=j6=k i6=k r=1

We examine the effect of clamping and ablation by Monte Carlo simulation of these modified models. We focus on the response of individual neurons i to perturbing

45 neuron k, which is summarized by change in the magnetizations, mr m˜ r. But i → i since these also represent the probabilities of finding the neuron i in each of the states r = 1, ..., p, we can measure the change as a Kullback–Leibler divergence,

p X mr  D = mr log i bits. (3.18) KL i m˜ r r=1 i

As shown in Fig. 3.11, the response of the network to the local perturbation is dis- tributed throughout the network for both clamping and ablation. However, clamping

leads to much larger DKLs, suggesting that the network is more sensitive to clamping, and perhaps robust against (limited) ablation. Interestingly, this result echoes the experimental observation that C. elegans locomotion is easily disturbed through op- togenetic manipulation of single neurons [Gordus et al., 2015, Liu et al., 2018], while ablation of single neurons has limited effect on the worms’ ability to perform differ- ent patterns of locomotion [Gray et al., 2005, Piggott et al., 2011, Yan et al., 2017], although further experimental investigation is needed to test our hypotheses on net- work response.

3.6 Discussion

Soon it should be possible to record the activity of the entire nervous system of C. elegans as it engages in reasonably natural behaviors. As these experiments evolve, we would like to be in a position to ask question about collective phenomena in this small neural network, perhaps discovering aspects of these phenomena which are shared with larger systems, or even (one might hope) universal. We start modestly, guided by the state of the data. We have built maximum entropy models for groups of up to N = 50 cells, match- ing the mean activity and pairwise correlations in these subnetworks. Perhaps our most important result is that these models work, providing successful quantitative 46 predictions for many higher order statistical structures in the network activity. This parallels what has been seen in systems where the neurons generate action potentials, but the C. elegans network operates in a very different regime. The success of pair- wise models in this new context adds urgency to the question of when and why these models should work, and when we might expect them to fail. Beyond the fact that the models make successful quantitative predictions, we find other similarities with analyses of vertebrate neural networks. The probability distri- butions that we infer have multiple peaks, corresponding to a rough energy landscape, and the parameters of these models appear close to a critical surface. In addition, we have shown that the inferred model is sparse, and has topological properties similar to that of the structural connectome. Nevertheless, global response is observed when the modeled network is perturbed locally, in a way similar to experimental observations. With the next generation of experiments, we hope to extend our analysis in four ways. First, longer recording will allow construction of meaningful models for larger groups of neurons. If coupled with higher signal–to–noise ratios, it should also be possible to make a more refined description of the continuous signals relevant to C. elegans neurons, rather than having to compress our description down to a small number of discrete states. This alternative description will be mathematically equiv- alent to a Boltzmann distribution of soft spins, constrained by the one- and two-point functions as well as a family of higher order correlations specified by the data. Sec- ond, registration and identification of the observed neurons will make it possible to compare the anatomical connections between neurons with the pattern of interactions in our probabilistic models. Being able to identify neurons across multiple worms will also allow us to address the degree of reproducibility across individuals, and perhaps extend the effective size of data sets by averaging. Third, optogenetic tools will al- low local perturbation of the neural network experimentally, which can be compared directly with the theoretical predictions in V.D above. Finally, improvements in ex- § 47 perimental methods will enable constructions of maximum entropy models for freely moving worms, with which we can map the relation between the collective behavior identified in the neuronal activity and the behavior of the animal.

48 j 40 20 20 i 40

-1 k 0 1 40 20 (a) 1.5 (b) 102 J 10 1 ij clamp k

20 0.5 100 20 ) ij i ij j

0 J

j P(J 40 20

30 40 -0.5 10-2 40 -1 -4 50 -1.5 10 20 10 20 30 400 50 0.2 0.4 -2 0 2 i J i ij

40 D (c) clamp k kKL (d) 2 40 20 10

10 0.4 clamp k ) 20 -1 0 20 1 0.3 KL 0 KL i k k 10 D 40 20 J 30 0.2 P(D ij 40 40 0.1

clamp k -2

20 10 50 0 0 0.5 1 10 20 30 400 50 0.2 0.4 i D , clamping i KL 40 D (e) ablate k KL (f) 0.1 104

10 0.08 2 ) 0 0.2 0.4 10 20 0.06 KL KL k k D 40 20

D 30 0.04 P(D 0 KL 10 40 0.02 clamp k

20 50 0 10-2

i 10 20 30 40 50 0 0.5 1 i D , ablation 40 KL

Figure 3.11: Local perturbation of the neural network leads to global response. (a,

b) For0 a typical0.2 0.4 group of N = 50 neurons, the inferred interaction matrix J is sparse. flat Here, the neuron index i and j are sorted based on mi , as in Fig. 3.4. (c, d) When D neuron k isKL clamped to a constant voltage, the Kullback-Leibler divergence (in bits) of the marginal distribution of states for neuron i is distributed throughout the network. (e, f) When neuron k is ablated, the DKL is also distributed throughout the network, but is smaller than in response to clamping.

49 Chapter 4

Searching for long time scales without fine tuning

The materials in this chapter were previously posted in [Chen and Bialek, 2020].

Most of animal and human behavior occurs on time scales much longer than the response times of individual neurons. In many cases it is plausible that these long time scales emerge from the recurrent dynamics of electrical activity in networks of neurons. In linear models, time scales are set by the eigenvalues of a dynamical matrix whose elements measure the strengths of synaptic connections between neurons. It is not clear to what extent these matrix elements need to be tuned in order to generate long time scales; in some cases, one needs not just a single long time scale but a whole range. Starting from the simplest case of random symmetric connections, we combine maximum entropy and random matrix theory methods to construct ensembles of net- works, exploring the constraints required for long time scales to become generic. We argue that a single long time scale can emerge generically from realistic constraints, but a full spectrum of slow modes requires more tuning. Langevin dynamics that

50 will generate patterns of synaptic connections drawn from these ensembles involve a combination of Hebbian learning and activity–dependent synaptic scaling.

4.1 Introduction

Living systems face various challenges over their lifetimes, and responding to these challenges often involves behaviors that occur over multiple time scales. As an ex- ample, a migratory bird needs to both react to instantaneous gusts, and to nav- igate its course over months. Recent experiments have focused attention on this problem, demonstrating the approximate power–law decay of behavioral correlation functions in fruit flies and mice [Berman et al., 2016, Shemesh et al., 2013], and near– marginal modes in locally linear approximations to neural and behavioral dynamics of the nematode C. elegans [Costa et al., 2019]. These long time scales could emerge from responses of the organism to a fluctuating environment, or could be intrin- sic, as would happen if the underlying neural networks were poised near critical- ity [Mora and Bialek, 2011, Mu˜noz,2018]. In the cases where we can decouple the organism from its environment, the long time scales in behavior must come from long time scales in the generator of behav- iors, the nervous system. While transient responses of individual neurons decay on the time scale of tens of milliseconds, autonomous behaviors can last orders of mag- nitude longer. We see this when we hold a string of numbers in our heads for tens of seconds before dialing a phone, and when a musician plays a piece from memory that lasts many minutes. Experimentally, long time scales in behavior have been associ- ated with persistent neural activities, where after a pulse stimulation, some neurons are found to hold their firing rate at specific values that encode the transient stim- uli [Aksay et al., 2001, Brody et al., 2003a, Major et al., 2004a, Major et al., 2004b, Major and Tank, 2004, Srimal and Curtis, 2008].

51 It is plausible that persistent neural activities emerge from the recurrent dynamics of electrical activity in the network of neurons. In the simplest linear model, the relaxation times of the system depend on the eigenvalues of a matrix representing the synaptic connection strengths among neurons, and we can imagine this being tuned so that time scales become arbitrarily long [Seung, 1996]. This simple model has successfully explained the long time scale in the oculomotor system of goldfish, where the nervous system tunes its dynamics to be slow and stable using constant feedback from the environments [Major et al., 2004a, Major et al., 2004b]. In general, long time scales in linear dynamical systems require fine tuning, as the modes need to be slow, but not unstable. There have been a number of discussions of how to avoid such fine tuning [Brody et al., 2003b], including adding non-linearity to create discrete approximations of the continuous for the dynamics [Brody et al., 2003b], placing neurons in special configurations such as a feed-forward line [Goldman, 2009] or a ring network [Burak and Fiete, 2012], promoting the interaction matrix to a dynamical variable [Magnasco et al., 2009], and regulating the overall neural activity with synaptic scaling [Renart et al., 2003, Tetzlaff et al., 2013]. Nonetheless, it is not clear to what extent connection strengths need to be tuned in order to generate sufficiently long time scales, especially when one needs not just a single long time scale but a whole spectrum of slow modes. In this manuscript, we address the fine-tuning question by asking whether we can find ensembles of random connection matrices, subject to biologically plausible con- straints, such that the resulting time scales of the system grow with increasing system sizes. We also discuss the conditions for systems to exhibit a continuous spectrum of slow modes as opposed to single slow modes. Finally, we present plausible dynamics for the system to tune its connection matrix towards these desired ensembles.

52 4.2 Setup

The problem of characterizing time scales in fully nonlinear neural networks—or any high dimensional dynamical system—is very challenging. To make progress, we follow the example of Ref [Seung, 1996] and consider the case of linear net- works. For linear systems, time scales are related to the eigenvalues of the dy- namical matrix that embodies the pattern of synaptic connectivity. The question of whether behavior is generic can be made precise by drawing these matrices at random from some probability distribution, connecting with the large literature on random matrix theory [Livan et al., 2018, Marino, 2016]. Importantly, we expect that some behaviors in these ensembles of networks become sharp as the networks become large, a result which has been exploited in thinking about problems ranging from energy levels of quantum systems [Wigner, 1951] to ecology [May, 1972] and finance [Bouchaud and Potters, 2003]. Concretely, we represent the activity each neuron i = 1, 2, ,N by a continuous ··· variable xi, which we might think of as a smoothed version of the sequence of action potentials, and assume a linear dynamics

x˙i = xi + Mijxj + ηi(t) . (4.1) −

If the neurons were unconnected (M = 0), their activity x would relax exponentially on a time scale which we choose as our unit of time. In what follows it will be important to imagine that the system is driven, at least weakly, and we take these driving terms to be independent in each cell and uncorrelated in time, where ηi(t) = h i 0 and

0 0 ηi(t)ηj(t ) = 2δijδ(t t ). (4.2) h i −

53 The choice of white noise is conventional, but also important because we want to understand how time scales emerge from the network dynamics rather than being imposed upon the network by outside inputs. In linear systems we can rotate to independent modes, corresponding to weighted combinations of the original variables. If the matrix M is symmetric, then the dy- namics are described by the relaxation times of these modes,

1 1 τi = , ≡ 1 λi ki − where λi are the eigenvalues of M; the system is stable only if all λi < 1. If { } the matrix M is chosen from a distribution P (M) then the eigenvalues are random variables, but their density, for example, becomes smooth and well defined in the limit N , → ∞ N 1 X ρ(λ) lim δ(λ λi). (4.3) N→∞ N ≡ i=1 − The simplest case is the Gaussian Orthogonal Ensemble (GOE), where the matrix elements are independent Gaussian random variables, with variances such that

2 Mii (0, c /N) (4.4) ∼ N 2 Mij i6=j (0, c /2N); (4.5) | ∼ N the factor of N in the variance ensures that the density ρ(λ) has support over a range of eigenvalues that are (1) at large N. O When we say that we want to search for long time scales, there are two possibilities. One is that we are interested in the single longest time scale, and the other is that we are interested in the full range of time scales. To get at these different questions, we define the longest time scale of the system, τmax, and the correlation time scale,

54 Figure 4.1: Schematics. A linear dynamical system with damping and pairwise in- teraction M has time scales determined by the eigenvalue spectrum of M, especially the gap g0 to the stability threshold. If the system is perturbed (red arrow), the norm activity decays with a longest time scale τmax = 1/g0, while the correlations in the unperturbed system decay with a characteristic time scale, defined to be the correlation time τcorr. In cases where the system has a continuous range of long time scales, the correlations decay as a power law (red curve). Systems with long time scales are defined such that τmax and τcorr grow as system size N.

τcorr. We continue to think about the case where M is symmetric, and return to the more general case in the discussion.

Longest time scale. The longest time scale, τmax, is the time constant given by the slowest mode of the system, which dominates the dynamics after long enough times. This time scale is determined by the gap, g0, between the largest eigenvalue and the stability threshold, which with our choice of units is w = 1. Mathematically, we define 1 1 τmax = . (4.6) ≡ g0 1 λmax −

55 In the thermodynamic limit, the gap is taken to be between the stability threshold and the right edge of the support of the spectral density1. Correlation time. To get at the correlation time, let’s take seriously the idea that the network is driven by noise. Then x(t) become a stochastic process, and from Eqs (4.1) and (4.2) we can calculate the correlation function

1 X C (t) x (0)x (t) N N i i ≡ i h i 1 X 1 = e−(1−λi)|t| (4.7) N 1 λi i − 1 X = τ e−|t|/τi . N i

The normalized correlation function

P −|t|/τi CN (t) i τie RN (t) = P (4.8) ≡ CN (0) i τi

has the intuitive behavior of starting at RN (0) = 1 and decaying monotonically. Then there is a natural definition of the correlation time, by analogy with single exponential decays, Z ∞ P 2 i τi τcorr( λ ) dt RN (t) = P . (4.9) { } ≡ 0 i τi In the thermodynamic limit, the autocorrelation coefficient R(t) and the correlation

time τcorr becomes the ratio of two integrals over the eigenvalue density ρ(λ).

Importantly, τmax depends only on the largest eigenvalue, while τcorr depends on the entire spectrum, and hence can be used to differentiate cases where the system is dominated by a single vs. a continuous spectrum of slow modes. The two time scales satisfy τcorr τmax, with the equality assumed only when all eigenvalues are equal, ≤ i.e. the spectral density is a delta function at λ = λmax.

1We note that this approximation is not ideal, as the spectral distribution of eigenvalues does not converge uniformly. In some cases, the fluctuation of the largest eigenvalue can be more meaningful than the average.

56 With these definitions, we refine our goal as to find biologically plausible ensembles for the connection matrix M, such that the resulting stochastic linear dynamics has

time scales, τmax and τcorr, that are “long,” growing as a power of the system size N, perhaps even extensively. To avoid fine tuning, we will construct examples of such ensembles by imposing global constraints on measurable observables of the dynamical system. We then compute the spectral density and the corresponding time scales using a combination of mean-field theory and numerical sampling of finite systems.

4.3 Time scales for ensembles with different global

constraints

To construct linear dynamical systems that generate long time scales without fine tuning individual parameters, we want to find probability distributions P (M) for the connection matrix such that time scales are long, on average. We start with the simplest Gaussian distribution, the GOE above, and gradually add constraints. We will see that for the GOE itself, there is a critical value of the scaled variance

2 ccrit = 1/2. For c < ccrit the system is stable but time scales are short, while for c > ccrit the system is unstable. Exactly at the critical point c = ccrit time scales are long in the sense that we have defined, diverging with system size. The essential challenge is to find the weakest constraints on P (M) that make these long time scales generic. Many of the results that weneed along the way are known in the random matrix theory literature, but we will arrive at some new theoretical questions.

4.3.1 Model 1: the Gaussian Orthogonal Ensemble

The simplest ensemble for the interaction matrix M is the Gaussian Orthogonal Ensemble (GOE) without any additional constraints, which has been studied since the beginning of random matrix theory [Wigner, 1951, Dyson, 1962a, Dyson, 1962b, 57 Figure 4.2: Spectral density for the connection matrix M drawn from the Gaussian Orthogonal Ensemble (GOE, a), the GOE with a hard threshold enforcing stability (hard threshold, b), and the GOE with an additional global constraint on the norm activity (soft threshold, d). Three representative parameters are chosen for each ensemble such that the system is subcritical (c = 0.6, red), critical (c = 1/√2, blue), and supercritical (c = 0.8, black). The stability threshold is visualized as the dashed gray line at λw = 1. Panel (c) is a schematic for constraining the averaged norm activity, generating ensembles with the soft threshold.

58 Livan et al., 2018] and overviewed in Chapter 2.2. Mathematically, we have M = M |, and  N  P (M) exp TrM |M ; (4.10) ∝ −2c2

since c sets the scale of synaptic connections, we will refer to this parameter as the interaction strength. Because the distribution only depends on matrix traces, it is invariant to rotations of M. Equivalently, if we think of decomposing the matrix M into its eigenvectors and eigenvalues, the probability depends only on the eigenvalues. Thus, we can integrate out the eigenvectors, and obtain the joint distribution of eigenvalues,

" # N X 2 1 X PGOE( λi ) exp λ + ln λj λk , (4.11) { } ∝ −2c2 i 2 − i j6=k

where the logarithmic repulsion term emerges from the Jacobian when we change variables from matrix elements to eigenvalues and eigenvectors. The spectral density can then be found using a mean field approximation, which becomes exact as N . → ∞ The result is Wigner’s well-known semicircle distribution [Wigner, 1951],

r 1 λ2 ρGOE(λ) = 2 , λ [ √2c, √2c]. (4.12) πc − c2 ∈ −

Equation (4.12) for the spectral density, together with Fig 4.2a, shows that there

is a phase transition at ccrit = 1/√2. If the interaction strength is greater than this critical strength (supercritical), then λmax > 1 and the system becomes unstable. If the interaction strength is smaller (subcritical), then the gap size between the largest eigenvalue and the stability threshold λ = 1 is of order 1, and the time scales remain finite as system size increases. The only case when the system has slow modes is at the critical value of the interaction strength, c = ccrit = 1/√2, where the spectral density becomes tangential to the stability threshold.

59 At criticality, corresponding to the blue curve in Fig 4.2a, we can estimate the size of the gap by asking that the gap be large enough to contain 1 mode, that is ∼

Z 1 dλ Nρ(λ) 1. (4.13) 1−g0 ∼

1/2 3/2 −2/3 With ρ(λ) (1 λ) , this gives Ng 1 or g0 N . Thus the longest time ∼ − 0 ∼ ∼ 2/3 scale grows with system size, τmax N . ∼ In the same way, we can estimate the full correlation function from the correlation time [Eq (4.7)],

1 X C (t) = τ e−t/τi N N i Z ρ(λ) = dλ e−(1−λ)|t| (4.14) (1 λ) − Z 1 dλ e−(1−λ)|t| t −1/2. (4.15) ∼ (1 λ)1/2 ∼ | | −

This has the power–law behavior expected for a critical system, where there is a continuum of slow modes. Note that in the GOE system, slow modes with time scales growing as system size are only possible at a single value of the interaction strength. Nonetheless, we need to distinguish the fine tuning here as happening at an ensemble level, which is different from the element-wise fine tuning that might have been required if we considered particular interaction matrices.

4.3.2 Model 2: GOE with hard stability threshold

Drawing interaction matrices from the GOE leads to long time scales only at a critical value of interaction strength. Can we modify the ensemble such that long time scales can be achieved without this fine tuning? In particular, in the GOE, if the interaction strength is too large, then the system becomes unstable. What will the spectral

60 distribution look like if we allow c > ccrit but edit out of the ensemble any matrix that leads to instability? Mathematically, a global constraint on the system stability requires all eigenval- ues to be less than the stability threshold, λw = 1. This modifies the eigenvalue distribution with a Heaviside step function:

Y Phard( λi ) PGOE( λi ) Θ(1 λi). (4.16) { } ∝ { } i −

Conceptually, what this model does is to pull matrices out of the GOE and discard them if they produce unstable dynamics; the distribution Phard( λ ) describes the { } matrices that remain after this editing. Importantly we do not introduce any extra

structure, and in this sense Phard is a maximum entropy distribution, as discussed more fully below.

The spectral density ρ(λ) that follows from Phard was first found by Dean and Majumdar [Dean and Majumdar, 2006, Dean and Majumdar, 2008]. Again, there is a phase transition depending on the interaction strength. For ensembles with inter- action strength less than the critical value ccrit = 1/√2, the stability threshold is away from the bulk spectrum, so the spectral density remains as Wigner’s semicircle. On the other hand, if the interaction strength is greater than the critical value, the spectral density becomes

1 √λ + l∗ 1 ρ(λ) = − (l∗ 2λ) , (4.17) c2 2π√1 λ − −

where 2   l∗ = 1 + √1 + 6c2 . 3

As shown in Fig 4.2b, the stability threshold acts as a wall pushing the eigenvalues to pile up. More precisely, near the stability thresshold λ = 1 we have ρ(λ) (1 λ)−1/2, ∼ −

61 which [by the same argument as in Eq (4.13)] indicates that the longest time scale

2 increases as system size with τmax N . ∼ The autocorrelation function also is dominated by the eigenvalues close to the stability threshold. The calculation is a bit subtle, however, since

Z ρ(λ) Z C(t) = dλ e−(1−λ)|t| dk k−1/2 k−1 e−k|t| (1 λ) ∼ −

−2 is not integrable. After introducing an IR cut-off at ε g0 N , we can write the ∼ ∼ resulting autocorrelation coefficient as

C(t) R(t) = = 1 √π(εt)1/2 + εt + ((εt)2). (4.18) C(0) − O

The correlation time

R dkρ(k)k−2 R dkk−5/2 τ = ε ε ε−1 N 2. (4.19) corr R −1 R −3/2 ε dkρ(k)k ∼ ε dkk ∼ ∼

We see that for supercritical systems, both the longest time scale τmax and the

correlation time τcorr increase as a power of the system size; the rate is even faster than the system at criticality. In fact, there are divergently many slow modes. Meanwhile, the interaction strength can undertake a range of values, as long as they are greater than a certain threshold. Thus it would seem that we have overcome the fine tuning problem! In fact, we cannot quite claim that the problem is solved. First, in the supercritical phase, the correlation function does not decay as a power law. Instead, the correlation

function stays at 1 for a time period τcorr τmax, and then decays exponentially. This ∼ means the system has a single long time scale, rather than a continuous spectrum of slow modes. Second, in order for a system to impose a hard constraint on its stability, it needs to measure its stability. Naively, checking for stability, especially

62 in the presence of slow modes, requires access to infinitely long measuring times; implementing a sharp threshold may also be challenging.

4.3.3 Model 3: Constraining mean-square activity

While it can be difficult to check for stability, it is much easier to imagine checking the overall level of activity in the network. One can even think about mechanisms that would couple indirectly to activity, such as the metabolic load. If the total activity is larger than some target level, the system might be veering toward instability, and there could be feedback mechanisms to reduce the overall strength of connections. Regula- tion of this qualitative form is known to occur in the brain, and is termed synaptic scal- ing [Turrigiano et al., 1998, Turrigiano and Nelson, 2004, Abbott and Nelson, 2000]; this is hypothesized to play an important role in maintaining persistent neural activ- ities [Renart et al., 2003, Tetzlaff et al., 2013]. In this section, we construct the least structured distribution P (M) that is consistent with a fixed mean (square) level of activity, which we can think of as a soft threshold on stability, and derive the density of eigenvalues that follow from this distribution. In the following section we discuss possible mechanisms for a system to generate matrices M, dynamically, out of this ensemble.

The spectral density

It is useful to remember that the GOE, Eq (4.10), can be seen as the maximum entropy distribution of matrices consistent with some fixed variance of the matrix elements Mij [Jaynes, 1957, Press´eet al., 2013]. If we want to add a constraint, we can stay within the maximum entropy framework, and in this way we isolate the generic consequences of this constraint: we are constructing the least structured ensemble of networks that satisfies the added condition.

63 We recall that if we want to constrain the mean values of several functions fµ(M), then the maximum entropy distribution has the form

" # 1 X P (M) = exp g f (M) . (4.20) Z µ µ − µ

In our case are are interested in the mean–square value of the individual matrix elements, and the mean–square value of the activity variables xi. But our basic model Eqs (4.1) and (4.2) predicts that

1 X 2 1 X 1 µ = xi = , (4.21) N h i N 1 λi i i − so the relevant maximum entropy model becomes

" # 1 N 1 | X P (M) = exp 2 TrM M Nξ . (4.22) Z −2c − 1 λi i −

Again, this distribution is invariant to orthogonal transformation. After the integra- tion over the rotation matrices, we have

" # N X 2 1 X X 1 P ( λi ) exp 2 λi + ln λj λk Nξ , (4.23) { } ∝ −2c 2 − − 1 λi i j6=k i −

where the scaling Nξ ensures that all the terms in the exponent are N 2, so there ∼ will be a well defined thermodynamic limit. Luckily, the same arguments that yield the exact density of eigenvalues in the Gaussian Orthogonal Ensemble also work here (see ), and we find

1 ρ(λ) = p B(λ), π (λ 1 + g0 + l)(1 g0 λ) − − − " 2   2 2 # l 1 λ λ ξ (2g0 2g0 + l 2g0l) (2g0 + l)λ B(λ) = 1 + + 1 g0 l + − − − , 2 2 2 p 2 8c − − 2 c − c 2 g0(g0 + l)(λ 1) − (4.24) 64 where the gap size g0 and the width of the support l are fixed by setting the spectral density at the two ends of the support zero. To our surprise, we find a finite gap for all ξ > 0. This means that there exists a maximum time scale even when the system is infinitely large. This upper limit of longest time scale depends on the Lagrange multiplier ξ and the interaction strength c. Because the Lagrange multiplier ξ is used to constrain the averaged norm activity µ, the maximum time scale is set by the allowed norm of the activity µ, measured in units of expected norm for independent neurons. As we explain below, the greater dynamic range the system can allow leads to longer time scales. The dependence of the gap on the Lagrange multiplier ξ is shown in Fig 4.2d at each of several fixed values of c; as before there is a phase transition at ccrit = 1/√2. This is understandable, since in the limit of ξ = 0 we recover the hard wall case. For small ξ, the spectrum is similar to the hard wall case, with the eigenvalues close to the stability threshold pushed into the bulk spectrum; for large ξ, the entire spectrum is pushed away from the wall. A closer look at the longest time scale τmax vs. Lagrange multiplier in Fig. 4.3a confirms that amplification of time scales occurs only when ξ < 1, corresponding to an amplification of mean–square activity µ.

The scaling of time scales in three phases

We now discuss the dependence of time scales on the interaction strength c. In contrast to the ensemble with a hard stability threshold, we find a finite gap for all values of interaction strength c, but the scaling of time scales vs. the Lagrange multiplier (and hence the mean–square activity) is different in the different phases. For the subcritical and critical phases, the results are as expected from the hard wall case. In the subcritical phase, as ξ 0, the spectral distribution converges → smoothly to the familiar semicircle. The time scales and the mean–square activity

both approach constants. On the other hand, when c = ccrit, we find that the longest

65 −2/5 −1/5 time scale grows as τmax ξ , and the correlation time scale τcorr ξ , i.e. ∼ ∼ both time scales can be large if ξ is small enough; meanwhile, the norm activity µ approaches a constant value of µ = 2. This suggests that if a system is poised at criticality, then the system can exhibit long time scales, even when the dynamic range of individual components is well controlled. The autocorrelation function exhibits a power law-like decay, as expected; see the blue curve in Fig 4.3f. The most interesting case is the supercritical phase, where the interaction strength c > ccrit = 1/√2. As ξ 0, the spectrum does not converge to the spectrum with → the hard constraint. We find that both time scales and the norm activity increase as

−2/3 power laws of the Lagrange multiplier ξ (Fig 4.3), with τmax 3τcorr ξ , and ≈ ∼ µ ξ−1/3. This implies that the time scales grow as a power of the allowed dynamic ∼ range of the system, although not with the size of the system. The question of whether the resulting time scales are “long” then becomes more subtle. Quantitatively, we see from Fig 4.3c, that if the system has an allowed dynamic range just 10 that of × 4 independent neurons, the system can generate time scales τmax almost 10 longer × than the time scale of isolated neurons. Interestingly, once the system is in the supercritical phase, the ratio of amplifica- tion has only a small dependence on the interaction strength c (Fig 4.3e). Intuitively, while an increasing interaction strength c implies that without constraints more modes will be unstable, while with constraints more modes concentrate near the stability threshold, but the entire support of the spectrum also expands, so the density of slow modes and their distance to the stability threshold remain similar. This is perhaps another indication for long time scales without fine tuning when the system uses its norm activity to regulate its connection matrix. We note that, although both the critical phase and the supercritical phase can reach time scales that are as long as the dynamic range allows, there are significant differences between the two phases. One difference is that in the critical phase, locally

66 the dynamic range for each neuron can remain finite, while for the supercritical phase, the variance of activity for individual neurons can be much greater. Moreover, as shown by Fig. 4.3f, systems in the supercritical phase are dominated by a single slow mode, rather than by a continuous spectrum of slow modes. While the autocorrelation function decays as a power law in the critical phase, in the supercritical phase, it holds at R(t) = 1 for a much longer time compared to the subcritical case, but then decays exponentially. While a single long time scale can be achieved without fine tuning, it seems that a continuous spectrum of long time scales is much more challenging.

Finite Size Effects

If we want these ideas to be relevant to real biological systems, we need to understand what happens at finite N. We investigate this numerically using direct Monte-Carlo sampling of the (joint) eigenvalue distribution. As the system size grows the time scales τmax and τcorr also grow, up to the upper limit given by the mean field results; see Fig 4.4. Finite size scaling is difficult in this case, since the scaling exponent α for the gap difference,

α ∆g0(N) g0(N) g0 N , (4.25) ≡ − ∼ depends on the Lagrange multiplier ξ (Fig 4.4e,f ). In particular, the scaling interpo-

−2 lates between two limiting cases: for small ξ, the gap scales as ∆g0 N , as in the ∼ −2/3 universality class of the hard threshold; for large ξ, the gap scales as ∆g0 N , ∼ which is in the universality class of the Gaussian ensemble without any additional constraint. In any case, thousands of neurons will be well described by the N → ∞ limit.

Distribution of matrix elements

Now that we have examples of network ensembles that exhibit long time scales, we need to go back and check what these ensembles predict for the distribution of individ- 67 Figure 4.3: Mean field results for Model 3. The longest time scale τmax (a) and the correlation time scale τcorr (b) increase with different scaling as the Lagrange multiplier ξ decreases, corresponding to an increasing value for the constrained averaged norm activity, µ (c,d). The exact scaling and the amplification of time constant depending on whether the system is subcritical (with interaction strength c < 1/√2), critical (c = 1/√2), or supercritical (e). Despite the supercritical phase exhibit long time scales, the autocorrelation function decays as a power law only at c = ccrit, and as exponential for other values of interaction strength (f). For this panel, ξ = 10−10, red curves are for c = 0.2, c = 0.6, blue curve is at c = ccrit, and black curves are at c = 1 and c = 3. 68 Figure 4.4: Finite size effects on the time scales τmax and τcorr for Model 3, interaction strength c = 1 (the supercritical phase). Results from direct Monte Carlo sampling of the eigenvalue distribution are plotted together with the mean field results. There is no universal exponent that explains the convergence of τmax when system size increases (e), and apparent exponents α depend on the Lagrange multiplier ξ. Panel f shows that the apparent α depends on the maximum system size used in the fitting, interpolating between α = 2 and α = 2/3.

69 ual matrix elements. In particular, because we did not constrain the self interaction

Mii to be 0, we want to check whether the long time scales emerge as a collective behavior of the network, or trivially from an effective increase of the intrinsic time

eff scales for individual neurons, τ = 1/(1 Mii ) = 1/(1 λ ). A similar ques- ind − h i − h i tion arises in real networks, where there have been debates about the importance of feedback within single neurons vs. the network dynamics in maintaining long time scales; see Ref [Major and Tank, 2004] for a review. We confirm that, at least in our models, this is not an issue: the constraint on the norm activity, in fact, pushes the average eigenvalues to be negative, and hence the effective self interaction for individual neurons actually leads to a shorter intrinsic time scale. We can impose additional constraints on the distribution of matrix elements, such that Mii = 0; h i for this ensemble, we can again solve for the spectral distribution, and we find that scaling behaviors described above don’t change.

4.4 Dynamic tuning

So far, we have established that a distribution constraining the norm activity can lead generically to long time scales, but we haven’t really found a mechanism for implementing this idea. But if we can write the distribution of connection matrices M in the form a Boltzmann distribution, we know that we can sample this distribution by allowing the matrix elements be dynamical variables undergoing Brownian motion in the effective potential. We will see that this sort of dynamics is closely related to previous work on self–tuning to criticality [Magnasco et al., 2009], and we can interpret the dynamics as implementing familiar ideas about synaptic dynamics, such as Hebbian learning and metaplasticity.

70 We can rewrite our model in Eq (4.22) as

" # 1 N 1 | X P (M) = exp 2 TrM M Nξ Z −2c − 1 λi i − 1 = exp [ V (M)/T ] (4.26) Z − 1 V (M) = TrM |M + c2ξTr(1 M)−1, (4.27) 2 − with a temperature T = c2/N. The matrix M will be drawn from the distribution P (M), as M itself performs Brownian motion or Langevin dynamics in the potential V (M):

˙ ∂V (M) τM M = + ζ(t) − ∂M (4.28) = M c2ξ(1 M)−2 + ζ(t), − − − where the noise has zero mean, ζ = 0, and is independent for each matrix element, h i

0 0 ζij(t)ζkl(t ) = 2T τM δikδjlδ(t t ). (4.29) h i −

It is useful to remember that, in steady state, our dynamical model for the xi , { } Eqs (4.1) and (4.2), predicts that

−1 xixj = [(1 M) ]ij. (4.30) h i −

This means that we can rewrite the Langevin dynamics of M, element by element, as

˙ 2 −2 τM Mij = Mij c ξ[(1 M) ]ij + ζij(t) − − − 2 = Mij c ξ xixk xkxj + ζij(t). (4.31) − − h ih i

71 Because the xi are Gaussian, we have

1 xixk xkxj = ( xixkxkxj xixj xkxk ) , (4.32) h ih i 2 h i − h ih i where as above the summation over the repeated index k is understood, so that

 !  1 X 2 X 2 xixk xkxj = xixj x x . (4.33) h ih i 2 k − h ki k k

We now imagine that the dynamics of M is sufficiently slow that we can replace averages by instantaneous values, and let the dynamics of M do the averaging for us. In this approximation we have

! ˙ 1 2 X 2 τM Mij = Mij c ξxixj x θ + ζij(t), (4.34) − − 2 k − k

where the threshold θ = P x2 . kh ki The terms in this Langevin dynamics have a natural biological interpretation.

First, the connection strength decays with an overall time constant τM . Second, the

synaptic connection Mij is driven by the correlation between pre– and post–synaptic

activity, xixj, as in Hebbian learning [Hebb, 1949, Magee and Grienberger, 2020]. ∼ In more detail, we see that the response to correlations is modulated depending on whether the global neural activity is greater or less than a threshold value; if the network is highly active, then the connection between neurons with correlated activity will decrease, i.e. the dynamics are anti-Hebbian, while and if the overall network is quiet the dynamics are Hebbian. We still have the problem of setting the threshold θ. Ideally, for the dynamics to generate samples out of the correct P (M), we need

∗ | θ = θ x x s.s. = Nµ(c, ξ), (4.35) ≡ h i

72 Figure 4.5: A neural network can tune its connection matrix to the ensemble with slow modes using simple Langevin dynamics. Two candidates for the dynamics in- clude one with a fixed threshold on the averaged neural activity, and one with a sliding threshold. (a) Average λmax. With fixed threshold (left), the system is stable only if the updating time scale τM is long enough, and if the fixed threshold for the norm activity, θe, is small enough, thus requiring fine tuning. With a sliding threshold (right), the system is stable for a large range of τθ.(b) The spectral distribution for connection matrix drawn from the dynamics with fixed threshold approaches the static distribution as the time constant τM increases. (c) If the connection matrix updates too fast, i.e. τM is too small, the system exhibits quasi-periodic oscillations, and does not reach a steady state distribution. In contrast, long τM leads to the adi- abatic approximation for the steady state distribution. (d) The expected eigenvalues for the dynamics with fixed threshold, and for the ones with sliding threshold, wiith −5 N = 32, c = 1, ξ = 2 and τM = 1000. (e) Example traces for the eigenvalues vs. time.

73 where as above µ is the mean activity whose value is enforced by the Lagrange mul- tiplier ξ. This means that θ needs to be tuned in relation to ξ, and it is challenging to have a mechanism that does this directly, and just pushes the fine tuning problem back one step. Importantly, if θ = θ∗ then the steady state spectral density of the connection matrix approaches the desired equilibrium distribution as the update time constant increases (Fig 4.5b), but if θ deviates from θ∗ then the steady distribution does not have slow modes. As shown in Fig 4.5de, if the threshold is too small, then the entire spectrum is shifted away from the stability threshold, and the system no longer exhibits long time scales; if the threshold is too large, then the largest eigen- value oscillates around the stability threshold λ = 1, and a typical connection matrix drawn from the steady state distribution is unstable. But we can once again relieve the fine tuning problem by promoting θ to a dy- namical variable, ˙ X 2 τθθ = x θ, (4.36) k − k which we can think of as a sliding threshold in the spirit of models for metaplastic- ity [Bienenstock et al., 1982]. We pay the price of introducing yet another new time scale, τθ, but in Figs 4.5de we see that this can vary over at least three orders of magnitude without significantly changing the spectral density of the eigenvalues. To see whether the sliding threshold really works, we can compare results where

θ is fixed to those where it changes dynamically; we follow the mean value of λmax as an indicator of system performance. We choose parameters in the supercritical phase, specifically c = 1 and ξ = 2−5, and study a system with N = 32. Figure 4.5a shows that with fixed threshold, even in the adiabatic limit where τM 1, there is  only a measure-zero range for the fixed threshold θe such that λmax is very close to, but smaller than 1. In contrast, for the dynamics with sliding threshold, at τM 1  there is a large range of values for the time constant τθ such that the system hovers just below instability, generating a long time scale. 74 The Langevin dynamics of our system is similar to the BCM theory of meta- plasticity in neural networks, in that both models involve a combination of Hebbian learning and a threshold on the neural activity [Bienenstock et al., 1982], but there are two key differences. First, the BCM theory imposes a threshold on the activity of locally connected neurons, while here the threshold is on the overall neural activity. Second, the BCM dynamics is Hebbian when the post– and pre–synaptic activities are larger than the threshold, and anti–Hebbian otherwise, which is the opposite of the dynamics for our system. It is interesting that in some other models for home- ostasis, plasticity requires the activity detection mechanism to be fast (τθ/τM 1)  for the system to be stable [Zenke et al., 2013, Zenke et al., 2017], which we do not observe for our system.

4.5 Discussion

Living systems can generate behaviors on time scales that are much longer than the typical time scales of their component parts, and in some cases correlations in these behaviors decay as an approximate power–law, suggesting a continuous spetcrum of slow modes. In order to understand how surprised we should be by these behaviors, it is essential to ask whether there exist biologically plausible dynamical systems that can generate these long time scales “easily.” Typically, to achieve long time scales a high dimensional dynamical system requires some degree of fine tuning: from the most to the least stringent, examples include setting individual elements of the con- nection matrix, choosing a particular network architecture, or imposing some global constraints which allow ensembles of systems with long time scales. In this note, we were able to construct a mechanism for living systems to reach long time scales with the least stringent fine-tuning condition: when the interaction strength of the connection matrix is large enough, imposing a global constraint on the

75 stability of the system leads to divergent many slow modes. To impose a biologically plausible mechanism for living systems, we constrain the averaged norm activity as a proxy for global stability; in this case, the time scales for the slow modes are set by the allowed dynamic range of the system. Further, we showed that these ensembles can be achieved by updating the connection matrix M with a sliding threshold for the norm activity, a mechanism that resembles metaplasticity in neural networks. Importantly, the slow modes achieved through constraining norm activity typically lead to exponentially decaying correlations; only when the interaction strength of the matrix is at a critical value do we find power-law decays. This suggest that a continuous range of slow modes is more difficult to achieve than a single long time scale. A natural follow-up question is whether there exist mechanisms which can tune the system to criticality in a self-organized way, for example by coupling the interaction strength to the averaged norm activity. Both for simplicity and to understand the most basic picture, we have been fo- cusing on linear networks with symmetric connections. Realistically, many biological networks are asymmetric, which gives rise to more complex and perhaps even chaotic dynamics [Sompolinsky et al., 1988, Vreeswijk and Sompolinsky, 1996]. In the asym- metric case, the eigenvalue spectrum for the Gaussian ensemble is well known (uniform distribution inside a unit circle) [Ginibre, 1965, Forrester and Nagao, 2007], but a similar global constraint on the norm activity leads to a dependence on the overlap among the left and right eigenvectors [Chalker and Mehlig, 1998, Mehlig and Chalker, 2000]. In particular, the matrix distribution can no longer be separated into the product of eigenvalues and eigenvectors, and it is difficult to solve for the spectral distribution analytically. Two new features emerge in the asymmetric case. First, the time scales given by real eigenvalues vs. complex eigenvalues may be different, leading to more (or less) dominant oscillatory slow modes in large systems [Akemann and Kanzieper, 2007]. Second, asymmetric connection matrices

76 can lead to complicated transient activity when the system is perturbed, with the time scales mostly dominated by the eigenvector overlaps, and can be very different from the time scales given by the eigenvalues [Grela, 2017]. In the limit of strong asymmetry, the network is organized into a feed-forward line, information can be held for a time that is extensive in system size; see examples in Refs. [Goldman, 2009] and [Ganguli et al., 2008]. It will be interesting to check whether systems can store information in these transients without fine-tuning the structure of the network. The system we study can be extended to consider more specific ensembles for par- ticular biological systems. For example, real neural networks have inhibitory and exci- tatory neurons, so that elements belonging to the same column need to share the same sign, and the resulting spectral distribution has been shown to differ from the unit sphere [Rajan and Abbott, 2006]; more generally recent work explores how structured connectivity can lead to new but still universal dynamics [Tarnowski et al., 2020]. An- other limitation of our work is that it only considers linear dynamics, or only dynami- cal systems where all fixed points to be equally likely. In contrast, some non-linear dy- namics such as the Lotka-Volterra model in ecology [Biroli et al., 2018], and the gat- ing neural network in [Can et al., 2020, Krishnamurthy et al., 2020] have been shown to drive systems to non-generic, marginally stable fixed points, around which there exists an extensive number of slow directions for the dynamics. In summary, we believe the issue of whether a continuum of slow modes can arise generically in neural networks remains open, but we hope that our study of very simple models has helped to clarify this question.

77 Chapter 5

Conclusion and Outlook

Living systems often are found to perform incredible tasks, such as collective motion and foraging on the organism level, and cognitive activity in our brains. Naively, if one would take apart a living system, and then randomly couple the individual components back together without any design principles, one does not expect the resulting system will be as intelligent or even just animate as before. Thus, one may regard the living systems as a surprising result of emergent phenomena. On the other hand, even in the inanimate world, the rise of macroscopic features of materials – such as the spontaneous magnetization in ferromagnetic materials – from microscopic interaction is also surprising, although these can be understood now using statistical physics. When we compare the animate and inanimate systems, shall we be more surprised by one compared to the other? More precisely, can we understand the collective behavior of living systems as an extension of inanimate physical systems, or is there something unique about biology, such that “more is different” takes its literal meaning? These questions need to be addressed by first examining the collective behavior of living systems in the framework of statistical physics. This dissertation was born thanks to the continuous effort of both experimentalists and theorists in developing new technology to collect and understand the emergence

78 of collective behavior in living systems. Although it mainly focuses on systems of in- terconnected neurons, the approaches developed by this dissertation can be extended to other living systems at different scales, such as in groups of interacting animals. The first part of the dissertation extended a statistical physics framework of con- structing probability model for the activity of large groups of neurons. In particular, we examine the collective behavior in the neural network of the nematode C. elegans in Chapter 3. Through analyzing data from real neurons, it extended the maximum entropy model which matches lower orders of statistics of neural activity, a method that has been successful in spiking networks, to describe this very different neural network with very small numbers of neurons and graded electrical activities, and successfully identified features of collective behavior, such as the inferred model has parameters near a critical surface. Importantly, this research also leads to testable hypothesis on how the network would react to local perturbations. Currently, there are plans to test this hypothesis, which will further facilitate our understanding of how well statistical physics models can be used to describe this neural network, and how out-of-equilibrium the real system is when being perturbed. A natural question resulted from this work in C. elegans is how finely the system need to be tuned to appear at criticality. For example, for a worm to develop from an embryo with a functional brain, does C. elegans need to encode all information about the strength of neuronal interaction in its genome, or is a stochastic method for development enough? Currently, several research groups are developing techniques to identify neurons across individual worms, which will create an exciting opportunity to test this idea by comparing the inferred statistical models across animals, and to construct an ensemble for the neural networks of the worm. In the case that the precise interaction strengths were shown to be essential, a more urgent question then need to be asked on how the brain with such fine-tuned parameters is developed

79 through , and how we can be inspired to design artificial neural networks with optimized performance. The question of fine-tuning was then further addressed in Chapter 4, where we focused on the emergent long time scales in large dynamical systems. In biological systems, often we observe long time scales in the behavior; here, we probe the question of how surprising we should be by examining the conditions for a dynamical system with random interaction to exhibit long time scales. In particular, we found that it is possible for the neural network use a self-adaptive mechanism to achieve a single long time scales, but to have a continuous spectrum of long time scales requires criticality and is harder to achieve. This offers another way of thinking about what we see in biological systems and how surprised we should be. These questions relating to the temporal evolution of biological networks and their interaction with the environment are not limited to systems of interconnected neurons. More generally, for example, these questions can be studied in collective animal behaviors, where the “group” is already constantly changing – an example is that a flock of birds can spontaneously be broken into two and recombine – and the system can be systematically perturbed by researchers. Some interesting questions include when an individual is introduced to an already formed group, how does it adapt, and how does the rest of the network respond? If the hypothesis that biological systems would like to be poised near criticality is correct, how do the individuals tune its interaction when either the system or the environment is changing to maintain the criticality? The hope is that by studying the dynamics of collective behaviors across different systems, it may give us insights on the evolution of biological networks in general, and some clues of how to design self-assembling networks.

80 Appendix A

Appendices for Chapter 3

A.1 Perturbation methods for overfitting analysis

To test if our maximum entropy model overfits, we partition the samples into a set of training data and a set of test data. The difference of the per-neuron log-likelihood for the training data and the test data is used as a metric of whether the model overfits: if the two values for the log-likelihood are equal within error bars, then the model generalizes well to the test data and does not overfit. Here, we outline a perturbation analysis which uses the number of independent samples and the number of parameters of the model to estimate the expectation value of this log-likelihood difference.

Consider a Boltzmann distribution parameterized by g = g1, g2, . . . , gm acting on observables φ1, φ2, . . . , φm. The probability for the N spins taking the value σ =

σ1, σ2, . . . , σN is m ! 1 X P (σ g) = exp g φ (σ) , (A.1) Z(g) i i | − i=1

81 where Z is the partition function. Then, the log-likelihood of a set of data with T samples under the Boltzmann distribution parameterized by g is

T 1 X L(σ1, σ2, . . . , σT g) = log P (σt g) | T | t=1 (A.2) m T ! X 1 X = log Z(g) g φt i T i − − i=1 t=1

Now, let us assume that a set of true underlying parameters, g∗ , exists for the { } ∗ ∗ system we study, which leads to a true expectation value be fi = fi(g ). However, we are only given finite number of observations, σ1, σ2, . . . , σT , from which we con- struct a maximum entropy model, i.e. infer the parameters gˆ by maximizing the { } likelihood of the data. Our hope is that the difference between the true parameters and the inferred parameters is small, in which case we can approximate the inferred parameters using a linear approximation

∗ gi = gi + δgi, (A.3)

X ∂gi X where δg δf = χ δf . (A.4) i ∂f j eij j ≈ j j − j

Here, χ is the inverse of the susceptibility matrix χij = ∂fi/∂gj = φiφj φi φj ; e − h i − h ih i and δfj is the difference between empirical mean and the true mean of φj,

T 1 X δf = φ (σt) f ∗ (A.5) j T j j t=1 −

t t For convenience, we will use short-hand notation φi(σ ) = φi to indicate the value of the observable φi at time t.

Let the number of samples in the training data be T1, and the number of samples in the test data be T2. For simplicity, assume that all samples are independent. We maximize the entropy of the model on only the training data to obtain parameters

82 gˆ , and we would like to know how well our model generalize to the test data. Thus, { } we quantify the degree of overfitting by the difference of likelihood of the training data and the test data:

" m T2 !# " m T1 !# X 1 X t0 X 1 X t Ltest Ltrain = log Z(ˆg) gˆi φi log Z(ˆg) gˆi φi − − − T2 − − − T1 i=1 t0=1 i=1 t=1

m T1 !! T1 T2 ! X ∗ X 1 X t ∗ 1 X t 1 X t0 = gi χeij φj fj φi φi . − T1 − T1 − T2 i=1 j t=1 t=1 t0=1 (A.6)

For simplicity of notation, let us write

T1 T2 (1) 1 X (2) 1 X α = φt f ∗ , α = φt f ∗ . (A.7) i T i i i T i i 1 t=1 − 2 t=1 −

(1) (2) By the Central Limit Theorem, αi and αi are Gaussian variables. Terms that appear in the likelihood difference [Eq. (A.6)], have expectation values

(1) ((1) (1) 1 αi = 0 , αi αj = χij . (A.8) h i h i T1

In addition, because we assume that the training data and the test data are indepen- dent, the cross-covariance between the training data and the test data is

α((1)α(2) = 0 . (A.9) h i j i

83 Combining all the above expressions, we obtain the expectation value of the like- lihood difference [Eq. (A.6)],

m ! D X ∗ X (1)  (1) (2) E Ltest Ltrain = gi χeijαj αi αi h − i i=1 − j − m m X X (1) (1) = χij α α − e h i j i i=1 j=1 (A.10) m m 1 X X = χ χ T eij ij − 1 i=1 j=1 m = −T1

Note that the difference of likelihood is only related to the number of parameters in our model and the number of independent samples in the training data. Similarly, we can evaluate the variance of the likelihood difference to be

  2 X ∗ ∗ 1 1 (Ltest Ltrain) = gi gkχik + h − i T1 T2 i,k

1 2 m + 2 (m + 2m) + (A.11) T1 T1T2 using Wick’s theorem for multivariate Gaussian variables and chain rules of partial derivatives. In order to test whether perturbation theory can be applied to the maximum entropy model learned from the real data, we estimate the number of independent samples using Nind. sample T/τ, where T is the length of the experiment and τ is ∼ the correlation time. The correlation time is extracted as the decay exponent of the overlap function, defined to be

N D 1 X E q(∆t) = δσi(t)σi(t+∆t) , (A.12) N t i=1

84 In our experiment, the correlation time is τ = 4 6s. For a typical recording of 8 ∼ minutes, the number of independent samples is between 80 and 120. In Figure 3.5, we compute the perturbation results using the number of non-zero parameters after the training and the number of independent samples estimated from the data. The prediction is within the error bar from the data, which suggests that the inferred coupling is within the perturbation regime of the true underlying coupling.

Note that the plotted difference is computed for the per-neuron log-likelihood, ltest − ltrain = (Ltest Ltrain)/N. −

A.2 Maximum entropy model with the pairwise

correlation tensor constraint

To fully describe the pairwise correlation between neurons with p = 3 states, the equal-state pairwise correlation cij = δσ σ is not enough; rather, we should constrain h i j i the pairwise correlation tensor, defined as

rs c δσ rδσ s . (A.13) ij ≡ h i j i

rs Here, we constrain the pairwise correlation tensor cij together with the local mag-

r netization m δσ r . Notice that for each pair of neurons (i, j), the number of i ≡ h i i constraints are p2 + 2p = 15, but these constraints are related through normaliza-

P r P rs r tion requirements, r mi = 1 and s cij = mi , which leads to only 7 independent variables for each pair of neurons. Because of this dependence, choosing which vari- ables to constraint is a problem of gauge fixing. Here, we choose the gauge where we

r constrain the local magnetization mi for states “rise” and “fall”, and the pairwise cor- relations cr crr; in this gauge the parameters can be compared meaningfully to the ij ≡ ij equal-state maximum entropy model above. The corresponding maximum entropy

85 model has the form

3 2 ! 1 X X r X X r P (σ) exp J δσ rδσ r h δσ r (A.14) ∝ −2 ij i j − i i i6=j r=1 i r=1

Note that the equivalence between constraining the equal-state correlation for each state and constraining the full pairwise correlation tensor only holds for the case of p = 3. For p > 3 states, one need to choose more constraints to fix the gauge, and it is not obvious which variables to fix. We train the maximum entropy model with tensor constraint [Eq. (A.14)] with the same procedure as the model with equal-state correlation constraint, described in the main text. The model is able to reproduce the constraints with a sparse interaction tensor J. However, as shown in the bottom panel of Fig. 3.5, the difference between ltrain, the per-neuron log likelihood of the training data (randomly chosen 5/6 of all data) and ltest, the per-neuron log likelihood of the test data, is greater than zero within error bars. This indicates that the maximum entropy model with tensor constraint overfits for all N = 10, 20,..., 50.

A.3 Maximum entropy model fails to predict the

dynamics of the neural networks as expected

By construction, the maximum entropy model is a static probability model of the observed neuronal activities. No constraint on the dynamics was imposed in building the model, and infinitely many dynamical models can generate the observed static distribution. The simplest possibility corresponds to the dynamics being like the dynamics of Monte Carlo itself, which is essentially Brownian motion on the energy landscape. To test whether this equilibrium dynamics can capture the real neural dynamics of C. elegans, we compare the mean occupancy time of each basin, τα , h i 86 calculated using the experimental data and using MCMC. The mean occupancy time is defined as the average time a trajectory spends in a basin before escaping to another basin. For equilibrium dynamics, the mean occupancy time is determined by the height of energy barriers according to the transition state theory, or by considering

2 random walks on the energy landscape, which gives the relation τ p /2e ln(Pα), ∼ − where p = 3 is the number of Potts states and Pα is the fraction of time the system visits basin α. As shown in Figure A.1, the mean occupancy time τ MC found in the h α i Monte Carlo simulation can be predicted by this simple approximation. In contrast, the empirical neural dynamics deviates from the equilibrium dynamics, as we might have expected. The dependence between τ data and P data is weak; a linear fit gives h α i α τ data P data0.5±0.027. h α i ≈ α

87 102 Mean occupancy time data 101

⌧ data AAACEHicdVA9TxtBEN3jGyeACWWaFVaUVKc92ziUSGkoiRQDks9Yc+uxvWJv77Q7h2Kd/BNo+Cs0FEQRbUo6/g1rYySC4EkjPb03o5l5Sa6VIyEegoXFpeWV1bX1yoePG5tb1e1Pxy4rrMS2zHRmTxNwqJXBNinSeJpbhDTReJKc/5j6JxdoncrMLxrn2E1haNRASSAv9apfYw1mqJHHBEUvBp2P4Cwm/E1lHwgmPLYzv1etiVA0GmIv4iKs7zdbe01PWvWo3mrwKBQz1NgcR73qfdzPZJGiIanBuU4kcuqWYElJjZNKXDjMQZ7DEDueGkjRdcvZQxP+xSt9PsisL0N8pr6cKCF1bpwmvjMFGrnX3lR8y+sUNNjvlsrkBaGRT4sGheaU8Wk6vK8sStJjT0Ba5W/lcgQWJPkMKz6E50/5++S4HkYijH42awdiHsca+8x22TcWse/sgB2yI9Zmkl2ya3bL/gRXwU3wN7h7al0I5jM77D8E/x4B5ziduA==

h i (sec) (sec) data 0.5 0.027 ⌧ P ± hAAACLXicbVDLSiNBFK12fMZXHJduCoPgKnSLEpeCLmYZwUQhHcPtyk1SWF1dVN0WQ5MfcuOvDAOziIjb+Y2pPARfBwoO59zLrXMSo6SjMBwHCz8Wl5ZXVtdK6xubW9vlnZ9Nl+VWYENkKrM3CThUUmODJCm8MRYhTRReJ3fnE//6Hq2Tmb6iocF2Cn0te1IAealTvogV6L7CmCDvxKDMAG5jwgcqukAwiu3MNTYzlPH620gRVk9ik/KwGh7VRp1yxZMp+FcSzUmFzVHvlP/E3UzkKWoSCpxrRaGhdgGWpFA4KsW5QwPiDvrY8lRDiq5dTNOO+IFXuryXWf808an6fqOA1LlhmvjJFGjgPnsT8TuvlVPvtF1IbXJCLWaHerniPvikOt6VFgWpoScgrPR/5WIAFgT5gku+hOhz5K+keVSNwmp0eVw5C+d1rLI9ts8OWcRq7Iz9YnXWYII9st9szJ6Dp+Bv8BK8zkYXgvnOLvuA4N9/oLqoxw== ↵ i/ ↵ 100 ⌧ MC AAACDnicdVDBSiNBEO1Rd3WjuxvXo5fGIHgaZmLcxJuQixchglEhE0NNp5I09vQM3TViGPIFXvZXvHhQlr169rZ/YydGcBd9UPB4r4qqenGmpKUg+OstLC59+ry88qW0uvb12/fy+o9Tm+ZGYFukKjXnMVhUUmObJCk8zwxCEis8iy+bU//sCo2VqT6hcYbdBIZaDqQAclKvvB0p0EOFPCLIexGobAQXEeE1FUfNCY/MzO2VK4G/X937uVfjgV+rh7uNXUeqYb1RDXnoBzNU2BytXvkp6qciT1CTUGBtJwwy6hZgSAqFk1KUW8xAXMIQO45qSNB2i9k7E77tlD4fpMaVJj5T304UkFg7TmLXmQCN7P/eVHzP6+Q0aHQLqbOcUIuXRYNccUr5NBvelwYFqbEjIIx0t3IxAgOCXIIlF8Lrp/xjclr1w8APj2uVg2AexwrbZFtsh4Wszg7YIWuxNhPsht2ye/bg/fLuvN/en5fWBW8+s8H+gff4DAKjnKs=sha1_base64="hP+6LrUf2d3tZaldqaQQvEKMXyw=">AAAB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i77+gsTLXTzQrMMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GSS7KDDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RRtRxzzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D77odBu3wMYA6nMMFXEEIN3AHD9CBLghI4BXevYn35n2suqp569LO4I+8zx84xIo4sha1_base64="2O57DSZK9mfx3RPtw5h2brH+nsQ=">AAACA3icbZC9SgNBFIXv+m/8i7Y2gyJYhV0bLQUbG0HBmEA2hruTm2RwdnaZuSuGJU9g46vYWCjiO9j5Nk5iCv8ODBzOmeHO/ZJcK8dh+BHMzM7NLywuLVdWVtfWN6qbq1cuK6ykusx0ZpsJOtLKUJ0Va2rmljBNNDWSm5Nx37gl61RmLnmYUzvFvlE9JZF91KnuxRpNX5OIGYtOjDof4HXMdMfl2clIxHbSdqq7YS2cSPw10dTswlTnnep73M1kkZJhqdG5VhTm3C7RspKaRpW4cJSjvME+tbw1mJJrl5N1RmLPJ13Ry6w/hsUk/f6ixNS5YZr4mynywP3uxuF/Xavg3lG7VCYvmIz8GtQrtOBMjNmIrrIkWQ+9QWmV/6uQA7Qo2ROseAjR75X/mquDWhTWoosQlmAbdmAfIjiEYziFc6iDhHt4hGd4CR6Cp+D1C9dMMOW2BT8UvH0C3uea9g==sha1_base64="PQSt19o7XeD8eef3KCoKuz3z6nI=">AAACA3icdZDNSiNBFIVv64w60dGM29kUI4Krpjv+JO4G3MxGcMCokI7hduUmKayubqpui6HJE7jxVWbjQhnmHWY3b2MlKjiiBwoO51Rx635poZXjKPoXzM1/+LiwuPSptrzyeXWt/mXlxOWlldSWuc7tWYqOtDLUZsWazgpLmKWaTtOLg2l/eknWqdwc87igboZDowZKIvuoV99MNJqhJpEwlr0EdTHC84TpiqvDg4lI7Kzt1TeicL+xu7e7I6Jwpxlvt7a9acTNViMWcRjNtAFPOurV/yb9XJYZGZYanevEUcHdCi0rqWlSS0pHBcoLHFLHW4MZuW41W2ciNn3SF4Pc+mNYzNKXLyrMnBtnqb+ZIY/c624avtV1Sh60upUyRclk5OOgQakF52LKRvSVJcl67A1Kq/xfhRyhRcmeYM1DeN5UvG9OGmEchfHPCJbgK3yDLYihCd/hBxxBGyRcwy+4g/vgJrgNfj/imgueuK3Dfwr+PABM4ZtEsha1_base64="w5CfhoMF57sszR/7ktr2I43mCzs=">AAACDnicdVBNSyNBEO3xY9XsukY97qUxCHsaeuJHsjfBixfBhY0KmWyo6VSSxp6eobtGNgz5BV78K148KMtePXvz32wnRlDZfVDweK+KqnpJrpUjIZ6CufmFxQ9LyyuVj59WP69V1zdOXVZYiS2Z6cyeJ+BQK4MtUqTxPLcIaaLxLLk4nPhnl2idyswPGuXYSWFgVF9JIC91q9uxBjPQyGOCohuDzofwMyb8ReXx4ZjHdup2qzURfqvv7e/tchHuNqKd5o4n9ajRrEc8CsUUNTbDSbf6GPcyWaRoSGpwrh2JnDolWFJS47gSFw5zkBcwwLanBlJ0nXL6zphve6XH+5n1ZYhP1dcTJaTOjdLEd6ZAQ/fem4j/8toF9ZudUpm8IDTyeVG/0JwyPsmG95RFSXrkCUir/K1cDsGCJJ9gxYfw8in/Pzmth5EIo++idiBmcSyzL2yLfWURa7ADdsROWItJdsVu2B27D66D2+B38Oe5dS6YzWyyNwge/gIBY5yn ↵ model h i (MC 9 1 ⌧ RW = ↵ 2e log P sweep) hAAACM3icbVDPSxtBGJ211mpsNa1HL4NB6KVhVwq1B0HwIj2l0hghk4ZvJ98mg7Ozy8y30rDs/+TFf8SDUDy0FK/+D06SPdQfDwYe732Pb74X51o5CsPfwdKr5dcrb1bXGutv321sNt9/OHVZYSV2ZaYzexaDQ60MdkmRxrPcIqSxxl58fjTzexdoncrMD5rmOEhhbFSiJJCXhs1vQoMZaxQExVCAzifwUxD+ovKkVwk79/gB/yQSC7L8WpV7WC14VJVCZ2PeqWPVsNkK2+Ec/DmJatJiNTrD5rUYZbJI0ZDU4Fw/CnMalGBJSY1VQxQOc5DnMMa+pwZSdINyfnPFd70y4klm/TPE5+r/iRJS56Zp7CdToIl76s3El7x+Qcn+oFQmLwiNXCxKCs0p47MC+UhZlKSnnoC0yv+Vywn4RsjX3PAlRE9Pfk5O99pR2I6+f24dhnUdq2yb7bCPLGJf2CE7Zh3WZZJdshv2h/0NroLb4F9wtxhdCurMFnuE4P4BgC+r8g== i ↵ 10-1 10-4 10-3 10-2 10-1 100 P Probability of visiting basin α, Pα

Figure A.1: Equilibrium dynamics of the inferred pairwise maximum entropy model fails to capture the neural dynamics of C. elegans. The mean occupancy time of each basin of the energy landscape, τα , is plotted against the fraction of time the h i system visits the basin, Pα. For 10 subgroups of N = 10 (in dots) and N = 20 (in asterisks) neurons, the empirical dynamics exhibits a weak power-law relation between τα and Pα. The striped patterns are artifacts due to finite sample size. In contrast,h equilibriumi dynamics extracted from a Monte Carlo simulation following detailed balance shows an inverse logarithmic relation between τα and Pα, which can be explained by random walks on the energy landscape. Errorh barsi of the data are extracted from random halves of the data. Error bars of the Monte Carlo simulation are calculated using correlation time and standard deviation of the observables.

88 Appendix B

Dynamical inference for C. elegans neural activity

In Chapter 3 (also see [Chen et al., 2019]), we successfully constructed a (joint) probability model for the neural activity of 50+ neurons in the brain of nematode Caenorhabditis elegans, when the worms are immobilized. By matching the mean and the pairwise correlation of the discretized neural activity while maximizing the en- tropy, we were able to construct a statistical physics model that successfully captures the collective behavior of these neurons, such as that the state of individual neurons can be predicted using network information; the number of local maxima in the prob- ability landscape is extensive as the number of neurons in the model. Nonetheless, we also showed that the equilibrium dynamics naturally given the static model, learned using neural activity patterns when the time axis is disregarded, cannot predict the out-of-equilibrium dynamics of the observed neural activity. In general, many dy- namical processes can share the same steady state distribution. In order to infer the mechanism of the interaction, we need to add the time axis back into the data, and perform dynamical inference for the neural activity.

89 To learn the dynamical interaction rules requires larger numbers of parameters in the model, and hence more number of independent time points in the data com- pared to learning the static model. The current state of the data is such that 100+ neurons can be measured simultaneously, but not for many times longer than the correlation time in the data, so learning a meaningful dynamical model is difficult. In this appendix, we first estimate the correlation time of neural activity in immo- bilized C. elegans. Then, we provide a possible mechanism for learning a dynamical model, which will be able to provide more biological insights once experiments can be performed to collect much longer datasets.

B.1 Estimate correlation time from the data

To understand the dynamical properties of the data, we estimate the correlation time from the activity of both individual neurons and the entire network. Two methods are discussed here: one is through fitting exponential functions to the decay of the correlation function, and the other is through the convergence of eigenvalues of the cross-correlation matrix measured at different time difference. In both methods, we

are using the discrete representation of neural activity, σi(t), which assign the activity of a neuron i based on the time derivative into three states: rise, fall, and flat. The first method to estimate the correlation time for individual neuron requires measuring the overlap of the neuronal state, which we defined to be

qi(∆t) = δσ (t)σ (t+∆t) t . (B.1) h i i i

The overlap takes value of 1 when ∆t = 0, and takes a baseline value q0 at large ∆t.

We extract q0 by measuring the overlap after we randomly permute the time points to corrupt the time correlation in the data. The overlap function decays exponentially; thus we fit the exponential function qi(∆t) = (1 q0) exp( t/τi)+q0 to the measured − − 90 overlap to extract the correlation time τi for individual neurons (see Fig. B.1). Note that this correlation time is an effective time constant driven by the entire network: even though each neuron has an unique time constant computed this way, it is likely to be different compared to the time constant of the same neuron, when it is isolated from the rest of the network. We can also extract the correlation time for the entire network by measure the average overlap of the neuron states for the whole network. The average overlap is defined as N D 1 X E q(∆t) = δσi(t)σi(t+∆t) , N t i=1 where N is the total numbers of neurons in our system. For the 84 neurons recorded in the dataset used in [Chen et al., 2019], the correlation time for the network is τ = 4.1s.

Figure B.1: Overlap of the discretized neuron signal, defined as q(∆t) = D 1 PN E N i=1 δσi(t)σi(t+∆t) , versus delay time ∆t, showing global correlation time (blue), t and two example neurons (red, orange). Black lines show exponential fit.

An alternative method to extract a global correlation time is through the spectra of the cross-correlation matrix at different time difference. As shown in Fig. B.2, the rank order plot of the eigenvalues of the cross-correlation matrix,

τ C = δσ (t)σ (t+τ) , (B.2) ij h i j i 91 N = 10, specific choice of neurons N = 10, specific choice of neurons, connected 1 10 101

0 10 100

-1

10 10-1

eigenvalue eigenvalue

-2 10 10-2

10-3 -3 0 1 10 10 10 100 101 rank rank N = 30, specific choice of neurons N = 30, specific choice of neurons, connected 102 101 = 0 = 2s 1 = 4s 0 10 = 6s 10 = 8s = 10s 0 10 = 12s 10-1 = 14s

-1 -2 eigenvalue

10 eigenvalue 10

-2 10 10-3

-3 10 -4 0 1 2 10 10 10 10 0 1 2 10 10 10 rank rank N = 10, specific choice of neurons N = 50, specific choice of neurons, connected 102 1 Figure B.2: Spectra of the (cross-)correlation matrix10 for observed neuron groups with number101 of neurons N = 10, 30. We only show the real parts of the eigenvalues.

100 and its100 connected version,

-1 10-1 10

p eigenvalue eigenvalue τ X t+τ -2 Cij = δσtσ δσir δσj r , (B.3) 10 h i j i − h ih i r=110-2

10-3 converges when the time difference τ = 8s 12s. This convergence implies that -3 -4 ∼10 10 100 101 102 100 101 102 neuron-neuron interaction occurs at a timescale less than 8 seconds,rank which is consis- rank tent with the decaying exponent we extracted using overlaps.

92 B.2 Coupling the neural activity and its time

derivative

Our goal is to construct a dynamical model where the network properties can be predicted using only local interaction among neurons. A successful dynamical model can, for example, predict the long network correlation time as an emergent property despite short time scales of the neuron-neuron interaction. We continue using the idea of statistical inference directly from data, by constructing models that match the lower-order observables but otherwise are least structured. A natural method of constructing dynamical model is to infer its action with both the magnitude of neural activity (measured by the Calcium concentration observed in each neuron nuclei as a proxy), which we denote as f, and its time derivative, f˙. In addition, experimental evidences also suggest that neurons in C. elegans interact with each other through such couplings, as in multiple cases, the rate of the voltage change in the post-synaptic neuron is found proportional to the voltage of the pre- synaptic neuron [Liu et al., 2009, Goodman et al., 2012]. The inferred dynamics is a probability model that gives the probability for each possible trajectory. For simplic- ity, we construct dynamical models for discretized neural states, where the signal for neuron i, fi, is discretized into q = 2 states using histogram equalization, i.e. such that the q states are equally-likely visited by the neuron, and denote the discretized ˙ signal θi 0, 1 . For the time derivative fi, we discretize it into p = 2 states, where ∈ { } ˙ σi = 1 if fi > 0 and 0 otherwise.The resulting dynamics becomes a Markov process, defined by the joint distribution, P ( θ, σ ). { } The first question we ask, is whether a model with only local field and an interac- tion term between the signal θ and the derivatives σ can reproduce the observed { } { } pairwise correlation within θ and σ . To construct such model, we constrain the { } { }

93 mean activity

mi = σi (B.4) h i

µi = θi (B.5) h i and the pairwise correlation matrix

Γij = θiσj . h i

The resulting maximum entropy distribution is

! 1 X X X P ( θ, σ ) = exp A θ σ + g θ + h σ . (B.6) Z ij i j i i i i { } i,j i i

This is very similar to a Restricted Boltzmann Machine, but it’s important to keep in mind that there is no hidden variable in our model. The Lagrange multipliers A, g, h are learned by maximizing the probability of the data for both θ and σ. { } Nonetheless, as shown by Fig. B.3, the model fails to predicts the cross-correlation within θ and σ . { } { } Alternatively, we construct the full pairwise model for θ, σ , which additionally { } θ σ constrains the pairwise correlation Γ = θiθj , and Γ = σiσj . The resulting ij h i ij h i distribution is

1 X X X X X  P (θ, σ) = exp g θ + h σ + K θ θ + J σ σ + J˜ θ σ Z i i i i ij i j ij i j ij i j (B.7) As shown by Fig. B.4, the inferred interaction parameter mostly concentrate in the

interaction among neural activities, Kij, and the interaction among neural deriva- ˜ tives, Jij. The cross-interaction matrix Jij is sparse. In addition, if we set the cross-interaction to zero, the performance of the model in predicting the higher-order

94 1 0.3 f df/dt

0.5 0

, model ,

, model ,

ij

ij C

0 -0.3 0 0.5 1 -0.3 0 0.3 , data C , data ij ij

Figure B.3: Maximum entropy model constructed by constraining only the mean activity, and the pairwise correlation θiσj between the magnitude of neural activity θ and its time derivative σ fails toh predicti the correlation among each class of the { } { } f observables. The left panel shows the unconnected correlation matrix, Γij = θiθj , df/dt h i and Γij = sigmaiσj ; the right panel shows the connected version. Data plotted here are drawnh from a subseti of neurons with size N = 30. correlation terms of the data is not largely impacted. In particular, the distribution for the joint activity P (θ) (Fig. B.4, bottom left panel) cannot be captured by the model with only pairwise interaction, while the distribution for the time derivatives P (σ) can be well-described by a pairwise model (bottom right panel). In summary, this maximum entropy model coupling the discrete version of the neural activity and its time derivative does not add much to construct the join distri- bution; it seems from the inferred interaction coefficient, that the interaction across f and f˙ is small, which is surprising considering the graded nature of C. elegans neuronal system. Another limitation of this approach is that with discretized data, if we were to generate actual traces from this model, we are required to further model some coherent time. These problems will be better addressed if we have long enough data to perform inference directly on the continuous neural activity.

95 N = 30 N = 30 2

0.2 1.5

f f 1 0.1 0.5

0 ij

C 0

, inferred , ij

-0.5 J -0.1 df/dt df/dt -1

-0.2 -1.5

-2 f df/dt f df/dt

0.1 0.1 P( ) P( ) P( , ) P( , )

0 0

, model ,

, model ,

ijk

ijk

C C

-0.1 -0.1 -0.1 0 0.1 -0.1 0 0.1 C , data C , data ijk ijk

Figure B.4: Learn the full pairwise maximum entropy model for N = 30. Top left: Full connected correlation for N = 30. Top right: The inferred interaction strength from the full pairwise maximum entropy model. Bottom: Connected three-point correlation cannot be predicted by the model for discretized magnitude of the signal θ , but can be predicted for the discretized time derivative σ . Adding coupling between{ } θ and σ does not help. { } { } { }

96 Appendix C

Appendices for Chapter 4

C.1 How to take averages for the time constants?

For finite-sized systems, we compute the observables by averaging over samples of connection matrix M, drawn with Monte Carlo methods from the distribution P (M). Usually, we can just take the arithmetic mean:

1 X [f]av f( λi αmc ) (C.1) ≡ Nmc { } αmc

However, in the case of the correlation time τcorr, because it is a ratio of two sums over functions of the eigenvalues, we have to take the geometric mean when we perform the averages:

1/N P 2 ! mc Y τα,i τ (N) exp [ln τ ( λ )] = i (C.2) corr corr i N av P τ ≡ { } α i α,i

Similarly, when we average over disorder for τmax at finite system size, we also want to take the geometric mean, especially because the distribution of time scales has long tails, as shown by the fractional standard deviation in Fig. C.1. for the time scales in Fig. C.1.

97 105 1.5 3

1.4

2 t

ln t ln 1.3 1 1.2

1.1 0 100 102 104 100 102 104 N N

1 10 t max

av t

corr av

0.5 5

/[t]

/[ln t] /[ln

t ln t ln

0 0 100 102 104 100 102 104 N N

Figure C.1: The fractional standard deviation of log time scale decreases with system size, while the fractional standard deviation of time scales is large and does not show trend of decreasing, suggesting the distribution of time scales has long tails. These results shows that we should take the geometric mean (rather than arithmetic mean) when we average over disorder to compute time scales. Here, the matrices are drawn from the GOE ensemble with the hard wall. Interaction strength is greater its critical value (i.e. in the supercritical phase), with c = 1.

98 critical c, hard wall critical c, hard wall 3 3 c = 1, hard wall c = 1, hard wall 10 10106 106 t max t corr

) 2/3

2 N ) 2

av 10 10 4 4 N2/5 av 10 10

1/3 av

N av

[t] [t] t 1 1 2 max 2 exp ([ ln t ] t ln ([ exp 10 1010 10 exp ([ ln t ] t ln ([ exp t corr N2

0 0 10 10 0 100 0 2 4 10 0 2 4 10 10 10 10100 10102 10104 100 102 104 N NN N

Figure C.2: Finite size scaling for τmax and τcorr, for matrices drawn from GOE with hard stability constraint, interaction strength being the critical, cc = 1/√2, (left panel) and super-critical, c = 1 (right panel). For each system size N, we average the time scales over 1000 Monte Carlo realization. C.2 Finite size effect for Model 2

We compute the time scales, τcorr(N) and τmax(N), by averaging over the disorder. As shown by Fig. C.2, in the critical regime where c = 1/√2, the longest time scale

2/3 2/5 tmax N , and the correlation time scale tcorr N ; in the supercritical regime, ∼ ∼ 2 the time constants τmax(N) τcorr N , as expected from the mean field result. ∼ ∼ The decay of the auto-correlation coefficient R(t) is also phase-dependent. For systems with the critical interaction strength, the mean field prediction of the auto- correlation function decays as a power-law, as R(t) t−1/2. However, for finite sized ∼ systems, the true power-law cannot be achieved (see Fig. C.3). On the other hand, if the system is in the supercritical phase, i.e. c > 1/√2, then the autocorrelation coefficient function has a plateau of length N 2, and then decays. As shown in ∼ Fig. C.4, it is perhaps clearer to plot 1 [R(t)]av in log-log scales. We see that for − 1 < t < N 2, there is an intermediate scaling of 1/2. Notice that the mean field results we compare to must have a cutoff (otherwise we cannot get C(t)). This intermediate scaling of 1 R(t) t1/2 is very interesting, and not something one expect from sums − ∼ of exponential decays. 99 100 1

0.8

-1 av av 10 0.6 Mean field Mean field N = 4 N = 4 N = 16 0.4 N = 16

-2 exp[ln R(t)] exp[ln exp[ln R(t)] exp[ln 10 N = 64 N = 64 N = 256 N = 256 N = 1024 0.2 N = 1024 t-1/2 t-1/2 10-3 10-5 100 105 10-5 100 105 t t

Figure C.3: The autocorrelation coefficient R(t) decays with time t. As system size increases, R(t) approaches the theory prediction of a power-law decay t−1/2. We need to pay attention to how we take the average. Here, we are at the∼ critical case in GOE with hard stability constraint.

c = 1, hard wall 100 100

) -2

av 10 av Mean field* N = 4 10-1 10-4 Mean field N = 16 N = 4 N = 64 exp[ln R(t)] exp[ln N = 16 N = 256 -6 N = 1024

N = 64 ] (1-R(t)) ln ([ exp 10 N = 256 ~ t N = 1024 ~ t1/2 10-2 10-8 10-5 100 105 10-5 100 105 1010 t t

Figure C.4: Comparison between autocorrelation coefficient function at different sys- tem size and the mean field results. For mean field result, we impose a cut-off for the eigenvalues to be at most 1 ε. For this plot, we take ε = 1/N 2, where N = 1024 is chosen to be the same as− the largest system we study. The top left and top right panels show the same data points, but with different axes scales. The bottom panels shows the amount of autocorrelation decay, 1 R(t). − C.3 Derivation for the scaling of time constants in

Model 3

For the ensemble of connection matrices with a maximum entropy constraint on the norm activity (Model 3), we are interested in how the time constants, τmax and 100 τcorr, depend on the parameters of the system: the interaction strength c, and the constrained norm activity µ. In particular, how the time constants scale with the norm activity when the system is set in different phases by the interaction strength c. Here, we analyze the spectral distribution ρ(λ), and investigate how the gap size g0, the length of the support l, and the constrained norm activity ]mu scales with the Lagrange multiplier that we used to constrain the norm, ξ. The results are summarized in Tab. C.1, and can be visualized in Fig. C.5. We have shown that the spectral distribution is (also see Eqs. 4.24)

1 ρ(λ) = p B(λ), π (λ 1 + g0 + l)(1 g0 λ) − − − " 2   2 2 # l 1 λ λ ξ (2g0 2g0 + l 2g0l) (2g0 + l)λ B(λ) = 1 + + 1 g0 l + − − − . 2 2 2 p 2 8c − − 2 c − c 2 g0(g0 + l)(λ 1) − (C.3)

By setting the spectral density at the edge of the support to zero, we can solve for

the scaling dependence of gap size g0 and the length of the support l on the Lagrange multiplier ξ. Mathematically, we have

B(1 g0 l) = 0, − − (C.4)

B(1 g0) = 0, −

After simple algebraic manipulation, this set of constraints becomes

2 2 2 (2g0 + l)(8c 3l ) + 4l = 0 (C.5) − 2 l 2 −3/2 −3/2 8 2ξl (l + g0) g = 0, (C.6) − c2 − 0

which sets the scaling relation between the longest time scale τmax = 1/g0 and the Lagrange multiplier ξ.

101 For the correlation time τcorr, we can express it as the ratio

ν τ = , (C.7) corr µ where the denominator is the expectation value for averaged norm activity,

Z 1 µ = x2 = dλρ(λ) h i i 1 λ − (C.8) 1 1  l l2  ξ l2 = + c2 g + . c2 c2pg(g + l) − − 2 8 − 8 g2(g + l)2 and the numerator can be written as

Z 1 ν dλρ(λ) ≡ (1 λ)2 − 1 1 1 3 1 5 1 ξ l2(2g + l) = + g−3/2(g + l)−3/2(c2g + g3 + c2l + g2l l2 + gl2 + l3) . −c2 c2 2 2 − 4 8 16 − 8 g3(g + l)3 (C.9)

g l x2 0 √ i h 2i √ √ √ 1 1−2c ξ 1 0 < c < 1/ 2 1 2c + A−ξ 2 2c B−ξ c2 c2 D−ξ  − 2/5 − 2/5 − −1/5 ξ 1 c = 1/√2 Acξ 2 Bcξ 2 Dcξ  2/3 − 2/3 −−1/3 1 ξ 1 c > √2 A+ξ l0 B+ξ D+ξ + 2  − c

Table C.1: Scaling of inverse slowest time scale (gap) g0, width of the support of 2 spectral density l, and averaged norm per neuron xi versus the Lagrange multiplier ξ (to leading order) in different regimes. h i .

Because in the limit of ξ 1, g0 is large and there is no long time scales, we focus  our discussions on cases where ξ 1.  Case 1: c < 1/√2 - As ξ approaches 0, we recover the semicircle spectral density with the wall far away from the spectrum. This suggests we can write g0 = 1 √2c+ − γ γ A−ξ and l = 2√2c B−ξ . The expected value for the norm is − √ Z 2c p2 λ2/c2 √ 2 2 √ 1 1 1 2c lim xi c<1/ 2 = dλ − = − (C.10) ξ→0h i| √ cπ 1 λ c2 − c2 − 2c − 102 norm constraint, w = 1, M N(0,c2/N) ij 5 norm constraint, w = 1, M N(0,c2/N) 10 ij 2.5

2

0 10 1.5

1

10-5 0.5

0 10-10 100 10-6 100 106 2 norm, w = 1, M N(0,c /N) norm, w = 1, M N(0,c2/N) ij ij 101 105

100

100

10-1

-2 10-5 10 -2 0 2 10-5 100 105 10 10 10

2 Figure C.5: Scaling of g0, l, and xi as a function of the Lagrange multiplier ξ for random matrices with maximum entropyh i constraint on the norm.

Taylor expansion leads to γ = 1. As ξ decreases, both g0 and µ reaches an upper limit. There is no long time scales in this case. Case 2 c = 1/√2 - This is the critical case in the hard wall limit, when the gap g0 is 0, and the spectral density remains a semicircle. To solve for the scaling behavior when ξ 1, we follow similar procedure as in the previous section and  γ γ assume g0 = Acξ , l = l0 + Bcξ . And again, l0 = 2√2c = 2. Now, the 0th order terms of Eq. C.6 has no terms that scales with ξ, requiring us to go to higher orders to solve for γ, which gives 2 γ = . (C.11) 5

103 The norm is

x2 = 2 Dξ1/5 + (ξ3/5) (C.12) h i i − O

This is interesting: as ξ decreases, the norm activity µ approaches the limit µ = 2, and the corresponding g0 continuous to decrease. The system can reach an infinitely long timescale with a bounded dynamic range for individual neurons. Case 3 c > 1/√2 - In this regime, the spectral density at ξ = 0 is divergent near λ = 1. But for any ξ > 0, there is a finite gap between the wall and right edge of the spectrum, so a limit for the spectral density when ξ 0 is not well-defined. → γ We can assume the gap g0 takes the form g0 = A+ξ . Because l is of order 1, we

γ can assume l = l0 + B+ξ . Plugging these to Eq. C.5, we can solve the 0th order equation and get

2   l = 1 + √1 + 6c2 (C.13) 0 3

This l0 is equal to the length of the spectrum at the hard wall limit, i.e. when ξ = 0.

Plugging g0 and l into Eq. C.6, we get from 0th order solution

2 µ = (C.14) 3 and 2 −2/3 1/3 l0 −2/3 A+ = 2 l (2 ) (C.15) 0 − 4c2

We notice that the prefactor A is not a monotonic function of c. Rather, A takes a minimum Amin = 1 at c = 2. This is interesting, as we don’t want the interaction strength to be too large.

104 The norm becomes

Z 1 −1/3 1 1/3 µ = dλρ(λ) = D+ξ + + (ξ) (C.16) 1 λ c2 O −

where  2  −2 1 −1/2 −1/2 2 l l A D+ = A l c + . (C.17) c2 0 8 − 2 − 8

Similarly, the prefactor D+ has a maximum Dmax = 3/8 at c = 2. For the correlation time, the denominator is the averaged norm, and the numerator is

Z 1 ν dλρ(λ) ≡ (1 λ)2 − (C.18) 1 = Eξ−1 + (ξ) − c2 O

The prefactor E has maximum Emax = 1/8 at c = 2. Then, the correlation time is

E τ sw = ξ−2/3 + (ξ) (C.19) corr D O

Interestingly, when we compute the ratio of the two time scales, τmax and τcorr, we found that the ratio is a constant, as

sw τcorr 1 AE 1/3 sw = 2 2 ξ + (ξ). (C.20) τmax 3 − c D O

In this case, as ξ approaches 0, the gap g0 decreases, but the norm also increases

to infinite. If we only look at the relation between the parametrized variable g0 and

−2 µ, we find that g0 µ , which matches what we get from numerically solving the ∼ exact Eqs. C.5 and C.6. To get an infinitely long time scale, an unlimited growth of dynamic range for individual neuron is required, which is impossible to realize in biological systems.

105 c = 0.6 critical c c = 0.8 1 1 1 = 10-1 = 10-1 = 10-4 = 10-4 = 10-7 = 10-7 = 10-10 = 10-10

= 10-1

R(t) R(t) R(t) = 10-4 = 10-7 = 10-10 0 0 0 10-5 100 105 10-5 100 105 10-5 100 105 t t t

c = 0.6 critical c c = 0.8 100 100 100

10-2 10-2 -1 -1 -1 = 10

= 10 = 10 -4

1-R(t) 1-R(t) 1-R(t) -4 -4 -5 = 10 -4 = 10 -4 = 10 10 -7 10 -7 10 -7 = 10 = 10 = 10 -10 -10 = 10 = 10-10 = 10 ~t1/2 10-6 10-6 10-5 100 105 10-5 100 105 10-5 100 105 t t t

Figure C.6: Autocorrelation coefficient R(t) decays with time at different parameter sets (interaction strength c and Lagrange multiplier ξ). The bottom row plots 1 R(t) against time in the log-log scale to show the t1/2 scaling at small ξ for interaction− strength in the supercritical phase. ∼

C.4 Decay of auto-correlation coefficient

How does the autocorrelation function decay? If a system has long time scale, we expect that we see a power law decay of autocorrelation function for some time period

0 < t < τcorr, or even a plateau that holds at R(t) = 1 for this initial time period. This is confirmed by Fig. C.6: if the Lagrange multiplier ξ is small enough, then we expect to see the 1/2 scaling when we plot 1 R(t) vs. t. For finite systems, the − autocorrelation coefficient function R(t) decays in a way consistent with the mean field results (see Fig. C.7).

106 = 10-10 = 10-7

100 100

av av

10-5 10-5

exp[ln (1-R(t))] exp[ln (1-R(t))] exp[ln

10-5 100 105 10-5 100 105 t t = 10-4 = 10-1

100 100

av av

N = 4 N = 16 -5 -5 N = 64 10 10 N = 256

N = 1024 exp[ln (1-R(t))] exp[ln exp[ln (1-R(t))] exp[ln Mean field

10-5 100 105 10-5 100 105 t t

Figure C.7: Finite size effect for the autocorrelation coefficient vs. time. The inter- action strength is set at c = 1 (the supercritical phase).

C.5 Model with additional constraint on self-

interaction strength

As described by the main text, the mean of Mii in Model 3 is negative. We can add an additional maximum entropy constraint which fixes the expectation value of P the diagonal entries of the interaction matrix to be 0. Because Tr(M) = i λi, the resulting probability distribution is simply

! 1 X 2 1 X X 1 α X P (λ) exp 2 N λi + ln λi λj Nξ N 2 λi ∝ −2c 2 | − | − 1 λi − c i j6=k i − i (C.21) with the Lagrange multiplier α fixing the Tr(M) = 0.

107 We can go through the math again, and find the resulting spectral density as

 1 1 2 2 1 l f(y) ρ(λ a) = 1 + (l + 4ly 8y ) (1 g0 l + α)(y ) ≡ − πpy(l y) 8c2 − − c2 − − − 2 −  ξ (l 2y)(g0 + l) + ly + − p 2 2 g0(g0 + l)(g0 + l y) − (C.22)

In addition to requiring the spectral density being zero at the edge of the support, we have an additional requirement R dλρ(λ)λ = 0. We solve the coupled equations

2 2 2 (2g0 + l)(8c 3l ) + 4(α + 1)l = 0 (C.23) − 2 l 2 −3/2 −3/2 8 2ξl (l + g0) g = 0 (C.24) − c2 − 0 l l2  l  ξ  2g + l  1 g0 1 g + α + 2 = 0 (C.25) − − 2 − 8c2 − − 2 2 − g1/2(g + l)1/2

for g0, l, and α. The mean difference between this model and the Model 3 (with only the norm being constrained), is that the support of the spectrum must contain both positive and negative eigenvalues. This effect is especially large when ξ 1, where the  eigenvalues are now packed together, occupying a length l that is no longer order 1.

However, the scaling relation between g0 and ξ, and µ and ξ, does not change in the ξ 1 limit (through similar argument as in the scaling relation for Model 3; also see  Fig. C.8).

108 6 106 10

4 104 10

102 102

0 10 100 0 1 2 3 10 10 10 10 100 101 102 103

Figure C.8: Scaling of the longest time scale tmax and the correlation time tcorr vs. the constrained norm µ for connection matrix M with the additional maximum entropy constraint fixing Mii = 0. h i

109 Bibliography

[Abbott and Nelson, 2000] Abbott, L. and Nelson, S. (2000). Synaptic plasticity: Taming the beast. Nat. Neurosci., 3(S11):1178–1183.

[Ahrens et al., 2013] Ahrens, M., Orger, M., Robson, D. N., Li, J., and Keller, P. (2013). Whole-brain functional imaging at cellular resolution using light-sheet microscopy. Nat. Methods, 10(5):413–420.

[Akemann et al., 2015] Akemann, G., Baik, J., and Di Francesco, P. (2015). The Oxford handbook of Random Matrix Theory. Oxford University Press.

[Akemann and Kanzieper, 2007] Akemann, G. and Kanzieper, E. (2007). Integrable structure of Ginibres ensemble of real random matrices and a Pfaffian integration theorem. J. Stat. Phys., 129(5):1159–1231.

[Aksay et al., 2001] Aksay, E., Gamkrelidze, G., Seung, H., Baker, R., and Tank, D. (2001). In vivo intracellular recording and perturbation of persistent activity in a neural integrator. Nat. Neurosci., 4(2):184–193.

[Amit et al., 1985] Amit, D. J., Gutfreund, H., and Sompolinsky, H. (1985). Spin- glass models of neural networks. Phys. Rev. A, 32(2):1007.

[Amit et al., 1987] Amit, D. J., Gutfreund, H., and Sompolinsky, H. (1987). Statis- tical mechanics of neural networks near saturation. Ann. Phys., 173(1):30–67.

[Ardiel and Rankin, 2010] Ardiel, E. and Rankin, C. (2010). An elegant mind: learn- ing and memory in Caenorhabditis elegans. Learn Mem., 17(4):191–201.

[Becker and Karplus, 1997] Becker, O. M. and Karplus, M. (1997). The topology of multidimensional potential energy surfaces: Theory and application to peptide structure and kinetics. J. Chem. Phys., 106(4):1495–1517.

[Beggs and Plenz, 2003] Beggs, J. and Plenz, D. (2003). Neuronal avalanches in neo- cortical circuits. J. Neurosci., 23(35):11167–11177.

[Berman et al., 2016] Berman, G. J., Bialek, W., and Shaevitz, J. W. (2016). Pre- dictability and hierarchy in Drosophila behavior. Proc. Natl. Acad. Sci. USA, 113(42):11943–11948.

110 [Bialek et al., 2014] Bialek, W., Cavagna, A., Giardina, I., Mora, T., Pohl, O., Sil- vestri, E., Viale, M., and Walczak, A. (2014). Social interactions dominate speed control in poising natural flocks near criticality. Proc. Natl. Acad. Sci. USA, 111(20):7212–7217.

[Bialek et al., 2012] Bialek, W., Cavagna, A., Giardina, I., Mora, T., Silvestri, E., Viale, M., and Walczak, A. (2012). Statistical mechanics for natural flocks of birds. Proc. Natl. Acad. Sci. USA, 109(13):4786–4791.

[Bienenstock et al., 1982] Bienenstock, E. L., Cooper, L. N., and Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2(1):32–48.

[Biroli et al., 2018] Biroli, G., Bunin, G., and Cammarota, C. (2018). Marginally stable equilibria in critical ecosystems. New J. Phys., 20(8):083051.

[Bouchaud and Potters, 2003] Bouchaud, J.-P. and Potters, M. (2003). Theory of Fi- nancial Risk and Derivative Pricing: From Statistical Physics to Risk Management. Cambridge University Press.

[Broderick et al., 2007] Broderick, T., Dudik, M., Tkaˇcik, G., Schapire, R., and Bialek, W. (2007). Faster solutions of the inverse pairwise Ising problem. arXiv preprint arXiv:0712.2437.

[Brody et al., 2003a] Brody, C. D., Hern´andez,A., Zainos, A., and Romo, R. (2003a). Timing and neural encoding of somatosensory parametric working memory in macaque prefrontal cortex. Cereb. Cortex, 13(11):1196–1207.

[Brody et al., 2003b] Brody, C. D., Romo, R., and Kepecs, A. (2003b). Basic mech- anisms for graded persistent activity: Discrete attractors, continuous attractors, and dynamic representations. Curr. Opin. Neurobiol., 13(2):204–211.

[Bullock and Horridge, 1965] Bullock, T. and Horridge, G. A. (1965). Structure and function in the nervous systems of invertebrates. San Francisco.

[Burak and Fiete, 2012] Burak, Y. and Fiete, I. R. (2012). Fundamental limits on persistent activity in networks of noisy neurons. Proc. Natl. Acad. Sci. USA, 109(43):17645–17650.

[Can et al., 2020] Can, T., Krishnamurthy, K., and Schwab, D. J. (2020). Gating creates slow modes and controls phase-space complexity in GRUs and LSTMs. arXiv preprint. arXiv:2002.00025.

[Castellani and Cavagna, 2005] Castellani, T. and Cavagna, A. (2005). Spin-Glass Theory for pedestrians. J. Stat. Mech., 2005(05):P05012.

[Cavagna et al., 2008] Cavagna, A., Giardina, I., Orlandi, A., Parisi, G., Procaccini, A., Viale, M., and Zdravkovic, V. (2008). The STARFLAG handbook on collective animal behaviour: Part I, empirical methods. Anim. Behav., 76(1):217–236. 111 [Chalker and Mehlig, 1998] Chalker, J. T. and Mehlig, B. (1998). Eigenvector statis- tics in non-Hermitian random matrix ensembles. Phys. Rev. Lett., 81(16):3367– 3370.

[Chartrand, 2011] Chartrand, R. (2011). Numerical differentiation of noisy, nons- mooth data. ISRN Appl. Math., 2011.

[Chen et al., 2013] Chen, T.-W., Wardill, T. J., Sun, Y., Pulver, S. R., Renninger, S. L., Baohan, A., Schreiter, E. R., Kerr, R. A., Orger, M. B., Jayaraman, V., et al. (2013). Ultrasensitive fluorescent proteins for imaging neuronal activity. Nature, 499(7458):295.

[Chen and Bialek, 2020] Chen, X. and Bialek, W. (2020). Searching for long time scales without fine tuning. arXiv preprint. arXiv:2008.11674 [physics.bio-ph].

[Chen et al., 2019] Chen, X., Randi, F., Leifer, A. M., and Bialek, W. (2019). Search- ing for collective behavior in a small brain. Phys. Rev. E, 99(5):052418.

[Cheng and Titterington, 1994] Cheng, B. and Titterington, D. M. (1994). Neural networks: A review from a statistical perspective. Stat. Sci., 9(1):2–30.

[Cocco et al., 2009] Cocco, S., Leibler, S., and Monasson, R. (2009). Neuronal cou- plings between retinal ganglion cells inferred by efficient inverse statistical physics methods. Proc. Natl. Acad. Sci. USA, 106(33):14058–14062.

[Cocco and Monasson, 2012] Cocco, S. and Monasson, R. (2012). Adaptive cluster expansion for the inverse Ising problem: convergence, algorithm and tests. J. Stat. Phys., 147(2):252–314.

[Costa et al., 2019] Costa, A. C., Ahamed, T., and Stephens, G. J. (2019). Adap- tive, locally linear models of complex dynamics. Proc. Natl. Acad. Sci. USA, 116(5):1501–1510.

[Cotler et al., 2017] Cotler, J. S., Gur-Ari, G., Hanada, M., Polchinski, J., Saad, P., Shenker, S. H., Stanford, D., Streicher, A., and Tezuka, M. (2017). Black holes and random matrices. J. High Energy Phys., 2017(5):118.

[Dean and Majumdar, 2006] Dean, D. and Majumdar, S. (2006). Large deviations of extreme eigenvalues of random matrices. Phys. Rev. Lett., 97(16):160201.

[Dean and Majumdar, 2008] Dean, D. and Majumdar, S. (2008). Extreme value statistics of eigenvalues of Gaussian random matrices. Phys. Rev. E, 77(4):041108.

[Dombeck et al., 2010] Dombeck, D. A., Harvey, C. D., Tian, L., Looger, L. L., and Tank, D. W. (2010). Functional imaging of hippocampal place cells at cellular resolution during virtual navigation. Nat. Neurosci., 13(11):1433.

112 [Dud´ıket al., 2004] Dud´ık,M., Phillips, S. J., and Schapire, R. E. (2004). Perfor- mance guarantees for regularized maximum entropy density estimation. In Pro- ceedings of the 17th annual Conference on Learning Theory,(COLT 2004), Banff, Canada, volume 3120, pages 472–486. Springer.

[Dyson, 1962a] Dyson, F. (1962a). The threefold way: Algebraic structure of symme- try groups and ensembles in quantum mechanics. J. Math. Phys., 3(6):1199–1215.

[Dyson, 1962b] Dyson, F. J. (1962b). A Brownian-motion model for the eigenvalues of a random matrix. J. Math. Phys., 3(6):1191–1198.

[Forrester and Nagao, 2007] Forrester, P. J. and Nagao, T. (2007). Eigenvalue statis- tics of the real Ginibre ensemble. Phys. Rev. Lett., 99(5):050603.

[Ganguli et al., 2008] Ganguli, S., Huh, D., and Sompolinsky, H. (2008). Memory traces in dynamical systems. Proc. Natl. Acad. Sci. USA, 105(48):18970–18975.

[Ginibre, 1965] Ginibre, J. (1965). Statistical ensembles of complex, quaternion, and real matrices. J. Math. Phys., 6(3):440–449.

[Goldman, 2009] Goldman, M. (2009). Memory without feedback in a neural network. Neuron, 61(4):621–634.

[Goodman et al., 1998] Goodman, M. B., Hall, D. H., Avery, L., and Lockery, S. R. (1998). Active currents regulate sensitivity and dynamic range in C. elegans neu- rons. Neuron, 20(4):763–772.

[Goodman et al., 2012] Goodman, M. B., Lindsay, T. H., Lockery, S. R., and Rich- mond, J. E. (2012). Electrophysiological methods for Caenorhabditis elegans neu- robiology. In Methods Cell Biol., volume 107, pages 409–436. Elsevier.

[Gordus et al., 2015] Gordus, A., Pokala, N., Levy, S., Flavell, S. W., and Bargmann, C. I. (2015). Feedback from network states generates variability in a probabilistic olfactory circuit. Cell, 161(2):215–227.

[Gray et al., 2005] Gray, J. M., Hill, J. J., and Bargmann, C. I. (2005). A circuit for navigation in Caenorhabditis elegans. Proc. Natl. Acad. Sci. USA, 102(9):3184– 3191.

[Grela, 2017] Grela, J. (2017). What drives transient behavior in complex systems? Phys. Rev. E, 96(2):022316.

[Gudowska-Nowak et al., 2020] Gudowska-Nowak, E., Nowak, M. A., Chialvo, D. R., Ochab, J. K., and Tarnowski, W. (2020). From synaptic interactions to collective dynamics in random neuronal networks models: Critical role of eigenvectors and transient behavior. Neural Comput., 32(2):395–423.

[Hebb, 1949] Hebb, D. O. (1949). The Organization of Behavior: A Neuropsycholog- ical Theory. J. Wiley; Chapman & Hall. 113 [Hodgkin and Huxley, 1952] Hodgkin, A. L. and Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117(4):500. [Hopfield, 1982] Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79(8):2554–2558. [Hopfield, 1984] Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81(10):3088–3092. [Hopfield and Tank, 1985] Hopfield, J. J. and Tank, D. W. (1985). Neural computa- tion of decisions in optimization problems. Biol. Cybern., 52(3):141–152. [Jaynes, 1957] Jaynes, E. (1957). Information theory and statistical mechanics. Phys. Rev., 106(4):620. [Kandel et al., 2000] Kandel, E. R., Schwartz, J. H., and Jessell, T. M. (2000). Prin- ciples of Neural Science. McGraw-hill New York. [Kato et al., 2015] Kato, S., Kaplan, H. S., Schr¨odel,T., Skora, S., Lindsay, T. H., Yemini, E., Lockery, S., and Zimmer, M. (2015). Global brain dynamics embed the motor command sequence of Caenorhabditis elegans. Cell, 163(3):656–669. [Keating and Snaith, 2000] Keating, J. P. and Snaith, N. C. (2000). Random Matrix Theory and ζ (1/2+ it). Comm. Math. Phys., 214(1):57–89. [Kim et al., 2013] Kim, E., Sun, L., Gabel, C. V., and Fang-Yen, C. (2013). Long- term imaging of Caenorhabditis elegans using nanoparticle-mediated immobiliza- tion. PLoS One, 8(1):e53419. [Kos et al., 2018] Kos, P., Ljubotina, M., and Prosen, T. (2018). Many-body quantum chaos: Analytic connection to random matrix theory. Phys. Rev. X, 8(2):021062. [K¨osteret al., 2014] K¨oster,U., Sohl-Dickstein, J., Gray, C. M., and Olshausen, B. A. (2014). Modeling higher-order correlations within cortical microcolumns. PLoS Comp. Biol., 10(7):e1003684. [Krishnamurthy et al., 2020] Krishnamurthy, K., Can, T., and Schwab, D. J. (2020). Theory of gating in recurrent neural networks. arXiv preprint. arXiv:2007.14823. [Liu et al., 2018] Liu, M., Sharma, A. K., Shaevitz, J., and Leifer, A. M. (2018). Tem- poral processing and context dependency in C. elegans response to mechanosensa- tion. eLife, 7:e36419. [Liu et al., 2009] Liu, Q., Hollopeter, G., and Jorgensen, E. M. (2009). Graded synap- tic transmission at the Caenorhabditis elegans neuromuscular junction. Proc. Natl. Acad. Sci. USA, 106(26):10823–10828. 114 [Livan et al., 2018] Livan, G., Novaes, M., and Vivo, P. (2018). Introduction to Random Matrices: Theory and Practice. Springer International Publishing. arxiv.org/1712.07903.

[Magee and Grienberger, 2020] Magee, J. C. and Grienberger, C. (2020). Synaptic plasticity forms and functions. Annu. Rev. Neurosci., 43:95–117.

[Magnasco et al., 2009] Magnasco, M. O., Piro, O., and Cecchi, G. A. (2009). Self- tuned critical anti-Hebbian networks. Phys. Rev. Lett., 102(25):258102.

[Major et al., 2004a] Major, G., Baker, R., Aksay, E., Mensh, B., Seung, H., and Tank, D. (2004a). Plasticity and tuning by visual feedback of the stability of a neural integrator. Proc. Natl. Acad. Sci. USA, 101(20):7739–7744.

[Major et al., 2004b] Major, G., Baker, R., Aksay, E., Seung, H., and Tank, D. (2004b). Plasticity and tuning of the time course of analog persistent firing in a neural integrator. Proc. Natl. Acad. Sci. USA, 101(20):7745–7750.

[Major and Tank, 2004] Major, G. and Tank, D. (2004). Persistent neural activity: prevalence and mechanisms. Curr. Opin. Neurobiol., 14(6):675–684.

[Marchetti et al., 2013] Marchetti, M. C., Joanny, J.-F., Ramaswamy, S., Liverpool, T. B., Prost, J., Rao, M., and Simha, R. A. (2013). Hydrodynamics of soft active matter. Rev. Mod. Phys., 85(3):1143.

[Marino, 2016] Marino, R. (2016). Number statistics in random matrices and appli- cations to quantum systems. PhD thesis, Universit´eParis–Saclay.

[May, 1972] May, R. M. (1972). Will a large be stable? Nature, 238(5364):413–414.

[McCulloch and Pitts, 1943] McCulloch, W. S. and Pitts, W. (1943). A logical cal- culus of the ideas immanent in nervous activity. Bull. Math. Biol., 5(4):115–133.

[Mehlig and Chalker, 2000] Mehlig, B. and Chalker, J. T. (2000). Statistical proper- ties of eigenvectors in non-Hermitian Gaussian random matrix ensembles. J. Math. Phys., 41(5):3233–3256.

[Meshulam et al., 2017] Meshulam, L., Gauthier, J. L., Brody, C. D., Tank, D. W., and Bialek, W. (2017). Collective behavior of place and non-place neurons in the hippocampal network. Neuron, 96(5):1178 – 1191.e4.

[Meshulam et al., 2018] Meshulam, L., Gauthier, J. L., Brody, C. D., Tank, D. W., and Bialek, W. (2018). Coarse-graining, fixed points, and scaling in a large popu- lation of neurons. arXiv preprint arXiv:1809.08461 [q-bio.NC].

[Mezard and Mora, 2009] Mezard, M. and Mora, T. (2009). Constraint satisfaction problems and neural networks: A statistical physics perspective. J. Physiol. Paris, 103(1-2):107–113. 115 [Monasson and Rosay, 2015] Monasson, R. and Rosay, S. (2015). Transitions between spatial attractors in place-cell models. Phys. Rev. Lett., 115:098101.

[Mora and Bialek, 2011] Mora, T. and Bialek, W. (2011). Are biological systems poised at criticality? J. Stat. Phys., 144(2):268–302.

[Mora et al., 2010] Mora, T., Walczak, A., Bialek, W., and Callan, C. (2010). Maximum entropy models for antibody diversity. Proc. Natl. Acad. Sci. USA, 107(12):5405–5410.

[Mu˜noz,2018] Mu˜noz, M. A. (2018). Colloquium: Criticality and dynamical scaling in living systems. Rev. Mod. Phys., 90(3):031001.

[Nguyen et al., 2016] Nguyen, J., Shipley, F., Linder, A., Plummer, G., Liu, M., Setru, S., Shaevitz, J., and Leifer, A. (2016). Whole-brain calcium imaging with cellular resolution in freely behaving Caenorhabditis elegans. Proc. Natl. Acad. Sci. USA, 113(8):E1074–E1081.

[Nguyen et al., 2017] Nguyen, J. P., Linder, A. N., Plummer, G. S., Shaevitz, J. W., and Leifer, A. M. (2017). Automatically tracking neurons in a moving and deform- ing brain. PLoS Comp. Biol., 13(5):e1005517.

[Nichols et al., 2017] Nichols, A. L., Eichler, T., Latham, R., and Zimmer, M. (2017). A global brain state underlies C. elegans sleep behavior. Science, 356(6344):eaam6851.

[Ohiorhenuan et al., 2010] Ohiorhenuan, I. E., Mechler, F., Purpura, K. P., Schmid, A. M., Hu, Q., and Victor, J. D. (2010). Sparse coding and high-order correlations in fine-scale cortical networks. Nature, 466(7306):617.

[Pennington and Worah, 2017] Pennington, J. and Worah, P. (2017). Nonlinear ran- dom matrix theory for deep learning. In Advances in Neural Information Processing Systems, pages 2637–2646.

[Philip, 2018] Philip, B. (2018). Schr¨odinger’scat among biology’s pigeons: 75 years of What is Life? Nature, 560:548–550.

[Piggott et al., 2011] Piggott, B. J., Liu, J., Feng, Z., Wescott, S. A., and Xu, X. S. (2011). The neural circuits and synaptic mechanisms underlying motor initiation in C. elegans. Cell, 147(4):922–933.

[Plerou et al., 2002] Plerou, V., Gopikrishnan, P., Rosenow, B., Nunes Amaral, L., Guhr, T., and Stanley, H. (2002). Random matrix approach to cross correlations in financial data. Phys. Rev. E, 65(6):066126.

[Posani et al., 2017] Posani, L., Cocco, S., Jeˇzek,K., and Monasson, R. (2017). Func- tional connectivity models for decoding of spatial representations from hippocampal CA1 recordings. J. Comp. Neurosci., 43(1):17–33.

116 [Press´eet al., 2013] Press´e,S., Ghosh, K., Lee, J., and Dill, K. A. (2013). Principles of maximum entropy and maximum caliber in statistical physics. Rev. Mod. Phys., 85(3):1115–1141.

[Rajan and Abbott, 2006] Rajan, K. and Abbott, L. (2006). Eigenvalue spectra of random matrices for neural networks. Phys. Rev. Lett., 97(18):188104.

[Ramaswamy, 2010] Ramaswamy, S. (2010). The mechanics and statistics of active matter. Annu. Rev. Condens. Matter Phys., 1(1):323–345.

[Renart et al., 2003] Renart, A., Song, P., and Wang, X.-J. (2003). Robust spatial working memory through homeostatic synaptic scaling in heterogeneous cortical networks. Neuron, 38(3):473–485.

[Reuter et al., 2015] Reuter, J. A., Spacek, D. V., and Snyder, M. P. (2015). High- throughput sequencing technologies. Mol. Cell, 58(4):586–597.

[Rieke et al., 1997] Rieke, F., Warland, D., de Ruyter van Steveninck, R., and Bialek, W. (1997). Spikes: Exploring the Neural Code. MIT Press.

[Rosenblatt, 1958] Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev., 65(6):386.

[Roudi et al., 2009] Roudi, Y., Aurell, E., and Hertz, J. (2009). Statistical physics of pairwise probability models. Front. Comp. Neurosci., 3:22.

[Schmidt, 2007] Schmidt, M. (2007). UGM: A Matlab toolbox for probabilistic undirected graphical models. http://www.cs.ubc.ca/~schmidtm/Software/UGM. html.

[Schneidman et al., 2006] Schneidman, E., Berry II, M. J., Segev, R., and Bialek, W. (2006). Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440:1007.

[Scholz et al., 2018] Scholz, M., Linder, A. N., Randi, F., Sharma, A. K., Yu, X., Shaevitz, J. W., and Leifer, A. (2018). Predicting natural behavior from whole- brain neural dynamics. bioRxiv preprint bioRxiv:445643.

[Schr¨odinger,1944] Schr¨odinger, E. (1944). What is Life? The Physical Aspect of the Living Cell. Cambridge University Press. http://www.whatislife.ie/downloads/What-is-Life.pdf.

[Segev et al., 2004] Segev, R., Goodhouse, J., Puchalla, J., and Berry II, M. J. (2004). Recording spikes from a large fraction of the ganglion cells in a retinal patch. Nat. Neurosci., 7(10):1155.

[Sengupta and Samuel, 2009] Sengupta, P. and Samuel, A. D. (2009). Caenorhab- ditis elegans: A model system for systems neuroscience. Curr. Opin. Neurobiol., 19(6):637–643. 117 [Seung, 1996] Seung, H. S. (1996). How the brain keeps the eyes still. Proc. Natl. Acad. Sci. USA, 93(23):13339–13344. [Shannon, 1948] Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech., 27:379–423, 623–656. [Shatz, 1992] Shatz, C. J. (1992). The developing brain. Sci. Am., 267:60–67. [Shemesh et al., 2013] Shemesh, Y., Sztainberg, Y., Forkosh, O., Shlapobersky, T., Chen, A., and Schneidman, E. (2013). High-order social interactions in groups of mice. eLife, 2:e00759. [Shore and Johnson, 1980] Shore, J. and Johnson, R. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Inform. Theory, 26(1):26–37. [Slonim et al., 2005] Slonim, N., Atwal, G. S., Tkaˇcik,G., and Bialek, W. (2005). Estimating mutual information and multi-information in large networks. arXiv preprint cs/0502017. [Sompolinsky et al., 1988] Sompolinsky, H., Crisanti, A., and Sommers, H. (1988). Chaos in random neural networks. Phys. Rev. Lett., 61(3):259–262. [Srimal and Curtis, 2008] Srimal, R. and Curtis, C. E. (2008). Persistent neural ac- tivity during the maintenance of spatial position in working memory. Neuroimage, 39(1):455–468. [Stephens et al., 2011] Stephens, G., de Mesquita, M., Ryu, W., and Bialek, W. (2011). Emergence of long timescales and stereotyped behaviors in Caenorhab- ditis elegans. Proc. Natl. Acad. Sci. USA, 108(18):7286–7289. [Stevenson and Kording, 2011] Stevenson, I. H. and Kording, K. P. (2011). How advances in neural recording affect data analysis. Nat. Neurosci., 14(2):139–142. [Stringer et al., 2019] Stringer, C., Pachitariu, M., Steinmetz, N., Reddy, C. B., Carandini, M., and Harris, K. D. (2019). Spontaneous behaviors drive multidi- mensional, brainwide activity. Science, 364(6437). [Tanaka, 1998] Tanaka, T. (1998). Mean-field theory of Boltzmann machine learning. Phys. Rev. E, 58(2):2302. [Tang et al., 2008] Tang, A., Jackson, D., Hobbs, J., Chen, W., Smith, J. L., Patel, H., Prieto, A., Petrusca, D., Grivich, M. I., Sher, A., Hottowy, P., Dabrowski, W., Litke, A. M., and Beggs, J. M. (2008). A maximum entropy model applied to spatial and temporal correlations from cortical networks in vitro. J. Neurosci., 28(2):505–518. [Tarnowski et al., 2020] Tarnowski, W., Neri, I., and Vivo, P. (2020). Universal transient behavior in large dynamical systems on networks. Phys. Rev. Res., 2(2):023333. 118 [Tetzlaff et al., 2013] Tetzlaff, C., Kolodziejski, C., Timme, M., Tsodyks, M., and W¨org¨otter,F. (2013). Synaptic scaling enables dynamically distinct short- and long-term memory formation. PLoS Comput. Biol., 9(10).

[Thouless et al., 1977] Thouless, D. J., Anderson, P. W., and Palmer, R. G. (1977). Solution of ‘solvable model of a spin glass’. Philos. Mag., 35(3):593–601.

[Tkaˇciket al., 2015] Tkaˇcik,G., Mora, T., Marre, O., Amodei, D., Palmer, S. E., Berry, M. J., and Bialek, W. (2015). Thermodynamics and signatures of criticality in a network of neurons. Proc. Natl. Acad. Sci. USA, 112(37):11508–11513.

[Tkaˇciket al., 2014] Tkaˇcik,G., Marre, O., Amodei, D., Schneidman, E., Bialek, W., and Berry, II, M. J. (2014). Searching for collective behavior in a large network of sensory neurons. PLoS Comp. Biol., 10(1):1–23.

[Tkaˇciket al., 2009] Tkaˇcik,G., Schneidman, E., Berry, I., Michael, J., and Bialek, W. (2009). Spin glass models for a network of real neurons. arXiv preprint arXiv:0912.5409.

[Tricomi, 1957] Tricomi, F. G. (1957). Integral Equations. Interscience Publishers, Inc., London & New York.

[Turrigiano et al., 1998] Turrigiano, G., Leslie, K., Desai, N., Rutherford, L., and Nelson, S. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391(6670):892–896.

[Turrigiano and Nelson, 2004] Turrigiano, G. and Nelson, S. (2004). Homeostatic plasticity in the developing nervous system. Nat. Rev. Neurosci., 5(2):97–107.

[Turrigiano, 2008] Turrigiano, G. G. (2008). The self-tuning neuron: Synaptic scaling of excitatory synapses. Cell, 135(3):422–435.

[Venkatachalam et al., 2016] Venkatachalam, V., Ji, N., Wang, X., Clark, C., Mitchell, J., Klein, M., Tabone, C., Florman, J., Ji, H., Greenwood, J., Chisholm, A., Srinivasan, J., Alkema, M., Zhen, M., and Samuel, A. (2016). Pan- neuronal imaging in roaming Caenorhabditis elegans. Proc. Natl. Acad. Sci. USA, 113(8):E1082–E1088.

[Vicsek et al., 1995] Vicsek, T., Czir´ok,A., Ben-Jacob, E., Cohen, I., and Shochet, O. (1995). Novel type of phase transition in a system of self-driven particles. Phys. Rev. Lett., 75(6):1226.

[Vreeswijk and Sompolinsky, 1996] Vreeswijk, C. v. and Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Sci- ence, 274(5293):1724–1726.

[Watts and Strogatz, 1998] Watts, D. J. and Strogatz, S. H. (1998). Collective dy- namics of “small-world” networks. Nature, 393(6684):440–442.

119 [Weigt et al., 2009] Weigt, M., White, R. A., Szurmant, H., Hoch, J. A., and Hwa, T. (2009). Identification of direct residue contacts in protein–protein interaction by message passing. Proc. Natl. Acad. Sci. USA, 106(1):67–72.

[White et al., 1986] White, J. G., Southgate, E., Thomson, J. N., and Brenner, S. (1986). The structure of the nervous system of the nematode Caenorhabditis ele- gans. Philos. Trans. R. Soc. Lond. B Biol. Sci., 314(1165):1–340.

[Wigner, 1951] Wigner, E. (1951). On the statistical distribution of the widths and spacings of nuclear resonance levels. Math. Proc. Cambridge, 47(4):790798.

[Yan et al., 2017] Yan, G., V´ertes,P. E., Towlson, E. K., Chew, Y. L., Walker, D. S., Schafer, W. R., and Barab´asi,A.-L. (2017). Network control principles predict neuron function in the Caenorhabditis elegans connectome. Nature, 550(7677):519.

[Yedidia et al., 2001] Yedidia, J. S., Freeman, W. T., and Weiss, Y. (2001). Gener- alized propagation. In Advances in Neural Information Processing Systems, pages 689–695.

[Zenke et al., 2017] Zenke, F., Gerstner, W., and Ganguli, S. (2017). The temporal paradox of Hebbian learning and homeostatic plasticity. Curr. Opin. Neurobiol., 43:166–176.

[Zenke et al., 2013] Zenke, F., Hennequin, G., and Gerstner, W. (2013). Synaptic plasticity in neural networks needs with a fast rate detector. PLoS Comput. Biol., 9(11):e1003330.

120