The Biological Future of Theoretical Curs Callan Princeton University

The very success of theorecal physics in elucidang the structure of maer at the smallest scales, and of the universe at the largest scales, has made the future course of this discipline uncertain. At the same me, the new ability of biological experiment to produce massive data is creang an urgent need for mathemacal frameworks in of the kind theorecal physics has tradionally provided for physical science. These overlapping “crises” offer a golden opportunity for both disciplines to collaborate. I will expand on this theme, and sketch some specific examples of how theorecal are taking up this challenge. Who am I and why am I here? (w. apologies to Adm. Stockdale!)

• I used to be a theorecal parcle .. wring papers like these: – Worldsheet Approach to Heteroc Solitons and Instantons – Brane Dynamics from the Born-Infeld Acon – D-Brane Approach to Black Hole Quantum . • But about ten years ago I started wring papers like this: – Precise physical models of -DNA interacon from high-throughput data – Informaon capacity of genec regulatory elements – Quanfying selecon in immune receptor repertoires • I am occasionally asked why I did this .. why did I “switch” to biology? – Not really because of any dissasfacon with “main stream” theorecal physics – Rather that pioneering biophysics colleagues (Bialek, Leibler) showed me that biology poses fascinang quesons for theory, and that the me is ripe to aack them. • On reflecon, though, it seems to me that biology and theorecal physics have both come to a “crisis” that offers opportunies for both subjects. – My purpose tonight is to explain what I mean by this and to give you some concrete noon of what theorecal physicists are doing to respond. – I am here for the “Quantave ” program (largely populated by theorecal physicists) … my talk may explain why KITP is hosng such an event! Theorecal Physics is Hugely Ambious!

It comprehends a lot. But some things escape its net:

It describes how It describes the our universe behavior of inanimate came from the maer everywhere in Big Bang. our universe

Thanks to NASA for the image The primary task of

• To discover the fundamental, mathemacally expressed, Laws of … as well as the “stuff” that obeys those laws .. in our universe (?). • These laws are valid within broad domains of phenomena and are in some sense “simple”; oen fit handily on a postcard in shorthand form. • They were discovered in steps over recent centuries: – Newtonian mechanics & the gravitaonal force law (1670s) – Maxwell’s equaons of E&M (1860s) (and special relavity) – Einstein’s general relavity (gravity as dynamical geometry) (1910s) – of electrons, and (1920s) – Quantum field theory and the of everything (1970s) – “Big Bang” cosmology and a theory of the origin of our universe (2000s) • The culminaon of this development, the outcome of an explosion of discovery over the last 50 years, is the Standard Model of our world. – It is a precise mathemacal theory whose scope is … everything. – Some modesty is of course in order, but a victory lap or two is jusfied. The Standard Model Consensus

Fiy years ago, there was no agreement on the fundamental nature of maer, on the physical theory governing that maer, or how the universe worked. Over me, a “picture” and “theory” of maer and the universe came together (on two tracks):

The parcle physics track: A specific quantum field theory of the strong, weak and electromagnec interacons between three “generaons” of point-like quarks and leptons; forces mediated by “gauge” gluons, photons, and the W/Z, with a Higgs boson doing symmetry breaking and mass generaon. Some 21 parameters (that could have been different) completely define the whole thing. Total agreement with many and varied experiments. The search for “constuents” could stop here!

The cosmology track: The recession of the galaxies suggests a “Big Bang” origin for the universe. The exploraon of its thermal “aerglow” (cosmic microwave background now at 2.75 K) revealed how the universe was constuted in its earliest seconds of existence. Two surprises came out: the maer of parcle physics is there, but is a minority player in the census: “dark maer” and “dark energy” dominate, but seem to act on the world only via gravity. This explains many puzzles of and also how today’s universe coalesced from the Big Bang The Standard Model of

We have a very specific quantum field theory of how these point parcle constuents interact. The quarks are not seen directly: they are “confined”, combining in triples to make neutrons, protons etc. Our ability to calculate specific results from in this theory is limited, but good enough to convince us of its accuracy.

Graphic credit: Parcle Data Group The Standard Model of cosmology

We can see photons emied back to here

Density ripples start to grow about here BIG BANG!!! A remark on the method of

Millions of data points in …..

A few model parameters out!

In cosmology/parcle physics we do not directly measure the “hidden variables” of interest. Given an underlying, simple, model for how they affect the noisy data sets we do measure, we use stascal inference to “see through the noise” into the underlying physical parameters. This is a nearly universal method in modern physical science. One of my messages is that biology is now entering this era. Massive resources were deployed to get here

The WMAP satellite The LHC accelerator at CERN

The CMS detector at the LHC

This consensus picture is the outcome of an enormous effort (intellectual, experimental, sociological, financial) carried out over a period of 50 years. This sustained effort to answer what amounts to a queson of “natural philosophy” is remarkable .. and a credit to the human race. Where does theoretical physics go from here? • There are loose ends, but the historic program has succeeded so well that it has put its own connuaon into serious queson! • On the conceptual side, we seem to be at a turning point: – Once you are down to point parcle constuents of maer, you can’t really “explain” them in terms of more fundamental enes: Is the game over? – There could be more massive point constuents that even the LHC can’t see (and neutrino masses and dark maer point that way … obscurely). – Going deeper, in the of string theory, we can see a natural reduconist limit point at the Planck scale .. way beyond direct experimental reach. • On the experimental side, exploring the relevant energy scales is increasingly costly, and surely approaching societal limits – The LHC is speaking; we hope for surprises, but we may “only” get a compleon of the Standard Model, not a view of deeper physical law. – Looking beyond the Standard Model, or into inflaon, the future experimental projects are in the 1010 euro class … will taxpayers support them? – And, in the long run, theory without experiment is not sustainable. • So, is going to be hard for fundamental theory from here on … Not impossible, to be sure, but are there other paths to take? The “other agenda” of theoretical physics • Discovering the fundamental laws is the historic core mission of theorecal physics .. but that’s not all theorecal physicists do! • In addion, we want to explain phenomena that are not directly baked into the fundamental laws … we call them emergent phenomena. – E.g. show that superconducvity (a macroscopic quantum effect) follows from the Schrödinger equaon for many electrons moving in a host atomic lace. – Or prove “confinement”, namely that the quarks of QCD can never be seen outside the hadrons (neutrons and protons) of which they are constuents. • Beer yet, we want to predict unknown emergent phenomena (i.e. derive them from fundamental law) before their experimental discovery: – Our record on this is not good. The quanzed Hall effect is a striking quantum effect and could have been predicted. But it wasn’t .. a failure of imaginaon? – It doesn’t happen oen, but it is not impossible: cosmologists did ancipate Big Bang phenomena like CMB. Topological insulators were also ancipated. • Good news is that we have (so far) found that our fundamental laws are able to explain new “emergent” phenomena. Usually aer the fact … Life: the “emperor” of all emergent phenomena • There are phenomena of fundamental importance which are certainly described by the already-known fundamental laws … but whose derivaon from those Laws completely escapes us. • The chief of these is Life. There are good reasons why it is me to bring the domain of living maer into the realm of predicve, mathemacal, science: – How “living maer” is governed by physical principles has always been a queson for theorecal physics ... but inadequate data ed our hands – The ongoing explosion of quantave biological data (hi-throughput sequencing, expression profiling, …) has created a totally new context for this issue – On the biology side, it is becoming clear that we need mathemacal frameworks (like we have in physics) to extract meaning from the growing mass of data. • Developing this kind of theory is what theorecal physicists do .. and it is a major intellectual challenge, on a par with our quest for fundamental laws – A quick tour of the past, present and future of this enterprise will, I hope, give you a more concrete idea of what I am talking about • It used to be said that there is no theory in biology … the day may be coming when there is no biology without theory! The theoretical physics of Life: past Historical instances of using theorecal principles to illuminate and make predicons about a broad class of biological phenomena include: • Schroedinger’s “What is Life?”: genes are carried by a – Basic quantum mechanics and the known rate of inducon of mutaons by x- rays (Morgan, 1910) led him to the conclusion that genes had to be carried by a molecule. Many genes -> informaon -> linear polymer molecule of heredity • Hopfield and Ninio’s “kinec proofreading” for DNA replicaon accuracy – Boltzmann equilibrium stascs and the binding energy differences between base pairs would give a copying error rate of 1 per 104 bases per generaon. The actual rate is more like 1 per 108, and this led them to propose an energy consuming that checks fidelity and corrects mistakes. Found! • Berg and Purcell’s explanaon of how bacteria locate their next meal – Bacteria must move up a density gradient to find a food source (in real world). B&P showed that a bacterium’s size and the physics of diffusion mean that a bacterium can’t measure the food gradient across its body. How does it know which way is up? They do a funny “run and tumble” random walk: as long as density is increasing, go straight … else pick a new direcon at random. True! The theoretical physics of Life: future

Big theorecal quesons are waing in the wings. They are way beyond our grasp today but not, I trust, forever! My favorite examples are:

• What is it about non-equilibrium stascal mechanics that makes it possible for populaons of disnct, stably reproducing enes to arise? We hardly know where to begin in trying to answer this queson. • Are there equaons describing evoluon of with 103s of genes; can they be solved and can anything be said about their global behavior? • Can we capture the dynamics of a ? Can equaons accurately describe its behavior, given that a cell has thousands of interacng parts? Can we see ssue types as basins of aracon of a large dynamical system? • Brains carry out tasks that have an abstract representaon. Can dynamical network models capture the processing power of a human brain (or eye)? • These are some of the “big quesons” that a future theorecal physics of biology will want to tackle. Today we are doing “warm-up exercises” to prepare ourselves for the big task that lies ahead. The theoretical physics of Life: present

Theorecal physicists, once they get hooked on biology, try to study issues that cut across kingdoms (“ from bacteria to brains”), are suitable for nontrivial mathemacal analysis, and can be illuminated by modern biological “big data”:

I: Biological enes respond to smuli, but noise limits how much they can know. Do quantave measures of informaon (Shannon bits) give us new views of biological funcon/fitness?

II: Biological funcon oen involves probability distribuons (pdf’s) on high- dimensional data spaces. These pdf’s must usually be “learned” from sparse data. How can this be done?

III: Could “high-throughput” biological experimental data give access to the molecular “wheels and gears” of cellular funcon (using a similar strategy to what is common in parcle physics and cosmology)?

I will give you a (superficial) tour of what theorecal physicists are doing to aack a few specific problems that fall under these general headings. I will try to show in what way “theorecal physics” and “biological big data” are coming together. Quick study: how genes are turned on and off

TF Protein [g] [c] Transcription Regulated factor TF gene concentration expression

Special DNA site (promoter) is Output noise Mean output occupied by TF with occupancy τ driven by TF concentraon [c] and DNA binding energy ε.

[g] Occupancy τ enhances RNAP binding, drives transcripon. Typical on/off mean response.

But diffusion and small numbers [c] make the output noisy. [c] (I) Building a fly embryo bit by bit

Development is a cascade of transcripon factor (TF) . Hunchback disnguishes the thorax from the abdomen. This decision is driven by the level of maternally supplied Bicoid. This paern repeats in a TF cascade (Bcd → Hb → Kr, Gt, Kn, ..) leading to the expression “stripes” needed to make a mul-segment body.

Maternal Produced Bicoid Huncback

Nuclei must know `where they are’ to make cell fate choice (thorax v. abdomen). [Bcd] [Bcd] is the signal: if [Bcd] big enough, Hb is expressed

[Hb] and [Bcd] levels in 104s of nuclei can be measured by fluorescent immuno-staining (dio for other TFs). Thus, we can measure the distribuon of the input/output pairs P([Hb] , [Bcd]) over the nuclei in 102s of eggs. What we see is a noisy switch (go back one slide). Can it do more than signal just “on” vs “off”? (I) Building a fly embryo bit by bit

Given two variables g and c with joint pdf P(g,c) their correlaon is best quanfied by “mutual informaon” (MI). It is posive and measured in “bits”. One bit means that knowing [g] will tell you just that [c] was “high” or “low”. If more bits --> more discriminaon is possible. The joint pdf P([Hb],[Bcd]) has been measured, so we can evaluate MI([Hb],[Bcd]) from data. We find MIdata = 1.5 bits. So, the fruit fly nucleus is more than a simple on/off switch! … Why?

Here is where theorecal physics comes in! We can ask how big MI([Hb],[Bcd]) could be, given the Black is data, red is opmal dist’n known I/O rule of the nucleus. This is a variaonal problem, with simple soluon. Find MImax = 1.7 bits (true OFF ON MI is very close to opmal). But we also get the distribuons of TF 20% of nuclei are neither “on” nor “off” … concentraons that achieve this and this is crucial to having MI > 1 opmum, and they match the data. The larger theoretical challenge: role of information in the gap gene network

bicoid caudal Gap genes express in stripes

hunch- kruppel knirps giant back

Gap gene network in D. melanogaster Nuclei need to “locate” about 100 rows Joint expression paerns of … which requires 6.6 bits. At 1.5 bits (or {hb,kr,kn,gt} could be a code so) per gene, we need 4 readout genes for “where you are” on the … exactly the number of gap genes. ant-post axis. NB: neighboring Informaon opmizaon can be used nuclei have different states! to “derive” the regulatory network. (II) Probability distributions & immune diversity • T&B cells implement adapve immunity. These cells have surface receptors to recognize pathogens floang loose or (for T cells) infecng our cells. • A T/B-cell that recognizes a pathogen proliferates to create a big clone to clear the infecon. A “memory” clone is le to protect against re-infecon. • New T/B-cells are made by stem cells. In each creaon event, the germline DNA for the receptor is “edited” to create a new, unique, receptor gene.

• You have ~ 107 unique, randomly-generated immune cell types in your body. Can we quanfy this diversity and understand how it is generated? • By harvesng T cells, extracng/amplifying DNA, and then sequencing, we obtain > 105 disnct examples of receptor genes from one blood sample. • There is virtually no overlap between repertoires of different individuals. We need to understand their stascs in order to learn anything useful. (II) Probability distributions & immune diversity The genome eding event is carried out by a few DNA repair . There are only a few “moves” they can apply to the germline DNA. Any new receptor gene sequence σ is the result of a “scenario” E, a set of values for these acons.

Choose possible V,D,J gene segments. Chew away some number of bases from each Don’t ask! A bit exoc …. Insert some number of bases between genes Account for different inseron probabilies

For each type of move there is an unknown probability distribuon: each V gene has its own likelihood of choice, dio for each number of VD or DJ inserons, etc. We want to infer a pdf for the generave scenarios. To do so, we assume a plausible structure:

N.B. Many “scenarios” can yield the same sequence read σ.

Pgen(σ) is the net probability that the result σ is produced in one stem cell event. Once we know the right component pdf’s we will be able to evaluate this hidden variable. (II) Probability distributions & immune diversity

Key theorecal idea: The true component probability distribuons are those that maximize the likelihood (product of individual probabilies over all data sequences) of the repertoire. Easy search problem! Results for the some of the distribuons are: Number of inserons Pgen distribuon & Shared Seqs

Error Bars: Variance over 10 individuals

The total possible diversity is enormous: something like 1013 sequences. The unique clones we observe come nowhere near sampling sequence diversity. We can get sharp results because we assume that the sequence diversity has a simple hidden source. Why did I tell you these particular stories?

First, each is a specific instance of a broader class of conceptually similar problems spanning the tree of life .. and for which big data is available

Theme I: Informaon transmission problems are ubiquitous, and data adequate for quanfying how well it is transmied can now be collected in many systems. • Higher organisms convert sensory inputs to spikes on . Recon- strucng a sensory input from the spike train is an informaon problem. • Eukaryoc signaling converts external chemical signal into phosphoryl- aon state of internal messenger proteins. Quanfy with mass-spec. Theme II: Correlated probability distribuons on high-dimensional data, and the problems of learning them from sparse data are everywhere. • A given visual scene produces a noisy train of “spikes” on the opc nerves. P(spikes|scene) is the mother of all high-dim’l pdf’s • The same enzymac funcon is provided in different species by proteins with different sequence. Is there a P(aa seq|protein type) to be learned? Second, they exemplify in various ways the theorecal physics method of “understanding” complex phenomena through inference of a much simpler underlying “hidden” (mathemacally expressed) mechanism. Back to the future (of science) • It seems blindingly obvious that biology will need some kind of general mathemacal framework to organize the data flood that is in the offing • I am very skepcal that generic “data mining” approaches (a la Google X) will do the job. Finding and using simpler underlying structures will be key. • The deep problem is figuring out what complexies can be ignored in creang an “adequate” theory or model of a cell (or brain). Do the “coarse graining” and “universality” that work in physics have a role in biology? • We need to rise above specific models for each specific . I don’t claim that physicists are uniquely equipped for this task .. but they are naturally inclined to look for the generality we really need. • The concrete examples I described in the talk are a prey pale imitaon of the sort of analysis we will eventually need. They are just first steps. • The problem is a major intellectual challenge, just as hard and deep as the fundamental physics challenges we have solved in the past. Accepng it will ensure the future vitality of theorecal physics .. And biology. Acknowledgements

Theme 1 Theme 2 Theme 3 Gasper Tkacik (IST Ausa) Thierry Mora (ENS) Jusn Kinney (CSHL) Bill Bialek (Princeton) Aleksandra Walczak (ENS) Ted Cox (Princeton) Zach Sethna (Princeton) Anand Murugan (Stanford)

Physics Dept. & Lewis-Sigler Institute, Princeton University Funding: NSF, NIH, Keck Foundation