Large deviations theory: Data, single-cell biomarkers, and Fisher's fundamental theorem of natural selection

Hong Qian

Department of Applied Mathematics University of Washington Genomics and the physical chemistry of biomolecules are two pillars of molecular biology. One of the key branches of the latter is chemical thermodynamics, which, curiously, is quite marginalized in the current research on computational molecular biology (CMB). I shall revisit this subject, but no starting from physics nor chemistry, but rather through a result from probability, beyond the and , call large deviations theory. I shall show how our theory can be applied to Fisher’s FTNS as well as to analyzing data from single cells. Prologue: The two pillars of molecular biology

• Genetics, genomics, and bioinformatics: It is about “information”;

• Physical chemistry, molecular dynamics (MD) and structural biology (SB): It is abut “molecules, their states, and the processes cause their changes”.

Things like …

= , = 𝜕𝜕𝐺𝐺 𝐺𝐺 𝐻𝐻 − 𝑇𝑇𝑇𝑇 [𝜇𝜇𝑖𝑖] = + ln𝜕𝜕𝑐𝑐𝑖𝑖 [ ] 𝑜𝑜 𝐶𝐶 𝐷𝐷 ∆𝜇𝜇 ∆𝜇𝜇 𝑘𝑘𝑘𝑘 = ln 𝐴𝐴 𝐵𝐵 𝑜𝑜 ∆𝜇𝜇 −𝑘𝑘𝑘𝑘 𝐾𝐾eq (1) How to represent (e.g., describe) a biochemical or a biological systems, not just MD and SB: Its states and its changes? (i) classifying biochemical (or biological) individuals into populations of kinetic species; (ii) counting the number of individuals in each and every “pure kinetic species”; (iii) representing changes in terms of “stochastic elementary processes”. A stochastic elementary process + + + + + 𝑗𝑗 = , 𝑟𝑟, … , 𝐴𝐴 2𝐵𝐵 𝐶𝐶 𝐸𝐸 𝑋𝑋 3𝐵𝐵 𝐸𝐸 ( + ) = + | = 𝑵𝑵 𝑁𝑁𝐴𝐴 𝑁𝑁𝐵𝐵 𝑁𝑁𝑍𝑍 𝑃𝑃 𝑵𝑵 𝑡𝑡 𝑑𝑑𝑑𝑑 +𝒏𝒏 ∆𝒏𝒏, 𝑵𝑵 𝑡𝑡 = 𝒏𝒏 = 1 + , = 𝑗𝑗 𝑗𝑗 𝑟𝑟0𝒏𝒏, 𝑑𝑑𝑡𝑡 𝑜𝑜 𝑑𝑑𝑡𝑡 𝑖𝑖𝑖𝑖 ∆𝒏𝒏 𝝂𝝂 . � − 𝑟𝑟𝑟𝑟𝑟𝑟 𝑜𝑜 𝑑𝑑𝑑𝑑 𝑖𝑖𝑖𝑖 ∆𝒏𝒏 𝟎𝟎 𝑜𝑜𝑜𝑜표𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 instantaneous

stoichiometric coefficients An essential feature of a stochastic elementary reaction with pure kinetic species = ( −) 𝑟𝑟𝑟𝑟 𝑃𝑃 𝐓𝐓=≥ 𝑡𝑡 𝑒𝑒 = −𝑟𝑟 𝑡𝑡+𝜏𝜏 𝑃𝑃 𝐓𝐓≥𝑡𝑡+𝜏𝜏 𝑒𝑒 −𝑟𝑟𝑟𝑟 −𝑡𝑡𝜏𝜏 𝑃𝑃 𝐓𝐓≥𝜏𝜏 𝑒𝑒 = 𝑒𝑒 𝑑𝑑 ln 𝑃𝑃 𝐓𝐓≥𝑡𝑡 − 𝑑𝑑𝑡𝑡 𝑟𝑟 The rate for the next event within multiple stochastic elementary processes among pure kinetic species

= 1 −𝑟𝑟𝑗𝑗𝑡𝑡 𝑃𝑃 𝐓𝐓𝑗𝑗=≥min𝑡𝑡 𝑒𝑒 , , , ≤ 𝑗𝑗 ≤, 𝑀𝑀 𝐓𝐓∗ 𝐓𝐓1 𝐓𝐓=2 ⋯ 𝐓𝐓𝑀𝑀, −𝑟𝑟∗𝑡𝑡 𝑃𝑃= 𝐓𝐓∗ ≥+ 𝑡𝑡 + 𝑒𝑒 + . 𝑟𝑟∗ 𝑟𝑟1 𝑟𝑟2 ⋯ 𝑟𝑟𝑀𝑀 A theorem on the rate for complex kinetic species that contain heterogeneous subpopulations = ( ) ∞ −𝑟𝑟 𝑠𝑠 𝑡𝑡 ( ) ( ) 𝑃𝑃 𝐓𝐓 ≥ 𝑡𝑡 = ∫0 𝑒𝑒 𝑓𝑓 𝑠𝑠 𝑑𝑑=𝑑𝑑 ( ) ∞ −𝑟𝑟 𝑠𝑠 𝑡𝑡( ) 𝑑𝑑 ln 𝑃𝑃 𝐓𝐓≥𝑡𝑡 ∫0 𝑟𝑟 𝑠𝑠 𝑒𝑒 𝑓𝑓 𝑠𝑠 𝑑𝑑𝑑𝑑 ∞ −𝑟𝑟 𝑠𝑠 𝑡𝑡 𝑑𝑑(𝑡𝑡) − = ∫0 𝑒𝑒 𝑓𝑓 𝑠𝑠 𝑑𝑑𝑑𝑑 0.𝑟𝑟̅ 𝑡𝑡 𝑑𝑑𝑟𝑟̅ 𝑡𝑡 2 𝑑𝑑𝑡𝑡 − 𝑟𝑟 𝑠𝑠 − 𝑟𝑟̅ 𝑡𝑡 ≤ The instantaneous rate for a non-elementary process with complex kinetic species: , ,

( ) = 𝑟𝑟̅ 𝐍𝐍 𝑡𝑡 𝑋𝑋+ 𝑡𝑡 𝑡𝑡 + 𝑑𝑑𝑟𝑟̅ 𝑡𝑡 𝜕𝜕𝑟𝑟̅ 𝑑𝑑𝐍𝐍 𝜕𝜕𝑟𝑟̅ 𝑑𝑑𝑑𝑑 𝜕𝜕𝑟𝑟̅ 𝑑𝑑𝑡𝑡 𝜕𝜕𝐍𝐍 𝑑𝑑𝑑𝑑 𝜕𝜕𝑋𝑋 𝑑𝑑𝑑𝑑 𝜕𝜕𝜕𝜕 0 ! 𝜕𝜕𝑟𝑟̅ ≤ 𝜕𝜕𝜕𝜕

Counting, it is everywhere!

• Newtonian particles in fluid mechanics, Lagrangian vs. Eulerian representations; • second quantization in quantum field theory; • atomic numbers in molecules, chemical species; • biological organisms. Lu and Qian, arXiv:2009.12644 (2020)

(2) Thermodynamics of Small Systems

Joint work with Prof. Zhiyue Lu, UNC. What are?

Fundamental thermodynamic relation Gibbs–Duhem equation Current understanding of the foundation of thermodynamics

• The existence of an function or entropy functional. • The entropy is a statistical concept; it is an Eulerian homogeneous function of all extensive variables with order 1. • It is a universal result, as a limit, for macroscopic large systems. (1917-2014)

(1919-1993) How can a small system have universal behavior? The world is stochastic.

The repeated measurements are not a way to obtaining truth via eliminating uncertainty. Variation (heterogeneity) is a part of the truth! On the border of this stochastic world, three major landmarks:

LLN: law of large numbers, CLT: central limit theorem, LDP: large deviations principle. Law of Large Numbers + + + lim = 𝑋𝑋1 𝑋𝑋2 ⋯ 𝑋𝑋𝑀𝑀 𝑀𝑀→∞ 𝔼𝔼 𝑋𝑋 𝑀𝑀 lim = ( = 1,2, … , ) 𝑘𝑘 𝑚𝑚 𝑘𝑘 𝑀𝑀→∞ ℙ 𝑘𝑘 𝐾𝐾 𝑀𝑀 Lawstationary of Large samples Numbersexpected value + + + lim = 1 2 𝑀𝑀 𝑋𝑋 𝑋𝑋 sample⋯ frequency𝑋𝑋 𝑀𝑀→∞ 𝔼𝔼 𝑋𝑋 𝑀𝑀 lim = ( = 1,2, … , ) 𝑘𝑘 𝑚𝑚 𝑘𝑘 statistical𝑀𝑀→ ∞concepts ℙprobability𝑘𝑘 𝐾𝐾 probabilistic concepts𝑀𝑀 Central Limit Theorem

+ + + lim = 𝑋𝑋1 𝑋𝑋2 ⋯ 𝑋𝑋𝑀𝑀 𝑀𝑀→∞ + + + ∞ lim 𝑀𝑀 𝑋𝑋1 𝑋𝑋2 ⋯ 𝑋𝑋𝑀𝑀 𝑀𝑀→∞ = N 0, − 𝑀𝑀𝔼𝔼 𝑋𝑋 𝑀𝑀 2 𝒩𝒩 𝜎𝜎 ∞ Central Limit Theorem

+ + + lim = 𝑋𝑋1 𝑋𝑋2 ⋯ 𝑋𝑋𝑀𝑀 𝑀𝑀→∞ + + + ∞ lim 𝑀𝑀 𝑋𝑋1 𝑋𝑋2 ⋯ 𝑋𝑋𝑀𝑀 𝑀𝑀→∞ = N 0, − 𝑀𝑀𝔼𝔼 𝑋𝑋 𝑀𝑀 2 normal 𝒩𝒩 𝜎𝜎 variance of 𝑋𝑋 Large Deviations Principle

; , = ∗ the ∗premise 𝑓𝑓𝑋𝑋�𝑀𝑀 𝑥𝑥 𝑀𝑀; →~𝛿𝛿 𝑥𝑥 − (𝑥𝑥 ), 𝑥𝑥 𝔼𝔼 𝑋𝑋 min −𝑀𝑀𝑀𝑀=𝑥𝑥 ( ) = 0, the existence𝑋𝑋�𝑀𝑀 𝑓𝑓 𝑥𝑥 𝑀𝑀 𝑒𝑒 ∗ 1 𝑥𝑥∈𝔖𝔖 𝜑𝜑 𝑥𝑥 𝜑𝜑 𝑥𝑥 lim ln = min .

i𝑀𝑀ts property 𝑀𝑀→∞ − ℙ 𝑋𝑋 ∈ 𝒜𝒜 ⊂ 𝔖𝔖 𝑥𝑥∈𝒜𝒜 𝜑𝜑 𝑥𝑥 𝑀𝑀 When the M tends to infinite … ; , ( ; ) ( ) −𝑀𝑀𝜑𝜑 𝑥𝑥 𝑓𝑓𝑋𝑋�𝑀𝑀 𝑥𝑥 𝑀𝑀 𝑓𝑓𝑋𝑋�𝑀𝑀 𝑥𝑥 𝑀𝑀 ≈ 𝐴𝐴 𝑒𝑒

sample mean value x rate function as entropy ln Pr = ln ( )d ∞ 𝜑𝜑 −𝑀𝑀𝜑𝜑 𝑥𝑥 �𝑀𝑀 = min ( ) + ( ) 𝑋𝑋 ≥ 𝑎𝑎 �𝑎𝑎 𝐴𝐴𝑒𝑒 𝑥𝑥

−𝑀𝑀 𝑎𝑎≤𝑥𝑥≤∞ 𝜑𝜑 𝑥𝑥 𝑜𝑜 𝑀𝑀 ( )

𝜑𝜑 𝑥𝑥

𝑥𝑥 𝑎𝑎 Cramér’s theorem for arbitrary distribution ( )

𝑝𝑝𝑋𝑋 𝑥𝑥

Harald Cramér (1893-1985) Generating Function and Legendre-Fenchel transform

= ln ( ) −𝛽𝛽𝛽𝛽 𝜓𝜓 𝛽𝛽 � 𝑝𝑝𝑋𝑋 𝑥𝑥 𝑒𝑒 𝑑𝑑𝑑𝑑 = sup + ( )

= inf + ( ) 𝜓𝜓 𝜷𝜷 𝒙𝒙 −𝜷𝜷 � 𝒙𝒙 𝜂𝜂 𝒙𝒙

−𝜑𝜑 𝑥𝑥 𝛽𝛽 𝜷𝜷 � 𝒙𝒙 𝜓𝜓 𝜷𝜷 Massieu-Guggenheim (free) entropy and Gibbs entropy = ln ( ) 𝜓𝜓 −𝛽𝛽𝛽𝛽 𝜂𝜂 𝑋𝑋 𝜓𝜓 𝛽𝛽 Massieu�-Guggenheim𝑝𝑝 𝑥𝑥 𝑒𝑒 entropy𝑑𝑑𝑑𝑑 = inf + ( )

−𝜑𝜑 𝑥𝑥 𝛽𝛽 𝜷𝜷 � 𝒙𝒙 𝜓𝜓 𝜷𝜷 [ ( )/ ] = ( ) = [1/ ] Gibbs entropy 𝑑𝑑 𝜓𝜓 𝛽𝛽 𝛽𝛽 𝜂𝜂 𝑥𝑥 −𝜑𝜑 𝑥𝑥 𝑑𝑑 𝛽𝛽 Legendre-Fenchel Transform and Free Entropy-Gibbs Entropy Duality = inf + ( ) = = + + 𝜂𝜂 𝒚𝒚 𝜷𝜷 𝜷𝜷 � 𝒚𝒚 𝜓𝜓 𝜷𝜷

1 fundamental1 2 thermodynamic2 Hill𝑑𝑑-Gibbs𝜂𝜂 𝒚𝒚-Duhem𝜷𝜷equation� 𝑑𝑑𝒚𝒚 𝛽𝛽 𝑑𝑑𝑦𝑦 𝛽𝛽 𝑑𝑑𝑦𝑦 ⋯ relation

= sup + ( ) = = + 𝜓𝜓 𝜷𝜷 𝒚𝒚 −𝜷𝜷 � 𝒚𝒚 𝜂𝜂 𝒚𝒚 𝑑𝑑𝜓𝜓 𝜷𝜷 −𝒚𝒚 � 𝑑𝑑𝜷𝜷 −𝑦𝑦1𝑑𝑑𝛽𝛽1 − 𝑦𝑦2𝑑𝑑𝛽𝛽2 ⋯ T. L. Hill’s Nanothermodynamics 1 1 , , = + 𝑝𝑝 𝜇𝜇 𝑆𝑆 𝑈𝑈 𝑉𝑉 𝑁𝑁 𝑈𝑈 𝑉𝑉 − 𝑁𝑁 − ℰ 𝑇𝑇 𝑇𝑇 𝑇𝑇 𝑇𝑇 extensive quantities sub-extensive 1 d , , = d + d d 𝑝𝑝 𝜇𝜇 𝑆𝑆 𝑈𝑈 𝑉𝑉 𝑁𝑁 𝑈𝑈 𝑉𝑉 − 𝑁𝑁 d , , =𝑇𝑇 d +𝑇𝑇 d d𝑇𝑇 ℰ 𝑇𝑇 𝑝𝑝 𝜇𝜇 −𝑆𝑆 𝑇𝑇 𝑉𝑉 𝑝𝑝 − 𝑁𝑁 𝜇𝜇 In the Large System Limit: Entropy is an order 1 homogeneous function of all extensive variables = ( ), ( ) = , 𝜂𝜂 𝒚𝒚 𝒚𝒚 � 𝛻𝛻𝑦𝑦𝜂𝜂 𝒚𝒚 𝜷𝜷=𝒚𝒚sup𝛻𝛻𝑦𝑦𝜂𝜂 𝒚𝒚 + ( ) = 0 ! 𝜓𝜓 𝜷𝜷 𝒚𝒚 −𝜷𝜷 � 𝒚𝒚 𝜂𝜂 𝒚𝒚 In summary, not just one entropy and one limit, but three and two limits! Three Entropies and Two Limits

Boltzmann entropy Massieu-Guggenheim Gibbs entropy entropy

big data limit large system limit Three Entropies and Two Limits

convex functions

subextensive function homogeneous function From here to things like …

= , = 𝜕𝜕𝐺𝐺 𝐺𝐺 𝐻𝐻 − 𝑇𝑇𝑇𝑇 [𝜇𝜇𝑖𝑖] = + ln𝜕𝜕𝑐𝑐𝑖𝑖 [ ] 𝑜𝑜 𝐶𝐶 𝐷𝐷 ∆𝜇𝜇 ∆𝜇𝜇 𝑘𝑘𝑘𝑘 = ln 𝐴𝐴 𝐵𝐵 𝑜𝑜 ∆𝜇𝜇 −𝑘𝑘𝑘𝑘 𝐾𝐾eq Stochastic biochemical kinetic description dictates a macroscopic, deterministic biochemical kinetics (as LLN) as well as a biochemical thermodynamics (as LDP)!

Conclusions

(1) The thermodynamic structure presented in the present work, while assumes a probability distribution a priori, does not require the concept of equilibrium in connection to detailed balance in stochastic dynamics, nor ergodicity. Therefore, it is applicable to measurements on biomarkers from isogenic single living cells. Of course, if a large system consists of many statistically identical but independent smaller parts, then the entire argument based on i.i.d. measurements can be applied to a single measurement of extensive variables of the large system as a whole Conclusions – cont. (2) The present result augments the current understanding of the nature of thermodynamic behavior, which so far has been focused on large systems. We now see there is actually a large measurements limit that generates a different kind of emergent order, a duality symmetry, for any small stochastic systems. This symmetry is lost, however, in the large systems limit. Qian and Cheng, Quant. Biol. 8, 172 (2020)

(3) Application to Single-cell Biology: counting frequencies for phenotypic heterogeneity and mean value of a quantitative biomarker

Joint work with Mr. Yu-Chen Cheng , UW. Consider total M isogenic cells,

assuming there are K phenotypic states

(clusters), with m1, m2, …, mK number of cells within different phenotypes,

Let be the value of a single-cell biomarker when the cell is in the 𝑘𝑘 𝑔𝑔 phenotype k. The standard LLN for mean value and the Borel’s LLN for frequencies + + + lim = 𝑔𝑔1 𝑔𝑔2 ⋯ 𝑔𝑔𝑀𝑀 𝑀𝑀→∞ 𝔼𝔼 𝑔𝑔 𝑀𝑀 lim = ( = 1,2, … , ) 𝑘𝑘 𝑚𝑚 𝑘𝑘 𝑀𝑀→∞ ℙ 𝑘𝑘 𝐾𝐾 𝑀𝑀 Pr = , , = ~ ( ) −𝑀𝑀𝑀𝑀 𝐱𝐱 𝑚𝑚1 𝑥𝑥1 ⋯ 𝑚𝑚𝐾𝐾 𝑥𝑥𝐾𝐾 𝑒𝑒 = 𝐾𝐾 log 𝑘𝑘 𝑘𝑘 𝑥𝑥 𝐼𝐼 𝐱𝐱 � 𝑥𝑥 𝑘𝑘 +𝑘𝑘=1 + 𝑝𝑝 ( ) = 𝑀𝑀 𝑚𝑚1𝑔𝑔1 ⋯ 𝑚𝑚𝐾𝐾𝑔𝑔𝐾𝐾 𝑔𝑔̅ = 𝐾𝐾 𝑀𝑀 𝑘𝑘 𝑘𝑘 �𝑘𝑘=1𝑥𝑥 𝑔𝑔 → 𝔼𝔼 𝑔𝑔 Contraction Principle

Pr ( ) = ~ ( ) 𝑀𝑀 −𝑀𝑀𝑀𝑀 𝑦𝑦 𝑔𝑔̅ 𝑦𝑦 𝑒𝑒 = inf ( ) : = 𝜑𝜑 𝑦𝑦 𝑀𝑀 𝐼𝐼 𝐱𝐱 𝑘𝑘 𝑘𝑘 𝐱𝐱 �𝑘𝑘=1𝑥𝑥 𝑔𝑔 𝑦𝑦 An Optimization Problem

min log , 𝑀𝑀 𝑘𝑘 𝑘𝑘 𝑥𝑥 𝐱𝐱 �𝑘𝑘=1 𝑥𝑥 = ,𝑘𝑘 𝑀𝑀 𝑝𝑝 𝑘𝑘 𝑘𝑘 �𝑘𝑘=1𝑥𝑥 𝑔𝑔 = 1𝑦𝑦. 𝑀𝑀 𝑘𝑘 �𝑘𝑘=1𝑥𝑥 Using the method of Lagrangian multiplier

= log 𝑀𝑀 −𝛽𝛽𝑔𝑔𝑘𝑘 𝑘𝑘 𝜑𝜑 −𝛽𝛽𝛽𝛽 − �𝑘𝑘=1𝑝𝑝 𝑒𝑒 = log 𝑀𝑀 𝑑𝑑 −𝛽𝛽𝑔𝑔𝑘𝑘 𝑦𝑦 − � 𝑝𝑝𝑘𝑘𝑒𝑒 𝑑𝑑𝛽𝛽 𝑘𝑘=1 Using the method of Lagrangian multiplier

= log 𝑀𝑀 −𝛽𝛽𝑔𝑔𝑘𝑘 cumulant𝑘𝑘generating function 𝜑𝜑 −𝛽𝛽𝛽𝛽 − �𝑘𝑘=1𝑝𝑝 𝑒𝑒 = log 𝑀𝑀 𝑑𝑑 −𝛽𝛽𝑔𝑔𝑘𝑘 𝑦𝑦 − � 𝑝𝑝𝑘𝑘𝑒𝑒 𝑑𝑑=𝛽𝛽 min 𝑘𝑘=+1 ( )

𝜑𝜑 𝑦𝑦 − 𝛽𝛽 𝜷𝜷 � 𝒚𝒚 𝜓𝜓 𝜷𝜷 Cramér’s theorem for arbitrary distribution ( )

𝑝𝑝𝑋𝑋 𝑥𝑥

Harald Cramér (1893-1985) The Shannon entropy is to Borel’s LLN what the Gibbs entropy is to standard LLN for mean value!

+ + + lim = 𝑔𝑔1 𝑔𝑔2 ⋯ 𝑔𝑔𝑀𝑀 𝑀𝑀→∞ 𝔼𝔼 𝑔𝑔 𝑀𝑀 lim = ( = 1,2, … , ) 𝑘𝑘 𝑚𝑚 𝑘𝑘 𝑀𝑀→∞ ℙ 𝑘𝑘 𝐾𝐾 𝑀𝑀 Tentative Summary • Applying the theory of large deviations to the statistical analysis of single cells, there could be a “thermodynamic behavior” in the date; • There is a relation at the fundamental level between Waddington’s single cell phenotypic landscape and Gibbsian thermodynamic; • Large deviations theory offers nonlinear statistical dependency beyond Gaussian fluctuations. Acknowledgements

Prof. Zhiyue Lu Mr. Yu-Chen Cheng UNC, Chapel Hill Univ. of Wash. Thank you!