arxiv.org:2006.15416

Thermodynamic Machine Learning through Maximum Production

Alexander B. Boyd,1, 2, ∗ James P. Crutchfield,3, † and Mile Gu1, 2, 4, ‡ 1Complexity Institute, Nanyang Technological University, Singapore 2School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 3Complexity Sciences Center and Physics Department, University of California at Davis, One Shields Avenue, Davis, CA 95616 4Centre for Quantum Technologies, National University of Singapore, Singapore (Dated: April 4, 2021) Adaptive systems—such as a biological organism gaining survival advantage, an autonomous robot executing a functional task, or a motor protein transporting intracellular nutrients—must model the regularities and stochasticity in their environments to take full advantage of thermodynamic resources. Analogously, but in a purely computational realm, machine learning algorithms estimate models to capture predictable structure and identify irrelevant noise in training data. This happens through optimization of performance metrics, such as model likelihood. If physically implemented, is there a sense in which computational models estimated through machine learning are physically preferred? We introduce the thermodynamic principle that work production is the most relevant performance metric for an adaptive physical agent and compare the results to the maximum-likelihood principle that guides machine learning. Within the class of physical agents that most efficiently harvest energy from their environment, we demonstrate that an efficient agent’s model explicitly determines its architecture and how much useful work it harvests from the environment. We then show that selecting the maximum-work agent for given environmental data corresponds to finding the maximum-likelihood model. This establishes an equivalence between nonequilibrium thermodynamics and dynamic learning. In this way, work maximization emerges as an organizing principle that underlies learning in adaptive thermodynamic systems. Keywords: nonequilibrium thermodynamics, Maxwell’s demon, Landauer’s Principle, extremal principles, machine learning, regularized inference, density estimation

I. INTRODUCTION [11]. Maxwell in his 1857 book Theory of argued that a “very observant” and “neat fingered being” could What is the relationship, if any, between abiotic physical subvert the Second Law of Thermodynamics [12]. In processes and intelligence? Addressed to either living effect, his “finite being” uses its intelligence (Maxwell’s or artificial systems, this challenge has been taken up word) to sort fast from slow molecules, creating a tem- by scientists and philosophers repeatedly over the last perature difference that drives a heat engine to do useful centuries, from the 19th century teleologists [1] and bio- work. The demon presented an apparent paradox because logical structuralists [2, 3] to cybernetics of the mid-20th directly converting disorganized thermal energy to orga- century [4, 5] and contemporary neuroscience-inspired nized work energy is forbidden by the Second Law. The debates of the emergence of artificial intelligence in dig- cleverness in Maxwell’s paradox turned on equating the ital simulations [6]. The challenge remains vital today thermodynamic behavior of mechanical systems with the [7–10]. A key thread in this colorful and turbulent his- intelligence in an agent that can accurately measure and tory explores issues that lie decidedly at the crossroads of control its environment. This established an operational thermodynamics and communication theory—of physics equivalence between energetic thermodynamic processes, and engineering. In particular, what bridges the dynam- on the one hand, and intelligence, on the other. ics of the physical world and its immutable laws and We will explore the intelligence of physical processes, principles to the purposeful behavior intelligent agents? substantially updating the setting from the time of Kelvin The following argues that an essential connector lies in a and Maxwell, by calling on a wealth of recent results new thermodynamic principle: work maximization drives on the nonequilibrium thermodynamics of information learning. [13, 14]. In this, we directly equate the operation of Perhaps unintentionally, James Clerk Maxwell laid founda- physical agents descended from Maxwell’s demon with tions for a physics of intelligence with what Lord Kelvin notions of intelligence found in modern machine learning. (William Thomson) referred to as “intelligent demons” While learning is not necessarily the only capability of a presumed intelligent being, it is certainly a most useful and interesting feature. ∗ [email protected] The root of many tasks in machine learning lies in dis- † [email protected] covering structure from data. The analogous process of ‡ [email protected] creating models of the world from incomplete informa- 2

storehouse of information as randomness and correlation.

s∗ 1 : 0.5 0 : 0.5 That reservoir is the demon’s informational environment, 1 : 1.0

A B 0 : 1.0 W 0 1 0 1 0 1 0 1 and the mechanism by which the demon measures and con- Mass trols its environment embodies the demon’s intelligence,

y0 : 1.0 s0 s1

yL 1 : 1.0 y1 : 1.0 Mass −

sL 1 s2 − according to modern physics. We will show that this

y2:L 1 : 1.0 0 1 0 1 0 1 0 1 − W mechanism is directly linked to the demon’s model of its

y0 : 1.0 s0 s1

yL 1 : 1.0 y1 : 1.0 − environment, which allows us to formalize the connection

sL 1 s2 − Mass

0 : bP A to machine learning. 1:1 b − P W

Environment 0 1 0 1 0 1 0 1 Mass

y2:L 1 : 1.0 − Machine learning estimates different likelihoods of dif-

Mass ferent models given the same data. Analogously, in the

0 : bP A Mass 1:1 b − P W 0 1 0 1 0 1 0 1 physical setting of information thermodynamics, differ- Maximum-Work ent demons harness different amounts of work from the Data Thermodynamic Learning Agent same information reservoir. Leveraging this commonality, Sec.II introduces thermodynamic learning as a physical FIG. 1. Thermodynamic learning generates the maximum- process that infers optimal demons from environmental in- work producing agent: (Left) Environment (green) behavior formation. As shown in Fig.1, thermodynamic learning becomes data for agents (red). (Middle) Candidate agents each selects demons that produce maximum work, parallel- have an internal model (inscribed stochastic state-machine) that captures the environment’s randomness and regularity to ing parametric density estimation’s selection of models store work energy (e.g., lift a mass against gravity) or to borrow with maximum likelihood. SectionIII establishes back- work energy (e.g., lower the mass). (Right) Thermodynamic ground in density estimation, computational mechanics, learning searches the candidate population for the best agent— and thermodynamic computing necessary to formalize the that producing the maximum work. comparison of maximum-work and maximum-likelihood learning. Our surprising result is that these two prin- tion is essential to adaptive organisms, too, as they must ciples of maximization are the same, when compared model their environment to categorize stimuli, predict in a common setting. This adds credence to the long- threats, leverage opportunities, and generally prosper in a standing perspective that thermodynamics and statistical complex world. Most prosaically, translating training data mechanics underlie many of the tools of machine learning into a generative model corresponds to density estimation [17, 22–28]. [15–17], where the algorithm uses the data to construct a SectionIV formally establishes that to construct an intel- probability distribution. ligent work-harvesting demon, a probabilistic model of its This type of model-building at first appears far afield from environment is essential. That is, the demon’s Hamilto- more familiar machine learning tasks such as categoriz- nian evolution is directly determined by its environmental ing pet pictures into cats and dogs or generating a novel model. This comes as a result of discarding demons that image of a giraffe from a photo travelogue. Nonetheless, are ineffective at harnessing energy from any input, fo- it encompasses them both [18]. Thus, by addressing ther- cusing only on a refined class of efficient demons that modynamic roots of model estimation, we seek a physical make the best use of the given data. This leads to the foundation for a wide breadth of machine learning. central result, found in Sec.V, that the demon’s work To carry out density estimation, machine learning invokes production from environmental “training data” increases the principle of maximum-likelihood to guide intelligent linearly with the log-likelihood of the demon’s model of learning. This says, of the possible models consistent its environment. Thus, if the thermodynamic training with the training data, an algorithm should select that process selects the maximum-work demon for given data, with maximum probability of having generated the data. it has also selected the maximum-likelihood model for Our exploration of the physics of learning asks whether a that same data. similar thermodynamic principle guides physical systems Ultimately, our work demonstrates an equivalence between to adapt to their environments. the conditions of maximum work and maximum likelihood. The modern understanding of Maxwell’s demon no longer In this way, thermodynamic learning is machine learning entertains violating the Second Law of Thermodynamics for thermodynamic machines—it is a physical process [19]. In point of fact, the Second Law’s primacy has been that infers models in the same way a machine learning repeatedly affirmed in modern nonequilibrium theory and algorithm does. Thus, work itself can be interpreted as a experiment. That said, what has emerged is that we thermodynamic performance measure for learning. In this now understand how intelligent (demon-like) physical pro- framing, learning is physical, building on the long-lived cesses can harvest thermal energy as useful work. They narrative of the thermodynamics of organization, which do this by exploiting an information reservoir [19–21]—a we recount in Sec.VI. While it is natural to argue that 3 learning confers benefits, our result establishes that the copied as needed to allow a population of candidate agents benefit is fundamentally rooted in the physical tradeoff to interact with it. As each agent interacts with a copy, it between energy and information. produces an amount of work, which it stores in the work reservoir for later use. In Fig.1, the work reservoir is illustrated by a hanging mass which raises when positive II. FRAMEWORK work is produced, storing more energy in gravitational potential energy, and lowers when work production is While demons continue to haunt discussions of physical negative, expending that same potential energy. However intelligence, the notion of a physical process trafficking the work energy is stored, after the agents harvest work in information and energy exchanges need not be limited from the training data, the agent that produced the most to mysterious intelligent beings. Most prosaically, we work is selected. are concerned with any physical system that, while in- Finding the maximum-work agent is “thermodynamic teracting with an environment, simultaneously processes learning” in the sense that it selects a device based on information at some energetic cost or benefit. Avoiding measuring its thermodynamic performance—the amount theological distractions, we refer to these processes as of work the device extracts. Ultimately, the goal is that thermodynamic agents. In truth, any physical system can the agent selected by thermodynamic learning continues be thought of as an agent, but only a limited number to extract work as the environment produces new symbols. of them are especially useful for or adept at comman- However, we leave analyzing the long-term effectiveness of deering information to convert between various kinds of thermodynamic learning to the future. Here, we concen- thermodynamic resources, such as between heat and work. trate on the condition of maximum-work itself, deriving Here, we introduce a construction that shows how to find and interpreting it. physical systems that are the most capable of processing SectionIV begins by describing the general class of phys- information to affect thermodynamic transformations. ical agents that can harness work from symbol sequence, Consider an environment that produces information in known as information ratchets [30, 31]. While these agents the form of a time series of physical values at regular are sufficiently general to implement virtually any (Tur- time intervals of length τ. We denote the particular state ing) computation, maximizing work production precludes realized by the environment’s output at time jτ by the a wide array of agents. SectionIVB then refines our con- sideration to agents that waste as little work as possible symbol yj ∈ Yj. Just as the agent must be instantiated by a physical system, so too must the environment and and, in so doing, vastly narrow the search by thermo- dynamic learning. For this refined class of agents, we its outputs to the agent. Specifically, Yj represents the state space of the jth output, which is a subsystem of the find that each agent’s operation is exactly determined environment. by its environment model. This leads to our final result, An agent has no access to the internals of its environment that the agent’s work increases linearly with the model’s and so treats it as a black box. Thus, the agent can only log-likelihood. access and interact with the environment’s output system For clarity, note that thermodynamic learning differs from physical systems that, evolving in time, dynamically adapt Yj over each time interval t ∈ (jτ, (j + 1)τ). In other to their environment [26, 32, 33]. Work maximization as words, the state yj realized by the environment’s output is also the agent’s input at time jτ. For instance, the described here is thermodynamic in its objective, while environment may produce realizations of a two level spin these previous approaches to learning are thermodynamic in their mechanism. system Yj = {↑, ↓}, which the agent is then tasked to manipulate through Hamiltonian control. That said, the perspectives are linked. In particular, it The aim, then, is to find an agent that produces as much was suggested that physical systems spontaneously de- work as possible using these black-box outputs. To do crease work absorbed from driving [32]. Note that work so, the agent must come to know something about the absorbed by the system is opposite the work produced. black box’s structure. This is the principle of requisite And so, as they evolve over time, these physical sys- complexity [29]—thermodynamic advantage requires that tems appear to seek higher work production, paralleling the agent’s organization match that of its environment. how thermodynamic learning selects for the highest work We implement this by introducing a method for thermody- production. And, the synchronization by which a phys- namic learning as shown in Fig.1, that selects a specific ical system decreases work absorption is compared to agent from a collection of candidates. learning [32]. Reference [33] goes further, comparing the Peeking into the internal mechanism of the black box, effectiveness of physical evolution to maximum-likelihood estimation employing an autoencoder. Notably, it reports we wait for a time Lτ, receiving the L symbols y0:L = that that form of machine learning performs markedly y0y1 ··· yL−1. This is the agent’s training data, which is 4 better than physical evolution, for the particular system by the probability of the data given the model: considered there. By contrast, we show that the advan- tage of machine learning over thermodynamic learning L(θ|y0:L) = Pr(Y0:L = y0:L|Θ = θ) does not hold in our framework. Simply speaking, they θ = Pr(Y0:L = y0:L) . are synonymous. We compare thermodynamic learning to machine learning Parametric density estimation seeks to optimize the likeli- algorithms that use maximum-likelihood to select models hood L(θ|y0:L) [15, 17, 34]. However, the procedure that consistent with given data. As Fig.1 indicates, each finds maximum-likelihood estimates usually employs the agent has an internal model of its environment; a connec- log-likelihood instead: tion Sec.IVF formalizes. Each agent’s work production θ is then evaluated for the training data. Thus, arriving at `(θ|y0:L) = ln Pr(Y0:L = y0:L) , (1) a maximum-work agent also selects that agent’s internal model as a description of the environment. Moreover since it is maximized for the same models, but converges and in contrast with Ref. [33], which compares ther- more effectively [35]. modynamic and machine learning methods numerically, the framework here leads to an analytic derivation of the equivalence between thermodynamic learning and B. Computational Mechanics maximum-likelihood density estimation. Given that our data is a time series of arbitrary length starting with y0, we must choose a model class whose possible parameters Θ = {θ} specify a wide range of pos- θ sible distributions Pr(Y0:∞)—the semi-infinite processes. III. PRELIMINARIES -Machines, a class of finite-state machines introduced θ to describe bi-infinite processes Pr(Y−∞:∞), provide a Directly comparing thermodynamic learning and den- systematic means to do this [36]. As described in App. sity estimation requires explicitly demonstrating that A these finite-state machines comprise just such a flexi- thermodynamically-embedded computing and machine ble class of representations; they can describe any semi- learning share the framework just laid out. The following infinite process. This follows from the fact that they are introduces what we need for this: concepts from machine the minimal sufficient statistic for prediction explicitly learning, computational mechanics, and thermodynamic constructed from the process. computing. (Readers preferring fuller detail should refer A process’s -machine consists of a set of hidden states to App.A.) S, a set of output states Y, a start state s∗ ∈ S, and (y) conditional output-labeled transition matrix θs→s0 over the hidden states:

(y) θ 0 θ θ θs→s0 = Pr(Sj+1 = s ,Yj = y|Sj = s) . A. Parametric Density Estimation (y) θs→s0 specifies the probability of transitioning to hidden Parametric estimation determines, from training data, state s0 and emitting symbol y given that the machine is the parameters θ of a probability distribution. In the in state s. In other words, the model is fully specified by present setting, θ parametrizes a family of probabilities the tuple: Pr(Y0:∞ = y0:∞|Θ = θ) over sequences (or words) of any length. Here, Y = Y Y ··· is the infinite-sequence ∗ (y) 0:∞ 0 1 θ = {S, Y, s , {θs→s0 }s,s0∈S,y∈Y } . random variable, composed of the random variables Yj that each realize the environment’s output yj at time time As an example, Fig.2 shows an -machine that generates jτ, and Θ is the random variable for the model. In other a periodic process with initially uncertain phase. words, a given model θ predicts the probability of any -Machines are unifilar, meaning that the current causal sequence y0:L of any length L that one might see. state sj along with the next k symbols uniquely deter- θ For convenience, we introduce random variables Yj that mines the following causal state through the propagator define a model: function:

θ Pr(Y0:∞) ≡ Pr(Y0:∞|Θ = θ) . sj+k = (sj, yj:j+k) .

With training data y0:L, the likelihood of model θ is given This yields a simple expression for the probability of any 5

s⇤

1:0.5 0:0.5 Thermal 1:1.0 Reservoir Q Z W Mass (Work Reservoir) A B FIG. 3. Thermodynamic computing: The system of interest 0:1.0 Z’s states store information, processing it as they evolve. The work reservoir, represented as the suspended mass, supplies FIG. 2. -Machine generating the phase-uncertain period-2 work energy W to drive the SOI Hamiltonian along a determin- process: With probability 0.5, an initial transition is made istic trajectory HZ (t). Meanwhile, heat energy Q is exchanged from the start state s∗ to state A. From there, it emits the with the thermal reservoir, driving the system toward thermal sequence 1010 .... However, with probability 0.5, the start equilibrium. state transitions to state B and outputs the sequence 0101 ....

the final state zτ 0 given the initial state zτ : word in terms of the model parameters:

Mz →z 0 = Pr(Zτ 0 = zτ 0 |Zτ = zτ ) . L−1 τ τ θ Y (yj ) Pr(Y = y0:L) = θ ∗ ∗ . 0:L (s ,y0:j )→(s ,y0:j+1) j=0 Together, these specify the SOI’s computational elements. In addition to being uniquely determined by the semi- In this, zτ is the input to the physical computation, zτ 0 is the output, and Mz →z0 is the logical architecture. infinite process, the -machine uniquely generates that τ τ same process. This means that our model class Θ is Figure3 illustrates a computation’s physical implementa- equivalent to the class of possible distributions over time tion. SOI Z is coupled to a work reservoir, depicted as series data. Moreover, knowledge of the causal state of a mass hanging from a string, that controls the system’s an -machine at any time step j contains all information Hamiltonian along a trajectory HZ (t) over the computa- 0 about the future that could be predicted from the past. tion interval t ∈ [τ, τ ] [37]. This is the basic definition of In this sense, the causal state is predictive of the pro- a thermodynamic agent: an evolving Hamiltonian driving cess. These and other properties have motivated a long a physical system to compute at the cost of work. investigation of -machines, in which the memory cost of In a classical system, this control determines each state’s storing the causal states is frequently used as a measure of energy E(z, t). As a result of the control, changes in process structure. AppendixA gives an extended review. energy due to changes in the Hamiltonian correspond to work exchanges between the SOI and work reservoir. The system Z follows a state trajectory zτ:τ 0 over the time interval t ∈ [τ, τ 0], which we can write: C. Thermodynamic Computing

zτ:τ 0 = zτ zτ+dt ··· zτ 0−dtzτ 0 , Computation is physical—any computation takes place embedded in a physical system. Here, we refer to substrate where zt is the system state at time t. Here, we de- of the physically-embedded computation as the system composed the trajectory into intervals of duration dt, of interest (SOI). Its states, denoted Z = {z}, are taken taken short enough to yield infinitesimal changes in state as the underlying physical system’s information bearing probabilities and the Hamiltonian. The resulting work degrees of freedom [19]. The SOI’s dynamic evolves the production for this trajectory is then the integrated change state distribution Pr(Zt = zt), where Zt is the random in energy due to the Hamiltonian’s time dependence [37]: variable describing state at time t. Computation over 0 0 Z τ time interval t ∈ [τ, τ ] specifies how the dynamic maps W = − dt ∂ E(z, t) . 0 |zτ:τ0 t z=z the SOI from the initial time t = τ to the final time t = τ . τ t It consists of two components: Note that while the state trajectory zτ:τ 0 mirrors the time series notation used for the training data y0:L = 1. An initial distribution over states Pr(Zτ = zτ ) at time t = τ. y0y1 ··· yL−1, they are different objects and should not be conflated. On the one hand, the training data series y0:L 2. Application of a Markov channel M, characterized is composed of realizations of L separate subsystems, each by the conditional probability of transitioning to produced at different times jτ, j ∈ {0, 1, 2, ··· L−1}. yj is 6 realized in the subsystem Yj, and so it can be manipulated completely separately from any other element of y0:L lying outside of Y . By contrast, z depends dynamically Z j t Thermal on many other elements in z 0 , all of which lie in the τ:τ Reservoir Q W Mass same system, since the time series zτ:τ 0 represents state X evolution of the single system Z over time. While the SOI exchanges work energy with the work

AAAB+nicbVDLSsNAFL2pr1pfqS7dBIsgCCVpBV0W3bisYB/ShjCZTtuxk0mYmSgl5lPcuFDErV/izr9x0mahrQcGDufcyz1z/IhRqWz72yisrK6tbxQ3S1vbO7t7Znm/LcNYYNLCIQtF10eSMMpJS1HFSDcSBAU+Ix1/cpX5nQciJA35rZpGxA3QiNMhxUhpyTPL/QCpMUYsuUu95P60nnpmxa7aM1jLxMlJBXI0PfOrPwhxHBCuMENS9hw7Um6ChKKYkbTUjyWJEJ6gEelpylFApJvMoqfWsVYG1jAU+nFlzdTfGwkKpJwGvp7MgspFLxP/83qxGl64CeVRrAjH80PDmFkqtLIerAEVBCs21QRhQXVWC4+RQFjptkq6BGfxy8ukXas69Wrt5qzSuMzrKMIhHMEJOHAODbiGJrQAwyM8wyu8GU/Gi/FufMxHC0a+cwB/YHz+AEP0k/8= AAAB+nicbVDLSsNAFL2pr1pfqS7dDBZBEEqigi6LblxWsA9pQ5hMJ+3YyYOZiVJiPsWNC0Xc+iXu/BsnbRbaemDgcM693DPHizmTyrK+jdLS8srqWnm9srG5tb1jVnfbMkoEoS0S8Uh0PSwpZyFtKaY47caC4sDjtOONr3K/80CFZFF4qyYxdQI8DJnPCFZacs1qP8BqRDBP7zI3vT+2M9esWXVrCrRI7ILUoEDTNb/6g4gkAQ0V4VjKnm3FykmxUIxwmlX6iaQxJmM8pD1NQxxQ6aTT6Bk61MoA+ZHQL1Roqv7eSHEg5STw9GQeVM57ufif10uUf+GkLIwTRUMyO+QnHKkI5T2gAROUKD7RBBPBdFZERlhgonRbFV2CPf/lRdI+qdun9ZObs1rjsqijDPtwAEdgwzk04Bqa0AICj/AMr/BmPBkvxrvxMRstGcXOHvyB8fkDQOqT/Q== AAAB+nicbVDLSsNAFL2pr1pfqS7dDBbBjSWpgi6LblxWsA9pQ5hMJ+3o5MHMRCkxn+LGhSJu/RJ3/o2TNgttPTBwOOde7pnjxZxJZVnfRmlpeWV1rbxe2djc2t4xq7sdGSWC0DaJeCR6HpaUs5C2FVOc9mJBceBx2vXuL3O/+0CFZFF4oyYxdQI8CpnPCFZacs3qIMBqTDBPbzM3vTtuZK5Zs+rWFGiR2AWpQYGWa34NhhFJAhoqwrGUfduKlZNioRjhNKsMEkljTO7xiPY1DXFApZNOo2foUCtD5EdCv1Chqfp7I8WBlJPA05N5UDnv5eJ/Xj9R/rmTsjBOFA3J7JCfcKQilPeAhkxQovhEE0wE01kRGWOBidJtVXQJ9vyXF0mnUbdP6o3r01rzoqijDPtwAEdgwxk04Qpa0AYCj/AMr/BmPBkvxrvxMRstGcXOHvyB8fkDRXuUAA== j+3 AAAB+nicbVDLSsNAFL2pr1pfqS7dDBZBEEpSBV0W3bisYB/ShjCZTtrRyYOZiVJiPsWNC0Xc+iXu/BsnbRbaemDgcM693DPHizmTyrK+jdLS8srqWnm9srG5tb1jVnc7MkoEoW0S8Uj0PCwpZyFtK6Y47cWC4sDjtOvdX+Z+94EKyaLwRk1i6gR4FDKfEay05JrVQYDVmGCe3mZuenfcyFyzZtWtKdAisQtSgwIt1/waDCOSBDRUhGMp+7YVKyfFQjHCaVYZJJLGmNzjEe1rGuKASiedRs/QoVaGyI+EfqFCU/X3RooDKSeBpyfzoHLey8X/vH6i/HMnZWGcKBqS2SE/4UhFKO8BDZmgRPGJJpgIprMiMsYCE6XbqugS7PkvL5JOo26f1BvXp7XmRVFHGfbhAI7AhjNowhW0oA0EHuEZXuHNeDJejHfjYzZaMoqdPfgD4/MHQm+T/g== j+2 j+1 j AAAB+nicbVDLSsNAFL2pr1pfqS7dDBbBjSVRQZdFNy4r2Ie0IUymk3bs5MHMRCkxn+LGhSJu/RJ3/o2TNgttPTBwOOde7pnjxZxJZVnfRmlpeWV1rbxe2djc2t4xq7ttGSWC0BaJeCS6HpaUs5C2FFOcdmNBceBx2vHGV7nfeaBCsii8VZOYOgEehsxnBCstuWa1H2A1Ipind5mb3h/bmWvWrLo1BVokdkFqUKDpml/9QUSSgIaKcCxlz7Zi5aRYKEY4zSr9RNIYkzEe0p6mIQ6odNJp9AwdamWA/EjoFyo0VX9vpDiQchJ4ejIPKue9XPzP6yXKv3BSFsaJoiGZHfITjlSE8h7QgAlKFJ9ogolgOisiIywwUbqtii7Bnv/yImmf1O3T+snNWa1xWdRRhn04gCOw4RwacA1NaAGBR3iGV3gznowX4934mI2WjGJnD/7A+PwBQ/aT/w== j 1 j 2 AAAB+nicbVDLSsNAFL2pr1pfqS7dBIvgxpK0gi6LblxWsA9pQ5hMp+3YySTMTJQS8yluXCji1i9x5984abPQ1gMDh3Pu5Z45fsSoVLb9bRRWVtfWN4qbpa3tnd09s7zflmEsMGnhkIWi6yNJGOWkpahipBsJggKfkY4/ucr8zgMRkob8Vk0j4gZoxOmQYqS05JnlfoDUGCOW3KVecn9aTz2zYlftGaxl4uSkAjmanvnVH4Q4DghXmCEpe44dKTdBQlHMSFrqx5JECE/QiPQ05Sgg0k1m0VPrWCsDaxgK/biyZurvjQQFUk4DX09mQeWil4n/eb1YDS/chPIoVoTj+aFhzCwVWlkP1oAKghWbaoKwoDqrhcdIIKx0WyVdgrP45WXSrlWderV2c1ZpXOZ1FOEQjuAEHDiHBlxDE1qA4RGe4RXejCfjxXg3PuajBSPfOYA/MD5/AEcAlAE= j 3 reservoir, as Fig.3 shows, it exchanges heat Q with Y Y Y AAAB9HicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZcFNy4r2Ie0Q8mkmTY2k4xJplCGfocbF4q49WPc+Tdm2llo64HA4Zx7uScniDnTxnW/ncLa+sbmVnG7tLO7t39QPjxqaZkoQptEcqk6AdaUM0GbhhlOO7GiOAo4bQfjm8xvT6jSTIp7M42pH+GhYCEj2FjJ70XYjAjm6cOs/9gvV9yqOwdaJV5OKpCj0S9/9QaSJBEVhnCsdddzY+OnWBlGOJ2VeommMSZjPKRdSwWOqPbTeegZOrPKAIVS2ScMmqu/N1IcaT2NAjuZhdTLXib+53UTE177KRNxYqggi0NhwpGRKGsADZiixPCpJZgoZrMiMsIKE2N7KtkSvOUvr5JWrepdVGt3l5V6La+jCCdwCufgwRXU4RYa0AQCT/AMr/DmTJwX5935WIwWnHznGP7A+fwBFWWSQA== Y Y Y Y the thermal reservoir. Coupling to a heat reservoir adds Inputs Outputs stochasticity to the state trajectory zτ:τ 0 . Since the SOI computes while coupled to a thermal reservoir at tempera- FIG. 4. Thermodynamic computing by an agent driven by ture T , Landauer’s Principle [19] relates a computation’s an input sequence: Information bearing degrees of freedom logical processing to its energetics. In its contemporary of SOI Z in the jth interaction interval split into the di- rect product of agent states X and the jth input states form, it bounds the average work production hW i by a Yj . Work W and heat Q are defined in the same way as term proportional to SOI’s entropy change. Taking the in Fig.3, with the SOI’s Hamiltonian control H (t) P X ×Yj Shannon entropy H[Zt] = − z Pr(Zt = z) ln Pr(Zt = z) explicitly decoupled from the environment’s remaining sub- in natural units, the Second Law of Thermodynamics systems ···Yj−2Yj−1Yj+1Yj+2 ··· , corresponding to future implies [14]: inputs Yj+1Yj+2 ··· and past outputs ···Yj−2Yj−1.

hW i ≤ k T (H[Z 0 ] − H[Z ]) . B τ τ For example, consider an agent charged with extracting work from an alternating process—a sequence emitted by Here, the average hW i is taken over all possible micro- a degenerate two-level system that alternates periodically scopic trajectories. And, the energy landscape is assumed between symbols 0 and 1. In isolation each symbol looks to be flat at the computation’s start and end, giving no random and has no free energy. Thus, an agent that energetic preference to a particular informational state. interacts with each symbol the same way gains no work. However, a memoryful agent can adaptively adjust its IV. AGENT ENERGETICS behavior, after reading the first symbol, to exactly predict succeeding symbols and, therefore, extract meaningful We now construct the theoretical framework for how work. This method of harnessing temporal correlations agents extract work from time-series data. This involves is implemented by information ratchets [30, 31]. They breaking down the agent’s actions into manageable elemen- combine physical inputs with additional agent memory tary components—where we demonstrate their actions states that store the input’s temporal correlations. can be described as repeated application of sequence of As shown in Fig.4, we describe an agent’s memory via computations. We then introduce tools to analyze work an ancillary physical system X . The agent then operates th production within such general computations on finite cyclically with duration τ, such that the j cycle runs data. We highlight the importance of the agent’s model over the time-interval [jτ, (j + 1)τ). Each cycle involves of the data in determining work production. This model- two phases: dependence emerges by refining the class of agents to 1. Interaction: Agent memory X couples to and in- those that execute their computation most efficiently. The th teracts with the j input system Yj that contains results are finally combined, resulting in a closed-form th the j input symbol yj. This phase has duration expression for agent work production from time-series τ 0 < τ, meaning the jth interaction phase occurs data. over the time-interval [jτ, jτ + τ 0). At the end, the agent decouples from the system Yj, passing its new state y0 to the environment as output or exhaust. A. Agent Architecture j 2. Rest: During time interval [jτ + τ 0, (j + 1)τ), the agent’s memory X sits idle, waiting for the next Recall from Sec.II that the basic framework describes a input Y . thermodynamic agent interacting with an environment j+1 at regular time-intervals τj in state yj. Each yj is drawn In this way, the agent transforms a series of inputs y0:L 0 according to a random variable Yj, such that the sequence into a series of outputs y0:L. Y0:∞ = Y0Y1 ... is a semi-infinite stochastic process. The In each cycle, all nontrivial thermodynamics occur in the agent’s task is to interact with this input string to generate interaction phase, during which the SOI consists of the useful work. joint agent-input system: i.e., Z = X ⊗ Yj, as shown in 7

Mxy x y =Pr(Y 0 = y0,Xj+1 = x0 Yj = y, Xj = x)

AAACJHicbZDLSgMxFIYz9VbrbdSlm2CRqShlpgoKUii4cSNUsDfaMmTStE2buZBktMPYh3Hjq7hx4QUXbnwW08tCW38I/HznHE7O7wSMCmmaX1piYXFpeSW5mlpb39jc0rd3ysIPOSYl7DOfVx0kCKMeKUkqGakGnCDXYaTi9C9H9cod4YL63q2MAtJ0UcejbYqRVMjWL67teBDBBqedrkSc+/dwYETGMN8o8kzNsHv5yDiu2nHvyBrmB8ZDbUQU6OUHh7aeNrPmWHDeWFOTBlMVbf290fJx6BJPYoaEqFtmIJsx4pJiRoapRihIgHAfdUhdWQ+5RDTj8ZFDeKBIC7Z9rp4n4Zj+noiRK0TkOqrTRbIrZmsj+F+tHsr2eTOmXhBK4uHJonbIoPThKDHYopxgySJlEOZU/RXiLuIIS5VrSoVgzZ48b8q5rHWSzd2cpgu5aRxJsAf2QQZY4AwUwBUoghLA4BE8g1fwpj1pL9qH9jlpTWjTmV3wR9r3D1c0o0s= 0 0 j Fig.4. While the other subsystems ···Yj−2Yj−1 and ! | Interaction Yj+1Yj+2 ··· may be physically instantiated somewhere else in the environment, they do not participate in the A) B) interaction, since they are energetically decoupled from

Xj the agent in this phase. Paralleling the computation AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF48V7Qe0oWy2k3btZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4Zua3n1BpHssHM0nQj+hQ8pAzaqx03+k/9ssVt+rOQVaJl5MK5Gj0y1+9QczSCKVhgmrd9dzE+BlVhjOB01Iv1ZhQNqZD7FoqaYTaz+anTsmZVQYkjJUtachc/T2R0UjrSRTYzoiakV72ZuJ/Xjc14bWfcZmkBiVbLApTQUxMZn+TAVfIjJhYQpni9lbCRlRRZmw6JRuCt/zyKmnVqt5FtXZ3WanX8jiKcAKncA4eXEEdbqEBTWAwhGd4hTdHOC/Ou/OxaC04+cwx/IHz+QMt0o2t Xj+1 AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBZBEEpSBT0WvHisYD+gDWWz3bRrN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkQKg6777RTW1jc2t4rbpZ3dvf2D8uFRy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvp357SeujYjVA04S7kd0qEQoGEUrtTv97PHCm/bLFbfqzkFWiZeTCuRo9MtfvUHM0ogrZJIa0/XcBP2MahRM8mmplxqeUDamQ961VNGIGz+bnzslZ1YZkDDWthSSufp7IqORMZMosJ0RxZFZ9mbif143xfDGz4RKUuSKLRaFqSQYk9nvZCA0ZygnllCmhb2VsBHVlKFNqGRD8JZfXiWtWtW7rNburyr1Wh5HEU7gFM7Bg2uowx00oAkMxvAMr/DmJM6L8+58LFoLTj5zDH/gfP4Ay3mPKQ== t = j⌧ + ⌧ 0

AAAB9HicbVDLSgNBEOyNrxhfUY9eBoMoCGE3CnoRAl48RjAPSJYwO5lNxsw+nOkNhCXf4cWDIl79GG/+jbNJDppY0E1R1c30lBdLodG2v63cyura+kZ+s7C1vbO7V9w/aOgoUYzXWSQj1fKo5lKEvI4CJW/FitPAk7zpDW8zvzniSosofMBxzN2A9kPhC0bRSC7ePHaQJudZO+0WS3bZnoIsE2dOSjBHrVv86vQilgQ8RCap1m3HjtFNqULBJJ8UOonmMWVD2udtQ0MacO2m06Mn5MQoPeJHylSIZKr+3khpoPU48MxkQHGgF71M/M9rJ+hfu6kI4wR5yGYP+YkkGJEsAdITijOUY0MoU8LcStiAKsrQ5FQwITiLX14mjUrZuShX7i9L1co8jjwcwTGcgQNXUIU7qEEdGDzBM7zCmzWyXqx362M2mrPmO4fwB9bnDy5Ekak=

AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoBeh4MVjBfsBbSib7aZdu9mE3YlQQn+EFw+KePX3ePPfuG1z0NYHA4/3ZpiZFyRSGHTdb6ewtr6xuVXcLu3s7u0flA+PWiZONeNNFstYdwJquBSKN1Gg5J1EcxoFkreD8e3Mbz9xbUSsHnCScD+iQyVCwShaqY03jz2kab9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdKqVb2Lau3+slKv5XEU4QRO4Rw8uII63EEDmsBgDM/wCm9O4rw4787HorXg5DPH8AfO5w8/Co91 t = j⌧ shown in Fig.3, Hamiltonian control over the joint space Yj AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fjw4rGi/ZA2lM120q7dbMLuRiihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEsG1cd1vZ2V1bX1js7BV3N7Z3dsvHRw2dZwqhg0Wi1i1A6pRcIkNw43AdqKQRoHAVjC6nvqtJ1Sax/LejBP0IzqQPOSMGivdPfQee6WyW3FnIMvEy0kZctR7pa9uP2ZphNIwQbXueG5i/Iwqw5nASbGbakwoG9EBdiyVNELtZ7NTJ+TUKn0SxsqWNGSm/p7IaKT1OApsZ0TNUC96U/E/r5Oa8MrPuExSg5LNF4WpICYm079JnytkRowtoUxxeythQ6ooMzadog3BW3x5mTSrFe+8Ur29KNeqeRwFOIYTOAMPLqEGN1CHBjAYwDO8wpsjnBfn3fmYt644+cwR/IHz+QMvWI2u Y 0

AAAB63icbVBNSwMxEJ2tX7V+VT16CRbRU9ltBT0WvHisYD+kXUo2zbaxSXZJskJZ+he8eFDEq3/Im//GbLsHbX0w8Hhvhpl5QcyZNq777RTW1jc2t4rbpZ3dvf2D8uFRW0eJIrRFIh6pboA15UzSlmGG026sKBYBp51gcpP5nSeqNIvkvZnG1Bd4JFnICDaZ9DB4PB+UK27VnQOtEi8nFcjRHJS/+sOIJIJKQzjWuue5sfFTrAwjnM5K/UTTGJMJHtGepRILqv10fusMnVlliMJI2ZIGzdXfEykWWk9FYDsFNmO97GXif14vMeG1nzIZJ4ZKslgUJhyZCGWPoyFTlBg+tQQTxeytiIyxwsTYeEo2BG/55VXSrlW9erV2d1lp1PI4inACp3ABHlxBA26hCS0gMIZneIU3RzgvzrvzsWgtOPnMMfyB8/kDkH6N3w== j HX ×Yj (t) results in a transformation of the agent’s SOI that also requires the exchange of work and heat. This interaction phase updates SOI states according to a Environment Environment Markov transition matrix M shown in Fig.5:

CoupleAAAB9HicbVBNTwIxEO3iF+IX6tFLIzHxRHbRRI8kXDxiIh8JbEi3DNDQ3a7tLJFs+B1ePGiMV3+MN/+NBfag4EsmeXlvpp15QSyFQdf9dnIbm1vbO/ndwt7+weFR8fikaVSiOTS4kkq3A2ZAiggaKFBCO9bAwkBCKxjX5n5rAtoIFT3gNAY/ZMNIDARnaCW/i/CEaU0lsYRZr1hyy+4CdJ14GSmRDPVe8avbVzwJIUIumTEdz43RT5lGwe17hW5iIGZ8zIbQsTRiIRg/XSw9oxdW6dOB0rYipAv190TKQmOmYWA7Q4Yjs+rNxf+8ToKDWz8VUZwgRHz50SCRFBWdJ0D7QgNHObWEcS3srpSPmGYcbU4FG4K3evI6aVbK3lW5cn9dqlayOPLkjJyTS+KRG1Ild6ROGoSTR/JMXsmbM3FenHfnY9mac7KZU/IHzucPYLaScQ== DecoupleAAAB+HicbVDLSgNBEJyNrxgfWfXoZTEInsJuFPQY0IPHCOYByRJmJ73JkNkHMz1iXPIlXjwo4tVP8ebfOEn2oIkFDUVV90x3BangCl332yqsrW9sbhW3Szu7e/tl++CwpRItGTRZIhLZCagCwWNoIkcBnVQCjQIB7WB8PfPbDyAVT+J7nKTgR3QY85Azikbq2+UewiNmN8ASnQqY9u2KW3XncFaJl5MKydHo21+9QcJ0BDEyQZXqem6KfkYlcmbeK/W0gpSyMR1C19CYRqD8bL741Dk1ysAJE2kqRmeu/p7IaKTUJApMZ0RxpJa9mfif19UYXvkZj1ONELPFR6EWDibOLAVnwCUwFBNDKJPc7OqwEZWUocmqZELwlk9eJa1a1Tuv1u4uKvVaHkeRHJMTckY8cknq5JY0SJMwoskzeSVv1pP1Yr1bH4vWgpXPHJE/sD5/AF4Ck38= D) C) 0 0 0 Mxy→x0y0 =Pr(Xj+1 =x ,Yj = y |Xj =x, Yj =y) , (2)

Input Xj+1 Output AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBZBEEpSBT0WvHisYD+gDWWz3bRrN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkQKg6777RTW1jc2t4rbpZ3dvf2D8uFRy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvp357SeujYjVA04S7kd0qEQoGEUrtTv97PHCm/bLFbfqzkFWiZeTCuRo9MtfvUHM0ogrZJIa0/XcBP2MahRM8mmplxqeUDamQ961VNGIGz+bnzslZ1YZkDDWthSSufp7IqORMZMosJ0RxZFZ9mbif143xfDGz4RKUuSKLRaFqSQYk9nvZCA0ZygnllCmhb2VsBHVlKFNqGRD8JZfXiWtWtW7rNburyr1Wh5HEU7gFM7Bg2uowx00oAkMxvAMr/DmJM6L8+58LFoLTj5zDH/gfP4Ay3mPKQ== Xj+1

AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBZBEEpSBT0WvHisYD+gDWWz3bRrN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkQKg6777RTW1jc2t4rbpZ3dvf2D8uFRy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvp357SeujYjVA04S7kd0qEQoGEUrtTv97PHCm/bLFbfqzkFWiZeTCuRo9MtfvUHM0ogrZJIa0/XcBP2MahRM8mmplxqeUDamQ961VNGIGz+bnzslZ1YZkDDWthSSufp7IqORMZMosJ0RxZFZ9mbif143xfDGz4RKUuSKLRaFqSQYk9nvZCA0ZygnllCmhb2VsBHVlKFNqGRD8JZfXiWtWtW7rNburyr1Wh5HEU7gFM7Bg2uowx00oAkMxvAMr/DmJM6L8+58LFoLTj5zDH/gfP4Ay3mPKQ== where Xj and Xj+1 are the random variables for the states Yj+1 Y 0

AAAB7nicbVBNS8NAEJ34WetX1aOXxSIIQkmqoMeCF48V7Ie0oWy2m3btZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/nZXVtfWNzcJWcXtnd2+/dHDYNHGqGW+wWMa6HVDDpVC8gQIlbyea0yiQvBWMbqZ+64lrI2J1j+OE+xEdKBEKRtFKrYde9njuTXqlsltxZyDLxMtJGXLUe6Wvbj9macQVMkmN6Xhugn5GNQom+aTYTQ1PKBvRAe9YqmjEjZ/Nzp2QU6v0SRhrWwrJTP09kdHImHEU2M6I4tAselPxP6+TYnjtZ0IlKXLF5ovCVBKMyfR30heaM5RjSyjTwt5K2JBqytAmVLQheIsvL5NmteJdVKp3l+VaNY+jAMdwAmfgwRXU4Bbq0AAGI3iGV3hzEufFeXc+5q0rTj5zBH/gfP4AzQOPKg==

AAAB63icbVBNSwMxEJ2tX7V+VT16CRbRU9ltBT0WvHisYD+kXUo2zbaxSXZJskJZ+he8eFDEq3/Im//GbLsHbX0w8Hhvhpl5QcyZNq777RTW1jc2t4rbpZ3dvf2D8uFRW0eJIrRFIh6pboA15UzSlmGG026sKBYBp51gcpP5nSeqNIvkvZnG1Bd4JFnICDaZ9DB4PB+UK27VnQOtEi8nFcjRHJS/+sOIJIJKQzjWuue5sfFTrAwjnM5K/UTTGJMJHtGepRILqv10fusMnVlliMJI2ZIGzdXfEykWWk9FYDsFNmO97GXif14vMeG1nzIZJ4ZKslgUJhyZCGWPoyFTlBg+tQQTxeytiIyxwsTYeEo2BG/55VXSrlW9erV2d1lp1PI4inACp3ABHlxBA26hCS0gMIZneIU3RzgvzrvzsWgtOPnMMfyB8/kDkH6N3w== j of the agent’s memory X before and after the jth inter- 0 action interval, and Yj and Yj are the random variables for the system Yj before and after the same interaction Environment Environment interval, realizing the input and output, respectively. As Sec.IIIC described, M is the logical architecture of the

physical computation that transforms the agent’s memory NextAAAB+3icbVDLSgMxFM3UV62vsS7dBIvgqszUgi4LblxJRfuAdiiZNNOGZjJDckdahvkVNy4UceuPuPNvTNtZaOuBhMM595KT48eCa3Ccb6uwsbm1vVPcLe3tHxwe2cflto4SRVmLRiJSXZ9oJrhkLeAgWDdWjIS+YB1/cjP3O09MaR7JR5jFzAvJSPKAUwJGGtjlPrAppHfmwg+z0I9ENrArTtVZAK8TNycVlKM5sL/6w4gmIZNABdG65zoxeClRwKlgWamfaBYTOiEj1jNUkpBpL11kz/C5UYY4iJQ5EvBC/b2RklBrE8xMhgTGetWbi/95vQSCay/lMk6ASbp8KEgEhgjPi8BDrhgFMTOEUMVNVkzHRBEKpq6SKcFd/fI6adeq7mW1dl+vNOp5HUV0is7QBXLRFWqgW9RELUTRFD2jV/RmZdaL9W59LEcLVr5zgv7A+vwBareUow== Symbol and input simultaneously. It is the central element in FIG. 5. Agent interacting with an environment via repeated the agent’s procedure for transforming inputs y0:L into 0 symbol exchanges: A) At time jτ agent memory Xj begins associated outputs y0:L. The key observation is that M interacting with input symbol Yj . Transitioning from A) to B), captures all of the agent’s internal logic. The logic does agent memory and interaction symbol jointly evolve according not change from cycle to cycle. However, the presence of to the Markov channel Mxy→x0y0 . This results in B)—the persistent internal memory between cycles implies that the updated states of agent memory Xj+1 and interaction symbol 0 0 agent’s behavior adapts to past inputs and outputs. This Yj at time jτ + τ . Transitioning from B) to C), the agent motivates us to define M as the agent architecture since memory decouples from the interaction symbol, emitting its new state to the environment. Then, transitioning from C) it determines how an agent stores information temporally. to D), the agent retains its memory state Xj+1 and the en- As we will show, M is one of two essential elements in vironment emits the next interaction symbol Yj+1. Finally, determining the work an agent produces from a time transitioning from D) to A), the agent restarts the cycle by series. coupling to the next input symbol. Note that prior related efforts to address agent ener- getics focused on ensemble-average work production [29– 31, 38–42]. In contrast, here we relate work production zτ → zτ 0 that ignores intermediate states in the SOI state to parametric density estimation—which involves each trajectory, as all information relevant to the computation’s logical operation lies in the input and output. Thus, agent being given a specific data string y0:L for training. To address this case, the following determines the work our attention turns to the question: What is the work production for single-shot short input strings. production of a computational map zτ → zτ 0 performed by the computation M at temperature T ? To determine this, we first prove a useful relation between the entropy and work production for a particular state B. Energetics of Computational Maps trajectory z 0 . Specifically, let W and Σ de- τ:τ |zτ:τ0 |zτ:τ0 note the work and total entropy production along this The agent architecture M specifies a physical computation trajectory, respectively. Meanwhile, let E(zt, t) denote as described in Sec.IIIC and therefore has a minimum the system energy when it is in state zt at time t. Now, energy cost determined by Landauer’s bound. However, consider the pointwise nonequilibrium free energy: this is a bound on the average work production, which depends explicitly on the distribution of inputs. We need φ(zt, t) = E(zt, t) + kBT ln Pr(Zt = zt) . (3) to determine, instead, the work produced from a single input yj. To find this we return to the general case of More familiarly, note that its time-averaged quantity is SOI Z undergoing a thermodynamic computation M. the nonequilibrium free energy [43]: A physical operation takes the SOI from state z at time τ τ neq 0 F = hφ(z, t)i . to state zτ 0 at time τ . This specifies a computational map Pr(Zt=z) 8

We can then show that the entropy production Σ can be in pointwise nonequilibrium free energy: expressed: eff W|z 0 = −∆φ|zτ ,z 0 . 0 τ:τ τ −W + φ(zτ , τ) − φ(zτ 0 , τ ) Σ = |zτ:τ0 . (4) |zτ:τ0 T Substituting Eq. (3) then gives: This follows by noting that the total entropy produced eff Pr(Zτ = zτ ) W = −∆EZ + kBT ln , from thermodynamic control is the sum of the entropy |zτ:τ0 Pr(Zτ 0 = zτ 0 ) change in the system [44]: 0 where ∆EZ = E(zτ 0 , τ )−E(zτ , τ). This is what we would Z Pr(Zτ = zτ ) expect in quasistatic computations, where the system ∆S|z 0 = kB ln τ:τ Pr(Zτ 0 = zτ 0 ) energies E(z, t) are varied slowly enough that the system Z remains in equilibrium for the duration. We should and that of the thermal reservoir: note, though, that it is possible to implement efficient Q computations rapidly and out of equilibrium [48]. ∆Sreservoir = |zτ:τ0 . |zτ:τ0 T This also holds if we average over intermediate states of the SOI’s state trajectory, yielding the work production Equation (4) follows by summing up these contributions of a computational map: to the total entropy production Σ = ∆Sreservoir + ∆SZ and noting that the SOI’s change in energy obeys the D E Pr(Z = z ) W eff = −∆E + k T ln τ τ . (6) Z |zτ ,z 0 Z B First Law of Thermodynamics ∆E = −W − Q. τ Pr(Zτ 0 = zτ 0 ) Since only the SOI’s initial and final states matter to the logical operation of the computational map, we take a The energy required to perform efficient computing is independent of intermediate properties. It depends only statistical average of all trajectories beginning in zτ and on the probability and energy of initial and final states. ending in zτ 0 . This results in the work production: This measures the energetic gains from a single data X 0 0 0 0 realization as it transforms during a computation, as W|z ,z 0 = W|z Pr(Zτ:τ = z 0 |zτ , zτ ) , (5) τ τ τ:τ0 τ:τ z0 opposed to the ensemble average. τ:τ0 for the computational map zτ → zτ 0 . This determines how much energy is stored in the work reservoir on average C. Energetics of Estimates when a computation results in this particular input-output pair. Thermodynamic learning concerns agents that maximize Similarly, taking the same average of the entropy produc- work production from their input data. As such, we tion shown in Eq. (4), conditioned on inputs and outputs, now restrict our attention to agents that harness all gives: available nonequilibrium free energy in the form of work neq 0 hW i = −∆F . These maximum-work agents zero out T Σ = −hW i + φ(zτ , τ) − φ(zτ 0 , τ ), |zτ ,zτ0 |zτ ,zτ0 the average entropy production hΣi = −hW i − ∆F neq = −hW i − ∆φ . |zτ ,zτ0 |zτ ,zτ0 and the work production of a computational map satisfies Eq. (6). From here on out, when we refer to efficient This suggestively relates computational-mapping work agents we refer to those that maximize work production and the change in pointwise nonequilibrium free energy from the available change in nonequilibrium free energy. φ(z, t). SOI state probabilities feature centrally in the expression This relation between work and free energy simplifies for nonequilibrium free energy and, thus, for the work for thermodynamically-efficient computations. In such production of efficient agents. However, the actual input scenarios, the average total entropy production over all distribution Pr(Zτ ) may vary while the agent, defined trajectories vanishes. AppendixB shows that zero average by its Hamiltonian HZ (t) over the computation inter- entropy production, combined with the Crooks fluctuation val, remains fixed. Moreover, since the work production theorem [45, 46], implies that entropy production along D E W eff of a computational map explicitly conditions |z ,z 0 any individual trajectory zτ:τ 0 produces zero entropy: τ τ Σ = 0. This is expected from linear response [47]. on the initial and final SOI state, this work cannot explic- |zτ:τ0 Thus, substituting zero entropy production into Eq. itly depend on the input distribution. At first blush, this (4), we arrive at our result: work production for is a contradiction: work that simultaneously does and thermodynamically-efficient computations is the change does not depend on the input distribution. This is resolved once one recognizes the role that estimates 9 play in thermodynamics. As indicated in Fig.1, we claim levels, the example implementations we use also start and that an agent has an estimated model of its environment end with the same flat energy landscape. Restricting to that it uses to predict the SOI. This model is, in one form such information-driven agents, we consider cases where or another, encoded in the evolving Hamiltonian HZ (t) ∆EZ = 0, whereby: that determines both the agent’s energetic interactions θ and its logical architecture. If an agent’s estimated model D θ E Pr(Zτ = zτ ) W = kBT ln . (8) |zτ ,zτ0 θ of SOI Z is encoded as parameters θ, then the agent Pr(Zτ 0 = zτ 0 ) estimates that the SOI state z at time t has probability: This provides a direct relationship between the work pro- θ duction hW θ i from particular data realizations and Pr(Zt = zt) = Pr(Zt = zt|Θ = θ) . |zτ ,zτ0 θ the model θ that the agent uses, via the estimates Pr(Zt ) The physical relevance of the estimated distribution comes provided by that model. This is an essential step in from insisting that the agent dissipates as little work as determining a model through thermodynamic learning. possible from a SOI whose distribution matches its own estimate. In essence, the initial estimated distribution θ Pr(Zτ ) must be one of the distributions that minimizes D. Thermally Efficient Agents the average entropy production [49]: With the work production of a maximally-efficient compu- θ Pr(Zτ ) ∈ arg minhΣ [Pr(Zτ )]i. tational map established, we are poised to determine the Pr(Z ) τ work production for thermodynamically-efficient agents. θ Specifically, consider an agent parameterized by its logical Estimated probabilities Pr(Zt ) at later times t > τ are de- termined by updating the initial estimate via the stochas- architecture M and model parameters θ. As described tic dynamics that result from the Hamiltonian H (t) by the agent architecture in Sec.IVA, the agent uses Z 0 interacting with the thermal bath. its memory Xj to map inputs Yj to outputs Yj and to Thus, since an efficient agent produces zero entropy when update its memory to Xj+1. In stochastic mapping the the SOI follows the minimum dissipation distribution SOI Z = X × Yj the model parameter θ provides an Pr(Zθ), the work it produces from a computational map estimate of the distribution over the current initial state t θ θ θ 0θ is: (Xj ,Yj ) as well as the final state (Xj+1,Yj ). Assuming the agent’s logical architecture M is executed optimally, θ D θ E Pr(Zτ = zτ ) direct application of Eq. (8) then says that the expected W = −∆EZ + kBT ln . (7) |zτ ,zτ0 θ 0 Pr(Zτ 0 = zτ 0 ) work production of the computational map xjyj → xj+1yj is: In this, we replaced the superscript “eff” with “θ” to θ θ emphasize that the agent is designed to be thermody- D θ E Pr(Xj = x, Yj = y) W 0 ≡ kBT ln 0 . j,xj yj →xj+1yj θ 0 θ 0 namically efficient for that particular estimated model. Pr(Xj+1 = x ,Yj = y ) Specifying the estimated model is essential, since mises- (9) timating the input distribution leads to dissipation and entropy production [49, 50]. Returning to thermodynamic In this, the estimated final distribution comes from the learning, this is how the model θ factors into the ratchet’s logical architecture updating the initial distribution: operation: estimated distributions explicitly determine θ 0 0θ 0 the work production of computational maps. Pr(Xj+1 = x ,Yj = y ) AppendixC gives a concrete quasistatic mechanism for X θ θ = Pr(Xj = x, Yj = y)Mxy→x0y0 . implementing any computation M and achieving the x,y work given by Eq. (7). This directly demonstrates how the model θ is built into the evolving energy landscape Equation (9)’s expression for work establishes that all HZ (t) that implements M. The model θ determines the functional aspects (logical operation and energetics) of initial and final change in state energies: ∆E(z, τ) = an efficient agent are determined by two factors: θ 0 θ −kB ln Pr(Zτ = z) and ∆E(z, τ ) = kB ln Pr(Zτ 0 = z). This quasistatic protocol operates and produces the same 1. The logical architecture M that specifies how the work for a particular input-output pair regardless of the agent manipulates inputs and updates its own mem- actual input distribution. ory. θ θ Since we focus on the energetic benefits derived from 2. The estimated input distribution Pr(Xj ,Yj ) for information itself rather than those from changing energy which the agents’ execution of M is optimized to minimize dissipation. 10

0 Thus, we define a thermally-efficient agent by the ordered outputs y0:L: θ θ pair {M, Pr(Xj ,Yj )}. L−1 So defined, we can calculate the work produced when D θ E X θ W|y ,y0 ,x = hWj,x y →x y0 i such agents act on a particular input sequence y0:L. This 0:L 0:L 0:L+1 j j j+1 j is done by first considering the work production of a j=0 L−1 θ θ particular sequence of agent memory states x and Y Pr(Xj =xj,Yj =yj) 0:L+1 = k T ln . B Pr(Xθ =x ,Y 0θ =y0 ) j=0 j+1 j+1 j j

Then, to obtain the average work produced from a par- ticular input sequence y0:L, we average over all possible hidden-state sequences and x0:L+1 and output sequences 0 y0:L:

D θ E X 0 0 D θ E W = kBT Pr(Y = y ,X0:L+1 = x0:L+1|Y0:L = y0:L) W 0 (10) |y0:L 0:L 0:L |y0:L,y0:L,x0:L+1 0 x0:L+1,y0:L L−1 L−1 θ θ X Y Y Pr(Xj = xj,Yj = yj) = kBT Pr(X0 = x0) Mx ,y →x ,y0 ln 0 . k k k+1 k Pr(Xθ = x ,Y θ = y0 ) 0 k=0 j=0 j+1 j+1 j j x0:L+1,y0:L

θ θ This gives the average energy harvested by an agent that Pr(X0 ) and estimated input process Pr(Y0:∞): transduces inputs y0:L according to the logical architec- θ θ ture M, given that it is designed to be as efficient as Pr(Xj = xj,Yj = yj) possible when its model θ matches the environment. X θ θ = Pr(X0 = x0) Pr(Y0:j+1 = y0:j+1) On its own, Eq. (10)’s work production is a deeply in- 0 x0:j ,y0:j ,y teresting quantity. In point of fact, since our agents are 0:j j−1 stochastic Turing machines [51], this is the work pro- Y × Mx ,y →x ,y0 . duction for any general form of computation that maps k k k+1 k k=0 0 inputs to output distributions Pr(Y0:L|Y0:L = y0:L) [52]. Thus, Eq. (10) determines the possible work benefit for We use this estimate for the initial state in Eq. (10), universal thermodynamic computing. since it is maximally efficient, dissipating as little as Given this general expression for work production, one possible if the agent architecture M receives the input θ might conclude that the next step for thermodynamic distribution Pr(Y0:∞). As a result of its efficiency, the θ θ learning is to search for the agent tuple {M, Pr(Xj ,Yj )} resulting computation performed by the agent produces that maximizes the work production. However, this strat- the maximum possible work, given its logical architecture. egy comes with two issues. First, it requires a wider This simplifies our search for maximum-work agents by search than necessary. Second, it does not draw a direct directly tying the estimated inputs to the model θ. How- connection to the underlying model θ. Recall that we ever, it still leaves one piece of the agent undefined: its are considering -machine models θ of the input sequence logical architecture M. Fortunately, as we discuss now, θ that give the probability estimate Pr(Y0:L = y0:L) for any the thermodynamics of modularity further simplifies the L. search. We address both of these issues by refining the search θ θ space to agents whose anticipated inputs Pr(Xj ,Yj ) are explicitly determined by their initial state distribution E. Thermodynamics of Modularity

An agent transduces inputs to outputs through a se-

ries of modular operations. The Hamiltonian HX ×Yj (t) that governs the evolution of the jth operation is de- coupled from the other elements of the time series Y0 × Y1 ···Yj−1 × Yj+1 × · · · . As a result of this modular 11 computational architecture, the correlations lost between F. Agent–Model Equivalence the agent and the rest of the information reservoir are irreversibly dissipated, producing entropy. This is an In the pursuit of maximum work, we refined the structure energetic cost associated directly with the agent’s logical of candidate agents considerably, limiting consideration architecture, known as the modularity dissipation [53]. to those that minimize the modularity dissipation. As a It sets a minimum for the work dissipated in a complex result, they store the predictive states of their estimated computation composed of many elementary steps. input and their logical architecture is explicitly deter- To continue our pursuit of maximum-work, we must design mined by an -machine model. Moreover, to be efficient, the agent’s logical architecture to minimize dissipated the candidate agent should begin in the -machine’s start work. Past analysis of agents that harness energy from a state s∗ such that the model θ uniquely determines the pattern Pr(Y0:∞) showed that the modularity dissipation second piece of an efficient agent. This is the estimated is only minimized when the agent’s states are predictive initial distribution over its memory state and input: of the pattern [53]. This means that to maximize work extracted, an agent’s state must contain all information j θ θ X Y (yk) Pr(Y = yj,X = sj) = δs ,s∗ θ . about the past relevant to the future Pr(Yj+1:L|Xj) = j j 0 sk→sk+1 y0:j ,s0:j ,sj+1 k=0 Pr(Yj+1:L|Y0:j). That is, for maximal work extraction agent states must be sufficient statistics for predicting the Conversely, maximum-work agents, characterized by their future. logical architecture and anticipated input distributions Moreover, the -machines introduced in Sec.IIIB are θ θ {M, Pr(Xj ,Yj )}, also specify the -machine’s model of constructed with hidden states Sj that are a minimal their estimated distribution: predictor of their output process. This is why they are referred to as causal states. And so, the -machine is (y) θ θ θs→s0 = Pr(Yj = y|Sj = s)δs0,(s,y) a minimal sufficient statistic for prediction. Transitions θ θ = Pr(Y = y|X = s)|Y |M 0 0 . (12) among causal states trigger outputs according to: j j j sy→s y

(y) θ θ 0 θ Through the -machine, the agent also specifies its esti- θ 0 = Pr(Y = y, S = s |S = s) , s→s j j+1 j mated input process. In this way, we arrive at a class of which is the probability that an -machine in internal state agents that are uniquely determined by their environment s transitions to s0 and emits output y. -Machines have model. the additional property that they are unifilar, meaning Figure6 explicitly lays this out. It presents an agent that that the next causal state s0 is uniquely determined by the estimates a period-2 process with uncertain phase, such θ θ current state s and output y via the propagator function that Pr(Y0:∞ = 0101 ··· ) = Pr(Y0:∞ = 1010 ··· ) = 0.5. s0 = (s, y). The middle column shows the -machine that is uniquely determined for that process, characterized by the model In short, to produce maximum work agent memory must (y) store at least the causal states of the environment’s own parameters θs→s0 . The right column shows the unique -machine. AppendixD describes how the logical archi- minimal agent that harnesses as much work as possible tecture of the maximum-work minimum-memory agent from that process. All three, (i) the estimated process, (ii) is determined by its estimated -machine model θ of its the estimated -machine model, and (iii) the maximum- inputs. The agent states are chosen to be the same as the work agent are equivalent. And, they can be determined causal states X = S, and they update according to the from one another. This holds true for any estimated input θ propagator function : process Pr(Y0:∞). Under the equivalence of model θ and agent operation, ( P (y) when we monitor the agent’s thermodynamic performance 1 δx0,(x,y), if x0 θx→x0 6= 0, Mxy→x0y0 = × . through its work production, we also measure the predic- |Yj| δ 0 , otherwise. x ,x tive performance of its underlying model. (11) This completes the thermodynamic learning framework laid out in Fig.1. There, the model an agent holds affects This guarantees that the agent’s internal states follow its interaction with the symbol sequence y and, ulti- the causal states of its estimated input process Pr(Y θ ). 0:L 0:∞ mately, its work production. And so, from this point In turn, it prevents the agent from dissipating temporal forward, when discussing an estimated process or an correlations within that process. -machine that generates that guess, we are also describ- ing the unique thermodynamic agent designed to pro- duce maximal work from the estimated process. We can 3 environment produces new symbols. However, we leave the examination of the e↵ectiveness of this training to a s⇤ future manuscript. 1:0.5 0:0.5 Our focus is on comparing our proposed thermody- namic training process to a machine learning process 1:1.0 that selects models from the same data using maximum- A B likelihood. As indicated in Fig. 1, each agent has some internal model of its environment, a connection which 0:1.0 we will formalize in Sec. VI. We then evaluate the corre- sponding work production of each model from the train- FIG. 2. A ✏-machine which generates the uncertain period-2 ing data. Thus, when we arrive at a maximum-work de- process. With probability 0.5, the transition is made from the start state s⇤ to the A state, producing the sequence 1010 , mon, we are also selecting that demon’s internal model and with 0.5 probability, the start state transitions to the···B as a description fo the environment. state and outputs the sequence 0101 . ···

likelihood instead III. BACKGROUND `(✓ y )=lnPr(Y ✓ = y ), (4) | 0:L 0:L 0:L To make the comparison between thermodynamic because it is maximized for the same models, but con- training and parametric density estimation explicit, we verges more e↵ectively [11]. must show how information thermodynamics and ma- 10 chine learning share a common framework. In this section we introduce the basic✓ concepts in the fields 0P ✏(x, y): tion Pr(Y0:L = y0:L) and output distribution Pr(Y0:LB.= Computational Mechanics of machine learning, information1 thermodynamics, and y0:0 L)= L . This form is very familiar, nearly exactly 1 if computational✓(y) =0, mechanics necessary|Y| to understand this x0,✏(x,y) x0 x x0 reproducing the form of Eq. (29), which describesGiven that the our data is a time series of arbitrary length Mxy x y = ! 6 (38) ! 0 0 ⇥ manuscript. For readerswork who harvested would by like an a e deepercient agentexpla- that extracts energy 12 11 ( x0,x else, starting at y0, we must choose a model class whose pos- |Y| P nation of this background, we recommend they refer to from a single realization of a physical systemsibleZ parameters⌧ . How- ⇥ = ✓ can span a wide range of and the estimated input is given by theApp. HMM A. of the ✏- ever, it should be noted that achieving this work pro- { } ✓ [25], in order for an information extractor to avoid the a maximum work-producing agent M. EstimatedAAACAXicbVDLSgNBEJyNrxhfq14EL4NB8BR2o6AnCYjgMYJ5QLKE2UknGTL7YKZXDEu8+CtePCji1b/w5t84SfagiQUNRVU33V1+LIVGx/m2ckvLK6tr+fXCxubW9o69u1fXUaI41HgkI9X0mQYpQqihQAnNWAELfAkNf3g18Rv3oLSIwjscxeAFrB+KnuAMjdSxD9oID5heaxQBQ+jSqoo4aD3u2EWn5ExBF4mbkSLJUO3YX+1uxJMAQuSSad1ynRi9lCkUXMK40E40xIwPWR9ahoYsAO2l0w/G9NgoXdqLlKkQ6VT9PZGyQOtR4JtOc+ZAz3sT8T+vlWDvwktFGCcIIZ8t6iWSYkQncdCuUMBRjgxhXAlzK+UDphhHE1rBhODOv7xI6uWSe1oq354VK5dZHHlySI7ICXHJOamQG1IlNcLJI3kmr+TNerJerHfrY9aas7KZffIH1ucPJ8qXUw== Processpossible semi-infiniteEstimated processes✏-machine Pr(Y0: ). Fortunately, Ecient Agent machine 1 thermodynamic cost of modularity, the logical architec- We have yet to fully specify the behavior of the agent, duction relies on including the additionalthe agent finite memory state machines described in App. A compose 0.5 ture Mxy x y must be constructed such that the hidden because we have left out the estimated input distribution . That memory allows us to cope with the temporal (model) ! 0 0 j such a flexibleAAACLXicbVBNSyNBEO3xazW7q1GPXhqDoocNMyqsJxFU8KhgVMiE0NNTMY39MXTXiGGYP+Rl/8oieFDEq3/DTiYHvwoaXr1Xxet6SSaFwzB8DCYmp6ZnfszO1X7++j2/UF9cOncmtxxa3EhjLxPmQAoNLRQo4TKzwFQi4SK5PhjqFzdgnTD6DAcZdBS70qInOENPdeuHMcItFkcOhWLletUhpLSMIXNCGl1RfxTjfe9R1uKYrtOK3FAmBblZduuNsBmOin4F0Rg0yLhOuvX7ODU8V6CRS+ZcOwoz7BTMouBy6JE7yBi/ZlfQ9lAzBa5TjK4t6ZpnUtoz1j+NdMS+3yiYcm6gEj+pGPbdZ21Ifqe1c+ztdgqhsxxB88qol0uKhg6jo6mwwFEOPGDcCv9XyvvMMo4+4JoPIfp88ldwvtWMtptbpzuN/b1xHLNkhaySDRKRv2SfHJMT0iKc3JH/5JE8Bf+Ch+A5eKlGJ4LxzjL5UMHrG58LqX8= class of models that they can describe any ✓ ✓ X state variables Xi are predictive of the input. Thus, the Pr(Yi = y, Xi = x). This distribution must match the ✓ ✓ (yk) modularity of the input by storing temporal correlations Pr(Y = yj,X = sj)= s ,s ✓ . semi-infinite process. This is because they are explicitlypredictive ✏-machine generator provides a prescription for actual distribution Pr(Y = y, X = x) in order for the j j 0 ⇤ sk sk+1 0.5 0.5 i i ! [19]. s , 1 A, 1 B,1 y0:j ,s0:j ,sj+1 k=0 A. Parametric Density Estimation constructed from the process using a causal equivalencethe transitions⇤ among the joint of the input and hidden agent to be locally ecient, not accounting for temporal X Y Thus, thermodynamically ecient pattern extractors (39) relation. We refer to theses machines⇤ as ✏-machinesstates.,be- correlations. Fortunately, because the agent states are provide a massive simplification in calculating the work 0.5 0.5 The purpose of parametric density estimation is to de- Specifically, to design the thermodynamic agent to be designed to match the causal states of the ✏-machine, we Pr(Y ✓ = 101010 )=0.cause,5 like the ✏-machine1:0.5 that describe0:0.5 bi-infinite pro- 0.5 0.5 Conversely, the globally ecient agent also specifies productionAAACEnicbVDLSsNAFJ3UV62vqks3wSK0m5LUiiIUCm5cVrAPaWqZTCft0MkkzNwIIfQb3Pgrblwo4taVO//GSduFtp7LhcM59zJzjxtypsCyvo3Myura+kZ2M7e1vbO7l98/aKkgkoQ2ScAD2XGxopwJ2gQGnHZCSbHvctp2x1ep336gUrFA3EIc0p6Ph4J5jGDQUj9fchqyeHfvwIgC7ifWpcOEB/GkZltpOWQQgCrVrPJZP1+wytYU5jKx56SA5mj081/OICCRTwUQjpXq2lYIvQRLYITTSc6JFA0xGeMh7WoqsE9VL5meNDFPtDIwvUDqFmBO1d8bCfaVin1XT/oYRmrRS8X/vG4E3kUvYSKMgAoye8iLuAmBmeZjDpikBHisCSaS6b+aZIQlJqBTzOkQ7MWTl0mrUrZPy5WbaqFenceRRUfoGBWRjc5RHV2jBmoigh7RM3pFb8aT8WK8Gx+z0Ywx3zlEf2B8/gAVhpvD of agents.0: Much of this advantage comes predictive from0.5 the ✏-machine the hidden states of the know that the distribution matches the distribution over termine model parameters ✓ of a probability distribution ✓ 0.5 ✓1 ··· cesses Pr(Y : ) [12], they1:1. are0 determined by a causal 10 the ✏-machine model of the estimated distribution from unifilarity,Pr(Y which= 010101 guarantees)=0 a single.5 hidden state1 1 agent match the states of the ✏-machine at every time causal states states and inputs from training data. In this case,AAACEnicbVDLSsNAFJ3UV62vqks3wSK0m5LUiiIUCm5cVrAPaWqZTCft0MkkzNwIIfQb3Pgrblwo4taVO//GSduFtp7LhcM59zJzjxtypsCyvo3Myura+kZ2M7e1vbO7l98/aKkgkoQ2ScAD2XGxopwJ2gQGnHZCSbHvctp2x1ep336gUrFA3EIc0p6Ph4J5jGDQUj9fchqyeHfvwIgC7ifWpcOEB/GkZtlpOWQQgCrVrPJZP1+wytYU5jKx56SA5mj081/OICCRTwUQjpXq2lYIvQRLYITTSc6JFA0xGeMh7WoqsE9VL5meNDFPtDIwvUDqFmBO1d8bCfaVin1XT/oYRmrRS8X/vG4E3kUvYSKMgAoye8iLuAmBmeZjDpikBHisCSaS6b+aZIQlJqBTzOkQ7MWTl0mrUrZPy5WbaqFenceRRUfoGBWRjc5RHV2jBmoigh7RM3pFb8aT8WK8Gx+z0Ywx3zlEf2B8/gAVg5vD ✓ provides0: probabilities equivalence relation. ✓ 1 ··· A B step Xi = S . To do this, the agent states and causal trajectory x0:L+1 for an input y0:L. Even calculating s⇤, 0 i A, 0 B,0 ✓ 0P ✏(x, y): tion Pr(Y = y0:✓L) and output✓ distribution✓ Pr(Y ✓= (y) ✓ ✓ over words of arbitrary length Pr(Y0: = y0: ⇥ = ✓), The parameters of ✏-machines are given by the tuple: 0.5 0:LPr(Y = yi,X = si)=Pr(Y = yi0:,SL = si), (53) ✓ =Pr(Y = y S = s)s ,✏(s,y) (40) the probability of1 of the1 output is reduced to tracking states occupy the same space = , and the transi- 1 i i i i s s0 j j 0 | 0:1.0 X S y0:0 L)= L . This form is very familiar, nearly exactly ! | where Y is the random variable for the environment at (y) ✓ ✓ j a set of hidden states , a set of output states ,tions a start within the1 agentM areif directly✓ drawn=0, from causal |Y| the terms for a particular causal state trajectory s0:L+1 x0,✏(x,y) x0 x x0 reproducing the form of Eq. (29), which describes the =Pr(Yj = y Xj = s) Msy s0y0 . (41) S Y Mxy x y = 0.5 ! 6 (38) if the ratchet is driven by the estimated input. The joint | |Y| time! j⌧, and ⇥ is the random variable for the model. For state s⇤ , and conditional output-labeled transitionequivalence! 0 0 relation⇥ work harvested by an ecient agent that extracts energy where s = ✏(s ,y ) ( x0,x else, j ⇤ 0:j ✓ 2 S |Y| P 0.5 distribution over causal states and inputs is also deter- convenience, we introduce the new random variables Yj matrix between the hidden states from a single realization of a physical system Z⌧ . How- Through the ✏-machine, the agent also specifies its esti- if ✓(y) =0, mined by the ✏-machine, because part of the construction L 1 and the estimated1 input is givenx0,✏(x,y by) the HMMx x ofx the ✏- ever, it should be noted that achieving this work pro- which is defines our model M = 0 ! 0 6 (51) xy x0y0 assumes starting in the state s = s⇤. To start, note that mated input. ✓ (yj ) (y) ✓ ✓ ✓ machine! ⇥ duction relies on including the additional0 agent memory ( x0,x else. Pr(Y0:L = y0:L) s0,s⇤ ✓sj sj+1 ✓ (43) =Pr(S = s0,Y = y S = s). (5) P s s0 j+1 j j |Y| . Thatthe memory joint probability allows us to trajectory cope with the distribution temporal is given by Thus, as shown in Fig. 6, there is a bijection between ✓ ⌘ ! j s0:L+1 j=0 ! | X Pr(Y0: ) Pr(Y0: ⇥ = ✓). (1) ✓ ✓ (y ) modularity of the input by storing temporal correlations 1 X Y Pr(Y = y ,X = s )= ✓ k . the 1) guessed input distribution, 2) the ✏-machine, which 1 ⌘ | The factorj j ofj 1/ j is to maximizes0,s⇤ works productionk sk+1 by i L 1 (y) |Y| ! [19]. ✓ specifies the probability of transitioning to hidden y0:j ,s0:j ,sj+1 k=0 ✓ ✓ (y ) is the estimated model, and 3) the agent that is designed (yj ) s s0 mapping to uniform outputs. The second term on the j X Y Thus,Pr( thermodynamicallyY0:i+1 = y0:i+1,S e0:cienti+2 = patterns0:i+2)= extractorss0,s⇤ ✓si si+1 . With training data y0:L, the likelihood of= the model✓✏(s ,y ✓ )is ✏(s ,y ! ). (44) (39) ! to eciently harness that input. This means that the ⇤ 0:j ! ⇤ state0:j+1 s0 and outputting y given that the machineright is is in the probability of the next agent state given j=0 given by the probability of the data given thei=0 model provide a massive simplification in calculating the work Y Y state s . In otherθ words, the model is fully specifiedthe current by input and current hidden state Pr(Xi+1 = (54) e↵ectiveness of our agent at harnessing work from finite FIG. 6. Equivalence of estimated input process0 Pr(Y0:∞ = y0:∞), -machine θ, and theConversely, agent that the globally efficiently ecient harnesses agent also the specifies inputproduction of agents. Much of this advantage comes 0 xthe0 Y✏i-machine=0 y, Xi model= x). of The the top estimated case x distribution0,✏(x,y) gives the prob-from unifilarity, which guarantees a single hidden state data can be directly associated with the model that un- processMoreover, asymptotically, the model-dependent using term logical inthe the architecture tuple work pro- Mxy→x0y0 = Pr(Xj+1 = x ,Yj+1 =| y |Xj = x, Yj = y) and estimated input (✓ y0:L)=Pr(Y0:L = y0:L ⇥ = ✓) (2) ability that the next causal state S✓ is x given thattrajectory x for an input y . Even calculating✓ ✓ duction of thermodynamicallyθ θ ecient agents is a famil- i+1 0 Summing0:L+1 over the variables0:L besides Yi and Si , we ob- derlies that agent’s architecture. And so, for the theL re-| | (y) ✓ ✓ ✓ distribution Pr(Xj = x, Yj = y). Determining one determines(y the) others. ✓ =Pr(Y = y S = s)s ,✏(s,y) (40) the probability of of the output is reduced to tracking ✓ the currents s0 causalj statej is Si 0 = x and output of the tain an expression for the estimated agent distribution in =Pr(iar termY that= y can0:L) be. easily interpreted:(3) the log-likelihood✓ = , ,s⇤, ✓ s ,y,y . (6)! | mainder of this work, when we discuss an estimated pro- 0:L s s0 0 ✓ ✓ the terms for a particular causal state trajectory s0:L+1 {S Y { ! } 2S 2Y } ✏-machine=Pr( is YiY=j =y.y X Thisj = s) is contingentMsy s0y0 . on(41) the prob- terms of just the HMM of the ✏-machine of the model ✓ | |Y| ! cess, or an ✏-machine that generates that guess, we are ability of seeing y given causal state x being nonzero.where sj = ✏(s⇤,y0:j) Parametric density estimation seeks to optimize For example Fig 2 shows how such a machine wouldThrough be the ✏-machine, the agent also specifies its esti- i also describing the unique thermodynamic agent which If it is, then the transitions among hidden states of the L 1 nowprod turn to explore✓ how such agents’ work production log-likelihood `(θ|y ) of parametric estimation coincides ✓ ✓ (yj ) mated0:L input. Pr(✓ Y = yi,X = si)= (y ) s ,s ✓ . is designed to produce work. (✓ y0:L) [3, 10]. However,W ✓ the= processkBT `(Y of0:L findingy0:L)+kBTLconstructedln . (45) to produce a periodic process with uncertainagent match the transitions of the causal states of the Pr(Y =iy ) i ✓ i 0 (43)⇤ sj sj+1 Y0:L=y0:L θ 0:L 0:L s0,s⇤ si si+1 ! L | tiesh | to theiri underlying| models. A direct|Y| comparison to with ln Pr(Y =Thus,y as), shown and in we Fig. can 6, there write: is a bijection between ⌘ y0:i,s!0:i,si+1 j=0 maximum-likelihood estimates usually uses the log- phase. 0:L 0:L s0:L+1 i=0 ✏the-machine. 1) guessed In input this distribution, way, if y0: 2)L theis a✏-machine, sequence which that could X Y X Y = kBT `(✓ y0:L)+kBTLln . (46) L 1 (55) log-likelihood parametric-density| estimation|Y| can now be be produced by the ✏-machine, we’ve designed the agent is the estimated model, and 3) the agent that is designed = ✓(yi) . (44) D E to stay synchronrized to the causal state of the input ✏(s⇤,y0:i) ✏(s⇤,y0:i+1) VII. WORK AS A PERFORMANCE MEASURE drawn. θ to eciently harness that input. This means that the i=0 ! ✓ ✓ Because ecient agents can be characterized by their W =✓k T `(θ|y ) + k TL ln |Y| . (14) Thus, an agent M,Pr(Xi ,Yi ) which is globally e- |y0:L Xe↵ectiveness=BS , so of that our0:Lagent the ratchet atB harnessing is predictive work from of the finite process Y{ } i i cient can be defined in terms of the estimated input pro- underlying model of their environment (the ✏-machine), Pr(dataY can✓ ) be and directly produces associated maximal with the work model by fullythat un- random-Moreover, the model-dependent term in the work pro- Reconsider this introduction 0: cess through its ✏-machine. Eq. (45) provides a suggestive parallel between machine izingderlies the1 that outputs agent’s architecture. And so, for the the re- duction of thermodynamically ecient agents is a famil- Conversely, the globally ecient agent also specifies The agent described in the previous section is maxi- learning and the thermodynamics harnessing work from One concludes thatmainder work of this work,production when we discuss is maximized an estimated pro- pre-iar term that can be easily interpreted: the log-likelihood ✓ cess, or an ✏-machine that generates that1 guess, we are of the modelthe ✏-machine✓ of the estimated distribution mally ecient for the input process Pr(Y0: ), and asymp- a finiteV. WORK-LIKELIHOOD string. If we treat y as CORRESPONDENCEtraining data for cisely when the log-likelihood `(θ|y ) is maximized. 1 0:L also describing the uniquePr(Yi0 thermodynamic= yi00:)=L . agent which (52) totically harvests the di↵erence between entropy rates of |· prod (y) ✓ ✓ ✓ our model, then theFOR log-likelihood AGENT is DESIGN maximized when Thus, the criterionis designed for tocreating produce work. a good model|Y| of an en- W Y ✓ =y T = kB=Pr(T `(Y0:YL y0:=L)+y SkB=TLsln)s ,✏.(s,y(45)) (56) h 0:L 0:L is s0 |i i |Y|0 the input and output kBT (hµ0 hµ) as L [25], | ! | !1 our model anticipates the input with highest probability. In the case where the ✏-machine can’t produce y from = k T `(✓ y ✓)+k TL✓ ln . (46) vironment is the same as that for extracting maximal B= Pr(0:YLi = yBXi = s) Msy s0y0 . (57) achieving the Information Processing Second Law of ther- This is the same condition for the thermodynamic agent the causal state x, we arbitrarily choose the next state | | |Y||Y| ! VII. WORK AS A PERFORMANCE MEASURE modynamics, if the input and output are stationary. We are now ready to return to our core objective— work. Because ecient agents can be characterized by their producing maximum work. This suggests that the condi- to be the same x,x0 . There are many possible choices Through the ✏-machine, the agent also specifies its esti- However, what happens when these agents are applied tionexploring for creating work a good production model of our as environment a performance is the measure This link is madefor this concrete case, but via it winds the simple up being example irrelevant, pre-becauseunderlyingmated model input. of their environment (the ✏-machine), Reconsider this introduction Eq. (45) provides a suggestive parallel between machine to finite data? these transitions correspond zero estimated probability, To summarize, as shown in Fig. 6, there is a bijec- samefor asa model the producing estimated maximal from work a. time series y0:L. In com- sented in App.F. ItThe goes agent through described anin the explicit previous description section is maxi- of learning and the thermodynamics harnessing work from and thus infinite work dissipation, drowning✓ out all other As shown in Appendix C by plugging this transducer mally ecient for the input process Pr(Y0: ), and asymp- tion between the 1) guessed input distribution, 2) the parison to the expression for general computing in Eq. the Hamiltonian control required to implement1 a memory-a finite string. If we treat y0:L as training data for into the expression for one-shot work production in Eq. detailstotically of harvests the model. the di↵ However,erence between this entropyparticular rates choice of forour model,✏-machine, then the which log-likelihood is our model, is maximized and 3) when the globally e- (10), using efficiently-designed predictive agents leads to less agent that harvestswhenthe inputy cannot and work output be from generatedkBT a(h0 sequence fromhµ) the as L causal of up state[25], spinsx pre- 37, we find a considerable simplification: µ !1 our modelcient anticipates agent that the input is designed with highest to harness probability. that input. This achieving the Information Processing Second Law of ther- aVIII. much TRAININGsimpler expression MEMORYLESS for work AGENTS production: ↑ and down spinsserves↓ that unifilarity compose and allows the time the agent series toy wait0:L. in The its cur-This is themeans same that condition the e for↵ectiveness the thermodynamic of our agent agent at harnessing modynamics, if the input and output are stationary. prod rent state until it receives an input that it can accept.producingwork maximum from finite work. data This can suggests be directly that the associated condi- with the ✓ agent’s internalHowever, memoryless what happens model when results these agents in Eq.are applied (14)’s W Y ✓ =y = kBT (ln Pr(Y0:L = y0:L)+L ln ). tion for creating a good model of our environment is the h 0:L 0:L i |Y| D E Up to transitions that are disallowed by our model model that underlies that agent’s architecture. And so, | To illustrateθ the thermodynamic θ training process, con- work production.to finite And, data? we find that the maximum-worksame as the producing maximal work. W|y =kBT ln Pr(Y0:L =y0:L)+L ln |Y| . (13) and will correspond to infinite work dissipation, we have for the the remainder of this work, when we discuss an (42) sider the simplest0:L possible agents. These would have only As shown in Appendix C by plugging this transducer agent has learnedainto direct about the expression mapping the input for between one-shot sequence. our work estimated production Specifically, input in Eq. process estimated process, or an ✏-machine that generates that It vastly reduces complexity of the expression to a di↵er- one internal state A that receive binary data y from a se- ✓ j Pr(37, weY0: find) and a considerable its ✏-machine simplification:✓ to the logical architecture guess, we are also describing the unique thermodynamic The mechanism behind this vast simplification arises from the agent learns the1 frequency of spins ↑ and ↓, confirm- ence of the log-probabilities between the input distribu- ries of two-level systems j = , . The internal models VIII. TRAINING MEMORYLESS AGENTS Y {" #} ing the basic principleprod of maximum-work✓ thermodynamic unifilarity—a property of prediction machines that guar- W ✓ = kBT (ln Pr(Y = y0:L)+L ln ). Y =y0:L 0:L h | 0:L i |Y| To illustrate the thermodynamic training process, con- antees a single state trajectory on x0:L for each input learning. However, the learning presented in App.F precludes the possibility of learning temporal structure(42) sider the simplest possible agents. These would have only string y0:L. The details of the derivation are outlined in It vastly reduces complexity of the expression to a di↵er- one internal state A that receive binary data yj from a se- in the spin sequence,ence of the since log-probabilities the agents between and the their input distribu- internal ries of two-level systems j = , . The internal models AppendixE. Y {" #} This expression directly captures the relationship between models have no memory [29]. To learn about the tempo- work production and the agent’s underlying model of the ral correlations within the sequence, one must use agents data. To see this, we recast it in the language of machine with multiple memory states. We leave thermodynamic learning. Consider y0:L as training data in parametric den- learning among memoryful agents for later investigation. sity estimation. We are then tasked to construct a model Stepping back, we see the relationship between machine of this data. Each candidate model is parameterized by learning and information thermodynamics more clearly. θ In parametric density estimation we have: θ, which results in an estimated process Y0:L. Observe θ then that Pr(Y0:L = y0:L) is simply the probability that the candidate model will output y0:L. Therefore, the 1. Data y0:L that provides a window into a black box. 13

2. A model θ of the black box that determines an opposite. And so, the effort continues today, for example, θ estimated distribution over the data Pr(Y0:L). with recent applications of nonequilibrium thermodynam- 3. A performance measure for the model of the data, ics to pattern formation in chemical reactions [60]. That θ given by the log-likelihood `(θ|y0:L) = ln Pr(Y0:L = said, statistical physics misses at least two, related, but y0:L). key components: dynamics of and information in thermal states. The parallel in thermodynamic learning is exact, with: Dynamical systems theory takes a decidedly mechanis- tic approach to the emergence of organization, analyzing 1. Data y0:L physically stored in systems Y0 × Y1 × the geometric structures in a system’s state space that ... YL−1 output from the black box. amplify fluctuations and eventually attenuate them into θ θ 2. An agent {M, Pr(Xj ,Yj )} that is entirely deter- macroscopic behaviors and patterns. This was eventu- mined by the model θ. ally articulated by pattern formation theory [61–63]. A canonical example is fluid turbulence [64]—a dynamical 3. The agent’s thermodynamic performance, given by D E explanation for its complex organizations occupied much its work production W θ , increases linearly |y0:L of the 70s and 80s. Landau’s original theory of incommen- with the log-likelihood `(θ|y0:L). surate oscillations was superseded by the mathematical discovery in the 1950s of chaotic attractors [65, 66]. This In this way, we see that thermodynamic learning through approach, too, falls short of leading to a principle of emer- work maximization is equivalent to parametric density gent organization. Patterns emerge, but what exactly are estimation. they and what complex behavior do they exhibit? Intuitively, the natural world is replete with complex Answers to this challenge came from a decidedly differ- learning systems—an observation seemingly at odds with ent direction—Shannon’s theory of noisy communication Thermodynamics and its Second Law which dictates that channels and his measures of information [67, 68], ap- order inevitably decays into disorder. However, our results propriately extended [69]. While adding an important are tantamount to a contravening physical principle that new perspective—that organized systems store and trans- drives the emergence of order through learning: work mit information—this, also, did not go far enough as it maximization. We showed, in point of fact, that work side-stepped the content and meaning of information [70]. maximization and learning are equivalent processes. At a In-roads to these appeared in the theory of computation larger remove, this hints of general physical principles of inaugurated by Turing [71]. The most direct and ambi- emergent organization. tious approach to the role of information in organization, though, appeared in Wiener’s cybernetics [4, 72]. While VI. SEARCHING FOR PRINCIPLES OF it eloquently laid out the goals to which principles should ORGANIZATION strive, it ultimately never harnessed the mathematical foundations and calculational tools needed. Likely, the Introducing an equivalence of maximum work production earliest overt connection between statistical mechanics and optimal learning comes at a late stage of a long line and information, though, appeared with Jaynes’ Max- of inquiry into what kinds of thermodynamic constraints imum Entropy [73] and Minimum Entropy Production and laws govern the emergence of organization and, for Principles [74]. that matter, biological life. So, let’s historically place the So, what is new today is the synthesis of statistical physics, seemingly-new principle. In fact, it enters a crowded field. dynamics, and information. This, finally, allows one to Within statistical physics the paradigmatic principle of answer the question, How do physical systems store and organization was found by Kirchhoff [54]: in electrical process information? The answer is that they intrinsically networks current distributes itself so as to dissipate the compute [36]. With this, one can extract from behavior a least possible heat for the given applied voltages. Gen- system’s information processing, even going so far as to eralizations, for equilibrium states, are then found in discover the effective equations of motion [75–78]. One Gibbs’ variational principle for entropy for heterogeneous can now frame questions about how a physical system equilibrium [55], Maxwell’s principles of minimum-heat reacts to, controls, and adapts to its environment. [56, pp. 407-408], and Onsager’s minimizing the “rate of All such systems, however, are embedded in the physical dissipation” [57]. world and require resources to operate. More to the point, Close to equilibrium, Prigogine introduced minimum en- what energetic resources underlie computation? Initiated tropy production [58], identifying dissipative structures by Brillouin [79] and Landauer and Bennett [19, 80], to- whose maintenance requires energy [59]. However, far day there is a nascent physics of information [14, 81]. from equilibrium the guiding principles can be quite the Resource constraints on computing by thermodynamic 14 systems are now expressed in a suite of new principles. For hint at a natural method for designing intelligent en- example, the principle of requisite complexity [29] dictates ergy harvesters—establishing that our present tools of that maximally-efficient interactions require an agent’s machine learning can be directly mapped to automated internal organization match the environment’s organiza- design of efficient information ratchets and pattern en- tion. And, thermodynamic resource costs arise from the gines [29, 42, 84]. Meanwhile, recent results indicate modularity of an agent’s architecture [53]. Pushing the that quantum systems can generate complex adaptive search for organization further, the preceding established behaviors using fewer resources than classical counter- the thermodynamics of how a system learns, suggesting parts [85–87]. Does this suggest there are new classes of the possibility of adaptive organization. quantum-enhanced energy harvesters and learners? To fully appreciate organization in natural processes, Ultimately, energy is an essential currency for life. This though, one must also address dynamics of agent popula- highlights the question, To what extent is work opti- tions, first on the time scale of agent life cycles and second mization a natural tendency of driven physical systems? on the scale of many generations. In fact, tracking the Indeed, recent results indicate physical systems evolve to complexity of individuals reveals that selection pressures increase work production [32, 33], opening a fascinating spontaneously emerge in purely-replicating populations possibility. Could the equivalence between work produc- [82] and replication itself necessarily dissipates energy tion and learning then indicate that the universe itself [83]. naturally learns? The fact that complex intelligent life As these pieces assembled, a picture has come into focus. emerged from the lifeless soup of the universe might be Intelligent, adaptive systems learn to harness resources considered a continuing miracle: a string of unfathomable from their environment, expending energy to live and statistical anomalies strung together over eons. It would reproduce. Taken altogether, the historical perspective certainly be extraordinary if this evolution then has a suggests we are moving close to realizing Wiener’s cyber- physical basis—hidden laws of thermodynamic organiza- netics [4]. tion that guide the universe to create entities capable of extracting maximal work.

VII. CONCLUSION ACKNOWLEDGMENTS We introduced thermodynamic machine learning—a phys- ical process that trains intelligent agents by maximizing The authors thank the Telluride Science Research Center work production from complex environmental stimuli sup- for hospitality during visits and the participants of the In- plied as time-series data. This involved constructing a formation Engines Workshops there. ABB thanks Wesley framework to describe thermodynamics of computation at Boyd for useful conversations and JPC similarly thanks the single-shot level, enabling us to evaluate the work an Adam Rupe. JPC acknowledges the kind hospitality of agent can produce from individual data realizations. Key the Santa Fe Institute, Institute for Advanced Study at to the framework is its generality—applicable to agents the University of Amsterdam, and California Institute exhibiting arbitrary adaptive input-output behavior and of Technology for their hospitality during visits. This implemented within any physical substrate. material is based upon work supported by, or in part by, In the pursuit of maximum work, we refined this general Grant No. FQXi-RFP-IPW-1902 and FQXi-RFP-1809 class of agents to those that are best able to harness from the Foundational Questions Institute and Fetzer work from temporally-correlated inputs. We found that Franklin Fund (a donor-advised fund of Silicon Valley the performance of such maximum-work agents increases Community Foundation), the Templeton World Charity proportionally to the log-likelihood of the model they Foundation Power of Information fellowship TWCF0337 use for predicting their environment. As a consequence, and TWCF0560, the National Research Foundation, Sin- our results show that thermodynamic learning exactly gapore, under its NRFF Fellow program (Award No. NRF- mimics parametric density estimation in machine learning. NRFF2016-02), Singapore Ministry of Education Tier 1 Thus, work is a thermodynamic performance measure Grants No. MOE2017-T1-002-043, and U.S. Army Re- for physically-embedded learning. This result further search Laboratory and the U.S. Army Research Office solidifies the connections between agency, intelligence, under grants W911NF-18-1-0028 and W911NF-21-1-0048. and the thermodynamics of information—hinting that Any opinions, findings and conclusions or recommenda- energy harvesting and learning may be two sides of the tions expressed in this material are those of the authors same coin. and do not reflect the views of National Research Foun- These connections suggest a number of exciting future dation, Singapore. directions. From the technological perspective, they During submission we became aware of related work: L. 15

Touzo, M. Marsili, N. Merhav, and E. Roldan. Opti- Pr(Zθ = z) = Pr(Z = z|Θ = θ). The procedure of mal work extraction and the minimum description length arriving at this estimated model is parametric density principle. arXiv:2006.04544. estimation [15, 88]. However, we take the random variable Zθ for the estimated distribution to denote the model for notational and conceptual convenience. Appendix A: Extended Background The Shannon entropy [68]: X Developing the principle of maximum work production H[Z] ≡ − Pr(Z = z) ln Pr(Z = z) and calling out the physical benefits of an agent modeling z its environment drew heavily from the areas of computa- measures uncertainty in nats, a “natural” unit for thermo- tional mechanics, nonequilibrium thermodynamics, and dynamic entropies. The Shannon entropy easily extends machine learning. Due to the variety of topics addressed, to joint probabilities and all information measures that the following provides a more detailed notational and come from their composition (conditional and mutual conceptual summary. This should aid the development informations). For instance, if the environment is com- be more self-contained, hopefully providing a common posed of two correlated subcomponents Z = X × Y, the language across the areas and a foundation for further probability and entropy are expressed: exploration. While we make suggestive comparisons by viewing foundations of each area side-by-side, it may be Pr(Z = z) = Pr(X = x, Y = y) most appropriate for readers already familiar with them 6= Pr(X = x) Pr(Y = Y ) and and concerned with novel results to skip, using the review to clarify unfamiliar notation. H[Z] = H[X,Y ] 6= H[X] + H[Y ] ,

1. Machine Learning and Generative Models respectively. While there are many other ways to create parametric models θ—from polynomial functions with a small number Thermodynamically, what is a good model of the data of parameters to neural networks with thousands [15]— with which an agent interacts? Denote the data’s state the goal is to match as well as possible the estimated space as Z = {z}. If we have many copies of Z, all initially distribution Pr(Zθ) to the actual distribution Pr(Z). prepared in the same way, then as we observe successive One measure of success in this is the probability that the realizations ~z = {z , z , ··· , z } from an ensemble, the 0 1 N model generated the data—the likelihood. The likelihood frequency of an observed state z approaches the actual of the model θ given a data point z is the same as the probability distribution Pr(Z = z), where Z is the random likelihood of Zθ: variable that realizes states z ∈ Z. However, with only a finite number N of realizations, the best that can be L(θ|z) = Pr(Z = z|Θ = θ) done is to characterize the environment with an estimated θ distribution Pr(Zθ = z). Estimating models that agree = Pr(Z = z) with finite data is the domain of statistical inference and = L(Zθ|z) . machine learning of generative models [15, 34, 88]. At first blush, estimating a probability distribution ap- Given a set ~z = {z1, z2, ··· , zN } of training data and pears distinct from familiar machine learning challenges, assuming independent samples, then the likelihood of the such as image classification and and the inverse prob- model is the product: lem of artificially generating exemplar images from given N data. However, both classification and prediction can Y L(Zθ|~z) = L(Zθ|z ) . (A1) be achieved through a form of unsupervised learning i i=1 [18]. For instance, if the system is a joint variable over both the pixel images and the corresponding label This is a commonly used performance measure in machine Z = pixels × {cat, dog}, then our estimated distribution learning, where algorithms search for models with maxi- Pr(Zθ = z) gives both a means of choosing a label for an mum likelihood [15]. However, it is common to use the image Pr(labelθ = cat|pixelsθ = image) and a means of log-likelihood instead, which is maximized for the same choosing an image for a label Pr(pixelsθ = image|labelθ = cat). A generative model is specified by a set of parameters θ from which the model produces the estimated distribution 16 models: and system Z):

`(θ|~z) = ln L(θ|~z) hΣi = h∆Sreservoiri + h∆SZ i N ≥ 0 . (A4) X θ = ln Pr(Z = zi) . (A2) i=1 This constrains the energetic cost of computations per- formed within the system Z. If the model Zθ were specified by a neural network, the A computation over time interval t ∈ [τ, τ 0] has two com- log-likelihood could be determined through stochastic gra- ponents: dient descent back-propagation [34, 89], for instance. The intention is that the procedure will converge on a network 1. An initial distribution over states Pr(Zτ = zτ ), model that produces the data with high probability. where Zt is the random variable of system Z at time t.

2. A Markov channel that transforms it, specified by 2. Thermodynamics of Information the conditional probability of the final state zτ 0 given the initial input zτ : Learning from data translates information in an environ-

0 0 ment into a useful model. What makes that model useful? Mzτ →zτ0 = Pr(Zτ = zτ |Zτ = zτ ) . In a physical setting, recalling from Landauer that “infor- mation is physical” [90], the usefulness one can extract This specifies, in turn, the final distribution Pr(Zτ 0 = zτ 0 ) from thermodynamic processes is work. Figure3 shows a that allows direct calculation of the system-entropy change basic implementation for physical computation. Such an [44]: information-storing physical system Z = {z}, in contact with a thermal reservoir, can execute useful computations Z Pr(Zτ = zτ ) ∆S|z 0 = kB ln . by drawing energy from a work reservoir. Energy flowing τ:τ Pr(Zτ 0 = zτ 0 ) from the system Z into the thermal reservoir is positive heat Q. When energy flows from the system Z to the Adding this to the information reservoir’s entropy change work reservoir, it is positive work W production. Work yields the entropy production of the universe. This can production quantifies the amount of energy that is stored also be expressed in terms of the work production: in the work reservoir available for later use. And so, in this reservoir Z Σ|z 0 ≡ ∆S + ∆S telling, it represents a natural and physically-motivated τ:τ |zτ:τ0 |zτ:τ0 0 0 measure of thermodynamic performance. In the frame- −W|z 0 + φ(zτ , τ) − φ(zτ , τ ) = τ:τ . work for thermodynamic computation of Fig.3, work is T extracted via controlling the system’s Hamiltonian. Here, φ(z, t) = E(z, t)+k T ln Pr(Z = z) is the pointwise Specifically, the system’s informational states are con- B t nonequilibrium free energy, which becomes the nonequi- trolled via a time-dependent Hamiltonian—energy E(z, t) librium free energy when averaged: hφ(z, t)iPr(Zt=z) = of state z at time t. For state trajectory zτ:τ 0 = neq 0 F (t) [43]. z z ··· z 0 z 0 over time interval t ∈ [τ, τ ], the re- τ τ+dt τ −dt τ Note that the entropy production is also proportional to sulting work extracted by Hamiltonian control is the the additional work that could have been extracted if temporally-integrated change in energy [37]: the computation was efficient. This is referred to as the Z τ 0 dissipated work:

W|z 0 = − dt ∂tE(z, t) . τ:τ z=zt τ W diss = T Σ . (A5) |zτ :zτ0 |zτ :zτ0 0 Heat Q = E(zτ , τ) − E(zτ 0 , τ ) − W flows into |zτ:τ0 |zτ:τ0 Turning back to the Second Law of Thermodynamics, we the thermal reservoir, increasing its entropy: see that the average work extracted is bounded by the Q change in nonequilibrium free energy: ∆Sreservoir = |zτ:τ0 . (A3) |zτ:τ0 T hW i ≥ F neq(τ 0) − F neq(τ) . where the thermal reservoir is at temperature T . The Second Law of Thermodynamics states that, on average, When the system starts and ends as an information reser- any processing on the informational states can only yield voir, with equal energies for all states E(z, τ) = E(z0, τ 0) nonnegative entropy production of the universe (reservoir [37], this reduces to Landauer’s familiar principle for era- 17

θ sure [19]—work production must not exceed the change generate distributions over semi-infinite strings Pr(Y0:∞ = in state uncertainty: y0:∞). As such, they allow us to anticipate more than just the first L symbols from the same source. Once we (y) hW i ≤ k T (H[Z 0 ] − H[Z ]) , 0 B τ τ have a model θ = {(θs→s0 , s, s , y)}s,s0,y from our training data y , we can calculate probabilities of longer words P 0:L where H[Zt] = − Pr(Zt = z) ln Pr(Zt = z) is the θ 0 z∈Z Pr(Y0:L0 = y0:L0 ) and, thus, the probability of symbols Shannon entropy of the system at time t measured in θ 0 θ following the training data Pr(YL:L0 = yL:L0 |Y0:L = y0:L). nats. This is the starting point for determining the work θ Distributions over semi-infinite strings Pr(Y0:∞ = y0:∞) production that agents can extract from data. are similar to processes, which are distributions over bi- θ infinite strings Pr(Y−∞:∞ = y−∞:∞). While not insisting on stationarity, and so allowing for flexibility in using 3. Computational Mechanics subwords of length L, we can mirror computational me- chanics’ construction of unifilar HMMs from time series, When describing thermodynamics and machine learning, θ where the hidden states Sj are minimal sufficient statis- data was taken from the state space Z all at once. How- θ θ tics of the past Y0:j about the future Yj:∞ [91]. In other ever, what if we consider a state space composed of L words, the hidden states are perfect predictors. L identical components Z = Y that are received in se- θ Given a semi-infinite process Pr(Y0:∞ = y0:∞), we con- quence. Our model of the time series y0:L of realizations struct a minimal predictor through a causal equivalence θ is described by an estimated distribution Pr(Y0:L = y0:L). 0 0 relation y0:k ∼ y0:j that says two histories y0:k and y0:j However, for L large enough, this object becomes impossi- are members of the same equivalence class if and only if ble to store, due to the exponential increase in the number they have the same semi-infinite future distribution: of sequences. Fortunately, there are ways to generally characterize an arbitrarily long time-series distribution θ 00 θ 0 θ 00 θ Pr(Yj:∞ =y0:∞|Y0:j =y0:j) = Pr(Yk:∞ =y0:∞|Y0:k =y0:k) . using a finite model. An equivalence class of histories is a causal state. Causal states also induce a map (·) from histories y0:k to states a. Generative Machines si:

0 0 A hidden Markov model (HMM) is described by a set of si = {y0:j|y0:k ∼ y0:j} hidden states S, a set of output states Y, a conditional ≡ (y0:k) . output-labeled matrix that gives the transition probabili- ties between the states: This guarantees that a causal state is a sufficient statistic

(y) θ 0 θ θ of the past about the future, such that we can track it as θs→s0 = Pr(Sj+1 = s ,Yj = y|Sj = s) a perfect predictor:

∗ for all j, and a start state s ∈ S. (Generally, one also θ θ θ θ Pr(Y |Y = y0:k) = Pr(Y |S = (y0:k)) . specifies an initial state distribution, with selecting a k:∞ 0:k k:∞ k start state being a special case.) We label the transition In fact, the causal states are minimal sufficient statistics. probabilities with the model parameter θ, since these are Constructing causal states reveals a number of proper- the actual parameters that must be stored to generate ties of stochastic processes and models. One of these is probabilities of time series. For instance, Fig.2 shows an unifilarity, which means that if the current causal state HMM that generates a periodic process with uncertain sk = (y0:k) is followed by any sequence yk:j, then the 0 phase. Edges between hidden states s and s are labeled resulting causal state s = (y ) is uniquely determined. (y) (y) k 0:k y : θs→s0 , where y is the output symbol and θs→s0 is the And, we can expand our use of the  function to include probability of emitting that symbol on that transition. updating a causal state: (y) θ If {θs→s0 } is the model for the estimated input Pr(Y = y0:L), then the probability of any word is calculated by sj = (sk, yk:j) taking the product of transition matrices and summing 0 0 ≡ {y0:l|∃ y0:k 3 sk = (y0:k) and y0:l ∼ y0:j} . over internal states: 0 L−1 This is the set of histories y0:l that predict the same future θ X Y (yj ) Pr(Y = y ) = δ ∗ θ . as a history y0:j which leads to causal state sk via the 0:L 0:L s0,s sj →sj+1 s0:L+1 j=0 initial sequence y0:k and then follows with the sequence yk:j. Beyond generating length-L symbol strings, these HMMs 18

Unifilarity is key to deducing several useful properties of 0 1 : 1.0 (y) | the HMM θ 0 , which we will refer to as a nonstationary s→s 1 1 : 1.0 A B 0 0 : 1.0 -machine. First, unifilarity implies that for any causal | | state s followed by a symbol y, there is a unique next state 1 0 : 1.0 s0 = (s, y), meaning that the symbol-labeled transition | matrix can be written: FIG. 7. The delay channel -transducer: The last input symbol is stored in its memory (states). If the last symbol was 1, then (y) (y) 0 0 the corresponding transitions, labeled y |1 : 1.0, update the θs→s0 = θs→(s,y)δs ,(s,y) . hidden state to A. Then, all outputs from A are symbol Moreover, the -machine’s form is uniquely determined 1. Similarly, input 0 leads to state B, whose corresponding outputs are all 0. In this way, the delay channel outputs the from the semi-infinite process: previous input symbol.

(y) θ θ θs→s0 = Pr(Yj = y|Sj = s)δs0,(s,y) ,

where the conditional probability is determined from the -Transducer state-transition diagrams label the edges 0 0 (y |y) process: of transitions between hidden states y |y : Mx→x0 . As (y0|y) Fig.7 shows this is to be read as the probability M 0 θ θ θ θ x→x Pr(Yj = y|Sj = (y0:i)) = Pr(Yj = y|Y0:j = y0:j) . of output y0 and next hidden state x0 given input y and current hidden state x. Once constructed, the -machine allows us to reconstruct These devices are memoryful channels, with their memory word probabilities via the simple product: encoded in the hidden states Xi. They implement a L−1 wide variety of functional operations. Figure7 shows θ Y (yj ) Pr(Y = y0:L) = θ ∗ ∗ , the delay channel. With sufficient memory, though, an 0:L (s ,y0:j )→(s ,y0:j+1) j=0 -transducer can implement a universal Turing machine [51]. Moreover, if the input and output alphabets are where y0:0 denotes the null word, taking a causal state to the same, then they represent the form of a physical itself under the causal update (s, y0:0) = s. information ratchet, which have energetic requirements Allowing for arbitrarily-many causal states, our class of that arise from the thermodynamics of their operation models (nonstationary -machines) is so general that it [30, 31]. Since these physically-implementable information can represent any semi-infinite process and, thus, any processors are so general in their ability to compute, they L distribution over sequences Y . One concludes that com- represent a very broad class of physical agents. As such, putational mechanics provides an ideal class of generative we use the framework of information ratchets to explore models to fit to data y0:L. Bayesian structural inference the functionality of agents that process information as a implements just this [92]. fuel. In these ways, computational mechanics already had solved (and several decades prior) the unsupervised learn- ing challenge recently posed by Ref. [93] to create an “AI Physicist”: a machine that learns regularities in time Appendix B: Proof of Zero Entropy Production of series to make predictions of the future from the past [94]. Trajectories

Perfectly-efficient agents dissipate zero work and generate zero entropy hΣi = 0. The Crooks fluctuation theorem b. Input-Output Machines and other detailed fluctuation theorems say that entropy production is proportional to the log-ratio of probabilities This way one constructs a predictive HMM that generates [45, 46]: θ a desired semi-infinite process Pr(Y = y0:L). The gen- 0:L ρF (Σ ) Σ = k ln |zτ:τ0 , eralization to -transducers allows for an input as well as |zτ:τ0 B ρR(−Σ ) an output process—a transformation between processes |zτ:τ0 [52]. The transducer at the ith time step is described by where ρ (Σ ) is the probability of the entropy pro- F |zτ:τ0 transitions among the hidden states Xi → Xi+1, that are duction under the protocol that controls the ratchet and 0 conditioned on the input Yi, emitting the output Yi : ρ (−Σ ) is the probability of minus that same entropy R |zτ:τ0 0 production if the control protocol is reversed. Thus, the (y |y) 0 0 0 Mx→x0 = Pr(Yi = y ,Xi+1 = x |Yi = y, Xi = x) . average entropy production is proportional to the relative 19

0 entropy between these two distributions [68]: production for individual computational maps zτ → zτ during the computation interval (τ, τ 0). ρ (Σ ) X F |zτ:τ0 The following implements an analogous form of quasistatic hΣi = kB ρF (Σ ) ln |zτ:τ0 ρ (−Σ ) R |zτ:τ0 computation that allows us to easily calculate the energy zτ:τ0 associated with implementing the computation M , ≡ k D (ρ (Σ )||ρ (−Σ )) . zτ →zτ0 B KL F |zτ:τ0 R |zτ:τ0 assuming the subsystem Z started in zτ and ends in If the control is thermodynamically efficient, this relative zτ 0 . Due to detailed balance, the rate equation dynamics entropy vanishes [95], implying the necessary and suffi- over the computational system and its ancillary copy 0 cient condition that ρ (Σ) = ρ (−Σ). This then implies Ztotal = Z × Z are partially specified by the energy F R 0 0 that all paths produce zero entropy: E(z, z , t) of system state z and ancillary state z at time t. This also uniquely specifies the equilibrium distribution:

Σ|z 0 = 0. 0 τ:τ −E(z,z ,t)/kBT eq 0eq 0 e Pr(Z = z, Z = z ) = 0 . t t P −E(z,z ,t)/kBT Entropy fluctuations vanish as the entropy production z,z0 e goes to zero. 0 P −E(z,z ,t)/kBT The normalization constant z,z0 e is the partition function that determines the equilibrium free Appendix C: Thermodynamically Efficient Markov energy: Channels   0 eq X −E(z,z ,t)/kBT Given a physical system Z = {z}, a computation on its F (t) = −kBT ln  e  . z,z0 states is given by a Markov channel Mz→z0 = Pr(Zτ 0 = z0|Z = z) and an input distribution Pr(Z = z). The τ τ The equilibrium free energy adds to the system energy. It following describes a quasistatic thermodynamic control is constant over the states: that implements this computation efficiently if the input θ distribution matches the estimated distribution Pr(Z = 0 eq eq 0eq 0 E(z, z , t) = F (t) − kBT ln Pr(Z = z, Z = z ) . z). This means that the work production is equal to the t t change in pointwise nonequilibrium free energy: We leverage the relationship between energy and equi- librium probability to design a protocol that achieves D θ E 0 W = φ(zτ 0 , τ ) − φ(zτ , τ) (C1) |zτ ,zτ0 the work production given by Eq. (C2) for a Markov θ channel M. The estimated distribution over the whole Pr(Zτ 0 = zτ 0 ) = ∆EZ + kBT ln θ . space assumes that the initial distribution of the ancillary Pr(Z = zτ ) τ variable is uncorrelated and uniformly distributed: Note that, while Pr(Zθ = z) is the input distribution τ Pr(Zθ) for which the computation is efficient, it is possible that Pr(Zθ,Z0 ) = τ . τ τ |Z| other input distributions Pr(Zτ = z) yield zero entropy production as well. They are only required to minimize Assuming the default energy landscape is constant initially θ θ DKL(Zτ ||Zτ ) − DKL(Zτ 0 ||Zτ 0 ) = 0 [49]. and finally—E(z, z0, τ) = E(z, z0, τ 0) = ξ—the maximally The physical setting that we take for thermodynamic con- efficient protocol over the interval [τ, τ 0] decomposes into trol is overdamped Brownian motion with a controllable five epochs, see Fig.8: energy landscape. This is described by detailed-balanced rate equations. However, if our physical state-space is lim- 1. Quench: [τ, τ +], ited to Z, then not all channels can be implemented with 2. Quasistatically evolve: (τ, τ1], continuous-time rate equations [48]. Fortunately, this can 3. Swap: (τ1, τ2], be circumvented by additional ancillary or hidden states 0 4. Quasistatically evolve: (τ2, τ ), and 0 [48, 96]. And so, to implement any possible channel, we 5. Reset: [τ −, τ 0]. add an ancillary copy of our original system Z0, such that 0 our entire physical system is Ztotal = Z × Z . For all protocol epochs, except for Epoch 3 during which Prescriptions have been given that efficiently imple- the two subsystems are swapped, Z is held fixed while ment any computation, specified by a Markov channel the ancillary system Z0 follows the local equilibrium dis-

0 0 Mzτ →zτ0 = Pr(Zτ = zτ |Zτ = zτ ), using quasistatic ma- tribution. Let’s detail these in turn. nipulation of the Z’s energy levels and an ancillary copy 1. Quench: Instantaneously quench the energy from 0 0 0 + Z [42, 53]. However, these did not determine the work E(z, z , τ) = ξ to E(z, z , τ ) = kBT ln(|Z|/ Pr(Zτ = z)) 20

0 0 2. Quasistatically evolve: Quasistatically evolve the en- Z Pr = 1.0 Z E = high 0 0 ergy landscape over a third of total time interval (τ, τ1] t = ⌧ Z 1 Z 1 such that the joint system remains in equilibrium and the Pr = 0.0 E =low ancillary system Z0 is determined by the Markov channel 1) 0 1 0 1 0 0 M applied to the system Z: Z Pr = 1.0 Z E = high 0 0 + 0 0 t = ⌧ Pr(Z = z, Z = z ) = Pr(Z = z)M 0 τ1 τ1 τ z→z Z 1 Z 1 0 θ Pr = 0.0 E =low E(z, z , τ1) = −kBT ln Pr(Zτ = z)Mz→z0 . 2) 0 1 0 1 0 0 Z Pr = 1.0 Z E = high Also, hold the energy barriers between states in Z high, 0 0 t = ⌧1 preventing probability flow between states and preserving Z 1 Z 1 Pr = 0.0 E =low the distribution Pr(Zt) = Pr(Zτ ) for all t ∈ (τ, τ1]. 0 1 0 1 3) Given that the system started in Zτ = zτ , the work 0 0 Z Pr = 1.0 Z E = high production during this epoch corresponds to the average 0 0 t = ⌧2 change in energy: Z 1 Z 1 Pr = 0.0 E =low D θ,2 E 4) 0 1 0 1 W |zτ ,zτ0 0 0 E = high Z Pr = 1.0 Z X Z τ1 0 0 = − dt Pr(Z0 = z0,Z = z|Z = z )∂ E(z, z0, t) . 0 t t τ τ t t = ⌧ + 0 τ Z 1 Z 1 z,z Pr = 0.0 E =low 0 1 0 1 5) As the system Z remains in zτ over the interval: 0 0 Z Pr = 1.0 Z E = high 0 0 0 0 0 0 t = ⌧ 0 Pr(Zt = z ,Zt = z|Zτ = zτ ) = Pr(Zt = z |Zt = z)δz,zτ Z 1 Z 1 Pr = 0.0 E =low 0 1 0 1 and the work production simplifies to: D E FIG. 8. Quasistatic agent implementing the Markov chain W θ,2 0 |zτ ,z 0 M 0 in the system Z over the time interval [τ, τ ] using τ zτ →zτ 0 Z τ1 ancillary copy Z in five steps: Epoch 1: Energy landscape is X 0 0 0 = − dt Pr(Zt = z |Zt = zτ )∂tE(zτ , z , t) . instantaneously brought into equilibrium with the distribution + 0 τ over the joint system. Epoch 2: Probability flows in the z ancillary system Z0 as the energy landscape quasistatically We can express the energy in terms of the estimated changes to make the conditional probability distribution in 0 0 0 0 equilibrium probability distribution: Z reflect the Markov channel Pr(Zτ1 = z |Zτ1 = z) = Mz→z . Epoch 3: Systems Z and Z0 are swapped. Epoch 4: Ancillary 0 0 0 θ system quasistatically reset to the uniform distribution. Epoch E(zτ , z , t) = −kBT ln Pr(Zt = z |Zt = zτ ) Pr(Zt = zτ ) . 5: Energy landscape instantaneously reset to uniform. And, since the distribution over the system Z is fixed during this interval: over the infinitesimal time interval [τ, τ +] such that, if the 0 0 θ distribution was as we expect, it would be in equilibrium Pr(Zt = z |Zt = zτ ) Pr(Zt = zτ ) eq 0eq θ Pr(Zτ ,Zτ ) = Pr(Zτ )/|Z|. 0 0 θ = Pr(Zt = z |Zt = zτ ) Pr(Zτ = zτ ) . If the system started in zτ , then the associated work produced is opposite the energy change: Plugging these into the expression for the work produc- tion, we find that the evolution happens without energy D θ,1 E 0 0 + W = E(zτ , z , τ) − E(zτ , z , τ ) |zτ ,zτ0 Pr(Zθ = z ) = ξ + k T ln τ τ . B |Z| D E W θ,1 denotes that the work is produced in Epoch |zτ ,zτ0 θ θ 1, conditioned on the estimated distributions Zτ and Zτ 0 , and initial and final states zτ and zτ 0 . Note that we also θ condition on Zτ 0 = zτ 0 , since work production in this phase is unaffected by the computation’s end state. 21 exchange: We keep the primary system Z fixed as in Epoch 2. And, as in Epoch 2, there is zero work production: D E W θ,2 |z ,z 0 τ τ D θ,4 E τ W = 0 . Z 1 |z ,z 0 X 0 0 τ τ = −kBT dt Pr(Zt = z |Zt = zτ ) + τ z0 The result is that the primary system is in the desired 0 0 θ × ∂t ln Pr(Zt = z |Zt = zτ ) Pr(Zτ = zτ ) final distribution: Z τ1 X 0 0 X 0 = −kBT dt Pr(Zt = z |Zt = zτ ) Pr(Zτ 0 = z) ≡ Pr(Zτ = z )Mz0→z , τ + z0 z0 θ 0 0 Pr(Z = zτ )∂t Pr(Z = z |Zt = zτ ) × τ t having undergone a mapping from its original state at Pr(Z0 = z0|Z = z ) Pr(Zθ = z ) t t τ τ τ time τ, while the ancillary system has returned to an Z τ1 X 0 0 = −kBT dt ∂t Pr(Zt = z |Zt = zτ ) uncorrelated uniform distribution. + 0− 0 τ z0 5. Reset: Finally, over time interval [τ , τ ] instanta- Z τ1 X 0 0 neously change the energy to the default flat landscape = −kBT dt ∂t Pr(Zt = z |Zt = zτ ) 0 0 + E(z, z , τ ) = ξ. Given that the system ends in state zτ 0 , τ z0 Z τ1 the associated work production is: = −kBT dt ∂t1 τ + D θ,5 E 0 0 0− 0 0 0 W = E(zτ , z , τ ) − E(zτ , z , τ ) = 0 . |zτ ,zτ0 θ Pr(Zτ 0 = zτ 0 ) = −ξ − kBT ln . After this state, at time τ1 the resulting joint distribution |Z| over the ancillary and primary system matches the desired computation: The net work production given the initial state zτ and final state zτ 0 is then: 0 0 Pr(Z = z, Z = z ) = Pr(Z = z)M 0 . (C2) τ1 τ1 τ z→z 5 D θ E X D θ,i E W|z ,z 0 = W 3. Swap: Over time interval (τ , τ ], efficiently swap the τ τ |zτ ,zτ0 1 2 i=1 0 0 0 two systems Z ↔ Z , such that Pr(Zτ = z, Z = z ) = 1 τ1 Pr(Zθ = z ) Pr(Z = z0,Z0 = z) and E(z, z0, τ ) = E(z0, z, τ ). This = k T ln τ τ . τ2 τ2 1 2 B θ Pr(Z 0 = z 0 ) operation requires zero work as well, as it is reversible, τ τ regardless of where the system starts or ends: Thus, when we average over all possible inputs and out- puts and the estimated an actual distributions are the D θ,3 E W = 0 . θ |zτ ,zτ0 same Zτ = Zτ , we see that this protocol achieves the thermodynamic Landauer’s bound: Such an efficient swap operation has been demonstrated θ X D θ E in finite time [48]. The resulting joint distribution over 0 0 W = Pr(Zτ =zτ ,Zτ =zτ ) W|z ,z 0 the ancillary and primary system matches a flip of the τ τ zτ ,zτ0 desired computation: = kBT ln 2(H[Zτ 0 ] − H[Zτ ]) . 0 0 0 Pr(Zτ = z, Z = z ) = Pr(Zτ = z )Mz0→z . 2 τ2 One concludes that this is a thermodynamically-efficient

0 method for computing any Markov channel. 4. Quasistatically evolve: Over time interval (τ2, τ ), quasistatically evolve the energy landscape from:

0 θ 0 E(z, z , τ2) = −kBT ln Pr(Zτ = z )Mz0→z to Appendix D: Designing Agents

P θ 00 0 Pr(Z = z )M 00 E(z, z0, τ −) = −k T ln z00 τ z →z To design a predictive thermodynamic agent, its hidden B |Z| states must match the states of the -machine at every θ Pr(Z 0 = z) θ τ time step Xi = S . To do this, the agent states and = −kBT ln . i |Z| causal states occupy the same space X = S, and the transitions within the agent M are directly drawn from 22 causal equivalence relation: distribution matches the causal-state distribution and inputs: ( P (y) 1 δx0,(x,y) if x0 θx→x0 6= 0, Mxy→x0y0 = × Pr(Y θ = y ,Xθ = s ) = Pr(Y θ = y ,Sθ = s ) , |Y| δx0,x otherwise . i i i i i i i i

The factor 1/|Y| maximizes work production by mapping if the ratchet is driven by the estimated input. The joint to uniform outputs. distribution over causal states and inputs is also deter- The second term on the right is the probability of the mined by the -machine, since the construction assumes ∗ next agent state given the current input and current starting in the state s0 = s . To start, note that the joint 0 probability trajectory distribution is given by: hidden state Pr(Xi+1 = x |Yi = y, Xi = x). The top case

δx0,(x,y) gives the probability that the next causal state θ θ θ 0 θ Pr(Y0:i+1 = y0:i+1,S0:i+2 = s0:i+2) is Si+1 = x given that the current causal state is Si = x i and output of the -machine is Yi = y. This is contingent Y (yj ) = δs ,s∗ θ . on the probability of seeing y given causal state x being 0 si→si+1 nonzero. If it is, then the transitions among the agent’s j=0 hidden states match the transitions of the -machine’s θ θ Summing over the variables besides Yi and Si , we obtain causal states. an expression for the estimated agent distribution in terms In this way, if y0:L is a sequence that could be produced of just the -machine’s HMM: by the -machine, we have designed the agent to stay θ i synchronized to the causal state of the input Xi = Si , so θ θ X Y (yj ) θ Pr(Y = y ,X = s ) = δ ∗ θ . that the ratchet is predictive of the process Pr(Y0:∞) and i i i i s0,s sj →sj+1 produces maximal work by fully randomizing the outputs: y0:i,s0:i,si+1 j=0

θ θ 0 0 1 Thus, an agent {M, Pr(Xi ,Yi )} designed to be globally Pr(Yi = yi|·) = . θ |Y| efficient for the estimated input process Pr(Y0:L) can be derived from the estimated input process through its It can be the case that the -machine cannot produce y (y) -machine θs→s0 . from the causal state x. This corresponds to a disallowed (y) transition of our model θx→x0 = 0. In these cases, we ar- bitrarily choose the next state to be the same δx,x0 . There Appendix E: Work Production of Optimal are many possible choices, though—such as resetting to Transducers the start state s∗. However, it is physically irrelevant, since these transitions correspond zero estimated proba- The work production of an arbitrary transducer M driven bility and, thus, to infinite work dissipation, drowning out by an input y0:L can be difficult to calculate, as shown in all other details of the model. However, this particular Eq. (10). However, when the transducer is designed to choice for when y cannot be generated from the causal harness an input process with -machine T , such that: state x preserves unifilarity and allows the agent to wait ( P (y) in its current state until it receives an input that it can 1 δx0,(x,y) if x0 θx→x0 6= 0, Mxy→x0y0 = × accept. |Y| δx0,x else, Modulo disallowed, infinitely-dissipative transitions, we now have a direct mapping between our estimated in- the work production simplifies. To see this, we express θ θ put process Pr(Y0:∞) and its -machine θ to the logical Eq. (10) in terms of the estimated distribution Pr(Y0:L), architecture M of a maximum work-producing agent. ratchet M, and input y0:L, assuming that the word y0:L As yet, this does not fully specify agent behavior, since can be produced from every initial hidden state with θ θ it leaves out the estimated input distribution Pr(Yi = nonzero probability Pr(Y0:L = y0:L|S0 = s0) Pr(X0 = θ P (yi) y, Xi = x). This distribution must match the actual s ) 6= 0, which guarantees that θ 6= 0. If 0 xi+1 xi→xi+1 distribution Pr(Yi = y, Xi = x) for the agent to be lo- this constraint is not satisfied, the agent will dissipate cally efficient, not accounting for temporal correlations. θ θ infinite work, as it implies Pr(Xi = xi,Yi = xi) = 0 Fortunately, since agent states are designed to match the for some i. Thus, we use the expression Mxy→x0y0 = -machine’s causal states, we know that the agent state δx0,(x,y)/|Y| in the work production: 23

D E W θ |y0:L

L−1 L−1 θ θ X Y Y Pr(Xi = xi,Yi = yi) = kBT Pr(X0 = x0) Mx ,y →x ,y0 ln 0 j j j+1 j Pr(Xθ = x ,Y θ = y0) 0 j=0 i=0 i+1 i+1 i i x0:L+1,y0:L L−1 L−1 θ θ X Y δxj+1,(xj ,yj ) Y Pr(X = xi,Y = yi) = k T Pr(X = x ) ln i i B 0 0 |Y| Pr(Xθ = x )/|Y| 0 j=0 i=0 i+1 i+1 x0:L+1,y0:L L−1 L−1 L−1 θ ! X Y Y Y Pr(X = xi) = k T ln |Y| + k T Pr(X = x ) δ ln Pr(Y θ = y |Xθ = x ) + ln i . B B 0 0 xj+1,(xj ,yj ) i i i i θ Pr(X = xi+1) x0:L+1 j=0 i=0 i=0 i+1

QL−1 Note that j=0 δxj+1,(xj ,yj ) vanishes unless each element of the hidden state trajectory xi corresponds to the resulting state of the -machine when the initial state x0 is driven by the first i inputs y0:i, which is (x0, y0:i). The fact that the agent is driven into a unique state is guaranteed by the -machine’s unifilarity. Thus, we rewrite:

L−1 L Y Y δxj+1,(xj ,yj ) = δxj ,(x0,y0:j ) . j=0 j=1

This engenders a simplification of the work production:

L L−1 θ θ θ D E X Y Y Pr(Y = yi|X = xi) Pr(X = xi) W θ = k T ln |Y| + k T Pr(X = x ) δ ln i i i |y0:L B B 0 0 xj ,(x0,y0:j ) θ Pr(X = xi+1) x0:L+1 j=1 i=0 i+1 L−1 X Y θ θ = kBT ln |Y| + kBT Pr(X0 = x0) ln Pr(Yi = yi|Xi = (x0, y0:i)) x0 i=0 θ X Pr(X0 = x0) + kBT Pr(X0 = x0) ln θ , (E1) Pr(X = (x0, y0:L)) x0 L

where y0:0 denotes the null input, which leaves the state includes the log-likelihood of the finite input: fixed under the -map (x , y ) = x . 0 0:0 0 D E This brings us to an easily calculable work production, W θ |y0:L especially when the system is initialized in the start state L−1 s∗. Recognizing that if we initiate the -machine in its X Y θ θ = kBT δx ,s∗ ln Pr(Y = yi|Y = y0:i) ∗ 0 i 0:i start state s , such that Pr(X = x ) = δ ∗ , then 0 0 x0,s x0 i=0

X0 = S0 is predictive. By extension, every following ∗ X δx0,s ∗ agent state is predictive and equivalent to the causal state + kBTL ln |Y| + kBT δx0,x ln θ Pr(X = (x0, y0:L)) x0 L Xi = Si yielding: θ = kBT (ln Pr(Y0:L = y0:L) + L ln |Y| ∗ ∗ Pr(Yi = yi|Xi = (s , y0:i)) = Pr(Yi = yi|Si = (s , y0:i)) θ ∗ − ln Pr(XL = (s , y0:L))) . = Pr(Yi = yi|Y0:i = y0:i) . The first log(·) in the last line is the log-likelihood of the Thus, the work production simplifies a sum of terms that model generating y0:L—a common performance measure for machine learning algorithms. If an input has zero probability this leads to −∞ work production, and all other features are drowned out by the log-likelihood term. Thus, the additional terms that come into play when the input probability vanishes become physically irrelevant: the agent is characterized by the -machine. From a ma- 24 Memoryless: chine learning perspective, the model is also characterized ( ) ( ) (y) θ : ✓ " A : ✓ # by the -machine θ 0 for the process Pr(Y ). The A A A A s→s 0:∞ " ! # ! additional term kBTL ln |Y| is the work production that comes from exhausting fully randomized outputs and does FIG. 9. Memoryless model of binary data consisting of a single not change depending on the underlying model. state A and the probabilityPeriodic: of outputting a ↑ and a ↓, denoted θ ∗ The final term −kBT ln Pr(X = (s , y0:L))) does di- (↑) (↓) L θA→A and θA→A, respectively.y0 :1.0 θ s s rectly depend on the model. Pr(XL = x) is the distribu- 0 1 tion over agent states X at time Lτ if the agent is driven yL 1 :1.0 y1 :1.0 θ by the estimated input distribution Y0:L. This component of the work production is larger, on average, for agents with high state uncertainty, since this leads, on-average, sL 1 s2 internal models correspond to memoryless -machine s, θ to smaller values of Pr(XL). This contribution to the as shown in Fig.9. The model’s parameters are the work production comes from the state space expanding probabilities of emitting ↑ and ↓, denoted θ(↑) and ∗ A→A from the start state s to the larger (recurrent) subset (↓) θ , respectively. of agent states, and so it provides additional work. This A→A Our first step is to design an efficient computation that indicates that we are neglecting the cost of resetting to maps an input distribution Pr(Z ) to an output dis- the start state while harnessing the energetic benefit of jτ tribution Pr(Z 0 ) over the jth interaction interval starting in it. jτ+τ [jτ, jτ + τ 0]. The agent corresponds to the Hamiltonian If the machine is designed to efficiently harness inputs evolution H (t) = H (t) over the joint space of the again after it operates on one string, it must be reset Z X ×Yj agent memory and jth input symbol. The resulting en- to the start state s∗. This can be implemented with an ergy landscape E(z, t) is entirely specified by the energy efficient channel that anticipates the input distribution of the two input states yE2:(LA1×:1 ↑.,0 t) and E(A × ↓, t). Pr(Xθ = x), outputs the distribution Pr(Xθ = x) = L L+1 Appropriately designing this energy landscape allows us δ ∗ , and so costs: x,s to implement the efficient computation shown in Fig. θ,reset θ ∗ 10. The thermodynamic evolution there instantaneously W = kBT ln Pr(XL = (s , y0:L)) . |y0:L quenches the energy landscape into equilibrium with the Thus, when we add the cost of resetting the agent to the estimated distribution at the beginning of the interaction interval Pr(Zθ ), then quasistatically evolves the system in start state at XL+1, the work production is dominated jτ θ by the log-likelihood: equilibrium to the estimated final distribution Pr(Zjτ+τ 0 ), and, finally, quenches back to the default energy landscape. D θ E θ In Fig. 10, the system undergoes a cycle, starting and W =kBT (ln Pr(Y0:L =y0:L)+L ln |Y|) . (E2) |y0:L ending with the same flat energy landscape, such that ∆EZ = 0. This cycle evolves the distribution over the θ joint states A × ↑ and A × ↓ from Pr(Zjτ = {A × ↑ Appendix F: Training Simple Agents θ ,A × ↓}) = {0.8, 0.2} to Pr(Zjτ+τ 0 = {A × ↑,A × ↓}) = {0.4, 0.6}. Note that this strategy can be used to evolve We now outline a case study of thermodynamic learning between any initial and final distributions. that is experimentally implementable using a controllable We control the transformation over time interval t ∈ two-level system. We first introduce a straightforward (jτ, jτ +τ 0) such that the time scale of equilibration in the method to implement the simplest possible efficient agent. system of interest is much shorter than the interval length Second, we show that this physical process achieves the τ 0. This slow-moving quasistatic control means that the general maximum-likelihood result arrived at in the main states are in equilibrium with the energy landscape over development. Last, we find the agent selected by ther- the interval. In this case, the state distribution becomes modynamic learning along with its corresponding model. the Boltzmann distribution: As expected, we find that this maximum-work producing EQ (F (t)−E(z,t))/kBT agent learns its environment’s predictive features. Pr(Zt = z) = e .

To minimize dissipation for the estimated distribution, 1. Efficient Computational Trajectories the state distribution must be the estimated distribution θ Pr(Zt = z) = Pr(Zt = z). And so, we set the two-level- The simplest possible information ratchets have only a system energies to be in equilibrium with the estimates: single internal state A and receive binary data y from j θ E(z, t) = −kBT ln Pr(Z = z) . a series of two-level systems Yj = {↑, ↓}. These agents’ t 25

A): t = j⌧ E): t = j⌧ + ⌧ 0

AAAB+3icbVDLSsNAFJ3UV62vWJduBougm5JUQRGEihuXFewDmlAm00k7dvJg5kZaQn7FjQtF3Poj7vwbp20W2nrgwuGce7n3Hi8WXIFlfRuFldW19Y3iZmlre2d3z9wvt1SUSMqaNBKR7HhEMcFD1gQOgnViyUjgCdb2RrdTv/3EpOJR+ACTmLkBGYTc55SAlnpm2QE2hvTm9ApncP3oAEl6ZsWqWjPgZWLnpIJyNHrml9OPaBKwEKggSnVtKwY3JRI4FSwrOYliMaEjMmBdTUMSMOWms9szfKyVPvYjqSsEPFN/T6QkUGoSeLozIDBUi95U/M/rJuBfuikP4wRYSOeL/ERgiPA0CNznklEQE00IlVzfiumQSEJBx1XSIdiLLy+TVq1qn1Vr9+eVei2Po4gO0RE6QTa6QHV0hxqoiSgao2f0it6MzHgx3o2PeWvByGcO0B8Ynz9Zn5Px AAACAXicbVDLSgMxFM3UV62vUTeCm2ARFaHMVEERhIIILivYB3SGkknTGpt5kNwRy1A3/oobF4q49S/c+Tdm2llo64F7OZxzL8k9XiS4Asv6NnIzs3PzC/nFwtLyyuqaub5RV2EsKavRUISy6RHFBA9YDTgI1owkI74nWMPrX6R+455JxcPgBgYRc33SC3iXUwJaaptbDrAHSC4PzvAQzu8cIPFh2vbaZtEqWSPgaWJnpIgyVNvml9MJaeyzAKggSrVsKwI3IRI4FWxYcGLFIkL7pMdamgbEZ8pNRhcM8a5WOrgbSl0B4JH6eyMhvlID39OTPoFbNeml4n9eK4buqZvwIIqBBXT8UDcWGEKcxoE7XDIKYqAJoZLrv2J6SyShoEMr6BDsyZOnSb1cso9K5evjYqWcxZFH22gH7SMbnaAKukJVVEMUPaJn9IrejCfjxXg3PsajOSPb2UR/YHz+AGeZlik=

AAAB9HicbVBNS8NAEN34WetX1aOXYBE8laQW9FgQwWMF+wFtKJvtpF262cTdSTGE/g4vHhTx6o/x5r9x2+agrQ8GHu/NMDPPjwXX6Djf1tr6xubWdmGnuLu3f3BYOjpu6ShRDJosEpHq+FSD4BKayFFAJ1ZAQ19A2x/fzPz2BJTmkXzANAYvpEPJA84oGsnrITxhditBDdNpv1R2Ks4c9ipxc1ImORr90ldvELEkBIlMUK27rhOjl1GFnAmYFnuJhpiyMR1C11BJQ9BeNj96ap8bZWAHkTIl0Z6rvycyGmqdhr7pDCmO9LI3E//zugkG117GZZwgSLZYFCTCxsieJWAPuAKGIjWEMsXNrTYbUUUZmpyKJgR3+eVV0qpW3MtK9b5WrtfyOArklJyRC+KSK1Ind6RBmoSRR/JMXsmbNbFerHfrY9G6ZuUzJ+QPrM8fZC+SdQ== AAAB9HicbVBNS8NAEN34WetX1aOXYBE8laQW9FgQwWMF+wFtKJvtpF262cTdSTGE/g4vHhTx6o/x5r9x2+agrQ8GHu/NMDPPjwXX6Djf1tr6xubWdmGnuLu3f3BYOjpu6ShRDJosEpHq+FSD4BKayFFAJ1ZAQ19A2x/fzPz2BJTmkXzANAYvpEPJA84oGsnrITxhditBDdNpv1R2Ks4c9ipxc1ImORr90ldvELEkBIlMUK27rhOjl1GFnAmYFnuJhpiyMR1C11BJQ9BeNj96ap8bZWAHkTIl0Z6rvycyGmqdhr7pDCmO9LI3E//zugkG117GZZwgSLZYFCTCxsieJWAPuAKGIjWEMsXNrTYbUUUZmpyKJgR3+eVV0qpW3MtK9b5WrtfyOArklJyRC+KSK1Ind6RBmoSRR/JMXsmbNbFerHfrY9G6ZuUzJ+QPrM8fZC+SdQ== Energy Energy

0AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4Kkkt2GPBi8eK9gPaUDbbSbt0swm7G6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DDcCu4lCGgUCO8Hkdu53nlBpHstHM03Qj+hI8pAzaqz04Fbqg1LZrbgLkHXi5aQMOZqD0ld/GLM0QmmYoFr3PDcxfkaV4UzgrNhPNSaUTegIe5ZKGqH2s8WpM3JplSEJY2VLGrJQf09kNNJ6GgW2M6JmrFe9ufif10tNWPczLpPUoGTLRWEqiInJ/G8y5AqZEVNLKFPc3krYmCrKjE2naEPwVl9eJ+1qxbuuVO9r5UYtj6MA53ABV+DBDTTgDprQAgYjeIZXeHOE8+K8Ox/L1g0nnzmDP3A+fwBbMI0k .8

0AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4Ckkt6rHgxWNF+wFtKJvtpl262YTdiVBKf4IXD4p49Rd589+4bXPQ1gcDj/dmmJkXplIY9LxvZ219Y3Nru7BT3N3bPzgsHR03TZJpxhsskYluh9RwKRRvoEDJ26nmNA4lb4Wj25nfeuLaiEQ94jjlQUwHSkSCUbTSg+de9Uplz/XmIKvEz0kZctR7pa9uP2FZzBUySY3p+F6KwYRqFEzyabGbGZ5SNqID3rFU0ZibYDI/dUrOrdInUaJtKSRz9ffEhMbGjOPQdsYUh2bZm4n/eZ0Mo5tgIlSaIVdssSjKJMGEzP4mfaE5Qzm2hDIt7K2EDammDG06RRuCv/zyKmlWXP/SrdxXy7VqHkcBTuEMLsCHa6jBHdShAQwG8Ayv8OZI58V5dz4WrWtOPnMCf+B8/gBYKI0i .6

0AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY8FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1Fjpwa3WB+WKW3UXIOvEy0kFcjQH5a/+MGZphNIwQbXueW5i/Iwqw5nAWamfakwom9AR9iyVNELtZ4tTZ+TCKkMSxsqWNGSh/p7IaKT1NApsZ0TNWK96c/E/r5ea8MbPuExSg5ItF4WpICYm87/JkCtkRkwtoUxxeythY6ooMzadkg3BW315nbRrVe+qWruvVxr1PI4inME5XIIH19CAO2hCCxiM4Ble4c0Rzovz7nwsWwtOPnMKf+B8/gBVII0g .4

0AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY8FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1iNOE+xEdKREKRtFKD261NihX3Kq7AFknXk4qkKM5KH/1hzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx6oxcWGVIwljbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uwxs/EypJkSu2XBSmkmBM5n+TodCcoZxaQpkW9lbCxlRThjadkg3BW315nbRrVe+qWruvVxr1PI4inME5XIIH19CAO2hCCxiM4Ble4c2Rzovz7nwsWwtOPnMKf+B8/gBSGI0e .2

AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1Fip6Q7KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGtn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jpp16reVbXWvK7Ua3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7A+fwBdZeMqA== 0 AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1Fip6Q7KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGtn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jpp16reVbXWvK7Ua3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7A+fwBdZeMqA== 0

AAAAB/3icbVDLSsNAFL3xWesrKrhxM1gEF1KSKuiy4sZlBfuAJpTJdNIOnUzCzMRSYhf+ihsXirj1N9z5N07bLLT1wMDhnHu4d06QcKa043xbS8srq2vrhY3i5tb2zq69t99QcSoJrZOYx7IVYEU5E7Sumea0lUiKo4DTZjC4mfjNByoVi8W9HiXUj3BPsJARrI3UsQ+vkXeGPM0iqpDXjYcCSxkPO3bJKTtToEXi5qQEOWod+8uESRpRoQnHSrVdJ9F+hqVmhNNx0UsVTTAZ4B5tGyqw2edn0/vH6MQoXRTG0jyh0VT9nchwpNQoCsxkhHVfzXsT8T+vnerwys+YSFJNBZktClOOdIwmZaAuk5RoPjIEE8nMrYj0scREm8qKpgR3/suLpFEpu+flyt1FqVrJ6yjAERzDKbhwCVW4hRrUgcAjPMMrvFlP1ov1bn3MRpesPHMAf2B9/gDwWZVh

AAAAB/3icbVDLSsNAFL3xWesrKrhxM1gEF1KSKuiy4sZlBfuAJpTJdNIOnUzCzMRSYhf+ihsXirj1N9z5N07bLLT1wMDhnHu4d06QcKa043xbS8srq2vrhY3i5tb2zq69t99QcSoJrZOYx7IVYEU5E7Sumea0lUiKo4DTZjC4mfjNByoVi8W9HiXUj3BPsJARrI3UsQ+vkXeGPM0iqpDXjYcCSxkPO3bJKTtToEXi5qQEOWod+8uESRpRoQnHSrVdJ9F+hqVmhNNx0UsVTTAZ4B5tGyqw2edn0/vH6MQoXRTG0jyh0VT9nchwpNQoCsxkhHVfzXsT8T+vnerwys+YSFJNBZktClOOdIwmZaAuk5RoPjIEE8nMrYj0scREm8qKpgR3/suLpFEpu+flyt1FqVrJ6yjAERzDKbhwCVW4hRrUgcAjPMMrvFlP1ov1bn3MRpesPHMAf2B9/gDwWZVh

AAAAB/XicbVDLSsNAFJ3UV62v+Ni5GSyCCylJFXRZceOygn1AE8pkOmmHTiZh5kapofgrblwo4tb/cOffOG2z0NYDFw7n3Mu99wSJ4Boc59sqLC2vrK4V10sbm1vbO/buXlPHqaKsQWMRq3ZANBNcsgZwEKydKEaiQLBWMLye+K17pjSP5R2MEuZHpC95yCkBI3XtgyvsnWIPeMQ09tKEKBU/dO2yU3GmwIvEzUkZ5ah37S+vF9M0YhKoIFp3XCcBPyMKOBVsXPJSzRJCh6TPOoZKYrb52fT6MT42Sg+HsTIlAU/V3xMZibQeRYHpjAgM9Lw3Ef/zOimEl37GZZICk3S2KEwFhhhPosA9rhgFMTKEUMXNrZgOiCIUTGAlE4I7//IiaVYr7lmlenterlXzOIroEB2hE+SiC1RDN6iOGoiiR/SMXtGb9WS9WO/Wx6y1YOUz++gPrM8fV/mUeg==

AAAAB/XicbVDLSsNAFJ3UV62v+Ni5GSyCCylJFXRZceOygn1AE8pkOmmHTiZh5kapofgrblwo4tb/cOffOG2z0NYDFw7n3Mu99wSJ4Boc59sqLC2vrK4V10sbm1vbO/buXlPHqaKsQWMRq3ZANBNcsgZwEKydKEaiQLBWMLye+K17pjSP5R2MEuZHpC95yCkBI3XtgyvsnWIPeMQ09tKEKBU/dO2yU3GmwIvEzUkZ5ah37S+vF9M0YhKoIFp3XCcBPyMKOBVsXPJSzRJCh6TPOoZKYrb52fT6MT42Sg+HsTIlAU/V3xMZibQeRYHpjAgM9Lw3Ef/zOimEl37GZZICk3S2KEwFhhhPosA9rhgFMTKEUMXNrZgOiCIUTGAlE4I7//IiaVYr7lmlenterlXzOIroEB2hE+SiC1RDN6iOGoiiR/SMXtGb9WS9WO/Wx6y1YOUz++gPrM8fV/mUeg== ⇥" ⇥# ⇥" ⇥# Quench

W A (j⌧) W A (j⌧) W A (j⌧ + ⌧ 0) W A (j⌧ + ⌧ 0) | ⇥" | ⇥# | ⇥" | ⇥# = k T ln(0.8) = k T ln(0.2) = k T ln(0.4) = k T ln(0.6)

AAACHHicbVDLSsNAFJ3Ud3xVXboZLErdlKQV7EbwsXGpYB/QhDCZTu3YySTM3Cgl9kPc+CtuXCjixoXg3zh9LLR64MLhnHu5954wEVyD43xZuZnZufmFxSV7eWV1bT2/sVnXcaooq9FYxKoZEs0El6wGHARrJoqRKBSsEfbOhn7jlinNY3kF/YT5EbmWvMMpASMF+cpeI8juT7AHPGIaYy9NiFLx3aB44wFJ923Pw3tHveAUX2FPyKJTqu4H+YJTckbAf4k7IQU0wUWQ//DaMU0jJoEKonXLdRLwM6KAU8EGtpdqlhDaI9esZagk5hQ/Gz03wLtGaeNOrExJwCP150RGIq37UWg6IwJdPe0Nxf+8Vgqdqp9xmaTAJB0v6qQCQ4yHSeE2V4yC6BtCqOLmVky7RBEKJk/bhOBOv/yX1Mslt1IqXx4UjsuTOBbRNtpBReSiQ3SMztEFqiGKHtATekGv1qP1bL1Z7+PWnDWZ2UK/YH1+AzuLnts= AAACHnicbVDLSgMxFM3U9/iqunQTLErdlJlR0Y3gY+OygrWFThkyaWpjM5khuWMpY7/Ejb/ixoUigiv9G9PHQqsHAodz7r2594SJ4Boc58vKTU3PzM7NL9iLS8srq/m19Wsdp4qyCo1FrGoh0UxwySrAQbBaohiJQsGqYed84FfvmNI8llfQS1gjIjeStzglYKQgf7BTDbL7U+wDj5jG2G/GXUmUirv94q0PJN21fR/vHHeCM3yFfSGLTsnbDfIFp+QMgf8Sd0wKaIxykP8wg2kaMQlUEK3rrpNAIyMKOBWsb/upZgmhHXLD6oZKYpZpZMPz+njbKE3cipV5EvBQ/dmRkUjrXhSayohAW096A/E/r55C66iRcZmkwCQdfdRKBYYYD7LCTa4YBdEzhFDFza6YtokiFEyitgnBnTz5L7n2Su5eybvcL5x44zjm0SbaQkXkokN0gi5QGVUQRQ/oCb2gV+vRerberPdRac4a92ygX7A+vwHk85+8

B B AAACI3icbVDLSgMxFM3UV62vUZdugkVtEctMFRRBqLpxWcFaoVOGTJrW2ExmSO4oZey/uPFX3LhQihsX/otp7cLXgYTDOfdy7z1BLLgGx3m3MhOTU9Mz2dnc3PzC4pK9vHKpo0RRVqORiNRVQDQTXLIacBDsKlaMhIFg9aB7OvTrt0xpHskL6MWsGZKO5G1OCRjJtw836356f4w94CHTGHtJTJSK7vqFGw9Isj38too5z8ObRztd/wRfYE/IglPaK/p23ik5I+C/xB2TPBqj6tsDrxXRJGQSqCBaN1wnhmZKFHAqWD/nJZrFhHZJhzUMlcRs1ExHN/bxhlFauB0p8yTgkfq9IyWh1r0wMJUhgWv92xuK/3mNBNoHzZTLOAEm6degdiIwRHgYGG5xxSiIniGEKm52xfSaKELBxJozIbi/T/5LLssld7dUPt/LV8rjOLJoDa2jAnLRPqqgM1RFNUTRA3pCL+jVerSerYH19lWascY9q+gHrI9P/VWhQg== AAACJXicbVDLTgIxFO3gC/GFunTTSESIkcygUReaoG5cagJiwpBJpxStdDqT9o6EjPyMG3/FjQuJMXHlr1iQha+TtDk559723uNHgmuw7XcrNTE5NT2Tns3MzS8sLmWXVy51GCvKajQUobryiWaCS1YDDoJdRYqRwBes7ndOh379jinNQ1mFXsSaAbmWvM0pASN52cN83Uvuj7ELPGAaY7cVdiVRKuz2C7cukHhreG0WM66L80fbHe8EV7ErZMEu7RW9bM4u2SPgv8QZkxwa49zLDsz7NA6YBCqI1g3HjqCZEAWcCtbPuLFmEaEdcs0ahkpiZmomoy37eMMoLdwOlTkS8Ej93pGQQOte4JvKgMCN/u0Nxf+8Rgztg2bCZRQDk/Tro3YsMIR4GBluccUoiJ4hhCpuZsX0hihCwQSbMSE4v1f+Sy7LJWenVL7YzVXK4zjSaA2towJy0D6qoDN0jmqIogf0hF7QwHq0nq1X6+2rNGWNe1bRD1gfn7jVois= B B

Quench

AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY8FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1Fjpwa3WB+WKW3UXIOvEy0kFcjQH5a/+MGZphNIwQbXueW5i/Iwqw5nAWamfakwom9AR9iyVNELtZ4tTZ+TCKkMSxsqWNGSh/p7IaKT1NApsZ0TNWK96c/E/r5ea8MbPuExSg5ItF4WpICYm87/JkCtkRkwtoUxxeythY6ooMzadkg3BW315nbRrVe+qWruvVxr1PI4inME5XIIH19CAO2hCCxiM4Ble4c0Rzovz7nwsWwtOPnMKf+B8/gBVII0g

AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY8FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1iNOE+xEdKREKRtFKD261NihX3Kq7AFknXk4qkKM5KH/1hzFLI66QSWpMz3MT9DOqUTDJZ6V+anhC2YSOeM9SRSNu/Gxx6oxcWGVIwljbUkgW6u+JjEbGTKPAdkYUx2bVm4v/eb0Uwxs/EypJkSu2XBSmkmBM5n+TodCcoZxaQpkW9lbCxlRThjadkg3BW315nbRrVe+qWruvVxr1PI4inME5XIIH19CAO2hCCxiM4Ble4c2Rzovz7nwsWwtOPnMKf+B8/gBSGI0e

AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4Kkkt2GPBi8eK9gPaUDbbSbt0swm7G6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DDcCu4lCGgUCO8Hkdu53nlBpHstHM03Qj+hI8pAzaqz04Fbqg1LZrbgLkHXi5aQMOZqD0ld/GLM0QmmYoFr3PDcxfkaV4UzgrNhPNSaUTegIe5ZKGqH2s8WpM3JplSEJY2VLGrJQf09kNNJ6GgW2M6JmrFe9ufif10tNWPczLpPUoGTLRWEqiInJ/G8y5AqZEVNLKFPc3krYmCrKjE2naEPwVl9eJ+1qxbuuVO9r5UYtj6MA53ABV+DBDTTgDprQAgYjeIZXeHOE8+K8Ox/L1g0nnzmDP3A+fwBbMI0k AAAB9HicbVBNS8NAEN34WetX1aOXYBE8laQW9FgQwWMF+wFtKJvtpF262cTdSTGE/g4vHhTx6o/x5r9x2+agrQ8GHu/NMDPPjwXX6Djf1tr6xubWdmGnuLu3f3BYOjpu6ShRDJosEpHq+FSD4BKayFFAJ1ZAQ19A2x/fzPz2BJTmkXzANAYvpEPJA84oGsnrITxhditBDdNpv1R2Ks4c9ipxc1ImORr90ldvELEkBIlMUK27rhOjl1GFnAmYFnuJhpiyMR1C11BJQ9BeNj96ap8bZWAHkTIl0Z6rvycyGmqdhr7pDCmO9LI3E//zugkG117GZZwgSLZYFCTCxsieJWAPuAKGIjWEMsXNrTYbUUUZmpyKJgR3+eVV0qpW3MtK9b5WrtfyOArklJyRC+KSK1Ind6RBmoSRR/JMXsmbNbFerHfrY9G6ZuUzJ+QPrM8fZC+SdQ== Energy 0.8 0.2 AAAB9HicbVBNS8NAEN34WetX1aOXYBE8laQW9FgQwWMF+wFtKJvtpF262cTdSTGE/g4vHhTx6o/x5r9x2+agrQ8GHu/NMDPPjwXX6Djf1tr6xubWdmGnuLu3f3BYOjpu6ShRDJosEpHq+FSD4BKayFFAJ1ZAQ19A2x/fzPz2BJTmkXzANAYvpEPJA84oGsnrITxhditBDdNpv1R2Ks4c9ipxc1ImORr90ldvELEkBIlMUK27rhOjl1GFnAmYFnuJhpiyMR1C11BJQ9BeNj96ap8bZWAHkTIl0Z6rvycyGmqdhr7pDCmO9LI3E//zugkG117GZZwgSLZYFCTCxsieJWAPuAKGIjWEMsXNrTYbUUUZmpyKJgR3+eVV0qpW3MtK9b5WrtfyOArklJyRC+KSK1Ind6RBmoSRR/JMXsmbNbFerHfrY9G6ZuUzJ+QPrM8fZC+SdQ== Energy AAAB9HicbVBNS8NAEN34WetX1aOXYBE8laQW9FgQwWMF+wFtKJvtpF262cTdSTGE/g4vHhTx6o/x5r9x2+agrQ8GHu/NMDPPjwXX6Djf1tr6xubWdmGnuLu3f3BYOjpu6ShRDJosEpHq+FSD4BKayFFAJ1ZAQ19A2x/fzPz2BJTmkXzANAYvpEPJA84oGsnrITxhditBDdNpv1R2Ks4c9ipxc1ImORr90ldvELEkBIlMUK27rhOjl1GFnAmYFnuJhpiyMR1C11BJQ9BeNj96ap8bZWAHkTIl0Z6rvycyGmqdhr7pDCmO9LI3E//zugkG117GZZwgSLZYFCTCxsieJWAPuAKGIjWEMsXNrTYbUUUZmpyKJgR3+eVV0qpW3MtK9b5WrtfyOArklJyRC+KSK1Ind6RBmoSRR/JMXsmbNbFerHfrY9G6ZuUzJ+QPrM8fZC+SdQ== Energy 0.4 0AAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4Ckkt6rHgxWNF+wFtKJvtpl262YTdiVBKf4IXD4p49Rd589+4bXPQ1gcDj/dmmJkXplIY9LxvZ219Y3Nru7BT3N3bPzgsHR03TZJpxhsskYluh9RwKRRvoEDJ26nmNA4lb4Wj25nfeuLaiEQ94jjlQUwHSkSCUbTSg+de9Uplz/XmIKvEz0kZctR7pa9uP2FZzBUySY3p+F6KwYRqFEzyabGbGZ5SNqID3rFU0ZibYDI/dUrOrdInUaJtKSRz9ffEhMbGjOPQdsYUh2bZm4n/eZ0Mo5tgIlSaIVdssSjKJMGEzP4mfaE5Qzm2hDIt7K2EDammDG06RRuCv/zyKmlWXP/SrdxXy7VqHkcBTuEMLsCHa6jBHdShAQwG8Ayv8OZI58V5dz4WrWtOPnMCf+B8/gBYKI0i .6 k T ln(0.2)

AAAB+XicbVBNS8NAEJ3Ur1q/oh69LBahgpakCnosevFYoV/QhrDZbtqlm03Y3RRK6D/x4kERr/4Tb/4bt20OWn0w8Hhvhpl5QcKZ0o7zZRXW1jc2t4rbpZ3dvf0D+/CoreJUEtoiMY9lN8CKciZoSzPNaTeRFEcBp51gfD/3OxMqFYtFU08T6kV4KFjICNZG8m37cuzfoSbqc1FBTrV27ttlp+osgP4SNydlyNHw7c/+ICZpRIUmHCvVc51EexmWmhFOZ6V+qmiCyRgPac9QgSOqvGxx+QydGWWAwliaEhot1J8TGY6UmkaB6YywHqlVby7+5/VSHd56GRNJqqkgy0VhypGO0TwGNGCSEs2nhmAimbkVkRGWmGgTVsmE4K6+/Je0a1X3qlp7vC7XL/I4inACp1ABF26gDg/QgBYQmMATvMCrlVnP1pv1vmwtWPnMMfyC9fENBKCRPQ== B k T ln(0.4) AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEIFbQktaDHohePFfoFbQib7aZdutmE3Y1QS3+JFw+KePWnePPfuG1z0OqDgcd7M8zMCxLOlHacLyu3tr6xuZXfLuzs7u0X7YPDtopTSWiLxDyW3QArypmgLc00p91EUhwFnHaC8e3c7zxQqVgsmnqSUC/CQ8FCRrA2km8XL8b+DWqiPhdlp1I78+2SU3EWQH+Jm5ESZGj49md/EJM0okITjpXquU6ivSmWmhFOZ4V+qmiCyRgPac9QgSOqvOni8Bk6NcoAhbE0JTRaqD8npjhSahIFpjPCeqRWvbn4n9dLdXjtTZlIUk0FWS4KU450jOYpoAGTlGg+MQQTycytiIywxESbrAomBHf15b+kXa24l5Xqfa1UP8/iyMMxnEAZXLiCOtxBA1pAIIUneIFX69F6tt6s92VrzspmjuAXrI9vrlGRFQ== B

AAAAB/3icbVDLSsNAFL3xWesrKrhxM1gEF1KSKuiy4sZlBfuAJpTJdNIOnUzCzMRSYhf+ihsXirj1N9z5N07bLLT1wMDhnHu4d06QcKa043xbS8srq2vrhY3i5tb2zq69t99QcSoJrZOYx7IVYEU5E7Sumea0lUiKo4DTZjC4mfjNByoVi8W9HiXUj3BPsJARrI3UsQ+vkXeGPM0iqpDXjYcCSxkPO3bJKTtToEXi5qQEOWod+8uESRpRoQnHSrVdJ9F+hqVmhNNx0UsVTTAZ4B5tGyqw2edn0/vH6MQoXRTG0jyh0VT9nchwpNQoCsxkhHVfzXsT8T+vnerwys+YSFJNBZktClOOdIwmZaAuk5RoPjIEE8nMrYj0scREm8qKpgR3/suLpFEpu+flyt1FqVrJ6yjAERzDKbhwCVW4hRrUgcAjPMMrvFlP1ov1bn3MRpesPHMAf2B9/gDwWZVh

⇥# k T ln(0.6) AAAAB/XicbVDLSsNAFJ3UV62v+Ni5GSyCCylJFXRZceOygn1AE8pkOmmHTiZh5kapofgrblwo4tb/cOffOG2z0NYDFw7n3Mu99wSJ4Boc59sqLC2vrK4V10sbm1vbO/buXlPHqaKsQWMRq3ZANBNcsgZwEKydKEaiQLBWMLye+K17pjSP5R2MEuZHpC95yCkBI3XtgyvsnWIPeMQ09tKEKBU/dO2yU3GmwIvEzUkZ5ah37S+vF9M0YhKoIFp3XCcBPyMKOBVsXPJSzRJCh6TPOoZKYrb52fT6MT42Sg+HsTIlAU/V3xMZibQeRYHpjAgM9Lw3Ef/zOimEl37GZZICk3S2KEwFhhhPosA9rhgFMTKEUMXNrZgOiCIUTGAlE4I7//IiaVYr7lmlenterlXzOIroEB2hE+SiC1RDN6iOGoiiR/SMXtGb9WS9WO/Wx6y1YOUz++gPrM8fV/mUeg== AAAAB/3icbVDLSsNAFL3xWesrKrhxM1gEF1KSKuiy4sZlBfuAJpTJdNIOnUzCzMRSYhf+ihsXirj1N9z5N07bLLT1wMDhnHu4d06QcKa043xbS8srq2vrhY3i5tb2zq69t99QcSoJrZOYx7IVYEU5E7Sumea0lUiKo4DTZjC4mfjNByoVi8W9HiXUj3BPsJARrI3UsQ+vkXeGPM0iqpDXjYcCSxkPO3bJKTtToEXi5qQEOWod+8uESRpRoQnHSrVdJ9F+hqVmhNNx0UsVTTAZ4B5tGyqw2edn0/vH6MQoXRTG0jyh0VT9nchwpNQoCsxkhHVfzXsT8T+vnerwys+YSFJNBZktClOOdIwmZaAuk5RoPjIEE8nMrYj0scREm8qKpgR3/suLpFEpu+flyt1FqVrJ6yjAERzDKbhwCVW4hRrUgcAjPMMrvFlP1ov1bn3MRpesPHMAf2B9/gDwWZVh

AAAB+HicbVDLSsNAFL2pr1ofjbp0M1iEChqSKuqy6MZlhb6gDWEynbRDJ5MwMxFq6Ze4caGIWz/FnX/j9LHQ6oELh3Pu5d57wpQzpV33y8qtrK6tb+Q3C1vbO7tFe2+/qZJMEtogCU9kO8SKciZoQzPNaTuVFMchp61weDv1Ww9UKpaIuh6l1I9xX7CIEayNFNjFs2Fwg+qoy0XZdS5PArvkOu4M6C/xFqQEC9QC+7PbS0gWU6EJx0p1PDfV/hhLzQink0I3UzTFZIj7tGOowDFV/nh2+AQdG6WHokSaEhrN1J8TYxwrNYpD0xljPVDL3lT8z+tkOrr2x0ykmaaCzBdFGUc6QdMUUI9JSjQfGYKJZOZWRAZYYqJNVgUTgrf88l/SrDjeuVO5vyhVTxdx5OEQjqAMHlxBFe6gBg0gkMETvMCr9Wg9W2/W+7w1Zy1mDuAXrI9vsVuRFw== B k T ln(0.8) ⇥" AAAB+XicbVBNS8NAEJ3Ur1q/oh69LBahgpakCvZY9OKxQr+gDWGz3bZLN5uwuymU0H/ixYMiXv0n3vw3btsctPXBwOO9GWbmBTFnSjvOt5Xb2Nza3snvFvb2Dw6P7OOTlooSSWiTRDySnQArypmgTc00p51YUhwGnLaD8cPcb0+oVCwSDT2NqRfioWADRrA2km/b12P/HjVQj4sScsrVS98uOmVnAbRO3IwUIUPdt796/YgkIRWacKxU13Vi7aVYakY4nRV6iaIxJmM8pF1DBQ6p8tLF5TN0YZQ+GkTSlNBoof6eSHGo1DQMTGeI9UitenPxP6+b6EHVS5mIE00FWS4aJBzpCM1jQH0mKdF8aggmkplbERlhiYk2YRVMCO7qy+ukVSm7N+XK022xdpXFkYczOIcSuHAHNXiEOjSBwASe4RXerNR6sd6tj2VrzspmTuEPrM8fDb6RQw== B ⇥#

AAAAB/XicbVDLSsNAFJ3UV62v+Ni5GSyCCylJFXRZceOygn1AE8pkOmmHTiZh5kapofgrblwo4tb/cOffOG2z0NYDFw7n3Mu99wSJ4Boc59sqLC2vrK4V10sbm1vbO/buXlPHqaKsQWMRq3ZANBNcsgZwEKydKEaiQLBWMLye+K17pjSP5R2MEuZHpC95yCkBI3XtgyvsnWIPeMQ09tKEKBU/dO2yU3GmwIvEzUkZ5ah37S+vF9M0YhKoIFp3XCcBPyMKOBVsXPJSzRJCh6TPOoZKYrb52fT6MT42Sg+HsTIlAU/V3xMZibQeRYHpjAgM9Lw3Ef/zOimEl37GZZICk3S2KEwFhhhPosA9rhgFMTKEUMXNrZgOiCIUTGAlE4I7//IiaVYr7lmlenterlXzOIroEB2hE+SiC1RDN6iOGoiiR/SMXtGb9WS9WO/Wx6y1YOUz++gPrM8fV/mUeg== AAAAB/3icbVDLSsNAFL3xWesrKrhxM1gEF1KSKuiy4sZlBfuAJpTJdNIOnUzCzMRSYhf+ihsXirj1N9z5N07bLLT1wMDhnHu4d06QcKa043xbS8srq2vrhY3i5tb2zq69t99QcSoJrZOYx7IVYEU5E7Sumea0lUiKo4DTZjC4mfjNByoVi8W9HiXUj3BPsJARrI3UsQ+vkXeGPM0iqpDXjYcCSxkPO3bJKTtToEXi5qQEOWod+8uESRpRoQnHSrVdJ9F+hqVmhNNx0UsVTTAZ4B5tGyqw2edn0/vH6MQoXRTG0jyh0VT9nchwpNQoCsxkhHVfzXsT8T+vnerwys+YSFJNBZktClOOdIwmZaAuk5RoPjIEE8nMrYj0scREm8qKpgR3/suLpFEpu+flyt1FqVrJ6yjAERzDKbhwCVW4hRrUgcAjPMMrvFlP1ov1bn3MRpesPHMAf2B9/gDwWZVh

AAAAB/XicbVDLSsNAFJ3UV62v+Ni5GSyCCylJFXRZceOygn1AE8pkOmmHTiZh5kapofgrblwo4tb/cOffOG2z0NYDFw7n3Mu99wSJ4Boc59sqLC2vrK4V10sbm1vbO/buXlPHqaKsQWMRq3ZANBNcsgZwEKydKEaiQLBWMLye+K17pjSP5R2MEuZHpC95yCkBI3XtgyvsnWIPeMQ09tKEKBU/dO2yU3GmwIvEzUkZ5ah37S+vF9M0YhKoIFp3XCcBPyMKOBVsXPJSzRJCh6TPOoZKYrb52fT6MT42Sg+HsTIlAU/V3xMZibQeRYHpjAgM9Lw3Ef/zOimEl37GZZICk3S2KEwFhhhPosA9rhgFMTKEUMXNrZgOiCIUTGAlE4I7//IiaVYr7lmlenterlXzOIroEB2hE+SiC1RDN6iOGoiiR/SMXtGb9WS9WO/Wx6y1YOUz++gPrM8fV/mUeg== ⇥" ⇥# ⇥" + D): t = j⌧ + ⌧ 0 B): t = j⌧+ AAACA3icbVDLSgMxFM3UV62vUXe6CRZREctMFRRBKOjCZQXbCm0tmTTTxmYeJHfEMhTc+CtuXCji1p9w59+YaWehrQfu5XDOvST3OKHgCizr28hMTc/MzmXncwuLS8sr5upaVQWRpKxCAxHIG4coJrjPKsBBsJtQMuI5gtWc3nni1+6ZVDzwr6EfsqZHOj53OSWgpZa50QD2APHF3ikewNldA0i0n7Sd24OWmbcK1hB4ktgpyaMU5Zb51WgHNPKYD1QQpeq2FUIzJhI4FWyQa0SKhYT2SIfVNfWJx1QzHt4wwNtaaWM3kLp8wEP190ZMPKX6nqMnPQJdNe4l4n9ePQL3pBlzP4yA+XT0kBsJDAFOAsFtLhkF0deEUMn1XzHtEkko6NhyOgR7/ORJUi0W7MNC8eooXyqmcWTRJtpCu8hGx6iELlEZVRBFj+gZvaI348l4Md6Nj9Foxkh31tEfGJ8/lLOWxw== AAAB/HicbVDLSsNAFJ3UV62vaJduBougCCWpgiIIRTcuK9gHNKFMppN27OTBzI0YQv0VNy4UceuHuPNvnLZZaPXAhcM593LvPV4suALL+jIKC4tLyyvF1dLa+sbmlrm901JRIilr0khEsuMRxQQPWRM4CNaJJSOBJ1jbG11N/PY9k4pH4S2kMXMDMgi5zykBLfXMsgPsAbLLw3M8hos7B0hy1DMrVtWaAv8ldk4qKEejZ346/YgmAQuBCqJU17ZicDMigVPBxiUnUSwmdEQGrKtpSAKm3Gx6/Bjva6WP/UjqCgFP1Z8TGQmUSgNPdwYEhmrem4j/ed0E/DM342GcAAvpbJGfCAwRniSB+1wyCiLVhFDJ9a2YDokkFHReJR2CPf/yX9KqVe3jau3mpFKv5XEU0S7aQwfIRqeojq5RAzURRSl6Qi/o1Xg0no03433WWjDymTL6BePjG8gelCc= C): j⌧

AAACC3icbZDLSgMxFIYz9VbrrerSTWgRlWKZqYIiXRS6cVnBXqCdlkyaaWMzF5IzYhm6d+OruHGhiFtfwJ1vY3pZaOuBhI//P4fk/E4ouALT/DYSS8srq2vJ9dTG5tb2Tnp3r6aCSFJWpYEIZMMhignusypwEKwRSkY8R7C6MyiP/fo9k4oH/i0MQ2Z7pOdzl1MCWuqkMy1gDxCXT67w6K4FJGrnilCcUG58HbVPO+msmTcnhRfBmkEWzarSSX+1ugGNPOYDFUSppmWGYMdEAqeCjVKtSLGQ0AHpsaZGn3hM2fFklxE+1EoXu4HUxwc8UX9PxMRTaug5utMj0Ffz3lj8z2tG4F7aMffDCJhPpw+5kcAQ4HEwuMsloyCGGgiVXP8V0z6RhIKOL6VDsOZXXoRaIW+d5Qs359lSYRZHEh2gDDpGFrpAJXSNKqiKKHpEz+gVvRlPxovxbnxMWxPGbGYf/Snj8wdbKJnq

quasistatic

WAAAB/3icbVDLSsNAFL3xWesrKrhxEyyCq5JUQTdCwY3LCvYBbS2T6U07dDKJMxOxxC78FTcuFHHrb7jzb5y2WWjrgYHDOfc1x485U9p1v62FxaXlldXcWn59Y3Nr297ZrakokRSrNOKRbPhEIWcCq5ppjo1YIgl9jnV/cDn26/coFYvEjR7G2A5JT7CAUaKN1LH367ctjQ86vUuIMuuMTEcXbscuuEV3AmeeeBkpQIZKx/5qdSOahCg05USppufGup0SaeZxHOVbicKY0AHpYdNQQUJU7XRy/8g5MkrXCSJpntDORP3dkZJQqWHom8qQ6L6a9cbif14z0cF5O2UiTjQKOl0UJNzRkTMOw+kyiVTzoSGESmZudWifSEK1iSxvQvBmvzxPaqWid1IsXZ8WyqUsjhwcwCEcgwdnUIYrqEAVKDzCM7zCm/VkvVjv1se0dMHKevbgD6zPH4trlmE= =0

QuasistaticAAAB+3icbVDLSsNAFL3xWesr1qWbYBFclaQKuiy6cdmCfUAbymR60w6dTMLMRFpCf8WNC0Xc+iPu/BunbRbaemDgcM59zQkSzpR23W9rY3Nre2e3sFfcPzg8OrZPSi0Vp5Jik8Y8lp2AKORMYFMzzbGTSCRRwLEdjO/nfvsJpWKxeNTTBP2IDAULGSXaSH271NM40VkjJcosMyKd9e2yW3EXcNaJl5My5Kj37a/eIKZphEJTTpTqem6i/YxIM43jrNhLFSaEjskQu4YKEqHys8XtM+fCKAMnjKV5QjsL9XdHRiKlplFgKiOiR2rVm4v/ed1Uh7d+xkSSahR0uShMuaNjZx6EM2ASqeZTQwiVzNzq0BGRhGoTV9GE4K1+eZ20qhXvqlJtXJdrd3kcBTiDc7gED26gBg9QhyZQmMAzvMKbNbNerHfrY1m6YeU9p/AH1ucP88aVBw==

FIG. 10. Joint two-level system Z = X × Yj = {A × ↑,A × ↓} undergoing perfectly-efficient computation when it receives its estimated input through a series of operations. The computation occurs over the time interval t ∈ (jτ, jτ + τ 0). At panel A) t = jτ and the system has a default flat energy landscape energy E(z, jτ) = E(x × y, jτ) = 0. However, it is out of equilibrium, θ since it is in the distribution Pr(Zjτ = {A × ↑,A × ↓}) = {0.8, 0.2}. The first operation is a quench, which instantaneously sets the energies be in equilibrium with the initial distribution, as shown in panel B). The associated energy change is work. Then, a quasistatic operation slowly evolves the system in equilibrium, through panel C), to the final desired distribution θ Pr(Zjτ+τ 0 = {A × ↑,A × ↓}) = {0.4, 0.6}, shown in panel D). This requires no work. Then, the final operation is another quench, in which the energies are reset to the default energy landscape E(z, jτ + τ 0) = 0, leaving the system as shown in panel E). Again, the change in energy corresponds to work invested through control. The total work production for a particular 0 computational mapping A × y → A × y is given by the work from the initial quench W|A×y(jτ) plus the work from the final 0 quench W|A×y0 (jτ + τ ).

The resulting process produces zero work: The two-level system’s state is fixed during the instanta- neous energy changes. Thus, if the joint state follows the 0 Z jτ+τ X computational mapping x × y → x0 × y0 the work produc- W quasistatic = − dt Pr(Zθ = z)∂ E(z, t) t t tion is, as expected, directly connected to the estimated z=jτ z distributions: = 0 Pr(Zθ = x × y) θ θ jτ W 0 0 = k T ln . (F1) and maps Pr(Zjτ ) to Pr(Zjτ+τ 0 ) without dissipation. |x×y,x ×y B θ 0 0 Pr(Z 0 = x × y ) With the quasistatic transformation producing zero work, jτ+τ the total work produced from the initial joint state x × y Recall from Sec.IVA that the ratchet system variable is exactly opposite the change in energy during the initial θ θ θ θ Zjτ = Xj × Yj splits into the random variable Xj for quench: θ the jth agent memory state and the jth input Yj . Sim- θ θ 0θ + θ ilarly, Zjτ+τ 0 = Xj+1 × Yj splits into the (j + 1)th E(x × y, jτ) − E(x × y, jτ ) = kBT ln Pr(Z = x × y) jτ θ 0θ agent memory state Xj+1 and jth output Yj . This minus the change in energy of the final joint state x0 × y0 work production achieves the efficient limit for a model θ  D θ E during the final quench: W|x×y,x0×y0 = W|x×y,x0×y0 described in Eq. (8). AppendixC generalized the thermodynamic operation E(x0 × y0, jτ + τ 0−) − E(x0 × y0, jτ + τ 0) above to any computation Mzτ →zτ0 . While it requires θ 0 0 = −kBT ln Pr(Zjτ+τ 0 = x × y ) . an ancillary copy of the system Z to execute the con- ditional dependencies in the computation, it is concep- 26 tually identical in that it uses a sequence of quenching, production of each input yields a simple expression in evolving quasistatically, and then quenching again. This terms of the model θ: appendix extends the strategies outlined in Refs. [42, 53] L−1 to computational-mapping work calculations. X W|y0:L = W|yj j=0 L−1 X  (yj )  2. Efficient Information Ratchets = kBT ln 2 + ln θA→A j=0   With the method for efficiently mapping inputs to outputs L−1 Y (yj ) in hand, we can design a series of such computations to = kBT L ln 2 + ln θA→A . implement a simple information ratchet that produces j=0 work from a series y . As prescribed in Eq. (11) of Sec. 0:L Due to the single causal state, the product within the log- V, to produce the most work from estimated model θ, the arithm simplifies to the probability of the word given the agent’s logical architecture should randomly map every model QL−1 θ(yj ) = Pr(Y θ = y ). So, the resulting state to all others: j=0 A→A 0:L 0:L work production depends on the familiar log-likelihood: 1 Mxy→x0y0 = |Yj| hW|y0:L i = kBT (L ln 2 + `(θ|y0:L)) 1 = hW θ i , = , |y0:L 2 again, achieving efficient work production, as expected. since there is only one causal state A. In conjunction with Eq. (12), we find that the estimated joint distribution of the agent and interaction symbol at the start of the interaction is equivalent to the parameters of the model: 3. Maximizing Work for Memoryless Models

θ θ θ Pr(Zjτ = x × y) = Pr(Xj = x, Yj = y) Leveraging the explicit construction for efficient informa- θ θ θ tion ratchets, we can search for the agent that maximizes = Pr(Yj = y|Xj = A) Pr(Xj = A) work from the input string y0:L. To infer a model through (y) = θA→A , work maximization, we label the frequency of ↑ states in this sequence with f(↑) and the frequency of ↓ with f(↓). where we again used the fact that A is the only causal state. The corresponding log-likelihood of the model is: In turn, the estimated distribution after the interaction Lf(↑) Lf(↓) is:  (↑)   (↓)  `(θ|y0:L) = ln θA→A θA→A θ 0 0 X θ θ     Pr(Zjτ+τ 0 = x × y ) = Pr(Xj = x, Yj = y)Mxy→x0y0 (↑) (↓) = Lf(↑) ln θA→A + Lf(↓) ln θA→A . xy 1 = ‘. Thus, for the corresponding agent, the work production 2 is: Thus, assuming the agent has model θ built-in, then Eq. D E W θ = k T `(θ|y ) + k TL ln 2 (F1) determines that the work production for mapping |y0:L B 0:L B 0 A × y to output A × y for a particular symbol y is:  (↑) (↓)  = kBTL ln 2+f(↑) ln θA→A +f(↓) ln θA→A .  (y)  W 0 = k T ln 2 + ln θ . |A×y,A×y B A→A Selecting from all possible memoryless agents, the model parameters θ maximizing work production are given by Since A is the only memory state and work does not de- the frequency of symbols in the input: f(↑) = θ(↑) and pend on the output symbol y0, the average work produced A→A f(↓) = θ(↓) . The resulting work production is: from an input y is: A→A D E W θ = k TL(ln 2 − H[f(↑)]) , W|yi = hW|A×y,A×y0 . (F2) |y0:L B

With the work production expressed for a single input yj, where H[f(↑)] is the Shannon entropy of binary variable we can now consider how much work our designed agent Y with Pr(Y =↑) = f(↑) measured in nats. harvests from the training data y0:L. Summing the work This simple example of learning statistical bias serves to 27 explicitly lay out the stages of thermodynamic machine corresponding work production is the same as energetic learning. The class of models is too simple, though, to gain of randomizing L bits distributed according to the illustrate the full power of the new learning method. That frequency f(↑). said, it does confirm that thermodynamic work maximiza- However, this neglects the substantial thermodynamic tion leads to useful models of data in the simplest case. benefits possible with temporally-correlated environments As one would expect, the simple agent found by thermody- [38]. To illustrate how to extract this additional energy, namic machine learning discovers the frequency of zeros in a sequel designs and analyzes memoryful agents. the input and, thus, it learns about its environment. The

[1] (Baron) G. Cuvier. Essay on the Theory of the Earth. physicists. Physics Reports, 810(1-124), 2019.2,4 Kirk and Mercein, New York, 1818.1 [18] H. W. Lin, M. Tegmark, and D. Rolnick. Why does [2] H. Bergson. Creative Evolution. Henry Holt and Company, deep and cheap learning work so well? J. Stat. Phys., New York, New York, 1907.1 168(6):1223–1247, 2017.2, 15 [3] D. W. Thompson. On Growth and Form. Cambridge [19] R. Landauer. Irreversibility and heat generation in the University Press, Cambridge, 1917.1 computing process. IBM J. Res. Develop., 5(3):183–191, [4] N. Wiener. Cybernetics: Or Control and Communication 1961.2,5,6, 13, 17 in the Animal and the Machine. MIT, Cambridge, 1948. [20] C. H. Bennett. Demons, engines and the Second Law. 1, 13, 14 Sci. Am., 257(5):108–116, 1987. [5] N. Wiener. The Human Use Of Human Beings: Cyber- [21] L. Szilard. On the decrease of entropy in a thermodynamic netics And Society. Da Capo Press, Cambridge, 1988. system by the intervention of intelligent beings. Z. Phys., 1 53:840–856, 1929.2 [6] D. C. Dennett. Consciousness Explained. Little, Brown [22] T. L. H. Watkins, A. Rau, and M. Biehl. The statistical and Co., New York, New York, 1991.1 mechanics of learning a rule. Rev. Mod. Phys., 65:499 – [7] S. J. Gould and R. Lewontin. The spandrels of san 556, 1993.2 marco and the panglossian paradigm: A critique of the [23] A. Engel and C. Van den Broeck. Statistical Mechanics adaptationist programme. Proc. Roy. Soc. Lond. B, of Learning. Cambridge University Press, 2001. 205(1161):581–598, 1979.1 [24] A. Bell. Learning out of equilibrium. In J. P. Crutch- [8] D. C. Dennett. Darwin’s Dangerous Idea: Evolution and field and J. Machta, editors, Santa Fe Institute Workshop the Meanings of Life. Simon and Schuster, New York, on Randomness, Structure, and Causality, 9-13 January New York, 1995. 2011, volume 21, 2011. [9] J. Maynard-Smith and E. Szathmary. The Major Tran- [25] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranthan, and sitions in Evolution. Oxford University Press, Oxford, S. Ganguli. Deep unsupervised learning using nonequilib- reprint edition, 1998. rium thermodynamics. arXiv:1503.03585, 2015. [10] G. P. Wagner. Homology, Genes, and Evolutionary In- [26] S. Goldt and U. Seifert. Stocastic thermodynamics of novation. Princeton University Press, Princeton, New learning. Phys. Rev. Lett., 118(010601), 2017.3 Jersey, 2014.1 [27] A. A. Alemi and I. Fischer. TherML: Thermodynamics [11] W. Thomson. Kinetic theory of the dissipation of energy. of machine learning. arXiv:1807.04162, 2018. Nature, pages 441–444, 9 April 1874.1 [28] Y. Bahri, J. Kadmon, J. Pennington, S. S. Schoenholz, [12] J. C. Maxwell. Theory of Heat. Longmans, Green and J. Sohl-Dickstein, and S. Ganguli. Statistical mechanics Co., London, United Kingdom, ninth edition, 1888.1 of deep learning. Ann. Rev. Cond. Matter Physics, 11:501– [13] T. Sagawa. Thermodynamics of information processing 528, 2020.2 in small systems. Prog. Theo. Phys., 127(1):1–56, 2012.1 [29] A. B. Boyd, D. Mandal, and J. P. Crutchfield. Leverag- [14] J. M. R. Parrondo, J. M. Horowitz, and T. Sagawa. ing environmental correlations: The thermodynamics of Physics of information. Nature Physics, 11(2):131, 2015. requisite variety. J. Stat. Phys., 167(6):1555–1585, 2016. 1,6, 13 3,7, 12, 14 [15] S. Shalev-Shwatrz and S. Ben-David. Understanding Ma- [30] D. Mandal and C. Jarzynski. Work and information chine Learning: From Theory to Algorithms. Cambridge processing in a solvable model of Maxwell’s demon. Proc. University Press, 2014.2,4, 15 Natl. Acad. Sci. USA, 109(29):11641–11645, 2012.3,6, [16] T. Hastie, R. Tibshirani, and J. Friedman. The Elements 18 of Statistical Learning: Data Mining, Inference, and Pre- [31] A. B. Boyd, D. Mandal, and J. P. Crutchfield. Identifying diction. Springer Series in Statistics. Springer, New York, functional thermodynamics in autonomous Maxwellian New York, second edition, 1818. ratchets. New J. Physics, 18:023049, 2016.3,6,7, 18 [17] P. Mehta, M. Bukov, C.-H. Wang, A. G. R. Day, [32] J. M. Gold and J. L. England. Self-organized novelty C. Richardson, C. K. Fisher, and D. J. Schwab. A high- detection in driven spin glasses. arXiv:1911.07216, 2019. bias, low-variance introduction to machine learning for 3, 14 28

[33] W. Zhong, J. M. Gold, S. Marzen, J. L. England, and ics of input-output processes: Structured transformations N. Y. Halpern. Learning about learning by many-body and the -transducer. J. Stat. Phys., 161(2):404–451, 2015. systems. arXiv:2004.03604 [cond-mat.stat-mech], 2020.3, 10, 18 4, 14 [53] A. B. Boyd, D. Mandal, and J. P. Crutchfield. Ther- [34] D. Jimenez Rezende, S. Mohamed, and D. Wierstra. modynamics of modularity: Structural costs beyond the Stochastic backpropagation and approximate inference in landauer bound. Phys. Rev. X, 8(031036), 2018. 11, 14, deep generative models. arXiv preprint arXiv:1401.4082, 19, 26 2014.4, 15, 16 [54] G. R. Kirchhoff. Ann. Phys., 75:1891, 1848. 13 [35] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum [55] J. W. Gibbs. The Scientific Papers of J. Willard Gibbs. likelihood from incomplete data via the EM algorithm. J. Longmans, Green, New York, New York, 1906. 13 Roy. Stat. Soc. Series B, 39(1):1–22, 1977.4 [56] J. C. Maxwell. A Treatise on Electricity and Magnetism, [36] J. P. Crutchfield. Between order and chaos. Nature vol. I and II. Dover Publications, Inc., New York, New Physics, 8(January):17–24, 2012.4, 13 York, third edition, 1954. 13 [37] S. Deffner and C. Jarzynski. Information processing and [57] L. Onsager. Reciprocal relations in irreversible processes, the second law of thermodynamics: An inclusive, Hamil- I. Phys. Rev., 37(4):405–426, 1931. 13 tonian approach. Phys. Rev. X, 3:041003, 2013.5, 16 [58] I. Prigogine. Modération et transformations irréversibles [38] A. B. Boyd, D. Mandal, and J. P. Crutchfield. Correlation- des systèmes ouverts. Bulletin de la Classe des Sciences, powered information engines and the thermodynamics of Académie Royale de Belgique, 31:600–606, 1945. 13 self-correction. Phys. Rev. E, 95(1):012152, 2017.7, 27 [59] I. Prigogine and P. Glansdorff. Thermodynamic Theory of [39] A. B. Boyd, D. Mandal, P. M. Riechers, and J. P. Crutch- Structure, Stability and Fluctuations. Wiley-Interscience, field. Transient dissipation and structural costs of physical London, 1971. 13 information transduction. Phys. Rev. Lett., 118:220602, [60] G. Falasco, R. Rao, and M. Esposito. Information thermo- 2017. dynamics of Turing patterns. Phys. Rev. Let., 121:108301, [40] N. Merhav. Sequence complexity and work extraction. J. 2018. 13 Stat. Mech., page P06037, 2015. [61] A. M. Turing. The chemical basis of morphogenesis. Trans. [41] N. Merhav. Relations between work and entropy produc- Roy. Soc., Series B, 237:5, 1952. 13 tion for general information-driven, finite-state engines. [62] R. Hoyle. Pattern Formation: An Introduction to Methods. J. Stat. Mech.: Th. Expt., 2017:1–20, 2017. Cambridge University Press, New York, 2006. [42] A. J. P. Garner, J. Thompson, V. Vedral, and M. Gu. [63] M. Cross and H. Greenside. Pattern Formation and Dy- Thermodynamics of complexity and pattern manipulation. namics in Nonequilibrium Systems. Cambridge University Phys. Rev. E, 95(042140), 2017.7, 14, 19, 26 Press, Cambridge, United Kingdom, 2009. 13 [43] S. Deffer and E. Lutz. Information free energy for nonequi- [64] W. Heisenberg. Nonlinear problems in physics. Physics librium states. arXiv:1201.3888, 2012.7, 16 Today, 20:23–33, 1967. 13 [44] M. Esposito and C. van den Broeck. Second law and [65] D. Ruelle and F. Takens. On the nature of turbulence. Landauer principle far from equilibrium. Europhys. Lett, Comm. Math. Phys., 20:167–192, 1971. 13 95:40004, 2011.8, 16 [66] A. Brandstater, J. Swift, Harry L. Swinney, A. Wolf, J. D. [45] G. E. Crooks. Entropy production fluctuation theorem Farmer, E. Jen, and J. P. Crutchfield. Low-dimensional and the nonequilibrium work relation for free energy dif- chaos in a hydrodynamic system. Phys. Rev. Lett., ferences. Phys. Rev. E, 60(3), 1999.8, 18 51:1442, 1983. 13 [46] C. Jarzynski. Hamiltonian derivation of a detailed fluc- [67] C. E. Shannon. A mathematical theory of communication. tuation theorem. J. Stat. Phys., 98(77-102), 2000.8, Bell Sys. Tech. J., 27:379–423, 623–656, 1948. 13 18 [68] T. M. Cover and J. A. Thomas. Elements of Information [47] T. Speck and U. Seifert. Distribution of work in isothermal Theory. Wiley-Interscience, New York, second edition, nonequilibrium processes. Phys. Rev. E, 70(066112), 2004. 2006. 13, 15, 19 8 [69] R. G. James, C. J. Ellison, and J. P. Crutchfield. Anatomy [48] K. J. Ray, G. W. Wimsatt, A. B. Boyd, and J. P. Crutch- of a bit: Information in a time series observation. CHAOS, field. Non-markovian momentum computing: Universal 21(3):037109, 2011. 13 and efficient. 2020. arxiv:2010.01152.8, 19, 21 [70] C. E. Shannon. The bandwagon. IEEE Trans. Info. Th., [49] A. Kolchinsky and D. H. Wolpert. Dependence of dissipa- 2(3):3, 1956. 13 tion on the initial distribution over states. J. Stat. Mech.: [71] A. Turing. On computable numbers, with an application Th. Expt., page 083202, 2017.9, 19 to the Entschiedungsproblem. Proc. Lond. Math. Soc., [50] P. M. Riechers and M. Gu. Initial-state dependence 42, 43:230–265, 544–546, 1937. 13 of thermodynamic dissipation for any quantum process. [72] W. Ross Ashby. An Introduction to Cybernetics. John arXiv:2002.11425, 2020.9 Wiley and Sons, New York, second edition, 1960. 13 [51] J. G. Brookshear. Theory of computation: Formal lan- [73] E. T. Jaynes. Information theory and statistical mechanics. guages, automata, and complexity. Benjamin/Cummings, Phys. Rev., 106(4):620–630, 1957. 13 Redwood City, California, 1989. 10, 18 [74] E. T. Jaynes. The minimum entropy production principle. [52] N. Barnett and J. P. Crutchfield. Computational mechan- Annu. Rev. Phys. Chem., 31:579–601, 1980. 13 29

[75] A. N. Kolmogorov. A new metric invariant of transient dy- njp Quant. Info., 3(6), 2017. 14 namical systems and automorphisms in Lebesgue spaces. [86] S. Loomis and J. P. Crutchfield. Thermal efficiency Dokl. Akad. Nauk. SSSR, 119:861, 1958. (Russian) Math. of quantum memory compression. Phys. Rev. Lett., Rev. vol. 21, no. 2035a. 13 125:020601, 2020. [76] N. H. Packard, J. P. Crutchfield, J. D. Farmer, and R. S. [87] M. P. Woods, R. Silva, G. Putz, S. Stupar, and R. Renner. Shaw. Geometry from a time series. Phys. Rev. Let., Quantum clocks are more accurate than classical ones. 45:712, 1980. arXiv preprint arXiv:1806.00491, 2018. 14 [77] F. Takens. Detecting strange attractors in fluid turbulence. [88] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The In D. A. Rand and L. S. Young, editors, Symposium on Helmholtz machine. Neural computation, 7(5):889–904, Dynamical Systems and Turbulence, volume 898, page 1995. 15 366, Berlin, 1981. Springer-Verlag. [89] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. [78] J. P. Crutchfield and B. S. McNamara. Equations of Nature, 521:436–444, May 2015. 16 motion from a data series. Complex Systems, 1:417 – 452, [90] R. Landauer. Information is physical. Physics Today, 1987. 13 pages 23–29, May 1991. 16 [79] L. Brillouin. Science and Information Theory. Academic [91] J. P. Crutchfield and D. P. Feldman. Regularities un- Press, New York, second edition, 1962. 13 seen, randomness observed: Levels of entropy convergence. [80] C. H. Bennett. Thermodynamics of computation—a re- CHAOS, 13(1):25–54, 2003. 17 view. Intl. J. Theo. Phys., 21:905, 1982. 13 [92] C. C. Strelioff and J. P. Crutchfield. Bayesian structural [81] T. Sagawa and M. Ueda. Information thermodynamics: inference for hidden processes. Phys. Rev. E, 89:042119, Maxwell’s demon in nonequilibrium dynamics. Nonequi- 2014. 18 librium Stat. Phys. Small Syst. Fluct. Relations Beyond, [93] T. Wu and M. Tegmark. Toward an artificial intelli- pages 181–211, nov 2013. 13 gence physicist for unsupervised learning. Phys. Rev. E, [82] J. P. Crutchfield and O. Gornerup. Objects that make ob- 100(3):033311, 2019. 18 jects: The population dynamics of structural complexity. [94] J. P. Crutchfield. The calculi of emergence: Computation, J. Roy. Soc. Interface, 3:345–349, 2006. 14 dynamics, and induction. Physica D, 75:11–54, 1994. 18 [83] J. England. Statistical physics of self-replication. J. Chem. [95] C. Jarzynski. Equalities and inequalities: Irreversibility Phys., 139(121923), 2013. 14 and the second law of thermodynamics at the nanoscale. [84] V. Serreli, C.-F. Lee, E. R. Kay, and D. A. Leigh. A Annu. Rev. Cond. Matt. Phys., 2(1), 2011. 19 molecular information ratchet. Nature, 445(7127):523– [96] J. A. Owen, A. Kolchinsky, and D. H. Wolpert. Number 537, 2007. 14 of hidden states needed to physically implement a given [85] J. Thompson, A. J. P. Garner, V. Vedral, and M. Gu. conditional distribution. New J. Phys., 21(013022), 2019. Using quantum theory to simplify input–output processes. 19