Markov Neural Operators for Learning Chaotic Systems
Zongyi Li,∗ Nikola Kovachki∗, Kamyar Azizzadenesheli,† Burigede Liu∗, Kaushik Bhattacharya∗, Andrew Stuart∗, Anima Anandkumar∗ June 15, 2021
Abstract Chaotic systems are notoriously challenging to predict because of their instability. Small errors accumulate in the simulation of each time step, resulting in completely different trajectories. However, the trajectories of many prominent chaotic systems live in a low-dimensional subspace (attractor). If the system is Markovian, the attractor is uniquely determined by the Markov operator that maps the evolution of infinitesimal time steps. This makes it possible to predict the behavior of the chaotic system by learning the Markov operator even if we cannot predict the exact trajectory. Recently, a new framework for learning resolution-invariant solution operators for PDEs was proposed, known as neural operators. In this work, we train a Markov neural operator (MNO) with only the local one-step evolution information. We then compose the learned operator to obtain the global attractor and invariant measure. Such a Markov neural operator forms a discrete semigroup and we empirically observe that does not collapse or blow up. Experiments show neural operators are more accurate and stable compared to previous methods on chaotic systems such as the Kuramoto-Sivashinsky and Navier-Stokes equations.
1 Introduction
Chaotic systems. One day in the winter of 1961, meteorologist Edward Lorenz was simulating simplified ordinary differential equations of weather using his primitive computer. Lorenz wanted to examine one particular time sequence. Instead of running the equation from the beginning, he hand-typed the initial condition from an earlier printout, and walked down the hall for a cup of coffee. When he came back, to his surprise, the result was completely different from the previous run. In the beginning, Lorenz assumed a vacuum tube had gone bad on his computer. But later, he realized that it is the seemingly negligible round-off error in the initial condition that he typed that made the simulation completely different [1]. The phenomena, also popularly known as the butterfly effect, led to the creation of chaos theory. Chaotic systems are characterized by strong instability. A tiny perturbation in the initial condition leads to a completely different trajectory. Such instability makes chaotic systems intricate and challenging both numerically and arXiv:2106.06898v1 [cs.LG] 13 Jun 2021 mathematically.
Machine learning methods for chaotic systems. Because of the intrinsic instability, it is infeasible for any method to capture the exact trajectory for a long period in a chaotic system. The correlation over a longer time is chaotic and unstable. If one tries to fit to standard ML model such as a recurrent neural network (RNN) directly on the long-time sequences, the gradient will easily blow up. Therefore, prior works either fit RNNs on very short sequences, or only learn a step-wise projection from a randomly generated evolution using reservoir computing (RC) [2–5]. These previous attempts are able to push the limit of faithful prediction to a moderate period on lower dimension systems such as ODEs like the Lorenz system or one-dimensional PDEs like the Kuramoto-Sivashinsky (KS) equation. However, they are not capable of modeling more complicated systems such as the Navier-Stokes equation (NS) which exhibits turbulence. Another approach to solving the
∗Caltech, {zongyili, nkovachki, bgl, bhatta, astuart, anima}@caltech.edu †Purdue University, [email protected]
1 Figure 1: Markov neural operator (MNO): learn global dynamics from local data (a) Learn the MNO from the local time-evolution data. (b) Compose MNO to evaluate the long-time dynamic. (c) Architecture of the MNO. chaotic PDEs is to use the equation instead of data, by using methods such as PINNs [6–8]. They are also only capable of solving, for example the KS equation, for a very short period. These results suggest that predicting long-time trajectories of such chaotic systems is, in some sense, an ill-posed problem. Instead, we take a new perspective: we predict long-time trajectories that, while eventually diverging from the truth, still preserve on the same orbit (attractor) of the system and its statistical properties.
Invariants in chaos. Despite the instability, many chaotic systems exhibit certain reproducible statistical properties, such as in the energy spectrum and auto-correlation, that remain the same in different realiza- tions [1]. For instance, this holds for popular PDE families such as the KS equation and the two-dimensional NS equation (Kolmogorov flows) [9]. Mathematically, this behavior can be characterized by the existence of an invariant measure. For Markovian systems, samples from this measure can be obtained through access to a Markov operator, a memoryless deterministic operator that captures the evolution of the dynamics from one time step to the next. In this case, the Markov operator forms a discrete semigroup, defined by the compositions of Markov operators [10]. By learning the Markov operator, we are able to generate the attractor and estimate the invariant measure for a variety of chaotic systems that are of interest to the physics and applied mathematics communities [11–16].
Neural operators. To learn the Markov operators, we need to model the time-evolution of functions in infinite-dimensional function spaces. It is especially challenging when we generate long-term trajectories. Even a very small error will accumulate along the composition, potentially causing exponential build up. If one cannot capture the Markov operator with respect to the true invariant measure, the trajectories will eventually blow up or collapse. Therefore, we propose to use a recent advance in the operator-learning known as the neural operator [17]. The neural operator remedies the mesh-dependent nature of finite-dimensional operator methods such as the standard models such as RNN, RC, and CNN, and therefore better captures the invariant measure for chaotic systems.
1.1 Our Contributions In this work, we exploit the ergodicity and Markovian property of the target dynamical systems. We learn a Markov neural operator (MNO) given only one-step evolution data and compose it over a long time horizon to obtain the global attractor of the chaotic system, as shown in Figure 1. The MNO forms a discrete semigroup defined by the composition of operators, allowing us to reach any state further in the future [18]. The goal is
2 to predict the global attractor which qualitatively does not collapse or blow up over a long or infinite time horizon, and quantitatively, shares the same distribution and statistics as the true function space trajectories.
• We learn a Markov neural operator (MNO) and compose it over a long time horizon to obtain the global attractor. Experiments show the learned Markov operator is stable and captures the long-time trajectory without collapsing or blowing up for complicated chaotic PDEs such as KS and NS. • We show that neural operators quantitatively outperform various RNN and CNN models. Our model is 4-times more accurate for a single-step prediction and has one order of magnitude lower error on invariant statistics. • We develop an approximation theorem that guarantees MNO is expressive enough for any locally Lipschitz system (this includes the KS and NS equations) for any finite period. • We use Sobolev norms in operator learning. Experiments show the Sobolev norm significantly outperforms standard loss functions in preserving invariant statistics such as the spectrum.
Invariant statistics. We use MNO to capture invariant statistics such as the Fourier spectrum, the spectrum of the proper orthogonal decomposition (POD), the point-wise distribution, the auto-correlation, and other domain-specific statistics such as turbulence kinetic energy and dissipation rate. These invariant statistics can usually be classified as a combination of different orders of derivatives and moments. If one uses a standard mean square error (MSE) for training, the model is not able to capture the higher frequency information induced from the derivatives. Therefore, we use the Sobolev norm to address higher-order derivatives and moments [19, 20]. This is similar in spirit to pre-multiplying the spectrum by the wavenumber, an approach commonly used in fluid analysis [21]. When used for training, the Sobolev norm yields a significant improvement in capturing high frequency details compared to the standard MSE or L2 loss, and therefore we obtain a Markov operator that is able to better characterize the invariant measure.
2 Problem setting
We will consider infinite dimensional dynamical systems where the phase space U is a Banach space and, in particular, a function space on a Lipschitz domain D ⊂ Rd. We are interested in the initial-value problem du (t) = F (u(t)), t ∈ (0, ∞), dt (1) u(0) = u0,
for initial conditions u0 ∈ U where F is usually a non-linear operator. We will assume, given some appropriate boundary conditions on ∂D, that the solution u(t) ∈ U exists and is unique for all times t ∈ (0, ∞). When making the spatial dependence explicit, we will write u(x, t) to indicate the evaluation u(t)|x for any x ∈ D. We adapt the viewpoint of casting time-dependent PDEs into function space ODEs (1), as this leads to the semigroup approach to evolutionary PDEs which underlies our learning methodology.
2.1 Markov operators and the semigroup Since the Banach-valued ODE (1) is autonomous, that is, F does not explicitly depend on time, under the well-posedness assumption, we may define, for any t ∈ [0, ∞), a Markov operator St : U → U such that u(t) = Stu(0). This map satisfies the properties
1. S0 = I,
2. St(u0) = u(t),
3. St(Ss(u0)) = u(t + s),
3 for any s, t ∈ [0, ∞) and any u0 ∈ U where I denotes the identity operator on U. In particular, the family {St : t ∈ [0, ∞)} defines a semigroup of operators acting on U. Our goal is to approximate a particular element of this semigroup associated to some fixed time step h > 0 given observations of the trajectory from ˆ (1). We build an approximation Sh : U → U such that ˆ Sh ≈ Sh (2)
ˆ Long-term predictions. Note that having access to the map Sh from (2) allows for approximating long ˆ time trajectories of (1) by repeatedly composing Sh with its own output. Indeed, by (2) and the semigroup property, we find that ˆn ˆ ˆ u(n · h) ≈ Sh (u0) := (Sh ◦ · · · ◦ Sh)(u0) (3) | {z } n times for any n ∈ N. This fact is formalized in Theorem 1 which is stated in the proceeding section. It is important to note that the approximation from (3) is n-dependent, in particular, predicting a trajectory well is only possible up to a finite time (unless there is some special feature in the dynamic such as periodicity). Luckily, for many practical applications, it is not necessary to predict exact trajectories of a system. Rather, it is sufficient to characterize statistical properties of the system’s long time behavior. This is especially true for chaotic systems where very small perturbations can result in large long time deviations. Many such systems occur in nature, for example, in the turbulent flow of fluids. In the current work, we take the perspective of accurately characterizing such statistical properties.
2.2 Attractors, ergodicity, and invariant measures.
Global Attractors. The long time behavior of the solution to (1) is characterized by the set U = U(u0) ⊂ U which is invariant under the dynamic i.e. St(U) = U for all t ≥ 0, and the orbit u(t) converges
inf ku(t) − vkU → 0 as t → ∞. v∈U
When it exists, U is often identified as the ω-limit set of u0. The chaotic nature of certain dynamical systems arises due to the complex structure of this set because u(t) follows U and U can be, for example, a fractal set. For dissipative systems, such sets exist and constitute a larger, global attractor A ⊂ U which is compact and attracts all orbits u(t), in particular, A is independent of the initial condition u0. Many PDEs arising in physics such as reaction-diffusion equations describing chemical dynamics or the Navier-Stokes equation describing the flow of fluids are dissipative and possess a global attractor which is often times finite-dimensional [9]. Therefore, numerically characterizing the attractor is an important problem in scientific computing with many potential applications.
Data distribution. For many applications, an exact form for the possible initial conditions to (1) is not available; it is therefore convenient to use a stochastic model to describe the initial states. To that end, let µ0 be a probability measure on U and assume that all possible initial conditions to (1) come as samples from µ0 i.e. u0 ∼ µ0. Then any possible state of the dynamic (1) after some time t > 0 can be thought of as ] being distributed according to the pushforward measure µt := St µ0 i.e. u(t) ∼ µt. Therefore as the dynamic evolves, so does the type of likely functions that result. This further complicates the problem of long time predictions since training data may only be obtained up to finite time horizons hence the model will need the ability to predict not only on data that is out-of-sample but also out-of-distribution.
Ergodic systems. To alleviate some of the previously presented challenges, we consider ergodic systems. Roughly speaking, a system is ergodic if there exists an invariant measure µ such that after some T > 0, we have µt ≈ µ for any t ≥ T (in fact, µ can be defined without any reference to µ0 or its pushforwards, see [22] for details). That is, after some large enough time, the distribution of possible states that the system can be in is fixed for any time further into the future. Indeed, µ charges the global attractor A. Notice that ergodicity is a much more general property than having stationary states which means that the system has a fixed period in time, or having steady states which means the system is unchanged in time.
4 Ergodicity mitigates learning a model that is able to predict out-of-distribution since both the input and ˆ ˆ the output of Sh will approximately be distributed according to µ. Furthermore, we may use Sh to learn about µ since sampling it simply corresponds to running the dynamic forward. Indeed, we need only generate ˆ data on a finite time horizon in order to learn Sh, and, once learned, we may use it to sample µ indefinitely ˆ by simply repeatedly composing Sh with itself. Having samples of µ then allows us to compute statistics which characterize the long term behavior of the system and therefore the global attractor A. Note further that this strategy avoids the issue of accumulating errors in long term trajectory predictions since we are ˆ only interested in the property that Sh(u(t)) ∼ µ.
Remark. Notably, the existence of a global attractor does not imply the existence of an invariant measure. Indeed, the only deterministic and chaotic systems that are proven to possess an invariant measure are certain ODEs such as the Lorenz-63 system [23]. On the other hand, proving the existence of an invariant measure for deterministic and chaotic PDEs such as the KS or NS equations are still open problems, despite ergodic behavior being observed empirically.
3 Markov neural operator
We propose the Markov neural operator, a method for learning the underlying Markov operators of chaotic dynamical systems. In particular, we approximate the operator mapping the solution from the current step ˆ to the next step Sh : u(t) 7→ u(t + h). The architecture of the Markov neural operator is detailed in Figure 1. Markov neural operators employs a local transformation P to lift the input function u(t) ∈ U to a higher dimensional representation v0(x) = P (u(x, t)). We model this step using a fully-connected neural network. The lifting step provides a rich representation for later steps of the Markov neural operator. The computation of v0 is followed by multiple layers of iterative updates vl 7→ vl+1 which consist of applying non-linear
operators formed by the composition of a non-local integral operator Kφl with a non-linear activation function σ. That is vl+1(x) := σ W vl(x) + Kφl vl (x) , ∀x ∈ D (4)
where Kφl : U → U is a linear operator parameterized by φl ∈ Θl a finite-dimensional space. The operator
Kφl is chosen to be a kernel integral operator parameterized by a neural network, Z Kφl vl (x) := κφl (x, y)vl(y)dy, ∀x ∈ D (5) D
2d dv ×dv where the kernel function κφl : R → R is a neural network parameterized by φl. Finally, the output dv du u(x, t + h) ≈ Q(vL(x)) is the projection of vL by the local transformation Q : R → R which is again given by a fully-connected neural network. The work Li et al. [24] proposes the Fourier neural operator (FNO)
as an efficient implementation of neural operators by assuming κφl (x, y) = κφl (x − y). In this case, (5) is a
convolution operator. FNO exploits the structure of convolution by parameterizing κφl directly in Fourier space and using the Fast Fourier Transform (FFT) to efficiently compute (5), in particular,