Markov Chain Monte Carlo Simulation Using the DREAM Software Package: Theory, Concepts, and MATLAB Implementation

UC Irvine UC Irvine Previously Published Works Title Markov chain Monte Carlo simulation using the DREAM software package: Theory, concepts, and MATLAB implementation Permalink https://escholarship.org/uc/item/6j55p5kh Author Vrugt, JA Publication Date 2016 DOI 10.1016/j.envsoft.2015.08.013 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Environmental Modelling & Software 75 (2016) 273e316 Contents lists available at ScienceDirect Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft Markov chain Monte Carlo simulation using the DREAM software package: Theory, concepts, and MATLAB implementation * Jasper A. Vrugt a, b, c, a Department of Civil and Environmental Engineering, University of California Irvine, 4130 Engineering Gateway, Irvine, CA, 92697-2175, USA b Department of Earth System Science, University of California Irvine, Irvine, CA, USA c Institute for Biodiversity and Ecosystem Dynamics, University of Amsterdam, Amsterdam, The Netherlands article info abstract Article history: Bayesian inference has found widespread application and use in science and engineering to reconcile Received 17 February 2015 Earth system models with data, including prediction in space (interpolation), prediction in time (fore- Received in revised form casting), assimilation of observations and deterministic/stochastic model output, and inference of the 13 August 2015 ~ model parameters. Bayes theorem states that the posterior probability, pðH YÞ of a hypothesis, H is Accepted 13 August 2015 ~ proportional to the product of the prior probability, p(H) of this hypothesis and the likelihood, LðH YÞ of Available online xxx ~ ~ ~ the same hypothesis given the new observations, Y,orpðH YÞfpðHÞLðH YÞ. In science and engineering, H often constitutes some numerical model, F (x) which summarizes, in algebraic and differential equations, Keywords: state variables and fluxes, all knowledge of the system of interest, and the unknown parameter values, x Bayesian inference ~ Markov chain Monte Carlo (MCMC) are subject to inference using the data Y. Unfortunately, for complex system models the posterior dis- simulation tribution is often high dimensional and analytically intractable, and sampling methods are required to Random walk metropolis (RWM) approximate the target. In this paper I review the basic theory of Markov chain Monte Carlo (MCMC) Adaptive metropolis (AM) simulation and introduce a MATLAB toolbox of the DiffeRential Evolution Adaptive Metropolis (DREAM) Differential evolution Markov chain (DE- algorithm developed by Vrugt et al. (2008a, 2009a) and used for Bayesian inference in fields ranging from MC) physics, chemistry and engineering, to ecology, hydrology, and geophysics. This MATLAB toolbox pro- Prior distribution vides scientists and engineers with an arsenal of options and utilities to solve posterior sampling Likelihood function problems involving (among others) bimodality, high-dimensionality, summary statistics, bounded Posterior distribution Approximate Bayesian computation (ABC) parameter spaces, dynamic simulation models, formal/informal likelihood functions (GLUE), diagnostic Diagnostic model evaluation model evaluation, data assimilation, Bayesian model averaging, distributed computation, and informa- Residual analysis tive/noninformative prior distributions. The DREAM toolbox supports parallel computing and includes Environmental modeling tools for convergence analysis of the sampled chain trajectories and post-processing of the results. Seven Bayesian model averaging (BMA) different case studies illustrate the main capabilities and functionalities of the MATLAB toolbox. Generalized likelihood uncertainty © 2015 Elsevier Ltd. All rights reserved. estimation (GLUE) Multi-processor computing Extended metropolis algorithm (EMA) 1. Introduction and scope environmental models that use algebraic and (stochastic) ordinary (partial) differential equations (PDEs) to simulate the behavior of a Continued advances in direct and indirect (e.g. geophysical, myriad of highly interrelated ecological, hydrological, and biogeo- pumping test, remote sensing) measurement technologies and chemical processes at different spatial and temporal scales. These improvements in computational technology and process knowl- water, energy, nutrient, and vegetation processes are often non- edge have stimulated the development of increasingly complex separable, non-stationary with very complicated and highly- nonlinear spatio-temporal interactions (Wikle and Hooten, 2010) which gives rise to complex system behavior. This complexity poses * Department of Civil and Environmental Engineering, University of California significant measurement and modeling challenges, in particular Irvine, 4130 Engineering Gateway, Irvine, CA, 92697-2175, USA. how to adequately characterize the spatio-temporal processes of E-mail address: [email protected]. the dynamic system of interest, in the presence of (often) URL: http://faculty.sites.uci.edu/jasper http://dx.doi.org/10.1016/j.envsoft.2015.08.013 1364-8152/© 2015 Elsevier Ltd. All rights reserved. 274 J.A. Vrugt / Environmental Modelling & Software 75 (2016) 273e316 incomplete and insufficient observations, process knowledge and environmental system J to forcing variables U ¼ {u1,…,un}. The system characterization. This includes prediction in space (inter- observations or data are linked to the physical system. polation/extrapolation), prediction in time (forecasting), assimila- ~ Y)JðxÃÞþε; (1) tion of observations and deterministic/stochastic model output, Ã Ã ; …; Ã and inference of the model parameters. where x ¼ fx1 xdg are the unknown parameters, and The use of differential equations might be more appropriate ε ¼ {ε1,…,εn}isan-vector of measurement errors. When a hy- )F Ã; ~; j~ than purely empirical relationships among variables, but does not pothesis, or simulator, Y ðx u 0Þ of the physical process is guard against epistemic errors due to incomplete and/or inexact available, then the data can be modeled using process knowledge. Fig. 1 provides a schematic overview of most ~)F Ã; ~; j~ ; important sources of uncertainty that affect our ability to Y x U 0 þ E (2) describe as closely and consistently as possible the observed system behavior. These sources of uncertainty have been dis- j~ 2 J 2 ℝt t … where 0 signify the initial states, and E ¼ {e1, ,en} cussed extensively in the literature, and much work has focused includes observation error (forcing and output data) as well as error on the characterization of parameter, model output and state due to the fact that the simulator, F (,) may be systematically * variable uncertainty. Explicit knowledge of each individual error different from reality, JðxÃÞ for the parameters x . The latter may source would provide strategic guidance for investments in data arise from numerical errors (inadequate solver and discretization), collection and/or model improvement. For instance, if input and improper model formulation and/or parameterization. (forcing/boundary condition) data uncertainty dominates total By adopting a Bayesian formalism the posterior distribution of simulation uncertainty, then it would not be productive to in- the parameters of the model can be derived by conditioning the crease model complexity, but rather to prioritize data collection spatio-temporal behavior of the model on measurements of the instead. On the contrary, it would be naive to spend a large observed system response portion of the available monetary budget on system characterization if this constitutes only a minor portion of total prediction ~ p x p Y x ~ ð Þ uncertainty. p x Y ; (3) ¼ ~ Note that model structural error (label 4) (also called epistemic p Y error) has received relatively little attention, but is key to learning fi ~ and scienti c discovery (Vrugt et al., 2005; Vrugt and Sadegh, where p(x) and pðx YÞ signify the prior and posterior parameter 2013). ~ ~ The focus of this paper is on spatio-temporal models that may distribution, respectively, and L x Y ≡ p Y x denotes the likeli- be discrete in time and/or space, but with processes that are ~ hood function. The evidence, pðYÞ acts as a normalization constant continuous in both. A MATLAB toolbox is described which can be (scalar) so that the posterior distribution integrates to unity used to derive the posterior parameter (and state) distribution, Z Z conditioned on measurements of observed system behavior. At ~ ~ ; ~ ; least some level of calibration of these models is required to make p Y ¼ pðxÞp Y x dx ¼ p x Y dx (4) sure that the simulated state variables, internal fluxes, and output c c variables match the observed system behavior as closely and d ~ consistently as possible. Bayesian methods have found widespread over the parameter space, x 2 c 2 ℝ . In practice, pðYÞ is not application and use to do so, in particular because of their innate required for posterior estimation as all statistical inferences about ~ ability to handle, in a consistent and coherent manner parameter, pðx YÞ can be made from the unnormalized density state variable, and model output (simulation) uncertainty. ~ ~ ; …; ~ fi ~ f ~ If Y ¼ fy1 yng signi es a discrete vector of measurements at p x Y pðxÞL x Y (5) times t ¼ {1,…,n} which summarizes the response of some If we assume, for the time being, that the prior distribution, p(x) Fig. 1. Schematic illustration of the most important sources of uncertainty in environmental systems modeling, including (1) parameter,

Load more