Dynamical Systems, Prediction, Inference. Time Series Data
Total Page:16
File Type:pdf, Size:1020Kb
Modeling time series data - understanding the underlying processes Dynamical systems, prediction, inference. Time series data Values of a variable or variables that come with a time stamp. Time can be continuous: x(t) or discrete ..., xt T , ..., xt, xt+1, ..., xt+T , ... − ! The intervals between measurements do not have to be equal. Analysis tools Many! Some examples: Frequency analysis: Fourier, wavelet, etc: find representation of time dependent signal x(t) in frequency domain, using different basis functions (sinusoids, step functions, ...). Nonlinear analysis: Poincare map, deterministic chaos. Models Linear models. Examples: LPC (linear predictive coding), linear filters. Nonlinear models: Dynamical Systems modeling - model of the underlying process. Dynamical System = {State space S; Dynamic f: S -> S; Initial condition} s˙ = f(s) Dynamic = equations of motion s(t) = f[s(t 1)] − p(s! s) s, s! S or: transition probabilities | ∈ Tasks Prediction: infer future from past Understanding: build model that gives insight into underlying physical phenomena Diagnostic: what possible pasts may have let to current state? Applications financial and economic forecasting environmental modeling medical diagnosis industrial equipment diagnosis speech recognition many more... Linear models and LPC Linear model: y = αixi i M Possible use: !∈ Interpolation - y = xˆ , k / M k ∈ Prediction - xi = xt; y = xˆt+1 Task: find those parameters α i, that minimize the mean squared error (MSE). Prediction (linear filters) T Model: xˆt+1 = αixt i − i=0 ! 1 MSE: (x xˆ )2 = ? 2 t+1 − t+1 ! " Set derivative w.r.t. α ito zero => system of linear equations => solution. Prediction (linear filters) T Model: xˆt+1 = αixt i − i=0 MSE: ! 1 2 1 (xt+1 xˆt+1) = αiαj xt ixt j 2 − 2 " − − # i j ! " # # 1 2 αi xt ixt+1 + xt+1 − " − # 2 i # ! Derivative: αj xt ixt j xt ixt+1 = 0 ! − − " − ! − " j ! In matrix notation: α!C = !γ Cij = xt ixt j ! − − " 1 γi = xt ixt+1 Solution: α! = !γC− ! − " Issues Need to compute the ensemble average - in reality replace that by a time average. What if the data are non-stationary? Adaptive filters: - replace average by instantaneous values - assume: slowly changing objective fkt. - change parameters into the direction of the gradient in each time step. Frequently used! Example: echo cancellation. But - most processes are nonlinear. Dynamical Systems Distinguish cases according to: States: discrete or continuous Time: discrete or continuous Both continuous - model with differential equations. Both discrete - use HMMs. Spacial system? Space: discrete or continuous? Cellular automata; reaction- diffusion equations. Stochastic processes chain of random variables Xt realization ..., xt T , ..., xt, xt+1, ..., xt+T , ... − ! Past Future x Alphabet: t ∈ A T Block of length T: Xt = XtXt+1...Xt+T 1 − T T Length T word: xt = xtXt+1...xt+T 1 − ∈ A Stochastic processes Uniform Process: Equal-length sequences occur with same probability T 1 P r(Xt ) = P r(XtXt+1...Xt+T 1) = − L A Example: Fair coin A = {H,T}; Pr(H) = Pr(T) = 0.5 Independent, Identically Distributed (IID) Process: T T P r(Xt ) = P r(Xt+i 1) − i=1 ! Example: Biased coin Pr(H) = p; Pr(T) = 1-p = q; n: T n T n number of heads in sequence; P r(Xt ) = p q − Stochastic processes L-Block Process: Example: 2-block process with no consecutive zeros A = {0,1} Pr(00) = 0; Pr(01) = 0; Pr(10) = 0.5; Pr(11) = 0.5 Pr(11101011) = Pr(11)Pr(10)Pr(10)Pr(11) = 0.175 Markov process P r( X ) = P (X X ) { t} t+1| t t Example: No Consecutive! zeros (Golden Mean Process). A = {0;1} Pr(0|0) = 0; Pr(1|0) = 1; Pr(0|1) = Pr(1|1) = 0.5 n-th order Markov process: P r(Xt Xt 1, ..., Xt n, ...) = P r(Xt Xt 1, ..., Xt n) | − − | − − more general than n-block process n n n n n P r(X1 X2 ) = P (X2 X1 )P (X1 ) n| n = P (X2 )P (X1 ) only if blocks are independent. Hidden Markov Process Internal n-th order Markov Process over internal states S. P r(St St 1, ..., St n, ...) = P r(St St 1, ..., St n) | − − | − − When the system is in state s, it produces a symbol x with probability p(x|s). Observation process. Observations: ..., xt T , ..., xt, xt+1, ..., xt+T , ... − ! Observed process: Block distribution P r(XT ) Hidden Markov Model Model the underlying dynamics as a hidden Markov process. p(s=2|s=1) p(s=1|s=1) state 1 state 2 ... state K p(s=1|s=2) p(x|s=1) p(x|s=1) observations x(t) Fitting HMM to data Infer the most likely dynamical system from the observable sequence of outputs. Find max likelihood parameters (bayesian estimation). Distributional parameters: transition probabilities p(s(t)|s(t-1)) observation probabilities p(x|s) initial state distribution p(s(0)) Baum-Welch Algorithm EM-alg: Iteratively estimate parameters and calculated likelihood of data; expected frequencies, given model. 1. Choose arbitrary parameters 2. Calculate likelihood 3. Expected frequencies so obtained are substituted for the old parameters. Iterate until there is no improvement. Learing HMMs Baum-Welch algorithm will find local optimum. Interpretation of the resulting hidden state model? Are there special HMMs which have a (physical) meaning? How about Prediction? Let’s play a little game... Past: 111111111111111111111111111111111111111111111111 Future? Your model? Past: 01010101010101010101010101010101 Future? Your model? Past: 001010001001001010100010101000001 Future? Your model? Past: 001010001001001010100010101000001 Future: 0 ... Model: p(s=2|s=1,x=1) = 1 p(s=1|s=1,x=0) = 1 s1 s2 p(s=1|s=2,x=0) = 1 p(x=0|s=1) = 0.5 p(x=0|s=2) = 1 Causal States J. Crutchfield, 1989 Some histories are equal with regards to the development of the future! Equivalency class: Two histories h and h’ are equivalent if the distribution over the future given either of those histories is the same: p(future|h) = p(future|h’) -> h~h’ Define: Causal states s: For all h in the same equivalence class, p(future|s) = p(future|h). Causal States Model the underlying dynamical system: Causal state set S; causal states model the underlying states of the system. Conditional transition probabilities p(s’, x | s) [x is the next observation]; model the dynamics of the system. Initial conditions. The ! -machine J. Crutchfield, 1989 A set containing: the mapping from histories to causal states: s = ! (h) -> the set of causal states, S. together with the transition probabilities T (x) = p(s , x s ) ij j | i M = S, T (x), x { { ∈ A}} Properties of ! -machine Causal shielding: Conditional independence of future and past, given the causal states. Deterministic Markovian Optimal predictor Minimal size Unique “Probabilistic bisimulation” (1989). Bisimulation coined by R. Milner (80s) in the CS literature. Conditional independence of future & past Past and Future are Independent given Causal State: p(future,past|s) = p(future|s) p(past|s) Causal states shield past and future from each other: Markov chain. Why is this true? By definition! Because p(future|s) = p(future|past) and therefore p(future|past,s) = p(future|s), since p(future|past,s) = p(future|past). [Data Proc. Ineq.] Deterministic The epsilon-machine is deterministic, in the sense that given a causal state s at time t, and a measurement x, there is a unique next causal state. deterministic mapping f: (s,x) -> s’ To proof this, we need to assume that the process is stationary! Markovian p(st st 1, st 2, ...) = p(st st 1) | − − | − Optimal predictor let r be a state from a rival partition R H[future s] H[future r] | ≤ | because H[future s] = H[future past] H[future r] | | ≤ | therefore: entropy rate of causal states = entropy rate of time series L L h(s) = lim H[Xfuture s] = H[Xfuture past] = h L | | →∞ Causal states contain every difference (in past) that makes a difference (to future). Causal states are sufficient statistics for predicting the future! Minimal Causal states are most compact description, out of all partitions of histories which have equal predictive power: H[S] H[R] ≤ because rival partitions R have the same predictive power only when they are refinements of S, otherwise their prediction is a statistical mixture of the causal state predictions. But that increases the entropy. H c p H[p ] i i ≥ i ! i # i " " Minimal sufficient statistics! Unique Any rival partition which is as predictive and of the same size is the same as the causal state partition (up to relabeling of the states). Because for a rival partition R of the same size as S: H[R] = H[S] and “as predictive as” means H[future|R] = H[future|S]. Unique minimal sufficient statistics! The ! -machine Optimal predictor: Lower prediction error than any rival model. Minimal size: Smallest of the prescient rivals. Unique: Smallest optimal predictor is equivalent. Model of the process: Reproduces all of process’s statistics. Renders process’s future independent of its past. You can calculate entropy rate and all statistics of a process from its epsilon-machine! Issues Efficient algorithm for finding the causal state partition and the transition probabilities. What if we are content with some prediction error, because we want to find a more compact representation? Is there some systematic way of trading error for model complexity? Will be addressed in next lecture... Readings: LPC: There are many places. I find “Numerical Methods in C” gives an extremely useful and compact presentation. HMMs: Rabiner, IEEE, 1994 and refs therein. Causal states and epsilon machine: Crutchfield and Shalizi, 1999 and refs therein..