<<

Modeling series data - understanding the underlying processes

Dynamical , prediction, inference. Time series data

Values of a variable or variables that come with a time stamp.

Time can be continuous: x(t) or discrete ..., xt T , ..., xt, xt+1, ..., xt+T , ... − ! The intervals between measurements do not have to be equal. Analysis tools

Many! Some examples:

Frequency analysis: Fourier, wavelet, etc: find representation of time dependent x(t) in frequency domain, using different basis functions (sinusoids, step functions, ...).

Nonlinear analysis: Poincare , deterministic chaos. Models

Linear models. Examples: LPC (linear predictive coding), linear filters.

Nonlinear models: Dynamical - model of the underlying process.

Dynamical = { S; Dynamic f: S -> S; } s˙ = f(s) Dynamic = of motion s(t) = f[s(t 1)] − p(s! s) s, s! S or: transition probabilities | ∈ Tasks

Prediction: infer future from past

Understanding: build model that gives insight into underlying physical phenomena

Diagnostic: what possible pasts may have let to current state? Applications

financial and economic forecasting environmental modeling medical diagnosis industrial equipment diagnosis speech recognition many more... Linear models and LPC

Linear model: y = αixi i M Possible use: !∈ Interpolation - y = xˆ , k / M k ∈ Prediction - xi = xt; y = xˆt+1

Task: find those parameters α i, that minimize the mean squared error (MSE). Prediction (linear filters)

T Model: xˆt+1 = αixt i − i=0 ! 1 MSE: (x xˆ )2 = ? 2 t+1 − t+1 ! "

Set derivative w.r.t. α ito zero => system of linear equations => solution. Prediction (linear filters)

T Model: xˆt+1 = αixt i − i=0 MSE: ! 1 2 1 (xt+1 xˆt+1) = αiαj xt ixt j 2 − 2 " − − # i j ! " # # 1 2 αi xt ixt+1 + xt+1 − " − # 2 i # ! Derivative: αj xt ixt j xt ixt+1 = 0 ! − − " − ! − " j ! In matrix notation: α!C = !γ Cij = xt ixt j ! − − " 1 γi = xt ixt+1 Solution: α! = !γC− ! − " Issues Need to compute the ensemble average - in reality replace that by a time average.

What if the data are non-stationary?

Adaptive filters: - replace average by instantaneous values - assume: slowly changing objective fkt. - change parameters into the direction of the gradient in each time step.

Frequently used! Example: echo cancellation.

But - most processes are nonlinear. Dynamical Systems

Distinguish cases according to: States: discrete or continuous Time: discrete or continuous

Both continuous - model with differential equations.

Both discrete - use HMMs.

Spacial system? Space: discrete or continuous? Cellular automata; reaction- diffusion equations. Stochastic processes

chain of random variables Xt realization ..., xt T , ..., xt, xt+1, ..., xt+T , ... − ! Past Future

x Alphabet: t ∈ A

T Block of length T: Xt = XtXt+1...Xt+T 1 −

T T Length T word: xt = xtXt+1...xt+T 1 − ∈ A Stochastic processes Uniform Process: Equal-length sequences occur with same probability T 1 P r(Xt ) = P r(XtXt+1...Xt+T 1) = − L A Example: Fair coin A = {H,T}; Pr(H) = Pr(T) = 0.5

Independent, Identically Distributed (IID) Process: T T P r(Xt ) = P r(Xt+i 1) − i=1 ! Example: Biased coin Pr(H) = p; Pr(T) = 1-p = q; n: T n T n number of heads in sequence; P r(Xt ) = p q − Stochastic processes

L-Block Process:

Example: 2-block process with no consecutive zeros

A = {0,1}

Pr(00) = 0; Pr(01) = 0; Pr(10) = 0.5; Pr(11) = 0.5

Pr(11101011) = Pr(11)Pr(10)Pr(10)Pr(11) = 0.175 Markov process P r( X ) = P (X X ) { t} t+1| t t Example: No Consecutive! zeros (Golden Mean Process). A = {0;1}

Pr(0|0) = 0; Pr(1|0) = 1; Pr(0|1) = Pr(1|1) = 0.5 n-th order Markov process:

P r(Xt Xt 1, ..., Xt n, ...) = P r(Xt Xt 1, ..., Xt n) | − − | − − more general than n-block process n n n n n P r(X1 X2 ) = P (X2 X1 )P (X1 ) n| n = P (X2 )P (X1 ) only if blocks are independent. Hidden Markov Process

Internal n-th order Markov Process over internal states S.

P r(St St 1, ..., St n, ...) = P r(St St 1, ..., St n) | − − | − − When the system is in state s, it produces a symbol x with probability p(x|s). Observation process. Observations: ..., xt T , ..., xt, xt+1, ..., xt+T , ... − ! Observed process: Block distribution P r(XT ) Hidden Markov Model

Model the underlying dynamics as a hidden Markov process.

p(s=2|s=1) p(s=1|s=1) state 1 state 2 ... state K p(s=1|s=2)

p(x|s=1) p(x|s=1)

observations x(t) Fitting HMM to data

Infer the most likely from the observable sequence of outputs. Find max likelihood parameters (bayesian estimation).

Distributional parameters: transition probabilities p(s(t)|s(t-1)) observation probabilities p(x|s) initial state distribution p(s(0)) Baum-Welch

EM-alg: Iteratively estimate parameters and calculated likelihood of data; expected frequencies, given model.

1. Choose arbitrary parameters

2. Calculate likelihood

3. Expected frequencies so obtained are substituted for the old parameters.

Iterate until there is no improvement. Learing HMMs

Baum-Welch algorithm will find local optimum.

Interpretation of the resulting hidden state model?

Are there special HMMs which have a (physical) meaning? How about Prediction?

Let’s play a little game...

Past: 111111111111111111111111111111111111111111111111

Future?

Your model? Past: 01010101010101010101010101010101

Future?

Your model? Past: 001010001001001010100010101000001

Future?

Your model? Past: 001010001001001010100010101000001

Future: 0 ...

Model: p(s=2|s=1,x=1) = 1 p(s=1|s=1,x=0) = 1 s1 s2 p(s=1|s=2,x=0) = 1

p(x=0|s=1) = 0.5 p(x=0|s=2) = 1 Causal States J. Crutchfield, 1989

Some histories are equal with regards to the development of the future!

Equivalency class: Two histories h and h’ are equivalent if the distribution over the future given either of those histories is the same: p(future|h) = p(future|h’) -> h~h’

Define: Causal states s: For all h in the same equivalence class, p(future|s) = p(future|h). Causal States

Model the underlying dynamical system:

Causal state S; causal states model the underlying states of the system.

Conditional transition probabilities p(s’, x | s) [x is the next observation]; model the dynamics of the system.

Initial conditions. The ! - J. Crutchfield, 1989 A set containing: the mapping from histories to causal states: s = ! (h) -> the set of causal states, S. together with the transition probabilities T (x) = p(s , x s ) ij j | i

M = S, T (x), x { { ∈ A}} Properties of ! -machine Causal shielding: Conditional independence of future and past, given the causal states. Deterministic Markovian Optimal predictor Minimal size Unique “Probabilistic bisimulation” (1989). Bisimulation coined by R. Milner (80s) in the CS literature. Conditional independence of future & past

Past and Future are Independent given Causal State:

p(future,past|s) = p(future|s) p(past|s)

Causal states shield past and future from each other: .

Why is this true? By definition!

Because p(future|s) = p(future|past) and therefore p(future|past,s) = p(future|s), since p(future|past,s) = p(future|past). [Data Proc. Ineq.] Deterministic The epsilon-machine is deterministic, in the sense that given a causal state s at time t, and a measurement x, there is a unique next causal state. deterministic mapping f: (s,x) -> s’

To proof this, we need to assume that the process is stationary! Markovian p(st st 1, st 2, ...) = p(st st 1) | − − | − Optimal predictor let r be a state from a rival partition R

H[future s] H[future r] | ≤ | because H[future s] = H[future past] H[future r] | | ≤ | therefore: rate of causal states = entropy rate of time series

L L h(s) = lim H[Xfuture s] = H[Xfuture past] = h L | | →∞ Causal states contain every difference (in past) that makes a difference (to future).

Causal states are sufficient statistics for predicting the future! Minimal

Causal states are most compact description, out of all partitions of histories which have equal predictive power: H[S] H[R] ≤ because rival partitions R have the same predictive power only when they are refinements of S, otherwise their prediction is a statistical mixture of the causal state predictions. But that increases the entropy.

H c p H[p ] i i ≥ i ! i # i " " Minimal sufficient statistics! Unique

Any rival partition which is as predictive and of the same size is the same as the causal state partition (up to relabeling of the states).

Because for a rival partition R of the same size as S: H[R] = H[S] and “as predictive as” means H[future|R] = H[future|S].

Unique minimal sufficient statistics! The ! -machine Optimal predictor: Lower prediction error than any rival model.

Minimal size: Smallest of the prescient rivals.

Unique: Smallest optimal predictor is equivalent.

Model of the process: Reproduces all of process’s statistics.

Renders process’s future independent of its past.

You can calculate entropy rate and all statistics of a process from its epsilon-machine! Issues

Efficient algorithm for finding the causal state partition and the transition probabilities.

What if we are content with some prediction error, because we want to find a more compact representation? Is there some systematic way of trading error for model ?

Will be addressed in next lecture... Readings:

LPC: There are many places. I find “Numerical Methods in C” gives an extremely useful and compact presentation.

HMMs: Rabiner, IEEE, 1994 and refs therein.

Causal states and epsilon machine: Crutchfield and Shalizi, 1999 and refs therein.