Probability and Statistics Robert Stengel Optimal Control and Estimation MAE 546 Princeton University, 2018
• Concepts and reality • Probability distributions • Bayes’s Law • Stationarity and ergodicity • Correlation functions and power spectra
Copyright 2018 by Robert Stengel. All rights reserved. For educational use only. http://www.princeton.edu/~stengel/MAE546.html http://www.princeton.edu/~stengel/OptConEst.html
1
Concepts and Reality of Probability (Papoulis, 1990) • Theory may be exact – Deals with averages of phenomena with many possible outcomes – Based on models of behavior • Application can only be approximate – Measure of our state of knowledge or belief that something may or may not be true – Subjective assessment A : event P(A) : probability of event
nA : number of times A occurs experimentally n :total number of trials n P(A) ≈ A n 2 Interpretations of Probability (Papoulis) • Axiomatic Definition (Theoretical interpretation) – Probability space, abstract objects (outcomes), and sets (events)
– Axiom 1: Pr(Ai) ≥ 0 – Axiom 2: Pr(“certain event”) = 1 = Pr [all events in probability space (or universe)] – Axiom 3: With no common elements,
Pr(Ai ∪ Aj ) = Pr(Ai ) + Pr(Aj )
• Relative Frequency (Empirical interpretation) n ⎛ Ai ⎞ N = number of trials (total) Pr(Ai ) = lim n = number of trials with attribute A N →∞ ⎝⎜ N ⎠⎟ Ai i
3
Interpretations of Probability (Papoulis) • Classical (“Favorable outcomes” interpretation) n Ai N is finite Pr(Ai ) = nAi = number of outcomes N “favorable to” Ai
• Measure of belief (Subjective interpretation)
– Pr(Ai) = measure of belief that Ai is true (similar to fuzzy sets) – Informal induction precedes deduction – Principle of insufficient reason (i.e., total prior ignorance):
• e.g., if there are 5 event sets, Ai , i = 1 to 5, Pr(Ai) = 1/5 = 0.2
4 Probability “... a way of expressing knowledge or belief that an event will occur or has occurred.” Statistics “The science of making effective use of numerical data relating to groups of individuals or experiments.”
5
Empirical (or Relative) Frequency of Discrete, Mutually Exclusive Events in Sample Space
ni Pr(xi ) = ; i = 1 to I in [0,1] N • N = total number of events
• ni = number of events with value xi • I = number of different values
• xi = ordered set of hypotheses or values
• xi is a discrete, real random variable 0.3
• Ensemble of equivalent sets 0.25
0.2 Ai = {x ∈U x = xi } ; i = 1 to I 0.15 Pr(x) • Cumulative probability over all sets 0.1 I I 1 I 0.05 ∑Pr(Ai ) = ∑Pr(xi ) = ∑ni = 1 0 i=1 i=1 N i=1 1 2 3 4 5 6 Cumulative Probability, Pr(x ≥/≤ a), Discrete Random Variables, and Continuous Variables
1
0.8
0.6 Pr(x) Cum Pr(x) ≥ a 0.4 Cum Pr(x) ≤ a 0.2
0 1 2 3 4 5 • Continuous, real random variable, x, e.g., a continuum of colors
– xi is the center of a categorical band in x (purple, blue, green, …)
Pr(xi ± Δx / 2) = ni / N I ∑Pr(xi ± Δx / 2) = 1 i=1 7
Probability Density Function, pr(x), and Continuous, Cumulative Distribution Function, Pr(x Pr(xi ± Δx / 2) ! pr(xi ) ⎯Δ⎯x→⎯0→pr(x) Δx Cumulative probability for x over (–∞, +∞) I I Pr x x / 2 pr x x ∑ ( i ± Δ ) = ∑ ( i ) Δ ⎯Δ⎯x→⎯0→ i=1 i=1 I→∞ ∞ pr(x) dx = 1 ∫−∞ Cumulative distribution function, x < X X Pr(x < X) = pr(x) dx Gaussian probability density and ∫−∞ cumulative distribution functions 8 Properties of Random Variables • Mode – Value of x for which pr(x) is maximum • Median – Value of x corresponding to 50th percentile – Pr(x < median) = Pr(x > median) = 0.5 • Mean – Value of x corresponding to statistical average • First moment of x = Expected value of x “Force” ∞ x = E(x) = x pr(x) dx ∫−∞ “Moment arm” 9 Expected Values ∞ Mean Value x = E(x) = x pr(x) dx ∫−∞ • Second central moment of x = Variance – Variance from the mean value rather than from zero – Smaller value indicates less uncertainty in the value of x ∞ 2 2 2 E ⎡(x − x ) ⎤ = σ x = (x − x ) pr(x) dx ⎣ ⎦ ∫−∞ • Expected value of any function of x, f(x) ∞ E f (x) = f (x) pr(x) dx [ ] ∫−∞ See Supplemental Material for examples of non-Gaussian distributions 10 Expected Value is a Linear Operation Expected value of sum of random variables x1 and x2 need not be statistically independent ∞ E x1 + x2 = x1 + x2 pr(x) dx [ ] ∫−∞ ( ) ∞ ∞ = x1 pr(x) dx + x2 pr(x) dx = E x1 + E x2 ∫−∞ ∫−∞ [ ] [ ] Expected value of constant times random variable ∞ ∞ E k x = k x pr(x) dx = k x pr(x) dx = k E x [ ] ∫−∞ ∫−∞ [ ] 11 Gaussian (Normal) Random Distribution • Used in some random number generators (e.g., RANDN) • Unbounded, symmetric distribution • Defined entirely by its mean and standard deviation 2 (x− x ) − 1 2 2σ x ∞ pr(x) = e pr(x) dx = 1 ∫−∞ 2π σ x Mean value: First moment, µ1 ∞ E(x) = x pr(x) dx = x ∫−∞ Variance: Second central moment, µ2 ∞ 2 2 2 E ⎡(x − x ) ⎤ = (x − x ) pr(x) dx = σ x ⎣ ⎦ ∫−∞ Units of x and σ x are the same 12 Probability of Being Close to the Mean (Gaussian Distribution) 2 (x− x ) − 1 2σ 2 Probability of being within ± 1σ pr(x) = e x 2π σ x Pr ⎣⎡x < (x + σ x )⎦⎤ − Pr ⎣⎡x < (x − σ x )⎦⎤ ≈ 68% Probability of being within ± 2σ Pr ⎣⎡x < (x + 2σ x )⎦⎤ − Pr ⎣⎡x < (x − 2σ x )⎦⎤ ≈ 95% Probability of being within ± 3σ Pr ⎣⎡x < (x + 3σ x )⎦⎤ − Pr ⎣⎡x < (x − 3σ x )⎦⎤ ≈ 99% 13 Experimental Determination of Mean and Variance • Sample mean for N data points, x1, x2, ..., xN N ∑ xi x = i=1 N • Sample variance for same data set N 2 Histogram ∑(xi − x ) σ 2 = i=1 x (N − 1) • Divisor is (N – 1) rather than N to produce an unbiased estimate – (N – 1) terms are independent – Inconsequential for large N • Distribution is not necessarily Gaussian – Prior knowledge: fit histogram to known distribution – Hypothesis test: determine best fit (e.g., Rayleigh, binomial, Poisson, ... ) 14 Central Limit Theorem • Probability distribution of the sum of independent, identically distributed (i.i.d.) variables – Approaches normal distribution as number of variables approaches infinity – Summation of continuous random variables produces a convolution of probability density functions (Papoulis, 1990) – See Supplemental Material for sufficient conditions ∞ ∞ pr(y) = ∫ pr(y − x)pr(x)dx pr(z) = ∫ pr(z − y)pr(y)dy −∞ −∞ 15 MATLAB Simulation of Central Limit Theorem Uniform distribution, 100,000 points, 60 histogram bins 16 Multiple Probability Densities and Expected Values Probability density functions of two random variables, x and y pr(x) and pr(y) given for all x and y in (−∞,∞) pr(x, y) : Joint probability density function of x and y ∞ ∞ ∞ ∞ ∫ pr(x)dx = 1; ∫ pr(y)dy = 1; ∫ ∫ pr(x, y)dx dy = 1; −∞ −∞ −∞ −∞ Expected values of x and y ∞ Mean Value E(x) = ∫ x pr(x)dx = x −∞ ∞ Mean Value E(y) = ∫ y pr(y)dy = y −∞ ∞ ∞ Covariance E(xy) = ∫ ∫ x y pr(x, y)dx dy −∞ −∞ ∞ Autocovariance 2 2 E(x ) = ∫ x pr(x)dx 17 −∞ Joint Probability (n = 2) Suppose x can take I values and y can take J values; then, I J ∑Pr(xi ) = 1 ; ∑Pr(yj ) = 1 i=1 j =1 If x and y are uncorrelated, Pr x , y = Pr x ∧ y = Pr x Pr y ( i j ) ( i j ) ( i ) ( j ) Pr(yj) and I J 0.5 0.3 0.2 ∑∑Pr(xi , yj ) = 1 i=1 j =1 0.6 0.3 0.18 0.12 0.6 Pr(x ) i 0.4 0.2 0.12 0.08 0.4 0.5 0.3 0.2 1 18 Conditional Probability (n = 2) If x and y are not independent, probabilities are related Probability that x takes ith value when y takes jth value • Similarly Pr xi , yj ( ) Pr(xi , yj ) Pr xi | yj = Pr y | x = ( ) Pr y ( j i ) ( j ) Pr(xi ) Pr(xi | yj ) = Pr(xi ) Pr(yj | xi ) = Pr(yj ) iff x and y are independent of each other iff x and y are independent of each other Causality is not addressed by conditional probability 19 Applications of Conditional Probability (n = 2) Joint probability can be expressed in two ways Pr(xi , yj ) = Pr(yj | xi )Pr(xi ) = Pr(xi | yj )Pr(yj ) Unconditional probability of each variable is expressed by a sum of terms J I Pr y Pr y | x Pr x Pr(xi ) = ∑Pr(xi | yj ) Pr(yj ) ( j ) = ∑ ( j i ) ( i ) j =1 i=1 20 Bayes’s Rule Thomas Bayes, 1702-1761 Bayes’s Rule proceeds from the previous results th Probability of x taking the value xi conditioned on y taking its j value Pr(yj | xi )Pr(xi ) Pr(yj | xi )Pr(xi ) Pr(xi | yj ) = = I Pr(yj ) ∑Pr(yj | xi ) Pr(xi ) i=1 ... and the converse Pr(xi | yj ) Pr(yj ) Pr(xi | yj ) Pr(yj ) Pr(yj | xi ) = = J Pr(xi ) ∑Pr(xi | yj ) Pr(yj ) j =1 Bayes’s or Bayes’ Rule? See http://www.dailywritingtips.com/charless-pen-and-jesus-name/ 21 Random Processes 22 Ensemble Statistics • Probability metrics apply to an ensemble of events Ai , i = 1 to I n ⎛ Ai ⎞ • For I events, Pr(Ai ) = lim , i = 1 to I N→∞ ⎝⎜ N ⎠⎟ • As I becomes large, Ai approaches a continuous, real, random variable, x, Ai = {x ∈U x = xi } ; i = 1 to I → ∞ I I I ∑ Pr(Ai ) = ∑ Pr(xi ) = ∑ Pr(xi ± Δx / 2) = • and i=1 i=1 i=1 I ∞ pr xi Δx ⎯⎯⎯→ pr(x) dx = 1 ∑ ( ) Δx→0 ∫−∞ i=1 I→∞ • x has no specified dependence on time 23 Random Process ! Random-process variable (or time series), x(t), varies with time as well as across the ensemble of trials ⎡ ⎤ prensemble (xi ) = prensemble ⎣xi (t )⎦, i = I, I → ∞ ! Joint ensemble statistics (e.g., joint probability distribution) depend on time ⎡ ⎤ prensemble (x1,x2 ) = prensemble ⎣x (t1),x (t2 )⎦ 24 Non-Stationary Random Process ! Joint ensemble statistics (e.g., joint probability distribution) depend on specific values of time, t1 and t2 ⎡ ⎤ ⎡ ⎤ prensemble ⎣x (t1)⎦ ≠ prensemble ⎣x (t2 )⎦ {unless periodic} ⎡ ⎤ prensemble (x1,x2 ) = prensemble ⎣x (t1),x (t2 )⎦ … unless time series is periodic or a special case 25 Non-Stationary Process: Sinusoidal Mean Trial 1 Trial 2 … Trial n Ensemble statistics Ensemble statistics 26 Non-Stationary Process: Sinusoidal Variance Trial 1 Trial 2 … Trial n Ensemble statistics Ensemble statistics 27 Stationary Random Process Joint ensemble statistics (e.g., joint probability distribution) depend only on the difference, Δt, in values of time ⎡ ⎤ ⎡ ⎤ prensemble ⎣x (t1)⎦ = prensemble ⎣x (t2 )⎦ ⎡ ⎤ ⎡ ⎤ prensemble ⎣x (t1),x (t2 )⎦ = prensemble {x (t1),x⎣t1 +(t2 −t1)⎦} ⎡ ⎤ = prensemble ⎣x (t ),x (t + Δt )⎦ 28 Stationary Process Trial 1 Trial 2 … Trial n Ensemble statistics Ensemble statistics Ensemble statistics and time-series statistics are stationary and not the same for each trial 29 Ergodic Random Process Ensemble statistics and time-series statistics are stationary, and they are the same ⎡ ⎤ ⎡ ⎤ prensemble ⎣x (t )⎦ = prtime ⎣x (t )⎦ ⎡ ⎤ ⎡ ⎤ prensemble ⎣x (t ),x (t + Δt )⎦ = prtime ⎣x (t ),x (t + Δt )⎦ 30 Stationary, Ergodic Process Ensemble statistics and time-series statistics are stationary and the same for each trial Trial 1 Trial 2 … Time-series statistics Trial n Ensemble statistics Ensemble statistics 31 Histograms of Random Processes Process with Non-Stationary Mean Process with Non-Stationary Variance Stationary, Non-Ergodic Ergodic Process Process 32 Correlation and Covariance Functions 33 Autocorrelation Functions of Random Processes Autocorrelation Function for Non- Stationary Variance E ⎣⎡x(t1 ), x(t2 )⎦⎤ ∞ ∞ = x t x t pr ⎡x t , x t ⎤dx t dx t ∫ ∫ ( 1 ) ( 2 ) ensemble ⎣ ( 1 ) ( 2 )⎦ ( 1 ) ( 2 ) −∞ −∞ !ψ ⎡x t , x t ⎤ ⎣ ( 1 ) ( 2 )⎦ Scalar function of two variables 34 Autocovariance Functions of Random Processes Mean values subtracted from variables E{⎣⎡x(t1 ) − x (t1 )⎦⎤⎣⎡x(t2 ) − x (t2 )⎦⎤} ∞ ∞ = ⎡x t − x t ⎤⎡x t − x t ⎤pr ⎡x t , x t ⎤dx t dx t ∫ ∫ ⎣ ( 1 ) ( 1 )⎦⎣ ( 2 ) ( 2 )⎦ ensemble ⎣ ( 1 ) ( 2 )⎦ ( 1 ) ( 2 ) −∞ −∞ ! φ ⎡x" t1 x" t2 ⎤, where (x") ! x − x ⎣ ( ) ( )⎦ 35 Autocovariance Functions of Random Processes With t1 = t2, autocovariance function = Variance 2 E ⎡x t − x t ⎤⎡x t − x t ⎤ = E ⎡x t − x t ⎤ {⎣ ( 1 ) ( 1 )⎦⎣ ( 2 ) ( 2 )⎦} {⎣ ( 1 ) ( 1 )⎦ } = E ⎡x! 2 t ⎤ = σ 2 ⎣ ( 1 )⎦ x x! t " ⎡x t − x t ⎤ ( ) ⎣ ( ) ( )⎦ 36 Cross-Covariance Functions Cross-covariance Function for Non- Stationary Variance E{⎣⎡x(t1 ) − x (t1 )⎦⎤⎣⎡y(t2 ) − y (t2 )⎦⎤} ∞ ∞ = ⎡x t − x t ⎤⎡y t − y t ⎤pr ⎡x t , y t ⎤dx t dy t ∫ ∫ ⎣ ( 1 ) ( 1 )⎦⎣ ( 2 ) ( 2 )⎦ ensemble ⎣ ( 1 ) ( 2 )⎦ ( 1 ) ( 2 ) −∞ −∞ = φ ⎡x! t y! t ⎤ ⎣ ( 1 ) ( 2 )⎦ With t1 = t2, cross-covariance function E x t x t y t y t {⎣⎡ ( 1 ) − ( 1 )⎦⎤ ⎣⎡ ( 1 ) − ( 1 )⎦⎤} = σ xy 37 Covariance Functions for Stationary Processes For stationary processes, statistics are independent of time Autocovariance Cross-covariance x t y t x t y t t φ ⎣⎡x!(t1 )x!(t2 )⎦⎤ = φ ⎣⎡x!(t)x!(t + Δt)⎦⎤ φ ⎣⎡ !( 1 ) !( 2 )⎦⎤ = φ ⎣⎡ !( ) !( + Δ )⎦⎤ " φ Δt " φ Δt xx ( ) xy ( ) Ordering in product is immaterial; therefore, Autocovariance function is symmetric φxx (Δt) = φxx (−Δt) Cross-covariance function is not symmetric unless y = kx Δt = Lag time 38 Properties of Continuous-Time Covariance Functions for Ergodic Processes Correlation depends on Δt only Convolution integrals of the dependent variables φxx (Δt) = E ⎣⎡x(t)x(t + Δt)⎦⎤ ∞ 1 T = x(t)x(t + Δt)pr(x)dx = lim x(t)x(t + Δt)dt ∫ T→∞ ∫ −∞ T 0 φyy (Δt) = E ⎣⎡y(t)y(t + Δt)⎦⎤ 1 T = lim y(t)y(t + Δt)dt T→∞ ∫ T 0 Cross-covariance function 1 T φxy (Δt) = E x(t)y(t + Δt) = lim x(t)y(t + Δt)dt ( ) T →∞ ∫ T 0 39 Properties of Discrete-Time Covariance Functions for Ergodic Processes k = Number of (±) lags Convolution sums of the dependent variables Discrete case approaches continuous case as k -> 0 1 N φxx (k) = E xn xn+ k = lim xn xn+ k ( ) N →∞ ∑( ) N n=1 1 N φyy (k) = E yn yn+ k = lim yn yn+ k ( ) N →∞ ∑( ) N n=1 Cross-covariance function 1 N φxy (k) = E xn yn+ k = lim xn yn+ k ( ) N →∞ ∑( ) N n=1 40 Additional Properties of Covariance Functions for Stationary Processes “Self” correlation ≥ lagged correlation φxx (0) ≥ φxx (Δt) φyy (0) ≥ φyy (Δt) 2 φxx (0)φyy (0) ≥ ⎣⎡φxy (Δt)⎦⎤ 41 Additional Properties of Correlation Functions for Stationary Processes 1 N φxx k = E xnxn+k = lim xnxn+k ( ) ( ) N →∞ ∑( ) N n=1 For finite length, number of products in correlation function estimate decreases as number of lags increases Time, sec Phi(k) # of Products 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 n 0 1 2 3 4 5 6 7 8 9 10 x(n) - 0.000 0.050 0.100 0.149 0.199 0.247 0.296 0.343 0.389 0.435 0.479 - - x(n) 0.911 11 0.000 0.050 0.100 0.149 0.199 0.247 0.296 0.343 0.389 0.435 0.479 - - x(n+1) 0.785 10 - 0.000 0.050 0.100 0.149 0.199 0.247 0.296 0.343 0.389 0.435 0.479 - x(n+2) 0.659 9 - - 0.000 0.050 0.100 0.149 0.199 0.247 0.296 0.343 0.389 0.435 0.479 N Approximate correction: multiply φ by N − k 42 AutoCovariance Functions for Stationary and Non Stationary Examples Stationary, Non- Ergodic Process Ergodic Process Process with Non- Process with Non- Stationary Mean Stationary Variance 43 http://en.wikipedia.org/wiki/Colors_of_noise 44 Dirac Delta Function Singularity at a point in time ⎪⎧ ∞, Δt = 0 δ (Δt) = ⎨ 0, Δt ≠ 0 ⎩⎪ ∞ Δt1 δ (t)dt = lim δ (Δt)d(Δt) = 1 ∫ Δt1→0 ∫ −∞ −Δt1 Representation of Dirac delta function as Gaussian distribution with vanishing standard deviation 2 ⎛ Δt ⎞ 1 −⎜ ⎟ δ (Δt) = lim e ⎝ a ⎠ a→0 a π 45 Continuous-Time White Noise http://simplynoise.com Autocovariance function 2 φxx (Δt) = σ x δ (Δt) Conditional probability distribution = Unconditional probability distribution pr ⎣⎡x(t) x(t + Δt)⎦⎤ = pr ⎣⎡x(t)⎦⎤ 46 Kronecker Delta Function Discrete function at indices ⎪⎧ 1, i = j δ ij = ⎨ 0, i ≠ j ⎩⎪ 47 Discrete-Time White Noise Autocovariance function 2 2 φxx (t j ) = σ x δ (t j ) = σ x δ ij Conditional probability distribution = Unconditional probability distribution pr ⎡x t x t ⎤ = pr ⎡x t ⎤ ⎣ ( j ) ( j+k )⎦ ⎣ ( j )⎦ 48 Colored Noise*: First-Order Filter Effects x! = ax − au, a < 0, u = white noise a = –1 rad/s a = –0.1 rad/s * = non-white noise http://yusynth.net/Modular/EN/NOISE/ 49 Colored Noise: Probability Density Functions x = −x + u Input Histogram Function Output Histogram Function tf = 10,000 sec Probability density functions are virtually identical 50 Markov Sequences (or Chains) and Processes ! Markov Sequence (Discrete Time) ! Probability distribution of dynamic process at time tk+1 > tk > 0, conditioned on the past history ! Depends only on the state, x, at time tk Pr ⎡x | x , x , x ,,0 ⎤ = Pr x | x ⎣ k +1 ( k k −1 k −2 )⎦ [ k +1 k ] ! Markov Process (Continuous Time) ! Probability distribution of dynamic process at time s > t > 0, conditioned on the past history depends only on the state, x, at time t ! Follow discussion of Continuous-Time Markov Chain (Infinitesimal definition, Jump chain/holding definition, Transition probability definition, Kolmogorov forward equation) at https://en.wikipedia.org/wiki/Markov_chain 51 Probability Distributions, Markov Chains, and Expected Values • Multivariate probability distributions are functions of many components of x (tautology) • If a process is Markovian, the description on the previous slide applies to the entire distribution • Consequently, it applies to expected values of x and f(x) as well ∞ E (x) = x pr x d x ∫−∞ ( ) ∞ E f(x) = f(x) pr x d x [ ] ∫−∞ ( ) 52 Stochastic Steady State Scalar discrete-time LTI system with zero- mean Gaussian random input xi+1 = axi + (1− a)ui , 0 < a < 1 Expected value E(xi+1 ) = E ⎣⎡axi + (1− a)ui ⎦⎤, 0 < a < 1 = aE(xi ) + (1− a)E(ui ) Expected value at stochastic steady state E(xi+1 ) = E(xi ) (1− a)E(xi ) = (1− a)E(ui ) E xi = E ui = 0 ( ) ( ) 53 Autocovariance of a Markov Sequence Scalar LTI system with zero-mean Gaussian random input φxx (1) = E(xi xi+1 ) = E ⎣⎡xia(xi − ui )⎦⎤ a ⎡E x 2 E x u ⎤ = ⎣ ( i ) − ( i i )⎦ If input is white noise, E(xiui ) = 0 2 2 φxx (1) = aE(xi ) = aσ x k 2 φxx (k) = a σ x , 0 < a < 1 54 Autocovariance of a Markov Process Expected value at stochastic steady state x! = ax − au = a(x − u), a < 0 E(x!) = a ⎣⎡E(x) − E(u)⎦⎤ ⎯t⎯→∞⎯→0 E x ⎯⎯⎯→ E u ( ) t→∞ ( ) Autocovariance function of Markov Process t E x t x t t E ⎡x t ea Δt x t ⎤ φxx (Δ ) = ⎣⎡ ( ) ( + Δ )⎦⎤ = ⎣ ( ) ( )⎦ a Δt 2 a Δt 2 = e E(x ) = e σ x , a < 0 55 Random Sequence (discrete) AutoCovariance Functions of Markov Random Sequences and Process (continuous) Processes −lna f = Δt 56 Fourier Transform and Its Inverse Fourier transform of x(t) ∞ F[x(t)] = X(ω) = ∫ x(t)e− jωt dt, ω = frequency, rad / s −∞ Inverse Fourier transform 1 ∞ x(t) = ∫ X(ω)e jωt dω 2π −∞ x(t) and X ( jω ) are a Fourier transform pair 57 Power Spectral Density Function of a Random Process Frequency distribution of the complex amplitude in x(t) X(ω) = XRe (ω) + jXIm (ω) Complex conjugate X *(ω) = XRe (ω) − jXIm (ω) Power spectral density function 2 X (ω) = X (ω)X *(ω) = [X Re (ω)+ jX Im (ω)][X Re (ω)− jX Im (ω)] 2 2 = X Re (ω)+ X Im (ω) ! Ψ xx (ω) 58 Power Spectral Density and Autocorrelation Functions Form a Fourier Transform Pair For a stationary process, Power spectral density function ∞ Ψ ω = ψ τ e− jωτ dτ xx ( ) ∫ xx ( ) −∞ Autocorrelation function 1 ∞ ψ τ = Ψ ω e jωτ dω xx ( ) ∫ xx ( ) 2π −∞ 59 Power Spectral Density and Autocovariance Functions for Variation from the Mean for x(t) = x(t) − x Power spectral density function ∞ Φ ω = φ τ e− jωτ dτ xx ( ) ∫ xx ( ) −∞ Autocovariance function 1 ∞ φ τ = Φ ω e jωτ dω xx ( ) ∫ xx ( ) 2π −∞ 60 Power Spectral Density and Autocovariance Functions for 1st-Order Filter Output, a = –1 rad/s, White-Noise Input Autocovariance Function Power Spectral Density Function Power spectral density function has magnitude but not phase angle 61 Spectral Characteristics of Continuous-Time White Noise 1 ∞ 1 ∞ φ 0 = σ 2 = Φ ω e jω ((0) dω = Φ ω dω A xx ( ) x ∫ xx ( ) ∫ xx ( ) π 0 π 0 Which is (1 π )× area (A) under the power spectral density curve For white noise, ∞ B Φ ω = φ 0 δ τ e− jωτ dτ = φ 0 = constant xx ( ) xx ( ) ∫ ( ) xx ( ) −∞ However, A is infinite except when the constant integrand is zero 62 White Noise and Band-Limiting ! Heuristic arguments for use of “white noise”: ! Real systems are band-limited by dynamics ! Discrete-time systems are band-limited by sampling frequency Assume that power spectral density has a sharp cutoff at ωB ⎧ ⎪ Φ, ω ≤ ω B Φ xx (ω ) = ⎨ 0, ω > ω ⎩⎪ B 2 σ x = Φω B 2π φxx (τ ) = Φsinω Bτ 2πτ ! If x(t) is a Markov process 2 2 fσ x Φxx (ω ) = 2 2 When power spectral density is plotted ω + f against frequency rather than log(frequency), rolloff does not seem as abrupt https://en.wikipedia.org/wiki/White_noise 63 Next Time: Least-Squares Estimation 64 Supplemental Material 65 Relative Frequency Example: Probability of Rolling a “7” with Two Dice • Proposition 1: 11 possible sums, one of which is 7 n Ai 1 Pr(Ai ) = = N 11 • Proposition 2: 21 possible pairs, not distinguishing between dice – 3 pairs: 1-6, 2-5, 3-4 n Ai 3 1 Pr(Ai ) = = = N 21 7 • Proposition 3: 36 possible outcomes, distinguishing between the two dice – 6 pairs: 1-6, 2-5, 3-4, 6-1, 5-2, 4-3 n Ai 6 1 Pr(Ai ) = = = N 36 6 66 Rolling Dice Favorable Outcomes Result Dice Rolling, 10,000,000 Trials Date and Time are 05-Feb-2017 16:37:55 Number = 7; ======freq = 0; Trials = 1e7; Number = 2 Probability = 0.027762 d1 = randi(6,Trials,1); Number = 3 Probability = 0.055632 d2 = randi(6,Trials,1); Number = 4 Probability = 0.083471 for i = 1:Trials Number = 5 Probability = 0.11094 sum = d1(i) + d2(i); Number = 6 Probability = 0.13878 if sum == Number Number = 7 Probability = 0.16689 or ~1/6 freq = freq + 1; Number = 8 Probability = 0.13907 end Number = 9 Probability = 0.11113 end Number = 10 Probability = 0.083353 Number = 11 Probability = 0.055476 Number = 12 Probability = 0.027808 67 Random Number Example • Statistical properties prior to actual event • Excel spreadsheet: 2 random rows and one deterministic row – [RAND()] generates a uniform random number on each call =RAND() =RAND() =RAND() =RAND() =RAND() =RAND() =RAND() =RAND() =RAND() =RAND() =RAND() =RAND() =RAND() =RAND() 0.1 0.2 0.3 0.4 0.5 0.6 0.7 1st Trial 2nd Trial 3rd Trial 1 1 1 0.8 0.8 0.8 0.6 Random 0.6 0.6 Random 0.4 Deterministic 0.4 0.4 0.2 0.2 0.2 0 0 0 4th Trial th 0.90 • Output for 4 trial 0.80 0.70 0.18 0.54 0.49 0.49 0.02 0.73 0.88 0.60 0.81 0.46 0.84 0.16 0.89 0.30 0.03 0.50 0.40 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.30 0.20 0.10 Once the experiment is over, 0.00 the results are determined 68 Simple Hypothesis Test: t Test Is A different from B? • t compares mean value difference of two data sets, normalized by standard deviations – |t| is reduced by uncertainty in the data sets (s) – |t| is increased by number of points in the data sets (n) (mA − mB ) t = m : Mean value of the data set 2 2 σ : Standard deviation of data set σ A σ B + n : Number of points in data set nA nB • |t| > 3, mA ≠ mB with ≥99.7% confidence (error probability ≤ 0.003 for Gaussian distributions) [n > 25] 69 Application of t Test to DNA Microarray Data (Data from Alon et al, 1999) 2 2 σ T σ N t = (mT − mN ) + 36 22 • 58 RNA samples representing tumor and normal tissue • 1,151 transcripts are over/under- expressed in tumor/normal comparison (p ≤ 0.003) • Genetically dissimilar samples are apparent 70 Mean Value of a Uniform Random Distribution • Used in most random number generators (e.g., RAND) • Bounded distribution • Example is symmetric about the mean ⎧ 0 x x ⎪ < min ∞ ⎪ 1 pr x dx =1 pr(x) = ; x < x < x ∫−∞ ( ) ⎨ x x min max ⎪ max − min x > xmax ⎩⎪ 0 Mean value 2 2 ∞ x max x 1 xma−x xmin 1 x = E(x) = x pr(x) dx = dx = = xmax + xmin ∫−∞ ∫x ( ) min xmax − xmin 2 xmax − xmin 2 71 Variance and Standard Deviation of a Uniform Random Distribution ∞ pr(x) dx = 1 ∫−∞ Variance xmin = −xmax ! a 3 a 2 a 2 2 1 2 x a E ⎡(x − x ) ⎤ = σ x = x dx = = ⎣ ⎦ 2a ∫−a 6a 3 −a Standard deviation 2 2 a a σ x = σ x = = 3 3 72 Some Non-Gaussian Distributions • Binomial Distribution – Random variable, x – Probability of k successes in n trials – Discrete probability distribution described by a probability mass function, pr(x ) n! k n−k ⎛ n ⎞ k n−k pr(x) = p(x) ⎣⎡1− p(x)⎦⎤ ! p(x) ⎣⎡1− p(x)⎦⎤ k!(n − k)! ⎝⎜ k ⎠⎟ = probability of exactly k successes in n trials, in (0,1) ~ normal distribution for large n Parameters of the distribution p(x) : probability of occurrence, in (0,1) n :number of trials 73 Some Non-Gaussian Distributions • Poisson Distribution – Probability of a number of events occurring in a fixed period of time – Discrete probability distribution described by a probability mass function λ = Average rate of occurrence of event (per unit time) λ ke−λ k = # of occurrences of the event pr(k) = k! pr(k) = probability of k occurrences (per unit time) ~ normal distribution for large λ • Cauchy-Lorentz Distribution γ pr(x) = 2 – Mean and variance are undefined π ⎡γ 2 + x − x ⎤ ⎣ ( 0 ) ⎦ – “Fat tails”: extreme values more likely than normal distribution 1 −1 ⎛ x − x0 ⎞ 1 Pr(x) = tan ⎜ ⎟ + – Central limit theorem fails π ⎝ γ ⎠ 2 74 Some Non-Gaussian Distributions • Rayleigh Distribution – Continuous positive random variable – Example: Magnitude of vector whose components have normal distributions x −x2 (2σ 2 ) pr(x) = e , x ≥ 0 σ 2 75 Some Non-Gaussian Distributions • Weibull Distribution – Positive random variable with shape, k, and scale, λ – Example: Time to failure for process with failure rate proportional to a power of time k−1 k k ⎛ x ⎞ −(x λ) pr(x) = ⎜ ⎟ e , x ≥ 0 λ ⎝ λ ⎠ 76 Some Non-Gaussian Distributions • Poisson Distribution – Probability of a number of events occurring in a fixed period of time – Discrete probability distribution described by a probability mass function λ = Average rate of occurrence of event (per unit time) λ ke−λ k = # of occurrences of the event pr(k) = k! pr(k) = probability of k occurrences (per unit time) ~ normal distribution for large λ • Cauchy-Lorentz Distribution γ pr(x) = 2 – Mean and variance are undefined π ⎡γ 2 + x − x ⎤ ⎣ ( 0 ) ⎦ – “Fat tails”: extreme values more likely than normal distribution 1 −1 ⎛ x − x0 ⎞ 1 Pr(x) = tan ⎜ ⎟ + – Central limit theorem fails π ⎝ γ ⎠ 2 77 Skewness and Kurtosis of Probability Distributions (3rd and 4th Central Moments) µ µ Skewness = 3 Excess Kurtosis = Kurtosis − 3 = 4 − 3 σ 3 σ 4 Laplace Hyperbolic Secant Logistic Normal Raised Cosine Wigner Semicircle Uniform 78 Two Trials to Show Mean, Standard Deviation, Kurtosis, and Skewness Stationary, Ergodic Mean and Variance Stationary, Ergodic Mean and Variance Mean = 0.0068123 Mean = 0.0050475 Standard Deviation = 1.0006 Standard Deviation = 0.99947 Kurtosis = 3.0291 Kurtosis = 3.0485 Skewness = -0.0093799 Skewness = 0.0147 Stationary, Non-Ergodic Mean and Variance Stationary, Non-Ergodic Mean and Variance Mean = 0.99847 Mean = 0.99337 Standard Deviation = 1.3156 Standard Deviation = 1.3262 Kurtosis = 2.5933 Kurtosis = 2.6395 Skewness = -0.12198 Skewness = -0.18208 Non-Stationary (Sinusoidal) Mean Non-Stationary (Sinusoidal) Mean Mean = 0.17438 Mean = 0.18366 Standard Deviation = 1.2058 Standard Deviation = 1.1833 Kurtosis = 2.904 Kurtosis = 2.83 Skewness = -0.042752 Skewness = -0.10543 Non-Stationary (Sinusoidal) Variance Non-Stationary (Sinusoidal) Variance Mean = -0.0043818 Mean = 0.0072196 Standard Deviation = 0.6804 Standard Deviation = 0.71168 Kurtosis = 4.6055 Kurtosis = 5.0192 Skewness = 0.063442 Skewness = -0.05492 Sine Wave Sine Wave Mean = 6.6818e-06 Mean = 6.6818e-06 Standard Deviation = 0.707 Standard Deviation = 0.707 Kurtosis = 1.5006 Kurtosis = 1.5006 Skewness = -2.8338e-05 Skewness = -2.8338e-05 79 Some Non-Gaussian Distributions Bimodal Distributions • Bimodal Distribution – Two Peaks – e.g., concatenation of 2 normal distributions with different means Histogram • Random Sine Wave x = Asin[ωt + random phase angle] ⎧ 1 ⎪ x ≤ A pr(x) = ⎨ π A2 − x2 ⎪ x > A ⎩ 0 80 Histograms (Wikipedia Example) Scaled Numbers Area = Sum (+) Normalized Numbers Area = 1 81 Sufficient Conditions for Central Limit Theorem (Papoulis, 1990) The probability distribution of the sum of independent, identically distributed (i.i.d.) variables approaches a normal distribution as the number of variables approaches infinity 3 xi , are independent, identically distributed (i.i.d.) random variables, and E(xi ) is finite xi are bounded: xi < A < ∞, and σ i > a > 0 3 2 E(xi ) < B < ∞, and ∑σ n ⎯n⎯→∞⎯→∞ 82 Dependence and Correlation x and y are independent if pr(x, y) = pr(x)pr(y) at every x and y in (−∞,∞) pr(x | y) = pr(x); pr(y | x)= pr(y) Dependence pr(x, y) ≠ pr(x)pr(y) for some x and y in (−∞,∞) x and y are uncorrelated if E(xy) = E(x)E(y) ∞ ∞ ∞ ∞ x y pr x, y dx dy = x pr x dx y pr y dy = x y ∫ ∫ ( ) ∫ ( ) ∫ ( ) −∞ −∞ −∞ −∞ Correlation E(xy) ≠ E(x)E(y) 83 Which Combinations are Possible? Independence and lack Independence and of correlation correlation pr(x, y) = pr(x)pr(y) at every x and y in (−∞,∞) pr(x, y) = pr(x)pr(y) at every x and y in (−∞,∞) ∞ ∞ ∞ ∞ ∞ ∞ ∫ ∫ x y pr(x, y)dx dy x y pr x, y dx dy = x pr x dx y pr y dy = x y −∞ −∞ ∫ ∫ ( ) ∫ ( ) ∫ ( ) ∞ ∞ ∞ ∞ −∞ −∞ −∞ −∞ = ∫ ∫ x y pr(x)pr(y)dx dy ≠ ∫ x pr(x)dx ∫ y pr(y)dy = x y −∞ −∞ −∞ −∞ Dependence and lack Dependence and of correlation correlation pr(x, y) ≠ pr(x)pr(y) for some x and y in (−∞,∞) pr(x, y) ≠ pr(x)pr(y) for some x and y in (−∞,∞) ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ ∫ ∫ x y pr(x, y)dx dy = ∫ x pr(x)dx ∫ y pr(y)dy = x y ∫ ∫ x y pr(x, y)dx dy ≠ ∫ x pr(x)dx ∫ y pr(y)dy = x y −∞ −∞ −∞ −∞ −∞ −∞ −∞ −∞ 84 ⎡ ⎤ ⎡ x ⎤ xk 1k xk = ⎢ ⎥ = ⎢ ⎥ Example ⎢ yk ⎥ ⎢ x2 ⎥ ⎣ ⎦ ⎣ k ⎦ 2nd-order LTI system xk +1 = Φxk + Λwk , x0 = 0 Gaussian disturbance, wk, with independent, uncorrelated components ⎡ 2 ⎤ ⎡ w ⎤ σ w 0 w ⎢ 1 ⎥ ; Q ⎢ 1 ⎥ = = 2 ⎢ w2 ⎥ ⎢ 0 σ ⎥ ⎣ ⎦ ⎣ w2 ⎦ Propagation of state mean and covariance xk +1 = Φxk + Λw T T Pk +1 = ΦPkΦ + ΛQΛ , P0 = 0 Off-diagonal elements of P and Q express correlation 85 xk +1 = Φxk + Λw T T Example Pk +1 = ΦPkΦ + ΛQΛ , P0 = 0 Independence and lack of Independent dynamics correlation in state and correlation in state ⎡ a 0 ⎤ ⎡ c 0 ⎤ ⎡ a 0 ⎤ ⎡ c ⎤ Φ = ⎢ ⎥; Λ = ⎢ ⎥ Φ = ⎢ ⎥; x0 ≠ 0; w = w; Λ = ⎢ ⎥ ⎣ 0 b ⎦ ⎣ 0 d ⎦ ⎣ 0 a ⎦ ⎣ c ⎦ Dependence and lack of Dependence and correlation in nonlinear output correlation in state ⎡ a b ⎤ ⎡ e 0 ⎤ Φ = ; Λ = ⎢ ⎥ ⎢ 0 f ⎥ ⎣ c d ⎦ ⎣⎢ ⎦⎥ ⎡ a b ⎤ ⎡ e 0 ⎤ Φ = ⎢ ⎥; Λ = ⎢ ⎥ ⎡ ⎤ c d 0 f x1 ⎣ ⎦ ⎣⎢ ⎦⎥ z = ⎢ ⎥; Conjecture (t.b.d.) x2 ⎣⎢ 2 ⎦⎥ 86 Colored Noise: First-Order Filter a = –1 rad/s Input AutoCovariance Output AutoCovariance Function Function tf = 10,000 sec 87 Bandwidth Effect: Comparison of AutoCovariance Functions for 1st- Order Filters a = –1 rad/s a = –0.1 rad/s tf = 10,000 sec 88 Sampled Sine Wave Statistics ⎧ 1 ⎪ x ≤ A pr(x) = ⎨ π A2 − x2 Time Series ⎪ x > A ⎩ 0 Histogram Bi-modal distribution x(t) = sin(0.1t) Autocorrelation Function (aliased by short sample length Δt = 0.1 s, # points = 1258 without n/(n – k) correction) 89 Sample-Length Effect on Estimate of Sine Wave Autocorrelation Function x(t) = sin(t 16π ) Autocorrelation function estimate becomes inaccurate when number of lags > ~10% of the number of sample points 90 Frequency Folding and Aliasing τ : Sampling interval, sec ω D = ω Cτ, rad/sec π ω = = Folding frequency C τ ! Discrete Fourier transform is periodic in frequency increments of Δω D = 2π ! Relationship of discrete Fourier transform to continuous Fourier transform 1 ∞ ⎛ 2πn⎞ XD (ωC ) = ∑ XC ⎜ω + ⎟ τ n=−∞ ⎝ τ ⎠