Basic Multivariate Normal Theory

Total Page:16

File Type:pdf, Size:1020Kb

Basic Multivariate Normal Theory Basic Multivariate Normal Theory [Prerequisite probability background: Univariate theory of random variables, expectation, vari- ance, covariance, moment generating function, independence and normal distribution. Other requirements: Basic vector-matrix theory, multivariate calculus, multivariate change of vari- able.] 1 Random Vector k T A random vector U 2 R is a vector (U1;U2; ··· ;Uk) of scalar random variables Ui defined over a common probability space. The expectation and variance of U are defined as: 0 1 EU1 B C BEU2C EU := B . C ; @ . A EU k 0 1 VarU1 Cov(U1;U2) ::: Cov(U1;Uk) BCov(U ;U ) VarU ::: Cov(U ;U )C T B 2 1 2 2 k C VarU := E(fU − EUgfU − EUg ) = B . C : @ . .. A Cov(Uk;U1) Cov(Uk;U2) ::: VarUk It is easy to check that if X = a + BU where a 2 Rp and B is a p × k matrix, then X is a random vector in Rp with EX = a + B EU, VarX = BVar(U)BT . You should also note that if Σ = VarU for some random vector U, then Σ is a k × k non-negative definite matrix, because for any a 2 Rk, aT Σa = Var(aT U) ≥ 0. 2 Multivariate Normal Definition 1. A random vector U 2 Rk is called a normal random vector if for every a 2 Rk, aT U is a (one dimensional) normal random variable. Theorem 1. A random vector U 2 Rk is a normal random vector if and only if one can k T write U = m + AZ for some m 2 R and k × k matrix A where Z = (Z1; ··· ;Zk) with IID Zi ∼ Normal(0; 1). Proof. \If part". Suppose U = m + AZ with m, A and Z as in the statement of the theorem. Then for any a 2 Rk, Xn T T T a U = a m + a AU = b0 + biZi i=1 for some scalars b0; b1; ··· ; bk. But a linear combination of independent (one dimensional) normal variables is another normal, so aT U is a normal variable. \Only if part" Suppose U is a normal random vector. It suffices to show that a V = m + AZ with Z as in the statement of the theorem, and suitably chosen m and A, has the same k distribution as U. For any a 2 R , the moment generating function MU (a) of U at a is 1 EeaT U = EeX where X = aT U is normally distributed by definition. But, by one dimensional X EX+ 1 VarX aT EU+ 1 aT (VarU)a aT µ+aT Σa normal distribution theory, Ee = e 2 = e 2 = e where we denote EU by µ and VarU by Σ. Note that Σ is non-negative definite and thus can be written as Σ = AAT for some k × k matrix A. T IID Write V = µ + AZ where Z = (Z1; ··· ;Zk) with Zi ∼ Normal(0; 1). Then, by the \if part", V is a normal random vector, and because Zi's are IID with mean 0 and variance 1, aT µ+ 1 aT Σa EV = µ and VarV = Σ. So by above discussion MV (a) = e 2 = MU (a). So U and V have the same moment generating function. Because this moment generating function is defined for all a 2 Rk, it uniquely determines the associated probability distribution. That is, V and U have the same distribution. Notation. If a random k-vector U is a normal random vector, then by above proof, its distribution is completely determined by its mean µ = EU and variance Σ = VarU. We shall denote this distribution by Normalk(µ, Σ). Note that U ∼ Normalk(µ, Σ) means that U = µ + AZ for Z as in the above theorem, where A satisfies Σ = AAT . p Theorem 2 (Linear transformations). If U ∼ Normalk(µ, Σ) and a 2 R and B is p × p T matrix, then V = a + BU ∼ Normalp(a + Bµ, BΣB ). Proof. Write U = µ + AZ where Z is as in Theorem 1 and A satisfies Σ = AAT . Then, V = a + BU = (a + Bµ) + (BA)Z. This proves the result. Theorem 3 (Probability density function). A Normalk(µ, Σ) distribution with a positive definite Σ admits the following pdf: { } 1 f(x) = (2π)−k=2(det Σ)−1=2 exp − (x − µ)T Σ−1(x − µ) ; x 2 Rk: 2 T Proof. Suppose X ∼ Normalk(µ, Σ) then X = µ + AZ where AA = Σ and A is non-singular (because Σ is p.d.). The joint pdf of Z1; ··· ;Zk is given by { } { } Yk 1 z2 1 g(z ; ··· ; z ) = p exp − i = (2π)−k=2 exp − zT z ; z = (z ; ··· ; z )T 2 Rk: 1 k 2 2 1 k i=1 2π Therefore, by multivariate change of variable formula for the transformation X = µ + AZ with inverse transformation Z = A−1(X − µ) and hence Jacobian J = (det A)−1, the pdf of X is, j −1j −1 − 2 Rk f(x) = (det A) g(A (x µ{)); x } 1 = (2π)−k=2j det Aj−1 exp − (x − µ)T (AAT )−1(x − µ) ; x 2 Rk {2 } 1 = (2π)−k=2(det Σ)−1=2 exp − (x − µ)T Σ−1(x − µ) ; x 2 Rk: 2 as det Σ = (det A)2. 2 Theorem 4 (Independence). Let U ∼ Normalk(µ, Σ). Suppose for some 1 ≤ p < k, Σ can be partitioned as ( ) Σ 0 Σ = (11) p×(k−p) 0(k−p)×p Σ(22) where Σ(11) is p × p, Σ(22) is (k − p) × (k − p) and 0m×n denotes the matrix of zeros of the specified dimensions. Similarly partition ( ) ( ) U µ U = (1) ; and µ = (1) U(2) µ(2) into vectors of length p and (k − p). Then U(1) ∼ Normalp(µ(1); Σ(11));U(2) ∼ Normalk−p(µ(2); Σ(22)) and U(1) and U(2) are independent. Proof. Because Σ is non-negative definite, so must be any of its diagonal blocks. Therefore × − × − T there are p p and (k p) (k p) matrices A(1) and A(2) such that Σ(11) = A(1)A(1) and T T Σ(22) = A(2)A(2). Clearly, Σ = AA with ( ) A 0 A = (1) p×(k−p) : 0(k−p)×p A(2) T IID Because U is normal, we must have U = µ + AZ where Z = (Z1; ··· ;Zk) with Zi ∼ T T Normal(0; 1). Write Z(1) = (Z1; ··· ;Zp) and Z(2) = (Zp+1; ··· ;Zk) . Then Z(1) and Z(2) are independent and ( ) ( ) ( )( ) ( ) U µ A 0 Z µ + A Z (1) = (1) + (1) p×(k−p) (1) = (1) (1) (1) : U(2) µ(2) 0(k−p)×p A(2) Z(2) µ(2) + A(2)Z(2) From this the result follows. Theorem 5 (Conditioning). Let U ∼ Nk(µ, Σ). For some 1 ≤ p < k, consider the partition into blocks with first p and last k − p variables: ( ) ( ) ( ) U µ Σ Σ U = (1) ; µ = (1) and Σ = (11) (12) : U(2) µ(2) Σ(21) Σ(22) Then, j ∼ −1 − − −1 U(1) U(2) Np(µ(1) + Σ(12)Σ(22)(U(2) µ(2)); Σ(11) Σ(12)Σ(22)Σ(21)): Proof. Clearly, U(2) ∼ Nk−p(µ(2); Σ(22)). Consider p IID N1(0; 1) variables Z1;:::;Zp that T − −1 are independent of U(2) and let Z = (Z1;:::;Zp) . The matrix Σ(11) Σ(12)Σ(22)Σ(21) is − −1 T non-negative definite, and let B be such that Σ(11) Σ(12)Σ(22)Σ(21) = BB . By Theorem 1, ( ) (( ) ( )) Z 0p×1 Ip 0p×(k−p) ∼ Nk ; U(2) µ(2) 0(k−p)×k Σ(22) 3 and hence, ! (( ) ( )) −1 − µ(1) + Σ(12)Σ(22)(U(2) µ(2)) + BZ µ(1) Σ(11) Σ(12) ∼ Nk ; ; U(2) µ(2) Σ(21) Σ(22) which is same as the distribution of U. So the conditional distribution of U(1) given U(2) −1 − is same as the conditional distribution of µ(1) + Σ(12)Σ(22)(U(2) µ(2)) + BZ given U2; −1 − − but the latter conditional distribution is precisely Np(µ(1) + Σ(12)Σ(22)(U(2) µ(2)); Σ(11) −1 Σ(12)Σ(22)Σ(21)): 3 Sample mean and variance of normal data IID 2 Theorem 6. Let Xi ∼ Normal(µ, σ ). Then ¯ X−pµ ∼ 1. W = σ= n Normal(0; 1), P n (X −X¯)2 i=1 i ∼ 2 2. V = σ2 χn−1 3. and V and W are independent. 2 Proof. Suffices to prove for µ = 0 and σ = 1 (otherwise take X~i = (Xi − µ)/σ). Let X T n denote the random vector (X1; ··· ;Xn) 2 R . Then, Xn ∼ Normaln(0n×1;In) where In × denotes the n n identity matrix with 1 on the diagonalsp and zero elsewhere. Let B be a n × m orthogonal matrix whose first row equals (1= n)11×n where 1m×n denote the matrix of the given dimensions (such an orthogonal matrix can be constructed by the Gram-Schmidt method). Take Y = BX. Then T Y ∼ Normaln(0n×1; BB ) = Normaln (0n×1;In) ; as, B is orthogonal. Therefore the coordinate variables Y1;Y2; ··· ;Yn of Y are IID Normal(0; 1). Now argue, p p P p n ¯ ∼ 1. Y1 = (1= n)11×mX = (1= n) i=1 Xi = nX. So W = Y1 Normal(0; 1). T T 2 2 ··· 2 2 2 ··· 2 2. Y Y = X X (since B is orthogonal), and so Y1 + Y2 + + Yn = X1 + X2 + + Xn.
Recommended publications
  • 5. the Student T Distribution
    Virtual Laboratories > 4. Special Distributions > 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5. The Student t Distribution In this section we will study a distribution that has special importance in statistics. In particular, this distribution will arise in the study of a standardized version of the sample mean when the underlying distribution is normal. The Probability Density Function Suppose that Z has the standard normal distribution, V has the chi-squared distribution with n degrees of freedom, and that Z and V are independent. Let Z T= √V/n In the following exercise, you will show that T has probability density function given by −(n +1) /2 Γ((n + 1) / 2) t2 f(t)= 1 + , t∈ℝ ( n ) √n π Γ(n / 2) 1. Show that T has the given probability density function by using the following steps. n a. Show first that the conditional distribution of T given V=v is normal with mean 0 a nd variance v . b. Use (a) to find the joint probability density function of (T,V). c. Integrate the joint probability density function in (b) with respect to v to find the probability density function of T. The distribution of T is known as the Student t distribution with n degree of freedom. The distribution is well defined for any n > 0, but in practice, only positive integer values of n are of interest. This distribution was first studied by William Gosset, who published under the pseudonym Student. In addition to supplying the proof, Exercise 1 provides a good way of thinking of the t distribution: the t distribution arises when the variance of a mean 0 normal distribution is randomized in a certain way.
    [Show full text]
  • 1 One Parameter Exponential Families
    1 One parameter exponential families The world of exponential families bridges the gap between the Gaussian family and general dis- tributions. Many properties of Gaussians carry through to exponential families in a fairly precise sense. • In the Gaussian world, there exact small sample distributional results (i.e. t, F , χ2). • In the exponential family world, there are approximate distributional results (i.e. deviance tests). • In the general setting, we can only appeal to asymptotics. A one-parameter exponential family, F is a one-parameter family of distributions of the form Pη(dx) = exp (η · t(x) − Λ(η)) P0(dx) for some probability measure P0. The parameter η is called the natural or canonical parameter and the function Λ is called the cumulant generating function, and is simply the normalization needed to make dPη fη(x) = (x) = exp (η · t(x) − Λ(η)) dP0 a proper probability density. The random variable t(X) is the sufficient statistic of the exponential family. Note that P0 does not have to be a distribution on R, but these are of course the simplest examples. 1.0.1 A first example: Gaussian with linear sufficient statistic Consider the standard normal distribution Z e−z2=2 P0(A) = p dz A 2π and let t(x) = x. Then, the exponential family is eη·x−x2=2 Pη(dx) / p 2π and we see that Λ(η) = η2=2: eta= np.linspace(-2,2,101) CGF= eta**2/2. plt.plot(eta, CGF) A= plt.gca() A.set_xlabel(r'$\eta$', size=20) A.set_ylabel(r'$\Lambda(\eta)$', size=20) f= plt.gcf() 1 Thus, the exponential family in this setting is the collection F = fN(η; 1) : η 2 Rg : d 1.0.2 Normal with quadratic sufficient statistic on R d As a second example, take P0 = N(0;Id×d), i.e.
    [Show full text]
  • Random Variables and Probability Distributions 1.1
    RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS 1. DISCRETE RANDOM VARIABLES 1.1. Definition of a Discrete Random Variable. A random variable X is said to be discrete if it can assume only a finite or countable infinite number of distinct values. A discrete random variable can be defined on both a countable or uncountable sample space. 1.2. Probability for a discrete random variable. The probability that X takes on the value x, P(X=x), is defined as the sum of the probabilities of all sample points in Ω that are assigned the value x. We may denote P(X=x) by p(x) or pX (x). The expression pX (x) is a function that assigns probabilities to each possible value x; thus it is often called the probability function for the random variable X. 1.3. Probability distribution for a discrete random variable. The probability distribution for a discrete random variable X can be represented by a formula, a table, or a graph, which provides pX (x) = P(X=x) for all x. The probability distribution for a discrete random variable assigns nonzero probabilities to only a countable number of distinct x values. Any value x not explicitly assigned a positive probability is understood to be such that P(X=x) = 0. The function pX (x)= P(X=x) for each x within the range of X is called the probability distribution of X. It is often called the probability mass function for the discrete random variable X. 1.4. Properties of the probability distribution for a discrete random variable.
    [Show full text]
  • Basic Econometrics / Statistics Statistical Distributions: Normal, T, Chi-Sq, & F
    Basic Econometrics / Statistics Statistical Distributions: Normal, T, Chi-Sq, & F Course : Basic Econometrics : HC43 / Statistics B.A. Hons Economics, Semester IV/ Semester III Delhi University Course Instructor: Siddharth Rathore Assistant Professor Economics Department, Gargi College Siddharth Rathore guj75845_appC.qxd 4/16/09 12:41 PM Page 461 APPENDIX C SOME IMPORTANT PROBABILITY DISTRIBUTIONS In Appendix B we noted that a random variable (r.v.) can be described by a few characteristics, or moments, of its probability function (PDF or PMF), such as the expected value and variance. This, however, presumes that we know the PDF of that r.v., which is a tall order since there are all kinds of random variables. In practice, however, some random variables occur so frequently that statisticians have determined their PDFs and documented their properties. For our purpose, we will consider only those PDFs that are of direct interest to us. But keep in mind that there are several other PDFs that statisticians have studied which can be found in any standard statistics textbook. In this appendix we will discuss the following four probability distributions: 1. The normal distribution 2. The t distribution 3. The chi-square (␹2 ) distribution 4. The F distribution These probability distributions are important in their own right, but for our purposes they are especially important because they help us to find out the probability distributions of estimators (or statistics), such as the sample mean and sample variance. Recall that estimators are random variables. Equipped with that knowledge, we will be able to draw inferences about their true population values.
    [Show full text]
  • Random Processes
    Chapter 6 Random Processes Random Process • A random process is a time-varying function that assigns the outcome of a random experiment to each time instant: X(t). • For a fixed (sample path): a random process is a time varying function, e.g., a signal. – For fixed t: a random process is a random variable. • If one scans all possible outcomes of the underlying random experiment, we shall get an ensemble of signals. • Random Process can be continuous or discrete • Real random process also called stochastic process – Example: Noise source (Noise can often be modeled as a Gaussian random process. An Ensemble of Signals Remember: RV maps Events à Constants RP maps Events à f(t) RP: Discrete and Continuous The set of all possible sample functions {v(t, E i)} is called the ensemble and defines the random process v(t) that describes the noise source. Sample functions of a binary random process. RP Characterization • Random variables x 1 , x 2 , . , x n represent amplitudes of sample functions at t 5 t 1 , t 2 , . , t n . – A random process can, therefore, be viewed as a collection of an infinite number of random variables: RP Characterization – First Order • CDF • PDF • Mean • Mean-Square Statistics of a Random Process RP Characterization – Second Order • The first order does not provide sufficient information as to how rapidly the RP is changing as a function of timeà We use second order estimation RP Characterization – Second Order • The first order does not provide sufficient information as to how rapidly the RP is changing as a function
    [Show full text]
  • 11. Parameter Estimation
    11. Parameter Estimation Chris Piech and Mehran Sahami May 2017 We have learned many different distributions for random variables and all of those distributions had parame- ters: the numbers that you provide as input when you define a random variable. So far when we were working with random variables, we either were explicitly told the values of the parameters, or, we could divine the values by understanding the process that was generating the random variables. What if we don’t know the values of the parameters and we can’t estimate them from our own expert knowl- edge? What if instead of knowing the random variables, we have a lot of examples of data generated with the same underlying distribution? In this chapter we are going to learn formal ways of estimating parameters from data. These ideas are critical for artificial intelligence. Almost all modern machine learning algorithms work like this: (1) specify a probabilistic model that has parameters. (2) Learn the value of those parameters from data. Parameters Before we dive into parameter estimation, first let’s revisit the concept of parameters. Given a model, the parameters are the numbers that yield the actual distribution. In the case of a Bernoulli random variable, the single parameter was the value p. In the case of a Uniform random variable, the parameters are the a and b values that define the min and max value. Here is a list of random variables and the corresponding parameters. From now on, we are going to use the notation q to be a vector of all the parameters: Distribution Parameters Bernoulli(p) q = p Poisson(l) q = l Uniform(a,b) q = (a;b) Normal(m;s 2) q = (m;s 2) Y = mX + b q = (m;b) In the real world often you don’t know the “true” parameters, but you get to observe data.
    [Show full text]
  • Random Variables and Applications
    Random Variables and Applications OPRE 6301 Random Variables. As noted earlier, variability is omnipresent in the busi- ness world. To model variability probabilistically, we need the concept of a random variable. A random variable is a numerically valued variable which takes on different values with given probabilities. Examples: The return on an investment in a one-year period The price of an equity The number of customers entering a store The sales volume of a store on a particular day The turnover rate at your organization next year 1 Types of Random Variables. Discrete Random Variable: — one that takes on a countable number of possible values, e.g., total of roll of two dice: 2, 3, ..., 12 • number of desktops sold: 0, 1, ... • customer count: 0, 1, ... • Continuous Random Variable: — one that takes on an uncountable number of possible values, e.g., interest rate: 3.25%, 6.125%, ... • task completion time: a nonnegative value • price of a stock: a nonnegative value • Basic Concept: Integer or rational numbers are discrete, while real numbers are continuous. 2 Probability Distributions. “Randomness” of a random variable is described by a probability distribution. Informally, the probability distribution specifies the probability or likelihood for a random variable to assume a particular value. Formally, let X be a random variable and let x be a possible value of X. Then, we have two cases. Discrete: the probability mass function of X specifies P (x) P (X = x) for all possible values of x. ≡ Continuous: the probability density function of X is a function f(x) that is such that f(x) h P (x < · ≈ X x + h) for small positive h.
    [Show full text]
  • Lecture 3: Randomness in Computation
    Great Ideas in Theoretical Computer Science Summer 2013 Lecture 3: Randomness in Computation Lecturer: Kurt Mehlhorn & He Sun Randomness is one of basic resources and appears everywhere. In computer science, we study randomness due to the following reasons: 1. Randomness is essential for computers to simulate various physical, chemical and biological phenomenon that are typically random. 2. Randomness is necessary for Cryptography, and several algorithm design tech- niques, e.g. sampling and property testing. 3. Even though efficient deterministic algorithms are unknown for many problems, simple, elegant and fast randomized algorithms are known. We want to know if these randomized algorithms can be derandomized without losing the efficiency. We address these in this lecture, and answer the following questions: (1) What is randomness? (2) How can we generate randomness? (3) What is the relation between randomness and hardness and other disciplines of computer science and mathematics? 1 What is Randomness? What are random binary strings? One may simply answer that every 0/1-bit appears with the same probability (50%), and hence 00000000010000000001000010000 is not random as the number of 0s is much more than the number of 1s. How about this sequence? 00000000001111111111 The sequence above contains the same number of 0s and 1s. However most people think that it is not random, as the occurrences of 0s and 1s are in a very unbalanced way. So we roughly call a string S 2 f0; 1gn random if for every pattern x 2 f0; 1gk, the numbers of the occurrences of every x of the same length are almost the same.
    [Show full text]
  • Discrete Random Variables
    ST 370 Probability and Statistics for Engineers Discrete Random Variables A random variable is a numerical value associated with the outcome of an experiment. Discrete random variable When we can enumerate the possible values of the variable (such as 0, 1, 2, . ), the random variable is discrete. Example: acceptance sampling Suppose that a sample of size 10 is drawn from a shipment of 200 items, of which some number are non-compliant; X is the number of non-compliant items in the sample. The possible values of X are 0, 1, 2, . , 10, so X is a discrete random variable. 1 / 15 Discrete Random Variables ST 370 Probability and Statistics for Engineers Continuous random variable When the variable takes values in an entire interval, the random variable is continuous. Example: flash unit recharge time Suppose that a cell phone camera flash is chosen randomly from a production line; the time X that it takes to recharge is a positive real number; X is a continuous random variable. Presumably, there is some lower bound a > 0 that is the shortest possible recharge time, and similarly some upper bound b < 1 that is the longest possible recharge time; however, we usually do not know these values, and we would just say that the possible values of X are fx : 0 < x < 1g. 2 / 15 Discrete Random Variables ST 370 Probability and Statistics for Engineers Probability distribution The probability distribution of a random variable X is a description of the probabilities associated with the possible values of X . The representation of a probability distribution is different for discrete and continuous random variables.
    [Show full text]
  • Student's T Distribution
    Student’s t distribution - supplement to chap- ter 3 For large samples, X¯ − µ Z = n √ (1) n σ/ n has approximately a standard normal distribution. The parameter σ is often unknown and so we must replace σ by s, where s is the square root of the sample variance. This new statistic is usually denoted with a t. X¯ − µ t = n√ (2) s/ n If the sample is large, the distribution of t is also approximately the standard normal. What if the sample is not large? In the absence of any further information there is not much to be said other than to remark that the variance of t will be somewhat larger than 1. If we assume that the random variable X has a normal distribution, then we can say much more. (Often people say the population is normal, meaning the distribution of the rv X is normal.) So now suppose that X is a normal random variable with mean µ and variance σ2. The two parameters µ and σ are unknown. Since a sum of independent normal random variables is normal, the distribution of Zn is exactly normal, no matter what the sample size is. Furthermore, Zn has mean zero and variance 1, so the distribution of Zn is exactly that of the standard normal. The distribution of t is not normal, but it is completely determined. It depends on the sample size and a priori it looks like it will depend on the parameters µ and σ. With a little thought you should be able to convince yourself that it does not depend on µ.
    [Show full text]
  • UNIT NUMBER 19.4 PROBABILITY 4 (Measures of Location and Dispersion)
    “JUST THE MATHS” UNIT NUMBER 19.4 PROBABILITY 4 (Measures of location and dispersion) by A.J.Hobson 19.4.1 Common types of measure 19.4.2 Exercises 19.4.3 Answers to exercises UNIT 19.4 - PROBABILITY 4 MEASURES OF LOCATION AND DISPERSION 19.4.1 COMMON TYPES OF MEASURE We include, here three common measures of location (or central tendency), and one common measure of dispersion (or scatter), used in the discussion of probability distributions. (a) The Mean (i) For Discrete Random Variables If the values x1, x2, x3, . , xn of a discrete random variable, x, have probabilities P1,P2,P3,....,Pn, respectively, then Pi represents the expected frequency of xi divided by the total number of possible outcomes. For example, if the probability of a certain value of x is 0.25, then there is a one in four chance of its occurring. The arithmetic mean, µ, of the distribution may therefore be given by the formula n X µ = xiPi. i=1 (ii) For Continuous Random Variables In this case, it is necessary to use the probability density function, f(x), for the distribution which is the rate of increase of the probability distribution function, F (x). For a small interval, δx of x-values, the probability that any of these values occurs is ap- proximately f(x)δx, which leads to the formula Z ∞ µ = xf(x) dx. −∞ (b) The Median (i) For Discrete Random Variables The median provides an estimate of the middle value of x, taking into account the frequency at which each value occurs.
    [Show full text]
  • 3.4 Exponential Families
    3.4 Exponential Families A family of pdfs or pmfs is called an exponential family if it can be expressed as ¡ Xk ¢ f(x|θ) = h(x)c(θ) exp wi(θ)ti(x) . (1) i=1 Here h(x) ≥ 0 and t1(x), . , tk(x) are real-valued functions of the observation x (they cannot depend on θ), and c(θ) ≥ 0 and w1(θ), . , wk(θ) are real-valued functions of the possibly vector-valued parameter θ (they cannot depend on x). Many common families introduced in the previous section are exponential families. They include the continuous families—normal, gamma, and beta, and the discrete families—binomial, Poisson, and negative binomial. Example 3.4.1 (Binomial exponential family) Let n be a positive integer and consider the binomial(n, p) family with 0 < p < 1. Then the pmf for this family, for x = 0, 1, . , n and 0 < p < 1, is µ ¶ n f(x|p) = px(1 − p)n−x x µ ¶ n ¡ p ¢ = (1 − p)n x x 1 − p µ ¶ n ¡ ¡ p ¢ ¢ = (1 − p)n exp log x . x 1 − p Define 8 >¡ ¢ < n x = 0, 1, . , n h(x) = x > :0 otherwise, p c(p) = (1 − p)n, 0 < p < 1, w (p) = log( ), 0 < p < 1, 1 1 − p and t1(x) = x. Then we have f(x|p) = h(x)c(p) exp{w1(p)t1(x)}. 1 Example 3.4.4 (Normal exponential family) Let f(x|µ, σ2) be the N(µ, σ2) family of pdfs, where θ = (µ, σ2), −∞ < µ < ∞, σ > 0.
    [Show full text]