Exponential Families

Total Page:16

File Type:pdf, Size:1020Kb

Exponential Families Exponential Families Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA Surprisingly many of the distributions we use in statistics for random vari- n ables X taking value in some space X (often R or N0 but sometimes R , Z, or some other space), indexed by a parameter θ from some parameter set Θ, can be written in exponential family form, with pdf or pmf f(x θ) = exp [η(θ)t(x) B(θ)] h(x) | − for some statistic t : X R, natural parameter η :Θ R, and functions → → B : Θ R and h : X R . The likelihood function for a random sample → → + of size n from the exponential family is n fn(x θ) = exp η(θ) t(xj) nB(θ) h(xi), | − Xj=1 Y which is actually of the same form with the same natural parameter η( ), · but now with statistic Tn(x) = t(xj) and functions Bn(θ) = nB(θ) and hn(x) = Πh(xj). P Examples For example, the pmf for the binomial distribution Bi(m,p) can be written as m x m x p m p (1 p) − = exp log x m log(1 p) x − 1 p − − x − p of Exponential Family form with η(p) = log 1 p and natural sufficient statis- tic t(x)= x, and the Poisson − x θ θ 1 e− = exp [(log θ)x θ] x! − x! 1 with η = log θ and again t(x) = x. The Beta distribution Be(α, β) with either one of its two parmeters unknown can be written in EF form too: β Γ(α + β) α 1 β 1 Γ(α) (1 x) x − (1 x) − = exp α log x log − Γ(α)Γ(β) − − Γ(α + β) x(1 x)Γ(β) − Γ(β) xα = exp β log(1 x) log − − Γ(α + β) x(1 x)Γ(α) − with t(x) = log x or log(1 x) when η = α or η = β is unknown, respectively. − With both parameters unknown the beta distribution can be written as a bivariate Exponential Family with parameter θ = (α, β) R2 : ∈ + f(x θ) = exp [η(θ) t(x) B(θ)] h(x) (1) | · − with vector parameter η = (α, β) and statistic t(x) = (log x, log 1 x) and − scalar (one-dimensional) functions B(θ) = log Γ(α) + log Γ(β) log(α + β) − and h(x) = 1/x(1 x). Since this comes up often, we’ll let η and T be − q-dimensional below; usually in this course q = 1 or 2. Natural Exponential Families It is often convenient to reparametrize exponential families to the natural parameter η = η(θ) Rq, leading (with A(η(θ)) B(θ)) to ∈ ≡ η t(x) A(η) f(x η)= e · − h(x) (2) | Since any pdf integrates to unity we have A(η) η t(x) e = e · h(x) dx ZX and hence can calculate the moment generating function (MGF) for the natural sufficient statistic t(x)= t (x), ,t (x) as { 1 · · · q } s t(X) Mt(s)= E e · h i s t(x) η t(x) A(η) = e · e · − h(x) dx ZX A(η) (η+s) t(x) = e− e · h(x) dx ZX A(η+s) A(η) = e − , 2 so log M (s) = A(η + s) A(η) and we can find moments for the natural t − sufficient statistic by E[t] = log M (0) = A(η) ∇ t ∇ V[t] = 2 log M (0) = 2A(η) ∇ t ∇ provided that η is an interior point of the natural parameter space q η t(x) E η R : 0 < e · h(x) dx < ≡ { ∈ ZX ∞} and that A( ) is twice-differentiable near η. For samples of size n N the · ∈ sufficient statistic Tn(x)= t(xj) X is a sum of independent random variables, so by the Central Limit Theorem we have approximately No n A(η), n 2A(η) . ∼ ∇ ∇ Note that 2A(η) = 2 log f(x θ) is both the observed and Fisher ∇ −∇ | (expected) information (matrix) In(θ) for natural exponential families, and that the score statistic is Z := log f(x θ)= T (x) n A(η) . ∇ | n − ∇ Conjugate Priors For hyper-parameters α Rq and β R such that ∈ ∈ η(θ) α βB(θ) cα,β := e · − dθ < , ZΘ ∞ we can define a prior density for θ by 1 η(θ) α βB(θ) π(θ α, β)= cα,β− e · − dθ. | ZΘ iid With this prior and with data X f(x θ) from the exponential family, { i} ∼ | the posterior is η(θ) α βB(θ) η(θ) Tn(x) nB(θ) π(θ x)) e · − e · − | ∝ π(θ α∗ = α + T (x), β∗ = β + n), ∝ | n 3 again within the same family but now with parameters α∗ = α + Tn and β∗ = β + n. For example, in the binomial example above this conjugate prior family is p α β α π(θ α, β) exp α log β log(1 p) = p (1 p) − , | ∝ 1 p − − − − the Beta family, while for the Poisson example it is α βθ π(θ α, β) exp α log θ βθ = θ e− , | ∝ { − } the Gamma family. Conjugate families for every exponential family are available in the same way. Note not every distribution we consider is from an exponential family. From (2), for exmple, it is clear set of points where the pdf or pmf is nonzero, the possible values a random variable X can take, is just x X : f(x θ) > 0 = x X : h(x) > 0 , { ∈ | } { ∈ } which does not depend on the parameter θ; thus any family of distributions where the “support” depends on the parameter (uniform distributions are important examples) can’t be from an exponential family. The next pages show several familiar (and some less familiar ones, like the Inverse Gaussian IG(µ, λ) and Pareto Pa(α, β)) distributions in expo- nential family form. Some of the formulas involve the log gamma func- tion γ(z) = log Γ(z) and its first and second derivatives, the “digamma” 2 2 ψ(z) = (d/dz)γ(z) and “trigamma” ψ′(z) = (d /dz )γ(z), which are built into R, Mathematica, Maple, the gsl library in C, and such, but aren’t on pocket calculators or most spreadsheets. In each case 2A(η) is the ∇ Information matrix in the natural parametrization, I(θ) in the usual pa- rameterization. 4 1 Exponential Family Examples Be(α, β) f(x)= Γ(α+β) xα 1(1 x)β 1, x (0, 1) T = (log x, log 1 x) Γ(α)Γ(β) − − − ∈ − B(α, β)= γ(α)+ γ(β) γ(α + β) η = (α, β) − A(η)= γ(η )+ γ(η ) γ(η + η ) 1 2 − 1 2 ψ(η ) ψ(η + η ) ψ(α) ψ(α + β) A(η)= 1 − 1 2 ET = − ∇ ψ(η2) ψ(η1 + η2) ψ(β) ψ(α + β) − − 2 ψ′(η1) c c A(η)= − − c = ψ′(η1 + η2) ∇ c ψ (η ) c − ′ 2 − Bi m x (m x) (m,p) f(x)= x p q − , x = 0...m T = x B(p)= m log q η = log(p/q) − A(η)= m log(1 + eη) p = eη/(1 + eη) meη E A(η)= 1+eη T = mp ∇2 meη A(η)= 2 I(p)= m/pq ∇ (1+eη ) λx Ex(λ) f(x)= λe− , x> 0 T = x B(λ)= log λ η = λ − − A(η)= log( η) − − A(η)= 1/η ET = 1/λ ∇ − 2A(η)= η 2 I(λ) = 1/λ2 ∇ − Ga λα α 1 λx (α, λ) f(x)= Γ(α) x − e− , x> 0 T = (log x,x) B(α, β)= γ(α) α log λ η = (α, λ) − − A(η)= γ(η ) η log( η ) 1 − 1 − 2 ψ(η ) log( η ) ψ(α) log λ A(η)= 1 − − 2 ET = − ∇ η1/η2 α/λ − 2 ψ′(η1) 1/η2 ψ′(α) 1/λ A(η)= − 2 I(α, λ)= − 2 ∇ 1/η2 η1/η2 1/λ α/λ − − Ge(p) f(x)= pqx, x = 0, 1, 2, ... T = x B(p)= log p η = log q − A(η)= log(1 eη) p = 1 eη −eη − E − A(η)= 1 eη T = q/p ∇2 −eη 2 A(η)= (1 eη )2 I(p) = 1/p q ∇ − 5 Exponential Family Examples (cont’d) 2 (a bx) /2x 3 IG(a, b) f(x)= ae− − /√2πx , x> 0 T = (1/x,x) B(a, b)= ab log a η = ( a2/2, b2/2) − − 1 − − A(η)= 2√η1 η2 log( 2η1) a = √ 2η1, b = √ 2η2 − − 2 − − − η /η 1/2η b/a + 1/a2 A(η)= 2 1 − 1 ET = ∇ p η1/η2 a/b η2 1 1 p 3 + 2 2 η1 η1 √−η1η2 b/a + 2/a 1 2A(η)= 1 I(a, b)= 2 q 1 η1 − ∇ − 3 1 a/b √η1η2 η2 − q α α x NB(α, p) f(x)= − p ( q) , x = 0, 1, 2, ... T = x x − B(p)= α log p η = log q − A(η)= α log(1 eη) p = 1 eη −αeη − E − A(η)= 1 eη T = αq/p ∇2 −αeη 2 A(η)= (1 eη )2 I(p)= α/p q ∇ − 2 2 2 (x µ) /2σ 2 2 No(µ,σ ) f(x)= e− − /√2πσ T = (x,x ) B(µ,σ2)= µ2/2σ2 + 1 log σ2 η = (µσ 2, σ 2/2) 2 − − − A(η)= η 2/4η 1 log( 2η ) − 1 2 − 2 − 2 η1/2η2 E µ A(η)= 2 − 2 T = 2 2 ∇ η1 /4η2 1/2η2 µ + σ − 2 2 2 1/2η2 η1/2η2 σ− 0 A(η)= − 2 2 3 2 I(a, b)= 4 ∇ η1/2η2 η1 /2η2 + 1/2η2 0 σ− /2 − x λ Po(λ) f(x)= λ e− /x!, x = 0, 1, 2, ..
Recommended publications
  • The Exponential Family 1 Definition
    The Exponential Family David M. Blei Columbia University November 9, 2016 The exponential family is a class of densities (Brown, 1986). It encompasses many familiar forms of likelihoods, such as the Gaussian, Poisson, multinomial, and Bernoulli. It also encompasses their conjugate priors, such as the Gamma, Dirichlet, and beta. 1 Definition A probability density in the exponential family has this form p.x / h.x/ exp >t.x/ a./ ; (1) j D f g where is the natural parameter; t.x/ are sufficient statistics; h.x/ is the “base measure;” a./ is the log normalizer. Examples of exponential family distributions include Gaussian, gamma, Poisson, Bernoulli, multinomial, Markov models. Examples of distributions that are not in this family include student-t, mixtures, and hidden Markov models. (We are considering these families as distributions of data. The latent variables are implicitly marginalized out.) The statistic t.x/ is called sufficient because the probability as a function of only depends on x through t.x/. The exponential family has fundamental connections to the world of graphical models (Wainwright and Jordan, 2008). For our purposes, we’ll use exponential 1 families as components in directed graphical models, e.g., in the mixtures of Gaussians. The log normalizer ensures that the density integrates to 1, Z a./ log h.x/ exp >t.x/ d.x/ (2) D f g This is the negative logarithm of the normalizing constant. The function h.x/ can be a source of confusion. One way to interpret h.x/ is the (unnormalized) distribution of x when 0. It might involve statistics of x that D are not in t.x/, i.e., that do not vary with the natural parameter.
    [Show full text]
  • A Skew Extension of the T-Distribution, with Applications
    J. R. Statist. Soc. B (2003) 65, Part 1, pp. 159–174 A skew extension of the t-distribution, with applications M. C. Jones The Open University, Milton Keynes, UK and M. J. Faddy University of Birmingham, UK [Received March 2000. Final revision July 2002] Summary. A tractable skew t-distribution on the real line is proposed.This includes as a special case the symmetric t-distribution, and otherwise provides skew extensions thereof.The distribu- tion is potentially useful both for modelling data and in robustness studies. Properties of the new distribution are presented. Likelihood inference for the parameters of this skew t-distribution is developed. Application is made to two data modelling examples. Keywords: Beta distribution; Likelihood inference; Robustness; Skewness; Student’s t-distribution 1. Introduction Student’s t-distribution occurs frequently in statistics. Its usual derivation and use is as the sam- pling distribution of certain test statistics under normality, but increasingly the t-distribution is being used in both frequentist and Bayesian statistics as a heavy-tailed alternative to the nor- mal distribution when robustness to possible outliers is a concern. See Lange et al. (1989) and Gelman et al. (1995) and references therein. It will often be useful to consider a further alternative to the normal or t-distribution which is both heavy tailed and skew. To this end, we propose a family of distributions which includes the symmetric t-distributions as special cases, and also includes extensions of the t-distribution, still taking values on the whole real line, with non-zero skewness. Let a>0 and b>0be parameters.
    [Show full text]
  • 5. the Student T Distribution
    Virtual Laboratories > 4. Special Distributions > 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 5. The Student t Distribution In this section we will study a distribution that has special importance in statistics. In particular, this distribution will arise in the study of a standardized version of the sample mean when the underlying distribution is normal. The Probability Density Function Suppose that Z has the standard normal distribution, V has the chi-squared distribution with n degrees of freedom, and that Z and V are independent. Let Z T= √V/n In the following exercise, you will show that T has probability density function given by −(n +1) /2 Γ((n + 1) / 2) t2 f(t)= 1 + , t∈ℝ ( n ) √n π Γ(n / 2) 1. Show that T has the given probability density function by using the following steps. n a. Show first that the conditional distribution of T given V=v is normal with mean 0 a nd variance v . b. Use (a) to find the joint probability density function of (T,V). c. Integrate the joint probability density function in (b) with respect to v to find the probability density function of T. The distribution of T is known as the Student t distribution with n degree of freedom. The distribution is well defined for any n > 0, but in practice, only positive integer values of n are of interest. This distribution was first studied by William Gosset, who published under the pseudonym Student. In addition to supplying the proof, Exercise 1 provides a good way of thinking of the t distribution: the t distribution arises when the variance of a mean 0 normal distribution is randomized in a certain way.
    [Show full text]
  • On a Problem Connected with Beta and Gamma Distributions by R
    ON A PROBLEM CONNECTED WITH BETA AND GAMMA DISTRIBUTIONS BY R. G. LAHA(i) 1. Introduction. The random variable X is said to have a Gamma distribution G(x;0,a)if du for x > 0, (1.1) P(X = x) = G(x;0,a) = JoT(a)" 0 for x ^ 0, where 0 > 0, a > 0. Let X and Y be two independently and identically distributed random variables each having a Gamma distribution of the form (1.1). Then it is well known [1, pp. 243-244], that the random variable W = X¡iX + Y) has a Beta distribution Biw ; a, a) given by 0 for w = 0, (1.2) PiW^w) = Biw;x,x)=\ ) u"-1il-u)'-1du for0<w<l, Ío T(a)r(a) 1 for w > 1. Now we can state the converse problem as follows : Let X and Y be two independently and identically distributed random variables having a common distribution function Fix). Suppose that W = Xj{X + Y) has a Beta distribution of the form (1.2). Then the question is whether £(x) is necessarily a Gamma distribution of the form (1.1). This problem was posed by Mauldon in [9]. He also showed that the converse problem is not true in general and constructed an example of a non-Gamma distribution with this property using the solution of an integral equation which was studied by Goodspeed in [2]. In the present paper we carry out a systematic investigation of this problem. In §2, we derive some general properties possessed by this class of distribution laws Fix).
    [Show full text]
  • 1 One Parameter Exponential Families
    1 One parameter exponential families The world of exponential families bridges the gap between the Gaussian family and general dis- tributions. Many properties of Gaussians carry through to exponential families in a fairly precise sense. • In the Gaussian world, there exact small sample distributional results (i.e. t, F , χ2). • In the exponential family world, there are approximate distributional results (i.e. deviance tests). • In the general setting, we can only appeal to asymptotics. A one-parameter exponential family, F is a one-parameter family of distributions of the form Pη(dx) = exp (η · t(x) − Λ(η)) P0(dx) for some probability measure P0. The parameter η is called the natural or canonical parameter and the function Λ is called the cumulant generating function, and is simply the normalization needed to make dPη fη(x) = (x) = exp (η · t(x) − Λ(η)) dP0 a proper probability density. The random variable t(X) is the sufficient statistic of the exponential family. Note that P0 does not have to be a distribution on R, but these are of course the simplest examples. 1.0.1 A first example: Gaussian with linear sufficient statistic Consider the standard normal distribution Z e−z2=2 P0(A) = p dz A 2π and let t(x) = x. Then, the exponential family is eη·x−x2=2 Pη(dx) / p 2π and we see that Λ(η) = η2=2: eta= np.linspace(-2,2,101) CGF= eta**2/2. plt.plot(eta, CGF) A= plt.gca() A.set_xlabel(r'$\eta$', size=20) A.set_ylabel(r'$\Lambda(\eta)$', size=20) f= plt.gcf() 1 Thus, the exponential family in this setting is the collection F = fN(η; 1) : η 2 Rg : d 1.0.2 Normal with quadratic sufficient statistic on R d As a second example, take P0 = N(0;Id×d), i.e.
    [Show full text]
  • Random Variables and Probability Distributions 1.1
    RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS 1. DISCRETE RANDOM VARIABLES 1.1. Definition of a Discrete Random Variable. A random variable X is said to be discrete if it can assume only a finite or countable infinite number of distinct values. A discrete random variable can be defined on both a countable or uncountable sample space. 1.2. Probability for a discrete random variable. The probability that X takes on the value x, P(X=x), is defined as the sum of the probabilities of all sample points in Ω that are assigned the value x. We may denote P(X=x) by p(x) or pX (x). The expression pX (x) is a function that assigns probabilities to each possible value x; thus it is often called the probability function for the random variable X. 1.3. Probability distribution for a discrete random variable. The probability distribution for a discrete random variable X can be represented by a formula, a table, or a graph, which provides pX (x) = P(X=x) for all x. The probability distribution for a discrete random variable assigns nonzero probabilities to only a countable number of distinct x values. Any value x not explicitly assigned a positive probability is understood to be such that P(X=x) = 0. The function pX (x)= P(X=x) for each x within the range of X is called the probability distribution of X. It is often called the probability mass function for the discrete random variable X. 1.4. Properties of the probability distribution for a discrete random variable.
    [Show full text]
  • 6: the Exponential Family and Generalized Linear Models
    10-708: Probabilistic Graphical Models 10-708, Spring 2014 6: The Exponential Family and Generalized Linear Models Lecturer: Eric P. Xing Scribes: Alnur Ali (lecture slides 1-23), Yipei Wang (slides 24-37) 1 The exponential family A distribution over a random variable X is in the exponential family if you can write it as P (X = x; η) = h(x) exp ηT T(x) − A(η) : Here, η is the vector of natural parameters, T is the vector of sufficient statistics, and A is the log partition function1 1.1 Examples Here are some examples of distributions that are in the exponential family. 1.1.1 Multivariate Gaussian Let X be 2 Rp. Then we have: 1 1 P (x; µ; Σ) = exp − (x − µ)T Σ−1(x − µ) (2π)p=2jΣj1=2 2 1 1 = exp − (tr xT Σ−1x + µT Σ−1µ − 2µT Σ−1x + ln jΣj) (2π)p=2 2 0 1 1 B 1 −1 T T −1 1 T −1 1 C = exp B− tr Σ xx +µ Σ x − µ Σ µ − ln jΣj)C ; (2π)p=2 @ 2 | {z } 2 2 A | {z } vec(Σ−1)T vec(xxT ) | {z } h(x) A(η) where vec(·) is the vectorization operator. 1 R T It's called this, since in order for P to normalize, we need exp(A(η)) to equal x h(x) exp(η T(x)) ) A(η) = R T ln x h(x) exp(η T(x)) , which is the log of the usual normalizer, which is the partition function.
    [Show full text]
  • A Family of Skew-Normal Distributions for Modeling Proportions and Rates with Zeros/Ones Excess
    S S symmetry Article A Family of Skew-Normal Distributions for Modeling Proportions and Rates with Zeros/Ones Excess Guillermo Martínez-Flórez 1, Víctor Leiva 2,* , Emilio Gómez-Déniz 3 and Carolina Marchant 4 1 Departamento de Matemáticas y Estadística, Facultad de Ciencias Básicas, Universidad de Córdoba, Montería 14014, Colombia; [email protected] 2 Escuela de Ingeniería Industrial, Pontificia Universidad Católica de Valparaíso, 2362807 Valparaíso, Chile 3 Facultad de Economía, Empresa y Turismo, Universidad de Las Palmas de Gran Canaria and TIDES Institute, 35001 Canarias, Spain; [email protected] 4 Facultad de Ciencias Básicas, Universidad Católica del Maule, 3466706 Talca, Chile; [email protected] * Correspondence: [email protected] or [email protected] Received: 30 June 2020; Accepted: 19 August 2020; Published: 1 September 2020 Abstract: In this paper, we consider skew-normal distributions for constructing new a distribution which allows us to model proportions and rates with zero/one inflation as an alternative to the inflated beta distributions. The new distribution is a mixture between a Bernoulli distribution for explaining the zero/one excess and a censored skew-normal distribution for the continuous variable. The maximum likelihood method is used for parameter estimation. Observed and expected Fisher information matrices are derived to conduct likelihood-based inference in this new type skew-normal distribution. Given the flexibility of the new distributions, we are able to show, in real data scenarios, the good performance of our proposal. Keywords: beta distribution; centered skew-normal distribution; maximum-likelihood methods; Monte Carlo simulations; proportions; R software; rates; zero/one inflated data 1.
    [Show full text]
  • Random Processes
    Chapter 6 Random Processes Random Process • A random process is a time-varying function that assigns the outcome of a random experiment to each time instant: X(t). • For a fixed (sample path): a random process is a time varying function, e.g., a signal. – For fixed t: a random process is a random variable. • If one scans all possible outcomes of the underlying random experiment, we shall get an ensemble of signals. • Random Process can be continuous or discrete • Real random process also called stochastic process – Example: Noise source (Noise can often be modeled as a Gaussian random process. An Ensemble of Signals Remember: RV maps Events à Constants RP maps Events à f(t) RP: Discrete and Continuous The set of all possible sample functions {v(t, E i)} is called the ensemble and defines the random process v(t) that describes the noise source. Sample functions of a binary random process. RP Characterization • Random variables x 1 , x 2 , . , x n represent amplitudes of sample functions at t 5 t 1 , t 2 , . , t n . – A random process can, therefore, be viewed as a collection of an infinite number of random variables: RP Characterization – First Order • CDF • PDF • Mean • Mean-Square Statistics of a Random Process RP Characterization – Second Order • The first order does not provide sufficient information as to how rapidly the RP is changing as a function of timeà We use second order estimation RP Characterization – Second Order • The first order does not provide sufficient information as to how rapidly the RP is changing as a function
    [Show full text]
  • Lecture 2 — September 24 2.1 Recap 2.2 Exponential Families
    STATS 300A: Theory of Statistics Fall 2015 Lecture 2 | September 24 Lecturer: Lester Mackey Scribe: Stephen Bates and Andy Tsao 2.1 Recap Last time, we set out on a quest to develop optimal inference procedures and, along the way, encountered an important pair of assertions: not all data is relevant, and irrelevant data can only increase risk and hence impair performance. This led us to introduce a notion of lossless data compression (sufficiency): T is sufficient for P with X ∼ Pθ 2 P if X j T (X) is independent of θ. How far can we take this idea? At what point does compression impair performance? These are questions of optimal data reduction. While we will develop general answers to these questions in this lecture and the next, we can often say much more in the context of specific modeling choices. With this in mind, let's consider an especially important class of models known as the exponential family models. 2.2 Exponential Families Definition 1. The model fPθ : θ 2 Ωg forms an s-dimensional exponential family if each Pθ has density of the form: s ! X p(x; θ) = exp ηi(θ)Ti(x) − B(θ) h(x) i=1 • ηi(θ) 2 R are called the natural parameters. • Ti(x) 2 R are its sufficient statistics, which follows from NFFC. • B(θ) is the log-partition function because it is the logarithm of a normalization factor: s ! ! Z X B(θ) = log exp ηi(θ)Ti(x) h(x)dµ(x) 2 R i=1 • h(x) 2 R: base measure.
    [Show full text]
  • 11. Parameter Estimation
    11. Parameter Estimation Chris Piech and Mehran Sahami May 2017 We have learned many different distributions for random variables and all of those distributions had parame- ters: the numbers that you provide as input when you define a random variable. So far when we were working with random variables, we either were explicitly told the values of the parameters, or, we could divine the values by understanding the process that was generating the random variables. What if we don’t know the values of the parameters and we can’t estimate them from our own expert knowl- edge? What if instead of knowing the random variables, we have a lot of examples of data generated with the same underlying distribution? In this chapter we are going to learn formal ways of estimating parameters from data. These ideas are critical for artificial intelligence. Almost all modern machine learning algorithms work like this: (1) specify a probabilistic model that has parameters. (2) Learn the value of those parameters from data. Parameters Before we dive into parameter estimation, first let’s revisit the concept of parameters. Given a model, the parameters are the numbers that yield the actual distribution. In the case of a Bernoulli random variable, the single parameter was the value p. In the case of a Uniform random variable, the parameters are the a and b values that define the min and max value. Here is a list of random variables and the corresponding parameters. From now on, we are going to use the notation q to be a vector of all the parameters: Distribution Parameters Bernoulli(p) q = p Poisson(l) q = l Uniform(a,b) q = (a;b) Normal(m;s 2) q = (m;s 2) Y = mX + b q = (m;b) In the real world often you don’t know the “true” parameters, but you get to observe data.
    [Show full text]
  • Random Variables and Applications
    Random Variables and Applications OPRE 6301 Random Variables. As noted earlier, variability is omnipresent in the busi- ness world. To model variability probabilistically, we need the concept of a random variable. A random variable is a numerically valued variable which takes on different values with given probabilities. Examples: The return on an investment in a one-year period The price of an equity The number of customers entering a store The sales volume of a store on a particular day The turnover rate at your organization next year 1 Types of Random Variables. Discrete Random Variable: — one that takes on a countable number of possible values, e.g., total of roll of two dice: 2, 3, ..., 12 • number of desktops sold: 0, 1, ... • customer count: 0, 1, ... • Continuous Random Variable: — one that takes on an uncountable number of possible values, e.g., interest rate: 3.25%, 6.125%, ... • task completion time: a nonnegative value • price of a stock: a nonnegative value • Basic Concept: Integer or rational numbers are discrete, while real numbers are continuous. 2 Probability Distributions. “Randomness” of a random variable is described by a probability distribution. Informally, the probability distribution specifies the probability or likelihood for a random variable to assume a particular value. Formally, let X be a random variable and let x be a possible value of X. Then, we have two cases. Discrete: the probability mass function of X specifies P (x) P (X = x) for all possible values of x. ≡ Continuous: the probability density function of X is a function f(x) that is such that f(x) h P (x < · ≈ X x + h) for small positive h.
    [Show full text]