6. Density Estimation 1. Cross Validation 2. Histogram 3. Kernel

Total Page:16

File Type:pdf, Size:1020Kb

6. Density Estimation 1. Cross Validation 2. Histogram 3. Kernel 6. Density Estimation 6.1 Cross Validation 1. Cross Validation Suppose we observe X1; : : : ; Xn from an unknown density f. Our goal is to estimate f nonparametrically. Finding the best estimator 2. Histogram fn in some sense is equivalent to finding the optimal smoothing 3. Kernel Density Estimation parameter h. b 4. Local Polynomials How to measure the performance of fn? • Risk/Integrated Mean Square Error(IMSE) 5. Higher Dimensions b R(f ; f) = E (L(f ; f)) 6. Mixture Models n n where L(f ; f) = (f (x) − f(x))2dx. 7. Converting Density Estimation Into Regression n n b b • One can find an optimalR estimator that minimizes the risk b b function: fn∗ = argmin R(fn; f); b fn but the risk function isbunknown! b 1 2 Estimation of the Risk function Finding an optimal estimator fn • Use leave-one-out cross validation to estimate the risk function. • E(J(h)) = E(J(h)) = R(h) + cb • One can express the loss function as a function of the smoothing • Theboptimal smoothing parameter parameter h h∗ = argmin J(h) 2 h L(h) = (fn(x) − f(x)) dx Z • Nonparametric estimator f can be expressed as a function of h 2 2 = (fbn(x)) dx − 2 fn(x)f(x)dx + f (x)dx and the best estimator fn∗ can be obtained by Plug-in the Z Z Z b J(h) optimal smoothing parameter b b b • Cross-validation| estimator of the{z risk function }(up to constant) f ∗ = fh∗ 2 2 n b b J(h) = fn(x)dx − f( i)(Xi) n − i=1 Z X b b b t where f( i) is the density estimator obtained after removing i h − observation b 3 4 Theorem 1. Suppose that f 0 is absolutely continuous and 6.2 Histogram 2 (f 0) du < 1. Then 2 R h 2 1 2 1 Histogram Estimator R(f ; f) = (f 0) du + + o(h ) + o : n 12 nh n Z • Without loss of generality, we assume that the support of f is The optimal bbandwidth is [0,1]. Divide the support into m equally sized bins 1 6 1=3 1 1 2 m − 1 h∗ = : B1 = 0; ; B2 = ; ; : : : ; Bm = ; 1 n1=3 (f (u))2du m m m m 0 With the optimal binwidth, R 1 n • Let h = m , pj = B f(x)dx and Yj = i=1 I(Xi 2 Bj) j C R(f ; f) ≈ • The histogram estimatorR is defined by P n n2=3 n 2=3 1=3 pj where 3 2 b . f (x) = I(x 2 B ) C = 4 (f 0(u)) du n h j j=1 X R b b Yj where pj = n . b 5 6 Proof. For any x; u 2 B , j Therefore, the bias of fn is 2 (u − x) E 1 E f(u) = f(x) + (u − x)f 0(x) + f 00(x): b(x) ≡ (fn(x)) −b f(x) = (pj) − f(x) 2 h p = j − f(x) for some x between x and u. Hence, e h b b 1 1 3 p = f(u)du = f(x)h + hf 0(x) h j − − x + O(h ) − f(x) j e h 2 ZBj (u − x)2 1 2 = f(x) + (u − x)f 0(x) + f 00(x) du = f 0(x) h j − − x + O(h ): 2 2 ZBj By the mean value theorem, for some x 2 B , 1 3 j j = f(x)h + hf 0(x) h j − − x + O(he): 2 2 2 2 1 4 b (x)dx = (f 0(x)) h ej − − x dx + O(h ) 2 ZBj ZBj 2 2 1 4 = (f 0(x )) h j − − x dx + O(h ) j 2 ZBj 3 2 h 4 = (f 0(xe )) + O(h ): j 12 7 8 e Hence Therefore, 1 m 1 m m b2(x)dx = b2(x)dx + O(h3) pj(1 − pj) v(x)dx = v(x)dx = 2 dx 0 Bj nh Z j=1 Z 0 j=1 Bj j=1 Bj X Z X Z X Z m h3 m m 2 3 1 1 2 = (f 00(xj )) x + O(h ) = p − p 12 nh2 j nh2 j j=1 j=1 Bj j=1 Bj X X Z X Z 2 1 m h e 2 2 1 1 1 1 = (f 00(x)) dx + o(h ): 2 2 2 = − pj = − h f (xj ) 12 0 nh nh2 nh nh Z j=1 j=1 X X For the variance of fn: 1 1 1 1 1 = − f 2(x)dx + o(1) = + o 1 p (1 − p ) nh nh 0 nh n v(x) ≡ Varb(f (x)) = Var (p ) = j j : Z n h2 j nh2 By the mean value theorem,b for some xbj 2 Bj, pj = f(x)dx − hf(xj ): ZBj 9 10 Theorem 2. The cross-validation estimator of risk for the histogram is 6.3 Kernel Density Estimation 2 (n + 1) m J(h) = − p2: h(n − 1) h(n − 1) j j=1 • Given a kernel K and a positive number h, called the X bandwidth, the kernel density estimator is: Theorem 3. Letbm = m(n) be the number of binsb in the histogram n fn. Assume that m(n) ! 1 and m(n) = log n=n ! 0 as n ! 1. 1 1 x − X f (x) = K i : Define n n h h i=1 b X 2 2 b • The choice of kernel K is not crucial but the choice of bandwidth ln(x) = max fn(x) − c; 0 ; un(x) = fn(x) − c; 0 q q h is important. where zα/(2m) mb. Then b c = 2 n • We assume that K satisfies p Pr l (x) ≤ E (f ) ≤ u (x)8x ≥ 1 − α: 2 2 n n n K(x)dx = 1; xK(x)dx = 0 and σK ≡ x K(x) > 0: Z Z Z b 11 12 E 2 1 x X 1 n Theorem 4. Let Rx = (f(x) − f(x)) be the risk at point x and let Proof. Write Kh(x; X) = h K −h and fn(x) = n i=1 Kh(x; Xi). R = Rxdx denote the integrated risk. Assume that f 00 is absolutely E (fn(x)) = E (Kh(x; X)) P continuous and that int(f (x))2dxb< 1. Also, assume that K b R 000 1 x − t satisfies = K f(t)dt b h h Z 2 2 K(x)dx = 1; xK(x)dx = 0 and σK ≡ x K(x) > 0: = K(u)f(x − hu)du Z Z Z Z Then, h2u2 = K(u) f(x) − huf 0(x) + f 00(x) + · · · du 2 2 1 4 4 2 f(x) K (x)dx 1 6 Z Rx = σK hn(f 00(x)) + + O(n− ) + O(hn); 1 2 2 4 nh = f(x) + h f 00(x) u K(u)du: R n 2 2 Z 1 4 4 2 K (x)dx 1 6 R = σK hn (f 00(x)) dx + + O(n− ) + O(hn): Hence 4 nhn 1 2 2 4 Z R bias(f (x)) = σ h f 00(x) + O(h ): n 2 K n Similarly, b 2 f(x) K (x)dx 1 Var (f (x)) = + O(n− ): n nh R n b 13 14 Bandwidth selection The Normal reference rule • The optimal smoothing bandwidth is • Assume that f is Normal, one can compute the h∗ with the 1=5 Gaussian kernel. c2 1=5 h∗ = h∗ = 1:06σn− c2A(f)n 1 2 2 • σ is estimated by σ = minfs; IQR=1:34g where s is the sample where c1 = x K(x)dx; c2 = (K(x)) dx and 2 standard deviation and IQR is the interquartile range. A(f) = (f 00(x)) dx. R R b 2 • The selected bandwidth is: • The onlyR unknown quantity in h∗ is A(f) = (f 00(x)) dx 1:06σ 2 (4) ER (4) (r) hn = : • A(f) = (f 00(x)) dx = (f (x))f(x)dx = (f ) where f n1=5 th denote theR r derivativRe of f b 15 16 Plug-in method Cross-validation (r) Let r = E (f ). To estimate 4 by using the Kernel method, one • Cross-validation score function: need to choose the optimal bandwidth which is a functional of 6. n 2 2 J(h) = f (x)dx − f i(Xi) 1. Estimate with the bandwidth chosen the normal reference n − 8 i=1 Z X rule. b b b • The selected bandwidth is 2. Estimate 6 with the bandwidth depending on 8 h = argmin J(h) ∗ 3. Estimate 4 with the bandwidth depending on b6 h Theorem 5 (Stone's Theorem). Supposeb f is bounded. Let f 4. The selected bandwidth is b h 1=5 denote the kernel estimator with bandwidth h and let h denote the ∗ c2 b h∗ = bandwidth chosen by cross-validation. Then c2 n 1 4 ! 2 f(x) − f ∗ dx h a:s: b ! 1: R 2 infh f(x) −b fh dx R b 17 18 Computation Optimal Convergence rate 4=5 • The cross-validation score function J(h) can be approximated by From Theorem 4, the optimal convergence rate is O(n− ) if the n n optimal bandwidth is used. 1 Xi − Xbi 2 1 (m) th J(h) = K∗ + K(0) + O Theorem 6. Let F be the set of all pdfs and let f denote the m nh2 h nh n2 i=1 j=1 X X derivative of f.
Recommended publications
  • Nonparametric Density Estimation Using Kernels with Variable Size Windows Anil Ramchandra Londhe Iowa State University
    Iowa State University Capstones, Theses and Retrospective Theses and Dissertations Dissertations 1980 Nonparametric density estimation using kernels with variable size windows Anil Ramchandra Londhe Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/rtd Part of the Statistics and Probability Commons Recommended Citation Londhe, Anil Ramchandra, "Nonparametric density estimation using kernels with variable size windows " (1980). Retrospective Theses and Dissertations. 6737. https://lib.dr.iastate.edu/rtd/6737 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. INFORMATION TO USERS This was produced from a copy of a document sent to us for microfilming. WhDe the most advanced technological means to photograph and reproduce this document have been used, the quality is heavily dependent upon the quality of the material submitted. The following explanation of techniques is provided to help you understand markings or notations which may appear on this reproduction. 1. The sign or "target" for pages apparently lacking from the document photographed is "Missing Page(s)". If it was possible to obtain the missing page(s) or section, they are spliced into the film along with adjacent pages. This may have necessitated cutting through an image and duplicating adjacent pages to assure you of complete continuity. 2. When an image on the film is obliterated with a round black mark it is an indication that the film inspector noticed either blurred copy because of movement during exposure, or duplicate copy.
    [Show full text]
  • Deep Kernel Learning
    Deep Kernel Learning Andrew Gordon Wilson∗ Zhiting Hu∗ Ruslan Salakhutdinov Eric P. Xing CMU CMU University of Toronto CMU Abstract (1996), who had shown that Bayesian neural net- works with infinitely many hidden units converged to Gaussian processes with a particular kernel (co- We introduce scalable deep kernels, which variance) function. Gaussian processes were subse- combine the structural properties of deep quently viewed as flexible and interpretable alterna- learning architectures with the non- tives to neural networks, with straightforward learn- parametric flexibility of kernel methods. ing procedures. Where neural networks used finitely Specifically, we transform the inputs of a many highly adaptive basis functions, Gaussian pro- spectral mixture base kernel with a deep cesses typically used infinitely many fixed basis func- architecture, using local kernel interpolation, tions. As argued by MacKay (1998), Hinton et al. inducing points, and structure exploit- (2006), and Bengio (2009), neural networks could ing (Kronecker and Toeplitz) algebra for automatically discover meaningful representations in a scalable kernel representation. These high-dimensional data by learning multiple layers of closed-form kernels can be used as drop-in highly adaptive basis functions. By contrast, Gaus- replacements for standard kernels, with ben- sian processes with popular kernel functions were used efits in expressive power and scalability. We typically as simple smoothing devices. jointly learn the properties of these kernels through the marginal likelihood of a Gaus- Recent approaches (e.g., Yang et al., 2015; Lloyd et al., sian process. Inference and learning cost 2014; Wilson, 2014; Wilson and Adams, 2013) have (n) for n training points, and predictions demonstrated that one can develop more expressive costO (1) per test point.
    [Show full text]
  • Kernel Methods As Before We Assume a Space X of Objects and a Feature Map Φ : X → RD
    Kernel Methods As before we assume a space X of objects and a feature map Φ : X → RD. We also assume training data hx1, y1i,... hxN , yN i and we define the data matrix Φ by defining Φt,i to be Φi(xt). In this section we assume L2 regularization and consider only training algorithms of the following form. N ∗ X 1 2 w = argmin Lt(w) + λ||w|| (1) w 2 t=1 Lt(w) = L(yt, w · Φ(xt)) (2) We have written Lt = L(yt, w · Φ) rather then L(mt(w)) because we now want to allow the case of regression where we have y ∈ R as well as classification where we have y ∈ {−1, 1}. Here we are interested in methods for solving (1) for D >> N. An example of D >> N would be a database of one thousand emails where for each email xt we have that Φ(xt) has a feature for each word of English so that D is roughly 100 thousand. We are interested in the case where inner products of the form K(x, y) = Φ(x)·Φ(y) can be computed efficiently independently of the size of D. This is indeed the case for email messages with feature vectors with a feature for each word of English. 1 The Kernel Method We can reduce ||w|| while holding L(yt, w · Φ(xt)) constant (for all t) by remov- ing any the component of w orthoganal to all vectors Φ(xt). Without loss of ∗ generality we can therefore assume that w is in the span of the vectors Φ(xt).
    [Show full text]
  • MDL Histogram Density Estimation
    MDL Histogram Density Estimation Petri Kontkanen, Petri Myllym¨aki Complex Systems Computation Group (CoSCo) Helsinki Institute for Information Technology (HIIT) University of Helsinki and Helsinki University of Technology P.O.Box 68 (Department of Computer Science) FIN-00014 University of Helsinki, Finland {Firstname}.{Lastname}@hiit.fi Abstract only on finding the optimal bin count. These regu- lar histograms are, however, often problematic. It has been argued (Rissanen, Speed, & Yu, 1992) that reg- We regard histogram density estimation as ular histograms are only good for describing roughly a model selection problem. Our approach uniform data. If the data distribution is strongly non- is based on the information-theoretic min- uniform, the bin count must necessarily be high if one imum description length (MDL) principle, wants to capture the details of the high density portion which can be applied for tasks such as data of the data. This in turn means that an unnecessary clustering, density estimation, image denois- large amount of bins is wasted in the low density re- ing and model selection in general. MDL- gion. based model selection is formalized via the normalized maximum likelihood (NML) dis- To avoid the problems of regular histograms one must tribution, which has several desirable opti- allow the bins to be of variable width. For these irreg- mality properties. We show how this frame- ular histograms, it is necessary to find the optimal set work can be applied for learning generic, ir- of cut points in addition to the number of bins, which regular (variable-width bin) histograms, and naturally makes the learning problem essentially more how to compute the NML model selection difficult.
    [Show full text]
  • 3.2 the Periodogram
    3.2 The periodogram The periodogram of a time series of length n with sampling interval ∆is defined as 2 ∆ n In(⌫)= (Xt X)exp( i2⇡⌫t∆) . n − − t=1 X In words, we compute the absolute value squared of the Fourier transform of the sample, that is we consider the squared amplitude and ignore the phase. Note that In is periodic with period 1/∆andthatIn(0) = 0 because we have centered the observations at the mean. The centering has no e↵ect for Fourier frequencies ⌫ = k/(n∆), k =0. 6 By mutliplying out the absolute value squared on the right, we obtain n n n 1 ∆ − I (⌫)= (X X)(X X)exp( i2⇡⌫(t s)∆)=∆ γ(h)exp( i2⇡⌫h). n n t − s − − − − t=1 s=1 h= n+1 X X X− Hence the periodogram is nothing else than the Fourier transform of the estimatedb acf. In the following, we assume that ∆= 1 in order to simplify the formula (although for applications the value of ∆in the original time scale matters for the interpretation of frequencies). By the above result, the periodogram seems to be the natural estimator of the spectral density 1 s(⌫)= γ(h)exp( i2⇡⌫h). − h= X1 However, a closer inspection shows that the periodogram has two serious shortcomings: It has large random fluctuations, and also a bias which can be large. We first consider the bias. Using the spectral representation, we see that up to a term which involves E(X ) X t − 2 1 n 2 i2⇡(⌫ ⌫0)t i⇡(n+1)(⌫ ⌫0) In(⌫)= e− − Z(d⌫0) = n e− − Dn(⌫ ⌫0)Z(d⌫0) .
    [Show full text]
  • A Least-Squares Approach to Direct Importance Estimation∗
    JournalofMachineLearningResearch10(2009)1391-1445 Submitted 3/08; Revised 4/09; Published 7/09 A Least-squares Approach to Direct Importance Estimation∗ Takafumi Kanamori [email protected] Department of Computer Science and Mathematical Informatics Nagoya University Furocho, Chikusaku, Nagoya 464-8603, Japan Shohei Hido [email protected] IBM Research Tokyo Research Laboratory 1623-14 Shimotsuruma, Yamato-shi, Kanagawa 242-8502, Japan Masashi Sugiyama [email protected] Department of Computer Science Tokyo Institute of Technology 2-12-1 O-okayama, Meguro-ku, Tokyo 152-8552, Japan Editor: Bianca Zadrozny Abstract We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally highly efficient and simple to implement. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bounds. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efficient than competing approaches. Keywords: importance sampling, covariate shift adaptation, novelty detection, regularization path, leave-one-out cross validation 1. Introduction In the context of importance sampling (Fishman, 1996), the ratio of two probability density func- tions is called the importance. The problem of estimating the importance is attracting a great deal of attention these days since the importance can be used for various succeeding tasks such as covariate shift adaptation or outlier detection.
    [Show full text]
  • Lecture 2: Density Estimation 2.1 Histogram
    STAT 535: Statistical Machine Learning Autumn 2019 Lecture 2: Density Estimation Instructor: Yen-Chi Chen Main reference: Section 6 of All of Nonparametric Statistics by Larry Wasserman. A book about the methodologies of density estimation: Multivariate Density Estimation: theory, practice, and visualization by David Scott. A more theoretical book (highly recommend if you want to learn more about the theory): Introduction to Nonparametric Estimation by A.B. Tsybakov. Density estimation is the problem of reconstructing the probability density function using a set of given data points. Namely, we observe X1; ··· ;Xn and we want to recover the underlying probability density function generating our dataset. 2.1 Histogram If the goal is to estimate the PDF, then this problem is called density estimation, which is a central topic in statistical research. Here we will focus on the perhaps simplest approach: histogram. For simplicity, we assume that Xi 2 [0; 1] so p(x) is non-zero only within [0; 1]. We also assume that p(x) is smooth and jp0(x)j ≤ L for all x (i.e. the derivative is bounded). The histogram is to partition the set [0; 1] (this region, the region with non-zero density, is called the support of a density function) into several bins and using the count of the bin as a density estimate. When we have M bins, this yields a partition: 1 1 2 M − 2 M − 1 M − 1 B = 0; ;B = ; ; ··· ;B = ; ;B = ; 1 : 1 M 2 M M M−1 M M M M In such case, then for a given point x 2 B`, the density estimator from the histogram will be n number of observations within B` 1 M X p (x) = × = I(X 2 B ): bM n length of the bin n i ` i=1 The intuition of this density estimator is that the histogram assign equal density value to every points within 1 Pn the bin.
    [Show full text]
  • Density Estimation 36-708
    Density Estimation 36-708 1 Introduction Let X1;:::;Xn be a sample from a distribution P with density p. The goal of nonparametric density estimation is to estimate p with as few assumptions about p as possible. We denote the estimator by pb. The estimator will depend on a smoothing parameter h and choosing h carefully is crucial. To emphasize the dependence on h we sometimes write pbh. Density estimation used for: regression, classification, clustering and unsupervised predic- tion. For example, if pb(x; y) is an estimate of p(x; y) then we get the following estimate of the regression function: Z mb (x) = ypb(yjx)dy where pb(yjx) = pb(y; x)=pb(x). For classification, recall that the Bayes rule is h(x) = I(p1(x)π1 > p0(x)π0) where π1 = P(Y = 1), π0 = P(Y = 0), p1(x) = p(xjy = 1) and p0(x) = p(xjy = 0). Inserting sample estimates of π1 and π0, and density estimates for p1 and p0 yields an estimate of the Bayes rule. For clustering, we look for the high density regions, based on an estimate of the density. Many classifiers that you are familiar with can be re-expressed this way. Unsupervised prediction is discussed in Section 9. In this case we want to predict Xn+1 from X1;:::;Xn. Example 1 (Bart Simpson) The top left plot in Figure 1 shows the density 4 1 1 X p(x) = φ(x; 0; 1) + φ(x;(j=2) − 1; 1=10) (1) 2 10 j=0 where φ(x; µ, σ) denotes a Normal density with mean µ and standard deviation σ.
    [Show full text]
  • Nonparametric Estimation of Probability Density Functions of Random Persistence Diagrams
    Journal of Machine Learning Research 20 (2019) 1-49 Submitted 9/18; Revised 4/19; Published 7/19 Nonparametric Estimation of Probability Density Functions of Random Persistence Diagrams Vasileios Maroulas [email protected] Department of Mathematics University of Tennessee Knoxville, TN 37996, USA Joshua L Mike [email protected] Computational Mathematics, Science, and Engineering Department Michigan State University East Lansing, MI 48823, USA Christopher Oballe [email protected] Department of Mathematics University of Tennessee Knoxville, TN 37996, USA Editor: Boaz Nadler Abstract Topological data analysis refers to a broad set of techniques that are used to make inferences about the shape of data. A popular topological summary is the persistence diagram. Through the language of random sets, we describe a notion of global probability density function for persistence diagrams that fully characterizes their behavior and in part provides a noise likelihood model. Our approach encapsulates the number of topological features and considers the appearance or disappearance of those near the diagonal in a stable fashion. In particular, the structure of our kernel individually tracks long persistence features, while considering those near the diagonal as a collective unit. The choice to describe short persistence features as a group reduces computation time while simultaneously retaining accuracy. Indeed, we prove that the associated kernel density estimate converges to the true distribution as the number of persistence diagrams increases and the bandwidth shrinks accordingly. We also establish the convergence of the mean absolute deviation estimate, defined according to the bottleneck metric. Lastly, examples of kernel density estimation are presented for typical underlying datasets as well as for virtual electroencephalographic data related to cognition.
    [Show full text]
  • DENSITY ESTIMATION INCLUDING EXAMPLES Hans-Georg Müller
    DENSITY ESTIMATION INCLUDING EXAMPLES Hans-Georg M¨ullerand Alexander Petersen Department of Statistics University of California Davis, CA 95616 USA In order to gain information about an underlying continuous distribution given a sample of independent data, one has two major options: • Estimate the distribution and probability density function by assuming a finitely- parameterized model for the data and then estimating the parameters of the model by techniques such as maximum likelihood∗(Parametric approach). • Estimate the probability density function nonparametrically by assuming only that it is \smooth" in some sense or falls into some other, appropriately re- stricted, infinite dimensional class of functions (Nonparametric approach). When aiming to assess basic characteristics of a distribution such as skewness∗, tail behavior, number, location and shape of modes∗, or level sets, obtaining an estimate of the probability density function, i.e., the derivative of the distribution function∗, is often a good approach. A histogram∗ is a simple and ubiquitous form of a density estimate, a basic version of which was used already by the ancient Greeks for pur- poses of warfare in the 5th century BC, as described by the historian Thucydides in his History of the Peloponnesian War. Density estimates provide visualization of the distribution and convey considerably more information than can be gained from look- ing at the empirical distribution function, which is another classical nonparametric device to characterize a distribution. This is because distribution functions are constrained to be 0 and 1 and mono- tone in each argument, thus making fine-grained features hard to detect. Further- more, distribution functions are of very limited utility in the multivariate case, while densities remain well defined.
    [Show full text]
  • Consistent Kernel Mean Estimation for Functions of Random Variables
    Consistent Kernel Mean Estimation for Functions of Random Variables , Carl-Johann Simon-Gabriel⇤, Adam Scibior´ ⇤ †, Ilya Tolstikhin, Bernhard Schölkopf Department of Empirical Inference, Max Planck Institute for Intelligent Systems Spemanstraße 38, 72076 Tübingen, Germany ⇤ joint first authors; † also with: Engineering Department, Cambridge University cjsimon@, adam.scibior@, ilya@, [email protected] Abstract We provide a theoretical foundation for non-parametric estimation of functions of random variables using kernel mean embeddings. We show that for any continuous function f, consistent estimators of the mean embedding of a random variable X lead to consistent estimators of the mean embedding of f(X). For Matérn kernels and sufficiently smooth functions we also provide rates of convergence. Our results extend to functions of multiple random variables. If the variables are dependent, we require an estimator of the mean embedding of their joint distribution as a starting point; if they are independent, it is sufficient to have separate estimators of the mean embeddings of their marginal distributions. In either case, our results cover both mean embeddings based on i.i.d. samples as well as “reduced set” expansions in terms of dependent expansion points. The latter serves as a justification for using such expansions to limit memory resources when applying the approach as a basis for probabilistic programming. 1 Introduction A common task in probabilistic modelling is to compute the distribution of f(X), given a measurable function f and a random variable X. In fact, the earliest instances of this problem date back at least to Poisson (1837). Sometimes this can be done analytically.
    [Show full text]
  • Density and Distribution Estimation
    Density and Distribution Estimation Nathaniel E. Helwig Assistant Professor of Psychology and Statistics University of Minnesota (Twin Cities) Updated 04-Jan-2017 Nathaniel E. Helwig (U of Minnesota) Density and Distribution Estimation Updated 04-Jan-2017 : Slide 1 Copyright Copyright c 2017 by Nathaniel E. Helwig Nathaniel E. Helwig (U of Minnesota) Density and Distribution Estimation Updated 04-Jan-2017 : Slide 2 Outline of Notes 1) PDFs and CDFs 3) Histogram Estimates Overview Overview Estimation problem Bins & breaks 2) Empirical CDFs 4) Kernel Density Estimation Overview KDE basics Examples Bandwidth selection Nathaniel E. Helwig (U of Minnesota) Density and Distribution Estimation Updated 04-Jan-2017 : Slide 3 PDFs and CDFs PDFs and CDFs Nathaniel E. Helwig (U of Minnesota) Density and Distribution Estimation Updated 04-Jan-2017 : Slide 4 PDFs and CDFs Overview Density Functions Suppose we have some variable X ∼ f (x) where f (x) is the probability density function (pdf) of X. Note that we have two requirements on f (x): f (x) ≥ 0 for all x 2 X , where X is the domain of X R X f (x)dx = 1 Example: normal distribution pdf has the form 2 1 − (x−µ) f (x) = p e 2σ2 σ 2π + which is well-defined for all x; µ 2 R and σ 2 R . Nathaniel E. Helwig (U of Minnesota) Density and Distribution Estimation Updated 04-Jan-2017 : Slide 5 PDFs and CDFs Overview Standard Normal Distribution If X ∼ N(0; 1), then X follows a standard normal distribution: 1 2 f (x) = p e−x =2 (1) 2π 0.4 ) x ( 0.2 f 0.0 −4 −2 0 2 4 x Nathaniel E.
    [Show full text]