Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)
Variable Kernel Density Estimation in High-Dimensional Feature Spaces
Christiaan M. van der Walt a,b Etienne Barnard b a Modelling and Digital Science, CSIR, Pretoria, South Africa [email protected] b Multilingual Speech Technologies Group, North-West University, Vanderbijlpark, South Africa [email protected]
Abstract (Fix and Hodges 1951; Silverman 1986) with the introduc- tion of the na¨ıve estimator. Since then, many approaches Estimating the joint probability density function of a to non-parametric density estimation have been developed, dataset is a central task in many machine learning most notably: kernel density estimators (KDEs) (Whittle applications. In this work we address the fundamen- 1958), k-nearest-neighbour density estimators (Loftsgaar- tal problem of kernel bandwidth estimation for vari- able kernel density estimation in high-dimensional fea- den and Quesenberry 1965), variable KDEs (Breiman et al. ture spaces. We derive a variable kernel bandwidth es- 1977), projection pursuit density estimators (Friedman et timator by minimizing the leave-one-out entropy ob- al. 1984), mixture model density estimators (Dempster et jective function and show that this estimator is capa- al. 1977; Redner and Walker 1984) and Bayesian networks ble of performing estimation in high-dimensional fea- (Webb 2003; Heckerman 2008). ture spaces with great success. We compare the perfor- Recent advances in machine learning have shown that mance of this estimator to state-of-the art maximum- maximum-likelihood (ML) kernel density estimation holds likelihood estimators on a number of representative much promise for estimating non-parametric density func- high-dimensional machine learning tasks and show that the newly introduced minimum leave-one-out entropy tions in high-dimensional spaces. Two notable contribu- estimator performs optimally on a number of high- tions are the Maximum Leave-one-out Likelihood (MLL) dimensional datasets considered. kernel bandwidth estimator (Barnard 2010) and the Maxi- mum Likelihood Leave-One-Out (ML-LOO) kernel band- width estimator (Leiva-Murillo and Rodr´ıguez 2012). Both Introduction estimators were independently derived from the ML objec- tive function with the only differences being the simplifying With the advent of the internet and advances in computing assumptions made in their derivations to limit the complex- power, the collection of very large high-dimensional datasets ity of the bandwidth optimisation problem. has become feasible – understanding and modelling high- dimensional data has thus become a crucial activity, espe- In particular, the ML-LOO estimator constrains the num- cially in the field of machine learning. Since non-parametric ber of free parameters that need to be estimated in the band- density estimators are data-driven and do not require or width optimisation problem by estimating an identical full- impose a pre-defined probability density function on data, covariance or spherical bandwidth matrix for all kernels. The they are very powerful tools for probabilistic data modelling MLL estimator by Barnard assumes that the density func- and analysis. Conventional non-parametric density estima- tion changes slowly throughout feature space, which limits tion methods, however, originated from the field of statistics the complexity of the objective function being optimised. Specifically, when the kernel bandwidth Hi for data point and were not originally intended to perform density estima- x tion in high-dimensional features spaces - as is often en- i is estimated, it is assumed that the kernel bandwidths of the remaining N − 1 kernels are equal to Hi. Therefore, the countered in real-world machine learning tasks. (Scott and H Sain 2005) states, for example, that kernel density estima- optimisation of the bandwidth i does not require the esti- tion of a full density function is only feasible up to six di- mation of the bandwidths of the remaining N-1 kernels. mensions. We therefore define density estimation tasks with We address the problem of kernel bandwidth estimation dimensionalities of 10 and higher as high-dimensional; and for kernel density estimation in this work by deriving a ker- we address the the fundamental problem of non-parametric nel bandwidth estimation technique from the ML framework density estimation in high-dimensional feature spaces in this without the simplifications imposed by the MLL and ML- study. LOO estimators. This allows us to compare the performance The first notable attempt to free discriminant analysis of this more general ML estimator to the MLL and ML-LOO from strict distributional assumptions was made in 1951 by estimators on a number of representative machine learning tasks to gain a better understanding of the practical implica- Copyright c 2017, Association for the Advancement of Artificial tions of the simplifying assumptions made in the derivations Intelligence (www.aaai.org). All rights reserved. of the MLL and ML-LOO estimators on real world (RW)
2674 density estimation tasks. where H is used to denote the dependency of the log- In Section 2 we define the classical ML kernel bandwidth likelihood on the kernel bandwidths Hj. estimation problem and derive a novel kernel bandwidth es- timator from the ML leave-one-out objective function. We Maximum-likelihood Kernel Bandwidth also show how the MLL and ML-LOO estimators can be de- Estimation rived from the same theoretical framework. In Section 3 we KDE bandwidths that optimise the leave-one-out ML objec- describe the experimental design of our comparative simu- tive function can be estimated by finding the partial deriva- lation studies and in Section 4 we present the results of our tive of the leave-one-out ML objective function in Eq. 2 with simulations. Finally, we make concluding remarks and sug- respect to each bandwidth Hk gest future work in Section 5. N ∂ 1 j =Ki H (xi − xj |Hj ) ∂ ∂Hk N−1 j (lH(X)) = Maximum-likelihood Kernel Density ∂Hk p (xi) i=1 H(−i) Estimation (3) KDEs estimate the probability density function of a D- The bandwidth Hk can thus be estimated by setting this dimensional dataset X, consisting of N independent and partial derivative to zero and solving for Hk. identically distributed samples x1, ..., x with the sum N MLE kernel bandwidth estimation Based on the formu- 1 N lation of the leave-one-out ML objective function in Eq. pH(x )= KH (x − x |H ) (1) 3 we derive a new kernel bandwidth estimator named the i N j i j j j=1 minimum leave-one-out entropy (MLE) estimator. (To our K x − x |H knowledge, this is the first attempt where partial derivatives where Hj ( i j j) is the kernel smoothing function x H are used to derive variable bandwidths in a closed form so- fitted over each data point j with bandwidth matrix j that lution.) describes the variation of the kernel function. Since the partial derivative in Eq. 3 will only be non-zero This formulation shows that the density estimated with the for j=k, we simplify this equation to give the partial deriva- KDE is non-parametric, since no parametric distribution is tive of the MLE objective function imposed on the estimate; instead the estimated distribution N 1 ∂ is defined by the sum of the kernel functions centred on the ∂ [KH (xi − xk|Hk)] N−1 ∂Hk k data points. KDEs thus require the selection of two design (lMLE(X)) = ∂H pH(− )(x ) parameters, namely the parametric form of the kernel func- k i=1 i i tion and the bandwidth matrix. It has been shown that the (4) efficiencies of kernels with respect to the mean squared er- If we restrict the MLE estimator to multivariate Gaussian ror between the true and estimated distribution do not differ kernels with diagonal covariance matrices, the partial deriva- significantly and that the choice of kernel function should tive of the kernel function with respect to the diagonal band- rather be based on the mathematical properties of the ker- width Hk in dimension d can be derived with the product nel function, since the estimated density function inherits rule 1 the smoothness properties of the kernel function (Silverman, − −2 xid−xkd −2 2 xip xkp h Kh h (xid − xkd) − 1 =Kh (5) 1986). The Gaussian kernel is therefore often selected in kd kd hkd kd p d kp hkp practice for its smoothness properties, such as continuity and 2 where H ( ) = h and h is the kernel bandwidth in differentiability. This thus leaves the estimation of the kernel k d,d kd kd dimension d centred on data point xk. If we substitute the bandwidth as the only parameter to be estimated. partial derivative of this kernel function into Eq. 4, set the The ML criterion for kernel density estimation is typi- result to zero and solve for Hk we obtain the MLE diagonal cally defined as the log-likelihood function and has an in- covariance bandwidth estimate herent shortcoming since the estimated log-likelihood func- 2 KH (x −x |H )(x −x ) tion tends to infinity as the bandwidths of the kernel func- N k i k k id kd =1 i pH(−i)(xi) tions centred on each data point tend to zero. This thus leads Hk(d,d) = (6) N KH (xi−xk|Hk) to a trivial degenerate solution when kernel bandwidths are k i=1 p (x ) estimated by optimising the ML objective function. To ad- H(−i) i dress this shortcoming the leave-one-out KDE estimate is MLL kernel bandwidth estimation A key insight ob- used. The leave-one-out estimate removes the effect of sam- served by (Gangopadhyay and Cheung, 2002) and by ple xi from the KDE sum when estimating the likelihood (Barnard, 2010) is that if the density function changes rel- of xi; optimising the leave-one-out ML objective function atively slowly throughout feature space, the optimal ker- with respect to the kernel bandwidth matrix will thus pre- nel bandwidths will also change slowly throughout feature vent the trivial solution of the ML objective function where space. If it is assumed that the density function changes lH(X)=∞ when Hj =0. The leave-one-out ML objective slowly throughout feature space, a simplification can be function is thus defined as ⎡ ⎤ 1We provide complete proofs for the MLE diagonal and full co- N variance, MLL and ML-LOO bandwidth estimators as well as ex- ⎣ 1 ⎦ lH(X)= log KH (x − x |H ) (2) perimental results on convergence of the MLE bandwidhts in (van N − 1 j i j j i=1 j =i der Walt 2014).
2675 made by assuming that the bandwidths of kernels in the neighbourhood of a kernel are the same as that of the ker- Table 1: RW dataset summary Dataset C D Ntr Nte nel. Thus, when the bandwidth Hi is being estimated, it is assumed that all other bandwidths are equal to Hi. Barnard Letter 26 16 16000 4000 used this assumption to reduce the complexity of the the Segmentation 7 18 2100 210 leave-one-out kernel density estimate and derived a diagonal Landsat 6 36 4435 2000 covariance variable multivariate Gaussian kernel banwidth Optdigits 10 64 3823 1797 estimator as Isolet 26 617 6238 1559 2 =KH (xi − xj|Hi)(xid − xjd) H = j i i i(d,d) K x − x |H (7) j =i Hi ( i j i) Datasets for each dimension d. This diagonal covariance bandwidth We select five RW datasets (with independent train and test estimator is called the MLL estimator. sets) from the UCI Machine Learning Repository (Lichman 2013) for the purpose of simulation studies. These data sets ML-LOO bandwidth estimation (Leiva-Murillo and are selected to have relatively high dimensionalities, since Rodr´ıguez 2012) also simplified the bandwidth optimisation high-dimensions are typical of many real-world machine problem for ML kernel bandwidth estimation by estimating learning tasks. The selected datasets are the Letter, Segmen- an identical spherical or full-covariance bandwidth matrix tation, Landsat, Optdigits and Isolet datasets. We summarise for all kernels – thus limiting the number parameters to be these datasets in Table 1 and denote the number of classes optimised to a single bandwidth or single covariance matrix. with “C”, the dimensionality with “D”, the number of train- If the definition of the leave-one-out KDE for an identical ing samples with “Ntr” and the number of test samples with spherical Guassian bandwidth is substituted into Eq. 2, the “Nte”. We refer the reader to (Lichman 2013) for a more ML-LOO spherical bandwidth, h, can be solved as 2 detailed description of these datasets.