<<

Lecture 8: Kernel Density Estimator

D. Jason Koskinen [email protected]

Advanced Methods in Applied Feb - Apr 2019 Photo by Howard Jackman University of Copenhagen Niels Bohr Institute Overview

• Much of what we have covered has been estimation, but using analytic or defined density expressions • Today we cover density estimates from the data itself • The methods are regularly employed on finite data samples that need smoothing or require non-parametric methods to get a PDF

• Last few slides of this lecture contain extended literature for further reading

D. Jason Koskinen - Advanced Methods in Applied Statistics 2

• The histogram is one of the most simple forms of a data- driven non-parametric density estimator • But, the only two histogram (bin width and bin position(s)) are arbitrary Circle Area Histogram

bin width 0.1 m2

250 2

Frequency bin width 1 m

bin width 3 m2 200 *From Lecture 1

150

100

50

0 70 75 80 85 90 95 100 Area of Circle from MC [m^2]

D. Jason Koskinen - Advanced Methods in Applied Statistics 3

• The histogram is one of the most simple forms of a data- driven non-parametric density estimator • But, the only two histogram parameters (bin width and bin position(s)) are arbitrary • Histograms are not smooth, but ideally our density estimator/ function is smooth • More dimensions requires more data in order to have a multi- dimensional histogram which can match the true PDF • We can avoid some of these issues with density estimates by using something more sensible

D. Jason Koskinen - Advanced Methods in Applied Statistics 4 Wish List

• For density estimates what do we want? • non-parametric, i.e. no explicit requirement for the form of the PDF • (Easily) extendable to higher dimensions • Use data to get local point-wise density estimates which can be combined to get an overall density estimate • Smooth • At least smoother than a ‘jagged’ histogram • Preserves real probabilities, i.e. any transformation has to give PDFs which integrate to 1 and don’t ever go negative • The answer… Kernel (KDE) • Sometimes it is “Estimator” too for KDE

D. Jason Koskinen - Advanced Methods in Applied Statistics 5 Bayesian Blocks

• An alternative to constant bins for histograms is to use Bayesian Blocks developed by J.D. Scargle • Bayesian Blocks are very useful for determining time varying changes • Covers many, but not all, wish list items

dN Crab Flux = E

AAACLXicbZDLSsNAFIYn3q23qks3g0VwY0lEUBeCWKquRMGq0NRyMpno4EwSZk7EEuKD+BQ+glt9AEEQNy58Dae1C28/DPz85xzOnC9IpTDoui/OwODQ8Mjo2HhpYnJqeqY8O3dikkwz3mCJTPRZAIZLEfMGCpT8LNUcVCD5aXBV69ZPr7k2IomPsZPyloKLWESCAdqoXV7zkd+gVnlNQ3C7K7ObYsuPNLA8PCjysF5QP9VJigmtn+cr/h4oBUW7XHGrbk/0r/H6pkL6OmyX3/0wYZniMTIJxjQ9N8VWDhoFk7wo+ZnhKbAruOBNa2NQ3LTy3nUFXbJJSKNE2xcj7aXfJ3JQxnRUYDsV4KX5XeuG/9WaGUYbrVzEaYY8Zl+LokxSe2wXFQ2F5gxlxxpgWti/UnYJlg1aoD+2BIEqSpaK95vBX9NYrW5WvaO1yvZOH88YWSCLZJl4ZJ1sk31ySBqEkTvyQB7Jk3PvPDuvzttX64DTn5knP+R8fAKDmqnp dE /

*arXiv:1308.6698 D. Jason Koskinen - Advanced Methods in Applied Statistics 6 Basics of Data Driven Estimation

• Fixed regions in parameter space are expected to have approximately equal probabilities • The smaller the region(s) the more supported our assumption that probability is constant • The more data in each region the more accurate the density estimate

• We will keep the region fixed and find some compromise; large enough to collect some data, but small enough that our probability assumption is reasonable • For more thorough treatment of the original idea see the articles by Parzen and Rosenblatt in the last slides of this lecture

D. Jason Koskinen - Advanced Methods in Applied Statistics 7 Example in 1 dimension

• Start with a uniform probability box around data points:

0.5, for u 1 K(u)= | |  (0, for u > 1 AAACZnicbVDPaxNBGJ2sWutq29UiHrwMDUILJezWil6UoBfBS4QmDWSXMDv5kgyZmd3OfCuG7eZP6//h3YMXe+3VSboHk/bBwOO979e8NJfCYhj+angPHj7aerz9xH/6bGd3L3j+omezwnDo8kxmpp8yC1Jo6KJACf3cAFOphPN09mXpn/8AY0Wmz3CeQ6LYRIux4AydNAz63w6Lo49xChOhS+4G2coPW++OF4sY4ScaVY4zUy0ui0saS7iI4tgPnXmP/SnyY9CjeggdBs2wFa5A75KoJk1SozMMbuJRxgsFGrlk1g6iMMekZAYFl1D5cWEhZ3zGJjBwVDMFNilXCVT0jVNG1N3inka6Uv/vKJmydq5SV6kYTu2mtxTv8wYFjj8kpdB5gaD57aJxISlmdBknHQkDHOXcEcaNcLdSPmWGcXShr21JU7X2CReU5iAr3yUVbeZyl/ROWtHbVvj9tNn+XGe2TV6TA3JIIvKetMlX0iFdwskV+U3+kuvGH2/Xe+m9ui31GnXPPlmDR/8B/6y8qw== | |

• Notice that this kernel K(u) is normalized. The kernel is always normalized!! The total width is 2*1, so the ‘height’ is 1/(2*1) • For a 1D set of data (x), then u=x-h, where h is a distance around a data point that has a non-zero probability • The value of h is set by the analyzer • Here, the h=1

D. Jason Koskinen - Advanced Methods in Applied Statistics 8 KDE by pieces

• For 2 data points (x1 and x2) we can produce a data driven probability density P(x) using a box of total width=2 and height=0.5 centered at each data point

P(x)

Probability Contribution halfwidth=1 Probability { AAACDnicbVDLSgNBEOz1GeMr6tHLYhA8hV0V9Bj04jGKeUCyhNlJbzJkZnaZmRXCkj/w4EU/xZt49Rf8Eq9Okj2YxIKGoqqb7q4w4Uwbz/t2VlbX1jc2C1vF7Z3dvf3SwWFDx6miWKcxj1UrJBo5k1g3zHBsJQqJCDk2w+HtxG8+odIslo9mlGAgSF+yiFFirPTQybqlslfxpnCXiZ+TMuSodUs/nV5MU4HSUE60bvteYoKMKMMox3Gxk2pMCB2SPrYtlUSgDrLppWP31Co9N4qVLWncqfp3IiNC65EIbacgZqAXvYn4n9dOTXQdZEwmqUFJZ4uilLsmdidvuz2mkBo+soRQxeytLh0QRaix4cxtCUMx90RGiaTIx0WblL+YyzJpnFf8i4p3f1mu3uSZFeAYTuAMfLiCKtxBDepAIYJneIU358V5dz6cz1nripPPHMEcnK9fInScyg== Contribution from x1 from x2 K(x-h)=0.5 x1 x2 {AAACDnicbVDLSgNBEOz1GeMr6tHLYhA8hV0V9Bj04jGKeUCyhNlJbzJkZnaZmRXCkj/w4EU/xZt49Rf8Eq9Okj2YxIKGoqqb7q4w4Uwbz/t2VlbX1jc2C1vF7Z3dvf3SwWFDx6miWKcxj1UrJBo5k1g3zHBsJQqJCDk2w+HtxG8+odIslo9mlGAgSF+yiFFirPTQybqlslfxpnCXiZ+TMuSodUs/nV5MU4HSUE60bvteYoKMKMMox3Gxk2pMCB2SPrYtlUSgDrLppWP31Co9N4qVLWncqfp3IiNC65EIbacgZqAXvYn4n9dOTXQdZEwmqUFJZ4uilLsmdidvuz2mkBo+soRQxeytLh0QRaix4cxtCUMx90RGiaTIx0WblL+YyzJpnFf8i4p3f1mu3uSZFeAYTuAMfLiCKtxBDepAIYJneIU358V5dz6cz1nripPPHMEcnK9fInScyg== x D. Jason Koskinen - Advanced Methods in Applied Statistics Final KDE

• Combine the contributions, normalize the new function to 1, and we have generated a data driven PDF.

Kernel Density Estimator using a box kernel P(x)

x1 x2 x D. Jason Koskinen - Advanced Methods in Applied Statistics Hyper-Cube

• Our extended ‘region’ definition is a D-dimensional hyper- cube with lengths equal to 2h on each side • We include all points within the hyper-cube volume via some weighting scheme. This is known as the kernel (K) which is for this KDE: 0.5, ~xi in region Rd K(~xi)= (0, ~xi outside region Rd

AAAC5XicdVHLbtQwFHVSHiW8hrJkYzGDVKRqlBQQbCpVsEHqpkVMW2k8ihznZmqNH5HtVB1FyR+wQ2z5LMSXsMWZRqjTlrs6Pvee+zjOSsGti+PfQbhx5+69+5sPooePHj95Oni2dWx1ZRhMmBbanGbUguAKJo47AaelASozASfZ4lOXPzkHY7lWX92yhJmkc8ULzqjzVDr4dbBNzoHVF03KX++RDOZc1cw3tE0Uj9/t4Bb/y+OWOLhwRtZcYeMrtWpaIqk7Kwxd1F+aNCckinfatr1FoytneQ7/FxJQeT8ad89eV2iDrZaAR1cVI8xAOTCQY+pwqblyvqCfmo+adDCMx/Eq8E2Q9GCI+jhMB39IrlklfVcmqLXTJC7drKbGcSagiUhloaRsQecw9VBRCXZWrz6gwa88k+Nu00L7RVbsVUVNpbVLmfnK7gZ7PdeRt+WmlSs+zLzdZeVAsctBRSWw07j7TZxzA8yJpQeUGe53xeyMGsq8NetTskyuHeGdVgxEE3mnkuu+3ATHu+PkzTg+ejvc/9h7toleoJdoGyXoPdpHn9EhmiAW7AUsEIEM5+G38Hv447I0DHrNc7QW4c+/ML/tkg== for some R centered at point ~xd

• Sometimes you will see the kernel as K(u) where u is the ‘distance’

from xi to xd

D. Jason Koskinen - Advanced Methods in Applied Statistics 11 Kernel Characteristics

• The integrated kernel always equals 1

+ 1 K(u)du =1

ZAAACGXicbZDNSgMxFIUz9a/Wv6pLN8EiVMQyI4K6EIpuBDcVHFtox5JJM21oJjMkd4QydOtT+Ahu9QFciVtXrn0R03YWtnoh5OOce7nJ8WPBNdj2l5Wbm19YXMovF1ZW19Y3iptbdzpKFGUujUSkGj7RTHDJXOAgWCNWjIS+YHW/fzny6w9MaR7JWxjEzAtJV/KAUwJGahdxi0top4fmCmAwvE8PMrouJ/ud5NxpF0t2xR4X/gtOBiWUVa1d/G51IpqETAIVROumY8fgpUQBp4INC61Es5jQPumypkFJQqa9dPyTId4zSgcHkTJHAh6rvydSEmo9CH3TGRLo6VlvJP7nNRMITr2UyzgBJulkUZAIDBEexYI7XDEKYmCAUMXNWzHtEUUomPCmtvh+OCyYVJzZDP6Ce1Q5qzg3x6XqRRZPHu2gXVRGDjpBVXSFashFFD2iZ/SCXq0n6816tz4mrTkrm9lGU2V9/gA7C6Dd 1

• If ‘u’ is multidimensional, then so is the integration. But, the integral is always 1. • Even if the kernel is not transformed to be 1D, e.g. K(x,y,z) instead of K(u), the integral of the kernel is always 1.

K(x, y, z) dx dy dz =1 3D in cartesian coordinate x, y, and z ZZZ

AAAB/3icbVBNS8NAEN34WetX1aOXxSJ4KokI6q3oxWMFYwttKJvtpF26uwm7m0IJ9eBP8Ko/wJN49ad49o+4TXOwrQ8GHu/NMDMvTDjTxnW/nZXVtfWNzdJWeXtnd2+/cnD4qONUUfBpzGPVCokGziT4hhkOrUQBESGHZji8nfrNESjNYvlgxgkEgvQlixglxkqtzgho9jQpdytVt+bmwMvEK0gVFWh0Kz+dXkxTAdJQTrRue25igowowyiHSbmTakgIHZI+tC2VRIAOsvzeCT61Sg9HsbIlDc7VvxMZEVqPRWg7BTEDvehNxf+8dmqiqyBjMkkNSDpbFKUcmxhPn8c9poAaPraEUMXsrZgOiCLU2IjmtoShyFPxFjNYJv557brm3V9U6zdFPCV0jE7QGfLQJaqjO9RAPqKIoxf0it6cZ+fd+XA+Z60rTjFzhObgfP0CZ+OWxg== K(s1,...,sk) ds1 ...dsk =1 k-dimensions represented by ~s

AAACSnicbVDLSsNAFJ3UqjW+qi7dDBahQiiJCOpCKLoR3LRgbaEJYTKd1qGTBzMTaVr6XX6FH+DClaAf4ErcOEmzsK0XLpw599zHHC9iVEjTfNUKK8XVtfXShr65tb2zW97bfxBhzDFp4ZCFvOMhQRgNSEtSyUgn4gT5HiNtb3iT1ttPhAsaBvcyiYjjo0FA+xQjqSi33LQppYGEd9WRkRjjE2gbvZHKROX4yrJt3aa9UIqZRriWYadPQ7jDTKsYmDFQwWHa4JYrZs3MAi4DKwcVkEfDLb+rATj2SSAxQ0J0LTOSzgRxSTEjU92OBYkQHqIB6SoYIJ8IZ5J9fQqPFdOD/ZCrVCdm7N+OCfKFSHxPKX0kH8ViLSX/q3Vj2b9wJjSIYkkCPFvUjxmUIUx9hD3KCZYsUQBhTtWtED8ijrBUbs9t8Tx/qitXrEUPlkHrtHZZs5pnlfp1bk8JHIIjUAUWOAd1cAsaoAUweAZv4AN8ai/al/at/cykBS3vOQBzUSj+Aid8sI8= ··· Z Z D. Jason Koskinen - Advanced Methods in Applied Statistics 12 Visual Region

• Everything within the hyper- cube is included with an appropriate value (also known 2h as a weight) xd • For the density estimator we need: • To normalize by the total number of events in the sample (N) 2h • The kernel is always normalized, 2h i.e. integral or sum equals 1

• The PDF estimator (PKDE) is now N constructed from the individual 1 ~x ~xn PKDE(~x)= K data points N h

AAACXnicbVDRatswFFW8ruuyZfXWl0JfzMIgeViw28L2UijbCoVCSWFpA3FmZOU6EZVkI8llQei39i+FPfSl+4W+Vk780DS7IHR0zrlc3ZMWjCodhrcN78XGy81XW6+bb9623m377z9cqryUBAYkZ7kcplgBowIGmmoGw0IC5imDq/T6e6Vf3YBUNBc/9byAMcdTQTNKsHZU4vf7iTn7cWI78Q0Q89t2j+JMYmIia85trEqeGHEU2V/udRandNpZyrX7c30nwpqZrfRu4rfDXrioYB1ENWijuvqJ/xBPclJyEJowrNQoCgs9NlhqShjYZlwqKDC5xlMYOSgwBzU2i81t8MkxkyDLpTtCBwv2aYfBXKk5T52TYz1Tz7WK/J82KnX2dWyoKEoNgiwHZSULdB5UMQYTKoFoNncAE0ndXwMywy4a7cJemZKmfGUJQ7AgwGzTJRU9z2UdXO73ooNeeHHYPv5WZ7aF9tBH1EER+oKO0SnqowEi6A/6i+7Rv8adt+m1vO2l1WvUPTtopbzdR41Oukw= n=1 • The illustration is the estimation X

at a single point xd ∈ x

D. Jason Koskinen - Advanced Methods in Applied Statistics 13 Exercise 1

• Take the fixed length hyper-cube KDE out for a spin in 1D

• Using the following data [1,2,5,6,12,15,16,16,22,22,22,23] for the finite data sample and h=1.5 • Normalize the kernel!!!!!!!! • If the totally width is 2h then K(u)=1/(2h) or K(u)=0 • Because the length is 2h, to be in the ‘hyper-cube’ each

data point xi only needs to be h in each dimension from xd

• Calculate PKDE(x=6), PKDE(x=10.1), PKDE(x=20.499), and

PKDE(x=20.501)

Code this by-hand, i.e. no external packages

D. Jason Koskinen - Advanced Methods in Applied Statistics 14 Exercise 1 Example

• Calculate the PKDE(x=6) by taking all 12 data points and seeing if they are within ±h of x=6, i.e. in the range 4.5 to 7.5. • We include the data point at x=6 in the KDE

1 N 6 ~x P (x = 6) = K n KDE 12 1.5

AAACV3icbVBdSxtBFJ2sVdNoNdHHviwNQnww7EaNvgTEttAilAhGhWy6zE7uxsGZ2WVmVgzD/CZ/TR98aX+Gj3by8WDUAwOHc87lzj1JzqjSQfBY8pY+LK+slj9W1tY/bWxWa1uXKiskgR7JWCavE6yAUQE9TTWD61wC5gmDq+T268S/ugOpaCYu9DiHAccjQVNKsHZSXP3Zjc3Zt++2cd9p73aiVGJiQmvClo1UwWMjOqH9bX7Zsyiho8bMb+9Fd0DMvY2FSzYP7cTbjav1oBlM4b8l4ZzU0RzduPoUDTNScBCaMKxUPwxyPTBYakoY2EpUKMgxucUj6DsqMAc1MNOTrb/jlKGfZtI9of2p+nLCYK7UmCcuybG+Ua+9ifie1y90ejwwVOSFBkFmi9KC+TrzJ/35QyqBaDZ2BBNJ3V99coNdLdq1vLAlSfjCEYZgQYDZimsqfN3LW3LZaob7zeD8oH5yOu+sjD6jL6iBQnSETtAP1EU9RNAD+oP+on+lx9Kzt+KVZ1GvNJ/ZRgvwav8B0gi2Uw== n=1 X 1 6 1 6 2 6 5 6 6 6 12 6 23 = K + K + K + K + K + ... + K 12" 1.5 1.5 1.5 1.5 1.5 1.5 # AAAC03icfZFNS8MwGMfT+j7fph69FIegiKXdnHoRxrwIXhy4F+zKSLO0BtO0JKkwSi/i1Y/lh/CTeDXtdrBO9kDJn//vSZ/kHy+mREjL+tL0peWV1bX1jcrm1vbObnVvvyeihCPcRRGN+MCDAlPCcFcSSfEg5hiGHsV97+U25/1XzAWJ2KOcxNgNYcCITxCUyhpVP2+GPocotbPUrmfDNgkC5z5fTqb+5XlOzGaBTs8qZVZfwJoL2OUCZpd+aprm3NDGr4bixO6oWrNMqyhjXtgzUQOzehhVv4fjCCUhZhJRKIRjW7F0U8glQRRnlWEicAzRCwywoySDIRZuWsSdGcfKGRt+xNXHpFG4v3ekMBRiEnqqM4TyWfxlufkfcxLpX7spYXEiMUPTQX5CDRkZ+dsZY8IxknSiBEScqLMa6BmqVKR64dIUzwtLl0gRZAjTrKKSsv/mMi96ddNumFbnotZqzzJbB4fgCJwAG1yBFrgDD6ALkHamdbQnzdG7eqq/6e/TVl2b7TkApdI/fgD3idlE ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ 1 1 1 = 0+0+ + +0+0+0+0+0+0+0+0 12" 3 3 #

AAACVHicbVBdS8MwFE07p7PqnProS3EIwmCkm6AvwpggPk5wH9CVkWbpFpamJUmFUfqP/DU+CKL/wxcfzD4Q67whcDjnHu69x48ZlQrCN8MsbBW3d0q71t7+QfmwcnTck1EiMOniiEVi4CNJGOWkq6hiZBALgkKfkb4/u13o/SciJI34o5rHxAvRhNOAYqQ0Narc3QwDgXDqZKnTyIZtOpm4FqzB2g/dzHIY5p+1tHijShXW4bLsTeCsQRWsqzOqfA7HEU5CwhVmSErXgbHyUiQUxYxk1jCRJEZ4hibE1ZCjkEgvXd6b2eeaGdtBJPTnyl6yvx0pCqWch77uDJGayr/agvxPcxMVXHsp5XGiCMerQUHCbBXZi/DsMRUEKzbXAGFB9a42niIdjtIR56b4fpg7IsWIY8IySyfl/M1lE/QadadZhw+X1VZ7nVkJnIIzcAEccAVa4B50QBdg8AxewDv4MF6NL7NgFletprH2nIBcmeVvsvuxTw==

D. Jason Koskinen - Advanced Methods in Applied Statistics 15 KDE Comments

• The function K() is known as the kernel, and h is the bandwidth • Larger bandwidths mean more smoothing, but it can remove real features • Smaller bandwidths will approach the true PDF better, but need lots of data points otherwise they are ‘bumpy’ • The fixed window KDE is similar to a histogram, but has better support for local densities

D. Jason Koskinen - Advanced Methods in Applied Statistics 16 Hyper-Cube

• We could use the hyper-cube kernel to construct a probability density, but there are a few drawbacks to this kernel • We have discrete jumps in density and limited smoothness • Nearby points in x have some sharp differences in probability

density, e.g. PKDE(x=20.499)=0 but PKDE(x=20.501)=0.08333 • All data have equal weighting and contribution regardless of distance to the estimation point

• So let’s switch to a different kernel with weights that decrease smoothly as a function of distance from the estimation point

D. Jason Koskinen - Advanced Methods in Applied Statistics 17 Gaussian Kernel

• The generic KDE expression remains similar, e.g. 1 N ~x ~x P (~x)= K n KDE N h

AAACXnicbVDRatswFFW8ruuyZfXWl0JfzMIgeViw28L2UijbCoVCSWFpA3FmZOU6EZVkI8llQei39i+FPfSl+4W+Vk780DS7IHR0zrlc3ZMWjCodhrcN78XGy81XW6+bb9623m377z9cqryUBAYkZ7kcplgBowIGmmoGw0IC5imDq/T6e6Vf3YBUNBc/9byAMcdTQTNKsHZU4vf7iTn7cWI78Q0Q89t2j+JMYmIia85trEqeGHEU2V/udRandNpZyrX7c30nwpqZrfRu4rfDXrioYB1ENWijuvqJ/xBPclJyEJowrNQoCgs9NlhqShjYZlwqKDC5xlMYOSgwBzU2i81t8MkxkyDLpTtCBwv2aYfBXKk5T52TYz1Tz7WK/J82KnX2dWyoKEoNgiwHZSULdB5UMQYTKoFoNncAE0ndXwMywy4a7cJemZKmfGUJQ7AgwGzTJRU9z2UdXO73ooNeeHHYPv5WZ7aF9tBH1EER+oKO0SnqowEi6A/6i+7Rv8adt+m1vO2l1WvUPTtopbzdR41Oukw= n=1 X • The kernel is now:

~x ~x 2 1 | n| K(~x, )= e 22 p AAACZ3icbVBNSyMxGE5n/awfrSvIgpfRIijYMtMV1osgelnYiwvWKp1aMuk7bTDJzCYZ2RLz2/wd/oA97MU97nXTdg7b6gshD88Hb/LEGaNKB8FLyfuwsLi0vLJaXlvf2KxUtz7eqDSXBFokZam8jbECRgW0NNUMbjMJmMcM2vHD5VhvP4JUNBXXepRBl+OBoAklWDuqV737dhg9AjE/7XGk6IDjo7MokZiY0JpI/ZDaNKOM2qlm4d7Up/JTkaoXd0883TetM0+MDtpetRY0gsn4b0FYgBoq5qpX/Rv1U5JzEJowrFQnDDLdNVhqShjYcpQryDB5wAPoOCgwB9U1kwqsf+CYvp+k0h2h/Qn7f8JgrtSIx87JsR6qeW1Mvqd1cp2cdg0VWa5BkOmiJGe+Tv1xn36fSiCajRzARFL3Vp8MsatIu9ZntsQxn/mEIVgQYLbsmgrne3kLbpqN8HMj+H5SO78oOltBu2gfHaIQfUHn6Cu6Qi1E0DP6hV7Rn9Jvr+LteJ+mVq9UZLbRzHh7/wDNE76q 2⇡

• The kernel at each data point now contributes a non-zero probability from [-∞,+∞] smoothly with decreasing weight as a function of distance • Each data point and corresponding kernel integrate to 1 over the whole parameter space

D. Jason Koskinen - Advanced Methods in Applied Statistics 18 Exercise 2

• Redo exercise 1 using the new Gaussian kernel • For the gaussian width use σ=3 • Calculate the KDE two ways: • Writing software where you code the gaussian function • Using an external package

• Plot the density estimate PKDE(x) over the following range of -10 < x < 35 • If you have time, plot the individual kernel contributions too

D. Jason Koskinen - Advanced Methods in Applied Statistics 19 Exercise 2 KDE plot

Gaussian Kernels (σ=3.00) p(x)

0.12

0.1

0.08

0.06

0.04

0.02

0 −10 −5 0 5 10 15 20 25 30 35 x

D. Jason Koskinen - Advanced Methods in Applied Statistics 20 Compact Kernel

• The gaussian kernel contributes across the whole space (infinite support), but sometimes we want compact support, i.e. zero outside of a specific range • Maybe some parameters are constrained to be non-negative • We know the physical system has either boundaries or effective cut-offs 3 (1 u2) for u 1 K(u)= 4 | |  • A common compact 0 for u > 1 ( | | support kernel is the Epanechnikov kernel *https://en.wikipedia.org/wiki/Kernel_(statistics) D. Jason Koskinen - Advanced Methods in Applied Statistics 21 Exercise 3

• Redo exercise 2 using the Epanechnikov kernel with a bandwidth that you choose

• In a nicely formatted table compare Calculate PKDE(x=6),

PKDE(x=10.1), PKDE(x=20.499), and PKDE(x=20.501) between the 3 different kernels; Parzen-Rosenblatt, gaussian, and Epanechnikov • Use either your by-hand(s) version or external package

D. Jason Koskinen - Advanced Methods in Applied Statistics 22 Kernel Bandwidth

• Every KDE is, unfortunately, strongly influenced by the kernel bandwidth, which is a user defined free parameter

Gaussian Kernels (σ=0.20)

p(x) 2

1.8

1.6

1.4

1.2 Gaussian Kernels (σ=3.00)

1 p(x)

0.8 0.12

0.6 0.1 0.4

0.2 0.08 Gaussian Kernels (σ=7.00) 0 −10 −5 0 5 10 15 20 25 30 35

0.06 x p(x)

0.05 0.04

0.02 0.04

0 −10 −5 0 5 10 15 20 25 30 0.03 35 x

0.02

0.01

0 −10 −5 0 5 10 15 20 25 30 35 x

D. Jason Koskinen - Advanced Methods in Applied Statistics 23 Bandwidth Selection

• An analytic approach to bandwidth selection is to choose a bandwidth which minimizes the mean integrated square error (MISE)

MISE(h)=E (P (~x) P (~x))2dx KDE  Z • But analytically this requires some known form of the underlying distribution • Assuming that the underlying distribution is gaussian, the optimal bandwidth is ˆ standard deviation from data 1/5 h 1.06ˆN number of data points ⇡ N

D. Jason Koskinen - Advanced Methods in Applied Statistics 24 Bandwidth Selection Non-parametric

• Instead of using a known function we can use subsets of the data as cross-validation of the kernel bandwidth

ISE(fˆ )= (fˆ (y) f(y))2dy h h Z = (fˆ (y))2dy 2 fˆ (y)f(y)dy + f(y)2dy h h Z Z Z • This can be shown to converge to a least squares cross- validation (LSCV) expression as n 2 2 LSCV (h)= (fˆh(y)) dy fˆ i(yi) N i=1 Z X fˆ i(yi) • The expression is the kernel estimator from the data omitting the data point yi which is also known as the “leave-one-out” density estimator *https://projecteuclid.org/euclid.ss/1113832723

D. Jason Koskinen - Advanced Methods in Applied Statistics 25 KDE in 2D, and more

• While the previous examples and work were for 1- dimension, the kernels work just fine in additional dimensions • No escaping the curse of dimensionality :-( • Similar to all other multi-dimensional problems, anything beyond 3D is difficult to visualize • Kernel bandwidths do not have to be the same in each dimension • Either specify the bandwidth in each dimension, or • Transform the parameter space(s) to be uniform for a given bandwidth

D. Jason Koskinen - Advanced Methods in Applied Statistics 26 Exercise 4

• There are many online tutorials covering different 2D density estimation problems in R, python, MatLab, etc. • Eruption of “Old Faithful” geyser • Rendering of text and numerals • Spread of diseases • Geographical population densities • After working through your own particular choice, use the 500 pseudo-experiment bootstrap from Lecture “Parameter Estimation and Confidence Intervals” exercise 1b to produce a 2D KDE • Because the LLH method gives precise contours, you can compare the contours from the KDE

D. Jason Koskinen - Advanced Methods in Applied Statistics 27 The End

D. Jason Koskinen - Advanced Methods in Applied Statistics 28 More KDE comments

• The kernel is symmetric about each data point • Makes sense, because the region near the data point should have a similar probability for a narrow (enough) bandwidth • Kernel symmetry is not technically a requirement, but in practice symmetry is often desirable because then the average of the kernel is centered on the data point • The kernel density estimator PDF is often used for Monte Carlo sampling • E.g. N-body simulations (galaxy formation, astrophysical large scale structure, disease propagation in an ecosystem, etc.) take 2 months to generate 200 data points across 3 dimensions or parameters. Real data is much, much larger. In order to use our N-body PDF, we can sample from a smoothed PDF from a KDE. • Because KDEs require ‘subjective’ input, clearly state the kernel, bandwidth, and any optimization from an analysis

D. Jason Koskinen - Advanced Methods in Applied Statistics 29 Further Info

• Fixed kernel width window, Parzen-Rosenblatt window • Parzen (http://www.jstor.org/stable/2237880) • Rosenblatt (http://projecteuclid.org/euclid.aoms/1177728190) • Nice list of various kernels at https://en.wikipedia.org/wiki/ Kernel_(statistics) • Very nice article on kernel bandwidth selection review https://projecteuclid.org/euclid.ss/1113832723 • Collection of other cross-validation techniques https:// cran.r-project.org/web/packages/kedd/vignettes/kedd.pdf

D. Jason Koskinen - Advanced Methods in Applied Statistics 30 Further Info (cont.)

• Variable bandwidth kernels • Z. I. Botev, et al., Kernel Density Estimation via Diffusion, Annals of Statistics, 38 5, 2916-2957 (2010). • I. S. Abramson, On bandwidth variation in kernel estimates—A square root law, Annals of Statistics, 10 4, 1217-1223 (1982). • Any other suggestions?

D. Jason Koskinen - Advanced Methods in Applied Statistics 31