Transformations and Bayesian Estimation of Skewed and Heavy-Tailed Densities
Dissertation
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Andrew Bean, B.A.,M.S.
Graduate Program in Statistics
The Ohio State University
2017
Dissertation Committee:
Xinyi Xu, Co-Advisor Steven N. MacEachern, Co-Advisor Yoonkyung Lee Matthew T. Pratola c Copyright by
Andrew Bean
2017 Abstract
In data analysis applications characterized by large and possibly irregular data sets, nonparametric statistical techniques aim to ensure that, as the sample size grows, all unusual features of the data generating process can be captured. Good large- sample performance can be guaranteed in broad classes of problems. Yet within these broad classes, some problems may be substantially more difficult than others.
This fact, long recognized in classical nonparametrics, also holds in the growing field of Bayesian nonparametrics, where flexible prior distributions are developed to allow for an infinite-dimensional set of possible truths.
This dissertation studies the Bayesian approach to the classic problem of nonpara- metric density estimation, in the presence of specific irregularities such as heavy tails and skew. The problem of estimating an unknown probability density is recognized as being harder when the density is skewed or heavy tailed than when it is symmetric and light-tailed. It is more challenging problem for classical kernel density estimators, where the expected squared-error loss is higher for heavier tailed densities. It is also a more challenging problem in Bayesian density estimation, where heavy tails preclude the analytical treatment required to establish a large-sample convergence rate for the popular Dirichlet-Process (DP) mixture model.
ii Our proposed approach addresses these features by incorporating a low-dimensional parametric transformation of the sample, estimated from the data, with the aim of set- ting up an easier density estimation problem on the transformed scale. This strategy was proposed earlier in combination with kernel density estimators, and we illustrate its usefulness in the Bayesian context. Further, we develop a set of transformations estimated in a way to ensure that the fastest proven convergence rate for the DP mixture is applicable to the transformed problem.
The transformation-density estimation technique makes advantageous use of a parametric pre-analysis to address specific irregularities in the data generating pro- cess. Since the parametric stage is low-dimensional, and governed by a faster con- vergence rate, the asymptotic performance of the model is enhanced without slowing down the overall convergence rate. We consider other settings where this recipe for semiparametric analysis — with parametric sub-analyses designed to address specific irregularities, or to simplify the main nonparametric component of the analysis — might be beneficial.
iii To my family.
iv Acknowledgments
This dissertation is the culmination of a six-year odyssey at Ohio State. I was
“carried to Ohio in a swarm of bees,” as the rock band The National put it. (Figu- ratively, of course, when it comes to the bees.) My studies have taken me far from my home state of Arizona. They have taken me far from friends and family, who, despite the distance, have continued to provide much-needed balance during my time in Ohio. In the end, Columbus too proved to be a wonderful home away from home.
This is mostly thanks to the people I have met here, including the faculty and staff in the Department of Statistics, and friends, roommates, and classmates throughout the years. I will miss them, and will look back on these years with fondness.
I would not have reached this point without the support of several people I want to thank individually. My advisors Xinyi Xu and Steve MacEachern are tremendous role models, both personally and as statisticians and researchers. They were unfailingly supportive and encouraging, even when my work did not proceed smoothly. It has been a privilege to work with them, and I look forward to continuing collaboration in the future. I thank Yoon Lee and Matt Pratola for serving on the committee for this dissertation. Their perspective and input on this work is invaluable. I have loved teaching statistics at Ohio State (loved everything but the grading, that is), and I became a better teacher during my time here thanks to Michelle Everson,
Jonathan Baker, Laura Kubatko, and others. I also thank the faculty and staff of the
v Mathematics and Computer Science Department at Colorado College; the wonderful teachers there inspired me to pursue my studies this far.
Most of all, I am grateful to my family — my parents Jeff and Sydney, and my brother Owen — for their love and support as I worked towards this degree. I am privileged to have had parents and grandparents who gave me the opportunity to succeed at every level of my education. The only way to properly give thanks for this gift is to pass it on. I can only hope to do so with the same selflessness. Lastly, to Yi: the greatest fortune I’ve had during my time in Ohio is to have found a partner like you. The daily ups and downs of doctoral studies, the halls of Cockins: neither are romantic, but both seemed that way as we navigated them together. I look forward to writing our next chapters together.
vi Vita
August 3, 1987 ...... Born - San Francisco, CA, USA
2009 ...... B.A. Mathematics, The Colorado College. 2012 ...... M.S. Statistics, The Ohio State University. 2012-present ...... Graduate Teaching Associate, The Ohio State University.
Publications
Research Publications
A. Bean, X. Xu, S.N. MacEachern, “Transformations and Bayesian Density Estima- tion”. Electronic Journal of Statistics, 10(2):3355-3373, Nov. 2016.
Fields of Study
Major Field: Statistics
vii Table of Contents
Page
Abstract ...... ii
Dedication ...... iv
Acknowledgments ...... v
Vita...... vii
List of Tables ...... xi
List of Figures ...... xii
1. Introduction and Theoretical Motivation ...... 1
1.1 Parametric and Nonparametric Asymptotics by Example ...... 3 1.2 Asymptotic Properties of Bayesian Posteriors ...... 6 1.2.1 Parametric Bayesian Models ...... 7 1.2.2 Nonparametric Bayesian Models ...... 9 1.3 Outline of the Thesis ...... 12
2. Frequentist Transformation-Density Estimation ...... 14
2.1 Parzen-Type Kernel Density Estimators ...... 15 2.1.1 Asymptotic properties of the kernel density estimator . . . . 16 2.2 Density Estimation with Transformations ...... 19
2.2.1 An L2 Criterion for Selecting Transformations ...... 20 2.3 Transformation Families ...... 27 2.3.1 Parametric transformation, nonparametric density estimation 27 2.3.2 Nonparametric transformation, nonparametric density esti- mation ...... 29 2.3.3 Other Transformation Families ...... 30
viii 3. Density Estimation using Dirichlet Process Mixtures ...... 35
3.1 Nonparametric Prior Construction ...... 35 3.1.1 The Dirichlet Process ...... 36 3.1.2 Dirichlet process mixtures ...... 38 3.2 Posterior computation for DP mixtures ...... 41 3.3 Performance of Dirichlet process mixtures for density estimation . . 43 3.3.1 Measurements of accuracy for density estimation ...... 43 3.3.2 Constructions for the DP mixture prior, and finite-sample performance ...... 46 3.3.3 Asymptotic properties of DP mixtures ...... 48
4. Iterative Transformation and Bayesian Density Estimation ...... 53
4.1 Background ...... 53 4.2 Transformations ...... 56 4.2.1 Family of Transformations ...... 60 4.2.2 A Criterion for Estimating Transformation Parameters . . . 62 4.2.3 Iterative Transformation Selection ...... 66 4.3 Simulation Study ...... 67 4.3.1 Simulation Design ...... 67 4.3.2 Simulation Results ...... 69 4.4 An Application to BMI Modeling ...... 74 4.5 Discussion ...... 77
5. Heavy-Tailed Density Estimation Using Transformations and DP Mixtures 80
5.1 Motivation ...... 80 5.2 DP Mixture Asymptotics and Distribution Tails ...... 82 5.2.1 Convergence Rates for Sub-Gaussian Tails ...... 84 5.2.2 Characterization of Heavy-Tailed Distributions ...... 85 5.3 A Family of Transformations for Heavy-Tailed Densities ...... 86 5.3.1 Transformations to sub-Gaussian tails ...... 86 5.3.2 Skew-t cdf-inverse-cdf transformations ...... 89 5.3.3 Estimating Skew-t Transformation Parameters ...... 90 5.4 Simulations and Data Analysis ...... 98 5.4.1 Data Analysis ...... 99 5.4.2 Simulation Study ...... 107 5.5 Discussion ...... 110
ix 6. Extensions and Future Work ...... 120
6.1 Multivariate Density Estimation ...... 120 6.1.1 Transformations to Multivariate sub-Gaussianity ...... 123 6.1.2 Bayesian Analysis of Heavy-Tailed Time Series ...... 126 6.1.3 Median Regression with a Heavy-Tailed Error Distribution . 127 6.2 A Recipe for Efficient Semiparametric Analysis ...... 130
Appendices 131
A. Asymptotic Expansion of MISE for Kernel Estimates ...... 131
B. Estimating integrated squared second derivatives ...... 133
Bibliography ...... 136
x List of Tables
Table Page
4.1 Ohio Family Health Survey (2008) sample sizes, divided into training and holdout samples...... 77
5.1 Unemployment data: Log predictive scores (5.36) for transformation / DP mixture (5.2) predictive densities...... 102
5.2 Acidity data: Log predictive scores (5.36) for transformation / DP mixture (5.2) predictive densities...... 106
5.3 Griffin (2010) predictive scores for the acidity data...... 107
xi List of Figures
Figure Page
4.1 DPM density estimates (dashed lines) based on samples of size 100 for two examples of the two-piece distributions (the true densities are
shown in solid black). The leftmost density is symmetric, but has t2 tails. The rightmost density has Gaussian tails, but is right-skewed. 54
4.2 Illustration of the Transformation-DPM technique. The heavy-tailed sample (left column, A1-A3) and skewed sample (right column, B1- B3) of figure 4.1 are transformed according to the symmetrizing and tail-shortening transformations of section 2. The DPM model is fit to the transformed samples in the bottom panels, then back-transformed to give the TDPM estimate on the original scale...... 57
4.3 The familiesϕ ˜(1,θ1) of Yeo-Johnson transformations andϕ ˜(2,θ2) of t-to- Normal cdf transformations...... 61
4.4 Comparison of Hellinger error for four density estimates, KDE, TKDE, DPM, and TDPM. 20 replicate samples from each scenario and sam- ple size. Horizontal dotted line represents median Hellinger distance obtained by fitting Griffin’s (2010) model without transformation. . 70
4.5 Top: comparison of density estimates (via KDE and DPM) with and without transformation, based on a single sample of size n = 200 from
the skewed and heavy tailed density with σ2 = 5 and g(·) a t dis- tribution with 2 degrees of freedom. Bottom: quantile-quantile plot comparing estimated to true quantiles for the same sample and model fits. Dashed and dotted lines represent point estimates of the quantiles, while the shaded regions represent pointwise 90% credible intervals for the true quantile based on the Bayesian fits...... 72
xii 4.6 Illustration of the estimated transformations based on 20 simulated samples of size n = 200 from each of the four two piece scenarios (a)-(d)...... 73
4.7 Quantiles of the empirical distribution of BMI, separated by gender and age...... 75
4.8 Comparison of the average log predictive likelihood of the holdout cases (4.18) based on the DPM and TDPM fits...... 76
5.1 Unemployment rates by US county, continental 48 states, BLS, May 2017...... 100
5.2 Unemployment rates plotted against the size of the labor force for 3219 counties...... 101
5.3 From left to right: estimated transformation, DP mixture predictive density, and posterior distribution for number of occupied mixture components for the unemployment data. Five transformation meth- ods...... 103
5.4 From left to right: estimated transformation, DP mixture predictive density, and posterior distribution for number of occupied mixture components for the acidity data. Five transformation methods. . . . . 105
5.5 Densities f0 for Monte Carlo simulation study...... 108
5.6 Median log Hellinger distances for K = 80 Monte Carlo simulations in each setting. The gray dotted line indicates the median log Hellinger distance for the normal density estimated with no transformation. . . 111
5.7 Hellinger distance between posterior predictive densities and true f0, K = 80 Monte Carlo simulations. First three scenarios (a)-(c). . . . . 114
5.8 Hellinger distance between posterior predictive densities and true f0, K = 80 Monte Carlo simulations. Last three scenarios (d)-(f). . . . . 115
5.9 Median log integrated squared error distances for K = 80 Monte Carlo simulations in each setting. The gray dotted line indicates the median for the normal density estimated with no transformation...... 116
xiii 5.10 Median log Kullback-Leibler dibergence for K = 80 Monte Carlo sim- ulations in each setting. The gray dotted line indicates the median for the normal density estimated with no transformation...... 117
5.11 Median log Kullback-Leibler divergence for K = 80 Monte Carlo sim- ulations in each setting. The gray dotted line indicates the median for the normal density estimated with no transformation...... 118
5.12 Median log total variation distances for K = 80 Monte Carlo sim- ulations in each setting. The gray dotted line indicates the median log total variation distance for the normal density estimated with no transformation...... 119
xiv Chapter 1: Introduction and Theoretical Motivation
A classic problem of nonparametric statistical inference involves the estimation
of an unknown probability density. Given a sample X1,...,Xn drawn independently ˆ from a distribution F0, the aim is to produce an estimate fn of the density f0 associ- ated with F0. In the absence of strong assumptions on the form of the distribution, the inferential problem is fundamentally nonparametric. In order to produce a “good” ˆ estimator fn — which can well-approximate any f0 in a broad class of densities — ˆ the complexity of fn must be allowed to increase with the sample size. A drawback of such nonparametric analysis is that the estimators, in comparison to their parametric counterparts, typically converge more slowly as the sample size grows. This stylized fact holds true in many situations in both frequentist and Bayesian inference.
In this dissertation, we propose a strategy for inference on unknown densities which divides the estimation into two stages: the first is parametric and low-dimensional, and is used to transform the estimation problem into one which is more easily ad- dressed by the second, nonparametric stage. This semiparametric approach balances the benefits and drawbacks of parametric and nonparametric estimation, while also making combined use of Bayesian and frequentist inference.
The current chapter sets the stage for this work. The sections to follow give an informal overview of some famous results in large-sample theory which illustrate ideas
1 pertinent to this thesis. The aim is to provide a theoretical context for our methods, and to demonstrate the promise of estimation strategies which combine parametric and nonparametric methods in a sensible way. We withhold some mathematical detail in the interest of a discussion that highlights a few central ideas.
1. Parametric and nonparametric estimators are governed by different asymptotic
rates of convergence, a fact which is well-known. This idea is illustrated by
some famous frequentist estimators in Section 1.1.
2. Analogous convergence concepts hold true in Bayesian inference, though the
notion of convergence differs.
3. Large-sample properties of nonparametric Bayesian models are difficult to estab-
lish analytically. In Bayesian density estimation, the strongest available results
require moderately strong assumptions about the data-generating processes.
4. We argue for transforming the nonparametric problem to meet these assump-
tions by way of a parametric “pre-analysis.” Since the pre-analysis is governed
by a comparatively fast asymptotic rate, it does not slow down the overall
convergence rate for the estimation as a whole.
This introductory chapter is organized as follows. Section 1.1 considers two well- known frequentist parametric and nonparametric estimators, highlighting the differ- ence in convergence rates as the sample size grows. Section 1.2 reviews some famous
2 results from the large-sample theory of Bayesian posteriors. Section 1.3 gives an infor-
mal argument for the usefulness of our semiparametric strategy for density estimation,
and the remainder of the thesis is outlined.1
1.1 Parametric and Nonparametric Asymptotics by Example
Faced with the problem of using the sample X1,...,Xn to infer features of the generating distribution F0, we give two simple examples from frequentist inference to illustrate the difference in asymptotics for parametric and nonparametric inference.
A parametric strategy for estimation assumes that F0 belongs to a family of dis- tributions indexed by a finite-dimensional Euclidean parameter. The nonparametric approach treats the distribution F0 as entirely unknown, making minimal assumptions about its analytic properties. The next examples discuss two famous estimators.
Example 1. Suppose one models the density f0 as belonging to a family {fθ : θ ∈ Θ}
p indexed by a finite-dimensional parameter θ belonging to Θ ⊆ R . To believe f0 truly
belongs to this family may be na¨ıve. Rather, the analyst may consider fθ a reasonable
approximation to the truth — a model which simplifies the problem of estimating
f0, an infinite-dimensional quantity, to that of estimating the finite-dimensional θ. ˆ The maximum-likelihood estimator θn is computed by maximizing the log-likelihood function n 1 X M (θ) = log f (X ) n n θ i i=1
1While this chapter outlines a theoretical motivation for the methods described in this thesis, the techniques presented here are also inspired by a line of work on frequentist density estimation. This parallel motivation is described in Chapter 2.
3 with respect to θ. Under quite general conditions on the form of f0 and on the family
fθ (see Van der Vaart (1998); Huber (1967)), a typical result is that
√ ˆ n θn − θ0 ⇒ N 0, Σ , (1.1)
R where θ0 maximizes the entropy functional log fθ(x) f0(x) dx, and Σ is an asymp- totic covariance whose form is akin to the inverse Fisher information,
d2 Z Σ−1 = − log f (x) f (x) dx. dθ2 θ 0
The notation Zn ⇒ Z indicates weak convergence of the distribution functions of the
Zn to the distribution of Z.
√ Attention should be paid to the normalizing sequence n in (1.1), which indicates ˆ the rate at which the discrepancy θn − θ0 decays in probability to zero. This rate is
typical of the asymptotic behavior of parametric estimators.
As our focus is density estimation, we make an additional comment that the limit
f of the estimated density f is not, in general, the same as the density f of the θ0 θˆn 0
sample. Rather, fθ0 is the density within the family {fθ} that is nearest to f0 in terms
R f0 of Kullback-Leibler divergence. That is, θ0 minimizes log dF0 over θ ∈ Θ. Hence fθ this parametric strategy cannot produce a consistent estimator unless f0 belongs to the same parametric family (a fact which is obvious intuitively).
The next example discusses a nonparametric kernel estimator which addresses this inconsistency, but does so to the detriment of the rate described in the first example.
Example 2. The well-known kernel density estimator (KDE) constructs an estimate of f0 as n 1 X fˆ (x) = K x − X h . (1.2) n,h nh i i=1 4 The function K, termed a kernel, is another probability density, and h > 0 is known as the bandwidth. The literature concerning (1.2) is extensive. While some addi- tional statistical properties of the KDE will be discussed in the next section, here we briefly address consistency of (1.2) and the associated rate of convergence. While ˆ asymptotics of fn,h are most commonly considered with respect to L2, or integrated squared-error (ISE) loss, we describe here the L1 asymptotics. The loss function is
integrated absolute error (IAE),
Z ˆ ˆ d1 fn,h, f0 = fn,h(x) − f0(x) dx, (1.3)