<<

Transformations and Bayesian Estimation of Skewed and Heavy-Tailed Densities

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Andrew Bean, B.A.,M.S.

Graduate Program in

The Ohio State University

2017

Dissertation Committee:

Xinyi Xu, Co-Advisor Steven N. MacEachern, Co-Advisor Yoonkyung Lee Matthew T. Pratola c Copyright by

Andrew Bean

2017 Abstract

In analysis applications characterized by large and possibly irregular data sets, nonparametric statistical techniques aim to ensure that, as the sample size grows, all unusual features of the data generating process can be captured. Good large- sample performance can be guaranteed in broad classes of problems. Yet within these broad classes, some problems may be substantially more difficult than others.

This fact, long recognized in classical nonparametrics, also holds in the growing field of Bayesian nonparametrics, where flexible prior distributions are developed to allow for an infinite-dimensional set of possible truths.

This dissertation studies the Bayesian approach to the classic problem of nonpara- metric , in the presence of specific irregularities such as heavy tails and skew. The problem of estimating an unknown probability density is recognized as being harder when the density is skewed or heavy tailed than when it is symmetric and light-tailed. It is more challenging problem for classical kernel density , where the expected squared-error loss is higher for heavier tailed densities. It is also a more challenging problem in Bayesian density estimation, where heavy tails preclude the analytical treatment required to establish a large-sample convergence rate for the popular Dirichlet-Process (DP) mixture model.

ii Our proposed approach addresses these features by incorporating a low-dimensional parametric transformation of the sample, estimated from the data, with the aim of set- ting up an easier density estimation problem on the transformed scale. This strategy was proposed earlier in combination with kernel density estimators, and we illustrate its usefulness in the Bayesian context. Further, we develop a set of transformations estimated in a way to ensure that the fastest proven convergence rate for the DP mixture is applicable to the transformed problem.

The transformation-density estimation technique makes advantageous use of a parametric pre-analysis to address specific irregularities in the data generating pro- cess. Since the parametric stage is low-dimensional, and governed by a faster con- vergence rate, the asymptotic performance of the model is enhanced without slowing down the overall convergence rate. We consider other settings where this recipe for semiparametric analysis — with parametric sub-analyses designed to address specific irregularities, or to simplify the main nonparametric component of the analysis — might be beneficial.

iii To my family.

iv Acknowledgments

This dissertation is the culmination of a six-year odyssey at Ohio State. I was

“carried to Ohio in a swarm of bees,” as the rock band The National put it. (Figu- ratively, of course, when it comes to the bees.) My studies have taken me far from my home state of Arizona. They have taken me far from friends and family, who, despite the distance, have continued to provide much-needed balance during my time in Ohio. In the end, Columbus too proved to be a wonderful home away from home.

This is mostly thanks to the people I have met here, including the faculty and staff in the Department of Statistics, and friends, roommates, and classmates throughout the years. I will miss them, and will look back on these years with fondness.

I would not have reached this point without the support of several people I want to thank individually. My advisors Xinyi Xu and Steve MacEachern are tremendous role models, both personally and as and researchers. They were unfailingly supportive and encouraging, even when my work did not proceed smoothly. It has been a privilege to work with them, and I look forward to continuing collaboration in the future. I thank Yoon Lee and Matt Pratola for serving on the committee for this dissertation. Their perspective and input on this work is invaluable. I have loved teaching statistics at Ohio State (loved everything but the grading, that is), and I became a better teacher during my time here thanks to Michelle Everson,

Jonathan Baker, Laura Kubatko, and others. I also thank the faculty and staff of the

v Mathematics and Computer Science Department at Colorado College; the wonderful teachers there inspired me to pursue my studies this far.

Most of all, I am grateful to my family — my parents Jeff and Sydney, and my brother Owen — for their love and support as I worked towards this degree. I am privileged to have had parents and grandparents who gave me the opportunity to succeed at every level of my education. The only way to properly give thanks for this gift is to pass it on. I can only hope to do so with the same selflessness. Lastly, to Yi: the greatest fortune I’ve had during my time in Ohio is to have found a partner like you. The daily ups and downs of doctoral studies, the halls of Cockins: neither are romantic, but both seemed that way as we navigated them together. I look forward to writing our next chapters together.

vi Vita

August 3, 1987 ...... Born - San Francisco, CA, USA

2009 ...... B.A. Mathematics, The Colorado College. 2012 ...... M.S. Statistics, The Ohio State University. 2012-present ...... Graduate Teaching Associate, The Ohio State University.

Publications

Research Publications

A. Bean, X. Xu, S.N. MacEachern, “Transformations and Bayesian Density Estima- tion”. Electronic Journal of Statistics, 10(2):3355-3373, Nov. 2016.

Fields of Study

Major Field: Statistics

vii Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita...... vii

List of Tables ...... xi

List of Figures ...... xii

1. Introduction and Theoretical Motivation ...... 1

1.1 Parametric and Nonparametric Asymptotics by Example ...... 3 1.2 Asymptotic Properties of Bayesian Posteriors ...... 6 1.2.1 Parametric Bayesian Models ...... 7 1.2.2 Nonparametric Bayesian Models ...... 9 1.3 Outline of the Thesis ...... 12

2. Frequentist Transformation-Density Estimation ...... 14

2.1 Parzen-Type Kernel Density Estimators ...... 15 2.1.1 Asymptotic properties of the kernel density . . . . 16 2.2 Density Estimation with Transformations ...... 19

2.2.1 An L2 Criterion for Selecting Transformations ...... 20 2.3 Transformation Families ...... 27 2.3.1 Parametric transformation, nonparametric density estimation 27 2.3.2 Nonparametric transformation, nonparametric density esti- mation ...... 29 2.3.3 Other Transformation Families ...... 30

viii 3. Density Estimation using Dirichlet Process Mixtures ...... 35

3.1 Nonparametric Prior Construction ...... 35 3.1.1 The Dirichlet Process ...... 36 3.1.2 Dirichlet process mixtures ...... 38 3.2 Posterior computation for DP mixtures ...... 41 3.3 Performance of Dirichlet process mixtures for density estimation . . 43 3.3.1 Measurements of accuracy for density estimation ...... 43 3.3.2 Constructions for the DP mixture prior, and finite-sample performance ...... 46 3.3.3 Asymptotic properties of DP mixtures ...... 48

4. Iterative Transformation and Bayesian Density Estimation ...... 53

4.1 Background ...... 53 4.2 Transformations ...... 56 4.2.1 Family of Transformations ...... 60 4.2.2 A Criterion for Estimating Transformation Parameters . . . 62 4.2.3 Iterative Transformation Selection ...... 66 4.3 Simulation Study ...... 67 4.3.1 Simulation Design ...... 67 4.3.2 Simulation Results ...... 69 4.4 An Application to BMI Modeling ...... 74 4.5 Discussion ...... 77

5. Heavy-Tailed Density Estimation Using Transformations and DP Mixtures 80

5.1 Motivation ...... 80 5.2 DP Mixture Asymptotics and Distribution Tails ...... 82 5.2.1 Convergence Rates for Sub-Gaussian Tails ...... 84 5.2.2 Characterization of Heavy-Tailed Distributions ...... 85 5.3 A Family of Transformations for Heavy-Tailed Densities ...... 86 5.3.1 Transformations to sub-Gaussian tails ...... 86 5.3.2 Skew-t cdf-inverse-cdf transformations ...... 89 5.3.3 Estimating Skew-t Transformation Parameters ...... 90 5.4 Simulations and Data Analysis ...... 98 5.4.1 Data Analysis ...... 99 5.4.2 Simulation Study ...... 107 5.5 Discussion ...... 110

ix 6. Extensions and Future Work ...... 120

6.1 Multivariate Density Estimation ...... 120 6.1.1 Transformations to Multivariate sub-Gaussianity ...... 123 6.1.2 Bayesian Analysis of Heavy-Tailed ...... 126 6.1.3 Regression with a Heavy-Tailed Error Distribution . 127 6.2 A Recipe for Efficient Semiparametric Analysis ...... 130

Appendices 131

A. Asymptotic Expansion of MISE for Kernel Estimates ...... 131

B. Estimating integrated squared second derivatives ...... 133

Bibliography ...... 136

x List of Tables

Table Page

4.1 Ohio Family Health Survey (2008) sample sizes, divided into training and holdout samples...... 77

5.1 Unemployment data: Log predictive scores (5.36) for transformation / DP mixture (5.2) predictive densities...... 102

5.2 Acidity data: Log predictive scores (5.36) for transformation / DP mixture (5.2) predictive densities...... 106

5.3 Griffin (2010) predictive scores for the acidity data...... 107

xi List of Figures

Figure Page

4.1 DPM density estimates (dashed lines) based on samples of size 100 for two examples of the two-piece distributions (the true densities are

shown in solid black). The leftmost density is symmetric, but has t2 tails. The rightmost density has Gaussian tails, but is right-skewed. 54

4.2 Illustration of the Transformation-DPM technique. The heavy-tailed sample (left column, A1-A3) and skewed sample (right column, B1- B3) of figure 4.1 are transformed according to the symmetrizing and tail-shortening transformations of section 2. The DPM model is fit to the transformed samples in the bottom panels, then back-transformed to give the TDPM estimate on the original scale...... 57

4.3 The familiesϕ ˜(1,θ1) of Yeo-Johnson transformations andϕ ˜(2,θ2) of t-to- Normal cdf transformations...... 61

4.4 Comparison of Hellinger error for four density estimates, KDE, TKDE, DPM, and TDPM. 20 replicate samples from each scenario and sam- ple size. Horizontal dotted line represents median Hellinger distance obtained by fitting Griffin’s (2010) model without transformation. . 70

4.5 Top: comparison of density estimates (via KDE and DPM) with and without transformation, based on a single sample of size n = 200 from

the skewed and heavy tailed density with σ2 = 5 and g(·) a t dis- tribution with 2 degrees of freedom. Bottom: -quantile plot comparing estimated to true for the same sample and model fits. Dashed and dotted lines represent point estimates of the quantiles, while the shaded regions represent pointwise 90% credible intervals for the true quantile based on the Bayesian fits...... 72

xii 4.6 Illustration of the estimated transformations based on 20 simulated samples of size n = 200 from each of the four two piece scenarios (a)-(d)...... 73

4.7 Quantiles of the empirical distribution of BMI, separated by gender and age...... 75

4.8 Comparison of the average log predictive likelihood of the holdout cases (4.18) based on the DPM and TDPM fits...... 76

5.1 Unemployment rates by US county, continental 48 states, BLS, May 2017...... 100

5.2 Unemployment rates plotted against the size of the labor force for 3219 counties...... 101

5.3 From left to right: estimated transformation, DP mixture predictive density, and posterior distribution for number of occupied mixture components for the unemployment data. Five transformation meth- ods...... 103

5.4 From left to right: estimated transformation, DP mixture predictive density, and posterior distribution for number of occupied mixture components for the acidity data. Five transformation methods. . . . . 105

5.5 Densities f0 for Monte Carlo simulation study...... 108

5.6 Median log Hellinger distances for K = 80 Monte Carlo simulations in each setting. The gray dotted line indicates the median log Hellinger distance for the normal density estimated with no transformation. . . 111

5.7 Hellinger distance between posterior predictive densities and true f0, K = 80 Monte Carlo simulations. First three scenarios (a)-(c). . . . . 114

5.8 Hellinger distance between posterior predictive densities and true f0, K = 80 Monte Carlo simulations. Last three scenarios (d)-(f). . . . . 115

5.9 Median log integrated squared error distances for K = 80 Monte Carlo simulations in each setting. The gray dotted line indicates the median for the normal density estimated with no transformation...... 116

xiii 5.10 Median log Kullback-Leibler dibergence for K = 80 Monte Carlo sim- ulations in each setting. The gray dotted line indicates the median for the normal density estimated with no transformation...... 117

5.11 Median log Kullback-Leibler for K = 80 Monte Carlo sim- ulations in each setting. The gray dotted line indicates the median for the normal density estimated with no transformation...... 118

5.12 Median log total variation distances for K = 80 Monte Carlo sim- ulations in each setting. The gray dotted line indicates the median log total variation distance for the normal density estimated with no transformation...... 119

xiv Chapter 1: Introduction and Theoretical Motivation

A classic problem of nonparametric involves the estimation

of an unknown probability density. Given a sample X1,...,Xn drawn independently ˆ from a distribution F0, the aim is to produce an estimate fn of the density f0 associ- ated with F0. In the absence of strong assumptions on the form of the distribution, the inferential problem is fundamentally nonparametric. In order to produce a “good” ˆ estimator fn — which can well-approximate any f0 in a broad class of densities — ˆ the complexity of fn must be allowed to increase with the sample size. A drawback of such nonparametric analysis is that the estimators, in comparison to their parametric counterparts, typically converge more slowly as the sample size grows. This stylized fact holds true in many situations in both frequentist and .

In this dissertation, we propose a strategy for inference on unknown densities which divides the estimation into two stages: the first is parametric and low-dimensional, and is used to transform the estimation problem into one which is more easily ad- dressed by the second, nonparametric stage. This semiparametric approach balances the benefits and drawbacks of parametric and nonparametric estimation, while also making combined use of Bayesian and .

The current chapter sets the stage for this work. The sections to follow give an informal overview of some famous results in large-sample theory which illustrate ideas

1 pertinent to this thesis. The aim is to provide a theoretical context for our methods, and to demonstrate the promise of estimation strategies which combine parametric and nonparametric methods in a sensible way. We withhold some mathematical detail in the interest of a discussion that highlights a few central ideas.

1. Parametric and nonparametric estimators are governed by different asymptotic

rates of convergence, a fact which is well-known. This idea is illustrated by

some famous frequentist estimators in Section 1.1.

2. Analogous convergence concepts hold true in Bayesian inference, though the

notion of convergence differs.

3. Large-sample properties of nonparametric Bayesian models are difficult to estab-

lish analytically. In Bayesian density estimation, the strongest available results

require moderately strong assumptions about the data-generating processes.

4. We argue for transforming the nonparametric problem to meet these assump-

tions by way of a parametric “pre-analysis.” Since the pre-analysis is governed

by a comparatively fast asymptotic rate, it does not slow down the overall

convergence rate for the estimation as a whole.

This introductory chapter is organized as follows. Section 1.1 considers two well- known frequentist parametric and nonparametric estimators, highlighting the differ- ence in convergence rates as the sample size grows. Section 1.2 reviews some famous

2 results from the large-sample theory of Bayesian posteriors. Section 1.3 gives an infor-

mal argument for the usefulness of our semiparametric strategy for density estimation,

and the remainder of the thesis is outlined.1

1.1 Parametric and Nonparametric Asymptotics by Example

Faced with the problem of using the sample X1,...,Xn to infer features of the generating distribution F0, we give two simple examples from frequentist inference to illustrate the difference in asymptotics for parametric and nonparametric inference.

A parametric strategy for estimation assumes that F0 belongs to a family of dis- tributions indexed by a finite-dimensional Euclidean parameter. The nonparametric approach treats the distribution F0 as entirely unknown, making minimal assumptions about its analytic properties. The next examples discuss two famous estimators.

Example 1. Suppose one models the density f0 as belonging to a family {fθ : θ ∈ Θ}

p indexed by a finite-dimensional parameter θ belonging to Θ ⊆ R . To believe f0 truly

belongs to this family may be na¨ıve. Rather, the analyst may consider fθ a reasonable

approximation to the truth — a model which simplifies the problem of estimating

f0, an infinite-dimensional quantity, to that of estimating the finite-dimensional θ. ˆ The maximum-likelihood estimator θn is computed by maximizing the log- n 1 X M (θ) = log f (X ) n n θ i i=1

1While this chapter outlines a theoretical motivation for the methods described in this thesis, the techniques presented here are also inspired by a line of work on frequentist density estimation. This parallel motivation is described in Chapter 2.

3 with respect to θ. Under quite general conditions on the form of f0 and on the family

fθ (see Van der Vaart (1998); Huber (1967)), a typical result is that

√ ˆ   n θn − θ0 ⇒ N 0, Σ , (1.1)

R where θ0 maximizes the entropy functional log fθ(x) f0(x) dx, and Σ is an asymp- totic whose form is akin to the inverse ,

d2 Z Σ−1 = − log f (x) f (x) dx. dθ2 θ 0

The notation Zn ⇒ Z indicates weak convergence of the distribution functions of the

Zn to the distribution of Z.

√ Attention should be paid to the normalizing sequence n in (1.1), which indicates ˆ the rate at which the discrepancy θn − θ0 decays in probability to zero. This rate is

typical of the asymptotic behavior of parametric estimators.

As our focus is density estimation, we make an additional comment that the limit

f of the estimated density f is not, in general, the same as the density f of the θ0 θˆn 0

sample. Rather, fθ0 is the density within the family {fθ} that is nearest to f0 in terms

R f0 of Kullback-Leibler divergence. That is, θ0 minimizes log dF0 over θ ∈ Θ. Hence fθ this parametric strategy cannot produce a consistent estimator unless f0 belongs to the same (a fact which is obvious intuitively).

The next example discusses a nonparametric kernel estimator which addresses this inconsistency, but does so to the detriment of the rate described in the first example.

Example 2. The well-known kernel density estimator (KDE) constructs an estimate of f0 as n 1 X   fˆ (x) = K x − X h . (1.2) n,h nh i i=1 4 The function K, termed a kernel, is another probability density, and h > 0 is known as the bandwidth. The literature concerning (1.2) is extensive. While some addi- tional statistical properties of the KDE will be discussed in the next section, here we briefly address consistency of (1.2) and the associated rate of convergence. While ˆ asymptotics of fn,h are most commonly considered with respect to L2, or integrated squared-error (ISE) loss, we describe here the L1 asymptotics. The is

integrated absolute error (IAE),

Z ˆ  ˆ d1 fn,h, f0 = fn,h(x) − f0(x) dx, (1.3)

ˆ  ˆ and consistency in L1 requires d1 fn,h, f0 → 0; if the limit is in probability, fn,h is ˆ called weakly consistent, while if the limit is almost sure, fn,h is strongly consistent.

Devroye (1983) proved weak and strong consistency for any f0 requires only that

limn→∞ h → 0 and limn→∞ nh → ∞ (in the one-dimensional case), and that the

kernel K satisfy K(u) ≥ 0 and R K(u) du = 1. Hall and Wand (1988) studied limiting

properties of the expectation of (1.3), known as the integrated absolute error

(MIAE). Those authors established the rate at which the MIAE vanishes to zero as

n → ∞. For a second-order kernel K whose first is zero and second moment

is one, and if f0 has two bounded, continuous derivatives, Hall and Wand (1988)

showed Z ˆ −2/5 inf E fn,h(x) − f0(x) dx = O n . (1.4) h>0

These examples illustrate that the broader consistency guarantees of nonparamet-

ric estimators are obtained at the cost of a slower convergence rate, n−2/5, compared

to the parametric n−1/2. These rates are indicators of asymptotic properties of the

5 estimators, and as such, they do not give information about the quality of the es-

timators in small samples. However, they remain useful for understanding how the

precision of the estimates compare if the sample is moderately large. If n is large,

then the distribution of a nonparametric estimator based on a sample of size

n displays similar stability to parametric estimator based on a smaller sample of size

n4/5. This is the price paid to avoid a potential bias between the limiting parametric

estimate Fθ0 and F0, which may be large if the model is badly misspecified.

1.2 Asymptotic Properties of Bayesian Posteriors

Bayesian inference differs fundamentally from the examples in Section 1.1 in that the unknown quantities to be estimated, either the parameter θ or the distribution

F0 itself, are treated as random variables. Thus, the parametric model fθ represents the conditional distribution of the Xi given a value of the parameter θ ∈ Θ. The distribution Π of θ is known as the prior distribution. The goal of inference is to obtain the posterior distribution of θ conditional on the observables Xn = (X1,...,Xn).

When Π has a density π, and the Xi are modeled as conditionally independent draws from fθ, the posterior distribution is given by Bayes’ theorem,

Qn i=1 fθ(Xi) π(θ) πn(θ|Xn) = R Qn . (1.5) Θ i=1 fθ(Xi) π(θ) dθ

The distribution (1.5) represents the information content that the data Xn provide

regarding θ.

The Bayesian approach to modeling raises interesting philosophical questions. Di-

aconis and Freedman (1986) describe two interpretations of the just-described distri-

butions on Θ. So-called “classical” Bayesians suppose there is a true parameter value

6 θ0, and the aim of sampling and modeling is to recover θ0. The assumed random-

ness of θ is a vehicle to achieve a natural, convenient framework for inference on θ0.

“Subjective” Bayesians, on the other hand, interpret probabilities for θ as degrees of

belief, indicating plausibility of various subsets of Θ.

In the classical view, it is natural to ask about the effectiveness of posterior prob-

abilities (1.5) for identifying the true θ0. If the Xi in (1.5) are i.i.d. from Fθ0 , we would hope that the posterior distribution converges to one which is degenerate at

θ0 — this is one sense in which the model could be called “consistent.” This line of questioning involves frequentist asymptotic properties of Bayesian posteriors, and regardless of philosophy, favorable answers to these questions provide an important justification for the Bayesian approach. In fact, Diaconis and Freedman (1986) ar- gue that frequentist properties are important through a subjective Bayesian lense as well. For example, from a subjective perspective, it would be useful to know that two Bayesians will ultimately arrive at similar inference under the same likelihood, or that a posterior based on a large sample is approximately independent of the prior.

1.2.1 Parametric Bayesian Models

The most famous result in the study of frequentist asymptotics of Bayesian pos- teriors is the Bernstein-von Mises theorem. Informally, this result asserts that the posterior distribution, centered around its , asymptotically resembles the nor- mal distribution (1.1) of the maximum likelihood estimator, centered around the true √ parameter value. In particular, these distributions shrink in scale at the same n

iid rate. To be more precise, suppose X1,...,Xn ∼ Fθ0 , and assume Fθ is a sufficiently regular parametric model (see Van der Vaart (1998) Theorem 10.1 and Lemma 10.6

7 for details). One compares the distance between the posterior distribution of the √ rescaled parameter n(θ − θ0),

 n √  o µ1 · Xn = Pr n θ − θ0 ∈ · Xn (1.6)

ˆ and a , with mean equal to a similar rescaling of the MLE θn = ˆ θn(X1,...,Xn), √   ˆ  −1 µ2 · Xn = N · n θn − θ0 , I . (1.7) θ0

The first distribution (1.6) depends on the data through the posterior distribution

(1.5) of θ given Xn = (X1,...,Xn), while the second (1.7) depends on the data ˆ through the MLE θn. The Bernstein-von Mises Theorem states that the distributions

(1.6) and (1.7) differ in total variation by an amount which converges to zero in probability under X ∼ F n . n θ0 This theorem is sometimes used to justify the use of Bayesian credible intervals, as it ensures those intervals have frequentist asymptotic coverage rates that match those of likelihood-based intervals. To set a context for this dissertation, however, it is more important to highlight that the required scaling of the parameter in (1.6) √ to obtain convergence is again n; thus, the typical parametric rate carries over to

Bayesian inference based on sufficiently regular parametric models.

However, when the parametric model is misspecified — if the true data generating

process is F0 6= Fθ0 — the inference will be inconsistent. Versions of the Bernstein-von

Mises Theorem for misspecified models have been established, for example, by Kleijn and Van der Vaart (2012), who showed that the posterior concentrates asymptotically at the same value θ0 of Example 1 in Section 1.1. That is, inference for the parametric

Bayesian model concentrates around the Fθ0 that is nearest in K-L divergence to the

8 true F0. This motivates the development of more flexible priors which include a

broader class of F0 inside their Kullback-Leibler support.

1.2.2 Nonparametric Bayesian Models

Nonparametric Bayesian models are commonly defined as having infinite-dimensional parameter spaces. This includes models for inference on function spaces,such as the space of continuous densities. In a manner analogous to Section 1.1, the posterior dis- tributions from these models allow broad consistency guarantees, but they converge more slowly than their parametric counterparts. To illustrate with an example, we focus on one well-known prior construction for estimating a continuous density: the

Dirichlet process (DP) mixture model. The DP mixture model will be discussed at length in Chapter 3.

Example 3. A popular approach for Bayesian density estimation expresses the den- sity as a countable mixture of continuous kernels. It is assumed that

iid Xi ∼ f for a density f is described as

Z 1   f(x) = K x − µσ dG(µ), (1.8) σ

where K is the kernel and G is known as the mixing distribution. G and the kernel

scale σ > 0 are the unknown parameters in this setup. When G is assigned a Dirichlet

Process (Ferguson; 1973) prior, and K is a normal density (a common choice), the

model (1.8) is known as a DP mixture of normals. It gives a well-defined prior on the

space F of continuous densities, and convenient methods are available to simulate

9 the posterior distribution

R Qn  h A i=1 f(Xi) dΠ(f) Πn A Xn = R Qn . (1.9) F i=1 f(Xi) dΠ(f)

This construction will be discussed at length in Chapter 3.

For now, we simply highlight the similarity in form of (1.8) and the kernel density estimator (1.2), which can alternatively be expressed as

Z 1   fˆ (x) = K x − µh dFˆ (µ). n,h h n

The mixing distribution in this construction is the empirical distribution of the sam- ple, n 1 X Fˆ = δ , n n Xi i=1 consisting of n atomic measures δXi located at each data point, each with equal probability weight 1/n.

Continuing our exploration of frequentist asymptotic properties, we assume f0 is the true density of the Xi, and study the posterior mass assigned by (1.9) to neighborhoods of f0 with respect to the Hellinger metric,

Z 1/2 p p 2 dH (f, f0) = f0(x) − f(x) dx.

Several researchers have established asymptotic properties of the DP mixture that are analogous to the kernel estimator. For example, it has been shown under weak conditions (see Ghosal, Ghosh and Ramamoorthi (1999)) that the posterior (1.9) from the DP mixture of normals is consistent, in the sense that for virtually any continuous f0, and any ε > 0,   F n  0 Πn f : dH (f, f0) > ε Xn −→ 0.

10 That is, the posterior concentrates, in the limit as n → ∞, on Hellinger neighborhoods of the truth.

To understand the rate of convergence of the DP mixture posterior, one considers  neighborhoods f : dH (f, f0) > Mεn , with M > 0, which shrink towards f0 at a rate determined by the sequence εn → 0. If the posterior concentrates asymptotically on these increasingly small sets,

  F n  0 Πn f : dH (f, f0) > Mεn Xn −→ 0, (1.10) then εn is an upper bound for the convergence rate.

Establishing rates εn of posterior convergence is a difficult analytical endeavor.

The most fruitful efforts (e.g. Ghosal and Van der Vaart (2001, 2007); Shen, Ghosal and Tokdar (2013)) require troubling restrictions on F0 and on the form of the prior.

For example, Ghosal and Van der Vaart (2007) found a rate of convergence

−2/5 2 εn = n (log n) (1.11)

that matches the L1 rate (1.4) of the KDE up to a logarithmic factor. However, this rate is only theoretically guaranteed under several somewhat restrictive assumptions.

1. The kernel σ > 0 is given a rather unnatural sequence of

priors

−1 πn,σ(σ) = σn g(σ/σn),

where g is compactly supported, and the scale of the prior shrinks according to

−1/5 a sequence of real numbers σn ∼ n .

R 00 2 2. The density f0 is smooth, with two continuous derivatives satisfying (f0 /f0) f0 dλ <

R 0 4 ∞ and (f0/f0) f0 dλ < ∞.

11 3. The distribution F0 has light, sub-Gaussian tails,

2  −c2a 1 − F0 [−a, a] ≤ c1e (1.12)

for positive constants c1 and c2.

The tail condition (1.12) is perhaps most problematic, as it excludes distributions with polynomial tails that behave asymptotically like Gamma or Pareto densities, including Student’s t.

1.3 Outline of the Thesis

The preceding tour of examples from frequentist and Bayesian, parametric and nonparametric inference provide a theoretical backdrop for the approach to Bayesian density estimation proposed in this thesis. Density estimation requires the flexibility of nonparamteric techniques. As the density may possess any number of irregular features — extreme skew, high curvature, multiple modes, heavy tails, and so on — we prefer estimators whose theoretical performance does not suffer in the presence of such features. Yet for Bayesian density estimation, favorable asymptotic performance is only theoretically guaranteed in the absence of heavy tails.

For densities which possess specific irregularities such as skew and heavy tails, we propose to address those features with a tailored parametric analysis. The analysis is designed to transform the density estimation problem to one which is theoretically more tractable. In terms of asymptotic performance, this strategy provides benefits

— for example, the rate (1.11) could be applicable to a broader class of densities — while exacting little cost. The parametric stage is governed by a faster rate than the nonparametric density estimation, so the overall rate for the problem is unchanged.

12 The remainder of the thesis is organized as follows. Chapters 2 and 3 provide ad- ditional background. Chapter 2 reviews transformation-density estimation strategies that have proved useful in frequentist kernel estimation. Chapter 3 discusses the DP mixture model in greater detail. In Chapter 4, we adapt a classical transformation strategy from kernel estimation and demonstrate its usefulness in the Bayesian con- text. Chapter 5 proposes a new set of transformations for estimating heavy tailed densities which asymptotically ensure that the tail condition (1.12) is satisfied. In

Chapter 6, we explore possible extensions and future work.

13 Chapter 2: Frequentist Transformation-Density Estimation

Chapter 1 drew a contrast between nonparametric procedures — which attempt to estimate quantities, such as probabiliy densities, belonging to infinite-dimensional spaces — and parametric models possessing only finite-dimensional parameter spaces.

A nonparametric approach is necessary for a problem like density estimation if one is interested in favorable asymptotic properties, such as consistency. Yet nonparametric estimators such as (1.2) converge more slowly than their parametric counterparts.

In this chapter, we review a line of research which combines a standard nonpara- metric kernel density estimate with a parametric transformation to achieve improved performance. The technique, known as transformation-density estimation, is remark- ably effective. We highlight two reasons for its success.

1. The strategy recognizes that some nonparametric estimation problems are more

challenging than others, both in terms of empirical performance and theoretical

asymptotic properties of the usual estimators.

2. The transformations employed are low-dimensional, indexed by a small number

of parameters. As such, they can be estimated from data at a fast asymptotic

rate.

14 In subsequent chapters, we will argue that both points apply equally to Bayesian

density estimation.

Before reviewing the literature on transformation kernel density estimation, we

review some additional properties of the kernel density estimator.

2.1 Parzen-Type Kernel Density Estimators

We recall from Example 2 of Section 1.1 the famous kernel density estimate (KDE),

n 1 X   fˆ (x) = K x − X h . (2.1) n,h nh i i=1

The KDE (2.1) is sometimes referred to as the Parzen,2 Rosenblatt, or “window” estimate. Smaller values of the bandwidth h produce estimates with more “wiggles,” while larger values produce smoother estimates, typically with fewer modes. The derivations of Appendix A show how this bandwidth relates to a bias- trade- ˆ off for fn,h as an estimator of f0. Estimators with larger h generally have smaller variance and larger bias. In practice, the analyst must decide on a rule for setting the bandwidth. The discussion to follow will focus on one approach, known as the

“plug-in” technique (Silverman; 1986; Park and Marron; 1990; Sheather and Jones;

1991).

The first appearance of (2.1) was in the context of classification and discrimi- nant analysis, in an unpublished technical report on discriminant analysis by Fix and Hodges (1949) which generalized Fisher’s linear discriminant analysis by relax- ing parametric assumptions on the class-specific densities. In doing so, the authors

2So named for Emanuel Parzen, whose 1962 paper (Parzen; 1962) established conditions uniform consistency of (2.1) pointwise over x, as well as the consistency of (2.1) for estimating the location of and curvature at a unique mode of f0.

15 introduced a form of (2.1) with K being a uniform density over [−1, 1]. Such esti- mates have proved to be of tremendous independent theoretical interest and practical utility. In Section 2.1.1, we discuss asymptotic properties of (2.1) and motivate the transformation techniques of Section 2.2.

2.1.1 Asymptotic properties of the kernel density estimator

ˆ Rosenblatt (1956) examined large-sample properties of fn,h(x) with respect to the

L2 risk, or the mean integrated squared error

Z  ˆ  h ˆ i ˆ 2 R2 fn,h, f0 = E d2 fn,h, f0 = E fn,h(x) − f0(x) dx . (2.2)

The expectation above is taken over X1,...,Xn ∼ f0. This quantity is of great

interest to later developments, and will be discussed in detail in Section 2.2. The

following asymptotic expansion, first given by Rosenblatt (1956), is quite common

(e.g. Silverman (1986), section 3.3). For a derivation, see Appendix A.

Lemma 1 (Asymptotic expansion of MISE for kernel estimates). We study (2.1)

with a kernel satisfying K(u) = K(−u), R K(u) = 1 and R u2K(u) < ∞. Under an

asymptotic regime in which h = h(n) → 0 and nh → ∞ as n → ∞, and provided that

f0 belongs to the second-order H¨olderclass,

{f : |f 0(x) − f 0(y)| ≤ L|x − y| for all x, y}, (2.3)

then (2.2) has the following expansion.

h i 1 h4σ4 E d fˆ , f  = R(K) + K R(f 00) + o(nh)−1 + h4, (2.4) 2 n,h 0 nh 4 0

R ∞ 2 2 R 2 where the functional R(g) = −∞ g (x) dx, and σg = u g(u) du.

16 It is now quite clear that with a bandwidth h of an appropriate order (anything ˆ satisfying h → 0 with hn → ∞) the kernel estimate fn,h is consistent in L2 for f0 under fairly weak conditions on the target density. h ˆ i To determine the optimal rate at which E d2 fn,h, f0 → 0, we need a more specific asymptotic rule for the bandwidth h. If we consider choosing h to minimize the dominant terms in (2.4),

1 h4σ4 AMISE(h) = R(K) + K R(f 00), (2.5) nh 4 0

then it can be shown that the AMISE optimal bandwidth h∗ is

 1/5 R(K) −1/5 h∗ = argminh>0 (AMISE(h)) = 4 00 n . (2.6) σK R(f0 )

The expression for the AMISE-optimal bandwidth, (2.6) is the jumping off point for

so-called “plug-in” bandwidth selection rules (Silverman (1986), Park and Marron

(1990), Sheather and Jones (1991)), which selects a bandwidth hˆ by plugging in an

00 empirical estimate of the unknown R(f0 ) in (2.6). More details on estimation of

00 R(f0 ) are given in Section 2.2.1. Alternative bandwidth selection rules include a variety of cross validation techniques (see Wand and Jones (1994)).

ˆ ˆ −1/5 Yet when any bandwidth h of the same order as h∗ is used, with hn = Cn , we obtain from (2.4) the fastest possible rate of convergence of an estimator of the type

(2.1),

h ˆ i 00 −4/5 E d2 fn,h, f0 ∼ C(K)R(f0 )n , (2.7)

Here, C(K) is a constant depending only on the kernel K (see Appendix A for details).

In fact, this rate of order n−4/5 is optimal, not just for kernel estimates,

but for any nonparametric estimate of f0 belonging to the H¨olderclass (2.3) (Farrell;

17 1972; Wahba; 1981). That is, denoting the class (2.3) by F2,

ˆ  −4/5 inf sup R2 fn, f0 = O n . ˆ fn f0∈F2

As a consequence, no improvement in rate can be made over the kernel density esti-

mator (2.7).

However, the asymptotic L2 risk (2.7) of the KDE depends on the functionals

00 C(K) and R(f0 ). The first constant, C(K), is a property of the estimator, de-

00 pending on the kernel being used. The second constant, R(f0 ), is a property of the density being estimated. Thus, we are presented with two clear objectives for im-

proving the asymptotic performance of the KDE. First, we could chose the kernel K

to give a minimal value for C(K). This is the approach behind the Epanechnikov

kernel (Epanechnikov; 1969) and other sophisticated choices of higher-order kernels

00 K. Second, we could attempt to reduce R(f0 ).

00 At first glance, R(f0 ) would seem to be beyond the analyst’s control, as it depends on the unknown f0. Still, it is useful to understand the meaning of the functional.

For a fixed kernel K, the expansion (2.7) indicates that smaller values of

Z 00 00 2 R(f0 ) = f0 (x) dx (2.8)

are associated with smaller expected L2 loss when the sample size is large. This

00 is intuitive, since the magnitude of f0 measures the local curvature of the density, making the integral (2.8) a global measure of curvature. Densities with sharper

features would yield larger values of (2.8), in turn leading to larger expected L2 error

if the density is estimated via (2.1). The strategy described in Section 2.2 attempts

to transform f0 to set up an easier density estimation problem.

18 2.2 Density Estimation with Transformations

The asymptotic expansions for both L2 risk, given in (2.7), and for L1 risk, in

(1.4), illustrate that some f0 are considerably harder to estimate than others.

A strategy suggested initially by Silverman (1986), then studied by Wand, Marron

and Ruppert (1991) involves transforming the sample before estimating the density.

The idea is to transform the Xi, say to ϕθ(Xi) = Yθ,i, according to some smooth, strictly increasing transformation ϕθ, so that the density of the transformed sample

Yθ,1,...,Yθ,n is “easier” to estimate than the density of the original Xi.

We adopt the notation g0,θ(y) for the density of the transformed variates Yθ =

ϕθ(X), and retain f0 for the density of X. These densities are related through the change-of-variable formula

d g (y) = f ϕ−1(y) · ϕ−1(y). (2.9) 0,θ 0 θ dy θ

The transformed density, clearly, may have an entirely different shape than f0, in-

00  cluding different values for functionals like R g0,θ .

Given the transformed sample Yi,θ, one could also estimate (2.9) using a standard

estimator such as (2.1). With this estimate, sayg ˆθ,n, in hand, we “back-transform”

for an estimate of the original density f0 of the Xi, using

ˆ  0 fθ,n(x) =g ˆθ,n ϕθ(x) · ϕθ(x). (2.10)

An illustration of the technique can be found in Chapter 4: see 4.2 on page 57. If ϕθ ˆ is chosen wisely, it is our hope that the transformation-density estimator fθ,n(x) will

be superior to a basic kernel estimate (2.1).

Central to the transformation method are two questions:

19 1. What are some appropriate criteria for selecting a single transformation ϕθˆ to

ensure that g0,θˆ is “easy” to estimate using standard methods?

2. How to define a rich enough family of transformations {ϕθ : θ ∈ Θ} that

a variety of problem behaviors of f0, such as heavy tails, skew, etc., can be

corrected?

We address question (1) in Section 2.2.1, and survey the literature for answers to question (2) in part 2.3.

2.2.1 An L2 Criterion for Selecting Transformations

In this section, we describe a simple criterion based on the L2 asymptotics of the ˆ KDE (e.g. (2.7)) that can be used to select an appropriate transformation θn = ˆ θn(X1,...,Xn). The approach discussed here was proposed by Wand et al. (1991), and studied further by several authors (Ruppert and Wand; 1992; Yang and Marron;

1999; Yang; 2000).

The general approach is to consider the form of the density

d g (y) = f ϕ−1(y) · ϕ−1(y) (2.11) 0,θ 0 θ dy θ

after transformation by θ, and to study, on this transformed scale, the difficulty of

estimating g0,θ with a standard kernel estimate

n 1 X   gˆ (y) = φ Y − yh , (2.12) θ,n nh θ,i θ θ i=1 where φ is the standard normal density. The same asymptotic treatment discussed in Section 2.1.1, in particular equation (2.7), suggests choosing θ and ϕθ to minimize the curvature functional Z 00  00 2 R g0,θ = g0,θ(x) dx, (2.13)

20 which is approximately proportional, in large samples, to the MISE

Z   2 R2 gˆθ,n, g0,θ = E gˆθ,n(y) − gθ,0(y) dy . (2.14)

00 However, like the MISE itself, R(g0,θ) is not scale invariant. For example, if we consider a linear transformation ϕθ(X) = θX for θ > 0, then the density f0 of X and

−1 the density g0,θ(y) = θ f0(y/θ) of Y = ϕθ(X) are equally difficult to estimate. Yet

00  00 R g0,θ 6= R(f0 ) if θ 6= 1, since Z Z Z 00 00 2 −3 00 2 −5 00 2 −5 00 R(g0,θ) = g0,θ(y) dy = θ f0 (y/θ) dy = θ f0 (x) dx, = θ R(f0 ).

This is an unacceptable property if (2.13) is taken as a criterion for selecting a trans- formation.

Instead, a scale invariant version of the curvature functional (2.13) is given by

Z 1/5 00 2 L(θ) = σθ g0,θ(y) dy . (2.15)

This is the measure the easiness of estimating a given density g0,θ which guides the transformations of Wand et al. (1991), Ruppert and Wand (1992) and Yang and

Marron (1999). These authors deem the asymptotically optimal transformation to be

θ0 = argminθ∈ΘL(θ), (2.16)

ˆ and devise estimators of θn of θ0.

The optimal transformation (2.16) is of course unknown, since it depends on the ˆ unknown density f0. To produce an estimator θn of θ0, Wand et al. (1991); Ruppert and Wand (1992); Yang and Marron (1999) employ an estimator of the curvature ˆ functional Ln(θ), and take

ˆ ˆ θn = argminθ∈ΘLn(θ). (2.17)

21 ˆ We will refer to θn as the Wand-Marron-Ruppert (WMR) estimator. The remainder

of this section will be devoted to reviewing its properties. The approach (2.17) — ˆ which employs the minimizer of an estimate Ln as an estimate of the true minimizer

θ0 of L — implicitly assumes that L is convex (at least in a neighborhood of θ0) and ˆ that estimators Ln(θ) are uniformly consistent in θ. In that case, favorable properties ˆ ˆ of Ln for estimating L will, theoretically, translate to good estimation of θ0 by θn.

In Section 2.2.1, we describe an estimator of L(θ) and its asymptotic properties.

In Section 2.2.1, properties of the WMR estimator (2.17) are established.

Estimating the Curvature Functional

ˆ The quality of θn as an estimator of θ0 is dependent on the ability to accurately estimate the function L(θ). In particular, it requires an estimate of the integrated

00  curvature R g0,θ (2.13). However, due to the crucial role this functional played, historically, in the development of rules for bandwidth selection, a substantial body of work exists on estimators of integrated squared second-order density derivatives.

00  See Appendix B for details on this topic. The best available estimator of R g0,θ in terms of asymptotic performance the “diagonals-in” estimator of Jones and Sheather

(1991), Z ˜ 00 2 Sδ(Yθ,1,...,Yθ,n) = g˜δ,n(y) dy, (2.18)

−1 Pn −1  whereg ˜δ,n(y) = (nδ) i=1 L δ (y − Yθ,i) is a further kernel smoothing of the transformed empirical distribution, with a bandwidth δ > 0 and kernel L. Note these

are not necessarily the same as the bandwidth h and kernel K for estimating f0. The

bandwidth δ also requires a selector; Jones and Sheather (1991) note that

∗ 000 −1/7 −1/7 δ = C(L) R(g0,θ) n , (2.19)

22 h i ˜ 00 2 minimizes an asymptotic expansion of E Sδ −R(g0,θ) . Note that the order of this bandwidth has a smaller order, n−1/7, than that of the density estimator, which from

−1/5 ∗ 000  (2.6) is n . Use of δ , of course, requires an estimate of R g0,θ . It is common to

000  3 use a parametric model for g0,θ to estimate R g0,θ at this stage. Finally, given a bandwidth δˆ = C(L)Rˆ(g000 )−1/7n−1/7, we take S˜ as an estimate n 0,θ δˆn

00 of R(g0,θ), and h i1/5 Lˆ (θ) =σ ˆ S˜ (2.20) n θ δˆn

2 −1 P ¯ 2 as an estimate of L(θ), whereσ ˆθ = n (Yθ,i − Yθ) . The estimator (2.18) was shown by Jones and Sheather (1991) to give better asymptotic performance than the

“diagonals-out” construction of Hall and Marron (1987).

Properties of the Wand-Marron-Ruppert Transformation Estimator

ˆ Before establishing properties of θn as an estimator of θ0, it is necessary to study ˆ the performance of Ln(θ), given by (2.20), as an estimator of L(θ). Since the aim is ˆ to consistently estimate the minimizer of L, it is not sufficient that Ln(θ) be “point- wise” consistent for each θ. Rather, stronger conditions akin to uniform consistency are required. The analytical treatment given by Yang (2000) requires the following assumptions:

p • A1: Θ is a compact rectangle in R . That is, Θ = Θ1 × Θ2 × · · · × Θp, with

each Θj a compact interval.

• A2: The minimizer (2.16) of L(θ) is unique.

3Alternatively, one could proceed down the rabbit hole with another nonparametric estimate of 000 R f0 , if one were so inclined. A cost of this strategy is the assumption that additional higher- order density derivatives are square-integrable. In practice, only a modest level of accuracy may be required for these high order properties.

23 • A3: L(θ) is locally convex at θ0 in the sense that for any ε > 0, there exists

D(ε) > 0 such that L(θ) − L(θ0) > D(ε) whenever kθ − θ0k > ε.

• A4: The transformation ϕθ(x) is infinitely continuously differentiable in x and

θ ∈ Θ.

• A5: For any θ ∈ Θ, the transformed density g0,θ(y) has continuous derivatives

of all orders in y, and

k (l) lim |y| g0,θ(y) = 0 |y|→∞

for any l = 0, 1, 2,... and any k ≥ 1.

The last assumption, especially for the applications we will emphasize in later sections, is problematic, since it excludes densities with polynomial tail decay. Nonetheless, we proceed to describe Yang’s results under these conditions. ˆ Yang (1995) showed that if the sequence of bandwidths used for Ln(θ) in (2.18)

(−1+)/5 − belong to intervals ∆n = [n , n ] for some 0 <  < 1/6, then for any ε > 0,

∞   X ˆ Pr sup sup Ln(θ) − L(θ) > ε < ∞. (2.21) n=1 θ∈Θ δn∈∆n

−1/7 For example, bandwidths δn = O(n ) ensure that (2.21) holds. Yang showed that

(2.21), combined with the local convexity condition A3, implies

∞   ∞   X ˆ X ˆ Pr sup kθn − θ0k > ε ≤ 2 Pr sup sup Ln(θ) − L(θ) > D(ε)/2 < ∞. n=1 δn∈∆n n=1 θ∈Θ δn∈∆n ˆ From this and the Borel-Cantelli lemma, it follows that θn is strongly consistent, with

  ˆ Pr sup kθn − θ0k > ε i.o. = 0 δn∈∆n

for any ε > 0.

24 ˆ A description of the asymptotic distribution of θn requires the additional assump- tions A4-A5, including the tail condition A5. Yang (2000) showed that under A1-A5, and with δn ∈ ∆n, ˆ θn = θ0 + Bn(δn) + Zn, (2.22) where

2 4 Bn(δn) = C(f0) δn + O δn (2.23)

ˆ characterizes the asymptotic bias of θn, and Zn describes the variability in the asymp- totic . The distribution of Zn is asymptotically normal, with

p 2 9  n + n δn Zn ⇒ N 0, Σ(f0) (2.24)

for a p × p matrix Σ(f0).

−1/7 In particular, with the choice δn = O(n ) implied by (2.19), the squared bias

−4/7  −5/7 Bn(δn) is of order n , and the variance is of order 1 n + n . Consequently, the root mean square error is driven by the asymptotic bias, and is O(n−2/7). This convergence rate is quite slow, a result of the nonparametric approach to estimating the functional L(θ).

Yang (2000) proposes a bias-corrected estimator with a larger bandwidth that √ is asymptotically normal with a n convergence rate. However, the bias correction requires two stages of estimates of high-order integrated derivatives of g0,θ, up to a fifth-order integrated squared derivative. We regard this approach as somewhat impractical, and suspect that it would only lead to a marked improvement for very large samples.

25 An L1 criterion due to Wand and Devroye (1993)

An alternative approach begins with the mean integrated absolute error (MIAE),

Z   R1 gˆθ,n, g0,θ = E |gˆθ,n(y) − g0,θ(y)| dy , (2.25)

whereg ˆθ,n is a standard density estimate based on the transformed sample Yθ,1,...,Yθ,n, and g0,θ is the transformed density given by (2.11).

The L1 risk, unlike its L2 counterpart, has the appealing property of being invari- ant under monotone transformations. In particular,

 ˆ  R1 gˆθ,n, g0,θ = R1 fθ,n, f0 ,

ˆ where fθ,n is the transformation density estimator obtained by backtransformingg ˆθ,n

(see (2.10)). In the L1 approach, a transformation which minimizes expected loss (for a kernel estimator) on the transformed scale also minimizes it (among transformation kernel estimators) on the original scale.

Wand and Devroye (1993) give an asymptotic expansion of (2.25), which we have encountered previously in the introductory chapter. For f0 possessing two continuous integrable derivatives, and a finite moment of order 1 + ε for some ε > 0, we have

 −1/5  −2/5 inf R1 gˆθ,n, g0,θ ∼ 2 Q g0,θ A(K)n , (2.26) hθ>0 whereg ˆθ,n is a kernel estimate with a global bandwidth hθ, as in (2.12). The right side is seen to be an increasing function of ! Z q u5g00 (y)  −1 0,θ M(θ) = Q g0,θ = inf u g0,θ(y) · E Z − dy, (2.27) u>0 p g0,θ(y) for Z ∼ N(0, 1). This motivates (2.27) as an alternative to L(θ) from (2.15) as a measure of the difficulty of estimating g0,θ. Wand and Devroye (1993) note that

26 for the strongly skewed lognormal density and a kurtotic mixture of normals, (2.27) can be reduced substantially with a judicious choice of transformation. However, they do not discuss the problem of estimating M(θ) based on a sample when g0,θ is unknown. In comparison, estimators of the functional in L(θ) are well-studied, and their asymptotic properties have been established.

2.3 Transformation Families

The transformation density estimation strategy has been studied by several au- thors in the context of kernel density estimation, and they have proposed a variety of transformation families and criteria for selecting transformations. We briefly sum- marize several approaches.

2.3.1 Parametric transformation, nonparametric density es- timation

Several authors have studied the use of parametric transformation families, se- ˆ lecting the transformation based on the scale-free estimated curvature criterion Ln of

(2.20). These authors differ on their choices of transformation families — the choice of appropriate transformation family depends on the types of densities one wishes ˆ to estimate — and in the details of the construction of Ln (for example, the use of

00  “diagonals-in” versus “diagonals-out” bandwidths for estimating R g0,θ . Here we give a listing of some authors who considered the parametric approach.

Wand, Marron and Ruppert (1991) pioneered the approach involving a parametric transformation. The paper focuses on the estimation of right-skewed densities sup- ported on the positive reals. This estimation is made easier with a family of shifted

27 power transformations

 θ2 SP (x + θ1) · sign(θ2) θ2 6= 0 ϕθ1,θ2 (x) = (2.28) ln(x + θ1) θ2 = 0.

The use of the curvature criterion L(θ) as a proxy for the MISE of a standard KDE on the transformed scale, just discussed in Section 2.2.1, is due to this paper. Wand,

Marron and Ruppert (1991) also consider an MISE criterion calculated on the original scale (the “X-space”) rather than on the transformed scale (the “Y -space”), but for ˆ ˆ simplicity select θn to minimize Ln(θ) for the transformed density, over a carefully selected grid Θ. Bolance, Guillen and Nielsen (2003) demonstrate the effectiveness of the Wand et al. (1991) approach for estimating actuarial loss functions.

Ruppert and Wand (1992) adapt the transformation approach to densities suffer- ing from high . This trouble is addressed using the parametric transformation

RW ˆ ϕθ (x) (2.34). The transformation is again selected according to the Ln criterion. A fixed-bandwidth kernel density estimate performs quite well on the resulting trans- formed scale, giving high quality estimates of a variety of kurtotic densities.

Yang and Marron (1999) consider three transformations from the Johnson family of section 2.3.3. These three are selected as remedies for , heavy-tailedness and short-tailedness in the original density. It is shown that, for a broad class of densities f0, at least one of the transformation families contains a transformation which can reduce L relative to f0. They develop an algorithm for selecting a series ˆ of transformations from these families, again with the aim of minimizing Ln. It is shown that transforming twice, or even more times, can further improve the quality of TKDE estimates. In the methodology of Chapter 4, we borrow much from the iterative transformation technique that Yang and Marron (1999) describe, yet we

28 abandon the Johnson transformations in favor of transformation families that map

R → R.

2.3.2 Nonparametric transformation, nonparametric density estimation

Other authors have considered nonparametric transformation strategies. This

methodology is less aligned with the techniques of Chapters 4 and 5, but we briefly

mention several works in the direction of TKDE with nonparametric transformations.

Ruppert and Cline (1994) develop a method for iteratively estimating and trans-

−1 ˆ forming the distribution. The transformations are of the basic form ϕFˆ(x) = G (F ), where Fˆ is a nonparametric estimate of the original cdf, and G is a target cdf. They continue estimating the cdf and re-transforming the sample until the empirical cdf of the transformed data is adequately close to G. Buch-Larsen, Nielsen, Guill´enand

Bolanc´e(2005) estimate heavy-tailed densities by applying the cdf of an estimated champernowne distribution to the sample, and estimating the resulting density on

(0, 1) by kernel methods with a boundary correction. Koekemoer and Swanepoel

(2008) consider a “black box of transformations” including several parametric fam- ilies mentioned here. They perform a series of transformations with a target of a

Gaussian density. Their scheme involves an initial pilot black-box transformation, selected to minimize a Wilk-Shapiro . The pilot transformed data is then

fit with a variable-bandwidth KDE with an additional multiplicative bias correction, where the bandwidth is concocted with a creative scheme that generalizes Abramson

(1982). The pilot transformation is accordingly modified by shifting the data to avoid

“explosive tail behavior” caused at the boundaries of the domains of some transfor- mations, the data are re-transformed, and another variable kernel estimate is made.

29 The authors repeat several times to produce the final estimate. Readers are referred

to section 5.2 of Koekemoer and Swanepoel (2008).

2.3.3 Other Transformation Families

Transformations can play many roles in a statistical analysis. We will mainly fo-

cus here on transformations that have been suggested to improve the performance of

density estimates in the scheme described at the outset of this section. We begin, how-

ever, by mentioning influential work on transformations by Johnson (1949) and Box

and Cox (1964). These authors developed parametric families of transformations to

transform a variety of non-normal distributions to approximate normality. Although

our ultimate aim of easy density estimation does not require normality, studies of the

criteria in Section 2.2.1 suggest that normal distributions are among the “easiest”

densities to estimate with the standard kernel methods. In particular, L(f0) attains

a minimum value of 0.6787 when f0 is a Beta(4, 4) density, and normal densities are similar in shape and are close behind, with L(φ) = 0.733 (Terrell and Scott; 1992).

With respect to Q(f0), a Beta(5.3, 5.3) density is easiest, at Q(f0) = 1.92, while normal densities have Q = 1.99. Thus, transformation families that can effectively induce approximate normality will also be effective at reducing L and Q.

Johnson (1949) systems

Johnson (1949) considers transformations to approximate normality, of the form

x − β  ϕ (x) = γ + δg , (2.29) θ η

where we have put θ = (γ, δ, β, η), and where g is a monotone function. The collection,

−1 indexed by θ, of distributions for X = ϕθ (Y ), where Y is standard normal, is

30 commonly known as a Johnson system. The shape of the distribution of X depends on the choice of g(·). Here are the three suggestions from Johnson (1949).

1. (The lognormal system) When g(u) = log(u), then a Gaussian distribution for

Y = ϕθ(X) corresponds to lognormal distributions for X. Hence the transfor-

mation g(u) = log(u) is appropriate for sample having skewed densities sup-

ported on intervals [c, ∞) or (−∞, c].

  2. (The SB system) Setting g(u) = log u (1 − u) is appropriate for the class of

X supported on a compact interval (β, β + η), known as the Johnson SB class.

−1 3. (the SU system) The inverse hyperbolic sine function g(u) = sinh (u) = log u+ √ 2 1 + u corresponds to the so-called SU class, which includes a modest variety

of distributions that exhibit heavy-tailed or skewed behaviors, or both.

These transformations capture a variety of features. However, neither the SB nor the lognormal system allow input distributions supported on (−∞, ∞). Hence, these transformations do not serve our purpose of estimating densities on the real line.

Skew-reducing transformations

The famous power transformations of Box and Cox (1964) were proposed, along with a likelihood-based framework for estimation, in the context of the .

The Box-Cox family of transformations are given by

 (xθ − 1)/θ θ 6= 0 ϕBC (x) = (2.30) θ log x θ = 0.

Box and Cox (1964) describe a use for these transformations in a linear model with

BC > normal errors, where the response has been transformed: Yθ,i ≡ ϕθ (Xi) = zi β + εi, for original response measurements Xi and predictors zi. They suggest estimating θ

31 and β together by maximizing a normal likelihood. This family of transformations is

effective for transforming skewed distributions (for the residuals, in the linear model

framework) to approximate symmetry. However, the transformations are valid only

for positive inputs Xi.

Bickel and Doksum (1981) discuss the family (2.30), and investigate asymptotic

properties of the estimators (θ,ˆ βˆ), noting that these are highly correlated, leading to instability in the estimate βˆ.4 In the same paper, Bickel and Doksum (1981) proposed an extension of (2.30) valid for all real inputs x, the signed power transformation

 (|x|θsign(x) − 1)/θ θ > 0 ϕSP (x) = (2.31) θ log x θ = 0

Although this extension is valid for all real x, Yeo and Johnson (2000) note that the

d SP derivative dx ϕθ (x) changes from a decreasing to an increasing function of x as x changes sign. (That is, the concavity of the transformation changes.) Because of this, when (2.31) is applied to skewed densities on R with substantial mass on both positive and negative values, the resulting transformed distribution is often bimodal.

To alleviate this trouble, Yeo and Johnson (2000) propose a new family of power transformations, which we will call the Yeo-Johnson family, which is either convex or concave for all x. The Yeo-Johnson family is

 θ   (x + 1) − 1 θ x ≥ 0, θ 6= 0  log (x + 1) x ≥ 0, θ = 0 ϕYJ (x) = (2.32) θ −(−x + 1)2−θ − 1 (2 − θ) x < 0, θ 6= 2   − log (−x + 1) x < 0, θ = 2.

The shape of this transformation is quite similar to Box-Cox, but is valid for all x ∈ (−∞, ∞). When 0 ≤ θ ≤ 2, the family is a monotone increasing map from

4Box and Cox (1982) respond, with more than a hint of derision, “it seems to us that this general conclusion is qualitatively obvious and at the same time scientifically irrelevant.”

32 R to R. In Chapter 4, we will employ this family for skew correction as part of a transformation-DP-mixture density estimation scheme.

Kurtosis-reducing transformations

There are several common transformations for normalizing mesokurtotic distri- butions. To be effective, these transformations should be convex for x below the

median of the input distribution, and concave for x above. The inverse hyperbolic

sine transformation

1 1 √ ϕSU (x) = sinh−1θx = logθx + x2θ2 + 1 (2.33) θ θ θ

is a special case of the SU -type transformations of Johnson (1949), mentioned in part

2.3.3. This is an effective family for reducing kurtosis. The parameter θ > 0 controls

the severity of the kurtosis correction: as θ → 0, the transformation becomes nearly

linear (and hence, does not affect the shape of the input density), while larger θ give

more strongly convex-to-concave “S-shaped” curves.

Ruppert and Wand (1992) suggest another kurtosis reducing transformation,

√  1 ϕRW (x) = αx + (1 − α)σ 2π Φ(x/σ) − , (2.34) θ 2

R u where Φ(u) = −∞ φ(u) du is the cdf of a standard normal. The parameters θ = (α, σ) both control the degree of kurtosis reduction, giving stronger effects as they become

smaller. Sending α → 1 gives the identity transformation, while, as α → 0, the shape

is of the transformation approaches that of a shifted-and-scaled normal cumulative

distribution function. For fixed α ∈ (0, 1), smaller values of σ give the transformation

greater convex-to-concave curvature, and hence stronger kurtosis reduction. Large

values of σ lead to a transformation that is approximately linear for comparatively

33 small values of x. For our purposes, α = 0 is a degenerate case we would like to avoid, since it leads to a transformation with bounded .

An alternative kurtosis reducing transformation is the cdf-inverse-cdf transforma- tion

CDF −1  ϕθ (x) = Φ Fθ(x) , (2.35)

where Φ is a standard normal cdf and Fθ is a heavier-tailed CDF. For example, Fθ

could be the cdf of a Student t distribution with θ degrees of freedom. This one-

parameter family, like the others in this section, has a convex-to-concave shape that

produces a transformed density with lighter tails than the original.

Though all of these transformations are appropriate for shortening the tails of a

CDF distribution, we favor ϕθ (x) for the interpretability of its parameter as a degrees of

freedom — this transformation is best applied to distributions f0(x) with polynomial

tails, behaving like a tθ distribution as |x| → ∞.

34 Chapter 3: Density Estimation using Dirichlet Process Mixtures

3.1 Nonparametric Prior Construction

Nonparametric Bayesian models are commonly defined as probability models on infinite dimensional parameter spaces (Orbanz and Teh (2010), M¨ullerand Rodriguez

(2013)). The problem of density estimation is a prime example. Nonparametric

Bayesian models approach the problem of density estimation by specifying a prior for the unknown distribution, an infinite dimensional quantity. Ferguson (1973), in his seminal paper on the subject, proposes two desirable properties for such a prior:

1. that they should possess large support, and

2. that the posterior distributions should be analytically manageable.

To this end, he defined the Dirichlet process (DP), described in Section 3.1.1. The

flexibility and relative ease of implementation of the DP and its extensions have made it central to many recent developments in the field of Bayesian nonparametrics, draw- ing interest from statisticians and data analysts, as well as many practitioners from the machine learning community. Recent reviews of the field of Bayesian nonpara- metrics are given by Quintana and Muller (2004) and M¨ullerand Rodriguez (2013).

35 Reviews with a machine learning perspective are given by Teh (2010), which focuses

specifically on the DP, and by Orbanz and Teh (2010) for the broader field of study.

In subsequent sections, we briefly review the DP and extensions, focusing on their

use for density estimation.

3.1.1 The Dirichlet Process

Let (X , A) be a measurable space, and let G0 be a probability measure on (X , A).

(In much of what follows, we will focus on the DP over the real line X = R, with the A = B, the Borel sets.) A random probability measure G is said to follow a

Dirichlet process prior with mass parameter M > 0 and base distribution G0 if for any measurable partition {A1,...,Ak} ∈ A of X ,

  G(A1),...,G(Ak) ∼ Dirichlet MG0(A1),...,MG0(Ak) . (3.1)

For a random probability measure G having this property, we will write

G ∼ DP(M,G0).

Ferguson (1973), using properties of the Dirichlet distribution, verifies the Kolmogorov

consistency conditions, establishing the existence of the stochastic process that as-

signs probabilities for G according to (3.1). The Dirichlet process, with its many

extensions and variations, has proven to be extremely useful on both points (1) and

(2) above.

Ferguson (1973) gave several useful properties for the basic construction (3.1).

Among them:

1. (Roles of the DP parameters) If G ∼ DP(M,G0), then for any A ∈ A,

G (A)(1 − G (A)) EG(A) = G (A) and VarG(A) = 0 0 (3.2) 0 M + 1 36 facts which are immediate from (3.1) when we consider the measurable partition

c {A, A }. Hence the base distribution G0 informs the location of the DP prior,

while M is a concentration parameter that behaves like an inverse variance.

2. (Almost sure discreteness) Any realization G ∼ DP(M,G0) is discrete with

probability 1.

3. (Conjugacy under random sampling) If G ∼ DP(M,G0), and θ1, . . . , θn are

a random sample from G, then the posterior distribution for G is also DP

distributed, with

n ! 1 X M G|θ , . . . , θ ∼ DP M + n, δ + G . (3.3) 1 n M + n θi M + n 0 i=1

Part of the great value of the Dirichlet process is the wealth of different representations

it implies. For example, we have the P`olya urn scheme representation of Blackwell

and MacQueen (1973). Suppose θ1, . . . , θn are i.i.d. from G ∼ DP(MG0), let θj be

one of these and define θ−j = (θ1, . . . , θj−1, θj+1, . . . , θn). In that case, θj|G, θ−j ∼ G,

and so from (3.2) and (3.3), marginalizing over G gives

 Pr θj ∈ A θ−j = E [G(A)|θ−j] n 1 X M = δ (A) + G (A). (3.4) M + n − 1 θi M + n − 1 0 i=1;i6=j

This representation is quite useful for the Gibbs samplers developed in the 1990s.

Another useful representation of the Dirichlet process is the constructive definition

iid given by Sethuraman (1994). This definition begins with a sequence pj ∼ Beta(1,M), Q for j = 1, 2,..., and from these constructs weights w1 = p1, and wk = pk j

37 as ∞ X G = wkδθ0k . (3.5) k=1 Sethuraman (1994) showed this construction uniquely defines the DP. This represen- tation, a special case of what Ishwaran and James (2001) termed a stick-breaking construction, makes the discreteness of G quite clear.

Although this discreteness limits the usefulness of the DP directly as a prior for estimating an unknown density, a natural extension uses the DP as a prior for the mixing distribution in general mixture (1.8). These Dirichlet process mixture

(DPM) models are our primary interest for subsequent sections. We describe the

DPM approach in part 3.1.2, and discuss MCMC computation in 3.2.

3.1.2 Dirichlet process mixtures

Following the tradition of the kernel smoothing methods we have already seen for density estimation, we construct a prior for an unknown density as a DP mixture. This approach, suggested by Ferguson (1983), has since been studied extensively, although practical implementation was not possible until the breakthroughs of modern Bayesian computation and MCMC in the 1990s. The suggestion of Ferguson (1983) is to model the unknown density as a convolution of a continuous kernel over a DP distributed mixing distribution

G ∼ DP(M,G0) Z iid (3.6) Xi | G, σ ∼ f(x|G) = pσ(x|θ) dG(θ).

Here, pσ(x|θ) is a parametric density. Some parameters θ are mixed over G ∼ DP, while others σ may be fixed or given independent prior distributions σ ∼ π(σ).

Implementation of the DP mixture model requires choosing an appropriate kernel pσ(x|θ) and base measure G0. Lo (1984) notes that if (3.6) is taken as a location

38 2  and precision θ = (µ, τ ) mixture of normal kernels pσ(x|θ) = τφ τ(x − µ) , where

2 + φ is the standard normal density, and G0(θ) = G0(µ, τ ) is supported on R × R , the model (3.6) contains all continuous densities on R in its closure. In the case of this location and scale mixture of normals, it is common for computational reasons

2 (see 3.2) to set G0(µ, τ ) as a conjugate normal- (Escobar and

West; 1995). Further, the parameters of G0 may be given additional hyper-priors.

Conjugacy of the hyper-priors can also simplify computations. We discuss the merits of various DPM constructions in part 3.3.2.

Of course, like the DP, the DP mixture model (3.6) has many representations.

For example, an equivalent hierarchical representation is

G ∼ DP(M,G0)

θi G ∼ G (3.7)

Xi θi, σ ∼ pσ(x|θi)

Alternatively, we could use the stick-breaking representation (3.5) of the DP to write

∞ ∞ X Xi {wk, θ0k}k=1, σ ∼ f(x|G) = wkpσ(x|θ0k), (3.8) k=1 expressing the density as an infinite mixture of continuous kernels, with the weights wk and kernel parameters θ0k given the same prior described previously in the context of (3.5).

Because of the discreteness of G, the model (3.8) implicitly leads to clustering among the kernel parameters θ1, . . . , θn in (3.7), and also among the observations

X1,...,Xn themselves, inducing a random partition of the observation indices I =

{1, . . . , n}. Antoniak (1974) studied the partitions. We introduce some notation to

39 discuss them here. The partitions

p = {Ij(p) ⊂ I : j = 1, . . . , kn(p)} \ Ij(p) Ik(p) = ∅ for all j 6= k (3.9)

Skn(p) j=1 Ij(p) = I consist of clusters of indices Ij(p) belonging to the same component of the mixture

∗ (3.8). Let θj denote the common value of θi for all i ∈ Ij(p). We introduce variables

∗ si = j ⇐⇒ θi = θj for i = 1, . . . , n, to indicate the assignment of observations to clusters. There is a one-to-one correspondence between s = (s1, . . . , sn) and the partition p. It can be shown (Lo; 1984) that another equivalent representation of the DP mixture model using the induced distribution of s is given by

kn(s) Γ(M) Y s = (s , . . . , s ) ∼ f (s) = M kn(s) Γ(n (s)) 1 n s Γ(M + n) j j=1 ∗ θj ∼ G0(θ)

X |s , θ∗ ∼ p (x|θ∗ ), i i σ si

where kn(s) = kn(p) denotes the number of unique values among θ1, . . . , θn, or equiv- alently, the number of cells Ij(p) in the partition p. Also, nj(s) = nj(p) = |Ij(p)| =

Pn th i=1 I(si = j) denotes the size of the j cluster in the partition p. Note that in this representation, the infinite dimensional distribution G has been marginalized out.

This points to a key reason for the importance of the DP and DP mixture models: although the prior itself is infinite dimensional, it permits representations in terms of a finite number of active parameters.

40 Lo (1984) utilized this finite representation to give an analytic expression for the

Bayes estimator of the density of the Xi under squared error loss, the posterior mean

ˆ  f(x) = E f(x|G) X1,...,Xn kn(p) R p (x|u) Q p (X |u) dG (u) n X X nj(p) σ i∈sj (p) σ i 0 = W (p) R Q (3.10) M + n n pσ(Xi|u) dG0(u) p j=1 i∈sj (p) M Z + p (x|u) dG (u). M + n σ 0

The integrals can be evaluated analytically for conjugate choices of pσ and G0, and the expression contains a finite number of terms. However, the use of (3.10) is infeasible for realistic problems, as it contains a sum over all possible partitions p of I, a number which grows extremely fast with the sample size. Easy implementation of the

DP mixture model was made possible with the development of the MCMC schemes discussed in part 3.2.

3.2 Posterior computation for DP mixtures

Although Gelfand and Smith (1990) brought the Gibbs sampler to the forefront of Bayesian computation, the first use of a Gibbs-style algorithm for implementation of the DP mixture (3.6) was in an unpublished dissertation by Escobar (1988), later published as Escobar (1994). The basic algorithm exploits the P`olya urn scheme rep- resentation (3.4), drawing the θi in the representation (3.7) from their full conditional distributions,

n    Z  X 1 M θi θ−i, X ∼ pσ(Xi|θj) δθ (θi)+ pσ(Xi|u) dG0(u) πG (θi|Xi), M + n − 1 j M + n − 1 0 j=1;j6=i (3.11)

41 where πG0 (θ|Xi) ∝ pσ(Xi|θ)g0(θ) is a standard Bayesian update of the base mea-

sure G0 with the single observation Xi. Many authors have proposed extensions

and improvements of the basic sampler. When G0(θ) is conjugate to the likelihood R pσ(Xi|θ), the integral pσ(Xi|u) dG0(u) can be evaluated in closed form. For non-

conjugate models, the integral can be approximated

Since the original MCMC schemes were introduced, there has been an explosion

of work on posterior computation strategies for the DP and associated nonparametric

Bayesian constructions. Samplers for the DP mixtures of normal kernels were dis-

cussed by Escobar and West (1995), who give examples of the model’s use for density

estimation, and by MacEachern (1994), who presents a faster-mixing Gibbs sampler

with a collapsed state space. Bush and MacEachern (1996) improve upon on the basic

Gibbs sampler using a “Rao-Blackwellization” strategy. For non-conjugate models,

MacEachern and M¨uller(1998) give an exact “no-gaps” MCMC scheme that does not

require approximations to the integral in (3.11). Neal (2000) reviews MCMC meth-

ods for conjugate and non-conjugate DP mixtures, discusses the use of Metropolis-

Hastings steps, and and gives two additional algorithms. Gelfand and Kottas (2002)

discuss using MCMC samples for inference on functionals of the infinite dimensional

mixing distribution G. More recent MCMC strategies include split-merge algorithms

(Dahl; 2003; Jain and Neal; 2004; Jain and Nealy; 2007), the slice sampler of Walker

(2007) and the retrospective sampler of Papaspiliopoulos and Roberts (2008).

Since the number of active parameters in the mixture (3.8) grows with the sample size, MCMC computations can become quite difficult for large samples. Approximate

Bayesian computation techniques for the DPM include the variational algorithm of

Blei and Jordan (2006).

42 3.3 Performance of Dirichlet process mixtures for density estimation

We now turn to evaluations of the performance of the DP mixture (3.6) for esti- mating an unknown density f0 on the basis of a sample X1,...,Xn. In part 3.3.1 we discuss some common metrics for the closeness of an estimate fˆ to the true density f0. In section 3.3.2, we mention a few forms of the DP mixture model that have been proposed in the literature as priors for density estimation, and consider their perfor- mance for finite samples. In part 3.3.3, we give an overview of asymptotic properties of the posterior distribution of the DP mixture model.

3.3.1 Measurements of accuracy for density estimation

In order to discuss the performance of density estimates, we require a metric d(f, g) to measure the distance between two densities f and g, typically an estimate ˆ f and the true density f0. We will briefly mention several such measures. For a more complete discussion on this topic, the reader may refer to the informative review by Gibbs and Su (2002). The first is defined on general Lp functions, while the subsequent subsections describe metrics specific to probability densities. Some of these are metrics in the formal sense, while others are not. The author will continue to write d(f, g) even when d is not a proper metric.

The Lp metric

Consider functions f, g : R → R belonging to the Lp space of functions with respect to Lebesgue measure λ on R. That is |f|p and |g|p are assumed to be Lebesgue

43 integrable. For f ∈ Lp, we write Z 1/p p kfkp = |f| dλ , (3.12) and also define kfk∞ = inf{M : f ≤ M a.e. λ}. The Lp norm (3.12) induces a metric on the Lp space of functions, defined by Z 1/p p dp(f, g) = kf − gkp = |f − g| dλ . (3.13)

This is a metric in the formal sense; it is symmetric, and Minkowski’s inequality gives the triangle inequality (see section 19 of Billingsley (2012)). The integrated squared error

2 dISE(f, g) = d2(f, g) (3.14) is a traditional loss function for evaluating kernel density estimates. Taking an expec- tation gives the mean integrated squared error (2.2) as a risk function (making (2.2) an evaluation of the estimator rather than an a particular estimate). Some authors, such as Wand and Devroye (1993), prefer an L1 approach, studying the so-called

integrated absolute error dL1 (f, g). This is of course a proper metric, and has an advantage that it is invariant under monotone changes of variables (see part 2.2.1 of

Chapter 2). In fact, half the L1 distance is sometimes known by another name: the total variation metric, Z Z 1 dTV (f, g) = sup f dλ − g dλ = kf − gk1. (3.15) A∈B(R) A A 2

Relative entropy (Kullback-Liebler divergence)

For Lebesgue continuous probability densities f and g, the relative entropy, or

Kullback-Liebler divergence, is given by Z f  d (f|g) = log f dλ, (3.16) KL g 44 where the integral is taken over the support of f. This is non-negative by Jensen’s inequality, but it is not a metric; it does not satisfy the triangle inequality, and it

∗ is not symmetric (though it can be made so by considering dKL(f, g) = dKL(f|g) + dKL(g|f)). It does, however, have appealing properties. For example, if f and g are joint densities of independent components, f(x) = f1(x1)f2(x2) and g(x) = g1(x1)g2(x2), then dKL(f|g) = dKL(f1|g1) + dKL(f2|g2). When f and g belong to the same parametric family, there is an intimate connection between (3.16) and the

Fisher information matrix. The K-L divergence (3.16) is of great interest in many statistical problems. It has been studied as a loss function in Bayesian decision the- ory, for example, by Brown, George and Xu (2008). However, dKL(f|g) is strongly affected by the tail behavior of f and g. For example, if g has a substantially heavier tail than f, the K-L divergence will be extremely large.

Hellinger distance

The Hellinger distance between probability measures with Lebesgue densities f and g,

Z 1/2   Z 1/2 p √ 2 p dH (f, g) = f − g dλ = 2 1 − fg dλ (3.17)

√ is a bounded metric on the space of densities, satisfying 0 ≤ dH (f, g) ≤ 2. Like the total variation norm, dH is invariant under monotone transformations.

For a more complete listing of metrics on the space of densities, see Gibbs and Su

(2002).

45 3.3.2 Constructions for the DP mixture prior, and finite- sample performance DP mixture of normal models for density estimation

CCV DCV 1 DCV 2 2 2 σ ∼ IG(aσ, bσ) τ ∼ IG(aτ , bτ ) σ ∼ IG(aσ, bσ) 2 2 2 G0(µ) = N(µ0, σ0) G0(µ, σ ) = N µ µ0, τσ a ∼ Beta(α, β) 2  2 G ∼ DP(M,G0) ·IG σ aσ, bσ G0(µ, ζ) = N µ µ0, (1 − a)σ  µi ∼ G G ∼ DP(M,G0) ·IG ζ (µζ + 1)/µζ , 1 2 2 2 Xi µi, σ ∼ N Xi µi, σ (µi, σi ) ∼ G G ∼ DP(M,G0) 2 2 Xi µi, σi ∼ N Xi µi, σi (µi, ζi) ∼ G 2 ζi 2 Xi µi, σ ∼ N Xi µi, a σ µζ The DP mixtures are a flexible class of models that can be used in a wide variety of statistical analyses, including, prominently, Bayesian density estimation. Many variants on the basic structure (3.6) have been proposed, though it is very common to assume normal kernels pσ(x|θ). Three possible structures for the DP mixture of normals are given in the table above.

The CCV model at left is an example of what Griffin (2010) describes as a Com- mon Component Variance model, in which the normal kernels have a shared σ, which is given an independent inverse-gamma prior. This structure is perhaps most similar to the standard kernel density estimate (2.1). The kernel scale parameter σ is analogous to the bandwidth h, but the locations µi of the kernels are comparatively shrunk away from Xi towards a common value µ0, and exhibit clustering, as described in part 3.1.2.

2 In the DCV1 model at center, the kernels are allowed different σi , with the DP mixing employed over the locations as well as the scales. This model was studied by Ferguson (1983) and Escobar and West (1995). The form of G0 is the familiar normal-inverse-gamma distribution, conjugate to the normal likelihood. The

46 parameter τ > 0 is strongly linked to the smoothness of the density; larger values

of τ result in a prior that puts more mass on many-modal densities. Small values of

τ lead to small dispersion in the kernel locations µi, and hence a smoother density

estimate.

The DCV2 model is due to Griffin (2010). Its parameters are more directly inter-

pretable in terms of the location, scale, and smoothness of the density being modeled.

The distinguishing feature of this model is the use of σ2, which can be interpreted

as an overall variance, “split” between the base measure G0 and the kernels, with

a ∈ (0, 1) governing the split. In a manner similar to τ, a controls the smoothness of the resulting density, with smaller values leading to more wiggly densities. The ζi allow the kernel variances to be inflated or deflated in different components of the in-

finite mixture. Griffin (2010) argues that the structure and parametrization of DCV2 make specification of more straightforward. The interpretation of σ2 as a marginal variance, for example, allows easier specification of the hyperparameters aσ and bσ than in DCV1. Lastly, we note that the DCV2 model, unlike the others, is not conjugate. Alternative MCMC strategies such as MacEachern and M¨uller(1998) are required. Griffin (2010) gives an algorithm in the style of Neal (2000).

I only say a few words about the finite-sample performance of these models. Grif-

fin (2010) gives convincing evidence for the effectiveness of the DCV2 construction, demonstrating its superior performance to alternative structures of the DP mixture by analyzing several common datasets. When performance in real data analysis is measured by cross validation with a predictive likelihood measure, Griffin’s prior is seen to be effective for estimating densities with a variety of shapes. Although a sim- ulation study would be an excellent way to evaluate the merits of these priors, and

47 assess their performance for finite samples, I do not carry out such a study here, other

than the results I will show to assess the transformation methodology of Chapters 4

and 5.

3.3.3 Asymptotic properties of DP mixtures

As mentioned at the beginning of this chapter, an advantage of nonparametric

approaches to density estimation is that consistency can often be guaranteed under

relatively weak assumptions on the true density. Recently, there has been strong

interest in studying asymptotic properties of nonparametric Bayesian models.

Study of asymptotic properties for these Bayesian methods proceeds somewhat

differently than the frequentist asymptotics exemplified in 2.1.1, though the results

are often analogous. Here we give a short account of important results concerning

the DP mixture, due to Ghosal, Ghosh and Ramamoorthi (1999), Ghosal and Van

der Vaart (2001), Ghosal and Van der Vaart (2007), and others.

Weak and strong posterior consistency

Let F be the set of all densities on R with respect to Lebesgue measure, and

continue to denote by f0 the true density generating a sample X1,X2,.... A prior Π

on F, such as the prior induced by the DP mixture (3.6), is evaluated asymptotically

by the behavior of posterior probabilities

R Qn f(Xi)Π(df)  U(f0) i=1 Π U(f0) X1,...,Xn = R Qn , F i=1 f(Xi)Π(df)

where U(f0) ⊂ F is a neighborhood of f0. Consistency of the prior Π for f0 requires that  Π U(f0) X1,...,Xn → 1. (3.18)

48 as n → ∞ almost surely with respect to the f0 measure. Two types of consistency are of interest Ghosal et al. (1999):

1. Π is said to be weakly consistent if (3.18) holds for any weak neighborhood

U(f0) of f0. A weak neighborhood of f0 is a set containing a set of the form

 Z Z 

f ∈ F : φif dλ − φif0 dλ < ε, i = 1, 2, . . . , k ,

where φi are bounded continuous functions on R.

2. Π is said to be strongly consistent if (3.18) holds for strong neighborhoods U(f0),

which contain a set of the form

{f ∈ F : kf − f0k1 < ε} ,

where kf − f0k1 is the L1 metric (3.13). Strong consistency is also equivalent

to consistency with respect to the Hellinger metric Wu and Ghosal (2010a).

Weak consistency is established by verifying that the famous prior positivity condition

Π{f ∈ F : dKL(f, f0) < ε} > 0 of Schwartz (1965). Ghosal, Ghosh and Ramamoorthi

(1999) verified the Schwartz condition for basic location DP mixture of normal models, similar to the CCV model of the previous section, under lax conditions on f0.

Typically, however, strong consistency is typically of greater practical interest than weak. For example, Bhattacharya and Dunson (2012) note that weak consistency does not guarantee that the posterior assigns positive probability to the set of densities with respect to Lebesgue measure. Hence most recent work has been concerned with strong consistency. Ghosal, Ghosh and Ramamoorthi (1999) initiated these investigations, though a more general result was shown by Tokdar (2006): the location-and-scale DP mixture of normals (for example, DCV1 from the previous section), with a regularity

49 condition on the tail of the base measure G0, is strongly consistent for any f0 satisfying

R η the tail condition |x| f0(x) dx < ∞. Lijoi, Prunster and Walker (2005) also give a relaxation of the Ghosal, Ghosh and Ramamoorthi (1999) conditions for strong consistency.

Posterior convergence rates

Ghosal and Van der Vaart (2001) and Ghosal and Van der Vaart (2007) extend the notion of strong consistency to establish rates of posterior convergence to the true density. These rates are again defined in terms of asymptotic behavior of (3.18), but with Hellinger neighborhoods of f0 that are allowed to shrink as n → ∞. If

   Π f : dH (f, f0) > Mεn X1,...,Xn → 0 (3.19)

almost surely under X1,X2,... ∼ f0, then εn is defined to be (an upper bound for) the rate of convergence with respect to the Hellinger distance dH from (3.17).

The papers of Ghosal and van der Vaart establish rates εn under two classes of f0.

Ghosal and Van der Vaart (2001) study “super-smooth” f0; estimation is carried out with DP mixtures of normals when f0 itself is a mixture of normals. Ghosal and Van der Vaart (2007) study a broader class of f0 only assumed to be twice continuously differentiable. We briefly summarize their findings here.

The following is a reproduction of a result from Ghosal and Van der Vaart (2001).

The result applies to estimation of densities of the form

Z

f0(x) = ϕσ0 (x − θ) dH0(θ), (3.20)

2 where ϕσ(u) = N(u|0, σ ), and σ0 is assumed to lie in a compact interval [σ, σ] ⊂

(0, ∞). Estimation of f0 is carried out with a DP mixture of normals having a similar

50 mixture structure. The density is modeled as

Z f(x) = ϕσ(x − θ) dG(θ), (3.21)

where G ∼ DP(M,G0), G0 is a positive measure on R, and σ is independently distributed on [σ, σ].

Theorem 2 (Ghosal and Van der Vaart (2001) Theorem 5.2). Assume that the true mixing measure H0 has sub-Gaussian tails in the sense that for some c0 > 0, H0 |θ| >

 −c t2 t . e 0 for all t. If the prior for σ has a continuous and positive density on an interval containing σ0 and the base measure G0 is normal, then for a large enough constant M,

 n (log n)3/2 o  Π f : d (f, f ) > M √ X ,...,X → 0 H 0 n 1 n

n in F0 -probability.

Here . denotes inequality up to a fixed constant multiple. The key quantity is √ the rate (log n)3/2 n. This is the posterior convergence rate for the location DP mixture of normals for estimating such a “super-smooth” f0. This rate is quite fast — it is described by Ghosal and Van der Vaart (2007) as “nearly parametric” — but it requires restrictive assumptions about synthesis between the mixture representation of the true density (3.20) and the structure of the mixture model (3.21).

Ghosal and Van der Vaart (2007) derive convergences rates for smooth densities f0 not assumed to be mixtures of normals. Of the two main results of the paper, the

first is a preliminary result which assumes f0 is compactly supported. The second, reproduced below, applies to a fairly broad class of f0, assumed to be twice continu-

R 00 2 R 0 2 ously differentiable, with (f0 /f0) f0 dλ < ∞ and (f0/f0) f0 dλ < ∞. The density

51 is modeled with a sequence of priors Πn, which describes the unknown density with a location mixture of normals (3.21), with the same DP prior on the mixing distribu- tion G described earlier. The kernel variance σ is given a sequence of priors assumed to shrink as n → ∞ at an analogous rate to the L2-optimal bandwidth (2.6) for a

kernel density estimate. It is assumed that σ/σn ∼ P , where σn > 0 are real numbers

−a1 −a2 shrinking to zero (but not too fast): n . σn . n for 0 < a1 < a2 < 1. In the result below, P is assumed to be compactly supported in (0, ∞).

Theorem 3 (Ghosal and Van der Vaart (2007) Theorem 2). Assume f0 satisfies the

tail condition

−caδ F0(|x| > a) ≤ Me (3.22)

−dtδ for some M, c, δ > 0. If the base measure G0 satisfies G0(a) & e for some d > 0, then with respect to the Hellinger semi-distance

Z k k √ √ 2 dH = ( p − q) dλ, −k the rate of convergence is

 −1/2 1+δ/2 2 εn = max (nσn) (log n) , σn log n . (3.23)

A possible choice for the bandwidth scale mentioned by Ghosal and Van der

−1/5 4/5 −2/5 13/5 Vaart (2007) is σn = n (log n) , which yields a rate εn = n (log n) when

−2/5 δ = 2. This should be compared to the analogous L1 rate of n for a kernel density

estimate (see (2.26)). These rates of course describe different manners of convergence,

but the orders are equivalent up to a logarithmic factor.

As a final note, if we intend to take the posterior mean (3.10) as a point estimate

of f0, the convergence (3.19) implies convergence of the posterior mean to f0 at the

same rate.

52 Chapter 4: Iterative Transformation and Bayesian Density Estimation

4.1 Background

We study the problem of estimating an unknown continuous density f on the real line. A popular Bayesian approach models the unknown density using Dirich- let Process Mixtures (DPM) (c.f. Ferguson (1983), Lo (1984)). The prior for f is constructed by convolving a continuous kernel, frequently Gaussian, with a Dirich- let Process (DP) (c.f. Ferguson (1973)) distributed mixing distribution G. Many variations on this basic structure have been proposed, one of the simplest being the location mixture of Gaussian kernels given by  G ∼ DP MG0 , iid µi G ∼ G (4.1) indep 2 Xi µi, σ ∼ N µi, σ i = 1, . . . , n where the base distribution G0 is often taken to be a normal distribution with fixed parameters, and an inverse-gamma prior for the kernel variance σ2 completes the model. Although more complex structures have been proposed, even this basic con- struction is known to be extremely flexible, having KL support on a wide class of continuous distributions Ghosal et al. (1999).

53 True Density 0.20 True Density 0.5 DPM estimate DPM estimate 0.4 0.15 0.3 0.10 Density Density 0.2 0.05 0.1 0.0 0.00 −10 −5 0 5 10 0 5 10 15

x x

Figure 4.1: DPM density estimates (dashed lines) based on samples of size 100 for two examples of the two-piece distributions (the true densities are shown in solid black). The leftmost density is symmetric, but has t2 tails. The rightmost density has Gaussian tails, but is right-skewed.

However, despite these models’ flexibility, a fact long recognized in classical kernel density estimation is pertinent in the Bayesian context: some densities f are easier to estimate than others. In particular, densities f which possess heavy tails or severe skew can cause difficulties for estimation when the sample size is small, or even moderate. Applying (4.1) directly to estimate such skewed and heavy-tailed samples gives estimates that tend, qualitatively, to be overly bumpy or unsmooth in the tails.

Quantitatively, the error for estimating such densities is inflated under a variety of metrics on the space of distributions, as will be shown in our simulation and data examples. For an understanding of these difficulties, consider the two examples shown in Figure 4.1. The plots show estimates of two so-called two-piece distributions, which were studied, for example, in Fernandez and Steel (1998) and Rubio and Steel (2014).

The left plot is a (symmetric but heavy-tailed) Student t distribution with 2 degrees

54 of freedom, while the right plot is a (light-tailed but skewed) split normal distribution

where the scale parameter is larger above the median by a factor of 5. In both cases,

the DPM estimates leave much to be desired. At left, in the heavy tailed plot, we can

see that the DPM model underestimates kurtosis. At right, in the skewed case, the

estimated density is bumpy, and appears to underestimate the degree of asymmetry

seen in the true density.

To alleviate this problem, in this paper we propose a Transformation DPM (TDPM)

method, which carefully selects a series of transformations from a parametric family

{ϕθ : θ ∈ Θ} that is designed to symmetrize and shorten the tails of the density, so that the density of the transformed sample Yθ,1,...,Yθ,n is easier to estimate than the density of the original observations X1,...,Xn, in a sense made specific in Section

4.2. We fit the model (4.1) on this transformed scale, produce a DPM estimateg ˆn,θ

of the density of the Yθ,i on the transformed scale, and then back-transform for an ˆ estimate fn,θ of the density of the original Xi, that is,

ˆ  0 fn,θˆ = gˆn,θˆ ◦ ϕθˆ · ϕθˆ. (4.2)

Figure 4.2 illustrates the application of the TDPM procedure to estimating the two-

piece densities. It is clear that for both cases, the TDPM estimates are closer to the

true densities than the direct DPM estimates. This transformation-density-estimation

technique has been investigated in the context of kernel density estimation and are

shown to be effective for producing improved density estimates (c.f. Wand et al. (1991)

and Yang and Marron (1999)). Recently, Iwata, Duvenaud and Ghahramani (2013)

proposed to improve manifold learning and cluster detection through transformations

with nonparametric warping functions. Our TDPM method, in contrast, utilizes a

55 low dimensional parametric transformation, with the focus of minimizing a statistic

that measures the ease with which a density can be estimated.

The rest of the paper is structured as follows. In Section 4.2, we describe a new

family of transformations {ϕθ : θ ∈ Θ} that is rich enough to correct problems with both skew and heavy tails, and a method for selecting a series of transformations from the family. In Section 4.3, we discuss the use of such transformations in combination with the DP mixture model, and demonstrate the effectiveness of the method, under a variety of simulated scenarios, at reducing error for estimating the original f0. In

Section 4.4, the method is used to estimate and compare the distributions of body mass index across groups of individuals from the 2008 Ohio Family Health Survey.

Finally, Section 5 concludes with a discussion of the contributions and of future work.

4.2 Transformations

Several authors Wand et al. (1991); Ruppert and Wand (1992); Yang and Marron

(1999) have considered the use of parametric transformations for reducing bias of ker- nel density estimates (KDEs), motivating their transformations through asymptotics of the kernel estimators. Because of the role of (4.1) as a Bayesian analogue of the

Gaussian KDE Escobar and West (1995), and the connections between KDE asymp- totics and the asymptotic behavior of posteriors from DP mixture models Ghosal and

Van der Vaart (2001), we hypothesize that these authors’ transformation-density es- timation methods are also relevant for DP mixture estimation, and that much of their work concerning transformation selection translates well to the setting of Bayesian density estimation.

56 (A1) Estimates on Original Scale (B1) Estimates on Original Scale

True Density True Density

0.6 TDPM estimate TDPM estimate

DPM estimate 0.20 DPM estimate 0.4 Density Density 0.10 0.2 0.0 0.00 −10 −5 0 5 10 0 5 10 15

x x

(A2) Transformation (B2) Transformation 10 5 5 0 0 5 − 5 Transformed Scale Transformed Scale Transformed − 10 −

−20 −10 0 10 20 0 5 10 15

Original Scale Original Scale

(A3) Estimates on Transformed Scale (B3) Estimates on Transformed Scale

0.30 Transformed True Density 0.30 Transformed True Density DPM − transformed data DPM − transformed data Transformed DPM − original data Transformed DPM − original data 0.20 0.20 Density Density 0.10 0.10 0.00 0.00 −10 −5 0 5 10 −5 0 5

x x

Figure 4.2: Illustration of the Transformation-DPM technique. The heavy-tailed sample (left column, A1-A3) and skewed sample (right column, B1-B3) of figure 4.1 are transformed according to the symmetrizing and tail-shortening transformations of section 2. The DPM model is fit to the transformed samples in the bottom panels, then back-transformed to give the TDPM estimate on the original scale.

57 In studying the transformation density estimation technique, we must address two central questions:

1. How to define an appropriate family of transformations {ϕθ : θ ∈ Θ}, and

2. How to select a transformation ϕθˆ from this family based on a sample?

The answer to the first question depends on the types of densities one wishes to estimate. A quick review of work on transformation kernel density estimation yields some ideas. Wand et al. (1991) seeks to estimate skewed densities whose support is bounded below. They suggest a signed-power family of transformations which maps ranges [c, ∞) to R. Ruppert and Wand (1992) estimates kurtotic densities on R by first applying a kurtosis reducing transformation. Yang and Marron (1999) suggest that for some densities, a second or third round of transformations can further im- prove the quality of kernel estimates. They consider an ensemble of three parametric

Johnson transformations, and develop an iterative method for choosing between these families.

Bayesian methods for density estimation differ from classical methods in an essen- tial way: they are driven by the likelihood. Consequently, evaluation of the success

(or failure) of the methods in simulation settings is tied to likelihood, and, for real- data examples, to out-of-sample predictive performance. Thus, to avoid a zero in the likelihood, it is essential to get the right support for the density, and we consider new families of transformations which preserve the support. For skew correction, we choose an alternative to the families of Wand et al. (1991) and Yang and Marron

(1999) that, unlike those transformations, maps R to R. For kurtosis reduction, we

58 forego the specific transformations in Ruppert and Wand (1992) and Yang and Mar-

ron (1999) in favor of a simple cdf-inverse-cdf transformation whose parameter can

be directly related to the extreme tail index of the original density. All of the trans-

formations we consider are strictly monotonic. The families we employ are detailed

in Section 4.2.1.

To answer the second question, our approach is to develop a single statistic mea-

suring the ease with which a density can be estimated. This statistic should be quickly

and easily computable for a large collection of candidate transformations. In their

work on transformation kernel density estimation, Ruppert and Wand (1992), Wand

et al. (1991), and Yang and Marron (1999) take this approach, and those authors

agree on the form of the statistic. For each candidate transformation Yθ,i = ϕθ(Xi), one can compute an estimate of the criterion

Z 1/5 00 2 L(θ) = σθ g0,θ(y) dy , (4.3)

−1  d −1 where g0,θ(y) = f0 ϕθ (y) · dy ϕθ (y) is the transformed density. This criterion is

motivated through study of the L2 asymptotics of kernel density estimates; larger

values of L(θ) indicate that g0, θ is more difficult to estimate. In practice, estimation

R 00 2 of (4.3) requires estimation of the integrated curvature R(θ) = g0,θ(y) dy, a problem which is well studied because of its role in “plug-in” bandwidth selectors

for kernel density estimates (Sheather and Jones; 1991). Like Ruppert and Wand

(1992), Wand et al. (1991), and Yang and Marron (1999), we select transformations

according to a kernel estimate Lˆ(θ) of (4.3). The criterion (4.3) is motivated, and the

estimator described in detail, in Section 4.2.2.

59 4.2.1 Family of Transformations

We consider a collection {ϕθ : θ ∈ Θ} of transformations consisting of two para- metric families: a skew-correcting transformation due to Yeo and Johnson (2000), and a kurtosis-reducing Student-t cdf-inverse-Gaussian-cdf transformation.

These parametric transformation families, which will be described shortly, are appropriate for samples centered near 0 and appropriately scaled. For this reason, the original X = (X1,...,Xn) are first centered and scaled according to the sample

median m(X) and I(X), giving

X − m(X) X˜ = h˜(X ) = i . i i I(X)5

˜ ˜ ˜ This rescaling sets I(X) = 5 as the desired interquartile range of X1,..., Xn. This

constant affects the achievable shapes of Yeo-Johnson transformations (4.4); simula-

tions suggest samples with an IQR of 5 allow effective estimation of the Yeo-Johnson

parameter.

Selection of the appropriate transformation family proceeds with the rescaled data. ˜ After centering and scaling according to Xi = h(Xi), a parametric transformationϕ ˜θ

is applied to the centered and scaled data to give the final transformation

˜ Yi = ϕθ(Xi) = ϕ˜θ ◦ h (Xi).

We search for an appropriate parametric transformationϕ ˜θ over a parameter θ ∈ Θ =

{(j, θj): j ∈ {1, 2}, θj ∈ Θj}, where j ∈ {1, 2} indexes families of transformations

with parameters θj.

60 Figure 4.3: The familiesϕ ˜(1,θ1) of Yeo-Johnson transformations andϕ ˜(2,θ2) of t-to- Normal cdf transformations.

The family j = 1 corrects excessively skewed distributions by employing the Yeo-

Johnson power transformations (Yeo and Johnson (2000)) given by

 θ1   (x + 1) − 1 θ1 x ≥ 0, θ1 6= 0  log (x + 1) x ≥ 0, θ = 0 ϕ˜ (x) = 1 (4.4) (1,θ1) 2−θ1   − (−x + 1) − 1 (2 − θ1) x < 0, θ1 6= 2  − log (−x + 1) x < 0, θ1 = 2, where the real-valued parameter θ1 is restricted to Θ1 = [0, 2]. See the left panel of Figure 4.3 for the shapes of the Yeo-Johnson transformations. These are closely related to the famous Box-Cox family and the signed power transformations of Wand et al. (1991). The Yeo-Johnson family is appropriate for correcting both left and right skew, with θ1 > 1 and θ1 < 1, respectively. Setting θ1 = 1 gives the identity transfor-

mation. The family possesses a symmetry property thatϕ ˜(1,θ1)(x) = −ϕ˜(1,2−θ1)(x).

61 For excessively kurtotic distributions, we propose a family of tail shortening cdf-

inverse-cdf transformations,

−1  ϕ˜(2,θ2)(x) = Φ Tθ2 x/bθ2 , (4.5)

where Φ and Tθ2 are, respectively, the cdfs of a standard normal and a t distribution.

The degrees of freedom parameter θ2 > 0 of the t cdf controls the severity of the kurtosis reduction. The rescaling constant b = 5T −1(0.75)−T −1(0.25) is required θ2 θ2 θ2 ˜ ˜ to again rescale the input sample X1,..., Xn from their interquartile range of 5 to

match the IQR of a tθ2 distribution.

4.2.2 A Criterion for Estimating Transformation Parameters

The criterion L(θ) given in (4.3) can be motivated as a target function for guiding

the transformation θ by studying of the asymptotics of the Gaussian kernel density estimator n 1 X gˆ (y) = φ (y − Y ) (4.6) n,hθ n hθ θ,i i=1 of the transformed density g0,θ. The kernel in this classic estimator is φh(u) =

(2π)−1/2h−1e−u2/h. In this section, we review the asymptotic considerations of (4.6)

that give rise to the criterion L(θ) in (4.3), a data-based estimator Lˆ(θ) of L, and

finally, a selection for θˆ ∈ Θ.

Provided g0,θ possesses two bounded, continuous derivatives, the mean integrated

squared error Z 2 MISE(θ) = gˆn,hθ (y) − g0,θ(y) dy (4.7)

62 of the KDEg ˆn,hθ admits an expansion that is quite standard in the kernel density estimation literature (c.f. Wand and Jones (1994)),

 1 4 MISE(θ) = AMISE(θ) + o + hθ , n → ∞, hθ → 0, nhθ → ∞. (4.8) nhθ ˆ These asymptotics allow different bandwidth selections hθ for each possible trans- formed sample Yθ,1,...,Yθ,n, but require, for each θ, that the bandwidth shrinks to

−1 0 more slowly than n . For large n, the L2 error ofg ˆn,hθ depends chiefly on the asymptotic mean integrated squared error

R(φ) 1 4 00 AMISE(θ, hθ) = + hθR(g0,θ), (4.9) nhθ 4 where R is the curvature functional Z 00 00 2 R(g0,θ) = (g0,θ(y)) dy (4.10)

of the transformed density g0,θ. This suggests a strategy of choosing both the trans- formation parameter θ and the bandwidth hθ to reduce AMISE(θ, hθ). For a given

00 R(g0,θ), the bandwidth minimizing the AMISE is

∗ 00 −1/5 −1/5 hθ = C1(φ) R(g0,θ) n .

∗ Plugging hθ into (4.9), we find that for each θ, the minimum AMISE over possible bandwidths hθ > 0 is Z 1/5 ∗ ∗ 00 2 −4/5 AMISE (θ) = AMISE(θ, hθ) = C2(φ) g0,θ(y) dy n ,

00 which depends on θ only through a positive power of the curvature R(g0,θ). Hence,

00 the transformation θ which minimizes R(g0,θ) also minimizes the AMISE (4.9).

00 There is a technical difficulty that, like the MISE itself, the curvature R(g0,θ) is

not scale invariant. For example, standardizing g0,θ by its mean µθ = Ef0 (ϕθ(X))

63 and standard deviation σ = Var1/2(ϕ (X)), and considering the density h (z) = θ f0 θ 0,θ

00 5 00 σθg0,θ(σθz + µθ) of Z = (Y − µθ)/σθ, one finds that R(hθ ) = σθ R(g0,θ). This is an unacceptable property if we are to use R as a criterion for transformations.

Standardizing R for scale yields the criterion (4.3). L(θ) is used as a target function in the transformation density estimation schemes of Ruppert and Wand

(1992), Wand et al. (1991), and Yang and Marron (1999). Transformations θ ∈ Θ near

∗ the L-optimal θ = argminθ∈ΘL(θ) lead to densities for which the global-bandwidth

normal KDEg ˆn,hθ in (4.6) will incur smaller L2 errors for estimating the true g0,θ. In practice, of course, L cannot be evaluated, so θ∗ is unknown. To concoct an estimate θˆ, we adopt the strategy of Ruppert and Wand (1992), Wand et al. (1991), and Yang and Marron (1999) of first developing a kernel estimate Lˆ(θ) of (4.3), and taking θˆ = argminLˆ(θ). The estimates are of the form

ˆ  ˆ 1/5 L(θ) =σ ˆθ R(Yθ,1,...,Yθ,n) , (4.11)

2 −1 P 2 whereσ ˆθ = (n − 1) (Yθ,i − Y θ) , and the chief difficulty is choosing an estimator

ˆ 00 R(Yθ,1,...,Yθ,n) of the integrated curvature R(g0,θ) in (4.10).

00 Several estimators of R(g0,θ) have been proposed. Jones and Sheather (1991) suggests the popular “diagonals-in” choice

n n ˆ −2 −5 X X (4) −1  R1(Yθ,1,...,Yθ,n) = n b K b (Yθ,i − Yθ,j) , (4.12) i=1 j=1 derived from a further kernel estimate

n ˜ −1 −1 X −1  fn,b = n b K b (y − Yi) , (4.13) i=1

00 where b is a bandwidth and K is a kernel for estimating R(g0,θ), not to be confused with h and φ in (4.6). Sheather and Jones (Sheather and Jones (1991)) give a rule

64 for selecting b of the form

−1/7 bSJ = C(g0,θ)D(K)n (4.14) by setting the asymptotically dominant terms in the bias of (4.12) to zero and solving the resulting equation. The constant C(g0,θ) involves higher order integrated squared derivatives of g0,θ, which Sheather and Jones (1991) suggests estimating in another similar stage. An implementation of the bandwidth selector and curvature estimator

(4.12) is given by the R package KernSmooth Wand and Ripley (2015).

However, the strategy of Sheather and Jones (1991) using (4.12) requires at least

2n2 operations, which is too costly for our setting, in which we would like to compute

(4.12) for many candidate values of θ. We take a simple numerical approach to estimating the integrated curvature of the kernel approximation (4.13),

Z Rˆ (Y ,...,Y ) = R(f˜00 ) = f˜00 (y)2 dy. 2 θ,1 θ,n n,bSJ n,bSJ

We use a bandwidth bSJ selected according to the Sheather-Jones rule (4.14), and a normal kernel K = φ. The integral is approximated over an equally spaced grid of 1000 points covering 20 sample standard deviations. The result is taken ˆ as R(Yθ,1,...,Yθ,n) in (4.11). Yang (2000) gives some asymptotic results concerning properties of Lˆ(θ) as an estimator of L(θ), and θˆ of θ∗. The trials we conduct in the simulations and data analysis sections are concerned less with accuracy of θˆ for θ∗, but more importantly with the accuracy of density estimation. To suit this purpose, the Lˆ-optimal transformation performs quite well.

In Section 4.2.3, we describe rules for selecting, in an iterative fashion, a series of transformations from among the families described in Section 4.2.1.

65 4.2.3 Iterative Transformation Selection

As demonstrated by Yang and Marron (1999), we note that some difficult den-

sities may benefit from multiple transformations applied in sequence. In particular,

densities which are both heavy tailed and skewed may require application of both

transformation families ϕ(1,θ1) and ϕ(2,θ2) described in Section 4.2.1.

We now describe a method for selecting, on the basis of the statistic Lˆ(θ), a series

of transformations

ˆ(1) ˆ(2) ˆ(K) θ , θ ,..., θ ∈ Θ = {(j, θj): j ∈ {1, 2}, θj ∈ Θj},

chosen from the families (j, θj) giving transformed sample values

Yθ,iˆ = (ϕθˆ(K) · · · ◦ ϕθˆ(2) ◦ ϕθˆ(1) )(Xi). (4.15)

In the simulations and data analyses of the next sections, we set Θ1 and Θ2

to be equally spaced grids over reasonable parameter values for the skew- and tail-

transformations. For the Yeo-Johnson family (4.4), Θ1 is set as a grid of 1000 equally

spaced values between 0 and 2. For the cdf-inverse-cdf transformation, the set Θ2

consists of 1000 possible values, equally spaced on the inverse scale between 1 and 20.

In total, this yields 2000 candidate transformations plus the identity.

The total number K of transformations that should be applied for a given density

is of great interest. The procedure should be sensitive enough to give a number K of

transformation so that K −1 −1  Y −1 0 g0,θ(y) = f0 ◦ ϕθ(1) ◦ · · · ◦ ϕθ(K) · ϕθ(l) l=1 has a small value of L(θ). However, it should not be too aggressive in suggesting transformations of already-easy densities such as the normal, nor should it apply transformations that do not give a substantial reduction in Lˆ.

66 To control the sensitivity of the procedure, we simulate 10, 000 standard normal

samples of size n, evaluate Lˆ for these, and continue applying transformations only so long as Lˆ(θ) satisfies the two conditions: (1) Lˆ exceeds the sample 1 − α quantile ˆ ˆ Ln,1−α of the 10,000 simulated normal L values for some small α, which we set to be

α = 0.1, and (2) the Lˆ-optimal transformation reduces Lˆ by more than a minimum

percentage, which we set to be 5%. Thus, at stage k, given the current transformed

(k) (k) version of the data X1 ,...,Xn , the procedure finds the optimal transformation

over the grids Θ1 and Θ2 of skew- and tail-transformation parameter values. Let

θˆ(k+1) and Lˆ(k+1) denote the minimizer and the minimum, respectively. If either

1. Lˆ(k+1) ≥ 0.95Lˆ(k), or

ˆ(k+1) ˆ 2. L ≤ Ln,1−α,

the transformation θ(k) is not applied: we set K = k, and proceed with density

estimation with the sample values (4.15). Otherwise, θ(k+1) is applied, and another

round of transformations is proposed.

4.3 Simulation Study

4.3.1 Simulation Design

We design a simulation study to investigate the utility of the transformation

method in estimating densities of varying skewness and tail heaviness. We consider

estimating two-piece densities of the form,

2  x − µ x − µ  f0(x) = g I(−∞,µ)(x) + g I(−∞,µ)(x) , (4.16) σ1 + σ2 σ1 σ2

where g is a symmetric density from a location-scale family, and σ1, σ2 > 0 are

distinct scale parameters for the regions above and below the median µ of f0. The

67 parent density g controls the tail behavior of f0, while the ratio of σ1 to σ2 controls

the degree of skewness. Parametric estimation of these densities has been studied,

for example, by Fernandez and Steel (1998) and Rubio and Steel (2014). In the

simulations detailed here, we fix µ = 0 and σ1 = 1. We then study cases where

σ2 = 1, for symmetric distributions, and where σ2 = 5, for right-skewed distributions.

Additionally, we study two choices for the parent family g: a standard normal density

and a heavy-tailed t density with 2 degrees of freedom. The resulting four densities

are depicted in the leftmost column of Figure 4.4. From each of the four two-piece

densities and each of three different sample sizes n = 100, 200 and 500, we perform

20 replicates of the simulation.

For each sample, we compare the direct DPM density estimate — found by fitting ˆ the basic DPM model (4.1) to the untransformed Xi so that fn is the posterior

predictive density — to the Transformation DPM density estimate (4.2), with the

transformation selected as described in Section 4.2, so that

ˆ  0 fn,θˆ = gˆn,θˆ ◦ ϕθˆ · ϕθˆ,

whereg ˆn,θˆ is the posterior predictive density of the DPM (4.1) fit on the transformed

scale to the Yθ,iˆ ’s. In addition, we fit Griffin’s modified DPM model, which can be represented by

G ∼ DP(MG0) iid µi G ∼ G (4.17)   indep ζi 2 Xi µi ∼ N µi, a σ i = 1, . . . , n µζ 2 where a ∼ Beta(α, β), G0 = N(µ0, (1 − a)σ ), ζi are iid from an inverse-gamma

2 distribution with mean µζ , and σ is given an inverse-gamma prior. This model was proposed by Griffin (2010) as an improved procedure over the basic DPM model (4.1) for density estimation, and is used as a benchmark in our simulation studies.

68 Computationally, the MCMC for the Bayesian methods is the most demanding

component for CPU time. The basic location mixture (4.1) is a conjugate-style model,

and is fit with a Gibbs sampler over a collapsed sample space, along the lines of

Algorithm 3 of MacEachern (1994). Griffin’s model is non-conjugate; for this model,

we implement the MCMC strategy suggested by the author in Appendix A of Griffin

(2010). In comparison, estimation of the transformation is far less computationally

demanding. Consider one example of a sample of size n = 500 from the skewed and

heavy-tailed density. Using one core of a 2×Twelve Core Xeon E5-2690 v3 / 2.6GHz

/ 128GB machine, the sequence of transformations can be completed in 3.09 minutes, while 5000 iterations of MCMC for the basic DPM model requires 27.44 minutes.

4.3.2 Simulation Results

For comparing point estimates to the truth, we employ the Hellinger distance, expressed here for a real-valued variate,

Z 1/2  p p 2 dH p, q = p(x) − q(x) dx . R

Hellinger distance is a useful metric for quantifying the distance between distributions

p and q. Bayesian methods construct the point estimates using MCMC approxima-

tions to the posterior predictive density given the observations. For each estimate ˆ ˆ fX , we evaluate the Hellinger distance dH (fX , fX ) between the point estimate and

the true two-piece density. Although the numerical results are presented only un-

der the Hellinger distance, other metrics such as the total variation distance and the

Kullback-Leibler distance have also been evaluated, and the results are similar under

these alternative metrics.

69 n = 100 n = 200 n = 500 Symmetric with Gaussian tails 0.015 0.4 0.020 0.3 0.010 hat,f) hat,f) hat,f) 0.004 − − − 0.2 f(x) d(f d(f d(f 0.010 0.005 0.1 0.0 0.000 0.000 0.000 −4 −2 0 2 4 KDE KDE KDE DPM DPM DPM TKDE TKDE TKDE x TDPM TDPM TDPM

n = 100 n = 200 n = 500 Skewed with Gaussian tails 0.04 0.4 0.015 0.03 0.008 0.3 hat,f) hat,f) hat,f) 0.010 − − − 0.02 0.2 f(x) d(f d(f d(f 0.004 0.005 0.01 0.1 0.0 0.00 0.000 0.000 0 5 10 15 20 KDE KDE KDE DPM DPM DPM TKDE TKDE TKDE x TDPM TDPM TDPM

n = 100 n = 200 n = 500 Symmetric with t tails 0.04 0.015 0.4 0.020 0.03 0.3 0.010 hat,f) hat,f) hat,f) − − − 0.02 0.2 f(x) d(f d(f d(f 0.010 0.005 0.01 0.1 0.0 0.00 0.000 0.000 −4 −2 0 2 4 KDE KDE KDE DPM DPM DPM TKDE TKDE TKDE x TDPM TDPM TDPM

n = 100 n = 200 n = 500 Skewed with t tails 0.04 0.4 0.015 0.03 0.020 0.3 hat,f) hat,f) hat,f) 0.010 − − − 0.02 0.2 f(x) d(f d(f d(f 0.010 0.005 0.01 0.1 0.0 0.00 0.000 0.000 0 5 10 15 20 KDE KDE KDE DPM DPM DPM TKDE TKDE TKDE x TDPM TDPM TDPM

Figure 4.4: Comparison of Hellinger error for four density estimates, KDE, TKDE, DPM, and TDPM. 20 replicate samples from each scenario and sample size. Hori- zontal dotted line represents median Hellinger distance obtained by fitting Griffin’s (2010) model without transformation. 70 Figure 4.4 gives a comparison of the Hellinger error for the Transformation DPM

(TDPM) density estimates, direct DPM density estimates, transformation KDE (TKDE), and direct KDE, across all simulation settings. The horizontal dotted lines represent the median of Hellinger distance for direct fits of Griffin’s model (4.17). These nu- merical results again suggest that the TDPM approach gives improved estimates for skewed or heavy-tailed distributions. For the heavy-tailed scenarios in particular, transformations reduced the median Hellinger error by half in comparison to a direct

fit of (4.1). Griffin’s DPM (4.17) notably outperforms the basic DP mixture (4.1) for capturing skewness and heavy tails, but the TDPM method, combining (4.1) with a pre-transformation, gives still better results.

To investigate the transformations employed in the estimation procedures, we il- lustrate the selected transformations in Figure 4.6 for the samples with size n = 200.

For the skewed and heavy-tailed densities investigated here, the criterion (4.11) ap- pears to be effective for identifying an appropriate remedial transformation. In each case, the transformation symmetrizes and shortens the tail of the original density, allowing more accurate density estimation on the transformed scale. When the true density is already “easy” to estimate5, as is the case with the standard normal den- sity shown in the top row of Figure 4.6, the selected transformations are close to linear. When the distribution is skewed, as in the second column, the skew-correcting transformations (4.4) of Yeo and Johnson are most commonly chosen. When the dis- tribution is heavy tailed but symmetric, the tail-shortening copula transformations

(4.5) are most common.

5As noted by Yang and Marron (1999), normal densities possess a curvature (4.3) that is already near the minimum among all densities with continuous second derivative.

71 Figure 4.5: Top: comparison of density estimates (via KDE and DPM) with and without transformation, based on a single sample of size n = 200 from the skewed and heavy tailed density with σ2 = 5 and g(·) a t distribution with 2 degrees of freedom. Bottom: quantile-quantile plot comparing estimated to true quantiles for the same sample and model fits. Dashed and dotted lines represent point estimates of the quantiles, while the shaded regions represent pointwise 90% credible intervals for the true quantile based on the Bayesian fits.

The Bayesian DPM model (4.1) allows a natural description of the uncertainty in the density estimation procedure. For a sample of size n = 200 from the skewed and heavy-tailed two-piece density with t2 tails, in Figure 4.5 we show draws from the

DPM and TDPM posterior distributions for the density. It can be seen that for this moderate sample size, the location-mixture (4.1) of Gaussians struggles to capture the polynomial tail decay of a t2; while after a tail-correcting transformation, the TDPM

72 Symmetric with Gaussian tails Skewed with Gaussian tails Symmetric with t tails Skewed with t tails 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 f(x) f(x) f(x) f(x) 0.1 0.1 0.1 0.1 0.0 0.0 0.0 0.0

−4 −2 0 2 4 0 5 10 15 20 −4 −2 0 2 4 0 5 10 15 20

x x x x

Transformation family by round 1.0 1.0 Final 1.0 1.0 Simulation tt− cdfto−Gaussian-inverse Copulacdf Estimated Yeo−Johnson Scenario 0.8 Yeo-Johnson 0.8 Transformations 0.8 0.8

(a) Symmetric with Gaussian tails 0.6 0.6 0.6 0.6 15

0.8 t−to−Gaussian Copula 0.4 0.4 0.4 0.4 0.4

Yeo−Johnson 5 5 0.2 f(x) 0.4 0.2 0.2 0.2 0.2 − Percentage of Simulated Samples of Simulated Percentage Samples of Simulated Percentage Samples of Simulated Percentage Samples of Simulated Percentage transformed scale, Y scale, transformed 15 0.0 − % of Simulated Samples % of Simulated 0.0 0.0 0.0 0.0 0.0 3 2 1 0 1 2 3

−4 −2 0 2 4 − − −

x original scale, X Round 1 Round 1 Round 1 Round 1 Round 2 Round 1 Round 2

(b) Skewed with Gaussian tails 15 0.8 0.4 15 15 15 15 5 10 10 10 10 5 0.2 f(x) 0.4 − 5 5 5 5 transformed scale, Y scale, transformed 15 0.0 − % of Simulated Samples % of Simulated 0.0 0 5 0 0 0 0

0 5 10 15 20 10 15

x 5 5 original scale, X 5 5 Round 1 − − − − transformed scale, Y scale, transformed Y scale, transformed Y scale, transformed Y scale, transformed

(c) Symmetric with t tails 15 15 15 15 − − − − 15 3 2 1 6 4 2 5 0 1 2 3 0 5 0 2 4 6 0 5 0.8 10 15 10 15 20 25 30 − − − − − − − 0.4 5 original scale, X original scale, X original scale, X original scale, X 5 0.2 f(x) 0.4 − transformed scale, Y scale, transformed 15 0.0 − % of Simulated Samples % of Simulated 0.0 6 4 2 0 2 4 6

−4 −2 0 2 4 − − −

x original scale, X Round 1 Round 2

(d) Skewed with t tails 15 0.8 0.4 5 5 0.2 f(x) 0.4 − transformed scale, Y scale, transformed 15 0.0 − % of Simulated Samples % of Simulated 0.0 5 0 5 10 15 20 25 30 0 5 10 15 20 −

x original scale, X Round 1 Round 2

Figure 4.6: Illustration of the estimated transformations based on 20 simulated sam- ples of size n = 200 from each of the four two piece scenarios (a)-(d).

73 model gives a more accurate depiction of the extreme tail behavior. Moreover, the

DPM model yields unstable estimates of the extreme quantiles, with the empirical confidence levels of the credible intervals often falling much below the nominal values.

On the other hand, the TDPM method provides much more stable estimates and more reliable coverage.

4.4 An Application to BMI Modeling

We demonstrate the use of transformations in modeling body mass index (BMI) measurements for Ohio adults grouped by age and gender. The data are from the 2008 Ohio Family Health Survey (OFHS), and are publicly available online

(at http://grc.osu.edu/omas/datadownloads/ofhsoehspublicdatasets/). BMI is cal- culated as an individual’s weight in kilograms divided by the square of height in meters. Due to the ease of its calculation, BMI is often used as a surrogate for more difficult measures of obesity, such as body fat percentage. Inference for the distribu- tion of BMI, and for the extreme quantiles in particular, is of significant public-health interest. Individuals with extreme BMI have greater risk for a variety of health prob- lems, including heart disease, stroke, and type-2 diabetes.

Figure 4.7 displays sample quantiles of the observed BMIs for each gender and age group in the study. The distributions of the BMI measurements are strongly-skewed, with heavy right tails, but the strength of these features varies with age and gender.

We divide the survey respondents into subgroups by age and gender, with the aim of estimating BMI distributions for each age-by-gender group. For a given gender g and age group a, denote the set of BMI measurements by xga = {xgai : i ∈ Iga},

74 Figure 4.7: Quantiles of the empirical distribution of BMI, separated by gender and age.

where Iga = {1, . . . , nga} are the indices for the respondents in the age-and-gender subgroup. The sizes of these subgroups are shown in Table 4.1.

To assess the effectiveness of the transformation DPM method for estimating the

BMI densities, we draw subsamples of size 200 from each age-and-gender group, and measure the out-of-sample predictive likelihood. For gender g and age a, we partition

train the indices Iga = {1, . . . , nga} at random into training cases Iga and holdout cases

hold train train train Iga , with Iga = 200. Denote the training set of BMIs by xga = {xgai : i ∈ Iga }

hold hold and the holdout set by xga = {xgai : i ∈ Iga }. With the 200 training observations

train xga , we form direct-DPM and transformation-DPM point estimates of the BMI density. As a measure of the quality of these density estimates, we consider the

75 Figure 4.8: Comparison of the average log predictive likelihood of the holdout cases (4.18) based on the DPM and TDPM fits.

average log predictive likelihood for the holdout cases,

1 X L = logfˆ (x |xtrain). (4.18) ga hold ga gai ga Iga hold i∈Iga

Figure 4.8 gives a comparison of Lga for the DPM and TDPM fits.

Some improvement can be seen for the middle-aged subgroups, which also tend to have the heaviest-tailed distributions of BMI (see Figure 4.7). The youngest and oldest groups have relatively more symmetric and less heavy-tailed distributions. In such cases, there is less to be gained by transforming these samples prior to estimation, which was clear in the simulations of Section 4.3.

76 Age group 18-24 25-34 35-44 45-54 55-64 65+ Total OFHS sample size 1194 3226 4656 6268 6299 9191 30,834 Female Training sample size 200 200 200 200 200 200 1200 Holdout sample size 994 3026 4456 6068 6099 8991 29,634 OFHS sample size 895 1838 2914 3892 3852 4660 18,051 Male Training sample size 200 200 200 200 200 200 1200 Holdout sample size 695 1638 2714 3692 3652 4460 16,851

Table 4.1: Ohio Family Health Survey (2008) sample sizes, divided into training and holdout samples.

4.5 Discussion

In modern work on nonparametric Bayesian density estimation, many excellent and sophisticated models have been proposed, often by modifying the structure of the DP mixture (4.1). Here we follow a different path to improved performance, by deconstructing a difficult problem — estimation of an unknown density with extreme features — into two easier problems. First we choose a sequence of transformations to symmetrize and shorten the tails of the distribution, and second, we use basic DPM models to estimate the resulting “well-behaved” density. The evidence presented here suggests that devoting some attention to choosing a good transformation of the sample can yield substantial gains in performance for density estimation.

These two subproblems of the density estimation problem differ intrinsically in dif-

ficulty. The transformation part of the problem is low-dimensional and parametric, and so we expect estimation of the transformation parameters to follow conventional root-n asymptotics. The density estimation part of the problem is infinite dimen- sional, and so we expect both poorer large sample behavior and a slower asymptotic

77 rate. Density estimators based on the DPM model are consistent under very mild conditions, and so we have little interest in finding the “optimal” transformation.

Rather, we seek to find a decent one, and we then let a standard DPM model do its work. The difference in asymptotic rates for the two parts of the problem ensures that, for large samples, the transformation has little variation in comparison to the density estimate, motivating our choice to fix a single transformation rather than averaging over transformations. This strategy applies to a wide variety of statisti- cal problems where different portions of the problem exhibit different rates. Among them, are problems where portions of a model differ greatly in dimension (as here) and problems where portions of the model are informed by different amounts of data, as in multiscale, local and treed regression models.

The advantages of the strategy we have pursued are twofold. First, conditioning on a single estimated transformation allows us to focus our computational resources on density estimation given the transformation. This task is simpler than averaging over transformations, and it allows us to rely on standard methods and code for fitting the model. Second, the single transformation approach provides a simpler conceptual framework, providing the user with a model which is easier to grasp.

There are many variations on the method we have presented. The families of transformations we have used could be replaced with other families. The transforma- tions could be driven by likelihood rather than Lˆ(θ). The rule for when to stop the iterative process of transformation selection need not be driven by the perspective of hypothesis testing (rule (1) of section 2.3). Information criteria could be used to se- lect transformation and density estimate, or a fully Bayesian approach could be used,

78 averaging over transformations. These last two strategies are relatively expensive in terms of computation.

Nonparametric Bayesian density estimation is often used in multivariate settings.

A natural question is how to extend this method to multivariate transformations to alleviate excess skewness and kurtosis. One promising route is to preprocess the data by first conducting a principal components analysis of the data. The transformation approach described herein could then be applied to each margin. Full development of such a method awaits further work.

The TDPM model can be represented in alternative forms. While we describe the model in terms of a transformation, followed by a DPM model, and finally completed with a back-transformation to the original scale, a mathematically equivalent version describes the model as a DPM with a non-standard base measure and kernel. Both presentations of the model have value; we believe the presentation here highlights our overall modelling strategy.

One of the great advantages of DP mixture models such as (4.1) and (4.17) is their

flexibility, allowing them to capture unusual features in the target density. Only two such features, skew and heavy-tails, are investigated here with the transformation method. The usefulness of the transformation strategy has not been established for other situations, such as estimation of many-modal distributions. The basic idea, however, remains powerful. The area in which TDPM estimates show the most im- provement over basic DPM is in the tails of the estimates. We would expect this advantage to persist, even if strange features are present in the body of the distribu- tion.

79 Chapter 5: Heavy-Tailed Density Estimation Using Transformations and DP Mixtures

5.1 Motivation

Previous chapters have illustrated that for skewed and heavy-tailed densities, an effective approach to Bayesian density estimation begins with a parametric trans- formation of the sample. In Chapter 4, the transformations were estimated using a frequentist criterion related to kernel density estimates. They were chosen to sym- metrizes the distribution and shorten its tails, so that superior density estimates could be obtained on the transformed scale. After back-transformation, this transformation- density estimation technique yields improved inference on the original scale as well.

While Chapter 4 demonstrated the usefulness of the transformation strategy of

Chapter 2 in the Bayesian context, we now investigate whether a different set of trans- formations, or a different method for estimating them, would give better performance in combination with the DP mixture.

Specifically, in this chapter, we investigate the impact of parametric transforma- tions on asymptotic properties of DPM posteriors. Recall, from Section 1.2.2 and

Section 3.3.3, that the theory establishing posterior convergence rates of the DP mix- ture rely on strong conditions on the lightness of the distribution tail. Theorem 3,

80 due to Ghosal and Van der Vaart (2007), stated that when a normal base measure is used in the DP mixture, a convergence rate is known only when the true distribution has equally light tails,

 c −ca2 F0 [−a, a] ≤ Me . (5.1)

This suggests that transformations which yield sub-Gaussian tails in the sense of

(3.22) on the transformed scale may lead to improved density estimates.

This Chapter investigates methods for estimating the transformation ϕθ so that the transformed tail probabilities

 c  −1 −1 c G0,θ [−a, a] = F0 −ϕθ (−a), ϕθ (a) can be made to decay at the exponential rate (5.1). The transformations described in Chapter 4 are not designed to ensure this.

Neither the curvature criterion L(θ) suggested in Chapter 4 and Section 2.2.1 for choosing a transformation nor the family of iterated Yeo-Johnson and t-cdf transfor- mations are especially well-suited to ensuring that (5.1) is satisfied.

The functional L(θ) suggested in Chapter 4 and Section 2.2.1 for defining the optimal transformation is not directly tied to the tails of either the transformed density or the original f0. Instead, that functional is derived from the asymptotic

L2 risk of a kernel density estimate for the transformed density after application of ϕθ : X 7→ Yθ. The L2 loss is insensitive to differences in tail-decay rates (see

Devroye and Gyorfi (1985), Chapter 1). There is no guarantee that the L(θ) optimal transformation θ0 of (2.16) will yield a transformed density with light tails.

Further, the iterative method proposed in Chapter 4 has stochastic behavior which make its asymptotic properties difficult to analyze. The Yeo-Johnson transformation

81 (4.4) used to correct skew also has an imbalanced effect on the tails of the transformed

distribution.6

In light of (5.1), we develop transformations which more directly aim to produce

exponentially decaying tails. To facilitate this, we employ cdf transformations of the

form

−1  Yi,θ = Φ Fθ(Xi) ,

where Fθ is an appropriate family of parametric distribution functions: here, we

propose the skew-t family of Azzalini and Capitanio (2003).

The chapter is organized as follows. In Section 5.2, we review asymptotic prop-

erties of a basic DP location mixture of normals, and highlight the impact of heavy

tails on performance. Section 5.3 concerns the transformations: we describe, for gen-

eral heavy-tailed families {Fθ, θ ∈ Θ}, the subset of the parameter space A(F0) ⊂ Θ

that give a sub-Gaussian tailed distribution for Yi,θ. Later in Section 5.3, we detail estimators of the parameters of a skew-t cdf Fθ that converge in the interior of this subset A(F0). In Section 5.4, we test the new transformation-density estimators in simulations and real data analysis. Section 5.5 concludes with a discussion.

5.2 DP Mixture Asymptotics and Distribution Tails

Consider a sample X1,...,Xn which is i.i.d. from a true distribution F0 (with density f0). The true density f0 being unknown, we model it as a random probability density f. For nonparametric estimation, one might employ a simple DP location

6 For example, consider applying the transformation (4.4) with θ1 > 0 to a normal distribution. The resulting distribution has a polynomially decaying left tail.

82 mixture of normals,

σ ∼ πn(σ) 2 G0(µ) = N ( µ | µ0, σ0 ) G ∼ DP(M,G0) (5.2) R f(x) | G, σ = φσ(x − µ) dG(µ) iid X1,...,Xn f ∼ f(x).

2 −1/2 −u2/2σ2 Here, φσ(u) = (2πσ ) e . The distribution G is, with probability one, a countable mixture of atomic measures. Its atoms are independent draws from G0, and the weights on these support points are distributed according to a stick-breaking scheme (Sethuraman; 1994). Hence (5.2) represents the density f as a countable location mixture of normal kernels. All kernels share a common scale σ, which is

7 given its own prior πn. For the remainder of this paper, we will set πn ≡ π for all n to be static inverse gamma distributions for σ2, with fixed hyperparameters. For the theoretical development, however, a sequence of priors πn with shrinking scales as n → ∞ is sometimes required.

Much of the established theory regarding asymptotic properties of (5.2) applies to light-tailed distributions with sub-Gaussian tail decay. Section 5.2.1 describes asymptotic properties in the sub-Gaussian case. Section 5.2.2 gives a mathematical description of a distribution whose tails decay at a slower polynomial rate, for which the results in 5.2.1 would not apply; we also review some approaches to nonparametric

Bayesian estimation in the heavy-tailed case.

7Many authors have proposed alternatives to the basic location mixture (5.2) for density esti- mation. The conjugate-style location-scale DP mixture of normals discussed by Escobar and West (1995) is another computationally simple choice. More recently, Griffin (2010) proposed a DP mix- ture of normals whose parameters are more easily relatable to marginal properties of the density f. Priors such as Griffin’s outperform (5.2) in simulations (Bean, Xu and MacEachern; 2016). For the current work, however, we adopt the simple DP mixture; additional complexity is introduced via the transformations.

83 5.2.1 Convergence Rates for Sub-Gaussian Tails

For example, Ghosal and Van der Vaart (2007) study the asymptotic behavior of

the posterior of the DP mixture (5.2), assuming the true distribution F0 of the Xi

satisfies

 −cx2 F0 |X| > x ≤ e (5.3)

for a postive constant c. If σ is given a sequence of priors πn of the form

−1 πn(σ) = σn g(σ/σn),

−1/5 where σn = s · n for positive s, and g is a compactly supported density on

(0, ∞), then the posterior convergence rate of (5.2), with respect to the Hellinger

R k p p 2 semi-distance hk(f, f0) = −k f(x) − f0(x) dx is given by Theorem 2 of Ghosal and Van der Vaart (2007). The theorem guarantees that for the sequence

−2/5 2 n = n log n , (5.4)

and some constant M > 0, we have

h i

Πn {f : hk(f, f0) > Mn} X1,...,Xn → 0

iid in probability under Xi ∼ F0. The posterior distribution referenced here assigns prob-   R Qn . R Qn abilities Πn A X1,...,Xn = A i=1 f(Xi)dΠ(f) F i=1 f(Xi) dΠ(f) to measur-

able subsets A of the space of continuous densities F on R. The expression (5.4) defines the posterior convergence rate of (5.2).

Though such a prior construction is unusual in practice (again, typically, πn is an

inverse gamma prior for σ2 which the analyst would choose independently of n), the

importance of the tail behavior of F0 on performance of the DP mixture is clear from

84 (5.3). Similar tail conditions appear in other results concerning posterior convergence

rates of DP mixtures (Ghosal and Van der Vaart; 2001; Wu and Ghosal; 2010b). No

posterior convergence rate has been established for F0 whose tail probabilities do not

decay exponentially.

5.2.2 Characterization of Heavy-Tailed Distributions

Here, we entertain the possibility that the tails of F0 may decay at a polynomial rate, slower than (5.3). A mathematical description of such a distribution is based on regularly varying functions; see Resnick (2007) for an overview. For a heavy right tailed distribution, the complementary cdf F 0 = 1 − F0 is regularly varying at ∞.

That is,

F 0(tx) lim = x−ν0 (5.5) t→∞ F 0(t)

for any x > 0. The constant ν0 > 0 is called the right-tail index of F0, while

−ν0 is called the index of regular variation. The statement (5.5) will be denoted

F 0(x) ∈ RV−ν0 . Left-tail heaviness is defined similarly; a cdf F0 has a heavy left tail if G(x) = F0(−x) ∈ RVν for some ν. The left- and right-tail indices need not be equal.

A more restrictive class of functions F 0 are said to exhibit second-order regular

variation if the convergence in (5.5) can be specialized to

F 0(tx) −ν0 − x −ρ0 F (t) 1 − x lim 0 = x−ν0 , (5.6) t→∞ a(t) ρ0

where a(t) is of constant sign, and ρ0 ≥ 0 is the second order parameter.

Although the posterior distribution of common DP mixture models are consistent for estimating the densities f0 of heavy-tailed distributions (Ghosal et al.; 1999), the

85 rate of posterior convergence is not known. Other authors have noted this short-

coming. Li, Lin and Dunson (2015) in particular note that location-and-scale DP

mixtures of normal kernels induce a degenerate prior on the tail index. As a result,

the posteriors are not consistent for the tail index. Li et al. (2015) and Tressou

(2008) have each proposed to address this problem by employing heavy-tailed kernel

mixtures. Our approach — to address tail heaviness, along with possible skew, by

estimating a single transformation to symmetrize the distribution and lighten its tails

— is an alternative to theirs.

5.3 A Family of Transformations for Heavy-Tailed Densities

In this section, we propose a parametric family of transformations which can map

general F0 satisfying (5.5) to light-tailed transformed distributions satisfying (5.3). In

Section 5.3.1, we give a general class of cdf-inverse-cdf transformations, and discuss

conditions under which these transformations lead to sub-Gaussian tails. In Section

5.3.3, a specific class of cdf-inverse-cdfs is proposed, based on the parametric family

of Skew-t distributions (Azzalini and Capitanio; 2003).

5.3.1 Transformations to sub-Gaussian tails

A simple technique for achieving this is to take a parametric family of heavy-tailed

cdfs Fθ and apply Gaussian copula transformations with Fθ marginals,

−1  Yθ = ϕθ(X) = Φ Fθ(X) , (5.7)

where Φ is the standard normal cdf. Of course, if X ∼ Fθ, then Yθ ∼ Φ; this is the so-called probability integral transformation. In general, if X ∼ F0, then the cdf G0,θ

86 of Yθ is

 −1  G0,θ(y) = Pr Φ Fθ(X) ≤ y

 −1  = Pr X ≤ Fθ Φ(y)

−1  = F0 Fθ (Φ(y)) .

Similarly, the complementary cdf is

−1  G0,θ(y) = F 0 Fθ (Φ(y)) . (5.8)

We hope to choose the family {Fθ, θ ∈ Θ} so that a nonempty set of transformations

lead to sub-Gaussian tail probabilities:

−cy2 Pr{|Yθ| ≥ y} = G0,θ(−y) + G0,θ(y) ≤ Me . (5.9)

Transformation to (5.9) is possible as long as θ indexes heavy-tailed distributions Fθ.

Let A(F0) denote the set of transformations which lead to sub-Gaussian tails (5.9),

−cy2 A(F0) = {θ ∈ Θ : Pr{|ϕθ(X)| ≥ y} ≤ Me for X ∼ F0}. (5.10)

Theorem 4. Assume F0 is a heavy-tailed cdf such that for x > 0,

−ν+ F 0(x) = x 0 L+(x)

and

−ν− F0(−x) = x 0 L−(x), and that the slowly varying functions L+ and L− are eventually bounded,   max lim sup L−(x), lim sup L+(x) < ∞ x→∞ x→∞

− + Denote ν0 = min{ν0 , ν0 }. Suppose Fθ is a family of continuous cdfs with 0 < Fθ(x) <

 k 1, with a parameter vector θ ∈ Θ = (ν, η): ν > 0, η ∈ R that includes a parameter

87 ν > 0 controlling both the left- and right-tail indices of Fθ. That is, 1 − Fθ(x) ∈ RV−ν and Fθ(−x) ∈ RV−ν. Then the distribution of

−1  Yθ = ϕθ(X) = Φ Fθ(X) ,

where X ∼ F0, satisfies the sub-Gaussian tail condition (5.9) provided ν ≤ ν0. That is,

A(F0) = {θ = (ν, η): ν ≤ ν0}.

Proof. No generality is lost by assuming right tail of F0 is heavier than the left, so

F 0 ∈ RV−ν0 . It suffices to characterize the set of θ for which

−1  −cy2 G0,θ(y) = F 0 Fθ (Φ(y)) ≤ Me . (5.11)

That is, the right-tail probabilities decay at a sub-Gaussian rate.

Consider the function Uθ = 1/(1−Fθ), which is strictly increasing with Uθ(x) → ∞

as x → ∞. Since 1 − Fθ ∈ RV−ν, well-known closure properties of regularly varying

functions imply that Uθ ∈ RVν. Additionally, its inverse function, which can be

written  1 U −1(t) = F −1 1 − θ θ t

−1 is regularly varying, Uθ ∈ RV1/ν. By another closure property, we have that the composition   1 h (t) = F ◦ U −1(t) = F F −1 1 − θ 0 θ 0 θ t

−1 of F 0 ∈ RV−ν0 and Uθ ∈ RV1/ν is regularly varying with index −ν0/ν. The tail probabilities (5.8) depend closely on this composition. The tail probabilities are

 1  G0,θ(y) = hθ 1−Φ(y) , so by regular variation of hθ, we have

 1 −ν0/ν  1   1  G (y) = L = Φ(y)ν0/ν L . 0,θ 1 − Φ(y) 1 − Φ(y) 1 − Φ(y)

88 In order for the tails to be sub-Gaussian (5.11), a finite upper-bound M > 0 must be available for the ratio

 ν0/ν   G0,θ(y) √ Φ(y) 1 = 2π · y−ν0/ν · φ(y)(ν−ν0)/ν · L . e−y2/2 φ(y)/y 1 − Φ(y)

Over intervals y ≤ C for any C, the expression are clearly bounded, since on the left side, the denominator is bounded away from zero and the numerator G0,θ(y) ∈ (0, 1).

We show the ratio can be bounded as y → ∞. For the first factor, Mills’ ratio gives

Φ(y)φ(y)/y → 1 as y → ∞. Hence

    G0,θ(y) √ ν − ν0   1  lim sup = 2π lim sup y−ν0/ν · exp y2 · L . (5.12) −y2/2 y→∞ e y→∞ 2ν 1 − Φ(y)

−1  1  From the condition lim supy→∞ y L 1−Φ(y) < ∞, it follows that for any ν ≤ ν0,

G (y) lim sup 0,θ < ∞. −y2/2 y→∞ e

Hence there exists a large C such that the ratio is bounded for y > C.

5.3.2 Skew-t cdf-inverse-cdf transformations

With the aim of improving performance of DP mixtures for estimating skewed- and heavy-tailed distributions, we propose to use the family of cdf transformations

−1  Yθ,i = ϕθ(Xi) = Φ Fθ(Xi) , (5.13)

where the family Fθ is a skewed extension of the t distribution.

We employ the Skew-t distributions of Azzalini and Capitanio (2003), which in- clude the Student t distributions as a special case, but include an additional slant pa- rameter to allow left- or right-skewed shapes. For a parameter vector θ = (ξ, ω, α, ν), comprised of center, scale, slant, and degrees of freedom parameters, respectively, the

89 skew-t density is given by

2  r ν + 1  f (x) = t (z) T αz , (5.14) θ ω ν ν+1 ν + z2

h √ i 2 −(ν+1)/2 where z = (x − ξ)/ω, tν(z) = Γ((ν + 1)/2) πνΓ(ν/2) 1 + z /ν is a R u standard Student t density with ν degrees of freedom, and Tν+1(u) = −∞ tν+1(z) dz is a tν+1 cdf. No closed form expression exists for Fθ(x) or the associated quantile

−1 function Fθ , as these involve intractable integral expressions. However, accurate numerical approximations to these functions are available.

To study tail properties of the skew-t distribution, note that the density is of the form

σfθ(µ + σz) = tν(z)gθ(z), where the skewing factor converges to a constant as z → ±∞, √ r ν + 1    Tν+1 α √ν + 1 as x → ∞ gθ(z) = 2Tν+1 αz 2 →  ν + z Tν+1 −α ν + 1 as x → −∞, for any θ. As a result, the tail behavior of the skew t is inherited from the standard t distribution with ν degrees of freedom, and one can show

F θ(x) ∈ RVν.

Chang and Genton (2007) give a proof, and discuss regular variation properties of the Skew-t and related distributions.

5.3.3 Estimating Skew-t Transformation Parameters

As a consequence of the regularly varying tail of Fθ, the transformation (5.13) leads to a sub-Gaussian tailed distribution for Yθ,i as long as the degrees of freedom parameter ν of Fθ is smaller than the left- and right-tail indices for the distribution

90 F0 of Xi. In the notation adopted previously, we have, for F0 satisfying the conditions of Theorem 4,

 A(F0) = θ = (ξ, ω, α, ν) ∈ Θ: ν ∈ (0, ν0], α ∈ (−∞, ∞) , (5.15)

− + where again, ν0 = min{ν0 , ν0 } denotes the smaller of the left- and right-tail indices of F0. ˜ ˜  In this section, we discuss estimators θn = ξn, ω˜n, α˜n, ν˜n of the skew-t parameter vector that converge to the interior of A(F0) with probability approaching 1 under

iid X1,...,Xn ∼ F0. That is,

˜  Pr θn ∈ A(F0) ≤ Pr ν˜n ≤ ν0 → 1, (5.16)

as n → ∞ for any ν0 > 0. We will accomplish this by first obtaining a consistentν ˆn of ν0, then modifying it to produceν ˜n satisfying (5.16). The following are two classes of estimators for ν0; Theorem 5 gives conditions under which they are consistent.

1. (Maximum skew-t likelihood) Likelihood estimation for the skew-t distribution

has been discussed by Azzalini and Capitanio (2003). The skew-t log-likelihood

function n X l(θ|X1,...,Xn) = log stθ(Xi) i=1 is continuously differentiable in θ = (ξ, ω, α, ν) up to all orders. Closed form-

expressions for the first-order partial derivatives are available (Azzalini and

Capitanio (2003), Appendix B). An MLE

ˆML ˆML ML ML ML θn = ξn , ωˆn , αˆn , νˆn = argmaxθ∈Θ l(θ|X1,...,Xn) (5.17)

can be found via a numerical search for solutions to the score equation 0 =

d dθ l(θ|X1,...,Xn). 91 2. (Tail index estimation) A more robust approach involves direct estimation of

the tail index ν0 based on extremal subsets of the order statistics X(1,n) ≤

X(2,n) ≤ · · · ≤ X(n,n). Many classical estimators target the inverse of the tail

index γ0 = 1/ν0, which, more generally, can be regarded as the parameter of

extreme value distribution associated with F0. If sequences an, bn can be found  such that X(n,n) − bn an converges in law, then necessarily

X − b  (n,n) n −1/γ0  lim Pr ≤ x = exp −(1 + γ0x) , (5.18) n→∞ an

for some γ0 ∈ R. When (5.18) holds with γ0 > 0, F0 is said to belong to the domain of attraction of the Fr`echet extreme value distributions. The class of

distributions satisfying (5.18) with γ0 > 0 corresponds identically to the class of

F0 whose complementary cdf is regularly varying (see(5.5)), and the tail index

is γ0 = 1/ν0. The well-known Hill estimator of γ0 > 0 is based on log excesses

for the k largest order statistics,

k−1 (1) 1 Xh i M = log X − log X . (5.19) n,k k (n−i,n) (n−k,n) i=0

For general γ ∈ R, Dekkers, Einmahl and Haan (1989) suggest the moment estimator (1)2 ! 1 M γˆ = M (1) + 1 − 1 − n,k , (5.20) n,k n,k 2 (2) Mn,k where k−1 2 (2) 1 Xh i M = log X − log X . n,k k (n−i,n) (n−k,n) i=0 There are many other estimators of the tail and extreme value index, including

adaptive estimators which achieve minimax rates. For simplicity, we choose

(5.20), an estimator which retains consistency for light-tailed γ0 ≤ 0, and has

92 smaller AMSE, under many settings, than other traditional estimators such as

(5.19) De Haan and Peng (1998).

Asymptotic properties of the estimators (5.17) and (5.20) are described in the next theorem.

ˆML Theorem 5. Consistency and asymptotic normality of the skew-t MLE θn and the

moment estimator γˆn,k are considered in for three increasingly restrictive classes for

the data-generating F0.

1. If 1 − F0 ∈ RV−ν0 , and if k(n) is a sequence of order statistics that satisfies

k(n) → ∞ and k(n)/n → 0 as n → ∞, then

γˆn,k(n) → γ0 = 1/ν0 (5.21)

in probability as n → ∞ for any γ0 ∈ R.

2. If 1 − F0 ∈ 2RV−ν0,2ρ0 , the second-order regular variation class (5.6), and if

2ρ0  k(n) → ∞ but with k(n) = o n 1 − ρ0 , then

p   k(n) γˆn,k(n) − γ0 → N 0, 1 + γ0 (5.22)

in distribution whenever γ0 > 0. Asymptotic normality still holds when γ0 ≤ 0,

but with different asymptotic variance; see Dekkers et al. (1989); De Haan and

Peng (1998).

 3. If F0 = St ξ0, ω0, α0, ν0 for some θ0 = (ξ0, ω0, α0, ν0) on the interior of the ˆML skew-t parameter space, then the MLE θn is consistent and asymptotically normal, √ ˆML   n θn − θ0 → N 0, Iθ0 , (5.23)

R ∞ h d2 i where Iθ = − −∞ dθ2 fθ(x) fθ(x) dx is the Fisher information.

93 Remark. If F0 does not necessarily belong to the skew-t parametric class, the MLE ˆML θn still converges in probability and is asymptotically normal, under regularity con- ditions described by Huber (1967) and Van der Vaart (1998). Specifically, conditions can be set to ensure

√ ML nθˆ − θ  → N0, I−1, n F0 F0 R where θF0 is the value which maximizes log(fθ(x)) f0(x) dx, and

Z h d2 i I = f (x) f (x) dx. F0 dθ2 θ 0

The value θF0 indexes the skew-t distribution nearest to F0 in Kullback-Leibler di-

vergence. However, in general, there is no guarantee that the degrees of freedom

− + component νF0 corresponds to the tail indices ν0 or ν0 of F0. The maximum skew-t likelihood estimator of degrees of freedom not, in general, a consistent estimator of

the tail index.

As one would expect, the nonparametric estimatorγ ˆn,k has good theoretical

asymptotic performance in broad classes of heavy tailed F0, but converges at slower p √ rates, k(n), than the parametric n. Under general F0, it is difficult to characterize

the value νF0 to which the MLE for degrees of freedom converges. While the MLE may perform reasonably well in practice for distributions somewhat near the skew-t, no broad guarantees can be made that transformations estimated using the likelihood will yield sub-Gaussian tails.

Modifying the tail-index estimator for transformation to sub-Gaussian tails

This section describes a simple strategy for modifying a consistent estimatorν ˆn of the tail index ν0 to produce a statisticν ˜n which satisfies (5.16); that is, it un- derestimates the tail index in the limit. Of course (5.16) is equivalent to finding an

94 estimatorγ ˜n of γ0 = 1/ν0 for which

 Pr γ˜n ≥ γ0 → 1, (5.24)

−1 in which caseν ˜n :=γ ˜n satisfies (5.16).

The search forγ ˜n satisfying (5.24) can be conceptualized in the context of a series of confidence sets for γ0. For each n ≥ 1, consider upper confidence bounds of the form (0, γˆn + δn], whereγ ˆn is a consistent estimator of γ0. Now choose the statistics

δn = δn(X1,...,Xn) > 0 so that the interval’s coverage rate 1 − αn approaches one, or equivalently  αn = Pr γˆn + δn ≤ γ0 → 0 (5.25) for any γ0 > 0 as n → ∞. The boundary of the confidence set,γ ˜n =γ ˆn + δn is an

‘estimator’ of γ0 which satisfies (5.16).

As a first example, take δn to be a series of constants with δn,1 → c > 0 as n → ∞.

Then (5.25) with δn = δn,1 follows from the consistency ofγ ˆn. Hence

γ˜n,1 =γ ˆn + δn (5.26)

satisfies (5.16);γ ˜n underestimates the tail index with probability tending to 1.

However, better choices are available for δn and the resulting modified estimator

2 γ˜. For example, if E γˆn < ∞, then Chebyshev’s inequality gives 2   Var(ˆγn) + E(ˆγn) − γ0 Pr γˆn + δn ≤ γ0 ≤ Pr |γˆn − γ0| > δn ≤ 2 (5.27) δn for any constants δn > 0. If the moments ofγ ˆn can be normalized in the sense

2 that for some sequence k(n) → ∞ as n → ∞, we have k(n)Var(ˆγn) → σ > 0 and p  k(n) E(ˆγn) − γ0 → 0, then choosing

σ 1 δn,2 = (5.28) pk(n) pp(n)

95 and substituting into (5.27) yields

k(n)Var(ˆγ ) + pk(n)E(ˆγ ) − γ 2 Prγˆ + δ > γ ≤ p(n) · n n 0 ∼ p(n). (5.29) n n 0 σ2

Taking p(n) → 0 but with k(n)p(n) → ∞, one finds thatγ ˆn + δn satisfies (5.25), with the probabilities vanishing at a rate at least as fast as p(n). However, since σ depends on F0, this can no longer be considered an estimator of γ0. Instead,

2 2 a consistent estimatorσ ˆn of the asymptotic variance σ = limn→∞ k(n)Var(ˆγn) ofγ ˆn is required. The resulting estimator

ˆ γ˜n,2 =γ ˆn + δn,2, (5.30)

ˆ σˆn 1 with δn,2 = √ · √ also satisfies (5.16). k(n) p(n)

Still better choices for δn are available if one assumes further thatγ ˆn is asymptot- ically normal, with

p  2 k(n) γˆn − γ0 → N 0, σ (5.31) in distribution as n → ∞. The probabilities in (5.25) can then be described asymp- totically by pk(n)  pk(n)  α ∼ Φ δ = 1 − Φ δ . (5.32) n σ n σ n

Take σ δn,3 = z(n), pk(n) with z(n) = Φ−1(1 − p(n)), a sequence of increasing standard normal quantiles with upper tail probabilities p(n). It follows that

 αn = Pr γˆn + δn > γ0 ∼ p(n)

96 and one can again control the rate p(n) of decay for the probabilities. With a consis-

ˆ σˆ tent estimatorσ ˆ of σ, and δn,3 = √ z(n), we obtain an estimator k(n)

ˆ γ˜n,3 =γ ˆn + δn,3 (5.33) which satisfies (5.16).

To compare the estimatorsγ ˜n,1,γ ˜n,2, andγ ˜n,3, note that for any σ,

δ n,3 = pp(n) · Φ−1(1 − p(n)) → 0, δn,2 as p(n) → 0. In fact, δn,3 ≤ δn,2 for any 0 < p(n) < 1. Henceγ ˜n,3 involves the least drastic adjustment to the consistent estimatorγ ˆn. The bias ofγ ˜n,3 is of a smaller order as n → ∞,

 −1/2 Bias γ˜n,2 = O (k(n)p(n)) ,

 −1/2 −1  Bias γ˜n,2 = O k(n) Φ (1 − p(n)) , where, again, the latter decays much faster. p Assuming the conditions of Theorem 2 (2) are satisfied, we have that k(n) γˆn,k(n)−   γ0 → N 0, 1 + γ0 whenever γ0 > 0. Using the adjustment described asγ ˜n,3, we can construct the modified estimator

ˆ γ˜n,k(n) =γ ˆn,k(n) + δn,k(n), (5.34) where p1 +γ ˆ ˆ n,k(n) −1  δn,k(n) = Φ 1 − p(n) , (5.35) pk(n) and p(n) = Cn−r. Assuming k(n) = np for some p ∈ (0, 1), then for such a sequence, one can show Φ−11 − p(n)  pk(n) → 0. Applications of Slutsky’s Theorem give

p1 +γ ˆ ˆ n,k(n) −1  P δn,k(n) = Φ 1 − p(n) → 0. pk(n)

97 and

P γ˜n,k(n) → γ0

for any γ0 > 0. Thusγ ˜n,k(n) is, likeγ ˆn,k(n), a consistent estimator of the heavy-tailed

EV index γ0, while retaining the property that

 −1  Pr γ˜n,k(n) ∈ A(F0) = Pr γ˜n,k(n) ≤ γ0 ∼ 1 − p(n) → 1, and the probability approaches 1 at a polynomial rate.

5.4 Simulations and Data Analysis

In this section, we assess the quality of point estimates of the density after trans- formation selected according to the following rules. We will occasionally refer to these transformations by the abbreviations listed here.

1. (NONE) No transformation, or the identity transformation ϕθ(x) = 1.

2. (BXM) The transformations of Bean, Xu and MacEachern (2016): iterated

Yeo-Johnson (Yeo and Johnson; 2000) and symmetric t-cdf-inverse-cdf trans-

formations, selected according to the scheme described in Chapter 4.

3. Skew-t cdf transformations, with the parameters estimated using:

(a) (ST ML) Maximum likelihood for the entire parameter vector θ.

(b) (ST EVI) The extreme value index estimator (5.20) without adjustment

to estimate 1/ν, followed by conditional maximum likelihood for center,

scale, and skewness.

98 (c) (ST EVI adj) The moment estimator (5.20) with adjustment (5.33) to

estimate 1/ν, followed by conditional maximum likelihood for center, scale,

and skewness.

We compare these estimates using a selection of real datasets in Section 5.4.1, then in a simulation study in Section 5.4.2. The model employed for density esti- mation on the transformed scale is slightly different from the normal mixture (5.2) described in the theoretical development. The prior πn(σ) for the shared kernel scale parameters assigns an inverse gamma distribution to σ2. This conditionally conjugate specification facilitates simple MCMC computations. Assessing performance of the models requires many runs of the MCMC, so we shorten the individual chains to 5500 iterations with a 500-iteration burn-in period (perhaps shorter than would be advis- able if very precise inference were required on functionals of the distribution), and implement the collapsed Gibbs sampler described in Section 3.2, which gives faster mixing for the cluster assignments.

5.4.1 Data Analysis

Bean et al. (2016) showed that in applications with skewed and heavy tailed densi- ties, the transformation approach yields improvements in terms of density estimation error and out-of-sample predictive power. In this section, we continue to study per- formance by considering a few real-data examples to assess the quality of predictive densities produced by the DP mixture (5.2) with and without transformation. Gneit- ing and Raftery (2007) discuss a variety of predictive scores; we follow Griffin (2010) in using the random-fold cross validation predictive score

K m 1 X X  S = log f xγ x−γ , (5.36) Km ki k k=1 i=1 99 where f(·|xI ) denotes the posterior predictive density given the observations xI , and the set γk comprises the first m elements of a random permutation of {1, . . . , n}, where n is the full sample size. Larger values of S indicate superior predictive densities.

Computation of (5.36) requires K runs of the MCMC algorithm described in

Chapter 3, which is quite time-consuming if the training datasets x−γk are large. For some examples with larger sample sizes, we take m to be large, and choose to reserve

most of the full data set as testing observations in xγk .

Unemployment Rates of US Counties

Figure 5.1: Unemployment rates by US county, continental 48 states, BLS, May 2017.

The United States Bureau of Labor Statistics (BLS) maintains detailed employ- ment statistics at the state, county, and metropolitan levels. A convenient tool for accessing this data in the R environment is the package blscrapeR (Eberwein; 2017).

We download county-level data on the labor force and unemployment from April

100 2016 to May 2017.8 The full sample consists of unemployment rates for these 3219 counties.

Figure 5.1 displays the unemployment rates for these counties on a map of the con- tinental 48 states. There is no obvious geographical pattern, but Figure 5.2 suggests that the counties with the highest unemployment rates tend to have medium-sized in terms of the labor force.

● ● ●

● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● 15 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ●● ● ●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ●●● ● ● ● ●● ● ●● ●●●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ●● ●●●● ● ● ● ●● ● ● ● ●● ●● ● ●● ●●●●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●● ● ● ●●●●●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ●●●●●●● ●●●●●●●●●●● ●●●● ●●●●● ●●● ●●●●● ● ● ● ●●●●●●●●●●●●● ●●●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●● ●●●●● ●●● ●● ●●●●●●● ● ●● ●● ●● ●●●●● ● ● ●●● ● ● ● ● ● ● ● ● ●● ●●● ●●● ●●●●●●● ●●●●●●●●● ●●● ●●●●●●●●●● ●●●●●●● ●●●● ●● ● ●● ● ● ● ● ● ●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●● ● ● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●●●●●● ● ●●●●●●●●●●●● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ● ● ● ●● ● ● ● ● ● ●● ●●●●● ● ●● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●●● ●●●●●● ●●●●● ● ●● ● ● ● ● ● ● 5 ● ● ● ● ●●●●●● ●● ●●● ●●●● ●●●●●●●● ●●●●●●●●●●●● ●●● ● ● ●● ●●● ● ●●● ●● ●● ● ● ● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ● ● ●● ●●●● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●● ●● ●● ●●● ●●●●● ●●●●●● ● ● ● ● ● ● ●●●● ● ● ●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●●●● ●●●● ●● ● ● ●●●● ● ●● ●● ● ● ● ● ● ● ●● ●● ● ● ●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ●● ●●●● ●●● ●●●●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●● ● ●●●● ●●●●●●●●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ●● ● ●●● ●●● ●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●● ● ●●●●●●●● ●●●●● ●●●●●● ● ● ● ● ● ●● ●● ●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●● ●●●●● ●●●●●●●●●●●●●●●● ● ●● ● ●●●● ● ● ●● ● ● ● ● ● ● ●●● ● ●●●● ●●●●●● ●●●●●● ●● ● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●● ●●●●●● ●●● ●● ●● ●●● ● ● ● ●● ● ● ● ● ●●● ●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●● ●● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●● ●●●●● ●● ●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●● ●● ●●●●●●● ●●● ●●●●●●● ● ● ● ● ● ●● ●●● ●●●● ●● ●●●●●●●●● ● ●●● ●●●●● ●●●●●●●●● ●● ●● ●●●●● ●● ●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●● ●●● ● ●●● ● ● ●● ● ●● ●●● ●● ●● ●●●●● ●● ●●● ● ●●●●●●●● ●●● ●●●●●● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●●●● ● ●●● ●●●● ● ●●●●●●● ● ● ● ●● ●●● ● ● ● ● ●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● County Unemployment Rate County Unemployment

6 8 10 12 14

Log of County Labor Force Size

Figure 5.2: Unemployment rates plotted against the size of the labor force for 3219 counties.

More strikingly Figure 5.3 shows that, despite restricted range of the unemployment- rate variable, the distribution of has a heavy right tail. Some well-known static trans- formations from [0, 1] 7→ R, such as the logit or probit transformations, would correct the range of the variable to be consistent with the normal mixture (5.2) and would also change the shape of the density. Here, we consider transformations estimated according to the schemes listed at the beginning of this section.

8https://www.bls.gov/web/metro/laucntycur14.txt

101 NONE BXM 2016 ST ML ST EVI ST EVI adj Predictive Score S -1.9266808 -1.8840721 -1.8783091 -1.8917496 -1.8893939 Std. Err. 0.0072078 0.0022487 0.0019394 0.0098569 0.0081237

Table 5.1: Unemployment data: Log predictive scores (5.36) for transformation / DP mixture (5.2) predictive densities.

Since the sample size is quite large (3219), we take m = |γk| = 3019 in (5.36),

and fit the model to randomly selected training sets of size |γ−k| = 200, to save

computation time. Figure 5.3 compares the estimated densities for the first subsample

x−γ1 based on five estimated transformations. The rightmost panel of Figure 5.3 shows

the estimated posterior distributions for the number of occupied components of the

countable mixture (5.2). As noted in Chapter 4, the transformed density estimates

have smoother tail decay, with fewer pronounced bumps in the tail.

Table 5.1 compares the log predictive scores (5.36) for K = 50 rounds of subsample

fits. Below is a , proportional to the standard deviation of the K joint √ P log predictive probabilities log p(xγki |xγk ) times 1/ K. All transformations lead to improvements over the direct DP mixture estimate.

For this density, whose shape is unimodal and right skewed, the Skew-t cdf trans- formations (5.13) appear to give slightly better performance than the iterated Yeo-

Johnson/t-cdf transformations of Bean, Xu and MacEachern (2016) (BXM). This is consistent with an understanding that densities closer to Skew-t might see the strongest benefit from transformation (5.13) via the Skew-t cdf. However, there is no clear difference in performance between the estimators of the Skew-t parameters; this choice appears to make little difference in terms of the quality of predictions.

102 NONE 0.3

20

0.2 15

10 Density 0.1 Transformed 5

0.0 0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● 0 5 10 15 20 0 5 10 15 20 unemployment unemployment BXM 2016 0.3 20

10 0.2

0 Density 0.1 Transformed

−10

0.0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●

0 5 10 15 20 0 5 10 15 20 unemployment unemployment ST ML 0.3

2 0.2

0 Density 0.1 Transformed

−2

0.0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●

0 5 10 15 20 0 5 10 15 20 unemployment unemployment ST EVI

4 0.2

0

Density 0.1 Transformed −4

0.0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●

0 5 10 15 20 0 5 10 15 20 unemployment unemployment ST EVI adj 0.3

4 0.2

0

Density 0.1 Transformed −4

0.0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●

0 5 10 15 20 0 5 10 15 20 unemployment unemployment

Figure 5.3: From left to right: estimated transformation, DP mixture predictive density, and posterior distribution for number of occupied mixture components for the unemployment data. Five transformation methods.

103 Acidity

This example, first studied in Richardson and Green (1997) and also analyzed by

Griffin (2010), concerns acidity index measurements for sample of 155 lakes in north- central Wisconsin.9 According to Richardson and Green (1997), these measurements have already been log transformed. This dataset has an additional interesting feature of a second mode for acidities around 6-7. Neither tail of the sample seems to indicate power-law behavior. Plots of the estimated densities (using the full sample) can be found in Figure 5.4.

There are some striking differences in the tail behavior of the estimated densities in the figures. The parametric transformations, designed to capture and correct skewness, identify (rather than multiple symmetric modes) an overall right-skewed pattern in the sample. The estimated transformations have concave shapes, which have the effect of “pulling in” the observations in the right tail, while “stretching out” the left tail. Conditional on such a transformation, the model for the density, which has the form

 0 f(x) = g ϕθ(x) · ϕθ(x) ∞ X ϕ0 (x)   = w θ φ ϕ (x) − µ σ , k σ θ k k=1 effectively has a smaller local kernel bandwidth at values of x where the derivative

0 ϕθ is large. As a result, we see a smaller scale in the descent of the densities to zero in the left tail, but a larger scale at the right end.

9The data can be found in text format from the website of the second author of Richardson and Green (1997) at https://people.maths.bris.ac.uk/ mapjg/mixdata.

104 Identity Transformation 600 8 0.6 7 400 6 0.4

5 Density 200 0.2

4 Probability Posterior Transformed Acidity Transformed

3 0.0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 3 4 5 6 7 8 3 4 5 6 7 8 10 20 30 Acidity Acidity k

Iterated YJ/t−cdf Transformations (Ch.4) 600

5 0.75

400

0 0.50 Density 200 0.25 −5 Posterior Transformed Acidity Transformed

0.00 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 3 4 5 6 7 8 3 4 5 6 7 8 10 20 30 Acidity Acidity k Skew−T Maximum Likelihood 600 2.5

0.75 0.0 400 −2.5 0.50 Density −5.0 200 0.25 Posterior Probability Posterior Transformed Acidity Transformed −7.5

0.00 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 3 4 5 6 7 8 3 4 5 6 7 8 10 20 30 Acidity Acidity k Skew−T EVI estimator 600 2.5

0.75 0.0 400 −2.5 0.50 Density −5.0 200 0.25 Posterior Probability Posterior

Transformed Acidity Transformed −7.5

0.00 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −10.0 0 3 4 5 6 7 8 3 4 5 6 7 8 10 20 30 Acidity Acidity k

Skew−T EVI adjusted 600 2.5 0.75 0.0 400

−2.5 0.50 Density −5.0 200 0.25 Posterior Probability Posterior

Transformed Acidity Transformed −7.5

0.00 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −10.0 0 3 4 5 6 7 8 3 4 5 6 7 8 10 20 30 Acidity Acidity k

Figure 5.4: From left to right: estimated transformation, DP mixture predictive density, and posterior distribution for number of occupied mixture components for the acidity data. Five transformation methods.

105 NONE BXM 2016 ST ML ST EVI ST EVI adj Predictive Score -1.1680690 -1.1406810 -1.1386870 -1.1424340 -1.1409290 Std. Err. 0.0070133 0.0082951 0.0065229 0.0082127 0.0073565

Table 5.2: Acidity data: Log predictive scores (5.36) for transformation / DP mixture (5.2) predictive densities.

The parametric stage does not identify problems with heavy tails in the sample.

The Bean, Xu and MacEachern (2016) approach yields a single Yeo-Johnson trans- formation, and the Skew-t transformations have degrees of freedom near infinity (in the case of ML) or at infinity (in the case of the EVI estimators). As a result, the transformations are approximately linear for large |x| and do not effect a discrepancy in tail thickness between transformed and back-transformed estimates.

Whether these features are appropriate for the application is somewhat unknown.

Turning to the predictive criterion (5.36), however, we again see a small but persistent improvement in performance from the transformed estimates, compared to the direct

DPM estimates. Table 5.2 displays these results. In this example, the K = 50

randomly selected training sets x−γk consist of 100 of the 155 total observations, with the other 55 making up the test set. For comparison, in Table 5.3 we include several predictive scores for the same example listed by Griffin (2010), though it is unclear how Griffin balanced the split between testing and training, so the scores may not be directly comparable. A larger training set would generally give improved performance. Griffin’s results include two of his models, CCV and DCV, which were introduced in Chapter 3 of the theis, and two additional prior constructions due to

Richardson and Green (1997) and Ishwaran and James (2002).

106 Griffin CCV Griffin DCV Richardson/Green Ishwaran/James Predictive Score -1.13 -1.13 -1.22 -1.22

Table 5.3: Griffin (2010) predictive scores for the acidity data.

The example raises some interesting questions about how the transformations would fare in situations where the true density resembles a multi-modal mixture with some heavy tailed components. If the heavy tailed components of such a mixture have small probability weight, the EVI estimators would require very large samples to identify the polynomial tail. Similarly, ML procedures for the skew-t would identify skewness, but might struggle to detect a heavy tail in the multimodal setting.

5.4.2 Simulation Study

We performed a Monte Carlo simulation study to evaluate the effectiveness of the transformations in improving the quality of point estimates of the density for a variety of f0. The simulations consider the six densities listed below and shown in

Figure 5.5.

−1/2 −x2/2 (a) (NORM) The standard normal density f0(x) = (2π) e .

(b) (2P-NORM) The right-skewed two-piece normal density

1 h −x2/2 −x2/(2·32) i f0(x) = √ e I(x < 0) + e I(x ≥ 0) 2 2π

of Rubio and Steel (2014), with right-side scale parameter larger by a factor of

three.

107 (a) NORM (b) 2P−NORM (c) T 0.4 0.20

0.3 0.3 0.15

0.2 0.2 0.10

0.1 0.1 0.05

0.0 0.00 0.0 −2 0 2 0 5 10 −20 −10 0 10 20 (d) ST (e) 2P−T (f) MIX density

0.4 0.20 0.15 0.3 0.15

0.10 0.2 0.10

0.05 0.1 0.05

0.0 0.00 0.00 −10 0 10 20 30 −10 0 10 20 −10 −5 0 5 10 x

Figure 5.5: Densities f0 for Monte Carlo simulation study.

(c) (T) The symmetric t distribution with 2 degrees of freedom, with density

Γ(3/2)  x2 −3/2 f0(x) = √ 1 + . 2πΓ(1) 2

(d) (ST) The skew-t density st(0, 1, 1, 2) with α = 1 and ν = 2, with density

 r 3  f (x) = 2 t (x) T x , 0 2 3 2 + x2

where tν and Tν are the density and distribution function of the symmetric t

distribution with ν degrees of freedom.

(e) (2P-T) The two-piece t density

1  Γ(3/2)  x2 −3/2 Γ(3)  x2 −3  √ √ 1 + I(x < 0) + √ 1 + I(x ≥ 0) 2 π 2Γ(1) 2 5Γ(5/2) 2 · 32

108 with 2 degrees of freedom and scale equal to 1 below the mode, and 5 degrees of

freedom with scale equal to 3 above the mode. This distribution has a heavier

left tail than right, but is right skewed.

(f) (MIX) The mixture

f0(x) = 0.4 · φ(x + 5) + 0.6t3(x),

with a normal component centered at x = −5 and a t3 component centered at

x = 0.

For sample sizes n = 100, n = 200, and n = 500, we draw K = 80 samples from each distribution, and analyze them using each of the five transformation strategies listed at the outset of this section (abbreviated NONE, BXM, ST ML, ST EVI, and

ST EVI adj), with the DP mixture (5.2) fit on the transformed scale. The MCMC for each model fit consists of 5500 iterations of the collapsed Gibbs sampler, with 500 observations discarded for burn-in. The posterior predictive distributiong ˆθ,n(y) is approximated by sampling from full-conditional distributions for the mixture-specific parameters for a new data value, using the scheme (3.11). It is back-transformed using ˆ (2.10) to produce a point estimate fθ,n(x) of f0. On this scale, we compute several metrics to measure density estimation error. See Section 3.3.1 for more discussion of these metrics.

• Hellinger distance (3.17).

• Total variation distance (3.15)

ˆ  ˆ  • The Kullback-Leibler divergences dKL fθ,n f0 and dKL fθ,n f0 described in

(3.16).

109 2 • The integrated squared error d2 (3.13).

The integrals defining these metrics are estimated numerically by discretizing the densities over wide, finite grids of 1000 points.

Figure 5.6 compares the median of the log Hellinger distance, across six scenarios

(a)-(f). For reference, the median for estimates of the standard normal density with- out transformation are included as a gray dotted line on each plot. Similar plots for the other metrics can be found at the end of this chapter. In every setting except for the standard normal (NORM) and the mixture density (MIX), all four transformation strategies yield improvements in Hellinger error.

5.5 Discussion

The transformation-density estimation technique, previously shown to be an effec- tive strategy in kernel estimation, is equally useful in combination with DP mixture density estimation. However, in light of differences in the way features of the true f0 drive asymptotic behavior of the kernel and DP density estimates, this chapter suggests a strategy for estimating transformations that is more directly tailored to

DP mixtures.

The analytical treatment required for establishing posterior convergence rates for the DP mixtures (e.g. Ghosal and Van der Vaart (2001, 2007); Shen et al. (2013)) is typically based on entropy bounds for classes of distributions F0, and accompanying bounds on prior concentration. Regular variation properties for F0 preclude this analytical treatment. While this paper does not make a deep investigation of the technical reasons behind the absence of rates for heavy tailed F0, we note that the sub-Gaussian tail condition (5.11) requires that the density have tails equally light

110 (a) NORM (b) 2P−NORM

−4.6 ● NONE

−5.0 ● ● BXM 2016 ST ML −5.0 −5.2 ● ● ST EVI ● ST EVI adj ● −5.4 ●

−5.4 ● −5.6 ● median of log Hellinger distance median of log Hellinger distance −5.8 −5.8 ● ●

100 200 500 100 200 500

sample size sample size

(c) T (d) ST −4.5 −4.5

● ● −5.0 −5.0

● ● ● ● ● ● −5.5 −5.5 ●

median of log Hellinger distance ● median of log Hellinger distance ● ●

100 200 500 100 200 500

sample size sample size

(e) 2P−T (f) MIX

● −4.6

−4.5 ● ● −5.0

● −5.0 ● ●

−5.4 ● ● ● −5.5

median of log Hellinger distance median of log Hellinger distance ● −5.8 ● ●

100 200 500 100 200 500

sample size sample size

Figure 5.6: Median log Hellinger distances for K = 80 Monte Carlo simulations in each setting. The gray dotted line indicates the median log Hellinger distance for the normal density estimated with no transformation.

111 as the Gaussian kernel and base measure. There is a necessary synthesis between the tail behavior of F0 and the tails of the distributions driving the prior. For heavy tailed F0, others have proposed employing heavy tailed distributions as the kernel or base measure (Li et al.; 2015; Tressou; 2008), but the convergence rates for such models are still unknown.

In contrast, our approach transforms the density to one which, as the sample size grows, is increasingly likely to satisfy the sub-Gaussian condition. The effect of the technique is to take an extraordinarily difficult inferential problem — the estimation of a complete density when it possesses heavy tails — for which convergence rates are unknown and divide it into two pieces — the estimation of a parametric description of the extreme tail, and DP mixture estimation of a light-tailed density — which can be solved at known rates.

We employ flexible Skew-t cdf-inverse-cdf transformations capable of transforming polynomial tails of any degree to sub-Gaussianity. As the parameter space for the transformations is low-dimensional, the parameters can be estimated by methods

(such as maximum likelihood) which stabilize at parametric rates. Alternatively, the tail index of the cdf transformation, which drives tail behavior on the transformed scale, can be estimated using well-known extreme value models. These extreme-value √ index estimators converge more slowly than the usual n rate, as they use only a fraction of the sample. However, they can be modified to ensure the probability of a sub-Gaussian transformed tail approaches 1 at a rate faster than the convergence rate of the DP mixture.

The general method is simple to implement, estimators of the transformation

(5.17) and (5.20) are fast to compute, and the approach yields substantial gains in

112 empirical performance. We recommend this technique as a straightforward way to improve Bayesian estimates of heavy tailed densities.

113 (a) NORM, n=100 (a) NORM, n=200 (a) NORM, n=500

0.025 ● 0.025 0.025 ● ● ● ● ● ● ● ● ● 0.020 ● 0.020 0.020 ● ● ● ● ● ● ● ● ● ● ● ●

0.015 0.015 ● 0.015 ●

● ● ● ● ● ● ● Hellinger Hellinger Hellinger ● 0.010 0.010 0.010

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.005 0.005 0.005 0.000 0.000 0.000 NONE NONE NONE ST ML ST ML ST ML ST EVI ST EVI ST EVI BXM 2016 BXM 2016 BXM 2016 ST EVI adj ST EVI adj ST EVI adj

(b) 2P−NORM, n=100 (b) 2P−NORM, n=200 (b) 2P−NORM, n=500

● ● ● ●

● 0.020 0.020 0.020 ● ● ● ● ● 0.015 0.015 0.015

● Hellinger Hellinger Hellinger 0.010 0.010 0.010 ● ● ● ● ● ● ● ● ● ● ● 0.005 0.005 0.005 0.000 0.000 0.000 NONE NONE NONE ST ML ST ML ST ML ST EVI ST EVI ST EVI BXM 2016 BXM 2016 BXM 2016 ST EVI adj ST EVI adj ST EVI adj

(c) T, n=100 (c) T, n=200 (c) T, n=500

● 0.05 0.05 0.05 0.04 0.04 0.04

● ● ● ● ● ● 0.03 0.03 0.03 ● ● ● ● ● ● ●

Hellinger ● Hellinger Hellinger 0.02 0.02 0.02 ● ● ● ● ● ● ● ● ● ● 0.01 0.01 0.01 ● ● ● ● ● ● ● ● ● 0.00 0.00 0.00 NONE NONE NONE ST ML ST ML ST ML ST EVI ST EVI ST EVI BXM 2016 BXM 2016 BXM 2016 ST EVI adj ST EVI adj ST EVI adj

Figure 5.7: Hellinger distance between posterior predictive densities and true f0, K = 80 Monte Carlo simulations. First three scenarios (a)-(c).

114 (d) ST, n=100 (d) ST, n=200 (d) ST, n=500

● ●

● 0.03 0.03 0.03 ●

● ●

0.02 0.02 ● 0.02 ● ● ● ● ● Hellinger Hellinger ● ● Hellinger ● ● ●

0.01 0.01 0.01 ● ● ● ● ● ● ● ● 0.00 0.00 0.00 NONE NONE NONE ST ML ST ML ST ML ST EVI ST EVI ST EVI BXM 2016 BXM 2016 BXM 2016 ST EVI adj ST EVI adj ST EVI adj

(e) 2P−T, n=100 (e) 2P−T, n=200 (e) 2P−T, n=500

● ● ● ●

0.030 ● 0.030 0.030

● ● ● ● ● ● ● ● ● ● ● ● ●

0.020 ● 0.020 0.020 ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Hellinger Hellinger Hellinger ● ● ● ●

0.010 0.010 0.010 ● ● ● ● ● 0.000 0.000 0.000 NONE NONE NONE ST ML ST ML ST ML ST EVI ST EVI ST EVI BXM 2016 BXM 2016 BXM 2016 ST EVI adj ST EVI adj ST EVI adj

(f) MIX, n=100 (f) MIX, n=200 (f) MIX, n=500

● ● 0.04 0.04 0.04

0.03 ● 0.03 0.03 ● ●

● ● 0.02 0.02 0.02 Hellinger Hellinger Hellinger

● ● ● ● ● ● ● 0.01 0.01 0.01

● 0.00 0.00 0.00 NONE NONE NONE ST ML ST ML ST ML ST EVI ST EVI ST EVI BXM 2016 BXM 2016 BXM 2016 ST EVI adj ST EVI adj ST EVI adj

Figure 5.8: Hellinger distance between posterior predictive densities and true f0, K = 80 Monte Carlo simulations. Last three scenarios (d)-(f).

115 (a) NORM (b) 2P−NORM

● ● −5.5 −5.5 NONE ● ● ● BXM 2016 ●

ST ML −6.0 −6.0 ST EVI ● ST EVI adj ●

● −6.5 −6.5

−7.0 ● −7.0 median of log ISE distance median of log ISE distance ● ● −7.5 ● −7.5

100 200 500 100 200 500

sample size sample size

(c) T (d) ST

● −5.0

● −5.5 −5.5

● ● ● −6.0

● −6.0

● ● −6.5 −6.5 ●

● −7.0 −7.0 median of log ISE distance median of log ISE distance

● ● −7.5 −7.5

100 200 500 100 200 500

sample size sample size

(e) 2P−T (f) MIX

−5.5 ● −5.0 ● ● −6.0 −5.5 ● ● ●

● −6.0 −6.5

● −6.5

−7.0 ● ● median of log ISE distance median of log ISE distance −7.0

● ● −7.5 −7.5

100 200 500 100 200 500

sample size sample size

Figure 5.9: Median log integrated squared error distances for K = 80 Monte Carlo simulations in each setting. The gray dotted line indicates the median for the normal density estimated with no transformation.

116 (a) NORM (b) 2P−NORM

● ● NONE −4.0 −4.5 ● BXM 2016 ● ● ST ML −4.5 ●

−5.0 ST EVI ●

ST EVI adj −5.0 ● −5.5 ● ● −5.5 ● −6.0 −6.0 median of log KL−backward distance median of log KL−backward distance median of log KL−backward ● ●

100 200 500 100 200 500

sample size sample size

(c) T (d) ST −3.5 −3.5 ● ● −4.5 −4.5 ● ● ● ● ● ● ● −5.5 −5.5 ● median of log KL−backward distance median of log KL−backward distance median of log KL−backward ● ● −6.5 −6.5 100 200 500 100 200 500

sample size sample size

(e) 2P−T (f) MIX

● −3 −3.5 ●

−4.0 ● ● −4 −4.5 ● ● ● ● −5.0 −5 ●

−5.5 ● −6 −6.0 median of log KL−backward distance median of log KL−backward distance median of log KL−backward ● ● −6.5 100 200 500 100 200 500

sample size sample size

Figure 5.10: Median log Kullback-Leibler dibergence for K = 80 Monte Carlo sim- ulations in each setting. The gray dotted line indicates the median for the normal density estimated with no transformation.

117 (a) NORM (b) 2P−NORM

● ● NONE ● ● BXM 2016 −4.0 ● −4.5 ● ST ML ST EVI ● ● ST EVI adj −4.5 −5.0 ●

−5.0 ● −5.5

● −5.5 median of log KL−forward distance median of log KL−forward distance median of log KL−forward ● ● −6.0 −6.0

100 200 500 100 200 500

sample size sample size

(c) T (d) ST −2 −2 −3 −3

● ● −4 −4 ● ● ● ● ● ● −5

−5 ● ● median of log KL−forward distance median of log KL−forward distance median of log KL−forward ● ● −6 −6

100 200 500 100 200 500

sample size sample size

(e) 2P−T (f) MIX

−3.0 ●

−3.0 ●

−3.5 ● ● −4.0 −4.0 ● ● ● −4.5 ● ● ● −5.0 −5.0 −5.5 median of log KL−forward distance median of log KL−forward distance median of log KL−forward ● ● −6.0 −6.0

100 200 500 100 200 500

sample size sample size

Figure 5.11: Median log Kullback-Leibler divergence for K = 80 Monte Carlo sim- ulations in each setting. The gray dotted line indicates the median for the normal density estimated with no transformation.

118 (a) NORM (b) 2P−NORM

● ● −2.8

NONE −2.6 ● ● ● BXM 2016 −3.0 ST ML −2.8 ● ST EVI ● −3.0 −3.2 ● ST EVI adj −3.2 ● ● −3.4 −3.4 median of log TVD distance median of log TVD distance −3.6

● −3.6 ● ● −3.8 −3.8 100 200 500 100 200 500

sample size sample size

(c) T (d) ST

● −2.5

● −2.6

● ● ● ● −3.0 −3.0

● ● ● ● −3.4 median of log TVD distance median of log TVD distance −3.5

● ● −3.8 100 200 500 100 200 500

sample size sample size

(e) 2P−T (f) MIX

● −2.0 −2.6 ● ●

● −2.5 ● −3.0 ● ● ● −3.0 ● ● −3.4 median of log TVD distance median of log TVD distance −3.5

● ● −3.8 100 200 500 100 200 500

sample size sample size

Figure 5.12: Median log total variation distances for K = 80 Monte Carlo simulations in each setting. The gray dotted line indicates the median log total variation distance for the normal density estimated with no transformation.

119 Chapter 6: Extensions and Future Work

The methodology discussed in Chapters 4 and 5 has potential for application and extension in a few more complex data analytic settings. This Chapter gives an overview of these possible extensions. We note that Dirichlet Process mixtures have been successfully applied in many settings aside from straightforward density estimation, including the modeling of random effects Bush and MacEachern (1996),

ANOVA modeling De Iorio, Muller, Rosner and MacEachern (2004), regression Kottas and Gelfand (2001), and Kuo, Smith, MacEachern and West (1992).

In many of these settings, one can envision incorporating transformations to handle heavy tailed distributions. The subsequent sections review some possible extensions of the transformation methodology.

6.1 Multivariate Density Estimation

The focus of this dissertation has been on estimation of univariate densities. If

d instead the sample X1,..., Xn belongs to R , and f0 is the d-dimensional Lebesgue density, then estimation an analogous construction to the previously presented DP mixture models takes the following form.

Σ ∼ πΣ G ∼ DP(M,G0) R (6.1) f G, Σ = φΣ(x − µ) dG(µ) iid Xi f ∼ f, 120 where φΣ(u) denotes the d-dimensional normal density with mean zero and Σ. The conditionally conjugate choice for the base measure G0 is another multivariate normal distribution.

Shen et al. (2013) studied frequentist asymptotic properties of multivariate DP mixtures such as (6.1). Their results are largely analogous to those established for univariate estimation by Ghosal and Van der Vaart (2007). Specifically, they gave conditions on F0 and the prior (6.1) to derive a posterior convergence rate εn with respect to the Hellinger or L1 metrics, which ensures

  Πn {f : dH (f, f0) > Mεn} X1,...,Xn → 0 (6.2)

iid almost surely as n → ∞ under Xi ∼ F0.

The rate established by Shen et al. (2013) is equivalent to the minimax rate for the β-H¨olderclass (see (2.3)) up to a logarithmic factor. Their development requires the following conditions.

• The base measure has powered-exponential tails,

a d −C1x 1 1 − G0 [−a, a] ≤ b1e

where Id denotes a d-dimensional rectangle.

• The prior πΣ of the covariance matrix satisfies

a  −1 −C2x 2 πΣ Σ : largest eigenvalue of Σ ≥ x ≤ b2e ,

 −1 a3 πΣ Σ : smallest eigenvalue of Σ < x ≤ b3x ,

the latter condition required only for sufficiently small x. Further, they assume

there exist constants κ, a4, a5, b4,C3 > 0 such that for any 0 < s1 ≤ · · · ≤ sd

121 and t ∈ (0, 1),

 a4 a5 κ/2 πΣ Σ: sj < jth eigenvalue < sj(1 + t)∀j = 1, . . . , d ≥ b4s1 t exp C3sd .

• The density f0 belongs to the locally β-H¨olderclass, with finite partial derivative

k P matrices D f0 for every k = (k1, . . . , kd) with kj ≤ floor(β), and satisfying

(for every such k),

k  k  2 β−floor(β) D f0 (x + y) − D f0 (x) ≤ L(x) exp τ0kyk kyk

for any x, y ∈ Rd.

• The derivatives of f0 additionally satisfy the integrability conditions that for

some  > 0, Z (2β+)/k  k    D f0 f0 dF0 < ∞

P for every k = (k1, . . . , kd) with kj ≤ floor(β), and

Z   (2β+)/β L f0 dF0 < ∞.

• The density f0 must exhibit exponential tail decay, with constants c, b, τ > 0

such that

τ  f0(x) ≤ c exp −bkxk (6.3)

for all kxk larger than some constant a > 0.

Under these conditions, the posterior convergence rate εn in (6.2) is

−β/(2β+d∗) t εn = n log n , (6.4) where d∗ = max(d, κ). Note this rate becomes slower as the dimension d increases.

122 The final condition (6.3) is analogous to the Ghosal and Van der Vaart (2007) con- ditions mentioned in equations (5.3) and (3.22). In the next section, we examine this condition in more detail and consider its consequences for the marginal distributions.

6.1.1 Transformations to Multivariate sub-Gaussianity

Given a joint density f0(x) that does not satisfy (6.3), a natural extension of our work is to develop a transformation of (X1,...,Xd) ∼ F0, of the form

Yθ = (Yθ,1,...,Yθ,d) = ϕθ(X1,...,Xd)

such that the density g0,θ(y) of Yθ satisfies (6.3). The methodology described in previ- ous chapters gives a straightforward method for transforming the marginal densities, individually, to sub-Gaussianity. For example, the transformation that takes

−1  Yθ,j = Φ Fθj (Xj) (6.5)

th with Fθ,j a parametric fit for the j marginal distribution, could be estimated just as in Chapter 5 to ensure that the jth transformed marginal distribution is sub-Gaussian.

d d This transformation ϕθ : R → R is one-to-one and has a diagonal Jacobian matrix.

The joint density g0,θ of Yθ under transformation of the margins has the form

−1  d −1 g0,θ(y) = f0 ϕ (y) ϕ (y) . (6.6) θ dy θ

d −1 Since the Jacobian dy ϕθ (y) is diagonal, its determinant is simple, d d −1 Y ∂ −1 ϕ (y) = ϕ (y). dy θ ∂y θ,j j=1 j Further, in the case of the cdf transformations (6.5),

d d −1 Y φ(yj) ϕθ (y) = , dy  −1  j=1 fθ,j Fθ,j Φ(yj)

123 as in the univariate case.

A useful theorem for relating the condition (6.3) on the joint density f0 to the marginal densities f0j is Sklar’s Theorem (Sklar; 1959), which states that any f0 can

be expressed as

 f0(x) = c0 F01(x1),...,F0d(xd) · f01(x1) ··· f0d(xd), (6.7)

where c0 is a copula density, corresponding to the joint distribution of (F01(X1),...,F0d(Xd))

where X = (X1,...,Xd) ∼ F0. Applying this to (6.6), we find that

d  −1  −1  Y g0,θ(y) = c0 F0,1 ϕθ,1(y1) ,...,F0,d ϕθ,d(yd) · g0,θ,j(yj), (6.8) j=1

where the marginal densities of the transformed variate, g0,θ,j, have the same form

 −1  f0,j Fθ,j Φ(yj) g0,θ,j(yj) = · φ(yj) (6.9)  −1  fθ,j Fθ,j Φ(yj)

as previous chapters. Each of these can be made sub-Gaussian by ensuring that the

parametric density fθ,j has a heavy enough tail to ensure

f (x ) lim sup 0,j j < ∞. |xj |→∞ fθ,j(xj)

If so,

2 g0,θ,j(yj) ≤ dj exp −(1/2)yj

for each j. Hence d Y 2 g0,θ,j(yj) ≤ d exp −(1/2)kyk . j=1 Restating the condition (6.3) for the transformed density, and taking the required

exponent to be the Gaussian τ = 2, we aim to ensure

2 g0,θ(y) ≤ c exp −bkyk .

124 From the expression (6.8) and the preceding arguments, this can be reduced to the

condition

  c   c F ϕ−1 (y ),...,F ϕ−1 (y ) ≤ exp −(b − 1/2)kyk2 (6.10) 0 0,1 θ,1 1 0,d θ,d d d

on the copula density of f0. This gives an additional requirement that the copula

density c0(u1, . . . , ud) cannot go to infinity too fast as (u1, . . . , ud) approaches the

boundary of [0, 1]d.

At the time of writing, it remains unknown to the author whether (6.10) holds

for a broad variety of multivariate distributions F0 and parametric transformations

of the margins Fθ,j based on skew-t densities. It is known (e.g. Geenens, Charpentier

and Paindaveine (2017)) that many parametric copulas, such as the Gaussian and

Student t copulas, have unbounded densities, so that taking c0 to be bounded would

be a restrictive assumption. This is an area for future exploration.

The expression (6.8) introduces another idea for semiparametric estimation. The

copula density c0 captures the dependence structure inherent in F0. One could either

take a nonparametric approach to estimating the whole of the joint density g0,θ(y) af- ter transformation of the margins, or, as an alternative, employ a parametric estimate of the copula c0, and estimate the margins g0,θ,j(yj) using univariate nonparametric techniques. Since parametric copulas are low-dimensional and can be estimated ef-

ficiently (e.g. ), and the nonparametric estimation is employed only for univariate densities, the so-called “curse of dimensionality” apparent in 6.4 (where the rate be- comes slower as d increases) can be avoided. However, if the coordinates of X and Y are indeed correlated, or if one believes (a priori) that the marginal distributions are similar, then the use of independent models for estimating the margins would seem to imply a loss of efficiency.

125 Closely related ideas were discussed by Xu, Maceachern and Xu (2015), who study

semiparametric Bayesian copula models for estimating stationary time series. In that

setting, it is assumed the marginal distributions f0t of X = (Xt, t = 1,...,T ) are identical for all t.

6.1.2 Bayesian Analysis of Heavy-Tailed Time Series

Xu, Maceachern and Xu (2015) present a copula approach to modeling non-

Gaussian time series that transforms the marginal distribution of the time series according to a cdf transformation from a distribution given a DP mixture prior. The time series is viewed through two major components: the marginal distribution of the observations, and the correlation structure. Through the copula approach, the authors make use of the DP mixture for a flexible model of the marginal distribu- tion, a model which is used to transform the series to one with more nearly-Gaussian marginals, on which scale traditional Gaussian time series models can be used to account for correlation structure.

Yet for many time series, particularly in financial applications (c.f. Resnick (2007)) for the analysis of log returns of financial investments, and in insurance, where mod- eling of rare but extreme events is central to appropriate decision making, there is a third important component to the problem: the marginal distributions of the time series may be heavy tailed.

If the marginal distribution f0 of the observations Xt in a time series X =

(X1,...,XT ) has a heavy-tail, consider transforming the margins according to a single parametric transformation. The transformed time series

−1  {Yθ,t = ϕθ(Xt) = Φ Fθ(Xt) , t = 1,...,T }

126 has identical univariate marginal distributions of the form (6.9). On this transformed scale, if the technique of Xu et al. (2015) is employed, one would develop a nonpara- metric model for the shared univariate marginal distribution G0,θ of the Yθ,t, and apply another transformation

−1  Zθ,t = Φ Gθ Yθ,t .

The authors suggest modeling Gθ using a DP mixture of normals and, given Gθ, modeling the copula transformed series Zθ,t using a standard parametric Bayesian time series model with normal margins.

We hypothesize that in cases where Xt have heavy-tailed marginal distributions, a parametric transformation to Yθ,t whose margins can be better estimated by a

DP mixture of normals. This would be an interesting, and perhaps quite useful extension of the transformation methodology. For example, a common objective in financial applications is to estimate the value at risk Resnick (2007), which is an extreme quantile of the return distribution. Since the transformed DP mixture tends to produce better estimates of extreme quantiles in the heavy-tailed setting (see

Figure 4.5 in Section 4.3), one might hope for better performance in estimating value at risk if the financial time series has heavy tails.

6.1.3 Median Regression with a Heavy-Tailed Error Distri- bution

One of the simplest extensions of nonparametric priors for density estimation is to employ similar priors for the error distribution in a regression setting. To explore possible extensions of the transformation technique to this context, we highlight two

Bayesian approaches from the literature.

127 Kottas and Gelfand (2001) describes modeling the error distribution in a median regression model with a DP scale-mixture of a median-zero two-piece density. A sample Y1,...,Yn of responses is modeled as

> Yi = β0 + xi β + εi,

and the distribution F of εi is described through a DP mixture

Z f(ε|G, γ) = p(ε|σ, γ) dG(σ2), (6.11) with split-normal kernels of the form

p(ε|σ, γ) = φγσ(ε)I(ε ≤ 0) + φσ/γ(ε)I(ε > 0).

These kernels are designed to have median zero, a property which is retained by f(ε|G, γ) after convolution with G. The mixing distribution has a DP prior G ∼

DP(M,G0) with an inverse-gamma base measure. To complete the prior, (β0, β) is assigned a multivariate normal distribution, and γ2 is assumed to be gamma dis- tributed, both independent of (6.11).

This setup of the DP mixture (6.11) differs slightly from our applications in den- sity estimation. It involves a scale (rather than location) mixture of normals, and the base measure of the mixing distribution (inverse gamma) has a polynomial tail.

Convolutions of such distributions may indeed exhibit heavy tailed behavior.

However, a transformation could still contribute to such a scheme in cases where the error distribution is skewed or heavy tailed. The classic transformation of Box and Cox (1964), and the family of Yeo and Johnson (2000) which figured centrally in Chapter 4, were designed for use in these situations. To carry through the ideas from Chapters 4 and 5, we consider a preliminary fit based on a parametric model

128 that allows a skewed and heavy tailed error distribution, and use the parametric fit

to transform the response variable to make the error distribution more tractable for

a nonparametric fit.

Transform the Residuals

In a straightforward extension of the methodology described in Chapter 5, consider

fitting a preliminary linear model with skew-t errors,

ind >  Yi ∼ St ξ0 + xi ξ, ω, α, ν .

This approach is recommended by Azzalini and Capitanio (2003), who describe the

score equations necessary for a maximum-likelihood approach to estimation. The

skew-t likelihood can be used to obtain preliminary estimates θˆ of the parameters

θ = (ξ0, ξ, ω, α, ν). The residuals from this preliminary fit,

ˆ >ˆ eˆi = Yi − ξ0 − xi ξ,

may exhibit skew or heavy tails, with a distribution close to skew-t.

Consider transforming the response observations according to

ˆ >ˆ −1  Zθˆ,i = ϕθˆ(Yi) = ξ0 + xi ξ + Φ St eˆi 0, ω, α, ν ,

which is strictly increasing in Yi with inverse

−1 ˆ > ˆ  Yi = St Φ Zθˆ,i − (β0 + xi β) .

If the Kottas and Gelfand (2001) model is fit to the transformed response variables

Zθˆ,i, back-transforming to obtain inference about the conditional distribution of Y given x would be straightforward to accomplish, since these distributions are again related through (2.10).

129 6.2 A Recipe for Efficient Semiparametric Analysis

The fundamental effectiveness of the transformation method is that it separates a very difficult problem (nonparametric estimation of a skewed or heavy-tailed density) into two easier problems (parametric estimation of a transformation, and nonpara- metric estimation of a comparatively short-tailed and symmetrized density). This breakdown of a difficult inference problem into easier pieces is powerful in a variety of modern problems.

Analysis of large data sets with strange features, increasingly common in this era of “big data,” begs for flexible models such as the countable mixtures I have studied.

However, approaching all aspects of a problem using a single nonparametric model can lead to inefficient inference. Substantial benefits can be derived by identifying specific features or irregularities of the problem, and adapting a parametric approach in those areas. These parametric components make more efficient use of the data, while still giving reasonable inference for these parts of the problem. The full flexibil- ity of nonparametric models is reserved for components of the problem (like density estimation) that require a high level of complexity.

This can serve as a useful guiding philosophy for data analysis in areas outside of density estimation.

130 Appendix A: Asymptotic Expansion of MISE for Kernel Estimates

The MISE is of enough interest to this methodology that it is worth a close look.

Here we derive the standard asymptotic expansion cited in (2.4). Consider first the

pointwise mean-square error

ˆ 2 ˆ  2 ˆ  E fh(x) − f0(x) = Var fh(x) + Bias fh(x) , (A.1)

ˆ −1 P −1  where again fh(x) = (nh) K h (x − Xi) . Note that the MISE (2.14) can be obtained by integrating (A.1) over x, since the integration and expectation can be switched in (2.14) via Fubini’s theorem. To study the pointwise bias and variance, we first note

Z Z −1  Efh(x) = h K (x − u)/h f0(u) du = K(t)f0(x + ht) dt,

Z Z 2 −1 −2 2 −1 2 E fh(x) = n h K (x − u)/h f0(u) du = (nh) K(t) f0(x + ht) dt.

Hence, evaluating the variance and squared bias, we get

Z  ˆ   Bias fh(x) = K(t) f0(x + ht) − f0(x) dt (A.2) " #   1 1 Z Z 2 Var fˆ (x) = K2(t)f(x + ht) dt − K(t)f (x + ht) dt . (A.3) h n h 0

131 If the bandwidth is made to shrink as sample size grows, h → 0 as n → ∞, then we will have

Z K(t)f0(x + ht) dt → f0(x), Z Z 2 2 K(t) f0(x + ht) dt → f0(x) K(t) dt

Hence, taking a limit of (A.3), it follows that

1 Z Varfˆ (x) ∼ f (x) K(t)2 dt. (A.4) h nh 0

For the bias, on the other hand, note that assuming for f0 belonging to the β-H¨older class (2.3), we have the expansion

1 f (x + ht) − f (x) = htf 0 (x) + h2t2f 00(x) + O(h3), 0 0 0 2 0 so that from (A.2),

Z 1 Z Biasfˆ (x) = hf 0 (x) tK(t) dt + h2f 00(x) t2K(t) dt + O(h3) (A.5) h 0 2 0

If the kernel is symmetric, then R tK(t) dt = 0, so we get an asymptotic expression for the pointwise MSE (A.1),

1 h4σ4 Varfˆ (x) + Bias2fˆ (x) = R(K)f (x) + K f 00(x) + o(nh)−1 + h−4. h h nh 0 4 0

Integrating this expression over x gives the relevant expansion (2.4).

132 Appendix B: Estimating integrated squared second derivatives

Estimation of such functionals is of key interest for bandwidth selection rules in

kernel density estimation, and there is a rich literature discussing the merits of various

estimators. It is assumed from the outset that f satisfies the necessary differentiability conditions, and further that f has smoothness of order p. That is, there exists Z such

that for any x and y,

(j) (j) ε f (x) − f (y) ≤ Z|x − y| ,

where p = j + ε for an integer j and 0 < ε ≤ 1. If p is an integer and j has p bounded

derivatives , then f is smooth of order p. Estimates are based on two equivalent

representations of the integrated squared second derivative,

Z Z R(f 00) := f 00(x)2 dx and R(f 00) = f (4)(x)f(x) dx.

In a companion paper to Sheather and Jones (1991), which introduced the famous

Sheather-Jones “plug-in” bandwidth selector, Jones and Sheather discuss “diagonals-

in” estimators of the form

Z n ˜ ˆ00 2 ˜ −1 X ˆ(4) Sα = fα(x) dx or Sα = n fα (Xi). i=1

133 These estimators of R(f 00) have the same variance, but reduced bias when compared to the “diagonals-out” estimators of Hall and Marron (1987),

ˆ −1 −1 X 00 00 ˆ −1 −1 X (4) Sα = n (n−1) Kα ∗Kα(Xi −Xj) and Sα = n (n−1) Kα (Xi −Xj). i6=j i6=j

The Jones and Sheather “diagonals-in” technique involves estimating R(f 00) = R (f 00)2 by replacing f with a kernel estimate

n ˆ −1 X −1 −1  fα(x) = n α L α (x − Xi) , i=1 with a kernel L having order k, meaning  Z  1 for j = 0 xjL(x) dx = 0 for 1 ≤ j ≤ k − 1  µk 6= 0 for j = k. Hence normal kernels have order k = 2. The bandwidth α is selected by examining an expression of the bias,

(−1)k/2αks L(4)(0) E(S˜ ) − R(f 00) = k Rf (2+l) + + oαk, α k! nα5 which holds when f has an order of smoothness p > k/2 + 2 (so p > 3 if we L is normal). Jones and Sheather observe that for most kernels L, in particular the normal, the first two terms on the right side have opposite signs. They derive a rule for the bandwidth α∗ by equating the sum to 0. This gives

− 1 ∗  (k/2+2) 5+k − 1 α = A(L) R f n 5+k .

Practical use of this rule requires a further estimate of Rf (k/2+2), but Jones and

Sheather say “the theory requires no great accuracy in estimating this functional at

˜ 00 this stage.” The MSE of Sα as an estimator of R(f ) is then

9/(5+k) (4) −9 (k/2+2) − 2k+1 MSE∗ ∼ 2R(f)R L A(L) R f × n k+5 .

134 −5/7 When using a second order kernel L, the order of magnitude of MSE∗ is n , an improvement on the “diagonals-out” rate of n−8/13.

135 Bibliography

Abramson, I. S. (1982). On Bandwidth Variation in Kernel Estimates-A Square Root

Law, The Annals of Statistics 10(4): 1217–1223.

Antoniak, C. E. (1974). Mixtures of Dirichlet processes with Applications to Bayesian

Nonparametric Prombles, The Annals of Statistics 2(6): 1152–1174.

Azzalini, A. and Capitanio, A. (2003). Distributions Generated by Perturbation of

Symmetry with Emphasis on a Multivariate Skew t-Distribution, Journal of the

Royal Statistical Society. Series B (Statistical Methodology) 65(2): 367–389.

Bean, A., Xu, X. and MacEachern, S. (2016). Transformations and Bayesian density

estimation, Electronic Journal of Statistics 10(2): 3355–3373.

Bhattacharya, A. and Dunson, D. B. (2012). Strong consistency of nonparametric

Bayes density estimation on compact metric spaces with applications to specific

manifolds, Annals of the Institute of Statistical Mathematics 64(4): 687–714.

Bickel, P. J. and Doksum, K. A. (1981). An Analysis of Transformations Revisited,

Journal of the American Statistical Association 76(374): 296–311.

Billingsley, P. (2012). Probability and Measure, Wiley, Hoboken, N.J.

Blackwell, D. and MacQueen, J. B. (1973). Ferguson Distributions Via Polya Urn

Schemes, The Annals of Statistics 1(2): 353–355.

136 Blei, D. M. and Jordan, M. I. (2006). Variational inference for Dirichlet process

mixtures, Bayesian Analysis 1(1A): 121–144.

Bolance, C., Guillen, M. and Nielsen, J. P. (2003). Kernel density estimation of

actuarial loss functions, Insurance: Mathematics and Economics 32(1): 19–36.

Box, G. E. P. and Cox, D. R. (1964). An Analysis of Transformations, Journal of the

Royal Statistical Society. Series B (Methodological) 26(2): 211–252.

Box, G. E. P. and Cox, D. R. (1982). An Analysis of Transformations Revisited,

Rebutted, Journal of the American Statistical Association 77(377): 209–210.

Brown, L. D., George, E. I. and Xu, X. (2008). Admissible predictive density estima-

tion, The Annals of Statistics 36(3): 1156–1170.

Buch-Larsen, T., Nielsen, J. P., Guill´en,M. and Bolanc´e,C. (2005). Kernel density

estimation for heavy-tailed distributions using the champernowne transformation,

Statistics 39(6): 503–516.

Bush, C. and MacEachern, S. (1996). A semiparametric Bayesian model for ran-

domised block designs, Biometrika 83(2): 275–285.

Chang, S.-M. and Genton, M. G. (2007). Extreme Value Distributions for the Skew-

Symmetric Family of Distributions, Communications in Statistics - Theory and

Methods 36(9): 1705–1717.

Dahl, D. B. (2003). An improved merge-split sampler for conjugate Dirichlet process

mixture models, Technical report, Univ. of Wisconsin, Madison.

137 De Haan, L. and Peng, L. (1998). Comparison of tail index estimators, Statistica

Neerlandica 52(1): 60–70.

De Iorio, M., Muller, P., Rosner, G. L. and MacEachern, S. N. (2004). An ANOVA

Model for Dependent Random Measures, Journal of the American Statistical As-

sociation 99(465): 205–215.

Dekkers, A. L. M., Einmahl, J. H. J. and Haan, L. D. (1989). A Moment Estimator for

the Index of an Extreme-Value Distribution, The Annals of Statistics 17(4): 1833–

1855.

Devroye, L. (1983). The Equivalence of Weak, Strong and Complete Convergence in

L1 for Kernel Density Estimates, The Annals of Statistics 11(3): 896–904.

Devroye, L. and Gyorfi, L. (1985). Nonparametric Density Estimation: The L1 View,

Wiley Interscience Series in Discrete Mathematics, Wiley, New York.

Diaconis, P. and Freedman, D. (1986). On the Consistency of Bayes Estimates, The

Annals of Statistics 14(1): 1–26.

Eberwein, K. (2017). blscrapeR: An API Wrapper for the Bureau of Labor Statistics

(BLS).

Epanechnikov, V. A. (1969). Non-parametric estimation of a multivariate probability

density, Theory of Probability & Its Applications 14(1): 153–158.

Escobar, M. D. (1988). Estimating the of several normal populations by esti-

mating the distribution of the means, PhD thesis, Yale University.

138 Escobar, M. D. (1994). Estimating Normal Means with a Dirichlet Process Prior,

Journal of the American Statistical Association 89(425): 268–277.

Escobar, M. D. and West, M. (1995). Bayesian Density Estimation and Inference

Using Mixtures, Journal of the American Statistical Association 90(430): 577–588.

Farrell, R. H. (1972). On the best obtainable asymptotic rates of convergence in

estimation of a density function at a point, The Annals of

43(1): 170–180.

Ferguson, T. S. (1973). A Bayesian Analysis of Some Nonparametric Problems, The

Annals of Statistics 1(2): 209–230.

Ferguson, T. S. (1983). Bayesian Density Estimation by Mixtures of Normal Distri-

butions, Recent Advances in Statistics 24(1): 287–302.

Fernandez, C. and Steel, M. F. J. (1998). On Bayesian Modeling of Fat Tails and

Skewness, Journal of the American Statistical Association 93(441): 359–371.

Fix, E. and Hodges, J. (1949). Discriminatory Analysis. Nonparametric Discrimina-

tion: Consistency Properties, Technical report, USAF School of Aviation Medicine.

Geenens, G., Charpentier, A. and Paindaveine, D. (2017). Probit transformation for

nonparametric kernel estimation of the copula density, Bernoulli 23(3): 1848–1873.

Gelfand, A. E. and Kottas, A. (2002). A Computational Approach for Full Non-

parametric Bayesian Inference under Dirichlet Process Mixture Models, Journal of

Computational and Graphical Statistics 11(2): 289–305.

139 Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-Based Approaches to

Calculating Marginal Densities, Journal of the American Statistical Association

85(410): 398–409.

Ghosal, S., Ghosh, J. K. and Ramamoorthi, R. V. (1999). Posterior consistency of

Dirichlet mixtures in density estimation, The Annals of Statistics 27(1): 143–158.

Ghosal, S. and Van der Vaart, A. W. (2001). Entropies and rates of convergence for

maximum likelihood and Bayes estimation for mixtures of normal densities, The

Annals of Statistics 29(5): 1233–1263.

Ghosal, S. and Van der Vaart, A. W. (2007). Posterior convergence rates of Dirichlet

mixtures at smooth densities, The Annals of Statistics pp. 697–723.

Gibbs, A. L. and Su, F. E. (2002). On Choosing and Bounding Probability Metrics,

International Statistical Review 70(3): 419–435.

Gneiting, T. and Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and

Estimation, Journal of the American Statistical Association 102(477): 359–378.

Griffin, J. E. (2010). Default priors for density estimation with mixture models,

Bayesian Analysis 5(1): 45–64.

Hall, P. and Marron, J. (1987). Estimation of integrated squared density derivatives,

Statistics & Probability Letters 6(2): 109–115.

Hall, P. and Wand, M. P. (1988). On the minimization of absolute distance in kernel

density estimation, Statistics and Probability Letters 6(5): 311–314.

140 Huber, P. J. (1967). The behavior of maximum likelihood estimates under non-

standard conditions, Proceedings of the Fifth Berkeley Symposium on Mathemati-

cal Statistics and Probability, Volume 1: Statistics, Fifth Berkeley Symposium on

Mathematical Statistics and Probability, University of California Press, Berkeley,

Calif., pp. 221–233.

Ishwaran, H. and James, L. F. (2001). Methods for Stick-Breaking

Priors, Journal of the American Statistical Association 96(453): 161–173.

Ishwaran, H. and James, L. F. (2002). Approximate Dirichlet Process Comput-

ing in Finite Normal Mixtures, Journal of Computational and Graphical Statistics

11(3): 508–532.

Iwata, T., Duvenaud, D. and Ghahramani, Z. (2013). Warped Mixtures for Non-

parametric Cluster Shapes, Twenty-Ninth Conference on Uncertainty in Artificial

Intelligence, pp. 311–320.

Jain, S. and Neal, R. M. (2004). A Split-Merge Markov chain Monte Carlo Procedure

for the Dirichlet Process Mixture Model, Journal of Computational and Graphical

Statistics 13: 158–182.

Jain, S. and Nealy, R. M. (2007). Splitting and merging components of a nonconjugate

Dirichlet process mixture model, Bayesian Analysis 2(3): 445–472.

Johnson, N. L. (1949). Systems of Curves Generated by Methods of Trans-

lation, Biometrika 36(1/2): 149–176.

141 Jones, M. and Sheather, S. (1991). Using non-stochastic terms to advantage in kernel-

based estimation of integrated squared density derivatives, Statistics & Probability

Letters 11(6): 511–514.

Kleijn, B. J. K. and Van der Vaart, A. W. (2012). The Bernstein-Von-Mises theorem

under misspecification, Electronic Journal of Statistics 6: 354–381.

Koekemoer, G. and Swanepoel, J. W. H. (2008). Transformation Kernel Density

Estimation with Applications, Journal of Computational and Graphical Statistics

17(3): 750–769.

Kottas, A. and Gelfand, A. E. (2001). Bayesian Semiparametric Median Regression

Modeling, Journal of the American Statistical Association 96(456): 1458–1468.

Kuo, L., Smith, A. F. M., MacEachern, S. N. and West, M. (1992). Bayesian Com-

putations in Survival Models Via the Gibbs Sampler, in J. P. Klein and P. K.

Goel (eds), Survival Analysis: State of the Art, Springer Netherlands, Dordrecht,

pp. 11–24.

Li, C., Lin, L. and Dunson, D. B. (2015). On Posterior Consistency of Tail Index for

Bayesian Kernel Mixture Models, ArXiv: 1511.02775, pp. 1–30.

Lijoi, A., Prunster, I. and Walker, S. G. (2005). On Consistency of Nonparamet-

ric Normal Mixtures for Bayesian Density Estimation, Journal of the American

Statistical Association 100(472): 1292–1296.

Lo, A. Y. (1984). On a Class of Bayesian Nonparametric Estimates: I. Density

Estimates, The Annals of Statistics 12(1): 351–357.

142 MacEachern, S. N. (1994). Estimating normal means with a conjugate style dirich-

let process prior, Communications in Statistics - Simulation and Computation

23(3): 727–741.

MacEachern, S. N. and M¨uller,P. (1998). Estimating Mixture of Dirichlet Process

Models, Journal of Computational and Graphical Statistics 7(2): 223–238.

M¨uller,P. and Rodriguez, A. (2013). Nonparametric Bayesian Inference, Institute of

Mathematical Statistics.

Neal, R. M. (2000). Markov Chain Sampling Methods for Dirichlet Process Mixture

Models, Journal of Computational and Graphical Statistics 9(2): 249–265.

Orbanz, P. and Teh, Y. W. (2010). Bayesian Nonparametric Models, Encyclopedia of

Machine Learning pp. 81–89.

Papaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov chain Monte

Carlo methods for Dirichlet process hierarchical models, Biometrika 95(1): 169–

186.

Park, B. U. and Marron, J. S. (1990). Comparison of Data-Driven Bandwidth Selec-

tors, Journal of the American Statistical Association 85(409): 66–72.

Parzen, E. (1962). On Estimation of a Probability Density Function and Mode, The

Annals of Mathematical Statistics 33(3): 1065–1076.

Quintana, F. and Muller, P. (2004). Nonparametric Bayesian Data Analysis, Statis-

tical Science 19(1): 95–110.

143 Resnick, S. I. (2007). Heavy-tail phenomena: probabilistic and statistical modeling,

Springer, New York, N.Y.

Richardson, S. and Green, P. J. (1997). On Bayesian Analysis of Mixtures with an

Unknown Number of Components, Journal of the Royal Statistical Society. Series

B (Methodological) 59(4): 731–792.

Rosenblatt, M. (1956). Remarks on Some Nonparametric Estimates of a Density

Function, The Annals of Mathematical Statistics 27(3): 832–837.

Rubio, F. J. and Steel, M. F. J. (2014). Inference in two-piece location-scale models

with Jeffreys priors, Bayesian Analysis 9(1): 1–22.

Ruppert, D. and Cline, D. B. H. (1994). Bias Reduction in Kernel Density Estimation

by Smoothed Empirical Transformations, The Annals of Statistics 22(1): 185–210.

Ruppert, D. and Wand, M. P. (1992). Correcting for Kurtosis in Density Estimation,

Australian Journal of Statistics 34(March 1991): 19–29.

Schwartz, L. (1965). On Bayes procedures, Zeitschrift f¨urWahrscheinlichkeitstheorie

und Verwandte Gebiete 4(1): 10–26.

Sethuraman, J. (1994). A constructive definition of Dirichlet priors.

Sheather, S. and Jones, M. (1991). A Reliable Data-Based Bandwidth Selection

Method for Kernel Density Estimation, Journal of the Royal Statistical Society.

Series B (Methodological) 53(3): 683–690.

Shen, W., Ghosal, S. and Tokdar, S. (2013). Adaptive Bayesian multivariate density

estimation with Dirichlet mixtures, Biometrika 100(3): 623–640.

144 Silverman, B. W. (1986). Density estimation for statistics and data analysis.

Sklar, A. (1959). Fonctions de repartition a n dimensions et leurs marges, Technical

report, Institut Statistique de lUniversit´ede Paris, Paris.

Teh, Y. W. (2010). Dirichlet Process, Encyclopedia of Machine Learning pp. 280–287.

Terrell, G. R. and Scott, D. W. (1992). Variable Kernel Density Estimation, The

Annals of Statistics 20(3).

Tokdar, S. T. (2006). Posterior consistency of dirichlet location-scale mixture of

normals in density estimation and regression, Sankhya: The Indian Journal of

Statistics 68(1): 90–110.

Tressou, J. (2008). Bayesian nonparametrics for heavy tailed distribution. Application

to food risk assessment, Bayesian Analysis 3(2): 367–392.

Van der Vaart, A. W. (1998). Asymptotic statistics, Cambridge University Press,

Cambridge, UK; New York, NY, USA.

Wahba, G. (1981). Data-based optimal smoothing of orthogonal series density esti-

mates, The Annals of Statistics 9(1): 146–156.

Walker, S. G. (2007). Sampling the Dirichlet Mixture Model with Slices, Communi-

cations in Statistics - Simulation and Computation 36(1): 45–54.

Wand, M. and Devroye, L. (1993). How easy is a given density to estimate?, Com-

putational Statistics and Data Analysis 16(3): 311–323.

Wand, M. P. and Jones, M. C. (1994). Kernel Smoothing, Chapman & Hall/CRC

Monographs on Statistics & Applied Probability, Taylor & Francis.

145 Wand, M. P., Marron, J. S. and Ruppert, D. (1991). Transformations in density

estimation, Journal of the American Statistical Association 86(414): 343–353.

Wand, M. and Ripley, B. (2015). KernSmooth: Functions for kernel density estima-

tion supporting Wand & Jones (1995).

Wu, Y. and Ghosal, S. (2010a). The L1-consistency of Dirichlet mixtures in multivari-

ate Bayesian density estimation, Journal of Multivariate Analysis 101(10): 2411–

2419.

Wu, Y. and Ghosal, S. (2010b). The L1-consistency of Dirichlet mixtures in multivari-

ate Bayesian density estimation, Journal of Multivariate Analysis 101(10): 2411–

2419.

Xu, Z., Maceachern, S. N. and Xu, X. (2015). Modeling non-gaussian time series

with nonparametric bayesian model, IEEE Transactions on Pattern Analysis and

Machine Intelligence 37(2): 372–382.

Yang, L. (1995). Transformation-Density Estimation, PhD thesis, University of North

Carolina - Chapel Hill.

Yang, L. (2000). Root- n convergent transformation-kernel density estimation, Jour-

nal of 12(4): 447–474.

Yang, L. and Marron, J. S. (1999). Iterated transformation-kernel density estimation,

Journal of the American Statistical Association 94(446): 580–589.

Yeo, I.-K. and Johnson, R. A. (2000). A New Family of Power Transformations to

Improve Normality or Symmetry, Biometrika 87(4): 954–959.

146