April 27, 2020 0:25 StatComputingArticle

ENCYCLOPEDIA WITH SEMANTIC COMPUTING Vol. 1, No. 1 (2016) article id (17 pages) c The Authors

Selected Topics in Statistical Computing Suneel Babu Chatla † , Chun-houh Chen ‡, and Galit Shmueli† †Institute of Service Science, National Tsing Hua University, Hsinchu 30013, Taiwan .O.C ‡Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan R.O.C

Received Day Month Year; Revised Day Month Year; Accepted Day Month Year; Published Day Month Year The field of computational refers to statistical methods or tools that are computationally intensive. Due to the recent advances in computing power some of these methods have become prominent and central to modern data analysis. In this article we focus on several of the main methods including density estimation, kernel smoothing, smoothing splines, and additive models. While the field of computational statistics includes many more methods, this article serves as a brief introduction to selected popular topics. Keywords: Histogram, Kernel density , Local regression, Additive models, splines, MCMC, Bootstrap

Introduction highlights major computational methods for estimation and - “Let the data speak for themselves” for inference. We do not aim to provide a comprehensive re- view of each of these methods, but rather a brief introduction. In 1962, John Tukey76 published an article on ”the fu- However, we compiled a list of references for readers inter- ture of data analysis”, which turns out to be extraordinarily ested in further information on any of these methods. For each clairvoyant. Specifically, he accorded algorithmic models as of the methods, we provide the statistical definition and prop- the same foundation status as algebraic models that statis- erties, as well as a brief illustration using an example dataset. ticians had favored at that time. More than a three decades In addition to the aforementioned topics, the twenty first later, in 1998 Jerome Friedman delivered a keynote speech23 century has witnessed tremendous growth in statistical com- in which he stressed the role of data driven or algorithmic putational methods such as functional data analysis, lasso, models in the next revolution of statistical computing. In re- and methods such as random forests, neu- sponse, the field of statistics has seen tremendous growth in ral networks, deep learning and support vector machines. Al- some research areas related to computational statistics. though most of these methods have roots in the machine According to the current Wikipedia entry on “Compu- learning field, they have become popular in the field of statis- tational Statistics”a: “Computational statistics or statistical tics as well. The recent book by Ref. 15 describes many of computing refers to the interface between statistics and com- these topics. puter science. It is the area of (or sci- The article is organized as follows. In Section 1, we open entific computing) specific to the mathematical science of with nonparametric density estimation. Sections 2 and 3 dis- statistics.” Two well known examples of statistical computing cuss smoothing methods and their extensions. Specifically, methods are the bootstrap and Markov Chain Monte Carlo Section 2 focuses on kernel smoothing while Section 3 in- (MCMC). These methods are prohibitive with insufficient troduces spline smoothing. Section 4 covers additive models, computing power. While the bootstrap has gained significant and Section 5 introduces Markov chain Monte Carlo methods arXiv:2004.11816v1 [stat.ME] 24 Apr 2020 popularity both in academic research and in practical appli- (MCMC). The final Section 6 is dedicated to the two most cations its feasibility still relies on efficient computing. Sim- popular methods: the bootstrap and jackknife. ilarly, MCMC, which is at the core of Bayesian analysis, is computationally very demanding. A third method which has 1. Density Estimation also become prominent in both academia and practice is non- parametric estimation. Today, nonparametric models are pop- A basic characteristic describing the behavior of any random ular data analytic tools due to their flexibility despite being variable X is its probability density function. Knowledge of very computationally intensive, and even prohibitively inten- the density function is useful in many aspects. By looking sive with large datasets. at the density function chart we can get a clear picture of In this article, we provide summarized expositions for whether the distribution is skewed, multi-modal, etc. In the some of these important methods. The choice of methods simple case of a continuous random variable X over an inter-

† The corresponding author can be reached at [email protected]. The first and third authors were supported in part by grant 105-2410-H-007-034- MY3 from the Ministry of Science and Technology in Taiwan. This is an Open Access article published by World Scientific Publishing Company. It is distributed under the terms of the Creative Commons Attribution 3.0 (CC-BY) License. Further distribution of this work is permitted, provided the original work is properly cited. ahttps://en.wikipedia.org/wiki/Computational_statistics, accessed August 24, 2016

article id-1 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

val X ∈ (a, b), the density is defined as idea of the accuracy and precision of the estimator. If we de-

Z b fine P(a < X < b) = f (x)dx. h  a B j = x0 + ( j − 1)h, x0 + jh , j ∈ Z, In most practical studies the density of X is not directly avail- able. Instead, we are given a set of n observations x1,..., xn with x0 being the origin of the histogram, then the histogram that we assume are iid realizations of the random variable. estimator can be formally written as We then aim to estimate the density on the basis of these observations. There are two basic estimation approaches: the Xn X ˆ −1 ∈ ∈ parametric approach, which consists of representing the den- fh(x) = (nh) I(Xi B j)I(x B j). (4) sity with a finite set of parameters, and the nonparametric ap- i=1 j proach, which does not restrict the possible form of the den- We now define the bias of histogram estimator. Assume sity function by assuming it belongs to a pre-specified family that the origin of the histogram x is zero and x ∈ B . Since of density functions. 0 j X are identically distributed In parametric estimation only the parameters are un- i known. Hence, the density estimation problem is equivalent n −1 X to estimating the parameters. However, in the nonparametric E[ fˆh(x)] = (nh) E[I(Xi ∈ B j)] approach, one must estimate the entire distribution. This is i=1 −1 because we make no assumptions about the density function. = (nh) nE[I(X ∈ B j)] Z jh = h−1 f (u)du. 1.1. Histogram ( j−1)h The oldest and most widely used density estimator is the his- togram. Detailed discussions are found in Refs. 65 and 34. This last term is not equal to f (x) unless f (x) is constant in Using the definition of derivatives we can write the density in B j. For simplicity, assume f (x) = a + cx, x ∈ B j and a, c ∈ R. the following form: Therefore d F(x + h) − F(x) ˆ ˆ f (x) ≡ F(x) ≡ lim . (1) Bias[ fh(x)] = E[ fh(x)] − f (x) dx h→0 h Z −1 where F(x) is the cumulative distribution function of the ran- = h ( f (u) − f (x))du B j dom variable X. A natural finite sample analog of equation Z (1) is to divide the real line into K equi-sized bins with small = h−1 (a + cu − a − cx)du bin width h and replace F(x) with the empirical cumulative B j ! ! distribution function 1 = h−1hc j − h − x #{x ≤ x} 2 Fˆ(x) = i . ! ! n 1 = c j − h − x . This leads to the empirical density function estimate 2 (#{x ≤ b } − #{x ≤ b })/n ˆ i j+1 i j Instead of slope c we may write the first derivative of the den- f (x) = , x ∈ (b j, b j+1], 1 h sity at the midpoint ( j − 2 )h of the bin B j where (b , b ] defines the boundaries of the jth bin and j j+1 ! ! ! ! h b − b n {x ≤ b } − {x ≤ b } 0 1 1 = j+1 j. If we define j = # i j+1 # i j Bias( fˆ (x)) = f j − h j − h − x then h 2 2 n j fˆ(x) = . (2) = O(1)O(h) nh = O(h), h → 0. The same histogram estimate can also be obtained using maximum likelihood estimation methods. Here, we try to find When f is not linear, a Taylor expansion of f to the first order a density fˆ maximizing the likelihood in the observations reduces the problem to the linear case. Hence the bias of the n ˆ histogram is given by Πi=1 f (xi). (3) ! ! ! ! Since the above likelihood (or its logarithm) cannot be 1 0 1 Bias( fˆ (x)) = j − h − x f j − h + o(h), h → 0. maximized directly, penalized maximum likelihood estima- h 2 2 tion can be used to obtain the histogram estimate. (5) Next, we proceed to calculate the bias, variance and MSE of the histogram estimator. These properties give us an Similarly, the variance for the histogram estimator can be cal-

article id-2 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

culated as der nh → ∞, h → 0, the histogram estimator is consistent; p  Xn  fˆ (x) −→ f (x). ˆ −1 h Var( fh(x)) = Var (nh) I(Xi ∈ B j) Implementing the MSE using the formula is difficult in i=1 practice because of the unknown density involved. In addi- n X h i tion, it should be calculated for each and every point. Instead = (nh)−2 Var I(X ∈ B ) i j of looking at the estimate at one particular point it might be i=1 −1 −2 h i worth calculating a measure of goodness of fit for the entire = n h Var I(X ∈ B j) histogram. For this reason the mean integrated squared error Z Z (MISE) is used. It is defined as: = n−1h−2( f (u)du)(1 − f (u)du) B j B j " Z ∞ # 2 Z MISE( fˆh(x)) = E ( fˆ − f ) (x)dx = (nh)−1(h−1 f (u)du)(1 − O(h)) −∞ B j Z ∞ −1 = MSE( fˆh(x))dx) = (nh) ( f (x) + o(1)), h → 0, nh → ∞. −∞ −1 2 0 2 2 −1 Bin width choice is crucial in constructing a histogram. = (nh) + h /12|| f ||2 + o(h ) + o((nh) ). As illustrated in Figure 1, bandwidth choice affects the bias- Note that || f 0||2 (dx is omitted in shorthand notation)is the variance trade-off. The top three represent the histograms for 2 square of the L norm of f 0 which describes how smooth the the normal random sample but with three different bin sizes. 2 density function f is. The common approach for minimiz- Similarly bottom three histograms are from another normal ing MISE is to minimize it as a function of h without higher sample. From the plot it can be seen that the histograms with order terms (Asymptotic MISE, or AMISE). The minimizer larger bin width have smaller variability but larger bias and (h ), called an optimal bandwidth, can be obtained by differ- vice versa. Hence we need to strike a balance between bias 0 entiating AMISE with respect to h. and variance to come up with a good histogram estimator.  6  h = . (6) 0 0 2 n|| f ||2 0.5 0.5 0.5

0.4 0.4 0.4 Hence we see that for minimizing AMISE we should theoret-

0.3 0.3 0.3 −1/3 ically choose h0 ∼ n , which if we substitute in the MISE Sample 1 0.2 0.2 0.2 −2/3

0.1 0.1 0.1 formula, would give the best convergence rate O(n ) for

0.0 0.0 0.0 a sufficiently large n. Again, the solution of equation 6 does −4 −2 0 2 4 −2 0 2 4 −3 −2 −1 0 1 2 3 4 5 bins 15 bins 26 bins not help much as it involves f 0 which is still unknown. How- ever, this problem can be overcome by using any reference 0.5 0.5 0.5

0.4 0.4 0.4 distribution (e.g., Gaussian). This method is often called the

0.3 0.3 0.3 “plug-in” method. Sample 2 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0 −4 −2 0 2 4 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 1.2. Kernel Density Estimation 5 bins 15 bins 26 bins The idea of the kernel estimator was introduced by Ref. 57. Fig. 1. Histograms for two randomly simulated normal samples with 5 bins (left), 15 bins (middle), and 26 bins (right). Using the definition of the probability density, suppose X has density f . Then 1 We observe that the variance of the histogram is propor- f (x) = lim P(x − h < X < x + h). tional to f (x) and decreases as nh increases. This contradicts h→0 2h with the fact that the bias of the histogram decreases as h de- For any given h, we can estimate the probability P(x − h < creases. To find a compromise we consider the mean square X < x + h) by the proportion of the observations falling in the error (MSE): interval (x − h, x + h). Thus a naive estimator fˆ of the density 2 is given by MSE( fˆh(x)) = Var( fˆh(x)) + (Bias( fˆh(x))) 1 1 Xn = f (x) + (( j − 1/2)h − x)2 f 0(( j − 1/2)h)2 fˆ(x) = I (X ). nh 2hn (x−h,x+h) i 1 i=1 + o(h) + o( ). nh To express the estimator more formally, we define the weight In order for the histogram estimator to be consistent, the function w by MSE should converges to zero asymptotically. Which means   1 if |x| < 1 that the bin width should get smaller with the number of ob- w(x) =  2 0 otherwise. servations per bin n j getting larger as n → ∞. Thus, un-

article id-3 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

Then it is easy to see that the above naive estimator can be this contradicts with the aim of decreasing bias by choosing written as small h. From a smoothing perspective, smaller bandwidth n results in under-smoothing and larger bandwidth results in 1 X 1  x − X  fˆ(x) = w i . (7) over-smoothing. From the illustration in Figure 2, we see that n h h i=1 when the bandwidth is too small (left) the kernel estimator However, the naive estimator is not wholly satisfactory be- under-smoothes the true density and when the bandwidth is cause fˆ(x) is of a “stepwise” nature and not differentiable large (right) the kernel estimator over-smoothes the underly- everywhere. We therefore generalize the naive estimator to ing density. Therefore we consider MISE or MSE of h as a overcome some of these difficulties by replacing the weight compromise. function w with a kernel function K which satisfies the con- 1 Z h4 MSE[ fˆ (x)] = f (x) K2 + ( f 00(x)k )2 + o((nh)−1) ditions h nh 4 2 Z Z Z 2 + o(h4), h → 0, nh → ∞. K(t)dt = 1, tK(t) = 0, and t K(t)dt = k2 , 0.

Note that MSE[ fˆh(x)] converges to zero, if h → 0 and Usually, but not always, K will be a symmetric probability nh → ∞. Thus the kernel density estimate is consistent, that density function. Now the kernel density estimator becomes p is fˆh(x) −→ f (x). On the whole, the variance term in MSE 1 Xn  x − X  penalizes under smoothing and the bias term penalizes over fˆ(x) = K i . nh h smoothing. i=1 From the kernel density definition it can be observed that • Kernel functions are symmetric around 0 and integrate to 1 0.5 0.5 0.5 • Since the kernel is a density function, the kernel estimate is a

R R 0.4 0.4 0.4 density too: K(x)dx = 1 implies fˆh(x)dx = 1.

• The property of smoothness of kernels is inherited by fˆh(x). 0.3 0.3 0.3

If K is n times continuously differentiable, then fˆh(x) is also Density Density Density 0.2 0.2 0.2 n times continuously differentiable. • Unlike histograms, kernel estimates do not depend on the 0.1 0.1 0.1

choice of origin. 0.0 0.0 0.0 • Usually kernels are positive to assure that fˆh(x) is a density. 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 bw=0.1 bw=0.5 bw=1 There are reasons to consider negative kernels but then fˆh(x) may be sometimes negative. Fig. 2. Kernel densities with three bandwidth choices (0.1, 0.5, and 1) We next consider the bias of the kernel estimate: for a sample from an exponential distribution.

Bias[ fˆh(x)] = E[ fˆh(x)] − f (x) Z Further, the asymptotic optimal bandwidth can be ob- = K(s) f (x + sh)ds − f (x) tained by differentiating MSE with respect to h and equating it to zero: Z h2 s2 R 2 = K(s)[ f (x) + sh f 0(x) + f 00(x)  K 1/5 2 h0 = . ( f 00(x))2k2n + o(h2)]ds − f (x) 2 It can be further verified that if we substitute this bandwidth 2 h 00 2 in the MISE formula then = f (x)k2 + o(h ), h → 0. Z Z 2 5 4/5  1/5 MISE( fˆ ) = K2 k2/5 f 00(x)2 n−4/5 For the proof see Ref. 53. We see that the bias is quadratic in h0 4 2 Z h. Hence we must choose h small to reduce the bias. Similarly 5  1/5 = C(K) f 00(x)2 n−4/5, where the variance for the kernel estimate can be written as 4 n Z  X   4/5 ˆ −2 2 2/5 var( fh(x)) = n var Kh(x − Xi) C(K) = K k2 . i=1 −1 From the above formula it can be observed that we should = n var[Kh(x − X)] Z choose a kernel K with a small value of C(K), when all −1 2 −1 other things are equal. The problem of minimizing C(K) can = (nh) f (x) K + o((nh) ), nh → ∞. R be reduced to that of minimizing K2 by allowing suitable Similar to the histogram case, we observe a bias-variance rescaled version of kernels. In a different context, Ref. 38 tradeoff. The variance is nearly proportional to (nh)−1, which showed that this problem can be solved by setting K to be requires choosing h large for minimizing variance. However, a Epanechnikov kernel17 (see Table 1.2).

article id-4 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

We define the efficiency of any symmetric kernel K by and 69. Given any estimator fˆ of a density f , the integrated comparing it to the Epanechnikov kernel: squared error can be written as 5/4 Z Z Z Z e f f (K) = C(Ke)/C(K) ( fˆ − f )2 = fˆ2 − 2 fˆ f + f 2. Z −1 3 −1/2 2 = √ k K . R 5 5 2 Since the last term ( f 2) does not involve f 0, the updated The reason for the power 5/4 in the above equation is that quantity R( fˆ) of the above equation, is for large n, the MISE will be the same as whether we use n Z Z 2 observations with kernel K or neff(K) observations and the R( fˆ) = fˆ − 2 fˆ f. kernel K 72. Some kernels and their efficiencies are given in e The basic principle of cross validation is to con- the Table 1.2. struct an estimate of R( fˆ) from the data themselves and then to minimize this estimate over h to give the choice of window Table 1. Definitions of some kernels and their efficiencies. R width. The term fˆ2 can be found from the estimate fˆ. Fur- Kernel K(t)Efficiency ther, if we define fˆ−i as the density estimate constructed from  R ˆ  1 | | ≤ all the data points except Xi, then f f can be computed using  2 for t 1, Rectangular  0.9295 0 otherwise. −1 −1 X −1 fˆ−i(x) = (n − 1) h Kh (x − X j).  √ j,i  3 (1 − 1 t2)/ 5 for |t| ≤ 5, Epanechnikov  4 5 1 0 otherwise. Now we define the required quantity without any unknown terms f , as   15 (1 − t2)2 for |t| ≤ 1,  16 Z X Biweight  0.9939 2 −1 0 otherwise. M0(h) = fˆ − 2n fˆ−i(Xi). i   35 (1 − t2)3 for |t| ≤ 1,  32 The score M depends only on data and the idea of least Triweight  0.987 0 0 otherwise. squares cross validation is to minimize the score over h. There  also exists a computationally simple approach to estimate M0 1 − |t| for |t| ≤ 1, Triangular  0.9859 - Ref. 69 provided the large sample properties for this es-  0 otherwise. timator. Thus, asymptotically, least squares cross validation 2 achieves the best possible choice of smoothing parameter, in Gaussian √1 e−(1/2)t 0.9512 2π the sense of minimizing the integrated squared error. For fur- ther details see Ref. 63. Another possible approach related to “plug-in” estima- tors is to use a standard family of distributions as a reference The top four kernels are particular cases of the following R 00 2 family: for the value f (x) dx which is the only unknown in the optimal bandwidth formula hopt. For example, if we consider 2p+1 −1 2 p K(x; p) = {2 B(p + 1, p + 1} (1 − x ) I{|x|<1}, the normal distribution with variance σ2, then where B(·, ·) is the beta function. These kernels are symmetric Z Z 3 f 00(x)2dx = σ−5 φ00(x)2dx = π−1/2σ−5 ≈ 0.212σ−5. beta densities on the interval [−1, 1]. For p = 0 the expression 8 gives rise to a rectangular density, p = 1 to Epanechnikov, and If a Gaussian kernel is used, then the optimal bandwidth is p = 2 and p = 3 are bivariate and trivariate kernels, respec- tively. The standard normal density is obtained as the limiting −1/10 3 −1/2 −1/5 hopt = (4π) π σn case p → ∞. 8 !1/5 Similar to the histogram scenario, the problem of choos- 4 = σn−1/5 ing the bandwidth (smoothing parameter) is of crucial im- 3 portance to density estimation. A natural method for choos- = 1.06σn−1/5. ing the bandwidth is to plot several curves and choose the estimate that is most desirable. For many applications this A quick way of choosing the bandwidth is therefore esti- approach is satisfactory. However, there is a need for data- mating σ from the data and substituting it in the above for- driven and automatic procedures that are practical and have mula. While this works well if the normal distribution is a fast convergence rates. The problem of bandwidth selection reasonable approximation, it may oversmooth if the popula- has stimulated much research in kernel density estimation. tion is multimodal. Better results can be obtained using a ro- The main approaches include cross-validation and “plug-in” bust measure of spread such as the interquartile (R) range, −1/5 methods (see the review paper by Ref. 39). which in this example yields hopt = 0.79Rn . Similarly, As an example, consider least squares cross validation, one can improve this further by taking the minimum of the which was suggested by Ref. 58 and 3 - see also Ref. 4,32 standard deviation and the interquartile range divided by 1.34.

article id-5 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

For most applications these bandwidths are easy to evaluate regression estimator is given by and serve as a good starting value. Pn Kh(Xi − x)Yi Although the underlying idea is very simple and the first mˆ (x) = i=1 , (9) h Pn K (X − x) paper was published long ago, the kernel approach did not i=1 h i make much progress until recently, with advances in comput- where Kh(·) = K(·/h)/h. For more details see Refs. 50,80 and ing power. At present, without an efficient algorithm, the cal- 33. culation of a kernel density for moderately large datasets can become prohibitive. The direct use of the above formulas for 2.2. Gasser-Muller¨ Estimator computations is very inefficient. Researchers developed fast and efficient Fourier transformation methods to calculate the Assume that the data have already been sorted according to estimate using the fact that the kernel estimate can be written the X variable. Ref. 24 proposed the following estimator: 43 n as a convolution of data and the kernel function . X Z si mˆ h(x) = Kh(u − x)duYi, (10) i=1 si−1

2. Kernel Smoothing with si = (Xi + Xi+1)/2, X0 = −∞ and Xn+1 = +∞. The weights in equation (10) add up to 1, so there is no The problem of smoothing sequences of observations is im- need for a denominator as in the Nadaraya-Watson estima- portant in many branches of science and it is demonstrated by tor. Although it was originally developed for equispaced de- the number of different fields in which smoothing methods signs, the Gasser-Muller¨ estimator can also be used for non- have been applied. Early contributions were made in fields equispaced designs. For the asymptotic properties please re- as diverse as astronomy, actuarial science, and economics. fer to Refs. 44 and 8. Despite their long history, local regression methods have re- ceived little attention in the statistics literature until the late 1970s. Initial work includes the mathematical development of 2.3. Local Linear Estimator Refs. 67, 40 and 68, and the LOWESS procedure of Ref. 9. This estimator assumes that locally the regression function m Recent work on local regression includes Refs. 19, 20 and 36. can be approximated by The local method was developed p p largely as an extension of parametric regression methods and X m( j)(x) X m(z) ≈ (z − x) j ≡ β (z − x) j, (11) accompanied by an elegant finite sample theory of linear esti- j! j j=0 j=0 mation methods that build on theoretical results for paramet- ric regression. It is a method for curve estimation by fitting for z in a neighborhood of x, by using a Taylor’s expansion. locally weighted least squares regression. One extension of Using a local least squares formulation, the model coefficients local linear regression, called local , is can be estimated by minimizing the following function: 59   discussed in and in the monograph by Ref. 21. Xn  Xp   j 2  Assume that (X , Y ),..., (Xn, Yn) are iid observations Y − β (X − x) } K (X − x) , (12) 1 1  i j i h i  with conditional mean and conditional variance denoted re- i=1  j=0  spectively by where K(·) is a kernel function with bandwidth h. If we let βˆ j ( j = 0, ··· , p) be equal to the estimates obtained from min- m(x) = E(Y|X = x) and σ2(x) = Var(Y|X = x). (8) imizing equation (12), then the estimator for the regression functions are obtained as Many important applications involve estimation of the regres- ˆ sion function m(x) or its νth derivative m(ν)(x). The perfor- mˆ v(x) = v!βv. (13) mance of an estimatorm ˆ (x) of m(ν)(x) is assessed via its MSE ν When p = 1 the estimatorm ˆ 0(x) is termed a local linear or MISE defined in previous sections. While the MSE crite- smoother or local linear estimator with the following explicit rion is used when the main objective is to estimate the func- expression: tion at the point x, MISE criterion is used when the main goal Pn is to recover the whole curve. 1 wiYi mˆ 0(x) = Pn , wi = Kh(Xi − x){S n,2 − (Xi − x)S n,1}. 1 wi (14)

2.1. Nadaraya-Watson Estimator Pn j where S n, j = 1 Kh(Xi − x)(Xi − x) . When p = 0, the local If we do not assume a specific form for the regression func- linear estimator is equals to the Nadaraya-Watson estimator. tion m(x), then a data point remote from x carries little infor- Also, both Nadaraya-Watson and Gasser-Muller¨ estimators mation about the value of m(x). In such a case, an intuitive are of the type of local least squares estimator with weights R s w K X − x w i K u − x du estimator of the conditional mean function is the running lo- i = h( i ) and i = si−1 h( ) , respectively. The cally weighted average. If we consider a kernel K with band- asymptotic results are provided in Table 2.3, which is taken width h as the weight function, the Nadaraya-Watson kernel from Ref. 19.

article id-6 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

Table 2. Comparison of asymptotic properties of local es- constant (Nadaraya-Watson) fit were discussed in detail by timators. Refs. 8, 19 and 36. Method Bias Variance

00 2m0(x) f 0(x) 2.4. Computational Considerations Nadaraya-Watson (m (x) + f (x) )bn Vn 00 Gasser-Muller¨ m (x)bn 1.5Vn Recent proposals for fast implementations of nonparametric 00 Local linear m (x)bn Vn curve estimators include the binning methods and the updat-

1 R ∞ 2 2 ing methods. Ref. 22 gave careful speed comparisons of these Note: Here, bn = 2 −∞ u K(u)duh and Vn = 2 R ∞ two fast implementations and direct naive implementations σ (x) K2(u)du. f (x)nh −∞ under a variety of settings and using various machines and software. Both fast methods turned out to be much faster with To illustrate both Nadaraya-Watson and local linear fit negligible differences in accuracy. on data, we considered an example of a dataset on trees2; 60. While the key idea of the binning method is to bin the This dataset includes measurements of the girth, height, and data and compute the required quantities based on the binned volume of timber in 31 felled back cherry trees. The smooth data, the key idea of the updating method involves updating fit results are described in Figure 3. The fit results are pro- the quantities previously computed. It has been reported that duced using the KernSmooth79 package in R-Software73. In for practical purposes neither method dominates the other. the right panel of Figure 3, we see that for a larger band- width, as expected, Nadaraya-Watson fits a global constant 3. Smoothing Using Splines model while the local linear fits a linear model. Further, from the left panel, where we used a reasonable bandwidth, the Similar to local linear estimators, another family of methods Nadara-Watson estimator has more bias than the local linear that provide flexible data modeling is spline methods. These fit. methods involve fitting piecewise polynomials or splines to allow the regression function to have discontinuities at cer- tain locations which are called “knots” 18; 78; 30.

● ● Nadaraya−Watson Nadaraya−Watson 3.1. Polynomial Spline Local linear Local linear ●●● ●●● ● ● 4.0 4.0 ● ● Suppose that we want to approximate the unknown regres- ● ● ● ● ● ● ● ● ● ●

3.5 3.5 sion function m by a cubic spline function, that is, a piece- ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● wise polynomial with continuous first two derivatives. Let ●● ● ●● ● log(Volume) log(Volume) 3.0 ● 3.0 ● ● ● ● ● t1, ··· , tJ be a fixed knot sequence such that −∞ < t1 < ··· <

2.5 ●●● 2.5 ●●● tJ < +∞. Then the cubic spline functions are twice continu-

2.2 2.4 2.6 2.8 3.0 2.2 2.4 2.6 2.8 3.0 ously differentiable functions s such that restrictions of s to

log(Girth) log(Girth) each of the intervals (−∞, t1], [t1, t2], ··· , [tJ−1, tJ], [tJ, +∞) is a cubic polynomial. The collection of all these cubic spline Fig. 3. Comparison of local linear fit versus Nadaraya-Watson fit for a functions forms a (J + 4)-dimensional linear space. There ex- model of (log) volume as a function of (log) girth, with choice of two ist two popular cubic spline bases for this linear space: bandwidths: 0.1 (left) and 0.5 (right). 2 3 3 Power basis: 1, x, x , x , (x − t j)+, ( j = 1, ··· , J) B-spline basis: The ith B-spline of degree p = 3, written as In comparison with the local linear fit, the Nadaraya- Ni,p(u), is defined recursively as: Watson estimator locally uses one parameter less without re-  1 ifu ≤ u ≤ u ducing the asymptotic variance. It suffers from large bias, par-  i i=1 Ni,0(u) =  ticularly in the region where the derivative of the regression 0 otherwise. function or design density is large. Also it does not adapt u − ui ui+p+1 − u Ni,p(u) = Ni, p − 1(u) + Ni+1,p−1(u). to non uniform designs. In addition, it was shown that the ui+p − ui ui+p+1 − ui + 1 Nadaraya-Watson estimator has zero minimax efficiency - for The above is usually referred to as the Cox-de Boor recursion details see Ref. 19. Based on the definition of minimax effi- formula. ciency, a 90% efficient estimator uses only 90% of the data. Which means that the Nadaraya-Watson does not use all the The B-spline basis is typically numerically more stable be- available data. In contrast, the Gasser-Muller¨ estimator cor- cause the multiple correlation among the basis functions is rects the bias of the Nadaraya-Watson estimator but at the smaller, but the power spline basis has the advantage that it expense of increasing variability for for random designs. Fur- provides easier interpretation of the knots so as deleting a par- ther, both the Nadaraya-Watson and Gasser-Muller¨ estima- ticular basis function is same as deleting that particular knot. tor have a large order of bias when estimating a curve at the The direct estimation of the regression function m depends on boundary region. Comparisons between local linear and local the choice of knot locations and the number of knots. There

article id-7 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

exist some methods based on the knot-deletion idea. For full is represented separately, the model retains the interpretative details please see Refs. 41 and 42. ease of a linear model. The most general method for estimating additive models 3.2. Smoothing Spline allows us to estimate each function by an arbitrary smoother. Some possible candidates are smoothing splines and kernel Consider the following objective function smoothers. The backfitting algorithm is a general purpose al- n gorithm that enables one to fit additive models with any kind X 2 {Yi − m(Xi)} . (15) of smoothing functions although for specific smoothing func- i=1 tions such as smoothing splines or penalized splines there ex- Minimizing this function gives the best possible estimate for ist separate estimation methods based on least squares. The the unknown regression function. The major problem with backfitting algorithm is an iterative algorithm and consists of the above objective function is that any function m that inter- the following steps: polates the data satisfies it, thereby leading to overfitting. To (i) Initialize: α = ave(y ), f = f 0, j = 1,..., p avoid this, a penalty for the overparametrization is imposed i j j  P  on the function. A convenient way for introducing such a(ii) Cycle: for j = 1,..., p repeat f j = S j y − α − k, j fk|x j . penalty is via the roughness penalty approach. The following(iii) Continue (ii) until the individual functions do not change function is minimized: where S (y|x ) denotes a smooth of the response y against the n Z j j X x {Y − m(X )}2 + λ {m”(x)}2dx, (16) predictor j. The motivation for the backfitting algorithm can i i be understood using conditional expectation. If the additive i=1   model is correct then for any k, E Y − α − P f (X )|X = where λ > 0 is a smoothing parameter. The first term penal- j,k j j k f (X ). This suggests the appropriateness of the backfitting izes the lack of fit, which is in some sense modeling bias. The k k algorithm for computing all the f . second term denotes the roughness penalty which is related to j Ref. 70 showed in the context of regression splines - overparameterization. It is evident that λ = 0 yields interpola- OLS estimation of spline models- that the additive model has tion and λ = ∞ yields linear regression-typical oversmooth- the desirable property of reducing a full p-dimensional non- ing and undersmoothing types. Hence, the estimate obtained parametric regression problem to one that can be fitted with from the objective functionm ˆ , which also depends on the λ the same asymptotic efficiency as a univariate problem. Due smoothing parameter, is called the smoothing spline estima- to lack of explicit expressions, the earlier research by Ref. 5 tor. For local properties of the this estimator, please refer to studied only the bivariate additive model in detail and showed Ref. 51. that both the convergence of the algorithm and uniqueness of It is well known that the solution to the minimization its solution depend on the behavior of the product of the two of (16) is a cubic spline on the interval [X , X ] and it is (1) (n) smoothers matrices. Later, Refs. 52 and 45 extended the con- unique in this data range. Moreover, the estimator is a linear vergence theory to p-dimensions. For more details, please see smoother with weights that do not depend on the response Refs. 5, 45 and 52. {Y }33. The connections between kernel regression, which we i We can write all the estimating equations in a compact discussed in the previous section, and smoothing splines have form: been critically studied by Refs. 62 and 64.  IS S ··· S   f  S y The smoothing parameter λ can be chosen by minimiz-  1 1 1  1  1  S IS ··· S   f  S y ing the cross-validation (CV) 71; 1 or generalized cross val-  2 2 2  2  2   . . . . .   .  =  .  idation (GCV) 77; 11 criteria. Both quantities are consistent  ......   .   .        estimates of the MISE ofm ˆ λ. For other methods and details S p S p S p ··· I fp S py please see Ref. 78. Further, for computational issues please Pˆ f = Qyˆ . refer to Refs. 78 and 18. Backfitting is a Gauss-Seidel procedure for solving the above system. While one could directly use QR decomposition 4. Additive Models without any iterations to solve the entire system, the compu- While the smoothing methods discussed in the previous sec- tational complexity prohibits doing so. The difficulty is that tion are mostly univariate, the additive model is a widely used QR require O{(np)3} operations while the backfitting involves multivariate smoothing technique. An additive model is de- only O(np) operations, which is much cheaper. For more de- fined as tails see Ref. 37. Xp We return to the trees dataset that we considered in Y = α + f j(X j) + , (17) the last section, and fit an additive model with both height j=1 and girth as predictors to model volume. The fitted model is 35 where the errors  are independent of the X js and have mean log(Volume) ∼ log(Height) + log(Girth). We used the gam 2 E() = 0 and variance var() = σ . The f j are arbitrary uni- package in R-software. The fitted functions are shown in Fig- variate functions, one for each predictor. Since each variable ure 4.

article id-8 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

The recent work by Ref. 83 showcases the successful application of additive models on large datasets and uses per- formance iteration with block QR updating. This indicates the 1.0 feasibility of applying these computationally intensive and 0.1 0.5 useful models for big data. The routines are available in the 82 0.0 R-package mgcv . −0.1 s(log(Girth)) s(log(Height)) −0.5

−0.3 5. Markov Chain and Monte Carlo −1.0 8 10 12 14 16 18 20 65 70 75 80 85 The Markov Chain Monte Carlo (MCMC) methodology pro- Girth Height vides enormous scope for realistic and complex statisti- cal modeling. The idea is to perform Monte Carlo integra- Fig. 4. Estimated functions for girth (left) and height (right) using tion using Markov chains. Bayesian statisticians, and some- smoothing splines. times also frequentists, need to integrate over possibly high- dimensional probability distributions to draw inference about The backfitting method is very generic in the sense that model parameters or to generate predictions. For a brief his- it can handle any type of smoothing function. However, there tory and overview please refer to Refs. 28 and 29. exists another method specific to penalized splines, which has become quite popular. This method uses penalized splines by 5.1. Markov chains estimating the model using penalized regression methods. In Consider a sequence of random variables, {X , X , X ,...}, equation (16), because m is linear in parameters β, the penalty 0 1 2 such that at each time t ≥ 0, the next state X is sam- can always be written as a quadratic form in β: t+1 pled from a conditional distribution P(X |X ), which de- Z t+1 t X {m”(x)}2dx = βT S β, pends only on the current state of the chain t. That is, given the current state Xt, the next state Xt+1 does not depend on where S is the matrix of known coefficients. Therefore the the past states - this is called the memory-less property. This penalized regression spline fitting problem is to minimize sequence is called a Markov chain. The joint distribution of a Markov chain is determined 2 T ||y − Xβ|| + λβ S β, (18) by two components:

w.r.t β. It is straightforward to see that the solution is least The marginal distribution of X0 , called the initial distribu- squares type of estimator and depends on the smoothing pa- tion rameter λ: The conditional density p(·|·) , called the transitional kernel βˆ = (XT X + λS )−1XT y. (19) of the chain. Penalized likelihood maximization can only estimate model It is assumed that the chain is time-homogeneous, which ·|· coefficients β given the smoothing parameter λ. There exist means that the probability P( ) does not depend on time t. two basic useful estimation approaches: when the scale pa- The set in which Xt takes values is called the state space of the Markov chain and it can be countably finite or infinite. rameter in the model is known, one can use Mallow’s Cp cri- terion; when the scale parameter is unknown, one can use Under some regularity conditions, the chain will gradu- GCV. Furthermore, for models such as generalized linear ally forget its initial state and converge to a unique stationary · models which are estimated iteratively, numerically there ex- (invariant) distribution, say π( ), which does not depend on ist two different ways of estimating the smoothing parameter: t and X0. To converge to a stationary distribution, the chain needs to satisfy three important properties. First, it has to irre- Outer iteration: The score can be minimized directly. This ducible, which means that from all starting points the Markov means that the penalized regression must be evaluated for chain can reach any non empty set with positive probability in each trial set of smoothing parameters some iterations. Second, the chain needs to aperiodic, which Performance iteration: The score can be minimized and the means that it should not oscillate between any two points in a smoothing parameter selected for each working penalized lin- periodic manner. Third, the chain must be positive recurrent ear model. This method is computationally efficient. as defined next. Performance iteration was originally proposed by Ref. 31. It Definition 1 (Ref. 49). usually converges, and requires only a reliable and efficient (i) Markov chain X is called irreducible if for all i, j, there method for score minimization. However, it also has some exists t > 0 such that Pi, j(t) = P[Xt = j|X0 = i] > 0. issues related to convergence. In contrast, the outer method suffers from none of the disadvantages that performance it- (ii) Let τii be the time of the first return to state i, (τii = eration has but it is more computationally costly. For more min{t > 0 : Xt = i|X0 = i}). An irreducible chain X is recur- details, please see Ref. 81. rent if P[τii < ∞] = 1 for some i. Otherwise . X is transient.

article id-9 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

Another equivalent condition for recurrence is As a consequence to the geometric convergence, the central X limit theorem can be used for ergodic averages, that is Pi j(t) = ∞, t N1/2( f¯ − E[ f (X)]) → N(0, σ2), for all i, j. for some positive constant σ, as N → ∞, with the conver- (iii) An irreducible recurrent chain X is called positive re- gence in distribution. For an extensive treatment of geometric current if E[τii] < ∞ for some i. Otherwise, it is called null- convergence and central limit theorems for Markov chains, recurrent. Another equivalent condition for positive recur- please refer to Ref. 48. rence is the existence of a stationary probability distribution for X, that is there exists π(·) such that X 5.2. The Gibbs sampler and Metropolis-Hastings π(i)P (t) = π( j) (20) i j algorithm i for all j and t ≥ 0. Many MCMC algorithms are hybrids or generalizations of the two simplest methods: the Gibbs sampler and the Metropolis- (iv) An irreducible chain X is called a periodic if for some i, Hastings algorithm. We therefore describe each of these two greatest common divider {t > 0 : Pii(t) > 0} = 1. methods next. In MCMC, since we already have a target distribution π(·), then X will be positive recurrent if we can demonstrate irre- 5.2.1. Gibbs Sampler ducibility. The Gibbs sampler enjoyed an initial surge of popularity After a sufficiently long burn-in of, say, m iterations, starting with the paper of Ref. 27 (in a study of image pro- points {X ; t = m + 1,..., n} will be the dependent sample t cessing models), while the roots of this method can be traced approximately from π(·). We can now use the output from back to Ref. 47. The Gibbs sampler is a technique for indi- the Markov chain to estimate the required quantities. For ex- rectly generating random variables from a (marginal) distri- ample, we estimate E[ f (X)], where X has distribution π(·) as bution, without calculating the joint density. With the help of follows: techniques like these we are able to avoid difficult calcula- Xn ¯ 1 tions, replacing them with a sequence of easier calculations. f = f (Xt) n n − m Let π(x) = π(x1,..., xk), x ∈ R denote a joint den- t=m+1 sity, and let π(xi|x−i) denote the induced full conditional den- This quantity is called an ergodic average and its conver- sities for each of the components xi, given values of other gence to the required expectation is ensured by the ergodic components x−i = (x j; j , i), i = 1,..., k, 1 < k ≤ n. 56; 49 theorem . Now the Gibbs sampler proceeds as follows. First, choose arbitrary starting values x0 = (x0,..., x0). Then successively Theorem 2. If X is positive recurrent and aperiodic then its 1 k make random drawings from the full conditional distributions stationary distribution π(·) is the unique probability distribu- 66 tion satisfying equation (20). We then say that X is ergodic π(xi|x−i), i = 1,..., k as follows :

and the following consequences hold: 1 0 x1 from π(x1|x−1), → → ∞ 1 1 0 0 (i) Pi j(t) π( j) as t for all i, j. x2 from π(x2|x1, x3,..., xk ), | | ∞ 1 1 1 0 0 (ii) (Ergodic theorem) If E[ f (X) ] < , then x3 from π(x3|x1, x2, x4,..., xk ),   . P f¯ → E[ f (X)] = 1, . 1 1 P xk from π(xk|x−k). where E[ f (X)] = i f (i)π(i), the expectation of f (X) with respect to π(·). 0 0 0 1 This completes a transition from x = (x1,..., xk ) to x = 1 1 Most of the Markov chains procedures in MCMC are re- (x1,..., xk ). Each complete cycle through the conditional dis- versible which means that they are positive recurrent with tributions produces a sequence x0, x1,..., xt,... which is a stationary distribution π(·), and π(i)Pi j = π( j)P ji. realization of the Markov chain, with transition probability Further, we say that X is geometrically ergodic, if it is from xt to xt+1 given by ergodic (positive recurrent and aperiodic) and there exists t t+1 k t+1 t t+1 0 ≤ λ < 1 and a function V(·) > 1 such that TP(x , x ) = Πl=1π(xl |x j, j > l, x j , j < l). (22) X t |Pi j(t) − π( j)| ≤ V(i)λ , (21) Thus the key feature of this algorithm is to only sample from j the full conditional distributions which are often easier to for all i. The smallest λ for which there exists a function sat- evaluate rather than the joint density. For more details see isfying the above equation is called the rate of convergence. Ref. 6.

article id-10 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

5.2.2. Metropolis-Hastings Algorithm have used MCMC in complicated problems are probably fa- The Metropolis-Hastings (M-H) algorithm was developed by miliar with stories about last minute problems after running Ref. 47. This algorithm is extremely versatile and produces the chain for several weeks. In the following we describe the two popular MCMC diagnostic methods. the Gibbs sampler as a special case25. Ref. 26 proposed a convergence diagnostic method To construct a Markov chain X , X ,..., X ,... with state 1 2 t which commonly known as ”Gelman-Rubin” diagnostic space χ and equilibrium distribution π(x), the Metropolis- method. It consists of the following two steps. First, obtain an Hastings algorithm constructs the transition probability from an overdispersed estimate of the target distribution and from X = x to the next realized state X as follows. Let q(x, x0) t t+1 it generate the starting points for the desired number of in- denote a candidate generating density such that X = x, x0 t dependent chains (say 10). Second, run the Gibbs sampler drawn from q(x, x0) is considered as a proposed possible value and re-estimate the target distribution of the required scalar for X . With some probability α(x, x0), we accept X = x0; t+1 t+1 quantity as a conservative Student t distribution, the scale pa- otherwise, we reject the value generated from q(x, x0) and set rameter of which involves both the between-chain variance X = x. This construction defines a Markov chain with tran- t+1 and within-chain variance. Now the convergence is monitored sition probabilities given as by estimating the factor by which the scale parameter might  0 0 0 shrink if sampling were continued indefinitely, namely q(x, x )α(x, x ) ifx , x p(x, x0) =   P 00 00 0 s 1 − x00 q(x, x )α(x, x ) ifx = x. p n − 1 m + 1 B  d f Rˆ = + , Next, we choose n mn W d f − 2

 π(x0)q(x0,x) 0 where B is the variance between the means from the m paral- min{ 0 , 1} ifπ(x)q(x, x ) > 0, α(x, x0) =  π(x)q(x,x ) lel chains, W is the within-chain variances, d f is the degrees  0 1 ifπ(x)q(x, x ) = 0. of freedom of the approximating density, and n is number of observations that are used to reestimate the target density. The choice of the arbitrary q(x, x0) to be irreducible and ape- The authors recommend an iterative process of running addi- riodic is a sufficient condition for π(x) to be the equilibrium tional iterations of the parallel chains and redoing step 2 until distribution of the constructed chain. the shrink factors for all the quantities of interest are near 1. It can be observed that different choices of q(x, x0) will Though created for the Gibbs sampler, the method by Ref. 26 lead to different specific algorithms. For q(x, x0) = q(x0, x), may be applied to the output of any MCMC algorithm. It em- we have α(x, x0) = min{π(x0)/π(x), 1}, which is the well- phasizes the reduction of bias in estimation. There also exist 47 q x, x0 q x0 − x known Metropolis algorithm . For ( ) = ( ), the a number of criticisms for the Ref. 26 method. It relies heav- chain is driven by a random walk process. For more choices ily on the user’s ability to find a starting distribution which is and their consequences please refer to Ref. 74. Similarly for highly overdispersed with the target distribution. This means applications of the M-H algorithm and for more details see that the user should have some prior knowledge on the target Ref. 7. distribution. Although the approach is essentially univariate the authors suggested using -2 times log posterior density as 5.3. MCMC Issues a way of summarizing the convergence of a joint density. Similarly, Ref. 55 proposed a diagnostic method which There is a great deal of theory about the convergence proper- is intended both to detect convergence to the stationary dis- ties of MCMC. However, it has not been found to be very use- tribution and to provide a way of bounding the variance es- ful in practice for determining the convergence information. timates of the quantiles of functions of parameters. The user A critical issue for users of MCMC is how to determine when must first run a single chain Gibbs sampler with the mini- to stop the algorithm. Sometimes a Markov chain can appear mum number of iterations that would be needed to obtain the to have converged to the equilibrium distribution when it has desired precision of estimation if the samples were indepen- not. This can happen due to the prolonged transition times dent. The approach is based on the two-state Markov chain between state space or due to the multimodality nature of theory, as well as the standard sample size formulas that in- the equilibrium distribution. This phenomenon is often called volves formulas of binomial variance. For more details please pseudo convergence or multimodality. refer to Ref. 55. Critics point out that the method can produce The phenomenon of pseudo convergence has led many variable estimates of the required number of iterations needed MCMC users to embrace the idea of comparing multiple runs given different initial chains for the same problem and that it of the sampler started at multiple points instead of the usual is univariate rather than giving information about the full joint single run. It is believed that if the multiple runs converge posterior distribution. to the same equilibrium distribution then everything is fine There are more methods available to provide the con- with the chain. However, this approach does not alleviate all vergence diagnostics for MCMC although not as popular as the problems. Many times running multiple chains leads to these. For a discussion about other methods refer to Ref. 10. avoiding running the sampler long enough to detect if there Continuing our illustrations using the trees data, we fit are any problems, such as bugs in the code, etc. Those who the same model that we used in section 4 using MCMC meth-

article id-11 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

46 ods. We used the MCMCpack package in R. For the sake Trace of (Intercept) Density of (Intercept) of illustration, we considered 3 chains and 600 observations per chain. Among 600 only 100 observations are considered −6 for the burnin. The summary results are described as follows. 0.3 −9 0.0 100 200 300 400 500 600 −10 −9 −8 −7 −6 −5 −4 −3

Iterations N = 500 Bandwidth = 0.2016

Trace of log(Girth) Density of log(Girth) 4 2.1 2 1.8 0

100 200 300 400 500 600 1.7 1.8 1.9 2.0 2.1 2.2 2.3

Iterations N = 500 Bandwidth = 0.01921

Trace of log(Height) Density of log(Height) 2.0 1.2 1.0 0.4 0.0 100 200 300 400 500 600 0.5 1.0 1.5

Iterations = 101:600 Iterations N = 500 Bandwidth = 0.05028 Thinning interval = 1 Number of chains = 3 Trace of sigma2 Density of sigma2 Sample size per chain = 500 0.020

1. Empirical mean and standard deviation for each variable, 100 0 0.005 plus standard error of the mean: 100 200 300 400 500 600 0.005 0.010 0.015 0.020

Iterations N = 500 Bandwidth = 0.0004963 Mean SD Naive SE Time-series SE (Intercept) -6.640053 0.837601 2.163e-02 2.147e-02 log(Girth) 1.983349 0.078237 2.020e-03 2.017e-03 Fig. 5. MCMC trace plots with 3 chains for each parameter in the esti- log(Height) 1.118669 0.213692 5.518e-03 5.512e-03 mated model. sigma2 0.007139 0.002037 5.259e-05 6.265e-05

2. Quantiles for each variable:

2.5% 25% 50% 75% 97.5% (Intercept) -8.307100 -7.185067 -6.643978 -6.084797 -4.96591 log(Girth) 1.832276 1.929704 1.986649 2.036207 2.13400 log(Height) 0.681964 0.981868 1.116321 1.256287 1.53511 sigma2 0.004133 0.005643 0.006824 0.008352 0.01161

Since, we used 3 chains, we can use the Gelman- Rubin convergence diagnostic method and check whether the shrinkage factor is close to 1 or not, which indicates the con- vergence of 3 chains to the same equilibrium distribution. The Further, the trace plots for the each parameter are dis- results are shown in Figure 6. From the plots we see that for played in Figure 5. From the plots it can be observed that the all the parameters the shrink factor and its 97.5% value are chains are very mixed which gives an indication of conver- very close to 1, which confirms that the 3 chains converged to gence of MCMC. the same equilibrium distribution.

article id-12 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

The jackknife estimators bJACK and TJACK can be heuristi- (Intercept) log(Girth) cally justified as follows. Suppose that ! a b 1 bias T O , ( n) = + 2 + 3 (25) median 1.20 median n n n 97.5% 97.5% where a and b are unknown but do not depend on n. Since 1.00 1.00 shrink factor shrink factor Tn−1,i, i = 1,..., n, are identically distributed, 200 300 400 500 600 200 300 400 500 600 ! a b 1 bias(T − ) = + + O , (26) last iteration in chain last iteration in chain n 1,i n − 1 (n − 1)2 (n − 1)3

and bias(T˜n) has the same expression. Therefore,

log(Height) sigma2 E(bJACK) = (n − 1)[bias(T˜n) − bias(Tn)] !  1 1 1 1 1  = (n − 1) ( − )a + ( − )b + O 1.10 1.30 median median n − 1 n (n − 1)2 n2 n3 97.5% 97.5% ! a (2n − 1)b 1 1.00 1.00 shrink factor shrink factor = + + O , 200 300 400 500 600 200 300 400 500 600 n n2(n − 1) n2 which means that as an estimator of the bias of T , b is last iteration in chain last iteration in chain n JACK correct up to the order of n−2. It follows that ! b 1 Fig. 6. Gelman-Rubin diagnostics for each model parameter, using the bias T bias T − E b − O , ( JACK) = ( n) ( JACK) = + 2 results from 3 chains. n(n − 1) n −2 that is, the bias of TJACK is of order n . The jackknife pro- 6. Resampling Methods duces a bias reduced estimator by removing the first order term in bias(Tn). Resampling methods are statistical procedures that involves The jackknife has become a more valuable tool since repeated sampling of the data. They replace theoretical Ref. 75 found that the jackknife can also be used to construct derivations required for applying traditional methods in statis- variance estimators. It is less dependent on model assump- tical analysis by repeatedly resampling the original data and tions and does not need any theoretical formula as required making inference from the resamples. Due to the advances by the traditional approach. Although it was prohibitive in the in computing power these methods have become prominent old days due to its computational costs, today it is certainly a and particularly well appreciated by applied statisticians. The popular tool in data analysis. jacknife and bootstrap are the most popular data-resampling methods used in statistical analysis. For a comprehensive 6.2. The Bootstrap treatment of these methods see Refs. 14 and 61. In the fol- lowing, we describe the Jackknife and bootstrap methods. The bootstrap13 is conceptually the simplest of all resam- pling methods. Let X1,..., Xn denote the dataset of n inde- pendent and identically distributed (iid) observations from 6.1. The Jackknife an unknown distribution F which is estimated by Fˆ, and let Quenouille54 introduced a method, later named the jack- Tn = Tn(X1,..., Xn) be a given statistic. Then the variance of knife, to estimate the bias of an estimator by deleting one Tn is data point each time from the original dataset and recalcu- Z  Z 2 n n var(Tn) = Tn(x) − Tn(y)dΠ F(yi) dΠ F(xi), lating the estimator based on the rest of the data. Let Tn = i=1 i=1 Tn(X1,..., Xn) be an estimator of an unknown parameter θ. (27) The bias of Tn is defined as where x = (x1,..., xn) and y = (y1,..., yn). Substituting Fˆ bias(Tn) = E(Tn) − θ. for F, we obtain the bootstrap variance estimator Let T = T (X ,..., X , X ,..., X ) be the Z  Z 2 n−1,i n−1 1 i−1 i+1 n ν = T (x) − T (y)dΠn Fˆ(y ) dΠn Fˆ(x ) given statistic but based on n − 1 observations BOOT n n i=1 i i=1 i X1,..., Xi−1, Xi+1,..., Xn, i = 1,..., n. Quenouille’s jack- ∗ ∗ = var∗[Tn(X ,..., Xn)|X1,..., Xn], knife bias estimator is 1 ∗ ∗ ˆ where {X ,..., Xn} is an iid sample from F and is called a b = (n − 1)(T˜ − T ), (23) 1 JACK n n bootstrap sample. var∗[X1,..., Xn] denotes the conditional ˜ −1 Pn variance for the given X1,..., Xn. The variance cannot be where Tn = n i=1 Tn−1,i. This leads to a bias reduced jack- knife estimator of θ, used directly for practical applications when νBOOT is not an explicit function of X ,..., X . Monte Carlo methods can be ˜ 1 n TJACK = Tn − bJACK = nTn − (n − 1)Tn. (24) used to evaluate this expression when F is known. That is, we

article id-13 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

repeatedly draw new datasets from F and then use the sample 6.3. Comparing the Jackknife and the Bootstrap T variance of the values n computed from new datasets as a In general, the jackknife will be easier to compute if n is less var T Fˆ numeric approximation to ( n). Since is a known dis- than, say, the 100 or 200 replicates used by the bootstrap for tribution, this idea can be further extended. That is, we can standard error estimation. However, by looking only at the n draw {X∗ ,..., X∗ }, b = 1 ..., B, independently from Fˆ, con- 1b nb jackknife samples, the jackknife uses only limited informa- ditioned on X ,..., X . Let T ∗ = T (X ∗,..., X∗ ) then we 1 n n,b n 1b nb tion about the statistic, which means it might be less efficient approximate ν using the following approximation: BOOT than the bootstrap. In fact, it turns out that the jackknife can be 14  2 viewed as a linear approximation to the bootstrap . Hence if 1 XB  1 XB  ν(B) = T ∗ − T ∗  . (28) the statistics are linear then both estimators agree. However, BOOT B  n,b B n,l b=1 l=1 for nonlinear statistics there is a loss of information. Practi- cally speaking, the accuracy of the jackknife estimate of stan- (B) dard error depends on how close the estimate is to linearity. From the law of large numbers, νBOOT = limB→∞νBOOT al- most surely. Both νBOOT and its Monte Carlo approximations Also, while it is not obvious how to estimate the entire sam- (B) (B) pling distribution of Tn by jackknifing, the bootstrap can be νBOOT are called bootstrap estimators. While νBOOT is more useful for practical applications, νBOOT is convenient for the- readily used to obtain a distribution estimator for Tn. oretical derivations. The distribution Fˆ used to generate the In considering the merits or demerits of the bootstrap, it bootstrap datasets can be any estimator (parametric or non- is to be noted that the general formulas for estimating stan- parametric) of F based on X1,..., Xn. A simple nonparamet- dard errors that involve the observed Fisher information ma- ric estimator of F is the empirical distribution. While we have trix are essentially bootstrap estimates carried out in a para- considered the bootstrap variance estimator here, the boot- metric framework. While the use of the Fisher information strap method can be used for more general problems such as matrix involves parametric assumptions, the bootstrap is free inference for regression parameters, hypothesis testing etc,. of those. The data analyst is free to obtain standard errors for For further discussion of the bootstrap see Ref. 16. enormously complicated estimators subject only to the con- Next, we consider the bias and variance of the bootstrap straints of computer time. In addition, if needed, one could estimator. Efron Ref. 13 applied the delta method to approx- obtain more smoothed bootstrap estimates by convoluting ∗ ∗ the nonparametric bootstrap with the parametric bootstrap- imate the bootstrap bias and variance. Let {X1,..., Xn} be a bootstrap sample from the empirical distribution Fn. Define a parametric bootstrap involves generating samples based on the estimated parameters while nonparametric bootstrap in- ∗ ∗ Pi = (the number ofX j = Xi, j = 1,..., n)/n, volves generating samples based on available data alone. To provide a simple illustration, we again considered the and trees data and fit an ordinary regression model with the for- ∗ ∗ ∗ 0 P = (P1,..., Pn) . mula mentioned in Section 4. To conduct a bootstrap analysis on the regression parameters we resampled the data with re- ∗ Given X1,..., Xn, the variable nP is distributed as a multi- placement 100 times (bootstrap replications) and fit the same 0 nomial variable with parameters n and P0 = (1,..., 1) /n. model to each sample. We calculated the mean and standard Then deviation for each regression coefficient, which are analogous 1 to OLS coefficient and standard error. We performed the same E P∗ = P0 and var (P∗) = n−2(I − 110) ∗ ∗ n for the jackknife estimators. The results are produced in Ta- ble 6.3. From the results it can be seen that the jackknife is off where I is the identity matrix, 1 is a column vector of 1’s, due to the small sample size. However, the bootstrap results and E∗ and var∗ are the bootstrap expectation and variance, are much closer to the values from OLS. respectively. Now, define a bootstrap estimator of the moment of a random variable Rn(X1,..., Xn, F). The properties of boot- strap estimators enables us to substitute the population quan- ∗ ∗ tities with the empirical quantities Rn(X1,..., Xn, Fn) = R (P∗). If we expand this around P0 using a multivariate Tay- Table 3. Comparison of the bootstrap, jackknife, and n parametric method (OLS) in a regression setting. lor expansion, we get the desired approximations for the boot- strap bias and variance: OLS Bootstrap Jackknife (Intercept) −6.63 −6.65 −6.63 1 (0.80) (0.73) (0.15) b E R P∗ ≈ R P0 tr V log(Height) 1.12 1.12 1.11 BOOT = ∗ n( ) n( ) + 2 ( ) 2n (0.20) (0.19) (0.04) 1 log(Girth) 1.98 1.99 1.98 ν = var R (P∗) ≈ U0U BOOT ∗ n n2 (0.08) (0.07) (0.01) Observations 31 0 2 0 Samples 100 31 where U = ∆Rn(P ) and V = ∆ Rn(P ).

article id-14 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

7. Conclusion 10Mary Kathryn Cowles and Bradley P Carlin, Markov chain monte The area and methods of computational statistics have been carlo convergence diagnostics: a comparative review, Journal of the American Statistical Association 91 (1996), no. 434, 883–904. evolving rapidly. Existing statistical software such as R al- 11Peter Craven and Grace Wahba, Smoothing noisy data with spline ready have efficient routines to implement and evaluate these functions, Numerische Mathematik 31 (1978), no. 4, 377–403. methods. In addition, there exists literature on parallelizing 12Henry Deng and Hadley Wickham, Density estimation in r, these methods to make them even more efficient, for e.g., (2011). please see Ref. 83. 13Bradley Efron, Bootstrap methods: another look at the jackknife, While some of the existing methods are still prohibitive Breakthroughs in Statistics, Springer, 1992, pp. 569–593. even with moderately large data - such as the local linear 14Bradley Efron and B Efron, The jackknife, the bootstrap and other estimator - implementations using more resourceful environ- resampling plans, vol. 38, SIAM, 1982. 15 ments such as servers or clouds make such methods feasible Bradley Efron and Trevor Hastie, Computer age statistical infer- ence, vol. 5, Cambridge University Press, 2016. even with big data. For an example see Ref. 84 where they 16 used server (32GB RAM) to estimate their proposed model Bradley Efron and Robert J Tibshirani, An introduction to the bootstrap, CRC press, 1994. on the real data which did not take more than 34 seconds. Oth- 17VA Epanechnikov, Nonparametric estimation of a multidimen- erwise, it would have taken more time. This will indicate the sional probability density, Teoriya veroyatnostei i ee primeneniya helpfulness of the computing power while estimating these 14 (1969), no. 1, 156–161. computationally intensive methods. 18Randall L Eubank, Spline smoothing and nonparametric regres- To the best of our knowledge, there exist multiple algo- sion, no. 04; QA278. 2, E8., 1988. rithms or R-packages to implement all the methods discussed 19Jianqing Fan, Design-adaptive nonparametric regression, Journal here. However, it should be noted that not every method is of the American statistical Association 87 (1992), no. 420, 998– computationally efficient. For example, Ref. 12 reported that 1004. 20 within R software there are 20 packages that implement den- , Local linear regression smoothers and their minimax ef- ficiencies, The Annals of Statistics (1993), 196–216. sity estimation. Further, they found that two packages (KernS- 21 mooth, ASH) are very fast, accurate and also well-maintained. Jianqing Fan and Irene Gijbels, Local polynomial modelling and its applications: monographs on statistics and applied probability Hence the user should be wise enough to choose efficient im- 66, vol. 66, CRC Press, 1996. plementations when dealing with larger datasets. 22Jianqing Fan and James S Marron, Fast implementations of Lastly, as we mentioned before, we are able to cover nonparametric curve estimators, Journal of Computational and only very few of the modern statistical computing methods. Graphical Statistics 3 (1994), no. 1, 35–56. For an expanded exposition of computational methods espe- 23Jerome H Friedman, Data mining and statistics: What’s the con- cially for inference, see Ref. 15. nection?, Computing Science and Statistics 29 (1998), no. 1, 3–9. 24Theo Gasser and Hans-Georg Muller,¨ Kernel estimation of re- gression functions, Smoothing techniques for curve estimation, References Springer, 1979, pp. 23–68. 25 1David M Allen, The relationship between variable selection and Andrew Gelman, Iterative and non-iterative simulation algo- data agumentation and a method for prediction, Technometrics 16 rithms, Computing science and statistics (1993), 433–433. 26 (1974), no. 1, 125–127. Andrew Gelman and Donald B Rubin, Inference from iterative 2Anthony Curtis Atkinson, Plots, transformations, and regression: simulation using multiple sequences, Statistical science (1992), an introduction to graphical methods of diagnostic regression 457–472. 27 analysis, no. 519.536 A875, Clarendon Press, 1985. Stuart Geman and Donald Geman, Stochastic relaxation, gibbs 3Adrian W Bowman, An alternative method of cross-validation for distributions, and the bayesian restoration of images, IEEE Trans- the smoothing of density estimates, Biometrika 71 (1984), no. 2, actions on pattern analysis and machine intelligence (1984), no. 6, 353–360. 721–741. 28 4Adrian W Bowman, Peter Hall, and DM Titterington, Cross- Charles Geyer, Introduction to markov chain monte carlo, Hand- validation in nonparametric estimation of probabilities and prob- book of Markov Chain Monte Carlo (2011), 3–48. 29 ability densities, Biometrika 71 (1984), no. 2, 341–351. Walter R Gilks, Markov chain monte carlo, Wiley Online Library. 30 5Andreas Buja, Trevor Hastie, and Robert Tibshirani, Linear PJ Green and BW Silverman, Nonparametric regression and gen- smoothers and additive models, The Annals of Statistics (1989), eralized linear models, vol. 58 of, Monographs on Statistics and 453–510. Applied Probability (1994). 31 6George Casella and Edward I George, Explaining the gibbs sam- Chong Gu, Cross-validating non-gaussian data, Journal of Com- pler, The American Statistician 46 (1992), no. 3, 167–174. putational and Graphical Statistics 1 (1992), no. 2, 169–179. 32 7Siddhartha Chib and Edward Greenberg, Understanding the Peter Hall, Large sample optimality of least squares cross- metropolis-hastings algorithm, The american statistician 49 validation in density estimation, The Annals of Statistics (1983), (1995), no. 4, 327–335. 1156–1174. 33 8C-K Chu and JS Marron, Choosing a kernel regression estimator, Wolfgang Hardle, Applied nonparametric regression, Cambridge, Statistical Science (1991), 404–419. UK (1990). 34 9William S Cleveland, Robust locally weighted regression and Wolfgang Hardle,¨ Smoothing techniques: with implementation in smoothing scatterplots, Journal of the American statistical asso- s, Springer Science & Business Media, 2012. 35 ciation 74 (1979), no. 368, 829–836. T Hastie, Gam: generalized additive models. r package version

article id-15 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

1.06. 2, 2011. 59David Ruppert and Matthew P Wand, Multivariate locally 36Trevor Hastie and Clive Loader, Local regression: Automatic ker- weighted least squares regression, The annals of statistics (1994), nel carpentry, Statistical Science (1993), 120–129. 1346–1370. 37Trevor J Hastie and Robert J Tibshirani, Generalized additive 60Thomas A Thomas A Ryan, Brian L Joiner, and Barbara F Ryan, models, vol. 43, CRC Press, 1990. Minitab student handbook, no. 04; LB1028. 5, R8., 1976. 38JOSEPH L Hodges Jr and ERIC L Lehmann, The efficiency of 61Jun Shao and Dongsheng Tu, The jackknife and bootstrap, some nonparametric competitors of the t-test, The Annals of Springer Science & Business Media, 2012. Mathematical Statistics (1956), 324–335. 62Bernard W Silverman, Spline smoothing: the equivalent variable 39M Chris Jones, James S Marron, and Simon J Sheather, A brief kernel method, The Annals of Statistics (1984), 898–916. survey of bandwidth selection for density estimation, Journal of 63 , Density estimation for statistics and data analysis, the American Statistical Association 91 (1996), no. 433, 401–407. vol. 26, CRC press, 1986. 40V Ya Katkovnik, Linear and nonlinear methods of nonparametric 64Bernhard W Silverman, Some aspects of the spline smoothing ap- , Soviet Automatic Control 5 (1979), 25–34. proach to non-parametric regression curve fitting, Journal of the 41Charles Kooperberg and Charles J Stone, A study of logspline Royal Statistical Society. Series B (Methodological) (1985), 1–52. density estimation, Computational Statistics & Data Analysis 12 65Jeffrey S Simonoff, Smoothing methods in statistics, Springer Sci- (1991), no. 3, 327–347. ence & Business Media, 2012. 42Charles Kooperberg, Charles J Stone, and Young K Truong, Haz- 66Adrian FM Smith and Gareth O Roberts, Bayesian computation ard regression, Journal of the American Statistical Association 90 via the gibbs sampler and related markov chain monte carlo meth- (1995), no. 429, 78–94. ods, Journal of the Royal Statistical Society. Series B (Method- 43Clive Loader, Local regression and likelihood, Springer Science ological) (1993), 3–23. & Business Media, 2006. 67Charles J Stone, Consistent nonparametric regression, The annals 44YP Mack and Hans-Georg Muller,¨ Derivative estimation in non- of statistics (1977), 595–620. parametric regression with random predictor variable, Sankhya:¯ 68 , Optimal rates of convergence for nonparametric estima- The Indian Journal of Statistics, Series A (1989), 59–72. tors, The annals of Statistics (1980), 1348–1360. 45Enno Mammen, Oliver Linton, J Nielsen, et al., The existence and 69 , An asymptotically optimal window selection rule for ker- asymptotic properties of a backfitting projection algorithm under nel density estimates, The Annals of Statistics (1984), 1285–1297. weak conditions, The Annals of Statistics 27 (1999), no. 5, 1443– 70 , Additive regression and other nonparametric models, 1490. The annals of Statistics (1985), 689–705. 46Andrew D Martin, Kevin M Quinn, Jong Hee Park, et al., Mcmc- 71Mervyn Stone, Cross-validatory choice and assessment of statis- pack: Markov chain monte carlo in r, Journal of Statistical Soft- tical predictions, Journal of the Royal Statistical Society. Series B ware 42 (2011), no. 9, 1–21. (Methodological) (1974), 111–147. 47Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosen- 72Alan Stuart, Maurice G Kendall, et al., The advanced theory of bluth, Augusta H Teller, and Edward Teller, Equation of state cal- statistics, vol. 2, Charles Griffin, 1973. culations by fast computing machines, The journal of chemical 73R Core Team et al., R: A language and environment for statistical physics 21 (1953), no. 6, 1087–1092. computing, (2013). 48Sean P Meyn and Richard L Tweedie, Stability of markovian pro- 74Luke Tierney, Markov chains for exploring posterior distributions, cesses ii: Continuous-time processes and sampled chains, Ad- the Annals of Statistics (1994), 1701–1728. vances in Applied Probability (1993), 487–517. 75John W Tukey, Bias and confidence in not-quite large samples, 49Per Mykland, Luke Tierney, and Bin Yu, Regeneration in markov Annals of Mathematical Statistics, vol. 29, INST MATHEMATI- chain samplers, Journal of the American Statistical Association CAL STATISTICS IMS BUSINESS OFFICE-SUITE 7, 3401 IN- 90 (1995), no. 429, 233–241. VESTMENT BLVD, HAYWARD, CA 94545, 1958, pp. 614–614. 50Elizbar A Nadaraya, On estimating regression, Theory of Proba- 76 , The future of data analysis, The Annals of Mathematical bility & Its Applications 9 (1964), no. 1, 141–142. Statistics 33 (1962), no. 1, 1–67. 51Douglas Nychka, Splines as local smoothers, The Annals of 77Grace Wahba, Practical approximate solutions to linear operator Statistics (1995), 1175–1197. equations when the data are noisy, SIAM Journal on Numerical 52Jean D Opsomer, Asymptotic properties of backfitting estimators, Analysis 14 (1977), no. 4, 651–667. Journal of Multivariate Analysis 73 (2000), no. 2, 166–179. 78 , Spline models for observational data, vol. 59, Siam, 53E Purzen, On estimation of a probability density and mode, Ann. 1990. Math. Statist 39 (1962), 1065–1076. 79MP Wand and BD Ripley, Kernsmooth: Functions for kernel 54Maurice H Quenouille, Approximate tests of correlation in time- smoothing for wand and jones (1995). r package version 2.23-15, series 3, Mathematical Proceedings of the Cambridge Philosophi- 2015. cal Society, vol. 45, Cambridge Univ Press, 1949, pp. 483–484. 80Geoffrey S Watson, Smooth regression analysis, Sankhya:¯ The In- 55Adrian E Raftery, Steven Lewis, et al., How many iterations in the dian Journal of Statistics, Series A (1964), 359–372. gibbs sampler, Bayesian statistics 4 (1992), no. 2, 763–773. 81Simon Wood, Generalized additive models: an introduction with 56Gareth O Roberts, Markov chain concepts related to sampling al- r, CRC press, 2006. gorithms, Markov chain Monte Carlo in practice 57 (1996). 82Simon N Wood, mgcv: Gams and generalized ridge regression for 57Murray Rosenblatt et al., Remarks on some nonparametric esti- r, R news 1 (2001), no. 2, 20–25. mates of a density function, The Annals of Mathematical Statistics 83Simon N Wood, Yannig Goude, and Simon Shaw, Generalized ad- 27 (1956), no. 3, 832–837. ditive models for large data sets, Journal of the Royal Statistical 58Mats Rudemo, Empirical choice of histograms and kernel density Society: Series C (Applied Statistics) 64 (2015), no. 1, 139–155. estimators, Scandinavian Journal of Statistics (1982), 65–78. 84Xiaoke Zhang, Byeong U Park, and Jane-ling Wang, Time-varying

article id-16 April 27, 2020 0:25 StatComputingArticle

Chatla, Chen and Shmueli ESC 1, article id (2016)

additive models for longitudinal data, Journal of the American Statistical Association 108 (2013), no. 503, 983–998.

article id-17