<<

Model-free intervals for regression and autoregression

Dimitris N. Politis University of California, San Diego I Use of models for prediction can be problematic when:

I a model is overspecified I parameter inference is highly model-specific (and sensitive to model mis-specification) I prediction is carried out by plugging in the estimated parameters and treating the model as exactly true.

I ”All models are wrong but some are useful”— George Box.

To explain or to predict?

I Models are indispensable for exploring/utilizing relationships between variables: explaining the world. I a model is overspecified I parameter inference is highly model-specific (and sensitive to model mis-specification) I prediction is carried out by plugging in the estimated parameters and treating the model as exactly true.

I ”All models are wrong but some are useful”— George Box.

To explain or to predict?

I Models are indispensable for exploring/utilizing relationships between variables: explaining the world. I Use of models for prediction can be problematic when: I ”All models are wrong but some are useful”— George Box.

To explain or to predict?

I Models are indispensable for exploring/utilizing relationships between variables: explaining the world. I Use of models for prediction can be problematic when:

I a model is overspecified I parameter inference is highly model-specific (and sensitive to model mis-specification) I prediction is carried out by plugging in the estimated parameters and treating the model as exactly true. To explain or to predict?

I Models are indispensable for exploring/utilizing relationships between variables: explaining the world. I Use of models for prediction can be problematic when:

I a model is overspecified I parameter inference is highly model-specific (and sensitive to model mis-specification) I prediction is carried out by plugging in the estimated parameters and treating the model as exactly true.

I ”All models are wrong but some are useful”— George Box. I If βˆ2 is barely statistically significant, do you still use it in prediction?

I If the true value of β2 is close to zero, and var(βˆ2) is large, then it may be advantageous to omit β2: allow a nonzero Bias but minimize MSE.

I A mis-specified model can be optimal for prediction!

A Toy Example

20 I Assume regression model: Y = β0 + β1X + β2X + error I If the true value of β2 is close to zero, and var(βˆ2) is large, then it may be advantageous to omit β2: allow a nonzero Bias but minimize MSE.

I A mis-specified model can be optimal for prediction!

A Toy Example

20 I Assume regression model: Y = β0 + β1X + β2X + error

I If βˆ2 is barely statistically significant, do you still use it in prediction? I A mis-specified model can be optimal for prediction!

A Toy Example

20 I Assume regression model: Y = β0 + β1X + β2X + error

I If βˆ2 is barely statistically significant, do you still use it in prediction?

I If the true value of β2 is close to zero, and var(βˆ2) is large, then it may be advantageous to omit β2: allow a nonzero Bias but minimize MSE. A Toy Example

20 I Assume regression model: Y = β0 + β1X + β2X + error

I If βˆ2 is barely statistically significant, do you still use it in prediction?

I If the true value of β2 is close to zero, and var(βˆ2) is large, then it may be advantageous to omit β2: allow a nonzero Bias but minimize MSE.

I A mis-specified model can be optimal for prediction! I Abundant Bayesian literature in parametric framework —Cox (1975), Geisser (1993), etc.

I Frequentist and/or nonparametric literature scarse.

Prediction Framework

I a. Point predictors b. Interval predictors c. Predictive distribution I Frequentist and/or nonparametric literature scarse.

Prediction Framework

I a. Point predictors b. Interval predictors c. Predictive distribution

I Abundant Bayesian literature in parametric framework —Cox (1975), Geisser (1993), etc. Prediction Framework

I a. Point predictors b. Interval predictors c. Predictive distribution

I Abundant Bayesian literature in parametric framework —Cox (1975), Geisser (1993), etc.

I Frequentist and/or nonparametric literature scarse. I GOAL: prediction of future εn+1 based on the

I Fε is the predictive distribution, and its could be used to form predictive intervals

I The and of Fε are optimal point predictors under an L2 and L1 criterion respectively.

I.i.d. set-up

I Let ε1, . . . , εn i.i.d. from the (unknown) cdf Fε I Fε is the predictive distribution, and its quantiles could be used to form predictive intervals

I The mean and median of Fε are optimal point predictors under an L2 and L1 criterion respectively.

I.i.d. set-up

I Let ε1, . . . , εn i.i.d. from the (unknown) cdf Fε

I GOAL: prediction of future εn+1 based on the data I The mean and median of Fε are optimal point predictors under an L2 and L1 criterion respectively.

I.i.d. set-up

I Let ε1, . . . , εn i.i.d. from the (unknown) cdf Fε

I GOAL: prediction of future εn+1 based on the data

I Fε is the predictive distribution, and its quantiles could be used to form predictive intervals I.i.d. set-up

I Let ε1, . . . , εn i.i.d. from the (unknown) cdf Fε

I GOAL: prediction of future εn+1 based on the data

I Fε is the predictive distribution, and its quantiles could be used to form predictive intervals

I The mean and median of Fε are optimal point predictors under an L2 and L1 criterion respectively. I Practical model-free predictive intervals will be based on quantiles of Fˆε, and the L2 and L1 optimal predictors will be approximated by the mean and median of Fˆε respectively.

I.i.d. data

I Fε is unknown but can be estimated by the empirical distribution (edf) Fˆε. I.i.d. data

I Fε is unknown but can be estimated by the empirical distribution (edf) Fˆε. I Practical model-free predictive intervals will be based on quantiles of Fˆε, and the L2 and L1 optimal predictors will be approximated by the mean and median of Fˆε respectively. I So the predictive distribution of Yn+1 given the data will depend on Y n and Xn+1 which is a matrix of observable, explanatory (predictor) variables.

I Key Examples: Regression and

Non-i.i.d. data

0 I In general, data Y n = (Y1,..., Yn) are not i.i.d. I Key Examples: Regression and Time series

Non-i.i.d. data

0 I In general, data Y n = (Y1,..., Yn) are not i.i.d. I So the predictive distribution of Yn+1 given the data will depend on Y n and Xn+1 which is a matrix of observable, explanatory (predictor) variables. Non-i.i.d. data

0 I In general, data Y n = (Y1,..., Yn) are not i.i.d. I So the predictive distribution of Yn+1 given the data will depend on Y n and Xn+1 which is a matrix of observable, explanatory (predictor) variables.

I Key Examples: Regression and Time series I Time series: Yt = µ(Yt−1, ··· , Yt−p; xt ) + σ(Yt−1, ··· , Yt−p; xt ) εt I The above are flexible, nonparametric models.

I Given one of the above models, optimal model-based predictors of a future Y -value can be constructed.

I Nevertheless, the prediction problem can be carried out in a fully model-free setting, offering—at the very least—robustness against model mis-specification.

Models

I Regression: Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0,1) I The above are flexible, nonparametric models.

I Given one of the above models, optimal model-based predictors of a future Y -value can be constructed.

I Nevertheless, the prediction problem can be carried out in a fully model-free setting, offering—at the very least—robustness against model mis-specification.

Models

I Regression: Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0,1) I Time series: Yt = µ(Yt−1, ··· , Yt−p; xt ) + σ(Yt−1, ··· , Yt−p; xt ) εt I Given one of the above models, optimal model-based predictors of a future Y -value can be constructed.

I Nevertheless, the prediction problem can be carried out in a fully model-free setting, offering—at the very least—robustness against model mis-specification.

Models

I Regression: Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0,1) I Time series: Yt = µ(Yt−1, ··· , Yt−p; xt ) + σ(Yt−1, ··· , Yt−p; xt ) εt I The above are flexible, nonparametric models. I Nevertheless, the prediction problem can be carried out in a fully model-free setting, offering—at the very least—robustness against model mis-specification.

Models

I Regression: Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0,1) I Time series: Yt = µ(Yt−1, ··· , Yt−p; xt ) + σ(Yt−1, ··· , Yt−p; xt ) εt I The above are flexible, nonparametric models.

I Given one of the above models, optimal model-based predictors of a future Y -value can be constructed. Models

I Regression: Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0,1) I Time series: Yt = µ(Yt−1, ··· , Yt−p; xt ) + σ(Yt−1, ··· , Yt−p; xt ) εt I The above are flexible, nonparametric models.

I Given one of the above models, optimal model-based predictors of a future Y -value can be constructed.

I Nevertheless, the prediction problem can be carried out in a fully model-free setting, offering—at the very least—robustness against model mis-specification. I GOAL: predict future value Yn+1 given the data

I Find invertible transformation Hm so that (for all m) the

vector m = Hm(Y m) has i.i.d. components k where

0 m = (1, . . . , m) Y −→Hm 

H−1 Y ←−m 

Transformation vs. modeling 0 I DATA: Y n = (Y1,..., Yn) I Find invertible transformation Hm so that (for all m) the

vector m = Hm(Y m) has i.i.d. components k where

0 m = (1, . . . , m) Y −→Hm 

H−1 Y ←−m 

Transformation vs. modeling 0 I DATA: Y n = (Y1,..., Yn) I GOAL: predict future value Yn+1 given the data Y −→Hm 

H−1 Y ←−m 

Transformation vs. modeling 0 I DATA: Y n = (Y1,..., Yn) I GOAL: predict future value Yn+1 given the data

I Find invertible transformation Hm so that (for all m) the

vector m = Hm(Y m) has i.i.d. components k where

0 m = (1, . . . , m) Transformation vs. modeling 0 I DATA: Y n = (Y1,..., Yn) I GOAL: predict future value Yn+1 given the data

I Find invertible transformation Hm so that (for all m) the

vector m = Hm(Y m) has i.i.d. components k where

0 m = (1, . . . , m) Y −→Hm 

H−1 Y ←−m  I (ii) implies that Yn+1 is a function of 1, . . . , n, and n+1

I So, given the data Y n, Yn+1 is a function of n+1 only, i.e.,

Yn+1 = h˜(n+1)

Transformation

Hm (i)(Y1,..., Ym) −→ (1, . . . , m)

−1 Hm (ii)(Y1,..., Ym) ←− (1, . . . , m)

I (i) implies that 1, . . . , n are known given the data Y1,..., Yn I So, given the data Y n, Yn+1 is a function of n+1 only, i.e.,

Yn+1 = h˜(n+1)

Transformation

Hm (i)(Y1,..., Ym) −→ (1, . . . , m)

−1 Hm (ii)(Y1,..., Ym) ←− (1, . . . , m)

I (i) implies that 1, . . . , n are known given the data Y1,..., Yn

I (ii) implies that Yn+1 is a function of 1, . . . , n, and n+1 Transformation

Hm (i)(Y1,..., Ym) −→ (1, . . . , m)

−1 Hm (ii)(Y1,..., Ym) ←− (1, . . . , m)

I (i) implies that 1, . . . , n are known given the data Y1,..., Yn

I (ii) implies that Yn+1 is a function of 1, . . . , n, and n+1

I So, given the data Y n, Yn+1 is a function of n+1 only, i.e.,

Yn+1 = h˜(n+1) I The mean and median of h˜() where  ∼ Fε are optimal point predictors of Yn+1 under L2 or L1 criterion

I The whole predictive distribution of Yn+1 is the distribution of h˜() when  ∼ Fε 2 ˜ ˜2 I To predict Yn+1, replace h by h ; to predict g(Yn+1), replace h˜ by g ◦ h˜.

I The unknown Fε can be estimated by Fˆε, the edf of 1, . . . , n. I But the predictive distribution needs bootstrapping—also because h˜ is estimated from the data.

Model-free prediction principle

Yn+1 = h˜(n+1)

I Suppose 1, . . . , n ∼ cdf Fε I The whole predictive distribution of Yn+1 is the distribution of h˜() when  ∼ Fε 2 ˜ ˜2 I To predict Yn+1, replace h by h ; to predict g(Yn+1), replace h˜ by g ◦ h˜.

I The unknown Fε can be estimated by Fˆε, the edf of 1, . . . , n. I But the predictive distribution needs bootstrapping—also because h˜ is estimated from the data.

Model-free prediction principle

Yn+1 = h˜(n+1)

I Suppose 1, . . . , n ∼ cdf Fε

I The mean and median of h˜() where  ∼ Fε are optimal point predictors of Yn+1 under L2 or L1 criterion 2 ˜ ˜2 I To predict Yn+1, replace h by h ; to predict g(Yn+1), replace h˜ by g ◦ h˜.

I The unknown Fε can be estimated by Fˆε, the edf of 1, . . . , n. I But the predictive distribution needs bootstrapping—also because h˜ is estimated from the data.

Model-free prediction principle

Yn+1 = h˜(n+1)

I Suppose 1, . . . , n ∼ cdf Fε

I The mean and median of h˜() where  ∼ Fε are optimal point predictors of Yn+1 under L2 or L1 criterion

I The whole predictive distribution of Yn+1 is the distribution of h˜() when  ∼ Fε I The unknown Fε can be estimated by Fˆε, the edf of 1, . . . , n. I But the predictive distribution needs bootstrapping—also because h˜ is estimated from the data.

Model-free prediction principle

Yn+1 = h˜(n+1)

I Suppose 1, . . . , n ∼ cdf Fε

I The mean and median of h˜() where  ∼ Fε are optimal point predictors of Yn+1 under L2 or L1 criterion

I The whole predictive distribution of Yn+1 is the distribution of h˜() when  ∼ Fε 2 ˜ ˜2 I To predict Yn+1, replace h by h ; to predict g(Yn+1), replace h˜ by g ◦ h˜. I But the predictive distribution needs bootstrapping—also because h˜ is estimated from the data.

Model-free prediction principle

Yn+1 = h˜(n+1)

I Suppose 1, . . . , n ∼ cdf Fε

I The mean and median of h˜() where  ∼ Fε are optimal point predictors of Yn+1 under L2 or L1 criterion

I The whole predictive distribution of Yn+1 is the distribution of h˜() when  ∼ Fε 2 ˜ ˜2 I To predict Yn+1, replace h by h ; to predict g(Yn+1), replace h˜ by g ◦ h˜.

I The unknown Fε can be estimated by Fˆε, the edf of 1, . . . , n. Model-free prediction principle

Yn+1 = h˜(n+1)

I Suppose 1, . . . , n ∼ cdf Fε

I The mean and median of h˜() where  ∼ Fε are optimal point predictors of Yn+1 under L2 or L1 criterion

I The whole predictive distribution of Yn+1 is the distribution of h˜() when  ∼ Fε 2 ˜ ˜2 I To predict Yn+1, replace h by h ; to predict g(Yn+1), replace h˜ by g ◦ h˜.

I The unknown Fε can be estimated by Fˆε, the edf of 1, . . . , n. I But the predictive distribution needs bootstrapping—also because h˜ is estimated from the data. I Yt data available for t = 1,..., n.

I εt ∼ i.i.d. (0,1) from (unknown) cdf F I the functions µ(·) and σ(·) unknown but smooth

Nonparametric Regression

MODEL (?): Yt = µ(xt ) + σ(xt ) εt

I xt univariate and deterministic I εt ∼ i.i.d. (0,1) from (unknown) cdf F I the functions µ(·) and σ(·) unknown but smooth

Nonparametric Regression

MODEL (?): Yt = µ(xt ) + σ(xt ) εt

I xt univariate and deterministic

I Yt data available for t = 1,..., n. I the functions µ(·) and σ(·) unknown but smooth

Nonparametric Regression

MODEL (?): Yt = µ(xt ) + σ(xt ) εt

I xt univariate and deterministic

I Yt data available for t = 1,..., n.

I εt ∼ i.i.d. (0,1) from (unknown) cdf F Nonparametric Regression

MODEL (?): Yt = µ(xt ) + σ(xt ) εt

I xt univariate and deterministic

I Yt data available for t = 1,..., n.

I εt ∼ i.i.d. (0,1) from (unknown) cdf F I the functions µ(·) and σ(·) unknown but smooth I Examples: kernel smoothers, local linear fitting, , etc. Pn ˜ x−xi  I E.g. Nadaraya-Watson estimator mx = i=1 Yi K h I here K(x) is the kernel, h the bandwidth, and ˜ x−xi  x−xi  Pn x−xk  K h = K h / k=1 K h .   2 2 Pn 2 ˜ x−xi I Similarly, sx = Mx − mx where Mx = i=1 Yi K q

Nonparametric Regression

Note: µ(x) = E(Y |x) and σ2(x) = Var(Y |x).

I Let mx , sx be smoothing estimators of µ(x), σ(x). Pn ˜ x−xi  I E.g. Nadaraya-Watson estimator mx = i=1 Yi K h I here K(x) is the kernel, h the bandwidth, and ˜ x−xi  x−xi  Pn x−xk  K h = K h / k=1 K h .   2 2 Pn 2 ˜ x−xi I Similarly, sx = Mx − mx where Mx = i=1 Yi K q

Nonparametric Regression

Note: µ(x) = E(Y |x) and σ2(x) = Var(Y |x).

I Let mx , sx be smoothing estimators of µ(x), σ(x). I Examples: kernel smoothers, local linear fitting, wavelets, etc. I here K(x) is the kernel, h the bandwidth, and ˜ x−xi  x−xi  Pn x−xk  K h = K h / k=1 K h .   2 2 Pn 2 ˜ x−xi I Similarly, sx = Mx − mx where Mx = i=1 Yi K q

Nonparametric Regression

Note: µ(x) = E(Y |x) and σ2(x) = Var(Y |x).

I Let mx , sx be smoothing estimators of µ(x), σ(x). I Examples: kernel smoothers, local linear fitting, wavelets, etc. Pn ˜ x−xi  I E.g. Nadaraya-Watson estimator mx = i=1 Yi K h   2 2 Pn 2 ˜ x−xi I Similarly, sx = Mx − mx where Mx = i=1 Yi K q

Nonparametric Regression

Note: µ(x) = E(Y |x) and σ2(x) = Var(Y |x).

I Let mx , sx be smoothing estimators of µ(x), σ(x). I Examples: kernel smoothers, local linear fitting, wavelets, etc. Pn ˜ x−xi  I E.g. Nadaraya-Watson estimator mx = i=1 Yi K h I here K(x) is the kernel, h the bandwidth, and ˜ x−xi  x−xi  Pn x−xk  K h = K h / k=1 K h . Nonparametric Regression

Note: µ(x) = E(Y |x) and σ2(x) = Var(Y |x).

I Let mx , sx be smoothing estimators of µ(x), σ(x). I Examples: kernel smoothers, local linear fitting, wavelets, etc. Pn ˜ x−xi  I E.g. Nadaraya-Watson estimator mx = i=1 Yi K h I here K(x) is the kernel, h the bandwidth, and ˜ x−xi  x−xi  Pn x−xk  K h = K h / k=1 K h .   2 2 Pn 2 ˜ x−xi I Similarly, sx = Mx − mx where Mx = i=1 Yi K q (a) (b) residual log−wage 12 13 14 15 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

20 30 40 50 60 20 30 40 50 60

age age

(a) Log-wage vs. age data with fitted kernel smoother mx . (b) Unstudentized residuals Y − mx with superimposed sx . I 1971 Canadian data cps71 from np package of R; wage vs. age dataset of 205 male individuals with common education. I Kernel smoother problematic at the left boundary; local linear is better (Fan and Gijbels, 1996) or reflection (Hall and Wehrly, 1991). I fitted residuals: et = (Yt − mxt )/sxt (t) (t) I predictive residuals:e ˜t = (Yt − mxt )/sxt (t) (t) I mx and sxt are the estimators m and s computed from the delete-Yt dataset: {(Yi , xi ), for all i 6= t}.

I e˜t is the (standardized) error in trying to predict Yt from the delete-Yt dataset. I Selection of bandwidth parameters h and q is often done by Pn 2 cross-validation, i.e., pick h, q to minimize PRESS= t=1 e˜t . Pn I BETTER: L1 cross-validation: pick h, q to minimize t=1 |e˜t |.

Residuals

I (?): Yt = µ(xt ) + σ(xt ) εt (t) (t) I predictive residuals:e ˜t = (Yt − mxt )/sxt (t) (t) I mx and sxt are the estimators m and s computed from the delete-Yt dataset: {(Yi , xi ), for all i 6= t}.

I e˜t is the (standardized) error in trying to predict Yt from the delete-Yt dataset. I Selection of bandwidth parameters h and q is often done by Pn 2 cross-validation, i.e., pick h, q to minimize PRESS= t=1 e˜t . Pn I BETTER: L1 cross-validation: pick h, q to minimize t=1 |e˜t |.

Residuals

I (?): Yt = µ(xt ) + σ(xt ) εt

I fitted residuals: et = (Yt − mxt )/sxt (t) (t) I mx and sxt are the estimators m and s computed from the delete-Yt dataset: {(Yi , xi ), for all i 6= t}.

I e˜t is the (standardized) error in trying to predict Yt from the delete-Yt dataset. I Selection of bandwidth parameters h and q is often done by Pn 2 cross-validation, i.e., pick h, q to minimize PRESS= t=1 e˜t . Pn I BETTER: L1 cross-validation: pick h, q to minimize t=1 |e˜t |.

Residuals

I (?): Yt = µ(xt ) + σ(xt ) εt

I fitted residuals: et = (Yt − mxt )/sxt (t) (t) I predictive residuals:e ˜t = (Yt − mxt )/sxt I e˜t is the (standardized) error in trying to predict Yt from the delete-Yt dataset. I Selection of bandwidth parameters h and q is often done by Pn 2 cross-validation, i.e., pick h, q to minimize PRESS= t=1 e˜t . Pn I BETTER: L1 cross-validation: pick h, q to minimize t=1 |e˜t |.

Residuals

I (?): Yt = µ(xt ) + σ(xt ) εt

I fitted residuals: et = (Yt − mxt )/sxt (t) (t) I predictive residuals:e ˜t = (Yt − mxt )/sxt (t) (t) I mx and sxt are the estimators m and s computed from the delete-Yt dataset: {(Yi , xi ), for all i 6= t}. I Selection of bandwidth parameters h and q is often done by Pn 2 cross-validation, i.e., pick h, q to minimize PRESS= t=1 e˜t . Pn I BETTER: L1 cross-validation: pick h, q to minimize t=1 |e˜t |.

Residuals

I (?): Yt = µ(xt ) + σ(xt ) εt

I fitted residuals: et = (Yt − mxt )/sxt (t) (t) I predictive residuals:e ˜t = (Yt − mxt )/sxt (t) (t) I mx and sxt are the estimators m and s computed from the delete-Yt dataset: {(Yi , xi ), for all i 6= t}.

I e˜t is the (standardized) error in trying to predict Yt from the delete-Yt dataset. Pn I BETTER: L1 cross-validation: pick h, q to minimize t=1 |e˜t |.

Residuals

I (?): Yt = µ(xt ) + σ(xt ) εt

I fitted residuals: et = (Yt − mxt )/sxt (t) (t) I predictive residuals:e ˜t = (Yt − mxt )/sxt (t) (t) I mx and sxt are the estimators m and s computed from the delete-Yt dataset: {(Yi , xi ), for all i 6= t}.

I e˜t is the (standardized) error in trying to predict Yt from the delete-Yt dataset. I Selection of bandwidth parameters h and q is often done by Pn 2 cross-validation, i.e., pick h, q to minimize PRESS= t=1 e˜t . Residuals

I (?): Yt = µ(xt ) + σ(xt ) εt

I fitted residuals: et = (Yt − mxt )/sxt (t) (t) I predictive residuals:e ˜t = (Yt − mxt )/sxt (t) (t) I mx and sxt are the estimators m and s computed from the delete-Yt dataset: {(Yi , xi ), for all i 6= t}.

I e˜t is the (standardized) error in trying to predict Yt from the delete-Yt dataset. I Selection of bandwidth parameters h and q is often done by Pn 2 cross-validation, i.e., pick h, q to minimize PRESS= t=1 e˜t . Pn I BETTER: L1 cross-validation: pick h, q to minimize t=1 |e˜t |. I L2–optimal predictor of Yf is E(Yf |xf ), i.e., µ(xf )

which is approximated by mxf . I L1–optimal predictor of Yf is the conditional median, i.e., µ(xf ) + σ(xf ) · median(F ) ˆ which is approximated by mxf + sxf · median(Fe ) where Fˆe is the edf of the residuals e1,..., en

Model-based (MB) point predictors

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I GOAL: Predict a future response Yf associated with point xf . which is approximated by mxf . I L1–optimal predictor of Yf is the conditional median, i.e., µ(xf ) + σ(xf ) · median(F ) ˆ which is approximated by mxf + sxf · median(Fe ) where Fˆe is the edf of the residuals e1,..., en

Model-based (MB) point predictors

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I GOAL: Predict a future response Yf associated with point xf .

I L2–optimal predictor of Yf is E(Yf |xf ), i.e., µ(xf ) I L1–optimal predictor of Yf is the conditional median, i.e., µ(xf ) + σ(xf ) · median(F ) ˆ which is approximated by mxf + sxf · median(Fe ) where Fˆe is the edf of the residuals e1,..., en

Model-based (MB) point predictors

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I GOAL: Predict a future response Yf associated with point xf .

I L2–optimal predictor of Yf is E(Yf |xf ), i.e., µ(xf )

which is approximated by mxf . ˆ which is approximated by mxf + sxf · median(Fe ) where Fˆe is the edf of the residuals e1,..., en

Model-based (MB) point predictors

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I GOAL: Predict a future response Yf associated with point xf .

I L2–optimal predictor of Yf is E(Yf |xf ), i.e., µ(xf )

which is approximated by mxf . I L1–optimal predictor of Yf is the conditional median, i.e., µ(xf ) + σ(xf ) · median(F ) Model-based (MB) point predictors

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I GOAL: Predict a future response Yf associated with point xf .

I L2–optimal predictor of Yf is E(Yf |xf ), i.e., µ(xf )

which is approximated by mxf . I L1–optimal predictor of Yf is the conditional median, i.e., µ(xf ) + σ(xf ) · median(F ) ˆ which is approximated by mxf + sxf · median(Fe ) where Fˆe is the edf of the residuals e1,..., en I To predict salary at age xf we need to predict g(Yf ) where g(x) = exp(x).

I MB L2–optimal predictor of g(Yf ) is E(g(Yf )|xf ) estimated −1 Pn by n i=1 g (mxf + σxf ei ) .

I Naive predictor g(mxf ) grossly suboptimal when g is nonlinear. I MB L1–optimal predictor of g(Yf ) estimated by the sample

median of the set {g (mxf + σxf ei ) , i = 1, ..., n}; naive plug-in ok iff g is monotone!

Model-based (MB) point predictors

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I DATASET cps71: salaries are logarithmically transformed, i.e., Yt = log-salary. I MB L2–optimal predictor of g(Yf ) is E(g(Yf )|xf ) estimated −1 Pn by n i=1 g (mxf + σxf ei ) .

I Naive predictor g(mxf ) grossly suboptimal when g is nonlinear. I MB L1–optimal predictor of g(Yf ) estimated by the sample

median of the set {g (mxf + σxf ei ) , i = 1, ..., n}; naive plug-in ok iff g is monotone!

Model-based (MB) point predictors

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I DATASET cps71: salaries are logarithmically transformed, i.e., Yt = log-salary.

I To predict salary at age xf we need to predict g(Yf ) where g(x) = exp(x). I Naive predictor g(mxf ) grossly suboptimal when g is nonlinear. I MB L1–optimal predictor of g(Yf ) estimated by the sample

median of the set {g (mxf + σxf ei ) , i = 1, ..., n}; naive plug-in ok iff g is monotone!

Model-based (MB) point predictors

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I DATASET cps71: salaries are logarithmically transformed, i.e., Yt = log-salary.

I To predict salary at age xf we need to predict g(Yf ) where g(x) = exp(x).

I MB L2–optimal predictor of g(Yf ) is E(g(Yf )|xf ) estimated −1 Pn by n i=1 g (mxf + σxf ei ) . Model-based (MB) point predictors

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I DATASET cps71: salaries are logarithmically transformed, i.e., Yt = log-salary.

I To predict salary at age xf we need to predict g(Yf ) where g(x) = exp(x).

I MB L2–optimal predictor of g(Yf ) is E(g(Yf )|xf ) estimated −1 Pn by n i=1 g (mxf + σxf ei ) .

I Naive predictor g(mxf ) grossly suboptimal when g is nonlinear. I MB L1–optimal predictor of g(Yf ) estimated by the sample

median of the set {g (mxf + σxf ei ) , i = 1, ..., n}; naive plug-in ok iff g is monotone! ? ? I Create pseudo-data Yi = mxi + sxi ri , for i = 1, ...n. ? I Calculate a bootstrap pseudo-response Yf = mxf + sxf r where r is drawn randomly from (r1, ..., rn). ? ? I Based on the pseudo-data Y1 , ..., Yn , re-estimate the ? ? functions µ(x) and σ(x) by mx and sx . ? ∗ ∗ ∗ ˆ ∗ I Calculate bootstrap root: g(Yf ) − Π(g, mx , sx , Y n, Xn+1, Fn ). I Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α— denoted q(α).

I Our estimate of the predictive distribution of g(Yf ) is the empirical df of bootstrap roots shifted to the right by Π.

I Then, a (1 − α)100% equal-tailed predictive interval for g(Yf ) is given by: [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm for predictive distribution of g(Yf)

Prediction root: g(Yf ) − Π where Π is the point predictor.

I Bootstrap the (fitted or predictive) residuals r1, ..., rn to create ? ? ˆ ? pseudo-residuals r1 , ..., rn whose edf is denoted by Fn . ? I Calculate a bootstrap pseudo-response Yf = mxf + sxf r where r is drawn randomly from (r1, ..., rn). ? ? I Based on the pseudo-data Y1 , ..., Yn , re-estimate the ? ? functions µ(x) and σ(x) by mx and sx . ? ∗ ∗ ∗ ˆ ∗ I Calculate bootstrap root: g(Yf ) − Π(g, mx , sx , Y n, Xn+1, Fn ). I Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α—quantile denoted q(α).

I Our estimate of the predictive distribution of g(Yf ) is the empirical df of bootstrap roots shifted to the right by Π.

I Then, a (1 − α)100% equal-tailed predictive interval for g(Yf ) is given by: [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm for predictive distribution of g(Yf)

Prediction root: g(Yf ) − Π where Π is the point predictor.

I Bootstrap the (fitted or predictive) residuals r1, ..., rn to create ? ? ˆ ? pseudo-residuals r1 , ..., rn whose edf is denoted by Fn . ? ? I Create pseudo-data Yi = mxi + sxi ri , for i = 1, ...n. ? ? I Based on the pseudo-data Y1 , ..., Yn , re-estimate the ? ? functions µ(x) and σ(x) by mx and sx . ? ∗ ∗ ∗ ˆ ∗ I Calculate bootstrap root: g(Yf ) − Π(g, mx , sx , Y n, Xn+1, Fn ). I Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α—quantile denoted q(α).

I Our estimate of the predictive distribution of g(Yf ) is the empirical df of bootstrap roots shifted to the right by Π.

I Then, a (1 − α)100% equal-tailed predictive interval for g(Yf ) is given by: [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm for predictive distribution of g(Yf)

Prediction root: g(Yf ) − Π where Π is the point predictor.

I Bootstrap the (fitted or predictive) residuals r1, ..., rn to create ? ? ˆ ? pseudo-residuals r1 , ..., rn whose edf is denoted by Fn . ? ? I Create pseudo-data Yi = mxi + sxi ri , for i = 1, ...n. ? I Calculate a bootstrap pseudo-response Yf = mxf + sxf r where r is drawn randomly from (r1, ..., rn). ? ∗ ∗ ∗ ˆ ∗ I Calculate bootstrap root: g(Yf ) − Π(g, mx , sx , Y n, Xn+1, Fn ). I Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α—quantile denoted q(α).

I Our estimate of the predictive distribution of g(Yf ) is the empirical df of bootstrap roots shifted to the right by Π.

I Then, a (1 − α)100% equal-tailed predictive interval for g(Yf ) is given by: [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm for predictive distribution of g(Yf)

Prediction root: g(Yf ) − Π where Π is the point predictor.

I Bootstrap the (fitted or predictive) residuals r1, ..., rn to create ? ? ˆ ? pseudo-residuals r1 , ..., rn whose edf is denoted by Fn . ? ? I Create pseudo-data Yi = mxi + sxi ri , for i = 1, ...n. ? I Calculate a bootstrap pseudo-response Yf = mxf + sxf r where r is drawn randomly from (r1, ..., rn). ? ? I Based on the pseudo-data Y1 , ..., Yn , re-estimate the ? ? functions µ(x) and σ(x) by mx and sx . I Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α—quantile denoted q(α).

I Our estimate of the predictive distribution of g(Yf ) is the empirical df of bootstrap roots shifted to the right by Π.

I Then, a (1 − α)100% equal-tailed predictive interval for g(Yf ) is given by: [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm for predictive distribution of g(Yf)

Prediction root: g(Yf ) − Π where Π is the point predictor.

I Bootstrap the (fitted or predictive) residuals r1, ..., rn to create ? ? ˆ ? pseudo-residuals r1 , ..., rn whose edf is denoted by Fn . ? ? I Create pseudo-data Yi = mxi + sxi ri , for i = 1, ...n. ? I Calculate a bootstrap pseudo-response Yf = mxf + sxf r where r is drawn randomly from (r1, ..., rn). ? ? I Based on the pseudo-data Y1 , ..., Yn , re-estimate the ? ? functions µ(x) and σ(x) by mx and sx . ? ∗ ∗ ∗ ˆ ∗ I Calculate bootstrap root: g(Yf ) − Π(g, mx , sx , Y n, Xn+1, Fn ). I Our estimate of the predictive distribution of g(Yf ) is the empirical df of bootstrap roots shifted to the right by Π.

I Then, a (1 − α)100% equal-tailed predictive interval for g(Yf ) is given by: [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm for predictive distribution of g(Yf)

Prediction root: g(Yf ) − Π where Π is the point predictor.

I Bootstrap the (fitted or predictive) residuals r1, ..., rn to create ? ? ˆ ? pseudo-residuals r1 , ..., rn whose edf is denoted by Fn . ? ? I Create pseudo-data Yi = mxi + sxi ri , for i = 1, ...n. ? I Calculate a bootstrap pseudo-response Yf = mxf + sxf r where r is drawn randomly from (r1, ..., rn). ? ? I Based on the pseudo-data Y1 , ..., Yn , re-estimate the ? ? functions µ(x) and σ(x) by mx and sx . ? ∗ ∗ ∗ ˆ ∗ I Calculate bootstrap root: g(Yf ) − Π(g, mx , sx , Y n, Xn+1, Fn ). I Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α—quantile denoted q(α). I Then, a (1 − α)100% equal-tailed predictive interval for g(Yf ) is given by: [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm for predictive distribution of g(Yf)

Prediction root: g(Yf ) − Π where Π is the point predictor.

I Bootstrap the (fitted or predictive) residuals r1, ..., rn to create ? ? ˆ ? pseudo-residuals r1 , ..., rn whose edf is denoted by Fn . ? ? I Create pseudo-data Yi = mxi + sxi ri , for i = 1, ...n. ? I Calculate a bootstrap pseudo-response Yf = mxf + sxf r where r is drawn randomly from (r1, ..., rn). ? ? I Based on the pseudo-data Y1 , ..., Yn , re-estimate the ? ? functions µ(x) and σ(x) by mx and sx . ? ∗ ∗ ∗ ˆ ∗ I Calculate bootstrap root: g(Yf ) − Π(g, mx , sx , Y n, Xn+1, Fn ). I Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α—quantile denoted q(α).

I Our estimate of the predictive distribution of g(Yf ) is the empirical df of bootstrap roots shifted to the right by Π. Resampling Algorithm for predictive distribution of g(Yf)

Prediction root: g(Yf ) − Π where Π is the point predictor.

I Bootstrap the (fitted or predictive) residuals r1, ..., rn to create ? ? ˆ ? pseudo-residuals r1 , ..., rn whose edf is denoted by Fn . ? ? I Create pseudo-data Yi = mxi + sxi ri , for i = 1, ...n. ? I Calculate a bootstrap pseudo-response Yf = mxf + sxf r where r is drawn randomly from (r1, ..., rn). ? ? I Based on the pseudo-data Y1 , ..., Yn , re-estimate the ? ? functions µ(x) and σ(x) by mx and sx . ? ∗ ∗ ∗ ˆ ∗ I Calculate bootstrap root: g(Yf ) − Π(g, mx , sx , Y n, Xn+1, Fn ). I Repeat the above B times, and collect the B bootstrap roots in an empirical distribution with α—quantile denoted q(α).

I Our estimate of the predictive distribution of g(Yf ) is the empirical df of bootstrap roots shifted to the right by Π.

I Then, a (1 − α)100% equal-tailed predictive interval for g(Yf ) is given by: [Π + q(α/2), Π + q(1 − α/2)]. I E.g., the and/or of Yt may depend on xt . I cps71 data: skewness/kurtosis of salary depend on age.

Model-free prediction in regression

Previous discussion hinged on model:( ?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) from cdf F .

I What happens if model( ?) does not hold true? I cps71 data: skewness/kurtosis of salary depend on age.

Model-free prediction in regression

Previous discussion hinged on model:( ?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) from cdf F .

I What happens if model( ?) does not hold true?

I E.g., the skewness and/or kurtosis of Yt may depend on xt . Model-free prediction in regression

Previous discussion hinged on model:( ?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) from cdf F .

I What happens if model( ?) does not hold true?

I E.g., the skewness and/or kurtosis of Yt may depend on xt . I cps71 data: skewness/kurtosis of salary depend on age. (a) (b) log−wage kurtosis log−wage skewness −1012345 −2.0 −1.5 −1.0 −0.5 0.0 0.5

20 30 40 50 60 20 30 40 50 60

age age

(a) Log-wage SKEWNESS vs. age. (b) Log-wage KURTOSIS vs. age.

I Both skewness and kurtosis are nonconstant! I Could try ACE, AVAS, etc.

I There is a simpler, more general solution!

General background-

I Could try skewness reducing transformations—but log already does that. I There is a simpler, more general solution!

General background-

I Could try skewness reducing transformations—but log already does that.

I Could try ACE, AVAS, etc. General background-

I Could try skewness reducing transformations—but log already does that.

I Could try ACE, AVAS, etc.

I There is a simpler, more general solution! I We will denote the conditional distribution of Yf given xf by Dx (y) = P{Yf ≤ y|xf = x}

I Assume the quantity Dx (y) is continuous in both x and y. I With a categorical response, standard methods like Generalized Linear Models can be invoked, e.g. , , etc.

I Since Dx (·) depends in a smooth way on x, we can estimate D (y) by the ‘local’ empirical N−1 P 1{Y ≤ y} x x,h t:|xt −x|

General background

I The Yt s are still independent but not identically distributed. I Assume the quantity Dx (y) is continuous in both x and y. I With a categorical response, standard methods like Generalized Linear Models can be invoked, e.g. logistic regression, Poisson regression, etc.

I Since Dx (·) depends in a smooth way on x, we can estimate D (y) by the ‘local’ empirical N−1 P 1{Y ≤ y} x x,h t:|xt −x|

General background

I The Yt s are still independent but not identically distributed.

I We will denote the conditional distribution of Yf given xf by Dx (y) = P{Yf ≤ y|xf = x} I With a categorical response, standard methods like Generalized Linear Models can be invoked, e.g. logistic regression, Poisson regression, etc.

I Since Dx (·) depends in a smooth way on x, we can estimate D (y) by the ‘local’ empirical N−1 P 1{Y ≤ y} x x,h t:|xt −x|

General background

I The Yt s are still independent but not identically distributed.

I We will denote the conditional distribution of Yf given xf by Dx (y) = P{Yf ≤ y|xf = x}

I Assume the quantity Dx (y) is continuous in both x and y. I Since Dx (·) depends in a smooth way on x, we can estimate D (y) by the ‘local’ empirical N−1 P 1{Y ≤ y} x x,h t:|xt −x|

General background

I The Yt s are still independent but not identically distributed.

I We will denote the conditional distribution of Yf given xf by Dx (y) = P{Yf ≤ y|xf = x}

I Assume the quantity Dx (y) is continuous in both x and y. I With a categorical response, standard methods like Generalized Linear Models can be invoked, e.g. logistic regression, Poisson regression, etc. General background

I The Yt s are still independent but not identically distributed.

I We will denote the conditional distribution of Yf given xf by Dx (y) = P{Yf ≤ y|xf = x}

I Assume the quantity Dx (y) is continuous in both x and y. I With a categorical response, standard methods like Generalized Linear Models can be invoked, e.g. logistic regression, Poisson regression, etc.

I Since Dx (·) depends in a smooth way on x, we can estimate D (y) by the ‘local’ empirical N−1 P 1{Y ≤ y} x x,h t:|xt −x|

I Can use local linear smoother of 1{Yt ≤ y}, t = 1,..., n.

I Estimator Dˆx (y) enjoys many good properties including asymptotic consistency; see e.g. Li and Racine (2007).

I But Dˆx (y) is discontinuous in y, and therefore unacceptable! I Could smooth it by kernel methods, or...

Constructing the transformation

ˆ Pn ˜ x−xi  I More general estimator Dx (y) = i=1 1{Yi ≤ y}K h . I Can use local linear smoother of 1{Yt ≤ y}, t = 1,..., n.

I Estimator Dˆx (y) enjoys many good properties including asymptotic consistency; see e.g. Li and Racine (2007).

I But Dˆx (y) is discontinuous in y, and therefore unacceptable! I Could smooth it by kernel methods, or...

Constructing the transformation

ˆ Pn ˜ x−xi  I More general estimator Dx (y) = i=1 1{Yi ≤ y}K h .

I Dˆx (y) is just a Nadaraya-Watson smoother of the variables 1{Yt ≤ y}, t = 1,..., n. I Estimator Dˆx (y) enjoys many good properties including asymptotic consistency; see e.g. Li and Racine (2007).

I But Dˆx (y) is discontinuous in y, and therefore unacceptable! I Could smooth it by kernel methods, or...

Constructing the transformation

ˆ Pn ˜ x−xi  I More general estimator Dx (y) = i=1 1{Yi ≤ y}K h .

I Dˆx (y) is just a Nadaraya-Watson smoother of the variables 1{Yt ≤ y}, t = 1,..., n.

I Can use local linear smoother of 1{Yt ≤ y}, t = 1,..., n. I But Dˆx (y) is discontinuous in y, and therefore unacceptable! I Could smooth it by kernel methods, or...

Constructing the transformation

ˆ Pn ˜ x−xi  I More general estimator Dx (y) = i=1 1{Yi ≤ y}K h .

I Dˆx (y) is just a Nadaraya-Watson smoother of the variables 1{Yt ≤ y}, t = 1,..., n.

I Can use local linear smoother of 1{Yt ≤ y}, t = 1,..., n.

I Estimator Dˆx (y) enjoys many good properties including asymptotic consistency; see e.g. Li and Racine (2007). Constructing the transformation

ˆ Pn ˜ x−xi  I More general estimator Dx (y) = i=1 1{Yi ≤ y}K h .

I Dˆx (y) is just a Nadaraya-Watson smoother of the variables 1{Yt ≤ y}, t = 1,..., n.

I Can use local linear smoother of 1{Yt ≤ y}, t = 1,..., n.

I Estimator Dˆx (y) enjoys many good properties including asymptotic consistency; see e.g. Li and Racine (2007).

I But Dˆx (y) is discontinuous in y, and therefore unacceptable! I Could smooth it by kernel methods, or... Linear interpolation

(a) (b) Fn(x) quantiles of transformed data u_t 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

−1.5 −0.5 0.5 0.2 0.4 0.6 0.8

x quantiles of Uniform (0,1)

FIGURE 2: (a) Empirical distribution of a test sample consisting of five N(0, 1) and five N(1/2, 1) independent r.v.’s with the piecewise linear estimator D˜ (·) superimposed; the vertical/horizontal lines indicates the inversion process, i.e., finding D˜ −1(0.75). (b) Q-Q plot of the

transformed variables ui vs. the quantiles of Uniform (0,1). I To see why, note that if we let ηi = Dxi (Yi ) for i = 1,..., n our transformation objective would be exactly achieved since η1, . . . , ηn would be i.i.d. Uniform(0,1).

I Dx (·) not known but we have estimator D˜x (·) as its proxy. I Therefore, our proposed transformation for the MF prediction ˜ principle is ui = Dxi (Yi ) for i = 1,..., n. I D˜x (·) is consistent, so u1,..., un are approximately i.i.d. I The probability integral transform has been used in the past for building better density estimators—Ruppert and Cline (1994).

Constructing the transformation–

I Since the Yt s are continuous r.v.’s, the probability integral transform is the key idea to transform them to ‘i.i.d.–ness’. I Dx (·) not known but we have estimator D˜x (·) as its proxy. I Therefore, our proposed transformation for the MF prediction ˜ principle is ui = Dxi (Yi ) for i = 1,..., n. I D˜x (·) is consistent, so u1,..., un are approximately i.i.d. I The probability integral transform has been used in the past for building better density estimators—Ruppert and Cline (1994).

Constructing the transformation–

I Since the Yt s are continuous r.v.’s, the probability integral transform is the key idea to transform them to ‘i.i.d.–ness’.

I To see why, note that if we let ηi = Dxi (Yi ) for i = 1,..., n our transformation objective would be exactly achieved since η1, . . . , ηn would be i.i.d. Uniform(0,1). I Therefore, our proposed transformation for the MF prediction ˜ principle is ui = Dxi (Yi ) for i = 1,..., n. I D˜x (·) is consistent, so u1,..., un are approximately i.i.d. I The probability integral transform has been used in the past for building better density estimators—Ruppert and Cline (1994).

Constructing the transformation–

I Since the Yt s are continuous r.v.’s, the probability integral transform is the key idea to transform them to ‘i.i.d.–ness’.

I To see why, note that if we let ηi = Dxi (Yi ) for i = 1,..., n our transformation objective would be exactly achieved since η1, . . . , ηn would be i.i.d. Uniform(0,1).

I Dx (·) not known but we have estimator D˜x (·) as its proxy. I D˜x (·) is consistent, so u1,..., un are approximately i.i.d. I The probability integral transform has been used in the past for building better density estimators—Ruppert and Cline (1994).

Constructing the transformation–

I Since the Yt s are continuous r.v.’s, the probability integral transform is the key idea to transform them to ‘i.i.d.–ness’.

I To see why, note that if we let ηi = Dxi (Yi ) for i = 1,..., n our transformation objective would be exactly achieved since η1, . . . , ηn would be i.i.d. Uniform(0,1).

I Dx (·) not known but we have estimator D˜x (·) as its proxy. I Therefore, our proposed transformation for the MF prediction ˜ principle is ui = Dxi (Yi ) for i = 1,..., n. I The probability integral transform has been used in the past for building better density estimators—Ruppert and Cline (1994).

Constructing the transformation–

I Since the Yt s are continuous r.v.’s, the probability integral transform is the key idea to transform them to ‘i.i.d.–ness’.

I To see why, note that if we let ηi = Dxi (Yi ) for i = 1,..., n our transformation objective would be exactly achieved since η1, . . . , ηn would be i.i.d. Uniform(0,1).

I Dx (·) not known but we have estimator D˜x (·) as its proxy. I Therefore, our proposed transformation for the MF prediction ˜ principle is ui = Dxi (Yi ) for i = 1,..., n. I D˜x (·) is consistent, so u1,..., un are approximately i.i.d. Constructing the transformation–

I Since the Yt s are continuous r.v.’s, the probability integral transform is the key idea to transform them to ‘i.i.d.–ness’.

I To see why, note that if we let ηi = Dxi (Yi ) for i = 1,..., n our transformation objective would be exactly achieved since η1, . . . , ηn would be i.i.d. Uniform(0,1).

I Dx (·) not known but we have estimator D˜x (·) as its proxy. I Therefore, our proposed transformation for the MF prediction ˜ principle is ui = Dxi (Yi ) for i = 1,..., n. I D˜x (·) is consistent, so u1,..., un are approximately i.i.d. I The probability integral transform has been used in the past for building better density estimators—Ruppert and Cline (1994). ˜ −1 ˜ I Inverse transformation Dx is well-defined since Dx (·) is strictly increasing. Let u = D (Y ) and Y = D−1(u ). I f xf f f xf f D˜ −1(u ) has (approximately) the same distribution as Y I xf i f (conditionally on xf ) for any i. So {D˜ −1(u ), i = 1, ..., n} is a set of bona fide potential I xf i responses that can be used as proxies for Yf .

Model-free optimal predictors

˜ I Transformation:ui = Dxi (Yi ) for i = 1,..., n. Let u = D (Y ) and Y = D−1(u ). I f xf f f xf f D˜ −1(u ) has (approximately) the same distribution as Y I xf i f (conditionally on xf ) for any i. So {D˜ −1(u ), i = 1, ..., n} is a set of bona fide potential I xf i responses that can be used as proxies for Yf .

Model-free optimal predictors

˜ I Transformation:ui = Dxi (Yi ) for i = 1,..., n. ˜ −1 ˜ I Inverse transformation Dx is well-defined since Dx (·) is strictly increasing. D˜ −1(u ) has (approximately) the same distribution as Y I xf i f (conditionally on xf ) for any i. So {D˜ −1(u ), i = 1, ..., n} is a set of bona fide potential I xf i responses that can be used as proxies for Yf .

Model-free optimal predictors

˜ I Transformation:ui = Dxi (Yi ) for i = 1,..., n. ˜ −1 ˜ I Inverse transformation Dx is well-defined since Dx (·) is strictly increasing. Let u = D (Y ) and Y = D−1(u ). I f xf f f xf f So {D˜ −1(u ), i = 1, ..., n} is a set of bona fide potential I xf i responses that can be used as proxies for Yf .

Model-free optimal predictors

˜ I Transformation:ui = Dxi (Yi ) for i = 1,..., n. ˜ −1 ˜ I Inverse transformation Dx is well-defined since Dx (·) is strictly increasing. Let u = D (Y ) and Y = D−1(u ). I f xf f f xf f D˜ −1(u ) has (approximately) the same distribution as Y I xf i f (conditionally on xf ) for any i. Model-free optimal predictors

˜ I Transformation:ui = Dxi (Yi ) for i = 1,..., n. ˜ −1 ˜ I Inverse transformation Dx is well-defined since Dx (·) is strictly increasing. Let u = D (Y ) and Y = D−1(u ). I f xf f f xf f D˜ −1(u ) has (approximately) the same distribution as Y I xf i f (conditionally on xf ) for any i. So {D˜ −1(u ), i = 1, ..., n} is a set of bona fide potential I xf i responses that can be used as proxies for Yf . I The L2—optimal predictor of g(Yf ) will be the   of g(Y ) that is approximated by n−1 Pn g D˜ −1(u ) . f i=1 xf i

I The L1—optimal predictor of g(Yf ) will be approximated by   the sample median of the set {g D˜ −1(u ) , i = 1, ..., n}. xf i

These n valid potential responses {D˜ −1(u ), i = 1, ..., n} I xf i gathered together give an approximate empirical distribution for Yf from which our predictors will be derived. I The L1—optimal predictor of g(Yf ) will be approximated by   the sample median of the set {g D˜ −1(u ) , i = 1, ..., n}. xf i

These n valid potential responses {D˜ −1(u ), i = 1, ..., n} I xf i gathered together give an approximate empirical distribution for Yf from which our predictors will be derived.

I The L2—optimal predictor of g(Yf ) will be the expected value   of g(Y ) that is approximated by n−1 Pn g D˜ −1(u ) . f i=1 xf i These n valid potential responses {D˜ −1(u ), i = 1, ..., n} I xf i gathered together give an approximate empirical distribution for Yf from which our predictors will be derived.

I The L2—optimal predictor of g(Yf ) will be the expected value   of g(Y ) that is approximated by n−1 Pn g D˜ −1(u ) . f i=1 xf i

I The L1—optimal predictor of g(Yf ) will be approximated by   the sample median of the set {g D˜ −1(u ) , i = 1, ..., n}. xf i Model-free optimal point predictors

Model-free method   L —predictor of Y n−1 Pn D˜ −1 D˜ (Y ) 2 f i=1 xf xi i   L —predictor of Y median{D˜ −1 D˜ (Y ) } 1 f xf xi i   L —predictor of g(Y ) n−1 Pn g D˜ −1(D˜ (Y )) 2 f i=1 xf xi i   L —predictor of g(Y ) median{g D˜ −1(D˜ (Y )) } 1 f xf xi i

TABLE. The model-free (MF2) optimal point predictors. I Focus on the L2—optimal case with g(x) = x.   Calculating the MF predictor Π(x ) = n−1 Pn g D˜ −1(u ) I f i=1 xf i for many different xf values—say on a grid—, the equivalent of a nonparametric smoother of a regression function is constructed—Model-Free Model-Fitting (MF2).

Model-free model-fitting

I The MF predictors (mean or median) can be used to give the equivalent of a model fit.   Calculating the MF predictor Π(x ) = n−1 Pn g D˜ −1(u ) I f i=1 xf i for many different xf values—say on a grid—, the equivalent of a nonparametric smoother of a regression function is constructed—Model-Free Model-Fitting (MF2).

Model-free model-fitting

I The MF predictors (mean or median) can be used to give the equivalent of a model fit.

I Focus on the L2—optimal case with g(x) = x. Model-free model-fitting

I The MF predictors (mean or median) can be used to give the equivalent of a model fit.

I Focus on the L2—optimal case with g(x) = x.   Calculating the MF predictor Π(x ) = n−1 Pn g D˜ −1(u ) I f i=1 xf i for many different xf values—say on a grid—, the equivalent of a nonparametric smoother of a regression function is constructed—Model-Free Model-Fitting (MF2). I No need for log-transformation of salaries! 2 I MF is totally automatic!!

M.o.a.T.

2 I MF relieves the practitioner from the need to find the optimal transformation for additivity and stabilization such as Box/Cox, ACE, AVAS, etc.—see Figures 3 and 4. 2 I MF is totally automatic!!

M.o.a.T.

2 I MF relieves the practitioner from the need to find the optimal transformation for additivity and variance stabilization such as Box/Cox, ACE, AVAS, etc.—see Figures 3 and 4.

I No need for log-transformation of salaries! M.o.a.T.

2 I MF relieves the practitioner from the need to find the optimal transformation for additivity and variance stabilization such as Box/Cox, ACE, AVAS, etc.—see Figures 3 and 4.

I No need for log-transformation of salaries! 2 I MF is totally automatic!! wage predicted wage 5000 10000 15000 20000 25000 30000 35000 8000 10000 12000 14000

20 30 40 50 60 20 30 40 50 60

age age

FIGURE 3: (a) Wage vs. age scatterplot. (b) Circles indicate the salary   predictor n−1 Pn g D˜ −1(u ) calculated from log-wage data with i=1 xf i g(x) exponential. For both figures, the superimposed solid line represents the MF2 salary predictor calculated from the raw data (without log). (a) (b) quantiles of transformed data u_t quantiles of transformed data u_t 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.4 0.8 0.0 0.4 0.8

quantiles of Uniform (0,1) quantiles of Uniform (0,1)

FIGURE 4: Q-Q plots of the ui vs. the quantiles of Uniform (0,1). (a) The ui ’s are obtained from the log-wage vs. age dataset of Figure 1 using bandwidth 5.5; (b) The ui ’s are obtained from the raw (untransformed) dataset of Figure 3 using bandwidth 7.3. I Let g(Yf ) − Π be the prediction root where Π is either the   L – or L –optimal predictor, i.e., Π = n−1 Pn g D˜ −1(u ) 2 1 i=1 xf i   or Π = median {g D˜ −1(u ) }. xf i I Based on the Y –data, estimate the conditional distribution ˜ ˜ Dx (·) by Dx (·), and let ui = Dxi (Yi ) to obtain the transformed data u1, ..., un that are approximately i.i.d.

MF2 predictive distributions

2 I For MF we can always take g(x) = x; no need for other preliminary transformations. I Based on the Y –data, estimate the conditional distribution ˜ ˜ Dx (·) by Dx (·), and let ui = Dxi (Yi ) to obtain the transformed data u1, ..., un that are approximately i.i.d.

MF2 predictive distributions

2 I For MF we can always take g(x) = x; no need for other preliminary transformations.

I Let g(Yf ) − Π be the prediction root where Π is either the   L – or L –optimal predictor, i.e., Π = n−1 Pn g D˜ −1(u ) 2 1 i=1 xf i   or Π = median {g D˜ −1(u ) }. xf i MF2 predictive distributions

2 I For MF we can always take g(x) = x; no need for other preliminary transformations.

I Let g(Yf ) − Π be the prediction root where Π is either the   L – or L –optimal predictor, i.e., Π = n−1 Pn g D˜ −1(u ) 2 1 i=1 xf i   or Π = median {g D˜ −1(u ) }. xf i I Based on the Y –data, estimate the conditional distribution ˜ ˜ Dx (·) by Dx (·), and let ui = Dxi (Yi ) to obtain the transformed data u1, ..., un that are approximately i.i.d. ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. Generate a bootstrap pseudo-response Y ∗ = D˜ −1 (u) where u I f xf is drawn randomly from the set (u1, ..., un). ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). ∗ ∗ I Calculate the bootstrap root g(Yf ) − Π where  −1   −1  Π∗ = n−1 Pn g D˜ ∗ (u∗) or Π∗ =median {g D˜ ∗ (u∗) } i=1 xf i xf i I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). I Predictive distribution of g(Yf ) is the above edf shifted to the right by Π, and MF2 (1 − α)100% equal-tailed, prediction interval for g(Yf ) is [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm: MF2 predictive distribution of g(Yf)

I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . Generate a bootstrap pseudo-response Y ∗ = D˜ −1 (u) where u I f xf is drawn randomly from the set (u1, ..., un). ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). ∗ ∗ I Calculate the bootstrap root g(Yf ) − Π where  −1   −1  Π∗ = n−1 Pn g D˜ ∗ (u∗) or Π∗ =median {g D˜ ∗ (u∗) } i=1 xf i xf i I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). I Predictive distribution of g(Yf ) is the above edf shifted to the right by Π, and MF2 (1 − α)100% equal-tailed, prediction interval for g(Yf ) is [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm: MF2 predictive distribution of g(Yf)

I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). ∗ ∗ I Calculate the bootstrap root g(Yf ) − Π where  −1   −1  Π∗ = n−1 Pn g D˜ ∗ (u∗) or Π∗ =median {g D˜ ∗ (u∗) } i=1 xf i xf i I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). I Predictive distribution of g(Yf ) is the above edf shifted to the right by Π, and MF2 (1 − α)100% equal-tailed, prediction interval for g(Yf ) is [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm: MF2 predictive distribution of g(Yf)

I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. Generate a bootstrap pseudo-response Y ∗ = D˜ −1 (u) where u I f xf is drawn randomly from the set (u1, ..., un). ∗ ∗ I Calculate the bootstrap root g(Yf ) − Π where  −1   −1  Π∗ = n−1 Pn g D˜ ∗ (u∗) or Π∗ =median {g D˜ ∗ (u∗) } i=1 xf i xf i I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). I Predictive distribution of g(Yf ) is the above edf shifted to the right by Π, and MF2 (1 − α)100% equal-tailed, prediction interval for g(Yf ) is [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm: MF2 predictive distribution of g(Yf)

I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. Generate a bootstrap pseudo-response Y ∗ = D˜ −1 (u) where u I f xf is drawn randomly from the set (u1, ..., un). ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). I Predictive distribution of g(Yf ) is the above edf shifted to the right by Π, and MF2 (1 − α)100% equal-tailed, prediction interval for g(Yf ) is [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm: MF2 predictive distribution of g(Yf)

I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. Generate a bootstrap pseudo-response Y ∗ = D˜ −1 (u) where u I f xf is drawn randomly from the set (u1, ..., un). ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). ∗ ∗ I Calculate the bootstrap root g(Yf ) − Π where  −1   −1  Π∗ = n−1 Pn g D˜ ∗ (u∗) or Π∗ =median {g D˜ ∗ (u∗) } i=1 xf i xf i I Predictive distribution of g(Yf ) is the above edf shifted to the right by Π, and MF2 (1 − α)100% equal-tailed, prediction interval for g(Yf ) is [Π + q(α/2), Π + q(1 − α/2)].

Resampling Algorithm: MF2 predictive distribution of g(Yf)

I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. Generate a bootstrap pseudo-response Y ∗ = D˜ −1 (u) where u I f xf is drawn randomly from the set (u1, ..., un). ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). ∗ ∗ I Calculate the bootstrap root g(Yf ) − Π where  −1   −1  Π∗ = n−1 Pn g D˜ ∗ (u∗) or Π∗ =median {g D˜ ∗ (u∗) } i=1 xf i xf i I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). Resampling Algorithm: MF2 predictive distribution of g(Yf)

I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. Generate a bootstrap pseudo-response Y ∗ = D˜ −1 (u) where u I f xf is drawn randomly from the set (u1, ..., un). ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). ∗ ∗ I Calculate the bootstrap root g(Yf ) − Π where  −1   −1  Π∗ = n−1 Pn g D˜ ∗ (u∗) or Π∗ =median {g D˜ ∗ (u∗) } i=1 xf i xf i I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). I Predictive distribution of g(Yf ) is the above edf shifted to the right by Π, and MF2 (1 − α)100% equal-tailed, prediction interval for g(Yf ) is [Π + q(α/2), Π + q(1 − α/2)]. I Πxf is an estimate of the regression function µ(xf ) = E(Yf |xf ).

I How does Πxf compare to the kernel smoother mxf ? I Can we construct confidence intervals for the regression function µ(xf ) without additive model (?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0,1)?

MF2: Confidence intervals for the regression function without an additive model

I To fix ideas, assume g(x) = x. 2 I MF L2—optimal point predictor of Yf is Π = n−1 Pn D˜ −1 (u ) where u = D˜ (Y ) xf i=1 xf i i xi i I How does Πxf compare to the kernel smoother mxf ? I Can we construct confidence intervals for the regression function µ(xf ) without additive model (?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0,1)?

MF2: Confidence intervals for the regression function without an additive model

I To fix ideas, assume g(x) = x. 2 I MF L2—optimal point predictor of Yf is Π = n−1 Pn D˜ −1 (u ) where u = D˜ (Y ) xf i=1 xf i i xi i

I Πxf is an estimate of the regression function µ(xf ) = E(Yf |xf ). I Can we construct confidence intervals for the regression function µ(xf ) without additive model (?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0,1)?

MF2: Confidence intervals for the regression function without an additive model

I To fix ideas, assume g(x) = x. 2 I MF L2—optimal point predictor of Yf is Π = n−1 Pn D˜ −1 (u ) where u = D˜ (Y ) xf i=1 xf i i xi i

I Πxf is an estimate of the regression function µ(xf ) = E(Yf |xf ).

I How does Πxf compare to the kernel smoother mxf ? MF2: Confidence intervals for the regression function without an additive model

I To fix ideas, assume g(x) = x. 2 I MF L2—optimal point predictor of Yf is Π = n−1 Pn D˜ −1 (u ) where u = D˜ (Y ) xf i=1 xf i i xi i

I Πxf is an estimate of the regression function µ(xf ) = E(Yf |xf ).

I How does Πxf compare to the kernel smoother mxf ? I Can we construct confidence intervals for the regression function µ(xf ) without additive model (?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0,1)? I So Πxf could be bootstrapped by resampling the i.i.d. variables u1,..., un. In fact, Π is asymptotically equivalent to the kernel I xf √ smoother mxf , i.e.,Π xf = mxf + oP (1/ nh). I Recall ui ∼ approx. i.i.d. Uniform(0,1). So, for large n, Π = n−1 Pn D˜ −1 (u ) ≈ R 1 D˜ −1 (u) du = m xf i=1 xf i 0 xf xf R R 1 −1 I This is just the identity ydF (y) = 0 F (u) du.

MF2: Confidence intervals for the regression function without an additive model

I Note that Πxf is defined as a function of the approx. i.i.d. variables u1,..., un. In fact, Π is asymptotically equivalent to the kernel I xf √ smoother mxf , i.e.,Π xf = mxf + oP (1/ nh). I Recall ui ∼ approx. i.i.d. Uniform(0,1). So, for large n, Π = n−1 Pn D˜ −1 (u ) ≈ R 1 D˜ −1 (u) du = m xf i=1 xf i 0 xf xf R R 1 −1 I This is just the identity ydF (y) = 0 F (u) du.

MF2: Confidence intervals for the regression function without an additive model

I Note that Πxf is defined as a function of the approx. i.i.d. variables u1,..., un.

I So Πxf could be bootstrapped by resampling the i.i.d. variables u1,..., un. I Recall ui ∼ approx. i.i.d. Uniform(0,1). So, for large n, Π = n−1 Pn D˜ −1 (u ) ≈ R 1 D˜ −1 (u) du = m xf i=1 xf i 0 xf xf R R 1 −1 I This is just the identity ydF (y) = 0 F (u) du.

MF2: Confidence intervals for the regression function without an additive model

I Note that Πxf is defined as a function of the approx. i.i.d. variables u1,..., un.

I So Πxf could be bootstrapped by resampling the i.i.d. variables u1,..., un. In fact, Π is asymptotically equivalent to the kernel I xf √ smoother mxf , i.e.,Π xf = mxf + oP (1/ nh). R R 1 −1 I This is just the identity ydF (y) = 0 F (u) du.

MF2: Confidence intervals for the regression function without an additive model

I Note that Πxf is defined as a function of the approx. i.i.d. variables u1,..., un.

I So Πxf could be bootstrapped by resampling the i.i.d. variables u1,..., un. In fact, Π is asymptotically equivalent to the kernel I xf √ smoother mxf , i.e.,Π xf = mxf + oP (1/ nh). I Recall ui ∼ approx. i.i.d. Uniform(0,1). So, for large n, Π = n−1 Pn D˜ −1 (u ) ≈ R 1 D˜ −1 (u) du = m xf i=1 xf i 0 xf xf MF2: Confidence intervals for the regression function without an additive model

I Note that Πxf is defined as a function of the approx. i.i.d. variables u1,..., un.

I So Πxf could be bootstrapped by resampling the i.i.d. variables u1,..., un. In fact, Π is asymptotically equivalent to the kernel I xf √ smoother mxf , i.e.,Π xf = mxf + oP (1/ nh). I Recall ui ∼ approx. i.i.d. Uniform(0,1). So, for large n, Π = n−1 Pn D˜ −1 (u ) ≈ R 1 D˜ −1 (u) du = m xf i=1 xf i 0 xf xf R R 1 −1 I This is just the identity ydF (y) = 0 F (u) du. ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). ∗ I Calculate the bootstrap confidence rootΠ xf − Π where Π∗ = n−1 Pn D˜ ∗−1 (u∗). i=1 xf i I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). 2 I MF (1 − α)100% equal-tailed, confidence interval for µ(xf ) is [Π + q(α/2), Π + q(1 − α/2)].

2 Resampling Algorithm: MF confidence intervals for µ(xf)

Define the confidence root µ(xf ) − Πxf . I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). ∗ I Calculate the bootstrap confidence rootΠ xf − Π where Π∗ = n−1 Pn D˜ ∗−1 (u∗). i=1 xf i I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). 2 I MF (1 − α)100% equal-tailed, confidence interval for µ(xf ) is [Π + q(α/2), Π + q(1 − α/2)].

2 Resampling Algorithm: MF confidence intervals for µ(xf)

Define the confidence root µ(xf ) − Πxf . I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. ∗ I Calculate the bootstrap confidence rootΠ xf − Π where Π∗ = n−1 Pn D˜ ∗−1 (u∗). i=1 xf i I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). 2 I MF (1 − α)100% equal-tailed, confidence interval for µ(xf ) is [Π + q(α/2), Π + q(1 − α/2)].

2 Resampling Algorithm: MF confidence intervals for µ(xf)

Define the confidence root µ(xf ) − Πxf . I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). 2 I MF (1 − α)100% equal-tailed, confidence interval for µ(xf ) is [Π + q(α/2), Π + q(1 − α/2)].

2 Resampling Algorithm: MF confidence intervals for µ(xf)

Define the confidence root µ(xf ) − Πxf . I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). ∗ I Calculate the bootstrap confidence rootΠ xf − Π where Π∗ = n−1 Pn D˜ ∗−1 (u∗). i=1 xf i 2 I MF (1 − α)100% equal-tailed, confidence interval for µ(xf ) is [Π + q(α/2), Π + q(1 − α/2)].

2 Resampling Algorithm: MF confidence intervals for µ(xf)

Define the confidence root µ(xf ) − Πxf . I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). ∗ I Calculate the bootstrap confidence rootΠ xf − Π where Π∗ = n−1 Pn D˜ ∗−1 (u∗). i=1 xf i I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). 2 Resampling Algorithm: MF confidence intervals for µ(xf)

Define the confidence root µ(xf ) − Πxf . I Bootstrap the u1, ..., un to create bootstrap pseudo-data ∗ ∗ ˆ ∗ u1, ..., un whose empirical distribution in denoted Fn . ˜ −1 I Use the inverse transformation Dx to create pseudo-data in ∗ ˜ −1 ∗ the Y domain, i.e., Yt = Dxt (ut ) for t = 1, ...n. ? I Based on the pseudo-data Yt , re-estimate the conditional ˜ ∗ distribution Dx (·); denote the bootstrap estimator by Dx (·). ∗ I Calculate the bootstrap confidence rootΠ xf − Π where Π∗ = n−1 Pn D˜ ∗−1 (u∗). i=1 xf i I Repeat the above steps B times, and collect the B bootstrap roots in the form of an edf with α—quantile denoted q(α). 2 I MF (1 − α)100% equal-tailed, confidence interval for µ(xf ) is [Π + q(α/2), Π + q(1 − α/2)]. I µ(x) = sin(x), σ(x) = (cos(x/2) + 2)/7, and errors N(0,1) or Laplace.

I Prediction points: xf = π; µ(x) has high slope but zero curvature—easy case for estimation.

I xf = π/2 and xf = 3π/2; µ(x) has zero slope but high curvature—peak and valley so large bias of mx .

Simulation: regression under model( ?)

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I Design points x1,..., xn for n = 100 drawn from a uniform distribution on (0, 2π) I Prediction points: xf = π; µ(x) has high slope but zero curvature—easy case for estimation.

I xf = π/2 and xf = 3π/2; µ(x) has zero slope but high curvature—peak and valley so large bias of mx .

Simulation: regression under model( ?)

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I Design points x1,..., xn for n = 100 drawn from a uniform distribution on (0, 2π)

I µ(x) = sin(x), σ(x) = (cos(x/2) + 2)/7, and errors N(0,1) or Laplace. I xf = π/2 and xf = 3π/2; µ(x) has zero slope but high curvature—peak and valley so large bias of mx .

Simulation: regression under model( ?)

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I Design points x1,..., xn for n = 100 drawn from a uniform distribution on (0, 2π)

I µ(x) = sin(x), σ(x) = (cos(x/2) + 2)/7, and errors N(0,1) or Laplace.

I Prediction points: xf = π; µ(x) has high slope but zero curvature—easy case for estimation. Simulation: regression under model( ?)

(?) Yt = µ(xt ) + σ(xt ) εt with εt ∼ i.i.d. (0, 1) with cdf F .

I Design points x1,..., xn for n = 100 drawn from a uniform distribution on (0, 2π)

I µ(x) = sin(x), σ(x) = (cos(x/2) + 2)/7, and errors N(0,1) or Laplace.

I Prediction points: xf = π; µ(x) has high slope but zero curvature—easy case for estimation.

I xf = π/2 and xf = 3π/2; µ(x) has zero slope but high curvature—peak and valley so large bias of mx . (a) (b) y y -1 0 1 2 -1 0 1 2

0123456 0123456

x x

FIGURE 6: Typical scatterplots with superimposed kernel smoothers; (a) Normal data; (b) Laplace data. xf /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 1.7 1.85 MB 0.845 0.832 0.822 0.838 0.847 0.865 0.798 0.852 0.854 MF/MB 0.901 0.912 0.874 0.897 0.921 0.908 0.861 0.903 0.914 MF2 0.834 0.836 0.829 0.831 0.849 0.852 0.804 0.840 0.838 MF/MF2 0.897 0.906 0.886 0.886 0.912 0.895 0.868 0.897 0.899 Normal 0.874 0.876 0.879 0.860 0.863 0.867 0.856 0.859 0.872

Table 4.2(a). Empirical coverage levels (CVR) of prediction intervals according to different methods at several xf points spanning the interval (0, 2π). Nominal coverage was 0.90, sample size n = 100 and bandwidths chosen by L1 cross-validation. Error distribution: i.i.d. Normal. [st.error of CVRs =0.013] xf /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 1.7 1.85 MB 0.870 0.820 0.841 0.894 0.883 0.872 0.831 0.867 0.868 MF/MB 0.899 0.870 0.886 0.910 0.912 0.905 0.868 0.905 0.888 MF2 0.868 0.805 0.834 0.867 0.874 0.859 0.825 0.858 0.856 MF/MF2 0.905 0.876 0.885 0.916 0.910 0.910 0.865 0.905 0.895 Normal 0.874 0.849 0.876 0.888 0.885 0.879 0.874 0.888 0.879

Table 4.3(a). Empirical coverage levels (CVR) of prediction intervals according to different methods at several xf points spanning the interval (0, 2π). Nominal coverage was 0.90, sample size n = 100 and bandwidths chosen by L1 cross-validation. Error distribution: i.i.d. Laplace. I NORMAL intervals exhibit under-coverage even when the true distribution is Normal—they need explicit bias correction.

I Bootstrap methods more variable due to extra .

I The MF/MB intervals are more accurate than their MB analogs in the case xf = π/2.

I When xf = π, the MB intervals are most accurate, and the MF/MB intervals seem to over-correct (and over-cover).

I Over-coverage can be attributed to ‘bias leakage’ and should be alleviated with a larger sample size, using higher-order kernels, or by a bandwidth trick such as undersmoothing. 2 2 I Performance of MF (resp. MF/MF ) intervals resembles that of MB intervals (resp. MF/MB). 2 I MF/MF intervals are best at xf = π/2; this is quite surprising since one would expect the MB and MF/MB intervals to have a distinct advantage when model( ?) is true. 2 I The price to pay for using the generally valid MF/MF intervals instead of the model-specific MF/MB ones is the increased variability of interval length. 1 2 I W is either exponential, i.e., 2 χ2 − 1, to capture skewness, or q 3 Student’s t with 5 d.f., i.e., 5 t5, to capture kurtosis.

I εx independent but not i.i.d.: cx = x/(2π) for x ∈ [0, 2π]

I Large x: εx is close to Normal. Small x: εx is skewed/kurtotic.

Simulation: regression without model( ?)

Instead: Y = µ(x) + σ(x) ε with ε = c√x Z+(1−cx )W x x 2 2 cx +(1−cx )

I Z ∼ N(0, 1) independent of W that is also (0,1). I εx independent but not i.i.d.: cx = x/(2π) for x ∈ [0, 2π]

I Large x: εx is close to Normal. Small x: εx is skewed/kurtotic.

Simulation: regression without model( ?)

Instead: Y = µ(x) + σ(x) ε with ε = c√x Z+(1−cx )W x x 2 2 cx +(1−cx )

I Z ∼ N(0, 1) independent of W that is also (0,1). 1 2 I W is either exponential, i.e., 2 χ2 − 1, to capture skewness, or q 3 Student’s t with 5 d.f., i.e., 5 t5, to capture kurtosis. Simulation: regression without model( ?)

Instead: Y = µ(x) + σ(x) ε with ε = c√x Z+(1−cx )W x x 2 2 cx +(1−cx )

I Z ∼ N(0, 1) independent of W that is also (0,1). 1 2 I W is either exponential, i.e., 2 χ2 − 1, to capture skewness, or q 3 Student’s t with 5 d.f., i.e., 5 t5, to capture kurtosis.

I εx independent but not i.i.d.: cx = x/(2π) for x ∈ [0, 2π]

I Large x: εx is close to Normal. Small x: εx is skewed/kurtotic. xf /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 1.7 1.85 MB 0.886 0.893 0.840 0.840 0.888 0.816 0.780 0.827 0.816 MF/MB 0.917 0.927 0.877 0.917 0.928 0.892 0.849 0.888 0.885 MF2 0.894 0.876 0.823 0.843 0.876 0.813 0.782 0.816 0.831 MF/MF2 0.917 0.932 0.897 0.894 0.930 0.874 0.836 0.881 0.890 Normal 0.903 0.920 0.876 0.888 0.883 0.838 0.834 0.868 0.868

Table 4.4(a). CVRs with error distribution non–i.i.d. skewed.

xf /π = 0.15 0.3 0.5 0.75 1 1.25 1.5 1.7 1.85 MB 0.836 0.849 0.804 0.868 0.872 0.831 0.805 0.840 0.858 MF/MB 0.883 0.885 0.850 0.921 0.923 0.908 0.861 0.890 0.921 MF2 0.814 0.843 0.798 0.861 0.863 0.829 0.816 0.831 0.865 MF/MF2 0.876 0.899 0.858 0.923 0.921 0.894 0.877 0.874 0.914 Normal 0.858 0.876 0.859 0.883 0.872 0.845 0.865 0.858 0.881

Table 4.5(a). CVRs with error distribution non–i.i.d. kurtotic. I The NORMAL intervals are totally unreliable which is to be expected due to the non-normal error distributions. 2 I The MF/MF intervals are best (by far) in the cases xf = π/2 and xf = 3π/2 attaining close to nominal coverage even with a sample size as low as n = 100.

I The case xf = π remains problematic for the same reasons previously discussed. I Novel viewpoint for inference: estimation and prediction 2 I Regression: point predictors andMF

I Prediction intervals with good coverage properties

I Totally Automatic: no need to search for optimal transformations in regression...

I ...but very computationally intensive!

Conclusions

I Model-free prediction principle 2 I Regression: point predictors andMF

I Prediction intervals with good coverage properties

I Totally Automatic: no need to search for optimal transformations in regression...

I ...but very computationally intensive!

Conclusions

I Model-free prediction principle

I Novel viewpoint for inference: estimation and prediction I Prediction intervals with good coverage properties

I Totally Automatic: no need to search for optimal transformations in regression...

I ...but very computationally intensive!

Conclusions

I Model-free prediction principle

I Novel viewpoint for inference: estimation and prediction 2 I Regression: point predictors andMF I Totally Automatic: no need to search for optimal transformations in regression...

I ...but very computationally intensive!

Conclusions

I Model-free prediction principle

I Novel viewpoint for inference: estimation and prediction 2 I Regression: point predictors andMF

I Prediction intervals with good coverage properties I ...but very computationally intensive!

Conclusions

I Model-free prediction principle

I Novel viewpoint for inference: estimation and prediction 2 I Regression: point predictors andMF

I Prediction intervals with good coverage properties

I Totally Automatic: no need to search for optimal transformations in regression... Conclusions

I Model-free prediction principle

I Novel viewpoint for inference: estimation and prediction 2 I Regression: point predictors andMF

I Prediction intervals with good coverage properties

I Totally Automatic: no need to search for optimal transformations in regression...

I ...but very computationally intensive! Bootstrap prediction intervals for time series [joint work with Li Pan, UCSD]

TIME SERIES DATA: X1,..., Xn (new notation)

GOAL: Predict Xn+1 given the data

Problem: We can not choose xf; it is given by the recent history of the time series, e.g., Xn−1, ··· , Xn−p for some p. ∗ ∗ ∗ ∗ ∗ Bootstrap series X1 , X2 ,..., Xn−1, Xn , Xn+1,... must be such that have the same values for the recent history, i.e., ∗ ∗ ∗ Xn−1 = Xn−1, Xn−2 = Xn−2,..., Xn−p = Xn−p. “Forward" bootstrap:

Linear AR model: φ(B)Xt = t with t ∼ iid ˆ Let φ be the LS (scatterplot) estimate of φ based on X1,..., Xn Let φ˜ be the LS (scatterplot) estimate of φ based on ˜ ˜ X1,..., Xn−p, Xn−p+1,..., Xn i.e., the last p values have been changed/corrupted/omitted.

φˆ − φ˜ = Op(1/n)

 Hence, we can artificially change the last p values in the bootstrap world and the effect will be negligible.

∗ ∗ Bootstrap series: X1 ,..., Xn−p, Xn−p+1,..., Xn where X ∗ = 1 ∗ (forward recursion) and X ,..., X are the t φˆ(B) t n−p+1 n same values as in the original dataset. Backward bootstrap for linear AR: Thombs/Schucany (1990), Breidt, Davis, Dunsmuir (1993).

Linear AR model: φ(B)Xt = t with t ∼ iid

−1 φ(B−1) implies φ(B )Xt = φ(B) t ≡ wt I The wt are uncorrelated but not i.i.d.—can not reshuffle! ∗ I Bootstrap the residuals t , and get bootstrap wt by ∗ φ(B−1) ∗ wt = φ(B) t (with estimated φ) ∗ −1 ∗ ∗ I Construct backward bootstrap Xt via φ(B )Xt = wt using initial (ending) conditions: ∗ ∗ ∗ Xn−1 = Xn−1, Xn−2 = Xn−2,..., Xn−p = Xn−p. ∗ I Get Xn+1 using forward bootstrap. Table : F=Forward, B=Backward, f/p=fitted/predictive residuals.

Normal AR(1) nominal coverage 95% nominal coverage 90% φ1 = 0.9 CVR LEN st.err CVR LEN st.err Ff 0.941 3.915 0.355 0.892 3.306 0.282 Fp 0.945 3.993 0.369 0.898 3.369 0.283 Bf 0.940 3.900 0.363 0.890 3.290 0.271 Bp 0.943 3.965 0.374 0.896 3.349 0.290 φ1 = 0.5 Ff 0.940 3.895 0.357 0.891 3.294 0.283 Fp 0.944 3.968 0.377 0.898 3.355 0.281 Bf 0.941 3.904 0.367 0.891 3.293 0.277 Bp 0.944 3.965 0.383 0.897 3.347 0.285 φ1 = −0.9 Ff 0.939 3.895 0.363 0.890 3.291 0.283 Fp 0.944 3.982 0.382 0.897 3.358 0.282 Bf 0.940 3.899 0.364 0.890 3.283 0.269 Bp 0.943 3.965 0.382 0.896 3.343 0.289 Table : Simulation Results of AR(1) with Laplace innovations

Laplace errors nominal coverage 95% nominal coverage 90% φ1 = 0.9 CVR LEN st.err CVR LEN st.err Ff 0.940 4.229 0.620 0.891 3.286 0.429 Fp 0.943 4.323 0.636 0.897 3.363 0.430 Bf 0.939 4.233 0.620 0.891 3.276 0.419 Bp 0.943 4.306 0.623 0.895 3.330 0.420 φ1 = 0.5 Ff 0.939 4.208 0.612 0.891 3.274 0.431 Fp 0.942 4.303 0.639 0.896 3.347 0.441 Bf 0.939 4.224 0.615 0.891 3.277 0.426 Bp 0.942 4.296 0.621 0.896 3.332 0.425 φ1 = −0.9 Ff 0.939 4.207 0.608 0.890 3.274 0.432 Fp 0.943 4.312 0.625 0.895 3.343 0.432 Bf 0.939 4.222 0.609 0.890 3.276 0.429 Bp 0.942 4.288 0.612 0.894 3.328 0.424 Other approaches

I Box and Jenkins (1976): naive approach assuming Gaussianity and ignoring the variability of estimated parameters

I Similarly, Cao et al.(1997), use forward bootstrap for the ∗ extrapolated Xn+1 but do not re-estimate φ on bootstrap series—as in Box and Jenkins (1976) but relaxing Gaussianity assumption

I Masarotto (1990) does forward bootstrap (with studentized prediction roots) but does not fix the last values to the ones from the original data—conditional inference breaks down.

I AR-sieve bootstrap of Alonso, Pena and Romo (2002), and ARIMA bootstrap of Pascual, Romo and Ruiz (2004): they do not use predictive roots—they approximate L(Xn+1) ∗ ∗ ˆ∗ ˆ with L(Xn+1)—plus they generate Xn+1 using φ not φ. Table : Simulation w. AR(2) model Xt = 1.55Xt−1 − 0.6Xt−2 + t BJ=Box/Jenkins, M=Masarotto, Mf/Mp= modified M using conditional inference and fitted/predictive residuals, PRR=Pascual/Romo/Ruiz

normal errors nominal coverage 95% nominal coverage 90% n = 50 CVR LEN st.err CVR LEN st.err BJ 0.926 3.771 0.402 0.868 3.165 0.337 M 0.942 4.098 0.515 0.894 3.455 0.413 Mf 0.946 4.185 0.547 0.901 3.521 0.441 Mp 0.945 4.159 0.537 0.899 3.496 0.431 Cao 0.905 3.640 0.560 0.859 3.150 0.424 Sieve/PRR 0.926 3.931 0.523 0.875 3.324 0.410 Ff 0.931 3.933 0.521 0.883 3.328 0.417 Fp 0.947 4.171 0.543 0.902 3.527 0.434 Bf 0.926 3.870 0.633 0.875 3.274 0.481 Bp 0.942 4.105 0.612 0.897 3.474 0.464 Table : Simulation w. AR(2) model Xt = 1.55Xt−1 − 0.6Xt−2 + t with Laplace innovations

Laplace errors nominal coverage 95% nominal coverage 90% n = 50 CVR LEN st.err CVR LEN st.err BJ 0.918 3.755 0.580 0.877 3.151 0.487 M 0.940 4.488 0.839 0.894 3.489 0.576 Mf 0.943 4.569 0.858 0.899 3.563 0.611 Mp 0.943 4.563 0.871 0.899 3.564 0.620 Cao 0.915 3.994 0.841 0.864 3.151 0.599 Sieve/PRR 0.929 4.235 0.800 0.880 3.335 0.596 Ff 0.931 4.252 0.828 0.883 3.322 0.582 Fp 0.942 4.529 0.873 0.898 3.537 0.613 Bf 0.929 4.176 0.792 0.879 3.259 0.554 Bp 0.939 4.449 0.887 0.895 3.481 0.594 Nonparametric autoregression model

I i.i.d. errors: Xt = m(Yt−1) + t

I heteroscedastic errors: Xt = m(Yt−1) + σ(Yt−1)t 0 where Yt−1 = (Xt−1, ··· , Xt−p) and t are i.i.d. and independent of the past Xi , i < t.

Estimate the conditional mean m(y) = E[Xt+1|Yt = y] and 2 2 variance σ (y) = E[(Xt+1 − m(y)) |Yt = y] by usual Nadaraya-Watson kernel estimates mˆ and σˆ2 defined as

Pn−1 K ( ky−yt k )x mˆ (y) = t=1 h t+1 , Pn−1 ky−yt k t=1 K ( h )

Pn−1 ky−yt k 2 K ( )(x − mˆ (yt )) σˆ2(y) = t=1 h t+1 , Pn−1 ky−yt k t=1 K ( h ) Nonparametric autoregression–residual bootstrapI

Mammen, Franke and Kreiss (1998, 2002)

I Given observations x1, ··· , xn, we generate the bootstrap pseudo data forwards by the following recursions, ∗ ∗ ∗ I xi = mˆ (yi−1) + i ∗ ∗ ∗ ∗ I xi = mˆ (yi−1) +σ ˆ(yi−1)i ∗ ∗ ∗ 0 ∗ Here yi−1 = (xi−1, ··· , xi−p) and i can be resampled from fitted or predictive residuals.

I Fitted residuals

I ˆi = xi − mˆ (yi−1), for i = p + 1, ··· , n xi −mˆ (yi−1) I ˆi = , for i = p + 1, ··· , n σˆ(yi−1)

I Predictive residuals as in Politis (2010) (t) (t) I ˆt = xt − mˆ t (yt−1), t = p + 1, ··· , n (t) (t) xt −mˆ (yt−1) I ˆt = (t) , t = p + 1, ··· , n. σˆ (yt−1) Nonparametric autoregression–residual bootstrapII

where the delete-xt kernel estimator is defined the same way as in regression scatterplot:

Pn ky−yi−1k K ( )xi mˆ (t)(y) = i=p+1,i6=t h Pn ky−yi−1k i=p+1,i6=t K ( h )

Pn ky−yi−1k (t) 2 K ( )(xi − mˆ (yi−1)) σˆ(t)(y) = i=p+1,i6=t h Pn ky−yi−1k i=p+1,i6=t K ( h ) for t = p + 1, ··· , n.

I bootstrap estimate

∗ Pn−1 ky−yi k ∗ ∗ i=1 K ( h )xi+1 p mˆ (y) = ∗ , for y ∈ R . Pn−1 ky−yi k i=1 K ( h ) Nonparametric autoregression—prediction intervalsI

I Prediction root: ˆ I Xn+1 − Xn+1 = m(yn) − mˆ (yn) + n+1 ˆ I Xn+1 − Xn+1 = m(yn) − mˆ (yn) + σ(yn)n+1 I Bootstrap prediction root: ∗ ˆ ∗ ∗ ∗ ∗ I Xn+1 − Xn+1 = m (yn) − mˆ (yn) + n+1 ∗ ˆ ∗ ∗ ∗ ∗ ∗ I Xn+1 − Xn+1 = m (yn) − mˆ (yn) + σ (yn)n+1 I Prediction interval: Collect B bootstrap root replicates in the form of an empirical distribution whose α-quantile is denoted q(α). Then, a (1 − α)100% equal-tailed predictive interval for Xn+1 is given by [mˆ (yn) + q(α/2), mˆ (yn) + q(1 − α/2)]. I This is forward bootstrap fixing the last p values; is backward bootstrap feasible?? Bootstrap Prediction Intervals for Markov ProcessesI

I TIME SERIES DATA: X1,..., Xn I Drop the nonparametric autoregression model

I Assume Markov property (of order p):

f (Xt |Xt−1, Xt−2, ··· ) = f (Xt |Yt−1)

0 where Yt−1 = (Xt−1, ··· , Xt−p) Bootstrap Prediction Intervals for Markov ProcessesII I Bootstrap based on transition density (Rajarshi 1990)

I Given observations x1, ··· , xn, the estimated transition density is constructed as follows,

n ˆ 1 X x − xi ky − yi−1k fn(x, y) = K ( , ) (n − p)h1h2 h1 h2 i=p+1 Z ˆ ˆ fn(y) = fn(x, y)dx

ˆf (x, y) ˆf (x|y) = n . n ˆ fn(y) Bootstrap Prediction Intervals for Markov Processes III I Generate pseudo series: ∗ ˆ ∗ I Forward bootstrap: xt+1∼fn(·|yt ). ∗ ˆ ∗ I Backward bootstrap: xt ∼fbn(·|yt+p).

n ˆ 1 X x − xi−p ky − yi k fbn(x, y) = K ( , ) (n − p)h1h2 h1 h2 i=p+1 Z ˆ ˆ fbn(y) = fbn(x, y)dx

ˆf (x, y) ˆf (x|y) = bn bn ˆ fbn(y)

ˆ∗ ∗ ∗ ∗ I Construct fn (x|y) based on the pseudo data x1 , x2 , ··· , xn . ˆ I predictive root: Xn+1 − Xn+1, ˆ R ˆ where Xn+1∼f (·|yn) and Xn+1 = xfn(x|yn)dx. ∗ ˆ ∗ I bootstrap predictive root Xn+1 − Xn+1, ∗ ∗ ˆ ∗ R ˆ∗ where Xn+1∼f (·|yn) and Xn+1 = xfn (x|yn)dx. Bootstrap Prediction Intervals for Markov ProcessesIV I The Local Bootstrap [Paparoditis and Politis (2001)]

I Given observations x1, ··· , xn, consider a finite-state Markov chain with discrete states x1, ··· , xn. I Estimate the one-step ahead transition probability by

W (y − y ) Pˆ(X = x |Y = y ) = b t j t+1 j+1 t t Pn−1 m=p Wb(yt − ym)

for j ∈ {p, ··· , n − 1} where Wb(·) is a positive kernel with bandwidth b. Then the estimator of the transition distribution is as follows,

n−1 ˆ X ˆ Fn(x|y) = 1(−∞,x](xj+1)P(Xt+1 = xj+1|Yt = y) j=p Pn−1 1(−∞,x](xj+1)Wb(y − yj ) = j=p . Pn−1 m=p Wb(y − ym) Bootstrap Prediction Intervals for Markov ProcessesV I Generate pseudo series: ∗ ˆ ∗ I Forward bootstrap: Generate xt+1 from Fn(·|yt ) i.e. let J be a discrete taking its values in the set {p, ··· , n − 1} with probability mass function given by W (y ∗ − y ) P(J = s) = b t s , Pn−1 ∗ m=p Wb(yt − ym)

∗ Then let xt+1 = xJ+1. ∗ ˆ ∗ I Backward bootstrap: Generate xt+1 from Fbn(·|yt+p) Here the estimated backward conditional density and distribution are W (y − y ) Pˆ (X = x |Y = y ) = b t+p j+p , b t j t+p t+p Pn−p m=1 Wb(yt+p − ym+p) j ∈ {1, 2, ··· , n − p}, and

n−p ˆ X ˆ Fbn(x|y) = 1(−∞,x](xj )Pb(Xj = xj |Yj+p = y) j=1 Pn−p 1(−∞,x](xj )Wb(y − yj+p) = j=1 . Pn−p m=1 Wb(y − ym+p) Bootstrap Prediction Intervals for Markov ProcessesVI

i.e. let J be a discrete random variable taking its values in the set {1, 2, ··· , n − p} with probability mass function given by

∗ Wb(y − ys+p) P(J = s) = t+p , Pn−p ∗ m=1 Wb(yt+p − ym+p)

∗ then let xt = xJ ˆ I predictive root: Xn+1 − Xn+1, where Xn+1 has conditional distribution F(·|yn) and

Pn−1 Wb(yn − yj )xj+1 Xˆ = j=p . n+1 Pn−1 m=p Wb(yn − ym)

∗ ˆ ∗ I bootstrap predictive root Xn+1 − Xn+1, ∗ ∗ where Xn+1 has conditional distribution Fn (·|yn) and

Pn−1 ∗ ∗ Wb(yn − yj )xj+1 Xˆ ∗ = j=p . n+1 Pn−1 ∗ m=p Wb(yn − ym) The Rosenblatt Transformation

I Time series data: X1,..., Xn

I Define η1 = DX1 (X1) so η1 ∼ Unif (0, 1)

I Let η2 = DX2|X1 (X2|X1). Can show η2 ∼ Unif (0, 1) and independent of η1

I Let η3 = DX3|X2,X1 (X3|X2, X1). Can show η3 ∼ Unif (0, 1) and independent of η1 and η2.

I ··· Construct η1, . . . , ηn ∼ iid Unif (0, 1) I To use this we need to estimate n functions

DX1 (·), DX2|X1 (·), DX3|X2,X1 (·),..., DXn|Xn−1,...,X1 (·) I Infeasible!—unless ··· we have Markov structure in which cases all these functions (except the first) are the same! Prediction Intervals for Markov Processes Based on the Model-free Prediction MethodI Rosenblatt Transformation for Markov Processes

I Distribution of Yt : D(y) = P(Yt ≤ y)

I Distribution of Xt given Yt−1 = y: Dy (x) = P(Xt ≤ y|Yt−1 = y)

I Let ηp = D(Yp) and ηt = DYt−1 (Xt ), for t = p + 1, ··· , n.

−1 −1 P(ηp ≤ z) = P(D(Yp) ≤ z) = P(Yp ≤ D (z)) = D(D (z)) = z

P(ηt ≤ z|Yt−1 = y) =P(Dy (Xt ) ≤ z|Yt−1 = y) −1 =P(Xt ≤ Dy (z)|Yt−1 = y) −1 =Dy (Dy (z)) =z (Uniform) and does not depend on y (Independence). Prediction Intervals for Markov Processes Based on the Model-free Prediction MethodII I Model free bootstrap

I Given observations x1, ··· , xn, we estimate Dy (x) by usual ˆ kernel estimator Dy (x),

Pn 1 K ( ky−yi−1k ) ˆ i=p+1 {xi ≤x} h Dy (x) = . Pn ky−yk−1k k=p+1 K ( h )

I Dˆ is a step function—can use linear interpolation to produce a continuous function D˜ Prediction Intervals for Markov Processes Based on the Model-free Prediction Method III I Construct i.i.d. transformed data ˜ I fitted transformation: ut = Dyt−1 (xt ), for t = p + 1, ··· , n. I predictive transformation:

Pn 1 K ( kyt−1−yi−1k ) ˆ (t) i=p+1,i6=t {xi ≤xt } h Dy (xt ) = , t = p+1, ··· , n t−1 P kyt−1−yk−1k k=p+1,k6=t K ( h )

After smoothing Dˆy (x) to D˜y (x), we get

(t) ˜ (t) ut = Dyt−1 (xt ).

I The ut are independent—no need for backward bootstrap!

I Generate pseudo series: ˜ −1 ∗ ˜ −1 ∗ ∗ Use D ∗ to generate x = D ∗ (u ), where u is sampled yt−1 t yt−1 t t (t) (t) from up+1, ··· , un or up+1, ··· , un Prediction Intervals for Markov Processes Based on the Model-free Prediction MethodIV

ˆ I Predictive root: Xn+1 − Xn+1, where Xn+1 has distribution ( ) ˆ = 1 Pn ˜ −1( ) Dyn x and xn+1 n−p t=p+1 Dyn ut I Bootstrap Predictive root: ∗ − ˆ ∗ = ˜ −1( ∗ ) − 1 Pn ˜ ∗−1( ∗) Xn+1 Xn+1 Dyn un+1 n−p t=p+1 Dyn ut

I Smoothed Model Free Markov Chain bootstrap ˆ x−xi I Substitute 1{ ≤ } in Dy (x) with a continuous cdf G( ). xi x h0 ˆ I The new estimator Sy (x) is defined by

n ky−y k P G( x−xi )K ( i−1 ) ˆ i=p+1 h0 h Sy (x) = Pn ky−yk−1k k=p+1 K ( h ) ˆ ˜ I Substitute Sy (x) in place of Dy (x) to get the Smoothed Model Free (SMF) method. Simulation Results: Nonparametric autoregressionI

I Model 1: Xt+1 = sin(Xt ) + t+1 p 2 I Model 2: Xt+1 = sin(Xt ) + 0.5 + 0.25Xt t+1

Table : nonparametric autoregression with i.i.d innovations-model 1

normal innovations nominal coverage 95% nominal coverage 90% n = 100 CVR LEN st.err CVR LEN st.err fitted residuals 0.927 3.860 0.393 0.873 3.255 0.310 predictive residuals 0.943 4.099 0.402 0.894 3.456 0.317 n = 200 fitted residuals 0.938 3.868 0.272 0.886 3.263 0.219 predictive residuals 0.948 4.012 0.283 0.899 3.385 0.231 Laplace innovations n = 100 fitted residuals 0.933 4.161 0.648 0.879 3.218 0.452 predictive residuals 0.944 4.430 0.658 0.896 3.445 0.470 n = 200 fitted residuals 0.937 4.122 0.460 0.885 3.198 0.329 predictive residuals 0.943 4.275 0.455 0.895 3.341 0.341 Simulation Results: Nonparametric autoregressionII

Table : nonparametric autoregression with heteroscedastic innovations-model 2 (needs oversmoothed bandwidth g = 2h)

g = 2h nominal coverage 95% nominal coverage 90% normal innovations CVR LEN st.err CVR LEN st.err n = 100 fitted residuals 0.894 3.015 0.926 0.843 2.566 0.783 predictive residuals 0.922 3.318 1.003 0.868 2.744 0.826 n = 200 fitted residuals 0.903 2.903 0.774 0.848 2.537 0.647 predictive residuals 0.921 3.164 0.789 0.863 2.636 0.654 Laplace innovations CVR LEN st.err CVR LEN st.err n = 100 fitted residuals 0.895 3.197 1.270 0.843 2.521 0.909 predictive residuals 0.921 3.662 1.515 0.866 2.740 0.967 n = 200 fitted residuals 0.905 3.028 0.955 0.851 2.395 0.747 predictive residuals 0.921 3.285 1.029 0.864 2.514 0.776 Simulation Results: Markov bootstrap-model 1I

normal innovation nominal coverage 95% nominal coverage 90% n = 50 CVR LEN std.err CVR LEN std.err Rajarshi-f 0.931 4.126 0.760 0.886 3.544 0.667 Rajarshi-b 0.931 4.130 0.757 0.887 3.555 0.677 Local-f 0.910 3.885 0.778 0.862 3.337 0.685 Local-b 0.911 3.920 0.795 0.863 3.355 0.676 MF-fitted 0.893 3.647 0.740 0.844 3.150 0.645 MF-predictive 0.926 4.108 0.854 0.880 3.482 0.721 SMF-fitted 0.892 3.627 0.717 0.843 3.131 0.619 SMF-predictive 0.939 4.293 0.828 0.893 3.614 0.709 n = 100 Rajarshi-f 0.942 4.137 0.627 0.901 3.535 0.519 Rajarshi-b 0.942 4.143 0.621 0.900 3.531 0.519 Local-f 0.930 3.980 0.625 0.886 3.409 0.511 Local-b 0.932 4.001 0.605 0.886 3.411 0.508 MF-fitted 0.917 3.771 0.602 0.869 3.238 0.497 MF-predictive 0.941 4.152 0.680 0.897 3.511 0.550 SMF-fitted 0.916 3.731 0.551 0.869 3.221 0.489 SMF-predictive 0.946 4.231 0.647 0.902 3.471 0.530 Simulation Results: Markov bootstrap-model 1II

Laplace innovation nominal coverage 95% nominal coverage 90% n = 50 CVR LEN std.err CVR LEN std.err Rajarshi-f 0.913 4.072 1.033 0.873 3.429 0.894 Rajarshi-b 0.913 4.081 1.021 0.873 3.441 0.897 Local-f 0.902 4.036 1.138 0.856 3.313 0.935 Local-b 0.905 4.046 1.103 0.861 3.324 0.847 MF-fitted 0.889 3.708 0.986 0.844 3.086 0.803 MF-predictive 0.922 4.278 1.129 0.876 3.494 0.993 SMF-fitted 0.891 3.715 0.963 0.846 3.084 0.764 SMF-predictive 0.930 4.442 1.126 0.887 3.620 0.972 n = 100 Rajarshi-f 0.925 4.236 1.027 0.885 3.424 0.763 Rajarshi-b 0.926 4.239 1.024 0.885 3.437 0.764 Local-f 0.923 4.153 0.935 0.878 3.323 0.714 Local-b 0.923 4.189 0.986 0.879 3.356 0.724 MF-fitted 0.911 3.884 0.895 0.864 3.124 0.643 MF-predictive 0.935 4.426 1.001 0.891 3.510 0.765 SMF-fitted 0.910 3.846 0.856 0.864 3.106 0.623 SMF-predictive 0.941 4.544 0.965 0.896 3.542 0.738 Simulation Results: Markov bootstrap-model 2I

normal innovation nominal coverage 95% nominal coverage 90% n = 50 CVR LEN std.err CVR LEN std.err Rajarshi-f 0.927 3.477 0.802 0.882 2.983 0.716 Rajarshi-b 0.928 3.495 0.832 0.882 2.992 0.722 Local-f 0.914 3.424 0.833 0.867 2.893 0.717 Local-b 0.914 3.451 0.803 0.867 2.924 0.692 MF-fitted 0.891 3.084 0.742 0.841 2.660 0.670 MF-predictive 0.926 3.549 0.849 0.880 2.989 0.725 SMF-fitted 0.890 3.073 0.712 0.842 2.653 0.630 SMF-predictive 0.931 3.687 0.836 0.891 3.078 0.699 n = 100 Rajarshi-f 0.943 3.421 0.629 0.901 2.928 0.561 Rajarshi-b 0.943 3.439 0.648 0.901 2.930 0.573 Local-f 0.942 3.425 0.578 0.898 2.894 0.483 Local-b 0.941 3.406 0.562 0.895 2.858 0.462 MF-fitted 0.921 3.134 0.578 0.874 2.685 0.504 MF-predictive 0.945 3.495 0.636 0.900 2.916 0.533 SMF-fitted 0.921 3.119 0.551 0.874 2.679 0.476 SMF-predictive 0.951 3.587 0.620 0.908 2.964 0.513 Simulation Results: Markov bootstrap-model 2II

Laplace innovation nominal coverage 95% nominal coverage 90% n = 50 CVR LEN std.err CVR LEN std.err Rajarshi-f 0.910 3.359 0.964 0.870 2.828 0.847 Rajarshi-b 0.911 3.369 0.953 0.871 2.839 0.841 Local-f 0.924 3.482 0.839 0.877 2.762 0.654 Local-b 0.924 3.531 0.871 0.881 2.810 0.676 MF-fitted 0.891 3.102 0.932 0.846 2.554 0.764 MF-predictive 0.920 3.616 1.064 0.879 2.916 0.855 SMF-fitted 0.893 3.093 0.871 0.846 2.539 0.707 SMF-predictive 0.930 3.774 1.094 0.888 3.017 0.857 n = 100 Rajarshi-f 0.926 3.518 0.952 0.887 2.867 0.779 Rajarshi-b 0.926 3.526 0.963 0.886 2.872 0.789 Local-f 0.932 3.425 0.717 0.888 2.724 0.580 Local-b 0.933 3.477 0.735 0.888 2.751 0.592 MF-fitted 0.912 3.200 0.740 0.865 2.578 0.581 MF-predictive 0.938 3.755 0.905 0.894 2.918 0.663 SMF-fitted 0.912 3.177 0.724 0.864 2.559 0.555 SMF-predictive 0.943 3.880 0.892 0.899 2.950 0.605 Simulation Results

I For local bootstrap, we select the bandwidth by cross validation.

I For Markov bootstrap based on transition density, we choose the bandwidth h = 0.9An−1/4, where IQR A = min(ˆσ, 1.34 ). [The unusual exponent -1/4 is to ensure convergence of the estimator of transition density]

I For model free method we select bandwidth h by cross 2 validation and h0 in smoothed model free to be h . I If we use cross validation method to select the bandwidth for Markov bootstrap based on transition density, the prediction interval yields over-coverage. The following table show the simulation results for Markov bootstrap based on transition density using cross validation—on the scatterplot not time series cross validation. Simulation Results -model 1 using cross validation

Normal innovation nominal coverage 95% nominal coverage 90% n = 50 CVR LEN std.err CVR LEN std.err Rajarshi-f 0.934 4.628 1.226 0.899 3.942 1.027 Rajarshi-b 0.935 4.641 1.208 0.901 3.953 1.008 n = 100 Rajarshi-f 0.959 4.485 0.808 0.926 3.899 0.668 Rajarshi-b 0.960 4.488 0.806 0.926 3.897 0.661 n = 200 Rajarshi-f 0.959 4.400 0.630 0.922 3.735 0.521 Rajarshi-b 0.959 4.399 0.632 0.922 3.741 0.524 Laplace innovation nominal coverage 95% nominal coverage 90% n = 50 CVR LEN std.err CVR LEN std.err Rajarshi-f 0.919 4.624 1.440 0.884 3.853 1.185 Rajarshi-b 0.921 4.634 1.410 0.886 3.867 1.165 n = 100 Rajarshi-f 0.943 4.642 1.053 0.908 3.799 0.837 Rajarshi-b 0.944 4.659 1.057 0.909 3.801 0.822 n = 200 Rajarshi-f 0.944 4.510 0.878 0.907 3.625 0.656 Rajarshi-b 0.944 4.506 0.869 0.908 3.632 0.644