BAYESIAN ANALYSIS OF NON-GAUSSIAN STOCHASTIC PROCESSES FOR TEMPORAL AND SPATIAL DATA

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Jiangyong (Matthew) Yin, M.S.

Graduate Program in

The Ohio State University

2014

Dissertation Committee:

Peter F. Craigmile, Advisor Xinyi Xu, Advisor Steven N. MacEachern c Copyright by

Jiangyong (Matthew) Yin

2014 Abstract

The Gaussian is the most commonly used approach for modeling and geo-statistical data. The Gaussianity assumption, however, is known to be insufficient or inappropriate in many problems. In this dissertation, I develop specific non-Gaussian models to capture the asymmetry and heavy tails of many real-world data indexed in the time, space or space-time domain.

Chapter 2 of this dissertation deals with a particular non-Gaussian time series model – the stochastic volatility model. The parametric stochastic volatility model is a nonlinear state space model whose state equation is traditionally assumed to be linear. Nonparametric stochastic volatility models provide great flexibility for mod- eling financial volatilities, but they often fail to account for useful shape information.

For example, a model may not use the knowledge that the autoregressive compo- nent of the volatility equation is monotonically increasing as the lagged volatility increases. I propose a class of additive stochastic volatility models that allow for dif- ferent shape constraints and can incorporate leverage effect, the asymmetric impacts of positive and negative return shocks on volatilities. I develop a Bayesian model

fitting algorithm and demonstrate model performances on simulated and empirical datasets. Unlike general nonparametric models, the proposed model sacrifices lit- tle when the true volatility equation is linear. In nonlinear situations, the proposed method improves the model fit and the ability to estimate volatilities over general,

ii unconstrained, nonparametric models, and at the same time, maintain more modeling

flexibility compared to parametric models.

The second part of this dissertation focuses on non-Gaussian spatial processes. In

Chapter 3, I first introduce a general framework for constructing non-Gaussian spatial processes using transformations of a latent multivariate . Based on this framework, I then develop a heteroscedastic asymmetric spatial process (HASP) for capturing the non-Gaussian features of environmental or climatic data, such as the heavy tails and skewness. The conditions for this non-Gaussian spatial process to be well defined is discussed at length. The properties of the HASP, especially its marginal moments and covariance structure, are established along with a Monte Carlo (MCMC) procedure for sampling from the posterior distribution.

The HASP model is used to study a US nitrogen dioxide concentration dataset. It is demonstrated that the ability of HASP to capture asymmetry and heavy tails benefits its predictive performance.

In the end, I highlight the extensions of the proposed methods in the temporal and spatial domain in several new directions. I discuss the applications of the proposed methodology to the GARCH-type nonparametric volatility models as well as to other state space models, in particular, the stochastic conditional duration model. I also discuss the extensions of the heteroscedastic asymmetric spatial process to the space- time domain.

iii To my parents

iv Acknowledgments

First and foremost, I owe my deepest gratitude to my advisors, Peter Craigmile and Xinyi Xu, for taking me in as their student since my very first year at Ohio State and devote so much of their time to the individual studies that we did together, for their continued guidance and coaching throughout the past five years even when we were oceans or cities apart, for their amazing patience and tolerance for my often wrong opinions and sometimes unreasonable requests, for imparting their knowledge and wisdom to me without any reservations, and for their financial support for my academic growth.

I would like to thank my committee member, Professor Steven MacEachern, for his guidance and his invaluable suggestions during my candidacy exam which has contributed to the development of part of this dissertation. I would also like to thank

Dr. Yoonkyung Lee, who was on my candidacy exam committee, for her time and for allowing me to sit in her classes asking numerous questions.

I specially want to thank Dr. Christopher Holloman, for the two years of great consulting experience at SCS and for giving me the opportunity to work on the

Nationwide project.

I really appreciate the financial support of the university and the department which has allowed me to complete my graduate studies and also afford the chance to see much of this country. I have also benefited greatly from the teachings of so

v many great statisticians at Ohio State, without whom, I would not have achieved the statistical maturity that I have today.

To my friends with whom I have shared too many drunken nights, Agniva, Casey,

Dani, Grant, Jingjing, John, Sarah, Steve, Tyler and many others, thank you for making the past 5 years in Columbus a fun time.

This dissertation is dedicated to my parents, Min Zhang and Mingshi Yin, who, with only high school education, have worked so hard to put me through college and always support me no matter where I decide to go or what I decide to do.

This dissertation research is supported in part by the National Science Foundation

(NSF) under grant DMS-0906864, DMS-1209194 and SES-1024709.

vi Vita

09/09/1983 ...... Born in Rushan, China

09/2002 - 07/2006 ...... Bachelor of Science in Statistics, Fudan University, China 07/2006 - 07/2009 ...... Executive and Senior Executive, ACNielsen, Shanghai 09/2009 - 08/2011 ...... Master of Science in Statistics, The Ohio State University

Publications

Research Publications

Jiangyong Yin, Peter F. Craigmile and Xinyi Xu. Shape-constrained Semiparametric Additive Stochastic Volatility Models. Submitted, May 2014.

Jiangyong Yin and Xinyi Xu. Portfolio Optimization Using Constrained Hierarchical Bayes Models. Department of Statistics Technical Report No. 874. The Ohio State University, Aug. 2013.

Fields of Study

Major Field: Statistics

vii Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita...... vii

List of Tables ...... x

List of Figures ...... xi

Chapters Page

1. Introduction ...... 1

1.1 Motivation ...... 1 1.2 Non-Gaussian Time Series Models ...... 3 1.3 Non-Gaussian Spatial and Spatio-temporal Processes ...... 6 1.4 Outline of the Dissertation ...... 12

2. Shape-constrained Semiparametric Additive Stochastic Volatility Models 14

2.1 Background ...... 14 2.1.1 Asymmetric or Semiparametric Stochastic Volatility Models 14 2.1.2 The Role of Shape Constraints in Semiparametric SV Models 19 2.2 Model Specification ...... 20 2.2.1 Semiparametric Additive Stochastic Volatility Models . . . 20 2.2.2 Shape-constrained Semiparametric Additive Stochastic Volatil- ity Models ...... 21 2.3 Model Fitting and Comparison ...... 28

viii 2.3.1 MCMC Procedure ...... 28 2.3.2 Model Comparison Criteria ...... 36 2.4 Simulations ...... 40 2.4.1 Uncentered versus Centered Basis Functions ...... 42 2.4.2 Unleverage SV Models ...... 43 2.4.3 Leveraged SV Models ...... 49 2.5 Empirical Studies ...... 54 2.6 Discussion ...... 58

3. Heteroscedastic Asymmetric Spatial Processes ...... 60

3.1 Introduction ...... 60 3.2 A General Framework for Constructing Non-Gaussian Processes . . 60 3.3 Heteroscedastic Asymmetric Spatial Process (HASP) ...... 70 3.3.1 Model Specification and Properties ...... 70 3.3.2 The Covariance Properties of the HASP Model ...... 78 3.3.3 Linear Co-Regionalization Version of the HASP Model . . . 85 3.3.4 More on the Choices of the Correlation and Cross-Correlation Functions ...... 88 3.3.5 Examples of the HASP Sample Paths ...... 93 3.4 Model Fitting and Spatial Prediction ...... 99 3.4.1 The Likelihood Function ...... 99 3.4.2 Model Fitting Strategy ...... 101 3.4.3 Spatial Prediction ...... 108 3.5 Using HASP to Model the Nitrogen Dioxide Pollution Data in the Contiguous United States ...... 111 3.6 Discussion ...... 121

4. Concluding Remarks and Future Work ...... 124

4.1 Summary ...... 124 4.2 Extensions of the Shape-Constrained Semiparametric Additive SV Model ...... 125 4.2.1 GARCH-type models ...... 125 4.2.2 Other non-Gaussian nonlinear state-space models ...... 129 4.3 Extensions of the Heteroscedastic Asymmetric Spatial Process . . . 132 4.3.1 Into the spatio-temporal domain ...... 132 4.3.2 Other directions ...... 140 4.4 Closing Remarks ...... 141

Bibliography ...... 142

ix List of Tables

Table Page

2.1 For the three true unleveraged SV models (Lf, Cf and Sf ), a summary N of the MSE and MAE of the log volatilities, {ht}t=1, obtained from fitting five different SV models. The averages and standard errors are multiplied by 100, and are based on 200 replicate series...... 45

2.2 For the three true leveraged SV models (Lf-Lg, Lf-NLg and NLf-NLg), N a summary of the MSE and MAE of the log volatilities, {ht}t=1, ob- tained from fitting five different SV models. The averages and standard errors are multiplied by 100, and are based on 200 replicate series. . . 50

2.3 Predictive log likelihood of daily returns from November 1, 2010 to October 31, 2013 using six different SV models. (u) - unleveraged; (l) - leveraged. The results of the best performing model (those with the largest predictive log likelihood) are in bold face...... 55

3.1 Predictive performance of the GP, SHP and HASP models fitted on the original and the transformed data. The highlighted values are the smallest within each group...... 120

x List of Figures

Figure Page

2.1 A comparison of the centered and uncentered basis functions. 7 knots (unequally spaced in the first row and equally spaced in the second row) are used for illustration purposes, where the first and last knot denote the lowest and largest possible values of the predictor x. Different colors and line types are used to differentiate different basis functions. Note how the centered and uncenters basis functions differ in terms of the 0 horizontal line...... 26

2.2 Stochastic Volatility (grey line) vs. Realized Volatility (dark line) vs. 2 log(rt )/2 (dots): the distribution of the realized volatilities is indeed 2 less left skewed than that of log(rt )/2 but is still much noisier than the estimated stochastic volatilities and slightly left skewed ...... 37

2.3 Trace plots of the parameters µ and β1 in model (2.2.5) with centered versus uncentered basis functions...... 43

2.4 An example of the simulated datasets for the case of sigmoid f. The top plot shows the simulated return series while the bottom one shows the

simulated latent log volatility series, {ht}. The quasi-Markov-switching feature is easily noticeable in the lower plot...... 46

2.5 Estimated autoregressive function f for the unleveraged SV models based on a single simulated dataset based on 10 knot intervals. The dashed line in each graph shows the true functional form of f. The solid lines shows the posterior mean of f and the shaded region indicates the point-wise 95% credible bands for f...... 47

xi 2.6 Estimated autoregressive function f for the unleveraged SV models based on a single simulated dataset based on 20 knot intervals. The dashed line in each graph shows the true functional form of f. The solid lines shows the posterior mean of f and the shaded region indicates the point-wise 95% credible bands for f...... 48

2.7 For the three true leveraged SV models (Lf-Lg, Lf-NLg and NLf-NLg), the estimated autoregressive function f (odd rows) and leverage func- tion g (even rows) in different fitted models (columns) based on 10 knot intervals. The line types and shading are defined in the same way as in Figure 2.5...... 52

2.8 For the three true leveraged SV models (Lf-Lg, Lf-NLg and NLf-NLg), the estimated autoregressive function f (odd rows) and leverage func- tion g (even rows) in different fitted models (columns) based on 20 knot intervals. The line types and shading are defined in the same way as in Figure 2.5...... 53

2.9 Fitted autoregressive function f and leverage function g for the daily returns of the S&P500, EQR, MSFT and JnJ. The solid lines show the posterior means and 95% credible bands of the fitted functions in the leveraged semiparametric models with shape constraints while the dashed lines represent the posterior means and 95% credible bands in the leveraged semiparametric models without shape constraints. . . . 57

3.1 The plots (i) – (iv) show the mean, variance, skewness and excess kurtosis, respectively, of the process Z(s) as a function of τ and %. Plot (v) shows the simulated density function for the marginal distribution of Z(s) for τ = 0.5. The dotted line in each plot denotes the case where % = 0, the dot-dash lines % = ±0.3, the dashed lines % = ±0.6 and the dark solid lines % = 0.9. The grey solid line in plot (v) denotes the density function of a standard normal distribution...... 77

3.2 The shape of the correlation function implied by (3.3.19) for different choices of %(h). Plot (i)–(iii) use the exponential correlation function for %(h), while plots (iv)–(vi) use the Gaussian correlation function. The values of the other parameters are shown in the plots...... 90

xii 3.3 The shape of the correlation function implied by (3.3.20) where ρα(h) assumes the form of a Gaussian correlation function and ρξ(h) the form of an exponential correlation function. The values of the other parameters are shown in the plots...... 91

3.4 Five sample paths from each of the four spatial processes in R: GP, GLG/SHP, HASP with positive skewness and HASP with negative skewness. An exponential correlation function is used for the GP. For the non-Gaussian processes, the same exponential correlation function is used for the marginal as well as the cross correlation functions of the latent multivariate Gaussian process. The sample paths for differ- ent processes bear resemblance to each other because the same seed is used for random number generation for easier comparison of the dif- ferent processes. The sample paths are approximated based on a finite number of observations on a grid with an increment of 0.1. Realizations of the HASP model are computed according to (3.3.17)...... 95

3.5 Five sample paths from each of the four spatial processes in R: GP, GLG/SHP, HASP with positive skewness and HASP with negative skewness. A Gaussian (or double exponential) correlation function is used for the GP. For the non-Gaussian processes, the same Gaussian correlation function is used for the marginal as well as cross correla- tion functions of the latent multivariate Gaussian process. The sample paths are constructed in the same way as in Figure 3.4...... 96

3.6 Five sample paths from each of the four spatial processes in R2: GP, GLG/SHP, HASP with positive skewness and HASP with negative skewness. An exponential correlation function is used for the GP. For the non-Gaussian processes, the same exponential correlation function is used for the marginal as well as cross correlation functions of the latent multivariate Gaussian process. Again, the sample paths for dif- ferent processes bear resemblance to each other because the same seed is used for random number generation for easier comparison of the different processes. The sample paths are approximated based on a finite number of observations on a rectangular grid with an increment of 0.1 in each direction. Realizations of the HASP model are computed according to (3.3.17)...... 97

xiii 3.7 Five sample paths from each of the four spatial processes in R2: GP, GLG/SHP, HASP with positive skewness and HASP with negative skewness. A Gaussian correlation function is used for the GP. For the non-Gaussian processes, the same Gaussian correlation function is used for the marginal as well as cross correlation functions of the latent multivariate Gaussian process. The sample paths are constructed in the same way as in Figure 3.6...... 98

3.8 The ambient NO2 concentration levels measured across the contiguous United States on September 9, 2013. In the top graph, each circle rep- resents an observation site. Blue circles represent the observation sites used in the training dataset while the red circles indicate sites randomly chosen for prediction evaluation. The size of the circles (measured by

the area not the radius) is proportional to the values of the NO2 con- centration levels. The three histograms show the distributions of the original data, the log transformed data and the square root transformed

NO2 concentrations...... 113

3.9 The three graphs in the first column are based on the original data while those in the second column are based on the transformed data. Plots a) and b) show the posterior density of the co-locational corre- lation coefficient %, which also measures the degree of skewness in the data. Plots c) and d) present the posterior density of the parameter κ, which is the correlation length in the Gaussian process as well as the respective process ε(s) and ξ(s) in the SHP and HASP model. Finally, plots e) and f) demonstrate the implied correlation functions of the fitted Gaussian and non-Gaussian processes...... 117

3.10 Predictive distributions for the NO2 levels at the 25 sites in the test dataset, which are based on the three models fitted on the original data. The red lines represent the predictive distributions of the HASP model, while the blue and grey lines show the predictive distributions of the SHP and GP models, respectively. The dashed lines represent the actual observed values at these test sites...... 118

3.11 Predictive distributions for the NO2 levels on the original scale at the 25 sites in the test dataset, which are based on the three models fitted on the transformed data. The red lines represent the predictive dis- tributions of the HASP model, while the blue and grey lines show the predictive distributions of the SHP and GP models, respectively. The dashed lines represent the actual observed values at these test sites. . 119

xiv Chapter 1: Introduction

1.1 Motivation

Univariate and multivariate time series, spatial processes and space-time mod- els are all special stochastic processes that play fundamental roles in the studies of real-world data that are sampled temporally and/or spatially, such as economical, fi- nancial, environmental and climatic data. Early modeling efforts were mostly focused on second-order stationary Gaussian processes, such as the stationary autoregressive moving average (ARMA) model for time series analysis (see, e.g., Brockwell and

Davis, 2009; Tsay, 2005) and the isotropic Gaussian process for spatial interpola- tion and function estimation (see, e.g., Cressie, 1993). Although the stationarity and

Gaussianity assumptions in these models are reasonable for a wide range of problems, they can be inappropriate in many other settings. In financial time series, there is compelling evidence that financial returns are non-Gaussian (with heavy tails and left skewness) and exhibit temporal dependence in the squared returns. The statistical modeling of these empirical features started with the seminal work of Engle (1982).

Since the second moment of financial returns is widely used as a proxy for investment risk in financial econometrics, this type of models have found significant applications in risk management, option pricing, and portfolio optimization, just to name a few.

1 Modeling non-Gaussian features is also important in the field of geostatistics.

For example, Hering and Genton (2010) argued that a skewed-t distribution is more appropriate than a normal distribution for the error terms of their wind speed model.

Furthermore, to correctly model the distribution of wind speed is important not only for the prediction of wind speed itself, but also for the prediction of wind power production. It is well-known that the functional relationship between the wind speed and the power production of a wind turbine, called the power curve, is highly nonlinear

(see, e.g., Lange, 2005). As a result, a good estimate of the mean wind speed alone is not sufficient for getting a good estimate of the wind power production. Non-Gaussian features in the wind speed, such as the skewness and heavy tails, play a significant role as well. As another example, Damian et al. (2003) found that the variances of log-transformed 10-day aggregate precipitation at 39 stations in the Languedoc-

Roussillon region of France are not constant in space and they demonstrated spatial dependence as well. This means that at any given time, the purely spatially indexed data should have spatially varying variances. We refer to this phenomenon as the spatial heteroscedasticity in this dissertation. From a purely spatial point of view, spatial heteroscedasticity means that there are regions in the spatial domain whose values depart from the underlying trend surface large enough that under the Gaussian process assumption, they need to be explained by relatively large variances compared to other regions (Palacios and Steel, 2006). This is different from spatial outliers, where the extreme departure from the mean surface under the Gaussian assumption is isolated rather than clustered together. Borrowing a term from the time series literature, we can also call this phenomenon variance clustering.

2 This dissertation is dedicated to the modeling of non-Gaussian features in empiri-

cal data from the fields of financial and environmental studies, especially the skewness

and heavy tails. In particular, this dissertation focuses on the study of non-Gaussian

temporal, spatial and spatio-temporal processes that are of the form

Y (u) = exp(α(u))ε(u), (1.1.1) where α(u) and ε(u) are two latent Gaussian processes and u is a temporal, spatial or spatio-temporal index that can be countable or uncountable.

1.2 Non-Gaussian Time Series Models

In financial econometrics, volatility is usually defined as the conditional standard p deviation of a discrete-time return series {rt, t ∈ Z}; namely, σt = Var(rt|Ft−1) where Ft−1 is the filtration up to time t − 1. The development of volatility models has been largely motivated by the following empirical features commonly observed in real-world financial returns (see, e.g. Rydberg, 2000):

• Heavy Tails. It has been generally accepted that the distribution of financial

returns has heavier tails than a normal distribution.

• Volatility Clustering. High price changes tend to cluster together which can be

explained by in the volatility process.

• Asymmetry. There is evidence that a high volatility period tends to follow a

large negative return rather than a positive one. In other words, positive and

negative returns have asymmetric impacts on the volatility. Engle and Ng (1993)

used the so-called news impact curve to capture the functional relationship of

3 volatilities on returns. Another type of asymmetry is that the distribution of

stock returns is slightly negatively skewed.

Systematic modeling of financial volatility started with the autoregressive condi- tional heteroscedastic (ARCH) model in the seminal paper of Engle (1982). Bollerslev

(1986) proposed the generalized autoregressive conditional heteroscedastic (GARCH) model which allows for parsimonious representation of the volatility process. A time series {rt}t∈Z (e.g., the returns of a financial asset) is called a GARCH(p, q) process if

rt = σtεt, t ∈ Z, p q 2 X 2 X 2 σt = ω + αirt−i + βjσt−j, i=1 j=1

where {εt}t∈Z is an independent and identically distributed (iid) process with mean zero and variance 1. A number of parametric models have been proposed to describe

the asymmetry feature of financial volatilities, among which the “GJR-GARCH”

model proposed by Glosten et al. (1993), the “EGARCH” model by Nelson (1991)

are the most well-known. The GJR-GARCH model captures the asymmetric news

2 impact curve with a piecewise linear function in rt :

rt = σtεt, t ∈ Z,

2 2 2 2 σt = ω + α1rt−1 + β1σt−1 + α2rt−1I(rt−1 < 0).

In contrast, the EGARCH model describes the asymmetric news impact curve through

a piecewise linear function in εt−1:

rt = exp(ht/2)ε, t ∈ Z,

ht = γ0 + γ1ht−1 + g(εt−1),

4 where

g(εt) = ωεt + λ(|εt| − E|εt|).

Nonparametric (and usually nonstationary) models for describing financial volatil- ities include the partially nonparametric (PNP) model of Engle and Ng (1993), the locally stationary ARCH model of Dahlhaus and Subba Rao (2006), a number of ad- ditive ARCH models (see Kim and Linton, 2004; Linton and Mammen, 2005; Yang,

2006, etc.), and several nonparametric GARCH models such as the approach of Au- drino and B¨uhlmann(2009) which is based on multivariate B-splines and estimated with a coordinate-wise gradient descent algorithm.

Another popular approach for describing the dependence in the second moment of financial returns is the so-called stochastic volatility (SV) model that originated from Hull and White (1987)’s stochastic differential equation for option pricing. The basic parametric (Gaussian) SV model (Taylor, 1994) is usually given as

iid rt = ω + exp(ht/2)εt, εt ∼ N(0, 1);

iid 2 ht = µ + φht−1 + ηt, ηt ∼ N(0, σ1), (1.2.1)

where {εt}t∈Z and {ηt}t∈Z are two mutually independent innovation processes. Model (1.2.1) is essentially a nonlinear state space model with the state variables, the log volatilities, following an autoregressive (AR) process.

Model (1.2.1) can explain both the heavy tails and volatility clustering features introduced on page 3 but does not accommodate the asymmetric effect of positive and negative returns on volatilities or the skewness in the marginal distribution of financial returns. To capture asymmetry, researchers introduce SV models with leverage effects

(see Harvey and Shephard, 1996; Yu, 2005). A leverage effect is defined as a (usually

5 negative) correlation between the innovations of the volatility process and the lagged

innovations of the return process, i.e., corr(ηt, εt−1) = ρ < 0 in (1.2.1). More details

are given in Chapter 2.

Although the linearity assumption in both model forms lends itself to the de-

velopment of efficient estimation procedures, it remains a convenient approximation

rather than the truth. In Chapter 2, I propose a semiparametric temporal process for

{ht}t∈Z that offers more modeling flexibility than parametric models. In nonlinear state space models, efficiently modeling the different latent features with semipara-

metric methods remains an important yet challenging issue due to the large dimension

of the latent state space. I address this issue by incorporating shape constraints in

the semiparametric nonstationary volatility process, which proves to be crucial for

achieving better model performance and also provides better fit for empirical data.

1.3 Non-Gaussian Spatial and Spatio-temporal Processes

A spatial or spatio-temporal process is a stochastic process defined on the do-

main {s : s ∈ D ⊂ Rd} or {(s, t):(s, t) ∈ D ⊂ Rd × R}. Consider a univariate spatio-temporal process {Y (s, t), (s, t) ∈ D ⊂ Rd × R}. The process Y (s, t) is most commonly assumed to be a covariance (or second-order) stationary Gaussian process

(GP); i.e., it satisfies the following conditions:

1)E[ Y (s, t)] = µ ∈ R, which does not depend on (s, t).

2) The covariance cov(Y (s, t),Y (s0, t0)) depends only on the separation between (s, t)

and (s0, t0) but not on the spatio-temporal indices (s, t) and (s0, t0) per se.

As a special case of a general space-time process, a spatial process Y (s) can be viewed as a snapshot in time of a spatio-temporal process Y (s, t) with fixed t = t0. For purely

6 spatial processes, another common assumption is the isotropy property. A random

field Y (s) is isotropic if the covariance cov(Y (s),Y (s0)) depends only on the (usually

Euclidean) distance ks − s0k.

Since Y (s, t) is a Gaussian process, the joint distribution of any finite number of realizations of Y (s, t) is uniquely determined by the mean and the covariance struc- ture. As a result, much research attention has been paid to the study of covariance functions in stationary temporal, spatial, or spatio-temporal processes. For spatial processes, the Mat´ernclass of functions (Mat´ern,1960; Handcock and Stein, 1993;

Guttorp and Gneiting, 2006) has emerged as the most popular choice for covariance functions. Other types of covariance functions include the powered exponential family

(see, e.g. Diggle et al., 1998), the Cauchy family (Gneiting and Schlather, 2004), and for computational convenience, covariance functions with compact support (see, e.g.

Gneiting, 2002a; Furrer et al., 2006). Further discussions of space-time covariance functions are provided in Chapter 4. Spatial statistical models dealing with areal data generally rely on the theory of the Markov random field (MRF) (Besag, 1974,

1975; Rue and Held, 2005). The most popular model of this sort is the Gaussian conditional autoregressive (CAR) model. A number of other types of models have also been proposed in the literature. See Held and Rue (2010) and the references therein for more information about areal processes.

Stationarity and Gaussianity of the spatial or spatio-temporal processes are often convenient modeling assumptions, while in reality, they are not always adequate. It is now widely recognized that many environmental processes exhibit nonstationary covariance structure or non-Gaussian features (see, e.g., Palacios and Steel, 2006;

Zhang and El-Shaarawi, 2010; Craigmile and Guttorp, 2011; Huang et al., 2011).

7 Nonstationary models, especially the construction and estimation of nonstationary covariance functions, have received a lot of research attention recently. See the review article of Sampson (2010) and the references therein for more information.

Non-Gaussian spatial or spatio-temporal data have been studied since the 1970s.

For the usual count or binary data, non-Gaussian spatial models within the frame- work of the generalized linear mixed models (GLMM) have been discussed in Diggle et al. (1998). For process with discrete spatial variation, similar models exist such as the auto-models of Besag (1974). However, not all non-Gaussian data can be accommodated in a GLMM framework. For moderate departures from Gaussianity,

De Oliveira et al. (1997) proposed to use a Box-Cox power transformation on a non-

Gaussian process and assume that the transformed process is a stationary Gaussian process. A large body of work has focused on addressing a specific shape departure from Gaussianity. For example, Palacios and Steel (2006) proposed a non-Gaussian spatial process, the Gaussian-log-Gaussian (GLG) model, that can capture the heavy tails by scale mixing a stationary Gaussian process with a stationary log Gaussian process:

ε(s) d Y (s) = σ + ρ(s), s ∈ D ∈ R , pλ(s) where ρ(s) is an iid process with mean zero and variance τ 2 for modeling the nugget effect, ε(s) is a stationary spatial process with zero mean, unit variance and a covari- ance function Cθ, and log(λ(s)) is a Gaussian process with mean −ν/2, variance ν and covariance function Cθ as well. Craigmile and Guttorp (2011) proposed a heavy- tailed non-Gaussian space-time temperature model to capture the spatially varying

8 seasonality in the variance of temperature time series:

2 ζt(s) = σt(s)ηt(s), t ∈ T ⊂ Z, s ∈ D ⊂ R

log(σt(s)) = α0(s) + α1(s) sin(2πt/365.25) + α2(s) cos(2πt/365.25)

+ α3(s) sin(2πt/182.625) + α4(s) cos(2πt/182.625).

Huang et al. (2011) proposed a heavy-tailed non-Gaussian space-time model based on the temporal stochastic volatility model, which can be viewed as the extension of the GLG model into the space-time domain.

τα(s, t) Y (s, t) = σ exp ε(s, t), s ∈ D ⊂ d, t ∈ , 2 R R where α(s, t) and ε(s, t) are two independent stationary Gaussian processes with mean

0 and variance 1. To capture non-Gaussian features in lattice data, Yan (2007) proposed the following non-Gaussian model:

zi = φi + εi, , i = 1, ··· ,I, P 2 ! j6=i bijφj σφ φi|φj6=i ∼ N P , P , j6=i bij j6=i bij iid εi|α ∼ N(0, exp(2ω + αi)), P 2 ! j6=i cijαj σα αi|αj6=i ∼ N P , P . j6=i cij j6=i cij

All models in the aforementioned papers, especially Palacios and Steel (2006),

Craigmile and Guttorp (2011) and Huang et al. (2011), have heavier tails than a

Gaussian process, which can accommodate observations that would otherwise be out- liers under a Gaussian process. However, none of these models is able to capture skewness in a spatial or spatio-temporal process. There are a number of other pa- pers that focus on modeling the skewness in spatial processes, most of which are

9 based on the skew-normal distribution and its different multivariate generalizations

(see Azzalini and Capitanio, 1999; Azzalini and Dalla Valle, 1996; Azzalini, 2005, for discussions of these distributions). For example, Kim and Mallick (2004) proposed a skew-Gaussian process {Y (s), s ∈ D ⊂ Rd} based on the multivariate skew-normal distribution of Azzalini and Dalla Valle (1996), which can be viewed essentially as the sum of a Gaussian random field and a non-Gaussian random variable. In particular,

√ 2 d Y (s) = δ|Z1| + 1 − δ Z2(s), s ∈ D ⊂ R ,

where Z1 is a standard normal random variable, Z2(s) is a zero-mean and unit- variance Gaussian random field, and δ ∈ (−1, 1). Intuitively, as |δ| → 1, the process

Y (s) becomes more and more skewed and also less and less spatially correlated, which is an undesirable property. An easy fix is to consider the following stationary skew-

Gaussian spatial process proposed by Zhang and El-Shaarawi (2010), which assumes

√ 2 d Y (s) = δ|Z1(s)| + 1 − δ Z2(s), s ∈ D ⊂ R ,

where δ ∈ (−1, 1) and Z1(s) and Z2(s) are two zero-mean and unit-variance spatial processes that are independent from each other.

It is without doubt that the heavy tails, skewness and variance clustering are common features in spatially indexed data, and all of these non-Gaussian features need to be taken into account in our statistical models. However, the heavy-tailed spatial processes presented above (e.g., the GLG process) do not allow for skewness in their marginal distributions. On the other hand, the skewness of the skew-normal distribution has a theoretical range of (−0.9953, 0.9953) and the kurtosis is bounded by [0, 0.8691) (see, e.g., Azzalini, 2005), which comes from the skewness and kurtosis

10 of a half-normal distribution. This is too restrictive for the corresponding skew-

Gaussian process to be widely applicable in practice. In addition, the ability of the skew-Gaussian process to capture the variance clustering phenomenon is limited. To see this, note that in the extreme case where δ = ±1, the skew-Gaussian process reduces to the process δ|Z1(s)| which, albeit skewed, is no better at capturing the variance clustering than the underlying Gaussian process Z1(s).

A more general framework is studied in Bolin (2013) and Bolin and Wallin (2013), where non-Gaussian random fields with Mat´erncovariance functions are obtained as solutions to stochastic partial differential equations (SPDE) driven by non-Gaussian noise processes. This is an extension to the Gaussian case studied in the well-known paper of Lindgren et al. (2011). For the non-Gaussian noise process in a SPDE,

Bolin (2013) and Bolin and Wallin (2013) considered special cases of the generalized hyperbolic process (see Eberlein and Hammerstein, 2004) – in particular, the iid processes generated from a normal inverse gamma distribution and an asymmetric generalized Laplace distribution (also known as the variance gamma distribution).

However, this SPDE-based approach is still limited in several respects in practice.

First of all, computationally efficient implementation of the non-Gaussian random

field implied by a SPDE relies on a Hilbert space approximation based on the method developed in Lindgren et al. (2011). The accuracy of such an approximation for non-

Gaussian processes is yet to be established. Second, even with the Hilbert space approximation, model fitting remains a challenging issue. Bolin (2013) developed a model fitting procedure based on the expectation maximization (EM) algorithm, but it does not accommodate measurement errors. Furthermore, it requires all nodes in the random field to be observed. Although these requirements are relaxed with the

11 Monte Carlo EM algorithm in Bolin and Wallin (2013), the model fitting procedure remains difficult to implement. In addition, the ability of this model to capture the variance clustering feature is not clearly understood. Therefore, we still need a general approach that naturally accommodates the heavy tails, asymmetry and variance clustering of spatial data, and at the same time, is more straightforward to implement.

1.4 Outline of the Dissertation

This dissertation focuses on the study of non-Gaussian temporal, spatial and spatio-temporal stochastic processes for capturing features such as asymmetry, heavy- tails as well as the temporal or spatial heteroscedasticity.

In Chapter 2, I develop a semiparametric additive stochastic volatility model to describe the heavy tails, volatility clustering and asymmetry of financial returns. In addition, I propose to include shape constraints in each component of our additive

SV model, which proves to be very important for achieving better model fit and predictive performance in our data examples. I employ a basis expansion of the shape-constrained functions that allows for efficient fitting of the Bayesian model, and I also present a particle-filter-based model comparison approach. The model is applied to several real-world equity returns data and achieves better out-of-sample predictive performance compared to the linear parametric SV model as well as the semiparametric SV models without proper shape constraints. I also demonstrate the necessity to include a leverage effect in the semiparametric additive SV models to capture the asymmetry feature of many real-world financial returns.

12 In Chapter 3, I first discuss a more general framework for constructing non-

Gaussian spatial processes using transformations of a latent multivariate Gaussian process. Based on this framework, I then introduce a heteroscedastic asymmetric spatial process (HASP) that can capture non-Gaussian features – heavy tails, skew- ness and variance clustering – of spatially indexed data, such as environmental or climatic data. The properties of the HASP will be established, along with a Markov chain Monte Carlo (MCMC) procedure to sample from the posterior distribution. In the end, we apply the HASP to study a US nitrogen dioxide concentration dataset and to demonstrate the benefits and necessity of modeling the above-mentioned non-

Gaussian features for predictive purposes.

Finally, this dissertation is concluded in Chapter 4 with a summary and some ex- tensions of the aforementioned research. I will discuss the application of our method- ology to the GARCH-type nonparametric volatility models as well as to other state space models, in particular, the stochastic conditional duration model. I also highlight the extensions of our heteroscedastic asymmetric spatial process to the space-time do- main, and discuss the possibility of fitting our non-Gaussian process to large datasets and of building non-Gaussian and nonstationary processes.

13 Chapter 2: Shape-constrained Semiparametric Additive Stochastic Volatility Models

2.1 Background

2.1.1 Asymmetric or Semiparametric Stochastic Volatility Models

The state space model approach of modeling conditional heteroscedasticity in

financial time series leads to the well-known stochastic volatility (SV) model, which originated from the stochastic differential equation of Hull and White (1987)’s work on option pricing (see also Taylor, 1994). The basic form of the SV model was given in (1.2.1). The properties of the basic SV model (1.2.1) as well as its limitations are discussed below (see also Fan and Yao, 2003, p.180).

i) Stationarity. It is well known that if |φ| < 1, the volatility process (1.2.1) is

strictly stationary with

 ω τ 2  h ∼ N , and corr(h , h ) = φ|k| for t, k ∈ . t 1 − φ 1 − φ2 t t+k Z

As a result, {rt}t∈Z is also strictly stationary since the marginal distribution of

rt, for each t ∈ Z, can be expressed as Z p(rt) = p(rt|ht)p(ht) dht,

14 which does not depend on t. In the discussion below, let µh ≡ ω/(1 − φ) denote

2 2 2 the mean and σh ≡ τ /(1 − φ ) denote the variance of the marginal distribution

of {ht}t∈Z.

ii) . Because the process {εt}t∈Z is independent of the process {ηt}t∈Z,

it is thus also independent of {ht}t∈Z. It is then easy to see that E(rt) = 0 and  h + h  E(r r ) = E exp t t+k E(ε ε ) = 0, for any t, k ∈ , k 6= 0. t t+k 2 t t+k Z

Also, by the moment generating function of the normal distribution, for each

t ∈ Z, we have  1  Var(r ) = E(r2) = E(eht ) E(ε2) = exp µ + σ2 . t t t h 2 h

In other words, {rt}t∈Z is a white noise series.

iii) Heavy Tails. Let κr and κε denote the kurtosis of the marginal distribution of rt

and εt, respectively. Then,

4 2ht 4 E(rt ) E(e ) E(εt ) 2 κr = = = exp(σ )κε > κε, 2 2 ht 2 2 2 h [E(rt )] [E(e )] [E(εt )]

that is, the kurtosis of {rt} is always larger than that of {εt}. This holds even if εt,

for all t ∈ Z, follows a distribution other than the standard normal distribution. iv) Volatility Clustering. First, for t, k ∈ Z, note that ht + ht+k follows a normal

2 2 |k| distribution with mean 2µh and variance 2σh[1 + corr(ht, ht+k)] = 2σh(1 + φ ). Using the moment generating function of the normal distribution again, we have

that

2 2 γr(k) ≡ Cov(rt , rt+k) = E(exp(ht + ht+k)) − E(exp(ht)) E(exp(ht+k))

2  2 |k|  = exp(2µh + σh) exp(σhφ ) − 1 ,

15 and hence

2 |k| 2 2 2 γr(k) exp(σhφ ) − 1 σh |k| corr(rt , rt+k) = = 2 ≈ 2 φ . γr(0) exp(σh) − 1 exp(σh) − 1

The approximation follows from the fact that ex ≈ 1 + x for small x > 0. Note

that the autocorrelation of the squared returns under SV model decays at the

same rate as an AR(1) series (similar results hold for the GARCH model too).

v) Asymmetry. Recall that for financial time series data, positive and negative

returns are found to have asymmetric impacts on the volatility, and the distri-

bution of returns is slightly negatively skewed. The SV model does not capture

asymmetry since the evolution of the volatility process does not depend on the

return shocks. Also, the marginal distribution of {rt} is symmetric since

3 3ht/2 3 E[rt ] = E(e ) E(εt ) = 0;

that is, the skewness of the marginal distribution is 0.

In volatility modeling, two functional relationships are of particular interest – the dependence of volatilities on lagged volatilities as well as the functional relationship between volatilities and lagged return shocks. The latter is also known as the news impact curve in the ARCH literature (see, e.g., Engle and Gonzalez-Rivera, 1991;

Engle and Ng, 1993). To capture the asymmetric impact of positive and negative return shocks on volatilities in SV models, researchers have introduced the “lever- age effect”. Recalled from Chapter 1 that a leverage effect is defined as a (usually negative) correlation between the innovations of the volatility process and the lagged innovations of the return process (see Harvey and Shephard, 1996; Yu, 2005). In model (1.2.1), we induce the leverage effect by assuming that corr(εt−1, ηt) = ρ < 0.

16 The resulting model is a Euler-Maruyama approximation to the following continuous

time asymmetric SV model from the options pricing literature:

d log(S(t)) = σ(t) dB1(t),

2 2 d log(σ (t)) = µ + (φ − 1) log(σ (t)) dt + σ1 dB2(t);

where S(t) is the asset price, and B1(t) and B2(t) are two Brownian motions satisfying

corr(dB1(t), dB2(t)) = ρ < 0; i.e., the increments of the two Brownian motions are

negatively correlated (see Yu (2005) and the references therein for more information).

Since the error distributions in the discrete-time SV models are assumed to be

normal, the leveraged SV model can be equivalently expressed as

iid rt = exp(ht/2)εt, εt ∼ N(0, 1),

iid 2 ht = µ + φht−1 + ψεt−1 + ξt, ξt ∼ N(0, σ2), (2.1.1)

p 2 where ψ = ρσ1, σ2 = σ1 1 − ρ and {ξt}t∈Z is independent from {ηt}t∈Z. This alternative form of the leverage effect (2.1.1) has been exploited by several papers including Yu (2005), Omori et al. (2007), and Yu (2012). We will also consider a similar form in our heteroscedastic asymmetric spatial process in Chapter 3. It is easy to establish the following important properties of the SV model with leverage effect (see Harvey and Shephard, 1996; Yu, 2005):

i) The bivariate process {(rt, ht)}t∈Z is Markov since

p(rt, ht|rt−1, ht−1, . . . , rt−k, ht−k,...) = p(rt, ht|rt−1, ht−1)

for all t, k ∈ Z.

ii) The process rt is a martingale difference sequence (MDS). To see this, let Ft =

σ{(εs, ξs), s ≤ t}. Then (rt, ht) ∈ Ft, i.e., (rt, ht) is Ft-measurable, and the result

17 follows instantly from

E[rt|Ft−1] = E[E[rt|ht, Ft−1]|Ft−1] = 0.

If the second moment of rt exists for all t ∈ Z, then {rt} is also a white noise series.

These two properties are shared with other popular volatility models such as the

ARCH, GARCH, GJR-GARCH and EGARCH models, as defined in Chapter 1.

Although parametric assumptions in traditional SV models, such as the linearity of the autoregressive and leverage components, provide reasonable approximations in some settings, they can also deviate considerably from the truth. Comte (2004) showed the nonlinearity of the autoregressive component in SV models for large cap indices of several major equity markets. Yu (2012) modeled the leverage effect semi- parametrically using piecewise linear functions:

d+1 q X 2 ht = φht−1 + σ (ρiεt−1 + 1 − ρi ξt)I(ci−1 ≥ εt−1 > ci), (2.1.2) i=1

where +∞ = c0 > c1 > . . . > cd > cd+1 = −∞ are the pre-specified knots. They

found empirical evidence that the leverage effect for many U.S. large cap equity

returns is significantly less than zero when the lagged return innovation is nega-

tive but not when it is positive, which implies that a linear leverage function might

not be sufficient for capturing the news impact curve. Other related nonparametric

models include Bandi and Ren`o(2012)’s continuous time stochastic volatility jump

diffusion model for capturing a time-varying leverage effect as well as Jensen and

Maheu (2012)’s approach for modeling the SV model with leverage

effect. Both articles have found empirical evidence supporting the usefulness of a

time-varying leverage effect.

18 2.1.2 The Role of Shape Constraints in Semiparametric SV Models

Nonparametric or semiparametric SV models can be used to relax the linearity assumptions in traditional SV models. Such models offer great flexibility for modeling the nonlinear functional relationships in the volatility equation. However, due to the latent structure of the state space model, they can also lead to very large variance in the estimated volatilities. In this chapter, we propose to incorporate our knowledge about the shapes (such as monotonicity or piecewise monotonicity) of the functional components in the volatility equation as additional constraints in semiparametric SV models. In particular, we implement the shape constraints in the prior distributions of our Bayesian semiparametric additive model and show that the resulting model is more flexible than the linear parametric model and that they can achieve lower estimation errors of the log volatilities and higher predictive log likelihood compared to the semiparametric model without shape constraints.

The idea of imposing shape constraints in a semiparametric SV model is warranted by the availability of our knowledge about the functional shapes as well as the proven benefits of imposing shape constraints in linear and generalized linear models. On one hand, the past two decades have witnessed the accumulation of significant knowledge about the shapes of the autoregressive and leverage components in SV models. For ex- ample, the phenomenon of volatility clustering (e.g., Rydberg, 2000) suggests that the autoregressive component of the volatility equation should be monotonically increas- ing as the lagged volatility increases. Additionally, Yu (2012)’s findings show that the leverage function should at least be monotonically decreasing on the negative real line. On the other hand, it has long been established that imposing the correct shape

19 constraints in function estimation problems has the potential of greatly improving model fit without incurring much additional cost (Ramsay, 1988). As a result, shape- constrained function estimation has been widely applied in regression settings and generalized linear models. In state-space models such as SV models, there has been, to the best of our knowledge, no demonstration of the benefits of shape-constrained state smoothing. It is our belief that the aforementioned functional shapes can be exploited to provide additional regularization in the semiparametric SV models to ensure better model fit and prediction accuracy.

The remainder of this chapter is organized as follows. In Section 2.2, we intro- duce our shape-constrained semiparametric additive SV model. In Section 2.3, we develop the Markov chain Monte Carlo procedures for sampling from the posterior distribution and a particle-filter-based procedure for evaluating the predictive log likelihood. We apply our methods to simulated and real-world financial returns in

Sections 2.4 and 2.5 and demonstrate their advantages over both the linear SV model and the semiparametric model without shape constraints. We close with conclusions and discussions in Section 2.6.

2.2 Model Specification

2.2.1 Semiparametric Additive Stochastic Volatility Models

Consider the following semiparametric stochastic volatility model:

iid rt = ω + exp(ht/2)εt, εt ∼ N(0, 1);

iid 2 ht = µ + f(ht−1) + g(εt−1) + ηt, ηt ∼ N(0, σ ), (2.2.1)

where the innovations {εt}t∈Z and {ηt}t∈Z are mutually independent. We assume that the autoregressive component, f, and the leverage function, g, are additive. It is easy

20 to see that the leveraged SV model (2.1.1) is a special case of (2.2.1) when both f and g are linear functions. Model (2.2.1) also includes the models of Comte (2004) and Yu (2012) as special cases: when g(εt−1) = 0, (2.2.1) reduces to Comte (2004)’s

nonparametric unleveraged SV model; when f is linear and g is piecewise linear,

(2.2.1) becomes Yu (2012)’s semiparametric leverage effect model. Additionally, when

ηt = 0 for all t, model (2.2.1) also contains the exponential ARCH (EGARCH) model

of Nelson (1991).

Additional constraints on the functions f and g are needed for (2.2.1) to be iden-

tifiable. We assume that f(hm) = 0 and g(εm) = 0, where hm and εm are two fixed

numbers within the respective ranges of the processes {ht}t∈Z and {εt}t∈Z. Under these conditions, the parameter µ in (2.2.1) represents the expected conditional log

volatility when ht−1 = hm and εt−1 = εm. The choice of hm and εm is discussed in

Section 2.2.2. For the remainder of this chapter, we will assume that the data have

already been de-trended, so ω = 0.

2.2.2 Shape-constrained Semiparametric Additive Stochas- tic Volatility Models

We call model (2.2.1) the shape-constrained semiparametric additive SV model if

f and g are restricted to be of certain shapes. As discussed in Section 1, for equity

returns, the autoregressive function f is typically monotonically increasing on the

entire real line, while the leveraged function g is monotonically decreasing (at least)

on the negative real line. The findings in Yu (2012) also indicate that g(εt−1) might

have some change-point behavior about zero, but generally, our knowledge of g(εt−1)

for positive εt−1 is mixed.

21 There are many ways to incorporate shape constraints in function estimation problems. Below we briefly discuss some of the most popular methods for estimating monotone functions which remains the most important form of shape constraints.

While some of monotonic function estimation methods are built on kernel smoothing techniques (e.g., Hall and Huang, 2001; Dette and Pilz, 2006; Dette and Scheder,

2006), the majority is based on splines or power basis functions. Consider the follow- ing basis expansion of an unknown function q:

q(x) = β0 + β1w1(x) + ... + βJ wJ (x), (2.2.2)

J where {wj(x)}j=1 are the basis functions. The splines-based methods come in dif- ferent flavors, but the basic idea is mostly the same – by choosing appropriate basis functions for the basis expansion (2.2.2), shape constraints on the unknown function q can be translated into range constraints on the coefficients of the basis functions,

J {βj}j=1. The model fitting can be carried out either through a constrained optimiza- tion for the maximum likelihood method or through a constrained prior distribution in the Bayesian framework. In this chapter, we focus on Bayesian isotonic regression methods.

There are roughly two flavors when it comes to the choice of splines as basis functions for monotone function estimation. One approach directly uses B-splines or

P-splines. For B-splines of order m with equally distanced knots,

d 1 B (x) = [B (x) − B (x)], dx j,m h j,m−1 j+1,m−1

PJ where h = xj+1 − xj. As a result, for q(x) = j=1 βjBj,m(x),

J 1 X q0(x) = (β − β )B (x). h j j−1 j,m−1 j=2

22 Since B-splines are nonnegative by construction, it is easy to see that a natural sufficient condition for function f(x) to be monotonically increasing is that the co- efficients satisfy the ordering β1 ≤ ... ≤ βJ−1 ≤ βJ . For B-splines of higher orders, such as cubic splines, this is a sufficient but not necessary condition and as a re- sult, it represents only a subset of all legitimate monotone functions. The sufficient and necessary constraints on higher order splines to obtain monotonicity can be very complex. Therefore, most applications practically only considered linear or quadratic splines. Brezger and Steiner (2008) is an example of this approach, where the un- known function is expanded by the traditional P-splines and the monotonicity of the unknown function is guaranteed through a truncated normal prior constraining the order of the coefficients.

The other approach of monotone function estimation is to embed the monotonic feature into the basis functions. In other words, for q to be monotone, one can

J simply use monotone basis functions {wj(x)}j=1 and then impose sign constraints on

J the coefficients {βj}j=1. Suppose that the wj(x)’s are monotonically increasing in x.

Then q(x) is monotonically increasing (decreasing) in x if βj ≤ (≥) 0 for all j ≥ 1. A prominent example of this type of basis functions is the integrated regression splines

(I-splines) proposed by Ramsay (1988). For example, Meyer et al. (2011) employs the monotonic I-splines as basis functions and uses gamma priors for the coefficients to ensure monotonicity. A different set of monotone basis functions is proposed by

Neelon and Dunson (2004) and also used in Cai and Dunson (2007). Their basis functions are defined as

wj(x) = I(x > γj−1)[min(x, γj) − γj−1], j = 1, . . . , J, (2.2.3)

23 given pre-specified knots γ0 < . . . < γJ . These basis functions are monotone, non-

negative but not orthogonal and because of their non-orthogonality, the coefficients

J {βj}j=1 are heavily correlated with each other. Although this is usually an undesirable feature, Neelon and Dunson (2004) and Cai and Dunson (2007) proposed priors for

J {βj}j=1 based on a and a Gaussian Markov random field, respectively, to explicitly model the correlation among the coefficients.

In this work, we adopt the Bayesian method of Neelon and Dunson (2004) in our

semiparametric SV models. When (2.2.3) are used as our basis functions in (2.2.2),

the monotonicity constraint on the function q is exactly equivalent to sign constraints

J on the coefficients {βj}j=1. It is also easy to see that with (2.2.3), the parameter β0

in (2.2.2) satisfies β0 = q(γ0); i.e., β0 represents the “baseline” of the function q on

[γ0, γJ ]. The coefficient βj, j ≥ 1 is simply the slope of the function q on the knot

interval [γj−1, γj]. Due to the nonorthogonality of the basis functions (2.2.3), the

J estimates of the coefficients {βj}j=0 will depend on each other through the equations

q(γ0) = β0,

q(γ1) = β0 + β1(γ1 − γ0),

q(γ2) = β0 + β1(γ1 − γ0) + β2(γ2 − γ1), . . J X q(γJ ) = β0 + βj(γj − γj−1). j=1

On top of that, the estimated q function will have larger variance towards the bound-

aries because of the relative lack of data in those regions. As a result of both the

dependence among model coefficients and the large estimation variance of q(γj) for

small γj, the estimation efficiency will be low for the baseline coefficient β0 as well

24 as the first few slope coefficients βj. For SV models or other state space models, this problem is aggravated by the necessity of specifying the knots with a wide enough range to cover all possible values of the latent variables.

Hence we propose to shift the basis functions so that the baseline coefficient β0 in

(2.2.2) represents the “center” of the function q. We let ( I(x > γj−1)[min(x, γj) − γj−1] − (γj − γj−1) if 1 ≤ j < M; wj(x) = (2.2.4) I(x > γj−1)[min(x, γj) − γj−1] if M ≤ j ≤ K, where M denotes the “center” of the range of x. Figure 2.1 compares the centered and the uncentered basis functions using a set of unequally spaced knots. With the basis functions (2.2.4), the parameter β0 in (2.2.2) models the center of the function through the equation β0 = q(γM−1). The estimation variance of q(γM−1) tends to be much smaller than that of q(γ0) because of the abundance of information contributed by the data from both sides of γM−1. As a result, although (2.2.4) renders exactly the same model as (2.2.3), it considerably improves the computational efficiency for the class of state space models entertained in this chapter.

Letting w(ht−1) and v(εt−1) denote the basis functions for f and g respectively, we can rewrite (2.2.1) as

iid rt = exp(ht/2)εt, εt ∼ N(0, 1); K L X X iid 2 ht = µ + βkwk(ht−1) + αlvl(εt−1) + ηt, ηt ∼ N(0, σ ). (2.2.5) k=1 l=1

As for the choice of M, we use M = d(K + 1)/2e for estimating the function f(ht−1)

in (2.2.5). For estimating the function g(εt−1), M is specified in such a way that

γM−1 = 0. This is always possible since the lagged return innovation, εt−1, follows a

standard normal distribution. It is easy to see that model (2.2.5) satisfies K h X h f(γM−1) = βkwk(γM−1) = 0 k=1 25 Irregular Knots Uncentered Bases Irregular Knots Centered Bases 0.2 0.20 ) ) x x ( ( j j 0.0 w w 0.10 −0.2 0.00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x

Regular Knots Uncentered Bases Regular Knots Centered Bases 0.2 0.20 0.1 ) ) x x ( ( j j 0.0 0.10 w w 0.00 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x

Figure 2.1: A comparison of the centered and uncentered basis functions. 7 knots (unequally spaced in the first row and equally spaced in the second row) are used for illustration purposes, where the first and last knot denote the lowest and largest pos- sible values of the predictor x. Different colors and line types are used to differentiate different basis functions. Note how the centered and uncenters basis functions differ in terms of the 0 horizontal line.

26 and L ε X ε g(γM−1) = αlvl(γM−1) = 0, l=1 which are the additional constraints needed for the model to be identifiable. We

K use the following priors from Neelon and Dunson (2004) on {βk}k=1 to ensure their non-negativity:

∗ ∗ β1 ∼ π(β1 );

∗ ∗ 2 βk ∼ N(βk−1, τ ), k ≥ 2;

∗ ∗ βk = I(βk > 0)βk, k ∈ {1,...,K}. (2.2.6)

∗ The prior for β1 can be based on prior knowledge of the steepness of the function toward the left tail. However, if such knowledge is not readily available, the flat prior

∗ π(β1 ) ∝ 1 can be a convenient alternative, which also has a nice property that the posterior distribution does not depend on the starting point, nor on the direction of

∗ K the latent random walk process {βk}k=1. In other words, the posterior will be exactly

∗ ∗ the same as by assuming the random walk starts at βκ with π(βκ) ∝ 1, for any integer κ from 1 to K. P As for the function g(εt−1) = l αlvl(εt−1), we have empirical evidence from the literature (such as Yu, 2012) showing that it is monotonically decreasing on the

negative real line, but we do not have consistent information about its shape on the

L positive real line. Accordingly, our priors for {αl}l=1 are specified as follows.

∗ ∗ α1 ∼ π(α1);

∗ ∗ 2 αl ∼ N(αl−1, γ ), l ≥ 2;

∗ ∗ αl = I(αl < 0)αl , l ≤ M − 1; (2.2.7)

∗ ∗ αl = ι(αl )αl , l ≥ M. (2.2.8)

27 ∗ ∗ In the equation above, π(α1) can be specified in the same way as π(β1 ) and ι is an indicator function that depends on the assumption of the shape of g on the positive

∗ ∗ real line: if g(εt−1) is monotonically decreasing in positive εt−1, then ι(αl ) = I(αl <

∗ ∗ 0); if g(εt−1) is monotonically increasing in positive εt−1, then ι(αl ) = I(αl > 0);

finally, if we lack such information or that g(εt−1) is not monotone for positive εt−1,

∗ we can simply set ι(αl ) = 1 to remove the shape constraint on that part of the leverage function altogether.

The remaining parameters are assumed to be mutually independent. For the priors of the variance parameters σ2, τ 2 and γ2, we use the conjugate inverse gamma

2 distribution as the prior for each parameter. We assume a N(0, dµ) prior for µ and a

2 N(0, d0) prior for the initial state h1.

2.3 Model Fitting and Comparison

2.3.1 MCMC Procedure

We use a Gibbs sampler to simulate from the joint posterior distribution of model

(2.2.5). An overview of the MCMC algorithm for a single iteration is described below.

∗ K ∗ L Step 1 - Update {βk}k=1 and {αl }l=1 sequentially from their full conditional distri- butions, which are mixtures of two truncated normal distributions for the

shape-constrained semiparametric model and normal for the corresponding

semiparametric model without shape constraints. For the shape-constrained

K L model, compute {βk}k=1 and {αl}l=1 deterministically using (2.2.6) – (2.2.8).

Step 2 - Update the variance parameters τ 2, γ2 and σ2 from their inverse gamma full

conditional distributions.

Step 3 - Update µ from its normal full conditional distribution.

28 Step 4 - Update {ht, t = 1,...,T } sequentially from their full conditional distribu-

tions.

The detailed full conditional distributions as well as some discussions about spe- cific challenges involved in each step are presented below.

∗ 1) Update βk and βk: Without sign constraints on βk (i.e., no shape constraints on

∗ the autoregressive function f), we draw βk = βk from its normal full conditional distribution whose mean and variance are given as

2∗ 2w 2w 2∗ τk w σk ∗ 2 σk τk µ˜k = 2w 2∗ µk + 2w 2∗ µk andτ ˜k = 2w 2∗ . σk + τk σk + τk σk + τk

∗ 2∗ In the above equations, µk and τk are the conditional prior mean and variance

∗ w 2w ∗ of βk, whereas µk and τk reflect the new information about βk in the data. If we

∗ K use a flat prior for the initial state of the random walk prior of {βk}k=1, we have

 ∗ βk+1 if k = 1 ∗  ∗ µk = βk−1 if k = K  ∗ ∗ (βk−1 + βk+1)/2 otherwise, and

 2 ∗ τ if k = 1 or K τ 2 = k τ 2/2 otherwise.

From the likelihood, we can obtain that

T T T w X X 2 2w 2 X 2 µk = wk(ht−1)ek(t)/ wk(ht−1) and σk = σ / wk(ht−1), t=2 t=2 t=2 where L K X X ek(t) = ht − µ − vl(εt−1)αl − wj(ht−1)βj. l=1 j=1,j6=k

∗ ∗ If βk is constrained to be positive a priori, i.e., βk = I(βk > 0)βk, Neelon

∗ and Dunson (2004) shows that the full conditional of βk can be expressed as a

29 mixture of two truncated normal distributions. Let Φ(·; m, v) be the cumulative distribution function of a N(m, v) random variable, φ(·; m, v) be the corresponding

2 density function and tN(µ0, σ0;[a, b]) denote the truncated Normal distribution with support [a, b].

n ∗ Y ∗ ∗ 2 ∗ ∗ 2 π(βk|−) ∝ π(ht|ht−1, β0, β−j)π(βk|βk−1, τ )π(βk|βk+1, τ ) t=2  P 2  w wk(ht−1)ek(t) 2w σ ∝ φ βk; µk = P 2 , σk = P 2 × wk(ht−1) wk(ht−1)  ∗ ∗ 2  β + β ∗ τ φ β∗; µ∗ = k−1 k+1 , τ 2 = k k 2 k 2 ( w ∗ φ(0; µw, σ2 )φ(β∗; µ∗, τ 2 ) if β∗ ≤ 0 = k k k k k k ∗ w 2w ∗ ∗ 2∗ ∗ φ(βk; µk , σk )φ(βk; µk, τk ) if βk > 0  ∗ ∗ 2∗ φ(β ; µ , τ ) ∗ w  k k k φ(0; µ∗, τ 2 )φ(0; µw, σ2 ) if β∗ ≤ 0  ∗ 2∗ k k k k k φ(0; µk, τk ) = ∗ 2 φ(β ;µ ˜ , τ˜ ) ∗ w  k k k φ(0; µ∗, τ 2 )φ(0; µw, σ2 ) if β∗ > 0  2 k k k k k φ(0;µ ˜k, τ˜k ) φ(β∗; µ∗, τ 2∗ )  k k k ∗  ∗ if β ≤ 0  φ(0; µ∗, τ 2 ) k k k , ∝ φ(β∗;µ ˜ , τ˜2)  k k k if β∗ > 0  2 k φ(0;µ ˜k, τ˜k )

∗ ∗ 2∗ Thus, a MCMC draw for βk is simply a draw from a tN(µk, τk ;(−∞, 0]) distri- bution with proportional to

∗ 2∗ Φ(0; µk, τk ) ∗ 2∗ , φ(0; µk, τk )

2 or from a tN(˜µk,τ ˜k ; [0, ∞)) distribution with probability proportional to

2 1 − Φ(0;µ ˜k, τ˜k ) 2 . φ(0;µ ˜k, τ˜k )

w 2w ∗ 2∗ We can see that for µk and σk (and hence µk and τk ) to exist, the corre- sponding basis function evaluated at ht−1, i.e., wk(ht−1), needs to be non-zero for at least some t ≥ 2. However, this is not always guaranteed since by the definition

30 of the basis functions (2.2.4), when k ≥ M, wk(ht−1) = 0 for ht−1 < γk−1 and

when k < M, wk(ht−1) = 0 for ht−1 > γk. In other words, wk(ht−1) would be zero

T for all t ≥ 2 if all values of {ht−1}t=2 simulated in one MCMC iteration are below

∗ γk−1 for k ≥ M or above γk for k < M. Whenever this happens, we draw βk

∗ 2∗ from its conditional prior distribution N(µk, τk ). It is necessary to point out that this is different from a variable selection problem where reversible jump MCMC

(Green, 1995) or pseudo-priors (Carlin and Chib, 1995) are used since which co-

efficients can be updated by the data are uniquely determined given the draws of

ht rather than being a stochastic choice. A good implication of drawing from the

prior distribution to update a coefficient when no data is available is that it allows

us to specify a knot grid for estimating the autoregressive function f that covers

a large enough range so that the values of {ht}t∈Z in the MCMC iterations are always within the boundary knots.

∗ 2) Update αl and αl : The difference between updating αl and βk lies in the fact

that both positive and negative sign constraints for αl are considered in (2.2.7)

and (2.2.8). Without sign constraints on αl (i.e., without shape constraints on the

∗ leverage function g), the full conditional distribution of αl is similar to that of

∗ ∗ βk. When αl is constrained to be positive, the full conditional distribution of αl

∗ is again similar to that of βk for positive βk. However, when αl is constrained to

∗ be negative, the full conditional distribution of αl is a mixture of two truncated normal distributions. The positive part of this mixture distribution now depends

on the conditional prior mean and variance and the negative part on the data.

∗ The full conditional distribution for a αl is described below.

31 ∗ Similary to the update of βk, we let

P 2 v vl(εt−1)el(t) 2v σ µl = P 2 and σl = P 2 , vl (εt−1) vl (εt−1) where K L X X ht−1 el(t) = ht − µ − wk(ht−1)βk − vj(e 2 rt−1)αj. k=1 j=1,j6=l ∗ On the other hand, the prior mean and variance for the latent αl is  α∗ if l = 1  l+1 ∗ ∗ µl = αl−1 if l = L  ∗ ∗ (αl−1 + αl+1)/2 otherwise , and

( 2 ∗ γ if l = 1 or L γ2 = l γ2/2 otherwise .

∗ Same as before, the mean and variance for the full conditional distribution of αl when there is no sign constraint is given by

2∗ 2v 2v 2∗ γl v σl ∗ 2 σl γl µ˜l = 2v 2∗ µl + 2v 2∗ µl andγ ˜l = 2v 2∗ . σl + γl σl + γl σl + γl

In summary, the different scenarios for sampling from the full conditions distribu-

∗ tion for αl is as below.

∗ • If αl = αl , i.e., we do not impose any sign constraints on αl, the full condi-

∗ 2 tional distribution for αl is simply N(˜µl, γ˜l ).

∗ ∗ ∗ • If αl = I(αl > 0)αl , then the full conditional distribution for αl is a mixture

∗ 2∗ of two truncated normal distributions: a tN(µl , γl ;(−∞, 0]) distribution with probability proportional to

∗ 2∗ Φ(0; µl , γl ) ∗ 2∗ , φ(0; µl , γl ) 32 2 and a tN(˜µl,γ ˜l ; [0, ∞)) distribution with probability proportional to

2 1 − Φ(0;µ ˜l, γ˜l ) 2 . φ(0;µ ˜l, γ˜l )

∗ ∗ ∗ • If αl = I(αl < 0)αl , then the full conditional distribution for αl is a mix-

∗ 2∗ ture of two different truncated normal distributions: a tN(µl , γl ; [0, ∞)) distribution with probability proportional to

∗ 2∗ 1 − Φ(0; µl , γl ) ∗ 2∗ φ(0; µl , γl )

2 and a tN(˜µl,γ ˜l ;(−∞, 0]) distribution with probability proportional to

2 Φ(0;µ ˜l, γ˜l ) 2 . φ(0;µ ˜l, γ˜l )

PT 2 ∗ ∗ 2∗ • If t=2 vl (εt−1) = 0, then sample αl from N(µl , γl ).

3) Update the variance parameters σ2, τ 2 and γ2:

2 2 2 Suppose the conjugate prior distribution for σ , τ and γ are IG(as, bs), IG(at, bt)

2 and IG(ag, bg), respectively. Then the full conditional distribution for σ is

T ! T − 1 X IG a + , b + 0.5 e2 s 2 s t t=2 where K L X X  ht−1  et = ht − µ − wk(ht−1) − vl e 2 rt−1 αl. k=1 l=1 The full conditional distribution for τ 2 is

K ! K − 1 X IG a + , b + 0.5 [β∗ − β∗ ]2 . t 2 t k k−1 k=1 Finally, the full conditional distribution for γ2 is

L ! L − 1 X IG a + , b + 0.5 [α∗ − α∗ ]2 . g 2 g l l−1 l=1

33 4) Update µ: Let

K L X X  ht−1  eµ(t) = ht − wk(ht−1) − vl e 2 rt−1 αl. k=1 l=1

The full conditional distribution for µ is

2 T 2 2 ! dµ X dµσ N e (t), . (T − 1)d2 + σ2 µ (T − 1)d2 + σ2 µ t=2 µ

5) Update ht:

The full conditional distribution for ht(2 ≤ t ≤ T ) satisfies

2 π(ht|rt, ht−1, ht+1, β, µ, σ ) K L ! ht X X 2 ∝ φ(rt; 0, e ) φ ht; µ + βkwk(ht−1) + αlvl(εt−1), σ × k=1 l=1 K L ! X X 2 φ ht+1; µ + βkwk(ht) + αlvl(εt), σ . k=1 l=1

For h1, this full conditional becomes

2 π(h1|r1, h2, β, µ, σ ) K L ! h1 2 X X 2 ∝ φ(r1; 0, e ) φ(h1; 0, d0) φ h2; µ + βkwk(h1) + αlvl(ε1), σ , k=1 l=1

and for hT , it is

2 π(hT |rT , hT −1, hT +1, β, µ, σ ) K L ! hT X X 2 ∝ φ(rT ; 0, e ) φ hT ; µ + βkwk(hT −1) + αlvl(εT −1), σ . k=1 l=1

Updating the high-dimensional state vector {ht, t = 1,...,T } remains a bot-

tleneck for many applications of Bayesian state space models. For the linear un-

leveraged SV model, Kim et al. (1998) developed a popular algorithm by using a

34 mixture of normal distributions to approximate the log chi-squared distribution in the linearized observation equation and then applying the augmented Kalman Fil- ter and simulation smoother to improve the efficiency of the MCMC. This method was later extended to the linear leveraged SV model in Omori et al. (2007) and

Nakajima and Omori (2009). However, this approach hinges entirely on the lin- earity of the state equation, which does not apply in our case.

Several alternative strategies can potentially be used to update the state vari- ables. First, the random-walk Metropolis-Hastings algorithm is the easiest to im- plement but with a drawback of having high auto-correlation among the samples.

However, if we can simulate the chain long enough and thin it to alleviate the auto-correlation problem, this basic algorithm still performs reasonably well. Sec- ond, since the state variables are highly correlated with each other, updating more than one state variable at a time, i.e., using block updating (e.g., Chib and Green- berg, 1994, 1995), would be more efficient than the one-at-a-time updating. To implement the block updating, one way to form a block is to choose state variables further apart in time so that the variables in one block are close to independent from each other. For example, if we choose a block size of B, then a typical block can be constructed as

  ht, h T , h T , . . . , h T , t+b B c t+2b B c t+KB b B c

where KB = argmaxk(t + kbT/Bc ≤ T ). Third, we can improve the proposal distributions for updating the state variables by exploiting the shape of the full- conditional distribution, such as using the Metropolis-adjusted Langevin algorithm

(Grenander and Miller, 1994; Roberts and Tweedie, 1996) and its variations. Last

35 but not least, particle-filter-based methods, such as the Particle Markov chain

Monte Carlo (PMCMC) method (e.g., Andrieu et al., 2010), can also be applied

to update the state variables.

2.3.2 Model Comparison Criteria

For simulation studies where the true values of the latent log volatility, ht, are known, model comparison can be easily accomplished by comparing the estimated and simulated ht under the Mean Absolute Error (MAE) or the Mean Squared Error

(MSE). Under these two loss functions, it is also desirable that we make comparisons on the log volatility scale rather than the variance or standard deviation scale since the distribution of the former is more symmetric.

For empirical data, a likelihood-based criterion such as the Bayes Factor is a natural choice for model comparison. Kim et al. (1998) developed a method of ap- proximating the Bayes factor using posterior draws based on Chib (1995)’s method

∗ ∗ ∗ log(p(r|Mk)) = log(p(r|θ ,Mk)) + log(p(θ |Mk)) − log(p(θ |r,Mk)).

Yu (2012) adopted the same approach where θ∗ is chosen to be the posterior mean of the parameters θ. However, such approaches require many approximations, such as calculating p(r|θ,Mk) with an important sampling approach based on the Laplace approximation as well as estimating p(θ|r,Mk) with a high-dimensional multivariate kernel density estimate. As for out-of-sample predictive performance, the most com- mon approach is to use the so-called realized volatilities as the true states (see, e.g.,

Audrino and B¨uhlmann,2009; Yu, 2012). However, realized volatility is but another estimator of the true volatility and it suffers from market microstructure noise when used on high frequency data (McAleer and Medeiros, 2008). Yu (2012) used daily

36 returns to calculate weekly realized volatilities, but as is illustrated in Figure 2.2,

2 although it is better than simply using rt , their usefulness as proxies for the true volatility states is debatable.

SP500 Distribution 1.2 −2 0.8 −4 −6 Density 0.4 log volatility −8 0.0 1950 1960 1970 1980 1990 2000 2010 −8 −6 −4 −2 0 Time

Figure 2.2: Stochastic Volatility (grey line) vs. Realized Volatility (dark line) vs. 2 log(rt )/2 (dots): the distribution of the realized volatilities is indeed less left skewed 2 than that of log(rt )/2 but is still much noisier than the estimated stochastic volatilities and slightly left skewed

In this chapter, we use the predictive log likelihood as our performance measure.

Let rt1:t2 be an abbreviation for {rt1 , . . . , rt2 } and let θ denote all model parameters.

Then the predictive log likelihood over the test period {T + 1,...,T + S} can be expressed as

S−1 X log p(r(T +1):(T +S)|r1:T ) = log p(rT +s+1|r1:(T +s)), (2.3.1) s=0

where

Z p(rT +s+1|r1:(T +s)) = p(rT +s+1|hT +s+1) p(hT +s+1|hT +s, θ, r1:(T +s)) ×

p(hT +s, θ|r1:(T +s)) dhT +s+1 dhT +s dθ. (2.3.2)

37 Equation (2.3.1) suggests the following scheme for computing the predictive log

likelihood: first, use (2.3.2) as well as the MCMC output based on the training data

r1:(T +s) to evaluate the one-step-ahead predictive log likelihood, log p(rT +s+1|r1:(T +s));

next, append the new return rT +s+1 to the training data and refit the model using

MCMC. Finally, the overall log likelihood can be computed by iterating the previous

two steps for s = 0,...,S − 1 and then summing up all individual one-step-ahead

predictive log likelihoods. However, this method is computationally intense since we

have to re-run the entire MCMC algorithm whenever we append a new return to the

training data.

To alleviate the computational burden, we propose to use the “Forward Filter-

ing” stage of the nonlinear Forward-Filtering-Backward-Sampling algorithm of God-

sill et al. (2004). Instead of using the one-step-ahead prediction, a particle filter

allows us to carry out multiple-step-ahead prediction very efficiently. A key step in

the multiple-step-ahead prediction approach is to generate samples of hT +s, θ|r1:(T +s)

for s > 0 (when s = 0, the samples of hT , θ|r1:T can be directly obtained from the

output of the MCMC algorithm). Given that

p(rT +s|hT +s, θ, r1:(T +s−1))p(hT +s, θ|r1:(T +s−1)) p(hT +s, θ|r1:(T +s)) = p(rT +s|r1:(T +s−1))

p(rT +s|hT +s) = p(hT +s, θ|r1:(T +s−1)), (2.3.3) p(rT +s|r1:(T +s−1)) we can apply Gordon et al. (1993)’s bootstrap filter to get weighted samples from the posterior distribution p(hT +s, θ|r1:(T +s)) without having to re-run the MCMC for each s. When applying the bootstrap filter, we follow the suggestion of Liu and Chen

(1998) and resample using the residual resampling scheme only when the effective sample size (ESS) falls below a threshold (see the filtering step iii in the algorithm

38 below). See Kong et al. (1994) and Liu and Chen (1995) for in-depth discussions of

resampling and the ESS. Alternatively, we can consider other resampling algorithms

such as, most notably, the systematic or stratified sampling procedure discussed in

Kitagawa (1996).

Although the resampling approach alleviates the so-called sample impoverishment

or degeneracy problem of particle filters where only a few particles will have non-

negligible weights after a few iterations, it is still desirable to re-run the MCMC

regularly to obtain new samples. In other words, we need to pick a moderate step

size for our multiple-step-ahead prediction in order to limit the impact of the sample

impoverishment problem while still allowing fast computation. We also note that the

idea of rejuvenating the particles once in a while using a Gibbs sampler has been

explored previously in different scenarios (see, e.g. MacEachern et al., 1999).

In summary, our algorithm is as follows. Let D be the number of the posterior

draws after thinning and S0 be the step size of the multiple-step-ahead prediction (i.e., the length of the test data to be evaluated before we re-run the MCMC to refresh the samples). For s = 0,...,S − 1, we iterate the following two steps.

1) The Prediction Step

i. If s ≡ 0 (mod S0), run the Markov chain Monte Carlo to obtain new sam-

ples from the posterior distribution p(hs, θ|r1:(T +s)). Initialize the particles

(d) (d) (d) D (d) (d) as {hT +s, θ ;w ˜s }d=1 where hT +s and θ are obtained from the MCMC

(d) andw ˜s = 1/D.

(d) ii. For d = 1,...,D, draw hT +s+1 from the univariate normal distribution

(d) (d) p(hT +s+1|hT +s, θ , r1:(T +s)).

39 iii. Estimate the predictive likelihood for the observation rT +s+1 using

D (d) (d) P hT +s+1 p˜(rT +s+1|r1:(T +s)) ≈ d=1 w˜s φ(rT +s+1|0, e ).

2) The Filtering Step

(d) (d) i. For d = 1,...,D, given the particles (hT +s+1, θ ) obtained in the pre-

(d) (d) (d) diction step, update the weights using ws+1 ∝ w˜s p(rT +s+1|hT +s+1) which follows from Equation (2.3.3).

(d) (d) PD (d) ii. Normalize the weights,w ˜s+1 = ws+1/ d=1 ws+1, and calculate the effective

PD (d) 2 sample size, ESS = 1/ d=1[w ˜s+1] .

(d) D iii. If ESS < 0.5D, resample D particles with probabilities {w˜s+1}d=1 and

(d) reset the weights of the new particles to bew ˜s+1 = 1/D. Otherwise, use

(d) (d) (d) D the particles and weights obtained previously, {hT +s+1, θ ;w ˜s+1}d=1. The residual resampling method works as follows:

(d) (d) (d) a - Retain Cd = bDw˜s+1c copies of {hT +s+1, θ } and calculate the total PD residual count Cr = D − d=1 Cd.

(d) (d) b - Obtain Cr iid draws from {hT +s+1, θ } with propor-

(d) tional to Dw˜s+1 − Cd, d = 1,...,D.

At the end, after all test data have been evaluated, we calculate the log predictive

PS−1 likelihood using (2.3.1); that is, s=0 logp ˜(rT +s+1|r1:(T +s)).

2.4 Simulations

The main objective of our simulation studies is to understand the role of shape constraints in the estimation of log volatilities. We investigate both the unleveraged and the more realistic leveraged SV models with various forms of the volatility func- tion, and compare the fit of our shape-constrained semiparametric additive SV model

40 (denoted by SC-SV) with the fit of two other models: the parametric SV model (AR-

SV) as defined by (1.2.1) and (2.1.1), and the semiparametric additive SV model

without shape constraints (SP-SV). To allow for different degrees of smoothness, we

consider semiparametric models with 10 and 20 knot intervals.

Our simulation results are based on 200 replicate time series of length N = 2, 048

for each experimental condition. We use MAE and MSE to measure our ability to

∗ ∗ accurately estimate the log volatilities. We assume flat priors for β1 and α1, and conjugate multivariate normal priors for (µ, φ) or (µ, φ, ψ) in the parametric AR-

SV models. The MCMC algorithm for fitting the semiparametric SV models was presented in Section 2.3.1 and we used the basic random-walk Metropolis-Hastings

T algorithm to update the state variables {ht}t=1. The same algorithm also applies to the parametric models (1.2.1) and (2.1.1), except that the full conditional for the coefficients (µ, φ) or (µ, φ, ψ) is now multivariate normal given conjugate priors. We used R (R Core Team, 2013) for all computations but used C++ through the R package RcppArmadillo (Eddelbuettel and Sanderson, 2014) for the update of the log

N volatilities, {ht}t=1, to speed up the computation. We use the log squared returns

N as the initial states for {ht}t=1. For the AR-SV and SC-SV models, we sample from the posterior distribution using 21, 000 iterations of the MCMC algorithm. To ensure

convergence of the Markov chain, we used 31, 000 iterations of the MCMC algorithm

for the SP-SV model. In all cases, the first 1, 000 iterations were discarded as burn-in

samples and the remaining samples were thinned every 10 iterations.

41 2.4.1 Uncentered versus Centered Basis Functions

To compare the two different sets of the basis functions, (2.2.4) versus (2.2.3), we implement each set in the unleveraged shape-constrained semiparametric SV model and fit the resulting two models on the same dataset simulated with the autoregressive function

f(ht−1) = −0.8 + 0.9ht−1.

We use 21 knots for the estimation of f and run the MCMC for 21000 iterations with the first 1000 draws discarded as burn-in samples and the remaining samples thinned every 10 iterations. Figure 2.3 shows their respective trace plots of the parameters

µ and β1 as defined in (2.2.5). For the model with uncentered basis functions, µ represents the value of the estimated autoregressive function f at the leftmost knot and for the model with centered basis functions, µ represents of the value of f at the central knot.

In Figure 2.3, the trace plots for the uncentered basis functions clearly have larger variance and higher autocorrelation than the ones for the centered basis functions.

The poor mixing of the MCMC chain for using the uncentered basis functions also

N affects the estimation accuracy of the log volatilities, {ht}t=1. The MSE of the esti- mated versus simulated log volatilities is 0.006 for using the centered basis functions and this number is almost 50% larger for the uncentered version, at 0.010. Clearly, the centered basis functions should be preferred to the uncentered ones. Even though the resulting models are theoretically equivalent, they behave quite differently in practice.

For the rest of this chapter, we implement the uncentered basis functions in all our semiparametric models.

42 Uncentered Uncentered −9.0 1.5 1 µ β 1.0 −10.0 0.5 0.0 −11.0 0 500 1000 1500 2000 0 500 1000 1500 2000

Centered Centered 2.0 −8.1 1 µ β 1.0 −8.3 0.0 0 500 1000 1500 2000 0 500 1000 1500 2000

Figure 2.3: Trace plots of the parameters µ and β1 in model (2.2.5) with centered versus uncentered basis functions.

2.4.2 Unleverage SV Models

We consider three different forms of the autoregressive component f in the SV

model (2.2.1) without leverage effect. (The three f functions are shown as the dashed

lines in Figure 2.6.) The variance of the error term ηt of the volatility process is set

to be 0.15 for all simulations. Although the scales of the returns and volatilities have

no bearing on model performance, we adjust the value of µ in each volatility equation

so that the simulated log volatilities are centered around −8, which is a typical value for real-world equity returns such as the S&P500 daily returns. The true volatility equations for the three unleveraged SV models are as follows.

1) (Lf) Linear f:

ht = −0.8 + 0.9ht−1 + ηt.

43 This is the basic SV model.

2) (Cf) Change-point f:

∗ ∗ ∗ ∗ ht = −8.1 + 0.9ht−1I(ht−1 > 0) + 0.3ht−1I(ht−1 ≤ 0) + ηt,

∗ where ht−1 = ht−1 + 8. The slope of this autoregressive function is steeper when

the lagged log volatility ht−1 is greater than the change point, −8, which reflects

the belief that the signal in the volatility process is stronger when the market is

more volatile.

3) (Sf) Sigmoid f: q ∗ ∗2 ht = −8 + 1.6ht−1/ 1 + ht−1 + ηt,

∗ where ht−1 = ht−1 + 8. This particular sigmoid function produces a volatility series with a quasi-Markov-switching behavior (see Figure 2.4). The two states

(volatile and stable) are generated by the two tails of the sigmoid function and

the steepness of the curve in the middle controls the transition between the two

states.

Table 2.1 summarizes the estimated MSE and MAE of the posterior means of the log volatilities compared to their simulated values. When the autoregressive function is linear, i.e., when AR-SV is the correct model, the shape-constrained semipara- metric model underperforms the AR-SV model by a small margin, while the SP-SV model, without proper shape constraints, is too flexible to pick up the linear form reliably. When the true autoregressive function is nonlinear, the impact of imposing the correct shape constraints on the estimation of the log volatilities becomes evident.

For example, with the change-point autoregressive SV model, imposing shape con- straints (i.e., the SC-SV model) reduces the MSE over the SP-SV model by 28% (10

44 True Model Fitted Models Lf AR-SV SC-SV(10) SP-SV(10) SC-SV(20) SP-SV(20) Average (s.e.) MSE 36.3(0.2) 36.7(0.2) 40.2(0.4) 36.9(0.2) 43.2(0.6) MAE 47.9(0.1) 48.2(0.1) 50.4(0.3) 48.3(0.1) 52.2(0.3) Rel. to AR-SV MSE 100.0% 101.2% 110.7% 101.5% 119.0% MAE 100.0% 100.6% 105.1% 100.8% 108.9%

True Model Fitted Models Cf AR-SV SC-SV(10) SP-SV(10) SC-SV(20) SP-SV(20) Average (s.e.) MSE 24.7(0.1) 23.1(0.1) 31.9(0.3) 23.1(0.1) 33.2(0.3) MAE 39.4(0.1) 38.1(0.1) 44.7(0.2) 38.1(0.1) 46.0(0.2) Rel. to AR-SV MSE 100.0% 93.6% 129.3% 93.5% 134.7% MAE 100.0% 96.9% 113.6% 96.8% 116.8%

True Model Fitted Models Sf AR-SV SC-SV(10) SP-SV(10) SC-SV(20) SP-SV(20) Average (s.e.) MSE 38.2(0.2) 34.2(0.2) 46.5(0.6) 34.0(0.2) 48.5(0.7) MAE 49.1(0.1) 46.2(0.2) 55.7(0.4) 45.9(0.2) 57.1(0.4) Rel. to AR-SV MSE 100.0% 89.6% 121.8% 88.9% 126.9% MAE 100.0% 94.1% 113.5% 93.6% 116.3%

Table 2.1: For the three true unleveraged SV models (Lf, Cf and Sf ), a summary of N the MSE and MAE of the log volatilities, {ht}t=1, obtained from fitting five different SV models. The averages and standard errors are multiplied by 100, and are based on 200 replicate series.

45 0.10 0.00 return −0.10 0 500 1000 1500 2000 Time −6 −8 log volatility −10 0 500 1000 1500 2000 Time

Figure 2.4: An example of the simulated datasets for the case of sigmoid f. The top plot shows the simulated return series while the bottom one shows the simulated latent log volatility series, {ht}. The quasi-Markov-switching feature is easily noticeable in the lower plot.

knot intervals) and 30% (20 knot intervals) and the reduction in MSE for the sigmoid autoregressive SV model is 26% and 30% respectively. This demonstrates the gain of including shape constraints in a semiparametric SV model when they are warranted.

Our findings from Table 2.1 can be further confirmed by the fitted autoregressive functions shown in Figure 2.5 (for the case of 10 knot intervals) and Figure 2.6 (for the case of 20 knot intervals). The only major difference between Figure 2.5 and

Figure 2.6 is that the plots in Figure 2.6 are much smoother. The true autoregressive functions (denoted by the dark solid lines in each panel of the figure) are estimated ˜ ˜ by the posterior means of f(ht−1) where ht−1 is a grid of pre-specified values of ht.

46 AR−SV(Lf) SC−SV(Lf) SP−SV(Lf) −6 −6 −6 ) ) ) 1 1 1 − − − t t t h h h ( ( ( f f f + + + −8 −8 −8 µ µ µ −10 −10 −10 −10 −9 −8 −7 −6 −10 −9 −8 −7 −6 −10 −9 −8 −7 −6

ht−1 ht−1 ht−1

AR−SV(Cf) SC−SV(Cf) SP−SV(Cf) −6.0 −6.0 −6.0 ) ) ) 1 1 1 − − − t t t h h h ( ( ( f f f −7.5 −7.5 −7.5 + + + µ µ µ −9.0 −9.0 −9.0 −9.0 −8.0 −7.0 −9.0 −8.0 −7.0 −9.0 −8.0 −7.0

ht−1 ht−1 ht−1

AR−SV(Sf) SC−SV(Sf) SP−SV(Sf) −6 −6 −6 ) ) ) −7 −7 −7 1 1 1 − − − t t t h h h ( ( ( f f f −8 −8 −8 + + + µ µ µ −10 −10 −10 −10 −9 −8 −7 −6 −5 −10 −9 −8 −7 −6 −5 −10 −9 −8 −7 −6 −5

ht−1 ht−1 ht−1

Figure 2.5: Estimated autoregressive function f for the unleveraged SV models based on a single simulated dataset based on 10 knot intervals. The dashed line in each graph shows the true functional form of f. The solid lines shows the posterior mean of f and the shaded region indicates the point-wise 95% credible bands for f.

47 AR−SV(Lf) SC−SV(Lf) SP−SV(Lf) −6 −6 −6 ) ) ) 1 1 1 − − − t t t h h h ( ( ( f f f + + + −8 −8 −8 µ µ µ −10 −10 −10 −10 −9 −8 −7 −6 −10 −9 −8 −7 −6 −10 −9 −8 −7 −6

ht−1 ht−1 ht−1

AR−SV(Cf) SC−SV(Cf) SP−SV(Cf) −6.0 −6.0 −6.0 ) ) ) 1 1 1 − − − t t t h h h ( ( ( f f f −7.5 −7.5 −7.5 + + + µ µ µ −9.0 −9.0 −9.0 −9.0 −8.0 −7.0 −9.0 −8.0 −7.0 −9.0 −8.0 −7.0

ht−1 ht−1 ht−1

AR−SV(Sf) SC−SV(Sf) SP−SV(Sf) −6 −6 −6 ) ) ) −7 −7 −7 1 1 1 − − − t t t h h h ( ( ( f f f −8 −8 −8 + + + µ µ µ −10 −10 −10 −10 −9 −8 −7 −6 −5 −10 −9 −8 −7 −6 −5 −10 −9 −8 −7 −6 −5

ht−1 ht−1 ht−1

Figure 2.6: Estimated autoregressive function f for the unleveraged SV models based on a single simulated dataset based on 20 knot intervals. The dashed line in each graph shows the true functional form of f. The solid lines shows the posterior mean of f and the shaded region indicates the point-wise 95% credible bands for f.

48 It is clearly seen in both Figure 2.5 and Figure 2.6 that the estimated autoregres-

sive functions for the semiparametric model without shape constraints (SP-SV) have

very large posterior variance, especially toward the tails where the volatility data are ˜ scarce. Additionally, the posterior distribution of f(ht−1) can be highly skewed and

the posterior averages can deviate substantially from the truth (see the fitted f func-

tion for the change-point ). Imposing proper shape constraints

in the semiparametric model considerably reduces the posterior variance and skew-

ness of the estimated autoregressive function across all values of the past volatilities,

and the resulting SC-SV model is still flexible enough to capture the shapes of the

different true autoregression functions.

2.4.3 Leveraged SV Models

For the leveraged models, we consider the following three different volatility equa-

tions. (Plots of the true autoregressive and leverage functions are shown as the dashed

lines in Figure 2.8.)

1) (Lf-Lg) Linear f & Linear g: The state equation is given by

ht = −0.9 + 0.9ht−1 − 0.6εt−1 + ηt,

where εt−1 = rt−1 exp(−ht−1/2). This is the AR-SV model with leverage effect.

2) (Lf-NLg) Linear f & Nonlinear g: The state equation is given by

ht = −1.1 + 0.9ht−1 − 0.6εt−1I(εt−1 < 0) + 0.1εt−1I(εt−1 ≥ 0) + ηt,

where εt−1 = rt−1 exp(−ht−1/2). If the variance of the error term ηt is 0, this be-

comes the EGARCH model whose news impact curve is asymmetric and piecewise

linear.

49 True Model Fitted Models Lf-Lg AR-SV SC-SV(10) SP-SV(10) SC-SV(20) SP-SV(20) Average (s.e.) MSE 38.0(0.2) 39.5(0.2) 39.6(0.2) 39.7(0.2) 40.3(0.3) MAE 48.5(0.1) 49.6(0.1) 49.6(0.1) 49.7(0.1) 50.0(0.2) Rel. to AR-SV MSE 100.0% 104.2% 104.3% 104.7% 106.1% MAE 100.0% 102.3% 102.3% 102.5% 103.1%

True Model Fitted Models Lf-NLg AR-SV SC-SV(10) SP-SV(10) SC-SV(20) SP-SV(20) Average (s.e.) MSE 35.3(0.2) 30.9(0.1) 31.2(0.1) 31.0(0.1) 31.7(0.3) MAE 47.2(0.1) 44.3(0.1) 44.6(0.1) 44.4(0.1) 44.8(0.1) Rel. to AR-SV MSE 100.0% 87.5% 88.5% 87.9% 89.6% MAE 100.0% 93.9% 94.4% 94.2% 94.9%

True Model Fitted Models NLf-NLg AR-SV SC-SV(10) SP-SV(10) SC-SV(20) SP-SV(20) Average (s.e.) MSE 25.5(0.1) 21.8(0.1) 26.3(0.3) 22.0(0.1) 27.0(0.3) MAE 40.2(0.1) 37.4(0.1) 41.2(0.3) 37.6(0.1) 41.9(0.3) Rel. to AR-SV MSE 100.0% 85.5% 103.2% 86.4% 105.9% MAE 100.0% 93.0% 102.4% 93.6% 104.2%

Table 2.2: For the three true leveraged SV models (Lf-Lg, Lf-NLg and NLf-NLg), a N summary of the MSE and MAE of the log volatilities, {ht}t=1, obtained from fitting five different SV models. The averages and standard errors are multiplied by 100, and are based on 200 replicate series.

50 3) (NLf-NLg) Nonlinear f & Nonlinear g: The state equation is given by

∗ ∗ ∗ ∗ ht = −8.5 + 0.9ht−1I(ht−1 > 0) + 0.3ht−1I(ht−1 ≤ 0)

− 0.6εt−1I(εt−1 < 0) + 0.1εt−1I(εt−1 ≥ 0) + ηt,

∗ where ht−1 = ht−1 + 8 and εt−1 = rt−1 exp(−ht−1/2).

The variances of the ηt terms in the above equations are set to be 0.15. We implement the correct shape constraints in the SC-SV models, which means that for the Lf-Lg case, the leverage function g is assumed to be monotonically decreasing on the entire real line while in the other two cases, it is assumed to be monotonically decreasing on the negative real line, and monotonically increasing on the positive real line. The MSE and MAE of the estimated log volatilities compared to the simulated values are shown in Table 2.2.

When the autoregressive component f and the leverage function g are both linear,

AR-SV is the true model and as expected, it outperforms both semiparametric models.

When either f or g is nonlinear, however, the AR-SV model is not flexible enough to pick up the true functional forms, resulting in the larger MSE and MAE as compared to the shape-constrained semiparametric model. For all three forms of the volatility function, the advantage of incorporating shape constraints in the semiparametric models is reflected in the consistently smaller MSE and MAE for the SC-SV model over the SP-SV model. As in Section 2.4.2, the greatest gain in the MSE or MAE was obtained when the autoregressive function f deviates the most from linearity

(the NLf-NLg case).

Figure 2.7 shows the estimated autoregressive function f (odd rows) and leverage function g (even rows) based on 10 knot intervals. The 20-knot-interval version is

51 AR−SV(Lf−Lg) SC−SV(Lf−Lg) SP−SV(Lf−Lg) ) ) ) 0 0 0 −6 −6 −6 ( ( ( g g g + + + ) ) ) 1 1 1 − − − t t t −10 −10 −10 h h h ( ( ( f f f + + + µ µ µ −14 −14 −14 −14 −12 −10 −8 −6 −14 −12 −10 −8 −6 −14 −12 −10 −8 −6

ht−1 ht−1 ht−1

AR−SV(Lf−Lg) SC−SV(Lf−Lg) SP−SV(Lf−Lg) ) ) ) −4 −4 −4 1 1 1 − − − t t t ε ε ε ( ( ( g g g + + + ) ) ) −8 −8 −8 8 8 8 − − − ( ( ( f f f + + + µ µ µ −12 −12 −12 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

εt−1 εt−1 εt−1

AR−SV(Lf−NLg) SC−SV(Lf−NLg) SP−SV(Lf−NLg) ) ) ) 0 0 0 ( ( ( g g g + + + −8 −8 −8 ) ) ) 1 1 1 − − − t t t h h h ( ( ( f f f + + + µ µ µ −12 −12 −12

−11 −9 −7 −5 −11 −9 −7 −5 −11 −9 −7 −5

ht−1 ht−1 ht−1

AR−SV(Lf−NLg) SC−SV(Lf−NLg) SP−SV(Lf−NLg) ) ) ) −5 −5 −5 1 1 1 − − − t t t ε ε ε ( ( ( g g g + + + ) ) ) −7 −7 −7 8 8 8 − − − ( ( ( f f f + + + µ µ µ −9 −9 −9 −4 −2 0 1 2 3 −4 −2 0 1 2 3 −4 −2 0 1 2 3

εt−1 εt−1 εt−1

AR−SV(NLf−NLg) SC−SV(NLf−NLg) SP−SV(NLf−NLg) ) ) ) −6 −6 −6 0 0 0 ( ( ( g g g + + + ) ) ) 1 1 1 − − − −8 −8 −8 t t t h h h ( ( ( f f f + + + µ µ µ −10 −10 −10 −10 −9 −8 −7 −6 −10 −9 −8 −7 −6 −10 −9 −8 −7 −6

ht−1 ht−1 ht−1

AR−SV(NLf−NLg) SC−SV(NLf−NLg) SP−SV(NLf−NLg) ) ) ) 1 1 1 − − − t t t ε ε ε −5 −5 −5 ( ( ( g g g + + + ) ) ) 8 8 8 −7 −7 −7 − − − ( ( ( f f f + + + µ µ µ −9 −9 −9 −4 −2 0 2 −4 −2 0 2 −4 −2 0 2

εt−1 εt−1 εt−1

Figure 2.7: For the three true leveraged SV models (Lf-Lg, Lf-NLg and NLf-NLg), the estimated autoregressive function f (odd rows) and leverage function g (even rows) in different fitted models (columns) based on 10 knot intervals. The line types and shading are defined in the same way as in Figure 2.5. 52 AR−SV(Lf−Lg) SC−SV(Lf−Lg) SP−SV(Lf−Lg) ) ) ) 0 0 0 −6 −6 −6 ( ( ( g g g + + + ) ) ) 1 1 1 − − − t t t −10 −10 −10 h h h ( ( ( f f f + + + µ µ µ −14 −14 −14 −14 −12 −10 −8 −6 −14 −12 −10 −8 −6 −14 −12 −10 −8 −6

ht−1 ht−1 ht−1

AR−SV(Lf−Lg) SC−SV(Lf−Lg) SP−SV(Lf−Lg) ) ) ) −4 −4 −4 1 1 1 − − − t t t ε ε ε ( ( ( g g g + + + ) ) ) −8 −8 −8 8 8 8 − − − ( ( ( f f f + + + µ µ µ −12 −12 −12 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

εt−1 εt−1 εt−1

AR−SV(Lf−NLg) SC−SV(Lf−NLg) SP−SV(Lf−NLg) ) ) ) 0 0 0 ( ( ( g g g + + + −8 −8 −8 ) ) ) 1 1 1 − − − t t t h h h ( ( ( f f f + + + µ µ µ −12 −12 −12

−11 −9 −7 −5 −11 −9 −7 −5 −11 −9 −7 −5

ht−1 ht−1 ht−1

AR−SV(Lf−NLg) SC−SV(Lf−NLg) SP−SV(Lf−NLg) ) ) ) −5 −5 −5 1 1 1 − − − t t t ε ε ε ( ( ( g g g + + + ) ) ) −7 −7 −7 8 8 8 − − − ( ( ( f f f + + + µ µ µ −9 −9 −9 −4 −2 0 1 2 3 −4 −2 0 1 2 3 −4 −2 0 1 2 3

εt−1 εt−1 εt−1

AR−SV(NLf−NLg) SC−SV(NLf−NLg) SP−SV(NLf−NLg) ) ) ) −6 −6 −6 0 0 0 ( ( ( g g g + + + ) ) ) 1 1 1 − − − −8 −8 −8 t t t h h h ( ( ( f f f + + + µ µ µ −10 −10 −10 −10 −9 −8 −7 −6 −10 −9 −8 −7 −6 −10 −9 −8 −7 −6

ht−1 ht−1 ht−1

AR−SV(NLf−NLg) SC−SV(NLf−NLg) SP−SV(NLf−NLg) ) ) ) 1 1 1 − − − t t t ε ε ε −5 −5 −5 ( ( ( g g g + + + ) ) ) 8 8 8 −7 −7 −7 − − − ( ( ( f f f + + + µ µ µ −9 −9 −9 −4 −2 0 2 −4 −2 0 2 −4 −2 0 2

εt−1 εt−1 εt−1

Figure 2.8: For the three true leveraged SV models (Lf-Lg, Lf-NLg and NLf-NLg), the estimated autoregressive function f (odd rows) and leverage function g (even rows) in different fitted models (columns) based on 20 knot intervals. The line types and shading are defined in the same way as in Figure 2.5. 53 presented in Figure 2.8. The two versions are very close to each other with the plots in Figure 2.8 slightly smoother than those in Figure 2.7. Thus, we can focus on Figure

2.8 for the following discussion. Whenever the true autoregressive or leverage function is nonlinear, the fitted f or g functions in the semiparametric SV model without shape constraints (SP-SV) can have extremely large posterior variances toward the tails where the data are scarce. Restricting the parameter space through proper shape constraints proves to be essential for more efficient estimation in SV models.

2.5 Empirical Studies

We now assess the advantage of our proposed shape-constrained semiparametric additive SV models for the prediction of volatilities in four real-world equity returns: the S&P500 index, Equity Residential (EQR; a real estate investment trust), Mi- crosoft (MSFT), and Johnson & Johnson (JnJ). The returns are calculated as the change in the logarithm of the daily adjusted closing prices obtained from Yahoo!

Finance (retrieved on January 12, 2014). For each series, we use the returns from

November 1, 2001 to October 29th, 2010 as our training set and the following three years of returns from November 1, 2010 to October 31, 2013 as the test set. For both the training and test return series, we subtract the sample mean from the data so that the returns have zero mean.

As in our simulation studies, we consider the parametric model (AR-SV), the semiparametric additive SV model with shape constraints (SC-SV) and without shape constraints (SP-SV). We fit both a leveraged and an unleveraged version of each model and compare their predictive log likelihoods over the test period. We run the MCMC for 102,500 iterations with the first 2,500 draws discarded as burn-in samples and the

54 AR-SV(u) SC-SV (u) SP-SV (u) AR-SV(l) SC-SV (l) SP-SV (l) S&P500 -3798.8 -3796.5 -3799.3 -3784.5 -3784.2 -3786.0 EQR -4080.5 -4068.8 -4072.0 -4080.4 -4070.6 -4079.5 MSFT -4103.6 -4100.7 -4100.8 -4102.6 -4099.4 -4103.1 JnJ -3681.7 -3682.9 -3684.2 -3680.9 -3680.7 -3681.7

Table 2.3: Predictive log likelihood of daily returns from November 1, 2010 to October 31, 2013 using six different SV models. (u) - unleveraged; (l) - leveraged. The results of the best performing model (those with the largest predictive log likelihood) are in bold face.

remaining draws thinned every 25 iterations. In terms of the shape constraints in the SC-SV model, exploratory analyses indicated that it is reasonable to assume the autoregressive function f to be monotonically increasing and the leverage function g monotonically decreasing. These assumptions will be assessed later. The predictive log likelihood that we use to compare the models is calculated according to Section

2.3.2 with the step size S0 for the multiple-step prediction set to be 110.

Table 2.3 displays the estimated predictive log likelihoods for all six SV mod- els. For each of the four return series, the value in bold denotes the model with the largest predictive log likelihood. The predictive log likelihood of the semipara- metric model without shape constraints always lags behind the corresponding shape- constrained model, suggesting that it is desirable to incorporate shape constraints in our semiparametric additive model when the knowledge of the true functional shapes is available. Compared to the AR-SV model, the semiparametric model without shape constraints only outperforms the corresponding AR-SV model in three out of the eight pairs. However, imposing shape constraints increases this ratio to seven out of eight. For Equity Residential, the outperformance of the shape-constrained model

55 compared to AR-SV is substantial. In one case, Johnson & Johnson, the linear para-

metric AR-SV model seems to be the most suitable model, indicating that a more

parsimonious model is preferred for explaining the volatility in this return series.

To assess the appropriateness of our shape assumptions for the four equity returns,

we estimate the autoregressive function f and leverage function g in the same way

as in the simulation studies. Figure 2.9 shows the fitted functions (posterior mean)

and the associated pointwise 95% credible bands for the leveraged semiparametric SV

models (with and without the shape constraints) for each return series.

For the S&P500 index, both fitted functions are clearly nonlinear, which means

that the volatility process and consequently the return series are nonstationary. When

the lagged log volatility ht−1 is small, the lag one autocorrelation for fixed εt−1 is

diminished as compared to when ht−1 is moderate; when ht−1 is large, the lag one autocorrelation is also attenuated, but to a lesser degree. In addition, the shapes of the autoregressive and leverage functions estimated from the two semiparametric models (SC-SV and SP-SV) agree with each other for the most part: the function f is monotonically increasing and g monotonically decreasing. This means that our assumptions of the shape constraints are appropriate. Even though the leftmost section of the g function is found to be increasing in the SP-SV model, we find this counterintuitive and also unreliable since our simulation studies showed that the estimated functions towards the boundaries can deviate considerably from the truth if no shape constraints are implemented.

As for the other three return series, it can be seen that our monotonicity assump- tions for the autoregressive and leverage functions in the shape-constrained models hold for all except the Equity Residential series (EQR). For EQR, the SP-SV model

56 S&P500(f) S&P500(g) −5 −6 −6 −7 ) ) 1 − 0 t ( ε g −7 ( −8 g + ) + 1 ) − −8 t 8 h − ( f ( f −9 + −10 + µ µ −11 −12 −12 −11 −10 −9 −8 −7 −6 −5 −3 −2 −1 0 1 2 3

ht−1 εt−1

EQR(f) EQR(g) −4 −6 −5 ) ) 1 −7 − 0 t ( ε g −6 ( g + ) + 1 ) − −7 −8 t 8 h − ( f ( f −8 + + µ µ −9 −10 −10 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3

ht−1 εt−1

MSFT(f) MSFT(g) −5 −6 −6 ) ) 1 −7 − 0 t ( ε g −7 ( g + ) + 1 ) − −8 −8 t 8 h − ( f ( f −9 + + µ µ −9 −11 −10 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2 3

ht−1 εt−1

JnJ(f) JnJ(g) −5 −5 −6 −6 ) ) 1 − 0 t ( ε g −7 ( −7 g + ) + 1 ) − −8 −8 t 8 h − ( f ( f −9 −9 + + µ µ −11 −11 −11 −10 −9 −8 −7 −6 −5 −3 −2 −1 0 1 2 3

ht−1 εt−1

Figure 2.9: Fitted autoregressive function f and leverage function g for the daily returns of the S&P500, EQR, MSFT and JnJ. The solid lines show the posterior means and 95% credible bands of the fitted functions in the leveraged semiparametric models with shape constraints while the dashed lines represent the posterior means and 95% credible bands in the leveraged semiparametric models without shape constraints.

57 finds that the posterior mean of the leverage function g is increasing on the positive

real line but with a very large degree of uncertainty. As a further investigation, we fit

two additional shape-constrained models with the same priors on f and on g(εt−1) for

εt−1 < 0. In the first model, we assumed the leverage function g to be monotonically increasing on the positive real line, but we imposed no shape constraints on g(εt−1) for

εt−1 > 0 in the second one. Their predictive log likelihoods are −4074.3 and −4074.5 respectively, both larger than the predictive log likelihoods for the corresponding leveraged AR-SV and SP-SV models. However, neither of these two additional mod- els outperforms the original leveraged SC-SV model with the monotonicity constraint on g. This may be due to the fact that the lagged return innovation εt−1 follows the standard normal distribution a priori, which means that it would be rare to observe an extreme εt−1 out in the tails. Nevertheless, a monotonically decreasing leverage func- tion seems a reasonable assumption for fitting the shape-constrained semiparametric additive SV models on most equity returns.

2.6 Discussion

In this chapter, we have proposed a class of shape-constrained semiparametric additive stochastic volatility models. We have introduced a parameterization that allows for more efficient sampling from the posterior distribution, and developed a particle-filter-based model comparison approach. Through simulations and empiri- cal studies, we demonstrated that the shape-constrained semiparametric SV model has the flexibility to capture unknown functional shapes, while not losing too much efficiency compared to a parametric SV model when the true underlying model is best explained by the latter. We demonstrated the use of specific forms of shape

58 constraints on the autoregressive and leverage function. Within this framework, re-

searchers and practitioners can easily incorporate other shape constraints that they

deem appropriate.

Our shape-constrained SV model can be extended in a number of directions, be-

yond simply including additional lagged terms. First, the additivity between the

autoregressive and leverage function can be relaxed. One option would be to assume

that the log volatility satisfies

ht = ζ(ht−1, εt−1) + ηt

where the function ζ is subject to a two-dimensional shape constraint. The challenge

lies in the fact that this function ζ does not have a rectangular support because of the

dependence between ht−1 and εt−1. Alternatively, we can assume that the additivity

holds on a different scale, ς(σt), where ς can be a known function other than exp(ht/2)

– the exp(ht/λ0) scale considered by Comte (2004) would be an example. Second,

the idea of incorporating shape constraints in semiparametric SV models can also be

applied to additive GARCH-type models for more efficient estimation of the news

impact curve. Third, our framework can be adapted to include convexity constraints

on the functional components. Under the basis expansion (2.2.2) and with the same

basis functions (2.2.4), convexity constraints can be translated to the ordering of the

J slope coefficients {βj}j=1. To constrain the function to be concave and monotonically increasing at the same time, the slope coefficients need to satisfy

0 ≤ β1 ≤ β2 ≤ ... ≤ βJ .

The corresponding model fitting is straightforward.

59 Chapter 3: Heteroscedastic Asymmetric Spatial Processes

3.1 Introduction

In Chapter 2, we focused on the modeling of semiparametric stochastic volatility models, which are non-Gaussian time series models that can capture the asymmetry, heavy tails and volatility clustering features of the financial returns. As discussed in Chapter 1, non-Gaussian features – heavy tails, asymmetry and spatial or spa- tiotemporal heteroscedasticity – are also important for modeling environmental and climatic data. In this chapter, we focus on the spatial domain. The extension to the space-time domain will be discussed in Chapter 4.

In particular, we propose a non-Gaussian spatial process that not only captures the heavy tails, but also the skewness in the marginal distribution. Before we set out to define our model, we first discuss a general strategy of constructing a non-Gaussian process and use that to motivate our asymmetric and heteroscedastic non-Gaussian model.

3.2 A General Framework for Constructing Non-Gaussian Processes

Expressing a stochastic spatial process as a function of two or more latent stochas- tic processes is a common approach in spatial statistics. It can be argued that these

60 types of models are analogous to the state space models in the time series literature.

For example, to explain the so-called nugget effect in spatial data, people commonly model an observed spatial process {Y (s), s ∈ D ⊂ Rd} as the sum of two independent spatial processes, i.e.,

d Y (s) = Z(s) + ε(s), s ∈ D ⊂ R where Z(s) is usually a stationary Gaussian process whose covariance function is smooth at spatial lag zero, and ε(s) represents independent and identically distributed measurement errors over the spatial domain of interest. In multivariate spatial pro- cesses, the commonly used linear model of co-regionalization (LMC) is another ex- ample of representing the observed spatial processes as a function of latent stochastic processes (see Goulard and Voltz, 1992). In LMC, the observed multivariate spatial process is expressed as linear combinations of latent independent univariate spatial processes.

Although the previous examples all use Gaussian processes, we can use the idea to construct a wide range of non-Gaussian processes. Consider the following more general framework of constructing a spatial processes. For an observed univariate spatial process {Y (s), s ∈ D ⊂ Rd}, we assume that

d Y (s) = f(U(s)), s ∈ D ⊂ R , (3.2.1) where f is a measurable function taking a p × 1 vector as the argument, f : Rp → R, and {U(s) s ∈ D ⊂ Rd} is a latent p−dimensional multivariate Gaussian spatial process. The measurability of the function f ensures that Y (s) at any fixed spatial location is a valid random variable. The most common condition for this to hold is to assume that the function f is continuous. In addition, we assume that U(s) has

61 mean zero and is second-order stationary with a matrix-valued covariance function

C(h), h ∈ R. Examples of spatial processes with a nonlinear transformation function f include the transformed Gaussian process of De Oliveira et al. (1997), the Gaussian-log-

Gaussian (GLG) model of Palacios and Steel (2006), the stationary skew-Gaussian process of Zhang and El-Shaarawi (2010), the stochastic heteroscedastic process

(SHP) of Huang et al. (2011) as well as the wavelet-based trend spatiotemporal long memory model of Craigmile and Guttorp (2011). The transformed Gaussian pro- cess of De Oliveira et al. (1997) is just a trivial example of the framework (3.2.1) where U(s) is a univariate Gaussian process and the transformation function f is the inverse function of the famous Box-Cox transformation function. If we let

U(s) = (α(s), ε(s))T , s ∈ D ⊂ Rd and f(x, y) = ex/2y, x, y ∈ R, where α(s) and ε(s) are two Gaussian processes with zero mean and unit variance, we essentially get

the GLG model. To get the skew-Gaussian process of Zhang and El-Shaarawi (2010),

T d we assume U(s) = (X1(s),X2(s)) , s ∈ D ⊂ R with a transformation function

f(x, y) = σ1|x| + σ2y, x, y ∈ R. The two zero-mean unit-variance processes, X1(s)

and X2(s), are assumed to be stationary and independent from each other, and the

resulting univariate process is stationary as well.

In the framework (3.2.1), for the univariate process {Y (s), s ∈ D ⊂ Rd} to be well-

defined, we need to first have a well-defined multivariate process {U(s), s ∈ D ⊂ Rd}.

T Suppose U(s) = (U1(s), ··· ,Up(s)) is a mean zero process with isotropic covariance

function C(s), then we have   C11(h) ··· C1p(h)  . .. .  0 C(h) =  . . .  , h = ks − s k ∈ R, Cp1(h) ··· Cpp(h)

62 0 where Cii(h) = E[Yi(s)Yi(s )] is the univariate covariance function of the process Ui(s),

0 and Cij(h) = E[Yi(s)Yj(s )] is the cross-covariance function between process Ui(s) and

Uj(s) for i 6= j. For the process U(s) to be well-defined, the covariance function needs

T T T to be such that for any n observations of the process U(s), i.e., (U1 (s), ··· , Un (s)) , the covariance matrix of the joint Gaussian distribution needs to be non-negative definite. This is a notoriously hard condition to satisfy for multivariate processes

(see, e.g., Gneiting et al., 2010).

The linear model of co-regionalization (LMC) is a commonly used methodology to ensure the non-negative definiteness of the covariance function by construction.

Alternatively, if the covariance and cross-covariance functions are restricted to be of a certain functional form (such as the Mat´ernclass), then we can find sufficient and sometimes necessary and sufficient conditions for the multivariate Gaussian spatial process to have a valid covariance function C(h). Let M(h|ν, a) denote the Mat´ern correlation function with smoothness parameter ν and scale parameter a; i.e.,

21−ν M(h|ν, a) = (akhk)νK (akhk), (3.2.2) Γ(ν) ν

where Kν(akhk) is a modified Bessel function of the second kind. If ν = 1/2,

M(h|ν, a) reduces to the exponential correlation function exp(−akhk). When ν =

3/2, M(h|ν, a) reduces to (1 + akhk) exp(−akhk). If general, if ν = 1/2 + n, for n = 0, 1, 2,..., then the Mat´ernfunction can be expressed as the product of an exponential function and a polynomial:   n   1 X (n + k)! n n−k M h n + , a = exp(−akhk) (2akhk) . 2 (2n)! k n=0 Suppose

2 Cii(h) = σi M(h|νi, ai) (3.2.3)

63 for i = 1, ··· , p and

Cij(h) = Cji(h) = ρijσiσjM(h|νij, aij). (3.2.4)

Then the following theorem proved by Gneiting et al. (2010) provides a sufficient (but not necessary) condition for the multivariate process U(s) to have valid second-order structure.

Theorem 1 (Theorem 1 of Gneiting et al. (2010)) For d ≥ 1, p ≥ 2 and 1 ≤ i 6=

j ≤ p, suppose that 1 ν = (ν + ν ), ij 2 i j and that the scale parameters satisfy

a1 = ··· = ap = aij = a,

Then U(s) has a valid covariance structure in Rd defined by (3.2.3) and (3.2.4) if

p the matrix (βij)i,i=1, with diagonal elements βii = 1 for i = 1, ··· , p and off-diagonal

elements βij for i ≤ i 6= j ≤ p given by " #−1 Γ(ν + d )1/2 Γ(ν + d )1/2 Γ( 1 (ν + ν )) β = ρ i 2 j 2 2 i j for 1 ≤ i 6= j ≤ p, ij ij Γ(ν )1/2 Γ(ν )1/2 1 d i j Γ( 2 (νi + νj) + 2 )

is symmetric and non-negative definite.

Gneiting et al. (2010) argues that the conditions in Theorem 1 are not necessarily

as restrictive as they seem. If ν = 1/2, i.e., for the exponential covariance function,

Ying (1991) showed that either a or σ2 can be fixed arbitrarily and the composite

quantity can still be estimated consistently and efficiently. More importantly, Zhang

(2004) proved that in dimension d ≤ 3, the parameters σ2 and a of a Mat´erncovari-

ance function with fixed smoothness parameter ν cannot be consistently estimated

64 under an infill asymptotic, but that the composite quantity σ2a2ν can be consistently

estimated and that this quantity is more important for spatial prediction, as explained

in the following theorem.

Theorem 2 (Zhang (2004), Theorem 2) Let Pi, i = 1, 2, be two probability measures

d such that under Pi, the process X(s), s ∈ R is stationary Gaussian with mean 0 and

d 2 an isotropic Mat´erncovariance function in R with a variance σi , a scale parameter

ai and the same smoothness parameter ν, where d = 1, 2, 3. For any bounded infinite

d set D ⊂ R , the two probability measures P1 and P2 are equivalent on the paths of

2 2ν 2 2ν X(s), s ∈ D if and only if σ1a1 = σ2a2 .

When p = 2, the restriction on ρ12 in Theorem 1 expressed in terms of the entries

of some positive definite matrix can be rephrased as the following inequality Γ(ν + d )1/2 Γ(ν + d )1/2 Γ( 1 (ν + ν )) ρ ≤ 1 2 2 2 2 1 2 . 12 Γ(ν )1/2 Γ(ν )1/2 1 d 1 2 Γ( 2 (ν1 + ν2) + 2 ) This is trivial since for the matrix   1 β12 β12 1 to be positive definite, we need to have β12 ≤ 1. As it turns out, for p = 2 this condition on ρ12 is not only sufficient but also necessary, as is stated in the following theorem.

Theorem 3 (Gneiting et al., 2010, Theorem 3) When p = 2, the full bivariate

Mat´ernmodel described by (3.2.2) and (3.2.4) is valid if and only if Γ(ν + d )1/2 Γ(ν + d )1/2 Γ( 1 (ν + ν )) ρ ≤ 1 2 2 2 2 1 2 × 12 Γ(ν )1/2 Γ(ν )1/2 1 d 1 2 Γ( 2 (ν1 + ν2) + 2 ) a2ν1 a2ν2 (a2 + t2)2ν12+d 1 2 inf 12 . 4ν12 t≥0 2 2 ν1+(d/2) 2 2 ν2+(d/2) a12 (a1 + t ) (a2 + t ) The above inequality implies the following important cases.

65 i) If ν12 < (ν1 +ν2)/2, the full bivariate Mat´ernmodel is valid if and only if ρ12 = 0.

In other words, ν12 ≥ (ν1 + ν2)/2 is a necessary condition for a bivariate Mat´ern

model to capture any cross-correlation in the multivariate spatial data.

ii) If ν12 = (ν1 + ν2)/2 and a1 = a2 = a12 = a, the full bivariate Mat´ernmodel is

valid if and only if

Γ(ν + d )1/2 Γ(ν + d )1/2 Γ( 1 (ν + ν )) ρ ≤ 1 2 2 2 2 1 2 . 12 Γ(ν )1/2 Γ(ν )1/2 1 d 1 2 Γ( 2 (ν1 + ν2) + 2 )

As a special case, if d = 2 (e.g., if we focus only on the two-dimensional spatial

domain), then the inequality condition on ρ12 can be simplified to

1/2 (ν1ν2) |ρ12| ≤ 1 . 2 (ν1 + ν2)

2 2 2 iii) (Gneiting et al., 2010, Theorem 4) When ν12 ≥ (ν1 +ν2)/2 and a12 ≥ (a1 +a2)/2, a sufficient but not necessary condition for the full bivariate Mat´ernmodel to be

valid is given by

2 aν1 aν Γ(ν )  a − (a2 + a2)/2ν12−(ν1+ν2)/2 |ρ | ≤ 1 2 12 e 12 1 2 . 12 2ν12 1/2 1/2 a12 Γ(ν1) Γ(ν2) ν12 − (ν1 + ν2)/2

Apanasovich et al. (2012) proved a more general result than Theorem 1, where the smoothness parameter in the cross-covariance function of two processes can be greater than or equal to the average of the smoothness parameters in the covariance functions of the corresponding two marginal processes, as opposed to the equality condition in Theorem 1.

Theorem 4 (Apanasovich et al., 2012, Theorem 1) The Mat´ernmodel (3.2.2) and

(3.2.4) provides a valid structure if there exists ∆A such that

66 i) ν + ν ν − i j = ∆ (1 − A ), i, j = 1, . . . , p, ij 2 A ij

where 0 ≤ Aij ≤ 1 form a valid correlation matrix.

2 ii) The collection −aij, i, j = 1, . . . , p form a conditional non-negative definite ma-

2 trix. In other words, let M be the matrix formed by −aij, i, j = 1, . . . , p. Then

T p Pp x Mx ≥ 0 for all x ∈ R such that i=1 xi = 0.

iii) The collection,

Γ(ν + d ) ρ σ σ a2∆A+νi+νj ij 2 , i, j = 1, . . . , p, ij i j ij νi+νj d Γ( 2 + 2 )Γ(νij)

form a non-negative definite matrix.

It is easy to see that condition (i) in Theorem 4 implies νij ≥ (νi+νj) since we must

p have ∆A(1 − Aij) ≥ 0. Examples of a collection of {aij}i,j=1 that satisfy condition

2 2 2 2 (ii) include aij = (ai + aj )/2 + τ(ai − aj) , 0 ≤ τ ≤ ∞ and aij = max{ai, aj} (see Apanasovich et al., 2012). In other words, Theorem 4 relaxes the restrictions on the smoothness parameters and the scale parameters in Theorem 1. However, as noted in

Apanasovich et al. (2012), the constraints on the co-locational correlation coefficients

ρij, i, j = 1, . . . , p, still seem unavoidable. These constraints, i.e., condition (iii) in Theorem 4, can be reformulated in terms of upper bounds on ρij, which depend on how much the smoothness and scale parameters deviate from the corresponding

67 parameters of the two marginal processes:

3 2 Y (k) ρij ≤ τij ≤ 1, i, j = 1, . . . , p, k=1 2 d   2∆A (1) B νij, 2 (2) aiaj τij = , τij = , νi+νj d 2 2  aij B 2 , 2 2 νi+νj  2νi 2νj (3) Γ 2 ai aj and τij = 2(ν +ν ) , Γ(νi)Γ(νj) i j aij where B(·, ·) is the Beta function. See Apanasovich et al. (2012) for the proof. One implication of Theorems 1, 3 and 4 is that we need to place more stringent restrictions on the smoothness and scale parameters in the Mat´erncorrelation functions in order to relax the constraints on the co-locational correlation parameters ρij, 1 ≤ i 6= j ≤ p.

In particular, in Theorem 3, if all the smoothness parameters are equal and so are the scale parameters, then the restriction on ρ12 is no longer needed.

For the remainder of the chapter, we assume that the latent multivariate process

{U(s), s ∈ D ∈ Rd} is always well defined (i.e., it has a valid covariance function). As long as U(s) also satisfies the natural restrictions of the transformation function f, then the univariate process {Y (s), s ∈ D ∈ Rd} is also well defined. The following result discusses when the constructed univariate process Y (s) can be non-Gaussian.

Result 1 Suppose that U(s) is well-defined with a valid stationary and isotropic covariance function C(h). Then (3.2.1) defines a univariate non-Gaussian process only if the transformation function f is nonlinear. However, the converse is not true.

This result is trivial. It is presented only as a reminder that nonlinear transforma- tion of Gaussian processes can still be a Gaussian process. Here is a simple example.

T Consider the multivariate Gaussian process {U(s) = (U1(s),U2(s),U3(s)) , s ∈ D ⊂

Rd}. Suppose that the three univariate processes all have mean zero and variance σ2,

68 0 and that they are independent from each other, i.e., Cov(Ui(s),Uj(s )) = 0 for i 6= j and for all s, s0, s ∈ D ∈ Rd. Define the transformation function f as

f(x1, x2, x3) = x1g1(x3) + x2g2(x3), x1, x2, x3 ∈ R,

P 2 where gi(x), i = 1, 2 satisfy i gi (x) = 1 for all x ∈ R. For example, g1(x) = sin(x) √ √ 2 2 and g2(x) = cos(x) or g1(x) = 1/ 1 + x and g2(x) = x/ 1 + x . The resulting univariate process

d Y (s) = f(U(s)) = U1(s)g1(U3(s)) + U2(s)g2(U3(s)), s ∈ D ⊂ R

2 is a Gaussian process with mean zero and variance σ since Y (s)|U3(s) is a Gaussian process whose distribution does not depend on U3(s).

Smoothness of a spatial process is also of special interest (see, e.g., Stein, 1999;

Banerjee and Gelfand, 2003). One definition of the smoothness of a stochastic process is the mean square continuity (or L2 continuity): a spatial process {Y (s), s ∈ D ⊂

d R } is mean square continuous at s0 if

2 lim E[(Y (s) − Y (s0)) ] = 0. (3.2.5) s→s0

The following result from Banerjee and Gelfand (2003) establishes the smoothness properties of the process Y (s), s ∈ D ⊂ Rd defined in (3.2.1).

Proposition 1 (Banerjee and Gelfand, 2003, Proposition 4) If the transformation function f : Rd → R is Lipschitz of order 1 and the multivariate process U(s) is mean square continuous, then the process Y (s) = f(U(s)) is also mean-square continuous.

The proof of this proposition (see Banerjee and Gelfand, 2003) follows instantly from the definition of the Lipschitz condition of order 1, i.e.,

|f(U(s + ∆s)) − f(U(s))| ≤ MkU(s + ∆s) − U(s)k,

69 where M is a constant in s and ∆s and k · k denotes the Euclidean distance.

Palacios and Steel (2006) discussed a special case where U(s) consists of two

T independent stationary processes (U1(s),U2(s)) , U1(s) ∈ R, U2(s) > 0, and the √ transform function f(x, y) = x/ y, x ∈ R, y > 0. They showed that as long as both

U1(s) and U2(s) are mean-square continuous, the process Y (s) = f(U1(s),U2(s)) is mean square continuous, too.

3.3 Heteroscedastic Asymmetric Spatial Process (HASP)

3.3.1 Model Specification and Properties

Now that we have introduced the general framework (3.2.1) for constructing a non-Gaussian process, we are ready to return to the topic of this chapter – the het- eroscedastic and asymmetric non-Gaussian process. We have shown that by assuming

U(s) = (α(s), ε(s))T with the transformation function f(x, y) = ex/2y, where α(s) and ε(s) are two independent Gaussian processes with zero mean and unit variance, we get the GLG model. However, the drawback of the GLG model (as well as the

SHP model on the space-time domain and the non-Gaussian model used in Craig- mile and Guttorp, 2011) is that by assuming that the component processes of the latent multivariate process X(s) = (α(s), ε(s))T are mutually independent from each other, the model is not flexible enough to capture the skewness often observed in spatial data. On the other hand, existing models that do capture the skewness, such as Zhang and El-Shaarawi (2010), do not allow for spatial heteroscedasticity as the

GLG model does.

Using (3.2.1), we propose a heteroscedastic asymmetric spatial process (HASP) which captures both the heavy tails and the skewness in the observed spatial data

70 simultaneously. We focus only on the spatial domain here. Generalizations to the space-time domain will be discussed in Chapter 4. Consider a latent multivariate mean zero Gaussian process (α(s), ε(s))T where the two univariate processes, α(s) and ε(s), are not independent from each other but rather have a cross covariance

x/2 function γc(h). Using the same transformation function f(x, y) = e y as in (3.2.1), we instantly obtain our Heteroscedastic Asymmetric Spatial Process (HASP) which is analogous to the stochastic volatility model with leverage effect in time series literature. To allow for long-range dependence or non-stationarity in the process, we write the model as

HT (s)δ + α(s) Y (s) = XT (s)β + exp ε(s), (3.3.1) 2 where XT (s) and HT (s) are two deterministic row vectors of spatial covariates. In general, both XT (s) and HT (s) should at least contain a column vector 1 and if necessary, columns vectors of other spatial covariates. Thus, in the simplest case,

XT (s) = HT (s) = 1.

To see that the HASP model (3.3.1) can capture the skewness in the data, we take a look at the properties of Z(s) = exp(α(s)/2)ε(s), s ∈ D ⊂ Rd, which is the stochastic component of the process Y (s). For two spatial locations s, s0 ∈ D ⊂ Rd, define the spatial distance, h, as h = ks − s0k. The matrix-valued stationary and isotropic covariance function of the latent multivariate process (α(s), ε(s))T is thus

τ 2ρ (h) τ%ρ (h) C(h) = α c , (3.3.2) τ%ρc(h) ρε(h) where ρα(h), ρc(h) and ρε(h) are three isotropic correlation functions that satisfy, together with the co-locational correlation coefficient %, appropriate conditions such as those in Theorem 1 to ensure that the joint process (α(s), ε(s))T is well defined. In

71 addition, we always assume that ρα(h), ρc(h) and ρε(h) are non-negative functions;

i.e., we do not consider the so-called hole effect which allows for negative correlations that are occasionally observed in empirical data (see, e.g., Cressie, 1993). Given the conditions above, the spatial process {Z(s), s ∈ D ⊂ Rd} is also well defined and has the following properties.

1 τ 2/8 2 2 τ 2/2 1 2 2 τ 2/4 Result 2 Z(s) has mean 2 τ%e and variance (1 + τ % )e − 4 τ % e .

Result 3 Non-zero correlation % induces skewness in Z(s). When % > 0, Z(s) is

positively (right) skewed and when % < 0, Z(s) is negatively (left) skewed.

Result 4 When % = 0, the process Z(s) has heavier tails than a Gaussian process.

When % 6= 0, the excess kurtosis is even larger than when % = 0.

Result 2 and 3 can be proven easily with the well-known Stein’s Identity that was

first proven in the seminal paper of Stein (1981).

Lemma 1 (Stein’s Identity) (Stein, 1981): If X is a normal random variable with

mean µ and variance σ2 and g is a real-valued function with a Lebesgue measurable

derivative function g0, then E[(X − µ)g(X)] = σ2 E[g0(X)].

Using Stein’s Identity, we can easily derive the following result.

Lemma 2 For a normal random variable X with mean µ and variance σ2, we have

E[XneaX ] = E[(X − µ)Xn−1eaX ] + µ E[Xn−1eaX ]

= σ2 E[aXn−1eaX + (n − 1)Xn−2eaX ] + µ E[Xn−1eaX ]

= (µ + aσ2) E[Xn−1eaX ] + (n − 1)σ2 E[Xn−2eaX ].

72 In particular, for n = 1, 2, 3 and 4, we have

E[XeaX ] = (µ + aσ2) E[eaX ]; (3.3.3)

E[X2eaX ] = (µ + aσ2) E[XeaX ] + σ2 E[eaX ]

= [(µ + aσ2)2 + σ2] E[eaX ]; (3.3.4)

E[X3eaX ] = (µ + aσ2) E[X2eaX ] + 2σ2 E[XeaX ]

= [(µ + aσ2)3 + 3σ2(µ + aσ2)] E[eaX ]; (3.3.5)

E[X4eaX ] = (µ + aσ2) E[X3eaX ] + 4σ2 E[X2eaX ]

= [(µ + aσ2)4 + 6σ2(µ + aσ2)2 + 3σ4] E[eaX ]. (3.3.6)

Proof of Result 2 Using (3.3.3) and (3.3.4) as well as the fact that

ε(s)|α(s) ∼ N(α(s)%/τ, 1 − %2), and

α(s)|ε(s) ∼ N(τ%ε(s), τ 2(1 − %2)), we can prove the aforementioned properties of the process Z(s) = eα(s)/2ε(s):

% 1 1 2 E[Z(s)] = E[E[eα(s)/2ε(s)|α(s)]] = E[α(s)eα(s)/2] = τ% E[eα(s)/2] = τ%eτ /8; τ 2 2

Var(Z(s)) = E[Z(s)2] − E[Z(s)]2 1 = E[E[ε2(s)eα(s)|α(s)]] − τ% E[eα(s)/2] 2 %2 1 = E[(1 − %2 + α2(s))eα(s)] − τ% E[eα(s)/2] τ 2 2 1 = (1 + τ 2%2) E[eα(s)] − τ% E[eα(s)/2] 2 2 1 2 = (1 + τ 2%2)eτ /2 − τ 2%2eτ /4. 4

73 Proof of Result 3 We can still use the conditional expectation approach as above to

find the third moment of Z(s), but a much simpler approach is to use the properties of a bivariate normal random variable and express α(s) as

p α(s) = τ%ε(s) + τ 1 − %2η(s), where s is a fixed spatial location, and η(s) is a normal random variable with zero mean and unit variance and is independent from ε(s). Note that the original bi- variate process {[α(s), ε(s)]T , s ∈ D ⊂ Rd} is not equivalent to the bivariate pro- p cess {[τ%ε(s) + τ 1 − %2η(s), ε(s)]T , s ∈ D ⊂ Rd} since the cross covariance func- tions of these two processes are not necessarily the same. However, for a single

fixed spatial location s, the two bivariate normal random vectors [α(s), ε(s)]T and

[τ%ε(s) + τp1 − %2η(s), ε(s)]T are indeed equivalent. Hence the random variables based on a measurable transformation of these two bivariate random vectors will have the same distribution and thus the same moments. This approach is used repeatedly in the remainder of this chapter. Thus, we have

h √ i E[Z3(s)] = E[e3τ%ε(s)/2ε3(s)] E e3τ 1−%2η(s)/2   √ 27 9 h 2 i = τ 3%3 + τ% E e3(τ%ε(s)+τ 1−% η(s))/2 8 2 27 9  = τ 3%3 + τ% E e3α(s)/2 8 2   27 9 2 = τ 3%3 + τ% e9τ /8. 8 2

74 So the skewness of Z(s) is given by

E[(Z(s) − E[Z(s)])3] [Var(Z(s))]3/2 E[Z(s)3] − 3 E[Z(s)] E[Z(s)2] + 2 E[Z(s)]3 = [Var(Z(s))]3/2 ( 27 τ 3%3 + 9 τ%) E[e3α(s)/2] − 3 τ%(1 + τ 2%2) E[eα(s)/2] E[eα(s)] + 1 τ 3%3 E[eα(s)/2]3 = 8 2 2 4 [Var(Z(s))]3/2 ( 27 τ 2%2 + 9 ) E[e3α(s)/2] − 3 (1 + τ 2%2) E[eα(s)/2] E[eα(s)] + 1 τ 2%2 E[eα(s)/2]3 =τ% 8 2 2 4 [Var(Z(s))]3/2 A =τ% , [Var(Z(s))]3/2

27 2 2 9 3α(s)/2 3 2 2 α(s)/2 α(s) 1 2 2 α(s)/2 3 where A = ( 8 τ % + 2 ) E[e ] − 2 (1 + τ % ) E[e ] E[e ] + 4 τ % E[e ] . Using the fact that E[XY ] > E[X] E[Y ] for positively correlated random variables

27 2 2 9 3 X and Y , we can see that the numerator A satisfies A > {( 8 τ % + 2 ) − 2 (1 + τ 2%2)} E[eα(s)/2] E[eα(s)] > 0. Therefore, the skewness of Z(s) always takes the same

sign as % – the correlation coefficient between α(s) and Z(s).

Proof of Result 4 Consider a fixed spatial location s. As in the proof of Result 3, we

assume p α(s) = τ%ε(s) + τ 1 − %2η(s),

where η(s) is a normal random variable with zero mean and unit variance and is

independent from ε(s). The fourth non-central moment is Z(s) is thus given by

h √ i E[Z4(s)] = E e2τ%ε(s)+2τ 1−%2η(s)ε(s)4 h p i = E e2τ%ε(s)ε(s)4 E ε2τ 1 − %2η(s) h √ i = (16τ 4%4 + 24τ 2%2 + 3) E e2τ%ε(s)+2τ 1−%2η(s)

= (16τ 4%4 + 24τ 2%2 + 3) E e2α(s)

= (16τ 4%4 + 24τ 2%2 + 3)e2τ 2 .

75 To prove the excess kurtosis of Z(s), we only need to show that

E[(Z(s) − E[Z(s)])4] − 3 Var(Z(s))2 > 0.

The calculation is tedious but straightforward. Let mi to denote the i-th non-central moment of Z(s), then

E[(Z(s) − E[Z(s)])4] − 3 Var(Z(s))2

2 4 2 2 4 = m4 − 4m3m1 + 6m2m1 − 3m1 − 3m2 + 6m2m1 − 3m1     2 27 9 2 1 2 = (16τ 4%4 + 24τ 2%2 + 3)e2τ − 4 τ 3%3 + τ% e9τ /8 τ%eτ /8 8 2 2  2  4 2 1 2 1 2  2 2 + 12(1 + τ 2%2)eτ /2 τ%eτ /8 − 6 τ%eτ /8 − 3 (1 + τ 2%2)eτ /2 2 2   2 39 2 = (16τ 4%4 + 24τ 2%2 + 3)e2τ − τ 4%4 + 15τ 2%2 + 3 eτ 4

2 3 2 + (3τ 4%4 + 3τ 2%2)e3τ /4 − τ 4%4eτ /2 8   2 39 2 > (16τ 4%4 + 24τ 2%2 + 3)eτ − τ 4%4 + 15τ 2%2 + 3 eτ 4

2 3 2 + (3τ 4%4 + 3τ 2%2)eτ /2 − τ 4%4eτ /2 8     25 2 21 2 = τ 4%4 + 9τ 2%2 eτ + τ 4%4 + 3τ 2%2 eτ /2 4 8 ≥ 0.

The inequality still holds when % = 0 and it is easy to see that the excess kurtosis is even larger when % 6= 0.

Result 2 indicates that when the co-locational correlation coefficient % 6= 0, the mean of the Z(s) process is non-zero. This might seem undesirable for modeling purposes but since we are trying to capture the skewness in the resulting process, a non-zero mean that takes the same sign as the skewness makes intuitive sense. In addition, it is easy to check that the variance of the process is always positive. When

76 (i) (ii) 2 5 4 1 3 0 mean 2 variance −1 1 0 −2 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 τ τ

(iii) (iv) 5 4 4 2 3 0 2 skewness −2 excess kurtosisexcess 1 −4 0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 τ τ

(v) 0.8 0.6 0.4 Density 0.2 0.0 −4 −2 0 2 4

Figure 3.1: The plots (i) – (iv) show the mean, variance, skewness and excess kurtosis, respectively, of the process Z(s) as a function of τ and %. Plot (v) shows the simulated density function for the marginal distribution of Z(s) for τ = 0.5. The dotted line in each plot denotes the case where % = 0, the dot-dash lines % = ±0.3, the dashed lines % = ±0.6 and the dark solid lines % = 0.9. The grey solid line in plot (v) denotes the density function of a standard normal distribution.

77 % = 0, i.e., there is no skewness in the process, the mean and skewness reduce to 0 and the variance reduces to eτ 2/2.

Figure 3.1 shows the functional relationships between the mean, variance, skewness and excess kurtosis of the process Z(s) with the parameters τ and %, as well as the simulated density of the marginal distribution of Z(s). Result 3 and 4 can be easily verified from the graphs. The plots also clearly show that for fixed co-locational correlation coefficient %, the magnitude of the mean, variance, skewness and kurtosis of the process Z(s) increases rapidly as τ increases; for fixed τ, the magnitude of the four central moments also increases as the absolute value of % increases.

3.3.2 The Covariance Properties of the HASP Model

This section presents the covariance function of the univariate spatial process

Y (s). We need the following lemma for our derivations.

Lemma 3 For a mean zero and second-order stationary Gaussian spatial process ε(s)

0 with correlation function ρε(h), we have, for ks − s k = h,

E[exp{a(ε(s) + ε(s0))}ε(s)ε(s0)]

2 2 2 = [ρε(h) + a (1 + ρε(h)) ] exp{a (1 + ρε(h))} (3.3.7)

Proof We prove Lemma 3 by using the conditional expectation approach, but note that the proof can be made simpler by applying the same linearization technique as

78 in the proofs of Result 3 and 4.

E[exp{a(ε(s) + ε(s0))}ε(s)ε(s0)]

= E[E[exp{a(ε(s) + ε(s0))}ε(s)ε(s0)|ε(s)]]

= E[ε(s) exp{aε(s)} E[exp{aε(s0)}ε(s0)|ε(s)]]   a2  = E ε(s) exp{aε(s)}(ρ (h)ε(s) + a(1 − ρ2(h)) exp aρ (h)ε(s) + (1 − ρ2(h)) ε ε ε 2 ε a2  = exp (1 − ρ2(h)) E {ρ (h)ε2(s) + a(1 − ρ2(h))ε(s)} exp{a(1 + ρ (h))ε(s)} 2 ε ε ε ε

Now apply (3.3.3) and (3.3.4), we have

E[exp{a(ε(s) + ε(s0))}ε(s)ε(s0)] a2  = exp (1 − ρ2(h)) [ρ (h) + ρ (h)a2(1 + ρ (h))2 + a2(1 + ρ (h))(1 − ρ2(h))] 2 ε ε ε ε ε ε

× E[exp{a(1 + ρε(h))ε(s)}] a2  = exp (1 − ρ2(h)) [ρ (h) + a2(1 + ρ (h))2] E[exp{a(1 + ρ (h))ε(s)}] 2 ε ε ε ε a2 a2  =[ρ (h) + a2(1 + ρ (h))2] exp (1 − ρ2(h)) + (1 + ρ (h))2 ε ε 2 ε 2 ε 2 2 2 =[ρε(h) + a (1 + ρε(h)) ] exp{a (1 + ρε(h))}.

Result 5 We assume as usual that the latent stationary bivariate Gaussian process

(α(s), ε(s))T is well defined with a valid stationary and isotropic covariance function

(3.3.2). Then Z(s) is also stationary with the covariance function given by

 2 2   2  τ % τ 1 2 cov(Z(s),Z(s0)) = ρ (h) + (1 + ρ (h))2 exp (1 + ρ (h)) − τ 2%2eτ /4. ε 4 c 4 α 4 (3.3.8)

79 Proof Since the latent bivariate process is assumed to be Gaussian and second-order

stationary, we have

α(s) 0   2  α(s ) τ P τ%R   ∼ N 0, Σ = , (3.3.9)  ε(s)  τ%R Q ε(s0) where P , Q and R are three 2-by-2 correlation matrices with correlation function

ρα(h), ρε(h) and ρc(h), respectively. In general, the covariance matrix Σ is not nec- essarily non-negative definite. However, since we assumed that the bivariate process

(α(s), ε(s))T has a valid second-order structure (3.3.2), (e.g., by assuming the condi- tions in Theorem 1 holds), Σ is indeed non-negative definite. Using the properties of multivariate normal distribution, we have

        α(s) ε(s) −1 ε(s) 2 2 2 −1 ∼ N τ%RQ , τ P − τ % RQ R , α(s0) ε(s0) ε(s0)

and hence 0 α(s) + α(s ) 0 ε(s), ε(s ) ∼ N(µα|ε, Σα|ε), 2 where

τ% µ = 1T RQ−1(ε(s) + ε(s0)) α|ε 2 τ% 1 + ρ (h) = c (ε(s) + ε(s0)), and 2 1 + ρε(h) τ 2 Σ = 1T (P − %2RQ−1R)1 α|ε 4 τ 2 = (1T P 1 − %21T RQ−1R1) 4 2  2  τ 2 (1 + ρc(h)) = 1 + ρα(h) − % . 2 1 + ρε(h)

80 Use the moment generating function of a normal random variable as well as Lemma

3, we have

cov(Z(s),Z(s0))    0   α(s) + α(s ) 0 0 2 = E E exp ε(s)ε(s ) ε(s), ε(s ) − E[Z(s)] 2  2 2 2 2  τ τ % (1 + ρc(h)) = exp (1 + ρα(h)) − × 4 4 1 + ρε(h)    τ% 1 + ρ (h) 1 2 E ε(s)ε(s0) exp c (ε(s) + ε(s0)) − τ 2%2eτ /4 2 1 + ρε(h) 4  2 2 2 2  τ τ % (1 + ρc(h)) = exp (1 + ρα(h)) − × 4 4 1 + ρε(h)  2 2   2 2 2  τ % 2 τ % (1 + ρc(h)) 1 2 2 τ 2/4 ρε(h) + (1 + ρc(h)) exp − τ % e 4 4 1 + ρe(h) 4  2   2 2  τ τ % 1 2 = exp (1 + ρ (h)) ρ (h) + (1 + ρ (h))2 − τ 2%2eτ /4. 4 α ε 4 c 4

Remark Regardless of the choice of the correlation functions ρα(h), ρε(h) and ρc(h), a necessary (but not necessarily sufficient) condition for the bivariate latent process to be valid is

2 2 (1 + ρα(h))(1 + ρε(h)) ≥ % (1 + ρc(h)) , and,

2 2 (1 − ρα(h))(1 − ρε(h)) ≥ % (1 − ρc(h)) .

This can be easily proven by noting that in (3.3.9), given that τ 2P is positive definite, Σ is non-negative definite if and only if the Schur complement of τ 2P ,

τ 2(P − %2RQ−1R), is non-negative definite (Haynsworth, 1968).

Denote the covariance function in (3.3.8) as C(h) with h = ks − s0k. It is easy to verify that C(0) = 0 and that limh→∞ C(h) = 0 as long as ρα(h), ρc(h), ρε(h) all

81 tend to zero as the spatial lag goes to infinity. This also leads to our next result on the mean square continuity of the process Y (s).

Result 6 If the correlation functions ρα(h), ρc(h), ρε(h) are such that the corre- sponding bivariate process [α(s), ε(s)]T is well-defined. In addition, assume that

lim ρα(h) = lim ρc(h) = lim ρε(h) = 1. h→0 h→0 h→0

Then the stochastic part of the HASP, Z(s) = eα(s)/2ε(s), is mean square continuous as defined in 3.2.5.

Proof It is easier to prove this result by definition than using Preposition 1. We need

2 to establish that lim∆s→0 E[(Z(s+∆s)−Z(s)) ] = 0. Let k∆sk = h, µ = E[Z(s)] and

C(h) = cov(Z(s + ∆),Z(s)). As has been proven, the process Z(s) has finite mean, variance and a valid covariance function. Therefore,

E[(Z(s + ∆s) − Z(s))2]

= E[Z2(s + ∆s)] + E[Z2(s)] − 2 E[Z(s + ∆s)Z(s)]

= 2 E[Z2(s)] − 2(C(h) + µ2)  2 2   2  2 τ % τ = 2(1 + τ 2%2)eτ /2 − 2 ρ (h) + (1 + ρ (h))2 exp (1 + ρ (h)) . ε 4 c 4 α

It is easy to see that

lim E[(Z(s + ∆s) − Z(s))2] ∆s→0   2 2   2  2 2 τ 2/2 τ % 2 τ = lim 2(1 + τ % )e − 2 ρε(h) + (1 + ρc(h)) exp (1 + ρα(h)) h→0 4 4 = 2(1 + τ 2%2)eτ 2/2 − 2(1 + τ 2%2)eτ 2/2

= 0.

82 Finally, it is worth pointing out that the function (3.3.8) can still be a valid covariance

function for a univariate spatial/spatio-temporal process even when ρα(h), ρε(h) and

ρc(h) does not imply a valid covariance structure for a bivariate Gaussian process.

The following result, although not directly useful for model fitting, sheds some light

on the properties of the particular functional form of (3.3.8).

Result 7

i) Suppose ρα(h), ρε(h) and ρc(h) are all continuous, non-negative, nonincreasing

and convex functions that satisfy

ρα(0) = ρε(0) = ρc(0) = 1;

lim ρα(h) = lim ρε(h) = lim ρc(h) = 0. h→∞ h→∞ h→∞

Then (3.3.8) is a positive definite covariance function for a spatial process defined

on R.

ii) If, in addition to the conditions in i), the derivatives of the correlation functions,

0 0 0 ρα(h), ρε(h) and ρc(h), are non-positive, nondecreasing and concave for h ≥ 0, then (3.3.8) is a positive definite covariance function for a spatial process defined

on R3.

It is easy to verify that the exponential correlation function satisfy both conditions.

Result 7 follows directly from P´olya’s Criterion (P´olya, 1949) as well as the following

criteria of the P´olya type for the positive definiteness of radial functions (Gneiting,

2001):

Theorem 5 (Theorem 1.1 in Gneiting (2001)) Let ϕ : [0, ∞) → R be a continuous function with ϕ(0) = 1 and limt→∞ ϕ(t) = 0. Suppose that k and l are non-negative

83 integers, at least one of which is strictly positive. Let

 d k √ η (t) = − ϕ( u) . 1 du u=t2 If there exists an α > 1/2 so that

 d k+l−1 η (t) = − [−η0 (tα)] 2 dt 1 is convex for t > 0, then the radial function ϕ(kxk), x ∈ Rn is positive definite for n = 1,..., 2l + 1.

Proof of Result 7

i) Use γ(h) to denote the covariance function in (3.3.8), and then γ(h) is continuous,

γ(0) = 1 and limh→∞ γ(h) = 0. In addition, given the conditions, γ(h) is convex

by the following properties of convex functions:

a. If f(h) and g(h) are convex, then f(h) + g(h) is also convex.

b. If g(h) is convex, and f(h) is convex and nondecreasing, then f(g(h)) is convex.

c. If f(h) and g(h) are convex, non-negative and are either both non-increasing

or both non-decreasing, then f(h)g(h) is convex.

Thus by P´olya’s Criterion, γ(h) is positive definite in R.

ii) In Theorem 5, let k = 0, l = 1 and α = 1. This special case of Theorem 5 (the

result of Askey, 1973) dictates that γ(h) is positive definite in R3 if −γ0(h) is convex, where −γ0(h) is given by

 2 2  2 τ % −γ0(h) = − eτ (1+ρα(h))/4 ρ0 (h) + ρ0 (h) (1 + ρ (h)) ε c 2 c 2  2 2  τ 2 τ % − ρ0 (h)eτ (1+ρα(h))/4 ρ (h) + (1 + ρ (h))2 . 4 α ε 2 c

84 0 0 0 Since ρα(h), ρε(h) and ρc(h) are non-positive, non-decreasing and concave, then

0 0 0 −ρα(h), −ρε(h) and −ρc(h) are non-negative, non-increasing and convex. There- fore, by the same rules in part i), −γ0(h) is convex.

3.3.3 Linear Co-Regionalization Version of the HASP Model

An alternative to using constrained Mat´erncovariance functions (as dictated by

Theorems 1, 3 and 4) for constructing multivariate Gaussian processes is to consider the linear model of co-regionalization (see, e.g., Goulard and Voltz, 1992; Wackernagel,

2003; Schmidt and Gelfand, 2003; Zhang, 2007). The LMC is a commonly used approach for obtaining valid covariance structure for multivariate spatial processes.

The LMC expresses a p-dimensional multivariate process U(s) as a linear combination

T of r (1 ≤ r ≤ p) independent spatial processes W (s) = (w1(s), . . . , wr(s)) , i.e.,

U(s) = AW (s),

where A is a p × r full-rank coefficient matrix and wk(s) has zero mean, unit variance and stationary correlation function ρk(h) for 1 ≤ k ≤ r . The cross-covariance matrix

Σ of U(s) is thus given by

Σ = AΘ(h)AT ,

where Θ(h) = diag{ρ1(h), . . . , ρr(h)}.

Since both α(s) and ε(s) are Gaussian processes and their co-locational correlation is %, we can rewrite α(s) as linear combinations of ε(s) and a third Gaussian process

η(s) that has mean 0, variable 1, correlation function ρη(h) and is independent from

ε(s), i.e.,

p α(s) = τ%ε(s) + τ 1 − %2η(s). (3.3.10)

85 Alternatively, we can express ε(s) as

p ε(s) = %α(s)/τ + 1 − %2ξ(s), (3.3.11) where ξ(s) is independent from α(s) and has mean 0, variance 1 and correlation

T function ρξ(h). It is easy to see that the bivariate Gaussian process (α(s), ε(s)) is well-defined by construction but we have also made implicit assumptions about the correlation and cross-correlation functions of α(s) and ε(s) when using (3.3.10) and

(3.3.11). For (3.3.10), the correlation functions satisfy

ρc(h) = ρε(h); (3.3.12)

2 2 ρα(h) = % ρε(h) + (1 − % )ρη(h), (3.3.13) and for (3.3.11), we have

ρc(h) = ρα(h); (3.3.14)

2 2 ρε(h) = % ρα(h) + (1 − % )ρξ(h). (3.3.15)

In other words, the matrix-valued covariance functions for (α(s), ε(s))T under (3.3.10) and (3.3.11) are respectively

 2 2  % ρε(h) + (1 − % )ρη(h) %ρε(h) %ρε(h) ρε(h) and   ρα(h) %ρα(h) 2 2 . %ρα(h) % ρα(h) + (1 − % )ρξ(h) By plugging the above two linear equations into model (3.3.1), we can get the following two LMC versions of the HASP model: ( ) HT (s)δ + τ%ε(s) + τp1 − %2η(s) Y (s) = XT (s)β + exp ε(s), (3.3.16) 2

86 or

HT (s)δ + α(s) h p i Y (s) = XT (s)β + exp %α(s)/τ + 1 − %2ξ(s) 2 % HT (s)δ + α(s) p HT (s)δ + α(s) = XT (s)β + α(s) exp + 1 − %2 exp ξ(s). τ 2 2 (3.3.17) The term % HT (s)δ + α(s) α(s) exp τ 2 in (3.3.17) clearly illustrates the skewness in the HASP model.

Result 8 The covariance function for (3.3.16) and (3.3.17) are, respectively,

 2 2   2  τ % τ 1 2 cov(Z(s),Z(s0)) = ρ (h) + (1 + ρ (h))2 exp (1 + ρ (h)) − τ 2%2eτ /4, ε 4 ε 4 α 4

and

 2 2   2  τ % τ 1 2 cov(Z(s),Z(s0)) = ρ (h) + (1 + ρ (h))2 exp (1 + ρ (h)) − τ 2%2eτ /4, ε 4 α 4 α 4 where

2 2 ρα(h) = % ρε(h) + (1 − % )ρη(h), and

2 2 ρε(h) = % ρα(h) + (1 − % )ρξ(h).

Proof Trivial given (3.3.8) and (3.3.12) – (3.3.15).

Last but not least, it is important to note that although the linearized forms

(3.3.16) and (3.3.17) represent only a subset of the more general model (3.3.1), the marginal distributions of (3.3.16) and (3.3.17) are exactly the same as (3.3.1). As a result, the mean, variance and skewness of the models (3.3.16) and (3.3.17) are exactly the same as those of (3.3.1).

87 3.3.4 More on the Choices of the Correlation and Cross- Correlation Functions

We have discussed several possible choices for the correlation functions ρα(h),

ρε(h) and ρc(h), but the question remains what restrictions we should embrace in practice. In addition to the modeling flexibility afforded by the assumptions on the correlation functions, we also need to consider the practicality of the model fitting procedures. As the marginal distribution is the same regardless of the choice of the correlation functions, we suggest the following two options that can make the model

fitting elegantly tractable while, as the same time, still provide enough flexibility in the covariance structure.

I. As mentioned previously, in the Mat´ernmodel (3.2.2) and (3.2.3), the con-

straint on % depends on how much the smoothness and scale parameters of the

cross-correlation function deviate from the corresponding parameters of the two

marginal processes. Since a major feature of the HASP model is the ability to

capture skewness in the process which depends on the co-locational correlation

parameter %, it is undesirable for us to place constraints on %.

According to Theorem 3, we can remove the restriction on % by requiring

both the α(s) and ε(s) processes have the same degree of smoothness, i.e., ν1 =

ν2. In other words, there is no restriction on % for the HASP model if we require

ρα(h) = ρc(h) = ρε(h) = %(h). (3.3.18)

In fact, it is easy to see that we can use any correlation function, not just

Mat´ernfunctions, for %(h) and the bivariate Gaussian process is always valid.

88 The covariance function is given by (for ks − s0k = h)

 2 2   2  τ % τ 1 2 cov(Z(s),Z(s0)) = %(h) + (1 + %(h))2 exp (1 + %(h)) − τ 2%2eτ /4. 4 4 4 (3.3.19)

Under the assumption (3.3.18), the three models (3.3.1), (3.3.16) and (3.3.17)

are also equivalent to each other. The condition (3.3.18) appears to be very

restrictive, but as we have learned from our previous discussion, all the param-

eters in the covariance function, even in the more restrictive Gaussian process,

cannot be consistently estimated at the same time (Zhang, 2004). In the case

of the exponential covariance function, σ2 exp(−ah), either a or σ2 can be fixed

arbitrarily and the composite quantity aσ2 can still be estimated consistently

and efficiently (Ying, 1991). As a result, the assumption (3.3.18) might not be

as restrictive as it seems and its impact on the model performance should be

very limited.

II. A more flexible model is the LMC version (3.3.17). (3.3.17) is easier to work with

than (3.3.16) since conditioning on a(s), Y (s) is a Gaussian process. Regardless

of the choice of correlation functions ρα(h) and ρξ(h), the model is always valid

by construction. The covariance function of (3.3.17) is given by:

 τ 2%2  cov(Z(s),Z(s0)) = %2ρ (h) + (1 − %2)ρ (h) + (1 + ρ (h))2 × α ξ 4 α  2  τ 1 2 exp (1 + ρ (h)) − τ 2%2eτ /4. (3.3.20) 4 α 4

Plots of the two covariances are presented below to facilitate the understanding of their behaviors and differences. Figure 3.2 shows the covariance function (3.3.19) for difference choices of %(h) and different values of other parameters. The exponential

89 (i) Exponential − Changing τ (ii) Exponential − Changing λ (iii) Exponential − Changing ρ 1.0 1.0 1.0 ρ = 0.5 τ = 0 ρ = 0.5 λ = 1 λ = 1 ρ = 0 λ = 1 τ = 1 τ = 1 λ = 2 τ = 1 ρ = 0.9 τ = 3 λ = 5 τ = λ = 0.8 5 0.8 10 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 h h h

(iv) Gaussian − Changing τ (v) Gaussian − Changing λ (vi) Gaussian − Changing ρ 1.0 1.0 1.0 ρ = 0.5 τ = 0 ρ = 0.5 λ = 1 λ = 1 ρ = 0 λ = 1 τ = 1 τ = 1 λ = 2 τ = 1 ρ = 0.9 τ = 3 λ = 5 τ = λ = 0.8 5 0.8 10 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 h h h

Figure 3.2: The shape of the correlation function implied by (3.3.19) for different choices of %(h). Plot (i)–(iii) use the exponential correlation function for %(h), while plots (iv)–(vi) use the Gaussian correlation function. The values of the other param- eters are shown in the plots.

90 (i) Changing τ (ii) Changing λ (iii) Changing κ 1.0 1.0 1.0 ρ = 0.5 τ = 0 ρ = 0.5 λ = 1 ρ = 0.5 κ = 1 λ = 1 τ = 3 τ = 3 λ = 2 τ = 3 κ = 2 κ = 3 τ = 6 κ = 1 λ = 5 λ = 1 κ = 5 τ = λ = κ = 0.8 9 0.8 10 0.8 10 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 h h h

(iv) Changing ρ (v) Changing ρ (vi) Changing ρ 1.0 1.0 1.0 τ = 3 ρ = 0 τ = 3 ρ = 0 τ = 3 ρ = 0 λ = 1 ρ = 0.3 λ = 1 ρ = 0.3 λ = 3 ρ = 0.3 κ = 3 ρ = 0.9 κ = 1 ρ = 0.9 κ = 1 ρ = 0.9 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 h h h

Figure 3.3: The shape of the correlation function implied by (3.3.20) where ρα(h) assumes the form of a Gaussian correlation function and ρξ(h) the form of an ex- ponential correlation function. The values of the other parameters are shown in the plots.

91 correlation function is used in plots (i)–(iii), i.e, %(h) = exp(−λh). In plots (iv)–(vi) in Figure 3.2, the Gaussian correlation function is used, thus %(h) = exp(−λh2). The

Gaussian (or double exponential) correlation function is chosen because it is the limit of the Mat´ernclass (3.2.2) as ν → ∞. The most important difference between a Gaus- sian correlation function and an exponential correlation function is the smoothness of the functions at the origin h = 0, which, in turn, determines the smoothness of the sample paths of the resulting spatial processes. For spatial prediction problems, the smoothness of a correlation function at the origin is of great importance. Stein (1999) argued that the smoothness of the sample path is equivalent to the rate of decay of the of the correlation function at high frequencies, and under infill asymptotics, the low frequency behavior of the spectral density has asymptotically negligible impact on spatial interpolation, while the high frequency behavior plays a pivotal role.

Figure 3.2 shows that the co-locational correlation parameter % has little impact on the covariance function, while it has a big impact on the marginal distribution of the spatial process as can be seen in Figure 3.1. In addition, both τ and λ impact the covariance function in a similar fashion.

Due to the additional parameters, the covariance function (3.3.20) is more flexible than (3.3.19), which can be clearly seen in Figure 3.3. The plots in Figure 3.3 are based on the assumptions that ρα(h) takes the form of an exponential correlation function exp(−λh), and that ρξ(h) takes the form of a Gaussian correlation function exp(−κh2) so that the process α(s) has a smoother sample path than ε(s). Same as in

Figure 3.2, the impact of different values of % on the shape of the covariance function in Figure 3.3 is not as large as the impacts of τ and λ. However, the smoothness of

92 the covariance function at h = 0 as well as its effective correlation length does depend partially on %. These impacts of % are in a large part due to the relation

2 2 ρε(h) = % ρα(h) + (1 − % )ρξ(h).

The parameters τ and λ also have similar effects on the covariance function, suggesting that the consistent estimation of both parameters simultaneously might still be a problem.

It is ultimately up to the researcher to decide which of the above two strategies to choose. The covariance function (3.3.20) is certainly more flexible than (3.3.19) but in practice, we do not always have enough data to efficiently estimate it. When the data is not abundant or we have no need to estimate the smoothness of the underlying process by using a very general covariance function, the more restricted strategy (3.3.19) might be a better choice. On the other hand, the model fitting procedures for these two strategies are very similar.

3.3.5 Examples of the HASP Sample Paths

In this section, we present several realizations from the heteroscedastic asymmetric spatial process and compare them with realizations from the SHP and the Gaussian process (GP). Figure 3.4 and 3.5 show the sample paths of the processes defined in

R, and the examples for the processes defined on R2 are presented in Figures 3.6 and 3.7. For Figures 3.4 and 3.6, an exponential function is used as the correlation and cross correlation functions of the (potentially latent) Gaussian processes, whereas a

Gaussian correlation function is used for Figures 3.5 and 3.7.

As discussed earlier, the sample path of a Gaussian process with a Gaussian correlation function is smoother than that with an exponential correlation function,

93 which can be clearly seen from the plots. From the time series literature, we know that a stationary and Markov Gaussian process defined on R+ with a continuous correlation function is necessarily an Orstein-Uhlenbeck process, which is the unique stationary solution to the differential equation

dXt = θ(µ − Xt)dt + σdWt, t ≥ 0,

where {Wt, t ≥ 0} is a Brownian motion with unit variance (see Uhlenbeck and

Ornstein, 1930; Doob, 1942). The Orstein-Uhlenbeck process has a stationary and isotropic exponential correlation function and its Euler-Maruyama discretization is the discrete-time AR(1) process. In addition, the sample path of an Orstein-Uhlenbeck process is continuous but nowhere differentiable with probability 1.

Apart from the discrepancies in the smoothness of the sample paths caused by different correlation functions, we can also recognize the unique features of the curves or surfaces drawn from the non-Gaussian processes compared to those drawn from the

Gaussian processes (which is set to have zero mean and unit variance). In Figures 3.4–

3.7, the curves or surfaces drawn from a Gaussian process mostly vary within −2 and

2. When we modulate the said Gaussian process with an independent exponential-

Gaussian process and get the GLG/SHP model, the resulting sample paths end up with more fluctuations in terms of the ranges of the functions over the entire do- main. Furthermore, when we correlate the Gaussian process and the modulating exponential-Gaussian process, i.e., for the HASP with positive or negative skewness, the curves or surfaces draw from the processes can have an even larger swing in one direction. This suggests that the HASP model (with SHP as a special case) is useful in modeling spatially indexed data when the underlying curve or surface has

94 GP SHP 6 6 4 4 2 2 0 0 Y(s) Y(s) −4 −4 −8 −8 0 2 4 6 8 10 0 2 4 6 8 10 s s

HASP : ρ = 0.7 HASP : ρ = − 0.7 6 6 4 4 2 2 0 0 Y(s) Y(s) −4 −4 −8 −8 0 2 4 6 8 10 0 2 4 6 8 10 s s

Figure 3.4: Five sample paths from each of the four spatial processes in R: GP, GLG/SHP, HASP with positive skewness and HASP with negative skewness. An exponential correlation function is used for the GP. For the non-Gaussian processes, the same exponential correlation function is used for the marginal as well as the cross correlation functions of the latent multivariate Gaussian process. The sample paths for different processes bear resemblance to each other because the same seed is used for random number generation for easier comparison of the different processes. The sample paths are approximated based on a finite number of observations on a grid with an increment of 0.1. Realizations of the HASP model are computed according to (3.3.17).

95 GP SHP 4 4 2 2 0 0 Y(s) Y(s) −2 −2 −6 −6

0 2 4 6 8 10 0 2 4 6 8 10 s s

HASP : ρ = 0.7 HASP : ρ = − 0.7 4 4 2 2 0 0 Y(s) Y(s) −2 −2 −6 −6

0 2 4 6 8 10 0 2 4 6 8 10 s s

Figure 3.5: Five sample paths from each of the four spatial processes in R: GP, GLG/SHP, HASP with positive skewness and HASP with negative skewness. A Gaussian (or double exponential) correlation function is used for the GP. For the non- Gaussian processes, the same Gaussian correlation function is used for the marginal as well as cross correlation functions of the latent multivariate Gaussian process. The sample paths are constructed in the same way as in Figure 3.4.

96 GP SHP 10 10 6 6

8 4 8 4 2 2 6 6 0 0 4 4 Latitude −2 Latitude −2 −4 −4 2 2 −6 −6 0 0 0 2 4 6 8 10 0 2 4 6 8 10 Longtitude Longtitude

HASP : ρ = 0.7 HASP : ρ = − 0.7 10 10 6 6

8 4 8 4 2 2 6 6 0 0 4 4 Latitude −2 Latitude −2 −4 −4 2 2 −6 −6 0 0 0 2 4 6 8 10 0 2 4 6 8 10 Longtitude Longtitude

Figure 3.6: Five sample paths from each of the four spatial processes in R2: GP, GLG/SHP, HASP with positive skewness and HASP with negative skewness. An exponential correlation function is used for the GP. For the non-Gaussian processes, the same exponential correlation function is used for the marginal as well as cross correlation functions of the latent multivariate Gaussian process. Again, the sample paths for different processes bear resemblance to each other because the same seed is used for random number generation for easier comparison of the different processes. The sample paths are approximated based on a finite number of observations on a rectangular grid with an increment of 0.1 in each direction. Realizations of the HASP model are computed according to (3.3.17).

97 GP SHP

10 6 10 6

8 4 8 4 2 2 6 6 0 0 4 4 Latitude −2 Latitude −2

2 −4 2 −4 −6 −6 0 0 0 2 4 6 8 10 0 2 4 6 8 10 Longtitude Longtitude

HASP : ρ = 0.7 HASP : ρ = − 0.7

10 6 10 6

8 4 8 4 2 2 6 6 0 0 4 4 Latitude −2 Latitude −2

2 −4 2 −4 −6 −6 0 0 0 2 4 6 8 10 0 2 4 6 8 10 Longtitude Longtitude

Figure 3.7: Five sample paths from each of the four spatial processes in R2: GP, GLG/SHP, HASP with positive skewness and HASP with negative skewness. A Gaussian correlation function is used for the GP. For the non-Gaussian processes, the same Gaussian correlation function is used for the marginal as well as cross correlation functions of the latent multivariate Gaussian process. The sample paths are constructed in the same way as in Figure 3.6.

98 marked peaks or valleys which, if modeled by the Gaussian processes, might be overly shrunken toward the overall mean.

3.4 Model Fitting and Spatial Prediction

3.4.1 The Likelihood Function

Suppose we observe the Y (s) process at n spatial locations in Rd which are indexed

T by s = (s1, s2, ··· , sn) . Let

T Y = (Y (s1), ··· ,Y (sn)) ,

T α = (α(s1), ··· , α(sn)) , and

T  = (ε(s1), ··· , ε(sn)) .

Assume that

α 0 τ 2P τ%R ∼ N , ,  0 τ%R Q where P , Q and R are three correlation matrices of dimension n whose (i, j) elements are defined by ρα(ksi − sjk), ρε(ksi − sjk) and ρc(ksi − sjk), respectively. We also use A ◦ B to denote the element-wise multiplication of two vectors or matrices of the same dimensions, and use A · B to denote the inner product of two vectors of the same dimensions. In addition, we introduce the following notations for the rest of

99 this chapter:

T X = (X(s1), ··· ,X(sn)) ,

T T H (s1)δ/2 H (sn)δ/2 T Wδ = (e , ··· , e ) ,

α(s1)/2 α(sn)/2 T Wα = (e , ··· , e ) ,

T T H (s1)δ/2 H (sn)δ/2 Vδ = diag{e , ··· , e },

α(s1)/2 α(sn)/2 Vα = diag{e , ··· , e },

W = Wδ ◦ Wα,

V = Vδ Vα.

As long as the correlation functions ρα(h), ρε(h) and ρc(h) are such that the matrix

2 −1 Q − % RP R is positive definite for all integer n > 1 (e.g., ρα(h), ρε(h) and ρc(h) satisfy the conditions in Theorem 1, 3 or 4, or the conditions given in Section 3.3.3), then the conditional distribution α| is a multivariate normal distribution

%  N RP −1α,Q − %2RP −1R . τ

The distribution of the data Y conditional on α and all model parameters θ is simply

 %  N Xβ + VRP −1α,V (Q − %2RP −1R)V . τ

Under the Bayesian framework, the poster distribution of the parameters (augmented with the latent α) is given by

π(α, θ, Y ) π(α, θ|Y ) = π(Y |α, θ)π(α|θ)π(θ), (3.4.1) π(Y ) ∝ where π(θ) is the prior distribution for θ.

Given (3.4.1), it is straightforward to come up with a Markov chain Monte Carlo

(MCMC) algorithm for fitting the model. If we adopt the strategy I presented in

100 Section 3.3.4 to ensure the positive definiteness of Q − %2RP −1R, i.e., to assume that

P = Q = R, then the conditional distribution Y |α, θ can be simplified to

 %  N Xβ + V α, (1 − %2)VPV , τ

or equivalently,  %  N Xβ + W ◦ α, (1 − %2)VPV . τ

We can simplify the model fitting even further by re-parameterizing the model. Since the row vector H(s)T contains a constant 1, we can rewrite H(s)T and β as

h i h iT H(s)T = 1 H˜ (s)T and β = µ β˜T , then H(s)T β = µ + H˜ (s)T β˜. Define σ = eµ/2, φ = %σ/τ ∈ R and ψ2 = σ2(1 − %2) ∈

R+, and then the distribution of Y |α, θ can be equivalently expressed as

  N Xβ + φW˜ ◦ α, ψ2VP˜ V˜ .

On the other hand, if the strategy II, i.e., the LMC (3.3.17), is adopted, where the correlation matrix for the process ξ(s) is denoted Ξ, then we have Q = %2P +(1−%2)Ξ

and P = R by equations (3.3.14) and (3.3.15). Note that Q − %2RP −1R can be

simplified to (1 − %2)Ξ, so the distribution π(Y |α, θ) can be written as

  N Xβ + φW˜ ◦ α, ψ2V˜ ΞV˜ .

3.4.2 Model Fitting Strategy

A number of different model fitting approaches have been proposed in the litera-

ture for non-Gaussian processes of the form (3.2.1). For example, Palacios and Steel

(2006) fitted the GLG model under the Bayesian framework. In particular, they used

101 a MCMC algorithm with block updating of the latent scale parameters. The block update is carried out through a Metropolis-Hastings step, in which the proposal dis- tribution is set to be a normal distribution that best approximates the full-conditional distribution. Zhang and El-Shaarawi (2010) used a maximum likelihood approach for the skew-Gaussian process. They augmented the parameter space with one of the two latent univariate Gaussian processes and used the expectation maximization (EM) al- gorithm to evaluate the likelihood function. Craigmile and Guttorp (2011) proposed a hierarchical Bayesian model on the wavelet domain to fit their non-Gaussian space- time model under an approximate whitened model. Huang et al. (2011) employed a pseudo maximum likelihood approach without augmenting the parameter space and used importance sampling to evaluate the likelihood function

Z L(θ; Y ) = p(Y |α, θ)p(α|θ)dα, where θ denotes all parameters in the model.

Instead of integrating out the latent process α and approximating the resulting likelihood function, we propose to use a Bayesian method to fit the HASP model, similar to the approach of Palacios and Steel (2006). The Gibbs sampler for our

MCMC algorithm is presented below. We use the strategy II to illustrate the model

fitting process but note that for the modeling strategy I, the model fitting procedure is very similar.

Let θ denote all model parameters other than α, i.e., θ = {β, φ, ψ2, τ 2, λ, κ, δ˜}.

For a single iteration in the MCMC algorithm, we simulate from the posterior distri- bution using the following steps.

102 1) Update the latent variables α either one-by-one or through a block update as in

Palacios and Steel (2006). One approach is to use a random walk Metropolis-

Hastings step. Alternatively, we can consider a better proposal distribution by us-

ing techniques such as approximating the full conditional distribution as was done

in Palacios and Steel (2006) or using the Metropolis-adjusted Langevin algorithm

(Grenander and Miller, 1994; Roberts and Tweedie, 1996). The full conditional

distribution of α is given by

p(α|Y , θ)  1 |V |−1 exp − [V −1V˜ −1(Y − Xβ) − φα]T × ∝ α 2ψ2 α δ o  1  Ξ−1[V −1V˜ −1(Y − Xβ) − φα] exp − αT (s)P −1α . α δ 2τ 2

˜ −1 Let Yφ = Vδ (Y − Xβ) and C be a constant in α, then

log p(α|Y , θ) n 1 X 1 = C − α(s ) − (V −1Y )T Ξ−1(V −1Y ) − 2φαT Ξ−1(V −1Y ) 2 k 2ψ2 α φ α φ α φ k=1 1  φ2 1  − αT Ξ−1 + P −1 α 2 ψ2 τ 2 n 1 X 1 = C − α(s ) − (V −1Y − 2φα)T Ξ−1(V −1Y ) 2 k 2ψ2 α φ α φ k=1 1  φ2 1  − αT Ξ−1 + P −1 α. (3.4.2) 2 ψ2 τ 2

In this chapter, we use the Metropolis-adjusted Langevin algorithm (MALA) algo-

rithm to sequentially update the components of α. For k = 1, ··· , n, let lt(α(sk))

denote the full conditional log likelihood function of α(sk) (up to a constant in

α(sk)) evaluated at the most recent draws of all other model parameters in itera-

tion t. Then,

103 [t−1] i. Given the samples of α(sk) in the iteration t − 1, α (sk), draw a sample

∗ α (sk) from the proposal distribution

 ς2  N α[t−1](s ) + 0 ∇l (α[t−1](s )), ς2 , k 2 t k 0

where ς0 is a user-defined tuning parameter.

ii. Calculate the ratio e∆t where

(  2 2) ∗ 1 [t−1] ∗ ς0 ∗ ∆t = lt(α (sk)) − 2 α (sk) − α (sk) − ∇lt(α (sk)) 2ς0 2 (  2 2) [t−1] 1 ∗ [t−1] ς0 [t−1] − lt(α (sk)) − 2 α (sk) − α (sk) − ∇lt(α (sk)) . 2ς0 2

[t] ∗ ∆t [t] iii. Set α (sk) = α (sk) with probability min(1, e ). Otherwise, set α (sk) =

[t−1] α (sk).

A simpler form of the full conditional log likelihood function of α(sk), k = 1, . . . , n, can be derived from (3.4.2), as is shown below (with subscript t omitted for sim- plicity). For convenience, let

φ2 1 Ω = Ξ−1 and Λ = Ξ−1 + P −1. ψ2 τ 2

Now let Ωi and Λi denote the i-th column vector of Ω and Λ, respectively. Since Ω

T T and Λ are both symmetric, Ωi and Λi are thus their respective i-th row vectors.

In addition, let Ωij and Λij denote the (i, j)-th elements of the respective matrices.

−1 α(−s1)/2 −α(sn)/2 T Finally, let Wα = (e , ··· , e ) and let Ck be a scalar that does not

104 depend on α(sk). Ignoring the terms in (3.4.2) that do not contain α(sk), we have ( 1 1 l(α(s )) = C − α(s ) − e−α(sk)/2Y − 2φα(s ) Ω · (W −1 ◦ Y ) k k 2 k 2ψ2 φk k k α φ

 −α(sk)/2  −1 + e Yφk Ωk · (Wα ◦ Yφ − 2φα) )  −α(sk)/2   −α(sk)/2  − e Yφk − 2φα(sk) Ωkk e Yφk 1 − 2α(s )Λ · α − α(s )2Λ . 2 k k k kk

To calculate the gradient of l(α(sk)), we can use the following general result.

Lemma 4 Let ω be a p × 1 column vector and M be a p × p symmetric matrix whose k-th column vector is denoted as Mk. Let g and h be two differentiable functions, defined as

f, g : R → R;

T f(ω) = (f(ω1), . . . , f(ωp)) ;

T g(ω) = (g(ω1), . . . , g(ωp)) .

Then, for k = 1, . . . , p, we have

∂ T 0 0 f (ω)Mg(ω) = f (ωk)Mk · g(ω) + g (ωk)Mk · f(ω). ∂ωk

The proof is trivial, which involves expanding f T (ω)Mg(ω) into a quadratic form and then applying basic calculus rules. By Lemma 4 and (3.4.2), the gradient of

105 l(α(sk)) is given by

∂ ∇l(α(sk)) = l(α(sk)) ∂α(sk) ( 1 1  1  = − − − e−α(sk)/2Y − 2φ Ω · (W −1 ◦ Y ) 2 2ψ2 2 φk k α φ )  1  + − e−α(sk)/2Y Ω · (W −1 ◦ Y − 2φα) 2 φk k α φ

− Λk · α.

2) Update β and φ from a multivariate normal full conditional distribution. Let

∗ ˜ ∗T ∗ −1 ∗T 2 ˜ ˜ T X = [X W ◦α], D = (X X ) X and SD = ψ DV ΞVD . We assume the

T prior distribution for [β φ] is multivariate normal with mean m0 and covariance

T matrix S0, N(m0,S0), then the full conditional for [β φ] is

T ˜ 2 [β φ] |Y (θ), α, δ, κ, ψ ∼ N(m1,S1),

where

−1 −1 −1 −1 −1 S1 = (S0 + SD ) = S0(S0 + SD) SD = SD(S0 + SD) S0,

−1 −1 −1 −1 −1 and m1 = (S0 + SD ) (S0 m0 + SD DY )

−1 −1 = SD(S0 + SD) m0 + S0(S0 + SD) DY .

3) Update the variance parameters τ 2 and ψ2 from their full conditional distribution.

We consider independent conjugate inverse gamma priors for τ 2 and ψ2. Since the

parameters τ 2 and λ have similar effects on the covariance function of the HASP

process as illustrated in Figure 3.2, it might be desirable that we use a highly

informative prior for at least one of the two parameters.

106 2 2 Suppose the prior for τ is IG(a0, b0) and the prior for ψ is IG(c0, d0), then

their full conditional distributions are given, respectively, by

T −1 IG(a0 + 0.5n, b0 + 0.5α (s)P α),

and

∗T ˜ ˜ −1 ∗ IG(c0 + 0.5n, d0 + 0.5Y (s)(V ΞV ) Y (s)),

where

Y ∗(s) = Y − Xβ − φV˜ α.

4) Update λ and κ from its full conditional distribution using a Metropolis-Hastings

step. The full conditional distribution for λ is

 1  π(λ|Y , α, φ, ψ2, τ 2, β, δ˜) |P |−1/2 exp − αT (s)P −1α π(λ), ∝ 2τ 2

while the full conditional distribution for κ is given by

 1  π(κ|Y , α, φ, ψ2, τ 2, β, δ˜) |Ξ|−1/2 exp − Y ∗T (s)(V˜ ΞV˜ )−1Y ∗(s) π(κ). ∝ 2ψ2

In the equations above, π(λ) and π(κ) denotes the probability density functions

of the prior distributions of λ and κ, respectively. In practice, we can choose an

exponential distribution, a gamma distribution or other distributions with posi-

tive support as the prior distributions. In our implementation, a truncated normal

distribution is used as the proposal distribution in the Metropolis-Hastings algo-

rithm.

5) Update δ˜ from its full conditional distribution using a Metropolis-Hastings step.

Depending on model assumptions, δ˜ could be univariate, multivariate or even

not needed in the model (when the original parameter δ is univariate). We use

107 a (potentially multivariate) normal distribution with meanm ˜ 0 and covariance ˜ ˜ ˜ matrix S0 as the prior for δ. The full conditional distribution of δ is thus

π(δ˜|Y , α, φ, ψ2, τ 2, β, κ)  1   1  |V˜ |−1 exp − Y ∗T (s)(V˜ ΞV˜ )−1Y ∗(s) exp − (δ˜ − m˜ )T S˜−1(δ˜ − m˜ ) ∝ 2ψ2 2 0 0 0  1 |V˜ |−1 exp − [V˜ −1V −1(Y − Xβ)T − φαT (s)] × Ξ−1 × ∝ δ 2ψ2 δ α o  1  [V˜ −1V −1(Y − Xβ) − φα] exp − (δ˜ − m˜ )T S˜−1(δ˜ − m˜ ) . δ α 2 0 0 0

3.4.3 Spatial Prediction

The objective of many spatial smoothing problems is to make predictions of the spatial process at unobserved locations. In other words, we need to calculate

d E[Y (s0)|Y ] for s0 ∈ D ⊂ R as well as understand the uncertainty associated with this prediction. The commonly used kriging estimator is essentially a linear predictor based on a weighted average of the observed values, where the weights depend on the known or estimated covariance structure of the spatial process (see, e.g. Cressie,

1993). Under the Bayesian framework where we use a MCMC to sample from the posterior distribution, the spatial prediction problem becomes drawing samples from the predictive distribution p(Y (s0)|Y ). Based on these samples, we can estimate the mean E[Y (s0)|Y ] and evaluate other properties of p(Y (s0)|Y ), such as the spread, skewness, percentiles, etc.

In the context of making spatial predictions based on the HASP model, we need to simulate from the predictive distribution

Z p(Y (s0)|Y ) = p(Y (s0)|Y , α, θ)p(α, θ|Y ) dα dθ. (3.4.3)

108 We present two strategies to sample from this predictive distribution. The first approach relies on the original parameterization of the HASP model (3.3.1) and is ap- plicable for any choice of the correlation functions. Note that Y (s0) can be computed deterministically given a sample of α(s0), ε(s0) and θ from the conditional distribution p(α(s0), ε(s0), θ|Y ). At the same time, note that sampling from p(α(s0), ε(s0), θ|Y ) is straightforward given the following result: Z p(α(s0), ε(s0), θ|Y ) = p(α(s0), ε(s0)|Y , α, θ)p(α, θ|Y )dα Z = p(α(s0), ε(s0)|α, , θ)p(α, θ|Y )dα, (3.4.4)

where p(α(s0), ε(s0)|α, , θ) is a multivariate normal distribution, which can be easily obtained from the joint multivariate normal distribution p(α(s0), ε(s0), α, |θ). The second equality in (3.4.4) holds because the sigma field generated by (Y , α, θ) satisfies

σ{Y , α, θ} = σ{Y , α, , θ} = σ{α, , θ}.

Therefore, we propose the following strategy to draw samples from the predictive distribution of Y (s0)|Y within the MCMC algorithm for model fitting.

i) In each MCMC iteration t, record the posterior samples of α and θ as α[t] and

[t] [t] θ . Calculate ε(si) for i = 1, . . . , n from

 HT (s )δ[t] + α(s )[t]  ε(s )[t] = exp − i i Y (s ) − XT (s )β[t] . i 2 i i

[t] [t] [t] [t] [t] ii) Given α ,  and θ , draw a sample of α(s0) , ε(s0) from the conditional

distribution p(α(s0), ε(s0), θ|Y ).

[t] iii) Compute Y (s0) according to

HT (s )δ[t] + α(s )[t]  Y (s )[t] = XT (s )β[t] + exp 0 0 ε(s )[t]. 0 0 2 0

109 [t] Then Y (s0) is a sample from the predictive distribution p(Y (s0)|Y ).

Our second approach works specifically for the two strategies detailed in Section

3.3.4. As discussed previously, strategy I is essentially a special case of strategy II when Ξ = P . Therefore, we will use the equation (3.3.17) to demonstrate the process

T to simulate from the predictive distribution. Let ξ = (ξ(s1), . . . , ξ(sn)) . Then the equation (3.4.4) becomes Z p(α(s0), ξ(s0), θ|) = p(α(s0), ξ(s0)|Y , α, θ)p(α, θ|Y )dα Z = p(α(s0), ξ(s0)|α, ξ, θ)p(α, θ|Y )dα, Z = p(α(s0)|α, θ)p(ξ(s0)|ξ, θ)p(α, θ|Y )dα. (3.4.5)

As before, the second equality in (3.4.5) holds because the sigma field generated by

(Y , α, θ) satisfies

σ{Y , α, θ} = σ{Y , α, ξ, θ} = σ{α, ξ, θ}, and the third equality follows from the independence of the processes α(s) and ξ(s), s ∈ D ⊂ Rd.

The resulting procedure for sampling from p(Y (s0)|Y ) is as follows.

i) In each MCMC iteration t, record the posterior samples of α and θ as α[t] and

[t] [t] θ . Calculate ξ(si) for i = 1, . . . , n from

[t] [t]−1 ξ(si) = ψ × " ( ) # HT (s )δ˜[t] + α(s )[t] exp − i i Y (s ) − XT (s )β[t] − φ[t]α(s )[t] . 2 i i i

[t] [t] [t] [t] [t] ii) Given α , ξ and θ , draw a sample of α(s0) , ξ(s0) from the conditional

distribution p(α(s0), ξ(s0), θ|Y ). According to (3.4.5), this can be further broken

down into two steps.

110 [t] [t] [t] a. Draw α(s0) from the normal distribution p(α(s0)|α , θ ) which is given

by

 [t] T [t] −1 [t] 2 [t] h [t] [t] T [t] −1 [t]i N (P0 ) (P ) α , (τ ) P00 − (P0 ) (P ) P0 ,

[t] [t] where P00 = 1 and P0 is a column vector whose i-th element is given by

[t] ρα(ks0 − sik; λ ).

[t] [t] [t] b. Draw ξ(s0) from the normal distribution p(ξ(s0)|ξ(s) , θ ) given by

 [t] T [t] −1 [t] h [t] [t] T [t] −1 [t]i N (Ξ0 ) (Ξ ) α , Ξ00 − (Ξ0 ) (Ξ ) Ξ0 ,

[t] [t] where Ξ00 = 1 and Ξ0 is a column vector whose i-th element is given by

[t] ρα(ks0 − sik; κ ).

[t] iii) Compute Y (s0) according to ( ) HT (s )δ˜[t] + α(s )[t] Y (s ) = XT (s )β[t] + φ[t] exp 0 0 α(s )[t] 0 0 2 0 ( ) HT (s )δ˜[t] + α(s )[t] + ψ[t] exp 0 0 ξ(s )[t]. 2 0

[t] Then Y (s0) is a sample from the predictive distribution p(Y (s0)|Y ).

3.5 Using HASP to Model the Nitrogen Dioxide Pollution Data in the Contiguous United States

Nitrogen dioxide (NO2) is a major air pollutant that is closely monitored by the

United States Environmental Protection Agency (EPA) as one of the six criteria air pollutants (common pollutants that EPA uses to set national air quality standards, which include ozone, particulate matter, carbon monoxide, nitrogen dioxide, sulfur dioxide and lead, according to the EPA website http://www.epa.gov/airquality/ urbanair). According to the World Health Organization (WHO Press, 2006), both

111 short-term and long-term exposures to elevated NO2 concentrations could have ad- verse effects on pulmonary functions. In addition, NO2 is often used as as a marker for the cocktail of combustion related pollutants, such as ultrafine particles, nitrous oxide, particulate matter and benzene.

Ambient concentrations of nitrogen dioxide are measured hourly (notionally) by

EPA monitoring stations nationwide and stored in EPA’s Air Quality System. Both the raw data and the aggregated data can be obtained from the EPA website (http:// aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html). For our analy- sis, we use the daily average of the NO2 concentration values measured hourly on

September 9, 2013. The data was retrieved on February 27, 2014. The measurement unit is parts per billion by volume (ppb). Only non-zero measurements not associated with any exceptional events are used in our analysis, and for duplicated records at a single site (due to more than one instruments at one site), we only retain one record.

There are 349 values observed at different sites meeting these criteria, out of which, we randomly selected 25 observations as our test set to evaluate the predictive perfor- mance of our models, and then fit the HASP and competing models on the remaining

327 observations. The sites and the NO2 concentration levels are illustrated in Figure

3.8.

It can be clearly seen from Figure 3.8 that the NO2 concentration data is skewed.

Since the emission of NO2 is mostly contributed by combustion related sources, such as road traffic and indoor combustion, we tend to observe high levels of NO2 con- centrations in large urban centers. Further scrutiny of the NO2 concentration map suggests that the high NO2 concentration levels in urban centers can diminish very quickly outward, suggesting that its spatial correlation length might not be very

112 ● ●

● ● ● ● ●● ● ● ● ●

● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●●●● ●●● ● ● ●●● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ●● ● ●● ● ● ●●●●● ●● ●● ●● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●

● ● ● ● ●● ● ● ●● ●●● ●● ● ●●●● ● ● ● ● ●● ●

● ●● 80 60 60 50 50 60 40 40 40 30 30 20 20 20 10 10 0 0 0

0 5 10 15 20 25 30 −2 −1 0 1 2 3 0 1 2 3 4 5 Histogram of the original data (y) Histogram of log(y) Histogram of sqrt(y)

Figure 3.8: The ambient NO2 concentration levels measured across the contiguous United States on September 9, 2013. In the top graph, each circle represents an observation site. Blue circles represent the observation sites used in the training dataset while the red circles indicate sites randomly chosen for prediction evaluation. The size of the circles (measured by the area not the radius) is proportional to the values of the NO2 concentration levels. The three histograms show the distributions of the original data, the log transformed data and the square root transformed NO2 concentrations.

113 large. A common approach for modeling right skewed positive data in practice is

to transform the original data using a log or square root function. The histograms

of the original NO2 concentrations as well as the log and square root transformed data are also presented in Figure 3.8. Although the histograms are not based on independently observed data, we can still use them to inform us about the possible shape of the marginal distribution of the spatial process. It can be seen that the log transformation in this case over-corrects the positive skewness while the square root transformed data is closer to normal but remains slightly right skewed. In addition, we also tried a cube root transformation, which still renders a histogram with some positive skewness. On the other hand, since the observations are not iid, it is hard to argue that a slight skewness observed in the histogram is necessarily due to the skewness in the marginal distribution.

From our observations above, it does not seem a great strategy to fit a Gaussian process or a symmetric non-Gaussian model to the original data. The HASP model, on the other hand, can be applied to both the original and the transformed data. It would be of practical interest to see how the model performances based on the original as well as the transformed data compare with each other. As a result, we decide to

fit the following three models on both the original and the square root transformed data – (i) a Gaussian process (GP); (ii) a symmetric non-Gaussian process (SHP) and (iii) the proposed asymmetric non-Gaussian process (HASP). It is sometimes desirable to include the so-called nugget effect in spatial processes (see, e.g., Cressie,

1993). The nugget effect describes the discontinuity of the covariance function at

the origin, which can either be explained by independent measurement errors at each

observational site, or by micro-scale variability which cannot be distinguished from

114 the effect of measurement errors (see, e.g., Gneiting and Guttorp, 2010). According to Gneiting and Guttorp (2010), the term “nugget” comes from mining applications where the occurrence of gold nuggets shows substantial micro-scale variability. It is important to develop a HASP process with nugget effect in our future work but for comparing the aforementioned three processes in this section, we focus on models without any nugget effect. We use the MCMC algorithm developed in the previous section to fit the HASP model and the SHP model (by setting φ = 0). The GP model is also fit under the Bayesian framework. All the covariance functions are assumed to be exponential since the sample path shown in Figure 3.8 is not smooth (the empirical semivariogram can also be used for the choice of the covariance functions but it is not informative in this case). In particular, for the SHP model, we fit

α(s) Y (s) = XT (s)β + σ2 exp ε(s), 2 where the covariance functions for the processes α(s) and ε(s) are assumed to be exp(−∆s/λ) and exp(−∆s/κ), respectively (note that the parameterization here is different from the previous sections). The parameters λ and κ are assumed to be different and have diffuse but informative gamma priors. For the HASP model, we

fit α(s) α(s) XT (s)β + φα(s) exp + ψ exp ξ(s), 2 2 where the covariance functions for the processes α(s) and ε(s) are exp(−∆s/λ) and exp(−∆s/κ), respectively. Based on our experience, the chain does not mix very well when λ and κ are allowed to be estimated separately and thus, as discussed in

Section 3.3.4, we require that λ = κ. In addition, exploratory analyses did not reveal

115 any significant spatial trend, so a constant mean is assumed for all three models, i.e.,

XT (s) = 1.

We implemented the model fitting algorithms in C++11 with the use of the linear

algebra library Armadillo (Sanderson, 2010) and a Mersenne-Twister pseudo-random

generator of 32-bit numbers with a state size of 19937 bits from the C++11 standard library. The Armadillo library is linked with the Accelerate framework under Mac

OS X 10.9.3. We carefully adjusted the tuning parameters and the initial values of the Metropolis-Hastings and Langevin steps in advance to speed up the convergence of the chains. For each model, we ran the chain for 100, 000 iterations and thinned it

every 20 iterations.

From plots a) and b) in Figure 3.9, we can clearly see that HASP is able to

capture the skewness in both the original data as well as the transformed data. Since

the degree of skewness increases with the co-locational coefficient parameter % as

illustrated in Figure 3.1, smaller values of % for the square root transformed data are expected. Plots c) – f) are presented to illustrate an interesting phenomenon that when a Gaussian process is scaled by an exponential Gaussian process, the correlation in the composite process is attenuated compared to the original Gaussian process, as illustrated in Figures 3.2 and 3.3.

To compare the performances of the three models on both the original and the transformed data, we examine the predictive accuracy for the NO2 levels (on the

original scale) at the 25 randomly selected test sites. Figures 3.10 and 3.11 show the

predictive distributions for the original NO2 level at each site based the original data

and the transformed data, respectively. Note that when the models are fitted on the

original data, the predictive distributions under the HASP model are strongly right

116 a) posterior density of ρ b) posterior density of ρ 2.5 2.5 2.0 2.0 1.5 1.5 ρ ρ 1.0 1.0 0.5 0.5 0.0 0.0 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0

c) posterior density of κ d) posterior density of κ 14 14

HASP HASP 10 10 SHP SHP 8 8

κ GP κ GP 6 6 4 4 2 2 0 0 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

e) correlation function of Z(s) f) correlation function of Z(s) 1.0 1.0

HASP HASP 0.8 0.8 SHP SHP GP GP 0.6 0.6 0.4 0.4 Correlation Correlation 0.2 0.2 0.0 0.0 0 2 4 6 8 10 0 2 4 6 8 10 Distance Distance

Figure 3.9: The three graphs in the first column are based on the original data while those in the second column are based on the transformed data. Plots a) and b) show the posterior density of the co-locational correlation coefficient %, which also measures the degree of skewness in the data. Plots c) and d) present the posterior density of the parameter κ, which is the correlation length in the Gaussian process as well as the respective process ε(s) and ξ(s) in the SHP and HASP model. Finally, plots e) and f) demonstrate the implied correlation functions of the fitted Gaussian and non-Gaussian processes.

117 (1) (2) (3) (4) (5)

(6) (7) (8) (9) (10)

(11) (12) (13) (14) (15)

(16) (17) (18) (19) (20)

(21) (22) (23) (24) (25)

Figure 3.10: Predictive distributions for the NO2 levels at the 25 sites in the test dataset, which are based on the three models fitted on the original data. The red lines represent the predictive distributions of the HASP model, while the blue and grey lines show the predictive distributions of the SHP and GP models, respectively. The dashed lines represent the actual observed values at these test sites.

118 (1) (2) (3) (4) (5)

(6) (7) (8) (9) (10)

(11) (12) (13) (14) (15)

(16) (17) (18) (19) (20)

(21) (22) (23) (24) (25)

Figure 3.11: Predictive distributions for the NO2 levels on the original scale at the 25 sites in the test dataset, which are based on the three models fitted on the transformed data. The red lines represent the predictive distributions of the HASP model, while the blue and grey lines show the predictive distributions of the SHP and GP models, respectively. The dashed lines represent the actual observed values at these test sites.

119 Original Data Transformed Data Criteria HASP SHP GP HASP SHP GP RMSE of Posterior Mean 4.15 4.25 4.33 4.16 4.23 4.23 MAE of Posterior Median 3.61 3.71 3.79 3.56 3.60 3.64

Table 3.1: Predictive performance of the GP, SHP and HASP models fitted on the original and the transformed data. The highlighted values are the smallest within each group.

skewed, while those under the SHP and GP models are more or less symmetric. In addition, the predictive distributions of the GP model based on the original data tend to have larger variances than the corresponding SHP model, potentially due to the skewness in the data and the absence of nugget effect in the model. As for the models based on the transformed data, all three processes produce right skewed predictive distributions because of the square transformation back to the original scale. Again, the predictive distributions under the HASP model exhibits more skewness than under the other two models.

To quantify the accuracy of the point estimates obtained from each model, we compare the root mean squared error (RMSE)

25 1 X h i2 Yˆ (s )Mk − Y (s ) , 25 i i i=1 and the mean absolute error (MAE)

25 1 X Y˜ (s )Mk − Y (s ) , 25 i i i=1

ˆ Mk where Y (si) denotes the posterior mean of the predictive distribution for site i

˜ Mk under model k, and Y (si) represents the posterior median of the predictive distri- bution for site i under model k. The results are presented in Table 3.1. First of all,

120 as expected, the performance of the SHP and the GP model fitted on the original data is worse than on the transformed data due to the different skewness of the data.

The performances of the HASP model fitted on the two datasets, on the other hand, are comparable in terms of the RMSE of the posterior means. Secondly, for both the original and the transformed NO2 concentration levels, the HASP model is able to achieve better performance than the SHP and GP processes, in terms of the RMSE of the posterior mean and the MAE of the posterior median. This is not surprising for the original NO2 data but for the square-root-transformed NO2 concentrations, it seems that even though the skewness in the process is mild, explicitly modeling it can still be beneficial for model performance.

In summary, there exists strong skewness in the spatial data of daily average NO2 concentrations. Failure to account for this skewness results in poor predictive perfor- mance of the model. The log or square root transformation attenuates the skewness but does not eliminate it and we should still use a skewed non-Gaussian process such as HASP to model the transformed data. In addition, in the SHP and HASP models, we also find evidence of variance clustering, or spatial heteroscedasticity.

3.6 Discussion

In this chapter, we proposed a non-Gaussian spatial process that is able to capture the skewness, heavy tails and spatial heteroscedasticity in the data. Our model is motivated as a special case of a more general framework for constructing non-

Gaussian processes. We studied the moments and the covariance structures of our heteroscedastic asymmetric non-Gaussian process and as well as the conditions on the covariance functions to make sure the process is well defined. We applied our

121 model to the nitrogen dioxide concentration data and found that by capturing the skewness in the data, we can achieve better predictive performances. We found that even though the common transformation functions can reduce the skewness in the data, the “left-over” skewness after the transformation still needs to be accounted for.

Our analyses can be improved in several aspects. First, as discussed previously, we need to consider the nugget effect in our model to account for measurement errors or micro-scale variability. It would be interesting to compare the model performances again with the nugget effects explicitly modeled. In addition, our method will also be compared against the approach of De Oliveira et al. (1997) in our future work. Second, we assume that the two correlation length parameters, λ and κ, in the HASP model are the same. This is merely a modeling choice. We do not believe this should be the truth for real-world spatial data, nor do we believe that the covariance functions for the α(s) and ξ(s) (or ε(s) for the SHP model) should necessarily have the same smoothness either. It is necessary to develop a more efficient model fitting procedure so that we can relax some of the convenient assumptions we made and at the same time, facilities the application of the HASP model on much larger datasets. Third, our nonparametric model comparison criteria only takes into account the posterior mean or median at each test site. The other information contained in the predictive distribution, such as the spread, skewness or heavy tails, are not examined. Alterna- tively, we can consider the continuous ranked probability score suggested in Gneiting et al. (2007) which satisfies proper scoring rules and is robust. In addition, a model comparison criterion based on the joint predictive distribution should be preferred to

122 our current approach. Alternatively, we can also consider the Bayes factor. The chal- lenge with computing the Bayes factor lies in the fact that we need to integrate out a high dimensional vector of the latent process evaluated at the observed locations in order to obtain the marginal density. Therefore, approximation methods are usually required for the computation. For example, Palacios and Steel (2006) adopted the method of Verdinelli and Wasserman (1995) which is based on Savage-Dickey density ratio. Other methods for approximating the Bayes factor also exist and it would be interesting to evaluate their approximation accuracies for the HASP model under different settings. Last, Proposition 1 gives a general condition for the transformed process (3.2.1) to be mean square continuous. However, mean square continuity says nothing about the continuity or smoothness of the sample paths. Stein (1999) gave an example where the process {Z(s), s ∈ R} is mean square continuous but

P [Z(s) is continuous in R] = 0. For evaluating the smoothness of the sample path, Kent (1989) proved that if a covariance function C(h) for a d-dimensional random

field is d-times continuously differentiable and satisfies

|C(h)| = O |h|d+β , as h → 0 for some β > 0, then there exists a version of the d-dimensional random field

{Y (s), s ∈ Rd} with continuous realizations. Whether we can find a general result for the mean square differentiability of the general non-Gaussian process (3.2.1) as well as the HASP model remains an important question to be answered in our future work.

123 Chapter 4: Concluding Remarks and Future Work

4.1 Summary

In this dissertation, we proposed two approaches of modeling the non-Gaussian features, such as the heavy tails and skewness, that are commonly observed in tem- porally and spatially indexed data. Our first method focuses on time series data, for which there are already numerous non-Gaussian models developed in the past few decades, including the well-known stochastic volatility (SV) model. We proposed a semiparametric additive SV model which can capture the nonlinear (and potentially nonstationary) features in the latent volatility process. More importantly, we intro- duced shape constraints in our semiparametric model, which proves to be very im- portant for achieving good model fit and predictive performance. Our second method focuses on modeling the non-Gaussian features of geostatistical data. Compared to the time series literature, the development of non-Gaussian models in the geostatis- tical literature is lacking but has received increasing attention. Our heteroscedastic asymmetric spatial process (HASP) is able to capture not only the heavy tails but also the skewness of non-Gaussian spatial data. When the spatial domain is reduced to the real line, our method can also be used as a curve fitting tool, which lends itself to an even wider range of applications.

124 Non-Gaussian stochastic processes is undoubtedly very useful in practice and there are many research questions about them yet to be answered. The remainder of this chapter is dedicated to the improvement of our methods and their extension in several directions.

4.2 Extensions of the Shape-Constrained Semiparametric Ad- ditive SV Model

4.2.1 GARCH-type models

For capturing the non-Gaussian features in financial time series, an alternative to the SV model is the family of ARCH-type models. The forms of the most commonly used parametric ARCH-type models are given in Section 1.2. Just like the SV model, the very commonly used GARCH model is also a stationary white noise process with heavier tails than a Gaussian process, while the two popular alternatives to GARCH, the EGARCH and GJR-GARCH models, are able to capture the asymmetric impact of positive and negative returns on volatilities (for more details, see, e.g. Fan and

Yao, 1998).

For nonparametric estimation of (G)ARCH-type models, a general formulation can be

rt = σtεt, t ∈ Z;

σt = m(rt−1, ..., rt−d), where m is an unknown function taking values in R+ and the number of lagged return terms, d ∈ Z+, needs to be pre-specified by researchers. Models of this type have been examined in the literature as early as Pagan and Ullah (1988). However, the positivity of the function m(·) is not easily guaranteed and estimating such a high

125 dimensional nonparametric surface suffers from the so-called “curse of dimensional-

ity”. As a result, additional constraints on the function m are usually needed, among

which, the additivity constraints such as those considered in the generalized additive

model (Hastie and Tibshirani, 1986) is one of the most popular. With the additivity

constraint, we can consider a nonparametric ARCH-type model with

( d ) 2 X σt = G ω + mj(rt−j) , t ∈ Z, (4.2.1) j=1 where mj(·)’s are unknown functions and G is a transformation function. To ensure

+ the positivity of σt, G should be a function defined on R and take values in R , such as the logarithm function used in Yang et al. (1999) (also see Kim and Linton, 2004;

Yang, 2006).

Because (4.2.1) is a deterministic function, we do not need to estimate all the volatilities, σt, t = 1,...,T , in the model fitting process as we did in the SV models.

However, further restrictions on the forms of mj(·) are also warranted for achieving better model fit and predictive performance. Linton and Mammen (2005) consider the following volatility process:

∞ 2 X σt (θ, m) = ψj(θ)m(rt−j), t ∈ Z, (4.2.2) j=1 j−1 where ψj(·) is a known function, e.g., ψj(θ) = θ , while m(·) is unknown but

assumed to be smooth. This model can be viewed as a generalization of the ARCH(∞)

j−1 model. In the case where ψj(θ) = θ for θ ∈ (0, 1), the model (4.2.1) can be simplified as

2 2 σt = θσt−1 + m(rt−1), t ∈ Z, (4.2.3)

which is exactly the partially nonparametric (PNP) model proposed by Engle and

Ng (1993). Model (4.2.3) also includes GARCH and GJR-GARCH models as special

126 cases. If we let m(r) = ω + αr2, we get the GARCH(1,1) model and if m(r) =

2 2 ω + α1r + α2r I(r < 0), then we get the GJR-GARCH model.

For semiparametric GARCH models, Audrino and B¨uhlmann(2009) proposes modeling the volatility function on the log scale, centering on a parametric GARCH(1,1) model and using multivariate B-splines to represent the residual part. In particular,

rt = σtεt, t = 1,...,T ;

2 2 log(σt ) = g(rt−1, σt−1), t = 2,...,T,

where εt is an iid process with zero mean and unit variance. The specific representa-

2 tion of log(σt ) using multivariate B-splines is

k1 k2 2 2 X X 2 log(σt (θ)) = gθ0 (rt−1, σt−1(θ)) + βj1,j2 Bj1,j2 (rt−1, σt−1(θ)) j1=1 j2=1

k1 k2 2 X X 2 = gθ0 (rt−1, σt−1(θ)) + βj1,j2 Bj1 (rt−1)Bj2 (σt−1(θ)), (4.2.4) j1=1 j2=1

where gθ0 is set to be the logarithm of a parametric GARCH(1, 1) process, and the

2 basis function Bj1 (rt−1) is assumed to be of degree 3 and Bj2 (σt−1(θ)) of degree 2.

Assuming the error terms εt, t = 1,...,T to be normal, we can write down the

log likelihood function and estimate the parameters {βj1,j2 : j1 = 1, . . . , k1; j2 =

1, . . . , k2} by maximizing the log likelihood.

However, due to the complexity of this semiparametric model, direct optimization of the likelihood function is undesirable. Audrino and B¨uhlmann(2009) proposed a gradient boosting algorithm, in particular, the functional gradient descent algorithm of Friedman (2001). The algorithm considers a loss function

1  r2  λ(r , g) = log 2π + g + t , t 2 exp(g)

127 2 where g represents the log squared volatility log(σt ), which is also a function of lagged returns and lagged squared volatilities given in (4.2.4). If we sum up the values of the loss function over the data samples, we obtain the negative log likelihood under the assumption of Gaussian errors. Conceptually, the algorithm works by iterating the following steps after initialization:

i) At step m, evaluate the negative gradient of the loss function over the data

{rt : t = 2,...,T }, given the most recent estimates of the squared volatilities,

{gˆm−1(t): t = 2,...,T };

2 ii) Find a bivariate B-splines function Bd1 (rt−1)Bd2 (σt−1(θ)) for some 1 ≤ d1 ≤ k1

and 1 ≤ d2 ≤ k2 which can achieve the best least square approximation of the

direction of the negative gradient obtained in step i);

2 iii) In the direction represented by the bivariate B-splines Bd1 (rt−1)Bd2 (σt−1(θ)), find ˆ the optimal step size βm such that the updated square volatilities, {gˆm(t): t =

ˆ 2 2,...,T } whereg ˆm(t) =g ˆm−1(t) + βmBd1 (rt−1)Bd2 (σt−1(θ)), minimizes the loss

function evaluated over the data {rt : t = 2,...,T }.

The stopping rule is based on the method of sample splitting. In particular, the optimal model structure is estimated on 70% of the data and then validated on the remaining 30% of the data. The number of iterations in the algorithm above is set to minimize the empirical risk of the validation sample. Note that the above gradient boosting algorithm approximates the function g by greedily adding a weighted B- splines basis function in each step to the initial estimate. Since the dictionary of the B-spline basis functions is finite in this model, the eventual estimate of g can be expressed in the form of the basis expansion (4.2.4).

128 The weights of the basis functions obtained from the boosting algorithm, however,

differ from those obtained from a maximum likelihood approach. The coordinate-wise

(one basis function at the time) gradient descent approach is related to penalized

maximum likelihood methods such as the lasso method (Tibshirani, 1996) for certain

loss functions (see Efron et al., 2004; Zhao and Yu, 2007), although Audrino and

B¨uhlmann(2009) did not provide the specific regularizations associated with the loss

function used in the algorithm above.

One thing is clear from the discussions above: since the semiparametric volatility

models are generally very flexible, to achieve good model fit and predictive per-

formance, we need to impose proper regularizations. Such regularizations can be

achieved either through explicit constraints on the functional form or through regu-

larized estimation procedures. Given this premise, the idea of imposing shape con-

straints in the semiparametric SV models can be applied to the ARCH-type models

as well. For example, in (4.2.1), we can assume that the function mj, j = 1, . . . , d, is monotonically decreasing on the negative real line and monotonically increasing on the positive real line. To fit such a shape-constrained (G)ARCH-type semiparamet- ric model, the basis functions of Neelon and Dunson (2004) might not be the best choice for fitting the shape-constrained semiparametric ARCH-type models due to their non-orthogonality. Instead, a number of other spline- or kernel-based isotonic regression techniques could be exploited for the model fitting.

4.2.2 Other non-Gaussian nonlinear state-space models

The challenge of fitting semiparametric nonlinear state space models is not con-

fined to SV models. Our methodology of imposing shape constraints in different

129 functional components of the additive state equation can be applied to other non- linear state space models as well, in order to efficiently capture nonlinear features of the state equation. A prominent example is a generalization of the generalized linear model:

yt ∼ F (g(αt); θ), t ∈ Z;

iid 2 αt = µ + f(αt−1) + εt, εt ∼ N(0, σ ). where F is a known non-Gaussian distribution such as a Poisson or Bernoulli distri- bution, and g is a parametric link function. Based on domain-specific knowledge, we could potentially assume that the function f is monotone (monotonically increasing or decreasing) and use that to improve the model performance.

Another group of non-Gaussian models in the financial time series literature is the so-called duration models, which is related to the survival models in biostatistics and also shares many features with the volatility models. Durations models were proposed to describe the behaviors of the time intervals (the durations) between high frequency trading events of a certain stock. The behavior of the durations contains useful information about intraday market activities as longer durations indicate a lack of trading activities which in turn signifies a period of no new information (Tsay,

2005). Let {ti}i∈Z+ denote the time of the i-th trading event measured with respect to a certain origin t0. Define the duration series di as

di = ti − ti−1, i = 1,...,N.

In practice, it is necessary to adjust the durations di for the deterministic patterns of intraday transactions but for simplicity, we assume that the diurnal patterns have been accounted for and that di > 0 for any i = 1,...,N. The first duration model,

130 the widely known autoregressive conditional duration (ACD) model as given below,

was introduced by Engle and Russell (1998) and is based on the GARCH model:

di = ∆iεi, i ∈ Z; p q X X ∆i = ω + αkdi−k + βl∆i−l, k=1 l=1

where εi, i = 1,...,N, are assumed to be iid according to some distribution with

positive support. Just like in survival models, the commonly used distributions are

the exponential and Weibull distribution. The SV version of the duration model

is the so-called stochastic conditional duration (SCD) model proposed by Bauwens

and Veredas (2004). As with the unleveraged SV model, the SCD model is also a

non-Gaussian nonlinear state space model, which is given by

ζi iid di = e εi, εi ∼ Weibull(α, β), i ∈ Z;

iid 2 ζi = µ + φζi−1 + ηi, ηi ∼ N(0, τ ),

where Weibull(α, β) denotes a Weibull distribution with the shape parameter α and

the scale parameter β, and the processes {εi}i∈Z and {ηi}i∈Z are mutually indepen- dent. When α = 1, the Weibull distribution reduces to an exponential distribution.

Based on our argument in Chapter 2, we can consider the following semiparametric

SCD model:

ζi iid di = e εi, εi ∼ Weibull(α, β), i ∈ Z;

iid 2 ζi = µ + f(ζi−1) + ηi, ηi ∼ N(0, τ ), (4.2.5)

where the processes {εi}i∈Z and {ηi}i∈Z are mutually independent and the autore- gressive function f is monotone and satisfies f(0) = 0. The model fitting and model comparison procedure developed in Chapter 2 can be easily adapted to fit (4.2.5).

131 4.3 Extensions of the Heteroscedastic Asymmetric Spatial Process

4.3.1 Into the spatio-temporal domain

Conceptually, the heteroscedastic asymmetric spatial process (HASP) developed

in Chapter 3 can be easily generalized to the spatio-temporal domain, but the unique-

ness of a space-time process is also crucial for understanding these types of models.

In this section, we first briefly review the special features of a space-time process

compared to a purely spatial process before introducing a generalized HASP model.

Consider a space-time process Y (s, t), where (s, t) ∈ D ⊂ Rd × R. Although mathematically, a space-time process can be viewed as a “spatial” process defined on a bounded subset of Rd+1, this view is insufficient since the progression of a stochastic process in time intrinsically differs from its separation in space. In particular, tem- poral progression usually has an intrinsic direction while this needs not be true for spatial separation. In addition, a valid covariance function defined on Rd+1 does not necessarily capture the interaction between temporal and spatial dependence. As a result, space-time processes warrant special treatment.

A space-time process {Y (s, t), (s, t) ∈ D ⊂ Rd × R} is said to have a fully symmetric covariance if

0 0 0 0 0 0 d cov{Y (s, t),Y (s , t )} = cov{Y (s , t),Y (s, t )}, (s, t), (s , t ) ∈ D ⊂ R × R.

A related notion is the separability of the covariance function. A space-time covari-

ance function is said to be separable if there exist a purely spatial covariance function

CS and a purely temporal covariance function CT such that

0 0 0 0 0 0 d cov{Y (s, t),Y (s , t )} = CS(s, s ) CT (t, t ), (s, t), (s , t ) ∈ D ⊂ R × R.

132 If a spatial process Y (s, t), (s, t) ∈ D ⊂ Rd × R, can be expressed as

Y (s, t) = ZS(s) ZT (t),

where the purely spatial process ZS(s) and purely temporal process ZT (t) are inde- pendent, then the covariance function of Y (s, t) is separable. The covariance function cov{Y (s, t),Y (s0, t0)} is said to be spatially stationary if it depends on the sites s and s0 only through the spatial separation vector ∆s = s − s0. Similarly, the covariance function is temporally stationary if it depends on the times t and t0 only through the temporal separation ∆t = t − t0. We say that a spatio-temporal process is stationary if its covariance function is both spatially and temporally stationary, and write

cov{Y (s, t),Y (s0, t0)} = C(s − s0, t − t0) = C(∆s, ∆t).

See the review articles of Kyriakidis and Journel (1999), Gneiting et al. (2006) and

Gneiting and Guttorp (2010) for more detailed discussions. The stationarity, sep- arability and full symmetry of a space-time covariance function are all convenient assumptions for simplifying the modeling of space-time data. Furthermore, the co- variance functions are commonly assumed to be spatially isotropic as well. In reality, these assumptions are often insufficient for describing true physical phenomena of atmospheric, geophysical and environmental processes. For example, consider the values of the space-time process Y (s, t) where (s, t) ∈ D ⊂ Rd × R, observed at

0 0 0 0 (si, ti), (si, ti) for i = 1, 2. Let ∆si = si − si and ∆ti = ti − ti, where i = 1, 2. Under the assumption of a stationary and separable covariance model, the ratio of the correlations

0 0 cov{Y (s1, t1),Y (s1, t1)} CS(∆s1)CT (∆t1) 0 0 = (4.3.1) cov{Y (s2, t2),Y (s2, t2)} CS(∆s2)CT (∆t2)

133 is always proportional to CS(∆s1)/CS(∆s2) for fixed time points and proportion to

CT (∆t1)/CT (∆t2) for fixed spatial locations (Cressie and Huang, 1999). Furthermore, if ∆t1 = ∆t2, then the correlation ratio (4.3.1) is constant in the magnitudes of temporal separations, which also equals the correlation ratio obtained from the pure

d spatial process Y (s, t0), s ∈ D ⊂ R for any t0 ∈ R. Similarly, the actual sizes of the spatial separations ∆s1 and ∆s2 have no bearing on the relative strength of correlations defined in (4.3.1) when ∆s1 = ∆s2 and as an extreme case, the univariate time series observed at a fixed spatial location has the same temporal correlation function regardless of the location.

To allow for a space-time interaction in the covariance function, i.e., to relax the proportional correlation ratio assumption implied by (4.3.1), researchers have proposed stationary, but non-separable, space-time covariance functions (see, e.g.

Cressie and Huang, 1999; Gneiting, 2002b; Stein, 2005; Rodrigues and Diggle, 2010;

Fonseca and Steel, 2011a). Cressie and Huang (1999) showed that a continuous, bounded, integrable and symmetric function on Rd × R is a space-time covariance function if and only if the function

Z ψ(∆t|ω) = e−i∆sT ωC(∆s, ∆t)d∆s is positive definite for almost all ω ∈ Rd. This result reduces the Bochner’s criterion for positive definite functions in Rd × R to a criterion in R. Based on this result of Cressie and Huang (1999), Gneiting (2002b) constructed non-separable, stationary space-time covariance functions of the form

σ2  k∆sk2  C(∆s, ∆t) = ϕ , (∆s, ∆t) ∈ d × , ψ(|∆t|2)d/2 ψ(|∆t|2) R R

134 which is proven to be valid if ϕ(x), x ≥ 0 is a completely monotone function and

ψ(x), x ≥ 0 is a positive function with a completely monotone derivative. (A contin- uous function defined on the positive real line is completely monotone if it possesses derivatives of all orders, among which the odd-order derivatives are non-positive func- tions while the even-order derivatives are non-negative.) An example of a valid non- separable stationary space-time covariance function that can be constructed based on either Cressie and Huang (1999) or Gneiting (2002b) is

σ2  ch2γ  C(h, u) = exp − , (au2α + 1)δ+βd/2 (au2α + 1)βγ where h = k∆sk and u = |∆t| are spatial and temporal distances, respectively. A second strategy of constructing non-separable space-time covariance functions is the convolution-based method of Rodrigues and Diggle (2010). A third strategy, discussed by Ma (2002), Ma (2003), Fonseca and Steel (2011a) and the like, is based on the mixture of a purely spatial and a purely temporal process:

d Y (s, t) = ZS(s; Φ)ZT (t; Ψ), (s, t) ∈ D ⊂ R × R, where (Φ, Ψ) is a bivariate non-negative random vector with cumulative distribution function µ(φ, ψ) and independent of the purely spatial process ZS(s; φ) with covari- ance CS(h; φ) and the purely temporal process ZT (t; ψ) with covariance CT (u; ψ).

The covariance function of the space-time process is thus

Z C(h, u) = CS(h; φ) CT (u; ψ) dµ(φ, ψ), h, u ∈ R.

See, e.g., Fonseca and Steel (2011a) for the discussion of specific choices of the distri- butions of Φ and Ψ and the covariance functions CS(h; φ) and CT (u; ψ).

135 To model the spatial heteroscedasticity in a space-time process, Damian et al.

(2003) proposed a non-stationary non-Gaussian space-time model where the measure- ments over time are assumed to be independent and the variations in the variances at different spatial locations are modeled through a latent log-Gaussian variance process in the same fashion as the stationary GLG model of Palacios and Steel (2006). The stationary non-Gaussian SHP model discussed in Chapter 3 is originally formulated as a general space-time process as well, although Huang et al. (2011) did not pro- duce any example where a general space-time covariance function is used. The only space-time dataset studied in their article assumes that the observations in time are independent. A truly space-time non-Gaussian model is proposed by Fonseca and

Steel (2011b), where the non-Gaussianity in the spatial and temporal domain are modeled separately through a scale mixing process as in Palacios and Steel (2006) and the space-time interaction in the covariance structure is induced by the mixture

d model of Fonseca and Steel (2011a). Let ZS(s; φ), s ∈ D ⊂ R be a purely spatial process with mean zero and covariance function CS(h; φ) where h denotes the spatial distance. Similarly, let ZT (t; ψ), t ∈ R be a purely temporal process with mean zero and covariance function CT (u; ψ) where u denotes the temporal distance. Finally, let (Φ, Ψ) be a bivariate non-negative random vector that is independent of ZS(s; φ) and ZT (t; ψ). Then the non-Gaussian space time model of Fonseca and Steel (2011a)

(without the nugget effect in the spatial process) is given by

0 0 d Y (s, t;Φ, Ψ) = ZS(s; Φ) ZT (t; Ψ), (s, t), (s , t ) ∈ D ⊂ R × R, where

Z (s; φ) S −αS (s)/2 ZS(s; φ) = 1/2 , or equivalently, ZS(s; φ) = e ZS(s; φ), λS (s) 136 and Z (t; ψ) T −αT (t)/2 ZT (t; ψ) = 1/2 , or equivalently, ZT (t; ψ) = e ZT (t; ψ). λT (t)

The processes αS(s) and αT (t) are assumed to be mean-square continuous but are independent from the other stochastic processes and random variables in the model.

The covariance function of Y (s, t;Φ, Ψ) is thus given by

EΦ,Ψ[CS(h, Φ)CT (t, Ψ)], h, u ∈ R.

It is easy to see that this model is able to capture the heavy tails of a non-Gaussian

space-time process but does not allow for skewness.

As is shown in Chapter 3, capturing the skewness in addition to the heavy tails

is important for building an appropriate model for space-time data. To obtain a

heteroscedastic asymmetric space-time process with a separable space-time covariance

function, we can simply specify a product of a purely spatial process and a purely

temporal process. For the non-separable case, there are different ways to generalize

our HASP model depending on which strategy of constructing non-separable space-

time covariance model is adopted. We present the most straightforward approach

below.

Under the framework (3.2.1), a latent multivariate space-time process is needed

for obtaining a heteroscedastic asymmetric space-time process. However, specifying

the covariance functions, which is already difficult for a multivariate spatial process,

becomes even more complicated for multivariate space-time processes. In addition to

the space-time interaction in the covariance function of each univariate component

space-time process, we need to deliberate on the choice of cross-covariance functions

between different component processes measured at different locations and times. In

137 addition, we need to ensure that the different covariance and cross-covariance func-

tions are constrained so that the matrix-valued covariance function of the multivariate

space-time process U(s, t),   C11(h, u) ··· C1p(h, u) T 0 0  . .. .  cov{U(s, t), U (s , t )} = C(h, u) =  . . .  , Cp1,u(h, u) ··· Cpp,u(h, u) where h = ks − s0k ∈ R and u = |t − t0| ∈ R, is always valid. In other words, for any finite number of observations of the multivariate space-time process, the resulting covariance matrix is always non-negative definite. To the best of our knowledge, there has been no attempt at establishing the general conditions for the covariance function of a general space-time process to be well-defined. Consider

α(s,t)/2 d Y (s, t) = e ε(s, t), (s, t) ∈ D ⊂ R × R, (4.3.2)

where the latent bivariate spatio-temporal process [α(s, t), ε(s, t)]T has the stationary

matrix-valued covariance function

τ 2ρ (h, u) τ%ρ (h, u) C(h) = α c (4.3.3) τ%ρc(h, u) ρε(h, u)

0 0 with h = ks−s k and u = |t−t |. In addition, ρα(h, u) and ρε(h, u) are the stationary,

symmetric and non-separable correlation functions for the latent univariate processes

α(s, t) and ε(s, t), respectively. The function ρc(h, u) is the cross-correlation function

between α(s, t) and ε(s, t), i.e.,

0 0 ρc(h, u) = cov{α(s, t), ε(s , t )}/τ.

It is easy to see that the moments and the covariance function of the non-Gaussian

process Y (s) derived in Chapter 3 are still applicable here. However, the conditions

on the correlation functions ρα, ρc and ρε for the model to be well-defined given

138 in Theorem 1, 3 and 4 in Chapter 3 do not necessarily hold, since non-separable space-time covariance functions are not necessarily nested within the class of Mat´ern functions.

On the other hand, we can always guarantee that such a covariance function is valid by adopting the LMC approach, i.e.,

p ε(s, t) = %α(s, t)/τ + 1 − %2ξ(s, t), which implies that

0 0 0 0 corr{α(s, t), α(s , t )} = corr{α(s, t), ε(s , t )}, i.e., ρc(h, u) = ρα(h, u).

The heteroscedastic asymmetric spatio-temporal process can then be expressed as

% HT (s, t)δ + α(s) Y (s) = XT (s, t)β + exp α(s, t) τ 2 p HT (s, t)δ + α(s, t) + 1 − %2 exp ξ(s, t). 2

One drawback of this model specification is that a single parameter % controls the skewness of the model over the entire space-time domain, but in reality, there might be differences between the spatial and temporal behaviors. To explicitly model the difference in time and space, e.g., skewness in space but not time (the simplest example being the independent temporal measurements), we can mix the HASP with a purely temporal process as in Ma (2002), Ma (2003) and Fonseca and Steel (2011a). etc. However, due to the fact that a HASP process does not have zero mean, to understand whether or not the mixing approach makes sense requires a more in-depth analysis.

139 4.3.2 Other directions

Even if we can write down a non-Gaussian spatial or spatio-temporal model and develop a Bayesian procedure to fit the model, computational feasibility is always a concern for models employing a latent scaling process. This computational burden could be severe for large datasets. Existing strategies for large spatial datasets are reviewed in details in Sun et al. (2012). Some of the methods for Gaussian processes, such as the tapering strategy developed by Furrer et al. (2006) and Kaufman et al.

(2008) and the Gaussian Markov random field approximation of Lindgren et al. (2011), depend on the sparsification of the covariance matrix of the Gaussian process so that it can be computed more efficiently with specifically designed algorithms. However, a major computational challenge in the HASP model as well as its extension in the space-time domain is the update of the latent α(s, t) process within a Gibbs sampler.

Neither the tapering method nor a Markov approximation of a Gaussian random field would contribute much to the fitting of the HASP model or its space-time counter- parts. To solve this problem, we either need to devise a clever model fitting procedure or use other approximation techniques such as a low rank representation for spatial processes.

Another question that remains to be explored is the development of a non-Gaussian process that also captures non-stationary features. We have seen an example of such an model in Damian et al. (2003) where the variance of the process is also modeled as a spatially indexed random field. In general, however, the non-stationary mod- els developed in the literature still assume Gaussianity. The number of methods for building non-stationary spatial or spatio-temporal processes is large and growing (see, e.g., Sampson and Guttorp, 1992; Guttorp and Sampson, 1994; Nychka and Saltzman,

140 1998; Higdon, 1998; Fuentes, 2001; Nott and Dunsmuir, 2002; Nychka et al., 2002, for early papers in this field) and the modeling potential of a flexible non-Gaussian and non-stationary model is certainly worth our research attention.

4.4 Closing Remarks

For many practical problems, the art of statistics resides in the process of designing a model that is as close to the true data generating process as possible. Technically, however, there is usually a nuanced balance between choosing feasible mathematical assumptions for our models and using them to uncover a repeatable pattern concealed in the data. The findings from an overly restrictive model might be dominated by our own (and usually wrong) assumptions while an overly flexible models will end up modeling random errors in the data and produce unrepeatable results.

In this dissertation, we have developed ways to relax some common but restric- tive assumptions in temporal and spatial stochastic processes, including the linearity assumption in the state equation of the SV model and the symmetry assumption in marginal distribution of the heavy-tailed non-Gaussian spatial processes. Our proposed methodologies offer disciplined ways to model non-Gaussian features in temporally and spatially indexed data. They are conceptually easy to understand and practically straightforward to implement. In addition, our methods can serve as building blocks for developing models that allow for even more empirical features to approximate the true data generating process.

141 Bibliography

Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte

Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 72:269–342.

Apanasovich, T. V., Genton, M. G., and Sun, Y. (2012). A valid Mat´ernclass of cross-

covariance functions for multivariate random fields with any number of components.

Journal of the American Statistical Association, 107:180–193.

Askey, R. (1973). Radial characteristics functions. Technical report, University of

Wisconsin-Madison, Mathematics Research Center.

Audrino, F. and B¨uhlmann,P. (2009). Splines for financial volatility. Journal of the

Royal Statistical Society: Series B (Statistical Methodology), 71:655–670.

Azzalini, A. (2005). The skew-normal distribution and related multivariate families.

Scandinavian Journal of Statistics, 32:159–188.

Azzalini, A. and Capitanio, A. (1999). Statistical applications of the multivariate skew

normal distribution. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 61:579–602.

Azzalini, A. and Dalla Valle, A. (1996). The multivariate skew-normal distribution.

Biometrika, 83:715–726.

142 Bandi, F. M. and Ren`o,R. (2012). Time-varying leverage effects. Journal of Econo-

metrics, 169:94–113.

Banerjee, S. and Gelfand, A. (2003). On smoothness properties of spatial processes.

Journal of Multivariate Analysis, 84:85–100.

Bauwens, L. and Veredas, D. (2004). The stochastic conditional duration model: a la-

tent variable model for the analysis of financial durations. Journal of Econometrics,

119:381–412.

Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems.

Journal of the Royal Statistical Society: Series B (Methodological), 36:192–236.

Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24:179–195.

Bolin, D. (2013). Spatial Mat´ernfields driven by non-Gaussian noise. Scandinavian

Journal of Statistics, in press.

Bolin, D. and Wallin, J. (2013). Non-Gaussian mat´ernfields with an application to

precipitation modeling. arXiv preprint arXiv:1307.6366.

Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Jour-

nal of Econometrics, 31:307–327.

Brezger, A. and Steiner, W. J. (2008). Monotonic regression based on Bayesian P–

splines. Journal of Business & Economic Statistics, 26:90–104.

Brockwell, P. and Davis, R. (2009). Time series: theory and methods. Springer

Verlag, New York, NY.

143 Cai, B. and Dunson, D. (2007). Bayesian multivariate isotonic regression splines.

Journal of the American Statistical Association, 102:1158–1171.

Carlin, B. and Chib, S. (1995). Bayesian model choice via Markov chain Monte

Carlo methods. Journal of the Royal Statistical Society: Series B (Methodological),

57:473–484.

Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American

Statistical Association, 90:1313–1321.

Chib, S. and Greenberg, E. (1994). Bayes inference in regression models with

ARMA(p, q) errors. Journal of Econometrics, 64:183–206.

Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm.

The American Statistician, 49:327–335.

Comte, F. (2004). Kernel deconvolution of stochastic volatility models. Journal of

Time Series Analysis, 25:563–582.

Craigmile, P. F. and Guttorp, P. (2011). Space-time modelling of trends in tempera-

ture series. Journal of Time Series Analysis, 32:378–395.

Cressie, N. (1993). Statistics for Spatial Data. Wiley, New York, NY.

Cressie, N. and Huang, H.-C. (1999). Classes of nonseparable, spatio-temporal sta-

tionary covariance functions. Journal of the American Statistical Association,

94:1330–1339.

Dahlhaus, R. and Subba Rao, S. (2006). Statistical inference for time-varying ARCH

processes. The Annals of Statistics, 34:1075–1114.

144 Damian, D., Sampson, P. D., and Guttorp, P. (2003). Variance modeling for non-

stationary spatial processes with temporal replications. Journal of Geophysical

Research: Atmospheres, 108:1–12.

De Oliveira, V., Kedem, B., and Short, D. A. (1997). Bayesian prediction of trans-

formed Gaussian random fields. Journal of the American Statistical Association,

92:1422–1433.

Dette, H. and Pilz, K. (2006). A comparative study of monotone nonparametric

kernel estimates. Journal of Statistical Computation and Simulation, 76:41–56.

Dette, H. and Scheder, R. (2006). Strictly monotone and smooth nonparametric

regression for two or more variables. Canadian Journal of Statistics, 34:535–561.

Diggle, P. J., Tawn, J., and Moyeed, R. (1998). Model-based geostatistics. Journal

of the Royal Statistical Society: Series C (Applied Statistics), 47:299–350.

Doob, J. L. (1942). The Brownian movement and stochastic equations. Annals of

Mathematics, 43:351–369.

Eberlein, E. and Hammerstein, E. A. v. (2004). Generalized hyperbolic and inverse

Gaussian distributions: limiting cases and approximation of processes. In Seminar

on Stochastic Analysis, Random Fields and Applications IV, volume 58, pages 221–

264. Springer.

Eddelbuettel, D. and Sanderson, C. (2014). RcppArmadillo: Accelerating R with

high-performance C++ linear algebra. Computational Statistics & Data Analysis,

71:1054–1063.

145 Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression.

The Annals of Statistics, 32:407–499.

Engle, R. (1982). Autoregressive conditional heteroscedasticity with estimates of the

variance of United Kingdom inflation. Econometrica, 50:987–1007.

Engle, R. and Gonzalez-Rivera, G. (1991). Semiparametric ARCH models. Journal

of Business & Economic Statistics, 9:345–359.

Engle, R. and Ng, V. (1993). Measuring and testing the impact of news on volatility.

The Journal of Finance, 48:1749–1778.

Engle, R. F. and Russell, J. R. (1998). Autoregressive conditional duration: a new

model for irregularly spaced transaction data. Econometrica, 66:1127–1162.

Fan, J. and Yao, Q. (1998). Efficient estimation of conditional variance functions in

stochastic regression. Biometrika, 85:645–660.

Fan, J. and Yao, Q. (2003). Nonlinear time series: Nonparametric and parametric

methods. Springer Verlag, New York, NY.

Fonseca, T. C. and Steel, M. F. (2011a). A general class of nonseparable space–time

covariance models. Environmetrics, 22:224–242.

Fonseca, T. C. and Steel, M. F. (2011b). Non-Gaussian spatiotemporal modelling

through scale mixing. Biometrika, 98:761–774.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.

Annals of Statistics, 29:1189–1232.

146 Fuentes, M. (2001). A high frequency kriging approach for non-stationary environ-

mental processes. Environmetrics, 12:469–483.

Furrer, R., Genton, M. G., and Nychka, D. (2006). Covariance tapering for interpo-

lation of large spatial datasets. Journal of Computational and Graphical Statistics,

15:502–523.

Glosten, L., Jagannathan, R., and Runkle, D. (1993). On the relation between the

expected value and the volatility of the nominal excess return on stocks. Journal

of Finance, 48:1779–1801.

Gneiting, T. (2001). Criteria of P´olya type for radial positive definite functions.

Proceedings of the American Mathematical Society, 129:2309–2318.

Gneiting, T. (2002a). Compactly supported correlation functions. Journal of Multi-

variate Analysis, 83:493–508.

Gneiting, T. (2002b). Nonseparable, stationary covariance functions for space–time

data. Journal of the American Statistical Association, 97:590–600.

Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic forecasts, cali-

bration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 69:243–268.

Gneiting, T., Genton, M., and Guttorp, P. (2006). Geostatistical space-time models,

stationarity, separability and full symmetry. In Finkenstadt, B., Held, L., and

Isham, V., editors, Statistical Methods for Spatio-Temporal Systems, pages 151–

175. Chapman and Hall/CRC, Boca Raton, FL.

147 Gneiting, T. and Guttorp, P. (2010). Continuous parameter spatio-temporal pro-

cesses. In Gelfand, A. E., Diggle, P., Guttop, P., and Fuentes, M., editors, Handbook

of Spatial Statistics, pages 427–436. CRC Press, New York, NY.

Gneiting, T., Kleiber, W., and Schlather, M. (2010). Mat´erncross-covariance func-

tions for multivariate random fields. Journal of the American Statistical Associa-

tion, 105:1167–1177.

Gneiting, T. and Schlather, M. (2004). Stochastic models that separate fractal di-

mension and the Hurst effect. SIAM review, 46:269–282.

Godsill, S. J., Doucet, A., and West, M. (2004). Monte Carlo smoothing for nonlinear

time series. Journal of the American Statistical Association, 99:156–168.

Gordon, N. J., Salmond, D. J., and Smith, A. F. (1993). Novel approach to

nonlinear/non-Gaussian Bayesian state estimation. IEEE Proceedings F (Radar

and ), 140:107–113.

Goulard, M. and Voltz, M. (1992). Linear coregionalization model: tools for estima-

tion and choice of cross-variogram matrix. Mathematical Geology, 24:269–286.

Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and

Bayesian model determination. Biometrika, 82:711–732.

Grenander, U. and Miller, M. I. (1994). Representations of knowledge in complex

systems. Journal of the Royal Statistical Society: Series B (Methodological), 56:549–

603.

Guttorp, P. and Gneiting, T. (2006). Studies in the history of probability and statistics

XLIX: On the Mat´erncorrelation family. Biometrika, 93:989–995.

148 Guttorp, P. and Sampson, P. D. (1994). Methods for estimating heterogeneous spatial

covariance functions with environmental applications. In Patil, G. and Rao, C.,

editors, Handbook of Statistics: Environmental Statistics, volume 12, pages 661–

689. Elsevier, New York, NY.

Hall, P. and Huang, L. (2001). Nonparametric kernel regression subject to mono-

tonicity constraints. The Annals of Statistics, 29:624–647.

Handcock, M. S. and Stein, M. L. (1993). A Bayesian analysis of kriging. Techno-

metrics, 35:403–410.

Harvey, A. C. and Shephard, N. (1996). Estimation of an asymmetric stochastic

volatility model for asset returns. Journal of Business & Economic Statistics,

14:429–434.

Hastie, T. and Tibshirani, R. (1986). Generalized additive models. Statistical Science,

1:297–318.

Haynsworth, E. V. (1968). Determination of the inertia of a partitioned Hermitian

matrix. Linear Algebra and its Applications, 1:73–81.

Held, L. and Rue, H. (2010). Conditional and intrinsic autoregressions. In Gelfand,

A. E., Diggle, P., Guttop, P., and Fuentes, M., editors, Handbook of Spatial Statis-

tics, pages 207–208. CRC Press, New York, NY.

Hering, A. S. and Genton, M. G. (2010). Powering up with space-time wind forecast-

ing. Journal of the American Statistical Association, 105:92–104.

Higdon, D. (1998). A process-convolution approach to modelling temperatures in the

North Atlantic Ocean. Environmental and Ecological Statistics, 5:173–190.

149 Huang, W., Wang, K., Breidt, F., and Davis, R. (2011). A class of stochastic volatility

models for environmental applications. Journal of Time Series Analysis, 32:364–

377.

Hull, J. and White, A. (1987). The pricing of options on assets with stochastic

volatilities. The Journal of Finance, 42:281–300.

Jensen, M. J. and Maheu, J. M. (2012). Estimating a semiparametric asymmetric

stochastic volatility model with a Dirichlet process mixture. Technical report,

Federal Reserve Bank of Atlanta.

Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008). Covariance tapering

for likelihood-based estimation in large spatial data sets. Journal of the American

Statistical Association, 103:1545–1555.

Kent, J. T. (1989). Continuity properties for random fields. The Annals of Probability,

17:1432–1440.

Kim, H.-M. and Mallick, B. K. (2004). A Bayesian prediction using the skew Gaussian

distribution. Journal of Statistical Planning and Inference, 120:85–101.

Kim, S., Shephard, N., and Chib, S. (1998). Stochastic volatility: likelihood inference

and comparison with ARCH models. The Review of Economic Studies, 65:361–393.

Kim, W. and Linton, O. (2004). The LIVE method for generalized additive volatility

models. Econometric Theory, 20:1094–1139.

Kitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear

state space models. Journal of Computational and Graphical Statistics, 5:1–25.

150 Kong, A., Liu, J. S., and Wong, W. H. (1994). Sequential imputations and Bayesian

missing data problems. Journal of the American Statistical Association, 89:278–

288.

Kyriakidis, P. C. and Journel, A. G. (1999). Geostatistical space–time models: a

review. Mathematical Geology, 31:651–684.

Lange, M. (2005). On the uncertainty of wind power predictionsanalysis of the forecast

accuracy and statistical distribution of errors. Journal of Solar Energy Engineering,

127:177–184.

Lindgren, F., Rue, H., and Lindstr¨om,J. (2011). An explicit link between Gaus-

sian fields and Gaussian Markov random fields: the stochastic partial differential

equation approach. Journal of the Royal Statistical Society: Series B (Statistical

Methodology), 73:423–498.

Linton, O. and Mammen, E. (2005). Estimating semiparametric ARCH models by

kernel smoothing methods. Econometrica, 73:771–836.

Liu, J. S. and Chen, R. (1995). Blind deconvolution via sequential imputations.

Journal of the American Statistical Association, 90:567–576.

Liu, J. S. and Chen, R. (1998). Sequential Monte Carlo methods for dynamic systems.

Journal of the American Statistical Association, 93:1032–1044.

Ma, C. (2002). Spatio-temporal covariance functions generated by mixtures. Mathe-

matical Geology, 34:965–975.

Ma, C. (2003). Spatio-temporal stationary covariance models. Journal of Multivariate

Analysis, 86:97–107.

151 MacEachern, S. N., Clyde, M., and Liu, J. S. (1999). Sequential importance sam-

pling for nonparametric Bayes models: The next generation. Canadian Journal of

Statistics, 27:251–267.

Mat´ern,B. (1960). Spatial variation: Stochastic models and their application to

some problems in forest surveys and other sampling investigations. Meddelanden

fr˚anStatens Skogsforskningsinstitut, Stockholm, 49.

McAleer, M. and Medeiros, M. (2008). Realized volatility: A review. Econometric

Reviews, 27:10–45.

Meyer, M. C., Hackstadt, A. J., and Hoeting, J. A. (2011). Bayesian estimation

and inference for generalised partial linear models using shape-restricted splines.

Journal of Nonparametric Statistics, 23:867–884.

Nakajima, J. and Omori, Y. (2009). Leverage, heavy-tails and correlated jumps in

stochastic volatility models. Computational Statistics & Data Analysis, 53:2335–

2353.

Neelon, B. and Dunson, D. (2004). Bayesian isotonic regression and trend analysis.

Biometrics, 60:398–406.

Nelson, D. (1991). Conditional heteroskedasticity in asset returns: a new approach.

Econometrica, 59:347–370.

Nott, D. J. and Dunsmuir, W. T. (2002). Estimation of nonstationary spatial covari-

ance structure. Biometrika, 89:819–829.

152 Nychka, D. and Saltzman, N. (1998). Design of air-quality monitoring networks. In

Nychka, D., Piegorsch, W. W., and Cox, L. H., editors, Case studies in environ-

mental statistics, pages 51–76. Springer Verlag, New York, NY.

Nychka, D., Wikle, C., and Royle, J. A. (2002). Multiresolution models for nonsta-

tionary spatial covariance functions. Statistical Modelling, 2:315–331.

Omori, Y., Chib, S., Shephard, N., and Nakajima, J. (2007). Stochastic volatility

with leverage: Fast and efficient likelihood inference. Journal of Econometrics,

140:425–449.

Pagan, A. and Ullah, A. (1988). The econometric analysis of models with risk terms.

Journal of Applied Econometrics, 3:87–105.

Palacios, M. and Steel, M. (2006). Non-Gaussian Bayesian geostatistical modeling.

Journal of the American Statistical Association, 101:604–618.

P´olya, G. (1949). Remarks on characteristic functions. In Proceedings of the Berkeley

Symposium on and Probability, pages 115–123.

R Core Team (2013). R: A Language and Environment for Statistical Computing.

R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-

project.org/.

Ramsay, J. (1988). Monotone regression splines in action. Statistical Science, 3:425–

441.

Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of Langevin

distributions and their discrete approximations. Bernoulli, 2:341–363.

153 Rodrigues, A. and Diggle, P. J. (2010). A class of convolution-based models for spatio-

temporal processes with non-separable covariance structure. Scandinavian Journal

of Statistics, 37:553–567.

Rue, H. and Held, L. (2005). Gaussian Markov Random Fields: Theory and Applica-

tions, volume 104 of Monographs on Statistics and Applied Probability. Chapman

& Hall, London, UK.

Rydberg, T. (2000). Realistic statistical modelling of financial data. International

Statistical Review, 68:233–258.

Sampson, P. D. (2010). Constructions for nonstationary spatial processes. In Gelfand,

A. E., Diggle, P., Guttop, P., and Fuentes, M., editors, Handbook of Spatial Statis-

tics, pages 119–130. CRC/Chapman and Hall, New York, NY.

Sampson, P. D. and Guttorp, P. (1992). Nonparametric estimation of nonstation-

ary spatial covariance structure. Journal of the American Statistical Association,

87:108–119.

Sanderson, C. (2010). Armadillo: An open source C++ linear algebra library for fast

prototyping and computationally intensive experiments. Technical report, NICTA,

Australia.

Schmidt, A. M. and Gelfand, A. E. (2003). A Bayesian coregionalization approach

for multivariate pollutant data. Journal of Geophysical Research: Atmospheres,

108:1–9.

Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution.

The Annals of Statistics, 9:1135–1151.

154 Stein, M. L. (1999). Interpolation of spatial data: Some theory for kriging. Springer,

New York.

Stein, M. L. (2005). Space–time covariance functions. Journal of the American

Statistical Association, 100:310–321.

Sun, Y., Li, B., and Genton, M. G. (2012). Geostatistics for large datasets. In Porcu,

E., Montero, J.-M., and Schlather, M., editors, Advances and challenges in space-

time modelling of natural events, Lecture Notes in Statistics, Vol. 207, pages 55–77.

Springer-Verlag, Berlin, Germany.

Taylor, S. (1994). Modeling stochastic volatility: A review and comparative study.

Mathematical Finance, 4:183–204.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of

the Royal Statistical Society: Series B (Methodological), 58:267–288.

Tsay, R. S. (2005). Analysis of financial time series. John Wiley & Sons, Inc.,

Hoboken, NJ.

Uhlenbeck, G. E. and Ornstein, L. S. (1930). On the theory of the Brownian motion.

Physical review, 36:823–841.

Verdinelli, I. and Wasserman, L. (1995). Computing Bayes factors using a gener-

alization of the Savage-Dickey density ratio. Journal of the American Statistical

Association, 90:614–618.

Wackernagel, H. (2003). Multivariate Geostatistics: An Introduction with Applica-

tions. Springer, Berlin, Germany, 3rd edition.

155 WHO Press (2006). WHO Air quality guidelines for particulate matter, ozone, nitro-

gen dioxide and sulfur dioxide: Global update 2005; Summary of risk assessment.

World Health Organization, Geneva, Switzerland.

Yan, J. (2007). Spatial stochastic volatility for lattice data. Journal of Agricultural,

Biological, and Environmental Statistics, 12:25–40.

Yang, L. (2006). A semiparametric GARCH model for foreign exchange volatility.

Journal of Econometrics, 130:365–384.

Yang, L., Hardle, W., and Nielsen, J. (1999). Nonparametric autoregression with mul-

tiplicative volatility and additive mean. Journal of Time Series Analysis, 20:579–

604.

Ying, Z. (1991). Asymptotic properties of a maximum likelihood estimator with data

from a Gaussian process. Journal of Multivariate Analysis, 36:280–296.

Yu, J. (2005). On leverage in a stochastic volatility model. Journal of Econometrics,

127:165–178.

Yu, J. (2012). A semiparametric stochastic volatility model. Journal of Econometrics,

167:473–482.

Zhang, H. (2004). Inconsistent estimation and asymptotically equal interpolations in

model-based geostatistics. Journal of the American Statistical Association, 99:250–

261.

Zhang, H. (2007). Maximum-likelihood estimation for multivariate spatial linear

coregionalization models. Environmetrics, 18:125–139.

156 Zhang, H. and El-Shaarawi, A. (2010). On spatial skew-Gaussian processes and

applications. Environmetrics, 21:33–47.

Zhao, P. and Yu, B. (2007). Stagewise lasso. The Journal of

Research, 8:2701–2726.

157