Multivariate isotonic regression

Konstantinos Fokianos University of Cyprus Department of Mathematics and P.O. Box 20537 CY – 1678 Nicosia Cyprus E-mail: [email protected]

Anne Leucht Technische Universit¨atBraunschweig Institut f¨urMathematische Stochastik Pockelsstraße 14 D – 38106 Braunschweig Germany E-mail: [email protected]

Michael H. Neumann Friedrich-Schiller-Universit¨atJena Institut f¨urMathematik Ernst-Abbe-Platz 2 D – 07743 Jena Germany E-mail: [email protected]

Abstract We consider a general monotone regression estimation problem for time series with multivariate regressors that may consist of nonstationary discrete and/or continuous components as well as a deterministic trend. We propose a modification of the classical isotonic estimator and establish its rate of convergence for the integrated L1-loss. The methodology captures the shape of the without assuming additivity or a parametric form for the regression function. Some simulations and a data example complement the study of the proposed estimator.

2010 Mathematics Subject Classification: Primary 62G08; secondary 62G20, 62H12, 62M10. Keywords and Phrases: Count time series, isotonic least squares estimation, multivariate isotonic regression, rates of convergence, Rosenthal inequality. Short title: Multivariate isotonic regression. version: September 15, 2016 1

1. Motivation We consider the classical regression model

Yt f It εt with E εt It 0 a.s., t Z, (1.1) d where we assume that the= regression( ) + function f ( S R) =is unknown∈ and allow for dependent observations Yt,It t having a general time series structure. The primary aim of this work is to provide a nonparametric′ ′ estimator of f subject∶ → to shape constraints; in particular we will assume that(( the) function) f in (1.1) is isotonic. As we will argue, throughout the paper, the assumption of isotonicity encompasses various data generating processes, including monotone trends which are observed in several applied problems. The problem of estimating a regression function subject to shape constraints in the context of time series has not been addressed adequately in the literature, to the best of our knowledge. There exist a large body of literature on estimation and testing for situa- tions where the class of admissible functions f can be parametrized by a finite-dimensional parameter; see e.g. Seber and Wild (1989), Kedem and Fokianos (2002), Escanciano (2006), Francq and Zakoian (2010) and Shumway and Stoffer (2011) among others. There are also many results on nonparametric kernel estimators for f relying on the assumption that the covariate vector It has a Lebesgue density. For an overview, we refer the reader to the monographs by H¨ardle(1990) and Fan and Gijbels (1996). On the other hand, there are numerous applications that the covariates do not possess a density with respect to the Lebesgue measure; a case in point is various count time series models which have been employed for the analysis of financial data (e.g. modeling the number of transactions) or biomedical data (e.g. modeling infectious diseases); see Fokianos et al. (2009) for instance. The aim of this paper is therefore to provide a nonparametric estimation procedure for the regression function under less restrictive conditions on the underlying distribution such that various types of covariates can be included. Instead, we impose a shape constraint and assume f to be isotonic but not necessarily additive. This general framework also allows inclusion of a trend component. Such a covariate accommodates the case of gradual changes over time in contrast to change-point models with stationarity between these points of (abrupt) changes. Two examples below will show the usefulness of this approach and its applicability. Example 1.1 ((Non-)linear ARCH models). Since the seminal paper by Engle (1982), (non-)linear ARCH models and various generalizations are used in financial mathematics e.g. for volatility modeling. The nonlinear ARCH(p) model is given by 2 Yt σt εt, with σt f Yt 1,...,Yt p , t Z,

− − 2 where εt t Z is a sequence of i.i.d. random variables with Eεt 1. Hence, with ηt 2 2 = = ( ) ∈ σt εt 1 , t , the squared process can be represented in form of (1.1), ∈ Z ( ) 2 = = ( − ) ∈ Yt f It ηt with It Yt 1,...,Yt p . ′ In particular, the linear ARCH(= ( p))+ model with nonnegative= ( − coefficients− ) satisfies the mono- tonicity constraint on f and falls within the framework of (1.2).

Example 1.2 (Count time series with exogenous variables). A classical model for integer- valued time series is that of that of Poisson autoregression, that is, observations Y1,...,Yn are available with (see Ferland et al. (2006), for instance)

Yt σ Yt 1,Yt 2,... Poisson λt ,

S ( − − ) ∼ ( ) 2 where λt f Yt 1,...,Yt p, λt 1, . . . , λt q (1.2) is the unobserved process of intensities. Special cases of the above specification are the = ( − − − − ) INGARCH(1,1) model when f Yt 1, λt 1 θ0 θ1Yt 1 θ2λt 1 and the INARCH(p) model with f Yt 1,...,Yt p θ0 θ1Yt 1 θpYt p. For nonnegative parameters θi, the mean function f is monotonically non-decreasing( − − ) = in+ each− argument.+ − In this contribution, we − − − − will be( working with) ARCH= + type of+ ⋯ models + though. The inclusion of the λt terms poses challenging issues since it is a hidden process. The same comment applies to the GARCH model (Example 1.1); work in the context of nonparametric estimation of GARCH models has been reported by Audrino and B¨uhlmann(2009) and Meister and Kreiss (2016). Recalling model (1.2) but with past λt’s omitted, consider the inclusion of a trend component and/or a covariate vector as follows (see, for example, Davis et al. (2000))

λt f Yt 1, t n, Zt . (1.3) This model can be represented as = ( − ~ ) Yt f It εt with It Yt 1, t n, Zt and for a certain choice of the white noise sequence ε , falls within′ the framework of = ( ) + = ( t− t ~ ) (1.1). ( ) In model (1.3), the explanatory variables Yt 1, t n, Zt do not have a density w.r.t. Le- besgue measure. Moreover, in model (1.2) the nature of the′ stationary distribution of λt is sometimes unclear, it can be discrete or continuous( − ~ or even) a mixture thereof. Hence, ap- plication of standard nonparametric methods such as kernel estimators of the function f, as proposed e.g. by Dette et al. (2006) and references therein, does not alleviate the aforemen- tioned problems. Furthermore, with dependent errors, a data-driven choice of smoothing parameters, such as a bandwidth, is a challenging issue. While the simple leave-one-out cross-validation may fail, the method of leave-k-out cross-validation involves a choice of k, which in turn requires a difficult subjective decision; see e.g. Chu and Marron (1991). On the other hand, the isotonic least squares estimator does not require the choice of any smoothing parameter since an appropriate tuning of the degree of smoothing is done auto- matically. This estimator seems to be less sensitive to irregularities in the design and if the target function is indeed isotonic then this estimator is consistent; see e.g. Christopeit and Tosstorff (1987) and references therein. The assumption of isotonicity seems to be often appropriate in the context of models (1.2) and (1.3) and, in fact, some popular parametric models share this property. We propose a new modification of the isotonic least squares estimator and show that it attains rates of convergence that are known to be optimal in comparable settings. In sharp contrast to usual nonparametric estimators, this estimator does not require the choice of an appropriate bandwidth which could cause problems in our general setting with a possibly irregular distribution of the explanatory variables and with dependent observations. The paper is structured as follows. We introduce our estimators and present results on its rate of convergence in Section 2. Numerical examples are discussed in Section 3. All proofs as well as technical auxiliary results are deferred to Section 4.

2. The estimator and its asymptotic properties

In the sequel x denotes the transpose of a vector x. We assume that Yn,1,In,1 ,..., d Yn,n,In,n are observed,′ where Yn,t is a real-valued and In,t an R -valued random′ variable′ ′ ′ ( ) ( ) 3 such that

Yn,t f In,t εn,t, t 1, . . . , n, with E εn,t In,t 0 almost surely. We assume that the conditional mean function f is isotonic, that is, monotonously= non-decreasing( ) + in each= argument. A popular estimator is the isotonic( S least) squares= estimator fn which is given as n ̃ 2 fn arg min Yn,t g In,t . g isotonic t 1 ̃ = Q ( − ( )) It is well known that fn satisfies at all observation= points x In,1,...,In,n the following equations: ̃ ∈ { } fn x max min AvY L U (2.1) U x U L x L ̃ min max AvY L U , (2.2) ( ) = L∶ x∈L U ∶ x∈U ( ∩ ) where = ∶ ∈ ∶ ∈ ( ∩ ) n 1 t 1 Yn,t In,t B d AvY B ,B R , # t n In,t B ∑ = ( ∈ ) and U and L denote upper and( ) lower= sets, respectively; see e.g.⊆ Theorem 1 in Brunk (1955) { ≤ ∶ ∈ } d and Theorem 1.4.4 in Robertson et al. (1988, p. 23). (A set U R is called an upper set d if x U and x y implies that y U; analogously, L R is called a lower set if x L and x y implies that y L. x y and x y that xi ⊆yi or xi yi, respectively, for all∈ i 1, . . . ,⪯ d.) While fn is uniquely∈ defined at the⊆ observation points, there is some∈ arbitrariness⪰ of choosing f∈n between⪯ these points;⪰ only the postulated≤ isotonicity≥ has to be satisfied.= ̃ There are already several̃ asymptotic results for the classical isotonic least squares esti- mator fn in dimension d 1, mostly derived in the case of deterministic regressors. Point- wise asymptotic distributions of isotonic least squares estimators under short and long dependencẽ have been derived= in Anevski and H¨ossjer(2006) and Dedecker et al. (2011). In particular, it is known that this estimator converges at the optimal rate n 1 3 to f. Zhang (2002, Theorem 2.3) shows in the case of independent, not necessarily i.i.d.,− ~ errors that 1 n p 1 p 1 3 n i 1 E fn ti f ti O n , where t1, . . . , tn are the (deterministic) covari- ates− and 1 p 3; see also~ Chatterjee− ~ et al. (2015) for a refinement for p 2 and and (under∑ the= assumption( ̃ ( ) − ( of)) i.i.d.) errors.= ( Furthermore,) Durot (2002, Theorem 1) proves that 1 ≤ ≤ 1 3 = E 0 fn x f x dx O n . On the other hand, much− less~ is known in the multivariate case. The only results ̃ for[∫fnS we( are) − aware( )S are] = the( following.) Hanson et al. (1973, Theorem 5) prove uniform consistency of fn in the case d 2 under the assumption of deterministic regressors and a continuous̃ target function f. Moreover, they provide intuition about the convergence of the probability of ã large deviation= of the estimator from the true regression function towards zero; see their formula (26). Robertson and Wright (1975, Theorems 2.1 and 2.2) state pointwise consistency for fn in the context of a general partial order for the regressors. Finally, Christopeit and Tosstorff (1987, Theorem 1) provide a consistency result in the general d-dimensional casẽ with continuous stochastic covariates and martingale difference innovations. To the best of our knowledge, there are no rate results for the multivariate isotonic LSE yet. We conjecture that a serious obstacle for deriving rates of convergence in the case d 2 is that there are simply way too many possible lower and upper set involved in (2.1) and (2.2); see e.g. Gao and Wellner (2007) as well as the discussion in Section 3 in Wu et al. (2005).≥ 4

We actually employ fn in the univariate setup; that is if d 1. However, in the mul- tivariate case we propose a slightly simpler estimator by the following. It turns out that the entropy problem cañ be avoided if we restrict ourselves to= lower and upper sets of (hyper-)rectangular type. For example, we could use the following estimators:

f x max min AvY a, b n a a x b b x as well as ( ) = ∶ ⪯ ∶ ⪰ (⟦ ⟧) f n x min max AvY a, b , b b x a a x where, for a, b d, a, b a , b a , b . It follows from the construction that f R 1 (1) = ∶ ⪰ d ∶ ⪯d (⟦ ⟧) n and f are isotonic and that f x f x holds for all x. We propose to choose f as an n n n n isotonic function∈ such⟦ that⟧ = [ ] × ⋯ × [ ] ( ) ≤ ( ) ̂ f x f x f x x, (2.3) n n n for example, we can choose f x ̂f x f x 2. When d 1, any choice of f n ( ) ≤ n( ) ≤ n( ) ∀ n between f and f is equal to f at the observation points. However, it deviates from f n n ̂ n ̂n in the multivariate case. As can( be) seen= ( in( the) + proof( ))~ of Theorem 2.1 below,= replacing lower and upper sets in (2.1) and (2.2)̃ by hyperrectangles greatly simplifies the derivation of thẽ desired rate of convergence and its computation.

For any process Xt t Z, we define the coefficients of uniform mixing as φ 0 1 and, for r N, ( ) ∈ ( ) = P U V φ r ∈ sup sup P V U σ Xt,Xt 1,... ,P U 0,V σ Xt r,Xt r 1,... . t P U ( ∩ ) We( ) also= recall theœW definition( ) − of the coefficientsW ∶ ∈ ( of strong− mixing) ( for) ≠ r ,∈ ( + + + )¡ ( ) N α r sup sup P U P V P U V U σ Xt,Xt 1,... ,V σ Xt r,Xt r 1,... t ∈ and( set) = α 0 1{S4.( In) the( following) − ( we∩ assume)S ∶ ∈ to( have observations− ) ∈ of( either+ a+ uniform+ )} or strong mixing process at hand. Moreover,( ) = we~ assume the information variables to be of the form In,t Xn,t,Zn,t , where Xn,t is a d1-dimensional vector consisting of components with values in N0,′ and′Zn,t′ is a d2-dimensional covariate consisting of variables with continuous marginal= ( distribution) functions and possibly a trend component t n. Here, we allow for d1, d2 N0 with d d1 d2 0. Note that by setting d1 0 or d2 0, it is possible that In,t is just equal to Xn,t or Zn,t, respectively. ~ ∈ = + > = = (A1) (i) Yn,t,In,t t 1,...,n is uniform (φ-) mixing with coefficients satisfying φn r ′ ′ 2 1 4 φ r , where r 1 r φ r . (( ) ) = ( ) ≤ (ii) Yn,t,In,t t 1∞,...,n is~ strong (α-) mixing with coefficients satisfying αn r = q1 q2 α(r), where′ ′ ∑r 1 r α (r) < ∞ for some q1 2 5d2 4 2 2N0 2k k = 2 ((N0 and q)2 ) 1∞ 1 q1 2 . ( ) ≤ ( ) ∑ = ( ) < ∞ ≥ ( ∨ ( ~ )) − ∈ = { S ∈ Example 1.2} (continued):= ~( Suppose+ ( + ) that)

Yn,t t Poisson f In,t , where I Y ,...,Y ,Z , σ Z ,Y ,Z ,Y ,Z ,... and n,t n,t 1 n,t d1 n,t S F t∼ n,t ( n,t( 1 ))n,t 1 n,t 2 n,t 2 d2 ′ ′ Zn,t t Z is a sequence of R -valued covariates (possibly including a trend component t n) = ( − − ) F = ( − − − − ) such that Zn,t 1 is independent of t. Then In,t t is a Markov chain. Assume in addition ∈ Z (that )f d1 d2 0,M is bounded. ~ R + ∈ + F ( ) ∶ → [ ] 5

d d We define, for probability distributions P1 and P2 on R , with densities p1 and p2 w.r.t. some σ-finite measure µ, the subprobability P1 P2 as ( Bd) P1 P2 B p1 p2 dµ B . B ∧ d Let P1 P2 P1 P2 R ∧. (Then( ) =PS1 P2∧ 1 P1 ∀ P2∈ BV ar 2, where P1 P2 V ar d 2 sup P1 B P2 B B is the total variation of the signed measure P1 P2.) SinceY ∧ Y ∶= ∧ ( ) Y ∧ Y = − Y − Y ~ Y − Y = { ( ) − ( )∶ ∈ B } − inf Poisson λ1 Poisson λ2 λ1, λ2 0,M

λ1 k λ2 k {Y inf (min) ∧e λ1 k!(, e )Yλ∶ 2 k! ∈λ1[, λ2 ]}0,M ρ 0, ∞ k 0 − − we obtain that= œQ= { ~ ~ }∶ ∈ [ ]¡ =∶ > Yn,t In,t D1 Yn,t In,t D2 In,t In,t inf P P P D1 0,P D2 0 ρ. S ∈ S ∈ Continuing inšY the same way∧ we get Y∶ ( ) > ( ) > Ÿ ≥ Yn,t,...,Yn,t+d −1 In,t D1 Yn,t,...,Yn,t+d −1 In,t D2 In,t In,t d1 inf P 1 P 1 P D1 0,P D2 0 ρ . Since Z is independentS ∈ of I ,Y ,...,YS ∈ , we obtain that šY n,t d1 ∧ n,t n,t n,t d1Y∶1 ( ) > ( ) > Ÿ ≥ In,t+d In,t D1 In,t+d In,t D2 In,t In,t d1 inf+ P 1 P 1 P+ − D1 0,P D2 0 ρ . S ∈ S ∈ d1 This impliesšY that the process∧ In,t t Z isY uniform∶ ( mixing) > with (φ d1) > Ÿ1≥ ρ . Since m φ md1 φ d1 we see that the process Yn,t,In,t t satisfies the mixing condition ∈ Z (A1)(i). ( ) ′ ′ ( ) ≤ − ( ) ≤ ( ) (( ) ) ∈

Having in mind that In,t Xn,t,Zn,t contains d1 0 components with a discrete distribution and d2 0 components′ having′ ′ either a continuous distribution or being non- random such as t n we impose= the( following) condition: ≥ ≥ ~ (A2) For t 1, . . . , n and n N, the random vectors Zn,t consist of components with con- tinuous marginal distribution functions and/or a trend component t n. Moreover, = ∈ there exist continuous distribution functions G1,...,Gd2 on R and, for all K N, constants 0 C1 C2 C2 K C3 C3 K such that ~ ∈ d2 1 C1 Gi

d2 ≤ Q= ( ∈ ( ] × ⋯ × 1( ]) C3 Gi bi Gi ai k1, . . . , kd1 K, ai bi. i 1 n

≤ M= ( ( ) − ( )) + ∀ ≤ ∀ ≤ Before we proceed some comments on assumption (A2) are in order. This condi- tion means that the “average distribution” of the continuous random variables behaves as a d2-dimensional product distribution which has, after an appropriate rescaling with G 1,...,G 1, a density bounded away from zero on 0, 1 d2 . The term 1 n is needed to 1 d2 accommodate− − the possible case of a trend variable t n. The role played by the terms on [ ] ~ ~ 6 the left-hand side and the right-hand side can also be understood by the following special cases.

Remark 1. (i) Suppose that In,t t 0,...,n,n N0 It t N0 is a strictly I(d1+1) I(d1+d2) I(d1+1),...,I(d1+d2) I(d1+1) I(d1+d2) such that P t P t P t t P t P t with ( ) = ∈ = ( ) ∈ continuous marginal distributions. Then (A2) holds true with Gi being the c.d.f. of d1 i ⊗⋅ ⋅ ⋅⊗ ≪ ≪ ⊗⋅ ⋅ ⋅⊗ I0 . (ii) Suppose( + ) that In,t t n is a purely deterministic trend component. Then (A2) holds true with G1 x max min x, 1 , 0 and C1 C2 C3 1. = ~ To simplify the notation,( ) = we{ suppress{ } the} index =n in =Yn,t,=In,t and εn,t from here on, just keeping in mind that also a triangular scheme is allowed, e.g., when a trend variable t n is included. We define

1 2 d2 ~ Mn n , hn 1 Mn, d d d and, for multi-indexes k ~( k+ , .) . . , k (0 k K j 1, . . . , d , 1 k M 1 2 = [ 1 ] d = j~ 1 j n j d1 1, . . . , d), subsets of the domain of f as = + = ( ) ≤ ≤ ∀ = ≤ ≤ B k , . . . , k G 1 k 1 h ,G 1 k h G 1 k 1 h ,G 1 k h . ∀ k = + 1 d1 1 d1 1 n 1 d1 1 n d2 d n d2 d n ′ − − − − Since= {( the estimator) }×(fn is(( based+ − on) means) over( + hyperrectangles,)]×⋯×( (( a “regular”− ) ) behavior( )] of it can be expected if there are sufficiently many observations in each box Bk. It turns out that regularity can be assured̂ if the following event occurs:

2 2 d2 d1 d2 An ω # t n It ω Bk C4 n k 0,...,K 1,...,Mn , (2.4) ~( + ) where=Cš4 is∶ some{ ≤ positive∶ ( ) constant.∈ } ≥ The following∀ lemma∈ { shows} that× { the event }AnŸactually occurs with a probability tending to one. Lemma 2.1. Suppose that (A2) is fulfilled. Furthermore, assume that either (A1)(i) or (A1)(ii) are satisfied. Then, for sufficiently small C4 0 in (2.4),

P An 1. n >

( ) Ð→→∞ It is well known that the traditional isotonic estimator fn x is problematic when x is close to the boundary of the support of the distribution of the It. The same is true for our ̃ estimator fn near the boundary of the domain. To fix the bias( ) problem at extreme small and large design points discussed e.g. in Sampson et al. (2003), Wu et al. (2005) proposed an adequatê modification by pulling up and down the isotonic LSE at the extremal points. We do not implement a boundary correction since this would involve some sort of tuning parameter whose appropriate choice is somewhat subjective. Taking this into account we neglect the behaviour of fn near the boundary and focus on estimating f on D B 0,...,K d1 G 1 h ,G 1 1 h G 1 h ,G 1 1 h , n k ̂ 1 n 1 n d2 n d2 n k Kn − − − − = = { } × ( ( ) ( − )] × ⋯ × ( ( ) ( − )] where Kn ∈ k 0 k1, . . . , kd1 K, 1 kd1 1, . . . , kd Mn . Denote by Q1,...,Qd2 the probability measures corresponding to the distribution functions G1,...,Gd , respectively. + 2 d1 = { ∶ ≤ ≤ < d1 < d1 } With µ being the counting measure on N , define ν µ Q1 Qd2 . In order to establish rates of convergence for our new estimators we require some addi- tional assumptions: = ⊗ ⊗ ⋯ ⊗ 4 (A3) (i) The process Yn,t,In,t t 1,...,n, n N, is φ-mixing and supx,t,n E εn,t In,t x . ′ ′ (( ) ) = ∈ ‰ S = Ž < ∞ 7

(ii) The process Yn,t,In,t t 1,...,n, n N, is α-mixing such that 2 3δ 4 δ δ 4 ′δ ′ 4 δ r 1 r α r = for some δ 0. It holds supx,t,n E εn,t ∞ + ~( +(() ~( + )) ) d ∈ 1 d2 + In,t x and for any B , define ηn,t εt1 Xn,t,Z ,...,Z B = ( ) < ∞ > n,t n,t(S S S ∑ ηn,t ,...,ηn,t 4 ηn,t ( ) ( ) and assume and P 1 4 P j , t1, . . . , t4 1, . . . , n . Moreover, = ) < ∞ ∈ B j 1 = (( ) ∈ ) either it holds ( ) ≪ ⊗ = ∈ { d2 } (a) that Zn,t t,n have uniformly bounded densities w.r.t. i 1Qi or (b) Z t n, Z 2 ,...,Zd2 and Z 2 ,...,Zd2 have uniformly bounded n,t n,t n,t n,t n,t t,n = ( ) ( ) d2 ( ) ⊗ densities w.r.t. Qi. = ( ~ i 2 ) (( )) Conditions (a) and (b) above are⊗ more= restrictive assumptions regarding the covariates Zn,t than (A2). It can be seen from the proof of Theorem 2.1 below, that under α-mixing regularity of their average distribution as stated in (A2) is not sufficient to deduce the (desired) rates of convergence.

Theorem 2.1. Suppose that (A2) and either (A1)(i) and (A3)(i) or (A1)(ii) and (A3)(ii) are fulfilled. Then,

1 1 2 d2 E fn x f x ν dx An O n . Dn − ~( + ) Here, in the definition ofS AnS ̂the( constant) − ( )SC4(is chosen)  such= ( that the assertion) of Lemma 2.1 holds true. Remark 2. (i) It follows from this theorem and Lemma 2.1 that

1 2 d2 fn x f x ν dx OP n . Dn − ~( + ) d (ii) In the special caseS ofS ̂ a( partially) − ( differentiable)S ( ) = and( bounded) function f 0, 1 R, d d isotonicity and boundedness implies that 0,1 i 1 ∂if x λ dx d supx f x ∶ [ ] → infx f x . Hence, the degree of smoothness β, measured in the L1-norm, is equal = to 1. It is well known that, under appropriate∫[ ] ∑ S conditions,( )S ( the) ≤ optimal( { rate( )} of− β 2β d 1 2 d convergence{ ( )}) for the L1-loss is n n ; see Stone (1982). Due to the kind of “imprecise” (mixing) conditions,− ~( + ) we− do~( not+ ) prove optimality of this rate in our context. Nevertheless, we are convinced= that this rate cannot be improved in general. (iii) It might be surprising that our rate of convergence does not depend on the number of discrete explanatory random variables. This is explained by the fact that, for

any k1, . . . , kd1 0,...,K , the cardinality of the set t n Xn,t k1, . . . , kd1 is proportional to the full sample size n. Therefore, there is no need to smooth′ over the first d1∈ directions{ } and there is no loss due to a{ trade-off≤ ∶ between= ( bias and) } that would appear with nonparametric smoothing techniques.

3. Examples 3.1. A simulation study. We illustrate the theoretical results by a simulation study which applies the aforementioned results to an additive and a non-additive model. More specifically, consider the ARCH model discussed in Example 1.1 and specified by including a trend component such as t Y σ ε , with σ2 d aY 2 b , (3.1) t t t t t 1 n = = + − + 8 where εt t is an iid sequence of standard normal random variables and a, b are positive parameters with a 1. In addition, consider the following non-linear AR model ( ) t < Yt d aYt 1 1 exp b εt. (3.2) n − In this case the iid sequence =εt t+is drawn( from− an(− exponential)) + distribution with mean one. The parameters are chosen such that 0 a 1 and b 0. Model (3.1) corresponds to the 2 2 case of an additive model for(Yt )with f y, s d ay bs whereas model (3.2) is an example of a non-additive model with f y, s d <1 ay< 1 exp >bs . For both models, the function f is isotonic. We compare the estimator( given) = by+ (2.3)+ with the standard parametric least squares estimator corresponding( to) = model− + (3.1)( − and(− the)) estimator proposed recently by Chen and Samworth (2016). This estimator is denoted by SCAR and it has been developed for additive models and independent data. However, it is still interesting to study, at least empirically, its performance to time series data. All results are based on 1000 simulations and 500 observations. To compare the empirical performance of the estimators, we employ their mean absolute n error, MAE, that is t 1 fn xt f xt n where xt denotes the value of the covariate vector at time t. In fact, we exclude a small number of values at the beginning and end = ̂ of the series (1% at eachŠ∑ side)S ( because) − ( of)S~ the boundary problems. The computation of the proposed estimate f f¯ f 2 is done by considering a grid of intervals around each n n n component of xt. In this case, we use a grid of ten intervals for each component of the covariate vector. Thê = additive( + estimator)~ of Chen and Samworth (2016) is computed using the R package scar. Figure 1 displays the results. The top row shows boxplots of MAE for the ARCH model (3.1) and for various choices of a and b. In this case, the standard LSE is performing, generally, better than the other estimators considered. The estimator proposed by Chen and Samworth (2016) is performing better than (2.3) because of the additivity of the assumed model. The bottom row shows boxplots of MAE for all estimators under model (3.2). Clearly (2.3) performs better than the rest of the estimators, in terms of MAE. Note that model (3.2) is not additive. Therefore we would expect that (2.3) will take this feature of the data into account more properly when compared to the other estimators. It is worth pointing out that the estimator proposed by Chen and Samworth (2016) is performing comparably to the standard least squares estimator. 3.2. Data Example. We apply the theoretical results to a study of population growth. In particular we investigate the growth of whooping cranes that became nearly extinct during the period 1938-1955. Whooping cranes are one of the largest birds in North America but also one of the rarest that can be found in the continent. For some time their population has been constantly decreasing and reached to about 20 individual birds in the world. With the employment of various conservation measures the population grew over the last years. The data we have are depicted in Figure 2 which shows the growth of population of whooping cranes between 1938 to 2007; see Int. Recovery Plan (2007). These are count time series data; recall Example 1.2. Clearly, the plot shows an increasing trend and a strong autocor- relation between the values of the series. The partial function illustrates further that the number of whooping cranes depends strongly on the their population one year before. We fit a model of the form (1.3) to these data with λt f yt 1, t n , where n is number of effective observations (in this case, the sample size is 70 and therefore n 69 − because of the inclusion of yt 1). = ( ~ ) We examine the performance of each of the methods described before for estimating= the model. This task is accomplished− by studying the in sample predictive power using n the mean absolute prediction error (MAPE), that is t 1 Yt Yt n. The results are shown

∑ = Ŝ − S~ 9

(a) (b) (c) MAE MAE MAE 0.05 0.10 0.15 0.20 0.0 0.5 1.0 1.5 2.0 0.00 0.02 0.04 0.06 0.08 0.10 0.12 LSE Min−Max SCAR LSE Min−Max SCAR LSE Min−Max SCAR

(e) (f) (f) MAE MAE MAE 0.85 0.90 0.95 1.00 1.05 1.10 1.15 0.8 0.9 1.0 1.1 1.2 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15

LSE Min−Max SCAR LSE Min−Max SCAR LSE Min−Max SCAR

Figure 1. Boxplots of MAE. Top row: Results from ARCH model (3.1) with d 0.2 and (a) a 0.1, b 0.7, (b) a 0.6, b 0.7, (c) a 0.1, b 0.2. Bottom row: Results from nonlinear AR model (3.2) with d 0.2 and (e) a 0.1,=b 0.7, (f) a 0=.6, b 0=.7, (g) a 0=.1, b 0=.2. ======LSE Min-Max 10 Min-Max 20 Min-Max 30 Min-Max 40 SCAR 5.639 3.797 3.594 3.498 2.847 2.566 Table 1. Mean absolute prediction error obtained from least squares esti- mator, the isotonic estimator (2.3) for varying number of a grid of intervals, and for SCAR.

in Table 1. The isotonic estimator has been computed by varying the number of intervals required to compute it. Recall that the computation of the proposed estimate (2.3) is done by considering a grid of intervals around each component of the covariate. Hence Min-Max 10 means that we create a 10 10 grid of intervals and we apply (2.3). Similarly for the others entries in the table. The results of the data analysis show that the SCAR estimator is superior to the rest but (2.3)× fits the data quite well, especially by considering a denser grid of intervals. It is worth reiterating the point that our approach does not require an additivity assumption.

4. Proofs We prove some auxiliary results in Section 4.1. The proofs of the main results of Section 2 are given in Section 4.2. 10

(a) (b) (c) ACF Partial ACF Whooping Cranes 50 100 150 200 250 −0.2 0.0 0.2 0.4 0.6 0.8 −0.2 0.0 0.2 0.4 0.6 0.8 1.0

1940 1960 1980 2000 0 5 10 15 5 10 15

Time Lag Lag

Figure 2. (a) Time series plot of the yearly number of whooping cranes between 1938 to 2007. (b) Autocorrelation function. (c) Partial autocorre- lation function.

4.1. General tools: inequalities for uniform or strong mixing processes. The following lemma provides an upper bound for the fourth moment of sums of uniform or strongly mixing, possibly nonstationary random variables.

Lemma 4.1. Suppose that η1, . . . , ηn are real-valued random variables with Eηt 0 for all t 1, . . . , n. 4 = (i) If ηt t 1,...,n is uniform (φ ) mixing and E ηt , then = = n 4 2 n 2 ( ) − 1 2 [ ] < ∞ 2 E ηt 4 φ r E ηt ⎡ t 1 ⎤ r ∞ t 1 ⎢ ⎥ ~ ⎢ ⎥ ⎢ŒQ ‘ ⎥ ≤ Œ Q (S S)‘ ŒQ [ ]‘ n ⎢ = ⎥ =−∞ =2 1 4 4 1 144 r 1 φ r E ηt . ⎣ ⎦ ∞ r 1 ~ t 1 + Œ + ( + 4)δ ( )‘ [ ] (ii) If ηt t 1,...,n is strong (α ) mixing andQ=E ηt , thenQ= + n 4= n ( ) 4 − [S S ] < ∞ E ηt E ηt ⎡ t 1 ⎤ t 1 ⎢ ⎥ ⎢ ⎥ 2 ⎢ŒQ= ‘ ⎥ ≤ Q= [n ] n 1 n n ⎢ ⎥ 2 2 δ 4 δ 2 ⎣ ⎦ E ηt 2 min 2,t, 4 α r ηt − 4 δ t 1 r 1 t 1 ( + )~( + ) t 1 n 1 n n + + ŒQ= [ ] + 2Q= œQ= C [ δ( )]4 δ 4 Q= Y Y ¡‘ 72 r 1 min 4,t, 4 α r ηt , − 4 δ r 1 t 1 ~( + ) t 1 + where 2,t max+ u t QE= η(tη+u )and œ4,tQ= Cmax u,v,w[ ( )]u t E ηtQη=uηYvηYw .¡

C = ≠ { S S} C = ( )∶ ≠ { S S} 11

Proof of Lemma 4.1. To exploit the mixing property as much as possible, we switch for 1 t u v w n between the following estimates:

≤ ≤ ≤ ≤ ≤ cov ηt, ηuηvηw , if u t max v u, w v , E ηtηuηvηw cov ηtηuηv, ηw , if w v max u t, v u , . ⎧ S ( )S − ≥ { − − } ⎪ cov ηtηu, ηvηw E ηtηu E ηvηw , if v u max u t, w v S [ ]S ≤ ⎨ S ( )S − ≥ { − − } Accordingly, we define⎪ for r 1 ⎩⎪ S ( )S + S [ ] [ ]S − ≥ { − − } 1 Tn,r t, u, v, w ≥ 1 t u v w n, r u t max v u, w v , T (2) t, u, v, w 1 t u v w n, r v u max u t, w v , n,r = {( )∶ ≤ ≤ ≤ ≤ ≤ = − ≥ { − − }} ( ) and = {( )∶ ≤ ≤ ≤ ≤ ≤ = − ≥ { − − }} 3 Tn,r t, u, v, w 1 t u v w n, r w v max u t, v u . Now we obtain( ) that = {( )∶ ≤ ≤ ≤ ≤ ≤ = − ≥ { − − }} n 4 n 4 E ηt E ηt ⎡ t 1 ⎤ t 1 ⎢ ⎥ ⎢ ⎥ ⎢ŒQ ‘ ⎥ ≤ Q [ ] 2 ⎢ = ⎥ = n ⎣ ⎦ E ηtηu ⎛t,u 1 ⎞ n 1 + Q= S [ ]S 4!⎝ ⎠ cov ηt, ηuηvηw − r 1 (1) t,u,v,w Tn,r + Q Q S ( )S n=1 ( )∈ 4! cov ηtηu, ηvηw − r 1 (2) t,u,v,w Tn,r + Q Q S ( )S n=1 ( )∈ 4! cov ηtηuηv, ηw − r 1 (3) t,u,v,w Tn,r + Q Q S ( )S Rn,1 = ( Rn,5).∈ (4.1)

=∶ + ⋯ + (i) φ-mixing

To estimate Rn,2 to Rn,4, we make use of the inequality

1 α cov X,Y 2 φ σ X , σ Y X α Y β, (4.2) ~ for α, β 1, withS 1 α( 1 β)S ≤1, which[ ( ( is) due( to))] IbragimovY Y Y (1962);Y see also Billingsley (1968, p. 170) for a more accessible reference. We obtain∈ [ ∞ from] (4.2)~ that+ ~ =

n n 1 2 2 1 2 2 1 2 E ηtηu 2 φ u t E ηt E ηu t,u 1 t,u 1 ~ ~ ~ n Q= S [ ]S ≤ Q= 1 2 (S − S) ‰ [2 ]Ž ‰2 [ ]Ž φ u t E ηt E ηu t,u 1 ~ n ≤ Q= (S2 − S) ™1 2[ ] + [ ]ž 2 E ηt φ r . (4.3) ∞ t 1 r ~ ≤ Q= [ ] =Q−∞ (S S) 12

1 For t, u, v, w Tn,r , we have that ( ) 3 4 1 4 4 1 4 4 3 ( cov ηt), η∈uηvηw 2 φ r E ηt E ηuηvηw ~ ~ ~ ~ 1 4 4 1 4 4 1 4 4 1 4 4 1 4 S ( )S ≤ 2 φ (r) ‰E[ηt ]Ž ŠE[Sηu S E]ηv E ηw ~ ~ ~ ~ ~ E η4 E η4 ≤ 2 φ1 4(r) ‰ [t ]Ž ‰ [ w]Ž , ‰ [ ]Ž ‰ [ ]Ž 4 ~ [ ] + ⋯ + [ ] which implies that ≤ ( ) n 1 4 2 4 cov ηt, ηuηvηw 2 φ r r 1 E ηt . (4.4) (1) t 1 t,u,v,w Tn,r ~ Q S ( )S ≤ ( )( + ) Q [ ] ( 2 )∈ = For t, u, v, w Tn,r , we get ( ) 1 2 2 2 1 2 2 2 1 2 ( cov ηt)ηu∈, ηvηw 2 φ r E ηt ηu E ηvηw ~ ~ 1~2 4 1 4 4 1 4 4 1 4 4 1 4 S ( )S ≤ 2 φ (r) ‰E[ηt ]Ž E‰ ηu[ ]ŽE ηv E ηw ~ ~ ~ ~ ~ E η4 E η4 ≤ 2 φ1 2(r) ‰ [t ]Ž ‰ [ w]Ž , ‰ [ ]Ž ‰ [ ]Ž 4 ~ hence, [ ] + ⋯ + [ ] ≤ ( ) n 1 2 2 4 cov ηtηu, ηvηw 2 φ r r 1 E ηt . (4.5) (2) t 1 t,u,v,w Tn,r ~ Q S 3 ( )S ≤ ( )( + ) Q= [ ] And finally, for (t, u, v,) w∈ Tn,r , we get ( ) 3 4 3 4 4 3 4 1 4 cov( ηtηuηv), η∈w 2 φ r E ηtηuηv E ηw ~ ~ ~ E η4 ~ E η4 S ( )S ≤ 2 φ3 4(r) Š [St S ] w ‰, [ ]Ž 4 ~ [ ] + ⋯ + [ ] which implies that ≤ ( ) n 3 4 2 4 cov ηtηuηv, ηw 2 φ r r 1 E ηt . (4.6) (3) t 1 t,u,v,w Tn,r ~ Q S ( )S ≤ ( )( + ) Q [ ] Assertion (i) follows( now)∈ from (4.1) and (4.3) to (4.6). =

(ii) α-mixing

Here we use the covariance inequality 1 1 α 1 β cov X,Y 4 α σ X , σ Y X α Y β, (4.7) where α, β 1, are such that 1 α 1 β 1 and X,Y− ~ −are~ random variables with X S ( )S ≤ [ ( ( ) ( ))] Y Y Y Y α , Y β ; see Bradley (2007, Corollary 10.16). 2 δ 4 δ From E∈ (ηtηu∞] min 2,t, 2,u,~4 α+ t~ u< ηt 4 δ ηu 4 δ and ηt 4 δ ηYu 4Y δ < ∞ Y 2 Y < ∞ 2 ηt 4 δ ηu 4 δ 2 we obtain that ( + )~( + ) S [ ]S ≤ {C C [ (S − S)] Y Y + Y Y + } Y Y + Y Y + ≤ n n 1 n n 2 (Y Y + + Y Y + )~ 2 2 δ 4 δ 2 Rn,2 E ηt 2 min 2,t, 4 α r ηt . (4.8) − 4 δ t 1 r 1 t 1 ( + )~( + ) t 1 ≤ ŒQ [ ] + 1Q œQ C [ ( )] Q Y Y + ¡‘ Furthermore, for t,= u, v, w In,r=, = = ( ) δ 4 δ cov ηt, ηuηvηw( min) ∈ 4,t, 4,u, 4,v, 4,w, 4 α r ηt 4 δ ηuηvηw 4 δ 3 , ~( + ) S ( )S ≤ šC C C C [ ( )] Y Y + Y Y( + )~ Ÿ 13

4 4 4 4 which implies by ηt 4 δ ηuηvηw 4 δ 3 ηt 4 δ ηu 4 δ ηv 4 δ ηw 4 δ 4 that

+ n 1 ( + )~ n + + n + + Y Y Y Y 2 ≤ (Y Y + Y Y δ 4+δY Y +4Y Y )~ Rn,3 4! r 1 min 4,t, 4 α r ηt . (4.9) − 4 δ r 1 t 1 ~( + ) t 1 + ≤ Q= ( + ) œQ= C [ ( )] Q= Y Y ¡ Since Rn,4 and Rn,5 can be estimated in the same manner we obtain (ii) from (4.1), (4.8) and (4.9). 

The next lemma provides a Rosenthal-type inequality for uniform mixing and bounded random variables which is tailor-made for nonstationary processes.

Lemma 4.2. Suppose that ηt t 1,...,n is a φ-mixing process with Eηt 0 and ηt 1 almost surely for all t 1, . . . , n. Then, for all integers k 2, ( ) = = S S ≤ n k = 2k 2≥! k 2 E ηt V2 Vk , R ⎡ t 1 ⎤R k 1 ! R ⎢ ⎥R ~ R ⎢ ⎥R ( − ) R ⎢ŒQ ‘ ⎥R ≤ Š ∨  where R ⎢ = ⎥R ( − ) R ⎣ ⎦R n n 1 q 2 Vq E ηt φ r r 1 . − t 1 r n 1 − = [S S] (S S)( + ) Q= =−Q( − )

Proof of Lemma 4.2. The proof uses ideas described in Section 4.3 in Dedecker et al. (2007). We define, for q 1, . . . , k,

= Aq n E ηt1 ηtq . 1 t1 t2 ... tq n ( ) = Q T [ ⋯ ]T Then ≤ ≤ ≤ ≤ ≤ n k E ηt k! Ak n . (4.10) R ⎡ t 1 ⎤R R ⎢ ⎥R R ⎢ ⎥R R ⎢ŒQ= ‘ ⎥R ≤ ( ) For 1 t1 n 1, we decomposeR the⎢ set t2,⎥ . .R . , tq t1 ... tq n, max1 i q ti 1 ti 1 R ⎣ ⎦R in the sets R R ≤ ≤ − {( )∶ ≤ ≤ ≤ ≤ < { + − } ≥ } Gr,m t1 t2, . . . , tq t1 ... tq n, max ti 1 ti r tm 1 tm max ti 1 ti . 1 i m m i q + + + ( ) = ›( )∶ ≤ ≤ ≤ ≤ < { − } < = − = ≤ < { − } Gr,m t1 contains all t2, . . . , tq where the largest gap between consecutive tis is r, and m is the first position where this gap occurs. Applying the covariance inequality (4.2) around this gap( ) makes an efficient( use) of the mixing property possible. We have

n A n E ηq q t1 t1 1 n 1 q 1 n 1 ( ) = Q= T [ ]T E ηt ηt E ηt ηt cov ηt ηt , ηt ηt . − − − 1 m m+1 q 1 m m+1 q t1 1 m 1 r 1 t2,...,tq Gr,m t1 + T [ ⋯ ] [ ⋯ ] + ( ⋯ ⋯ )T Q= Q= Q= ( Q)∈ ( ) 14

q 2 Since #Gr,m t1 r 1 and cov ηt1 ηtm , ηtm+1 ηtq 2E ηt1 φ tm 1 tm we obtain that − ( ) ≤ ( + ) S ( ⋯ ⋯ )S ≤ [S S] ( + − ) q 1 Aq n Am n Aq m n − m 1 n − n 1 ( ) ≤ Q= ( ) ( ) q 2 E ηt 1 2 q 1 φ r r 1 1 − t1 1 r 1 − q 1 + Q= [S S] Œ + ( − ) Q− ( )( + ) ‘ Am n Aq m n Vq . − m 1 ≤ { ( ) − ( ) + } Next we prove that, for 2 mQ= q k,

m 2 q m 2 q 2 m 2 q m 2 q 2 V2 Vm V2 ≤Vq m< ≤ V2 V2 Vq m VmV2 VmVq m V2 Vq. ~ ( − )~ ~ ~ ( − )~ ~ − − − (4.11) (4.11)Š implies∨  byŠ Lemma∨ 4.7 in Dedecker= ∨et al. (2007,∨ p. 79) that ∨ ≤ ∨ 1 2k 2 A n V k 2 V , k k k 1 2 k − ~ ( ) ≤ ‹  Š ∨  which concludes in conjunction with (4.10)− the proof. It remains to verify (4.11). It follows from H¨older’sinequality, for 2 p q, that

q−p p−2 n 1 n 1 q−2 n 1 ≤ < q−2 φ r r 1 p 2 φ r φ r r 1 q 2 , − − − r n 1 − ⎛r n 1 ⎞ ⎛r n 1 − ⎞ Q (S S)( + ) ≤ Q (S S) Q (S S)( + ) which implies=−( − ) that ⎝ =−( − ) ⎠ ⎝ =−( − ) ⎠

q−p p−2 q−2 q−2 Vp V2 Vq . (4.12)

Using (4.12) we obtain that ≤

m 2 m 2 m q 2 q m 2 q 2 V2 Vq m V2 V2 Vq ~ ~ m ~(q −2 ) (( − )− )~( − ) − q 2 1 m q 2 q 2 ≤ V2 Vq V2 Vq, ~( − ) ~ − ~( − ) ~ = Š  ≤ ∨ q m 2 q m q 2 m 2 q 2 q m 2 VmV2 V2 Vq V2 ( − )~ ( − )~(q−m− ) 1 (q−m− )~( − ) ( − )~ q 2 q−2 q−2 q 2 ≤ V2 Vq V2 Vq ~ − ~ and, finally, = Š  ≤ ∨

q m q 2 m 2 q 2 q q m q 2 q m 2 q 2 VmVq m V2 Vq V2 Vq ( − )~( − ) ( − )~( − ) ( −(2 −q ))~(2 − ) (( − )− )~( − ) − q q 2 q 4 q 2 q 2 1 2 q 2 q 2 ≤ V2 Vq V2 Vq V2 Vq, ~( − ) ~( − ) ( − )~( − ) ~ − ~( − ) ~ which proves (4.11).= = Š  ≤ ∨  15

4.2. Proofs of the main results. Proof of Lemma 2.1. First, we obtain under (A2) that for all sufficiently large n n C P I B 1 n2 2 d2 . t k 2 t 1 ~( + ) Hence, for C4 C1 4, it sufficesQ= to show( ∈ that) ≥ n C1 2 2 d2 < ~ P 1 It Bk P It Bk n 0. d d 4 n k 0,...,K 1 1,...,Mn 2 t 1 ~( + ) Q ŒWQ ( ∈ ) − ( ∈ )W ≥ ‘ Ð→→∞ Applying∈{ Markov’s} ×{ inequality} it is= enough to verify that n q 2q d2 2 d2 E 1 It Bk P It Bk o n (4.13) t 1 ( − )~( + ) holds for some q 2ŒQ=d2 (2 .∈ This) − will( now∈ be)‘ proven = Š for the two cases of uniform and strong mixing separately. > ∨ ( ~ ) (i) φ-mixing

By Lemma 4.2 with ηt 1 It Bk P It Bk we obtain that for q 2 n q q 2 d2 E =1 I(t ∈Bk ) −P I(t ∈Bk ) O n . > t 1 ~( + ) Hence, (4.13) is satisfiedŒQ for= q( ∈max)d−2, 2( . ∈ )‘ = Š 

(ii) α-mixing > { } q 2  q  Suppose that for some q 2N and  0 we have r N r α r . Then we can apply an extension of Rosenthal’s inequality (Theorem 2 in− Doukhan~( + ) (1994, Section 1.4.1)), Minkowski’s inequality and∈ Jensen’s inequality> to get∑ ∈ ( ) < ∞ n q E 1 It Bk P It Bk t 1 ŒQ ( ∈ n) − ( ∈ )‘ n q 2 = 1 q 1 2 C5 max It Bk P It Bk q  , It Bk P It Bk 2  ~ ⎧t 1 t 1 ⎫ ⎪ ⎪ q q  + q 2  + ≤ ⎨Q= Y (n ∈ ) − ( ∈ )Y ŒQ= Yn ( ∈ ) − ( ∈ )Y ‘ ⎬ ⎪ 1 q 2 1 ⎪ C6 max ⎩n P It Bk ~( + ) , n P It Bk ~( + ) ⎭ ⎧ n t 1 n t 1 ⎫ ⎪ ~ ⎪ ≤ ⎨ 1Œd2qQ2 d(2 q∈  )‘q 2 d2q 2 d2 Œ2 Q ( ∈ )‘ ⎬ C7 max ⎪n = , n = ⎪ ⎩⎪ ⎭⎪ − ~[( + )( + )] ~ − ~[( + )( + )] q 2 d2q 2 d2 2  ≤ C7 n š Ÿ for q 2, arbitrary~ − ~[( + 0)( and+ )] some constants C ,C ,C . The latter term is of the = 5 6 7 desired order if > > d  < ∞ d  q 1 2 2q d d 1 2 q 2 2  2 2 2 2  which in turn are satisfied+ for  < q −1 and q⇐⇒ 5d 4< 2.− Finally note that the initial ( + ) 2 ( + ) summability condition on the mixing− coefficient is satisfied for these choices of  and q under the assumptions of the lemma.= > ~ ∨  16

Proof of Theorem 2.1. Recall that 1 1 1 1 Bk k1, . . . , kd1 G1 hn kd1 1 1 ,G1 hnkd1 1 Gd2 hn kd 1 ,Gd2 hnkd . ′ − − − − We define + + = {( ) }×( ( ( 1 − )) ( )]×⋯×1 ( ( ( − )) ( )] xk k1, . . . , kd1 ,G1 hn kd1 1 1 ,...,Gd2 hn kd 1 . − − ′ and + = ‰ 1( ( + )) 1( ( + ))Ž xk k1, . . . , kd1 ,G1 hn kd1 1 1 ,...,Gd2 hn kd 1 . − − ′ We have, for x Bk, = ‰ ( ( + − )) ( ( − ))Ž fn x f x sup AvY y, xk f x ∈ y xk ̂ ( ) − ( ) ≤ sup⪯ Avε (⟦y, xk ⟧) − f( x)k f x . (4.14) y xk

≤ (⟦d2 ⟧) +d2( 2( d2) − ( )) According to Lemma 4.3 and since ν ⪯Bk hn O n we obtain that − ~( + 1 2 d2 f xk f x ν dx ( ) = ν=Bk( f xk ) f xk O n . Bk k Bk Dn k Bk Dn − ~( + ) (4.15) ∶ Q∈ S ( ( ) − ( )) ( ) ≤ ∶ Q∈ ( )( ( ) − ( )) = Š  Next we estimate E sup Av y, x 1 . We define y xk ε k An

B ⪯ B l ,...,l ,k ,...,k ,  k (⟦ ⟧)1 d1 d1+1 d 0 li ki i ( j1,...,jd = ) B 2 ≤ ≤ ∀ B , k k1,...,kd1 ,kd1+1 m1 1,...,kd md2 1 j ( ) 0 mi 2 i i − + − + j1,...,jd = B 2 ≤ ≤ ∀ B . k k1,...,kd ,kd +1 m1 1,...,kd md 1 j 1 1 2 ( ) 0 mi 2 i i − + − + = 2 2 d2 Recall that, if the event An occurs,≤ #≤ t ∀ n It Bk C4n , which implies that

j1,...,jd ~( + ) 2 j1 jd2 2 2 d2 # t n It Bk { ≤ ∶ ∈C4 2} ≥ n . ( ) Furthermore, it follows from (4.18) in Lemma 4.4 that+⋯+ for some~(C+ ) { ≤ ∶ ∈ } ≥ 8 n j1 1,...,jd 1 j j 2 1 2 d 1 2 1 < ∞d2 2 E sup εt It y, xk y Bk C8 2 n . t 1 ( + + ) ( +⋯+ )~ ~( + ) Therefore, weœWQ obtain= ( that∈ ⟦ ⟧)W ∶ ∈ ¡ ≤ 1 E sup Avε y, xk An y xk j 1,...,j 1  ⪯ (⟦ ⟧) n 1 1 d2 sup t 1 εt It y, xk y Bk ( + + ) 1 E An ⎡ j1,...,j ⎤ j ,...,j 0 ⎢ = d2 ⎥ 1 d2 ⎢ ›S∑ #( t ∈ n⟦ It ⟧)SBk∶ ∈ ⎥ ⎢ ( ) ⎥ ≤ Q ≥ ⎢ ⎥ 1 2 d⎢2 ⎥ O n ⎢ . { ≤ ∶ ∈ } ⎥ (4.16) ⎣ ⎦ − ~( + ) To sum= up, (4.14),Š (4.15) and (4.16) yield that

1 1 2 d2 E fn x f x ν dx An O n . Dn − ~( + ) The term E f  x ‰f̂ (x ) −ν (dx)Ž+1 ( can) be treated= Š analogously, which completes Dn n S An the proof of the theorem.  ∫ ‰ ̂ ( ) − ( )Ž− ( )  17

Lemma 4.3. Under the assumptions of Theorem 2.1,

d2 1 2 d2 max f xk f xk O n . 0 k1,...,kd1 K 1 kd1+1,...,kd1+d2 Mn ( − )~( + ) ≤ ≤ < Q < ( ( ) − ( )) = Š 

Proof of Lemma 4.3. Let k1, . . . , kd1 be arbitrary and let 0 kd1 1, . . . , kd1 d2 1 kd1 1, . . . , kd1 d2 Mn and kj 2 for at least one j . We estimate the sum by walking along the main and minor diagonals as follows: I = {( + + )∶ < + + < = } f xk f xk kd1+1,...,kd1+d2 1 kd1+1,...,kd1+d2 Mn Q ( ( ) − ( )) ( )∶ < f x< f x k1,...,kd1 ,kd1+1 i,...,kd1+d2 i k1,...,kd ,kd +1 i,...,kd +d i i 0 1 1 1 2 kd1+1,...,kd1+d2 0 = Š ( ( + + )) − ( ( + + )) ( Q )∈I Q≥ # 0 2 sup f k1, . . . , kd1 , z inf f k1, . . . , kd1 , z . z z

≤ I ‹ d{2 1( d2 1 )}2 d−2 { ( )} Since # 0 d2Mn O n , we obtain the assertion.  − ( − )~( + ) I ≤ = ( ) Lemma 4.4. Suppose that the assumptions of Theorem 2.1 hold true. Then, for arbitrary k k, z z and some C8 , n d 1 ⪯ E ⪯ sup ε

1 all the other random variables, we could define Zt t n Vt and we obtain analogously to (4.19) that, for z 1 , z 1 1 n, 2 n, . . . , 1 , ( ) ̃ ( ) ( ) n = ~ + sup∈ { ~ ~ εt1 X}t k, k ,Zt z, z k,z Rd k k, z z z t 1 n ̃ ( )∈ ∶ ⪰ ⪯ ⪯ WQ= ( ∈ ⟦ ⟧ ∈ ⟦ ⟧)W sup εt1 Xt k, k , Zt z, z , (4.20) k,z Rd k k, z z z t 1 = WQ ( ̃ ∈ ⟦ ⟧ ̃ ∈ ⟦ ⟧)W where z z 1 1 2n ,(z 2)∈,...,∶ ⪰z d⪯2 ⪯. Furthermore,= it is obvious that

( ) ( ) ( ) d1 d2 P= ( Xt +k,~( z )Zt z P Xt) k, z Zt z , k N0 , z, z R . (4.21) Finally, it follows from Lemma 6.4 in Bradley (2007) that the coefficients of strong ‰ ̃ ⪰ ⪯ ̃ ⪯ Ž = ( ≥ ⪯ ⪯ ) ∈ ∈ mixing respectively uniform mixing of the process Yt, Xt, Zt t 1,...,n are the same as those of the original process Yt,It t 1,...,n. Therefore, we′ ′ can′ reduce the technically ̃ ̃ = cumbersome case with integer-valued′ and/or′ trend components(( in) I)t to the more convenient = case where all components of It((possess) ) a continuous distribution. We make this simplifying assumption in what follows. To unify notation, we set u u 1 , . . . , u d k , . . . , k , z 1 , . . . , z d2 and u 1 d1 1 d 1 d2 1 n It u ,..., u ,..., , z ,..., z ( ) . Let(P) n n t 1 P . Then( ) πn ( P)n u, u P ( )k , ( ) k , z,(z) .= In( what( ) follows,) we= decompose(− the hyperrectangle) u, u= n 1 d1 ( ) = (∞ ∞ ) = ∑ = 1 = 1(⟦ ⟧) = into certain smaller hyperrectangles. We define, for j 1, uj;0 u , uj;2j u and, for ([ ∞j )×⋯×[ ∞)×⟦ ⟧) ⟦ ⟧ 1 k 2 , we choose uj;k such that ( ) ( ) ≥ = = 2 2 d d j ≤ < P n uj;0, uj;k u , u u , u πn k 2 . ( ) ( ) ( ) ( ) − 1 1 Having defined theseŠ[ grid points] × u[ j;k on the] × interval ⋯ × [ u , u] =, we proceed with defining 2 2 j1 grid points on the interval u , u as follows. For any( ) j(1,) j2 1, 1 k1 2 , we set j ;k j ;k j ;k u 1 1 u 2 , u 1 1 u 2 and,( for) 1( )k 2j2 , we choose[ u 1 1 such] that j2;0 j;2j 2 j2;k2 ( ) ( ) [ ] ≥ ≤ ≤ P = u , u = uj1;k1 , uj1;≤k1 < u 3 , u 3 u d , u d π k 2 j1 j2 . n j1;k1 1 j1;k1 j2;0 j2;k2 n 2 ( ) ( ) ( ) ( ) −( + ) − ji We proceedŠ[ further in] the× [ same way and] × [ set finally,] × for ⋯ × any[ j1, . . . ,] jd = 1, 1 ki 2 (i j1,...,jd−1;k1,...,kd−1 d j1,...,jd−1;k1,...,kd−1 d jd 1, . . . , d 1), uj ;0 u , u j u and choose, for 1 kd 2 , d jd;2 d ≥ ≤ ≤ = uj1,...,jd−1;k1,...,kd−1 such that ( ) ( ) jd;kd − = = ≤ ≤

j1;k1 j1;k1 j1,...,jd−1;k1,...,kd−1 j1,...,jd−1;k1,...,kd−1 P n uj ;k 1, uj ;k u , u u , u 1 1 1 1 j2;k2 1 j2;k2 jd;0 jd;kd

j1 jd πŠ[n kd 2 − ].× [ − ] × ⋯ × [ ] With these grid−( points,+⋯+ ) we define intervals =

uj1;0, uj1;1 , if k1 1, Bj1;k1 j1 . uj1;k1 1, uj1;k1 , if 1 k1 2 [ ] = We proceed analogously= toœ the definition above and end up with intervals ( − ] < ≤ j1,...,jd−1;k1,...,kd−1 j1,...,jd−1;k1,...,kd−1 u , u , if kd 1, Bj1,...,jd−1;k1,...,kd−1 jd;0 jd;1 . jd;kd j1,...,jd−1;k1,...,kd−1 j1,...,jd−1;k1,...,kd−1 jd ⎧ uj ;k 1 , uj ;k , if 1 kd 2 ⎪ [ d d d d ] = = ⎨ For any multi-index j⎪; k( j1−, . . . , jd; k1, . . . , kd , we define] the hyperrectangle< ≤ ⎩⎪ j1;k1 j1,...,jd−1;k1,...,kd−1 Bj;k Bj ;k B B . ( ) = ( 1 1 j2;k2 ) jd;kd = × × ⋯ × 19

The indexes j1, . . . , jd determine the scale and k1, . . . , kd the location of such a hyperrect- angle. Note that these Bj,k do not form a dyadic partition of u, u . By construction, these sets fulfil ⟦ ⟧

Figure 3. Partition of u, u in dimension 2.

⟦ ⟧ j P n Bj;k πn 2 , (4.22) −S S where j j1 jd. ( ) = Jn Let Jn N be chosen such that 1 2n πn2 1 n. Each hyperrectangle u, u with u u S uS =can+ be ⋯ + composed of disjoint rectangles− Bj;l, where from each (multi-)scale j at most one rectangle∈ is taken, plus subsets~( ) of< rectangles≤ ~ with j Jn, again at most⟦ one⟧ per j scale.⪯ ⪯ For j; l with j Jn and 1 li 2 i i, we define n S S = ( ) S S ≤ V≤j;l ≤ ε∀t1 It Bj;l . t 1 j For j; l with j Jn and 1 li 2 i =i,Q= we also( define∈ ) n ( ) S S = ≤ ≤Vj;l ∀ εt 1 It Bj;l . t 1 ̃ Then = Q= S S ( ∈ ) n sup εt 1 It Bj;l max Vj;l max Vj;l . (4.23) l 1 l 2ji i l 1 l 2ji i u u u u t 1 j j Jn i j j Jn i ̃ ∶ ⪯ ⪯ WQ ( ∈ )W ≤ Q ∶ ≤ ≤ ∀ {S S} + Q ∶ ≤ ≤ ∀ { j}i We verify below= the end of this proof∶ S S≤ that, for j; l with j∶ S S= Jn and 1 li 2 i, 4 j 2 E Vj;l C9 nπ( n2) . S S ≤ ≤ ≤ ∀ (4.24) for some C . Therefore, we obtain −S S 9 [ ] ≤ ( ) E V 4 E< ∞V 1 V nπ 2 j 4 j;l C nπ 2 5 j 4, j;l j;l n j 4 3 9 n nπn2 √ −S S~ [ ] √ − S S~ which impliesS thatS (S S >  ≤ √ −S S~ ≤ ( )

E max Vj;l j l 1 li 2 i i j 4 j 4  ∶ ≤ nπ≤ n∀2{S S} E Vj;l 1 Vj;l nπn2 j √ −S S~ l 1 li 2 i i √ −S S~ ≤ + j 4 Q S S (S S > ) C9 1 nπn 2 ∶ ≤. ≤ ∀ (4.25) √ −S S~ ≤ ( + ) 20

Now we define Vj;l Vj;l E Vj;l . We obtain in complete analogy to (4.25) that for some C10 ̃ = ̃ − [̃ ] Jn 4 E max Vj;l C10 nπn 2 . < ∞ j l 1 li 2 i i √ − ~ ̃ Jn Since E Vj,l supx,t E εt ∶ It≤ ≤x ∀n{S πn 2S} ≤ C11 for some C11 , we conclude that − ̃ Jn 4 [ ] ≤ { E(S SSmax= )}Vj;l C≤11 C10 nπn2 <.∞ (4.26) l 1 l 2ji i i − ~ ̃ √ From (4.23), (4.25) and (4.26) ∶ ≤ ≤ we∀ obtain{ } that≤ + n E max εt 1 It u, u u u u u t 1

 ∶ ⪯ ⪯ WQ= ( ∈ ⟦ ⟧)W E max Vj;l E max Vj;l l 1 l 2ji i l 1 l 2ji i j j Jn i j j Jn i ≤ Q  S S +d Q  S̃ S ∶ S S≤ ∶ ≤ ≤ J∀n ∶ S S= ∶ ≤ ≤ ∀ j 4 Jn 4 C9 1 nπn 2 C11 C10 nπn2 # j j Jn j 1 √ ⎛ − ~ ⎞ √ − ~ ≤ ( + ) Qd 1 + ( + ) { ∶ S S = } C8 nπn log⎝n= ,⎠ for some C . √ − ≤8 ‰ + ( ) Ž  Proof of (4.24)< ∞. (i) φ-mixing n κ j 1 Since t 1 E ηt O nπn2 holds under (A3)(i), for ηt εt It Bj,l and κ 2, 4, (4.24) is an immediate consequence−S S of Lemma 4.1(i). ∑ = [ ] = ( ) = ( ∈ ) = (i) α-mixing We first consider the case without trend, i.e. we assume (A3)(ii)(a) to be valid. Obviously, n κ j 1 we have t 1 Eηt O nπn2 , for ηt εt It Bj;l and κ 2, 4. Furthermore, using j 1 2 j 2 the notation of Lemma 4.1(ii),−S S max1 t n 2,t O πn2 n O πn2 and = max ∑η 2 = O( π 2 j )2 4 δ holds= ( under∈ (A2)) − andS S the=− additional assumptions−S S 1 t n t 4 δ n ≤ ≤ {C } = (( + j ) ) = (( ) ) in (A3)(ii), in particular part−S (a).S ~( We+ ) obtain, with r 1 πn2 , that ≤ ≤ {Y Y + } = (( ) ) n n 1 n n −S S 2 2 δ 4 δ ̃≍ 2 ~( ) E ηt min 2,t , 4 α r ηt − 4 δ t 1 r 1 t 1 ( + )~( + ) t 1 n n 1 + n Q= [ ] + Q= j œQ= C [ ( )]2 δ 4 δ Q= Y 2Y δ ¡4 δ 2 δ 4 δ 2 O n πn2 r 2,t r 4r α r ηt − 4 δ −S S t 1 −( + )~( + ) r r 1 ( + )~( + ) ( + )~( + ) t 1 ≤ ( j) + ̃ C + ̃ [ ( )] Y Y + O n πn2 . Q= =Q̃+ Q= −S S j 2 4 j 4 4 δ Using= maxŠ 1 t n 4,t O πn2 , max1 t n ηt 4 δ O ζ2 and choosing j 1 3 r πn2 we obtain that −S S −S S ~( + ) ≤ ≤ {C } = (( ) ) ≤ ≤ {Y Y + } = (( ) ) n−S 1S − ~ n n 2 δ 4 δ 4 ̃≍ ( )r 1 min 4,t, 4 α r ηt − 4 δ r 1 t 1 ~( + ) t 1 n n 1 + n Q= ( + ) 3 œQ= C 3δ[ 4(δ )] Q= Y2 3Yδ 4¡δ δ 4 δ 4 O r 4,t r r 1 α r ηt − 4 δ t 1 − ~( + ) r r 1 + ~( + ) ~( + ) t 1 = Œ̃ Cj + ̃ ( + ) [ ( )] Y Y + ‘ O n πQn=2 . =Q̃+ Q= −S S = Š  21

Hence, (4.24) follows by Lemma 4.1(ii).

The calculations are similar in the case where a trend variable is involved, that is, n κ assumption (A3)(ii)(b) holds. Here we have, as in the case without a trend, t 1 E ηt O nπ 2 j , for κ 2, 4. Let Z 1 t n. Then η 0 unless n t t = −S S ( ) ∑ [ ] = j1,...,jd1 ;l1,...,ld1 j1,...,jd1 ;l1,...,ld1 ( ) t= Tn t n t =n ~ u ≡ , u . jd1+1,ld1+1 1 jd1+1,ld1+1 Note that nπ 2 j is∈ of the= š same≤ ∶ order~ ∈ Š of magnitude− as #T π, where πŸ d2 G z i n n i 2 i ji,li 1 i −S S 2 2 2 4 δ ( ) Gi z . Since maxt T 2,t O π and maxt T ηt O π we obtain, with ji,li n n 4 δ = ∏ = ( ( − )− the choice( ) of r 1 π, that ~( + ) ( )) ∈ {C } = ( ) ∈ {Y Y + } = ( ) n 1 n n 2 δ 4 δ 2 miñ= ~ 2,t, 4 α r ηt − 4 δ r 1 t 1 ( + )~( + ) t 1 Q œQ C [ ( )] Qn 1Y Y + ¡ = = 2 δ 4 δ = 2 δ 4 δ 2 δ 4 δ 2 O r 2,t r r α r ηt − 4 δ ⎛t Tn −( + )~( + ) r r 1 ( + )~( + ) ( + )~( + ) ⎞ = Q Œ̃C + ̃ j Q [ ( )] Y Y + ‘ O ⎝#∈Tn π O nπn2 . =̃+ ⎠ −S S 1 3 Analogously,= we( obtain,) = with( the choice) of r π , that n 1 n n − ~ 2 δ 4 δ 4 r 1 min 4,t, 4 α r ̃≍ ηt − 4 δ r 1 t 1 ~( + ) t 1 Q( + ) œQ C [ ( )] n 1 Q Y Y + ¡ = 3 = 3δ 4 δ = 2 3δ 4 δ δ 4 δ 4 O r 4,t r r 1 α r ηt − 4 δ ⎛t Tn − ~( + ) r r 1 + ~( + ) ~( + ) ⎞ = Q ̃j C + ̃ Q ( + ) [ ( )] Y Y + O ⎝nπ∈ n2 . =̃+ ⎠ Therefore, (4.24) follows−S S again by Lemma 4.1(ii). = ( ) 

Acknowledgment . This research was partly funded by the German Research Foundation DFG, project NE 606/2-2 and by the Volkswagen Foundation (Professorinnen f¨urNieder- sachsen des Nieders¨achsischen Vorab). Part of this work was completed while K. Fokianos was visiting the Department of Statistics at TU Dortmund.

References Audrino, F. and Buhlmann,¨ P. (2009). Splines for financial volatility. J. R. Statist. Soc. B 71, 655–670. Anevski, D. and Hossjer,¨ O. (2006). A general asymptotic scheme for inference under order restrictions. Ann. Statist. 34, 1874–1930. Billingsley, P. (1968). Convergence of Probability Measures. Wiley, New York. Bradley, R. C. (2007). Introduction to Strong Mixing Conditions, Volume I. Kendrick Press. Brunk, H. D. (1955). Maximum likelihood estimates of monotone parameters. Ann. Math. Statist. 26, 607–616. Chatterjee, S., Guntuboyina, A. and Sen, B. (2015). On risk bounds in isotonic regression and other shape restricted regression problems. Ann. Statist. 43, 1774–1800. Chen, Y. and Samworth, R. J. (2016). Generalized additive and index models with shape constraints. J. R. Statist. Soc. B. to appear. 22

Christopeit, N. and Tosstorff, G. (1987). Strong consistency of least-squares esti- mators in the monotone regression model with stochastic regressors. Ann. Statist. 15, 568–586. Chu, C. K. and Marron, J. S. (1991). Comparison of two bandwidth selectors with dependent errors. Ann. Statist. 19, 1906–1918. Davis, R. A., Dunsmuir, W. T. M., and Wang, Y. (2000). On autocorrelation in a model. Biometrika 87, 491–506. Dedecker, J., Doukhan, P., Lang, G., Leon,´ J. R., Louhichi, S., and Prieur, C. (2007). Weak Dependence: With Examples and Applications. Lecture Notes in Statistics 190, Springer. Dedecker, J., Merlevede,` F., and Peligrad, M. (2011). Invariance principles for linear processes with application to isotonic regression. Bernoulli 17, 88–113. Dette, H., Neumeyer, N., and Pilz, K. F. (2006). A simple nonparametric estimator of a strictly monotone regression function. Bernoulli 12, 469–490. Doukhan, P. (1994). Mixing - Properties and Examples. Springer. Durot, C. (2002). Sharp asymptotics for isotonic regression. Probab. Theory Relat. Fields 122, 222–240. Engle, R. F. (1982). Autoregressive conditional with estimates of the variance of United Kingdom inflation. Econometrica 50, 987–1007. Escanciano, J. C. (2006). Goodness-of-fit tests for linear and nonlinear time series models. J. Amer. Statist. Assoc. 101, 531–541. Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and its Applications. London: Chapman & Hall. Ferland, R., Latour, A. and Oraichi, D. (2006). Integer–valued GARCH processes, J. Time Ser. Anal. 27, 923–942. Fokianos, K., Rahbek, A., and Tjøstheim, D. (2009). Poisson autoregression, J. Amer. Statist. Assoc. 104, 1430–1439. Francq, C. and Zako¨ıan, J.-M. (2010). GARCH models: Structure, and Financial Applications. UK: Wiley. Gao, F. and Wellner, J. A. (2007). Entropy estimate for high-dimensional monotonic functions. J. Mult. Anal. 98, 1751–1764. Hanson, D. L., Pledger, G. and Wright, F. T. (1973). On consistency in monotonic regression. Ann. Statist. 1, 401–421. Hardle,¨ W. (1990). Applied . Cambridge; New York; New Rochelle: Cambridge University Press. Ibragimov, I. A. (1962). Some limit theorems for stationary processes. Teor. Veroyatn. Primen. 7, 361–392. (in Russian). [English translation: Theory Probab. Appl. 7, 349– 383.] International Recovery Plan for the Whooping Crane (Grus Americana), Third Revision (2007). Endangered Species Bulletins and Technical Reports (US Fish and Wildlife Service). Paper 45. Kedem, B. and Fokianos, K. (2002). Regression Models for Time Series Analysis. Wiley Series in Probability and Statistics. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, 2002. Meister, A. and Kreiß, J.-P. (2016). Statistical inference for nonparametric GARCH models. Stoch. Proc. Appl. 126, 3009–3040. Robertson, T. and Wright, F. T. (1975). Consistency in generalized isotonic regression. Ann. Statist. 3, 350–362. 23

Robertson, T., Wright, F. T., and Dykstra, R. L. (1988). Order Restricted Statis- tical Inference. Wiley, New York. Sampson, A. R., Singh, H., and Whitaker, L. R. Order restricted estimators: some bias results. Statist. Probab. Lett. 61, 299–308. Seber, G. A. F. and Wild, C. J. (1989) . John Wiley & Sons, Inc., New York. Shumway, R. H. and Stoffer, D. S. (2011) Time Series Analysis and its Applications. Third ed. Springer, New York. Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. Ann. Statist. 10, 1040–1053. Wu, J., Meyer, M. C., and Opsomer, J. D. (2015). Penalized isotonic regression. J. Statist. Plann. Inference 161, 12–24. Zhang, C.-H. (2002). Risk bounds in isotonic regression. Ann. Statist. 30, 528–555.