<<

Optimal solutions to the isotonic regression problem

Alexander I. Jordan, Anja M¨uhlemann, and Johanna F. Ziegel

University of Bern

May 6, 2020

Abstract In general, the solution to a regression problem is the minimizer of a given loss criterion, and depends on the specified . The nonparametric isotonic regression problem is special, in that optimal solutions can be found by solely specifying a functional. These solutions will then be minimizers under all loss functions simultaneously as long as the loss functions have the requested functional as the Bayes act. For the functional, the only require- ment is that it can be defined via an identification function, with examples including the expectation, quantile, and expectile functionals. Generalizing classical results, we characterize the optimal solutions to the iso- tonic regression problem for such functionals, and extend the results from the case of totally ordered explanatory variables to partial orders. For total or- ders, we show that any solution resulting from the pool-adjacent-violators al- gorithm is optimal. It is noteworthy, that simultaneous optimality is unattain- able in the unimodal regression problem, despite its close connection.

Keywords: Order-restricted optimization problems, Partial order, Simulta- neous optimality, pool-adjacent-violators algorithm, consistent loss functions

arXiv:1904.04761v2 [math.ST] 5 May 2020 MSC Classifications: 62G08

1 Introduction

Suppose that we have pairs of observations (z1, y1), ..., (zn, yn) where we assume that yi, i = 1, . . . , n are real-valued. The aim of isotonic regression is to fit an increasing functiong ˆ: {z1, . . . , zn} → to these observations. The covariates z1, . . . , zn can take values in any set as long as it is equipped with a partial or- der which we denote by . Then, a function g : {z1, . . . , zn} → R is increasing if zi  zj implies that g(zi) ≤ g(zj).

1 As it is common in , we aim to find an estimateg ˆ that mini- mizes the expected loss for some loss function L: R × R → [0, ∞). If the function gˆ is interpreted as an estimator of the conditional expectation of a Y given Z, then a natural choice for L is the squared error loss L(x, y) = (x − y)2. For i ≤ j, let Ei:j denote the expectation with respect to the empirical distribution of (zi, yi),..., (zj, yj). Assuming that z1 < z2 < ··· < zn, the minimizer of the quadratic loss criterion 2 E1:n(g(Z) − Y ) (1) over all increasing functions g is given by

gˆ(z`) = min max Ei:jY = max min Ei:jY, ` = 1, . . . , n, (2) j≥` i≤j i≤` j≥i see Barlow et al. (1972, eq. (1.9)–(1.13)). The solutiong ˆ can be computed effi- ciently using the so-called pool-adjacent-violators (PAV) algorithm. These results were developed in the 1950s by several parties independently; see Ayer et al. (1955), Bartholomew (1959a), Bartholomew (1959b), Brunk (1955), van Eeden (1958), Miles (1959). It turns out that the solution given at (2) is also the unique minimizer of the Bregman loss criterion E1:nL(g(Z),Y ), (3) where the squared error loss in (1) has been replaced by a Bregman loss function L = Lφ (Barlow et al., 1972, Theorem 1.10). That is,

0 Lφ(x, y) = φ(y) − φ(x) − φ (x)(y − x), where φ is a convex function with subgradient φ0. Savage (1971) found that the Bregman class comprises all loss functions L where the expectation functional min- imizes the expected loss, i.e.,

EP Y = arg minx EP L(x, Y ), where Y is a random variable with distribution P . Due to this property, any loss function in the Bregman class is also referred to as a consistent loss function for the expectation functional (Gneiting, 2011). In summary, the increasing regression function at (2) is simultaneously optimal with respect to all consistent loss functions for the expectation. This robustness with respect to the choice of loss function that the solution to the regression problem is determined by the choice of the expectation as the target functional. We will see that the same holds for other functionals. As such, in nonparametric isotonic regression we can replace the task of choosing a loss function with the task of choosing a suitable target functional. This remarkable result is particularly beneficial in scenarios where a single rel- evant loss function cannot easily be identified. For example, institutions such as

2 y

● ●

z

Figure 1: Solutions in isotonic . Two solutions to the isotonic regression problem are shown for an example with z = z1, . . . , z4. The red curve is predetermined to pass through the midpoint of the functional intervals, whereas the blue curve illustrates the smoothest solution with minimal slope central banks or weather services provide analyses and forecasts that drive indi- vidual decision making in a heterogeneous group of users. In these circumstances, determining a unifying loss function is hardly trivial. However, publishing results for the expectation and for various quantile levels is certainly feasible. The simultaneous-optimality result for nonparametric isotonic regression is in stark contrast to the optimality behavior of parametric models for increasing regres- d sion functions. Suppose that {gθ : θ ∈ Θ},Θ ⊆ R is a parametric model of increas- ing functions gθ. Then, the optimal parameters with respect to the Bregman-loss criterion (3) generally vary (substantially) depending on the chosen loss function (Patton, 2019). Consistency of the loss function merely ensures that the true pa- rameter value of a correctly specified model minimizes the Bregman-loss criterion on the population level. Interestingly, simultaneous optimality with respect to all consistent loss functions generally also breaks down if one weakens the isotonicity constraint of the regression function to a unimodality constraint; see Section 5. In this paper, we generalize the result of Barlow et al. (1972, Theorem 1.10) in several directions. First, instead of the expectation functional, we consider gen- eral (possibly set-valued) functionals T that are given by an identification function V (x, y) as defined in Definition 2.1. Second, in the case of set-valued function- als, we give a complete characterization of all possible solutions for totally ordered covariates. Third, we demonstrate that a suitably modified version of min-max or max-min solutions as in (2) continues to hold for general partial orders on the covariates. An identification function is an increasing function that weighs negative values

3 in the case of underestimation against positive values in the case of overestimation, with an optimal expected value of zero. The corresponding functional T then maps to the optimizing argument (or set of optimizing arguments). Prime examples of such functionals are (possibly set-valued) quantiles, expectiles (Newey and Powell, 1987), or ratios of expectations. Quantiles, including the , have previously also been treated in Robertson and Wright (1973, 1980), but not in the interpreta- tion as set-valued functionals. Predefining a global scheme for reducing the median interval to a single point (e.g., some weighted average of lower and upper functional value) inevitably restricts the possible solutions to the isotonic regression problem. Figure 1 illustrates this issue, and shows how a more general interpretation of the functional as set-valued facilitates solutions with secondary optimality criteria such as smoothness and minimal slope. Expectiles and ratios of expectations, on the other hand, have been fully treated in Robertson and Wright (1980). These func- tionals map to single values and satisfy the Cauchy value property which is implied by identifiability. In contrast to previous work, we treat all functionals as set-valued. In Section 3, we give explicit solutions for the lower and upper bound of the isotonic regression problem in the context of total orders. The method of proof for these results is fundamentally different from the approach of Barlow et al. (1972, Theorem 1.10) or Robertson and Wright (1980), and in contrast to the latter comes with an im- mediate construction principle for loss functions. Our method relies on the mixture or Choquet representations of consistent loss functions, introduced by Ehm et al. (2016) for the quantile and expectile functionals. Given the identification function V (x, y) for the functional T , a one-parameter family of elementary loss functions that are consistent for the functional T can be readily defined,

Sη(x, y) = (1{η ≤ x} − 1{η ≤ y}) V (η, y), where η ∈ R. For all consistent loss functions L in the class Z  S = Sη(x, y) dH(η): H is a nonnegative measure on R , (4) R the optimal isotonic solution to the criterion (3) is bounded below by a min-max formula and bounded above by a max-min formula as in (2) with the expectation replaced by the lower and upper functional values under T , respectively. We show that the min-max or max-min solution is simultaneously optimal with respect to all elementary loss functions for T , and hence with respect to the entire class S. In fact, optimality of an isotonic solution with respect to the criterion (3) for L = Sη for some η ∈ R corresponds to finding a solution with optimal superlevel set {g ≥ η}. Considering an isotonicity constraint as a constraint on admissible superlevel sets of the regression function relates to the work of Polonik (1998) in the context of . If T is a quantile, an expectile, or a ratio of expectations, then S comprises all consistent loss functions for T subject to standard conditions, and if V (x, y) = x−y

4 is the identification function of the expectation, then the class S is the class of Bregman loss functions; see Ehm et al. (2016); Gneiting (2011). We also give results that can be directly translated to a simple algorithm that recovers the full of optimal solutions from the lower and upper bounds and the full set. While the bounds alone do not contain sufficient information, only few additional computations on the entire data set are necessary. Our method of proof also leads to a transparent proof of the validity of the PAV algorithm; see Section 3.2. Recently, M¨osching and D¨umbgen (2020) derived a similar result of min-max and max-min formulas as lower and upper bounds for optimal isotonic solutions in the context of set-valued minimizers of convex and coercive loss functions. Br¨ummer and Du Preez (2013) rediscover the result of Barlow et al. (1972) that the PAV algorithm leads to a simultaneously optimal solution for all proper scoring rules in the context of binary events – a special class of loss functions that are consistent for the expectation functional. In Section 4, we treat general partial orders on the covariates and demonstrate that a suitably modified version of min-max or max-min solutions continues to hold. Again, the optimal isotonic fit is simultaneously optimal with respect to all loss functions in S defined at (4). With our method of proof this extension is straightforward but for reasons of transparency, we first present the case of a total order in Section 3. The results in Robertson and Wright (1980) not only hold for a large class of functionals, but also for partial orders on the covariates. However, the generality of their results is limited by treating potentially set-valued functionals as maps to single values. To the best of our knowledge, the literature following Robertson and Wright (1980) is void of further results that characterize the solutions to the isotonic regression problem, or any investigations into the effect of the choice of loss function among options sharing the same Bayes act. A comprehensive overview on isotonic regression is given in the monograph Groeneboom and Jongbloed (2014). Also, Guntuboyina and Sen (2018) review risk bounds, asymptotic theory, and algorithms in common nonparametric shape- restricted regression problems in the context of optimization. Among the most recent developments on algorithms for isotonic regression with partially ordered covariates, Kyng et al. (2015) and Stout (2015) provide fast algorithms for isotone regression under different loss functions using the representation of a partial order as a directed acyclic graph. Recent advances on asymptotic theory for isotonic regression include Han et al. (2019), giving rates for least squares isotonic regression on the unit cube of arbitrary dimension, and Bellec (2018), considering isotonic, unimodal, and convex regression in the context of total orders. Another recent interest is the regularization of isotonic regression on multiple variables with Luss and Rosset (2017) proposing a method via range restriction on the solution to the regression problem.

5 2 Functionals and consistent loss functions

We start with the definition of a functional via an identification function.

Definition 2.1. A function V : R × R → R is called an identification function if V (·, y) is increasing and left-continuous for all y ∈ R. Then, for any finite and nonnegative measure P on R, we define the functional T induced by an identification function V as − + ¯ T (P ) = [TP ,TP ] ⊆ [−∞, +∞] = R, where the lower and upper bounds are given by

− + TP = sup {x : V (x, P ) < 0} and TP = inf {x : V (x, P ) > 0} , R ∞ using the notation V (x, P ) = −∞ V (x, y) dP (y). Defining functionals for any finite and nonnegative measure, as opposed to merely probability distributions, is a minor detail that simplifies notation when joining and intersecting data subsets. Except in the case of the null measure, any finite and nonnegative measure can be replaced with its corresponding , without any change to the functional values. All results are concerned with probability distributions P with finite support, and therefore, the existence of integrals is guaranteed. The following example and Proposition 2.6 hold for more general types of distributions given that the relevant integrals exist. We leave these obvious generalizations up to the reader and assume that all probability distributions considered have finite support. − + Note that TP can take the value −∞ and TP can take the value +∞. In the subsequent results, we repeatedly refer to the smallest or largest element of a finite set where one of the elements could be ±∞. We still write min and max of the set but this quantity could be ±∞.

Definition 2.2. A functional T is called a functional of singleton type if T (P ) is a singleton whenever P is not the null measure. Otherwise, T is called a functional of interval type.

Table 1 summarizes common functionals and their respective identification func- tions, and Example 2.3 explains two options in more detail.

Example 2.3. Let α, τ ∈ (0, 1), and let P denote a probability distribution.

(a) Consider the identification function V (x, y) = 1{x > y} − α, then V (x, P ) = P (Y < x) − α, and the interval of all α-quantiles of P ,

T (P ) = [sup{x : P (Y < x) < α}, inf{x : P (Y < x) > α}],

is potentially of positive length.

6 Table 1: Selection of functionals and their respective identification func- tions. The parameters satisfy α, τ ∈ (0, 1), p > 1 and δ > 0, and u : I → R and w : I → (0, ∞) are measurable functions on an interval I ⊆ R. The functionals “`p minimizer” and “Huber minimizer” map to the intervals of values minimizing the `p loss and the Huber loss (Huber, 1964), respectively

Functional Identification function Type Median V (x, y) = 1{x > y} − 1/2 interval Mean V (x, y) = x − y singleton 2nd V (x, y) = x − y2 singleton α-Quantile V (x, y) = 1{x > y} − α interval τ-Expectile V (x, y) = 2|1{x > y} − τ|(x − y) singleton Ratio EP (u(Y ))/EP (w(Y )) V (x, y) = xw(y) − u(y) singleton p−1 `p minimizer V (x, y) = sign(x − y)|x − y| singleton Huber minimizer V (x, y) = sign(x − y) min(|x − y|, δ) interval

(b) The identification function V (x, y) = 2|1{x > y} − τ|(x − y) leads to Z x Z ∞ V (x, P ) = 2(1 − τ) (x − y) dP (y) + 2τ (x − y) dP (y), −∞ x which is strictly increasing and continuous in its first argument. Hence, there exists a unique solution in x for the equation V (x, P ) = 0, and we call that 1 solution the τ-expectile eτ (P ). In particular, for τ = 2 we obtain V (x, y) = x − y and thus T (P ) = {EP (Y )}. In the later proofs, we use three implications of Definition 2.1 repeatedly to establish order relationships between the variable in the first argument of V and the functional of an empirical distribution. To facilitate reference, we note these statements explicitly. Corollary 2.4. Let V be an identification function inducing the functional T , and P be a finite and nonnegative measure on R. Then, V (η, P ) = 0 =⇒ η ∈ T (P ), + V (η, P ) > 0 =⇒ η > sup T (P ) = TP , − V (η, P ) < 0 =⇒ η ≤ inf T (P ) = TP . Lemma 2.5 shows that a generalized version of the Cauchy mean value property, used to define functionals in Robertson and Wright (1980), holds for any functional we consider in this paper. This suggests that our results are less general, unless it can be proven that every Cauchy mean value function can be defined in terms of

7 an identification function. On the other hand, in contrast to Robertson and Wright (1980), we treat set-valued functionals and their boundaries rigorously, and retain a higher level of generality in that regard.

Lemma 2.5. Let P,Q be finite and nonnegative measures on R. Then,

− + − + − + min{TP ,TQ } ≤ TP +Q ≤ TP +Q ≤ max{TP ,TQ }.

Proof. The statement follows from Definition 2.1. The second inequality is trivial. − + For the first inequality, and x < min{TP ,TQ }, we have V (x, P ) < 0 and V (x, Q) ≤ 0, hence V (x, P + Q) < 0. A similar argument applies to the third inequality. The definition of a functional in terms of an identification function comes with a straightforward construction principle for large classes of loss functions. In a nut- shell, a continuous oriented identification function defines a functional via its unique root in the first argument, a first-order condition. By integration, corresponding loss functions inherit the consistency for the functional, i.e., the minimum expected loss is attained by any member in T (P ). The loss functions defined in Proposition 2.6 are the most basic, in the sense that they are a result of integration with respect to the Dirac measure at a given threshold η ∈ R. A similar result has also been discussed in Dawid (2016) and Ziegel (2016).

Proposition 2.6. Let V be an identification function, T be the induced functional, ¯ and η ∈ R. Then the elementary loss function Sη : R × R → R given by

Sη(x, y) = (1{η ≤ x} − 1{η ≤ y}) V (η, y) is consistent for T relative to the class P of probability distributions with finite support. That is,

EP Sη(t, Y ) ≤ EP Sη(x, Y ) for all P ∈ P, all t ∈ T (P ) and all x ∈ R¯. Proof. Let

1 1 d(η) = EP Sη(t, Y ) − EP Sη(x, Y ) = ( {η ≤ t} − {η ≤ x}) V (η, P ). If V (η, P ) = 0 then d(η) = 0. If V (η, P ) < 0 it follows from Corollary 2.4 that η ≤ t and therefore d(η) ≤ 0. Similary, if V (η, P ) > 0 it follows that η > t and therefore d(η) ≤ 0. As an immediate consequence of the consistency of elementary loss functions for the functional T , we have that all loss functions in the class S defined at (4) are also consistent for the functional T . This result exemplifies an important line of reasoning used multiple times in this paper: A property of Sη that holds for

8 Table 2: Commonly used loss functions that are consistent for the mean functional. For an interval I ⊆ R, a Bregman loss is induced by a convex function φ : I → R with subgradient φ0. See Patton (2011, 2019) for the QLIKE loss and the exponential Bregman loss, respectively

Name Mixing measure Loss function Domain H((η1, η2]) = L(x, y) = 0 0 0 Bregman loss φ (η2) − φ (η1) φ(y) − φ(x) − φ (x)(y − x) I 2 Squared error η2 − η1 (x − y) R Exponential Bregman exp(η2) − exp(η1) exp(y) − exp(x) − exp(x)(y − x) R QLIKE loss −1/η2 + 1/η1 y/x − log (y/x) − 1 (0, ∞)

all η ∈ R translates to the class S. Examples of members of the class S for the expectation functional, i.e., V (x, y) = x − y, are given in Table 2. The importance of the construction in Proposition 2.6 lies in the postponing of integration, or, in other words, applying Fubini in a double integration (with respect to P and to H), and then showing the property of consistency for the integrand Sη for each η rather than for the original loss function which is the integral of Sη with respect to dH(η).

3 Results for total orders

3.1 Min-max and max-min solutions

Suppose that we have observations (z1, y1),..., (zn, yn), and let P denote their empirical distribution. Throughout this section, we assume that the covariates z1, . . . , zn are equipped with a total order, and that the indices are chosen such that z1 < z2 < ··· < zn. Repeated observations can also be easily accommodated as explained in Remark 3.1 below. ¯ We aim to find an increasing function g : {z1, . . . zn} → R that minimizes

EP Sη(g(Z),Y ) for all η ∈ R, (5) where the random vector (Z,Y ) has distribution P . Any increasing functiong ˆ solving this optimization problem is a solution to the isotonic regression problem that is optimal with respect to all scoring functions in the class S, simultaneously. 1 Condition (5) is equivalent to minimizing EP {η ≤ g(Z)}V (η, Y ) for all η ∈ R. We can rephrase the minimization problem to reflect the way in which we prove the main result: For a given η ∈ R, we have to find an index i ∈ {1, . . . , n + 1} that minimizes n X si(η) = vi:n(η) = V (η, y`). `=i

9 Thereby, we obey the condition

{z : g(z) ≥ η} = {zi, . . . , zn}, implied by the monotonicity constraint on g. This index search needs to be con- ducted for every η ∈ R separately. In a nutshell, we find the generalized inverse to an optimal solution. Afterwards, we define the overall minimizing functiong ˆ. From now on, we assume that all indices i, j ∈ {1, . . . , n + 1} unless specified otherwise.

Remark 3.1. The assumption that the ordering z1 < ··· < zn is strict is non- 0 0 restrictive. Given a series of observations (z1, y1),..., (zm, ym) with non-strictly 0 ordered or unordered zi, we can choose z1 < ··· < zn < zn+1 = ∞ such that 0 0 {z1, . . . , zm} ⊆ {z1, . . . , zn}. We define the empirical counting measure for the index range from i to j by

j m X X 1 0 1 0 Pi:j(B) = {z` = zk} {(zk, yk) ∈ B}, `=i k=1 with the corresponding integral of the identification function being equal to the following sum,

j m X X 1 0 V (η, Pi:j) = vi:j(η) = {z` = zk}V (η, yk). `=i k=1

For condition (5), we write the empirical probability distribution as P (B) = P1:n(B)/m. The subsequent arguments leading to an optimal solution rely solely on the identifi- cation sum vi:j(η) and the functional T (Pi:j), where we dealt with the dependence on the number of observations for each unique value of z` in the above generalization. We begin by introducing sets consisting of minimizing indices. For η ∈ R, let I(η) denote the set of indices i minimizing s (η), and define I = S I(η) ⊆ i η∈R {1, . . . , n + 1}. That is, i ∈ I if and only if there exists an η ∈ R such that

si(η) ≤ sj(η) for all j.

The following proposition is immediate.

Proposition 3.2. Let η ∈ R. The inclusion i ∈ I(η) holds if and only if,

vi:(j−1)(η) ≤ 0 for all j > i,

vj:(i−1)(η) ≥ 0 for all j < i.

If j > i and vi:(j−1)(η) = 0, then j ∈ I(η). Analogously, if j < i and vj:(i−1)(η) = 0, then j ∈ I(η).

10 y

● ●

● ● ● η ●

● z 1 3 4 6 8

Figure 2: Minimizing indices are separators. For a sample of 9 data points, the graph illustrates the functional value (expectation) on relevant subsets of the data for a given η and the minimizing index i = 3. The expectation value (vertical location of a brown line) is above or below η when the corresponding subsample extends (horizontal extension of a brown line) to the right or left of the minimizing index, respectively

The following proposition is a key observation to show optimality of the min- max and max-min solution. We relate the threshold η ∈ R to the minimal and − maximal elements of the functional T on subsets of the data. We write Ti:j = T − = inf T (P ) and T + = T + = sup T (P ). Pi:j i:j i:j Pi:j i:j Proposition 3.3. Let η ∈ R, and i ∈ I(η). Then, max T − ≤ η ≤ min T + , ji i:(j−1) + − max Tj:(i−1) < η ≤ min Ti:(j−1). ji,j6∈I(η)

Proof. For all j < i, we have vj:(i−1)(η) ≥ 0. For all j > i, we have vi:(j−1)(η) ≤ 0. Both inequalities are strict when j∈ / I(η). Corollary 2.4 implies the result. Figure 2 illustrates the statement in Proposition 3.3.

Lemma 3.4. (a) Let η, η0 ∈ R, η < η0, and i ∈ I(η), j ∈ I(η0). Then, min{i, j} ∈ I(η) and max{i, j} ∈ I(η0).

(b) For all η ∈ R, I(η) is a set of consecutive indices in I. In other words, if i, j ∈ I(η) and i < i0 < j such that i0 ∈/ I(η), then i0 ∈/ I. (c) The functions η 7→ min I(η) and η 7→ max I(η)

11 are increasing.

(d) Suppose that ηm ↑ η and i ∈ I(ηm) for all m ∈ N. Then, i ∈ I(η).

0 Proof. (a) If j ≥ i the statement is trivial. Otherwise, vj:(i−1)(η) ≥ 0 ≥ vj:(i−1)(η ) 00 00 0 by Proposition 3.2, hence vj:(i−1)(η ) = 0 for all η ∈ [η, η ] due to the mono- tonicity of the identification function. The statement follows from Proposition 3.2.

0 0 0 (b) Suppose the contrary: There exists an η 6= η such that i0 ∈ I(η ). If η < 0 η, we have that vi:(i0−1)(η ) ≥ 0. Similarly, since i0 ∈/ I(η) it holds that

vi:(i0−1)(η) < 0. This contradicts the monotonicity assumption for the first 0 0 argument of V . The argument against an η > η such that i0 ∈ I(η ) works similarly.

(c) Let η < η0 and suppose the contrary: Let i = min I(η) and i0 = min I(η0) such 0 0 0 that i > i . Then, i 6∈ I(η), and we have vi0:(i−1)(η ) ≤ 0 and vi0:(i−1)(η) > 0, contradicting the monotonicity assumption for the first argument of V . The argument for η 7→ max I(η) works similarly.

(d) The left-continuity of the identification function implies that sj(ηm) ↑ sj(η) as ηm ↑ η for all j. If i ∈ I(ηm) for all m ∈ N, then si(ηm) ≤ sj(ηm) for all m ∈ N and all j, which in combination with the left-continuity implies si(η) ≤ sj(η) for all j.

Lemma 3.4 confirms the existence of a left-continuous function ι: R → {1, . . . , n+ 1} mapping η to a score-minimizing index i that indicates the smallest z` in the corresponding set {zi, . . . , zn}. Note that limη→−∞ ι(η) = 1 and limη→∞ ι(η) = n+1. Then, any function g : {z1, . . . , zn} → R with superlevel sets corresponding to the sets induced by ι, i.e., with g(z`) ≥ η for all z` ∈ {zι(η), . . . , zn} for all η ∈ R, must be an optimizing solution. In fact, this solution is unique for a given ι because monotone functions are characterized by their superlevel sets.

Proposition 3.5. Let ι: R → {1, . . . , n+1} be an increasing, left-continuous func- tion such that ι(η) ∈ I(η). Then, the function gˆ: {z1, . . . , zn} → R given by

inf{η : ι(η) > `} =g ˆ(z`) = max{η : ι(η) ≤ `} (6) is the unique function that satisfies

{z : g(z) ≥ η} = {zι(η), . . . , zn} for all η ∈ R, among all increasing functions g : {z1, . . . , zn} → R.

12 η

ι(η) =ι(7η)

y6 ●● ι(η) =ι(6η) y3 ● y2 = y5 ●● ●

y4 ● ι(η) =ι(2η)

y1 ●●

ι(η) =ι(1η)

= = z1 z2 z3 z4 z5 z6

Figure 3: Graph of gˆ. For a sample of 6 data points, the values ofg ˆ(z) for z = z1, . . . , z6 are shown in red. The epigraph of the function η 7→ zι(η) is shown in grey, where T is chosen as the median functional to choose ι(η)

Proof. Due to the monotonicity and left-continuity of ι: R → {1, . . . , n + 1}, we have inf{η : ι(η) > `} = max{η : ι(η) ≤ `}, ` = 1, . . . , n. The monotonicity ofg ˆ follows from the monotonicity of ι and the fact that {z1, . . . , zn} is ordered. Let η0 ∈ R. Then,

0 0 0 (i)g ˆ(z`) ≥ η =⇒ ι(ˆg(z`)) ≥ ι(η ) =⇒ ` ≥ ι(η ), 0 0 (ii)g ˆ(zι(η0)) = max{η : ι(η) = ι(η )} ≥ η .

0 0 Therefore, {z :g ˆ(z) ≥ η } ⊆ {zι(η0), . . . , zn} ⊆ {z :g ˆ(z) ≥ η } where the first inclusion follows by (i) and the second by (ii). Uniqueness follows because any hypothetical alternativeg ¯ withg ¯(z`) 6=g ˆ(z`) for some ` ∈ {1, . . . , n} leads to the contradiction {zι(η), . . . , zn} = {z :g ¯(z) ≥ η}= 6 {z :g ˆ(z) ≥ η} = {zι(η), . . . , zn} for all η betweeng ¯(z`) andg ˆ(z`). In Figure 3 we give an example for 6 data points. The example illustrates how the valuesg ˆ(z`), ` = 1, . . . , n, can be determined from the epigraph of the function η 7→ zι(η). Now, we can state and show one of our main results which is thatg ˆ coincides with or is bounded by a min-max and max-min solution.

Proposition 3.6. Let ` ∈ {1, . . . , n} and let gˆ be a solution to the isotonic regres- sion problem. Then,

− + min max Ti:j ≤ gˆ(z`) ≤ max min Ti:j. j≥` i≤j i≤` j≥i

13 Proof. Applying the first set of bounds from Proposition 3.3 to the formula forg ˆ at (6) yields

− + inf max Ti:(ι(η)−1) ≤ gˆ(z`) ≤ max min Tι(η):(j−1). ι(η)>` i<ι(η) ι(η)≤` j>ι(η)

− The lower bound is bounded below by minj≥` maxi≤j Ti:j, and the upper bound is + bounded above by maxi≤` minj≥i Ti:j. The max-min inequality implies that for functionals T of singleton type, e.g., the expectation or expectile functionals, the lower and upper bound in Proposition 3.6 are equal. In general, a similar statement always holds, where the choice of ι determines whetherg ˆ attains the minimal or maximal elements of the functional.

Proposition 3.7. Let ` ∈ {1, . . . , n}.

(a) If ι(η) = min I(η) for all η ∈ R, then,

+ + gˆ(z`) = min max Ti:j = max min Ti:j. j≥` i≤j i≤` j≥i

(b) If ι(η) = max I(η) for all η ∈ R, then,

− − gˆ(z`) = min max Ti:j = max min Ti:j. j≥` i≤j i≤` j≥i

Proof. The proof works the same way as the proof of Proposition 3.6 but using second set of bounds in Proposition 3.3. This is possible because in (a) we have that for j < ι(η) it holds that j 6∈ I(η) and in (b), for j > ι(η) we know that j 6∈ I(η). Let us denote the solution in part (a) of Proposition 3.7 by g+ and the one in part (b) by g−. Clearly, it always holds that g− ≤ g+. It is a natural question whether any increasing function g that satisfies g− ≤ g ≤ g+ is also a minimizer of the criterion (5). It turns out that the answer is negative; see M¨osching and D¨umbgen (2020, Remark 2.2, Example 2.4). The following proposition provides a simple sufficient criterion for g to also be a solution. In the case of quantiles and for the classical asymmetric linear loss, the same result is shown in M¨osching and D¨umbgen (2020, Lemma 2.1). Note that in Proposition 3.8 it is not required that g−, g+ are the solutions from Proposition 3.7 as long as they satisfy g− ≤ g+.

Proposition 3.8. Let g−, g+ be two solutions to the isotonic regression problem, that is, minimizers of (5), and suppose they satisfy g− ≤ g+. Let gˆ be increasing, − + + + − − 0 g ≤ gˆ ≤ g , and suppose that g (z`) = g (z`0 ), g (z`) = g (z`0 ) for some ` < ` implies gˆ(z`) =g ˆ(z`0 ). Then, gˆ is also a minimizer of (5), that is, a solution to the isotonic regression problem.

14 − Proof. For η ∈ R, define ι(η) := min{` :g ˆ(z`) ≥ η}, and analogously ι (η) and ι+(η) withg ˆ replaced by g− and g+, respectively. The functions ι, ι−, ι+ are increasing and left-continuous. For ι−, ι+ it holds that ι−(η), ι+(η) ∈ I(η). For all ` ∈ {1, . . . , n}, we have that

− − g (z`) = max{η : ι (η) ≤ `} + + ≤ gˆ(z`) = max{η : ι(η) ≤ `} ≤ g (z`) = max{η : ι (η) ≤ `}, therefore, ι+(η) ≤ ι(η) ≤ ι−(η) for all η ∈ R. By Lemma 3.4 (b), it remains to show that ι(η) ∈ I. This follows from the following two observations. First, if

0 gˆ(z`) = max{η : ι(η) ≤ `} = max{η : ι(η) ≤ ` } =g ˆ(z`0 )

0 0 − − for some ` ≤ ` , then ι(η) 6∈ {` + 1, . . . , ` }. Second, if g (z`) < g (z`+1) or + + g (z`) < g (z`+1), then ` + 1 ∈ I.

+ The proof of Proposition 3.8 shows thatg ˆ may jump at points z` where g and g− do not jump as long as ` ∈ I, that is, as long as ` is a minimizing index for some η. The following Proposition 3.9 characterizes the possible additional jumps ofg ˆ.

Proposition 3.9. Let g−, g+ be two solutions to the isotonic regression problem, and suppose that for some i, j ∈ {1, . . . , n}, i < j,

− − − + + + η := g (zi) = g (zj) < g (zi) = g (zj) =: η .

− − + + Furthermore, assume it holds that for i > 1, g (zi−1) 6= g (zi) or g (zi−1) 6= g (zi), − − + + and for j < n, g (zj) 6= g (zj+1) or g (zj) 6= g (zj+1). Then for ` ∈ {i + 1, . . . , j}, − − − + we have Ti:(`−1) ≤ η if and only if ` ∈ I(η) for all η ∈ (η , η ].

+ + − − Proof. We will first argue that Ti:k ≥ η for k > i, and that Tk:j ≤ η for k < j. The assumptions ensure that there are i− ≤ i, i+ ≤ i such that i− ∈ I(η−), i+ ∈ I(η+), and max{i−, i+} = i. By Lemma 3.4 (a) we have i ∈ I(η+), and + + + therefore vi:k(η ) ≤ 0 for all k > i by Proposition 3.2, hence Ti:k ≥ η by Corollary 2.4. The assumptions also imply that there areη ˜− > η−, j− ≥ j,η ˜+ > η+, j+ ≥ j such that j− + 1 ∈ I(η) for all η ∈ (η−, η˜−], j+ + 1 ∈ I(η) for all η ∈ (η+, η˜+], and min{j−, j+} = j. By Lemma 3.4 (a) we have j + 1 ∈ I(η) for η ∈ (η−, η˜−], − − − and therefore vk:j(η) ≥ 0 for all k ≤ j, η ∈ (η , η˜ ], hence Tk:j ≤ η. In summary, − − Tk:j ≤ η for all k ≤ j. − − For the first part of the result, let ` ∈ {i + 1, . . . , j} such that Ti:(`−1) ≤ η . By + − + + + Lemma 2.5, we have Ti:k ≤ max{Ti:(`−1),T`:k} for all k ≥ `. Since Ti:k ≥ η and − − + + + Ti:(`−1) ≤ η , we have T`:k ≥ η and v`:k(η) ≤ 0 for all η ≤ η , k ≥ `. Similarly, by − − + − − Lemma 2.5, we have Tk:j ≥ min{Tk:(`−1),T`:j} for all k ≤ ` − 1. Since Tk:j ≤ η and

15 + + − − − as shown above T`:j ≥ η , we have Tk:(`−1) ≤ η and vk:(`−1)(η) ≥ 0 for all η > η , k ≤ ` − 1. Hence, we have ` ∈ I(η) for all η ∈ (η−, η+]. − + To prove the converse, note that ` ∈ I(η) for all η ∈ (η , η ] implies vk:(`−1)(η) ≥ − + − 0 for all η ∈ (η , η ], k < `. Hence, in particular, vi:(`−1)(η) ≥ 0 and Ti:(`−1) ≤ η for − + − − all η ∈ (η , η ], and, therefore, Ti:(`−1) ≤ η .

3.2 Pool-adjacent-violators algorithm

As in Section 3.1, the PAV algorithm takes observations (z1, y1),..., (zn, yn), with z1 < ··· < zn and can be generalized as detailed in Remark 3.1. Its starting point is the finest partition Q0 = {{1},..., {n}} of the index set {1, . . . , n}, and a corresponding function g0 : {z1, . . . , zn} → R satisfying

g0(z`) ∈ T (P`:`).

If possible, an increasing function has to be chosen. The algorithm iteratively considers pooling adjacent elements Q1 and Q2 in the current partition, where + “adjacent” means that the largest element of Q1, Q1 = max Q1, is the predecessor − (in terms of the natural numbers) of the smallest element of Q2, Q2 = min Q2. − + Pooling adjacent partition elements is considered necessary when T − + > T − + Q1 :Q1 Q2 :Q2 + − (strong adjacent violators), it is considered invalid when T − + < T − + , and Q1 :Q1 Q2 :Q2 optional otherwise (weak adjacent violators). The early stopping criterion is the existence of an increasing function gPAV : {z1, . . . , zn} → R that is constant on each element of the current partition QPAV and satisfies

gPAV(z`) ∈ T (PQ−:Q+ ) for all Q ∈ QPAV and ` ∈ Q, (7) that is, when no further pooling is necessary. The late stopping criterion is reached when no weak adjacent violators remain. The first and most apparent property we − + observe is that for all ` ∈ {1, . . . , n}, Q1,Q2 ∈ QPAV, Q1 ≤ ` ≤ Q2 , we have

− + T − + ≤ gPAV(z`) ≤ T − + , (8) Q1 :Q1 Q2 :Q2 since otherwise either gPAV is not increasing or the condition (7) is violated. Defini- tion 2.1 and its Corollary 2.4 allow for an immediate proof of an additional property of QPAV. Proposition 3.10. Let Q be a partition of {1, . . . , n} found by the PAV algorithm, Q ∈ Q, and j ∈ Q. Then,

− − + + Tj:Q+ ≤ TQ−:Q+ ≤ TQ−:Q+ ≤ TQ−:j. Proof. The second inequality is trivial. For the first inequality, suppose the con- − − trary: There exist η ∈ R, j ∈ Q such that TQ−:Q+ < η < Tj:Q+ . This implies

16 − that j > Q and vQ−:Q+ (η) ≥ 0 > vj:Q+ (η), hence vQ−:(j−1)(η) > 0. Therefore, + − TQ−:(j−1) < η < Tj:Q+ , which means that Q can be seen as the result of an invalid pooling of {Q−, . . . , j −1} and {j, . . . , Q+}. A similar argument applies to the third inequality. To show the connection between a valid solution by the PAV algorithm and the score optimizing solutiong ˆ in Section 3.1, we define

ιPAV(η) = min{k : η ≤ gPAV(zk)}. (9)

Plugging ιPAV into the definition ofg ˆ recovers gPAV,

gˆ(z`) = max{η : ιPAV(η) ≤ `}

= max{η : η ≤ gPAV(z`)} = gPAV(z`).

In order to show that gPAV solves the isotonic regression problem, it remains to be shown that ιPAV(η) ∈ I(η) for all η ∈ R.

Proposition 3.11. Let η ∈ R, then ιPAV(η) ∈ I(η). Proof. Let η ∈ R. We combine Proposition 3.10, the statement (8), and the defining equation (9). As a result, for all j, k ∈ {1, . . . , n + 1}, j < ιPAV(η) < k, we have T − ≤ g (z ) < η ≤ g (z ) ≤ T + , hence j:(ιPAV(η)−1) PAV ιPAV(η)−1 PAV ιPAV(η) ιPAV(η):(k−1) vj:(ιPAV(η)−1)(η) ≥ 0 ≥ vιPAV(η):(k−1)(η). The statement follows from Proposition 3.2.

As a closing side note, we point out that ιPAV corresponds to coarsest partition that allows the solution gPAV. Any weak adjacent violators on which gPAV takes the same value have been pooled.

4 Generalization to partial orders

In the first part of this paper, we considered a series of observations (z`, y`), where ` = 1, . . . , n and the set {z1, . . . , zn} was totally ordered. In this section, we solve the isotonic regression problem considering a distribution P for a random vector (Z,Y ) ∈ Z × R, where Z is a finite partially ordered set. Analogously to (5), we aim to minimize the criterion

EP Sη(g(Z),Y ) for all η ∈ R, (10) over all increasing functions g : Z → R¯. We call any minimizer of (10) a solution to the isotonic regression problem. The considerations in Section 3.1 lead to the reformulation of the optimization problem at (5) as the minimization of

n X si(η) = vi:n(η) = V (η, y`) `=i

17 over all i ∈ {1, . . . , n + 1}. The dependence on z1, . . . , zn andg ˆ seemingly vanishes, but remains encoded in the index set {1, . . . , n + 1} and in the link to η via an optimizing function ι: R → {1, . . . , n + 1} such that

{z :g ˆ(z) ≥ η} = {zι(η), . . . , zn}.

In the second part, we now generalize the index set {1, . . . , n + 1} and the function ι in order to accommodate a partially ordered set Z. We introduce upper sets x ⊆ Z to replace single indices i ∈ {1, . . . , n + 1}. We consider a set X ⊆ P(Z), where P denotes the power set. The set X consists of all admissible superlevel sets for an increasing function g imposed by the partial order on Z. A set x ∈ X is characterized by the property that if z ∈ x and z  z0, then z0 ∈ x. This implies that X is a finite lattice, that is, it is closed under union and intersection and contains Z and the empty set. Consequently, we replace the function ι: R → {1, . . . , n + 1} with a function ξ : R → X , that maps η to an upper set x of Z that minimizes Z sx(η) = vx(η) = V (η, Px) = V (η, y) P (dz, dy), (11) x×R where Px(A) = P ((x × R) ∩ A) for any A ∈ P(Z) ⊗ B(R), where B(R) denotes the Borel σ-algebra on R. In this notation, sx is only defined for x ∈ X , whereas vx and Px are defined for any x ∈ P(Z). Let X(η) denote the set of superlevel sets x ∈ X minimizing sx(η) at (11). Since P(Z) is finite, such a minimizer always exists. The following proposition is the analogue to Proposition 3.2 and gives necessary and sufficient conditions for the inclusion x ∈ X(η).

Proposition 4.1. Let η ∈ R. Subject to x, x0 ∈ X , the inclusion x ∈ X(η) holds if and only if

0 vx\x0 (η) ≤ 0 for all x ( x, 0 vx0\x(η) ≥ 0 for all x ) x.

0 0 Let x ∈ X(η), x ∈ X . If vx\x0 (η) = vx0\x(η), then x ∈ X(η).

0 0 Proof. Note that sx(η) ≤ sx0 (η) for all x ( x and all x ) x holds if and only 0 0 if vx\x0 (η) ≤ 0 for all x ( x and vx0\x(η) ≥ 0 for all x ) x. For the first part 0 of the result, note that x ∈ X(η) implies sx(η) ≤ sx0 (η) for all x ( x and all x0 ) x. Conversely, let x ∈ X be such that the latter condition is satisfied. Then, 0 sx(η) ≤ sx0∩x(η) and sx(η) ≤ sx0∪x(η) for all x ∈ X . By substracting vx\x0 (η) on 0 both sides of the latter inequality, we have sx∩x0 (η) ≤ sx0 (η) for all x ∈ X , and hence x ∈ X(η). The second part of the result is immediate after adding sx∩x0 (η) to both sides of vx\x0 (η) = vx0\x(η), that is, sx(η) = sx0 (η).

0 0 Corollary 4.2. Let η ∈ R and x ∈ X(η), x ∈ X . If x ( x and vx\x0 (η) = 0, then 0 0 0 x ∈ X(η). Analogously, if x ) x and vx0\x(η) = 0, then x ∈ X(η).

18 Lemma 4.3. (a) Let η, η0 ∈ R, η < η0, and x ∈ X(η), x0 ∈ X(η0). Then, 00 00 0 vx0\x(η ) = 0 for all η ∈ [η, η ]. (b) Let η ∈ and x0, x00 ∈ X(η), x ∈ X . If x ∈ S X(η) and x0 ⊇ x ⊇ x00, then R η∈R x ∈ X(η).

(c) Let η, η0 ∈ R, η < η0, and x ∈ X(η), x0 ∈ X(η0). Then, x ∪ x0 ∈ X(η) and x ∩ x0 ∈ X(η0).

Proof. (a) We have (x ∪ x0) \ x = x0 \ x = x0 \ (x ∩ x0). The statement is trivial if 0 0 x \ x = ∅. Otherwise, vx0\x(η) ≥ 0 ≥ vx0\x(η ) by Proposition 4.1, where the statement follows from the monotonicity of the identification function in its first argument.

(b) The statement is trivial if x = x0, x = x00, or x∈ / X(η0) for all η0 6= η. 0 0 0 Therefore, assume x ∈ X(η ), η 6= η. If η < η , then vx\x00 (η) = 0 by part (a). 0 If η < η, then vx0\x(η) = 0 by part (a). In either case, x ∈ X(η) by Corollary 4.2.

0 0 00 (c) We have sx(η) ≤ sx∪x0 (η) and sx0 (η ) ≤ sx∩x0 (η ), and vx0\x(η ) = 0 for all 00 0 0 0 η ∈ [η, η ] by part (a). That means, sx(η) = sx∪x0 (η) and sx0 (η ) = sx∩x0 (η ).

Example 4.4. In the case Z = {z1, . . . , zn}, we choose X as the image of {1, . . . , n+ 1} under the one-to-one mapping

i 7→ {z : z ≥ zi}, implying a total order on Z. In combination with the function ξ : R → X defined by ξ(η) = {z : z ≥ zι(η)}, (12) we can embed the results from Section 3 into the more general setting of a partial order on the covariates.

Equation (12) demonstrates that instead of an increasing function ι: R → {1, . . . , n + 1} such that ι(η) ∈ I(η) for all η ∈ R, we are now interested in a de- creasing function ξ : R → X in the sense that for η0 > η it holds that ξ(η0) ⊆ ξ(η). Furthermore, ξ(η) ∈ X(η) should hold for all η ∈ R. In Proposition 4.5, we show that the functions ξ are in one-to-one correspondence to the solutionsg ˆ of the iso- tonic regression problem at (10), and in Lemma 4.6 we show existence of such a function ξ. The following proposition is analogous to Proposition 3.5 and allows to recover gˆ from ξ.

19 Proposition 4.5. Let ξ : R → X be a decreasing, left-continuous function such that ξ(η) ∈ X(η), where left-continuity means that if ηn ↑ η and z ∈ ξ(ηn), then z ∈ ξ(η). Then, the function gˆ: Z → R given by inf{η : z∈ / ξ(η)} =g ˆ(z) = max{η : z ∈ ξ(η)} (13) is the unique function that satisfies

{z : g(z) ≥ η} = ξ(η) for all η ∈ R, among all increasing functions g : Z → R.

Proof. The left-continuity and monotonicity of ξ : R → X implies equation (13). The monotonicity ofg ˆ follows from the monotonicity of ξ and the fact that ξ takes values being superlevel sets of the partial order on Z. Let η0 ∈ R. Then, (i)g ˆ(z) ≥ η0 =⇒ ξ(ˆg(z)) ⊆ ξ(η0) =⇒ z ∈ ξ(η0). (ii) For any z ∈ ξ(η0) :g ˆ(z) = max{η : z ∈ ξ(η)} ≥ η0.

Therefore, {z :g ˆ(z) ≥ η0} ⊆ {z : z ∈ ξ(η0)} ⊆ {z :g ˆ(z) ≥ η0} where the first inclusion follows by (i) and the second by (ii). Uniqueness follows because any hy- pothetical alternativeg ¯ withg ¯(z0) 6=g ˆ(z0) for some z0 ∈ Z leads to the contradiction ξ(η) = {z :g ¯(z) ≥ η}= 6 {z :g ˆ(z) ≥ η} = ξ(η) for all η betweeng ¯(z0) andg ˆ(z0). The following lemma guarantees the existence of a function ξ as specified in Proposition 4.5.

Lemma 4.6. (a) There exists a decreasing function ξ : Q → X such that ξ(q) ∈ X(q) for all q ∈ Q. (b) Let η ↑ η and x ∈ X(η ), x ⊇ x . Then, x = T x ∈ X(η). n n n n n+1 n∈N n

Proof. (a) Let {qn} = Q be an enumeration of the rationals. We define ξ(qn) inductively. Pick x1 ∈ X(q1) and set ξ(q1) = x1. For n ≥ 2, define

− [ + \ xn = ξ(qi), xn = ξ(qi), i∈{1,...,n−1} i∈{1,...,n−1} qi>qn qi

− if {i : qi > qn} 6= ∅ and {i : qi < qn} 6= ∅. If {i : qi > qn} = ∅, we set xn = ∅, + and if {i : qi < qn} = ∅, we set xn = Z. We choose any xn ∈ X(qn) and set − + ξ(qn) = (xn ∪ xn ) ∩ xn . At each step n, ξ(qn) ∈ X(qn) follows by 4.3 (a), and + − ξ(qn) ⊆ xn . Furthermore, we show by induction that xn ⊆ ξ(qn) for all n. For n = 2, this is easily verified. Suppose the claim holds for n − 1 ≥ 2. If − − + + qn > qn−1, then xn = xn−1 and xn = xn−1 ∩ ξ(qn−1) = ξ(qn−1), hence

− − − xn = xn−1 ⊆ (xn ∪ xn−1) ∩ ξ(qn−1) = ξ(qn).

20 − − + + If qn < qn−1, then xn = xn−1 ∪ ξ(qn−1) = ξ(qn−1) and xn = xn−1, hence

− + xn = ξ(qn−1) ⊆ (xn ∪ ξ(qn−1)) ∩ xn−1 = ξ(qn).

+ In summary, for k < n, if qk < qn, then ξ(qn) ⊆ xn ⊆ ξ(qk), and if qk > qn, − ξ(qk) ⊆ xn ⊆ ξ(qn) showing that ξ is decreasing.

0 0 (b) We have sxn (ηn) ≤ sx (ηn) for all x ∈ X . Furthermore, the definitions of x and V imply 1{z ∈ xn}V (ηn, y) → 1{z ∈ x}V (η, y) pointwise, and we 1 have {z ∈ xn}V (ηn, y) ≤ supn∈N |V (ηn, y)|. By the dominated convergence 0 0 theorem, sxn (ηn) → sx(η) and sx (ηn) → sx (η).

Part (b) of Lemma 4.6 describes a possible completion step for part (a) that also modifies ξ to be left-continuous. In a nutshell, any decreasing ξ0 : Q → X that satisfies ξ0(η0) ∈ X(η0) for all η0 ∈ Q admits a left-continuous version on R, T 0 0 0 0 ξ : η 7→ η0<η ξ (η ) ∈ X(η), where the intersection is over all η ∈ Q, η < η. As η increases, ξ follows one of the totally ordered paths through the lattice. In Figure 4 the direction of movement as η increases is illustrated by arrows. In order to prove the existence of a function ξ (and thusg ˆ) that solves the isotonic regression problem, we need that X is closed under union and intersection. This property is essential for Lemma 4.6. We could also start with a set X of subsets of {z1, . . . , zn} that are interpreted as the admissible superlevel sets of the function g that is to be fitted. If X is closed under union and intersection, then X induces a partial order on {z1, . . . , zn} by Birkhoff’s Representation Theorem; see for example Gurney and Griffin (2011). Consequently, the optimal functiong ˆ always exists and is increasing. Starting with X , one could formulate constraints other than isotonicity on g as long as they can be formulated in terms of restrictions on admissible superlevel sets. Examples are unimodality or quasi-convexity. Generally, there is no solution that is simultaneously optimal with respect to all elementary loss functions; see Section 5 for examples in the case of a unimodality constraint. The following proposition generalizes Proposition 3.3 and is essential to provide − min-max bounds on solutions to the isotonic regression problem. We write Tx = T − = inf T (P ) and T + = T + = sup T (P ). Px x x Px x Proposition 4.7. Let η ∈ R, x ∈ X(η). Then, subject to x0 ∈ X ,

− + max Tx0\x ≤ η ≤ min Tx\x0 , x0)x x0(x + − max Tx0\x < η ≤ min Tx\x0 . x0)x,x0∈/X(η) x0(x,x0∈/X(η)

0 0 Proof. For all x ) x, we have vx0\x(η) ≥ 0. For all x ( x, we have vx\x0 (η) ≤ 0. If x0 ∈/ X(η), then both inequalities are strict. Corollary 2.4 implies the result. As a generalization of Proposition 3.6, we obtain the following min-max bounds.

21 z , z , z , z { 1 2 3 4}

z , z , z { 2 3 4}

z , z z , z { 3 4} { 2 4}

z z { 4} { 2}

Figure 4: Moving through X . The possible paths through X based on a specific partial order on Z = {z1, z2, z3, z4} are illustrated. The arrows indicate the direction of moving through the lattice X as η increases

Proposition 4.8. Let z ∈ Z and let gˆ be a solution to the isotonic regression problem. Then, subject to x, x0 ∈ X , − + min max Tx\x0 ≤ gˆ(z) ≤ max min Tx\x0 . x0:z∈ /x0 x)x0 x:z∈x x0(x Proof. Applying the first set of bounds from Proposition 4.7 to the formula forg ˆ at (13), we obtain − + inf max Tx\ξ(η) ≤ gˆ(z) ≤ max min Tξ(η)\x0 . η:z∈ /ξ(η) x)ξ(η) η:z∈ξ(η) x0(ξ(η) − The lower bound is bounded from below by minx0:z∈ /x0 maxx)x0 Tx\x0 , and the upper + bound is bounded from above by maxx:z∈x minx0(x Tx\x0 . In the case of partial orders on the covariates, it is also possible to define minimal and maximal solutions. Recall that, analogously to I(η), we defined X(η) as the set of superlevel sets x ∈ X minimizing sx(η) at (11). Now, let − 0 0 X (η) = {x ∈ X(η): @ x ∈ X(η) such that x ( x}, + 0 0 X (η) = {x ∈ X(η): @ x ∈ X(η) such that x ) x} denote the sets of minimal and maximal elements of X(η), respectively. In order to prove an analogous statement to Proposition 3.7, we need the following lemma on a modified max-min inequality. Lemma 4.9. Suppose that T is of singleton type. Let z ∈ Z be such that P ({z} × R) > 0. Then, subject to x, x0 ∈ X , + − max min Tx\x0 ≤ min max Tx\x0 . x:z∈x x0(x x0:z∈ /x0 x)x0

22 Proof. Let x00 ∈ X such that z∈ / x00, then

+ + − max min T 0 = max min T 0 = max min T 0 0 x\x 0 x\x 0 x\x x:z∈x x (x x:z∈x x (x x:z∈x x (x 0 0 P ((x\x )×R)>0 P ((x\x )×R)>0 − − − ≤ max Tx\(x∩x00) = max T(x∪x00)\x00 ≤ max Tx\x00 , x:z∈x x:z∈x x:x)x00 where the last inequality holds because x∪x00 ∈ X and if z ∈ x then x∪x00 ) x00. Proposition 4.10. Let z ∈ Z be such that P ({z} × R) > 0, and let ξ : R → X be decreasing and left-continuous.

(a) If ξ(η) ∈ X+(η) for all η ∈ R, then, subject to x, x0 ∈ X ,

+ + gˆ(z) = min max Tx\x0 = max min Tx\x0 . x0:z∈ /x0 x)x0 x:z∈x x0(x

(b) If ξ(η) ∈ X−(η) for all η ∈ R, then, subject to x, x0 ∈ X ,

− − gˆ(z) = min max Tx\x0 = max min Tx\x0 . x0:z∈ /x0 x)x0 x:z∈x x0(x

Proof. The proof follows using Lemma 4.9 and applying the same steps as in the proof of Proposition 4.8 to the second set of bounds in Proposition 4.7. As in Section 3, we denote the solution in part (a) of Proposition 4.10 by g+ and the one in part (b) by g−. Combining Propositions 4.10 to 4.15 and Corollary 4.12, gives a complete characterizations of all possible solutions to the isotonic regression problem for partial orders. For the following results, it is not required that g−, g+ are the solutions from Proposition 4.10. Unless specified, they do not even need to satisfy g− ≤ g+ everywhere. We define ξ− : η 7→ {z : g−(z) ≥ η} and ξ+ analogously. Proposition 4.11. Let g− and g+ be two solutions to the isotonic regression prob- lem such that g− ≤ g+. Let gˆ be isotonic, g− ≤ gˆ ≤ g+, and suppose that all superlevel sets of gˆ lie in S X(η). Then, gˆ is a solution to the isotonic regres- η∈R sion problem.

Proof. For η ∈ R define ξ(η) = {z :g ˆ(z) ≥ η}. The functions ξ, ξ−, ξ+ are decreas- ing, that is ξ(η) ⊇ ξ(η0) for η ≤ η0, and left-continuous. For ξ−, ξ+ it holds that ξ−(η), ξ+(η) ∈ X(η). Since, for all z ∈ Z, it holds that

g−(z) = max{η : z ∈ ξ−(η)} ≤ g(z) = max{η : z ∈ ξ(η)} ≤ g+(z) = max{η : z ∈ ξ+(η)}, we obtain ξ−(η) ⊆ ξ(η) ⊆ ξ+(η) for all η ∈ R. Lemma 4.3 (b) implies the result. The following corollary is an immediate consequence of Lemma 4.3 (c).

23 Corollary 4.12. Let g− and g+ be two solutions to the isotonic regression problem. Then, the distributive lattice generated by ξ− and ξ+ is a subset of S X(η). η∈R Having two solutions g− and g+ allows us to find all solutions to the isotonic regression problem with superlevel sets that lie in the lattice generated by ξ− and ξ+. Examples include solutions that transition from g− to g+ at a particular threshold η, ( g+(z), z ∈ ξ+(η), gˆ(z) = g−(z), otherwise, or pointwise convex combinations of solutions with α ∈ (0, 1),

gˆ(z) = αg−(z) + (1 − α)g+(z).

In order to refine the lattice of minimizing upper sets from Corollary 4.12 with the purpose to characterize all solutions, we pose the question whether simple sep- aration rules exist for the set difference of consecutive lattice elements. These sets necessarily take the form of the intersection of a level set of g− and a level set of g+, that is, sets of the form {z : g−(z) = η− and g+(z) = η+}. These rules do exist as we show in Propositions 4.14 and 4.15. First, we introduce the notion of a separation.

Definition 4.13. A separation of a set Z ∈ P(Z) is a collection of sets Z1,...,Zn ⊆ Sn Z that are pairwise separated and satisfy Z = i=1 Zi. Two sets Zi and Zj are 0 00 separated with respect to Z if for all z ∈ Zi and z ∈ Zj, there does not exist a 0 00 finite sequence (zk)k=1,...,m, zk ∈ Z, z1 = z , zm = z that for all k = 1, . . . , m − 1 satisfies zk  zk+1 or zk+1  zk. Proposition 4.14. Let g− and g+ be two solutions to the isotonic regression prob- lem, and let η−, η+ ∈ R, η− < η+, be such that Z = {z : g−(z) = η− and g+(z) = + 0 η } is nonempty. Furthermore, let Z1,...,Zn be a separation of Z, and let x = − − + + 00 0 00 − + ξ (η ) ∩ ξ (η ) and x = x \ Z. Then, x ∪ Zk ∈ X(η) for all η ∈ (η , η ], k = 1, . . . , n.

Proof. Without loss of generality, we show the claim for k = 1. By Lemma 4.3 (c), 0 + 00 − − + + − we have x ∈ X(η ) and x = ξ (η + 1) ∪ ξ (η + 2) ∈ X(η + 1) for some 0 00 − + 1, 2 > 0. More precisely, we have x , x ∈ X(η) for all η ∈ (η , η ] by Lemma 4.3 (b), since ξ−(η) ⊆ x00 ⊆ x0 ⊆ ξ+(η), η ∈ (η−, η+]. 00 0 Let x1 = x ∪ Z1 and x2 = x \ Z1 both of which are upper sets in X . Then 00 0 − + Z1 = x1 \x but also Z1 = x \x2. Therefore, vZ1 (η) ≥ 0 ≥ vZ1 (η) for all η ∈ (η , η ] by Proposition 4.1. Then the statement follows from Corollary 4.2. Proposition 4.14 allows us to find additional solutions to the isotonic regression problem with superlevel sets where separation elements have been added to known

24 minimizing superlevel sets. Using the variables defined in Proposition 4.14, one example of a new solution is  η, z ∈ Z ,  1 gˆ(z) = g+(z), z ∈ x00, g−(z), otherwise, where η ∈ (η−, η+]. Iterative application of Proposition 4.14 recovers all minimizing superlevel sets that can be obtained from the solutions in Proposition 4.10 via Corollary 4.12 and the information on the partially ordered set Z. Proposition 4.15 allows us to recover the remaining minimizing superlevel sets when the distribution P of the random vector (Z,Y ) is fully known. In fact, this proposition is a generalization of Proposition 4.14 that determines whether a level set intersection of g− and g+ can be split further by calculating values of the lower bound of the functional T .

Proposition 4.15. Let g− and g+ be two solutions to the isotonic regression prob- lem, and let η−, η+ ∈ R, η− < η+, be such that Z = {z : g−(z) = η− and g+(z) = η+} is nonempty. Furthermore, let x0 = ξ−(η−)∩ξ+(η+) and x00 = x0\Z. For x ∈ X , 0 00 − − − + x ) x ) x , we have Tx0\x ≤ η if and only if x ∈ X(η) for all η ∈ (η , η ]. Proof. We have x0, x00 ∈ X(η) for all η ∈ (η−, η+] as in the proof of Proposition 4.14. + 0 + + Then, vx0\k(η ) ≤ 0 for all k ∈ X , k ( x , by Proposition 4.1, and hence Tx0\k ≥ η 00 − + by Corollary 2.4. Analogously, vk\x00 (η) ≥ 0 for all k ∈ X , k ) x , η ∈ (η , η ], − − leading to Tk\x00 ≤ η . For the first part of the statement, let x ∈ X , x0 ) x ) x00, be such that − − − + Tx0\x ≤ η . We show that x ∈ X(η) for all η ∈ (η , η ] using Proposition 4.1. + − + − − We have Tx0\k ≤ max{Tx0\x,Tx\k} for all k ( x by Lemma 2.5. Since Tx0\x ≤ η + + + + by assumption and as just shown Tx0\k ≥ η , we obtain Tx\k ≥ η . By Corollary + 2.4, vx\k(η) ≤ 0 for all k ( x, η ≤ η , that is, the first inequality in Proposition − + − − + 4.1 holds for all η ∈ (η , η ]. Similarly, Tk\x00 ≥ min{Tk\x,Tx\x00 } for all k ) x. − − + + − − Since Tk\x00 ≤ η and Tx\x00 ≥ η , we obtain Tk\x ≤ η . Therefore, vk\x(η) ≥ 0, for all η > η−, k ) x, that is, the second inequality in Proposition 4.1 holds for all η ∈ (η−, η+]. To prove the converse, note that x ∈ X(η) for all η ∈ (η−, η+] implies that − + vk\x(η) ≥ 0 for all η ∈ (η , η ], k ) x. Hence, in particular, vx0\x(η) ≥ 0 and − − + − − Tx0\x ≤ η for all η ∈ (η , η ], and, therefore Tx0\x ≤ η .

4.1 Partitioning the covariate set In Section 3.2, we discussed how the PAV algorithm creates a partition of Z, and that it leads to a solutiong ˆ of the isotonic regression problem in the context of total

25 orders. In this section, we show how a solution to the isotonic regression problem leads to a corresponding partition Q of Z, such that the solution satisfies

gˆ(z) ∈ T (PQ), for all Q ∈ Q, z ∈ Q, and the solution is constant on every element of the partition. Let T be a functional of singleton type, andg ˆ be a solution to the isotonic regression problem. Subject to x, x0, k, k0 ∈ X , the combination of Proposition 4.8 and Lemma 4.9 yields

+ gˆ(z) = max min Tx\x0 (14) x:z∈x x0(x − = min max Tk\k0 . (15) k0:z∈ /k0 k)k0 for all z ∈ Z with P ({z} × R) > 0. We call (x, x0) a max-min pair for z if z ∈ x, 0 + 0 0 x ( x, andg ˆ(z) = Tx\x0 , and we call (k , k) a min-max pair for z if z∈ / k , 0 − 0 − + k ) k , andg ˆ(z) = Tk\k0 . For a pair x, x ∈ X such that Tx\x0 = Tx\x0 , we also ± use the notation Tx\x0 . Note that for a functional T of singleton type, we have ± 0 T (Px\x0 ) = {Tx\x0 } if P ((x \ x ) × R) > 0. The following lemma provides the necessary tools to construct the partition Q.

Lemma 4.16. Let T be a functional of singleton type, and gˆ be a solution to the isotonic regression problem. Furthermore, let z ∈ Z such that P ({z} × R) > 0, and 0 0 0 0 let (x1, x1), (x2, x2) be max-min pairs for z, and (k1, k1), (k2, k2) be min-max pairs for z. Then the following statements hold:

± ± ± (a) We have that gˆ(z) = T 0 = T 0 = T 0 0 . x1\k1 (x1∪x2)\k1 x1\(k1∩k2)

0 0 ± 0 (b) If x, k ∈ X such that z ∈ x, z∈ / k , and gˆ(z) = Tx\k0 , then (x, x ∩ k ) is a max-min pair for z and (k0, k0 ∪ x) is a min-max pair for z.

0 0 0 (c) If z˜ ∈ x1 \ k1, then (x1, x1) is a max-min pair for z˜ and (k1, k1) is a min-max pair for z˜.

+ + + Proof. We repeatedly use the inequalitiesg ˆ(z) = T 0 = minx0∈X T 0 ≤ T 0 x1\x1 x1\x x1\k − − − 0 andg ˆ(z) = T 0 = maxk∈X T 0 ≥ T 0 for all x, k ∈ X , where the second k1\k1 k\k1 x\k1 + − equality holds because TP = ∞ and TP = −∞ for null measures P . Furthermore, 0 by assumption, T (Px\k0 ) is a singleton if P ((x \ k ) × R) > 0, and therefore T (Px\k0 ) is a singleton if z ∈ x and z∈ / k0.

0 0 ± (a) Clearly, z ∈ x1, z ∈ x2, z∈ / k1, and z∈ / k2. Hence,g ˆ(z) ≤ T 0 ≤ gˆ(z) x1\k1 + + implies the first statement. Furthermore,g ˆ(z) ≤ T 0 = T 0 , x2\(x1∪k1) (x2\x1)\k1 − + ± and henceg ˆ(z) = min{T 0 ,T 0 } ≤ T 0 ≤ gˆ(z) confirms the x1\k1 (x2\x1)\k1 (x1∪x2)\k1 second statement using Lemma 2.5. Similarly, for the third statement,g ˆ(z) ≤ ± + − T 0 0 ≤ max{T 0 ,T 0 0 } =g ˆ(z). x1\(k1∩k2) x1\k1 (x1∩k1)\k2

26 − + 0 0 0 (b) The statement follows immediately from Tx\k0 = Tx\k0 ,(x ∪ k ) \ k = x \ k = x \ (x ∩ k0), and the definition of max-min and min-max pairs.

0 0 (c) Let (xz˜, xz˜) be a max-min pair forz ˜ and (kz˜, kz˜) be a min-max pair forz ˜. ± ± Then the statement follows fromg ˆ(z) ≤ T 0 ≤ gˆ(˜z) ≤ T 0 ≤ gˆ(z). x1\kz˜ xz˜\k1

Proposition 4.17. Let T be a functional of singleton type. Then there exists a partition Q of Z such that gˆ is constant on every element of the partition almost everywhere and gˆ(z) ∈ T (PQ) for all Q ∈ Q, z ∈ Q such that P ({z} × R) > 0.

Proof. Letx ¯z denote the union of the first components of all max-min pairs for ¯0 z ∈ Z, and let kz denote the intersection of the first components of all min-max ± pairs for z ∈ Z. By Lemma 4.16 (a), we haveg ˆ(z) = T ¯0 . We now show that the x¯z\kz ¯0 S collection Q of sets Qz =x ¯z \ kz is a partition of Z. First, we have z∈Z Qz = Z, ¯0 since z ∈ x¯z and z∈ / kz for all z ∈ Z. Second, by Lemma 4.16 (b), we have that ¯0 ¯0 ¯0 (¯xz, x¯z ∩ kz) is a max-min pair for z and (kz, kz ∪ x¯z) is a min-max pair for z. Then, ¯0 ¯0 by Lemma 4.16 (c), we havex ¯z ⊂ x¯z˜ and kz ⊃ kz˜ for allz ˜ ∈ Qz, i.e., Qz ⊂ Qz˜ and in particular z ∈ Qz˜. Swapping the roles of z andz ˜ gives Qz˜ ⊂ Qz. Therefore, Qz = Qz˜ for all z ∈ Z, z˜ ∈ Qz. When T is a functional of interval type, we therefore obtain a partition for every fixed convex combination of its lower bound T − and its upper bound T +.

5 Unimodal Regression

It is astonishing that in isotonic regression, solutions are simultaneously optimal for all loss functions in the class S which exhausts all consistent loss functions for the functional T in many relevant examples. One might wonder whether this is still fulfilled for slightly adapted shape constraints. Unimodality is a shape constraint closely related to isotonicity. One estimation procedure is to take a between two consecutive observations and then split the data set in two. On the subset preceding the mode an isotonic regression is performed and on the data following the mode an antitonic regression is performed. This procedure is then repeated for any possible choice of mode, as illustrated in Figure 5. Finally the optimal function is chosen by selecting the one with minimal loss. The reason for the mode to be chosen outside of {z1, . . . , zn} is to avoid ambiguity. If the mode is fixed on observation zi, 1 < i < n, then the isotonic regression on {z1, . . . , zi} and the antitonic regression on {zi, . . . , zn} might yield two different values forg ˆ(zi). Fixing mode mi and applying our method to {z1, . . . , zi−1} and {zi, . . . , zn} with isotonicity and antitonicity, respectively, as shape constraints yields a function gˆi : {z1, . . . , zn} → R that is optimal for any consistent loss function for functional T . The question arises whether there is one mode mi such that the corresponding

27 η

y4 ●●

y1 = y2 ●● ●●

● ● ●

● ●

^ = ^ = ^ g1 g2 g3 ^ = ^ y3 ● g4 g5

m1 z1 m2 z2 m3 z3 m4 z4 m5

Figure 5: Possible Modes. For a sample of 4 data points, the five possible choices m1, . . . , m5 for the mode, and the corresponding subdivision into isotonic and antitonic part for the functionsg ˆ1,..., gˆ5 are marked

gˆi dominates all other functionsg ˆj, j 6= i. It turns out that this is generally not the case. To give an example, we consider four observations (z1, y1),..., (z4, y4) with z1 < ··· < z4 and (y1, . . . y4) = (9, 9, 0, 10), and let P denote the corresponding empirical distribution. We choose the expectation functional as the regression tar- get, and consider modes m1, . . . , m5 with m1 < z1 < m2 < z2 < ··· < z4 < m5.

For modes m1 and m3, the unimodal approach yields the partitions Qm1 = Qm3 = {{z1, z2}, {z3, z4}} using the PAV algorithm, and for mode m2, we obtain the parti- tion Qm2 = {{z1}, {z2}, {z3, z4}}. But for this specific data example all three modes yield the same function i.e.,g ˆ1 =g ˆ2 =g ˆ3. For modes m4 and m5, we obtain the partitions Qm4 = Qm5 = {{z1, z2, z3}, {z4}} and thereforeg ˆ4 =g ˆ5. The functions gˆ1,..., gˆ5 are illustrated in Figure 6. Solutiong ˆi, 1 ≤ i ≤ 5, dominates all otherg ˆj, j 6= i, if

EP Sη(ˆgi(Z),Y ) ≤ EP Sη(ˆgj(Z),Y ) for all η ∈ R, j 6= i. It can be seen that this condition is not fulfilled by plotting the expected elementary scores forg ˆ1,..., gˆ5; see Figure 7. This visual method of comparing forecasts is called a Murphy diagram and was introduced by Ehm et al. (2016). Hence, in unimodal regression there is not necessarily a solutiong ˆi that simul- taneously minimizes all consistent loss functions for a functional T . This agrees with our findings in Section 4 because the set X is not closed under union and intersection. Indeed, it holds that {z1}, {z4} ∈ X but {z1, z4} ∈/ X . Therefore, the existence of a decreasing function ξ : R → X is not guaranteed.

28 Isotonic Antitonic

^ g5

^ g4

^ g3 ● ● ● ● ^ g2

^ g1

m1 z1 m2 z2 m3 z3 m4 z4 m5

Figure 6: Counterexample. For the specific sample of 4 data points (black), the five possible choices m1, . . . , m5 for the mode, and the resulting functionsg ˆ1,..., gˆ5 are illustrated

^ = ^ = ^ ^ = ^ 1 g1 g2 g3 g4 g5

0.8

0.6

0.4

0.2

0 η

Figure 7: Murphy Diagram. The Murphy diagram comparing the expected elementary scores ofg ˆ1,..., gˆ5 for η ∈ R given realizations (y1, . . . , y4) = (9, 9, 0, 10)

29 Acknowledgements We would like to thank Tilmann Gneiting, Alexandre M¨oschung and Lutz D¨umbgen for inspiring discussions and valuable comments. Anja M¨uhlemann and Johanna F. Ziegel gratefully acknowledge financial support from the Swiss National Science Foundation.

References

Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T. and Silverman, E. (1955). An empirical distribution function for with incomplete information. Ann. Math. Statist., 26, 641–647.

Barlow, R. E., Bartholomew, D. J., Bremner, J. M. and Brunk, H. D. (1972). Under Order Restrictions. Wiley, London.

Bartholomew, D. J. (1959a). A test of homogeneity for ordered alternatives. Biometrika, 46, 36–48.

Bartholomew, D. J. (1959b). A test of homogeneity for ordered alternatives. II. Biometrika, 46, 328–335.

Bellec, P. C. (2018). Sharp oracle inequalities for least squares estimators in shape restricted regression. Ann. Statist., 46, 745–780.

Br¨ummer,N. and Du Preez, J. (2013). The PAV algorithm optimizes binary proper scoring rules. arXiv:1304.2331.

Brunk, H. D. (1955). Maximum likelihood estimates of monotone parameters. Ann. Math. Statist., 26, 607–616.

Dawid, A. P. (2016). Contribution to the discussion of “Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast ” by Ehm, W., Gneiting, T., Jordan, A. and Kr¨uger,F. J. R. Stat. Soc. Ser. B. Stat. Methodol., 78, 505–562.

Ehm, W., Gneiting, T., Jordan, A. and Kr¨uger,F. (2016). Of quantiles and expec- tiles: Consistent scoring functions, Choquet representations and forecast rank- ings. J. R. Stat. Soc. Ser. B. Stat. Methodol., 78, 505–562.

Gneiting, T. (2011). Making and evaluating point forecasts. J. Amer. Statist. Assoc., 106, 746–762.

Groeneboom, P. and Jongbloed, G. (2014). Nonparametric estimation under shape constraints. Cambridge University Press, New York.

Guntuboyina, A. and Sen, B. (2018). Nonparametric shape-restricted regression. Statist. Sci., 33, 568–594.

30 Gurney, A. J. T. and Griffin, T. G. (2011). Pathfinding through congruences. In Relational and Algebraic Methods in Computer Science, vol. 6663. Springer, Heidelberg, 180–195.

Han, Q., Wang, T., Chatterjee, S. and Samworth, R. J. (2019). Isotonic regression in general dimensions. Ann. Statist., 47, 2440–2471.

Huber, P. J. (1964). Robust estimation of a . Ann. Math. Statist., 35, 73–101.

Kyng, R., Rao, A. and Sachdeva, S. (2015). Fast, provable algorithms for isotonic regression in all Lp-norms. In Advances in Neural Information Processing Systems 28. Curran Associates, Inc., Red Hook, 2719–2727.

Luss, R. and Rosset, S. (2017). Bounded isotonic regression. Electron. J. Stat., 11, 4488–4514.

Miles, R. E. (1959). The complete amalgamation into blocks, by weighted means, of a finite set of real numbers. Biometrika, 46, 317–327.

M¨osching, A. and D¨umbgen, L. (2020). Monotone least squares and isotonic quan- tiles. Electron. J. Stat., 14, 24–49.

Newey, W. K. and Powell, J. L. (1987). Asymmetric least squares estimation and testing. Econometrica, 55, 819–847.

Patton, A. J. (2011). Volatility forecast comparison using imperfect volatility prox- ies. J. , 160, 246–256.

Patton, A. J. (2019). Comparing possibly misspecified forecasts. J. Bus. Econom. Statist. published online.

Polonik, W. (1998). The silhouette, concentration functions and ML-density esti- mation under order restrictions. Ann. Statist., 26, 1857–1877.

Robertson, T. and Wright, F. T. (1973). Multiple isotonic median regression. Ann. Statist., 1, 422–432.

Robertson, T. and Wright, F. T. (1980). Algorithms in order restricted statistical inference and the Cauchy mean value property. Ann. Statist., 8, 645–651.

Savage, L. J. (1971). Elicitation of personal probabilities and expectations. J. Amer. Statist. Assoc., 66, 783–801.

Stout, Q. F. (2015). Isotonic regression for multiple independent variables. Algo- rithmica, 71, 450–470.

31 van Eeden, C. (1958). Testing and Estimating Ordered Parameters of Probability Distributions. Mathematical Centre, Amsterdam.

Ziegel, J. F. (2016). Contribution to the discussion of “Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings” by Ehm, W., Gneiting, T., Jordan, A. and Kr¨uger,F. J. R. Stat. Soc. Ser. B. Stat. Methodol., 78, 505–562.

32