Fast Algorithms for Segmented Regression

Jayadev Acharya [email protected] Massachusetts Institute of Technology, Cambridge, MA 02139, USA

Ilias Diakonikolas [email protected] University of Southern California, Los Angeles, MA 90089, USA

Jerry Li [email protected] Ludwig Schmidt∗ [email protected] Massachusetts Institute of Technology, Cambridge, MA 02139, USA

Abstract dent observations are corrupted by random noise. More specifically, we assume that there exists a family of func- We study the fixed design segmented regression tions F such that for some f ∈ F the following holds: problem: Given noisy samples from a piecewise y = f(x ) +  , where the  ’s are i.i.d. random variables linear function f, we want to recover f up to a i i i i drawn from a “tame” distribution such as a Gaussian (later, desired accuracy in mean-squared error. we also consider model misspecification). Previous rigorous approaches for this problem rely on dynamic programming (DP) and, while Throughout this paper, we consider the classical notion of sample efficient, have running time quadratic Mean Squared Error (MSE) to measure the performance in the sample size. As our main contribution, (risk) of an estimator. As expected, the minimax risk de- we provide new sample near-linear time algo- pends on the family F that f comes from. The natural rithms for the problem that – while not being case that f is linear is fully understood: It is well-known minimax optimal – achieve a significantly better that the least-squares estimator is statistically efficient and sample-time tradeoff on large datasets compared runs in sample-linear time. The more general case that f is to the DP approach. Our experimental evalua- non-linear, but satisfies some well-defined structural con- tion shows that, compared with the DP approach, straint has been extensively studied in a variety of contexts our algorithms provide a convergence rate that is (see, e.g., (Gallant & A., 1973; Feder, 1975; Friedman, only off by a factor of 2 to 4, while achieving 1991; Bai & Perron, 1998; Yamamoto & Perron, 2013; speedups of three orders of magnitude. Kyng et al., 2015; Avron et al., 2013; Meyer, 2008; Chat- terjee et al., 2015)). In contrast to the linear case, this more general setting is not well-understood from an information- 1. Introduction theoretic and/or computational aspect. f We study the regression problem – a fundamental inference In this paper, we focus on the case that the function is piecewise linear k task with numerous applications that has received tremen- promised to be with a given number of unknown dous attention in machine learning and statistics during the pieces (segments). This is known as fixed de- segmented past fifty years (see, e.g., (Mosteller & Tukey, 1977) for a sign regression, and has received considerable classical textbook). Roughly speaking, in a (fixed design) attention in the statistics community (Gallant & A., 1973; regression problem, we are given a set of n observations Feder, 1975; Bai & Perron, 1998; Yamamoto & Perron, (x , y ), where the y ’s are the dependent variables and the 2013). The special case of piecewise polynomial functions i i i (splines) has been extensively used in the context of in- xi’s are the independent variables, and our goal is to model the relationship between them. The typical assumptions ference, including density estimation and regression, see, are that (i) there exists a simple function f that (approxi- e.g., (Wegman & Wright, 1983; Friedman, 1991; Stone, mately) models the underlying relation, and (ii) the depen- 1994; Stone et al., 1997; Meyer, 2008).

rd Information-theoretic aspects of the segmented regression Proceedings of the 33 International Conference on Machine problem are well-understood: Roughly speaking, the min- Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). ∗Authors ordered alphabetically. Fast Algorithms for Segmented Regression imax risk is inversely proportional to the number of sam- classical regression model: ples. In contrast, the computational complexity of the prob- y = f(x ) +  . (1) lem is poorly understood: Prior to our work, known al- i i i gorithms for this problem with provable guarantees were Here, the i are i.i.d. sub-Gaussian noise variables with 2 2 quite slow. Our main contribution is a set of new provably variance proxy σ , mean E[i] = 0, and variance s = 2 fast algorithms that outperform previous approaches both E[i ] for all i. We will let  = (1, . . . , n) denote the vec- in theory and in practice. Our algorithms run in time that is tor of noise variables. We also assume that f : Rd → R is nearly-linear in the number of data points n and the number a k-piecewise linear function. Formally, this means: of intervals k. Their computational efficiency is established d Definition 1. The function f : R → R is a k-piecewise both theoretically and experimentally. We also emphasize linear function if there exists a partition of the real line into that our algorithms are robust to model misspecification, k disjoint intervals I1,...,Ik, k corresponding parameters i.e., they perform well even if the function f is only ap- d θ1,..., θk ∈ R , and a fixed, known j such that for all proximately piecewise linear. d x = (x1, . . . , xd) ∈ R we have that f(x) = hθi, xi if Note that if the segments of f were known a priori, the seg- xj ∈ Ii. Let Lk,j denote the space of k-piecewise linear mented regression problem could be immediately reduced functions with partition defined on coordinate j. to k independent problems. Roughly Moreover, we say f is flat on an interval I ⊆ R if I ⊆ Ii speaking, in the general case (where the location of the seg- for some i = 1, . . . , k, otherwise, we say that f has a jump ment boundaries is unknown) one needs to “discover” the on the interval I. right segments using information provided by the samples. To address this algorithmic problem, previous works (Bai In the full paper, we also discuss the agnostic setting where & Perron, 1998; Yamamoto & Perron, 2013) relied on dy- the ground truth f is not piecewise linear itself but only namic programming that, while being statistically efficient, well-approximated by a k-piecewise linear function. For is computationally quite slow: its running time scales at simplicity of exposition, we assume that the partition coor- least quadratically with the size n of the data, hence it is dinate j is 1 in the rest of the paper. We remark that this rather impractical for large datasets. model also contains the problem of (fixed design) piece- Our main motivation comes from the availability of large wise as an important subcase (see datasets that has made computational efficiency the main the full paper for details). bottleneck in many cases. In the words of (Jordan, 2013): Following this generative model, a regression algorithm re- “As data grows, it may be beneficial to consider faster infer- ceives the n pairs (xi, yi) as input. The goal of the algo- ential algorithms, because the increasing statistical strength rithm is then to produce an estimate fb that is close to the of the data can compensate for the poor algorithmic qual- true, unknown f with high probability over the noise terms ity.” Hence, it is sometimes advantageous to sacrifice sta- i and any randomness in the algorithm. We measure the tistical efficiency in order to achieve faster running times distance between our estimate fband the unknown function because we can then achieve the desired error guarantee f with the classical mean-squared error: faster (provided more samples). In our context, instead of n using a slow dynamic program, we employ a subtle itera- 1 X 2 MSE(fb) = (f(xi) − fb(xi)) . tive greedy approach that runs in sample-linear time. n i=1 Our iterative greedy approach builds on the work n×d of (Acharya et al., 2015; ?), but the details of our algo- Throughout this paper, we let X ∈ R be the data ma- T rithms here and their analysis are substantially different. In trix, i.e., the matrix whose j-th row is xj for every j, and particular, as we explain in the body of the paper, the nat- we let r denote the rank of X. ural adaptation of their analysis to our setting fails to pro- The following notation will also be useful. For any func- vide any meaningful statistical guarantees. To obtain our tion f : Rd → R, we let f ∈ Rn denote the vector with results, we introduce novel algorithmic ideas and carefully components fi = f(xi) for i ∈ [n]. For any interval I, we combine them with additional probabilistic arguments. let XI denote the data matrix consisting of all data points n I |I| xi for i ∈ I, and for any vector v ∈ R , we let v ∈ R 2. Preliminaries be the vector of vi for i ∈ I.

In this paper, we study the problem of fixed design seg- 2.1. Our Contributions d mented regression. We are given samples xi ∈ R for i ∈ [n] ( = {1, . . . , n}), and we consider the following Our main contributions are new, fast algorithms for the aforementioned segmented regression problem. We now informally state our main results and refer to later sections Fast Algorithms for Segmented Regression for more precise theorems. ing the rest, we now split the candidates into log n buck- Theorem 2 (informal statement of Theorems 13 and 14). ets based on the lengths of the candidate intervals. In this α REEDY ERGE scheme, bucket contains all candidates with length be- There is an algorithm G M , which, given X α α+1 (of rank r), y, a target number of pieces k, and the variance tween 2 and 2 , for α = 0,..., log n−1. Then we find of the noise s2, runs in time O(nd2 log n) and outputs an the k + 1 candidates with largest error within each bucket and merge the remaining candidate intervals. We continue O(k)-piecewise linear function fb so that with probability this process until we are left with O(k log n) buckets. at least 0.99, we have r ! A potential disadvantage of our algorithms above is that kr k MSE(fb) ≤ O σ2 + σ log n . they produce O(k) or O(k log n) pieces, respectively. In n n order to address this issue, we also provide a postprocess- ing algorithm that converts the output of any of our algo- That is, our algorithm runs in time which is nearly linear rithms and decreases the number of pieces down to 2k + 1 in n and still achieves a reasonable rate of convergence. while preserving the statistical guarantees above. The guar- While this rate is asymptotically slower than the rate of the antee of this postprocessing algorithm is as follows. dynamic programming estimator, our algorithm is signifi- cantly faster than the DP so that in order to achieve a com- Theorem 4 (informal). There is an algorithm POST- parable MSE, our algorithm takes less total time given ac- PROCESSING that takes as input the output of ei- cess to a sufficient number of samples. For more details on ther GREEDYMERGE or BUCKETGREEDYMERGE to- this comparision and an experimental evaluation, see Sec- gether with a target number of pieces k, runs in time 3 2  tions 2.2 and4. O k d log n , and outputs a (2k + 1)-piecewise linear function fbp so that with probability at least 0.99, we have At a high level, our algorithm first forms a fine partition of r ! [n] and then iteratively merges pieces of this partitions until kr k MSE(fbp) ≤ O σ2 + σ log n . only O(k) pieces are left. In each iteration, the algorithm n n reduces the number of pieces in the following manner: we group consecutive intervals into pairs which we call “can- Qualitatively, an important aspect this algorithm is that its didates”. For each candidate interval, the algorithm com- running time depends only logarithmically on n. In prac- putes an error term that is the error of a fit tice, we expect k to be much smaller than n, and hence combined with a regularizer depending on the variance of the running time of this postprocessing step will usually be the noise s2. The algorithm then finds the O(k) candidates dominated by the running time of the piecewise linear fit- with the largest errors. We do not merge these candidates, ting algorithm run before it. but do merge all other candidate intervals. We repeat this process until only O(k) pieces are left. In this document, we only give the formal pseudo code and proofs for GREEDYMERGE. We refer the reader to the full A drawback of this algorithm is that we need to know the version for more details. variance of the noise s2, or at least have a good estimate. In practice, we might be able to obtain such an estimate, but 2.2. Comparison to prior work ideally our algorithm would work without knowing s2. By extending our algorithm, we obtain the following result: Dynamic programming. Previous work on segmented regression with statistical guarantees (Bai & Perron, 1998; Theorem 3 (informal). There is an algorithm BUCKET- Yamamoto & Perron, 2013) relies heavily on dynamic GREEDYMERGE, which, given X (of rank r), y, and a programming-style algorithms to find the k-piecewise lin- target number of pieces k, runs in time O(nd2 log n) and ear least-squares estimator. Somewhat surprisingly, we are outputs an O(k log n)-piecewise linear function fb so that not aware of any work which explicitly gives the best pos- with probability at least 0.99, we have sible running time and statistical guarantee for this algo- r ! kr log n k rithm. For completeness, we prove the following result MSE(fb) ≤ O σ2 + σ log n . n n (Theorem5), which we believe to be folklore. The tech- niques used in its proof will also be useful for us later. At a high level, there are two fundamental changes to the Theorem 5 (informal). The exact dynamic program runs algorithm: first, instead of merging with respect to the sum in time O(n2(d2 + k)) and outputs an k-piecewise linear squared error of the least squares fit regularized by a term estimator fb so that with probability at least 0.99 we have depending on s2, we instead merge with respect to the av- 2 kr  MSE(fb) ≤ O σ n . erage error the least squares fit incurs within the current interval. The second change is more substantial: instead We now compare our guarantees with those of the DP. of finding the O(k) candidates with largest error and merg- The main advantage of our approaches is computational Fast Algorithms for Segmented Regression efficiency – our algorithm runs in linear time, while the The problem of linear regression is very well understood, running time of the DP has a quadratic dependence on n. and the asymptotically best estimator is known to be the While our statistical rate of convergence is slower, we are least-squares estimator. able to achieve the same MSE as the DP in asymptotically less time if enough samples are available. Definition 7. Given x1,..., xn and y1, . . . , yn, the least squares estimator f LS is defined to be the linear function For instance, suppose we wish to obtain a MSE of η with Pn 2 which minimizes i=1(yi − f(xi)) over all linear func- k LS a -piecewise linear function, and suppose for simplicity tions f. For any interval I, we let fI denote the least that d = O(1) so that r = O(1) as well. Then Theorem squares estimator for the data points restricted to I, i.e. for 5 tells us that the DP requires n = k/η samples and runs the data pairs {(xi, yi)}i∈I . We also let LEASTSQUARES in time O(k3/η2). On the other hand, Theorem2 tells us (X, y, I) denote an algorithm which solves linear least that GREEDYMERGING requires n = Oe(k/η2) samples squares for these data points, i.e., which outputs the coeffi- and thus runs in time Oe(k/η2). For non-trivial values of k, cients of the fit for these points. this is a significant improvement in time complexity. Following our previous definitions, we let f LS ∈ n de- This gap also manifests itself strongly in our experiments R note the vector whose i-th coordinate is f LS(x ), and sim- (see Section4). When given the same number of samples, i ilarly for any I ⊆ [n] we let f LS ∈ |I| denote the vector our MSE is a factor of 2-4 times worse than that achieved I R whose i-th coordinate for i ∈ I is f LS(x ). The following by the DP, but our algorithm is about 1, 000 times faster I i error rate is known for the least-squares estimator: already for 104 data points. When more samples are avail- able for our algorithm, it achieves the same MSE as the DP Fact 8. Let x1,..., xn and y1, . . . , yn be generated as about 100 times faster. in (1), where f is a linear function. Let f LS(x) be the least squares estimator. Then, Histogram Approximation Our results build upon  LS   2 r  E MSE(f ) = O σ . the techniques of (Acharya et al., 2015), who consider the n problem of histogram approximation. Indeed, without too Moreover, with probability 1 − δ, we have much work it is possible to convert their algorithm to our  r + log 1/δ  MSE(f LS) = O σ2 . regression setting, but the resulting guarantee is not sta- n tistically meaningful (see the full version for a more de- tailed discussion). To overcome this difficulty, we propose Fact8 can be proved with the following lemma, which we a merging algorithm based on a new merging error mea- also use in our analysis. The lemma bounds the correlation sure. Utilizing more sophisticated proof techniques than in of a random vector with any fixed r-dimensional subspace: (Acharya et al., 2015), we show that our new algorithm has good statistical guarantees in the regression setting. Lemma 9 (c.f. proof of Theorem 2.2 in (Rigollet, 2015)). Fix δ > 0. Let 1, . . . , n be i.i.d. sub-Gaussian random 2.3. Mathematical Preliminaries 2 variables with variance proxy σ . Let  = (1, . . . , n), Tail bounds We will require the following tail bound and let S be a fixed r-dimensional affine subspace of Rn. on deviations of squares of sub-Gaussian random variables. Then, with probability 1 − δ, we have We defer the proof to the supplementary material. |vT |   sup ≤ O σpr + log(1/δ) . Corollary 6. Fix δ > 0 and let 1, . . . , n be as in (1). v∈S\{0} kvk 2 Recall that s = E[i ]. Then, with probability 1−δ, we have that simultaneously for all intervals I ⊆ [n] the following This lemma also yields the two following consequences. inequality holds: The first corollary bounds the correlation between sub- Gaussian random noise and any linear function on any in-

X 2 2 2 p terval (see the full paper for proofs): i − s |I| ≤ O(σ · log(n/δ)) |I| . (2) i∈I n Corollary 10. Fix δ > 0 and v ∈ R . Let  = (1, . . . , n) be as in Lemma9. Then with probability at least 1 − δ, we Linear regression. Our analysis builds on the classical have that for all intervals I, and for all non-zero linear results for fixed design linear regression. In linear regres- functions f : Rd → R, sion, the generative model is exactly of the form described |hI , f + v i| in (1), except that f is restricted to be a 1-piecewise linear I I ≤ O(σpr + log(n/δ)) . function (as opposed to a k-piecewise linear function), i.e., kfI + vI k f(x) = hθ∗, xi for some unknown θ∗. Corollary 11. Fix δ > 0 and v ∈ Rn. Let  be as in Fast Algorithms for Segmented Regression

Lemma9. Then with probability at least 1 − δ, we have into two sets: |h, f + v i|  r n F = {I ∈ I : f is flat on I} , sup I I ≤ O σ kr + k log . f∈Lk kfI + vI k δ J = {I ∈ I : f has a jump on I} .

These bounds also imply the following statement, which We first bound the error over the intervals in F. By Lemma we will need later. We defer the proof to the supplementary 12, we have X 2 2 material. kfI − fbI k ≤ O(σ |F|(r + log(n/δ))) , (3) Lemma 12. Fix δ > 0. Then with probability 1−δ we have I∈F the following: for all disjoint sets of k intervals I1,...,Ik with probability at least 1 − O(δ). of [n] so that f is flat on each I , the following inequality ` We now turn our attention to the intervals in J and distin- holds: guish two further cases. We let J1 be the set of intervals in k X LS 2 2 J which were never merged, and we let J2 be the remain- kfI − fI` k ≤ O(σ k(r + log(n/δ))) . ` ing intervals. If the interval I ∈ J1 was never merged, the `=1 interval contains one point, call it i. Because we may as- sume that x 6= 0, we know that for this one point, our 3. A greedy merging algorithm i estimator satisfies fb(xi) = yi, since this is clearly the In this section, we formally state our new algorithm for seg- least squares fit for a linear estimator on one nonzero point. mented regression and analyze its statistical and computa- Hence Corollary6 implies that the following inequality tional efficiency. See Algorithm1 for the corresponding holds with probability at least 1 − O(δ): pseudocode. The algorithm expects two tuning parameters X 2 X 2 kfI − fbI k = kI k τ, γ because the algorithm does not return a k-piecewise I∈J1 I∈J1 linear function but instead a O(k)-piecewise linear function   The tuning parameters allow us to trade off running time X  n s X ≤ σ2 |I| + O log |I| for fewer pieces. In typical use cases we have τ, γ = Θ(1),  δ  in which case our algorithm produces an O(k)-piecewise I∈J1 I∈J1 2   n √  linear function in time O(nd log n) time. ≤ σ2 m + O log m . (4) δ We now state the running time of GREEDYMERGING. The analysis is similar to the analysis presented in (Acharya We now finally turn our attention to the intervals in J2. et al., 2015) and for space considerations is left to the sup- Fix an interval I ∈ J2. By definition, the interval I plementary material. was merged in some iteration of the algorithm. This im- Theorem 13. Let X and y be as above. plies that in that iteration, there were (1 + 1/τ)k intervals M ,...,M so that for each interval M , we have Then GREEDYMERGING(τ, γ, s, X, y) outputs a 1 (1+1/τ)k ` (2 + 2 )k + γ-piecewise linear function and runs 2 2 LS 2 2 τ kyI − fbI k − s |I| ≤ kyM` − fM k − s |M`| . (5) in time O(nd2 log(n/γ)). ` Since the true, underlying k-piecewise linear function f has at most k jumps, this means that there are at least k/τ in- 3.1. Analysis of GREEDYMERGING tervals of the M` on which f is flat. WLOG assume that Theorem 14. Let δ > 0, and let fb be the estimator re- these intervals are M1,...,Mk/τ . turned by GREEDYMERGING. Let m = (2 + 2 )k + γ be τ We have, by definition, the number of pieces in fb. Then, with probability 1 − δ, we k/τ k/τ have that X 2 X 2 y − f LS − s2|M | = f − f LS M` M` ` M` M` MSE(fb) `=1 `=1 √ ! m(r + log(n/δ)) τ + k n k/τ k/τ 2 X LS X X 2 2 ≤ O σ + σ √ log . + 2 hM , fM − f i + ( − s ) . (6) n n δ ` ` M` i `=1 `=1 i∈M` Proof. We first condition on the event that Corollaries6, We will upper bound each term on the RHS in turn. First, 10 and 11, and Lemma 12 all hold with error parameter since the function f is flat on each M` for ` = 1, . . . , k/τ, O(δ), so that together they all hold with probability at least Lemma 12 implies 1 − δ. Let I = {I1,...,Im} be the final partition of [n] k/τ   X LS 2 2 k  n that our algorithm produces. Recall f is the ground truth fM − f ≤ O σ r + log , (7) ` M` τ δ k-piecewise linear function. We partition the intervals in I `=1 Fast Algorithms for Segmented Regression

Algorithm 1 Piecewise linear regression by greedy merging. 1: GREEDYMERGING(τ, γ, s, X, y) {Initial partition of [n] into intervals of length 1.} 2: I0 ← {{1}, {2},..., {n}} {Iterative greedy merging (we start with j ← 0).} j 2 3: while |I | > (2 + τ )k + γ do 4: Let sj be the current number of intervals. {Compute the least squares fit and its error for merging neighboring pairs of intervals.} sj 5: for u ∈ {1, 2,..., 2 } do 6: θu ← LEASTSQUARES(X, y,I2u−1 ∪ I2u) I I 2 2 7: eu = ky − X θuk2 − s |I2u−1 ∪ I2u| 8: end for 1 9: Let L be the set of indices u with the (1 + τ )k largest errors eu, breaking ties arbitrarily. 10: Let M be the set of the remaining indices. {Keep the intervals with large merging errors.} j+1 S 11: I ← {I2u−1,I2u} {Merge the remaining intervals.} u∈L j+1 j+1 12: I ← I ∪ {I2u−1 ∪ I2u | u ∈ M} 13: j ← j + 1 14: end while 15: return the least squares fit to the data on every interval in Ij with probability at least 1 − O(δ). LHS is also bounded by their average, which implies that Moreover, note that the function f LS is a linear function ky − f k2 − s2|I| M` I bI LS T ˆ ˆ d on M` of the form f (x) = x β, where β ∈ is the k/τ M` R τ   M f X LS 2 2 least-squares fit on `. Because is just a fixed vector, the ≤ kyM` − fM k − s |M`| LS k ` vector f − f lives in the affine subspace of vectors of i=1 M` M` d √ the form fM` + (XM` )η where η ∈ R is arbitrary. So   n  n τ n ≤ O σ2 r + log + O σ log . (10) Corollary 11 and (7) imply that δ δ k k/τ We now similarly expand out the LHS of (5): X LS hM` , fM` − fM i 2 2 ` kyI − fbI k − s |I| `=1 v = kf +  − f k2 − s2|I| uk/τ I I bI uX LS |hM` , Xηi| 2 2 2 ≤ t hM` , fM` − fM i · sup = kfI − fbI k + 2hI , fI − fbI i + kI k − s |I| . (11) ` η kXηk `=1 Next, we aim to upper bound P (f(x ) − f(x ))2.  k  n i∈I i b i ≤ O σ2 r + log . (8) Hence, we lower bound the second and third terms of (11). τ δ The calculations here closely mirror those above. with probability 1 − O(δ). By Corollary 10, we have that By Corollary6, we get that with probability 1 − O(δ),  r n k/τ ! 2hI , fI − fbI i ≥ −O σ r + log kfI − fbI k , X X  n √ δ 2 − s2|M | ≤ O σ log n . i ` δ and by Corollary6 we have `=1 i∈M` 2  n p Putting it all together, we get that k k − s2|I| ≥ −O σ log |I| , I δ k/τ X  2  and so y − f LS − s2|M | M` M` ` 2 2 i=1 kyI − fbI k − s |I|    r  k 2  n  n √ 2 n ≤ O σ r + log + O σ log n (9) ≥ kfI − fbI k − O σ r + log kfI − fbI k τ δ δ δ  n p with probability 1−O(δ). Since the LHS of (5) is bounded − O σ log |I| . (12) by each individual summand above, this implies that the δ Fast Algorithms for Segmented Regression

Putting (10) and (12) together yields that with probability Piecewise constant Piecewise linear 0 1 − O(δ), 10 100

2  2  n kfI − fbI k ≤ O σ r + log δ 10−2 −2 MSE  r n MSE 10 + O σ r + log kfI − fbI k δ  √   n τ n p 10−4 10−4 + O σ log + |I| . 10−4 10−2 100 10−4 10−1 102 δ k Time (s) Time (s) Figure 2. Computational vs. statistical efficiency in the synthetic 2 2 Letting z = kfbI −fI k , then this inequality is of the form data experiments. The solid lines correspond to the data in Fig- 2 z ≤ bz + c where b, c ≥ 0. In this case, we have that ure1, the dashed lines show the results from additional runs of  r n the merging algorithms for larger values of n. The merging al- b = O σ r + log , and gorithm achieves the same MSE as the dynamic program about δ √ 100× faster if a sufficient number of samples is available.   n  n τ n  c = O σ2 r + log + O σ log + p|I| δ δ k Dow Jones data Piecewise linear approximation We now prove the following lemma about the behavior of such quadratic inequalities:

Lemma 15. Suppose z2 ≤ bz + c where b, c ≥ 0. Then z2 ≤ O(b2 + c).

Figure 3. Results of fitting a 5-piecewise linear function (d = 2) Proof. From√ the quadratic formula, the inequality implies to a Dow Jones time series. The merging algorithm produces a b+ b2+4c that z ≤ 2 . From this, it is straightforward to fit that is comparable to the dynamic program and is about 200× demonstrate the desired claim. faster (0.013 vs. 3.2 seconds).

Thus, from the lemma, we have 4. Experiments 2  2  n In addition to our theoretical analysis above, we also study kfI − fbI k ≤ O σ r + log √ δ the empirical performance of our new estimator for seg-  n τ n  + O σ log + p|I| . mented regression on both real and synthetic data. As base- δ k line, we compare our estimator (GREEDYMERGING) to the dynamic programming approach. Since our algorithm Hence the total error over all intervals in J2 can be bounded combines both combinatorial and linear-algebraic opera- by: tions, we use the Julia programming language1 (version X 2  2  n 0.4.2) for our experiments because Julia achieves perfor- kfI − fbI k ≤ O kσ r + log δ mance close to C on both types of operations. All experi- I∈J2 ! ments were conducted on a laptop computer with a 2.8 GHz  n √ X + O σ log τ n + p|I| Intel Core i7 CPU and 16 GB of RAM. δ I∈J    n  Synthetic data. Experiments with synthetic data allow ≤ O kσ2 r + log δ0 us to study the statistical and computational performance of  n  √ √  our estimator as a function of the problem size n. Our the- + O σ log τ n + kn . (13) δ oretical bounds indicate that the worst-case performance of the merging algorithm scales as O( kd + pk/n log n) for In the last line we use that the intervals I ∈ J2 are disjoint n kd (and hence their cardinalities sum up to at most n), and that constant error variance. Compared to the O( n ) rate of there are at most k intervals in J2 because the function f the dynamic program, this indicates that the relative per- is k-piecewise linear. Finally, applying a union bound and formance of our algorithm can depend on the number of summing (3), (4), and (13) yields the desired conclusion. features d. Hence we use two types of synthetic data: a piecewise-constant function f (effectively d = 1) and a

1http://julialang.org/ Fast Algorithms for Segmented Regression

Piecewise constant 10 100 103 8 100

10−1 6 102 10−2 MSE 4 Speed-up 10−2

Running time (s) 1

Relative MSE ratio 2 10 10−4

102 103 104 102 103 104 102 103 104 102 103 104 n n n n Piecewise linear 10 100 2 8 10 3 0 10 10−1 6 10

MSE 4 10−2 Speed-up

−2 Running time (s) 2 10 Relative MSE ratio 2 10 10−4 102 103 104 102 103 104 102 103 104 102 103 104 n n n n Merging k Merging 2k Merging 4k Exact DP

Figure 1. Experiments with synthetic data: results for piecewise constant models with k = 10 segments (top row) and piecewise linear models with k = 5 segments (bottom row, dimension d = 10). Compared to the exact dynamic program, the MSE achieved by the merging algorithm is worse but stays within a factor of 2 to 4 for a sufficient number of output segments. The merging algorithm is significantly faster and achieves a speed-up of about 103× compared to the dynamic program for n = 104. This leads to a significantly better trade-off between statistical and computational performance (see also Figure2). piecewise linear function f with d = 10 features. factor of 2 to 4, and this ratio empirically increases only slowly with n (if at all). The experiments also show that We generate the piecewise constant function f by randomly forcing the merging algorithm to return at most k pieces choosing k = 10 integers from the set {1,..., 10} as func- can lead to a significantly worse MSE. tion value in each of the k segments.2 Then we draw n/k samples from each segment by adding an i.i.d. Gaussian In terms of computational performance, the merging algo- noise term with variance 1 to each sample. rithm has a significantly faster running time, with speed- ups of more than 1, 000× for n = 104 samples. As can be For the piecewise linear case, we generate a n × d data seen in Figure2, this combination of statistical and com- matrix X with i.i.d. Gaussian entries (d = 10). In each putational performance leads to a significantly improved segment I, we choose the parameter values β indepen- I trade-off between the two quantities. When we have a suffi- dently and uniformly at random from the interval [−1, 1]. cient number of samples, the merging algorithm achieves a So the true function values in this segment are given by given MSE roughly 100× faster than the dynamic program. fI = XI βI . As before, we then add an i.i.d. Gaussian noise term with variance 1 to each function value. Real data. We also investigate whether the merging Figure1 shows the results of the merging algorithm and the algorithm can empirically be used to find linear trends in a exact dynamic program for sample size n ranging from 102 real dataset. We use a time series of the Dow Jones index to 104. Since the merging algorithm can produce a variable as input, and fit a piecewise linear function (d = 2) with 5 number of output segments, we run the merging algorithm segments using both the dynamic program and our merg- with three different parameter settings corresponding to k, ing algorithm with k = 5 output pieces. As can be seen 2k, and 4k output segments, respectively. As predicted by from Figure3, the dynamic program produces a slightly our theory, the plots show that the exact dynamic program better fit for the rightmost part of the curve, but the merging has a better statistical performance. However, the MSE of algorithm identifies roughly the same five main segments. the merging algorithm with 2k pieces is only worse by a As before, the merging algorithm is significantly faster and achieves a 200× speed-up compared to the dynamic pro- 2We also repeated the experiment for other values of k. Since the results are not qualitatively different, we only report the k = gram (0.013 vs 3.2 seconds). 10 case here. Fast Algorithms for Segmented Regression

Acknowledgements Kyng, R., Rao, A., and Sachdeva, S. Fast, provable algo- rithms for in all lp-norms. In NIPS, Part of this research was conducted while Ilias Diakoniko- pp. 2701–2709, 2015. las was at the University of Edinburgh, Jerry Li was at Mi- crosoft Research Cambridge (UK), and Ludwig Schmidt Meyer, M. C. Inference using shape-restricted regression was at UC Berkeley. splines. Annals of Applied Statistics, 2(3):1013–1033, 09 2008. Jayadev Acharya was supported by a grant from the MIT- Shell Energy Initiative. Ilias Diakonikolas was supported Mosteller, F. and Tukey, J. W. Data analysis and regres- in part by EPSRC grant EP/L021749/1, a Marie Curie Ca- sion: a second course in statistics. Addison-Wesley, reer Integration Grant, and a SICSA grant. Jerry Li was Reading (Mass.), Menlo Park (Calif.), London, 1977. supported by NSF grant CCF-1217921, DOE grant DE- SC0008923, NSF CAREER Award CCF-145326, and a Rigollet, P. High dimensional statistics. 2015. URL NSF Graduate Research Fellowship. Ludwig Schmidt was http://www-math.mit.edu/~rigollet/ supported by grants from the MIT-Shell Energy Initiative, PDFs/RigNotes15.pdf. MADALGO, and the Simons Foundation. Stone, C. J. The use of polynomial splines and their tensor products in multivariate function estimation. Annals of References Statistics, 22(1):pp. 118–171, 1994.

Acharya, J., Diakonikolas, I., Hegde, C., Li, J. Z., and Stone, C. J., Hansen, M. H., Kooperberg, C., and Truong, Schmidt, L. Fast and near-optimal algorithms for ap- Y. K. Polynomial splines and their tensor products in proximating distributions by histograms. In PODS, pp. extended linear modeling: 1994 wald memorial lecture. 249–263, 2015a. Annals of Statistics, 25(4):1371–1470, 1997.

Acharya, J., Diakonikolas, I., Li, J. Zheng, and Schmidt, Wegman, E. J. and Wright, I. W. Splines in statistics. Jour- L. Sample-optimal density estimation in nearly-linear nal of the American Statistical Association, 78(382):pp. time. CoRR, abs/1506.00671, 2015b. URL http:// 351–365, 1983. arxiv.org/abs/1506.00671. Yamamoto, Y. and Perron, P. Estimating and testing multi- Avron, H., Sindhwani, V., and Woodruff, D. Sketching ple structural changes in linear models using band spec- structured matrices for faster . In tral regressions. Econometrics Journal, 16(3):400–429, NIPS, pp. 2994–3002. 2013. 2013.

Bai, J. and Perron, P. Estimating and testing linear models with multiple structural changes. Econometrica, 66(1): 47–78, 1998.

Chatterjee, S., Guntuboyina, A., and Sen, B. On risk bounds in isotonic and other shape restricted regression problems. Annals of Statistics, 43(4):1774–1800, 08 2015.

Feder, P. I. On asymptotic distribution theory in segmented regression problems– identified case. Annals of Statis- tics, 3(1):49–83, 01 1975.

Friedman, J. H. Multivariate adaptive regression splines. Annals of Statistics, 19(1):1–67, 03 1991.

Gallant, A. R. and A., Fuller W. Fitting segmented poly- nomial regression models whose join points have to be estimated. Journal of the American Statistical Associa- tion, 68(341):144–147, 1973.

Jordan, M. I. On statistics, computation and scalability. Bernoulli, 19(4):1378–1390, 09 2013.