Fast Algorithms for Segmented Regression Fast Algorithms for Segmented Regression Jayadev Acharya [email protected] Massachusetts Institute of Technology, Cambridge, MA 02139, USA Ilias Diakonikolas [email protected] University of Southern California, Los Angeles, MA 90089, USA Jerry Li [email protected] Ludwig Schmidt∗ [email protected] Massachusetts Institute of Technology, Cambridge, MA 02139, USA Abstract dent observations are corrupted by random noise. More specifically, we assume that there exists a family of func- We study the fixed design segmented regression tions F such that for some f 2 F the following holds: problem: Given noisy samples from a piecewise y = f(x ) + ; where the ’s are i.i.d. random variables linear function f, we want to recover f up to a i i i i drawn from a “tame” distribution such as a Gaussian (later, desired accuracy in mean-squared error. we also consider model misspecification). Previous rigorous approaches for this problem rely on dynamic programming (DP) and, while Throughout this paper, we consider the classical notion of sample efficient, have running time quadratic Mean Squared Error (MSE) to measure the performance in the sample size. As our main contribution, (risk) of an estimator. As expected, the minimax risk de- we provide new sample near-linear time algo- pends on the family F that f comes from. The natural rithms for the problem that – while not being case that f is linear is fully understood: It is well-known minimax optimal – achieve a significantly better that the least-squares estimator is statistically efficient and sample-time tradeoff on large datasets compared runs in sample-linear time. The more general case that f is to the DP approach. Our experimental evalua- non-linear, but satisfies some well-defined structural con- tion shows that, compared with the DP approach, straint has been extensively studied in a variety of contexts our algorithms provide a convergence rate that is (see, e.g., (Gallant & A., 1973; Feder, 1975; Friedman, only off by a factor of 2 to 4, while achieving 1991; Bai & Perron, 1998; Yamamoto & Perron, 2013; speedups of three orders of magnitude. Kyng et al., 2015; Avron et al., 2013; Meyer, 2008; Chat- terjee et al., 2015)). In contrast to the linear case, this more general setting is not well-understood from an information- 1. Introduction theoretic and/or computational aspect. f We study the regression problem – a fundamental inference In this paper, we focus on the case that the function is piecewise linear k task with numerous applications that has received tremen- promised to be with a given number of unknown dous attention in machine learning and statistics during the pieces (segments). This is known as fixed de- segmented past fifty years (see, e.g., (Mosteller & Tukey, 1977) for a sign regression, and has received considerable classical textbook). Roughly speaking, in a (fixed design) attention in the statistics community (Gallant & A., 1973; regression problem, we are given a set of n observations Feder, 1975; Bai & Perron, 1998; Yamamoto & Perron, (x ; y ), where the y ’s are the dependent variables and the 2013). The special case of piecewise polynomial functions i i i (splines) has been extensively used in the context of in- xi’s are the independent variables, and our goal is to model the relationship between them. The typical assumptions ference, including density estimation and regression, see, are that (i) there exists a simple function f that (approxi- e.g., (Wegman & Wright, 1983; Friedman, 1991; Stone, mately) models the underlying relation, and (ii) the depen- 1994; Stone et al., 1997; Meyer, 2008). rd Information-theoretic aspects of the segmented regression Proceedings of the 33 International Conference on Machine problem are well-understood: Roughly speaking, the min- Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). ∗Authors ordered alphabetically. Fast Algorithms for Segmented Regression imax risk is inversely proportional to the number of sam- classical regression model: ples. In contrast, the computational complexity of the prob- y = f(x ) + : (1) lem is poorly understood: Prior to our work, known al- i i i gorithms for this problem with provable guarantees were Here, the i are i.i.d. sub-Gaussian noise variables with 2 2 quite slow. Our main contribution is a set of new provably variance proxy σ , mean E[i] = 0, and variance s = 2 fast algorithms that outperform previous approaches both E[i ] for all i. We will let = (1; : : : ; n) denote the vec- in theory and in practice. Our algorithms run in time that is tor of noise variables. We also assume that f : Rd ! R is nearly-linear in the number of data points n and the number a k-piecewise linear function. Formally, this means: of intervals k. Their computational efficiency is established d Definition 1. The function f : R ! R is a k-piecewise both theoretically and experimentally. We also emphasize linear function if there exists a partition of the real line into that our algorithms are robust to model misspecification, k disjoint intervals I1;:::;Ik, k corresponding parameters i.e., they perform well even if the function f is only ap- d θ1;:::; θk 2 R , and a fixed, known j such that for all proximately piecewise linear. d x = (x1; : : : ; xd) 2 R we have that f(x) = hθi; xi if Note that if the segments of f were known a priori, the seg- xj 2 Ii. Let Lk;j denote the space of k-piecewise linear mented regression problem could be immediately reduced functions with partition defined on coordinate j. to k independent linear regression problems. Roughly Moreover, we say f is flat on an interval I ⊆ R if I ⊆ Ii speaking, in the general case (where the location of the seg- for some i = 1; : : : ; k, otherwise, we say that f has a jump ment boundaries is unknown) one needs to “discover” the on the interval I. right segments using information provided by the samples. To address this algorithmic problem, previous works (Bai In the full paper, we also discuss the agnostic setting where & Perron, 1998; Yamamoto & Perron, 2013) relied on dy- the ground truth f is not piecewise linear itself but only namic programming that, while being statistically efficient, well-approximated by a k-piecewise linear function. For is computationally quite slow: its running time scales at simplicity of exposition, we assume that the partition coor- least quadratically with the size n of the data, hence it is dinate j is 1 in the rest of the paper. We remark that this rather impractical for large datasets. model also contains the problem of (fixed design) piece- Our main motivation comes from the availability of large wise polynomial regression as an important subcase (see datasets that has made computational efficiency the main the full paper for details). bottleneck in many cases. In the words of (Jordan, 2013): Following this generative model, a regression algorithm re- “As data grows, it may be beneficial to consider faster infer- ceives the n pairs (xi; yi) as input. The goal of the algo- ential algorithms, because the increasing statistical strength rithm is then to produce an estimate fb that is close to the of the data can compensate for the poor algorithmic qual- true, unknown f with high probability over the noise terms ity.” Hence, it is sometimes advantageous to sacrifice sta- i and any randomness in the algorithm. We measure the tistical efficiency in order to achieve faster running times distance between our estimate fband the unknown function because we can then achieve the desired error guarantee f with the classical mean-squared error: faster (provided more samples). In our context, instead of n using a slow dynamic program, we employ a subtle itera- 1 X 2 MSE(fb) = (f(xi) − fb(xi)) : tive greedy approach that runs in sample-linear time. n i=1 Our iterative greedy approach builds on the work n×d of (Acharya et al., 2015a;b), but the details of our algo- Throughout this paper, we let X 2 R be the data ma- T rithms here and their analysis are substantially different. In trix, i.e., the matrix whose j-th row is xj for every j, and particular, as we explain in the body of the paper, the nat- we let r denote the rank of X. ural adaptation of their analysis to our setting fails to pro- The following notation will also be useful. For any func- vide any meaningful statistical guarantees. To obtain our tion f : Rd ! R, we let f 2 Rn denote the vector with results, we introduce novel algorithmic ideas and carefully components fi = f(xi) for i 2 [n]. For any interval I, we combine them with additional probabilistic arguments. let XI denote the data matrix consisting of all data points n I jIj xi for i 2 I, and for any vector v 2 R , we let v 2 R 2. Preliminaries be the vector of vi for i 2 I. In this paper, we study the problem of fixed design seg- 2.1. Our Contributions d mented regression. We are given samples xi 2 R for i 2 [n] ( = f1; : : : ; ng), and we consider the following Our main contributions are new, fast algorithms for the aforementioned segmented regression problem. We now informally state our main results and refer to later sections Fast Algorithms for Segmented Regression for more precise theorems. ing the rest, we now split the candidates into log n buck- Theorem 2 (informal statement of Theorems 13 and 14). ets based on the lengths of the candidate intervals. In this α REEDY ERGE scheme, bucket contains all candidates with length be- There is an algorithm G M , which, given X α α+1 (of rank r), y, a target number of pieces k, and the variance tween 2 and 2 , for α = 0;:::; log n−1.
