<<

Joint 48th IEEE Conference on Decision and Control and ThB05.4 28th Chinese Control Conference Shanghai, P.. China, December 16-18, 2009 regression for travel time estimation via convex optimization

Sébastien Blandin , Laurent El Ghaoui and Alexandre Bayen

Abstract—We develop an algorithm aimed at estimating travel Sensors such as loop detectors are widely available on time on segments of a road network using a convex optimiza- highways. They provide accurate measurements of density tion framework. Sampled travel time from probe vehicles are or speed [13], but are extremely sparse on arterials. Thus assumed to be known and serve as a training set for a algorithm to provide an optimal estimate of the travel hardly any information of arterial traffic is available in real- time for all vehicles. A kernel method is introduced to allow for time. It is only recently that the growth of mobile sensors a non-linear relation between the known entry times and the has been shown to offer reliable information for traffic travel times that we want to estimate. To improve the quality monitoring [12], [29]. One can hope to use this source of of the estimate we minimize the estimation error over a convex information to provide accurate estimate of travel time on combination of known kernels. This problem is shown to be a semi-definite program. A rank-one decomposition is used to arterials. convert it to a linear program which can be solved efficiently. Travel time estimation on highways has been investigated with different tools. Efforts have been made from the model- I. INTRODUCTION ing side, assuming local knowledge of density or speed, and producing an estimate given by a deterministic or stochastic Travel time estimation on transportation networks is a model [5], [14], [20]. This problem has also been addressed valuable traffic metric. It is readily understood by practi- using data analysis and machine learning techniques with tioners, and can be used as a performance measure [3] for various types of learning methods [18], [19], [21], [28], [30]. traffic monitoring applications. Note however that the travel Arterial travel time estimation is more complex because time estimation problem is easier to address on highways the continuum approximation of the road might not apply at than on arterial roads. This can be intuitively interpreted intersections, where the dynamics is not easily modeled [1]. by the fact that properties of highways can be considered Information about the state of traffic on arterials is also lim- to be ‘more spatially invariant’ than the ones of arterial ited because of the sparsity of sensors. Some attempts have roads. Indeed the latter present complex features such as been made to estimate travel time on arterials, but in practice intersections and signalization forcing to stop resulting in it is not always possible to know the traffic lights cycles or spatially discontinuous properties. In this article, we propose to obtain a dedicated fleet of probe vehicles, often needed a new method to estimate travel time on road segments for estimation [24], [25]. However the ubiquity of GPS now without any elaborated model assumption. This method is enables one to realistically assume the knowledge of sampled shown to belong to a specific class of convex optimization travel times, an assumption for example verified in sections of problems and provides a non-linear estimate of the travel Northern California with the Mobile Millennium system [11]. time. The kernel regression method introduced allows for We propose to focus on arterial travel time estimation estimation improvement through the online extension of the using machine learning techniques and convex optimization. set of kernels used. In particular, we assess the performance We use kernel methods [23] to provide a non-linear estimate of this technique through a rank one kernel decomposition. of travel time on an arterial road segment. We assume the Highway traffic modeling is a mature field. Macroscopic knowledge of the travel times of a subset of vehicles and models date back to [10], [16], [22] and usually fall under the estimate the travel time of all vehicles. We use convex theory of scalar conservation laws [9]. Microscopic models optimization [2] to improve the performance of the non-linear take into account vehicle driving behaviors and can be estimate through kernel regression. The regression is done derived from the car-following model [17]. Flow models and on a set of kernels chosen according to their usually good driving behavior models on arterials are still the focus of performances, or physical criteria. The kernel framework [6] significant ongoing research. enables the addition of features to the set of covariates in the regression problem. The kernel regression gives the possibil- S. Blandin is a Ph.D. student, Systems Engineering, Department of Civil ity to select the most relevant features via optimization. and Environmental Engineering, University of California, Berkeley, CA This article is organized as follows. In section II, we 94720-1710 USA (e-mail: [email protected]). Corresponding author. L. El Ghaoui is a Professor, Department of Electrical Engineering and describe the optimization problem, introducing the regular- Computer Science, University of California Berkeley, CA 94720-1710 USA ization parameter and the kernel in a learning setting. In (e-mail: [email protected]). section III, we pose the kernel regression problem and show A. Bayen is an Assistant Professor, Systems Engineering, Department of Civil and Environmental Engineering, University of California Berkeley, CA that it can be written as a convex optimization problem, 94720-1710 USA (e-mail: [email protected]). transform it into a linear program, which can be solved

978-1-4244-3872-3/09/$25.00 ©2009 IEEE 4360 ThB05.4

efficiently and we describe the general learning algorithm For ρ large enough, the problem is well-posed and over-fitting used. In section IV, we present the simulation dataset used with respect to θ is prevented [8]. for validating the method and the results obtained. In par- ticular, we discuss the theoretical results stating that kernel C. Kernel methods regression enables to obtain a better estimate on the validation The regression method described in section II-A in the set. Finally, based on these results, we enumerate in section V linear case can be extended to the non-linear case through ongoing extensions to this work. the use of a kernel. One can consider a mapping function φ(·) and consider the linear regression problem between the II. PROBLEM STATEMENT mapped covariates φ(xi) and the outputs yi. This is the A. Travel time estimation main principle of kernel methods, which consist in using We investigate travel time estimation for a given road a feature space, in which the dataset is represented, and segment. Assuming a set of entry times and travel times on to consider linear relations between objects in this feature the section, we apply machine learning techniques to use space, and between these features and the outputs. Given a the knowledge of a subset of the pairs entry time, travel positive semi-definite matrix K = (kij ), we define the kernel R time, in order to produce an estimate of travel time for every function by Kf : X × X → such that Kf(xi, xj ) = kij. entry time. The dataset used for validation is described later This implicitly defines a feature mapping φ(·) between the in section IV-A. We assume the knowledge of a dataset of input set X and a Hilbert space H by φ(·) : X → H such + + size N which reads S = {(xi, yi) ∈ R × R |i = 1 ··· N} that hφ(xi), φ(xj )iH = Kf(xi, xj). In the following we note where for each value of the index i = 1 ··· N, xi is an entry Xmap a matrix representation of the mapping φ(·) (thus the time on the road section and yi is the realized travel time i-th column of Xmap is φ(xi)). When φ(·) has scalar values, (known as the a-posteriori travel time in the transportation Xmap is a row vector. community) for entry time xi. We would like to learn a Remark 1: One does not have to define a mapping func- function h : R+ → R+ which given S, would provide an tion φ(·) to define a kernel matrix, but can simply consider a estimate of the travel time y for any x ∈ R+. This is a positive semi-definite matrix and use it as a kernel. It is also typical regression problem as described in [6]. The well- possible to define a kernel matrix from a mapping φ(·) and T known unconstrained least-squares method can be formulated one of its matrix representation Xmap as K = Xmap Xmap. as an optimization problem: The inner product in H naturally appears to be given by the Gram matrix K, called the kernel. Kernel techniques [6], T 2 minθ ky − x θk2 (1) [23] have several benefits: 1 where y ∈ RN× is the vector of realized travel time or • They enable to work with any types of features of the output, xT ∈ RN×1 is the vector of entry time or input. One initial data-set, which has a priori no particular structure, must note that here x is a row vector and y is a column vector in a Hilbert space. so θ is a scalar. The well-known solution of this problem can • They guarantee a reasonable computational cost for be computed as: the algorithm by allowing a complexity related to the number of points represented and not the number of T † θopt = (x x ) x y (2) features used (this is known as the kernel trick and is described in Remark 4). where the notation (x xT )† denotes the pseudo-inverse Thus, kernel methods provide several extensions to usual of (x xT ) and the optimal estimate is given by yˆ = regression methods, and can be easily written in a machine (x xT )† xT x y. This estimate does not have bias, i.e. the learning framework. mean of the output y equals the mean of the estimate yˆ. D. Learning setting B. Regularization We assume the knowledge of a training set The regression problem defined in (1) is often ill-posed Str = {(xi, yi)|i = 1 ··· ntr} and we look for in the sense that the solution does not depend continuously the best estimate of the elements of the test set, on the data (the case of multiple solutions falls into that St = {xi|i = ntr + 1 ··· ntr + nt}. In order to match denomination). Formulation (1) could also lead to over- the structure of this problem, we define the kernel matrix in fitting in the case of non-linear regression since there is no block form as: penalization for high values of the solution θopt. In order to Ktr Ktrt prevent these two possible flaws, it is a common practice to K = T (4) K Kt add to the objective function a quadratic term called Tikhonov  trt  2 2 regularization [26] which has the form ρ |θ| in the scalar where kij = hφ(xi), φ(xj )iH for i, j = 1 ··· ntr, ntr+1...ntr+ case. Then the optimal estimate becomes: nt. The Gram matrix Ktr is the result of an optimization problem over the training set, and we learn the cross-term T 2 −1 T yˆ = (x x + ρ I) x x y. (3) Ktrt, which expresses the inner-product in the feature space

4361 ThB05.4

between the elements of the test-set and the elements of the B. Cross-validation training set. Kt is the inner product between the elements of The optimal value of (5) as expressed in (10) depends the test set. on the kernel matrix (9) and on the regularization param- III. ANALYSIS eter ρ. The parameter ρ is tuned through a re-sampling procedure [7], the k-fold cross-validation (here k does not A. Convex formulation denote the kernel matrix but the number of folds used Expressing the linear least-squares (1) for the mapped in the cross-validation method). This technique consists in input Xmap with the regularization term described in sec- dividing the dataset into k parts of approximately equal tion II-B yields: size, and using a subset for training while the remainder is ∗ T 2 2 2 used for testing [27]. For instance if the different parts are p = minθ ky − Xmap θk2 + ρ kθk2 (5) {Pi|i = 1 ··· k} then given n ∈ {1 ··· k} one would use Pn where we note p∗ the optimal value of this problem. Using as a training set and Pi as a test set. This is T i=1···k, i6=n the change of variable z = Xmap θ − y yields the equivalent useful to make extensive use of the dataset while avoiding formulation: bias on the training sSet. Here we use this method on the ∗ 2 2 2 training set to pick the optimal value of the regularization p = minθ,z kzk2 + ρ kθk2 (6) parameter ρ and on the whole set to have a meaningful subject to z = y − XT θ (7) map estimation error. The lagrangian dual of this problem reads: C. Kernel regression T ∗ T T Xmap Xmap As stated in Remark 2, the optimal value of the regularized d = maxα − 2 α y − α I + 2 α. (8) ρ ! regression problem (5) is a convex function of the regularized kernel matrix K and can be optimized over the kernel. The In this equation, we see the expression of the kernel matrix: ρ kernel optimization problem, which consists in minimizing ∗ T the value d defined in (10) with respect to the regularized K = Xmap Xmap. (9) kernel Kρ reads: T Xmap Xmap T −1 If we denote K = I + 2 the regularized kernel, the ρ ρ minKρ y Kρ y (12) dual optimal point and dual optimal value of problem (8) can subject to Kρ ≥ 0 (13) be expressed as: where the constraint on the kernel matrix enforces that the α∗ = K−1 y and d∗ = yT K−1 y. (10) ρ ρ regularized kernel Kρ must be a Gram matrix. This problem Remark 2: Since the primal (6)-(7) and dual (8) are con- is convex according to Remark 2. In order to prevent over- vex and strictly feasible, strong duality holds and primal fitting with respect to Kρ, we follow [15] and constrain Kρ optimal value p∗ and dual optimal value d∗ are equal. We to be a convex combination of given kernels, i.e. we define note that expression (8) shows that the dual optimal value a set of kernels {K1 ··· Kk} and consider the problem: is a maximum over a set of linear functions of Kρ, so the min yT K−1 y (14) optimal value is a convex function of the regularized kernel λ ρ k matrix K . Since the choice of the kernel is crucial for the ρ subject to λ ≥ 0 λ = 1 (15) optimal value it is interesting to minimize the optimal value i i =1 ∗ i d with respect to the kernel. k X Remark 3: Optimizing the kernel matrix physically means Kρ = λi Ki. (16) looking for the best mapping function φ(·) such that there i=1 is a linear relation between the features of the inputs φ(x ) X i In a learning setting, the optimization problem (14) is defined and the outputs y . If one takes φ(·) as the identity mapping, i only on the training set but the expression of the kernel then the optimal value of (8) becomes: matrix as a linear combination of known kernels must be 1 xT x − satisfied on the whole set. Using the notation introduced in T −1 T K K y Kρ y = y I + 2 y (11) tr trt ρ section II-D we write Kρ = T and under this   Ktrt Kt which may not be optimal for non-linear systems. form the problem reads:   Remark 4: One must note that the kernel matrix (9) is a 1 min yT K− y (17) square matrix which has the dimension of xT x, and thus its λ tr k size does not depend on the number of features represented in subject to λi ≥ 0 λi = 1 (18) Xmap but only on the number of covariates x . The dimension i i=1 of the image space of φ(·) which is the dimension of the k X feature space, does not appear in the kernel matrix. This is Kρ = λi Ki (19) the kernel trick mentioned in section II-C. i=1 X 4362 ThB05.4

which can be written in a semi-definite program form using an epigraph property and the Schur complement: Ψ1/2 β∗ = γ∗ β∗ = κ∗. (29)

minλ,t t (20) If we minimize over γ in (28) we obtain the following k optimization problem: subject to λi ≥ 0 λi = 1 (21) i=1 1/2 T X Ψ = maxβ y β (30) k t yT subject to |lT β| ≤ 1 i = 1...p. (31) Kρ = λi Ki and Ktr ≥ 0. (22) i y I + 2 i=1 ρ X   The lagrangian of this problem can be written as: The solution of this optimization problem is the parameter λ∗ p giving the optimal convex combination of the set of kernels T T ∗ L(β, u) = y β + |ui| − ui li β (32) {K1 ··· Kk} which minimizes d from (10). i=1 X  D. Rank-one kernel optimization and taking the lagrangian dual of problem (30)-(31) yields: The kernel optimization problem in the form of (20)- (21)-(22) is not tractable and cannot be efficiently solved 1/2 Ψ = minu kuk1 (33) by standard optimization software. In this section we use p the rank-one decomposition of kernels to find an equivalent subject to y = ui li (34) formulation in a linear program form. This is done through i=1 X the introduction of several intermediate problems. We assume using the strict feasibility of the primal and convexity of the that we can write the regularized kernel as a convex combi- p T primal and the dual. Problem (34) is a linear program. The nation of dyads: Kρ = i=1 νi li li where li are row vectors ∗ ∗ p optimal ν can be retrieved from the optimal u from the and νi are positive scalars such that i=1 νi = 1. Since by PK relation: definition Kρ = I + ρ2 , the decomposition of Kρ into a sum of dyads is possible whenever tPhe kernel K itself can |u |∗ ν∗ = i . (35) be written as a sum of dyads. In practice the kernel is a i Ψ1/2 positive semi-definite matrix so it can be diagonalized in an Indeed one can check that with this value of the vec- orthonormal basis and this property is satisfied. Thus we can 1/2 ∗ ∗ tor ν equations (27)-(29) yields Ψ Kρ(ν ) β = y write an equivalent formulation of problem (14)-(15)-(16) as: 1/2 ∗ ∗ and on the other hand we can write Ψ Kρ(ν ) β = p ∗ T p T −1 =1 |u | (l β) li which is equal to =1 ui li using the Ψ = minν y Kρ y (23) i i i i p optimality condition in the lagrangian (32). This proves that iPf u∗ is optimal for (33)-(34) then ν∗ givPen by (35) is optimal subject to νi ≥ 0 νi = 1 (24) i=1 for (23)-(24)-(25) and vice-versa. p X T E. Choice of kernels Kρ = νi li li (25) i=1 Several types of kernels are commonly used in machine X learning [23]. Here, we propose to combine several classical where the vectors l are the eigenvectors of the matrices K i j kernels with a kernel motivated by the known physical from equation (16). Introducing the change of variable κ = properties of the phenomenon we want to estimate. K−1(ν) and doing some computations enables one to rewrite ρ 1) Classical kernels: We consider a Gaussian kernel Kσ 2 problem (23)-(24)-(25) as: σ |xi−xj | defined by kij = exp(− σ2 ). We also use a linear lin lin T T 2 kernel K defined by kij = xi xj . Since we use a rank- Ψ = maxκ 2 y κ − max (li κ) (26) 1≤i≤p one decomposition of each kernel, the regression problem   with the linear kernel Klin is expected to produce a slightly and the optimal κ is related to the optimal ν by the relation better estimate than the regular linear least-squares, because in the kernel method the weight of the eigenvectors can be κ∗ = K−1(ν∗). (27) ρ different. One can note that solving problem (26) for the vector variable 2) Physics of the phenomenon: Since we are interested in κ is the same as solving the problem: estimating the travel time across a traffic light intersection, we consider the physical properties of this phenomenon as T T 2 Ψ = minγ,β 2 y γ β − max (li γ β) (28) described in [1]. According to the authors, a reasonable 1≤i≤p   model is the following: the travel time across a traffic light for the variables γ and β. This is simply obtained by writing intersection increases suddenly at the beginning of the red κ = γ β in problem (26), with γ scalar and β vector. The light and decreases linearly from there until the next begin- optimal point (γ∗, β∗) of problem (28) satisfies: ning of the red light. This motivates us to use a piecewise

4363 ThB05.4

linear function φ(·) as a mapping. In order to do so we These computations are executed with Matlab, and the opti- assume the traffic cycle length to be constant of value c. mization problems are solved by CVX, which is a disciplined The slope of the linear function is left free and is chosen convex programing tool, using the program SDPT3. to be an optimization parameter. These considerations lead C. Results and discussion us to define the mapping function for the third kernel as phy φ (x) = x mod c where c is the traffic cycle length. It Kernel Error on training set Error on test set is motivated by the fact that according to the model the Klin 1.09 1.10 phenomenon is periodic of period c, and on a period the Kphy 1.07 1.09 σ relation between the entry time and the travel time is linear. K 0.79 0.76 Klin + Kphy We assume that c = 60 seconds. 1.07 1.10 Klin + Kσ 0.79 0.76 Kphy + Kσ 0.77 0.74 IV. SIMULATION RESULTS A. Dataset description Table I VALUES OF THE L2 RELATIVEERROR ON THETRAINING SET ANDON We use a dataset generated by a traffic micro simulator, THETEST SET FOR DIFFERENT COMBINATIONS OF KERNELS, FOR A Paramics [4]. It is based on the car-following theory and TRAINING SET OFSIZE 50 % OF THE SIZE OF THE WHOLE DATASET. WE Kσ σ K LIN driving behavior modeling and has been the subject of exten- NOTE THE GAUSSIANKERNEL ANDUSE = 100, THELINEAR KERNEL, AND K PHY THE PHYSICAL KERNEL. sive research funded by Caltrans. It is assumed to accurately reproduce the macroscopic properties of traffic as well as inconsistent driving patterns observed in real life. Thus it The results shown in Table I yield several observations. provides a challenging dataset to estimate the performance First, the high value of the relative errors must be compared of our algorithm. The dataset consists of pairs x , y 1055 ( i i) to usual techniques, such as the conventional linear least- where x is an entry time and y is the travel time of a i i squares. For this dataset, the linear least-squares optimal vehicle entering the road section at time x . This dataset has i estimate gives a L2 relative error of 1 on the training set, been generated for a road segment in Berkeley, California. because it produces an estimate without bias. The error on It consists of an arterial link of length 1207 feet and the the test set is as close to 1 as the distribution of the training simulation has been run for half an hour between PM 3 : 30 set is close to the distribution of the test set. When using and PM on a week day. 4 : 00 only one kernel, the estimate performs at least almost as well B. Analysis method as the linear least-squares estimate. The performance of the combination of kernels is at least as good at the performance Given an entry time xi, we would like to provide a travel of the best kernels, and is better when the two kernels have time estimate yˆi. We are interested in the quadratic error different features. In the case of a Gaussian kernel and a between this estimate yˆi and the effective travel time yi. In physical kernel, the result is improved because the physical order to evaluate the performance of the techniques described kernel uses a feature (the operator mod) which does not exist in section III, we follow the method described below. The in the Gaussian kernel. In the case of the linear kernel and error metric used is a L2 relative norm. the Gaussian kernel, there is no added value compared to • We consider a training set whose size is one half of the the Gaussian kernel alone. The benefit of a combination of size the whole dataset. This can be considered to model kernels is that if one does not know a-priori which kernel the fact that we know one half of the travel times of performs better, the optimization algorithm picks an optimal vehicles flowing on the road section, and we want to combination and yields a mixed non-linear estimate which, as estimate the travel time of the other vehicles. illustrated in figure 1, is able to locally capture trends of the • In order to define the optimal regularization parameter phenomenon. This property of the estimate does not appear for the training set, we define a 5-fold on this set. We in the comparison proposed in Table I but is really useful for use cross-validation as defined in section III-B on this 5- traffic applications. Here the linear least-squares estimate has fold. Namely we use one of the five subsets as a training a constant value at the mean value of the dataset, whereas set, and the remainder serves as the test-set. We solve the the estimate given by a combination of the kernels oscillates. optimization problem described in section III on each of This is the subject on ongoing research which uses the same the five training sets, and for several values of ρ, and formalism with other norms such as the H1 norm instead of we pick the one which minimizes the error metric. a L2 norm in the objective function of (5). • Having defined a regularization parameter at the previ- ous step, we compute the optimal weight vector from the training set and evaluate the error on the test-set. V. CONCLUSIONS AND FUTURE WORK • We iterate this method for different training sets being The result presented in previous sections show that even in size one-half of the whole dataset, and we average if the accuracy obtained by the kernel regression technique the errors obtained. The results are given in Table I for is not spectacular, an added value is the fact that the different convex combinations of kernels. estimate is able to follow the trend of the travel time.

4364 ThB05.4

Centered Travel−time at a traffic light intersection 140 [9] M. GARAVELLO and B. PICCOLI. Traffic flow on networks. American Institute of Mathematical Sciences, Springfield, USA, 2006. 120 [10] B. GREENSHIELDS. A study of traffic capacity. Proceedings of the Highway Research Board, 14(1):448–477, 1935. 100 [11] R. HERRING, A. HOFLEITNER, S. AMIN, T. ABOU NASR, A. ABDEL KHALEK, P. ABBEEL, and A. BAYEN. Using mobile phones to forecast 80 arterial traffic through statistical learning. Submitted to Transportation Research Board, 2009. 60 [12] B. HOH, M. GRUTESER, R. HERRING, J. BAN, D. WORK, J.-C. HERRERA, A. BAYEN, and Q. JACOBSON. Virtual trip lines for 40 distributed privacy-preserving traffic monitoring. In 6th International Conference on Mobile Systems, Applications, and Services, pages 15– 20 Centered Travel−Time (sec) 28, Breckenridge, CO, June 17-18 2008. [13] Z. JIA, C. CHEN, B. COIFMAN, and P. VARAIYA. The PeMS 0 algorithms for accurate, real-time estimates of g-factors and speeds

−20 from single-loop detectors. In Intelligent Transportation Systems, 2001. 3:28 3:32 3:36 3:40 3:44 3:48 3:52 3:56 4:00 Proceedings. 2001 IEEE, pages 536–541, 2001. Tiime (hour) [14] P. KACHROO, K. OZBAY, and A. HOBEIKA. Real-time travel time estimation using macroscopic traffic flowmodels. 2001 Proceedings of IEEE Intelligent Transportation Systems, pages 132–137, 2001. Figure 1. Observed travel times (blue sparse points) and estimated travel [15] Gert R. G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El times (red smooth curve) for a linear combination of a Gaussian kernel (with Ghaoui, and Michael I. Jordan. Learning the kernel matrix with σ = 100) and a linear kernel. semidefinite programming. J. Mach. Learn. Res., 5:27–72, 2004. [16] M. LIGHTHILL and G. WHITHAM. On kinematic waves II a theory of traffic flow on long crowded roads. Proceedings of the Royal Society of London, 229(1178):317–345, 1956. The kernel regression technique enables to add kernels to [17] G. NEWELL. A simplified car-following theory: a lower order model. the set used in order to provide a richer signal providing Transportation Research Part B, 36(3):195–205, 2002. better accuracy. Thus extensions to this work include the [18] D. NIKOVSKI, N. NISHIUMA, Y. GOTO, and H. KUMAZAWA. Univari- ate short-term prediction of road travel times. 2005 IEEE Intelligent use of different kernels offering other features to improve Transportation Systems, 2005. Proceedings, pages 1074–1079, 2005. the results obtained. In particular, it would be satisfying to [19] T. PARK and S. LEE. A bayesian approach for estimating link travel reach sufficiently good estimation accuracy with kernels only time on urban arterial road network. Lecture notes in computer science, pages 1017–1025, 2004. based on the physical properties of the road and some varying [20] K. PETTY, P. BICKEL, M. OSTLAND, J. RICE, F. SCHOENBERG, parameters (weather, time of day). Applying results from the J. JIANG, and Y. RITOV. Accurate estimation of travel times from support vector machines theory allowing to bound the error single-loop detectors. Transportation Research Part A, 32(1):1–17, 1998. in classifiers would significantly improve the quality of our [21] J. RICE and E. VAN ZWET. A simple and effective method for travel time estimate. The estimated travel time on a road predicting travel times on freeways. IEEE Transactions on Intelligent segment may be as important for practitioners as its range of Transportation Systems, 5(3):200–207, 2004. [22] P. RICHARDS. Shock waves on the highway. Operations Research, variations. This is related to the discussion in section IV-C 4(1):42–51, 1956. on the ongoing research focusing on the use of other norms [23] B. SCHOLKOPF and A. SMOLA. Learning with kernels. MIT press, in the regression problem (5), and how to find a tractable and 2002. [24] A. SKABARDONIS and N. GEROLIMINIS. Real-time estimation of efficient way to solve the problem in these cases. travel times on signalized arterials. In Proceedings of the 16th International Symposium on Transportation and Traffic Theory, 2005. REFERENCES [25] K. SRINIVASAN and P. JOVANIS. Determination of number of probe vehicles required for reliable travel time measurement in urban [1] X. BAN, R. HERRING, P. HAO, and A. BAYEN. Delay pattern network. Transportation Research Record, 1537(-1):15–22, 1996. estimation for signalized intersections using sampled travel times. In [26] A. TIKHONOV. Solution of incorrectly formulated problems and the Transportation Research Board Annual meeting, Washington, D.C., regularization method. In Soviet Math. Dokl, volume 4, pages 1035– January 10-14 2008. 1038, 1963. [2] S. BOYD and L. VANDENBERGHE. Convex optimization. Cambridge [27] P. TURNEY. A theory of cross-validation error. Journal of Experimen- university press, 2004. tal and Theoretical Artificial Intelligence, 6:361–361, 1994. [3] T. CHOE, A. SKABARDONIS, and P. Varaiya. Freeway performance [28] J. VAN LINT and H. Van Zuylen. Monitoring and predicting freeway measurement system: an operational analysis tool. Transportation travel time reliability: Using width and skew of day-to-day travel time Research Record, 1811(-1):67–75, 2002. distribution. Transportation Research Record, 1917:54–62, 2005. [4] L. CHU, H. LIU, and W. RECKER. Using microscopic simulation [29] D. WORK, O-P. TOSSAVAINEN, S. BLANDIN, A. BAYEN, to evaluate potential intelligent transportation system strategies under T. IWUCHUKWU, and K. TRACTON. An ensemble Kalman nonrecurrent congestion. Transportation Research Record, 1886(- filtering approach to highway traffic estimation using GPS enabled 1):76–84, 2004. mobile devices. In 47th IEEE Conference on Decision and Controls, [5] B. COIFMAN. Estimating travel times and vehicle trajectories on 2008 Cancun, Mexico, pages 5062–5068, 2008. freeways using dual loop detectors. Transportation Research Part A, [30] J. YEON, L. ELEFTERIADOU, and S. LAWPHONGPANICH. Travel 36(4):351–364, 2002. time estimation on a freeway using discrete time markov chains. [6] N. CRISTIANINI and J. SHAWE-TAYLOR. An Introduction to Support Transportation Research Part B, 42(4):325–338, 2008. Vector Machines. Cambridge University Press, Cambridge, UK, 2000. [7] B. EFRON and G. GONG. A leisurely look at the bootstrap, the jackknife, and cross-validation. American Statistician, pages 36–48, 1983. [8] H. ENGL, K. KUNISCH, and A. NEUBAUER. Convergence rates for tikhonov regularization of nonlinear ill-posed problems. Inverse Problems, 5(4):523–540, 1989.

4365