ISBN 978-80-7378-065-4 © MATFYZPRESS WDS'08 Proceedings of Contributed Papers, Part I, 88–93, 2008.

Total Approach in Regression Methods M. Peˇsta Charles University, Faculty of Mathematics and Physics, Prague, Czech Republic.

Abstract. Total least squares (TLS) is a data modelling technique which can be used for many types of statistical analysis, e.g. a regression. In the regression setup, both dependent and independent variables are considered to be measured with errors. Thereby, the TLS approach in is sometimes called an errors-in- variables (EIV) modelling and, moreover, this type of regression is usually known as an orthogonal regression. We take an EIV regression model into account. Necessary algebraic tools are introduced in order to construct the TLS estimator. A comparison with the classical estimator is illustrated. Consequently, the existence and uniqueness of the TLS estimator are discussed. Finally, we show the large sample properties of the TLS estimator, i.e. a strong and weak consistency, and an asymptotic distribution.

Introduction Observing several characteristics—may be thought as variables—straightforwardly postulates a nat- ural question: “What is the relationship between these measured characteristics?” One of many possible attitudes can arise, that some of the characteristics might be explained by a (functional) dependence on the other characteristics. Therefore, we consider the first mentioned variables as dependent or response and the second ones as independent or explanatory. Our proposed model of dependence contains errors in the response variable (we think only of one dependent variable) and in the explanatory variables as well. But firstly, we just try to find an appropriate fit for some points in the Euclidean space using a hyperplane, i.e. approximating several incompatible linear relations. Afterwards, some properties for the measurement errors are added and, hence, several statistical asymptotical qualities are developed.

Overdetermined System Let us consider the overdetermined system of linear relations

n n×m Ê y Xβ, y Ê , X , n > m. (1) ≈ ∈ ∈ Relations in (1) are deliberately not denoted as equations, because in many cases, the exact solution need not exist. Thereby, only an approximation can be found. Hence, one can speak about the “best” solution of the overdetermined system (1). But the “best” in which way?

Singular Value Decomposition Before inquiring into an appropriate solution of (1), we should introduce some very important tools for further exploration. n×m

Theorem (Singular Value Decomposition – SVD) If A Ê then there exist orthonormal matrices

n×n m×m∈ Ê U = [u1,..., un] Ê and V = [v1,..., vm] such that ∈ ∈ ⊤ n×m

U AV = Σ = diag σ1, . . . , σp Ê , σ1 ... σp 0, and p = min n, m . (2) { } ∈ ≥ ≥ ≥ { }

Proof. See Golub and Van Loan [1996]. 

In SVD, the diagonal matrix Σ is uniquely determined by A (though the matrices U and V are not). Previous powerful matrix decomposition allows us to define a cutting point r for a given matrix n×m

A Ê using its singular values σi ∈ σ1 ... σr > σr+1 = ... = σp = 0, p = min n, m . ≥ ≥ { }

88 PESTA:ˇ TOTAL LEAST SQUARES APPROACH IN REGRESSION METHODS

Since the matrices U and V in (2) are orthonormal, it yields rank(A) = r and one may obtain a dyadic decomposition (expansion) of the matrix A:

r ⊤ A = σiuivi . (3) Xi=1 A suitable matrix norm is also required and, hence, the for matrix (a )n,m Frobenius norm A ij i,j=1 is defined as follows ≡

n m p r 2 ⊤ 2 2 A F := v aij = tr(A A) = v σi = v σi , p = min n, m . (4) k k u u u { } uXi=1 Xj=1 q uXi=1 uXi=1 t t t Furthermore, the following approximation theorem plays the main role in the forthcoming derivation, where a matrix is approximated with another one with lower rank. n×m

Theorem (Eckart-Young-Mirsky Matrix Approximation) Let the SVD of A Ê be given by A = r σ ⊤ rank( ) = r k < r = k σ ⊤ ∈ i=1 iuivi with A . If and Ak i=1 iuivi , P P r 2 min A B F = A Ak F = v σi . (5) rank(B)=k k − k k − k u ui=Xk+1 t

Proof. See Eckart and Young [1936] and Mirsky [1960].  Above all, one more technical property needs to be incorporated. n×m Theorem (Sturm Interlacing Property) Let n m and the singular values of A Ê are σ1 ... ≥ ∈ ′ ′ ≥ ≥ σm. If B results from A by deleting one column of A and B has singular values σ ... σm , then 1 ≥ ≥ −1 ′ ′ ′ σ1 σ σ2 σ ... σm σm 0. (6) ≥ 1 ≥ ≥ 2 ≥ ≥ −1 ≥ ≥

Proof. See Thompson [1972]. 

Total Least Squares Solution Now, three basic approximation ways of the overdetermined system (1) are suggested. The tradi- tional approach penalizes only the misfit in the dependent variable part

min ǫ s.t. y + ǫ = Xβ (7)

n m 2 Ê ǫ∈ Ê ,β∈ k k and is called the ordinary least squares (OLS). Here, the data matrix X is thought as exactly known and errors occur only in the vector y. An opposite case to the OLS is represented by the data least squares (DLS), which allow corrections only in the explanatory variables (independent input data)

min Θ s.t. y = (X + Θ)β. (8)

n×m m F Ê Θ∈ Ê ,β∈ k k Finally, we concentrate ourselves on the total least squares approach minimizing the squares of errors in the values of both dependent and independent variables

min [ε, Ξ] F s.t. y + ε = (X + Ξ)β. (9)

n×(m+1) m Ê [ε,Ξ]∈ Ê ,β∈ k k A graphical illustration of three previous cases can be found in Figure 1. One may notice that the TLS “search” for the orthogonal projection of the observed data onto the unknown approximation corresponding to a TLS solution. Once a minimizing [εˆ, Ξˆ ] of the TLS problem (9) is found, then any β satisfying y+εˆ = (X+Ξˆ )β is called a TLS solution. The “basic” form of the TLS solution was investigated for the first time by Golub and Van Loan [1980].

89 PESTA:ˇ TOTAL LEAST SQUARES APPROACH IN REGRESSION METHODS

OLS DLS 5 5 4 4 3 3 2 2 1 1 0 0

0 1 2 3 4 5 0 1 2 3 4 5

TLS Various Least Squares Fit 5 5

OLS 4 4 DLS TLS 3 3 2 2 1 1 0 0

0 1 2 3 4 5 0 1 2 3 4 5

Figure 1. Various least squares fits (ordinary, data, and total LS) for the same three data points in the two-dimensional plane that coincides with the regression setup of one response and one explanatory variable.

n×m m ′ ′ ′⊤ (TLS Solution of ) Ê = σ Theorem y Xβ Let the SVD of X be given by X i=1 iuivi and [ , ] = m+1 σ ≈⊤ σ′ > σ ∈ the SVD of y X i=1 iuivi . If m m+1, then P P ˆ ˆ ˆ ⊤ ˆ [yˆ, X] := [y + εˆ, X + Ξ] = UΣV and Σ = diag σ1, . . . , σm, 0 (10) { } with the corresponding TLS correction matrix

ˆ ⊤ [εˆ, Ξ] = σm+1um+1vm+1 (11) solves the TLS problem and 1 ˆ ⊤ = [v ,m , . . . , vm ,m ] (12) β ⊤ 2 +1 +1 +1 −e1 vm+1 exists and is the unique solution to yˆ = Xˆ β. ⊤ Proof. Proof by contradiction, we firstly show that e1 vm+1 = 0. Suppose v1,m+1 = 0, then there exist m 6

0 = w Ê such that 6 ∈ 0 0, w⊤ [y, X]⊤ [y, X] = σ2  w  m+1   ⊤ ⊤ 2 ′ which yields into w X Xw = σm+1. But this is a contradiction with the assumption σm > σm+1, since ′2 ⊤ σm is the smallest eigenvalue of X X. ′ Sturm interlacing theorem (6) and the assumption σm > σm+1 yield σm > σm+1. Therefore, σm+1 is not a repeated singular value of [y, X] and σm > 0. ˆ ˆ If σm+1 = 0, then rank[y, X] = m+1. We want to find [yˆ, X] such that [y, X] [yˆ, X] F is minimal 6 k − k and [yˆ, Xˆ ][ 1, β⊤]⊤ = 0 for some β. Therefore, rank([yˆ, Xˆ ]) = m and applying Eckart-Young-Mirsky −

90 PESTA:ˇ TOTAL LEAST SQUARES APPROACH IN REGRESSION METHODS

theorem (5), one may easily obtain the SVD of [yˆ, Xˆ ] in (10) and the TLS correction matrix (11), which must have rank one. Now, it is clear that the TLS solution is given by the last column of V. Finally, since dim Ker([yˆ, Xˆ ]) = 1, then the TLS solution (12) must be unique. ⊤ ⊤ If σm+1 = 0, then vm+1 Ker([y, X]) and [y, X][ 1, β ] = 0. Hence, no approximation is needed, overdetermined system (1) is∈ compatible, and the exact− TLS solution is given by (12). Uniqueness of this TLS solution follows from the fact that [ 1, β⊤]⊤ Range([y, X]⊤). − ⊥  ′ A closed-form expression of the TLS solution (12) can be derived. If σm > σm+1, the existence and uniqueness of the TLS solution has already been shown. Thereby, since singular vectors vi, i.e. from (10), are eigenvectors of [y, X]⊤ [y, X], then βˆ also satisfies

⊤ ⊤ ⊤ 1 y y y X 1 2 1 [y, X] [y, X] − = − = σm −  βˆ   X⊤y X⊤X   βˆ  +1  βˆ  and, hence, ˆ ⊤ 2 −1 ⊤ β = (X X σm Im) X y. (13) − +1 Previous equation reminds us a form of an estimator in the ridge regression setup. Therefore, one may expect avoiding multicollinearity problems with classical OLS regression (7), due to the ridge regression and the TLS “orthogonal” regression correspondence. Expression (13) looks almost similar to the OLS ˜ 2 estimator β of (7), except the term containing σm+1. This term is missing in the well-known OLS estimator with full rank regression matrix providing by Gauss-Markov theorem of a solution as so-called normal equations X⊤Xβ˜ = X⊤y. ′ From a statistical point of view, a situation when σm = σm+1 occurs for real data is unlikely and also quite irrelevant. But Van Huffel and Vandewalle [1991] investigated this case and concluded the following summary. Suppose σq > σq+1 = ... = σm+1, q m and denote Q := [vq+1,..., vm+1]. Then: ≤ ′ σm > σm+1 the unique TLS solution (12) exists; • ⇒ ′ ⊤ σm = σm+1 & e1 Q = 0 infinitely many TLS solutions of (9) exist and one can pick up • one of them with the smallest6 ⇒ norm;

′ ⊤ σm = σm+1 & e1 Q = 0 no solution of (9) exists and one needs to define another (“more • restrictive”) TLS problem. ⇒ A more restrictive TLS problem, mentioned previously, is called a nongeneric TLS problem. Simply, additional restriction [ε, Ξ] Q = 0—added to the constraints in (9)—tries to “project” out “unimportant” or “redundant” data from the original TLS problem (9).

Errors-in-Variables Model One should not only pay attention to the existence or form of the TLS solution, but also to its properties, e.g. statistical ones. In statistics, the TLS problem (9) corresponds to a so-called errors-in- variables setup. Here, unobservable true values y0 and X0 satisfy a single linear relationship

y0 = α1n + X0β (14) and unknown parameters α (intercept) and β (regression coefficients) need to be estimated. Observations y and X measure y0 and X0 with additive errors ε and Ξ

y = y0 + ε, (15)

X = X0 + Ξ. (16) 2 Rows of the errors [ε, Ξ] are iid with common zero mean and σν Im+1, where 2 σν > 0 is unknown. TLS Estimator ′ 1 ⊤ For simplicity, we suppose that condition σm > σm+1 is satisfied. Let us denote G := In 1n1n − n with 1n := [1,..., 1]⊤ for practical purposes. Then, we define the estimate of coefficient β as the TLS solution βˆ and the estimate of intercept α as follows ˆ αˆ := y¯ [x¯1,..., x¯m] β (17) − 2 where x¯i means the average of the elements of ith column of matrix X. Finally, the variance term σν is 2 −1 2 estimated using singular valuesσ ˆ := n σm+1.

91 PESTA:ˇ TOTAL LEAST SQUARES APPROACH IN REGRESSION METHODS Large Sample Properties An asymptotical behaviour of an estimator is one of its basic characteristics. The asymptotical properties can provide some information about the quality (i.e. efficiency) of the estimator.

Consistency Firstly, we provide a theorem showing the strong consistency of the TLS estimator. 1 ⊤ Theorem (Strong Consistency) If limn→∞ n X0 X0 exists, then

2 a.s. 2 lim σˆ = σν . (18) n→∞ 1 ⊤ Moreover, if limn→∞ n X0 GX0 > 0, then a.s. lim βˆ = β, (19) n→∞ lim αˆ a.s.= α. (20) n→∞

Proof. See Gleser [1981].  The assumptions in the previous theorem are somewhat restrictive and need not be satisfied, e.g. univari- ate errors-in-variables model with the values of the independent variable vary linearly with the sample size. Therefore, these assumptions need to be weakened yielding the following theorem. Theorem (Weak Consistency) Suppose that the distribution of the rows of [ε, Ξ] possesses finite fourth ′ moment. Denote X0 := [1n, X0]. If 1 λ ′⊤ ′ , n , n min X0 X0 √  → ∞ → ∞ λ2 X′⊤X′ min 0 0 , n ′⊤ ′  λmax X0 X0 → ∞ → ∞  then αˆ P α ˆ , n . (21)  β  →  β  → ∞

Proof. Can be easily derived using Theorem 2 by Gallo [1982a]. 

Notation λmin (respectively, λmax) denotes the minimal (respectively, maximal) eigenvalue. It has to be remarked on the fourth moment finiteness of the rows of [ε, Ξ], that this mathematically means for all i 1, . . . , n

∈ { } rj Æ E ωij < , ωij εi, Ξi,1,... Ξi,m , rj . (22) ∞ ∈ { } ∈ jYrj =4 The assumptions in the previousP theorems ensure that the values of the independent variables “spread out” fast enough. Gallo [1982a] proved that the previous “intermediate” assumptions are implied by the assumptions in the theorem for strong consistency.

Asymptotic Distributions Finally, an asymptotic distribution for further statistical inference has to be shown. Theorem (Asymptotic Normality) Suppose that the distribution of the rows of [ε, Ξ] possesses finite fourth moment. If 1 ⊤ lim X0 X0 > 0 n→∞ n αˆ α then √n ˆ − has an asymptotic zero-mean multivariate normal distribution as n .  β β  → ∞ Proof. See Gallo− [1982b].  The covariance matrix of the multivariate normal distribution from the previous theorem is not shown here due to its complicated form and one may find that formula in Gallo [1982b].

92 PESTA:ˇ TOTAL LEAST SQUARES APPROACH IN REGRESSION METHODS Discussion and Conclusions In this paper, the TLS problem from algebraical point of view is summarized and a connection with the errors-in-variables—a —is shown. An unification of algebraical and numerical results with statistical ones is demonstrated. The TLS optimizing problem is defined here also with the OLS and DLS alternatives. Its solution is found using spectral information of the system; and the existence and uniqueness of this solution are discussed. The errors-in-variables model as a correspondence to the orthogonal regression is introduced. Moreover, a comparison of the classical regression approach with the errors-in-variables setup is shown. Finally, large sample properties such as a strong and weak consistency, and an asymptotical distribution of the TLS estimator—an estimator in the errors-in-variables model—are recapitulated. For a further research, one may be interested in the extension of the TLS approach in the or, on the top of that, in the . Amemiya [1997] proposed a way of the first order linearization of the nonlinear relations. A computational stability could be improved using the Golub-Kahan bidiagonalization connected up with the TLS problem by Paige and Strakoˇs[2006]. This approach needs to be studied from the statistical point of view as well.

Acknowledgments. The present work was supported by the Grant Agency of the Czech Republic (grant 201/05/H007).

References Amemiya, Y., Generalization of the TLS approach in the errors-in-variables problem, in Proceedings of the Second International Workshop on Total Least Squares and Errors-in-Variables Modeling, edited by S. Van Huffel, pp. 77–86, 1997. Eckart, G. and Young, G., The approximation of one matrix by another of lower rank, Psychometrica, 1 , 211–218, 1936. Gallo, P. P., Consistency of regression estimates when some variables are subject to error, Communications in Statistics: Theory and Methods, 11 , 973–983, 1982a. Gallo, P. P., Properties of Estimators in Errors-in-variables Models, Ph.D. thesis, Institute of Statistics Mimeoseries #1511, University of North Carolina, Chapel Hill, NC, 1982b. Gleser, L. J., Estimation in a multivariate “errors in variables” regression model: Large sample results, Annals of Statistics, 9 , 24–44, 1981. Golub, G. H. and Van Loan, C. F., An analysis of the total least squares problem, SIAM Journal on , 17 , 883–893, 1980. Golub, G. H. and Van Loan, C. F., Matrix Computation, Johns Hopkins University Press, Baltimore, MD, 3rd edn., 1996. Mirsky, L., Symmetric gauge functions and unitarily invariant norms, Quarterly Journal of Mathematics Oxford, 11 , 50–59, 1960. Paige, C. C. and Strakoˇs, Z., Core problems in linear algebraic systems, SIAM Journal on Matrix Analysis and Applications, 27 , 861–875, 2006. Thompson, R. C., Principal submatricies IX: Interlacing inequalities for singular values of submatrices, Linear Algebra Applications, 5 , 1–12, 1972. Van Huffel, S. and Vandewalle, J., The Total Least Squares Problem: Computational Aspects and Analysis, SIAM, Philadelphia, PA, 1991.

93