Robust Maximum Likelihood Estimation of Sparse Vector Error

arXiv:1710.05513v1 [stat.ML] 16 Oct 2017 rn.Zpn hoi upre yteHn ogPDFellowsh PhD Kong palomar@ Hong [email protected]; the (e-mail: by (HKPFS). supported is Zhao Ziping grant. ie sfollows: as given onerto oeig[,5 ] EMfor VECM A 6]. 5, se [4, time modeling in problem cointegration inference and stu estimation Johansen statistical Later, relat the series. stationary time nonstationary linear the the in concept ships the describe proposed to first Granger “cointegration” and of Engle 3], [2, In var ables. ti macroeconomic and in returns t used financial for widely for modeling is series test It im- and very equilibriums. estimate cointegrated is to long-run [1] analysis cointegration (VECM) in model portant correction error vector The sparsity. group outliers, heavy-tails, simulations. numerical through of shown performance is The algorithm this problem. nonconvex proposed to the applied based is solve algorithm method efficient (MM) majorization-minimization An the on reduction. dimension spar and addition, feature tion realize In to considered the issue. are on relations this cointegration based tackle method to estimation distribution model Cauchy pap robust this In outli a methods. propose and these we of data inapplicability the heavy-tailed to lead practice, can Gaussian in underlying but, the distribution analysis assume traditional methodologies The estimation equ long-run relationships. the correction estimate variable to cointe- librium used error for is model which vector series analysis, time gration the important an is finance, (VECM) model and econometrics In where h noainwt mean with innovation the eue onerto rank cointegration reduced a ν Γ i eoe h drift, the denotes ( hswr a upre yteHn ogRC1261 researc 16206315 RGC Kong Hong the by supported was work This ne Terms Index i 1 = ∆ OUTMXMMLKLHO SIAINO PREVCO ERRO VECTOR SPARSE OF ESTIMATION LIKELIHOOD MAXIMUM ROBUST ∆ stefis ifrneoeao,i.e., operator, difference first the is p , . . . , y t = ν − — + .INTRODUCTION 1. 1 Π otistesotrnefcs and effects, short-run the contains ) Πy onerto nlss outstatistics, robust analysis, cointegration h ogKn nvriyo cec n ehooy ogKon Hong Technology, and Science of University Kong Hong The eemnsteln-u equilibriums, long-run the determines ABSTRACT t − 0 1 n covariance and + r i.e., , P eateto lcrncadCmue Engineering, Computer and Electronic of Department p i =1 − 1 ak( rank Γ i ∆ iigZa n ailP Palomar P. Daniel and Zhao Ziping y Π ∆ t Σ − = ) y Matrix . i ORCINMODEL CORRECTION t ust.hk). + = y K < r ε t y t ∈ , t pScheme ip − Π R selec- y and , ε K died t ion- ries and has − t me (1) ers er, he se 1 is is i- i- h , eso euto eg,lso[7) n[8,element-wis [18], In on d imposed [17]). was and sparsity lasso selection feature (e.g., realize reduction to mension way a as interest research outliers. fit of rep- better influence the conservative to dampen distributions a and heavy-tails heavy-tailed as the distribution of Cauchy resentative the log-likeliho of we the [15], on function on based based problem paper, estimation the this formulate estimators In VECM. likelihood for maximum introduced pseudo were the [15], a tim In methods in estimation needed. outliers effective and and heavy-tails simple the view modeling, with of series deal point To empirical 14]. an suc 13, and the studied [12, theoretical Lucas and a from outliers. test 11 both of issues presence Dickey-Fuller 10, the [9, the in Papers test of Johansen properties issues. the these discussed to analy sensitive Cointegration particularly models. is estimated the with in procedures effects estimation verse and analysis theoretical ma the typically in assumption con noise features Gaussian stylized popular observatio These the tradict faulty [8]. like data processed corruption, regula wrongly data and and as political wit well like as associated factors, changes, external often to are due and outliers heavy-tails exhibit variables cnmcter n a eue o ttsia rirg [7 arbitrage statistical for used be can and theory economic ogrnsainr iesre endb h cointegratio the by defined matrix series time stationary long-run tcnb rte as written be can it nr” ahrta h popular the than rather “norm”, ta fipsn h ru priyon sparsity in- group paper, the this imposing In of space. stead parameter keep low-rank naturally the and of relations sa geometry group cointegration the purpose, reduce all simultaneously selection in can variable feature it since the better realize is sparsity to 20], [19, by a elz h aetre ihu h ha factorization ahead the without target same Π the realize can on sparsity group put y uete“iglrt su”i piiain ae nwhic of on regularizer based sparsity optimization, group the in re- to issue” proposed “singularity firstly the also duce is approximation counterpart better smoothed a A has power. which [21] function Geman-type t = ssi ob onertdwt rank with cointegrated be to said is preotmzto 1]hsbcm h ou fmuch of focus the become has [16] optimization Sparse macroeconomic and returns financial that well-known is It αβ β uhln-u qiirusaeotnipidby implied often are equilibriums long-run Such . T o priyprun,ie,apoiaigthe approximating i.e., pursuing, sparsity For . Π Π = n d akcntan o t which it, for constraint rank a add and β αβ nVC oeig sindicated As modeling. VECM in T ( ℓ α β 1 , nr,w s nonconvex a use we -norm, β sattained. is g. ∈ r R and , K β × eequivalently we , r β .Accordingly, ). T y t R ie the gives tory ad- the me the ℓ sis od de ns ]. re 0 i- h h h n e e ] - - Robust estimation is somewhat underrated in financial ap- sparsity and to remove the “singularity issue”, i.e., when us- plications due to the complex computations that are time and ing nonsmooth functions, the variable may get stuck at a non- resource intensive. By considering the robust loss and the smooth point [23], a smooth nonconvexfunction based on the regularizer, a nonconvex optimization problem is finally for- rational (Geman) function in [21] is used given as follows: mulated. The expectation-maximization (EM) is usually used 2 px , x ǫ to solve the robust losses (e.g., [22]). However, EM cannot be ǫ 2ǫ(p+ǫ)2 rat (x)= 2 | |≤ . p |x| 2ǫ +pǫ , x >ǫ applied for our formulation. To deal with it, an efficient algo- ( p+|x| 2(p+ǫ)2 rithm based on the majorization-minimization (MM) method − | | In order to attain feature selection in VECM, i.e., sparse coin- is proposed with estimation performance numerically shown. tegration relations, according to [19, 20], we can impose the row-wise group sparsity on matrix β. In fact, due to Π = 2. ROBUST ESTIMATION OF SPARSE VECM αβT , the row-wise sparsity imposed on β can also be real- N ized by directly estimating Π through imposing the column- Suppose a sample path yt (N >K) and the needed { }t=1 wise group sparsity on Π and constraining its rank. Then we pre-sample values are available, then the VECM (1) can be have the sparsity regularizer of matrix Π which is given by written into a matrix form as follows: K ǫ R (Π)= rat ( πi ) , (4) i=1 p k k2 ∆Y = ΠY−1 + Γ∆X + E, (2) where πi (i = 1,...,K)P denotes the ith column of Π. The grouping effect is achieved by taking the ℓ2-norm of each where Γ = [Γ ,..., Γp , ν], ∆Y = [∆y ,..., ∆yN ], 1 −1 1 group, and then applying the group regularization. Y−1 = [y0,..., yN−1], ∆X = [∆x1,..., ∆xN ] with T ∆x = ∆yT ,..., ∆yT , 1 , and E = [ε ,..., ε ]. t t−1 t−p+1 1 N 2.3. Problem Formulation 2.1. Robustness Pursued by Cauchy Log-likelihood Loss By combining the robust loss function (3) and the sparsity regularizer (4), we attain a penalized maximum likelihood es- The robustness is pursued by a multivariate Cauchy distribu- timation formulation which is specified as follows: tion. Assume the innovations εt’s in (1) follow Cauchy dis- K , tribution, i.e., εt Cauchy (0, Σ) with Σ S , then the minimize F (θ) L (θ)+ ξR (Π) ∼ ∈ ++ θ={Π,Γ,Σ} (5) probability density function is given by subject to rank (Π) r, Σ 0. ≤ 1+K K Γ 1 − 1+ This is a constrained smooth nonconvex problem due to the 2 − 2 T −1 2 gθ (εt)= K [det (Σ)] 1+ εt Σ εt . nonconvexity of the objective function and the constraint set. Γ 1 (νπ) 2 2 The negative log-likelihood loss function of the Cauchy dis- 3. PROBLEM SOLVING VIA THE MM METHOD tribution for N samples from (1) is written as follows: The MM method [24, 25, 26] is a generalization of the well- N 1+K N known EM method. For an optimization problem given by L (θ)= 2 log det (Σ)+ 2 i=1 log 1+ 2 (3) − 1 minimize f (x) subject to x , Σ 2 (∆yi Πyi−1 PΓ∆xi−1) , x ∈ X − − 2 instead of dealing with this problem directly which could be where the constants are dropped and θ , Π (α, β) , Γ, Σ . difficult, the MM-based algorithm solves a series of simpler { } subproblems with surrogate functions that majorize f (x) (0) 2.2. Group Sparsity Pursued by Nonconvex Regularizer over . More specifically, starting from an initial point x , it producesX a sequence x(k) by the following update rule: For a vector x RK , the sparsity level is usually measured by ∈ K (k) (k−1) the ℓ0-“norm” (or sgn( x )) as x = sgn( xi )= k, x arg min f x, x , | | k k0 i=1 | | ∈ x∈X where k is the number of nonzero entries in x. Generally, P (k) applying the ℓ0-“norm” to different groups of variables can where the surrogate majorizing function f x, x satisfies enforce group sparsity in the solutions. The ℓ -“norm” is not 0 f x(k), x(k) = f x(k) , x(k) , convex and not continuous, which makes it computationally ∀ ∈ X f x, x(k) f (x) , x, x(k) , difficult and leads to intractable NP-hard problems. So, ℓ1- ′ ≥ ∀ ∈ X f x(k), x(k); d = f ′ x(k); d , d, s.t. x(k) + d . norm as the tightest convex relaxation is usually used to ap- ∀ ∈ X proximate the ℓ0-“norm” in practice, which is easier for opti- The objective function value is monotonically nonincreasing mization and still favors sparse solutions. at each iteration. In order to use the MM method, the key step Tighter nonconvex sparsity-inducing functions can lead to is to find a majorizing function to make the subproblem easy better performance [16]. In this paper, to better pursue the to solve, which will be discussed in the following subsections. 0.7 3.1. Majorization for the Robust Loss Function L (θ) fractional function fractional function (ǫ-smoothed) majorizing function Instead of using the EM method [22], in this paper, we derive 0.6 majorizing point the majorizing function for L (θ) from an MM perspective. 0.5 Lemma 1. At any point x(k) R, log (1 + x) log 1+ x(k) 1 (k) ∈ ≤ (k) + k x x , with the equality attained at x = x . 0.4 1+x( ) − Based on Lemma 1, at the iterate θ(k), the loss function 0.3 L (θ) can be majorized by the following function: rational function

L θ, θ(k) 0.2 1 ≃ 1 2 N 1 − 0.1 log det (Σ)+ Σ 2 ∆Y¯ ΠY¯ −1 Γ∆X¯ , 2 2 − − F

0 where “ ” means “equivalence” up to additive constants, -0.3 -0.2 -0.1 0 0.1 0.2 0.3 ≃ x (k) (k) ∆Y¯ = ∆Ydiag √w , Y¯ −1 = Y−1diag √w , and ∆X¯ = ∆Xdiag √w(k) with w(k) RN and the element Fig. 1. Majorization for smoothed sparsity-inducing function. ∈ (k) 1+K wt = (k) 2 , t =1 ...N. The majorization in Lemma 3 is pictorially shown in Fig. − (k) (k) 1+ Σ 2 (∆yt−Π yt−1−Γ ∆xt−1) (k) 2 1. Then at θ , the regularizer R (Π) can be majorized by

By taking the partial derivatives for Σ and Γ, and deﬁning the (k) 1 T (k) T T −1 R Π, θ 2 vec (Π) diag q IK vec (Π) , projection matrix M¯ = IN ∆X¯ ∆X¯ ∆X¯ ∆X¯ , the ≃ ⊗ − (k) (k) K majorizing function L1 θ, θ is minimized when where q R with the ith (i =1,...,K) element −1 ∈ 2 −1 ¯ ¯ ¯ T ¯ ¯ T (k) (k) (k) Γ (Π)= ∆Y ΠY−1 ∆X ∆X∆X , q = p max ǫ, π p + max ǫ, π . − T i i i Σ Π 1 Y¯ ΠY¯ M¯ Y¯ ΠY¯ . 2 2 ( )= N ∆ −1 ∆ −1 n o n o − − 3.3. The Majorized Subproblem in MM (k) Substituting these equations back into L1 θ, θ , we have (k) (k) By combining L2 Π, θ and R Π, θ , wecangetthe L Π, θ(k) 1 ≃ majorizing function for F (θ) which is given as follows: N T log det ∆Y¯ ΠY¯ −1 M¯ ∆Y¯ ΠY¯ −1 . (k) (k) (k) 2 − − F 1 Π, θ L2 Π, θ + ξR Π, θ h i ≃ T Then we introduce the following useful lemma. 1vec (Π)TG(k)vec (Π) vec H(k) vec (Π) , ≃ 2 − (k) SK Lemma 2. At any point R ++, log det (R) (k) T −(k) (k) ∈ ≤ where G = Y¯ −1M¯ Y¯ Σ + ξdiag q IK , Tr R−(k)R + log det R(k) K, with the equality at- −1 ⊗ ⊗ − H(k) Σ−(k) Y¯ M¯ Y¯ T F Π, θ(k) tained at R = R(k). and = ∆ −1. Although 1 is (k) a quadratic function in Π, together with the nonconvex rank Based on Lemma 2, L1 Π, θ is further majorized by constraint on Π in (5), the problem is still hard to solve.

k 2 K k ( ) Lemma 4. Let A, B S and B A, then at any ( ) 1 − 2 ¯ ¯ ¯ L2 Π, θ Σ ∆Y ΠY−1 M . (k) K T ∈ T (k)T ≃ 2 − F point x R , x Ax x Bx +2x (A B) x + (k)T ∈ (k) ≤ − (k) (k) x (B A) x with the equality attained at x = x . Finally, after majorization, L2 Π, θ becomes a quadratic − (k) (k) function in Π. Based on Lemma 4 and noticing ψ IK2 G for any G (k) (k) (k) (k) 3.2. Majorization for the Sparsity Regularizer R Π ψG satisfying ψG λmax G , F 1 Π, θ can be ( ) ≥ further majorized by the following function: In this section, we introduce the majorization trick to deal (k) 1 (k) (k) 2 with the nonconvex sparsity regularizer R (Π). F 2 Π, θ ψ Π P , ≃ 2 G − F k (k) ǫ q( ) 2 (k) Lemma 3. At any given point x , ratp (x) x + c , (k) (k) −(k) −(k) (k) T 2 where P = Π ψ Σ Π Y¯ −1M¯ Y¯ (k) ≤ G −1 with the equality attained at x = x , the coefﬁcient −(k) k − −(k) − ξψ Π( )diag q(k) + ψ H(k). 2 −1 G G q(k) = p max ǫ, x(k) p + max ǫ, x(k) , and Finally, the majorized subproblem for problem (5) is (k) (k) 2 h p max ǫ, x +2(max ǫ, x ) i pǫ ǫ2 2 constant c(k) = +2 . (k) { | |} (k){ | 2 |} p ǫ 2 minimize Π P subject to rank (Π) r. (6) 2(p+max ǫ, x ) − 2( + ) Π F { | |} − ≤

This problem has a closed form solution. Let the singular Based on the MM method, MM-RSVECM obtains a value decomposition for P be P = USVT , the optimal Π faster convergence than GD-RSVECM. This may be because ⋆ T is Π = USrV , where Sr is obtained by thresholding the the algorithm based on the MM method better exploits the smallest (P r) diagonal elements in S to be zeros. Accord- structure of the original problem. − ⋆ ⋆T ingly, parameters α and β can be factorized by Π = α⋆β . Then the proposed problem formulation based on Cauchy log-likelihood loss function is further validated by com- 3.4. The MM-RSVECM Algorithm paring the parameter estimation accuracy under student t- distributions with different degree of freedom p. The esti- Based on the MM method, to solve the original problem (5), mation accuracy is measure by the normalized mean squared we just need to iteratively solve a low-rank approximation error (NMSE): problem (6) with a closed form solution at each iteration. The overall algorithm is summarized in Algorithm 1. 2 E Πˆ Πtrue Algorithm 1 MM-RSVECM - Robust MLE of Sparse VECM − F N NMSE (Π)= 2 . Input: yi and needed pre-sampled values. Π { }i=1 k truekF Initialization: Π(0) α(0), β(0) , Γ(0), Σ(0) and k =1. In Fig. 3, we show the simulation results for NMSE (Π) by Repeat using three estimation methods, which are based on Gaus- 1. Compute w(k), q(k), G(k), H(k), ψ(k) and P(k); sian innovation assumption, true Student t-distribution, and G the proposed Cauchy innovation assumption. 2. Update Π(k) by solving (6) and Γ(k), Σ(k); 0.8 Gaussian 3. k = k +1; Student t 0.7 Cauchy Until Π(k), Γ(k) and Σ(k) satisfy a termination criterion. 0.6 Output: Πˆ αˆ , βˆ , Γˆ and Σˆ . 0.5 ) Π 4. NUMERICAL SIMULATIONS 0.4 NMSE( Numerical simulations are considered in this section. A 0.3

VECM (K =5, r =3,N = 1000) with underlying group 0.2 sparse structure for Π is specified firstly. Then a time series sample path is generated with innovations distributed to Stu- 0.1 dent t-distribution with degree of freedom p = 3. We first 0 0 10 20 30 40 50 60 70 80 90 compare our algorithm (MM-RSVECM) with the gradient Degree of freedom p descent algorithm (GD-RSVECM) for the proposed nonconvex problem formulation in (5). The convergence result of Fig. 3. NMSE (Π) vs degree of freedom p for t-distributions. the objective function value is shown in Fig. 2.

9000 From Fig. 3, we can see the parameter estimated from GD-RSVECM MM-RSVECM Cauchy assumption using the MM-VECM algorithm consis- 8000 tently has a lower parameter estimation error compared to the

7000 estimation results from Gaussian assumption and even using the true Student t-distribution. Based on this, the proposed 6000 problem formulation is validated.

5000

4000 5. CONCLUSIONS Objective function value

3000 This paper has considered the robust and sparse VECM estimation problem. The problem has been formulated by con- 2000 sidering a robust Cauchy log-likelihood loss function and a 1000 nonconvex group sparsity regularizer. An efficient algorithm 0 100 200 300 400 500 600 700 800 900 1000 CPU time based on the MM method has been proposed with the effi- ciency of the algorithm and the estimation accuracy validated Fig. 2. Convergence comparison for objective function value. through numerical simulations. 6. REFERENCES [14] P. H. Franses and A. Lucas, “Outlier detection in cointegration analysis,” Journal of Business & Economic [1] H. Lütkepohl, New introduction to multiple time series Statistics, vol. 16, no. 4, pp. 459–468, 1998. analysis. Springer,2007. [15] A. Lucas, “Cointegration testing using pseudolikelihood [2] C. W. Granger, “Cointegrated variables and error cor- ratio tests,” Econometric Theory, pp. 149–169, 1997. rection models,” unpublished USCD Discussion Paper 83-13a, Tech. Rep., 1983. [16] F. Bach, R. Jenatton, J. Mairal, G. Obozinski et al., “Optimization with sparsity-inducing penalties,” Foun- [3] R. F. Engle and C. W. Granger, “Co-integration and er- dations and Trends R in Machine Learning,vol.4,no.1,

ror correction: representation, estimation, and testing,” pp. 1–106, 2012. Econometrica: Journal of the Econometric Society, pp. [17] R. Tibshirani, “Regression shrinkage and selection via 251–276, 1987. the lasso,” Journal of the Royal Statistical Society. Se- [4] S. Johansen, “Statistical analysis of cointegration vec- ries B (Methodological), pp. 267–288, 1996. tors,” Journal of Economic Dynamics and Control, [18] I. Wilms and C. Croux, “Forecasting using sparse coin- vol. 12, no. 2, pp. 231–254, 1988. tegration,” International Journal of Forecasting, vol. 32, no. 4, pp. 1256–1267, 2016. [5] ——, “Estimation and hypothesis testing of cointegration vectors in gaussian vector autoregressive models,” [19] L. Chen and J. Z. Huang, “Sparse reduced-rank regres- Econometrica: Journal of the Econometric Society, pp. sion for simultaneous dimension reduction and variable 1551–1580, 1991. selection,” Journal of the American Statistical Associa- tion, vol. 107, no. 500, pp. 1533–1545, 2012. [6] ——, “Identifying restrictions of linear equations with applications to simultaneous equations and cointegra- [20] ——, “Sparse reduced-rank regression with covariance tion,” Journal of Econometrics, vol. 69, no. 1, pp. 111– estimation,” Statistics and Computing, vol. 26, no. 1-2, 132, 1995. pp. 461–470, 2016.

[7] A. Pole, Statistical arbitrage: algorithmic trading in- [21] D. Geman and G. Reynolds, “Constrained restoration sights and techniques. John Wiley & Sons, 2011, vol. and the recovery of discontinuities,” IEEE Transactions 411. on Pattern Analysis and Machine Intelligence, vol. 14, no. 3, pp. 367–383, 1992. [8] S. T. Rachev, C. Menn, and F. J. Fabozzi, Fat-tailed and skewed asset return distributions: implications for [22] B. Bosco, L. Parisio, M. Pelagatti, and F. Baldi, “Long- risk management, portfolio selection, and option pric- run relations in European electricity prices,” Journal ing. John Wiley & Sons, 2005, vol. 139. of Applied Econometrics, vol. 25, no. 5, pp. 805–832, 2010. [9] P. H. Franses and N. Haldrup, “The effects of additive outliers on tests for unit roots and cointegration,” Jour- [23] M. A. Figueiredo,J. M. Bioucas-Dias, and R. D. Nowak, nal of Business & Economic Statistics, vol.12, no.4, pp. “Majorization-minimization algorithms for wavelet- 471–478, 1994. based image restoration,” IEEE Transactions on Image Processing, vol. 16, no. 12, pp. 2980–2991, 2007. [10] P. H. Franses, T. Kloek, and A. Lucas, “Outlier robust [24] D. R. Hunter and K. Lange, “A tutorial on MM algo- analysis of long-run marketing effects for weekly scan- The American Statistician ning data,” Journal of Econometrics, vol. 89, no. 1, pp. rithms,” , vol. 58, no. 1, pp. 293–315, 1998. 30–37, 2004. [25] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A uniﬁed [11] H. B. Nielsen, “Cointegration analysis in the presence convergence analysis of block successive minimization of outliers,” The Econometrics Journal,vol.7, no.1, pp. methods for nonsmooth optimization,” SIAM Journal on 249–271, 2004. Optimization, vol. 23, no. 2, pp. 1126–1153, 2013. [12] A. Lucas, “Unit root tests based on m estimators,” [26] Y. Sun, P. Babu, and D. P. Palomar, “Majorization- Econometric Theory, vol.11, no.02, pp.331–346,1995. minimization algorithms in signal processing, commu- nications, and machine learning,” IEEE Transactions on [13] ——, “An outlier robust unit root test with an appli- Signal Processing, vol. 65, no. 3, pp. 794–816, 2016. cation to the extended nelson-plosser data,” Journal of Econometrics, vol. 66, no. 1, pp. 153–173, 1995.