M-Estimation in Low-Rank Matrix Factorization: a General Framework

M-estimation in Low-rank Matrix Factorization: a General Framework Peng Liu1;∗ Wei Tu2;∗ Jingyu Zhao3 Yi Liu2 Linglong Kong2;y Guodong Li3 Bei Jiang2 Hengshuai Yao4 Guangjian Tian5 1School of Mathematics, Statistics and Actuarial Science, University of Kent 2Department of Mathematical and Statistical Sciences, University of Alberta 3Department of Statistics and Actuarial Science, University of Hong Kong 4 Huawei Hi-Silicon, Canada 5Huawei Noah’s Ark Lab, Hong Kong, China Emails: [email protected]. fwei.tu, yliu16, lkong, bei1g@ualberta. fgladys17, [email protected]. fhengshuai.yao, [email protected] Abstract—Many problems in science and engineering can be Matrix factorization arguably is the most popular and in- reduced to the recovery of an unknown large matrix from a small tuitive approach for low-rank matrix recovery. The basic idea number of random linear measurements. Matrix factorization is to decompose the low-rank matrix X 2 m×n into the arguably is the most popular approach for low-rank matrix R recovery. Many methods have been proposed using different loss product of two matrices functions, such as the most widely used L2 loss, more robust X = U >V; (1) choices L1 and Huber loss, and quantile and expectile loss for skewed data. All of them can be unified into the framework of M- where U 2 Rr×m and V 2 Rr×n. In many practical problems, estimation. In this paper, we present a general framework of low- the rank of a matrix is known or can be estimated in advance; rank matrix factorization based on M-estimation in statistics. The framework mainly involves two steps: we first apply Nesterov’s see, for example, the rigid and nonrigid structures from motion smoothing technique to obtain an optimal smooth approximation as well as image recovery. The matrix U and V can also be for non-smooth loss functions, such as L1 and quantile loss; interpreted as latent factors that drive the unknown matrix X. secondly, we exploit an alternative updating scheme along with Matrix factorization is usually based on L2 loss, i.e. square Nesterov’s momentum method at each step to minimize the loss, which is optimal for Gaussian errors. However, its smoothed loss function. Strong theoretical convergence guarantee has been developed for the general framework, and extensive performance may be severely deteriorated when the data is numerical experiments have been conducted to illustrate the contaminated by outliers. For example, in a collaborate filter- performance of the proposed algorithm. ing system, some popular items have a lot of ratings regardless Index Terms—matrix recovery, M-estimation, matrix factor- of whether they are useful, while others have fewer ones. There ization, robustness, statistical foundation may even exist shilling attacks, i.e. a user may consistently give positive feedback to their products or negative feedback I. INTRODUCTION to their competitors regardless of the items themselves [6]. Recently there are a few attempts to address this problem; see, Motivation. In matrix recovery from linear measurements, for example, [2], [7]. However, to the best of our knowledge, we are interested in recovering an unknown matrix X 2 m×n R they focus on either L or quantile loss, and are only useful from p < mn linear measurements b = Tr(A>X), where 1 i i in limited scenarios. For low-rank matrix recovery under M- each A 2 m×n is a measurement matrix, i = 1; : : : p. i R estimation, He et al. [8] studied the use of a few smooth loss Usually it is expensive or even impossible to fully sample functions such as Huber and Welsch in this setting. In this the entire matrix X, and we are left with a highly incomplete paper, we propose a more general framework that is applicable set of observations. In general it is not always possible to to any M-estimation loss function, smooth or non-smooth, recovery X under such settings. However, if we impose a low- and provide theoretical convergence guarantee on the proposed rank structure on X, it is possible to exploit this structure and state-of-the-art algorithm. efficiently estimate X. The problem of matrix recovery arises This paper introduces a general framework of M-estimation in a wide range of applications, such as collaborate filtering [9]–[11] on matrix factorization. Specifically, we consider loss [1], image recovery [2], structure from motion, photometric functions related to M-estimation for matrix factorization. M- stereo [3], system identification [4] and computer network estimation is defined in a way similar to the well-known termi- tomography [5]. nology “Maximum Likelihood Estimate (MLE)” in statistics, and has many good properties that are similar to those of *These authors contributed equally to this work. †Correspondence Author. Part of the work is done during a sabbatical at MLE. Meanwhile, it still retains intuitive interpretation. The Huawei. proposed class of loss functions includes well-known L1 and L2 loss as special cases. In practice, we choose a suitable Objective M Function M-estimation procedure according to our knowledge of the data and the specific nature of the problem. For example, Huber loss enjoys the property of smoothness as L2 loss and Smooth? robustness as L1 loss. no The loss functions of some M-estimation procedures are yes Nesterov’s smooth, such as the L2 loss, while some others are non-smooth Solve smooth smoothing such as L1 and quantile loss. Note that the resulting objective problem functions all have a bilinear structure due to the decomposition technique at Eq. (1). For non-smooth cases, we first consider Nesterov’s smoothing method to obtain an optimal smooth approximation Initialization [12], and the bilinear structure is preserved. The alternating minimization method is hence used to search for the solutions. t+1 t t At each step, we employ Nesterov’s momentum method to AltMin: U = NAGU (U ;V ) accelerate the convergence, and it turns out to find the global optima easily. Figure 1 gives the flowchart of the proposed t+1 t+1 t algorithm. Theoretical convergence analysis is conducted for AltMin: V = NAGV (U ;V ) both smooth and non-smooth loss functions. Contributions. We summarize our contributions below. no Converge? 1. We propose to do matrix factorization based on the loss functions of M-estimation procedures. The proposed yes framework is very general and applicable to any M- estimate loss function, which gives us flexibility in se- U;b Vb lecting a suitable loss for specific problems. 2. We propose to use Nesterov’s smoothing technique to Figure 1: Flowchart of the whole algorithm obtain an optimal smooth approximation when the loss function is non-smooth. 3. We consider the Nesterov’s momentum method, rather Rp be an affine transformation with the ith entry of A(X) −1 Pp than gradient descent methods, to perform the optimiza- being hAi;Xi, and M(x) = p i=1 L(xi) for a vector > tion at each step of the alternating minimization, which x = (x1; :::; xp) . We can then rewrite (3) into a more greatly accelerates the convergence. compact form 4. We provide theoretical convergence guarantees for the min M(b − A(U >V )): (4) proposed algorithm. r×m r×n U2R ;V 2R 5. We illustrate the usefulness of our method by conducting extensive numerical experiments on both synthetic data A. The case that M is not smooth and real data. The loss function L(·), and hence M(·), may be non- smooth in some M-estimation procedures, such as the L1 II. METHODOLOGY FRAMEWORK loss. For this non-smooth case, we first employ Nesterov’s ∗ m×n Let X 2 R be the target low-rank matrix, and Ai 2 smoothing method to obtain an optimal smooth approximation; Rm×n with 1 ≤ i ≤ p be given measurement matrices. Here see [12] (pp. 129-132). Ai’s can be the same or different with each other. We assume Specifically, we first assume that the objective function M > the observed signals b = (b1; :::; bp) have the structure has the following structure ∗ bi = hAi;X i + i; i = 1; : : : ; p; (2) M(b − A(U >V )) = M^ (b − A(U >V )) (5) > n > ^ o where hAi;Xi := Tr(Ai X) and i is the error term. Suppose + max hB(b − A(U V )); ui2 − φ(u) ; that the rank of matrix X∗ is no more than r with r u ∗ ∗> ∗ min(m; n; p). We then have the decomposition X = U V , where M^ (·) is continuous and convex; see Eq. (2.2) in [12]. and the matrix can be recovered by solving a non-convex Then the objective function M can be approximated by optimization problem > > p Mπ(b − A(U V )) = M^ (b − A(U V )) (6) 1 X > n o min L(bi − hAi;U V i); (3) > ^ U2 r×m;V 2 r×n p + max hB(b − (U V )); ui2 − φ(u) − πd2(u) ; R R i=1 u where L(·) is the loss function used an M-estimation proce- where π is a positive smoothness parameter. Note that Mπ(·) 2 dure. For example, L(y) = y for the L2 loss and L(y) = jyj is smooth and can be optimized by many available gradient- m×n for the L1 loss. Here L(·) usually is convex. Let A : R ! based methods. B. Alternating Minimization i.e., a rough output is sufficient. Our simulation experiments Due to the bilinear form with respect to U and V in the show that, after several iterations of the SVP algorithm, the objective functions Eq. (4) and Eq. (6), we consider an al- resulting values are close to the true ones, while it is not the ternating minimization scheme.

Load more