Linearized Binary Regression
Total Page:16
File Type:pdf, Size:1020Kb
Linearized Binary Regression Andrew S. Lan1, Mung Chiang2, and Christoph Studer3 1Princeton University, Princeton, NJ; [email protected] 2Purdue University, West Lafayette, IN; [email protected] 3Cornell University, Ithaca, NY; [email protected] Abstract—Probit regression was first proposed by Bliss in 1934 Section II-A. In what follows, we will assume that the noise to study mortality rates of insects. Since then, an extensive vector w 2 RM has i.i.d. standard normal entries, and refer body of work has analyzed and used probit or related binary to (1) as the standard probit model. regression methods (such as logistic regression) in numerous applications and fields. This paper provides a fresh angle to A. Relevant Prior Art such well-established binary regression methods. Concretely, we demonstrate that linearizing the probit model in combination with 1) Estimators: The two most common estimation techniques linear estimators performs on par with state-of-the-art nonlinear for the standard probit model in (1) are the posterior mean regression methods, such as posterior mean or maximum a- (PM) and maximum a-posteriori (MAP) estimators. The PM posteriori estimation, for a broad range of real-world regression estimator computes the following conditional expectation [9]: problems. We derive exact, closed-form, and nonasymptotic ex- pressions for the mean-squared error of our linearized estimators, PM R x^ = Ex[xjy] = N xp(xjy)dx; (2) which clearly separates them from nonlinear regression methods R that are typically difficult to analyze. We showcase the efficacy of where p(xjy) is the posterior probability of the vector x given our methods and results for a number of synthetic and real-world the observations y under the model (1). The PM estimator is datasets, which demonstrates that linearized binary regression finds potential use in a variety of inference, estimation, signal optimal in a sense that it minimizes the mean-squared error processing, and machine learning applications that deal with (MSE) defined as binary-valued observations or measurements. 2 MSE(x^) = Ex;w kx − x^k ; (3) I. INTRODUCTION and is, hence, also known as the nonlinear minimum mean- This paper deals with the estimation of the N-dimensional squared error (MMSE) estimator. Evaluating the integral in (2) vector x 2 RN from the following measurement model: for the probit model is difficult and hence, one typically resorts to rather slow Monte-Carlo methods [10]. By assuming that y = sign(Dx + w): (1) the vector x is multivariate Gaussian, an alternative regression Here, the vector y 2 {−1; +1gM contains M binary-valued technique is the MAP estimator that solves the following convex measurements, the function sign(z) operates element-wise on optimization problem [11]: its argument and outputs +1 for z ≥ 0 and −1 otherwise, MAP PM T 1 T −1 M×N x^ =arg min− m=1 log(Φ(ymdmx)) + 2 x Cx x: (4) D 2 is a given design matrix (or matrix of covariates). N R x2R The noise vector w 2 RM has i.i.d. random entries. Estimation R x −1=2 −t2=2 of the vector x from the observation model in (1) is known Here, Φ(x) = −∞(2π) e dt is the cumulative distri- T as binary regression. The two most common types of binary bution function of a standard normal random variable, dm regression are (i) probit regression [1] for which the noise is the mth row of the covariate matrix D, and Cx is the vector w follows a standard normal distribution and (ii) logistic covariance matrix of the zero-mean multivariate Gaussian regression [2] for which the noise vector w follows a logistic prior on the vector x. By ignoring the prior on x, one distribution with unit scale parameter. arrives at the well-known maximum-likelihood (ML) estimator. Binary regression finds widespread use in a broad range Compared to the PM estimator, MAP and ML estimation can be of applications and fields, including (but not limited to) implemented efficiently either by solving a series of re-weighted image classification [3], biomedical data analysis [4], [5], least squares problems [12] or by using standard numerical economics [6], and signal processing [7], [8]. In most real-world methods for convex problems that scale favorably to large applications, one can use either probit or logistic regression, problem sizes [13], [14]. In contrast to such well-established since the noise distribution is unknown; in this paper, we nonlinear estimators, we will investigate linear estimators that focus on probit regression for reasons that we will detail in are computationally efficient and whose performance is on par to that of the PM, MAP, and ML estimators. AL and MC were supported in part by the US National Science Foundation 2) Analytical Results: Analytical results that characterize the (NSF) under grant CNS-1347234. CS was supported in part by Xilinx Inc. and by the US NSF under grants ECCS-1408006, CCF-1535897, CAREER CCF- performance of estimation under the probit model are almost 1652065, and CNS-1717559. exclusively for the asymptotic setting, i.e., when M and/or N tend to infinity. More specifically, Brillinger [15] has shown invertible. We next introduce two new linear estimators for this in 1982 that the conventional least-squares (LS) estimators for model and then, provide exact, closed-form, and nonasymptotic scenarios in which the design matrix D has i.i.d. Gaussian expressions for the associated MSEs. entries, delivers an estimate that is the same as that of the A. Linear Minimum Mean-Squared Error Estimator PM estimator up to a constant. More recently, Brillinger’s result has been generalized by Thrampoulidis et al. [16] to the Our main result is as follows. sparse setting, i.e., where the vector x has only a few nonzero Theorem 1. The linear minimum mean-squared error (L- entries. Other related results analyze the consistency of the ML MMSE) estimate for the generalized probit model in (5) is estimator for sparse logistic regression. These results are either L-MMSE T −1 asymptotic [8], [17], [18] or of probabilistic nature [19]; the x^ = E Cy¯ y¯; (6) latter type of results bounds the MSE with high probability. In where contrast to all such existing analytical results, we will provide 2 1=2 2 −1=2 nonasymptotic and exact expressions for the MSE that are valid E = π diag(diag(σ I + Cz) )DCx; (7) for arbitrary and deterministic design matrices D. 2 2 −1=2 Cy¯ = π arcsin(diag(diag(σ I + Cz) )Cz 2 −1=2 B. Contributions × diag(diag(σ I + Cz) )); (8) We propose novel linear estimators of the form x^ = Wy T and Cz = DCxD + Cw. for the probit model in (1), where W 2 RN×M are suitably- chosen estimation matrices, and provide exact, closed-form, Remark 1. The reason that we focus on probit regression is and nonasymptotic expressions for the MSE of these estima- that under the standard probit model, the matrices E and Cy¯ tors. Specifically, we will develop two estimators: a linear exhibit closed-form expressions; For logistic regression, such minimum mean-squared error (L-MMSE) estimator that aims closed-form expressions do not exist. at minimizing the MSE in (3) and a more efficient but less Proof. The proof consists of two steps. First, we linearize the accurate least-squares (LS) estimator. Our MSE results are in model in (5). Then, we derive the L-MMSE estimate in (6) stark contrast to existing performance guarantees for the MAP for the linearized model. The two steps are as follows. or PM estimators, for which a nonasymptotic performance Step 1 (Linearization): Let z = Dx + w and analysis is, in general, difficult. We provide inference results on synthetic data, which suggest that the inference quality y¯ = fσ(z) = Fx + e (9) of the proposed linear estimators is on par with state-of- be a linearization of the generalized probit model in (5), where the-art nonlinear estimators, especially at low signal-to-noise F 2 M×N is a linearization matrix and e 2 M is a residual ratio (SNR), i.e., when the quantization error is lower than the R R error vector that contains noise and linearization artifacts. Our noise level. Moreover, we show using six different real-world goal is to perform a Bussgang-like decomposition [22], which binary regression datasets that the proposed linear estimators uses the linearization matrix F that minimizes the ` -norm of achieve competitive predictive performance to PM and MAP 2 the residual error vector e averaged over the signal and noise. estimation at comparable or even lower complexity. Concretely, let Cz be the covariance matrix of the vector z II. LINEARIZED PROBIT REGRESSION and consider the optimization problem To develop and analyze linearized inference methods for h 2i minimize x;w y¯ − Fx ; M×N E the standard probit model in (1), we will first consider the F2R following smoothed version of the probit model: −1 which has a closed-form solution that is given by F = ECx with E = yx¯ T . It can easily be verified that for this y¯ = fσ(Dx + w): (5) Ex;w particular choice of the linearization matrix F, the residual We will then use these results to study the binary model (1). error vector e and the signal of interest x are uncorrelated, i.e., y¯ 2 [−1; +1]M x T Here, , is zero-mean Gaussian with known we have Ex;w xe = 0N×M . covariance Cx, the sigmoid function is defined as fσ(z) = We now derive a closed-form expression for the entries of 2Φ(z/σ) − 1 and operates element-wise on its argument, σ 2 the matrix E.