Bayesian Poisson Regression for Crowd Counting
Total Page:16
File Type:pdf, Size:1020Kb
Appears in IEEE Int’l Conf. on Computer Vision, Kyoto, 2009. Bayesian Poisson Regression for Crowd Counting Antoni B. Chan Nuno Vasconcelos Department of Computer Science Dept. of Electrical and Computer Engineering City University of Hong Kong University of California, San Diego [email protected] [email protected] Abstract by reducing the confidence when the prediction is far from an integer. In addition, the confidence levels are currently Poisson regression models the noisy output of a count- measured in standard-deviations, which provides little intu- ing function as a Poisson random variable, with a log-mean ition on the reliability of the estimates. A confidence mea- parameter that is a linear function of the input vector. In sure based on posterior probabilities seems more intuitive this work, we analyze Poisson regression in a Bayesian set- for counting numbers. Finally, negative outputs of the GP ting, by introducing a prior distribution on the weights of must be truncated to zero, and it is unclear how this affects the linear function. Since exact inference is analytically un- the optimality of the predictive distribution. obtainable, we derive a closed-form approximation to the One common method of regression for counting num- predictive distribution of the model. We show that the pre- bers is Poisson regression [3], which models the noisy out- dictive distribution can be kernelized, enabling the repre- put of a counting function as a Poisson random variable, sentation of non-linear log-mean functions. We also derive with a log-mean parameter that is a linear function of the an approximate marginal likelihood that can be optimized input vector. This is analogous to standard linear regres- to learn the hyperparameters of the kernel. We then relate sion, except that the mean is modeled as the exponential the proposed approximate Bayesian Poisson regression to of a linear function to ensure non-negative values, and that Gaussian processes. Finally, we present experimental re- the noise model is Poisson because the outputs are count- sults using Bayesian Poisson regression for crowd counting ing numbers. One way of extending Poisson regression to from low-level features. the Bayesian setting is to adopt a hierarchical model, where the log-mean function is modeled with a standard Gaus- 1. Introduction sian process [4, 5, 6]. These solutions, however, have two Recent work [1, 2] on crowd counting using low-level disadvantages. First, because of the lack of conjugacy be- feature regression has shown promise in computer vision. tween the Poisson and the GP, [4, 5, 6] must approximate One advantage with these methods is that they bypass in- inference with Markov-chain Monte Carlo (MCMC), which termediate processing stages, such as people detection or limits these algorithms to small datasets. Second, the hier- tracking, that may be susceptible to problems when the archical model contains two redundant noise sources: 1) crowd is dense. In [1], the scene is segmented into crowds the Poisson-distributed observation noise, and 2) the Gaus- moving in different directions and various low-level fea- sian noise of the GP in the log-mean function. These two tures are extracted from each crowd segment (e.g. informa- noise terms model essentially the same thing: the noise in tion on the shape, edges and texture of the segment). The observing the count. A more parsimonious representation crowd count in each segment is then estimated with a Gaus- would include only the observation noise, while modeling sian process (GP) regression function that maps feature vec- the mean as a deterministic function. tors to the crowd size. Experiments in [1] indicate that the In this work, we analyze the standard Poisson regression counting algorithm is capable of producing accurate counts, model in a Bayesian setting, by adding a Gaussian prior for a wide range of crowd densities. on the weights of the linear log-mean function. Since ex- One problem with the system of [1] is that it uses GP re- act inference is analytically unobtainable, approximate in- gression, which models continuous real-valued functions, ference is still necessary. However, in contrast to previous to predict discrete counting numbers. Because of this work [4, 5, 6], we propose a closed-form approximation to mismatch, regression may not be taking full advantage of Bayesian inference. The contributions of this paper, with Bayesian inference. For example, rounding of the real- respect to Bayesian Poisson regression (BPR), are five-fold: valued predictions is not handled in a principled way, e.g. 1) we derive a closed-form approximation to the predictive 1 distribution for BPR; 2) we kernelize the predictive distri- of f∗ = f(x∗) is obtained by averaging over all possible bution, enabling the representation of non-linear log-mean model parameterizations, with respect to the posterior dis- functions via kernel functions; 3) we derive an approxi- tribution of w [7], mate marginal likelihood function for optimizing the hyper- parameters of the kernel function with Type-II maximum p(f∗|x∗,X,y)= p(f∗|x∗, w)p(w|X, y)dw (4) likelihood; 4) we show that the proposed approximation to BPR is related to a Gaussian process with a special non- G f | 1 xT A−1Xy, xT A−1x . = ( ∗ 2 ∗ ∗ ∗) (5) i.i.d. noise term; 5) finally, we present experimental results σn that show improvement in crowd counting accuracy when using the proposed model. 2.2. Kernelized regression The remainder of this paper is organized as follows. In The predictive distribution in (5) can be rewritten to only Sections 2 and 3, we briefly review Gaussian processes and depend on the inner products between the inputs x i. Hence, Poisson regression. In Section 4, we present the Bayesian the “kernel trick” can be applied to obtain a kernel version framework for Poisson regression, derive a closed-form ap- of the Bayesian linear regression. Consider the model proximation for the predictive distribution and marginal T likelihood, and kernelize the regression model. Finally, f(x)=φ(x) w, (6) in Section 5, we present experimental results on Bayesian Poisson regression for crowd counting. where φ(x) is a high-dimensional feature transformation of d D x from dimension d to D, i.e. φ : R → R , and w ∈ D 2. Gaussian process regression R . Substituting into (5) and applying the matrix inversion lemma, the predictive distribution can be rewritten in terms Gaussian process (GP) regression [7] is a Bayesian treat- T k(x, x )=φ(x) Σpφ(x ) ment for predicting a function value f(x) from the input of the kernel function [7] d vector x ∈ R . Consider the case when f(x) is linear, from p(f∗|x∗,X,y)=G(f∗|µ∗, Σ∗), (7) which we observe a noisy target y, i.e. T where the predictive mean and covariance are f(x)=x w,y= f(x)+, (1) µ kT K σ2 I −1y, w ∈ Rd ∗ = ∗ ( + n ) (8) where is the weight vector of the linear model, T 2 −1 2 Σ∗ = k(x∗, x∗) − k∗ (K + σnI) k∗, (9) and the observation noise is Gaussian, ∼N(0,σn). The Bayesian model assumes a prior distribution on the weight and K is the kernel matrix with entries Kij = k(xi, xj ), vectors, w ∼N(0, Σp), where Σp is the covariance matrix T and k∗ =[k(x∗, x1) ···k(x∗, xN )] . Hence, non-linear of the weight prior. regression is achieved by adopting different positive definite 2.1. Bayesian prediction kernel functions. For example, using a linear kernel, 2 T 2 Let X =[x1, ···xN ] be the matrix of observed input kl(x, x )=θ1(x x +1)+θ2, (10) T vectors xi, and y =[y1 ··· yN ] be the vector of ob- served outputs yi. Bayesian inference on (1) is based on the results in standard Bayesian linear regression, while em- posterior distribution of the weights w, conditioned on the ploying a squared-exponential (RBF) kernel, observed data {X, y}, and is computed with Bayes’ rule, − 1 x−x2 2 θ2 2 k x, x θ e 2 θ , p y|X, w p w r( )= 1 + 3 (11) p w|X, y ( ) ( ) . ( )= p y|X, w p w dw (2) ( ) ( ) yields Bayesian regression for locally smooth, infinitely dif- Since the data-likelihood and weight prior are both Gaus- ferentiable, functions. Finally, a compound kernel, such as sian, (2) is also Gaussian [7], the RBF-RBF kernel, − 1 x−x2 − 1 x−x2 1 −1 −1 2 θ2 2 θ2 2 p w|X, y G w| A Xy,A , krr(x, x )=θ1e 2 + θ3e 4 + θ5, (12) ( )= ( 2 ) (3) σn which contains two RBF functions with different length d − 1 1 2 − 2 2 where G(x|µ, Σ) = (2π) |Σ| exp(− 2 x − µΣ) scales, can simultaneously model both global non-linear is the equation of a multivariate Gaussian distribution, trends and local deviations from the trend. 2 T −1 1 T −1 xΣ = x Σ x, and A = 2 XX +Σp . Finally, θ k x, x σn The hyperparameters of the kernel function ( ) given a novel input vector x∗, the predictive distribution can be learned with Type-II maximum likelihood. The N marginal likelihood of the training data {x i,yi}i=1 is max- overdispersion replaces the Poisson noise with a negative imized with respect to the hyperparameters ([7], Chapter 5) binomial [3], T µ(x)=exp(x β),y∼ NegBin(µ(x),α), (16) p(y|X, θ)= p(y|w,X,θ)p(w|θ)dw (13) where α is the scale parameter of the negative binomial. 1 T −1 1 N = − y K y − log |Ky|− log 2π, (14) y x 2 y 2 2 The likelihood of given an input vector is −1 K K σ2 I Γ(y + α ) α−1 y where y = + n .