<<

Appears in IEEE Int’l Conf. on Computer Vision, Kyoto, 2009.

Bayesian Poisson Regression for Crowd Counting

Antoni B. Chan Nuno Vasconcelos Department of Computer Science Dept. of Electrical and Computer Engineering City University of Hong Kong University of California, San Diego [email protected] [email protected]

Abstract by reducing the confidence when the is far from an integer. In addition, the confidence levels are currently Poisson regression models the noisy output of a count- measured in standard-deviations, which provides little intu- ing function as a Poisson , with a log- ition on the reliability of the estimates. A confidence mea- that is a linear function of the input vector. In sure based on posterior probabilities seems more intuitive this work, we analyze Poisson regression in a Bayesian set- for counting numbers. Finally, negative outputs of the GP ting, by introducing a prior distribution on the weights of must be truncated to zero, and it is unclear how this affects the linear function. Since exact inference is analytically un- the optimality of the predictive distribution. obtainable, we derive a closed-form approximation to the One common method of regression for counting num- predictive distribution of the model. We show that the pre- bers is Poisson regression [3], which models the noisy out- dictive distribution can be kernelized, enabling the repre- put of a counting function as a Poisson random variable, sentation of non-linear log-mean functions. We also derive with a log-mean parameter that is a linear function of the an approximate marginal likelihood that can be optimized input vector. This is analogous to standard linear regres- to learn the hyperparameters of the kernel. We then relate sion, except that the mean is modeled as the exponential the proposed approximate Bayesian Poisson regression to of a linear function to ensure non-negative values, and that Gaussian processes. Finally, we present experimental re- the noise model is Poisson because the outputs are count- sults using Bayesian Poisson regression for crowd counting ing numbers. One way of extending Poisson regression to from low-level features. the Bayesian setting is to adopt a hierarchical model, where the log-mean function is modeled with a standard Gaus- 1. Introduction sian process [4, 5, 6]. These solutions, however, have two Recent work [1, 2] on crowd counting using low-level disadvantages. First, because of the lack of conjugacy be- feature regression has shown promise in computer vision. tween the Poisson and the GP, [4, 5, 6] must approximate One advantage with these methods is that they bypass in- inference with Markov-chain Monte Carlo (MCMC), which termediate processing stages, such as people detection or limits these algorithms to small datasets. Second, the hier- tracking, that may be susceptible to problems when the archical model contains two redundant noise sources: 1) crowd is dense. In [1], the scene is segmented into crowds the Poisson-distributed observation noise, and 2) the Gaus- moving in different directions and various low-level fea- sian noise of the GP in the log-mean function. These two tures are extracted from each crowd segment (e.g. informa- noise terms model essentially the same thing: the noise in tion on the shape, edges and texture of the segment). The observing the count. A more parsimonious representation crowd count in each segment is then estimated with a Gaus- would include only the observation noise, while modeling sian process (GP) regression function that maps feature vec- the mean as a deterministic function. tors to the crowd size. in [1] indicate that the In this work, we analyze the standard Poisson regression counting algorithm is capable of producing accurate counts, model in a Bayesian setting, by adding a Gaussian prior for a wide of crowd densities. on the weights of the linear log-mean function. Since ex- One problem with the system of [1] is that it uses GP re- act inference is analytically unobtainable, approximate in- gression, which models continuous real-valued functions, ference is still necessary. However, in contrast to previous to predict discrete counting numbers. Because of this work [4, 5, 6], we propose a closed-form approximation to mismatch, regression may not be taking full advantage of . The contributions of this paper, with Bayesian inference. For example, rounding of the real- respect to Bayesian Poisson regression (BPR), are five-fold: valued is not handled in a principled way, e.g. 1) we derive a closed-form approximation to the predictive

1 distribution for BPR; 2) we kernelize the predictive distri- of f∗ = f(x∗) is obtained by averaging over all possible bution, enabling the representation of non-linear log-mean model parameterizations, with respect to the posterior dis- functions via kernel functions; 3) we derive an approxi- tribution of w [7], mate marginal for optimizing the hyper-  of the kernel function with Type-II maximum p(f∗|x∗,X,y)= p(f∗|x∗, w)p(w|X, y)dw (4) likelihood; 4) we show that the proposed approximation to BPR is related to a Gaussian process with a special non- G f | 1 xT A−1Xy, xT A−1x . = ( ∗ 2 ∗ ∗ ∗) (5) i.i.d. noise term; 5) finally, we present experimental results σn that show improvement in crowd counting accuracy when using the proposed model. 2.2. Kernelized regression The remainder of this paper is organized as follows. In The predictive distribution in (5) can be rewritten to only Sections 2 and 3, we briefly review Gaussian processes and depend on the inner products between the inputs x i. Hence, Poisson regression. In Section 4, we present the Bayesian the “kernel trick” can be applied to obtain a kernel version framework for Poisson regression, derive a closed-form ap- of the Bayesian . Consider the model proximation for the predictive distribution and marginal T likelihood, and kernelize the regression model. Finally, f(x)=φ(x) w, (6) in Section 5, we present experimental results on Bayesian Poisson regression for crowd counting. where φ(x) is a high-dimensional feature transformation of d D x from dimension d to D, i.e. φ : R → R , and w ∈ D 2. Gaussian process regression R . Substituting into (5) and applying the matrix inversion lemma, the predictive distribution can be rewritten in terms Gaussian process (GP) regression [7] is a Bayesian treat-  T  k(x, x )=φ(x) Σpφ(x ) ment for predicting a function value f(x) from the input of the kernel function [7] d vector x ∈ R . Consider the case when f(x) is linear, from p(f∗|x∗,X,y)=G(f∗|µ∗, Σ∗), (7) which we observe a noisy target y, i.e.

T where the predictive mean and are f(x)=x w,y= f(x)+, (1) µ kT K σ2 I −1y, w ∈ Rd ∗ = ∗ ( + n ) (8) where is the weight vector of the , T 2 −1 2 Σ∗ = k(x∗, x∗) − k∗ (K + σnI) k∗, (9) and the observation noise is Gaussian, ∼N(0,σn). The Bayesian model assumes a prior distribution on the weight and K is the kernel matrix with entries Kij = k(xi, xj ), vectors, w ∼N(0, Σp), where Σp is the T and k∗ =[k(x∗, x1) ···k(x∗, xN )] . Hence, non-linear of the weight prior. regression is achieved by adopting different positive definite 2.1. Bayesian prediction kernel functions. For example, using a linear kernel,  2 T  2 Let X =[x1, ···xN ] be the matrix of observed input kl(x, x )=θ1(x x +1)+θ2, (10) T vectors xi, and y =[y1 ··· yN ] be the vector of ob- served outputs yi. Bayesian inference on (1) is based on the results in standard Bayesian linear regression, while em- posterior distribution of the weights w, conditioned on the ploying a squared-exponential (RBF) kernel, observed {X, y}, and is computed with Bayes’ rule, − 1 x−x2  2 θ2 2 k x, x θ e 2 θ , p y|X, w p w r( )= 1 + 3 (11) p w|X, y  ( ) ( ) . ( )= p y|X, w p w dw (2) ( ) ( ) yields Bayesian regression for locally smooth, infinitely dif- Since the data-likelihood and weight prior are both Gaus- ferentiable, functions. Finally, a compound kernel, such as sian, (2) is also Gaussian [7], the RBF-RBF kernel, − 1 x−x2 − 1 x−x2 1 −1 −1  2 θ2 2 θ2 2 p w|X, y G w| A Xy,A , krr(x, x )=θ1e 2 + θ3e 4 + θ5, (12) ( )= ( 2 ) (3) σn which contains two RBF functions with different length d − 1 1 2 − 2 2 where G(x|µ, Σ) = (2π) |Σ| exp(− 2 x − µΣ) scales, can simultaneously model both global non-linear is the equation of a multivariate Gaussian distribution, trends and local deviations from the trend. 2 T −1 1 T −1  xΣ = x Σ x, and A = 2 XX +Σp . Finally, θ k x, x σn The hyperparameters of the kernel function ( ) given a novel input vector x∗, the predictive distribution can be learned with Type-II maximum likelihood. The N marginal likelihood of the training data {x i,yi}i=1 is max- replaces the Poisson noise with a negative imized with respect to the hyperparameters ([7], Chapter 5) binomial [3],  T µ(x)=exp(x β),y∼ NegBin(µ(x),α), (16) p(y|X, θ)= p(y|w,X,θ)p(w|θ)dw (13) where α is the of the negative binomial. 1 T −1 1 N = − y K y − log |Ky|− log 2π, (14) y x 2 y 2 2 The likelihood of given an input vector is −1 K K σ2 I Γ(y + α ) α−1 y where y = + n . More details are available in [7]. p(y|x,β,α)= (p) (1 − p) , (17) Γ(y +1)Γ(α−1)

3. Regression for counting numbers α−1 where p = α−1+µ(x) , and Γ(·) is the gamma function. Note While the GP provides a Bayesian framework for re- that the negative binomial reduces to a gressing to real-valued outputs, it is not clear how to use when α =0. The mean, , and of y are the GP when the outputs are counting numbers, i.e. non- E y µ x , negative integers, y ∈ Z+ = {0, 1, 2, ···}. A typical ap- [ ]= ( ) (18) y µ x αµ x , proach to regression for counting functions models the out- var( )=( )(1 + ( )) (19) put as a Poisson random variable, where the mean param- (1 − α)µ(x),α<1 mode(y)= . (20) eter is a function of the input variable. In this section, we 0,α≥ 1 review two standard regression methods for counting num- bers, Poisson regression and negative . Hence, for α>0, the negative binomial has variance larger than that of an equivalent Poisson with mean µ(x). Similar 3.1. Poisson regression to Poisson regression, the parameters {α, β} of the negative binomial model can be estimated by maximizing the data Poisson regression [3] models the noisy output y as a log-likelihood (see [3] for more details). Poisson distribution, where the log-mean parameter is a lin- d ear function of the input vector x ∈ R , i.e. 4. Bayesian Poisson regression λ x xT β, µ x eλ(x),y∼ µ x , ( )= ( )= Poisson( ( )) (15) Although both Poisson and negative binomial regression provide methods for regressing a counting function, they do where λ(x) is the log-mean function, µ(x) is the mean not do so in a Bayesian setting, i.e. by integrating over the y ∈ Z β ∈ Rd function, +, and is the weight vector. posterior distribution of the weight vector β. In this sec- The likelihood of an output y given an input vector x is −µ(x) y tion, we present a Bayesian regression model for counting p y|x,β e µ(x) ( )= y! . The mean and the variance of functions. We adopt the standard Poisson regression model, the predictive distribution are equal, i.e. E[y]=var(y)= λ x xT β, µ x eλ(x),y∼ µ x , µ(x), and mode(y)=µ(x). ( )= ( )= Poisson( ( )) (21) Given a matrix of input vectors X =[x1 ···xN ] and T and introduce a Gaussian prior on the weight vector, β ∼ a vector of outputs y =[y1 ···yN ] , the weight vec- N (0, Σp). The posterior distribution of β, given the training tor β can be learned by maximizing the data likelihood, data {X, y}, is computed with Bayes’ rule log p(y|X, β), which is a concave in β. Poisson regres- sion is an example of a [8], which p(y|X, β)p(β) p(β|X, y)= . (22) is a general regression framework when the underlying co- p(y|X, β)p(β)dβ variates are linear. Generalized kernel machines, and the resulting kernel Poisson regression, were proposed in [9]. However, a closed-form expression of (22) is analytically unobtainable because of the lack of conjugacy between the 3.2. Negative binomial regression Poisson likelihood and the Gaussian prior. Instead, we will A Poisson random variable is equidispersed, i.e. the vari- adopt the approximate posterior distribution of [10]. We ance is equal to the mean. However, in many cases, the ac- will then derive a closed-form expression for the predictive tual random variable is overdispersed, with variance greater distribution and marginal likelihood, using this approximate than the mean, due to additional factors that are not ac- posterior distribution, and kernelize both quantities. counted for by the input x or the model itself. Poisson 4.1. Log-gamma approximation regression is ill-suited to model overdispersion because it will bias the mean towards the variance, in order to keep the The approximate posterior distribution of [10] is based equidispersion property. One popular regression model for on approximating the log- with a 0.4 Log−gamma (y=1) Using (27), this can be approximated as [10] 0.35 Log−gamma (y=5) Log−gamma (y=20) N 0.3 Log−gamma (y=50) 1 −1 Normal p y|X, β ≈ G λ x | y ,y 0.25 ( ) ( ( i) log i i ) (30) yi |y,1) i=1 λ 0.2 p( 1 σ⋅ 0.15 T 2 |Σy| 2 − 1 X β−t 2 Σy 0.1 = N e , (31) π 2 0.05 (2 )

0 1 1 −5 −4 −3 −2 −1 0 1 2 3 4 5 y ··· t y ( − )/ where Σ = diag([ y1 yN ]), and = log( ) is the λ µ σ y Figure 1. Gaussian approximation of the log-gamma distribution element-wise of . Substituting into (22), for different values of y. The plot is normalized so that the distri- log p(β|X, y) ∝ log p(y|X, β)+logp(β) (32) butions are zero-mean and unit variance.   1  T 2 1 2 ≈− X β − t − βΣ , (33) 2 Σy 2 p Gaussian. Consider a Gamma random variable µ ∼ where we have dropped terms that are not a function of β. Gamma(a, b), with distribution Expanding the norm term and completing the square, the posterior distribution is approximately Gaussian, 1 a−1 − µ p β|X, y ≈ G β|µ , ˆ , p(µ|a, b)= µ e b . (23) ( ) ( ˆβ Σβ) (34) Γ(a)ba with mean and variance λ µ −1 T −1 −1 −1 The transformed random variable =log has a log- µˆβ =(XΣy X +Σp ) XΣy t, (35) gamma distribution. It is well known that, for large a, the ˆ X −1XT −1 −1. log-gamma distribution is approximately Gaussian [11, 12], Σβ =(Σy +Σp ) (36) This approximate posterior distribution was originally de- −1 λ =logµ ∼N(log a +logb, a ). (24) rived in [10]. In the remainder of this section, we extend [10], by deriving an approximation to the predictive distri- Setting b =1and a = y ∈ Z+, (23) becomes bution and marginal likelihood for Bayesian Poisson regres- sion, and apply the kernel trick to both quantities. 1 y−1 −µ p(µ|y,1) = µ e . (25) Γ(y) 4.3. Bayesian prediction Given a novel input x∗, the predictive distribution of the λ The distribution of is obtained with the change of variable output y∗ is obtained by averaging over all possible param- formula, leading to the following approximation, eters, with respect to the posterior distribution of β,  ∂ p y |x ,X,y p y |x ,β p β|X, y dβ. p λ|y, p µ eλ|y, eλ ( ∗ ∗ )= ( ∗ ∗ ) ( ) (37) ( 1) = ( = 1)∂λ (26) T 1 λy −eλ −1 Let us define an intermediate random variable λ ∗ = x β. = e e ≈ G(λ| log y,y ). (27) ∗ (y − 1)! Note that λ∗ is a linear transformation of β, and that the pos- terior distribution of β is approximately Gaussian. Hence, Figure 1 plots the Gaussian approximation of the log- the distribution of λ∗ is also approximately Gaussian, gamma distribution for different values of y.Asy increases, 2 p(λ∗|x∗,X,y)=G(λ∗|µˆλ, σˆλ) (38) the log-gamma converges to the Gaussian approximation. where T −1 T −1 −1 −1 4.2. Approximate posterior distribution µˆλ = x∗ (XΣy X +Σp ) XΣy t, (39) 2 T −1 T −1 −1 We now present the approximation to the posterior dis- σˆλ = x∗ (XΣy X +Σp ) x∗. (40) p β|X, y y tribution ( ). The output is Poisson, and hence the Finally, we can obtain the predictive distribution by inte- data-likelihood is grating over λ∗,  N p y |x ,X,y p y |λ p λ |x ,X,y dλ 1 yi −µ(xi) ( ∗ ∗ )= ( ∗ ∗) ( ∗ ∗ ) ∗ (41) p(y|X, β)= µ(xi) e (28) yi! i=1 λ∗ p y |λ e−(e ) eλ∗ y∗ /y N where ( ∗ ∗)= ( ) ∗! is a Poisson distribu- 1 λ(xi) eλ(xi)yi e−e . tion. The integral in (41) does not have an analytic solution, = y y − (29) i=1 i( i 1)! and thus an approximation is necessary. (a) (b) (c) (d)

35 data points 0.18 4 0.25 4 data points 35 data points data points µ µ * 0.16 λ * λ mode * 3.5 * 30 3.5 30 mode 0.14 0.2 25 25 3 0.12 3 0.15 20 0.1 20 2.5 y 2.5 y log y log y 15 0.08 2 15 0.1 0.06 2 10 10 1.5 0.04 0.05 5 1.5 5 1 0.02

0 0 1 0 0 0.5 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x x x x Figure 2. Examples of Bayesian Poisson regression using (a) the linear kernel, and (c) the RBF kernel. The mean parameter eµˆλ and the mode are plotted on top of the negative binomial predictive distribution. The corresponding log-mean functions are plotted in (b) and (d).

4.4. Closed-form approximate prediction where k(·, ·), K, and k∗ are defined as in Section 2.2. After computing (47, 48), the predictive distribution is still (45). To obtain a closed-form approximation to the predictive  The hyperparameters θ of the kernel k(x, x ) can be distribution in (41), we note that we can define a random learned, in a manner similar to the GP, by maximizing the variable µ∗ =exp(λ∗), and hence λ∗ =logµ∗. Since λ∗ marginal likelihood p(y|X, θ). Using the log-gamma ap- is approximately Gaussian, we can use (24) to approximate proximation in (31), p(y|X, θ) is approximated with λ∗ as a log-gamma random variable, or equivalently µ ∗ as a gamma random variable, µ∗ ∼ Gamma(ˆa,ˆb), where 1 1 T −1 log p(y|X, θ) ∝− log |K +Σy|− t (K +Σy) t. (49) −2 ˆ 2 µˆλ 2 2 aˆ = σλ , b = σλe . (42) Figure 2 presents two examples of learning a BPR function, We can now rewrite the predictive distribution of (41) as the by maximizing the marginal likelihood. Two different ker- integral over µ∗,  nels were used, the linear kernel and RBF kernel, and the ∞ predictive distributions are shown in Figures 2a and 2c, re- p(y∗|x∗,X,y)= p(y∗|µ∗)p(µ∗|x∗,X,y)dµ∗, (43) 0 spectively. The corresponding log-mean functions are plot- y ted in Figures 2b and 2d. While the linear kernel can only p y |µ e−µ∗ µ ∗ /y where ( ∗ ∗)= ∗ ∗! is a Poisson distribution, model an exponential trend in the data, the RBF kernel is p µ |x ,X,y and ( ∗ ∗ ) is a gamma distribution. The gamma capable of adapting to the local deviations in the function. is the conjugate prior of the Poisson, and thus the integral in (43) can be solved analytically, resulting in a negative 4.6. Relationship with Gaussian processes binomial distribution [3] We now relate the proposed approximate Bayesian Pois- a y Γ(ˆ + ∗) aˆ y∗ p(y∗|x∗,X,y)= (ˆp) (1 − pˆ) , son regression to Gaussian processes. The equations for the y a (44) 2 Γ( ∗ +1)Γ(ˆ) parameters of the approximate λ∗ distribution, µˆλ and σˆλ in −2 (47, 48), are almost identical to those of the GP predictive 1 σˆλ where pˆ = ˆ = −2 . Hence, the predictive dis- µ 1+b σˆλ +exp(ˆµλ) distribution, ∗ and Σ∗ in (8, 9). There are two main differ- 2 tribution of y∗ can be approximated as a negative binomial, ences. First, while the GP noise term in (9) is i.i.d. (σnI),

µˆλ 2 the noise term of BPR in (48) is dependent on the output y∗|x∗,X,y ∼ NegBin(e , σˆλ) (45) 1 1 y ··· values (Σ = diag[ y1 yN ]). This is a consequence of with mean and scale parameter computed with (39, 40). assuming a Poisson noise model. Second, the predictive mean µˆλ in (47) is computed using the log-counts t, rather 4.5. Kernelized regression than the counts y, as with the GP in (8). Similar to GP regression, we can extend BPR to repre- Hence, we have the following interpretation of approxi- sent non-linear log-mean functions using the kernel trick. mate Bayesian prediction for Poisson regression: given ob- {X, y} x Given a high-dimensional feature transformation φ(x), the served data and novel input ∗, the approximation λ log-mean function is models the predictive distribution of the log-mean ∗ as a Gaussian process with non-i.i.d. observation noise with λ x φ x T β. 1 1 ( )= ( ) (46) y ··· covariance Σ = diag([ y1 yN ]), learned from the data {X, y} λ Rewriting (39, 40) in terms of φ(x) and applying the matrix log . Given the distribution of ∗, the predictive dis- y eµˆλ inversion lemma, the parameters of the λ∗ distribution can tribution for ∗ is a negative binomial with mean and σ2 λ be computed using a kernel function, scale parameter ˆλ. Note that the variance of ∗ plays a role as the scale parameter of the negative binomial. Hence, µ kT K −1t, ˆλ = ∗ ( +Σy) (47) increased uncertainty in estimating λ∗ with a GP leads to 2 T −1 σˆλ = k(x∗, x∗) − k∗ (K +Σy) k∗, (48) increased uncertainty in the y∗ prediction. The approximation to the BPR marginal likelihood in Table 1. Comparison of regression functions for crowd counting. (49) differs from that of the GP in (14) in a similar manner MSE err Method away towards away towards as above, and hence we have a similar interpretation. In 3.1518 3.1179 1.3975 1.3750 summary, we have shown that the proposed approximation Poisson BPR-l 3.0814 2.0936 1.3700 1.1686 to BPR is based on assuming a GP prior on the log-mean BPR-rr 2.4675 2.0246 1.2154 1.1375 parameter of the Poisson output distribution. The GP prior linear 3.3493 2.8718 1.4521 1.3304 uses a special noise term, which approximates the uncer- GPR-l 3.2786 2.6929 1.4371 1.2800 tainty that arises from the Poisson noise model. This is in GPR-rr 3.1725 2.0896 1.4561 1.1011 contrast to other methods [4, 5, 6] that assume the standard

8 8 i.i.d Gaussian noise in the GP prior. linear linear 7 GPR−l 7 GPR−l GPR−rr GPR−rr 6 Poisson 6 Poisson BPR−l BPR−l 5. Crowd Counting Experiments 5 BPR−rr 5 BPR−rr

MSE 4 MSE 4

In this section, we present experimental results on crowd 3 3 counting using the proposed Bayesian Poisson regression. 2 2 1 1 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 We use the crowd video database introduced in [1], which training size training size contains 4000 frames of video of a pedestrian walkway with Figure 3. Error rate for training sets of different sizes for the (left) a large number of moving people. The database is annotated “away” crowd, and (right) “towards” crowd. with two crowd motion classes, “away” from or “towards” the camera, and the goal is to count the number of people improves when using the Bayesian framework, decreasing in each motion class. The database was split into a training from 3.152/3.118 (away/towards) to 3.081/2.094 for lin- set of 1200 frames for learning the regression function, and ear BPR. The MSE further decreases to 2.468/2.025 when a test set of 2800 frames. non-linear trends in the log-mean function are modeled with the RBF-RBF kernel (BPR-rr). Comparing the two 5.1. Experimental Setup Bayesian regression models with linear kernels, BPR-l out- . / . We use the crowd counting system from [1] to com- performs GPR-l on both classes (MSE of 3 081 2 094 vs. . / . pare different regression functions. The crowd was seg- 3 279 2 693). In the non-linear case, BPR-rr has a signifi- . mented into the two motion classes, using the mixture of cantly lower MSE than GPR-rr on the “away” class (2 468 . dynamic textures [13]. A feature vector, composed of the vs. 3 173), but shows only a slight improvement on the “to- . . 29 perspective-normalized features described in [1], was ex- wards” class (2 025 vs. 2 090). This indicates that BPR is tracted from each crowd segment in each video frame. The improving the cases where GPR tends to have larger error. feature vectors were normalized so that each dimension had We also measured the test error while varying the size of zero mean and unit variance, based on the statistics from the training set, by picking a subset of the original training the training frames. A Bayesian Poisson regression func- set. Figure 3 plots the MSE versus the training size. Over- tion was learned, from the training frames, using the linear all, the Bayesian methods (BPR and GPR) are more robust kernel in (10), and the RBF-RBF kernel in (12), which we when the training set is small, compared with standard lin- denote “BPR-l” and “BPR-rr”, respectively. For compari- ear or Poisson regression. This indicates that, in practice, a son, a GP regression function was also trained using the lin- system could be trained with fewer examples, thus reducing ear and RBF-RBF kernels (GPR-l and GPR-rr). A standard the number of images that need to be annotated by hand. linear least-squares regression function and Poisson regres- Figure 4 plots the BPR-rr predictions and the true counts sion function were also learned. for the “away” and “towards” crowds. The predictions track For BPR, the count estimate is the mode of the predictive the true counts in most of the test frames, with some errors distribution. For the GP, the count estimate is obtained by occuring due to outliers in the video (e.g. bicycles and skate- rounding the predictive mean to the nearest non-negative boarders). Finally, Figure 5 presents the original image, integer. The quality of the count estimates are evaluated segmentation, and crowd estimates for several test frames. 1 M c − c 2 with the mean-squared error, MSE = M i=1(ˆi i) , 1 M and absolute error, err = M i=1 |cˆi − ci|, between the 6. Conclusions c c count estimate ˆi and the ground-truth counts i, averaged In this paper, we have proposed an approximation to M over the test frames. Bayesian Poisson regression for modeling counting func- tions. We derived a closed-form approximation to the pre- 5.2. Experimental Results dictive distribution of the model, and show that the model Table 1 presents the counting error rates for the vari- can be kernelized, enabling the representation of non-linear ous regression functions. For Poisson regression, the MSE log-mean functions. We also propose an approximation to 45 test set training set test set truth 40 prediction 35 30 25

count 20 15 10 5 0 0 500 1000 1500 2000 2500 3000 3500 4000 frame 30 test set training set test set truth prediction 25

20

15 count

10

5

0 0 500 1000 1500 2000 2500 3000 3500 4000 frame Figure 4. Crowd counting results over both the training and test sets for: (top) “away” crowd, and (bottom) “toward” crowd. The gray bars show the one standard-deviations error bars of the predictive distribution.

18 (±4.6) [19] 7 (±2.8) [7] 29 (±5.6) [31] 8 (±3.0) [8] 9 (±3.2) [7] 11 (±3.5) [12] 10 (±3.2) [12] 15 (±4.0) [15]

11 (±3.4) [11] 6 (±2.6) [5] 6 (±2.7) [6] 8 (±3.0) [7] 14 (±3.9) [15] 19 (±4.6) [18] 17 (±4.2) [17] 14 (±3.8) [16]

Figure 5. Crowd counting examples: The red and green segments are the “away” and “towards” crowds. The estimated crowd count for each segment is in the top-left, with the (standard-deviation of the Bayesian prediction) and the [ground-truth]. The ROI is also highlighted. the marginal likelihood, for learning the kernel hyperparam- [6] J. Vanhatalo and A. Vehtari, “Sparse log gaussian processes via eters via type-II maximum likelihood. The proposed ap- MCMC for spatial ,” in JMLR Workshop and Confer- proximation is related to a Gaussian process with a special ence Proceedings, 2007, pp. 73–89. [7] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Ma- non-i.i.d. noise term that approximates the Poisson output chine Learning. MIT Press, 2006. noise. Finally, we apply BPR to feature-based crowd count- [8] J. A. Nedler and R. W. M. Wedderburn, “Generalized linear models,” ing, and improve on the results obtained with GPR. J. of the Royal Stat. Society, Series A, vol. 135, pp. 370–84, 1972. [9] G. C. Cawley, G. J. Janacek, and N. L. C. Talbot, “Generalised kernel References machines,” in Intl. Joint Conf. on Neural Networks, 2007, pp. 1720– 25. [1] A. B. Chan, Z. S. J. Liang, and N. Vasconcelos, “Privacy preserv- [10] G. M. El-Sayyad, “Bayesian and classical analysis of poisson regres- ing crowd monitoring: Counting people without people models or sion,” J. of the Royal Statistical Society. Series B (Methodological)., tracking,” in CVPR, 2008. vol. 35, no. 3, pp. 445–51, 1973. [2] D. Kong, D. Gray, and H. Tao, “Counting pedestrians in crowds using [11] M. S. Bartlett and D. G. Kendall, “The statistical analysis of viewpoint invariant training,” in BMVC, 2005. variance-heterogeneity and the logarithmic transformation,” Supple- ment to the J. of the Royal Statistical Society, vol. 8, no. 1, pp. 128– [3] A. C. Cameron and P. K. Trivedi, of . 38, 1946. Cambridge Univ. Press, 1998. [12] R. L. Prentice, “A log gamma model and its maximum likelihood [4] P. J. Diggle, J. A. Tawn, and R. A. Moyeed, “Model-based geostatis- estimation,” Biometrika, vol. 61, no. 3, pp. 539–44, 1974. tics,” Applied Statistics, vol. 47, no. 3, pp. 299–350, 1998. [13] A. B. Chan and N. Vasconcelos, “Modeling, clustering, and segment- [5] C. J. Paciorek and M. J. Schervish, “Nonstationary covariance func- ing video with mixtures of dynamic textures,” IEEE Trans. on PAMI, tions for Gaussian process regression,” in NIPS, 2004. vol. 30, no. 5, pp. 909–926, May 2008.