Color Constancy Beyond Bags of Pixels

Ayan Chakrabarti Keigo Hirakawa Todd Zickler Harvard School of Engineering and Applied Sciences [email protected] [email protected] [email protected]

Abstract

Estimating the color of a scene illuminant often plays a central role in computational color constancy. While this problem has received signiﬁcant attention, the methods that exist do not maximally leverage spatial dependencies between pixels. Indeed, most methods treat the observed color (or its spatial derivative) at each pixel independently of its neighbors. We propose an alternative approach to illu- (a) (b) 0.4 minant estimation—one that employs an explicit statistical 0.35 0.35 0.3 model to capture the spatial dependencies between pixels 0.3 0.25 induced by the surfaces they observe. The parameters of 0.25 0.2 0.2 Green this model are estimated from a training set of natural im- Green 0.15 0.15 ages captured under canonical illumination, and for a new 0.1 0.1 image, an appropriate transform is found such that the cor- 0.05 0.05

0 0 rected image best ﬁts our model. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Red Red (c) (d)

0.25 0.2 0.2 1. Introduction 0.15 0.15

0.1 0.1 Color is useful for characterizing objects only if we have 0.05 0.05 0 0 Green Green a representation that is unaffected by changes in scene illu- −0.05 −0.05

−0.1 mination. As the spectral content of an illuminant changes, −0.1 −0.15 −0.15 so does the spectral radiance emitted by surfaces in a scene, −0.2 −0.2 −0.25

−0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 and so do the spectral observations collected by a tri- Red Red chromatic sensor. For color to be of practical value, we (e) (f) require the ability to compute color descriptors that are in- Figure 1. Color distributions under changing illumination. Im- variant to these changes. ages (a,b) were generated synthetically from a hyper-spectral re- As a first step, we often consider the case in which flectance image[12], standard color filters and two different illumi- the spectrum of the illumination is uniform across a scene. nant spectra. Above are scatter plots for the red and green values of Here, the task is to compute a mapping from an input color (c,d) individual pixels; and (e,f) 8×8 image patches projected onto image y(n) to an illuminant-invariant representation x(n). a particular spatial basis vector. Black lines in (c-f) correspond to What makes the task difficult is that we do not know the the illuminant direction. The distribution of individual pixels does input illuminant a priori. not disambiguate between dominant colors in the image and the The task of computing invariant color representations color of the illuminant. has received significant attention under a variety of titles, including color constancy, illuminant estimation, chromatic esis is a good example; it simply states that the expected adaptation, and white balance. Many methods exist, and reflectance in an image is achromatic [14]. A wide vari- almost all of them leverage the assumed independence of ety of more sophisticated techniques take this approach as each pixel. According to this paradigm, spatial information well. Methods based on the dichromatic model [10], gamut is discarded, and each pixel in a natural image is modeled as mapping [8, 11], color by correlation [9], Bayesian infer- an independent draw. The well-known grey world hypoth- ence [1], neural networks [3], and the grey edge hypothesis [18] are distinct in terms of the computational techniques tion x(n)1. In general, this computational chromatic adap- they employ, but they all discard spatial information and ef- tation problem is ill-posed. To make it tractable, we make fectively treat images as “bags of pixels.” the standard assumption that the mapping from y to x is Bag-of-pixels methods depend on the statistical distribu- algebraic/linear; and furthermore, that it is a diagonal trans- tions of individual pixels and ignore their spatial contexts. form (in RGB or some other linear color space[7]). This as- Such distributions convey only meager illuminant informa- sumption effectively imposes joint restrictions on the color tion, however, because the expected behavior of the mod- matching functions, the scene reflectivities, and the illumi- els is counterbalanced by the strong dependencies between nant spectra[5, 19]. Under this assumption of (generalized) nearby pixels. This is demonstrated in Figure 1(c,d), for diagonal transforms, we can write: example, where it is clearly difficult to infer the illuminant y n Lx n direction with high precision. ( ) = ( ), (2) In this paper, we break from the bag-of-pixels paradigm where L = diag(`), ` R3, x(n) = 1 2 3 T ∈ 3 by building an explicit statistical model of the spatial depen- [x{ }(n) x{ }(n) x{ }(n)] [0, 1] , and f is im- ∈ dencies between nearby image points. These image depen- plicit in the algebraic constraints imposed. dencies echo those of the spatially-varying reflectance[17] of an observed scene, and we show that they can be ex- 3. Spatio-Spectral Analysis ploited to distinguish the illuminant from the natural vari- ability of the scene (Figure 1(e,f)). Studies in photometry have established that the dif- We describe an efficient method for inferring scene il- fuse reflectivity for real-world materials as a function of lumination by examining the statistics of natural color im- λ are typically smooth and can be taken to live in a ages in the spatio-spectral sense. These statistics are learned low-dimensional linear subspace[15]. That is, x(λ, n) = T 1 R R from images collected under a known illuminant. Then, t=0− φt(λ)ct(n), where φt : is the basis and R → given an input image captured under a unknown illuminant, Pct(n) are the corresponding coefficients that describes ∈ we can map it to its invariant (canonical) representation by the reflectivity at location n. Empirically, we observe fitting it to the learned model. Our results suggest that ex- that the baseband reflectance φ0 is constant across all λ ploiting spatial information in this way can significantly im- (φ0(λ) = φ0) and the spatial variance along this dimen- prove our ability to achieve chromatic adaptation. sion(i.e., the variance in c0(n)) is disproportionately larger The rest of this paper is organized as follows. We be- than that along the rest. gin with a brief review of a standard color image formation The color image y can be written as a sum of the baseline model in Section 2. A statistical model for a single color im- and residual images: age patch is introduced in Section 3, and the optimal correc- y n y n y n tive transform for the illuminant is found via model-fitting ( ) = lum( ) + chr( ) in Section 4. The proposed model is empirically verified ylum(n) = f(λ)`(λ)φ0c0(n) dλ = `φ0c0(n) using available training data in Section 5. Z T 1 − ychr(n) = f(λ)`(λ)φtct(n) dλ. (3) 2. Background: Image Formation Z Xt=1 We assume a Lambertian model where x : R Z2 × → Here, the baseline “luminance” image contains the majority [0, 1] is the diffuse reflectivity of a surface corresponding of energy in y and is proportional to the illuminant color to the image pixel location n Z2 as a function of the R3 ∈ ` ; we see from Figure 2 that ylum marks the inter- electromagnetic wavelength λ R in the visible range. The ∈ ∈ object boundaries and intra-object textures. The residual tri-stimulus value recorded by a color imaging device is “chrominance” image describes the “deviation” from the baseline intensity image, capturing the “color” variations in y(n) = f(λ)`(λ)x(λ, n) dλ, (1) reflectance. Also, unlike the luminance image, it is largely Z void of high spatial frequency content. Existing literature in signal processing provides addi- 1 2 3 T where y(n) = [y{ }(n), y{ }(n), y{ }(n)] is the tri- tional evidence that ychr is generally a low-pass signal. For stimulus (e.g. RGB) value at pixel location n cor- instance, Gunturk et al. [13] have shown that the Pearson responding to the color matching functions f(λ) = product-moment correlation coefficient is typically above 1 2 3 T 1 2 3 [f { }(λ), f { }(λ), f { }(λ)] , f { }, f { }, f { } : R → 1 x n [0, 1], and ` : R R is the spectrum of the illuminant. For convenience, we refer to ( ) as the reflectance image and to → ` as the illuminant color. In practice these may be, respectively, the im- Our task is to map a color image y(n) taken under an age under a canonical illuminant and the entries of a diagonal “relighting unknown illuminant to an illuminant-invariant representa- transform”. These interpretations are mathematically equivalent. k = 1

k = 9

y Figure 2. Decomposition of (left column) a color image into k = 59 (middle column) luminance ylum and (right column) chrominance ychr components. Log-magnitude of the Fourier coefficients in (bottom row) correspond to the images in (top row), respectively. Figure 3. Eigen-vectors of the covariance matrices Λk. The pat- Owing to the edge and texture information that comprise lumi- tern in each patch corresponds to a basis vector used for spatial nance image, luminance dominates chrominance in the high-pass decorrelation (in this case a DCT filter) and the colors represent y components of . the eigen-vectors of the corresponding Λk. The right-most column contains the most significant eigen-vectors that are found to 1 2 3 be achromatic. 0.9 for high-pass components of y{ }, y{ }, and y{ }— suggesting that ylum dominates high-pass components of y. Figure 2 also illustrates Fourier support of a typical color respond to the lowest frequency component or DC. image taken under a canonical illuminant, clearly confirm- By using this decorrelating basis, modeling the distribution of color image patches X reduces to modeling the ing the band-limitedness of ychr. These observations are DT R3 D consistent with the contrast sensitivity function of human distribution of three-vectors k X , k, where k ∈1 2∀ 3 vision[14] (but see [16]) as well as the notion that the scene computes the response of each of X{ }, X{ } and X{ } to reflectivity x(λ, n) is spatially coherent, with a high con- Dk such that centration of energy at low spatial frequencies. T 1 T 1 Dk X{ } Dk X{ } All of this suggests that decomposing images by spa- DT T 2 T 2 k X =  Dk  X{ }  =  Dk X{ }  . (4) tial frequency can aid in illuminant estimation. High-pass T 3 T 3 Dk X{ } Dk X{ } coefficients of an image y will be dominated by contribu-      The DC component for natural images is known to have tions from the luminance image ylum, and the contribution near uniform distributions[4]. The remaining components of ychr (and thus the scene chrominance xlum) will be lim- are modeled as Gaussian. Formally, ited. Since the luminance image ylum provides direct information about the illuminant color (equation (3)), so too will T i.i.d. D X (νmin νmax) the high-pass image coefficients. This is demonstrated in 0 ∼ U × i.i.d. Figure 1(e,f), which shows the color of 8 8 image patches DT X (0, Λ ), k > 0, (5) × k ∼ N k projected onto a high-pass spatial basis function. Λ DT XXT D ν ν In subsequent sections, we develop a method to exploit where k = E[ k k], and [ min, max] is the range the ‘extra information’ available in (high-pass coefficients of the DC coefficients. The probability of the entire re- of) spatial image patches. flectance image patch is then given by

1 1 T T 1 T X D X Λ− D X P ( ) 1/2 exp ( k ) k k . 3.1. Statistical Model ∝ det(Λk) −2 kY>0 We seek to develop a statistical model for a √K √K (6) 1 2 3 K × patch where X{ }, X{ } and X{ } R are cropped We can gain further insight from looking at the sam- 1 2 3 ∈ from x{ }(n), x{ }(n) and x{ }(n) respectively. Rather ple covariance matrices Λ computed from a set of nat- { k} than using a general model for patches of size √K √K ural images taken under a single (canonical) illuminant. × × 3, we employ a spatial decorrelating basis and represent The eigenvectors of Λk represent directions in tri-stimulus such patches using a mutually independent collection of K space, and Figure 3 visualizes these directions for three three-vectors in terms of this basis. We use the discrete co- choices of k. For all K > 0 we ﬁnd that the most signif- sine transform(DCT) here, but the discrete wavelet trans- icant eigenvector is achromatic, and that the corresponding form(DWT), steerable pyramids, curvelets, etc. are other eigenvalue is signiﬁcantly larger than the other two. This common transform domains that could also be used. This is consistent with the scatter plots in Figure 1, where the RK gives us a set of basis vectors Dk k=0...(K 1) distributions have a highly eccentric elliptical shape that is { } − ∈ where without loss of generality, D0 can be taken to cor- aligned with the illuminant direction. wT w 4. Estimation Algorithm = 3 In the previous section, a statistical model for a single

color patch was proposed. The parameters of this model ¡ √ e can be learned, for example, from a training set of natural e 3 images with a canonical illumination. In this section, we develop a method for color constancy that breaks an image into a “bag of patches” and then attempts to fit these patches to such a learned model. 1 2 3 Let diag(w), w = [1/`{ } 1/`{ } 1/`{ }] represent the diagonal transform that maps the observed image to the reflectance image (or image under a canonical illumina- Figure 4. The concentric ellipses correspond to the equivalue con- 0T 0 T tion). Dividing the observed image into a set of overlapping tours of w Aw . The optimal point on the sphere w w = 3 patches Y , we wish to find the set of patches Xˆ (w) therefore lies on the major axis of these ellipses. { j} { j } that best fit the learned model from the previous section (in ˆ terms of log-likelihood) such that j, Xj is related to Yj as wT w ∀ ellipsoid touches the = 3 sphere is along the major T axis, i.e. the eigen-vector e of A that corresponds to the T T T 1 1 2 2 3 3 minimum eigen-value. The solution to (8) is then given by Xˆ j(w) = w{ }Y { } w{ }Y { } w{ }Y { } . (7) j j j √3e. This is illustrated in Figure 4. We choose to estimate w by model-fitting as follows: 5. Experimental Results

w = arg max log P Xˆ (w0) . 0 j (8) In this section, we evaluate the performance of the pro- w Xj posed method on a database collected speciﬁcally for color It is clear that (8) always admits the solution w = 0. We constancy research[6]. While this database suffers from therefore add the constraint that wT w = 3 (so that w = a variety of non-idealities—JPEG artifacts, demosaicking, [1 1 1]T when Y is taken under canonical illumination). non-linear effects such as gamma correction, etc.— it is fre- This constrained optimization problem admits a closed quently used in the literature to measure the performance form solution. To see this, let the eigen-vectors and therefore provides a useful benchmark[6, 18]. The database contains a large number of images captured in dif- and eigen-values for Λk be given by Vkh = 1 2 3 2 { ferent lighting conditions. Every image has a small grey [ k{h} k{h} k{h}] h= 1,2,3 and σkh h= 1,2,3 respec- V V V } { } { } { } sphere in the bottom right corner that provides the “ground tively. Then equation (8) simpliﬁes as truth”. Since the sphere is known to be perfectly grey, its

1 1 1 T 1 mean color (or rather, the mean color of the 5% brightest w = arg min w0{ } { }D Y { } w0 2 kh k j pixels to account for the sphere being partially in shadow) 2σkh V j,kX>0,h is taken to be the color of the illuminant. 2 2 2 T 2 3 3 T 3 { } { } Training was done on all overlapping patches in a set of +w0 k{h}Dk Yj{ } + w0 k{h}Dk Yj{ } V V 100 images that are color corrected based on the sphere, i.e. 1 T T for each image the illuminant was estimated from the sphere = arg min w0 a a w0 w0 2σ2 jkh jkh j,kX>0,h kh and then every pixel was diagonally transformed by the in- T verse of the illuminant. The patch size was chosen to be = arg min w0 Aw0, (9) w0 8 8 and the DCT was used for spatial decorrelation. For × “relighting” images, we chose to apply diagonal transforms subject to wT w = 3, where directly in RGB color space, and it is important to keep in

T mind that the results would likely improve (for all methods 1 T 1 2 T 2 3 T 3 ajkh = k{h}Dk Yj{ } k{h}Dk Yj{ } k{h}Dk Yj{ } we consider) by ﬁrst “sharpening” the color matching func- hV V V i tions (e.g. [5, 7]). T ajkha A = jkh . (10) The performance of the estimation algorithm was evalu- 2σ2 20 j,kX>0,h kh ated on images from the same database. These images were chosen a-priori such that they did not represent any of The solution can now be found by an eigen-decomposition the scenes used in training, and also such that the sphere was T of A. Note that the equivalue contours of w0 Aw0 are approximately in the same light as the rest of the scene. The ellipsoids of increasing size whose axes are given by the proposed algorithm was compared with the Grey-World[2] eigen-vectors of A. Therefore, the point where the smallest and Grey-Edge[18] methods. An implementation provided (a) 10.4◦ 3.3◦ 0.98◦

(b) 3.1◦ 8.1◦ 1.5◦

(d) 0.55◦ 0.48◦ 1.7◦ Un-processed Image Grey World Grey Edge Proposed Method Figure 5. A selection of images from the test set corrected by different methods with the corresponding angular errors.

# Grey-World[2] Grey-Edge[18] Proposed ◦ ◦ ◦ tion of 5). For all algorithms, the right portion of the image 1 7.4 5.9 2.5 ◦ ◦ ◦ 2 4.1 5.5 1.2 was masked out so that the sphere would not be included in ◦ ◦ ◦ 3 10.4 3.3 0.98 ◦ ◦ ◦ the estimation process. The angular deviation of the sphere 4 3.1 8.1 1.5 ◦ ◦ ◦ T 5 11.3 1.8 0.31 color in the corrected image from [1 1 1] was chosen as ◦ ◦ ◦ 6 4.3 1.8 0.90 ◦ ◦ ◦ the error metric. 7 2.2 4.2 1.4 ◦ ◦ ◦ 8 4.4 1.9 1.2 ◦ ◦ ◦ 9 3.3 1.7 1.1 ◦ ◦ ◦ Table 1 shows the angular errors for each of the three 10 2.6 0.91 0.42 ◦ ◦ ◦ 11 4.4 1.9 1.7 algorithms for all images . The proposed method does bet- ◦ ◦ ◦ 12 2.5 3.6 2.6 ◦ ◦ ◦ ter than Grey-World in 17 and better than Grey-Edge in 12 13 2.4 2.1 2.6 ◦ ◦ ◦ 14 4.6 0.80 1.4 of the 20 images. Some of the actual color corrected im- ◦ ◦ ◦ 15 14.7 6.8 7.7 ◦ ◦ ◦ ages are shown in Figure 5. In Figure 5(a-c), the proposed 16 7.2 2.2 3.1 ◦ ◦ ◦ 17 13.7 0.96 1.9 method outperforms both Grey-World and Grey-Edge. In ◦ ◦ ◦ 18 6.9 3.1 4.3 ◦ ◦ ◦ the ﬁrst case, we see that since the image has green as a very 19 0.55 0.48 1.7 ◦ ◦ ◦ 20 3.9 0.08 2.2 dominant color, Grey-World performs poorly and infers the ◦ ◦ ◦ Mean 5.7 2.9 2.0 illuminant to be green. For images (b) and (c), there are Table 1. Angular errors for different color constancy algorithms. many edges (e.g. the roof in (c)) with the same color distribution across them, and this causes the Grey-Edge method by the authors of [18] was used for both these methods, to perform poorly. In both cases, the proposed method ben- and for Grey-Edge the parameters that were described in eﬁts from spatial correlations and cues from complex image [18] to perform best were chosen (i.e. second-order edges, features. In Figure 5(d), both Grey-World and Grey-Edge do a Minkowski norm of 7 and a smoothing standard devia- better than the proposed method. This is because most of 15 mination is also within the scope of our future work.

Acknowledgments 10 The authors thank Dr. H.-C. Lee for useful discussions, and the authors of [6, 12, 18] for access to their databases and code.

5 References

Angular Error in Degrees [1] D. Brainard and W. Freeman. Bayesian color constancy. J. of the Optical Soc. of Am. A, 14(7):1393–1411, 1993.

0 [2] G. Buchsbaum. A spatial processor model for object colour k=1 k=9 k=59 GW GE Proposed perception. J. Franklin Inst., 310(1):1–26, 1980. Figure 6. This box-plot summarizes the performance of three in- [3] V. Cardei, B. Funt, and K. Barnard. Estimating the scene illumination chromaticity using a neural network. J. of the dividual spatial components (Dk), showing the median and quan- tiles of angular errors across the test set. These are also compared Optical Soc. of Am. A, 19(12):2374–2386, 2002. to the Grey World(GW) and Grey Edge(GE) algorithms, and the [4] A. Chakrabarti and K. Hirakawa. Effective Separation of proposed method that combines cues from all spatial components. Sparse and Non-Sparse Image Features for Denoising. In The proposed method performs best—having the lowest average Proc. ICASSP, 2008. error as well as the smallest variance. [5] H. Chong, S. Gortler, and T. Zickler. The von Kries hypothesis and a basis for color constancy. In Proc. ICCV, 2007. [6] F. Ciurea and B. Funt. A Large Image Database for Color the objects in the scene are truly achromatic (i.e. their true Constancy Research. In Proc. IS&T/SID Color Imaging color is grey/white/black) and therefore the image closely Conf., pages 160–164, 2003. satisfies the underlying hypothesis for those algorithms. [7] G. Finlayson, M. Drew, and B. Funt. Diagonal transforms Finally, the performance of each individual spatial sub- suffice for color constancy. In Proc. ICCV, 1993. band component was evaluated. That is, we observed how [8] G. Finlayson and S. Hordley. Gamut constrained illumina- well the proposed method performed when estimating w tion estimation. Intl. J. of Comp. Vis., 67(1):93–109, 2006. j using the statistics of each DkY alone, for every k. Figure [9] G. Finlayson, S. Hordley, and P. M. Hubel. Color by cor- 6 shows a box-plot summarizing the angular errors across relation: A simple, unifying framework for color constancy. the test set for three representative values of k and compares IEEE Trans. PAMI, 23(11):1209–1221, 2001. them with the Grey-World and Grey-Edge algorithm as well [10] G. Finlayson and G. Schaefer. Convex and non-convex illuminant constraints for dichromatic colour constancy. In as the proposed method which combines all components. Proc. CVPR, 2001. Each single component outperforms Grey-World and some [11] D. Forsyth. A novel algorithm for color constancy. Intl. J. of are comparable to Grey-Edge. The proposed method, which Comp. Vis., 5(1), 1990. uses a statistical model to weight and combine cues from all [12] D. Foster, S. Nascimento, and K. Amano. Information limits components, performs best. on neural identification of colored surfaces in natural scenes. Visual Neuroscience, 21(03):331–336, 2005. 6. Conclusion and Future Work [13] B. K. Gunturk, Y. Altunbasak, and R. M. Mersereau. Color plane interpolation using alternating projections. IEEE In this paper, we presented a novel solution to the com- Trans. Image Processing, 11(9):997–1013, 2002. putational chromatic adaptation task through an explicit sta- [14] E. Land. The retinex theory of colour vision. In Proc. R. tistical modeling of the spatial dependencies between pix- Instn. Gr. Br., volume 47, pages 23–58, 1974. els. Local image features are modeled using a combina- [15] H.-C. Lee. Introduction to Color Imaging Science. Camb. Univ. Press, 2005. tion of spatially decorrelating transforms and an evalua- [16] C. Parraga, G. Brelstaff, T. Troscianko, and I. Moorehead. tion of the spectral correlation in this transform domain. Color and luminance information in natural scenes. J. of the The experimental verifications suggest that this joint spatio- Optical Soc. of Am. A, 15(3):563–569, 1998. spectral modeling strategy is effective. [17] B. Singh, W. Freeman, and D. Brainard. Exploiting Spatial The ideas explored in this paper underscores the benefits and Spectral Image Regularities for Color Constancy. Work- to exploiting spatio-spectral statistics for color constancy. shop on Stat. and Comp. Theories of Vis., 2003. We expect further improvements from a model based on [18] J. van de Weijer, T. Gevers, and A. Gijsenij. Edge- heavy-tailed probability distribution functions for the trans- Based Color Constancy. IEEE Trans. on Image Processing, form coefficients. Also, many bag-of-pixel approaches to 16(9):2207–2214, 2007. color constancy can be adapted to use bags of patches in- [19] G. West and M. H. Brill. Necessary and sufficient conditions stead, especially Bayesian methods [1] that fit naturally into for von kries chromatic adaptation to give color constancy. J. of Math. Bio., 15(2):249–258, 1982. our statistical framework. Examining spatially-varying illu-