Understanding Generalized Whitening and Coloring Transform for Universal Style Transfer

Tai-Yin Chiu University of Texas as Austin [email protected]

Abstract while having remarkable results, it suffers from computa- tional inefficiency. To overcome this issue, a few methods Style transfer is a task of rendering images in the styles [10, 13, 20] that use pre-computed neural networks to ac- of other images. In the past few years, neural style transfer celerate the style transfer were proposed. However, these has achieved a great success in this task, yet suffers from methods are limited by only one transferrable style and can- either the inability to generalize to unseen style images or not generalize to other unseen styles. StyleBank [1] ad- fast style transfer. Recently, an universal style transfer tech- dresses this limit by controlling the transferred style with nique that applies zero-phase component analysis (ZCA) for style filters. Whenever a new style is needed, it can be whitening and coloring image features realizes fast and ar- learned into filters while holding the neural network fixed. bitrary style transfer. However, using ZCA for style transfer Another method [3] proposes to train a conditional style is empirical and does not have any theoretical support. In transfer network that uses conditional instance normaliza- addition, other whitening and coloring transforms (WCT) tion for multiple styles. Besides these two, more methods than ZCA have not been investigated. In this report, we [2, 7, 21] to achieve arbitrary style transfer are proposed. generalize ZCA to the general form of WCT, provide an an- However, they partly solve the problem and are still not able alytical performance analysis from the angle of neural style to generalize to every unseen style. transfer, and show why ZCA is a good choice for style trans- More recently, several methods exploring the second- fer among different WCTs and why some WCTs are not well order statistics of content image features and style image applicable for style transfer. features came up for universal style transfer. AdaIN [9] tries to match up the of the stylized image fea- ture and the style image feature. A method [14] that uses 1. Introduction zero-phase component analysis (ZCA), a special kind of whitening and coloring transform (WCT), for feature trans- Style transfer is a task of synthesizing an image whose formation further focuses on of image features. content comes from a target content image and style comes While AdaIN enjoys more computational efficiency and from another texture image. At the early stage, one success- ZCA method synthesizes images visually closer to a consid- ful method for style transfer is image quilting [4], which ered style, Avatar-Net [19] aims to find a balance between considers spatial maps of certain quantities (such as image them. intensity and local image orientation) over both the texture Even though the ZCA method for feature transformation image and the content image. Another method [8] makes produces good stylized images, it still remains an empirical style transfer possible by making analogy between images method and lacks any theoretical analyses. Also, the report and their artistically filtered version. Though these methods [14] does not mention the performance of other WCT meth- produce good results, they suffer from only using low level ods. Here we generalize ZCA to the general form of WCT image features. for style transfer. Furthermore, we incorporate the idea of Later a seminal work of neural style transfer [5] lever- neural style transfer to analytically discuss the performance ages the power of convolutional neural network to extract of WCT on style transfer. From the analysis, we explain features of an image that decouple and well represent vi- why ZCA is good for style transfer among different WCTs sual styles and contents. Style transfer [6] is then achieved and show that not every WCT is well applicable to style by jointly minimizing the feature loss [17] and the style transfer. In experiments, we study five natural WCT meth- loss formulated as the difference of Gram matrices. This ods [11] associated with ZCA, principal component anal- optimization is solved by iteration [6, 12, 16, 18]. Thus, ysis (PCA), , standardized ZCA,

14452 F FT and standardized PCA. The experiments show that PCA and form j j . In fact, such a Gram matrix can serve as a standardized PCA give bad results while others lead to per- style representation for an image I. Furthermore, the Gram ceptually meaningful images, which is consistent with our matrix from a higher convolutional layer captures a coarser theory. style representation of I, while the one from a lower layer captures a finer style representation. For convenience, when 2. Background we mention a feature map later, it refers to the reshaped fea- ture matrix F instead of the original feature tensor φ. Since this report mainly inherits the ideas of neural style To synthesize an image I with the content from the im- transfer and universal style transfer with ZCA, here we o age I and the style from the image I , one has to find an briefly introduce them in the following. c s optimal Io to minimize the content loss and the style loss: 2.1. Neural style transfer 1 F F 2 The paper by Gatys et al.[6] first introduced an effective arg min || l(Io) − l(Ic)||F Io nl method for style transfer using neural networks. The key (2) 1 G 1 G 2 finding of this work is that by processing images using con- + λj|| j(Io) − j(Is)||F , X nj mj volutional neural network the style representation and the j∈Ω content representation of an image can be decoupled. This means that the style or the content of an image can be al- where F indicates the Frobenius norm, l denotes some high tered independently to generate another perceptually mean- l-th convolutional layer of VGG-19 network for evaluating ingful image. content loss, Ω is a predefined set of convolutional layers The capability of separating content from style was ob- for evaluating style loss, λj’s are scaling factors, and nj = served from the VGG-19 network, a convolutional neural hj(Ic)wj(Ic) and mj = hj(Is)wj(Is). network targeting object recognition. VGG is composed of 2.2. Universal style transfer with ZCA a series of convolutional layers followed by three fully con- nected layers. At the output of each convolutional layer is Unlike neural style transfer where learning from content a feature map for the input image. For two images with images and style images is necessary to optimize Eq. 2, similar contents, their feature maps extracted from a higher [14] proposes a learning-free scheme and formulates style convolutional layer should be closer than the feature maps transfer as an image reconstruction process. In particu- extracted from a lower convolutional layer, so that the final lar, four autoencoders for general image reconstruction are fully connected layer can classify them into the same cete- built. Each encoder is a part of pre-trained VGG-19 net- gory based on their similar features from the highest con- work that encompasses the input layer to the reluN 1 layer volutional layer. This implies that the feature map from a (N = 1, 2, 3 or 4) and is kept fixed during training process, higher convolutional layer can be used as a content repre- while the corresponding decoder is structurally symmetrical sentation of an image. Suppose φj(I) is the feature map of to the encoder network. The autoencoder network is trained 2 an input image I at the j-th convolutional layer of the VGG- by minimizing the reconstruction loss ||Ir − Ii||F , where 19 network. A feasible content representation of I could Ii and Ir are an input image and the reconstruction image. be extracted, for example, from the layers j = relu4 1 or After training, each autoencoder can be used for single-level relu4 2 . style transfer, and the four autoencoders can be cascaded to On the other hand, the computation of a style representa- perform multi-level style transfer to achieve better synthetic tion is an inspiration from the primary visual system where images. correlations between neurons are computed. Let φj(I) be The inner-workings of an autoencoder for single-level the feature map of an image I at j-th convolutional layers style transfer is illustrated as follows. The encoder of the of shape hj(I) × wj(I) × kj, where hj(I), wj(I), and kj autoencoder is used to extract a feature map for an input are the height, width, and channel length of the feature map. image. For style transfer, a special kind of whitening and For each layer, say j-th layer for example, we can define the coloring transform (WCT) that uses zero-phase component Gram matrix Gj(I) of shape kj ×kj, where the (α,β) com- analysis (ZCA) is applied for feature transformation. The ponent is the correlation between the channels α and β and transformed feature is then converted back to a perceptu- is given by ally meaningful image by the decoder. Specifically, given a content image Ic and a style image Is, the extracted fea- hj (I) wj (I) tures by the encoder are Fc of shape k × n and Fs of shape Gj(I)α,β = φj(I)h,w,αφj(I)h,w,β. (1) X X k × m, respectively. Fc and Fs are first subtracted by their h=1 w=1 means such that the centralized features F¯ c and F¯ s have a By reshaping φj(I) into a matrix Fj(I) of shape kj × zero mean. Then eigen-decomposition is applied to their 1 F¯ F¯ T E Λ ET hj(I)wj(I), the Gram matrix can be written in a concise matrices and derive n c c = c c c and

4453 1 F¯ F¯ T E Λ ET F¯ m s s = s s s . The whitening step transforms c Accordingly, 1 T to an uncorrelated feature F˜ c ( F˜ cF˜ = I) according to − 1 − 1 n c W U Σ 2 U EΛ 2 ET, (7) Eq. 3: = 1 = 1 − 1 F˜ E Λ 2 ETF¯ Σ Σ EΛET c = c c c c. (3) where the eigen-decomposition of is = , and U is an orthogonal matrix. Different choices of U define F˜ F¯ 1 1 The coloring step transforms c to zca such that different whitening transforms. 1 F¯ F¯ T 1 F¯ F¯ T n zca zca = m s s according to Eq. 4: Besides, in certain situations it is more convenient to 1 − 2 1 work with the standardized random vector y = V x F¯ E Λ 2 ETF˜ zca = s s s c. (4) with V being the diagonal matrix diag(Σ). Sim- ilar to Eq. 7, the whitening matrix Wy for y is written as F¯ is finally re-centered to F by adding the mean of − 1 zca zca W = U P 2 , where U is an orthogonal matrix and P is F y 2 2 s, which finishes the whole WCT. After feature transfor- the of y and also the correlation matrix of mation, the decoder converts the transformed feature F T − 1 T − 1 − 1 − 1 zca x, i.e., P = E[yy ]= V 2 E[xx ]V 2 = V 2 ΣV 2 . to an image with the content from Ic and the style from Is. − 1 − 1 Since the whitened vector Wyy = U2P 2 V 2 x, we Moreover, better results can be achieved by multi-level can alternatively express the whitening matrix W for x as style transfer: the autoencoder associated with the relu4 1 1 1 − 2 − 2 layer takes Ic and Is as inputs and produces a synthetic im- W = U2P V . (8) age I4. Then I4 as a content image and Is are passed to the autoencoder associated with the relu3 1 to generate a syn- In this report we study five natural whitening transforma- tions which are succinctly representable in the form defined thetic image I3. Repeating this procedure until a synthetic in either Eq. 7 or Eq. 8. First recall [14] that uses ZCA image I1 is generated from the autoencoder associated with the relu1 1 . We will explain why multi-level style transfer whitening transformation for style transfer. ZCA whitening Wzca works better later in Sec 3.4. matrix is given by − 1 − 1 Wzca = Σ 2 = EΛ 2 ET, (9) 3. Methods which corresponds to U1 = I in Eq. 7. Closely related to In [14], it uses ZCA, a special case of WCT, to realize ZCA whitening, PCA whitening matrix Wpca is defined as style transfer. However, it mentions little about whether − 1 − 1 other WCT methods can work well for style transfer nor Wpca = ETΣ 2 = Λ 2 ET. (10) does it analytically discusses the performance of WCT. Wzca Wpca Here we consider the generalized WCT scheme for style The major difference between and is that ZCA finally rotates back to the original coordinate by E after the transfer, providing a theory from the perspective of neural − 1 ET Λ 2 style transfer to explain why ZCA is a good way for this rotation by followed by the scaling . task and why some other WCTs might not. If ZCA transform or PCA transform is applied to stan- dardized vectors, we can have the standardized version Wzca 3.1. Whitening transform of ZCA transform matrix std or PCA transform matrix Wpca T std : Suppose x = (x1,...,xd) is a random vector to be E 1 1 − 1 1 whitened, and µ = [x] is the mass center of x. While x is zca − 2 − 2 2 T − 2 W = P V = EpΛp E V , (11) not necessary to be centralized by subtracting µ from x for std p a whitening purpose, we follow the results of [14] where the 1 pca − 1 − 1 − − 1 W ETP 2 V 2 Λ 2 ETV 2 whitening transform is done on centered signals. Therefore, std = p = p p , (12) in the following x is assumed to be centered with µ = 0. P E Λ ET where the eigen-decomposition = p p p is used. A whitening transform is a linear operation that brings Note that Wzca and Wpca correspond to U I and Σ std std 2 = a random vector x with covariance matrix Cov(x) = to U ET T 2 = p in Eq. 8, respectively. another random vector z = (z1,...,zd) with an identity The last natural whitening is Cholesky whitening, whose I covariance matrix Cov(z) = . Specifically, the linear op- name comes from the Cholesky decomposition of Σ−1: W eration is defined by a d × d matrix that converts x to Σ−1 LLT. Comparing it to Eq. 6, we can derive the W = z = x that satisfies Cholesky whitening matrix Wchol to be T T T T E[zz ]= W E[xx ]W = WΣW = I. (5) Wchol = LT. (13)

1 T Wchol U LTΣ 2 It follows that WΣW W = W and thus Note that corresponds to 1 = in Eq. 7. In addition, it can be verified that the standardized version WTW Σ−1 Wchol LT = . (6) std is also .

4454 3.2. Coloring transform Proof. First recall that a whitening matrix should sat- isfy Eq. 6. Therefore, we have Σ = (WTW)−1 = A coloring transform is a reversed procedure of the cor- W−1(WT)−1. Thus responding whitening transform. Specifically in single- F c c − − level style transfer, a content feature c =[f1 ,...,fn]k×n Σ = E[(f c − f¯c)(f c − f¯c)T]= W 1(WT) 1, (17a) F s s c c c and a style feature s = [f1 ,...,fm]k×m are extracted by F Σ E s ¯s s ¯s T W−1 WT −1 an encoder composed of a part of VGG-19 network. c and s = [(f − f )(f − f ) ]= s ( s ) . (17b) F F¯ c ¯c c ¯c s are then centralized to c =[f1 − f ,...,fn − f ] and F F¯ s ¯s s ¯s Let’s begin with the second term of l( wct): s =[f1 − f ,...,fm − f ] by subtraction of their means f¯c = 1 n f c and f¯s = 1 n f s, respectively. Af- n i=1 i m i=1 i 1 F FT 1 F FT P P Σ 1 F¯ F¯ T wct wct − s s ter the computation of the covariances c = n c c and n m Σ = 1 F¯ F¯ T, for each of them we can derive a whiten- n m s m s s 1 1 ing transform matrix W or W according to the methods = f wct(f wct)T − f s(f s)T (18) c s n X i i m X i i introduced in Sec. 3.1. The whitened and colored feature i=1 i=1 F wct wct T s s T wct to be decoded is then derived by = E[f (f ) ] − E[f (f ) ], −1 s s F = W W F¯ +[f¯ ,..., f¯ ] × , (14) wct s c c k n where E[f wct(f wct)T] equals to −1 where Wc is the whitening transform, W is the coloring s W−1W E c ¯c c ¯c T WT W−1 T transform, and the addition of the k × n matrix [f¯s,..., f¯s] s c [(f − f )(f − f ) ] c ( s ) W−1W E c ¯c ¯s T is the re-centering step. + s c( [f ] − f )(f ) (19) To comply with the formulation in Sec. 3.1, we define ¯s E c ¯c TWT W−1 T ¯s ¯s T c c +f ( [f ] − f ) c ( s ) + f (f ) , f as the random vector representing the n examples fi , s i = 1,...,n, and f as the random vector representing the which can be reduced to s W m examples fi , i = 1, . . . , m. Whitening matrices c W Σ W−1 W−1 T ¯s ¯s T E s s T and s then can be derived from the covariances c = s ( s ) +0+0+ f (f ) = [f (f ) ], (20) c c c c T s s s s T E[(f − f¯ )(f − f¯ ) ] and Σs = E[(f − f¯ )(f − f¯ ) ], c c s s W−1 T WT −1 E s where f¯ = E[f ] and f¯ = E[f ]. Therefore, the random where the identities ( s ) = ( s ) and [(f − wct ¯s s ¯s T E s s T ¯s ¯s T vector f that represents the columns of Fwct is given by f )(f − f ) ]= [f (f ) ] − f (f ) are used. Hence the second term of l F is E f s f s T E f s f s T . wct −1 c c s ( wct) [ ( ) ]− [ ( ) ] = 0 f = W Wc(f − f¯ )+ f¯ . (15) s This proves that the style loss is zero if Fwct is used. Equipped with Eq. 15, we can analyze the performance of Next we focus on the first term: WCTs for style transfer. 1 F F 2 1 F F F F T || wct − c||F = tr[ ( wct − c)( wct − c) ] 3.3. Analysis of WCTs for single-level style transfer n n n 1 From the point of view of neural style transfer, a single- =tr[ (f wct − f c)(f wct − f c)T] n X i i i i level style transfer in [14] can actually be regarded as a way i=1 that uses ZCA to provide a fast and approximate solution to =tr[E[(f wct − f c)(f wct − f c)T]] minF l(F; Fc, Fs) with l(F; Fc, Fs) defined as (21) With f c−c¯ , f c − f¯c and f c−s¯ , f c − f¯s, Eq. 21 can be F F F 1 F F 2 1 FFT 1 F FT 2 l( ; c, s)= || − c||F +λ || − s s ||F , expanded as n n m content loss style loss −1 c−c¯ c−c¯T T −1 T tr[W Wc E[f f ]W (W ) ] | {z } | {z (16)} s c s W−1W E c−c¯ c−s¯T where F of shape k n and F of shape k m are ex- −tr[ s c [f f ]] c × s × (22) E c−s¯ c−c¯T WT W−1 T tracted at the output of either relu4 1 , relu3 1 , relu2 1 , −tr[ [f f ] c ( s ) ] or relu1 1 layer of VGG network, and k is the number of +tr[E[f c−s¯f c−s¯T]]. channels at that layer. Furthermore, it can be proved that if a WCT is used to approximate the solution, the loss function The first trace is equal to tr[W−1(W−1)T] = tr[Σ ]. The F F F F s s s l( ; c, s) of the approximated solution wct is bounded. second and third trace terms are equivalent, since tr[A] = Theorem 3.1. Given a single-level style transfer that is for- tr[AT] for any arbitrary matrix A. Since E[f c−c¯f c−s¯T] mulated as the minimization of l(F; Fc, Fs). If F = Fwct equals to which is computed according to Eq. 14, then the style loss E c cT E c ¯sT ¯c E c T ¯c ¯sT is zero and the content loss is bounded by the means and [f f ] − [f ]f − f [f ] + f f (23) F F E c cT ¯c ¯cT Σ W−1 W−1 T covariances of c and s. = [f f ] − f f = c = c ( c ) ,

4455 W−1 W−1 T the second and third trace terms become tr[ s ( c ) ]. used, the score is For the fourth trace, we observe that E f c−s¯f c−s¯T can be [ ] 1 1 1 1 Σ 2 Σ 2 E Λ 2 ETE Λ 2 ET further written as tr[ s c ] = tr[ s s s c c c ] n n c c c s s c s s E[f f T] − E[f ]f¯ T − f¯ E[f ]T + f¯ f¯ T s s s T c c c T =tr[ σi ei (ei ) σj ej(ej) ] − − X X =W 1(W 1)T + f¯cf¯cT − f¯cf¯sT − f¯sf¯cT + f¯sf¯sT i=1 j=1 c c (28) W−1 W−1 T f¯c f¯s f¯c f¯s T, = σsσc(es)Tec × tr[es(ec)T] = c ( c ) +( − )( − ) X i j i j i j (24) i,j taking trace of which, the fourth trace term turns to be = σsσc[(es)Tec]2, W−1 W−1 T ¯c ¯s 2 X i j i j tr[ c ( c ) ]+ ||f − f ||2. Putting everything above i,j together, we have the loss l(Fwct; Fc, Fs) to be s c s c where σi ’s (σj ’s) and ei ’s (ej’s) are the singular values and F F F ¯c ¯s 2 1 1 l( wct; c, s)= ||f − f ||2+ 2 2 the corresponding eigenvectors of Σs (Σc ), respectively. W−1 W−1 T W−1 W−1 T W−1 W−1 T tr[ s ( s ) − 2 s ( c ) + c ( c ) ]. On the other hand, since generally Ws and Wc can (25) − 1 − 1 be written as W = U Σ 2 and W = U Σ 2 with Moreover, by exploiting Cauchy-Schwarz inequality s s s c c c certain orthogonal matrices Us and Uc (refer to Eq. 7), 1 1 −1 −1 T 2 T 2 tr[W (W ) ] equals to tr[Σs U UcΣc ]. According W−1 W−1 T W−1W−1T W−1W−1T s c s tr[ s ( c ) ] ≥−qtr[ s s ]qtr[ c c ] to von Neumann’s trace inequality, it can be bounded as (26) follows: Σ W−1W−1T Σ and the identities c = c c and s = W−1W−1T n s s , we can derive the inequality 1 1 |tr[Σ 2 UTU Σ 2 ]|≤ σsσc = σsσc[(es)Tec]2, s s c c X i i X i i i j l(F ) ≤ ||f¯c − f¯s||2 +( tr(Σ )+ tr(Σ ))2, (27) i=1 i,j wct 2 p s p c (29) where we use the identity [(es)Tec]2 = ||es||2 = 1, ∀i. where the upper bound is associated with the means and Pj i j i 2 By replacing the factor σc of σsσc[(es)Tec]2 with σc, covariances of Fc and Fs.  i i,j i i i j j it becomes the score of ZCA inP Eq. 28. This implies that the score of ZCA is a good approximation of the upper bound Theorem 3.1 says that whichever WCT is used for and thus ZCA is a good choice for style transfer. single-level style transfer, there is always no style loss and In contrast, when bad choices of Us and Uc are used, it the upper bound for content loss only depends on the con- could result in negative scores and bad style transfer results, tent feature and the style feature but is irrespective of the and PCA is one of such cases. If PCA (refer to Eq. 10) is WCT used. The bounded content loss implies that WCT can used, the score is capture the general appearance of a content image. How- ever, whether the details of a content can hold still depends 1 1 tr[E Λ 2 Λ 2 ET] on the WCT used. To evaluate the performance of different s s c c n n WCTs on style transfer, we have to come up with a simple (30) = σsσctr[es(ec)T]= σsσc(es)Tec. and yet indicative quantitative metric. To this end, follow- X i i i i X i i i i i i ing Theorem. 3.1 we have a corollary as follows. =1 =1 Eq. 30 could be a small or a negative value since (es)Tec W−1 W−1 T i i Corollary 3.1.1. tr[ s ( c ) ] can be used as a score could be negative. This implies that PCA is not a good op- function to evaluate style transfer results using WCT: the tion for style transfer. higher value of W−1 W−1 T , the better the perfor- tr[ s ( c ) ] Moreover, if the standardized version of ZCA (refer to 1 1 1 1 mance of the WCT used for style transfer. 2 2 2 2 Eq. 11) is used, the score is tr[Vs Ps Pc Vc ], where Vs ¯c ¯s Σ W−1W−1T (Vc) and Ps (Pc) are the diagonal variance matrix and the Proof. Recall Eq. 25 where f , f , c = c c , Σ W−1W−1T correlation matrix of Fs (Fc), respectively. Since a covari- and s = s s all depend only on the content and ance matrix Σ connects to the corresponding correlation the style images and are irrelevant to the WCT used. There- 1 1 1 1 1 P Σ V 2 PV 2 Σ 2 V 2 P 2 U fore, a WCT that minimizes the loss function l(Fwct) more matrix in a relation = , = with W−1 W−1 T  U being some orthogonal matrix. It can be shown that U is corresponds to a higher value of tr[ s ( c ) ]. 1 very close to the identity matrix, and thus we have Σ 2 ≈ 1 1 1 1 1 1 2 2 2 2 V 2 P 2 . This implies that the score tr[Vs Ps Pc Vc ] is With Corollary 3.1.1, we can have a sense for why ZCA 1 1 2 2 is a good choice for style transfer. If ZCA (refer to Eq. 9) is close to tr[Σs Σc ], which is the score of ZCA. Hence, the

4456 performance of standardized ZCA and ZCA on style trans- to the original order where the content feature comes from fer is similar. Besides, similar to the case of PCA, the stan- the relu4 1 layer (the 1 F F 2 term in Eq. 31), the n4 || 4 − 4,c||F dardized PCA could lead to a negative score and is not good reverse order actually finds a suboptimal solution to the loss for style transfer. below: 3.4. Multi-level style transfer 4 1 F F 2 1 F FT 1 F FT 2 || 1− 1,c||F + λN || N N − N,s N,s||F , n1 X nN mN [14] empirically shows that multiple autoencoders for N=1 single-level style transfer can be cascaded to achieve better (34) style transfer results. Here we propose an explanation from where the content information comes from relu1 1 layer. the perspective of neural style transfer: suppose we want As mentioned in Sec. 2.1, since a feature map from a higher to find a synthetic image I that minimizes the loss function convolutional layer better represents the content informa- below: tion of an image, the original order of cascade gives better

4 synthetic results. 1 F F 2 1 F FT 1 F FT 2 || 4− 4,c||F + λN || N N − N,s N,s||F , n4 X nN mN N=1 4. Experiments (31) 4.1. Training details where FN = FN (I) is the feature of I extracted at the reluN 1 layer of VGG-19 network, FN,c’s and FN,s’s are We train the autoencoders on the MS-COCO dataset. the features of the content image and the style image at dif- MS-COCO dataset consists of 11.8K training images. Each ferent layers, respectively. We approach this optimization image from the dataset is resized to 512×512 and randomly problem by first solving I4 from the relu4 1 part: cropped to 256 × 256 as an input in a batch. For relu4 1 and relu3 1 cases, the autoencoders are trained with a batch F F 2 F F T F FT || 4(I4) − 4,c||F 4(I4) 4(I4) 4,s 4,s 2 size of 16 for 10 epochs, while for relu2 1 and relu1 1 + λ4|| − ||F , n4 n4 m4 cases, the autoencoders are trained for 5 epochs. We use (32) Adam optimizer with learning rate 1 × 10−4 and without which is the single-level style transfer associated with the weight decay. relu4 1 layer and the solution is approximated using WCT. In the report [14], a decoder is structurally symmetric Next we want to find another image I on top of I 3 4 to the corresponding encoder in a way that a max-pooling such that I3 is close to I4 and also optimizes the loss 1 T 1 T 2 layer in the encoder corresponds to an up-sampling layer λ || F F − F ,sF || . If I is close to I , then 3 n3 3 3 m3 3 3,s F 3 4 in the decoder. However, a max-pooling operation in the I is still suboptimal to Eq. 32. To account for this, instead 3 encoder loses spatial information in feature maps, and the of having a direct loss ||I −I ||2 , we require the features of 4 3 F corresponding up-sampling operation in the decoder can- I and I extracted at the relu3 1 layer to be close. Overall, 4 3 not recover the lost structures very well. Therefore, the de- I optimizes the following loss: 3 coded image will contain structural artifacts and distortion F F 2 F F T F FT at boundaries [15]. To fix this, we use transposed convolu- || 3(I3) − 3(I4)||F 3(I3) 3(I3) 3,s 3,s 2 +λ3|| − ||F , tional layers as the symmetric part of max-pooling layers. n3 n3 m3 (33) Compared to up-sampling layers, transposed convolutional which exactly corresponds to single-level style transfer as- layers have adjustable parameters to flexibly learn to recon- struct images and avoid distortions. sociated with the relu3 1 layer that takes I4 as the content image and whose solution can be approximated by WCT. 4.2. Discussions We can repeat the procedure for I2 on top of I3 and I1 on top of I2, and I1 will be a suboptimal solution to Eq. 31. Ef- In Fig. 1 we demonstrate eight examples of style trans- fectively, we are approximating a solution to Eq. 31 by cas- fer based on five natural WCTs, which are associated with cading four single-level style transfer autoencoders where ZCA, PCA, standardized version of ZCA and PCA, and each autoencoder takes the output image of the previous au- Cholesky decomposition, respectively. In the Table. 4.1 are W−1 W−1 T toencoder as a content image . the corresponding values of the score tr( s ( c ) ) at This standpoint also explains why synthetic results are different single levels of style transfer. Note that compar- worse if the autoencoders are cascaded in a reverse order: ing the values in different examples are meaningless, since first I1 is generated from Ic and Is using the autoencoder as- a value depends on many factors, such as the sizes of the sociated with the relu1 1 layer, then I1 and Is are fed into content image and the style image and the distribution of the autoencoder associated with the relu2 1 layer to gener- pixel values. ate I2, and repeat the procedure until I4 is generated by the Let us focus on the results from PCA and PCA-std first. autoencoder associated with the relu4 1 layer. Compared If we look at them closely, we can notice that there are

4457 Figure 1. Examples of style transfer using five natural WCTs. The results from PCA and PCA-std are not good but still capture the general appearance of content images and the styles from style images, which can be explained by the bounded content loss and zero style loss in −1 −1 T single-level style transfer. In contrast, the results from ZCA, ZCA-std, and Cholesky are much better, since the values of tr(Ws (Wc ) ) for them are much higher as shown in Table 4.1.

4458 Example ZCA PCA ZCA-std PCA-std Cholesky 1 70177.1 / 2244.9 -8200.9 / -28.7 69862.4 / 2246.2 -7389.0 / -98.9 62073.5 / 2063.2 464.0 / 8.5 -15.3 / 5.2 466.4 / 8.4 -22.6 / -3.7 442.8 / 8.2 2 83055.0 / 3349.3 359.5 / 140.3 83012.9 / 3358.8 5748.9 / 255.5 71536.8 / 2940.6 688.4 / 9.6 -3.4 / -3.3 691.5 / 9.7 6.1 / -0.9 642.1 / 8.8 3 38127.9 / 1861.5 2267.4 / 144.3 37572.1 / 1852.1 6400.9 / -228.4 32533.6 / 1695.7 404.6 / 3.9 79.8 / 2.7 402.9 / 3.8 -48.1 / -0.5 375.4 / 3.5 4 60312.3 / 2513.3 2079.1 / -20.2 60144.6 / 2533.0 6482.2 / -549.6 52955.3 / 2269.4 616.0 / 7.6 43.9 / -0.3 620.8 / 7.7 0.8 / 2.9 587.6 / 6.9 5 92117.1 / 4942.3 -9806.4 / -386.3 91384.9 / 4969.0 4170.3 / 624.5 79736.5 / 4623.4 1146.7 / 13.4 27.1 / 3.5 1151.0 / 13.5 39.0 / 1.5 1115.4 / 12.7 6 147059.6 / 5585.0 -15089.9 / 33.0 146909.1 / 5592.4 -6802.0 / -88.6 135144.2 / 5232.4 1175.9 / 16.4 79.8 / -6.9 1183.1 / 16.5 162.9 / 7.0 1138.3 / 15.6 7 59170.7 / 2247.0 7361.8 / -8.3 59147.1 / 2264.9 12447.2 / -276.2 53560.3 / 2096.3 647.1 / 6.3 -34.7 / -1.3 650.4 / 6.3 98.5 / -4.3 620.0 / 5.9 8 44762.8 / 1174.6 -6245.8 / -76.0 44515.5 / 1173.1 -10660.4 / -104.7 39738.1 / 1067.6 255.8 / 4.5 -31.8 / 2. 255.3 / 4.5 15.1 / 2.5 241.2 / 4.3

−1 −1 T Table 1. Values of tr(Ws (Wc ) ) for the eight examples in Fig. 1. Each cell contains four values separated by slashes, and they correspond to the values from the single levels associated with the reluN 1 layer, N =4, 3, 2, 1, respectively. some points that are consistent with our previous analy- WCT, such as standardized ZCA or Cholesky decomposi- sis. Apparently, the results from PCA and PCA-std are not tion, for style transfer, from the angle of neural style trans- good, but we can observe that the styles from the style im- fer ZCA can always be the first good choice. In contrast, ages are transferred to the synthetic images to a certain ex- PCA, one of the most widely used whitening method, is not tent. This is because from the perspective of neural style well applicable in style transfer. transfer, WCT does not cause any style information loss in single-level style transfer as shown in the proof for the The- 5. Conclusion orem 3.1. Moreover, we can observe that the synthetic im- In this report we study the general form of WCT for style ages by PCA and PCA-std still capture the general appear- transfer. We analyze the performance of WCT from the per- ance of the content images. This can be explained by the spective of neural style transfer. From the analysis we show bounded content loss as proved in the Theorem 3.1. How- why some WCTs, especially ZCA, are good choices for ever, as explained in Eq. 30, since style transfer by PCA or style transfer among different WCTs and why some other PCA-std results in worse content losses, the details of con- WCTs might not be well applicable to style transfer. In ex- tent information are ruined in the synthetic images. This is periments we study five natural WCTs and show that ZCA, justified in Table. 4.1, where in each example the values of standardized ZCA, and Cholesky decomposition for feature W−1 W−1 T transformation can achieve good style transfer results while tr( s ( c ) ) for PCA and PCA-std are much smaller than the values for other WCT methods and could even be PCA and standardized PCA are not well applicable to style transfer. negative. Furthermore, Table. 4.1 shows that in each example the W−1 W−1 T References values of tr( s ( c ) ) for ZCA and ZCA-std are very close, which is consistent with the previous discussion. [1] Dongdong Chen, Lu Yuan, Jing Liao, Nenghai Yu, and Gang That is why the synthetic images by ZCA and ZCA-std look Hua. Stylebank: An explicit representation for neural image almost the same. Besides, we can notice that in each exam- style transfer. In Proc. CVPR, volume 1, page 4, 2017. 1 W−1 W−1 T [2] Tian Qi Chen and Mark Schmidt. Fast patch-based style ple the values of tr( s ( c ) ) for Cholesky decompo- sition method are slightly smaller than the values for ZCA transfer of arbitrary style. arXiv preprint arXiv:1612.04337, 2016. 1 and ZCA-std, implying that the results by Cholesky method [3] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. could have been slightly worse than the ones by ZCA and A learned representation for artistic style. Proc. of ICLR, ZCA-std. However, the visualization in Fig. 1 shows only 2017. 1 little difference between them; one small but noticeable dif- [4] Alexei A Efros and William T Freeman. Image quilting for ference is in the third example of pencil sketch, where the texture synthesis and transfer. In Proceedings of the 28th an- left part in the Cholesky result is brighter. nual conference on Computer graphics and interactive tech- In summary, even though there are many choices of niques, pages 341–346. ACM, 2001. 1

4459 [5] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Tex- [20] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Vic- ture synthesis using convolutional neural networks. In Ad- tor S Lempitsky. Texture networks: Feed-forward synthesis vances in Neural Information Processing Systems, pages of textures and stylized images. In ICML, pages 1349–1357, 262–270, 2015. 1 2016. 1 [6] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Im- [21] Hao Wang, Xiaodan Liang, Hao Zhang, Dit-Yan Yeung, and age style transfer using convolutional neural networks. In Eric P Xing. Zm-net: Real-time zero-shot image manipula- Proceedings of the IEEE Conference on Computer Vision tion network. arXiv preprint arXiv:1703.07255, 2017. 1 and Pattern Recognition, pages 2414–2423, 2016. 1, 2 [7] Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Dumoulin, and Jonathon Shlens. Exploring the structure of a real-time, arbitrary neural artistic stylization network. arXiv preprint arXiv:1705.06830, 2017. 1 [8] Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, Brian Curless, and David H Salesin. Image analogies. In Proceed- ings of the 28th annual conference on Computer graphics and interactive techniques, pages 327–340. ACM, 2001. 1 [9] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1510–1519. IEEE, 2017. 1 [10] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016. 1 [11] Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Opti- mal whitening and decorrelation. The American Statistician, 72(4):309–314, 2018. 1 [12] Chuan Li and Michael Wand. Combining markov random fields and convolutional neural networks for image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2479–2486, 2016. 1 [13] Chuan Li and Michael Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In European Conference on Computer Vision, pages 702–716. Springer, 2016. 1 [14] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. In Advances in Neural Information Processing Systems, pages 386–396, 2017. 1, 2, 3, 4, 6 [15] Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, and Jan Kautz. A closed-form solution to photorealistic image stylization. arXiv preprint arXiv:1802.06474, 2018. 6 [16] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. arXiv preprint arXiv:1701.01036, 2017. 1 [17] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 5188–5196, 2015. 1 [18] Eric Risser, Pierre Wilmot, and Connelly Barnes. Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893, 2017. 1 [19] Lu Sheng, Ziyi Lin, Jing Shao, and Xiaogang Wang. Avatar- net: Multi-scale zero-shot style transfer by feature decora- tion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8242–8250, 2018. 1

4460