Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

Instance-Aware Coherent Video Transfer for Chinese

Hao Liang , Shuai Yang , Wenjing Wang and Jiaying Liu∗ Wangxuan Institute of Computer Technology, Peking University {lianghao17, williamyang, daooshee, liujiaying}@pku.edu.cn

Abstract Recent researches have made remarkable achieve- ments in fast video style transfer based on west- Frame 1 ern . However, due to the inherent differ- ent drawing techniques and aesthetic expressions of Chinese , existing methods ei- Frame 2 ther achieve poor temporal consistency or fail to Input frames ChipGAN Ours transfer the key freehand brushstroke characteris- (a) Chinese ink wash video style transfer tics of Chinese ink wash painting. In this paper, we present a novel video style transfer framework for Chinese ink wash paintings. The two key ideas are a multi-frame fusion for temporal coherence and an instance-aware style transfer. The frame reorder- ing and stylization based on reference frame fu- sion are proposed to improve temporal consistency. Meanwhile, the proposed method is able to adap- tively leave the white spaces in the background and to select proper scale to extract feature and depict Input NST ChipGAN Ours Real the foreground subject by leveraging instance seg- (b) Chinese ink wash image style transfer mentation. Experimental results demonstrate the superiority of the proposed method over state-of- Figure 1: Our model is able to produce temporally stable Chinese the-arts in terms of both temporal coherence and ink wash videos. Moreover, our model can capture the characteris- visual quality. Our project website is available at tics of Chinese ink wash style better than existing methods. https://oblivioussy.github.io/InkVideo/. wash styles. The reasons are twofold: 1) Motions: it is 1 Introduction still challenging to estimate optical flow for large motions, while wrong flows will cause flickers and jitters; 2) Strokes: Ink video is a modern artistic expression of Chinese ink wash the high contrast between the black strokes and white spaces paintings. It inherits the freehand brushwork characteristics in ink wash styles makes subtle changes evident, which re- of Chinese ink wash paintings, and is widely used in the film, quires higher temporal consistency. In this paper, we propose advertising, and art industry. However, to produce traditional a novel multi-frame fusion scheme to solve these problems. ink videos, a large number of paintings need to be drawn, [ which is considerably time-consuming. This practical re- Meanwhile, although methods following Gatys et al., ] quirement motivates our work: we investigate an approach 2016 show good performance on richly textured styles such to automatically convert a real video into ink video, which as oil paintings, their results are less satisfactory for Chinese can support large-scale production and help people enjoy, un- ink wash styles as shown in Fig. 1 (b). It is hard for them to derstand and analyze this art form. learn the ink wash brushstrokes and how to leave white spaces Since Gatys et al. [2016], there have been many ef- correctly. The challenges of Chinese ink wash style trans- forts towards image/video stylization. Among them, optical fer lie in three aspects: 1) White space: Chinese ink wash flow is widely used in video stylization for temporal coher- paintings often leave white spaces to express ethereal beauty, ence [Ruder et al., 2016; Chen et al., 2017; et al., which requires the model to identify the painting subject and 2017]. However, existing methods are less effective for ink determine the areas to leave blank; 2) Contour brushstroke: Chinese ink wash paintings usually do not use filled colors, ∗Corresponding author. so the model has to carefully render contour brushstrokes to

823 Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21) distinguish different subjects; 3) Subject size: Subjects in real tency, Ruder et al. [2016] introduce optical flows into NST photos are very diverse in size, which requires the model to by adding a new loss function of the continuity of adjacent adjust the brushstroke size for different subjects, given the frames. For acceleration, Chen et al. [2017] design a feed- fixed kernel size. Recently, attempts have been made for forward video style transfer model based on feature fusion. Chinese ink wash style transfer. ChipGAN [He et al., 2018] However, due to large motions, intensive ink wash brush- learns Chinese ink wash style by Generative Adversarial Net- strokes, and the sensitiveness of GANs, directly exploiting works (GANs) [2014]. However, ChipGAN treats the image optical flow-based schemes leads to unsatisfactory results. In as a whole and is still not adaptive enough to the semantic this paper, we propose a coherent Chinese ink wash video contents. In this paper, we propose three specialized instance- style transfer model based on multi-frame fusion. aware constraints to meet the above-mentioned challenges. In general, we propose a novel Chinese ink wash video 2.3 Chinese Ink Wash Painting Generation style transfer network. The key ideas include multi-frame fu- Traditional Chinese ink wash style transfer methods [Yu et sion for temporal coherence and instance-aware style trans- al., 2003; Yang and Xu, 2013; Liang and Jin, 2013] mainly fer. To maintain temporal consistency, we reorder the frames rely on brushstroke simulation, which is manually designed to flexibly select proper reference frames for optical flow es- and not robust enough for real application. ChipGAN [He timation, and fuse reference frames at feature-level. More- et al., 2018] is a GAN-based model that learns the ink wash over, a spatial transformation consistency constraint is im- style directly from data. Zhang et al. [2020] improve it by posed to alleviate the sensitiveness of GANs to small input modelling styles with AdaIN [Huang and Belongie, 2017]. changes, further stabilizing the outputs. For introducing in- However, they are less adaptive to the image content, and fail stance awareness, we first identify the painting subject and to consider temporal consistency for videos. In this paper, we background, then apply white space constraint, contour con- propose a new model to meet the challenges of Chinese ink trast constraint, and adaptive scale selection to guide the net- wash painting video style transfer. work to extract features from the scale adaptive to the image content. In summary, our contributions are threefold: 3 Instance-Aware Coherent Chinese Ink • We propose a novel instance-aware coherent video style transfer framework, which translates video frames into Wash Video Style Transfer Chinese ink wash paintings with both high temporal Let X and Y be the real photo domain and Chinese ink wash consistency and fine style imitation. painting domain, respectively. Our model aims to learn the • We propose a multi-frame fusion strategy for robust bidirectional mapping between X and Y , while additionally video style transfer, which is end-to-end trainable and considering the temporal consistency between consecutive can efficiently generate stable results. frames. Our model consists of two generators F : X → Y and B : Y → X with two corresponding discriminators D • We analyze the characteristics of Chinese ink wash Y and DX . We further disassemble F into two mappings: en- paintings and propose an instance-aware style transfer coder E and generator G, where E extracts the feature map method, which can generate proper white spaces and dis- of the image and G generates ink paintings from the feature tinct contours, and is adaptive to multi-scale subjects. map. In this section, we will first introduce our strategy of coherent video stylization. Then, we propose the instance- 2 Related Work aware style transfer method for Chinese ink wash paintings. 2.1 Image Style Transfer Finally, we show our full training objective. Style transfer is to transfer the paintings’ style to real pho- tos. Traditional methods [Hertzmann et al., 2001] are mainly 3.1 Coherent Video Style Transfer based on non-parametric texture synthesis, failing to consider First, we briefly review the idea of optical flow-based tempo- the high-level concept of style. Recently, benefiting from the ral consistency constraint. Optical flow describes the motion development of deep convolutional networks, Neural Style of each pixel between two frames. Ideally, we have the re- Transfer (NST) [Gatys et al., 2016] successfully learns high- lationship xt = ω(xt−1, f(xt−1, xt)), where xt is the frame level style features and shows superior performance. Later, at time t, f(x, x0) is the optical flow from frame x to x0, and many works improve NST in terms of speed [Johnson et al., ω(x, f) represents warping x based on f. An occlusion map 2016; Huang and Belongie, 2017; et al., 2017b] and the- Mo is often introduced to solve object occlusion and inaccu- oretical explanations [Li et al., 2017a]. Besides NST-based rate estimation issues: Mo ⊗xt = Mo ⊗ω(xt−1, f(xt−1, xt)) methods, another way of style transfer is to treat styles as with ⊗ the element-wise multiplication operator. image domains and use image-to-image translation models There are some drawbacks of the traditional optical flow- such as CycleGAN [Zhu et al., 2017]. However, as analyzed based methods. In the following, we will point out and ana- in Sec. 1, the above universal style transfer methods are not lyze each problem and provide our solutions. suitable for the Chinese ink wash styles, which implies the requirement of specialized architecture and loss designs. Spatial-Transformation Consistency Directly warping frames according to optical flow easily 2.2 Video Style Transfer causes distortion and seam artifacts [Chen et al., 2017]. Nat- Directly applying image style transfer to videos leads to urally, we warp the feature map instead. However, in ex- flickers and discontinuities. To maintain temporal consis- periments, we find that GAN is very sensitive to the feature

824 Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

E G Ltmp x H

Optical Flow Wr ~ ~ W G Estimation w1~wk w Fuse H' f1~fk G(H')

Fuse E Warp R1~Rk Rf x1~xk G Ltmp

Figure 2: Pipeline of feature-based multi-frame fusion framework. Our method fuses the feature maps from k selected reference frames based on optical flow to achieve temporal coherence. For simplicity, we omit occlusion masks and spatial-transformation consistency loss. changes, i.e., the transformation on the input feature does not consistent in non-occlusion regions: match the transformation on the output image. To solve this L = E k(G(H0) − F (x)) ⊗ M k2/|M | problem, we propose a spatial-transformation consistency tmp x o,c 2 o,c loss by imposing an input-output consistency constraint under k (4) X 0 2  optical flow warping transformation. Let yω = ω(F (x), f) + k(G(H ) − ω(F (xi), fi)) ⊗ Mo,ik2/|Mo,i| , denote the warped stylized image and Hω = ω(E(x), f↓) - i=1 note the warped feature map, where f↓ is the down-sampled where fi and Mo,i are the optical flow and occlusion map f to match the size of E(x). Our loss function is: Pk between x and xi. Mo,c = max(0, 1 − i=1 Mo,i) is the  2  fused occlusion map to indicate regions with no reference. Lcons = Ex k(G(Hω) − yω) ⊗ Mok2/|Mo| , (1) Frame Selection by Reordering where | · | calculates the number of pixels. Intuitively, Lcons In our multi-frame fusion process, the k reference frames keeps the consistency of feature input E(x) and image output should contains different information and be fused together. F (x) under spatial transformations ω. So there is a potential constraint that these reference frames should be “far away” from each other. Meanwhile, sequen- Feature-Based Multi-Frame Fusion tial reference may accumulate errors in rendering. Errors in In most video stylization methods, only the previous frame the preceding frames will accumulate to subsequent frames is used as the reference. However, because of the object oc- through sequential references until the last few results are un- clusion and the inaccurate optical flow estimation, using only acceptable. A non-sequential reference structure such as bi- one frame may cause large errors. Therefore, we propose a nary tree may be better. Therefore, we propose a frame re- multi-frame feature fusion scheme to stylize the target frame ordering scheme. By adjusting the order of generation, we xt with k reference frames of time {t1, ..., tk}. For simplicity, make the reference frames diverse and organize the reference in the following, we will omit t, and use x and xi to represent relationships adaptively. the target frame xt and the reference frame xt , respectively. i We first divide all frames into two sets: Sinked and As illustrated in Fig. 2, we first align the feature map Suninked, denoting frames that have been stylized and to-be Hi = E(xi) of the reference frame xi to the feature map stylized, respectively. In each round, we select the most ap- H = E(x) of the target frame x based on their optical propriate frame xˆ from Suninked, and then select k closest flow: Ri = ω(Hi, fi↓), where fi↓ = f(xi, x)↓ is the down- frames from Sinked as reference frames to generate the corre- sampled optical flow between xi and x. The aligned reference sponding ink video frame of xˆ. Specifically, to select xˆ from feature maps {R1, ..., Rk} are then fused to form one refer- Suninked, we design the criteria: ence feature map Rf . Rf is further fused with H to obtain 0 0 0 X αδ(x, x ) X δ(x, x ) the final feature map H , which is decoded to the stylization xˆ = arg min − , 0 result G(H ). Here, two networks Wr and W are proposed to x 0 |Suninked| 0 |Sinked| x ∈Suninked x ∈Sinked predict the fusion weight maps. Specifically, Wr is fed with where α is a hyper parameter, and δ(x, x0) = kf(x, x0)k2 (Ri − H) to predict weight map wi of Ri. The weight maps 2 measures the distance between the frames x and x0, with the are normalized by softmax and used to fuse {R1,...,Rk} optical flow f(x, x0) from x to x0. The whole formula can be to one reference feature map Rf . Similarly, W is fed with 0 understood as that we want to select the frame that is close to (Rf − H ) to predict the weight map w, and fuse Rf and H: the unstylized frames while being different from the stylized k k X wi X wj frames. So the frames in Sinked set are “far away” from each Rf = w˜i ⊗ Ri, where w˜i = e / e , (2) i=1 j=1 other, which meets our constraint. The proximity of selected 0 frame and unstylized frames means the selected frame will H = w ⊗ Rf + (1 − w) ⊗ H. (3) provide reference for more frames in the subsequent rounds. To train network Wr and W , we propose a temporal loss Then, we select k reference frames from Sinked. The se- to require high temporal consistency after fusion. We expect lected frames closest to xˆ in terms of δ(·, ·) are used as refer- that the stylized frame and the aligned reference frame are ence. Finally, we generate the ink video frame corresponding

825 Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

Lcycle

x E (x) B(F(x)) Link y Instance Lcycle Segmentation �� ′ B Dink B

Fuse G Lstroke DX Ladv M1 E(x) F(x) B(y) Crop and ScaleNorm Re-scale Lwhiten Lcontour DY G

Ladv F(B(y)) s(x1) E F M1 �� Figure 3: Pipeline of instance-aware Chinese ink wash style transfer. For simplicity, this example only contains one instance (N = 1).

to xˆ based on the selected reference frames, and then move xˆ where Mc is the contour mask of x, with 1 represents the from Suninked to Sinked. Through frame reordering, the ref- subject contour and 0 represents others. Haar(x) is the high- erence can be non-sequential, which is better than using se- frequency components of x by applying the Haar wavelet, quential reference in terms of avoiding accumulating errors. with values in [−2, 2]. Thus, offset 2 is added in the loss. 3.2 Instance-Aware Chinese Ink Wash Stylization Adaptive Scale Selection. Since deep models are limited by the convolution kernel size, it is difficult to adaptively Chinese ink wash style has many characteristics that need extract features according to the subject scale. Therefore, special consideration. By observing professional paintings, we design a novel scale normalization strategy. Considering we find that the white space is often left out of the painting frame x with N instances, let xn and M n be the bounding subject and contour is usually depicted by heavy brushstrokes box region and the mask of the n-th subject, respectively, i.e., to highlight the content. Meanwhile, stroke size is basically PN n determined by the size of the subject. For small subjects, Mf = n=1 M . As shown in Fig. 3, our encoder E con- sists of a basic encoder E0. We first scale each subject region finer strokes are used to better depict their details, and vice n versa. All these require the identification of the painting sub- to a predefined fixed size, obtaining s(x ) with s the scale op- eration. Then we feed it to E0 to extract the feature map under ject, which is closely related to object detection. n this predefined scale. Later, we re-scale the feature map Hb Therefore, we are motivated to leverage the advanced in- n −1 n −1 back to its original size as H = s (Hb ) with s the re- stance segmentation tool [Kirillov et al., 2019], and propose n an instance-aware style transfer network. Based on the de- scale operation, and fuse all H with the original feature map E0(x) to obtain the final feature map E(x): tected subjects, we design a whiten loss Lwhiten for white space and a contour loss Lcontour for contour brushstroke. N X n n 0 Moreover, we propose a scale normalization strategy to adap- E(x) = H ⊗ M + E (x) ⊗ Mb. (7) tively adjust the scale of the subject. n=1 White Space Constraint. Leaving white space means that The final stylization result is obtained as F (x) = G(E(x)). more ink needs to be allocated to the significant subjects such Here, the fixed size is set to the averaged subject size in the as the horses in Fig. 1, while leaving trivial objects rough or training set, which means the normalized subject is stylized blank to emphasize the simplicity of the painting. Given the using the most suitable brushstroke size. In Sec. 3.1, we in- detected foreground instance mask Mf of the target frame troduce a spatial-transformation consistency constraint. Now x ∈ X, we expect that the ratio of average darkness of back- we can extend it to both warping and scale transformations: ground area and foreground area in the generated painting is  2  small. So we propose the whiten loss: Lcons = Ex k(G(Hω) − yω) ⊗ Mok2/|Mo|  2  N k(1 − F (x)) ⊗ Mbk /|Mb| X (8) L = E 2 , (5) + E  k(s−1(G(Hn)) − F (x)) ⊗ M nk2/|M n|. whiten x k(1 − F (x)) ⊗ M k2/|M | x b 2 f 2 f n=1 where Mb = 1 − Mf is the background mask. F (x) is the generated painting in RGB space with values in [−1, 1], so 3.3 Full Objective (1 − F (x)) represents the darkness of the generated painting. We use the adversarial cycle loss [Zhu et al., 2017] to learn the bidirectional mapping between X and Y : Contour Contrast Constraint. In Chinese ink wash paint-     ings, the brushstrokes need to be carefully distinguished for Ladv = Ey log DY (y) + Ex log(1 − DY (F (x))) different objects. To achieve this, we propose to increase the + E  log D (x) + E  log(1 − D (B(y))), high-frequency components at the contours of the subjects. x X y X     Specifically, we calculate the contours based on the instance Lcycle = γEx kB(F (x)) − xk1 + Ey kF (B(y)) − yk1 . segmentation results, and use a contour loss to increase the Converting from photos to paintings would inevitably lose differentiation between object boundaries: some information. Therefore, we add a parameter γ < 1   Lcontour = 2 − Ex kHaar(F (x)) ⊗ Mck1/|Mc| , (6) to reduce the forward cycle consistency in Lcycle.

826 ovldt h fetvns ftepooe ehd we method, trans- style proposed video the state-of-the-art following of the with effectiveness compare the validate State-of-the-Arts with To Comparison 4.2 material. supplementary details the experimental in More provided are testing. results and and training for flows cal ChipPhi in paintings wash ink Chinese dataset use we training, Details For Implementation 4.1 Results Experimental 4 studies. ablation through effects their verify later will where ossfo h nent bu 0 rmsi oa.PWC- total. in frames 10k about Net Internet, the from horses is: objective of full effects our Finally, diffusion respectively. realize wash, ink and the edges subject the emphasize loss wash MOS Method L eas noprt h rssrk loss brushstroke the incorporate also We Temporal Method Comprehensive Stylization [ +λ =L

ilu,21;Sun 2019; Niklaus, Frame 2 Frame 1 iue5 oprsnwt AdaIN with Comparison 5: Figure λ al :Ue rfrnert fdfeetmethods. different of rate preference User 1: Table 4 [ adv i He L Input r ye-aaeest aac hs oss We losses. these balance to hyper-parameters are L whiten b hns n ahiaesyetransfer style image wash ink Chinese (b) + dI C ylGNCiGNOurs ChipGAN CycleGAN WCT AdaIN Input ink a hns n ahvdosyetransfer style video wash ink Chinese (a) 2018 al., et .220 .33.14 2.73 2.00 2.72 λ 1 nrdcdb ChipGAN by introduced iue4 ie tl rnfrcmaio ihChipGAN with comparison transfer style Video 4: Figure L + cycle ierCmon hpA Ours ChipGAN Compound Linear λ .316 3.02 3.16 1.65 1.49 1.83 1.72 2.95 5 L + contour ] 2018 al., et n ute olc 1 iesof videos 115 collect further and , rceig fteTitehItrainlJitCneec nAtfiilItliec (IJCAI-21) Intelligence Artificial on Conference Joint International Thirtieth the of Proceedings λ 2 L cons AdaIN 3.13 + Linear λ + ] 6 L sue oetmt opti- estimate to used is [ λ 2017 stroke 3 L [ tmp He ] WCT , .32.19 1.73 L + 2018 al., et stroke λ 7 [ L 2017b WCT ink n ink and 3.50 3.63 4.40 , Compound Compound ] CycleGAN , ] (9) to 827 eoge 2017 Belongie, 2020 Linear methods: fer n h ipiiypicpeo hns n ahpaintings. Opinion wash Mean follow- ink reports contrast, 1(b) Chinese Table high evaluation, of quantitative of principle For and simplicity instance clean the each are ing of results the contour Our in the spaces subject. depicts white better leaves and successfully background and model subjects com- our By foreground and parison, interwoven. the strokes messy with fuses are backgrounds, ChipGAN surrounding brushstrokes The variety. content. lack to strokes image dry the use to fail depict tends but tonality CycleGAN and brushstrokes. color render the to transfer WCT and AdaIN 5. Fig. Transfer. Style Image and quality overall highest the score. has temporal stylization method of our quality. aspects 1(a), comprehensive Table three In and from quality, results stylization the sup- consistency, score the invite to in we users found comparison, 40 quantitative be For can results material. proposed more plementary the and to Videos thanks schemes. consistency can model temporal our from good comparison, suffer By maintain and artifacts. and blurry strokes, and they Linear ink bias However, representative color the flickering. imitate jitters. to of and fail problem consistency, flickers the temporal avoid have consider Compound results not the does ChipGAN therefore 4. Fig. Transfer. Style Video 2017 al., Method FID [ CycleGAN [ 2018 2017 ] n mg tl rnfrmtos AdaIN methods: transfer style image and , ] ] ] Linear , ChipGAN , n ChipGAN and 250.98 AdaIN al :FDsoeo ifrn methods. different of score FID 2: Table ChipGAN ] WCT , [ 2019 0.4250.73 302.04 C Li WCT [ Li [ ] He n Compound and ChipGAN [ 2019 al., et [ Li 2018 eut r hw nFg ()and 1(a) Fig. in shown are Results eut r hw nFg ()and 1(b) Fig. in shown are Results 2018 al., et tal. et near ] nsnl rm stylization. frame single on 2017b , ylGNCiGNOurs ChipGAN CycleGAN ] 262.6 Compound , ] [ . 2020 Ours ] CycleGAN , ] . O 216.54 u rs [ Wang [ un and Huang [ 213.16 Zhu tal., et et Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

Config w/o fusion w/o reordering full Temporal 1.92 1.72 2.36 Stylization 1.91 1.92 2.18 Comprehensive 1.88 1.81 2.32

(a) Chinese ink wash video style transfer Input w/o scaleNorm w/o Lwhiten w/o Lcontour Full Config w/o L w/o L w/o scaleNorm full whiten contour Figure 7: Ablation studies for instance-aware style transfer. MOS 1.83 2.21 2.95 3.01 (b) Chinese ink wash image style transfer

Table 3: User preference rate for ablation studies.

(a) Results on dog and tiger with model trained on horses.

Frame 1

Input Output Real Input Output Real

Frame 2 (b) Results of training on fish dataset Input w/o fusion w/o reordering Full

Figure 6: Effects of frame fusion and frame reordering. The differ- ence between the two frames is visualized in the middle row. Input frames CycleGAN Ours (c) Sketch painting style transfer. Score (MOS) in the user study and Table 2 reports the FID score [Heusel et al., 2017], where our model is the best. Figure 8: Generalization of the proposed method. 4.3 Ablation Study In Fig. 6, we compare the video style transfer results under different configurations. Without frame fusion, the result has Frame 1 texture jitters. Without frame reordering, the horse’s nose be- comes longer as the horse moves , which is caused by sequen- Frame 2 tial references. By comparison, our full model avoids these Input frames CycleGAN CycleGAN + our schemes problems and achieves better temporal consistency. Fig. 7 analyzes the designs of our instance-aware Chi- Figure 9: The improvement brought by our method to CycleGAN. nese ink wash stylization. Without Lwhiten, the result is much messy. Without Lcontour, the contour of the horse gets blurred and mixed with the background. Without scale nor- than CycleGAN. Moreover, in Fig 9, our frame fusion and malization, the model fills the body of the horse with black, reordering schemes effectively improve the temporal consis- which is rare in Chinese ink wash paintings. In comparison, tency of CycleGAN for oil painting video style transfer. the result of our full model is more vivid and balancing. For quantitative comparison, we conduct a user study in 5 Conclusion Table 3 following the previous settings. The full model has In this paper, we study a challenging problem of video style the best result, demonstrating the effectiveness of our designs. transfer for Chinese ink wash painting. A novel multi-frame fusion with frame reordering is proposed for temporal con- 4.4 Model Generalization sistency, and an instance-aware style transfer is proposed to transfer ink wash painting characteristics. We validate the Generalization of Content. Though trained only on effectiveness and robustness of our methods by comparison horses, our model has learned the general feature of ink wash, experiments and comprehensive ablation studies. thus can transfer other subjects as shown in Fig. 8(a). We also collect data and train for fish. The results are good as shown in Fig. 8(b). More results and details are in supplementary. Acknowledgements Generalization of Style. The proposed model is not limited This work was supported by the Fundamental Research Funds to Chinese ink wash paintings. It can be well applied to other for the Central Universities and the National Natural Science styles. In Fig. 8(c), we trained our model on Quick Draw Foundation of under Contract No.61772043 and a re- Dataset1. Our model successfully generates sketches better search achievement of Key Laboratory of Science, Technol- ogy and Standard in Press Industry (Key Laboratory of Intel- 1https://github.com/googlecreativelab/quickdraw-dataset ligent Press Media Technology).

828 Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)

References [Li et al., 2019] Xueting Li, Sifei Liu, Jan Kautz, and Ming- Hsuan Yang. Learning linear transformations for fast ar- [Chen et al., 2017] Dongdong Chen, Liao, Lu Yuan, bitrary style transfer. In Proc. IEEE Int’l Conf. Computer Nenghai Yu, and Gang Hua. Coherent online video style Vision and Pattern Recognition, 2019. transfer. In Proc. Int’l Conf. Computer Vision, pages 1105– 1114, 2017. [Liang and Jin, 2013] Lingyu Liang and Lianwen Jin. Image-based rendering for ink painting. In Proc. IEEE [Gatys et al., 2016] Leon A Gatys, Alexander S Ecker, and Int’l Conf. Systems, Man, and Cybernetics, pages Matthias Bethge. Image style transfer using convolutional 3950–3954, 2013. neural networks. In Proc. IEEE Int’l Conf. Computer Vi- sion and Pattern Recognition, pages 2414–2423, 2016. [Niklaus, 2019] Simon Niklaus. A reimplementation of PWC-Net using PyTorch. https://github.com/sniklaus/ [ ] Goodfellow et al., 2014 Ian Goodfellow, Jean Pouget- pytorch-pwc, Nov 19, 2019. Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen- [Ruder et al., 2016] Manuel Ruder, Alexey Dosovitskiy, and erative adversarial nets. In Advances in Neural Informa- Thomas Brox. Artistic style transfer for videos. In German tion Processing Systems, pages 2672–2680, 2014. Conf. Pattern Recognition, pages 26–36, 2016. [He et al., 2018] Bin He, Feng Gao, Daiqian Ma, Boxin Shi, [Sun et al., 2018] Deqing Sun, Xiaodong Yang, Ming-Yu and Ling-Yu Duan. Chipgan: A generative adversar- Liu, and Jan Kautz. PWC-Net: CNNs for optical flow us- ial network for chinese ink wash painting style transfer. ing pyramid, warping, and cost volume. In Proc. IEEE In Proc. ACM Int’l Conf. Multimedia, pages 1172–1180, Int’l Conf. Computer Vision and Pattern Recognition, 2018. 2018. [ ] [Hertzmann et al., 2001] Aaron Hertzmann, Charles E Ja- Wang et al., 2020 Wenjing Wang, Jizheng Xu, Li Zhang, cobs, Nuria Oliver, Brian Curless, and David H Salesin. Yue Wang, and Jiaying Liu. Consistent video style trans- Image analogies. In Proc. the 28th Annual Conf. Com- fer via compound regularization. In AAAI Conference on puter Graphics and Interactive Techniques, pages 327– Artificial Intelligence, 2020. 340, 2001. [Yang and Xu, 2013] Lijie Yang and Tianchen Xu. Ani- [Heusel et al., 2017] Martin Heusel, Hubert Ramsauer, mating chinese ink painting through generating repro- Science China Information Sci- Thomas Unterthiner, Bernhard Nessler, and Sepp Hochre- ducible brush strokes. ences iter. GANs trained by a two time-scale update rule , 56(1):1–13, 2013. converge to a local nash equilibrium. In Advances in Neu- [Yu et al., 2003] Jinhui Yu, Guoming Luo, and Qunsheng ral Information Processing Systems, pages 6626—-6637, Peng. Image-based synthesis of chinese paint- 2017. ing. Journal of Computer Science and Technology, 18(1):22–28, 2003. [Huang and Belongie, 2017] Xun Huang and Serge Be- longie. Arbitrary style transfer in real-time with adaptive [Zhang et al., 2020] Fengquan Zhang, Huaming Gao, and instance normalization. In Proc. Int’l Conf. Computer Vi- Yuping Lai. Detail-preserving cyclegan-adain frame- sion, pages 1510–1519, 2017. work for image-to-ink painting translation. IEEE Access, 8:132002–132011, 2020. [Huang et al., 2017] Haozhi Huang, Hao Wang, Wenhan Luo, Lin Ma, Wenhao Jiang, Xiaolong Zhu, Zhifeng Li, [Zhu et al., 2017] Jun Yan Zhu, Taesung Park, Phillip Isola, and Wei Liu. Real-time neural style transfer for videos. and Alexei A. Efros. Unpaired image-to-image translation In Proc. IEEE Int’l Conf. Computer Vision and Pattern using cycle-consistent adversarial networks. In Proc. Int’l Recognition, pages 783–791, 2017. Conf. Computer Vision, pages 2242–2251, 2017. [Johnson et al., 2016] Justin Johnson, Alexandre Alahi, and Fei Fei Li. Perceptual losses for real-time style transfer and super-resolution. In Proc. European Conf. Computer Vision, pages 694–711, 2016. [Kirillov et al., 2019] Alexander Kirillov, Yuxin Wu, Kaim- ing He, and Ross Girshick. PointRend: Image segmenta- tion as rendering. 2019. [Li et al., 2017a] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. Demystifying neural style transfer. In Int’l Joint Conf. Artificial Intelligence, pages 2230–2236, 2017. [Li et al., 2017b] Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Lu, and Ming-Hsuan Yang. Universal style transfer via feature transforms. In Advances in Neural In- formation Processing Systems, pages 386–396, 2017.

829