Deep Animation Video Interpolation in the Wild
Total Page:16
File Type:pdf, Size:1020Kb
Deep Animation Video Interpolation in the Wild Li Siyao1∗ Shiyu Zhao1;2∗ Weijiang Yu3 Wenxiu Sun1;4 Dimitris Metaxas2 Chen Change Loy5 Ziwei Liu5 B 1SenseTime Research and Tetras.AI 2Rutgers University 3Sun Yat-sen University 4Shanghai AI Laboratory 5S-Lab, Nanyang Technological University [email protected] [email protected] [email protected] [email protected] [email protected] fccloy, [email protected] Frame 0 Frame 1 DAIN SoftSplat Ours Figure 1: A typical example for animation video interpolation. Our approach is capable of estimating optical flows for large motions correctly and restore the content, while competing methods fail to handle such motions. Abstract isting state-of-the-art interpolation methods for animation videos. Notably, AnimeInterp shows favorable perceptual In the animation industry, cartoon videos are usually quality and robustness for animation scenarios in the wild. produced at low frame rate since hand drawing of such The proposed dataset and code are available at https: frames is costly and time-consuming. Therefore, it is desir- //github.com/lisiyao21/AnimeInterp/. able to develop computational models that can automatically interpolate the in-between animation frames. However, ex- 1. Introduction isting video interpolation methods fail to produce satisfying results on animation data. Compared to natural videos, In the animation industry, cartoon videos are produced animation videos possess two unique characteristics that from hand drawings of expert animators using a complex make frame interpolation difficult: 1) cartoons comprise and precise procedure. To draw each frame of an animation lines and smooth color pieces. The smooth areas lack tex- video manually would consume tremendous time, thus lead- tures and make it difficult to estimate accurate motions on ing to a prohibitively high cost. In practice, the animation animation videos. 2) cartoons express stories via exagger- producers usually replicate one drawing two or three times ation. Some of the motions are non-linear and extremely to reduce the cost, which results in the actual low frame large. In this work, we formally define and study the an- rate of animation videos. Therefore, it is highly desirable to imation video interpolation problem for the first time. To develop computational algorithms to interpolate the interme- arXiv:2104.02495v1 [cs.CV] 6 Apr 2021 address the aforementioned challenges, we propose an effec- diate animation frames automatically. tive framework, AnimeInterp, with two dedicated modules In recent years, video interpolation has made great in a coarse-to-fine manner. Specifically, 1) Segment-Guided progress on natural videos. However, in animations, ex- Matching resolves the “lack of textures” challenge by ex- isting video interpolation methods are not able to produce ploiting global matching among color pieces that are piece- satisfying in-between frames. An example from the film wise coherent. 2) Recurrent Flow Refinement resolves the Children Who Chase Lost Voices is illustrated in Figure1, “non-linear and extremely large motion” challenge by recur- where the current state-of-the-art methods fail to generate a rent predictions using a transformer-like architecture. To piece of complete luggage due to the incorrect motion esti- facilitate comprehensive training and evaluations, we build mation, which is shown in the lower-left corner of the image. a large-scale animation triplet dataset, ATD-12K, which The challenges here stem from the two unique characteristics comprises 12,000 triplets with rich annotations. Extensive of animation videos: 1) First, cartoon images consist of ex- experiments demonstrate that our approach outperforms ex- plicit sketches and lines, which split the image into segments of smooth color pieces. Pixels in one segment are similar, ∗Equal contributions; BCorresponding author. which yields insufficient textures to match the corresponding (a) interpolation problem for the first time. This problem has significance to both academia and industry. 2) We propose an effective animation interpolation framework named Ani- meInterp with two dedicated modules to resolve the “lack of textures” and “non-linear and extremely large motion” (b) challenges. Extensive experiments demonstrate that Ani- meInterp outperforms existing state-of-the-art methods both quantitatively and qualitatively. 3) We build a large-scale cartoon triplet dataset called ATD-12K with large content Figure 2: Two challenges in animation video interpola- diversity representing many types of animations to test ani- tion. (a) Piece-wise smooth animations lack of textures. (b) mation video interpolation methods. Given its data size and Non-linear and extremely large motions. rich annotations, ATD-12K would pave the way for future research in animation study. pixels between two frames and hence increases the difficulty to predict accurate motions. 2) Second, cartoon animations 2. Related Work use exaggerated expressions in pursuit of artistic effects, Video Interpolation. Video interpolation has been widely which result in non-linear and extremely large motions be- studied in recent years. In [16], Meyer et al. propose a tween adjacent frames. Two typical cases are depicted in phase-based video interpolation scheme, which performs im- Figure2 (a) and (b) which illustrate these challenges re- pressively on videos with small displacements. In [19, 20], spectively. Due to these difficulties mentioned above, video Niklaus et al. design a kernel-based framework to sample in- interpolation in animations remains a challenging task. terpolated pixels via convolutions on corresponding patches In this work, we develop an effective and principled novel of adjacent frames. However, the kernel-based framework is method for video interpolation in animations. In particular, still not capable of processing large movements due to the we propose an effective framework, AnimeInterp, to ad- limitation of the kernel size. To tackle the large motions in dress the two aforementioned challenges. AnimeInterp con- videos, many studies use optical flows for video interpola- sists of two dedicated modules: a Segment-Guided Match- tion. Liu et al.[15] predict a 3D voxel flow to sample the ing (SGM) module and a Recurrent Flow Refinement (RFR) inputs into the middle frame. Similarly, Jiang et al.[9] pro- module, which are designed to predict accurate motions for pose to jointly estimate bidirectional flows and an occlusion animations in a coarse-to-fine manner. More specifically, mask for multi-frame interpolation. Besides, some studies the proposed SGM module computes a coarse piece-wise are dedicated to improving warping and synthesis with given optical flow using global semantic matching among color bidirectional flows [2, 1, 17, 18], and higher-order motion pieces split by contours. Since the similar pixels belonging information is used to approximate real-world videos [33, 5]. to one segment are treated as a whole, SGM can avoid the In addition to using pixel-wise flows on images, “feature local minimum caused by mismatching on smooth areas, flows” on deep features are also explored for video interpola- which resolves the “lack of textures” problem. To tackle tion [8]. Although existing methods achieve great success in the “non-linear and extremely large motion” challenge in interpolating real-world videos , they fail to handle the large animation, the piece-wise flow estimated by SGM is further and non-linear motions of animations. Thus, the animation enhanced by a Transformer-like network named Recurrent video interpolation still remains unsolved. In this paper, we Flow Refinement. As shown in Figure1, our proposed ap- propose a segment-guided matching module based on the proach can better estimate the flow of the luggage in large color piece matching, boosting the flow prediction. displacement and generate a complete in-between frame. Vision for Animation. In vision and graphics community, A large-scale animation triplet dataset, ATD-12K, is built there are many works on processing and enhancing the ani- to facilitate comprehensive training and evaluations of video mation contents, e.g., manga sketch simplification [24, 23], interpolation methods on cartoon videos. Unlike other ani- vectorization of cartoon images [35, 34], stereoscopy of an- mation datasets, which consists of only single images, ATD- imation videos [14], and colorization of black-and-white 12K contains 12,000 frame triplets selected from 30 ani- manga [22, 27, 11]. In recent years, with the surge of deep mation movies in different styles with a total length over learning, researchers attempt to generate favorable animation 25 hours. Apart from the diversity, our test set is divided contents. For example, generative adversarial networks are into three difficulty levels according to the magnitude of used to generate vivid cartoon characters [10, 32] and style- the motion and occlusion. We also provide annotations on transfer models are developed to transfer real-world images movement categories for further analysis. to cartoons [4, 29, 3]. However, generating animation video The contributions of this work can be summarized as fol- contents is still a challenging task due to the difficulty of lows: 1) We formally define and study the animation video estimating the motion between frames. Training set Test set Disney Hayao Hosoda Makoto A1 Pictures Kyoto Animation Easy Medium Hard Scaling Fetching Eating (a) 6% (b) Histogram of optical flow mean Histogram of optical flow standard deviation 12% 3% Deformation 100 100 14% ATD-12K-12K ATD-12K-12K 80 Hard Speaking Adobe240 80 Adobe240 28% Easy Sporting 28% 60 60 40% 14% Translation 40 40 Rotation 52% Medium Others Walking 20 20 28% (%) Percentage 32% 22% (%) Percentage 21% 0 0 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 pixel pixel Figure 3: Triplet samples and statistics of ATD-12K. Top: typical triplets in the training and test sets. (a) The percentage of difficulty levels and motion tags in different categories. (b) Histograms of mean motion values and standard deviations calculated in each image. 3. ATD-12K Dataset Meanwhile, triplets with SSIM lower than 0.75 are also dropped off to get rid of scene transitions in our data.