Single-Shot Motion Completion with Transformer

Single-Shot Motion Completion with Transformer Yinglin Duan ∗, Tianyang Shi ∗ Zhengxia Zou * NetEase Fuxi AI Lab University of Michigan, Ann Arbor fduanyinglin, [email protected] [email protected] Yenan Lin, Zhehui Qian, Bohan Zhang Yi Yuan † NetEase NetEase Fuxi AI Lab yenan [email protected], [email protected] fqianzhehui, [email protected] Abstract Motion completion is a challenging and long-discussed problem, which is of great significance in film and game ap- plications. For different motion completion scenarios (in- betweening, in-filling, and blending), most previous methods deal with the completion problems with case-by-case designs. In this work, we propose a simple but effective Figure 1. Motion completion by our method based on input method to solve multiple motion completion problems un- keyframes (blue ones). Our method is a unified framework that der a unified framework and achieves a new state of the art can solve multiple motion completion problems. We achieve a accuracy under multiple evaluation settings. Inspired by the new state of the art on a high-quality motion completion dataset – recent great success of attention-based models, we consider LaFAN1 [16]. the completion as a sequence to sequence prediction problem. Our method consists of two modules - a standard transformer encoder with self-attention that learns long-range moted the research progress of motion completion. With dependencies of input motions, and a trainable mixture recent advances in this field, manpower can now be greatly embedding module that models temporal information and saved, where high-quality motion can be smoothly gener- discriminates key-frames. Our method can run in a non- ated from a set of historical or sparsely key-frames by learn- autoregressive manner and predict multiple missing frames ing over a large scale of motion capture data [24, 16]. How- within a single forward propagation in real time. We finally ever, most previous methods deal with the completion prob- show the effectiveness of our method in music-dance appli- lem for different completion scenarios (in-betweening, in- cations. Our animated results can be found on our project filling, and blending) with case-by-case designs. In this pa- page https://github.com/FuxiCV/SSMCT. arXiv:2103.00776v1 [cs.CV] 1 Mar 2021 per, we propose a novel framework that can unify the above processing scenarios. In our method, we leverage the recent popular deep learning architecture named Transformer [46], 1. Introduction which is built on the self-attention mechanism and now has Motion completion is an important and challenging been widely used in neural language processing, computer problem that has been studied for a long time. Mo- vision, and reinforcement learning [42, 11, 34]. We adopt tion completion provides fundamental technical support for BERT [10], a recent well-known transformer encoder as our animation authoring of 3D characters, and has been re- backbone, where known frames (key-frames) and unknown cently successfully applied in film production and video frames (need to be complemented) are fed together orderly games [45, 8]. as input and thus all the frames can be predicted in a single propagation at inference. Our method works in a non- In recent years, deep learning methods have greatly pro- autoregressive manner and can be easily accelerated with *Contributed Equally. (Names are in alpha order) GPU parallelization. As a comparison, most previous meth- †Corresponding author. ods adopt recursive or auto-regressive prediction where the motion frames need to be generated iteratively that lacks 2. Our method works in a parallel prediction manner parallelism. with high computational efficiency. On a single CPU desk- We consider motion completion in the following three top (I7-8700K @ 3.70GHz), our method can run in real scenarios: time (40 motion sequences per second, each with 30 frames long). * In-betweening. Animators are required to complement 3. We achieve new in-betweening benchmarking accu- the motion frame-by-frame between the past frames racy under multiple evaluation settings. Results on other and the provided further keyframe [16]. completion tasks (in-filling and blending) are also reported as baselines for fostering the research community 2. * In-filling. As an extension of in-betweening, in-filling poses the characters on specific positions of the time- 2. Related works line [8], and complements the rest of the frames [24]. 2.1. Motion completion A sub-task of In-filling is temporal super-resolution, where the input motion frames are considered as Motion completion is an emerging research hot-spot in keyframes with equal interval [15]. computer graphics and multimedia. Motion completion can be viewed as a conditional motion sequence genera- * Blending. Blending focuses on the automatic gener- tion problem. Different from those unconditional motion ation of the transition between a pair of pre-defined synthesis tasks [40, 47, 43] that focus on the generation of motions and has been widely used in video games1. unconstrained motion sequences by directly sampling from For example, many games provide dance fragments for their posterior distribution, motion completion aims at fill- players to choreograph, and blending helps players to ing the missing frames in a temporal sequence based on a concatenate and smooth those chosen fragments. given set of keyframes. Motion completion has a long research history, which We also introduce a mixture embedding module that fur- can be traced back to the late 1980s. Early works of mo- ther integrates temporal knowledge to the transformer en- tion completion typically adopt inverse kinematics to gen- coder. Our embedding module contains two types of learn- erate realistic transitions between keyframes. For example, able embeddings: position embedding and keyframe em- space-time constraints and searching based methods were bedding. Position embedding is a widely studied technol- proposed in the 1980s-1990s [48, 33] to compute optimal ogy in recent transformer literature, where a set of pre- physically-realistic trajectory. By using such techniques, defined sinusoidal signals are usually used to introduce tem- transitions between different motions can be smoothly gen- poral knowledge to the transformer model [15, 16]. In our erated [38]. Also, probabilistic models like the maximum method, we further make this embedding trainable to deal a posterior methods [7, 32], the Gaussian process [47], with different motion completion scenarios. In addition Markov models [29] were introduced to motion completion to the encoding of input frame orders, we also introduce tasks and were commonly used after 2000s. Keyframe embedding to annotate the input frames and tells Recently, deep learning methods have greatly promoted the model which parts of the input frames are keyframes (al- the research and the performance of motion completion, ready known) and which frames need to be predicted (un- where the recurrent neural network is the most commonly known). Since the keyframe embedding may have different used framework for motion completion of the deep learning forms, our method can be easily applied to different com- era [51, 15]. For example, Harvey et al. introduce a novel pletion scenarios regardless of how the input keyframes are framework named Recurrent Transition Networks (RTN) to arranged. Our design can be also considered as a motion- learn a more complex representation of human motions with version of Mask Language Model (MLM) [44], where the the help of LSTM [15]. Besides, generative adversarial unknown frames are represented by a deep bi-directional learning has been also introduced to the motion comple- model trained with self-supervised losses [10]. tion to make the output motion more realistic and natural- The contributions of our paper are summarized as fol- istic [19]. Some non-recurrent motion completion models lows: are also proposed very recently. Kaufmann et al. propose 1. We investigate the capability of the transformer-based an end-to-end trainable convolutional autoencoder to fill in model in the motion completion task. We propose a sim- missing frames [24]. Another recent progress by Harvey ple but efficient method to solve the motion completion et al. introduces time-to-arrival embeddings and scheduled problems of different application scenarios under a unified target-noise to further enhance the performance of RTN and framework. achieve impressive completion results [16]. Different from the previous RNN-based or convolution-based method, we 1https://docs.unity3d.com/Packages/com.unity. [email protected]/manual/clp_blend.html 2We will make our model and pre-trained weights publically available. propose a transformer-based model, which is a more uni- the motion to a set of sequential tokens, and 2) a stan- fied solution to this task and can deal with arbitrary missing dard transformer encoder used to process sequential fea- frames in a single-shot prediction manner. tures. Our method supports multiple input coded format, e.g. [local positions & rotations] or [global positions & ro- 2.2. Motion control tations] or [positions only]. Without loss of generality, we Motion control is a typical conditional motion generation assume that the input has both positions (x, y, z) and rota- task, which is also highly related to motion completion. In tions (q0, q1, q2, q3) variables (no matter local or global), motion control, the control signal comes from a pre-defined and therefore, a single pose contains a position coordinate temporal sequence, e.g. root trajectory, rather than a set of matrix P 2 RJ×3 and a quaternion matrix Q 2 RJ×4, keyframes in motion completion. where J represents the joint number of the input pose. Graph-based motion control is the most common type of For each input pose, we firstly flatten P and Q into 1- method in this field before the deep learning era [27, 39, 26, D vectors p and q, and then concatenate the two vectors 1, 5]. For example, Arikan et al.

Load more