Everybody's Talkin': Let Me Talk As You Want
Total Page:16
File Type:pdf, Size:1020Kb
Everybody’s Talkin’: Let Me Talk as You Want Linsen Song1∗ Wayne Wu2;3 Chen Qian2 Ran He1 Chen Change Loy3 1NLPR, CASIA 2SenseTime Research 3Nanyang Technological University [email protected] fwuwenyan, [email protected] [email protected] [email protected] Figure 1: Audio-based video editing. Speech audio of an arbitrary speaker, extracted from any video, can be used to drive any videos featuring a random speaker. Abstract tained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering net- We present a method to edit a target portrait footage by work and a dynamic programming method to construct a taking a sequence of audio as input to synthesize a photo- temporally coherent and photo-realistic video. Extensive realistic video. This method is unique because it is highly experiments demonstrate the superiority of our method over dynamic. It does not assume a person-specific rendering existing approaches. Our method is end-to-end learnable network yet capable of translating arbitrary source audio and robust to voice variations in the source audio. Some into arbitrary video output. Instead of learning a highly het- results are shown in Fig.1. Video results are shown on our arXiv:2001.05201v1 [cs.CV] 15 Jan 2020 erogeneous and nonlinear mapping from audio to the video project page1. directly, we first factorize each target video frame into or- thogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a re- 1. Introduction current network is introduced to translate source audio into expression parameters that are primarily related to the au- I’m going where the sun keeps shining. Through the dio content. The audio-translated expression parameters pouring rain. Going where the weather suits my are then used to synthesize a photo-realistic human subject clothes. in each video frame, with the movement of the mouth re- gions precisely mapped to the source audio. The geometry Fred Neil, Everybody’s Talkin’ and pose parameters of the target human portrait are re- 1Project Page: https://wywu.github.io/projects/EBT/ ∗This work was done during an internship at SenseTime Research. EBT.html 1 Video portrait editing is a highly sought-after technique of the source audio. in view of its wide applications, such as filmmaking, video We further propose an Audio ID-Removing Network that production, and telepresence. Commercial video editing ap- keeps audio-to-expression translation agnostic to the iden- plications, such as Adobe Premiere and Apple iMovie, are tity of the source audio. Thus, the translation is robust resource-intensive tools. Indeed, editing audio-visual con- to variations in the voices of different people in different tent would require one or more footages to be reshot. More- source audio. Finally, we solve the difficult face generation over, the quality of the edited video is highly dependent on problem as a face completion problem conditioned on fa- the prowess of editors. cial landmarks. Specifically, after reconstructing a 3D face Audio-based approach is an attractive technique for au- mesh with new expression parameters, we extract the as- tomatic video portrait editing. Several methods [6, 62] are sociated 2D landmarks from the mouth region and repre- proposed to animate the mouth region of a still image to sent them as heatmaps. These heatmaps are combined with follow an audio speech. The result is an animated static im- target frames where the mouth region is masked. Taking age rather than a video, hence sacrificing realism. Audio- the landmark heatmaps and the masked target frames as in- driven 3D head animation [46] is an audio-based approach puts, a video rendering network is then used to complete but aiming at a different goal, namely to drive stylized 3D the mouth region of each frame guided by dynamics of the computer graphic avatars, rather than to generate a photo- landmarks. realistic video. Suwajanakorn et al.[45] attempted to syn- We summarize our contributions as follows: 1) We make thesize photo-realistic videos driven by audio. While im- the first attempt at formulating an end-to-end learnable pressive performance was achieved, the method assumes framework that supports audio-based video portrait edit- the source audio and target video to come from the same ing. We demonstrate coherent and photo-realistic results identity. The method is only demonstrated on the audio by focusing specifically on expression parameter space as tracks and videos of Barack Obama. Besides, it requires the target space, from which source audios can be effec- long hours of single-identity data (up to 17 hours [45]) for tively translated into target videos. 2) We present an Audio training using relatively controlled and high-quality shots. ID-Removing Network that encourages an identity-agnostic In this paper, we investigate a learning-based framework audio-to-expression translation. This network allows our that can perform many-to-many audio-to-video translation, framework to cope with large variations in voices that are i.e., without assuming a single identity of source audio and present in arbitrary audio sources. 3) We propose a Neural the target video. We further assume a scarce number of tar- Video Rendering Network based on the notion of face com- get video available for training, e.g., at most a 15-minute pletion with a masked face as input and mesh landmarks footage of a person is needed. Such assumptions make as conditions. This approach facilitates the generation of our problem non-trivial: 1) Without sufficient data, espe- photo-realistic video for arbitrary people within one single cially in the absence of source video, it is challenging to network. learn direct mapping from audio to video. 2) To apply the Ethical Considerations. Our method could contribute framework on arbitrary source audios and target videos, our greatly towards advancement in video editing. We envisage method needs to cope with large audio-video variations be- relevant industries, such as filmmaking, video production, tween different subjects. 3) Without explicitly specifying and telepresence to benefit immensely from this technique. scene geometry, materials, lighting, and dynamics, as in the We do acknowledge the potential of such forward-looking case of a standard graphics rendering engine, it is hard for technology being misused or abused for various malevolent a learning-based framework to generate photo-realistic yet purposes, e.g., aspersion, media manipulation, or dissemi- temporally coherent videos. nation of malicious propaganda. Therefore, we strongly ad- To overcome the aforementioned challenges, we propose vocate and support all safeguarding measures against such to use the expression parameter space, rather than the full exploitative practices. We welcome enactment and enforce- pixels, as the target space for audio-to-video mapping. This ment of legislation to obligate all edited videos to be dis- facilitates the learning of more effective mapping, since tinctly labeled as such, to mandate informed consent be col- the expression is semantically more relevant to the audio lected from all subjects involved in the edited video, as well source, compared to other orthogonal spaces, such as ge- as to impose hefty levy on all law defaulters. Being at the ometry and pose. In particular, we manipulate the expres- forefront of developing creative and innovative technolo- sion of a target face by generating a new set of parameters gies, we strive to develop methodologies to detect edited through a novel LSTM-based Audio-to-Expression Transla- video as a countermeasure. We also encourage the pub- tion Network. The newly generated expression parameters, lic to serve as sentinels in reporting any suspicious-looking combined with geometry and pose parameters of the target videos to the authority. Working in concert, we shall be human portrait, allow us to reconstruct a 3D face mesh with able to promote cutting-edge and innovative technologies the same identity and head pose of the target but with new without compromising the personal interest of the general expression (i.e., lip movements) that matches the phonemes public. 2 2. Related Work realistic target mouth region, the best mouth image of the target actor is retrieved and warped [48]. Kim et al.[27] Audio-based Facial Animation. Driving a facial anima- present a method that transfers expression, pose parameters, tion of a target 3D head model by input source audio learns and eye movement from the source to the target actor. These to associate phonemes or speech features of source au- methods are person-specific [27] therefore rigid in practice dio with visemes. Taylor et al.[47] propose to directly and suffer audio-visual misalignment [48, 27, 32] therefore map phonemes of source audio to face rig. Afterward, creating artifacts leading to unrealistic results. many speech-driven methods have been shown superior to Deep Generative Models. Inspired by the successful appli- phoneme-driven methods under different 3D models, e.g. cation of GAN [18] in image generation [41, 31, 63, 20, 57], face rig [61, 12], face mesh [25], and expression blend- many methods [6, 60, 62, 38, 27] leverage GAN to generate shapes [37]. photo-realistic talking face images conditioned on coarse Compared to driving a 3D head model, driving a photo- rendering image [27], fused audio, and image features [6, realistic portrait video is much harder since speaker-specific 62, 60]. Generative inpainting networks [34, 21, 29, 59] are appearance and head pose are crucial for the quality of the capable of modifying image content by imposing guided generated portrait video. Taylor et al.[46] present a sliding object edges or semantic maps [59, 44]. We convert talk- window neural network that maps speech feature window to ing face generation into an inpainting problem of mouth re- visual feature window encoded by active appearance model gion, since mouth movement is primarily induced by input (AAM) [10] parameters.