A Novel Lip Synchronization Approach for Games and Virtual Environments
Total Page:16
File Type:pdf, Size:1020Kb
A Novel Lip Synchronization Approach for Games and Virtual Environments Jan Krejsa Fotis Liarokapis Masaryk University CYENS – Centre of Excellence Brno, Czech Republic Nicosia, Cyprus [email protected] [email protected] Abstract—This paper presents an algorithm for the offline visually from one another and multiple phonemes may map generation of lip-sync animation. It redefines visemes as sets of to the same viseme. constraints on the facial articulators such as lips, jaw, tongue. This paper proposes a novel solution to improve the quality The algorithm was comparatively evaluated with 30 healthy participants by presenting a set of phrases delivered verbally by of auto-generated lip-sync animation. Even despite the lower a virtual character. Each phrase was presented in two versions: quality of its results, the automated approach still has an once with the traditional lip-sync method, and once with our irreplaceable role due to its time- and cost-saving properties. method. Results confirm that the suggested solution produces Traditionally, auto-generated lip-sync animation is achieved more natural animation than standard keyframe interpolation via keyframe interpolation. First, the speech audio is seg- techniques. Index Terms—virtual reality, computer games, lip-sync anima- mented into short units known as visemes. For each viseme, tion, user-studies the traditional approach defines a viseme shape: the shape of the mouth that is required to produce the given sound. I. INTRODUCTION These poses are used as animation keyframes placed at the The purpose of lip synchronization (or lip-sync) is to rein- onset of each viseme, and the final animation is created by force the illusion that the voice of an animated character is interpolating between these keyframes. The visemic context or coming out of the character’s mouth. This can help create co-articulation (i.e. how the appearance of the given viseme is an emotional connection to the character, and improve the influenced by the visemes that come before and after it) is usu- immersion of the viewer in a virtual environment. There are ally disregarded, although there are works that have suggested currently two leading techniques in lip synchronization. One solutions. Notable methods involve dominance functions [4] is capturing the performance of human actors via motion or neural networks [5]. capture techniques, the other is hiring professional animators In this work, we suggest an alternative solution for modeling to produce the facial animations manually. Both methods co-articulation (“Fig. 1”). We propose to replace each viseme deliver high-quality results, but they are time-consuming and shape with a set of constraints on certain parameters of the face expensive. As such, they are not suitable for applications (e.g. mouth openness). Instead of specifying a single facial where a large quantity of speech animation is required (e.g. pose for each viseme, the constraints define a wide range of role-playing games). In such applications, it is preferable to allowed poses. Since each viseme in our model has constraints generate the majority of speech animations automatically. for only some of the facial parameters (e.g. mouth width, jaw Lip animation can also be automatically generated from openness, etc), the parameters that are not constrained by the audio input or a phonetic transcript. However, although signif- given viseme are free to be driven by neighboring visemes. icant advances have been made recently [1], [2], the resulting Effectively, this models co-articulation and leads to more fluid quality is still not on par with the two aforementioned ap- and natural motion. proaches. In practice, the automated approach tends to be used Our method was evaluated in a study with 30 participants. only to generate low-quality baseline animations. Animators Participants were presented with a set of phrases delivered then try to refine as many animations as possible, and for the verbally by a virtual character. Each phrase was presented in most important scenes, they often use motion capture. two versions: once with the traditional lip-sync method, and When synthesizing speech animation, the speech audio we once with our method. Participants were asked to rate the lip want to synchronize to is usually segmented into atomic units motion for each phrase in several aspects, such as naturalness of sound, called phonemes. For example, the “m” sound in the or perceived quality. word “mother” is described by the phoneme /m/. A viseme, In all measured aspects, our method significantly outper- then, is a group of phonemes, all of which are visually similar formed the traditional method. On a scale from 0 to 100%, to each other, but distinguishable from phonemes that belong the naturalness of our method was rated at 70% on average, to another viseme [3]. Some phonemes are hard to distinguish compared to 56% naturalness for the traditional method. In terms of quality, our method was rated at 75%, while the 978-1-6654-3886-5/21/$31.00 ©2021 IEEE traditional method received a rating of 60%. Overall, the Fig. 1. An example user interface for viseme tagging. average score of our method was 73%, which is a significant although the method produces plausible wrinkles, they are still improvement over the 60% rating of the traditional method. a mere approximation limited by the training data, and it may deliver unexpected results when presented with a pose outside II. RELATED WORK of its training data set. Ma and Deng [10] suggest a geometry- There are three fundamental approaches to lip synchro- based method for capturing facial performance that includes nization: animation by hand, facial performance capture, and wrinkle information, while still retaining low-cost acquisition. automatic synthesis. Hand-made facial animation has the ad- Their method can reconstruct high-resolution facial geometry vantage of having full control over the performance. A skilled and appearance in real-time and with superior accuracy. artist can add more expressiveness, even beyond the limits of Laine et al. [11] utilized deep learning to reduce the amount what an actor could do. However, animating by hand is time- of manual clean-up and corrections required to maintain a consuming, and as such, it may not be suitable for large-scale high-quality output. Their pipeline involves manually cleaning projects that require many hours of facial animation (such as up a small subset of the input video footage to train a deep an expansive role-playing game), especially if their budget is neural network, which is then used to automatically process limited. This section will focus on the other two approaches the remaining video footage at a very high speed. While the (facial performance capture and automatic synthesis). training data must be recorded in stereo, the remaining footage may be monocular. In comparison with previous methods, their A. Facial Performance Capture method delivers increased accuracy, especially in the mouth Performance capture records motion data of a human actor and eye regions. and transfers it onto a virtual character [6]. By recording the B. Automatic Synthesis performance of an actor, we can capture all the intricacies of their facial movements in great detail. This approach currently Existing methods of automatic synthesis can be grouped achieves the highest-quality results, but it is also the most into two categories: (1) data-driven methods and (2) procedural expensive method. The current industry standard in film is methods based on key-frame interpolation. to track markers on the face using a helmet-mounted camera 1) Data-driven Methods: Data-driven methods of lip-sync (HMC) rig [7]. Tracking precision can be an issue, requiring an animation are generally based on either (a) concatenating animator to clean up the recorded data manually, which can be short speech animation units sampled from a large data set quite laborious. The resulting animations are difficult to edit of motion-captured speech, or (b) sampling from statistical or refine [8]. The method also generally requires expensive models extracted from motion-captured data [12]–[15]. equipment and professional actors. Taylor et al. [5] proposed to replace viseme shapes with Since motion capture equipment is so expensive, there have animation units, called dynamic visemes. They are short been multiple works that attempt to reduce the required cost. animation units extracted from a training data set, which can Cao et al. [9] have presented a method that adds high-fidelity then be rearranged in a different order and blended into each facial detail to low-resolution face tracking data. However, other. This technique produces good results for phrases that are similar to the training data, but to consistently provide good results for any possible phrase, an exhaustive data set that covers all possible sequences of phonemes would be required. Deng et al. [16] proposed a method that builds co- articulation models based on motion-capture data, using weight decomposition. Again, as with most other data-driven methods, their method would provide best results with a complete training data set of all possible combinations of phonemes. Deena et al. [17], [18] presented an approach based on a switching state model. One disadvantage is that they require the entire training data set to be manually phoneme- tagged. Asadiabadi et al. [19] used deep learning to train a speaker- Fig. 2. The facial pose produced when pronouncing the phoneme /m/ in the independent model that can reproduce the emotions of an syllable “moo”. actor. In another approach, Long Short-Term Memory (LSTM) networks were used to improve the mapping from phoneme sequences to their dynamic visemes [20], [21]. Karras et al. approach, Edwards et al. [1] introduced two parameters: jaw [2] presented a deep-learning-based method that generates lip- and lip. The values of these two parameters dynamically sync animation in real time, based on real-time audio input.