Real-Time Lip Sync for Live 2D Animation Deepali Aneja Wilmot Li University of Washington Adobe Research
[email protected] [email protected] Figure 1. Real-Time Lip Sync. Our deep learning approach uses an LSTM to convert live streaming audio to discrete visemes for 2D characters. ABSTRACT characters and objects move. However, live 2D animation The emergence of commercial tools for real-time performance- has recently emerged as a powerful new way to communicate based 2D animation has enabled 2D characters to appear on and convey ideas with animated characters. In live anima- live broadcasts and streaming platforms. A key requirement tion, human performers control cartoon characters in real-time, for live animation is fast and accurate lip sync that allows allowing them to interact and improvise directly with other characters to respond naturally to other actors or the audience actors and the audience. Recent examples from major stu- through the voice of a human performer. In this work, we dios include Stephen Colbert interviewing cartoon guests on present a deep learning based interactive system that auto- The Late Show [6], Homer answering phone-in questions matically generates live lip sync for layered 2D characters from viewers during a segment of The Simpsons [15], Archer using a Long Short Term Memory (LSTM) model. Our sys- talking to a live audience at ComicCon [1], and the stars of an- tem takes streaming audio as input and produces viseme se- imated shows (e.g., Disney’s Star vs. The Forces of Evil, My quences with less than 200ms of latency (including processing Little Pony, cartoon Mr.