(12) United States Patent (10) Patent No.: US 6,778,252 B2 Moulton Et Al

USOO6778252B2 (12) United States Patent (10) Patent No.: US 6,778,252 B2 Moulton et al. (45) Date of Patent: Aug. 17, 2004 (54) FILM LANGUAGE 5,880,788 A 3/1999 Bregler 5,884.267 A 3/1999 Goldenthal et al. (75) Inventors: William Scott Moulton, Kentfield, CA 6,097,381 A 8/2000 Scott et al. (US); Steven Wolff, Woodacre, CA OTHER PUBLICATIONS (US); Rod Schumacher, Los Angeles, CA (US); Andrew Bryant, San Diego, Bregler et al., “Video Rewrite: Driving Visual Speech with CA (US); Marcy Hamilton, Los Audio,” ACM Siggraph 97, Interval Research Corporation, Angeles, CA (US); Strath Hamilton, No.97, (Mar. 8, 1997). Los Angeles, CA (US); Dana Taschner, Ezzat et al., “Mike Talk: A talking Facial Display Based on Sunset Beach, CA (US) Morphing Techniques,” Proceedings of Computer Anima tion Conference, (Jun. 8, 1998). (73) Assignee: Film Language, Los Angeles, CA (US) Ezzat et al., “Visual Speech Synthesis by Morphing Visemes.” A.I.Memo, MIT, No. 165, (Mar. 8, 1999). (*) Notice: Subject to any disclaimer, the term of this Brand et al., “Voice-Driven animation.” TR-98-20, Mitsub patent is extended or adjusted under 35 ishi Electric Research Laboratory, (Mar. 8, 1998). U.S.C. 154(b) by 53 days. Burnett et al., “Direct and Indirect Measures of Speech Articulator Motions. Using Low Power EM SEnsors,” XIV (21) Appl. No.: 10/027,191 International CongreSS of Phoenetic Sciences, Lawrence (22) Filed: Dec. 20, 2001 Livermore National Laboratory, (Mar. 8, 1999). e a Vs (65) Prior Publication Data (List continued on next page.) Primary Examiner Rodney Fuller US 2002/0097380 A1 Jul.• 4-225, 2002 74Y A ttorney,y, AgAgent, or Firm-SanfordFirm-Sanfor AstO Related U.S. Application Data (57) ABSTRACT (60) Provisional application No. 60/257,660, filed on Dec. 22, X 2000. A method for modifying an audio visual recording originally 7 produced with an original audio track of an original Speaker, (51) Int. Cl." ............................................ Goal: 3100 using a Second audio dub track of a Second Speaker, com (52) U.S. Cl. ............................... 352/12; 352/5; 352/23; prising analyzing the original audio track to convert it into 352/25 a continuous time coded facial and acoustic Symbol Stream (58) Field of Search ................................ 352/5, 12, 23, to identify corresponding visual facial motions of the origi 352/25 nal Speaker to create continuous data Sets of facial motion corresponding to Speech utte rance States and (56) References Cited transformations, Storing these continuous data Sets in a U.S. PATENT DOCUMENTS database, analyzing the Second audio dub track to convert it to a continuous tune coded facial and acoustic Symbol 5,502,790 A : 3/1996 Yi .............................. 704/256 Stream, using the Second audio dub track's continuous time s: A : 12 E. - - - - - - - - - - - - - - - - - - - - - - - - - - 2.25 coded facial and acoustic Symbol Stream to animate the 563.0017 A 3.18: GS, et al... 70357. original Speaker's face, Synchronized to the Second audio 5,657.426 A s/1997 WE et al... 704/276 dub track to create natural continuous facial Speech expres 5,729,694. A 3/1998 Holrichter et al." Sion by the original Speaker of the Second dub audio track. 5,818.461. A 10/1998 Rouet et al. 5,826,234. A 10/1998 Lyberg 34 Claims, 2 Drawing Sheets EAAftje EEFINAa leaguas. EMefe.1-1 EXPRESSIONN TARGET EXPRESSENTARGET XEURAAW SCREEN ACTORAAW fARSET scES r scoRAT&ax S is ATORRECO2S3EN BEFEREN38. SA&REa Errics ElioAfA8ASE DATABASSEs Aciso:SESERAssys Y 27 TARST SPEECSENTENcs AKAYE AUSTRAC&f SPEECREocrops Ews 8H8NEESEFEAMPit sta is AN isTESTA sootstry Arts AQUSC SEEC Iron to wrius NARSAW (Race RECOGNTXN 0. ARK 288EGNNNGAOEN) IMING ONS FOR Dus Actor Target spect EAE SysABASE SENCEEWES ANTIESTAMSS Aica FORTE scREEN AeroR E EPA AAWA Rask FfH EEE P08 LA3ES SSRElysissCATASE LENGTHSTOPREFERED scEEN Yavao FQR HEWOCEDUBATOR AcREANCEST SREPEPKNaz: r sa-SSAs reas to sag AAAs ABESYSwiss Y FQRAPV3E FONEKEY FRAES TEXEU&Eaching Fox8tyERY 20 ry Fus pH&NE WISE KEY FRAe Stect BES PERA story NT350 NSTASXSTC Sfc. ASSEMSEARGE & Wifi AssfACAao wists: ABSSEYING WISEESFOR Exact 2 staga RPNG FOR FORTEATOR NES vocetus RAck ALL WISEE-WSEMETE V. 320 r Yo:RoucEFEREATEINTERWALS TRANSONs ERAE U2E Aies CSRWISOk NST AND HEA. FOSQi EFALt-cotROLPo?ts REIR Nro Taasts SO FOR FACSS&PE visa-Eseats PSEREE WSES:O:AjectsSPSSC T&Ack Wolea EastEFrienEWTRACS AAPNSAN.SEFERER. sale 378 N STTCHS CEAN 3 w" 360 Ko TaBLE TRANSATOR fo REXAGERede XESTONA wers A? Wol Eric Hits set foSMARSE of FROCU8 aScREE ACOR WSEMES (9,1,24ETc) US 6,778.252 B2 Page 2 OTHER PUBLICATIONS Excitation Function,” 138th Meeting of the Acoustical Soci ety of America, Lawrence Livermore National Laboratory, Holzrichter et al., “Human Speech Articulator Measurments (Mar. 8, 1999). Using Low Power, 2 GHz Honodyne Sensors,” 24th Inter Pentland et al., “View-Based and Modular Eigenspaces for national Conference of Infrared and Millimeter Waves, Face Recognition,” MIT Media Laboratory, IEEE Confer Lawrence Livermore National Laboratory, (Mar. 8, 1999). ence on Computer Vision & Pattern Recognition, (Mar. 8, Burnett et al , “The Use of Glottal Electromagnetic 1994). Micropower Sensors (GEMS) in Determining a Voiced * cited by examiner U.S. Patent Aug. 17, 2004 Sheet 1 of 2 US 6,778,252 B2 OENIFY Afw SPEECH ACOREMOTONA PASSAGE TO BE USED RANGE CAPURE NPT TIME CODE) FOR BULDNG ACTOR FOR CLOSED MOUTH WOCE UB TARGET AW ORIGINAL SCREEN REFERENCE WSEME SET OPENWOWELS SPEECH RACK 120 ACTOR SPEECH TRACK 130 MANUALY MANUAY eAPFURE EMOf10NA APfurE EMefenA. EXPRESSON N ARGET EXPRESSON IN ARGE WOCEOUBAWW SCREEN ACTOR AMW ARGET SCREEN ACTOR Afy TRACK SO 60 RECORD SCREEN RECORD VOICE OUB ACOR REFERENCE ACORREFERENCE ACOUSC SPEECH A/W TRACK A/W TRACK 18O TARGE WOCEDUB 50 RECOGNITION TO MARK OATABASE DAABASE Afor Afw frACK BEGINNING AND EN MING PONTS FORSCREENACTOR 200 270 TARGE SPEECH SENTENCE EVENTS ANALYZE AUD!O TRACKS TO SPEECH RECOGNITION PUS GENERATE TIME STAMPED 210 TRANSCRIP USED TO DENIFY PeNEME STREAM AND MESAMP AL PHONES ACOUSC SPEECH ANNOAON TO AAW FRAME N TARGET A/W TRACK RECOGNON TO MARK RACK 280 BEGINNING AND END 290 IMING PONS FORUB ACTOR TARGE SPEECH DENFY ALL PURE PHONE SfRE WISEMES to DAfABASE SENENCEEVENTS ASSOCATED MUZZLE SHAPES 220 FOR THE SCREEN ACTOR AND MESTAMPANNOATE AAW FRAME fraek Wifh EXECUTE TEMPORAL WARP PHONE LABELS STOREVISEMES TO OATABASE TO MACHOUB UERANCE FORTHE WOCE OUBACOR LENGTHS TO PREFERRed SCREEN 230 ACTOR UTERANCE LENGTH 300 SORE PURE PHONE MUZZLE 285 SAPES AND PONE LABES ANNOATE ARGET OUBRACK WH TO MAGE DAABASE ABES OENTFYING WISEMES EXECUTE Muzzle sincHING AND FORAL PURE PHONE KEY FRAMES TEXURE MATCHING FOREVERY 240 PURE PHONE WISEME KEY FRAME 3. SELECT BESPHONE FRAME INSTANCES TO STORE ASSEMBLE TARGE TRACK WH AS SANDARD WISEME ABES DENIFYING WISEMES FOR EXECUTE 2 MAGE MORPHING FOR FOR THE ACTOR NEW WOICE OUBRACK A VSEME-TO-VSEME ME INTERWAS TRANSiTIONS 320 O PRODUCE PNTERMEDIATE FRAME MUZZLE PATCHES COMPUTER VISION NSER AN EAB POSION OENTFY ALL CONTROL PONS RECFY NOTARGE RACK FOR FACE SHAPE WISEME MUZZLE SHAPES FOREVERY WSEMES FORAL PHONES PURE PHONE KEY FRAME OF THE NEW EXECUTE FINA EXTURE NSPEECH TRACK WOICE DUBRACK MAPPING AND MUZALE 370 SCHNG CLEAN-UP 250 330 340 OCKUP ABLE RANSATOR TO REDUCE ARGERPONE MX NEMOONA WEIGHTS Add wOMETRC HGHIGHTS SE TO SMALLERSE OF FROM DJB ORSCREENACOR WISEMES (9, 6, 24 ETC) + Figure 1 U.S. Patent Aug. 17, 2004 Sheet 2 of 2 US 6,778,252 B2 90S 910 900 SOUND MAGE WOCE RADAR HEAD POSTION ANALYSS ANALYSIS ANALYSIS ANALYSS 91S STORE ANNOTATED 925 3D cyberScAN scREEN 920 PAND TIME-SYNCHED ACORS HEAD A/V/RWP RACKS GENERATE NORMALZED TRAN SPEECH 3D MESHMODEL RECOGNIZERO NEW WOICE RACK 945 955 GENTIFY VOCAL RAC NGUIST TRANSFORMATION CHECK D PHONES INVIDEO WAVEFORMS 935 INDEX IMAGE NDEX AUDQ MATCH TO MORPH KEY - KEY FRAMES TRACK WH NTERPOLATION FRAMES 960 WITH PHONEMES PHONEMES 40 AND WEIGHTS AND WEIGHTS 9 COMPARE DIFFERENCES AN CENFY ANY TRANSiTION MATCH WOCE OUB OSCONNUES 973 Aube RAek fo serEN 970 ACTOR STORED REFERENCE WSEME BASE SELECT FUNCTION CURVt MACHING RADAR WAWEFORMSOPE AND 36 MeRP NFERPOLAft INTER-SOPE POINTS 975 WREFRAME CONTROL 978 VERTICES BETWEEN 985 KEY FRAMEWSEMES FOR ALt ALL FRAMES NKAN APPY FUNCTON CRVE FC MORPHARGE INTERPOLATION PATH BETWEEN WISEME KEY ACOUSTIC DUB-ACTOR TO FRAMES 980 SEREEN AGF6R FRANSFER FUNCTION TO CORRECT AND SMOOH MORPH ePfcAL CONRot. PAfhs Figure 2 US 6,778,252 B2 1 2 FILM LANGUAGE face and mouth during Speech, and applying it to directing facial animation during Speech using Visemes. This invention is described in our provisional patent Visemes are collected by means of using existing legacy application No. 60/257,660 filed on Dec. 22, 2000. material or by the ability to have access to actors to generate BACKGROUND OF THE INVENTION reference audio video “footage”. When the screen actor is 1. Field of the Invention available, the actor Speaks a Script eliciting all the different The present invention relates generally to cinematic required phonemes and co-articulations, as would be com Works, and particularly to altered cinematic works where the monly established with the assistance of a trained linguist. facial motion and audio speech Vocal tract dynamics of a This Script is composed for each language or actor on a case Voice dub Speaker are matched to animate the facial and lip by case basis. The Script attempts to elicit all needed facial motion of a Screen actor, thereby replacing the Sound track expressive phonemes and co-articulation points. of a motion picture with a new Sound track in a different The Sentences of the Script first elicit speech-as-audio, to language. represent mouth shapes for each spoken phoneme as a 2.

Load more