Advanced Multimodal Interfaces: Lessons Learned from Project Oxygen

Victor Zue ([email protected]) MIT and Artificial Intelligence Laboratory

CSAIL Formed in July 2003

• Merger of former Artificial Intelligence Laboratory and the Laboratory for Computer Science (= Project Mac July 1963) • About 820 members – 93 principal investigators * 72 active teaching faculty * EECS, Math, Brain and Cognitive, Aero/Astro, Mechanical Eng, Planetary Sciences, Harvard-MIT HST – 450 graduate students – 90 research staff and research visitors – 80 postdoctoral fellows – 60 staff – 50 undergraduates

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

1 Our New Home: The Stata Center

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

How Do We Organize?

• Four Directorates (Research Directors) – Architecture, systems, and networks (Srini Devadas) – Theory (David Karger) – Language, learning, vision, and graphics (Leslie Kaelbling) – Physical, biological, and social systems (Randy Davis) • Cross-cutting organizations within –T-Party – Project Oxygen – Center for Information Security and Privacy • Funding – DARPA, NSF, NASA, CIA, NIH, ONR, AFOSR, ARDA – Singapore-MIT Alliance, Cambridge-MIT Institute, ITRI – Quanta Computer, Hewlett Packard, Microsoft, NTT, Nokia, Philips, Delta Electronics, Acer, Shell, Ford, Sun, IBM, Intel, Toyota, Honda, Cisco, ABB

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

2 CSAIL PIs

Arvind Agarwal AmarasingheAsanovicBalakrishnan Clark Corbato Adelson Barzilay Collins Darrell Durand Fisher

Devadas Dennis Ernst Fano Guttag Jackson Kaashoek Freeman Glass Golland Grimson Horn Jaakkola

Katabi Lampson Liskov Madden Miller Morris Rinard Kaelbling Katz Lozano-PerezPoggio Popovic Richards

Rudolph Saltzer Sollins Stonebraker Terman Ward Seneff Tenenbaum Willsky Zue Architecture, Systems, & Networks Language, Learning, Vision, & Graphics

Berger Demaine Edelman Garland Goemans Abelson Brooks Davis Gifford Kellis Knight

Goldwasser Indyk Karger Leighton Leiserson Leonard Long Massaquoi Moses O’Reilly Roy Berners-Lee

Lynch Meyer Micali Rivest Rubinfeld Rus Shrobe Stultz Sussman Szolovits Teller Weitzner W3C

Shor Sipser Spielman Sudan Vempala Tidor Williams Wisdom Winston Theory Physical, Biological, & Social Systems Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

CSAIL PIs

• Diverse – From 7 departments – From 22 countries –13 women • Recognized worldwide – 16 NAS/NAE members, – 6 MacArthur Foundation Genius Awards, – 4 Turing Awards, – 2 Japan Prizes, – 1 Knight of the British Empire, – 1 Millennium Technology Award, –….

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

3 What is Oxygen?

• A research project to radically change the ways people deal with information related activities – Launched in 2000 – In collaboration with six world-class partners (Acer, Delta Electronics, HP, NTT, Nokia, Philips) – Develop technology on many fronts – Experiment with prototypes

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

The Oxygen Vision

• Computing and communication (C&C) will soon be “freely” available to everybody, everywhere – Need to accommodate a new, nomadic lifestyle • Computers should serve people, rather than the other way around – Need to create a new mode of interaction ÖTo bring the abundance of computation and communication within easy reach of humans by – Blending computation into peoples’ lives through natural, perceptual interfaces, using sound, sight, gesture, etc. – Enabling them to easily do the tasks they want to do – collaborate, access knowledge, automate, and customize

Pervasive, Human-Centered Computing

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

4 Some Technical Challenges

• Devices • Networks • Security and Privacy • Interfaces • Collaboration • Operating Systems • . . .

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

We Need Anthropomorphic Interfaces

• Computers should interact with humans in the same ways humans interact with humans • Next-generation interfaces should be human centric: – Carry on a conversation with the user – Permit multi-modal interactions using speech, vision, gesture and sketching, – Create an integrated, immersive environment

SPEECHSPEECH RECOGNITIONRECOGNITION

HANDWRITINGHANDWRITING RECOGNITIONRECOGNITION

LANGUAGELANGUAGE meaning UNDERSTANDINGUNDERSTANDING

GESTUREGESTURE RECOGNITIONRECOGNITION

MOUTHMOUTH && EYESEYES TRACKINGTRACKING

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

5 Challenges for Multimodal Interfaces

• Input needs to be understood in the proper context – “What about that one” • Timing information is a useful way to relate inputs

Speech: “Move this one over here”

Pointing: (object) (location)

time • Handling uncertainties • Need to develop a unifying linguistic framework

Map-Based Interactions video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Conversational Interactions (Seneff, Glass, Zue)

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

6 Speech-Based Interfaces

• Communicate with users through a conversational paradigm • Understand verbal input – – Language understanding (in context) • Verbalize response – Language generation – Speech synthesis • Engage in dialogue with a user during the interaction

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Conversational Interfaces

LanguageLanguage GenerationGeneration

SpeechSpeech DialogueDialogue SynthesisSynthesis ManagementManagement

AudioAudio Hub DatabaseDatabase

Galaxy Architecture SpeechSpeech ContextContext

RecognitionRecognition ResolutionResolution •••

LanguageLanguage UnderstandingUnderstanding

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

7 Mono-lingual Ö Multi-lingual

SpeechSpeech SpeechSpeech Generation UnderstandingUnderstanding

SpeechSpeech SpeechSpeech Generation Generation CommonCommon Semantic Frame SpeechSpeech SpeechSpeech UnderstandingUnderstanding UnderstandingUnderstanding

DATABASE

• Use an Interlingua to capture meaning • Keep technology components language-independent when possible • Encode language specific information in external data structures Multilingual Weather Information video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Visual Conversational Cues (Darrell)

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

8 Visual Conversational Cues

• Locate and track people in a scene • Determine who is speaking • Determine to what/whom the speaker is speaking • Perceive head-nod agreement gestures • Track body position and pose • Recognize arm gestures • Provide pose-invariant features for audio-visual speech recognition • Infer audio-visual activities • …

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Head Pose Tracking

• Head pose provides important visual cues in dialogue: – Determine focus of attention – Acknowledge information and provide answers to Yes/No questions • Real-world challenges – Variable background – Subtle movements – User independent • Solutions – Stereo processing – Motion estimation – Online model acquisition

Head Pose Tracking video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

9 Body Pose Tracking

• Body pose can provide information about indirect object reference, gestures, etc. • Current approach uses example-based learning – Match an image to a large example corpus based on shape or appearance – Incrementally update pose by aligning matched model to subsequent frames

Body Tracking video

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Audio-Visual Integration

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

10 Importance of Audio-Visual Integration

• The audio and visual signals both contain information about: – Identity/location of the person: Who is talking? Where is he? – Linguistic message: What’s she saying? – Emotion, mood, stress, etc.: How does he feel? • Proper utilization of these two channels of information can lead to robust and enhanced capabilities, e.g., – Locating and identifying the speaker – Speech understanding augmented with facial features – Speech, gesture, and sketching integration – Audio/visual information delivery

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Audio-Visual Symbiosis (for Input)

Person Personal Locating Identity

Robust Microphone Camera Speaker Face Robust Locating Arrays Arrays ID ID Person ID

Acoustic Visual Signal Signal

Acoustic Visual Robust Robust Speech Lip/Mouth Understanding Reading Paraling. Paraling. Paralinguistic Understanding Detection Detection Detection

Linguistic Paralinguistic Message Information

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

11 Audio-Visual Symbiosis (for Output)

Audio/Video Multiple Spotlight Personas

Robust Speaker Projector Voice Face Robust Delivery Arrays Arrays Conversion Rendering Avatar Rendering

Acoustic Visual Signal Signal

Acoustic Visual Speech Facial Robust Robust Paraling. Paraling. Synthesis Animation Paralinguistic TTS Generation Generation Generation

Linguistic Paralinguistic Message Information

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Locating and Identifying a Speaker (Darrell, Fisher, Hazen)

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

12 Audio-Visual Synchrony Detection

• The “Cocktail Party” problem: which audio to listen to? blah blah blah computer, blah blah blahshow me the blah blah blah blah blah blah Oxygen blah blah blah blah blah blah presentation

blah blah blah blah

• A visual link between user and device can help – Detect and recognize user – Confirm that utterance was from user

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Audio-Visual Synchrony Detection (cont’d)

• Find most informative subspace projection between audio and video features using mutual information

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

13 Audio-Visual Fusion for Talker Tracking

Prior

Video Posterior Location Estimates

Audio Audio-only tracking is often ambiguous Adding vision-based location cues in reverberant environments. makes tracking unambiguous!

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Audio-Visual Person Verification

Face Identification Joint accept Speaker Decision reject Identification

40 Face Speech Combined 20 10 CombinedCombined audio-visualaudio-visual 5 inputsinputs reducesreduces equalequal 2 errorerror raterate byby 90%90% 1

0.1 False Reject probability (%) probability Reject False

0.01 0.01 0.1 1 2 5 10 20 40 False Accept probability (%) Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory video5/18/2005

14 Audio Visual Speech Recognition (Hazen, Glass)

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Audio-Visual Speech Recognition

• Lip-reading significantly • Preliminary studies in helps human recognition automatic audio-visual of speech in noisy speech recognition environments show the same trend

100 -14% Audio-Only -26% Audio-Visual

10 -26%

-19% -44% Word Error Rate (%) Rate Word Error -50% -44% 1 -15 -10 -5 0 5 10 15 20 25 Signal-to-Noise Ratio (dB)

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

15 AVSR Data Collection & Feature Extraction

Lips

Teeth Time

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Integrating Speech with Gesture and Sketching (Darrell, Davis)

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

16 Gesture and Speech

• Integrating speech with visual cues for multimodal interaction – Un-tethered audio/video processing – Pointing and manipulation gestures (e.g., “that one,” or “this big.”)

Gesture & Speech video

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Sketching & Pen Gesture Understanding

• Sketching is a natural means of communication between people • We have begun to develop a toolkit to make it a natural means of communication with legacy software systems • We have used it in several applications, including – Mechanical design – Software development • Integration with speech input is often desirable (e.g., “draw two parallel lines”)

Sketching & Speech video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

17 Audio Visual Information Delivery (Poggio, Glass)

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Audio-Visual Information Delivery

• New, data-driven approach can produce very natural and intelligible synthetic speech MIT Envoice • We can now produce photo-realistic animations • These animated agents can speak and sing in different languages • We can combine speech synthesis and facial animation to produce realistic avatars Mary101-F video Mary101-M video Marilyn video Mary101-sing-C video Mary101-sing-J video English Avatar video Chinese Avatar video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

18 Summary

• Oxygen strives for pervasive, human-centered computing • Next-generation interfaces should be human centric: – Carry on a conversation with the user – Permit multi-modal interactions using speech, vision, gesture and sketching, – Create an integrated, immersive environment • Many faculty members, researchers and students at CSAIL are actively conducting research in these areas • Systems with limited capabilities are beginning to emerge • But much research remains

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

The Team

• Vision: – Trevor Darrell, John Fisher, Tony Ezzat, Tommy Poggio • Speech and Language: – Jim Glass, TJ Hazen, Stephanie Seneff, Chao Wang, Victor Zue • Sketching: – Randy Davis • Plus many of their talented students

Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

19 Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

Example of Content Processing English: Some thunderstorms may be accompanied by gusty winds and hail

clause: weather_event topic: precip_act, name: thunderstorm, num: pl quantifier: some pred: accompanied_by adverb: possibly topic: wind, num: pl, pred: gusty and: precip_act, name: hail

Japanese: Spanish: Algunas tormentas posiblement acompanadas por vientos racheados y granizo Chinese: ¤@ ¨Ç ¹p «B ¥i ¯à ·| ¦ñ ¦³ °} -· ©M ¦B ¹r Multilingual Weather Information video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005

20