Advanced Multimodal Interfaces: Lessons Learned from Project Oxygen
Victor Zue ([email protected]) MIT Computer Science and Artificial Intelligence Laboratory
CSAIL Formed in July 2003
• Merger of former Artificial Intelligence Laboratory and the Laboratory for Computer Science (= Project Mac July 1963) • About 820 members – 93 principal investigators * 72 active teaching faculty * EECS, Math, Brain and Cognitive, Aero/Astro, Mechanical Eng, Planetary Sciences, Harvard-MIT HST – 450 graduate students – 90 research staff and research visitors – 80 postdoctoral fellows – 60 staff – 50 undergraduates
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
1 Our New Home: The Stata Center
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
How Do We Organize?
• Four Directorates (Research Directors) – Architecture, systems, and networks (Srini Devadas) – Theory (David Karger) – Language, learning, vision, and graphics (Leslie Kaelbling) – Physical, biological, and social systems (Randy Davis) • Cross-cutting organizations within –T-Party – Project Oxygen – Center for Information Security and Privacy • Funding – DARPA, NSF, NASA, CIA, NIH, ONR, AFOSR, ARDA – Singapore-MIT Alliance, Cambridge-MIT Institute, ITRI – Quanta Computer, Hewlett Packard, Microsoft, NTT, Nokia, Philips, Delta Electronics, Acer, Shell, Ford, Sun, IBM, Intel, Toyota, Honda, Cisco, ABB
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
2 CSAIL PIs
Arvind Agarwal AmarasingheAsanovicBalakrishnan Clark Corbato Adelson Barzilay Collins Darrell Durand Fisher
Devadas Dennis Ernst Fano Guttag Jackson Kaashoek Freeman Glass Golland Grimson Horn Jaakkola
Katabi Lampson Liskov Madden Miller Morris Rinard Kaelbling Katz Lozano-PerezPoggio Popovic Richards
Rudolph Saltzer Sollins Stonebraker Terman Ward Seneff Tenenbaum Willsky Zue Architecture, Systems, & Networks Language, Learning, Vision, & Graphics
Berger Demaine Edelman Garland Goemans Abelson Brooks Davis Gifford Kellis Knight
Goldwasser Indyk Karger Leighton Leiserson Leonard Long Massaquoi Moses O’Reilly Roy Berners-Lee
Lynch Meyer Micali Rivest Rubinfeld Rus Shrobe Stultz Sussman Szolovits Teller Weitzner W3C
Shor Sipser Spielman Sudan Vempala Tidor Williams Wisdom Winston Theory Physical, Biological, & Social Systems Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
CSAIL PIs
• Diverse – From 7 departments – From 22 countries –13 women • Recognized worldwide – 16 NAS/NAE members, – 6 MacArthur Foundation Genius Awards, – 4 Turing Awards, – 2 Japan Prizes, – 1 Knight of the British Empire, – 1 Millennium Technology Award, –….
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
3 What is Oxygen?
• A research project to radically change the ways people deal with information related activities – Launched in 2000 – In collaboration with six world-class partners (Acer, Delta Electronics, HP, NTT, Nokia, Philips) – Develop technology on many fronts – Experiment with prototypes
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
The Oxygen Vision
• Computing and communication (C&C) will soon be “freely” available to everybody, everywhere – Need to accommodate a new, nomadic lifestyle • Computers should serve people, rather than the other way around – Need to create a new mode of interaction ÖTo bring the abundance of computation and communication within easy reach of humans by – Blending computation into peoples’ lives through natural, perceptual interfaces, using sound, sight, gesture, etc. – Enabling them to easily do the tasks they want to do – collaborate, access knowledge, automate, and customize
Pervasive, Human-Centered Computing
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
4 Some Technical Challenges
• Devices • Networks • Security and Privacy • Interfaces • Collaboration • Operating Systems • . . .
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
We Need Anthropomorphic Interfaces
• Computers should interact with humans in the same ways humans interact with humans • Next-generation interfaces should be human centric: – Carry on a conversation with the user – Permit multi-modal interactions using speech, vision, gesture and sketching, – Create an integrated, immersive environment
SPEECHSPEECH RECOGNITIONRECOGNITION
HANDWRITINGHANDWRITING RECOGNITIONRECOGNITION
LANGUAGELANGUAGE meaning UNDERSTANDINGUNDERSTANDING
GESTUREGESTURE RECOGNITIONRECOGNITION
MOUTHMOUTH && EYESEYES TRACKINGTRACKING
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
5 Challenges for Multimodal Interfaces
• Input needs to be understood in the proper context – “What about that one” • Timing information is a useful way to relate inputs
Speech: “Move this one over here”
Pointing: (object) (location)
time • Handling uncertainties • Need to develop a unifying linguistic framework
Map-Based Interactions video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Conversational Interactions (Seneff, Glass, Zue)
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
6 Speech-Based Interfaces
• Communicate with users through a conversational paradigm • Understand verbal input – Speech recognition – Language understanding (in context) • Verbalize response – Language generation – Speech synthesis • Engage in dialogue with a user during the interaction
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Conversational Interfaces
LanguageLanguage GenerationGeneration
SpeechSpeech DialogueDialogue SynthesisSynthesis ManagementManagement
AudioAudio Hub DatabaseDatabase
Galaxy Architecture SpeechSpeech ContextContext
RecognitionRecognition ResolutionResolution •••
LanguageLanguage UnderstandingUnderstanding
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
7 Mono-lingual Ö Multi-lingual
SpeechSpeech SpeechSpeech Generation UnderstandingUnderstanding
SpeechSpeech SpeechSpeech Generation Generation CommonCommon Semantic Frame SpeechSpeech SpeechSpeech UnderstandingUnderstanding UnderstandingUnderstanding
DATABASE
• Use an Interlingua to capture meaning • Keep technology components language-independent when possible • Encode language specific information in external data structures Multilingual Weather Information video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Visual Conversational Cues (Darrell)
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
8 Visual Conversational Cues
• Locate and track people in a scene • Determine who is speaking • Determine to what/whom the speaker is speaking • Perceive head-nod agreement gestures • Track body position and pose • Recognize arm gestures • Provide pose-invariant features for audio-visual speech recognition • Infer audio-visual activities • …
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Head Pose Tracking
• Head pose provides important visual cues in dialogue: – Determine focus of attention – Acknowledge information and provide answers to Yes/No questions • Real-world challenges – Variable background – Subtle movements – User independent • Solutions – Stereo processing – Motion estimation – Online model acquisition
Head Pose Tracking video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
9 Body Pose Tracking
• Body pose can provide information about indirect object reference, gestures, etc. • Current approach uses example-based learning – Match an image to a large example corpus based on shape or appearance – Incrementally update pose by aligning matched model to subsequent frames
Body Tracking video
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Audio-Visual Integration
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
10 Importance of Audio-Visual Integration
• The audio and visual signals both contain information about: – Identity/location of the person: Who is talking? Where is he? – Linguistic message: What’s she saying? – Emotion, mood, stress, etc.: How does he feel? • Proper utilization of these two channels of information can lead to robust and enhanced capabilities, e.g., – Locating and identifying the speaker – Speech understanding augmented with facial features – Speech, gesture, and sketching integration – Audio/visual information delivery
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Audio-Visual Symbiosis (for Input)
Person Personal Locating Identity
Robust Microphone Camera Speaker Face Robust Locating Arrays Arrays ID ID Person ID
Acoustic Visual Signal Signal
Acoustic Visual Robust Robust Speech Lip/Mouth Understanding Reading Paraling. Paraling. Paralinguistic Understanding Detection Detection Detection
Linguistic Paralinguistic Message Information
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
11 Audio-Visual Symbiosis (for Output)
Audio/Video Multiple Spotlight Personas
Robust Speaker Projector Voice Face Robust Delivery Arrays Arrays Conversion Rendering Avatar Rendering
Acoustic Visual Signal Signal
Acoustic Visual Speech Facial Robust Robust Paraling. Paraling. Synthesis Animation Paralinguistic TTS Generation Generation Generation
Linguistic Paralinguistic Message Information
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Locating and Identifying a Speaker (Darrell, Fisher, Hazen)
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
12 Audio-Visual Synchrony Detection
• The “Cocktail Party” problem: which audio to listen to? blah blah blah computer, blah blah blahshow me the blah blah blah blah blah blah Oxygen blah blah blah blah blah blah presentation
blah blah blah blah
• A visual link between user and device can help – Detect and recognize user – Confirm that utterance was from user
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Audio-Visual Synchrony Detection (cont’d)
• Find most informative subspace projection between audio and video features using mutual information
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
13 Audio-Visual Fusion for Talker Tracking
Prior
Video Posterior Location Estimates
Audio Audio-only tracking is often ambiguous Adding vision-based location cues in reverberant environments. makes tracking unambiguous!
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Audio-Visual Person Verification
Face Identification Joint accept Speaker Decision reject Identification
40 Face Speech Combined 20 10 CombinedCombined audio-visualaudio-visual 5 inputsinputs reducesreduces equalequal 2 errorerror raterate byby 90%90% 1
0.1 False Reject probability (%) probability Reject False
0.01 0.01 0.1 1 2 5 10 20 40 False Accept probability (%) Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory video5/18/2005
14 Audio Visual Speech Recognition (Hazen, Glass)
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Audio-Visual Speech Recognition
• Lip-reading significantly • Preliminary studies in helps human recognition automatic audio-visual of speech in noisy speech recognition environments show the same trend
100 -14% Audio-Only -26% Audio-Visual
10 -26%
-19% -44% Word Error Rate (%) Rate Word Error -50% -44% 1 -15 -10 -5 0 5 10 15 20 25 Signal-to-Noise Ratio (dB)
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
15 AVSR Data Collection & Feature Extraction
Lips
Teeth Time
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Integrating Speech with Gesture and Sketching (Darrell, Davis)
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
16 Gesture and Speech
• Integrating speech with visual cues for multimodal interaction – Un-tethered audio/video processing – Pointing and manipulation gestures (e.g., “that one,” or “this big.”)
Gesture & Speech video
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Sketching & Pen Gesture Understanding
• Sketching is a natural means of communication between people • We have begun to develop a toolkit to make it a natural means of communication with legacy software systems • We have used it in several applications, including – Mechanical design – Software development • Integration with speech input is often desirable (e.g., “draw two parallel lines”)
Sketching & Speech video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
17 Audio Visual Information Delivery (Poggio, Glass)
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Audio-Visual Information Delivery
• New, data-driven approach can produce very natural and intelligible synthetic speech MIT Envoice • We can now produce photo-realistic animations • These animated agents can speak and sing in different languages • We can combine speech synthesis and facial animation to produce realistic avatars Mary101-F video Mary101-M video Marilyn video Mary101-sing-C video Mary101-sing-J video English Avatar video Chinese Avatar video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
18 Summary
• Oxygen strives for pervasive, human-centered computing • Next-generation interfaces should be human centric: – Carry on a conversation with the user – Permit multi-modal interactions using speech, vision, gesture and sketching, – Create an integrated, immersive environment • Many faculty members, researchers and students at CSAIL are actively conducting research in these areas • Systems with limited capabilities are beginning to emerge • But much research remains
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
The Team
• Vision: – Trevor Darrell, John Fisher, Tony Ezzat, Tommy Poggio • Speech and Language: – Jim Glass, TJ Hazen, Stephanie Seneff, Chao Wang, Victor Zue • Sketching: – Randy Davis • Plus many of their talented students
Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
19 Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
Example of Content Processing English: Some thunderstorms may be accompanied by gusty winds and hail
clause: weather_event topic: precip_act, name: thunderstorm, num: pl quantifier: some pred: accompanied_by adverb: possibly topic: wind, num: pl, pred: gusty and: precip_act, name: hail
Japanese: Spanish: Algunas tormentas posiblement acompanadas por vientos racheados y granizo Chinese: ¤@ ¨Ç ¹p «B ¥i ¯à ·| ¦ñ ¦³ °} -· ©M ¦B ¹r Multilingual Weather Information video Victor Zue — MIT Computer Science and Artificial Intelligence Laboratory 5/18/2005
20