Relating Optical Speech to Speech Acoustics and Visual Speech Perception

UNIVERSITY OF CALIFORNIA Los Angeles Relating Optical Speech to Speech Acoustics and Visual Speech Perception A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical Engineering by Jintao Jiang 2003 i © Copyright by Jintao Jiang 2003 ii The dissertation of Jintao Jiang is approved. Kung Yao Lieven Vandenberghe Patricia A. Keating Lynne E. Bernstein Abeer Alwan, Committee Chair University of California, Los Angeles 2003 ii Table of Contents Chapter 1. Introduction ............................................................................................... 1 1.1. Audio-Visual Speech Processing ............................................................................ 1 1.2. How to Examine the Relationship between Data Sets ............................................ 4 1.3. The Relationship between Articulatory Movements and Speech Acoustics........... 5 1.4. The Relationship between Visual Speech Perception and Physical Measures ....... 9 1.5. Outline of This Dissertation .................................................................................. 15 Chapter 2. Data Collection and Pre-Processing ...................................................... 17 2.1. Introduction ........................................................................................................... 17 2.2. Background ........................................................................................................... 18 2.3. Recording a Database for the Correlation Analysis.............................................. 19 2.3.1. Talkers................................................................................................... 19 2.3.2. Materials................................................................................................ 20 2.3.3. Recording Facilities...............................................................................21 2.3.4. Placement of EMA Pellets and QualisysTM Retro-Reflectors............... 24 2.3.5. Synchronization..................................................................................... 27 2.3.6. Recording Procedure............................................................................. 28 2.4. Recording a Database for the Perceptual Similarity Analysis .............................. 29 2.5. Perceptual Experiments......................................................................................... 30 2.5.1. Participants............................................................................................ 30 2.5.2. Video Presentations............................................................................... 30 2.5.3. Procedure............................................................................................... 31 2.6. Conditioning the Data ........................................................................................... 32 2.6.1. Audio..................................................................................................... 32 2.6.2. Optical Data........................................................................................... 34 2.6.3. EMA Data............................................................................................. 37 2.6.4. All Three Data Streams......................................................................... 39 2.7. Summary of Physical Measures and Perceptual Data........................................... 39 2.7.1. Physical Measures................................................................................. 39 2.7.2. Perceptual Data ..................................................................................... 42 2.8. Summary ............................................................................................................... 44 Chapter 3. Multilinear Regression, Multidimensional Scaling, Hierarchical Clustering Analysis, and Phoneme Equivalence Classes.............................................45 iii 3.1. Introduction ........................................................................................................... 45 3.2. Multilinear Regression .......................................................................................... 45 3.2.1. Mean Subtraction.................................................................................. 45 3.2.2. Multilinear Regression .......................................................................... 46 3.2.3. Jackknife Procedure .............................................................................. 47 3.2.4. Goodness of Fit ..................................................................................... 48 3.2.5. Maximum Correlation Criterion Estimation .........................................49 3.3. Phi-Square Transformation ................................................................................... 52 3.4. Multidimensional Scaling .....................................................................................55 3.5. Hierarchical Clustering Analysis and Phoneme Equivalence Classes.................. 59 3.6. Summary ............................................................................................................... 63 Chapter 4. On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics ...................................................................................................... 64 4.1. Introduction ........................................................................................................... 64 4.2. Background ........................................................................................................... 64 4.3. Analysis of CV Syllables ...................................................................................... 67 4.3.1. Consonants: Place and Manner of Articulation..................................... 67 4.3.2. Syllable-Dependent Predictions ............................................................ 67 4.3.3. Discussion ............................................................................................. 72 4.4. Examining the Relationships between Data Streams for Sentences ..................... 75 4.4.1. Analysis................................................................................................. 75 4.4.2. Results ...................................................................................................76 4.4.3. Discussion ............................................................................................. 77 4.5. Prediction Using Reduced Data Sets..................................................................... 80 4.5.1. Analysis................................................................................................. 80 4.5.2. Results ...................................................................................................81 4.5.3. Discussion ............................................................................................. 82 4.6. Predicting Face Movements from Speech Acoustics Using Spectral Dynamics.. 83 4.6.1. Analysis................................................................................................. 83 4.6.2. Results of Correlation Analysis Using Dynamical Information ........... 86 4.6.3. Discussion ............................................................................................. 90 4.7. Summary ............................................................................................................... 90 Chapter 5. The Relationship between Visual Speech Perception and Physical Measures…… .................................................................................................................. 93 5.1. Introduction ........................................................................................................... 93 iv 5.2. Background ........................................................................................................... 93 5.3. Method .................................................................................................................. 97 5.3.1. Analyses of Perceptual Data ................................................................. 97 5.3.2. 3-D Optical Signal Analyses................................................................. 98 5.3.3. Consonant Classification, Traditional Visemes, and Phoneme Equivalence Classes ............................................................................................ 101 5.3.4. Analysis Approach .............................................................................. 102 5.4. Results and Discussion........................................................................................ 105 5.4.1. Overall Visual Perception Results ...................................................... 105 5.4.2. Predicting Visual Perceptual Measures from Physical Measures ....... 106 5.5. Multidimensional Scaling Analysis .................................................................... 110 5.6. Phoneme Equivalence Class (PEC) Analysis...................................................... 116 5.7. General Discussion.............................................................................................. 120 5.8. Summary ............................................................................................................. 125 Chapter 6. Examining the Correlations between Face Movements and Speech Acoustics Using Mutual Information Faces................................................................ 126 6.1. Introduction

Relating Optical Speech to Speech Acoustics and Visual Speech Perception

Improving Opus Low Bit Rate Quality with Neural Speech Synthesis

Lecture 16: Linear Prediction-Based Representations

Source-Channel Coding for CELP Speech Coders

Samia Dawood Shakir

End-To-End Video-To-Speech Synthesis Using Generative Adversarial Networks

Linear Predictive Modelling of Speech - Constraints and Line Spectrum Pair Decomposition

Audio Processing and Loudness Estimation Algorithms with Ios Simulations

Code Excited Linear Prediction with Multi-Pulse Codebooks

A Review of Physical and Perceptual Feature Extraction Techniques for Speech, Music and Environmental Sounds

CELP and Speech Enhancement Ian Mcloughlin

Harmonic Coding of Speech at Low Bit Rates

TRANSCOM Proceedings, Section 3