Arxiv:1710.01351V1 [Cs.CV] 3 Oct 2017

Understanding the visual speech signal Helen L. Bear University of East London, 4-6 University Way, London E16 2RD [email protected] Abstract TIMIT [16], Oulu [1], and RMAV [17]. We use RMAV. For machines to lipread, or understand speech from lip movement, they decode lip-motions (known as visemes) into 2. The phoneme-to-viseme map play-off the spoken sounds. We investigate the visual speech chan- We begin with a play-off to measure the effect of us- nel to further our understanding of visemes. This has appli- ing different phoneme-to-viseme (P2V) maps from prior cations beyond machine lipreading; speech therapists, an- work. 120 P2Vs are tested with the conventional system imators, and psychologists can benefit from this work. We on 12 talkers. The results are displayed in Figure 1 as a explain the influence of speaker individuality, and demon- heatmap [8]. Consonant P2Vs are on the x-axis and vowel strate how one can use visemes to boost lipreading. P2Vs on the y-axis. We see that a combination of Disney vowels [18] and Woodward consonants [26] perform best. 1. Introduction This contrasts with [10] which concluded Lee’s visemes [19] achieved most accurate lipreading with isolated words Machine lipreading (MLR) is speech recognition with- which suggests that utterance duration affects visemes. out audio input e.g. from a silent video. MLR research is of interest to computer vision engineers and speech re- DIS searchers. Two current complimentary challenges in MLR are; to develop an end-to end system or, to understand the JEF visual speech signal to apply the knowledge to new do- HAZ mains such as speech therapy and animation. Our work addresses the latter challenge. Phonemes are the smallest NET sounds one can make [2], and a viseme is the visual equiv- LEE alent [12]. Current knowledge of visemes is limited, there BOZ is no proven function, (often presented as a map) between visemes and phonemes. Our work here focuses on under- NIC standing visemes, in order to recognise the right phoneme. MON 1.1. Conventional lipreading machines WOO FIS FRA DIS LEE HEI HAZ FIN BOZ BIN JEF KRI NET WAL NIC arXiv:1710.01351v1 [cs.CV] 3 Oct 2017 The conventional lipreading process has, at a high level, Figure 1: Heatmap of lipreading P2V maps.[8] been adopted from audio recognition systems. This is: 1) track faces and extract features, 2) train a model and clas- Figure 2 is critical difference plots for the P2V maps. sify 3) filter output through a language network. Debates Critical difference is a measure of confidence intervals be- over the optimal tracking methods, features [13], and clas- tween different algorithms [15]. Overlapping bars join P2V sifier method [25] remain but, pre-deep learning, the clas- maps which are not critically different. By comparing Fig- sic choices with accurate results were Active Appearance ures 2a and 2b we see that consonant visemes vary less than Model features [20] and Hidden Markov Model classifiers the vowel sets. This observation is supported by lipread- [23] (often built with the HTK toolkit [27] e.g. [19, 22, 24]). ing practitioners (e.g. Nichie [21]), who advocate there are key shapes for articulator sounds (vowels) and gestures are 1.2. Data formed by motion between the shapes, the motions are de- Available lipreading datasets are reviewed in [3] but the termined by consonants. most accurate lipreading data to date are; BBC [14], TCD- All P2V maps are fully tabulated in [4, 10]. 4321 CD erarchical training method which used viseme classifiers as 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 CD initialisation models of phoneme classifiers, for all viseme 8 7 6 5 4 3 2 1 set sizes. All talker mean results are in Figure 4. Phoneme Lee 13.75 3 Franks Hazen 13.5 3 Woodward Nichie 12.125 5 Disney HMMs initialised with visemes achieve higher accuracy. Neti 11.0625 5.3125 Heider Walden 8.75 6.125 Fisher Nichie 7.1 1 Bozkurt Jeffers 8.25 7 Bozkirt Montgomery 6.1667 2.4333 Neti 8.125 7.4375 5.8333 3.7667 Kricos Finn Hazen Lee 50 7.5625 Binnie Disney 5.6667 4.0333 Jeffers 45 Classifier and network units (a) Consonant P2V maps (b) Vowel P2V maps. Visemes + word net % 40 Visemes + phoneme net C WLT phonemes + word net Figure 2: Critical difference between P2V maps. WLT phonemes + phoneme net 35 Weighted guessing 30 3. Speaker independence 25 20 In [6] results show Speaker-Dependent (SD) visemes 15 can improve lipreading accuracy. In Figure 3 this conclu- All speaker mean correctness, sion is reinforced with equivalent experiments on continu- 10 ous speech talkers. Red plots show SD visemes, blue plots 5 0 are Multi-speaker (MS) visemes, and orange are Speaker- 11 13 15 17 19 21 23 25 27 29 31 33 35 37 Independent (SI) visemes. Speaker independence is the Quantity of visemes in set ability to lipread previously unseen talkers and is an obsta- Figure 4: Boosting with network decoding and classifier cle for lipreading machines. units. 20 We also tested the of the language network unit. In Fig- 18 SSD ure 4 we show that a phoneme network is better than a word MS 16 SI network. However, using a phoneme network means the % 14 C final output is a phoneme string which requires further pro- 12 cessing to understand but in [9] this effect is not significant. 10 8 5. Conclusions and the future Word correctness, 6 4 In our comparison of previous P2V mappings there is lit- tle difference between them but Disney’s outperforms oth- 2 ers on continuous speech and Lee’s marginally outperforms 0 Speaker1 Speaker2 Speaker3 Speaker4 Speaker5 Speaker6 Test Speaker q others [10] on isolated words. This means that visemes vary, by speaker and, by utterance. We suggest that speaker in- Figure 3: Comparing multi-speaker, speaker-dependent dividuality in visual speech is due to the variability with P2V functions on six RMAV speakers. which different people use visual gestures whilst talking. For speaker-dependent recognition there are choices In [11] we learn there is a limitation on how useful all when selecting a set of visemes containing fewer classes SI visemes within a set are towards recognition accuracy. than the phoneme set, yet these sets outperform phoneme A badly trained viseme is worse than no viseme. However labelled classifiers. But phoneme classifiers are desirable as with our SD visemes, (red plots in Figure 3) all visemes these are cross-speaker consistent so we ask is there a way increase accuracy. So, whilst bad training data is more of mapping similarities between SD visemes [5]? For not detrimental to classification than having less, with the right only can the right SD visemes out-perform phoneme classi- knowledge of visual gestures, our need for big data is re- fiers, but when used to help train phoneme classifiers, they duced for accurate lipreading. lipread significantly better [7] also. 4. Boosting phonemes with visemes Best results are achieved when the units match between classifiers and the language network, but not significantly We present an experiment in [9] which showed viseme so. So, for the purposes of decoding phonemes to the words sets with < 11 visemes are negatively affected by homo- spoken, the preferred network unit is words [7]. phone confusions. The sets which are too large (> 35) End-to-end systems perform well with big data and deep do not differentiate sufficiently to for accurate lipreading. learning [14] but we are still to fully understand the visual This means the range of optimum sizes is from 11 to 35 and speech signal. Understanding visual speech will mean we varies by talker. Further to this, in [7] we designed a hi- can improve adaptation between talkers in the future. 4322 References [16] N. Harte and E. Gillen. Tcd-timit: An audio-visual corpus of continuous speech. IEEE Transactions on Multimedia, [1] I. Anina, Z. Zhou, G. Zhao, and M. PietikŁinen. Ouluvs2: 17(5):603–615, May 2015. A multi-view audiovisual database for non-rigid mouth mo- [17] Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, and R. Bow- tion analysis. In 2015 11th IEEE International Conference den. Improving visual features for lip-reading. Proceedings and Workshops on Automatic Face and Gesture Recognition of the International Conference on Audio-Visual Speech Pro- (FG), volume 1, pages 1–5, May 2015. cessing (AVSP), 7(3):42–48, 2010. [2] I. P. Association. Handbook of the International Phonetic [18] J. Lander. Read my lips: Facial animation techniques. Association: A guide to the use of the International Phonetic http://www.gamasutra.com/view/feature/ Alphabet. Cambridge University Press, 1999. 131587/read_my_lips_facial_animation_ [3] Authors. When will machine lipreading come of age? have .php. Accessed: 2014-01-28. we reached a singularity? Signal Processing Magazine - [19] S. Lee and D. Yook. Audio-to-visual conversion using hid- under review, 2017. den markov models. In Proceedings of Pacific Rim Interna- [4] H. L. Bear. Decoding visemes: improving machine lipread- tional Conference on Artificial Intelligence (PRICAI), pages ing, 2016. 563–570. Springer, 2002. [5] H. L. Bear. Visual gesture variability between talkers in con- [20] I. Matthews and S. Baker. Active appearance models revis- tinuous visual speech. In British Machine Vision Conference ited. International Journal of Computer Vision, 60(2):135– (BMVC) Deep learning for machine lip reading workshop. 164, 2004. British Machine Vision Association (BMVA), 2017. [21] E. Nichie. Lipreading principles and practice, 1912. [6] H. L. Bear, S. J. Cox, and R.

Arxiv:1710.01351V1 [Cs.CV] 3 Oct 2017

Speechreading for Information Gathering

Early Intervention: Communication and Language Services for Families of Deaf and Hard-Of-Hearing Children

In This Issue: Speech Day 10

Research and Evidence 40 Years On

The Language Skills of Singaporean Deaf Children Using Total Communication Mandy Phua Su Yin National University of Singapore 20

An Auditory Processing Disorder (APD) Refers to a Variety of Conditions That Affect the Way the Brain Processes Auditory Information

American Sign Language

Speech-Reading Intervention for Profoundly Deaf Child the Case of Hosanna School for the Deaf, Ethiopia

Deep Learning for Lip Reading Using Audio-Visual Information for Urdu Language

Forensic Lip Reading by Tina Lannin Tina Is a Life-Long Lip Reader, She Is Totally Deaf and Is a Certified Lipreading Teacher

Predicting the Ability to Lip-Read in Children Who Have a Hearing Loss Jeanne Breitmayer Flowers

“Hear, Israel” the Involvement of Jews in Education of the Deaf (1850–1880)