<<

Visual Speech Perception of

Arabic Emphatics and

By

Maha Saliba Foster

B.A., American University of Beirut, l982

M.A., University of Colorado, 2012

A thesis submitted to the

Faculty of the Graduate School of the

University of Colorado in partial fulfillment

of the requirement for the degree of

Doctor of Philosophy

Department of Linguistics

Department of Cognitive Science

2016

i

This thesis entitled:

Visual Speech Perception of Emphatics and Gutturals

Written by Maha Saliba Foster

has been approved for the Departments of Linguistic

and the Institute of Cognitive Science

______Dr. Rebecca Scarborough, Chair

______Dr. Eliana Calunga

______Dr. Bhuvana Narasimhan

______Dr. David Rood

______Dr. Sarel Van Vuuren

Date

The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards

of scholarly work in the above mentioned discipline.

IRB protocol # 14-0394

ii

Foster Maha Saliba (Ph.D., Linguistics and Cognitive Science)

Visual Speech Perception of Arabic Emphatics and Gutturals

Thesis directed by Dr. Rebecca Scarborough

ABSTRACT

This investigation explores the potential effect on perception of speech visual cues associated with Arabic gutturals (AGs) and Arabic emphatics (AEs); AEs are pharyngealized phonemes characterized by a visually salient primary articulation but a rather invisible secondary articulation produced deep in the pharynx. The corpus consisted of 72 minimal pairs each containing two contrasting of interest (COIs), an emphatic versus a non-emphatic, or a paired with another guttural. In order to assess the potential effect that visual speech information in the , chin, cheeks, and neck has on the perception of the COIs, production data elicited from 4 native Lebanese speakers was captured on videos that were edited to allow perceivers to see only certain regions of the face. Fifty three Lebanese perceivers watched the muted movies each presented with a minimal pair containing the word uttered in the video, and selected in a forced identification task the word they thought they saw the speaker say. The speakers’ speech was analyzed to help explore what in their production informed correct identification of the COIs. Perceivers were above chance at correctly identifying AEs and AGs, though AEs were better perceived than AGs. In the emphatic category, the effect on perception of measurement differences between a word and its pair was submitted to automatic speech recognition. The machine learning models were generally successful at correctly classifying COIs as emphatic or non-emphatics across contexts; the models were able to predict the probability of perceivers’ accuracy in identifying certain COIs produced by certain speakers; also, an overlap between the measurements selected by the computer and those selected by human perceivers was found. No difference in perception of AEs according to the part of the face that was visible was observed, suggesting that the lips, present in all of the videos, were most important for perception of emphasis. Conversely, in the perception of AGs, lips were not as informative and perceivers relied more on cheeks and chin. The presence of visible cues associated with the AEs, particularly in the lips, suggests that such visual cues might be informative for non-native learners as well, if they were trained to attend to them.

iii

Dedication

I dedicate this dissertation to my beloved father. Dad, if you weren’t constricted by your aging body, you would be calling me to ask how you could help me get through this trying time of writing a dissertation. Remember how you used to wake me up with a glass of fresh squeezed orange juice when I stayed up late studying for my finals? I know you would be proud of me and how much I wish you could be here to show me once more your love and support. To you my daddy, I dedicate this work.

iv

Acknowledgements

It has been a long and arduous path from start to finish having to deal with a long commute, a full time job, a family and a dissertation. Sitting day in and day out in my little corner, how many times have I cried out to my Lord and Savior Jesus Christ! It is only by putting my faith in Him that I was able to take the next step and then the next, towards the finish line that often seemed more like a mirage than a reality. Thank you,

Jesus, for giving me the perseverance to go on, and for surrounding me with a multitude of people who were instrumental in making it possible for me to reach my goal. I apologize for any oversight, for my supporters and encouragers were many!

Special thanks to my advisor Dr. Rebecca Scarborough whose guidance was invaluable, and to all four members of my committee, Dr. Bhuvana Narasimhan, Dr.

David Rood, Dr. Eliana Colunga, and Dr. Sarel van Vuuren for being there for me and taking time out of their busy schedule to read my dissertation and provide me with valuable comments.

I am also very fortunate to be married to a brilliant software engineer who put in countless hours writing code that allowed me to collect data online and capture all the answers. Without you Allan, I would still be formatting my data! That was really a labor of love that you offered me unconditionally. Thank you also to my four boys who witnessed me continuously sitting in the same corner surrounded with papers and with a computer glued to my lap. Writing a dissertation could take a toll on family relations, but

I thank God for my loving and supportive family!

The completion of this dissertation would not have been possible without the help of many people at the University of Denver. Thank you Dr. Mahoor for providing me with

v all I needed to collect my data: a place to run subjects, the equipment, and the needed programs for fitting faces and capturing the relevant measurements; thank you Kris

Douglass for all the hard work you put into fitting all the speakers’ faces and Ali

Mollahosseini, for lending Kris and I your expertise. Thank you to the staff at the media center who put up with me for months as I was working on editing the videos and provided me with all the much needed technical support; most of all, I owe Dr. Cathy

Durso an inexpressible debt of gratitude for the many, many hours she invested in helping me with my statistics and data analysis. I really can’t thank you enough Cathy, you went well over your call of duty!

Lastly, none of this would have been possible without the participation of all the three speakers who volunteered their time, and all the perceivers who were willing to put up with a rather lengthy study.

I am truly blessed to have had the pleasure to work with such a tremendous team where each member was irreplaceable and played an essential role in contributing to the success of this project.

vi

Table of Contents

Abstract ...... iii

Acknowledgement ...... v

Table of Contents ...... vii

List of Tables ...... x

List of Figures ...... xii

CHAPTER 1: Introduction ...... 1

1 Background ...... 1

2 Arabic Emphatics and Gutturals ...... 5

2.1 Arabic Emphatics ...... 7

2.2 Arabic Gutturals ...... 9

3 Audio-Visual Speech Perception in Arabic ...... 11

4 Goal of this Study ...... 12

CHAPTER 2: Production of AEs and AGs ...... 15

1 Preparation of Stimuli ...... 15

2 Experiment 1: Production Data Collection ...... 18

3 Measurements ...... 20

3.1 Procedure ...... 20

3.2 AEs and NEs Arrow Plots ...... 25

CHAPTER 3: Perception of Emphasis ...... 35

1 Experiment 2: Perception Data ...... 35

2 Perception Results ...... 39

2.1 Descriptive Statistics ...... 40

vii

2.2 Variables Interactions ...... 47

2.3 Models by Speaker ...... 62

CHAPTER 4: Production-Perception Connection ...... 69

1 Machine Learning Classification of Emphasis ...... 71

1.1 Machine Learning Models ...... 73

1.1.1 Decision Trees ...... 74

1.1.2 Machine Learning Forward-Build Models ...... 78

1.1.3 General Interpretation of Machine Learning Measurements...... 85

2 Connection of Production-Perception of Emphasis for Perceivers ...... 86

2.1 Best-Forward-Build Models ...... 86

2.2 Single-Best Function ...... 87

CHAPTER 5: Correlation between Machine Learning and Human Perceivers ... 90

1 Correlation between Machine Learning and Human Perceivers ...... 92

1.1 Procedures and Results ...... 93

CHAPTER 6: Discussion of Visual Perception of Emphasis ...... 98

Chapter 7: Statistical Overview of Arabic Gutturals ...... 106

1 Overview: Descriptive Statistics ...... 107

2 Variables Interactions ...... 117

Chapter 8: Perception-Production Connection in Arabic Gutturals ...... 126

1 Decision Trees ...... 126

2 Arrow Plots ...... 129

3 Forward-Build Models ...... 132

Chapter 9: Discussion and Conclusions ...... 134

1 Summary and Comparison of Emphatics and Gutturals ...... 134 viii

2 Discussion ...... 143

3 Concluding Remarks and Future Direction ...... 148

References ...... 150

Appendix ...... 155

ix

List of Tables

Table 1-1: Arabic Emphatic (AEs) and Guttural (AGs) Contrasts ...... 11 Table 2-1: Words Distribution in the Corpus ...... 17 Table 2-2: Measurements and their Descriptions ...... 22 Table 3-1: Visible Parts in Each of the Six Facial Conditions ...... 37 Table 3-2: Percent Correct by Long and Short Vowel ...... 43 Table 3-3: Percent Correct for NEs and AEs by Speaker ...... 44 Table 3-4: All-Inclusive Model ...... 49 Table 3-5: Best Model ...... 50 Table 3-6: Effect of Vowel-Emphasis Interaction on Perception ...... 51 Table 3-7a: Effect of -Vowel Interaction on Perception of AEs...... 54 Table 3-7b: Effect of Consonant-Vowel Interaction on Perception of NEs ...... 55 Table 3-8: Emphasis-Speaker Interaction ...... 57 Table 3-9- Effect of Vowel-Speaker Interaction on Perception of Emphasis ...... 58 Table 3-10: Face Effect on Emphasis ...... 60 Table 3-11: Speaker*Face Interaction in the AE Subset ...... 61 Table 3-12: Patterns of Perception Associated with Speaker 1 ...... 63 Table 3-13: Patterns of Perception Associated with Speaker 2 ...... 64 Table 3-14: Patterns of Perception Associated with Speaker 4 ...... 66 Table 4-1: Decision Trees Accuracy Rates and Best Measurements ...... 77 Table 4-2: Best Model for Machine Identification of AEs ...... 79 Table 4-3: Best Measurements Selected in all Vowel Contexts ...... 80 Table 4-4: Best Measurements Selected for AE Identification in [æ]-Context ..... 81 Table 4-5: Best Measurements Selected in the [æ:]-Context ...... 81 Table 4-6: Best Measurements in the [i]-Context ...... 82 Table 4-7: Best Measurements in the [i:]-Context ...... 83 Table 4-8: Best Measurements in the [u]-Context ...... 83 Table 4-9: Best Measurements in the [u:]-Context ...... 84 Table 4-10: Forward-Build[æ]-best-model ...... 86 Table 4-11: Best-Models by Vowel ...... 87 Table 4-12: Single Best Variables Generated by the ‘Single-Best’ Function ...... 87 Table 5-1: Best Machine and Best Subsets Measurements ...... 92 Table 5-2: ML Classification and Perceivers’ Correctness Correlation ...... 94 Table 7-1: Percentage and Number of Correct Answers by Consonant Pairs ... 108 Table 7-2: Percentage and Number of Correct Answers by Speaker ...... 108 Table 7-3: /ħ, h/-Percentage and Number of Correct Answers by Speaker ...... 109 Table 7-4: /ʔ, ʕ/-Percentage and Number of Correct Answers by Speaker ...... 109 Table 7-5: Percentage and Number of Correct Answers by Consonant ...... 111 Table 7-6: Effect of Consonant Position on Perception of /ħ, h/ and /ʕ, ʔ/...... 111 Table 7-7: Effect of Vowel Quality on Perception of /ħ, h/ ...... 111 Table 7-8: Effect of Vowel Quality on Perception of /ʕ, ʔ/ ...... 112 Table 7-9: Effect of on Perception of /ħ, h/ ...... 112 Table 7-10: Effect of Vowel Length on Perception of /ʕ, ʔ/ ...... 112 Table 7-11: Long and Short Vowel Effect on Perception of /ʕ, ʔ/ ...... 113 Table 7-12: Long and Short Vowel Effect on Perception of /ħ, h/ ...... 113 Table 7-13: Effect of CV/VC on Perception of /ħ, h/ ...... 114 x

Table 7-14: Effect of CV/VC on Perception of /ʔ, ʕ/ ...... 115 Table 7-15: Percentage and Number of Correct Answers by Face ...... 116 Table 7-16: Best /ħ, h/ Model ...... 118 Table 7-17: Effect of Speaker*Consonant Interaction on Perception of /ħ, h/ ... 119 Table 7-18: Position Effect on Perception of /ħ/ ...... 119 Table 7-19: Position Effect on Perception of /h/ ...... 120 Table 7-20: Vowel Consonant Effect on Perception of /ħ, h/ ...... 120 Table 7-21: Best /ʔ, ʕ/ Model ...... 122 Table 7-22: Vowel Consonant Effect on Perception of /ʔ, ʕ/ ...... 123 Table 7-23: Position Effect on Perception of /ʕ/ ...... 123 Table 7-24: Face Effect in /ʔ /-only Model ...... 124 Table 7-25: Summary of AGs Effects on Perception ...... 124 Table 8-1: /ʔ,ʕ/ Best Single Variables Selected by the ‘Single-Best’ Function .. 132 Table 8-2: /ħ,h/-Best Single Variables Selected by the ‘Single-Best’ Function .. 133

xi

List of Figures

Figure 1-1: X-Ray Image of Three Pharyngeal Cavities ...... 6 Figure 1-2: Videoflouroscopy of /sæb/ and /sʕɑb / Produced by a Male Speaker ..8 Figure 2-1: Fitted Models ...... 19 Figure 2-2: 66 Landmark Points ...... 20 Figure 2-3: Spectrograms of Selected Intervals ...... 21 Figure 2-4: Horizontal Trajectory in the Production of /famat- faemaet/ ...... 24 Figure 2-5: Lip Horizontal Difference in [æ-æ:]-Context ...... 27 Figure 2-6: Chin Width Low Difference in [æ-æ:]-Context ...... 28 Figure 2-7: Lip Horizontal Difference in [i-i:]-Context ...... 30 Figure 2-8: Chin Width Mid Difference in [i-i:]-Context ...... 31 Figure 2-9: Upper Lip Thickness Difference in [u;u:]-Context ...... 32 Figure 2-10: Chin Vertical Difference in [u;u:]-Context ...... 33 Figure 3-1: Speaker 1: Six Facial Conditions ...... 35 Figure 3-2: Bone Structure of the Face ...... 36 Figure 3-3: % of Correct Responses of NEs and AEs across all ...... 41 Figure 3-4: % of Correct Responses as a Function of Position ...... 43 Figure 3-5: % of Correct Answers by Speaker for AEs, NEs, and AEs + Es ...... 44 Figure 3-6: AEs Distribution of Responses by Face Condition ...... 45 Figure 3-7: NEs Distribution of Responses by Face Condition ...... 46 Figure 4.1: Speaker 2 Mouth in /tʕɑ:b/ and /tæ:b/ ...... 70 Figure 4-2: Speaker 1 Mouth in /sʕu:r/and /su:r/ ...... 70 Figure 4-3: [æ:]-Decision Tree ...... 75 Figure 4-4: [u]- Decision Tree ...... 76 Figure 5-1: [æ] Probability of Correctness ...... 96 Figure 6-1: Machine Learning vs Human Performance on AE Recognition ...... 100 Figure 7-1: Perception of AG by Speaker ...... 110 Figure 7-2: Vowel Effect on Perception ...... 114 Figure 8.1: /ħ, h/ Decision Tree by Scaled Measurements ...... 127 Figure 8-2: /ʔ,ʕ/ Decision Tree by Scaled Measurements ...... 129 Figure 8-3: Lower Cheek Measurement Differences Between /ʕ/ and /ʔ/...... 130 Figure 8-4: Lip Horizontal Measurement Differences Between /ʕ/ and /ʔ/ ...... 131 Figure 9-1: Comparison of Speakers 2 and 3 in [tʕifəl] and [tifəl] ...... 144

xii

CHAPTER 1: INTRODUCTION

1 Background

Scientists for many decades have invested time and effort in understanding how humans perceive and organize their environment. A specific and not yet fully understood inquiry into perception relates to . Language perception has long been considered a process involving the decoding of the acoustic signal. Undoubtedly, for people with normal hearing, audition is the primary sensory medium through which a linguistic message is perceived; furthermore, human exchanges are becoming increasingly auditory-only due to the proliferation of technological devices such as cell phones and tablets; yet, face-to-face communication is still preferred and encodes more than one sensory pathway.

Several neurological studies confirm the important role that vision plays in speech processing and the activation of the auditory cortex by seen speech. Face-to- face communication has now been established as a bimodal sensory process in which vision and audition integrate to produce one percept. The perception of visual speech

(henceforth, VS), also known as lip-reading, refers to the ability humans have to perceive speech sounds visually by attending to the optical features or the oral configurations characteristic of different sounds; the most salient visual information is provided primarily by lip movements, mouth opening, and tongue position relative to the teeth. McLeod and Summerfield (1990) demonstrated that listeners’ reliance on the visual input increases in an environment where the acoustic signal is degraded; lip- reading enables a perceiver to tolerate an additional 4-6 dB of noise before message comprehension is impacted. Sanders and Goodrich (1971) tested their subjects’

1 auditory word-recognition while listening through a 400-Hz low-pass filter. Under the auditory-only condition, subjects reached 24% correctness; however the participants’ accuracy increased significantly to 78% when the speaker’s face was visible, confirming that speech perception in audiovisual (AV) conditions, where the listener can hear and see the speaker, is always better than when speech is transmitted via a unimodal audio only or visual only signal. According to Ross et al. (2007), watching the mouth’s articulations augments and enhances our auditory capabilities. In fact, Robert-Ribes and colleagues went as far as suggesting that in some cases, visual speech not only improves the intelligibility and identification of the acoustic signal, but also, certain speech sounds are even better perceived visually than auditorily. A study on the perception of French vowels revealed that the audio channel was better at capturing the height feature of certain vowels while the visual channel was more reliable for the distinction of roundness (Robert-Ribes, J. et al., 1998).

The role that the visual modality plays in speech perception is a widely noted phenomenon. However, while spectro-temporal physical acoustic characteristics of sound waves are fairly well understood, our understanding of VS, its contribution to perception, and how it integrates with audition, is still lagging. Moreover, the acoustic signal tends to convey enough information to help perceivers effortlessly discriminate between sounds or phonemes under reasonable speech to noise ratio conditions; on the other hand, count fewer distinct visual units of speech known as visemes, because articulatory gestures associated with many sounds such as velars and pharyngeals, are hidden from the perceiver, and many phonemes such as the bilabials /b/, /p/, and /m/, are not visually contrastive enough, and are consequently

2 grouped as one viseme. In French for example, 36 phonemes are mapped onto 17 classes of visemes (Dubois et al., 2012).

In spite of the higher reliance perceivers seem to have on the acoustic signal, understanding the contribution of the visual modality can be beneficial in more than one way to more than one discipline. To mention a few, in the artificial intelligence field, visual features can be used to improve visual speech recognition of talking heads; in speech therapy, VS could serve as a tool to train patients with speech impediment; from a pedagogical point of view, VS can be used to develop speechreading training programs for the hearing impaired; it can also aid second language learners better perceive sounds that are outside their native repertoire. It is in fact this latter pedagogical reason that is the long term motivation behind this study. Even though the reliance of second language learners on visual speech cues is beyond the scope of this investigation, the contrastive sounds that are the focus of this research were chosen as a set up for a follow up exploration targeting second language learners.

Thus far however, visual speech studies have focused mainly on the most visually salient sounds such as bilabials and dentals where the lips play a central role.

Lips are considered the most informative facial feature (e.g., Ouni et al., 2005; Thomas and Jordan, 2004). In a noisy environment, seeing the lips may be enough for the recognition of the signal as can often be inferred by the position and movement of the lips, teeth, and tongue (Summerfield, 1987). According to Campbell

(2001), even though speech information is found in all the visible areas of the face, many involve different lip gestures: rounding, unrounding, spreading, compression, and protrusion. Rounded protruded lips cue an [u] sound, or [] in French; wide spread lips with visible teeth usually accompany the [i] articulation (Campbell, 2001). There are 3 additional elements within the mouth that may be informative such as the teeth and tongue, and around the mouth such as wrinkles above the lips, but in this study I will be mostly focusing on the motion of the lips.

Several investigations suggest that the face is rich in speech information beyond the movements of the mouth, tongue and teeth. Benoit for example uncovered speech information in the movements of the jaw (Benoit et al., 1996). Using digital video manipulation, Preminger et al. (1998), studied the effect of visual masking of the tongue, teeth, lips, upper and lower parts of the face on consonant-viseme identification; his results suggested that consonant visemes were perceived with 70% or better accuracy even when the entire mouth was masked. He also found that for some consonants, speech gestures were also encoded in the chin and the upper and sides of the cheeks.

However, perception was poor when the upper or lower parts of the face and the mouth were not visible (Preminger et al., 1998). Campbell’s (2001) independent component analysis also revealed speech cues found in the jaw, chin, and even the neck.

Arabic, a language with many post-velar sounds referred to in this study as guttural sounds, is a perfect context to study the potential contribution of different facial features to speech perception, including cheeks, chin, and throat. Arabic is a diglossic

Semitic language characterized by two main registers that fall at the two ends of a continuum (Harry, 1996). The more formal register known as Fusha, or classical

Arabic, is a variety shared by all literate Arabs. It is the written language and the language of the Quran. A little less formal is the , which is the language of the media. At the other end of the spectrum are all the different spoken dialects that vary from one country to another. It is the language Arabs use in their day

4 to day interactions. In this study, the focus is on the Lebanese spoken dialect; all four speakers recorded, and all perceivers, speak the Lebanese variety of Arabic.

2 Arabic Emphatics and Gutturals

Most research on AV speech perception has understandably focused on the most visually salient sounds such as the labials and dentals. Seeing consonants produced with a consonantal obstruction deep in the throat in the pharyngeal and glottal areas is very difficult if not impossible. The further back in the vocal tract the consonantal obstruction occurs, the less visually noticeable, if at all, the phoneme becomes. While the English phonemic inventory has a paucity of uvular, pharyngeal, and glottal sounds, the Arabic phonemic repertoire is rich in these quasi invisible sounds. Of the Arabic phonemes produced in the back of the vocal tract are two pharyngeal (/ħ/, /ʕ/), two uvular fricatives (/χ/, /ʁ/), a (/ʔ/), and a glottal (/h/). Additionally, there are four emphatic sounds that are produced with a secondary articulation involving the retraction of the root of the tongue towards the lower pharynx and the lifting of the larynx, thus resulting in a bigger resonating vocal cavity in which vowels are pharyngealized and sound deeper and backer. The primary places of articulation of Arabic emphatics where the major constriction occurs are dental for /tʕ/ and /dʕ/, interdental /ðʕ/, and alveolar /sʕ/. The following section provides more details on the characteristics of these Arabic phonemes.

As illustrated in Figure 1-1, the vocal tract is comprised of three main cavities:

The nasal cavity is situated above the palate; just below it, the oral cavity houses the tongue blade and body, and is bounded by the hard and soft palates, and the pharyngeal cavity at the back. The pharyngeal cavity can be divided into three parts: the 5 nasopharynx above the velum, the oral cavity or oropharynx extending from the velum to the tip of the larynx, and the laryngeal part also known as laryngo-pharynx, posterior to the larynx and bounded by the pharyngeal wall.

Fig. 1-1: X-Ray Image of Three Pharyngeal Cavities

From Al-Tamimi et al., 2009

These three cavities are not separate entities, but are all connected and speech sounds are produced with constrictions happening in any one of them.

This study focuses on Arabic sounds that are produced with constrictions in the pharyngeal cavity, namely, Arabic gutturals (henceforth, AG) which are formed with narrowing in various parts of the pharynx, and Arabic emphatics (henceforth, AE) which are characterized by a front primary and pharyngeal secondary articulation. The following section describes the Arabic sounds under investigation.

6

2.1 Arabic Emphatics

Arabic utilizes four coronal phonemes often referred to as emphatics. Based on

Bin-Muqbil's (2006) characterization of AEs, they are "a set of complex phonemes that are produced with a primary coronal articulation and a secondary articulation involving the retraction of the tongue body into the oropharynx." While most Arab linguists agree that the primary articulation involves a coronal articulation, there is no consensus on where the secondary articulation happens exactly. For example, Zawaydeh (1988) who studied claims that AEs are uvularized. Al-Tamimi Feda and colleagues (2009) conducted a videofluoroscopic study of the emphatic consonants in

Jordanian Arabic and did not observe any uvular involvement in the production of AEs.

Instead, their investigation supported the view that AEs are pharyngealized. As illustrated in Figure 1-2, the investigators measured (1) the length of the pharynx of 4 speakers represented by the distance between the skull base and the larynx opposite the cervical vertebrae (B to B+ on the image below), (2) the raising of the larynx (A to

A+) located at the level of the 5th cervical vertebra for males and the 4th cervical vertebra for female, (3) the oropharyngeal width (D to D+) between the anterior edge of the vertebral body and the root of the tongue, and (4) the hyoid bone elevation at the level of the 3rd vertebra (C to C+). The same measurements were taken for words in minimal pairs one with an emphatic and the other with its non-emphatic counterpart. Figure 1-2 shows the simultaneous action of articulators in the pharynx that work together in the production of the emphatic /sʔ/ followed by the vowel /æ/, namely, the retraction of the tongue root into the oropharynx that causes the elevation of the hyoid bone (the protrusion on the line between C and C+) and the raising of the larynx. As a result of this secondary articulation that narrows the pharynx, the configuration of the pharyngeal 7 cavity is altered and affects the fundamental frequencies of coarticulated vowels; /æ/ when adjacent to the emphatic is pulled back and realized as a back /ɑ/ with lip rounding.

For the purpose of this investigation, suffice it to say that AEs and their non- emphatic counterparts (NE) share the same primary articulation characterized by a major narrowing produced by the tongue coming very close to, or against the alveolar ridge; they differ however in that AEs’ primary articulation is combined with another less prominent constriction caused by pulling the root of the tongue towards the oropharynx.

The tongue retraction is what sets AEs apart from their NE counterparts.

Fig.1-2: Videofluoroscopy of /sæb/ and /sʕɑb/ Produced by a Male Speaker

From Al-Tamimi et al., 2009

There are four AEs characteristic of the Arabic language: two fricatives, [ðʕ, sʕ] and two stops [tʕ,dʕ] but the less-frequent fricative [ðʕ] is often realized in Lebanese

Arabic as either [zʕ] or [dʕ], and will not be considered in this study. As previously explained, AEs are mostly distinguishable through the effect they have on the adjacent vowels. In fact, the influence of AEs has often scope over the vocalic environment within

8 a word and across word boundaries. Israel A. and colleagues (2012) used real-time MRI to study emphasis and they observed both progressive and regressive emphasis spread. For this reason, and due to the fact that non-invasive procedures were used in this investigation, I will be looking at the visual features associated with vowels coarticulated with AEs as reflected in movements of the lips, chin, and cheeks.

Pharyngealization formed with a constriction in the back of the mouth results in a shortened pharynx which in turn raises the first formant F1. All vowels adjacent to emphatics are associated with a high F1 and a low F2. These formants variations are caused by a change in the positioning of the speech articulators which is usually visible in the displacement of facial articulators. Arabic has only three vowels and vowel length is phonemic: /sʕɑb/ means pour and /sʕɑ:b/, afflict. The three vowels that constitute the

Arabic vocalic repertoire are [æ], produced with jaw opening and realized as /ɑ/ near

AEs; /ɑ/ is produced with open but rounded lips; the [u] is realized as [ʉ] an

AE and is also produced with tighter lip protrusion; [i] is produced with lip spreading and realized as /ɨ/ when pharyngealized, the effect of on /ɨ/ is not as pronounced as it is with /ɑ/.

2.2 Arabic Gutterals

Arabic also counts in its phonemic inventory the voiced and voiceless pharyngeals [ʕ] and [ћ] respectively, the two laryngeals [h] and [ʔ], and three uvulars, [q,

ᴚ, χ]. Based on the UPSID1 data (cf. Maddieson, 1984), out of 317 world languages, few include sounds with a pharyngeal primary articulation. For instance, [ʕ] is only found in 7 other languages besides Arabic, and the voiceless pharyngeal [ћ] in 13 languages, and 54 languages including Arabic contain the uvulars [q, ʁ, χ] (Elgendy A., 2001).

9

Not all Arabic linguists agree on the exact place of articulation of Arabic uvulars and pharyngeals that will be referred to in this study as Arabic gutterals (henceforth,

AG). The main difference between the AGs and AEs is in the primary and main pharyngeal constriction during the production of AGs that is not combined with any other simultaneous obstruction at a different site in the vocal tract. As Elgendy, (2001) puts it, the speech planning for the production of AGs pharyngeal muscles are triggered by direct neural commands that result in the active involvement of the pharynx and the contraction of the pharyngeal muscles in the articulatory process (Elgendy, 2001).

Based on their acoustic similarity, the perception of [ʕ] has been contrasted with that of

[ʔ], [ћ] is contrasted with [h], and [ᴚ] with [χ]. The articulation of the glottal stop [ʔ] varies from a full occlusion at the level of the glottis or it can be produced like an , similar to [ʕ] which has been described by Heselwood (2007) as varying between modal , pressed voice, and creaky voice. As shown by Ghazeli’s cineflurographic study,

[ʕ] is formed by the approximation of the epiglottis and the rear wall of the pharynx. The voiceless pharyngeal [ћ] is produced with a similar but more pronounced narrowing between the epiglottis and the fourth cervical vertebrae (CV4). The root of the tongue is retracted during the production of the laryngeal [h] causing a constriction at the level of the second cervical vertebrae (CV2). The uvular gutturals [ᴚ] and [χ] are articulated with the root of the tongue retracting towards the posterior wall of the oropharynx (Delattre,

1971; Ghazeli, 1977). Several researchers have investigated the articulation of AGs using cineradiography or fiberscopy because these sounds are not visually salient.

While there is some disagreement as to the exact location of the narrowing, there is a consensus that all gutturals are characterized by a primary constriction in the posterior region of the vocal tract produced by the retraction of the root of the tongue towards the 10 oropharynx or the laryngopharynx (Bin-Muqbil, 2006). Table 1-1 lists all the Arabic sounds investigated in this study, their general place of articulation and the sound contrasts in each minimal pair. The AG contrasts were selected based on the difficulty that non-native speakers of Arabic have in discriminating between a word and its pair.

Table 1-1: Arabic Emphatic (AEs) and Guttural (AGs) Contrasts Sound Contrasting with AEs NEs Alveolar Voiceless Pharyngealized Fricative [s ʕ] Alveolar Voiceless Fricative [s] Alveolar Pharyngealized Voiced Stop [d ʕ] Alveolar Voiced Stop [d] Alveolar Pharyngealized Voiceless Stop [t ʕ] Alveolar Voiceless Stop [t] Arabic Gutturals (AGs) Pharyngeal [ʕ] Glottal stop [ʔ] Voiceless Pharyngeal /ħ/ Laryngeal [h] Voiced Uvular [ʁ] Voiceless Uvular [χ]

3 Audio-Visual Speech Perception in Arabic

The very few studies investigating audiovisual speech perception in Arabic, suggest that Arabs are sensitive to VS (Azra et al., 2005; Ouni et al., 2005; Ouni S.,

2007). The McGurk illusion has been used extensively as a behavioral tool to study AV integration of audition and video in speech; not only does it provide evidence for a multimodal representation of speech but it also offers support to the integration claim of the two sensory modalities involved in AV perception. The effect is observed when the auditory signal and the visual signal, both originating from the same source, are incongruent. The audio-video mismatch results in a percept that is a fusion between the signals transmitted via each modality. For example, a bilabial /ba/ uttered over a face mouthing the velar /ga/ is often perceived as /da/. Azra et al. (2005) conducted an

11 investigation on the McGurk effect in Arabic and its sensitivity to the phonological context of a . They presented participants with audio /b/ and visual /q/. In most cases, the resulting percept was /d/ suggesting that native Arab speakers do attend to

VS. Ouni et al. (2007) studied the perception of native Arabic speakers of visual speech in Arabic. The stimuli consisted of a set of words with sounds that belonged to different visemes including labials, dentals, alveolar, pharyngeal, and pharyngealized sounds; the pharyngeals were tested in the context of the three pharyngealized vowels /ɑ/, /ɨ/, and /u/. In the three vowel contexts, 45% of CVs were recognized correctly. The best recognition overall was in the /ɑ/ vowel context where the tongue was partially visible:

/tʕɑ/ was perceived correctly 52%, /ðʕɑ/, 95%, and /sʕɑ/ 47% of the time. /hɑ/ was recognized accurately 89%, /ħɑ/ 61%, and /χɑ/ 47% of the time. Perception of the aforementioned sounds in the context of the vowel /ɨ/ was almost as good as in the /ɑ/ environment and slightly lower when adjacent to /u/. Recognition of the sounds produced behind the velum was reasonably high in lipreading suggesting that these visemes coarticulated with one of the three vowels carry some informative visual cues used by native speakers of Arabic.

4 Goal of this Study

In this study, I examine the visual cues associated with two sets of Arabic speech sounds, the AGs which are recognized in the literatures as including the true pharyngeals (the guttural [q] will be excluded from this study) and the AEs in the context of the three vowels. These sounds are auditorily distinct and belong to two distinct groups of visemes. Visemes is a term coined by Fisher (1968) that refers to a group of speech sounds that are visually non-distinct. AGs are not optically salient due to their 12 place of articulation deep in the vocal tract that makes them practically invisible; however, since the formant frequencies of vowels neighboring AGs and AEs is altered implying a difference in vowel quality, I argue that their effect on coarticulated vowels may provide important visual cues that enhance Arabic native speakers’ perception of

AGs and particularly AEs, which are expected to be better perceived than AGs due to their primary front place of articulation. The importance of the lips in visual perception as the most visually salient speech articulator is certainly not to be undermined; however, this investigation also aims at exploring other regions in the face that might potentially transmit to the perceiver important speech information related to the identification of

AGs and AEs. The following hypotheses will be tested:

1. Arabic gutturals and emphatics are visually distinguishable on the basis of the

vowel with which they coarticulate.

2. There is visual speech information associated with AEs and AGs in the lips, chin,

and cheeks.

3. In their perception of AEs and AGs, native speakers of Lebanese Arabic do

attend to visual cues found in the lips, chin, and cheeks.

To this end, the two experiments described in Chapters 2 and 3 were designed to help examine the hypotheses posited above. Chapter 2 is dedicated to the production experiment, measurements and results. After describing the methodology for perception data gathering in Chapter 3, Chapter 4 presents the perception results as they relate to production based on phonetic features and measurement differences in the lips, chin and cheeks. Chapter 5 is dedicated to machine learning and the human perception informed by measurements used by both machine learning and perceivers to classify

13 utterances into AEs or NEs. Chapter 6 discusses visual perception of emphasis. The following 2 chapters talk about AGs perception results and the relation between perception and production as it relates to gutturals. The dissertation concludes in chapter 9 with a summary of procedures and interpretation of results. Conclusions are drawn based on a comparison between AGs and AEs.

14

CHAPTER 2: PRODUCTION of AEs and AGs

Production data was elicited from 4 native Lebanese speakers and their speech was analyzed to help investigate what visible features were associated with the COIs and explore which of these might best inform perception. One of the hardest parts of collecting data for this research was finding, in the Denver area, enough speaker participants who were native Lebanese. Consequently, I included myself as one of the subjects and I am referred to as Speaker 4. The use of the researcher as one of the speakers lends asymmetry to the analysis. As I was videotaping myself, I was very well aware of the possible bias I would introduce to the study and made every effort to speak naturally, careful not to hyper-articulate. However, as will be discussed later, speakers adjust their articulation to accommodate the needs of the perceiver. In my current environment I mostly use Arabic for pedagogical reasons in my second language classroom where I adjust my articulation and rate of speech for the sake of my students.

The quality of student directed speech may have very well carried over to the recordings in this investigation and explains why speaker 4 was the best perceived, at least in the identification of emphasis. To remedy this possible flaw, many of the analyses were carried out with and without speaker 4, and while some of the significant effects were attenuated without speaker 4, they generally remained significant nonetheless.

1 Preparation of Stimuli

Benguerel and Pichora-Fuller (1982), looked at V1CV2 utterances to test the visual intelligibility of the English [u], [i], and [æ] in the context of nine different consonants. Their results showed that consonant discrimination was better when

15 coarticulated with the [æ] and [i] than with [u] (Benguerel and Pichora-Fuller, 1982; also,

Jiang J. et al., 2001). Benoit et al. (1992) in a similar experiment studied the contextual effect of vowels on consonant intelligibility in French; they confirmed the effect of AV discriminability of French consonants based on neighboring vowels. Adding to the evidence of vowel effect on the perception of consonants, Ouni (2007) in the experiment mentioned above similarly confirmed a vowel influence on the perception of AGs. Thus, all AEs, NEs, and AGs in this study occur systematically in the context of the three cardinal short and long Arabic vowels [æ], [i], and [u].

The corpus is composed of 144 words that consist of 72 minimal pairs each containing two contrasting consonants of interest (COIs) and made up of two real Arabic or two pseudo words that abide by the rules of , and were constructed to 'sound' Arabic. The difficulty in finding enough Arabic tokens to form minimal pairs containing the targeted contrasts forced the use of nonce utterances. As previously mentioned the ultimate goal of this study was to uncover potential visual cues associated with AEs and AGs and test their benefit in training second language learners of Arabic to attend to the informative visual cues. The selection of AGs and AEs contrast was therefore motivated by the difficulty that most American students face in discriminating between pairs of AGs and between AEs and NEs. Thus, the pharyngeal fricative [ʕ] was contrasted with the glottal stop [ʔ], [ħ] with [h], and [ʁ] with [χ] and the emphatics with their non-emphatic counterparts. For example, in the minimal pair

[tæ:b/tʕɑ:b] (repent/heal), the words with the word-initial AE / tʕ/ occurs in the context of

[æ] and was contrasted with its NE cognate that starts with a NE [t] in the same vocalic environment. Each AE and AG appeared with all six vowels (short and long [æ, i, u]),

16 word-initially, and word-finally. All 144 words were produced by 4 native speakers of

Lebanese Arabic.

Table 2-1 summarizes the distribution of the words in the corpus.

Table 2-1: Words Distribution in the Corpus

/tʕ/ initial 6 /#tʕ/ contrasted with 6#/t/ /tʕɑ-tæ/, /tʕɑ:-tæ:/ /tʕɨ-ti/, /tʕɨ:-ti:/ /tʕu-tu/, /tʕu:-tu:/ /sʕ/ initial 6 /#sʕ/ contrasted with 6 /#s/ /sʕɑ-sæ/, /sʕɑ:-sæ:/ /sʕɨ-si/, /sʕɨ:- si:/ /sʕu-su/, /sʕu:-su:/ /dʕ/ initial 6 /#dʕ/ contrasted with 6#/d/ /dʕɑ-dæ/, /dʕɑ:-dæ:/ /dʕɨ-di/, /dʕɨ:-di:/ /dʕu-du/, /dʕu:-du:/ /tʕ/ final 6 /tʕ#/ contrasted with 6 /#t/ /ɑtʕ-æt/, /ɑ:tʕ-æ:t/ /ɨtʕ-it /, /ɨ:tʕ, i:t/ /utʕ-ut/, /u:tʕ-u:t/ /sʕ/ final 6 /sʕ#/ contrasted with 6 /#s/ /ɑsʕ-æs/, /ɑ:sʕ-æ:s / /ɨsʕ-is/, /ɨ:sʕ-i:s/ /usʕ-us/, /u:sʕ-u:s / /dʕ/ final 6 /dʕ#/ contrasted with 6/d#/ /ɑdʕ-æd/, / ɑ:dʕ-æ:d/ /ɨdʕ-id /, /ɨ:dʕ-i:d / / usʕ-ud/, /u:sʕ-u:d/ /χ/ and /ʁ/ 6 /#χ/ contrasted with 6 /#ʁ/ /χæ-ʁæ/, /χæ:-ʁæ:/ initial /χi-ʁi/, /χti:-ʁi:/ /χu-ʁu/, /χu:-ʁtu:/ /χ/ and /ʁ/ 6 /χ#/ contrasted with 6 /ʁ#/ /æχ-æʁ/, / æ:χ-æ:ʁ/ final /iχ-iʁ/, /i:χ-i:ʁ/ /uχ-uʁ/, /u:χ-u:ʁ/ /ʕ/ and /ʔ/ 6 /#ʕ/ contrasted with 6 /#ʔ/ /ʕæ-ʔæ/, /ʕæ:-ʔæ:/ initial /ʕi-ʔi/, /ʕti:-ʔi:/ /ʕu-ʔu/, /ʕu:-ʔu:/ /ʕ/ and /ʔ/ 6 /ʕ#/ contrasted with 6 /ʔ#/ /æʕ-æʔ/, /æ:ʕ-æʔ/ final /iʕ-iʔ/, /i:ʕ:-i:ʔ/ /uʕ-uʔ/, /u:ʕ-u:ʔ / /ħ/ and /h/ 6 /#ħ/ contrasted with 6 /#h/ /ħæ-hæ/, /ħæ:-hæ:/, initial /ħi-hi/, /ħi:-hi:/, /ħu-hu/, /ħu:-hu:/ /ħ/ and /h/ 6 /ħ#/ contrasted with 6 /h#/ /æħ-æh/, /æ:ħ-æ:h/, final /iħ-ih/, /iħ:- hi:/, /uħ-uh/, /u: ħ-u:h/ C#= initial position; #C= ending position

17

Section 2 is a report of the experiment which involved the collection and analysis of production data.

2 Experiment 1: Production Data Collection

Four native speakers of the Lebanese dialect, two males and two females, were seated in front of a video camera that captured the frontal view of the face, and a computer screen where words were displayed one at a time. The backdrop was matte black to absorb reflecting light. The subjects had no facial hair, piercings or tattoos that could be distracting to viewers. They were videotaped as they enunciated the same

144 words presented in a random order. After a 1, 2, 3 display on the screen, a word was shown to the participant looking straight at the video camera placed in front of the monitor. Speakers were instructed to start and end each utterance with a closed-mouth position. The videos were saved individually as separate files and used to extract the measurements associated with potential visual features judged by native speakers to be important for the recognition of AEs and AGs.

As pointed out by Brooke and Summerfield (1983), one of the main difficulties in analyzing facial articulatory movements is the dissociation of the speech gestures from head and body movements. Certainly, clamps and supports can keep the speakers' face from moving, but this technique usually results in very unnatural speech. Instead, the approach used in this study does not restrain participants in any way; subjects were simply asked to sit as still as possible. Nonetheless, there was a need to compensate for non-speech movements by normalizing the head position around the position of the markers placed at the corners of the eyes.

18

The videos of each subject were recorded at 30 frames per second images. To fit each face with a mask or model generated from selected facial points, the optimization method based on the Active Appearance Model (AAM) algorithm was applied to a set of

66 facial points shown on the prototype in Figure 2-2. Quoting Cootes et al. (2001), “our models of appearance can match to any of a class of deformable objects (e.g., any face with any expression, rather than one person’s face with a particular expression)”. Each mask was generated in MATLAB, a technical computing language, from forty different images for each speaker. Figure 2-1 shows one of the four speakers’ fitted face pronouncing an AE coarticulated with [u] and a NE coarticulated with [i] to demonstrate how the mask adjusts to the movement of the articulators in order to match each time the configuration the facial features take:

Fig. 2-1: Fitted Models

AE + [u] NE +[i]

19

The fitting on the face in Figure 2-2 was designed as part of a project seeking to find as many markers as possible related to human expressions (Mavadati, S. M. et al.,

2013); for the purpose of this study only the markers situated within the red shape

(points 2,17,9) were used.

Fig. 2-2: 66 Landmark Points

3 Measurements

3.1. Procedure

Sound bites associated with each word were loaded into Praat (Paul Boersma &

David Weenink, 2013), a computer software designed for the analysis of speech in , and text grids were created specifying the time intervals to be measured from the onset of the selected to the stable state of the vowel for word-initial COIs and from the stable state of the vowel to the cessation of the movement for word- finally

COIs. These intervals were found to be the most comparable units of measurement

20 across speakers and the stable state of a vowel is considered to have the most articulatory stability with the least coarticulatory interference. Figure 2.3 is a spectrogram of my pronunciation of the words /tʕɑ:b/ and /bæ:t/ illustrating the parsing of the selected intervals at word-onset and coda.

Fig. 2-3: Spectrograms of Selected Intervals

Word-Initial /tʕɑ:b/ Word-Final /bæ:t/

Based on the fittings associated with each word and speaker as illustrated in

Figure 2.2, measurements at the selected interval were submitted to the AAM algorithm which was run on all the videos to match each interval with its corresponding frames.

After identifying the frames that were closest to the selected time interval, the articulators' motions were measured from the facial markers, using Euclidean Distance

(ordinary straight line) and fitted to match each speaker’s 3-D face. Ultimately, 13 measures were extracted: 7 lip parameters, two cheeks measurements, and 4 chin measurements. The lip vertical inner square dimension was later added to account for

21 the non-linear relationship that the vertical inner area between the lips has with perception of AEs and NEs. These measurements are shown in Table 2-2.

Table 2-2: Measurements and their Descriptions Measurements Area Cheek High Measurement from marker 3 to15 Cheek Low Measurement from marker 4 to14 Chin Width High Measurement from marker 6 to12 Chin Width mid Measurement from marker 7 to11 Chin Width Low Measurement from marker 8 to10 Chin Vert Measurement from marker 9 to 34 Lips Vertical Outer Measurement from marker 58 to52 Lips Vertical Inner Measurement from marker 65 to62 Lip Vertical Side (left and right sides Measurement from marker 61 to 66/63 to 64 were combined and averaged) Lip Horizontal Measurement from marker 49 to 55 Lip Thickness Upper Measurement from marker 62 to 52 Lip Thickness Lower Measurement from marker 58 to 65 Open Area Measurement from marker 49 to55 * 65 to 62 Lips Vert Inner Sq Similar to open area A term introduced for the non-linear effect of Inner Lips Vert

The different measurements of the lips, cheeks, and chin movements were used to quantify and compare the differences in the movement of each of the facial features under investigation, during the production of AEs versus NEs or an AG versus its pair.

As a first attempt, the effect each facial feature gesture had on perception of the COIs was explored by looking at the trajectory of the movement from the time the motion started to the point it reached its maximum or minimum. This approach was riddled with problems mainly related to syllable boundaries. Because each video had different number of frames, it was not possible to compare the measures frame by frame.

Additionally, the selected frames that correspond to the interval of interest were a close approximation because the boundaries of each interval did not match exactly the

22 boundaries of the frames within which it was embedded; as a result, the sections of the frames outside the end points of the interval contained some additional non-speech noise that interfered with the data analyses, and resulted in false maxima and minima.

As an added complication, the speakers’ rates of speech were not controlled for, which means that the same word spoken by two different speakers was likely to vary in duration. Adding to the effect of coarticulation which makes clean delineation between word segments difficult, the variation within and between speakers, complicated matters even more. Measuring maximum or minimum displacements as well as the slope defined by the movement from the starting point to the highest or lowest points turned out to be a very unreliable measurement. Figure 2-4 is a plot for the ending syllable of the pair /ʃɑmɑtʕ, ʃæmæt/ (steal, gloat), featuring the horizontal movement of the lips from corner to corner. Each color represents a different speaker. The dotted lines trace the movement of the syllable with the emphatic in the minimal pair and the solid line the movement of its NE cognate. The x-axis plots the frames in the videos during which these movements were captured, and the y-axis represents the distance the lips traveled from the moment the lips started forming the configuration needed for the vowel preceding the COI (emphatic and NE [t]) to the moment the lips closed after the release of the final stop. One of the issues clearly portrayed in Figure 2-4 was the variability in word length between the AE and NE , and between speakers: for speakers 1 and 4, the syllable with the emphatic for this particular pair of words was longer than that with its NE counterpart. For speaker 3, the two contrasted syllables seemed to be of similar length, and for speaker 2, the syllable with the AE was longer than that with the NE. Due to the production discrepancy in syllable length, the same frames do not correspond to the exact section of the syllable under scrutiny for all four 23 speakers. Additionally, the movements and trajectory of the articulators varied a lot from speaker to speaker: the highest point in the horizontal opening of speaker 4’s lips trajectory, reached a maximum for the emphatic syllable that was much bigger than the

NE syllable. The difference between the two syllables was larger than for the other three speakers. For speaker 1, both contrasting syllables followed a similar trajectory, moving in the same direction, but speaker 2’s movements in the production of the two CVs travelled in opposite directions; however, unlike speaker 4, the NE had a larger horizontal lip opening than its emphatic counterpart. The overall pattern of the word pair

/ʃɑmɑtʕ, ʃæmæt/ for speaker 3 looked similar to that of speaker 2, with the AE syllable yielding a wider opening than its NE cognate.

Fig. 2-4: Horizontal Lip Trajectory in the Production of /ʃɑmɑtʔ, ʃæmæt/

24

The outcome of this failed approach led to the decision to compare the production of AEs with that of their NE counterparts, not based on an interval, but on one point which corresponds to the stable state of the adjoining vowel to the COI, that is, toward the middle of the vowel where the effect of coarticulation with previous or following segments is at its minimum. Nonetheless, the graph in Figure 2.4 serves two purposes: (1) it outlines the seemingly chaotic picture of speech production and (2) clearly reflects what perceivers have to adjust to in terms of speaker variability.

Measurements were used to predict accuracy of perceivers’ responses and also served as the input data for the machine learning classification task detailed in Chapter

4. To determine if a perceiver can distinguish between a word and its pair based on the amplitude of the measurements listed in Table 2-2, the differences in measurement for each word and its pair were calculated for each of the facial features. For example, assuming that speaker 1’s lip horizontal measure in the production of /tʕɑ:b/ was 2mm and in the production of /tæ:b/, 3 mm, the measurement difference would be 1mm. The question is whether the perceiver can tell which of the two words speaker 1 said based on the measurement difference between AE and NE.

The arrow plots in the following section are a graphical illustration of the difference in measurements between AEs and NEs.

3.2 AEs and NEs Arrow Plots

Arrow plots were generated to provide a visual representation of the differences between AEs and NEs for each speaker across all consonant-vowel combinations.

These plots make it possible to recognize at a glance where the significant measurements are for each speaker, and also allow a quick cross-speaker comparison

25 for each feature. A sampling of arrow plots is presented below, some applied to lips and others, to non-lip measurements. The length of the arrows represents the difference in the gesture associated with a certain facial feature during the production of AEs versus

NEs. The longer the arrow the bigger the difference is, meaning the more distinguishable are the AE and NE words in a particular vowel context. The green arrows pointing down indicate that the NE movement is bigger than the AE movement for a particular measurement. The blue arrows pointing up signify the opposite. Figure

2-5 is a plot that illustrates the difference in lip horizontal spreading between AEs and

NEs in the context of the vowels [æ-æ:].

 [æ-æ:] arrow plots

Based on my intuition about my own articulation (speaker 4) of AE + [æ] contrasted with NE + [æ], my lips spread out wider from corner to corner in the production of the NE than AE. The [æ] plots reflect the consistency of the short and long

[æ] lip horizontal measurement across speakers. Just as expected, with only a few exceptions, most of the arrows are green, pointing down, and indicating that the lips are spread wider in the production of NEs + [æ] then AE + [æ]. The most consistent speaker with only green arrows and the larger differences in lip horizontal measurements is speaker 4. Perceivers watching speaker 4 are likely to correctly identify NEs + [æ] based on the horizontal spreading of the lips because she articulates

NEs + [æ] quite distinctively from AEs [æ].

26

Fig. 2-5: Lip Horizontal Difference in [æ-æ:]-Context

The next arrow plot illustrates the lower part of the chin measurement in the context of [æ]. Based on my own impression as a speaker of Lebanese Arabic, my assumption is that AE + [æ] is associated with a narrower lower chin measurement than is the case with NE + [æ]. The plot in Figure 2-6 confirms my native speaker intuition with regards to my chin movement during speech production. The mixture of green and blue arrows suggests that the lower chin lacks the consistency of the lip horizontal measurement. These measures are unlikely to be as helpful to perceivers in the identification of emphasis.

27

Fig. 2-6: Chin Width Low Difference in [æ-æ:]-Context

Speakers 1 and 3 show more of a pattern for this particular measurement. With a few exceptions, speaker 1’s chin widens more in the context of NE + [æ] than AE + [æ], while speaker 3’s lower chin is larger during the production of AE + [æ] than NE + [æ].

Arrow sizes also provide indications about the difference between long and short vowels if any. For example, speaker 2’s chin really widens when uttering the emphatic

[d] preceding or following long [æ:]. The arrows representing [d] + short [æ] are a lot

28 shorter and the [dæ] arrow does not even point in the same direction as [dæ:] signifying a slightly larger chin for [dæ].

To summarize the information gleaned from the two [æ] arrow plots, the lip horizontal opening exhibits a similar pattern across speakers and is expected to be a more reliable measurement than the one given by the lower chin dimensions. The implication is that perceivers are more likely to look at the lips than the chin when trying to identify emphasis in the context of [æ] because the difference in lip measurements is large enough to be noticed. take a clearly different configuration for AEs than NEs and the consistency across speakers is likely to reflect their own pronunciations as native speakers of Arabic. The same seems to be true for the vowel [i, i:].

 [i;i:] arrow plots

I am pretty confident that in my own production, my lips are clearly more spread for NE + [i] than AE + [i]. My native speaker intuition is reflected in the lip horizontal measurement plot that cues emphasis. However, I would have expected to see the same pattern for all four speakers. The plots in figure 2-7 show that for speakers 2 and

4, the lip horizontal opening is indeed larger for NE + [i] than AE + [i], but this consistency is lacking in the production of speakers 1 and 3. The all green, long arrows in speakers 2 and 4 plots shown in Figure 2-7 signify a visible and consistent difference in the lips horizontal measurement with bigger measurements associated with NEs. The mixed green and blue shorter arrows in speakers 1 and 3 plots reflect a small difference between AEs and NEs for this measurement which makes it hard for perceivers to discriminate between AE and NE in the [i]-context, based on lip horizontal.

29

Fig. 2-7: Lip Horizontal Difference in [i;i:]-Context

A non-lip measurement associated with [i] is the width of the mid-chin reported in

Figure 2-8. The mixed blue and green arrows shown in the plots across speakers suggest that the chin-width may not be as reliable a measurement for perceivers as lip horizontal in the [i] context. For speaker 4 for example, most of the arrows are blue reflecting a bigger displacement of the chin for AEs + [i] than NEs + [i]; but the arrows, compared to those in the lip horizontal plot in Figure 2-7 are shorter suggesting less of a distinction in chin width in the [i] environment. In the case of speaker 2, there seems to be a dividing line between long and short [i]; with the exception of [i:d], long [i:] + NE is

30 associated with wider chin measurements than is [i] + AE. Speakers 1 and 3 plots don’t exhibit a clear pattern in the production of emphasis with [i].

Fig. 2-8: Chin Width Mid Difference in [i;i:]-Context

 [u;u:] arrow plots

Figure 2-9 shows the arrows associated with the upper lip thickness for the COIs coarticulated with [u]. This feature seems to be informative for the perception of speaker

4 only. The arrows in her plot are mostly blue pointing upward and signifying that the lips are thicker for AEs + [u] than NEs + [u]. Speaker 2’s lip thickness measurements exhibit the opposite pattern: his upper lip looks thicker during the articulation of NEs +

31

[u]. Speaker 3’s plot shows a difference in lip thickness between long and short vowels.

Typically for him, his upper lip is thicker with long [u:] + AE and thinner with short [u] +

NE. Speaker 1 shows no decipherable pattern in lip-thickness that would help perceivers identify emphasis in conjunction with [u].

Fig. 2-9: Upper Lip Thickness Difference in [u-u:]-context

The non-lip [u] measurement shown in Figure 2-10 is the chin vertical from the

center of the chin to the tip of the nose.

32

Fig. 2-10: Chin Vertical Difference in [u;u:]-Context

The only consistency that the plots in Figure 2-10 shows in relation to the chin vertical measurement is again connected to speaker 4. The arrows in her plot are long and mostly blue reflecting that her chin drops more during the AE articulation. Speaker

1 shows a general different pattern with her chin dropping more for the production of

NEs, but her arrows are generally shorter with a few C[u] combinations where AEs and

NEs are indistinguishable. The plots of speakers 2 and 3 lack consistency; while speaker 2 exhibits a few moderately large differences between AEs and NEs in the chin vertical measurement, he also shows within the same plot very small, probably

33 indistinguishable differences indicating that perceivers would in all likelihood struggle in discriminating between AEs + [u] and NE + [u] based on the chin vertical measurement only.

In sum, the arrow plots were used as an exploratory method to show the variations between and within speakers based on the differences in the facial gestures associated with the production of NEs and AEs. They visually help confirm some of the findings presented in the section on perception. In addition to the speaker effect, the vowel effect is also evident. In the few examples provided above, there were more long arrows pointing in the same direction in the [æ] plots but less consistency and generally shorter arrows associated with [u]. The plots also demonstrated a small vowel duration differences. This is clearly seen in Figure 2-8, where the arrows on the right in speaker

2 plot associated with long vowels are longer than the ones on the left representing short vowels. However the connection between vowel length and larger differences between AEs and NEs is imperfect and is reflected in the modest significance shown in some of the models below.

The plots have given us an idea about what to expect to find in the perception data. COIs in plots with inconstant arrows, some blue, some green likely reflect perceivers’ inconstancy in correctly identifying that COI. The plots clearly reflect a vowel effect and a speaker effect. They also show that many features examined in this investigation beside the lips exhibit differences in the production of AEs and NEs. Do perceivers really attend to these differences? Chapter 3 details the perception data collection and presents the perception results.

34

CHAPTER 3: PERCEPTION OF EMPHASIS

1 Experiment 2: Perception Data

The videos of the speakers’ faces described in experiment 1, were cropped to create six facial conditions. The goal was to test the perceivers’ accuracy in emphasis and guttural identification based on how much, and what parts of the face were visible to them. This may help determine what areas of the face Lebanese speakers attend to when trying to identify AEs and AGs based on the optical signal alone.

 Video Cropping

In order to isolate the facial features that potentially carry perceptually functional visual information, only certain parts of the face were visible to the perceivers. Using the video editor Adobe Premiere, the videos were cropped in such a way that participants dynamically saw movements of different parts of the face always including the mouth.

Following in Figure 3-1 are examples of the six facial conditions:

Fig. 3-1: Speaker 1: Six Facial Conditions

AF LwF L

LCkCn LCk LCn LCn 35

Temporal bone

The rationale behind choosing these conditions was to evaluate whether any of

these facial regions transmitted sufficient information that led to accurate identification

of AEs and AGs. A better recognition of the COIs under the AF condition compared to

the LwF would suggest that the eyes, eyebrows and forehead contain speech

information not present in the mouth. A comparison between LwF where the neck is

visible and LCkCn would allow assessing if the neck has any visible cues that benefited

the perception of AEs and/or AGs. LCk and LCn conditions may reveal potential clues

encoded in the cheeks and chin. Figure 3.2 is a representation of the bone structure of

the cheeks, chin, and lips, extending to the temporal bone. The bones are separated in

Figure 3.2, but of course, in reality, all the different parts are joined together by

ligaments and muscles and travel together.

Maxilla Fig. 3-2: Bone Structure of the Face

https://www.flickr.com/photos/internetarchivebookimages/14577717870/

36

Cheeks are defined as the facial area that expands from the fleshy area below

Mandible the eyes on each side of the face to the outer edge of the zygomatic bone (cheek bone);

the markers in Figure 2-2 were placed on the edge of the outer zygomatic bone closest

to the temporal bone. Upper chin measurement expands from the mandible angle on

one side of the face to the other side. The Zygomatic bone articulates with the maxilla,

or upper jaw bone and the temporal bone at the base and sides of the skull. The lower

teeth are implanted in the mandible or lower jaw which moves up and down with the

mouth opening and closing, and rotate laterally to the left and to the right. The cheeks

and chin move with the production of almost every sound and therefore are likely to

encode speech information. Table 3-1 shows the visible parts in each face condition:

Table 3-1: Visible Parts in Each of the Six Facial Conditions

Face Label Visible parts

AF The whole face including All Face the neck

LwF from the bottom of the eyes Lower Face to the bottom of the neck

L Lips Lips only LCkCn Lip Cheeks Chin Lips, high cheeks, Chin LCk Lips, Cheeks Lips and cheeks (no chin) LCn Lips, Chin Lips and Chin (no cheeks)

37

All the cropped videos were uploaded on Microsoft’s Cloud platform known as

Azure which gave perceiver participants the ability to complete the study online. Fifty three subjects whose mother tongue is Lebanese Arabic were recruited using social media. The participants were scattered all over Europe, Canada, the United States, and

Lebanon. Half way through recruiting subjects, to boost sluggish participation, subjects received a $20 compensation for their time or a $20 Amazon gift card (the compensation was approved by IRB). After setting up a user name and password, each participant was able to log onto the website. Prior to starting the study, subjects were asked to provide their age and gender. The average age was 45 among 26 males and

27 females. After reading the instructions, perceivers completed a warmup session where they practiced on five videos. Every participant watched 432 silent videos of two of the four speakers randomly chosen; movies were about two seconds long each.

Given speakers A, B, C, and D, there were six possible combinations: AB, AC, AD, BC,

BD, and CD. In each movie, the speaker was seen uttering one of the two words that belonged to a minimal pair, but both words were displayed on the computer screen below the video; subjects, in a forced choice identification task, were asked to click on the word they thought the speaker was saying. The Next button took them to the following video. Each movie could only be seen once.

The experiment was designed so that each member of each minimal pair appeared in three of the six facial conditions; thus all minimal pairs were presented in all six conditions. Each subject watched videos featuring two of the four speakers, but for any pair of words, the perceiver saw the same talker across all six conditions. For example, given the minimal pair [dærb-dʕɑrb] spoken by speaker A, the subject may see

[dærb] under AF, L, and LCk conditions; its pair [dʕɑrb] would then be displayed in the 38 other three conditions LwF, LCkCn, and LCn spoken by the same speaker; the perceiver would have therefore seen the minimal pair [dærb-dʕɑrb] in all six facial conditions. This approach resulted in a lengthy study that took an average of one hour to complete, but participants had the choice to take a break and pick up at a different time from where they left off. The answers were automatically recorded and linked to their perceiver.

2 Perception Results

Perception results for both emphatics and gutturals were collected together, but analyzed separately. This section focuses on the results associated with the AEs.

I collected a total of 10,939 responses to 72 minimal pairs, each containing an

AE and its NE counterpart. As a refresher, in each silent video perceivers watched a speaker produce either an AE-word or an NE-word. Both items were presented to the participants who were instructed to select in a forced choice task the utterance they thought the speaker pronounced. The overall percent of correct answers was 69.6%, which is significantly higher at alpha level of .05 than would result from random guessing, X2 (1, N = 10,939) = 1686.4, p < .001. A higher proportion of correct responses to NE stimuli (52%) versus AE tokens (48%) was shown to be significant X2

(1, N = 10,939) = 28,882, p < .001. However, the higher percentage of correct NE responses was likely due to a bias that responders had towards NEs and does not necessarily indicate that NEs were better perceived than AEs. In fact, the higher proportion of correct AE responses compared to the proportion of accurate NE responses (X2= (1, N= 10,939) = 4.5624, p = 0.032), confirms that perceivers, when unsure, tended to pick the NE-word. 39

Since speaker 4 was the investigator and the best perceived subject, statistical tests were run both with and without speaker 4 in order to ensure that she was not skewing the results. After removing speaker 4, NEs still yielded more correct responses than AEs, X2(1, N = 8364) = 140.33, p < .001.

It was also expected that in addition to emphasis, other factors influenced perception. These variables relate to the characteristics of the words in the corpus and included consonant quality and position in the word, vowel quality, and vowel length.

Other variables that were accounted for in the data analyses were speakers and face conditions. Following are statistical results providing an overview of the potential impact that each of these variables had on the perception of emphasis.

2.1 Descriptive Statistics

 Perception by Vowel

As pointed out earlier, there is a lot of evidence in the literature that vowel context affects perception of neighboring consonants such that consonants adjacent to

[æ] are expected to be better perceived than those adjacent to [i] which in turn are usually easier to recognize than those coarticulated with [u], (e.g., Benoit et al. 1992;

Jiang et al. 2002; 1992, Ouni, 2007). In line with other studies, the difference in perception based on vowel context was again confirmed in the perception of AEs, χ2 (5,

N = 10,939) = 406.14, p< .001. Overall, COIs were best identified in the long [æ:]- context with 86% correct recognition, followed by short [æ] at 75% correct identification, then [i:] at 70%, [i], 68%, [u:], 63%, and [u], 58%. Vowels [æ:] and [u] were better perceived next to an emphatic than in a NE environment while other vowels were better

40 perceived in NE contexts. Long vowels were generally better recognized than short vowels but not significantly so (p = .90).

Figure 3-3 shows percentages of correct answers and demonstrates perception correctness on the basis of vowel quality (best to worse, [æ > i > u]) and vowel length.

The difference in recognition between AEs and NEs coarticulated with [æ, æ:] approached significance, χ2 (1, N= 3633) = 3.26, p= 0.07, but NEs, and especially AEs were best perceived when adjacent to long [æ:]. NEs + [i] were significantly easier to see than AEs + [i], and both AEs and NEs were slightly better distinguished optically in the proximity of [i:]. There is also a significant difference in the identification of NE and

AE in the [u, u:] environment (p < .001); however, [u] is better recognized when coarticulated with AEs than with NEs, and unlike the other two vowels, AEs around short [u] are better perceived than around long [u:], but the opposite is true for NEs coarticulated with [u].

Fig. 3-3: % of Correct Responses of NEs and AEs by Vowel

100 90 80 70 60 50 NE 40 AE

percentcorrect 30 20 10 0 [æ] [æ:] [i] [i:] [u] [u:]

41

 Perception by Consonant and Consonant Position

An analysis of goodness of perception based on consonant quality showed a significant difference across the three consonant types [t, d, s]. The consonant [d]

(emphatic and non-emphatic) yielded 72% correct responses, [s], 69%, and [t], 68%,

χ2(2, N = 10,939) = 10.021, p = 0.0066.

Words were chosen or constructed with the COI positioned either word-initially or word-finally. Overall, utterances where the targeted consonant occurred word-finally were better identified than words with the COI at word onset. This difference approached significance, χ2 (1, N = 10,939) = 3.42, p =0.06. However, NEs and AEs showed different patterns of recognition as related to consonant position. NEs at the beginning of words were significantly better recognized (75% correct) than word-finally

(68% correct), χ2 (1, N = 5466) = 31.65, p < .001. On the other hand, the percentage of

AEs correct identification word-finally (69%) was significantly better than at the beginning of an utterance (66%), χ2(1, N=5473) = 7.7613, p = 0.005. It may be that regressive spreading of emphasis to the initial word consonant and the vowel preceding

AE are anticipatory cues signaling emphasis. These cues may provide additional cognitive processing time leading to better recognition of AEs at the end of a word.

Figure 3-4 illustrates the difference in perception between AEs and NEs as a function of position.

42

Fig. 3-4: % of Correct Responses as a Function of Position

Percent Correct by Position 76% 74% 72% 70% 68% initial 66% end

percentcorrect 64% 62% 60% NE E

So far, reported percentages of correct answers indicate that responses vary based on vowel quality, length, consonant quality, and the COI’s position. It is also expected that perception varies as a function of speaker as is demonstrated in the next section.

 Perception by Speaker

By simply watching the videos of the four speakers producing the same set of words, it became evident that no two speakers articulate a sound in the exact same manner. In fact, there were very noticeable differences in the way each speaker produced emphasis as was shown in the arrow plots. For instance, speaker 3 abides by the theory of economy of movements and his lips don’t move a lot; on the other hand, speaker 2 has a more active face. A -square test confirmed a significant speaker effect, χ2 (3, N = 10,939) = 451.15, p< .001; the effect remains significant even after removing the investigator (speaker 4) from the analysis, χ2 (2, N = 8364) = 140.33, p<

.001. Figure 3-5 illustrates the percentages of correct answers for AEs, NEs, and overall perception for the category ‘emphasis’ (NEs and AEs) by speaker. 43

Fig. 3-5: % of Correct Answers by Speaker for AEs, NEs, and AEs + NEs

100 90 80 70 60 NE 50 E 40 NE+E 30 20 10 0 S1 S2 S3 S4

For three out of four speakers (S1, S3, S4), there were more correct answers associated with the visual perception of NEs than AEs. As reflected in Table 3-3,

Speakers 2 and 4 were generally better understood (71% accuracy S2 and 83% for S4) than speakers 1 and 3, (around 60% for both S1 and S3).

Table 3-3: Percent Correct for NEs and AEs by Speaker

NE AE NE + AE S1 68% 51% 60% S2 67% 75% 71% S3 62% 58% 60% S4 85% 80% 83%

A comparison between AEs and NEs spoken by speaker 3 suggests that perception of this speaker is mediocre for both AEs (58%) and NEs (62%); speaker 1 has a much higher percentage of correct responses for NEs (68%) but her AEs are the least-well identified category across speakers (51%). Speaker 4 is very well perceived regardless of emphasis with an advantage for NEs (NEs = 85%, AEs = 80%). Speaker 2 is also well-perceived overall but more so in his AEs than NEs (NEs = 67%, AEs = 75%).

44

One of the main goals of this investigation was to discover what visual features perceivers tend to rely on in face-to-face communication. The next section examines the percentages of correct responses based on facial conditions which contain different combinations of potentially functional visual cues.

 Perception by face

For all six facial conditions, the proportion of correct to incorrect answers was around 25% incorrect to 75% correct in the perception of AEs, and approximately 30% to 70% in NE identification as shown in Figures 3-6 and 3-7. No significant difference in perception between face conditions was found, χ2 (5, N = 10,939) = 0.7582, p = 0.9796.

The lack of significance between AEs and NEs identification in AF (the no-masking condition), and L (with only the lips visible) seems to suggest that the lips, present in all six facial conditions, contain most of the visual speech information that native speakers of Lebanese depend on.

Fig. 3-6: AEs Distribution of Responses by Face Condition

100 90 80 70 60 Cor 50 Inc 40 30 20 10 0 AF LwF L LCkCn LCk LCn

45

Fig. 3-7: NEs Distribution of Responses by Face Condition

100 90 80 70 60 Cor 50 Inc 40 30 20 10 0 AF LwF L LCkCn LCk LCn

In sum, the data presented in this section highlighted native perceivers ability to differentiate between AEs and NEs at better than chance. Results confirm what was observed in the arrow plots and point to a strong speaker effect, suggesting that based on visual-input only, not all speakers were equally well-perceived overall. Speakers also differed in intelligibility based on category: speaker 1’ s NEs were better identified than her AEs and speaker 2 was best read in his production of AEs. Furthermore, a strong vowel effect suggests that vowel length as well as vowel quality mattered to perceivers.

This was again observed in the arrow plots. More correct answers were associated with long than short vowels, especially with the long [æ] and [i] compared to their short cognates. Consonants coarticulated with [u] were the hardest to identify. Results also suggest that consonant position (word-initial or word-final) informed perception. AEs were better identified word-finally; the reverse was true for NEs. However, no significant

46 difference in perception was found between the six faces, suggesting that the lips, present in all faces, contain sufficient information for differentiating between AEs and

NEs.

2.2. Variables Interactions

The previous analysis gave an overview of the influence variables such as vowels, speakers, and faces had on perception of emphasis by providing counts and percentages; it assessed the significance of certain differences pertaining to the perception of the NE-AE contrast across all variables using the Chi-square test.

However, factors such as speakers, vowels, position, emphasis and faces, though treated as fixed independent variables, do interact to modulate perception. In order to understand the interaction between variables, the responses recorded in the visual decision task were submitted to a mixed effects logistic regression analyses based on seven fixed variables:

 vowel quality (long and short [æ], [i], [u])

 vowel length (short or long)

 consonant quality (AE and NE [t,d,s])

 consonant position (word-initial, word-finally)

 speaker (S1,S2, S3, S4)

 face (AF: all face, LwF: from bottom of eyes to neck, L: lips only, LCkCn: lips,

cheeks, and chin, LCk: lips and cheeks, LCn: lips and chin).

Perceivers’ correctness was the dependent variable as the goal was to examine how each variable, or variable interaction, predicted the accuracy of the perceivers’ answers. The models were fitted using Mle4 or maximum likelihood estimation in R 47

(Bates et al., 2011). As a first step, comprehensive models including all, or most of the explanatory variables, were generated and compared to each other in search for the best predictors of perception. Model 1 looked at correctness of answers as a function of consonant quality ([d, t, s]), consonant position (word-initial or final), vowel quality

([æ,i,u]), vowel length (long verses short vowels), speaker (speakers 1, 2, 3, 4), and face (AF, LwF, L, LCkCn, LCk, LCn). Table 3-4, summarizes the All-Inclusive Model in which the base conditions against which all other conditions are compared, are [d] for the consonants, the short vowel [æ], speaker 3, non-emphatics (NEs), final-consonant position and the face with lips only (L); these values therefore don’t show up in the table. “Perceiver” and “Word” have been included as random effects in all of the presented models. The "Estimate" column is an estimate of the effect of the variable value relative to the base. A positive number says the rate of correct responses goes up for that variable when compared to the referent; on the other hand, a negative number indicates that the number of correct responses associated with that variable decreases.

For example, the vowel [u] is significantly more often misperceived than the (omitted) vowel [æ] as reflected in the negative estimate -1.05. On the other hand, the positive coefficient +1.30 associated with speaker 4 signifies that speaker 4 is significantly more intelligible than speaker 3. Speaker 3 was chosen as the referent because he was generally the least well-perceived subject. The “signif.” column is the p-value, relative to an alpha of 0.05. The number of stars symbolizes the level of significance.

48

Table 3-4: All-Inclusive Model

Number of obs: 10939; Groups: word ID, 72; User ID, 53 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 1.136690 0.255736 4.445 8.80e-06 *** Speaker1 0.037652 0.070831 0.532 0.59502 Speaker2 0.703360 0.077211 -9.110 < 2e-16 *** Speaker4 1.295574 0.081618 15.874 < 2e-16 *** [i] -0.601007 -0.209420 2.870 0.00411 ** [u] -1.058010 0.208790 -5.067 4.03e-07 *** Short vowels -0.331217 0.170269 -1.945 0.05174 . Word-initial cons 0.045157 0.170273 0.265 0.79085 [s] -0.052912 -0.208594 0.254 0.79976 [t] -0.109605 0.208357 -0.526 0.59886 AF 0.028114 0.079074 0.356 0.72218 LwF 0.050951 0.078973 0.645 0.51882 LCkCn -0.023365 -0.078759 0.297 0.76672 LCk 0.028859 0.079102 0.365 0.71523 LCn 0.007555 0.079031 0.096 0.92384 Non-emphatic 0.250016 0.170270 1.468 0.14201

The All-Inclusive Model confirms some of the statistical results already revealed by the Chi-Square tests. Relative to Speaker 3, Speaker 4 has the highest rate of correct responses followed by Speaker 2. Speakers 3 and 1 are the least well perceived and the difference between them is not significant. Face did not really matter in identifying AEs and NEs, meaning that participants’ accuracy in perceiving emphasis was not dependent on how much of the face was available to them, as long as the lips were visible. However, after including all the factors, the effect of consonant quality and position was no longer significant.

Several other models were run, removing variables that did not reach significance, one at a time. Using the likelihood ratio test, all models were compared for their predictive power. The statistical Model 2 summarized in Table 3-5 includes the

49 significant factors from the All-Inclusive Model, namely, speakers, vowel quality, and to a certain extent, vowel length.

Table 3-5: Best Model

Number of obs: 10939; Groups: word ID, 72; User ID, 53 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 1.24702 0.18617 6.698 2.11e-11 *** Speaker1 0.03827 0.07082 0.540 0.5889 Speaker2 0.70349 0.077211 9.112 < 2e-16 *** Speaker4 1.29573 0.21316 15.879 < 2e-16 *** [i] -0.60407 -0.20942 -2.834 0.0046 ** [u] -1.05890 0.21259 -4.981 6.33e-07 *** Short vowels -0.33214 0.17331 -1. 916 0. 0553 .

Based on the likelihood ratio test comparing the Best Model to the All-Inclusive

Model, the Best Model was 94% better at predicting correct identification of emphasis.

The Best Model once more shows a strong speaker effect relative to speaker 3, and a vowel effect where emphasis is best perceived in the [æ]-context, followed by [i] then

[u]. Long vowels are marginally better perceived than short vowels.

 Vowel Effect

Zooming in on the significant main effects, the model summarized in Table 3-6 provides a closer look at the vowel effect on the perception of emphasis. Because of the strong speaker influence, the variable “Speaker” was included in all models to control for it. The results summarized in Table 3-6 show how each vowel, relative to the default

[u NE], interacts with AEs and NEs.

50

Table 3:6: Effect of Vowel-Emphasis Interaction on Perception

Number of obs: 10,939; Groups: word ID, 72; User ID, 53 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.34578 0.24032 1.439 0.15020 Speaker1 0.03768 0.07080 0.532 0.59456 Speaker2 0.70268 0.07720 9.102 < 2e-16 *** Speaker4 1.29676 0.08157 15.898 < 2e-16 *** [æ] : AE 0.31231 0.32962 0.947 0.34339 [i] : AE -0.54364 0.32720 -1.661 0.09662 . [u] : AE 0.13877 0.32915 0.422 0.67331 [æ:] : AE 1.32776 0.33760 3.933 8.39e-05 *** [i:] : AE -0.37049 0.32753 -1.131 0.25799 [u:] : AE -0.50170 0.32745 -1.532 0.12549 [æ] : NE 0.37060 0.33049 1.121 0.26213 [i] : NE 0.73662 0.33279 2.213 0.02687 * [æ:] : NE 0.91544 0.33346 2.745 0.00605 ** [i:] : NE 0.71990 0.33202 2.168 0.03014 * [u:] : NE -0.93178 0.32766 -2.844 0.00446 **

From the model in Table 3-6, we draw the following conclusions per vowel on the effect of vowel-emphasis interaction on perception of AE and NE. All observations are made relative to the reference [u]-NE:

 Long [æ:]-context: significantly better perception of both AEs and NEs

 Long [i:]: significantly better perception of NE, not AE

 Long [u:]: significantly worse perception of NE

 Short [æ]: no significant effect on either AE or NE

 Short [i]: significantly better perception of NE, marginally worse with AE

 Short [u]: no effect on perception of AE

It is apparent that the strongest interactions were in the [æ:]-context probably because it is an that provides a good view of the tongue which is clearly depressed and retracted during the production of an AE, but sits a lot higher with a

51 forward thrust during the NE articulation. Short and long [i] and [u] don’t have any significant effect on the perception of AEs.

While [æ:] is still a good predictor of correctness in the NE context, short [æ] loses its significance. [i], and especially [i:], positively impact recognition of NE but [u:] interferes with the identification of NEs. One hypothesis attributes this to a duration effect. According to Giannini and Pettorino (1982), before [æ], AEs are longer, and before [i], NEs are longer. It is conceivable that longer segments provide longer cognitive processing time. Furthermore, some sounds have been shown to block the spreading of emphasis; the most reported blocking sound is the vowel [i] (e.g., Ghazeli

1977, Younes 1993).

As previously pointed out, vowel duration is phonemic in Arabic, meaning that substituting a long vowel for its short counterpart will result in a different word.

Regressions were run to explore the effect of vowel duration on perception when vowel length interacted with emphasis, but no vowel length effect was found except on the NE subset where NEs + long vowels were slightly better perceived than NEs + short vowels

(p-value = 0.08). The only meaningful effect in the vowel length-speaker interaction was associated with speaker 2 whose AEs were better recognized when coarticulated with long vowels (p-value = 0.03). The weak effect of vowel duration on perception was however surprising considering that native speakers’ semantic processing must be attuned to the distinction between short and long vowels. It is plausible that native speakers are used to relying more on the acoustic signal to detect duration, and vision does not contribute to this dimension of perception.

52

To complete the picture of vowel influence on perception of AEs and NEs,

Models 3-7a and 3-7b investigate the effect of vocalic environment on COIs identification in the context of AEs and NEs.

Logistic regressions performed to assess the effect of consonants on the perception of emphasis showed that consonant quality did not impact correct identification of either AEs or NEs. Table 3-7a summarizes the differences in emphasis identification for AEs as a function of CV combinations. Wherever significance was found it is likely to be attributed to the vocalic context rather than the COI quality. What transpired from looking at the vowel-consonant (CV) interaction in the AE subset was the strong effect that long [æ:] had on all three pharyngealized COIs. Short [æ] also had a significant influence on the perception of the stops [dʕ] and [tʕ] but perception accuracy of the fricative [sʕ] favored the [u]-context. No significant effects were found between any of the COIs adjacent to [i] or [u]. It is also worth noting a change in the speaker effect when consonant-vowel interactions were examined in the AE subset: while perception of speaker 1 has not been shown as significantly different from perception of speaker 3, in the AE subset, intelligibility of speaker 1 was quite significantly diminished compared to speaker 3. Only significant effects are reported in

Tables 3-7a and 3-7b.

53

Table 3-7a: Effect of Consonant-Vowel Interaction on Perception of AEs

Number of obs: 5472; Groups: word ID, 36; User ID, 53 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) -0.22641 0.32159 -0.704 0.48141 Speaker1 -0.26271 0.10134 -2.592 0.00953 ** Speaker2 1.22800 0.11417 10.756 < 2e-16 *** Speaker4 1.25266 0.11502 10.891 < 2e-16 *** dʔ : æ 1.32008 0.44470 2.968 0.00299 ** dʔ : æ: 2.02958 0.45621 4.449 8.63e-06 *** sʔ: æ: 1.37122 0.44484 3.083 0.00205 ** sʔ: u 1.28515 0.44252 2.904 0.00368 ** tʔ: æ 1.06107 0.44169 2.402 0.01629 * tʔ: æ: 2.55913 0.47935 5.339 9.36e-08 *** tʔ: u 0.81669 0.43829 1.863 0.06241 .

Table 3-7b shows fewer and weaker vowel effects on the visual recognition of

NEs. Except for [sæ], NE identification was barely improved in the context of long or short [æ], but was significantly better in the [si]-context. Visual perception was significantly weakened when [t], and more so when [s], occurred in the context of short

[u]. We also notice that in this model, the significance levels of speakers 1 and 2 are altered: the overall well-perceived speaker 2 is shown in the NE subset as barely better than the least intelligible speaker 3; speaker 1, whose face is usually not as intelligible as speaker 2, was significantly better perceived in the context of NEs. Non-significant results are omitted from Table 3-7b.

54

Table 3-7b: Effect of Consonant–Vowel Interaction on Perception of NEs

Number of obs: 5467; Groups: word ID, 36; User ID, 53 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.6520 0.3144 2.074 0.038072 * Speaker1 0.3772 0.1076 3.507 0.000454 *** Speaker2 0.2056 0.1150 1.788 0.073794 . Speaker4 1.4546 0.1261 11.538 < 2e-16 *** d : æ 0.7406 0.4315 1.716 0.086091 . t : æ: -0.7563 0.4163 -1.817 0.069258 . s : æ: 1.0817 0.4404 2.456 0.014040 * s : i 1.3131 0.4418 2.972 0.002956 ** s : i: 0.8374 0.4328 1.935 0.053019 . s : u -2.0004 0.4197 -4.766 1.88e-06 *** t : u: -1.3263 0.4161 -3.188 0.001434 **

The reference in this regression was the [tu:] CV combination. The most meaningful results found in terms of CV interaction in the NE subset relate to the u- context: [su] and [tu] were significantly harder to identify than [tu:]. [tæ:] was a little less well-perceived than [tu:]. Any further attempt to explain these results would be pure speculation because they don’t show any interpretable patterns.

Two more models were run to examine whether vowels preceding the COI

(ending position) or following the COI (initial position) had any effect of perception. The only significance found in the AE subset was related to [u] which was significantly better perceived when preceding than following the consonant. Values approaching significance in the NE subset suggest that NEs were a little better perceived after short than long [æ] (p = 0.07 and 0.06 respectively).

To summarize the important highlights related to the vowel effect on perception of AEs and NEs, both categories were well perceived in the [æ:]-context; long and short

[i] improved perceivers’ performance in correctly identifying NEs, but the high seemed to block spreading of emphasis and did not have a significant effect on 55 perception of AEs. The spread of emphasis to the vowel [u] may not be distinctive enough to the less observant perceiver since [u] is produced with round lips and in the back of the vocal tract for both NEs and AEs. [æ] is also a back vowel, however it is articulated with more open lips that reveal the position of the tongue; AEs and NEs being so much better perceived in the [æ]-context than [u]-context suggests that tongue visibility is an important cue in speech visual perception. Furthermore, consonant position and vowel length did not turn out to significantly influence perception of emphasis. Even though vowel length is phonemic in Arabic, perceivers don’t seem to visually attend to the duration dimension of the vowel.

 Speaker Effect

The other significant main effect reflected in the Best Model in Table 3-5 is the speaker effect. From all the previous models where ‘Speaker’ was controlled for, we can confidently conclude that overall, speakers 2 and 4 were better perceived than speakers

1 and 3. While speaker 3 was not very intelligible in both NEs and AEs categories, speaker 4 was well identified in both sets. AEs were best recognized with speaker 2, but his NEs were difficult to perceive. The opposite was true for speaker 1.

The next question that relates to the effect of speaker on perception was whether an interaction between speaker and emphasis impacted the accuracy of perceivers’ identification of AEs and NEs. As previously discussed, a significantly higher proportion of NE words were answered correctly. This was due to a perceiver bias that led to a higher proportion of NE answers (0.52) as compared to AEs (48%). However, the proportion of AE accurate responses (70%) was significantly higher than that of NEs

(68%), X2(1, N = 10,939) = 4.56, p < .033. What makes a perceiver select more often

56 the word with the NE versus AE may also be correlated with who the speaker is. Table

3-8 examines the emphasis-speaker interaction.

 Emphasis speaker interaction

Table 3-8: Emphasis-Speaker Interaction

Number of obs: 10939, groups: words, 72; perceivers, 53 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.39898 0.16298 2.448 0.01437 * Speaker1 -0.24609 0.09163 -2.686 0.00724 ** Speaker2 1.16654 0.10332 11.291 < 2e-16 *** Speaker4 1.24529 0.10532 11.824 < 2e-16 *** NE 0.25199 0.21927 1.149 0.25047 Speaker1:NE 0.61308 0.12118 5.059 4.21e-07 *** Speaker2:NE -0.89732 0.12894 -6.959 3.42e-12 *** Speaker4:NE 0.14007 0.14399 0.973 0.33067

The results in Model 3-8 are relative to the AE production of speaker 3. The model confirms the interaction between emphasis and speakers 1 and 2 as previously noted; compared to speaker 3’s AEs, perceivers recognized NEs significantly better than AEs when watching speaker 1, but they identified AEs significantly better when looking at speaker 2.

Would some COIs when combined with certain vowels result in better perception of emphasis dependent on speaker? Model 3-9 summarize in Table 3-9 looks at the vowel-speaker interaction.

57

 Vowel-Speaker Interaction and its Effect on Perception of Emphasis

Table 3-9: Effect of Vowel-Speaker Interaction on Perception of Emphasis

Number of obs: 5472; Groups: word ID, 36; User ID, 72 Fixed effects Estimate Std. Error z value Pr(>|z|) (intercept) 0.54109 0.23753 2.278 0.022725 * Speaker1 -0.38576 0.15384 -2.507 0.012159 * Speaker2 0.71380 0.15889 4.492 7.04e-06 *** Speaker4 1.57535 0.19984 7.883 3.20e-15 *** [æ] -0.28181 0.32468 -0.868 0.385416 [æ:] 0.39547 0.32818 1.205 0.228197 [i:] -0.16560 0.32619 -0.508 0.611683 [u] -0.16776 0.32332 -0.519 0.603857 [u:] -0.07004 0.32463 -0.216 0.829184 Speaker1*æ 0.81322 0.21157 3.844 0.000121 *** Speaker2*æ 0.54913 0.22095 2.485 0.012944 * Speaker4*æ 1.12622 0.29786 3.781 0.000156 *** Speaker1*æ: 1.08851 0.22551 4.827 1.39e-06 *** Speaker2*æ: 0.78525 0.24883 3.156 0.001601 ** Speaker4*æ: 1.42378 0.38159 3.731 0.000191 *** Speaker1*i: 0.37764 0.20615 1.832 0.066974 . Speaker2*i: 0.47546 0.22293 2.133 0.032940 * Speaker4*i: 0.04912 0.26895 0.183 0.855095 Speaker1*u 0.23113 0.20273 1.140 0.254243 Speaker2*u -0.86197 0.21606 -3.989 6.62e-05 *** Speaker4*u -1.23048 0.24819 -4.958 7.13e-07 *** Speaker1*u: 0.27114 0.20572 1.318 0.187485 Speaker2*u: -0.55573 0.21001 -2.646 0.008142 ** Speaker4*u: -1.24378 0.24828 -5.010 5.45e-07 ***

Interactions in Model 3-9 reveal that perception of AEs and NEs was unequivocally best with long and short [æ] for speakers 1, 2, and 4 compared to speaker 3 and especially for speakers 1 and 4. COIs with long [i:] were a little better identified when produced by speakers 1 or 2, and COIs coarticulated with [u:] were significantly misperceived when uttered by speakers 2 and 4.

58

Up to this point, from the Best Model, I extrapolated the variables that had significant effects on perception of emphasis, namely, “Speaker”, “Vowel”, and somewhat, “Vowel Length” when associated with NEs. This is not to say that all the other variables were discarded; in fact, they have been used in a number of simple models investigating potential interactions between the different variables. The results or partial results of the models with at least some significance are summarized in tables

3-10 and 3-11.

 Interactions with Face

One of the goals of this study was to isolate the facial features that perceivers potentially attend to when trying to identify emphasis. The descriptive statistics in the previous section and the All-Inclusive regression Model concur that no particular face condition gave perceivers an advantage in differentiating, based on optical information only, between AEs and NEs. Upon further investigation, some interactions between face and emphasis, as well as between face and speakers on the AE subset were detected.

Models 3-10 and 3-11 reflect these interactions with speaker 3 and AEs set as the references. It is worth remembering that all facial conditions included the lips.

The fixed effects in Model 3-10 suggest that the lower face where lips, chin, cheeks, and neck are visible was significantly better perceived when emphasis-face interaction was introduced; however, perception of NE under the LwF condition was significantly worse than when perceivers were looking at lips only in the AE-context.

Perhaps LwF introduced noise that distracted the viewers.

59

Table 3-10: Face Effect on Emphasis

Number of obs: 5472, groups: UserID, 53; wordid, 36 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.79742 0.16881 4.724 2.32e-06 *** NE 0.36388 0.22525 1.615 0.10621 AF 0.10861 0.10618 1.023 0.30636 LwF 0.25212 0.10816 2.331 0.01975 * LCkCn 0.04467 0.10680 0.418 0.67577 LCk 0.06078 0.10682 0.569 0.56940 LCn -0.06798 0.10652 -0.638 0.52332 NE:AF -0.20604 0.15652 -1.316 0.18806 NE:LwF -0.43428 0.15619 -2.780 0.00543 ** NE:LCkCn -0.15251 0.15596 -0.978 0.32814 NE:LCk -0.07843 0.15666 -0.501 0.61663 NE:LCn 0.14445 0.15677 0.921 0.35683

Interactions with face are still very obscure. What is however clear, is that perceivers didn’t appear to take advantage of the potential additional information afforded by the whole face.

One last attempt was made to uncover a face effect. Since there was a strong speaker effect, it is conceivable that some facial features are more functional when associated with a particular speaker; or perceivers may benefit from seeing more of the face if a particular speaker’s lips are not very informative. Model 3-11 reflects the results of speaker-face interactions.

 Speaker-Face Interaction in AEs

Model 3-11 shows that compared to the identification of speaker 3’s AFs, perception was significantly impacted when perceivers responded to a video showing only lips and chin. In spite of the negative main effect of LCn facial condition compared to the whole face, the model shows a significant interaction positively impacting perception when perceivers’ attention was focused on speaker 4’s lips and chin during

60 the production of AEs. It is worth noting that the movement of the lips and chin are highly correlated. It is plausible that for speaker 4, the dropping of the chin that accompanies the pronunciation of AEs was more accentuated than for the other speakers and cued emphasis.

Only fixed factors and significant interactions are reported:

Table 3-11: Speaker*Face Interaction in the AE Subset

Number of obs: 5,467; Groups: word ID, 36; User ID, 53 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.55537 0.22423 2.477 0.013256 * Speaker1 -0.43569 0.21135 -2.061 0.039263 * Speaker2 1.40991 0.24853 5.673 1.4e-08 *** Speaker4 0.93533 0.24161 3.871 0.000108 *** LwF -0.01858 0.22259 -0.083 0.933494 L -0.31610 0.21723 -1.455 0.145637 LCkCn 0.06480 0.21990 0.295 0.768236 LCk -0.15420 0.21717 -0.710 0.477683 LCn -0.45463 0.21676 -2.097 0.035963 * Speaker4:L 0.60036 0.33792 1.777 0.075633 . Speaker4:LCn 0.68208 0.33929 2.010 0.044395 *

The few significant effects are not generalizable, and results point at the lips being the locus of attention in the perception of emphasis.

To summarize the findings connected to correct identification of emphasis,

Chapter 3 examined the meaningful effects of the explanatory variables that were uncovered by the Best Model. Accuracy distinction between AEs and NEs was modulated by vowels and speakers. AEs and NEs were best perceived in the [æ]- context where the tongue position and lip distinguishes the two groups of consonants; the long and short [i] on the other hand block the spread of emphasis and demonstrate a significant effect on NEs only. However, since emphasis on [i] does not exhibit the same level of lip rounding which is the most of emphasis, it

61 is likely that the significance is attributable to a bias towards NEs. The lips configuration in the [u]-context being similar to that assumed by vowels adjacent to AEs, perceivers couldn’t easily tell apart NEs from AEs in the vicinity of the high back vowel. In addition to the Best Model including the most relevant effects, other models examined different interactions between variables. While no consonant effect was found, the consonant- vowel interaction for the AE and NE models revealed significance in the identification of some consonant-vowel combinations examined separately for AEs and NEs. While all consonants were better identified in the [æ:] context, no generalizable patterns emerged. Also, COIs were a little better recognized in the context of long than short vowels but vowel duration was marginally significant suggesting a lack of visual affinity to vowel length.

Emphasis-speaker interactions indicated that perceivers generally could correctly identify emphasis on words spoken by speakers 2 and 4 much better than by speakers

1 and 3. Speaker 1’s NEs were shown as significantly better identified than her AEs.

Conversely, perceivers could recognize Speaker 2’s AEs much better than his NEs.

All other models failed to indicate any interactions between variables. However, considering the very strong speaker effect, the following section examines variables and their interactions within each speaker. Only the models with significant results are fully or partially reported.

2.3 Models by Speaker

 Speaker 1

Following is a concatenation of the models showing some significant effects based on

Speaker 1’s intelligibility:

62

Table 3-12: Patterns of Perception Associated with Speaker 1

Number of obs: 3016, groups: wordid, 72; UserID, 29 ------Emphasis Estimate Std. Error z value Pr(>|z|) (Intercept) 0.2344 0.1950 1.202 0.22930 NE 0.8785 0.2756 3.187 0.00144 ** ------Vowel quality (Intercept) 1.3571 0.2459 5.518 3.43e-08 *** [i] -1.0114 0.3409 -2.967 0.00301 ** [u] -1.0088 0.3395 -2.972 0.00296 ** ------Vowel quality/Emphasis interaction (Intercept) 1.1563 0.2871 4.027 5.64e-05 *** [i] -1.8928 0.3970 -4.767 1.87e-06 *** [u] -0.8533 0.3955 -2.158 0.030949 * NE 0.3199 0.4050 0.790 0.429630 [i]:NE 1.8718 0.5652 3.312 0.000928*** [u]:NE -0.2369 0.5598 -0.423 0.672186 ------Face conditions (Intercept) 0.86282 0.17991 4.796 1.62e-06 *** AF -0.22155 0.14974 -1.480 0.1390 LoF -0.03317 0.14983 -0.221 0.8248 LCkCn -0.25056 0.14965 -1.674 0.0941 . LCk -0.24160 0.15043 -1.606 0.1082 LCn -0.35618 0.14976 -2.378 0.0174 * ------Vowel quality-face interactions (Intercept) 1.372188 0.314651 4.361 1.29e-05 *** [i] 0.932998 0.426088 -2.190 0.0285 * [u -0.641039 0.424367 -1.511 0.1309 … [u]:LCkCn -0.892212 0.384876 -2.318 0.0204 * ------

Model 3-12 confirms previous findings concerning speaker 1. The vowel quality- emphasis interaction model indicates that her NEs are much better perceived than her

AEs. Overall, her COIs in the context of [æ] are better identified than with [i] or [u], but compared to her intelligibility on AEs, her NEs are significantly better identified in the context of [i] than in the environment of the other two vowels. The face conditions 63 model seems to suggest that there is something in speaker 1’s chin or the combination of cheek and chin that negatively impacts perception, especially in the [u] context.

 Speaker 2

Table 3-13 summarizes the significant effects found among the variables when applied to speaker 2.

Table 3-13: Patterns of Perception Associated with Speaker 2

Number of obs: 2791, groups: wordid, 72; UserID, 28 ------Emphasis Estimate Std. Error z value Pr(>|z|) (Intercept) 1.6806 0.2315 7.260 3.87e-13 *** NE -0.6809 0.2971 -2.291 0.0219 * ------Vowel quality (Intercept) 2.1217 0.2538 8.360 < 2e-16 *** [i] -0.6666 0.3252 -2.050 0.0404 * [u] -1.6784 0.3244 -5.174 2.29e-07 *** Emphasis-face interactions (Intercept) 1.27932 0.27710 4.617 3.9e-06 *** AE:AF 0.76268 0.38083 2.003 0.04521 * NE:LwF -0.668 0.23465 -2.848 0.00439 ** ------Vowel quality-face interactions (Intercept) 1.9164 0.3272 5.856 4.74e-09 *** [i] -0.3234 0.4276 -0.756 0.44942 [u] -1.6273 0.4209 -3.866 0.00011 *** … LCn 1.0303 0.3946 2.611 0.00902 ** … [i]:LCn -1.2078 0.4842 -2.494 0.01261 * [u]:LCn -0.8125 0.4779 -1.700 0.08913 . ------Speaker-consonant position interactions InitialPos:Sp2(AE) -066571 0.19229 -3.462 0.000536 *** InitialPos:Sp2(NE) 0.75423 0.18517 0.1851 4.64e-05 *** ------

64

Speaker 2’s strength is in his articulation of emphatics, especially word-finally.

The emphasis-face interaction model shows that in his AE production, he was significantly better perceived when the whole face was visible, but the recognition of his

NEs was significantly impacted in the LwF condition. His NEs were not well identified compared to speakers 1 and 4, but they were better recognized word-initially than word- finally. Speaker 2 appears to be responsible for the mild position effect found earlier.

Perception of his COIs when adjacent to [æ] was better than next to [i] and particularly

[u]. Additionally, some face effects were found for this speaker. Compared to lips only, the condition with lips and chin was better perceived; however, when lips and chin interacted with [i] or [u], perception was negatively affected.

 Speaker 3

There is not much to report on the models investigating interactions associated with speaker 3, as no additional significant effects or interactions were found: no vowel, no COI, no face, or emphasis interactions.

 Speaker 4

As for speaker 4, no emphasis effect was found; her AEs and NEs were equally well perceived. However, a very strong vowel quality effect was found with AEs which were significantly better identified in the [æ]- than [i]- or [u]-context. The vowel-emphasis interaction model reflects the ease of correct identification of COIs in the context of long

[æ:], but [i] + NE was even better recognized possibly because of the ambiguity of the signal in the [i]-context and the perceivers’ bias towards NEs; as indicated by the negative coefficients across all vowels, NE + [i] is better perceived than all the other NE

65

+ vowel combinations, and particularly better than NE + [æ:] and NE + [u]. Additionally, some weak face-vowel interactions such as LCn with [u] and LCkCn with [i] were found.

Table 3-14: Patterns of Perception Associated with Speaker 4

Number of obs: 2575, groups: words, 72; perceivers, 24 ------Vowel quality Estimate Std. Error z value Pr(>|z|) (Intercept) 3.4927 0.2778 12.572 < 2e-16 *** [i] -1.2755 0.3312 -3.851 0.000117 *** [u] -2.7243 0.3198 -8.518 < 2e-16 * ------Vowel-emphasis interaction (Intercept) 1.4627 0.3704 3.949 7.86e-05 *** NE 1.7090 0.5786 2.954 0.003141 ** [æ] 1.4832 0.5358 2.768 0.005641 ** [æ:] 3.2516 0.7798 4.170 3.05e-05 *** [i:] 0.2547 0.5001 0.509 0.610579 [u] -0.3512 0.4878 -0.720 0.471585 [u:] -1.0910 0.4799 -2.273 0.023001 * NE : [æ] -1.5639 0.8175 -1.913 0.055764 . NE : [æ:] -2.8309 1.0227 -2.768 0.005641 ** NE : [i:] -0.7895 0.7850 -1.006 0.314546 NE : [u] -2.5267 0.7450 -3.391 0.000695 *** NE : [u:] -0.7847 0.7436 -1.055 0.291320 ------Emphasis-face interactions (Intercept) 1.3395 0.2780 4.819 1.45e-06 *** AF 0.7025 0.2543 2.763 0.00573 ** LCk 0.5199 0.2462 2.112 0.03468 NE:AF -0.9133 0.3422 -2.669 0.00761 NE:LoF -0.8828 0.3315 -2.663 0.00775 * NE:LCkCn -0.4766 0.3322 -1.435 0.15140 NE:LCk -0.5878 0.3409 -1.724 0.08468 . NE:LCn -0.2601 0.3423 -0.760 0.44732 ------Vowel-face interactions (Intercept) 3.02756 0.39365 7.691 1.46e-14 *** [i] -0.83657 0.49681 -1.684 0.0922 . [u] -2.22955 0.46071 -4.839 1.30e-06 ***

LCn 1.06045 0.58692 1.807 0.0708 … [i]:LCkCn -1.24470 0.65486 -1.901 0.0573 . … [u]:LCn -1.27735 0.64894 -1.968 0.0490 * 66

1. idiosyncrasies in emphasis production that stems from different backgrounds.

Both speakers 2 and 3 have large faces, but speaker 2 has much more active

articulators relative to speaker 3.

2. The vocalic context of the COIs influences the accuracy of perception: as pointed

out in the first statement, this is true for some speakers, and not consistent in the

context of all three vowels. This can be attributed in large part to the articulatory

constrictions associated with the production of any CV combination for that

matter. For example, COIs were well-identified across the board when

coarticulated with the open, back vowel [æ:]; the mouth in the production of [æ:]

has a distinctive configuration: in the context of AEs lips are round and more

open than in the context of NEs.

3. As long as the lips are visible, other facial features don’t appear to impact the

perception of emphasis in Arabic. This is not to say that the cheeks and chin and,

in all likelihood, other features not tested for in this study do not carry visual

speech information; after all, the facial muscles are interconnected and chin and

cheeks movements are synchronized with lips gestures. The indication is

however, that lips are the main focus of attention when the perceiver is deprived

of the information provided by the acoustic signal and has to rely solely on the

optical information.

Considering that each speaker provided perceivers with a different set of visual cues, and in some cases such as with speaker 3, they were hardly given any cues detectable within the settings of the study, it is amazing how perceivers were able to identify emphasis at better than chance. We have confirmed up to this point that the

67 vocalic environment definitely has an impact on perception of emphatics, and some speakers are much better perceived than others. Why is that? What is it in speaker 4’s pronunciation of emphatics that makes her more intelligible than speakers 1 and 3?

Perceivers are looking at faces that are very rich in speech information and they must be sensitive to the difference in how facial gestures move in one word as opposed to another. In the next section, I will be revisiting the measurements described in Chapter

2 in order to examine the relationship between perception and production and investigate how shape and amplitude of a speech gesture informs perception. In other words, maybe looking at the measurement differences between AEs and NEs would help better understand what in the speaker’s face informs the perceivers’ responses.

68

CHAPTER 4: PRODUCTION - PERCEPTION CONNECTION OF AES

Perceivers were staring for an hour at videos of speakers mouthing words that looked very much alike and were asked to decide which of two words the speaker was saying. What in the speaker face cues the perceiver in selecting the correct word? Is it how round the speaker’s mouth is when producing a word with an AE, or how wide the cheeks are, or how pointed the chin is? This chapter describes the relationship between measurements (production) and accuracy of perception. Using a machine learning (ML) algorithm, I first examine the measurement that the ML selects to best classify words as

AE or NE. Then, I examine how production, i.e. measurements of different facial features, relates to perceivers’ accuracy of perception. But first, let us look at a real example provided by 2 of the speakers.

In Arabic, rounded protruded lips are usually characteristic of a pharyngealized coronal phoneme coarticulated with the vowel [æ] and it is fairly easily distinguishable from its plain non-pharyngealized counterpart characterized by spread lips. Figure 4-1 is a representation of speaker 2’s lips saying /tʕæ:b/ and /tæ:b/. We notice that the difference is not only in the lips configuration but there are also potential speech cues in the tongue position and teeth. The /tʕæ:/ in /tʕæ:b/ is distinguished by rounded lips, thick lower lip, low tongue position, and upper teeth that are partially visible. In contrast, the mouth opening in the non-pharyngealized syllable /tæ:/ is bigger, the lower lip is thinner, and both upper and lower teeth are visible and the tongue sits higher in the oral cavity of the vocal tract.

69

Fig. 4.1: Speaker 2 mouth in /tʕæ:b/ and /tæ:b/

/tʕ ɑ :b/ /t æ:b/

The vowel effect on the articulation of this minimal pair is obvious and so is the difference in the lips measurements. We also notice the asymmetry in speaker 2’s lower lip where the off center vertical measurement on the left is larger than on the right. This might be one of his idiosyncrasies that confuse his viewers in the identification of NE.

Statistical models did show that speaker 2’ s NEs are not as intelligible as his AEs.

As important as mouth movements are for speech recognition, lips alone are not always sufficiently informative in all consonant-vowel contexts; for instance, the contrast in the mouth configuration for pharyngealized and non-pharyngealized consonants coarticulated with /u/ is much smaller than in the [æ] context. Figure 4-2 shows speaker

1 saying /sʕu:r/ versus /su:r/:

Fig. 4-2 Speaker 1 mouth in /sʕu:r/ and /su:r/

ʕ /s u:r/ /su:r/ ɑ :b/ /

70

In a still picture when both faces are displayed side by side, we may be able to notice that the opening between the inner lips is slightly bigger for this particular speaker in the production of [su:]; but would this little difference be a sufficient cue for a perceiver to discriminate between /sʕu:r/ and /su:r/ by simply watching each word uttered at normal rate of speech?

This next section examines the production-perception connection by looking at machine learning forward-build models designed to predict perceivers’ accuracy of emphasis identification based on measurement differences in vertical and horizontal lip dimensions, cheeks and chin.

1 MACHINE LEARNING CLASSIFICATION OF EMPHASIS

Models were developed to examine how production, expressed here as the measurements associated with the different facial features, informed the perception of

AEs and NEs. The models predicted perceivers’ answer correctness based on the contrast in the measurements of AE and NE for the word shown. The database created to this end, included all the information from the perceiver data augmented with the differences between AE and NE measurements for each word, taken at the stable state of the vowel preceding or following the COI. The measurements were set to zero for faces for which they were not visible. As a reminder, each perceiver watched two of the four speakers under six facial conditions; in all conditions except AF with no masking, different parts of the face were concealed. The measurement contrasts that were included were all the differences in the cheeks, chin, lips, lip vertical inner square, and open area width and/or opening as outlined below; all the variables represent measurement differences between AEs and NEs. The explanatory variables are: 71

 8 different lip measurement differences: lip horizontal, lip vertical outer, lip

thickness upper, lip thickness lower, lip vertical inner, open area, lip vertical

inner-square, and lip vertical side.

 4 chin measurements: chin width high, chin width mid, chin width high, and chin

vertical.

 2 cheek measurements: cheek high and cheek low.

 1 open area

 1 lip vertical inner-square

As previously mentioned, the lip vertical inner square was added to account for the non-linear relationship between the vertical inner area between the lips with perception of AEs and NEs. This nonlinearity was uncovered by the decision tree algorithm reported above and turned out to be significant in many of the human perception models.

Human language as a tool for communication is extremely complex and machine learning (ML) is not even close to uncovering all the layers of complexity related to the neurological, psychological, physical, and cultural factors to mention just a few. As Salvi puts it, talking about machine-learning methods,

“They can be used to develop models of astounding complexity that can perform reasonably well in speech recognition and synthesis tasks, despite our incomplete understanding of the human speech perception and production mechanisms. Machine learning can also be used as a complement to standard statistics to extract knowledge from multivariate data collections, where the number of variables, the size (number of data points), and the quality of the data (missing data, inaccurate transcriptions) would make standard analysis methods ineffective. Finally these methods can be used to model and simulate the processes that take place in the human brain during speech perception and production.” (Salvi, 2006, pp 3-4).

72

This long quote was included because it informs about the usefulness of ML and describes the dataset in this thesis (and as a matter of fact, most human speech production and perception data) very well. Salvi talks about the quality of the data and like in most other studies, there are some shortcomings that may have compromised the quality of the data. For instance, a procedural limitation relates to the variability in the face fittings that were done only once. Ideally, the face fittings must be repeated to control for the potential fluctuations in the measurements from one fitting to another.

Also, machine learning in its classification task was limited in that the dataset focused on idiosyncratic features of a particular pronunciation by a particular speaker of a particular word. The ML analysis would have been more powerful if there were more speakers and more variety in the input.

Nonetheless, as outlined in the rest of this chapter, ML was able to overcome the imperfections found in the database and perform better than humans basing its classification performance on selected measurement differences. Additionally, in the context of this study, ML enabled the isolation of different parts of the face and thus largely mitigated the problem of lips being present in all human-perceived faces.

1.1 Machine Learning Models

The measurements and the correct responses were entered into the ML model the task of which was to learn to identify emphatics by cycling through the measurements and selecting the ones that were the most likely to be associated with an

AE or a NE response. Different methods were used. The decision tree function was applied to classify tokens into AEs and NEs. The forward build and best subset

73 functions were utilized to develop the best ML models based on the best subsets of measurement differences between AEs and NEs.

1.1.1 Decision Trees

Decision trees are particularly useful in settings where there is a significant amount of data providing a starting point that informs for example the choice of variable measurements. The decision tree function in R is an algorithm designed for classification and the reduction of the number of candidate variables. It automatically scans all the variables and takes a subset of randomly selected features and identifies the best to split on for better performance. The trees in this investigation utilize the

"class" method using the rpart (Recursive Partitioning and Regression Trees) package in R. In this study, decision trees were used as an exploratory tool of the best measurements selected by the ML for the classification of words as AEs or NEs. The training dataset consisted of all the AE and NE words and the measurement differences between AEs and NEs for each of the facial features. The splitting rule implemented to segregate data was the Gini index defined as:

Gini(t) = Σ pi (1 - pi) i where pi is the relative frequency (total number of observations of the class divided by total number of observations) of class i at node t (parent or child) at which a given split of the data is performed (Zambon, M. et al., 2005). Using Zambon and colleagues’ wordings, “the gini index is a measure of impurity for a given node that is at a maximum when all observations are equally distributed among all classes.”

The pruning criterion applied minimizes cross-validation error with 10-fold cross- validation. This is an alternative to train-test-validate in a small data setting.

74

The decision tree function cycled through all the measurements described above, and selected the best ones for each vowel that led to the best categorization of the

COIs into AEs and NEs. This function uses a recursive algorithm and continues to split data until it ends up with ‘pure’ sets, or perfect perception. That is to say, the algorithm keeps recycling through the measurements adding nodes or leaves until it achieves perfect identification. This 100% correct recognition of AEs and NEs however ends up being based on singleton subsets as the algorithm becomes too specific to the training data, and loses its predictive power when applied to new data. To overcome this over- fitting problem, the fully grown tree is pruned by removing the branches that would hurt performance on unseen data. Figure 4-3 shows long [a:] pruned tree, and Figure 4-4 the

[u]-pruned tree:

Fig. 4-3: [æ:]-Decision Tree

Figure 4-3 shows the only measurement associated with [æ:] stored in the top node and split 50/50 between AEs and NEs. Based on the tree, the best measurement for the computer to predict emphasis for [æ:] is lip horizontal. If the lip horizontal measurement is smaller or equal to -0.26 (a base number generated by ML), the 75 algorithm picks the word with NE on the left side of the tree, with 81% accuracy. If the measurement is smaller than -0.26, then the ML selects AE on the right branch of the tree, with 94% accuracy. Figure 4-4 is the decision tree that shows the best measurements the machine selected in the identification of emphasis in the [u] environment:

Fig. 4-4: [u] Decision Tree

The best measurement selected by the computer for the recognition of AE and

NE in the [u]-context was the vertical opening between the inner upper and inner lower lips. When the inner vertical lip measurement was smaller than 0.1, the machine classified the item as a NE with 62% accuracy. By adding the off-center lip vertical opening, accuracy of recognition increased in 24% of the cases to 90%. When the

76 measurement was bigger than 0.1, the machine picked emphatic and was correct 69% of the time, but this happened only on 38% of the items.

Table 4-1 shows the best measurements selected by the decision tree function for each vowel, and the rate of accuracy for AEs and NEs:

Table 4-1: Decision Trees Accuracy Rates and Best Measurements

vowel Selected Features Accuracy [æ] Lip horizontal NE: 88% AE:76% [æ:] Lip horizontal NE: 81% AE: 94% [i] Lip horizontal; lip vertical; Off- NE: 87% AE: 100% Center Outer Lip Vertical [i:] Lips horizontal; upper chin NE: 89% AE: 79% [u] Middle Inner Lips Vertical; NE: 90% AE:69% Off-Center Outer Lips Vertical [u:] Upper lip thickness NE: 61% AE: 100%

In the context of short and long [æ] as well as short and long [i] the machine favored the lip horizontal measurement for classifying AEs and NEs with better than

80% rate of correctness, except in the case of short [æ] where accuracy dropped to

76%. In the context of [æ], lip horizontal was the only measurement that the algorithm selected in its categorization task indicating its confidence in this dimension.

In the [i] environment, the function picked two vertical lip measurements (center and off center) in addition to lip horizontal and was able to identify AE + [i] with 100% accuracy. A non-lip measurement found in the chin width was also chosen for the recognition of emphasis in the context of [i:], but the machine was more successful in recognizing NEs than AEs with [i-i:]. Using only the thickness of the lip measurement as an emphatic identifier, the computer was able to recognize AEs adjacent to [u:] with

100% correctness. It was not as successful at recognizing NE co-occurring with [u]. The computer was better at classifying correctly NE + [u] than AE + [u]. 77

Five forward build models for each vowel were generated as an additional means to assess the most successful measurements ML based its choice on for selecting AEs versus NEs.

1.1.2 Machine Learning Forward-Build Models

Instead of the training, test, and validation approach typically used when the goal is to select a preferred predictive model from a collection of possible models, the forward models and best subset models were selected to minimize the AIC (Akaike

Information Criterion). AIC is a criterion that penalizes the use of too many variables that artificially inflate the model without improving its predictability power. This was chosen over cross-validation for speed and to accommodate the structure of the questions asked of the perceivers.

Forward-build models were constructed in the statistical platform R in order to model correctness based on measurement differences. These are iterative nested models with the first iteration being the log likelihood of the base model that predicts correctness as a function of ‘Speaker’, and includes ‘Word’ and ‘Perceiver’ as random variables. The forward build function cycles through a vector of available variables, i.e. the measurement differences between AEs and NEs, looking for the single best additional predictor; the last selected variable is removed from the available measurements, and added to the new model. The process is repeated, and forward build keeps building new and improved models by adding at each stage the single variable that gives the greatest increase in log likelihood. The iteration stops when new

78 addition to the basic model produces an output that is not significantly better than the previous one with a p-value greater than 0.05.

For all the values, the differences were scaled by making the average difference for each speaker equal to zero and the standard deviation equal to 1; standardizing measurement differences controlled for the variability between speakers.

The logistic regressions presented in Tables 4-2 to 4-9 were generated by the best-subsets process. The algorithm checked for every possible subset of explanatory variables and selected the best ones according to the AIC criterion, which is a measurement of predictive power corrected for the number of variables used. Typically, regressions tend to improve after the addition of variables, even if the added variables have no practical relationship to the outcome. The AIC criterion corrects for overfitting by penalizing the model for having too many explanatory variables.

As the ML learned to tell apart AEs and NEs across all vowels, the best features that led to the highest accuracy in identifying emphasis were Lips Vertical Outer opening and the Open Area (the area of Lips Horizontal * Lips Vertical Outer) with about

74% accuracy. As seen in Table 4-2, the positive coefficient for Lip Vertical Outer signifies that the larger vertical distance between the upper and lower lips is, the more likely the word is to be classified as an AE. The negative coefficient associated with the

Open Area indicates that as the Open Area gets larger, the probability of classifying the current syllable as emphatic decreases:

Table 4-2: Best Model for Machine Identification of AEs

Estimate Std. Error z-value (Intercept) 1.46 0.15 9.53 Outer lip vert 3.50 0.54 6.51 Open area -4.28 0.59 -7.29

79

Logistic binomial regressions were run to generate the best five models per vowel with the highest predictable power of correctness, each having picked different features used by the ML to categorize each syllable as AE or NE. The best five models included

‘Emphasis’ as a fixed variable and measurements as dependent variables. The analyses were performed with and without Speaker 4 (the investigator). Table 4-3 lists the top five models across vowels, with and without speaker 4; also included in the table are the percentages of correct identification of AEs. The best model is listed first.

Table 4-3: Best Measurements Selected in all Vowel Contexts

Models Selected Features with Selected Features without % correct All vowels S4 S4 with/without S4 Model 1 Lip Vert Outer; Open Lip Horz; Lip Vert Inner 74% 70% Area Model 2 Lip Vert Outer; Lip Horz; Lip Vert Side 76% 70% Inner Lips Vert; Open Area Model 3 Lips Horz; Lip Vert Lip Horz; Lip Vert Inner 76% 71% Outer; Open Area Sq; Lip Vert Side Model 4 Lip Vert Outer; Lower Lip Horz; Lip Vert Inner; 76% 73% Lip Thick; Open Area Lip Vert Inner Sq Model 5 Upper Cheek; Lip Vert Lip Horz; Lip Vert Inner Sq 74% 70% Outer; Open Area

Overall, the most useful measurements for speakers 1, 2, and 3 is lip horizontal.

When the most intelligible speaker was included, the machine preferred lip vertical and open area. Tables 4-4 to 4-9, provide the most predictive measurements per vowel of

ML correct classification of each word as AE or NE.

80

Table 4-4: Best Measurements Selected for AE Identification in [æ]-Context

Models Selected Features Selected Features % correct With [æ] with S4 without S4 S4/without S4 Model 1 Lip Vert Outer; Lip Lip Vert Outer; Lip 90% 87% thick Upper; Open thickness Upper; Area Open Area Model 2 Cheek Low; Lip Lip Horz; Lip Vert 86% 80% Vert Outer; Open Outer; Open Area Area Model 3 Lip Vert Outer; Cheek Low; Lip Vert 86% 80% Open Area Outer; Open Area Model 4 Chin Width Low; Cheek Low; Lip Vert 86% 80% Lip Vert Outer; Outer; Open Area Open Area Model 5 Lip Vert Outer; Chin Width Low; Lip 84% 80% Open Area; Lip Vert Outer; Open Vert Side Area

The ML used the same informative measurements with and without speaker 4, to categorize words with AE except for lip horizontal that appeared once in Model 2.

Additionally, the non-lip measurements in Models 2-4 suggest the presence of speech information in the cheeks and chin that the ML used to recognize AEs.

The perfect ML recognition of AEs and NEs when coarticulated with [æ:] caused failure of the regression to converge. In spite of convergence issues, the model still showed the variables chosen by the ML for identifying AE + [æ:].

Table 4-5: Best Measurements Selected in the [æ:]-Context

Models Selected Features with Selected Features without % correct [æ:] S4 S4 With /without S4 Model 1 Lip Horz; Lip Vert Side Lip Horz; Lip Vert Side 100% 100% Model 2 Lip Horz; Lip Vert Inner Cheek High; Lip Thick 100% 100% Upper Model 3 Lip Horz; Open Area Lip Horz; Lip Vert Inner 100% 100% Model 4 Lip Horz; Lip Vert Lip Horz; Open Area 100% 100% Outer Model 5 Chin Width Mid; Lip Cheek Low; Lip Thick 100% 100% Horz; Lip Vert Inner Upper

81

The ML chose lips horizontal, in addition to other vertical dimensions, and lip thickness as the most useful measurements in the context of [æ:]. The models highlight the importance of the lips in all their different measurements, even though other useful information in the cheeks and the chin was found.

The spreading of the lips in the context of [i] also cued ML in categorizing COIs as AE. However, Table 4-6 indicates that the prominence of Lip Horizontal was mostly driven by speaker 4.

Table 4-6: Best Measurements in the [i]-Context

Models Selected Features with Selected Features without % correct [i] S4 S4 with/without S4 Model 1 Lip Horiz; Lip Vert Outer Chin Width High; Chin 74% 72% Width Mid Model 2 Lip Horiz; Open Area Chin Width High; Chin 74% 83% Width Mid; Lip Vert Side Model 3 Lip Horiz; Lip Vert Inner Chin Width High; Chin 78% 72% Width Mid; Lip Horiz Model 4 Lip Horiz; Lip Vert Chin Width High; Chin 74% 72% Outer; Lip Thick Upper Width Mid; Lip Thick Low Model 5 Lip Horiz; Lip Thick Chin Width High; Chin 74% 78% Upper; Lip Vert Inner Width Mid; Lip Thick Upper

What distinguishes the [i]-models, are all the non-lip measurements, primarily in the chin, associated with speakers 1, 2, and 3. Lip horizontal though appears to be a very good predictor of emphasis related to speaker 4.

As seen in Table 4-7, chin width is informative for the identification of AEs with [i], but even more so with its long counterpart [i:]. Lip vertical seems to be the common measurement found in almost all the [i:] models.

82

Table 4-7: Best Measurements in the [i:]-Context

Models Selected Features Selected Features % correct [i:] with S4 without S4 With /without S4 Model 1 Lip Vert Outer; Chin Width Mid; Open 87% 65% Open Area Area Model 2 Chin Width Mid; Lip Horiz; Lip Vert 87% 76% Lip Vert Outer; Inner Sq Open Area Model 3 Chin Width Low; Lip Horiz; Lip Vert 87% 65% Lip Vert Outer; Inner Open Area Model 4 Lip Vert Outer; Chin Width Mid; Lip 78% 71% Open Area; Lip Vert Side Vert Inner Sq Model 5 Chin Width Mid; Chin Width Mid; Lip 78% 76% Open Area Vert Inner Sq

The measurement that distinguishes speaker 4 from the other speakers in the recognition of AEs + [i:] is the open area between the upper and lower lips.

Telling AEs and NEs apart in the [u]-context when speaker 4 was included was problematic and ML recognition was poor and did not perform better than chance.

Table 4-8: Best Measurements in the [u]-Context

Models Selected Features Selected Features without % correct [u] with S4 S4 with /without S4 Model 1 XXXXX Chin Width Mid; Lip Thick 50% 65% Upper Model 2 Lip Thick Upper Cheek Low; Chin Width 67% 76% Mid; Lip Thick Upper Model 3 Chin Width Mid Chin Width Mid; Lip Horiz; 62% 65% Lip Thick Upper Model 4 Cheek High Chin Width Mid; Chin 43% 71% Width Low; Lip Thick Upper Model 5 Chin Width Low Chin Width Mid; Chin Vert; 48% 76% Lip Thick Upper

83

The thickness of the upper lip and the lip vertical inner square were the two most informative features for correct identification of AEs in the context of [u] when produced by speakers 1, 2, and 3. In addition to the lips, cheeks and chin were also informative.

Table 4-9 shows the best models for ML associated with long [u:]. Lip horizontal was the first measurement picked by ML in the model without speaker 4. Other vertical lip measurements as well as lip thickness were also found informative.

Table 4-9: Best Measurements in the [u:]-Context

Models Selected Features Selected Features % correct [u:] with S4 without S4 With S4/without S4 Model 1 Lip Vert Outer; Lip Lip Horiz; Open Area; 85% 87% Vert Inner; Open Lip Vert Inner Sq Area Model 2 Lip Horiz; Lip Vert Lip Horiz; Lip Vert Outer; 85% 80% Outer; Lip Vert Lip Vert Inner Sq Inner; Model 3 Lip Thick Upper; Lip Horiz; Lip Vert Inner 75% 80% Lip Thick Low; Sq; Lip Vert Side Open Area Model 4 Chin Width Low ; Lip Horiz; Lip Thick 80% 80% Lip Thick Low; Low; Lip Vert Inner Sq Open Area Model 5 Lip Thick Upper Lip Horiz; Lip Vert Outer; 78% 80% Lip Vert Inner

In sum, even though most of the measurements in the best ML models included the lips, the ML nonetheless was able to separate out individual features despite the presence of the lips in all facial conditions and some information that helps identify AEs was found in the chin and cheeks. However, all of the first models that include speaker

4 contain lip measurements only, indicating that most of the informative information as far as speaker 4 is concerned resides in her lips. Based on the different lip parameters

84 excluding chin and cheeks, the ML was able to identify emphasis correctly at least 85% of the time except in the context of [i] and [u].

1.1.3 General Interpretation of Machine Learning Measurements

A number of these findings echo results previously seen in the regressions, decision trees, and arrow plots. For example, machine classification of AEs in the [æ]- context was best, cheeks in the context of [i] were informative, and [u:] when co- articulated with AEs is not visually salient. Across all vowels and by vowel, the recurrent measurements used by the machine for emphasis identification were measurements on the lips. For the vowel [æ], Lip Vert Outer and Open Area were sufficient to generate the best model. For the long vowel [æ:], Lip Horizontal, Lip Vertical Side and Lip Vertical

Inner, all three correlated, produced the best recognition. However, as previously mentioned, machine recognition of [æ:] was very good, and adding Chin Width and

Cheek Low as shown in Model 5 did not contribute much since the lips were more than adequate for 100% correct recognition. The same variables associated with the lips were selected for the identification of [i] and [i:] with Lip Horizontal, Lip Vertical Outer, and Open Area being the most informative. However, without speaker 4, Chin Width seemed useful in the machine’s recognition of [i, i:] as this variable is included in each of the five models. Furthermore, Chin Width Mid and Low, have also been selected by ML for the recognition of emphasis followed or preceded by [u]. Lip Thickness is the recurring theme in all models with [u:] when speaker 4 is included. For the long [u:], different parts of the lips were the only measurements used by ML except for Chin

Width Low that appears once in Model 4.

85

2 CONNECTION OF PRODUCTION-PERCEPTION OF EMPHASIS FOR PERCEIVERS

This section explores whether human perceivers’ accuracy of response can be predicted based on the same measurements given to the ML. The models developed were the best forward build models according to the AIC and “single_best” models.

2.1 Best-Forward-Build Models

As previously explained, the action of the function is to successively add from the available variables the variable that gives the greatest increase in the log likelihood of the model until all likelihood ratio tests of the candidate models to the model without a new variable have p values greater than 0.05.

The “forward-build” output includes the member “best_model” which gives the largest model that is a significant improvement on its predecessor in the forward build process. In all cases, the output was a single variable model indicating that this single variable was the only one that significantly improved the predictability power of the base model that only included ‘Speaker’ as a fixed effect and ‘Word’ and ‘Perceiver’ as random effects. Table 4-10 is a summary of the forward-build [æ]-best-model:

Table 4:10: Forward-Build[æ]-best-model

Number of obs:1565, groups: words, 12; perceivers, 52 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.6797 0.2526 2.691 0.00712 ** Speaker2 0.3845 0.2425 1.586 0.11278 Speaker3 -0.5048 0.1790 -2.820 0.00480 ** Speaker4 2.5716 0.2646 9.720 < 2e-16 *** Lip vert inner sq 17.5600 6.4598 2.718 0.00656 **

In this model, the lip vertical inner square measurement was the single variable that significantly predicted perceivers’ accuracy in the [æ]-environment. Table 4-11 86 provides a summary of the measurement variables selected by the ‘best-model’ forward-build by vowel:

Table 4:11: Best-Models by Vowel

Vowel Measurements Pr(>|z|) [æ:] Upper lip thickness 0.00901 ** [i] Lower lip thickness < 2e-16 *** [i:] Lip vert inner sq 0.001690 ** [u] Lip vert inner sq 0.0352 * [u:] No measurement found

For all vowels except [u:], the best-model included a single measurement variable that significantly improved the base model. The [u:] model suggests that perceivers did not find one reliable measurement that cued emphasis.

2.2 ‘Single-Best’ Function

The ‘single-best’ function outputs the single measurement that best predicts correctness.

Table 4-12 lists per vowel the single best predictive variable added to the last

iteration of the model with the maximum log likelihood:

Table 4-12: Single Best Variables Generated by the ‘Single-Best’ Function

Vowel Single-Best Variable p-value [æ] Lip Vertical Inner-Square 0.006 [æ:] Upper Lip Thickness 0.009 [i] Lower Lip Thickness <2e-16 [i:] Cheek High 0.004 [u] Lip Vertical Inner-Square 0.03 [u:] Open Area NA

The best-single variable for predicting the perceivers’ probability of correct identification of emphasis in the [æ]- and [u]-contexts is the lip vertical inner square. Lip

87 thickness appeared to be a good predictor of correct identification of AEs. When the lip horizontal measurement is big, the lips are stretched and appear thinner than when the lip corners come closer together. Upper lip thickness seems to be a helpful identifier of emphasis in the [æ:]-context, and the lower lip thickness aids in the recognition of emphasis in the [i]-context. The only non-lip measurement selected by the best-single function is Cheek High in the [i:]-context. It is similar to the gesture associated with a broad smile when the lips are maximally stretched. The fleshy part of the cheeks is pulled up and the cheeks pulled out. There was no measurement that significantly predicted correct perception of emphasis in the [u:] context. We have previously seen that [u:] masks emphasis probably because of the extreme protrusion of the lips that takes place in both AE and NE environments. We notice that the measurements outputted by both functions, the forward-build ‘best-model’ and the ‘single-best’ function are almost the same except for the cheek high associated with long [i:] and selected by the ‘single-best’ function.

In sum, the small p-values in Table 4-11 reflect that the best single variables significantly impacted perceivers’ correctness in all vowel contexts except [u:]. It is noteworthy that all the selected measurements were in the lips except in the context of

[i:] where the distance of the upper part of the cheeks was found informative. The models also show that perceivers were more correct with different features for AE words with the same vowel quality but different vowel length. This may imply that vowel duration in Arabic involves changes in vowel quality; there is evidence in the literature that short vowels are more centralized than long vowels that tend to occupy a more peripheral space on the vowel chart. It is thought that speakers, due to temporal constraints on short vowels, undershoot and do not reach the vowel target, while the 88 articulators in the production of long vowels have more time to attain the targeted configuration (Moon and Lindblom, 1994).

89

CHAPTER 5: CORRELATION BETWEEN MACHINE LEARNING AND HUMAN PERCEIVERS

ML models the measurement differences estimated by the computer as most informative in the classification of each token as AE or NE. The forward-build models the measurement differences that influence human perceivers to select the correct response. Both sets of models contribute to finding a potential correlation between the machine’s and perceivers’ processes. To this end, inputted into the computer were the measurement differences between each word and its pair. Since we are looking at perception of emphasis on the adjacent vowel, all analyses regarding emphasis consider identification of the COI in the context of each vowel. For example, we have established that the horizontal spreading of the lips corner to corner was the best measurement selected by the machine learning algorithm for classifying words as AE or

NE in the context of [i]. The lip dimensions associated with NE in the [i]-context were larger than those associated with AE adjacent to [i]. Thus, subtracting AE[i] from NE[i] measurements will result in a positive number. Given the minimal pair (AE[i]/NE[i]) in which, for example, NE is the correct response, the positive difference:

(larger lip-measurement) - (smaller lip-measurement) or (NE + [i]) – (AE + [i]) >0 indicates a higher probability that the computer will assign the word to the NE category.

If the correct answer is AE, a negative difference would cue AE:

(Smaller AE lip-measurement)– (bigger NE lip-measurement) <0

While the machine algorithm always subtracted the measurement for the current word, be it NE or AE, from the measurement of its pair, the executed regression used to generate the perceivers’ models always subtracted the measurement associated with a

90

NE token from the same measurement associated with an AE word; the formula was always (AE – NE). The arrow plots in Figure 2-5 provide a clear illustration where the arrows associated with speakers 4 show a big lip-horizontal measurement difference that is correlated with NE words. Going back to the same example used with machine learning, given a pair of words (NE + [i]/AE + [i]), the difference in lip-horizontal measurement is negative since NE is larger than AE; a negative difference between a word and its pair in the context of [i] based on the lip-horizontal measurement difference, cues NE.

On the other hand, [æ:] measurements in lip horizontal are larger for AEs than

NEs. Therefore, the difference between two words on lip horizontal is positive in the context of [æ:] spoken by speaker 4: (bigger AE lip horizontal + [æ:]) minus (smaller NE lip horizontal + [æ:]) >0. In other words, watching a silent video of a speaker saying the word [sʕæ:r] and presented with the minimal pair [sʕæ:r-sæ:r] (become-walk), perceivers are more likely to click on [sʕæ:r] when the speaker’s lips are more rounded.

Table 5-1 shows side by side the measurements selected by the ML algorithm in its categorization task and the humans’ best models informed by the measurement differences that predict the perceivers’ probability of correctness.

All vowels expect [i:] and [u], are associated with some measurement informative to both the ML algorithm and human perceivers in the selection of the AEs. The main measurement used more than once by both is the lip horizontal differences. The differences in the open area also cued both the ML and the perceivers in selecting the

AE answer. Also, for [i:], the computer and perceiver Models selected a lip vertical measurement. However, the ML and the humans don’t have any commonality in the [u, u:] environments suggesting again the ambiguity in identifying AEs in the [u, u:]-context. 91

The gray highlighted measurements are the best predictors for both human and machine learning performance.

Table 5-1: Best Machine and Best Subsets Measurements

vowel Best Machine Model Best Subsets Models

[æ] o Lip Vert Outer o Lip Horiz o Upper Lip Thick o Lip vert side o Open Area o Open Area o Lip Vert Inner Sq o Chin Vert [æ:] o Lip Horiz o Lip Horiz o Lip Vert Side o Upper Lip Thick o Lower Lip Thick o Lip Vert Inner Sq o Chin Width Low [i] o Lip Horiz o Lip Horiz o Lip Vert Outer o Lower Lip Thick o Cheek High o Chin Vert [i:] o Lip Vert Outer o Lip Horiz o Open Area o Upper Lip Thick Highlighted because o Lip Vert Side correlated o Cheek High [u] o Chin Width Mid o Lip Vert Inner Sq o Lip Thick Upper [u:] o Lip Vert Outer Not available o Lip Vert Inne o Open Area

1 Correlation between Machine Learning and Human Perceivers

This section compares the perceivers’ probability of correctness and machine learning confidence in its classification of emphasis.

92

1.1 Procedures and Results

The correlation models were based on two sets of models, the perceivers’ correctness on measurement models and the ML models. The perceivers’ correctness models refer to the probability of perceivers’ accuracy for each work; the ML models refer to the ML confidence in its classification of each token as AE or NE. Both sets of models use the difference in measurements between AEs and NEs. Such comparison was not possible for the long [æ:] since the computer was 100% correct in its classification of AEs + [æ:]; [u] was also omitted from this analysis because the machine learning failed to select an informative measurement to categorize words with [u]. Logit regressions were run on the best models for perceivers and machine per vowel. A logit is a transformation of the probability changing a 0 to 1 scale with .05 representing equal probability, to an open ended scale (minus infinity to plus infinity) with 0 representing equal probability. In other words, when we talk about probabilities, we use a scale that is limited between 1 and 0. We cannot express a very high probability much better than

99%. A logit scale removes the end of the distribution and small probability differences between for example 98% and 99% is expanded and has a larger difference on the logit scale.

The very small p-value of the logit linear regression compiled in Table 5-2 indicates a significant correlation, showing a strong relation between the ML classification and perceivers’ correctness across all vowels; in other words, the correct classification of a word as an AE or NE is highly correlated with the probability of the perceiver selecting the correct answer. The high correlation between machine learning and human likelihood of correctness is particularly seen in the contexts of [æ] and [i]; in

93 the [u:] environment, the relation between perceivers and computer’s confidence is weaker, yet still significant.

Table 5-2: ML Classification and Perceivers’ Correctness Correlation

Estimate Std. Er t-value Pr(>|t|) [æ] Intercept 0.60 0.03 15.00 <2e-16*** Computer Confidence 0.086 0.002 28.81 <2e-16*** R-squared value 0.35 [æ:] omitted [i] Intercept 0.718 0.02 24.09 <2e-16*** Computer Confidence -0.024 0.01 -2.363 0.0183 * R-squared value 0.005 [i:] Intercept 0.65 0.032 20.26 <2e-16 *** Computer Confidence 0.185 0.01 17.60 <2e-16 *** R-squared value 0.15 [u] omitted [u:] Intercept 0.58 0.016 35.789 <2e-16 *** Computer Confidence 0.009 0.004 2.02 0.0426 * R-squared value 0.002

In spite of the variability in the model as seen in the low R-squared values, a significant trend was established where the machine learning emphasis identification is positively correlated with perceivers’ correctness based on measurement differences between AEs and NEs. The same regression executed without speaker 4 still shows a significant correlation (0.48 with speaker 4, 0.37 without speaker 4), but a lower R- squared value.

The plot in figure 5-1 shows the relationships between the ML confidence in its classification and the probability of perceivers’ correctness. For ease of interpreting the graph I have divided the plot area into sections that I will refer to as follows:

94

 High Positive Correlation or HPC, on the upper right hand corner

 NC, no correlation

 MLC or Mid-Low Correlation, in the center of plot towards the bottom

 MHC or Mid-High Correlation, in the center of the plot towards the top

 50-50, the area near the zero point

The scale of the plots has to be taken into consideration with ‘0’ being equivalent to the

50% probability on a 0-1 scale.

The [æ]-plot ([a] in R), is based on the logit regression with the vertical line

(probability based on perceivers’ models) and the horizontal line (ML models) crossing at the 50/50 probability point of the logit scale. AEs are identified with capital letters and

NEs with lowercase letters. Speaker 1 is represented in black, speaker 2 in red, speaker

3 in green, and speaker 4 in blue. The x-axis plots the machine learning confidence in its classification of a syllable and its pair as AE or NE based on the measurement differences that best predict AE + [æ]. The y-axis represents the perceivers’ probability of correctness.

Based on the [æ]-plot in Figure 6-1, most of the syllables are above the 50/50 probability lines. The blue cluster in the HPC area reflects a high correlation, mostly driven by speaker 4; it indicates that the computer confidence in its classification predicts that perceivers are very likely to perceive the words in HPC correctly. All the [t]s and [s]s in the NC corner of the plot below the 50% line are all associated with speakers

1 and 3 and indicate that, for these tokens, the machine has no confidence in accurately classifying emphatics, and the probability of the perceivers’ correct identification is very low. There is a cluster of syllables produced by speaker 4 in the MHC area where the correlation is fairly good, and another group in the LHC region that includes many of 95

Speaker 4 emphatic [t]s, a big cluster of speaker 3’ [t]s and [d]s and a few of speaker 1’

[d]s.

The big cluster in the 50-50 area comprises most of speakers 1, 2, and 3 utterances and the computer’s confidence in its classification of emphasis as well as its predictive power in perceivers’ correctness is about 50%. We notice that while perceivers identify speaker 2’ in red quite well, the machine learning is not confident in its classification of all of speaker 2’s COI.

Fig. 5-1: [æ] Probability of Correctness

Speaker 1 Speaker 2 Speaker 3 Speaker 4

Speaker 1 Speaker 2 Speaker 3 Speaker 4

ML Confidence in Emphasis Classification

96

Is there a correlation between ML confidence in its classification of COIs and the probability that perceivers will correctly identify emphasis? Yes, but it is an imperfect correlation, typically driven by a set of words uttered by one or two of the speakers.

97

CHAPTER 6: DISCUSSION OF VISUAL PERCEPTION OF EMPHASIS

After exploring a number of variables controlled for in this investigation, we conclude that vowel context, speakers, and measurement differences inform human perceivers in their identification of AEs versus NEs. The parameters that impacted perception are the ones with the most information and salient differences between the two categories.

In the production of the vowel [æ], since the mouth is open, perceivers have at their disposal a view of the position of the tongue which is more depressed when [æ] is adjacent to AEs and sits higher up in the vocal tract in the production of NEs; they also benefit from the visibility of the teeth seen when [æ] coarticulates with NEs but not AEs, and the configuration of the lips that are more rounded with AEs than NEs. Additionally, there are no articulatory constrictions associated with [æ] that refrain the tongue from pulling back towards the pharynx. The vowel [i] also has distinctive salient features such as the horizontal opening of the lips and the visibility of the teeth; but due to physical constraints imposed by the front position of the tongue and the pharyngealization of the consonant that retracts the tongue into a back position, the distinction between AEs and

NEs becomes less obvious than with [æ]. The rounded lips in [u] conceals the tongue and teeth in both AEs and NEs and do not provide perceivers with enough distinctiveness to help them tell AEs and NEs apart.

The speaker effect is connected to the vowel effect. How well a speaker enunciates impacts how distinctively he or she produces the vowel within and between the NE and AE categories. According to Halliday (1978), “social dimension denotes differences between the speech of different speakers, and the stylistic denotes 98 differences within the speech of a single speaker.” It is possible that the sociolinguistic background of the four speakers used in this study was reflected in the clarity of their speech. Speaker 4 is a language teacher and is used to producing clear utterances for the benefit of the students. Speakers’ personalities are reflected in their speech styles.

Speaker 2 is a loud, outgoing person and speakers 1 and 3 are a lot more reserved.

This is speculative, but it appears that their personality is reflected in their body language and more particularly their speech movements. This is not to say that every large movement is informative. Speaker 2 seems to have some idiosyncrasies in his production of NEs that confuse perceivers. Nonetheless, there is no doubt that speaker effect impacts the amplitude of speech facial movements and affects the measurement differences between AEs and NEs.

I have investigated what measurement differences ML uses to classify syllables into AEs or NEs. Overall, the ML performed better than humans as shown in Figure 6-1 comparing the ML performance with the performance of perceivers looking at lips only.

The comparison is based on the L condition only because the machine learning algorithm was using primarily lip measurement differences in its classification task. The graph shows the percent of correct recognition by vowel for the best computer model and the percent of correct answers of perceivers in the L condition. The graph indicates that except in the [u]-context, the computer’s identification of emphasis was mostly in tune with speaker 4 as its performance best predicted perceivers’ correctness of identification when speaker 4 was included. It appears that machine learning and perceivers of speaker 4 were capitalizing on the lips to identify AEs and NEs and their performance was best in the [æ]-context. Machine learning outperformed humans in the

[i:]-context but perceivers did better in the [u]-context where the difference in lip 99 configuration is very similar in the AE and NE environment. It must be that human perceivers were relying on other information not available to the computer to discriminate between AEs and NEs in the [u]-context.

Fig. 6-1: Machine Learning vs Human Performance on Recognition of AE

120%

100%

80% Machine with S4

60% Machine without S4 Humans with S4 40% Humans without S4

20%

0% [æ] [æ:] [i] [i:] [u] [u:]

The machine learning algorithm had a big advantage over human perceivers in that it was given the answers and its task was to correlate those answers for AEs with the best measurements that predict emphasis at the highest level of confidence. Based on lip-only measurements, correct responses were mostly driven by speaker 4 as confirmed by the regression models, and the arrow plots. In the absence of the most intelligible speaker, perceivers, in trying to discriminate between AEs and NEs, still relied primarily on the lips, but also on cheeks and chin measurement differences. We can confidently infer that there is a connection between readability of the lips and intelligibility of the speaker. The more informative a speaker’s lips are, the better he or she is perceived. Speaker 3’s style can be described as minimalist as he moved his lips 100 just enough to produce a distinction that is perceived acoustically but not visually. COIs in the context of [æ], and particularly [æ:] lips alone seem to be a more reliable feature across speakers and for the machine learning as well.

On the other hand, perceivers had more than just the measurements to help them identify AEs and NEs. They had at their disposal all the potential cues embedded in the videos they watched. However, as much as watching the movies provided them with multiple additional cues not available to the computer, seeing the cropped speakers’ faces introduced a lot more noise and attention deflection elements that perceivers were not used to under normal circumstances. This may explain why showing the lower face degraded the ability of perceivers to accurately identify emphasis.

The positive correlation between ML and human accuracy implies that ML can be used to model visual speech perception, but a bigger number of perceivers would provide the algorithm with more samples and makes ML classification more powerful.

However, even with only four speakers, there was some overlap between the ML and perceivers’ processes. Both ML and humans identified AEs best in the [æ:]-context; both favored lip measurements over cheek and chin, and both found speaker 4 to be most intelligible, and speaker 3 the least intelligible. Since using measurement differences extracted from speaker 4’s production was associated with the highest probability of perceivers’ correct identification of AEs and NEs; but measurements extracted from speaker 3’s production were not as helpful for either perceivers or ML, we can infer a meaningful relationship between production and perception. This is strong evidence that measurement differences underlying the shape that different facial features take during production of AEs and NEs, and in particular the lips, encode 101 useful information to perceivers and ML. The more pronounced the differences, the better the perception is.

To summarize, measurement differences provided the underlying bases for evaluating ease of identification for both ML and humans. The statistical models used the size and consistency of the contrasts between AEs and NEs, based on a set of measurement differences in the long and short [æ], [i], and [u] vocalic environments, to model the difficulty of emphasis recognition for the perceiver. It is easier for perceivers to distinguish between two contrasting sounds when there is a large systematic difference between them. As illustrated by the arrow plots in figures 2-2 to 2-7, long arrows would correspond to a large difference between AEs and NEs and the sounds they represent are likely to be very well perceived; short arrows on the other hand, correspond to contrasts between syllables that are hard to distinguish because the gestures involved in producing the two contrasting sounds are not distinct enough in terms of amplitude and direction with AE measurement larger than NE or vice versa.

As for the ML algorithm, its task was to decipher what kind of word it was looking at, AE or NE? All the differences in measurements between each word and its pair were inputted into the computer. By looking at several examples, it learned to make associations between the size of a certain measurement and the emphasis category it is most likely to be associated with. In other words, the ML learned to recognize that a large lip horizontal measurement is associated with NE and small horizontal measurement is associated with AE; therefore, a positive lip horizontal measurement between the current word and its pair is likely to be categorized as an NE.

The implication is that the ML needed several examples of the same type to learn to make these associations and select the measurements that best distinguish between 102 a word and its pair consistently. The more production patterns are alike across speakers, and the bigger the measurement difference between AE and NE, the higher the ML confidence would be in its classification. This is the case for instance, with the lip horizontal measurement differences in the context of [æ] where the horizontal measurement is smaller in the articulation of AEs and larger in the production of NEs.

The ML categorized AEs + [æ] with 90% accuracy as opposed to COIs in the [u] context, that were identified with 50% correctness. This is also reflected in the perceivers’ better recognition of emphasis in the context of [æ] versus [u]. In a perfect world where the differences between AE and NE for each speaker were similar, the machine would have been correct 100% of the time because it would have been able to infer from more examples across speakers the relationship between measurement differences and the kind of word they are associated with; perceivers also would have had more success in identifying emphasis correctly because their own production would have been more consistent with the speaker, and they would have to make fewer adjustments to different speakers’ idiosyncrasies. Considering the variation in emphasis production, machine learning confidence in classification varied as a function of speaker and vocalic environment. Humans’ performance also depended on these same variables plus many others including phonetic information.

The perceivers’ task was harder because they had to deal not only with speaker variability but also with their own experience regarding the production of AEs and NEs.

Perceivers may have often been confronted with contradictory information between their preconceived idea of what the contrast between AEs and NEs should look like, based on their own production of emphasis as native speakers, and what they saw in the videos. It is likely that perception is best when the articulation of the speaker perceivers 103 are watching matches their own, i.e., if the speaker’s pronunciation of emphasis meets their expectation. The perception-production effect is becoming increasingly established in the literature. For example, Fridriksson (2009) in his study on treating non-fluent aphasia observed a significant improvement in the speech production of non-fluent aphasia when patients were trained to focus on visual speech. Perceivers variability motivated the necessity to use ‘Perceivers’ as a random effect in the logistic models presented earlier.

In conclusion, the strong correlation between perceivers and the ML is a clear indication that humans and the computer use some of the same processes to tell AEs and NEs apart. However, the low R-squared value suggests that the correlation is highly driven by a set of words where the machine and the perceivers have a high level of certainty in the classification of items as AEs of NEs. There are however many other tokens with weaker associations between the machine’s and humans’ performances.

Results also confirmed the strong predictive power of the lips measurements for both machine and humans and, as is the case, it is not expected to be perfect considering that humans bring to the table their own expertise that sway their answers one way or the other; they also have many more variables to consider in their selection.

In addition to measurement differences, variables such as vowel context, speaker, face, and many others not even considered in this study may help or interfere with accurate perception of emphasis.

104

CHAPTER 7: STATISTICAL OVERVIEW OF ARABIC GUTTURALS

The distinguishing characteristic of emphatics is the dental or coronal primary articulation accompanied by a secondary pharyngeal articulation. Part 2 of this dissertation investigates another set of phonemes classified as Arabic gutturals or AGs, known as the ‘pure’ pharyngeals and characterized by only one primary constriction at a point along the pharynx. The three sets of pairs being studied are the two uvular fricatives (/χ/, /ʁ/), a laryngeal glottal stop (/ʔ/) contrasted with the voiced pharyngeal approximant /ʕ/, and the glottal approximant (/h/) contrasted with the laryngeal approximant /ħ/. As stated earlier, what motivated the choice of these contrasts was the difficulty that learners of Arabic as a second language have in distinguishing between the AG pairs. Even though second language acquisition of pharyngeals is beyond the scope of this study, I am nonetheless setting the stage for future investigations.

The procedures and analyses followed to study the gutturals are similar to those adopted in the study of emphatics. However, the AEs and NEs fall clearly into two categories distinguished by the feature ‘+emphasis’ that makes comparing and contrasting them from a procedural point of view straightforward. As for the AGs, the pair groupings don’t follow a principled linguistic criterion based on for example, manner or place of articulation; consequently, most of the analyses were done on each pair separately. Having to section the AG data into three pairs made the machine learning analysis inadequate due to the reduced number of tokens given the computer to learn to make associations between measurement differences and correct classification.

Therefore, machine learning will not be applied to the AGs. The following section presents an overview of the variables and their relationship to correctness of perception.

105

In chapter 8, I will discuss the connection between the production and perception of

AGs using decision trees, arrow plots and forward-build models. Finally, Chapter 9 is a summary of procedures and finding of visual perception of AGs and AEs and discusses some conclusions drawn from this study.

1 Overview: Descriptive Statistics

A total of 10,916 responses based on 72 minimal pairs containing one of three contrasts /ʕ,ʔ/, /χ , ʁ/, or /ʔ, h/, were collected, yielding about 54% accuracy (5015 incorrect verses 5901 correct). This percentage based on a Chi-Squared test for given probabilities is significantly better than chance: X2(1, N = 10,916) = 71.912, p < .001.

Without speaker 4 (who is the investigator), around the same percentage of 54% of the answers were correct (4475/8364), which is still significantly better than chance: X2(1, N

= 8,364) = 41.056, p < .001. In this set, the gap between speaker 4 and the other three speakers seemed to have narrowed.

The three contrasts were examined with the COI occurring word-finally and word- initially in the context of short [æ], [i], and [u], and their long counterparts [æ:], [i:], and

[u:]. There were a total of 24 minimal pairs for each contrast. The following section provides descriptive statistics pertaining to the guttural category.

 Perception by minimal pair

While the difference in consonant quality in the emphasis category did not have much of an effect on the perception of AEs and NEs, the Chi-square test shows a significant difference between the AG pairs in their effect on correct visual identification.

As reflected in Table 7-1, the /ʔ, ʕ/ pair was the easiest to perceive, followed by / ħ, h /;

106 but the percentage of correct responses to / χ, ʁ/ was near chance (50.56% correct) as confirmed by the Chi-Square test, X2(1, N = 3,624) = 0.44, p = 0.50. Since the pair /χ ,

ʁ/ had low visual appearance and perceivers couldn’t visually tell the difference between /χ/ and /ʁ/, this pair was excluded from the analysis.

Table 7-1 shows the percentages of correct and incorrect responses by pair.

Table 7-1: % and Number of Correct Answers by Consonant Pairs with and without Speaker 4

With Incorrect Corrrect %Correct Without Incorrect Correct %Correct S4 S4 /χ , ʁ/ 1792 1832 50.56 /χ , ʁ/ 1405 1431 50.46 / ħ,h/ 1664 1985 54.40 / ħ, h/ 1273 1488 53.89 /ʔ, ʕ / 1559 2084 57.20 /ʔ , ʕ / 1211 1556 56.23 Chi-Square with speaker 4: X2(2, N = 10,916) = 32.641, p < .001 Chi-Square without speaker 4: X2(2, N = 8,364) = 19.033, p < .001

In the reduced data, there were a total of 7,292 responses and 48 minimal pairs identified with 56% accuracy (3223/ 4069), a ratio that is significantly better than chance, X2(1, N = 7,292) = 98,151, p <0.01.

 Perception by Speaker

In order to determine if speaker 4 was skewing the results as the investigator, speaker effect on perception is examined in Table 7-2 for the two pairs /ʔ, ʕ/ and /ħ, h/:

Table 7-2: % and Number of Correct Answers by Speaker

Incorrect Correct % Correct Speaker1 922 1159 56% Speaker2 738 927 56% Speaker3 824 958 54% Speaker4 739 1025 58% Chi-square with speaker 4: X2(3, N = 7292) = 6.83, p = .007 Chi-square without speaker 4: X2(2, N = 5,528) = 1.81, p =0.40

107

Even though perceivers were a little better at identifying gutturals spoken by speaker 4, the influence of speaker on the perception of AGs didn’t quite reach significance. The small speaker effect was next investigated for each pair.

 Speaker Effect on Perception of / ħ, h/

Table 7-3 shows that the numbers of correct and incorrect answers and percentages in the /ħ, h/ pair for speakers 1, 2, and 4 reached about 55%. Speaker 3 was significantly less-well perceived in the /ħ, h/ contrast as perceivers were not able to distinguish between his /ħ/- and /h/-words at better than chance.

Table 7-3: / ħ, h/ - % and Number of Correct Answers by Speaker

/ ħ, h/ Incorrect Correct % Correct Speaker1 475 580 55% Speaker2 378 486 56% Speaker3 420 422 50% Speaker4 391 497 55% Chi-square: X2(3, N = 7292) = 8.43, p = .03

 Speaker Effect on Perception of /ʕ, ʔ/

Table 7-4: /ʔ, ʕ/- % and Number of Correct Answers by Speaker

/ʔ, ʕ/ Incorrect Correct % Correct Speaker1 447 579 56% Speaker2 360 441 55% Speaker3 404 536 57% Speaker4 348 528 60% Chi-square: X2(3, N = 7292) = 5.144, p = .16

As reflected in Table 7-4, perceivers were better at correctly identifying the /ʔ, ʕ/ contrast and reached 60% accuracy when watching speaker 4. Even speaker 3 was much better perceived in this pair than in the / ħ, h/ contrast. In fact, he was easier to understand than speakers 1 and 2. The difference between speakers however was not 108 significant for /ʔ, ʕ/. The graph in Figure 7-1 shows the effect of speakers on perception of AGs.

Fig. 7-1: Perception of AG by Speaker

62%

60%

58%

56%

54% / ʔ, ʕ / 52% / ħ, h/ 50%

48%

46%

44% Speaker1 Speaker2 Speaker3 Speaker4

 Perception by Consonant Quality

Even though the overall accuracy in identifying each guttural consonant individually does not exceed 60% for any of the COIs, the Chi-Squared statistics shows a significant difference in the correct recognition of each COI individually. Table 7-5 indicates that the /h/ and /ʕ/ have the highest ratio of correct answers; in interpreting these results, we have to keep in mind the possibility that the seemingly better perception of /h/ and /ʕ/ could be the result of a bias towards these two consonants when perceivers were uncertain of the answer.

109

Table 7-5: % and Number of Correct Answers by Consonant

With S4 Incorrect Correct %Correct / ħ / 912 906 49.83 /h/ 752 1079 58.92 / ʔ/ 805 1018 55.84 / ʕ / 754 1066 58.57 Chi-Square: X2(5, N = 10,916) = 39.168, p < .001

 Perception by Position

Table 7-6 reports the percentages of correct and incorrect counts of

responses based on the COI position word-initial or word final.

Table 7-6: Effect of Consonant Position on Perception of /ħ, h/ and /ʕ, ʔ/

Position /ħ, h/ initial /ħ, h/ end /ʕ, ʔ/ initial /ʕ, ʔ/ end Number correct 859 1019 1014 1070 Number incorrect 966 805 802 757 Percent correct 53% 56% 56% 59% Chi-Square-/ħ, h/: X2(1, N = 3,649) = 3.05, p < .08 Chi-Square-/ʕ, ʔ/: X2(1, N = 3,643) = 2.77, p =0.102

COI position appears to have a marginal effect on the perception of /ħ, h/ where word-final consonants were slightly better perceived than word-initial consonants.

 Perception by vowel quality of the / ħ, h/ contrast

Tables 7-7 and 7-8 examine the effect of vowel quality on the perception of each pair /

ħ, h/ and /ʕ, ʔ/.

Table 7-7: Effect of Vowel Quality on Perception of /ħ, h/

/ħ, h/ Incorrect Correct %Correct [æ] 534 683 56% [i] 574 641 53% [u] 556 661 54% Chi-Square: X2(2, N = 3649) = 2.77, p = 0.24

110

Table 7-8: Effect of Vowel Quality on Perception of /ʕ, ʔ/

/ʕ, ʔ/ Incorrect Correct %Correct [æ] 554 653 54% [i] 450 772 63% [u] 555 659 54% Chi-Square: X2(2, N = 3643) = 26.77, p <0.01

Based on Tables 7-7 and 7-8, /ħ, h/ is better perceived in the [æ] context, but overall, vowel quality does not appear to significantly modulate the perception of this contrast. On the other hand, the vowel [i] seemed to have a significant effect on distinguishing between /ʕ/ and /ʔ/.

 Perception by Vowel Length

Tables 7-9 and 7-10 show the impact that vowel length might have on the perception of each pair /ħ, h/ and /ʕ, ʔ/.

Table 7-9: Effect of Vowel Length on Perception of /ħ, h/

/ħ, h/ Incorrect Correct %Correct Short vow 840 976 54% Long vow 824 1009 55% Chi-Square: X2(2, N = 3649) = 0.57, p =0.44

Table 7-10: Effect of Vowel Length on Perception of /ʕ, ʔ/

/ʕ, ʔ/ Incorrect Correct %Correct Short vow 830 991 54% Long vow 729 1093 60% Chi-Square: X2(2, N = 3643) = 11.309, p<0.01

Perception of /ʕ, ʔ/ is sensitive to vowel duration. Generally, AGs were better perceived in the context of long than short vowels. As shown in Table 7-11, COIs in the

/ʕ, ʔ/ contrast had the highest overall rate of correct responses at 68%, followed by [i] at

111

59%. /ʕ/ and /ʔ/ were better identified in the long [æ:]-context than short [æ] (56% vs

52%), and better with [u:] than [u] (55% vs 53%):

Table 7-11: Long and Short Vowel Effect on Perception of /ʕ, ʔ/ Incorrect Correct %Correct [æ] 288 314 52% [æ:] 266 339 56% [i] 257 353 58% [i:] 193 419 68% [u] 285 324 53% [u:] 270 335 55% Chi-Square: X2(2, N = 3643) = 43.21, p<0.01

While the effect of vowel quality on perception of /ħ, h/ was not meaningful, there was a significant effect of vowel over all as reflected in table 7-12. The results also reflect a difference in vowel duration in the [u]-context only.

Table 7-12: Long and Short Vowel Effect on Perception of /ħ, h/

Incorrect Correct %Correct [æ] 264 335 56% [æ:] 270 348 56% [i] 346 417 54% [i:] 278 329 54% [u] 230 224 49% [u:] 276 332 55% Chi-Square: X2(2, N = 3649) = 25.01, p<0.01

The graph in Figure 7-2 illustrates the vowel effect on perception for both pairs:

112

Fig. 7-2: Vowel Effect on Perception

80%

70%

60%

50%

40% /ħ, h/

30% /ʕ, ʔ/

20%

10%

0% [æ] [æ:] [i] [i:] [u] [u:]

 Perception of AGs by Consonant + Vowel

Table 7-13 indicates a significant difference in perception between /ħ, h/ based on the consonant-vowel combinations:

Table 7-13: Effect of CV/VC on Perception of /ħ, h/

vowel Initial/short V Initial/long V End/short V End/long V [i,i:] [ħi] [hi] [ħi:] [hi:] [iħ] [ih] [i:ħ] [i: h] 21% 85% 53% 54% 58% 40% 51% 59% [æ,æ:] [ħæ] [hæ] [ħæ:] [hæ:] [æħ] [æh] [æ:ħ] [æ:h] 69% 46% 38% 68% 53% 56% 62% 57% [u,u:] [ħu] [hu] [ħu:] [hu:] [uħ] [uh] [u:ħ] [u: h] 54% 40% 65% 42% 54% 68% 71% 41% X2(23,N = 3649)=256.74, p < 0.01

Huge discrepancies such as the one between /ħi/ and /hi/ are very suspicious and suggest a strong bias towards /hi/. The arrow plots in chapter 8 confirm this hypothesis. If true, then these results indicate that perceivers can’t tell based on visual information alone, the difference between /ħi/ and /hi/. The same might apply to other

113 consonant-vowel combinations such as [ħæ] and [hæ], as well as their long cognates

[ħæ:] and [hæ:].

Similar observations apply to perception of CV/VC in the [ʔ, ʕ] data shown in

Table 7-14, especially in the [i]-context:

Table 7-14: Effect of CV/VC on Perception of /ʔ, ʕ/

vowel Initial/short V Initial/lonʕ V End/short V End/long V [i,i:] [ʕi] [ʔi] [ʕi:] [ʔi:] [iʕ] [iʔ] [i:ʕ] [i:ʔ] 47% 74% 57% 68% 71% 39% 75% 74% [æ,æ:] [ʕæ] [ʔæ] [ʕæ:] [ʔæ:] [æʕ] [æʔ] [æ:ʕ] [æ:ʔ] 49% 57% 65% 49% 57% 45% 62% 48% [u,u:] [ʕu] [ʔu] [ʕu:] [ʔu:] [uʕ] [uʔ] [u:ʕ] [u:ʔ] 47% 51% 43% 64% 67% 65% 62% 53% X2(23,N = 3649)=166.48, p < 0.01

Taking those numbers at face value, the higher percentages of correct identification seem to be associated with the vowel [i]. [ʔi] is favored in initial position and its pair in the final position. The same appears to be true in the [u] context where

[ʔu] and [ʔu:] are better identified in the coda and [uʕ] and [u:ʕ] word-initially.

These raw data are difficult to interpret. It remains to be seen whether the higher percentages in many pairs can be correlated with better identification or perceiver bias.

 Perception by Face Condition

Similar to the emphatic category, the Chi-Square analysis shows that face condition does not impact perception of AGs, even though in the condition where the whole face is visible (AF), guttural identification is better. Table 7-15 shows the results of perception by face across speakers:

114

Table 7-15: % and Number of Correct Answers by Face

/ħ, h/ Incor Cor %Cor /ʕ, ʔ/ Incor Cor %Cor AF 255 354 58% 251 362 59% LoF 278 328 54% 260 351 57% L 291 316 52% 285 320 53% LCkLn 284 325 53% 245 361 60% LCk 270 336 55% 264 338 56% LCh 286 326 53% 254 352 58% Chi-Square with : X2(5, N = 3,649) = 5.61, p = 0.34 Chi-Square with : X2(5, N = 3,643) = 7.31, p = 0.19

In short, statistical tests indicate that the AGs were overall perceived significantly better than chance; however, the uvular pair /χ, ʁ/ was dropped from the analysis due to its poor visual appearance. Out of the two pairs left, /ʔ, ʕ/ was better identified than /ħ, h/. Unlike the AEs and NEs where speaker effect was highly significant, there was no systematic difference in how well the AG production of each speaker was perceived.

In the perception of / ħ, h /, speakers 1, 2, and 4 were perceived with about 55% accuracy; the /ħ/ and /h/ of speaker 3 were indistinguishable based on the visual signal alone. The /ʕ, ʔ/ contrast was better identified, especially with speakers 4 and 3.

Overall, /h/ and /ʕ/ were better perceived than their pairs /ħ/ and /ʔ/, but it is unclear whether this was due to a bias towards the /h/ and /ʕ/ or whether these two consonants contain optical cues that make them easier to recognize. Consonant position was found to be marginally significant for the / ħ, h / pair with the COIs at the end of the word significantly better perceived than at the beginning of the word. No position effect was found for the /ʕ, ʔ/ pair.

Vowel quality and length only impacted recognition of /ʕ, ʔ/ especially in the context of [i]; COIs in the /ʕ, ʔ/ contrast were better recognized next to all three long vowels than their short counterparts. Different CV and VC combinations also appear to

115 have an influence on the perception of gutturals, but generalizable patterns were not found. No face effect was found either.

The significant variables such as length and vowel appear to have an influence over one of the two pairs and not the other. The next section will explore in more depth the relationships and interactions between variables.

2. Variable Interactions

The goal of this section is to investigate in depth how variable interactions modulate the accuracy of perceivers’ identification of AGs within each pair. The binomial multi-linear regression models were fitted using maximum likelihood estimation, Mle4, in

R. For the sake of comparison, the variables included to model perception of AGs were the same as the ones used with AEs and NEs, namely:

 vowel quality (long and short [æ], [i], [u])

 vowel length (short or long)

 consonant quality /ħ, h/, and /ʔ, ʕ/

 consonant position (word-initial, word-finally)

 speaker (S1,S2, S3, S4)

 face (AF: all face, LwF: from bottom of eyes to neck, L: lips only, LCkCn: lips,

cheeks, and chin, LCk: lips and cheeks, LCn: lips and chin).

All models included ‘Perceivers’ and ‘Words’ as random variables.

 /ħ, h/ models

After building a comprehensive model that included all the variables (vowel quality and length, consonant quality and position, consonant-vowel combinations,

116 speaker, and face), ‘Speaker’ and ‘Face’ were the only two meaningful effects that were found to influence the perception of /ħ, h/. Consonant quality was marginally significant and was part of the best /ħ, h/ model which only included the significant effects. The best /ħ, h/ model is summarized in Table 7-16.

Table 7-16: Best /ħ, h/ Model

Number of obs: 3,649; Groups: Word, 24; Perceiver, 53 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) -0.11306 0.15492 -0.730 0.46554 Speaker1 0.22602 0.09672 2.337 0.01945 * Speaker2 0.28615 0.10115 2.829 0.00467 ** Speaker4 0.26229 0.10061 2.607 0.00914 ** /ħ/ -0.40096 0.22355 -1.794 0.07287 . AF 0.25294 0.11940 2.112 0.03471 * LwF 0.10341 0.078973 0.866 0.38642 LCkCn 0.05676 0.11934 0.476 0.63434 LCk 0.13678 0.11932 1.146 0.25166 LCn 0.07096 0.11916 0.596 0.55154

The default values in the multi-linear regression are speaker 3, COI /h/, and Face

/L/. Compared to the reference speaker, when controlling for Face and Consonant

Quality, subjects 1, 2 and 4 were significantly better perceived; /h/ was a little better identified than /ħ/. The positive coefficients associated with all face conditions indicate that lips alone don’t have enough information to accurately differentiate between /h/ and its pair; the whole face (AF) was significantly better perceived than L; in fact, all facial conditions were better perceived than L.

Post hoc models were developed for a closer look at speaker effect. The model summarized in Table 7-17 has speaker 3 and /h/ set as defaults values. It shows that in the /ħ, h/ contrast, speaker 3 is better perceived than speaker 4, /h/ is significantly better identified than /ħ/, but compared to the perception of speaker 3 in the /h/-context, speakers 1, 2, and 4 are significantly more intelligible. 117

Table 7-17: Effect of Speaker*Consonant Interaction on Perception of /ħ, h/

Number of obs: 3649; Groups: word ID, 24; User ID, 53 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.46806 0.18030 2.596 0.009430 ** Short vowels 0.009896 0.310502 0.032 0.9746 Speaker1 -0.06188 0.13835 -0.447 0.654696 Speaker2 0.01279 0.14453 0.088 0.929489 Speaker4 -0.26848 0.14264 -1.882 0.059806 . /ħ/ -0.95933 0.25528 -3.758 0.000171 *** /ħ/*Speaker1 0.57957 0.19534 2.967 0.003007 ** /ħ/*Speaker2 0.54886 0.20399 2.691 0.007133 ** /ħ/*Speaker4 1.06430 0.20277 5.249 1.53e-07 ***

A model examining the speaker-face interaction in the perception of /ħ, h/, showed only one significant effect: relative to speaker 3 in the L condition, perception of speaker 1 in the LCkCN condition was significantly better identified (p-value = 0.01).

A post-hoc model showed some significant effects of position in the perception of

/ħ/-alone. Based on the Chi-square tests, /ħ, h/ pairs were better perceived when the

COI was at word offset than at word onset. This effect was obscured when other factors were introduced. However, constricting the data to the /ħ/ words, the Chi squared results concerning consonant position were confirmed as seen in Table 7-18:

Table 7-18: Position Effect on Perception of /ħ/

Number of obs: 1818, groups: Perceivers, 52; Words, 12 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.4100 0.1572 2.608 0.00911 ** Initial Position -0.8413 0.2076 -4.053 5.06e-05 ***

The negative coefficient associated with initial position /ħ/ indicates that words ending with /ħ/ were much better perceived than word-initial.

118

 /h/-only Models

Several models were developed exploring interactions and significant effects in the

/h/-context. Constricting the data to /h/ words only, the only meaningful effect found was related to /h/ position within the word.

Table 7-19: Position Effect on Perception of /h/

Number of obs: 1831, groups: Perceivers, 52; Words, 12 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.09427 0.20633 0.457 0.6477 Initial Position 0.62986 0.27953 2.253 0.0242 *

Model 7-20 was built to examine whether the Consonant-vowel interactions can be attributed to goodness of perception or to a bias towards one COI or the other. The results summarized in Table 7-20 confirm the bias theory. The only significant effect found in the CV and VC is in the [ħi] syllable that was the least well-perceived compared to the default [hæ]. The /ħ, h/ contrast is indeed very difficult to perceive, and in a forced choice task, perceivers tended to choose /h/ over /ħ/.

Table 7-20: Vowel Consonant Effect on Perception of /ħ, h/

Number of obs: 3,649; Groups: Word, 24; Perceiver, 53 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.50800 0.34460 1.474 0.1404 [hæ:] 0.02517 0.48673 0.052 0.9588 [hi] 0.16154 0.44654 0.362 0.7175 [hi:] -0.41595 0.48594 -0.856 0.3920 [hu] -0.36304 0.59454 -0.611 0.5414 [hu:] -0.39402 0.48664 -0.810 0.4181 [ħæ] -0.52143 0.48615 -1.073 0.2835 [ħæ:] -0.51051 0.48641 -1.050 0.2939 [ħi] -0.98531 0.48874 -2.016 0.0438 * [ħi:] -0.26140 0.48620 -0.538 0.5908 [ħu] -0.62099 0.48633 -1.277 0.2016 [ħu:] -0.23297 0.48711 -0.478 0.6325

119

In sum, upon examining the perception of the /ħ, h/ pair, the significant effects were few. Speakers 1, 2, and 4 were found to be significantly better perceived in the context of /ħ, h/ than speaker 3; the /ħ/-only and /h/-only models suggest that this significance was driven by the perception of /ħ/ which also showed a strong speaker effect; on the other hand, the accuracy of the answers associated with /h/ was not impacted by who the speaker was. Searching for the source of the significance found in the Chi-square with regards to consonant position, the regression Models showed that

/ħ/ was significantly better perceived at the end than the beginning of the word; the opposite was true for its pair /h/, but the effect was not as strong. The whole face and the lips, cheeks, and chin, seemed to have informative speech information that aided perceivers in the recognition of the /ħ, h/ contrast. The model examining the CV interactions confirmed the difficulty perceivers had in distinguishing between /h/ and /ħ/ and their tendency to click on the /h/ answer when uncertain. Even though /ħ/ was associated with more significant effects such as face, speaker, and position, it was not as easy to recognize as its cognate /h/.

The next section explores meaningful effects on the perception of the /ʔ, ʕ/ contrast which according to the Chi-Square test was better perceived than /ħ, h/.

 /ʔ, ʕ/ Models

The comprehensive model that included all the variables on the perception of, returned no meaningful vowel or speaker effects on the perception of /ʔ, ʕ/. There was a face effect showing that the whole face (AF) significantly improved correct identification of the /ʔ, ʕ/ contrast, and seeing the lips and cheeks also helped (p=.03 for

120 both AF and LCk). The Chi-squared analysis however, revealed a consonant quality, a vowel quality and vowel length effects that were not revealed by the regressions.

To further explore these effects, The Best /ʔ, ʕ/ Model summarized in Table 7-21 was developed that includes the variables “Face”, “Vowel Quality”, “Vowel Length”, and

“Consonant Quality”.

Table 7-21: Best /ʔ, ʕ/ Model

Number of obs: 3,643; Groups: Word, 24; Perceiver, 52 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.188809 0.195257 0.967 0.3336 [æ] -0.009077 0.194405 -0.047 0.9628 [i] 0.390070 0.195285 1.997 0.0458 * Short Vowels -0.235077 0.159203 -1.477 0.1398 /ʔ/ -0.107457 0.159222 -0.675 0.4997 AF 0.243865 0.118703 2.054 0.0399 * LwF 0.173968 0.118359 1.470 0.1416 LCkCn 0.250162 0.119129 2.100 0.0357 * LCk 0.119769 0.118657 1.009 0.3128 LCn 0.166265 0.118921 1.398 0.1621

The model still shows no evidence of significant effects of vowel length and consonant quality. It does show however that the /ʔ, ʕ/ contrast was best perceived in the context of [i]. The face effect associated with improved perception in the AF and

LCkCn conditions persists. A model examining the speaker-face interaction did not show any speaker effect.

Since the vowel [i] was found to significantly impact perception, a model was run to explore whether this effect was restricted to certain consonants. The regression summarized in Table 7-22 shows that relative to the default [ʕi:], [ʔi] and [ʔi:] were not significantly better perceived. The regressions indicates that [ʕi:] was better identified than [ʔæ:] and [ʔu], and to a lesser extent, [ʔæ].

121

Table 7-22: Vowel Consonant Effect on Perception of /ʔ, ʕ/

Number of obs: 3,643; Groups: Word, 24; Perceiver, 52 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.6803 0.2526 2.693 0.00707 ** [ʕæ] -0.5583 0.3533 -1.580 0.11402 [ʕæ:] -0.1185 0.3541 -0.335 0.73787 [ʕi] -0.2902 0.3538 -0.820 0.41201 [ʕu] -0.3986 0.3532 -1.128 0.25915 [ʕu:] -0.5717 0.3534 -1.618 0.10567 [ʔæ] -0.6312 0.3527 -1.790 0.07353 . [ʔæ:] -0.7419 0.3529 -2.103 0.03551 * [ʔi] -0.3947 0.3548 -1.113 0.26590 [ʔi:] 0.2285 0.3563 0.641 0.52124 [ʔu] -0.7016 0.3528 -1.988 0.04677 * [ʔu:] -0.3476 0.3532 -0.984 0.32500

 /ʕ/-alone effects

Additional models investigated the effect of /ʕ/ alone on perception. No face, vowel quality, or vowel length meaningful effects were found, however, speaker 4 was significantly better perceived in the context of /ʕ/ (p=0.04). Table 7-23 shows the effect of /ʕ/ position on perception. /ʕ/-final words were significantly better perceived than /ʕ/- initial words.

Table 7-23: Position Effect on Perception of /ʕ/

Number of obs: 1820, groups: Perceivers, 51; Words, 12 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.7061 0.1477 4.780 1.76e-06 *** Initial Position -0.6484 0.1783 -3.636 0.000277 ***

/ʔ/ only Models

Looking at accuracy of perception applied to /ʔ/ only, the facial effects noticed in the identification of the pair as /ʔ/ or /ʕ/ is replicated in the /ʔ/ only model. AF and to a

122 certain degree, LCk contributed to better distinguishing /ʔ/ from /ʕ/. The model is summarized in Table 7-24:

Table 7-24: Face Effect in /ʔ /-only Model

Number of obs: 1,823; Groups: Word, 12; Perceiver, 52 Fixed effects Estimate Std. Error z value Pr(>|z|) (Intercept) 0.09556 0.18249 0.524 0.6005 AF 0.35996 0.17201 2.093 0.0364 * LwF 0.07992 0.16812 0.475 0.6345 LCkCn 0.29303 0.16866 1.737 0.0823 . LCk 0.13658 0.16638 0.821 0.4117 LCn 0.07856 0.17154 0.458 0.6470

No other effects were uncovered in the /ʔ/-alone models. Table 7-25 provides a summary of all the effects found in relation to the AGs individually, and in pairs:

Table 7-25: Summary of AGs Effects on Perception

/ħ, h/ /ħ/ /h/ /ʔ, ʕ/ /ʔ/ /ʕ/ Face AF, LCkCn AF, LCkCn none AF, LCk AF, LCk none Speaker S1,2,4>3 S1,2,4>3 none none none S4 Consonant /ʔ,ʕ/> /ħ,h/ /ħ//ħ/ none none none Position none End>Start none none none End>Start Vow Quality none none none [i] [i] [i] Vow Length none none none none none none

Compared to AEs and NEs, speaker effect was not as robust in the AG category.

Speaker 3 was not intelligible enough for perceivers to tell the difference between /ħ, h/, but he was better understood in the /ʔ, ʕ/ videos, and speaker 4 was more intelligible than other speakers in the context of /ʕ/ only. Vowel effect was also not present except when [i] was associated with the /ʔ, ʕ/ pair. The lack of vowel effect may have an articulatory explanation: the pharyngeal constriction can be produced with hardly any movement involving the lips which are usually salient in the articulation of vowels. [i] is a front high vowel, the furthest of the three from the pharyngeal cavity. The articulators, in

123 anticipation of the trajectory from the pharynx to the front of the oral cavity, partly take the configuration of [i] for a seamless transition from the guttural COI to the front vowel.

When /ʔ/ or /ʕ/ follows /i/, the reverse trajectory must happen when the articulators move from an /i/ configuration in anticipation of a backward movement towards the pharynx. As plausible as this explanation may seem, it fails to apply to the /ħ, h/ contrast that is submitted to similar articulatory conditions.

Both /ħ/ and /ʕ/ were significantly better perceived word-finally. Based on my own observations as a native speaker, the differences are acoustically very subtle and the pharyngeal frication or approximation associated with them tends to be held a little longer word- finally for the sounds to be audible, especially in the case of /ħ/. It may be that the lengthening of the consonants contributes to better visual perception.

No facial condition was found to affect the perception of emphatics most likely because the lips contain most of the speech information. On the other hand, identification of AGs appears to be consistently better when the whole face was visible and when the cheeks could be seen. Combined with the lack of vowel effect, the lips alone are not sufficient to visually differentiate between the AGs.

Overall, there were very few variables that impacted the visual perception of

AGs. Maybe measurements will inform what parts of the face, if any, enhance the optical information associated with AGs. The next chapter explores measurements of lips, cheeks, and chin in the production of AGs and their connection to perception.

124

CHAPTER 8: PERCEPTION-PRODUCTION CONNECTION IN ARABIC GUTTURALS

Following a similar procedure to the one applied to the emphatic category, three different approaches that complement or validate one another, were adopted to tie measurements to perception of AGs. As an exploratory tool, the decision tree function was run on the AG data to get a general idea of the potentially relevant measurements to the perception of the gutturals. The arrow plots provide a visualization of the differences in measurements between a COI and its pair, and the ‘single_best’ and best forward build models were used to depict the best measurement based on measurements on the lips, cheeks, and chin.

1 Decision Trees

To review, the decision tree function is an algorithm designed for classification. In the context of this study, the algorithm cycles through all the measurements taken during the production of AGs by all four speakers, and selects the best one that is associated with the most accurate identification between an AG consonant and its pair.

The machine continues to refine its selection by adding more variables that improve the model. The trees are pruned or trimmed to avoid overfitting.

Figures 8-1 and 8-2 represent the /ħ, h/ and /ʔ, ʕ/ decision trees respectively, generated across vowels on scaled measurements.

125

Fig. 8-1: /ħ, h/ Decision Tree by Scaled Measurements

*In Fig. 8-1, /H/ represents the pharyngeal /ħ/.

The best scaled measurement selected by the algorithm for the identification of the /ħ, h/ contrast is the vertical measurement between the outer lips. When the vertical dimension was smaller than -1.2, the machine selected the /h/ answer with 86% accuracy, and that was the case on 8% of the /ħ, h/ data. On 23% of the instances, accurate perception of /h/ reached 58% when the vertical measurements of the inner lips and the off center lip opening were added. No measurements in the cheeks and chin were found to enhance perception of /h/. In 92% of the utterances with the /ħ, h/ contrast, the vertical inner lip opening was larger than -1.2., and the machine picked the

/ħ/ answer with 53% accuracy. By adding other lip measurements, such as the opening 126 between the inner lips in the center and off center vertical dimension, correct identification of /ħ/ was boosted to about 88% but only in 10% of the cases. Most of the

/ħ/s (43%) were recognized with 67% accuracy. The branching of the tree into several leaves indicates that no measurement stood out as the one measurement perceivers could rely on, to accurately discriminate between /ħ/ and /h/. The low percentage of the cases when the machine reached better than 80% correct recognition of /ħ/, suggests that correctness in these cases was driven by one, or maybe two speakers.

The multi linear regressions performed on the /ʔ, ʕ/ contrast data suggested that perception significantly improved when perceivers saw the whole face and the LCkCn condition where the lips, cheeks, and chin were visible. Figure 8-2 shows the function’s best selected measurements for /ʔ, ʕ/’s correct identification. The decision tree provides cross validation to the output of the regression in Table 7-21 as both the width of the lower cheeks and lower chin were selected as informative measurements that help discriminate between /ʔ/ and /ʕ/. The tree having fewer branches than the /ħ,h/ tree, reflects that the /ʔ, ʕ/ contrast is better perceived than the /ħ, h/ pair.

127

Fig. 8-2: /ʔ,ʕ/ Decision Tree by Scaled Measurements

The arrow plots in the following section provide a visualization that will allow us to better understand the difficulty perceivers had in seeing the difference between each

COI and its pair.

2 Arrow Plots

Two sets of Arrow Plots were selected as a sample. The lower cheeks plot was chosen to represent the best measurement differences between each COI and its pair, and the cheek horizontal, the worst. The selection of the plot in figure 8-3 was informed

128 by the evidence provided by the regressions and decision tree, indicating the positive effect that the lower cheeks have on perception of the /ʔ, ʕ/ contrast. /j/ stands for the glottal stop /ʔ/, and /g/, for the pharyngeal /ʕ/. Each arrow represents the difference in the lower cheeks measurement per vowel, once in the initial position, and the other in the final position. The longer the arrows are the wider is the lower cheek measurement during the production of /ʔ/ or /ʕ/. The bigger measurement is associated with the consonant that is on top. Generally speaking, the lower cheeks are wider during the /ʕ/ articulation (represented as a /g/ in the plot, and the long arrows, especially in speakers

2 and 4 plots, are associated with the vowel [i].

Fig. 8-3: Lower Cheek Measurement Differences Between /ʕ/ and /ʔ/

129

The plot in Figure 8-4 illustrates the lack of information offered by the lip horizontal measurement in helping to differentiate between /ʕ/ and /ʔ/:

Fig. 8-4: Lip Horizontal Measurement Differences Between /ʕ/ and /ʔ/

With a few exceptions, the arrows representing the difference in the lip horizontal width are very short, indicating that the measurement that was the most informative in the AE category has probably no effect on perception in the AG category.

130

3 Forward-Build Models

In addition to decision trees and arrow plots, forward-build models were developed to select the single best measurement that predicts answer correctness.

The single best models were run for each pair of AGs per vowel. A likelihood ratio (lr) test was performed to compare two models and obtain the p-value for the improvement of one model over the other; the basic model consists of an intercept, and

‘Word’ and ‘Perceiver’ as random variables. The model it is being compared to includes the single-best measurement variable. Table 8-1 provides summaries of the best models of the forward build. The p-values are an assessment of the improvement that adding a measurement variable brings to the basic model.

Table 8-1: /ʔ,ʕ/ Best Single Variables Selected by the ‘Single-Best’ Function

/ʔ,ʕ/ Single-Best Variable p-value [æ] High Cheeks 0.456 [æ:] Upper Lip Thickness 0.006 [i] Lower Lip Thickness 0.005315 [i:] Cheek High 0.001861 [u] Lip Vertical Inner-Square 0.05917 [u:] Open Area 0.14

Based on the p-values, the only models that showed significant improvement over the basic intercept only model, included the lower lip thickness and cheek high, in the [i:]-context. Furthermore, upper lip thickness in the [æ] environment and lip-vert- inner-square in the context of [u] marginally contributed to the perception of the /ʔ, ʕ/ contrast.

Table 8-2 reports the summaries of the single best-models by vowel, on /ħ, h/.

131

Table 8-2: /ħ,h/-Best Single Variables Selected by the ‘Single-Best’ Function

/ħ,h/ Single-Best Variable p-value [æ] Low Chin Width 0.02 [æ:] High Chin Width 0.02 [i] Lip Horizontal & Chin Width High 0.005 [i:] NA NA [u] Cheek High 0.01 [u:] Lip Horizontal 0.04

No measurement contrast was found to be a significant predictor of correctness in the context of long [i:]. We notice that most of the informative contrasts are in the chin and cheeks, and only a few were in the lips. Overall, the single best models were comparatively weak with only a couple of measurements potentially contributing to the correct identification of /ʔ, ʕ/ and /ħ, h/.

132

CHAPTER 9: DISCUSSION AND CONCLUSIONS

1. Summary and Comparison of Emphatics and Gutturals

This study has sought to explore visual speech perception of Arabic emphatics and gutturals as modulated by phonetic variables and differences in facial feature measurements. The first stated hypothesis concerns the vowel effect on the perception of AEs and AGs. This hypothesis was accepted when applied to emphatics. AEs were much better perceived in the [æ:]-context and the effect was robust and rather consistent across speakers. NEs were better perceived in the [i] context but emphatics in the [u]-environment were generally poorly identified. However, the vowel effect was not as robust in the AG category. The only significant effect was found when [i] was coarticulated with /ʔ/ or /ʕ/.

I also hypothesized that there were visual cues associated with AGs and AEs in the lips, chin, and cheeks, and that native Lebanese speakers attended to these cues to better perceive visually AEs and AGs. These two hypothesis were supported but not without reservation. I indeed found visual cues in the lips, chin, and cheeks, but most of the information native speakers attended to in the perception of emphatics was in the lips. On the other hand, cheeks and chin were more informative than the lips in the perception of AGs.

The corpus used to test these hypotheses comprised 72 minimal pairs each containing an AE and its NE cognate. The AG corpus was composed of 24 /ħ-h/ pairs,

24 /ʔ-ʕ/ pairs, and 24 /χ, ʁ/ pairs. The /χ, ʁ/ pairs were excluded from the analyses due to the participants' inability to tell the two consonants apart based on optical information alone. As in the emphasis analysis, fifty three perceivers watched two randomly

133 selected speakers in 432 silent videos edited to fit one of six facial conditions, each with only some parts of the face showing (AF, LwF, L, LCkCn, LCk, and LCn). With each video, perceivers saw two words in a minimal pair and were asked to identify the talker’s utterances in a forced-choice identification task.

Both the emphatics and gutturals that perceivers were tested on are characteristic of Semitic languages and share a common place of articulation involving a constriction in the pharynx, with one major difference; while AEs are produced with a primary coronal articulation in combination with the secondary pharyngeal articulation, the AGs are produced with a single constriction in the pharynx. Perceivers trying to visually identify a word as an AE have an advantage given by visibly salient articulatory gestures formed in the front of the mouth. On the other hand, since all the action happens deep in the throat, visual identification of AGs is a lot more challenging and has to rely on some subtle movements of the cheeks and jaw that are a result of the activities in the pharynx and the tongue. In other words, the movement of the articulators as they produce a guttural sound is not transparent; all perceivers have at their disposal are the consequences of the inner movements happening deep in the throat.

Words forming minimal pairs in both categories in this study are homophenous, meaning words that sound different but have very similar facial configuration (Berger,

1972); however, there are some differences in most contrasts that cue the native speaker in identifying one utterance over its pair. These differences appear to be a lot more subtle for AGs than AEs as evidenced by the lower accuracy means associated with the AGs. Chelali and colleagues (2012) examined 20 images for each of the 28

Arabic consonant phonemes and were able to collapse them into 11 visemes; all the 134 gutturals were found to be visually indistinguishable and were grouped together.

Nonetheless, Chi-Square tests performed on both categories of sounds in the current study revealed that both AEs and AGs were recognized optically at better than chance, even though AEs were considerably better identified than AGs.

Broken down by category, Chi-Square models provided quick and simple overviews exploring the predictability of correct identification based on phonetic and facial features. From this overview, strong speaker and vowel effects and a marginal vowel length effect emerged in connection with the AE set, but these variables except for vowel length had a more subdued influence on the perception of AGs. The [æ]- context, and especially long [æ:], presented significant improvement in the perception of

AEs; in the AG category, [i] and long vowels exerted the greatest coarticulation influence on the recognition of [ʕ] versus [ʔ]. In terms of consonant quality, the difference between the emphatic [dʕ], [tʕ], and [sʕ], was not significant perhaps because all three consonants are articulated around the same area in the vocal tract and what distinguishes them acoustically, due to and voicing, is not visually available, or attended to by native speakers; results however trended towards a better perception associated with [dʕ]. In contrast, there was a meaningful difference in the perception of AGs with identification correctness of the COIs going in this order:

/h/=/ʕ/>/ʔ/>/ħ/. Also, participants were better at recognizing the COI within the /ʕ, ʔ/ contrast than the /h, ħ/ one.

Consonant position was found to be significant for both emphatics and AGs. For some of the COIs onset NEs were better recognized than at codas, but the opposite was true for AEs. Perceivers were better at perceiving word- finally /h, ħ/, and position of /ʕ, ʔ/ had no significant effect on perception. 135

Chi-Square tests were also used to explore the contribution of face conditions on perception. The effect for of emphatic COIs was negligible, but apparent in the AG category. These results strongly suggest that the lips, present in all conditions, carried enough speech information when associated with AEs and NEs. Cheeks and chin’s influence in most cases was not noticeable compared to the lips alone. On the other hand, lips alone were not sufficiently distinguishable to significantly impact correct identification of AGs. In fact, AGs were best identified when the whole face (AF condition) was seen. Additionally, perception of AGs improved significantly when perceivers’ attention was focused on the lips, cheeks, and chin (LCkCn). That LCkCn was better than the lower face (LwF) which included the neck, suggests that the movements in the pharynx are not visible in the neck.

Multi-linear logistic regressions provided a more refined look at the Chi-Square tables. By setting perceivers and words as random variables, the regressions allowed controlling simultaneously for perceiver variation and variables in the words beyond the ones accounted for.

The regression models examined how variables interacted together to modulate perception of AEs and AGs. Concurring with the Chi-Square models, speaker effect was impactful among emphatic contrasts. Speaker 4, being the investigator, was often removed from the analysis. All meaningful results were a bit attenuated, but they remained significant even without speaker 4.

In the emphatic category, the Best-Model included speaker, vowel and vowel length as explanatory variables. The meaningful factors in the [ħ, h]-Best-Model that fits the AG category were consonant quality, speaker, and face; the significant variables in the [ʔ, ʕ]-Best-Model were vowel quality, vowel length, consonant quality, and face. 136

Post hoc analyses were performed on subsets within categories and other small significances were found that allowed us, at least in the AE set, to create a profile for each speaker: speaker 4 was highly intelligible across vowels, unlike speaker 3.

Speaker 2’s NEs were not as well identified as his AEs; on the contrary, speaker 1’s

NEs were much better recognized than her AEs. This big variability between speakers was not apparent in the AG category. This is not to say that all speakers produce AGs in the same manner, but rather, the absence of a meaningful speaker effect indicates a lack of visually distinctive speech-related movements that help perceivers differentiate between AG consonants.

The vowel effect was uncontentious on the emphatics, but only significant in the context of [i] for the guttural category. AEs provide perceivers with a lot more visual information due to the front primary gestures in the production of emphatics.

Furthermore, emphasis changes the vowel quality and accentuates the roundness of the lips in /i/- and particularly /ɑ/-context. Lip roundedness is a very visually salient feature that exploits the natural synergy between all the lips parameters and their effect on the cheeks and chin; a protrusion of the lips happens when the corners of the lips are drawn closer together, and modifies the open area. But roundedness is not contrastive enough in the context of [u] which is already produced with rounded protruded lips. This explains why AEs and NEs were poorly perceived in the context of the high back vowel; even the best machine learning models employed more chin and cheek variables, in addition to lip thickness in the [u]-context.

The only vowel that helped identification of AGs was the [i], likely due to anticipatory coarticulation and the distance between the high front vowel place of articulation and the pharynx. Literature provides evidence of coarticulation patterns in 137 the lower pharynx. Ultrasound studies reported narrowing of the lower pharynx and lateral and inward movements when pharyngeals coarticulate with lower back vowels such as /ɑ/ (Kelsey et al.,1969b; Parush & Ostry ,1993; Zagzebski, 1975). This is no surprise considering that the lowering and retraction of the tongue dorsum changes the configuration of the pharyngeal cavity. However, the visual speech signal is not structured by all the movements in the pharynx to help perceivers distinguish between an AG and its pair.

Referring back to the fitted faces in Figure 2-1, a description of the differences between the two images would most likely elicit a comparison in the shape of the lips, and for the more observant observers, of the chin and cheeks. To fit such descriptions into statistical models, different horizontal and vertical measurements of the lips, cheeks and chin were taken for each word and differences in the dimensions were compared for each item and its pair. The premise is that if in the perceiver’s mind the chin drops more in the /ʔ/ than /ʕ/- context, then an utterance pronounced with a larger distance between the tip of the nose and the chin would cue /ʔ/. Speakers who don’t exhibit sufficient distinctive movements between a word and its pair would be difficult to perceive. Furthermore, native speakers are well-aware of speakers’ variations and are really adept at adjusting to their interlocutor’s idiosyncrasies (Bertelson et al., 2003). But in order to do that, they need to see consistency within the speaker’s articulation, especially when unfamiliar with the speaker. Barring within speaker consistency, the task of identifying an already ambiguous signal in the visual identification of AGs and

AEs, becomes very difficult.

The arrow plots by vowel and by speaker were the first qualitative exploratory data analysis tool that was used to examine at a glance if measurements were 138 informative predictors of perception accuracy. Long arrows indicated a distinctive difference in the displacement of the facial features when the speaker produced two contrasting words within a minimal pair. Arrows of the same color signified speaker consistency. Many of the plots generated from AEs data showed long, same color arrows pointing in the same direction, especially when associated with speaker 4. On the other hand, the arrows in the plots generated from AGs data, were generally very short and inconsistent within and across speakers reflecting the difficulty perceivers had in distinguishing between AGs. There were however a few plots where differences were apparent; for example, the low cheek measurement associated with speakers 1, 3, and

4 shows more salient articulation of /ʕi:/ over /ʔi:/.

The other exploratory tool was the decision trees that were used to select the best predictive measurements of accuracy. A simple tree with fewer branches typically reflects the machine’s confidence in the measurement as a good predictor of the COI type. For example, lip horizontal was the feature that cued the computer in selecting AE or NE with high level of confidence. The trees associated with AGs were generally more complicated with many levels, pointing at the difficulty the function had in finding a few best measurements that contributed to the correct classification of a syllable. The decision tree function performed poorly in finding the most informative measurements related to AGs. Regressions were informed by successful classification of AEs based on measurement differences selected by the decision trees. It also alerted to variables that might have a non-linear relationship to emphasis. Such was the motivation for the lip vertical inner square addition.

Emphasis data could easily be analyzed in terms of two distinctive classes, AEs and NEs. However, the statistics of AGs were done separately for each pair, and one of 139 the initial three pairs was dropped. Consequently, the dataset for each AG pair was not big enough and did not lend itself to ML which was not applicable to AGs.

The quest for a simple effective model to predict correct identification of emphasis led to the development of the machine learning and the best-subset logistic regressions on emphasis that were purely based on scaled differences contrasts between the AE and NE measurements without perceivers’ involvement. The best subset regressions outputted by the machine learning process, picked out the contrasts that were most informative, correcting for over fitting using the AIC criterion. The machine overwhelmingly selected horizontal and vertical lip measurements that are in reality highly correlated; when the lips are stretched horizontally, the vertical opening and the open area between the inner lips are smaller. On the other hand, rounded lips are associated with a smaller lip horizontal opening. The machine’s classification performance exceeded in accuracy the perceivers’. In fact, the computer was 100% accurate in its classification of AEs and NEs in the context of [æ:]. In addition to lips, the machine also selected informative measurements found in the chin and cheek areas, especially in the [i] and [u] environments.

I also investigated the perception-production connection as reflected in the measurements of AEs and AGs. The question related to how measurements that are a consequence of a speaker’s production, inform perceivers’ correct identification of a word and its pair. For example, when perceivers expected the AE measurement on the lip vertical opening in a specific word to be larger than the NE measurement, a positive coefficient (AE – NE) associated with lip vertical opening would predict the selection of the correct AE answer. On the other hand, if the perceiver in his or her mind estimates that the cheeks are wider in the production of [ʕi] than [ʔi], a negative coefficient would 140 predict the wrong response. The perceivers’ models based on measurement differences were forward build models also selected by the AIC information criterion. The process consisted of recycling through all the measurements and adding each time the variable that would result in the most improvement to the previous model. For AEs, most of the measurements were in the lips except in the [i:]-context where cheeks were selected as most informative. On the other hand, most measurements selected to add to AG models were in the chin and cheeks. This was reflected in the perception regression models where the LCkCn condition was found to be significant.

The final stage of this analysis was an investigation of correlations comparing whether the machine’s certainty of classification was related to the probability of perceivers’ accuracy. As reflected in the correlation plots, the relationship between machine and human confidence was not a straight up standard linear regression type, but the correlations for several of them were highly significant indicating a level of consistency that machine and perceivers share.

2 Discussion

It is amazing that with so few clues, perceivers could still identify the emphatic and especially the guttural consonants better than chance. Perceivers were faced with a number of different challenges that made their identification task rather difficult. Some of these challenges have to do with the perceivers themselves, others with speaker variability, the constraints caused by different vocalic environments of the COIs, the deprivation of subjects from the acoustic signal, and the truncated visual signal.

Perceivers have different perceptual sensitivities modulated by their own background. The speaker variability no doubt has an effect on the perceivers who are 141 themselves native speakers of Lebanese Arabic and did not approach the task as naïve, uninformed participants. They had their own bias as to how, for example, an

AE/NE contrast should look. If the articulation of a speaker for a particular contrast matches their own, it is likely they will perceive the speaker accurately. In addition to their own linguistic background, there are some speakers who are not visually very intelligible. The following videos illustrate the point. Analysis after analysis has shown the poor visual perception associated with speaker 3, and the high accuracy mean associated with speaker 2’s production of AEs. Videos 1 and 2 show speaker 2 uttering the words [tʕifəl] and [tifəl]; videos 3 and 4 show speaker 3 saying the same two words.

Fig. 9-1: Comparison of Speakers 2 and 3 in [tʕifəl] and [tifəl]

Speaker 2: [tʕifəl] Speaker 2: [tifəl]

Speaker 3: [tʕifəl] Speaker 3: [tifəl] 142

Even though [i] blocks emphasis, the front [i] almost looks like a back [u] in speaker 2’s production. Psychologically, speaker 2 feels compelled to make a clear distinction between the two words. Lindblom (1990) reports that speakers tend to alter their speech production, in an effort to increase intelligibility and address the communicative needs of their perceiver. According to Green and colleagues (2010), adults also tend to modify their facial movements to enhance the visual cues in visual speech. But apparently, not all of them do; even for a native speaker, it would be very hard to tell [tʕifəl] and [tifəl] apart visually when spoken by speaker 3.

Moreover, perceivers were not given any feedback and had no idea about the accuracy of their answers. There is a lot of research to support the native perceiver’s ability to adjust to speaker variability. Bertelson et al. (2003), showed how listeners retune their phonetic category boundaries when listening to ambiguous sounds. It is plausible that they would also retune their perception of speech gestural boundaries to adjust to the speaker’s idiosyncrasies. However, in this study, perceivers had no idea where adjustments were needed.

Other challenges facing the perceivers were related to the vocalic environment of the COI. Sanker (2016) studied patterns of misperception of Arabic gutturals. Her results indicated that the phonetic environment did not have as strong an effect on accuracy of guttural identification as other consonants did, probably because of the pseudoformants that characterize the guttural sounds. In line with the findings in the current study, Sanker noted significantly higher accuracy in codas than onsets. She also pointed out directionality in perceivers’ responses where participants selected /h/ more than /ħ/ and /ʕ/ more than /ʔ/. 143

On the other hand, the vowel environment of the emphatics is important for AE identification. Vowels generally have a raised F1 in the vicinity of AEs; F1 is associated with tongue height which affects the dimensions of the resonating cavities. The tongue, mostly visible in the [æ]-context may partially explain why AEs are best perceived in the vicinity of [æ]. In the [i]-context, there is a conflict between the gesture of the front vowel that wants to thrust the tongue forward and the pharyngealization effect that wants to pull the vowel backwards. This conflict constrains the movement and makes AEs a little harder to identify in the [i]-context. Furthermore, the tongue is not visible in the [i] and [u] environments. The rounded lips configuration of [u] makes it hard to distinguish between

AEs and NEs.

Liberman (1985), in his theory of speech perception claimed that when subjects listen to a word, they recognize the gestures associated with the production of the stimulus. If so, would perceivers recognize the movements related to the word they see? Yehia et al. (1998) stated that neuromuscular activity controlling the vocal tract configuration has audible and visual consequences that are visible in the motion of the face. If Liberman is right, then the motion of the face would cue the native listener in recognizing the sounds associated with the gestures.

In modern neurolinguistics, researches have taken special interest in mirror neurons.

Mirror neurons activate the motor cortex of a person who is watching the action of another individual. According to Rizzolatti and Craighero (2004), actions belonging to the motor repertoire of the observer are mapped onto his or her motor system. So, in the context of this study, the perceivers being native speakers of Arabic, must have gestures associated with the production of Arabic speech sounds in their motor repertoire since they have produced these sounds multiple times in their lives. 144

Consequently, seeing the facial motion of the speaker should activate the motor cortex triggering in the perceiver recognition of the sound. Unfortunately, it is not as simple as that. As a native speaker, I may be able to describe what my lips and to a lesser extent, my tongue is doing when I produce an emphatic. However, I have no awareness of what is exactly going on in my pharynx, and even less so in someone else’s pharynx.

Moreover, seeing a salient gesture such as lip rounding and lip spreading may trigger an association between the motor action and the corresponding sound; but undistinguishable movements are unlikely to elicit strong associations, especially if the same movement is connected to more than one sound.

The question remains, what in the speaker’s face informs the perceiver? Facial conditions revealed two important facts: (1) in sounds involving a front articulation the lips encode most of the information. (2) When the mouth does not contain enough speech information, then the perceiver will direct his gaze to other regions of the face.

Watching an impoverished speech signal, perceivers only have time to focus on the most informative part of the face. AGs’ pharyngeal actions are not encoded in the lips and consequently, perceivers need all the cues they can get to identify the sound, hence the significant effect of the whole face, the cheeks, and the chin. In other words, perceivers focus their attention on the critical information that has the higher probability of leading them to correct identification. Lusk and Mitchel (2016) studied differential gaze patterns during audiovisual speech segmentation. They concluded that perceivers direct their gaze mostly to the most informative features. Barenholtz and colleagues

(2016) reported on the highly flexible system of attention allocation during speech encoding. Even 8 to 10 month old babies in their babbling stage will fixate on the speaker’s mouth (Lewkowicz, 2012). Humans strive to communicate, that is, to convey 145 their message as speakers and receive someone else’s message as perceivers, and they are very skillful at accomplishing this goal even when given nothing more than cropped up videos of moving faces.

3 Concluding Remarks and Future Direction

In conclusion, it is not clear how the outcomes generated by this investigation would generalize to a general population. The sample was limited to Lebanese speakers and perceivers, which was a necessary starting point in order to control for dialectal variations. However, the study needs to be expanded to all Arabic dialects.

Also, reliability of the study would increase with the introduction of more speakers and a greater variety of words. I would probably have an easier time recruiting more participants if the experiment was limited to about 20 minutes. More words, distributed randomly across more participants would increase the generalizability of the investigation.

In future work, I would like to use principle component analysis. Many of my measurements are correlated; for example, lip horizontal changes as a function of vertical lip opening and open area; chin vertical is connected to vertical lip opening.

Principle component analysis would allow the creation of a smaller pool of variables or principle factors, by combining all the correlated variables without losing any of the information. A smaller set of principle factors would have more predictive power and would make results more interpretable.

In spite of some limitations, results were still meaningful, and results of all analyses congruent were with each other and with the literature overall. Furthermore, the use of speaker 4 may be counted as one of the downfalls, but it has good 146 unintended consequences. In setting up for a future study involving second language learners of Arabic (L2), the results have confirmed that I would be an adequate speaker to present the training sets for L2 learners since I was the most intelligible speaker. This investigation also helped in deciding which class of words would be best to start training on, namely, emphatics in the [æ:]-context. Once distinguishing AEs from NEs in the [æ:] environment becomes easy, gradually training would expand to [æ], then [i:], [i], and [u] saving [u:] for last since it was determined that AEs are least intelligible in the [u:]- context. Also, L2 learners will first train on the lips-only condition before gradually adding the chin and cheeks. Based on this study, gutturals don’t appear to be suitable for visual training. Hopefully, learners would develop sensitivity to pharyngeal sounds after extensive exposure to pharyngealized phonemes.

147

REFERENCES

Azra, A.N., Madison, M.I., & Idrissi, A. (2005). McGurk fusion effects in Arabic words. AVSP, 11-16.

Al-Tamimi, F., Alzoubi, F., & Tarawna, R. (2009). A videofluoroscopic study of the emphatic consonants in Jordanian Arabic. Folia Phoniatrica et Logopaedica, 61, 247-253.

Barenholtz, E., Mavica, L., & Lewkowicz, D.J. (2016). Language familiarity modulates relative attention to the eyes and mouth of a talker. Cognition, 147, 100-105.

Bates, D., Maechler, M., Bolker, B., & Walker, S. (2014). Linear mixed-effects models using Eigen and S4. package version. lme4. 1, 0-6. Retrieved from http://CRAN.R- project.org/package=lme4

Benguerel, A.P., Pichora-Fuller, M.K. (1982). Coarticulation effect in lipreading. Journal of Speech and Hearing Research. 25(4), 600-607.

Benoit, C., Lallouache, T., Mohamadi, T., & Abry, C., (1992). A set of French visemes for visual speech synthesis. In G. Bailly & C. Benoit (Eds.), Talking machines: Theories, models, and applications, 485-504. North Holland, Netherlands: Elsevier Science Publishers.

Benoit, C., Guiard-Marigny, T., Le Goff, B., and Adjoudani, A. (1996). Which components of the face do humans and machines best speechread? D.G. Stork and machines: Models, systems, and applications, 315-25. Berlin,Germany: Springer.

Berger, K. W. (1972). Visemes and homophenous words. Teacher of the deaf. 70. (2nd ed.) 369-399. Kent, Ohio: National Educational Press, Inc.

Bertelson P., Vroomen, J., de Gelder, B. (2003). Visual recalibration of auditory speech identification: a McGurk aftereffect. Psychological Science. 14, 592–597. 10.1046/j.0956-7976.2003.psci_1470.

Bin-Muqbil, M. (2006). Phonetic and phonological aspects of Arabic emphatics and gutturals. (Unpublished doctoral dissertation, University of Wisconsin, Madison, Wisconsin.

Boersma, Paul & Weenink, David (2011). Praat: doing phonetics by computer (Version 5.2.28) [Software]. Available from http://www.praat.org/

Brooke, N.M., & Summerfield, A.Q. (1983). Analyis, synthesis and perception and visible articulatory movements. Journal of Phonetics, 11, 63-76.

148

Cootes, T., Edwards, G. J., and Taylor, C.J. (2001). Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23.(6).

Campbell, C.S. (2001). Patterns of evidence: Investigating information in visible speech perception. Retrieved from Dissertation Abstracts International, 61, 3869B (UMI No. 9979917).

Cappa, S.F. & Pulvermuller, F. (2012). Language and the motor system. Cortex, 48(7), 785-787.

Casserly, E.D. and Pisoni, D.B. (2010). Speech Perception and Production, 1(5), 629- 647.

Chelali, F.Z., Sadeddine, K, & Djeradi, A. (2012). Visual speech analysis application to Arabic phonemes. Special Issue of International Journal of Computer Applications (0975 – 8887) on Software Engineering, Databases and Expert Systems – SEDEXS.

Delattre, P. (1971). Pharyngeal features in the consonants of Arabic, German, Spanish, French, and . Phonetica, 21. 129–155.

Dubois, C., Otzenberg, H., Gounot, D., Sock, R., Metz-Lutz, M.N. ( 2012). Visemic processing in audiovisual discrimination of natural speech: a simultaneous fMRI_EEG study. Neuropsychologia, 50, 1316–1326.

Elgendy, A. M. (2001). Aspects of pharyngeal coarticulation (Unpublished doctoral dissertation). University of Amsterdam, Amsterdam, Netherlands..

Fisher, C. G. (1968). Confusions among visually perceived consonants. Journal of Speech and Hearing Research, 11(4), 796–804.

Fridriksson, J., Baker, J., Whiteside, J. Eoute, D., Moser, D., Vesselinov, R., and Rorden, C. (2009). Treating visual speech perception to improve speech production in non- fluent aphasia. Stroke. 2009 Mar; 40(3): 853–858.

Ghazeli, S. (1977). Back consonants and backing coarticulation in Arabic (Unpublished doctoral dissertation). University of Texas, Austin, Texas.

Giannini, A. & Pettorino, M. (1982). The emphatic consonants in Arabic: Speech 282 laboratory report IV. Naples: Instituto Universitario Orientale.

Green, J. R, Nip, S. B., Wilson, E. M., Mefferd, A. S., Yunusova, Y. (2010). Lip movement exaggerations during infant-directed speech. Journal of Speech, Language, and Hearing Research, 53(6).

149

Halliday, M.A.K. (1978). "An interpretation of the functional relationship between language and social structure", from Uta Quastoff (ed.), Sprachstruktur – Sozialstruktur: Zure Linguistichen Theorienbildung, 10, 3–42 of The Collected Works, 2007.

Hardison, D. (2005). Second-language spoken word identification: Effects of perceptual training, visual cues, and phonetic environment. Applied Psycholinguistics, 26, 579- 596.

Harry, B. (1996). The importance of the language continuum in Arabic Multiglossia. In A. Elgibali (Ed.). Understanding Arabic: Essays in contemporary Arabic Linguistics in honor of El-Said Badawi. New York, NY: Columbia Press.

Hazan, V., Sennema, A., Iba, M., & Faulkner, A. (2005). Effect of audiovisual perceptual training on the perception and production of consonants by Japanese learners of English. Speech Communication, 47(3).

Heselwood, B. (2007). The ‘tight approximant’ variant of the Arabic ‘ayn’. Journal of the International Phonetic Association, 37, 1-32.

Israel, A., Proctor, M., Goldstein, L., Iskarous, K., & Narayanan, S. (2012). Emphatic segments and emphasis spread in Lebanese Arabic: A real-time magnetic resonance imaging study. 13th Annual Conference of the International Speech Communication Association, InterSpeech, 3, 2175-2178.

Jiang, J., Alwan, A., Bernstein, L.E., Keating, P.A., & Auer, E.T. (2002). On the correlation between facial movements, tongue movements and speech acoustics. International Conference on Spoken Language Processing, 1, 42-45.

Kelsey, C. A., Woodhouse, R. J., & Minifie, F. D. (1969b). Ultrasonic observations of coarticulation in the pharynx. Journal of the Acoustical Society of America, 46, 1016- 1018.

Lewkowicz, D.J, Hansen-Tift, A. M. (2012). Infants deploy selective attention to the mouth of a talking face when learning speech. Proceedings of the National Academy of Sciences. 109(5), 1431–1436. doi: 10.1073/pnas.1114783109, PMCID: PMC3277111, Psychological and Cognitive Sciences.

Liberman, A.M., & Mattingly, I. (1985). The motor theory of speech perception revised. Cognition, 21, 1-36.

Lindblom, B., (1990). Explaining phonetic variation: a sketch of the H&H theory. In W.J. Hardcastle, A. Marchal (Eds.), Speech production and speech modeling, 403-409. Dordrecht: The Netherlands.

Lusk, L. G. (2016). Differential gaze patterns on eyes and mouth during audiovisual speech segmentation. Frontiers in Psychology, 7(1159), 52. 150

Maddieson, I. & Lindblom, B. (1988). Phonetic universals in consonant sytems. In Li, C. and Hyman, L. M. (Eds.), Language, Speech and Mind, 62-78.

Mavadati, M.S., Mahoor, M.H., Bartlett, K., Trinh, P., & Cohn, J. (2013). Disfa: A spontaneous faical action intensity database. Affective Computing, 4 (2), 151-160.

McLeod, A., Summerfield, Q., (1990). A procedure for measuring auditory and audio- visual speech-reception thresholds for sentences in noise: Rationale, evaluation, and recommendations for use. British Journal of Audiology, 24(1), 29-43.

Moon, S.J. & Lindblom, B., (1994). Interaction between duration, context, and speaking style in English stressed vowels. Journal of the Acoustical Society of America, 96, 40-55.

Ouni, S, Cohen, M., & Massaro, D. (2005). Training Baldi to be multilingual: a case study for an Arabic Badr. Speech Communication, 45(2).

Ouni,S., Ouni, K. (2007). Arabic pharyngeals in visual speech, University of Van Tilburg. International Conference on Auditory-Visual Speech Processing 2007 (AVSP), Hilvarenbeek, Netherlands,1 (1), 212-215.

Parush, A. and Ostry, D.J. (1993). Lower pharyngeal wall coarticulation in VCV syllables. The Journal of the Acoustical Society of America, 94(2).

Pisoni, D. B., Logan, J. S., & Lively, S. E. (1994). Perceptual learning of nonnative speech contrasts: Implications for theories of speech perception. In H.C. Nusbaum and J. Goodman (Eds.). The Development of Speech Perception: The Transition from Speech Sounds to Spoken Words, 121-166. Cambridge: MIT Press.

Preminger, J. E., Lin, H.B., Payen, M., & Levitt, H. (1998). Selective visual masking in speech reading. Journal of Speech, Language and Hearing Research, 41, 564-575.

Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Annual Review of Neurscience, 27, 169-192.

Robert- Ribes, J., Schwartz, J. L., Lallouache,T., & Escudier, P. (1998). Complementarity and synergy in bimodal speech: Auditory, visual and audiovisual identification of French oral vowels in noise. Journal of Acoustical Society of America, 103, 3677–3689.

Salvis, G. (2006). Mining speech sounds machine learning methods for automatic speech recognition and analysis (Unpublished Doctoral Dissertation). University of Stockholm, Stockholm Sweden.

151

Sanker C. (2016). Patterns of misperception of Arabic guttural and non-guttural consonants. Retrieved from Cornell Theses and Dissertations.

Sanders, D. & Goodrich, S. (1971). The relative contribution of visual and auditory components of speech to speech intelligibility as a function of three conditions of frequency distortion. Journal of Speech and Hearing Research, 14, 154-159.

Summerfield, A. Q. (1987). Some preliminaries to a comprehensive account of audio- visual speech perception. In B. Dodd and R. Campbell (Eds), Hearing by Eye: The Psychology of Lip-Reading, 3-51.

Thomas, S. M., Jordan, T. R. (2004). Contributions of oral and extraoral facial movement to visual and audiovisual speech perception. Journal of Experimental Psychology: Human Perception and Performance, 30, 873–888.

Walden, B., Prosek, R., Montgomery, A., Scherr, C., & Jones, C. (1977). Effects of training on speech recognition of consonants. Journal of Speech and Hearing Research, 20, 130-145.

Weikum, W.M., Vouloumanos, A., Navarra, J., Soto-Faraco, S., Sebastián-Gallés, N., Werker, J. (2007). Visual language discrimination in infancy. Science, 316, 1159.

Yehia, H.C., Rubin, P.E., and Vatikiotis-Bateson, E. (1998). Quantitative association of vocal-tract and facial behavior. Speech Communication, 26, 23-44.

Younes, M. (1993). Emphasis spread in two Arabic dialects. In M. Eid and C. Holes (Eds.), Perspectives on Arabic Linguistics, 119–145. Amsterdam: Benjamins.

Zagzebski, J. A. (1975). Ultrasonic measurement of lateral pharyngeal wall motion at two levels in the vocal tract. Journal of Speech and Hearing Research, 18, 308-318.

Zambon, M., Lawrence, R., Bunn A., and Powell, S. (2005). Effect of alternative splitting rules on image processing using classification tree analysis, Photogrammetric Engineering & Remote Sensing, 72(1), 25–30.

152

Appendix

Corpus: minimal pairs – The correct answer that corresponds to the odd number is the

AE and the correct answer that corresponds to the even numbers is the NE. correct correct incorrect incorrect تَبَ عْ ْ w2s1f1c1 َطبَ عْ ْ w1s1f1c1 َطبَ عْ ْ w1s1f1c1 تَبَ عْ ْ w2s1f1c1 تا بْ ْ w4s1f1c1 طا بْ ْ w3s1f1c1 طا بْ ْ w3s1f1c1 تا بْ ْ w4s1f1c1 تِفِ لْ w6s1f1c1 ِطفِ لْ w5s1f1c1 ِطفِ لْ w5s1f1c1 تِفِ لْ w6s1f1c1 تي نْ w8s1f1c1 طي نْ w7s1f1c1 طي نْ w7s1f1c1 تي نْ w8s1f1c1 تُ نْ w10s1f1c1 طُ نْ w9s1f1c1 طُ نْ w9s1f1c1 تُ نْ w10s1f1c1 توما w12s1f1c1 طو مْ w11s1f1c1 طو مْ w11s1f1c1 توما w12s1f1c1 َس بْ w14s1f1c1 َص بْ w13s1f1c1 َص بْ w13s1f1c1 َس بْ w14s1f1c1 سا بْ w16s1f1c1 صا بْ w15s1f1c1 صا بْ w15s1f1c1 سا بْ w16s1f1c1 ِس بْ w18s1f1c1 ِص بْ w17s1f1c1 ِص بْ w17s1f1c1 ِس بْ w18s1f1c1 سي نْ w20s1f1c1 صي نْ w19s1f1c1 صي نْ w19s1f1c1 سي نْ w20s1f1c1 ُسلِ بْ w22s1f1c1 ُصلِ بْ w21s1f1c1 ُصلِ بْ w21s1f1c1 ُسلِ بْ w22s1f1c1 سو رْ w24s1f1c1 صو رْ w23s1f1c1 153

صو رْ w23s1f1c1 سو رْ w24s1f1c1 َد َر سْ w26s1f1c1 َض َر سْ w25s1f1c1 َض َر سْ w25s1f1c1 َد َر سْ w26s1f1c1 دائِ رْ w28s1f1c1 ضائِ رْ w27s1f1c1 ضائِ رْ w27s1f1c1 دائِ رْ w28s1f1c1 ِد بْ w30s1f1c1 ِض بْ w29s1f1c1 ِض بْ w29s1f1c1 ِد بْ w30s1f1c1 دي كْ w32s1f1c1 ضيق w31s1f1c1 ضيق w31s1f1c1 دي كْ w32s1f1c1 ُدرو بْ w34s1f1c1 ُضروب w33s1f1c1 ُضروب w33s1f1c1 ُدرو بْ w34s1f1c1 دو مْ w36s1f1c1 ضو مْ w35s1f1c1 ضو مْ w35s1f1c1 دو مْ w36s1f1c1 َش َم تْ w108s1f1c1 َش َم طْ w107s1f1c1 َش َم طْ w107s1f1c1 َش َم تْ w108s1f1c1 با تْ w110s1f1c1 با طْ w109s1f1c1 با طْ w109s1f1c1 با تْ w110s1f1c1 ِم ِش تْ w112s1f1c1 ِم ِش طْ w111s1f1c1 ِم ِش طْ w111s1f1c1 ِم ِش تْ w112s1f1c1 فَري تْ w114s1f1c1 فَري طْ w113s1f1c1 فَري طْ w113s1f1c1 فَري تْ w114s1f1c1 بِ ُز تْ w116s1f1c1 بِ ُز طْ w115s1f1c1 بِ ُز طْ w115s1f1c1 بِ ُز تْ w116s1f1c1 ُسكو تْ w118s1f1c1 ُسقو طْ w117s1f1c1

ُسقو طْ w117s1f1c1 ُسكو تْ w118s1f1c1 بَ سْ w120s1f1c1 بَ صْ w119s1f1c1 بَ صْ w119s1f1c1 بَ سْ w120s1f1c1

154

نا سْ w122s1f1c1 نا صْ w121s1f1c1 نا صْ w121s1f1c1 نا سْ w122s1f1c1 البِ سْ w124s1f1c1 البِ صْ w123s1f1c1 البِ صْ w123s1f1c1 البِ سْ w124s1f1c1 َرفي سْ w126s1f1c1 َرفي صْ w125s1f1c1 َرفي صْ w125s1f1c1 َرفي سْ w126s1f1c1 يَ ُخ سْ w128s1f1c1 يَ ُخ صْ w127s1f1c1 يَ ُخ صْ w127s1f1c1 يَ ُخ سْ w128s1f1c1 بو سْ w130s1f1c1 بو صْ w129s1f1c1 بو صْ w129s1f1c1 بو سْ w130s1f1c1 َخ دْ w132s1f1c1 َخ ضْ w131s1f1c1 َخ ضْ w131s1f1c1 َخ دْ w132s1f1c1 فا دْ w134s1f1c1 فا ضْ w133s1f1c1 فا ضْ w133s1f1c1 فا دْ w134s1f1c1 َرفِ دْ w136s1f1c1 َرفِ ضْ w135s1f1c1 َرفِ ضْ w135s1f1c1 َرفِ دْ w136s1f1c1 فَري دْ w138s1f1c1 فَري ضْ w137s1f1c1 فَري ضْ w137s1f1c1 فَري دْ w138s1f1c1 يَ ُع دْ w140s1f1c1 يَ ُع ضْ w139s1f1c1 يَ ُع ضْ w139s1f1c1 يَ ُع دْ w140s1f1c2 َم رفو دْ w142s1f1c1 َم رفو ضْ w141s1f1c1 َم رفو ضْ w141s1f1c2 َم رفو دْ w142s1f1c1 ْ ْ ْ ْ Gutturals أَلَ مْ w144s1f1c2 َعلَ مْ w143s1f1c2 َعلَ مْ w143s1f1c2 أَلَ مْ w144s1f1c2 آ ِج لْ w146s1f1c2 عا ِج لْ w145s1f1c2

155

عا ِج لْ w145s1f1c2 آ ِج لْ w146s1f1c2 إ ِر فْ w148s1f1c2 ِع ِر فْ w147s1f1c2 ِع ِر فْ w147s1f1c2 إ ِر فْ w148s1f1c2 إي دْ w150s1f1c2 عي دْ w149s1f1c2 عي دْ w149s1f1c2 إي دْ w150s1f1c2 أُ ُصور w152s1f1c1 ُع ُصور w151s1f1c2 ُع ُصور w151s1f1c2 أُ ُصور w152s1f1c1 أوم w154s1f1c2 ع ُوم w153s1f1c2 ع ُوم w153s1f1c2 أوم w154s1f1c2 غا صْ w156s1f1c2 خا صْ w155s1f1c2 خا صْ w155s1f1c2 غا صْ w156s1f1c2

َغ ر َغ رْ w158s1f1c2 َخ ر َخ رْ w157s1f1c2

َخ ر َخ رْ w157s1f1c2 َغ ر َغ رْ w158s1f1c2 ِغيا رْ w160s1f1c2 ِخيا رْ w159s1f1c2 ِخيا رْ w159s1f1c2 ِغيا رْ w160s1f1c2 ِغي بْ w162s1f1c2 ِخْي بْ w161s1f1c2 ِخي بْ w161s1f1c2 ِغي بْ w162s1f1c2 ُغمو ضْ w164s1f1c2 ُخمو ضْ w163s1f1c2 ُخمو ضْ w163s1f1c2 ُغمو ضْ w164s1f1c2 غ ُو تْ w166s1f1c2 خ ُو تْ w165s1f1c2 خ ُو تْ w165s1f1c2 غ ُو تْ w166s1f1c2 هَ لْ w168s1f1c2 َح لْ w167s1f1c2 َح لْ w167s1f1c2 هَ لْ w168s1f1c2 ها نْ w170s1f1c2 حا نْ w169s1f1c2 حا نْ w169s1f1c2 ها نْ w170s1f1c2 ِه رْ w172s1f1c2 ِح رْ w171s1f1c2 ِح رْ w171s1f1c2 ِه رْ w172s1f1c2

156

هي نْ w174s1f1c2 حي نْ w173s1f1c2 حي نْ w173s1f1c2 هي نْ w174s1f1c2 هُبو بْ w176s1f1c2 ُحبو بْ w175s1f1c2 ُحبو بْ w175s1f1c2 هُبو بْ w176s1f1c2 هو دْ w178s1f1c2 حو دْ w177s1f1c2 حو دْ w177s1f1c2 هو دْ w178s1f1c2 َدفَ عْ w246s1f1c2 َدفَأْ w245s1f1c2 َدفَأْ w245s1f1c2 َدفَ عْ w246s1f1c2 جاع w248s1f1c2 جا ءْ w247s1f1c2 جا ءْ w247s1f1c2 جاع w248s1f1c2 شا ِر عْ w250s1f1c2 شا ِريء w249s1f1c2 شا ِريء w249s1f1c2 شا ِر عْ w250s1f1c2 تَ ضيي عْ w252s1f1c2 تَ ديي ءْ w251s1f1c2 تَ ديي ءْ w251s1f1c2 تَ ضيي عْ w252s1f1c2 تَبَ ُّر عْ w254s1f1c2 تَبَ ُّر ءْ w253s1f1c2 تَبَ ُّر ءْ w253s1f1c2 تَبَ ُّر عْ w254s1f1c2 بِفو عْ w256s1f1c2 بِفو ءْ w255s1f1c2 بِفو ءْ w255s1f1c2 بِفو عْ w256s1f1c2 َض غْ w258s1f1c2 َض خْ w257s1f1c2 َض خْ w257s1f1c2 َض غْ w258s1f1c2 دا غْ w260s1f1c2 دا خْ w259s1f1c2 دا خْ w259s1f1c2 دا غْ w260s1f1c2 نَفِ غْ w262s1f1c2 نَفِ خْ w261s1f1c2 نَفِ خْ w261s1f1c2 نَفِ غْ w262s1f1c2 سي غْ w264s1f1c2 سي خْ w263s1f1c2 سي خْ w263s1f1c2 سي غْ w264s1f1c2

تَنُ غْ W266s1f1c2 تَنُ خْ w265s1f1c2

157

تَنُ خْ w265s1f1c2 تَنُ غْ W266s1f1c2 ن ُو غْ w268s1f1c2 نو خْ w267s1f1c2 نو خْ w267s1f1c2 ن ُو غْ w268s1f1c2 نحى w270s1f1c2 نهى w269s1f1c2 نهى w269s1f1c2 نحى w270s1f1c2 فا حْ w272s1f1c2 فا هْ w271s1f1c2 فا هْ w271s1f1c2 فا حْ w272s1f1c2 َش ِر حْ w274s1f1c2 َش ِر هْ w273s1f1c2 َش ِر هْ w273s1f1c2 َش ِر حْ w274s1f1c2 نَبي حْ w276s1f1c2 نَبي هْ w275s1f1c2 نَبي هْ w275s1f1c2 نَبي حْ w276s1f1c2 قُ ُر حْ w278s1f1c2 ُك ُر هْ w277s1f1c2 ُك ُر هْ w277s1f1c2 قُ ُر حْ w278s1f1c2 أَبوح w280s1f1c2 أَبوه w279s1f1c2 أَبوه w279s1f1c2 أَبوح w280s1f1c2

158