Robust Estimation of Hypernasality in Dysarthria with Acoustic Model
Total Page:16
File Type:pdf, Size:1020Kb
1 Robust Estimation of Hypernasality in Dysarthria with Acoustic Model Likelihood Features Michael Saxon1, Ayush Tripathi2, Yishan Jiao3, Julie Liss3, and Visar Berisha1;3 1School of Electrical, Computer, and Energy Engineering, Arizona State University 2Department of Electrical and Electronics Engineering, Visvesvaraya National Institute of Technology 3College of Health Solutions, Arizona State University Abstract—Hypernasality is a common characteristic symp- work we present automated metrics for hypernasality scoring tom across many motor-speech disorders. For voiced sounds, that are robust to disease- and speaker-specific confounders. hypernasality introduces an additional resonance in the lower Clinician perceptual assessment is the gold-standard tech- frequencies and, for unvoiced sounds, there is reduced articu- latory precision due to air escaping through the nasal cavity. nique for assessing hypernasality [8]. However, this method However, the acoustic manifestation of these symptoms is highly has been shown to be susceptible to a wide variety of er- variable, making hypernasality estimation very challenging, both ror sources, including stimulus type, phonetic context, vocal for human specialists and automated systems. Previous work in quality, articulation patterns, and previous listener experience this area relies on either engineered features based on statistical [9]. Additionally, these perceptual metrics have been shown signal processing or machine learning models trained on clinical ratings. Engineered features often fail to capture the complex to erroneously overestimate severity on high vowels when acoustic patterns associated with hypernasality, whereas metrics compared with low vowels [10], and vary based on broader based on machine learning are prone to overfitting to the phonetic context [11]. Although these difficulties may be small disease-specific speech datasets on which they are trained. mitigated by averaging multiple clinician ratings, this further Here we propose a new set of acoustic features that capture drives up costs associated with hypernasality assessment and these complementary dimensions. The features are based on two acoustic models trained on a large corpus of healthy speech. The makes its use as a trackable metric over time less feasible. first acoustic model aims to measure nasal resonance from voiced Automated hypernasality assessment systems have been sounds, whereas the second acoustic model aims to measure proposed as an objective alternative to perceptual assessment. articulatory imprecision from unvoiced sounds. To demonstrate Instrumentation-based direct assessment techniques visualize that the features derived from these acoustic models are specific to the velopharyngeal closing mechanism using videofluoroscopy hypernasal speech, we evaluate them across different dysarthria corpora. Our results show that the features generalize even when [12] or magnetic resonance imaging [13] and provide in- training on hypernasal speech from one disease and evaluating formation about velopharyngeal port size and shape [14]. on hypernasal speech from another disease (e.g., training on These methods are invasive and can cause patients pain and Parkinson’s disease, evaluation on Huntington’s disease), and discomfort. Alternatively, nasometry measures nasalance, the when training on neurologically disordered speech but evaluating modulation of the velopharyngeal opening area, by estimating on cleft palate speech. the acoustic energy from the nasal cavity relative to the oral Index Terms—hypernasality, dysarthria, velopharyngeal dys- cavity with two microphones separated by a plate that isolates function, clinical speech analytics, speech features. the mouth from the nose [15]. In some cases, nasalance scores yield a modest correlation with perceptual judgment of hyper- I. INTRODUCTION nasality [16], [17]; however, there is considerable evidence that this relationship depends on the person and the reading pas- YPERNASALITY refers to the perception of excessive sages used during assessment [17], [18]. Furthermore, properly H nasal resonance in speech, caused by velopharyngeal administering the evaluation requires significant training and it dysfunction (VPD), an inability to achieve proper closure arXiv:1911.11360v4 [eess.AS] 6 Aug 2020 cannot be used to evaluate hypernasality from existing speech of the velum, the soft palate regulating airflow between the recordings. Thus, the clinician’s perception of hypernasality is oral and nasal cavities. Hypernasality is a common symp- often the de-facto gold-standard in clinical practice [19]. tom in motor-speech disorders such as Parkinson’s Disease An appealing alternative to instrumentation-based tech- (PD), Huntington’s Disease (HD), amyotrophic lateral sclero- niques is the direct estimation of hypernasality from recorded sis (ALS), and cerebellar ataxia, as velar movement requires audio. This family of methods aims to measure the atypical precise motor control [1], [2], [3], [4]. It is also the defining acoustic resonance from VPD as an objective proxy for hyper- perceptual trait of cleft palate speech [5]. Reliable detection nasal speech. Systems that can accurately and consistently rate of hypernasality is useful in both rehabilitative (e.g., tracking changes to a patient’s hypernasality directly from the speech the progress of speech therapy) and diagnostic (e.g., early signal could enable remote tracking of neurological disease detection of neurological diseases) settings [6], [7]. Because progression from a mobile device or home computer [20]. of the promise hypernasality tracking shows for assessing Previous work in this area can be categorized broadly in neurological disease, there is interest in developing strategies two groups: engineered features based on statistical signal that are robust to the limitations of existing techniques—in this processing [21] and supervised methods based on machine This work is partially supported by National Institutes Of Health Grant learning [22]. The simple acoustic features fail to capture 5R01DC006859-14. the complex manifestation of hypernasality in speech, as 2 there is a great deal of person-to-person variability [23]. The specific symptoms rather than the perceptual acoustic cues of more complex machine learning-based metrics are prone to hypernasality itself. overfitting to the necessarily small disease-specific speech Features based on automatic speech recognition (ASR) datasets, making it difficult to evaluate how effectively they acoustic models targeting articulatory precision have been generalize. used in nasality assessment systems [49]. The nasal cognate In this paper, we propose an approach that falls between distinctiveness measure uses a similar approach but assesses these two extremes. We know that for voiced sounds hy- the degree to which specific stops sound like their co-located pernasal speech results in additional resonances at the lower nasal sonorants [57]. frequencies [24]. The acoustic manifestation of the additional resonance is difficult to characterize with simple features B. Contributions as it is dependent on several factors including the physio- We propose a novel set of features called “Nasalization- anatomy of the speaker, the context, etc. For unvoiced sounds, Articulation Precision” (NAP), addressing the limitations of hypernasal speech results in imprecise consonant production— the current methods with a hybrid approach that brings to- the characteristic insufficient closure of the velopharyngeal gether domain expertise and machine learning. Our approach port renders the speaker unable to build sufficient pressure relies on a combination of two minimally-supervised acoustic in the oral cavity to properly form plosives, causing the air to models trained using only healthy speech data, with dysarthric instead leak out through the nose [25]. speech data only used for training simple linear classifiers on top of the features. The first model learns a distribution A. Related work of acoustic patterns associated with nasalization of voiced Spectral analysis of speech is a potentially effective method phonemes and the second model learns a distribution of acous- to analyze hypernasality. Acoustic cues based on formant F1 tic patterns of precise articulation for unvoiced phonemes. and F2 amplitudes, bandwidths, and pole/zero pairs [26], [27], For a dysarthric sample, the model produces phoneme-specific [28], [29], [30], [31], [32], [33] and changes in the voice low measures of nasalization and precise articulation. As a result, tone/high tone ratio [34] [35] have been proposed to detect the features are intuitive and interpretable, and focus on or evaluate hypernasal speech. These spectral modifications hypernasality rather than other co-modulating factors. in hypernasal speech will have an impact on articulatory In contrast to other approaches that rely on machine learn- dynamics, thereby affecting speech intelligibility. Statistical ing, the feature models are not trained on any clinical data. We signal processing methods that seek to reverse these cues, such show that these features can be combined using simple linear as suppressing the nasal formant peaks and then performing regression to develop a robust estimator of hypernasality that peak-valley enhancement, have demonstrated improvement in generalizes across different dysarthria corpora, outperforming the perceptual qualities of cleft palate and lip-caused hy- both neural network-based and engineered feature-based ap- pernasal speech