1

Robust Estimation of Hypernasality in with Acoustic Model Likelihood Features Michael Saxon1, Ayush Tripathi2, Yishan Jiao3, Julie Liss3, and Visar Berisha1,3 1School of Electrical, Computer, and Energy Engineering, Arizona State University 2Department of Electrical and Electronics Engineering, Visvesvaraya National Institute of Technology 3College of Health Solutions, Arizona State University

Abstract—Hypernasality is a common characteristic symp- work we present automated metrics for hypernasality scoring tom across many motor-speech disorders. For voiced sounds, that are robust to disease- and speaker-specific confounders. hypernasality introduces an additional resonance in the lower Clinician perceptual assessment is the gold-standard tech- frequencies and, for unvoiced sounds, there is reduced articu- latory precision due to air escaping through the nasal cavity. nique for assessing hypernasality [8]. However, this method However, the acoustic manifestation of these symptoms is highly has been shown to be susceptible to a wide variety of er- variable, making hypernasality estimation very challenging, both ror sources, including stimulus type, phonetic context, vocal for human specialists and automated systems. Previous work in quality, articulation patterns, and previous listener experience this area relies on either engineered features based on statistical [9]. Additionally, these perceptual metrics have been shown signal processing or machine learning models trained on clinical ratings. Engineered features often fail to capture the complex to erroneously overestimate severity on high vowels when acoustic patterns associated with hypernasality, whereas metrics compared with low vowels [10], and vary based on broader based on machine learning are prone to overfitting to the phonetic context [11]. Although these difficulties may be small disease-specific speech datasets on which they are trained. mitigated by averaging multiple clinician ratings, this further Here we propose a new set of acoustic features that capture drives up costs associated with hypernasality assessment and these complementary dimensions. The features are based on two acoustic models trained on a large corpus of healthy speech. The makes its use as a trackable metric over time less feasible. first acoustic model aims to measure nasal resonance from voiced Automated hypernasality assessment systems have been sounds, whereas the second acoustic model aims to measure proposed as an objective alternative to perceptual assessment. articulatory imprecision from unvoiced sounds. To demonstrate Instrumentation-based direct assessment techniques visualize that the features derived from these acoustic models are specific to the velopharyngeal closing mechanism using videofluoroscopy hypernasal speech, we evaluate them across different dysarthria corpora. Our results show that the features generalize even when [12] or magnetic resonance imaging [13] and provide in- training on hypernasal speech from one disease and evaluating formation about velopharyngeal port size and shape [14]. on hypernasal speech from another disease (e.g., training on These methods are invasive and can cause patients pain and Parkinson’s disease, evaluation on Huntington’s disease), and discomfort. Alternatively, nasometry measures nasalance, the when training on neurologically disordered speech but evaluating modulation of the velopharyngeal opening area, by estimating on cleft speech. the acoustic energy from the nasal cavity relative to the oral Index Terms—hypernasality, dysarthria, velopharyngeal dys- cavity with two microphones separated by a plate that isolates function, clinical speech analytics, speech features. the mouth from the nose [15]. In some cases, nasalance scores yield a modest correlation with perceptual judgment of hyper- I.INTRODUCTION nasality [16], [17]; however, there is considerable evidence that this relationship depends on the person and the reading pas- YPERNASALITY refers to the perception of excessive sages used during assessment [17], [18]. Furthermore, properly H nasal resonance in speech, caused by velopharyngeal administering the evaluation requires significant training and it dysfunction (VPD), an inability to achieve proper closure arXiv:1911.11360v4 [eess.AS] 6 Aug 2020 cannot be used to evaluate hypernasality from existing speech of the velum, the regulating airflow between the recordings. Thus, the clinician’s perception of hypernasality is oral and nasal cavities. Hypernasality is a common symp- often the de-facto gold-standard in clinical practice [19]. tom in motor-speech disorders such as Parkinson’s Disease An appealing alternative to instrumentation-based tech- (PD), Huntington’s Disease (HD), amyotrophic lateral sclero- niques is the direct estimation of hypernasality from recorded sis (ALS), and cerebellar ataxia, as velar movement requires audio. This family of methods aims to measure the atypical precise motor control [1], [2], [3], [4]. It is also the defining acoustic resonance from VPD as an objective proxy for hyper- perceptual trait of cleft palate speech [5]. Reliable detection nasal speech. Systems that can accurately and consistently rate of hypernasality is useful in both rehabilitative (e.g., tracking changes to a patient’s hypernasality directly from the speech the progress of speech therapy) and diagnostic (e.g., early signal could enable remote tracking of neurological disease detection of neurological diseases) settings [6], [7]. Because progression from a mobile device or home computer [20]. of the promise hypernasality tracking shows for assessing Previous work in this area can be categorized broadly in neurological disease, there is interest in developing strategies two groups: engineered features based on statistical signal that are robust to the limitations of existing techniques—in this processing [21] and supervised methods based on machine This work is partially supported by National Institutes Of Health Grant learning [22]. The simple acoustic features fail to capture 5R01DC006859-14. the complex manifestation of hypernasality in speech, as 2 there is a great deal of person-to-person variability [23]. The specific symptoms rather than the perceptual acoustic cues of more complex machine learning-based metrics are prone to hypernasality itself. overfitting to the necessarily small disease-specific speech Features based on automatic speech recognition (ASR) datasets, making it difficult to evaluate how effectively they acoustic models targeting articulatory precision have been generalize. used in nasality assessment systems [49]. The nasal cognate In this paper, we propose an approach that falls between distinctiveness measure uses a similar approach but assesses these two extremes. We know that for voiced sounds hy- the degree to which specific stops sound like their co-located pernasal speech results in additional resonances at the lower nasal sonorants [57]. frequencies [24]. The acoustic manifestation of the additional resonance is difficult to characterize with simple features B. Contributions as it is dependent on several factors including the physio- We propose a novel set of features called “- anatomy of the speaker, the context, etc. For unvoiced sounds, Articulation Precision” (NAP), addressing the limitations of hypernasal speech results in imprecise consonant production— the current methods with a hybrid approach that brings to- the characteristic insufficient closure of the velopharyngeal gether domain expertise and machine learning. Our approach port renders the speaker unable to build sufficient pressure relies on a combination of two minimally-supervised acoustic in the oral cavity to properly form plosives, causing the air to models trained using only healthy speech data, with dysarthric instead leak out through the nose [25]. speech data only used for training simple linear classifiers on top of the features. The first model learns a distribution A. Related work of acoustic patterns associated with nasalization of voiced Spectral analysis of speech is a potentially effective method phonemes and the second model learns a distribution of acous- to analyze hypernasality. Acoustic cues based on formant F1 tic patterns of precise articulation for unvoiced phonemes. and F2 amplitudes, bandwidths, and pole/zero pairs [26], [27], For a dysarthric sample, the model produces phoneme-specific [28], [29], [30], [31], [32], [33] and changes in the voice low measures of nasalization and precise articulation. As a result, tone/high tone ratio [34] [35] have been proposed to detect the features are intuitive and interpretable, and focus on or evaluate hypernasal speech. These spectral modifications hypernasality rather than other co-modulating factors. in hypernasal speech will have an impact on articulatory In contrast to other approaches that rely on machine learn- dynamics, thereby affecting speech intelligibility. Statistical ing, the feature models are not trained on any clinical data. We signal processing methods that seek to reverse these cues, such show that these features can be combined using simple linear as suppressing the nasal formant peaks and then performing regression to develop a robust estimator of hypernasality that peak-valley enhancement, have demonstrated improvement in generalizes across different dysarthria corpora, outperforming the perceptual qualities of cleft palate and lip-caused hy- both neural network-based and engineered feature-based ap- pernasal speech [36], further demonstrating the connection proaches. Furthermore, we demonstrate that this estimator in- between these cues and intelligibility. The large variability deed robustly captures the perceptual attributes of hypernasal- of speech degradation patterns across neurological disease or ity by training on our full dysarthria corpus and evaluating on injury challenges simple features that are based on domain a cleft lip and palate (CLP) dataset, with speakers who are, expertise [37]. Overall, these simple features are not robust to apart from the cleft palate-induced hypernasality, otherwise the complicated acoustic patterns that emerge in hypernasality, healthy speakers. and are prone to high false positive and negative error rates To evaluate the efficacy of this new feature set, we train and in out-of-domain test cases. validate a linear model that estimates clinician-rated hyper- In response, data-derived representations of hypernasality nasality scores across different neurological diseases to ensure that combine more elemental speech features and supervised that it is focusing on hypernasality and not other disease- learning have been proposed. Mel-frequency cepstral coeffi- specific dysarthria symptoms. We show that this representation cients (MFCCs) and other spectral transformations [38], [39], correlates strongly with clinical perception of hypernasality [37], [40], [41], [42], [43], [44], [45], glottal source related across several different neurological disorders, even when features (jitter and shimmer) [46], [47], difference between the trained on one disorder and evaluated on a different disorder. low-pass and band pass profile of the Teager Energy Operator Such assessment has a potential advantage as an inexpensive, (TEO) [48], [49], and non-linear features [50], [51] have all non-invasive and simple-to-administer method, scalable to been proposed as model input features. Gaussian mixture large and diverse populations [58] [59]. models (GMM), support vector machines, and deep neural networks have been used in conjunction with these features II.METHODS for hypernasality evaluation from word and sentence level data The proposed hypernasality evaluation algorithm is based on [52], [53], [54], [55]. Recently, end-to-end neural networks the intuition that as the severity of hypernasality increases, two taking MFCC frames as input and producing hypernasality broad perceptible changes take place: the unvoiced phonemes assessments as output have also been proposed [56]. become less precise and the voiced phonemes become nasal- These methods rely on supervised learning and are trained ized. To that end, we model these perceptual changes at the on small data sets. For our application they run the risk phone level. In Fig. 1, we provide a high-level overview of of overfitting to the data by focusing on associated disease- the proposed hypernasality score estimation scheme. 3

Fig. 1: A high-level diagram of the NAP method. The leftmost pre-processing segment depicts the forced alignment of transcript to audio as well as the aligned word-phoneme-nasal class segmentation of the speech signal and spectrogram.

After forced alignment and pre-processing, the input speech The perceptual evaluation of hypernasality from recorded is routed phoneme-by-phoneme to one of two acoustic models. samples was carried out by 14 different speech language The voiced phonemes are analyzed using the first model, pathologists on a scale of 1 to 7. The average hypernasality yielding an objective estimate of acoustic nasalization based on score for each speaker was used as the ground truth. The inter- a likelihood ratio. The unvoiced phonemes are analyzed with rater reliability of the SLPs was moderate, with an average the second model, which captures the production quality of inter-clinician mean absolute error of 1.44 on the 7-point scale. unvoiced phonemes, objectively estimating articulatory preci- The sentences spoken were: sion as another likelihood ratio. Both the acoustic nasalization 1) The supermarket chain shut down because of poor model and the articulatory precision model are trained using management. healthy speech from the LibriSpeech Dataset [60]. The features 2) Much more money must be donated to make this de- from these models are averaged by phoneme, then input to a partment succeed. simple linear model to predict clinician-assessed hypernasality 3) In this famous coffee shop they serve the best doughnuts ratings across several different neurological disorders. We in town. describe the data and processing steps in detail below. 4) The chairman decided to pave over the shopping center garden. A. Data 5) The standards committee met this afternoon in an open 1) Healthy speech corpus: LibriSpeech is a public domain meeting. corpus of healthy English utterances with transcripts. It con- The speech recordings were carried out in sound-treated tains roughly 1000 hours of speech from 1,128 female and room using a microphone. Table I shows the breakdown of 1,210 male speakers reading book passages aloud, sampled at clinical characteristics of the subjects and the statistics of the 16 kHz. It contains 100 and 360 hour “clean” samples, which nasality score (NS) subsets. S.D. denotes standard deviation. have been carefully segmented and aligned [60]. We use this Figure 2 contains the clinician hypernasality score histograms corpus to train both acoustic models shown in Fig. 1. for each disorder population. 2) Dysarthric speech corpus: The dataset contains record- ings from 75 speakers (40 male and 35 female) of varying TABLE I: Clinical characteristics and nasality scores of the levels of hypernasality. The corpus contains data from speakers subjects. diagnosed with several different neurological disorders: 38 Disease Male Female Mean Age S.D. Age Mean NS S.D. NS patients have Parkinsons disease (PD), 6 have Huntington’s PD 20 18 71.06 9.62 2.55 0.75 A 6 10 62.47 14.05 3.58 0.68 disease (HD), 16 patients cerebellar Ataxia (A), and 15 pa- ALS 8 7 59.54 13.23 4.41 0.85 tients amyotrophic lateral sclerosis (ALS). HD 6 0 58.40 13.20 3.31 0.59 Total 40 35 65.80 12.67 3.20 1.04 All individuals read the same set of five sentences, capturing a range of phonemes. Reading is an ideal stimulus for this 3) Cleft Palate speech corpus: While we are chiefly con- task because it controls for phonetic distributional variations cerned with evaluating hypernasality in dysarthric speakers that would be present in more spontaneous speech and enables exhibiting neuromuscular diseases, cleft lip and palate (CLP) for consistency between speakers and between assessments in- speech is useful for evaluation purposes, as CLP speech often time, ideal qualities for a clinical measure. exhibits hypernasality without the other kinds of perceptual 4

16 not present. Similarly, we define the “nasal” class (NAS)

12 Parkinson's Disease

12 10 to contain the nasal consonants as well as half of adjacent

8

6 5 vowels surrounding them. These rules were implemented after

4

2 2

1 alignment; an illustrative example of the two classes is shown 16 0

Huntington's Disease

12 in the third tier of the aligned example in Fig. 1.

8 For this task, we use 100 hours of clean-labeled speech

4

2 2

1 1 from the LibriSpeech dataset [60]. We first perform forced

0 16 phone-alignment to the transcript as shown in Figure 1. We ALS

12 Count partition all phonemes into the NAS and ORL classes. For

8

4

3 3 4 each frame in each phoneme, we extract 13 PLP coefficients, 2 2 1 NAS ORL

0 16 giving two feature matrices, X and X , containing

Ataxia 12 all frames of nasal PLPs in one, and non-nasal PLPs in

8

6 5 the other. To model the probability density functions, we

3 4

1 1 0 0 use a 16-mixture Gaussian Mixture Model (GMM). The

0

1 2 3 4 5 6 weight, mean, and covariance matrix for each of the GMM

Perceptual Hypernasality Rating components is learned using the expectation maximization Fig. 2: Distribution of clinician-rated hypernasality score by (EM) algorithm. The GMM for the nasal class is represented disorder. by λNAS = {µNAS, ΣNAS, ωNAS}, i = 1, 2, ...16. Here, µNAS, ΣNAS and ωNAS represent the mean, covariance changes (slurring, generalized articulatory imprecision) that matrix and weight of the ith Gaussian, respectively. Similarly, also arise in dysarthria. We use a corpus of 6 child and 12 adult for the non-nasal class the GMM components are given by CLP speakers with different levels of hypernasality severity, λORL = {µORL, ΣORL, ωORL}, i = 1, 2, ...16. that span the hypernasality range (from normal to extreme) in After training on healthy speech, we provide a segmented equal intervals [61] to demonstrate that our model chiefly cap- dysarthric utterance to evaluate the likelihood from each of tures hypernasality rather than any associated neurologically the two learned probability density functions. For an out-of- disordered speech symptoms. sample input, we estimate the likelihood, voiced phoneme by voiced phoneme. That is, for data feature matrix Xpj , the B. Data pre-processing likelihood that this phoneme is nasalized is pj Y Consider an utterance x(t) with sampling rate Fs and a f(X |λNAS) = f(xi|λNAS), (1) i∈pj corresponding transcript of phonemes pj, {p1, p2, . . . pNp }. We analyze x(t) with a 20ms frame length and 10ms overlap. where the notation i ∈ pj is shorthand notation for all 20ms For a frame indexed by i, xi(t), we extract a set of features, xi. frames aligned to phoneme pj. Similarly for the ORL class, The utterance x(t) is force-aligned using the Montreal Forced pj Y Aligner1 [62] at the phoneme level. We denote the data feature f(X |λORL) = f(xi|λORL). (2) pj i∈pj matrix for all frames that are aligned to phoneme pj by X . After alignment, we use different features for the acoustic We use the log-likelihood ratio test statistic as a continuous nasalization model than for the articulatory precision model. measure of nasalization. In particular, we define For the nasalization model we use perceptual linear prediction f(Xpj |λ ) NAS pj (PLP) features because they better preserve acoustic cues that N(pj) = log p /|X |, (3) f(X j |λORL) have been previously used to model hypernasality, including p formant frequencies, bandwidths, and spectral tilt [63]. For the where |X j | represents the number of acoustic frames aligned articulatory precision model for unvoiced phonemes, we use to phoneme pj. This statistic is calculated for every voiced, mel-frequency cepstral coefficients (MFCCs), a common rep- non-nasal phoneme in the input utterance. Thus, for a given resentation for automatic speech recognition applications [64]. speaker, a nasalization log-likelihood ratio is computed for all We also use a lower sampling rate for the voiced nasalization the voiced phonemes, (AA, AE, AH, AO, AW, AY, B, D, DH, 2 model (8 kHz) compared to the unvoiced articulatory precision EH, ER, EY, G, IY, JH, V, Z) . pj model (16 kHz). This is motivated by the difference in spectral For non-nasalized speech, we expect the value of N(X ) to energy distribution between voiced and unvoiced sounds. be low, whereas for nasalized speech, we expect it to be high. Figure 3 shows a comparison of the value of the nasalization likelihood feature between a group of high hypernasality (> 4 C. Nasalization model perceptual rating) and a group of low hypernasality (< 3 To assess nasalization on a phoneme level, we model the Perceptual rating). We average the hypernasality scores for the distributions of two classes of voiced phonemes. The “oral” 4 most relevant phonemes for predicting hypernasality (see non-nasal (ORL) class consists of all voiced oral consonants Section III-B for details). As expected, there is an increase and all vowels from syllables where nasal consonants are in the nasality feature value for an increase in severity of hypernasality. 1See Section IV-B for discussion on forced alignment performance for these dysarthric speech samples. 2In this work we use ARPAbet codes to transcribe English phonemes [65]. 5

N(D) N(B) N(IY) N(AA) AP(F) AP(T)

0.0 0

-1

-2

-0.5

-3

-4

-5

-1.0 ArticulatoryPrecision NasalizationFeature

-6

-7

Low Hypernasality High Hypernasality Low Hypernasality High Hypernasality

-1.5

Fig. 3: Bar chart of the nasalization feature for D, B, IY, and Fig. 4: Bar chart of the articulatory precision feature for F and AA for low hypernasality and high hypernasality speakers. T for low hypernasality and high hypernasality speakers.

D. Articulation model Figure 4 shows a comparison of the average value of the articulation precision feature between a group of high hyper- The articulation model is an implementation of the goodness nasality (> 4 perceptual rating) subjects and a group of low of pronunciation algorithm [66] based on an acoustic model hypernasality (< 3 Perceptual rating) subjects. We average trained using Kaldi as specified in [67]. For our implementa- the articulation scores for the most relevant phonemes for tion we used a triphone model trained with a Gaussian Mixture predicting hypernasality (see Section III-B). As expected, there Model-Hidden Markov Model on 960 hours of healthy native is a decrease in the articulation precision feature value for an English speech data from the LibriSpeech corpus [60]. The increase in severity of hypernasality. Furthermore, we expect input features to the ASR model are a 39-dimensional second- hypernasality to exhibit unique patterns in terms of affected order Mel-Frequency Cepstral Coefficient (MFCC) (including and unaffected unvoiced phonemes, which are not general deltas and delta-deltas) with utterance-level cepstral mean vari- to dysarthria [57], making phoneme-level AP classification a ance normalization and Linear Discriminant Analysis transfor- valuable signal in quantifying hypernasality. mation. We use the Kaldi toolkit training scripts for training the model. In contrast to the nasalization model, here we use a sampling rate of 16 kHz to capture the wideband nature of E. Linear regression model unvoiced phonemes. The nasalization model required a much In the interest of generalization and clinical interpretability, smaller training set (100 hours) since there were only two simple linear ridge regression models [68] are used to estimate classes modeled by the GMM. the nasality score using the phoneme-averaged nasalization After training, the acoustic model can be queried using and articulatory precision features as input. the Viterbi decoding algorithm for the posterior probability Two different cross-validation strategies are used to evaluate P (X|q) of a given set of acoustic feature frames X represent- model performance and the quality of the input features. ing a realization of some phoneme q. For a “well-articulated” First, we use leave-one-speaker-out (LOSO) cross-validation, phoneme, no phoneme apart from the one intended by the as is typically done. To evaluate the generalization across speaker should maximize this posterior. out-of-domain diseases, we also perform leave-one-disease- We use the acoustic model to assess articulatory precision as out (LODO) cross-validation. In LODO, data for three of the follows. Considering the set of phonemes Q in the language, neurological conditions is used for training and the fourth is we assess the log-likelihood ratio of the frames Xpj from a used for testing. given phoneme pj, to the maximum log-likelihood across all phonemes, III.EXPERIMENTSAND RESULTS In this section we evaluate the efficacy of the features in  P (Xpj |p )  j pj AP (pj) = log /|X |, (4) two ways: by analyzing the correlation between individual max P (Xpj |q) q∈Q per-feature averages and speaker hypernasality rating, and by where |Xpj | represents the number of acoustic frames aligned observing the performance of a simple linear regression model to phoneme pj. directly calculating the hypernasality score for a speaker from This processing is performed after forced alignment to the their features. For the hypernasality score computation task, transcript labels, and assessed for each unvoiced phoneme to we evaluate the performance of a model against two baselines. permit by-phoneme analysis of precise articulation. Models are compared using mean average error (MAE) and For speakers who exhibit little hypernasality, we expect the Pearson correlation coefficient (PCC) between the clinical the value of AP (Xpj ) for unvoiced phonemes to be high, perceptual hypernasality score and the predicted hypernasality whereas for hypernasal speakers, we expect it to be lower. scores. 6

TABLE II: Comparative evaluation of state-of-the-art formant features (FF-Linear, -Additive, -KNN) and neural network- based (MFCC-, PASE-NN) approaches and the clinician raters (Human) against our NAP features for predicting clinician hypernasality score, conducted using leave-one-speaker-out (LOSO) and using leave-one-disease-out (LODO) cross validation. We report mean absolute error (MAE) on the 7-point scale and Pearson correlation coefficient (PCC). Bold denotes best performance overall for a metric, italic denotes best non-human performance for a metric when applicable. Train on Ataxia, HD, PD, ALS HD, PD, ALS Ataxia, PD, ALS Ataxia, HD, ALS Ataxia, PD, HD Test on Left-out speaker Ataxia HD PD ALS Model MAE PCC MAE PCC MAE PCC MAE PCC MAE PCC FF-Linear 0.871 0.180 0.823 0.042 0.666 -0.751 1.316 0.351 1.426 -0.425 FF-Additive 0.789 0.435 0.730 -0.123 0.693 -0.557 1.334 0.277 1.260 0.429 FF-KNN 0.754 0.481 0.781 0.333 0.567 0.381 1.218 0.402 1.227 -0.039 MFCC-NN 0.884 0.458 0.904 -0.120 0.429 0.568 0.800 0.457 1.233 0.315 PASE-NN 0.774 0.417 0.707 -0.204 0.433 0.237 1.150 0.163 1.407 0.176 NAP-Linear ours 0.587 0.722 0.546 0.750 0.559 0.737 0.509 0.697 0.597 0.527 Human 0.832 0.725 0.871 0.476 1.256 0.550 0.746 0.636 0.979 0.601

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2 Estimated Hypernasality Rating Estimated Hypernasality Rating Estimated Hypernasality Rating

1 1 1

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

Perceptual Hypernasality Rating Perceptual Hypernasality Rating Perceptual Hypernasality Rating (a) (b) (c) Fig. 5: LOSO results from predicting the hypernasality score for the simple feature baseline with (a) the KNN classifier and simple formant features (FF-KNN), (b) the neural network baseline (MFCC-NN), and (c) the NAP features with simple linear regression (NAP-Linear)

A. Hypernasality evaluation also implemented a basic model using the Problem Agnostic Speech Encoder (PASE) [72]. The PASE encodings were first In Table II, we show the results of the evaluations for six extracted from the raw audio, then fed through three feed- different models alongside comparable evaluations of human forward layers with ReLU activations, followed by a single clinician raters. We evaluated our NAP features against com- LSTM layer, all with hidden size 250. After max-pooling in parison models representing the state of the art in engineered time, a final feed-forward layer projects the latent codes to the features and in supervised learning. The most predictive acous- final hypernasality score estimation. Both models are trained tic features for hypernasality presented in [69] were extracted using L1 loss and the Adam optimizer [73] for 50 epochs with using Praat source code provided by the authors [70]. These a learning rate of 0.001. These models’ results are reported in formant features (FF) included F 1 formant amplitude, P 0 Table II as MFCC-NN and PASE-NN, respectively. nasality peak amplitude, and normalized and raw A1 − P 0 Finally, we treat each of the 14 clinician hypernasality difference [71]. All features were extracted for each vowel severity evaluators as an individual estimator, with which we and used in a linear and non-linear model to estimate the assess MAE and PCC from the ground-truth average hyper- clinician-assessed hypernasality labels. The linear model is nasality severity scores. For the LOSO evaluation we average based on simple multiple regression whereas the non-linear the 14 human evaluator MAE and PCC scores across the 75 models are based on additive regression and k-nearest neigh- speakers, and then average these across the 14 evaluators to bor regression. The results of this model are labeled FF-Linear, get an average human baseline MAE and PCC. Similarly, for FF-Additive, and FF-KNN in Table II. the LODO conditions we evaluate only the evaluation disease In addition, we evaluated two neural networks. We imple- subset. These results are presented in Table II as “Human.” mented the neural network proposed in [56], consisting of The results show that the linear model based on NAP three feed-forward layers with sigmoid activations. The inputs features not only consistently outperforms the baseline FF and to the networks are 39-dimensional Mel Frequency Cepstral NN models, achieving the best performance of all systems on Coefficients (MFCC) computed with a 20-ms window length all measures but LODO HD MAE. The improved performance and no overlap. The hidden layer is of size 100, and the output of our system over the baselines is most pronounced on layer of size 1. The output value is averaged across all frames the LODO conditions, suggesting that our model is more to provide a single nasality score estimation per speaker. We consistently robust to disease-specific manifestations. 7

0 0 0.0

R = -0.51 R = -0.63

-0.4

-2 -2

-0.8

-4 -4 N(D) AP(T) AP(F) -1.2

-1.6

-6 -6

R = 0.45

-2.0

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

Perceptual Hypernasality Rating Perceptual Hypernasality Rating Perceptual Hypernasality Rating

0.0 0.0 0.0

-0.4 -0.4 -0.4

-0.8 -0.8 -0.8 N(B) N(IY) N(AA) -1.2 -1.2 -1.2

-1.6 -1.6 -1.6

R = 0.34 R = 0.37 R = 0.26

-2.0 -2.0 -2.0

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

Perceptual Hypernasality Rating Perceptual Hypernasality Rating Perceptual Hypernasality Rating

Fig. 6: Plots of the two most prominent articulatory precision features (AP(T) and AP(F)) as well as four of the most prominent nasalization features (N(D), N(B), N(IY), and N(AA)) against clinician-assessed nasality score.

The differences are also apparent when we analyze the 0.72

0.7082 individual LOSO correlation plots in Fig. 5. These scatter 0.7052

0.70 plots relate the estimated hypernasality score for each speaker 0.6949

against the actual hypernasality score. As is clear from the 0.68 figures, the correlation of the baseline methods is largely 0.6698 driven by the samples with very high nasality scores. The NAP 0.66 model exhibits a linear trend between the predicted and actual

0.64

values throughout the hypernasality range. LOSOCorrelation

Furthermore, our model outperforms the human evaluators 0.62 0.6144 in all but two measures, LOSO PCC and LODO-ALS PCC.

0.60

AP(T) AP(T) + AP(T, F) + AP(T, F) + AP(T, F) +

B. Individual feature contributions N(IY) N(IY) N(IY, B) N(IY, B, D)

Feature Combinations We use a simple forward selection algorithm for the LOSO Fig. 7: Cumulative marginal improvement plot of leave-one- model to identify the most predictive NAP features. The algo- speaker-out correlation with the addition of the most optimal rithm identifies the subset of features that minimizes the cross- articulatory precision and nasalization features. validation mean square error between the model predicted and clinical hypernasality rating, with features iteratively added until the error stops decreasings. This procedure results in 6 A. Role of articulatory precision in hypernasality non-redundant features selected for prediction. This includes the articulatory precision for T and F and the nasalization for Articulatory precision and hypernasality are tightly linked. D, B, IY, and AA. We plot the top features against the clinical Neurological VPD, which gives rise to hypernasality, is often perceptual nasality ratings in Fig. 6; Fig. 7 depicts the marginal accompanied by impaired articulatory precision. The neuro- improvement in LOSO correlation as features are added in by logical conditions we study herein impact several aspects of decreasing feature prominence. speech production, including respiration, voicing, resonance, and articulation. This brings up two important questions:

IV. SUPPLEMENTARY ANALYSIS • Do our features capture changes related to hypernasality The previous section experimentally addresses the core that go beyond changes in articulatory precision? question of whether our novel model achieves improved results • Are our features sensitive to changes in articulatory in estimating hypernasality in individuals with neuromuscular precision that result from only hypernasality (and not disease in ways that generalize across diseases. In this section, other articulatory impairments resulting from dysarthria)? we perform supplementary experiments to interrogate specific In an attempt to decouple articulatory precision from hy- aspects of the features, model, and approach. pernasality, we collect clinical articulatory precision ratings (in 8

0.88 1.0

0.87

0.8

0.8600 0.8597

0.86 0.8571

0.85

0.6

0.84

0.8305 0.4

0.83 LOSOCorrelation 0.82

0.2

0.81

25 ms AveragePhone Alignment (s) Error

0.0

0.80

1 2 3 4 5 6 CAP CAP + N(IY) CAP + N(IY, CAP + N(IY,

Perceptual Hypernasality Rating

AA) AA, D)

Feature Combinations Fig. 9: Plot of average alignment errors per speaker (s) Fig. 8: Cumulative marginal improvement plot of LOSO cor- against clinician-rated articulatory precision at the phone level. relation for the most optimal articulatory precision and nasal- Dashed line indicates an average alignment error of 25 ms. ization features, and clinician articulatory precision (CAP).

We directly evaluate the prevalence of alignment errors addition to the hypernasality ratings) from the same clinicians. generated by our forced alignment methodology using manu- The inter-rater reliability of the ratings was robust, with an ally aligned transcripts. Two annotators produced word- and average pairwise Pearson correlation coefficient of 0.75 and a syllable-level aligned transcripts using the same spelling and mean absolute error of 1.01 on a 7-point scale. phoneme-word conventions employed in the acoustic model To answer the first question above, and demonstrate that our dictionary for all utterances in the dataset. For each speaker features capture information beyond changes in articulatory we count word- and phone-level alignment errors based on 0 precision, we use a multiple linear regression model with the position of the center point of a word or phoneme tc as clinician-rated articulatory precision alongside our six most assessed by the forced aligner and the beginning and end of predictive features (N(AA), N(IY), N(B), N(D), AP(T), AP(F)) the corresponding word or syllable, tmin, tmax as assessed by as independent variables. The dependent variable is the clinical the human transcriber. For each word or phoneme, the error 0 0 hypernasality rating. We once again use the forward selection is counted as te = max(0, tmin − tc, tc − tmax). This error algorithm on PCC to cumulatively select the most predictive measure returns 0 if the center of the phoneme falls within features. The results are depicted in Figure 8. As expected, the the syllable; otherwise it returns the maximum error between subjective AP rating is most predictive as there is significant the center of the automatically aligned phoneme and the start overlap with hypernasality, and it is selected first. In the and end of the manually-aligned syllable. In Figure 9 we show presence of this generalized measure of articulatory precision, the alignment error (in sec.) against the hypernasality rating it makes sense that AP(T, F), features that are themselves to show how alignment error rates progress as hypernasality estimating AP, would not be selected. This reinforces the increases. The results show that for all but the most severely rationale for their inclusion in the model. Three nasalization hypernasal speakers forced alignment works effectively. features, N(IY, AA, D), are able to further improve the These results also indicate that our objective hypernasality correlation of the linear model predictions. ratings for the most imprecise speakers are not reliable. While To answer the second question, and demonstrate that our this is a limitation of the approach, it is not severely limiting. features are sensitive to hypernasality alone, we evaluate a In most cases, clinicians are more concerned with evaluating linear model trained on our full dataset of dysarthric speech speakers in the mild-moderate end of the scale where they using the six most predictive features predicting hypernasality can monitor disease progress early or evaluate the effects of scores for the 18 speech samples from individuals with cleft an intervention. This is less common for later stages of disease. lip and palate in our CLP dataset. The linear hypernasality It is interesting to note that despite poor alignment the model trained on our dysarthric speech corpus achieves a model still yields high hypernasality scores for imprecise hpyernasality severity predicition PCC of 0.89 for the adults speakers. Precise alignment for speakers in this range is simply and 0.82 for the children. This provides additional evidence not possible, manually or otherwise. It’s likely that the poor that our features capture the perceptual quality of hypernasality hypernasality ratings predicted by the model are driven by the and not other co-modulating symptoms. poor alignment itself [77].

B. Effectiveness of forced alignment C. Analysis of Imprecise Articulation as a Confound The features we have proposed herein rely on force aligning In Section IV-A we analyzed the extent to which the known transcripts to dysarthric speech [74]. This can be prob- NAP features genuinely capture both the perceptual dynamics lematic as co-articulation, blending, missed targets, distorted of hypernasality and information beyond mere articulatory vowels, and poor articulation present in severely disordered precision. In this section, we further analyze the role that speech [75] may interfere with the appropriate matching of articulatory precision alone may play as a confound, driv- dictionary phoneme-word pairs to the realized sounds [76]. ing our system to label a low-hypernasality but otherwise 9

annotators in 8 of the 10 conditions, with the exception of LOSO PCC and leave-out-ALS PCC. In the LOSO condition the MFCC-NN approach outperforms the simpler formant features in PCC; while the formant feature model does achieve a lower MAE it seems to be a result of largely predicting the mean, with only a very modest upsloping trend in Figure 5(a) as opposed to Figures 5(b) and 5(c), which clearly show upward-sloping trends. In the LODO conditions, the formant-feature-based and NN models perform unpredictably. On some disease classes, MFCC-NN outperforms FF, while the opposite is true for oth- ers. By comparison, the NAP achieves consistent performance Fig. 10: A scatter plot of clinician-assessed perceptual hy- across all LODO classes. This suggests that these features are pernasality score against clinician articulatory precision rating a robust measure of hypernasality, relatively invariant to the (CAP), color-coded by ∆Z . disease-specific co-modulating variables that hinder the per- formance of the baselines on the same task. The nasalization features in the NAP are simultaneously more robust to both the severely dysarthric speaker as high hypernasality. After all, disease-specific overfitting expected from NN methods such as as discussed before, hypernasality and articulatory precision [56] and speaker-to-speaker variances discussed in the design are well correlated—in our own data, the perceptual clinician of the formant-based A1P0 and related features in [69], [71]; AP and hypernasality ratings have a PCC of 0.84 to each other. this is the advantage of being trained on a large corpus of By dividing each speaker into differentiated classes based healthy speech, and targeting a specific perceptual quality. on the relationship between their own level of hypernasality Articulatory precision features are robust in a similar way. and articulatory precision, we seek to assess if relatively low One of the added benefits of the proposed approach over the hypernasality but high CAP speakers are mislabeled as high baseline methods is the direct interpretability of the individual hypernasality. To do this, we assess “∆Z ,” the difference NAP features. While it is not immediately clear how MFCC between speaker i’s z-scored clinician AP and z-scored hy- features or formant-based features are expected to change with pernasality scores (CAP (i), H(i)), different hypernasality levels, the proposed features are easy to interpret. CAP (i) − µCAP H(i) − µH ∆Z (i) = − , (5) The feature-level analyses of the nasalization and articula- σ σ CAP H tory precision features behave as expected, with the nasaliza- where the µCAP , σCAP , µH , σH are the mean and standard tion log likelihood of the phonemes increasing as hypernasality deviation of the CAP and hypernasality scores, respectively. increases, while the articulatory precision decreases as hyper- We find that for our dataset, ∆Z is roughly normally nasality increases (Figure 6). Analysis of Eqns (3) and (4) distributed, with a mean of −0.001 and a standard deviation shows that this makes sense. As hypernasality increases, the of 0.57, with a minimum of −1.56 and a maximum of 1.25. voiced phonemes become more and more like the N class in Figure 10 shows a scatter plot of H against CAP for all the acoustic model in Section II. Similarly, as hypernasality speakers in our dataset, color-coded by ∆Z . F or most levels increases, the acoustics of the unvoiced phonemes become less of hypernasality or CAP, there exist a range of speakers with and less like the intended target, therefore the ratio in Eqn (4) varying delta. Partitioning the ∆Z distribution at ±0.3 and 0 decreases. splits it roughly in quarters. During the feature selection analysis in Section III-B, certain To analyze the impact of imprecise articulation, we use ∆Z consonants appeared prominently. The nasalization feature for and hypernasality scores to produce a train/test partition of phonemes D, and B, as well as the articulatory precision of T the data, where the 7 speakers with ∆Z > 0.3 and H < and F were prominent. T, B, and D are referred to as a “nasal 2.5, containing low-hypernasality speakers with relatively high cognates” in [57], as the bilabial consonant B shares a place CAP are the test set, while all others form the training set. of articulation with the bilabial nasal M, the lingua-alveolar We find that a linear model trained on the NAP features with consonants T and D share a place of articulation with the this partition achieves an MAE of 0.823 and a PCC of 0.560. lingua-alveolar nasal N. Leakage through the nasal cavity will These results suggest that our model is robust to the correlation interfere with the production of all of these phonemes, and in between hypernasality and the more generalized articulatory the voiced case, they will sound like their corresponding nasal precision-reducing symptoms of dysarthria, as the model does phonemes. It is not surprising that the nasalization model is not falsely label relatively high-CAP but low-hypernasality most sensitive to these phonemes since that model is trained on speakers as high hypernasality. healthy speech, where the N class consists mostly of instances in M and N and surrounding vowels. V. DISCUSSION Through the same analysis, the most prominent vowels The model based on NAP features outperforms both base- selected were AA and IY. AA is the most open and back vowel lines, across all settings but MAE in the leave-out-HD case. in English, whereas IY is the most closed and fronted. It may Furthermore, the NAP-based model outperforms the human be the case that these extreme ends of the vowel chart exhibit 10

more noticeable patterns of nasalization, either on a perceptual VI.CONCLUSION level or just in their PLP-nasalization feature realization. We have presented and demonstrated the Nasalization- Dysarthria is characterized by a constellation of often co- Articulation Precision features for objective estimation of modulating perceptual and acoustic attributes. Neuromotor hypernasality. This method leverages a data-driven approach speech control problems are not necessarily restricted to a to learning expert-designed features on healthy speech that specific articulator; when a speaker exhibits hypernasality aris- capture perceptible elements in hypernasal speech. We demon- ing from neuromuscular VPD, chances are other articulators strated that these features, when evaluated on disordered such as the tongue and lips are affected, thereby driving speech, track the expected trends in perceptual hypernasality a more generalized articulatory imprecision. The converse ratings, and can be used with ridge regression to estimate is also true, meaning that a speaker with some level of a clinician-rated hypernasality score more accurately than neuromuscular tongue, lip, and jaw-driven imprecision may several representative baseline methods as well as human also exhibit some amount of hypernasality. As a result it annotators, on average. Additionally, we demonstrated that the becomes difficult to truly isolate if a dysarthric hypernasality NAP algorithm predictions for hypernasality rating generalize assessment system is precisely measuring hypernasality, rather across diseases with significantly less loss in accuracy than than broader articulatory imprecision. To do this, experiments existing approaches. This implies that the NAP features are a where only hypernasality varies as other articulatory impreci- robust method for estimating hypernasality in dysarthria. sions are held constant, and where articulatory precision varies The chief limitation of this approach, and articulatory against a low level of hypernasality are necessary. The former precision estimation techniques more generally, is a reliance experiment is straightforward—carried out for our analysis on on known transcripts with which alignment may be per- cleft palate speakers described in section IV-A. The latter was formed. Neural models for directly assessing articulatory pre- the experiment to evaluate the role that imprecise articulation cision from raw speech audio is a promising future research plays as a confound in section IV-C. direction—such models could provide the simultaneous iden- We found that our hypernasality estimation model—trained tification of and precision assessment of phonemes on the exclusively on the dataset of dysarthric speech—achieved high fly, and provide downstream representations that could drive correlation on estimating the hypernasality of CP speech. The characterization of hypernasality without relying on reading as CP recordings were all from speakers exhibiting no neuro- a stimulus, or known transcripts for assessment. muscular disease, meaning they exhibited hypernasality with no other articulatory imprecisions. Thus, the generalization We plan to expand on this work by collecting a larger results exhibited by our model suggest it models the genuine dataset of nasality-scored dysarthric speech, representing more perceptual attributes of hypernasality in dysarthric speech, as diseases, and designing stimuli better tailored for this task. opposed to the correlated generalized imprecision. Furthermore, we will work to apply insights from this work In spite of its robustness the NAP model has limitations. to improve the robustness of neural models for the estimation Most limiting is its reliance on aligned transcripts to perform of articulatory precision, nasality, and other objective speech the estimation. The results shown in this paper were based on biomarkers. forced alignment. This is always possible when the ground truth transcript is known but is not feasible for spontaneous REFERENCES speech. The robustness of the model comes from the fact that [1] M. Novotny, J. Rusz, R. Cmejla, H. Ruzickova, J. Klempir, and E. Ruz- it is trained on a large corpus of healthy speech; however, this icka, “Hypernasality associated with basal ganglia dysfunction: evidence training induces a bias in the model. As the feature selection from Parkinson’s disease and Huntington’s disease,” PeerJ, vol. 4, p. results show, the model is adept at detecting hypernasal speech e2530, 2016. [2] D. Theodoros, B. Murdoch, and E. Thompson, “Hypernasality in Parkin- from phonemes that look similar to nasals in healthy speech; sons disease: A perceptual and physiological analysis,” J Med Speech- however it is impossible to capture nasalization acoustic Lang Pathol, vol. 3, no. 2, pp. 73–84, 1995. patterns for unvoiced speech since these sounds never occur [3] M. Poole, J. Wee, J. Folker, L. Corben, M. Delatycki, and A. Vogel, “Nasality in Friedreich ataxia,” Clin Linguist Phon, vol. 29, no. 1, pp. in healthy speech (and hence cannot be captured in our 46–58, Jan 2015. model). As a result, we use articulatory precision of unvoiced [4] J. Duffy, Motor Speech Disorders: Substrates, Differential Diagnosis, consonants as a proxy for nasalization for these sounds. and Management. Mosby, 1995. Increased hypernasality typically implies reduced articulatory [5] D. P. Kuehn and K. T. Moller, “Speech and language issues in the cleft palate population: the state of the art,” The Cleft Palate-Craniofacial precision, but the converse is not necessarily true. As such, Journal, vol. 37, no. 4, pp. 1–35, 2000. it is possible for speakers to exhibit reduced precision for [6] E. Carrow, V. Rivera, M. Mauldin, and L. Shamblin, “Deviant Speech other reasons than hypernasality. As we showed with the Characteristics in Motor Neuron Disease,” JAMA OtolaryngologyHead & Neck Surgery, vol. 100, no. 3, pp. 212–218, 09 1974. CLP speech experiments, when the reduction in articulatory [7] D. Theodoros, B. Murdoch, P. Stokes, and H. Chenery, “Hypernasality precision is due to hypernasality, the model generalizes out- in dysarthric speakers following severe closed head injury: A perceptual of-disease quite well. We did an initial analysis on 7 speakers and instrumental analysis,” Brain Injury, vol. 7, no. 1, pp. 59–69, 1993. [8] H. Extence and S. Cassidy, “The role of the speech pathologist in the with low hypernasality scores and relatively high clinical care of the patient with cleft palate,” in Maxillofacial Surgery (Third articulatory precision scores where the model showed that it Edition), third edition ed., P. A. Brennan, H. Schliephake, G. Ghali, and tracks with hypernasality; however further analysis on larger L. Cascarini, Eds. Churchill Livingstone, 2017, pp. 1014 – 1023. [9] R. D. Kent, “Hearing and believing: Some limits to the auditory- dysarthric datasets are required to fully validate the model in perceptual assessment of speech and voice disorders,” American Journal this context. of Speech-Language Pathology, vol. 5, no. 3, pp. 7–23, 1996. 11

[10] D. P. Kuehn and J. B. Moon, “Velopharyngeal closure force and levator Journal of the Acoustical Society of America, vol. 121, no. 6, pp. 3858– veli palatini activation levels in varying phonetic contexts,” Journal of 3873, 2007. Speech, Language, and Hearing Research, vol. 41, no. 1, pp. 51–62, [33] P. Vijayalakshmi, M. R. Reddy, and D. O’Shaughnessy, “Acoustic 1998. analysis and detection of hypernasality using a group delay function,” [11] L. B. Lintz and D. Sherman, “Phonetic elements and perception of IEEE Transactions on Biomedical Engineering, vol. 54, no. 4, pp. 621– nasality,” Journal of Speech and Hearing Research, vol. 4, no. 4, pp. 629, 2007. 381–396, 1961. [34] G.-S. Lee, C.-P. Wang, C. C. Yang, and T. B. Kuo, “Voice low tone [12] D. G. Henningsson and D. A. Isberg, “Comparison between multiview to high tone ratio: a potential quantitative index for vowel [a:] and its videofluoroscopy and nasendoscopy of velopharyngeal movements,” The nasalization,” IEEE Transactions on Biomedical Engineering, vol. 53, Cleft Palate-Craniofacial Journal, vol. 28, no. 4, pp. 413–418, 1991, no. 7, pp. 1437–1439, 2006. pMID: 1742312. [35] G.-S. Lee, C.-P. Wang, and S. Fu, “Evaluation of hypernasality in vowels [13] D. S. Kao, D. A. Soltysik, J. S. Hyde, and A. K. Gosain, “Magnetic using voice low tone to high tone ratio,” The Cleft Palate-Craniofacial resonance imaging as an aid in the dynamic assessment of the velopha- Journal, vol. 46, no. 1, pp. 47–52, 2009. ryngeal mechanism in children,” Plastic and reconstructive surgery, vol. [36] M. C. Vikram, N. Adiga, and S. R. M. Prasanna, “Spectral enhancement 122, no. 2, p. 572, 2008. of cleft lip and palate speech,” in Proc. Interspeech 2016, 2016, pp. 117– [14] K. Bettens, F. L. Wuyts, and K. M. Van Lierde, “Instrumental assess- 121. ment of velopharyngeal function and resonance: A review,” Journal of [37] J. R. Orozco-Arroyave, E. A. Belalcazar-Bolanos, J. D. Arias-Londono,˜ communication disorders, vol. 52, pp. 170–183, 2014. J. F. Vargas-Bonilla, S. Skodda, J. Rusz, K. Daqrouq, F. Honig,¨ and [15] Pentax, “Nasometer II: Model 6450,” 2016. [Online]. Available: https: E. Noth,¨ “Characterization methods for the detection of multiple voice //www.pentaxmedical.com/pentax/en/99/1/Nasometer-II-Model-6450 disorders: neurological, functional, and laryngeal diseases,” IEEE Jour- [16] T. U. Brancamp, K. E. Lewis, and T. Watterson, “The relationship nal of Biomedical and Health Informatics, vol. 19, no. 6, pp. 1820–1828, between nasalance scores and nasality ratings obtained with equal 2015. appearing interval and direct magnitude estimation scaling methods,” [38] D. K. Rah, Y. I. Ko, C. Lee, and D. W. Kim, “A noninvasive estimation The Cleft Palate-Craniofacial Journal, vol. 47, no. 6, pp. 631–637, 2010. of hypernasality using a linear predictive model,” Annals of Biomedical [17] T. Watterson, S. C. McFarlane, and D. S. Wright, “The relationship Engineering, vol. 29, no. 7, pp. 587–594, 2001. between nasalance and nasality in children with cleft palate,” Journal [39] L. He, J. Zhang, Q. Liu, H. Yin, and M. Lech, “Automatic evaluation of Communication Disorders, vol. 26, no. 1, pp. 13–28, 1993. of hypernasality and consonant misarticulation in cleft palate speech,” [18] K. Sinko, M. Gruber, R. Jagsch, I. Roesner, A. Baumann, A. Wutzl, IEEE Signal Processing Letters, vol. 21, no. 10, pp. 1298–1301, 2014. and D.-M. Denk-Linnert, “Assessment of nasalance and nasality in [40] S. M. Rendon,´ J. O. Arroyave, J. V. Bonilla, J. A. Londono, and patients with a repaired cleft palate,” European Archives of Oto-Rhino- C. C. Dom´ınguez, “Automatic detection of hypernasality in children,” Laryngology, vol. 274, no. 7, pp. 2845–2854, 2017. in International Work-Conference on the Interplay Between Natural and [19] K. L. Chapman, A. Baylis, J. Trost-Cardamone, K. N. Cordero, Artificial Computation. Springer, 2011, pp. 167–174. A. Dixon, C. Dobbelsteyn, A. Thurmes, K. Wilson, A. Harding-Bell, [41] K. Nikitha, S. Kalita, C. Vikram, M. Pushpavathi, and S. M. Prasanna, T. Sweeney et al., “The Americleft Speech Project: a training and “Hypernasality severity analysis in cleft lip and palate speech using reliability study,” The Cleft Palate-Craniofacial Journal, vol. 53, no. 1, vowel space area.” in Proc. Interspeech 2017, 2017, pp. 1829–1833. pp. 93–108, 2016. [42] A. K. Dubey, S. M. Prasanna, and S. Dandapat, “Zero time windowing [20] L. Beijer and A. Rietveld, “Asynchronous telemedicine applications in based severity analysis of hypernasal speech,” in 2016 IEEE Region 10 the rehabilitation of acquired speech-language disorders in neurological Conference (TENCON), 2016, pp. 970–974. patients,” Smart Homecare Technology and TeleHealth, vol. 3, pp. 39– [43] ——, “Pitch-adaptive front-end feature for hypernasality detection,” 48, 2015. Proc. Interspeech 2018, pp. 372–376, 2018. [21] J. Rafael Orozco Arroyave, J. Francisco Vargas Bonilla, and E. Del- [44] A. Vogel, H. Ibrahim, S. Reilly, and N. Kilpatrick, “A comparative gado Trejos, “Acoustic analysis and non linear dynamics applied to voice study of two acoustic measures of hypernasality.” Journal of Speech, pathology detection: A review,” Recent Patents on Signal Processing, Language, and Hearing Research, vol. 52, pp. 1640–1651, 2009. vol. 2, no. 2, pp. 96–107, 2012. [45] R. Kataoka, K.-I. Michi, K. Okabe, T. Miura, and H. Yoshida, “Spectral [22] S. Hegde, S. Shetty, S. Rai, and T. Dodderi, “A survey on machine properties and quantitative evaluation of hypernasality in vowels,” The learning approaches for automatic detection of voice disorders,” Journal Cleft Palate-Craniofacial Journal, vol. 33, no. 1, pp. 43–50, 1996. of Voice, vol. 33, no. 6, pp. 947–e11, 2019. [46] G. Castellanos, G. Daza, L. Sanchez,´ O. Castrillon,´ and J. Suarez,´ [23] D. A. Lohmander and M. M. Olsson, “Methodology for perceptual “Acoustic speech analysis for hypernasality detection in children,” in assessment of speech in patients with cleft palate: A critical review 2006 International Conference of the IEEE Engineering in Medicine of the literature,” The Cleft Palate-Craniofacial Journal, vol. 41, no. 1, and Biology Society, 2006, pp. 5507–5510. pp. 64–70, 2004, pMID: 14697067. [47] A. K. Dubey, A. Tripathi, S. Prasanna, and S. Dandapat, “Detection of [24] A. Kummer and L. Lee, “Evaluation and Treatment of Resonance hypernasality based on vowel space area,” The Journal of the Acoustical Disorders,” Language, Speech, and Hearing in Schools, vol. 27, pp. Society of America, vol. 143, no. 5, pp. EL412–EL417, 2018. 271–281, Jul 1996. [48] D. A. Cairns, J. H. Hansen, and J. E. Riski, “A noninvasive technique [25] A. Woo, “Velopharyngeal dysfunction,” Semin Plast Surg, vol. 26, no. 4, for detecting hypernasal speech using a nonlinear operator,” IEEE pp. 170–177, Nov 2012. Transactions on Biomedical Engineering, vol. 43, no. 1, p. 35, 1996. [26] L. Yu and B. D. Barkana, “Classifying hypernasality using the pitch [49] A. Maier, A. Reuß, C. Hacker, M. Schuster, and E. Noth,¨ “Analysis of and formants,” Proceedings of the 6th International Conference on hypernasal speech in children with cleft lip and palate,” in International Information Technology New Generations, 2009. Conference on Text, Speech and Dialogue. Springer, 2008, pp. 389–396. [27] Y. Kozaki-Yamaguchi, N. Suzuki, Y. Fujita, H. Yoshimasu, M. Akagi, [50] J. R. Orozco-Arroyave, J. D. Arias-Londono,˜ J. F. Vargas-Bonilla, and T. Amagasa, “Perception of hypernasality and its physical corre- and E. Noth,¨ “Automatic detection of hypernasal speech signals using lates,” Oral Science International, vol. 2, no. 1, pp. 21 – 35, 2005. nonlinear and entropy measurements,” in Proc. Interspeech 2012, 2012, [28] R. Kataoka, D. W. Warren, D. J. Zajac, R. Mayo, and R. W. Lutz, “The pp. 2029–2032. relationship between spectral characteristics and perceived hypernasality [51] J. R. Orozco-Arroyave, J. Vargas-Bonilla, J. D. Arias-Londono,˜ in children,” The Journal of the Acoustical Society of America, vol. 109, S. Murillo-Rendon,´ G. Castellanos-Dom´ınguez, and J. Garces,´ “Nonlin- no. 5, pp. 2181–2189, 2001. ear dynamics for hypernasality detection in Spanish vowels and words,” [29] S. Hawkins and K. N. Stevens, “Acoustic and perceptual correlates of the Cognitive Computation, vol. 5, no. 4, pp. 448–457, 2013. non-nasal–nasal distinction for vowels,” The Journal of the Acoustical [52] R. G. Nieto, J. I. Mar´ın-Hurtado, L. M. Capacho-Valbuena, A. A. Suarez, Society of America, vol. 77, no. 4, pp. 1560–1575, 1985. and E. A. B. Bolanos,˜ “Pattern recognition of hypernasality in voice of [30] J. R. Glass and V. W. Zue, “Detection of nasalized vowels in American patients with cleft and lip palate,” in 2014 XIX Symposium on Image, English.” Proceedings of ICASSP, vol. 4, pp. 1569–1572, 1985. Signal Processing and Artificial Vision. IEEE, 2014, pp. 1–5. [31] P. Vijayalakshmi, T. Nagarajan, and J. Rav, “Selective pole modification- [53] M. Golabbakhsh, F. Abnavi, M. Kadkhodaei Elyaderani, F. Derakhshan- based technique for the analysis and detection of hypernasality,” in deh, F. Khanlar, P. Rong, and D. Kuehn, “Automatic identification of TENCON 2009–2009 IEEE Region 10 Conference, 2009, pp. 1–5. hypernasality in normal and cleft lip and palate patients with acoustic [32] T. Pruthi, C. Y. Espy-Wilson, and B. H. Story, “Simulation and analysis analysis of speech,” The Journal of the Acoustical Society of America, of nasalized vowels based on magnetic resonance imaging data,” The vol. 141, no. 2, pp. 929–935, 2017. 12

[54] D. Cairns, J. Hansen, and J. Riski, “A noninvasive technique for detect- Michael Saxon recieved the MS degree in Computer ing hypernasal speech using a nonlinear operator.” IEEE Transactions Engineering from Arizona State University (ASU) in on Biomedical Engineering, vol. 43, no. 1, p. 35, 1996. Tempe, AZ, USA in May 2020, with a thesis topic [55] C. Vikram, A. Tripathi, S. Kalita, and S. Prasanna, “Estimation of centered on dysarthric speech analysis. He is now a hypernasality scores from cleft lip and palate speech.” Proc. Interspeech Ph.D. student and NSF Fellow in the NLP Lab at 2018, pp. 1701–1705, 2018. the University of California, Santa Barbara (UCSB) [56] C. M. Vikram, A. Tripathi, S. Kalita, and S. R. M. Prasanna, “Estimation in Santa Barbara, CA, USA. His research interests of hypernasality scores from cleft lip and palate speech,” in Proc. include automatic speech recognition, deep learning, Interspeech 2018, 2018, pp. 1701–1705. and natural language processing. [57] M. Saxon, J. Liss, and V. Berisha, “Objective measures of plosive nasalization in hypernasal speech,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 6520–6524. [58] S. B. Rutkove, K. Qi, K. Shelton, J. Liss, V. Berisha, and J. M. Shefner, “ALS longitudinal studies with frequent data collection at home: study design and baseline data,” Amyotroph Lateral Scler Frontotemporal Degener, vol. 20, no. 1-2, pp. 61–67, Feb 2019. Ayush Tripathi received the B.Tech degree in [59] J. Peplinski, V. Berisha, J. Liss, S. Hahn, J. Shefner, S. Rutkove, K. Qi, Electrical and Electronics Engineering from Visves- and K. Shelton, “Objective assessment of vocal tremor,” in ICASSP 2019 varaya National Institute of Technology, Nagpur, - 2019 IEEE International Conference on Acoustics, Speech and Signal India in 2019. He is currently a researcher in the Processing, May 2019, pp. 6386–6390. Speech & Natural Language Processing Group at [60] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: TCS Research & Innovation, India. He has previ- An ASR corpus based on public domain audio books,” in 2015 IEEE ously worked as a Research Aide at Arizona State International Conference on Acoustics, Speech and Signal Processing University and a Research Intern at Indian Institute (ICASSP), April 2015, pp. 5206–5210. of Technology, Guwahati. [61] D. P. Kuehn, P. B. Imrey, L. Tomes, D. L. Jones, M. M. O’Gara, E. J. Seaver, B. E. Smith, D. Van Demark, and J. M. Wachtel, “Efficacy of continuous positive airway pressure for treatment of hypernasality,” The Cleft Palate-Craniofacial Journal, vol. 39, no. 3, pp. 267–276, 2002. [62] M. McAuliffe, M. R. F. P. Danar, V. Willerton, M. Asocolof, and A. Coles, “MontrealCorpusTools/Montreal-Forced-Aligner: Version 1.0.1,” Apr. 2019. [63] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Yishan Jiao received her Ph.D degree in Speech the Journal of the Acoustical Society of America, vol. 87, no. 4, pp. and Hearing Science from Arizona State University 1738–1752, 1990. (ASU) in Tempe, AZ, USA in May 2019. Her [64] L. Muda, M. Begam, and I. Elamvazuthi, “Voice recognition algorithms research interests include acoustic signal processing using mel frequency cepstral coefficient (MFCC) and dynamic time and machine learning. warping (DTW) techniques,” arXiv preprint arXiv:1003.4083, 2010. [65] V. W. Zue and S. Seneff, “Transcription and alignment of the timit database,” in Recent Research Towards Advanced Man-Machine Inter- face Through Spoken Language. Elsevier, 1996, pp. 515–525. [66] S. Witt and S. Young, “Phone-level pronunciation scoring and assess- ment for interactive language learning,” Speech Commun., vol. 30, no. 2-3, pp. 95–108, Feb. 2000. [67] M. Tu, A. Grabek, J. Liss, and V. Berisha, “Investigating the role of L1 in automatic pronunciation evaluation of L2 speech,” Proc. Interspeech 2018, vol. 2018-September, pp. 1636–1640, 1 2018. [68] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” The Journal of Machine Julie M. Liss is a Professor of Speech & Hear- Learning Research, vol. 12, pp. 2825–2830, 2011. ing Science and Associate Dean of the College of [69] W. Styler, “On the acoustical and perceptual features of vowel nasality,” Health Solutions at Arizona State University (ASU) 2015. in Tempe, AZ, USA. Her research explores the ways [70] ——, “Using Praat for linguistic research,” University of Colorado at in which speech and language change in the context Boulder Phonetics Lab, 2013. of neurological damage or disease. [71] M. Y. Chen, “Acoustic correlates of English and French nasalized vowels,” The Journal of the Acoustical Society of America, vol. 102, no. 4, pp. 2360–2370, 1997. [72] S. Pascual, M. Ravanelli, J. Serr, A. Bonafonte, and Y. Bengio, “Learning Problem-Agnostic Speech Representations from Multiple Self-Supervised Tasks,” in Proc. Interspeech 2019, 2019, pp. 161–165. [73] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014. [74] Y. T. Yeung, K. H. Wong, and H. Meng, “Improving automatic forced alignment for dysarthric speech transcription,” in Proc. Interspeech 2015, 2015, pp. 2991–2995. Visar Berisha is an Associate Professor in the Col- [75] P. Green, J. Carmichael, A. Hatzis, P. Enderby, M. Hawley, and lege of Health Solutions and Fulton Entrepreneurial M. Parker, “Automatic speech recognition with sparse training data Professor in the School of Electrical Computer for dysarthric speakers,” in Eighth European Conference on Speech & Energy Engineering at Arizona State University Communication and Technology (Eurospeech), 2003, pp. 1189–1192. (ASU) in Tempe, AZ, USA. His research interests [76] T. Knowles, M. Clayards, and M. Sonderegger, “Examining factors include computational models of speech production influencing the viability of automatic acoustic analysis of child speech,” and perception, clinical speech analytics, and statis- Journal of Speech, Language, and Hearing Research, vol. 61, no. 10, tical signal processing. pp. 2487–2501, 2018. [77] P. Green and J. Carmichael, “Revisiting dysarthria assessment intelligi- bility metrics,” in Eighth International Conference on Spoken Language Processing, 2004, pp. 485–488.