applied sciences

Article Modelling of Amplitude Modulated Vocal Fry Glottal Area Waveforms Using an Analysis-by-Synthesis Approach

Vinod Devaraj 1,2,* and Philipp Aichinger 1

1 Department of Otorhinolaryngology, Division of Phoniatrics-Logopedics, Medical University of Vienna, 1090 Vienna, Austria; [email protected] 2 Signal Processing and Speech Communication Laboratory, University of Technology, 8010 Graz, Austria * Correspondence: [email protected]

Abstract: The characterization of quality is important for the diagnosis of a voice disorder. Vocal fry is a voice quality which is traditionally characterized by a low frequency and a long closed phase of the glottis. However, we also observed amplitude modulated vocal fry glottal area waveforms (GAWs) without long closed phases (positive group) which we modelled using an analysis-by-synthesis approach. Natural and synthetic GAWs are modelled. The negative group consists of euphonic, i.e., normophonic GAWs. The analysis-by-synthesis approach fits two modelled GAWs for each of the input GAW. One modelled GAW is modulated to replicate the amplitude and frequency modulations of the input GAW and the other modelled GAW is unmodulated. The modelling errors of the two modelled GAWs are determined to classify the GAWs into the positive and the negative groups using a simple support vector machine (SVM) classifier with a linear kernel. The modelling errors of all vocal fry GAWs obtained using the modulating model are smaller than the modelling errors obtained using the unmodulated model. Using the two modelling errors as  predictors for classification, no false positives or false negatives are obtained. To further distinguish  the subtypes of amplitude modulated vocal fry GAWs, the entropy of the modulator’s power spectral Citation: Devaraj, V.; Aichinger, P. density and the modulator-to-carrier frequency ratio are obtained. Modelling of Amplitude Modulated Vocal Fry Glottal Area Waveforms Keywords: voice quality; glottal area waveforms; vocal fry; irregular Using an Analysis-by-Synthesis Approach. Appl. Sci. 2021, 11, 1990. https://doi.org/10.3390/app11051990 1. Introduction Academic Editor: Michael Döllinger Vocal fry is a voice quality which is synonymously referred to as , pulse , glottal fry or creak [1–3]. In addition, the term vocal fry is used to designate a Received: 31 December 2020 subtype of creaky voice [4]. The remaining subtypes are multipulsed voice, aperiodic Accepted: 22 February 2021 Published: 24 February 2021 voice, nonconstricted creak and tense/pressed voice. Vocal fry is mainly characterized by a low fundamental frequency which gives an auditory impression of “a stick being

Publisher’s Note: MDPI stays neutral run along a railing”, “popping of corn” or “cooking of food on a pan” [1,2,5]. Other with regard to jurisdictional claims in characteristics include shimmer, jitter, and damping of pulses. Furthermore, subglottal published maps and institutional affil- air pressure and air flow were found to be smaller in vocal fry than in modal registers [2]. iations. However, fundamental frequency is one of the main factors which distinguishes vocal fry from modal and harsh voice [6]. In our work, we identify vocal fry based on the impulsivity of voice samples, i.e., the auditory attribute associated with the separate perception of the glottal cycles. It was shown in the past that the peak prominence observed in loudness versus time curves may reflect the temporal auditory segregation of the glottal cycles in Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. vocal fry [7]. This article is an open access article Vocal fry has been identified in healthy and dysphonic speakers. In particular, vocal distributed under the terms and fry was reported to be one of the salient voice characteristics of male patients with contact conditions of the Creative Commons granuloma [8]. Hence, vocal fry may be an indicator for the presence of a pathology which Attribution (CC BY) license (https:// makes the study of characteristics of vocal fry clinically important. Although it can be a creativecommons.org/licenses/by/ sign of a voice disorders, Hollien et al. reported vocal fry voice quality in nonpathological 4.0/). voice [9,10]. These studies investigated the fluctuation of pulse rate of vocal fry samples,

Appl. Sci. 2021, 11, 1990. https://doi.org/10.3390/app11051990 https://www.mdpi.com/journal/applsci Appl. Sci. 2021, 11, 1990 2 of 13

and evidence against classifying vocal fry as pathological is provided. The results were further supported by an investigation of vocal fry with 104 first year graduate students [11]. Of the students in the study group, 86% had a nonpathological voice and the remaining 14% were reported to use vocal fry along with a pathological voice. However, 16% of the students with nonpathological voice quality also used vocal fry. For the objective detection of vocal fry and creak, these following acoustic features have been used in the past. The presence of vocal fry segments in speech utterances was detected based on the autocorrelation properties of the audio signals [3]. The insertion and detection rates obtained were 13% and 74%, respectively. In [12], audio features like inter- frame periodicity, interpulse similarity, peak fall and peak rise, H2-H1, i.e., the difference in amplitudes of the first two harmonics, F0 contours in each frame and peak prominence were used for vocal fry detection. An aperiodicity, periodicity and pitch (APP) detector, which uses a dip profile of the average magnitude difference function (AMDF) at lags, was used to distinguish between irregular phonation and modal phonation [13]. A Fourier spectrum analysis approach of the audio signals was also proposed for distinguishing vocal fry segments from diplophonic voice [14]. The ratio of the numbers of harmonics present in consecutive frames was determined to detect transitions of modal to vocal fry regimes, or vice versa. In total, 81% of the voice segments were correctly detected as vocal fry. The distinct characteristics of the creaky and modal regions were obtained using so called epoch parameters [15]. Epochs are negative to positive zero crossing instances of zero frequency filtered audio signals. The used epoch parameters were the number of epochs in a frame, the time instants of epochs, and the time interval between successive epochs. Based on the variance of these epoch parameters, a neural network classifier was developed to identify creaky regions. This study was done using voice samples of several speaker groups. The F1 scores were analyzed to validate the detector. The maximum F1 score was found to be a little over 0.8 for female Finnish speakers when compared to the other groups. Convolutional neural networks (CNN) based on the detection of the creaky voice from emergency calls was also proposed, where 32 mel-scale log filter-bank outputs were used as input features [16]. The data consist of emergency call recordings and a conversational speech corpus with 30 recordings each. Though these methods detect vocal fry or creaky segments, they do not allow a detailed study of voice production. Previous studies have tried to interlink vocal fold vibration patterns with voice quality types. Several studies have investigated the vibration patterns of vocal fry, which include the number of opening and closing phases of the vocal folds in a single modulator cycle, and the duration of the closed phase. The authors of [17] and [18] found single- and double-pulsed patterns, respectively, using high-speed videos, where the closed phase is longer than the open phase. More evidence for the existence of multiple pulses in a single cycle was reported in [1,2,5]. Furthermore, multiple pulses in a single cycle without a long closed phase were observed using electroglottograms which reflect translaryngeal electrical resistance that is proportional to the contact area of the vocal folds [19]. In this paper, we observe vocal fry GAWs with multiple single-peak pulses in a single cycle. A GAW is the time series of the glottal area, i.e., the projected area of the space between the vocal folds, which vibrate during phonation. In the open phase, the magnitude of the GAW monotonically increases and decreases while the vocal folds are opening and closing, respectively. The magnitude becomes zero when the folds are closed completely, i.e., during the closed phase. The pulses are amplitude modulated and no long closed phase is observed. Throughout the paper, we refer to single-peak pulses. We model the GAWs using an analysis-by-synthesis approach. A modified version of the analysis-by-synthesis approach implemented in [20] for modelling diplophonic GAWs is used. Here, we use two supporting points per pulse of GAW instead of one while modelling the input GAWs. This helps in improving the estimation of the upper and lower envelopes,—discussed in Section 2.4. Furthermore, instead of two fundamental frequencies, one fundamental frequency is used for modelling GAWs. Appl. Sci. 2021, 11, 1990 3 of 13

In this study, GAWs are used to distinguish between vocal fry and normal voice quality. The remainder of this article is structured as follows. In Section2, the extraction of GAWs, the different types of observed amplitude modulated GAWs, and the modelling of GAWs using analysis-by-synthesis are described. The results of automatic classification of amplitude modulated vocal fry GAWs and euphonic GAWs are reported in Section3. Section4 discusses the performance, limitations and the possible future works of this study. Finally, Section5 concludes the paper.

2. Materials and Methods 2.1. Data High-speed videos (HSVs) of the vocal folds with a frame rate of 4000 frames per second are obtained by inserting a laryngeal endoscopic camera (HRES ENDOCAM 5562) through the mouth into the pharynx [21]. Subjects are instructed to produce sustained phonation of for a duration of two seconds while the vocal fold vibrations are filmed. Audio files are recorded simultaneously using a head worn microphone. The audio files are synchronized with the captured video files. The voice samples are annotated by three expert annotators with regard to presence or absence of vocal fry. Based on the annotations, the corresponding video frames of the HSVs are cropped and used for GAW extraction. The GAWs are extracted using a seeded region growing segmentation algorithm [22]. Seven GAWs annotated as vocal fry contain amplitude modulations without long phases. They are used as a positive group. Eight euphonic i.e., normophonic GAWs are used as a negative group. The vocal fry GAWs are further distinguished into two categories based on the cyclic- ity of the GAWs by visual inspection. Three out of seven amplitude modulated GAWs are observed to be acyclically modulated. These GAWs are termed as Acyclically Amplitude Modulated Pulse train (AAMP). The remaining four vocal fry GAWs are observed to have a cyclic modulator. These cyclic GAWs are further subclassified based on the number of pulses in each modulator cycle, one of which is characterized by double-pulsing, i.e., alternation of pulse magnitudes, timing, and/or shapes. This type is termed as Cyclically Amplitude Modulated Pulse train-2 (CAMP = 2). The other cyclically modulated type is characterized by more than two pulses per cycle. This type is termed as Cyclically Amplitude Modulated Pulse train = 4 (CAMP = 4). Along with the natural GAWs, 1200 syn- thetic GAWs are used. Of the synthetic GAWs, 300 belong to the euphonic group and the remaining 900 are vocal fry GAWs, with each of the vocal fry type having 300 GAWs. Synthetic GAWs are generated using parameter distributions obtained from estimated parameters of the natural GAWs. Regularity and symmetry of vocal fold vibration is analyzed using phonovibrograms (PVGs). A PVG displays the deflections of the vocal folds from the glottal axis represented as the color intensities [23]. The brighter the color intensity, the farther away are the vocal folds from the glottal axis. Figure1 shows PVGs of the GAWs shown in Figure2, i.e., of the amplitude modulated vocal fry and euphonic types of vocal fold vibration. A negligible anterior–posterior asymmetry of vocal folds is observed in the PVG belonging to CAMP = 4 in Figure1a. Here, each modulator cycle contains four pulses where the borders of one of the modulator cycles is marked by two vertical green lines. The vibration pattern is described as an increase in the amplitude of the vocal fold vibration for four consecutive glottal pulses followed by a sudden decrease in amplitude. In contrast, the anterior–posterior phase difference is observed in PVGs of CAMP = 2, AAMP, and the euphonic type. In Figure1b, the anterior–posterior phase difference in every second pulse of the vocal folds results in a period-doubled GAW, as shown in Figure2b. Appl. Sci. 2021, 11, 1990 4 of 13 Appl. Sci. 2021, 11, Appl.1990 Sci. 2021, 11, 1990 4 of 14 4 of 14

Figure 1. PhonovibrogramFigureFigure 1.1. PhonovibrogramPhonovibrogram (PVG) representation (PVG)(PVG )of representation vocal fold kinematics of of vocal vocal offold fold the kinematics kinematicsGAWs shown of of the the in GAWs GAWs shown shown in in Figure 2: (a) CAMFigureFigureP = 42 ,:(2: (b a()a) )CAM CAMP CAMPP = = =2 4,,4 (, (c(b)b )AAMP) CAMPCAM Pand == 2,2 ,( (d(cc))) euphonic AAMPAAMP andand type. ((dd)) euphoniceuphonic type.type.

Figure 2 Glottal area waveforms (GAWs) extracted by segmentation of the high-speed videos Figure 2. GlottalFigure area 2 Glottal waveforms area waveforms (GAWs) extracted (GAWs) byextracted segmentation by segmentation of the high-speed of the high videos-speed (HSVs): videos (a–c) vocal fry (HSVs): (a–c) vocal(HSVs fry )and: (a– (cd) )vocal euphonic fry and voice. (d) euphonic(a) illustrates voice. an (examplea) illustrates of a anCAM exampleP = 4 type of a CAMP = 4 type and (d) euphonic voice. (a) illustrates an example of a CAMP = 4 type GAW, having four pulses in a modulator cycle; GAW, having fourGAW pulses, having in a modulatorfour pulses cycle in a modulator; (b) illustrates cycle an; ( exampleb) illustrates of CAM an exampleP = 2 type of, CAMP = 2 type, (b) illustrates an example of CAMP = 2 type, where each modulator cycle contains two pulses; and (c) belongs to an example where each modulatorwhere eachcycle modulator contains two cycle pulses contains; and two(c) belongs pulses; toand an ( cexample) belongs of to the an Acyclically example of the Acyclically of the AcyclicallyAmplitude Amplitude ModulatedAmplitude Modulated Pulse Modulated Pulse train train(AAMP (AAMP)Pulse) type train, type, which (AAMP which is acyclically) type is acyclically, which modula is modulated. acyclicallyted. The Themodulaeuphonic euphonicted. The GAW euphonic has smaller amplitudeGAW has modulations smallerGAW amplitude than has the smaller vocalmodulations fryamplitude GAWs. than modulations the vocal fry than GAWs. the vocal fry GAWs. Figure3 shows an overview of the analysis and synthesis combined with parameter Figure 3 showsFigure an overview 3 shows of an the overview analysis ofand the synthesis analysis combinedand synthesis with combined parameter with parameter distribution fitting. First, the analyzer models the natural GAW y(t) (input) using an distribution fittingdistribution. First, the fitting analyzer. First, models the analyzer the natural models GAW the natural푦(t) (input) GAW using 푦(t) an(input) using an analysis-by-synthesis approach, which is described in Section 2.2. A Fourier synthesizer is analysis-by-synthesisanalysis app-byroach-synthesis, which app isroach described, which in is Section described 2.2. inA FourierSection 2.2.synthesizer A Fourier synthesizer used to obtain the modelled GAWs yˆ(t). The modelling error E is obtained from the root is used to obtainis usedthe modelled to obtain GAWsthe modelled 푦̂(푡). The GAWs modelling 푦̂(푡). The error modelling 퐸 is obtained error from퐸 is obtainedthe from the root mean squareroot error mean between square error the input between GAW the and input the GAW modelled and theGAW modelled 푦̂(푡). The GAW 푦̂(푡). The Appl. Sci. 2021, 11, 1990 5 of 14 Appl. Sci. 2021, 11, 1990 5 of 13 Appl. Sci. 2021, 11, 1990 5 of 14

modelled GAW is refined iteratively to obtain a minimum modelling error. Parameters of mean square error between the input GAW and the modelled GAW yˆ(t). The modelled the inputmodelled GAW 휓 GAW are alsois refined estimated iteratively during to theobtain modelling a minimum process. modelling Distributions error. Parameters of of GAW is푖 refined iteratively to obtain a minimum modelling error. Parameters of the input estimatedthe parameters input GAW of the휓 inputare also GAW estimated are fitted during as described the modelling in Section process. 2.3 Distributionsfor the of GAW ψ are also푖 estimated during the modelling process. Distributions of estimated purpose ofestimated generating i parameters the synthetic of the corpus input, which GAW is arelarger fitted than as the described corpus of in the Section natural 2.3 for the parameters of the input GAW are fitted as described in Section 2.3 for the purpose of GAWs. Sectionpurpose 2.4 of explains generating the theFourier synthetic synthesizer corpus ,in which detail. is largerThe synthetic than the GAWs corpus 푦of( 푡the) natural generating the synthetic corpus, which is larger than the corpus of the푠 natural GAWs. are againGAWs. input inSectionto the 2.4 analyzer. explains The the modellingFourier synthesizer errors obtained in detail. for The the synthetic natural andGAWs 푦푠(푡) Section 2.4 explains the Fourier synthesizer in detail. The synthetic GAWs ys(t) are again synthetic GAWs are used for the classification of the positive and the negative group. In areinput again into input the analyzer.into the Theanalyzer. modelling The errorsmodelling obtained errors for obtained the natural for and the synthetic natural GAWs and Section 2.5, the features and classification are explained. syntheticare used GAWs for the are classification used for the ofclassification the positive of and the the positive negative and group. the negative In Section group. 2.5 In, the Sectionfeatures 2.5, and the classificationfeatures and classification are explained. are explained.

Figure 3. Overview of the analyzer and the synthesizer used for modelling and synthesizing Figure 3. Overview of the analyzer and the synthesizer used for modelling and synthesizing GAWs. GAWs. TheFigure analyzer 3. Overview models and of the estimates analyzer parameters and the synthesizer of input GAWs, used for which modelling is described and synthesizing in Section 2.2GA. TheTheWs. synthesizer analyzer The analyzer models described models and estimatesin and Section estimates parameters 2.4 generates parameters of inputsynthetic of GAWs,input GAWs GAWs, which 푦푠 (iswhich푡) described is based is described inon Section in 2.2. The ( ) the distributionsSectionsynthesizer of 2.2 the. The estimated described synthesizer parameters in Sectiondescribed 2.4휓푖 ingeneratesthat Section are fitted 2.4 synthetic generates as described GAWs syntheticy ins tSectionis GAWs based 2.3. on푦 푠( the푡) is distributions based on of thethe distributions estimated parametersof the estimatedψi that parameters are fitted as휓푖 described that are fitted in Section as described 2.3. in Section 2.3. 2.2. Analyzer 2.2. Analyzer Figure2.2. 4 Analyzershows the block diagram of the analysis-by-synthesis approach used in the Figure4 shows the block diagram of the analysis-by-synthesis approach used in the analyzer to modelFigure input 4 shows GAWs. the Ablock similar diagram approach of the was analysis proposed-by-synthesis in the past approach to model used in the analyzer to model input GAWs. A similar approach was proposed in the past to model diplophonicanalyzer GAWs to [ 20model] where input two GAWs. simultaneous A similar fundamental approach wasfrequency proposed (푓표) tracksin the werepast to model diplophonic GAWs [20] where two simultaneous fundamental frequency ( fo) tracks were used. In thisdiplophonic study, instead GAWs of [20 two] where 푓표 tracks, two simultaneous only one 푓표 trackfundamental is estimated. frequency First, (the푓 ) tracks푓표 were used. In this study, instead of two f tracks, only one f track is estimated.표 First, the f track of theused. input In thisGAW study, is extracted instead byof twoa hidden 푓 tracks,o Markov only model one 푓 (HMM) tracko iscombined estimated. with First, the 푓 o track of the input GAW is extracted표 by a hidden Markov표 model (HMM) combined with표 repetitivetrack execution of the of input a Viterbi GAW a lgorithm.is extracted An by unmodulated a hidden Markov quasi modelunit pulse (HMM) train combined 푢푡 is with repetitive execution of a Viterbi algorithm. An unmodulated quasi unit pulse train ut is generatedrepetitive by an oscillator execution driven of a Viterbiby the aextractedlgorithm. 푓An표 track. unmodulated The pulse quasi locations unit pulse of this train 푢푡 is generated by an oscillator driven by the extracted fo track. The pulse locations of this pulse pulse traingenerated approximate by an the oscillator time instants driven of theby themaxima extracted of the 푓 input track. GAW. The Anpulse additional locations of this train approximate the time instants of the maxima of표 the input GAW. An additional pulse pulse trainpulse for trainindic approximateating the locations the time of instantsthe minima of the of maxima the input of theGAW input is obtained GAW. An by additional train for indicating the locations of the minima of the input GAW is obtained by phase phase shiftingpulse thetrain train’s for indic phaseating by the180 locations표. The instantaneous of the minima phase of theand input amplitude GAW ofis theobtained by shifting the train’s phase by 180◦. The instantaneous phase and amplitude of the maxima maxima and minima are extracted from the quasi180표 unit pulse train, which is explained in phaseand shifting minima the are train’s extracted phase from by the quasi. The unit instantaneous pulse train, whichphase and is explained amplitude in detailof the in detail in Section 2.5. maximaSection and 2.5 .minima are extracted from the quasi unit pulse train, which is explained in detail in Section 2.5.

FigureFigure 4. 4.The The modified modified version version of of the the analyzer analyzer used used for for modelling modelling GAWs, GAWs, modified modified from from [20 [20].]. Figure 4. The modified version of the analyzer used for modelling GAWs, modified from [20]. A Fourier synthesizerA Fourier synthesizer (FS) uses (FS) the uses extracted the extracted instantaneous instantaneous phase phaseΘ(t) Θand(t) and the the Fourier Fourier coefficientscoefficientsA Fourier 푅푘 R obtainedk synthesizerobtained from from (FS) pulse pulse uses shapes shapes the extracted 푟r푙l to model model instantaneous the the input input GAW. GAW.phase The Θ(t) The pulse and shapes the Fourier coefficients 푅푘 obtained from pulse shapes 푟푙 to model the input GAW. The Appl. Sci. 2021, 11, 1990 6 of 13

are obtained by cross-correlating the quasi unit pulse with each block of the input GAW of length 32 ms obtained using a Hanning window with a 50% overlap. These single-peak pulse shapes are then modelled using Chen’s model [24]. Chen’s model estimates the pulse parameters which are used for generating synthetic GAWs. The pulse shapes are transformed to the Fourier coefficient Rk using the discrete Fourier transform (DFT). The synthesized GAW yF(t) is obtained by the Fourier synthesizer using the Fourier coefficients of the pulse shapes and the extracted instantaneous phase. The synthesized GAW yF(t) is multiplied with an amplitude modulator m(t) to obtain a modelled GAW,

yˆ(t) = yF(t)· m(t) (1)

The modelling error E is the level of the root mean squared difference between the input GAW y(t) and the modelled GAW yˆ(t) i.e.,  q  (y(t) − yˆ(t))2 E(t) = 20 · log10 q  [dB] (2) y2(t)

yˆ(t) obtained using the unmodulated quasi unit pulse train ut is the output of a nonmodulating model where the frequency and amplitude modulation present in the original GAWs are not modelled. Random modulations of the individual pulses of the input GAW are modelled by minimizing the error E via modulating the quasi unit pulse trains, which approximates a nonlinear system in a linear way. Pulse heights and time instants of the unit pulse trains are modulated on a pulse-to-pulse time scale. Additionally, the modelled GAWs are clipped at the baseline to minimize the modelling error E. The amplitude modulation vector h(ρ) and the pulse time modulation vector ∆(ρ) of the quasi unit pulse train are estimated iteratively by minimizing the time domain modelling error E(t). ρ is the pulse index.

2.3. Distribution Fitting The parameters of the pulse shape of the estimated input GAW are used for generating synthetic GAWs. These parameters are fitted to Gaussian distributions independently of each other and random numbers are drawn from the distribution to obtain original synthesis parameters. A is a parameter matrix containing N rows and P columns. N is the number of rows of a single GAW type, and P is the number of frames in each GAW, which may be different for each GAW. First, the mean µA and the standard deviation σA of the matrix is taken over all frames for each GAW, i.e.,

∑P A µ = i=1 i (3) A P s 2 ∑P (A − µ ) σ = i=1 i A (4) A P These two measures describe the distribution of a parameters observed in a GAW. To reflect inter-GAW variation of the parameter distribution, means and standard deviations of µA and σA are also obtained. ∑N µ µ = i=1 A,i (5) µA N s N 2 ∑ µA,i − µµ σ = i=1 A (6) µA N ∑N σ µ = i=1 A,i (7) σA N Appl. Sci. 2021, 11, 1990 7 of 13

s N 2 ∑ (σ − µσ ) σ = i=1 A,i A (8) σA N In particular, two Gaussian distributions are estimated indicating the variation of the means and standard deviations of a GAW type’s parameter over time. From each of these two Gaussian distributions N (µ, σ) with mean µ and standard deviation σ, M random numbers are drawn.  µAs,M ∼ N µµA , σµA (9)

σAs, M ∼ N (µσA , σσA ) (10)

These M random pairs of µAs and σAs indicate the means and standard deviations of the parameters for synthesizing M GAWs. {\displaystyle {\mathcal {N}}(\mu,\sigma

ˆ{2})}Using each pair of µAs and σAs , a Gaussian distribution is obtained. From this distri- bution, R numbers are drawn which indicate the values of a parameter for R frames.

As,M ∼ N (µAs , σAs ) (11)

All parameters of the pulse shape are generated using the same process, but differ- ent distribution parameters. The parameters are input to the synthesizer for generating synthetic GAWs.

2.4. Fourier Synthesizer This subsection explains the Fourier synthesizer. The Fourier synthesizer generates yF(t) using the estimated locations and amplitudes of the extrema of the input GAW and the Fourier coefficients obtained by transforming the pulse model. The location of two quasi unit pulse trains approximate the maxima and minima of the individual pulses of the input GAW. They are the supporting points at which the instantaneous phase is equal to odd and even integer multiples of π, i.e., (π, 3 π, 5 π... and 2 π, 4 π, 6 π ... ), respectively. These supporting points are sorted and interpolated to obtain Θ(t). Instantaneous phase Θ(t) and the Fourier coefficients Rk obtained by transforming prototype pulses are provided to the Fourier synthesizer to synthesize the GAWs. The amplitude of the modelled GAW is modulated by multiplying the carrier yF(t) by an estimated modulator m(t). The modulator is obtained as a superposition of the estimated upper envelope mu(t) and the lower envelope ml (t), i.e.,

 1 + y (t)   −1 + y (t)  m(t) = F ·m (t) − F ·m (t) (12) 2 u 2 l

The envelopes are estimated by interpolating the pulse heights of the modulated unit pulse trains using shape preserving piecewise cubic interpolation. The carrier yF(t) is obtained as the negative cosine of the estimated instantaneous phase Θ(t). Hence, at times when yF(t) is +1 and −1, the modulator is set equal to the upper and lower envelope respectively. At times in between, (12) enables smooth transition. To obtain the signals yF(t) used in the generation of the synthesized corpus, the same synthesizer is used, but instead of using as input the parameters estimated from an input GAW, parameters drawn from the distributions explained in Section 2.3 are used.

2.5. Features and Classification Modelling errors are used as features to distinguish between the positive and the negative group. Eunmod is the modelling error obtained using an unmodulated quasi unit pulse u(t) train and Emod is the improved modelling error obtained using a modulated pulse train ue(t). An SVM classifier with a linear kernel is used for classifying the positive (vocal fry GAWs) and the negative (euphonic GAWs) groups with the two modelling errors as predictors. Appl. Sci. 2021, 11, 1990 8 of 14

2.5. Features and Classification Modelling errors are used as features to distinguish between the positive and the negative group. 퐸푢푛푚표푑 is the modelling error obtained using an unmodulated quasi unit pulse 푢(t) train and 퐸푚표푑 is the improved modelling error obtained using a modulated Appl. Sci. 2021, 11, 1990 pulse train 푢̃(t). An SVM classifier with a linear kernel is used for classifying the positive8 of 13 (vocal fry GAWs) and the negative (euphonic GAWs) groups with the two modelling errors as predictors. The subtypes of AMS-type vocal fry GAWs are further distinguished based on the cyclicityThe subtypesof their modulators. of AMS-type The vocal modulators fry GAWs for are CAM furtherP = 2 distinguished and CAMP = based 4 are oncyclic, the cyclicitywhereas ofthe their modulators modulators. are acyclic The modulators in AAMP fortype CAMP GAWs =. 2As and a result, CAMP the = 4 magnitude are cyclic, whereasspectra of the the modulators acyclic modulators are acyclic are in AAMPmore homogene type GAWs.ous acrossAs a result, frequencies the magnitude than the spectramagnitude of the spectra acyclic of the modulators cyclic modulators. are more homogeneousThe homogeneity across of a frequenciesmagnitude spectrum than the magnitudeis reflected spectraby Shannon’s of the cyclicEntropy modulators. The homogeneity of a magnitude spectrum is reflected by Shannon’s Entropy 푁 ∑푓=1 푆(푓) 푙표푔2푆(푓) 퐻 = −N (13) ∑ f =1 S( f푙표푔) log2푁2푓S( f ) H = − (13) log N 2 f 퐺(푓) where 푆(푓) = G( f ) , is the probability distribution of the modulators’ power spectrum where S( f ) = ∑푓 퐺(푓) , is the probability distribution of the modulators’ power spectrum 2∑ f G( f ) 퐺(푓) = |푋(푓)| , 푋(푓) is the discrete Fourier transform of the modulator and 푁푓 is the G( f ) = |X( f )|2, X( f ) is the discrete Fourier transform of the modulator and N is the total total number of discrete frequencies 푓. To distinguish between CAMP = 2 andf CAMP = 4 number of discrete frequencies f . To distinguish between CAMP = 2 and CAMP = 4 GAWs, GAWs, the modulator-to-carrier frequency ratio 푟 = 푓 ⁄푓 , where the modulator the modulator-to-carrier frequency ratio r = f / f , where푓 the푚 modulator푐 frequency f is frequency 푓 is the frequency correspondingf m c to the maximal magnitude of m the the frequency푚 corresponding to the maximal magnitude of the modulator’s spectrum and modulator’s spectrum and 푓푐 is the carrier frequency estimated using the time instances fc is the carrier frequency estimated using the time instances of the quasi unit pulse train. of the quasi unit pulse train. A pairwise statistical comparison is conducted to check the A pairwise statistical comparison is conducted to check the variability of the two features, variability of the two features, i.e., the entropy of the modulator spectrum and the 푟푓 i.e., the entropy of the modulator spectrum and the r f ratio, in terms of p-values between theratio, different in terms subtypes of p-values of vocal between fry GAWs. the different subtypes of vocal fry GAWs.

3.3. ResultsResults FigureFigure5 5 shows shows waveforms waveforms involved involved in in the the modelling modelling of of an an example example of of an an amplitude amplitude modulatedmodulated naturalnatural vocalvocal fryfry GAWGAW usingusing thethe nonmodulatingnonmodulating modelmodel (top(top plot)plot) andand thethe modulatingmodulating modelmodel (bottom(bottom plot).plot). TheThe inputinput GAWGAW isis anan amplitudeamplitude modulatedmodulated signalsignal withwith fourfour pulsespulses withinwithin aa modulatormodulator cyclecycle (CAMP(CAMP == 4).4). TheThe modulatingmodulating modelmodel hashas aa modellingmodelling errorerror ofof −−111.551.55 dB dB for for the the example example GAW GAW,, which is larger than the modelling errorerror obtainedobtained usingusing thethe nonmodulatingnonmodulating modelmodel ((−−22.15.15 dB) dB).. A smaller modelling errorerror indicatesindicates aa betterbetter fitfit toto thethe inputinput GAW.GAW.

FigureFigure 5.5. ModellingModelling ofof anan exampleexample ofof anan amplitudeamplitude modulatedmodulated vocalvocal fryfry GAWGAW usingusing thethe nonmodu- latingnonmodulating model (top model plot) and(top theplot) modulating and the modulating model (bottom model plot). (bottom plot).

The modulating model is observed to fit the input GAW better than the nonmodulating model. In particular, the nonmodulating model fails to track the jitter and shimmer (instantaneous frequency and amplitude modulations) of the individual pulses of the input GAW. As a result, the individual extrema of the estimated unit pulse trains (not shown here) are equally high, which results in constant upper and lower envelopes. Hence, the modulator estimated using these constant envelopes has negligible fluctuation. On the other hand, in the modulating model, the quasi unit pulse trains track the instantaneous Appl. Sci. 2021, 11, 1990 9 of 14

The modulating model is observed to fit the input GAW better than the nonmodulating model. In particular, the nonmodulating model fails to track the jitter and shimmer (instantaneous frequency and amplitude modulations) of the individual pulses of the input GAW. As a result, the individual extrema of the estimated unit pulse trains Appl. Sci. 2021, 11, 1990 9 of 13 (not shown here) are equally high, which results in constant upper and lower envelopes. Hence, the modulator estimated using these constant envelopes has negligible fluctuation. On the other hand, in the modulating model, the quasi unit pulse trains track the instantaneousfrequency frequency and and amplitude amplitude modulation modulation ofof thethe individual pulses pulses of of the the input input GAW. The GAW. The modulator estimated estimated using using these these quasi quasi unit unitpulse pulse trains trainsfluctuates fluctuates in accordance in with the accordance withinput the GAW, input which GAW, results which in results a modelling in a modelling error smaller error than smaller the errorthan th achievede error when using achieved whenthe using nonmodulating the nonmodulating model. model. Figure 6 showsFigure waveforms6 shows waveforms involved in involved the modelling in the modelling of an example of an example of a euphonic of a euphonic GAW GAW using theusing non themodulating nonmodulating model (top model plot) (top and plot) the andmodulating the modulating model (bottom model (bottomplot). plot). The The modulationmodulation is negligible is negligible as compare as comparedd to what to is what observed is observed in vocal in vocalfry GAW fry GAW in Figure in Figure 5. Thus, 5. Thus, the GAWsthe GAWs modelled modelled by the by thetwo twomodels models are aresimilar similar which which makes makes the thedifference difference between between the themodelling modelling errors errors smaller smaller compared compared to the to difference the difference between between the modelling the modelling errors errors obtainedobtained for the for GAW the GAWshown shown in Figure in Figure 5. 5.

Figure 6. ModellingFigure 6. of Modelling an example of an of example a natural of euphonica natural euphonic GAW using GAW the using nonmodulating the nonmodulating model (topmodel plot) and the modulating model(top plot) (bottom and plot).the modulating model (bottom plot).

E E 퐸푚표푑 vs. 퐸푢푛푚표푑mod arevs. theunmod two featuresare the two used features for classifying used for amplitude classifying modulated amplitude vocal modulated vocal fry GAWs (positivefry GAWs group) (positive and group) euphonic and (negative euphonic (negativegroup). Figure group). 7 Figureshows7 theshows scatter the scatter plots plots of the twoof the modelling two modelling errors errorsplotted plotted against against each other each otherfor natural for natural (a) and (a) synthetic and synthetic (b) data. (b) data. For vocalFor vocal fry GAWs, fry GAWs, the use the of use the of modulating the modulating model model results results in modelling in modelling errors errors smaller smaller than thanthe modelling the modelling errors errors achieved achieved when when using using the non themodulating nonmodulating model. model. The The line of line of equalityequality (LoE) (LoE) is d isashed. dashed. Modelling Modelling errors errors of of the euphoniceuphonic GAWs GAWs are are well well separated from separated fromthe the modelling modelling errors errors of vocal of vocal fry GAWsfry GAWs in the in featurethe feature space. space. An SVMAn SVM classifier with a classifier withlinear a linear kernel kernel acheives acheives a 5-fold a 5-fold cross cross validated validated accuracy accuracy of of 100% 100% for for natural natural and synthetic and syntheticGAWs. GAWs. The The performance performance of of the the classification classification is reportedis reported using using sentivities, sentivities, specificities and accuracies with 95% confidence intervals (CI) in Table1. For both natural and synthetic specificities and accuracies with 95% confidence intervals (CI) in Table 1. For both natural GAWs, no false positives or false negatives are observed. and synthetic GAWs, no false positives or false negatives are observed.

Table 1. Sensitivities, specificities and accuracies of classification between vocal fry and euphonic GAWs with 95% confidence interval (CI).

Sensitivity Specificity Accuracy GAW Value 95% CI Value 95% CI Value 95% CI Natural 100% 59.04% to 100.00% 100% 63.06% to 100.00% 100% 78.20% to 100.00% Synthetic 100% 99.59% to 100.00% 100% 98.78% to 100.00% 100% 99.69% to 100.00% Appl. Sci. 2021, 11, 1990 10 of 13 Appl. Sci. 2021, 11, 1990 10 of 14

Figure 7. ScatterFigure plots 7. of Scatter the modelling plots of errrorsthe modelling vocal fry errrors GAWs vocal and otherfry GAW typess ofand GAWs: other (typesa) natural of GAWs and (b: (a)) synthetic GAWs. natural and (b) synthetic GAWs. The subtypes of amplitude modulated vocal fry GAWs are distinguished using the Table 1. Sensitivities,entropy specificities of the modulator’s and accuracies power of classification spectral density between (PSD) vocal fry and and the euphonic modulator-to-carrier GAWs with 95% confidence interval (CI). frequency ratio r f . Figure8 shows box plots of the entropy and r f values of natural and syntheticSensitivity vocal fry GAW subtypes.Specificity The entropy of the modulator’sAccuracy PSD is used to dis- GAW tinguish between cyclically and acyclically modulated vocal fry GAWs. The entropy of Value 95% CI Value 95% CI Value 95% CI the modulators’ PSD is larger for modulator with a homogeneous spectrum than for a 59.04% to 63.06% to 78.20% to Natural modulator100% with inhomogeneous100% spectrum. This results100% in larger modulator spectrum entropies for AAMP100.00% than for CAMP. An100.00% Anova test is performed100.00% to check the variance of 99.59% to 98.78% to p 99.69% to Synthetic entropy100% between the subtypes100% of vocal fry GAWs. The100%-values obtained between CAMP (CAMP = 2 and100.00% CAMP = 4) and AAMP100.00% are less than 0.01, which100.00% indicates significant differences between the group means of CAMP and AAMP. For the two cyclically modu- The subtypeslated groups,of amplitude the obtained modulated p-value vocal is fry 0.2526. GAWs Therefore, are distinguished a distinction using is possible the between entropy of theCAMP modulator’s and AAMP power type spectral GAWs baseddensity on the(PSD) entropies and the of modulator the modulator-to-carrier spectrum. To make frequency ratioa distinction 푟푓. Figure between8 shows box the twoplots subtypes of the entropy of cyclically and 푟푓modulated values of natural vocal fry and GAWs, the r f synthetic vocalratio fry is used.GAW The subtypes.r f ratio The for entropy double pulsing of the modulator’s(CAMP = 2) isPSD approximately is used to 0.5 and for distinguish betweenquadruple cyclically pulsing and (CAMP acyclically = 4) is modulated approximately vocal 0.25. fry InGAWs. double The pulsing, entropy a modulator of cycle the modulators’contains PSD twois larger pulses, for whereas modulator in quadruple with a homogeneous pulsing, a modulator spectrum cycle than contains for a four pulses. Appl. Sci. 2021, 11, 1990 The Anova test for the r ratio resulted in a p-value of less than 0.05,11 which of 14 indicates a modulator with inhomogeneous spectrum.f This results in larger modulator spectrum entropies for significantAAMP than difference for CAMP between. An Anova the cyclically test is performed modulated to check vocal frythe GAWs.variance of entropy between the subtypes of vocal fry GAWs. The P-values obtained between CAMP (CAMP = 2 and CAMP = 4) and AAMP are less than 0.01, which indicates significant differences between the group means of CAMP and AAMP. For the two cyclically modulated groups, the obtained p-value is 0.2526. Therefore, a distinction is possible between CAMP and AAMP type GAWs based on the entropies of the modulator spectrum. To make a distinction between the two subtypes of cyclically modulated vocal fry GAWs, the 푟푓 ratio is used. The 푟푓 ratio for double pulsing (CAMP = 2) is approximately 0.5 and for quadruple pulsing (CAMP = 4) is approximately 0.25. In double pulsing, a modulator cycle contains two pulses, whereas in quadruple pulsing, a modulator cycle contains four pulses. The Anova test for the 푟푓 ratio resulted in a p-value of less than 0.05, which indicates a significant difference between the cyclically modulated vocal fry GAWs.

Figure 8. FeaturesFigure used 8. Features for distinguishing used for distinguishing the subtypes the of subtypes synthetic of vocal synthetic fry GAWs:vocal fry (a GAWs:) entropy (a) ofentropy the modulators’ of power spectralthe density modulators’ (PSD) for power distinguishing spectral density cyclically (PSD vs.) for acyclically distinguishing modulated cyclically vocal vs. fry acyclically and (b) modulator-to-carrier modulated vocal fry and (b) modulator-to-carrier frequency ratio 푟 to distinguish CAMP = 2 and frequency ratio r f to distinguish CAMP = 2 and CAMP = 4 type vocal fry GAWs.푓 The features of natural GAWs are represented using”x”CAMP = encircled 4 type vocal in green. fry GAWs. The features of natural GAWs are represented using”x” encircled in green.

4. Discussion In this study, the detection of vocal fry based on characterizing the variation observed in the vibration patterns of the vocal folds is proposed. The results of the classification accuracies suggest that the proposed model enables the distinction of normal and vocal fry GAWs. The obtained classification accuracies of detection were found to be competitive with the accuracies reported for past detection techniques proposed in [3,12– 14]. In these past techniques, the detection rates of vocal fry were less than 90% and the models were based mainly on audio signals, whereas in our study, the high-speed videos of vocal folds enable a more detailed study of voice production. This may help in studying the cause–effect relationship between voice production and perception, and aid the diagnosis of voice disorders. Bailly et al. investigated the ventricular fold dynamics in human phonation [25,26]. A correlation between the vibration of the ventricular and vocal folds was demonstrated using time derivative of EGG signals (DEEG) [25]. A periodic contact between the ventricular folds was observed for every second glottal cycle which resulted in a period- doubling pattern. In [26], ventricular motion was studied using synchronized high-speed cinematographic and audio recordings. The recorded audio sequences consist of glides with transition between different voice qualities. In most of vocal fry sequences, the ventricular folds were observed to follow a slow nonoscillatory motion with a partial ventricular contact, or a total contact. However, in our study, no significant ventricular motion was observed in the high-speed videos. The audio sequences were also recorded for sustained phonation of vowels without any glides. The cyclic modulation of the vibration patterns of vocal fry which we observed using high-speed videos were only due to distinct vocal fold vibration behavior. For CAMP = 2 and CAMP = 4 type GAWs, in each modulator cycle, the peak area of the glottis increased after every subsequent opening of the vocal folds. It is suggested that the myoelastic aerodynamic cause of this phenomenon is to be studied in the future. A central aspect of our study is the analysis of noise contained in the GAWs. The GAWs extracted from the videos contain modulation noise (i.e., jitter and shimmer) and quantization noise. We modelled the modulation noise on a pulse-to-pulse time scale by modulating the quasi unit pulse train that the model uses. The additive quantization noise arises from the finite pixel resolution of the videos and is part of the modelling error levels E. In particular, depending on the relative size of the glottis in the video, a GAW is quantized to only a few hundred available values, i.e., a discrete number of pixels. We focused in this study on modelling modulation noise instead of additive quantization Appl. Sci. 2021, 11, 1990 11 of 13

4. Discussion In this study, the detection of vocal fry based on characterizing the variation observed in the vibration patterns of the vocal folds is proposed. The results of the classification accuracies suggest that the proposed model enables the distinction of normal and vocal fry GAWs. The obtained classification accuracies of detection were found to be competitive with the accuracies reported for past detection techniques proposed in [3,12–14]. In these past techniques, the detection rates of vocal fry were less than 90% and the models were based mainly on audio signals, whereas in our study, the high-speed videos of vocal folds enable a more detailed study of voice production. This may help in studying the cause–effect relationship between voice production and perception, and aid the diagnosis of voice disorders. Bailly et al. investigated the ventricular fold dynamics in human phonation [25,26]. A correlation between the vibration of the ventricular and vocal folds was demonstrated using time derivative of EGG signals (DEEG) [25]. A periodic contact between the ventric- ular folds was observed for every second glottal cycle which resulted in a period-doubling pattern. In [26], ventricular motion was studied using synchronized high-speed cine- matographic and audio recordings. The recorded audio sequences consist of glides with transition between different voice qualities. In most of vocal fry sequences, the ventricular folds were observed to follow a slow nonoscillatory motion with a partial ventricular contact, or a total contact. However, in our study, no significant ventricular motion was observed in the high-speed videos. The audio sequences were also recorded for sustained phonation of vowels without any glides. The cyclic modulation of the vibration patterns of vocal fry which we observed using high-speed videos were only due to distinct vocal fold vibration behavior. For CAMP = 2 and CAMP = 4 type GAWs, in each modulator cycle, the peak area of the glottis increased after every subsequent opening of the vocal folds. It is suggested that the myoelastic aerodynamic cause of this phenomenon is to be studied in the future. A central aspect of our study is the analysis of noise contained in the GAWs. The GAWs extracted from the videos contain modulation noise (i.e., jitter and shimmer) and quantization noise. We modelled the modulation noise on a pulse-to-pulse time scale by modulating the quasi unit pulse train that the model uses. The additive quantization noise arises from the finite pixel resolution of the videos and is part of the modelling error levels E. In particular, depending on the relative size of the glottis in the video, a GAW is quantized to only a few hundred available values, i.e., a discrete number of pixels. We focused in this study on modelling modulation noise instead of additive quantization noise, because (i) the modulations and their properties were used to directly describe the voice types that we report, and (ii) the amount of modulation noise in voices labeled as vocal fry exceeded the amount of additive quantization noise by an order of a magnitude. Added values of the presented approach include the following. First, analysis-by- synthesis combined with parameter distribution fitting provides a means of data augmen- tation. That is particularly relevant because not much natural data is available yet. The selection of the synthesis parameter distributions that we fitted to create the synthetic, i.e., augmented corpus appears to be reasonable since the scatter plots of modelling errors as well as the boxplots of the entropy of the modulator’s spectrum and the modulator- to-carrier frequency ratio reflect comparable distributions for natural and synthetic data. Second, we opted for a detailed signal modelling based approach for classification instead of using a black box approach. This has the advantages that the classification appears to be better explainable, and that qualitative knowledge regarding voice production kinematics is obtained. The limitations of our proposal and suggestions for future work include the following. First, all of the vocal fold vibrations patterns that we found in the voices which were labelled as vocal fry are auditorily perceived as pulsatile, i.e., individual glottal cycles appear to be audible due to temporal segregation. However, it is not yet clear whether listeners may reliably distinguish the subtypes that we found, or whether they may be Appl. Sci. 2021, 11, 1990 12 of 13

trained to do so. Second, although our model gives a classification with no true negatives or false negatives, it only distinguishes between vocal fry and euphonic GAWs. For future studies, we suggest the addition other types of dysphonic voices as negative data, and the combination of analyses with audio waveforms. Third, to further test and improve our waveform model, more positive natural data will be needed. With more natural data, the performance estimation of the classification could be made more confident. In particular, the lower limit of the 95% CI of classification accuracy for the natural data could be improved. Finally, instead of modelling the temporal transitions between the voice qualities, intervals of homogeneous voice qualities were preselected. Thus, models of the temporal transitions of the voice qualities related to vocal fry may be proposed in the future.

5. Conclusions This paper investigated different types of amplitude modulated vocal fry GAWs. They were modelled using an analysis-by-synthesis approach and distinguished automatically from euphonic GAWs based on their modelling errors. Traditionally, vocal fry GAWs are characterized by a long closed phase. However, we also observed amplitude modulated vocal fry GAWs without a long closed phase, and analyzed them in a detailed way. These GAWs have been termed as AAMP, CAMP = 2 and CAMP = 4 based on their cyclicity and the number of pulses in a single modulator cycle. Modulated and unmodulated GAWs were modelled for vocal fry and euphonic GAWs. The rational for using a nonmodulating model is to obtain larger modelling errors for amplitude modulated GAWs than for nonmodulated GAWs. Modelling errors of the natural and synthetic vocal fry GAWs are observed to be well separated from the euphonic GAWs in the feature space. These modelling errors are used as predictors for classifying the vocal fry and euphonic GAWs. For the natural and synthetic GAWs, no false positives or false negatives were obtained for classification between vocal fry and euphonic GAWs.

Author Contributions: V.D.: conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, and writing—original draft preparation. P.A.: conceptual- ization, methodology, validation; formal analysis, investigation, resources, data curation, writing— review and editing, visualization, supervision, project administration, and funding acquisition. Both the authors have read and agreed to the published version of the manuscript. Funding: This research was funded by the Austrian Science Fund (FWF), grant number KLI722-B30. Institutional Review Board Statement: Ethical review and approval were waived by ethics commit- tee of Medical University of Vienna. The approval number is 1473/2016. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study. Data Availability Statement: P.Aichinger, I. Roesner, M. Leonhard, D. Denk-Linnert and B.Schneider- Stickler, “A database of laryngeal high-speed videos with simultaneous high-quality audio recordings of pathological and nonpathological voices”, Proceedings of the International Conference on Language Resources and Evaluation (LREC), vol.10, pp. 767–770, 2016. Acknowledgments: This work was supported by the Austrian Science Fund (FWF): KLI 722-B30. The authors would like to thank the University Hospital Erlangen for providing the segmentation tool. Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Appl. Sci. 2021, 11, 1990 13 of 13

References 1. Whitehead, R.L.; Metz, D.E.; Whitehead, B.H. Vibratory patterns of the vocal folds during pulse register phonation. J. Acoust. Soc. Am. 1984, 75, 1293–1297. [CrossRef][PubMed] 2. Blomgren, M.; Chen, Y.; Gilbert, H.R. Acoustic, aerodynamic, physiologic, and perceptual properties of modal and vocal fry registers. J. Acoust. Soc. Am. 1998, 103, 2649–2658. [CrossRef][PubMed] 3. Ishi, C.T.; Ishiguro, H.; Hagita, N. Proposal of acoustic measures for automatic detection of vocal fry. In Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisbon, Portuga, 4–8 September 2005. 4. Keating, P.A.; Garellek, M.; Kreiman, J. Acoustic properties of different kinds of creaky voice. In Proceedings of the International Congress of Phonetic Sciences, Glasgow, UK, 10–14 August 2015. 5. Degottex, G. Glottal Source and Vocal-Tract Separation. Ph.D. Thesis, Université Pierre et Marie Curie-Paris VI, Paris, France, 2010. 6. Laver, J. The phonetic description of voice quality. Camb. Stud. Linguist. Lond. 1980, 31, 1–186. 7. Aichinger, P.; Roesner, I.; Schoentgen, J. Auditory sensation of impulsivity and tonality in vocal fry. J. Acoust. Soc. Am. 2019, 145, 1909. [CrossRef] 8. Ylitalo, R.; Hammarberg, B. Voice characteristics, effects of voice therapy, and long-term follow-up of contact granuloma patients. J. Voice 2000, 14, 557–566. [CrossRef] 9. Hollien, H.; Moore, P.; Wendahl, R.W.; Michel, J.F. On the nature of vocal fry. J. Speech Hear. Res. 1966, 9, 245–247. [CrossRef] [PubMed] 10. Hollien, H.; Wendahl, R. Perceptual study of vocal fry. J. Acoust. Soc. Am. 1968, 43, 506–509. [CrossRef][PubMed] 11. Gottliebson, R.O.; Lee, L.; Weinrich, B.; Sanders, J. Voice problems of future speech-language pathologists. J. Voice 2007, 21, 699–704. [CrossRef][PubMed] 12. Drugman, T.; Kane, J.; Gobl, C. Data-driven detection and analysis of the patterns of creaky voice. Comput. Speech Lang. 2014, 28, 1233–1253. [CrossRef] 13. Vishnubhotla, S.; Espy-Wilson, C.Y. Automatic detection of irregular phonation in continuous speech. In Proceedings of the Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, 17–21 September 2006. 14. Martin, P. Automatic detection of voice creak. In Proceedings of the Speech Prosody, Sixth International Conference, Shanghai, China, 22–25 May 2012. 15. Narendra, N.P.; Rao, K.S. Automatic detection of creaky voice using epoch parameters. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. 16. Lauri, T.; Tanel, A.; Werner, S. Recognition of Creaky Voice from Emergency Calls. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1990–1994. 17. Hollien, H.; Girard, G.T.; Coleman, R.F. Vocal fold vibratory patterns of pulse register phonation. Folia Phoniatr. Logop. 1977, 29, 200–205. [CrossRef][PubMed] 18. Moore, P.; von Leden, H. Dynamic variations of the vibratory pattern in the normal larynx. Folia Phoniatr. Logop. 1958, 10, 205–238. [CrossRef][PubMed] 19. Childers, D.G.; Lee, C.K. Vocal quality factors: Analysis, synthesis, and perception. J. Acoust. Soc. Am. 1991, 90, 2394–2410. [CrossRef][PubMed] 20. Aichinger, P.; Pernkopf, F. Synthesis and Analysis-by-Synthesis of Modulated Diplophonic Glottal Area Waveforms. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 914–916. [CrossRef] 21. Aichinger, P.; Roesner, I.; Leonhard, M.; Denk-Linnert, D.; Schneider-Stickler, B. A database of laryngeal high-speed videos with simultaneous high-quality audio recordings of pathological and non-pathological voices. In Proceedings of the InternationalCon- ference on Language Resources and Evaluation (LREC), Portorož, Slovenia, 23–28 May 2016; Volume 10, pp. 767–770. 22. Glottis Analysis Tools (GAT-2018), Computer Program: Version 5; Department of Phoniatrics and Pediatric Audiology, University Hospital Erlangen: Erlangen, Germany. 23. Voigt, D.; Döllinger, M.; Braunschweig, T.; Yang, A.; Eysholdt, U.; Lohscheller, J. Classification of functional voice disorders based on phonovibrograms. Artif. Intell. Med. 2010, 49, 51–59. [CrossRef][PubMed] 24. Chen, G.; Shue, T.L.; Kreiman, J.; Alwan, A. Estimating the voice source noise. In Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association, Portland, OR, USA, 9–13 September 2012. 25. Bailly, L.; Bernardoni, N.H.; Müller, F.; Rohlfs, A.K.; Hess, M. Ventricular-fold dynamics in human phonation. J. Speech Lang. Hear. Res. 2014, 57, 1219–1242. [CrossRef][PubMed] 26. Bailly, L.; Nathalie, H.; Xavier, P. Vocal fold and ventricular fold vibration in period-doubling phonation: Physiological description and aerodynamic modeling. J. Acoust. Soc. Am. 2010, 127, 3212–3222. [CrossRef][PubMed]