Arxiv:1812.03415V2 [Cs.SD] 3 Aug 2019 of Many Hours of Practice and Experience

INCREASE APPARENT PUBLIC SPEAKING FLUENCY BY SPEECH AUGMENTATION

Sagnik Das Nisha Gandhi Tejas Naik Roy Shilkrot

Human Interaction Lab, Stony Brook University Department of Computer Science Stony Brook, NY, USA

ABSTRACT Fluent and confident speech is desirable to every speaker. But professional speech delivering requires a great deal of experience and practice. In this paper, we propose a speech stream manipulation system which can help non-professional speak- Disfluent Filler-word Silence ers to produce fluent, professional-like speech content, in turn Segmentation Detection contributing towards better listener engagement and compre- hension. We propose to achieve this task by manipulating the disfluencies in human speech, like the sounds uh and um, the filler words and awkward long silences. Given any unrehearsed speech we segment and silence the filled pauses and doctor the duration of imposed silence as well as other long pauses (disfluent) by a predictive model learned using professional speech dataset. Finally, we output a audio stream Filler Word Removal in which speaker sounds more fluent, confident and practiced compared to the original speech he/she recorded. According to our quantitative evaluation, we significantly increase the fluency of speech by reducing rate of pauses and fillers. Index Terms— Speech disfluency detection, Speech disfluency repair, Speech Processing, Assistive technologies in Synthesize Silences speech

1. INTRODUCTION

Professional speakers, who make their living from their speech, speak clearly and fluently with very few repetitions Fig. 1 and revisions. This kind of error-free utterances is the result . The proposed speaker augmentation pipeline arXiv:1812.03415v2 [cs.SD] 3 Aug 2019 of many hours of practice and experience. On the other hand, a regular speaker generally speaks with no real practice of hension and speakers cognitive state such as uncertainty, con- articulation and delivery. Naturally, words of an unrehearsed fidence, thoughtfulness and cognitive load ([3, 2]). Different speech contain unintentional disfluencies interrupting the studies claim that disfluencies often refer to uncertainties in flow of the speech. Speech disfluency generally contains long speakers’ mind about the future statements. Consequently, pauses, discourse markers, repeated words, phrases or sen- less confident speakers tend to be more disfluent.[2]. More- tences and fillers or filled pauses like uh and um. According over, it has also been observed that filled pauses specifically to the research by [1] approximately 6% of the speech appears indicate the level of cognitive difficulty of the speaker [4]. to be non-pause disfluency. Filled pauses or filler words are Generally, disfluencies occur before a longer utterance [5] or the most common disfluency in any unrehearsed, impromptu when the topic is unfamiliar to the speaker [6]. Fluency re- speech [2]. flects the speaker’s ability to focus the listener’s attention on Numerous linguistics research has been conducted to find his/her message rather than inviting the listener to focus on out the effect of speech disfluencies on the listener’s compre- the idea and try to self-interpret it [7]. Considering the di- Examples: https://sagniklp.github.io/pub-speaker-aug/ verse factors affecting speaker fluency, our idea is to doctor a speech to make it fluent by taking care of temporal factors (ILP) based [14, 15] methods are introduced. A classifica- contributing to it. tion approach using lexical features is taken by [16], they In this work, we propose a system to detect, segment, specifically focus on schizophrenic patient dialogs. Some and remove the most common disfluencies, namely the filler incremental [17, 18, 19, 20], multi-step [21] and joint task words and long, unnatural pauses from a speech to aid speak- (parsing and disfluency detection) [22, 17] methods were ers’ fluency. Our system takes a raw speech track as input introduced recently. Though all these provide some con- and outputs a modified fluent version of it by intelligently re- vincing results, all of them are limited to pre-defined feature moving the filled pauses and adjusting the “long” silences. templates (lexical, acoustic, and prosodic). We evaluate the performance of our system quantitatively on With advances in deep learning, most recent methods rely speeches of non-native speakers of English. We also propose on recurrent neural networks (RNN) [23, 24, 25, 26, 27]. an assistive user interface which can be used to help users’ to These methods use word embeddings and acoustic features visualize and comparatively analyze their speech. instead of pre-defined feature templates. We interpret the occurrence of disfluencies in a speech All the techniques above, make one fundamental assump- as an acoustic event, and a segmentation approach is taken tion, i.e., any disfluency detection must have an automatic for the detection. A CNN and RNN combined architecture, speech recognizer (ASR) in the pipeline. Consequently, to a convolution recurrent neural network (CRNN) is used to the best of our knowledge till now all the presented disflu- achieve the task, inspired from [8]. Further, a binary classifi- ency detection schemes work in the transcript level. Also, cation approach is taken to detect long pauses between words. these systems have never been paired with an acoustic level After deleting the filler-words and adjusting the silences the repair scheme with a goal of exploring the use-cases from the fluent version of the speech is obtained. The performance of perspective of the listener and the speaker. our system is evaluated on speeches of non-native speakers In our work, we address these motivations by devising of English using fluency metrics proposed by [9]. We also a disfluency detection and repair method relying solely on propose an assistive user interface which can be used to help acoustic features to synthesize temporally fluent speech seg- users’ to visualize and comparatively analyze their speech. ments from the perspective of human-interaction. The essential contributions of this paper are - 3. PROPOSED METHOD 1. A disfluency detection mechanism that works directly on acoustic features without using any language fea- 3.1. Disfluency Detection tures. Our work focuses on building a system that can be used not 2. A silence modeling scheme directly conditioned on the only as a disfluency detection system but also provide a way previous speech. to understand users’ disfluency better. The primary motivations of this work are the following- 3. A disfluency repair technique to help users improve a pre-delivered speech. • Work with disfluencies on the acoustic level without using any transcript. 2. RELATED WORKS • Significant portion of a disfluent speech contains long In recent years, there have been many works related to speech pauses. In transcript level, it’s not an issue, but in the disfluencies, spanned across the domains of psychology, lin- acoustic level, it matters a lot in determining speakers’ guistics and natural language processing (NLP). Where, the fluency. psychology and linguistic researchers focused on defining • Repairing disfluent segments to help users understand disfluencies, the reasons and effects of it from a language the possible improvements to their speech, as well as and cognitive aspect; the NLP researchers focused more allow to create fluent speech content without much has- on detecting these from speech transcripts to help language sle. understanding and recognition systems. The prime motivation for disfluency detection in NLP is to The types of disfluencies we considered in this work, are the better interpret the speech-to-text transcripts for natural lan- use of filler words, and intermittent long pauses. guage understanding systems. One of the first work [10], focuses on classifying the edit 3.1.1. Dataset words (restarts and repairs) from the text using a boosted classifier. Another contemporary method was to apply a The dataset used for filler word segmentation is obtained from 1 2 noisy channel model to detect and correct speech disfluencies Switchboard transcription . We also used the Automanner [11, 12, 13]. Later, Hidden Markov Model (HMM), Con- 1https://www.isip.piconepress.com/projects/switchboard/ ditional Random Field (CRF), Integer Linear Programming 2https://www.cs.rochester.edu/hci/currentprojects.php?proj=automanner k class class target to the longs data, training the In Where, class correct its to frame each k of classification binary as lated frame at Here, en- ob- timestep band each are at (MFCCs)) mel coefficients tained (log cepstral features frequency mel acoustic or level ergy frame step, this In Features 3.1.2. from collected are pairs fluent [31]. additional TIMIT hand, other the consid- On ( suggested be micro-pauses the of can than measure higher and quantitative considerably it’s experimental because is safe ered choice This similarly. one ihslne oethan sur- more pairs silences word la- with the Additionally, is rounded segment pause. unnatural disfluent an a as within beled silence accompany any disfluencies silences, general longer the Since, to resort a approach. we vocabulary, If following model disfluent. the as exist labeled doesn’t is pair silence word of amount significant a but h iec rbblt oe ie rbblt fasilence a of utterance probability pair a word ( gives each model For probability it’s silence disfluent. model the is probability silence the to if according decided pair then word each dataset segment the and silences from the locate we model First, detection disfluency [25]. a and [29] model probability lence recording contains since gen- interfaces. samples standard more from training gives This our to data. eralization additional for transcription [28] P soitdwt on emn),ohriezero. otherwise , segment) sound a with associated 1). (Eq. sil ahsoundtrack Each olbldsun iecsw s obnto fasi- a of combination use we silences disfluent label To curn ewe hm odpi ihlow with pair word A them. between occurring ) C k t stenme ffaue i rqec dimension) frequency (in features of number the is h ako emnigtefilrwrsi formu- is words filler the segmenting of task The . MFCC = { k 1 푆 , dtrie sn h ne/fsttmln of timeline onset/offset the using (determined 2 } r max arg Conv and S k 푀 sdvddit utpefie length fixed multiple into divided is , t θ eutn etr vector feature a resulting r h aaeeso h classifier. the of parameters the are P ( y MaxPool t ( k 0 ) . 7 | i.2 Fig. m eod r lolabeled also are seconds y t ( fluent t k , ) θ lc iga ftefilrwr segmentation filler-word the of diagram Block . 1 = ) ,02sc.[30]. secs. 0.2 ), stack 푃 fframe if 푐 m t R ∈ t P be- (1) sil C 푃 푠 . ftefia Clyr hnpoaiiisaegvnby- given are probabilities then layer, FC final the of Let, get probabilities. to class applied the is activation softmax with layer another Finally, where, obtained is and activation ReLU with layer (FC) connected as GRUs use we Where, work, this 2. [33]. In in unit. presented RNN Eq. each in the as by learned given is output layer tensor frames a outputs which axis P frequency the along stacked then operation. max-pooling the after layer, convolution the over tensor a applied R is only max-pooling of is Output Max-pooling dimension. feature frequency the across dimension. spanned time are layers and CNN activations. the (ReLU) in unit used linear Filters rectified with layers CNN the feed- by followed layers. layers, forward recurrent combina- and and a convolutional [8] is of architecture (SED) tion The detection event task. [32] sound architecture speech-recognition for Similar, used previously segmentation. is word filler Network Neural for Recurrent (CRNN) Convolutional a propose we Here segmentation word filler for CRNN 3.1.3. matrix, label class corresponding The Y sequence. frame the eune fframes of sequences t s F : t × R ∈ N nllyroutputs layer final RNN olantefaue vrtm axis, time over features the learn To features extracted of sequence The + M T − 0 p

× GRU1 1 ( t F T 푝

otisalthe all contains 1 hc upt idnvector hidden a outputs which GRU … × 푡 Where, . M … P 0 ) ( × GRUL y T

t GRUL C F hsi e oteRNa euneof sequence a as RNN the to fed is This . M | p ˆ m M t i 0 F 1 = stetuctdfeunydimension frequency truncated the is 0: stenme fnuoso h layer. the of neurons of number the is t stenme ffitr ftefinal the of filters of number the is t : y t F , + t θ 푃 . ( T ෠ R ∈ G = ) 푠 p ˆ ˆ − t i 1 p − ˆ Where, . t f 1 Softmax , p r hnfdt fully- a to fed then are ˆ K t i −

× FC1 (ReLU) 1 M T 퐺 ) p ˆ F steotu tensor output the is t T The . R ∈ ( g etr asare maps feature ˆ FC2 F stelnt of length the is t ) 퐺 R ∈ G ෠ C safunction a is i × hrecurrent th T Softmax sfdto fed is C F P 1 c × (2) (3) ∈ T Silence Filler Vocals

Segmentation Replacement Background fillers

20 HISTOGRAM OF FLUENT SILENCE TIMES Fluent 15

Silence

0.0905 0.1578 0.2252 0.2925 0.3599 0.4272 0.4945 0.6292 0.6965 0.5619 Intervals

10 count

0 Disfluent 1 2 3 4 5 6 7 8 9 10 Silence bins

Intervals

Modified Silences Modified Silence classifier Silence

Fig. 3. The visualization interface; top: Speech track Fig. 4. Silence modification pipeline: The dashed line on the with colored segmentation outputs (brown: disfluent silences, histogram shows the median time of the fluent silences. blue: fillers, green: fluent silences); bottom: Modified speech.

of long, unnatural pauses that hurt the fluency of the speech. The CRNN training objective is to minimize the cross-entropy It is also required to keep the pace of the speech intact. Too l loss with 2 regularization (Eq. 4) much reduction of silences makes the speech unnatural and X X (k) broken. We take the fluent silence times (i.e., as suggested by L(θ) = − log P (y ) + λ||θ|| (4) t our silence classifier) and obtain a histogram and found that 0:t k taking the median of the histogram bins as the optimal amount of silence works quite well. In this way, the distribution of the 3.1.4. Disfluent silence Classification silence along the speech progression confines to a constant distribution and speaker sounds more consistent and fluent in The problem is formulated as a binary classification task, the modified speech. given a silent segment Z, the task is to decide whether it’s a disfluent or a non-disfluent silence. Classifying a silence only makes sense when it’s combined with adjacent utterances. 4. RESULTS & ANALYSIS Because an occurrence of silence is solely driven by the utterance and also heavily influenced by disfluencies. Thus, it’s 4.1. Experimental Settings not always evident that all pauses higher than a significant threshold is disfluent, illustration in Fig.3 gives an idea of the 4.1.1. Datasets fact. The experiments are performed on Switchboard [35], Au- We train a support vector machine (SVM) to achieve this tomanner [28] and our dataset of public speaker recording. To task. Given a silent segment Z, it’s first padded with the one- train the CRNN we use the segments from the Switchboard. Zˆ word utterances on the left and right ( ). Then, the MFCC The CRNN test results are reported on held-out data from features are extracted and we take the mean over the fre- Switchboard-I. Silence classification results are reported on z ∈ RT T quency axis to create the feature vector i . is the TIMIT [31], Switchboard, and Automanner held-out dataset. z number of frames in i. Segments are of variable length thus All the fluency metrics are evaluated on our dataset, contain- z i padded with trailing zeros prior the classification. During, ing recordings of 20 non-native speakers of English. The testing we don’t use the previous and next word boundaries speakers were asked to talk on a specific topic for 50-60 but a fixed length time window is used. In our experiments, seconds. we found that 0.8 − 1.0 secs. give pretty good results.

3.2. Disfluency Repair 4.1.2. Parameter settings First, the fillers are replaced with silences. We found that it’s We experimented with different configurations of the CNN often helpful (such as when ambient noise is present) to use a and RNN parameters and different features. decomposition mechanism [34] on the speech to separate the Types of features: Initial experiments were performed background noise and vocals. Then, the fillers are replaced on mel frequency cepstrum coefficients (mfcc), mel spectro- with its corresponding background segment. The modified grams, log mel spectrograms (log mel), spectral contrast, zero track is then used to segment (Z) the silences and finally, the crossing rate and tonnetz. According to the experimental classification is done. results, the mfcc (40 × t) and log mel (128 × t) features are All the silence segment lengths are then modified to make used for filler segmentation. For the silence classification the speech fluent (Fig. 4). The goal is to reduce the amount mfcc features are used, after taking mean over the frequency axis. The used feature dimensions are shown in table. All The silence classification is evaluated using the F1 score w.r.t. features are extracted in 30ms frames with 15ms overlap. the disfluent silence class. RNN & CRNN parameters: In experiments with the CNN To evaluate the quality of the augmented speech from our and CRNN, we explore {1, 2, 3} convolutional (conv) lay- system, we use the following metrics defined in [9]: ers with combination of max pooling and average pool- • Speech rate: Is obtained as- ing. At each layer, ReLU activation is used. Following settings are used for conv filters- {16, 32, 64} and kernel # of syllables SR = × 60 (5) sizes- {2, 3, 4, 5, 8}. All the conv layers use same padding. total time − ufp[< 3] Pooling size was varied within {2, 3, 4, 5, 8}. The pooling is performed only on the frequency dimension. We tried Where, ufp[< 3] = total time of unfilled pauses lesser different dropout ratios of {0.3, 0.5, 0.75}. than 3 seconds. Since, pauses > 3 secs. are considered The RNN we use is Gated Recurrent Units (GRU). Exper- as articulation pauses [30]. iments are performed with {2, 3} layers (l) and {64, 128, 256} • Articulation rate hidden units (d). No intermediate dropout (dr) is applied. Final fully connected layer (FC1 in Fig.3) is experimented # of syllables AR = × 60 (6) with hidden units (d) of {100, 200} with dropout ratios of total time {0.3, 0.5, 0.75}. • Phonation-time ratio Features CNN RNN FC speaking time PTR = (7) conv1 [32,(8,8)], conv2 [64,(4,4)] total time mfcc maxpool1 [5,5], maxpool2[4,4] l=3 d=100 dr=0.25 d=128 dr=0.5 • Mean length of runs conv1 [32,(8,8)], conv2 [64,(4,4)] # of syllables log mel maxpool1 [8,4], maxpool2[4,2] MLR = (8) dr=0.25 # utterances between p[> 0.25] Where, p[> 0.25] = pauses greater than 0.25 seconds. Table 1. Final parameters for CNN, RNN and fully connected layers • Mean length of pauses total of p[> 0.2] The networks are trained in an end-to-end fashion using MLP = (9) AdaGrad algorithm for 200 epochs. The learning rate was set # of p[> 0.2] to 0.01. The regularization constant (λ) was set to 0.01. The final parameters are given in Table 1. • Filled pauses per min. Silence classification parameters: Max length of the se- # of filled pauses quences were set to 128. Final parameters are given in table FPM = (10) total time 3. 4.2. Filler Word Segmentation SVM LogReg XGBoost The filler word segmentation performance is evaluation re- itr=1500 depth=3 itr=100 sults are given in Table 4 and 5. In Table 4 we report the com- kernel=rbf lr=0.1 C=10 parative performance of the CRNN using different features. C=10 estimators=100 To understand more about the credibility of the CRNN, in Table 5 we show the results compared to an automatic speech Table 3. Final parameters used in silence classification recognizer available with Kaldi (ASpIRE Chain Model3). Considering the simplicity of our network, it performs pretty 4.1.3. Evaluation Metrics close to the ASR in terms of F1 score. All results are evalu- To evaluate the filler word segmentation we use the following ated on a subset of Switchboard-I dataset. frame level statistics: Features Precision Recall F1 • F1 Score (F1): The F1 score is calculated on frame level (30ms) using the TP, the frames where fillers are mfcc 0.9482 0.9610 0.9534 correctly detected; TN, the frames where non-fillers log mel 0.9495 0.9629 0.9550 are correctly detected; FP, the frames where fillers are wrongly detected; and FN, the frames where non-fillers Table 4. Performance of the CRNN with different features are wrongly detected. 3https://github.com/kaldi-asr/kaldi/tree/master/egs/aspire Metrics → SR ↑ AR ↑ PTR ↑ MLR ↑ MLP ↓ FPM ↓ Original 165.3571 171.0986 58.865 0.400 0.654 3.659 Processed 186.241 186.241 65.570 0.495 0.365 1.762

Table 2. The ﬂuency metrics, before and after processing the speeches. ↑ means higher is better and ↓ denotes lower is better

Method P recision Recall F1 perspective. Along with the pitfalls of our method following could be the future directions of this work- ASR 0.9774 0.9792 0.9775 CRNN 0.9495 0.9629 0.9550 • Improving the filler word segmentation performance as well as devising techniques to segment other kinds Table 5. Performance of filler word segmentation compared of common disfluencies (repetition, discourse markers, to an automatic speech recognizer. corrections) and speech impairments (stuttering).

The only drawback that we have observed while compar- • Devising a dynamic and online repair scheme, by gen- ing our method and ASR is that, sometimes our classiﬁer de- erating necessary (disﬂuent) portions of speech, instead tects segments that sounds similar with ’uh’ or ’um’. of replacing.

4.3. Disﬂuent Silence Classiﬁcation 6. CONCLUSION

For this task we experimented with SVM, Logistic Regression Disfluency detection is a well-explored problem in the speech (LogReg) and XGBoost. The results are summarized in table processing community and performed on speech transcripts 6. We used 10-fold cross validation to report our results. to mostly aid the intelligent conversational agents. In this work, we interpret disfluency detection from speakers per- Method → SVM LogReg XGBoost spective and introduce an additional component of repairing the disfluencies. Consequently, we tried to work solely on the F 0.9055 0.9200 0.9207 1 acoustic domain, diminishing a need for a complex system like an ASR, before disfluency detection. With the results of Table 6. Silence classification performance on TIMIT, our detection and repair scheme, we show improved fluency SwitchBoard and Automanner in speakers’ dialogues, given a less-fluent speech. To the best of our knowledge, this is the first work related to disfluency 4.4. Disfluency Repair repair for the sake of users’ and can be further extended to After processing the speeches by removing the fillers and long assist users with speech impairments and other general dis- silences, the fluent speech is obtained. To compare the flu- fluencies. ency of the synthesized and the original speech, discussed metrics (Section 4.1.3) are used. The results are reported in 7. ACKNOWLEDGEMENTS table 2. Mean of each metric across all the samples are reported. From the numbers, it’s pretty clear that we improve We are thankful to Faizaan Charania and Mahima Parashar for the fluency. It’s notable that in the processed speech the ar- curating the dataset and working on some essential observa- ticulation and speech rate increases to same quantity since tions. We would also like to thank the participating speakers we take care of all the unfilled pauses in the speech and in- for the speeches they provided. We gratefully acknowledge troduce a more uniform silence production. Apart from the the support of NVIDIA Corporation with the donation of the numbers, for qualitative understanding, some processed sam- Titan Xp and P6000 GPU used for this research. ples are available here. 8. REFERENCES 5. FUTURE WORK [1] Jean E Fox Tree, “The effects of false starts and repeti- This work is motivated by the fact that, disfluency detection tions on the processing of subsequent words in sponta- is not only useful for the intelligent agents but also a practical neous speech,” Journal of memory and language, vol. problem definition to help users to produce a better, confi- 34, no. 6, pp. 709–738, 1995. dent and fluent talk. To the extent of the types of disfluencies produced in a speech, this work is a small step towards a big- [2] Kathryn Womack, Wilson McCoy, Cecilia Ovesdotter ger goal, repairing disfluencies in a speech from a speakers’ Alm, Cara Calvelli, Jeff B Pelz, Pengcheng Shi, and Anne Haake, “Disfluencies as extra-propositional in- Conference on Computational Linguistics. Association dicators of cognitive processing,” in Proceedings of for Computational Linguistics, 2010, pp. 1371–1378. the workshop on extra-propositional aspects of meaning in computational linguistics. Association for Computa- [14] Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin tional Linguistics, 2012, pp. 1–9. Hillard, Mari Ostendorf, and Mary Harper, “Enriching speech recognition with automatic detection of sentence [3] Martin Corley and Oliver W Stewart, “Hesitation dis- boundaries and disfluencies,” IEEE Transactions on au- fluencies in spontaneous speech: The meaning of um,” dio, speech, and language processing, vol. 14, no. 5, pp. Language and Linguistics Compass, vol. 2, no. 4, pp. 1526–1540, 2006. 589–602, 2008. [15] Kallirroi Georgila, “Using integer linear programming [4] Dale J Barr and Mandana Seyfeddinipur, “The role for detecting speech disfluencies,” in Proceedings of of fillers in listener attributions for speaker disfluency,” Human Language Technologies: The 2009 Annual Con- Language and Cognitive Processes, vol. 25, no. 4, pp. ference of the North American Chapter of the Associa- 441–455, 2010. tion for Computational Linguistics, Companion Volume: Short Papers. Association for Computational Linguis- [5] Elizabeth Shriberg, “Disfluencies in switchboard,” in tics, 2009, pp. 109–112. Proceedings of International Conference on Spoken Language Processing, 1996, vol. 96, pp. 11–14. [16] Christine Howes, Matt Purver, Rose McCabe, [6] Sandra Merlo and Letıcia Lessa Mansur, “Descriptive PG Healey, and Mary Lavelle, “Helping the medicine discourse: topic familiarity and disfluencies,” Journal go down: Repair and adherence in patient-clinician of Communication Disorders, vol. 37, no. 6, pp. 489– dialogues,” in Proceedings of the 16th SemDial Work- 503, 2004. shop on the Semantics and Pragmatics of Dialogue (SeineDial), 2012, pp. 19–21. [7] Paul Lennon, “Investigating fluency in efl: A quantitative approach,” Language learning, vol. 40, no. 3, pp. [17] Matthew Honnibal and Mark Johnson, “Joint incre- 387–417, 1990. mental disfluency detection and dependency parsing,” Transactions of the Association of Computational Lin- [8] Emre Cakır, Giambattista Parascandolo, Toni Heittola, guistics, vol. 2, no. 1, pp. 131–142, 2014. Heikki Huttunen, and Tuomas Virtanen, “Convolutional recurrent neural networks for polyphonic sound event [18] Julian Hough and Matthew Purver, “Strongly incremen- detection,” arXiv preprint arXiv:1702.06286, 2017. tal repair detection,” arXiv preprint arXiv:1408.6788, 2014. [9] Judit Kormos and Mariann Denes,´ “Exploring measures and perceptions of fluency in the speech of second lan- [19] Christine Howes, Julian Hough, Matthew Purver, and guage learners,” System, vol. 32, no. 2, pp. 145–164, Rose McCabe, “Helping, i mean assessing psychiatric 2004. communication: An application of incremental self- repair detection,” 2014. [10] Eugene Charniak and Mark Johnson, “Edit detection and parsing for transcribed speech,” in Proceedings of [20] James Ferguson, Greg Durrett, and Dan Klein, “Disflu- the second meeting of the North American Chapter of ency detection with a semi-markov model and prosodic the Association for Computational Linguistics on Lan- features,” in Proceedings of the 2015 Conference of the guage technologies. Association for Computational Lin- North American Chapter of the Association for Com- guistics, 2001, pp. 1–9. putational Linguistics: Human Language Technologies, 2015, pp. 257–262. [11] Matthias Honal and Tanja Schultz, “Correction of disfluencies in spontaneous speech using a noisy-channel [21] Xian Qian and Yang Liu, “Disfluency detection using approach,” in Eighth European Conference on Speech multi-step stacked learning,” in Proceedings of the 2013 Communication and Technology, 2003. Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Lan- [12] Mark Johnson and Eugene Charniak, “A tag-based guage Technologies, 2013, pp. 820–825. noisy-channel model of speech repairs,” in Proceedings of the 42nd Annual Meeting of the Association for Com- [22] Mohammad Sadegh Rasooli and Joel Tetreault, “Joint putational Linguistics (ACL-04), 2004. parsing and disfluency detection in linear time,” in Pro- [13] Simon Zwarts, Mark Johnson, and Robert Dale, “De- ceedings of the 2013 Conference on Empirical Methods tecting speech repairs incrementally using a noisy chan- in Natural Language Processing, 2013, pp. 124–129. nel approach,” in Proceedings of the 23rd International [23] Julian Hough and David Schlangen, “Recurrent neural [30] Heidi Riggenbach, “Toward an understanding of flu- networks for incremental disfluency detection,” Inter- ency: A microanalysis of nonnative speaker conversa- speech 2015, 2015. tions,” Discourse processes, vol. 14, no. 4, pp. 423–441, 1991. [24] Shaolei Wang, Wanxiang Che, and Ting Liu, “A neural attention model for disfluency detection,” in Proceed- [31] John S Garofolo, Lori F Lamel, William M Fisher, ings of COLING 2016, the 26th International Confer- Jonathan G Fiscus, and David S Pallett, “Darpa timit ence on Computational Linguistics: Technical Papers, acoustic-phonetic continous speech corpus cd-rom. nist 2016, pp. 278–287. speech disc 1-1.1,” NASA STI/Recon technical report n, [25] Vicky Zayats, Mari Ostendorf, and Hannaneh Ha- vol. 93, 1993. jishirzi, “Disfluency detection using a bidirectional [32] Tara N Sainath, Oriol Vinyals, Andrew Senior, and lstm,” arXiv preprint arXiv:1604.03209, 2016. Has¸im Sak, “Convolutional, long short-term memory, [26] Julian Hough and David Schlangen, “Joint, incremen- fully connected deep neural networks,” in Acoustics, tal disfluency detection and utterance segmentation from Speech and Signal Processing (ICASSP), 2015 IEEE In- speech,” in Proceedings of the Annual Meeting of the ternational Conference on. IEEE, 2015, pp. 4580–4584. European Chapter of the Association for Computational [33] Kyunghyun Cho, Bart Van Merrienboer,¨ Caglar Gul- Linguistics (EACL), 2017. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger [27] Shaolei Wang, Wanxiang Che, Yue Zhang, Meishan Schwenk, and Yoshua Bengio, “Learning phrase rep- Zhang, and Ting Liu, “Transition-based disfluency de- resentations using rnn encoder-decoder for statistical tection using lstms,” in Proceedings of the 2017 Confer- machine translation,” arXiv preprint arXiv:1406.1078, ence on Empirical Methods in Natural Language Pro- 2014. cessing, 2017, pp. 2785–2794. [34] Zafar Rafii and Bryan Pardo, “Music/voice separation [28] M Iftekhar Tanveer, Ru Zhao, Kezhen Chen, Zoe Tiet, using the similarity matrix.,” in ISMIR, 2012, pp. 583– and Mohammed Ehsan Hoque, “Automanner: An au- 588. tomated interface for making public speakers aware of their mannerisms,” in Proceedings of the 21st Interna- [35] John J Godfrey, Edward C Holliman, and Jane Mc- tional Conference on Intelligent User Interfaces. ACM, Daniel, “Switchboard: Telephone speech corpus for 2016, pp. 385–396. research and development,” in Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE Inter- [29] Guoguo Chen, Hainan Xu, Minhua Wu, Daniel Povey, national Conference on. IEEE, 1992, vol. 1, pp. 517– and Sanjeev Khudanpur, “Pronunciation and silence 520. probability modeling for asr,” in Sixteenth Annual Con- ference of the International Speech Communication As- sociation, 2015.