Robust Prediction of Punctuation and Truecasing for Medical ASR

Robust Prediction of Punctuation and Truecasing for Medical ASR Monica Sunkara Srikanth Ronanki Kalpit Dixit Sravan Bodapati Katrin Kirchhoff Amazon AWS AI, USA fsunkaral, [email protected] Abstract the accuracy of subsequent natural language understanding algorithms (Peitz et al., 2011a; Makhoul Automatic speech recognition (ASR) systems et al., 2005) to identify information such as pa- in the medical domain that focus on transcrib- tient diagnosis, treatments, dosages, symptoms and ing clinical dictations and doctor-patient con- signs. Typically, clinicians explicitly dictate the versations often pose many challenges due to the complexity of the domain. ASR output punctuation commands like “period”, “add comma” typically undergoes automatic punctuation to etc., and a postprocessing component takes care enable users to speak naturally, without hav- of punctuation restoration. This process is usu- ing to vocalise awkward and explicit punc- ally error-prone as the clinicians may struggle with tuation commands, such as “period”, “add appropriate punctuation insertion during dictation. comma” or “exclamation point”, while true- Moreover, doctor-patient conversations lack ex- casing enhances user readability and improves plicit vocalization of punctuation marks motivating the performance of downstream NLP tasks. This paper proposes a conditional joint mod- the need for automatic prediction of punctuation eling framework for prediction of punctuation and truecasing. In this work, we aim to solve the and truecasing using pretrained masked lan- problem of automatic punctuation and truecasing guage models such as BERT, BioBERT and restoration to medical ASR system text outputs. RoBERTa. We also present techniques for Most recent approaches to punctuation and true- domain and task specific adaptation by fine- casing restoration problem rely on deep learning tuning masked language models with medical (Nguyen et al., 2019a; Salloum et al., 2017). Al- domain data. Finally, we improve the robustness of the model against common errors made though it is a well explored problem in the litera- in ASR by performing data augmentation. Ex- ture, most of these improvements do not directly periments performed on dictation and conver- translate to great real world performance in all sational style corpora show that our proposed settings. For example, unlike general text, it is model achieves ∼5% absolute improvement a harder problem to solve when applied to the med- on ground truth text and ∼10% improvement ical domain for various reasons and we illustrate on ASR outputs over baseline models under F1 each of them: metric. 1 Introduction • Large vocabulary: ASR systems in the medical domain have a large set of domain-specific Medical ASR systems automatically transcribe vocabulary and several abbreviations. Ow- medical speech found in a variety of use cases like ing to the domain specific data set and the physician-dictated notes (Edwards et al., 2017), open vocabulary in LVCSR (large-vocabulary telemedicine and even doctor-patient conversations continuous speech recognition) outputs, we (Chiu et al., 2017), without any human interven- often run into OOV (out of vocabulary) or tion. These systems ease the burden of long hours rare word problems. Furthermore, a large vo- of administrative work and also promote better en- cabulary set leads to data sparsity issues. We gagement with patients. However, the generated address both these problems by using subword ASR outputs are typically devoid of punctuation models. Subwords have been shown to work and truecasing thereby making it difficult to com- well in open-vocabulary speech recognition prehend. Furthermore, their recovery improves and several NLP tasks (Sennrich et al., 2015; 53 Proceedings of the 1st Workshop on NLP for Medical Conversations, pages 53–62 Online, 10, 2020. c 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 Bodapati et al., 2019). We compare word and in-depth analysis of the effectiveness of us- subword models across different architectures ing pretrained masked language models like and show that subword models consistently BERT and its successors to address the data outperform the former. scarcity problem. • Data scarcity: Data scarcity is one of the ma- • Techniques for effective domain and task jor bottlenecks in supervised learning. When adaptation using Masked Language Model it comes to the medical domain, obtaining (MLM) finetuning of BERT on medical do- data is not as straight-forward as some of main data to boost the downstream task per- the other domains where abundance of text formance. is available. On the other hand, obtaining • Method for enhancing robustness of the mod- large amounts of data is a tedious and costly els via data augmentation with n-best lists process; procuring and maintaining it could (from ASR output) to the ground truth dur- be a challenge owing to the strict privacy laws. ing training to improve performance on ASR We overcome the data scarcity problem, by us- hypothesis at inference time. ing pretrained masked language models like BERT (Devlin et al., 2018) and its succes- The rest of this paper is organized as follows. sors (Liu et al., 2019; Yang et al., 2019) which Section2 presents related work on punctuation and have successfully been shown to produce state- truecasing restoration. Section3 introduces the of-the-art results when finetuned for several model architecture used in this paper and describes downstream tasks like question answering and various techniques for improving accuracy and ro- language inference. We approach the predic- bustness. The experimental evaluation and results tion task as a sequence labeling problem and are discussed in Section4 and finally, Section5 jointly learn punctuation and truecasing. We presents the conclusions. show that finetuning a pretrained model with a very small medical dataset (∼500k words) has 2 Related work ∼5% absolute performance improvement in Several researchers have proposed a number of terms of F1 compared to a model trained from methodologies such as the use of probabilistic ma- scratch. We further boost the performance by chine learning models, neural network models, and first finetuning the masked language model the acoustic fusion approaches for punctuation pre- to the medical speech domain and then to the diction. We review related work in these areas downstream task. below. • ASR Robustness: Models trained on ground 2.1 Earlier methods truth data are not exposed to typical errors in In earlier efforts, punctuation prediction has been speech recognition and perform poorly when approached by using finite state or hidden Markov evaluated on ASR outputs. Our objective is to models (Gotoh and Renals, 2000; Christensen et al., make the punctuation prediction and truecas- 2001a). Several other approaches addressed it as ing more robust to speech recognition errors a language modeling problem by predicting the and establish a mechanism to test the perfor- most probable sequence of words with punctuation mance of the model quantitatively. To address marks inserted (Stolcke et al., 1998; Beeferman this issue, we propose a data augmentation et al., 1998; Gravano et al., 2009). Some others based approach using n-best lists from ASR. used conditional random fields (CRFs) (Lu and The contributions of this work are: Ng, 2010; Ueffing et al., 2013) and maximum en- tropy using n-grams (Huang and Zweig, 2002). The • A general post-processing framework for con- rise of stronger machine learning techniques such ditional joint labeling of punctuation and true- as deep and/or recurrent neural networks replaced casing for medical ASR (clinical dictation and these conventional models. conversations). 2.2 Using acoustic information • An analysis comparing different embeddings Some methods used only acoustic information such that are suitable for the medical domain. An as speech rate, intonation, pause duration etc., 54 (Christensen et al., 2001b; Levy et al., 2012). While tuation and truecasing using BLSTM models. This pauses influence in the prediction of Comma, into- work addressed joint learning as two correlated nation helps in disambiguation between punctua- tasks, and predicted punctuation and truecasing as tion marks like period and exclamation. Although two independent outputs. Our proposed approach this seemed to work, the most effective approach is is similar to this work, but we rather condition to combine acoustic information with lexical infor- truecasing prediction on punctuation output; this is mation at word level using force-aligned duration discussed in detail in Section3. (Klejch et al., 2017). In this work, we only consid- Punctuation and casing restoration for ered lexical input and a pretrained lexical encoder speech/ASR outputs in the medical domain has for prediction of punctuation and truecasing. The not been explored extensively. Recently, (Salloum use of pretrained acoustic encoder and fusion with et al., 2017) proposed a sequence labeling model lexical outputs are possible extensions in future using bi-directional RNNs with an attention mech- work. anism and late fusion for punctuation restoration to clinical dictation. To our knowledge, there has not 2.3 Neural approaches been any work on medical conversations, and we aim to bridge the gap here with latest advances in Neural approaches for punctuation and truecas- NLP with large-scale pretrained language models. ing can be classified into two broad categories: sequence labeling based models and MT-based

Robust Prediction of Punctuation and Truecasing for Medical ASR

Aviation Speech Recognition System Using Artificial Intelligence

A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech

Learned in Speech Recognition: Contextual Acoustic Word Embeddings

Compound Word Formation.Pdf

Preliminary Version of the Text Analysis Component, Including: Ner, Event Detection and Sentiment Analysis

Information Retrieval and Web Search

Hunspell – the Free Spelling Checker

Synthesis and Recognition of Speech Creating and Listening to Speech

Natural Language Processing in Speech Understanding Systems

Building a Treebank for French

Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020

Improving Named Entity Recognition Using Deep Learning with Human in the Loop