Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-Gram Embeddings
Total Page:16
File Type:pdf, Size:1020Kb
Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-Gram Embeddings Pieter Fivez∗ Simon Susterˇ ∗y Walter Daelemans∗ ∗CLiPS, University of Antwerp, Antwerp, Belgium yAntwerp University Hospital, Edegem, Belgium [email protected] Abstract process, for instance to counter the frequency bias of a context-insensitive corpus frequency-based We present an unsupervised context- system. Flor (2012) also pointed out that ignor- sensitive spelling correction method for ing contextual clues harms performance where a clinical free-text that uses word and char- specific misspelling maps to different corrections acter n-gram embeddings. Our method in different contexts, e.g. iron deficiency due to generates misspelling replacement candi- enemia ! anemia vs. fluid injected with enemia dates and ranks them according to their se- ! enema. A noisy channel model like the one by mantic fit, by calculating a weighted co- Lai et al. will choose the same item for both cor- sine similarity between the vectorized rep- rections. resentation of a candidate and the mis- Our proposed method exploits contextual clues spelling context. We greatly outperform by using neural embeddings to rank misspelling two baseline off-the-shelf spelling cor- replacement candidates according to their seman- rection tools on a manually annotated tic fit in the misspelling context. Neural embed- MIMIC-III test set, and counter the fre- dings have recently proven useful for a variety of quency bias of an optimized noisy channel related tasks, such as unsupervised normalization model, showing that neural embeddings (Sridhar, 2015) and reducing the candidate search can be successfully exploited to include space for spelling correction (Pande, 2017). context-awareness in a spelling correction We hypothesize that, by using neural embed- model. Our source code, including a dings, our method can counter the frequency bias script to extract the annotated test data, can of a noisy channel model. We test our sys- be found at https://github.com/ tem on manually annotated misspellings from the pieterfivez/bionlp2017. MIMIC-III (Johnson et al., 2016) clinical notes. In this paper, we focus on already detected non-word 1 Introduction misspellings, i.e. where the misspellings are not The genre of clinical free-text is notoriously noisy. real words, following Lai et al. Corpora contain observed spelling error rates which range from 0.1% (Liu et al., 2012) and 0.4% 2 Approach (Lai et al., 2015) to 4% and 7% (Tolentino et al., 2.1 Candidate Generation 2007), and even 10% (Ruch et al., 2003). This high spelling error rate, combined with the vari- We generate replacement candidates in 2 phases. able lexical characteristics of clinical text, can ren- First, we extract all items within a Damerau- der traditional spell checkers ineffective (Patrick Levenshtein edit distance of 2 from a reference et al., 2010). lexicon. Secondly, to allow for candidates be- Recently, Lai et al. (2015) have achieved nearly yond that edit distance, we also apply the Dou- ble Metaphone matching popularized by the open 80% correction accuracy on a test set of clinical 1 notes with their noisy channel model. However, source spell checker Aspell. This algorithm their model does not leverage any contextual in- converts lexical forms to an approximate pho- formation, while the context of a misspelling can netic consonant skeleton, and matches all Dou- provide important clues for the spelling correction 1http://aspell.net/metaphone/ ble Metaphone representations within a Damerau- For each Levenshtein edit distance of 1. As reference lexi- candidate con, we use a union of the UMLS R SPECIALIST lexicon2 and the general dictionary from Jazzy3, a Vectorize misspelling Java open source spell checker. Vectorize candidate context words 2.2 Candidate Ranking Our setup computes the cosine similarity be- Addition with reciprocal weighting tween the vector representation of a candidate and the composed vector representations of the mis- spelling context, weights this score with other pa- Cosine similarity rameters, and uses it as the ranking criterium. This setup is similar to the contextual similarity score Divide by edit Is candidate OOV? by Kilicoglu et al. (2015), which proved unsuc- distance cessful in their experiments. However, their ex- periments were preliminary. They used a limited Divide by OOV Yes context window of 2 tokens, could not account for penalty candidates which are not observed in the train- No ing data, and did not investigate whether a big- Rank by score ger training corpus leads to vector representations which scale better to the complexity of the task. We attempt a more thorough examination of the Figure 1: The final architecture of our model. It applicability of neural embeddings to the spelling vectorizes every context word on each side within correction task. To tune the parameters of our a specified scope if it is present in the vector vo- unsupervised context-sensitive spelling correction cabulary, applies reciprocal weighting, and sums model, we generate tuning corpora with self- the representations. It then calculates the cosine induced spelling errors for three different scenar- similarity with each candidate vector, and divides ios following the procedures described in section this score by the Damerau-Levenshtein edit dis- 3.2. These three corpora present increasingly dif- tance between the candidate and misspelling. If ficult scenarios for the spelling correction task. the candidate is OOV, the score is divided by an Setup 1 is generated from the same corpus which OOV penalty. is used to train the neural embeddings, and exclu- sively contains corrections which are present in the vocabulary of these neural embeddings. Setup 2 is tor representations for character n-grams and sums generated from a corpus in a different clinical sub- these with word unigram vectors to create the final domain, and also exclusively contains in-vector- word vectors. FastText allows for creating vector vocabulary corrections. Setup 3 presents the most representations for misspelling replacement can- difficult scenario, where we use the same corpus as didates absent from the trained embedding space, for Setup 2, but only include corrections which are by only summing the vectors of the character n- not present in the embedding vocabulary (OOV). grams. In other words, here our model has to deal with We report our tuning experiments with the dif- both domain change and data sparsity. ferent setups in 4.1. The final architecture of our Correcting OOV tokens in Setup 3 is made pos- model is described in Figure1. We evaluate this sible by using a combination of word and char- model on our test data in section 4.2. acter n-gram embeddings. We train these embed- dings with the fastText model (Bojanowski et al., 3 Materials 2016), an extension of the popular Word2Vec model (Mikolov et al., 2013), which creates vec- We tokenize all data with the Pattern tokenizer (De Smedt and Daelemans, 2012). All text is lower- 2 https://lexsrv3.nlm.nih.gov/ cased, and we remove all tokens that include any- LexSysGroup/Projects/lexicon/current/ web/index.html thing different from alphabetic characters or hy- 3http://jazzy.sourceforge.net phens. 3.1 Neural embeddings viations and shorthand, resulting in instances such phebilitis ! phlebitis sympots ! symp- We train a fastText skipgram model on 425M as and toms words from the MIMIC-III corpus, which contains . We provide a script to extract this test set https://github.com/ medical records from critical care units. We use from MIMIC-III at pieterfivez/bionlp2017 the default parameters, except for the dimension- . ality, which we raise to 300. 4 Results 3.2 Tuning corpora For all experiments, we use accuracy as the metric In order to tune our model parameters in an to evaluate the performance of models. Accuracy unsupervised way, we automatically create self- is simply defined as the percentage of correct mis- induced error corpora. We generate these tuning spelling replacements found by a model. corpora by randomly sampling lines from a refer- 4.1 Parameter tuning ence corpus, randomly sampling a single word per To tune our model, we investigate a variety of pa- line if the word is present in our reference lexi- rameters: con, transforming these words with either 1 (80%) or 2 (20%) random Damerau-Levenshtein opera- Vector composition functions tions to a non-word, and then extracting these mis- (a) addition spelling instances with a context window of up to 10 tokens on each side, crossing sentence bound- (b) multiplication aries. For Setup 1, we perform this procedure (c) max embedding by Wu et al. (2015) for MIMIC-III, the same corpus which we use to Context metrics train our neural embeddings, and exclusively sam- ple words present in our vector vocabulary, re- (a) Window size (1 to 10) sulting in 5,000 instances. For Setup 2, we per- (b) Reciprocal weighting form our procedure for the THYME (Styler IV (c) Removing stop words using the English stop et al., 2014) corpus, which contains 1,254 clin- word list from scikit-learn (Pedregosa et al., ical notes on the topics of brain and colon can- 2011) cer. We once again exclusively sample in-vector- (d) Including a vectorized representation of the vocabulary words, resulting in 5,000 instances. misspelling For Setup 3, we again perform our procedure for the THYME corpus, but this time we exclusively Edit distance penalty sample OOV words, resulting in 1,500 instances. (a) Damerau-Levenshtein (b) Double Metaphone 3.3 Test corpus (c) Damerau-Levenshtein + Double Metaphone No benchmark test sets are publicly available for (d) Spell score by Lai et al. clinical spelling correction. A straightforward an- notation task is costly and can lead to small cor- We perform a grid search for Setup 1 and Setup 2 pora, such as the one by Lai et al., which con- to discover which parameter combination leads to tains just 78 misspelling instances.