Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-Gram Embeddings

Total Page:16

File Type:pdf, Size:1020Kb

Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-Gram Embeddings Unsupervised Context-Sensitive Spelling Correction of Clinical Free-Text with Word and Character N-Gram Embeddings Pieter Fivez∗ Simon Susterˇ ∗y Walter Daelemans∗ ∗CLiPS, University of Antwerp, Antwerp, Belgium yAntwerp University Hospital, Edegem, Belgium [email protected] Abstract process, for instance to counter the frequency bias of a context-insensitive corpus frequency-based We present an unsupervised context- system. Flor (2012) also pointed out that ignor- sensitive spelling correction method for ing contextual clues harms performance where a clinical free-text that uses word and char- specific misspelling maps to different corrections acter n-gram embeddings. Our method in different contexts, e.g. iron deficiency due to generates misspelling replacement candi- enemia ! anemia vs. fluid injected with enemia dates and ranks them according to their se- ! enema. A noisy channel model like the one by mantic fit, by calculating a weighted co- Lai et al. will choose the same item for both cor- sine similarity between the vectorized rep- rections. resentation of a candidate and the mis- Our proposed method exploits contextual clues spelling context. We greatly outperform by using neural embeddings to rank misspelling two baseline off-the-shelf spelling cor- replacement candidates according to their seman- rection tools on a manually annotated tic fit in the misspelling context. Neural embed- MIMIC-III test set, and counter the fre- dings have recently proven useful for a variety of quency bias of an optimized noisy channel related tasks, such as unsupervised normalization model, showing that neural embeddings (Sridhar, 2015) and reducing the candidate search can be successfully exploited to include space for spelling correction (Pande, 2017). context-awareness in a spelling correction We hypothesize that, by using neural embed- model. Our source code, including a dings, our method can counter the frequency bias script to extract the annotated test data, can of a noisy channel model. We test our sys- be found at https://github.com/ tem on manually annotated misspellings from the pieterfivez/bionlp2017. MIMIC-III (Johnson et al., 2016) clinical notes. In this paper, we focus on already detected non-word 1 Introduction misspellings, i.e. where the misspellings are not The genre of clinical free-text is notoriously noisy. real words, following Lai et al. Corpora contain observed spelling error rates which range from 0.1% (Liu et al., 2012) and 0.4% 2 Approach (Lai et al., 2015) to 4% and 7% (Tolentino et al., 2.1 Candidate Generation 2007), and even 10% (Ruch et al., 2003). This high spelling error rate, combined with the vari- We generate replacement candidates in 2 phases. able lexical characteristics of clinical text, can ren- First, we extract all items within a Damerau- der traditional spell checkers ineffective (Patrick Levenshtein edit distance of 2 from a reference et al., 2010). lexicon. Secondly, to allow for candidates be- Recently, Lai et al. (2015) have achieved nearly yond that edit distance, we also apply the Dou- ble Metaphone matching popularized by the open 80% correction accuracy on a test set of clinical 1 notes with their noisy channel model. However, source spell checker Aspell. This algorithm their model does not leverage any contextual in- converts lexical forms to an approximate pho- formation, while the context of a misspelling can netic consonant skeleton, and matches all Dou- provide important clues for the spelling correction 1http://aspell.net/metaphone/ ble Metaphone representations within a Damerau- For each Levenshtein edit distance of 1. As reference lexi- candidate con, we use a union of the UMLS R SPECIALIST lexicon2 and the general dictionary from Jazzy3, a Vectorize misspelling Java open source spell checker. Vectorize candidate context words 2.2 Candidate Ranking Our setup computes the cosine similarity be- Addition with reciprocal weighting tween the vector representation of a candidate and the composed vector representations of the mis- spelling context, weights this score with other pa- Cosine similarity rameters, and uses it as the ranking criterium. This setup is similar to the contextual similarity score Divide by edit Is candidate OOV? by Kilicoglu et al. (2015), which proved unsuc- distance cessful in their experiments. However, their ex- periments were preliminary. They used a limited Divide by OOV Yes context window of 2 tokens, could not account for penalty candidates which are not observed in the train- No ing data, and did not investigate whether a big- Rank by score ger training corpus leads to vector representations which scale better to the complexity of the task. We attempt a more thorough examination of the Figure 1: The final architecture of our model. It applicability of neural embeddings to the spelling vectorizes every context word on each side within correction task. To tune the parameters of our a specified scope if it is present in the vector vo- unsupervised context-sensitive spelling correction cabulary, applies reciprocal weighting, and sums model, we generate tuning corpora with self- the representations. It then calculates the cosine induced spelling errors for three different scenar- similarity with each candidate vector, and divides ios following the procedures described in section this score by the Damerau-Levenshtein edit dis- 3.2. These three corpora present increasingly dif- tance between the candidate and misspelling. If ficult scenarios for the spelling correction task. the candidate is OOV, the score is divided by an Setup 1 is generated from the same corpus which OOV penalty. is used to train the neural embeddings, and exclu- sively contains corrections which are present in the vocabulary of these neural embeddings. Setup 2 is tor representations for character n-grams and sums generated from a corpus in a different clinical sub- these with word unigram vectors to create the final domain, and also exclusively contains in-vector- word vectors. FastText allows for creating vector vocabulary corrections. Setup 3 presents the most representations for misspelling replacement can- difficult scenario, where we use the same corpus as didates absent from the trained embedding space, for Setup 2, but only include corrections which are by only summing the vectors of the character n- not present in the embedding vocabulary (OOV). grams. In other words, here our model has to deal with We report our tuning experiments with the dif- both domain change and data sparsity. ferent setups in 4.1. The final architecture of our Correcting OOV tokens in Setup 3 is made pos- model is described in Figure1. We evaluate this sible by using a combination of word and char- model on our test data in section 4.2. acter n-gram embeddings. We train these embed- dings with the fastText model (Bojanowski et al., 3 Materials 2016), an extension of the popular Word2Vec model (Mikolov et al., 2013), which creates vec- We tokenize all data with the Pattern tokenizer (De Smedt and Daelemans, 2012). All text is lower- 2 https://lexsrv3.nlm.nih.gov/ cased, and we remove all tokens that include any- LexSysGroup/Projects/lexicon/current/ web/index.html thing different from alphabetic characters or hy- 3http://jazzy.sourceforge.net phens. 3.1 Neural embeddings viations and shorthand, resulting in instances such phebilitis ! phlebitis sympots ! symp- We train a fastText skipgram model on 425M as and toms words from the MIMIC-III corpus, which contains . We provide a script to extract this test set https://github.com/ medical records from critical care units. We use from MIMIC-III at pieterfivez/bionlp2017 the default parameters, except for the dimension- . ality, which we raise to 300. 4 Results 3.2 Tuning corpora For all experiments, we use accuracy as the metric In order to tune our model parameters in an to evaluate the performance of models. Accuracy unsupervised way, we automatically create self- is simply defined as the percentage of correct mis- induced error corpora. We generate these tuning spelling replacements found by a model. corpora by randomly sampling lines from a refer- 4.1 Parameter tuning ence corpus, randomly sampling a single word per To tune our model, we investigate a variety of pa- line if the word is present in our reference lexi- rameters: con, transforming these words with either 1 (80%) or 2 (20%) random Damerau-Levenshtein opera- Vector composition functions tions to a non-word, and then extracting these mis- (a) addition spelling instances with a context window of up to 10 tokens on each side, crossing sentence bound- (b) multiplication aries. For Setup 1, we perform this procedure (c) max embedding by Wu et al. (2015) for MIMIC-III, the same corpus which we use to Context metrics train our neural embeddings, and exclusively sam- ple words present in our vector vocabulary, re- (a) Window size (1 to 10) sulting in 5,000 instances. For Setup 2, we per- (b) Reciprocal weighting form our procedure for the THYME (Styler IV (c) Removing stop words using the English stop et al., 2014) corpus, which contains 1,254 clin- word list from scikit-learn (Pedregosa et al., ical notes on the topics of brain and colon can- 2011) cer. We once again exclusively sample in-vector- (d) Including a vectorized representation of the vocabulary words, resulting in 5,000 instances. misspelling For Setup 3, we again perform our procedure for the THYME corpus, but this time we exclusively Edit distance penalty sample OOV words, resulting in 1,500 instances. (a) Damerau-Levenshtein (b) Double Metaphone 3.3 Test corpus (c) Damerau-Levenshtein + Double Metaphone No benchmark test sets are publicly available for (d) Spell score by Lai et al. clinical spelling correction. A straightforward an- notation task is costly and can lead to small cor- We perform a grid search for Setup 1 and Setup 2 pora, such as the one by Lai et al., which con- to discover which parameter combination leads to tains just 78 misspelling instances.
Recommended publications
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • ACL 2019 Social Media Mining for Health Applications (#SMM4H)
    ACL 2019 Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task Proceedings of the Fourth Workshop August 2, 2019 Florence, Italy c 2019 The Association for Computational Linguistics Order copies of this and other ACL proceedings from: Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected] ISBN 978-1-950737-46-8 ii Preface Welcome to the 4th Social Media Mining for Health Applications Workshop and Shared Task - #SMM4H 2019. The total number of users of social media continues to grow worldwide, resulting in the generation of vast amounts of data. Popular social networking sites such as Facebook, Twitter and Instagram dominate this sphere. According to estimates, 500 million tweets and 4.3 billion Facebook messages are posted every day 1. The latest Pew Research Report 2, nearly half of adults worldwide and two- thirds of all American adults (65%) use social networking. The report states that of the total users, 26% have discussed health information, and, of those, 30% changed behavior based on this information and 42% discussed current medical conditions. Advances in automated data processing, machine learning and NLP present the possibility of utilizing this massive data source for biomedical and public health applications, if researchers address the methodological challenges unique to this media. In its fourth iteration, the #SMM4H workshop takes place in Florence, Italy, on August 2, 2019, and is co-located with the
    [Show full text]
  • Mapping the Hyponymy Relation of Wordnet Onto
    Under review as a conference paper at ICLR 2019 MAPPING THE HYPONYMY RELATION OF WORDNET ONTO VECTOR SPACES Anonymous authors Paper under double-blind review ABSTRACT In this paper, we investigate mapping the hyponymy relation of WORDNET to feature vectors. We aim to model lexical knowledge in such a way that it can be used as input in generic machine-learning models, such as phrase entailment predictors. We propose two models. The first one leverages an existing mapping of words to feature vectors (fastText), and attempts to classify such vectors as within or outside of each class. The second model is fully supervised, using solely WORDNET as a ground truth. It maps each concept to an interval or a disjunction thereof. On the first model, we approach, but not quite attain state of the art performance. The second model can achieve near-perfect accuracy. 1 INTRODUCTION Distributional encoding of word meanings from the large corpora (Mikolov et al., 2013; 2018; Pen- nington et al., 2014) have been found to be useful for a number of NLP tasks. These approaches are based on a probabilistic language model by Bengio et al. (2003) of word sequences, where each word w is represented as a feature vector f(w) (a compact representation of a word, as a vector of floating point values). This means that one learns word representations (vectors) and probabilities of word sequences at the same time. While the major goal of distributional approaches is to identify distributional patterns of words and word sequences, they have even found use in tasks that require modeling more fine-grained relations between words than co-occurrence in word sequences.
    [Show full text]
  • Pairwise Fasttext Classifier for Entity Disambiguation
    Pairwise FastText Classifier for Entity Disambiguation a,b b b Cheng Yu , Bing Chu , Rohit Ram , James Aichingerb, Lizhen Qub,c, Hanna Suominenb,c a Project Cleopatra, Canberra, Australia b The Australian National University c DATA 61, Australia [email protected] {u5470909,u5568718,u5016706, Hanna.Suominen}@anu.edu.au [email protected] deterministically. Then we will evaluate PFC’s Abstract performance against a few baseline methods, in- cluding SVC3 with hand-crafted text features. Fi- For the Australasian Language Technology nally, we will discuss ways to improve disambig- Association (ALTA) 2016 Shared Task, we uation performance using PFC. devised Pairwise FastText Classifier (PFC), an efficient embedding-based text classifier, 2 Pairwise Fast-Text Classifier (PFC) and used it for entity disambiguation. Com- pared with a few baseline algorithms, PFC Our Pairwise FastText Classifier is inspired by achieved a higher F1 score at 0.72 (under the the FastText. Thus this section starts with a brief team name BCJR). To generalise the model, description of FastText, and proceeds to demon- we also created a method to bootstrap the strate PFC. training set deterministically without human labelling and at no financial cost. By releasing 2.1 FastText PFC and the dataset augmentation software to the public1, we hope to invite more collabora- FastText maps each vocabulary to a real-valued tion. vector, with unknown words having a special vo- cabulary ID. A document can be represented as 1 Introduction the average of all these vectors. Then FastText will train a maximum entropy multi-class classi- The goal of the ALTA 2016 Shared Task was to fier on the vectors and the output labels.
    [Show full text]
  • Welsh Language Technology Action Plan Progress Report 2020 Welsh Language Technology Action Plan: Progress Report 2020
    Welsh language technology action plan Progress report 2020 Welsh language technology action plan: Progress report 2020 Audience All those interested in ensuring that the Welsh language thrives digitally. Overview This report reviews progress with work packages of the Welsh Government’s Welsh language technology action plan between its October 2018 publication and the end of 2020. The Welsh language technology action plan derives from the Welsh Government’s strategy Cymraeg 2050: A million Welsh speakers (2017). Its aim is to plan technological developments to ensure that the Welsh language can be used in a wide variety of contexts, be that by using voice, keyboard or other means of human-computer interaction. Action required For information. Further information Enquiries about this document should be directed to: Welsh Language Division Welsh Government Cathays Park Cardiff CF10 3NQ e-mail: [email protected] @cymraeg Facebook/Cymraeg Additional copies This document can be accessed from gov.wales Related documents Prosperity for All: the national strategy (2017); Education in Wales: Our national mission, Action plan 2017–21 (2017); Cymraeg 2050: A million Welsh speakers (2017); Cymraeg 2050: A million Welsh speakers, Work programme 2017–21 (2017); Welsh language technology action plan (2018); Welsh-language Technology and Digital Media Action Plan (2013); Technology, Websites and Software: Welsh Language Considerations (Welsh Language Commissioner, 2016) Mae’r ddogfen yma hefyd ar gael yn Gymraeg. This document is also available in Welsh.
    [Show full text]
  • Fasttext.Zip: Compressing Text Classification Models
    Under review as a conference paper at ICLR 2017 FASTTEXT.ZIP: COMPRESSING TEXT CLASSIFICATION MODELS Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Herve´ Jegou´ & Tomas Mikolov Facebook AI Research fajoulin,egrave,bojanowski,matthijs,rvj,[email protected] ABSTRACT We consider the problem of producing compact architectures for text classifica- tion, such that the full model fits in a limited amount of memory. After consid- ering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quan- tization artefacts. Combined with simple approaches specifically adapted to text classification, our approach derived from fastText requires, at test time, only a fraction of the memory compared to the original FastText, without noticeably sacrificing quality in terms of classification accuracy. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy. 1 INTRODUCTION Text classification is an important problem in Natural Language Processing (NLP). Real world use- cases include spam filtering or e-mail categorization. It is a core component in more complex sys- tems such as search and ranking. Recently, deep learning techniques based on neural networks have achieved state of the art results in various NLP applications. One of the main successes of deep learning is due to the effectiveness of recurrent networks for language modeling and their application to speech recognition and machine translation (Mikolov, 2012).
    [Show full text]
  • Learning to Read by Spelling Towards Unsupervised Text Recognition
    Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta Andrea Vedaldi Andrew Zisserman Visual Geometry Group Visual Geometry Group Visual Geometry Group University of Oxford University of Oxford University of Oxford [email protected] [email protected] [email protected] tttttttttttttttttttttttttttttttttttttttttttrssssss ttttttt ny nytt nr nrttttttt ny ntt nrttttttt zzzz iterations iterations bcorote to whol th ticunthss tidio tiostolonzzzz trougfht to ferr oy disectins it has dicomered Training brought to view by dissection it was discovered Figure 1: Text recognition from unaligned data. We present a method for recognising text in images without using any labelled data. This is achieved by learning to align the statistics of the predicted text strings, against the statistics of valid text strings sampled from a corpus. The figure above visualises the transcriptions as various characters are learnt through the training iterations. The model firstlearns the concept of {space}, and hence, learns to segment the string into words; followed by common words like {to, it}, and only later learns to correctly map the less frequent characters like {v, w}. The last transcription also corresponds to the ground-truth (punctuations are not modelled). The colour bar on the right indicates the accuracy (darker means higher accuracy). ABSTRACT 1 INTRODUCTION This work presents a method for visual text recognition without read (ri:d) verb • Look at and comprehend the meaning of using any paired supervisory data. We formulate the text recogni- (written or printed matter) by interpreting the characters or tion task as one of aligning the conditional distribution of strings symbols of which it is composed.
    [Show full text]
  • Detecting Personal Life Events from Social Media
    Open Research Online The Open University’s repository of research publications and other research outputs Detecting Personal Life Events from Social Media Thesis How to cite: Dickinson, Thomas Kier (2019). Detecting Personal Life Events from Social Media. PhD thesis The Open University. For guidance on citations see FAQs. c 2018 The Author https://creativecommons.org/licenses/by-nc-nd/4.0/ Version: Version of Record Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.21954/ou.ro.00010aa9 Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk Detecting Personal Life Events from Social Media a thesis presented by Thomas K. Dickinson to The Department of Science, Technology, Engineering and Mathematics in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science The Open University Milton Keynes, England May 2019 Thesis advisor: Professor Harith Alani & Dr Paul Mulholland Thomas K. Dickinson Detecting Personal Life Events from Social Media Abstract Social media has become a dominating force over the past 15 years, with the rise of sites such as Facebook, Instagram, and Twitter. Some of us have been with these sites since the start, posting all about our personal lives and building up a digital identify of ourselves. But within this myriad of posts, what actually matters to us, and what do our digital identities tell people about ourselves? One way that we can start to filter through this data, is to build classifiers that can identify posts about our personal life events, allowing us to start to self reflect on what we share online.
    [Show full text]
  • Question Embeddings for Semantic Answer Type Prediction
    Question Embeddings for Semantic Answer Type Prediction Eleanor Bill1 and Ernesto Jiménez-Ruiz2 1, 2 City, University of London, London, EC1V 0HB Abstract. This paper considers an answer type and category prediction challenge for a set of natural language questions, and proposes a question answering clas- sification system based on word and DBpedia knowledge graph embeddings. The questions are parsed for keywords, nouns and noun phrases before word and knowledge graph embeddings are applied to the parts of the question. The vectors produced are used to train multiple multi-layer perceptron models, one for each answer type in a multiclass one-vs-all classification system for both answer cat- egory prediction and answer type prediction. Different combinations of vectors and the effect of creating additional positive and negative training samples are evaluated in order to find the best classification system. The classification system that predict the answer category with highest accuracy are the classifiers trained on knowledge graph embedded noun phrases vectors from the original training data, with an accuracy of 0.793. The vector combination that produces the highest NDCG values for answer category accuracy is the word embeddings from the parsed question keyword and nouns parsed from the original training data, with NDCG@5 and NDCG@10 values of 0.471 and 0.440 respectively for the top five and ten predicted answer types. Keywords: Semantic Web, knowledge graph embedding, answer type predic- tion, question answering. 1 Introduction 1.1 The SMART challenge The challenge [1] provided a training set of natural language questions alongside a sin- gle given answer category (Boolean, literal or resource) and 1-6 given answer types.
    [Show full text]
  • Spelling Correction: from Two-Level Morphology to Open Source
    Spelling Correction: from two-level morphology to open source Iñaki Alegria, Klara Ceberio, Nerea Ezeiza, Aitor Soroa, Gregorio Hernandez Ixa group. University of the Basque Country / Eleka S.L. 649 P.K. 20080 Donostia. Basque Country. [email protected] Abstract Basque is a highly inflected and agglutinative language (Alegria et al., 1996). Two-level morphology has been applied successfully to this kind of languages and there are two-level based descriptions for very different languages. After doing the morphological description for a language, it is easy to develop a spelling checker/corrector for this language. However, what happens if we want to use the speller in the "free world" (OpenOffice, Mozilla, emacs, LaTeX, ...)? Ispell and similar tools (aspell, hunspell, myspell) are the usual mechanisms for these purposes, but they do not fit the two-level model. In the absence of two-level morphology based mechanisms, an automatic conversion from two-level description to hunspell is described in this paper. previous work and reuse the morphological information. 1. Introduction Two-level morphology (Koskenniemi, 1983; Beesley & Karttunen, 2003) has been applied successfully to the 2. Free software for spelling correction morphological description of highly inflected languages. Unfortunately there are not open source tools for spelling There are two-level based descriptions for very different correction with these features: languages (English, German, Swedish, French, Spanish, • It is standardized in the most of the applications Danish, Norwegian, Finnish, Basque, Russian, Turkish, (OpenOffice, Mozilla, emacs, LaTeX, ...). Arab, Aymara, Swahili, etc.). • It is based on the two-level morphology. After doing the morphological description, it is easy to The spell family of spell checkers (ispell, aspell, myspell) develop a spelling checker/corrector for the language fits the first condition but not the second.
    [Show full text]
  • Easychair Preprint Feedback Learning: Automating the Process
    EasyChair Preprint № 992 Feedback Learning: Automating the Process of Correcting and Completing the Extracted Information Rakshith Bymana Ponnappa, Khurram Azeem Hashmi, Syed Saqib Bukhari and Andreas Dengel EasyChair preprints are intended for rapid dissemination of research results and are integrated with the rest of EasyChair. May 12, 2019 Feedback Learning: Automating the Process of Correcting and Completing the Extracted Information Abstract—In recent years, with the increasing usage of digital handwritten or typed information in the digital mailroom media and advancements in deep learning architectures, most systems. The problem arises when these information extraction of the paper-based documents have been revolutionized into systems cannot extract the exact information due to many digital versions. These advancements have helped the state-of-the- art Optical Character Recognition (OCR) and digital mailroom reasons like the source graphical document itself is not readable, technologies become progressively efficient. Commercially, there scanner has a poor result, characters in the document are too already exists end to end systems which use OCR and digital close to each other resulting in problems like reading “Ouery” mailroom technologies for extracting relevant information from instead of “Query”, “8” instead of ”3”, to name a few. It financial documents such as invoices. However, there is plenty of is even more challenging to correct errors in proper nouns room for improvement in terms of automating and correcting post information extracted errors. This paper describes the like names, addresses and also in numerical values such as user-involved, self-correction concept based on the sequence to telephone number, insurance number and so on.
    [Show full text]
  • Hunspell – the Free Spelling Checker
    Hunspell – The free spelling checker About Hunspell Hunspell is a spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding. Hunspell interfaces: Ispell-like terminal interface using Curses library, Ispell pipe interface, OpenOffice.org UNO module. Main features of Hunspell spell checker and morphological analyzer: - Unicode support (affix rules work only with the first 65535 Unicode characters) - Morphological analysis (in custom item and arrangement style) and stemming - Max. 65535 affix classes and twofold affix stripping (for agglutinative languages, like Azeri, Basque, Estonian, Finnish, Hungarian, Turkish, etc.) - Support complex compoundings (for example, Hungarian and German) - Support language specific features (for example, special casing of Azeri and Turkish dotted i, or German sharp s) - Handle conditional affixes, circumfixes, fogemorphemes, forbidden words, pseudoroots and homonyms. - Free software (LGPL, GPL, MPL tri-license) Usage The src/tools dictionary contains ten executables after compiling (or some of them are in the src/win_api): affixcompress: dictionary generation from large (millions of words) vocabularies analyze: example of spell checking, stemming and morphological analysis chmorph: example of automatic morphological generation and conversion example: example of spell checking and suggestion hunspell: main program for spell checking and others (see manual) hunzip: decompressor of hzip format hzip: compressor of
    [Show full text]