An Unsupervised and Data-Driven Approach for Spell Checking in Vietnamese OCR-Scanned Texts

Total Page:16

File Type:pdf, Size:1020Kb

An Unsupervised and Data-Driven Approach for Spell Checking in Vietnamese OCR-Scanned Texts An Unsupervised and Data-Driven Approach for Spell Checking in Vietnamese OCR-scanned Texts Cong Duy Vu HOANG & Ai Ti AW Department of Human Language Technology (HLT) Institute for Infocomm Research (I2R) A*STAR, Singapore {cdvhoang,aaiti}@i2r.a-star.edu.sg Abstract Researchers in the literature used to approach this task for various languages such as: English OCR (Optical Character Recognition) scan- (Tong and Evans, 1996; Taghva and Stofsky, ners do not always produce 100% accuracy 2001; Kolak and Resnik, 2002), Chinese (Zhuang in recognizing text documents, leading to et al., 2004), Japanese (Nagata, 1996; Nagata, spelling errors that make the texts hard to process further. This paper presents an in- 1998), Arabic (Magdy and Darwish, 2006), and vestigation for the task of spell checking Thai (Meknavin et al., 1998). for OCR-scanned text documents. First, we The most common approach is to involve users conduct a detailed analysis on characteris- for their intervention with computer support. tics of spelling errors given by an OCR interac- scanner. Then, we propose a fully auto- Taghva and Stofsky (2001) designed an matic approach combining both error detec- tive system (called OCRSpell) that assists users as tion and correction phases within a unique many interactive features as possible during their scheme. The scheme is designed in an un- correction, such as: choose word boundary, mem- supervised & data-driven manner, suitable orize user-corrected words for future correction, for resource-poor languages. Based on the provide specific prior knowledge about typical er- evaluation on real dataset in Vietnamese rors. For certain applications requiring automa- language, our approach gives an acceptable tion, the interactive scheme may not work. performance (detection accuracy 86%, cor- rection accuracy 71%). In addition, we also Unlike (Taghva and Stofsky, 2001), non- give a result analysis to show how accurate interactive (or fully automatic) approaches have our approach can achieve. been investigated. Such approaches need pre- specified lexicons & confusion resources (Tong 1 Introduction and Related Work and Evans, 1996), language-specific knowledge (Meknavin et al., 1998) or manually-created pho- Documents that are only available in print re- netic transformation rules (Hodge and Austin, quire scanning from OCR devices for retrieval 2003) to assist correction process. or e-archiving purposes (Tseng, 2002; Magdy supervised and Darwish, 2008). However, OCR scanners Other approaches used mecha- do not always produce 100% accuracy in rec- nisms for OCR error correction, such as: statis- ognizing text documents, leading to spelling er- tical language models (Nagata, 1996; Zhuang et rors that make the texts texts hard to process fur- al., 2004; Magdy and Darwish, 2006), noisy chan- ther. Some factors may cause those errors. For nel model (Kolak and Resnik, 2002). These ap- instance, shape or visual similarity forces OCR proaches performed well but are limited due to scanners to misunderstand some characters; or in- requiring large annotated training data specific to put text documents do not have good quality, caus- OCR spell checking in languages that are very ing noises in resulting scanned texts. The task of hard to obtain. spell checking for OCR-scanned text documents Further, research in spell checking for proposed aims to solve the above situation. Vietnamese language has been understudied. 36 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data (Hybrid2012), EACL 2012, pages 36–44, Avignon, France, April 23 2012. c 2012 Association for Computational Linguistics Hunspell−spellcheck−vn1 & Aspell2 are inter- Real-syllable or Context-based Errors (Class 2): active spell checking tools that work based on refer to syllables that are correct in terms of pre-defined dictionaries. their existence in a standard dictionary but According to our best knowledge, there is no incorrect in terms of their meaning in the work in the literature reported the task of spell context of given sentence. checking for Vietnamese OCR-scanned text doc- uments. In this paper, we approach this task in Unexpected Errors (Class 3): are accidentally terms of 1) fully automatic scheme; 2) without us- formed by unknown operators, such as: ing any annotated corpora; 3) capable of solving insert non-alphabet characters, do incorrect both non-word & real-word spelling errors simul- upper-/lower- case, split/merge/remove taneously. Such an approach will be beneficial for syllable(s), change syllable orders, . a poor-resource language like Vietnamese. Note that errors in Class 1 & 2 can be formed by applying one of 4 operators4 (Insertion, Dele- 2 Error Characteristics tion, Substitution, Transposition). Class 3 is ex- First of all, we would like to observe and analyse clusive, formed by unexpected operators. Table 1 the characteristics of OCR-induced errors in com- gives some examples of 3 error classes. pared with typographical errors in a real dataset. An important note is that an erroneous syllable can contain errors across different classes. Class 2.1 Data Overview 3 can appear with Class 1 or Class 2 but Class 1 We used a total of 24 samples of Vietnamese never appears with Class 2. For example: OCR-scanned text documents for our analysis. − hoàn (correct) || Hòan (incorrect) (Class 3 & 1) Each sample contains real & OCR texts, referring − bắt (correct) || bặt’ (incorrect) (Class 3 & 2) to texts without & with spelling errors, respec- tively. Our manual sentence segmentation gives a result of totally 283 sentences for the above 24 samples, with 103 (good, no errors) and 180 (bad, errors existed) sentences. Also, the number of syl- lables3 in real & OCR sentences (over all samples) are 2392 & 2551, respectively. 2.2 Error Classification We carried out an in-depth analysis on spelling Figure 1: Distribution of operators used in Class errors, identified existing errors, and then man- 1 (left) & Class 2 (right). ually classified them into three pre-defined error classes. For each class, we also figured out how an error is formed. 2.3 Error Distribution As a result, we classified OCR-induced spelling Our analysis reveals that there are totally 551 rec- errors into three classes: ognized errors over all 283 sentences. Each error is classified into three wide classes (Class 1, Class Typographic or Non-syllable Errors (Class 1): 2, Class 3). Specifically, we also tried to identify refer to incorrect syllables (not included operators used in Class 1 & Class 2. As a result, in a standard dictionary). Normally, at we have totally 9 more fine-grained error classes least one character of a syllable is expected (1A..1D, 2A..2D, 3)5. misspelled. We explored the distribution of 3 error classes 1http://code.google.com/p/ in our analysis. Class 1 distributed the most, fol- hunspell-spellcheck-vi/ lowing by Class 3 (slightly less) and Class 2. 2http://en.wikipedia.org/wiki/GNU_ Aspell/ 4Their definitions can be found in (Damerau, 1964). 3In Vietnamese language, we will use the word “sylla- 5A, B, C, and D represent for Insertion, Deletion, Sub- ble” instead of “token” to mention a unit that is separated by stitution, and Transposition, respectively. For instance, 1A spaces. means Insertion in Class 1. 37 Class Insertion Deletion Substitution Transpositiona Class 1 áp (correct) || áip (in- không (correct) || kh yếu (correct) || ỵếu N.A. correct) (“i” inserted) (incorrect). (“ô”, “n”, (incorrect). (“y” sub- and “g” deleted) stituted by “ỵ”) Class 2 lên (correct) || liên trình (correct) || ngay (correct) || ngây N.A. (contextually incor- tình (contextually (contextually incor- rect). (“i” inserted) incorrect). (“r” rect). (“a” substituted deleted) by “â”) Class 3 xác nhận là (correct) || x||nha0a (incorrect). 3 syllables were misspelled & accidentally merged. aOur analysis reveals no examples for this operator. Table 1: Examples of error classes. Generally, each class contributed a certain quan- ing scheme to produce the final output. Currently, tity of errors (38%, 37%, & 25%), making the VOSE implements two different correctors, a correction process of errors more challenging. In Contextual Corrector and a Weighting-based addition, there are totally 613 counts for 9 fine- Corrector. Contextual Corrector employs lan- grained classes (over 551 errors of 283 sentences), guage modelling to rank a list of potential can- yielding an average & standard deviation 3.41 & didates in the scope of whole sentence whereas 2.78, respectively. Also, one erroneous syllable is Weighting-based Corrector chooses the best able to contain the number of (fine-grained) error candidate for each syllable that has the highest classes as follows: 1(492), 2(56), 3(3), 4(0) ((N) weights. The following will give detailed descrip- is count of cases). tions for all components developed in VOSE. We can also observe more about the distribu- tion of operators that were used within each error 3.1 Pre-processor class in Figure 1. The Substitution operator was Pre-processor will take in the input text, do used the most in both Class 1 & Class 2, holding tokenization & normalization steps. Tokeniza- 81% & 97%, respectively. Only a few other oper- tion in Vietnamese is similar to one in En- ators (Insertion, Deletion) were used. Specially, glish. Normalization step includes: normal- the Transposition operator were not used in both ize Vietnamese tone & vowel (e.g. hòa –> Class 1 & Class 2. This justifies the fact that OCR hoà), standardize upper-/lower- cases, find num- scanners normally have ambiguity in recognizing bers/punctuations/abbreviations, remove noise similar characters. characters, . This step also extracts unigrams. Each of them 3 Proposed Approach will then be checked whether they exist in a pre- built list of unigrams (from large raw text data). The architecture of our proposed approach Unigrams that do not exist in the list will be re- (namely (VOSE)) is outlined in Figure 2. Our pur- garded as Potential Class 1 & 3 errors and then pose is to develop VOSE as an unsupervised data- turned into Non-syllable Detector.
Recommended publications
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • Welsh Language Technology Action Plan Progress Report 2020 Welsh Language Technology Action Plan: Progress Report 2020
    Welsh language technology action plan Progress report 2020 Welsh language technology action plan: Progress report 2020 Audience All those interested in ensuring that the Welsh language thrives digitally. Overview This report reviews progress with work packages of the Welsh Government’s Welsh language technology action plan between its October 2018 publication and the end of 2020. The Welsh language technology action plan derives from the Welsh Government’s strategy Cymraeg 2050: A million Welsh speakers (2017). Its aim is to plan technological developments to ensure that the Welsh language can be used in a wide variety of contexts, be that by using voice, keyboard or other means of human-computer interaction. Action required For information. Further information Enquiries about this document should be directed to: Welsh Language Division Welsh Government Cathays Park Cardiff CF10 3NQ e-mail: [email protected] @cymraeg Facebook/Cymraeg Additional copies This document can be accessed from gov.wales Related documents Prosperity for All: the national strategy (2017); Education in Wales: Our national mission, Action plan 2017–21 (2017); Cymraeg 2050: A million Welsh speakers (2017); Cymraeg 2050: A million Welsh speakers, Work programme 2017–21 (2017); Welsh language technology action plan (2018); Welsh-language Technology and Digital Media Action Plan (2013); Technology, Websites and Software: Welsh Language Considerations (Welsh Language Commissioner, 2016) Mae’r ddogfen yma hefyd ar gael yn Gymraeg. This document is also available in Welsh.
    [Show full text]
  • Learning to Read by Spelling Towards Unsupervised Text Recognition
    Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta Andrea Vedaldi Andrew Zisserman Visual Geometry Group Visual Geometry Group Visual Geometry Group University of Oxford University of Oxford University of Oxford [email protected] [email protected] [email protected] tttttttttttttttttttttttttttttttttttttttttttrssssss ttttttt ny nytt nr nrttttttt ny ntt nrttttttt zzzz iterations iterations bcorote to whol th ticunthss tidio tiostolonzzzz trougfht to ferr oy disectins it has dicomered Training brought to view by dissection it was discovered Figure 1: Text recognition from unaligned data. We present a method for recognising text in images without using any labelled data. This is achieved by learning to align the statistics of the predicted text strings, against the statistics of valid text strings sampled from a corpus. The figure above visualises the transcriptions as various characters are learnt through the training iterations. The model firstlearns the concept of {space}, and hence, learns to segment the string into words; followed by common words like {to, it}, and only later learns to correctly map the less frequent characters like {v, w}. The last transcription also corresponds to the ground-truth (punctuations are not modelled). The colour bar on the right indicates the accuracy (darker means higher accuracy). ABSTRACT 1 INTRODUCTION This work presents a method for visual text recognition without read (ri:d) verb • Look at and comprehend the meaning of using any paired supervisory data. We formulate the text recogni- (written or printed matter) by interpreting the characters or tion task as one of aligning the conditional distribution of strings symbols of which it is composed.
    [Show full text]
  • Detecting Personal Life Events from Social Media
    Open Research Online The Open University’s repository of research publications and other research outputs Detecting Personal Life Events from Social Media Thesis How to cite: Dickinson, Thomas Kier (2019). Detecting Personal Life Events from Social Media. PhD thesis The Open University. For guidance on citations see FAQs. c 2018 The Author https://creativecommons.org/licenses/by-nc-nd/4.0/ Version: Version of Record Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.21954/ou.ro.00010aa9 Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk Detecting Personal Life Events from Social Media a thesis presented by Thomas K. Dickinson to The Department of Science, Technology, Engineering and Mathematics in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science The Open University Milton Keynes, England May 2019 Thesis advisor: Professor Harith Alani & Dr Paul Mulholland Thomas K. Dickinson Detecting Personal Life Events from Social Media Abstract Social media has become a dominating force over the past 15 years, with the rise of sites such as Facebook, Instagram, and Twitter. Some of us have been with these sites since the start, posting all about our personal lives and building up a digital identify of ourselves. But within this myriad of posts, what actually matters to us, and what do our digital identities tell people about ourselves? One way that we can start to filter through this data, is to build classifiers that can identify posts about our personal life events, allowing us to start to self reflect on what we share online.
    [Show full text]
  • Spelling Correction: from Two-Level Morphology to Open Source
    Spelling Correction: from two-level morphology to open source Iñaki Alegria, Klara Ceberio, Nerea Ezeiza, Aitor Soroa, Gregorio Hernandez Ixa group. University of the Basque Country / Eleka S.L. 649 P.K. 20080 Donostia. Basque Country. [email protected] Abstract Basque is a highly inflected and agglutinative language (Alegria et al., 1996). Two-level morphology has been applied successfully to this kind of languages and there are two-level based descriptions for very different languages. After doing the morphological description for a language, it is easy to develop a spelling checker/corrector for this language. However, what happens if we want to use the speller in the "free world" (OpenOffice, Mozilla, emacs, LaTeX, ...)? Ispell and similar tools (aspell, hunspell, myspell) are the usual mechanisms for these purposes, but they do not fit the two-level model. In the absence of two-level morphology based mechanisms, an automatic conversion from two-level description to hunspell is described in this paper. previous work and reuse the morphological information. 1. Introduction Two-level morphology (Koskenniemi, 1983; Beesley & Karttunen, 2003) has been applied successfully to the 2. Free software for spelling correction morphological description of highly inflected languages. Unfortunately there are not open source tools for spelling There are two-level based descriptions for very different correction with these features: languages (English, German, Swedish, French, Spanish, • It is standardized in the most of the applications Danish, Norwegian, Finnish, Basque, Russian, Turkish, (OpenOffice, Mozilla, emacs, LaTeX, ...). Arab, Aymara, Swahili, etc.). • It is based on the two-level morphology. After doing the morphological description, it is easy to The spell family of spell checkers (ispell, aspell, myspell) develop a spelling checker/corrector for the language fits the first condition but not the second.
    [Show full text]
  • Easychair Preprint Feedback Learning: Automating the Process
    EasyChair Preprint № 992 Feedback Learning: Automating the Process of Correcting and Completing the Extracted Information Rakshith Bymana Ponnappa, Khurram Azeem Hashmi, Syed Saqib Bukhari and Andreas Dengel EasyChair preprints are intended for rapid dissemination of research results and are integrated with the rest of EasyChair. May 12, 2019 Feedback Learning: Automating the Process of Correcting and Completing the Extracted Information Abstract—In recent years, with the increasing usage of digital handwritten or typed information in the digital mailroom media and advancements in deep learning architectures, most systems. The problem arises when these information extraction of the paper-based documents have been revolutionized into systems cannot extract the exact information due to many digital versions. These advancements have helped the state-of-the- art Optical Character Recognition (OCR) and digital mailroom reasons like the source graphical document itself is not readable, technologies become progressively efficient. Commercially, there scanner has a poor result, characters in the document are too already exists end to end systems which use OCR and digital close to each other resulting in problems like reading “Ouery” mailroom technologies for extracting relevant information from instead of “Query”, “8” instead of ”3”, to name a few. It financial documents such as invoices. However, there is plenty of is even more challenging to correct errors in proper nouns room for improvement in terms of automating and correcting post information extracted errors. This paper describes the like names, addresses and also in numerical values such as user-involved, self-correction concept based on the sequence to telephone number, insurance number and so on.
    [Show full text]
  • Hunspell – the Free Spelling Checker
    Hunspell – The free spelling checker About Hunspell Hunspell is a spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding. Hunspell interfaces: Ispell-like terminal interface using Curses library, Ispell pipe interface, OpenOffice.org UNO module. Main features of Hunspell spell checker and morphological analyzer: - Unicode support (affix rules work only with the first 65535 Unicode characters) - Morphological analysis (in custom item and arrangement style) and stemming - Max. 65535 affix classes and twofold affix stripping (for agglutinative languages, like Azeri, Basque, Estonian, Finnish, Hungarian, Turkish, etc.) - Support complex compoundings (for example, Hungarian and German) - Support language specific features (for example, special casing of Azeri and Turkish dotted i, or German sharp s) - Handle conditional affixes, circumfixes, fogemorphemes, forbidden words, pseudoroots and homonyms. - Free software (LGPL, GPL, MPL tri-license) Usage The src/tools dictionary contains ten executables after compiling (or some of them are in the src/win_api): affixcompress: dictionary generation from large (millions of words) vocabularies analyze: example of spell checking, stemming and morphological analysis chmorph: example of automatic morphological generation and conversion example: example of spell checking and suggestion hunspell: main program for spell checking and others (see manual) hunzip: decompressor of hzip format hzip: compressor of
    [Show full text]
  • Sethesaurus: Wordnet in Software Engineering
    This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TSE.2019.2940439 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 14, NO. 8, AUGUST 2015 1 SEthesaurus: WordNet in Software Engineering Xiang Chen, Member, IEEE, Chunyang Chen, Member, IEEE, Dun Zhang, and Zhenchang Xing, Member, IEEE, Abstract—Informal discussions on social platforms (e.g., Stack Overflow, CodeProject) have accumulated a large body of programming knowledge in the form of natural language text. Natural language process (NLP) techniques can be utilized to harvest this knowledge base for software engineering tasks. However, consistent vocabulary for a concept is essential to make an effective use of these NLP techniques. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms (such as abbreviations, synonyms and misspellings) in informal discussions. Existing techniques to deal with such morphological forms are either designed for general English or mainly resort to domain-specific lexical rules. A thesaurus, which contains software-specific terms and commonly-used morphological forms, is desirable to perform normalization for software engineering text. However, constructing this thesaurus in a manual way is a challenge task. In this paper, we propose an automatic unsupervised approach to build such a thesaurus. In particular, we first identify software-specific terms by utilizing a software-specific corpus (e.g., Stack Overflow) and a general corpus (e.g., Wikipedia). Then we infer morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations.
    [Show full text]
  • Corpus Based Evaluation of Stemmers
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Repository of the Academy's Library Corpus based evaluation of stemmers Istvan´ Endredy´ Pazm´ any´ Peter´ Catholic University, Faculty of Information Technology and Bionics MTA-PPKE Hungarian Language Technology Research Group 50/a Prater´ Street, 1083 Budapest, Hungary [email protected] Abstract There are many available stemmer modules, and the typical questions are usually: which one should be used, which will help our system and what is its numerical benefit? This article is based on the idea that any corpora with lemmas can be used for stemmer evaluation, not only for lemma quality measurement but for information retrieval quality as well. The sentences of the corpus act as documents (hit items of the retrieval), and its words are the testing queries. This article describes this evaluation method, and gives an up-to-date overview about 10 stemmers for 3 languages with different levels of inflectional morphology. An open source, language independent python tool and a Lucene based stemmer evaluator in java are also provided which can be used for evaluating further languages and stemmers. 1. Introduction 2. Related work Stemmers can be evaluated in several ways. First, their Information retrieval (IR) systems have a common crit- impact on IR performance is measurable by test collec- ical point. It is when the root form of an inflected word tions. The earlier stemmer evaluations (Hull, 1996; Tordai form is determined. This is called stemming. There are and De Rijke, 2006; Halacsy´ and Tron,´ 2007) were based two types of the common base word form: lemma or stem.
    [Show full text]
  • Detecting Politeness in Natural Language by Michael Yeomans, Alejandro Kantor, Dustin Tingley
    CONTRIBUTED RESEARCH ARTICLES 489 The politeness Package: Detecting Politeness in Natural Language by Michael Yeomans, Alejandro Kantor, Dustin Tingley Abstract This package provides tools to extract politeness markers in English natural language. It also allows researchers to easily visualize and quantify politeness between groups of documents. This package combines and extends prior research on the linguistic markers of politeness (Brown and Levinson, 1987; Danescu-Niculescu-Mizil et al., 2013; Voigt et al., 2017). We demonstrate two applications for detecting politeness in natural language during consequential social interactions— distributive negotiations, and speed dating. Introduction Politeness is a universal dimension of human communication (Goffman, 1967; Lakoff, 1973; Brown and Levinson, 1987). In practically all settings, a speaker can choose to be more or less polite to their audience, and this can have consequences for the speakers’ social goals. Politeness is encoded in a discrete and rich set of linguistic markers that modify the information content of an utterance. Sometimes politeness is an act of commission (for example, saying “please” and “thank you”) <and sometimes it is an act of omission (for example, declining to be contradictory). Furthermore, the mapping of politeness markers often varies by culture, by context (work vs. family vs. friends), by a speaker’s characteristics (male vs. female), or the goals (buyer vs. seller). Many branches of social science might be interested in how (and when) people express politeness to one another, as one mechanism by which social co-ordination is achieved. The politeness package is designed to make it easier to detect politeness in English natural language, by quantifying relevant characteristics of polite speech, and comparing them to other data about the speaker.
    [Show full text]
  • Automatic Detection of Anglicisms for the Pronunciation Dictionary Generation: a Case Study on Our German It Corpus
    AUTOMATIC DETECTION OF ANGLICISMS FOR THE PRONUNCIATION DICTIONARY GENERATION: A CASE STUDY ON OUR GERMAN IT CORPUS Sebastian Leidig, Tim Schlippe, Tanja Schultz Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT), Germany ABSTRACT With the globalization more and more words from other lan- guages come into a language without assimilation to the phonetic system of the new language. To economically build up lexical re- sources with automatic or semi-automatic methods, it is important to detect and treat them separately. Due to the strong increase of Anglicisms, especially from the IT domain, we developed features for their automatic detection and collected and annotated a German IT corpus to evaluate them. Furthermore we applied our methods to Afrikaans words from the NCHLT corpus and German words from the news domain. Combining features based on grapheme perplex- ity, grapheme-to-phoneme confidence, Google hits count as well as Fig. 1. Overview of the Anglicism detection system spell-checker dictionary and Wiktionary lookup reaches 75.44% f- score. Producing pronunciations for the words in our German IT corpus based on our methods resulted in 1.6% phoneme error rate We evaluated our features on the two matrix languages German and to reference pronunciations, while applying exclusively German Afrikaans. However, our methods can easily be adapted to new ma- grapheme-to-phoneme rules for all words achieved 5.0%. trix languages. Index Terms— Foreign entity detection, lexical resources, pro- In our scenario we receive single words from texts in the matrix nunciation modeling, Anglicisms language as input. As output we produce a classification between the classes English and native.
    [Show full text]
  • Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings
    Unsupervised Context-Sensitive Spelling Correction of English and Dutch Clinical Free-Text with Word and Character N-Gram Embeddings Pieter Fivez [email protected] Simon Šuster [email protected] Walter Daelemans [email protected] CLiPS, University of Antwerp, Prinsstraat 13, 2000 Antwerp, Belgium Abstract We present an unsupervised context-sensitive spelling correction method for clinical free-text that uses word and character n-gram embeddings. Our method generates misspelling replacement candidates and ranks them according to their semantic fit, by calculating a weighted cosine similarity between the vectorized representa- tion of a candidate and the misspelling context. To tune the parameters of this model, we generate self-induced spelling error corpora. We perform our experiments for two languages. For English, we greatly outperform off-the-shelf spelling correction tools on a manually annotated MIMIC-III test set, and counter the frequency bias of a noisy channel model, showing that neural embeddings can be successfully exploited to improve upon the state-of-the-art. For Dutch, we also outperform an off-the-shelf spelling correction tool on manually annotated clinical records from the Antwerp University Hospital, but can offer no empirical evidence that our method counters the frequency bias of a noisy channel model in this case as well. However, both our context-sensitive model and our implementation of the noisy channel model obtain high scores on the test set, establishing a state-of-the-art for Dutch clinical spelling correction with the noisy channel model.1 1. Introduction The problem of automated spelling correction has a long history, dating back to the late 1950s.2 Tradi- tionally, spelling errors are divided into two categories: non-word misspellings, the most prevalent type of misspellings, where the error leads to a nonexistent word, and real-word misspellings, where the error leads to an existing word, either caused by a typo (e.g.
    [Show full text]