Pronunciation Modeling in Spelling Correction for Writers of English As a Foreign Language

Total Page:16

File Type:pdf, Size:1020Kb

Pronunciation Modeling in Spelling Correction for Writers of English As a Foreign Language PRONUNCIATION MODELING IN SPELLING CORRECTION FOR WRITERS OF ENGLISH AS A FOREIGN LANGUAGE A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Adriane Boyd, B.A., M.A. ***** The Ohio State University 2008 Master’s Examination Committee: Approved by Professor Eric Fosler-Lussier, Advisor Professor Christopher Brew Advisor Computer Science and Engineering Graduate Program c Copyright by Adriane Boyd 2008 ABSTRACT In this thesis I propose a method for modeling pronunciation variation in the context of spell checking for non-native writers of English. Spell checkers, which are nearly ubiquitous in text-processing software, have been developed with native speakers as the target audience and fail to address many of the types of spelling errors peculiar to non-native speakers, especially those errors influenced by their native language’s writing system and by differences in the phonology of the native and non-native languages. The model of pronunciation variation is used to extend a pronouncing dictionary for use in the spelling correction algorithm developed by Toutanova and Moore (2002), which includes statistical models of spelling errors re- lated to both orthography and pronunciation. The pronunciation variation modeling is shown to improve performance for misspellings produced by Japanese writers of English as a foreign language. ii to my parents iii ACKNOWLEDGMENTS I would like to thank my advisor, Eric Fosler-Lussier, and the computational lin- guistics faculty in the Linguistics and Computer Science and Engineering departments at Ohio State for their support. I would also like to thank the computational linguis- tics discussion group Clippers for their feedback in the early stages of this work. iv VITA 2003 . .B.A., Linguistics and German, Univer- sity of North Carolina at Chapel Hill 2007 . .M.A., Linguistics, The Ohio State Uni- versity 2005-2008 . Graduate Research and Teaching Asso- ciate, The Ohio State University PUBLICATIONS Research Publications Adriane Boyd, Markus Dickinson, and Detmar Meurers (2007). Increasing the re- call of corpus annotation error detection. In Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories (TLT 2007). Adriane Boyd, Markus Dickinson, and Detmar Meurers (2007). On representing de- pendency relations – Insights from converting the German TiGerDB. In Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories (TLT 2007). Adriane Boyd (2007). Discontinuity Revisited: An Improved Conversion to Context- Free Representations. In Proceedings of the Linguistic Annotation Workshop (LAW 2007). Adriane Boyd, Whitney Gegg-Harrison, and Donna Byron (2006). Identifying non- referential it. A machine learning approach incorporating linguistically motivated patterns. Traitement Automatique des Langues. Volume 46, No. 1. Adriane Boyd, Whitney Gegg-Harrison, and Donna Byron (2005). Identifying non- referential it: a machine learning approach incorporating linguistically motivated features. In Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing. v FIELDS OF STUDY Major Field: Computer Science and Engineering vi TABLE OF CONTENTS Page Abstract . ii Dedication . iii Acknowledgments . iv Vita......................................... v List of Tables . x List of Figures . xii Chapters: 1. Introduction and Motivation . 1 1.1 Characteristics of Spelling Errors . 2 1.1.1 Native Writers of English . 3 1.1.2 Japanese Writers of English as a Foreign Language . 4 1.2 Developing a Spell Checker for Non-Native Writers of English . 7 2. Background . 9 2.1 Spell Checking Tasks . 10 2.1.1 Non-Word Error Detection . 10 2.1.2 Isolated Word Error Correction . 13 2.2 Edit Operations . 16 2.2.1 Types of Edit Operations . 16 2.2.2 Costs of Edit Operations . 18 2.2.3 Extending Edits to Pronunciation . 19 2.3 Noisy Channel Spelling Correction . 20 vii 2.3.1 Training the Error Model . 23 2.3.2 Extending the Model to Pronunciation Errors . 25 2.3.3 Letter-To-Phone Model . 27 2.4 Spell Checkers Adapted for JWEFL . 29 2.5 Summary . 30 3. Resources and Data Preparation . 32 3.1 TIMIT . 32 3.2 English Read by Japanese Corpus . 33 3.3 CMU Pronouncing Dictionary . 34 3.4 Atsuo-Henry Corpus . 34 3.5 Spell-Checker Oriented Word Lists . 35 4. Method .................................... 39 4.1 Pronouncing Dictionary with Variation . 39 4.1.1 Initial Recognizer . 41 4.1.2 Adapting the Recognizer . 42 4.1.3 Generating Pronunciations . 43 4.2 Implementation of the Noisy Channel Spelling Correction Approach 46 4.2.1 Letter-to-Phone Model . 46 4.2.2 Noisy Channel Spelling Correction . 48 5. Results . 51 5.1 Experimental Setup . 51 5.2 Baseline . 51 5.3 Evaluation . 52 5.3.1 Tuning Model Parameters . 53 5.3.2 Evaluation of Pronunciation Variation . 57 5.3.3 Evaluation of the Spelling Correction Model . 58 5.4 Summary . 58 6. Summary and Outlook . 60 6.1 Outlook . 60 Bibliography . 62 Appendices: viii A. Annotation Schemes . 64 A.1 Phonetic Transcriptions . 64 A.1.1 TIMIT . 64 A.1.2 English Read by Japanese Corpus . 64 A.2 Mapping to CMUDICT Phoneme Set . 65 B. Letter-to-Phone Alignments . 66 ix LIST OF TABLES Table Page 1.1 Difficult Phoneme Pairs for Japanese Speakers of English . 6 2.1 Percentage of Correct Suggestions in the 1 to 3-Best Candidates as a Function of the Maximum Substitution Length (N) on Native Speaker Misspellings from Brill and Moore (2000) . 18 2.2 Percentage of Correct Suggestions in the 1 to 4-Best Candidates by the Letter (L), Pronunciation (PHL), and Combined (CMB) Models on Native Speaker Misspellings from Toutanova and Moore (2002) . 19 2.3 Summary of Types and Costs of Edit Operations in Previous Spelling Correction Approaches . 20 2.4 Percentage of Correct Suggestions in the 1- to 6-Best Candidates for Native and JWEFL Misspellings from the Atsuo-Henry Corpus (Mitton and Okada, 2007) . 30 3.1 Word List Sizes . 38 4.1 Number of Pronunciations with Five Generated Variations . 45 4.2 Phone and Word Accuracy for Letter-to-Phone Model Trained and Tested on CMUDICT as a Function of the Number of Most-Specific Contexts(N) ............................... 47 4.3 Phone and Word Accuracy for Letter-to-Phone Models Trained on Word List 70 and CMUDICT, Tested on Word List 70 Test Set as a Function of the Number of Most-Specific Contexts N ........ 48 5.1 Aspell Results: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set . 52 x 5.2 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of the Maximum Substitution Length (N) for PL ................................... 54 5.3 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of the Maximum Substitution Length (N) for PPHL .................................. 54 5.4 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of the Maximum Substitution Length (N) for Combined Model . 54 5.5 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of Dictionary Size for All Models . 55 5.6 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of Minimum Probability m for All Models . 56 5.7 Candidate Corrections for the Misspelling *eney, Intended Word any 57 5.8 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set as a Function of Pronunciation Variation for PPHL ........... 58 5.9 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set for All Models . 59 5.10 Performance of Spell Checker on Test Data . 59 A.1 TIMIT Phonemes . 64 A.2 ERJ Phonemes . 64 A.3 Mapping to CMUDICT Phonemes . 65 B.1 Letter-Phone Edit Distances . 67 B.2 Letter-Phone Edit Distances, cont. 68 xi LIST OF FIGURES Figure Page 2.1 Sample Trie . 12 2.2 Directed Graph for Calculating the Distance between *plog and peg (from Mitton, 1996) . 15 2.3 Letter Alignment of Word and Misspelling . 23 4.1 Example Phone Alignment . 41 4.2 Original phone model for p ........................ 43 4.3 Adapted phone model for p accounting for variation between p, th, t, and dh ................................... 44 4.4 Finite state transducer for canonical phone r where the respective tran- sition probabilities reflect the negative logarithm of the probability that the phone r, uh, d, or l was observed for r ............... 44 4.5 Word List Trie . 49 xii CHAPTER 1 INTRODUCTION AND MOTIVATION Spell checkers are very frequently included in software where text is entered such as word processors, email programs, and web browsers. The goal of a spell checker is to identify misspellings, select appropriate words as suggested corrections, and rank the suggested corrections so that the intended word is high in the suggestion list. Since spell checkers have been developed with competent native speakers as the target users, they do not appropriately address many types of errors made by non- native writers and they often fail to suggest the appropriate corrections (cf. Okada, 2004; L’Haire, 2007). Non-native writers of English struggle with many of the same idiosyncrasies of English spelling that cause difficulty for native speakers, but differ- ences between English phonology and the phonology of their native language lead to types of spelling errors not anticipated by traditional spell checkers (Okada, 2004; L’Haire, 2007; Mitton and Okada, 2007). In order to address the spelling errors that result from these phonological differ- ences, I propose a method for modeling pronunciation variation from a phonetically untranscribed corpus of read speech. The model of pronunciation variation is evalu- ated in the context of the spelling correction algorithm developed by Toutanova and Moore (2002), which takes into account pronunciation similarity between misspellings 1 and suggested corrections.
Recommended publications
  • Easy Slackware
    1 Создание легкой системы на базе Slackware I - Введение Slackware пользуется заслуженной популярностью как классический linux дистрибутив, и поговорка "кто знает Red Hat тот знает только Red Hat, кто знает Slackware тот знает linux" несмотря на явный снобизм поклонников "бога Патре­ га" все же имеет под собой основания. Одним из преимуществ Slackware является возможность простого создания на ее основе практически любой системы, в том числе быстрой и легкой десктопной, о чем далее и пойдет речь. Есть дис­ трибутивы, клоны Slackware, созданные именно с этой целью, типа Аbsolute, но все же лучше создавать систему под себя, с максимальным учетом именно своих потребностей, и Slackware пожалуй как никакой другой дистрибутив подходит именно для этой цели. Легкость и быстрота системы определяется выбором WM (DM) , набором программ и оптимизацией программ и системы в целом. Первое исключает KDE, Gnome, даже новые версии XFCЕ, остается разве что LXDE, но набор программ в нем совершенно не устраивает. Оптимизация наиболее часто используемых про­ грамм и нескольких базовых системных пакетов осуществляется их сборкой из сорцов компилятором, оптимизированным именно под Ваш комп, причем каж­ дая программа конфигурируется исходя из Ваших потребностей к ее возможно­ стям. Оптимизация системы в целом осуществляется ее настройкой согласно спе­ цифическим требованиям к десктопу. Такой подход был выбран по банальной причине, возиться с gentoo нет ни­ какого желания, комп все таки создан для того чтобы им пользоваться, а не для компиляции программ, в тоже время у каждого есть минимальный набор из не­ большого количества наиболее часто используемых программ, на которые стоит потратить некоторое, не такое уж большое, время, чтобы довести их до ума. Кро­ ме того, такой подход позволяет иметь самые свежие версии наиболее часто ис­ пользуемых программ.
    [Show full text]
  • Behavior Based Software Theft Detection, CCS 2009
    Behavior Based Software Theft Detection 1Xinran Wang, 1Yoon-Chan Jhi, 1,2Sencun Zhu, and 2Peng Liu 1Department of Computer Science and Engineering 2College of Information Sciences and Technology Pennsylvania State University, University Park, PA 16802 {xinrwang, szhu, jhi}@cse.psu.edu, [email protected] ABSTRACT (e.g., in SourceForge.net there were over 230,000 registered Along with the burst of open source projects, software open source projects as of Feb.2009), software theft has be- theft (or plagiarism) has become a very serious threat to the come a very serious concern to honest software companies healthiness of software industry. Software birthmark, which and open source communities. As one example, in 2005 it represents the unique characteristics of a program, can be was determined in a federal court trial that IBM should pay used for software theft detection. We propose a system call an independent software vendor Compuware $140 million dependence graph based software birthmark called SCDG to license its software and $260 million to purchase its ser- birthmark, and examine how well it reflects unique behav- vices [1] because it was discovered that certain IBM products ioral characteristics of a program. To our knowledge, our contained code from Compuware. detection system based on SCDG birthmark is the first one To protect software from theft, Collberg and Thoborson that is capable of detecting software component theft where [10] proposed software watermark techniques. Software wa- only partial code is stolen. We demonstrate the strength of termark is a unique identifier embedded in the protected our birthmark against various evasion techniques, including software, which is hard to remove but easy to verify.
    [Show full text]
  • Spell Checker
    International Journal of Scientific and Research Publications, Volume 5, Issue 4, April 2015 1 ISSN 2250-3153 SPELL CHECKER Vibhakti V. Bhaire, Ashiki A. Jadhav, Pradnya A. Pashte, Mr. Magdum P.G ComputerEngineering, Rajendra Mane College of Engineering and Technology Abstract- Spell Checker project adds spell checking and For handling morphology language-dependent algorithm is an correction functionality to the windows based application by additional step. The spell-checker will need to consider different using autosuggestion technique. It helps the user to reduce typing forms of the same word, such as verbal forms, contractions, work, by identifying any spelling errors and making it easier to possessives, and plurals, even for a lightly inflected language like repeat searches .The main goal of the spell checker is to provide English. For many other languages, such as those featuring unified treatment of various spell correction. Firstly the spell agglutination and more complex declension and conjugation, this checking and correcting problem will be formally describe in part of the process is more complicated. order to provide a better understanding of these task. Spell checker and corrector is either stand-alone application capable of A spell checker carries out the following processes: processing a string of words or a text or as an embedded tool which is part of a larger application such as a word processor. • It scans the text and selects the words contained in it. Various search and replace algorithms are adopted to fit into the • domain of spell checker. Spell checking identifies the words that It then compares each word with a known list of are valid in the language as well as misspelled words in the correctly spelled words (i.e.
    [Show full text]
  • Finite State Recognizer and String Similarity Based Spelling
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by BRAC University Institutional Repository FINITE STATE RECOGNIZER AND STRING SIMILARITY BASED SPELLING CHECKER FOR BANGLA Submitted to A Thesis The Department of Computer Science and Engineering of BRAC University by Munshi Asadullah In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science and Engineering Fall 2007 BRAC University, Dhaka, Bangladesh 1 DECLARATION I hereby declare that this thesis is based on the results found by me. Materials of work found by other researcher are mentioned by reference. This thesis, neither in whole nor in part, has been previously submitted for any degree. Signature of Supervisor Signature of Author 2 ACKNOWLEDGEMENTS Special thanks to my supervisor Mumit Khan without whom this work would have been very difficult. Thanks to Zahurul Islam for providing all the support that was required for this work. Also special thanks to the members of CRBLP at BRAC University, who has managed to take out some time from their busy schedule to support, help and give feedback on the implementation of this work. 3 Abstract A crucial figure of merit for a spelling checker is not just whether it can detect misspelled words, but also in how it ranks the suggestions for the word. Spelling checker algorithms using edit distance methods tend to produce a large number of possibilities for misspelled words. We propose an alternative approach to checking the spelling of Bangla text that uses a finite state automaton (FSA) to probabilistically create the suggestion list for a misspelled word.
    [Show full text]
  • Spell Checking in Computer-Assisted Language Learning: a Study of Misspellings by Nonnative Writers of German
    SPELL CHECKING IN COMPUTER-ASSISTED LANGUAGE LEARNING: A STUDY OF MISSPELLINGS BY NONNATIVE WRITERS OF GERMAN Anne Rimrott Diplom, Justus Liebig Universitat, 2002 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS In the Department of Linguistics O Anne Rimrott 2005 SIMON FRASER UNIVERSITY Spring 2005 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author. APPROVAL Name: Anne Rimrott Degree: Master of Arts Title of Thesis: Spell Checking in Computer-Assisted Language Learning: A Study of Misspellings by Nonnative Writers of German Examining Committee: Chair: Dr. Alexei Kochetov Assistant Professor, Department of Linguistics Dr. Trude Heift Senior Supervisor Associate Professor, Department of Linguistics Dr. Chung-hye Han Supervisor Assistant Professor, Department of Linguistics Dr. Maria Teresa Taboada Supervisor Assistant Professor, Department of Linguistics Dr. Mathias Schulze External Examiner Assistant Professor, Department of Germanic and Slavic Studies University of Waterloo Date DefendedIApproved: SIMON FRASER UNIVERSITY PARTIAL COPYRIGHT LICENCE The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection.
    [Show full text]
  • Exploiting Wikipedia Semantics for Computing Word Associations
    Exploiting Wikipedia Semantics for Computing Word Associations by Shahida Jabeen A thesis submitted to the Victoria University of Wellington in fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science. Victoria University of Wellington 2014 Abstract Semantic association computation is the process of automatically quan- tifying the strength of a semantic connection between two textual units based on various lexical and semantic relations such as hyponymy (car and vehicle) and functional associations (bank and manager). Humans have can infer implicit relationships between two textual units based on their knowledge about the world and their ability to reason about that knowledge. Automatically imitating this behavior is limited by restricted knowledge and poor ability to infer hidden relations. Various factors affect the performance of automated approaches to com- puting semantic association strength. One critical factor is the selection of a suitable knowledge source for extracting knowledge about the im- plicit semantic relations. In the past few years, semantic association com- putation approaches have started to exploit web-originated resources as substitutes for conventional lexical semantic resources such as thesauri, machine readable dictionaries and lexical databases. These conventional knowledge sources suffer from limitations such as coverage issues, high construction and maintenance costs and limited availability. To overcome these issues one solution is to use the wisdom of crowds in the form of collaboratively constructed knowledge sources. An excellent example of such knowledge sources is Wikipedia which stores detailed information not only about the concepts themselves but also about various aspects of the relations among concepts. The overall goal of this thesis is to demonstrate that using Wikipedia for computing word association strength yields better estimates of hu- mans’ associations than the approaches based on other structured and un- structured knowledge sources.
    [Show full text]
  • Words in a Text
    Cross-lingual geo-parsing for non-structured data Judith Gelernter Wei Zhang Language Technologies Institute Language Technologies Institute School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] ABSTRACT to cross-language geo-information retrieval. We will examine A geo-parser automatically identifies location words in a text. We Strötgen’s view that normalized location information is language- have generated a geo-parser specifically to find locations in independent [16]. Difficulties in cross-language experiments are unstructured Spanish text. Our novel geo-parser architecture exacerbated when the languages use different alphabets, and when combines the results of four parsers: a lexico-semantic Named the focus is on proper names, as in this case, the names of Location Parser, a rules-based building parser, a rules-based street locations. parser, and a trained Named Entity Parser. Each parser has Geo-parsing structured versus unstructured text requires different different strengths: the Named Location Parser is strong in recall, language processing tools. Twitter messages are challenging to and the Named Entity Parser is strong in precision, and building geo-parse because of their non-grammaticality. To handle non- and street parser finds buildings and streets that the others are not grammatical forms, we use a Twitter tokenizer rather than a word designed to do. To test our Spanish geo-parser performance, we tokenizer. We use an English part of speech tagger created for compared the output of Spanish text through our Spanish geo- tweets, which was not available to us in Spanish.
    [Show full text]
  • Automatic Spelling Correction Based on N-Gram Model
    International Journal of Computer Applications (0975 – 8887) Volume 182 – No. 11, August 2018 Automatic Spelling Correction based on n-Gram Model S. M. El Atawy A. Abd ElGhany Dept. of Computer Science Dept. of Computer Faculty of Specific Education Damietta University Damietta University, Egypt Egypt ABSTRACT correction. In this paper, we are designed, implemented and A spell checker is a basic requirement for any language to be evaluated an end-to-end system that performs spellchecking digitized. It is a software that detects and corrects errors in a and auto correction. particular language. This paper proposes a model to spell error This paper is organized as follows: Section 2 illustrates types detection and auto-correction that is based on n-gram of spelling error and some of related works. Section 3 technique and it is applied in error detection and correction in explains the system description. Section 4 presents the results English as a global language. The proposed model provides and evaluation. Finally, Section 5 includes conclusion and correction suggestions by selecting the most suitable future work. suggestions from a list of corrective suggestions based on lexical resources and n-gram statistics. It depends on a 2. RELATED WORKS lexicon of Microsoft words. The evaluation of the proposed Spell checking techniques have been substantially, such as model uses English standard datasets of misspelled words. error detection & correction. Two commonly approaches for Error detection, automatic error correction, and replacement error detection are dictionary lookup and n-gram analysis. are the main features of the proposed model. The results of the Most spell checkers methods described in the literature, use experiment reached approximately 93% of accuracy and acted dictionaries as a list of correct spellings that help algorithms similarly to Microsoft Word as well as outperformed both of to find target words [16].
    [Show full text]
  • Development of a Persian Syntactic Dependency Treebank
    Development of a Persian Syntactic Dependency Treebank Mohammad Sadegh Rasooli Manouchehr Kouhestani Amirsaeid Moloodi Department of Computer Science Department of Linguistics Department of Linguistics Columbia University Tarbiat Modares University University of Tehran New York, NY Tehran, Iran Tehran, Iran [email protected] [email protected] [email protected] Abstract tions in tasks such as machine translation. Depen- dency treebanks are collections of sentences with This paper describes the annotation process their corresponding dependency trees. In the last and linguistic properties of the Persian syn- decade, many dependency treebanks have been de- tactic dependency treebank. The treebank veloped for a large number of languages. There are consists of approximately 30,000 sentences at least 29 languages for which at least one depen- annotated with syntactic roles in addition to morpho-syntactic features. One of the unique dency treebank is available (Zeman et al., 2012). features of this treebank is that there are al- Dependency trees are much more similar to the hu- most 4800 distinct verb lemmas in its sen- man understanding of language and can easily rep- tences making it a valuable resource for ed- resent the free word-order nature of syntactic roles ucational goals. The treebank is constructed in sentences (Kubler¨ et al., 2009). with a bootstrapping approach by means of available tagging and parsing tools and man- Persian is a language with about 110 million ually correcting the annotations. The data is speakers all over the world (Windfuhr, 2009), yet in splitted into standard train, development and terms of the availability of teaching materials and test set in the CoNLL dependency format and annotated data for text processing, it is undoubt- is freely available to researchers.
    [Show full text]
  • Study on Spell-Checking System Using Levenshtein Distance Algorithm
    International Journal of Recent Development in Engineering and Technology Website: www.ijrdet.com (ISSN 2347-6435(Online) Volume 8, Issue 9, September 2019) Study on Spell-Checking System using Levenshtein Distance Algorithm Thi Thi Soe1, Zarni Sann2 1Faculty of Computer Science 2Faculty of Computer Systems and Technologies 1,2FUniversity of Computer Studies (Mandalay), Myanmar Abstract— Natural Language Processing (NLP) is one of If the word is not found, it is considered to be an error, the most important research area carried out in the world of and an attempt may be made to suggest that word. When a Artificial Intelligence (AI). NLP supports the tasks of a word which is not within the dictionary is encountered, portion of AI such as spell checking, machine translation, most spell checkers provide an option to add that word to a automatic text summarization, and information extraction, so list of known exceptions that should not be flagged. An on. Spell checking application presents valid suggestions to the user based on each mistake they encounter in the user’s adaptive spelling checker tool based on Ternary Search document. The user then either makes a selection from a list Tree data structure is described in [1]. A comprehensive of suggestions or accepts the current word as valid. Spell- spelling checker application presented a significant checking program is often integrated with word processing challenge in producing suggestions for a misspelled word that checks for the correct spelling of words in a document. when employing the traditional methods in [2]. It learned Each word is compared against a dictionary of correctly spelt the complex orthographic rules of Bangla.
    [Show full text]
  • Spell Checker in CET Designer
    Linköping University | Department of Computer Science Bachelor thesis, 16 ECTS | Datateknik 2016 | LIU-IDA/LITH-EX-G--16/069--SE Spell checker in CET Designer Rasmus Hedin Supervisor : Amir Aminifar Examiner : Zebo Peng Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin- istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam- manhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/. Copyright The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum- stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose.
    [Show full text]
  • Discourse-Aware Statistical Machine Translation As a Context-Sensitive Spell Checker
    Discourse-aware Statistical Machine Translation as a Context-Sensitive Spell Checker Behzad Mirzababaei, Heshaam Faili and Nava Ehsan School of Electrical and Computer Engineering College of Engineering University of Tehran, Tehran, Iran {b.mirzababaei,hfaili,n.ehsan}@ut.ac.ir Phrase-based SMT is weak in handling long- Abstract distance dependencies between the sentence words. In order to capture this kind of Real-word errors or context sensitive spelling dependencies, which affects detecting the correct errors, are misspelled words that have been candidate word, mentioned SMT is augmented wrongly converted into another word of with a discourse-aware reranking method for vocabulary. One way to detect and correct reranking the N-best results of SMT. real-word errors is using Statistical Machine Our work can be regarded as an extension of Translation (SMT), which translates a text containing some real-word errors into a the method introduced by Ehsan and Faili correct text of the same language. In this (2013), in which they use SMT to detect and paper, we improve the results of mentioned correct the spelling errors of a document. But SMT system by employing some discourse- here, we use the N-best results of SMT as a aware features into a log-linear reranking candidate list for each erroneous word and rerank method. Our experiments on a real-world test the list by using a discourse-aware reranking data in Persian show an improvement of system which is just a log-linear ranker. about 9.5% and 8.5% in the recall of Shortly, the contributions of this paper can be detection and correction respectively.
    [Show full text]