Real-Word Spelling Correction Using Google Web 1T N-Gram with Backoff

Total Page:16

File Type:pdf, Size:1020Kb

Real-Word Spelling Correction Using Google Web 1T N-Gram with Backoff Real-Word Spelling Correction using Google Web 1T n-gram with Backoff Aminul ISLAM Diana INKPEN Department of Computer Science, SITE Department of Computer Science, SITE University of Ottawa University of Ottawa Ottawa, ON, Canada Ottawa, ON, Canada [email protected] [email protected] Abstract: set [5]1, and a normalized and modified version of the Longest We present a method for correcting real-word spelling Common Subsequence (LCS) string matching algorithm (de- errors using the Google Web 1T n-gram data set and tails are in section 3.1). Our intention is to focus on how to a normalized and modified version of the Longest improve the correction recall while maintaining the correc- Common Subsequence (LCS) string matching algo- tion precision as high as possible. The reason behind this rithm. Our method is focused mainly on how to intention is that if the recall for any method is around 0.5, improve the correction recall (the fraction of er- this means that the method fails to correct around 50 per- rors corrected) while keeping the correction preci- cent of the errors. As a result, we can not completely rely on sion (the fraction of suggestions that are correct) as these type of methods, for that we need some type of human high as possible. Evaluation results on a standard interventions or suggestions to correct the rest of the uncor- data set show that our method performs very well. rected errors. Thus, if we have a method that can correct almost 90 percent of the errors, even generating some extra candidates that are incorrect is more helpful to the human. Keywords: This paper is organized as follow: Section 2 presents a brief overview of the related work. Our proposed method is Real-word; spelling correction; Google web 1T; n- described in Section 3. Evaluation and experimental results gram are discussed in Section 4. We conclude in Section 5. 1. Introduction 2. Related Work Real-word spelling errors are words in a text that occur Work on real-word spelling correction can roughly be clas- when a user mistakenly types a correctly spelled word when sified into two basic categories: methods based on semantic another was intended. Errors of this type may be caused by information or human-made lexical resources, and methods the writer's ignorance of the correct spelling of the intended based on machine learning or probability information. Our word or by typing mistakes. Such errors generally go un- proposed method falls into the latter category. noticed by most spellcheckers as they deal with words in isolation, accepting them as correct if they are found in the 2.1 Methods Based on Semantic Information dictionary, and flagging them as errors if they are not. This The `semantic information' approach first proposed by [6] approach would be sufficient to correct the non-word error and later developed by [1] detected semantic anomalies, but myss in \It doesn't know what the myss is all about." but was not restricted to checking words from predefined confu- not the real-word error muss in \It doesn't know what the sion sets. This approach was based on the observation that muss is all about." To correct the latter, the spell-checker the words that a writer intends are generally semantically needs to make use of the surrounding context such as, in related to their surrounding words, whereas some types of this case, to recognise that fuss is more likely to occur than real-word spelling errors are not. muss in the context of all about. Ironically, errors of this type may even be caused by spelling checkers in the cor- 2.2 Methods Based on Machine Learning rection of non-word spelling errors when the auto-correct Machine learning methods are regarded as lexical disam- feature in some word-processing software sometimes silently biguation tasks and confusion sets are used to model the change a non-word to the wrong real word [1], and some- ambiguity between words. Normally, the machine learn- times when correcting a flagged error, the user accidentally ing and statistical approaches rely on pre-defined confusion make a wrong selection from the choices offered [2]. An ex- sets, which are sets (usually pairs) of commonly confounded tensive review of real-word spelling correction is given in [3, words, such as ftheir, there, they'reg and fprinciple, princi- 1] and the problem of spelling correction more generally is palg. [7], an example of a machine-learning method, com- reviewed in [4]. bined the Winnow algorithm with weighted-majority voting, In this paper, we present a method for correcting real- using nearby and adjacent words as features. Another ex- word spelling error using the Google Web 1T n-gram data ample of a machine-learning method is that of [8]. 1Details of the Google Web 1T data set can be found at 978-1-4244-4538-7/09/$25.00 c 2009 IEEE. www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt. 2.3 Methods Based on Probability Information of these2. [14] showed that edit distance and the length of [9] proposed a statistical method using word-trigram prob- the longest common subsequence are special cases of n-gram abilities for detecting and correcting real-word errors with- distance and similarity, respectively. [15] normalized LCS by out requiring predefined confusion sets. In this method, if dividing the length of the longest common subsequence by the trigram-derived probability of an observed sentence is the length of the longer string and called it longest common lower than that of any sentence obtained by replacing one subsequence ratio (LCSR). But LCSR does not take into ac- of the words with a spelling variation, then we hypothesize count the length of the shorter string which sometimes has that the original is an error and the variation is what the a significant impact on the similarity score. user intended. [13] normalized the longest common subsequence so that [2] analyze the advantages and limitations of [9]'s method, it takes into account the length of both the shorter and the and present a new evaluation of the algorithm, designed so longer string and called it normalized longest common sub- that the results can be compared with those of other meth- sequence (NLCS). We normalize NLCS in the following way ods, and then construct and evaluate some variations of the as it gives better similarity value, as well as it is more com- algorithm that use fixed-length windows. They consider a putationally efficient: variation of the method that optimizes over relatively short, 2 × len(LCS(si; sj )) v1 = NLCS(si; sj ) = (1) fixed-length windows instead of over a whole sentence (ex- len(si) + len(sj ) cept in the special case when the sentence is smaller than While in classical LCS, the common subsequence needs the window), while respecting sentence boundaries as natu- not be consecutive, in spelling correction, a consecutive com- ral breakpoints. To check the spelling of a span of d words mon subsequence is important for a high degree of match- requires a window of length d+4 to accommodate all the tri- ing. [13] used maximal consecutive longest common subse- grams that overlap with the words in the span. The smallest quence starting at character 1, MCLCS and maximal con- possible window is therefore 5 words long, which uses 3 tri- 1 secutive longest common subsequence starting at any char- grams to optimize only its middle word. They assume that acter n, MCLCS . MCLCS takes two strings as input and the sentence is bracketed by two BoS and two EoS markers n 1 returns the shorter string or maximal consecutive portions of (to accommodate trigrams involving the first two and last the shorter string that consecutively match with the longer two words of the sentence). The window starts with its left- string, where matching must be from first character (charac- hand edge at the first BoS marker, and the [9]'s method is ter 1) for both strings. MCLCS takes two strings as input run on the words covered by the trigrams that it contains; n and returns the shorter string or maximal consecutive por- the window then moves d words to the right and the process tions of the shorter string that consecutively match with the repeats until all the words in the sentence have been checked. longer string, where matching may start from any charac- As [9]'s algorithm is run separately in each window, poten- ter (character n) for both of the strings. They normalized tially changing a word in each, [2]'s method as a side-effect MCLCS and MCLCS and called it normalized MCLCS also permits multiple corrections in a single sentence. 1 n 1 (NMCLCS ) and normalized MCLCS (NMCLCS ) respec- [10] proposed a trigram-based method for real-word errors 1 n n tively. Similarly, we normalize NMCLCS and NMCLCS without explicitly using probabilities or even localizing the 1 n in the following way: possible error to a specific word. This method simply as- sumes that any word trigram in the text that is attested in 2 × len(MCLCS1(si; sj )) v2 =NMCLCS1(si; sj ) = (2) the British National Corpus [11] is correct, and any unat- len(si) + len(sj ) tested trigram is a likely error. When an unattested trigram 2 × len(MCLCS (s ; s )) v =NMCLCS (s ; s ) = n i j (3) is observed, the method then tries the spelling variations of 3 n i j len(s ) + len(s ) all words in the trigram to find attested trigrams to present i j to the user as possible corrections.
Recommended publications
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • Unified Language Model Pre-Training for Natural
    Unified Language Model Pre-training for Natural Language Understanding and Generation Li Dong∗ Nan Yang∗ Wenhui Wang∗ Furu Wei∗† Xiaodong Liu Yu Wang Jianfeng Gao Ming Zhou Hsiao-Wuen Hon Microsoft Research {lidong1,nanya,wenwan,fuwei}@microsoft.com {xiaodl,yuwan,jfgao,mingzhou,hon}@microsoft.com Abstract This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirec- tional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-of- the-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. 1 Introduction Language model (LM) pre-training has substantially advanced the state of the art across a variety of natural language processing tasks [8, 29, 19, 31, 9, 1].
    [Show full text]
  • Spell Checker
    International Journal of Scientific and Research Publications, Volume 5, Issue 4, April 2015 1 ISSN 2250-3153 SPELL CHECKER Vibhakti V. Bhaire, Ashiki A. Jadhav, Pradnya A. Pashte, Mr. Magdum P.G ComputerEngineering, Rajendra Mane College of Engineering and Technology Abstract- Spell Checker project adds spell checking and For handling morphology language-dependent algorithm is an correction functionality to the windows based application by additional step. The spell-checker will need to consider different using autosuggestion technique. It helps the user to reduce typing forms of the same word, such as verbal forms, contractions, work, by identifying any spelling errors and making it easier to possessives, and plurals, even for a lightly inflected language like repeat searches .The main goal of the spell checker is to provide English. For many other languages, such as those featuring unified treatment of various spell correction. Firstly the spell agglutination and more complex declension and conjugation, this checking and correcting problem will be formally describe in part of the process is more complicated. order to provide a better understanding of these task. Spell checker and corrector is either stand-alone application capable of A spell checker carries out the following processes: processing a string of words or a text or as an embedded tool which is part of a larger application such as a word processor. • It scans the text and selects the words contained in it. Various search and replace algorithms are adopted to fit into the • domain of spell checker. Spell checking identifies the words that It then compares each word with a known list of are valid in the language as well as misspelled words in the correctly spelled words (i.e.
    [Show full text]
  • Finite State Recognizer and String Similarity Based Spelling
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by BRAC University Institutional Repository FINITE STATE RECOGNIZER AND STRING SIMILARITY BASED SPELLING CHECKER FOR BANGLA Submitted to A Thesis The Department of Computer Science and Engineering of BRAC University by Munshi Asadullah In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science and Engineering Fall 2007 BRAC University, Dhaka, Bangladesh 1 DECLARATION I hereby declare that this thesis is based on the results found by me. Materials of work found by other researcher are mentioned by reference. This thesis, neither in whole nor in part, has been previously submitted for any degree. Signature of Supervisor Signature of Author 2 ACKNOWLEDGEMENTS Special thanks to my supervisor Mumit Khan without whom this work would have been very difficult. Thanks to Zahurul Islam for providing all the support that was required for this work. Also special thanks to the members of CRBLP at BRAC University, who has managed to take out some time from their busy schedule to support, help and give feedback on the implementation of this work. 3 Abstract A crucial figure of merit for a spelling checker is not just whether it can detect misspelled words, but also in how it ranks the suggestions for the word. Spelling checker algorithms using edit distance methods tend to produce a large number of possibilities for misspelled words. We propose an alternative approach to checking the spelling of Bangla text that uses a finite state automaton (FSA) to probabilistically create the suggestion list for a misspelled word.
    [Show full text]
  • Spell Checking in Computer-Assisted Language Learning: a Study of Misspellings by Nonnative Writers of German
    SPELL CHECKING IN COMPUTER-ASSISTED LANGUAGE LEARNING: A STUDY OF MISSPELLINGS BY NONNATIVE WRITERS OF GERMAN Anne Rimrott Diplom, Justus Liebig Universitat, 2002 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ARTS In the Department of Linguistics O Anne Rimrott 2005 SIMON FRASER UNIVERSITY Spring 2005 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author. APPROVAL Name: Anne Rimrott Degree: Master of Arts Title of Thesis: Spell Checking in Computer-Assisted Language Learning: A Study of Misspellings by Nonnative Writers of German Examining Committee: Chair: Dr. Alexei Kochetov Assistant Professor, Department of Linguistics Dr. Trude Heift Senior Supervisor Associate Professor, Department of Linguistics Dr. Chung-hye Han Supervisor Assistant Professor, Department of Linguistics Dr. Maria Teresa Taboada Supervisor Assistant Professor, Department of Linguistics Dr. Mathias Schulze External Examiner Assistant Professor, Department of Germanic and Slavic Studies University of Waterloo Date DefendedIApproved: SIMON FRASER UNIVERSITY PARTIAL COPYRIGHT LICENCE The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection.
    [Show full text]
  • Exploiting Wikipedia Semantics for Computing Word Associations
    Exploiting Wikipedia Semantics for Computing Word Associations by Shahida Jabeen A thesis submitted to the Victoria University of Wellington in fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science. Victoria University of Wellington 2014 Abstract Semantic association computation is the process of automatically quan- tifying the strength of a semantic connection between two textual units based on various lexical and semantic relations such as hyponymy (car and vehicle) and functional associations (bank and manager). Humans have can infer implicit relationships between two textual units based on their knowledge about the world and their ability to reason about that knowledge. Automatically imitating this behavior is limited by restricted knowledge and poor ability to infer hidden relations. Various factors affect the performance of automated approaches to com- puting semantic association strength. One critical factor is the selection of a suitable knowledge source for extracting knowledge about the im- plicit semantic relations. In the past few years, semantic association com- putation approaches have started to exploit web-originated resources as substitutes for conventional lexical semantic resources such as thesauri, machine readable dictionaries and lexical databases. These conventional knowledge sources suffer from limitations such as coverage issues, high construction and maintenance costs and limited availability. To overcome these issues one solution is to use the wisdom of crowds in the form of collaboratively constructed knowledge sources. An excellent example of such knowledge sources is Wikipedia which stores detailed information not only about the concepts themselves but also about various aspects of the relations among concepts. The overall goal of this thesis is to demonstrate that using Wikipedia for computing word association strength yields better estimates of hu- mans’ associations than the approaches based on other structured and un- structured knowledge sources.
    [Show full text]
  • Words in a Text
    Cross-lingual geo-parsing for non-structured data Judith Gelernter Wei Zhang Language Technologies Institute Language Technologies Institute School of Computer Science School of Computer Science Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] ABSTRACT to cross-language geo-information retrieval. We will examine A geo-parser automatically identifies location words in a text. We Strötgen’s view that normalized location information is language- have generated a geo-parser specifically to find locations in independent [16]. Difficulties in cross-language experiments are unstructured Spanish text. Our novel geo-parser architecture exacerbated when the languages use different alphabets, and when combines the results of four parsers: a lexico-semantic Named the focus is on proper names, as in this case, the names of Location Parser, a rules-based building parser, a rules-based street locations. parser, and a trained Named Entity Parser. Each parser has Geo-parsing structured versus unstructured text requires different different strengths: the Named Location Parser is strong in recall, language processing tools. Twitter messages are challenging to and the Named Entity Parser is strong in precision, and building geo-parse because of their non-grammaticality. To handle non- and street parser finds buildings and streets that the others are not grammatical forms, we use a Twitter tokenizer rather than a word designed to do. To test our Spanish geo-parser performance, we tokenizer. We use an English part of speech tagger created for compared the output of Spanish text through our Spanish geo- tweets, which was not available to us in Spanish.
    [Show full text]
  • Analyzing Political Polarization in News Media with Natural Language Processing
    Analyzing Political Polarization in News Media with Natural Language Processing by Brian Sharber A thesis presented to the Honors College of Middle Tennessee State University in partial fulfillment of the requirements for graduation from the University Honors College Fall 2020 Analyzing Political Polarization in News Media with Natural Language Processing by Brian Sharber APPROVED: ______________________________ Dr. Salvador E. Barbosa Project Advisor Computer Science Department Dr. Medha Sarkar Computer Science Department Chair ____________________________ Dr. John R. Vile, Dean University Honors College Copyright © 2020 Brian Sharber Department of Computer Science Middle Tennessee State University; Murfreesboro, Tennessee, USA. I hereby grant to Middle Tennessee State University (MTSU) and its agents (including an institutional repository) the non-exclusive right to archive, preserve, and make accessible my thesis in whole or in part in all forms of media now and hereafter. I warrant that the thesis and the abstract are my original work and do not infringe or violate any rights of others. I agree to indemnify and hold MTSU harmless for any damage which may result from copyright infringement or similar claims brought against MTSU by third parties. I retain all ownership rights to the copyright of my thesis. I also retain the right to use in future works (such as articles or books) all or part of this thesis. The software described in this work is free software. You can redistribute it and/or modify it under the terms of the MIT License. The software is posted on GitHub under the following repository: https://github.com/briansha/NLP-Political-Polarization iii Abstract Political discourse in the United States is becoming increasingly polarized.
    [Show full text]
  • Automatic Spelling Correction Based on N-Gram Model
    International Journal of Computer Applications (0975 – 8887) Volume 182 – No. 11, August 2018 Automatic Spelling Correction based on n-Gram Model S. M. El Atawy A. Abd ElGhany Dept. of Computer Science Dept. of Computer Faculty of Specific Education Damietta University Damietta University, Egypt Egypt ABSTRACT correction. In this paper, we are designed, implemented and A spell checker is a basic requirement for any language to be evaluated an end-to-end system that performs spellchecking digitized. It is a software that detects and corrects errors in a and auto correction. particular language. This paper proposes a model to spell error This paper is organized as follows: Section 2 illustrates types detection and auto-correction that is based on n-gram of spelling error and some of related works. Section 3 technique and it is applied in error detection and correction in explains the system description. Section 4 presents the results English as a global language. The proposed model provides and evaluation. Finally, Section 5 includes conclusion and correction suggestions by selecting the most suitable future work. suggestions from a list of corrective suggestions based on lexical resources and n-gram statistics. It depends on a 2. RELATED WORKS lexicon of Microsoft words. The evaluation of the proposed Spell checking techniques have been substantially, such as model uses English standard datasets of misspelled words. error detection & correction. Two commonly approaches for Error detection, automatic error correction, and replacement error detection are dictionary lookup and n-gram analysis. are the main features of the proposed model. The results of the Most spell checkers methods described in the literature, use experiment reached approximately 93% of accuracy and acted dictionaries as a list of correct spellings that help algorithms similarly to Microsoft Word as well as outperformed both of to find target words [16].
    [Show full text]
  • N-Gram-Based Machine Translation
    N-gram-based Machine Translation ∗ JoseB.Mari´ no˜ ∗ Rafael E. Banchs ∗ Josep M. Crego ∗ Adria` de Gispert ∗ Patrik Lambert ∗ Jose´ A. R. Fonollosa ∗ Marta R. Costa-jussa` Universitat Politecnica` de Catalunya This article describes in detail an n-gram approach to statistical machine translation. This ap- proach consists of a log-linear combination of a translation model based on n-grams of bilingual units, which are referred to as tuples, along with four specific feature functions. Translation performance, which happens to be in the state of the art, is demonstrated with Spanish-to-English and English-to-Spanish translations of the European Parliament Plenary Sessions (EPPS). 1. Introduction The beginnings of statistical machine translation (SMT) can be traced back to the early fifties, closely related to the ideas from which information theory arose (Shannon and Weaver 1949) and inspired by works on cryptography (Shannon 1949, 1951) during World War II. According to this view, machine translation was conceived as the problem of finding a sentence by decoding a given “encrypted” version of it (Weaver 1955). Although the idea seemed very feasible, enthusiasm faded shortly afterward because of the computational limitations of the time (Hutchins 1986). Finally, during the nineties, two factors made it possible for SMT to become an actual and practical technology: first, significant increment in both the computational power and storage capacity of computers, and second, the availability of large volumes of bilingual data. The first SMT systems were developed in the early nineties (Brown et al. 1990, 1993). These systems were based on the so-called noisy channel approach, which models the probability of a target language sentence T given a source language sentence S as the product of a translation-model probability p(S|T), which accounts for adequacy of trans- lation contents, times a target language probability p(T), which accounts for fluency of target constructions.
    [Show full text]
  • Development of a Persian Syntactic Dependency Treebank
    Development of a Persian Syntactic Dependency Treebank Mohammad Sadegh Rasooli Manouchehr Kouhestani Amirsaeid Moloodi Department of Computer Science Department of Linguistics Department of Linguistics Columbia University Tarbiat Modares University University of Tehran New York, NY Tehran, Iran Tehran, Iran [email protected] [email protected] [email protected] Abstract tions in tasks such as machine translation. Depen- dency treebanks are collections of sentences with This paper describes the annotation process their corresponding dependency trees. In the last and linguistic properties of the Persian syn- decade, many dependency treebanks have been de- tactic dependency treebank. The treebank veloped for a large number of languages. There are consists of approximately 30,000 sentences at least 29 languages for which at least one depen- annotated with syntactic roles in addition to morpho-syntactic features. One of the unique dency treebank is available (Zeman et al., 2012). features of this treebank is that there are al- Dependency trees are much more similar to the hu- most 4800 distinct verb lemmas in its sen- man understanding of language and can easily rep- tences making it a valuable resource for ed- resent the free word-order nature of syntactic roles ucational goals. The treebank is constructed in sentences (Kubler¨ et al., 2009). with a bootstrapping approach by means of available tagging and parsing tools and man- Persian is a language with about 110 million ually correcting the annotations. The data is speakers all over the world (Windfuhr, 2009), yet in splitted into standard train, development and terms of the availability of teaching materials and test set in the CoNLL dependency format and annotated data for text processing, it is undoubt- is freely available to researchers.
    [Show full text]
  • Study on Spell-Checking System Using Levenshtein Distance Algorithm
    International Journal of Recent Development in Engineering and Technology Website: www.ijrdet.com (ISSN 2347-6435(Online) Volume 8, Issue 9, September 2019) Study on Spell-Checking System using Levenshtein Distance Algorithm Thi Thi Soe1, Zarni Sann2 1Faculty of Computer Science 2Faculty of Computer Systems and Technologies 1,2FUniversity of Computer Studies (Mandalay), Myanmar Abstract— Natural Language Processing (NLP) is one of If the word is not found, it is considered to be an error, the most important research area carried out in the world of and an attempt may be made to suggest that word. When a Artificial Intelligence (AI). NLP supports the tasks of a word which is not within the dictionary is encountered, portion of AI such as spell checking, machine translation, most spell checkers provide an option to add that word to a automatic text summarization, and information extraction, so list of known exceptions that should not be flagged. An on. Spell checking application presents valid suggestions to the user based on each mistake they encounter in the user’s adaptive spelling checker tool based on Ternary Search document. The user then either makes a selection from a list Tree data structure is described in [1]. A comprehensive of suggestions or accepts the current word as valid. Spell- spelling checker application presented a significant checking program is often integrated with word processing challenge in producing suggestions for a misspelled word that checks for the correct spelling of words in a document. when employing the traditional methods in [2]. It learned Each word is compared against a dictionary of correctly spelt the complex orthographic rules of Bangla.
    [Show full text]