Watch Your Spelling! by Kurt Hornik and Duncan Murdoch Programs (Aspell and Hunspell) and Their Respective Dictionaries

Total Page:16

File Type:pdf, Size:1020Kb

Watch Your Spelling! by Kurt Hornik and Duncan Murdoch Programs (Aspell and Hunspell) and Their Respective Dictionaries 22 CONTRIBUTED RESEARCH ARTICLES Watch Your Spelling! by Kurt Hornik and Duncan Murdoch programs (Aspell and Hunspell) and their respective dictionaries. Finally, we investigate the possibility Abstract We discuss the facilities in base R of using the CRAN package Rd files for creating a for spell checking via Aspell, Hunspell or Ispell, statistics dictionary. which are useful in particular for conveniently checking the spelling of natural language texts in package Rd files and vignettes. Spell checking Spell checking performance is illustrated using the Rd files in package stats. This example clearly indicates the The first spell checkers were widely available on need for a domain-specific statistical dictionary. mainframe computers in the late 1970s (e.g., http: We analyze the results of spell checking all Rd //en.wikipedia.org/wiki/Spell_checker). They ac- files in all CRAN packages and show how these tually were “verifiers” instead of “correctors”, re- can be employed for building such a dictionary. porting words not found in the dictionary but not making useful suggestions for replacements (e.g., as close matches in the Levenshtein distance to terms R and its add-on packages contain large amounts of from the dictionary). In the 1980s, checkers became natural language text (in particular, in the documen- available on personal computers, and were integrated tation in package Rd files and vignettes). This text is into popular word-processing packages like Word- useful for reading by humans and as a testbed for a va- Star. Recently, applications such as web browsers and riety of natural language processing (NLP) tasks, but email clients have added spell check support for user- it is not always spelled correctly. This is not entirely written content. Extending coverage from English unsurprising, given that available spell checkers are to beyond western European languages has required not aware of the special Rd file and vignette formats, software to deal with character encoding issues and and thus rather inconvenient to use. (In particular, increased sophistication in the morphology routines, we are unaware of ways to take advantage of the ESS particularly with regard to heavily-agglutinative lan- (Rossini et al., 2004) facilities to teach Emacs to check guages like Hungarian and Finnish, and resulted in the spelling of vignettes by only analyzing the LATEX several generations of open-source checkers. Limi- chunks in suitable TEX modes.) tations of the basic unigram (single-word) approach In addition to providing facilities to make spell have led to the development of context-sensitive spell checking for Rd files and vignettes more convenient, checkers, currently available in commercial software it is also desirable to have programmatic (R-level) ac- such as e.g. Microsoft Office 2007. Another approach cess to the possibly mis-spelled words (and suggested to spell checking is using adaptive domain-specific corrections). This will allow their inclusion into auto- models for mis-spellings (e.g., based on the frequency matically generated reports (e.g., by R CMD check), of word n-grams), as employed for example in web or aggregation for subsequent analyses. Spell check- search engines (“Did you mean . ?”). ers typically know how to extract words from text The first spell checker on Unix-alikes was spell employing language-dependent algorithms for han- (originally written in 1971 in PDP-10 Assembly lan- dling morphology, and compare the extracted words guage and later ported to C), which read an input against a known list of correctly spelled ones, the so- file and wrote possibly mis-spelled words to out- called dictionary. However, NLP resources for statis- put (one word per line). A variety of enhancements tics and related knowledge domains are rather scarce, led to (International) Ispell (http://lasr.cs.ucla. and we are unaware of suitable dictionaries for these. edu/geoff/ispell.html), which added interactivity The result is that domain-specific terms in texts from (hence “i”-spell), suggestion of replacements, and sup- these domains are typically reported as possibly mis- port for a large number of European languages (hence spelled. It thus seems attractive to create additional “international”). It pioneered the idea of a program- dictionaries based on determining the most frequent ming interface, originally intended for use by Emacs, possibly mis-spelled terms in corpora such as the Rd and awareness of special input file formats (originally, files and vignettes in the packages in the R reposito- TEX or nroff/troff). ries. GNU Aspell, usually called just Aspell (http:// In this paper, we discuss the spell check function- aspell.net), was mainly developed by Kevin Atkin- ality provided by aspell() and related utilities made son and designed to eventually replace Ispell. Aspell available in the R standard package utils. After a can either be used as a library (in fact, the Omegahat quick introduction to spell checking, we indicate how package Aspell (Temple Lang, 2005) provides a fine- aspell() can be used for checking individual files grained R interface to this) or as a standalone program. and packages, and how this can be integrated into Compared to Ispell, Aspell can also easily check docu- typical work flows. We then compare the agreement ments in UTF-8 without having to use a special dictio- of the spell check results obtained by two different nary, and supports using multiple dictionaries. It also The R Journal Vol. 3/2, 2010-09-17 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 23 tries to do a better job suggesting corrections (Ispell Argument files is a character vector with the names only suggests words with a Levenshtein distance of of the files to be checked (in fact, in R 2.12.0 or later 1), and provides powerful and customizable TEX filter- alternatively a list of R objects representing connec- ing. Aspell is the standard spell checker for the GNU tions or having suitable srcrefs), control is a list or software system, with packages available for all com- character vector of control options (command line mon Linux distributions. E.g., for Debian/Ubuntu arguments) to be passed to the spell check program, flavors, aspell contains the programs, aspell-en the and program optionally specifies the name of the pro- English language dictionaries (with American, British gram to be employed. By default, the system path and Canadian spellings), and libaspell-dev provides is searched for aspell, hunspell and ispell (in that the files needed to build applications that link against order), and the first one found is used. Encodings the Aspell libraries. which can not be inferred from the files can be spec- See http://aspell.net for information on ob- ified via encoding. Finally, one can use argument taining Aspell, and available dictionaries. A filter to specify a filter for processing the files be- native Windows build of an old release of As- fore spell checking, either as a user-defined function, pell is available on http://aspell.net/win32/.A or a character string specifying a built-in filter, or a current build is included in the Cygwin sys- list with the name of a built-in filter and additional tem at http://cygwin.com1, and a native build arguments to be passed to it. The built-in filters cur- is available at http://www.ndl.kiev.ua/content/ rently available are "Rd" and "Sweave", correspond- patch-aspell-0605-win32-compilation. ing to functions RdTextFilter and SweaveTeXFilter Hunspell (http://hunspell.sourceforge.net/) in package tools, with self-explanatory names: the is a spell checker and morphological analyzer de- former blanks out all non-text in an Rd file, dropping signed for languages with rich morphology and com- elements such as \email and \url or as specified by plex word compounding or character encoding, orig- the user via argument drop; the latter blanks out code inally designed for the Hungarian language. It is in chunks and Noweb markup in an Sweave input file. turn based on Myspell, which was started by Kevin aspell() returns a data frame inheriting from Hendricks to integrate various open source spelling "aspell" with the information about possibly mis- checkers into the OpenOffice.org build, and facili- spelled words, which has useful print() and tated by Kevin Atkinson. Unlike Myspell, Hunspell summary() methods. For example, consider ‘lm.Rd’ can use Unicode UTF-8-encoded dictionaries. Hun- in package stats which provides the documenta- spell is the spell checker of OpenOffice.org, Mozilla tion for fitting linear models. Assuming that src Firefox 3 & Thunderbird and Google Chrome, and is set to the URL for the source of that file, i.e. it is also used by proprietary software like MacOS X. "http://svn.R-project.org/R/trunk/src/library/ Its TEX support is similar to Ispell, but nowhere near stats/man/lm.Rd", and f is set to "lm.Rd", we can that of Aspell. Again, Hunspell is conveniently pack- obtain a copy and check it as follows: aged for all common Linux distributions (with De- > download.file(src, f) bian/Ubuntu packages hunspell for the standalone > a <- aspell(f, "Rd") hunspell-en-us program, dictionaries including and > a hunspell-en-ca, and libhunspell-dev for the applica- tion library interface). For Windows, the most recent accessor available build appears to be version 1.2.8 on http: lm.Rd:128:25 //sourceforge.net/projects/hunspell/files/. Aspell and Hunspell both support the so-called ANOVA lm.Rd:177:7 Ispell pipe interface, which reads a given input file and then, for each input line, writes a single line to the datasets standard output for each word checked for spelling on lm.Rd:199:37 the line, with different line formats for words found in the dictionary, and words not found, either with or differenced without suggestions.
Recommended publications
  • Automatic Wordnet Mapping Using Word Sense Disambiguation*
    Automatic WordNet mapping using word sense disambiguation* Changki Lee Seo JungYun Geunbae Leer Natural Language Processing Lab Natural Language Processing Lab Dept. of Computer Science and Engineering Dept. of Computer Science Pohang University of Science & Technology Sogang University San 31, Hyoja-Dong, Pohang, 790-784, Korea Sinsu-dong 1, Mapo-gu, Seoul, Korea {leeck,gblee }@postech.ac.kr seojy@ ccs.sogang.ac.kr bilingual Korean-English dictionary. The first Abstract sense of 'gwan-mog' has 'bush' as a translation in English and 'bush' has five synsets in This paper presents the automatic WordNet. Therefore the first sense of construction of a Korean WordNet from 'gwan-mog" has five candidate synsets. pre-existing lexical resources. A set of Somehow we decide a synset {shrub, bush} automatic WSD techniques is described for among five candidate synsets and link the sense linking Korean words collected from a of 'gwan-mog' to this synset. bilingual MRD to English WordNet synsets. As seen from this example, when we link the We will show how individual linking senses of Korean words to WordNet synsets, provided by each WSD method is then there are semantic ambiguities. To remove the combined to produce a Korean WordNet for ambiguities we develop new word sense nouns. disambiguation heuristics and automatic mapping method to construct Korean WordNet based on 1 Introduction the existing English WordNet. There is no doubt on the increasing This paper is organized as follows. In section 2, importance of using wide coverage ontologies we describe multiple heuristics for word sense for NLP tasks especially for information disambiguation for sense linking.
    [Show full text]
  • GNU Emacs GNU Emacs
    GNU Emacs GNU Emacs Reference Card Reference Card Computing and Information Technology Computing and Information Technology Replacing Text Replacing Text Replace foo with bar M-x replace-string Replace foo with bar M-x replace-string foo bar foo bar Query replacement M-x query replace Query replacement M-x query replace then replace <Space> then replace <Space> then skip to next <Backspace> then skip to next <Backspace> then replace and exit <Period> then replace and exit <Period> then replace all ! then replace all ! then go back ^ then go back ^ then exit query <Esc> then exit query <Esc> Setting a Mark Setting a Mark Set mark here C-<Space> or C-@ Set mark here C-<Space> or C-@ Exchange point and mark C-x C-x Exchange point and mark C-x C-x Mark paragraph M-h Mark paragraph M-h Mark entire file C-x h Mark entire file C-x h Sorting Text Sorting Text Normal sort M-x sort-lines Normal sort M-x sort-lines Reverse sort <Esc>-1 M-x sort lines Reverse sort <Esc>-1 M-x sort lines Inserting Text Inserting Text Scan file fn M-x view-file fn Scan file fn M-x view-file fn Write buffer to file fn M-x write-file fn Write buffer to file fn M-x write-file fn Insert file fn into buffer M-x insert-file fn Insert file fn into buffer M-x insert-file fn Write region to file fn M-x write-region fn Write region to file fn M-x write-region fn Write region to end of file fn M-x append-to-file fn Write region to end of file fn M-x append-to-file fn Write region to beginning of Write region to beginning to specified buffer M-x prepend-to-buffer specified buffer
    [Show full text]
  • Automatic Correction of Real-Word Errors in Spanish Clinical Texts
    sensors Article Automatic Correction of Real-Word Errors in Spanish Clinical Texts Daniel Bravo-Candel 1,Jésica López-Hernández 1, José Antonio García-Díaz 1 , Fernando Molina-Molina 2 and Francisco García-Sánchez 1,* 1 Department of Informatics and Systems, Faculty of Computer Science, Campus de Espinardo, University of Murcia, 30100 Murcia, Spain; [email protected] (D.B.-C.); [email protected] (J.L.-H.); [email protected] (J.A.G.-D.) 2 VÓCALI Sistemas Inteligentes S.L., 30100 Murcia, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-86888-8107 Abstract: Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing Citation: Bravo-Candel, D.; López-Hernández, J.; García-Díaz, with patient information.
    [Show full text]
  • Easy Slackware
    1 Создание легкой системы на базе Slackware I - Введение Slackware пользуется заслуженной популярностью как классический linux дистрибутив, и поговорка "кто знает Red Hat тот знает только Red Hat, кто знает Slackware тот знает linux" несмотря на явный снобизм поклонников "бога Патре­ га" все же имеет под собой основания. Одним из преимуществ Slackware является возможность простого создания на ее основе практически любой системы, в том числе быстрой и легкой десктопной, о чем далее и пойдет речь. Есть дис­ трибутивы, клоны Slackware, созданные именно с этой целью, типа Аbsolute, но все же лучше создавать систему под себя, с максимальным учетом именно своих потребностей, и Slackware пожалуй как никакой другой дистрибутив подходит именно для этой цели. Легкость и быстрота системы определяется выбором WM (DM) , набором программ и оптимизацией программ и системы в целом. Первое исключает KDE, Gnome, даже новые версии XFCЕ, остается разве что LXDE, но набор программ в нем совершенно не устраивает. Оптимизация наиболее часто используемых про­ грамм и нескольких базовых системных пакетов осуществляется их сборкой из сорцов компилятором, оптимизированным именно под Ваш комп, причем каж­ дая программа конфигурируется исходя из Ваших потребностей к ее возможно­ стям. Оптимизация системы в целом осуществляется ее настройкой согласно спе­ цифическим требованиям к десктопу. Такой подход был выбран по банальной причине, возиться с gentoo нет ни­ какого желания, комп все таки создан для того чтобы им пользоваться, а не для компиляции программ, в тоже время у каждого есть минимальный набор из не­ большого количества наиболее часто используемых программ, на которые стоит потратить некоторое, не такое уж большое, время, чтобы довести их до ума. Кро­ ме того, такой подход позволяет иметь самые свежие версии наиболее часто ис­ пользуемых программ.
    [Show full text]
  • Emacspeak — the Complete Audio Desktop User Manual
    Emacspeak | The Complete Audio Desktop User Manual T. V. Raman Last Updated: 19 November 2016 Copyright c 1994{2016 T. V. Raman. All Rights Reserved. Permission is granted to make and distribute verbatim copies of this manual without charge provided the copyright notice and this permission notice are preserved on all copies. Short Contents Emacspeak :::::::::::::::::::::::::::::::::::::::::::::: 1 1 Copyright ::::::::::::::::::::::::::::::::::::::::::: 2 2 Announcing Emacspeak Manual 2nd Edition As An Open Source Project ::::::::::::::::::::::::::::::::::::::::::::: 3 3 Background :::::::::::::::::::::::::::::::::::::::::: 4 4 Introduction ::::::::::::::::::::::::::::::::::::::::: 6 5 Installation Instructions :::::::::::::::::::::::::::::::: 7 6 Basic Usage. ::::::::::::::::::::::::::::::::::::::::: 9 7 The Emacspeak Audio Desktop. :::::::::::::::::::::::: 19 8 Voice Lock :::::::::::::::::::::::::::::::::::::::::: 22 9 Using Online Help With Emacspeak. :::::::::::::::::::: 24 10 Emacs Packages. ::::::::::::::::::::::::::::::::::::: 26 11 Running Terminal Based Applications. ::::::::::::::::::: 45 12 Emacspeak Commands And Options::::::::::::::::::::: 49 13 Emacspeak Keyboard Commands. :::::::::::::::::::::: 361 14 TTS Servers ::::::::::::::::::::::::::::::::::::::: 362 15 Acknowledgments.::::::::::::::::::::::::::::::::::: 366 16 Concept Index :::::::::::::::::::::::::::::::::::::: 367 17 Key Index ::::::::::::::::::::::::::::::::::::::::: 368 Table of Contents Emacspeak :::::::::::::::::::::::::::::::::::::::::: 1 1 Copyright :::::::::::::::::::::::::::::::::::::::
    [Show full text]
  • Behavior Based Software Theft Detection, CCS 2009
    Behavior Based Software Theft Detection 1Xinran Wang, 1Yoon-Chan Jhi, 1,2Sencun Zhu, and 2Peng Liu 1Department of Computer Science and Engineering 2College of Information Sciences and Technology Pennsylvania State University, University Park, PA 16802 {xinrwang, szhu, jhi}@cse.psu.edu, [email protected] ABSTRACT (e.g., in SourceForge.net there were over 230,000 registered Along with the burst of open source projects, software open source projects as of Feb.2009), software theft has be- theft (or plagiarism) has become a very serious threat to the come a very serious concern to honest software companies healthiness of software industry. Software birthmark, which and open source communities. As one example, in 2005 it represents the unique characteristics of a program, can be was determined in a federal court trial that IBM should pay used for software theft detection. We propose a system call an independent software vendor Compuware $140 million dependence graph based software birthmark called SCDG to license its software and $260 million to purchase its ser- birthmark, and examine how well it reflects unique behav- vices [1] because it was discovered that certain IBM products ioral characteristics of a program. To our knowledge, our contained code from Compuware. detection system based on SCDG birthmark is the first one To protect software from theft, Collberg and Thoborson that is capable of detecting software component theft where [10] proposed software watermark techniques. Software wa- only partial code is stolen. We demonstrate the strength of termark is a unique identifier embedded in the protected our birthmark against various evasion techniques, including software, which is hard to remove but easy to verify.
    [Show full text]
  • Welsh Language Technology Action Plan Progress Report 2020 Welsh Language Technology Action Plan: Progress Report 2020
    Welsh language technology action plan Progress report 2020 Welsh language technology action plan: Progress report 2020 Audience All those interested in ensuring that the Welsh language thrives digitally. Overview This report reviews progress with work packages of the Welsh Government’s Welsh language technology action plan between its October 2018 publication and the end of 2020. The Welsh language technology action plan derives from the Welsh Government’s strategy Cymraeg 2050: A million Welsh speakers (2017). Its aim is to plan technological developments to ensure that the Welsh language can be used in a wide variety of contexts, be that by using voice, keyboard or other means of human-computer interaction. Action required For information. Further information Enquiries about this document should be directed to: Welsh Language Division Welsh Government Cathays Park Cardiff CF10 3NQ e-mail: [email protected] @cymraeg Facebook/Cymraeg Additional copies This document can be accessed from gov.wales Related documents Prosperity for All: the national strategy (2017); Education in Wales: Our national mission, Action plan 2017–21 (2017); Cymraeg 2050: A million Welsh speakers (2017); Cymraeg 2050: A million Welsh speakers, Work programme 2017–21 (2017); Welsh language technology action plan (2018); Welsh-language Technology and Digital Media Action Plan (2013); Technology, Websites and Software: Welsh Language Considerations (Welsh Language Commissioner, 2016) Mae’r ddogfen yma hefyd ar gael yn Gymraeg. This document is also available in Welsh.
    [Show full text]
  • Learning to Read by Spelling Towards Unsupervised Text Recognition
    Learning to Read by Spelling Towards Unsupervised Text Recognition Ankush Gupta Andrea Vedaldi Andrew Zisserman Visual Geometry Group Visual Geometry Group Visual Geometry Group University of Oxford University of Oxford University of Oxford [email protected] [email protected] [email protected] tttttttttttttttttttttttttttttttttttttttttttrssssss ttttttt ny nytt nr nrttttttt ny ntt nrttttttt zzzz iterations iterations bcorote to whol th ticunthss tidio tiostolonzzzz trougfht to ferr oy disectins it has dicomered Training brought to view by dissection it was discovered Figure 1: Text recognition from unaligned data. We present a method for recognising text in images without using any labelled data. This is achieved by learning to align the statistics of the predicted text strings, against the statistics of valid text strings sampled from a corpus. The figure above visualises the transcriptions as various characters are learnt through the training iterations. The model firstlearns the concept of {space}, and hence, learns to segment the string into words; followed by common words like {to, it}, and only later learns to correctly map the less frequent characters like {v, w}. The last transcription also corresponds to the ground-truth (punctuations are not modelled). The colour bar on the right indicates the accuracy (darker means higher accuracy). ABSTRACT 1 INTRODUCTION This work presents a method for visual text recognition without read (ri:d) verb • Look at and comprehend the meaning of using any paired supervisory data. We formulate the text recogni- (written or printed matter) by interpreting the characters or tion task as one of aligning the conditional distribution of strings symbols of which it is composed.
    [Show full text]
  • Detecting Personal Life Events from Social Media
    Open Research Online The Open University’s repository of research publications and other research outputs Detecting Personal Life Events from Social Media Thesis How to cite: Dickinson, Thomas Kier (2019). Detecting Personal Life Events from Social Media. PhD thesis The Open University. For guidance on citations see FAQs. c 2018 The Author https://creativecommons.org/licenses/by-nc-nd/4.0/ Version: Version of Record Link(s) to article on publisher’s website: http://dx.doi.org/doi:10.21954/ou.ro.00010aa9 Copyright and Moral Rights for the articles on this site are retained by the individual authors and/or other copyright owners. For more information on Open Research Online’s data policy on reuse of materials please consult the policies page. oro.open.ac.uk Detecting Personal Life Events from Social Media a thesis presented by Thomas K. Dickinson to The Department of Science, Technology, Engineering and Mathematics in partial fulfilment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science The Open University Milton Keynes, England May 2019 Thesis advisor: Professor Harith Alani & Dr Paul Mulholland Thomas K. Dickinson Detecting Personal Life Events from Social Media Abstract Social media has become a dominating force over the past 15 years, with the rise of sites such as Facebook, Instagram, and Twitter. Some of us have been with these sites since the start, posting all about our personal lives and building up a digital identify of ourselves. But within this myriad of posts, what actually matters to us, and what do our digital identities tell people about ourselves? One way that we can start to filter through this data, is to build classifiers that can identify posts about our personal life events, allowing us to start to self reflect on what we share online.
    [Show full text]
  • Emacspeak User's Guide
    Emacspeak User's Guide Jennifer Jobst Revision History Revision 1.3 July 24,2002 Revised by: SDS Updated the maintainer of this document to Sharon Snider, corrected links, and converted to HTML Revision 1.2 December 3, 2001 Revised by: JEJ Changed license to GFDL Revision 1.1 November 12, 2001 Revised by: JEJ Revision 1.0 DRAFT October 19, 2001 Revised by: JEJ This document helps Emacspeak users become familiar with Emacs as an audio desktop and provides tutorials on many common tasks and the Emacs applications available to perform those tasks. Emacspeak User's Guide Table of Contents 1. Legal Notice.....................................................................................................................................................1 2. Introduction.....................................................................................................................................................2 2.1. What is Emacspeak?.........................................................................................................................2 2.2. About this tutorial.............................................................................................................................2 3. Before you begin..............................................................................................................................................3 3.1. Getting started with Emacs and Emacspeak.....................................................................................3 3.2. Emacs Command Conventions.........................................................................................................3
    [Show full text]
  • Universal Or Variation? Semantic Networks in English and Chinese
    Universal or variation? Semantic networks in English and Chinese Understanding the structures of semantic networks can provide great insights into lexico- semantic knowledge representation. Previous work reveals small-world structure in English, the structure that has the following properties: short average path lengths between words and strong local clustering, with a scale-free distribution in which most nodes have few connections while a small number of nodes have many connections1. However, it is not clear whether such semantic network properties hold across human languages. In this study, we investigate the universal structures and cross-linguistic variations by comparing the semantic networks in English and Chinese. Network description To construct the Chinese and the English semantic networks, we used Chinese Open Wordnet2,3 and English WordNet4. The two wordnets have different word forms in Chinese and English but common word meanings. Word meanings are connected not only to word forms, but also to other word meanings if they form relations such as hypernyms and meronyms (Figure 1). 1. Cross-linguistic comparisons Analysis The large-scale structures of the Chinese and the English networks were measured with two key network metrics, small-worldness5 and scale-free distribution6. Results The two networks have similar size and both exhibit small-worldness (Table 1). However, the small-worldness is much greater in the Chinese network (σ = 213.35) than in the English network (σ = 83.15); this difference is primarily due to the higher average clustering coefficient (ACC) of the Chinese network. The scale-free distributions are similar across the two networks, as indicated by ANCOVA, F (1, 48) = 0.84, p = .37.
    [Show full text]
  • Spelling Correction: from Two-Level Morphology to Open Source
    Spelling Correction: from two-level morphology to open source Iñaki Alegria, Klara Ceberio, Nerea Ezeiza, Aitor Soroa, Gregorio Hernandez Ixa group. University of the Basque Country / Eleka S.L. 649 P.K. 20080 Donostia. Basque Country. [email protected] Abstract Basque is a highly inflected and agglutinative language (Alegria et al., 1996). Two-level morphology has been applied successfully to this kind of languages and there are two-level based descriptions for very different languages. After doing the morphological description for a language, it is easy to develop a spelling checker/corrector for this language. However, what happens if we want to use the speller in the "free world" (OpenOffice, Mozilla, emacs, LaTeX, ...)? Ispell and similar tools (aspell, hunspell, myspell) are the usual mechanisms for these purposes, but they do not fit the two-level model. In the absence of two-level morphology based mechanisms, an automatic conversion from two-level description to hunspell is described in this paper. previous work and reuse the morphological information. 1. Introduction Two-level morphology (Koskenniemi, 1983; Beesley & Karttunen, 2003) has been applied successfully to the 2. Free software for spelling correction morphological description of highly inflected languages. Unfortunately there are not open source tools for spelling There are two-level based descriptions for very different correction with these features: languages (English, German, Swedish, French, Spanish, • It is standardized in the most of the applications Danish, Norwegian, Finnish, Basque, Russian, Turkish, (OpenOffice, Mozilla, emacs, LaTeX, ...). Arab, Aymara, Swahili, etc.). • It is based on the two-level morphology. After doing the morphological description, it is easy to The spell family of spell checkers (ispell, aspell, myspell) develop a spelling checker/corrector for the language fits the first condition but not the second.
    [Show full text]