Watch Your Spelling! by Kurt Hornik and Duncan Murdoch Programs (Aspell and Hunspell) and Their Respective Dictionaries

22 CONTRIBUTED RESEARCH ARTICLES Watch Your Spelling! by Kurt Hornik and Duncan Murdoch programs (Aspell and Hunspell) and their respective dictionaries. Finally, we investigate the possibility Abstract We discuss the facilities in base R of using the CRAN package Rd files for creating a for spell checking via Aspell, Hunspell or Ispell, statistics dictionary. which are useful in particular for conveniently checking the spelling of natural language texts in package Rd files and vignettes. Spell checking Spell checking performance is illustrated using the Rd files in package stats. This example clearly indicates the The first spell checkers were widely available on need for a domain-specific statistical dictionary. mainframe computers in the late 1970s (e.g., http: We analyze the results of spell checking all Rd //en.wikipedia.org/wiki/Spell_checker). They ac- files in all CRAN packages and show how these tually were “verifiers” instead of “correctors”, re- can be employed for building such a dictionary. porting words not found in the dictionary but not making useful suggestions for replacements (e.g., as close matches in the Levenshtein distance to terms R and its add-on packages contain large amounts of from the dictionary). In the 1980s, checkers became natural language text (in particular, in the documen- available on personal computers, and were integrated tation in package Rd files and vignettes). This text is into popular word-processing packages like Word- useful for reading by humans and as a testbed for a va- Star. Recently, applications such as web browsers and riety of natural language processing (NLP) tasks, but email clients have added spell check support for user- it is not always spelled correctly. This is not entirely written content. Extending coverage from English unsurprising, given that available spell checkers are to beyond western European languages has required not aware of the special Rd file and vignette formats, software to deal with character encoding issues and and thus rather inconvenient to use. (In particular, increased sophistication in the morphology routines, we are unaware of ways to take advantage of the ESS particularly with regard to heavily-agglutinative lan- (Rossini et al., 2004) facilities to teach Emacs to check guages like Hungarian and Finnish, and resulted in the spelling of vignettes by only analyzing the LATEX several generations of open-source checkers. Limi- chunks in suitable TEX modes.) tations of the basic unigram (single-word) approach In addition to providing facilities to make spell have led to the development of context-sensitive spell checking for Rd files and vignettes more convenient, checkers, currently available in commercial software it is also desirable to have programmatic (R-level) ac- such as e.g. Microsoft Office 2007. Another approach cess to the possibly mis-spelled words (and suggested to spell checking is using adaptive domain-specific corrections). This will allow their inclusion into auto- models for mis-spellings (e.g., based on the frequency matically generated reports (e.g., by R CMD check), of word n-grams), as employed for example in web or aggregation for subsequent analyses. Spell check- search engines (“Did you mean . ?”). ers typically know how to extract words from text The first spell checker on Unix-alikes was spell employing language-dependent algorithms for han- (originally written in 1971 in PDP-10 Assembly lan- dling morphology, and compare the extracted words guage and later ported to C), which read an input against a known list of correctly spelled ones, the so- file and wrote possibly mis-spelled words to out- called dictionary. However, NLP resources for statis- put (one word per line). A variety of enhancements tics and related knowledge domains are rather scarce, led to (International) Ispell (http://lasr.cs.ucla. and we are unaware of suitable dictionaries for these. edu/geoff/ispell.html), which added interactivity The result is that domain-specific terms in texts from (hence “i”-spell), suggestion of replacements, and sup- these domains are typically reported as possibly mis- port for a large number of European languages (hence spelled. It thus seems attractive to create additional “international”). It pioneered the idea of a program- dictionaries based on determining the most frequent ming interface, originally intended for use by Emacs, possibly mis-spelled terms in corpora such as the Rd and awareness of special input file formats (originally, files and vignettes in the packages in the R reposito- TEX or nroff/troff). ries. GNU Aspell, usually called just Aspell (http:// In this paper, we discuss the spell check function- aspell.net), was mainly developed by Kevin Atkin- ality provided by aspell() and related utilities made son and designed to eventually replace Ispell. Aspell available in the R standard package utils. After a can either be used as a library (in fact, the Omegahat quick introduction to spell checking, we indicate how package Aspell (Temple Lang, 2005) provides a fine- aspell() can be used for checking individual files grained R interface to this) or as a standalone program. and packages, and how this can be integrated into Compared to Ispell, Aspell can also easily check docu- typical work flows. We then compare the agreement ments in UTF-8 without having to use a special dictio- of the spell check results obtained by two different nary, and supports using multiple dictionaries. It also The R Journal Vol. 3/2, 2010-09-17 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 23 tries to do a better job suggesting corrections (Ispell Argument files is a character vector with the names only suggests words with a Levenshtein distance of of the files to be checked (in fact, in R 2.12.0 or later 1), and provides powerful and customizable TEX filter- alternatively a list of R objects representing connec- ing. Aspell is the standard spell checker for the GNU tions or having suitable srcrefs), control is a list or software system, with packages available for all com- character vector of control options (command line mon Linux distributions. E.g., for Debian/Ubuntu arguments) to be passed to the spell check program, flavors, aspell contains the programs, aspell-en the and program optionally specifies the name of the pro- English language dictionaries (with American, British gram to be employed. By default, the system path and Canadian spellings), and libaspell-dev provides is searched for aspell, hunspell and ispell (in that the files needed to build applications that link against order), and the first one found is used. Encodings the Aspell libraries. which can not be inferred from the files can be spec- See http://aspell.net for information on ob- ified via encoding. Finally, one can use argument taining Aspell, and available dictionaries. A filter to specify a filter for processing the files be- native Windows build of an old release of As- fore spell checking, either as a user-defined function, pell is available on http://aspell.net/win32/.A or a character string specifying a built-in filter, or a current build is included in the Cygwin sys- list with the name of a built-in filter and additional tem at http://cygwin.com1, and a native build arguments to be passed to it. The built-in filters cur- is available at http://www.ndl.kiev.ua/content/ rently available are "Rd" and "Sweave", correspond- patch-aspell-0605-win32-compilation. ing to functions RdTextFilter and SweaveTeXFilter Hunspell (http://hunspell.sourceforge.net/) in package tools, with self-explanatory names: the is a spell checker and morphological analyzer de- former blanks out all non-text in an Rd file, dropping signed for languages with rich morphology and com- elements such as \email and \url or as specified by plex word compounding or character encoding, orig- the user via argument drop; the latter blanks out code inally designed for the Hungarian language. It is in chunks and Noweb markup in an Sweave input file. turn based on Myspell, which was started by Kevin aspell() returns a data frame inheriting from Hendricks to integrate various open source spelling "aspell" with the information about possibly mis- checkers into the OpenOffice.org build, and facili- spelled words, which has useful print() and tated by Kevin Atkinson. Unlike Myspell, Hunspell summary() methods. For example, consider ‘lm.Rd’ can use Unicode UTF-8-encoded dictionaries. Hun- in package stats which provides the documenta- spell is the spell checker of OpenOffice.org, Mozilla tion for fitting linear models. Assuming that src Firefox 3 & Thunderbird and Google Chrome, and is set to the URL for the source of that file, i.e. it is also used by proprietary software like MacOS X. "http://svn.R-project.org/R/trunk/src/library/ Its TEX support is similar to Ispell, but nowhere near stats/man/lm.Rd", and f is set to "lm.Rd", we can that of Aspell. Again, Hunspell is conveniently pack- obtain a copy and check it as follows: aged for all common Linux distributions (with De- > download.file(src, f) bian/Ubuntu packages hunspell for the standalone > a <- aspell(f, "Rd") hunspell-en-us program, dictionaries including and > a hunspell-en-ca, and libhunspell-dev for the applica- tion library interface). For Windows, the most recent accessor available build appears to be version 1.2.8 on http: lm.Rd:128:25 //sourceforge.net/projects/hunspell/files/. Aspell and Hunspell both support the so-called ANOVA lm.Rd:177:7 Ispell pipe interface, which reads a given input file and then, for each input line, writes a single line to the datasets standard output for each word checked for spelling on lm.Rd:199:37 the line, with different line formats for words found in the dictionary, and words not found, either with or differenced without suggestions.

Load more